A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response

Huang, Lu; Hou, Jundong

doi:10.3390/su18115539

Open AccessArticle

A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response

by

Lu Huang

^1,2

and

Jundong Hou

^1,*

¹

School of Economics and Management, China University of Geosciences, Wuhan 430074, China

²

The Laboratory of Natural Disaster Risk Prevention and Emergency Management, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5539; https://doi.org/10.3390/su18115539

Submission received: 3 April 2026 / Revised: 22 May 2026 / Accepted: 23 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Effective emergency relief allocation in dynamic post-disaster environments depends critically on accurate and timely demand information. From a sustainability perspective, improving allocation accuracy is essential for using scarce rescue resources efficiently and supporting resilient disaster response. However, existing demand forecasting approaches frequently exhibit systematic bias, leading to resource misallocation and diminished rescue outcomes. Although deploying on-site assessment teams can partially mitigate this limitation, a unified framework that systematically embeds field assessment feedback into operational allocation processes remains lacking. To bridge this gap, this study proposes a multi-agent joint assessment-allocation model that facilitates coordinated operations between demand assessment and resource distribution activities. The sequential decision-making process is formulated as a Markov Decision Process (MDP), and deep reinforcement learning is employed to coordinate the actions of assessment and allocation teams, enabling allocation policies to be continuously refined through real-time field feedback. By improving the match between actual demand and material supply, the proposed model aims to support more resource-efficient disaster response under demand uncertainty. An empirical case study based on the 2025 Dingri County earthquake in Tibet is conducted to validate the proposed framework. Results demonstrate that integrating assessment feedback substantially improves resource allocation performance: in multi-site rescue scenarios, the framework increases the number of rescued individuals, reduces mission completion time, and enhances overall demand satisfaction. Further sensitivity analysis reveals that a moderate increase in team size strengthens cross-site coordination, whereas excessive team deployment yields diminishing returns and may generate operational redundancy. These findings suggest that sustainable emergency management depends not only on the availability of relief resources, but also on the efficient coordination of real-time information acquisition and material allocation. The proposed framework offers a generalizable approach for integrating real-time information acquisition with dynamic relief allocation. It improves the efficient utilization of scarce rescue resources, reduces avoidable operational redundancy, and strengthens the resilience of emergency response systems, thereby contributing to sustainable disaster risk reduction.

Keywords:

sustainable disaster response; emergency material allocation; real-time demand assessment; Markov decision process; disaster resilience; resource use efficiency

1. Introduction

The escalating frequency and severity of natural disasters, compounded by increasing societal vulnerability and exposure, have emerged as one of the most pressing challenges in contemporary disaster risk management. Events such as earthquakes and large-scale floods are typically characterized by rapid evolution, cascading effects, and complex spatial dynamics, propagating impacts across multiple affected sites and posing severe threats to human safety, social stability, and economic continuity. In this context, sustainable disaster response is closely linked to disaster risk reduction and response-system resilience, as emphasized by the United Nations Sustainable Development Goals and the Sendai Framework for Disaster Risk Reduction. Against this backdrop, the timely delivery and efficient allocation of emergency materials, including food, water, medicine, tents, blankets, as well as the deployment of rescue personnel, constitute a critical component of post-disaster response. The quality of allocation decisions directly determines rescue effectiveness and the degree to which the needs of affected populations are met. Evidence from recent large-scale relief operations, including the 2023 Turkiye earthquake and the Luding earthquake in Sichuan, indicates that resource allocation commonly confronts two fundamental challenges [1,2,3]: (i) allocation efficiency is constrained by limited resource stocks and transportation capacity, resulting in critical supply mismatches and delayed assistance; and (ii) demand assessment and resource distribution are typically managed by different organizations with limited coordination mechanisms, which can substantially degrade system performance and amplify disaster losses [4,5]. There is therefore a compelling need for a collaborative multi-actor framework that establishes a structured, closed-loop assessment-allocation feedback mechanism to enhance response performance.

Post-disaster emergency supply allocation has accordingly attracted substantial research attention [6]. Early contributions primarily relied on static optimization models assuming deterministic demand, employing quantitative methods such as integer linear programming, multi-objective programming, and bi-level programming [7,8] to address problems including emergency supply distribution [9], vehicle routing [10], and facility location-allocation [11]. Despite their methodological rigor, static models are poorly suited to practice, as the inherently dynamic and uncertain nature of disaster events causes demand to fluctuate in ways that static solutions cannot accommodate [12]. Dynamic resource allocation models, by contrast, continuously reassess demand across multiple decision epochs and adapt allocation plans accordingly, yielding more responsive and practically effective relief dispatch strategies [13,14,15,16,17,18].

Demand uncertainty constitutes a core challenge in emergency supply allocation [19]. Stochastic programming approaches have been widely adopted to address this challenge by constructing explicit scenario sets to represent uncertain conditions. Illustratively, Wang et al. [9] proposed a two-stage stochastic programming model to jointly optimize facility location and material allocation over predefined scenarios, while Mohamadi and Yaghoubi [20] developed a bi-objective stochastic model for casualty transportation and medical supply distribution under uncertain demand and response timing. A fundamental limitation of stochastic programming approaches, however, is their dependence on predefined scenario sets [6], which fail to capture dynamic feedback effects of decisions on system state transitions and often fail to reflect the rapidly evolving realities of actual disaster response. This limitation is particularly consequential because allocation decisions made at any point in time directly constrain demand fulfillment capacity in subsequent periods. This sequential interdependence highlights the need for decision frameworks that possess foresight, dynamic updating capabilities, and the capacity to account for the long-run consequences of early-stage decisions [21,22,23,24].

The effectiveness of resource allocation depends not only on optimization methodology but equally on the availability of accurate and timely demand information [25]. In the early phase following a disaster, the true magnitude of need is rarely directly observable, and estimates based on pre-disaster population exposure models and vulnerability indices frequently deviate substantially from actual ground conditions [20]. Field assessment is widely recognized as the primary means of obtaining accurate demand data. However, assessment resources, including rescue teams and unmanned aerial vehicles, are themselves limited, and the problem of deploying them optimally to maximize the informational value of assessments under strict time constraints is a critical operational challenge. More fundamentally, information gathered through assessment must be effectively translated into timely allocation decisions [26]: when assessment and allocation agents operate without adequate coordination, information delays and distortions can rapidly erode the value of field intelligence and undermine response quality [27]. However, research that directly integrates field information into decision optimization models remains limited.

Motivated by observations from both practice and academic research, this paper aims to address the following questions:

How can the assessment process increase the number of people rescued by improving the information required for allocation decisions?
How can the assessment agent and the allocation agent be effectively coordinated within a highly uncertain, time-sensitive operational environment?

Multi-agent systems (MAS) and agent-based modeling (ABM) offer powerful tools for addressing such coordination problems. These approaches have been applied to simulate inter-departmental cooperation in emergency management [28,29], evaluate collaborative response strategies [30], and optimize multi-vehicle logistics scheduling [31,32]. A related strand of research combines agent-based simulation with heuristic optimization methods to support emergency decision-making. Beklaryan et al. [33] modeled adaptive rescuer behavior under emergency conditions using agent-based simulation with modified fuzzy clustering, while Hawe et al. [34] employed agent-based simulation to plan resource allocation for a hypothetical two-site major incident. Building on this foundation, Akopov et al. [35] integrated a parallel bi-objective genetic algorithm with agent-based rescuer simulation to optimize evacuation parameters, and Guerrero Granados et al. [36] further combined an Improved Genetic Algorithm with Multi-objective Particle Swarm Optimization within a coordinated multi-agent framework for technological disaster management. However, existing studies largely focus on macro-organizational behavior or isolated operational subtasks. A unified framework that coherently integrates on-site demand assessment and operational resource allocation as a coupled system remains absent from the literature, and the mechanisms by which human and artificial agents can collaborate to complete a closed-loop assessment-allocation process under high uncertainty are not well understood. The Markov Decision Process (MDP) framework offers a principled foundation for modeling such problems, owing to its strengths in sequential decision-making, explicit state transition modeling, and long-term reward optimization [37,38]. The MDP framework has been successfully applied in emergency management contexts, including ambulance dispatch [39,40] and medical resource allocation [41,42]. Although several researchers have extended MDP-based approaches to post-disaster resource allocation [24,43], the coordinated integration of assessment and allocation operations within a unified MDP framework has not been adequately investigated.

To address these research gaps, this study proposes an integrated multi-agent framework for disaster response in which the post-disaster decision-making process is formulated as an MDP. A central artificial Decision-Making Agent (DMA) learns an adaptive response policy from environmental interactions and coordinates the operations of a human-staffed Demand Assessment Team (AT) and Resource Allocation Team (ST), both composed of human agents operating in the field, thereby enabling a principled and adaptive assessment-allocation feedback loop. To validate the proposed framework, a case study is conducted based on the 2025 Dingri County earthquake in Tibet, comparing rescue performance with and without integrated demand assessment. Complementary sensitivity analyses on team size provide actionable managerial guidance for the optimal configuration of multi-agent disaster response systems.

This study makes three primary contributions to the literature. First, it introduces a novel multi-agent collaborative MDP model for emergency material allocation, offering a methodological advance in this domain. While decision-making under uncertainty is inherently challenging in single-agent settings, the systematic integration of multiple specialized agents within an MDP framework remains relatively underexplored, particularly when explicit coordination between demand assessment and material allocation roles is required. This work provides practical guidance for future applications of MDP-based models in complex disaster response environments. Second, it proposes a unified operational framework that coordinates the DMA, AT, and ST within a single allocation process, establishing a structured approach to human and artificial intelligence agent collaboration under conditions of spatiotemporal uncertainty in emergency operations. Third, from a sustainability perspective, this study shows that real-time demand assessment and adaptive material allocation can promote the efficient use of scarce emergency resources, reduce avoidable operational delays, and enhance the resilience of disaster response systems. In this regard, the proposed framework extends emergency supply allocation optimization toward sustainable disaster risk reduction.

The remainder of this paper is structured as follows. Section 2 presents the proposed multi-agent decision model and its mathematical formulation. Section 3 describes the empirical case study and discusses the experimental results. Section 4 presents the conclusions, managerial implications, and directions for future research.

2. Problem Definition and Model Description

2.1. Problem Description

The model proposed in this study focuses on optimizing the coordination between demand assessment and material distribution to maximize the number of lives saved in post-disaster response operations. All sets, indices, parameters, and variables are defined in Table 1.

Two foundational demand concepts are adopted: nominal demand (

N_{i}

) and actual demand (

N_{i r}

). Nominal demand represents estimated supply requirements derived from historical exposure data, serving as the initial planning benchmark for relief operations. Actual demand, by contrast, is determined by field personnel using real-time observations and predictive models upon arrival at the affected zone [44]. The central methodological challenge is to establish a mechanism that enables the efficient transition from nominal to actual demand through structured field assessment, thereby improving allocation accuracy. To address this challenge, this study proposes a multi-agent framework for demand assessment and material allocation comprising three specialized agents, each fulfilling an indispensable and complementary role. The multi-agent structure is essential for managing the complexity of real-time decision-making in disaster environments, where task specialization, inter-agent communication, and adaptive coordination are critical determinants of system performance.

The DMA serves as the central artificial learning agent, responsible for scenario simulation, demand forecast refinement, and allocation guidance. In the initial phase of response, the DMA generates demand forecasts using simulated data drawn from a Gaussian distribution with zero mean and unit variance. As the response progresses, the DMA continuously assimilates real-time field intelligence provided by the AT and the ST to revise its demand estimates, enabling dynamic and context-adaptive adjustments to the overall relief strategy.

The Demand Assessment Team (AT) consists of human agents deployed in the affected areas to collect and evaluate on-site demand data. In addition to gathering field information, the AT conducts preliminary route assessments and provides regular updates to the DMA. When unanticipated events arise, such as sudden demand surges or newly identified hazards, the AT promptly evaluates the situation and communicates this information to the DMA. This feedback mechanism allows the DMA to refine its decision models in real time, ensuring that the relief strategy remains adaptive to evolving ground conditions. The DMA, in turn, guides the AT’s operational priorities by directing field personnel toward zones or issues requiring the most urgent attention.

Due to the time-sensitive nature of emergency response, the efficient routing of field agents is critical. The time elapsed between an agent’s dispatch and the completion of service at each zone directly affects the overall efficiency of material allocation. For an agent serving zone

i,

the start time

S F_{i}

, service time

ε_{i},

and completion time

F T_{i}

are calculated using Equations (1), (2) and (3), respectively:

ε_{i} = h_{i} \cdot N_{i}

(1)

F T_{i} = S F_{i} + ε_{i}

(2)

S F_{j} = F T_{i} + T_{i, j}

(3)

where the service time

ε_{i}

is proportional to the material demand

N_{i}

. The performance factor

h_{i}

is treated as a zone-specific stochastic parameter to reflect the variability in operational capacity across different field conditions. Equation (2) defines the completion time at zone

i

as the sum of start time and service time. Equation (3) specifies that the start time at the subsequent zone

j

equals the completion time at the current zone

i

plus the inter-zone travel time

T_{i, j}

. Equation (4) represents the survival function

δ (θ)

, shown in Figure 1. This function indicates that the survival rate decreases exponentially as time

θ_{i, j}

increases; consequently, the visit priority assigned to a region also decreases as time progresses. The reverse S-curve further emphasizes the critical importance of the initial few time units in saving lives, as detailed in Edrisi [45].

δ (θ) = α e^{b θ^{c}} + d α, c, d > 0, b < 0, α + d < 1, d ≪ α

(4)

The Resource Allocation Team (ST) consists of human agents responsible for distributing material resources in accordance with DMA’s directives. In addition to executing prescribed allocation plans, ST agents also continuously monitor ground conditions for unforeseen disruptions, including road damage and infrastructure failures. Upon identifying such obstacles, the ST immediately notifies the DMA, which can then revise routing instructions or reformulate the distribution plan. This real-time disruption detection and reporting capability represents a critical field intelligence function that the DMA alone cannot replicate. Figure 2 demonstrates how the three agents collaborate within the proposed model.

2.2. Markov Decision Process Formulation

The multi-agent coordination problem described above is formally modeled as a Markov Decision Process (MDP). The MDP formulation presented in this section is based on standard frameworks in the operations research and reinforcement learning literature [24]. The original contributions of this study lie in the problem-specific state space and decision variable design, the multi-agent coordination structure, and the survival-probability-based reward function, which are tailored to the post-disaster demand assessment and material allocation context.

2.2.1. State Variable

The system state

S_{t}

= (

M_{t}

,

D_{t}

)

\in S

at decision epoch

t

consists of two components: material status and zone request status. The material status component is defined as

M_{t} = {(M_{t k})}_{k ϵ K} = (L^{t}, R^{t k})

, where each element

M_{t k}

characterizes supply point k, indicating which disaster zones can be served and the total materials available

R^{t k}

. The ordered list

L^{t}

specifies the sequence in which zones are to receive material allocation. The zone request status component is denoted

D_{t} = {(D_{t i})}_{i ϵ I}

, where

D_{t i} = (O^{t i}, e^{t i}, W^{t i})

encodes the geographic location, demand level, and service status of each zone. The demand level

e^{t i}

is treated as a random variable following a specified probability distribution

J (μ_{t i}, σ_{t i})

(e.g., normal or log-normal) that captures the inherent uncertainty in post-disaster needs [46,47]. The binary indicator

W^{t i} = 0

denotes that zone

i

is currently unserved. The complete state space S contains all feasible realizations of

S_{t}

.

2.2.2. Decision Variable

For each state

S_{t} \in S,

the feasible action set

X_{S_{t}}

defines all permissible allocation and routing decisions:

X_{S_{t}} = \{{(A_{t i})}_{i ϵ I}, {(B_{t i})}_{i ϵ I}\}

(5)

The routing component

A_{t i}

=

({a_{t i j}^{z})}_{(i, j) \in z} \in \{0, 1\}

encodes whether the AT selects route z when moving from zone

i

to zone

j

, while the allocation component

B_{t i} = {(b_{t i}^{k})}_{i ϵ L^{t}}

specifies the quantity of materials allocated from supply point k to disaster zone

i

, following the allocation sequence defined by

L^{t}

. The composite action is:

x_{t} = (({a_{t i j}^{z})}_{(i, j) \in z}, {(b_{t i}^{k})}_{i ϵ L^{t}})

(6)

All decisions must satisfy the following operational constraints:

\sum_{i ϵ I} b_{t i}^{k} \leq R^{t k}, \forall i \in I, i \in L^{t} \forall k \in K

(7)

b_{t i}^{k} \leq e^{t i}, \forall i \in I, \forall k \in K

(8)

a_{t i j}^{z} ϵ \{0, 1\} \forall i, j \in I, z = \{1, 2, 3 \dots\}

(9)

\sum_{j ϵ I} a_{t i j}^{z} \leq 1 \forall i \in I, (i, j) \in z

(10)

Equation (7) ensures that the total quantity of materials dispatched across all zones does not exceed the available supply at each depot. Equation (8) restricts material allocation to each zone to its assessed actual demand, preventing over-allocation. Equations (9) and (10) jointly enforce that each zone selects at most one route for movement to the next target zone, ensuring routing feasibility.

2.3. Markov Decision Process-Based Recursive Model

To maximize the total number of individuals rescued, the model integrates the objectives of all three agents into a unified global objective, defined as the product of the survival probability function and the material allocation quantity, rather than optimizing each agent’s objective independently. This formulation captures the joint effect of timely resource delivery and adequate supply on rescue outcomes.

The value function

V {(S}_{t})

denotes the maximum expected total discounted reward achievable from state

S_{t}

under the optimal policy. A myopic baseline, which maximizes only the immediate reward

C (S_{t}, x_{t})

at each decision epoch, provides a lower-performance benchmark; however, it disregards the long-run consequences of current decisions on future system states [39].

V_{0} (S_{t}) = {m a x}_{x_{t}} C (S_{t}, x_{t})

(11)

The myopic approach, however, is fundamentally limited because it disregards the impact of current decisions on future system states. Given the dynamic and resource-constrained nature of disaster response, routing and allocation decisions made at any epoch directly shape the system’s capacity to address demand in subsequent periods. The full dynamic formulation based on the Bellman optimality principle is [38,39]:

V (S_{t}) = \max_{x_{t}} (C ({(S}_{t}, x_{t}) + γ \sum_{S_{t + 1} ϵ S} P (S_{t + 1} | {(S}_{t}, x_{t}) V {(S}_{t + 1}))

(12)

where P is the one-step transition matrix encoding the stochastic evolution of demand and network conditions, and

γ \in

[0,1] is the discount factor governing the relative weight assigned to delayed rewards. The contribution function

C (S_{t}, x_{t})

aggregates rescue effects across all zones:

C (S_{t}, x_{t}) = \sum_{i \in I} o_{t}^{i, x_{t}}

(13)

where

o_{t}^{i, x_{t}} = b_{ti}^{k} {δ (θ_{i, j})}_{\begin{matrix} i, j \in I \\ . \end{matrix}}

, quantifies the rescue effect at zone

i

by combining the material allocation quantity with the time-dependent survival probability [45]. Let

X^{π} (S_{t})

be a decision function (i.e., policy) that depicts allocation decisions for each state

S_{t} \in S

. The optimization objective is to identify the policy

π^{*}

that maximizes the expected total discounted contribution over a finite planning horizon T:

\max_{π ϵ \prod} E^{π} \sum_{t = 0}^{T} γ^{t} C (S_{t}, X^{π} (S_{t}))

(14)

The state transition function

φ

describes the deterministic mapping from the current state-action pair

(S_{t}, x_{t})

to the subsequent system state, such that

S_{t + 1} = φ (S_{t}, x_{t})

. The optimal policy

π^{*}

satisfies the Bellman optimality equations [24,40] using Equations (15) and (16).

V (S_{t}) = {m a x}_{x_{t} ϵ X_{S_{t}}} (C (S_{t}, x_{t}) + γ E [V (S_{t + 1}) | (S_{t}, x_{t}])

(15)

π^{*} (S_{t}) = {a r g m a x}_{x_{t}} [E (C (S_{t}, x_{t})) + γ E [V^{*} (φ (S_{t}, x_{t}))]]

(16)

Obtaining

π^{*}

from

V^{*}

requires evaluating

V^{*} (φ (S_{t}, x_{t}))

across all state-action pairs

(S_{t}, x_{t})

. This is computationally intractable in practice due to the lack of complete information on state-action pairs. Therefore, we employ the Q-function as a tractable approximation for

V^{*}

, enabling the DMA to learn a near-optimal policy through iterative interaction with the simulated environment:

Q (S_{t}, x_{t}) = E [C (S_{t}, x_{t}) + γ \max_{x_{t + 1}} Q (S_{t + 1}, x_{t + 1})]

(17)

Based on the above definitions of

o_{t}^{i, x_{t}}, C (S_{t}, x_{t})

represents the total number of individuals rescued across all zones at epoch

t

, while the Q-value

Q (S_{t}, x_{t})

represents the expected discounted cumulative number of rescued individuals over the planning horizon, starting from a given state

S_{t}

and action

x_{t}

under the optimal policy.

The multi-agent decision coordination process is provided in Appendix A.

3. Solution Approach

Upon the onset of a disaster, the system transitions to the online response phase, during which the DMA receives the current environment state from the AT or ST and selects the most appropriate feasible action. The decision problem formulated in Equation (14) is an optimal control problem that aims to maximize the expected discounted cumulative contribution over the disaster-response horizon. In principle, such an optimal control problem can be solved using dynamic programming if the complete transition probability model and reward structure are explicitly known. However, in the proposed post-disaster resource allocation problem, the state-action space is large and high-dimensional, zone demands evolve stochastically, route accessibility changes dynamically, and transition probabilities are difficult to obtain in closed form [48]. Therefore, exact dynamic programming is computationally intractable.

To address the challenges posed by large, high-dimensional state-action spaces and to support stable and efficient learning, we adopt a Dueling Double Deep Q-Network with Prioritized Experience Replay (D3QN-PER) as the learning algorithm [48,49,50]. Unlike exact dynamic programming, deep reinforcement learning learns a parameterized policy directly from environmental interactions without requiring an explicit model of state transitions, making it well-suited to the sequential, uncertain nature of the problem [48]. It should be noted that this study does not employ external metaheuristic optimization methods, such as genetic algorithms, particle swarm optimization, differential evolution, or exact dynamic programming, to generate precomputed optimal or suboptimal decision labels for actions. Instead, training samples are generated as state-transition experiences during simulation, and the policy is improved through temporal-difference learning.

The standard D3QN-PER architecture is drawn from the deep reinforcement learning literature; its technical components are briefly described in Section 3.3 for completeness. The distinguishing features of this study lie in three problem-specific elements embedded within this framework: (i) the multi-agent field intelligence integration mechanism that updates the state representation in real time; (ii) the survival-probability-based reward function that couples material allocation with time-sensitive rescue outcomes; and (iii) the action masking mechanism that enforces operational feasibility constraints at every decision epoch.

3.1. Agent Collaboration in the Proposed Model

This subsection describes the problem-specific coordination mechanism, which represents a distinguishing methodological feature of the proposed framework. At each decision epoch

t

, the DMA maintains the environment state

S_{t} = (M_{t}, D_{t}) \in S

, where

M_{t} = {(L^{t}, R^{t k})}_{k ϵ K}

denotes resource status and

D_{t} = {(O^{t i}, e^{t i}, W^{t i})}_{i ϵ I}

captures the request status of each zone

i

. Zone demand is modeled as a stochastic process

e^{t i} \sim J (μ_{t i}, σ_{t i})

, reflecting the spatiotemporal uncertainty inherent in post-disaster conditions. During online operation, the AT furnishes the DMA with a field intelligence set

l_{t}^{A T}

=

{({\tilde{e}}_{ti}, {\tilde{μ}}_{ti}, {\tilde{σ}}_{ti}), {\tilde{ϱ}}_{ij}}

, where

{\tilde{e}}_{ti}

is the demand information for zone

i,

and

{\tilde{ϱ}}_{ij} \in \{0, 1\}

indicates the operational feasibility of the route between zone

i

and

j

. The DMA integrates this field intelligence to construct the updated state

{\tilde{S}}_{t} = H (S_{t}, l_{t}^{D A T})

, and subsequently makes decisions based on

{\tilde{S}}_{t},

thereby reducing allocation decision bias. The DMA selects action

x_{t}

=

(A_{t,} B_{t})

so as to maximize the expected discounted total contribution in Equation (14). After the DMA outputs an action (

x_{t} \in X_{{\tilde{S}}_{t}}

) subject to Equations (5) and (6), the ST executes the prescribed allocation plan, realizing contribution

r_{t + 1} = C ({\tilde{S}}_{t}, x_{t})

, updated resource levels

R^{k (t + 1)}

, revised service-state indicators

W^{i (t + 1)}

, and the subsequent environment state

S_{t + 1}

in which zone demands are resampled as

e^{(t + 1) i} \sim J (μ_{(t + 1) i}, σ_{(t + 1) i})

. The DMA then forms

{\tilde{S}}_{t + 1}

in a similar manner and stores the experience transition tuple (

{\tilde{S}}_{t}, x_{t} {, r}_{t + 1}, {\tilde{S}}_{t + 1}

) in the prioritized replay buffer for subsequent network training.

3.2. Feasibility Enforcement

Action masking is a problem-specific mechanism introduced to address the large set of infeasible actions arising from route disruptions and zone service constraints in disaster environments [51]. Unlike standard settings in previous studies, where all actions are typically assumed feasible, the highly constrained nature of disaster logistics requires explicit enforcement of feasibility at every decision epoch to prevent operationally invalid decisions. The feasible decision set

X_{{\tilde{S}}_{t}}

is constrained by Equations (7)–(10). Additional operational feasibility conditions are enforced during online dispatch: routes for which

{\tilde{ϱ}}_{ij}

= 0 are treated as physically infeasible and excluded from consideration; zones already under service (

W^{t i}

= 1) are excluded from the candidate set to prevent duplicated resource deployment. To ensure that the DMA never outputs invalid actions, action masking is implemented by restricting decisions to the feasible action subset:

f ({\tilde{S}}_{t}) = \{x \in X_{{\tilde{S}}_{t}} | x s a t i s f i e s E q u a t i o n s (7) - (9) a n d a l l r o u t e, s e r v i c e f e a s i b i l i t y c o n d i t i o n s\}

(18)

The DMA is constrained to select actions exclusively from

f ({\tilde{S}}_{t})

, guaranteeing operational validity at every decision epoch.

3.3. D3QN-PER: Neural Architecture

Bellman optimality and Q-network approximation

The optimal action-value function satisfies the Bellman optimality equation:

Q^{*} ({\tilde{S}}_{t}, x_{t}) = E [C ({\tilde{S}}_{t}, x_{t}) + γ \max_{x_{t + 1}} Q^{*} ({\tilde{S}}_{t + 1}, x_{t + 1})]

(19)

The optimal action-value function

Q^{*} ({\tilde{S}}_{t}, x_{t})

is approximated by a parameterized neural network

Q ({\tilde{S}}_{t}, x_{t}; ϑ),

where

ϑ

denotes the set of trainable network weights and

Q ({\tilde{S}}_{t}, x_{t}; ϑ) r e p r e s e n t s

the value estimate produced by the current Q-network for state

{\tilde{S}}_{t}

and action

x_{t}

, following the Deep Q-Network (DQN) paradigm of Mnih [50].

2.: Dueling architecture and Double DQN target

The dueling network architecture decomposes the Q-value into a state value term

V ({\tilde{S}}_{t}, ϑ)

and an action-specific advantage term

g ({\tilde{S}}_{t}, x_{t})

, as formalized in Equation (20):

Q ({\tilde{S}}_{t}, x_{t}; ϑ) = V ({\tilde{S}}_{t}; ϑ) + (g ({\tilde{S}}_{t}, x_{t}; ϑ) - \frac{1}{|f ({\tilde{S}}_{t})|} \sum_{x^{'}} g ({\tilde{S}}_{t}, x^{'}; ϑ))

(20)

Standard Q-learning suffers from systematic overestimation due to the maximization operator. Double DQN alleviates this bias by decoupling action selection from action evaluation when forming the TD target. Specifically, the next action is selected using the online network parameters

ϑ

, while its value is evaluated using the target network parameters

ϑ^{-}

. The resulting TD target is:

y_{t} = r_{t + 1} + γ Q ({\tilde{S}}_{t + 1}, a r g \max_{x^{'} \in f ({\tilde{S}}_{t})} Q ({\tilde{S}}_{t + 1}, x^{'}; ϑ); ϑ^{-}),

(21)

where

y_{t}

is the TD target used to update the Q-value estimate for (

{\tilde{S}}_{t}, x_{t}

), and the target network parameters

ϑ^{-}

are periodically synchronized with the online network to provide stable value estimates throughout training.

3.: Prioritized Experience Replay (PER)

PER is a prioritized sampling mechanism that improves the efficiency and stability of Q-network training by replaying transitions with larger temporal-difference errors more frequently.

In standard experience replay, transitions sampled uniformly from the experience replay buffer (ERB) may contain limited learning signal, thereby reducing sample efficiency. Prioritized Experience Replay (PER) addresses this issue by assigning each stored transition a sampling probability proportional to the magnitude of its TD error [52], ensuring that transitions with greater informational content receive disproportionately higher training attention [48]. For the

K

-th transition stored in the ERB, we define the TD error as:

℘_{K} = y_{K} - Q ({\tilde{S}}_{K}, x_{K}; ϑ)

(22)

where

℘_{K}

denotes TD errors for transition

K

. The TD error quantifies the discrepancy between the Q-value estimates produced by the online (prediction) and target networks, and directly determines the priority assigned to each stored transition. The priority score for transition

K

is defined as

ψ_{K} = | ℘_{K} | + Ⅎ,

(23)

where

Ⅎ > 0

is a small constant ensuring that every transition retains a non-zero sampling probability. Based on these priority scores, the probability of sampling the

K

-th transition is:

Λ (K) = \frac{ψ_{K}^{A}}{\sum_{K} ψ_{K}^{A}}

(24)

where

A \in [0, 1]

governs the degree of prioritization (

A

= 0 recovers uniform sampling). Since non-uniform sampling introduces bias relative to the uniform replay distribution, PER applies an importance-sampling (IS) correction weight to each sampled transition [53]:

Φ_{K} = {(\frac{1}{N \times Ψ (K)})}^{β}

(25)

where

β \in [0, 1]

controls the strength of the IS correction and is typically annealed toward 1 during training, and N denotes the current number of transitions stored in the ERB. The PER-weighted loss function for Q-network training is given by

L (ϑ) = E [Φ_{K} (y_{K} - {Q ({\tilde{S}}_{K}, x_{K}; ϑ)}^{2})]

(26)

where

ϑ

is updated via stochastic gradient descent on this weighted loss. The target network is periodically synchronized with the online Q-network by copying

ϑ^{-} \leftarrow ϑ

. The complete D3QN-PER training procedure is presented in Algorithm 1.

Algorithm 1: D3QN-PER Training Procedure

Input: discount factor

γ

, replay buffer capacity N, mini-batch size B,
learning rate, target update interval, feasibility masking
rules (Equations (7)–(10), served-zone exclusion rule, route feasibility conditions), PER parameters (

A

,

β

).
Output: learned DMA policy

π (\tilde{S}) =

arg

\max_{x \in A (\tilde{S})} Q (\tilde{S}, x; ϑ)

.
1. Initialize online dueling Q-network

Q (\tilde{S}, x; ϑ)

and target network

Q (\tilde{S}, x; ϑ^{-}) \leftarrow

Q (\tilde{S}, x; ϑ)

.
2. Initialize prioritized replay buffer

ϰ

with capacity N.
3. For each training episode (disaster scenario) do
4. Initialize state

S_{0} = (M_{0}, D_{0})

, initialize state

{\tilde{S}}_{0}

, where

e^{0 i}

\sim J (μ_{0 i}, σ_{0 i})

.
5. for t = 0, 1, …, T − 1 do
6. AT provides field intelligence

l_{t}^{D A T}

=

{({\tilde{e}}_{ti}, {\tilde{μ}}_{ti}, {\tilde{σ}}_{ti}), {\tilde{ϱ}}_{ij}}

;
The DMA updates the state

{\tilde{S}}_{t} = H (S_{t}, l_{t}^{D A T})

.
7. Construct the feasible set

f ({\tilde{S}}_{t})

via constraints (7)–(10) and

W^{t i}

.
8. Select action

x_{t}

: with probability

ϵ

, sample

x_{t}

uniformly from

f ({\tilde{S}}_{t})

;
9. Otherwise,

x_{t} =

arg

\max_{x \in A (\tilde{S})} Q (\tilde{S}, x; ϑ)

.
10. The ST executes the prescribed allocation; the AT executes routing if applicable; compute reward

r_{t + 1} = C ({\tilde{S}}_{t}, x_{t})

and next state

S_{t + 1}

;
11. Construct the next augmented state

{\tilde{S}}_{t + 1} = H (S_{t + 1}, l_{t + 1}^{D A T})

by updating

R^{k (t + 1)}

,

W^{i (t + 1)}

and sampling new demands

e^{(t + 1) i} \sim J (μ_{(t + 1) i}, σ_{(t + 1) i})

.
12. Store transition (

{\tilde{S}}_{t}, x_{t}

,

r_{t + 1}

,

{\tilde{S}}_{t + 1}

) in

ϰ

with maximum priority

ψ

.
13. Sample mini-batch of K transitions from

ϰ

using

Λ (K)

.
14. For each sampled transition K do
15. Compute TD target

y_{K}

using Equation (21).
16. Compute TD error and update priority:

℘_{K} = y_{K} - Q ({\tilde{S}}_{K}, x_{K}; ϑ)

;

ψ_{K}

=|

℘_{K}

|+

Ⅎ

.
17. Compute IS weight

Φ_{K}

using Equation (25).
18. Update ϑ by minimizing the loss function

L (ϑ) = E [Φ_{K} (y_{K} - {Q ({\tilde{S}}_{K}, x_{K}; ϑ)}^{2})]

.
19. At every target update interval

τ

, synchronize the target network

ϑ^{-} \leftarrow

ϑ

.
20. end for
21. end for
22. Return the learned policy

π (\tilde{S}) =

arg

\max_{x \in A (\tilde{S})} Q (\tilde{S}, x; ϑ)

.

4. Case Study

This study employs the M6.8 earthquake that struck Dingri County, Shigatse City, Tibet Autonomous Region on 7 January 2025, as an empirical test case to assess the applicability and practical relevance of the proposed framework. According to the China Earthquake Networks Center (CENC), the epicenter was located at 87.45° E, 28.50° N, at a focal depth of approximately 10 km. As of 14 January 2025, a total of 3614 aftershocks had been recorded, with the largest reaching magnitude 5.0. According to official statistics released by the Ministry of Emergency Management of the People’s Republic of China and the Tibet Autonomous Region Emergency Management Department, the earthquake caused direct economic losses of RMB 34.884 billion, affected approximately 300,500 people, injured 860 individuals, and caused the collapse of 2865 houses and damage to an additional 38,806 structures. In response, a national Level II emergency protocol was activated; the Tibet Autonomous Region elevated its response to Level II; and a field command headquarters was established to coordinate rescue operations. Although the response involved a broad coalition of actors, including government agencies, the People’s Liberation Army, non-governmental organizations, and community volunteers, relief supply allocation remained largely fragmented across organizations. Such fragmentation can reduce inter-agency coordination efficiency and overall response effectiveness [49,54]. Therefore, this case provides a highly relevant empirical context for investigating how structured demand assessment can improve allocation decisions under conditions of high spatiotemporal uncertainty, heterogeneous demand profiles, and complex logistics networks.

4.1. Study Area and Data Settings

The study area comprises twelve townships with seismic intensities of VII (Modified Mercalli Scale) or above, namely Quluo Township (QLT), Changsuo Township (CT), Cuoguo Township (TT), Nixia Township (NT), Jiacuo Township (GT), Qudang Township (QDT), Xiegeer Town (XGT), Zhaxizong Township (ZT), Mangpu Township (MT), Lazi Town (LT), Quxia Town (QT), and Xiyin Township (XT), as shown in Figure 3. The failure probabilities of transportation links in the affected zones are estimated primarily using route-specific parameters, including link length and the presence of critical infrastructure elements, such as bridges, following Nadi et al. [44]. Complete zone-level information is provided in Appendix A and Appendix B. Inter-zone travel times were obtained from Baidu Maps. Regarding model parameterization, service performance factors were set to 0.3 for assessment teams and 0.6 for relief supply teams [44]. The survival function is given by

δ (θ) = 0.8 e^{- 0.0001 θ^{2}},

where

θ

denotes the rescue time in hours [45]. The discount factor is set to γ = 0.98 to account for discounted future rewards in sequential decision-making. Following established modeling conventions [47,55,56], the injured population at each zone is assumed to follow a normal distribution with the nominal demand as its mean and a standard deviation of 3. The distribution function is used to generate real-time estimates of the injured population, which is also used to estimate rewards. Material demand at zone

i

is calculated as

{D S}_{i} = N_{i r} \times d \times I \times (1 + ℵ)

, where

d

is per-capita demand,

I

is the seasonal factor, and

ℵ

is the loss factor. Following Lin et al. [57] and considering the 2025 average low temperature of −11 °C in Shigatse, Tibet, and its high-altitude conditions, the food demand parameters are set as

d

= 1 kg,

I

= 1.2,

ℵ

= 0.05. All algorithm hyperparameters are configured in accordance with Zeng et al. [49] and Mnih et al. [50], as detailed in Appendix C. The model is solved using the framework described in Section 3. All algorithms were implemented using Python libraries.

4.2. Results and Analysis

The single-agent-per-category configuration used in this subsection is intentionally simplified to isolate and quantify the contribution of the AT agent in a controlled and tractable setting. Section 4.2.2 then extends the analysis to a twelve-zone, multi-agent configuration to examine system-level scalability and coordination effects under more realistic operational conditions.

4.2.1. Performance Analysis and Contribution of the AT Agent

To isolate and quantify the contribution of the AT agent, the initial analysis focuses on six severely affected zones, namely QLT, CT, QDT, XGT, MT, and LT, which are selected as the primary response targets. These zones are located within or adjacent to the highest-intensity region (seismic intensity ≥ VIII on the Modified Mercalli Scale) and collectively account for the majority of the affected population and collapsed structures in the study area. The transportation network connecting these zones comprises 14 road links (Figure 4). For analytical clarity, a single representative agent is assumed for each agent category. The D3QN-PER algorithm is applied to train the DMA, yielding the optimal assessment route illustrated in Figure 5. The resulting agent properties and operational timelines for the AT and ST are presented in Table 2. To examine the system’s responsiveness to field disruptions, a link failure event is introduced at t = 4.7067 h. After completing service at zone XGT, the AT agent reports that Link 10 has become impassable (Link 10 is selected as the failure location because it represents a critical connectivity segment in the six-zone network, lying on the primary route connecting XGT to adjacent zones; its disruption forces a non-trivial re-routing decision that meaningfully tests the DMA’s adaptive replanning capability). At that moment, the ST agent has just commenced service at XGT. The DMA is required to identify an alternative optimal routing plan based on the updated network state. Under this configuration, two scenarios are evaluated: (Scenario 1) full coordination, in which AT, ST, and DMAs are all active; and (Scenario 2) assessment-absent, in which the AT agent is removed and only the ST and DMA operate.

Q-value convergence. Both scenarios were trained using D3QN-PER, and the resulting Q-value convergence curves (measured in number of people rescued) are plotted over training episodes in Figure 6. After approximately 1200 training episodes, the Q-values of both scenarios stabilized within the range of 1800 to 2500. Scenario 1 (full coordination) converged to approximately 2300, whereas Scenario 2 (assessment-absent) stabilized at approximately 2010. During the first 300 episodes, Scenario 1 exhibited a markedly faster learning rate, reflecting the informational advantage conferred by the AT agent’s real-time field data. Beyond 300 episodes, Q-value fluctuations gradually diminished in both scenarios, indicating convergence toward stable allocation strategies. A detailed zone-level comparison of optimal actions, demand satisfaction rates, and Q-values is provided in Table 3. Table 3 further shows that the number of rescued individuals in Scenario 1 is higher than that in Scenario 2.

Robustness evaluation. To assess the robustness of the DMA decision-making in a stochastic environment, 100 independent simulation runs were conducted for each scenario. Performance was evaluated using four metrics: failure rate (failure occurs when an agent encounters a destroyed link), mission completion time, number of rescued individuals, and the mean absolute percentage error (MAPE) of Q-values. The integration of AT information reduced the probability of the ST agent encountering an impassable link from 9.50% (Scenario 2) to 1.56% (Scenario 1) (Appendix D, Figure A2), confirming that proactive route assessment substantially reduces operational disruptions. Real-time demand information from the AT also improved Q-value accuracy, with MAPE decreasing from 9.74% to 0.96% (Appendix D, Figure A3), indicating substantially improved state representation and consequent action quality. These gains collectively reduced link failures, shortened the average mission completion time from 42.50 h to 35.60 h (Appendix D, Figure A4), and yielded an 11.34% increase in the rescued population in Scenario 1 relative to Scenario 2 (Appendix D, Figure A5), which is consistent with the zone-level findings in Table 3.

4.2.2. Performance Analysis of Multi-Agent Coordination

This section extends the analysis to all 12 affected zones to evaluate the scalability and generalizability of the proposed framework under more realistic multi-zone operational conditions. Specifically, the impact of multi-agent coordination is evaluated across these 12 zones. Following Yang et al. [43], four AT agents and four ST agents are deployed. Two scenarios are compared: DR1 (full coordination, all agents active) and DR2 (assessment-absent, all AT agents removed). Comparative results are summarized in Figure 6 and Table 4.

DR1 consistently outperforms DR2 in both rescued population and mission completion time across all tested methods, confirming that AT deployment and the associated real-time information sharing generate meaningful and consistent system-level performance gains. The proposed D3QN-PER is further benchmarked against three established baselines: DQN [49], Q-learning [58], and a myopic policy [39,41]. Under DR1, D3QN-PER achieves the highest rescued population of 6173.98, followed by DQN (5500.01), Q-learning (5290.14), and the myopic policy (4874.20). D3QN-PER also attains its maximum rescued population in the shortest time among the learning-based methods (64.69 h). Notably, while the myopic policy reaches its maximum marginally faster (64.09 h), it produces the lowest rescued population of all methods. This finding reflects the inherent limitation of short-horizon greedy decision-making in complex, multi-zone disaster response: optimizing immediate reward without accounting for future consequences ultimately sacrifices substantial rescue capacity.

Overall, DR1 outperforms DR2 in both rescued population and completion time, confirming the benefit of deploying multiple AT agents and supporting the mechanism identified in Table 3 (Section 4.2.1).

4.2.3. Solution Stability Analysis

To assess solution stability and robustness, 20 independent training runs were performed for each method (D3QN-PER, DQN, Q-learning, and myopic policy) under scenarios DR1 and DR2 (Section 4.2.2). Stability was quantified by the coefficient of variation (

CV = σ / μ

), where σ and μ are the standard deviation and mean of rescued individuals across runs; a lower CV denotes greater consistency. Results are shown in Figure 7 and Table 5. D3QN-PER achieved the highest mean number of rescued individuals and the lowest CV values in both scenarios (DR1: 6233, CV = 2.3%; DR2: 5592, CV = 3.0%). Strikingly, its worst-case result under DR1 (5924) still surpassed the mean of DQN (5557), confirming consistent superiority even under adverse runs. The myopic policy showed the poorest stability (DR1: 9.0%; DR2: 10.5%), reflecting its inherent sensitivity to stochastic environmental fluctuations, while DQN and Q-learning exhibited moderate CV values of 4.8–5.7%. The stability of D3QN-PER stems from two structural mechanisms: periodic target network synchronization [50], which prevents oscillation of training targets, and importance sampling correction [53], which eliminates gradient bias induced by PER’s non-uniform sampling.

4.2.4. Sensitivity Analysis

This section systematically examines how variations in the numbers of AT and ST agents affect overall rescue performance. The total affected population is fixed at 6499, with a baseline of 3249 rescued individuals established under the condition of zero deployed ST agents. The numbers of AT and ST agents are varied independently while all other parameters are held constant. As illustrated in Figure 8, increasing the total number of deployed agents generally improves the rescued population. However, the marginal benefit of additional AT agents exhibits clear diminishing returns: once the AT count exceeds five, the rescued population reaches a plateau and further AT deployment yields negligible incremental improvement. This finding has a direct practical implication: relief organizations should prioritize balanced AT and ST configurations over unilateral increases in assessment capacity, particularly when operating under tight resource and time constraints. From a sustainable development perspective, this finding is particularly important. Deploying too many assessment teams may consume limited human, logistical, and organizational resources, without necessarily producing a corresponding improvement in rescue effectiveness.

Table 6 presents the AT routing sequences under the reference configuration of ST = 4 as the number of AT agents varies from 3 to 5. Zones for which service is initiated simultaneously are indicated in the original analysis using bold and underlined formatting, whereas the remaining zones are served sequentially. Importantly, the service priority order of the three key affected zones, QLT, XGT, and ZT, remains unchanged across all AT configurations. This stability provides strong evidence that the proposed framework maintains consistent prioritization of key zones even as team compositions vary, demonstrating structural robustness to moderate staffing fluctuations.

Figure 9 presents the zone-level demand satisfaction rates for ST = 4 under different AT levels. The aggregate satisfaction rate is maximized at AT = 4, and the improvement from AT = 3 to AT = 4 substantially exceeds the marginal gain from AT = 4 to AT = 5, further confirming the existence of diminishing returns beyond an optimal AT threshold. Together, the results in Table 6 and Figure 9 show that, although AT staffing influences routing schedules and zone-level satisfaction rates, the fundamental allocation structure and the prioritization of critical zones remain stable. This provides evidence of the operational robustness of the proposed framework under realistic staffing variations.

5. Conclusions

Timely and accurate demand assessment is a fundamental prerequisite for effective and sustainable post-disaster resource allocation. In practice, however, demand information is frequently incomplete, dynamically evolving, and subject to significant noise, posing severe challenges to reliable and timely decision-making. From a sustainability perspective, such information uncertainty may lead not only to inefficient logistics operations, but also to avoidable resource waste and delayed assistance to vulnerable populations. To address this problem, this study proposes a multi-agent decision framework that integrates on-site demand assessment with dynamic resource allocation, thereby overcoming the efficiency bottleneck associated with information delays in conventional stage-wise response approaches. The framework formalizes the post-disaster response process as a Markov Decision Process and enables dynamic coordination among a Demand Assessment Team (AT), a Resource Allocation Team (ST), and a central artificial Decision-Making Agent (DMA), supporting adaptive, field-feedback-driven resource allocation. Three principal findings are derived from an empirical evaluation based on the 2025 Dingri County earthquake in Tibet.

First, the integration of real-time demand assessment substantially improves the quality of individual allocation decisions and overall system-level rescue performance. Incorporating on-site field intelligence from the AT reduces decision bias in the DMA’s state representations, improves the quality of ST allocation decisions, and shortens the time required to rescue a given proportion of the affected population. In multi-zone settings, this mechanism results in larger rescued populations, shorter mission completion times, and higher demand satisfaction rates, collectively highlighting the strategic importance of structured, field-feedback-driven information flow in emergency response systems.

Second, expanding team size generally improves rescue outcomes, but AT capacity exhibits pronounced diminishing marginal returns. The total rescued population increases as the numbers of both AT and ST agents grow; however, the marginal gains from adding AT agents plateau once a threshold is exceeded. In operational terms, this finding implies that relief organizations should seek balanced AT and ST configurations rather than indefinitely expanding assessment capacity, particularly under binding time and resource constraints.

Third, increasing AT capacity alone yields only marginal improvements in system-wide demand satisfaction, while substantial disparities in inter-zone satisfaction rates persist. This finding underscores the necessity of incorporating equity objectives explicitly into emergency response planning: frameworks that optimize solely for aggregate performance metrics may systematically under-serve the most vulnerable zones. Response planners should therefore adopt multi-objective approaches that balance system-level efficiency with equitable resource distribution across all affected areas.

From an algorithmic perspective, the proposed D3QN-PER outperforms all baseline methods in both solution quality and stability. In the DR1 scenario, D3QN-PER yields the highest rescue count of 6173.98, exceeding DQN (5500.01), Q-learning (5290.14), and the myopic policy (4874.20) by 12.3%, 16.7%, and 26.7%, respectively. This performance advantage also persists in the DR2 scenario, indicating that the superiority of D3QN-PER holds across varying scenario configurations. Stability analysis further corroborates these findings: D3QN-PER achieves the lowest coefficient of variation among all compared methods (CV = 2.3% under DR1), confirming that the stochastic nature of its prioritized experience replay mechanism does not compromise solution reliability. Therefore, these results further demonstrate the consistent advantages of D3QN-PER in both solution quality and robustness.

Overall, the findings show that the proposed framework has important implications for sustainability in disaster response. By increasing the number of rescued individuals and improving demand satisfaction across affected zones, the framework supports the social dimension of sustainability. It also improves resource-use efficiency by reducing allocation bias, the selection of failed routes, and redundant team deployment. In addition, the continuous feedback between field assessment and allocation decisions enhances the adaptability and resilience of the emergency response system.

The framework has also been designed with practical deployment in mind, and its resource requirements merit consideration. While the precise figures will depend on local administrative arrangements and the maturity of existing emergency management systems, the 2025 Dingri response—which covered twelve affected townships under a Level II national emergency protocol—suggests that approximately twenty trained personnel would be sufficient at the provincial level. Such a team would comprise a core of operations research analysts overseeing decision-support processes, four to six system and data engineers maintaining the geospatial and real-time information flow, and field coordinators supporting rescue teams in the affected zones. Compared with the RMB 34.884 billion in direct economic losses recorded during the Dingri earthquake, the operational cost of deploying the framework is almost negligible, reinforcing its role as a complementary tool rather than a substitute for existing disaster response systems.

This study has several limitations that point to promising directions for future research. First, the scope of the decision problem could be broadened to encompass a wider range of interdependent emergency response tasks, including search-and-rescue, medical triage, and mass evacuation, thereby extending the current single-phase allocation model into a comprehensive multi-task emergency management framework. Second, the practical applicability of the model could be enhanced by incorporating additional real-world features such as zone-level urgency prioritization, explicit equity constraints, and more realistic operational allocation rules. Third, algorithmic scalability and computational efficiency remain important open challenges. Future research could develop more efficient deep reinforcement learning architectures specifically tailored to disaster response decision problems, and explore the integration of large-scale heterogeneous data sources, including satellite imagery, population mobility data, and telecommunications signals, to improve the robustness and cross-context generalizability of the multi-agent framework. Fourth, while D3QN-PER has been benchmarked against DQN, Q-learning, and the myopic policy—confirming its advantages in solution quality (DR1: 6173.98 rescued individuals) and stability (CV = 2.3%)—direct comparisons with hybrid variants that integrate D3QN with external optimization techniques (i.e., D3QN-PSO, D3QN-DE, and D3QN-GA) were not conducted in this study. Although combinations of DDQN with PSO, DE, and GA have been explored in related domains, the systematic integration of these optimization techniques with the full D3QN architecture remains largely unexamined in the literature. Therefore, this comparison represents a meaningful direction for future research. Fifth, the case study integrates real-world data, including inter-site travel times and local meteorological conditions, with scenario-specific parameter assumptions such as service performance factors and demand distribution settings. The results are, therefore, best interpreted as scenario-based analyses, which is consistent with established practice in disaster response modeling [59]. However, the sensitivity of these parameters has not been formally quantified [60]. Systematic sensitivity analysis represents a natural extension of this work, particularly for examining how input uncertainty propagates through allocation decisions and ultimately affects rescue outcomes.

Author Contributions

Conceptualization, L.H. and J.H.; methodology, L.H. and J.H.; software, L.H.; validation, L.H. and J.H.; formal analysis, L.H. and J.H.; investigation, L.H. and J.H.; resources, L.H. and J.H.; data curation, L.H. and J.H.; writing—original draft preparation, L.H.; writing—review and editing, L.H.; visualization, L.H. and J.H.; supervision, J.H.; project administration, L.H. and J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Fund of China [Nos. 72274185, 71874163], National Social Science Fund of China [No. 23AZD072], and Innovative Research Group Project of the Natural Science Foundation of Hubei Province (2024AFA015). Fuzhou Municipal Social Science Planning Project [No. 2025FZB013].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1 illustrates the multi-agent decision coordination process. At time t, the decision-making agent (DMA) observes the current state

S_{t}

and selects an action

x_{t}

using the Q-function, including the assignment of the assessment team (AT), route planning, the supply response team (ST), and the material allocation plan. The AT conducts on-site evaluation and reports real-time demand information to the DMA, while the ST conducts material distribution and reports any operational disruptions encountered. The DMA integrates this information to compute the reward, update the Q-function, and transition to the next state

S_{t + 1}

. This cycle repeats until the planning horizon T is reached, at which point the optimal policy is output.

Figure A1. Multi-agent decision coordination process. Source: Drafted by the authors, Microsoft Visio Professional 2013. Note: Solid arrows indicate sequential process flow within each agent; dashed arrows represent inter-agent communication and directive transmission.

Table A1. Information of affected zones.

Town	Total Population	Instability Ratio	Vulnerable Population ( $N_{i}$ )
QLT	2300	0.17	391
CT	3000	0.15	450
TT	2500	0.15	375
NT	1800	0.16	288
GT	1100	0.16	176
QDT	2900	0.15	435
XGT	10,000	0.12	1200
ZT	3169	0.13	411.97
MT	2900	0.15	435
LT	6906	0.16	1104.96
QT	3400	0.14	476
XT	5400	0.14	756

Appendix B

Table A2. Link Information of affected zones.

Failure Probability	Start Zone	End Zone	Distance (km)	Time (min)
0	QLT	CT	20.1	33
0	QLT	TT	21.7	34
0	QLT	NT	53.9	77
0	QLT	GT	42.8	41
0.15	QLT	QDT	150.4	203
0	QLT	XGT	30.8	36
0.0	QLT	ZT	83.0	111
0.0	QLT	MT	120.1	128
0.0	QLT	LT	115.5	120
0	QLT	QT	105.4	107
0	QLT	XT	116.8	117
0	CT	TT	21.7	27
0	CT	NT	44.5	58
0	CT	GT	62.8	63
0.1	CT	QDT	119.6	168
0	CT	XGT	50.8	58
0.1	CT	ZT	171.5	178
0.05	CT	MT	140.0	149
0.1	CT	LT	135.5	165
0	CT	QT	125.3	128
0.05	CT	XT	136.8	138
0	TT	NT	55.4	72
0	TT	GT	64.5	65
0.1	TT	QDT	130.5	171.9
0	TT	XGT	52.3	60
0.1	TT	ZT	173.1	180
0.05	TT	MT	142	153
0.05	TT	LT	137.0	140
0	TT	QT	129	126.9
0.05	TT	XT	138.3	139
0	NT	GT	96.5	99
0.05	NT	QT	106.7	157
0	NT	XT	84.6	92
0.1	NT	ZT	162.8	172
0.05	NT	MT	135.1	152
0.05	NT	LT	126.8	134
0.05	NT	QT	120.5	131
0	NT	XT	108.4	117
0.15	GT	QDT	149.3	206
0	GT	XGT	25.8	29
0	GT	ZT	108.9	116
0	GT	MT	77.4	88
0	GT	LT	72.8	79
0	GT	QT	66	62.7
0	GT	XT	74.1	77
0.15	QDT	XGT	137.5	200
0.15	QDT	ZT	241.4	281
0.15	QDT	MT	213.6	260
0.15	QDT	LT	205.3	243
0.15	QDT	QT	199.1	240
0.15	QDT	XT	186.9	226
0.15	XGT	ZT	134.5	146
0	XGT	MT	103.3	116
0	XGT	LT	98.5	107
0	XGT	QT	88.3	94
0	XGT	XT	99.7	105
0	ZT	MT	62.2	75
0	ZT	LT	36	39
0	ZT	QT	47.1	53
0	ZT	XT	54.6	57
0	MT	LT	26.3	40
0	MT	QT	15.7	27
0	MT	XT	26.9	37
0	LT	QT	11.1	15
0	LT	XT	18.6	18
0	QT	XT	12.3	14

Appendix C

Table A3. D3QN-PER Hyperparameter Settings.

Parameter	Value	Explanation
Replay buffer capacity $ϰ$	10,000,000	Maximum number of transitions stored in the replay buffer
Mini-batch size B	64	Number of transitions sampled per gradient update step
Learning rate	0.00025	Step size for neural network parameter optimization
PER priority exponent $A$	0.6	Controls degree of prioritization in experience sampling
Maximum episodes	100,000 (1200)	Number of training episodes
Target update $τ$	400	Frequency of target network parameter synchronization
Exploration rate $ϵ$	1.0	Probability of random action selection

Appendix D

Figure A2. Performance comparison between Scenario 1 (with AT) and Scenario 2 (without AT) in terms of the failure-rate metric. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

Figure A3. Comparison of the performance of Scenario 1 and Scenario 2 in terms of the MAPE metric. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

Figure A4. Performance comparison between Scenario 1 and Scenario 2 in terms of the completion-time metric. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

Figure A5. Performance comparison between Scenario 1 and Scenario 2 in terms of the rescued-population metric. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

References

Laosunthara, A.; Kruthphong, K.; Leelawat, N.; Wararuksajja, W.; Sukulthanasorn, N.; Suppasri, A.; Thongthip, R.; Chintanapakdee, C. Initial observations and immediate lessons learned from Thailand’s response to the 2025 Mandalay earthquake. Int. J. Disaster Risk Reduct. 2025, 127, 105675. [Google Scholar] [CrossRef]
Sun, F.; Li, H.; Cai, J.; Hu, S.; Xing, H. Examining organizational collaboration and resource flows of disaster response systems based on a time-dynamic perspective. Int. J. Disaster Risk Reduct. 2024, 108, 104565. [Google Scholar] [CrossRef]
Biswas, S.; Kumar, D.; Hajiaghaei-Keshteli, M.; Bera, U.K. An AI-based framework for earthquake relief demand forecasting: A case study in Turkey. Int. J. Disaster Risk Reduct. 2024, 102, 104287. [Google Scholar] [CrossRef]
Steen, R.; Roud, E.; Torp, T.M.; Hansen, T.A. The impact of interorganizational collaboration on the viability of disaster response operations: The Gjerdrum landslide in Norway. Saf. Sci. 2024, 173, 106459. [Google Scholar] [CrossRef]
Andreassen, N.; Borch, O.J. Co-ordination of emergency response systems in high-complexity environments. In Crisis and Emergency Management in the Arctic; Routledge: Abingdon, UK, 2020. [Google Scholar] [CrossRef]
Yu, L.; Zhang, C.; Jiang, J.; Yang, H.; Shang, H. Reinforcement learning approach for resource allocation in humanitarian logistics. Expert Syst. Appl. 2021, 173, 114663. [Google Scholar] [CrossRef]
Camacho-Vallejo, J.-F.; González-Rodríguez, E.; Almaguer, F.-J.; González-Ramírez, R.G. A bi-level optimization model for aid distribution after the occurrence of a disaster. J. Clean. Prod. 2015, 105, 134–145. [Google Scholar] [CrossRef]
Zahedi, A.; Kargari, M.; Husseinzadeh Kashan, A. Multi-objective decision-making model for distribution planning of goods and routing of vehicles in emergency. Int. J. Disaster Risk Reduct. 2020, 48, 101587. [Google Scholar] [CrossRef]
Wang, Y.; Dong, Z.S.; Hu, S. A stochastic prepositioning model for distribution of disaster supplies considering lateral transshipment. Socio-Econ. Plan. Sci. 2021, 74, 100930. [Google Scholar] [CrossRef]
Wang, Y.; Sun, B. Multiperiod optimal emergency material allocation considering road network damage and risk under uncertain conditions. Oper. Res. 2022, 22, 2173–2208. [Google Scholar] [CrossRef]
Shiripour, S.; Mahdavi-Amiri, N. Optimal distribution of the injured in a multi-type transportation network with damage-dependent travel times. Socio-Econ. Plan. Sci. 2019, 68, 100660. [Google Scholar] [CrossRef]
Wang, Y.; Bier, V.M.; Sun, B. Measuring and achieving equity in multiperiod emergency material allocation. Risk Anal. 2019, 39, 2408–2426. [Google Scholar] [CrossRef] [PubMed]
Wang, F.; Pei, Z.; Dong, L.; Ma, J. Emergency resource allocation for multi-period post-disaster using multi-objective cellular genetic algorithm. IEEE Access 2020, 8, 82255–82265. [Google Scholar] [CrossRef]
Zhang, J.; Long, D.Z.; Li, Y. A reliable emergency logistics network for COVID-19 considering uncertain time-varying demands. Transp. Res. Part E Logist. Transp. Rev. 2023, 172, 103087. [Google Scholar] [CrossRef]
Long, E.F.; Nohdurft, E.; Spinler, S. Spatial resource allocation for emerging epidemics: A comparison of greedy, myopic, and dynamic policies. Manuf. Serv. Oper. Manag. 2018, 20, 181–198. [Google Scholar] [CrossRef]
Li, Y.; Chung, S.H. Disaster relief routing under uncertainty: A robust optimization approach. IISE Trans. 2019, 51, 869–886. [Google Scholar] [CrossRef]
Wang, W.; Yang, K.; Yang, L.; Gao, Z. Distributionally robust chance-constrained programming for multi-period emergency resource allocation and vehicle routing in disaster response. Omega 2023, 120, 102915. [Google Scholar] [CrossRef]
Manshadi, V.; Niazadeh, R.; Rodilitz, S. Fair dynamic rationing. Manag. Sci. 2023, 69, 6417–7150. [Google Scholar] [CrossRef]
Wan, M.; Ye, C.; Peng, D. Multi-period dynamic multi-objective emergency material distribution model under uncertain demand. Eng. Appl. Artif. Intell. 2023, 117, 105530. [Google Scholar] [CrossRef]
Mohamadi, A.; Yaghoubi, S. A bi-objective stochastic model for emergency medical services network design under disruptions. Int. J. Disaster Risk Reduct. 2017, 23, 204–217. [Google Scholar] [CrossRef]
Caunhye, A.M.; Nie, X. A stochastic programming model for casualty response planning during catastrophic health events. Transp. Sci. 2018, 52, 437–453. [Google Scholar] [CrossRef]
Yang, M.; Liu, Y.; Yang, G. Multi-period dynamic distributionally robust pre-positioning of emergency supplies under demand uncertainty. Appl. Math. Model. 2021, 89, 1433–1458. [Google Scholar] [CrossRef]
Zhou, L.; Wu, X.; Xu, Z.; Fujita, H. Emergency decision making for natural disasters: An overview. Int. J. Disaster Risk Reduct. 2018, 27, 567–576. [Google Scholar] [CrossRef]
Wang, C.; Ju, P.; Lei, S.; Wang, Z.; Wu, F.; Hou, Y. Markov decision process-based resilience enhancement for distribution systems. IEEE Trans. Smart Grid 2020, 11, 2498–2510. [Google Scholar] [CrossRef]
Iqbal, S.; Sardar, M.U.; Lodhi, F.K.; Hasan, O. Statistical model checking of relief supply location and distribution in natural disaster management. Int. J. Disaster Risk Reduct. 2018, 31, 1043–1053. [Google Scholar] [CrossRef]
Wang, J.; Wang, N.; Ouyang, M. Regional-scale dynamic planning for distributing emergency supplies under evolving tropical cyclones. Reliab. Eng. Syst. Saf. 2024, 245, 110024. [Google Scholar] [CrossRef]
Zhan, S.L.; Liu, N. Determining the optimal decision time of relief allocation in response to disaster via relief demand updates. Int. J. Syst. Sci. 2016, 47, 509–520. [Google Scholar] [CrossRef]
Comfort, L.K.; Ko, K.; Zagorecki, A. Coordination in rapidly evolving disaster response systems: The role of information. Am. Behav. Sci. 2004, 48, 295–313. [Google Scholar] [CrossRef]
Mochizuki, J.; Toyasaki, F.; Sigala, I.F. Toward resilient humanitarian cooperation: Examining horizontal cooperation among humanitarian organizations using ABM. J. Nat. Disaster Sci. 2015, 36, 35–52. [Google Scholar] [CrossRef][Green Version]
Kirac, E.; Shaltayev, D.; Wood, N. Evaluating the impact of citizen collaboration with government agencies in disaster response operations: An agent-based simulation study. Int. J. Disaster Risk Reduct. 2024, 106, 104469. [Google Scholar] [CrossRef]
Martin, S. A multi-agent-based cooperative approach to scheduling and routing. Eur. J. Oper. Res. 2016, 254, 169–178. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J. Agent-based evaluation of humanitarian relief goods supply capability. Int. J. Disaster Risk Reduct. 2019, 36, 101105. [Google Scholar] [CrossRef]
Beklaryan, A.L.; Akopov, A.S. Simulation of agent-rescuer behavior in emergency based on modified fuzzy clustering. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems (AAMAS ‘16); International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2016; pp. 1275–1276. [Google Scholar] [CrossRef]
Hawe, G.I.; Coates, G.; Wilson, D.T.; Crouch, R.S. Agent-based simulation of emergency response to plan the allocation of resources for a hypothetical two-site major incident. Eng. Appl. Artif. Intell. 2015, 46, 336–345. [Google Scholar] [CrossRef]
Akopov, A.S.; Beklaryan, L.A.; Beklaryan, A.L. Cluster-based optimization of an evacuation process using a parallel bi-objective real-coded genetic algorithm. Cybern. Inf. Technol. 2020, 20, 45–63. [Google Scholar] [CrossRef]
Guerrero Granados, B.; Quintero Monroy, C.G.; Viloria Núñez, C. Improved genetic algorithm approach for coordinating decision-making in technological disaster management. Neural Comput. Appl. 2024, 36, 4503–4521. [Google Scholar] [CrossRef]
Huang, P.; Lin, X.; Liu, C.; Fu, L.; Yu, L. A real-time automatic fire emergency evacuation route selection model based on decision-making processes of pedestrians. Saf. Sci. 2024, 169, 106332. [Google Scholar] [CrossRef]
Lee, H.-R.; Lee, T. Multi-agent reinforcement learning algorithm to solve a partially observable multi-agent problem in disaster response. Eur. J. Oper. Res. 2021, 291, 296–308. [Google Scholar] [CrossRef]
Schmid, V. Solving the dynamic ambulance relocation and dispatching problem using approximate dynamic programming. Eur. J. Oper. Res. 2012, 219, 611–621. [Google Scholar] [CrossRef] [PubMed]
Asadi, A.; Pinkley, S.N.; Mes, M. A Markov decision process approach for managing medical drone deliveries. Expert Syst. Appl. 2022, 204, 117490. [Google Scholar] [CrossRef]
Robbins, M.J.; Jenkins, P.R.; Bastian, N.D.; Lunday, B.J. Approximate dynamic programming for the aeromedical evacuation dispatching problem. Omega 2020, 91, 102020. [Google Scholar] [CrossRef]
Wang, X.; Liang, Z.; Zhu, K. Markov decision model of emergency medical supply scheduling in public health emergencies. Int. J. Comput. Intell. Syst. 2021, 14, 1155–1169. [Google Scholar] [CrossRef]
Yang, S.; Zhang, Y.; Lu, X.; Guo, W.; Miao, H. Multi-agent deep reinforcement learning-based decision support model for resilient community post-hazard recovery. Reliab. Eng. Syst. Saf. 2024, 242, 109754. [Google Scholar] [CrossRef]
Nadi, A.; Edrisi, A. Adaptive multi-agent relief assessment and emergency response. Int. J. Disaster Risk Reduct. 2017, 24, 12–23. [Google Scholar] [CrossRef]
Edrisi, A.; Poorzahedy, H.; Nassiri, H.; Nourinejad, M. A multi-agent optimization formulation of earthquake disaster prevention and management. Eur. J. Oper. Res. 2013, 229, 261–275. [Google Scholar] [CrossRef]
Shapiro, A. Distributionally robust optimal control and MDP modeling. Oper. Res. Lett. 2021, 49, 809–814. [Google Scholar] [CrossRef]
Wiesemann, W.; Kuhn, D.; Rustem, B. Robust Markov decision processes. Math. Oper. Res. 2013, 38, 153–183. [Google Scholar] [CrossRef]
Gök, M. Dynamic path planning via Dueling Double Deep Q-Network (D3QN) with prioritized experience replay. Appl. Soft Comput. 2024, 158, 111503. [Google Scholar] [CrossRef]
Zeng, Y.; Wen, X.; Tan, Q.; Liu, Y.; Chen, X. Real-time load dispatch in hydropower plant based on D3QN-PER. J. Hydrol. 2023, 625, 130019. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Huang, S.; Ontañón, S. A closer look at invalid action masking in policy gradient algorithms. Proc. Int. Fla. Artif. Intell. Res. Soc. Conf. 2022, 35. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Liu, Y.; Zhang, W.; Chen, F.; Li, J. Path planning based on improved deep deterministic policy gradient algorithm. In Proceedings of the 2019 IEEE ITNEC Conference, Chengdu, China, 15–17 March 2019; IEEE: New York, NY, USA; pp. 295–299. [CrossRef]
Wang, Y.; Chen, H. Blockchain: A potential technology to improve the performance of collaborative emergency management with multi-agent participation. Int. J. Disaster Risk Reduct. 2022, 72, 102867. [Google Scholar] [CrossRef]
Jaiswal, K.S.; Wald, D.J.; Hearne, M. Estimating Casualties for Large Earthquakes Worldwide Using an Empirical Approach; USGS Open-File Report 2009-1136; USGS: Reston, VA, USA, 2009. [Google Scholar] [CrossRef]
Chen, W.; Wu, M.; Zhang, L.; Gardoni, P. Multi-objective optimization for enhancing hospital network resilience under earthquakes. Int. J. Disaster Risk Reduct. 2022, 82, 103281. [Google Scholar] [CrossRef]
Lin, A.; Wu, H.; Liang, G.; Cardenas-Tristan, A.; Wu, X.; Zhao, C.; Li, D. A big data-driven dynamic estimation model of relief supplies demand in urban flood disaster. Int. J. Disaster Risk Reduct. 2020, 49, 101682. [Google Scholar] [CrossRef]
Jaimungal, S. Reinforcement learning and stochastic optimisation. Financ. Stoch. 2022, 26, 103–129. [Google Scholar] [CrossRef]
Saltelli, A.; Bammer, G.; Bruno, I.; Charters, E.; Di Fiore, M.; Didier, E.; Espeland, W.N.; Kay, J.; Lo Piano, S.; Mayo, D.; et al. Five ways to ensure that models serve society: A manifesto. Nature 2020, 582, 482–484. [Google Scholar] [CrossRef]
Dukkanci, O.; Koberstein, A.; Kara, B.Y. Drones for relief logistics under uncertainty after an earthquake. Eur. J. Oper. Res. 2023, 310, 117–132. [Google Scholar] [CrossRef]

Figure 1. The general form of the survival function. Source: Drafted by the authors, Microsoft Visio Professional 2013.

Figure 2. Proposed multi-agent environment with three types of cooperative agents and their connections. Source: Drafted by the authors, Microsoft Visio Professional 2013.

Figure 3. Location map of the study area. Source: Drafted by the authors, ArcGIS 10.8.

Figure 4. The route network for six affected zones. Source: Drafted by the authors, ArcGIS 10.8.

Figure 5. Results of the optimal route for six affected zones. Red arrows indicate the optimal routes; blue arrows indicate non-optimal routes; red numbers denote the route numbers of the optimal routes. Source: Drafted by the authors, ArcGIS 10.8.

Figure 6. Performance comparison of different methods for maximizing the Q-value: (a) DR1; (b) DR2. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

Figure 7. Stability analysis of rescue performance comparison across different methods under DR1 and DR2 scenarios. Source: Drafted by the authors, Python 3.11 (Matplotlib library, version 3.8).

Figure 8. Number of rescued individuals under different numbers of AT and ST agents. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.8).

Figure 9. Demand satisfaction rates in each affected zone under different numbers of AT agents. Source: Drafted by the authors, Python 3.12 (Matplotlib library, version 3.

Table 1. Sets, indices, parameters, and variables included in the model.

Notation	Definition
Sets, Indices, and Parameters
I	Set of demand zones, indexed by $i, j$
K	Set of medical supply points, indexed by $k$
S	All feasible states
$X_{S_{t}}$	Set of actions at decision epoch $t$ in state $S_{t}$
$L^{t}$	Zone list defining the material allocation sequence for disaster zones
$z$	Route link, $(i, j) ϵ z$ , $z = \{1, 2, 3 \dots\}$
$i, j$	Index of disaster zones
$θ_{i, j}$	Travel time from zone $i$ to zone $j$
$ε_{i}$	Service time at zone $i$
$T_{i, j}$	Travel time from zone $i$ to zone $j$
${F T}_{i}$	Completion time at zone $i$
$h_{i}$	Performance factor
$N_{i}$	Nominal demand in zone $i$ (vulnerable population in zone $i$ )
$N_{i r}$	Actual demand in zone $i$ (actual injured population in zone $i$ )
$γ$	Discount factor
State variables
$S_{t}$	Consists of two components: material status and disaster zone request status, at decision epoch $t$
$M_{t k}$	The status of supply point k, indicating which disaster zones can be serviced at decision epoch $t$
$R^{t k}$	The total available materials from supply point k at decision epoch $t$
$D_{t i}$	Characteristics of the disaster zone $i$ request at decision epoch $t$
$O^{t i}$	Location of a disaster zone $i$ at decision epoch $t$
$e^{t i}$	Demand level of the corresponding zone at decision epoch $t$
$J (μ_{t i}, σ_{t i})$	Probability distribution; explanatory variables $μ_{t i}, σ_{t i}$
$W^{t i}$	Whether the disaster zone $i$ is being served at time $t$ , If $W^{t i} = 0$ , it means the zone is not being served at decision epoch $t$
Decision variables
$x_{t}$	Indices of actions, $x_{t} ϵ X_{S_{t}}$ at decision epoch $t$
$a_{t i j}^{z}$	Whether disaster zone $i$ chooses route z for the material need assessment of the next disaster zone $j$ ( $a_{t i j}^{z} = 1$ ) or not ( $a_{t i j}^{z} = 0$ ), at decision epoch $t$
$b_{t i}^{k}$	Quantity of materials allocated from supply point k to disaster zone $i$ , at decision epoch $t$
Function
$V {(S}_{t})$	Value function of states $S_{t}$ at time $t$
$C (S_{t}, x_{t})$	The reward function
$o_{t}^{i, x_{t}}$	The rescue effect of taking feasible action $x_{t}$
$δ (θ_{i, j})$	Survival function describing the effectiveness of relief assessment
$Q (s_{t}, x_{t})$	Approximation function (Q-value), represented by the number of people rescued
$X^{π} (S_{t})$	Decision function
$V^{*} (φ (s_{t}, x_{t}))$	The optimal value of the next state $φ (s_{t}, x_{t})$

Table 2. Properties of AT and ST agents.

Zone	$ε_{i}^{A T}$ (h)	$ε_{i}^{S T}$ (h)	Route ( $i$ → $j$ )	$T_{i, j}$ (h)	${F T}_{i}^{A T}$ (h)	${F T}_{i}^{S T}$ (h)
QT	0.9775	2.6067	QLT → CT	0.5500	0.9775	2.6067
CT	1.1250	3.0000	CT → XGT	0.9667	2.6525	6.1567
XGT	1.0875	2.9000	XGT → MT	1.9333	4.7067	10.0234
MT	3.00	8.0000	MT → LT	0.6667	9.6400	19.9567
LT	1.0875	2.9000	LT → QDT	4.0500	11.3942	23.5234
QDT	2.7624	7.3664	QDT → QDT	0.00	18.2066	34.9398

Table 3. Zone performance comparison between Scenario 1 (full coordination) and Scenario 2 (assessment-absent) using D3QN-PER.

Zone	Scenario 1 (with AT)			Scenario 2 (Without AT)
Zone	AT Route	ST Satisfaction (%)	Q-Value	AT Route	ST Satisfaction (%)	Q-Value
QLT	7	70	343	—	63	235
CT	4	80	297	—	80	247
XGT	9	60	339	—	36	311
MT	1	40	754	—	40	298
LT	11	74	337	—	58	703
QDT	5	64	229	—	73	267

Table 4. Comparative performance of four methods under full-coordination (DR1) and assessment-absent (DR2) scenarios.

Scenario	Method	No. of People Rescued	Completion Time (h)
DR1	D3QN-PER	6173.98	64.69
	DQN	5500.01	66.03
	Q-learning	5290.14	67.00
	Myopic policy	4874.20	64.09
DR2	D3QN-PER	5524.09	67.49
	DQN	5300.14	68.67
	Q-learning	5199.14	70.45
	Myopic policy	4549.25	72.60

Table 5. Solution statistics of rescued population comparison across different methods under DR1 and DR2 scenarios.

Scenario	Method	Mean	Std Dev	CV (%)	Min	Max	IQR
DR1	D3QN-PER	6233	143	2.3	5924	6483	172
	DQN	5557	283	5.1	5070	5981	476
	Q-learning	5284	256	4.8	4619	5632	339
	Myopic policy	4973	447	9.0	4132	5701	627
DR2	D3QN-PER	5592	165	3.0	5235	5881	199
	DQN	5361	304	5.7	4839	5816	511
	Q-learning	5192	277	5.3	4471	5570	368
	Myopic policy	4658	489	10.5	3737	5454	686

Table 6. Rescue routing paths under different AT sizes (ST = 4).

AT	Routing Paths (Disaster Zones Sequences)
AT = 3	AT 1: ZT-LT-MT-XT; AT 2: XGT-CT-TT-QDT; AT 3: QLT-GT-QT-NT
AT = 4	AT 1: QLT-TT-NT; AT 2: XGT-CT-QDT; AT 3: GT-LT-QT; AT 4: ZT-MT-XT
AT = 5	AT 1: QLT-GT-QT; AT 2: TT-NT; AT 3: XGT-LT-XT; AT 4: ZT-MT; AT 5: CT-QDT

Note: Bold and underlined zone names indicate zones for which service is initiated simultaneously; the remaining zones are served sequentially.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, L.; Hou, J. A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response. Sustainability 2026, 18, 5539. https://doi.org/10.3390/su18115539

AMA Style

Huang L, Hou J. A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response. Sustainability. 2026; 18(11):5539. https://doi.org/10.3390/su18115539

Chicago/Turabian Style

Huang, Lu, and Jundong Hou. 2026. "A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response" Sustainability 18, no. 11: 5539. https://doi.org/10.3390/su18115539

APA Style

Huang, L., & Hou, J. (2026). A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response. Sustainability, 18(11), 5539. https://doi.org/10.3390/su18115539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response

Abstract

1. Introduction

2. Problem Definition and Model Description

2.1. Problem Description

2.2. Markov Decision Process Formulation

2.2.1. State Variable

2.2.2. Decision Variable

2.3. Markov Decision Process-Based Recursive Model

3. Solution Approach

3.1. Agent Collaboration in the Proposed Model

3.2. Feasibility Enforcement

3.3. D3QN-PER: Neural Architecture

4. Case Study

4.1. Study Area and Data Settings

4.2. Results and Analysis

4.2.1. Performance Analysis and Contribution of the AT Agent

4.2.2. Performance Analysis of Multi-Agent Coordination

4.2.3. Solution Stability Analysis

4.2.4. Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI