A Multi-Depot Dynamic Vehicle Routing Problem with Stochastic Road Capacity: An MDP Model and Dynamic Policy for Post-Decision State Rollout Algorithm in Reinforcement Learning

In the event of a disaster, the road network is often compromised in terms of its capacity and usability conditions. This is a challenge for humanitarian operations in the context of delivering critical medical supplies. To optimise vehicle routing for such a problem, a Multi-Depot Dynamic Vehicle-Routing Problem with Stochastic Road Capacity (MDDVRPSRC) is formulated as a Markov Decision Processes (MDP) model. An Approximate Dynamic Programming (ADP) solution method is adopted where the Post-Decision State Rollout Algorithm (PDS-RA) is applied as the lookahead approach. To perform the rollout effectively for the problem, the PDS-RA is executed for all vehicles assigned for the problem. Then, at the end, a decision is made by the agent. Five types of constructive base heuristics are proposed for the PDS-RA. First, the Teach Base Insertion Heuristic (TBIH-1) is proposed to study the partial random construction approach for the non-obvious decision. The heuristic is extended by proposing TBIH-2 and TBIH-3 to show how Sequential Insertion Heuristic (SIH) (I1) as well as Clarke and Wright (CW) could be executed, respectively, in a dynamic setting as a modification to the TBIH-1. Additionally, another two heuristics: TBIH-4 and TBIH-5 (TBIH-1 with the addition of Dynamic Lookahead SIH (DLASIH) and Dynamic Lookahead CW (DLACW) respectively) are proposed to improve the on-the-go constructed decision rule (dynamic policy on the go) in the lookahead simulations. The results obtained are compared with the matheuristic approach from previous work based on PDS-RA.


Introduction
Recent events have shown that the occurrence of a disaster continues to claim many lives despite the growing number of relief organisations to support and help the victims throughout the world. In the case of the 2015 Nepal earthquake, for example, nearly 9000 lives were lost, and 23,000 people were injured [1]. In the event, critical medical supplies and health personnel were far from lacking, given aid rushed into Nepal as soon as the news went out. The relief supplies received worldwide were so large in volume that the Kathmandu airport was overwhelmed with the sudden increase in air traffic [2]. However, these life-changing resources could not be distributed accordingly and in a timely manner given the lack of coordination between the local and international relief aid providers. This led to inefficiencies in executing relief operations [1,2]. The fact that Nepal is a landlocked country without any sea access also exacerbated the logistical challenges further, as seen from the bottleneck problem in Kathmandu airport [3]. The hilly topography of Nepal is prone to landslides [1,2,4], while continuous aftershocks damaged the road infrastructure, such as the Pasang Lhamu and Araniko highways [5], which compromised the road network to the disaster zone [4,6]. Additionally, the sudden onset of the disaster meant that there was limited availability of vehicles and little preparation for emergency logistics operations [4]. The viability of the long-term humanitarian operations with the available vehicles, with regards to safety and logistics and asset management, were also in question since the road network was compromised [4]. This hindered efficient relief operations in terms of transport and delivery [4,7].
Meanwhile, urgent local medical supplies were rapidly diminishing as both field and local hospitals were overrun by victims seeking immediate treatment [2,8]. If the relief aid and supplies that were laying dormant at the Tribhuvan Airport could have been channelled through effectively, it would certainly have helped alleviate the problem of the urgent need for medical supplies and treatment.
In terms of disaster management preparedness, this case study serves to highlight the critical role of transportation in the event of a disaster. Transportation service is a crucial element when it comes to humanitarian logistics operation, in particular the in-country transportation for delivery of relief supplies [4]. From the 2015 Nepal earthquake event, some observations have been made with regards to ensuring efficient delivery of goods through vehicle routing. The study in [4] pointed out that the geographical topology and mountainous landscapes of Nepal as well as the second earthquake tremors and weather conditions during the event heavily impacted the delivery speed, especially involving the last mile of the delivery. In the disaster event, the road condition and capacity were compromised by major accidents, causing traffic density and loss of cargo [4]. Furthermore, there was an overwhelming demand for the limited number of trucks in terms of capacity and availability due to critical delay and backlogs. This increased the price of transport vehicle procurement to as high as 40%. Additionally, the lack of a decision support system (DSS) to monitor and track transport vehicles, coupled with untrained drivers, also led to a serious shortage of reliable transportation. Landslides, in particular, limited the routing to certain areas. In addition, the risk of accidents due to a unfamiliar route was significant and not helped by landslides which are sensitive to weather conditions. As such, intelligent routing and communication access is pivotal for this particular vehicle routing problem.
The proposed Multi-Depot Dynamic Vehicle-Routing Problem with Stochastic Road Capacity (MDDVRPSRC) model addresses such delivery problems in the setting of relief humanitarian operations by incorporating the aforementioned challenges. The problem of a bottleneck could be solved if "mini" airports were erected at strategic locations (multidepot). The problem of limited transport vehicles in terms of availability and capacity could be addressed at some length through transport vehicles performing split deliveries which is known to be a cheaper alternative by almost 50% [9]. Furthermore, these vehicles could also perform multi-trip deliveries. Communication among Logistics Service Providers (LSPs) and their local contacts helps in updating the road network conditions which may hint towards dynamic problem modelling, with information updates at regular intervals. The uncertainty which is the pivotal aspect of efficient delivery operations should be addressed in terms of dynamic and stochastic settings of road capacity and conditions within the road network [4,7].
The Markov Decision Processes (MDP) modelling framework offers a natural way to address these dynamic and stochastic aspects of the problem. However, incorporating these aspects leads to an exponential growth of state, action and outcome spaces which is needed when solving such a real and complex problem. To deal with the curse of dimensionality [10], the Approximate Dynamic Programming (ADP) approach is commonly applied to solve the problem from the Machine Learning (ML) (Reinforcement Learning, RL) perspective. To address the large outcome space, the Post-Decision State (PDS) is applied as an extended version of the basic MDP modelling framework [11].
The ADP approach is usually applied to approximate the value of the next state s k+1 or the value of the action a k when solving the Bellman equation [12]. In ADP, the lookahead approach is suitable when dealing with the large action and outcome space that constitutes part of the curse of dimensionality. In this paper, the known Post-Decision State Rollout Algorithm (PDS-RA) [13] is applied as part of the lookahead approach in the ADP.
A base heuristic in the PDS-RA is taken as the guiding policy for the decision rule applied as the state transitions within the lookahead horizon. In modelling the VRP through the MDP framework, the decision rule is generally the assignment of vehicles when the state transitions. In the rollout, the transition occurs in the lookahead horizon beginning from the potential next state s k+1 to the lookahead end state s K . In other words, the base decision rule is a route computed for all vehicles to navigate within the future lookahead horizon. In the case of MDDVRPSRC, however, assigning vehicles based on a computed route, which is computed once, becomes problematic. This is due to the stochastic road capacity which may render the computed route unusable in the next lookahead update. One way to navigate around this is to have the route build dynamically on the go by a simple constructive heuristic known as a base heuristic.
From the decision rule or route obtained through this approach at each iteration, only the first assignments of the constructed route are applied while the rest are ignored. This method is feasible and practical due to the less expansive computation performed by the simple constructive heuristic. Here, the Teach Base Insertion heuristic (TBIH-1) is proposed to balance the random exploration and to guide the exploitation by dictating obvious assignments. Extending from this, the authors apply two known constructive heuristics: Sequential Insertion Heuristic SIH(I1) and Clarke and Wright (CW), in a dynamic setting and embed them into the TBIH-1(as proposed TBIH-2 and TBIH-3). To the best of our knowledge, no paper has shown how these constructive heuristics can be executed, as proposed, in a PDS-RA setting. We further derived from these two heuristics: TBIH-4 and TBIH-5 that seek, within their algorithm, promising vehicle assignments by looking up to two steps ahead. To deal further with the large action space, most research provides some mechanism to segregate or cluster vehicles with a specific set of customers to serve. Due to the stochastic road capacity, such mechanism is difficult to adopt. Thus, further approximation is made when computing the optimal action a k with regards to executing the PDS-RA. Through this proposed method, each PDS-RA is executed for each vehicle for every potential assignment that can be given to that vehicle.
The contributions made in this paper are as follows: first, the novel MDDVRPSRC is proposed in a disaster event setting based on the modelling framework of MDP. Next, TBIH-1 heuristic is proposed in this work along with four extended variants, namely TBIH-2, TBIH-3, TBIH-4, and TBIH-5. By doing so, it is shown how SIH(I1) and CW can be applied in the dynamic setting of route-based MDP for PDS-RA [14]. The authors also show how these two can be extended by looking up to two steps ahead. Finally, it is also shown and validated how the near optimal decision can be approximated further by disintegrating the collective assignment decisions to an individual near optimal decision. To the best of our knowledge, both the MDDVRPSRC MDP model and the five base heuristics applied in the PDS-RA algorithm are novel and have not yet been proposed. This work is the extension of research published by [15], where the damage determination of the roads within the road network is referred from. Furthermore, the Poisson stochastic distribution of a stochastic road capacity is also referred from. In this paper, the dynamic and stochastic MDDVRPSRC MDP model is presented to complement the earlier research in [15] that modelled the Deterministic Multi-Depot VRP with Road Capacity (D-MDVRPRC) as well as the two-stage Stochastic Integer Liner Programming (SILP) model of a Multi-Depot VRP with Stochastic Road Capacity (MDVRPSRC-2S). To solve the MDP Model presented in this paper via the PDS-RA algorithm, the five base heuristics are proposed as an alternative to the "cluster first, route second" approach that cannot be applied. Meanwhile, to benchmark these heuristics, the matheuristic rollout presented in [15] is applied where tractable.
This paper is organised as follows: Section 2 describes the literature review focusing on the proposed model and base heuristics. Section 3 describes the problem of MDDVRPSRC where some elements of the models have been referenced from the authors' previous work. The known PDS-RA approach is briefly described in Section 4 as well as how the optimal decision is approximated through the proposed mechanism. Additionally, variants of the proposed base heuristics are presented here. Section 5 presents the computational results. while Section 6 synthesises the findings. Finally, Section 7 concludes the paper.

Literature Review
An extensive synthesis of the literature on works adapting VRP for humanitarian operations can be found in [16,17]. In [17], numerous papers within the last decade (as of 2020) have been reviewed in terms of the application of VRPs for three selected humanitarian operations. Various modelling aspects, such as dynamic and stochastic problem, multi-disaster phase, multi-objectives, multi-trips, multi-depots, split delivery and more, which are relevant to the model proposed in this work, are elaborated. Furthermore, the solution approaches applied are also discussed in detail, especially the challenges when dealing with stochastic and dynamic VRPs. Ref. [15] extends the findings by discussing some papers applying VRP outside the field of humanitarian operation settings but are still relevant to the model that was proposed. Meanwhile, the RL adaptation in solving Supply Chain Management (SCM) is discussed in [18]. Additionally, the adoption of RL in VRP and TSP is discussed in [19].

Stochastic Vehicle Routing Problem for Humanitarian Operations
The survey in [20] discussed the recent dynamic VRP for various applications from 2015 to 2021. From the analysis of their work, it could be observed that the research focusing on dynamic problems usually also address stochastic problems. The opposite, however, may not necessarily be true. As such, some discussion on research works regarding stochastic VRP for humanitarian operations is warranted to complement those discussed in [17]. For example, the recent work of [21] addressed the Inventory Routing Problem (IRP) with an uncertain traffic network. Here the distribution of essential multi-commodities in the chaotic post-disaster phase among relief shelters and distribution centers is modelled through the network flow model. The uncertain traffic network is due to the magnitude of the earthquake's attributes, such as the earthquake's magnitude and the time of its occurrence. Such attributes also affect the vehicle speeds when making split deliveries. Apart from the optimized routing decision, the efficiency of the commodities distribution is further improved through the optimized inventory decision. Both are computed via the simulation optimization technique where the Sample Averaging Approximation (SAA) method is applied.
On the other hand, Ref. [22] presented a two-stage Location Routing Problem (LRP) of distributing first aid relief materials post-earthquake disaster where the complex demand uncertainty is addressed. Here the mixture of uncertain demand from the perspective of randomness (probabilistic theory) and fuzziness (possibility theory) due to the merging of subjective and objective data forms the hybrid demand uncertainties. A scenario-based stochastic demand over a specified interval per scenario is considered for stochastic probabilistic programming due to the strong relationship between events/scenarios such as post-earthquakes and aftershocks and the demands' uncertainty. As such, the demand parameter is considered a fuzzy random variable. In the two-stage robust programming model, the decision for locating a warehouse among existing warehouses is first determined, while the routing and distribution decision is computed in the second stage. The objective is to minimize the cost of warehouse location and the penalty induced by unsatisfied demand.
Here the equivalent crisp model is represented by the Basic Possibilistic Chance Constraint Programming (BPCCP) model and later modified as the two-stage robust programming model to overcome the drawback of BPCCP. A small-scale instance problem based on the actual case study of Hamadan province of Iran, a location prone to earthquakes, is applied to verify the model proposed. The solution obtained using CPLEX indicates that the demand satisfaction level is significantly higher than in conventional scenario-based stochastic programming.
Meanwhile, the work proposed in [23] focused on the redistribution of food to charitable agencies, such as homeless shelters and soup kitchens, by picking up the resource (food) from donors such as grocery stores and restaurants. Here the objective is to assure fairness in distributing the donated foods while also considering the waste implicitly through the constraints introduced. The demand from charitable agencies and the resource (donation) are stochastic in this problem. Additionally, equity is the rate between allocated amounts to respective agencies over the total demands. Some assumptions are made, such as unlimited vehicle capacities and that all donors must be visited before distribution is performed. A decomposition strategy in the form of a heuristic is applied to solve the problem where the recipients and donors are clustered first. Then the route is computed for respective clusters, and resources are allocated for each recipient.
Another multi-objective problem is addressed in [24] to deliver both non-perishable and perishable items to demand points considering uncertainty, such as the location and number of relief centers that should be established at the demand points as well as the delivery means of the relief item post-disasters. A Mixed-Integer Non-Linear programming (MINLP) model is formulated to minimise the total distance travelled, the maximum travelled distance between relief centers and demand points, and the total cost associated with acquiring the relief items and vehicles utilized as well as the inventory cost. This model is solved by GAMS software for small-scale instances. Meanwhile, the larger instances are solved by a Grasshopper Optimization Algorithm (GOA) metaheuristic.
Ref. [25] addressed another multi-objective problem involving the COVID-19 pandemic. In this problem, multi-period collection and delivery of multi-products in a single open and close loop Supply Chain (SC) is modelled through the formulation of transportation problems and the Pick and Delivery VRP (PDVRP). Here the open and close supply chain system involving reusable and non-reusable products is transferred from hybrid depots that produce and recycle the products through a forward and reverse flow. The transfer collection centers located in the affected COVID-19 areas acted as the intermediary between the depots and the hospitals. Heterogeneous vehicles traverse the forward and reverse flow via PDVRP from the transfer collection centers to the hospitals when delivering products through split delivery while receiving the old product for reproduction. Various uncertain parameters are defined involving vehicle cost, production cost, the demand from the hospitals, and the returned products when computing for multiple decision variables focusing on the number and routing of the vehicles, production and return of products, shipping, and inventory of products. Finally, a robust optimization approach is applied where the Tchebycheff method is adopted to solve the complex problem.
A complex problem of relief distribution within the humanitarian logistics network and victim evacuation is addressed by [26]. This problem is based on the Facility Location Problem (FLP) and VRP, where the uncertain demand, transportation time, miscellaneous cost, injured victims, and facility capacity are considered. Moreover, the formulated problem involves stakeholders, such as the suppliers of relief materials (e.g., charitable organizations), the distribution centers, the emergency centers that distribute the aids, distribution units where the evacuated victims are located, and hospitals to which they are sent to for further treatment. Thus, the decisions are based on locating and distributing relief materials and allocating evacuated victims assigned to respective routed vehicles. Multi-objectives are also addressed where the total humanitarian logistics cost, the total time of relief operations, and the variation between the lower and upper bound of the transportation cost of the distribution centers are minimized. The uncertainties in the form of inconsistencies and unclear and inaccurate information are captured through the neutrosophic fuzzy set. These uncertainties are prioritized and dealt with, respectively, through the neutrosophic set and the robust optimisation, where the latter deals with the uncertainties associated with worst-case planning involving victims, facility capacity, and relief supply.
The work by [27] similarly addressed the LRP through the Conditional Value at Risk with Regret (CVaR-R) bi-objective mixed nonlinear programming model in dealing with relief distribution post-disaster. In this problem, the optimal decisions include selecting distribution centers to be made operational among all existing distribution centers, the number of vehicles that should be assigned, the assignment of vehicles to respective demand points, and the allocation of relief aids delivered to each of these demand points. The concept of regret when making these decisions due to inaccurate expectations of demands allows for a novel chance constraint programming approach introduced through the CVaR-R measure. The regret value for each objective to: (1) minimise the total waiting time of demand points and (2) minimise the total system cost is measured by defining possible demand scenarios with respective probability. For each demand scenario, the difference between the ideal objective values (from the deterministic model) and the objective values computed given the demands in each scenario is determined, from which the worst-case scenarios are identified and applied to compute the CVaR-R. To solve the problem model, the Nash Bargaining Solution (NBS) approach is applied with the help of a hybrid Genetic Algorithm (GA) when solving the single LRP version of the problem (via the GA) as well as determining the Pareto frontier (via the Non-Dominated Sorting GA Algorithm II (NSGA-II)) used to compute the NBS.
Finally, Ref. [28] also addressed the risk decision factor regarding the relief distribution with uncertain travel time. This problem is formulated based on the multi-level network via a mixed integer programming model to minimise the total arrival time of vehicles delivering relief materials to only selected demand points instead of satisfying demand for all demand points. Furthermore, the near-optimal delivery route computed ensures that a particular service level is reached throughout the operation by formulating the associated constraints. By introducing the risk-averse approach, the objective function is adapted by incorporating the standard deviation term representing the risk of the decision regarding uncertain time travel. Meanwhile, the non-additional term is the expected total arrival that needs to be minimized. Both are weighted to balance the importance of risk to the decision maker when computing for the near-optimal decision. Here, the Variable Neighbourhood Search (VNS) is employed to solve the problem based on the data obtained from the Haiti earthquake case study in 2010.

Markov Decision Processes Model for Humanitarian Operations
In this paper, the current trends of VRP in the scope of MDP modelling are discussed. It could be derived from [17] that MDP modelling for VRPs in the setting of humanitarian operations is scarce. This is supported by the finding that only [29] addressed the problem in humanitarian operations. Ref. [29] looked at multiple humanitarian operations, such as delivery and search and rescue, by incorporating the Relief Assessment Team (RAT) and the Emergency Relief Team (ERT), using the Decision-Making Agent (DMA) to coordinate the former two. The problem is modelled as an MDP with multiple random parameters, such as the demand and stability of the transport link. Here, the RAT is tasked to assess affected areas and dynamically report the demand situation for each zone. ERT, on the other hand, is tasked to serve the zones assigned by the DMA. Meanwhile, decisions are composed by assigning various combinations of aid organisations to a specified affected zone, as well as routing decisions for both the ERT and RAT. Finally, Q-learning is applied to solve the problem involving a maximum of 7 RAT and 10 ERT.
Other than papers reviewed by [17], there are works such as [30] which plan evacuation routing for the last minute disaster preparedness phase. In this problem, residents are evacuated prior to disaster occurrence via buses that pick up stochastic resident evacuees at bus stops. The problem is modelled as a single trip operation with a homogeneous fleet of vehicles within a finite horizon. Here, the action or decision is the assignment of the next pick-up point for each of the vehicles and the number of evacuees taken, suggesting a split delivery operation. The small instance of the problem is solved exactly using structured value iteration. Meanwhile, dynamic re-routing is applied for a large-scale problem through a reduced network flow MIP and Robust MIP (RMIP) model. Beyond these aforementioned works, related works on modelling a VRP through the framework of the MDP for humanitarian operations are sparse.
Instead, the MDP application is used to address humanitarian operations that revolve around the coverage problem, the allocation problem, the path planning, and the scheduling problem. Ref. [31] computed the evacuation routes during a disaster by modelling the problem as an MDP model. However, the work cannot be regarded as a VRP as the target application is not specified in terms that would constitute a VRP, such as the vehicles' availability or their capacity. Similarly, Ref. [32] used the MDP model to address the problem of clearing the debris from blocked edges or roads in an optimal assignment with uncertain clearance resources. In the post-disaster event, the optimal decision was computed considering the delivery of aid or service for demand nodes. Then we have [33] who addressed the problem of congestion in terms of hospital facilities as well as limited ambulances to rescue patients in the aftermath of a disaster. Considering the stochastic treatment time and transportation availability, the decision for such planning was to allocate ambulances to affected patients and to choose which medical facility the ambulance should be headed for. Two types of vehicles are considered: a dedicated ambulance for a location and another ambulance with flexibility in terms of the location. The latter might suggest split delivery. However, neither capacity nor comprehensive routing were considered in this problem. Dynamic patient treatment times were updated, and the problem was solved based on proposed heuristics that applied a myopic approach with policy improvement.

Markov Decision Processes Model for Industrial Problems and the Application of Approximate Dynamic Programming
Apart from the application for humanitarian operations, the MDP modelling framework has also been adopted for industrial problems since 2000. Interestingly, most of the early works are dedicated to solve the single VRP with stochastic demand such as in [34][35][36][37]. The study in [38] is among the first to look into the theoretical aspect of the pick-up and delivery of a single vehicle-routing problem with stochastic request using MDP modelling. The focus on the multi-vehicle problem is seen later in [39][40][41] for a VRP with stochastic demands. On the other hand, Ref. [42] deviated from stochastic demands by addressing the problem with stochastic customers or stochastic requests. In this type of application, various different and diverse solution strategies were adopted. This contrasted with the problem with stochastic demands which mainly applied the lookahead approach such as the PDS-RA.
Meanwhile, in [14,43] the route-based MDP was introduced as opposed to the conventional formulation of an MDP addressing a VRP while incorporating a dynamic route plan at every decision epoch. This framework was applied in [44] to solve the problem of maximising the number of services within the same period for stochastic service requests using the Value Function Approximation (VFA). The problem was to decide which new stochastic request should be accepted and which should be postponed to the next operation period. Apart from the fixed working time, multi-periods would be considered as multi-trip operations. VFA was applied as one of the ADP solution approaches allowing for online computation while dealing with the decision and outcome spaces. Through the VFA, the state space was segregated based on common features. The resulting MDP model led to a huge number of decision points which was dealt with by introducing a classification of multi-period operations. The application of VFA as shown here is commonly known to ignore some details of the state due to the aggregation mechanism within. The same authors in [45] alleviated this problem by proposing a hybrid of two ADP approaches: the VFA and the Rollout Algorithm (RA) to address the stochastic request for a single vehiclerouting problem. The authors later addressed the same problem in [46] with a different hybrid mechanism of VFA and RA by having the second part of the simulation horizon driven by a base policy dictated by the VFA. This method was effective in increasing the solution quality while reducing the computation time. A single vehicle-routing problem was also addressed by [47] with regards to taxi routing when searching for a passenger. In this MDP model, the transition probability is derived from the taxi-passenger matching probability on a link. Here, an enhanced value iteration was applied to solve the problem by reformulating Bellman's Equation into a series of matrix operations.

Approximate Dynamic Programming in Machine Learning
As can be seen from the example mentioned above, aside for [47], ADP is commonly adopted as a solution approach for problems formulated in MDP. The ADP emerged as a means to solve real and practical MDP problem models. This is due to the complexity of the MDP model that increases exponentially due to the explosion of state, outcome, and action spaces [48,49]. Such problem renders the exact solution to be prohibitive as shown in [50,51]. Afterwards, Ref. [52] coined the term ADP which is otherwise known as RL or neuro-dynamic programming. The interest in ADP sparked around the mid-1990s when it was extensively written on by [53,54]; although the ideas and concepts could be found dating back to the 1950s. For instance, Ref. [55] described the concept of approximating the value of a position in a chess game which is based on the state of the board. He likened the concept to an experienced player evaluating a move roughly but not based on all possible scenarios. Later, Ref. [12], when introducing dynamic programming, hinted at the idea with the mention of value space and approximation. Then [56] not only applied the idea of [55] in evaluating a position move, but also showcased the practical use of the lookahead approach while approximating the value of a move in the game of checkers. At the same time, he considered the possibility of remembering all the positions and moves, much like the concept of a dynamic lookup table. The work was further improved in [57] with regards to a tree search lookahead with an improved alpha-beta pruning scheme based on the memorisation of a board position, known as the book-learning procedure. The basic ideas from the aforementioned works were explored further resulting in pivotal works of [58][59][60] which formed what is known as the modern era of ADP [53]. Following that, Refs. [61][62][63] in his experiments with the game Backgammon afterwards demonstrated the practical application of ADP to solve real and complex problems.

Rollout Algorithm and Post-Decision State Rollout Algorithm in Approximate Dynamic Programming
The authors take advantage of the well-known operations of the VRP from the humanitarian operations' perspective and propose that the model based on MDP is adopted for MDDVRPSRC with a known variant of RA (PDS-RA) being applied to solve the problem online. The intuition for the lookahead approach as elaborated above can be identified back to the work of [55] who thought of evaluating a move by thinking ahead in a lookahead manner. The RA is based on the same principle but is refined by means of the quality of the lookahead which depends on the base policy applied. Furthermore, the horizon of the lookahead plays an important role as was mentioned by [64,65]. Another crucial aspect of the rollout is the Monte Carlo sampling that would enable the multiple lookaheads to account for the stochastic parameter for the simulated episode. Finally, the number of simulations performed would also contribute to a better mean approximation of the state or PDS value. The RA is proposed by [48] belonging to the class type of policy iteration in reinforcement learning. Note that [34] was among the first to develop a specified RA in solving sequencing problems. This work was extended in [35] for a single VRP. In [49], the performance of the rollout policy is compared to the optimistic approximate policy iteration for the single VRP. The former showed a better performance. Cyclic heuristic is applied as the base policy for the RA in [66] to solve the problem of of a VRP with stochastic demand (VRPSD) as well as the problem of the Travelling Salesmen Problem (TSP) with stochastic travel time. The former was also addressed by [36] who presented the two-stage stochastic programming solution approach and compared it to the online approach using RA. The outcome space was addressed exactly for all these problems up to this point. However, Ref. [37] managed to tackle this challenging computation approach by applying Monte Carlo sampling instead in a single-and two-stage RA. The computation was shown to have shrunk by 65% when compared with [35].
In [39], the multi-vehicle VRPSD is finally addressed. The challenges that come with the multi-vehicle problem is addressed through clustering to dedicate a set of different customers to each vehicle. An offline a priori route is computed for respective vehicles, and RA is only applied if a route failure occurred. Although RA was not applied from the get-go, this work highlighted the potential of a clustering mechanism when dealing with a multi-vehicle VRP. This was seen in [40] who made the clustering mechanism dynamic at every decision point as the status of stochastic demands were being updated. The problem was then solved by proposed variants of RA making use of the extension of the MDP framework, pre-decision state (PRE) ,and post-decision state (PDS) , advocated by [67]. Here, the base policy applied was the fixed route heuristic. This heuristic was relaxed forming a known restocking fixed route heuristic in [41], from which policies were iterated and obtained through a local search. These policies were evaluated with the help of the optimal value computed by dynamic programming to help with pruning the search space in the search for a more effective rollout base policy. This base policy was then applied in the RA to obtain a more optimal policy for the problem. The same concept was applied in [68] for the same problem, apart from the duration limit for a single vehicle, where a hybrid of backward and forward recursive dynamic programming was applied instead. In the work of [42], the authors applied RA with the cheapest insertion heuristic as the base policy for the problem of single VRP with stochastic requests. The solution was compared with the solution obtained from a greedy heuristic and VFA, respectively. Meanwhile, Ref. [69] proposed a framework for applying RA for the dynamic and stochastic VRP as an ADP solution approach. In [45], PDS-RA was applied with the VFA method driving the decision rule for the lookahead.

Rollout Algorithm as Matheuristic
An RA with which the base heuristic is driven by the policy of applying a mathematical programming method could be regarded as a matheuristic method. Among those who applied such a technique are [15,[70][71][72][73]. In the work of [70], the authors addressed the scheduling problem with RA, where the decision rule was obtained using the quadratic programming approach. In [71], the Mixed Integer Linear Programming (MILP) was applied to obtain the decision rule for the RA in solving the problem of inventory routing with a single vehicle. Such approaches were also seen in [72,73] for solving the inventory routing problem. In other works, Ref. [15] proposed a matheuristic RA method to solve the multi-vehicle routing problem for humanitarian delivery operations by reducing the two-stage stochastic programming model to two reduced models that was dependent on the vehicle's mode of operation: replenishment or serving an emergency shelter.

Knowledge Gap and Research Contribution
By referring to Section 2.2, it is very clear that a humanitarian VRP, which is being addressed through machine learning, is still lacking. Meanwhile, from the industrial problem's perspective, the literature available shows that solving the VRP problem via reinforcement learning and RA are largely limited to stochastic demands, customers, or request problems. No such approach applied has addressed the problem with stochastic road capacity. The authors intend to fill this gap.
Furthermore, addressing the VRP through MDP formulation and the RA solution method is often limited to those of a single-vehicle problem as seen in the work [35][36][37]41,42,45,49,66,68]. This is evident even in the published works of the last five years. Those who solve multi-vehicle problems such as the work [39][40][41] often resort to a clustering or decomposition method of the vehicles to a set of different customers. Such a method, with the exception of a dynamic decomposition method, could not be performed when addressing MDDVRPSRC as the road capacity; thus the route is uncertain. Furthermore, it is not clear how a dynamic clustering or decomposition method could address the problem with stochastic road capacity while also accounting for the split delivery and multi-trip operations. To the best of the authors' knowledge, there is no literature that has proposed the method of building the decision rule on the go, guided by the constructive heuristic as the policy while performing the rollout. Similarly, no known variants of such heuristics have been introduced to allow for decision ruling on the go. Here the authors intend to fill the research gaps by proposing the application of proposed heuristics (TBIH-1-TBIH-5) within the PDS-RA adopted from [13]. In terms of the modelling approach, to the best of the authors' knowledge, no literature addresses the MDDVRPSRC, whether in the form of a mathematical programming model or in the MDP formulation, especially in humanitarian operations settings. Most literature for MDP formulation in VRPs addressed multi-trip operations only for on-route failure occasions. For example, the work such as [74] addressed the split operation also when triggered by the event of route failure due to stochastic demands. Literature such as [44,45] which addressed the multi-period operations, however, do not include the possibility of split delivery operations. Furthermore, most of these models only described operations involving a single vehicle. We differ from existing literature by intentionally allowing multi-trip operation as well as split delivery to address the limitation of delivery trucks during a disaster event rather than a result of route failure. Addressing multi-vehicles leads to the explosion of action space and is often very difficult to solve without resorting to a clustering approach. The authors therefore present here how with a moderate number of destination nodes in various simulated road network, the MDDVRPSRC could be solved without utilising a clustering or dynamic clustering method.

Problem Statement
The problem of MDDVRPSRC focuses on the delivery problem, one of the crucial humanitarian operations during a disaster and post-disaster event. Here, the road network is represented as an undirected incomplete graph G = (H, E) in graph theory. H is the set of nodes in graph G such that H = {D} {S} {N} where D, S, and N are the set of depots, emergency shelters, and connecting nodes respectively. The connecting nodes represent the junction connecting the edges (i, j) ∈ E, representing the roads such that Note that emergency shelters whose demands are satisfied can act as connecting nodes for vehicles to travel through.
In MDDVRPSRC, the medical supplies are to be delivered to temporary erected emergency shelters, s ∈ S, with different demands, w s , by a homogeneous fleet of vehicles, m ∈ M, with capacity, q m . The delivery of medical supplies is conducted via split delivery to account for the limited number of vehicles during the sudden onset of a disaster. Vehicles are allowed to perform multi-trip deliveries throughout the humanitarian operations to satisfy all the demands. The vehicle capacity, q m , can be replenished to a full capacity, Q, as soon as they return to the depot, d ∈ D. All vehicles must be dispatched throughout the operation until the total demand is satisfied as there is no guarantee that any assigned vehicles might reach their designated emergency shelter. More on this point, a vehicle is considered stranded when it is unable to travel to the next node on the way to the depot for the consecutive decision points of ST once the total demand is satisfied. Unless such an event occurs, all vehicles must return to the depot. However, they are not constrained as to which depot they should return to, advocating the flexibility needed for the humanitarian operations.
All throughout the operation, the road capacity, r i,j , is uncertain. The mean road capacity for each road, (i, j), as well as the capacity distribution deteriorates as the operation time progresses for all damaged roads [15]. The deteriorating road capacity mean is due to the damages inflicted by the subsequent post-disaster events, such as tremors of an earthquake. This is simulated as the outward radial circles originating from the epicentre of the earthquake [15]. For more information of road capacity distribution, road damage determination, and simulation of earthquake tremors, refer to the work of [15].

Agent Solving MDDVRPSRC in Reinforcement Learning
In reinforcement learning, an agent learns to make decisions based on given information of the system. The information is given in the form of the state representation of the system as well as the transition of the state when making a constrained action or decision. An agent learns to make optimal decisions based on the series of interactions it has made from making decisions and in return obtained rewards. An MDP model formulates how an agent sees the system through: (i) the state representation, (ii) how the system transitions, (iii) the constrained actions or decisions it is allow to make as well, and (iv) the reward it received from making a decision. In this work, the agent perceived the problem as a MDDVRPSRC and will make a near optimal decision at every decision point. This agent then becomes a part of the DSS for delivering critical medical supplies during a disaster.
The agent learns of the aforementioned delivery operations of the MDDVRPSRC through a series of discrete states it observes at each decision point, k, beginning with the initial state, s 0 , until the end state, s K , representing the end of delivery operations. The Pre-Decision State (PRE), s k , represents the state of the operations at decision point, k, prior to decision, a k , computed by the agent. The state observed by the agent after decision, a k , is made and denoted as the Post-Decision State (PDS), s a k k . These states (PRE and PDS) are described through the state variables l, t, q, w, e, and r, respectively: • l: current location of all vehicles; • t: the next arrival time to next destination of all vehicles; • q: the capacity status of all vehicles; • w: the demand status for all nodes; • e: the occupancy status of road (i, j) in relation to vehicle m; • r: the road capacity of all roads or edges.
In the MDDVRPSRC MDP formulation, an agent computes a decision, a k , to send a fleet of vehicles to a destination node based on the state that it observed, s k . Once the decision a k is executed, the PRE transitions to PDS, s a k k , deterministically and waits for the next decision point, k + 1, which is triggered at T k+1 by the arrival of one or more vehicles m ∈ M k+1 simultaneously at the emergency shelter s ∈ S or connecting node n ∈ N or depot d ∈ D.
Once the decision point k + 1 is triggered, the random road capacitiesr i,j are observed for all roads (i, j) ∈ E through the dynamic update from the locals at the arrival destinations. At this point, the PDS transitions to PRE, s k+1 , stochastically and the PRE state variables e k and r k are updated. This new random information is now known to the agent (via the updates of e k and r k ) who then uses it to compute the next decision, a k+1 , for all vehicles. Once the decision a k+1 is computed, the PRE transitions to PDS, s a k+1 k+1 . At the same time, demand w h , ∀h ∈ S is served or the vehicle's capacity is replenished or neither (when a vehicle arrives at connecting node h ∈ H ∩ {S + D}) depending on where the vehicles arrived. Hence, the variables of PDS representing the next destination, l a k , arrival time to next destination t a k , capacity, q a k , status edges travelled, e a k and demand status, w a k are updated accordingly. The MDDVRPSRC formulation that follows is developed by referring to [75] and the work of [13]. All parameters, state variables. and decision variables are listed in Tables 1 and 2. Table 1. MDDVRPSRC: Parameters.

Parameters
waiting time for vehicle arrived at a node but is assigned to remain stationary at current node such that decision set of all possible decisions at decision point k given s k A π (s k ) MDP decision rule at decision point k following policy π given s k π (s k ) on-the-go constructed lookahead decision rule at decision point k following policy π given s k η π (s k ) decision rule computed through construction heuristic or CPLEX determined by policy π at decision point k given s k in the lookahead horizon damage unit sustained by road (i, j) [15], such that p i,j ∈ N Z a large negative arrival time for vehicles resting at depot when all demand is served G2 a larger constant acting as reward or penalty ST a limit on how many times a vehicle is allowed to be stationary (stranded) consecutively in terms of decision points F(m, i) function that adds consecutive decision points for all stranded vehicle m at i when all demand is served and all other vehicle is at depot consecutively starting from decision point k strand 0 to current decision point k strand k such that reward for the agent for making decision a k at decision point k when observing(given) the state s k R k,m (s k , a k ) individual reward of vehicle m at decision point k

PRE and PDS Variables
vector of arrival time ∀m ∈ M to assigned destination at decision point k, such that vector of next assigned destination for vehicles, such that a k ∈ H |M| = [a 1 , a 2 , . . . , a |M| ] ∈ A(s k )

MDDVRPSRC Formulation
The pre-decision state (PRE) s k is a multi dimensional vector consisting of other vectors representing each state variable, respectively. Within this vector are the state variables defined in Table 2: The PDS representation shares the same features as the PRE differs only by annotation and that its variables are updated after decision a k is made: Once the decision point is triggered, based on the minimum current values within the arrival time vector t a k , at: the respective decision/assignment is computed for all vehicles including those that arrived as shown in Equation (4) below: Here, the vehicle that is still en route to its destination during the decision point k is denoted by {m ∈ M\M k }.
In the MDDVRPSRC, the decision a k is a |M| dimensional vector in a decision space (a set) A(s k ) given the state s k . The decision involves assigning the next destination for all vehicles at every decision point k, However, the decision space A(s k ) for MDDVRPSRC is too large to obtain a good solution within reasonable computation efforts. Therefore, for every decision point k, the decision by the agent is computed as proposed in [15] where the reduced decision set for a single vehicle O m is as defined for the rollout in Table 1: The state S k transitions deterministically to PDS, s a k k : where the decision is made by the agent (l a k = a k ). This is where the next destination l a k , arrival time t a k , capacity of vehicle q a k , travelled edges status by vehicles e a k , as well as the demands status w a k are updated. At this point, the stochastic road capacity is not known; hence r a k is not updated. The time of arrival t a k m ∈ R, ∀m ∈ M is updated to: where t wait is defined as: and t k is an n-tuple with an increasing order of arrival time.
Meanwhile, the capacity of all vehicles m ∈ M is updated to: The travelled edges e a k are updated ∀(i, j) ∈ E, ∀m ∈ M: Finally, the demand of emergency shelter is also updated ∀h ∈ H: At decision point T k+1 , the uncertainty of the road capacityr i,j ∀(i, j) ∈ E is now observed by the agent which leads to the transition from PDS to PRE, s k+1 : The road capacity at this point is no longer uncertain ( r i,j ∀(i, j) ∈ E) since it has been sampled/known to vehicles that have arrived at their destinations. This information is thus known to the agent. The next destination l m = l a k m (∀m ∈ M), the arrival time t m = t a k m (∀m ∈ M), capacity of the vehicle q m = q a k m (∀m ∈ M), as well as the shelter demand w h = w a k h (∀h ∈ H) remain the same.
The travelled edges status, e k is updated ∀m ∈ M, ∀(i, j) ∈ E: The road capacity r k is at this point observed and updated: where the random road capacityr i,j ∀(i, j) ∈ E is obtained from a random Poisson distribution as described in [15]. When transitioning to PDS, s a k k , the agent receives a reward R k (s k , a k ) contributed by all vehicles m ∈ M at decision point k: where R k,m (s k , a k ), ∀m ∈ M is given by: Here, a policy of decisions denotes the guiding principle on which the decision is based. For example, a policy π B ∈ Π could be a heuristic if a decision function A π B (s k ) is mapped from that heuristic. In this model formulation, the decision a k = A π (s k ) ⊂ A(s k ) [67] is selected among other potential decisions in the decision space, A(s k ) following a certain policy π ∈ ∏ such that the decision rule function A π (s k ) : s k → a k .
The objective (Equation (18)) is to find an optimal policy π such that the expected total rewards are maximised (objective function for the Bellman optimality equation [67]): Hence: where for every decision point k, the optimal decision a k is chosen: a k = A π (s k ) by following the optimal policy π .

MDDVRPSRC Solution Approach
To solve an MDP is to seek an optimal policy π . Through this optimal policy, every decision made by the agent is optimal: A π (s k ) : s k → a k as stated by the principal of optimality [12]. To obtain the optimal policy, the Bellman Equation is solved [12] such that the optimal decision a k is computed for every decision point k for each given state s k observed by the agent. This series of optimal computed decisions is said to be guided by the optimal policy π ∈ Π, and therefore the problem formulated in MDP is solved. To compute the optimal decision, a k , the Bellman Equation could be written as Equation (20) [67]: Due to the curse of dimensionality as seen in Equation (20), computing for an optimal decision is often intractable. To alleviate the curse associated with the outcome or transition space, the PDS is introduced [67] (Equation (21)) and the equation was rewritten as in Equation (22): Even with PDS introduced as above, the computation for an optimal decision is usually challenging, especially when computing the value of the PDS, V a k k (s a k k ) in Equation (22). The ADP approach approximates the value of PDS instead. This is to deal with the large state space in the MDP. Through the ADP approach, Equation (22) is rewritten as Equation (23), where the value of PDS is approximated and thus the decision is computed to near optimality instead: For MDDVRPSRC, this equation is still challenging to solve since the decision space A(s k ) in Equation (23) is too large for a practical number of vehicles to be involved. Furthermore, the decisions consist of combinations of vehicle assignments that would require a long rollout horizon as well as a large number of Monte Carlo simulations. The concern is that the reward obtained for one vehicle may exaggerate the value of the decision for all vehicles collectively if computed prematurely. This is seen in the initial experiments where, given limited computation capabilities on the machine, inefficient assignments for vehicles resulted.
Alternatively, to cope with this challenge, the near optimal decision could be further approximated as in Equation (24) as proposed in work [15]: where the decision space, made of combinations of vehicle assignments (which could be astronomical), could be restricted to a decision space of possible assignments for each vehicle O m instead, as shown in Equation (25): Even with such measures, given the machine's capabilities, the computation is only limited to a small number of rollout horizons and Monte Carlo simulations. However, it is shown in this work that the decisions computed are applicable to this type of problem.
To compute Equation (25), the PDS value, V a m k (s a m k ) is approximated using PDS-RA, as proposed by [40]. However, different base heuristics can be applied to solve for the MDP problems with characteristics such as the one in this work. Finally, this approach is applied to reciprocate the model for the agent's decision in Equation (6).

PDS-RA Algorithm
PDS-RA is one of the RA families first introduced in [40] as an ADP solution algorithm to the dynamic VRP with stochastic demand. PDS-RA takes advantage of the PDS structures that alleviate the problem associated with the outcome or transition space. This thus reduces the number of rollout executions compared to the conventional RA to approximate the value of PDS in a modified Bellman Equation effectively. The general PDS-RA could be referred to in Algorithm 1. Here, the rollout transitioned PDS (simulation) is denoted as s a to avoid confusion with the real-time transitioned PDS, (s a k k ) observed by the agent. In this algorithm, the values of PDS, V a k k (s a k k ) associated with each respected a k ∈ A(s k ) is approximated (V a k k (s a k k )). For each possible PDS associated with the next decision a k ∈ A(s k ), the PDS-RA is executed, and by the end of the execution, the approximated value of PDS, V a k k (s a k k ) is obtained. In every execution of PDS-RA, the base policy π B(s a k k ) ∈ Π is first assigned (policy to apply heuristic B), computing the decision rule function π B(s a k k ) (s k ) for the rollout simulation is based on the heuristic B performed given the PDS, s a k k . Based on this specific decision rule (which is normally the assignment of vehicles to next destination for VRP), the lookahead into the future as far as horizon K is performed. Transiting from the simulated PRE to PDS is enabled by referring to π B(s a k k ) (s k ) when making a decision during the lookahead. This decision is followed by a stochastic transition, transitioning PDS back to PRE (s k = S M,W (s a , W k+1 (ω(k + 1)))) in the lookahead simulation. Here, the random information of road capacities for all roads, (W k+1 (ω(k + 1))), is known by sampling ω(k + 1) ∈ Ω through a known distribution as part of the Monte Carlo simulation approach. By sampling ω(k + 1) ∈ Ω, the exhaustive computation for all random transitions of outcomes in the outcome space Ω is prevented as first observed by [37] in her application of RA.
Each time the transition occurs within the lookahead along the horizon, rewards R k (s k , a k ) are consecutively amassed. At the end of a one-episode lookahead simulation, the sum of rollout rewards B n (π B(s a k k ) , k + 1, K) is obtained. This value is used to update the approximated PDS value of the respective potential decision a k through an incremental mean approach (Algorithm 1: line 14).   [75] based on PDS-RA proposed by [13] and highlighted in [15]).  The PDS-RA is then executed for the next possible decision a k ∈ A(s k ) after which the process repeats. Finally, when the PDS-RA is executed ∀a k ∈ A(s k ), Equation (23) can now be computed based on all approximated PDS values associated with each respective potential next decision for agent to make. Based on the computation of Equation (23), a decision is then made.
It should be noted that Algorithm 1 is applied to the rollout and looks into the future of each potential decision where each decision a k revolves on the assignments of all vehicles a m = a k [m], ∀m ∈ M simultaneously per PDS-RA. The base policy π B(s a k k ) guides the construction of the decision function π B(s a k k ) (s k ) in one go, per PDS-RA execution ∀a k ∈ A(s k ).
In the applied solution, the rollout is executed for every potential next destination for each vehicle a m , ∀m ∈ M, and the value for each PDS associated with the potential next destination is computed by PDS-RA, such that near optimal assignments a m would be computed in Equation (25). These near optimal individual assignments form the near optimal decision as described in Equation (24). The base policy π B(s a k k ) is based on an iterative policy π B(s k ) applied at each lookahead decision point s k to construct the decision rule π B(s a k k ) on-the-go using constructive base heuristics B. The decision rule In Algorithm 2, for example, CPLEX (denoted as CPLEX) is applied instead as the base heuristic. Here, CPLEX is run for a Stochastic Linear Integer Programming (SILP) version of the reduced MDDVRPSRC to construct π CP(s a k k ) (s k ) on the go given the current rollout state s k and this results in η π CP(s k) (s k ). Since here the policy is to apply CPLEX, we denote that the policy that the decision rule follows is of π CP(s a k k ) . A detailed explanation on this matheuristic is given in [15] (Algorithm 2). The authors used this exact configuration (with CPLEX as the base heuristic) as a benchmark where tractable. As a solution approach in general, the Algorithm 3 is referred. The authors first introduced the TBIH-1 heuristic (Algorithm 4) based on a pure random insertion for the non-obvious decisions. Furthermore, other constructive heuristics were applied dynamically (TBIH-2 and TBIH-3), extended from TBIH-1, to construct the decision rule on the go ( π B(s a k k ) ) as shown in Algorithm 3, in contrast to Algorithm 2. Additionally from these constructive heuristics, the authors propose another two new variants (TBIH-4 and TBIH-5) for this problem by introducing the exploitation mechanism on both TBIH-2 and TBIH-3. R k (s k , a k ) = R k (s k , a k ) + λ k R k (s k , a k ) 8: s k = S M,W (s a , W k+1 (n(k + 1))) 9: π CP k (s k ) ← CPLEX(s k ) 10:   In this section, the base heuristics applied are described in general (Algorithm 4), and the elaboration of each is described in the subsections that follow. TBIH-1, TBIH-2, TBIH-3, TBIH-4, and TBIH-5 are the heuristics applied in this work to both validate the model and to cross-compare the performance for each of the models.
The algorithm for each heuristic applied here follows the same main structure of: (i) the teaching part (TP) and (ii) the seeking part (SP).
In the TP, the obvious decisions are chosen without running any heuristics to search for the best next destination. These obvious decisions are stated in Equation (6) and applied to each vehicle:

•
To serve any shelter s ∈ S randomly among possible shelters as the next destination; • To replenish at any depots d ∈ D randomly among possible depots as the next destination; • To return to any depots d ∈ D randomly among possible depots when all demands have been served; • To remain stationary at the current arrival node i while still having capacity (q m = 0) and demands that have not all been served, if the only next possible destination is to a depot; • To wait at the current arrival node i while still having capacity (q m = 0) if the road capacity to the next possible shelter is blocked r i,j = 0; • To remain resting at the depot if the current arrival node is a depot i ∈ D and all demands have been served; • To remain stationary at the current arrival node i if all road capacity to a neighbouring node j are blocked r i,j = 0, ∀(i, j) ∈ E.

50:
else 51: nextD.remove(a m ) 52: if len(nextD)!=0 then 53: a m = random.choice(nextD) 54: else 55: continue 56: end if 57: end if 58: end while 59: if Decision!= 0 then a m = random.choice(nextD) 66: if r i,a m > 0 then 67: Decision.append(a m ) 68: end if 74: else if len(US)==0 AND len(nextD!=0) then 75: if len(nextD)!=1 then 76: a m = random.choice(nextD) 77: while len(nextD)!=0 do 78: if r i,a m > 0 then 79: Decision.append(a m ) 80: These decisions are considered obvious decisions where computation efforts should not be focused on. Instead, only when all of the above TP decisions are not applicable (no obvious decisions), will the heuristic be applied. Ideally, none of these obvious decisions should be specified. Instead, the agent should be able to figure out and learn the obvious decision based on the reward obtained. However, such an ideal mechanism would require a humongous amount of rollout episodes with horizons extending to the termination state on each of them. For practical purposes, limited by computation power, the obvious decisions are filtered out to avoid extensive computation efforts. Hence the term "teach" in TBIH-1's "teaching part" (TP).
For the SP, a purely random selection of the next destinations could be applied as in TBIH-1 (Algorithm 4 (line 122)). In the proposed TBIH-1, the obvious decisions are inserted by "teaching". The non-obvious decision is decided by a purely random selection among possible next destinations. The general structure of TBIH-1 is described in Algorithm 4 where the SP part is shown in line (121-147).
The TP consists of updating the capacity of all the vehicles as well as the demand of the shelters (Algorithm 4 (line 1)). This is performed so that the decision selected is based on the current status of the demand and capacity which are otherwise updated/observed by the agent during the transition from PRE to PDS. The next part involves determining the potential next destinations j for vehicle m and sorting these destinations whether they are emergency shelters, depots or non-depot nodes (line 5-9). Afterwards, the obvious decision selection follows as described in Equation (6)

TBIH-2
Among the first to develop SIH is [76], whose work is based on the generalised savings algorithm. Ref. [77] then introduced three types of SIH to solve the VRP and scheduling problem with a time window. The proposed SIH (I1) constructs a route by considering two criteria: the first involves determining the best place for insertion, c 1 based on c 1,1 and c 1,2 . The second is the consideration for the best un-routed node υ to be inserted c 2 . For the VRP considering time windows, the SIH (I1) is computed with the following equations [77]: where c 1,1 is the generalised savings, and c 1,2 is the time difference between the new service time for j, b j υ and the time prior to insertion of υ. Together, the best insertion place of υ is computed as c(i(υ), υ, j(υ)) given by: where c 1 is given as: Meanwhile, the best υ insertion criterion c 2 is given by: where node 0 in the formulation is the depot; υ is then chosen based on: Since the MDDVRPSRC does not consider time windows, the θ 2 value is given the value of zero and therefore c 1,2 is not considered. This turns c 1 into a generalisation of [76]. Furthermore, both ζ and θ 1 are given the value of one. Both node 0 and i are considered as the current position of vehicle m at s k during the lookahead. In the DSIH, the seed j is chosen randomly by looking one step ahead beyond the next destination of m currently at i in a set such that {j : (υ, j) ∈ E, r υ,j > 0, j = i, j ∈ (set of the immediate neighbor of i), ∀j ∈ H, ∀υ ∈ (set of the immediate neighbor of i)}. This is illustrated in Figure 1a-c. Here, the road capacity is not considered to simplify the description of the DSIH. In Figure 1a, the current position of vehicle m is at node 0 (not a depot) with capacity to serve. Node S5 is an emergency shelter with demand, while the rest are connecting nodes. The purpose of the DSIH is to treat the next possible destination from the current position as υ by treating j as the seed when constructing a route from node 0. In Figure 1b, the node j is identified as node S9, node 4, and node 3. In the DSIH, the seed j is then chosen randomly among these three potential seeds. In Figure 1c, node 3 is chosen randomly as the seed j. Here, two possible nodes could be inserted as the υ: node 1 and node 5. After applying the SIH(I1) (without time window consideration, θ 1 and ζ is given the value one), node 5 is considered the best inserted node and the route η π B(s k ) is constructed from node 0-5-3. The next destination of the vehicle from node 0, a m = η π B(s k ) (s k ) is then selected as node 5. This also means that at the lookahead decision point k, the lookahead route π B(s a k k ) is constructed on the go/updated such that π B(s a k k ) : s k → η π B(s k ) (s k ) with heuristic B as per TBIH-2. In addition, when applying TBIH-2, the route-based MDP concept is applied [14]. This means that at the lookahead decision point k + 1, the vehicle may not necessarily move to node 3 (η π B(s k ) (s k+1 )) next upon arriving at node 5 as the DSIH will be executed at every decision point. In this example, the road capacity is ignored to simplify the explanation of the DSIH that is used in the SP of TBIH-2. In reality, if the edge (0, 1) has no road capacity available (r 0,1 = 0), then SIH(I1) will not be applied as the only possible υ is node 5. list of potential inserted nodes based on the selected seed, InsertList = {υ : ∀υ ∈ nextND, (υ, seed) ∈ E} 9: dictC11 : (i, υ, seed) ← (c i,υ + c υ,seed − c i,seed ), ∀j ∈ InsertList 10: dictC2 : (i, υ, seed) ← (λc i,seed − dictC1(i, υ, seed)), ∀υ ∈ InsertList 11: choose υ from (i, υ, seed) = arg max

TBIH-3
The CW is derived from the work of [78] as an alternative heuristic solution to the method proposed in [79] to solve the general VRP [80]. This algorithm which is also known as the savings algorithm [81] is based on the savings computed for two non-depot nodes (i, j) in a complete graph (where all non-depot nodes have arcs connecting them to the depot). The savings is computed as S i,j = 2c i,0 + 2c j,0 − (c i,0 + c j,0 + c i,j ) where 0 is the depot [81]. The route is then constructed based on the savings computed for all non-depot nodes in a decreasing order, provided that the capacity constraint is respected and the connection between edges are allowed (normally the theoretical application onto a complete graph). In the event that the capacity constraint is violated, a new route is constructed for the next vehicle in the same manner of the remaining savings pair. The concept of the CW algorithm is illustrated in Figure 2 [82]. One of the few surveys on the CW algorithm for VRP can be referred to in [83]. A good example of the CW application can be referred to in [84]. In this work, the application of the Dynamic Clarke and Wright Algorithm (DCW) is proposed and applied. By replacing the random selection of the TBIH-1 in the SP with the DCW, TBIH-1 is then modified to form TBIH-3 and is used as the base heuristic (among other heuristics) in the execution of the PDS-RA. In the DCW, the route-based MDP approach [14] is adopted during the rollout resulting in the on-the-go construction of the decision function π B(s a k k ) . The idea is to apply the CW iteratively (π B(s k ) ) when a decision for a single vehicle's next assignment during the lookahead is required, given that the TBIH-3 has been selected as the base heuristic. Iteratively, the CW is applied when no obvious decision can be taken during the lookahead when transitioning. At the point of decision, the current position l m is regarded as the single depot in the CW while the neighbouring nodes j ∈ O m : O m ⊂ H are considered customers. The example for applying the DCW is illustrated in Figure 3a,b where node 3 is the current arrival spot of vehicle m. Additionally, a shelter in the network example is denoted as the node S2. Using the CW, route η π B(s k ) is constructed from the assumed depot (node 3) and back to the depot. An example of a constructed route η π B(s k ) could be (3 , or both if both edges (5,1) and (4,1) happen to have the highest savings. a m would then be chosen as the first insertion π B(s a k k ) : s k → η π B(s k ) (s k ) of the chosen route to transition within the lookahead horizon.
The algorithm for the TBIH-3 is shown in Algorithm 6 as the extension of Algorithm 4 by replacing lines 121-147. In this algorithm, a decision is computed if no other obvious decision can be chosen. To execute the CW, all possible pairs (j, k) ∈ E are detected from the possible next destination nodes in the list nextND (lines 3-4). If there are no edges that exist, a randomly selected node is chosen as the next destination for the vehicle m (lines 5-9). Otherwise, the savings are computed from these edges and sorted in decreasing order (lines [11][12] prior to constructing the temporary decision function η π B(s k ) (line 16) which constructs the on-the-go base decision function π B(s a k k ) : s k → η π B(s k ) (s k ) (line 17). Lines 148-156 are similar to Algorithm 4 where the computed decision is returned.
if len(pairs == 0) then 6: a m = random.choice(nextND) 7: Decision.append(a m ) 8: r i,a m = max(r i,a m − 1, 0) 9: break 10: else 11: compute savings: (j, k) ← c i,j + c i,k − c j,k ∀(j, k) ∈ Pairs 12: sort (j, k) ∈ Pairs with decreasing savings 13: if len(Pairs)==1 then 14: a m = j : (j, k) ∈ Pairs 15: else 16: construct route π B(s a k k ) : s k → η π B(s k ) (s k ) from i = l m by inserting (j, k) ∈ Pairs as would be done in CW (decreasing savings) 17: end if 19: Decision.append(a m ) 20: r i,a m = max(r i,a m − 1, 0) 21: break 22: end if 23: else 24: Decision.extend(nextND) The application of the SIH in the MDDVRPSRC for the rollout is quite clear, given the chosen "seed customer" is the main driver of the method. In the MDDVRPSRC, the authors concur with [85] that choosing an appropriate seed is very important for insertion heuristics.
For this particular problem of depending on the capacity status of vehicle m, the seed is either the depot in the route to replenish capacity or the emergency shelter in the route to serve a shelter. This is different from Algorithm 5 where the seed is randomly chosen based on potential destinations υ's neighbour. In Figure 1a-c, it is clear that more can be done to guide the vehicles towards fulfilling their task. In these figures, it could be argued that the selection of node 5 (while ignoring the road capacity condition in the network), which might occur through the random selection of seeds that occurs in Algorithm 5, may not be the best choice. Instead, node 1 could be the better choice as it would lead to serving node S9. In the TBIH-4 (Algorithm 7), the potential seeds j are reserved only for those which are either one of the depots, emergency shelters, or neighbours of either. In Figure 1a-c, node 1 will be chosen instead as υ since this node would lead to serving shelter S9. Only node 1 could be inserted as only a one-time insertion procedure in DLASIH is performed at every decision point in the lookahead during the construction of the route. For the shelter or depot which is farther than a one-step lookahead, as seen in Figure 4a-c, the neighbour of either that shelter or depot is then chosen as the seed j. In this case where the vehicle is packed with delivery supplies, node 9 shall be chosen as the seed (j) since it is the neighbour of shelter S7. Since only node 1 can connect to node 9 from node 0 in the one-time insertion, node 1 is then regarded as the next destination υ. Here, no SIH(I1) computation is necessary as the option is rather obvious. In this illustration case, recognising node 9 as the neighbour of emergency shelter S7 helped in trimming down potential seeds to be considered and thus also reduced the number of potential υ.  The potential seeds j are considered in (b) and Node 9 was selected since it is the neighbour of shelter S7 in (c). As a result, node 1 (υ) is inserted and route (0-1-9-S7) is constructed. (c) shows two potential seeds j (node 4 and node 9) in which case a random selection between node 4 and node 9 as a seed is done.
However, if there are more potential υ leading to the neighbours of emergency shelters, then one of these neighbours will be chosen randomly. If more than one possible υ is connected to the chosen seed (j), the SIH(I1) will be executed.
In the TBIH-4 algorithm (Algorithm 7), the selection for j is restricted to those that would lead to either a shelter or depot, depending on q m (lines 5 and 55). However, if such nodes are not available, another lookahead is performed to see if there are potential destinations j that could lead to neighbours of either a depot or shelter (lines 21 and 71). Depending on the case considered, the numbers of possible j from which the seed for the SIH could be chosen from (lines 9, 37, 59, and 94) can be reduced. For either case, the insertion criteria are computed and evaluated such that the on-the-go lookahead route π B(s a k k ) is updated: , and the decision a m = η π B(s k ) (s k ) are returned (line 98 onwards).
If the vehicle is with capacity, inserting node 4 for the on-the-go lookahead route would make more sense. If TBIH-3 (with an embedded DCW) could perform as a sort of lookahead (for nonobvious decisions, SP) for a nearby emergency shelter when a vehicle m has capacity, then the selection for the next destination would be more accurate. This is illustrated in Figure 5a,b. In Figure 5a, the current position of vehicle m is at node 3, and the nearby emergency shelter is node S9. With the exception of node 5, which is the neighbour of node 3, both nodes 1 and 4 are neighbours of node S9. As a result, only nodes 4 and 1 are considered when constructing η π B(s k ) (s k ) even though node 5 is also a neighbour of node 3. This leads to a more promising construction of π B(s a k k ) on the go. If node 5 is taken into consideration, there is a possibility of node 5 being selected as the next destination for vehicle m. This is undesirable as that would lead vehicle m, which has capacity, farther from serving S9. With this concept, the DCW is extended into DLACW (turning TBIH-3 into TBIH-5). The principle of the proposed DLACW is, for most parts, similar with the exception of a mechanism to detect a nearby shelter or depot depending on the capacity status of vehicle m.

Computational Results
This study presents computational results for the following purposes: • To validate the model MDDVRPSRC by observing through the simulation tool the ecosystem (emergency medical supplies delivery) simulated. • To validate the reinforcement learning solution of the agent in conducting the medical delivery operations through decisions computed based on the ADP approach (PDS-RA with 5 proposed base heuristics). • To study the quality of the learning solution through the resulting simulated data by means of a comparative approach against the matheuristic proposed in the work of [15] in the stochastic setting of road capacity and dynamic road damage. • To extend the findings in the work of [15] which serves as a preliminary study for this research.
The experiment is conducted using the authors' developed MDDVRPSRC Decision Support System (DSS) program ( Figure 6) with codes written in Python 2.7 programming language. Embedded in this program is also a network monitoring layout through which the simulation of medical supplies delivery operations in the setting of the MDDVRPSRC is observed in real time (live simulation). Both the MDDVRPSRC model and the computation of the agent's decision are also implemented at the heart of the DSS. As part of the computation of the agent's decision, this program also executes the matheuristic (upon selection) proposed in work of [15] with CPLEX computation executed through the DOCPLEX API for Python. For the experiment, the MDDVRPSRC DSS is run on a laptop computer running on Intel R Core TM i7-7500U CPU at 2.70-2.90 GHz with 20 GB RAM. To test and validate the model and solutions algorithm for such a unique problem as the medical supplies delivery operations with compromising circumstances, the common benchmarks of Perl, Gaskell, and Christofides cannot be applied. So several test instances are designed [86,87], ranging from a small and simple road network, to medium and more challenging networks based on the available experimental apparatus (Figures A1-A10). Along with each test instance are the associated initial damage for each road within the network of the respective test instances [86,87]. These datasets, consisting of test instances and the associated damage files are available in [87] and are shown in Table 3. In Table 3, the test instances are ordered by increasing complexity levels, characterised in terms of total number of nodes as well as the ratio between connecting node to a depot and emergency shelter. The test instances associated with the road network D4N30S10 are placed last as it is hypothesised as the most challenging network among the test instances listed. Each network is comprised of multi-depots, multi emergency shelters with different demands and connecting nodes as described in Section 3.1. The placement of the nodes is based on the lessons learned from the 2015 Nepal earthquake.
In each of the road networks seen in Figures A1-A10, the blue, yellow, and brown nodes each represent the depots, emergency shelters, and connecting nodes, respectively. The violet circle lines represent the outward tremor that originated from an earthquake epicentre (coordinate (460, 180) in all the networks). The degree of the initial road damage is based on the intersection of these circles onto the edges, and the corresponding random road capacity is denoted in red at the center of the edges. The demands of each emergency shelter hovers above in a pink box. The green boxes represent vehicles that have arrived at the nodes where they are currently stationed. The blue boxes represent vehicles en route to each of their next assigned destinations. In Figure A11, the simulation example of an ongoing medical supplies operation is shown. In the road network D8N20S8, five vehicles are assigned to deliver medical supplies to eight emergency shelters with their respective demands. The full road capacity in this network for a city road, normal road, and highway is given in Table 3 as (6,7,8), respectively. In all road networks, the highways are placed at the outer part of the network, while the city roads are placed at the innermost sections of the networks. Normal roads can be found connecting highways with city roads in most cases, especially in the larger networks. At decision point 104, which is at the simulated time of 3097 min (translated as 2:3:37:00), the road capacity for each road changes randomly based on the dynamic road capacity mean for the random distribution of each road. These dynamic deteriorating road capacities in turn depend on the initial damage sustained by the road (given in the damage file of each test instance in the repository [87]). Thus the road capacity for the edges with more interceptions with the radial earthquake tremor circles are seen with a tendency to have less road capacity at random when compared with the edges with less or zero intersections. Hence, vehicles travelling at these edges will suffer longer travel times proportional to the initial damage sustained by the edges as accounted for in the MDDVRPSRC model described in Section 3.3. The work [15] is referred to for more explanation on how the random road capacity is sampled at each decision point. The experiment settings for both the simulation and computation of the agent's decision (PDS-RA) is given in Table 4.
For the model and solution validation, simulated data is compiled ( Figure 6). For each proposed base heuristics (TBIH-1, TBIH-2, TBIH-3, TBIH-4, and TBIH-5) applied for all test instances in Table 3, 10 complete simulations of a medical supplies delivery operation are performed. Out of the 10 complete simulations, there are 10 readings for 4 key measurements:
Average decision computation time (K4). Table 3. Simulated test instances applied to validate the model and solutions algorithm. D3N8S3_4  3  3  8  4  550  6, 7, 8  D3N8S3_8  3  3  8  8  550  6, 7, 8  D3N8S3_15  3  3  8  15  550  6, 7, 8  D3N8S3_30  3  3  8  30  550  6, 7, 8  D3N8S3_50  3  3  8  50  The first three key measurements are self explanatory. The last key measurement is the average time taken for the agent to make one decision at decision point k based on the total computation time divided by the number of decisions made (decision points) to complete the delivery operations simulation. The PDS-RA with the proposed heuristic bases are benchmarked with the matheuristic rollout found in the work of [15] for all vehicle number settings (4, 8, 15, 30, and 50) for the road networks D3N8S8-D7N18S7. For the road network D8N20S8-D4N30S10, however, the benchmarking is completed only up to the vehicle number settings of 4, 8, and 15. This is due to the resultant computational time which is far longer than considered reasonable when compared with the longest computation time obtained among the five proposed heuristics (Figures A14 and A18). With the resulting simulated data, the model is then validated based on the analysis of the output data produced ( Figure 6). Furthermore, the performance of the proposed heuristics compared to the matheuristic rollout applied is observed through a descriptive and comparative analysis. The computational results in the Supplementary file (Tables S1-S81) are collected and recorded for a time span of more than two years; given the hardware available for the experiments. A total of 10 readings were taken for each of the proposed base heuristics applied in the PDS-RA for all test instances. This was for all key measurements (K1-K4) given the stochastic road capacity and dynamic deterioration of the mean road capacity in the problem addressed. From each 10 readings, the descriptive analysis is performed to measure the mean, standard deviation, variance, and covariance of the sample data. The Normality test is performed to determine that a suitable comparative analysis method is applied for benchmarking. A total of 11,600 key readings were recorded as a result of 2900 simulations performed for further analysis involving the key measurements of K1, K2, K3, and K4 mentioned in Section 5. The 2900 simulations consist of 290 sets of 10 readings per set, for each of the 4 key measurements which are then used to compute the average reading. Not all 290 sets tested were found to have a normal distribution based on the Shapiro-Wilk test [88] performed in the Excel [89]. The highest percentage for normal data (around 50%) is only seen in the K3 and K4 measurements. Furthermore, the 10 readings for each key measurement of a test instance is considered small for a parametric test. As such, a non-parametric test (Wilcoxon Signed Rank Test) was applied to test for significance in differences against the matheuristic solution (PDS-RA with CPLEX as base heuristic). Moreover, the Best So Far (BSF) measurement among the solution algorithms applied at each test instance was performed to observe the performance of each PDS-RA of the respective proposed heuristics against the matheuristic rollout.

Instance Depot Shelter Nodes Vehicle Total Demand Max Road Capacity
The full computation results are presented in the supplementary file and the abbreviations applied are listed in Table 5. Furthermore, the general overview of the simulated data collected is shown in Table A1. Thorough investigation and synthesise of the resulting simulation data by means of cross-referencing key values were performed to ensure that no errors are presented.
The results obtained in Tables S1-S81 are further synthesised for numerical analysis focusing on model validation and base heuristics performances. The MDDVRPSRC model is validated based on the trends and patterns observed in Figures A12, A13, A16 and A17. Meanwhile, the performance of the proposed heuristics, as compared to the matheuristic rollout, can be seen in the remaining figures between Figures A12-A27 and in the supplementary file S1 (Figures S1-S62). Figures A12-A15 show the trends for average measurements of each of the 10 sample readings based on all four key measurements, while Figures A16-A19 shows the trends for best measurements among the 10 reading samples for each key measurement. Figures A20-A23 show the total numbers of BSF counts for each algorithm for all 40 instances with the matheuristic rollout benchmark. Figures A24-A27 depict the total numbers of BSF counts in percentage for each algorithm for all 50 instances with and without (omitting 10 test instances for matheuristic rollout due to computation time limitation) the matheuristic rollout benchmark. A more detailed breakdown per test instance of the percentage of BSF associated with each heuristics is shown in Figures S1-S40. Meanwhile , Tables A2-A7 give a more detailed breakdown on numbers of the BSF counts, normal distribution data, and the significant differences for each key measurements. Finally, a detailed performance of each PDS-RA with proposed heuristics for all key measurements is shown in Figures S41-S62.

Discussion
In terms of the MDDVRPSRC MDP model validation, the behaviours plotted in Figures A12, A13, A16 and A17 are conforming to the natural expectation on how the humanitarian operational aspects will shape out based on the key measurements. Figures A12 and A16, for example, show a logical increase of total distance travelled with the increase in the number of vehicles. Here, the increase in total distance is also attributed to the policy that all vehicles must be dispatched for delivery to compensate for the potential risk that a vehicle might be stranded while en route due to the road damage incurred. Furthermore, a stochastic road capacity with multiple dispatches of vehicles might ensure a faster delivery time at the cost of an increase in total distance travelled.
For the road network D3N8S3, the increase of total distance is higher than that of networks D4N11S4, D5N13S5, and D6N16S6. This is comparable to that of network D7N18S7 onwards with operations involving 30 and 50 vehicles. This is due to the large amount of vehicles travelling on a road network with limited roads. The random road capacity as well as deteriorating road conditions cause a bottleneck at some connecting nodes. However, a steady increase of roads in more complex networks alleviates this problem, as shown in networks D4N11S4, D5N13S5, and D6N16S6. Given the increasing demands and more complex networks, a different observation could be made.
The road networks of D7N18S7, D8N20S8, D9N25S9, and D9N30S10, for example, indicate roughly the same trend of total distance travelled with an occasional peak at about 10,000 km for networks D8N20S8 and D9N25S10. However, an obvious increase of total distance travelled can be seen for the network D4N30S10, thereby confirming the hypothesis that this network is the most complex in terms of delivery operations. This is explained by the ratio of depots to connecting nodes where the vehicle has only a limited number of depots to replenish supplies in this network as compared to the other networks. Furthermore, the ratio of depots and shelters also contributes to this observation, showing the difficulties of completing the deliveries given the smaller number of depots to replenish.
Additionally, the reduced number of depots in this network also leads to more connecting options between the depots and connecting nodes which may not necessarily be advantageous to the delivery operations. This is especially true for networks that tend to have a shorter route disabled due to random road capacity and a dynamic reduction of the road capacity due to damage to the road. As a result, a longer route is taken leading to the increase of total distance travel by all vehicles.
All of the observations for the total distance travelled shown in Figures A12 and A16 also apply to the observations seen in Figures A13 and A17. In general, the increase of the total vehicle numbers leads to the reduced delivery operations time (total travel time). For the network D3N8S3, a limited number of vehicles in a small network with high demands relative to the number of vehicles led to an increase in total travel time compared to network D4N11S4. This is due to the longer time required by the smaller number of vehicles to satisfy the total demands within the network. Moreover, the deteriorating road capacity for each damaged road may lead to lesser road availability for an already small road network. This leads to vehicles taking the longer route compared to that of network D4N11S4.
Vehicles may also travel back and forth along the same road due to connecting roads becoming increasingly less available. The bottleneck effect is also seen for the larger number of vehicles when comparing the total time travel within the road network of D3N8S3 with the road networks of D4N11S4, D5N13S5, and D6N16S6. It is also shown clearer here that the bottleneck effect could be alleviated through trends observed for networks D7N18S7, D8N20S8, D9N25S9, and D9N30S10. Similarly the reduced ratio between depots to shelters and depots to connecting nodes leads to a more complex network. This is despite not having the highest number of nodes that contributes to a higher total travel time for some of the algorithms. Interestingly, the matheuristic rollout approach does not show the same observations. This shows the potential of the matheuristic rollout in navigating more complex networks compared to the proposed heuristics.
This, however, comes at the cost of computation time as shown in Figures A14, A15, A18, A19, A22, A23, A26 and A27. This was observed when investigating the performance of the proposed base heuristics against the matheuristic rollout as a benchmark. The total computation time increases for the matheuristic rollout applying CPLEX at every lookahead decision point for road network D5N13S5 onwards for all vehicle settings ( Figures A14 and A18) when compared with the results obtained with PDS-RA applying the proposed base heuristics. This trend is even more obvious in Figures A15 and A19, showing a clear increase in computation time for the agent in making a decision on average. As a result, no BSF count was ever obtained through the matheuristic rollout for the key measurement of K3 and K4 ( Figures A22, A23, A26 and A27).
Apart from showing an exponential increase for both K3 and K4 (see Figures A15  and A19), it is also obvious that this trend depends on the total number of nodes that are involved in the network. This is evident when comparing the two key measurements for networks D9N30S10 and D4N30S10. However, the road networks sharing a similar number of nodes as D4N30S10, such as D9N25S10, do not indicate a similar magnitude of increment. Therefore, it could be concluded that both the number of nodes and complexity associated with each network affect the two key measurements for the matheuristic rollout.
Meanwhile, the performance of the proposed PDS-RA applying base heuristics is further investigated through the BSF count for all instances tested. Figures A22, A23, A26 and A27 confirm the observation made for the matheuristic rollout in terms of computation time (K3 and K4). However. the matheuristic rollout shows clear dominance in terms of the key measurements of K1 and K2 ( Figures A20, A21, A24 and A25). This is especially seen in the breakdown of K1 and K2 in Figures S46 and S52 which Figure S46 interestingly also show a good performance of TBIH-1 for K1. This shows the relevance of the matheuristic approach for complex stochastic problems. In most of the individual networks, the matheuristic approach seems to also perform better compared to the other proposed approaches for K1 and K2 (Figures S46 and S52). However, as it can be seen in Figures A12-A17 with the exception of Figures A14 and A15, the application of PDS-RA with proposed heuristics remains competitive with low gaps of difference. This is also supported by the statistical numerical evidence that show lower significance differences recorded throughout all simulated data involving K1 and K2 when compared with data obtained by the matheuristic rollout (Table A1).
Of those, a vast majority of significance difference is seen in Figures A14, A15, A18 and A19 which corroborates findings in terms of computation time for K3 and K4. Judging from the trends, the practicality of the matheuristic rollout as benchmarked, is shown to be poor at least for the given hardware used for the experimentation. This is despite the good performance shown for K1 and K2, albeit with no significance difference.
On the other hand, the TBIH-1 shows clear advantage as shown from Figures A20-A27 in terms of the BSF count. This is perhaps expected considering the stochastic problem which may favour the exploratory approach more than the exploitation part, as is performed in TBIH-1 with random selection for the SP part of the algorithm. The comparable performance of PDS-RA with TBIH-1 compared to the matheuristic rollout is also shown in most of the road networks, respectively. Furthermore, TBIH-1 is seen at times neck to neck with the benchmark when looking into the performance in each individual network, such as in Figures S26, S29, S33, S34, and S36 among others. It is also noteworthy to see that the algorithm also performs rather well for the network D4N30S10, with the exception of key measurement K2. Moreover, the dominance of the TBIH-1 is increasingly more noticeable for larger networks as best seen in Figures S41, S47, S53, and S58. Meanwhile, both the TBIH-2 and TBIH-4 also perform well in the overall BSF count (as seen in Figures A20-A27) when compared to that of TBIH-3 and TBIH-5 which is based on DCW. This highlights the advantages of the DSIH which centred on the concept of inserting and placing promising nodes in ways that optimise the operation.
DCW, on the other hand, tends to ignore the inner part of the nodes and favour the outer nodes in an attempt to reduce parallel connections to the origin node as CW has always been intended for. This is evident by the performance of TBIH-3 which is the lowest followed by TBIH-2 when looking at BSF counts for both the individual network and overall networks ( Figures S41-S62). Except for K3 in Figure S55, the TBIH-3 only scores a BSF count of one for all other key measurements ( Figures S43, S49 and S60). This translates in a low BSF count obtained in Figures A20-A27 where the TBIH-3 is seen multiple times with BSF counts as low as 0% and 2.5% while topping at most only an 8% as seen in Figure A26. TBIH-5 shows an improved performance when compared to TBIH-3 (for K1, K2, K3, and K4 in Figures S45, S51, and S57) and TBIH-2 (except for K3 and K4) with the addition of a lookahead mechanism for selecting more promising nodes in the route. This demonstrates the strength of exploitation in the heuristics to improve selection. The TBIH-2, however, is better in terms of K3 and K4 ( Figures S54 and S59), displaying the trade off for embedding such features.
Similarly the TBIH-2's performance is improved in TBIH-4 by means of an exploitation mechanism that requires a lookahead in selecting more promising nodes and filtering out those that are not. Unlike the TBIH-3 and TBIH-5, however, the gap in the computation speed between the TBIH-2 and TBIH-4 is not obvious. This shows that the TBIH-5 might be more costly to implement compared to TBIH-4 which improves on the TBIH-2 with less trade-off as seen in Figures S54, S56, S59 and S61. As such, the TBIH-4 could be considered an all-rounder with a balanced performance next to TBIH-1.
It should be noted that the PDS-RA is performed per vehicle when making a collective decision for all vehicles. Furthermore, both the number of Monte Carlo simulations and the length of the lookahead horizon shown in Table 4 could be considered low when compared to other similar work. However, the new perspective of the computing decision, as proposed in Equation (24), demands some compromise be made, especially with limited computational power available for this research. Furthermore, the method applied in this research is necessary to break the usual practice of clustering the emergency hot-spots per vehicle and then computing the routing decision afterwards. Additionally, the research for stochastic road capacity problems with additional consideration for damaged roads is very limited among reinforcement-learning-oriented research. Due to the stochastic road capacity, the resulting key measurements are highly varied as shown in the variance and covariance measurement of each of the simulated samples collected (Tables S2-S81). Ideally, a good amount of Monte Carlo simulations of the rollout and a longer horizon for the lookahead would be best to account for such stochastic problems. A trade-off still needs to be made where the limitation of computation is a concern. If anything, this research proves that the proposed methods could be applied to a machine with limited capability to simulate, visualise, and compute decisions as a DSS for an emergency medical supplies delivery humanitarian operation.
However, more study into this research is warranted. With capable machines, the number of lookahead horizons and the number of Monte Carlo simulations should be increased. With such an increase in parameters, perhaps the TP of the algorithm could be discarded; allowing the agent a pure learning opportunity when making decisions. In the experiments here, this could not be achieved; hence the TP is needed. Furthermore, with enough Monte Carlo simulations, the highly stochastic problem concerning routing can be properly addressed. A longer horizon of the lookahead ensures better decisions in a long-term perspective.
The investigation included a one-factor experiment performed by varying the fixed number of vehicles per road networks, which is a limitation. It should be note,d however that various road networks were tested consisting of varying numbers of depots, emergency shelters, and connecting nodes. Furthermore, given the entry level machine that is utilized, this experiment (involving 2900 simulations and 11,600 measurement readings) took more than one year to complete. Given a more capable machine, factorial experiments should be performed to investigate the performance of the proposed heuristics againist the matheuristic benchmark. For example, through a factorial experiment, the existing network could be expanded into more challenging networks. In D4N30S10 for instance, it would also be interesting to see how the delivery operation with such a number of depots and an increase of connecting nodes fares with a smaller number of emergency shelters as more options for routing become available. Will the agent with the proposed solution method be able to navigate intelligently among these many options? Hence, more studies should be performed with expanded networks where the combination of ratios between depots, connecting nodes, and emergency shelters are varied. For this experiment, the vehicles are placed randomly at depots initially. This is performed to account for the degree of unpreparedness, where coordination should be planned with random accounts of assets. Hence, even though the key measurements are assessed through 10 average readings for each test instance, the initial situation for each simulation run is varied. There are two ways that this study could be expanded further: (1) to increase the number of simulations per test instance to obtain more than 10 readings for a better average reading, and (2) to apply a fixed assignment of vehicles per depot for all simulations. The latter approach, however would not account for a more realistic scenario of emergency medical supplies delivery operations. Finally, in this study, the placement of depots, connecting nodes, and emergency shelters are made such that the findings obtained from the lessons learned in the 2015 Nepal earthquake are addressed. Instead of utilising a simulated network, a more concrete simulation could be performed by applying real networks and incorporating details of the depots, connecting nodes, emergency shelters, vehicles, road damages, and road capacities during that actual disaster event. It is noted that such data is usually of a sensitive nature. However, developing a simulated network allows for flexibility when completing planning exercise and experiments.

Conclusions
As part of the DSS for humanitarian emergency medical supplies delivery operations, the 2015 Nepal earthquake is referred to in developing the MDDVRPSRC MDP model. The presented model focuses on the difficulty in navigating through stochastic road capacity within the compromised road network due to the ongoing tremors from the earthquake. The model also incorporates multi-depots, multi-trips, and split delivery operations. Here the conventional approach of "cluster first, route second" largely applied among related research cannot be applied. Instead, to solve the problem, a lookahead approach of ADP is adopted, where the PDS-RA is applied. As part of the PDS-RA mechanism, five base constructive heuristics are proposed to construct the decision rule on the go dynamically and iteratively. Unlike conventional applications of the PDS-RA in VRP, this research adopted the proposed method in the work of [15] for a consecutive application of the PDS-RA for each vehicle that arrives at every decision point. The resulting individual assignment of vehicles computed collectively forms an MDP decision at every decision point.
The five proposed base heuristics are based on a decision-making strategy that consists of obvious decisions (TP) and non-obvious decisions (SP) to reduce the burden of computation. In the TBIH-1, the SP applied pure random selection for selecting a vehicle's next destination. Alternatively, the principle of constructive heuristics used in SIH(I1) and CW, (coined as DSIH and DCW, respectively) are adopted in the TBIH-2 and the TBIH-3. A lookahead exploitation mechanism is adapted to both the DCW and the DSIH, giving birth to DLASIH and DLACW which is applied in the proposed TBIH-4 and TBIH-5, respectively. These five proposed base heuristics are compared with the matheuristic proposed in the authors' previous work, [15]. Moreover, test instances were developed and made available in the repository [87]. The results presented in the supplementary file validate the model where expected behaviour is observed from the simulated operations based on four key measurements: K1, K2, K3, and K4. Furthermore, the performance of the PDS-RA applied with the proposed five base heuristics shows comparable performance for K1 and K2 with no significant difference recorded. Meanwhile, all the proposed heuristics showed superior performance for K3 and K4 when compared to the matheuristic. The results also highlight the power of exploration associated to pure random selection in the TBIH-1 in addressing a highly stochastic problem such as the MDDVRPSRC. Furthermore, the advantages of exploitation are shown in TBIH-4 and TBIH-5 when compared with the performance of TBIH-2 and TBIH-3, respectively. For problems such as the MDDVRPSRC, it would appear that the DSIH (TBIH-2) and DLASIH (TBIH-4) perform better than their counterparts: DCW (TBIH-3) and DLACW (TBIH-5).

Data Availability Statement:
The primary simulated data is presented in the supplementary file. The test instances from which the simulated data is generated is found in [87].

Figure A21.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for total travel time over all test instances applied (omitting 10 non-benchmarked measurements).

Figure A22.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for total computation time over all test instances applied (omitting 10 non-benchmarked aasurements).

Figure A23.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for average decision computation time over all test instances applied (omitting 10 non-benchmarked measurements).

Figure A24.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for total travelled distance over all test instances applied (including 10 non-benchmarked measurements).

Figure A25.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for total travel time over all test instances applied (including 10 non-benchmarked measurements).

Figure A26.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for total computation time over all test instances applied (including 10 non-benchmarked measurements).

Figure A27.
Algorithms performance compared to matheuristic rollout applied in PDS-RA: total BSF measured for average decision computation time over all test instances applied (including 10 non-benchmarked measurements).