Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations

Hatami, Mohsen; Flood, Ian; Foroutan, Forough

doi:10.3390/buildings16112274

Open AccessArticle

Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations^†

by

Mohsen Hatami

¹

,

Ian Flood

^1,* and

Forough Foroutan

²

¹

Rinker School, College of Design, Construction and Planning, University of Florida, Gainesville, FL 32611-5703, USA

²

FIBER, College of Design, Construction and Planning, University of Florida, Gainesville, FL 32611-5703, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Advances in Information Technology in Civil and Building Engineering: Proceedings of ICCCBE 2024, Montreal, QC, Canada, 25–28 August 2024; pp. 133–147.

Buildings 2026, 16(11), 2274; https://doi.org/10.3390/buildings16112274

Submission received: 7 February 2026 / Revised: 20 May 2026 / Accepted: 20 May 2026 / Published: 4 June 2026

(This article belongs to the Special Issue Selected Papers from the 20th International Conference on Computing in Civil and Building Engineering (ICCCBE 2024))

Download

Browse Figures

Versions Notes

Abstract

Truck-to-excavator assignment is a time-critical control problem in open-pit earthmoving systems (mines, quarries, and large cut-and-fill construction sites) where stochastic travel and service times, changing queues, and equipment outages continually alter the best dispatch decision. A deep reinforcement learning (DRL) dispatch policy is developed and trained using a discrete-event simulation (DES) digital twin of the Sungun copper mine haulage system. The dispatch task is formulated as a Markov decision process using state features that represent fleet locations, excavator and dump queues, and short-term congestion conditions. The resulting deep artificial neural network (DANN) policy is tuned via systematic hyperparameter optimisation and evaluated against a priority-based rule-of-thumb dispatch baseline under long-horizon operating tracks. Results show that the final trained policy improves the average production rate per truck cycle by approximately 17% while reducing avoidable waiting and maintaining stable performance over extended operation, with inference fast enough for real-time dispatch use. Model fidelity is supported by close agreement between simulated and observed daily completed-cycle counts. Robustness is assessed through controlled truck load-capacity perturbations, and scalability is examined through fleet-size sensitivity, which reveals diminishing returns as additional trucks are added under a fixed excavation–haulage configuration. Practical deployment considerations and implications for construction earthmoving logistics are discussed.

Keywords:

deep reinforcement learning; digital twin; discrete-event simulation; earthmoving logistics; open-pit mining; truck dispatch

1. Introduction

Fleet dispatch is central to productivity in open-pit earthmoving operations, including mines, quarries, and large construction earthwork projects where truck cycles connect loading, haul, dump, and return activities [1,2]. Dispatch decisions must be taken repeatedly under uncertainty, with performance sensitive to congestion, equipment availability, and short-term imbalances between loading and hauling capacity [1,2].

Traditional truck allocation approaches in mining and earthmoving range from fixed dispatching rules to mathematical optimisation and simulation-based optimisation [2,3,4,5]. While effective under stable conditions, many approaches are difficult to deploy for real-time control because system state can change rapidly and optimal decisions depend on non-linear interactions among queues, travel times, and equipment utilisation [2,5]. Deep reinforcement learning (DRL) offers an experience-based strategy that can learn a policy mapping the observed system state directly to a dispatch action once trained [6,7]. However, many studies use simplified environments, test only a limited set of operating conditions, and report little sensitivity analysis. As a result, it remains unclear how robust the learned policy is to variability, how well it scales to larger fleets and more complex networks, and whether it meets real-time decision requirements when evaluated in higher-fidelity simulations that capture realistic congestion, travel times, and downtime [8,9,10].

This article develops a DRL dispatch policy, implemented as a deep artificial neural network (DANN), trained on a discrete-event simulation (DES) digital twin of the Sungun copper mine [11,12,13]. The dispatch problem is formulated as a Markov decision process (MDP) in which the state is a compact summary of the simulator’s current conditions, and each action assigns an available truck to a candidate destination [7,13]. The learned policy is evaluated against representative rule-based dispatch policies and a random benchmark under observed and synthetic operating scenarios [13].

The contributions are: (i) a reproducible DES–DRL training pipeline, implemented with a DANN, that learns multi-destination, multi-excavator truck-allocation policies in a realistic mine digital twin; (ii) a systematic hyperparameter optimisation and sensitivity analysis supporting transparent model selection; (iii) empirical evidence that the learned policy improves production performance relative to representative rule-based baselines across observed operating conditions, with scalability tests on larger synthetic fleets; and (iv) discussion of deployment considerations and transfer to construction earthmoving logistics. Accordingly, the manuscript should be understood as a research study demonstrating the performance improvement of the proposed policy model within a DES-based evaluation environment, rather than as a field-deployment or implementation study. The remainder of the paper is organised as follows: Section 2 presents the Sungun case study and methodology, including the DES–DRL integration and the DRL problem formulation (state, actions, reward, and DANN architecture); Section 3 reports verification/validation and performance results, followed by sensitivity and scalability experiments; Section 4 discusses implications for large-scale fleet allocation in construction and civil earthmoving, along with limitations and future research directions; and Section 5 concludes the paper.

Background and Related Work

Truck dispatch and allocation problems are commonly treated as dynamic routing and queueing control problems in which decisions influence both short-term waiting and longer-term system balance [2]. In practice, haulage systems exhibit non-stationary behaviour driven by changing travel times, variable loading and dumping service times, road conditions, and unplanned downtime [1,2].

Early and widely used approaches include fixed dispatching rules (e.g., shortest queue, nearest shovel) and operations-research formulations such as linear/integer programming and network flow models [2,5]. These methods can yield strong performance when modelling assumptions hold, but they often require frequent re-optimisation or extensive data and calibration to remain effective under stochastic, time-varying conditions [2,5]. Simulation-based optimisation and metaheuristics partially address stochasticity by evaluating candidate policies in a simulator, but they can become computationally expensive when decisions must be made frequently or when the state space is large [2,14,15].

Reinforcement learning (RL) learns a policy through interaction with an environment by maximising cumulative reward, while DRL extends RL with neural function approximators to handle richer state descriptions [6,7]. Recent studies have reported promising results for open-pit truck dispatching and equipment allocation, yet many evaluations are still based on simplified simulators, small fleet sizes, or limited scenario sets [8,13]. Recent mining digital-twin and DRL scheduling literature has also highlighted the need for clearer frameworks, stronger generalisation, and broader comparative evaluation [9,10]. As a result, there remains a need for studies that combine higher-fidelity simulation, clearer state and action definitions, and systematic sensitivity analysis to support more robust and transferable dispatch policies [2,8,13].

This study addresses these gaps by coupling DRL with a DES digital twin that captures queueing dynamics and operational constraints for a real mine case study, and by reporting a transparent hyperparameter search and scalability testing using synthetic large-fleet scenarios [13]. The focus is on producing a policy that has high production performance and remains operationally usable in real-time dispatch settings.

This manuscript is a substantially extended version of the ICCCBE 2024 conference paper [13]. Compared with [13], this journal article adds (i) a complete case-study description and data pipeline for the Sungun copper mine, (ii) a fully specified MDP for truck allocation with explicit state, action, and reward definitions, (iii) a detailed description of the DANN policy structure and the two-mode (exploration/implementation) learning strategy, (iv) systematic hyperparameter optimisation and sensitivity testing (learning rate, mini-batch size, reward length, and network depth/width), and (v) expanded empirical results, including simulator verification/validation and scalability tests using synthetic complex scenarios.

Recent work has explored dispatching rules, mathematical programming, and metaheuristics for truck allocation, as well as RL/DRL approaches for dispatching [2,5,8,9]. This study contributes a simulation-integrated DRL workflow that uses a discrete-event simulator as a safe environment for policy exploration and repeatable evaluation, and it documents practical modelling choices needed to operationalise DRL in a large, stochastic fleet system, including state and reward design and a training curriculum based on reward-length staging.

2. Methodology

This section describes the end-to-end workflow used to develop and evaluate the proposed dispatch approach. The workflow couples a discrete-event simulation digital twin of the Sungun haulage system with a deep reinforcement learning policy implemented as a DANN, enabling learning through repeated simulator interaction and fast inference during evaluation runs [13].

2.1. Theoretical Framework and RL Formulation

Dispatch control is central to open-pit mining operations, particularly for truck allocation and equipment scheduling. Many approaches can be viewed as decision agents that map the observed system state to an action intended to improve operational performance. Broadly, agents can be characterised as search-based or experience-based [16]. Search-based agents evaluate candidate actions explicitly, which can be computationally intensive and can limit real-time use. Experience-based agents learn a direct state-to-action mapping from prior experience and can execute decisions rapidly once trained [16,17]. This study focuses on experience-based approaches by comparing rule-of-thumb baselines with a DRL dispatch policy implemented as a DANN. As shown in Figure 1, the state observed at each dispatch decision point is mapped to a dispatch action, and the resulting state transition provides feedback for learning. Figure 2 highlights the long-horizon nature of dispatch control: different policies can lead to markedly different cumulative performance trajectories as the system evolves, motivating the use of the digital twin to simulate alternative futures and compare policies on aggregate performance rather than only immediate local effects.

Developing a DRL model for truck allocation is challenging because the best action is not known a priori and must be learned from interaction with the operating environment. The agent therefore learns through simulated trial and error in the digital-twin training environment, where candidate decisions are explored and their longer-term consequences are observed in the simulator [7,16]. In this study, learning proceeds by iteratively updating the policy using reward feedback from simulated experience, balancing exploration of alternative dispatch actions with exploitation of the best-performing behaviours learned so far [7]. The digital twin provides a cost-effective, risk-free testbed to train and evaluate policies under stochastic operating conditions over long time horizons [16].

Formally, dispatch is modelled as a Markov decision process (MDP) defined over the simulator dynamics [13]. Decision points occur when a truck becomes available and a new destination allocation decision is required (e.g., after dumping, when the truck is empty and requests a new assignment). The environment transition dynamics are generated by the digital twin, which advances the system through loading, hauling, dumping, and return events with stochastic travel and service times [5]. The following are the MDP elements used in this study:

State, s, captures the current fleet status and operational context [13]. It includes high-level indicators of (i) truck status/phase (e.g., at a loading point, hauling, at the dump, or returning), (ii) truck load status (e.g., empty, partially loaded, or full), and (iii) remaining ore to be mined. For the Sungun case study, the implemented state vector is expanded to include operational inputs such as truck and excavator status, travel times, excavator cycle times, truck loading and unloading times, queue lengths, and production targets and constraints [13]. The complete set of implemented state features is summarised in the Supplementary Materials (Table S1).
Action, a, corresponds to the truck-allocation decision [13]. For the Sungun case study, the action space is defined as selecting the next excavator/haulage route for an available (idle) truck (i.e., one discrete action per excavator/route), while loading and unloading are executed by the simulator once the dispatch destination is chosen [13].
Reward, r, incentivises shorter truck cycles [13]. In this study, reward is defined as:

r = T_max − T_cycle

(1)

where T_cycle is the realised cycle time and T_max is a reference upper bound used to shift rewards upward (so they remain typically positive). Allocations that reduce realised cycle time therefore receive a higher reward [13]. The learning procedure further reinforces successful actions by increasing the network output associated with the selected action when it yields better outcomes, strengthening preference for actions associated with better outcomes [13].
Reward length. Rewards are accumulated over a chosen horizon (measured in truck cycles) to balance feedback frequency against reward-signal stability [13]. Reward length is treated as a tuneable hyperparameter and was varied from 2 to 1200 cycles during sensitivity testing to identify a stable, repeatable setting for the comparative experiments.
Policy network (DANN). The dispatch policy is represented by a DANN with a fully connected feedforward structure, ReLU activations, and a sigmoid output layer that maps action scores to the range 0 to 1 [13]. Figure 3 illustrates the network structure used to approximate the dispatch policy, showing how the state feature groups are mapped through the hidden layers to a normalised output score for each candidate excavator/route (one output per destination). The network depth and nodes per hidden layer are treated as tuneable design settings and are selected through the hyperparameter optimisation described later. During training, actions are selected stochastically from the network outputs to generate diverse experience, with an exploration-rate schedule ε decaying from 1.0 to 0.01 to gradually reduce randomness as learning progresses. Key training settings include discount factor γ = 0.99 and experience replay with mini-batch updates to stabilise learning.

2.2. DRL Training Approach and DES Integration

This study trains the dispatch policy by repeatedly interacting with the Sungun digital twin. At each dispatch decision point, the agent observes the current system state, selects a truck-allocation action, and receives a reward computed from simulated operational outcomes. The resulting experience is then used to update the policy to maximise expected cumulative reward [7,13]. The policy is implemented as a DANN to support fast inference once trained, enabling real-time dispatch decisions during evaluation runs [13]. Figure 4 summarises the end-to-end workflow used in this study, from data collection and pre-processing through feature engineering, iterative training and validation, and performance evaluation used to assess readiness for deployment. Figure 5 summarises the underlying agent–environment interaction loop used during training and evaluation (state observation, action selection, reward feedback, and policy update).

The digital twin is implemented in Arena and parameterised using published Sungun operational data, following Azadi et al. [11]. It captures the main haulage-cycle logic, stochastic travel and service times, and operational constraints needed to generate realistic state transitions and reward signals for training and evaluation [13]. Input data (e.g., fleet configuration, travel times, loading and dumping times, and operational targets) are pre-processed and integrated into the digital twin and learning environment so that training and evaluation are based on consistent, case-specific operating conditions [11,13].

The learning components are implemented in Python 3.12.6 using PyTorch 2.4.1+cu118 for the DANN policy network and optimisation, while mine-operation dynamics are executed in the Arena-based digital twin. A custom interface links the learning code to the simulator to (i) advance the digital twin under a candidate policy, (ii) extract the next state, (iii) compute rewards from realised cycle times, and (iv) assemble state–action–reward training patterns for updating the DANN [13]. Figure 6 provides a software-level view of the coupling between the Arena-based digital twin and the learning agent, highlighting the modules responsible for state construction, action enumeration, reward calculation, and simulator execution.

Figure 7 summarises the two-phase DRL–DANN development cycle used in this study (adapted from Ref. [16]). In Phase I, the Arena-based digital twin is run under exploratory dispatching to generate state–action–reward training patterns over successive truck-cycle decision points; in Phase II, these patterns are used to train/update the DANN policy. The updated policy is then evaluated, and the simulation is restarted, repeating the two phases until performance stabilises. Training proceeds iteratively across RL cycles. In exploration mode, actions are selected stochastically from the policy outputs to generate diverse state–action–reward trajectories. In implementation mode, the learned policy is applied deterministically (e.g., selecting the highest-scoring action) to evaluate performance in non-training runs. These evaluations are run in the digital twin over long simulation horizons (thousands of truck cycles) [13]. A pseudocode summary of the training loop is provided in the Supplementary Materials (Algorithm S1).

To support repeatable experiments, both the simulator stochastic seed and the neural-network initialisation seed are fixed (both set to 99) [13]. Unless otherwise stated, training uses 1000 optimisation epochs, a replay buffer of 10,000 transitions, and periodic target-network updates every 1000 training steps (dispatch decision points) [13]. All remaining learning and network-design settings are specified in Section 2.5.

Prior to training, continuous inputs (e.g., locations, times, queue lengths, and production targets) are normalised, while categorical status variables are one-hot encoded. Derived features include distance measures (truck-to-excavator and excavator-to-truck), average travel-time indicators, excavator efficiency measures, and average queue-length summaries to better capture congestion and service capacity in the state representation [13].

2.3. Case Study and Operation Simulator (Sungun Mine)

The digital-twin simulator, inspired by prior work on simulation-based decision-agent control in construction manufacturing [16], captures the key truck-allocation dynamics of the Sungun open-pit mining system. The model represents a fleet of 40 trucks serving nine excavators over multiple haulage routes and includes stochastic travel, loading, dumping, and return times, as well as truck breakdowns [11,13]. Figure 8 provides a schematic of the open-pit haulage system represented in the digital twin, highlighting the dispatching decision point and the routing of trucks among the nine excavators, associated queues, haulage routes, and dumping operations [13]. The simulator is executed over a week-long operating horizon in each evaluation run to evaluate operational efficiency and cost-related performance measures. Model inputs are summarised in Table 1, Table 2, Table 3 and Table 4. The simulator was parameterised using published Sungun case-study sources; the loading-point characteristics, including bench level, material type, loading device model, production rates, and ore grade, were compiled from Ref. [4], while Table 1, Table 2, Table 3 and Table 4 report the triangular process-time distributions used for loading, hauling, returning, and dumping in the digital twin.

The digital twin is parameterised using published Sungun case-study sources rather than a raw historical event log used directly for policy training. Accordingly, the DRL policy is trained on experience generated within the DES environment, and issues such as missing or erroneous training records do not arise in the usual data-driven sense. Instead, the main data-related considerations concern the provenance, adequacy, and representativeness of the published case-study inputs and the modelling assumptions used to translate them into the simulator.

The Sungun copper mine is located in East Azerbaijan, Iran, and serves as the case study used to parameterise and evaluate the dispatch policy in a realistic open-pit haulage setting [12,13]. The operation is characterised by continuous production and variable operating conditions (e.g., mountainous terrain and weather-related disruptions), which motivates modelling stochastic travel and service processes within the digital-twin simulator [12,13].

The digital twin represents a fleet configuration of 40 haul trucks (30-ton capacity) serving nine excavators with heterogeneous loading characteristics, yielding nine corresponding haulage routes (one per excavator) to a dumping area modelled with effectively unconstrained capacity [13]. Experiments are executed over a week-long horizon (7 days), using 20 independent replications and a daily shift length of 990 min to capture performance under stochastic variability and support statistically comparable policy evaluation [13].

Model inputs are compiled from published Sungun sources and case-study data reported in prior work, including fleet specifications, production rates, route distances, and operational disruption factors [11,12,13]. These inputs are processed to parameterise stochastic loading, hauling, dumping, return, and breakdown/maintenance processes within the simulator; triangular distributions are used to represent process-time variability in a computationally efficient replicated simulation setting [4,13].

2.4. Control Policies and Baselines

Three dispatch policies are evaluated under identical digital-twin operating conditions to isolate the effect of the control logic on haulage performance. The policies differ only in how an available truck selects its next excavator/route at each dispatch decision point; all subsequent loading, hauling, dumping, and return processes are executed by the simulator using the same stochastic process-time distributions and constraints. The evaluated policies are:

Random policy: Assigns an available truck to one of the candidate excavators/routes using a uniform random selection, providing a non-informed benchmark.
Rule-of-thumb policy: Applies a deterministic heuristic to prioritise candidate excavators/routes based on operational factors available in the state (e.g., material or destination priorities, proximity/travel time, and/or current congestion as reflected by queue lengths) [13].
DANN policy: Selects the next excavator/route using the learned state-to-action mapping represented by the DANN, enabling rapid inference during evaluation runs after training in the digital twin [13].

To ensure a fair comparison, all control policies are evaluated under identical digital-twin conditions, including the same simulator logic, stochastic input distributions, operating horizon, and replication settings. The policies are assessed via repeated simulation replications over the same horizon. Policy performance is quantified using the operational measures defined in Table 5 and reported in Section 3.

2.5. Hyperparameter Optimisation

The DANN dispatch policy is trained using deep reinforcement learning within the mining-system digital twin, and policy quality is therefore sensitive to the learning configuration used during training. A structured hyperparameter optimisation was conducted to improve training stability and to establish a repeatable configuration for all subsequent baseline comparisons reported in Section 3.

The optimisation targeted the learning and network-design parameters most directly associated with convergence behaviour and final policy quality: learning rate, mini-batch size, reward length, number of hidden layers, and nodes per hidden layer. These hyperparameters were varied systematically under otherwise identical simulator conditions so that observed differences could be attributed to learning configuration rather than changes in mine-operation dynamics or stochastic inputs.

Candidate configurations were evaluated using the operational performance measures defined in Table 5, which capture both productivity and system inefficiency/bottlenecks and provide a consistent basis for comparing alternative training settings. The selection of the preferred configuration was based on a multi-metric view that prioritised improvements in completed truck cycles and reductions in idle-truck and idle-excavator penalties, while also considering robustness across replications.

To support replication, Table 6 consolidates the DRL–DANN training configuration and the monitoring and seed controls used for the sensitivity experiments and the subsequent comparative runs. The table is organised into two groups. The first group lists the five hyperparameters tuned in Section 2.5, in the same order as the panels in Figure 9. The second group lists the remaining training settings that were held constant across all candidate configurations. Figure 9 summarises the sensitivity results for the tuned hyperparameters, and Table 6 reports the final optimised settings used in the comparative experiments in Section 3.

2.6. Operating-Parameter Sensitivity and Robustness Tests

In addition to the learning-configuration sensitivity analysis in Section 2.5, an operating-parameter sensitivity analysis was conducted to assess the robustness of the trained dispatch policy under plausible variations in mine conditions. The analysis tests whether policy performance remains stable when key physical and operational inputs deviate from the calibrated baseline used to parameterise the digital twin.

The sensitivity design perturbed a defined subset of simulation inputs by ±10% and ±20% relative to their baseline values. The perturbed inputs included representative haul-route travel times, excavator service/cycle times (truck loading), and truck service times (loading and unloading), together with the resulting queueing effects. During these robustness tests, the tuned DRL–DANN training configuration identified in Section 2.5 was held fixed, and all other simulator settings, stochastic inputs, and replication controls were kept consistent with the baseline evaluation settings to support fair comparison across conditions.

The perturbed operating parameters were selected to reflect practical sources of variability in haulage operations, including route travel times and loading and unloading service times, so that robustness could be assessed under plausible deviations from the calibrated baseline.

Performance under each perturbed condition was quantified using the operational measures defined in Table 5. Results are reported as changes relative to the baseline condition using the same aggregation approach applied elsewhere in the evaluation, including aggregation across replications over a consistent operating horizon. This enables robustness to be interpreted alongside the baseline comparisons in Section 3.

3. Results

This section evaluates the performance of the proposed DANN dispatch policy on the Sungun case study, using the operational measures defined in Table 5 as the common basis for comparison across all experiments. The section first establishes the headline improvement over the rule-of-thumb baseline using long-horizon operational traces, then shows how policy performance improves from the initial to the final trained model to make clear that the reported gains reflect learning rather than a single favourable realisation. This is followed by a concise credibility assessment in which the calibrated digital twin reproduces observed daily completed truck-cycle counts, before testing robustness under a controlled perturbation of effective truck load capacity. The section closes with a brief scalability assessment that examines how production responds as fleet size increases, providing an operational interpretation of where additional trucks cease to translate into higher throughput under the same excavation-haulage configuration.

3.1. Primary Baseline Comparison: Final Trained Policy Versus the Rule-of-Thumb Policy

The baseline evaluation compares the learned DRL dispatch policy (implemented as a DANN) with the rule-of-thumb policy under identical digital-twin conditions and over a long validation horizon. Figure 10 shows the resulting 6000-cycle production trace for the final trained model, showing both the DANN and rule-of-thumb policies, with performance reported relative to the random policy baseline. The learned policy delivers a clear productivity gain: the average production rate per truck cycle increases by approximately 17% relative to the rule-of-thumb approach, and the advantage is sustained across the full trace. This comparison provides the central evidence that an experience-based policy can outperform a conventional heuristic in a stochastic, congestion-sensitive haulage system.

The improvement is consistent with more efficient allocation at dispatch decision points. In practical terms, the learned policy better balances competing routes and loading points as conditions evolve, reducing avoidable congestion and underutilisation that can arise when a fixed heuristic over-commits trucks to locally attractive choices. The long-horizon trace in Figure 10 therefore serves not only as the primary comparison, but also as a stability check that the advantage persists once the policy is applied in long-horizon simulator evaluation runs over a full operational week in the simulator.

Truck cycle time, along with the ability to estimate or predict its components, is a central quantity for analysing open-pit haulage performance and comparing operational alternatives [19,20]. To support the interpretation of why dynamic dispatch matters in this case study, Table 7 decomposes representative cycle-time components by excavator and route, including loading, hauling, dumping, and return segments. The table highlights the heterogeneity in route and service characteristics that a dispatch policy must manage in real time. The baseline gains shown in Figure 10 are consistent with a policy that learns to exploit this heterogeneity by steering trucks towards better system-level choices as queues and short-term imbalances emerge, rather than relying on a fixed rule that cannot account for the non-linear interactions among travel times, service times, and queueing.

3.2. Policy Maturation from the Initial to the Final Trained Model

Section 3.1 reports the primary baseline comparison using the final trained policy. To clarify that this advantage is a learned effect rather than an artefact of a particular simulation run, it is helpful to show how dispatch behaviour changes from the initial to the final DRL model. The initial and final policy long-horizon traces in Figure 10 and Figure 11, respectively, provide this context over the same 6000-truck-cycle operating horizon, with results reported relative to the same random-policy baseline.

The initial-policy trace in Figure 11 reflects an early-stage controller that has not yet developed stable dispatch preferences. Performance is therefore more variable across the horizon as the policy explores alternative actions and encounters different congestion states. The final-policy trace in Figure 10 shows a clearer and more consistent improvement pattern, indicating that training has produced a more reliable response as conditions evolve over time. Read together, the two traces capture the maturation of the learned dispatch policy: the final model more consistently anticipates and mitigates queue growth at excavators and congestion on shared haul routes, rather than reacting myopically to immediate local conditions.

This visual comparison is reinforced by the aggregate effect reported in Section 3.1. Over the same 6000-cycle evaluation horizon, the final trained policy achieves an average production-rate improvement of approximately 17% relative to the rule-of-thumb baseline, whereas the initial policy does not exhibit the same sustained advantage. The progression from Figure 10 to Figure 11 therefore makes explicit that the reported performance gain emerges through training as dispatch decisions become more stable and better aligned with system-level operating conditions. Training progression is also evident at the RL-cycle scale, where cycle-time performance average improves across successive RL cycles in the dissertation evaluation. This motivates focusing the remaining experiments on the final trained policy, while still documenting the learning progression explicitly.

3.3. Digital-Twin Fidelity Against Observed Operations

Before turning to robustness and scalability, it is important to confirm that the calibrated digital twin reproduces the observed production level of the Sungun operation at an aggregate daily scale. Figure 12 compares the number of completed truck cycles generated by the simulator with the corresponding observed daily counts over the same operating period. This comparison provides a practical test of model fidelity, supporting the conclusion that the simulator operates in the correct throughput regime and captures the day-to-day variation present in the measured system.

Figure 12 shows close agreement between simulated and observed daily completed-cycle counts in both magnitude and overall pattern. This supports the fidelity of the calibrated digital twin as the evaluation environment for the dispatch policies studied here. The baseline comparisons in Section 3.1 and Section 3.2, together with the robustness and scalability tests that follow, are all conducted within this same calibrated environment under consistent replication controls, so differences in performance can be interpreted as policy-driven rather than artefacts of an unrealistic operating baseline.

3.4. Robustness Under Operating-Parameter Perturbation

A practical dispatch policy should remain effective when operating conditions deviate from the calibrated baseline used to parameterise the digital twin. Robustness is assessed by perturbing truck load capacity by ±10% and ±20% while holding the trained policy fixed and keeping all other simulator settings and replication controls consistent with the baseline evaluation. Figure 13 reports the resulting sensitivity behaviour relative to the baseline condition using the same performance measures applied in the preceding comparisons.

The results in Figure 13 show that the policy advantage is preserved under these plausible capacity perturbations. While absolute throughput changes as capacity is increased or decreased, the trained dispatch policy continues to exhibit a stable improvement pattern rather than brittle behaviour when conditions shift away from the calibrated setting. This outcome is consistent with a policy that responds to the haulage, loading, and queueing dynamics of the system, rather than one whose effectiveness is confined to a single narrowly defined operating point, and it provides a natural bridge to the scalability check that follows.

3.5. Scalability with Increasing Fleet Size

Scalability is assessed by examining how production responds as fleet size increases under the same excavation-haulage configuration. Figure 14 reports the relationship between the number of trucks and the resulting production rate. The curve provides a compact systems-level view of how additional hauling capacity translates into throughput as the dispatch problem becomes increasingly congestion- and queueing-dominated.

Figure 14 shows that production gains do not increase proportionally with fleet size. Instead, production rises initially as additional trucks reduce periods of underutilisation, then progressively levels off as excavator service limits, shared-route interactions, and queue formation become the dominant constraints. This behaviour is consistent with the cycle-time heterogeneity reported in Table 7, which highlights that loading, hauling, dumping, and return segments contribute unevenly across excavators and routes. The resulting saturation pattern reinforces the need for a dispatch policy that responds to system state, rather than relying on fixed allocation rules that cannot adapt to changing congestion and local imbalances.

4. Discussion

4.1. Interpretation of Results and Broader Implications

The results indicate that the learned DRL dispatch policy improves operational performance primarily by anticipating short-term congestion and balancing service across excavators and dump points. Unlike fixed dispatching rules, including the priority-based rule-of-thumb benchmark used here, the policy can condition decisions on multiple simultaneous cues, such as current queue lengths, truck locations, and expected near-term arrivals.

Performance gains are most plausibly explained by reduced idling and more consistent utilisation of constrained resources. When the policy routes trucks to avoid transient congestion build-up, it lowers the probability of long queues forming at a single excavator or dump and reduces the variance in cycle completion times, which in turn stabilises throughput and schedule reliability.

The sensitivity experiments and hyperparameter optimisation reported in this study are important because small changes in state definition, reward weighting, or learning parameters can materially affect both performance and decision stability. The resulting model selection process provides a clearer basis for transferring the approach to other sites or to construction earthmoving contexts where fleet size, travel networks, and operational objectives differ.

Future work should compare the proposed DRL policy against a broader set of benchmark methods, including optimisation-based dispatch strategies and other advanced adaptive approaches, to more comprehensively position its performance across different operating settings.

4.2. Practical Deployment Considerations

If practical deployment were pursued, the decision-support loop would require reliable event data streams (e.g., truck dispatch times, queue observations, and excavator/dump availability) and a clear definition of the decision trigger (e.g., whenever a truck becomes available). In many operations, dispatch updates on the order of seconds to a minute are sufficient; the key requirement is that inference time remains negligible relative to the dispatch cadence.

A second consideration is policy maintenance. Changes in haul-road geometry, equipment mix, or operating rules may necessitate periodic retraining. A DES digital twin provides a safe environment for updating the policy offline and validating behaviour before deployment, but the approach remains dependent on simulation fidelity and data quality. Any live implementation would also require site-specific governance, operational oversight, performance monitoring, and formal change-management procedures, which are beyond the scope of the present research study.

4.3. Limitations and Threats to Validity

As with any simulation-based study, the present work has limitations that should be considered when interpreting the findings. The primary evaluation is anchored to one detailed case study, and the conclusions depend on both the adequacy and representativeness of the case-study inputs used to parameterise the simulator and the modelling assumptions embedded in the DES, including travel-time distributions, service-time models, breakdown representation, and operational constraints. In addition, although the fleet-size sensitivity experiments provide evidence on scalability trends, they cannot fully reproduce the joint distribution of disruptions and human operational practices observed in the field. These considerations suggest interpreting the results as evidence of policy feasibility and performance potential within a DES-based evaluation environment, rather than as a general guarantee of performance or as certification-level validation for safety-critical deployment. Future work should include broader multi-site field validation, rare-event stress testing, and formal hazard analysis in more deployment-oriented settings.

4.4. Implications for Construction Earthmoving Logistics

Although the case study is a mine, the control structure aligns closely with construction earthmoving and quarry operations, where truck fleets serve multiple loaders and dumps under variable travel conditions. The proposed DES–DRL workflow can therefore be viewed as a general method for developing experience-based dispatch policies in data-enabled construction operations, particularly when combined with site-specific digital twins used for planning and real-time monitoring.

5. Conclusions

This paper develops and evaluates a deep reinforcement learning dispatch policy for truck allocation in a stochastic open-pit haulage system. The approach couples a discrete-event simulation digital twin with DRL training and includes systematic hyperparameter optimisation and sensitivity testing to support transparent model selection.

Across the evaluated experiments, the learned policy improves dispatch performance relative to the rule-of-thumb baseline by reducing avoidable waiting and mitigating transient congestion, while producing actions fast enough to support real-time dispatch. Fleet-size sensitivity results indicate diminishing returns as additional trucks are added under a fixed excavation-haulage configuration, reinforcing the importance of state-responsive dispatch in congestion- and queueing-driven regimes. Robustness tests under plausible truck load-capacity perturbations indicate that performance advantages are preserved when operating conditions deviate from the calibrated baseline, subject to simulator fidelity and the adequacy of the state representation.

Key limitations relate to reliance on simulator fidelity and the availability and quality of operational data for parameterisation, the use of one primary case study, simplifications in the current state representation, limited transfer across sites and conditions, and the use of a single-objective reward formulation. Accordingly, future work should focus on (i) broader field validation, to address simulator- and case-study-dependence; (ii) richer state features (e.g., grade or quality constraints and maintenance state), to address limits in the current state representation; (iii) multi-objective reward formulations that jointly trade off cost, production, and emissions, to address reward simplification; (iv) transfer-learning strategies, to address limited transfer across sites and changing operating conditions; and (v) explainability-oriented analyses to improve interpretation of learned policy behaviour and support decision transparency.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/buildings16112274/s1. The following information complements the methods described in the manuscript and provides additional detail for reproducibility: Table S1, complete state-feature definition; and Algorithm S1, DRL training-loop pseudocode.

Author Contributions

Conceptualization, M.H. and I.F.; Methodology, M.H. and I.F.; Software, M.H. and I.F.; Validation, M.H., I.F. and F.F.; Formal analysis, M.H., I.F. and F.F.; Investigation, M.H. and I.F.; Resources, M.H. and I.F.; Data curation, M.H. and F.F.; Writing—original draft, M.H. and I.F.; Writing—review & editing, M.H., I.F. and F.F.; Visualization, M.H., I.F. and F.F.; Supervision, I.F.; Project administration, I.F.; Funding acquisition, I.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge support from the University of Florida, M.E. Rinker, Sr. School of Construction Management, through internal funds that supported a graduate student teaching assistantship associated with this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hustrulid, W.A.; Kuchta, M.; Martin, R.K. Open Pit Mine Planning and Design, Two Volume Set & CD-ROM Pack, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar] [CrossRef]
Ramazan, S.; Dagdelen, K.; Johnson, T.B. Fundamental tree algorithm in optimising production scheduling for open pit mine design. Min. Technol. 2005, 114, 45–54. [Google Scholar] [CrossRef]
Ozdemir, B.; Kumral, M. Simulation-based optimization of truck-shovel material handling systems in multi-pit surface mines. Simul. Model. Pract. Theory 2019, 95, 36–48. [Google Scholar] [CrossRef]
Mohtasham, M.; Mirzaei Nasirabad, H.; Mahmoodi Markid, A. Development of a goal programming model for optimization of truck allocation in open pit mines. J. Min. Environ. 2017, 8, 359–371. [Google Scholar] [CrossRef]
Topal, E.; Ramazan, S. A new MIP model for mine equipment scheduling by minimizing maintenance cost. Eur. J. Oper. Res. 2010, 207, 1065–1071. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; Available online: http://incompleteideas.net/book/the-book-2nd.html (accessed on 19 May 2026).
Noriega, R.; Pourrahimian, Y.; Askari-Nasab, H. Deep Reinforcement Learning based real-time open-pit mining truck dispatching system. Comput. Oper. Res. 2025, 173, 106815. [Google Scholar] [CrossRef]
van Eyk, L.; Heyns, P.S. A framework to define, design and construct digital twins in the mining industry. Comput. Ind. Eng. 2025, 200, 110805. [Google Scholar] [CrossRef]
Khadivi, M.; Charter, T.; Yaghoubi, M.; Jalayer, M.; Ahang, M.; Shojaeinasab, A.; Najjaran, H. Deep reinforcement learning for machine scheduling: Methodology, the state-of-the-art, and future directions. Comput. Ind. Eng. 2025, 200, 110856. [Google Scholar] [CrossRef]
Azadi, N.; Monjezi, M.; Ataaeipour, M. Application of Arena Software to Optimize Sungon Copper Mine Transport Fleet. In Proceedings of the International Mining Conference, Olympic Hotel, Tehran, Iran, February 2013; Available online: https://www.researchgate.net/publication/322992825_Application_of_Arena_software_to_optimize_Sungon_copper_mine_transport_fleet (accessed on 6 February 2026).
Saadatmand Hashemi, A.; Sattarvand, J. Simulation Based Investigation of Different Fleet Management Paradigms in Open Pit Mines—A Case Study of Sungun Copper Mine. Arch. Min. Sci. 2015, 60, 195–208. [Google Scholar] [CrossRef]
Hatami, M.; Flood, I. Optimizing Truck Allocation in Open-Pit Mining Using a Deep Reinforcement Learning Policy. In Advances in Information Technology in Civil and Building Engineering: Proceedings of ICCCBE 2024, Volume 2, Simulation and Automation; Francis, A., Miresco, E., Melhado, S., Eds.; Lecture Notes in Civil Engineering; Springer: Cham, Switzerland, 2025; Volume 629, pp. 133–147. [Google Scholar] [CrossRef]
Chang, Y.; Ren, H.; Wang, S. Modelling and Optimizing an Open-Pit Truck Scheduling Problem. Discret. Dyn. Nat. Soc. 2015, 2015, 745378. [Google Scholar] [CrossRef]
Ghaziania, H.H.; Monjezi, M.; Mousavi, A.; Dehghani, H.; Bakhtavar, E. Design of Loading and Transportation Fleet in Open-Pit Mines Using Simulation Approach and Metaheuristic Algorithms. J. Min. Environ. 2021, 12, 1177–1188. [Google Scholar] [CrossRef]
Flood, I.; Flood, P.D.L. Intelligent Control of Construction Manufacturing Processes Using Deep Reinforcement Learning. In Proceedings of the 12th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH 2022), Lisbon, Portugal, 14–16 July 2022; SciTePress: Setúbal, Portugal, 2022; pp. 112–122. [Google Scholar] [CrossRef]
Liu, R.; Piplani, R.; Toro, C. A deep multi-agent reinforcement learning approach to solve dynamic job shop scheduling problem. Comput. Oper. Res. 2023, 159, 106294. [Google Scholar] [CrossRef]
Hatami, M. Deep Reinforcement Learning Application for Intelligent Control and Optimization of Truck Allocation in Mining Operations. Doctoral Dissertation, University of Florida, Gainesville, FL, USA, 2023. [Google Scholar]
Chanda, E.K.; Gardiner, S. A comparative study of truck cycle time prediction methods in open-pit mining. Eng. Constr. Archit. Manag. 2010, 17, 446–460. [Google Scholar] [CrossRef]
Erarslan, K. Modelling performance and retarder chart of off-highway trucks by cubic splines for cycle time estimation. Min. Technol. 2005, 114, 161–166. [Google Scholar] [CrossRef]

Figure 1. Dynamic system control by a decision agent (adapted from Ref. [16]).

Figure 2. Historic track of a real system to be modelled, followed by simulated alternative future tracks of the system (adapted from Ref. [16]).

Figure 3. DANN architecture and input feature groups for dispatch-policy approximation.

Figure 4. End-to-end workflow for DRL policy development, evaluation, and refinement.

Figure 5. Agent–environment interaction loop for DRL training and policy update.

Figure 6. Software components linking the digital twin and the DRL agent, showing the environment interface and simulator core. Solid arrows indicate object associations and data-flows, while the dashed arrow indicates waiting-truck queue information made available to the simulator.

Figure 7. Two-phase reinforcement learning DANN development cycle (adapted from Ref. [16]).

Figure 8. Schematic model of the open-pit mining system. Solid arrows indicate truck/process flow, while dashed arrows indicate decision-agent state and action links (adapted from Ref. [13]).

Figure 9. Hyperparameter sensitivity results for the DRL-trained DANN dispatch policy (adapted from Ref. [18]).

Figure 10. Performance of the DANN dispatch policy in the final DRL model over 6000 truck cycles (random selection policy used as baseline).

Figure 11. Performance of the DANN dispatch policy in the initial DRL model over 6000 truck cycles (random selection policy used as baseline).

Figure 12. Daily completed-truck-cycle comparison between the simulation model and observed operational data.

Figure 13. Sensitivity of dispatch-policy performance to truck load capacity variation.

Figure 14. Sensitivity analysis: Different numbers of trucks vs. production rate.

Table 1. Loading time distributions of trucks in loading points (s), (triangular distribution).

Loading Points	Minimum Loading Time of 30 Tons Trucks (s)	Mode of Loading Time of 30 Tons Trucks (s)	Maximum Loading Time of 30 Tons Trucks (s)
1	108	122	131
2	117	135	151
3	108	122	131
4	95	109	123
5	108	122	131
6	95	109	123
7	117	135	151
8	108	122	131
9	95	109	123

Table 2. Hauling time distributions of trucks in loading points (s), (triangular distribution).

Loading Points	Minimum Hauling Time of 30 Tons Trucks (s)	Mode of Hauling Time of 30 Tons Trucks (s)	Maximum Hauling Time of 30 Tons Trucks (s)
1	385	409	431
2	340	362	385
3	355	376	393
4	351	373	388
5	364	371	379
6	390	396	399
7	389	397	409
8	341	373	391
9	359	371	382

Table 3. Returning time distributions of trucks in loading points (s), (triangular distribution).

Loading Points	Minimum Returning Time of 30 Tons Trucks (s)	Mode of Returning Time of 30 Tons Trucks (s)	Maximum Returning Time of 30 Tons Trucks (s)
1	325	379	403
2	297	315	336
3	310	339	367
4	313	344	371
5	324	348	369
6	340	355	372
7	338	357	366
8	299	309	323
9	339	357	363

Table 4. Dumping time distributions of trucks at dump points in target shift (s), (triangular distribution).

Dump Points	Minimum Dumping Time of 30 Tons Trucks (s)	Mode of Dumping Time of 30 Tons Trucks (s)	Maximum Dumping Time of 30 Tons Trucks (s)
Dump 1950	62	68	75

Table 5. Performance metrics used for evaluation.

Metric	Description	Importance
Completed Truck Cycles	Number of truck cycles completed during a simulation run	Measures productivity of the mining operation
Idle Truck Penalties	Penalties incurred for idle trucks	Indicates inefficient use of trucks
Waiting Excavator Penalties	Penalties incurred for waiting excavators	Indicates bottlenecks in the mining operation

Table 6. DRL–DANN training configuration for the hyperparameter sensitivity analysis and comparative evaluation.

Fine-Tuned Hyperparameters:	Values:
Learning rate, α.	Initial = 0.01. Range = 0.0001 → 1.0: optimum = 0.001
Mini-batch size.	Initial = 32. Range = 2 → 256: optimum = 64
Reward length.	Initial = 40. Range = 2 → 1200: optimum = 40
Number of hidden layers.	Initial = 5. Range = 1 → 10: optimum = 1
Number of nodes per hidden layer.	Initial = 32. Range = 2 → 256: optimum = 64
Fixed Training Settings:	Values:
Discount factor, γ.	0.99
Exploration rate, ε.	Decays: 1.0 → 0.01
Number of epochs.	1000
Target network update frequency.	Every 1000 training steps (dispatch decision points)
Replay buffer size.	10,000 transitions
Simulator stochastic seed.	99
DANN initialisation seed.	99

Table 7. Detailed time analysis for each excavator during the simulation run, including loading, hauling, dumping, return, and total cycle time (s).

Simulation Run	Excavator	Loading Time (s)	Hauling Time (s)	Dumping Time (s)	Return Time (s)	Total Time (s)
1	Excavator 1	122	409	68	379	978
1	Excavator 2	135	362	68	315	880
1	Excavator 3	122	376	68	339	905
1	Excavator 4	109	373	68	344	894
1	Excavator 5	122	371	68	348	909
1	Excavator 6	109	396	68	355	928
1	Excavator 7	135	397	68	357	957
1	Excavator 8	122	373	68	309	872
1	Excavator 9	109	371	68	357	905
2	Excavator 1	108	385	62	325	880
…	…	…	…	…	…	…

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hatami, M.; Flood, I.; Foroutan, F. Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations. Buildings 2026, 16, 2274. https://doi.org/10.3390/buildings16112274

AMA Style

Hatami M, Flood I, Foroutan F. Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations. Buildings. 2026; 16(11):2274. https://doi.org/10.3390/buildings16112274

Chicago/Turabian Style

Hatami, Mohsen, Ian Flood, and Forough Foroutan. 2026. "Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations" Buildings 16, no. 11: 2274. https://doi.org/10.3390/buildings16112274

APA Style

Hatami, M., Flood, I., & Foroutan, F. (2026). Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations. Buildings, 16(11), 2274. https://doi.org/10.3390/buildings16112274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Policy for Haul-Truck Dispatch: A Framework for Earthmoving and Quarry Operations^†

Abstract