This section describes the end-to-end workflow used to develop and evaluate the proposed dispatch approach. The workflow couples a discrete-event simulation digital twin of the Sungun haulage system with a deep reinforcement learning policy implemented as a DANN, enabling learning through repeated simulator interaction and fast inference during evaluation runs [
13].
2.1. Theoretical Framework and RL Formulation
Dispatch control is central to open-pit mining operations, particularly for truck allocation and equipment scheduling. Many approaches can be viewed as decision agents that map the observed system state to an action intended to improve operational performance. Broadly, agents can be characterised as search-based or experience-based [
16]. Search-based agents evaluate candidate actions explicitly, which can be computationally intensive and can limit real-time use. Experience-based agents learn a direct state-to-action mapping from prior experience and can execute decisions rapidly once trained [
16,
17]. This study focuses on experience-based approaches by comparing rule-of-thumb baselines with a DRL dispatch policy implemented as a DANN. As shown in
Figure 1, the state observed at each dispatch decision point is mapped to a dispatch action, and the resulting state transition provides feedback for learning.
Figure 2 highlights the long-horizon nature of dispatch control: different policies can lead to markedly different cumulative performance trajectories as the system evolves, motivating the use of the digital twin to simulate alternative futures and compare policies on aggregate performance rather than only immediate local effects.
Developing a DRL model for truck allocation is challenging because the best action is not known a priori and must be learned from interaction with the operating environment. The agent therefore learns through simulated trial and error in the digital-twin training environment, where candidate decisions are explored and their longer-term consequences are observed in the simulator [
7,
16]. In this study, learning proceeds by iteratively updating the policy using reward feedback from simulated experience, balancing exploration of alternative dispatch actions with exploitation of the best-performing behaviours learned so far [
7]. The digital twin provides a cost-effective, risk-free testbed to train and evaluate policies under stochastic operating conditions over long time horizons [
16].
Formally, dispatch is modelled as a Markov decision process (MDP) defined over the simulator dynamics [
13]. Decision points occur when a truck becomes available and a new destination allocation decision is required (e.g., after dumping, when the truck is empty and requests a new assignment). The environment transition dynamics are generated by the digital twin, which advances the system through loading, hauling, dumping, and return events with stochastic travel and service times [
5]. The following are the MDP elements used in this study:
State,
s, captures the current fleet status and operational context [
13]. It includes high-level indicators of (i) truck status/phase (e.g., at a loading point, hauling, at the dump, or returning), (ii) truck load status (e.g., empty, partially loaded, or full), and (iii) remaining ore to be mined. For the Sungun case study, the implemented state vector is expanded to include operational inputs such as truck and excavator status, travel times, excavator cycle times, truck loading and unloading times, queue lengths, and production targets and constraints [
13]. The complete set of implemented state features is summarised in the
Supplementary Materials (Table S1).
Action,
a, corresponds to the truck-allocation decision [
13]. For the Sungun case study, the action space is defined as selecting the next excavator/haulage route for an available (idle) truck (i.e., one discrete action per excavator/route), while loading and unloading are executed by the simulator once the dispatch destination is chosen [
13].
Reward,
r, incentivises shorter truck cycles [
13]. In this study, reward is defined as:
where
Tcycle is the realised cycle time and
Tmax is a reference upper bound used to shift rewards upward (so they remain typically positive). Allocations that reduce realised cycle time therefore receive a higher reward [
13]. The learning procedure further reinforces successful actions by increasing the network output associated with the selected action when it yields better outcomes, strengthening preference for actions associated with better outcomes [
13].
Reward length. Rewards are accumulated over a chosen horizon (measured in truck cycles) to balance feedback frequency against reward-signal stability [
13]. Reward length is treated as a tuneable hyperparameter and was varied from 2 to 1200 cycles during sensitivity testing to identify a stable, repeatable setting for the comparative experiments.
Policy network (DANN). The dispatch policy is represented by a DANN with a fully connected feedforward structure, ReLU activations, and a sigmoid output layer that maps action scores to the range 0 to 1 [
13].
Figure 3 illustrates the network structure used to approximate the dispatch policy, showing how the state feature groups are mapped through the hidden layers to a normalised output score for each candidate excavator/route (one output per destination). The network depth and nodes per hidden layer are treated as tuneable design settings and are selected through the hyperparameter optimisation described later. During training, actions are selected stochastically from the network outputs to generate diverse experience, with an exploration-rate schedule ε decaying from 1.0 to 0.01 to gradually reduce randomness as learning progresses. Key training settings include discount factor γ = 0.99 and experience replay with mini-batch updates to stabilise learning.
2.2. DRL Training Approach and DES Integration
This study trains the dispatch policy by repeatedly interacting with the Sungun digital twin. At each dispatch decision point, the agent observes the current system state, selects a truck-allocation action, and receives a reward computed from simulated operational outcomes. The resulting experience is then used to update the policy to maximise expected cumulative reward [
7,
13]. The policy is implemented as a DANN to support fast inference once trained, enabling real-time dispatch decisions during evaluation runs [
13].
Figure 4 summarises the end-to-end workflow used in this study, from data collection and pre-processing through feature engineering, iterative training and validation, and performance evaluation used to assess readiness for deployment.
Figure 5 summarises the underlying agent–environment interaction loop used during training and evaluation (state observation, action selection, reward feedback, and policy update).
The digital twin is implemented in Arena and parameterised using published Sungun operational data, following Azadi et al. [
11]. It captures the main haulage-cycle logic, stochastic travel and service times, and operational constraints needed to generate realistic state transitions and reward signals for training and evaluation [
13]. Input data (e.g., fleet configuration, travel times, loading and dumping times, and operational targets) are pre-processed and integrated into the digital twin and learning environment so that training and evaluation are based on consistent, case-specific operating conditions [
11,
13].
The learning components are implemented in Python 3.12.6 using PyTorch 2.4.1+cu118 for the DANN policy network and optimisation, while mine-operation dynamics are executed in the Arena-based digital twin. A custom interface links the learning code to the simulator to (i) advance the digital twin under a candidate policy, (ii) extract the next state, (iii) compute rewards from realised cycle times, and (iv) assemble state–action–reward training patterns for updating the DANN [
13].
Figure 6 provides a software-level view of the coupling between the Arena-based digital twin and the learning agent, highlighting the modules responsible for state construction, action enumeration, reward calculation, and simulator execution.
Figure 7 summarises the two-phase DRL–DANN development cycle used in this study (adapted from Ref. [
16]). In Phase I, the Arena-based digital twin is run under exploratory dispatching to generate state–action–reward training patterns over successive truck-cycle decision points; in Phase II, these patterns are used to train/update the DANN policy. The updated policy is then evaluated, and the simulation is restarted, repeating the two phases until performance stabilises. Training proceeds iteratively across RL cycles. In exploration mode, actions are selected stochastically from the policy outputs to generate diverse state–action–reward trajectories. In implementation mode, the learned policy is applied deterministically (e.g., selecting the highest-scoring action) to evaluate performance in non-training runs. These evaluations are run in the digital twin over long simulation horizons (thousands of truck cycles) [
13]. A pseudocode summary of the training loop is provided in the
Supplementary Materials (
Algorithm S1).
To support repeatable experiments, both the simulator stochastic seed and the neural-network initialisation seed are fixed (both set to 99) [
13]. Unless otherwise stated, training uses 1000 optimisation epochs, a replay buffer of 10,000 transitions, and periodic target-network updates every 1000 training steps (dispatch decision points) [
13]. All remaining learning and network-design settings are specified in
Section 2.5.
Prior to training, continuous inputs (e.g., locations, times, queue lengths, and production targets) are normalised, while categorical status variables are one-hot encoded. Derived features include distance measures (truck-to-excavator and excavator-to-truck), average travel-time indicators, excavator efficiency measures, and average queue-length summaries to better capture congestion and service capacity in the state representation [
13].
2.3. Case Study and Operation Simulator (Sungun Mine)
The digital-twin simulator, inspired by prior work on simulation-based decision-agent control in construction manufacturing [
16], captures the key truck-allocation dynamics of the Sungun open-pit mining system. The model represents a fleet of 40 trucks serving nine excavators over multiple haulage routes and includes stochastic travel, loading, dumping, and return times, as well as truck breakdowns [
11,
13].
Figure 8 provides a schematic of the open-pit haulage system represented in the digital twin, highlighting the dispatching decision point and the routing of trucks among the nine excavators, associated queues, haulage routes, and dumping operations [
13]. The simulator is executed over a week-long operating horizon in each evaluation run to evaluate operational efficiency and cost-related performance measures. Model inputs are summarised in
Table 1,
Table 2,
Table 3 and
Table 4. The simulator was parameterised using published Sungun case-study sources; the loading-point characteristics, including bench level, material type, loading device model, production rates, and ore grade, were compiled from Ref. [
4], while
Table 1,
Table 2,
Table 3 and
Table 4 report the triangular process-time distributions used for loading, hauling, returning, and dumping in the digital twin.
The digital twin is parameterised using published Sungun case-study sources rather than a raw historical event log used directly for policy training. Accordingly, the DRL policy is trained on experience generated within the DES environment, and issues such as missing or erroneous training records do not arise in the usual data-driven sense. Instead, the main data-related considerations concern the provenance, adequacy, and representativeness of the published case-study inputs and the modelling assumptions used to translate them into the simulator.
The Sungun copper mine is located in East Azerbaijan, Iran, and serves as the case study used to parameterise and evaluate the dispatch policy in a realistic open-pit haulage setting [
12,
13]. The operation is characterised by continuous production and variable operating conditions (e.g., mountainous terrain and weather-related disruptions), which motivates modelling stochastic travel and service processes within the digital-twin simulator [
12,
13].
The digital twin represents a fleet configuration of 40 haul trucks (30-ton capacity) serving nine excavators with heterogeneous loading characteristics, yielding nine corresponding haulage routes (one per excavator) to a dumping area modelled with effectively unconstrained capacity [
13]. Experiments are executed over a week-long horizon (7 days), using 20 independent replications and a daily shift length of 990 min to capture performance under stochastic variability and support statistically comparable policy evaluation [
13].
Model inputs are compiled from published Sungun sources and case-study data reported in prior work, including fleet specifications, production rates, route distances, and operational disruption factors [
11,
12,
13]. These inputs are processed to parameterise stochastic loading, hauling, dumping, return, and breakdown/maintenance processes within the simulator; triangular distributions are used to represent process-time variability in a computationally efficient replicated simulation setting [
4,
13].
2.4. Control Policies and Baselines
Three dispatch policies are evaluated under identical digital-twin operating conditions to isolate the effect of the control logic on haulage performance. The policies differ only in how an available truck selects its next excavator/route at each dispatch decision point; all subsequent loading, hauling, dumping, and return processes are executed by the simulator using the same stochastic process-time distributions and constraints. The evaluated policies are:
To ensure a fair comparison, all control policies are evaluated under identical digital-twin conditions, including the same simulator logic, stochastic input distributions, operating horizon, and replication settings. The policies are assessed via repeated simulation replications over the same horizon. Policy performance is quantified using the operational measures defined in
Table 5 and reported in
Section 3.
2.5. Hyperparameter Optimisation
The DANN dispatch policy is trained using deep reinforcement learning within the mining-system digital twin, and policy quality is therefore sensitive to the learning configuration used during training. A structured hyperparameter optimisation was conducted to improve training stability and to establish a repeatable configuration for all subsequent baseline comparisons reported in
Section 3.
The optimisation targeted the learning and network-design parameters most directly associated with convergence behaviour and final policy quality: learning rate, mini-batch size, reward length, number of hidden layers, and nodes per hidden layer. These hyperparameters were varied systematically under otherwise identical simulator conditions so that observed differences could be attributed to learning configuration rather than changes in mine-operation dynamics or stochastic inputs.
Candidate configurations were evaluated using the operational performance measures defined in
Table 5, which capture both productivity and system inefficiency/bottlenecks and provide a consistent basis for comparing alternative training settings. The selection of the preferred configuration was based on a multi-metric view that prioritised improvements in completed truck cycles and reductions in idle-truck and idle-excavator penalties, while also considering robustness across replications.
To support replication,
Table 6 consolidates the DRL–DANN training configuration and the monitoring and seed controls used for the sensitivity experiments and the subsequent comparative runs. The table is organised into two groups. The first group lists the five hyperparameters tuned in
Section 2.5, in the same order as the panels in
Figure 9. The second group lists the remaining training settings that were held constant across all candidate configurations.
Figure 9 summarises the sensitivity results for the tuned hyperparameters, and
Table 6 reports the final optimised settings used in the comparative experiments in
Section 3.
2.6. Operating-Parameter Sensitivity and Robustness Tests
In addition to the learning-configuration sensitivity analysis in
Section 2.5, an operating-parameter sensitivity analysis was conducted to assess the robustness of the trained dispatch policy under plausible variations in mine conditions. The analysis tests whether policy performance remains stable when key physical and operational inputs deviate from the calibrated baseline used to parameterise the digital twin.
The sensitivity design perturbed a defined subset of simulation inputs by ±10% and ±20% relative to their baseline values. The perturbed inputs included representative haul-route travel times, excavator service/cycle times (truck loading), and truck service times (loading and unloading), together with the resulting queueing effects. During these robustness tests, the tuned DRL–DANN training configuration identified in
Section 2.5 was held fixed, and all other simulator settings, stochastic inputs, and replication controls were kept consistent with the baseline evaluation settings to support fair comparison across conditions.
The perturbed operating parameters were selected to reflect practical sources of variability in haulage operations, including route travel times and loading and unloading service times, so that robustness could be assessed under plausible deviations from the calibrated baseline.
Performance under each perturbed condition was quantified using the operational measures defined in
Table 5. Results are reported as changes relative to the baseline condition using the same aggregation approach applied elsewhere in the evaluation, including aggregation across replications over a consistent operating horizon. This enables robustness to be interpreted alongside the baseline comparisons in
Section 3.