1. Introduction
Open-pit mining constitutes one of the pillars of the modern extractive industry and is responsible for a significant share of global production of metallic and non-metallic minerals. Within this context, loading and haulage operations, commonly referred to as material handling, play an important role in the overall performance of a mine and can account for up to nearly 50% of the total operational costs of the production system [
1,
2]. These operations involve the continuous coordination of loading shovels, haul trucks, and unloading points such as crushers, waste dumps, or intermediate stockpiles, forming a highly interdependent and dynamic system.
The operational process of an open-pit mine is characterized by its stochastic nature and the persistent presence of internal and external disturbances. Mechanical failures, variations in road conditions, changes in material properties, and fluctuations in equipment availability directly affect cycle times and system efficiency [
3,
4]. In this setting, the decision regarding the destination to which a truck should be assigned after completing a dumping operation is far from trivial, as it directly impacts both resource utilization and the fulfillment of short-term production plans.
Traditionally, truck dispatching is treated as a sequence of allocation decisions in which each truck is assigned to a loading unit based on current system information. These centralized dispatching systems remain widely used in industrial practice and perform well under relatively stable conditions. However, they typically rely on reactive rules and short-term criteria, which may not fully account for the temporal interdependencies between consecutive assignments. Their ability to anticipate congestion and coordinate resources over time is, therefore, limited, especially in large-scale operations.
To overcome these limitations, truck dispatching can be approached from a scheduling perspective, where decisions are coordinated over time rather than treated as isolated assignments. In this context, Multi-Agent Systems (MAS) provide a decentralized framework in which trucks and shovels act as autonomous agents that construct schedules through local negotiation. However, without adaptive mechanisms, such systems may reproduce inefficient coordination patterns. Reinforcement learning (RL) offers a way for agents to improve their decisions based on accumulated experience, potentially enhancing both productivity and responsiveness in dynamic mining environments.
The purpose of this work is to investigate whether incorporating reinforcement learning into truck agents can improve operational scheduling performance while maintaining computation times compatible with dynamic mining operations. The proposed approach combines decentralized coordination with agent-level learning, allowing trucks to adapt their decisions based on previous outcomes.
The main contributions of this work are threefold. First, a decentralized MAS for truck dispatching is proposed, where trucks, shovels, and unloading points are modeled as interacting agents. Second, reinforcement learning is incorporated into truck agents to guide their decision-making during coordination processes. Third, a controlled comparison between systems with and without learning is conducted across scenarios of different operational scales.
The remainder of this article is organized as follows.
Section 2 reviews related work on mining scheduling and the application of reinforcement learning in this domain.
Section 3 describes the problem formulation and the proposed methodology.
Section 4 presents the experimental environment and the evaluation metrics. The results are reported in
Section 5, followed by a discussion in
Section 6. Finally,
Section 7 presents the conclusions and outlines directions for future research.
2. Related Works
2.1. Truck Dispatching in Open-Pit Mining
Truck dispatching in open-pit mining has been extensively studied as a core operational problem with the dual objective of improving material movement efficiency and reducing operating costs. Many contributions adopt centralized decision-making strategies based on operations research, simulation modeling, or heuristic procedures, emphasizing the challenges of real-time coordination under uncertainty [
1].
Recent work continues to explore simulation-based frameworks to support dispatching decisions. For example, ref. [
5] proposed a simulation-optimization approach that partitions truck and shovel assignments across multiple pits, explicitly modeling interaction effects and demonstrating improved productivity through fleet partitioning. In [
6], an integrated simulation and optimization framework that simultaneously accounts for anthropogenic greenhouse gas emissions is introduced.
Beyond simulation, operations research continues to contribute algorithmic models for dispatching. Further, ref. [
7] proposed a bi-objective mathematical model that incorporates the minimization of carbon emission into the allocation optimization model. Additionally, ref. [
8] presented a chance-constrained goal programming model based on four goals to estimate the impacts of the uncertainty on the efficiency of truck-shovel systems.
Heuristic and metaheuristic approaches have been explored for truck dispatching in open-pit mines because they can deliver high-quality decisions with low computational overhead. A representative metaheuristic line uses genetic algorithms (GA) to evolve dispatching policies; for example, ref. [
9] proposed a GA that evolves cyclic finite automata for mine truck dispatching and reports improved shovel utilization and reduced contention compared with commonly used greedy heuristics and linear programming. Another used metaheuristic is tabu search. Further, ref. [
10] compares a tabu search procedure against a multi-agent dispatching approach on scenarios derived from a Chilean open-pit mine and shows that both methods can generate feasible dispatching schedules, while also discussing efficiency differences across solutions. Overall, these works support the view that metaheuristics can provide practical dispatching solutions under realistic constraints, although their performance and scalability depend on instance size, congestion effects, and how operational constraints are encoded.
While these approaches contribute valuable insights into dispatching strategies, most still treat decisions as short-term assignments and do not fully account for temporal dependencies across multiple dispatch cycles. This limitation motivates the interpretation of truck dispatching as a scheduling problem, where decisions must be coordinated over time to synchronize truck and shovel activities. Within this perspective, few studies have proposed scheduling-oriented methods that generate and update sequences of assignments, aiming for improved operational coherence. For instance, ref. [
11] applied a hybrid dispatching and metaheuristic scheduling approach with promising results, and [
12] examined coordinated scheduling strategies to reduce idle times and queuing effects. However, the computational intensity of these scheduling methods highlights the need for frameworks that can approximate schedule quality without incurring excessive overhead.
Unlike reactive dispatching rules, scheduling formulations explicitly model temporal dependencies between successive truck assignments, allowing the synchronization of loading and haulage activities over multiple cycles. This distinction becomes particularly relevant in large-scale operations, where local allocation decisions may generate cascading congestion effects over time.
2.2. Multi-Agent Systems for Scheduling in Open-Pit Mining
Multi-agent systems (MASs) have been used in mining to support planning, coordination, and the simulation of complex systems, mainly at strategic and tactical levels. Prior studies have applied MAS to production planning, resource management, autonomous equipment coordination, and policy analysis, without focusing on real-time operational allocation [
13,
14].
A more focused line of research applies MAS directly to the truck–shovel fleet in open-pit mining, modeling trucks and shovels as autonomous agents that coordinate through distributed mechanisms without learning capabilities [
15]. The study demonstrates that MAS-based approaches enhance flexibility, scalability, and robustness to operational disturbances while avoiding centralized control, making them a viable alternative to traditional allocation-based dispatching methods.
Previous work by the authors evaluated the non-learning version of the proposed multi-agent system against classical optimization-based approaches, including tabu search [
10] and mathematical programming formulations [
16], showing competitive performance in schedule generation quality and computational feasibility. The present study builds upon that validated baseline and focuses specifically on analyzing the incremental impact of incorporating reinforcement learning into the same MAS architecture.
2.3. Reinforcement Learning in Mining
Reinforcement learning (RL) has gained increasing attention in mining as a suitable approach for sequential decision-making under uncertainty in dynamic production systems. Unlike rule-based methods, RL enables agents to adapt their behavior through interaction with the environment by maximizing cumulative rewards [
17]. Early applications focused on planning and operational control under uncertainty, highlighting the benefits of adaptive approaches over static optimization [
18].
More recently, RL has been applied to dynamic dispatching and resource allocation in open-pit mining, showing that learned policies can outperform heuristic strategies in terms of productivity and equipment utilization [
19,
20]. However, many of these approaches rely on centralized formulations or single decision-making agents, limiting scalability and integration into distributed systems. Additionally, computational challenges related to state-space size, reward design, and learning convergence remain significant barriers in large-scale operational settings [
21], motivating the exploration of hybrid and distributed learning architectures.
2.4. Contributions of This Paper
This paper makes the following contributions:
It integrates reinforcement learning at the agent level, enabling trucks to decide to participate in a negotiation process.
It provides a controlled comparison between a MAS with learning and an equivalent MAS without learning under identical scenarios and evaluation metrics.
It analyzes both operational performance and computational cost, explicitly characterizing the trade-off between material transported per hour and schedule generation time.
It presents experimental results across scenarios of different scales, offering empirical evidence on the applicability and limitations of reinforcement learning in multi-agent mining systems.
3. Methodology
3.1. Problem Statement
In open-pit mining operations, the haulage and loading process is organized around a repetitive operational cycle in which each truck, once it completes a dumping operation at a destination point, requests a new loading destination to continue its operation. This cycle generally includes the travel of the empty truck to a shovel, the loading operation, the transportation of material to a dumping location, and the subsequent unloading process, after which the truck becomes available for a new assignment.
Figure 1 schematically illustrates this cycle of activities performed by a truck during operations.
Traditionally, the determination of loading destinations has been addressed through centralized dispatching systems that assign trucks to shovels using global system information and a point-wise allocation approach. Under this paradigm, dispatching is conceived as a sequence of reactive decisions in which each truck is individually assigned to a loading resource based on criteria such as queue lengths, estimated production rates, or predefined heuristic rules [
1,
2]. Each decision is made at the end of a haulage cycle, without explicitly considering the cumulative impact of successive assignments on overall system performance.
Although centralized allocation-based approaches have been widely adopted in industrial practice, numerous studies have documented operational inefficiencies associated with their application in dynamic environments. In particular, the inherent variability of cycle times, road congestion, and unforeseen events generates imbalances that are difficult to anticipate from a centralized perspective. As a result, it is common to observe situations in which trucks form long queues waiting to be loaded at certain shovels, while other shovels remain idle or underutilized due to unbalanced resource allocation [
4,
11].
These limitations have motivated authors to reconsider the problem from a scheduling perspective, in which decisions are not restricted to individual assignments but instead aim to organize and coordinate haulage and loading activities over time, explicitly accounting for the interaction among multiple consecutive operational cycles. Unlike allocation-based approaches, scheduling enables the capture of temporal dependencies, the anticipation of congestion, and the evaluation of the cumulative impact of decisions on resource utilization and overall material flow [
11,
12,
16].
3.2. Multi-Agent System for Dynamic Scheduling
The proposed approach models the material handling process as a Multi-Agent System, in which physical and operational entities are represented by autonomous agents. Rather than computing a global schedule in a centralized manner, schedules emerge from local interactions among agents, enabling a distributed and scalable scheduling process that is well suited to the dynamic nature of the operational environment. The proposed MAS-TDLR framework is conceived to support the fulfillment of production objectives while reducing operational effort in open-pit mining operations.
3.2.1. Agents
Within the MAS-TDLR, mining equipment is modeled as a set of interacting agents, each responsible for specific operational decisions. TruckAgents are responsible for managing the execution of haulage tasks and constructing feasible operational schedules that balance productivity and cost. In doing so, they account for vehicle-specific characteristics, including payload capacity, travel speeds under loaded and empty conditions, spotting times, and unloading durations. The mine layout is explicitly considered to support route selection, and truck agents participate in negotiation processes to determine suitable assignments.
ShovelAgents represent loading equipment and are tasked with organizing their loading activities in accordance with production targets. Their decision-making process incorporates operational constraints such as shovel capacity, digging and loading rates, and the predefined destinations associated with the extracted material.
UnloadingPointAgents model facilities where material is deposited, such as crushers, stockpiles, and waste dumps. These agents regulate unloading operations by accounting for local capacity constraints, particularly the number of trucks that can be serviced simultaneously, thereby contributing to a smooth and balanced material flow.
Through the coordinated interaction of these agent types, the MAS-TDLR promotes improved synchronization among mining resources, more effective allocation of equipment, and enhanced overall efficiency of the material handling system.
3.2.2. Coordination Mechanism
Schedule generation is achieved through a distributed negotiation process based on an extended version of the Contract Net Protocol (CNP) [
22], following the ideas introduced in [
15]. Within this framework, shovelAgents assume the role of negotiation initiators, while truckAgents participate as bidders. Since multiple shovelAgents can initiate negotiations simultaneously, the protocol is designed to support concurrent negotiation processes.
To enable this behavior, the classical CNP is modified with an explicit confirmation stage. Each negotiation starts when a shovelAgent broadcasts a call-for-proposal (CFP) to all truckAgents, specifying the time window during which it is available for loading. Upon receiving a CFP, a truckAgent evaluates its current schedule and, if feasible, replies with a proposal containing its estimated arrival time and the associated operational cost. If the request cannot be accommodated, the truckAgent issues a refusal. During the proposal collection phase, the shovelAgent stores all incoming responses. This phase continues until either a predefined timeout is reached or responses have been received from all truckAgents. If no valid proposals are available, the negotiation terminates without assignment. Otherwise, the shovelAgent selects the most favorable proposal and sends a confirmation request to the corresponding truckAgent.
The confirmation phase may result in acceptance, rejection, or no response within the allotted time. When a confirmation is accepted, the shovelAgent finalizes the assignment and notifies the remaining truckAgents with rejection messages. If the confirmation is refused or times out, the shovelAgent discards the proposal, selects the next best alternative, and repeats the confirmation step. The process ends without an agreement if no proposals remain. This extended negotiation mechanism enables multiple simultaneous negotiations while preserving coordination consistency, thereby improving scheduling performance. The overall interaction flow is illustrated in
Figure 2, and an example of a resulting truck schedule is presented in
Table 1.
3.2.3. Agent Decision-Making
ShovelAgents are responsible for assessing the proposals they receive and selecting the option that best satisfies their operational objectives. This selection process is driven by a utility function adapted from [
23], which favors assignments that minimize shovel idle time while simultaneously reducing the overall cost associated with truck operations.
TruckAgents, in turn, face a two-stage decision-making process. First, upon receiving a call-for-proposal from a shovelAgent, a truckAgent must decide whether to participate in the negotiation. This decision is based on an evaluation of its current schedule to determine whether a compatible loading time slot is available. When such a slot exists, the truckAgent estimates the total duration of the required activities and verifies whether the task can be feasibly integrated into its schedule. If these conditions are not met, the truckAgent declines the request; otherwise, it submits a proposal. The second decision concerns the confirmation of a previously submitted proposal. At this stage, the truckAgent considers the expected idle time of the shovel, as communicated in the request for confirmation, together with its involvement in other ongoing negotiations. If the reported idle time exceeds a predefined threshold of one minute, the truckAgent confirms its participation and commits to the assignment. Conversely, if the idle time is shorter, the agent evaluates alternative negotiations that may offer a lower operational cost. When a more advantageous option is available, the agent withdraws from the current negotiation; otherwise, it confirms the original proposal.
3.3. Reinforcement Learning
Each truckAgent in the proposed system integrates a reinforcement learning module to enhance its decision-making process during negotiation and shovel assignment. This module is introduced to address a limitation observed in the baseline system, in which truckAgents repeatedly submit proposals to all available shovels without considering the outcomes of previous negotiations, leading to unnecessary communication overhead and inefficient scheduling decisions. By incorporating reinforcement learning, each truckAgent is able to learn from its own interaction history and progressively adapt its behavior based on prior acceptances and rejections. As a result, the truckAgent develops an adaptive strategy that allows it to identify which shovels are more likely to lead to successful assignments and which should be avoided. This learning mechanism improves negotiation efficiency while preserving the decentralized nature of the multi-agent system and without requiring modifications to the underlying communication protocol.
3.3.1. State Representation and Actions
The learning mechanism adopted by the truckAgents resembles the behavior of a robot navigating a maze, progressively discovering feasible paths toward a final goal by avoiding blocked or unsuccessful transitions based on prior experience. For the truck dispatching case, the learning process of a truckAgent is modeled using a shovel matrix, which represents the historical outcomes of previous negotiations between the truckAgent and each shovelAgent, as illustrated in
Figure 3. In this matrix, each row corresponds to a specific shovel, and each column represents the outcome of a past negotiation. The values stored in the cells indicate the historical result of these negotiations: a value of “0” denotes a successful negotiation, whereas a value of “X” denotes a failed negotiation. In the proposed approach, cells marked with “X” are modeled as non-traversable states, acting as obstacles within the decision-making process. The final column of each row contains the symbol “F”, which does not correspond to a negotiation outcome but instead represents the arrival at a terminal state. At the beginning of operations, the matrix is initialized with all cells set to “0” and is progressively updated with the actual outcomes of negotiations, blocking those positions associated with rejected proposals. While the previous description provides an intuitive explanation of the shovel matrix, a formal definition is introduced below to clarify its structure and dimensions.
Let
denote the set of shovel agents and let
be the number of negotiations considered in the learning process. For each truck agent
, a shovel matrix
is defined, where each row
corresponds to a shovel
, and each column
represents a specific negotiation instance considered by the agent. The last column
contains the terminal marker
.
Let
denote the outcome of the negotiation between the truck
and shovel
at negotiation index
, where
represents a successful negotiation and
represents an unsuccessful one (e.g., rejection, refusal, or timeout). The value stored in the matrix is defined by the mapping
Accordingly, the matrix directly encodes the outcome of each considered negotiation in its corresponding column, without temporal shifting or reordering. Cells marked with
are treated as non-traversable states in the learning process, while cells marked with
remain traversable. This formulation makes explicit how each matrix entry in
Figure 3 is generated from a negotiation outcome and clarifies the dimensional relation between shovels, negotiations, and matrix indices.
Based on this representation, learning is defined using Q-learning, where the agent’s state is determined by its position within the matrix (i.e., the evaluated shovel and the corresponding point in the negotiation history). From each state, the agent can execute discrete actions that involve changing the shovel (upward and downward movements) and exploring the negotiation history (leftward and rightward movements), provided that the resulting transitions lead to traversable cells. When a truck receives a call-for-proposal (CFP), it uses the learned policy to evaluate the row corresponding to the shovel that issued the CFP. If a feasible path exists that allows the agent to reach the terminal state “F” without crossing obstacles, the truckAgent responds with a proposed message. Otherwise, when no valid trajectory toward the terminal state exists due to the presence of blocked cells, the truck responds with a refuse message. This mechanism implements a hard filtering strategy based on the structural feasibility of the negotiation history, preventing the submission of proposals to shovelAgents whose accumulated rejections preclude reaching the terminal state.
The simplicity of the matrix-based learning representation is intentional, as the goal is not to model all external factors affecting negotiations but to capture stable patterns of repeated rejection at the local agent level. By treating consistently unsuccessful negotiations as blocked states, the learning mechanism provides a robust filtering strategy that remains effective despite external disturbances beyond the agent’s control.
3.3.2. Q-Learning
Learning is implemented using the Q-learning algorithm [
24], which enables the iterative estimation of the expected utility of executing a given action from a specific state. In this context, the Q-value represents the desirability of performing a particular move within the maze, taking into account both the immediate reward and the expected future rewards.
The Q-values are updated according to the following rule:
where
denotes the learning rate,
the discount factor,
the reward obtained after executing the action, and
the new state reached after the move. Through multiple exploration episodes of the shovel matrix, the truckAgent progressively adjusts its Q-values, gradually learning which movements lead to more favorable decisions.
The reward scheme is defined with the objective of guiding the learning process toward feasible decisions that are consistent with the adopted hard filtering strategy. In particular, reaching the terminal state “F” yields a high reward of 100, as it represents a successful decision in the evaluation of a call-for-proposal (CFP). Actions that advance through the negotiation history (rightward movements) receive a reward of 10, encouraging the agent to progress toward the terminal state when feasible trajectories exist. Complementarily, actions that involve changing the evaluated shovel (upward and downward movements) are assigned a reward of 1, allowing the exploration of alternative options without dominating the decision-making process. Actions that move backward through the negotiation history (leftward movements) are also assigned a reward of 10, enabling the reconsideration of previous states along the path. Finally, cells marked with “X” are modeled as non-traversable obstacles; therefore, no reward is associated with these cells, and actions that would lead to them are excluded from the set of valid actions.
In the Q-learning algorithm, the learning rate was set to α = 0.01 and the discount factor to γ = 0.98 in all experiments. Action selection followed a fixed ε-greedy policy with ε = 0.05, meaning that the truckAgent selected with probability 0.95 and a random action with probability 0.05. A learning episode was defined as a complete decision trajectory within the shovel matrix, starting from a randomly selected initial state and ending when the agent reached a shovel position (terminal state). Q-values were updated after each state–action transition using the standard Q-learning update rule. For each configuration, 900 training episodes were executed. Although smaller scenarios tend to stabilize earlier due to their reduced state space, a fixed episode horizon was maintained to ensure methodological consistency.
3.4. MAS Implementation
The MAS-TDLR system was developed using the Java Agent DEvelopment Framework (JADE version 4.6.0) [
25], a software platform tailored for the construction of multi-agent systems. JADE offers built-in support for agent lifecycle management, standardized interaction protocols, and behavior modeling, as well as graphical tools for observing agent communications and assessing system behavior. The framework is implemented on Java 1.8, providing portability and compatibility across different operating systems.
For the implementation of the open-pit mine environment, the JGraphT library [
26] was used to model the mine road network and its operational structure. This library provides efficient data structures and algorithms for graph-based representations, which were employed to represent operational locations—such as shovels, unloading points, and intersections—as nodes, and haul roads as edges with associated traversal costs. This representation enables consistent estimation of travel times and supports structured reasoning about routing and movement within the mining environment, facilitating realistic modeling of transportation dynamics in the proposed multi-agent system.
4. Experimental Design
4.1. Scenarios
The experimental environment models an open-pit mining operation in which a fleet of high-capacity trucks transports material from loading shovels to processing and disposal facilities. The system explicitly represents the main operational components of the mining process, including trucks, shovels, crushers, stockpiles, and waste dumps, which interact through a decentralized multi-agent system.
For evaluation purposes, three simulated scenarios were constructed based on real operational data from an open-pit copper mine in Chile. These scenarios represent different operational scales and workload levels, allowing the analysis of system behavior under varying degrees of resource competition and interaction density.
Scenario 1 corresponds to a smaller-scale operation with a reduced number of trucks relative to loading capacity. In this configuration, queues are less frequent and shovels experience lower risk of starvation. This scenario serves as a baseline to observe coordination behavior under moderate system load.
Scenario 2 represents a medium-scale operation in which the number of trucks increases relative to the available loading units. Here, competition for shovels becomes more pronounced, and coordination decisions have a stronger impact on waiting times and shovel utilization.
Scenario 3 reflects a large-scale operation with a higher fleet size and more intense equipment interaction. In this scenario, congestion and queuing effects are more likely to occur, making it suitable for evaluating scalability and the robustness of coordination mechanisms.
The scenarios feature heterogeneous fleets of trucks and shovels operating in twelve-hour shifts. Real operational data, including truck and shovel velocities and their respective capacities, were used to define the properties of the agents, as summarized in
Table 2, while the number of shovels and trucks for each scenario is reported in
Table 3. The modeled transportation infrastructure consists of 638 nodes and 1330 edges, capturing the layout and connectivity of the mine’s operational road network.
Together, these scenarios enable the evaluation of both operational efficiency and the scalability of the multi-agent system under increasing workload and coordination complexity.
4.2. Evaluation Scheme and Metrics
The evaluation is conducted using the following quantitative metrics:
Operational cost (HH:MM:SS): total time associated with the operational cycles of the trucks during the shift, reported in hours:minutes:seconds format.
Transported material (tons): total amount of material moved during the shift.
Material transported per hour (tons/hour): indicator of the system’s hourly productivity.
Execution time (minutes): computational time required to complete the simulation.
These metrics allow the assessment of both operational efficiency and the dynamic behavior of the learning mechanism incorporated into the system.
4.3. Learning Configuration
An important aspect of the experimental design is the controlled variation in the number of negotiations considered by the agent before consolidating its policy. For each scenario, configurations with 3, 5, 6, 7, 8, 9, and 10 negotiations were analyzed to identify a setting that provided stable learning behavior while keeping the computational cost bounded. This approach makes it possible to analyze the impact of the trade-off between exploration and stability in the learning process. All learning algorithm parameters are kept constant throughout the experiments, ensuring comparability across scenarios and configurations.
As shown in the comparison of accumulated reward across the three scenarios (
Figure 4), the evolution of the learning process under different negotiation configurations can be clearly observed. Despite the differences in scale and complexity between the small, medium, and large scenarios, a common pattern emerges: intermediate configurations consistently achieve higher accumulated rewards and exhibit more stable convergence behavior.
In the small scenario, the learning curves rise rapidly and stabilize early, indicating that the environment is relatively easy for the agent to learn. Intermediate configurations—particularly those with 6 and 7 effective negotiations—reach higher final reward values and display smoother convergence, reflecting a more efficient learning process. In contrast, configurations with fewer negotiations converge quickly but to lower reward levels. In the medium scenario, the learning process becomes more gradual. Intermediate configurations present steeper learning curves and attain higher accumulated rewards, whereas configurations with a small number of negotiations (3 and 5) and very large histories (9 and 10) show reduced stability and lower final performance, suggesting insufficient learning in the former case and excessive variability in the latter. In the large scenario, this pattern becomes even more pronounced. The configuration with 7 negotiations achieves the highest accumulated reward, with a smooth and stable learning curve. Configurations with larger negotiation histories exhibit noticeable fluctuations and weaker convergence, indicating a noisier and less effective learning process. Overall, these results highlight the importance of balancing historical depth and learning stability, showing that intermediate configurations provide the best trade-off between exploration and convergence across all scenarios.
Based on these results, the number of effective negotiations considered for each scenario is defined accordingly and summarized in
Table 4.
The fact that the medium and large scenarios share the same number of negotiations does not contradict their difference in scale, since increasing the scenario size mainly results in a larger number of truck–shovel pairs, rather than requiring a deeper negotiation history per pair. The learning process operates at the level of individual truck–shovel interactions, and once a sufficient number of negotiations is reached to characterize the interaction pattern of a given pair, additional negotiations do not provide further benefit. Therefore, the same negotiation depth remains adequate for both scenarios, even though the total number of interacting pairs increases in the larger case.
4.4. Computational Environment
All simulations were executed in a homogeneous computational environment to ensure that the obtained results depend exclusively on the behavior of the proposed system and not on variations in hardware or software configurations.
The experiments were conducted on a laptop computer with the following specifications:
This computational environment was used consistently across all evaluated scenarios and configurations.
Given the stochastic nature of the multi-agent system and the negotiation process among agents, particularly in distributed decision-making and the incorporation of learning, each experimental configuration was executed 10 times under identical initial conditions. Repeating the simulations helps mitigate the impact of random variations inherent to agent behavior and interaction ordering, leading to more stable and representative results. The values reported in the Results section correspond to the arithmetic mean obtained from these repeated executions. Although dispersion measures such as standard deviation were not systematically recorded, the relative performance ranking between MAS-TDRL and MAS-TDnoRL remained consistent across all runs, indicating stable behavior and robustness of the observed improvements. This approach is especially relevant in multi-agent systems, where small initial variations can be amplified over time due to the decentralized nature of the decision-making process.
5. Results
From a production and scheduling perspective, the comparison between the multi-agent system with reinforcement learning (MAS-TDRL) and the multi-agent system without learning (MAS-TDnoRL) reveals clear and consistent differences in performance indicators. In terms of total material transported, MAS-TDRL systematically outperforms MAS-TDnoRL across all evaluated scenarios, indicating that the incorporation of learning capabilities improves decision-making related to the assignment of trucks to shovels.
Figure 5 presents a direct comparison of the material transported by both approaches. The results show that MAS-TDRL achieves transported volumes of 67,726 tons, 230,334 tons, and 447,021 tons in the evaluated scenarios, whereas MAS-TDnoRL transports 53,515 tons, 159,345 tons, and 319,071 tons, respectively. These results demonstrate that reinforcement learning enables truckAgents to select loading opportunities more efficiently, avoiding low-productivity assignments and promoting a higher overall material flow.
The analysis of operational cost, measured as the total accumulated operating time of the trucks, is shown in
Figure 6. In this study, operational cost is expressed as equipment operating time because detailed economic data, such as fuel consumption, tire wear, maintenance costs, or driver salaries, were not available from the mining operation. Therefore, operating time was adopted as a practical proxy, as it is strongly correlated with these cost components in real mining systems.
In the small scenario, MAS-TDRL reduces the operational cost from 149:17:13 to 146:00:06, reflecting a direct improvement in temporal efficiency. In scenarios of higher complexity, the operational cost of MAS-TDRL increases compared to MAS-TDnoRL, reaching 482:20:56 versus 415:32:53 in the medium scenario and 1006:39:13 versus 849:05:41 in the large scenario. This increase should be interpreted in the context of higher productivity levels achieved by MAS-TDRL, where more material is transported within the same shift duration. Consequently, higher accumulated operating time may coexist with improved production indicators, reflecting a trade-off between productivity and equipment utilization.
To integrate both effects,
Figure 7 presents the indicator of material transported per hour. Despite the increase in operational cost in larger configurations, MAS-TDRL exhibits higher hourly efficiency in all cases. This result indicates that the system is able to transport a greater amount of material per unit of effective time, confirming that the increase in operational time does not imply a loss of productivity, but rather a more efficient utilization of available resources.
Finally, the analysis of execution time (
Figure 8) shows that the computational execution time of the learning-based algorithm is lower than that of the non-learning approach in all three scenarios. This suggests that, once the learned policy stabilizes, the system reduces unnecessary exploration and makes decisions in a more direct manner.
6. Discussion
The scope of the present study is limited to evaluating the incremental impact of reinforcement learning within the proposed multi-agent scheduling architecture. The experimental comparison is therefore designed as a controlled internal analysis between the learning-enhanced system and its non-learning counterpart, rather than as a comprehensive benchmark against all existing dispatching or centralized scheduling approaches in the literature. Previous comparisons of the non-learning baseline against optimization-based and metaheuristic approaches have been reported in earlier studies, whereas the focus of this work is specifically on isolating and quantifying the contribution of the learning component.
The results enable a deeper analysis of the implications of incorporating reinforcement learning into a multi-agent system for mining fleet scheduling. Rather than merely improving isolated decisions, the learning mechanism allows agents to gradually adapt to recurring operational patterns and system variability. Such adaptive behavior has been recognized in the broader scheduling and manufacturing research community, where reinforcement learning methods have emerged as a promising alternative to rule-based strategies in dynamic scheduling contexts [
27,
28,
29]. These studies highlight RL’s capacity to autonomously refine policies based on environmental feedback, reducing reliance on static decision rules.
Another observation concerns the balance between coordination intensity and computational effort. The experiments suggest that intermediate levels of agent interaction provide sufficient informational richness for learning while avoiding excessive coordination overhead. When negotiation levels are too low, the learning process may lack enough experience to generalize effective policies; conversely, excessively high interaction levels can increase computational burden without proportional gains. Similar trade-offs between solution quality and computational cost have been discussed in the scheduling literature, where dynamic and adaptive approaches must carefully manage complexity to remain practical [
27,
28].
From a broader perspective, these findings highlight that learning-enabled MAS should not be evaluated solely in terms of performance gains but also in terms of operational feasibility. The design of interaction protocols, learning parameters, and negotiation frequency plays a central role in determining whether the approach remains practical for large-scale mining systems. Related work has emphasized that reinforcement learning can improve decision quality in dynamic environments but also poses challenges in scalability and computational efficiency [
30,
31].
Despite the encouraging results, this study presents limitations. First, the evaluation was conducted in a simulation environment, which, although based on real operational data, cannot fully capture all sources of uncertainty and variability present in actual mining operations. Second, the experimental scenarios were derived from data of a single open-pit copper mine, which may limit the generalizability of the findings to mines with different geological, operational, or fleet configurations. Validation across multiple mining sites and heterogeneous operational conditions would strengthen external validity. Third, computational overhead increases in larger scenarios, which may impose constraints under stricter decision-time requirements. These aspects represent important directions for future research and practical validation.
7. Conclusions and Future Work
This paper presented a multi-agent system for open-pit mining fleet scheduling that integrates Q-learning capabilities into truck agents. The decentralized approach was evaluated across scenarios of different operational scales, showing that reinforcement learning can improve production indicators such as total transported material and hourly productivity compared to a multi-agent system without learning. These improvements indicate that learning mechanisms can enhance decision quality in dynamic truck–shovel assignment environments.
This study evaluated whether incorporating reinforcement learning into a multi-agent scheduling system improves operational performance while maintaining computational feasibility. The results demonstrate that learning increases material transported per hour by 18–29% compared to the non-learning baseline. Although the learning mechanism introduces additional computational overhead, execution time remains below 10 min even in the largest scenario, which is compatible with scheduling decision cycles in open-pit mining operations. These findings confirm that reinforcement learning can enhance productivity without compromising practical applicability.
Future work will focus on extending the approach with more advanced reinforcement learning algorithms to improve scalability and convergence speed. Additional research directions include incorporating dynamic disruptions, testing cooperative multi-agent learning schemes, and validating the system with higher-fidelity simulation or real operational data. Hybrid approaches combining heuristics and learning also represent a promising direction for further investigation.