Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction

Zhou, Bo; He, Lei; Li, Yongqiang; Lv, Zhandong; Zhang, Shiping

doi:10.3390/electronics15030617

Open AccessArticle

Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction

by

Bo Zhou

^1,2,

Lei He

³

,

Yongqiang Li

^3,*,

Zhandong Lv

^1,2 and

Shiping Zhang

³

¹

State Key Laboratory of Continental Shale Oil, Daqing 163712, China

²

Exploration and Development Research Institute of Daqing Oilfield, Daqing 163712, China

³

School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 617; https://doi.org/10.3390/electronics15030617 (registering DOI)

Submission received: 10 November 2025 / Revised: 12 December 2025 / Accepted: 31 December 2025 / Published: 31 January 2026

(This article belongs to the Special Issue Distributed Control and Optimization for Large-Scale Multi-Agent Systems)

Download

Browse Figures

Versions Notes

Abstract

Efficient and non-destructive extraction of clay minerals is fundamental for shale oil and gas reservoir evaluation and enrichment mechanism studies. However, traditional manual extraction experiments face bottlenecks such as low efficiency and reliance on operator experience, which limit their scalability and adaptability to intelligent research demands. To address this, this paper proposes an intelligent experimental scheduling system for clay mineral extraction based on deep reinforcement learning. First, the complex experimental process is deconstructed, and its core scheduling stages are abstracted into a Flexible Job Shop Scheduling Problem (FJSP) model with resting time constraints. Then, a scheduling agent based on the Proximal Policy Optimization (PPO) algorithm is developed and integrated with an improved Heterogeneous Graph Neural Network (HGNN) to represent the relationships among operations, machines, and constraints. This enables effective capture of the complex topological structure of the experimental environment and facilitates efficient sequential decision-making. To facilitate future practical applicability, a four-layer system architecture is proposed, comprising the physical equipment layer, execution control layer, scheduling decision layer, and interactive application layer. A digital twin module is designed to bridge the gap between theoretical scheduling and physical execution. This study focuses on validating the core scheduling algorithm through realistic simulations. Simulation results demonstrate that the proposed HGNN-PPO scheduling method significantly outperforms traditional heuristic rules (FIFO, SPT), meta-heuristic algorithms (GA), and simplified reinforcement learning methods (PPO-MLP). Specifically, in large-scale problems, our method reduces the makespan by over 9% compared to the PPO-MLP baseline, and the algorithm runs more than 30 times faster than GA. This highlights its superior performance and scalability. This study provides an effective solution for intelligent scheduling in automated chemical laboratory workflows and holds significant theoretical and practical value for advancing the intelligentization of experimental sciences, including shale oil and gas research.

Keywords:

clay mineral extraction; deep reinforcement learning; flexible job shop scheduling problem

1. Introduction

The types, contents, and occurrences of clay minerals in shale are key scientific issues for accurately evaluating shale oil and gas reservoirs and revealing their enrichment mechanisms [1,2,3]. To achieve precise analysis, clay minerals must first be efficiently and non-destructively extracted from shale samples, which forms the foundation of the entire analytical workflow [4]. At present, the clay mineral suspension extraction experiment based on the centrifugation method is the mainstream technology, which significantly improves efficiency compared with the traditional sedimentation method [5,6]. The overall process, illustrated in Figure 1, involves multiple stages, including sample preparation, centrifugal separation, washing, and drying, requiring coordination across various automated equipment.

However, as the research scale expands, the current clay mineral extraction process reveals several bottlenecks. First, traditional manual operation is inefficient, with long experimental cycles and high dependence on operator experience, making it difficult to ensure consistency and accuracy of results. Second, in multi-robot collaborative operations, coordination becomes difficult; during large-scale sample processing, equipment waiting and conflicts severely restrict overall efficiency [7,8]. Finally, the dynamic uncertainty of the experimental environment poses challenges—factors such as variations in sample characteristics, sensor drift, unexpected reagent delays, or equipment status fluctuations may affect the experimental progress. Traditional static scheduling approaches struggle to adapt to such real-time changes, leading to inefficiencies and requiring manual intervention. Against this backdrop, developing a system capable of achieving automated and intelligent scheduling of experimental workflows that can dynamically respond to these uncertainties to improve efficiency, ensure quality, and reduce reliance on human experience has become an urgent demand in this field.

The core challenge of automating this process lies in achieving efficient and dynamic multi-machine coordination. We formally frame this challenge as a Flexible Job Shop Scheduling Problem (FJSP) [9,10] for several clear reasons. First, each shale sample can be treated as a ‘job’ that requires a specific sequence of operations. Second, each experimental step (e.g., centrifugation, liquid transfer) corresponds to an ‘operation’. Third, the availability of multiple robotic systems (e.g., three gantries, one mobile robot) that can perform the same type of operation introduces ‘flexibility’, which is the defining characteristic of FJSP. Therefore, the problem of scheduling N samples through M steps on K machines maps directly onto the FJSP framework.

Our modeling approach follows a structured methodology. We first deconstruct the complex experimental process, abstracting its core scheduling stages into a Flexible Job Shop Scheduling Problem (FJSP) model with resting time constraints. To solve this, we develop a scheduling agent based on the Proximal Policy Optimization (PPO) algorithm, integrated with an improved Heterogeneous Graph Neural Network (HGNN) to effectively represent the complex relationships among operations, machines, and heterogeneous constraints. Furthermore, to bridge the gap between theoretical scheduling and physical execution, we design a comprehensive four-layer system architecture—comprising physical, execution control, scheduling decision, and interactive application layers—which features a digital twin module designed to ensure real-time synchronization between the virtual model and the physical laboratory. The digital twin module is crucial for bridging the gap between theoretical scheduling’s inherent static assumptions and the dynamic realities of a physical lab. It enables the system to continuously update its understanding of the experimental state based on real-world events, allowing the DRL agent to perform reactive scheduling. This approach is built upon several foundational assumptions common in scheduling problems: (1) a machine can only process one operation at a time; (2) operations for a single sample must follow a predefined sequence; (3) once an operation starts, it cannot be interrupted; and (4) an operation can only begin after its predecessor’s resting period is complete. These steps and assumptions provide a robust foundation for developing our intelligent scheduling system.

The main contributions of this paper are as follows:

(1): We model the clay mineral extraction workflow as an FJSP with resting time constraints and design a four-layer system architecture for practical implementation in an automated laboratory.
(2): We construct a novel scheduling agent based on an HGNN-PPO algorithm. The improved HGNN effectively captures the complex topological structure and heterogeneous constraints of the experimental environment, enabling high-quality sequential decision-making.
(3): Through extensive simulation experiments, we demonstrate the superiority of our proposed method over traditional heuristic rules (FIFO, SPT), a meta-heuristic algorithm (GA), and a simplified reinforcement learning approach (PPO-MLP) in terms of efficiency, resource utilization, and scalability.

2. Related Work

At present, there is limited academic research, both domestically and internationally, focusing on key stages such as clay mineral extraction in shale oil exploration. However, there has been extensive research on scheduling agents within intelligent scheduling systems.

The core challenge of experimental scheduling systems lies in achieving efficient and dynamic multi-machine coordination. In laboratory settings, the clay mineral suspension extraction experiment features multiple workpieces, multi-step processes, multiple devices, and dynamically arriving tasks. Moreover, the workflow may need real-time adjustments due to environmental variations or operational demands. This problem exhibits a high degree of similarity to the Flexible Job Shop Scheduling Problem (FJSP) [9,10], as both involve complex and dynamic scheduling requirements aligned with task allocation and resource optimization issues found in flexible manufacturing systems.

In recent years, deep reinforcement learning (DRL) has made remarkable progress in the field of flexible shop scheduling due to its strong environmental adaptability and sequential decision-making capabilities [11,12,13]. Current studies mainly follow two technical paths. The first applies classical DRL algorithm frameworks. For instance, Kong Beibei et al. [14] and Dong Hai et al. [15] adopted the PPO algorithm to solve power battery disassembly and traditional FJSPs, respectively; An Youjun et al. [16] employed the DDQN algorithm to address multi-objective scheduling in semiconductor packaging. The second focuses on innovative state representation and network structures to better capture the complex topological relationships inherent in scheduling problems. For example, Cheng et al. [17] extracted multi-dimensional features from shop-floor environments, while Zhao et al. [18] and Lei Kun et al. [19] introduced attention mechanisms and Graph Neural Networks (GNNs), respectively, to enhance the representational power of policy networks, achieving superior solution quality and computational speed compared to traditional heuristic and meta-heuristic algorithms [20].

Although these studies have demonstrated the great potential of DRL in industrial scheduling, directly transferring existing FJSP solutions to the clay mineral extraction experimental scenario faces three major challenges that hinder their applicability. The first is the model mismatch problem. Traditional FJSP models typically assume deterministic industrial environments (e.g., fixed processing times), whereas chemical experiments are characterized by randomness and dynamic uncertainty (e.g., fluctuating reaction rates, sensor drift, unexpected reagent delays, or equipment unavailability). Existing models, when applied in a purely static manner, fail to capture these critical variables. Our approach addresses this by integrating a digital twin module that continuously synchronizes the virtual environment with the physical lab, allowing the scheduling agent to reactively adapt to observed real-time changes, even if the underlying FJSP formulation is based on initial estimates. The second is the constraint heterogeneity problem. While FJSP models with time lags or resting/cooling times have been explored in the Operations Research literature for manufacturing scenarios [21,22], the clay mineral extraction process introduces a unique combination of constraints rarely seen together. These include not only strict resting-time requirements between operations but also “batch-sample collaborative processing” (e.g., simultaneous centrifugation of multiple samples). Existing FJSP algorithms often focus on one type of constraint and lack the capacity to efficiently model such hybrid workflows that combine flexible individual operations with rigid batch-level constraints, making them difficult to apply directly. The third challenge is the theory–execution gap. Most existing FJSP research remains at the algorithm simulation level; the generated scheduling sequences lack a closed-loop interaction mechanism with physical execution units (robots, equipment). As a result, these approaches cannot effectively respond to real-world issues such as execution delays, sensor drift, or equipment malfunctions, making theoretically optimal solutions insufficiently robust or executable in practice. Our proposed four-layer system architecture, particularly the digital twin module, is designed to address this challenge conceptually. The architecture provides a framework for enabling real-time state synchronization, which would allow the agent to observe the actual state of the physical environment and make reactive decisions. This paper validates the core decision-making component of this architecture through simulation, thereby demonstrating its potential to bridge the theory-execution gap and enhance robustness.

To overcome these limitations, this study does not simply apply existing DRL algorithms but deeply adapts and integrates them to suit the specific scenario of clay mineral extraction. The innovations of this paper are threefold: first, a resting-time–constrained FJSP model is developed to accurately reflect the characteristics of the experimental process; second, an improved HGNN-based scheduling agent is designed to handle heterogeneous constraints; and finally, a four-layer system architecture with a digital twin module is established to bridge the gap between theoretical scheduling and physical execution. These developments aim to fill current research gaps and provide a comprehensive and feasible solution for intelligent scheduling in automated chemical laboratory workflows. The proposed system not only promotes laboratory intelligence but also opens new pathways for scientific research in shale oil exploration, carrying significant practical implications and strategic value.

3. Materials and Methods

3.1. Experimental Process Analysis

The workflow of the clay mineral suspension extraction experiment is shown in Figure 2. A laboratory can simultaneously process up to 40 samples. The available equipment includes three gantries, one fixed robotic arm, and one mobile robot. Each sample uses specific beakers as carriers for different operations, referred to as B1, B2, and B3 in the figure.

The operations in the flowchart can be divided into two categories. Operations 1, 2, 4, 5, 6, 7, 9, and 10—marked in orange—are steps that each clay sample must undergo independently. Operations 11, 12, 13, 14, 15, and 16—marked in blue—are collective operations performed on multiple samples simultaneously. The blue operations, being batch-level external actions, only need to be executed sequentially at specific times and thus do not require scheduling optimization. The orange operations, which are independently executed by each sample, consist of eight steps in total. An effective scheduling strategy can be formulated to arrange these operations compactly across available machines, thereby improving overall efficiency.

In summary, the laboratory can handle up to 40 samples, each with eight schedulable operations distributed among four machines. Some operations require a resting time before the next can begin. This setup aligns closely with the Flexible Job Shop Scheduling Problem (FJSP) scenario. Therefore, this paper employs a deep reinforcement learning–based scheduling algorithm to optimize decision-making within this context. For other procedures in the workflow, an intelligent scheduling framework is constructed on top of the scheduling algorithm to handle all processes efficiently.

In the real experimental environment, a server is deployed to run the intelligent scheduling system. At the initial stage, operators input sample information through a human–machine interface (HMI) using voice or graphical interaction. The backend program processes and packages the data, sending it to the scheduling agent. The agent initializes its internal digital environment based on the received data and makes decisions according to the environmental state. The decisions are transmitted via the HMI to the physical machines. When events occur in the physical environment—such as task completion, new sample arrivals, or rework requirements—the agent receives event signals from the machines and updates the corresponding state variables in its digital environment in real time.

3.2. Experimental Process Scheduling Agent

3.2.1. Modeling of FJSP with Resting Time

As described in Section 3.1, each clay sample undergoes eight schedulable operations in the suspension extraction experiment. This study formulates the problem as a Flexible Job Shop Scheduling Problem (FJSP) with resting time, solved using a deep reinforcement learning (DRL) framework based on the PPO algorithm, with environmental feature extraction performed by a Heterogeneous Graph Neural Network (HGNN).

A set of

n

jobs

J = {J_{1}, J_{1}, \dots, J_{n}}

must be processed on m machines

M = {M_{1}, M_{2}, \dots, M_{m}}

. Each job J_i consists of multiple operations, where

O_{i j}

denotes the j-th operation of job

J_{i}

. Each operation can be processed on one machine

M_{k}

from its compatible set

M_{i j} \in M

, with a processing time of

t_{i j k}

. The goal is to assign operations to machines and determine their execution order to minimize the overall completion time.

For this study, each operation

O_{i j}

is followed by a resting time

r_{i j}

. The following assumptions are made:

Each machine can process only one operation at a time.
All operations belonging to the same job must be executed sequentially.
Processing is continuous and uninterrupted, ensuring seamless flow without pauses.
Each operation for a sample can begin only after the previous operation has completed and its resting period has elapsed.

The parameters used in this problem are summarized in Table 1.

Based on these parameters, and following [23], the mathematical model for the FJSP with resting time is defined as:

M i n i m i z e \max {C_{i, n_{i}}}, i = 1, 2, \dots, n

(1)

s . t . \{\begin{matrix} C_{i, 0} = 0, C_{i, j} > 0, \forall i, j & (a) \\ \sum_{k \in M_{i, j}} X_{i, j, k} = 1, \forall i, j & (b) \\ (C_{i, 1} - t_{i, 1, k} - A_{i}) X_{i, 1, k} \geq 0, \forall i, k & (c) \\ (C_{i, j} - t_{i, j, k} - r_{i, j - 1, k} - C_{i, j - 1}) X_{i, j, k} \geq 0, \forall i, j, k & (d) \\ (C_{h, g} - t_{h, g, k} - C_{i, j}) X_{i, j, k} X_{h, g, k} (Y_{i, j, h, g} + 1) + \\ (C_{i, j} - t_{i, j, k} - C_{h, g}) X_{i, j, k} X_{h, g, k} (1 - Y_{i, j, h, g}) \geq 0, \\ \forall i, j, h, g, k & (e) \end{matrix}

(2)

Equation (1) is the objective function, minimizing the maximum completion time across all jobs. Equation (2a) ensures that completion times are nonnegative; (2b) restricts each operation to a single machine; (2c) ensures processing begins only after job arrival; (2d) enforces the resting time between consecutive operations; and (2e) guarantees that no machine executes more than one operation at the same time.

3.2.2. Construction of the Scheduling Algorithm

Based on the mathematical model above, the experimental process scheduling problem is transformed into an FJSP with resting time, solved by a deep reinforcement learning–based agent that iteratively makes scheduling decisions. At each decision step, the agent assigns an unscheduled operation to a compatible machine until all operations are scheduled. The workflow of this method is illustrated in Figure 3. In each iteration, the scheduling state is first converted into a heterogeneous graph structure. Then, an HGNN with a two-stage embedding process is applied to extract operation and machine feature embeddings. These embeddings are input into the decision network, which generates a probability distribution over possible actions, from which a scheduling decision is sampled.

The scheduling agent operates in two modes: training mode and scheduler mode.

In training mode, during each training episode, the program randomly samples an FJSP instance (defined by n jobs, m machines, and corresponding operations) and builds a virtual environment. At each decision time step

t

(either 0 or when schedulable operations and machines are available), the agent observes the current virtual environment state st and makes a decision at, assigning an unscheduled operation to an idle machine starting from time

T (t)

. The environment then transitions to the next decision step

t

+ 1. This process repeats until all operations are scheduled.

In scheduler mode, the agent interacts with the real laboratory environment, where the virtual environment functions as a digital twin. At each decision step

t

, the agent observes the virtual state

s_{t}

and makes a decision

a_{t}

, which is simultaneously applied to both the virtual and physical environments. The virtual environment’s state transitions are triggered by event signals from the physical system, updating its internal state accordingly and advancing to the next decision step

t

+ 1. This real-time synchronization is crucial for handling dynamic uncertainties. For instance, if an operation’s processing time fluctuates (e.g., due to sample heterogeneity or sensor drift) and takes longer than initially estimated, the physical system will report its completion at the actual later time. The digital twin immediately updates the machine’s status and the operation’s completion time. Similarly, if there are unexpected reagent delays or new samples are inserted into the workflow, these events are reported to the digital twin, which then updates the availability of resources or the set of pending jobs. The agent then perceives this updated, real-time state of the environment and makes its next decision based on these current conditions, effectively enabling reactive scheduling. This iterative feedback loop allows the system to adapt continuously to deviations from the initial plan, maintaining operational efficiency and robustness. And this process repeats until all operations are completed.

3.2.3. Markov Decision Process Modeling

Based on the operating principle of the agent described above, the corresponding Markov Decision Process (MDP) is defined as follows.

State: At step $t$ , the conditions of all operations and machines constitute the state $s_{t}$ . The initial state $s_{0}$ is an FJSP instance sampled from a distribution (training mode), or a batch of sample information input from the real environment (scheduler mode).
Action: This paper adopts an integrated approach to solve the FJSP by merging operation selection and machine assignment into a single composite decision. Specifically, an action $a t \in A t$ at step $t$ is defined as a feasible “operation–machine” pair ( $O_{i j}$ , $M_{k}$ ), where operation $O_{i j}$ is executable (i.e., its preceding operation in the same job has finished and the resting time has elapsed), and $M_{k}$ ∈ $M_{i j}$ is idle. Operation $O_{i j}$ starts immediately on machine $M_{k}$ . The action set $A_{t}$ is time-dependent and contains all feasible operation–machine pairs. Since each job has at most one operation ready at any time and $| M_{i j} | \leq m$ , we have $| A_{t} | \leq n \times m$ .
Transition: Given state st and action at, the environment deterministically transitions to a new state $s_{t + 1}$ at the time when some operation finishes. In this paper, two different states are distinguished by the topology and features of their corresponding heterogeneous graph structures.
Reward: The reward is defined as the difference between the estimated makespan at times $s_{t}$ and $s_{t + 1}$ , i.e., $r (s_{t}, a_{t}, s_{t + 1}) = C_{m a x} (s_{t}) - C_{m a x} (s_{t + 1})$ . When the discount factor γ = 1, the cumulative reward over a completed solution process is $G = Σ | O | t = 0 r (s_{t}, a_{t}, s_{t + 1}) = C_{m a x} (s_{0}) - C_{m a x}$ . For a specific problem instance, $C_{m a x} (s_{0})$ is a constant, which means minimizing $C_{m a x}$ is equivalent to maximizing $G$ .
Policy: The policy $π (a_{t} | s_{t})$ defines a probability distribution over all “operation–machine” pairs in the action set $A_{t}$ for each state $s_{t}$ . A deep reinforcement learning (DRL) algorithm is then designed to parameterize the policy $π$ as a neural network and optimize it to maximize the expected cumulative reward.

To train the proposed scheduling policy network, we adopt the Proximal Policy Optimization (PPO) algorithm [24]. PPO is an advanced policy-gradient method operating within the actor–critic framework, known for improved training stability, reliability, and sample efficiency compared with earlier approaches. Its core idea is to limit the magnitude of the policy update at each iteration, avoiding destructive, overly large changes during training and thus ensuring stable convergence.

3.2.4. Heterogeneous-Graph-Based Scheduling State Representation

As noted in the previous section, the agent’s ability to act on the environment hinges on the policy network

π_{θ}

mapping an input state

s_{t}

to an action

a_{t}

. The raw state information is too voluminous to feed directly into a policy network. A common approach is to represent the scheduling state with a disjunctive graph and apply a graph neural network to extract features. This paper uses a heterogeneous graph structure

H = (O, M, C, E)

to represent the FJSP scheduling state [25], and extends the operation-state information. Specifically, O is the set of nodes representing all operations, M the set of nodes representing all machines, C the precedence constraints between consecutive operations within the same job, and E the set of all “operation–machine” pairs. Thus, the raw state information obtained by the agent from the virtual environment at each decision point can be defined by the following three items:

(1): Operation state $u_{i j} \in R^{7}$ : whether the operation has been planned; the number of machines capable of processing it; its processing time; the operation start time; the number of unscheduled operations within the sample; the estimated completion time of the sample; and the resting time of the sample after this operation.
(2): Machine state $v_{k} \in R^{3}$ : the next idle time of the machine; the number of operations the machine can process; utilization (busy time/total working time).
(3): Operation–machine pair state $E_{i j k} \in R$ : the processing time of the operation on that machine.

3.2.5. State Feature Extraction with an Improved Heterogeneous Graph Neural Network

We build upon the Heterogeneous Graph Neural Network (HGNN) proposed in [25] and enhance it with multi-head attention to serve as the core module for solving the FJSP. The goal is to extract embeddings for operations and machines from the heterogeneous-graph state of the FJSP to support subsequent scheduling decisions. A two-stage embedding update mechanism is adopted to separately process machine nodes and operation nodes, with a Graph Attention Network (GAT) to strengthen key-information aggregation, as shown in Figure 4. The architecture and workflow are divided into two stages.

Stage 1: Machine-node embedding update. The inputs are the raw machine-node features

ν_{k} \in R^{3}

; the concatenated vector of operation–machine edge features

λ_{i j k}

(processing time) and raw operation features

μ_{i j} \in R^{7}

, namely

μ_{i j k} = [μ_{i j}| λ_{i j k}] \in R^{8}

. The core is an improved application of the GAT mechanism with the following steps:

(1): Attention coefficient computation: For machine $M_{k}$ and its neighboring operation $O_{i j}$ , compute the attention coefficient $e_{i j k}$ , fusing machine and operation features: $e_{i j k} = LeakyReLU (a^{⊤} [W^{M} ν_{k}∣ W^{O} μ_{i j k}])$ .
Here $W^{M} \in R^{d \times 3}, W^{O} \in R^{d \times 8}$ are linear transformation matrices, and $a \in R^{2 d}$ is the attention weight vector.
(2): Self-attention supplement: Compute the machine’s self-attention coefficient $e_{k k}$ to avoid neglecting its own state: $e_{k k} = LeakyReLU (a^{⊤} [W^{M} ν_{k}| W^{M} ν_{k}])$ .
(3): Normalization and feature aggregation: Apply Softmax normalization to all attention coefficients of neighboring operations and the machine itself to obtain $α_{i j k}$ and $α_{k k}$ , and aggregate features: $ν_{k}^{'} = σ (α_{k k} W^{M} ν_{k} + \sum_{O_{i j} \in N t (M_{k})} α_{i j k} W^{O} μ_{i j k})$ , where $σ$ is an activation function (e.g., ReLU), and $N_{t} (M_{k})$ is the set of operations adjacent to machine $M_{k}$ at state $t$ .

Stage 2: Operation-node embedding update. Inputs include the raw operation features

μ_{i j}

; the features of the predecessor

μ_{i, j - 1}

and successor

μ_{i, j + 1}

operations; and the aggregated value of the updated machine embeddings from Stage 1,

{\bar{ν} i j}^{'} = \sum M_{k} \in N t (O i j) ν_{k}^{'}

. This stage employs a Multi-Layer Perceptron (MLP) fusion approach, i.e., five MLP modules process different sources of information and their concatenation is projected into the operation embedding:

u_{i j}^{'} = {M L P}_{θ_{0}} (ELU [{M L P}_{θ_{1}} (μ_{i, j - 1}) |{M L P}_{θ_{2}} (μ_{i, j + 1})| {M L P}_{θ_{3}} ({\bar{ν}}_{i j}^{'})| {M L P}_{θ_{4}} (μ_{i j})])

(3)

Here,

{MLP}_{θ}

denotes an MLP with ELU activation and output dimension

d, θ_{0} \sim θ_{4}

are parameters of different modules. The Start and End nodes of the graph do not compute embeddings and serve only as structural connectors.

In addition, this paper stacks

L

HGNN layers with identical structure (parameters independent) to enhance feature extraction. The first layer uses the original node features; subsequent layers take the previous layer’s embeddings as input; edge features (processing time) participate in all layers. We further improve the original HGNN by replacing average pooling with multi-head attention [26]. After

L

layers, the operation and machine embeddings are fused via multi-head attention and concatenated into a fixed-dimensional state embedding:

h_{t} = [\sum_{O_{i j} \in O} α_{i j} μ_{i j}^{(L)}∣ \sum_{M_{k} \in M} β_{k} ν_{k}^{(L)}]

(4)

where

|O|

and

|M|

are the numbers of operation and machine nodes;

μ_{i j}^{(L)}

and

ν_{k}^{(L)}

are the L-th layer outputs;

α_{i j}

and

β_{k}

are the normalized attention weights for operation and machine nodes computed by the multi-head attention mechanism, satisfying

\sum_{O_{i j} \in O} α_{i j} = 1

and

\sum_{M_{k} \in M} β_{k} = 1

.

The weight computation proceeds as follows: (1) set learnable query vectors separately for operation and machine nodes; (2) compute the similarity between each node embedding and its query vector via a multi-head attention mechanism; (3) apply the Softmax function to normalize similarities into attention weights; (4) perform weighted summation of node embeddings based on these weights to obtain fused global features for operations and machines. This multi-head attention weighting adaptively focuses on more important node information and, compared with mean pooling, better captures features that are critical to decision-making, thereby enhancing the model’s expressive power.

HGNN offers several advantages. First, it captures complex relationships: through the heterogeneous graph and two-stage embeddings, it effectively models precedence constraints among operations, compatibility between machines and operations, and key characteristics such as processing times. Second, its architecture is inherently size-agnostic. The graph-based structure, combined with attention and pooling mechanisms, allows the same model architecture to process problems of different sizes (e.g., varying numbers of jobs and machines) without modification. This architectural flexibility makes it fundamentally more scalable than approaches like MLP, which require fixed-size input vectors. While this architecture allows for training a single model that generalizes across sizes, to achieve optimal performance for benchmarking at each specific scale, our study trained separate models for each problem size (

n

= 10, 20, 40), as detailed in the experimental section. Additionally, it improves decision quality: the extracted embeddings contain global information (e.g., machine utilization, operation priority), supporting the policy network in generating superior scheduling decisions.

3.3. Scheduling System Architecture Design

To translate the scheduling algorithm into a practical solution for the clay mineral extraction laboratory, we propose a stable, efficient, and scalable system architecture. This section outlines the conceptual design of this architecture, which serves as a blueprint for future implementation. To further aid the reader, it is important to explicitly distinguish between the components validated via simulation in this study and those that would require future physical integration. The core of this paper’s contribution lies in the design and simulated validation of the Scheduling Decision Layer and the Digital Twin Environment Module. These components were realized in software to demonstrate the effectiveness of the scheduling intelligence. Conversely, the Physical Layer (e.g., robots, centrifuges), the Execution Control Layer (e.g., device drivers and middleware), and the Interaction Application Layer are described as part of the complete conceptual system and would require physical hardware integration and development for real-world deployment. The design aims to bridge the gap between abstract scheduling decisions and the precise actions of physical robots and equipment.

3.3.1. Overall Architecture Design

The system adopts a layered, modular design philosophy to build a closed-loop control system integrating perception, decision-making, execution, and interaction. The overall architecture is divided into four layers: the Physical Layer, Execution Control Layer, Scheduling Decision Layer, and Interaction Application Layer, as shown in Figure 5.

Physical Layer: Located at the bottom, it is the physical foundation of experimental operations. This layer would comprise all hardware units that execute specific experimental tasks, such as gantry robots, a fixed robotic arm, a mobile robot, centrifuges, and sample racks. These devices are the ultimate executors of scheduling commands.
Execution Control Layer: Serving as the bridge between the physical devices and the upper layers, it translates the macro-instructions from the Scheduling Decision Layer (e.g., “move Sample 1 from Point A to the centrifuge at Point B”) into concrete, hardware-executable action sequences (e.g., drive motors, control grippers). Through device drivers and middleware such as the Robot Operating System (ROS), this layer achieves precise control of physical devices and continuously collects device states, sensor data, and task-completion signals for upward feedback.
Scheduling Decision Layer: This layer would act as the “brain” of the entire system, whose core is the experimental-process scheduling agent developed in Section 4. Running on an independent server, it receives task instructions from the Interaction Application Layer and status feedback from the Execution Control Layer. Internally, it maintains a digital-twin environment that is synchronized with the physical world in real time. Based on the state of this environment, the scheduling agent uses a trained deep reinforcement learning model to make sequential decisions, generating optimal “operation–machine” assignment commands and delivering them to the Execution Control Layer.
Interaction Application Layer: As the sole entry point for user interaction, it provides a graphical Human–Machine Interface (HMI) for laboratory operators. Through this interface, users input initial sample information, issue experimental tasks, monitor end-to-end progress in real time, handle abnormal events, and view data reports. This layer converts user intent into standardized task data and passes it to the Scheduling Decision Layer. This layered architecture separates decision-making from execution, achieving high cohesion and low coupling between layers. Iterative upgrades to the scheduling algorithms do not affect low-level hardware control, while additions or replacements of hardware devices only require adaptation within the Execution Control Layer, thereby ensuring system flexibility, maintainability, and scalability.

3.3.2. Core Module Function Design

To realize the architecture above, the system is divided into four core functional modules that collaborate with one another.

1.: Scheduling Decision Core Module

This module is the runtime carrier of the scheduling agent. Its main functions include:

Task parsing and initialization: It receives sample information from the HMI (e.g., number of samples, types, special requirements), parses it, initializes the digital-twin environment, and generates the initial scheduling problem instance.
State perception and decision-making: It is designed to continuously listen for state changes in the digital-twin environment. At each decision moment (e.g., experiment start, completion of an operation), it invokes the HGNN-based state feature extraction network to process the current state, and the PPO policy network outputs the optimal scheduling action.
Command generation and dispatch: It converts the decided scheduling action (e.g., “assign operation $O_{i j}$ to machine $M_{k}$ ”) into standardized task commands and dispatches them to the Execution Control Layer via the communication module.

2.: Digital Twin Environment Module

This module is a dynamic, high-fidelity mapping of the physical laboratory in the information space and forms the basis for the scheduling agent’s decision-making. It addresses the gap between theoretical models and physical execution. Its core functions are as follows:

Real-time state synchronization: By subscribing to events from the Execution Control Layer (e.g., “operation completed,” “equipment failure,” “new sample arrival”), it updates state variables in real time, including each sample’s current operation stage, each machine’s occupancy status and estimated idle time, and resting-time countdowns for samples. This mechanism specifically addresses the dynamic uncertainties such as fluctuating processing times (e.g., sample heterogeneity, sensor drift), unexpected reagent delays, or the insertion of new samples. The digital twin dynamically reflects these changes, providing the scheduling agent with an up-to-date and accurate representation of the real-world conditions.
Dynamic environment modeling: It accurately models the core constraints in the experimental process, such as precedence constraints between operations, machine compatibility constraints, batch-processing constraints for samples, and strict resting-time constraints.
Future state prediction: When the scheduling agent makes a decision, the digital twin can immediately simulate the future state after executing that decision, providing the basis for evaluating the action’s value (reward computation). This allows the agent to “think” and “plan” in a virtual environment that closely mirrors reality.

3.: Human–Machine Interaction Module

This module aims to provide an intuitive, efficient, and information-rich control interface. Its main functions include:

Task management: It allows operators to input pending sample information via the interface or voice commands and to start, pause, or terminate the entire experimental process.
Visual status monitoring: It presents real-time views of processing progress for each sample, the load status and task queues for each machine, etc., using dynamic Gantt charts and 3D virtual scenes to achieve a “one-screen overview” of the site.
Alarm and exception handling: When the Execution Control Layer reports anomalies such as equipment failures or operation errors, the interface immediately pops up an alert and provides options for manual intervention (e.g., retry, skip the operation, manual takeover).
Data statistics and analysis: After the experiment, it automatically generates statistical reports, including makespan, equipment utilization, bottleneck-operation analysis, and other key performance indicators (KPIs), providing data support for process optimization.

3.3.3. System Workflow

After system startup, a typical closed-loop workflow proceeds as follows:

1.: Task initialization: The operator inputs a batch of samples (e.g., 40) via the HMI and clicks “Start Experiment.”
2.: Environment construction: The HMI packages the sample information and sends it to the Scheduling Decision Module. The Scheduling Decision Module initializes the digital-twin environment accordingly and generates all pending operations for all samples.
3.: First decision: The scheduling agent observes the initial state of the digital twin ( $t$ = 0), where the first operation of every sample is executable and all machines are idle. The agent computes the optimal first “operation–machine” assignment via the policy network and issues the command.
4.: Command execution and status feedback: The command is transmitted through the communication module to the Execution Control Layer, which dispatches the appropriate robots to perform the operation. During execution, robot statuses (e.g., “moving,” “operating”) are fed back to the HMI in real time.
5.: Event triggering and state update: Upon completion of the operation, the Execution Control Layer publishes an “operation completed” event. Both the Digital Twin Environment Module and the HMI subscribe to this event. The digital twin updates its internal state (e.g., operation completed, corresponding machine becomes idle, the sample enters resting, or the next operation becomes executable), while the HMI updates the experiment progress.
6.: Iterative decision-making: The state update in the digital twin constitutes a new decision moment. The scheduling agent again observes the new state, selects the optimal action among all currently feasible “operation–machine” pairs, and issues a new command.
7.: Dynamic scheduling: During the experiment, if new samples are added midstream via the HMI or a temporary equipment failure occurs, the corresponding events trigger immediate updates in the digital twin. Based on the latest real state of the environment, the scheduling agent dynamically adjusts subsequent scheduling strategies.
8.: Process completion: This closed-loop cycle continues until all operations for all samples have been scheduled and executed. The system records the total makespan and generates an analysis report.

Through the design above, the system organically integrates the deep reinforcement learning scheduling algorithm with real laboratory automation hardware, forming a complete, intelligent, and automated closed-loop scheduling solution from task input to result output.

4. Results

To comprehensively evaluate the performance of the proposed deep reinforcement learning–based scheduling system (hereafter referred to as HGNN-PPO), a series of simulation experiments were conducted. The experiments aimed to assess the scheduling agent from two dimensions: (1) the agent’s own learning and convergence capability, and (2) performance comparison with traditional scheduling methods in static scheduling problems.

4.1. Experimental Environment and Parameter Settings

The experiments were performed on a server equipped with an Intel Core i9-12900K CPU, 64 GB RAM, and an NVIDIA GeForce RTX 3090 GPU. The algorithm framework was implemented in Python 3.9 and PyTorch 1.12. The PPO algorithm [24] was used for training, and the key hyperparameters are listed in Table 2. Both the actor and critic networks of the PPO algorithm use a 3-layer multi-layer perceptron with a hidden layer size of 64.

To simulate the clay-mineral extraction scenario, the number of machines m was fixed at 4 (three gantries and one mobile robot). We defined three problem scales based on the number of jobs (samples)

n

: small (

n

= 10), medium (

n

= 20), and large (

n

= 40). To ensure the highest possible performance for a fair and rigorous comparison against baseline methods at each scale, we trained a separate agent for each problem size. Each sample contained eight operations, and the processing time of each operation on compatible machines was randomly drawn from a uniform distribution U[5, 30] min. Resting times were sampled from U[0, 60] min. These numerical ranges were determined based on the operational specifications of the actual equipment (e.g., gantry robots, centrifuges) and established protocols within our automated laboratory, ensuring the practical relevance of the simulation instances.

4.2. Benchmark Methods

To verify the superiority of the HGNN-PPO algorithm, three representative categories of baseline methods were selected for comparison:

(Heuristic Rules) Classical Heuristic Rules [27]:

FIFO (First-In-First-Out): schedules operations in the order of sample arrival—simple and intuitive.

SPT (Shortest Processing Time): prioritizes the operation with the shortest processing time among currently available ones.

2.: Meta-heuristic Algorithm:

Genetic Algorithm (GA) [28]: a classical global search method for combinatorial optimization, widely applied in scheduling problems. To ensure a robust and fair comparison, we implemented a GA specifically tailored for the FJSP structure using a two-vector encoding strategy (comprising an operation sequence vector and a machine assignment vector). The hyperparameters were fine-tuned based on preliminary experiments: the population size was set to 100, and the maximum number of generations was set to 200. We employed binary tournament selection to choose parents. For the genetic operators, we utilized Precedence Operation Crossover (POX) for the operation sequence and uniform crossover for machine assignment, with a combined crossover probability of 0.8. A swap mutation operator was applied with a probability of 0.1 to prevent premature convergence. The algorithm terminates when the maximum number of generations is reached or if the best fitness value remains unchanged for 20 consecutive generations.

3.: Simplified Reinforcement Learning Method:

PPO-MLP: uses the same PPO framework but replaces the HGNN state-feature extractor with a simple multilayer perceptron (MLP). This baseline is used to validate the effectiveness of the HGNN module.

4.: Google OR-Tools:

OR Tools is a powerful constraint programming solver showing strong performance in solving industrial scheduling problems. It can provide exact solutions for us to evaluate the performance of the commercial solver’s solution against that of the proposed HGNN-PPO algorithm.

4.3. Performance Evaluation Metrics

The following three key metrics were used to measure the performance of different scheduling strategies:

Makespan (Cmax): the total time required to complete all operations of all samples; the core measure of scheduling efficiency (lower = better).
Average Machine Utilization (AMU): the ratio of total machine working time to (machine count × makespan); reflects resource-use efficiency (higher = better).
Average Flow Time (AFT): the average time each sample spends from system entry to completion; measures per-sample turnaround efficiency (lower = better).
Average Calculation Time (ACT): The average wall-clock time consumed by the algorithm to generate a complete scheduling solution for a given batch of samples. This metric evaluates the computational efficiency of the algorithm, which is critical for the system’s ability to perform real-time dynamic scheduling in response to unexpected events (lower = better).

4.4. Experimental Results and Analysis

4.4.1. Training Performance of the Agent

First, the HGNN-PPO agent’s training process was evaluated. A total of 1000 training epochs were run, and every 10 epochs the model was validated. Figure 6 shows the variation in makespan on the validation set during training.

As seen in the figure, the makespan on the validation set gradually decreases as training progresses and stabilizes after around 900 epochs. This indicates that the agent successfully learned an effective scheduling policy—by continuously interacting with the environment, it optimized its decision-making behavior to minimize makespan. This confirms the effectiveness of the Markov-decision-process modeling and PPO-based training framework proposed in this paper.

4.4.2. Comparison of Scheduling Performance

We compared HGNN-PPO with baseline methods on static problem instances of small, medium, and large scales. For each scale, 10 instances were randomly generated, and the mean performance values were recorded (Table 3).

To visualize the comparison more clearly, the results are plotted as bar charts in Figure 7.

Combining Table 3 and Figure 7, it is evident that across all scales, the proposed HGNN-PPO method achieved the best performance in Makespan, AMU, and AFT. Compared with the next-best Genetic Algorithm (GA), HGNN-PPO reduced makespan by about 3.2% and improved machine utilization by about 2.1% on large-scale problems (n = 40). The performance advantage of HGNN-PPO becomes increasingly pronounced as problem size grows, demonstrating its excellent performance scalability. While separate models were trained for each scale, this result shows that the underlying HGNN-PPO approach is highly effective at tackling larger and more complex scheduling tasks, maintaining and even extending its performance gap over other methods. Furthermore, HGNN-PPO outperformed PPO-MLP significantly—on large-scale instances, the makespan was reduced by over 9%, confirming that the heterogeneous graph neural network effectively captures complex constraints and topological relationships between operations and machines, yielding higher-quality state representations that guide superior scheduling decisions.

In terms of algorithm efficiency, the analysis of the Average Calculation Time (ACT) metric reveals critical differences. The HGNN-PPO method demonstrated outstanding computational speed, generating a complete schedule in just 0.68 s for large-scale (n = 40) problems. This is significantly faster than the GA, which required 21.3 s, and the commercial solver OR-Tools, which took 36.8 s. This sub-second solving capability is crucial for dynamic scheduling, as it enables the system to respond rapidly to real-time events in the laboratory, such as unexpected equipment failures or the urgent insertion of new samples. While simple heuristic rules like FIFO and SPT offer the fastest computation (<0.1 s), their short-sighted decision-making prevents global optimization, resulting in markedly inferior solution quality and making them unsuitable for primary optimization.

It is also noteworthy that while OR-Tools, as a powerful constraint programming solver, achieved the best or near-optimal Makespan values, this high solution quality came at a significant computational cost, with solving times being over 50 times longer than HGNN-PPO on large instances. This highlights that HGNN-PPO strikes an excellent balance between solution quality and computational efficiency. It can generate a high-quality scheduling solution, comparable or superior to that of a fine-tuned GA, in a fraction of the time. This unique combination of speed and quality makes it highly advantageous for practical, real-time applications where rapid and effective decision-making is paramount.

4.4.3. Ablation Study on HGNN Components

To gain a deeper understanding of the contribution of specific architectural choices within our improved Heterogeneous Graph Neural Network (HGNN), we conducted an ablation study. This study quantifies the impact of the multi-head attention mechanism for global state embedding, the self-attention supplement for machine nodes, and the sequential nature of the two-stage embedding process. We compare the full HGNN-PPO model against simplified variants and the PPO-MLP baseline on large-scale problems (n = 40). The results are presented in Table 4.

HGNN-PPO (Full Model): The complete proposed model, incorporating all enhancements.
HGNN-PPO (w/o Global Multi-head Attention): This variant replaces the multi-head attention mechanism used for aggregating operation and machine embeddings into a fixed-dimensional global state (as described in Equation (4)) with a simple average pooling operation. This evaluates the impact of adaptively weighting the importance of different node types for the global feature representation.
HGNN-PPO (w/o Machine Self-Attention): In the first stage of machine node embedding updates, the self-attention coefficient (α_kk) that allows a machine to consider its own state is removed. Machine nodes only aggregate information from their connected operations. This assesses the importance of a machine’s intrinsic state in its updated representation.
HGNN-PPO (Parallel Two-Stage Embedding): The original HGNN uses a sequential two-stage embedding, where operation node updates (Stage 2) explicitly leverage the updated machine embeddings from Stage 1. In this ablated variant, Stage 2 uses the initial machine embeddings, effectively decoupling the sequential information flow. This tests the benefit of the structured, dependent two-stage processing.

The ablation study results clearly demonstrate the individual contributions of the enhanced HGNN components to the overall performance. Each ablated variant shows a performance degradation compared to the full HGNN-PPO model, confirming the efficacy of our architectural choices. Specifically, replacing the global multi-head attention with average pooling led to a Makespan increase of approximately 2.8% (1065.2 vs. 1035.6), indicating that adaptively weighting the importance of different operation and machine embeddings is crucial for forming a rich global state representation. The removal of the self-attention supplement for machine nodes resulted in a Makespan increase of about 4.3% (1080.5 vs. 1035.6), highlighting the importance of machines retaining awareness of their own states alongside aggregated neighbor information. Furthermore, decoupling the sequential information flow in the two-stage embedding by using initial rather than updated machine embeddings in Stage 2 led to a Makespan increase of approximately 5.7% (1095.0 vs. 1035.6). This underscores the benefit of the structured, dependent two-stage processing, where machine states are first refined and then inform the operation state updates. While each component contributes, their combined effect leads to the significant performance gains observed in the full HGNN-PPO. All ablated HGNN variants still outperform the PPO-MLP baseline, reinforcing that the general graph-based representation is superior, and our specific enhancements further optimize this representation.

5. Discussion

The superior performance of our HGNN-PPO method validates the critical role of sophisticated state representation in solving complex scheduling tasks. The significant advantage over the PPO-MLP baseline confirms that the Heterogeneous Graph Neural Network effectively captures the intricate topological relationships and constraints of the workflow. This allows the PPO agent to learn a more globally aware policy, outperforming not only simple heuristics but also strong meta-heuristic methods like Genetic Algorithms, especially as problem complexity increases. While the selected Genetic Algorithm (GA) serves as a representative and widely used metaheuristic benchmark, and the inclusion of OR-Tools provides a powerful solver comparison, we acknowledge that the scope of this study does not cover the entire landscape of state-of-the-art metaheuristics. Other powerful techniques, such as Tabu Search and Simulated Annealing, also exist. A comprehensive benchmark against these methods is a valuable direction for future work to further contextualize the performance of our proposed approach.

From a practical standpoint, this work lays a crucial foundation for a fully automated system. This study is simulation-based and does not yet include implementation on physical hardware. The primary objective was to first validate the feasibility and superiority of the core scheduling intelligence (the HGNN-PPO algorithm) in a highly realistic simulated environment before undertaking the significant engineering effort of physical integration. The simulation parameters were carefully calibrated based on real-world equipment, making the results a reliable predictor of potential performance.

The proposed four-layer architecture, featuring a digital twin, serves as a blueprint for bridging the theory-execution gap. Issues such as communication latency, execution errors, and synchronization are indeed critical engineering challenges that must be addressed during physical implementation. These aspects represent the logical next phase of our research. The current study successfully de-risks this future work by confirming that the proposed scheduling agent is highly effective from an algorithmic perspective.

Despite these advances, limitations suggest future research directions. A key assumption in our current model is the absence of a maximum waiting time constraint, which is valid for the chemically stable samples in the clay mineral extraction workflow. However, we recognize this may not apply to all experimental protocols, especially in time-sensitive biological or chemical assays where sample degradation is a concern. Therefore, future work should focus on several key areas: (1) Modeling Broader Constraints, such as incorporating maximum waiting times to enhance the model’s applicability to a wider range of experimental scenarios. (2) Handling Dynamic Uncertainties, such as equipment failures and urgent job insertions, by integrating techniques like Robust Reinforcement Learning. (3) Multi-Objective Optimization, extending the model to balance competing goals like makespan, energy consumption, and cost. (4) Enhancing Generalization, using transfer learning to enable the agent to adapt to new experimental workflows or laboratory layouts with minimal retraining.

6. Conclusions

This study developed an intelligent scheduling system for automated clay mineral extraction, successfully addressing key challenges in laboratory efficiency and coordination. The primary contributions are

Innovative Problem Modeling: We are the first, to our knowledge, to model the automated clay mineral extraction process as a Flexible Job Shop Scheduling Problem (FJSP) with resting time constraints. While the mathematical formulation of FJSP with time lags has been studied in the Operations Research literature, its application to this specific automated chemical workflow is novel and provides an accurate theoretical foundation for its optimization.
Advanced Core Algorithm: We proposed an improved HGNN-PPO scheduling agent that effectively extracts features from complex states, enabling superior, globally aware decisions compared to baseline methods. An ablation study further demonstrated the individual contributions of key HGNN components, such as the multi-head attention for global state embedding, the self-attention supplement for machine nodes, and the sequential two-stage embedding, to this enhanced performance.
Practical System Architecture: We designed a four-layer system architecture with a digital twin to bridge the theory-execution gap, enabling robust, closed-loop control for real-world deployment.
Verified Performance Superiority: Simulations confirmed that HGNN-PPO significantly outperforms traditional and baseline intelligent algorithms across all scales in key metrics, including makespan, machine utilization, and flow time.

In summary, this research delivers a comprehensive and effective solution for intelligent scheduling in automated laboratories, offering significant theoretical and practical value for advancing experimental sciences.

Author Contributions

Conceptualization, B.Z.; methodology, B.Z. and L.H.; software, L.H.; validation, B.Z. and L.H.; formal analysis, L.H.; investigation, Z.L. and S.Z.; resources, Y.L. and Z.L.; data curation, Z.L.; writing—original draft preparation, L.H.; writing—review and editing, L.H.; visualization, B.Z.; supervision, Y.L. and Z.L.; project administration, B.Z. and S.Z., Y.L. and Z.L.; funding acquisition, Y.L. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank Harbin Institute of Technology for providing the working and experimental environment.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFT	Average Flow Time
AMU	Average Machine Utilization
DRL	deep reinforcement learning
FIFO	First-In-First-Out
FJSP	Flexible Job Shop Scheduling Problem
GA	Genetic Algorithm
GAT	Graph Attention Network
GNN	Graph Neural Network
HGNN	Heterogeneous Graph Neural Network
HGNN-PPO	deep reinforcement learning–based scheduling system
HMI	Human–Machine Interface
KPI	key performance indicator
MDP	Markov Decision Process
MLP	Multi-Layer Perceptron
PPO	Proximal Policy Optimization
ROS	Robot Operating System
SPT	Shortest Processing Time

References

Zou, C.; Dong, D.; Wang, Y.; Li, X.; Huang, J.; Wang, S.; Guan, Q.; Zhang, C.; Wang, H.; Liu, H.; et al. Shale Gas in China: Characteristics, Challenges and Prospects (I). Pet. Explor. Dev. 2015, 42, 753–767. [Google Scholar] [CrossRef]
Bergaya, F.; Lagaly, G. Chapter 1 General Introduction: Clays, Clay Minerals, and Clay Science. In Developments in Clay Science; Elsevier: Amsterdam, The Netherlands, 2006; Volume 1, pp. 1–18. ISBN 1572-4352. [Google Scholar]
Ji, L.L.; Lin, M.; Jiang, W.B.; Cao, G.H. The Multiscale Digital Core of Shale and Its Application. Adv. Mech. 2024, 54, 606–628. [Google Scholar]
Brigatti, M.F.; Galán, E.; Theng, B.K.G. Chapter 2 Structures and Mineralogy of Clay Minerals. In Developments in Clay Science; Elsevier: Amsterdam, The Netherlands, 2006; Volume 1, pp. 19–86. ISBN 1572-4352. [Google Scholar]
Carrado, K.A.; Decarreau, A.; Petit, S.; Bergaya, F.; Lagaly, G. Chapter 4 Synthetic Clay Minerals and Purification of Natural Clays. In Developments in Clay Science; Elsevier: Amsterdam, The Netherlands, 2006; Volume 1, pp. 115–139. ISBN 1572-4352. [Google Scholar]
Keller, W.D.; da Costa, L.M. Comparative Chemical Compositions of Aqueous Extracts from Representative Clays. Am. Mineral. 1989, 74, 1142–1146. [Google Scholar]
Zhang, K.; Hu, Z.; Song, F.; Yang, X.; Liu, Y. Consensus of Input Constrained Multi-Agent Systems by Dynamic Time-Varying Event-Triggered Strategy with a Designable Minimal Inter-Event Time. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 2119–2123. [Google Scholar] [CrossRef]
Li, M.; Zhang, K.; Liu, Y.; Song, F.; Li, T. Prescribed-Time Consensus of Nonlinear Multi-Agent Systems by Dynamic Event-Triggered and Self-Triggered Protocol. IEEE Trans. Autom. Sci. Eng. 2025, 22, 16768–16779. [Google Scholar] [CrossRef]
Fang, Y.; Peng, C.; Lou, P.; Zhou, Z.; Hu, J.; Yan, J. Digital-Twin-Based Job Shop Scheduling Toward Smart Manufacturing. IEEE Trans. Ind. Inform. 2019, 15, 6425–6435. [Google Scholar] [CrossRef]
Gao, K.; Cao, Z.; Zhang, L.; Chen, Z.; Han, Y.; Pan, Q. A Review on Swarm Intelligence and Evolutionary Algorithms for Solving Flexible Job Shop Scheduling Problems. IEEE/CAA J. Autom. Sin. 2019, 6, 904–916. [Google Scholar] [CrossRef]
Wang, Y.-F. Adaptive Job Shop Scheduling Strategy Based on Weighted Q-Learning Algorithm. J. Intell. Manuf. 2020, 31, 417–432. [Google Scholar] [CrossRef]
Wang, H.-X.; Yan, H.-S. An Interoperable Adaptive Scheduling Strategy for Knowledgeable Manufacturing Based on SMGWQ-Learning. J. Intell. Manuf. 2016, 27, 1085–1095. [Google Scholar] [CrossRef]
Bouazza, W.; Sallez, Y.; Beldjilali, B. A Distributed Approach Solving Partially Flexible Job-Shop Scheduling Problem with a Q-Learning Effect. IFAC-Pap. 2017, 50, 15890–15895. [Google Scholar] [CrossRef]
Kong, B.; Yan, W.; Ruan, B.; Tian, R.; Hu, L.; Jiang, Z. Flexible Disassembly Scheduling Method of Power Battery Based on Reinforcement Learning. Mach. Tool Hydraul. 2025, 53, 192–199. [Google Scholar] [CrossRef]
Dong, H.; Lu, T. Proximal Policy Optimization Approach Leveraging Deep Reinforcement Learning for Solving Flexible Job-Shop Scheduling Problem. Appl. Res. Comput. 2025, 42, 2722–2729. [Google Scholar] [CrossRef]
An, Y.; Liu, C.; Hu, H.; Dong, Y.; Gao, K.; Guo, P. Integrated Scheduling of Production and Maintenance of Multi-Objective Hybrid Flowshop Based on Improved DDQN. Comput. Integr. Manuf. Syst. 2025, 1–26. [Google Scholar] [CrossRef]
Cheng, W.; Zhang, C.; Meng, L.; Zhang, B.; Gao, K.; Sang, H. Deep Reinforcement Learning for Solving Efficient and Energy-Saving Flexible Job Shop Scheduling Problem with Multi-AGV. Comput. Oper. Res. 2025, 181, 107087. [Google Scholar] [CrossRef]
Zhao, L.; Fan, J.; Zhang, C.; Shen, W.; Zhuang, J. A DRL-Based Reactive Scheduling Policy for Flexible Job Shops with Random Job Arrivals. IEEE Trans. Automat. Sci. Eng. 2024, 21, 2912–2923. [Google Scholar] [CrossRef]
Lei, K.; Guo, P.; Zhao, W.; Wang, Y.; Qian, L.; Meng, X.; Tang, L. A Multi-Action Deep Reinforcement Learning Framework for Flexible Job-Shop Scheduling Problem. Expert Syst. Appl. 2022, 205, 117796. [Google Scholar] [CrossRef]
Li, X.; Gao, L. A Hybrid Genetic Algorithm and Tabu Search for Multi-Objective Dynamic JSP. In Effective Methods for Integrated Process Planning and Scheduling; Li, X., Gao, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 377–403. ISBN 978-3-662-55305-3. [Google Scholar]
Wang, Y.; Zhu, Q. A Hybrid Genetic Algorithm for Flexible Job Shop Scheduling Problem with Sequence-Dependent Setup Times and Job Lag Times. IEEE Access 2021, 9, 104864–104873. [Google Scholar] [CrossRef]
Riezebos, J.; Gaalman, G.J.C.; Gupta, J.N.D. Flow Shop Scheduling with Multiple Operations and Time Lags. J. Intell. Manuf. 1995, 6, 105–115. [Google Scholar] [CrossRef]
Luo, S. Dynamic Scheduling for Flexible Job Shop with New Job Insertions by Deep Reinforcement Learning. Appl. Soft Comput. 2020, 91, 106208. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible Job-Shop Scheduling via Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Ind. Inf. 2023, 19, 1600–1610. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhang, J.; Ding, G.; Zou, Y.; Qin, S.; Fu, J. Review of Job Shop Scheduling Research and Its New Perspectives Under Industry 4.0. J. Intell. Manuf. 2019, 30, 1809–1830. [Google Scholar] [CrossRef]
Katoch, S.; Chauhan, S.S.; Kumar, V. A Review on Genetic Algorithm: Past, Present, and Future. Multimed. Tools Appl. 2021, 80, 8091–8126. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Clay mineral extraction process.

Figure 2. Clay Mineral Suspension Extraction Experimental Process.

Figure 3. Workflow of the Scheduling Algorithm.

Figure 4. The two-stage embedding update.

Figure 5. Overall architecture of the scheduling system.

Figure 6. Makespan curve on the validation set during agent training; Horizontal axis: checkpoints; vertical axis: makespan on validation set.

Figure 7. Comparison of average makespan across three problem scales for different scheduling methods.

Table 1. Parameters of the FJSP.

Variable	Meanings
$n$	Number of jobs
$m$	Number of machines
$J_{i}$	The i-th job
$n_{i}$	Number of operations in job $J_{i}$
$m_{k}$	The k-th machine
$O_{i j}$	The j-th operation of job $J_{i}$
$M_{i j}$	Set of available machines for operation $O_{i j}$
$t_{i j k}$	Processing time of operation $O_{i j}$ on machine $M_{k}$
$A_{i}$	Arrival time of job J_i
$i, h$	Job indices, $i, h = 1, 2, \dots, n$
$j, g$	Operation indices of $J_{i}$ and $J_{h}$ , $j = 1, 2, \dots, n_{i}; g = 1, 2, \dots, n_{h}$
$r_{i, j}$	Resting time after operation $O_{i j}$
$C_{i, j}$	Completion time of operation $O_{i j}$
$X_{i, j, k}$	Binary variable indicating whether $O_{i j}$ is executed on $M_{k}$ (1 if yes, 0 otherwise)
$Y_{i, j, h, g}$	Binary variable indicating execution order between $O_{i, j}$ and $O_{h, g}$ (1 if $O_{i j}$ precedes $O_{h g}$ , −1 otherwise)
k	Machine index, $k = 1, 2, \dots, m$

Table 2. PPO Algorithm Hyperparameters.

Parameter	Value
Learning Rate	0.0002
γ	0.99
λ	0.95
Clip Epsilon	0.2
Batch Size	20
Epochs	1000
Embedding Dimension	128
Number of Attention Heads	2
Number of HGNN Layers	2

Table 3. Performance Comparison on Static Scheduling Problems.

Problem Size (n)	Method	Makespan (min) ↓	AMU (%) ↑	AFT (min) ↓	ACT(s) ↓
10	FIFO	345.2	78.5	210.8	<0.1
	SPT	310.6	82.1	188.4	<0.1
	GA	288.1	86.8	175.2	2.52
	PPO-MLP	302.5	83.3	185.1	0.18
	HGNN-PPO	281.4	88.2	170.3	0.25
	OR Tools	240.1	98.6	146.6	4.45
20	FIFO	610.8	79.2	435.1	<0.1
	SPT	598.3	81.5	401.7	<0.1
	GA	545.6	87.3	360.4	6.33
	PPO-MLP	577.2	84.0	385.9	0.30
	HGNN-PPO	530.9	89.1	351.6	0.44
	OR Tools	459.4	99.2	298.8	11.10
40	FIFO	1228.5	78.9	850.2	0.10
	SPT	1195.1	80.6	815.5	0.11
	GA	1070.3	88.4	705.8	21.3
	PPO-MLP	1140.7	83.1	766.3	0.52
	HGNN-PPO	1035.6	90.5	682.4	0.68
	OR Tools	890.8	99.5	588.1	36.8

Table 4. Ablation Study on HGNN Components for Large-Scale Problems (n = 40).

Method	Makespan (min) ↓	AMU (%) ↑	AFT (min) ↓
HGNN-PPO (Full Model)	1035.6	90.5	682.4
HGNN-PPO (w/o Global Multi-head Attention)	1065.2	89	705.8
HGNN-PPO (w/o Machine Self-Attention)	1080.5	88.2	718.1
HGNN-PPO (Parallel Two-Stage Embedding)	1095	87.7	725.5
PPO-MLP	1140.7	83.1	766.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, B.; He, L.; Li, Y.; Lv, Z.; Zhang, S. Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction. Electronics 2026, 15, 617. https://doi.org/10.3390/electronics15030617

AMA Style

Zhou B, He L, Li Y, Lv Z, Zhang S. Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction. Electronics. 2026; 15(3):617. https://doi.org/10.3390/electronics15030617

Chicago/Turabian Style

Zhou, Bo, Lei He, Yongqiang Li, Zhandong Lv, and Shiping Zhang. 2026. "Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction" Electronics 15, no. 3: 617. https://doi.org/10.3390/electronics15030617

APA Style

Zhou, B., He, L., Li, Y., Lv, Z., & Zhang, S. (2026). Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction. Electronics, 15(3), 617. https://doi.org/10.3390/electronics15030617

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Deep Reinforcement Learning-Based Experimental Scheduling System for Clay Mineral Extraction

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Experimental Process Analysis

3.2. Experimental Process Scheduling Agent

3.2.1. Modeling of FJSP with Resting Time

3.2.2. Construction of the Scheduling Algorithm

3.2.3. Markov Decision Process Modeling

3.2.4. Heterogeneous-Graph-Based Scheduling State Representation

3.2.5. State Feature Extraction with an Improved Heterogeneous Graph Neural Network

3.3. Scheduling System Architecture Design

3.3.1. Overall Architecture Design

3.3.2. Core Module Function Design

3.3.3. System Workflow

4. Results

4.1. Experimental Environment and Parameter Settings

4.2. Benchmark Methods

4.3. Performance Evaluation Metrics

4.4. Experimental Results and Analysis

4.4.1. Training Performance of the Agent

4.4.2. Comparison of Scheduling Performance

4.4.3. Ablation Study on HGNN Components

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI