Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning

Wang, Ruizhe; Wang, Minrui; Du, Ziyan; Dong, Xiaochuan; Peng, Yibing

doi:10.3390/machines14030343

Open AccessArticle

Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning

by

Ruizhe Wang

,

Minrui Wang

,

Ziyan Du

,

Xiaochuan Dong

and

Yibing Peng

^*

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(3), 343; https://doi.org/10.3390/machines14030343

Submission received: 2 February 2026 / Revised: 13 March 2026 / Accepted: 17 March 2026 / Published: 18 March 2026

(This article belongs to the Section Advanced Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

Machining process route planning (MPRP) is vital for autonomous manufacturing yet remains challenging under complex, multi-dimensional engineering constraints. This paper proposes an attention-augmented deep reinforcement learning (DRL) framework to achieve intelligent process orchestration. First, an Optional Process Attribute Adjacency Graph (OPAAG) is established to formally model the “feature–process–resource–constraint” coupling, enhancing the agent’s perception of manufacturing semantics. The architecture synergistically integrates Graph Attention Networks (GAT) to perceive spatial benchmark dependencies and a Transformer-based encoder to capture sequential resource correlations within variable-length machining chains. Furthermore, a dynamic action masking mechanism is integrated to guarantee a 100% constraint satisfaction rate during both training and inference stages. Experimental evaluations across diverse part geometries demonstrate that the proposed method offers significant advantages in cost optimization, inference efficiency, and topological stability compared to traditional heuristic algorithms and standard DRL models. By effectively distilling the search space and maintaining action feasibility, the framework provides an efficient and robust solution for autonomous process planning in complex industrial scenarios.

Keywords:

machining process route planning; deep reinforcement learning; graph attention network; transformer; action masking

1. Introduction

In the paradigm of Industry 4.0, machining process route planning (MPRP) serves as a critical component of intelligent manufacturing systems. It acts as the bridge between design and production within Computer-Aided Process Planning (CAPP) by identifying machining features, selecting resources, and determining the optimal sequence of operations. The efficiency of MPRP directly impacts product quality, delivery cycles, and manufacturing costs in high-mix, low-volume production environments [1,2,3].

1.1. Traditional and Heuristic Approaches in MPRP

Early research focused on knowledge-driven systems. Li et al. [4] utilized ontology-based models to represent manufacturing knowledge, while Waiyagan and Bohez [5] developed rule-based systems for feature-based planning. Recent extensions include the use of knowledge graphs for dynamic reasoning by Long et al. [6] and semantic frameworks for cost estimation by Hernandes et al. [7]. Concurrently, MPRP was widely treated as a complex combinatorial optimization problem. Hua et al. [8] employed Genetic Algorithms (GA) to optimize tool changes and setup orientations, while Liu et al. [9] applied Ant Colony Optimization (ACO) to navigate large-scale constraint spaces. Multi-objective meta-heuristics have since been applied to diverse scenarios, including cellular manufacturing reliability [10], non-cutting path optimization [11], and intelligent 3D process generation [12]. Liu et al. [13] further integrated process planning with AGV scheduling tasks, and Peng et al. [14] optimized routes specifically for remanufacturing. Despite their mathematical rigor, these heuristic methods often suffer from high computational costs and limited adaptability when facing the “curse of dimensionality” in complex prismatic parts.

1.2. Deep Learning and Reinforcement Learning in Manufacturing

Advancements in Deep Learning (DL) have enabled automated feature extraction from CAD models. Ding et al. [15] used attribute adjacency graphs (AAG) for feature modeling, and other studies [16] explored multi-view representations. Lei et al. [17] developed MFPointNet for direct feature recognition from point clouds. More recently, graph-based encodings have been used to capture topological interactions, such as the GCN models proposed by Wang et al. [18] and the attention-based semantic frameworks by Du et al. [19]. Zhang et al. [20] pointed out the unique advantages of DRL in solving constrained combinatorial optimization problems in MPRP. Deep Reinforcement Learning (DRL) has emerged as a powerful paradigm for real-time decision-making in MPRP. Wu et al. [21] proposed an early DRL framework for rapid process generation. Zhao et al. [22] introduced Graph Attention Networks (GAT) into green manufacturing scheduling, and Xiao et al. [23] utilized graph convolutional RL for energy-aware planning. To enhance stability, Zhang et al. [24,25] explored proximal policy optimization (PPO) with specialized reward shaping and exploration mechanisms. Additionally, Multi-Agent RL has been applied to systemic process control [26] and dual-resource scheduling [27].

1.3. Research Gaps and Research Objectives

Despite these advancements, two critical research gaps persist in the current DRL-based MPRP frameworks:

Over-simplification of practical industrial scenarios: Despite the success of existing DRL models, they fail to reflect the multi-layered complexity. Specifically, they do not simultaneously account for the dual-level dependencies: the spatial benchmark (datum) dependencies between features (Zhu et al. [28]) and the internal sequential correlations within individual process chains, especially when handling variable-length sequences (Kwon et al. [29]; Li et al. [30]).
Inefficient distillation of the search space: Current research often relies on coarse data representations that lack deep refinement of manufacturing semantics (Zhang et al. [31]). This failure results in a redundant and unnecessarily large solution space. Consequently, optimizing the selection among multiple Optional Processes (Su et al. [32]) while maintaining 100% action feasibility remains a challenge.

In light of these gaps, the objective of this research is to develop a comprehensive intelligent MPRP framework that can simultaneously perceive multi-dimensional manufacturing semantics and strict engineering constraints. We aim to construct an environment model that preserves the intricate “feature–process–resource” coupling and design a hybrid neural architecture capable of capturing both global topological constraints and local sequence correlations.

1.4. Proposed Framework, Contributions, and Organization

To achieve the aforementioned objectives, this paper introduces a Hybrid Attention-augmented DRL framework that distinguishes itself from prior studies through several core innovations. Our approach establishes an Optional Process Attribute Adjacency Graph (OPAAG) to formally map manufacturing constraints into a structured tensor space, providing a high-fidelity representation of machining semantics by encoding complex manufacturing constraints into a numerical latent space. We further develop a Hybrid Attention mechanism that synergistically combines GAT and Transformer layers to extract multi-scale features, where GAT layers perceive spatial benchmark dependencies and the Transformer encoder captures sequential resource correlations within variable-length machining chains. To ensure strict engineering feasibility, a dynamic action masking mechanism is integrated to guarantee a 100% constraint satisfaction rate (CSR) during both training and real-time inference. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art heuristic and DRL methods in cost optimization, convergence speed, and stability.

The remainder of this paper is organized as follows: Section 2 formalizes the MPRP problem and its symbolic representation; Section 3 details the OPAAG construction and defines the DRL agent and the state space; Section 4 presents the proposed hybrid network architecture; Section 5 discusses the experimental results and comparative analysis; and Section 6 concludes the paper.

2. Problem Formulation and Overall Framework

This chapter aims to establish a rigorous mathematical foundation for intelligent MPRP. First, the MPRP problem is formalized as a Markov Decision Process (MDP). Second, the graph representation model of the part and the symbolic system for optional operation sets are defined in detail. Finally, the integrated “train-application” dual-stage architecture, which incorporates spatial graph encoding and sequential optimization, is introduced.

2.1. Problem Formalization as MDP

The generation and optimization of machining process routes are essentially complex sequential decision-making challenges under multi-dimensional constraints [33,34,35]. To achieve autonomous decision-making, the MPRP problem is formalized as a Markov Decision Process (MDP) characterized by the quintuple

(S, A, P, R, γ)

. At each time step

t

, the agent perceives the system state

s_{t}

and executes an action

a_{t}

based on the policy

π_{θ}

. The state

s_{t}

represents a fusion of static structural attributes and the dynamic process status. The state

s_{t}

is defined as a fusion of static structural attributes

s

and dynamic process status

d_{t}

.

The transition

s_{t} \to s_{t + 1}

reflects the iterative update of the machining schedule, as well as the changes in machine and tool status after a specific operation is selected.

2.2. Symbolic Representation of Part and Operations

To enable deep neural reasoning, the part is structured as an Optional Process Attribute Adjacency Graph (OPAAG), denoted as

G = (V, E)

. The vertex set

V

represents individual machining features, and the edge set

E

captures directed spatial benchmark (datum) dependencies based on the Datum Principle. Each machining feature

f_{i} \in V

is associated with an Optional Process Set

O_{i} = {o_{i, 1}, o_{i, 2}, \dots, o_{i, k}}

, which ensures that the agent perceives the multi-dimensional coupling of “feature-process-resource”. A specific operation

o_{i, j} \in O_{i}

is represented as a multi-dimensional vector:

o_{i, j} = [P r o c e s s, T A D, T o o l, M a c h i n e, C o s t]

(1)

where

i

represents the feature index,

j

represents the process index of the current feature

P r o c e s s

denotes the machining step type,

T A D

is the tool approach direction,

T o o l

and

M a c h i n e

are the required manufacturing resources, and

C o s t

represents the estimated execution cost. This vectorization ensures that the agent can perceive the multi-dimensional coupling of “feature-process-resource” while maintaining full compatibility with the structural representation

G

.

The OPAAG structure, including its topological edges and node attributes, is directly mapped into the MDP state space through a tensorization process detailed in Section 3.2.1.

2.3. Architecture of the Hybrid Attention-Augmented DRL Framework

The proposed framework, as shown in Figure 1, adopts a “train-application” dual-stage architecture to balance offline learning efficiency with online responsiveness. This design facilitates the agent to internalize complex machining logic during learning and deploy it for autonomous planning in real time.

2.3.1. Training Stage

The training stage facilitates iterative parameter optimization through intensive agent–environment interaction. During the initial information modeling process, the part’s MBD model is analyzed via the AAGNet algorithm [36] to extract geometric parameters and machining constraints, which are then used to construct the OPAAG. The agent perceives the complex system state

s_{t}

via a Transformer-GAT augmented PPO framework. Specifically, a two-layer Graph Attention Network (GAT) encodes spatial benchmark (datum) dependencies from the OPAAG structure. Simultaneously, the Transformer encoder serves as a bridge to capture long-range sequential correlations within variable-length operation sequences, ensuring resource stability and decision coherence. By executing actions

a_{t}

(selections of specific

o_{i, j}

) filtered by a dynamic masking mechanism and receiving multi-objective rewards

r_{t}

, the agent iteratively refines its policy

π_{θ}

and value evaluation

V_{ω}

. This recursive feedback loop continues until the policy converges toward an optimal mapping between part topologies and cost-efficient routes.

2.3.2. Application Stage

Upon transitioning to the application phase, the matured model autonomously generates process routes for novel parts without requiring further parameter modifications. The process begins with a consistent pre-treatment of the new part to ensure structural compatibility, after which its OPAAG representation is fed into the inference engine. The agent then identifies the optimal action sequence through a series of sequential decisions, leveraging its internalized knowledge of engineering constraints and resource optimization. The decision process terminates when the machining schedule for all features reaches unity (all features processed), ultimately outputting a complete, constraint-compliant, and cost-optimal route that is ready for manufacturing execution.

3. DRL Agent Construction

3.1. Construction of Optional Process Attribute Adjacency Graph

The OPAAG serves as the pivotal bridge connecting raw CAD data with deep learning inference [37]. The construction process transforms unstructured part models into structured graph models enriched with comprehensive machining semantics through the following four stages.

3.1.1. Feature Recognition and Attribute Extraction

The initialization of the OPAAG begins with the automated parsing of the part’s MBD model via the AAGNet intelligent feature recognition algorithm [36]. Utilizing a Graph Convolutional Network (GCN) architecture, AAGNet learns the intrinsic topological connectivity of B-rep faces to identify typical machining features (e.g., holes, slots, planes). For each identified feature

f_{i}

, a multi-dimensional attribute set—including geometry (dimensions, coordinates), precision (

I T, R a

), and engineering constraints (

T A D

, allowance)—is extracted and encapsulated into a node

v_{i} \in V

, as summarized in Table 1. Simultaneously, based on the Datum Principle, the algorithm identifies the locating reference for each feature and constructs directed edges

e_{i j} \in E

. These edges are explicitly labeled as “Datum Dependent,” forming the foundational topology of the graph and representing the hard precedence constraints required for sequential decision-making, as shown in Figure 2.

3.1.2. Process Chain Matching

As shown in Figure 3, following feature identification, the system performs Process Chain Matching to correlate each node

v_{i}

with a feasible sequence of machining steps. The matching logic is driven by the feature type and precision requirements extracted in the previous stage. As detailed in Table 2, the system retrieves typical process routes (e.g., “Drilling

\to

Expanding

\to

Reaming” for high-precision holes) from the database to form the process chain

L_{i}

. To ensure manufacturing integrity, the selection strictly adheres to the hierarchical constraint: “Rough Machining

\to

Semi-finish Machining

\to

Finish Machining.” This sequential logic ensures progressive material removal and minimizes thermal deformation, providing the categorical basis for the

P r o c e s s

attribute within the operation vector.

3.1.3. Processing Resource Matching

Processing resource matching transforms the abstract process steps into executable operations by assigning physical tools (

T o o l

) and machine tools (

M a c h i n e

) from digitized repositories summarized in Table 3. This procedure populates the specific values for the optional operation vector

o_{i, j} = [P r o c e s s, T A D, T o o l, M a c h i n e, C o s t]

. Valid matching is determined by aligning feature requirements with resource capabilities, ensuring the tool geometry fits the feature accessibility and the machine repeatability satisfies the target

I T

grade. For each feature

f_{i}

, the collection of all valid resource–process combinations forms the Optional Operation Set

O_{i}

, which defines the action space boundaries for the DRL agent.

3.1.4. Processing Cost Calculation

The quantitative evaluation of the

C o s t

attribute provides the objective basis for the DRL reward function. For a specific operation

o_{i, j}

, the effective cutting time

T_{c}

is calculated based on the material removal volume

V_{m r}

and the Material Removal Rate (MRR) [38]:

M R R = v_{c} \times f_{z} \times Z \times a_{p}

(2)

T_{c} = \frac{V_{m r}}{M R R}

(3)

where

v_{c}

is cutting speed,

f_{z}

is feed per tooth,

Z

is the number of teeth, and

a_{p}

is the depth of cut. The machining cost is derived as

C_{i, j} = k_{p} \cdot T_{c}

. To ensure a uniform gradient signal for policy optimization, the raw costs are normalized to a standard interval of

[0, 1]

relative to the extrema within each set

O_{i}

:

{\overline{C}}_{i, j} = \frac{C_{i, j} - C_{m i n, i}}{C_{m a x, i} - C_{m i n, i}}

(4)

where

C_{m a x, i}

and

C_{m i n, i}

are the maximum and minimum costs within the optional operation set

O_{i}

for feature

f_{i}

.

This normalized cost ensures that the agent can effectively distinguish between various resource combinations during the training process. Processing resource matching, processing cost calculation and the final synthesized OPAAG are as shown in Figure 4.

3.2. Intelligent DRL Model

3.2.1. State Definition

The state

s_{t}

constitutes the perceptual foundation of the reinforcement learning environment and is formalized as a high-dimensional fusion of static attributes

s

and dynamic attributes

d_{t}

denoted as

s_{t} = [s \oplus d_{t}]

, as shown in Figure 5. To bridge the gap between symbolic process graphs and neural network architectures, the environment state is formally mapped into a structured three-dimensional tensor with dimensions represented as

| V | \times | O | \times 55

. In this representation,

| V |

denotes the total number of machining features,

| O |

represents the maximum number of candidate process units allowed per feature, and 55 represents the comprehensive feature dimension of each process operation.

This study establishes an explicit mechanism to map from the Optional Process Attribute Adjacency Graph denoted as OPAAG and represented as

G

consisting of nodes

V

and edges

E

to the state tensor by directly transforming OPAAG edge information into the edge information of the state space. This topological constraint is expressed as a static binary adjacency matrix

A \in {0, 1}^{| V | \times | V |}

with the mathematical definition as follows:

A_{i j} = \{\begin{array}{l} 1, & i f e_{i j} \in E \\ 0, & o t h e r w i s e \end{array}

(5)

where

A_{i j} = 1

explicitly expresses the manufacturing precedence constraints between features such as the datum-first principle, which provides a stable structural foundation for the agent to perceive topological dependencies during the planning process.

Concurrently, the node information of the state space is constructed by concatenating the static attribute matrix derived from each feature node in OPAAG with its corresponding dynamic information matrix. A core contribution of this representation is that a 3-dimensional dynamic feature vector is attached separately to each specific candidate operation

o

under each feature

v

within the state tensor. In the 55-dimensional feature vector for each process unit, the first 52 dimensions consist of static manufacturing semantics, including a 16-dimensional one-hot encoding for the process to distinguish between different machining methods such as milling, drilling, and reaming, an 8-dimensional vector for the tool access direction to capture spatial orientation constraints, a 24-dimensional one-hot vector for the specific cutting tool resource required to perform the operation, a 3-dimensional vector identifying the available machining machine, and a 1-dimensional normalized scalar measuring the basic processing cost of the operation.

The subsequent 3-dimensional dynamic features characterize the information that evolves with each discrete time step. The mask vector

M_{t}

provides a 1-dimensional binary identifier for each specific operation to filter the search space and ensure the feasibility domain. For instance, if a datum feature remains unmachined, all operations for its dependent features are marked as 0, while once an operation meets the precedence and process constraints, its position is marked as 1. The process schedule

P_{t}

utilizes a 1-dimensional scalar to quantify the overall completion degree of the machining chain belonging to that feature through a fractional value. For example, a feature with a process chain of three steps would have its progress updated to 0.67 upon completing two steps, and the decision process terminates only when all progress values reach unity. The action identifier

L_{t}

provides a 1-dimensional binary identifier to record the action performed in the preceding decision step. This identifier is essential for calculating non-cutting costs such as tool changes and clamping adjustments, where additional cost penalties are triggered if the current operation requires resources different from those recorded. This tensorized characterization preserves the global topological associations of OPAAG while enabling the seamless conversion of symbolic manufacturing information into a numeric format for subsequent feature extraction.

3.2.2. Action Definition and Masking Mechanism

The action

a^{(t)}

is defined as the discrete selection of a single machining operation from the global operation set, serving as the core output of the reinforcement learning agent at each time step

t

. This action is represented in a one-hot vector format,

a^{(t)} \in {0, 1}^{V \times K \times M}

, where the design of the vector dimensions corresponds directly to the structural hierarchy of the machining process. Specifically, the first dimension

V

represents the total number of machining features, indicating which feature is selected for processing; the second dimension

K

corresponds to the length of the process chain for a single feature, specifying the exact process step to be executed; and the third dimension

M

denotes the number of available tool–machine tool combinations for that particular process step. Within this vector, only the element corresponding to the selected machining operation

o_{i j}

is marked as 1, while all remaining elements are 0. For instance, if a specific index in the vector is activated, it precisely maps to a unique combination of a feature, its current process stage, and the assigned physical resources.

This one-hot characterization is designed to work in synergy with the mask vector

M^{(t)}

to ensure 100% feasibility of the generated process routes. Due to the strict precedence and resource constraints in manufacturing—such as the requirement that a datum feature must be machined before its dependent features or the prohibition against repeating a completed process—invalid actions are filtered out before the agent performs sampling. By performing an element-wise multiplication between the neural network’s output probability distribution and the mask vector

M^{(t)}

, the probability of selecting an operation that violates engineering constraints is effectively reduced to zero. This masking mechanism ensures that the agent only makes decisions within the feasible search space, which not only prevents the generation of invalid process sequences but also clarifies the structure for the network to output stable probability distributions and perform efficient sampling selection.

3.2.3. State Transition and Termination Conditions

The state transition process in this framework follows the formal rule

s_{t + 1} = T (s_{t}, a^{(t)})

, where the transition function

T

acts exclusively upon the dynamic attribute

d^{(t)}

, while the static attribute

s

remains invariant due to its representation of inherent part characteristics. When the agent executes a specific action

a^{(t)}

, the system first identifies the corresponding machining operation

o_{i j}

, which represents the

j

-th process step of feature

f_{i}

. Based on this execution, the three components of the dynamic attribute are updated through a rigorous logic. Within the mask vector

M^{(t + 1)}

, all operation masks associated with the completed process step

j

of feature

u_{i}

are set to 0 to prevent redundancy, while the masks for the subsequent process step

j + 1

are set to 1 to enable progression. Simultaneously, any dependent feature operations that fail to meet their updated datum requirements—such as features relying on

u_{i}

when

u_{i}

has not yet reached the necessary stage—remain masked as 0.

The update of the process schedule

P^{(t + 1)}

ensures that the agent can accurately perceive the completion degree of the manufacturing task. For the selected feature

u_{i}

, its progress value is incremented as

P_{i}^{(t + 1)} = P_{i}^{(t)} + 1 / K_{i}

, where

K_{i}

denotes the total number of process steps in the feature’s specific process chain

L_{i}

. This linear update method allows the state to reflect the real-time machining status of the entire part. Concurrently, the previous operation identifier

L^{(t + 1)}

is updated by setting the position corresponding to the current action

a^{(t)}

to 1 and all other positions to 0. This identifier serves as the critical reference for calculating tool change and clamping costs in the subsequent time step

t + 1

. In scenarios where an invalid action is selected, the state transition function performs no update, resulting in

s_{t + 1} = s_{t}

. The decision-making episode reaches its termination condition when the process schedule for all features reaches unity (

P_{i} = 1

), signifying that the complete process route has been generated and all manufacturing requirements have been satisfied. The logic of state updates and transitions is illustrated in Figure 6.

3.2.4. Incentive Mechanism

The incentive mechanism within this framework is constructed as a two-layer reward structure, comprising an instant reward

r_{t}

and a cumulative reward

r_{c}

, designed to guide the agent toward acquiring an optimal machining strategy. The total reward effectively balances the immediate quality of individual actions with the overall goal of generating a complete and compliant process route.

The instant reward

r_{t}

is utilized to evaluate the performance of the action at each discrete time step, adhering to a design logic that penalizes high-cost operations while rewarding actions that comply with machining constraints. This is formulated through a weighted combination of a cost penalty component and a compliance reward component:

r_{t} = λ_{1} \cdot (1 - \frac{C_{m} + C_{t} + C_{c}}{C_{m a x}}) + λ_{2} \cdot I_{v a l i d}

(6)

In this equation, the cost penalty item seeks to minimize resource consumption by evaluating the machining cost

C_{m}

, the tool change cost

C_{t}

, and the clamping cost

C_{c}

. Since tool changes and clamping adjustments represent non-cutting time that significantly impacts productivity, they are explicitly integrated into the penalty. The term

C_{m a x}

serves as a preset threshold to normalize the total cost within the

[0, 1]

interval, ensuring that low-cost actions yield higher reward values. For this study, the weights are set as

λ_{1} = 0.7

for cost and

λ_{2} = 0.3

for compliance. The compliance reward utilizes an indicator flag

I_{v a l i d}

, where a value of 1 is assigned to valid actions that satisfy all constraints, and 0 is assigned to invalid actions. Furthermore, tool change and clamping costs are treated as fixed penalties based on industrial time-cost ratios; specifically,

C_{t} = 0.3

is triggered when the tool differs from the previous step, and

C_{c} = 0.5

is applied when the machine tool or tool access direction (TAD) changes. If resources remain unchanged, both

C_{t}

and

C_{c}

are set to 0.

The weights are set as

λ_{1} = 0.7

and

λ_{2} = 0.3

to balance the relative scales of primary machining costs (which possess larger numerical magnitudes) and the compliance rewards. These values were determined through preliminary grid search experiments to ensure the agent prioritizes cost optimization while maintaining strict adherence to engineering constraints. Furthermore, the fixed penalties

C_{t} = 0.3

and

C_{c} = 0.5

are assigned based on empirical industrial time–cost ratios, representing the typical non-cutting overhead for tool and setup changes. A brief sensitivity analysis indicated that the model performance remains stable within a

\pm 10 %

fluctuation in these parameters, confirming the robustness of the reward design.

The cumulative reward

r_{c}

provides a holistic evaluation of the decision-making sequence at the conclusion of each episode. It is designed to provide positive feedback for complete strategy generation; thus, if an episode terminates because all feature machining is successfully finished, the agent receives a higher cumulative reward. Conversely, if the episode ends prematurely due to the continuous selection of invalid actions, a negative reward is administered to discourage unproductive exploration. The synergy between

r_{t}

and

r_{c}

ensures that the agent optimizes both local action-level efficiency and the global integrity of the process route.

4. Network Structure

The network architecture serves as the primary computational vehicle for achieving intelligent decision-making, specifically designed to adapt to the “graph-structured input” and “sequential process transitions” inherent in process planning. This study adopts the Proximal Policy Optimization (PPO) algorithm as the base framework. To effectively capture both topological dependencies (datum relationships between features) and temporal sequential information (process chain order), a Graph Attention Network (GAT) and a Transformer encoder are integrated as core feature extraction modules. The policy network

π_{θ}

and the value network

V_{ω}

share this joint feature extraction module, diverging only at their respective output layers to ensure a consistent interpretation of the environmental state. The detailed configuration of network is shown in Figure 7.

4.1. Sequential Feature Encoding via Transformer

The core function of the Transformer encoder is to process the temporal data within the process chain

L_{i}

of each feature. Traditional recurrent architectures often struggle with variable-length sequences, whereas the Transformer effectively captures internal dependencies through its self-attention mechanism [39,40].

Initially, the process chain of each feature is converted into a raw attribute vector

v_{i, j}^{r a w}

containing the type, cost, and resource requirements of each step. To eliminate scale differences, a Layer Normalization step is performed. The encoder then utilizes Query (

Q

), Key (

K

), and Value (

V

) matrices to compute attention weights, identifying the significance of each process step relative to others in the sequence. The scaled dot-product attention is formulated as:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(7)

The final output is a fixed-dimensional encoding vector that realizes the representation of variable-length process chain information:

E m b e d d i n g_{v_{i}} = T r a n s f o r m e r (L a y e r N o r m (v_{i}^{r a w}))

(8)

After each node’s information is transformed into a fixed-dimensional encoding vector, it will be combined with the edge information of the Optional Process Attribute Adjacency Graph, forming an intermediate state in the form of a graph, which then enters the subsequent GAT layer.

4.2. Spatial Topology Aggregation via GAT

The GAT [41] layers are responsible for aggregating node association data within the adjacency graph

G

to strengthen the characterization of datum dependencies between features. Because the machining state of a reference feature directly dictates the feasibility of subsequent feature operations, the attention mechanism is utilized to adaptively assign weights to different adjacent nodes.

A two-layer GAT structure is implemented for feature aggregation, calculating the attention weight

α_{i j}

between a node

u_{i}

and its neighbor

u_{j}

via multi-head attention. This weight reflects the manufacturing impact of adjacent nodes on the current node. The feature aggregation formula is defined as:

h_{i}^{(l + 1)} = σ (\sum_{j \in N_{i}} α_{i j} W^{l} h_{j}^{(l)})

(9)

where

h_{i}^{(l + 1)}

is the node feature output from the

(l + 1)

-th GAT layer,

σ

is the ReLU activation function,

N_{i}

is the set of adjacent nodes for

u_{i}

, and

W^{l}

is the weight matrix of the

l

-th layer.

4.3. Decision Head and Masking Mechanism

The policy network outputs the probability distribution of actions based on current state features. Structural features from the GAT layer are converted into a one-dimensional vector through global average pooling and flattened via a fully connected layer. To ensure 100% feasibility of the output process routes, a masking mechanism is integrated to suppress invalid actions.

For operations violating engineering constraints (marked with a mask of 0), the corresponding dimension in the feature vector is set to a minimum value of

- 10^{9}

, ensuring its probability remains near zero during Softmax calculation. The action probability distribution is expressed as:

P (a_{t} | s_{t}) = S o f t m a x (F l a t t e n (G A T (s_{t})) + (1 - M_{t}) \cdot (- \infty))

(10)

where

P (a^{(t)} | s_{t})

is the probability of selecting action

a^{(t)}

under state

s_{t}

,

F l a t t e n (G A T (s_{t}))

is the GAT output after pooling and flattening.

4.4. PPO Learning Strategy

The value network evaluates the state-value

V_{ω} (s_{t})

, representing the expected cumulative reward of subsequent decisions, to provide a benchmark for policy updates. It shares the GAT extraction module with the policy network, utilizing two fully connected layers with ReLU activation for mapping to a one-dimensional output.

Training is conducted using the PPO-clip mechanism to achieve stable parameter updates [42]. To reduce variance, Generalized Advantage Estimation (GAE) is employed to calculate the advantage function

{\hat{A}}_{t}

. The policy network is optimized using a clip objective function to prevent excessively large policy shifts. Training terminates when the value network evaluation stabilizes (fluctuation

< 10^{- 4}

) and the Constraint Satisfaction Rate (CSR) reaches 100%, indicating that the agent has learned to generate fully compliant and optimized process routes.

5. Experimental Verification

In this section, the proposed attention-augmented DRL framework is evaluated through a series of experiments. The objective is to verify the model’s ability to generate valid and optimized process routes for complex prismatic parts and to demonstrate its performance superiority and robustness.

5.1. Experiment Setup

The implementation and training of the proposed model are conducted on a high-performance workstation. The hardware and software environments used to support the neural network training and the geometric feature processing are detailed in Table 4 below.

To ensure the high fidelity of the manufacturing environment, a comprehensive resource library is constructed, containing 3 sets of CNC machine tools with varying precision grades and 24 types of standardized cutting tools [43].

Following the methodology of previous work [37], the training and testing dataset was constructed by collecting 329 historical machining files of various parts from industrial manufacturing plants and laboratories. These cases encompass a wide spectrum of topological complexities, with the number of machining features (

| V |

) ranging from 7 to 25. The dataset is primarily composed of complex parts, which are foundational to the automotive, aerospace, and general machinery manufacturing sectors. As illustrated in Figure 8, the part geometries include housings, brackets, cylindrical bases, and intricate structural components that require multi-axis processing.

These parts are characterized by a wide variety of machining features and stringent engineering requirements, ensuring that the agent learns to process diverse manufacturing semantics effectively. Specifically, the dataset incorporates diverse feature elements such as planar surfaces, stepped holes, through slots, precision reamed bores, and complex pockets. Moreover, the cases exhibit varying densities of datum dependencies, which necessitate that the agent strictly adheres to fundamental manufacturing principles, such as datum-first and rough-to-finish, when orchestrating the machining sequences. The represented manufacturing scenarios cover three-axis and multi-face machining centers where non-cutting overhead, particularly tool changes and setup adjustments, significantly impacts overall productivity.

The dataset was randomly split into a training set (80%) and a held-out test set (20%). The results reported in Section 5.3 are evaluated on the unseen test set. The training process utilizes the PPO-clip algorithm. To achieve stable convergence and avoid local optima in the high-dimensional action space, the hyperparameters are carefully tuned based on preliminary sensitivity tests. These configurations are summarized in Table 5.

5.2. Training Convergence and Stability Analysis

This section evaluates the learning behavior and convergence properties of the proposed attention-augmented DRL agent across 3000 training episodes. To perform a comprehensive assessment, the training dataset is partitioned into three complexity levels based on feature quantity: the Simple group (centered around |V| ≈ 7), the Medium group (centered around |V| ≈ 15), and the Complex group (centered around |V| ≈ 25). Performance metrics were sampled every five episodes to capture fine-grained training dynamics among these groups.

Figure 9 illustrates the average cumulative reward curves for the three complexity categories. All levels exhibit a robust and steady upward trend during the initial training stage. Due to the integrated masking mechanism, the agent avoids the “sparse reward” challenge commonly encountered in standard RL, as it filters invalid actions from the outset. Consequently, it maintains a 100% Constraint Satisfaction Rate (CSR) throughout the entire training process for all groups. For the Simple group, the agent rapidly identifies optimal process sequences, reaching a stable plateau around episode 1500. As complexity increases, the Medium and Complex groups exhibit slower convergence with pronounced fluctuations. This suggests that the GAT and Transformer modules encounter greater challenges in capturing dense topological dependencies and orchestrating longer process–resource chains. Despite the increased difficulty, all groups reach stable convergence by episode 2200, proving the framework’s strong adaptability to parts with varying feature densities.

To assess learning stability and robust optimization capabilities, three representative parts—Part 1, Part 2, and Part 3—were selected for independent training sessions. These parts possess similar complexity, featuring approximately 12–15 machining features, and involve dense datum dependencies along with multi-resource couplings. Figure 10 shows the training convergence curves for these scenarios.

The consistent convergence behavior across different parts, characterized by synchronized stability and comparable final reward levels, verifies that the proposed Hybrid Attention-DRL model can dependably identify optimal or near-optimal process routes for tasks of similar complexity.

The internal training stability is further examined in Figure 11, which displays the evolution of network losses and policy entropy. The Value Loss (

L_{V F}

) exhibits an initial transient phase as the critic network learns to assess the state-value function for various part geometries, followed by a steady exponential decrease. The Policy Loss (

L_{C L I P}

) remains near the zero baseline, demonstrating that the PPO-clip mechanism effectively constrains policy updates within a stable trust region to prevent extreme divergence. Simultaneously, the entropy curve follows a characteristic non-linear decline, representing the agent’s transition from broad exploration across multiple part types to a concentrated, high-confidence decision-making policy.

5.3. Comparative Analysis with Baseline Algorithms

To evaluate the optimization efficiency and generalization capability of the proposed attention-augmented DRL framework, it is compared against five representative baseline algorithms: Genetic Algorithm (GA), Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Simulated Annealing (SA), and a Standard PPO model. All algorithms are evaluated using the same resource library and the multi-objective cost model.

To ensure a fair and persuasive comparison, the core parameters of the four heuristic algorithms (GA, ACO, PSO, and SA) were determined through a series of preliminary tuning experiments. Following the benchmark settings commonly used in similar Machining Process Route Planning (MPRP) optimization studies, the parameters were selected to ensure that each heuristic could reach a stable convergence state within the given problem scale (7–25 features). The specific configurations are summarized in Table 6.

The comparative study is conducted using the entire constructed dataset (329 machined parts) covering the full range of feature counts. This comprehensive evaluation ensures that the performance metrics reflect the algorithms’ capabilities across diverse topological complexities and datum dependency densities. Each algorithm was executed for 50 independent runs across the dataset to ensure statistical significance. The quantitative results, representing the average performance across all parts, are summarized in Table 7.

The results in Table 7 indicate that the proposed method achieves a comparative performance advantage over the entire dataset. While traditional heuristic algorithms (GA, ACO, PSO, SA) and the standard PPO model are capable of finding viable process routes, the proposed framework provides a modest reduction in total machining costs. Specifically, the proposed method achieves an average total cost of 422.3, representing an improvement of approximately 1.4% over the Standard PPO model and 3.8% over the best-performing heuristic baseline (ACO). This suggests that the integrated GAT and Transformer modules contribute to a more refined perception of resource coupling and topological constraints, leading to marginally better optimization of non-cutting costs (

C_{t}

and

C_{c}

). In terms of feasibility, the proposed method maintains a robust 100% CSR over the full range of parts, demonstrating the reliability of the masking mechanism in handling diverse constraint densities. The most notable benefit, however, lies in the computational efficiency and stability. The inference speed of 0.09 s on the RTX 4090 GPU allows for nearly instantaneous process orchestration, and the standard deviation (9.2) is comparable to the Standard PPO model but lower than the heuristic methods, indicating a reliable performance across diverse part geometries.

To assess the significance of the performance gains, a one-tailed t-test was performed between the proposed framework and the strongest heuristic baseline (ACO). Across the 50 independent runs for all 329 parts, the proposed method achieved a significantly lower total cost (

p < 0.05

). This indicates that the performance improvements are statistically robust and not attributed to the stochastic nature of DRL training.

The overall cost distribution is visualized in Figure 12. The box plot illustrates that the performance ranges of the different algorithms overlap significantly, particularly between the DRL-based models and the best heuristic methods. However, the proposed method maintains a slightly lower median and a more compact distribution, confirming that the attention-augmented DRL approach offers a reliable and efficient option for process orchestration.

The scalability of all investigated algorithms with respect to increasing part complexity is illustrated in Figure 13. To maintain logical consistency with the categorical analysis in Section 5.2, the computation time is evaluated across three landmark complexity levels: Simple (|V| ≈ 7), Medium (|V| ≈ 15), and Complex (|V| ≈ 25).

As shown, the execution time for all four meta-heuristic baselines (GA, ACO, PSO, and SA) exhibits a near-exponential growth pattern as the search space expands with the number of machining features. For instance, ACO’s computation time escalates rapidly from approximately 4.2 s to over 130 s. In stark contrast, the proposed attention-augmented DRL framework displays a remarkably stable and flat inference trajectory, maintaining an average response time of 0.09 s across the entire complexity spectrum. This demonstrates that the integrated GAT and Transformer modules effectively internalize complex manufacturing constraints and dependencies within a fixed-time forward propagation, providing a highly scalable and real-time solution for intelligent process planning in large-scale industrial scenarios.

Furthermore, it is important to emphasize that while this study does not explicitly categorize constraint intensity into discrete gradients, the diverse dataset of 329 cases naturally spans a wide spectrum of manufacturing complexities—ranging from sparse topological dependencies to highly dense engineering constraints. The consistent achievement of a 100% Constraint Satisfaction Rate (CSR) across all test scenarios implicitly demonstrates the robustness of the dynamic masking mechanism and the hybrid attention architecture in handling varying constraint densities. Additionally, the stable performance maintained on the unseen test set (20% of the total cases), which comprises varied part geometries and feature combinations not present during training, serves as a testament to the model’s strong generalization capabilities and its potential for cross-scenario applicability in real-world industrial settings.

5.4. Ablation Study on Neural Components

To quantify the specific contribution of each architectural component to the overall performance, an ablation study was conducted. Four model variants were evaluated across the entire dataset:

Full Model (Proposed): The complete architecture incorporating GAT, Transformer, and the Masking mechanism.
No GAT: The GAT layer is replaced with a standard Multi-Layer Perceptron (MLP) for feature encoding, ignoring spatial topological dependencies among machining features.
No Transformer: The Transformer encoder is removed, relying solely on GAT-aggregated features for decision-making without explicit sequential resource inheritance modeling.
No Masking: The masking mechanism is disabled, allowing the agent to explore the entire action space, including invalid process steps that violate precedence constraints.

The performance metrics are summarized in Table 8.

To further quantify the contribution of each architectural component to the overall performance, a detailed analysis of the ablation variants was conducted based on the results in Table 8. The integration of the Graph Attention Network (GAT) is demonstrated to be pivotal for spatial process optimization; its removal (No GAT variant) results in a 19.1% increase in clamping costs (

C_{c}

), from 56.1 to 66.8. This underscores the GAT’s critical role in aggregating spatial benchmark dependencies, which allows the agent to effectively group features and minimize redundant setups.

Furthermore, the Transformer encoder is essential for temporal resource management. The ‘No Transformer’ variant leads to a 15.0% rise in tool change costs (

C_{t}

), increasing from 53.2 to 61.2. This validates the necessity of the self-attention mechanism in capturing long-range sequential correlations within variable-length machining chains for optimal tool inheritance. Notably, the Masking mechanism serves as the foundational guarantee for feasibility; without it (No Masking variant), the agent fails to identify a meaningful optimization gradient, and the Constraint Satisfaction Rate (CSR) plunges to 52.6%. These quantitative findings confirm the synergistic effect of the proposed hybrid attention-augmented architecture in achieving cost-efficient and 100% compliant process orchestration.

The ablation results highlight the functional necessity of each component. As illustrated in Figure 14, the No Masking variant (red line) fails to identify any optimization gradient. Throughout the 3000 training episodes, its reward trajectory remains purely stochastic, wandering irregularly around a low reward level without any significant upward trend. This confirms that without the action-space pruning provided by the masking mechanism, the agent cannot effectively learn the complex precedence rules and resource dependencies, resulting in a drastically low CSR of 52.6%.

In contrast, the Full Model achieves the most efficient convergence and the highest cumulative reward. The No GAT variant exhibits increased setup costs (

C_{c}

), as it lacks spatial relational encoding to group features by common machining datums. The No Transformer variant shows a degradation in tool change performance (

C_{t}

), proving that sequential attention is essential for resource inheritance optimization. These results validate the synergistic effect of the proposed hybrid attention architecture in complex process orchestration.

5.5. Case Study

Figure 15 illustrates the comprehensive decision-making workflow generated for part 1. The orchestration originates from the identification of initial datum features, which undergo a series of preprocessing treatments to be formally modeled as the initial state of the environment. This state is subsequently fed into the trained DRL framework to derive the final orchestration results. During the sequential decision-making process, a dynamic masking mechanism is utilized at each step to filter invalid actions that would violate the 31 precedence constraints, enabling the framework to navigate the high-dimensional search space and identify an optimal sequence. The resulting plan comprises 20 machining operations, successfully streamlining the execution to 12 tool changes and 5 clamping setups. This reduction in non-cutting overheads and auxiliary time directly validates the efficiency of the proposed method in large-scale industrial scenarios.

The detailed process parameters, including feature sequences and resource allocations, are summarized in Table 9. An in-depth analysis of this plan reveals that the DRL agent has effectively internalized professional machining expertise. The agent adhered to the Datum-First strategy by prioritizing the machining of datum planes F1, F2, and F3 from Step 1 to Step 8, thereby establishing stable locating surfaces for subsequent high-precision features. Furthermore, for the IT8 precision hole F4, the model correctly sequenced the multi-step process chain involving drilling, expanding, and reaming in Step 15, Step 19, and Step 20 respectively, which confirms the adherence to the Rough-to-Finish principle. Additionally, the Transformer-based sequential encoding effectively optimized resource correlations by grouping features F6 through F14 for consecutive processing under a unified +X tool access direction and consistent tooling. This case study confirms that the proposed hybrid architecture not only achieves high computational efficiency but also guarantees the generation of process routes that are highly consistent with practical industrial requirements.

6. Conclusions

Machining process route planning (MPRP) plays a critical role in the transition toward autonomous manufacturing. This paper proposed an attention-augmented deep reinforcement learning (DRL) framework for intelligent process orchestration. The proposed approach synergistically leverages Graph Attention Networks (GAT) to perceive spatial benchmark dependencies between machining features and a Transformer-based encoder to model sequential resource inheritance within machining chains.

The main contributions of this study are as follows:

An Optional Process Attribute Adjacency Graph (OPAAG) was developed to formally model the complex “feature–process–resource–constraint” coupling relationships, enabling the agent to perceive engineering constraints effectively.
A dynamic action masking mechanism was integrated to ensure a 100% constraint satisfaction rate (CSR) during both training and inference stages, providing a robust solution for hard engineering constraints.
The hybrid attention-augmented network was comprehensively evaluated against traditional heuristics (e.g., GA, ACO) and standard DRL models, validating its performance advantages in complex process orchestration.

Experimental results demonstrate that the proposed method achieves an average total cost reduction of approximately 3.8% compared to the best-performing heuristic baseline (ACO). Furthermore, with an inference speed of approximately 0.09 s, the framework significantly outperforms iterative algorithms in efficiency and exhibits superior stability across diverse part geometries.

While this research primarily focuses on optimizing resource-related and time-based costs, the minimization of non-cutting overheads directly benefits manufacturing quality. Specifically, the agent’s capability to minimize clamping setups and tool access direction changes indirectly ensures the consistency of machining precision by reducing cumulative setup errors. In future work, the expansion of this architecture to incorporate more sophisticated machining mechanism constraints and direct quality indicators, such as surface roughness, will be explored. Furthermore, multi-machine collaborative scheduling and dynamic resource environments will be investigated to further enhance its versatility in complex shop-floor scenarios [44,45].

Author Contributions

Conceptualization, R.W.; methodology, R.W.; software, R.W.; validation, R.W.; formal analysis, R.W.; investigation, Z.D.; resources, R.W.; data curation, X.D.; writing—original draft preparation, R.W.; writing—review and editing, Y.P.; visualization, R.W.; supervision, M.W.; project administration, Y.P.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the National Key Research and Development Program of China (No. 2022YFB3304100).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Adapa, S.K.; Jagadish. An exhaustive review on intelligent computer-aided process planning in context with various optimisation techniques. Int. J. Mater. Prod. Technol. 2023, 66, 209–231. [Google Scholar] [CrossRef]
Nasir, V.; Sassani, F. A review on deep learning in machining and tool monitoring: Methods, opportunities, and challenges. Int. J. Adv. Manuf. Technol. 2021, 115, 2683–2709. [Google Scholar] [CrossRef]
Li, H.; Zhang, H.; He, Z.; Jia, Y.; Jiang, B.; Huang, X.; Ge, D. Solving integrated process planning and scheduling problem via graph neural network based deep reinforcement learning. arXiv 2024, arXiv:2409.00968. [Google Scholar] [CrossRef]
Li, Y.; Zhou, T. Research on Intelligent Planning Method for Turning Machining Process Based on Knowledge Base. Machines 2025, 13, 417. [Google Scholar] [CrossRef]
Waiyagan, K.; Bohez, E.L.J. Intelligent feature-based process planning for five-axis mill-turn parts. Comput. Ind. 2009, 60, 296–316. [Google Scholar] [CrossRef]
Long, A.; Huang, S.; Tian, Y.; Zhang, Y.; Yu, L.; Chen, Z. An automatic construction and intelligent retrieval method for knowledge graph for machining process specifications. Int. J. Comput. Integr. Manuf. 2025, 1–22. [Google Scholar] [CrossRef]
Hernandes, L.C.; Szejka, A.L.; Mas, F. Intelligent product manufacturing cost estimation framework driven by semantic technologies and knowledge-based systems. Int. J. Comput. Integr. Manuf. 2025, 1–22. [Google Scholar] [CrossRef]
Hua, G.; Zhou, X.; Ruan, X. GA-based synthesis approach for machining scheme selection and operation sequencing optimization for prismatic parts. Int. J. Adv. Manuf. Technol. 2007, 33, 594–603. [Google Scholar] [CrossRef]
Liu, X.; Yi, H.; Ni, Z. Application of ant colony optimization algorithm in process planning optimization. J. Intell. Manuf. 2013, 24, 1–13. [Google Scholar] [CrossRef]
Shirzadi, S.; Tavakkoli-Moghaddam, R.; Kia, R.; Mohammadi, M. A multi-objective imperialist competitive algorithm for integrating intra-cell layout and processing route reliability in a cellular manufacturing system. Int. J. Comput. Integr. Manuf. 2017, 30, 839–855. [Google Scholar] [CrossRef]
Wang, W.; Li, Y.; Huang, L. Rule and branch-and-bound algorithm based sequencing of machining features for process planning of complex parts. J. Intell. Manuf. 2018, 29, 1329–1336. [Google Scholar] [CrossRef]
Jing, X.; Zhu, Y.; Liu, J.; Zhou, H.; Zhao, P.; Liu, X.; Li, Q. Intelligent generation method of 3D machining process based on process knowledge. Int. J. Comput. Integr. Manuf. 2020, 33, 38–61. [Google Scholar] [CrossRef]
Liu, Q.; Wang, C.; Li, X.; Gao, L. An improved genetic algorithm with modified critical path-based searching for integrated process planning and scheduling problem considering automated guided vehicle transportation task. J. Manuf. Syst. 2023, 70, 127–146. [Google Scholar] [CrossRef]
Peng, H.; Wang, H.; Chen, D. Optimization of remanufacturing process routes oriented toward eco-efficiency. Front. Mech. Eng. 2019, 14, 422–433. [Google Scholar] [CrossRef]
Ding, S.; Guo, Z.; Wang, B.; Wang, H.; Ma, F. MBD-Based Machining Feature Recognition and Process Route Optimization. Machines 2022, 10, 906. [Google Scholar] [CrossRef]
Leng, J.; Chen, Q.; Mao, N.; Jiang, P. Combining granular computing technique with deep learning for service planning under social manufacturing contexts. Knowl.-Based Syst. 2018, 143, 295–306. [Google Scholar] [CrossRef]
Lei, R.; Wu, H.; Peng, Y. MFPointNet: A Point Cloud-Based Neural Network Using Selective Downsampling Layer for Machining Feature Recognition. Machines 2022, 10, 1165. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, S.; Zhang, H.; Zhang, Y.; Liang, J.; Huang, R.; Huang, B. Machining feature process route planning based on a graph convolutional neural network. Adv. Eng. Inform. 2024, 59, 102249. [Google Scholar] [CrossRef]
Du, K.; Yang, B.; Wang, S.; Chang, Y.; Li, S.; Yi, G. Relation extraction for manufacturing knowledge graphs based on feature fusion of attention mechanism and graph convolution network. Knowl.-Based Syst. 2022, 255, 109703. [Google Scholar] [CrossRef]
Panzer, M.; Bender, B. Deep reinforcement learning in production systems: A systematic literature review. Int. J. Prod. Res. 2022, 60, 4316–4341. [Google Scholar] [CrossRef]
Wu, W.; Huang, Z.; Zeng, J.; Fan, K. A fast decision-making method for process planning with dynamic machining resources via deep reinforcement learning. J. Manuf. Syst. 2021, 58, 392–411. [Google Scholar] [CrossRef]
Zhao, M.; Mo, L.; Liu, J.; Han, J.; Niu, D. GAT-based deep reinforcement learning algorithm for real-time task scheduling on multicore platform. In Proceedings of the 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 5674–5679. [Google Scholar] [CrossRef]
Xiao, Q.; Niu, B.; Xue, B.; Hu, L. Graph convolutional reinforcement learning for advanced energy-aware process planning. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 2802–2814. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Zhang, S.; Zhang, Y.; Zhou, J.; Wang, Z.; Huang, B.; Huang, R. A novel method based on deep reinforcement learning for machining process route planning. Robot. Comput. Integr. Manuf. 2024, 86, 102688. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Wang, Y.; Zhang, Y.; Zhou, J.; Huang, B.; Zhang, S. Employing deep reinforcement learning for machining process planning: An improved framework. J. Manuf. Syst. 2025, 78, 370–393. [Google Scholar] [CrossRef]
Li, C.; Chang, Q.; Fan, H. Multi-agent reinforcement learning for integrated manufacturing system-process control. J. Manuf. Syst. 2024, 76, 585–598. [Google Scholar] [CrossRef]
Zhang, N.; Liu, B.; Zhang, J. Dual Resource Scheduling Method of Production Equipment and Rail-Guided Vehicles Based on Proximal Policy Optimization Algorithm. Technologies 2025, 13, 573. [Google Scholar] [CrossRef]
Zhu, G.; Wang, S.; Wang, L. Heterogeneous graph neural network for modeling intelligent manufacturing systems. Meas. Sci. Technol. 2024, 36, 015114. [Google Scholar] [CrossRef]
Kwon, O.R.; Lee, G.T. A predictive model based on Transformer with statistical feature embedding in manufacturing sensor dataset. Int. J. Comput. Integr. Manuf. 2025, 1–16. [Google Scholar] [CrossRef]
Li, W.; Nie, Y.; Yang, F. Multi-Variable Transformer-Based Meta-Learning for Few-Shot Fault Diagnosis of Large-Scale Systems. Sensors 2025, 25, 2941. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Wu, Y.; Ma, Y.; Song, W.; Le, Z.; Cao, Z.; Zhang, J. A review on learning to solve combinatorial optimisation problems in manufacturing. IET Collab. Intell. Manuf. 2023, 5, e12072. [Google Scholar] [CrossRef]
Su, C.; Jiang, Q.; Han, Y.; Wang, T.; He, Q.C. Knowledge graph-driven decision support for manufacturing process: A graph neural network-based knowledge reasoning approach. Adv. Eng. Inform. 2025, 64, 103098. [Google Scholar] [CrossRef]
Barto, A.G. Reinforcement learning: Connections, surprises, and challenge. AI Mag. 2019, 40, 3–15. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Petersen, S.; Beattie, C.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Kwon, S.; Oh, Y. Optimal process planning for hybrid additive–subtractive manufacturing using recursive volume decomposition with decision criteria. J. Manuf. Syst. 2023, 71, 360–376. [Google Scholar] [CrossRef]
Wu, H.; Lei, R.; Peng, Y.; Gao, L. AAGNet: A graph neural network towards multi-task machining feature recognition. Robot. Comput. Integr. Manuf. 2024, 86, 102661. [Google Scholar] [CrossRef]
Zhang, L.; Wang, X.; Wu, H.; Peng, Y. A novel approach to part process route generation based on graph neural network encoding. Int. J. Comput. Integr. Manuf. 2025, 1–19. [Google Scholar] [CrossRef]
Huang, B.; Zhang, S.; Huang, R.; Li, X.; Zhang, Y.; Liang, J.C. An effective numerical control machining process optimization approach of part with complex pockets for numerical control process reuse. IEEE Access 2019, 7, 45146–45165. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Proceedings of the 31st Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017; NIPS: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, H.; Xia, C.; Sun, L. Graph-BERT: Only attention is needed for learning graph representations. arXiv 2020, arXiv:2001.05140. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Vespoli, S.; Mattera, G.; Marchesano, M.; Nele, L.; Guizzi, G. Adaptive manufacturing control with deep reinforcement learning for dynamic WIP management in Industry 4.0. Comput. Ind. Eng. 2025, 202, 110966. [Google Scholar] [CrossRef]
Huang, Z.; Shen, Y.; Li, J.; Fey, M.; Brecher, C. A Survey on AI-Driven Digital Twins in Industry 4.0: Smart Manufacturing and Advanced Robotics. Sensors 2021, 21, 6340. [Google Scholar] [CrossRef] [PubMed]
Ojstersek, R.; Brezocnik, M.; Buchmeister, B. Multi-objective optimization of production scheduling with evolutionary computation: A review. Int. J. Ind. Eng. Comput. 2020, 11, 359–376. [Google Scholar] [CrossRef]

Figure 1. Proposed attention-augmented DRL framework for machining process route planning.

Figure 2. Description of feature recognition.

Figure 3. Description of process chain matching.

Figure 4. Description of process resource matching, process cost calculation and the final synthesized OPAAG.

Figure 5. Description of state definition.

Figure 6. Description of state transition.

Figure 7. Description of network Structure.

Figure 8. Some parts of the dataset.

Figure 9. Learning curves of the agent on representative test part groups with different complexities.

Figure 10. Independent training sessions.

Figure 11. Evolution of training losses and policy entropy.

Figure 12. Statistical comparison of cost optimization performance across the entire dataset.

Figure 13. Influence of Part Complexity on Inference Efficiency.

Figure 14. Comparison of convergence curves for different ablation model architectures.

Figure 15. Decision-making evolution and optimized machining workflow generated by the DRL framework for the 14-feature component.

Table 1. Detailed attribute representation of machining feature nodes extracted.

Symbol	Physical Significance and Functional Role
$I D$	A unique identifier assigned to each node $v_{i}$ in the graph.
$S i z e$	Defines the basic dimensions (length, width, height) of the feature.
$(x, y, z)$	Specifies the spatial coordinates of the feature center or reference point.
$T A D$	Direction of cutting.
$I T$	Represents the dimensional tolerance grade required for the feature.
$f$	Specifies the form and position tolerance requirements.
$R_{a}$	Denotes the required surface roughness of the machined surface.
${\vec{d}}_{r e m}$	Defines the material removal orientation or tool access direction.
$δ$	Represents the machining allowance for the initial feature state.
$T y p e$	Classifies the feature into specific machining categories.

Table 2. Typical machining process routes for plane and hole features.

Category	Typical Machining Process Routes
Plane	Rough Milling
	Rough Milling → Finish Milling
	Rough Milling → Semi-finish Milling → Finish Milling
	Rough Milling → Grinding
	Rough Milling → Finish Milling → Grinding
	Rough Milling → Semi-finish Milling → Finish Milling → Grinding
Hole	Drilling
	Drilling → Reaming
	Drilling → Expanding → Reaming
	Drilling → Expanding → Boring → Fine Boring
	Drilling → Expanding → Internal Grinding
	Drilling → Expanding → Reaming → Internal Grinding
	Drilling → Boring → Internal Grinding → Honing
	Drilling → Expanding → Boring → Internal Grinding → Honing

Table 3. Parameterized structure of the processing resource libraries.

Category	Managed Parameter
Tool	Tool Model
	$Cutting Speed (v_{c}$ )
	$Feed per Tooth (f_{z}$ )
	$Max Cutting Depth (a_{p}$ )
	Process Applicability
Machine	$Stroke Limits (X, Y, Z$ )
	$Max Spindle Speed (n_{m a x}$ )
	Repeatability
	Power Capacity

Table 4. Experimental environment configuration.

Component	Specification/Version
Operating System	Windows 11
CPU	Intel Core i9-13900K @ 3.00 GHz
RAM	32 GB
GPU	NVIDIA GeForce RTX 4090 (24 GB)
Programming Language	Python 3.8
Deep Learning Framework	PyTorch 1.12.1
Acceleration Library	CUDA 11.7

Table 5. Hyperparameter configuration for the PPO-clip training.

Category	Hyperparameter	Value
Optimization	Optimizer	Adam
	Learning Rate	$2 \times 10^{- 4}$
	Batch Size	32
PPO Core	$Clip Coefficient (ϵ$ )	0.2
	$Discount Factor (γ$ )	0.98
	$GAE Parameter (λ$ )	0.95
Network	Attention Heads	4
	Hidden Dimensions	128

Table 6. Detailed parameter configurations for baseline heuristic algorithms.

Algorithm	Key Parameter Settings
GA	Population: 100; Generations: 500; Crossover: 0.8; Mutation: 0.1
ACO	Ants: 50; Iterations: 300; α = 1.0; β = 2.0; Evaporation: 0.5
PSO	Particles: 80; $c_{1} = c_{2} = 2.0$ ; Inertia weight: 0.9 $\to$ 0.4
SA	Initial Temp: 1000; Cooling Rate: 0.95; Termination: 0.01

Table 7. Overall performance comparison across the entire dataset.

Algorithm	Avg. $C_{t o t a l}$	Std. Dev	$C_{m}$	$C_{t}$	$C_{c}$	CSR (%)	Inference Time (s)
GA	445.2	±13.5	315.4	64.2	65.6	94.5	15.42
ACO	438.8	±15.2	314.8	62.5	61.5	95.2	22.85
PSO	448.6	±16.8	316.2	65.8	66.6	93.8	12.15
SA	452.4	±12.2	318.5	68.4	65.5	96.0	18.60
Std. PPO	428.5	±10.5	313.5	57.0	58.0	98.2	0.08
Proposed	422.3	±9.2	312.8	53.2	56.1	100.0	0.09

Table 8. Performance comparison of ablation model variants.

Configuration	Avg. $C_{t o t a l}$	$C_{t}$	$C_{c}$	CSR (%)	Convergence Episode
Full Model	422.3	53.2	56.1	100.0	~2150
No GAT	439.5	58.4	66.8	100.0	~2550
No Transformer	431.8	61.2	58.6	100.0	~2300
No Masking	624.5	92.4	88.6	52.6	/

Table 9. Process Planning Result of Part 1.

No.	Feature ID	Feature Type	Process Name	TAD	Cutting Tool	Machine
1	F3	Plane	Rough Milling	−X	End Mill (Φ10)	3-axis Machining Center
2	F1	Plane	Rough Milling	+Y	End Mill (Φ8)	3-axis Machining Center
3	F1	Plane	Finish Milling	+Y	End Mill (Φ6)	3-axis Machining Center
4	F2	Plane	Rough Milling	−Y	End Mill (Φ8)	3-axis Machining Center
5	F2	Plane	Finish Milling	−Y	End Mill (Φ6)	3-axis Machining Center
6	F5	Through Slot	Rough Milling	−X	End Mill (Φ8)	3-axis Machining Center
7	F3	Plane	Finish Milling	X	End Mill (Φ8)	3-axis Machining Center
8	F5	Through Slot	Finish Milling	−X	End Mill (Φ6)	3-axis Machining Center
9	F9	Through Hole	Drilling	+X	Twist Drill (Φ3)	3-axis Machining Center
10	F10	Through Hole	Drilling	+X	Twist Drill (Φ3)	3-axis Machining Center
11	F13	Through Hole	Drilling	+X	Twist Drill (Φ4)	3-axis Machining Center
12	F14	Through Hole	Drilling	+X	Twist Drill (Φ4)	3-axis Machining Center
13	F11	Through Hole	Drilling	+X	Twist Drill (Φ4)	3-axis Machining Center
14	F12	Through Hole	Drilling	+X	Twist Drill (Φ4)	3-axis Machining Center
15	F4	Through Hole	Drilling	+X	Twist Drill (Φ20)	3-axis Machining Center
16	F6	Through Hole	Drilling	+X	Twist Drill (Φ2)	3-axis Machining Center
17	F8	Through Hole	Drilling	+X	Twist Drill (Φ2)	3-axis Machining Center
18	F7	Through Hole	Drilling	+X	Twist Drill (Φ2)	3-axis Machining Center
19	F4	Through Hole	Expanding	+X	Expansion Cutter (Φ20)	3-axis Machining Center
20	F4	Through Hole	Reaming	+X	Reamer (Φ20)	3-axis Machining Center

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, R.; Wang, M.; Du, Z.; Dong, X.; Peng, Y. Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning. Machines 2026, 14, 343. https://doi.org/10.3390/machines14030343

AMA Style

Wang R, Wang M, Du Z, Dong X, Peng Y. Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning. Machines. 2026; 14(3):343. https://doi.org/10.3390/machines14030343

Chicago/Turabian Style

Wang, Ruizhe, Minrui Wang, Ziyan Du, Xiaochuan Dong, and Yibing Peng. 2026. "Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning" Machines 14, no. 3: 343. https://doi.org/10.3390/machines14030343

APA Style

Wang, R., Wang, M., Du, Z., Dong, X., & Peng, Y. (2026). Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning. Machines, 14(3), 343. https://doi.org/10.3390/machines14030343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Attention-Augmented Deep Reinforcement Learning for Intelligent Machining Process Route Planning

Abstract

1. Introduction

1.1. Traditional and Heuristic Approaches in MPRP

1.2. Deep Learning and Reinforcement Learning in Manufacturing

1.3. Research Gaps and Research Objectives

1.4. Proposed Framework, Contributions, and Organization

2. Problem Formulation and Overall Framework

2.1. Problem Formalization as MDP

2.2. Symbolic Representation of Part and Operations

2.3. Architecture of the Hybrid Attention-Augmented DRL Framework

2.3.1. Training Stage

2.3.2. Application Stage

3. DRL Agent Construction

3.1. Construction of Optional Process Attribute Adjacency Graph

3.1.1. Feature Recognition and Attribute Extraction

3.1.2. Process Chain Matching

3.1.3. Processing Resource Matching

3.1.4. Processing Cost Calculation

3.2. Intelligent DRL Model

3.2.1. State Definition

3.2.2. Action Definition and Masking Mechanism

3.2.3. State Transition and Termination Conditions

3.2.4. Incentive Mechanism

4. Network Structure

4.1. Sequential Feature Encoding via Transformer

4.2. Spatial Topology Aggregation via GAT

4.3. Decision Head and Masking Mechanism

4.4. PPO Learning Strategy

5. Experimental Verification

5.1. Experiment Setup

5.2. Training Convergence and Stability Analysis

5.3. Comparative Analysis with Baseline Algorithms

5.4. Ablation Study on Neural Components

5.5. Case Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI