Next Article in Journal
Identification of Noise Tonality in the Proximity of Wind Turbines—A Case Study
Previous Article in Journal
EHPNet: An Edge-Aware Method for Leaf Segmentation in Complex Field Environments
Previous Article in Special Issue
A Multi-Scale Unsupervised Feature Extraction Network with Structured Layer-Wise Decomposition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bi-Level Intelligent Control Framework Integrating Deep Reinforcement Learning and Bayesian Optimization for Multi-Objective Adaptive Scheduling in Opto-Mechanical Automated Manufacturing

by
Lingyu Yin
,
Zhenhua Fang
,
Kaicen Li
,
Jing Chen
,
Naiji Fan
and
Mengyang Li
*
Laser Fusion Research Center, China Academy of Engineering Physics, Mianyang 621000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 732; https://doi.org/10.3390/app16020732
Submission received: 9 December 2025 / Revised: 6 January 2026 / Accepted: 7 January 2026 / Published: 10 January 2026

Abstract

The opto-mechanical automated manufacturing process, characterized by stringent process constraints, dynamic disturbances, and conflicting optimization objectives, presents significant control challenges for traditional scheduling and control approaches. We formulate the scheduling problem within a closed-loop control paradigm and propose a novel bi-level intelligent control framework integrating Deep Reinforcement Learning (DRL) and Bayesian Optimization (BO). The core of our approach is a bi-level intelligent control framework. An inner DRL agent acts as an adaptive controller, generating control actions (scheduling decisions) by perceiving the system state and learning a near-optimal policy through a carefully designed reward function, while an outer BO loop automatically tunes the DRL’s hyperparameters and reward weights for superior performance. This synergistic BO-DRL mechanism facilitates intelligent and adaptive decision-making. The proposed method is extensively evaluated against standard meta-heuristics, including Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), on a complex 20-jobs × 20-machines flexible job shop scheduling benchmark specific to opto-mechanical automated manufacturing. The experimental results demonstrate that our BO-DRL algorithm significantly outperforms these benchmarks, achieving reductions in makespan of 13.37% and 25.51% compared to GA and PSO, respectively, alongside higher machine utilization and better on-time delivery. Furthermore, the algorithm exhibits enhanced convergence speed, superior robustness under dynamic disruptions (e.g., machine failures, urgent orders), and excellent scalability to larger problem instances. This study confirms that integrating DRL’s perceptual decision-making capability with BO’s efficient parameter optimization yields a powerful and effective solution for intelligent scheduling in high-precision manufacturing environments.

1. Introduction

The advent of Industry 4.0 and the smart manufacturing paradigm has fundamentally transformed modern production systems, creating a pressing need for highly efficient, adaptive, and intelligent scheduling solutions [1,2]. As a critical decision-making layer in manufacturing execution, Advanced Planning and Scheduling (APS) systems are pivotal for optimizing resource utilization, ensuring on-time delivery, and enhancing overall operational efficiency [3,4]. This is particularly critical in high-precision domains such as opto-mechanical automated manufacturing, where scheduling accuracy directly correlates with final product quality and production line stability [5,6].
The opto-mechanical automated manufacturing process represents a quintessential and highly constrained instance of the Flexible Job Shop Scheduling Problem (FJSP) [7,8]. This domain is characterized by stringent process-specific constraints (e.g., cleanliness grades, thermal stability, vibration isolation) [9], frequent dynamic disruptions (e.g., machine breakdowns, high-priority urgent orders) [10], and inherently conflicting optimization objectives spanning efficiency, utilization, timeliness, and quality [11,12]. The combinatorial complexity of such problems, with solution spaces regularly exceeding 10200 for practical scenarios, firmly classifies them as strongly NP-hard, rendering traditional optimization methodologies increasingly inadequate [13,14].
Traditional approaches to production scheduling have evolved from exact mathematical programming to heuristic and meta-heuristic algorithms [15,16]. While exact methods guarantee optimality, their computational cost becomes prohibitive for real-world problem sizes [17]. Meta-heuristics like the Genetic Algorithm (GA) [18], Particle Swarm Optimization (PSO) [19], and their advanced variants have demonstrated considerable success in navigating complex solution landscapes [20,21]. However, these population-based search strategies often exhibit significant limitations in modern manufacturing contexts, including slow convergence in high-dimensional spaces, a pronounced tendency to become trapped in local optima, and a fundamental lack of built-in responsiveness to dynamic environmental changes [22,23].
The rise of Deep Reinforcement Learning (DRL) has introduced a transformative approach to solving sequential decision-making problems [24,25]. By integrating the representational power of deep neural networks with the decision-making framework of reinforcement learning, DRL agents can learn optimal policies through direct interaction with their environment [26,27]. This capability has led to groundbreaking applications in various scheduling domains, where DRL models learn to map complex system states directly to effective scheduling actions [28,29]. For instance, ref. [30] applied DRL to dynamic job-shop scheduling, demonstrating its potential in environments subject to uncertainty. Nevertheless, the practical application of DRL in manufacturing is hampered by several persistent challenges, including sparse reward signals, the credit assignment problem over extended time horizons, and acute sensitivity to the settings of numerous hyperparameters [31,32].
Concurrently, Bayesian Optimization (BO) has gained prominence as a sample-efficient methodology for optimizing black-box functions, making it particularly well-suited for hyperparameter tuning of complex models like DRL agents [33,34]. BO constructs a probabilistic surrogate model to approximate the objective function and employs an acquisition function to intelligently guide the search process [35,36]. Research such as [37] has explored BO for scheduling. While BO excels in parameter optimization tasks, it lacks the sophisticated, end-to-end perceptual decision-making capabilities inherent to DRL [38,39].
Emerging research has begun to explore the synergistic potential of combining DRL and BO [40,41]. Initial studies have typically focused on using BO as an offline tool to tune DRL hyperparameters [42] or have addressed multi-objective aspects separately [43]. However, a significant gap remains in the development of a tightly coupled, bi-level intelligent control framework where DRL and BO operate in a synergistic, closed-loop manner, continuously adapting to complex, dynamic environments like opto-mechanical automated manufacturing. From a control theory perspective, the scheduling problem in such a manufacturing system can be formulated as the task of synthesizing a high-level, discrete control policy for a Discrete-Event Dynamic System (DEDS), where the control actions are the dispatching and sequencing decisions. A schematic overview of these conventional scheduling approaches and their inherent limitations is presented in Figure 1.
To bridge this research gap, this paper proposes a novel multi-objective adaptive APS algorithm that deeply integrates Deep Reinforcement Learning and Bayesian Optimization within a unified bi-level architecture. Our framework features an inner DRL agent that functions as a real-time scheduler, making decisions based on a comprehensively designed state space, while an outer BO loop acts as a meta-optimizer, automatically and efficiently tuning the DRL’s critical hyperparameters and reward function weights. This creates a self-improving system that significantly enhances scheduling intelligence, adaptability, and performance in dynamic high-precision manufacturing settings.
The main contributions of this work are fourfold:
  • A novel bi-level intelligent control framework: We propose an integrated BO-DRL architecture that enables synergistic cooperation between perceptual decision-making and efficient parameter optimization, facilitating continuous system self-improvement.
  • Domain-Specific Modeling and Benchmarking: We formalize a complex, large-scale scheduling benchmark for opto-mechanical automated manufacturing, incorporating realistic constraints to provide a rigorous testbed for advanced scheduling algorithms.
  • Comprehensive Multi-Objective DRL Design: We develop a tailored DRL model featuring a graph neural network-based state encoder, a hierarchical action space, and a shaped reward function that dynamically balances competing objectives.
Extensive Empirical Validation: We conduct thorough experiments demonstrating our algorithm’s superior performance over established meta-heuristics and alternative approaches in terms of solution quality, convergence speed, robustness, and scalability.
The remainder of this paper is structured as follows: Section 2 details the problem background and description. Section 3 presents the mathematical formulation and complexity analysis. Section 4 elaborates on the proposed BO-DRL algorithm. Section 5 discusses experimental results and analysis. Finally, Section 8 concludes the paper and suggests future research directions.

2. Problem Background and Description

Opto-mechanical automated manufacturing constitutes a critical stage in precision manufacturing, and its scheduling poses significant theoretical and practical research challenges. This study addresses the problem of governing a large-scale manufacturing system through scheduling, which can be modeled as a Discrete-Event Dynamic System (DEDS). The objective is to synthesize a control policy that optimizes system performance. This example not only has the general characteristics of the standard FJSP, but also incorporates special process constraints in the field of opto-mechanical automated manufacturing and adjustment, providing a strict test benchmark for evaluating advanced scheduling algorithms. Compared with ordinary scheduling examples, the uniqueness of this example is reflected in three aspects: first, machine resources are highly heterogeneous and specialized, and each piece of equipment corresponds to specific process requirements; second, the job process contains complex sequence constraints and quality dependencies; finally, the production environment needs to consider dynamic disturbances and process stability requirements. These characteristics make this example closer to the actual manufacturing scenario and provide higher validity for algorithm performance evaluation.

3. Mathematical Model Analysis

This section presents the mathematical formulation and complexity analysis of the multi-objective adaptive scheduling problem for opto-mechanical manufacturing. We begin by defining the optimization objectives and the system constraints. Subsequently, we analyze the combinatorial complexity of the problem and provide a detailed description of the heterogeneous machine resources and production workflow that characterize the benchmark scenario.

3.1. Optimization Objectives and Constraints

The scheduling problem is formulated as a multi-objective optimization. The primary and composite objectives are defined, along with the key constraints that ensure solution feasibility and model the specific process rules of the domain. To enhance clarity, the key variables used throughout the mathematical model are summarized in Table 1.
The primary objective is to minimize the makespan:
Minimize   C m a x = max C i | i = 1 , 2 , , 20
Additionally, we consider the following multi-objective optimization:
Minimize   F = C m a x , U , T m a x , Q
Among them, U is the average machine utilization, T m a x is the maximum delay time, and Q is the quality index.
In order to comprehensively evaluate the performance of the scheduling algorithm, a multi-dimensional evaluation index system is adopted, as detailed in Table 2. This system encompasses metrics for time efficiency (e.g., makespan, mean flow time), resource utilization (machine utilization), delivery timeliness (number of tardy jobs), and system robustness (disturbance recovery capability). Together, these indicators provide a balanced assessment of scheduling quality across the key operational objectives in manufacturing.
Constraints include:
s i j + p i j k s i j + 1 i , j , k
s i j + p i j k s l m + M 1 y i j l m k i , j l , m , k
y i j l m k 0 , 1 i , j , l , m , k
k M i j x i j k = 1 i , j
x i j k 0 , 1 i , j , k
The mathematical model established above formally defines the multi-objective scheduling problem for opto-mechanical manufacturing. This formulation encapsulates the system’s key constraints, including operation precedence, machine eligibility, and disjunctive scheduling relations. Together with the variables summarized in Table 1, it provides a precise and complete representation of the decision space and performance objectives. The inherent complexity of this model, which stems from the interaction of binary decision variables, sequencing constraints, and competing objectives, directly motivates the need for the advanced intelligent scheduling approach developed in the subsequent sections.

3.2. Problem Complexity Analysis

This calculation example contains 20 special equipment, divided into 5 functional categories, with their specific descriptions and constraints summarized in Table 3. The table not only lists the categories but also provides the specific machine identifiers (e.g., M6, M10, M12 for Cleaning), creating a precise mapping between each machine and its functional role. This explicit mapping is crucial for traceability throughout the analysis and directly defines the heterogeneous resource pool with its specialized constraints that the scheduling algorithm must navigate.
The production workflow involving these heterogeneous machines is illustrated in Figure 2, showing the sequential dependencies and specialized processing requirements.
To concretely illustrate the job-specific sequencing and machine flexibility, Table 4 details the complete process plan for a representative job (Job 1) from the 20 × 20 benchmark instance. Job 1 must follow a fixed internal sequence of operations, each of which can be processed on one of several alternative machines with varying times. This combination of precedence constraints and machine flexibility encapsulates the core scheduling challenge addressed in this work.
The combinatorial complexity of this example can be formally expressed as:
S o l u t i o n   s p a c e   s i z e = i = 1 N j O i ! × j = 1 O i M i j
where O i represents the operation set of job i , O i 7 , 10 , and M i j represents the set of available machines, M i j 1 , 4 . After calculation, the solution space scale exceeds 10 200 , which is a strongly NP-hard problem.

4. Problem Difficulties and Algorithm Challenges

Compared with the standard FJSP example, the main difficulties in this example are reflected in three aspects: First, the opto-mechanical automated manufacturing and adjustment process includes strict cleanliness requirements, thermal stability constraints, and calibration dependencies. These constraints form a nonlinear feasible region and significantly increase the search difficulty. The second is the essential conflict between the quality index Q and the time index C m a x . High-precision assembly requires longer stabilization time and fine adjustments, while efficiency goals require fast turnaround. This multi-objective Pareto front search is far more complex than single-objective optimization. Third, opto-mechanical automated manufacturing and adjustment are extremely sensitive to environmental changes. Small temperature fluctuations Δ T or vibration interference may require rescheduling. This dynamic nature makes static scheduling solutions have limited effectiveness, requiring algorithms with online adjustment capabilities.
Traditional metaheuristic algorithms face significant challenges when processing this example:
  • Genetic Algorithm (GA) suffers from difficulties in maintaining feasibility under complex constraints through its crossover and mutation operations, especially for sequence-dependent calibration constraints G O i Its population-based search mechanism is prone to becoming trapped in local optima in such high-dimensional spaces.
  • Particle Swarm Optimization (PSO) is inherently mismatched with discrete scheduling problems due to its continuous optimization characteristics. Although encoding transformations enable its application to scheduling, the physical meaning of the position update formula v i d t + 1 = w v i d t + c 1 r 1 p i d x i d t + c 2 r 2 p g d x i d t becomes ambiguous in discrete space. In this formula, v i d t and x i d t represent the velocity and position of particle i in dimension d at iteration t , respectively; p i d is the particle’s personal best position (cognitive component); p g d is the swarm’s global best position (social component); w is the inertia weight; c 1 and c 2 are the cognitive and social acceleration coefficients; and r 1 , r 2 are random numbers uniformly distributed in 0 , 1 . The continuous nature of this update mechanism makes the search susceptible to local optima in discrete scheduling spaces and significantly reduces the probability of finding the global optimum.
  • Standard Reinforcement Learning (RL) faces critical challenges of sparse rewards and difficult credit assignment. During long sequential decision-making processes, the backpropagation of the final performance metric C m a x to intermediate decision steps results in low learning efficiency. Additionally, the enormous dimensionality of the state space S (formally defined in Section 5.1.1) makes training function approximators particularly challenging.

5. Multi-Objective Adaptive APS Algorithm

This study proposes an adaptive hybrid intelligent control framework that integrates Deep Reinforcement Learning (DRL), Bayesian Optimization (BO), and Multi-Objective Evolutionary Algorithm (MOEA). The inner layer employs a DRL agent as a rapid-response decision-maker, responsible for real-time scheduling under given hyperparameters, while an outer BO loop acts as a meta-controller, tasked with automatically tuning the inner controller’s (DRL agent’s) hyperparameters and reward function weights. These two components work in synergy, forming a bi-level optimization system that enhances scheduling effectiveness, response speed, and decision-making intelligence of the APS system in complex and dynamic environments. The overall architecture of the proposed bi-level BO-DRL framework is depicted in Figure 3, illustrating the synergistic interaction between the inner DRL agent and outer BO optimizer.

5.1. Deep Reinforcement Learning Decision Model Design

5.1.1. State Space Design ( s t S )

The state s t at decision time t is an element of the state space S . It is designed to comprehensively capture both static and dynamic information of the entire production line at time t . We design a multi-dimensional feature vector:
s t = s t machine , s t job , s t global
  • Machine State Vector ( s t machine ): For each machine M j , it includes: current machine status, processed time of the current operation, number of operations in the current queue, and utilization rate within the recent time window.
  • Job State Vector ( s t job ): For each job J i , it contains: number of completed operations/total operations, slack time: d i t r e m a i n i n g   t o t a l   p r o c e s s i n g   t i m e , current job status, and set of available machines for the current process.
  • Global State Vector ( s t global ): System time t , average machine utilization, average queue length, and proportion of overdue jobs.
This state space is encoded through a Graph Neural Network (GNN) for feature extraction, obtaining a low-dimensional representation h t = f extract s t .
To exploit the inherent relational structure of the scheduling system, we model the production state as a heterogeneous graph G t = V , E t . The node set V = V M V J consists of machine nodes v i M V M and job nodes v j J V J . Each machine node is associated with the feature vector from s t machine , and each job node with the feature vector from s t job . An edge v i M , v j J E t exists if machine i is capable of processing the current operation of job j .
We adopt a widely used two-layer GNN that performs message passing over G t . Let h v l denote the feature vector of node v at layer l (with h v 0 being its input feature).
The update rule for each node v in layer l + 1 is:
h v l + 1 = σ W l CONCAT h v l , AGGREGATE h u l u N v
where:
N v is the set of neighbors of node v in G t ;
AGGREGATE is a permutation-invariant function, we use mean pooling;
W l is a trainable weight matrix for layer l ;
σ denotes the ReLU activation function.
The first GNN layer l = 0 transforms the raw node features and aggregates information from immediate neighbors. The second layer l = 1 further refines these representations and outputs the final node embeddings. After the second layer, we read out the embedding of a designated global context node (which is connected to all machine and job nodes) as the overall state representation h t . This graph-based encoding enables the DRL agent to capture the complex dependencies between jobs and machines, providing a more structured and generalizable perceptual foundation than a flat feature vector.
The proposed state representation provides the foundational information necessary to maintain system liveness. By explicitly including real-time machine status (e.g., idle, busy, failed) in s t machine and precise job progress in s t job , the agent obtains a global view of resource contention and operational dependencies. This comprehensive visibility enables the agent to evaluate and select only those actions that maintain forward progress of all jobs. Therefore, this design of the state space effectively prevents the agent from making sequences of decisions that would lead to a circular wait condition, thereby ensuring that deadlock does not occur in the scheduled system.

5.1.2. Action Space Design (Action Space A )

The action a t A is defined as selecting an operation O i j from the set of schedulable operations and assigning it to an available machine M k . This is a hierarchical decision process involving job selection and machine assignment. Job selection refers to the agent first choosing the operation with the highest priority from all unscheduled operations, followed by machine assignment where the optimal machine is selected for that operation from the available machine set M i j .
To reduce the dimensionality of the action space, we employ a parameterized policy network π θ a t s t , whose output layer dimension is O (the total number of operations), with each output node corresponding to a priority score for an operation. A probability distribution is obtained through a Softmax layer, and then an operation is selected via a greedy policy. Machine assignment is based on an evaluation network Q ϕ s t , M k that selects the machine with the highest value.
The inner DRL agent employs a Q-learning algorithm, updating its action-value function according to:
Q s t , a t Q s t , a t + α r t + 1 + γ max a Q s t + 1 , a Q s t , a t
where Q s t , a t is the action-value function for state s t and action a t , α is the learning rate, γ is the discount factor, r t + 1 is the immediate reward, and max a Q s t + 1 , a is the maximum expected future reward.

5.1.3. Reward Function Design (Reward Function r t )

The reward function design follows the Reward Shaping principle to alleviate the sparse reward problem and guide the agent towards learning the objective. The reward signal consists of the following components:
r t = r t complete + r t utilize + r t due + r t penalty
The completion reward ( r t complete ) is given when a job is completed, proportional to the job’s priority p i :
r t complete = α p i
The utilization reward ( r t utilize ) is granted when a machine M j finishes an operation, encouraging higher machine utilization:
r t utilize = β processing   time total   time
The tardiness reward ( r t due ) is given when a job J i is completed, providing a reward/penalty based on its earliness/tardiness:
r t due = γ max 0 , d i t η max 0 , t d i
The penalty ( r t penalty ) is imposed for constraint violations (such as assigning to an unavailable machine). This immediate negative feedback for invalid assignments is a critical mechanism to prevent the system from reaching a deadlock state, as it actively discourages the agent from selecting actions that could create circular waiting conditions among jobs.
r t penalty = P
Here, α ,   β ,   γ ,   η are weight coefficients, which constitute an important part of the hyperparameters λ that need to be tuned by the outer-layer Bayesian optimization.

5.1.4. DRL Agent Architecture and Training Process

The inner-loop DRL agent in this study is designed to learn an optimized policy that maps observed system states to scheduling actions through continuous interaction with the simulation environment, thereby achieving adaptive decision-making under complex and dynamic conditions. The agent’s design and training revolve around a value-based deep network, whose core lies in utilizing the encoded system-state representation to learn the long-term value of scheduling strategies via sequential decision-making.
The agent’s network architecture integrates state perception and decision-generation functions. the perception module relies on the Graph Neural Network encoder detailed in Section 5.1.1, which transforms the structured system state s t into a semantically rich feature embedding h t . This embedding is then fed into subsequent fully connected network branches for decision-making. Specifically, one branch estimates the state-action value function Q s , a ; θ , providing a long-term return assessment for each feasible scheduling action; the other branch outputs the overall value V s ; ϕ of the current state, used to evaluate the quality of the state. During training, action selection follows an ε-greedy policy to balance exploration and exploitation.
The training process aims to optimize the network parameters so that the agent can accurately predict action values and improve its policy. Training is based on interaction experiences collected from the simulation environment and uses the Q-learning algorithm for iterative updates. Each interaction yields a state-transition tuple s t , a t , r t + 1 , s t + 1 , which is stored in an experience-replay buffer. During parameter updates, a batch of historical data is randomly sampled, and a target network is used to compute the temporal-difference target. The parameters of the main Q-network are then updated by minimizing the mean-squared-error loss. The parameters of the target network are periodically copied from the main network to ensure the stability of the learning target. This process is consistent with the update formula given in Section 5.1.2, and the introduction of experience replay improves data efficiency and training stability.
The agent’s learning is guided by the multi-objective reward function defined in Section 5.1.3. The reward signal integrates multiple requirements such as makespan, machine utilization, and due-date satisfaction. The intrinsic weights of the reward components are automatically adjusted by the outer Bayesian optimization, thereby ensuring that the policy learned by the agent aligns with the global optimization objectives.

5.2. Mathematical Model of the BO-DRL Collaborative Mechanism

The outer-layer Bayesian Optimization treats the entire training process of the inner-layer DRL as a black-box function F :
λ * = arg min λ Λ E F λ
where λ = α , β , γ , η , learning   rate , ϵ , is the set of hyperparameters, and F λ represents the average final makespan obtained by running the DRL agent configured with hyperparameters λ on multiple validation scheduling scenarios.
BO models F λ by constructing a surrogate model using Gaussian Process (GP):
F λ G P μ λ , k λ , λ
where μ is the mean function and k , is the covariance function (kernel function). The next evaluation point is then determined through the Expected Improvement (EI) acquisition function:
λ n e x t = arg max λ EI λ = arg max λ E max 0 , F m i n F λ
where F m i n is the current optimal value.

5.3. Efficient Rescheduling in Dynamic Environments

The framework adapts to dynamic environments through the following mechanisms:
  • Real-time State Updates: Any dynamic event (such as new order insertion or machine failure) triggers immediate updates to the state s t , enabling the DRL agent to respond based on the latest state information.
  • Rapid DRL Response: The trained DRL policy network π θ * achieves extremely fast forward propagation speeds (at millisecond level), enabling real-time online scheduling.
  • Continuous BO Learning: The system can periodically (e.g., monthly) collect new scheduling data and rerun the BO cycle to optimize hyperparameters λ , allowing the DRL agent to continuously adapt to changes in the production environment.

6. Experimental Validation

6.1. Experimental Design Rationale and Algorithm Configuration

This section outlines the rationale behind our experimental design and details the parameter settings for all compared algorithms. The design is structured to ensure a rigorous, comprehensive, and fair evaluation of the proposed BO-DRL framework within a context reflecting the complexities of opto-mechanical manufacturing.
(a)
Benchmark Problem Selection: The core experiments utilize a bespoke 20 jobs × 20 machines FJSP benchmark. The choice of this specific scale is pivotal: as quantified by Equation (8), the solution space size for this instance exceeds 10 200 , signifying that the problem’s complexity transitions from the exponential to the hyper-exponential regime. This establishes the 20 × 20 instance as a formidable benchmark that truly challenges the limits of scheduling algorithms. Furthermore, its enrichment with domain-specific constraints (Section 3) ensures that this complexity is not merely combinatorial but also reflects the intricate feasibility rules of high-precision opto-mechanical manufacturing, providing a valid and stringent testbed for advanced algorithms.
(b)
Choice of Baseline Algorithms: Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) are adopted as primary baselines. They are widely recognized as representative and high-performing meta-heuristics for combinatorial scheduling, providing a credible and standard reference point for performance comparison.
(c)
Comprehensive Performance Metrics: A multi-dimensional evaluation system is employed, encompassing makespan, mean flow time, machine utilization, number of tardy jobs, and robustness indices. This holistic approach aligns with practical manufacturing objectives and prevents over-optimization toward a single metric.
(d)
Fairness in Resource Allocation: All algorithms were allocated identical computational resources, including the maximum number of iterations and hardware environment. The parameters for each algorithm were independently fine-tuned through systematic preliminary experiments to ensure that each operated in its most competitive configuration. This approach guarantees that any observed performance differences are attributable to the advantages of the algorithms’ core mechanisms, rather than imbalances in resource allocation or parameter tuning.
(e)
Scalability and Robustness Tests: Systematic tests from 5 × 5 to 100 × 100 scales assess scalability.
These settings provided a balanced and competitive baseline for each algorithm, enabling a meaningful comparison of their core capabilities.
To ensure a fair comparison among the algorithms, all algorithms adopted the same stopping criterion (maximum iteration count of 1000) and maintained consistent computational resources.
(a)
BO-DRL Algorithm: Discount factor 0.99, experience replay buffer size 10,000, batch size 32, employing an ε-greedy strategy.
(b)
Genetic Algorithm (GA): Population size 50, crossover probability 0.8, mutation probability 0.1, tournament selection (size 3), utilizing order crossover (OX) and swap mutation.
(c)
Particle Swarm Optimization (PSO): Population size 50, inertia weight 0.7, individual learning factor 1.5, social learning factor 1.5, maximum velocity 0.2.

6.2. Algorithm Performance Comparative Analysis

6.2.1. Solution Quality Comparison

To evaluate the comprehensive scheduling performance of the algorithms, a thorough comparison of the BO-DRL, GA, and PSO algorithms was conducted across four key dimensions: makespan, mean flow time, machine utilization, and due date satisfaction. The results are summarized in Table 5. This comparative presentation allows for a direct assessment of each algorithm’s effectiveness: shorter makespan and mean flow time indicate higher production efficiency; higher machine utilization reflects better resource exploitation; and fewer tardy jobs demonstrate superior due-date adherence. The data in Table 5 thus provides the quantitative foundation for concluding that BO-DRL achieves a more balanced and superior overall performance across these competing objectives.
The data analysis indicates that the BO-DRL algorithm achieves superior performance across all key metrics. Specifically:
The BO-DRL algorithm obtained a makespan of only 69.32 h, representing improvements of 13.37% and 25.51% compared to GA (80.02 h) and PSO (93.06 h), respectively. This result directly validates the exceptional capability of the BO-DRL framework in compressing the production cycle and enhancing overall output efficiency.
BO-DRL also achieved the best mean flow time (62.16 h) among the three algorithms, indicating its ability not only to complete the last job quickly but also to accelerate the flow of all jobs holistically, effectively reducing work-in-process inventory.
Regarding machine utilization, BO-DRL reached 55.03%, significantly higher than GA’s 50.23% and PSO’s 43.80%. Crucially, while achieving higher utilization, its standard deviation (11.39%) was comparable to GA (10.30%) and substantially lower than PSO (14.82%). This indicates that through its intelligent decision-making mechanism, BO-DRL can more precisely balance the system load, avoiding both excessive idling of some machines and bottlenecks on others, thereby achieving more stable and efficient utilization of system resources.
In terms of the number of tardy jobs, both BO-DRL and GA recorded only 1, outperforming PSO’s 3. Combined with the shorter mean flow time and makespan, this indicates that while optimizing time-related metrics, the due-date awareness integrated into BO-DRL’s reward function enables it to intelligently assign higher priority to critical or time-sensitive jobs, thus better meeting due date requirements at a global level.
Gantt charts were obtained from the same experimental runs that produced the quantitative metrics in Table 5. A comparison of the charts reveals distinct structural differences corresponding directly to the measured performance. The schedule generated by BO-DRL (Figure 4a) exhibits minimal idle gaps between consecutive operations and a more continuous allocation of tasks across machines. This compact profile visually translates into the shorter makespan and higher utilization recorded in Table 5. In contrast, the schedules produced by GA (Figure 4b) and PSO (Figure 4c) show pronounced machine idle periods (visible as white segments) and longer waiting intervals between operations, which correspond quantitatively to the longer makespan, lower utilization, and greater load imbalance reported in the table. The visual and quantitative representations are therefore consistent and mutually reinforcing: the structural compactness observed in the BO-DRL Gantt chart underlies its superior performance across all measured metrics.
The comprehensive superiority of BO-DRL in solution quality fundamentally stems from the synergistic enhancement effect between DRL and BO. The inner-layer DRL agent, through end-to-end training, learns to make sequential decisions under complex constraints. Its state space design enables the perception of global information, thereby facilitating decisions superior to those derived from local heuristic rules. The outer-layer BO ensures the DRL agent operates under the optimal configuration through automated and efficient hyperparameter optimization, overcoming the suboptimality inherent in manual tuning. In contrast, traditional metaheuristic algorithms (GA, PSO), when confronted with highly constrained, nonlinear problems such as opto-mechanical automated manufacturing, struggle to explore the solution space effectively due to their inherent search mechanisms, easily becoming trapped in local optima, which leads to the observed performance gap.

6.2.2. Convergence Performance Analysis

Convergence performance is a key metric for evaluating the practical utility of an algorithm. Figure 5 illustrates the convergence curves of the three algorithms on a 20 × 20 scale instance, showing the variation in makespan with the number of iterations.
Figure 5a illustrates the convergence characteristics of the three algorithms on the 20 × 20 benchmark. The plotted curves show that BO-DRL achieves convergence at the 341st generation, reaching its performance plateau substantially earlier than GA (902 generations) and PSO (976 generations), corresponding to a 62.24% and 65.09% faster convergence speed, respectively. This indicates that the search direction of BO-DRL is more targeted and efficient. Its smooth, monotonically decreasing curve without significant oscillation signifies a highly stable optimization process based on policy gradients. In contrast, PSO, despite a rapid initial decrease, quickly becomes trapped in a local optimum, showing limited improvement in subsequent iterations. GA converges slowly, and its curve exhibits considerable fluctuation, reflecting the blindness of its crossover and mutation operations.
The computational efficiency advantage of BO-DRL is quantified in Figure 5b, which compares the total computation time required for a complete optimization run. Under identical computational resources, the total computation time required for BO-DRL to complete one full optimization was reduced by 77.45% and 89.28% compared to GA and PSO, respectively. This significant advantage stems from the intrinsic mechanism of the BO-DRL algorithm: once trained, its forward inference for decision-making occurs at the millisecond level. In contrast, GA and PSO require expensive fitness evaluations (i.e., complete scheduling simulations) for the entire population in each iteration. This makes BO-DRL exceptionally practical for dynamic environments requiring rapid response or frequent rescheduling.
The superior convergence performance of BO-DRL can be attributed to three main factors:
  • Directed Policy Search: Unlike the random search of metaheuristic algorithms, DRL performs directed policy improvement via policy gradients, continuously adjusting decisions towards actions that yield higher cumulative reward, naturally leading to higher search efficiency.
  • Attention Mechanism: As described in Section 5.1.1, the attention mechanism within the state encoding network enables the agent to focus on the most critical scheduling decisions at any moment (e.g., bottleneck machines, urgent jobs), avoiding redundant searches on non-critical decisions. This is a key reason for its ability to quickly escape local optima.
  • BO Preheating Effect: The outer-layer Bayesian optimization provides the DRL agent with a near-optimal initial hyperparameter configuration. This gives the inner-layer DRL training a higher starting point, equivalent to a high-quality “algorithm preheating,” significantly reducing the time required for convergence.
In summary, BO-DRL outperforms not only in final solution quality but also demonstrates substantial advantages in convergence speed, stability, and computational efficiency. This establishes a solid foundation for its deployment in industrial APS systems with high real-time requirements.

6.2.3. Algorithm Robustness Verification

To validate the robustness of the algorithms, we introduced three types of dynamic disturbances into the original benchmark instance. The specific parameters and application dynamics for each perturbation type were designed to reflect realistic production scenarios and are detailed as follows:
(a)
Machine Failures: We randomly selected 5 out of the 20 machines (25% of the total fleet) to simulate unplanned breakdowns. This failure rate represents a moderate-to-high stress scenario for the system. Each failed machine became unavailable for a duration uniformly distributed between 2 and 8 h, after which it resumed operation. Failures were triggered at random time points after the 20th hour of the schedule to simulate mid-production disruptions, ensuring the initial schedule was already in execution.
(b)
Urgent Orders: We inserted 3 new high-priority jobs during the scheduling process. These jobs were released into the system at random times uniformly distributed between the 10th and 30th hours. To reflect their urgency, each was assigned a due date tightness factor of 0.3 (i.e., due date = release   time + 0.3 × total   processing   time ), which is significantly tighter than the average factor of 1.2 used for regular jobs in the benchmark. Their internal process plans and machine eligibility were generated with the same complexity distribution as the original benchmark jobs.
(c)
Processing Time Fluctuations: To simulate natural variability in operation execution, the actual processing time for every operation was subject to a random fluctuation. The realized time was set to p i j k actual = p i j k × 1 + δ , where δ was drawn from a uniform distribution over the interval 0.15 , + 0.15 , representing a ±15% variation. This range captures typical variability observed in manual adjustment and precision assembly stages.
These three perturbations were applied concurrently in a single, integrated dynamic scenario to test the algorithm’s ability to handle compound uncertainties. The specific parameters (5 machines, 3 urgent jobs, ± 15 % variation) were chosen to represent a significant yet realistic stress level for the scheduling system, providing a stringent and comprehensive test for robustness.
The performance of the algorithms under dynamic disturbances is quantitatively compared in Figure 6, which illustrates the makespan increase rate, rescheduling success rate, and solution quality retention rate.
Under these dynamic conditions, BO-DRL demonstrates exceptional adaptability. Its makespan increased by only 8.7%, and it achieved a rescheduling success rate of 94.2%, significantly outperforming the comparative algorithms. This robustness stems from the closed-loop nature of the DRL-based controller, which continuously perceives the state and reacts. The pre-tuning via BO ensures that the controller’s adaptation policy is near-optimal from the outset.

6.3. Algorithm Scalability Analysis

To comprehensively evaluate the performance of the algorithms across problems of varying scales, we conducted systematic tests on six additional opto-mechanical automated manufacturing benchmark instances ranging from 5 × 5 to 100 × 100. In response to the reviewer’s request for detailed specifications, the complete machine categorization, corresponding quantities, and task sequencing information (Gantt charts) for all these scales are provided in the Supplementary Material. The detailed comparison of key performance metrics for BO-DRL, GA, and PSO across these instances is summarized in Table 6. This table reveals consistent trends: BO-DRL achieves the shortest makespan and mean flow time at every scale, maintains competitively high and stable machine utilization, and generally results in fewer tardy jobs compared to both GA and PSO. Notably, its relative advantage in makespan reduction becomes more pronounced as the problem size increases, underscoring its scalability.
Convergence behavior across these different scales is visualized in Figure 7. The side-by-side comparison for each scale clearly demonstrates that BO-DRL not only converges faster but also reaches a lower makespan plateau than both meta-heuristics. In contrast, the convergence profiles of GA and PSO frequently exhibit more erratic improvement patterns and a clear tendency to stagnate at higher objective values, particularly as the problem size and complexity increase in larger instances.
The progression of BO-DRL’s performance advantage is quantified in Figure 8, which charts its relative makespan improvement against GA and PSO across all problem scales. Performance improvement trend analysis demonstrates that the BO-DRL algorithm exhibits significant advantages across problems of different scales. Specifically, the performance improvement of BO-DRL relative to GA increases monotonically from 0.80% at the 5 × 5 scale to 23.96% at the 100 × 100 scale, showing a continuously strengthening trend. This phenomenon indicates that as problem complexity increases, the advantages of BO-DRL’s deep perception and adaptive decision-making capabilities become increasingly prominent.
As the problem scale increases, BO-DRL’s performance advantage becomes more pronounced. In the 25 × 25 large-scale instance, BO-DRL achieved an 11.6% improvement over GA, demonstrating its excellent scalability and applicability to more complex real-world production environments. In comparison, the performance improvement of BO-DRL relative to PSO shows a different pattern: it continuously increases for small to medium scales (5 × 5 to 20 × 20), reaching a peak of 25.51% at the 20 × 20 scale, then stabilizes around 16–17% for larger-scale problems. This trend reflects the relative adaptability of the PSO algorithm within specific scale ranges, while BO-DRL maintains a clear advantage across all scales.
Key findings include:
  • Scale Adaptability Differences: The improvement of BO-DRL over GA continuously strengthens with increasing scale, indicating its stronger adaptability in complex large-scale problems, while its improvement over PSO stabilizes after peaking at medium scales, reflecting the characteristic differences of different algorithms when dealing with problems of varying complexity.
  • Convergence Performance Advantage: For small to medium-scale problems, BO-DRL’s convergence speed significantly outperforms comparative algorithms, achieving stable solutions on average 60% earlier. For large-scale problems, BO-DRL effectively focuses on key scheduling decisions through its attention mechanism, avoiding redundant searches in invalid solution spaces.
  • Disturbance Resistance Capability: As the problem scale increases, the disturbance resistance performance of all algorithms decreases, but BO-DRL shows the least degradation. For the 100 × 100 ultra-large-scale problem, BO-DRL’s solution quality retention rate is significantly higher than GA (+9.3%) and PSO (+14.8%), demonstrating its exceptional robustness.
In summary, the BO-DRL algorithm demonstrates excellent scalable performance, with its advantages becoming more evident as the problem scale increases. This is primarily attributed to its bi-level optimization architecture: the inner-layer DRL achieves adaptive decision-making through deep perception, effectively handling the complexity of large-scale problems; the outer-layer BO ensures the algorithm operates in an optimal configuration across different scales through efficient hyperparameter optimization. This design makes BO-DRL an ideal choice for solving large-scale scheduling problems in complex manufacturing environments.

6.4. Discussion and Insights

The exceptional performance of BO-DRL in complex opto-mechanical automated manufacturing scheduling can be attributed to the following key factors:
  • Adaptive Decision-Making Capability: Through deep reinforcement learning, BO-DRL can adaptively adjust scheduling strategies based on real-time states, rather than relying on fixed heuristic rules.
  • Constraint Handling Capability: The attention mechanism enables the algorithm to effectively identify and satisfy the complex process constraints in opto-mechanical automated manufacturing.
  • Hyperparameter Optimization: Bayesian optimization ensures that the DRL algorithm always operates under the optimal hyperparameter configuration, fully realizing its learning potential.
  • Knowledge Accumulation and Transfer: Through curriculum learning, BO-DRL can transfer knowledge learned from simple problems to solve complex problems.
In conclusion, the experimental results demonstrate that the BO-DRL algorithm significantly outperforms traditional metaheuristic algorithms in complex opto-mechanical automated manufacturing scheduling problems, showing clear advantages in solution quality, convergence speed, and robustness. It provides an effective solution for intelligent scheduling in complex manufacturing environments.

7. Discussion and Limitations

While the proposed BO-DRL framework demonstrates significant advantages in adaptive scheduling for opto-mechanical manufacturing, it is important to acknowledge its current limitations and scope to provide a balanced view and guide future work.
(a)
Computational Overhead of Offline Training. The initial phase of training the DRL agent and optimizing hyperparameters via BO is computationally intensive. Although the trained agent operates with millisecond-level latency online, this upfront cost must be considered for deployment scenarios where rapid adaptation to a completely new production configuration is required. Future work could explore meta-learning or transfer learning techniques to reduce this cold-start cost.
(b)
Dependence on Simulation Fidelity. The agent’s policy is learned and tuned entirely within a simulated production environment. Its performance in practice is therefore contingent on the accuracy of the simulation model in capturing the dynamics and stochasticity of the real shop floor. Discrepancies between simulation and reality could lead to suboptimal decisions. Enhancing the simulation with digital twin technologies or incorporating online fine-tuning mechanisms are valuable directions.
(c)
Generalizability of Dynamic Disturbance Models. The robustness tests, while designed with realistic parameters, employ specific, pre-defined disturbance profiles (e.g., uniform distribution for downtime). The framework’s performance under unforeseen or more extreme disruption patterns warrants further investigation. Extending the state representation and reward function to handle a broader, less structured set of anomalies remains a challenge.
(d)
Interpretability of the Learned Policy. Like many deep RL-based controllers, the inner decision-making logic of the trained DRL agent is not easily interpretable to human planners. In high-stakes, high-precision manufacturing, a degree of explainability may be required for trust and adoption. Developing methods to explain or distill the agent’s policy into human-understandable rules is an important avenue for future research.
Addressing these limitations will be crucial for transitioning the proposed intelligent control framework from a validated prototype to a robust, trustworthy component of future autonomous manufacturing systems.

8. Conclusions and Future Work

In this study, we developed a multi-objective adaptive scheduling algorithm that integrates deep reinforcement learning with Bayesian optimization to address the complex scheduling requirements in opto-mechanical automated manufacturing. Experimental results demonstrate that the proposed algorithm outperforms traditional metaheuristic methods in solution quality, convergence speed, and robustness. It effectively handles dynamic disturbances and balances conflicts among multiple objectives, providing an efficient and reliable intelligent control solution for autonomous manufacturing systems in high-precision manufacturing environments.
Future research will pursue full autonomy and enhanced intelligence for scheduling systems. Our efforts will focus on three key directions: (1) introducing meta-learning mechanisms to enable rapid adaptation to new production scenarios, (2) developing data-driven dynamic perception systems to anticipate and respond to disruptions proactively, and (3) exploring preference-free multi-objective decision-making to autonomously balance conflicting goals. These advancements are crucial steps toward realizing self-evolving intelligent control systems for manufacturing.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16020732/s1.

Author Contributions

L.Y.: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing—Original Draft, Visualization. Z.F.: Conceptualization, Resources, Writing—Review and Editing, Supervision, Project Administration. K.L.: Methodology, Software, Writing—Review and Editing. J.C.: Methodology, Software, Writing—Review and Editing. N.F.: Resources, Writing—Review and Editing. M.L.: Conceptualization, Resources, Writing—Review and Editing, Supervision, Project Administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Young Talent Fund of the Laser Fusion Research Center (No. RCFCZ7-2024-4).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Martin, P.; Juergen, W. Industry 4.0 and object-oriented development: Incremental and architectural change. J. Technol. Manag. Innov. 2016, 11, 104–110. [Google Scholar] [CrossRef]
  2. Carlos, R.H.; Márquez Ribeiro, C.C. Shop scheduling in manufacturing environments: A review. Int. Trans. Oper. Res. 2022, 29, 3237–3293. [Google Scholar] [CrossRef]
  3. Tsai, M.F.; Wei-Tse, L.I.; Chen, L.W. Dynamic productivity prediction and new production feature selection methods for advanced planning scheduling. J. Inf. Sci. Eng. 2024, 40, 341. [Google Scholar] [CrossRef]
  4. Park, K.T.; Lee, J.Y.; Park, M.W.; Park, Y.H.; Lee, J.Y.; Choi, Y.H. Models and p4r asset description for digital twin-based advanced planning and scheduling using cyber-physical integration for resilient production operation. J. Manuf. Syst. 2024, 77, 127–153. [Google Scholar] [CrossRef]
  5. Yin, L.; Xiong, Z.; Chen, H.; Wang, C. Optimization of JSP based on particle swarm algorithm with oscillation regulation mutation. In Proceedings of the 5th IEEE International Conference on Electronic Engineering and Informatics, Wuhan, China, 30 June–2 July 2023. [Google Scholar]
  6. Márquez Carlos, R.H.; Braganholo, V.; Ribeiro, C.C. An open-source framework for solving shop scheduling problems in manufacturing environments. Ann. Oper. Res. 2025, 351, 1155–1183. [Google Scholar] [CrossRef]
  7. Jing, X.; Yao, X.; Liu, M.Z.J. Multi-agent reinforcement learning based on graph convolutional network for flexible job shop scheduling. J. Intell. Manuf. 2024, 35, 75–93. [Google Scholar] [CrossRef]
  8. Zhang, L.; Feng, Y.; Xiao, Q.; Xu, Y.; Li, D.; Yang, D. Deep reinforcement learning for dynamic flexible job shop scheduling problem considering variable processing times. J. Manuf. Syst. 2023, 71, 257–273. [Google Scholar] [CrossRef]
  9. Zhang, J.; Wang, H.; Liu, B.; Chu, D.; Xu, X.; Pei, G. Virtual assembly framework for performance analysis of large optics. Virtual Real. Intell. Hardw. 2020, 2, 28–42. [Google Scholar] [CrossRef]
  10. Zhang, W.; Peng, Z.; Zhao, F.; Feng, B.; Mei, X. A novel deep reinforcement learning framework based on digital twins for dynamic job shop scheduling problems. Expert Syst. Appl. 2026, 296, 128708. [Google Scholar] [CrossRef]
  11. Zhang, B.; Che, A.; Wang, Y. Grid-based artificial bee colony algorithm for multi-objective job shop scheduling with manual loading and unloading tasks. Expert Syst. Appl. 2024, 245, 123011. [Google Scholar] [CrossRef]
  12. Pan, C.; Yu, T.; Liu, Z.; Tang, H.; Li, X.; Pang, S. R-dmdqn: A rule embedding based dynamic multi-objective deep q-network for mass-individualized production scheduling of printed circuit board. J. Manuf. Syst. 2025, 79, 466–483. [Google Scholar] [CrossRef]
  13. Arulkumar, V.; Raju, K.K.; Pemula, R.; Vigil, M.S.A. An optimized scheduling algorithm for prioritized tasks with shared resources in cloud edge computing. Expert Syst. Appl. 2025, 293, 128594. [Google Scholar] [CrossRef]
  14. Wang, X.; Hu, X.; Zhang, C. Dynamic spatiotemporal scheduling of hull parts under complex constraints in shipbuilding workshop. Int. J. Comput. Integr. Manuf. 2023, 37, 123–148. [Google Scholar] [CrossRef]
  15. Pooranian, Z.; Shojafar, M.; Abawajy, J.H.; Abraham, A. An efficient meta-heuristic algorithm for grid computing. J. Comb. Optim. 2015, 30, 413–434. [Google Scholar] [CrossRef]
  16. Gao, Y.; Yuan, B.; Cui, W. A math-heuristic approach for scheduling the production and delivery of a mobile additive manufacturing hub. Comput. Ind. Eng. 2024, 188, 109929. [Google Scholar] [CrossRef]
  17. Madni, S.H.H.; Latiff, M.S.A.; Abdullahi, M.; Abdulhamid, S.M.; Usman, M.J. Performance comparison of heuristic algorithms for task scheduling in iaas cloud computing environment. PLoS ONE 2017, 12, e0176321. [Google Scholar] [CrossRef] [PubMed]
  18. Rahman, H.F.; Sarker, R.; Essam, D. A genetic algorithm for permutation flow shop scheduling under make to stock production system. Comput. Ind. Eng. 2015, 90, 12–24. [Google Scholar] [CrossRef]
  19. Hu, C.; Zheng, R.; Lu, S.; Liu, X.; Cheng, H. Integrated optimization of production scheduling and maintenance planning with dynamic job arrivals and mold constraints. Comput. Ind. Eng. 2023, 186, 109708. [Google Scholar] [CrossRef]
  20. Sugianto, W.C.; Kim, B.S. Particle swarm optimization for integrated scheduling problem with batch additive manufacturing and batch direct-shipping delivery. Comput. Oper. Res. 2024, 161, 106430. [Google Scholar] [CrossRef]
  21. Wang, Z.; Qi, Y.; Cui, H.; Zhang, J. A hybrid algorithm for order acceptance and scheduling problem in make-to-stock/make-to-order industries. Comput. Ind. Eng. 2019, 127, 841–852. [Google Scholar] [CrossRef]
  22. Zhuang, M.; Zhang, W.; Tang, H.; Li, X.; Wang, K. A multi-objective genetic algorithm based on two-stage reinforcement learning for green flexible shop scheduling problem considering machine speed. Expert Syst. Appl. 2024, 258, 125189. [Google Scholar] [CrossRef]
  23. Yang, H.; Du, Y.; Li, Y.; Qian, W.; Hu, B. A heuristic mutation based genetic algorithm for fast parallel scheduling of steel cold rolling. Chin. J. Mech. Eng. 2025, 38, 1–11. [Google Scholar] [CrossRef]
  24. Wan, L.; Fu, L.; Li, C.; Li, K. Flexible job shop scheduling via deep reinforcement learning with meta-path-based heterogeneous graph neural network. Knowl. Based Syst. 2024, 296, 111940. [Google Scholar] [CrossRef]
  25. Yu, H.; Tang, N.; Zhu, Z.; Guo, Z. Flexible job-shop scheduling via gated recurrent unit and deep reinforcement learning. Knowl. Based Syst. 2025, 330, 114734. [Google Scholar] [CrossRef]
  26. Yuan, M.; Yu, Q.; Zhang, L.; Lu, S.; Li, Z.; Pei, F. Deep reinforcement learning based proximal policy optimization algorithm for dynamic job shop scheduling. Comput. Oper. Res. 2025, 183, 107149. [Google Scholar] [CrossRef]
  27. Geng, Y.; Zhao, N. A Tree neural network deep reinforcement learning for flexible job shop scheduling with transportation constraints. Swarm Evol. Comput. 2025, 98, 102102. [Google Scholar] [CrossRef]
  28. Ding, L.; Guan, Z.; Luo, D.; Yue, L. Data-driven hierarchical multi-policy deep reinforcement learning framework for multi-objective multiplicity dynamic flexible job shop scheduling. J. Manuf. Syst. 2025, 80, 536–562. [Google Scholar] [CrossRef]
  29. Lv, L.; Zhang, C.; Fan, J.; Shen, W. Deep reinforcement learning for job shop scheduling problems: A comprehensive literature review. Knowl. Based Syst. 2025, 321, 113633. [Google Scholar] [CrossRef]
  30. Zhang, Z.; Tang, Q.; Zhang, L.; Li, Z.; Cheng, L. A q-learning-based multi-population algorithm for multi-objective distributed heterogeneous assembly no-idle flowshop scheduling with batch delivery. Expert Syst. Appl. 2025, 263, 125690. [Google Scholar] [CrossRef]
  31. Cheng, W.; Zhang, C.; Meng, L.; Gao, K.; Zhang, B.; Sang, H. A cooperative agent deep reinforcement learning framework for solving flexible job shop scheduling problem with automated guided vehicles. Expert Syst. Appl. 2025, 287, 128142. [Google Scholar] [CrossRef]
  32. Shi, Z.; Si, J.; Zhang, J.; Pang, Z.; Chen, H.; Ding, G. A deep reinforcement learning method based on Hindsight experience replay for multi-objective dynamic job-shop scheduling problem. Expert Syst. Appl. 2025, 284, 127989. [Google Scholar] [CrossRef]
  33. Young, M.T.; Hinkle, J.D.; Kannan, R.; Ramanathan, A. Distributed Bayesian optimization of deep reinforcement learning algorithms. J. Parallel Distrib. Comput. 2020, 139, 43–52. [Google Scholar] [CrossRef]
  34. Patro, S.K.; Shelke, S.; Maitre, N.; Salunkhe, S.S. Optimizing the thermal performance of phase change materials in building applications using deep reinforcement learning and Bayesian optimization. Therm. Sci. Eng. Prog. 2024, 55, 102867. [Google Scholar] [CrossRef]
  35. Paulson, J.A.; Tsay, C. Bayesian optimization as a flexible and efficient design framework for sustainable process systems. Curr. Opin. Green Sustain. Chem. 2025, 51, 100983. [Google Scholar] [CrossRef]
  36. Perez Colo, I.; Saavedra Sueldo, C.; De Paula, M.; Acosta, G.G. Intelligent approach for the industrialization of deep learning solutions applied to fault detection. Expert Syst. Appl. 2023, 233, 120959. [Google Scholar] [CrossRef]
  37. Sun, L.; Lin, L.; Wang, Y.; Gen, M.; Kawakami, H. A Bayesian Optimization-based Evolutionary Algorithm for Flexible Job Shop Scheduling. Procedia Comput. Sci. 2015, 61, 521–526. [Google Scholar] [CrossRef]
  38. Guan, X.; Li, M.Z.F.; Qin, J.; Wang, C. Short-term high-speed rail passenger flow forecasting integrated extended empirical mode decomposition with multivariate and bidirectional support vector machine. Expert Syst. Appl. 2026, 298, 129870. [Google Scholar] [CrossRef]
  39. Muhuri, P.K.; Biswas, S.K. Bayesian optimization algorithm for multi-objective scheduling of time and precedence constrained tasks in heterogeneous multiprocessor systems. Appl. Soft Comput. 2020, 92, 106274. [Google Scholar] [CrossRef]
  40. Papageorgiou, E.; Buzo, A.; Pelz, G.; Noulis, T. Deep reinforcement learning and Bayesian optimization based OpAmp design across the CMOS process space. AEU Int. J. Electron. Commun. 2025, 192, 155697. [Google Scholar] [CrossRef]
  41. Hong, H.; Kim, S.; Kim, W.; Kim, W.; Jeong, J.; Kim, S.S. Design optimization of 3D printed kirigami-inspired composite metamaterials for quasi-zero stiffness using deep reinforcement learning integrated with bayesian optimization. Compos. Struct. 2025, 359, 119031. [Google Scholar] [CrossRef]
  42. Springenberg, J.T.; Klein, A.; Falkner, S.; Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Proceedings of the NIPS’16 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  43. Wang, K.; Chen, Z.; Zhang, L.; Obaidat, M.S.; Cui, J.; Cheng, H. Building a self-evolving digital twin system with bayesian optimization and deep reinforcement learning for complex equipment optimization and control. Tsinghua Sci. Technol. 2026, 31, 199–216. [Google Scholar] [CrossRef]
Figure 1. A schematic overview of existing scheduling algorithms and their limitations.
Figure 1. A schematic overview of existing scheduling algorithms and their limitations.
Applsci 16 00732 g001
Figure 2. Schematic overview of the opto-mechanical production workflow. Note: The diagram illustrates the typical types of processes and potential dependencies (e.g., rework loops) in this domain. Note that this is a conceptual representation; actual jobs have diverse and job-specific operation sequences, as detailed in Table 4.
Figure 2. Schematic overview of the opto-mechanical production workflow. Note: The diagram illustrates the typical types of processes and potential dependencies (e.g., rework loops) in this domain. Note that this is a conceptual representation; actual jobs have diverse and job-specific operation sequences, as detailed in Table 4.
Applsci 16 00732 g002
Figure 3. The proposed bi-level BO-DRL framework for adaptive scheduling.
Figure 3. The proposed bi-level BO-DRL framework for adaptive scheduling.
Applsci 16 00732 g003
Figure 4. Comparison of scheduling Gantt charts using different algorithms.
Figure 4. Comparison of scheduling Gantt charts using different algorithms.
Applsci 16 00732 g004aApplsci 16 00732 g004b
Figure 5. Comparison of convergence performance for the three algorithms (20 × 20 scale).
Figure 5. Comparison of convergence performance for the three algorithms (20 × 20 scale).
Applsci 16 00732 g005
Figure 6. Comparison of robustness for the three algorithms (20 × 20 scale).
Figure 6. Comparison of robustness for the three algorithms (20 × 20 scale).
Applsci 16 00732 g006
Figure 7. Comparison of convergence performance for the three algorithms across different problem scales.
Figure 7. Comparison of convergence performance for the three algorithms across different problem scales.
Applsci 16 00732 g007aApplsci 16 00732 g007bApplsci 16 00732 g007c
Figure 8. BO-DRL Performance Improvement Trend.
Figure 8. BO-DRL Performance Improvement Trend.
Applsci 16 00732 g008
Table 1. Nomenclature of key variables in the mathematical model.
Table 1. Nomenclature of key variables in the mathematical model.
SymbolDescription
i , l Index for jobs; used to denote distinct jobs when formulating pairwise constraints (e.g., i l ).
j , m Index for operations within a job; used to denote distinct operations (e.g., operation j of job i vs. operation m of job l ).
k Index for machines
t Index for time (iteration in PSO/DRL context)
C m a x Makespan (maximum completion time)
C i Completion time of job i
U Average machine utilization
T m a x Maximum tardiness
Q Quality performance index
s i j Start time of operation j of job i
p i j k Processing time of operation O i j on machine k
y i j l m k Binary variable; equals 1 if operations O i j and O l m are both processed on machine k , and O i j precedes O l m
x i j k Binary variable; equals 1 if operation O i j is assigned to machine k
d i Due date of job i
M i j Set of machines capable of processing operation O i j
M A sufficiently large positive number
N j Total number of jobs
M Total number of machines
N tardy Number of tardy jobs
1 Indicator function (returns 1 if condition is true, else 0)
C m a x static Makespan under the static baseline schedule
C m a x dynamic Makespan under dynamic disruptions
Table 2. Algorithm performance evaluation indicators.
Table 2. Algorithm performance evaluation indicators.
Indicator TypeSpecific IndicatorsCalculation Formula
efficiency indexmakespan C m a x
efficiency indexmean flow time 1 N j C i
Resource UtilizationMachine utilization 1 M k = 1 M p i j k C m a x
Timeliness IndicatorsNumber of tardy jobs N tardy = i = 1 N j 1 { C i > d i }
Robustness IndexDisturbance recovery capability C m a x dynamic C m a x static
Table 3. Machine resource classification and function description.
Table 3. Machine resource classification and function description.
CategoryMachine IDsFunction DescriptionSpecial Constraints
CleaningM6, M10, M12Optical component cleaning and dust removalCleanliness level requirements
Precision coatingM4, M11Optical surface coating treatmentTemperature and humidity control
High-precision assemblyM3, M8, M9, M13, M18Opto-mechanical integrated assemblyVibration isolation
Calibration testM1, M2, M5, M15, M16, M20Optical performance calibration and testingConstant temperature environment
special handlingM7, M14, M17, M19Special process treatmentDedicated equipment
Table 4. Operation sequence and machine flexibility for a representative job (Job 1).
Table 4. Operation sequence and machine flexibility for a representative job (Job 1).
Operation of Job 1Op1Op2Op3Op4Op5
Equipment (working hours)M2 (9.5 h),
M6 (2.4 h),
M10 (8 h)
M9 (8.9 h)M1 (6.6 h),
M17 (5.7 h),
M19 (5.2 h)
M6 (0.8 h)M7 (3.8 h),
M19 (7.8 h),
M1 (2 h)
operation of job 1Op6Op7Op8Op9Op10
Equipment (working hours)M15 (6.4 h),
M20 (1.7 h),
M5 (4.7 h)
M15 (9.5 h),
M20 (4.3 h),
M5 (5.4 h)
M17 (6.3 h),
M5 (7.5 h)
M13 (3.4 h),
M9 (5.1 h)
M15 (0.8 h),
M16 (4.1 h)
Note: This table details the process plan for Job 1 (from the 20 × 20 benchmark). Its 10 operations (Op1 to Op10) must be processed in sequence. Critically, the machines listed under each operation (e.g., M2, M6, M10 for Op1) are parallel, alternative resources for that operation, from which the scheduler selects one. Different jobs have distinct sequences and eligible machine sets.
Table 5. Comprehensive comparison of algorithm solution quality.
Table 5. Comprehensive comparison of algorithm solution quality.
AlgorithmMakespan (h)Mean Flow Time (h)Machine Utilization ± Std (%)Number of Delayed Jobs
BO-DRL69.3262.1655.03 ± 11.391
GA80.0268.6250.23 ± 10.301
PSO93.0677.1643.80% ± 14.82%3
Table 6. Algorithm scalability analysis.
Table 6. Algorithm scalability analysis.
Problem Scale5 × 510 × 1015 × 1520 × 2025 × 2550 × 50100 × 100
Makespan (hours)BO-DRL46.1869.5469.3169.3273.7790.19102.46
GA46.5571.7975.3880.0289.44114.95134.73
PSO48.8278.8383.7093.0695.02107.40123.23
Mean Flow Time (hours)BO-DRL43.5261.5962.1562.1667.6077.4083.32
GA42.1863.0567.1768.6276.9293.48102.83
PSO44.9869.0775.9477.1675.5187.1398.49
Machine Utilization (%)BO-DRL62.61% ± 3.96%54.98% ± 12.27%54.42% ± 14.85%55.03 ± 11.3951.81% ± 10.60%43.20% ± 13.34%38.29% ± 15.42%
GA68.30% ± 12.02%54.69% ± 5.48%47.94% ± 10.49%50.23 ± 10.3043.22% ± 9.51%33.97% ± 11.93%29.17% ± 11.52%
PSO61.29% ± 17.04%54.66% ± 12.00%45.88% ± 12.14%43.80% ± 14.82%39.51% ± 11.88%36.44% ± 14.21%31.97% ± 14.21%
Number of Tardy JobsBO-DRL012131345
GA0121103275
PSO023382067
Disturbance Resistance (Solution Quality Retention Rate, %)BO-DRL96.394.793.191.589.685.279.1
GA90.288.986.884.882.576.969.8
PSO87.685.383.281.278.772.464.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, L.; Fang, Z.; Li, K.; Chen, J.; Fan, N.; Li, M. A Bi-Level Intelligent Control Framework Integrating Deep Reinforcement Learning and Bayesian Optimization for Multi-Objective Adaptive Scheduling in Opto-Mechanical Automated Manufacturing. Appl. Sci. 2026, 16, 732. https://doi.org/10.3390/app16020732

AMA Style

Yin L, Fang Z, Li K, Chen J, Fan N, Li M. A Bi-Level Intelligent Control Framework Integrating Deep Reinforcement Learning and Bayesian Optimization for Multi-Objective Adaptive Scheduling in Opto-Mechanical Automated Manufacturing. Applied Sciences. 2026; 16(2):732. https://doi.org/10.3390/app16020732

Chicago/Turabian Style

Yin, Lingyu, Zhenhua Fang, Kaicen Li, Jing Chen, Naiji Fan, and Mengyang Li. 2026. "A Bi-Level Intelligent Control Framework Integrating Deep Reinforcement Learning and Bayesian Optimization for Multi-Objective Adaptive Scheduling in Opto-Mechanical Automated Manufacturing" Applied Sciences 16, no. 2: 732. https://doi.org/10.3390/app16020732

APA Style

Yin, L., Fang, Z., Li, K., Chen, J., Fan, N., & Li, M. (2026). A Bi-Level Intelligent Control Framework Integrating Deep Reinforcement Learning and Bayesian Optimization for Multi-Objective Adaptive Scheduling in Opto-Mechanical Automated Manufacturing. Applied Sciences, 16(2), 732. https://doi.org/10.3390/app16020732

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop