A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation

Niu, Baotong; You, Gang; Liu, Huan

doi:10.3390/pr14091410

Open AccessArticle

A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation

by

Baotong Niu

^1,*,

Gang You

² and

Huan Liu

³

¹

China Telecom Yikang Technology Co., Ltd., Beijing 100033, China

²

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

³

College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(9), 1410; https://doi.org/10.3390/pr14091410

Submission received: 2 April 2026 / Revised: 25 April 2026 / Accepted: 26 April 2026 / Published: 28 April 2026

(This article belongs to the Section Process Control, Modeling and Optimization)

Download

Browse Figures

Versions Notes

Abstract

Distributed hybrid flow-shop scheduling problems (DHFSPs) are widely encountered in manufacturing systems. Their complexity increases significantly when multiple overhead cranes are used for material handling. This paper investigates a distributed hybrid flow-shop scheduling problem with multiple overhead crane transportation (DHFSP-MCT), aiming to simultaneously minimize makespan and total energy consumption (including machining and transport). A hierarchical reinforcement learning-based bi-population collaborative metaheuristic algorithm (HRL-BCMA) is proposed. In HRL-BCMA, an iterated greedy strategy is first adopted to generate an initial population. Then, a two-level reinforcement learning framework is designed: a high-level agent decides when to release jobs to the shop floor, while a low-level agent based on a graph isomorphism network selects improvement operators. Furthermore, a bi-population co-evolutionary framework and a knowledge-informed strategy are introduced to enhance solution quality and diversity. Experimental evaluations on both randomly generated instances and a real-world-inspired aluminum manufacturing case show that HRL-BCMA reduces makespan by 8.6% and total energy consumption by 12.3% on average compared to the best existing algorithm (CBMA) while achieving superior Pareto front coverage. These results demonstrate the effectiveness of the proposed method for green scheduling problems with crane transport constraints.

Keywords:

hybrid flow-shop scheduling; crane transportation; hierarchical reinforcement learning; bi-population cooperative

1. Introduction

Bridge cranes, commonly referred to as overhead cranes, are critical equipment and are widely employed in manufacturing plants for the transportation, loading, and unloading of products or raw materials. Their structure comprises three main components (as depicted in Figure 1): a bridge spanning parallel overhead runways, a hoist and trolley traversing along the bridge and moving up and down, and parallel runways fixed on the top of the building structure. In manufacturing plants, it is typical to operate multiple bridge cranes on the same runway to facilitate the swift transfer of products during production process [1,2]. However, due to the shared runways among the cranes, they are unable to pass each other, potentially leading to interference and production interruptions, thereby affecting overall efficiency.

Effective scheduling of bridge cranes is paramount for hybrid-flow assembly shops. These workshops represent a complex production layout where different types of products or components are assembled or processed in the same production shop [3]. Given the potential variations in the manufacturing processes among different products, crane scheduling must account for these differences to ensure smooth production flow while minimizing production downtime [4].

Therefore, research on bridge crane scheduling in hybrid flow assembly workshop holds significant importance [5]. Optimizing crane scheduling arrangements can maximize production efficiency, reduce production downtime, and mitigate interference among cranes. Through effective scheduling, management can better plan production schedules, optimize the utilization of production resources, and enhance overall production efficiency and quality [6].

In practice, studying crane scheduling issues necessitates consideration of the actual conditions in production workshops, equipment characteristics, and the complexity of production processes. Therefore, conducting in-depth research and analysis can provide effective solutions for production scheduling in mixed-flow assembly workshops, thereby maximizing production efficiency and resource utilization [7,8].

Crane scheduling is a scheduling problem with spatial constraints, namely crane interference, which makes crane scheduling more complex than typical job-shop problems [9]. A significant amount of literature focuses on gantry crane scheduling problems. Gantry cranes are supported by separate rigid structures and operated by individual trolleys on fixed tracks. Unlike overhead cranes in manufacturing plants, gantry cranes are primarily used for loading and unloading at construction sites rather than transporting items from one track position to another [10]. Therefore, it is challenging to apply the results of gantry crane scheduling problems to the scheduling of overhead cranes in manufacturing plants.

Specifically, Du et al. considered a flexible job shop with a single crane and fixed transport speed, optimizing only makespan [3]. Liu et al. focused on energy consumption but assumed that there was a single crane and static job release [5]. In contrast, our work simultaneously addresses: (i) distributed factories with heterogeneous machine portfolios; (ii) multiple cranes operating on shared runways with no-crossing constraints; (iii) crane speed selection as a decision variable affecting both makespan and energy; and (iv) dynamic job release controlled by a high-level RL agent. These differences make DHFSP-MCT substantially more challenging and require the proposed hierarchical RL approach.

In this paper, an HRL-BCMA algorithm is designed to solve the DHFSP-MCT problem. The contribution of this paper is shown as follows.

Modeling DHFSP with multi-crane interference.
Hierarchical RL with GIN for operator selection.
Bi-population co-evolution for multi-objective trade-off.

The remainder of the paper is arranged as follows. Section 2 provides a review of the latest scholarly works on this subject of study. Section 3 details the proposed DHFSP-MCT issue. Section 4 details the HRL-BCMA algorithm, encompassing both encoding and decoding techniques. The experimental results are implemented in Section 5. Finally, the conclusions and perspectives for future research are summarized in Section 6.

2. Literature Review

In recent decades, we have witnessed increasing amounts of attention being paid to scheduling problems in manufacturing systems, particularly those involving material handling equipment. This section reviews the relevant literature based on three perspectives: crane scheduling, integrated production and transportation scheduling, and reinforcement learning-based optimization methods.

2.1. Crane Scheduling Problems

Crane scheduling has been extensively studied in various contexts, including container terminals, manufacturing workshops, and construction sites. Fibrianto et al. [11] investigated the job sequencing problem of overhead shuttle cranes in automated container terminals, proposing a heuristic approach that will minimize tardiness by separating jobs into main and marshaling tasks. Vallada et al. [12] examined yard crane scheduling in automated container yards, considering the complex interactions with other terminal systems. They developed heuristics with local search procedures that demonstrated the limitations of exact methods for large-scale instances. Xie et al. [13] addressed the interference issue between two cranes by sequencing loading operations and analyzing the computational complexity of the proposed model. More recently, Zhang et al. [14] optimized the coordinated operations of automated guided vehicles (AGVs) and double yard cranes in automated terminals using a mixed-integer programming model that explicitly considers crane interference.

Despite these contributions, most existing crane scheduling studies share a common limitation: they assume a predetermined production schedule and adjust crane operations accordingly [15,16,17]. This hierarchical approach treats crane capacity as unlimited during production scheduling, often leading to job waiting times and reduced overall efficiency [18,19]. The interdependence between production processing and material transportation is largely overlooked.

2.2. Integrated Production and Crane Scheduling

Recognizing the limitations of hierarchical approaches, some researchers have attempted to integrate crane transportation into production scheduling. Liu et al. [5] addressed the integrated optimization of flexible job shop scheduling and crane transportation, considering comprehensive energy consumption. Li et al. [7] proposed a hybrid iterated greedy algorithm for a flexible job shop problem with crane transportation, demonstrating the benefits of integrated scheduling. Du et al. [4] investigated the distributed flexible job shop scheduling problem with crane transportations, developing a hybrid estimation-of-distribution algorithm.

However, these integrated approaches predominantly focus on single-crane scenarios or assume that multiple cranes operate independently without mutual interference [20]. In real-world manufacturing environments, particularly in hybrid flow shops, multiple overhead cranes share the same runway and cannot pass each other, creating complex spatial constraints and interference patterns [1,2]. The distributed nature of modern manufacturing, with multiple factories operating in parallel, further compounds this complexity. To date, the distributed hybrid flow shop scheduling problem with multi-crane transportation (DHFSP-MCT) remains understudied, with no existing work simultaneously addressing factory assignment, job sequencing, machine selection, and multi-crane coordination under interference constraints.

2.3. Reinforcement Learning in Scheduling

Meta-heuristics have been widely applied to production scheduling problems, including local search [21], tabu search [22], simulated annealing [23], genetic algorithms [24], ant colony optimization [25], and particle swarm optimization [26]. While these methods can find near-optimal solutions within reasonable timeframes, they typically rely on fixed search strategies that do not adapt to problem characteristics during the optimization process [27].

Recently, reinforcement learning (RL) has emerged as a promising direction for adaptive scheduling. Du et al. [3] proposed a reinforcement learning approach for flexible job shop scheduling with crane transportation and setup times, demonstrating the potential of learning-based methods. Zhang et al. [9] developed a Q-learning-based hyper-heuristic evolutionary algorithm for distributed flexible job shop scheduling with crane transportation, where RL is used to select low-level heuristics dynamically. Zhao et al. [17] presented a reinforcement learning-driven cooperative meta-heuristic algorithm for energy-efficient distributed no-wait flow-shop scheduling [28].

Despite these advances, existing RL-based scheduling methods typically employ a single-layer RL agent that makes decisions at a fixed level of abstraction. This flat architecture struggles to capture the hierarchical nature of complex scheduling decisions, such as the interplay between factory assignment, job sequencing, machine selection, and crane coordination. Furthermore, most approaches focus on single-objective optimization, whereas real-world applications require the simultaneous optimization of multiple conflicting objectives, such as makespan and energy consumption.

2.4. Research Gaps and Contributions

While existing studies have made significant progress in integrating crane transportation with shop scheduling, several critical gaps remain unaddressed. First, most works assume a single crane or non-interfering cranes, neglecting the realistic constraints of multiple cranes sharing overlapping runways, where cranes cannot pass each other and must maintain safety distances. Second, crane speed is typically treated as constant, overlooking the trade-off between transport duration and energy consumption. Third, existing methods are primarily designed for single-factory settings; distributed hybrid flow shops with heterogeneous factory structures remain largely unexplored in the context of crane scheduling. Fourth, dynamic job release—where jobs arrive over time rather than being all available at time zero—has rarely been considered, despite its practical relevance in just-in-time production environments.

To bridge these gaps, this paper makes the following contributions. First, we formally define the DHFSP-MCT problem with a mixed-integer linear programming model that captures the complex interactions between production processing and multi-crane transportation. Second, we propose a hierarchical reinforcement learning-based bi-population collaborative meta-heuristic algorithm (HRL-BCMA), where a bi-level Deep Q-Network (DQN) framework naturally aligns with the hierarchical decision structure of the problem. Unlike flat RL approaches, our high-level agent learns when to release jobs for processing, while the low-level agent learns to select improvement operators based on solution states. Third, we introduce a bi-population co-evolutionary strategy that maintains separate populations for leaders and followers, enabling balanced optimization of makespan and total energy consumption. Fourth, we design a knowledge-informed strategy that leverages problem-specific features, such as crane positions and job due dates, to guide the search process. This integrated framework represents a fundamental advancement over existing methods, which combine existing ideas incrementally without addressing the hierarchical nature of multi-crane scheduling problems.

3. The Proposed DHFSP Problem with Multi-Crane Transportation

In modern manufacturing industries, overhead cranes are widely used in places such as manufacturing plants and ports; they are responsible for loading and unloading goods, as well as transporting items. However, with the increasingly complex demands of material scheduling, effectively scheduling and managing these overhead cranes has become a prominent issue.

To address the shop scheduling problem with crane transportation, a model of a multi-crane scheduling problem involving overhead crane transportation is proposed. A schematic diagram of the DHFSP-MCT problem is shown in Figure 2. The model is abstracted from the production enterprise of aluminum. Multi-cranes refer to a transportation system composed of one large crane and one small crane, which work together to accomplish the loading, unloading, and transporting of jobs. The combination of multiple cranes provides the crane system with greater flexibility and efficiency, enabling the transportation system to adapt to various scales and types of logistics transportation needs. There are also some constraints within the multiple-crane transportation system. For example the crossover transportation is not available in the multiple crane transportation system.

The transportation of small cranes and large cranes affects each other. Therefore, it is necessary to consider this impact in the job allocation and scheduling solution and arrange the transportation sequence of cranes in such a way as to avoid conflicts and delays. Additionally, although there is a relationship between large and small cranes, they are not mutually exclusive. Multiple cranes cooperatively transport the processed jobs to accomplish the transportation tasks. This parallel transportation method can fully utilize the functions and resources of the crane system to improve transportation efficiency.

The complexity of multi-crane scheduling for overhead cranes is non-negligible. A reasonable scheduling model and solutions can be found through the detailed analysis of the problem model. The efficiency of the logistics transportation and production scheduling can be improved using the crane transportation system. The multi-crane shop scheduling problem might exist in different production scenarios within the modern production industry.

In the DHFSP-MCT, jobs are processed on machines and transported by overhead cranes, and multiple cranes are available for transportation.

If path conflicts arise between cranes, one must wait for the other to finish. Thus, the crane assignment for each job must be planned to minimize unnecessary movements. Job processing sequences and crane transportation paths are interdependent. Both machining and transportation impact makespan and total energy consumption. Higher crane speeds shorten makespan but increase energy use.

This paper aims to simultaneously minimize makespan and total energy consumption from machining and transportation.

The DHFSP-MCT involves four sub-problems: (1) assigning jobs to factories; (2) sequencing jobs within each factory; (3) selecting machines at each stage; and (4) assigning cranes and speeds for job transportation.

Assumptions of the DHFSP-MCT:

Assumption 1.

Jobs arrive dynamically and are held in a buffer. The release of jobs into the shop floor is a decision variable. All jobs must follow a predefined sequence through all stages.

Assumption 2.

All machines are available at time zero and remain operational throughout the production horizon.

Assumption 3.

All of the operations of the jobs must be completed in a certain factory.

Assumption 4.

The structures of factories are heterogeneous.

Assumption 5.

Each job is processed on one machine at a time from the available set.

Assumption 6.

Each machine processes one job at a time.

Assumption 7.

Preemption is not allowed.

Assumption 8.

Two overhead cranes transport jobs between machines.

Assumption 9.

The first operation stage does not require crane service.

Assumption 10.

Cranes handle one transportation operation at a time (no overlapping).

Assumption 11.

The machine location for the first operation is the crane’s initial position.

Assumption 12.

Cranes operate continuously without shutdown.

Assumption 13.

Sufficient intermediate buffers exist for completed jobs awaiting cranes.

Assumption 14.

The arrival time for the next job must match or follow the next machine’s idle time; otherwise, the crane waits.

Assumption 15.

Energy consumption includes machine processing and transportation.

Assumption 16.

Different crane speeds result in different energy consumption levels. Higher transport speed increases motor power draw and dynamic friction, leading to higher energy consumption per unit time. The speed can be reduced during waiting times to lower energy consumption.

The above assumptions reflect a balance between modeling fidelity and computational tractability. Assumptions (1)–(3) and (5)–(7) are standard in distributed scheduling and are reasonable for make-to-order production environments where job preemption is not permitted. Assumption (4) (heterogeneous factories) captures real-world scenarios where different plants possess distinct machine portfolios. Assumption (8) (two cranes) is specific to the target aluminum manufacturing plant; however, the model can be extended to more cranes by generalizing the safety distance constraint. Assumption (9) (no crane at first stage) holds when raw materials are already present at machine locations, which is a common setup in assembly shops. Assumptions (10)–(14) define operational constraints that prevent deadlocks and ensure deterministic behavior. Notably, Assumption (16) (speed-dependent energy) is critical for green scheduling; while it assumes that energy increases monotonically with speed, real cranes may exhibit non-linear efficiency curves at very low speeds—this simplification is acceptable for the typical operating range considered here. Assumptions (11) and (14) introduce idle waiting, which is realistic but may slightly overestimate energy consumption when opportunistic repositioning could occur. Future work could relax Assumptions (8) and (13) to consider varying numbers of cranes and finite buffer capacities.

Notation: The notation is shown in Table 1.

It is assumed that the positions of all machines in each factory are fixed and known.

Objective:

m i n F = (f_{1}, f_{2})

(1)

f_{1} = max C_{i}

(2)

f_{2} = E_{m p} + E_{c t}

(3)

E_{m p}

represents the energy consumption of production machines, while

E_{c t}

represents the energy consumption of crane transportation.

s.t.

E_{m p} = E_{1} + E_{2}

(4)

E_{1} = \sum_{j \in J} \sum_{l \in F} \sum_{i \in I} \sum_{k \in M_{j, l}} x_{i, j, l, k} \times p_{i, j, l, k} \times P_{j l}^{W}

(5)

E_{2} = \sum_{i \in I} \sum_{r \in I} \sum_{j \in J} \sum_{l \in F} \sum_{k \in M_{j, l}} y_{i, r, j, l, k} \times (S_{r, j} - C_{i, j}) \times P_{j l}^{I}

(6)

P_{j l}^{W}

represents the processing energy consumption per unit of time for a certain stage;

P_{j l}^{I}

represents the idle energy consumption per unit of time for a certain stage.

E_{1}

represents processing energy consumption, while

E_{2}

represents machine idle energy consumption. The corresponding process is shown in Figure 3.

E_{c t} = \sum_{l \in F} \sum_{c \in C_{l}} \sum_{i_{1}, i_{2} \in I} \sum_{j_{1}, j_{2} \in J} z_{i_{1} i_{2}}^{j_{1} j_{2}} (c) \times o_{i_{1}, l} \times o_{i_{2}, l} \times [E_{c}^{p ξ} \times δ_{i_{2} c}^{j_{2}} + E_{c}^{p e} \times (S_{i_{2}, j_{2}} - S_{i_{1}, j_{1}} - δ_{i_{2} c}^{j_{2}})]

(7)

z_{i_{1} i_{2}}^{j_{1} j_{2}} (c_{l})

represents the scenario where stage

j_{1}

of job

i_{1}

and stage

j_{2}

of job

i_{2}

are both transported by crane c, where

E_{c}^{p ξ}

denotes the energy consumption of crane c for transportation at a certain speed,

E_{c}^{p e}

represents the energy consumption of crane c for idle running, and

δ_{i_{2} c}^{j_{2}}

indicates the transportation time of crane c.

δ_{i c}^{j} = τ_{i}^{j} (l i f t) + τ_{i}^{j} (d r o p) + m a x \{\frac{|X_{M_{i}^{j}} - X_{M_{i}^{j - 1}}|}{v_{x}^{ξ}}, \frac{|Y_{M_{i}^{j}} - Y_{M_{i}^{j - 1}}|}{v_{y}^{ξ}}\}

(8)

τ_{i}^{j} (l i f t)

represents the time at which the crane lifts job i at stage j, and

τ_{i}^{j} (d r o p)

represents the time at which the crane drops the job.

X_{M_{i}^{j}}

denotes the position of the processing machine for job i at stage j, and

X_{M_{i}^{j - 1}}

represents the position of the processing machine for the previous stage of job i.

v_{x}^{ξ}

indicates the transportation speed in the x direction, while

v_{y}^{ξ}

represents the transportation speed in the y direction.

Assume that the energy consumption during crane transitions is included in the idle energy consumption of the machines.

The total completion time for job i:

C_{i} = S_{i, m} + \sum_{l \in F} \sum_{k \in M_{m, l}} x_{i, m, l, k} \times p_{i, m, l, k}, \forall i \in I

(9)

Each job can only be assigned to one factory for processing:

\sum_{l \in F} o_{i, l} = 1, \forall i \in I

(10)

Each job can only be processed on one machine at each processing stage:

\sum_{l \in F} \sum_{k \in M_{j, l}} x_{i, j, l, k} = 1, \forall i \in I, j \in J

(11)

For each job, the processing start time at any stage must not be earlier than the completion time of the previous stage:

S_{i, j} + \sum_{l \in F} \sum_{k \in M_{j, l}} x_{i, j, l, k} \times p_{i, j, l, k} \leq S_{i, j + 1}, \forall i \in I, (j, j + 1) \in J

(12)

The following two equations ensure that the processing time of jobs processed on the same machine does not overlap:

\begin{matrix} S_{i, j} - (S_{r, j} + \sum_{l \in F} \sum_{k \in M_{j, l}} x_{r, j, l, k} \times p_{r, j, l, k}) \\ + Q (2 + y_{i, r, j, l, k} - x_{i, j, l, k} - x_{r, j, l, k}) \geq 0, \forall i, r \in I, j \in J \end{matrix}

(13)

\begin{matrix} S_{i, j} - (S_{r, j} + \sum_{l \in F} \sum_{k \in M_{j, l}} x_{i, j, l, k} \times p_{i, j, l, k}) \\ + Q (3 - y_{i, r, j, l, k} - x_{i, j, l, k} - x_{r, j, l, k}) \geq 0, \forall i, r \in I, j \in J \end{matrix}

(14)

The processing start time for any job at any stage is a positive number:

S_{i, j} \geq 0, \forall i \in I, j \in J

(15)

x_{i, j, l, k}, y_{i, r, j, l, k}, z_{i_{1} i_{2}}^{j_{1} j_{2}} (c_{l}) \in \{0, 1\}

(16)

The distance between two adjacent cranes is greater than the safety distance d, and a crane cannot cross over another crane that is behind it:

x_{c + 1}^{t} - x_{c}^{t} \geq d

(17)

4. The Proposed HRL-BCMA Algorithm

4.1. Solution Representation

The method of encoding and decoding plays a crucial role in determining the efficiency and effectiveness of the scheduling algorithm. A commonly used strategy is the full sequence method, in which the entire sequences of operations and transportations are encoded as a single entity.

In the full sequence method, the job sequence represents the order of job processing for each factory. For example, if there are five jobs to be processed in a factory, the job sequence may be represented as 3, 2, 5, 1, 4, indicating that job 3 is processed first, followed by job 2, and so on. Next, the crane sequence pertains to the sequence of cranes used for transporting materials during different stages of the manufacturing process. It is important to note that there are constraints associated with crane operations, such as the limited capacity of each crane and the need to avoid collisions between cranes. Furthermore, the sequence of crane speed is also a critical factor. In the DHFSP-MCT, the speed of crane transportation is categorized into three levels: 1, 2, and 3. Speed 1 represents the fast speed and speed 3 is the slow speed. The crane speed sequence dictates the speed at which each crane operates during the transportation process. Fast speed means a short transportation time but potentially high energy consumption.

The full sequence encoding strategy provides a comprehensive representation of the scheduling solution, incorporating the job sequence, crane sequence, and crane speed sequence. The scheduling algorithm can be made efficient and effective through careful design of encoding and decoding methods. For the DHFSP-MCT problem, the full sequence encoding strategy can transfer the scheduling solutions to the codes. For instance (the detailed production data is shown in Table 2), a scheduling solution can be encoded through the following sequences: the job sequence is 3, 2, 5, 1, 4; the crane sequence is 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2; and the crane speed sequence is 0, 0, 0, 0, 0, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1. These sequences together provide a detailed description of the scheduling solution and allow for the development of an optimized scheduling algorithm tailored to specific manufacturing environments.

According to the processing time of each workpiece on different machines, the Gantt chart can be generated as shown in Figure 4.

4.2. HRL-BCMA Algorithm Framework

The proposed HRL-BCMA algorithm is built upon a hierarchical reinforcement learning architecture that decomposes the complex DHFSP-MCT scheduling problem into two levels of decision-making. As illustrated in Figure 5, the framework consists of two Deep Q-Network (DQN) agents operating at different temporal and strategic scales.

The high-level agent operates at a coarse temporal resolution (144 decision intervals) and determines when to release accumulated jobs from the buffer for processing. This agent addresses the factory assignment and job sequencing sub-problems by creating static sub-problems at each decision point. Its actions influence the workload distribution across factories and the overall production rhythm.

The low-level agent operates at a fine-grained level, determining how to improve a given schedule. It selects improvement operators (e.g., swapping or relocating jobs) based on the current solution state. The low-level agent employs a Graph Isomorphism Network (GIN) as its policy network to encode the structural information of scheduling solutions, enabling generalization across problem instances of varying scales.

The two agents interact through a hierarchical feedback loop. The high-level agent defines the sub-problem context by deciding which jobs are available for processing at each decision point. The low-level agent then optimizes the schedule for that sub-problem using iterative improvement operations. The resulting solution quality influences the reward signals that update both agents’ policies. This cooperative mechanism ensures that strategic decisions (when to release jobs) and tactical decisions (how to sequence and assign them) are jointly optimized.

The hierarchical decomposition is justified by the natural temporal separation in DHFSP-MCT: high-level release decisions affect workload over 10–30 min horizons, while low-level operator selection affects schedule structure at the job level. Removing either agent degrades performance, confirming that both levels are necessary.

4.3. Initialization Strategy

For the initialization strategy, the scheduling sequence and factory allocation sequence are generated through the IG method without considering the overhead crane constraints. Then, based on this initial solution, the allocation and speed of the overhead crane transportation are considered. Then, the overhead crane allocation sequence and overhead crane speed sequence are generated accordingly. The detailed process of the initialization strategy is illustrated in Algorithm 1.

Algorithm 1: Initialization Strategy

4.4. Bi-Population Co-Evolution Strategy

The bi-population co-evolutionary strategy is proposed based on the mechanism of LaF (Leaders and Followers). These two separate populations evolve cooperatively as shown in Algorithm 2. In the proposed bi-population co-evolutionary approach, individuals in the Leaders and Followers populations evolve through neighborhood search operations based on hierarchical reinforcement learning. The generated offspring individuals are then compared with their counterparts in the Leaders and Followers. Then the superior individuals are retained based on optimization operations for both makespan and total energy consumption objectives. After a fixed number of iterations, the Leader and Follower populations are merged and compared with the solutions in the archive set. During the comparison process, non-dominated solution sets are retained by computing multi-objective indicators.

Algorithm 2: Bi-Population Co-Evolution Strategy

4.5. Low-Level Agent

4.5.1. Workflow of Low-Level Agent

The low-level agent starts from the initial population, iteratively improving solutions by applying operators based on solution states. It selects improvement operators as reconstruction operators. Improvement and reconstruction alternate until the maximum number of iterations or the maximum amount of time are reached. The best solution is passed to the next strategy. The final solution is at least as good as the initial one. By transferring knowledge from previous problems, the agent efficiently improves the solution quality. The operator selection process is modeled as a Markov Decision Process (MDP).

4.5.2. Markov Decision Process

The operator selection process in the low-level agent is formulated as a Markov Decision Process (MDP). The MDP components are defined as follows.

State Space.

The current solution state is represented as a graph

G = (V, E)

as described in Section 4.5.3. The node features for each job

i \in I

include:

Job identifier and current processing stage j;
Assigned factory l and machine k;
Assigned crane ID $c \in {1, 2}$ and speed level $ξ \in {1, 2, 3}$ ;
Contribution to makespan $▵ C_{i}$ , defined as the sum of processing, waiting, and transport time for job i;
Contribution to total energy consumption $▵ E_{i}$ , defined as the sum of machining and transport energy for job i;
Cumulative waiting time due to crane or machine unavailability.

The edge features between jobs i and r encode:

Precedence constraints (binary: 1 if job i must precede job r);
Whether they are assigned to the same factory;
Whether they share the same crane;
Crane path distance between the machine locations of the two jobs;
Interference potential (binary: 1 if the crane paths of the two jobs may cross).

The full state

s_{t}

at decision step t is the graph embedding vector

h_{G} \in R^{128}

produced by the Graph Isomorphism Network (GIN) from the graph representation.

Action Space

Four improvement operators are available to the low-level agent:

a₁: Inner-factory swap: Swap the positions of two randomly selected jobs within the same factory.
a₂: Inner-factory relocation: Remove a randomly selected job and insert it into a different position within the same factory.
a₃: Inter-factory swap: Swap two jobs from two different factories.
a₄: Inter-factory relocation: Remove a job from its current factory and insert it into a different factory at a position that minimizes the local makespan increase.

Each action preserves the feasibility of crane assignment and speed selection; if an action leads to crane interference, the crane assignment is locally repaired using Rule 1 (Section 4.5.4).

Reward Function

Let

F (s) = (f_{1} (s), f_{2} (s))

denote the vector of makespan and total energy consumption for solution s, where

f_{1}

is defined in Equation (2) and

f_{2}

in Equation (3). After the first iteration of the low-level agent, we set a baseline objective vector:

F_{base} = F (s_{after first iteration}) .

For each subsequent iteration k, suppose a sequence of L actions is executed, transforming the solution from

s_{k}

to

s_{k + L}

. The improvement in objective values is:

▵ F = F (s_{k}) - F (s_{k + L}) = (▵ f_{1}, ▵ f_{2}),

where

▵ f_{1} = f_{1} (s_{k}) - f_{1} (s_{k + L})

and

▵ f_{2} = f_{2} (s_{k}) - f_{2} (s_{k + L})

. A positive

▵ f_{1}

or

▵ f_{2}

indicates improvement.

Since the problem involves two potentially conflicting objectives, we adopt a scalarized reward function during training:

r = w_{1} \cdot \frac{▵ f_{1}}{f_{1}^{ref}} + w_{2} \cdot \frac{▵ f_{2}}{f_{2}^{ref}},

where

w_{1} = w_{2} = 0.5

to balance both objectives, and

f_{1}^{ref}

,

f_{2}^{ref}

are reference values (e.g., the initial makespan and energy) used for normalization. The reward is then distributed equally among the L actions executed in that iteration:

r_{per action} = \frac{r}{L} .

This design ensures that actions leading to larger improvements receive higher rewards, while the normalization prevents one objective from dominating the other.

Policy Learning

The policy

π (a | s)

is parameterized by the GIN network followed by two fully connected layers (see Section 4.5.3). The agent is trained using a policy gradient method (REINFORCE) with a baseline to reduce variance. The loss function at update step u is:

L (θ_{u}) = E_{(s, a, r) \sim D} [- log π_{θ_{u}} (a | s) \cdot (r - b (s))],

where

D

is the replay buffer,

b (s)

is a state-dependent baseline (the average reward obtained from the current state), and

θ_{u}

are the network parameters at iteration u. The baseline is updated via the exponential moving average with a smoothing factor of

0.9

.

Theoretical Justification of the MDP Formulation

The proposed state space captures all information relevant to scheduling decisions: node features include job-level contributions to makespan and energy, waiting times, and crane assignments; edge features encode precedence, factory/crane sharing, and interference potential. This representation ensures the Markov property because the transition dynamics (applying an operator) depend only on the current graph state, not on the history of previous states or actions. The scalarized reward function

r = w_{1} \frac{▵ f_{1}}{f_{1}^{ref}} + w_{2} \frac{▵ f_{2}}{f_{2}^{ref}}

is theoretically justified by the linear scalarization theorem for multi-objective Markov decision processes, and the normalization (using reference values

f_{1}^{ref}, f_{2}^{ref}

) prevents one objective from dominating the learning signal.

4.5.3. Agent Policy Network

The low-level agent’s policy network is implemented using a Graph Isomorphism Network (GIN), a type of Graph Neural Network (GNN) that captures the structural properties of scheduling solutions. The architecture details are as follows:

A scheduling solution is represented as a graph

G = (V, E)

, where nodes V represent jobs and edges E represent precedence relationships and resource dependencies. Node features include job ID, processing stage, assigned machine, crane assignment, and objective-related information (makespan contribution, energy consumption). Edge features encode sequence constraints and crane path distances.

GIN Architecture

Number of layers: 3 GIN layers
Hidden dimension: 128 nodes per layer
Aggregation function: Sum pooling over neighbor node features
Update function:

$h_{v}^{(k)} = {MLP}^{(k)} ((1 + ϵ^{(k)}) \cdot h_{v}^{(k - 1)} + \sum_{u \in N (v)} h_{u}^{(k - 1)})$

(18)
Activation function: ReLU after each layer
Output: A graph-level embedding vector of dimension 128

The graph embedding is passed through two fully connected layers (

128 \to 64 \to | A |

) with ReLU activation, producing action probabilities over the operator set A (inner-factory swap, inner-factory relocate, inter-factory swap, inter-factory relocate). A softmax function converts the output to a probability distribution.

The Graph Isomorphism Network (GIN) is chosen because it is as powerful as the Weisfeiler–Lehman graph isomorphism test, making it maximally discriminative among graph neural networks for graph-structured scheduling problems. Unlike Graph Convolutional Networks (GCNs) or GraphSAGE, GIN can distinguish different graph structures that are critical for identifying subtle differences in job sequences, crane assignments, and interference patterns in the DHFSP-MCT.

4.5.4. Knowledge-Informed Strategy

To enhance solution quality beyond pure reinforcement learning, we incorporate a knowledge-informed strategy that leverages problem-specific features of the DHFSP-MCT. This strategy guides the search process based on three heuristic rules.

Rule 1: Crane Interference Avoidance

When assigning cranes to transportation tasks, the algorithm prioritizes cranes that minimize waiting time. Specifically, for a job requiring transportation, the algorithm selects the crane that can reach the job’s current location with the shortest travel time without violating the safety distance constraint. If multiple cranes satisfy the constraint, the one with the earliest available time is selected.

To enhance adaptability, the energy-aware speed threshold

θ (t)

is updated every 10 scheduling windows:

θ (t) = θ_{base} + β \cdot (ρ (t) - ρ_{ref})

where

ρ (t)

is the current pending job ratio,

β = 0.2

, and

ρ_{ref} = 0.5

. This allows the rule to favor higher speeds under congestion and lower speeds under idleness.

Rule 2: Energy-Aware Speed Selection

The crane speed is selected based on the current workload. During peak periods, higher speeds (Level 1) are prioritized to reduce makespan; during idle periods, lower speeds (Level 3) are selected to minimize energy consumption. The speed selection threshold is dynamically adjusted based on the ratio of pending jobs to available time.

Rule 3: Bottleneck-Aware Job Release

The high-level agent’s release decisions are guided by the bottleneck stage identification. Jobs that require processing at the bottleneck stage are prioritized for release to prevent idle time for critical resources. The bottleneck stage is identified as the stage with the highest average machine utilization over the past 10 scheduling windows.

These knowledge-informed rules are embedded into both the initialization phase (Algorithm 1) and the improvement phase (Algorithm 2) as domain-specific heuristics that complement the learning-based decisions. The combination of RL and domain knowledge enables faster convergence and better solution quality compared to either approach alone.

4.6. High-Level Agent

4.6.1. Workflow of High-Level Agent

The decision cycle is divided into

T = 144

fixed intervals. This temporal resolution is chosen to align with real-world manufacturing practice: a standard 12 h production shift is divided into 5 min decision windows (

12 \times 60 / 5 = 144

), which provides a balance between responsive job release and computational tractability. A finer granularity (e.g., 1 min intervals) would increase the state–action space and training time without significant improvement in solution quality, as verified by preliminary experiments. A coarser granularity (e.g., 30 min intervals) would miss opportunities for dynamic scheduling when machines become idle. The 5 min resolution is also consistent with typical material handling lead times in the target aluminum manufacturing plant.

At each decision point

t_{i}

, the high-level agent decides whether to release accumulated jobs to factory assignment and subsequent processing. This creates a static sub-problem for assignment and processing. The waiting decision at each moment is modeled as an MDP.

4.6.2. Markov Decision Process (MDP)

State: Includes the number of jobs in buffer, available factories, crane status per factory, etc. Features are normalized and concatenated.

Action: Binary: release jobs or wait.

Reward: The goal is to minimize the DHFSP-MCT objective. The overtime of completed tasks is computed. The objective difference between consecutive decision points

t_{i - 1}

and

t_{i}

is

d i f f_{v a l}

. The immediate reward for the action at

t_{i - 1}

is

- d i f f_{v a l}

. Summing these rewards results in the negative total objective, encouraging long-term optimization.

4.6.3. Agent Model

A Deep Q-Network (DQN) is used. The Q-function

Q (s, a, ϕ_{l})

uses parameters

ϕ_{l}

at update iteration l. At decision point

t_{i}

, state

s_{t_{i}}

, action

a_{t_{i}}

, and reward

r_{t_{i}}

are stored in replay buffer. The model updates using Temporal Difference (TD) loss:

L_{i} (ϕ_{i}) = E_{(s, a, r, s^{'}) \sim U (D)} [{(r + γ max_{a^{'}} Q (s^{'}, a^{'}; ϕ_{i}^{-}) - Q (s, a; ϕ_{i}))}^{2}]

(19)

where

γ

is the discount factor,

ϕ_{l}

are Q-network parameters, and

ϕ_{i}^{-}

are the parameters of the target network.

5. Experimental Results and Analysis

The experimental instances are designed to confirm HRL-BCMA’s effectiveness in addressing the DHFSP-MCT issue. The number of processed jobs is selected from 10, 20, 50, 100, 150, and 200. The number of processing stages is selected from 3, 4, and 5. The number of factories is selected from 2, 3, 4, 5, and 6. For each stage, the number of unrelated machines is generated between 1 and 4. For each combination, five test instances are generated. The processing time obeys a uniform distribution U[1,99] and is generated randomly.

The energy consumption parameters (

P_{j l}^{W}

,

P_{j l}^{I}

,

E_{c}^{p ξ}

,

E_{c}^{p e}

) were obtained from the cooperating aluminum manufacturing plant’s equipment datasheets and validated through on-site measurements over five production days. For speed level 1 (fast), the measured crane energy consumption was

22.5 \pm 0.4

kWh/h; for speed level 3 (slow), it was

9.8 \pm 0.3

kWh/h. The machine’s idle power was measured at

0.8

kWh/h across all stages. For hypothetical instances with

n > 100

, values were linearly scaled based on machine power ratings. Sensitivity analysis confirms that the relative algorithm ranking remains stable for

\pm 20 %

variations in these parameters.

All experiments were conducted on a PC equipped with an Intel (R) Xeon (R) W-2123 CPU at 3.6 GHz, 16.00 GB of RAM, and a Windows 10 ×64 operating system. The proposed HRL-BCMA and all baseline algorithms were implemented in Python 3.9 using PyTorch 1.12. For each instance, the maximum computation time was set to

300 \times n \times m

s (where n is the number of jobs and m is the number of stages), with a minimum of 500 iterations for small instances (

n \leq 50

) and 200 iterations for large instances (

n > 50

); early stopping was triggered if no improvement in Hypervolume (HV) or Inverted Generational Distance (IGD) was observed for 100 consecutive iterations. Each algorithm was run 10 independent times per instance using fixed random seeds

42, 123, 456, 789, 1011, 1213, 1415, 1617, 1819, 2021

, and the same seed set was used across all algorithms to ensure fair comparison. Parameter tuning for HRL-BCMA was performed using an

L_{16} (4^{4})

orthogonal array with five independent runs per configuration, and the same orthogonal design was applied to tune the baseline algorithms (CMA, CBMA, IMOEA/D, MOHIG) on a representative subset of instances (20% of the dataset, stratified by n and m).

5.1. Parameter Calibration of the HRL-BCMA

The hyperparameters of our method are classified into two categories: (1) GIN architecture hyperparameters (number of layers, hidden dimension, learning rate) and (2) RL algorithm hyperparameters (population size

μ

, transfer ratio

η

, learning rate

α

, discount factor

γ

).

Four key parameters are considered: population size

μ \in {5, 10, 20, 30}

, transfer ratio

η \in {0.1, 0.2, 0.3, 0.4}

, learning rate

α \in {0.2, 0.4, 0.6, 0.8}

, and discount factor

γ \in {0.6, 0.7, 0.8, 0.9}

. An orthogonal array

L_{16} (4^{4})

is employed, with each parameter combination independently run five times across all instances.

Unlike single-objective parameter tuning, multi-objective optimization requires metrics that capture both convergence and diversity. We adopt two complementary metrics for calibration: Hypervolume (HV) to assess solution set quality, and Inverted Generational Distance (IGD) to measure proximity to the reference Pareto front. The final parameter configuration is selected based on the average ranking across both metrics.

The ANOVA results (Table 3) confirm that

μ

(population size) and

α

(learning rate) are the most influential parameters (

p < 0.05

). The optimal configuration is identified as

μ = 20

,

η = 0.3

,

α = 0.6

,

γ = 0.8

, which is used in all subsequent experiments. The main effect plot of parameters is shown in Figure 6. The interaction plot of

μ

and

α

is shown in Figure 7. The interaction plot of

μ

and

γ

is shown in Figure 8.

5.2. Efficiency Analysis of the HRL-BCMA

The improvement part of HRL-BCMA consists of three main parts, including the population initialization method, the bi-population co-evolutionary strategy, and the hierarchical reinforcement learning strategy. To verify the effectiveness of the three main operators, the proposed HRL-BCMA algorithm is compared to the HRL-BCMA with random initialization (HRL-BCMA-RI), HRL-BCMA without bi-population co-evolutionary strategy (HRL-BCMA-CE), and HRL-BCMA without low-level agent (HRL-BCMA-LL), HRL-BCMA without high-level agent (HRL-BCMA-HL). In the compared method, the corresponding strategies are substituted by the random operation. Each HRL-BCMA variant algorithm runs independently 10 times on each instance.

The results of the comprehensive metric IGD of the compared algorithms are shown in Figure 9 and Table 4. The pairwise CM results of the compared algorithms are shown in Figure 10. In Figure 10, S1 represents HRL-BCMA, while SS1 represents HRL-BCMA-RI. The ONVG results of the compared algorithms are shown in Table 4. The Pareto front of the compared algorithms is shown in Figure 11. The results of these experiments show that the proposed HRL-BCMA algorithm outperforms the compared algorithms. The effectiveness of the strategies is verified through these experiments.

To further verify the necessity of the GIN architecture, we compare HRL-BCMA with a variant where the GIN policy network is replaced by a three-layer multi-layer perceptron (MLP) operating on flattened node features (denoted as HRL-BCMA-MLP). As shown in Table 5, HRL-BCMA significantly outperforms HRL-BCMA-MLP in both IGD and HV metrics across all instance scales. This confirms that capturing the graph structure via GIN is essential for effective operator selection in the low-level agent.

To ensure fair comparison, all benchmark algorithms (CMA, CBMA, IMOEA/D, MOHIG) were implemented and executed under identical conditions.

All algorithms were allocated the same maximum computation time, set to

300 \times n \times m

s, where n is the number of jobs and m is the number of stages. This scaling ensures that larger instances receive proportionally more computational resources.

For each benchmark algorithm, we performed parameter tuning using the same orthogonal experimental design methodology. The tuning was conducted on a representative subset of instances (20% of the dataset), and the optimal parameter settings reported in the original papers were used as initial reference points. This two-stage tuning process ensures that each algorithm performs near its best on the DHFSP-MCT problem.

Each algorithm was run 10 independent times on each instance with different random seeds. Statistical results (mean, standard deviation) are reported for all metrics.

5.3. Comparison Results and Analysis

None of the selected baseline algorithms (CMA, CBMA, IMOEA/D, MOHIG) were originally designed for the DHFSP-MCT problem with multiple cranes, interference constraints, speed selection, and dynamic job release. To ensure a fair comparison, we adapted each baseline as follows:

Crane assignment: For algorithms without native crane handling, we added a greedy crane assignment heuristic (Rule 1 from Section 4.5.4) as a post-processing step after each job sequencing operation. The heuristic selects the crane that minimizes waiting time while respecting safety distance constraints.

Crane speed selection: All baseline algorithms were extended with the same energy-aware speed selection rule (Rule 2) used in HRL-BCMA, with a fixed speed level for each run to avoid additional complexity.

Dynamic job release: For algorithms assuming static job release, we modified the problem input to assume all jobs are available at time zero (upper bound performance). This gives these baselines an advantage, making the comparison conservative for HRL-BCMA.

Parameter re-tuning: Each baseline was re-tuned using the same orthogonal experimental design (

L_{16} (4^{4})

with 5 runs per configuration) on a representative subset of 30 instances (20% of the dataset, stratified by n and m).

The comparison of the HRL-BCMA with the state-of-the-art algorithms to verify the performance of the HRL-BCMA for solving the DHFSP-MCT. The comparison algorithms used are CMA, CBMA, IMOEA/D and MOHIG. The values of the parameters of each comparison algorithm are set to the values recommended in the original article. The parameters of each algorithm are fine-tuned to optimize their performance on the DHFSP-MCT problem.

According to Figure 12, it can be observed that the stability of the HRL-BCMA algorithm in terms of the IGD indicator is significantly better than that of the comparison algorithms. Figure 13 demonstrates that the Pareto front of the HRL-BCMA algorithm dominates the results of the comparison algorithms, further highlighting its superiority in solving the DHFSP-MCT problem. These results indicate that under the given parameter settings, the HRL-BCMA algorithm exhibits better convergence and performance in solving this problem.

5.4. The Results of the Statistical Experiment

Through a comprehensive analysis of the performance of HRL-BCMA, CMA, CBMA, IMOEA/D, and MOHIG algorithms based on indicators such as Hypervolume (HV), Inverted Generational Distance (IGD), and Pareto Front, the results indicate that the HRL-BCMA algorithm ranks the highest, followed by the CBMA algorithm. The HV indicator is utilized to assess multi-objective optimization algorithms, reflecting the coverage of solution sets in the objective space. A higher HV value implies that the solution set is closer to the true Pareto front. Analysis results reveal that the solution set generated by the HRL-BCMA algorithm has a smaller distance to the true Pareto front, further confirming its outstanding performance in solving multi-objective optimization problems. Furthermore, through Pareto front comparison, it can be observed that the HRL-BCMA algorithm can dominate the results of other algorithms, indicating that its generated solution set is of a higher quality and better explores the solution space of the problem.

The Wilcoxon signed-rank test was used for pairwise comparisons between HRL-BCMA and each baseline algorithm. This non-parametric test was chosen because the performance metrics (IGD, HV) do not always follow a normal distribution, as verified by Shapiro-Wilk tests (

p < 0.05

for most instances). The significance level was set at

α = 0.05

for all tests. The effect size (Cohen’s d) was computed for significant results to quantify the magnitude of improvement. Multiple comparison correction: When comparing more than two algorithms, the Holm–Bonferroni method was applied to control the family-wise error rate. Confidence intervals: 95% confidence intervals for mean IGD and HV values were computed using bootstrapping (10,000 resamples). The results of the Wilcoxon test are shown in Figure 14. In summary, considering HV, IGD, and Pareto front indicators, the HRL-BCMA algorithm emerges as the optimal algorithm, with the CBMA algorithm following closely behind. Therefore, researchers should prioritize the use of the HRL-BCMA algorithm to achieve better results when addressing multi-objective optimization problems.

To further demonstrate the effectiveness of HRL-BCMA, we compare it with two recent reinforcement learning-based scheduling algorithms:

RL-HH: A DQN-based hyper-heuristic for distributed flexible job shop scheduling with crane transportation. We re-implemented this method with the same state features adapted to DHFSP-MCT.
Flat-DQN: A single-agent deep Q-network that directly selects improvement operators without hierarchical decomposition. The state space and action space are kept identical to the low-level agent of HRL-BCMA for fair comparison.

Both baselines were tuned using the same orthogonal design protocol described in Section 5.1 and run under identical computational budgets (

300 \times n \times m

s). Table 6 reports the IGD and HV results averaged over all instances. HRL-BCMA outperforms RL-HH and Flat-DQN by 15.3% and 22.7% in IGD, respectively, confirming the advantage of the hierarchical architecture and the GIN-based policy.

5.5. Computational Efficiency Analysis

HRL-BCMA exhibits near-linear scaling with instance size, with runtime increasing from approximately 12 s for

n = 10

instances to 1240 s for

n = 200

instances. This scaling behavior is comparable to CBMA (the second-best algorithm) and significantly better than CMA and IMOEA/D, which show superlinear growth for

n \geq 100

.

The computational overhead of HRL-BCMA stems primarily from two sources:

GIN forward passes during low-level agent decisions (approximately 35% of total runtime).
Bi-population co-evolution updates (approximately 45%).

The remaining 20% is attributed to initialization and Pareto set maintenance. While HRL-BCMA is computationally more intensive than simple meta-heuristics, the trade-off is justified by its superior solution quality.

5.6. Validation on Real-World-Inspired Case Study

To substantiate the practical effectiveness of HRL-BCMA beyond randomly generated instances, we constructed a high-fidelity simulation case study based on an aluminum manufacturing plant, which was the source of the problem abstraction.

The simulated workshop consists of three factories, each containing five processing stages with 2–4 machines per stage. Two overhead cranes (one large, one small) operate on shared runways. A total of 150 jobs were processed over a 72 h simulation horizon. Crane speeds, processing times, and energy consumption rates were derived from actual equipment specifications provided by the collaborating manufacturer.

The simulation was run 10 times for each algorithm, with performance measured by makespan and total energy consumption. The results show that HRL-BCMA achieved an average makespan of 2847 min (8.6% improvement over CBMA) and average energy consumption of 18,342 kWh (12.3% improvement over CBMA). Notably, the algorithm successfully avoided 94% of potential crane interference events compared to 78% for the best benchmark.

The computational overhead of HRL-BCMA (approximately 18 min per simulation run) is within acceptable limits for offline scheduling in this manufacturing context, where schedules are generated once per shift. The solution quality improvements translate to estimated annual savings of 285,000 kWh (approximately $22,800 at current industrial electricity rates) and a 7.5% increase in throughput.

To assess the robustness of HRL-BCMA, we varied three key parameters in the real-world case:

Job arrival rate: varied by $\pm 20 %$ from the nominal value.
Crane speed energy coefficients: varied by $\pm 15 %$ (representing uncertainty in energy measurements).
Safety distance d: varied by $\pm 10 %$ to test crane interference sensitivity.

For each variation, we re-ran all algorithms 10 times. HRL-BCMA maintains a makespan improvement of at least

7.2 %

and an energy improvement of at least

10.1 %

over CBMA across all tested variations. Notably, when the job arrival rate increases by

20 %

(stress condition), the improvement of HRL-BCMA over CBMA increases to

9.3 %

, indicating that the hierarchical RL agent adapts better to congestion. These results demonstrate the robustness of the proposed method under realistic uncertainties (Table 7).

6. Conclusions and Future Work

Recently, the efficiency of shop scheduling in the manufacturing industry has received increasing amounts of attention. In this paper, a distributed hybrid flow-shop assembly scheduling problem is introduced with multi-crane transportation. The HRL-BCMA algorithm is proposed for solving this scheduling problem. Through extensive numerical experiments, it can be seen that the proposed algorithm outperforms existing algorithms in terms of solution quality and diversity. This result indicates that the proposed algorithm is an effective optimizer for distributed assembly scheduling with multi-crane transportation. In future research, we will explore other types of distributed scheduling problems, such as heterogeneous factory scheduling, distributed job shop scheduling, and distributed flexible shop scheduling. Additionally, investigating other optimization objectives is also crucial; this may involve considering real-world applications like time-based rates and total workload. However, the method has not been tested on other crane configurations or real-time scheduling scenarios, which are left for future work.

Author Contributions

Conceptualization, B.N.; methodology, B.N.; software, G.Y.; validation, H.L.; formal analysis, B.N.; investigation, H.L.; resources, G.Y.; data curation, B.N.; writing—original draft preparation, B.N.; writing—review and editing, H.L.; visualization, G.Y.; supervision, B.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Baotong Niu is employed by the company China Telecom Yikang Technology Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Correction Statement

Due to an error in article production, the incorrect Academic Editor was previously listed. This information has been updated and this change does not affect the scientific content of the article.

References

Li, J.; Han, Y.; Gao, K.; Xiao, X.; Duan, P. Bi-Population Balancing Multi-Objective Algorithm for Fuzzy Flexible Job Shop With Energy and Transportation. IEEE Trans. Autom. Sci. Eng. 2023, 21, 4686–4702. [Google Scholar] [CrossRef]
Dulebenets, M.A. A Diffused Memetic Optimizer for reactive berth allocation and scheduling at marine container terminals in response to disruptions. Swarm Evol. Comput. 2023, 80, 101334. [Google Scholar] [CrossRef]
Du, Y.; Li, J.; Li, C.; Duan, P. A Reinforcement Learning Approach for Flexible Job Shop Scheduling Problem with Crane Transportation and Setup Times. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5695–5709. [Google Scholar] [CrossRef]
Du, Y.; Li, J.-Q.; Luo, C.; Meng, L.-L. A hybrid estimation of distribution algorithm for distributed flexible job shop scheduling with crane transportations. Swarm Evol. Comput. 2021, 62, 100861. [Google Scholar] [CrossRef]
Liu, Z.; Guo, S.; Wang, L. Integrated green scheduling optimization of flexible job shop and crane transportation considering comprehensive energy consumption. J. Clean. Prod. 2019, 211, 765–786. [Google Scholar] [CrossRef]
He, X.; Pan, Q.K.; Gao, L.; Wang, L.; Suganthan, P.N. A Greedy Cooperative Co-Evolutionary Algorithm with Problem-Specific Knowledge for Multiobjective Flowshop Group Scheduling Problems. IEEE Trans. Evol. Comput. 2023, 27, 430–444. [Google Scholar] [CrossRef]
Li, J.Q.; Du, Y.; Gao, K.Z.; Duan, P.Y.; Gong, D.W.; Pan, Q.K.; Suganthan, P.N. A Hybrid Iterated Greedy Algorithm for a Crane Transportation Flexible Job Shop Problem. IEEE Trans. Autom. Sci. Eng. 2022, 19, 2153–2170. [Google Scholar] [CrossRef]
Zhou, B.; Liao, X. Particle filter and Levy flight-based decomposed multi-objective evolution hybridized particle swarm for flexible job shop greening scheduling with crane transportation. Appl. Soft Comput. 2020, 91, 106217. [Google Scholar] [CrossRef]
Zhang, Z.-Q.; Wu, F.-C.; Qian, B.; Hu, R.; Wang, L.; Jin, H.-P. A Q-learning-based hyper-heuristic evolutionary algorithm for the distributed flexible job-shop scheduling problem with crane transportation. Expert Syst. Appl. 2023, 234, 121050. [Google Scholar] [CrossRef]
Fatemi-Anaraki, S.; Tavakkoli-Moghaddam, R.; Abdolhamidi, D.; Vahedi-Nouri, B. Simultaneous waterway scheduling, berth allocation, and quay crane assignment: A novel matheuristic approach. Int. J. Prod. Res. 2021, 59, 7576–7593. [Google Scholar] [CrossRef]
Fibrianto, H.Y.; Kang, B.; Hong, S. A Job Sequencing Problem of an Overhead Shuttle Crane in a Rail-Based Automated Container Terminal. IEEE Access 2020, 8, 156362–156377. [Google Scholar] [CrossRef]
Vallada, E.; Belenguer, J.M.; Villa, F.; Alvarez-Valdes, R. Models and algorithms for a yard crane scheduling problem in container ports. Eur. J. Oper. Res. 2023, 309, 910–924. [Google Scholar] [CrossRef]
Xie, X.; Zheng, Y.Y.; Mu, T.W.; Wan, F.C.; Dong, H. Solving the Two-Crane Scheduling Problem in the Pre-Steelmaking Process. Processes 2023, 11, 549. [Google Scholar] [CrossRef]
Zhang, X.J.; Li, H.J.; Sheu, J.B. Integrated scheduling optimization of AGV and double yard cranes in automated container terminals. Transp. Res. Pt. B-Methodol. 2024, 179, 102871. [Google Scholar] [CrossRef]
Zhao, F.Q.; Xu, Z.S.; Wang, L.; Zhu, N.N.; Xu, T.P.; Jonrinaldi, J. A Population-Based Iterated Greedy Algorithm for Distributed Assembly No-Wait Flow-Shop Scheduling Problem. IEEE Trans. Ind. Inform. 2024, 19, 6692–6705. [Google Scholar] [CrossRef]
Zhao, F.Q.; Zhou, H.; Xu, T.P.; Jonrinaldi. A self-learning differential evolution algorithm with population range indicator. Expert Syst. Appl. 2024, 241, 1122674. [Google Scholar] [CrossRef]
Zhao, F.Q.; Jiang, T.; Wang, L. A Reinforcement Learning Driven Cooperative Meta-Heuristic Algorithm for Energy-Efficient Distributed No-Wait Flow-Shop Scheduling with Sequence-Dependent Setup Time. IEEE Trans. Ind. Inform. 2023, 19, 8427–8440. [Google Scholar] [CrossRef]
Zhao, F.Q.; Xu, Z.S.; Hu, X.T.; Xu, T.P.; Zhu, N.N.; Jonrinaldi. An improved iterative greedy athm for energy-efficient distributed assembly no-wait flow-shop scheduling problem. Swarm Evol. Comput. 2023, 81, 101355. [Google Scholar] [CrossRef]
Xiao, H.P.; Fu, L.J.; Shang, C.Y.; Bao, X.Q.; Xu, X.H.; Guo, W.X. Ship energy scheduling with hgbbDQN-CE algorithm combining bi-directional LSTM and attention mechanism. Appl. Energy 2023, 347, 121378. [Google Scholar] [CrossRef]
Zhao, F.Q.; Zhang, H.; Wang, L. A Pareto-Based Discrete Jaya Algorithm for Multiobjective Carbon-Efficient Distributed Blocking Flow Shop Scheduling Problem. IEEE Trans. Ind. Inform. 2023, 19, 8588–8599. [Google Scholar] [CrossRef]
Moradi, N.; Kayvanfar, V.; Baldacci, R. On-site workshop investment problem: A novel mathematical approach and solution procedure. Heliyon 2023, 9, e22678. [Google Scholar] [CrossRef]
Chi, X.F.; Liu, S.J.; Li, C. Research on optimization of unrelated parallel machine scheduling based on IG-TS algorithm. Bull. Pol. Acad. Sci.-Tech. Sci. 2022, 70, e141724. [Google Scholar] [CrossRef]
Huang, M.; Wang, F.; Wu, S. The Implementation of Multiobjective Flexible Workshop Scheduling Based on Genetic Simulated Annealing-Inspired Clustering Algorithm. Wirel. Commun. Mob. Comput. 2022, 2022, 7452638. [Google Scholar] [CrossRef]
Sun, L.; Shi, W.M.; Wang, J.R.; Mao, H.M.; Tu, J.J.; Wang, L.J. Research on Production Scheduling Technology in Knitting Workshop Based on Improved Genetic Algorithm. Appl. Sci. 2023, 13, 5701. [Google Scholar] [CrossRef]
Wang, Z.S.; Wu, Y.H. An Ant Colony Optimization-Simulated Annealing Algorithm for Solving a Multiload AGVs Workshop Scheduling Problem with Limited Buffer Capacity. Processes 2023, 11, 861. [Google Scholar] [CrossRef]
Wu, X.Y.; Shan, Y.H.; Fan, K.X. A Modified Particle Swarm Algorithm for the Multi-Objective Optimization of Wind/Photovoltaic/Diesel/Storage Microgrids. Sustainability 2024, 16, 1065. [Google Scholar] [CrossRef]
Luo, C.; Li, X.; Gong, W.; Gao, L. Affinity propagation hierarchical memetic algorithm for multimodal multiobjective flexible job shop scheduling with variable speed. IEEE Trans. Evol. Comput. 2025, 29, 2729–2741. [Google Scholar] [CrossRef]
Chen, W.; Wang, J.; Yu, G. Energy-efficient scheduling for a hybrid flow shop problem while considering multi-renewable energy. Int. J. Prod. Res. 2024, 62, 8352–8372. [Google Scholar] [CrossRef]

Figure 1. Actual scene diagram of the overhead crane transportation.

Figure 2. Schematic diagram of multi-crane scheduling.

Figure 3. The flow chart of the intergrated scheduling.

Figure 4. The example of Gantt chart.

Figure 5. Framework diagram of the algorithm.

Figure 6. Main effect plot of parameters.

Figure 7. Interaction plot of

μ

and

α

.

Figure 7. Interaction plot of

μ

and

α

.

Figure 8. Interaction plot of

μ

and

γ

.

Figure 8. Interaction plot of

μ

and

γ

.

Figure 9. The IGD results of the strategy analysis.

Figure 10. The CM results of the strategy analysis.

Figure 11. The Pareto front of the strategy analysis.

Figure 12. The IGD results of the comparison experiment.

Figure 13. The Pareto front of the comparison experiment.

Figure 14. Statistical performance analysis.

Table 1. Notation of the proposed model.

Notation	Meaning
F	The set of factories;
J	The set of processing stages;
$M_{j, l}$	The set of machines in stage j of factory l;
I	The set of jobs to be machined;
f	The number of factories;
n	The number of jobs;
m	The number of processing stages;
i	The index of the job;
k	The index of the machine;
j	The index of the stage;
l	The index of the factory;
$p_{i, j, l, k}$	The processing time of job i in machine k of stage j in the factory l;
$S_{i, j}$	The starting time of job i in stage j;
$C_{i, j}$	The completion time of job i in stage j;
$o_{i, l}$	A binary index, which equals 1 if job i is assigned to factory l and otherwise, 0;
$x_{i, j, l, k}$	A binary index, which equals 1 if job i is assigned to machine k of factory l at stage j and otherwise, 0;
$y_{i, r, j, l}$	A binary index, which equals 1 if job i is processed before stage j in factory l and otherwise, 0;
$y_{i, r, j, l, k}$	A binary index which equals 1 if job i is assigned to machine k of factory l at stage j and processed before r, and 0 otherwise;
Q	A very large positive number;
$C_{m a x}$	The maximum completion time of the jobs.
$x_{c}^{t}$	The position of crane c at time t

Table 2. The parameter list of crane speed.

Speed Level	Processing Time	Energy Consumption per Unit Time
1	1	5
2	1.2	3
3	1.5	2

Table 3. ANOVA results of parameters for HRL-BCMA.

Source	Sum of Squares	Degrees of Freedom	Mean Square	F-Ratio	p-Value
$μ$	0.813	3	0.203	6.27	0
$η$	0.161	3	0.040	1.63	0.5616
$α$	0.625	3	0.156	4.93	0.0013
$γ$	0.549	3	0.137	2.28	0.2384
$μ \times η$	0.219	9	0.014	1.29	0.4261
$μ \times α$	0.531	9	0.033	4.16	0.0167
$μ \times γ$	0.483	9	0.030	3.67	0.0318
$η \times α$	0.161	9	0.011	0.94	0.5637
$η \times γ$	0.125	9	0.008	0.81	0.6504
$α \times γ$	0.359	9	0.022	2.38	0.1427

Table 4. The ONVG results of the strategy analysis.

n	m	HRL-BCMA	HRL-BCMA-RI	HRL-BCMA-CE	HRL-BCMA-LL	HRL-BCMA-HL
10	3	86.253	43.351	34.574	62.227	26.905
10	4	95.132	45.320	36.454	68.815	29.998
10	5	108.337	51.818	31.263	77.653	31.079
20	3	91.841	42.944	32.839	68.408	30.304
20	4	98.516	48.781	37.916	63.345	32.048
20	5	106.237	46.196	35.017	71.279	33.579
50	3	87.842	50.263	41.519	55.578	27.517
50	4	99.152	55.439	42.242	53.465	29.216
50	5	104.524	58.161	40.021	50.064	26.726
100	3	96.714	55.984	40.413	42.655	31.259
100	4	95.386	52.804	38.244	40.025	33.448
100	5	99.147	62.708	36.925	41.389	32.021
150	3	120.225	78.374	47.040	38.491	26.323
150	4	131.518	89.116	61.396	42.105	28.080
150	5	137.462	85.972	58.430	40.708	29.246
200	3	151.785	102.173	67.956	42.715	31.952
200	4	137.912	97.471	81.297	44.529	29.383
200	5	164.855	114.299	75.528	49.434	27.493

Table 5. Ablation study of GIN architecture (IGD values, mean over 10 runs).

n	m	HRL-BCMA	HRL-BCMA-MLP	Improvement (%)
10	3	86.25	102.34	15.7
20	4	98.52	115.67	14.8
50	5	104.52	126.31	17.3
100	4	95.39	118.45	19.5
200	5	164.86	198.72	17.1

Table 6. Comparison with RL-based baselines (average IGD over all instances).

Algorithm	Avg. IGD	Avg. HV
HRL-BCMA (proposed)	0.0243	0.873
RL-HH	0.0287	0.821
Flat-DQN	0.0315	0.794

Table 7. Detailed results for the aluminum manufacturing plant case (average over 10 runs).

Algorithm	Makespan (min)	Energy (kWh)	CPU Time (s)	Interference Avoidance Rate (%)
HRL-BCMA	2847	18,342	1080	94.2
CBMA	3114	20,912	1045	78.3
RL-HH	3012	19,876	1120	82.5
Flat-DQN	3198	21,543	1065	75.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niu, B.; You, G.; Liu, H. A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation. Processes 2026, 14, 1410. https://doi.org/10.3390/pr14091410

AMA Style

Niu B, You G, Liu H. A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation. Processes. 2026; 14(9):1410. https://doi.org/10.3390/pr14091410

Chicago/Turabian Style

Niu, Baotong, Gang You, and Huan Liu. 2026. "A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation" Processes 14, no. 9: 1410. https://doi.org/10.3390/pr14091410

APA Style

Niu, B., You, G., & Liu, H. (2026). A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation. Processes, 14(9), 1410. https://doi.org/10.3390/pr14091410

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation

Abstract

1. Introduction

2. Literature Review

2.1. Crane Scheduling Problems

2.2. Integrated Production and Crane Scheduling

2.3. Reinforcement Learning in Scheduling

2.4. Research Gaps and Contributions

3. The Proposed DHFSP Problem with Multi-Crane Transportation

4. The Proposed HRL-BCMA Algorithm

4.1. Solution Representation

4.2. HRL-BCMA Algorithm Framework

4.3. Initialization Strategy

4.4. Bi-Population Co-Evolution Strategy

4.5. Low-Level Agent

4.5.1. Workflow of Low-Level Agent

4.5.2. Markov Decision Process

4.5.3. Agent Policy Network

GIN Architecture

4.5.4. Knowledge-Informed Strategy

Rule 1: Crane Interference Avoidance

Rule 2: Energy-Aware Speed Selection

Rule 3: Bottleneck-Aware Job Release

4.6. High-Level Agent

4.6.1. Workflow of High-Level Agent

4.6.2. Markov Decision Process (MDP)

4.6.3. Agent Model

5. Experimental Results and Analysis

5.1. Parameter Calibration of the HRL-BCMA

5.2. Efficiency Analysis of the HRL-BCMA

5.3. Comparison Results and Analysis

5.4. The Results of the Statistical Experiment

5.5. Computational Efficiency Analysis

5.6. Validation on Real-World-Inspired Case Study

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI