Solving Panel Block Assembly Line Scheduling Problem via a Novel Deep Reinforcement Learning Approach

: The panel block is a quite important “intermediate product” in the shipbuilding process. However, the assembly efﬁciency of the panel block assembly line is not high. Therefore, rational scheduling optimization is of great signiﬁcance for improving shipbuilding efﬁciency. Currently, the processing sequence of the panel blocks in the panel block assembly line is mainly determined using heuristic and metaheuristic algorithms. However, these algorithms have limitations, such as small problem-solving capacity and low computational efﬁciency. To address these issues, this study proposes an end-to-end approach based on deep reinforcement learning to solve the scheduling problem of the ship’s panel block assembly line. First, a Markov decision model is established, and a disjunctive graph is creatively used to represent the current scheduling status of the panel block assembly line. Then, a policy function based on a graph isomorphism network is designed to extract information from the disjunctive graph’s state and train it using Proximal Policy Optimization algorithms. To validate the effectiveness of our method, tests on both real shipbuilding data and publicly available benchmark datasets are conducted. We compared our proposed end-to-end deep reinforcement learning algorithm with heuristic algorithms, metaheuristic algorithms, and the unimproved reinforcement learning algorithm. The experimental results demonstrate that our algorithm outperforms other baseline methods in terms of model performance and computation time. Moreover, our model exhibits strong generalization capabilities for larger instances.


Introduction
In recent years, with the continuous maturation of artificial intelligence technology, machine learning has provided new approaches for solving scheduling problems in complex and uncertain environments. Moreover, integrating machine learning algorithms with job scheduling problems aligns more closely with the core principles of intelligent manufacturing. Particularly, with the ongoing development of deep reinforcement learning (DRL) [1,2], intelligent agents learn optimal scheduling strategies through interaction with the environment guided by rewards. This approach has been widely applied to solve workshop scheduling problems. The panel block assembly line in shipbuilding is a specialized workshop, but its processing process will encounter bottlenecks such as block congestion [3]. To address these issues, we propose a DRL-based method to tackle the scheduling problem of the ship's panel block assembly line.
The structural composition of a vessel primarily consists of various profiles and steel materials. The hull of a ship is typically streamlined, but near the midship, the shape tends to become flat [4]. In modern shipbuilding practices, the entire structure of a ship is generally divided into several small sections, referred to as blocks. These individual blocks are then combined to form larger sections or ring sections, which are further assembled create the complete structure of the vessel. A block serves as the most fundamental intermediate unit in ship structure manufacturing [5]. After the hull is divided into blocks, based on their internal structural characteristics, blocks with complex structures and large outer panel curvatures are referred to as curved blocks, while those with flat or nearly flat profiles are known as panel blocks, as depicted in Figure 1 [6]. For conventional vessels like bulk carriers and tankers, panel blocks can account for more than 60% of the total number of hull blocks. In some cases, the proportion of panel blocks can even reach around 80% for large and ultra-large oil tankers [7]. Furthermore, with the development of larger ships, such as large bulk carriers, oil tankers, and container ships, which commonly present the characteristics of longer midship, the demand for panel blocks has significantly increased. Due to the extensive processing required for a large number of panel blocks in the shipbuilding process, the manufacturing and assembly of panel blocks represent the primary bottleneck in the entire shipbuilding process. Furthermore, shipbuilding is an orderbased industry, where the panel blocks of the ship for each order have different shapes and volumes, further increasing the complexity of scheduling for the panel block assembly line [8,9]. As a result, the assembly process of the panel block assembly line possesses a high degree of variability and system complexity [10].
However, the production efficiency of most panel block assembly lines is not high [11]. This is primarily due to the continued use of traditional on-site scheduling methods, which make it difficult to obtain optimized scheduling solutions. For instance, some shipyards directly prioritize the production of simple blocks before complex ones without employing means of optimized scheduling to reduce labor hours. Consequently, optimizing the production plan of the panel block assembly line to enhance the overall shipbuilding process efficiency is of paramount importance [12][13][14][15].
In light of these considerations, this paper proposes an end-to-end DRL approach to address the scheduling problem of panel block assembly lines in shipbuilding. Initially, we introduce a scheduling model based on the Markov Decision Process (MDP) [16] and creatively employ a disjunctive graph to represent the scheduling process of panel block assembly lines, thereby comprehensively and logically capturing the current state. This representation effectively integrates the dependencies between operations and the status of each workstation in the panel block assembly line, providing crucial information for generating optimal scheduling decisions. Subsequently, we utilize a Graph Isomorphism Network (GIN) to encode and embed the nodes in the disjunctive graph, enabling efficient computation of the policy network. Based on this approach, we design a policy network capable of handling instances of panel block assembly line scheduling of any size, effectively facilitating the generalization from model training to model deployment without the need for retraining. Finally, we employ the Proximal Policy Optimization (PPO) algorithm [17] to train our policy network. Extensive experiments are conducted using real shipyard data and publicly available benchmark datasets, demonstrating that our Due to the extensive processing required for a large number of panel blocks in the shipbuilding process, the manufacturing and assembly of panel blocks represent the primary bottleneck in the entire shipbuilding process. Furthermore, shipbuilding is an order-based industry, where the panel blocks of the ship for each order have different shapes and volumes, further increasing the complexity of scheduling for the panel block assembly line [8,9]. As a result, the assembly process of the panel block assembly line possesses a high degree of variability and system complexity [10].
However, the production efficiency of most panel block assembly lines is not high [11]. This is primarily due to the continued use of traditional on-site scheduling methods, which make it difficult to obtain optimized scheduling solutions. For instance, some shipyards directly prioritize the production of simple blocks before complex ones without employing means of optimized scheduling to reduce labor hours. Consequently, optimizing the production plan of the panel block assembly line to enhance the overall shipbuilding process efficiency is of paramount importance [12][13][14][15].
In light of these considerations, this paper proposes an end-to-end DRL approach to address the scheduling problem of panel block assembly lines in shipbuilding. Initially, we introduce a scheduling model based on the Markov Decision Process (MDP) [16] and creatively employ a disjunctive graph to represent the scheduling process of panel block assembly lines, thereby comprehensively and logically capturing the current state. This representation effectively integrates the dependencies between operations and the status of each workstation in the panel block assembly line, providing crucial information for generating optimal scheduling decisions. Subsequently, we utilize a Graph Isomorphism Network (GIN) to encode and embed the nodes in the disjunctive graph, enabling efficient computation of the policy network. Based on this approach, we design a policy network capable of handling instances of panel block assembly line scheduling of any size, effectively facilitating the generalization from model training to model deployment without the need for retraining. Finally, we employ the Proximal Policy Optimization (PPO) algorithm [17] to train our policy network. Extensive experiments are conducted using real shipyard data and publicly available benchmark datasets, demonstrating that our proposed method outperforms existing heuristic algorithms, metaheuristic algorithms, and reinforcement learning algorithms in terms of algorithmic performance and computa-tion time. Furthermore, our method exhibits remarkable generalization capabilities when applied to larger-scale instances. The main contributions of this paper are as follows: (1) We introduce an end-to-end reinforcement learning approach to learn scheduling rules, overcoming limitations such as poor model generalization. This method can effectively solve instances of any scale without the need for retraining; (2) We present an MDP model for the panel block assembly line scheduling problem, providing a comprehensive definition of states, actions, and rewards within this MDP framework. The algorithms utilized for model training are also elaborated upon; (3) We propose a graph embedding method that employs disjunctive graphs to represent the state information of the panel block assembly line. This approach directly extracts scheduling features from the disjunctive graph, marking the first instance of combining DRL with disjunctive graphs to address the scheduling problem in shipbuilding's panel block assembly lines.
The remaining sections of the paper are described as follows: Section 2 provides a comprehensive summary of the current research status in the relevant field. Section 3 presents the mathematical model for the scheduling problem and describes the background information on the technologies related to our research. Section 4 elucidates our research methodology, including the establishment of the MDP model, parameterized policy network, and the training process of the algorithm. Section 5 presents the experimental procedure and discusses the results obtained. Finally, in Section 6, we present the conclusions and outline future work for our research.

Literature Review
The scheduling process among the workstations in the panel block assembly line is modeled in this paper as a permutation flow shop scheduling problem with the objective of minimizing the maximum completion time [18]. Sriskandarajah and Hall [19] have proven the NP-hardness of such problems when the number of processing machines exceeds two (m > 2). Currently, heuristic algorithms, metaheuristic algorithms [20,21], and reinforcement learning algorithms are the mainstream research approaches for solving this type of problem. The following is a summary of the current research for each method.

Solving Scheduling Problem via Heuristics
Heuristic algorithms offer advantages such as high solution efficiency and fast computation speed. The Johnson algorithm [22] was the first constructed heuristic algorithm, which can be applied to the two-machine flow shop problem and yield an optimal solution. The Palmer algorithm [23] and the Gupta algorithm [24] employ the construction of processing time slopes to solve the permutation flow shop problem (PFSP). This approach involves converting the slopes of the jobs into function values, sorting them based on increasing or decreasing rules and arranging the job sequence accordingly. The NEH algorithm [25], considered one of the most efficient heuristic algorithms, follows the fundamental idea of prioritizing and allocating jobs based on their total processing time. It then achieves a complete schedule through consecutive job insertions. The priorities of the jobs and the insertion construction method are the two key aspects of the NEH algorithm. Framinan et al. [26] proposed three stages in the development of heuristic algorithms, namely index development, solution construction, and solution improvement. Through experiments, they concluded that the priority assignment method of the NEH algorithm is most effective for the permutation flow shop scheduling problem with objectives of maximum completion time and machine time constraints.

Solving Scheduling Problem via Metaheuristics
Conventional heuristic methods can only address small-scale permutation flow shop scheduling problems. To enhance computational performance, numerous scholars have employed metaheuristic algorithms to tackle various large-scale scheduling problems. Yu (2015) [27] proposed a block-based evolutionary algorithm for solving PFSP. They designed association rules to extract excellent genes, increasing solution diversity and enabling the generation of various blocks for artificial chromosome combinations, thereby improving convergence efficiency. Qun et al. (2015) [28] introduced a hybrid backtracking search algorithm to solve PFSP, establishing a mathematical model with the objective of minimizing the completion time. They devised new crossover and mutation strategies to incorporate simulated annealing mechanisms into random insertion local search, preventing premature convergence to local optima. Korhan et al. (2016) [29] presented an improved iterative greedy algorithm for PFSP, aiming to minimize total tardiness. They combined this algorithm with random search techniques, further enhancing the quality of the solutions. Suansh Deb et al. (2018) [30] designed a metaheuristic algorithm based on rhinoceros natural behavior to minimize the makespan in PFSP. They simplified the search operations and reduced the number of operation parameters required in the mathematical model. Morais et al. (2022) [31] devised three optimization algorithms based on discrete differential evolution to solve the makespan minimization problem in PFSP. While metaheuristic algorithms outperform heuristic algorithms in terms of solution quality, they still suffer from drawbacks such as excessive computation time for large-scale problem instances.

Solving Scheduling Problem via Reinforcement Learning
Considering the fixed structure of heuristic algorithms and metaheuristic algorithms, the search performance is somewhat constrained. Many researchers have attempted to utilize reinforcement learning algorithms to solve scheduling problems. Liu et al. [32] employed a parallel-trained Actor-Critic neural network model to address job shop scheduling problems. The Actor network learns actions under different circumstances, while the Critic assists in evaluating the value of those actions and returns feedback to the Actor network. This algorithm achieved promising results in job shop scheduling problems. Waschneck et al. [33] applied the DQN algorithm to production scheduling, utilizing deep neural networks to train in a reinforcement learning environment with flexible user-defined objectives to optimize production scheduling problems. Lin et al. [34] proposed an MDQN algorithm to solve job shop scheduling problems in an intelligent factory based on edge computing frameworks, and simulation results demonstrated its superior performance compared to other methods. Park et al. [35] introduced a framework that combines graph neural networks with reinforcement learning, yielding excellent results in job shop scheduling problems. Yang et al. [36] employed DRL to investigate dynamic PFSP for implementing intelligent decision-making in dynamic scheduling scenarios, achieving superior results compared to heuristic algorithms. Moreover, the trained network can generate a scheduling action within an average of 2.13 ms. Pan et al. [37] presented a DRL model based on heterogeneous networks to solve PFSP, and experimental results demonstrated its remarkable performance in PFSP problem-solving.

Research Gap
Both heuristic algorithms and metaheuristic algorithms serve as effective means of obtaining feasible solutions to scheduling problems. However, as the population size increases, the computational complexity of these algorithms also grows, resulting in significantly longer solution times. Additionally, heuristic algorithms and metaheuristic algorithms require retraining when tackling problems of different scales, leading to reduced efficiency. Moreover, even a minor adjustment in a parameter within heuristic and metaheuristic algorithms can potentially impact the final results, making them more suitable for small-scale scheduling problems. On the other hand, reinforcement learning algorithms only require training once to address problems of all scales without the need for retraining. However, in the application of reinforcement learning algorithms, most rely on Deep Q-Networks (DQN) to approximate action-value functions, which cannot directly optimize policies. Furthermore, the representation of environmental states often relies on mathematical models, which may not fully capture the scheduling state. To address these issues and minimize the influence of intermediate processes on the computational results, we adopt an end-to-end DRL approach to solve the scheduling problem in the context of shipyard panel block assembly, with the objective of minimizing the maximum completion time.

Symbolic Representation
The symbols used to establish the mathematical model for the scheduling problem in shipyard panel block assembly are presented in Table 1. Table 1. Nomenclature for various parameters.

Notation Description n
The number of blocks m The number of workstations B The set of blocks S The set of workstations i The number of blocks in set B j The process number of block i B i The The operation of block B i on workstation S j p i,j The processing time of block B i on workstation S j π The processing sequence of blocks C max The maximum completion time The completion time of block π i on workstation S j

Problem Description
The research focuses on the scheduling problem in shipbuilding panel block assembly, where n panel blocks B = {B 1 , B 2 , . . . , B n } undergo m assembly processes {Q i1 , Q i2 , . . . , Q im } in a flow production manner across m workstations S = {S 1 , S 2 , . . . , S m }. In this study, the assembly processes in the shipyard panel block assembly line are designed to consist of seven sequential stages, as illustrated in Figure 2, resulting in a total of seven workstations (m = 7) and seven assembly processes per block. All blocks are processed in the same order at each workstation, with the constraint that each block is processed only once at each workstation. The processing times required for each block at each workstation are known, and infinite buffers exist between workstations [38]. The objective is to find an optimal scheduling scheme with the goal of minimizing the maximum completion time. It is assumed that the blocks are processed in the order of workstation 1 to m, denoted as the block processing sequence π = {π 1 , π 2 , . . . , π n }. The mathematical formulation of the problem is described in Equations (1) and (2).

Reinforcement Learning
Reinforcement learning [39] comprises essential components such as agents, environments, states, actions, and rewards. The interaction between the agent and the environment occurs through states, actions, and rewards. MDP forms the core of reinforcement learning, consisting of elements (S, A, P, γ, R), as defined in Equation (3). It encompasses actions a ∈ A, states s ∈ S, reward function r = R(s, a), state transition probabilities P(s |s, a), and discount factor γ. In this paper, the notations a t , s t , and r t are employed to represent the action, state, and reward at step t, respectively. The objective of reinforcement learning is to maximize the expected return by learning an optimal scheduling strategy.

Reinforcement Learning
Reinforcement learning [39] comprises essential components such as agents, environments, states, actions, and rewards. The interaction between the agent and the environment occurs through states, actions, and rewards. MDP forms the core of reinforcement learning, consisting of elements ( , , , , ), as defined in Equation (3). It encompasses actions ∈ , states ∈ , reward function = ( , ) , state transition probabilities ( ′| , ), and discount factor . In this paper, the notations , , and are employed to represent the action, state, and reward at step t, respectively. The objective of reinforcement learning is to maximize the expected return by learning an optimal scheduling strategy.

Disjunctive Graph
To express the scheduling state more comprehensively and logically, we utilize a disjunctive graph [40] to represent the scheduling process of the panel block assembly line. Let = ∀ , ∪ { , } denote the set of all operation nodes, where S and T represent the virtual start and end nodes, respectively. Therefore, the disjunctive graph = ( , , ) is a mixed graph with O as its vertex set. Here, C represents a set of conjunctive arcs (directed arcs) that depict the adjacent operation relationships determined by the process, usually denoted by solid lines. On the other hand, D represents a set of disjunctive arcs (undirected arcs), denoted by dashed lines, representing the disjunctive arcs between operations that can be processed on the same workstation. By determining the direction of each disjunctive arc, a solution for the planar segmented assembly line scheduling instance can be obtained, resulting in a directed acyclic graph (DAG) [41]. Figure 3a,b illustrate an example of a disjunctive graph and its solution for a panel block assembly line scheduling instance. In Figure 3a, which represents a scheduling instance with three blocks and three workstations, the black arrows depict the conjunctive arcs among operations within the same block, while the dashed lines represent the disjunctive arcs connecting operations that require the same workstation across different blocks. Figure 3b

Disjunctive Graph
To express the scheduling state more comprehensively and logically, we utilize a disjunctive graph [40] to represent the scheduling process of the panel block assembly line. Let N = O ij ∀i, j ∪ {S, T} denote the set of all operation nodes, where S and T represent the virtual start and end nodes, respectively. Therefore, the disjunctive graph G = (N, C, D) is a mixed graph with O as its vertex set. Here, C represents a set of conjunctive arcs (directed arcs) that depict the adjacent operation relationships determined by the process, usually denoted by solid lines. On the other hand, D represents a set of disjunctive arcs (undirected arcs), denoted by dashed lines, representing the disjunctive arcs between operations that can be processed on the same workstation. By determining the direction of each disjunctive arc, a solution for the planar segmented assembly line scheduling instance can be obtained, resulting in a directed acyclic graph (DAG) [41]. Figure 3a,b illustrate an example of a disjunctive graph and its solution for a panel block assembly line scheduling instance. In Figure 3a, which represents a scheduling instance with three blocks and three workstations, the black arrows depict the conjunctive arcs among operations within the same block, while the dashed lines represent the disjunctive arcs connecting operations that require the same workstation across different blocks. Figure 3b presents a complete solution to the scheduling problem, where each disjunctive arc has a direction, and each node is connected by at most two disjunctive arcs.  To enhance the reader's understanding of the scheduling process for the panel block assembly line, we present an illustrative example in Figure 4. As depicted in the diagram, when the input instance consists of five blocks and four workstations, the Gantt chart output reveals that the optimal sequence for block input should follow the order of 4-2-3-1-5. To enhance the reader's understanding of the scheduling process for the panel block assembly line, we present an illustrative example in Figure 4. As depicted in the diagram, when the input instance consists of five blocks and four workstations, the Gantt chart output reveals that the optimal sequence for block input should follow the order of 4-2-3-1-5.  To enhance the reader's understanding of the scheduling process for the panel block assembly line, we present an illustrative example in Figure 4. As depicted in the diagram, when the input instance consists of five blocks and four workstations, the Gantt chart output reveals that the optimal sequence for block input should follow the order of 4-2-3-1-5.

Methods
In this section, we present the fundamental principles of our approach. Firstly, we establish the MDP model for the scheduling problem. Subsequently, we devise a policy

Methods
In this section, we present the fundamental principles of our approach. Firstly, we establish the MDP model for the scheduling problem. Subsequently, we devise a policy network based on GIN to address the task. Finally, we introduce the training algorithm employed and provide a comprehensive account of the training process.

MDP Model
The MDP model serves as a bridge between DRL and scheduling problems. Resolving the scheduling problem in panel block assembly can be viewed as determining the orientations of the disjunctive graphs. The underlying MDP model we establish is as follows: State: Due to the limitations of existing methods in providing a comprehensive and rational representation of the current scheduling state in the planar segment assembly line and considering the inherent characteristics of the scheduling environment, we introduce the concept of disjunctive graphs for representation. Disjunctive graphs encompass all the information related to the scheduling environment, including the processing status of each workstation and the number of segments currently being assembled. The state consists of the processing status of each workstation and the blocks placed on the panel block assembly line. To capture the current scheduling state at step t, we utilize the disjunctive graph G(t) = (N, C(t), D(t)), where C(t) comprises the set of all connecting arcs up to step t, and D(t) represents the remaining disjunctive arcs in the graph. The initial state s 0 represents the initial state of the panel block assembly line before scheduling begins, while the terminal state s t corresponds to a feasible solution when scheduling is complete, at which point D(t) = ∅, indicating that all disjunctive arcs have been assigned orientations. As the assembly process progresses, the disjunctive graph provides varying state descriptions of the current scheduling environment, and in turn, the changes in the environmental state lead to different disjunctive graphs.
Action: The actions of the intelligent agent involve selecting the block to be placed in the first workstation during the assembly process of the panel block assembly line. The design of the action space should thoroughly consider minimizing idle time across workstations and enhancing their utilization rate.
State transition: As the scheduling environment of the panel block assembly line constantly evolves, the scheduling state progresses from state s t to the next state s t+1 . When determining the next action a t to schedule, our first step is to identify the earliest feasible time to allocate a t on the required workstation. Subsequently, we update the direction of the disjunctive arcs for the workstations based on the current temporal relationships, generating a new disjunctive graph as the new state s t+1 .
Reward: The rational definition of rewards serves as a prerequisite for successful learning in reinforcement learning. Rewards should be defined in direct relation to the objective of minimizing makespan. Therefore, we initially compute the difference between partial solutions at two consecutive steps, denoted as U t = C(s t+1 ) − C(s t ), where C(s t ) is defined as C(s t ) = max ij LB t O ij , representing a lower bound on the makespan. We assign the negative value of the difference U t as the immediate reward for each step t, i.e., R(a t , s t ) = −U t . In other words, the cumulative reward corresponds to the negative makespan when all operations are scheduled.
Policy: For state s t , the stochastic policy π(a t |s t ) generates a probability distribution over actions a t , with action selection prioritized based on the probability distribution.
Graph embedding: Graph isomorphic network (GIN) [42] is a kind of deep neural network [43,44] capable of learning representations of graph-structured data, and it is the latest variant of graph neural network (GNN). The disjunctive graph we have constructed encompasses all the information regarding the scheduling states, including the processing times of blocks at each workstation and the order of block processing. To extract all the embedded states from the disjunctive graph, we parameterize the policy π(a t |s t ) as a GIN π θ (a t |s t ) with trainable parameters θ. Given a graph G = (V, E), GIN calculates p-dimensional embeddings for each node v ∈ V by performing k iterations of update steps.
Action selection: The disjunctive graph provides all the information of the scheduling environment at each decision step t. By transmitting the context information embedded in the disjunctive graph to a multi-layer perceptron network, the probability distribution of all actions at this step t is generated.

Learning Algorithm
In this paper, we employ the PPO algorithm to train our agent. PPO is an Actor-Critic algorithm. Detailed information about the PPO algorithm is provided in pseudocode in Algorithm 1. The Actor refers to the policy network π θ described above, while the Critic and Actor utilize the same GIN.
We present the scheduling process of our proposed reinforcement learning method in the ship panel block assembly line in Figure 5. This reinforcement learning model consists of an Agent that determines the input order of panel blocks and an environment that captures the current state of the assembly line and the processing information of each station using a disjunctive graph. The environment provides feedback to the agent regarding the current processing status of the assembly line. Subsequently, the agent selects a block to input and makes decisions on the next block to input when certain operations on the block are completed.

Algorithm 1. PPO Algorithm for training our model
Input: update epoch k; PPO steps M; number of actors to compute reward and perform update N; actor network π θ ; behavior actor network π θ old ; trainable parameters of actor network θ; trainable parameters of behavior actor network θ old ; critic network V ∅ ; trainable parameters of critic network ∅; clipping ratio ε; policy loss coefficient C p ; value function loss coefficient C v ; entropy loss coefficient C e ; 1 Initialization: initialize parameter sets of π θ , π θ old and V ∅ ; 2 for m = 1, · · · , M, do; 3 Pick N independent scheduling instances from distribution D; 4 for n=1, · · · , N, do; 5 for t=0, 1, 2, · · · , do 6 sample a n,t based on π θ old (a n,t |S n,t ); 7 Receive reward r n,t and next state S n,t+1 ; 8Â n,t = ∑ t 0 Y t r n,t − V ∅ (S n , t); r n,t (θ)= π θ (an,t|Sn,t) π θ old (a n,t |S n,t ) 9 if S n,t is terminal then 10 break; 11 end 12 end 13 L CLIP (θ) = ∑ t 0 min r n,t (θ)Â n,t , clip(r n,t (θ), 1 − ε, 1 + ε)Â n,t 14 L E (θ) = ∑ t 0 Entropy(π θ (a n,t |S n,t )) 15 consists of an Agent that determines the input order of panel blocks and an environment that captures the current state of the assembly line and the processing information of each station using a disjunctive graph. The environment provides feedback to the agent regarding the current processing status of the assembly line. Subsequently, the agent selects a block to input and makes decisions on the next block to input when certain operations on the block are completed.

Computational Experiment
In this section, we present the computational results of our method on instances of various scales. We demonstrate the effectiveness of our model through two validation approaches. Firstly, we conduct tests using real shipyard data, and secondly, we evaluate our model using the publicly available Taillard benchmark dataset [45]. To assess the per-

Computational Experiment
In this section, we present the computational results of our method on instances of various scales. We demonstrate the effectiveness of our model through two validation approaches. Firstly, we conduct tests using real shipyard data, and secondly, we evaluate our model using the publicly available Taillard benchmark dataset [45]. To assess the performance of the proposed model, we compare it to six baseline methods, considering both the minimization of the maximum completion time and the training time of each algorithm.
Datasets: In this study, we trained our model using real processing data of ship hull blocks on a panel block assembly line. The dataset comprises 40 types of blocks, totaling 287, with their respective processing data across seven workstations. As shown in Table 2, we randomly selected six blocks to illustrate their processing times at each workstation in the assembly line. We use a random sampling method to select 50 blocks from several panel blocks to train our model. Subsequently, the trained model was tested on instances with 25, 50, 75, 100, and 125 planar segments, totaling five cases. To evaluate the effectiveness of our approach, the trained model was also subjected to nine test sets ranging from 20 × 5 to 100 × 20 using the Taillard benchmark. Baseline methods: In the two test cases, our model is compared to heuristic algorithms LPT, NEH, and metaheuristic algorithms GA, TS. Additionally, to demonstrate the effectiveness of our introduced disjunctive graph state representation approach, we contrast it with DDQN and DRL algorithm PPO which employs traditional state representation methods. Detailed descriptions of the six baseline methods are provided in Appendix A.
Experimental Settings: Our experiments were conducted entirely in Python 3.8, running on a computer equipped with an AMD Ryzen 7 5800H/3.20Ghz CPU and an NVIDIA RTX3050Ti GPU. The selection of appropriate parameters plays a crucial role in the success of our experiments. Table 3 presents the parameters utilized during the training process of our model. Evaluation metrics: We employ the metrics of makespan and the computational time of the model to assess the performance of our method and the baseline approaches. Table 4 presents the computational results of six baseline methods and our approach, in which the time unit of makespan is hours, as shown by the character h in brackets in the title of the table. We test five groups of instances of different planar segments. By comparing them, we can conclude that our method yields a smaller makespan than all the baseline methods in all cases. Compared with the two models of reinforcement learning, our method can save 1-2% time, and it can save 10% time compared with LPT. To visually demonstrate the differences between our model and the baseline methods, we showcase the disparities by taking the differences between the baseline methods and our model's results in Table 5. Additionally, we depict a line graph in Figure 6 that intuitively illustrates how the magnitude of differences varies with the problem scale. From Table 5 and Figure 6, it can be observed that our model consistently outperforms the other methods across all problem instances. The metaheuristic algorithm shows superior performance compared to the heuristic algorithm, but the NEH algorithm and the metaheuristic algorithm perform similarly, with the NEH algorithm even slightly outperforming the metaheuristic algorithm. Overall, the reinforcement learning algorithm surpasses the other baseline methods. However, when compared to other reinforcement learning algorithms, such as DDQN and PPO, which do not use disjunctive graph representation for states, our algorithm achieves lower makespan. For instance, compared with the DDQN method, our method can save 3.3 h of processing time when the segment count is 15 and 14.6 h when the segment count is 125. Therefore, in real shipbuilding scenarios, our approach can significantly enhance the efficiency of panel block assembly. With the increase in problem scale, it is evident that the disparity between other algorithms and the algorithm proposed in this paper in terms of makespan is also growing. The advantages of our method are even more pronounced. For instance, compared to the PPO algorithm that does not incorporate disjunctive graph representation for states, the time saved by our algorithm has expanded from 1.9 h to 10.3 h. This signifies that in practical applications, as ship sizes grow and the demand for panel blocks increases, our algorithm will save even more time.   To provide a more comprehensive reflection of the model's performance, we further compared the computation times of each algorithm, as shown in Table 6, in which the time unit of computation time is seconds, as shown by the character s in parentheses in the title of the table. From Table 6, it can be observed that the LPT algorithm in the heuristic methods almost instantly yields scheduling results. However, compared to our model, the reinforcement learning algorithm, metaheuristic algorithm, and NEH algorithm require slightly longer computation times. Based on the results obtained from Tables 4 and 6, our algorithm achieves superior scheduling solutions with reduced computational time. Taking the case of a panel block size of 125 as an example, when evaluating the makespan criterion, the ordering of algorithms based on time, from longest to shortest, is as follows: LPT, GA, NEH, TS, DDQN, PPO, and our algorithm. On the other hand, when considering the computational time criterion, the ordering from longest to shortest is as follows: TS, GA, NEH, PPO, DDQN, our algorithm, and LPT. It is evident that although LPT exhibits the shortest computational time, it results in the longest makespan. Conversely, our method achieves the shortest makespan while also maintaining a computational time that is second only to LPT. Therefore, considering the comprehensive performance, our approach emerges as the optimal choice.

Computational Results of the Panel Block Assembly Line
Similarly, we calculated the differences in computation times between the baseline methods and our model, as presented in Table 7. These differences were then plotted in a line graph, shown in Figure 7. Based on Table 7 and Figure 7, it is evident that, apart from the LPT algorithm, all other baseline methods have longer computation times than our approach, and as the instance scale increases, the gap in computation times widens. However, although the LPT algorithm can yield instant results, its processing time for makespan significantly exceeds that of our algorithm. Therefore, considering the overall performance, our model surpasses the other baseline methods and offers a more effective solution to the scheduling problem in ship panel block assembly lines.

Computational Results of Benchmark Instances
In this section, we compare the performance of LPT, NEH, GA, TS, DDQN, and PPO without disjunctive graph representation and our approach to the Taillard benchmark dataset. Table 8 shows the achieved by each algorithm for each problem scale. To highlight the differences in results, we have bolded the best results in the table. On the whole, reinforcement learning is better than metaheuristic and heuristic algorithms. But also, as a reinforcement learning algorithm, our method consistently achieves the smallest across all problem instances, surpassing the performance of DDQN and PPO methods. Particularly, when the number of workstations is fixed at five and the number of blocks is 20, the disparities between different algorithms are minimal. However, as the number of blocks increases, our algorithm proves more efficient in time-saving. For instance, compared to the PPO algorithm, our algorithm can save 0 h, 13 h, and 33.5 h, respectively, when the number of blocks is 20, 50, and 100. With the increasing number of blocks, the superiority of our algorithm becomes increasingly prominent.

Computational Results of Benchmark Instances
In this section, we compare the performance of LPT, NEH, GA, TS, DDQN, and PPO without disjunctive graph representation and our approach to the Taillard benchmark dataset. Table 8 shows the makespan achieved by each algorithm for each problem scale. To highlight the differences in results, we have bolded the best results in the table. On the whole, reinforcement learning is better than metaheuristic and heuristic algorithms. But also, as a reinforcement learning algorithm, our method consistently achieves the smallest makespan across all problem instances, surpassing the performance of DDQN and PPO methods. Particularly, when the number of workstations is fixed at five and the number of blocks is 20, the disparities between different algorithms are minimal. However, as the number of blocks increases, our algorithm proves more efficient in time-saving. For instance, compared to the PPO algorithm, our algorithm can save 0 h, 13 h, and 33.5 h, respectively, when the number of blocks is 20, 50, and 100. With the increasing number of blocks, the superiority of our algorithm becomes increasingly prominent. Table 9 presents the computational time of our approach and all the baseline methods. From Table 9, it can be observed that LPT remains instantaneous in solving instances of any scale. Although our method is slower than LPT in terms of computation speed, it outperforms other baseline methods, which is acceptable in practical industrial production. Furthermore, when the number of workstations is the same, our model demonstrates the ability to save more computational time compared to the other five methods, excluding the LPT approach, as the number of blocks increases. For instance, when compared to the PPO algorithm, which does not use disjunctive graph representation for states, our algorithm achieves respective reductions in computational time of 0.07 s, 0.23 s, and 0.54 s when the number of blocks is 20, 50, and 100. Moreover, the rate of time savings surpasses the rate of block increase significantly. This observation further underscores our model's capacity to economize computational time as the instance size expands. Based on the results presented in Tables 8 and 9, our method consistently exhibits the shortest computational time and achieves the smallest makespan across all test cases. Consequently, when considering the comprehensive performance of the model, it can be confidently concluded that our approach continues to surpass the baseline methods in the Tarillard benchmark test.

Discussion
By comparing our proposed model with six baseline methods in two test cases, our model outperforms other baseline methods in terms of computation results and computational time. Among them, the LPT algorithm demonstrates near real-time computation but exhibits a significant disparity in makespan results compared to other algorithms. Therefore, under specific conditions that prioritize computation time, the LPT algorithm is undoubtedly more suitable. However, when considering the overall performance of the model and computational time, our method holds a distinct advantage.
As our model's computational instances scale up from 25 to 125, it continues to exhibit superior performance compared to other algorithms, showcasing its remarkable generalization capabilities when handling larger-scale instances. With the development of larger ships, the demand for panel blocks increases significantly. Thanks to the robust generalization abilities of our model, our approach can effectively tackle this challenge. Additionally, it is noteworthy that our model outperforms the PPO algorithm, which does not utilize disjunctive graph representation of states, in both sets of test experiments.
In essence, the satisfactory performance of our method can be attributed to the end-toend model based on DRL efficiently generating feasible scheduling sequences, coupled with the disjunctive graph-based state representation method, which provides a more comprehensive and rational depiction of the current scheduling information. Hence, our model stands as a competitive approach for solving the ship panel block assembly scheduling problem. In practical shipbuilding processes, our method can provide scientific and effective scheduling plans for panel block assembly, reducing construction time for blocks, and facilitating the execution of subsequent work plans, thereby shortening the shipbuilding cycle and lowering costs. Moreover, the scheduling method for panel block assembly lines can offer valuable insights for scheduling other product assembly lines, thereby comprehensively enhancing shipbuilding management and boosting enterprise competitiveness.

Conclusions and Future Work
This study investigates the scheduling problem of ship panel block assembly lines, aiming to minimize the assembly time of panel blocks by determining the sequence of incoming blocks. To address this problem, we propose an end-to-end scheduling method based on DRL. Initially, we establish an MDP model that conforms to the scheduling environment, creatively employing disjunctive graphs to capture the state of the current node and designing a reward function. In order to enhance the learning of information contained in the disjunctive graphs, we introduce a policy function based on GIN and train it using PPO algorithms. We compare our proposed model with heuristic algorithms LPT and NEH, metaheuristic algorithms GA and TS, as well as reinforcement learning algorithms DDQN and PPO without disjunctive graph state representation. Experimental results demonstrate that our algorithm surpasses other baseline methods in terms of both model performance and computational time considerations. Moreover, our model exhibits strong generalization ability when handling larger-scale instances. In the practical production of panel blocks, our approach enables shipyards to save significant time, and as the demand for panel blocks increases with the development of large-scale vessels, our model can effectively respond to these requirements.
However, it should be noted that our proposed DRL method, being an optimization approach, does not guarantee the discovery of a globally optimal solution. Due to limitations in available data, we were unable to test our model on larger-scale instances. Additionally, in order to simplify the problem, we have made simplifications to the actual process of panel block assembly in the assembly line. However, in real-world scenarios, scheduling issues can become more complex, and in the future, we aim to further investigate how to handle unforeseen events that may arise during the actual assembly process, such as workstation machine failures or the introduction of urgent tasks. To better address the scheduling problem in the panel block assembly line, we will enhance the robustness of our model and validate its scalability by testing on larger-scale instances. Furthermore, we will continue exploring improved heuristic and metaheuristic algorithms while conducting additional experiments to verify the consistency of our model across different datasets.

Conflicts of Interest:
The authors declare no conflict of interest. experience samples take the form of (s, d, s , r), where s represents the state of the panel block assembly line, d represents the scheduling plan, s represents the new production state after applying scheduling plan d to the panel block assembly line, and r represents the reward obtained when applying scheduling plan d under state s. The DDQN consists of the Current Network, responsible for action retrieval, and the Target Network, responsible for action value computation. Both networks have identical structures. We treat the various system states s in the MDP as input values to the neural network. The output of the neural network is a Q-table for different actions, where each dimension of the Q-table maps to a specific action. The value stored at a particular index in the Q-table represents the Q-value of that action. A higher Q-value indicates greater value and rationality for the corresponding action.  L clicp θ , θ = ∑ (s t , a t ) min r t (θ)R θ min (s t , a t ) clip(r t (θ), 1 − , 1 + )R θ old (s t , a t ) (A1) where θ represents the new policy parameters, θ represents the old policy parameters, r t (θ) = π(a t |s t , θ ) π(a t |s t , θ) denotes the importance sampling ratio that characterizes the similarity between the old and new policies, and is the clipping parameter. PPO utilizes gradient ascent to update the parameters, as shown in Equation (A2).