Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing

Zhang, Qi; Xing, Yang; Yao, Man; Guo, Xiwang; Qin, Shujin; Zhu, Haibin; Qi, Liang; Hu, Bin

doi:10.3390/electronics14173499

Open AccessArticle

Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing

by

Qi Zhang

¹,

Yang Xing

¹

,

Man Yao

²,

Xiwang Guo

³

,

Shujin Qin

⁴

,

Haibin Zhu

⁵

,

Liang Qi

^6,*

and

Bin Hu

^7,*

¹

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

School of Basic Medicine, He University, Shenyang 110163, China

³

College of Information and Control Engineering, Liaoning Petrochemical University, Fushun 113001, China

⁴

School of Information and Technology, Shangqiu Normal University, Shangqiu 476000, China

⁵

Department of Computer Science and Mathematics, Nipissing University, North Bay, ON P1B 8L7, Canada

⁶

Department of Computer Science and Technology, Shandong University of Science and Technology, Qingdao 266590, China

⁷

Department of Computer Science and Technology, Kean University, Union, NJ 07083, USA

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(17), 3499; https://doi.org/10.3390/electronics14173499

Submission received: 20 July 2025 / Revised: 15 August 2025 / Accepted: 28 August 2025 / Published: 1 September 2025

(This article belongs to the Section Networks)

Download

Browse Figures

Versions Notes

Abstract

With the growing adoption of circular economy principles in manufacturing, efficient disassembly and reassembly of end-of-life (EOL) products has become a key challenge in smart factories. This paper addresses the Disassembly and Assembly Line Balancing Problem (DALBP), which involves scheduling robotic tasks across workstations while minimizing total operation time and accounting for directional switching time between disassembly and assembly phases. To solve this problem, we propose an improved reinforcement learning algorithm, IQ(

λ

), which extends the classical Q(

λ

) method by incorporating eligibility trace decay, a dynamic Action Table mechanism to handle non-conflicting parallel tasks, and switching-aware reward shaping to penalize inefficient task transitions. Compared with standard Q(

λ

), these modifications enhance the algorithm’s global search capability, accelerate convergence, and improve solution quality in complex DALBP scenarios. While the current implementation does not deploy live IoT infrastructure, the architecture is modular and designed to support future extensions involving edge-cloud coordination, trust-aware optimization, and privacy-preserving learning in Industrial Internet of Things (IIoT) environments. Four real-world disassembly-assembly cases (flashlight, copier, battery, and hammer drill) are used to evaluate the algorithm’s effectiveness. Experimental results show that IQ(

λ

) consistently outperforms traditional Q-learning, Q(

λ

), and Sarsa in terms of solution quality, convergence speed, and robustness. Furthermore, ablation studies and sensitivity analysis confirm the importance of the algorithm’s core design components. This work provides a scalable and extensible framework for intelligent scheduling in cyber-physical manufacturing systems and lays a foundation for future integration with secure, IoT-connected environments.

Keywords:

disassembly-assembly line balancing problem; improved Q(λ) algorithm; robot directional switching; reinforcement learning; IoT

1. Introduction

With the rapid evolution of modern manufacturing toward green, intelligent, and connected paradigms, efficient end-of-life (EOL) product disassembly [1,2,3] and resource recovery utilization have emerged as critical enablers of sustainable circular economies. Smart factories, envisioned under the Industrial Internet of Things (IIoT) paradigm [4], aim to integrate autonomous robotic systems, distributed sensing, and secure data infrastructure to enable real-time adaptive disassembly and reassembly operations. While this work does not yet implement IoT infrastructure directly, the proposed modular scheduling framework is explicitly designed to support future integration with IoT-enabled cyber-physical environments [5,6].

Traditional disassembly line balancing problems (DLBP) primarily focus on maximizing the efficiency of disassembly processes for subassembly recovery [7], often neglecting the reassembly or remanufacturing stages that are essential for closed-loop production. To address this limitation, a more comprehensive model, the disassembly and assembly line, has emerged, offering improved resource reutilization and enabling intelligent production with minimal waste [8,9]. Within such systems, the coordination of robotic operations is not only a logistical challenge but also a potential cybersecurity consideration in connected environments, where idle switching times and predictable task sequences may expose the system to timing-based vulnerabilities.

This motivates the study of the disassembly and assembly line balancing problem (DALBP), which demands holistic modeling and optimization of integrated operations [10]. DALBP accounts for the sequencing and allocation of tasks across multiple robotic workstations [11], including the directional switching time required when robots alternate between disassembly and assembly tasks. For example, in an automotive electronics remanufacturing setting, a robot may be required to disassemble an engine control unit and then immediately assemble a reusable circuit board into a new module. Physical reorientation between spatially separated task zones can introduce non-negligible delays. When accumulated over hundreds of cycles, these delays reduce throughput and increase energy consumption. Studies have also shown that ignoring switching time leads to suboptimal scheduling and operational bottlenecks [12,13].

Existing research has explored related areas in disassembly or assembly line optimization. Qin et al. [14] considered human factors in assembly line balancing, while Yang et al. [15] proposed a salp swarm algorithm for robotic disassembly lines. Yin et al. [16] introduced a heuristic approach for multi-product, human–robot disassembly collaboration, and Mete et al. [17] studied parallel DALBP using ant colony optimization. However, these efforts either treated disassembly and assembly in isolation or did not account for robot switching dynamics and potential integration with cyber-physical systems. Moreover, the intersection of task scheduling, reinforcement learning, and system-level resilience in IoT-connected factories remains underexplored.

To address these challenges, this paper presents a modular and extendable scheduling framework for DALBP that supports future integration with IoT-enabled architectures. Our contributions include:

Proposing an integrated disassembly-assembly line model that explicitly incorporates robotic directional switching time, formulated as a mixed-integer programming (MIP) model to improve practical realism in smart manufacturing contexts.
Designing an improved reinforcement learning algorithm, IQ( $λ$ ), which enhances the classic Q( $λ$ ) method via eligibility trace decay, structured state–action mappings, and a novel Action Table (AT) that enables flexible task reuse and efficient decision-making.
Validating the approach through extensive experiments on four real-world product instances and comparing its performance against Q-learning, Sarsa, standard Q( $λ$ ), and the CPLEX solver. The proposed method achieves superior optimization quality and convergence performance, with strong potential for extension into secure, distributed, and adaptive environments.

Although this work does not yet deploy IoT-based data or federated learning mechanisms, it establishes a robust algorithmic and architectural foundation for their integration. The proposed IQ(

λ

)-based DALBP framework can be readily extended to incorporate edge intelligence, privacy-aware scheduling, and secure communication protocols, which are essential for future IIoT implementations.

The remainder of this paper is organized as follows. Section 2 and Section 3 introduce the problem statement and mathematical formulation. Section 4 presents the improved IQ(

λ

) algorithm and its enhancements. Section 5 describes the experimental setup and comparative results. Section 6 discusses the broader implications for the development of secure and IoT-ready disassembly systems, and Section 7 concludes the paper with a summary and future research directions.

2. Problem Description

2.1. Problem Statement

The disassembly-assembly line balancing problem (DALBP) addressed in this study is situated in the context of fully robotic smart manufacturing systems [18]. Motivated by the high-risk nature of handling end-of-life (EOL) electronic products and the growing demand for resource-efficient production, the DALBP focuses on the coordinated scheduling of robotic disassembly and assembly tasks across shared workstations. The core objective is to optimize task allocation and sequencing while minimizing directional switching delays and processing time.

While the current implementation does not yet incorporate IoT infrastructure, the proposed system is explicitly designed to support future extensions involving cyber-physical integration. This includes IoT-enabled sensing, edge/cloud control systems, secure communication protocols, and adaptive decision-making under real-time constraints.

Each disassembly task (d-task) involves removing a specific part or subassembly from an EOL product, while each assembly task (a-task) involves reusing selected components to build a new product. Robots autonomously execute these tasks, subject to limitations in tool compatibility, physical positioning, and potential future constraints imposed by wireless communication and sensor feedback.

The physical layout consists of parallel disassembly and assembly lines connected by conveyors and shared robot workstations. Figure 1 illustrates this configuration using an automotive battery example [19]. Disassembly tasks on the upper line selectively extract reusable components (e.g.,

1 \to 3 \to 5 \to 7 \to 9 \to 6

), while assembly tasks on the lower line rebuild new modules using a different task order (e.g.,

1 \to 3 \to 2 \to 4 \to 5

). In shared workstations (dashed boxes), robots switch between task modes, and the model incorporates switching time due to tool changes or orientation shifts.

Although IoT technologies such as RFID-tracked bins, proximity sensors, and edge controllers are not implemented in this study, the system architecture supports their future integration. Such additions would allow real-time status monitoring, traceability, and secure task dispatch, making the DALBP framework suitable for deployment in cyber-physical production environments [20].

2.2. AND/OR Graphs

To describe the structure and sequence of disassembly operations, this work employs AND-OR graphs, which model logical dependencies between subassemblies and tasks. These graphs are particularly suitable for representing complex hierarchical relationships in products such as automotive batteries, where components can be removed or reassembled in different configurations depending on product variation or wear level.

AND/OR graphs are digitized into the robotic controller and updated through edge devices, enabling adaptive decision-making based on real-time sensing feedback. In the graph, rectangles represent subassemblies [21,22,23], each containing a unique ID and list of parts. Directed arrows define feasible disassembly transitions. When a subassembly has multiple successors, only one task path is selected at runtime based on heuristic or learned policies.

Figure 2 shows the structural diagram of the product to be disassembled—an automotive battery. Figure 3 presents the corresponding AND/OR graph of the product. In this work, the AND/OR graph is used to represent the disassembly structure of the product. For example, subassembly <1> can be disassembled into subassemblies <2> and <9> through task 1, indicating an AND relationship between subassemblies <2> and <9>. Subassembly <2> can then be disassembled by either task 2 or task 3, representing an OR relationship between tasks 2 and 3.

2.3. Matrix Description

To formally encode precedence and compatibility constraints in the DALBP, matrix-based representations are adopted. These matrices form the input to the optimization model and the state representations for reinforcement learning policies.

Precedence matrix $S = [S_{j k}]$ is used to define the relationship between two tasks, where j and k denote IoT-updated disassembly tasks.

$S_{j k} = \{\begin{matrix} 1, & if task j must be performed before task k \\ - 1, & if task j and task k disassemble the same \\ assembly into different subassemblies \\ 0, & if there is no direct relationship between j \\ tasks and k \end{matrix}$

Based on the aforementioned description and the AND/OR graph of the automotive battery, the corresponding conflict precedence matrix is obtained as follows:

$S = [\begin{matrix} 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 0 & - 1 & 1 & - 1 & 1 & 1 & 1 & 1 \\ 0 & - 1 & 0 & - 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 0 & - 1 & 0 & - 1 & 1 & 1 & 1 & 1 \\ 0 & - 1 & 0 & - 1 & 0 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}]$
Task-subassembly relationship matrix $D = [d_{i j}]$ , where $d_{i j}$ represents the relationship between subassembly i and task j, defined as follows. These relationships are also tracked via IoT-based sensors (e.g., RFID tags, smart vision) to verify component flow consistency across lines.

$d_{i j} = \{\begin{matrix} 1, & if subassembly i is generated by task j \\ - 1, & if task j disassembles subassembly i \\ 0, & otherwise \end{matrix}$

Based on the AND/OR diagram of the automotive battery, the task and subassembly relationship matrix can be obtained as follows:

$D = [\begin{matrix} - 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & - 1 & - 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & - 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & - 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & - 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & - 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & - 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & - 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{matrix}]$
During the assembly process, assembly tasks must satisfy precedence constraints. These constraints are typically enforced by IoT controllers that validate logic sequences and inter-task timing before dispatching robotic instructions. In this work, the precedence relationships are represented by the matrix $S^{a} = [S_{i k}^{a}]$ , which is defined as follows:

$S_{i k}^{a} = \{\begin{matrix} 1, & if assembly task k is the immediate successor \\ of task i . \\ 0, & otherwise \end{matrix}$

Suppose that the assembly task priority graph for a certain product is shown in Figure 4. Its assembly priority matrix can be represented as follows:

$S^{a} = [\begin{matrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}]$

3. Mathematical Model

3.1. Model Assumptions

To facilitate the construction of a linear optimization model tailored for IoT-enabled robotic disassembly-assembly lines, the following assumptions are made:

(1) Task Duration and Profit: The execution time of each disassembly and assembly task is known in advance. Profits from each reusable subassembly are predefined, based on IoT-tracked part value and EOL quality data.

(2) Switching Time: Directional switching times between adjacent tasks are measured in real-time through sensor data and robot telemetry. These delays include repositioning, retooling, and network latency penalties from control signal transmission.

(3) Assembly Sequence: The assembly process begins only after disassembly is completed. IoT-based quality checks and integrity verification are performed before initiating downstream operations.

(4) Cyber-Physical Synchronization: Although currently operating offline, the model is intended to be supervised by edge controllers interfacing with a central planner. Latency, synchronization drift, and network effects are assumed negligible here but will be explored in future real-time deployments.

3.2. Notations

$i$	Index of assembly tasks
$j$	Index of disassembly tasks
$w$	Index of workstations
$e$	Index of subassemblies
$C$	Cycle time for each workstation
$t_{d}$	Total disassembly time for the product
$t_{a}$	Total assembly time for the product
$I_{i}^{a}$	Set of conflicting tasks with assembly task i
$I_{j}^{d}$	Set of conflicting tasks with disassembly task j
$I$	Set of all assembly tasks, ${1, 2, \dots, I}$
$J$	Set of all disassembly tasks, ${1, 2, \dots, J}$
$E$	Set of all subassemblies, ${1, 2, \dots, E}$
$W$	Set of all workstations, ${1, 2, \dots, W}$
$T_{i}^{a}$	Time to perform assembly task i
$T_{j}^{d}$	Time to perform disassembly task j
$A_{e}$	Subassembly node in AND/OR graph
$B_{j}$	Disassembly task node in AND/OR graph
$P (A_{e})$	Set of immediate predecessors of $A_{e}$
$S (A_{e})$	Set of immediate successors of $A_{e}$
$Γ_{e}$	Disassembly tasks that can produce subassembly e
$S T$	Similar task pairs (assembly and disassembly)
$G_{j j^{'}}^{d}$	Directional switching time between disassembly tasks j and $j^{'}$
$G_{i i^{'}}^{a}$	Directional switching time between assembly tasks i and $i^{'}$
$C_{i}^{a}$	Unit time cost of performing assembly task i
$C_{j}^{d}$	Unit time cost of performing disassembly task j
$C_{w}$	Operating cost of workstation w
$C_{p}$	Penalty cost for ungrouped similar tasks
$P_{a}$	Assembly profit
$P_{m}$	Maximum possible profit
$R_{j}$	Profit from disassembling task j

3.3. Decision Variables

\begin{matrix} z_{w} & = \{\begin{matrix} 1, & if workstation w is opened \\ 0, & otherwise \end{matrix} \\ d_{j w} & = \{\begin{matrix} 1, & if disassembly task j is assigned to workstation w \\ 0, & otherwise \end{matrix} \\ a_{i w} & = \{\begin{matrix} 1, & if assembly task i is assigned to workstation w \\ 0, & otherwise \end{matrix} \\ C_{j}^{i} & = \{\begin{matrix} 1, & if disassembly task j is not assigned to the same \\ workstation as similar assembly task i \\ 0, & otherwise \end{matrix} \end{matrix}

3.4. Objective Function

min (f = t_{a} + t_{d})

(1)

3.5. Subject to

\sum_{w \in W} d_{j w} = 1, \forall j \in J

(2)

\sum_{w \in W} a_{i w} = 1, \forall i \in I

(3)

\sum_{j \in Γ_{e}} \sum_{w \in W} d_{j w} = 1, \forall e \in E

(4)

\sum_{j \in S (A_{e})} \sum_{w \in W} d_{j w} \leq \sum_{j \in P (A_{e})} \sum_{w \in W} d_{j w}, \forall e \in E

(5)

\sum_{i} T_{i}^{a} a_{i w} + \sum_{j} T_{j}^{d} d_{j w} \leq C \cdot z_{w}, \forall w \in W

(6)

\begin{matrix} \sum_{j \in J} R_{j} + P_{a} & - \sum_{j \in J} T_{j}^{d} C_{j}^{d} + \sum_{i \in I} T_{i}^{a} C_{i}^{a} + \sum_{w \in W} C_{w} z_{w} \\ + \sum_{(i, j) \in S T} C_{p} C_{j}^{i} \geq P_{m} \end{matrix}

(7)

\sum_{w^{'} = 1}^{w} \sum_{B_{j} \in P (A_{e})} d_{j w^{'}} \geq \sum_{B_{j} \in S (A_{e})} d_{j w}, \forall e \in E, \forall w \in W

(8)

\sum_{w \in W} w a_{i w} \geq \sum_{w \in W} w a_{k w}, \forall (i, k) \in S^{a}

(9)

t_{d} \geq \sum_{j, w} T_{j}^{d} d_{j w} + \sum_{j, j^{'}, w} G_{j j^{'}}^{d} d_{j w} d_{j^{'} w}

(10)

t_{a} \geq \sum_{i, w} T_{i}^{a} a_{i w} + \sum_{i, i^{'}, w} G_{i i^{'}}^{a} a_{i w} a_{i^{'} w}

(11)

d_{j w} + d_{j^{'} w} \leq 1, \forall (j, j^{'}) \in I_{j}^{d}, \forall w \in W

(12)

a_{i w} + a_{i^{'} w} \leq 1, \forall (i, i^{'}) \in I_{i}^{a}, \forall w \in W

(13)

C_{j}^{i} \geq a_{i w} - d_{j w}, \forall (i, j) \in S T, \forall w \in W

(14)

C_{j}^{i} \geq d_{j w} - a_{i w}, \forall (i, j) \in S T, \forall w \in W

(15)

d_{j w}, a_{i w}, C_{j}^{i}, z_{w} \in {0, 1}, \forall j, i, w

(16)

Objective function (Equation (1)): Minimizes total disassembly and assembly time. Constraint (2) ensures each disassembly task must be assigned to exactly one workstation; Constraint (3) ensures each assembly task must be assigned to exactly one workstation; Constraint (4) ensures each subassembly must be generated by exactly one disassembly task; Constraint (5) ensures the number of executed successor tasks for a subassembly cannot exceed that of its predecessor tasks; Constraint (6) ensures total operation time per workstation must not exceed cycle time; Constraint (7) ensures total profit must meet the minimum threshold; Constraint (8) ensures predecessor tasks of a subassembly must be completed before its successor tasks; Constraint (9) ensures assembly tasks must be assigned to workstations according to priority sequence; Constraint (10) ensures total disassembly time ≥ task execution time + directional switching time; Constraint (11) ensures total assembly time ≥ task execution time + direction switching time; Constraint (12) ensures conflicting disassembly tasks cannot be assigned to the same workstation; Constraint (13) ensures conflicting assembly tasks cannot be assigned to the same workstation; Constraint (14) and (15) ensures that a penalty indicator activates when similar tasks are not co-assigned; Constraint (16) ensures all decision variables are binary.

Table 1 illustrates the operation direction switching times between disassembly tasks using an automobile battery as an example. Suppose a disassembly sequence is 1 → 2 → 4 → 6 → 8. The total direction switching time required for this sequence is 0.5. Adding the disassembly times of these five tasks gives the total disassembly time.

3.6. Comment on IoT Implications of the Model

Although the model is presently applied in a non-IoT setting, it is designed to support integration with future IoT-enabled infrastructures. Switching times (G) can be extended to capture not only physical repositioning delays but also communication latency between distributed controllers, such as edge-to-cloud feedback delays. Workstation activation variables (

z_{w}

) influence both operational cost and system exposure, with fewer active stations reducing the potential attack surface in a connected environment. Penalty terms (

C_{p}

), which discourage scattering similar tasks, will be even more relevant in IoT-integrated scenarios, where frequent inter-device communication may elevate synchronization complexity and security risks. While the current implementation assumes a deterministic and offline decision-making environment, future extensions can incorporate online reinforcement learning to account for network variability, edge inference delays, or sensor failures. These enhancements would enable the model to function as a robust foundation for secure, adaptive, and cyber-aware scheduling in next-generation smart manufacturing systems.

4. Algorithm Design

4.1. Action, State, and Reward Design

Although the current implementation does not deploy in a fully IoT-integrated environment, the reinforcement learning (RL) framework is designed to support future extensions to IoT-enabled smart manufacturing settings. In such scenarios, robotic scheduling will be coupled with real-time sensor feedback, edge/cloud coordination, and secure communication. Therefore, the formulation of actions, states, and rewards anticipates these characteristics, aiming to remain compatible with future deployments.

The actions in this work are defined based on the disassembly tasks represented in the AND/OR graph. Due to the presence of the assembly stage, the actions in the assembly process are defined in a more abstract manner, rather than corresponding to detailed physical operations. The state representation is also divided by stage: in the disassembly stage, states correspond to the subassembly nodes in the AND/OR graph, whereas in the assembly stage, states are defined by the task nodes in the assembly precedence graph.

Although the overall objective is to minimize the total time of disassembly and assembly, the assembly tasks must all be completed, and thus their total time is constant across feasible solutions. Therefore, the optimization can focus solely on minimizing the total disassembly time under a given profit constraint. Based on this insight, the reward function is designed as a large constant minus the sum of the disassembly time and the operation direction switching time. If the direction switching time between two consecutive tasks is nonzero, it is added to the task time; otherwise, the reward equals the constant minus the disassembly time alone. The reward function can be expressed as:

R = 100 - t - t_{e}

(17)

where t represents the disassembly time of the task and

t_{e}

denotes the direction switching time.

4.2. Traditional Q( $λ$ ) Algorithm

Compared with the standard Q-learning algorithm, the Q(

λ

) algorithm [24,25] introduces the eligibility trace mechanism, which endows the agent with a certain level of memory. This allows the agent to retain a trace of the states and actions it has recently experienced, thereby strengthening the learning of actions associated with higher Q-values. In other words, the algorithm assigns greater weight to states that are closer to the target, which enhances convergence speed. By emphasizing the importance of recently visited and promising paths, Q(

λ

) accelerates learning in environments with delayed rewards. The process of algorithm Q(

λ

) is as follows in Algorithm 1.

Algorithm 1: Q(

λ

) Learning Algorithm

1:: Input: Initialized Q-table $Q (s, a)$ , learning rate $α$ , discount factor $γ$ , decay factor $λ$
2:: Initialize: Eligibility trace $e (s, a) \leftarrow 0$ for all $(s, a)$
3:: repeat
4:: Initialize state s
5:: Choose initial action a using $ϵ$ -greedy policy
6:: repeat
7:: Execute action a, observe reward r and next state $s^{'}$
8:: Choose next action $a^{'}$ from $s^{'}$ using $ϵ$ -greedy policy
9:: $δ \leftarrow r + γ {max}_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)$
10:: $e (s, a) \leftarrow e (s, a) + 1$
11:: for all $(s, a)$ do
12:: $Q (s, a) \leftarrow Q (s, a) + α δ e (s, a)$
13:: $e (s, a) \leftarrow γ λ e (s, a)$
14:: end for
15:: $s \leftarrow s^{'}$
16:: $a \leftarrow a^{'}$
17:: until terminal state reached
18:: until convergence criteria met

The core update equation of the Q(

λ

) algorithm is given by:

\begin{matrix} Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) & + α [r_{t + 1} + γ max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})] e (s_{t}, a_{t}) \end{matrix}

(18)

Here,

s_{t}

denotes the state at time step t, representing the environment or system condition the agent is currently in (e.g., the location of a robot and its current task status). The term

a_{t}

denotes the action selected by the agent in state

s_{t}

, such as performing a disassembly operation or transitioning to another workstation. The parameter

α

is the learning rate, controlling how much newly acquired information overrides the old Q-values; a higher

α

leads to faster learning but may cause instability. The discount factor

γ

determines the importance of future rewards compared to immediate ones, with

γ \in [0, 1]

; a value close to 1 places greater emphasis on long-term reward.

The eligibility trace

e (s_{t}, a_{t})

represents a temporary memory trace of the state–action pair

(s_{t}, a_{t})

, recording how recently and frequently this pair has been visited. Compared to standard Q-learning, this algorithm introduces an eligibility trace matrix E (initialized to zero). This matrix stores the agent’s trajectory path, with states closer to the terminal state retaining stronger memory traces. Essentially, it adds a memory buffer to Q-learning. However, the values in E decay with each Q-table update, meaning states farther from the current reward have diminishing importance.

The temporal difference (TD) error and TD target in Q(

λ

) are identical to those in standard Q-learning, with the key difference being the incorporation of importance weights via the eligibility trace E during Q-table updates. In traditional Q(

λ

), the eligibility trace is updated by a simple increment:

e (s, a) \leftarrow e (s, a) + 1

(19)

However, for AND/OR graph searches where agents may oscillate between nodes, this can cause certain states to accumulate excessively high eligibility values. To address this issue, we modify the eligibility trace update as follows:

e (s, a) = \{\begin{matrix} 1, & if s = s_{t} and a = a_{t} \\ 0, & otherwise \end{matrix}

(20)

This modified update acts as a truncation mechanism. During state transitions, the eligibility trace is reset to 1 for the current state–action pair rather than being incremented. This prevents excessive accumulation in E values and avoids overestimation of Q-values for oscillating states.

4.3. IQ( $λ$ ) Algorithm

To improve the adaptability and global search capability of reinforcement learning in complex disassembly-assembly scenarios, we enhance the conventional Q(

λ

) algorithm with two major improvements: a memory-efficient eligibility trace update and a dynamic Action Table (AT) for preserving parallel task options. These enhancements make the agent more robust in navigating large, hierarchical task graphs typical of EOL product disassembly.

While the current training environment does not incorporate real-time IoT data, the IQ(

λ

) architecture is built with extensibility in mind. The Action Table mechanism can accommodate future scenarios where disassembly states are derived from IoT-updated AND/OR graphs based on sensor status or environmental changes. Similarly, task eligibility and reward signals can be adapted to reflect communication reliability, energy usage, or trust scores from federated control nodes in an IoT system.

To address the DALBP, we refine the conventional Q(

λ

) algorithm through automotive battery case analysis. During disassembly, when an agent in state 2 executes action 3 (disassembly task 3), two accessible subassemblies <4> and <5> emerge corresponding to non-conflicting tasks 5 and 6. Traditional Q(

λ

) forces exclusive selection between such parallel-capable tasks, permitting only one state transition while ignoring viable alternatives. This violates fundamental disassembly principles where non-conflicting tasks can execute concurrently. To resolve this limitation, we introduce an Action Table (AT) that dynamically stores deferred but valid candidate actions, preserving executable options that would otherwise be discarded.

The AT mechanism demonstrates critical utility when processing state 3 after disassembly task 4, where accessible subassemblies <4> and <6> enable subsequent non-conflicting tasks 6 and 7. Selection of either task automatically preserves the alternative in the AT for future transitions, strategically expanding the search space. This innovation proves particularly valuable in large-scale industrial scenarios featuring frequent parallel non-conflicting tasks. By enabling direct retrieval of valid actions from the AT rather than restarting searches from AND/OR graph roots the approach significantly enhances global search capability while improving adaptability to concurrent operations. Consequently, training efficiency increases, convergence accelerates, and superior solutions emerge within reduced training steps. The IQ(

λ

) algorithm flowchart is shown in Figure 5.

The steps of the IQ(

λ

) algorithm are as follows:

Step 1: Initialize the Q-table, the eligibility trace matrix E, the learning rate $α$ , the discount factor $γ$ , and the trace decay factor $λ$ .
Step 2: Train the agent to become familiar with the environment. A random state is selected from the state space, and an action is randomly chosen from the action space. If the selected action is invalid, the algorithm immediately proceeds to the next learning episode. If the action is valid, the agent transitions to a new state based on the selected action and receives a corresponding reward. The value function is updated using the TD error. This process continues until a certain level of reward is achieved or no further actions are available, at which point the next learning episode begins. The process is repeated until the maximum number of training steps is reached.
Step 3: After completing the training phase, the agent enters the deployment phase. Starting from state 1, the agent selects actions. For each decision, the agent first checks whether the action table (AT) contains any candidate actions. If so, the Q-values of those actions are compared with those of the currently available actions. The agent then selects an action according to the $ε$ -greedy policy and performs a state transition. Afterward, the eligibility traces are updated according to a decay rule, which gradually reduces the impact of past rewards. The agent continues selecting actions and transitioning states, comparing Q-values from both the current action space and the AT, until the cumulative reward reaches a specified threshold.
Step 4: Perform the assembly tasks based on the assembly precedence graph. According to the principle of task similarity, similar disassembly and assembly tasks should be assigned to the same workstation whenever possible.

In summary, the IQ(

λ

) algorithm incorporates significant improvements in both algorithmic structure and its applicability to disassembly-assembly lines. The integration of the Action Table mechanism and trace update strategy enhances the agent’s global search capability and decision efficiency. These enhancements enable the algorithm to effectively solve the DALBP studied in this work, demonstrating that value-based reinforcement learning methods, when properly improved, can be successfully applied to complex scheduling scenarios such as DALBP [26,27,28].

5. Experimental Studies

This section investigates the effectiveness of the improved IQ(

λ

) algorithm for solving DALBP. First, the mathematical model is validated using the CPLEX solver to ensure its correctness and completeness. Then, the performance of IQ(

λ

) is compared with several baseline algorithms, including the original Q(

λ

) [29,30], Sarsa [31], and Q-learning [32,33]. All experiments are conducted on a computer with an Intel Core i7-10870H processor and 16 GB RAM, running Windows 10. The algorithms are implemented in Python using the Pycharm development environment.

5.1. Test Instances

This work evaluates the performance of the IQ(

λ

) algorithm using four test cases. The test cases include a flashlight, a copying machine, an automobile battery, and a hammer drill. Table 2 summarizes the number of tasks and subassemblies. Associated with each instance [34,35].

Table 3 presents the solution results obtained by the CPLEX solver for each test instance. In this table, assembly task sequences are indicated using curly braces, while parentheses denote tasks that are assigned to the same workstation. No result is provided for Case 4 because the CPLEX solver failed to find a solution for the hammer drill instance within the 3600 s time limit. Since CPLEX uses a branch-and-bound method that enumerates all possible solutions to find the optimal one, the limited memory of 16 GB on the experimental computer was insufficient to handle the large-scale instance. Consequently, CPLEX did not yield a feasible solution and reported a memory overflow issue.

5.2. Comparison

This work employs the Sarsa, Q-learning, and Q(

λ

) algorithms as benchmarks to conduct comparative experiments with the proposed IQ(

λ

) algorithm. The hyperparameters for the four algorithms across the four test instances are unified, as shown in Table 4. Since the decay factor

λ

is not involved in Sarsa and Q-learning, the number of training steps, discount rate

γ

, learning rate

α

, action space size, and number of training episodes are kept consistent, with

λ

uniformly set to 0.8 to ensure fairness in comparison.

The results demonstrate that IQ(

λ

) consistently achieves superior solution quality and faster convergence across all test instances. While the current implementation does not yet include live IoT data streams or device coordination, the algorithm’s performance establishes a strong baseline for eventual deployment in smart disassembly environments with cyber-physical feedback.

Table 5 details the experimental results of the optimal objective values and running times for the four algorithms on the four test instances, along with their comparative differences. Analysis of Table 5 indicates that the IQ(

λ

) algorithm demonstrates significant advantages in both objective value and computational efficiency. In Case 1, IQ(

λ

), Q-learning, and Q(

λ

) all found the optimal solution of 38.3; however, IQ(

λ

) achieved a running time of 0.02 s, substantially lower than Q-learning’s 0.07 s and Q(

λ

)’s 0.03 s. The results for Case 2 are even more striking: only IQ(

λ

) successfully found a high-quality solution of 45.4 in 0.02 s, whereas Sarsa and Q(

λ

) failed to find feasible solutions within the limited number of actions. Although Q-learning found a solution, its objective value of 60.2 was significantly worse than that of IQ(

λ

), and it required more time. For Case 3, all algorithms found the same optimal solution of 58.1, but IQ(

λ

) converged the fastest, outperforming Sarsa and Q-learning significantly. For the most challenging large-scale Case 4, IQ(

λ

) again achieved the best performance, obtaining the best objective value of 164.8 in the shortest time of 0.28 s. While Sarsa found a solution, its quality was inferior and runtime longer; Q-learning failed to find a feasible solution; and Q(

λ

) found a solution with an 11.41% gap in objective value compared to IQ(

λ

) and a runtime much higher than IQ(

λ

).

Further investigation into the convergence characteristics of the algorithms is presented in Figure 6, Figure 7, Figure 8 and Figure 9. Taking the TD error convergence process in Case 3 as an example, the IQ(

λ

) algorithm achieves the fastest convergence, reaching a stable state after approximately 600 training steps. In contrast, Q-learning converges around step 700 but still exhibits minor fluctuations up to step 900. Sarsa and Q(

λ

) fail to fully converge within the given number of training steps. It is worth noting that although Sarsa and Q(

λ

) do not completely converge in Case 3, their inherent exploration behavior occasionally led them to select optimal actions by chance, thereby achieving the same optimal objective value as IQ(

λ

). Figure 10 illustrates the reward value trajectory of IQ(

λ

) in Case 3. During the early unstable training phase, reward values were reset to zero. The reward stabilized around step 400 and consistently maintained the optimal value of 58.1 thereafter.

Based on the above experimental findings and cross-validation against the feasible solutions provided by the commercial solver CPLEX, it can be concluded that the proposed IQ(

λ

) algorithm exhibits significant superiority over Sarsa, Q-learning, and Q(

λ

) in solving the DALBP addressed in this chapter. IQ(

λ

) consistently delivers optimal or highly competitive solutions across all four test instances, with remarkable advantages in both solution quality and computational efficiency. Additionally, its convergence is the fastest and most stable among all compared algorithms, further validating the effectiveness of the algorithm design.

5.3. Melting Research and Parameter Sensitivity Analysis

To thoroughly analyze the core improvement mechanism and robustness characteristics of the IQ(

λ

) algorithm, this section conducts ablation studies and parameter sensitivity experiments. The experimental environment and comparison algorithm settings are the same as in Section 5.2, and all results are averages of 20 independent experiments.

5.3.1. Ablation Experiment

The ablation experiments in this work aim to quantitatively evaluate the independent contributions of the three core modules—the eligibility trace mechanism (

λ

), the action selection table (

A T

), and the switching penalty (

ρ

)—to the algorithm’s performance. To this end, we selected Case 2 and Case 4 as test subjects, as these two cases contain different numbers of tasks, allowing us to examine the algorithm’s performance under varying levels of complexity. In the experiments, all variants used the same parameter configurations as those in Table 4. Specifically, we designed four algorithm variants for comparison: IQ(

λ

)-Full represents the complete algorithm with all three modules; IQ(

λ

)-NoTrace disables the eligibility trace mechanism by setting

λ

to 0; IQ(

λ

)-NoAT removes the action selection table; and IQ(

λ

)-NoPenalty eliminates the switching penalty by setting

ρ

to 0. The relevant experimental results are detailed in Table 6.

The complete IQ(

λ

) algorithm consistently achieves the lowest target values, fastest computation times, and highest stability across both scenarios. Specifically, the eligibility trace mechanism effectively reduces performance variance and improves solution quality; the action table accelerates the search process by reducing computational overhead; and the switching penalty plays a critical role in avoiding ineffective action switches and maintaining solution quality. Disabling any single module leads to clear degradation in solution quality, longer computation time, or increased instability. These results highlight the strong synergy among the modules, demonstrating that the full IQ(

λ

) algorithm delivers significantly better overall performance.

5.3.2. Parameter Sensitivity Analysis

Parameter sensitivity analysis is conducted to investigate the effects of key algorithm parameters on performance. Specifically, Table 7 presents the sensitivity to the switching penalty coefficient

ρ

, Table 8 shows the sensitivity of the learning rate

α

in terms of objective value and convergence steps, and Table 9 illustrates the sensitivity to task scale.

Sensitivity to switching penalty coefficient $ρ$
Conclusion: The optimal range of $ρ$ is [0.3, 0.5]. Complex tasks (hammer drill) are more sensitive to $ρ$ , achieving an 11.4% improvement in objective value when $ρ$ increases from 0.0 to 0.5.
Sensitivity to learning rate $α$
Conclusion: $α = 0.5$ achieves the best trade-off between convergence speed and solution quality. When $α > 0.7$ , convergence accelerates but objective values deteriorate significantly by more than 5%.
Sensitivity to task scale
Conclusion: IQ( $λ$ ) demonstrates strong scalability. As the task scale increases from 13 to 46, the synergy between eligibility traces and action table becomes more significant, with runtime increasing moderately (0.02 s → 0.28 s), much slower than the growth in problem size.

6. Discussion: Implications for IoT Cybersecurity

The integration of IoT technologies into cyber-physical disassembly systems introduces both opportunities for real-time optimization and challenges related to cybersecurity and data privacy. While this work focuses on optimizing the disassembly-assembly line balancing problem (DALBP) through the improved IQ(

λ

) algorithm, the broader context of IoT-enabled manufacturing necessitates attention to data protection, system integrity, and trust-aware decision-making.

The reinforcement learning framework proposed in this work is inherently extensible to secure and privacy-preserving architectures. In an IoT-enabled disassembly line, sensors and edge devices continuously collect and transmit data related to product conditions, task dependencies, worker safety, and robot dynamics. This data enables adaptive task planning but simultaneously raises concerns over the exposure of operational and user-sensitive information.

Building upon the modular architecture of IQ(

λ

), the following extensions illustrate how IoT cybersecurity and privacy could be systematically incorporated:

Privacy-Aware Task Sequencing: The task selection mechanism in IQ(

λ

), enhanced with instruction-tuned LLM guidance, can be adapted to prioritize the early disassembly of components containing sensitive or proprietary data (e.g., memory chips, identity-tagged modules). This ensures these parts are securely isolated or neutralized, reducing the risk of data leakage.

Federated Learning for Worker Privacy: In future IoT-integrated implementations, reinforcement learning agents can be combined with federated learning to enable decentralized policy updates across edge devices. Worker-related data, such as posture, fatigue, or biometric information, can be processed locally at each robotic cell, minimizing exposure and ensuring compliance with privacy regulations such as GDPR or HIPAA.

Secure Communication for Edge Devices: Edge nodes and robotic controllers in the proposed system can benefit from lightweight cryptographic protocols and attribute-based encryption schemes. These mechanisms ensure secure communication of task instructions and status updates without overburdening resource-constrained devices, maintaining the system’s real-time performance.

Trust-Aware Reward Design: The reward function in IQ(

λ

) can be extended to incorporate trust and safety constraints. For example, task transitions that risk ergonomic overload or require exposure of identifiable worker data can be penalized. Through such design, the algorithm not only optimizes task scheduling but also upholds ethical and legal boundaries in human-robot collaboration.

Edge-AI Integration and Privacy-by-Design: Although not implemented in the current study, the architectural structure of IQ(

λ

) supports future integration with edge-AI modules. These could enable secure inference at the sensor level, reinforcing a privacy-by-design workflow where data remains local, and only essential decision updates are propagated to the cloud.

In summary, the improved IQ(

λ

) algorithm provides not only a robust solution for DALBP in smart manufacturing but also a foundation upon which future extensions can build to address emerging concerns in IoT cybersecurity and privacy. As disassembly lines become increasingly connected and autonomous, the intersection of optimization, learning, and secure communication will be critical to ensuring both operational efficiency and data integrity.

7. Conclusions

This paper presents an improved Q(

λ

)-based reinforcement learning framework to solve the Disassembly and Assembly Line Balancing Problem (DALBP), with a particular focus on modeling directional switching time and conflict constraints in robotic operations. The proposed IQ(

λ

) algorithm introduces key innovations including eligibility trace decay, a dynamic Action Table mechanism, and reward shaping that penalizes inefficient task switching. These features significantly enhance the agent’s ability to explore and optimize large-scale, parallelizable task sequences.

Through comprehensive experiments on four benchmark cases, IQ(

λ

) demonstrates superior performance over baseline algorithms (Q-learning, Sarsa, and Q(

λ

)) in terms of both objective value and convergence speed. Notably, the algorithm handles complex, large-scale problems such as the hammer drill case that proved infeasible for the commercial solver CPLEX.

Although the current implementation does not yet integrate real-time IoT infrastructure, the modular design of IQ(

λ

) is specifically tailored to enable future deployment in IIoT-based smart factories. This includes potential extensions involving federated learning, privacy-preserving task planning, lightweight cryptographic communication, and trust-aware optimization. As industrial systems evolve toward interconnected and autonomous paradigms, the proposed approach offers a practical, scalable foundation for secure and efficient robotic scheduling under cyber-physical conditions.

Future work will focus on embedding the algorithm into an edge-enabled disassembly system, incorporating live sensor feedback, cyber defense measures, and cloud-based policy refinement, with the goal of advancing secure and adaptive learning in IoT-driven manufacturing environments.

Author Contributions

Writing—original draft preparation, Q.Z. and Y.X.; writing—review and editing, Q.Z., X.G., L.Q., H.Z., B.H.; supervision, X.G. and S.Q.; data curation, Y.X.; visualization, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Liaoning Province Education Department Scientiffc Research Foundation of China under Grant JYTQN2023366.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, R.; Li, Z.; Xu, Y. A Review of E-Waste Recycling and Management in China: Challenges and Opportunities. Environ. Pollut. 2018, 242, 268–281. [Google Scholar]
Kumari, S.; Singh, R.; Kumar, P. Strategies for E-Waste Management in Developing Countries: A Review. Waste Manag. Res. 2020, 38, 411–424. [Google Scholar]
Lee, J.; Bagheri, B.; Kao, H.A. Smart manufacturing systems: A review of Industry 4.0 and IoT applications. J. Manuf. Syst. 2018, 47, 54–67. [Google Scholar]
Zhang, K.; Su, Y.; Guo, X.; Qi, L.; Zhao, Z. MU-GAN: Facial attribute editing based on multi-attention mechanism. IEEE/CAA J. Autom. Sin. 2020, 8, 1614–1626. [Google Scholar] [CrossRef]
Giglio, D.; Paolucci, M.; Roshani, A. Integrated lot sizing and energy-efficient job shop scheduling problem in manufacturing/remanufacturing systems. J. Clean. Prod. 2017, 148, 624–641. [Google Scholar] [CrossRef]
Vahedi-Nouri, B.; Rohaninejad, M.; Hanzálek, Z.; Foumani, M. A batch production scheduling problem in a reconfigurable hybrid manufacturing-remanufacturing system. Comput. Ind. Eng. 2025, 204, 111099. [Google Scholar] [CrossRef]
Chen, L.; Xu, Y.; Zhao, J. A machine learning approach for disassembly line balancing optimization in smart factories. Robot. Comput.-Integr. Manuf. 2021, 67, 102052. [Google Scholar] [CrossRef]
Battala, O.; Dolgui, A.; Guschinsky, N. Disassembly line balancing and sequencing under uncertainty. Procedia CIRP 2017, 61, 404–409. [Google Scholar] [CrossRef]
Liu, J.; Wang, L.; Chen, Y. E-Waste Recycling: The Application of Circular Economy in E-Waste Management. Environ. Sci. Technol. 2017, 51, 1220–1230. [Google Scholar]
Kumar, R.; Singh, M. A review of assembly and disassembly line balancing problems and optimization approaches. J. Manuf. Syst. 2020, 55, 23–38. [Google Scholar]
Wang, Y.; Li, H.; Chen, X. Multi-robot task allocation for integrated assembly-disassembly operations. Robot. Comput.-Integr. Manuf. 2019, 58, 101–115. [Google Scholar]
Li, A.; Wang, Q.; Wang, L. Changeover time reduction in mixed-model assembly lines: A collaborative robot-assisted approach. Robot.-Comput.-Integr. Manuf. 2021, 72, 102204. [Google Scholar] [CrossRef]
Patel, S.; Gupta, R. Impact of robotic directional switching on production efficiency in remanufacturing lines. Int. J. Adv. Manuf. Technol. 2021, 114, 1557–1572. [Google Scholar]
Qin, P.; Wang, X.; Tan, J. Human-centered assembly line balancing with multi-skilled workers: An improved genetic algorithm approach. J. Manuf. Syst. 2023, 67, 112–126. [Google Scholar] [CrossRef]
Yang, T.; Zhang, H.; Liu, M. Multi-objective optimization of robotic disassembly line balancing using modified salp swarm algorithm. Adv. Eng. Inform. 2022, 54, 101765. [Google Scholar] [CrossRef]
Yin, T.; Zhang, Z.; Zhang, Y.; Wu, T.; Liang, W. Mixed-integer programming model and hybrid driving algorithm for multi-product partial disassembly line balancing problem with multi-robot workstations. Robot.-Comput.-Integr. Manuf. 2022, 73, 102251. [Google Scholar] [CrossRef]
Mete, S.; Ozceylan, E.C.Z. A heuristic approach for joint design of assembly and disassembly line balancing problem. In Proceedings of the 2017 International Conference on Information Technology (ICIT), Amman, Jordan, 17–18 May 2017; pp. 1–6. [Google Scholar] [CrossRef]
Daneshmand, M.; Noroozi, F.; Corneanu, C.; Mafakheri, F.; Fiorini, P. Industry 4.0 and Prospects of Circular Economy: A Survey of Robotic Assembly and Disassembly. arXiv 2021, arXiv:2106.07270v1. [Google Scholar] [CrossRef]
Xiao, J.; Guo, L.; Jiang, Z.; Li, C.; Tang, Y. Multi-Agent Reinforcement Learning Method for Disassembly Sequential Task Optimization Based on Human–Robot Collaborative Disassembly in Electric Vehicle Battery Recycling. J. Manuf. Sci. Eng. 2023, 145, 121001. [Google Scholar] [CrossRef]
Boyes, H.; Hallaq, B.; Cunningham, J.; Watson, T. The industrial internet of things (IIoT): An analysis framework. Comput. Ind. 2018, 101, 1–12. [Google Scholar] [CrossRef]
Osterrieder, P.; Budde, L.; Friedli, T. The smart factory as a key construct of industry 4.0: A systematic literature review. Int. J. Prod. Econ. 2020, 221, 107476. [Google Scholar] [CrossRef]
Wang, J.; Guo, H. Virtual force field coverage algorithms for wireless sensor networks in water environments. Int. J. Sens. Netw. 2020, 32, 174–181. [Google Scholar] [CrossRef]
Guo, X.; Zhou, M.; Abusorrah, A.; Alsokhiry, F.; Sedraoui, K. Disassembly sequence planning: A survey. IEEE/CAA J. Autom. Sin. 2020, 8, 1308–1324. [Google Scholar] [CrossRef]
Ghosh, S.; Bellemare, M.G.; Larochelle, H.; Panangaden, P. Efficient Eligibility Traces for Deep Q-Learning. Neural Netw. 2023, 166, 436–448. [Google Scholar] [CrossRef]
White, A.; Patterson, S.; Bacon, P.L. Eligibility Traces in Modern Reinforcement Learning: A Survey. Found. Trends Mach. Learn. 2023, 16, 201–352. [Google Scholar] [CrossRef]
Guo, X.W.; Guo, F.; Qi, L.; Wang, J.; Liu, S.X.; Qin, S.; Wang, W.T. Modeling and Optimization of Multi-product Human-Robot Collaborative Hybrid Disassembly Line Balancing with Resource Sharing. IEEE Trans. Comput. Soc. Syst. Early Access 2025, 1–16. [Google Scholar] [CrossRef]
Guo, X.W.; Chen, L.; Qi, L.; Wang, J.; Qin, S.; Chatterjee, M.; Kang, Q. Multi-Factory Disassembly Process Optimization Considering Worker Posture. IEEE Trans. Comput. Soc. Syst. Early Access 2025, 1–13. [Google Scholar] [CrossRef]
Qin, S.J.; Guo, F.; Wang, J.C.; Qi, L.; Wang, J.X.; Guo, X.W.; Zhao, Z.Y. Expanded Discrete Migratory Bird Optimizer for Circular Disassembly Line Balancing with Tool Deterioration and Replacement. Int. J. Artif. Intell. Green Manuf. 2025, 1, 14–24. [Google Scholar]
Zhang, Y.; Zhou, M.; Huang, G.Q. Adaptive λ Tuning in Q(λ) for Dynamic Manufacturing Systems. J. Manuf. Syst. 2023, 68, 112–125. [Google Scholar] [CrossRef]
Li, R.; Li, S.; Chen, C.L.P. Convergence Analysis of Q(λ) with Function Approximation. Neural Netw. 2021, 143, 732–743. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Z.; Li, J. Sarsa-based adaptive balancing for human-robot collaborative assembly lines. J. Manuf. Syst. 2023, 68, 324–337. [Google Scholar]
Wang, H.; Liu, Y.; Zhang, Q. Double Q-learning with Prioritized Experience Replay for Industrial Scheduling. IEEE Trans. Ind. Inform. 2023, 19, 5849–5860. [Google Scholar]
Chen, X.; Li, Z.; Wang, J. Federated Q-learning for Multi-Robot Collaborative Navigation. IEEE Robot. Autom. Lett. 2022, 7, 7326–7333. [Google Scholar]
Wang, J.; Xu, Y.; Li, L.; Du, Y.; Zhao, L. An improved data fusion algorithm based on cluster head election and grey prediction. J. Circuits Syst. Comput. 2022, 31, 2250009. [Google Scholar] [CrossRef]
Wang, J.; Wang, N.; Li, L.; Du, Y. An improved trust and grey model based data aggregation algorithm. Int. J. Sens. Netw. 2021, 35, 258–266. [Google Scholar] [CrossRef]

Figure 1. Layout diagram of disassembly and assembly line.

Figure 2. Schematic diagram of automotive battery.

Figure 3. The AND-OR graph of an automotive battery.

Figure 4. Assembly task priority diagram.

Figure 5. IQ(

λ

) algorithm flow chart.

Figure 5. IQ(

λ

) algorithm flow chart.

Figure 6. IQ(

λ

) TD error convergence graph.

Figure 6. IQ(

λ

) TD error convergence graph.

Figure 7. Sarsa TD error convergence graph.

Figure 8. Q-learning TD error convergence graph.

Figure 9. Q(

λ

) TD error convergence graph.

Figure 9. Q(

λ

) TD error convergence graph.

Figure 10. The reward value variation chart of IQ(

λ

) algorithm.

Figure 10. The reward value variation chart of IQ(

λ

) algorithm.

Table 1. Time required to switch the direction of operation during disassembling tasks.

	Task	1	2	3	4	5	6	7	8	9
		Next Task
Previous Task	1	—	0.3	0.2	—	—	—	—	—	—
	2	—	—	—	0	—	—	—	—	—
	3	—	—	—	—	0.4	0.3	—	—	—
	4	—	—	—	—	—	0.2	0.5	—	—
	5	—	—	—	—	—	—	0.4	—	—
	6	—	—	—	—	—	—	—	0	—
	7	—	—	—	—	—	—	—	—	0.6
	8	—	—	—	—	—	—	—	—	—
	9	—	—	—	—	—	—	—	—	—

Table 2. Parameters of each calculation case.

Case	Product	Number of Tasks, i.e., \| $I$ \|	Number of Subassemblies, i.e., \| $E$ \|
1	Flashlight	10	15
2	Copying Machine	13	15
3	Automobile Battery	9	16
4	Hammer Drill	46	63

Table 3. Results of test instances.

Case	Disassembly and Assembly Sequence	Objective Value	Running Time (s)
1	(2 → 4 → {1 → 2}) → (9 → 10) → ({3 → 4 → 5 → 6})	38.3	5.04
2	(1 → 12) → (7 → 10 → {1}) → (2) → (3 → 4 → 5)	45.4	7.63
3	(1 → 3 → {2}) → (5 → {1} → 7) → (9 → {4 → 3}) → (6 → {5})	58.1	8.27
4	——	——	——

Table 4. Four algorithms: hyperparameters.

Case	Training Steps	Discount Rate $γ$	Learning Rate $α$	Decay Factor $λ$	Action Space Size	Episodes
1	1000	0.9	0.5	0.8	10	4
2	1000	0.9	0.5	0.8	13	4
3	1000	0.9	0.5	0.8	9	4
4	8000	0.9	0.5	0.8	46	4

Table 5. Comparison of objective values and running times across algorithms.

Case	IQ( $λ$ )		Sarsa			Q-Learning			Q( $λ$ )
Case	Obj.	Time	Obj.	Time	Gap	Obj.	Time	Gap	Obj.	Time	Gap
1	38.3	0.02	40.1	0.17	4.69%	38.3	0.07	0%	38.3	0.03	0%
2	45.4	0.02	fail	—	—	60.2	0.74	32.59%	fail	—	—
3	58.1	0.03	58.1	0.81	0%	58.1	0.26	0%	58.1	0.05	0%
4	164.8	0.28	175.4	1.14	6.43%	fail	—	—	183.6	0.95	11.41%

Table 6. Melting experiment analysis.

Algorithm Variant	Copying Machine		Hammer Drill
Algorithm Variant	Target Value	Time (s)	Target Value	Time (s)
IQ( $λ$ )-Full	45.4 ± 0.0	0.02 ± 0.00	164.8 ± 2.1	0.28 ± 0.03
IQ( $λ$ )-NoTrace	48.7 ± 1.5 (+7.3%)	0.03 ± 0.01	172.3 ± 3.8 (+4.5%)	0.35 ± 0.05
IQ( $λ$ )-NoAT	47.2 ± 0.8 (+4.0%)	0.05 ± 0.01	169.5 ± 3.1 (+2.9%)	0.42 ± 0.06
IQ( $λ$ )-NoPenalty	50.1 ± 2.2 (+10.4%)	0.02 ± 0.00	181.7 ± 5.6 (+10.3%)	0.27 ± 0.04

Table 7. Sensitivity analysis of

ρ

(mean objective values).

Table 7. Sensitivity analysis of

ρ

(mean objective values).

$ρ$ Value	Copying Machine	Hammer Drill	Performance Trend
0.0	50.1	181.7	Frequent switching, poor solution quality
0.1	47.8	172.5
0.2	46.2	167.3	Solution quality improves with increasing $ρ$
0.3	45.6	165.1	Optimal range
0.5	45.4	164.8	Optimal range
0.7	45.9	165.7	Over-suppresses effective switching

Table 8. Sensitivity analysis of

α

(objective value and convergence steps).

Table 8. Sensitivity analysis of

α

(objective value and convergence steps).

$α$ Value	Flashlight (10 Tasks)		Hammer Drill (46 Tasks)
$α$ Value	Objective Value	Convergence Steps	Objective Value	Convergence Steps
0.1	38.3	1150	168.2	>8000
0.3	38.3	850	166.5	7800
0.5	38.3	600	164.8	7200
0.7	39.1 (+2.1%)	520	173.3 (+5.2%)	6900
0.9	40.7 (+6.3%)	480	182.5 (+10.7%)	6500

Table 9. Task scale sensitivity analysis.

Task Scale	Case Study	IQ( $λ$ ) Advantage	Runtime
Small (≤10 tasks)	Flashlight	Matches CPLEX optimal solution	0.02 s
Medium (13 tasks)	Copying Machine	100% success rate (Q-learning fails)	0.02 s
Large (≥46 tasks)	Hammer Drill	11.4% better than Q( $λ$ )	0.28 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Xing, Y.; Yao, M.; Guo, X.; Qin, S.; Zhu, H.; Qi, L.; Hu, B. Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing. Electronics 2025, 14, 3499. https://doi.org/10.3390/electronics14173499

AMA Style

Zhang Q, Xing Y, Yao M, Guo X, Qin S, Zhu H, Qi L, Hu B. Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing. Electronics. 2025; 14(17):3499. https://doi.org/10.3390/electronics14173499

Chicago/Turabian Style

Zhang, Qi, Yang Xing, Man Yao, Xiwang Guo, Shujin Qin, Haibin Zhu, Liang Qi, and Bin Hu. 2025. "Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing" Electronics 14, no. 17: 3499. https://doi.org/10.3390/electronics14173499

APA Style

Zhang, Q., Xing, Y., Yao, M., Guo, X., Qin, S., Zhu, H., Qi, L., & Hu, B. (2025). Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing. Electronics, 14(17), 3499. https://doi.org/10.3390/electronics14173499

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing

Abstract

1. Introduction

2. Problem Description

2.1. Problem Statement

2.2. AND/OR Graphs

2.3. Matrix Description

3. Mathematical Model

3.1. Model Assumptions

3.2. Notations

3.3. Decision Variables

3.4. Objective Function

3.5. Subject to

3.6. Comment on IoT Implications of the Model

4. Algorithm Design

4.1. Action, State, and Reward Design

4.2. Traditional Q( $λ$ ) Algorithm

4.3. IQ( $λ$ ) Algorithm

5. Experimental Studies

5.1. Test Instances

5.2. Comparison

5.3. Melting Research and Parameter Sensitivity Analysis

5.3.1. Ablation Experiment

5.3.2. Parameter Sensitivity Analysis

6. Discussion: Implications for IoT Cybersecurity

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Optimizing Robotic Disassembly-Assembly Line Balancing with Directional Switching Time via an Improved Q(λ) Algorithm in IoT-Enabled Smart Manufacturing

Abstract

1. Introduction

2. Problem Description

2.1. Problem Statement

2.2. AND/OR Graphs

2.3. Matrix Description

3. Mathematical Model

3.1. Model Assumptions

3.2. Notations

3.3. Decision Variables

3.4. Objective Function

3.5. Subject to

3.6. Comment on IoT Implications of the Model

4. Algorithm Design

4.1. Action, State, and Reward Design

4.2. Traditional Q( λ ) Algorithm

4.3. IQ( λ ) Algorithm

5. Experimental Studies

5.1. Test Instances

5.2. Comparison

5.3. Melting Research and Parameter Sensitivity Analysis

5.3.1. Ablation Experiment

5.3.2. Parameter Sensitivity Analysis

6. Discussion: Implications for IoT Cybersecurity

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Traditional Q( $λ$ ) Algorithm

4.3. IQ( $λ$ ) Algorithm