# A Parallel Deep Reinforcement Learning Framework for Controlling Industrial Assembly Lines

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Related Works and Paper Contributions

#### 2.1. RL Approaches

#### 2.2. DRL Approaches

- ALBPs are in general NP-hard problems which naturally arise in industrial processes; many instances have been formalized and most of the solutions are instance-driven, i.e., tailored to the specific use case addressed;
- Several advanced algorithms have been developed for supporting exact methods; computational costs’ reduction has also been addressed by means of heuristics and genetic algorithms; however, with the explosion of exploitable data and the consideration of more complex industrial problems, said improvements are not enough to deal with more complex ALBP instances;
- Several ML-based approaches have been proposed and applied to industrial problems; although RL-based approaches allow us to learn optimal policies also in a model-free setting, they can be applied to small- and medium-sized problems or to simple ALBP instances;
- Large-sized problems and complex ALBP instances can be successfully addressed by DRL techniques; however, in this context, the duration of the training phase and the quality of the solutions represent crucial aspects which still need to be investigated.

#### 2.2.1. Parallel and Distributed DRL Approaches

#### 2.3. Paper Contributions

- A complex ALBP instance envisaging (i) the presence of hard constraints concerning tasks, resources and workstations and (ii) the need for solving both tasks and resource assignment problems has been considered;
- A parallel DRL approach for solving the mentioned ALBP has been derived for reducing the time required by the training phase;
- Given the application sector (integration activities in the aerospace sector), hard constraints’ management has been embedded in the decision problem.

## 3. Problem Formalization

#### 3.1. Assembly Line Balancing Problem

#### 3.2. Proposed Markov Decision Process Framework

#### 3.2.1. State Space

- ${\mathcal{S}}^{\mathcal{W}}\left[k\right]$ captures workstations’ states: the generic element ${s}_{i}^{\mathcal{W}}\left[k\right]\in {\mathcal{S}}^{\mathcal{W}}\left[k\right]$ represents the state of the i-th workstation defined as the number of tasks which are in execution at time k;
- ${\mathcal{S}}^{\mathcal{T}}\left[k\right]$ captures tasks’ states: the generic element ${s}_{j}^{\mathcal{T}}\left[k\right]\in {\mathcal{S}}^{\mathcal{T}}\left[k\right]$ represents the state of the j-th task defined as the number of discrete time instants ${d}_{j}\left[k\right]$ left for the task to be finished;
- ${\mathcal{S}}^{\mathcal{R}}\left[k\right]$ captures resources’ states: the generic element ${s}_{i,r}^{\mathcal{R}}\left[k\right]\in {\mathcal{S}}^{\mathcal{R}}\left[k\right]$ represents the state of resource of type r with respect to the i-th workstation which is defined as the number of resource units available at the workstation at time k.

#### 3.2.2. Action Space

- ${\mathcal{A}}^{\mathcal{T}}\left[k\right]$ represents tasks control actions: the generic element ${a}_{i,j}^{\mathcal{T}}\left[k\right]\in {\mathcal{A}}^{\mathcal{T}}$ specifies if task ${\tau}_{j}$ has been assigned to workstation i at time k (${a}_{i,j}^{\mathcal{T}}\left[k\right]=1$), or not (${a}_{i,j}^{\mathcal{T}}\left[k\right]=0$);
- ${\mathcal{A}}^{\mathcal{R}}\left[k\right]$ represents resources control actions: the generic element ${a}_{i,r}^{\mathcal{R}}\left[k\right]\in {\mathcal{A}}^{\mathcal{R}}$ specifies if a unit of resource of type r has been assigned to workstation i at time k (${a}_{i,r}^{\mathcal{R}}\left[k\right]=1$), or not (${a}_{i,r}^{\mathcal{R}}\left[k\right]=0$).

#### 3.2.3. Reward Function

#### 3.2.4. Policy Function

## 4. Proposed DRL Task-Control Algorithm with Parallelized Agents’ Training

#### Agents’ Neural Network Structure

## 5. Simulations

- Scenario 1. We evaluate the performance of the algorithm for the problem of scheduling a set of tasks on a given number of workstations, considering the following constraints: the precedence constraints among the tasks must be respected, all tasks must finish before a given deadline and workstations can work any of the tasks, but only one at any given time;
- Scenario 2. It is the same as Scenario 1, but in addition constraints on resources needed by the tasks to execute are also considered. In addition, the algorithm needs to provide an optimized schedule of resource assignment to the workstations, aiming at minimizing the assignment of resources (i.e., the algorithm should assign only the needed resources, when needed).

#### 5.1. Experimental Setup

- The NN implemented inside each DQN agent is a dense, two-layer network with 24 neurons in each layer, with re-LU activation functions;
- Adam optimizer is used [32];
- The learning rate is $\alpha =0.001$;
- The discount factor is $\gamma =0.9$;
- A memory size of 2000 samples is selected;
- The minibatch size is 256 samples;
- Learning is performed over 10,000 episodes.

#### 5.2. Scenario 1

#### 5.3. Scenario 2

#### 5.4. Comparison with State-of-the-Art Approaches

- The total time to complete the tasks (cycle time). This gives a measure of how good the algorithm is in optimizing the tasks’ execution;
- The execution time of the algorithm. This tells us how fast the algorithm is in computing the tasks’ scheduling. This is important, especially when there is the need to update the scheduling in real time (e.g., due to faults).

#### 5.4.1. MPC for Task Execution Control

#### 5.4.2. Shortest Processing Time Heuristic

- In the assignment problem, if both workstations are free, priority is given to workstations’ IDs;
- In the assignment problem, if two or more tasks have the same processing time, priority is given to tasks’ IDs.

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

ALBP | Assembly line balancing problem |

DQN | Deep Q network |

DRL | Deep Reinforcement Learning |

MDP | Markov Decision Process |

ML | Machine Learning |

MPC | Model predictive control |

NN | Neural network |

RL | Reinforcement learning |

SPT | Shortest processing time |

## References

- Gourisaria, M.K.; Agrawal, R.; Harshvardhan, G.; Pandey, M.; Rautaray, S.S. Application of Machine Learning in Idustry 4.0. Mach. Learn. Theor. Found. Pract. Appl. Stud. Big Data
**2021**, 87, 57–87. [Google Scholar] [CrossRef] - Li, K.; Zhang, T.; Wang, R.; Wang, Y.; Han, Y.; Wang, L. Deep Reinforcement Learning for Combinatorial Optimization: Covering Salesman Problems. J. IEEE Trans. Cybern.
**2021**, 14. [Google Scholar] [CrossRef] [PubMed] - Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Keuper, J.; Preundt, F.J. Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability. In Proceedings of the 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), Salt Lake City, UT, USA, 14 November 2016; pp. 19–26. [Google Scholar] [CrossRef] [Green Version]
- Ben-Nun, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv.
**2019**, 52, 1–43. [Google Scholar] [CrossRef] - Scholl, A.; Becker, C. State-of-the-art exact and heuristic solution procedures for simple assembly line balancing. Eur. J. Oper. Res.
**2006**, 168, 666–693. [Google Scholar] [CrossRef] - Boysen, N.; Fliedner, M.; Scholl, A. A classification of assembly line balancing problems. Eur. J. Oper. Res.
**2007**, 183, 674–693. [Google Scholar] [CrossRef] - Sivasankaran, P.; Shahabudeen, P. Literature review of assembly line balancing problems. Int. J. Adv. Manuf. Technol.
**2014**, 73, 1665–1694. [Google Scholar] [CrossRef] - Kumar, N.; Mahto, D. Assembly Line Balancing: A Review of Developments and Trends in Approach to Industrial Application. Glob. J. Res. Eng. Ind. Eng.
**2013**, 13, 29–50. [Google Scholar] - Rudin, N.; Hoeller, D.; Reist, P.; Hutter, M. Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. arXiv
**2021**, arXiv:2109.11978. [Google Scholar] - SESAME Smart European Space Access thru Modern Exploitation of Data Science. Available online: https://cordis.europa.eu/project/id/821875 (accessed on 21 December 2021).
- Eghtesadifard, M.; Khalifeh, M.; Khorram, M. A systematic review of research themes and hot topics in assembly linebalancing through the web of science within 1990–2017. Comput. Ind. Eng.
**2020**, 139. [Google Scholar] [CrossRef] - Tasan, S.O.; Tunal, S. A review of the current applications of genetic algorithms in assembly line balancing. J. Intell. Manuf.
**2008**, 19, 49–69. [Google Scholar] [CrossRef] - Bengio, Y.; Lodi, A.; Prouvost, A. Machine learning for combinatorial optimization: A methodological tour d’horizon. Eur. J. Oper. Res.
**2021**, 290, 405–421. [Google Scholar] [CrossRef] - Zweben, M.; Davis, E.; Daun, B.; Deale, M.J. Scheduling and rescheduling with iterative repair. IEEE Trans. Syst. Man Cybern.
**1993**, 23, 1588–1596. [Google Scholar] [CrossRef] - Zhang, W.; Dietterich, T.G. A reinforcement learning approach to job-shop scheduling. IJCAI
**1995**, 95, 1114–1120. [Google Scholar] - Tassel, P.; Gebser, M.; Schekotihin, K. A Reinforcement Learning Environment For Job-Shop Scheduling. arXiv
**2021**, arXiv:2104.03760. [Google Scholar] - Open Aigym. Available online: https://gym.openai.com/ (accessed on 30 September 2010).
- He, Y.; Wu, G.; Chen, Y.; Pedrycz, W. A Two-stage Framework and Reinforcement Learning-based Optimization Algorithms for Complex Scheduling Problems. arXiv
**2021**, arXiv:2103.05847. [Google Scholar] - Mondal, S.S.; Sheoran, N.; Mitra, S. Scheduling of Time-Varying Workloads Using Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 9000–9008. [Google Scholar]
- Zhou, L.; Zhang, L.; Horn, B.K. Deep reinforcement learning-based dynamic scheduling in smart manufacturing. Procedia CIRP
**2020**, 93, 383–388. [Google Scholar] [CrossRef] - Wang, L.; Hu, X.; Wang, Y.; Xu, S.; Ma, S.; Yang, K.; Liu, Z.; Wang, W. Dynamic job-shop scheduling in smart manufacturing using deep reinforcement learning. Comput. Netw.
**2021**, 190, 107969. [Google Scholar] [CrossRef] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Liu, C.L.; Chang, C.C.; Tseng, C.J. Actor-critic deep reinforcement learning for solving job shop scheduling problems. IEEE Access
**2020**, 8, 71752–71762. [Google Scholar] [CrossRef] - Oren, J.; Ross, C.; Lefarov, M.; Richter, F.; Taitler, A.; Feldman, Z.; Di Castro, D.; Daniel, C. SOLO: Search Online, Learn Offline for Combinatorial Optimization Problems. In Proceedings of the International Symposium on Combinatorial Search, Gugangzhou, China, 26–30 July 2021; Volume 12, pp. 97–105. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 1928–1937. [Google Scholar]
- Clemente, A.V.; Castejòn, H.N.; Chandra, A. Efficient Parallel Methods for Deep Reinforcement Learning. arXiv
**2017**, arXiv:1705.04862. [Google Scholar] - Macua, S.V.; Davies, I.; Tukiainen, A.; Munoz de Cote, E. Fully Distributed Actor-Critic Architecture for Mulitask Deep Reinforcement Learning. arXiv
**2021**, arXiv:2110.12306v1. [Google Scholar] - Yu, L.; Sun, Y.; Xu, Z.; Shen, C.; Yue, D.; Jiang, T.; Guan, X. Multi-agent deep reinforcement learning for HVAC control in commercial buildings. IEEE Trans. Smart Grid
**2020**, 12, 407–419. [Google Scholar] [CrossRef] - Hanumaiah, V.; Genc, S. Distributed Multi-Agent Deep Reinforcement Learning Framework for Whole-building HVAC Control. arXiv
**2021**, arXiv:2110.13450v1. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - ADM Optimizer. Available online: https://keras.io/api/optimizers/adam/ (accessed on 21 December 2021).
- Liberati, F.; Tortorelli, A.; Mazquiaran, C.; Imran, M.; Panfili, M. Optimal Control of Industrial Assembly Lines. In Proceedings of the 2020 7th International Conference on Control, Decision and Information Technologies (CoDIT), Prague, Czech Republic, 29 June–2 July 2020; Volume 1, pp. 721–726. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tortorelli, A.; Imran, M.; Delli Priscoli, F.; Liberati, F.
A Parallel Deep Reinforcement Learning Framework for Controlling Industrial Assembly Lines. *Electronics* **2022**, *11*, 539.
https://doi.org/10.3390/electronics11040539

**AMA Style**

Tortorelli A, Imran M, Delli Priscoli F, Liberati F.
A Parallel Deep Reinforcement Learning Framework for Controlling Industrial Assembly Lines. *Electronics*. 2022; 11(4):539.
https://doi.org/10.3390/electronics11040539

**Chicago/Turabian Style**

Tortorelli, Andrea, Muhammad Imran, Francesco Delli Priscoli, and Francesco Liberati.
2022. "A Parallel Deep Reinforcement Learning Framework for Controlling Industrial Assembly Lines" *Electronics* 11, no. 4: 539.
https://doi.org/10.3390/electronics11040539