# Sortation Control Using Multi-Agent Deep Reinforcement Learning in N-Grid Sortation System

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We propose the design of a compact and efficient sortation system called the n-grid sortation system.
- We present a cooperative multi-agent RL algorithm to control the behavior of each module in the system.
- We describe how the RL agents learn together to optimize the performance of the n-grid sortation system.

## 2. Related Work

#### 2.1. Sortation Systems

#### 2.2. Existing Optimization Work for the Sortation Task

#### 2.3. Deep Reinforcement Learning

## 3. Design of the $\mathit{N}$-Grid Sortation System

- 1∼4n emitters through which parcels are fed into the sortation system,
- $n\times n$ sorters by which the incoming parcels are routed (or diverted) to their specific destination, and
- 1∼4n removers through which parcels are unloaded from the sortation system.

- Optimal routing: sorters should deliver parcels to their specific destination as quickly as possible.
- Congestion control: emitters should control the number of incoming parcels to allow the parcels to be processed and transferred without congestion by the sorters in the system.
- Collision resolution: a system-wide agent should resolve a collision caused by the actions of several routing or emission agents.

## 4. Design of Cooperative the Multi-Agent RL Algorithm

#### 4.1. Routing Agents

#### 4.2. Emission Agents

#### 4.3. Collision Resolver and Penalty Generator

- The action of moving the parcel closer to the remover is selected first,
- The sorter’s action takes precedence over the emitter’s action, and
- If two or more actions have the same priority, one action is randomly selected.

#### 4.4. Cooperative Multi-Agent RL Algorithm

- (1)
- The routing agents and the emission agent initialize their experience replay memories, action-value functions (${Q}^{{s}_{i}}$ and ${Q}^{{e}_{j}}$), and target action-value functions (${\widehat{Q}}^{{s}_{i}}$ and ${\widehat{Q}}^{{e}_{j}}$) (see Lines 1∼8 of Algorithm 1).
- (2)
- An episode begins with the reset of the target system. At the reset of the target system (that is, time step $t=1$), each emitter has one parcel of a random type, and no parcel is on a sorter or a remover. The parcel information on the grid system is set to ${\mathsf{\Phi}}_{t}$ (see Lines 10∼11 of Algorithm ① and 1 in Figure 3).
- (3)
- Based on the states configured with ${\mathsf{\Phi}}_{t}$ and the routing agent’s location, the routing agents select their actions according to the $\u03f5$-greedy policy. If there is no parcel on a routing agent, its action is zero indicating “Stop” (see Lines 13∼22 of Algorithm 1 and ② in Figure 3).
- (4)
- Based on the states configured with ${\mathsf{\Phi}}_{t}$ and the actions selected by nine routing agents, the emission agents select their actions according to the $\u03f5$-greedy policy (see Lines 23∼29 of Algorithm 1 and ③ in Figure 3).
- (5)
- The selected actions are delivered to the collision resolver, and it alters the actions according to the rules presented by Section 4.3 (see Line 30∼31 of Algorithm 1 and ④ in Figure 3).
- (6)
- The altered actions at time step t are finally performed on the target system, and the rewards at time step $t+1$ are generated from it (see Lines 32∼33 of Algorithm 1 and ⑤ in Figure 3).
- (7)
- The rewards are also delivered to the penalty generator (co-located with collision resolver), and the penalized rewards are generated by it (see Lines 34∼35 of Algorithm 1 and ⑦ in Figure 3). The penalized rewards will be delivered to the routing and emission agents. However, such delivery is not performed directly at the current time step t, but will be performed at the optimization phase of the state-action value functions.
- (8)
- The states, actions, altered actions, penalized rewards, parcel information, and episode ending information are stored in the corresponding replay memory as transition information (see Lines 36∼41 of Algorithm 1). The transition information including penalized rewards will be used when the state-action value functions are optimized. Since the actions selected by routing agents are required for the optimization task of emission agents, ${A}_{t}^{S}$ is configured and put into the transitions for emission agents.
- (9)
- Model weight copying to the target models is performed every $\tau $ episodes (see Line 43 of Algorithm 1).

- L: the total number of parcels arriving at a remover (i.e., the number of parcels classified) and
- M: the number of parcel moves by any one sorter.

Algorithm 1: The proposed multi-agent RL algorithm (Γ: target system, Ω: collision resolver and penalty generator, ϵ ∈ [0, 1]: exploration threshold). |

- (1)
- A routing agent ${s}_{i}$ samples a random mini-batch of K transitions from its replay memory ${D}^{{s}_{i}}$ (see Line 2 of Algorithm 2).
- (2)
- For each transition, the routing agent should get the next routing agent or remover $\eta $, which its action is bound to, because the target value for the loss equation at time step t comes from $\eta $ (see Line 4 of Algorithm 2).
- (3)
- The action moving a parcel to $\eta $ was selected by ${s}_{i}$ according to its policy (i.e., $\u03f5$-greedy at ${Q}^{{s}_{i}}$) at the current time step t. At time step $t+1$, after one step progresses, the moving parcel passes to $\eta $. Therefore, the target value for the loss equation should be provided by $\eta $ (see Line 9 of Algorithm 2). We call this strategy the deep chain Q network (DCQN). The strategy is similar to the one of Q-routing [32], which is a well-known RL algorithm in the field of computer network where a node needs to select its adjacent node where the node sends a network packet so that the node delivers the packet to its final destination as soon as possible.

Algorithm 2:${Q}^{{s}_{i}}$ optimization for routing agent ${s}_{i}$ ($i=1,\cdots ,9$). |

- (1)
- The transitions in a mini-batch are sampled sequentially (see Line 2 of Algorithm 3).
- (2)
- The target value for the loss equation at time step t is calculated just from (1) the reward of the next time step $t+1$ and (2) the maximum state-action value for the emission agent at the next time step $t+1$ (see Lines 7∼8 of Algorithm 3). For the next time step $t+1$, ${S}_{k+1}^{{e}_{j}}$ should be configured with the actions of the routing actions at time step $t+1$. Therefore, The transitions in a mini-batch should be sampled sequentially.

Algorithm 3:${Q}^{{e}_{j}}$ optimization for routing agent ${e}_{j}$ ($j=1,\cdots ,6$). |

## 5. Performance Evaluation

#### 5.1. Training Details and Performance Measure

#### 5.2. Results

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Benzi, F.; Bassi, E.; Marabelli, F.; Belloni, N.; Lombardi, M. IIoT-based Motion Control Efficiency in Automated Warehouses. In Proceedings of the AEIT International Annual Conference (AEIT), Florence, Italy, 18–20 September 2019; pp. 1–6. [Google Scholar]
- Kirks, T.; Jost, J.; Uhlott, T.; Jakobs, M. Towards Complex Adaptive Control Systems for Human-Robot- Interaction in Intralogistics. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2968–2973. [Google Scholar]
- Harrison, R. Dynamically Integrating Manufacturing Automation with Logistics. In Proceedings of the 24th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Zaragoza, Spain, 10–13 September 2019; pp. 21–22. [Google Scholar]
- Seibold, Z.; Stoll, T.; Furmans, K. Layout-optimized Sorting of Goods with Decentralized Controlled Conveying modules. In Proceedings of the 2013 IEEE International Systems Conference (SysCon), Orlando, FL, USA, 15–18 April 2013; pp. 628–633. [Google Scholar] [CrossRef]
- Jayaraman, A.; Narayanaswamy, R.; Gunal, A.K. A Sortation System Model. Winter Simul. Conf. Proc
**1997**, 1, 866–871. [Google Scholar] [CrossRef] - Beyer, T.; Jazdi, N.; Gohner, P.; Yousefifar, R. Knowledge-based planning and adaptation of industrial automation systems. In Proceedings of the 2015 IEEE 20th Conference on Emerging Technologies Factory Automation (ETFA), Luxembourg, 8–11 September 2015; pp. 1–4. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Hasselt, H.v.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Rodriguez-Ramos, A.; Sampedro, C.; Bavle, H.; de la Puente, P.; Campoy, P. A Deep Reinforcement Learning Strategy for UAV Autonomous Landing on a Moving Platform. J. Intell. Robot. Syst.
**2019**, 93, 351–366. [Google Scholar] [CrossRef] - Kim, J.; Lim, H.; Kim, C.; Kim, M.; Hong, Y.; Han, Y. Imitation Reinforcement Learning-Based Remote Rotary Inverted Pendulum Control in OpenFlow Network. IEEE Access
**2019**, 7, 36682–36690. [Google Scholar] [CrossRef] - Xiao, D.; Tan, A. Scaling Up Multi-agent Reinforcement Learning in Complex Domains. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Sydney, NSW, Australia, 9–12 December 2008; Volume 2, pp. 326–329. [Google Scholar] [CrossRef][Green Version]
- Busoniu, L.; Babuska, R.; De Schutter, B. A Comprehensive Survey of Multiagent Reinforcement Learning. Trans. Sys. Man Cyber Part C
**2008**, 38, 156–172. [Google Scholar] [CrossRef][Green Version] - Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate with Deep Multi-agent Reinforcement Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2145–2153. [Google Scholar]
- Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability. In Proceedings of the 34th International Conference on Machine Learning; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research; PMLR: International Convention Centre: Sydney, Australia, 2017; Volume 70, pp. 2681–2690. [Google Scholar]
- Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient Multi-Agent Deep Reinforcement Learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Richland, SC, USA, 10–15 July 2018; pp. 443–451. [Google Scholar]
- Anonymous. Emergent Tool Use From Multi-Agent Autocurricula. Submitted to International Conference on Learning Representations.
**2020**. under review. [Google Scholar] - Wang, S.; Wan, J.; Zhang, D.; Li, D.; Zhang, C. Towards Smart Factory for Industry 4.0: A Self-organized Multi-agent System with Big Data Based Feedback and Coordination. Comput. Netw.
**2016**, 101. [Google Scholar] [CrossRef][Green Version] - Rosendahl, R.; Cala, A.; Kirchheim, K.; Luder, A.; D’Agostino, N. Towards Smart Factory: Multi-Agent Integration on Industrial Standards for Service-oriented Communication and Semantic Data Exchange. 2018. [Google Scholar]
- Johnson, M.E.; Meller, R.D. Performance Analysis of Split-Case Sorting Systems. Manuf. Serv. Oper. Manag.
**2002**, 4, 258–274. [Google Scholar] [CrossRef] - Pan, F.b. Simulation Design of Express Sorting System - Example of SF’s Sorting Center. Open Cybern. Syst. J.
**2014**, 8, 1116–1122. [Google Scholar] [CrossRef][Green Version] - Gebhardt GridSorter - Decentralized Plug&Play Sorter & Sequenzer. Available online: https://www.gebhardt-foerdertechnik.de/en/products/sorting-technology/gridsorter/ (accessed on 7 December 2019).
- Factoryio Features. Available online: https://factoryio.com/features (accessed on 7 December 2019).
- Lem, H.J.; Mahwah, N. Conveyor Sortation System. 914155. Available online: https://patentimages.storage.googleapis.com/2b/2f/56/6eaeeaeb32b18d/US4249661.pdf (accessed on 7 December 2019).
- Sonderman, D. An analytical model for recirculating conveyors with stochastic inputs and outputs. Int. J. Prod. Res.
**1982**, 20, 591–605. [Google Scholar] [CrossRef] - Bastani, A.S. Analytical solution of closed-loop conveyor systems with discrete and deterministic material flow. Eur. J. Oper. Res.
**1988**, 35, 187–192. [Google Scholar] [CrossRef] - Chen, J.C.; Huang, C.; Chen, T.; Lee, Y. Solving a Sortation Conveyor Layout Design Problem with Simulation-optimization Approach. In Proceedings of the 2019 IEEE 6th International Conference on Industrial Engineering and Applications (ICIEA), Tokyo, Japan, 12–15 April 2019; pp. 551–555. [Google Scholar]
- Westbrink, F.; Sivanandan, R.; Schütte, T.; Schwung, A. Design approach and simulation of a peristaltic sortation machine. In Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Hong Kong, China, 8–12 July 2019; pp. 1127–1132. [Google Scholar]
- Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A Brief Survey of Deep Reinforcement Learning. arXiv
**2017**. [Google Scholar] [CrossRef][Green Version] - Watkins, C.J.C.H.; Dayan, P. Technical Note: Q-Learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Boyan, J.A.; Littman, M.L. Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach. In Advances in Neural Information Processing Systems 6; Cowan, J.D., Tesauro, G., Alspector, J., Eds.; The MIT Press: Cambridge, MA, USA, 1994; pp. 671–678. [Google Scholar]

**Figure 1.**A three-gird instance of the proposed n-grid sortation system. (

**a**) Main components of the three-grid sortation system; (

**b**) Raw state representation of the three-grid sortation system.

**Figure 3.**Behavior inference procedure by the cooperative RL multi-agent in the three-grid sortation system.

**Figure 5.**The change of average emission rate (${E}_{e}$), average correct classification rate (${C}_{e}$), average wrong classification rate (${W}_{e}$), and sorting performance index ($SP{I}_{e}$). We conducted seven experimental runs. The graphs include the lines plotted for the average values of the four measures, while the shaded areas of the upper and lower limits are the maximum and minimum values for the same measures.

**Figure 6.**The change of the number of parcel collisions and the change of the total collision penalty imposed on the emission agents. We conducted seven experimental runs. The graphs show the average, the maximum, and the minimum of the two measures like Figure 5.

**Figure 7.**The graph above shows simulated screen captures of the three-grid sortation system controlled by the saved models leading to the emergent behaviors of the agents. In the first sequence of two images at the top, routing and emission agents seemed to consider the location of the parcels placed on different agents and increased throughput with unobtrusive behavior. In the second sequence of two images at the bottom, there were many parcels of the same type on the routing and emission agents, and there were just two removers for the type of parcels. In such a situation, agents seemed to perform routing so that there was less congestion.

**Table 1.**The components of rewards at time step $t+1$ caused by the actions performed by the routing agents and the emission agents at time step t.

Symbol | Type | Coefficient | Possible Values |
---|---|---|---|

${u}_{t+1}^{{s}_{i}}$ | routing | ${\mu}_{u}=-0.1$ | 0 or 1 |

${c}_{t+1}$ | correct classification | ${\mu}_{c}=1$ | $0,1,\cdots ,6$ |

${w}_{t+1}$ | wrong classification | ${\mu}_{w}=-1$ | $0,1,\cdots ,6$ |

${p}_{t+1}^{{s}_{i}}$ | collision penalty | ${\mu}_{p}=-0.1$ | 0 or 1 |

$i{n}_{t+1}^{{e}_{j}}$ | emission | ${\mu}_{in}=0.1$ | 0 or 1 |

$balanc{e}_{t+1}$ | in and out balance | ${\mu}_{balance}=-1$ | $0,1,\cdots ,6$ |

${p}_{t+1}^{{e}_{j}}$ | collision penalty | ${\mu}_{p}=-0.1$ | 0 or 1 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, J.-B.; Choi, H.-B.; Hwang, G.-Y.; Kim, K.; Hong, Y.-G.; Han, Y.-H. Sortation Control Using Multi-Agent Deep Reinforcement Learning in *N*-Grid Sortation System. *Sensors* **2020**, *20*, 3401.
https://doi.org/10.3390/s20123401

**AMA Style**

Kim J-B, Choi H-B, Hwang G-Y, Kim K, Hong Y-G, Han Y-H. Sortation Control Using Multi-Agent Deep Reinforcement Learning in *N*-Grid Sortation System. *Sensors*. 2020; 20(12):3401.
https://doi.org/10.3390/s20123401

**Chicago/Turabian Style**

Kim, Ju-Bong, Ho-Bin Choi, Gyu-Young Hwang, Kwihoon Kim, Yong-Geun Hong, and Youn-Hee Han. 2020. "Sortation Control Using Multi-Agent Deep Reinforcement Learning in *N*-Grid Sortation System" *Sensors* 20, no. 12: 3401.
https://doi.org/10.3390/s20123401