# Dynamic Storage Location Assignment in Warehouses Using Deep Reinforcement Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- The empirical proof-of-concept that a real-world DSLAP may be solved end-to-end using DRL.
- Practical design choices for solving the presented DSLAP using DRL.

## 2. Related Work

## 3. Use Case

#### 3.1. Warehouse Outline, Logic and Simulation

#### 3.2. Real-World Data

- A timestamp [YYYY-MM-DD hh-mm-ss] when the location assignment took place.
- The loaded good type identification number.
- The number of articles on the pallet.
- The date on which the pallet was first packed and entered the warehouse system.
- The type of storage location assignment (first entry or re-entry after partial retrieval of articles).
- The class (A, B or C), assigned to the pallet based on subjective experience by human workers.

## 4. Reinforcement Learning Approach

#### 4.1. Introduction to Deep Reinforcement Learning

#### 4.2. Action Space and Interaction with the Simulation

#### 4.3. Observation Space

#### 4.4. Reward Design

#### 4.5. Learning Algorithm and Hyperparameters

## 5. Experimental Setup

#### 5.1. Train-Test Split

#### 5.2. Benchmarks

- RANDOM: The easiest benchmark method samples actions (A, B or C) randomly from a uniform distribution.
- Just-in-Order: This method follows the intuition that the cheapest zones should be used to the limit. Therefore, as long as the capacity utilization of zone A is not 100%, pallets are assigned to zone A. When it is full, pallets are assigned to zone B and so on.
- ABC: This method represents the currently running system in the warehouse. For this benchmark, we use those classes that were assigned by experts and executed in reality.
- DoS-Quantiles: This method is engineered from historic data and serves as the strongest baseline, which can be created only in retrospective. It is based on the duration of stay (DoS) of a certain good type. Two quantiles q1 and q2 of the DoS are defined. When the historic average DoS of a good type on a pallet is smaller than or equal to q1, the pallet is assigned to zone A. If it is between q1 and q2, it is assigned to zone B. The rest is assigned to zone C. In a preliminary grid-search of quantile values q1 ∈ [0.35, 0.40, 0.45 … 0.95] and q2 ∈ [0.40, 0.45, 0.50 … 1.00], q1 = 0.70 and q2 = 0.90 achieved the best results on the whole dataset.

## 6. Results

## 7. Discussion and Future Work

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Tompkins, J.; White, J.; Bozer, Y.; Tanchoco, J.M. Facilities Planning, 4th ed.; John Wiley & Sons: New York, NY, USA, 2010; ISBN 0470444045. [Google Scholar]
- Reyes, J.J.R.; Solano-Charris, E.L.; Montoya-Torres, J.R. The storage location assignment problem: A literature review. Int. J. Ind. Eng. Comput.
**2019**, 10, 199–224. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, London, UK, 2018; ISBN 9780262039246. [Google Scholar]
- Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Blundell, C. Agent57: Outperforming the Atari Human Benchmark. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; Volume 37, pp. 507–5017. [Google Scholar]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
**2019**, 575, 350–354. [Google Scholar] [CrossRef] - Samsonov, V.; Kemmerling, M.; Paegert, M.; Lütticke, D.; Sauermann, F.; Gützlaff, A.; Schuh, G.; Meisen, T. Manufacturing Control in Job Shop Environments with Reinforcement Learning. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Online streaming, 4–6 February 2021; Rocha, A.P., Steels, L., van den Herik, J., Eds.; Science and Technology Publications Lda: Sétubal, Portugal, 2021; pp. 589–597, ISBN 978-989-758-484-8. [Google Scholar]
- van Ekeris, T.; Meyes, R.; Meisen, T. Discovering Heuristics and Metaheuristics for Job Shop Scheduling from Scratch via Deep Reinforcement Learning. In Proceedings of the Conference on Production Systems and Logistics, online, 10–11 August 2021; pp. 709–718. [Google Scholar] [CrossRef]
- de Puiseau, C.W.; Meyes, R.; Meisen, T. On reliability of reinforcement learning based production scheduling systems: A comparative survey. J. Intell. Manuf.
**2022**, 33, 911–927. [Google Scholar] [CrossRef] - Samsonov, V.; Hicham, K.B.; Meisen, T. Reinforcement Learning in Manufacturing Control: Baselines, Challenges and Ways Forward. Eng. Appl. Artif. Intell.
**2022**, 112, 104868. [Google Scholar] [CrossRef] - Iklassov, Z.; Medvedev, D.; Solozabal, R.; Takac, M. Learning to generalize Dispatching rules on the Job Shop Scheduling. arXiv
**2022**, arXiv:2206.04423. [Google Scholar] - Baer, S.; Turner, D.; Mohanty, P.K.; Samsonov, V.; Bakakeu, R.J.; Meisen, T. Multi Agent Deep Q-Network Approach for Online Job Shop Scheduling in Flexible Manufacturing. In Proceedings of International Conference on Manufacturing System and Multiple Machines, Tokyo, Japan, 17–18 November 2020. [Google Scholar]
- Wu, W.; Zhou, W.; Lin, Y.; Xie, Y.; Jin, W. A hybrid metaheuristic algorithm for location inventory routing problem with time windows and fuel consumption. Expert Syst. Appl.
**2021**, 166, 114034. [Google Scholar] [CrossRef] - Kübler, P.; Glock, C.H.; Bauernhansl, T. A new iterative method for solving the joint dynamic storage location assignment, order batching and picker routing problem in manual picker-to-parts warehouses. Comput. Ind. Eng.
**2020**, 147, 106645. [Google Scholar] [CrossRef] - Trindade, M.A.M.; Sousa, P.S.A.; Moreira, M.R.A. Ramping up a heuristic procedure for storage location assignment problem with precedence constraints. Flex. Serv. Manuf. J.
**2022**, 34, 646–669. [Google Scholar] [CrossRef] [PubMed] - Zhang, G.; Shang, X.; Alawneh, F.; Yang, Y.; Nishi, T. Integrated production planning and warehouse storage assignment problem: An IoT assisted case. Int. J. Prod. Econ.
**2021**, 234, 108058. [Google Scholar] [CrossRef] - Li, M.L.; Wolf, E.; Wintz, D. Duration-of-Stay Storage Assignment under Uncertainty. In Proceedings of the International Conference on Learning Representations 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Rimélé, A.; Grangier, P.; Gamache, M.; Gendreau, M.; Rousseau, L.-M. Supervised Learning and Tree Search for Real-Time Storage Allocation in Robotic Mobile Fulfillment Systems. arXiv
**2021**, arXiv:2106.02450v1. [Google Scholar] - Berns, F.; Ramsdorf, T.; Beecks, C. Machine Learning for Storage Location Prediction in Industrial High Bay Warehouses. In Proceedings of the International Conference on Pattern Recognition, Virtual Event, 10–15 January 2019; Springer: Cham, Switzerland, 2021; pp. 650–661. [Google Scholar]
- Kim, B.; Jeong, Y.; Shin, J.G. Spatial arrangement using deep reinforcement learning to minimise rearrangement in ship block stockyards. Int. J. Prod. Res.
**2020**, 58, 5062–5076. [Google Scholar] [CrossRef] - Rimélé, A.; Grangier, P.; Gamache, M.; Gendreau, M.; Rousseau, L.-M. E-commerce warehousing: Learning a storage policy. arXiv
**2021**, arXiv:2101.08828v1. [Google Scholar] - Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540v1. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Rein-forcement Learning. arXiv
**2013**, arXiv:1312.5602v1. [Google Scholar] - Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Ap-proximation. Adv. Neural Inf. Process. Syst.
**1999**, 12, 1057–1063. [Google Scholar] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347v2. [Google Scholar] - Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res.
**2021**, 22, 1–8. [Google Scholar]

**Figure 2.**Learning curve of the DRL agent displaying the average and standard deviation of the cumulative reward on each training episode over training progress in million steps.

Hyperparameter | Value |
---|---|

alpha | 0.0001 |

steps | 19500 |

gamma | 0.99 |

ent_coef | 0.00 |

gae_lambda | 1 |

vf_coef | 0.5 |

n_epochs | 10 |

batch_size | 256 |

policy_kwargs: net_arch | [256, 256, 256] |

Agent | Total Cost | Number of Assignments per Zone | Mean DoS per Zone | |||||
---|---|---|---|---|---|---|---|---|

Total | A | B | C | A | B | C | ||

PPO (Ours) | 37.78 | 1088 | 214 | 647 | 227 | 2.16 | 4.12 | 9.12 |

DoS-Quantile | 35.99 | 1088 | 257 | 621 | 210 | 2.58 | 3.83 | 16.12 |

ABC | 40.34 | 1088 | 262 | 561 | 265 | 2.25 | 4.45 | 7.54 |

RANDOM | 45.43 | 1088 | 369 | 377 | 342 | 4.41 | 4.95 | 4.29 |

Just-in-Order | 45.70 | 1088 | 110 | 665 | 313 | 4.87 | 3.87 | 6.08 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Waubert de Puiseau, C.; Nanfack, D.T.; Tercan, H.; Löbbert-Plattfaut, J.; Meisen, T.
Dynamic Storage Location Assignment in Warehouses Using Deep Reinforcement Learning. *Technologies* **2022**, *10*, 129.
https://doi.org/10.3390/technologies10060129

**AMA Style**

Waubert de Puiseau C, Nanfack DT, Tercan H, Löbbert-Plattfaut J, Meisen T.
Dynamic Storage Location Assignment in Warehouses Using Deep Reinforcement Learning. *Technologies*. 2022; 10(6):129.
https://doi.org/10.3390/technologies10060129

**Chicago/Turabian Style**

Waubert de Puiseau, Constantin, Dimitri Tegomo Nanfack, Hasan Tercan, Johannes Löbbert-Plattfaut, and Tobias Meisen.
2022. "Dynamic Storage Location Assignment in Warehouses Using Deep Reinforcement Learning" *Technologies* 10, no. 6: 129.
https://doi.org/10.3390/technologies10060129