1. Introduction
Order picking is one of the most resource-intensive warehouse operations, accounting for a substantial share of operational costs in traditional and semi-automated facilities [
1]. In its basic form, the order picking routing problem can be modeled as a Traveling Salesman Problem (TSP), which is NP-hard. In practice, however, warehouse systems introduce additional complexities such as order batching, multiple pickers, congestion effects, and integrated assignment and scheduling decisions. A wide range of heuristics and optimization approaches have been proposed for warehouse routing and related problems, often relying on mathematical programming and decomposition techniques to manage congestion, workload balancing, and fleet coordination for robotic warehouses [
2,
3,
4]. In addition, the literature on dynamic vehicle routing problems (DVRPs) addresses routing decisions that must adapt to newly arriving requests or changing system conditions [
5] which is relevant in a warehouse setting. Some of the recent work in the research area of DVRP explores anticipatory decision-making using approximate dynamic programming and hybrid offline–online approaches to account for future uncertainty and system dynamics [
6].
The TSP is typically formulated as a graph-based problem. Graphs can be represented in two common ways: 1. where nodes correspond to coordinates in a geometric space, and 2. where nodes are defined by pairwise distances in a completely weighted graph. Warehouse layouts often impose grid-like constraints (orthogonal aisles and cross-aisles) [
1], where distances follow Manhattan metrics rather than direct Euclidean distances. To model these settings, warehouse graphs are usually constructed with intersections and storage locations as nodes, with edges representing feasible traversal paths. These graphs are reduced to or constructed only from storage locations on order path, with distances computed via shortest paths through the layout [
7].
Recent advances in Deep Reinforcement Learning (DRL) have shown promising results for combinatorial optimization tasks such as the TSP by training agents to take actions based on environment states and rewards from reward functions [
8]. Although a few studies have explored DRL in warehouse optimization, it remains limited and underexplored [
9,
10,
11]. Use of DRL in warehouse environments could enable agents capable of making dynamic operational decisions, such as route selection, order batching, and item allocation.
A promising research direction also lies in graph representation learning, which aims to learn vector embeddings of graphs, nodes, or edges [
12]. Since warehouses can be represented as graphs, Graph Neural Networks (GNNs) [
13] can offer a powerful way to extract embeddings for storage locations and other entities. These embeddings could capture structural information about warehouse layouts and provide richer inputs for decision-making tasks compared to hand crafted features. For example, item allocation policies might be improved by leveraging learned embeddings that encode layout-specific characteristics. The representation learning capabilities of the proposed approach enable the model to capture structural relationships within the warehouse graph, which can help the policy better reflect operational constraints. In addition, the learned node embeddings provide a compact representation of node roles within the graph, offering insight into the model’s decision process and improving the interpretability of the learning-based approach.
Importantly, the purpose of learning embeddings in this work is not limited to solving a single small TSP instance. Instead, embeddings allow the model to learn structural regularities of warehouse layouts and support policy learning rather than solving each routing instance independently from scratch. This method also provides a foundation for extending the approach toward more complex operational settings, including multiple pickers.
Several studies demonstrate that exploiting structural properties of routing instances can significantly improve performance. For example, clustering-based decomposition methods partition nodes into groups and construct local routes before integrating them into a global solution, thereby improving heuristic efficiency for TSP variants [
14]. In warehouse environments, routing problems are often modeled as hierarchical extensions of the TSP that capture additional operational structure [
15]. These approaches highlight the importance of identifying and leveraging latent structural properties of the underlying graph.
Graph neural networks address this challenge by learning continuous vector embeddings that encode structural position, connectivity patterns, and relational context within the graph. In contrast to predefined clustering strategies, embeddings provide a structural abstraction that allows the model to capture latent groupings and dependencies without explicit manual decomposition. Such an approach would be essential for developing scalable policies capable of generalizing to realistic warehouse settings involving multiple orders, alternative storage locations, and dynamic operational constraints without decomposition.
In this work, we propose an integrated GNN-based deep reinforcement learning framework for warehouse optimization that jointly addresses order picking path optimization and warehouse layout representation learning. Specifically, we aim to demonstrate the feasibility of training GNN models to produce near optimal order picking routes and learn embeddings of warehouse storage locations. Currently we only focus on optimizing order picking path, but we envision extending this approach to other problems by including orders and items as graph nodes and defining new reward functions, thereby laying the groundwork for a foundational GNN–DRL model that can be fine-tuned for diverse warehouse optimization tasks.
The present study deliberately focuses on a simplified single-picker routing scenario to evaluate the proposed GNN–DRL method in a controlled and analytically transparent setting. This formulation enables direct comparison with well-established TSP solution methods and allows us to isolate and assess the capabilities of the model without additional confounding operational complexities.
The objective of this work is not to replace exact TSP solvers in static environments, where optimal or near-optimal solutions can be obtained efficiently, but rather to establish a generalizable methodology for dynamic graph-structured warehouse optimization. By demonstrating that GNNs can learn meaningful structural embeddings while producing competitive routing policies, we lay the groundwork for extending the approach to more realistic and dynamic settings.
In particular, the proposed method is designed to accommodate future extensions involving multiple pickers, order batching, dynamically arriving orders, congestion effects, and jointly optimized warehouse decisions. The current formulation therefore represents a foundational step toward a holistic approach to optimization in complex warehouse systems. To illustrate the potential of the framework beyond the single-picker setting, we additionally conducted a small-scale experiment involving multiple pickers.
DRL applications with GNNs integration in the domain of internal logistics remain scarce [
16,
17,
18], and to the best of our knowledge, no existing study explicitly focuses on representation learning, leaving significant unexplored potential for advancing real-time, data-driven warehouse decision making.
Accordingly, this paper addresses the following research question: can graph neural networks learn meaningful embeddings of warehouse layouts that capture their structural properties while producing near-optimal order-picking paths, and can such models generalize across different warehouse configurations?
The main contributions of this paper are threefold:
- (i)
We formulate the warehouse order-picking problem as a graph-based deep reinforcement learning task using graph neural networks.
- (ii)
We demonstrate that the proposed GNN–DRL approach produces routing solutions with low optimality gaps in simulated warehouse environments, with increasing advantages for larger and more complex orders and can generalize across different warehouse configurations with appropriate fine-tuning.
- (iii)
We show that the learned node embeddings capture meaningful structural properties of warehouse layouts.
We evaluate the proposed approach in simulated warehouse environment and discuss its applicability to a broader class of warehouse optimization problems. Overall, the presented GNN–DRL framework provides a unified methodology for order-picking optimization and warehouse representation learning, and establishes a flexible foundation for advanced, real-time, data-driven warehouse decision-making systems.
Section 2 describes the warehouse graph construction and the GNN–DRL methodology.
Section 3 presents experimental results, embedding analysis and a smaller preliminary multi-picker experiment.
Section 4 discusses implications, limitations, and future research directions.
2. Materials and Methods
In this study we used synthetic warehouse layouts that were modeled as graphs, transforming them into inputs suitable for Graph Neural Networks (GNNs). GNN was trained using Deep Reinforcement Learning (DRL) to address the order-picking problem, with the additional objective of learning graph representations that capture warehouse structure and support decision-making. Our methodology follows a multi-step pipeline consisting of the following steps:
Graph Construction from Relational Data: Warehouse layout is modeled as graph derived from relational database representations of storage locations and aisles of synthetic warehouse layout. We define nodes corresponding to intersections and storage locations, and edges corresponding to feasible traversal paths. We employ both PyTorch Geometric (version 2.7.0) [
19,
20,
21] and NetworkX (version 3.6.1) [
22] python (version 3.13.11) libraries to construct graphs. We use PyTorch graph for GNNs training and NetworkX for calculation of shortest distances for reward functions and for construction of distance matrices. PyTorch Geometric graphs are processed to conform to the input requirements of GNN training.
Graph Verification: To verify the correctness of the constructed graphs, we manually checked the graph representation against the warehouse layout by comparing the distances between connected nodes in the graph with the corresponding distances in the tiny sized warehouse layout. This check was performed for all edges near intersections, where graph-construction errors are most likely to occur, and for a random sample of edges inside the aisles. This ensured that aisle, cross-aisle, and entrance connections were correctly represented, we also computed and visualized small set of optimal picking paths using python-tsp [
23] implementation of Dynamic programing [
24] to check for possible visible errors in graph. This ensures that the graph representation accurately reflects warehouse layout constraints and distances.
Training GNN with DRL: We integrate GNNs with DRL to learn both embeddings of storage locations that capture layout structure and policy for generating optimized picking routes. The DRL agent produces actions based on warehouse graphs and receives rewards based on path efficiency.
Evaluation of order picking paths produced by GNN: We compare the performance of the GNN–DRL model against classical heuristics in terms of average travel distance. Algorithms used for comparison are: Christofides algorithm [
25], Local Search [
26] and basic (ejection-chain) implementation of Lin–Kernighan heuristic [
27]. We used python-tsp [
23] implementation of those algorithms except for Christofides for which we used our own implementation.
Embedding Visualization and Evaluation: The learned embeddings from the GNN are visualized and evaluated to assess whether they capture meaningful structural properties of warehouse layouts.
All experiments are conducted on synthetic warehouse layouts of different sizes generated to simulate a realistic grid-like layout. GNN training is run on a consumer-grade GPU (RTX 5060ti) inside a personal computer with 8 core CPU with 16 threads and 64 GB of RAM, demonstrating the feasibility of our approach without requiring large-scale computational resources.
The warehouse layout is formalized as a weighted graph
, where the vertex (nodes) set
represents storage locations, aisles, intersections, and entrances, and edge weights;
corresponds to physical travel distances (
Figure 1). Distances are derived from typical Warehouse Management System (WMS) data or, when unavailable, estimated from the geometric layout [
28]. Edge weights (distances) between two adjacent storage locations are defined as the physical widths of the storage location closer to the entrance. In our experiments, all storage locations have the same width of 1 m. For example, in
Figure 1, the distance between storage location 0 and 3 is defined by the width of location 0 and is therefore 1 m. Non-ground locations are connected exclusively to the corresponding ground-level location with distance of 1 m. This allows locations at the same ground position to be treated consistently in the model. Horizontal storage locations are connected to neighboring ground-level storage nodes, with additional edges introduced between nodes on opposite sides of an aisle and weighted by the aisle width. Edges connecting aisle nodes to storage nodes are assigned distances of full aisle length. Intersection nodes are positioned along aisles and connected to the intersecting aisles, the nearest storage nodes, and to successive intersections within the same aisle, thereby forming continuous traversal paths. Both intersection and aisle nodes are removed in the final graph representation for GNN training, after which storage locations previously connected through intersections are linked directly. Further modifications required to prepare the graph as input for the GNN, including the selection of anchor nodes and the picker node, are described later in the text.
The smaller demo warehouse layout used for training contains 600 storage locations, arranged along 3 vertical aisles and 10 horizontal aisles with storage locations being assigned only to horizontal aisles, with two opposing storage walls per aisle of size x × y (x = 10, y = 3). In addition, the layout instance includes 15 intersections, modeled as graph nodes that connect vertical and horizontal aisles. For additional fine tuning and evaluation, we used larger warehouse with 1200 storage locations that has 10 additional horizontal aisles (two columns of 10 horizontal aisles versus two columns of 5 aisles for smaller warehouse).
Figure 2 illustrates the database schema employed to construct the experimental warehouse graph. The schema formalizes the warehouse layout through relational entities representing warehouses, aisles, intersections, storage locations, entrance nodes, and edges. These entities and their relationships are used to represent the corresponding graph structure, where nodes represent physical locations and edges encode navigable connections. For graph construction, distances between intersections located on the same aisle are approximated as 3 m. Note that the illustrated schema does not represent a real-world warehouse management system (WMS) database. Instead, it serves as a representation of the data required to generate the warehouse graph and to improve the reproducibility of the study.
Figure 3 depicts the graph-based representation of the experimental warehouse layout constructed using the Python NetworkX library (version 3.6.1). Nodes represent physical warehouse entities, including entrances, aisles, intersections, and storage locations, and are distinguished by color. Aisles 0, 6 and 12 are vertical aisles with no assigned locations. Edges denote feasible traversal paths between nodes. A similar graph structure as depicted in
Figure 1 and
Figure 3 serves as the input to the GNN, supporting the learning of node embeddings and the extraction of warehouse structural representations for downstream decision-making in order-picking tasks.
Figure 4 illustrates an example of an optimal order-picking route generated from a graph-based representation of the warehouse in a top-down view. The warehouse entrance (denoted by E) represents the starting and ending location of the picking tour. Storage locations with visible IDs and marked with green are ones on the picking path. The numbers displayed above the blue points denote the order in which locations are visited in the optimal picking sequence. Shortest-path distances between all pairs of storage locations are computed using Dijkstra’s algorithm [
29], while the overall optimal picking sequence is determined using dynamic programming. Distances between locations are defined as described in
Figure 1. Each square in
Figure 4 represents an area of
. For non-ground-level locations (
), an additional distance of 1 m must be considered, as illustrated in
Figure 1—although this is not visible in
Figure 4. Similarly, a side change within the same aisle adds 1 m.
The diagram in
Figure 5 illustrates the process of constructing the initial graph structure from warehouse layout data. It begins with retrieving data from a database and then creating nodes representing storage locations, aisles, intersections, and the entrance. Edges are added between nodes to reflect connectivity, with distance attributes based on physical measurements (such as storage width or height, and aisle length or width).
Next, all pairs shortest-path distances are computed using Dijkstra’s algorithm. For GNN processing, intersection and aisle nodes are subsequently removed, and storage nodes that were previously connected through these nodes are directly connected as shown in
Figure 1. This graph simplification was observed to have a positive impact on the quality of storage-node embedding representations, while no change, in the average shortest-path distance in early training stages was observed, when compared to the original graph containing intersection and aisle nodes. We attribute this to the aggregation behavior of GNN layers, whereby storage nodes adjacent to any intersections become more similar to one another.
Finally, a set of anchor nodes (e.g., the first, middle, and last node ID) is selected, and additional anchor-type edges are introduced to store shortest-path distances from each node to these anchor points as shown in
Figure 1. While it is not critical which specific nodes are selected as anchors, it is important that anchor nodes are not located next to each other and that they are well distributed across the warehouse layout. In our implementation, anchor nodes are selected as the first, middle, and last nodes according to the node ID ordering. These nodes provide a good spatial distribution across the warehouse, and the distance between the first and last node corresponds to the maximum possible distance between two storage locations in the graph. All distances are then normalized using minimum and maximum distances from anchor edges, resulting in a graph representation that is subsequently used for training and simulation.
DRL is used to train the GNN. Unlike supervised learning, where a model is trained to reproduce predefined outputs or labels, DRL trains the model to maximize a reward through interaction with the environment. Training is organized into episodes. During each episode, the model selects actions by following some learned policy, and the resulting changes in the environment state are observed. At the end of the episode, rewards are calculated and used to update the model so that actions leading to higher rewards become more likely. This approach is well suited to problems for which the optimal solution is unknown or computationally expensive to obtain [
8].
In the proposed method, the GNN is trained to construct picking routes for a given order. At each decision step, the GNN selects the next action, which corresponds to the next storage node to be visited. A complete route is generated by repeatedly selecting actions from the learned policy and updating the environment state, including the picker position and the remaining locations to be visited, until all required locations have been visited. The diagram (
Figure 6) illustrates the workflow of a single episode during simulation and training of the GNN-based order-picking agent. Each episode begins with the generation of random orders, after which all subsequent steps are executed in parallel for the entire batch of orders. For each order, a copy of the initial warehouse graph is constructed, and additional features are assigned to storage nodes to indicate their role. Storage nodes along the current path are marked with a binary feature (0 or 1) indicating whether they are part of the active path. An additional picker node is introduced to represent the current location of the picker. This node is connected to the entrance node or storage node that the picker most recently visited (for example see
Figure 1). GNN produces two outputs:
A score for each storage node, representing the likelihood that the node should be chosen as the next step in the path.
A graph-level score, obtained via mean pooling over all storage node representations in the final GNN layer followed by feed forward layer (linear), giving value estimation (approximation of total distance left on order picking path).
During action selection, non-candidate storage nodes (those not on the path) are masked out, and a SoftMax is applied over candidates scores to generate probabilities. During training, the next action is sampled from this distribution, while during testing, the action with the maximum probability (argmax) is chosen. During training we remove last step from each order from our training dataset as number of candidates is only 1.
The resulting states, actions, value estimates, and rewards are stored, and the process is repeated until all required storage locations for each order have been visited. After completion of the picking tour, rewards are computed based on the total travel distance, including the return to the entrance, and normalized using the minimum and maximum values derived from anchor-based distances. Finally, the GNN is trained for several epochs using the collected episode data, after which the system proceeds to the next episode and repeats the cycle.
Finally, the trained model is evaluated on randomly generated test orders. The testing workflow is the same as the training workflow, except that the GNN training step is skipped because the model has already been trained. The trained model is therefore applied directly to previously unseen orders without need for retraining.
The GNN is trained using DRL training method named Proximal Policy Optimization (PPO) [
30], which is an improvement over policy gradient methods [
31]. The algorithm aims to maximize the average of action probabilities of taken actions based on the advantage estimator. Actions with higher advantage are reinforced, increasing their probability in future decisions, whereas actions with lower or negative advantage reduce their likelihood of being chosen in future decisions. This guides the policy toward selecting more effective actions and gradually improving overall performance. The policy is optimized by maximizing the policy gradient objective using gradient ascent, which is equivalent to minimizing the negative objective via gradient descent. Formally, the policy gradient loss is defined as:
where
denotes the policy parameterized by
θ, representing the probability distribution over possible actions.
is the probability of taking action
given the state
at time step
and
is the estimated advantage function, which measures how much better (or worse) an action is compared to the expected value under the current policy.
represents empirical expectation over time steps, approximated in practice by averaging over a batch.
In algorithms like PPO, we are not just increasing the raw probability of an action in isolation; we are increasing the probability relative to the previous policy. Algorithms control updates relative to the previous policy rather than boosting absolute probabilities. To quantify this change, PPO defines a probability ratio between the new and old policies as
which measures how much more (or less) likely the current policy
is to select action
in state
compared to the previous policy
. The unclipped objective is then expressed as
In PPO, the clipping mechanism is used to prevent the policy from changing too much in a single update. While the policy ratio measures how much more (or less) likely the new policy is to select an action compared to the old policy, large deviations can destabilize learning. Clipping restricts the ratio to stay within a small range (e.g.,
). If the ratio moves outside this range, the objective uses the clipped value instead. This ensures that actions with high advantage are still encouraged, but the update remains controlled, preventing overly aggressive policy changes that could harm performance. The clipped objective is defined as:
where
is the clipping parameter (typically set to 0.1–0.2).
Full PPO loss function is sum of
and value loss
that is mean squared error between predicted and target value estimation multiplied by value coefficient and subtracted by entropy of probability distribution
multiplied by entropy coefficient). The full loss is defined as:
where
and
are weighting coefficients for the value loss
(mean squared error between predicted and target value estimation) and entropy
. Increasing the entropy of the policy helps promote exploration during training. A higher-entropy policy corresponds to a more diverse or less deterministic action distribution.
To improve training stability, methods such as Generalized Advantage Estimation (GAE) [
32] are used. Without GAE, when the full sequence of rewards for an episode is available, the value target for the critic can be computed as the discounted sum of all future rewards. In our setting, the reward corresponds to the negative distance to the next location, and at the final timestep of an episode (when the agent reaches the last location in the path), we add the distance back to the entrance. All distances are normalized using the minimum and maximum values derived from anchor distances. Formally, target value estimate
is defined as:
where
is the reward at timestep
and
is the discount factor. The corresponding advantage can then be expressed as:
where
denotes the value function prediction for state
and
is target value estimate.
GAE improves this by using a smoothed combination of temporal-difference errors denoted as
. across multiple time steps. The temporal-difference at time step t is defined as:
where
is the immediate reward,
and
are value function estimates for the current and next state, respectively, and
is the discount factor. The parameter λ is subsequently used in GAE to control the exponential weighting of
over time. Advantage estimation
is then calculated as sum of
and advantage of next step.
satisfies the following recurrence relation:
where
is the GAE smoothing parameter. The value target
is obtained by adding the advantage estimate
to the baseline value function estimate
.
We applied Generalized Advantage Estimation (GAE) using a smoothing parameter and a discount factor .
GNN training hyperparameters are set as follows: the entropy coefficient is 0.001, the value function coefficient is 0.5, and the PPO clipping parameter (
) is 0.1. Optimization is performed using the AdamW [
33] optimizer with a learning rate of
and a batch size of 64.
At each episode, 42 random orders of size 25 are generated. For each order, the picker’s path is simulated, rewards are computed, and the model is trained for 8 epochs using the collected trajectories.
The model architecture begins with an embedding layer which increases dimensionality to 32. This is followed by 12 TransformerConv [
34] layers with 4 attention heads and an output dimension of 32. After the TransformerConv layers, a linear layer is applied to nodes of the target type (storage nodes), followed by an additional linear layer that produces node-level scores. A SoftMax function is then applied to obtain action probabilities.
To estimate the global value, mean pooling is applied to the node embeddings after the first linear layer, followed by a linear layer that outputs the graph-level value estimate.
3. Results
Figure 7 shows the progression of the average distance per order during training of the GNN–DRL model and its comparison with classical heuristics. The
x-axis represents the number of training episodes, ranging from 0 to 18,500, while the
y-axis denotes the average distance traveled per order, measured in meters. Performance is evaluated every 100 episodes on a new set of 1000 randomly generated orders of size 25. The GNN–DRL model is compared against three baseline methods: Local Search, Christofides, and Lin–Kernighan indicated in the legend. To make figures more readable the
y-axis is limited to the value of 185 m (values for episodes 100, 200, 300, 400 with y values cut from figure are approximately 259 m, 211 m, 189 m, 189 m).
During early training, the GNN–DRL model exhibits higher variance and longer average distances, followed by a gradual improvement as training progresses. After approximately 3000 episodes, the GNN–DRL model consistently outperforms the Christofides heuristic; after around 8000 episodes, it surpasses Local Search; and after approximately 12,000 episodes, it achieves sustained improvements over the Lin–Kernighan heuristic. After that, the GNN–DRL model maintains a consistently lower average distance per order than Local Search, Christofides, and Lin–Kernighan, demonstrating its advantage in the converged regime.
Temporary spikes in the GNN–DRL performance curve are visible throughout training and are attributed to continued exploration, which may allow the model to escape local minima. Each training episode requires on average 1.34 min, resulting in a total training time of approximately 17 days for 18,500 episodes of continuous training. Note that this is a one time cost and inference is done in around 0.057 s for every step (Simulation part in
Figure 6) so for an order size of 15 it takes around 0.86 s and for an order size of 25 it takes around 1.44 s. For an order size of 25, the Lin–Kernighan heuristic is substantially faster for a single order, requiring less than 0.005 s per order. The computation time of the GNN–DRL approach can be substantially reduced by processing multiple orders in batches. When 1000 orders are evaluated together using the GNN model with a maximum batch size of 256, the average computation time decreases from 1.44 s to approximately 0.086 s per order. Although the Lin–Kernighan heuristic is faster, the GNN–DRL runtime is still practical for real-time use.
In the final phase of training, the model was trained on orders with sizes ranging from 5 to 25 storage locations, using 42 randomly generated orders per order size for each episode for a total of 50 training episodes. Model performance was first evaluated against optimal solutions for order sizes between 5 and 20. For each order size, 1000 random orders were generated, and the optimal route for each order was computed using dynamic programming. The optimality gap was calculated as the percentage deviation from the optimal solution:
where
denotes the route length produced by the evaluated method for order
, and
denotes the corresponding optimal route length. The reported gap represents the average gap across all 1000 test orders for a given order size.
Table 1 reports the average route lengths and instance-level optimality gaps for the proposed GNN–DRL method and the Lin–Kernighan heuristic. The mean GNN–DRL optimality gap remains below 2% for all tested order sizes, increasing from 0.77% at order size 5 to 1.72% at order size 20. From order size 13 onward, the GNN–DRL gap remains within a narrow range of 1.65–1.82%, indicating that the increase in the gap becomes less pronounced for larger orders. The percentage of optimal solutions decreases as the order size increases for both methods, which is expected as the routing problem becomes more complex. However, GNN–DRL maintains a comparable or higher percentage of optimal solutions than Lin–Kernighan for larger order sizes. At order size 20, GNN–DRL produces optimal solutions for 33.8% of the tested orders, compared with 22.0% for Lin–Kernighan. The 95th percentile gap further shows that GNN–DRL remains relatively stable across order sizes. For GNN–DRL, the P95 gap stays mostly between approximately 5% and 7%, reaching 5.57% at order size 20. In contrast, the Lin–Kernighan P95 gap increases more clearly and reaches 8.80% at order size 20. Overall, these results indicate that the proposed GNN–DRL method remains close to the exact optimum across the tested order sizes and shows more stable high-percentile behavior for larger instances.
Those results should be interpreted as evidence that the proposed GNN–DRL policy can approximate optimal routing decisions, rather than as a claim that it replaces exact methods or highly optimized OR solvers for static TSP-like instances. Exact methods and strong heuristics remain well suited for the current static single-picker setting. However, the low optimality gaps indicate that the proposed graph-based learning approach may provide a useful foundation for more complex joint optimization settings where the problem cannot be reduced to a standard static TSP formulation.
A separate comparison was then conducted for larger order sizes, where exact dynamic programming was not used as the primary reference due to its increasing computational cost. In this experiment, order sizes from 5 to 35 items were evaluated, with 1000 randomly generated orders for each order size. The routes produced by the proposed GNN–DRL approach were compared with those generated by the Lin–Kernighan heuristic. This larger-scale comparison is reported separately from the optimality-gap analysis and is intended to evaluate the scalability of the learned policy relative to a strong heuristic baseline.
Figure 8 jointly illustrates the effect size and statistical significance of the performance difference between the GNN–DRL model and the Lin–Kernighan heuristic as a function of order size. Statistical significance is assessed using a Wilcoxon signed-rank test [
35], applied separately for each order size to the paired route lengths obtained from 1000 randomly generated orders. The
x-axis represents the order size, while the left
y-axis shows the relative improvement of GNN–DRL over Lin–Kernighan, expressed as the percentage reduction in average route length. Negative values indicate superior performance of the GNN–DRL model. The right
y-axis reports the statistical significance of this difference using the Wilcoxon signed-rank test, expressed as
, with the dotted horizontal line indicating the conventional significance threshold of
.
Statistically significant improvements (p-value < 0.05) emerge from order size 15 onward, with significance increasing rapidly for larger orders and reaching extremely small p-values that are close to zero (below 10−30) for most order sizes above 20. The standard deviation values for both methods remain comparable across all order sizes, indicating that the observed performance gains are not driven by increased variance or outliers.
The corresponding
Table 2 provides a quantitative comparison between GNN–DRL and Lin–Kernighan across all evaluated order sizes, reporting average route lengths, percentage changes, standard deviations, and Wilcoxon signed-rank test statistics. For small order sizes (5–10 locations), average route lengths are nearly identical, and differences are not consistently statistically significant. As order size increases, the GNN–DRL model increasingly outperforms Lin–Kernighan, as reflected by a growing negative percentage change in route length.
p-values in the tables are reported in scientific notation (e.g.,
), representing very small numbers close to zero and indicating strong statistical significance. Overall, the results shown in
Figure 8 and the accompanying table (
Table 2) demonstrate that the performance advantage of the GNN–DRL approach against Lin–Kernighan becomes more pronounced as order size and combinatorial complexity increase as distinguishable by increasingly bigger percentage difference and
p-values close to zero which shows bigger significance in difference between approaches, confirming its effectiveness for larger and more challenging order-picking problems.
When the model trained on the smaller warehouse was directly applied to the larger warehouse containing approximately 1200 storage locations, a decrease in performance was observed. To address this, the model was fine-tuned on the larger warehouse by training for an additional 2800 episodes using orders of size 25, with 4 training epochs per episode. Due to GPU memory constraints, the batch size was reduced from 64 (used for the smaller warehouse) to 48, and the learning rate was decreased from to . After this fine-tuning phase, an additional 50 episodes of training were performed using variable order sizes ranging from 5 to 25 locations, following the same procedure as for the smaller warehouse.
Figure 9 illustrates the performance of the GNN–DRL model during fine tuning and baseline heuristics when applied to a larger warehouse instance. The
x-axis shows the number of training episodes, while the
y-axis reports the average distance per order in meters. Results are shown for the GNN–DRL model and three comparison methods: Local Search, Christofides, and Lin–Kernighan.
As shown in
Figure 9, the GNN–DRL model consistently achieves lower average distances per order compared to the Lin–Kernighan heuristic throughout training, despite the increased problem scale. However, the relative performance improvement is smaller than that observed for the smaller warehouse, indicating a slight reduction in percentage gain as warehouse size increases. These results demonstrate that while the GNN–DRL approach generalizes to larger warehouse instance, additional fine-tuning is necessary and performance gains diminish moderately with the increase in total number of storage locations in warehouse.
Figure 10 illustrates the effect size and statistical significance of the performance difference like
Figure 8 but for a bigger warehouse. The solid curve with circular markers represents the percentage improvement of GNN–DRL relative to Lin–Kernighan. For small order sizes, improvements are close to zero and fluctuate around the baseline, indicating negligible performance differences. Order sizes for which the Wilcoxon test does not indicate statistical significance (
) are sizes 7, 8, 9, and 11 and are highlighted with large red cross markers. These cases correspond to the entries in
Table 3 where
p-values exceed the 0.05 threshold, confirming that observed differences for small order sizes are not statistically significant.
As order size increases, the magnitude of improvement steadily grows, reaching reductions of approximately 1.0–1.6% for larger orders. In parallel, the curve rises sharply, indicating rapidly increasing statistical significance. From order size 11 onward, all improvements are statistically significant, with p-values decreasing by several orders of magnitude. For orders above 24 locations, the p-values reported in the table mostly fall below , demonstrating that the observed performance gains are both substantial and highly robust.
Overall,
Figure 10 and the accompanying
Table 3 consistently show that while performance differences between GNN–DRL and Lin–Kernighan are marginal and statistically insignificant for small orders, the GNN–DRL model achieves increasingly larger and statistically significant improvements as order size and therefore combinatorial complexity increases. This confirms that the advantage of the GNN–DRL approach becomes more pronounced in larger and more challenging order-picking scenarios.
To better understand how the GNN captures the warehouse structure, we visualize the learned embeddings of storage nodes. The embeddings are taken from the output of the final TransformerConv layer under the conditions where all storage locations are marked as being on the order path and the picker location is set to the entrance. For visualization purposes, the embedding dimensionality is reduced to two using Uniform Manifold Approximation and Projection (UMAP) [
36].
Figure 11 presents the resulting visualization alongside the corresponding warehouse layout grid. On the left, storage locations are colored according to their physical aisles, while on the right, colors represent k-means clusters [
37] derived from the learned embeddings. The number of clusters is set equal to the number of aisles containing storage locations. Each point corresponds to a storage location: the point shape (square vs. circle) indicates the side of the aisle, border thickness encodes the x-coordinate (thicker borders correspond to larger x values), and opacity represents the y-coordinate, with more opaque points indicating higher vertical positions.
Spatial mapping enables a direct comparison between the learned embedding space and the physical warehouse layout. In the left part of
Figure 11, colors denote individual aisles. While storage locations from the same aisle tend to form broadly coherent clusters, the embedding structure indicates that physical proximity, both between storage locations and relative to the warehouse entrance, also plays an important role and in some cases appears more influential than aisle membership. Storage locations that are spatially close to one another, as well as those with similar distances to the entrance, appear closer in the embedding space even when they belong to different aisles. The strong alignment between the relative positions of storage locations in the embedding space and their positions in the physical warehouse layout demonstrates that the learned embeddings successfully preserve meaningful spatial organization.
Figure 12 shows the corresponding warehouse layout grid, where storage locations are colored to match the k-means clusters visible in the UMAP visualization. Color is based on cluster of ground (y = 0) location. This spatial mapping allows a direct comparison between the learned embedding structure and the physical warehouse layout. The strong correspondence between UMAP, physical aisles and distance to entrance confirms that the learned embeddings preserve meaningful spatial organization.
Anchor nodes (with identifiers 0, 300, and 599) are visually distinguishable in the embedding space, indicating that anchor-based distance features influence the learned representations. While these anchors improve structural awareness during learning, it remains an open question whether their explicit inclusion could be reduced or removed without degrading model performance.
Figure 13 illustrates how the storage-node embeddings change when the picker’s location is modified. In this experiment, the picker node is placed at storage location with ID 495. Projection shows a noticeable reorganization of clusters, which visually correlates strongly with graph distance to the selected picker location. Notably, all locations within the same aisle as picker (Aisle 10) form a compact cluster in the embedding space.
Storage locations closer to the picker in the warehouse graph appear more similar in the embedding space which can also be identified in k-means clusters projected on warehouse layout (
Figure 14). This behavior suggests that the learned embeddings dynamically encode task-relevant context, such as the current picker position. In practical applications, this property could be exploited to identify similar storage locations for items that frequently co-occur in orders, as well as to identify dissimilar locations for items that should be spatially separated (e.g., due to congestion or incompatibility constraints).
Presented results support the main motivation outlined in the Introduction: that graph-based representation learning achieved with deep reinforcement learning can effectively capture warehouse layout structure. The visualizations demonstrate that the GNN learns meaningful embeddings.
Importantly, these embeddings are not only descriptive but also operationally useful. The previously demonstrated improvements over classical heuristics indicate that the learned representations can directly support decision-making in order picking. The sensitivity of embeddings to the picker’s location further highlights their adaptability to dynamic operational contexts.
Overall, these results validate proposed approach and illustrate the potential of GNN-based warehouse representation learning done with deep reinforcement learning for real-time, adaptive warehouse optimization, as anticipated in the Introduction.
Preliminary Multipicker Experiment
Real warehouse optimization problems rarely correspond to a simple TSP, as they often involve multiple pickers operating simultaneously and additional constraints such as, for example, congestion. To explore how our approach could be extended to a multi-picker setting, we conducted a smaller-scale experiment in which the model produced predictions for two pickers. In this experiment, both pickers worked on the same order. The model therefore had to assign order locations to the two pickers at each step and generate a picking route for each picker, with the objective of minimizing the total distance traveled by both pickers. This experiment is intended as a preliminary illustration of how the proposed graph-based learning framework may be extended toward multi-picker settings, rather than as a full evaluation of practical multi-picker warehouse performance.
Compared to the single-picker formulation, the graph representation is extended by adding a second picker node. After the final TransformerConv layer, we compute element-wise multiplications between storage node embeddings and picker embeddings to produce storage node embeddings for each picker. These embeddings are then passed through linear layers that produce the policy outputs. The value estimate is computed directly from a global pooling operation over storage node embeddings.
At each simulation step, the model outputs two actions, one for each picker. If only a single storage location remains to be visited (when the number of locations in the order is odd), the final location is assigned to the picker whose current route has the lower cost. To account for route overlap we introduce an additional penalty: if the predicted paths of the two pickers intersect during a step, the step cost is multiplied by 1.2.
As a baseline, we combine agglomerative clustering [
38] with the Lin–Kernighan TSP heuristic. Storage locations are first partitioned into two clusters using agglomerative clustering. The clusters are then rebalanced to obtain approximately equal sizes based on the average distance to cluster members. Finally, a Lin–Kernighan solver computes a TSP route within each cluster.
The model used in this experiment consists of eight TransformerConv layers with one attention head and an embedding dimension of 128. The policy head includes a linear layer of size 1024. Training was performed with a learning rate of for the first 700 episodes. Each episode contained 42 orders with sizes between 6 and 10 locations and took around 52 s to finish. Additional hyperparameters included a PPO clipping parameter of 0.2, entropy coefficient of 0.0001, value coefficient of 0.5, batch size of 64, and four training epochs per episode.
After 700 episodes, we reduced the learning rate to and the clipping parameter to 0.1, while increasing order sizes to between 10 and 15 locations. Each episode took about 90 s to finish. We observed that the model converged more quickly when first trained on smaller instances and then gradually exposed to more complex orders. However, the optimal training curriculum remains an open question. For this experiment we trained model for total of 4000 episodes taking around 93 h.
The results in
Figure 15 show that the GNN–DRL approach converges to solutions that start to outperform clustering + Lin–Kernighan baseline after approximately 1000 training episodes for order size of 10 locations.
Figure 16 and accompaning
Table 4 illustrates the effect size and statistical significance of the performance difference for multi-picker setting at Episode 4000. The results show that the method is capable of quickly outperform Clustering + Lin–Kernighan baseline for smaller order sizes (order size ≤ 11). While short training time was not yet sufficient for the model to learn to effectively solve the TSP for larger order sizes. At 4000 episodes the model is still improving, indicating that it has not yet reached its converged state.
The corresponding
Table 4 provides a quantitative comparison between the GNN–DRL approach and the Clustering + Lin–Kernighan baseline across all evaluated order sizes. The table reports average route lengths, percentage differences, standard deviations, and the results of the Wilcoxon signed-rank test. For smaller order sizes (≤11), the average route lengths produced by GNN–DRL are significantly shorter than those obtained using the Clustering + Lin–Kernighan heuristic. The higher standard deviation observed for the GNN–DRL approach compared to the Clustering + Lin–Kernighan baseline indicates greater variability in the generated routes. This variability is expected, as the model is still in relatively early stages of training and has not yet consistently converged to near-optimal TSP solutions.
These preliminary results suggest that the proposed framework can be extended toward multi-picker and possibly other warehouse optimization scenarios. However, further work is required to incorporate more realistic operational constraints, including detailed congestion effects and dynamic order arrivals, as well as to provide a more comprehensive large-scale evaluation.
4. Discussion and Conclusions
This study demonstrates that the combined GNN–DRL approach consistently produces routing solutions with low optimality gaps while simultaneously learning meaningful graph-based representations of warehouse layouts across different warehouse sizes and order configurations. While exact optimization methods and advanced heuristics remain highly effective for static TSP formulations, the proposed framework provides a foundation for extending graph-based representation learning and adaptive policy learning toward more dynamic and operationally complex warehouse optimization settings. Additionally, preliminary experiments suggest that the proposed method can be extended to a multi-picker setting.
The embedding analysis together with low optimality gaps in order picking path lengths provides clear evidence that GNNs can capture meaningful structural information. UMAP visualizations reveal that storage locations belonging to the same aisle are visually distinguishable in embedding projections, while neighboring aisles are positioned close to each other in the embedding space. This indicates that warehouse structure is reflected in the learned representations. These observations confirm that the GNNs can properly encode higher-level structural properties of the warehouse layout. Unlike other DRL routing approaches, the learned representations can be inspected and related to physical warehouse structure, revealing clear alignment with aisle organization, spatial proximity, and picker location. This transparency supports interpretability by enabling validation of whether the model encodes meaningful and operationally relevant information rather than spurious patterns. The consistency between embedding structure, warehouse layout, and observed routing performance also serves as an additional sanity check for the correctness of the modeling assumptions and experimental setup, strengthening confidence in the validity of the proposed framework.
Further insight is provided by experiments in which the picker’s location is changed. The resulting reorganization of the embedding space correlates strongly with graph distance to the picker, demonstrating that the learned representations are context-sensitive and dynamically adapt to task-relevant information. This property suggests that the embeddings could support additional warehouse decision-making tasks beyond routing, such as identifying similar or dissimilar storage locations based on operational criteria.
Regarding generalization, the results show that models trained on smaller warehouse can be transferred to larger warehouse through additional fine-tuning. Although direct application without adaptation results in reduced performance, a limited fine-tuning phase is sufficient to restore and maintain a consistent advantage over classical heuristics. While relative improvements are slightly smaller in the larger warehouse, the GNN–DRL approach remains robust across scales, suggesting that the learned representations capture reusable, structural patterns that generalize across warehouse sizes, rather than overfitting to a single configuration.
Further research direction could be a more detailed exploration of multi-picker setting and extending warehouse graph to include more graph entities like orders and items. This extension would allow joint optimization of tasks such as order batching and item allocation or reallocation, while extending representation learning to additional entities including orders and items.
Despite the encouraging results, several limitations of this study should be acknowledged. First, all experiments were conducted on synthetic warehouse layout, which, while designed to reflect realistic grid-based structures, may not fully capture the heterogeneity and operational constraints of real-world warehouses. Second, the proposed GNN–DRL approach requires a non-trivial training time, which may limit its applicability in settings where rapid deployment or frequent retraining is required, although this cost is incurred offline and inference remains efficient. However, our additional experiments indicate that training efficiency may be improved through staged training procedures, where the model is first trained on simpler problem instances (e.g., smaller order sizes) and subsequently fine-tuned on more complex instances. Such curriculum-style training showed faster convergence in our experiments, although identifying optimal training procedures remains an open direction for future research. Third, the current formulation relies on anchor-based distance features to support representation learning; while effective, the extent to which these engineered features are necessary warrants further investigation. Finally, the experimental setup considers a single-picker, single-tour scenario, and does not yet address multi-picker interactions, congestion effects, or dynamic order arrivals, which are important aspects of practical warehouse operations and represent natural directions for future work.
Overall, the findings provide affirmative answers to the research question posed in this study. First, the results demonstrate that GNNs can learn embeddings that capture meaningful structural properties of warehouse layout while providing low optimality gaps in order-picking performance. Second, the learned models exhibit the ability to generalize across different warehouse configurations, with fine-tuning providing an effective mechanism for adapting to different warehouse sizes. Taken together, these findings position GNN–DRL not merely as a competitive routing heuristic, but as a general representation-learning framework for adaptive, graph-structured warehouse decision making.