CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments

Gong, Shengshuo; Varlamov, Oleg

doi:10.3390/a19040312

Open AccessArticle

CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments

by

Shengshuo Gong

^*

and

Oleg Varlamov

Department of Information Processing and Management Computer Science and Control Systems, Bauman Moscow State Technical University, 105005 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 312; https://doi.org/10.3390/a19040312

Submission received: 16 March 2026 / Revised: 3 April 2026 / Accepted: 12 April 2026 / Published: 16 April 2026

Download

Browse Figures

Versions Notes

Abstract

The NP-hard coordination of heterogeneous robots for time-windowed warehouse tasks remains challenging: metaheuristics are precise but slow, whereas neural methods cannot handle heterogeneous constraints, leading to infeasible allocations. This paper presents the Constraint-Aware Heterogeneous Transformer (CAHT), a lightweight encoder–decoder architecture that performs end-to-end task assignment and sequencing in a single forward pass. The central innovation is a dynamic feasibility masking mechanism that enforces capacity and energy constraints directly within the softmax computation, eliminating infeasible allocations at the architectural level. This is complemented by a spatial-bias Transformer encoder and a two-stage supervised–reinforcement learning training paradigm using ALNS-generated labels. Experiments across four problem scales (5–20 robots, 50–200 tasks) demonstrate that CAHT achieves objective values within 7–13% of the ALNS reference while being 29–91× faster (23–104 ms vs. 2–3 s). Constraint violation rates remain below 6%, with time-window satisfaction above 94%. Ablation analysis identifies dynamic masking as the dominant contribution (+213% degradation upon removal), and cross-scale generalization reveals that the optimality gap decreases from 13.0% to 10.7% as the problem scale grows. With only 0.91 M parameters, CAHT occupies a new trade-off point on the Pareto frontier, offering a practical path toward real-time autonomous warehouse coordination.

Keywords:

multi-robot task allocation; heterogeneous fleet; transformer; dynamic constraint masking; warehouse logistics; neural combinatorial optimization

1. Introduction

The explosive growth of e-commerce has placed unprecedented pressure on warehouse fulfillment operations, driving the widespread deployment of autonomous robot fleets [1,2]. Modern distribution centers increasingly rely on heterogeneous fleets—comprising Automated Guided Vehicles (AGVs), Autonomous Mobile Robots (AMRs), and specialized forklift units—each offering distinct trade-offs among speed, payload capacity, and energy efficiency [3]. Orchestrating such diverse fleets to execute hundreds of pickup–delivery tasks under tight time-window constraints while respecting each robot’s physical limitations remains a central challenge in warehouse automation [4,5].

Formally, this Multi-Robot Task Allocation (MRTA) problem generalizes the Heterogeneous Fleet Vehicle Routing Problem with Time Windows (HF-VRPTW) [6], which is NP-hard. Existing solution approaches present a fundamental and unsatisfactory trade-off. Metaheuristic solvers such as Adaptive Large Neighborhood Search (ALNS) [7,8] deliver high-quality solutions but require seconds to minutes of computation, precluding deployment in real-time re-allocation cycles. Simple heuristics (e.g., nearest-first assignment) offer sub-millisecond response but sacrifice 20–30% in solution quality. The emerging field of Neural Combinatorial Optimization (NCO) [9,10] offers the potential to close this gap by learning to produce near-optimal solutions in a single forward pass.

However, existing NCO architectures—including the Attention Model, POMO [11], and recent generalizable solvers [12,13]—lack native support for heterogeneous warehouse MRTA because they lack: (a) mechanisms for modeling the distinct interaction semantics among heterogeneous robot types and diverse tasks; (b) architectural enforcement of hard constraints such as capacity limits, energy budgets, and zone accessibility [14]; and (c) native support for the joint assignment-and-sequencing structure inherent in multi-robot allocation [15]. Our experiments confirm this diagnosis: POMO, despite retraining on warehouse MRTA data, produces solutions with constraint violation rates exceeding 98%, rendering its outputs effectively unusable.

To address these limitations, we propose the Constraint-Aware Heterogeneous Transformer (CAHT), a lightweight neural architecture that addresses each of the above limitations through three technical contributions:

(1): Dynamic feasibility masking (addressing limitation b): Hard constraint enforcement is embedded directly into the assignment decoder’s probability computation by setting infeasible robot–task scores to negative infinity before softmax normalization. This architectural mechanism reduces constraint violations by over 75 percentage points and improves objective values by 213% compared to unconstrained decoding, validating that constraint satisfaction in heterogeneous MRTA cannot be learned from data alone but must be structurally enforced [14].
(2): Spatial-bias Transformer encoding for heterogeneous entities (addressing limitation a): The standard self-attention mechanism is augmented with a learned spatial proximity bias, enabling distance-dependent robot–task interaction modeling. Combined with type-specific input embeddings that distinguish robot categories, this design supports effective representation learning across heterogeneous entity types without requiring explicit graph construction.
(3): End-to-end assignment and sequencing (addressing limitation c): CAHT jointly produces task-to-robot assignments via a bilinear attention decoder and per-robot task execution orders via a GRU-based autoregressive decoder, eliminating the need for separate optimization stages.

Extensive experiments on a synthetic benchmark with ALNS-generated training labels demonstrate that CAHT achieves objective values within 7–13% of ALNS while being 29–91× faster, with strong generalization to unseen problem scales. The model contains only 0.91 M parameters, positioning it as a practical candidate for edge-deployed real-time warehouse automation.

The remainder of this paper is organized as follows. Section 2 reviews related work across three areas: multi-robot task allocation, heterogeneous fleet vehicle routing, and neural combinatorial optimization. Section 3 presents the CAHT methodology, including the problem formulation, model architecture, training paradigm, and inference procedure. Section 4 reports experimental results, including comparative evaluation, ablation analysis, cross-scale generalization, latency profiling, and online simulation. Section 5 concludes the paper and outlines future research directions.

2. Related Work

This section reviews three streams of literature that motivate CAHT. Section 2.1 examines multi-robot task allocation approaches in warehouse settings, highlighting the scalability limitations of centralized optimization and the quality limitations of decentralized methods. Section 2.2 reviews vehicle-routing formulations with heterogeneous fleets, which provide the combinatorial optimization foundation for our problem. Section 2.3 surveys neural combinatorial optimization methods, identifying the critical gap in constraint handling that CAHT addresses.

2.1. Multi-Robot Task Allocation

Multi-robot task allocation has been extensively investigated in the robotics and operations research communities. The foundational taxonomy of Gerkey and Matarić classifies MRTA along three dimensions: single- vs. multi-task robots, single- vs. multi-robot tasks, and instantaneous vs. time-extended allocation. Recent work has addressed MRTA in realistic industrial settings, including production scheduling with heterogeneous robots [4] and warehouse-specific formulations with diverse robotic platforms [3,15]. Choi et al. [5] proposed an optimization framework for multi-robot logistics that integrates scheduling and allocation, while Sioud et al. [15] developed a dedicated model for smart warehouse environments. Market-based approaches, particularly sequential auctions, have seen widespread adoption due to their decentralized nature, though their myopic allocation strategy often yields globally suboptimal assignments. Centralized optimization via mixed-integer programming provides stronger guarantees but scales poorly beyond moderate problem sizes.

2.2. Vehicle Routing with Heterogeneous Fleets

The Heterogeneous Fleet VRPTW extends classical vehicle routing by introducing vehicles with differing capacities, speeds, and operating costs. Metz et al. [6] addressed delay-resistant robust routing with heterogeneous time windows, while Mozhdehi et al. [16] applied deep reinforcement learning to the heterogeneous fleet VRPTW. Kim et al. [17] proposed a clustering-enhanced ant colony approach for multi-trip heterogeneous fleet routing. On the metaheuristic front, Adaptive Large Neighborhood Search (ALNS) has proven particularly effective for VRPTW variants. Voigt [7] provided a comprehensive review and ranking of ALNS operators, Liu et al. [8] developed a parallel ALNS framework on Spark, and Boualamia et al. [18] introduced a reinforcement learning-based adaptation mechanism for ALNS. Industrial solvers such as Google OR-Tools provide accessible alternatives, though their performance is highly sensitive to computation budgets, as demonstrated in Section 4.

2.3. Neural Combinatorial Optimization

Neural combinatorial optimization (NCO) leverages deep learning to construct heuristic policies for NP-hard problems [9,10]. Ye et al. [19] proposed GLOP, a hierarchical partition-and-construct framework for large-scale routing. Fang et al. [12] introduced INViT, a generalizable routing solver with an invariant nested-view Transformer. Gao et al. [13] developed ensemble methods with transferable local policies for VRP. Zheng et al. [20] presented UDC, a unified divide-and-conquer framework for large-scale combinatorial optimization. The RL4CO benchmark [11] provides a systematic evaluation of NCO architectures including the Attention Model, POMO, and MatNet. A critical limitation shared by existing NCO architectures, however, is the absence of explicit constraint handling: feasibility is typically enforced through soft penalty terms or post hoc repair. Bi et al. [14] recently proposed Lagrangian multiplier-based constraint handling for neural VRP solvers, but their approach targets homogeneous fleets and does not address the joint assignment-sequencing structure. The dynamic masking mechanism proposed herein addresses this fundamental gap for heterogeneous MRTA.

3. Methodology

This section presents the CAHT framework in five parts. Section 3.1 formulates the heterogeneous multi-robot task allocation problem as a constrained optimization. Section 3.2 describes the model architecture, comprising heterogeneous input embeddings, a spatial-bias Transformer encoder, a constraint-aware assignment decoder with dynamic feasibility masking, and an autoregressive sequencing decoder. Section 3.3 introduces the two-stage training paradigm combining supervised pretraining and reinforcement learning fine-tuning. Section 3.4 details the inference procedure, and Section 3.5 analyzes model complexity.

3.1. Problem Formulation

Consider a warehouse modeled as a two-dimensional workspace [2], where a heterogeneous fleet of N robots (

R = \{r_{1}, \dots, r_{N}\}

) serves a set of M pickup–delivery tasks (

T = \{t_{1}, \dots, t_{M}\}

). Each robot (

r_{i}

) is characterized by a feature vector encoding its position, battery level, velocity, payload capacity, and kinematic type (AGV, AMR, or forklift). Each task (

t_{j}

) is specified by its pickup and delivery locations, priority, time window

[e_{j}, l_{j}]

and payload weight (

w_{j}

).

The objective is to find a joint assignment-and-sequencing decision minimizing a weighted combination of energy consumption, makespan, and time-window violations [6,16]. Let

x_{i j} \in \{0, 1\}

denote the assignment of task j to robot i and π_i be the task execution sequence for robot i. The objective function is formulated as follows:

\min J = α \sum_{i} E_{i} (π_{i}) + β \sum_{i} C_{i} (π_{i}) + γ \sum_{i j} D_{i j} x_{i j}

(1)

where

E_{i}

denotes the total energy consumption along the robot’s route (proportional to distance, payload, and kinematic energy rate),

C_{i}

is the completion time,

D_{i j}

is the time-window violation penalty, α = 0.4, β = 0.4, and γ = 0.2. The optimization is subject to the following constraints:

\sum_{i} x_{ij} = 1, \forall j \in T

(2)

\sum_{j} w_{j} x_{ij} \leq c a p_{i}, \forall i \in R

(3)

E_{i} (π_{i}) \leq b a t_{i}, \forall i \in R

(4)

e_{j} \leq a r r i v a l (j, π_{i}) \leq l_{j}, \forall j a s s i g n e d t o r_{i}

(5)

Constraints (2)–(5) enforce single assignment, capacity limits, energy feasibility, and time-window compliance, respectively. This formulation generalizes the Heterogeneous Fleet VRPTW and is NP-hard [6].

3.2. Model Architecture

CAHT employs an encoder–dual-decoder architecture comprising: (i) heterogeneous input embeddings, (ii) a spatial-bias Transformer encoder, (iii) a constraint-aware assignment decoder with dynamic masking, and (iv) an autoregressive sequencing decoder. A schematic overview is provided in Figure 1.

3.2.1. Heterogeneous Input Embedding

Robot features (

f_{r} \in ℝ^{5}

) and task features (

f_{t} \in ℝ^{8}

) are projected into a shared d = 128-dimensional space via separate linear layers. Learnable type embeddings distinguish robots from task entities, and an additional kinematic-type embedding (indexed by AGV/AMR/Forklift) is added to robot tokens:

{h_{r}}^{0} = W_{r} f_{r} + b_{r} + e_{t y p e} + e_{k i n}, {h_{t}}^{0} = W_{t} f_{t} + b_{t} + e_{t y p e}

(6)

The resulting embeddings are concatenated into

H^{0} \in ℝ^{(N + M) \times d}

and passed to the encoder.

3.2.2. Spatial-Bias Transformer Encoder

The standard self-attention mechanism is augmented with a learned spatial proximity bias [12,19]. For tokens i and j, the attention score is computed as follows:

a_{i j} = \frac{{q_{i}}^{T} k_{j}}{{\sqrt{d}}_{k}} + g_{i j}

(7)

where

g_{i j} = M L P (‖ p o s_{i} - p o s_{j} ‖_{2})

is produced by a 2-layer MLP (1 → 64 → 1) applied to the Euclidean distance. The encoder comprises L = 4 layers with 8 attention heads, an FFN dimension of 512, LayerNorm, residual connections, and dropout (0.1).

3.2.3. Constraint-Aware Assignment Decoder

Robot–task compatibility scores are computed via bilinear attention:

s_{i j} = {(h_{i})}^{L^{T}} W_{a} {h_{j}}^{L} + {v_{a}}^{T} [{h_{i}}^{L} ∥ {h_{j}}^{L}]

(8)

The core innovation is the dynamic feasibility mask applied before softmax normalization [14]:

P (x_{i j} = 1) = s o f t m a x_{i} (s_{i j} + m_{i j}), m_{i j} \in \{0, - \infty\}

(9)

The mask is set to

m_{i j} = - \infty

when the robot’s residual capacity is insufficient for the task payload constraint (3) or its remaining battery cannot cover a conservative round-trip energy estimate constraint (4). Critically, the mask is recomputed dynamically after each greedy assignment: once a task is allocated, the robot’s state (remaining capacity, position, and battery) is updated before masks are recalculated for the next task.

3.2.4. Autoregressive Sequencing Decoder

For each robot, a single-layer GRU (hidden size 128) generates the task execution order autoregressively. At each decoding step (τ), the context vector (

u^{t}

) attends to remaining unscheduled tasks:

P (π_{i τ} = j | π_{i, < τ}) = s o f t m a x_{j} (\frac{u^{tT} {h_{j}}^{L}}{\sqrt{d}})

(10)

A time-window mask suppresses selections that would inevitably violate constraint (5). The GRU’s initial hidden state is set to the corresponding robot’s encoder output.

3.3. Two-Stage Training

3.3.1. Stage I: Supervised Pretraining

The model is pretrained using high-quality labels generated by an ALNS solver [13,14]. The composite loss takes the following form:

L_{S L} = L_{a s s i g n} + λ L_{s e q}

(11)

where

L_{a s s i g n}

is cross-entropy loss on assignment labels,

L_{s e q}

is the negative log-likelihood of the ground-truth task sequence, and λ = 0.5.

3.3.2. Stage II: Reinforcement Learning Fine-Tuning

The second stage directly optimizes the deployment objective (J) via REINFORCE [9] with a greedy rollout baseline:

𝛻_{θ} L_{R L} = E_{π ~ P θ} [(J (π) - J (π_{b l})) 𝛻_{θ} \log P_{θ} (π)]

(12)

Training employs K = 8 sampled trajectories, the Adam optimizer (lr = 10⁻⁵), entropy a regularization coefficient of 0.01, and curriculum scheduling from small to large problem scales [11,18].

3.4. Inference

At inference time, two sequential passes are performed: the assignment decoder greedily allocates tasks with dynamic masking, and the sequencing decoder generates per-robot task orders autoregressively. A lightweight post-processing module resolves any residual violations through local insertion heuristics [7]. The complete pipeline executes in 23–104 ms on the CPU.

3.5. Model Complexity

CAHT is designed for edge deployability, with L = 4 encoder layers (d = 128, 8 heads) and a single-layer GRU decoder (hidden size of 128). The total parameter count is 0.91 million—orders of magnitude smaller than general-purpose language models while being specifically optimized for warehouse MRTA [20].

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Dataset

A synthetic benchmark is constructed with aisle-structured warehouse layouts on a 100 × 100 grid [2]. Three robot types (AGV, AMR, and forklift) are included with calibrated attribute distributions [3]. Task constraints are deliberately designed to ensure instance feasibility: time windows are widened (late = early + U(100, 300)), payload weights are bounded (U(1, 5)), and each instance is validated to ensure a total task load < 60% of fleet capacity. Ground-truth labels are generated by an ALNS solver (3 s budget per instance), with three destroy operators and two repair operators under simulated annealing acceptance [7,8]. For each scale (S, M, and L), the dataset is split into 300 training, 50 validation, and 50 test instances. Training data is augmented 4× via coordinate mirroring (original + 3 reflections), yielding 1200 training instances per scale. The validation set is used for early stopping and hyperparameter selection. The XL scale (20 robots and 200 tasks) contains only 50 test instances for zero-shot generalization evaluation. Table 1 summarizes the dataset and label quality.

4.1.2. Baselines and Metrics

Five baselines are selected: Nearest-First Greedy (fastest heuristic), OR-Tools (10 s) (industrial solver with heterogeneous capacity, energy, and time-window constraints), ALNS (30 s) (strongest metaheuristic [7]), POMO (state-of-the-art NCO method [11], retrained with ~0.8 M matched parameters), and POMO+Repair (POMO augmented with a post hoc constraint repair module that reassigns capacity- and energy-violating tasks to feasible robots using greedy insertion). The inclusion of POMO+Repair provides a fair comparison by addressing POMO’s inherent lack of constraint handling [14].

Evaluation employs the following five metrics. (1) Objective value (Obj.): the weighted composite cost (J) defined in Equation (1), computed as 0.4 × total energy + 0.4 × makespan + 0.2 × time-window penalty. (2) Gap vs. ALNS (%): the relative increase in objective value compared to the ALNS reference, defined as

\frac{(O b j_{m e t h o d} - O b j_{A L N S})}{O b j_{A L N S} \times 100}

. (3) Constraint Violation Rate (CVR%): the percentage of tasks that violate at least one hard constraint (capacity, energy, or time window), computed as the number of violated tasks divided by M × 100. (4) Time-Window Satisfaction (TW%): the percentage of tasks whose completion time falls within the prescribed time window

[e_{j}, l_{j}]

. (5) Inference Time: wall-clock time in milliseconds for a single instance, measured on the CPU. All results are reported as mean ± standard deviation over 50 test instances.

4.2. Solution Quality and the Speed–Quality Trade-Off

Table 2 presents the comparative results across three problem scales.

CAHT (SL) achieves objective values within 7–11% of the ALNS upper bound across all scales, substantially surpassing every other baseline. The optimality gap notably narrows as the problem scale increases—from +11.0% (S) to +7.1% (L)—suggesting that the attention-based architecture scales favorably.

This quality advantage is achieved at dramatically lower computational cost. As shown in Figure 2, CAHT’s inference times of 23–104 ms translate to a 29–91× speedup over ALNS (2116–3006 ms) and a 96–412× speedup over OR-Tools (10,000 ms). This positions CAHT at a previously unoccupied point on the speed–quality Pareto frontier [10]: within single-digit percentage points of the strongest metaheuristic while operating in the real-time regime.

OR-Tools is configured as a heterogeneous fleet VRP with three constraint dimensions: (1) per-vehicle capacity with heterogeneous upper bounds matching each robot’s payload rating, (2) cumulative energy consumption tracked per vehicle with per-type energy rates and battery limits, and (3) soft time-window upper bounds with weighted penalty terms. The solver uses PATH_CHEAPEST_ARC for initial solutions, GUIDED_LOCAL_SEARCH as a metaheuristic, and a 10 s time limit. Despite this comprehensive configuration, OR-Tools underperforms ALNS because the generic routing solver is not specifically designed for the joint assignment-sequencing structure with heterogeneous energy rates. POMO (raw) produces CVR > 98% as expected, since it lacks any constraint mechanism [14]. However, even POMO+Repair—which applies post hoc greedy insertion to fix capacity and energy violations—remains substantially worse than CAHT, confirming that architectural constraint enforcement outperforms post hoc repair for heterogeneous MRTA [11,14].

Supervised vs. Reinforcement Learning Variants

CAHT (SL) marginally outperforms CAHT (SL+RL) across all scales. Analysis of the RL training curves (Figure 3) reveals that while the training reward shows an upward trend, the high gradient variance of REINFORCE under the limited training regime (300 instances, 15 epochs, and K = 4 samples) prevents stable convergence to an improved policy [9]. The validation objective oscillates around the SL baseline without consistent improvement, indicating that the RL fine-tuning has not yet converged. We retain the two-stage paradigm as a framework contribution: the SL stage provides a strong initialization, and the RL stage offers a principled pathway for further optimization with larger compute budgets. CAHT (SL) is adopted as the primary model for the remaining experiments, and the scaling up of RL training is deferred to future work [18].

4.3. Ablation Study: Why Dynamic Masking Is the Key Innovation

Table 3 presents systematic ablation on the medium scale.

Dynamic masking is the dominant contribution. Its removal causes a 3.1× increase in objective (from 1997 to 6250) and drives CVR from 4.9% to 79.9%. This 213% degradation exceeds all other components and carries a broader implication for the NCO community: architectural constraint enforcement via masking proves vastly more effective than learning constraint satisfaction from data [14].

The Transformer encoder is essential (+86.6%). Replacing it with a 2-layer MLP nearly doubles the objective and raises CVR to 46.8%, confirming that global pairwise interaction modeling is critical [12,13].

4.4. Cross-Scale Generalization

In Table 4 the optimality gap decreases monotonically from 13.0% (S) to 10.7% (XL), which is attributable to a statistical smoothing effect in larger instances [20]. CVR remains bounded at 6.3%, even at the zero-shot XL scale, confirming that dynamic masking provides structural generalization for constraint satisfaction.

4.5. Latency Profiling

As summarized in Table 5, the assignment decoder dominates latency (55–68%), driven by iterative greedy masking. Optimizing this module—through batched mask computation—represents the most promising avenue for further speedup.

4.6. Online Rolling-Horizon Evaluation

As reported in Table 6 CAHT achieves TW satisfaction of 61.3% at 3.9 ms per re-allocation—78× faster than ALNS (304.9 ms). In higher-throughput environments with more frequent re-optimization, CAHT’s speed advantage would become decisive [2].

4.7. Limitations

This work has several limitations. First, the reduced training regime (300 instances/scale, 30 SL + 15 RL epochs) limits convergence; full-scale training is expected to further narrow the 7–13% gap. Second, evaluation relies on synthetic data; validation on real warehouse logs or high-fidelity simulators [1] would strengthen claims. Third, the assignment decoder’s iterative greedy masking (55–68% of latency) may produce cascading suboptimal allocations when early assignments constrain later options; beam search or sampling-based decoding could mitigate this effect. Fourth, the spatial-bias encoder currently uses Euclidean distance, which does not account for physical obstacles such as racks and restricted zones in real warehouses; however, the modular design of the spatial MLP (

g_{i j} = M L P (d_{i j})

) allows for straightforward substitution with Manhattan distance or A*-based path distances without architectural changes. Fifth, the largest tested scale (20 robots, 200 tasks) remains below real warehouse scales; extension to 100+ robots and 1000+ tasks is a priority for future work.

5. Conclusions

This paper has presented the Constraint-Aware Heterogeneous Transformer (CAHT), a lightweight end-to-end neural architecture for real-time multi-robot task allocation in warehouse environments. The proposed framework combines dynamic feasibility masking, spatial-bias Transformer encoding, and a two-stage supervised–reinforcement learning training paradigm to achieve a favorable speed–quality balance: objective values within 7–13% of the ALNS metaheuristic, with 29–91× faster inference (23–104 ms vs. 2–3 s), sub-6% constraint violation rates, and more than 94% time-window satisfaction.

Ablation analysis identified dynamic feasibility masking as the single most impactful innovation (+213% degradation upon removal)—a finding with implications for the broader NCO community [9,14]. Cross-scale generalization experiments revealed the encouraging pattern that the optimality gap decreases from 13.0% to 10.7% as the problem scale grows, indicating that the architecture learns transferable allocation patterns. With only 0.91 M parameters, CAHT is deployable on edge computing platforms, offering a practical path toward fully autonomous, real-time warehouse coordination [1,2].

We identify three directions for future work: (i) full-scale training with increased data volume and extended RL fine-tuning [18], (ii) validation on real-world warehouse data and high-fidelity robotic simulation platforms [1], and (iii) architectural optimization of the assignment decoder through differentiable constraint relaxation [14] or parallel masking to reduce the dominant latency bottleneck.

Author Contributions

S.G. led the conceptualization, conducted all experiments and data analysis, and drafted the manuscript. O.V. supervised the project, provided resources, and contributed to the manuscript revision. Both authors reviewed and approved the final version of the manuscript. S.G. serves as the corresponding author responsible for the submission and all correspondence. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The synthetic dataset generation code, model implementation, and training scripts developed in this study are available at [https://github.com/GSS789666/CAHT-code.git, accessed on 17 March 2026]. The dataset itself is procedurally generated using the code provided; full generation parameters are described in Section 4.1 of the manuscript, enabling independent reproduction. No pre-existing or third-party datasets were used in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript: CAHT: Constraint-Aware Heterogeneous Transformer; MRTA: Multi-Robot Task Allocation; ALNS: Adaptive Large Neighborhood Search; AGV: Automated Guided Vehicle; AMR: Autonomous Mobile Robot; NCO: Neural Combinatorial Optimization; VRPTW: Vehicle Routing Problem with Time Windows; HF-VRPTW: Heterogeneous Fleet VRPTW; CVR: Constraint Violation Rate; TW: Time Window; SL: Supervised Learning; RL: Reinforcement Learning; GRU: Gated Recurrent Unit; FFN: Feed-Forward Network; MLP: Multi-Layer Perceptron.

References

Keith, R.; La, H.M. Review of autonomous mobile robots for the warehouse environment. arXiv 2024, arXiv:2406.08333. [Google Scholar] [CrossRef]
Zhen, L.; Tan, Z.; de Koster, R.; He, X.; Wang, S.; Wang, H. Optimizing warehouse operations with autonomous mobile robots. Transp. Sci. 2025, 59, 1130–1152. [Google Scholar] [CrossRef]
Msala, Y.; Oussama, H.; Talea, M.; Aboulfatah, M. A novel method for enhancing warehouse operations using heterogeneous robotic systems for autonomous pick-and-deliver tasks. EAI Endorsed Trans. AI Robot. 2025, 4, 1–13. [Google Scholar] [CrossRef]
Shakeri, Z.; Benfriha, K.; Varmazyar, M.; Talhi, E.; Quenehen, A. Production scheduling with multi-robot task allocation in a real industry 4.0 setting. Sci. Rep. 2025, 15, 1795. [Google Scholar] [CrossRef] [PubMed]
Choi, B.; Kim, M.; Kim, H. An optimization framework for allocating and scheduling multiple tasks of multiple logistics robots. Mathematics 2025, 13, 1770. [Google Scholar] [CrossRef]
Metz, L.; Mutzel, P.; Niemann, T.; Schürmann, L.; Stiller, S.; Tillmann, A.M. Delay-resistant robust vehicle routing with heterogeneous time windows. Comput. Oper. Res. 2024, 164, 106553. [Google Scholar] [CrossRef]
Voigt, S. A review and ranking of operators in adaptive large neighborhood search for vehicle routing problems. Eur. J. Oper. Res. 2025, 322, 357–375. [Google Scholar] [CrossRef]
Liu, S.; Sun, J.; Duan, X.; Liu, G. Parallel adaptive large neighborhood search based on Spark to solve VRPTW. Sci. Rep. 2024, 14, 23809. [Google Scholar] [CrossRef] [PubMed]
Darvariu, V.-A.; Hailes, S.; Musolesi, M. Graph reinforcement learning for combinatorial optimization: A survey and unifying perspective. arXiv 2024, arXiv:2404.06492. [Google Scholar] [CrossRef]
Chung, K.T.; Lee, C.K.M.; Tsang, Y.P. Neural combinatorial optimization with reinforcement learning in industrial engineering: A survey. Artif. Intell. Rev. 2025, 58, 130. [Google Scholar] [CrossRef]
Berto, F.; Hua, C.; Park, J.; Luttmann, L.; Ma, Y.; Bu, F.; Wang, J.; Ye, H.; Kim, M.; Choi, S.; et al. RL4CO: An extensive reinforcement learning for combinatorial optimization benchmark. In Proceedings of the KDD 2025, Toronto, ON, Canada, 3–7 August 2025. [Google Scholar] [CrossRef]
Fang, H.; Song, Z.; Weng, P.; Ban, Y. INViT: A generalizable routing problem solver with invariant nested view Transformer. arXiv 2024, arXiv:2402.02317. [Google Scholar] [CrossRef]
Gao, C.; Shang, H.; Xue, K.; Li, D.; Qian, C. Towards generalizable neural solvers for vehicle routing problems via ensemble with transferrable local policy. In Proceedings of the IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; pp. 6914–6922. [Google Scholar] [CrossRef]
Bi, J.; Ma, Y.; Zhou, J.; Song, W.; Cao, Z.; Wu, Y.; Zhang, J. Learning to handle complex constraints for vehicle routing problems. arXiv 2024, arXiv:2410.21066. [Google Scholar] [CrossRef]
Sioud, R.; Bamoumen, M.; Hamani, N. A novel model for multi-robot task assignment in smart warehouses. In IN4PL 2024; CCIS 2373; Springer: New York, NY, USA, 2025; pp. 343–353. [Google Scholar] [CrossRef]
Mozhdehi, A.; Mohammadizadeh, M.; Wang, Y.; Sun, S.; Wang, X. EFECTIW-ROTER: Deep reinforcement learning approach for solving heterogeneous fleet and demand VRPTW. In Proceedings of the ACM SIGSPATIAL 2024, Atlanta, GA, USA, 29 October–1 November 2024; pp. 17–28. [Google Scholar] [CrossRef]
Kim, B.S.; Mozhdehi, A.; Wang, Y.; Sun, S.; Wang, X. Clustering-based enhanced ant colony optimization for multi-trip VRP with heterogeneous fleet and time windows. In Proceedings of the IWCTS’24, Atlanta, GA, USA, 29 October 2024; pp. 46–55. [Google Scholar] [CrossRef]
Boualamia, H.; Metrane, A.; Hafidi, I.; Mellouli, O. A new adaptation mechanism of the ALNS algorithm using reinforcement learning. Oper. Res. Forum 2025, 6, 105. [Google Scholar] [CrossRef]
Ye, H.; Wang, J.; Liang, H.; Cao, Z.; Li, Y.; Li, F. GLOP: Learning global partition and local construction for solving large-scale routing problems in real-time. In Proceedings of the AAAI-24, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 20284–20292. [Google Scholar] [CrossRef]
Zheng, Z.; Zhou, C.; Tong, X.; Yuan, M.; Wang, Z. UDC: A unified neural divide-and-conquer framework for large-scale combinatorial optimization problems. Adv. Neural Inf. Process. Syst. 2024, 37, 6081–6125. [Google Scholar]

Figure 1. Architectural overview of CAHT.

Figure 2. Inference time scaling. CAHT (blue) remains near the 100 ms threshold; ALNS and OR-Tools at 10³–10⁴ ms.

Figure 3. RL training curves (M scale). (a) Training reward improves over epochs. (b) Policy gradient loss decreases. (c) Validation objective converges toward the SL baseline (dashed) but does not consistently surpass it, motivating CAHT (SL) as the primary model.

Table 1. Dataset configuration and ALNS label quality. Training augmented 4× via coordinate mirroring.

Scale	N	M	Train	Aug.	Test	ALNS Obj.	CVR%	TW%
S	5	50	300	1200	50	1001.3	0.7	99.3
M	10	100	300	1200	50	1781.5	1.9	98.1
L	15	150	300	1200	50	2575.4	3.5	96.5
XL	20	200	–	–	50	3431.4	4.6	95.4

Table 2. (a) Small scale (N = 5, M = 50). (b) Medium scale (N = 10, M = 100). (c) Large scale (N = 15, M = 150).

(a)
Method	Obj. ↓	Gap (%)	CVR% ↓	TW% ↑	Makespan	Time (ms)
ALNS (30 s)	1001.3 ± 98.5	0.0	0.7 ± 1.6	99.3 ± 1.6	820.8 ± 129.7	2115.6
Nearest Greedy	1276.5 ± 189.1	+27.5	2.0 ± 2.6	98.0 ± 2.6	810.8 ± 214.0	0.5
OR-Tools (10 s)	1297.1 ± 241.7	+29.5	25.4 ± 11.0	79.4 ± 5.7	926.2 ± 266.3	10,001.5
POMO	2080.7 ± 758.8	+107.8	99.1 ± 2.7	73.4 ± 10.0	1449.8 ± 244.6	7.8
POMO+Repair	1188.7 ± 211.2	+18.7	6.3 ± 4.8	93.7 ± 4.8	937.4 ± 140.0	9.2
CAHT (SL)	1111.9 ± 152.2	+11.0	3.5 ± 3.0	96.5 ± 3.0	961.2 ± 171.6	20.2
CAHT (SL+RL)	1131.2 ± 170.9	+13.0	4.9 ± 3.4	95.1 ± 3.4	954.4 ± 164.2	21.2
(b)
Method	Obj. ↓	Gap (%)	CVR% ↓	TW% ↑	Makespan	Time (ms)
ALNS (30 s)	1781.5 ± 128.7	0.0	1.9 ± 1.7	98.1 ± 1.7	919.2 ± 113.1	3001.8
Nearest Greedy	2222.6 ± 236.5	+24.8	2.6 ± 2.2	97.4 ± 2.2	949.7 ± 192.2	1.7
OR-Tools (10 s)	2191.1 ± 282.5	+23.0	26.7 ± 8.8	79.2 ± 4.7	992.1 ± 242.9	10,003.2
POMO	10,664.3 ± 2865.0	+498.6	98.4 ± 5.4	45.8 ± 8.2	2566.1 ± 321.8	16.8
POMO+Repair	1995.4 ± 188.7	+12.0	3.7 ± 2.4	96.3 ± 2.4	890.3 ± 138.0	19.5
CAHT (SL)	1947.9 ± 184.7	+9.3	4.0 ± 2.3	96.0 ± 2.3	1075.2 ± 150.2	52.5
CAHT (SL+RL)	1997.1 ± 199.4	+12.1	4.9 ± 2.2	95.1 ± 2.2	1070.9 ± 148.9	50.8
(c)
Method	Obj. ↓	Gap (%)	CVR% ↓	TW% ↑	Makespan	Time (ms)
ALNS (30 s)	2575.4 ± 173.0	0.0	3.5 ± 2.0	96.5 ± 2.0	981.3 ± 113.3	3006.1
Nearest Greedy	3109.5 ± 263.1	+20.7	2.6 ± 1.8	97.4 ± 1.8	1018.7 ± 206.2	3.9
OR-Tools (10 s)	3174.6 ± 455.3	+23.3	28.4 ± 5.7	79.0 ± 4.8	1067.6 ± 233.4	10,008.2
POMO	29,024.2 ± 4580.9	+1027.0	99.6 ± 0.6	33.0 ± 5.5	3672.4 ± 311.5	26.5
POMO+Repair	2952.4 ± 257.3	+14.6	3.3 ± 1.7	96.7 ± 1.7	998.7 ± 203.7	33.1
CAHT (SL)	2757.7 ± 240.1	+7.1	4.4 ± 2.5	95.8 ± 2.0	1133.8 ± 134.7	90.7
CAHT (SL+RL)	2867.0 ± 269.4	+11.3	5.7 ± 2.2	94.3 ± 2.2	1146.7 ± 144.1	93.3

In all result tables, the symbol ↓ indicates that lower values are preferred (e.g., CVR, Obj, Gap, Makespan, Time), whereas ↑ indicates that higher values are preferred (e.g., TW). Best results per metric are highlighted in bold (excluding the ALNS oracle reference). Lower is better for Obj, Gap, CVR, Makespan, and Time; higher is better for TW.

Table 3. Ablation study (medium scale).

Variant	Obj. ↓	ΔObj%	CVR% ↓	TW% ↑	Time (ms)
CAHT Full	1997.1 ± 199.4	+0.0	4.9 ± 2.2	95.1 ± 2.2	51.1
w/o RL	1947.9 ± 184.7	−2.5	4.0 ± 2.3	96.0 ± 2.3	49.0
w/o dynamic masking	6250.0	+213.0	79.9	70.1	62.0

Table 4. Cross-scale generalization. XL is zero-shot.

Test Scale	Obj. ↓	ΔObj%	CVR% ↓	TW% ↑	Setting
S (5, 50)	1131.2 ± 170.9	+13.0	4.9 ± 3.4	95.1 ± 3.4	In-distr.
M (10, 100)	1997.1 ± 199.4	+12.1	4.9 ± 2.2	95.1 ± 2.2	In-distr.
L (15, 150)	2867.0 ± 269.4	+11.3	5.7 ± 2.2	94.3 ± 2.2	In-distr.
XL (20, 200)	3797.7 ± 332.3	+10.7	6.3 ± 2.0	93.7 ± 2.0	Zero-shot

Table 5. Latency breakdown (CPU, ms).

Scale	Embed.	Encoder	Assign	Seq.	Total
S	0.18	2.81	13.01	7.44	23.44
M	0.19	5.57	35.74	14.55	56.04
L	0.23	9.79	68.14	21.92	100.08

Table 6. Online simulation (N = 10, 300 s, λ = 0.3, and reallocation of 10 s).

Method	Comp.	TW% ↑	Wait (s)	Throughput	Solution Time (ms)
Nearest Greedy	93	98.9	5.0	18.60	0.1
ALNS (1 s)	93	93.5	5.0	18.60	304.9
POMO	93	30.1	5.0	18.60	1.5
CAHT	93	61.3	5.0	18.60	3.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gong, S.; Varlamov, O. CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments. Algorithms 2026, 19, 312. https://doi.org/10.3390/a19040312

AMA Style

Gong S, Varlamov O. CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments. Algorithms. 2026; 19(4):312. https://doi.org/10.3390/a19040312

Chicago/Turabian Style

Gong, Shengshuo, and Oleg Varlamov. 2026. "CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments" Algorithms 19, no. 4: 312. https://doi.org/10.3390/a19040312

APA Style

Gong, S., & Varlamov, O. (2026). CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments. Algorithms, 19(4), 312. https://doi.org/10.3390/a19040312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CAHT: A Constraint-Aware Heterogeneous Transformer for Real-Time Multi-Robot Task Allocation in Warehouse Environments

Abstract

1. Introduction

2. Related Work

2.1. Multi-Robot Task Allocation

2.2. Vehicle Routing with Heterogeneous Fleets

2.3. Neural Combinatorial Optimization

3. Methodology

3.1. Problem Formulation

3.2. Model Architecture

3.2.1. Heterogeneous Input Embedding

3.2.2. Spatial-Bias Transformer Encoder

3.2.3. Constraint-Aware Assignment Decoder

3.2.4. Autoregressive Sequencing Decoder

3.3. Two-Stage Training

3.3.1. Stage I: Supervised Pretraining

3.3.2. Stage II: Reinforcement Learning Fine-Tuning

3.4. Inference

3.5. Model Complexity

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Baselines and Metrics

4.2. Solution Quality and the Speed–Quality Trade-Off

Supervised vs. Reinforcement Learning Variants

4.3. Ablation Study: Why Dynamic Masking Is the Key Innovation

4.4. Cross-Scale Generalization

4.5. Latency Profiling

4.6. Online Rolling-Horizon Evaluation

4.7. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI