On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization

Juan, Angel A.; Guerrero, Antoni; Escoto, Marc; Panadero, Javier; Garcia-Sanchez, Alvaro; Resende, Mauricio G. C.

doi:10.3390/info17050504

Open AccessArticle

On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization

by

Angel A. Juan

^1,2,*

,

Antoni Guerrero

^1,3

,

Marc Escoto

¹

,

Javier Panadero

⁴

,

Alvaro Garcia-Sanchez

⁵

and

Mauricio G. C. Resende

⁶

¹

CIGIP-ValgrAI, Universitat Politècnica de València, Ferrandiz-Carbonell, 03802 Alcoy, Spain

²

Business Analytics Department, UNIE Universidad, Av. Monforte de Lemos 28, 28029 Madrid, Spain

³

Baobab Soluciones, 55 Jose Abascal, 28003 Madrid, Spain

⁴

Computer Architecture & OS Department, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain

⁵

Department of Organization Engineering, Business Administration and Statistics, Universidad Politécnica de Madrid, Jose Abascal 2, 28006 Madrid, Spain

⁶

Institute of Science and Technology, Federal University of São Paulo, Rua Pedro Vicente 625, São Paulo 01109, Brazil

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 504; https://doi.org/10.3390/info17050504

Submission received: 5 April 2026 / Revised: 10 May 2026 / Accepted: 18 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue Emerging Research in Optimization Algorithms in the Era of Big Data)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes the concept of biased-randomized transformers, a novel methodology that combines biased-randomized techniques and transformer-based deep learning for ‘agile’ optimization (i.e., real-time optimization that is carried out iteratively in dynamic systems). On the one hand, biased-randomization techniques have been used in the past to inject controlled randomness into greedy heuristics, thus converting them into probabilistic algorithms capable of generating thousands of good-quality solutions while preserving heuristic logic. On the other hand, transformer models can capture complex patterns across thousands of variables. Once trained, these models can be seen as data-driven heuristics able to provide fast solutions to new instances and adapt to changing inputs. The combination of biased-randomization techniques with trained transformers allows for a fast exploration and selection of the high-quality solutions to NP-hard combinatorial optimization problems. The paper includes two case studies that illustrate the potential of these biased-randomized transformers.

Keywords:

heuristics; agile optimization; biased randomization; transformers

1. Introduction

Combinatorial optimization problems arise in many practical domains such as logistics, scheduling, routing, and production planning. A classical example is the Vehicle Routing Problem [1], which captures key structural features of real applications. Many of these problems are NP-hard, so exact methods based on branch-and-bound or dynamic programming become impractical as instance sizes grow. As a result, heuristics and metaheuristics are widely used to obtain high-quality solutions within limited computing time [2]. Classical metaheuristics such as Genetic Algorithms, Simulated Annealing, and Tabu Search provide flexible frameworks to explore large search spaces while balancing intensification and diversification. Some authors have explored learning-based approaches that use data to guide the search process [3]. In particular, neural combinatorial optimization and reinforcement learning (RL) methods have shown that models such as attention-based architectures can learn implicit solution structures for routing and scheduling problems, often matching or surpassing human-designed heuristics on specific benchmarks [4,5]. Hybrid approaches that combine learning with classical optimization, for example through large neighborhood search guided by neural policies, have further improved scalability and robustness in complex settings [6].

Relative to Greedy Randomized Adaptive Search Procedures or GRASP [7,8], biased-randomized algorithms (BRAs) provide a simple yet effective mechanism to transform deterministic greedy heuristics into probabilistic constructive procedures [9]. Instead of always selecting the best-ranked candidate, BRAs sample from a probability distribution that is skewed toward high-quality options [10]. This preserves the logic of the underlying heuristic while enabling controlled diversification of the search. Theoretical and empirical studies show that this type of non-uniform randomization can approximate multi-start strategies with limited overhead while maintaining fast construction times [11]. BRAs have been successfully applied to routing, scheduling, and network design problems, where they often achieve high-quality solutions in short times. Their structure also makes them well suited for parallel execution, since multiple independent runs can be generated with minimal coordination. This idea is closely related to the concept of ‘agile optimization’ [12]. In this context, the term refers to a computational setting in which high-quality solutions to a combinatorial problem must be generated under strict time constraints, with the additional requirement that the same problem instance may need to be re-solved repeatedly as input data evolves over time or due to external disruptions. The focus is therefore on methods that enable fast, low-latency re-optimization without retraining or full re-optimization from scratch, typically through lightweight inference and parallel solution generation. In the context of combinatorial optimization, the agile-optimization concept has already been employed by several authors with the same or similar meaning [13,14].

In parallel, transformer models, originally introduced in Natural Language Processing (NLP), have emerged as a powerful tool for sequential decision problems due to their ability to capture long-range dependencies through attention mechanisms [15]. Their application to combinatorial optimization has led to promising results on several benchmark problems, especially when trained in a supervised or RL setting [3]. These models can be interpreted as high-dimensional, data-driven heuristics that map problem instances to solution sequences with a low inference time once trained. However, purely deterministic inference may limit their ability to explore alternative high-quality solutions, which is important in many practical contexts.

In this paper, we propose the concept of biased-randomized transformers (BRTs), a methodology that combines biased randomization techniques [16] with transformer models within an agile optimization framework (Figure 1). The transformer is first trained on a large set of high-quality solutions, so it learns a mapping from problem instances to promising decision sequences. At inference time, we introduce biased randomization into the model’s sequential decision process, for example, by sampling from a geometric probability distribution derived from attention scores or output logits. This allows for the generation of multiple diverse solutions from the same trained model while preserving the learned structural patterns. The resulting approach can be parallelized naturally, enabling real-time generation of high-quality solutions under dynamic inputs. The main contribution of this work is the integration of biased randomization and transformer-based learning into a unified methodology, BRT, that combines the interpretability and efficiency of classical heuristics with the adaptability of data-driven models. This bridge between heuristic reasoning and machine learning provides a practical path toward fast and robust decision making in dynamic combinatorial optimization settings.

The main contributions of this work can be summarized as follows: (i) it introduces a hybrid learning–optimization scheme where a transformer is trained on high-quality solutions and used as a parametric policy for sequential decision-making in combinatorial optimization problems; (ii) it incorporates biased randomization into the inference phase of the transformer, using theoretical probability distributions (e.g., geometric-based sampling) to perturb action selection and generate multiple diverse solution trajectories from a single trained model; (iii) it shows how this mechanism naturally enables parallel solution generation, supporting fast re-optimization in dynamic settings without retraining; and (iv) it provides an integrated methodology that connects transformer-based representation learning with metaheuristic-style exploration, bridging data-driven sequence modeling and probabilistic heuristic search for combinatorial optimization problems. The rest of the paper is structured as follows. Section 2 offers a brief literature review on related work. Section 3 describes how biased-randomized solution sets could be used to train transformer models. Section 4 introduces the novel concept of biased-randomized transformers. Section 5 presents two case studies that illustrate the potential of the proposed methodology. Section 6 details computational experiments, and Section 7 analyzes the results. Finally, Section 8 concludes the paper with discussion and future work.

2. Related Work

In recent years, there has been noticeable progress in combinatorial optimization driven by the integration of machine learning with classical optimization paradigms. In particular, neural combinatorial optimization (NCO) has emerged as a rapidly evolving research area in which data-driven models are trained to approximate or replace human-designed heuristics. Several surveys emphasize that NCO methods, especially those based on RL and supervised learning, can achieve competitive solution quality while significantly reducing inference time in repeated or real-time decision settings [17,18]. Despite these advances, current approaches still face limitations in generalization, scalability, and robustness across heterogeneous problem instances [19,20,21].

A major research direction focuses on attention-based architectures, particularly transformers, for routing, scheduling, and assignment problems. Building on earlier sequence-to-sequence and pointer network models, transformers have shown strong capability in capturing long-range dependencies and structural patterns in combinatorial optimization problems. Several studies show that transformer-based policies trained via RL can achieve competitive performance in vehicle routing and related problems, often outperforming classical neural architectures in benchmark settings [22,23,24]. However, most existing approaches rely on deterministic or near-deterministic decoding strategies, which limits exploration during inference and may reduce solution diversity in complex landscapes. To address this limitation, several works explore hybrid learning and optimization paradigms that integrate neural policies with classical metaheuristic mechanisms. These include neural-guided large neighborhood search, learned improvement heuristics, and preference-based optimization frameworks that explicitly combine learning signals with structured search procedures [25,26]. These approaches illustrate a growing convergence between machine learning and operations research, where learned components are embedded into classical optimization pipelines rather than replacing them entirely. Improvement-based transformer policies have also been proposed, where solutions are iteratively refined using learned attention mechanisms, improving both generalization and solution quality in routing problems [27,28].

Graph-based neural networks (GNNs) also remain an important line of research. Since many combinatorial optimization problems are naturally represented as graphs, GNNs have been widely adopted to encode structural information. Empirical studies show that GNNs are particularly effective at capturing spatial and relational dependencies in routing and scheduling problems, often in combination with RL or hybrid search strategies [19,20,29]. These models are especially effective in settings where feasibility constraints and topological structure play a dominant role in solution quality. Beyond pure learning approaches, hybrid methods combining neural models with classical optimization techniques have gained increasing attention. These approaches aim to combine the generalization capability of machine learning with the reliability and interpretability of traditional heuristics. Examples include neural-guided search procedures, RL-augmented local search, and learning-enhanced metaheuristics, which use learned policies to guide solution construction and improvement steps [30]. These hybrid frameworks are particularly relevant in industrial settings, where both solution quality and computational efficiency are required.

Another important trend concerns robustness, uncertainty, and real-time adaptability. Many practical applications involve stochastic travel times, dynamic demands, and evolving system constraints, requiring fast and adaptive decision-making. NCO methods have been extended to handle uncertainty by integrating attention mechanisms with robust or distribution-aware optimization objectives [31,32]. These approaches illustrate that learning-based models can effectively address stochastic variants of classical problems while maintaining computational efficiency, although robustness across unseen distributions remains an open challenge. Several contributions also extend transformer and RL methods to more realistic and dynamic logistics settings. For instance, transformer-based RL frameworks have been proposed for dynamic routing problems, where decisions must adapt in real time to changing travel conditions without retraining [33,34]. Other works investigate multi-period and energy-constrained routing problems in smart city logistics, showing how RL and heuristic components can be combined to improve scalability and adaptability in real-world scenarios [5]. Finally, some studies emphasize generalization across instance distributions, constraint tightness, and problem sizes. Empirical evidence suggests that many neural combinatorial optimization models suffer from overfitting to specific training distributions, particularly when constraint regimes vary significantly between training and test instances [28]. This has motivated the development of training strategies that explicitly incorporate diversity in instance generation and structured randomness during both training and inference.

All in all, the literature indicates a clear convergence between RL, transformer-based sequence modeling, and metaheuristic optimization. This convergence motivates hybrid frameworks that combine learned decision policies with probabilistic search mechanisms, aiming to achieve both fast inference and high-quality solution exploration in combinatorial optimization problems. Table 1 summarizes the main research directions in NCO and classical metaheuristics. The table highlights the evolution from classical heuristic methods to learning-based approaches, and finally to hybrid frameworks that integrate RL, transformer architectures, graph-based representations, and metaheuristic principles. Despite these advances, most existing approaches still rely on either fully learned deterministic policies or manually designed stochastic search strategies. The proposed BRT methodology addresses this limitation by combining both aspects in a unified framework, where a trained transformer acts as a parameterized heuristic generator and biased-randomized techniques introduce a stochastic control layer over the decoding process. In this way, BRT enables structured probabilistic perturbations of learned policies, supporting controlled diversification at inference time without retraining.

3. Training Transformers with BRAs

Training transformer models for combinatorial optimization involves learning a parameterized policy that constructs high-quality solutions from problem instance representations. Some authors show how transformer architectures can be trained using RL to solve vehicle routing problems, capturing complex state and decision dependencies through self-attention mechanisms [5,35]. An additional component of our approach is the possible use of biased-randomized heuristics also to generate the training data. As previously discussed, BRAs transform deterministic heuristics into probabilistic procedures, introducing controlled randomness while preserving heuristic logic. By running BRAs, thousands of high-quality solutions can be generated for each problem instance in a short time. These solution sets provide diverse yet high-quality trajectories,

π^{(i)}

, that are used as ground truth for training the transformer, allowing the model to learn the structure of high-quality solutions and the decision dependencies encoded by the heuristics.

Formally, a combinatorial instance s (e.g., TOP with node set N, rewards

r_{i}

, and vehicles V) is mapped to an initial embedding sequence:

{\hat{w}}_{0} \in R^{(n + 2 + m) \times d_{k}}

(1)

where

n = | N |

,

m = | V |

, and

d_{k}

is the embedding dimension. The model processes each problem instance in a sequential manner, where a specific decision is made at each step to incrementally construct the final solution. To account for this, a learned timestep embedding is incorporated into the sequence to identify the current step of the trajectory, allowing the transformer to distinguish between different stages of the construction of the solution. Node features such as coordinates, rewards, and vehicle constraints are projected into a latent space using learnable matrices

W_{n}

,

W_{f}

, and

W_{v}

. The transformer then applies masked multi-head self-attention to compute contextualized representations respecting feasibility constraints (via a binary mask

M_{t}

at timestep t):

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + M_{t}) V

(2)

where

Q, K, V

are the query, key, and value matrices. Within this architecture, the features are processed through successive layers with residual connections and feed-forward networks; specifically, the multi-head attention at layer l is computed as

Z_{l}^{h} = Softmax (\frac{Q_{l, h} K_{l, h}^{⊤}}{\sqrt{d_{k}}}) V_{l, h}, {MHA}^{l} (h^{(l)}) = [Z_{l}^{1}, Z_{l}^{2}, \dots, Z_{l}^{H}] W_{l}^{o u t}

(3)

and a feed-forward layer with residual normalization produces the next layer

h^{(l + 1)}

. In alternative contexts, such as maintenance scheduling and routing [5], node and time-window features can also be embedded into the transformer.

The model learns a stochastic policy

p_{θ} (π | s)

over permutations

π = (π_{1}, π_{2}, \dots)

of decision steps, factorized autoregressively as:

p_{θ} (π | s) = \prod_{t = 1}^{N} p_{θ} (π_{t} | s, π_{1 : t - 1})

(4)

assigning probabilities to each choice at step t conditioned on the state and past decisions. While any transformer-based architecture can be adapted for this task, we focus on the Decision Transformer (DT) [36] framework, which treats optimization as a sequence modeling task. In the DT, timesteps are explicitly modeled as separate tokens that track the progression of the episode, enabling the model to associate specific states and actions with their temporal position within the construction sequence. The DT is trained in a supervised manner using a dataset of (instance, solution) pairs to mimic high-quality solutions of BRAs.

Alternatively, the model can be trained directly from scratch or fine-tuned using RL to maximize the expected reward

R (π)

. Methods such as ‘reinforce’ [37] or proximal policy optimization (PPO) [38] are commonly employed for this purpose. In the case of REINFORCE, the gradient of the loss function is defined as

\nabla_{θ} L (θ | s) = E_{p_{θ} (π | s)} [(R (π) - b (s)) \nabla_{θ} \log p_{θ} (π | s)]

(5)

where a baseline

b (s)

, such as the average reward from previous trajectories or a critic network in the case of PPO, is used to reduce variance and stabilize the learning process. This hybrid flexibility allows the model to either learn purely from environmental interaction or to refine a pre-trained DT.

4. Biased-Randomization of Trained Transformers

Once a transformer model has been trained on solution sets generated by BRAs, it can be used as a data-driven, high-level heuristic to solve new problem instances. The transformer encodes the logic of thousands of heuristic-based solutions, capturing complex dependencies across nodes, resources, and constraints. Unlike classical heuristics, which rely on a few manually designed rules, the transformer can reason over a high-dimensional space of features, making it capable of identifying high-quality decisions in diverse scenarios. To introduce variability and maintain exploratory power, the transformer can itself be embedded in a biased-randomized framework. At each decision step t, the model outputs a probability distribution over feasible choices:

p_{θ} (π_{t} | s, π_{1 : t - 1})

(6)

Instead of always selecting the highest-probability action (greedy decoding), a skewed probability distribution can be applied, as is common in classical BRAs [16]. One simple and effective choice is the Geometric(

β

) distribution, where

β \in (0, 1)

controls the greediness of the selection; values of

β

close to 1 produce more greedy choices favoring higher-probability candidates, while values close to 0 produce nearly uniform sampling, allowing greater exploration. Formally, let the candidates in

F_{t}

be ranked in descending order according to

p_{θ} (π_{t} | s, π_{1 : t - 1})

. The selection probability for candidate i is then

{\tilde{p}}_{i} = (1 - β) β^{i - 1}, i = 1, 2, \dots, | F_{t} |

(7)

which can be normalized to sum to 1 over the truncated set

F_{t}

if needed. This approach allows fast sampling using analytical expressions for the geometric distribution and requires tuning of only a single parameter

β

.

Multiple biased-randomized runs of the transformer can be executed in parallel, producing a set of distinct solution trajectories

{π^{(1)}, π^{(2)}, \dots, π^{(R)}}

for the same instance. The final solution is selected according to the objective function

π^{*} = \arg \max_{π^{(r)}} R (π^{(r)})

(8)

where

R (π)

is the problem-specific reward or objective value. Parallel execution ensures that diverse high-quality solutions can be produced in milliseconds, thus enabling agile optimization properties as described in Peyman et al. [13].

5. Illustrative Case Studies

To evaluate the proposed methodology, we consider two case studies based on well-known combinatorial optimization problems. The objective is to illustrate the behavior of the approach across problems with different structural properties and constraint types. In particular, we analyze a routing problem and a packing problem, which makes it possible to assess the flexibility of the methodology in distinct settings. The following subsections present the application of the framework to the TOP and the bi-dimensional knapsack problem (2KP), respectively. For each case, we describe the problem formulation, the experimental setup, and the inference strategies considered.

5.1. Case Study 1: Closed TOP

Figure 2 represents a traditional team orienteering problem [39]. In the closed variant of the TOP, a given set of vehicles K depart from a depot (node 0), visit a subset of the remaining nodes at most once, and return to the depot. The objective is to maximize the total collected reward while respecting a maximum route length

L_{\max}

for each vehicle. Each non-depot node

i \in {1, 2, \dots, n}

is defined by spatial coordinates

(x_{i}, y_{i})

and a reward

p_{i} > 0

, while distances

d_{i j}

are computed using the Euclidean metric. More formally, let

G = (V, E)

be a complete undirected graph with vertex set

V = {0, 1, \dots, n}

.

Let

x_{i j k}

be a binary variable equal to 1 if vehicle k traverses arc

(i, j)

. The objective maximizes the total collected reward:

\max \sum_{k = 1}^{K} \sum_{i = 1}^{n} p_{i} \sum_{j = 0}^{n} x_{i j k}

(9)

Each non-depot node is visited at most once:

\sum_{k = 1}^{K} \sum_{j = 0}^{n} x_{i j k} \leq 1 \forall i \in {1, 2, \dots, n}

(10)

Each vehicle departs from and returns to the depot:

\sum_{j = 0}^{n} x_{0 j k} = 1 \sum_{i = 0}^{n} x_{i 0 k} = 1 \forall k \in K

(11)

Flow conservation ensures route continuity:

\sum_{i = 0}^{n} x_{i h k} = \sum_{j = 0}^{n} x_{h j k} \forall h \in {1, 2, \dots, n}, \forall k \in K

(12)

Route length constraints are enforced:

\sum_{i = 0}^{n} \sum_{j = 0}^{n} d_{i j} x_{i j k} \leq L_{\max} \forall k \in K

(13)

Subtour elimination constraints ensure a single connected route per vehicle [40]:

u_{i k} - u_{j k} + n x_{i j k} \leq n - 1 \forall i \neq j, i, j \in {1, 2, \dots, n}, \forall k \in K

(14)

The auxiliary variables

u_{i k}

define the position of node i in route k:

1 \leq u_{i k} \leq n \forall i \in {1, 2, \dots, n}, \forall k

(15)

Finally, binary constraints are imposed:

x_{i j k} \in {0, 1} \forall i, j \in V, \forall k \in K

(16)

For evaluation, we generate synthetic TOP instances with different numbers of nodes and vehicles. Instances are solved using the transformer-based RL approach in Guerrero et al. [34], considering its deterministic training configuration. The analysis focuses on the inference phase, evaluating the impact of biased randomization during decoding while keeping the training procedure unchanged. Each instance is solved as a sequential decision process. At each step, the policy selects either a feasible unvisited node or a return-to-depot action. Feasibility constraints ensure that nodes are not revisited and that the remaining travel budget allows for the route to be completed. The process continues until all vehicles finish their routes or no feasible actions remain.

We consider three inference strategies. The first is deterministic greedy decoding, which selects the action with the highest probability at each step. The second, denoted as BR-Native, samples actions directly from the probability distribution provided by the model itself. The third strategy, denoted as BR-Geo, introduces biased randomization through a geometric rank-based mechanism. In this case, candidate actions are ranked according to their predicted probabilities and selected using a geometric distribution controlled by parameter

β \in (0, 1)

, as explained in Juan et al. [16]. This preserves the model ranking while introducing controlled variability. In both probabilistic strategies, multiple trajectories are generated in parallel and the best solution is selected. Additionally, one trajectory is always generated using greedy decoding to guarantee a strong baseline solution, while the remaining trajectories explore alternative constructions. The model architecture and training procedure follows the approach described in Guerrero et al. [34]. Node features, along with contextual route information, are fed into the transformer model, which generates the solution in an autoregressive manner. To ensure validity, infeasible solutions are prevented through the use of masking mechanisms during the decoding process. The model is trained using RL, specifically employing the REINFORCE algorithm with a roll-out baseline.

5.2. Case Study: Bi-Dimensional Knapsack Problem

As a second example (Figure 3), we consider the bi-dimensional 0–1 knapsack problem (2KP), where a set of items must be selected subject to two capacity constraints: volume and weight [41].

Each item i is characterized by volume

v_{i}

, weight

w_{i}

, and value

p_{i}

. Let

x_{i}

be a binary decision variable equal to 1 if item i is selected, and 0 otherwise. The objective is to maximize total value while respecting both capacities:

\max \sum_{i = 1}^{n} p_{i} x_{i}

(17)

s.t.

\sum_{i = 1}^{n} v_{i} x_{i} \leq C_{v}

(18)

\sum_{i = 1}^{n} w_{i} x_{i} \leq C_{w}

(19)

x_{i} \in {0, 1} \forall i = 1, 2, \dots, n

(20)

To evaluate scalability and robustness, synthetic instances are generated with controlled variability. The number of items and their attributes

(v_{i}, w_{i}, p_{i})

are sampled independently from uniform distributions. Capacities are defined as a fraction of total item volume and weight, producing instances that range from loosely to tightly constrained. The 2KP is formulated as an autoregressive sequential decision process. At each step t, the agent selects one item from the feasible set. The state includes the remaining capacities

C_{v}^{(t)}

and

C_{w}^{(t)}

, a binary mask indicating selected items, and the index of the last selected item. These elements summarize the information required to select the next action.

Item features

(v_{i}, w_{i}, p_{i})

are embedded once and reused throughout the sequence. The action space consists of selecting a feasible item or a terminal token. Infeasible actions, including already selected items or capacity-violating choices, are masked. The transition updates the state after each selection. When item i is selected, capacities are reduced accordingly, and the item is marked as used. The process terminates when the terminal token is selected or no further feasible actions exist. During inference, solutions are generated sequentially by the transformer model. As a baseline, we consider greedy decoding. In addition, we also evaluate probabilistic strategies based on biased randomization, including multinomial sampling and geometric rank-based sampling, which enable the generation of multiple candidate solutions per instance. In this probabilistic setting, multiple solution trajectories are explored in parallel by replicating the instance across the batch. As in the TOP case, one trajectory is always generated using the greedy policy to ensure a strong baseline, while the remaining trajectories provide diversification.

In contrast to the TOP, this model is trained using a supervised learning approach. A dataset of problem instances paired with their corresponding optimal solutions were generated to serve as the ground truth. Since the K2P is inherently order-invariant, data augmentation is implemented by permuting the sequence of objects within the optimal solutions. This invariance is explicitly incorporated into the loss function, ensuring that the model does not prioritize a specific insertion order during training. Consequently, the network tries to distribute the probability mass among all items belonging to the optimal subset.

6. Computational Experiments

The methodologies described in Section 5 were implemented in Python 3.12.3 and executed on a workstation equipped with 128 GB of RAM, running Ubuntu 24.04.3, and an NVIDIA GeForce RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA). For stochastic inference strategies, the number of parallel executions was set in both cases to

R = 512

, and the highest-reward trajectory among these was selected as the solution for that instance. Additionally, for the BR-Geo strategy, sampling is biased according to a geometric distribution controlled by a parameter

β

, favoring higher-ranked items while still allowing exploration. We evaluate geometric sampling with three values of

β

:

0.2

,

0.5

, and

0.7

, reflecting different exploration-exploitation trade-offs. Finally, for benchmarking, optimal solutions were computed for all problems in both cases and performance was measured using the optimality gap relative to MILP solutions:

Gap (π) = \frac{Z^{*} - Z (π)}{Z^{*}} \times 100

(21)

where

π

denotes the selected inference strategy, as well as the average inference time per instance to assess suitability for real-time decision-making.

6.1. Computational Experiments for the TOP

For the case of the TOP, we generated 16 synthetic instances to evaluate the proposed framework. These instances are divided into two groups: a set of 10 smaller instances, each with 20 to 24 nodes and 2 to 3 vehicles, and a set of 6 larger instances, each with 35 to 45 nodes and 3 to 4 vehicles. All instances were generated to ensure route feasibility. Node and depot coordinates, as well as node rewards, were independently sampled from a uniform distribution on

[0, 1]

and min–max normalized within each instance. For each vehicle, the distance budget was defined as:

D_{\max} = {∥ s - e ∥}_{2} + U (0, 1) \cdot 1.5 + 0.5

(22)

where

{∥ s - e ∥}_{2}

denotes the Euclidean distance between the start and end depots. This guarantees that traveling directly between depots is always feasible while introducing controlled variability in route capacity. In this case, optimal solutions were computed using Gurobi (https://www.gurobi.com, (accessed on 17 May 2026)), providing reference rewards for each instance.

6.2. Computational Experiments for the 2KP

For the 2KP, instances were generated according to the protocol in Section 5.2, with item counts ranging from 20 to 100. For each item, volume, weight, and value are drawn as

v_{i}, w_{i} \sim U (1, 20)

and

p_{i} \sim U (1, 100)

. Finally, backpack capacities scaled via a random fill ratio

ρ \sim U (0.2, 0.8)

in the following way:

C_{v} = ρ \sum_{i = 1}^{n} v_{i}, C_{w} = ρ \sum_{i = 1}^{n} w_{i}

(23)

All generated instances are solved to optimality using the OR-Tools mixed-integer programming solver (https://developers.google.com/optimization, (accessed on 17 May 2026)). Instances that do not converge or that have empty feasible solutions are discarded, thereby providing a ground truth for precise evaluation. To analyze scalability, problem instances are directly generated with sizes

n \in {20, 35, 50, 70, 85, 100}

, which makes it possible to evaluate how solution quality and computational efficiency evolve as the instance size increases.

7. Analysis of the Results

This section presents the computational results obtained for the two aforementioned case studies and analyzes the performance of different inference strategies. The results are reported separately for the TOP and the 2KP, allowing us to examine how the different inference strategies behave across problems with distinct combinatorial structures and levels of difficulty.

7.1. Results for the TOP

In addition to biased-randomized inference strategies, three standard stochastic decoding mechanisms were also evaluated: temperature sampling, top-k sampling, and nucleus (top-p) sampling [42,43,44]. Temperature sampling modifies the output distribution by scaling the logits with a temperature parameter T before applying the softmax operation, thus controlling the entropy of the distribution. Higher temperatures increase exploration by flattening the probability distribution, whereas lower temperatures make the policy more deterministic. Top-k sampling restricts the candidate set to the k highest-probability actions and samples proportionally within this reduced subset. Finally, nucleus or top-p sampling dynamically selects the smallest subset of actions whose cumulative probability exceeds a threshold p, and sampling is performed within this adaptive subset. After a parameter calibration process, the following values were selected for each strategy:

T = 5

,

k = 5

, and

p = 0.9

. Table 2 reports the objective values obtained by each method, namely the exact MILP solution, the transformer with greedy inference (Transf.), and the different probabilistic decoding variants: BR-Geo with

β = 0.5

(similar results are obtained with other values of this parameter), BR-Native, temperature sampling, top-k sampling, and top-p sampling. The table also provides the average objective values across instances and the corresponding per-instance average optimality gaps.

As shown in Table 2, the deterministic greedy policy produces an average optimality gap of

4.51 %

, indicating that strictly exploitative decoding can lead to suboptimal solutions. In contrast, all probabilistic decoding strategies improve the average solution quality across instances. Figure 4 shows that temperature sampling and BR-Geo achieve the best average performance, obtaining similar average optimality gaps. Both approaches consistently produce near-optimal solutions on the smaller instances and improve the results on the larger instances relative to greedy decoding. Despite the strong average performance of temperature sampling, its practical application requires a careful and often time-consuming calibration process. The temperature parameter,

T > 0

, strongly influences the balance between exploration and exploitation, and relatively small variations may alter the decoding behavior. By contrast, BR-Geo relies on a simpler and more interpretable mechanism based on rank-preserving geometric perturbations. The geometric parameter,

β \in (0, 1)

, directly controls the exploration level while preserving the ordering structure learned by the transformer, which makes BR-Geo easier to calibrate.

Unlike standard multinomial sampling, which directly samples from the transformer’s predicted probabilities, geometric biased randomization reshapes the original distribution into a geometric form. This modification increases the probability of selecting lower-ranked actions while still preserving the overall ranking produced by the transformer. Consequently, BR-Geo introduces controlled diversification without fully disconnecting the search from the learned policy structure. In routing problems, highly concentrated probability distributions may sometimes reflect an excessive reliance on patterns observed during training rather than actual optimality. Introducing geometric bias allows the model to explore alternative feasible constructions that would otherwise remain unexplored. Temperature sampling similarly increases exploration by globally modifying the entropy of the distribution, but BR-Geo achieves this diversification in a different manner by perturbing the ranked action selection process instead of directly altering the logits.

From a computational perspective, all transformer-based inference strategies remain highly efficient on GPU. Multi-trajectory sampling with

R = 512

trajectories requires approximately

0.08

seconds per instance on average, with negligible differences between greedy, multinomial, geometric, temperature, top-k, and top-p sampling due to parallelization. In contrast, exact MILP optimization for the same instances requires approximately 18 s to solve the smaller instances on average, and approximately 151 s to solve the larger instances, with noticeable variability across problems. This variability leads Gurobi to solve some instances in less than one second, while others require more than 10 min. By comparison, the transformer-based approach exhibits much more stable runtimes, with minimum and maximum inference times of

0.04

and

0.12

s per instance, respectively. These results indicate that the proposed methodology can obtain near-optimal solutions while consistently maintaining real-time performance.

To provide additional insight into the variability and robustness of the proposed approaches, Figure 5 compares the percentage gaps across methodologies, separated into small and large instances. For small instances, the BR-Geo strategy obtains lower dispersion than both the transformer baseline and BR-Native, indicating a higher level of consistency across solutions. This suggests that BR-Geo improves average solution quality while also providing more stable inference behavior for instances close to the training distribution. For larger instances, the dispersion of BR-Geo and BR-Native becomes more comparable, reflecting the increased difficulty associated with extrapolating beyond the training distribution. Nevertheless, clear differences emerge in terms of worst-case behavior. Both the transformer baseline and BR-Native produce high-gap outliers, reaching approximately

24 %

and

18 %

, respectively, whereas BR-Geo limits the maximum observed gap to around

9 %

. This reduction in extreme failure cases further supports the robustness of the proposed BR-Geo mechanism for agile combinatorial optimization.

7.2. Results for the 2KP

For the 2KP, the results of all methods are collected in Table 3, where each instance’s objective value is listed. This includes the optimal MILP solution, the transformer using greedy inference (Transf.), and the set of biased-randomized approaches: BR-Geo with a dynamic

β

value and BR-Native. In addition to reporting the number of instances solved to optimality, the table provides average optimality gaps for each problem size, as a summary of solution quality across methods. Figure 6 complements this information by visualizing the distribution of percentage gaps across methodologies.

Compared to the TOP, the transformer’s greedy inference achieves notably smaller average optimality gaps on the 2KP, with an overall value of

2.11 %

. However, in this case both biased-randomized decoding strategies achieve smaller average optimality gaps on the 2KP, reducing the overall gap to

0.14 %

for BR-Geo and

0.16 %

for BR-Native. This improvement is consistent with the relative simplicity of the 2KP, where feasible solutions are easier to construct and the combinatorial complexity is lower than in the TOP. Although greedy decoding remains reasonably competitive for some larger instances, its performance is notably less consistent overall, especially for smaller and medium-sized instances where larger gaps persist. Both stochastic inference strategies consistently produce near-optimal solutions across all tested instance sizes while increasing the number of optimal solutions found. BR-Geo achieves perfect optimality for the smallest instances (

N = 20

), whereas BR-Native attains perfect optimality for

N = 70

. Across all instance sizes, both approaches maintain very small gaps and high optimality counts, showing that probabilistic exploration is highly effective even in this comparatively less complex combinatorial setting. These results suggest that the two stochastic approaches perform very similarly on the 2KP, with only marginal differences between them. This contrasts with the TOP experiments, where differences between inference strategies were more pronounced and exploratory behavior had a larger impact on performance. The results therefore reinforce the idea that the relative benefits of each probabilistic decoding strategy are problem-dependent, and that for simpler problems such as the 2KP, multiple stochastic approaches can reliably achieve near-optimal performance.

From a computational standpoint, both transformer-based inference strategies remain extremely efficient, mirroring the observed behaviour in the case of the TOP. Multi-trajectory sampling with

R = 512

trajectories requires negligible time per instance, making runtimes stable across problem sizes. While exact MILP solvers could solve some 2KP instances in fractions of a second, others may require more time due to problem complexity, making transformer-based inference a fast alternative for evaluation in larger instances.

7.3. Comparing Results Across Problems

While greedy and geometric decoding show consistent behavior across both the TOP and the 2KP, multinomial sampling (BR-Native) behaves differently in each case, performing better in the 2KP than in the TOP. This performance can be explained by the nature of each problem. In the 2KP, the objective function is order-invariant, meaning that any permutation of a given set of items results in the same total profit, volume and weight. From the model perspective, multiple trajectories lead to the same optimal subset. During training, this lack of a unique sequence prevents the transformer from collapsing the probability mass onto a single item, maintaining a higher policy entropy where probabilities remain relatively distributed among several high-quality candidates. This inherent stochasticity allows multinomial sampling to effectively explore alternative trajectories and escape local optima.

Conversely, the TOP is a sequence-dependent problem where temporal and spatial constraints break this symmetry. As the model converges, it learns to prioritize a narrow set of feasible trajectories, leading to an overconfident (low-entropy) policy that often assigns a probability

p > 0.9

to a single transition. This causes standard multinomial sampling to frequently mirror greedy decoding, preventing it from deviating enough to escape local minima. In this context, the BR-Geo inference decouples the search from raw probability values, artificially re-introducing the entropy necessary to bypass the model’s biases and explore the solution space more robustly.

8. Conclusions

This work introduces biased-randomized transformers, a novel methodology that bridges classical heuristic search with data-driven learning to enable agile optimization in dynamic environments. The primary contribution lies in unifying two powerful paradigms (biased randomization and transformer-based deep learning) into a single and agile optimization approach for solving NP-hard combinatorial problems in real time. Unlike purely deterministic or standard probabilistic models, this approach employs the representational power of transformers to learn high-quality solution structures from thousands of trajectories, while simultaneously preserving the exploratory capacity of geometric-based biased randomization through controlled randomness during inference.

Our experimental evaluation shows that this integration leads to performance improvements over both traditional heuristics and greedy transformer-based approaches. On a suite of synthetic instances, we achieve reductions in optimality gaps across different problem types. For instance, on the closed TOP, different probabilistic inference strategies significantly outperform deterministic greedy decoding. Similar effects were observed for the 2KP, where the objective is order-invariant. In addition, we compare biased-randomized decoding with standard stochastic decoding strategies, including temperature sampling, top-k, and top-p sampling. The results show that while temperature sampling can achieve competitive performance after careful tuning, both top-k and top-p are generally less effective than the proposed geometry-based biased-randomized strategy. The computational efficiency of BRTs suggests potential applicability to real-time optimization settings. All transformer-based inference strategies require only an average of

0.06

s per instance on a modern GPU (with

R = 512

parallel trajectories), compared to the highly variable and often lengthy runtime required by exact solvers such as Gurobi or OR-Tools. This consistent, sub-second performance suggests that BRTs can satisfy agile optimization requirements linked to dynamic environments where rapid response times and frequent re-optimization tasks are necessary, at least for the instance sizes considered in our study.

Several lines future research are described next: (i) we plan to conduct a more systematic investigation of different

β

values within the geometric bias approach; (ii) we will extend our evaluation to large-scale instances of team orienteering and vehicle routing problems with hundreds of nodes and multiple vehicles, carefully assessing scalability beyond the instance ranges considered in this study while maintaining sub-second inference times; (iii) as any other constructive heuristic, a trained transformer model could be combined with local search operators and integrated in traditional metaheuristic frameworks when agile optimization is not a strong requirement; and (iv) finally, we aim to incorporate stochastic variants of these problems by extending the proposed BRT into simheuristics [45,46], which combines simulation-based sampling with metaheuristic search.

Author Contributions

Conceptualization, A.A.J. and M.G.C.R.; methodology, A.G.-S., A.G. and A.A.J.; software, A.G., M.E. and J.P.; validation, J.P. and A.G.-S.; formal analysis, A.G. and M.E.; writing—original draft preparation, A.A.J., M.E. and A.G.; writing—review and editing, J.P., A.G.-S. and M.G.C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Spanish Ministry of Science, Innovation and Universities/AEI (PID2022-138860NB-I00, AIA2025-163553-C44, DIN2024-013395) and the Generalitat Valenciana (2024 CIAICO 117).

Data Availability Statement

All required data is available from the references.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Toth, P.; Vigo, D. Vehicle Routing: Problems, Methods, and Applications; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014. [Google Scholar]
Salhi, S.; Thompson, J. An overview of heuristics and metaheuristics. In The Palgrave Handbook of Operations Research; Salhi, S., Boylan, J., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 353–403. [Google Scholar]
Wu, Y.; Song, W.; Cao, Z.; Zhang, J.; Lim, A. Learning improvement heuristics for solving routing problems. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5057–5069. [Google Scholar] [CrossRef] [PubMed]
Bengio, Y.; Lodi, A.; Prouvost, A. Machine learning for combinatorial optimization: A methodological tour d’horizon. Eur. J. Oper. Res. 2021, 290, 405–421. [Google Scholar] [CrossRef]
Guerrero, A.; Juan, A.A.; Garcia-Sanchez, A.; Pita-Romero, L. Optimizing maintenance of energy supply systems in city logistics with heuristics and reinforcement learning. Mathematics 2024, 12, 3140. [Google Scholar] [CrossRef]
Hottung, A.; Tierney, K. Neural large neighborhood search for routing problems. Artif. Intell. 2022, 313, 103786. [Google Scholar] [CrossRef]
Resende, M.G.; Ribeiro, C.C. Optimization by GRASP; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Bamoumen, M.; Elfirdoussi, S.; Ren, L.; Tchernev, N. An efficient GRASP-like algorithm for the multi-product straight pipeline scheduling problem. Comput. Oper. Res. 2023, 150, 106082. [Google Scholar]
Fernandez, S.A.; Carvalho, M.M.; Silva, D.G. A hybrid metaheuristic algorithm for the efficient placement of UAVs. Algorithms 2020, 13, 323. [Google Scholar] [CrossRef]
Bresina, J. Heuristic-biased stochastic sampling. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR, USA, 4–8 August 1996; pp. 271–278. [Google Scholar]
Martí, R.; Lozano, J.A.; Mendiburu, A.; Hernando, L. Multi-start methods. In Handbook of Heuristics; Springer: Berlin/Heidelberg, Germany, 2025; pp. 211–230. [Google Scholar]
Gajula, V.; Rajathy, R. An agile optimization algorithm for vitality management along with fusion of sustainable renewable resources in microgrid. Energy Sources Part A Recovery Util. Environ. Eff. 2020, 42, 1580–1598. [Google Scholar]
Peyman, M.; Copado, P.J.; Tordecilla, R.D.; Martins, L.d.C.; Xhafa, F.; Juan, A.A. Edge computing and IoT analytics for agile optimization in intelligent transportation systems. Energies 2021, 14, 6309. [Google Scholar] [CrossRef]
Zhou, M.; Lin, X.; Liang, Y. Agile optimization framework: A framework for tensor operator optimization in neural network. Future Gener. Comput. Syst. 2024, 161, 432–444. [Google Scholar] [CrossRef]
Liu, T.; Wang, Y.; Sun, J.; Tian, Y.; Huang, Y.; Xue, T.; Li, P.; Liu, Y. The role of transformer models in advancing blockchain technology: A systematic survey. Eng. Appl. Artif. Intell. 2026, 163, 112968. [Google Scholar] [CrossRef]
Juan, A.A.; Faulin, J.; Ferrer, A.; Lourenço, H.R.; Barrios, B. MIRHA: Multi-start biased randomization of heuristics with adaptive local search for solving non-smooth routing problems. Top 2013, 21, 109–132. [Google Scholar] [CrossRef]
Wang, F.; He, Q.; Li, S. Solving combinatorial optimization problems with deep neural network: A survey. Tsinghua Sci. Technol. 2024, 29, 1266–1282. [Google Scholar] [CrossRef]
Chung, K.T.; Lee, C.K.; Tsang, Y.P. Neural combinatorial optimization with reinforcement learning in industrial engineering: A survey. Artif. Intell. Rev. 2025, 58, 130. [Google Scholar] [CrossRef]
Cappart, Q.; Chételat, D.; Khalil, E.B.; Lodi, A.; Morris, C.; Veličković, P. Combinatorial optimization and reasoning with graph neural networks. J. Mach. Learn. Res. 2023, 24, 1–61. [Google Scholar]
Joshi, C.K.; Cappart, Q.; Rousseau, L.M.; Laurent, T. Learning the travelling salesperson problem requires rethinking generalization. Constraints 2022, 27, 70–98. [Google Scholar] [CrossRef]
Angioni, D.; Archetti, C.; Speranza, M.G. Neural combinatorial optimization: A tutorial. Comput. Oper. Res. 2025, 182, 107102. [Google Scholar] [CrossRef]
Berto, F.; Hua, C.; Zepeda, N.G.; Hottung, A.; Wouda, N.A.; Lan, L.; Park, J.; Tierney, K.; Park, J. RouteFinder: Towards foundation models for vehicle routing problems. Trans. Mach. Learn. Res. 2025, 2025. [Google Scholar]
Kool, W.; van Hoof, H.; Welling, M. Attention, learn to solve routing problems! In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chi, M.; Pang, W.; Wu, X.; Zhao, P.; Li, Y.; Wang, T.; Qian, J.; Xiao, Y.; Wang, L.; Zhou, Y. A generalized neural solver based on LLM-guided heuristic evoluation framework for solving diverse variants of vehicle routing problems. Expert Syst. Appl. 2026, 296, 128876. [Google Scholar] [CrossRef]
Fang, Z.; Wang, D.; Chen, J.; Wang, J.; Zhang, Z. UCPO: A universal constrained combinatorial optimization method via preference optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; Volume 40, pp. 36900–36908. [Google Scholar]
Ma, L.; Hao, X.; Zhou, W.; He, Q.; Zhang, R.; Chen, L. A hybrid neural combinatorial optimization framework assisted by automated algorithm design. Complex Intell. Syst. 2024, 10, 8233–8247. [Google Scholar] [CrossRef]
Guan, Q.; Cao, H.; Jia, L.; Yan, D.; Chen, B. Synergetic attention-driven transformer: A deep reinforcement learning approach for vehicle routing problems. Expert Syst. Appl. 2025, 274, 126961. [Google Scholar] [CrossRef]
Bi, J.; Ma, Y.; Zhou, J.; Song, W.; Cao, Z.; Wu, Y.; Zhang, J. Learning to handle complex constraints for vehicle routing problems. Adv. Neural Inf. Process. Syst. 2024, 37, 93479–93509. [Google Scholar]
Toenshoff, J.; Ritzert, M.; Wolf, H.; Grohe, M. Graph neural networks for maximum constraint satisfaction. Front. Artif. Intell. 2021, 3, 580607. [Google Scholar] [CrossRef]
da Costa, P.R.d.O.; Rhuggenaath, J.; Zhang, Y.; Akcay, A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. PMLR 2020, 129, 465–480. [Google Scholar]
Xiao, P.; Zhang, Z.; Chen, J.; Wang, J.; Zhang, Z. Neural combinatorial optimization for robust routing problem with uncertain travel times. Adv. Neural Inf. Process. Syst. 2024, 37, 134841–134867. [Google Scholar]
Wang, Y.; Liang, X. Application of reinforcement learning methods combining graph neural networks and self-attention mechanisms in supply chain route optimization. Sensors 2025, 25, 955. [Google Scholar] [CrossRef]
Ammouriova, M.; Guerrero, A.; Tsertsvadze, V.; Schumacher, C.; Juan, A.A. Using reinforcement learning in a dynamic team orienteering problem with electric batteries. Batteries 2024, 10, 411. [Google Scholar] [CrossRef]
Guerrero, A.; Escoto, M.; Ammouriova, M.; Men, Y.; Juan, A.A. Using transformers and reinforcement learning for the team orienteering problem under dynamic conditions. Mathematics 2025, 13, 2313. [Google Scholar] [CrossRef]
Yan, D.; Guan, Q.; Ou, B.; Yan, B.; Cao, H. Graph-driven deep reinforcement learning for vehicle routing problems with pickup and delivery. Appl. Sci. 2025, 15, 4776. [Google Scholar] [CrossRef]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 15084–15097. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Gu, Y.; Cheng, Y.; Chen, C.P.; Wang, X. Proximal policy optimization with policy feedback. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 4600–4610. [Google Scholar] [CrossRef]
Mısır, M.; Gunawan, A.; Vansteenwegen, P. Algorithm selection for the team orienteering problem. In Proceedings of the European Conference on Evolutionary Computation in Combinatorial Optimization (Part of EvoStar); Springer: Berlin/Heidelberg, Germany, 2022; pp. 33–45. [Google Scholar]
Palomo-Martínez, P.J.; Salazar-Aguilar, M.A.; Albornoz, V.M. Formulations for the orienteering problem with additional constraints. Ann. Oper. Res. 2017, 258, 503–545. [Google Scholar] [CrossRef]
Pisinger, D.; Toth, P. Knapsack problems. In Handbook of Combinatorial Optimization: Volume 1–3; Springer: Berlin/Heidelberg, Germany, 1998; pp. 299–428. [Google Scholar]
Wiher, G.; Meister, C.; Cotterell, R. On decoding strategies for neural text generators. Trans. Assoc. Comput. Linguist. 2022, 10, 997–1012. [Google Scholar] [CrossRef]
Zhu, Y.; Li, J.; Li, G.; Zhao, Y.; Jin, Z.; Mei, H. Hot or cold? Adaptive temperature sampling for code generation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 437–445. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wen, J.R. Decoding and Deployment. In Large Language Models; Springer: Berlin/Heidelberg, Germany, 2025; pp. 259–301. [Google Scholar]
Panadero, J.; Juan, A.A.; Bayliss, C.; Currie, C. Maximising reward from a team of surveillance drones: A simheuristic approach to the stochastic team orienteering problem. Eur. J. Ind. Eng. 2020, 14, 485–516. [Google Scholar] [CrossRef]
Nessari, S.; Tavakkoli-Moghaddam, R.; Bakhshi-Khaniki, H.; Bozorgi-Amiri, A. A hybrid simheuristic algorithm for solving bi-objective stochastic flexible job shop scheduling problems. Decis. Anal. J. 2024, 11, 100485. [Google Scholar] [CrossRef]

Figure 1. Overview of the biased-randomization transformer model methodology.

Figure 2. A visual representation of the (open) team orienteering problem.

Figure 3. A visual representation of the bi-dimensional knapsack problem.

Figure 4. Boxplot comparison of percentage gaps to the optimal solution for the TOP of different decoding strategies.

Figure 5. Boxplot comparison of percentage gaps to the optimal solution for the TOP, grouped by instance size (small and large).

Figure 6. Boxplot comparison of percentage gaps to the optimal solution for the 2KP.

Table 1. Summary of research lines in neural combinatorial optimization.

Research Line	Main Idea	Representative Methods
Classical metaheuristics	Constructive and improvement heuristics based on deterministic or stochastic search rules	GRASP, tabu search, simulated annealing, biased-randomized heuristics [11,16]
Representation learning for CO	Encoding combinatorial structure via graphs or attention mechanisms to parameterize policies	GNN-based models and attention architectures for routing and SAT-like problems [19,20,23]
RL-based combinatorial optimization	Sequential decision-making trained via reward maximization for constructive or improvement heuristics	Pointer networks, actor-critic routing, Decision Transformer-style policies [23,24]
Hybrid learning and metaheuristics	Integration of learned policies within classical search procedures for improved exploration	Neural LNS, learned improvement heuristics, preference-optimization frameworks [6,25,26]
Robust and adaptive optimization	Learning under uncertainty, dynamics, and distribution shifts in problem instances	Stochastic routing, dynamic VRP, robust neural policies [31,32,33]
Proposed approach (BRT)	Combination of transformer modeling with biased-randomization for controlled exploration	Probabilistic decoding over learned transformer policies with heuristic-guided diversification

Table 2. Performance comparison across inference strategies on the TOP instances.

Problem	MILP	Transf.	BR-Geo (0.5)	BR-Native	Temp. (5)	Top-k (5)	Top-p (0.9)
P1	11.72	11.46	11.68	11.46	11.68	11.46	11.46
P2	6.28	6.28	6.28	6.28	6.28	6.28	6.28
P3	7.71	7.32	7.69	7.32	7.69	7.32	7.32
P4	7.47	7.06	7.06	7.06	7.06	7.06	7.06
P5	8.73	8.19	8.73	8.19	8.73	8.19	8.19
P6	10.81	10.71	10.71	10.71	10.71	10.71	10.71
P7	9.30	9.30	9.30	9.30	9.30	9.30	9.30
P8	11.04	10.61	10.99	10.75	10.99	10.75	10.61
P9	11.20	10.92	11.20	10.94	11.10	10.92	10.94
P10	11.54	11.52	11.52	11.52	11.54	11.52	11.52
P11	16.55	16.55	16.55	16.55	16.55	16.55	16.55
P12	19.85	18.46	19.18	19.36	19.37	18.68	18.48
P13	26.21	22.96	24.43	24.52	24.58	23.90	24.35
P14	16.26	16.03	16.18	16.18	16.26	16.03	16.03
P15	14.32	10.89	13.16	11.85	13.12	12.35	11.44
P16	17.63	17.45	17.63	17.63	17.63	17.63	17.63
Average	12.91	12.23	12.64	12.48	12.66	12.42	12.37
Avg. Gap (%)	–	4.51	1.65	3.23	1.60	3.86	3.44
Avg. Time (s)	68.46	0.08	0.08	0.08	0.08	0.08	0.08

Table 3. Performance comparison across inference strategies on the 2KP instances.

	Transf.		BR-Geo (dyn)		BR-Native
N Obj.	Gap	N Opt.	Gap	N Opt.	Gap	N Opt.
20	4.27	0	0.00	10	0.01	9
35	2.34	1	0.21	6	0.35	6
50	2.86	0	0.18	3	0.21	4
70	0.53	1	0.07	6	0.00	10
85	1.57	0	0.10	4	0.14	5
100	1.09	0	0.26	4	0.28	1
Average	2.11	0.33	0.14	5.5	0.16	5.8
Avg. Time	0.08		0.09		0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Juan, A.A.; Guerrero, A.; Escoto, M.; Panadero, J.; Garcia-Sanchez, A.; Resende, M.G.C. On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization. Information 2026, 17, 504. https://doi.org/10.3390/info17050504

AMA Style

Juan AA, Guerrero A, Escoto M, Panadero J, Garcia-Sanchez A, Resende MGC. On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization. Information. 2026; 17(5):504. https://doi.org/10.3390/info17050504

Chicago/Turabian Style

Juan, Angel A., Antoni Guerrero, Marc Escoto, Javier Panadero, Alvaro Garcia-Sanchez, and Mauricio G. C. Resende. 2026. "On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization" Information 17, no. 5: 504. https://doi.org/10.3390/info17050504

APA Style

Juan, A. A., Guerrero, A., Escoto, M., Panadero, J., Garcia-Sanchez, A., & Resende, M. G. C. (2026). On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization. Information, 17(5), 504. https://doi.org/10.3390/info17050504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Use of Biased-Randomized Transformers as Data-Driven Heuristics for Agile Optimization

Abstract

1. Introduction

2. Related Work

3. Training Transformers with BRAs

4. Biased-Randomization of Trained Transformers

5. Illustrative Case Studies

5.1. Case Study 1: Closed TOP

5.2. Case Study: Bi-Dimensional Knapsack Problem

6. Computational Experiments

6.1. Computational Experiments for the TOP

6.2. Computational Experiments for the 2KP

7. Analysis of the Results

7.1. Results for the TOP

7.2. Results for the 2KP

7.3. Comparing Results Across Problems

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI