Heuristics and Learning Models for Dubins MinMax Traveling Salesman Problem

This paper addresses a MinMax variant of the Dubins multiple traveling salesman problem (mTSP). This routing problem arises naturally in mission planning applications involving fixed-wing unmanned vehicles and ground robots. We first formulate the routing problem, referred to as the one-in-a-set Dubins mTSP problem (MD-GmTSP), as a mixed-integer linear program (MILP). We then develop heuristic-based search methods for the MD-GmTSP using tour construction algorithms to generate initial feasible solutions relatively fast and then improve on these solutions using variants of the variable neighborhood search (VNS) metaheuristic. Finally, we also explore a graph neural network to implicitly learn policies for the MD-GmTSP using a learning-based approach; specifically, we employ an S-sample batch reinforcement learning method on a shared graph neural network architecture and distributed policy networks to solve the MD-GMTSP. All the proposed algorithms are implemented on modified TSPLIB instances, and the performance of all the proposed algorithms is corroborated. The results show that learning based approaches work well for smaller sized instances, while the VNS based heuristics find the best solutions for larger instances.


Introduction
The multiple traveling salesman problem (mTSP) is a challenging optimization that naturally arises in various real-world routing and scheduling applications. It is a generalization of the well-known traveling salesman problem (TSP) [1,2], where given a set of n targets and m vehicles, the objective is to find a path for each vehicle, such that each target is visited at least once by some vehicle, and an objective which depends on the cost of the paths is minimized. The mTSP has many practical application, including but not limited to industrial robotics [3], transportation and delivery [4], monitoring and surveillance [5], disaster management [6,7], precision agriculture [8], search and rescue missions [9,10], multi-robot task allocation and scheduling [11], print press scheduling [12], satellite surveying networks [13], transportation planning [14], mission planning [15], co-operative planning for autonomous robots [16], and unmanned aerial vehicle planning [17].
The standard mTSP aims to find paths for multiple vehicles without taking into account the dynamics of the vehicles. However, some vehicles, like fixed-wing unmanned aerial vehicles and ground robots, have kinematic constraints that need to be considered while planning their paths. Specifically, these vehicles have yaw rate or turning radius constraints. To address this issue, L.E. Dubins [18] extensively studied the paths for curvature-constrained vehicles moving between two configurations. Dubins considered the model of a vehicle that moves at a constant speed v 0 on a 2D plane that cannot reverse and has a minimum turning radius ρ; this vehicle is also called the Dubins vehicle in the literature. The configuration of a Dubins vehicle is defined by its position and heading (x, y, θ), where x and y are the vehicle's position coordinates on the plane, and θ is its heading angle. The motion of a Dubins vehicle can be fully defined using the following set of equations:ẋ where u is the input to the vehicle and ρ is the minimum turning radius of the vehicle. Dubins proposed that in an obstacle-free environment, the shortest path between any two configurations on a 2D plane for a Dubins vehicle must be either the CCC type or the CSC type or a combination of sub-paths of these two types; here, C represents an arc of a circle with a radius of ρ, and S represents a straight line segment. These paths can be expressed as LSL, RSR, RSL, LSR, RLR, or LRL, where L and R indicate clockwise and counterclockwise turns of a C segment, respectively, as illustrated in Figure 1. When the vehicle is also capable of reversing, the shortest path problem was solved by Reeds and Shepp in [19]. Optimal solutions to these problems were also derived by Sussmann and Tang (1991) [20] as well as Boissonnat et al. (1994) [21], utilizing the Pontryagin maximum principle [22]. Tang and Özgüner (2005) [23] and Rathinam et al. (2007) [24] have further expanded upon Dubins' research by applying it to routing problems involving car-like mobile robots or fixed-wing aerial vehicles that maintain a constant forward speed and adhere to curvature constraints while making turns. The vehicle routing problem for a Dubins vehicle is called the Dubins traveling salesman problem (DTSP) [25]. In the DTSP, the challenge is not only to find the optimal order of the targets to visit but also to determine suitable heading angles for the vehicle when visiting the targets. A feasible solution to the DTSP involves a curved path where the radius of curvature at any point along the path is at least equal to the turning radius ρ ≥ 0, and each target is visited at least once. Since the problem involves both discrete and continuous decision variables, with the target order being discrete and the heading angles being continuous, finding an optimal solution to the DTSP is an NP-hard problem, as shown by Jerome Le Ny et al. [26]. Currently, no exact algorithm exists in the literature that can find an optimal solution to the DTSP. However, over the last two decades, several heuristics and approximation algorithms have been developed to find feasible solutions for the DTSP [23,24,[27][28][29][30][31][32][33][34].

RSL RSR LRL
One popular approach to solving the DTSP, as discussed in several studies [33,[35][36][37][38], involves sampling the interval [0, 2π] into k candidate headings at each target and posing the resulting problem as a generalized traveling salesman problem (GTSP) formulation.
In GTSP (also known as the one-in-a-set TSP), given n nodes partitioned into k mutually exclusive sets, the goal is to compute the shortest path that contains exactly one node from each of the k sets. GTSP is an NP-hard problem since it reduces to the TSP when each cluster contains only one node. DTSP can be reduced to a one-in-a-TSP by forming n mutually exclusive sets consisting of k vehicle configurations corresponding to k heading angles at each of the n targets. Increasing the angle discretizations at each target for the one-in-a-set TSP variant results in solutions that get close to the DTSP optimum. This approach provides a natural way to find a good feasible solution to the DTSP problem, as noted in previous studies [36]. Figure 2 illustrates the discretization of heading angles at a target for the DTSP. In this work, we explore the MinMax variant of the mTSP problem for a Dubins-like vehicle by formulating it as a one-in-a-set TSP, called the MinMax Dubins generalized multiple traveling salesman problem (MD-GmTSP) and present three different approaches to solve it. The MinMax variant is an important problem to study as it directly relates the objective to the mission completion time for visiting all the targets by a team of vehicles. This variant is relevant for real-world applications as it accounts for the worst-case scenario in situations where m homogeneous vehicles are assigned travel paths by accounting for limitations on load/delivery capacities or refueling constraints.

•
We formulate a mixed-integer linear program for the MD-GmTSP (Section 3). • We develop heuristics to generate feasible solutions for the MD-GmTSP in a relatively short amount of time, followed by using a combination of variable neighborhood search (VNS) metaheuristic [39] for improving solution quality (Section 5). • We explore learning-based methods to solve the MD-mTSP. An architecture consisting of a shared graph neural network and distributed policy networks is used to define a learning policy for MD-GmTSP. Reinforcement learning is used to learn the allocation of agents to vertices to generate feasible solutions for the MD-GmTSP, thus eliminating the need for accurate ground truth (Section 6). • Finally, we implement all the algorithms on a set of modified instances from the TSPLIB library [40] using CPLEX and present simulation results to corroborate the performance of the proposed approaches (Section 7).

Literature Review
The traveling salesman problem (TSP) is one of the most well-known optimization problems in the literature [41]. Several methods, ranging from exact techniques (branch and bound/cut/price) [41] to fast heuristics and approximation algorithms [42,43], have been developed to solve the TSP. Heuristics generally aim to trade off optimality for computational efficiency. The mTSP deals with a generalization of the TSP with multiple vehicles. mTSPs are much harder to solve to optimality compared to the single TSP because targets have to be allocated to vehicles in addition to finding a tour for each of the vehicles. Numerous variants and generalizations of the mTSP have been addressed in the literature. For example, mTSP with motion constraints are addressed in [24,44]; fuel constraints are addressed in [45]; capacity and other resource constraints are addressed in [46,47]. Many recent methods have focused on the application of metaheuristics like genetic algorithms (GA) [48,49], simulated annealing (SA) [50], memetic search [51], tabu search [52], swarm optimization [53], and other methods [54,55] to solve the mTSP problem for the single-depot and multi-depot case.
The mTSP problem can be classified into two main types based on its objectives: the MinSum TSP and the MinMax TSP. In MinSum mTSP, the goal is to minimize the sum of the path costs for visiting all targets by the m vehicles, while in the MinMax mTSP [56][57][58], the objective is to minimize the maximum among the m path costs. While there are several approximation algorithms for the MinSum mTSP [24], there are very few theoretical results for the MinMax mTSP [59][60][61]. Even further, the MinMax mTSP for vehicles with turning radius constraint (Dubins-type vehicle) has received limited or no attention in the literature primarily due to the computational complexity involved in solving the problem.
For the case when all the vehicles are homogeneous without motion constraints, there are several heuristics [62][63][64][65][66] in the literature for solving MinMax mTSPs and related vehicle routing problems. In [62], the authors present four heuristics and found that a linear program (LP)-based heuristic in combination with load balancing and region partitioning ideas performed the best. In [63], the authors present an ant-colony-based metaheuristic to solve the min-max problem with limited computational experiments. A transformationbased method by decoupling the task partitioning problem amongst the vehicles and the routing problems is presented in [65]. In [66], a new heuristic denoted as MD is presented for a min-max vehicle routing problem. Through extensive computational results, the MD algorithm is shown to be the best amongst the considered min-max algorithms in [66].
Recently, MinMax mTSPs with other constraints have also been considered in surveillance applications. In [67], a mixed-integer linear program (MILP) based approach is presented along with market based heuristics for a min-max problem with additional revisit constraints to the targets. In [68], a min-max routing problem is considered where the vehicles are functionally heterogeneous, i.e., there are vehicle-target assignment constraints that specify the sub-set of targets that each vehicle can visit. Both approximation algorithms and heuristics were presented in [68]. In [69], insertion based heuristics are presented for a team of underwater autonomous vehicles with functional heterogeneity. An energy-aware VNS is presented for a vehicle routing problem in [70] where there is a fuel or energy constraint for each of the vehicles. There is also recent work [71] on generalized min-max routing problems where the targets are partitioned into sets, and a vehicle is only required to visit exactly one target from each set.

Problem Statement
Let T = {T 1 , T 2 , . . . T t } be the set of t targets on a 2D plane. Each target is associated with h candidate heading angles. For simplicity, lets assume that these heading angles are obtained by sampling the interval [0, 2π] uniformly, i.e., Φ = {0, 2π h , . . . , c 2π h , . . . , 2π}, where c ≤ h. Let {T 0 } denote the initial target (depot). Let (x i , y i ) denote the location of target T i , and θ i ∈ Φ denote the arrival angle of a vehicle at target i. The angle of arrival and angle of departure of a vehicle from a target is assumed to be the same. To simplify the presentation, we also refer to (x i , y i ) as target T i . The configuration of the vehicle at a target T i at an angle of θ i is denoted by (x i , y i , θ i ) or simply (T i , θ i ).
Let ν = {1, 2, . . . m} be the set of m homogeneous vehicles with a minimum turning radius ρ. Let τ k ⊂ T be the set of targets toured by vehicle k visiting each target exactly once, starting and ending at depot T 0 . Let d k (T i , θ i , T j , θ j ) denote the shortest Dubins path traversed by vehicle k ∈ ν, from (T i , θ i ) to (T j , θ j ). Since the vehicles are homogeneous, d k (T i , θ i , T j , θ j ) remains the same for all k ∈ ν. Let the heading angle at the depot for all the vehicles be the same and be given. The objective of MD-GmTSP is to find a tour for each vehicle and the heading angle at each target in T such that

•
Each target in T is visited once by some vehicle at a specified heading angle; • The tour for each vehicle starts and ends at the depot; • The length of the longest Dubins tour among the m vehicles is minimized.

Mixed-Integer Linear Programming Formulation
To model the MD-GmTSP as a mixed-integer linear programming problem, we define the related sets, decision variables, and parameters as follows:

Notations
• V := {0, 1, 2, . . . , n}, the set of all possible Dubins vehicle configurations. Let V be partitioned into mutually exclusive and exhaustive non-empty sub-sets where V 0 := {0} corresponds to the configuration at the depot, and V 1 , . . . , V t corresponds to the t target clusters. • A := (i, j) : i, j ∈ V, i = j, the set of arcs representing the Dubins path between configurations. • ν := {1, 2, . . . , m}, the set of homogeneous (uniform) vehicles that serve these customers. • c kij , the cost of traveling from configuration i to configuration j for vehicle k, where k ∈ ν, i = j, i ∈ V p , j ∈ V r , p = r, p, r ∈ T. • u kp :, the rank order (visit order) of cluster V p on the tour of vehicle k ∈ ν, p ∈ T. • Q := t − m + 1, the maximum number of clusters that a vehicle can visit.

Decision Variables
Let us define the following binary variables: y kpr = 1 if path exists from target cluster V p to cluster V r in tour k, p, r ∈ T, k ∈ ν 0 otherwise

Cluster Degree Constraints
For each target cluster V p (excluding V 0 ), there can only be a single outgoing arc to any other target cluster belonging to the tour of any of the m vehicles. This condition is imposed by the following constraints: There can only be a single incoming (entering) arc to a target cluster from any other vertex belonging to other target clusters, excluding V 0 for a given vehicle. This condition is imposed by the following constraints: For each vehicle, there should be a single leaving arc from and a single entering arc to the depot, which are imposed by

Cluster Connectivity Constraints
The entering and leaving nodes should be the same for a given vehicle in each cluster, which is satisfied by Flows from target cluster V p to target cluster V r for vehicle k are defined by y kpr . Thus, y kpr should be equal to the sum of x kij 's from V p to V r . Hence, we write

Sub-Tour Elimination Constraints
The sub-tour elimination constraints and capacity bounding constraints are proposed using the auxiliary variables u kp and Q as

Solving the MILP
To obtain a globally optimal solution for a problem expressed as an MILP, it is common to use solvers such as Gurobi or CPLEX. Although these solvers are effective in solving smaller instances of the Euclidean mTSP (for example, an instance with 100 nodes and 3-5 robots takes up to 3 h to solve using a MILP formulation [72]), they struggle to scale to larger instances that involve multiple robots, especially for difficult problems such as the MD-GmTSP. In the following section, we introduce a variable-neighborhood-search-based heuristic that leverages the problem's structure to generate fast and high-quality solutions.

Heuristics for the MD-GmTSP
In this section, we present a heuristic based on variable neighborhood search (VNS) to solve the MD-GmTSP. VNS is a metaheuristic proposed by Hansen and Mladenovic [39,73] that employs the idea of systematic exploration of neighborhoods of an initial solution and improves it using local search methods. The neighborhood exploration involves descent to a local minimum in the neighborhood of the incumbent solution or an escape from the valleys containing them to obtain better solutions (global minimum).
The heuristic approach can be divided into two broad phases: (1) construction phase-initial solution and (2) improvement phase-neighborhood search. In the construction phase, a tour construction heuristic is used to generate an initial feasible solution for the given problem. Based on this initial solution, the neighborhood search phase aims to obtain better solutions by making incremental improvements. The tour improvement phase starts by defining a set of neighborhood structures in the solution space. These neighborhood search structures, denoted as N 1 , N 2 , . . . , N κ max , can be chosen either arbitrarily or based on a sequence of neighborhood changes with increasing cardinality. Once the neighborhood search structures are fixed, different neighborhoods are explored in deterministic or stochastic ways, depending on the type of VNS metaheuristic used. If a better solution is found in the κ-th neighborhood, then the search is re-centered around the new incumbent.
Otherwise, the search moves to the next neighborhood, κ + 1. Depending on the chosen neighborhood structures, different neighborhoods of the solution space are reached. VNS does not follow a predefined trajectory but explores increasingly distant neighborhoods of the current solution. A jump to a new solution is made only if an improvement is achieved, often retaining favorable characteristics of the incumbent solution to obtain promising neighboring solutions.

Initial Solution: Construction Phase
Construction heuristics are commonly used to address vehicle routing problems, and they can be broadly categorized into three classes: insertion heuristics, savings heuristics, or clustering heuristics. In this work, we focus on insertion-based heuristics for the MD-GmTSP, where the algorithm starts with a closed sub-tour comprising a few targets and progressively inserts new targets into the tour until all the targets are visited. Specifically, we explore two types of insertion heuristics for the MD-GmTSP: (1) greedy k-insertion for MD-GmTSP and (2) cheapest k-insertion for MD-GmTSP.
The key differentiator between these algorithms is the order of target selection and the position at which each target is inserted. The heuristics begin with a tour starting at a depot (initial location). A new target is inserted into the tour such that the increase in tour length is minimized. Greedy insertion algorithms, namely Nearest Insertion and Farthest Insertion, select an unvisited target whose distance from any target in the current tour is minimum and maximum, respectively, for insertion. In contrast, the cheapest-insertion algorithm selects an unvisited target whose insertion causes the lowest increase in tour length. The main steps involved in these construction heuristics for the MD-GmTSP are discussed below.
Consider a set of t targets and their associated heading angles, visited by a m Dubins vehicles, and let (T 0 , θ 0 ) correspond to the depot target. To simplify the presentation, we also refer to any target configuration ( denote a Dubins tour for vehicle v ∈ ν, through p(≤ t) cities, starting and ending at the depot. Let d v (T i , T i+1 ) denote the length of the shortest Dubins path from (T i , θ i ) to (T i+1 , θ i+1 ). Consider a target (T j , θ j ) from the set of unvisited targets. The increase in tour length by inserting the T j between targets T i and T i+1 in tour τ v , without changing the associated heading angles, . We wish to insert the new unvisited target T j in a tour τ v at a position such that the increase in length of tour v (∆D v (T j )) results in a minimum increase in length of the largest of the m tours. One way of achieving this is by greedily inserting the new target T j in the smallest of the m tours without affecting the rest of the tour(s).
A relaxed version of the above approach is to allow modifications to the heading angles of k (≥ 0) targets adjacent to the newly inserted target such that the increase in tour length is minimum. Given a Dubins tour describes the increase in the tour length of vehicle v by inserting target T j . This version allows for the heading angles of 2k targets between T i−k and T (i+1)+k to be optimized when the target T j in inserted into the tour. Algorithm 1 presents a formal description of the k−greedy MD-GmTSP tour construction heuristic.
In the cheapest k-insertion method (Algorithm 2), during each iteration, the algorithm finds the most promising target (from the set of unvisited targets) and the position for insertion in the current tour simultaneously. In this algorithm, we construct new tours by inserting new targets at positions that lead to the smallest increase in tour lengths.

Algorithm 1 Greedy k-insertion for MD-GmTSP.
Initialization: Start with m initial Dubins tours τ v , ∀v ∈ [1, m] starting and ending at the depot (T 0 , θ 0 ). Repeat: While the set of unvisited targets N u is non-empty 1.
Identify the tour with the shortest Dubins tour length Choose a target T j from the list of unvisited targets N u according to one of the following greedy strategy.
(a) Random: Choose a random target Determine a position i in the tour τ ν min such that the increase in tour length Remove T j from the set of unvisited targets N u

Algorithm 2 Cheapest k-insertion for MD-GmTSP
Initialization: Start with m initial Dubins tours τ v , ∀v ∈ [1, m] starting and ending at the depot (T 0 , θ 0 ). Repeat: While the set of unvisited targets N u is non-empty 1.
Identify the tour with the shortest Dubins tour length Identify the target T j from the list of unvisited targets N u , and a position i in the tour τ ν min such that the increase in tour length Remove T j from the set of unvisited targets N u

Neighborhood Search: Improvement Phase
In the improvement phase, we use VNS-based local search heuristics to enhance the incumbent solution from the construction phase. These heuristics modify the current solution by performing a sequence of operations to produce a new feasible solution that improves the objective function. The algorithm evaluates the effect of modifying the current solution systematically and replaces it with a new solution if it is better. The algorithm may also be allowed to make changes that lead to poorer solutions with the hope of finding a better solution later.
Let I denote a problem instance of the MD-GmTSP. Let X represent the set of feasible solutions for this instance. For any x ∈ X, let f (x) denote a function that calculates the objective cost, which, in the case of the MD-GmTSP, corresponds to the Dubins tour length of the largest tour within x. It is important to note that the solution set X is finite but the size of the set can exponentially increase with the size (number of targets) of an instance of MD-GmTSP. Our goal is to find a solution x such that f (x ) ≤ f (x), ∀x ∈ X. To facilitate this search, we define a neighborhood N(x) ⊂ X for each solution x ∈ X, where N is a function that maps from a solution x to a set of solutions in its neighborhood. A solution x * is defined to be locally optimal or is said to be a local optimum with respect to a neighborhood N if f (x * ) ≤ f (x ), ∀x ∈ N(x * ). Improvement heuristics aim to find a locally optimal solution before the termination condition is met. In this paper, we investigate three improvement heuristics based on VNS for solving the MD-GmTSP, namely (1) basic variable neighborhood search (BVNS), (2) variable neighborhood descent (VND), and (3) general variable neighborhood search (GVNS).

Basic Variable Neighborhood Search (BVNS)
BVNS [39] uses a stochastic approach to reach the global minimum. In this approach, the neighborhood search is continued in a random direction, away from the initial solution. The steps of BVNS are explained in Algorithm 3.

Algorithm 3
Steps of the basic VNS by Hansen and Mladenovic [73] Initialization: Choose a set of neighborhood structures denoted by N k (κ = 1, . . . , κ max ) for the search process; find an initial solution x; select a termination/stopping condition and repeat the following steps until the termination condition is met: Repeat the following steps until κ = κ max : (a) Shaking: Generate a feasible solution x at random from the κth neighborhood of Local search: Apply a local search method with x as an initial solution; denote the obtained local optimum as x ; (c) Move or not: If x is better than the incumbent x, set x ← x and continue the search process with N 1 (κ ← 1); otherwise, set κ ← κ + 1;

Neighborhood Structures
In this work, we use k−exchange operators to explore neighborhoods of MD-GmTSP using BVNS. A solution χ is considered to be in the k−exchange neighborhood of solution χ if there is exactly one vehicle V i such that: 1.
χ and χ differ only in the order of target visits in the tour of vehicle V i .

2.
The tour for vehicle V i in χ differs from the tour of vehicle V i in χ by at most k edges.
Shake 'Shaking' is the randomization part of the BVNS heuristic. Shaking perturbs the initial solution while ensuring that the new neighborhood still retains certain aspects of the initial incumbent. In this step, a random feasible solution x in the current neighborhood of x is found. This random selection enables the algorithm to avoid stopping at the local optima in the local search procedure. In this work, we investigate two different Shake neighborhoods: Shake and Shake mod . Shake refers to picking two random targets T j , T j from tours of two random vehicles v and v , respectively, and swapping the targets between tours. The position of insertion for the targets is also selected at random. For the MD-GmTSP, in addition to Shake, a modified shake Shake mod is also evaluated. In Shake mod , a random target T j is selected from a random tour τ v , and is transferred to another tour τ v chosen at random. The insertion of the targets into the tours can be random or greedy (position leading to a minimal increase in tour length). The different shake neighborhoods are illustrated in Figure 3.

Local Search
After obtaining the solution x from the Shake procedure, we proceed to enhance it through local search. During this step, all feasible neighbors of x within the current neigh-borhood are explored to identify a potential solution x that has a lower cost. It is worth noting that there is a possibility for the current neighborhood of x to be empty. In that case, we set x to be the same as x , and the search continues in the next neighborhood. Exploration of neighborhoods can be of two types: (a) steepest descent and (b) first descent. The steepest descent and first descent heuristics are detailed in Algorithms 4 and 5, respectively.
For a strong NP-hard problem like the MD-GmTSP, the steepest descent heuristic is a highly time-consuming process. In this work, a first improvement version of the 2-opt and 3-opt algorithms explores the different neighborhood structures of the MD-GmTSP. [73] Initialization: Find an initial solution x; Choose the neighborhood structures N(x), and the objective function f (x) that calculates the length of the longest tour in x, Repeat:
If all solutions of N(x) have been considered, stop.

2-opt
Given a solution, the 2-opt algorithm, introduced by Croes in 1958 [74], is a local search heuristic that explores all possible 2-Exchange neighborhoods of the solution. Its objective is to gradually enhance a feasible tour by iteratively swapping two edges in the current tour with two new edges, aiming to reach a local optimum where no further improvements are possible. During each improving step, the 2-opt algorithm selects two distinct edges that connect (i, i ) and (j, j ) from the tour. It then replaces these edges with the edges connecting (i, j) and (i , j ), provided that this change reduces the tour length. This process continues until a termination condition is encountered (See Figure 4).

3-opt
The 3-opt algorithm [75] works similarly to the 2-opt but by reconnecting three edges instead of two, as seen in Figure 5. There are seven different ways of reconnecting the edges. Each new tour formed by reconnecting the edges is analyzed to find the optimum one. This process repeats until all possible three-edge combinations in the network are checked for improvement. A 3-optimal tour is also a 2-optimal one. A 3-opt exchange yields better solutions, but it is much slower (O(n 3 ) complexity) compared to the 2-opt (O(n 2 ) complexity).
On completion of the local search, the cost of x is compared to the cost of x. If the cost of x is less than the cost of x, then x is set to be x. The current neighborhood of x is set as the first neighborhood (κ ← 1), and the algorithm continues from the shaking step. On the contrary, if the cost of x ≥ x, then the solution x is discarded and search continues in the next neighborhood κ + 1. The algorithm terminates when there are no new neighborhoods to be explored. The steps of neighborhood change are detailed in Algorithm 6.

Algorithm 6
Steps for neighborhood change by Hansen and Mladenovic [73] The main difference between the BVNS and the VND approach is that in the VND, the neighborhood changes are deterministic. Algorithm 7 explains the VND metaheuristic in detail.
Depending on the problem, multiple local search methods can be nested to optimize the VND search. The objective of the local search heuristics is to find the local minimum among all the l max neighborhoods. As a result, the likelihood of reaching a global minimum is higher when using VND with a larger l max than when using a single-neighborhood structure.

Algorithm 7
Steps of the basic VND by Hansen and Mladenovic [73] Initialization: Choose a set of neighborhood structures denoted by N l (l = 1, . . . , l max ) that will be used in the descent process; find an initial solution x; select a termination/stopping condition and repeat the following steps until the termination condition is met: Set l ← 1.

2.
Repeat the following steps until l = l max : (a) Exploration of neighborhood: Find the best neighbor x of x (x ∈ N l (x)); Move or not: If x is better than the incumbent x, set x ← x and continue the search process with N 1 (l ← 1); otherwise set l ← l + 1;

Exploration of Neighborhoods for MD-GmTSP
We use a combination of four different neighborhood structures to explore the neighborhoods for MD-GmTSP. An x-point move is nested with a k-opt search (2-opt or 3-opt) to explore distant neighborhoods during local search coupled with greedy and random insertion techniques. The neighborhood structures are discussed in detail below.

•
One-point move: Given a solution x, a one-point move transfers a target T j from tour τ v to a new feasible position in another tour τ v ( =τ v ). The target to be moved is chosen from the tour having the largest tour length and is relocated to a tour having the smallest tour length. The computational complexity of these local search operators is O(n 2 ) for a given solution x . • Two-point move: A two-point move swaps a pair of nodes rather than transferring a node between tours as in a one-point move. A target T j belonging to the tour τ v having the largest tour length is swapped with a target T j belonging to another tour τ v ( =τ v ). After performing two-point moves from the solution x , the best solution x is returned depending on the search strategy employed (FirstImprovement or SteepestDescent).

General Variable Neighborhood Search (GVNS)
The general variable neighborhood search (GVNS) [73] merges the techniques used in both VND (deterministic descent to a local optimum) and BVNS (stochastic search) to reach increasingly distant neighborhoods in the search space. In GVNS, a sequential VND search (V ND seq ) improves the initial solution as compared to a local search heuristic (as in the case of BVNS). Algorithm 8 details the steps of GVNS as proposed by Hansen and Mladenovic.

Algorithm 8 Steps of the GVNS by Hansen and Mladenovic [73]
Initialization: Choose the set of neighborhood structures denoted by N κ (κ = 1, . . . , κ max ) that will be used in the shaking phase and the set of neighborhood structures N l (l = 1, . . . , l max ) that will be used in the local search process; find an initial solution x; choose a termination/stopping condition. Repeat the following steps until the termination condition is met:

2.
Repeat the following steps until κ = κ max : (a) Shaking: Generate a feasible solution x at random from the κth neighborhood of x(x ∈ N k (x)); (b) Local search by VND: Repeat the following steps until l = l max -Exploration of neighborhood. Find the best neighbor x of x in N l (x ) -Move or not. If f (x ) < f (x ), set x ← x and l ← 1; otherwise set l ← l + 1 (c) Move or not: If the local optimum x is better than the incumbent, move there (x ← x ) and continue the search with N 1 (κ ← 1); otherwise set κ ← κ + 1;

Neighborhood Search Structures for MD-GmTSP
To ensure the effectiveness of GVNS, it is crucial to employ suitable local search strategies during the execution of Local Search by VND (V ND seq ). The performance of V ND seq is influenced by the choice of neighborhood structures and the order in which these neighborhoods are explored. A combination of x−point and k−edge move neighborhoods are used in this paper to solve the MD-GmTSP. The details of these neighborhoods are provided below and illustrated in Figure 6.

1.
One-point move: In GVNS, we use the same One-point move operator as in BVNS.

2.
Two-point move: In GVNS, we use the same Two-point move operator as in BVNS.

3.
Or-opt2 move: An or-opt2 move selects a string of two adjacent nodes belonging to the tour having the maximum length and transfers it into a new tour. After performing the or-opt2 move for all strings of nodes x , the best solution x is returned by the operator depending on the search strategy employed. The or-opt2 operator generates x ∈ N 2 (x ) as compared to generic or-optk move operator generates x ∈ N k (x ). For our use case, no improvement was observed for k > 2, which can be attributed to the complexity in the MD-GmTSP problem.

4.
Three-point move: The three-point move operation involves selecting a pair of adjacent nodes from the tour with the maximum length and exchanging them with a node from another tour. By repeatedly applying three-point moves starting from the initial solution x , the operator generates a new solution x . Depending on the chosen search strategy, the best solution x ∈ N 3 (x ) is returned by the operator. 5.

2-opt move:
We use a 2-opt move operator to improve the resulting tours x obtained from the rest of the operators by performing intra-tour local optimization.

Local Search by VND for MD-GmTSP
A nested local search strategy is employed to solve the MD-GmTSP using GVNS. The V ND seq as specified in Algorithm 9 explores the neighbors of a given solution according to a particular sequence. This sequence is determined based on the non-decreasing order of neighborhood size. Based on the local search in each neighborhood, if the obtained solution (x ) is better than the current best (x ), the 2-opt search is carried out for possible intra-tour improvements. The resulting solution replaces the current best (x ← x ), and the search restarts at the first neighborhood. Otherwise, the search continues with a larger neighborhood search operator.

Algorithm 9 V ND seq
1: procedure V ND seq (x , l max ) 2: l ← 1 3: while l < l max do 4: if l = 1 then 5: x ← onepoint(x ) 6: if l = 2 then 7: x ← twopoint(x ) 8: if l = 3 then 9: x ← or_opt2(x ) 10: if l = 4 then 11: x ← threepoint(x ) 12: if f (x ) < f (x ) then 13: x ← 2_opt(x ) 14: x ← x ; l ← 1 15: else 16: l ← l + 1 17: return x Using heuristics to solve NP-hard problems provides a distinct advantage over traditional solvers as it can produce feasible solutions quickly. The quality of the solution can be further improved by employing improvement heuristics, but this often comes at the cost of computational power and time. Heuristics are rule-based approaches that prioritize improving computational efficiency over finding the optimal solution. Recently, there has been significant interest in learning-based approaches from the combinatorial optimization research community. These approaches utilize models that can solve large problem instances in real-time. The concept behind this approach is that heuristics can be viewed as decision-making policies which can be parameterized using neural networks. The policies can then be trained to solve different combinatorial optimization problems. In the next section, we introduce a reinforcement learning method to develop policies for solving the MD-GmTSP. This method holds great promise for achieving high-quality solutions in a shorter amount of time than traditional heuristics.

Learning-Based Approach for the MD-GmTSP
Several machine learning techniques, including neural networks [76], pointer networks [77], attention networks [78,79] and reinforcement learning [80,81] have been explored to solve the TSP problem. Wouter et al. [79] extended a single-agent TSP model to learn strong heuristics for the vehicle routing problem (VRP) and its variants. However, supervised learning, which is successful for most machine learning problems, is not applicable for combinatorial optimization problems due to the unavailability of optimal solutions for large instances, especially for co-operative combinatorial optimization problems like multiple agent TSP, asymmetric mTSP, and multi-agent task assignment and scheduling problems. These problems pose several challenges, such as an explosion of the search space with the increase in the number of agents and cities, a lack of data with ground truth solutions, and difficulty in developing an architecture that can capture interactive behaviors among agents. Reinforcement learning (RL) is a commonly explored paradigm to solve problems with the above difficulties, whose solution quality can be verified and provided as a reward to a learning algorithm. Recent studies [82,83] have investigated using RL-based approaches to optimize the Euclidean version of MinMax TSP, but no significant study has been conducted for the Dubins MinMax TSP. The main advantage of using the RL-based method is that the majority of computational time and resources are spent on training. Once a trained model is available, a feasible solution can be inferred almost in real-time.
In this paper, we utilize the RL framework developed by Hu et al. [82] and expand its application to the Dubins MTSP. The model consists of a shared graph neural network and distributed policy networks that collectively learn a universal policy representation for generating nearly optimal solutions for the Dubins MTSP. To address the challenges of optimization, the extensive search space of the problem is effectively partitioned into two tasks: assigning cities to agents with candidate heading angles and performing small-scale Dubins TSP planning for each agent. As proposed by Hu et al. [82], an S-sample batch reinforcement learning technique is used to address the lack of training data, reducing the gradient approximation variance and significantly enhancing the convergence speed and performance of the RL model.

Framework Architecture
In this work, the graph isomorphism network (GIN) [84] policy is used to summarize the state of the instance graph and assigns nodes to each agent for visitation, effectively transforming the MD-GmTSP into a single-agent DTSP problem. This approach facilitates the use of algorithms [85,86], learning-based methods [77,79,80], and solvers [87][88][89] to quickly obtain near-optimal solutions. Additionally, we utilize a group of distributed policy networks to address the node-to-agent assignment problem.

Graph Embedding
For each node v ∈ V, the graph embedding network GIN [84] computes a p-dimensional feature embedding f v by information sharing from the neighboring connected nodes according to the graph structure. GIN uses the following update process to parameterize the graph neural network: where f k v is the feature vector of node v ∈ V at the k−th iteration/layer, N v denotes all the neighbors of node v, u is neighboring node of v (i.e., u ∈ N v ), MLP represents a multi-layer perceptron, and is a learnable parameter.

Distributed Policy Networks
We propose a two-stage distributed policy network-based approach for designing a graph neural network with the attention mechanism [90]. In the first stage, each agent autonomously generates its agent embedding by leveraging both the global information and the node embeddings present in the graph. In the second stage, each node selects an agent to associate with itself based on its own embedding and all the prior agent embeddings.

Calculation of Agent Embedding
Graph embeddings, denoted as g f , are computed by max pooling from the set of node features f = { f 1 , f 2 , . . . f n }, i.e., These graph embeddings are then used as inputs to construct an agent embedding, where f i ∈ R p , n is the number of nodes, p is the dimension of node embedding, and g f ∈ R p . To construct the agent embedding, an attention mechanism is employed to generate attention coefficients that signify the significance of a particular node's feature in creating its embedding.

•
Graph Context embedding: A graph context embedding is used to ensure that every city, except the depot, is visited by only one agent and the depot is visited by all agents.
By setting the depot as the first node in the graph ( f depot = f 1 ), we concatenate the depot features with the global embedding to create the graph context embedding, represented as f c ∈ IR 2p . The concatenation operation is denoted by [.; .] and is shown in Equation (12): • Attention Mechanism: The attention mechanism [90] is used to convey the importance of a node to an agent a. The node feature set obtains the keys and values, and the graph context embedding computes the query for agent a, which is standard for all agents.
where d k and d v are the dimension of key and values. A SoftMax is used to compute the attention weights w ai ∈ [0, 1], which the agent puts on node i, using: w ai = e u ai ∑ j e u aj , i = 2, 3, . . . , n, j = 2, 3, . . . , n, where u ai is the compatibility of the query to be associated with the agent and all its nodes given by • Agent embedding: From the attention weights, we construct the agent embedding using In this section, we outline the process for determining the policy that assigns an agent to a specific node. To evaluate the importance of each agent to the node, we utilize the following set of equations. These equations are applied to each agent individually as follows: imp ai = C tanh(u ai ), i = 2, 3, . . . , n.
Here, d k is the dimension of new keys; θ ak and θ aq are parameters of neural networks to project the embeddings back to d k dimensions. The importance imp ai is computed by obtaining u ai for a new round of attention and clipping the result within [−10, 10] using tanh [80].
Since each agent is limited to visiting only one city, the agent's importance factor determines which city it visits. A SoftMax function evaluates the probability of an agents likelihood to visit a node: where p ai is the probability of agent a visiting node i; imp ai is the importance of node i for agent a; a ∈ 1, 2, . . . , m; and i ∈ 1, 2, . . . , n. The SoftMax function is used for creating a probabilistic policy that can be optimized using gradient-based techniques. In the context of a policy network, the parameters of the SoftMax function are typically learned from data that reflects the characteristics of the agents being modeled. As such, the policy network may need to be retrained when those characteristics change, such as when different turning radius constraints or a different number of agents are introduced.

S-Samples Batch Reinforcement Learning
The parameter θ for our model is estimated by maximizing the expected reward of the policy, i.e., θ * = arg max θ L R (θ).
where D is the training set, λ is an assignment of cities representing which agent visits it, R(λ) is the reward of the assignment, and λ, π θ is the distribution of the assignments over θ, i.e., π θ (λ) = ∏ i∈{1,...,n} p ai . However, the inner sum over all assignments in Equation (20) is intractable to compute. REINFORCE [91] is used as the estimator for the gradient of the expected reward. An S-sample batch approximation decreases variances in the gradient estimate and speeds up convergence.
The variance during training is decreased by introducing an advantage function A(λ s ): This results in a new optimization function: The pyDubins [92] code computes the rewards for every assignment, which calculates the DTSP tour lengths and returns the negative of the maximum tour length over all agents as the reward for the assignment.

Computational Results
Algorithms were implemented using an Intel NUC Kit (NUC8i7HVK) equipped with an Intel Core i7 processor and 32 GB of RAM. The effectiveness of the proposed methods was evaluated by solving instances from the TSPLIB library [40], which contains sample instances for the TSP and related problems with varying sizes ranging from 16 to 176 cities. To adapt the TSPLIB instances for the MD-GmTSP, each city was associated with a set of h = 8 candidate heading angles. The minimum turning radius (ρ) was calculated for each instance based on the layout of the cities. Specifically, ρ was computed as follows: where (x i , y i ) and (x j , y j ) are the 2D coordinates of city i and city j, respectively, for all i, j ∈ I.

MILP Results
Each instance was solved using the MILP formulation for MD-GmTSP described in Section 3. The CPLEX 12.10 [87] solver was used with a computational time limit of 10,800 s (3 h) to generate optimal results. Table 1 presents the results of the MD-GmTSP for m = 3 homogeneous vehicles and h = 8 heading angle discretizations. CPLEX was able to generate feasible solutions within the time limit only for instances with fewer than 52 cities and did not find optimal solutions for any of the instances. The optimality gap for the best result, obtained on the Ulysses-22 instance with 22 cities, was 14.09%. Table 1. MD-GmTSP results for m = 3 vehicles and h = 8 discretizations on TSPLIB instances using CPLEX 12.10 with a runtime of 3 h (10,800 s). Instances that CPLEX could not solve are indicated by *.

VNS-Based Heuristics Results
The algorithms discussed in Section 5, including the construction heuristics and the variable neighborhood search (VNS)-based improvement heuristics, were implemented in Python3 and tested on the same set of TSPLIB instances for m = 3 vehicles and h = 8 (discrete heading angles). Table 2 presents a comparison of the performance of different construction heuristics (k = 0 relaxation) on the MD-GmTSP. The construction phase allowed angle optimization only for new unvisited targets to be inserted into tours while keeping the rest of the tour unchanged (k = 0). The initial feasible solutions were generated without imposing any time limit. The results demonstrate that the cheapest construction method produced min-max tours with an average cost that was 11.81% higher than the nearest construction method. However, the nearest and farthest construction methods produced tours with identical min-max costs. Additionally, the cheapest-insertion heuristics generated solutions approximately 2.5 times faster than the nearest-insertion method. Table 2. Comparison of MinMax costs of feasible solutions obtained from insertion heuristics for the MD-GmTSP for m = 3 vehicles, C = 8 angle discretizations, and k = 0 relaxation. Times are specified in secs.

Analysis of VNS Improvements for MD-GmTSP
The impact of different neighborhood structures on improving the MD-GmTSP tours using BVNS, VND, and GVNS heuristics is shown in Tables 3-5, respectively. These improvement heuristics were executed with a computational time limit of t max = 600 s (10 min) as the stopping criterion, with FirstImprovement used to switch between neighborhoods. The presented results in each row of Tables 3-5 represent the average MD-GmTSP solutions for all TSPLIB instances considered. The analysis reveals that the improvement heuristics can achieve solution quality improvements of up to 30.78%. Furthermore, the average time required to solve these instances ranged from 898.6 s to 1439.96 s. Refer to Figure 7 for sample solutions with respect to the GVNS heuristic. Table 3. Influence of different neighborhood structures in the BVNS heuristic on improving the MD-GmTSP solutions with t max = 600 s time limit.      Table 6 presents the best-performing combination of construction and improvement heuristics for each instance. The most effective results are achieved using cheapest insertion and nearest insertion as the construction heuristics, in addition to GVNS with Shakemod and VND with one-point move as the improvement heuristics. Refer to Figure 8 for sample solutions corresponding to the att48 instance. It should be noted that solutions generated using the cheapest-insert construction with VNS have, on average, costs that are 14.96% higher than the solutions produced by the nearest-insert construction with VNS. On the other hand, the GVNS + Shake mod scheme resulted in an average improvement of 3.46% over the initial tours, while the VND + one-point move scheme generated an average improvement of 3.32% over the initial solution. A comparison is made between the best solutions obtained from VNS-based heuristics and MILP formulations for each instance in Table 7. It is observed that VNS heuristics generally produce superior solutions in significantly less time than the MILP solutions obtained from CPLEX.

Learning-Based Approach
In the learning-based approach, training and testing datasets were generated separately for m Dubins agents and n cities. Each city coordinate was uniformly generated at random in the interval (0, 1) 2 and paired with h = 8 heading angles in the range [0, 2π]. During training, the data were randomly generated at every iteration. To overcome performance limitations, separate models were trained for different numbers of cities and radius of curvature (ρ) specific to each instance. A total of 1000 instances were generated for each type and utilized to train the RL model described in Section 6. The RL model was trained using the Adam optimization method with the following hyperparameters: a learning rate of =1e −5 , a clipping gradients norm of 3, and S set of 10. Figure 9 illustrates the convergence of a model trained on the pr76 dataset after approximately 2000 iterations during the training process. The dataset consists of 1000 instances, each comprising 76 cities randomly distributed within the original boundaries of the pr76 instances. The tours for these instances are calculated using a Dubins turning radius (ρ = 980). In order to assess the effectiveness of the RL model, we compare its solutions to those generated by a heuristics-based approach for the MD-GmTSP. The trained RL model is utilized to evaluate the min-max tours for the original pr76 instance, ensuring a valid basis for comparison between the two approaches. Table 8 compares the solutions obtained from the RL model to the best solutions obtained from the MILP and VNS-based formulation. It is important to note that due to the absence of pre-existing baselines for the Dubins MTSP, it was not possible to determine the optimality gap for the solutions obtained from the RL model. Table 8. Comparison for best solutions obtained from MILP formulation (t max = 10,800 s) against the VNS-based heuristics (t max = 600 s). It can be observed that the RL model outperformed both the MILP formulation and the VNS-based heuristics on smaller instances of the problem. Specifically, the RL-based solutions performed better in terms of solution quality for smaller instances. However, as the problem instances became larger, the VNS-based heuristics generated superior solutions compared to the RL model. This indicates that the RL model is especially effective for solving MD-GmTSP instances with a relatively small number of cities and Dubins agents. This effectiveness can be attributed to the fact that the quality of Dubins tours used during the training phase is higher for smaller instances, while the training data quality decreases with the increase in instance size. The RL model's ability to learn and generalize patterns from the training data enables it to achieve high-quality solutions in these instances. There is scope for additional research and experimentation to explore the potential for enhancing the RL model performance and scalability for larger instances of the MD-GmTSP by leveraging advanced RL techniques and alternative network architectures.

Conclusions
In this paper, we explore different methods to solve MinMax routing problems for a team of Dubins vehicles. We formulate the routing problem as a one-in-a-set mTSP with Dubins paths and refer to it as the MD-GmTSP. We then develop a MILP formulation for the MD-SmTSP and solve it on 16 TSPLIB instances (ranging from 16 to 137 cities) using CPLEX for finding optimal solutions. We show that this method does not scale well for large instances involving higher numbers of nodes or robots. As an alternate approach, we develop fast heuristics based on insertion techniques to obtain good, feasible solutions for the MD-GmTSP. The results show that, on average, the nearest-insertion algorithm generates solutions with tour costs 11.81% lower than the solutions constructed using the cheapest-insertion algorithm. Subsequently, a combination of VNS-based heuristics with several neighborhood search structures is explored to improve the quality of initial feasible solution within a 600 s time limit. The neighborhood structures we studied include 2-opt, 3-opt, k-point, Or-opt, Shake, and the Shake mod . We solved the TSPLIB instances to check the performance of different combinations of heuristics and identified the best combination to solve MD-GmTSP. Overall, GVNS-based heuristics produced the most promising improvements over the initial solutions. The best-performing combination to solve the MD-GmTSP was to use the nearest-insertion heuristic to construct the initial tour followed by a GVNS-type heuristic to improve the solution quality using neighborhood search structures. Finally, we also explore a learning-based method to generate solutions for the MD-GmTSP. Our architecture consists of a shared graph neural network, with distributed policy networks that extract a common policy for the Dubins multiple traveling salesmen. An S-sample batch reinforcement learning method trains the model, achieving significant improvements in convergence speed and performance. The resulting RL model is used to generate fast feasible paths to the MD-GmTSP, and a comprehensive comparison is presented on the solution quality obtained from each of the other approaches. Overall, the learning based methods work well for smaller instances, while the GVNS based approaches perform better for large instances. Future work can address factors related to planning paths in the presence of obstacles and uncertainties in the position of targets.

Conflicts of Interest:
The authors declare no conflict of interest.