1. Introduction
The history of the travelling salesman problem (TSP) dates back to a book published in 1832 which mentioned optimal routes for the first time [
1]. The first person who formulated this as a mathematical optimization problem was probably Karl Menger [
2]. At a mathematical colloquium in Vienna in 1930, he talked about “the task of finding the shortest path connecting the points for a finite number of points whose pairwise distances are known.” This problem can be stated with following cost function to be minimized:
where the selection of points
is decided based on a permutation vector
of length
N according to
. The cost function
is the sum of all Euclidean distances
between pairs of neighbouring points.
A tour is usually modelled as a graph with nodes from the set V (a.k.a. vertices) and edges from set E, where the nodes correspond to points and the edges connect the points in pairs. The distance between two points corresponds to the length of the connecting edge.
This optimization problem continues to attract many researchers to this day. Originally treated as a two-dimensional problem with symmetric distances as described above, several variants of the original problem have been formulated and studied up to the present.
The methods to deal with the TSP can be divided into exact solvers, which provide the global optimum (i.e., the shortest possible tour), and heuristics trying to find a solution which is at least close to the optimum. The time needed to find the exact solution grows exponentially with the number of tour points (see, for instance, [
3]). Implementations for exact solvers are freely available for academic use [
4] (
www.math.uwaterloo.ca/tsp/concorde/ (accessed on 7 December 2022)). Hougardy and Zhong have created special instances that are difficult to solve with exact solvers [
5].
In contrast, suboptimal solutions can be found in reasonable time by suitable heuristics. A state-of-the-art heuristic has been developed by Helsgaun [
6] (
http://webhotel4.ruc.dk/~keld/research/LKH-3/LKH-3.0.7.tgz (accessed on 7 December 2022)). Equipped with various of tour-improvement tools, it has even been able to find optimal tours for instances up to 85 900 points.
Christofides and Serdyukov independently proposed an algorithm that guarantees solutions within a factor of 1.5 of the optimal tour length (excess ≤ 50%) if the instance is symmetric and obeys the triangle inequality [
7,
8]. In [
9], it was theoretically proven that a (
)-approximation can be obtained in
-time. This work was extended by [
10].
The most successful heuristics are based on the chained Lin–Kernigham method [
11], which is one of the methods that try to optimize the entire tour by local improvements. It relies on the ideas of [
12,
13] and applies sequences of typically only 2-opt and 3-opt permutations. A
k-opt permutation intersects a tour of connected points at
k locations, i.e.,
k edges are removed from the graph and
k other edges are inserted, ideally shorter on average than the original edges. A tour is called
k-optimal if no further improvements can be obtained by using
m-opt permutations with
. There are also studies including 4-opt and 5-opt [
14]. Such
k-opt permutations steer the solution to a minimum, which is typically not the global one. Laporte gave a concise tour of the subject in [
15].
In addition to the classical TSP, optimization problems with special constraints have also been studied, such as the vehicle-routing problem, in which not only the distances between points count, but also time constraints and multiple agents with varying capacities have to be considered [
16,
17]. A comprehensive overview is given in [
18]. Time-dependent TSPs were investigated by [
19,
20], wherein the costs depend not only on the distances between points but are also a function of the position within the tour.
A generalized TSP was presented in [
21], wherein points are clustered and the tour to be optimized consist of exactly one point of each cluster. This problem was addressed with a genetic algorithm. Several applications of clustered TSPs were discussed in [
22]. Time constraints have also been an issue recently in solving TSPs with constraint programming [
23]. The authors of [
24] proposed a technique allowing permutations that do not directly lead to a shorter tour and thus can avoid getting stuck in local minima. It can compete with the Lin–Kernigham–Helsgaun (LKH, [
6]) method for difficult problems when starting from a random tour i.e., without creating an initial tour.
As already pointed out above, the local optimisation by
k-opt permutations and similar operations runs the risk of getting stuck in a local minimum. There are different strategies to overcome this problem. In [
25], for example, a “breakout local search” was proposed to jump from one local minimum to the next and hopefully better local minimum, which of course requires several attempts. A very successful recombination technique initially generates many suboptimal tours converging at different local minima (see for example [
26,
27]). In [
26], this is described as “principle of backbone edges”. These are connections between points that are present in all quickly solved tours. The entire tour can then be collapsed into a smaller tour that considers only the remaining missing connections, whereas a path of backbone edges is taken into account as a single edge. A comparison of techniques can be found in [
28] and the literature cited therein. These ideas of combining information from different trials are closely related to so-called “ant colony optimization” methods whereby the decision-making mimics the use of pheromone trails of ants. Recent developments and relevant citations can be found in [
29].
However, such recombination techniques and other metaheuristics are not practical for very large instances with more than one million points, because even finding the first local minimum can be very time consuming.
Many proposed heuristics only deal with small or medium-sized instances [
30,
31,
32,
33,
34,
35,
36,
37,
38], whereas in [
39] problems with even billions of cities are tackled. In the last decade, heuristics based on the idea of swarm intelligence like ant colonization and others have been become quite popular [
29,
32,
33,
35,
37,
38,
40]. However, so far they cannot really compete with state-of-the-art methods like LKH when applied to classical TSP problems.
The processing of large instances can be accelerated by decomposing the tour into shorter subtours which remain connected by using fixed edges [
41,
42]. This could be combined by using multiple threads and parallel processing on multicore processors [
29,
42] or by taking advantage of graphical processing units [
43].
The method described in this paper draws on many ideas proposed by various researchers and combines them with new ideas. It aims at optimizing very large instances (up to 10 million points, symmetric Euclidean two-dimensional distances) in a single pass in limited time. To achieve this goal, various techniques are implemented to speed up the processing. The vast number of possible permutations is reduced by a suitable candidate selection based on Delaunay triangulation [
44]. This also allows the precomputation of a sparse distance matrix. As an alternative to the use of a single-level permutation vector, a two-level data structure is considered, and parallel processing based on multithreading is supported. The main contribution of the proposed method is to provide and investigate an alternative to chained Lin–Kernigham method by systematically applying
k-opt permutations with
. Therefore, the proposed method does not consider recombination or other metaheuristics to escape from local minima, but concentrates on the acceleration of the first run. A dedicated software has been written from scratch. For initial tour generation, candidate set enrichment is proposed and the use of a cluster-based technique [
42] is considered. The developed code will be made available as open source to allow reproducible research on this topic.
3. Investigations
3.1. Data and Hardware
The implementation of the proposed method has been tested on several TSP instances chosen from different repositories. The focus was on very large instances and instances with varying characteristics. The compiled set of instances can be found at [
55]. All research reported here has been performed on a Linux server equipped with an AMD EPYC 7452 CPU (2.35 GHz, 32 cores, 128 MB) and 125.7 GiB RAM. The source code (ANSI-C) utilizes the Linux-specific pthread library for parallel processing and has been compiled with gcc and option ‘-Ofast’. Alternatively, the code can be compiled on computers running Windows operation systems. However the source code does not support parallel processing under Windows yet.
3.2. General Behaviour
The general behaviour shall be illustrated by using the instance
C316k.0. If other instances show different characteristics for certain settings, this will be mentioned explicitly. Candidate selection is restricted to the five closest candidates, the maximum edge distance between points
Q and
R is set to
, parallel processing is turned off, and the one-level structure is used.
Figure 8 shows the progress of optimization and marks special features.
The initial tour with a length of 201 871 162 is provided by DoLoWire (30 s). After an initial loop over all points
P using 2-opt permutations according to
Section 2.4.1, the tour length is reduced to 200 615 308 by 3845 successful permutations. This reduction is possible because DoLoWire does not optimize the initial tour across cluster boundaries. Starting with the second loop, 3-opt operations (see
Section 2.4.2) are also enabled.
Table 9 represents the entire process in numbers. The time measurements have a precision of one second in all experiments.
With each loop, the number of successful permutations decreases until a certain threshold is reached. Then the next k-opt processing is activated and so on. The last loop simply verifies that no further improvement is possible and the processing terminates after about 37 min.
3.3. Investigations on Maximum Edge Distance
The number of points
R to be examined depends on the chosen maximum edge distance between points
Q and
R. The variation of
maxEdgeDist between 5 and 200 has been investigated. It affects the course of optimization and the time required to process each loop over
P as can be seen in
Figure 9.
For each parameter set, there are two curves depending on the loop number: one shows the current tour length and the other shows the elapsed time t.
The magenta curve () has two kinks at and . At these locations, the next k-opt permutations are activated. At the same loop positions, the time curve also has kinks because enabling more complex permutations increases the time needed for one loop over all P positions. Once the majority of successful permutations is performed, the duration per loop remains nearly constant until the set of enabled permutations is further expanded.
The smaller the maximum edge distance is chosen, the smaller the improvements per loop are, because some helpful permutations are not checked. At the same time, the duration per loop is shorter, i.e., the algorithm is faster. This leads to a flatter slope of the time curves and the optimizer can perform more loops in the same time. For very low values of maxEdgeDist, the optimization converges to unfavourable local minima. For example, with maxEdgeDist = 5, the algorithm stops after 20 loops, which are processed in three minutes; when using maxEdgeDist = 200, a local minimum is reached after 18 loops and approximately 73 min. Basically, we can see that the time needed to complete a certain loop is a function of maxEdgeDist, which is actually not a big surprise because the maxEdgeDist determines the number of inspected R positions (see Listing 2 line 10).
The behaviour within the first five loops is always the same because only 2-opt and 3-opt permutations are performed here, which do not depend on maxEdgeDist. Increasing maxEdgeDist from 100 to 200 only has a minor effect on the achieved improvements per loop. For other instances, it can be observed that values higher than rarely benefit the optimization process when using tight time constraints.
3.4. Investigations on Maximum Number of Candidates
In
Table 1, it was shown that the average number of candidates is equal to six, when selected via Delaunay triangulation. The question is whether all candidates should always be considered or whether this set can be reduced to the
nearest points.
Figure 10 shows the corresponding curves for
and
. In general, we observe that the performance improves when the number of candidates is increased up to
. Choosing more than seven candidates has almost no effect on the tour length because only about 50% of all points have more than six candidates and the chance that a connection to one of the more distant candidates is favourable decreases. The processing time increases with
, although this increase is only noticeable as long as there are points with at least
candidates. Because the time limit was set to 60 min, the last loops of
could not be completed leading to the kinks in the corresponding time curves.
3.5. Investigations on Data Structures
As described above, the simplest structure for storing point-order information is a permutation vector. This is considered a one-level structure. As an alternative, a two-level structure has been implemented, which is much more complicated from the point of view of data organisation, but it can reduce the required number of data accesses.
Figure 11 shows the progress of optimization for TSP instance C316k (
,
).
In the first few seconds, the tour length decreases much faster when using the two-level structure (see
Figure 11a). This advantage persists beyond 10 min (see magnified diagram in
Figure 11b).
Figure 12 illustrates the progress with respect to the number of performed loops over all positions
P.
Loop zero corresponds the state after initial tour generation. The identical length curves prove that the achieved tour lengths are independent on the chosen data structure. That means both algorithms perform exactly the same sequence of permutations. Up to the fifth loop, the two-level structure requires only a tenth of the time compared to the one-level structure. This advantage vanishes when starting with 5-opt permutations. The positive effect of the two-level structure only becomes noticeable when many permutations can be applied. As soon as the optimization converges to a local minimum, the search for successful permutations takes more time than the actual permutations performed and the algorithmic overhead of the two-level structure consumes the time previously saved. After the 16th loop, the two-level structure has become slower on average than the one-level structure.
3.6. Investigations on Chosen Initial Tour Generator
In
Section 2.3, two different methods for generating an initial tour were discussed. The previous tests always relied on DoLoWire which creates much better tours than the GrebCap approach. It is now necessary to investigate what influence the quality of this initial tour has on the optimization process.
Figure 13 depicts the progress as a function of time.
In contrast to previous investigations, the elapsed time additionally includes the time needed for the initial tour generations. According to
Table 2, this is 30 s for DoLoWire and about three seconds for the both GrebCap approaches. GrebCap+ denotes the greedy approach, including the candidate set enrichment and point merging (see
Section 2.3.1).
It is obvious that the greedy approach produces an unfortunate initial tour. After 30 s of local optimization with Sys2to6, the tour is still worse than the result of the initialization by DoLowire and, moreover, the optimization converges to a worse local minimum.
A different picture can be observed in application to
ara238025.tsp,
Figure 14.
Here, the cluster-based generator of DoLoWire creates an unfortunate initial tour forcing the optimization process to a worse local minimum.
3.7. Parallel Processing of Tour Segments
As described in
Section 2.5, the search for suitable permutations can be accelerated by parallelization. The influence of the number
of parallel jobs on the course of the tour-length reduction has been investigated.
The results for C316k.0 are shown in
Figure 15.
The comparison now becomes somewhat more difficult because the order of permutations depends on the size of segments, which is, of course, a function of the chosen number of parallel jobs.
It can be clearly seen that the optimization process distinctly benefits from the use of multiple threads within the first three minutes (
Figure 15a,b). After that, all variants converge to different local minima which is due to the varying order of permutations performed (
Figure 15c). The maximum number of threads is set to 30 because the CPU has 32 physical cores. Two cores are reserved for background processes of the operating system so that they do not influence the time measurement too much.
By using at least eight parallel jobs for this instance, the local minimum can be reached within 60 min. With 30 jobs, the shortest tour can be attained after approximately half an hour. A similar minimum is obtained with 16 jobs, but in this example it takes almost 60 min to reach the point of convergence. When using eight parallel jobs, the minimum is reached after 35 min. Unfortunately, the sequence of permutations here steers the optimization into an unfavourable local minimum.
Most interesting is the influence of multithreading on the largest instance in the TSPs set used.
Figure 16 depicts the optimization progress for E10M.0 containing ten million points as a function of the number of parallel jobs.
The optimization starts with an initial tour of length 2 529 955 468 provided by DoLoWire. Reading the corresponding coordinates and permutation files takes eight seconds. For the single-job case (solid curve), the determination of candidates takes additional 61 s before the actual optimization can begin. When 2-opt and 3-opt permutations have exhausted their potential (after about 210 s), 5-opt operations are enabled. The tour is then steadily improved until the time limit is reached.
When using two parallel jobs, the preparation time increases to 150 s due to an unfortunate point configuration for the triangulation algorithm used. This computational overhead is needed after each round of parallel processing causing the stair-stepped curve up to about 11 min. The following plateau is a consequence of the chosen scheduling mechanism which switches back to single-job processing too early in this situation. After 30 min, the first round of parallel processing with 5-opt permutations is activated, and it continues until the specified time expires.
The more jobs are used in parallel, the fewer points are contained in each sequence, and the triangulation overhead becomes less pronounced. The curves now appear to be piecewise linear. This is simply a consequence of the uniform distribution of points in E10M.0. Each round of parallel processing corresponds to exactly one loop over all points
P. The chance of improvement is evenly distributed in this course. With each new round, the probability of successful permutations decreases; the curves flatten accordingly. A similar progression is observed for E3M.0, the second largest instance of the test set (
Figure 17).
The more segments are processed in parallel, the faster the tour length is improved.
Optimizing the third-largest instance ‘santa1437195’ gives a slightly different picture. The curves in
Figure 18 are less regular because the progress depends on the location of the points that are being processed.
This instance is highly clustered, and different parts of the tour have different optimization potential. Depending on the segmentation chosen, it is also possible that an unfortunate permutation is chosen which temporarily leads the optimization in an unfavourable direction. As can be seen for the 30-jobs curve, the tour length achieved after one hour is not significantly different from the result of the 16-job optimization. The advantage of two jobs over single-thread processing comes into play after about three minutes.
Based on the optimization progress, it can be observed that the acceleration of the improvements does not scale with the number of parallel jobs. Inspecting the plot of instance E3M.0 (
Figure 17), the relation between number of jobs and required processing time
t to reach a tour length of about
is as follows:
Parallelization in 30 threads only leads to a fivefold increase in optimization speed. There are at least three reasons for this. First, there is some overhead in managing parallel processing, including candidate set generation. Secondly, the optimizer misses more and more favourable permutations as the segments to be processed become shorter. Third, and this is the main reason, parallel optimization often has to be interrupted by single-job processing to exchange information between distant segments (see
Section 2.5.3).
3.8. Comparison with a State-of-the-Art Method
The proposed method Sys2to6 is now compared to LKH [
56], which is widely known as a state-of-the-art heuristic for finding excellent TSP tours. The performance of Sys2to6 is mainly influenced by the values usedfor the number of candidates
, the
value, the number of parallel jobs, and the maximal duration of processing. The latter is set to a limit of 30 min (including the initial tour generation), in combination with
. The other two parameters have been varied (
,
). A good compromise could be found by using
and
. Note that without time constraints, larger values for both parameters would typically lead to shorter tours.
The default LKH settings are geared toward small and medium-sized instances. For large instances, the preparation of initial values already takes more than an hour even without outputting an initial tour. To force LKH to output results within 30 min also for E10M.0 requires special options (see
Table 10).
The generation of sets of candidates by using the default settings is too time-consuming and must be replaced by the Delaunay method. The corresponding implementation enriches the set of Delaunay candidates as described in
Section 2.3.1 with respect to
Figure 3. Finally, the five Delaunay candidates with the smallest
value are used [
57]. It has to be mentioned that LKH can result in better tours for smaller instances when manually optimized settings are used.
The time limit is set to a maximum of 30 min. However, because LKH does not include preprocessing in the time measurement, this limit must be adaptively reduced for each instance. Because the initial tour generation of LKH is dependent on seeding the random number generator, the optimization process is started 10 times with different seed values bu using the keyword SEED. The average tour length is chosen for the comparison. Running LKH on instance
Tnm100000 causes the program to abort. This can be prevented by additional use of the option ‘PRECISION = 1’ [
57].
Table 11 compares the results in terms of final tour length obtained within 30 min of optimization.
When processing very large instances with LKH, it is not possible to get a tour-length value close to 30 min, the actual calculation times are about 32 min for E3M.0 and about 38 min for E10M.0.
The instances ExM.0 contain uniformly distributed points. As can be concluded from the results in
Table 11, LKH does an excellent job for theses instances. The achieved excesses are less than 0.5% whereas the tours obtained with Sys2to6 are significantly longer. Something similar can be observed for the VLSI instances. Here, Sys2to6 performs worst with excess values above 8%.
For the instances Cxk.0, which contain many clusters of points, the proposed method can outperform LKH. The best performance of Sys2to6 in relation to LKH can be observed for the instance santa1437195. It contains 1 437 195 two-dimensional coordinates of households (Finnish National Coordinate System) distributed all over Finland (see for example [
42]). These coordinates also show distinct clusters in large Finnish cities. With the given settings and constraints, LKH has problems dealing with this dataset, resulting in an excess of 7.32%, whereas Sys2to6 only has an excess of 3.69%.
As the performance of Sys2to6 might depend on the initial-tour generation chosen, additional investigations have been conducted by using GrebCap+ instead of DoLoWire. Taking the random component of GrebCap+ into account, the optimization has been performed 10 times for each instance.
Table 12 lists the corresponding results in terms of worst, best, and median tour.
In only three cases is the performance improved when using GrebCap+ instead of DoLoWire for initial tour generation. In addition, the tour lengths significantly increase for the clustered instances C100k, C316k, and santa1437195, which obviously benefit from using DoLoWire as initial tour generator.
The chosen settings , are a compromise between fast processing and large permutation search space with respect to the chosen time constraint. Additional investigations have shown that with respect to the limited time span of 30 min, the largest two instances in the studied set would benefit from smaller values for these two parameters, because the local improvements contribute more to the reduction of the tour length than the larger search space. Fine-granular modification of the parameters results in different optimal settings for different instances. This is mainly due to the modified order of permutation selection, which directs the optimization process to different region in the hyperspace of tour lengths.