Estimating the Tour Length for the Close Enough Traveling Salesman Problem

: We construct empirically based regression models for estimating the tour length in the Close Enough Traveling Salesman Problem (CETSP). In the CETSP, a customer is considered visited when the salesman visits any point in the customer’s service region. We build our models using as many as 14 independent variables on a set of 780 benchmark instances of the CETSP and compare the estimated tour lengths to the results from a Steiner zone heuristic. We validate our results on a new set of 234 instances that are similar to the 780 benchmark instances. We also generate results for a new set of 72 larger instances. Overall, our models ﬁt the data well and do a very good job of estimating the tour length. In addition, we show that our modeling approach can be used to accurately estimate the optimal tour lengths for the CETSP. radius of the customer service regions and the number of customers up to 20. There are three new instances for each combination of the radius of the customer service regions and the number of customers greater than 20. We estimated the SZVNS tour lengths for the 234 new instances using the Best BIC model. The out-of-sample MPE and MAPE values are − 0.344% and 4.233%, respectively, which indicate that the Best BIC model performs well in estimating the SZVNS tour lengths on the new CETSP instances.


Introduction
Operations researchers have long been interested in estimating the length of tours and routes in the traveling salesman problem (TSP) and the vehicle routing problem (VRP). One of the earliest papers by Beardwood et al. [1] developed analytically derived formulas for the TSP. Christofides and Eilon [2], Hindle and Worthington [3], and Cavdar and Sokol [4] improved these formulas by using empirically estimated parameters. Golden and Alt [5] constructed interval estimates of the optimal solution value. Chien [6] and Kwon et al. [7] used parameters for the shape of the area covering the customers and the depot, the distance between customers, and the coordinates of the customers. Basel and Willemain [8] estimated the optimal tour length for 17 TSP instances using the square root of the number of cities and the variability of tour lengths. Nicola et al. [9] developed empirically based regression models for estimating the travel distance in the TSP, in the capacitated VRP with time windows, and in the multi-region, multi-depot pickup and delivery problem.
As pointed out by Nicola et al. [9], carriers and logistics companies may need to solve a large number of routing problems which might require a significant computational effort. Approaches that generate fast, accurate estimates for the travel distance are highly desirable in practice for a wide range of real-world routing problems. These estimates can be used by companies to make strategic, tactical, and operational decisions quickly [9].
In this paper, we construct empirically based regression models for estimating the tour length in the Close Enough Traveling Salesman Problem (CETSP). In the TSP, a salesman must visit the exact customer location. In the CETSP, a customer has a service region and is considered visited when the salesman visits any point in the customer's service region. The CETSP arises in practical applications including utility meter reading at residential locations using radio frequency identification and using drones for aerial surveillance. In another application, during multi-visit drone routing [10] it becomes important to quickly assess whether the duration to cover a region of interest would exceed a drone's battery life.
In the CETSP, a service region is assumed to be a circular disk centered at the customer location with a specified radius. The objective is to visit all customers in the shortest distance traveled starting and ending at the depot. The TSP is a special case of the CETSP when all radii of the circular disks are zero, making the CETSP at least as difficult to solve as the TSP. In order to solve an instance of the CETSP, it is not enough to determine the sequence in which the customers are visited. We must also determine the locations at which these customers are visited within their respective service regions. In Figure 1, we show an instance of a CETSP with 12 customers denoted by C 1 , . . . , C 12 and a depot denoted by C 0 . The service region is specified by a circle centered at the customer's location. A feasible CETSP tour is shown by the solid lines with arrows. The tour passes through at least one point in the service region of each customer. If a tour passes through an overlap of several disks, all customers that define those disks are served. A Steiner zone is an overlap of disks. If a Steiner zone is contained in at most k disks, it has degree k. For example, the location of each of customers C 1 , C 6 , C 8 , and C 11 is within a degree 1 Steiner zone. Similarly, the location of each of customers C 3 , C 4 , C 5 , C 9 , and C 12 is within a degree 2 Steiner zone. Whereas, the location of each of customers C 2 , C 7 , and C 10 is within a degree 3 Steiner zone. For more details on Steiner zones, see Wang et al. [11]. Progress in computational integer programming has been remarkable over the past 30 years; see Bertsimas et al. [12] for details. In particular, exact solvers for the TSP (e.g., Concorde) can now generate solutions quickly to most large instances with hundreds or thousands of customers. However, there are very few exact approaches for the CETSP. Exact approaches have been developed by Behdani and Smith [13] and Coutinho et al. [14]. Tight bounds have been developed by Carrabs et al. [15]. Coutinho et al. [14] proposed a branch-and-bound algorithm and applied it to instances from the literature. Computation times ranged from seconds for small instances to four hours for large instances with hundreds of customer locations.
The remainder of the paper is organized as follows. In Section 2, we give the regression models and discuss the fitness measures to test the performance of the models. In Section 3, we show the results of the regression models, best subset model selection, and model validation. In Section 4, we present our conclusions and future directions.

Regression Models and Fitness Measures
The Steiner zone variable neighborhood search (SZVNS) heuristic developed by Wang et al. [11] finds high-quality solutions to instances of the CETSP. We use SZVNS tour lengths in our regression models to estimate CETSP tour lengths.
Our regression model can be represented by . . , 15}, i denotes a CETSP instance, and E() denotes the expected value. The dependent variable Y i is the tour length generated by the SZVNS heuristic. In Table 1, we give the definitions of the independent variables for the regression model. Nodes represent customers and the depot. The size of an instance is captured by n and A. MinP, MaxP, VarP, SumMinP, and SumMaxP capture the distances between nodes. MinM, MaxM, SumM, and VarM capture the distances to the average node represented by the mean x-coordinate and the mean y-coordinate. VarX×VarY captures the spread of the instance across the two axes. AvgR captures the mean of the radii of the customer service regions. The service region radius of a depot is always zero. SZ captures the feature of the instance that is exploited by the SZVNS heuristic. AvgR and SZ are used to capture the geometric features unique to the CETSP. These two independent variables would not be used in a regression model that estimates TSP tour lengths.
We use the 780 CETSP benchmark instances and their tour lengths produced by the SZVNS heuristic given in Wang et al. [11]. The node locations (depot and the customers) are generated randomly and all customers in an instance have the same radius for the service regions. The instances have 6,8,10,12,14,16,18,20,25, or 30 customers. The radii for the customer service regions are 0.25, 0.50, or 1.00. There are 30 instances for each combination of the radius of the customer service regions and the number of customers up to 20. There are 10 instances for each combination of the radius of the customer service regions and the number of customers greater than 20.
We use mean percentage error (MPE) and mean absolute percentage error (MAPE) to assess the quality of the approximation of the CETSP tour lengths from the regression model (y i ) with respect to the tour lengths from the SZVNS heuristic (Y i ). MPE and MAPE are defined by 100 where N denotes the number of instances. A value of MPE close to zero indicates that there is almost an equal distribution of instances with tour lengths being overestimated (Y i < y i ) and underestimated (Y i > y i ). The value of MAPE is always non-negative, and a small value indicates that the tour-length estimates are close to the SZVNS tour lengths for most of the instances. We use adjusted R 2 , Studentized residuals, Mallows's C p , and Bayesian information criterion (BIC) to assess the quality of the model fits. Residuals (Y i − y i ) have a mean of zero. Studentized residuals are scaled residuals with unit variance. Mallows's C p and BIC are used in the context of model selection where the goal is to find the best model involving a subset of the independent variables. Mallows's C p addresses the issue of model overfitting by penalizing for adding extra variables. The value of Mallows's C p should be close to the number of independent variables in the model (p) to indicate the absence of overfitting. BIC penalizes a model for having more independent variables, and the penalty increases as the number of instances (size of the data set) increases. The lower the value of BIC, the better is the model fit.

Regression Results
In Table 2, we present the regression results for the 780 instances using all 14 independent variables in our model. The regression model has an adjusted R 2 value of 0.921 which indicates a very good model fit. The variables n and SZ are not significant at the 10% level. The remaining 12 variables are significant at various levels. In Figure 2a, we give the histogram of Studentized residuals for the model with 14 variables. This histogram shows that there is almost an equal distribution of instances with positive and negative residuals. This is also indicated by the MPE value of −0.192% which is close to zero. The MAPE value indicates that the tour-length estimates from the regression model differ by an average of 3.984% from the SZVNS tour lengths.

Best Subset Model Selection
Although the regression model shown with all 14 variables performed well in estimating the SZVNS tour lengths, two of the variables were not significant at the 10% level and one variable was significant only at the 10% level. We now examine whether we can remove some variables to create a parsimonious model without compromising much on the model fit.
We create models with 1 to 14 independent variables in the following way. We start by constructing models with only one independent variable and identifying the 1-variable model that produced the largest adjusted R 2 value. Then we find the 2-variable model that produced the largest adjusted R 2 value, and so on until we consider the 14-variable model. In Table 3, we show the best subset models based on the adjusted R 2 value. For example, out of all possible models with two independent variables, the model containing the variables SumMaxP and AvgR produced the largest adjusted R 2 value. Following this process, we create 14 models that contain 1 to 14 variables. In Table 4, we show the adjusted R 2 , Mallows's C p , and BIC values of the best subset models from Table 3.
Over these 14 models, we select the model that has the largest adjusted R 2 value. This is the model with 13 variables and an adjusted R 2 value of 0.921. There are other models with the same adjusted R 2 value when rounded to three decimal places.
We show the coefficients of the first model in Table 5 under the column labeled "Best adjusted R 2 ". We note that the adjusted R 2 value of 0.921 indicates a very good model fit. The variable SZ is not in this model and the variable n is not significant at the 10% level. The remaining 12 variables are significant at various levels. In Figure 2b, we give the histogram of Studentized residuals for this model. This histogram shows that there is almost an equal distribution of instances with positive and negative residuals. This is also indicated by the MPE value of −0.192% which is close to zero. The MAPE value indicates that the tour-length estimates from this regression model differ by an average of 3.983% from the SZVNS tour lengths. Next, from the 14 models in Table 3, we select the model that has its Mallows's C p value closest to the number of independent variables in the model. This is the model with 10 independent variables and a Mallows's C p value of 13.0.
We show the coefficients of the second model in Table 5 under the column labeled "Best Mallows's C p ". We note that the adjusted R 2 value of 0.921 indicates a very good model fit. The four variables n, VarP, SumM, and SZ are not in this model. The remaining 10 variables are significant at various levels. In Figure 2c, we give the histogram of Studentized residuals for this model. This histogram shows that there is almost an equal distribution of instances with positive and negative residuals. This is also indicated by the MPE value of −0.194% which is close to zero. The MAPE value indicates that the tour-length estimates from this regression model differ by an average of 3.995% from the SZVNS tour lengths.
Finally, from the 14 models in Table 3, we select the model that has the smallest BIC value. This is the model with eight independent variables and a BIC value of −1921. Table 5. Regression results on the three selected best subset models.
We show the coefficients of the third model in Table 5 under the column labeled "Best BIC". We note that the adjusted R 2 value of 0.920 indicates a very good model fit. The six variables n, MaxP, VarP, MaxM, SumM, and SZ are not in this model. The remaining eight variables are significant at the 0.1% level. In Figure 2d, we give the histogram of Studentized residuals for this model. This histogram shows that there is almost an equal distribution of instances with positive and negative residuals. This is also indicated by the MPE value of −0.202% which is close to zero. The MAPE value indicates that the tour-length estimates from this regression model differ by an average of 4.008% from the SZVNS tour lengths.
All three best subset models based on adjusted R 2 , Mallows's C p , and BIC performed well. In fact, their adjusted R 2 values are nearly the same (about 0.921), as are their MAPE values (about 4%). We recommend the model with the least number of independent variables (Best BIC model with eight variables) to estimate the SZVNS tour lengths for CETSP instances with random node locations. The coefficients of all eight variables in the Best BIC model are highly significant (p < 0.001).

Model Validation
We generated 234 new CETSP instances, similar to the 780 instances we used to develop the regression models, to test the Best BIC model with eight variables. The node locations (depot and the customers) are generated randomly and all customers in an instance have the same radius for the service regions. The instances have 6, 8, 10, 12, 14, 16, 18, 20, 25, or  We generated the optimal tour lengths for the 234 new instances using a branch-andbound algorithm [14] in order to test the usefulness of the eight independent variables selected in the Best BIC model in estimating the optimal tour lengths. SZ is the only independent variable specific to the SZVNS algorithm and is not a part of any of the three models shown in Table 5. Therefore, the eight independent variables in the Best BIC model should have more general usage in estimating tour lengths. The regression model built using the eight independent variables in the Best BIC model and trained on the optimal tour lengths of the 234 new instances is given by: optimal i = 13.394 + 0.054 where optimal i denotes the optimal tour length for instance i. The signs and the orders of magnitude of the coefficients are the same as the coefficients in the Best BIC model. The regression model for estimating the optimal tour lengths with eight independent variables has an adjusted R 2 value of 0.937, indicating a very good model fit. The MPE value of −0.161%, which is close to zero, indicates that there is almost an equal distribution of instances with positive and negative residuals. The MAPE value of 3.735% also indicates the usefulness of this model in estimating optimal tour lengths.
The eight independent variables in the Best BIC model adequately capture the geometric properties of the CETSP instances where the node locations are generated randomly. The model can be trained on the tour lengths generated from any heuristic or optimal algorithm to accurately estimate the tour lengths for that specific heuristic or optimal algorithm on other similar CETSP instances.
We generated a new set of 72 larger CETSP instances to further assess the Best BIC model with eight variables. The node locations (depot and the customers) are generated randomly and all customers in an instance have the same radius for the service regions. The instances have 35, 40, 45, or 50 customers. The radii for the customer service regions are 0.25, 0.50, or 1.00. There are six new instances for each combination of the radius of the customer service regions and the number of customers. We estimated the SZVNS tour lengths for the 72 new larger instances using the Best BIC model. The out-of-sample MPE and MAPE values are −0.356% and 4.389%, respectively, which indicate that the Best BIC model performs well in estimating the SZVNS tour lengths on the larger CETSP instances.
We tried generating the optimal tour lengths for the new set of 72 larger instances using the branch-and-bound algorithm from Coutinho et al. [14] with a maximum run time of one hour. We were unable to find the optimal solution to seven instances, so we do not report regression results based on optimal tour lengths. This experiment reinforces the need for a regression-based model to quickly estimate the CETSP tour lengths of larger and more difficult instances for a specific heuristic or optimal algorithm. All CETSP instances used in our experiments are given in Sinha Roy et al. [23].

Conclusions and Future Directions
We applied regression models to an important problem in the routing literature-the CETSP. We demonstrated that it is possible to quickly and accurately estimate tour lengths using a regression model without generating the actual tours. We showed that the tour lengths generated by the SZVNS heuristic or by an optimal algorithm could be estimated with an average error of about 4% by a regression model with eight independent variables where the node locations are generated randomly. In the future, we would like to develop models for instances with node locations that are generated in different structured ways.

Conflicts of Interest:
The authors declare no conflict of interest.