A Novel K-Means Clustering Algorithm with a Noise Algorithm for Capturing Urban Hotspots

: With the development of cities, urban congestion is nearly an unavoidable problem for almost every large-scale city. Road planning is an effective means to alleviate urban congestion, which is a classical non-deterministic polynomial time (NP) hard problem, and has become an important research hotspot in recent years. A K-means clustering algorithm is an iterative clustering analysis algorithm that has been regarded as an effective means to solve urban road planning problems by scholars for the past several decades; however, it is very difﬁcult to determine the number of clusters and sensitively initialize the center cluster. In order to solve these problems, a novel K-means clustering algorithm based on a noise algorithm is developed to capture urban hotspots in this paper. The noise algorithm is employed to randomly enhance the attribution of data points and output results of clustering by adding noise judgment in order to automatically obtain the number of clusters for the given data and initialize the center cluster. Four unsupervised evaluation indexes, namely, DB, PBM, SC, and SSE, are directly used to evaluate and analyze the clustering results, and a nonparametric Wilcoxon statistical analysis method is employed to verify the distribution states and differences between clustering results. Finally, ﬁve taxi GPS datasets from Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China) are selected to test and verify the effectiveness of the proposed noise K-means clustering algorithm by comparing the algorithm with fuzzy C-means, K-means, and K-means plus approaches. The compared experiment results show that the noise algorithm can reasonably obtain the number of clusters and initialize the center cluster, and the proposed noise K-means clustering algorithm demonstrates better clustering performance and accurately obtains clustering results, as well as effectively capturing urban hotspots.


Introduction
Modern cities have become important engines and hubs to drive social development. A city represents the most concentrated residence of people and the gathering place of social resources. Both work and life are inseparable from urban support. In recent years, there has been a "big city disease", of which the most prominent phenomenon is urban congestion, which has become a nearly unavoidable problem for almost every large-scale city. Consequently, from the perspective of informatization and intelligence, people have successively used information technology to put forward digital cities and smart cities from the strategic level and have formulated construction schemes to meet the development needs of different cities, hoping to solve the challenges faced in the process of urban development and alleviating urban congestion. In particular, the application of the new generation of cloud computing, big data, the Internet of Things, and artificial intelligence technology has made urban operation more intelligent and has gradually become a reality, making rail transit and urban transportation more predictable and widely applied; however, a city is a densely populated area with a high concentration of both living and vehicle operation. The growth of the world's civil vehicle sales from 2010 to 2020 is shown in Figure 1. ment needs of different cities, hoping to solve the challenges faced in the process of urban development and alleviating urban congestion. In particular, the application of the new generation of cloud computing, big data, the Internet of Things, and artificial intelligence technology has made urban operation more intelligent and has gradually become a reality, making rail transit and urban transportation more predictable and widely applied; however, a city is a densely populated area with a high concentration of both living and vehicle operation. The growth of the world's civil vehicle sales from 2010 to 2020 is shown in Figure 1. Moreover, population flow is directly related to time, and urban congestion is still an important challenge for every city. The application of big data has served as a basic strategic digital resource in smart cities. Many researchers have analyzed the trajectory GPS data of transportation vehicles in order to mine the hidden information behind the data to reflect the urban operation status and define temporal and spatial change rules [1], in addition to use in traffic congestion status analysis [2][3][4][5][6][7], crowd movement distribution [8][9][10], traffic travel recommendation [11,12], and road planning [13,14], urban hotspot discovery [15][16][17][18], and so on. Such research results are directly applied to the construction of a smart city to elucidate more reasonable urban road planning and a more reasonable dispersion of vehicle flow and human flow. Such research methods usually use machine learning algorithms (such as cluster analysis and feature learning) to capture the vehicle trajectory patterns, including the origins and destinations (OD) [19][20][21], stops and moves (SM) [22,23], and moving objects (MO) [24,25] from the GPS data. Pongracic et al. [26] proposed a midlatitude Klobuchar correction model to correct the Klobuchar model for midlatitude users. Gu et al. [27] proposed a data-based methodology to estimate the traffic congestion of road segments between bus stops in order to improve the travel time reliability and quality of public transport services. Gao et al. [28] proposed a specific and accurate definition of traffic congestion to quantify the level of traffic congestion and con- Moreover, population flow is directly related to time, and urban congestion is still an important challenge for every city. The application of big data has served as a basic strategic digital resource in smart cities. Many researchers have analyzed the trajectory GPS data of transportation vehicles in order to mine the hidden information behind the data to reflect the urban operation status and define temporal and spatial change rules [1], in addition to use in traffic congestion status analysis [2][3][4][5][6][7], crowd movement distribution [8][9][10], traffic travel recommendation [11,12], and road planning [13,14], urban hotspot discovery [15][16][17][18], and so on. Such research results are directly applied to the construction of a smart city to elucidate more reasonable urban road planning and a more reasonable dispersion of vehicle flow and human flow. Such research methods usually use machine learning algorithms (such as cluster analysis and feature learning) to capture the vehicle trajectory patterns, including the origins and destinations (OD) [19][20][21], stops and moves (SM) [22,23], and moving objects (MO) [24,25] from the GPS data. Pongracic et al. [26] proposed a midlatitude Klobuchar correction model to correct the Klobuchar model for midlatitude users. Gu et al. [27] proposed a data-based methodology to estimate the traffic congestion of road segments between bus stops in order to improve the travel time reliability and quality of public transport services. Gao et al. [28] proposed a specific and accurate definition of traffic congestion to quantify the level of traffic congestion and constructed an image-based traffic congestion estimation framework based on a convolutional neural network. Afrin and Yodo [29] proposed a Bayesian network based on speed-and volumerelated measures and a probabilistic congestion estimation approach. These models have been used to explain and discover urban operation states, crowd migration hotspots, and other urban operations.
In order to learn the valuable information hidden behind the location data, a clustering learning algorithm is a common and simple method that is used in many studies. Cluster analysis, also known as group analysis, is not only a statistical analysis method to study a classification problem (sample or index), but is also an important algorithm for data mining. Cluster analysis is composed of several patterns. Usually, a pattern is a vector of measurement or a point in multi-dimensional space. Cluster analysis is based on similarity. Patterns in a cluster have more similarity than patterns that are not in the same cluster. Clustering analysis algorithms can be divided into partition methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Among them, K-means clustering is the simplest, most used, and computationally efficient clustering algorithm, but it faces three major problems for any given dataset. As such, it is very difficult to find the appropriate number of clusters, optimize clustering centers, and capture global clustering results.
For the past several decades, many researchers have proposed some new ideas and methods. An improved K-means algorithm based on density (canopy K-means) has been proposed to solve the problem of determining the most suitable number of clusters and the best initial seeds [30]. An evolutionary K-means (EKM) method, based on combining K-means and a genetic algorithm, has been proposed to select parameters automatically through the evolution of partitions for solving the initiation problem of K-means [31]. The K-means++ algorithm has been proposed to quickly capture a better clustering center to find the sensitivity of the clustering process for the clustering center [32]. Fuzzy C-means (FCM) has been proposed to solve the problem of clustering edge data attribution [33][34][35][36]. An intelligent optimization algorithm with a K-means algorithm has been proposed to effectively solve the global optimization of clustering and the sensitivity of clustering center effectively [37,38]. In addition, some researchers have also proposed some improved methods for K-means algorithms and new application scenarios. For example, the depth representation extracted by depth learning technology has been proposed to improve the clustering performance of K-means clustering [39]. A competing cluster center approach has been proposed to maximize the benefits of cluster centers [40]. Ma and Zhou [41] proposed a novel sharing-based niche genetic algorithm with an initial population based on hybrid K-means clustering in order to obtain the best chromosome and perform Kmeans clustering. Sun et al. [42] proposed a framework to differentiate between these two types of methods with the following procedure.
These K-means clustering algorithms have adequately realized clustering and have obtained clustering results in actual engineering applications; however, some shortcomings still exist, such as a low processing efficiency, difficulty in determining the number of clusters, sensitively initializing the cluster center, and so on. In order to solve these problems, a novel K-means clustering algorithm based on a noise algorithm, namely, a noise K-means clustering algorithm, is developed here in order to improve the processing efficiency of automatic clustering and avoid both excessive manual configuration of parameter uncertainty and clustering results falling into local optimums in this paper. The noise algorithm is employed to randomly enhance the attribution of data points and output the results of clustering by adding noise judgment in order to automatically obtain the number of clusters for the given data and initialize the center cluster. Four unsupervised evaluation indexes of DB, PBM, SC, and SSE, and the nonparametric Wilcoxon statistical analysis method are employed to evaluate and analyze the clustering results and verify the distribution states and differences. Finally, five taxi GPS datasets, including Beijing, Chongqing, San Francisco, Rome and Aracaju, are selected to test and verify the effectiveness of the proposed noise K-means clustering algorithm in comparison with fuzzy C-means, K-means, and K-means plus approaches. The innovations and main contributions of this paper are described as follows.

•
A novel noise K-means clustering algorithm based on a noise algorithm is developed to capture urban hotspots.

•
The noise algorithm is employed to randomly enhance the attribution of data points and output results of clustering by adding noise judgment to automatically obtain the number of clusters and initialize the center cluster.

•
Four unsupervised evaluation indexes of DB, PBM, SC, and SSE are directly used to evaluate and analyze the clustering result. • A non-parametric Wilcoxon statistical analysis method is employed to verify the distribution state and difference of clustering results.

•
Comprehensive experiments are designed and executed to prove the effectiveness of the proposed noise K-means clustering algorithm with five sets of taxi GPS data.

The Idea of the Noise K-Means Clustering Algorithm
A K-means clustering algorithm is an iterative clustering analysis algorithm that has been regarded as an effective means to solve urban road planning problems by scholars for the past several decades. The algorithm has been widely used in the fields of document classification, customer classification, ride data analysis, criminal network analysis, the detailed analysis of call records, and so on. It is very difficult to determine the number of clusters and sensitively initialize the cluster center. Noise can be used to simulate noise phenomena in nature. Because of its continuity, if an axis in two-dimensional noise is taken as the time axis, the result is a continuously changing one-dimensional function. As such, in order to solve the existing problems of the K-means clustering algorithm and make use of the merits of the noise algorithm, a novel noise-based K-means clustering algorithm is proposed to obtain a better clustering center and capture urban hotspots in this paper. The proposed noise-based K-means clustering algorithm consists of three parts. Firstly, the noise algorithm is employed to randomly enhance the attribution of data points and the output result of clustering by adding noise judgment in order to automatically obtain the number of clusters of the given data and initialize the cluster center. Secondly, the K-means clustering algorithm is employed to optimize the clustering center generated. It is fused with noise algorithm to form a novel noise-based K-means clustering algorithm. Finally, the proposed noise-based K-means clustering algorithm is used to obtain the clustering results for the given data and capture an excellent clustering center, i.e., and urban hotspot. Four unsupervised evaluation indexes of DB, PBM, SC, and SSE are directly used to evaluate and analyze the clustering results, and nonparametric Wilcoxon statistical analysis is employed to verify the distribution states and differences between clustering results.

The Flow of the Noise K-Means Clustering Algorithm
The flow of the proposed noise K-means clustering algorithm is shown in Figure 2.

The Realization of the Noise-Based K-Means Clustering Algorithm
The detailed steps of the noise-based K-means clustering algorithm are described as follows.
Step 1. Set the clustering number K According to [43][44][45][46], it is generally believed that the clustering number of a K-means algorithm is between 2 and √ N, where N represents the number of GPS data points. In this paper, more clusters are required to describe the distribution of urban hotspots. The GPS data points are intensive data, so the clustering number was set as [ Step 2. Optimize the clustering number K using binary inversion The clustering number K is converted from decimal to binary (the binary digits is rounded by √ N). Then, one digit of the binary number is randomly flipped to generate a new binary number. Finally, the binary number is converted to a decimal and a new K is obtained. The binary inversion of the solution is shown in Figure 3.

The Realization of the Noise-Based K-Means Clustering Algorithm
The detailed steps of the noise-based K-means clustering algorithm are described as follows.
Step 1. Set the clustering number K According to [43][44][45][46], it is generally believed that the clustering number of a K-means algorithm is between 2 and √N, where N represents the number of GPS data points. In this paper, more clusters are required to describe the distribution of urban hotspots. The GPS data points are intensive data, so the clustering number was set as [ √ ,√N].
Step 2. Optimize the clustering number K using binary inversion The clustering number K is converted from decimal to binary (the binary digits is rounded by √ ). Then, one digit of the binary number is randomly flipped to generate a new binary number. Finally, the binary number is converted to a decimal and a new K is obtained. The binary inversion of the solution is shown in Figure 3. Step 3. Optimize the clustering center Since the location of the center point of the clustering is not fixed under the same number of clustering K values, in order to obtain a better initial distribution of the center point, the sum of squares for frror (SSE) is used to evaluate and find the optimal center of the clustering.
where K represents the number of clusters, N represents the number of taxi GPS data points in clustering, and ( − ) represents the error sum of squares for each GPS data point.
Step 4. Output optimized clustering center and capture urban hotspots According to the found optimal center of the clustering, it is set as the clustering center of the noise-based K-means clustering algorithm (Algorithm 1), which is used to solve the urban road planning problem in order to obtain the clustering results for the given taxi GPS data and thus capture an excellent center cluster, i.e., urban hotspots.

Algorithm 1. Noise-Based K-Means Clustering Algorithm
Input: Taxi GPS dataset and the number of taxi GPS data points, noise radius rate (which can also be generated randomly), the number of clustering iterations, and the clustering termination condition. Output: Clustering results and new clustering center 1: Initialize the noise radius rate. Record the number of solutions (NS) of the current taxi GPS data. 2: Obtain the current optimal solution data points interval restart = × , and takes integers. // represents the maximum number of iterations Step 3. Optimize the clustering center Since the location of the center point of the clustering is not fixed under the same number of clustering K values, in order to obtain a better initial distribution of the center point, the sum of squares for frror (SSE) is used to evaluate and find the optimal center of the clustering.
where K represents the number of clusters, N represents the number of taxi GPS data points in clustering, and X ij − X i 2 represents the error sum of squares for each GPS data point.
Step 4. Output optimized clustering center and capture urban hotspots According to the found optimal center of the clustering, it is set as the clustering center of the noise-based K-means clustering algorithm (Algorithm 1), which is used to solve the urban road planning problem in order to obtain the clustering results for the given taxi GPS data and thus capture an excellent center cluster, i.e., urban hotspots.
It can see that the complexity of the noise-based K-means clustering algorithm is related to the number of data points in each dataset. It is not directly related to the number of other noises and iteration times of the algorithm. As such, its time complexity is O n 2 .

Algorithm 1. Noise-Based K-Means Clustering Algorithm
Input: Taxi GPS dataset and the number of taxi GPS data points, noise radius rate (which can also be generated randomly), the number of clustering iterations, and the clustering termination condition. Output: Clustering results and new clustering center 1: Initialize the noise radius rate. Record the number of solutions (NS) of the current taxi GPS data. 2: Obtain the current optimal solution data points interval restart = √ iter max × NS, and takes integers. //iter max represents the maximum number of iterations 3: Record the basic number of noises NB. Determine the noise reduction rate in the iteration process: decrease = r max −r min itermax NB −1 . //r max , r min represents the maximum and minimum values of the noise radius respectively.

4: Select an integer between [
√ N] as the initial solution randomly, then evaluate it with SSE and denote it as the optimal solution. 5: Set the maximum of noise value rate. Determine whether the new solution generated by binary √ N]. 6: Generate a new random number within the noise radius range to produce Noise. 7: Select the current solution as the optimal solution when the difference between the current solution and the optimal solution is greater than Noise. 8: Set rate = 0 when the current iteration times is a multiple of 4NS. Set the optimal solution as new solution when the current iteration number is a multiple of restart. Set rate = rate-decrease when the current iteration times is a multiple of NS. 9: Record the optimal solution and judge whether the iteration and the noise radius are completed. 10: Output the number of clusters and the initialization center of the given taxi GPS dataset. 11: Calculate the distance between the data point and the center point. Attribute the data points to the nearest cluster center according to the distance of the data points. Assign the data point average of each cluster as the new clustering center. 12: Calculate the SSE. Determine the termination condition of clustering. 13: Output the clustering result, which is the urban hotspots.

Obtain the Clustering Number K Value and the Initial Center
The noise algorithm is used to obtain the clustering number K value and the initial center point in the given optimization objectives (such as SSE). The advantages and disadvantages of the optimal solution and the current solution are judged to join the noise, so that the data point attribution and clustering attribution output results have certain randomness. Moreover, the optimal solution is found from the current optimal solution at a certain interval of iterations, or the optimal solution is found by using "noise-free" at a certain interval of iterations.

Optimize Clustering Center
The K-means plus algorithm proposed by Arthur [32] is used to solve the sensitivity of K-means clustering center, which the computational complexity is O(log K). In order to obtain a better clustering number and clustering center, a K-means algorithm may be integrated with a noise algorithm to optimize the clustering center of the noise.

Obtain Clustering Result and Capture Excellent Cluster Center
K-means clustering is an unsupervised partition clustering algorithm. It takes distance as the standard of similarity measurement between data objects. That is, if the distance between data objects is smaller, their similarity is higher, and they are more likely to gather in the same class cluster. In this paper, Euclidean distance is used to calculate the distance between data objects and SSE is used to evaluate the clustering results.

Urban Taxi GPS Data
In order to verify the effectiveness of the proposed noise K-means clustering algorithm, five taxi GPS datasets at home and abroad are used as shown in Table 1. Taxi GPS data mainly refer to the vehicle position, direction, and speed information that is regularly recorded by the vehicle via the on-board global positioning system during travel. At present, many taxi GPS datasets exist for many cities across the world. The cities in this study, i.e., Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China), are representative cities for different continents and countries and are large-scale cities. The DB index [47], PBM index [48], SC (silhouette coefficient) [49] and SSE (sum of squares for error) are directly used to evaluate and analyze the clustering results. These evaluation methods are directly related to the number of clusters. Among them, DB index is mainly used to evaluate the performance of the noise K-means clustering algorithm. If the value of DB index is smaller, the similarity between clusters is lower and the clustering result is better. The PBM index is still used to evaluate the quality of clustering structure, where it describes clustering results and object attribution by defining quality. If the value of the PBM index is higher, the clustering effect is better. SC evaluates the clustering results by clustering the cohesion and separation. If the value of SC is greater, the clustering effect is better. SSE is used to evaluate the error probability distribution state and object attribution clustering. If the value of SSE is smaller, the clustering effect is better. The evaluation results and excellent procedures directly affect the effectiveness of urban hotspot discovery.
In Table 1, the GPS points are distributed in city areas and some are hotspots; however, it is very difficult to capture the hotspots from GPS datasets.

Experimental Environment and Parameter Setting
The experimental environment based on VMware featured the following: Intel Xeon E5-2658, dominant frequency 2 × 2.10 GHz with 8G RAM, Windows 2008 server, and the algorithm was coded in MATLAB 2016b. MATLAB is a commercial mathematical software produced by American MathWorks company(USA) which includes row matrix operation, drawing functions and data, implementing algorithms, creating user interfaces, connecting programs of other programming languages, and so on. It is used in data analysis, wireless communication, deep learning, image processing and computer vision, signal processing, robotics, control systems, and other fields. In our experiment, the alternative values were tested and modified for some functions to obtain the most reasonable initial values of these parameters. These selected values of the parameters take on the optimal solution and the most reasonable running time to efficiently complete the solving problem. The parameter settings of the FCM, K-means, K-means plus, and noise K-means are shown in Table 2. The number of clustering iterations was 200, and each algorithm ran 20 times independently. Generally, the evaluation results will be directly affected when the smaller clustering number of the comparison algorithm is set. FCM selected the clustering center according to the fuzzy parameters. K-means and K-means plus selected clustering center randomly. Noise-based K-means could obtain the clustering number and the initialization of the clustering center automatically. At the same time, the clustering numbers of noisebased K-means and FCM, K-means, and K-means plus were the same, and the clustering evaluation results of the corresponding clustering algorithm were also similar.

Experimental Results and Comparison Analysis
The comparison results of the maximum, average and minimum values of noise Kmeans, FCM, K-means and K-means plus under the evaluation of SC, BM index, DB index and SSE for the taxi GPS data are shown in Tables 3-6.
As can be seen from Tables 3-6, the proposed noise K-means clustering algorithm is used to obtain the clustering number of a given GPS dataset, which can improve the clustering effect effectively and much easier to find urban hotspots. It can also obtain the better clustering evaluation results, capture the urban hotspots, which can more effectively reflect the urban operating state through different clustering evaluation methods in the given GPS dataset. As can be seen from Table 3 for SC, the noise-based K-means clustering algorithm has a better performance, which indicates that it can better capture excellent clustering centers (urban hotspots). As can be seen from Table 4 for PBM index, the clustering results of noise K-means and K-means plus performed better in the taxis GPS data. As can be seen from Table 5 for DB index, K-means, noise K-means and K-means plus all performed well in the taxis GPS data, and there is little difference in the overall evaluation value. As can be seen from Table 6 for SSE, the noise K-means performs very well in 5 taxi GPS datasets, that because SSE is the optimization target in the clustering process of each clustering algorithm.  The average running time for each algorithm when running 20 times is shown in Table 7.
As can be seen from Table 7, the average iteration time of the noise K-means clustering algorithm is longer, because the noise algorithm needs to capture the clustering number of a given dataset and initialize the clustering center. The noise K-means clustering algorithm can obtain the better clustering number of a given GPS data, and much easier to find urban hotspots. It can also obtain the better clustering evaluation results, capture the urban hotspots. The SSE convergence curves for the FCM, K-means, K-means plus, and noise-based K-means methods are shown in  algorithm can obtain the better clustering number of a given GPS data, and much to find urban hotspots. It can also obtain the better clustering evaluation results, c the urban hotspots.
The SSE convergence curves for the FCM, K-means, K-means plus, and noise K-means methods are shown in Figures 4-8.   algorithm can obtain the better clustering number of a given GPS data, and much to find urban hotspots. It can also obtain the better clustering evaluation results, the urban hotspots.
The SSE convergence curves for the FCM, K-means, K-means plus, and nois K-means methods are shown in Figures 4-8.   Appl. Sci. 2021, 11, x FOR PEER REVIEW Figure 6. The SSE convergence curve for taxi GPS data (Rome (Italy)).    As can be seen from Figures 4-8, all comparative clustering algorithms, exc FCM, could complete the convergence by iterating about 10 times, which shows t proposed noise-based K-means clustering algorithm is feasible and suitable for bas tition clustering algorithms.

Visual Presentation of Urban Hotspots
In order to more intuitively display the capture of urban hotspots by clusterin rithm, a visual presentation in Amap system is used in this paper. That is, the ca    As can be seen from Figures 4-8, all comparative clustering algorithms, exc FCM, could complete the convergence by iterating about 10 times, which shows t proposed noise-based K-means clustering algorithm is feasible and suitable for bas tition clustering algorithms.

Visual Presentation of Urban Hotspots
In order to more intuitively display the capture of urban hotspots by clusterin rithm, a visual presentation in Amap system is used in this paper. That is, the ca As can be seen from Figures 4-8, all comparative clustering algorithms, except for FCM, could complete the convergence by iterating about 10 times, which shows that the proposed noise-based K-means clustering algorithm is feasible and suitable for basic partition clustering algorithms.

Visual Presentation of Urban Hotspots
In order to more intuitively display the capture of urban hotspots by clustering algorithm, a visual presentation in Amap system is used in this paper. That is, the captured cluster center location information is input into the map system through Amap API in order to realize the visual presentation of the captured urban hotspots. The obtained experiment results for the Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China) are shown in Figures 9-12. Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 20 cluster center location information is input into the map system through Amap API in order to realize the visual presentation of the captured urban hotspots. The obtained experiment results for the Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China) are shown in Figures 9-12.  As can be seen from Figures 9-12, the spatial distributions of urban hotspots have been captured by the FCM, K-means and K-means plus clustering algorithms for the given As can be seen from Figures 9-12, the spatial distributions of urban hotspots have been captured by the FCM, K-means and K-means plus clustering algorithms for the given As can be seen from Figures 9-12, the spatial distributions of urban hotspots have been captured by the FCM, K-means and K-means plus clustering algorithms for the given taxi GPS data of Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China) and Beijing (China) is different. The aggregation degree of taxi operation varies significantly between different cities. From the city hotspot marking results of the taxis GPS data in the five cities obtained by FCM in Figure 9, it can be seen that the clustering effect is not ideal and that there are serious dispersion and local optimum phenomena. For example, the clustering results of San Francisco (USA) demonstrate local optimum phenomena and multiple hotspots are close together, which results in an uneven distribution of urban hotspots and deviation from urban hotspots. Thus, a potentially planned road cannot benefit many people with this approach. From the city hotspot marking results of the taxis GPS data in five cities obtained by K-means clustering in Figure 10, it can be seen that the clustering centers and numbers of clusters are not very reasonable and do not show optimal results. There are multiple local optimum phenomena, such as the serious phenomenon that multiple hotspots are close together in Rome (Italy) and Beijing (China). The phenomenon occurs such that some marker points are not clustered, so that a planned road cannot benefit a large number of people. As can be seen from the city hotspot marking results of the taxis GPS data in five cities obtained by K-means plus clustering in Figure 11, the obtained clustering effect is better than that obtained with FCM and K-means clustering. The clustering centers and number of clusters in most cities are optimal, but there is a local optimal phenomenon in Rome (Italy), which shows multiple hotspots together. A road planned with this approach could benefit many people. As can be seen from the city hotspot marking results of the taxis GPS data in five cities obtained by the noise-based K-means clustering in Figure 12, the obtained clustering effect is better than that obtained with FCM, K-means, and K-means plus clustering. The clustering centers and numbers of clusters for all five cities are the best here, and there are no groups of hotspots that are close together. The information reflected by these urban hotspots denotes shopping points, parks, stations, amusement areas, and other public places, which means that consequently planned roads benefit all people. In summary, the compared experiment results show that the noise-based K-means algorithm can reasonably obtain the number of clusters and initialize the cluster center, and the proposed algorithm shows better clustering performance and accurately obtains clustering results, in addition to effectively capturing urban hotspots.

Statistical Analysis of Wilcoxon
A Wilcoxon rank sum test is a non-parametric null hypothesis statistical testing method [50] that is often used to test the significant difference and distribution state of a clustering training process. When statistical validation is performed, it is usually composed of p, h and stats, in which p represents the results u and v of a clustering evaluation. p is the continuous distribution of data samples, which is used to test u and v under the non-null hypothesis (noise-based K-means vs. FCM, noise-based K-means vs. K-means, noise-based K-means vs. K-means plus). As p→0, the difference between u and v becomes more obvious. h is the logical value for testing 0 or 1, h = 1 means reject the null hypothesis, and h = 0 means reject the null hypothesis at α (for example, α = 0.05, α is the significance level parameter, value range: 0 < α < 1). That is, h = 1 means that the difference between u and v is significant, while h = 0 means that the difference between u and v is not significant. Stats consists of two statistics, zval and ranksum. zval represents a normal distribution estimate of p, while ranksum represents a statistic [51]. The statistical analysis results of the Wilcoxon rank sum testing are shown in Tables 8-10. As can be seen from Tables 8-10, there are significant differences among noise-based K-means, FCM, and K-means clustering, which indicates that they are distributed differently in space; however, the differences between noise-based K-means and K-means plus clustering are not significant, especially in Roma (Italy), which rejects the significance level and indicates that their spatial distribution is similar. Furthermore, as p→0 for each group, it indicates that it is feasible and effective to automatically capture the cluster number and initialize the center cluster by use of a noise algorithm, and the urban hotspots captured by the noise-based K-means clustering algorithm more effectively represent reality.

Conclusions
In this paper, a novel noise-based K-means clustering algorithm has been proposed to effectively solve the problems of difficulty in determining the clustering numbers and the sensitivity of initializing the clustering center for a K-means clustering algorithm. The noise-based K-means clustering algorithm has been applied to capture urban hotspots in large cities from across the world. When the clustering operation was completed, the clustering results were evaluated by the DB index, PBM index, SC, and SSE, and the experimental results of each evaluation standard were statistically analyzed by Wilcoxon rank sum testing to obtain the significant differences for the urban hotspot distribution for each clustering algorithm. The proposed noise-based K-means clustering algorithm obtained better optimal results for urban hotspots over the FCM, K-means, and K-means plus methods. The method presented here can better serve a large number of people in large cities. In addition, the proposed noise-based K-means clustering algorithm can also be applied in the fields of the document classification, customer classification, ride data analysis, criminal network analysis, the detailed analysis of call records, and so on.
There are also some shortcomings for the method presented here. On the one hand, the amount of GPS data is too small to effectively reflect the distributions and the relationships of urban hotspots. On the other hand, it is difficult to effectively avoid specific buildings in the city, even if the optimal results of urban hotspots are used in urban road planning. As such, these problems will be further solved in future work.