A Novel K-Means Clustering Method for Locating Urban Hotspots Based on Hybrid Heuristic Initialization

: With rapid economic and demographic growth, trafﬁc conditions in medium and large cities are becoming extremely congested. Numerous metropolitan management organizations hope to promote the coordination of trafﬁc and urban development by formulating and improving traf-ﬁc development strategies. The effectiveness of these solutions depends largely on an accurate assessment of the distribution of urban hotspots (centers of trafﬁc activity). In recent years, many scholars have employed the K-Means clustering technique to identify urban hotspots, believing it to be efﬁcient. K-means clustering is a sort of iterative clustering analysis. When the data dimensionality is large and the sample size is enormous, the K-Means clustering algorithm is sensitive to the initial clustering centers. To mitigate the problem, a hybrid heuristic “fuzzy system-particle swarm-genetic” algorithm, named FPSO-GAK, is employed to obtain better initial clustering centers for the K-Means clustering algorithm. The clustering results are evaluated and analyzed using three-cluster evaluation indexes (SC, SP and SSE) and two-cluster similarity indexes (CI and CSI). A taxi GPS dataset and a multi-source dataset were employed to test and validate the effectiveness of the proposed algorithm in comparison to the Random Swap clustering algorithm (RS), Genetic K-means algorithm (GAK), Particle Swarm Optimization (PSO) based K-Means, PSO based constraint K-Means, PSO based Weighted K-Means, PSO-GA based K-Means and K-Means++ algorithms. The comparison ﬁndings demonstrate that the proposed algorithm can achieve better clustering results, as well as successfully acquire urban hotspots.


Introduction
Cities have traditionally been recognized as the primary drivers of economic activity and play a crucial role in human society.By evaluating the spatial distribution pattern of human economic and social activities, we can determine that humankind has entered a citycentric era [1].The subject of "Urban Problems" has become increasingly prominent due to the rapid expansion of the size and population of major cities around the world [2].The fundamental source of "urban problems" is the conflict between limited local resources and the needs of numerous city residents [3].The typical phenomenon is traffic congestion [4].Faced with the most challenging urban traffic problems, many international metropolitan management organizations hope to promote the coordination of traffic and urban development by formulating and improving traffic development strategies.The effectiveness of these strategies essentially depends on an accurate understanding of urban hotspots distribution [5].Urban hotspots (also called human activity centers) generally have higher traffic densities than other city areas [6].
Urban hotspots are usually the clustering centers of numerous traffic trajectory points that affect population flow, movement patterns and human interaction.The urban road network's complexity and surrounding space environment make it challenging to obtain urban hotspots.The widespread application of information technologies in the cities, such as GPS navigation, provides high-resolution Spatio-temporal perception [7].In particular, GPS navigation services are widely used in vehicles, making it possible to accurately perceive the movement of many vehicles simultaneously in time and space [8].The emergence of GPS trajectory data provides a new approach and method to solve the problem of urban traffic optimization [9].As ordinary day-to-day GPS trajectory data, taxi track data not only reflects urban traffic conditions but also records people's daily travel information [10].How to mine hidden knowledge from massive trajectory data has become an important research topic [11,12].
Many researchers have analyzed the GPS data of vehicle trajectory and drilled the connotative information behind these data, straining to find the spatial-temporal variation models reflecting vehicles distribution [13].These models are applied to public transportation optimization analysis [14], urban population spatial and temporal dynamic distribution analysis [15], private car navigation real-time optimal path analysis [16], municipal road planning [17], urban hotspots searching [18,19], etc.Many research results have been applied to the practice of smart city construction, achieving good results in the balance of urban traffic flow and human flow load, and playing a pivotal role in emergency treatment.
These researchers use many machine learning methods based on GPS data sets, such as clustering analysis algorithm, time series algorithm, deep learning algorithm, reinforcement learning algorithm, and so on.Mariescu-Istodor and Fränti [20] proposed a grid-based method for GPS route analysis for retrieval that achieves efficient route search under noisy and variable sampling rate conditions.Zhang et al. [21] proposed a hybrid method for incremental extraction of urban road network from GPS data set, which can mine cumulative change patterns of road network.Shafabakhsh et al. [22] proposed a spatial kernel density calculation method based on GPS data to provide decision support for the allocation of medical emergency resources.Wang et al. [23] proposed a trajectory clustering method based on adaptive distance and hierarchical clustering based on the optimal cluster number to analyze similarities and anomalies in taxi GPS trajectory.Zhang et al. [24] proposed combining fuzzy C-means clustering with quantitative spatial regression to comprehensively analyze GPS data and establish the relationship model for improving the traffic service level from the perspective of time and space.
To extract the important information and mine the knowledge concealed in the data, a clustering algorithm is typically employed for preliminary data analysis [25,26].The clustering algorithm is a fundamental and effective unsupervised learning method widely used in many fields.Clustering algorithms divide data set into different clusters according to a fixed standard (distance, similarity, etc.) [27], making the similarity of objects in the same cluster as ample as possible and the distinction of objects in the different clusters as large as possible [28].These unsupervised clustering algorithms can be broadly divided into the following six categories: partition-based clustering, density-based clustering, model-based clustering, hierarchical clustering, grid-based clustering, and graph-based clustering [29].Among the above algorithms, the clustering based on partition represented by the K-means algorithm has the advantages of simple structure, high efficiency, easy convergence and muscular universality.As one of the mature clustering algorithms, the K-means algorithm has achieved significant application effects in many fields.But it has disadvantages such as sensitivity to noise and outliers, difficulty in selecting cluster numbers, sensitivity to the initial clustering center, etc.
Numerous scholars have presented many ideas and approaches to locating the ideal cluster centers.The K-means++ algorithm was designed to weaken the K-means algorithm's sensitivity to the initial clustering centers; the algorithm's fundamental idea is to maximize the distance between the original clustering centers [30].Krishna and Murty [31] combined the genetic algorithm with the K-means algorithm to obtain the Genetic K-means algorithm (GAK) converged to the global optimum.The K-means algorithm execution is regarded as the K-means operator viewed as the search operator to replace crossover; Simultaneously, a kind of clustering-specific bias mutation operator is defined, called a distance-based mutation.Lu et al. [32] proposed a new clustering algorithm called Fast Genetic K-means Algorithm (FGAK) inspired by GAK, always converging to the global optimum eventually and running much faster.Islam et al. [33] advanced a previous genetic-searching approach called GenClust, with the intervention of fast hill-climbing of K-means and obtain an algorithm that is faster than its predecessor.Dowlatshahi and Nezamabadi-Pour [34] adapt the structure of stochastic population-based Gravitational Search Algorithm for effectively solving the multivariate data clustering problem.Some scholars have proposed that the combination of nearest-better neighborhood and fuzzy PSO is statistically superior to multi-mode optimization, indicating that the hybrid algorithm has advantages [35].
These improved K-means clustering algorithms achieved good results but still have disadvantages, such as excessive dependence on hyperparameters [36], over-sensitivity to outliers, and sensitivity to cluster center initialization.Toward alleviating these problems, this paper proposes a novel K-means clustering algorithm based on hybrid "fuzzy systemparticle swarm-genetic" to automatically obtain the optimal initial centers, namely the Novel FPSO-GAK clustering algorithm.This algorithm can enhance the processing effect of automatic clustering, avoid too much uncertainty of manual configuration parameters and clustering results falling into local optimum.The initialization operation of the K-means++ was used to obtain a relatively optimal cluster center points group as one initial particle of the Particle Swarm Optimization (PSO).Then, the parameters of the PSO algorithm were optimized by a fuzzy system and the GA algorithm was implemented to search for better clustering centers.Finally, the optimal clustering center points were found after several iterations.Three unsupervised evaluation indexes, the Separation (SP), Silhouette coefficient (SC) and the Sum of Squared Error (SSE), as well as two cluster similarity indexes Centroid index (CI) and Point-level centroid similarity index (CSI), were used to evaluate the clustering results.Five taxi GPS datasets were selected and compared with GAK, PSO based K-Means (PSOK), PSO based constraint K-Means (PSOCK), PSO based Weighted K-Means (PSOWK), PSO-GA based K-Means (PSO-GAK), and K-Means++ algorithms to verify the validity of the proposedalgorithm.
The innovations and main contributions of this paper are as follows: • A novel FPSO-GAK algorithm based on hybrid "fuzzy system-particle swarm-genetic" was designed to obtain excellent non-homogeneous search capability.

•
The PSO algorithm, fuzzy system algorithm, and genetic algorithm were organically combined to obtain the optimal initial clustering centers.• Three unsupervised evaluation indexes (SP, SC and SSE) and two cluster similarity indexes (CI and CSI) were employed to evaluate and analyze the clustering results.

•
The comprehensive experiment was designed and implemented to verify the effectiveness of the novel FPSO-GAK clustering algorithm.

A Novel FPSO-GAK Clustering Method
2.1.The Idea of the Novel FPSO-GAK Method K-means is an exclusive distance-based clustered partitioning algorithm that is implemented iteratively [37].Scholars have regarded it as an effective means to discover urban functional areas in the past decades [38,39].The K-means clustering problem on n points is NP-Hard for dimension d ≥ 2 [40], and it is a demanding but unavoidable problem to determine the optimal clustering centers accurately.The acquisition of the optimal cluster centers depends greatly on the initial cluster centers, which is also an NP-hard problem.In practice, one of the effective methods of approaching NP-hard problems is the heuristic method.Many researchers have applied heuristic algorithms to initial cluster centers [41], exposing the issues that one single heuristic algorithm is prone to fall into local optimization and premature convergence in the process of searching.These problems are caused mainly by the homogeneity of the search strategy of one single heuristic algorithm.The proposed algorithm is formed by a suitable embedded combination of PSO, GA heuristic algorithm and fuzzy system, which maintains the advantages of each algorithm and overcomes local optimization and premature convergence to a great extent.In order to solve the problems existing in the K-means clustering algorithm and make use of the advantages of the hybrid heuristic algorithm, this paper proposes a new K-means clustering algorithm which can obtain better clustering centers and capture urban hotspots under the premise of given cluster number.
The proposed noevel FPSO-GAK algorithm consists of three parts.Given the number of clusters K, perform the following steps: Firstly, the k-means++ initialization method is employed to obtain K initial cluster center points and treat them as one particle.Perform this process for the provided number of rounds to obtain the desired number of particles.Secondly, particles are put into the PSO algorithm.The PSO algorithm is employed to update the particle information and use the fuzzy system to update the parameter Settings of the PSO algorithm; the cycle is executed alternately until the iteration criteria are satisfied.Then, the optimized particles are put into the GA algorithm for optimization.In this stage, the GA, fuzzy, and PSO algorithms are executed alternately until the iterative convergence criteria is satisfied.Finally, the initial clustering center data corresponding to the optimal particle extracted in the above steps are input into the K-means clustering algorithm to cluster the given data.The excellent clustering centers, namely the urban hotspots, are captured.

The Flow of the Novel FPSO-GAK Clustering Method
The flow of the Novel FPSO-GAK clustering algorithm is shown in Figure 1.

The Realization of the Novel FPSO-GAK Clustering Method
The detailed execution steps of the proposed clustering algorithm (pseudo-code is shown in Algorithm 1) are as follows.

Input:
The taxis GPS data set and the number of the data points; The population size of PSO alogrithm, N swarm ; The termination condition and/or The maximum number of clustering iterations; Output: Clustering center points and Clustering results.
1: The population size is set as N swarm (number of particles).And the initial velocity of particles is randomly generated within the given [ν min , ν max ]; the initial velocity of different particles is different.N swarm rounds of initialization operations of K-Means++ are performed in the GPS data set, and each round's initial cluster center points are taken as a particle.2: The fitness value of each particle is calculated and sorted in descending order to find out the optimal fitness value Pbest of each individual and the optimal population value Gbest.3: Operate the fuzzy system, put the calculation results of Formulaes ( 1) and ( 2) into the fuzzy system, and calculate the values of ω, c 1 , and c 2 .The above output results control the inertia weight and the particle velocity and change the particle position distribution.4: Update the velocity and position of the particles according to Formula (4a,b).5: Repeat Step 1 to Step 4 until the termination conditions are met.The fitness of each particle is calculated, and the previous best even number of particles are stored in memory for subsequent genetic algorithm operations.6: The results of Step 5 are manipulated by the genetic algorithm, including crossover, mutation, and elite operation.In particular, genetic selection has been performed in step 5, directly utilizing roulette to pick individuals for the crossover operation, but without genetic rearrangement operation.7: Update the velocity and position of the particles according to Formula (4a,b).8: Determine whether the termination condition is met.If true, output the optimal particle and proceed to Step 9. Otherwise, return to Step 6 until the termination condition is met.9: Run the K-means algorithm.
step.1 The initial parameter setting According to the literature [42][43][44][45], one of the most remarkable features of PSO algorithms is their speed.As per experience, the initial population should range from 50 to 1000; while the convergence will be better with a larger initial population, a population that is too large will also affect the algorithm's speed.In our experiment, the initial population N swarm was between 100 and 200.N swarm rounds of initialization operations of K-means++ were performed in the GPS data set, and each round's initial center points were taken as one particle.Many researchers believe that the clustering number of the K-means algorithm is usually between 2 and √ N, where N is the number of data points in the data set [46,47].The urban hotspots must be depicted at a specified density; the initial clustering number was set to around N 8 .

step.2 Optimizing the initial clustering center points
The fitness and DN values of each particle were calculated for the fuzzy operation.The fitness function, also known as the evaluation function, is a criterion determined by the objective function to assess the quality of individuals within a group.It is always non-negative, the greater the value, the better it is.The fitness function design should be as straightforward as feasible to reduce computation time complexity.The fitness function f in this paper is as follows: where, K represents the number of clusters, J means the number of the point that belong to cluster i.The following formula for DN is used to change the fuzzy system and control the maximum invariant fitness values: where PG and CG indicate the optimal population extremum/individual extremum, respectively, f denotes fitness value of a particular particle, N swarm means the population size of the particle swarm.As found in related literature [48,49], the fitness value of particle swarm is normalized as the parameter of the fuzzy system according to the following formula: Then the particle optimization parameters ω, c 1 , and c 2 are calculated by the fuzzy system according to the fuzzy rules in the Table 1 and Mamdani fuzzy inference algorithm.After a PSO routine runs once, the values of P best and G best are updated according to the sorted fitness values of each particle.Particles update velocity ν id and position z id using the following formula: where ω denotes the inertia weight, c 1 and c 2 are two positive learning factors, r 1 and r 2 are random functions to generate uniformly distributed random values in the range [0, 1).The fuzzy system updates the parameter settings of the PSO algorithm, and the PSO algorithm is employed to update the particle information; the cycle is executed alternately until the iteration conditions are satisfied.The optimized particles obtained in the previous stage are sorted in descending order, and a certain number of the top particles are selected to enter the subsequent genetic algorithm.The top 30% of particles were picked for this paper.Genetic selection operations were performed using roulette algorithms.The gene rearrangement technique based on cosine similarity was used to deal with chromosomes of unequal length and different shapes.According to the fitness value, the following adaptive crossover probability P c formula is designed to realize the genetic single-point crossover operation: where f max represents the maximum fitness value in the current population; f avg represents the average fitness value in the current population; f represents the more significant fitness value of the two chromosomes that participated in crossover operation.The crossover operation is performed when the P c value is greater than the roulette pick probability.Formula 5 works as follows: when the fitness value of chromosomes involved in the crossover is small, crossover operation must be performed; otherwise, the probability of crossover operation is low.Then, the adaptive mutation operation is carried out.Genetic variation is the operation of gene mutation on chromosomes in a population to search for a better solution.Usually, small changes are made to the chromosomes of the population, such as mutating the chromosomes with lower fitness to expand the search space and prevent falling into local optimality.As with crossover operations, a mutation probability is required to perform chromosome mutation operations.This paper obtains the adaptive variation probability P m by using the following formula: where ξ 1 and ξ 2 are gene mutation coefficients; f max represents the maximum fitness value in the current chromosomes population; f avg represents the average fitness value in the current chromosomes population; f represents the fitness value of chromosomes involved in mutations.
After performing a genetic algorithm operation, the generated particle's fitness value and DN value are calculated to start the fuzzy system.The fuzzy system calculates out the particle optimization parameters ω, c 1 , and c 2 parameter.The P best and G best are updated according to the fitness of each particle; then, the particle swarm optimization algorithm is activated to update the particles velocity and position information.Genetic mutation operation, fuzzy system operation and particle swarm operation are performed alternately until iteration conditions are satisfied.Then, the particle with the highest fitness is selected as the initial clustering center of the subsequent K-means algorithm.
step.3 Capturing the optimal urban hotspots On the premise that superior initial clustering centers were obtained, the K-means algorithm can converge to the final outcome more quickly and accurately.Especially when the number of clustering data points is enormous and the dimension of data is high, the initial clustering centers have a substantial influence on the K-means algorithm's efficiency.We used the particle with the highest fitness value acquired in Step 2 as the initial clustering centers of the k-means algorithm to analyze the GPS data set of urban taxis.The final clustering centers of taxi GPS data are the urban hotspots we are searching for.

Synthetic Data Test
To validate the effectiveness of the proposed algorithm, a synthetic dataset was employed for testing.The synthetic dataset (Figure 2) contains 1000 data points divided into 120 clusters with random distribution of centroids, and the data variance within the clusters is 15.There are various enhanced versions of the K-Means algorithm now available.The proposed algorithm was compared with the state-of-the-art Random Swap clustering algorithm (RS) [50], a lightweight and efficient clustering algorithm with linear time complexity depending on the size of the data.The proposed algorithm contains meta-heuristic type methods whose time complexity will be higher than the time complexity of K-means arithmetic.According to the analysis in Section 3.5, it knows that the proposed algorithm's time complexity shows a basically linear relationship with the number of data.It is estimated that the time consuming of the proposed algorithm is about 20 to 50 times that of the random-swap clustering algorithm.At the cost of time complexity, the non-homogeneous search capability of the proposed algorithm can achieve better clustering results on medium and large data sets.Thirty clustering operations were performed on the synthetic data using the proposed algorithm and random swap clustering algorithms according to the accuracy convergence condition and a fixed number of iterations, respectively (the number of clusters for each clustering operation was selected randomly between 70 and 90).The average SC, SP, SSE and ACI (reference to Section 3.6) indexes were used to evaluate the results of clustering, as shown in Table 2. Table 2 demonstrates that the state-of-the-art Random Swap clustering algorithm is an effective clustering algorithm that achieves excellent results with respect to all clustering indexes.The ACI index of FPSO-GAK and Random Swap Clustering algorithms is greater than 0.15, indicating a substantial difference in their clustering-level similarity.The proposed algorithm slightly outperforms the Random Swap Clustering algorithm in many indexes except the time-consuming index.The time complexity of both algorithms shows a linear relationship with the amount of data, but the proposed algorithm will take an order of magnitude more time-consuming than the Random Swap Clustering algorithm.Considering the rapid increase of computing power nowadays, it is acceptable in applications to sacrifice a certain time loss to obtain the improvement of the clustering effect.In the subsequent test and analysis of the real dataset, we will further conduct a comparative analysis of the proposed algorithm with similar types of heuristics algorithms.

Taxis GPS Data Set
For the purpose of verifying the performance of the novel algorithm, five GPS data sets were used as shown in Table 3. GPS data sets typically contain vehicle ID numbers, longitude values, latitude values, altitude values, timestamps, GpsSpeed, etc., which are periodically recorded by the onboard GLOBAL positioning system (GPS).Currently, public transport authorities in many cities have accumulated taxi GPS data sets for several years.In this study, we focused on GPS data sets from representative metropolitan areas in several countries, i.e., Aracaju (Brazil), Beijing (China), Chongqing (China), Rome (Italy), and San Francisco (USA).Data set Aracaju is a taxi GPS data set from Aracaju city in Brazil, which belongs to the standard data set in UCI Machine Learning Repository.Data set Beijing (China) is a taxis GPS data set from downtown Beijing, China, which is from the well-known Geolife Trajectories project [51,52].Data set Chongqing (China) is a taxis GPS data set from urban area Chongqing, China.Data set Rome (Italy) is a taxi GPS data set from Rome city, which belongs to the data set in Crawdad and contains GPS data of about 320 taxis' trajectories [53].Data set San Francisco (USA) is a taxi GPS data set from San Francisco, which belongs to the data set in Crawdad and contains GPS data of about 500 taxis' trajectories [53].The data sets have been cleaned beforehand; just a subset of the original data sets has been extracted, and a few records between 1:00 and 6:00 p.m. have been eliminated.In addition to the taxi vehicle GPS dataset, we validated the proposed algorithms using the multi-source trajectory dataset.The multi-source trajectory dataset is derived from the Routes 2019 subset of the MOPSI dataset (http://cs.uef.fi/mopsi/data,accessed on 5 July 2022), which contains approximately 3 million trajectory data points [54].Trajectories encompass a variety of activities, such as walking, biking, hiking, jogging, driving, taking the bus, and others.Some of these trajectory points record life routines, such as going to work or shopping.The dataset is particularly suitable for migratory pattern mining, person activity identification, and location-based similarity.In the experiments of this paper, we select multi-source trajectory data points within the urban area for analysis, mainly including trajectory point data of vehicles, walking, jogging and biking.The spatial distribution of the multi-source data trajectory points is shown in Figure 3.

Experimental Environment and Parameter Setting
The experimental environment based on a personal computer featured the following: Intel I5-8250U, dominant frequency 4 × 1.60 GHz with 8G RAM, Windows 10 operating system, and the algorithm routines were coded in MATLAB 2018b.The taxi GPS data sets contained subsets Aracaju, Beijing, Chongqing, Rome and San Francisco were used as experimental data sets.In each algorithm routine, clustering performance was evaluated using SC, SP, and SSE indicators, which paid particular attention to SC index.The above algorithm routines were trained 20 times independently in the experiment.The clustering numbers to GAK and K-means++ algorithms were initialized to 100.Then the parameters were adjusted respectively within √ n according to the amount of data sets to acquire the best experimental results.The initial parameter setting of PSO and fuzzy system are shown in Table 4.
For the PSO algorithm: the inertia weight ω keeps the particle moving with inertia, giving it the tendency to expand the search space and the ability to explore new regions.The acceleration constants c 1 and c 2 represent the weights of the statistical acceleration terms that push each particle toward the P best and G best positions.A low value allows the particle to hover outside the target region before being pulled back, while a high value causes the particle to make a sudden dash toward or over the target region.The higher the number of populations (NP), the higher the accuracy will be obtained, but at the same time, it will prolong the computing time.The velocity range (VRP) determines the resolution (or accuracy) in the region between the current position and the best position.If velocity maximum is too high, the particle may fly past the good solution, and if velocity minimum is too small, the particle cannot perform sufficient exploration, leading to becoming stuck in local optima.
For the GA algorithm: the adaptive cross coefficient range (ACCR) is too large, it is easy to destroy the existing favorable mode, increase the randomness and easily miss the optimal individual; it is too small to renew the population effectively.The adaptive variation coefficient range (AVCR)is too small and the diversity of the population declines too fast, which easily leads to the rapid loss of effective genes and is not easy to repair; it is is too large, the diversity of the population can be guaranteed, but the probability of higher-order mode destruction also increases.
For the fuzzy system: the fuzzy range of ω in the PSO (FRW) control the variation range in the defuzzification process.The fuzzy range of c i in the PSO (FRC) control the variation range in the defuzzification process.Tables 5-7 demonstrate that the proposed algorithm has superior clustering performance in terms of Silhouette coefficient, degree of separation and SSE, as well as the ability to capture better clustering results and locate more effective urban hotspots.However, in Table 7's San Francisco data set, the K-Means++ algorithm delivers just slightly worse than the proposed algorithm.It reveals that the K-Means++ algorithm can be used as a preliminary coarse-grained clustering algorithm in such applications.The tables also demonstrate that the overall clustering performance of the RS algorithm is good, and the clustering effect is better than the K-Means++ algorithm in most cases.In Table 5, the average Silhouette coefficient of the proposed algorithm is higher than other algorithms.The greater the Silhouette coefficient, the more compact within a cluster and more dispersed between the clusters, indicating that the clustering effect is superior to the competitors.In Table 6, the proposed algorithm has a higher SP value than other algorithms.The higher the SP value, the greater the separation between clustering clusters, and the higher the clustering algorithm's ability to satisfy the needs of the majority of clustering.Table 7 indicates that the average SSE value of the proposed algorithm in 20 training cycles is superior to other algorithms.Overall, the clustering results of the proposed algorithms all outperformed the comparison algorithms.The above suggests that hybrid heuristic initialization algorithms are effective, practical and feasible.Figures 4-8 show that the proposed algorithm undergoes minimal change during the iterative convergence process.Within the five groups of taxi GPS data sets, the pro-posed really is in the optimal state, the optimization effect is noticeable, and its curve variations are negligible.It indicates that the proposed algorithm can obtain relatively better clustering centers by using the initialization operation of the k -means++ algorithm when initializing the populations.Assuming that the proposed algorithm is equivalent to the K-means++ algorithm, the two algorithms' convergence curves will coincide in the convergence scenario, which is not observed in the experiments.In the iterative optimization process of the proposed algorithm, the fitness value changes slightly, and the location distribution of particles and the clustering center are constantly modified to achieve a more optimal clustering centers distribution throughout the whole GPS data set.The final clustering effect can better accommodate actual requirements.The convergence curves of the proposed algorithm and PSO-GAK algorithm remain close to each other, indicating that the parameters of the proposed algorithm should be further adjusted to improve its convergence effect, which will be studied in future work.
Figures 9-13 show that the proposed algorithm can obtain better particles than the GAK, PSOK, PSOCK, PSOWK, PSO-GAK and K-means++ as the input initial clustering centers of the K-means algorithm.With optimization by the proposed algorithm, improved clustering centers can be captured, significantly reducing the K-means algorithm's sensitivity to initial clustering centers.When SSE is employed as the objective function of the k-means clustering algorithm, the clustering centers obtained by the proposed algorithm are used to perform the K-means clustering algorithm, which has faster convergence characteristics and a lower SSE value overall.In Figure 9, the proposed algorithm begins to converge in the second generation and converges entirely in the fourth generation, and the SSE obtained is significantly smaller than other algorithms.In Figure 10, before the seventh generation, the convergence speed of the proposed algorithm is faster than other algorithms, and the SSE value is also relatively small.After the tenth generation, the SSE value of the proposed algorithm is slightly smaller than that of the K-means++ and GAK algorithms.In Figure 11, proposed converges from about the second generation and converges entirely in the fifth generation.The SSE value is lower than that of other algorithms, showing that the clustering effect is excellent and the separation between clusters is significant, consequently demonstrating the algorithm's effectiveness.In Figure 12, the proposed converges from the second generation and entirely at about the fifth generation, with a very smooth convergence curve.In the meantime, the figure demonstrates that the SSE value is less than that of other algorithms, indicating that the clustering center captured by the proposed algorithm is effective and does not cause k-means clustering to fall into local optimal solutions.In Figure 13, the proposed converges from about the second generation and ultimately converges at the fifth generation.In the beginning stage, the SSE value of the algorithm is substantially lower than that of other algorithms, and after convergence, it is extremely stable.
Overall, experiments on various datasets indicate that the algorithm that combines the genetic algorithm, fussy system, and particle swarm algorithm virtually captures the optimum clustering centers more effectively.

Multi-Source Trajectory Data Set
To further validate the performance of the proposed algorithm, we employed a more complex multi-source dataset.In order to accurately analyze the above evaluation indexes, The number of clusterings is randomly generated between 130 and 150, each algorithm was iterated 120 times in a single run to guarantee convergence.Table 8 is the comparison tables for the average Silhouette coefficient (SC), the average degree of separation (SP), and the average SSE value of 20 training sessions respectively.All eight algorithms converge well within less than 120 iterations in the multi-source dataset.The K-Means++ and Random Swap Clustering algorithms are much less time consuming than the other algorithms, and both achieve very good clustering results.The analysis of the experimental data shows that the Random Swap Clustering algorithm outperforms the K-Means++ algorithm.The Random Swap Clustering algorithm also outperforms single-mode heuristics in many cases.With reasonable time consumption, the evaluation indexes of the proposed algorithm are significantly better than those of the compared algorithms.This may be due to the fact that the proposed hybrid heuristic algorithm has stronger non-homogeneous spatial search capability, but also brings more computing resources and time consumption.Overall, the clustering results of the proposed algorithms all outperformed the comparison algorithms in the terms of above evaluation indexes.The above suggests that the proposed hybrid heuristic algorithms are effective.

Clustering Algorithm Complexity Analysis
The proposed noevel FPSO-GAK algorithm consists of three parts.To facilitate analysis, the following notation is used: the number of GPS data points is N, the number of clustering is K, the particles population of PSO is M. In the first part, the k-Means++ initialization method is employed to obtain particles M population.The time complexity of this part is O(P1) = MKlog(K).In the second part, the hybrid heuristic initialization is utilized to determine optimal initial clustering centers.The iterations number of PSO algorithm is T 1 , The iterations number of GA algorithm is T 2 , number of calculations to fitness function is T 3 .The time complexity of the fuzzy system is related to the number of iterations and the number of fuzzy rules; it takes very little time and is negligible in terms of computational time complexity.The time complexity of the second part is O(P2) = MNKT 1 + MNKT 2 + MT 3 .In the third part, the K-Means clustering algorithm is used to locate the urban hotspots.The iterations number of K-Means clustering algorithm is T 4 .The time complexity of the third part is O(P3) = NKT 4 .
The time complexity of the proposed clustering algorithm is O(FPSO − GAK) = O(P1) + O(P2) + O(P3) ≈ NK((T 1 + T 2 )M + T 4 ).In practical applications, K and M are very small relative to the number of GPS data points N. The number of iterations T of the K-means++ algorithm in which the random initialization procedure fails to acquire the better initialization tends to be extremely high for large-scale datasets.Due to the good non-homogeneous search capability of the proposed hybrid heuristic initialization algorithm, the value of (T 1 + T 2 )M is relatively small.Furthermore, the better-initialized centers makes T 4 would be much smaller than T. The result of their interaction makes: the complexity of the proposed algorithm is of the same order of magnitude as that of k-Means, but it will be slightly larger numerically.
In the actual experimental operation, the PSO-GAK algorithm in literature [55] consumes the most time since its complexity tends to O(N 2 ).The proposed algorithm can better adjust the search parameters automatically, and it will converge significantly faster than the PSO-GAK algorithm.The complexity of an intelligent algorithm is generally higher than the general iterative, recursive algorithm [56].Collectively, it can be seen that the complexity of the proposed algorithm is much less than O(N 2 ) but greater than O(NKT).The complexity of the proposed algorithm is roughly linearly related to the number of samples, and it is very efficient and scalable for handling large data sets.

Clustering Results Similarity Analysis
External indexes have been commonly used to compare clustering algorithms by calculating how many pairs of data points are partitioned consistently between two clustering solutions.For both solutions to be consistent, a pair of points must be assigned to either the same cluster or a distinct cluster.This offers an estimation of point-level similarity but no direct information regarding cluster-level similarity.In most cases, information about the cluster-level similarity of the algorithm is of more interest than the estimation of point-level similarity.Fränti et al. [57] proposed Centroid index (CI) measures cluster-level differences of two solutions, and centroid similarity index (CSI) measures point-level differences of two solutions.The CI index has clear intuitive interpretation on how many clusters are differently located in the two solutions by an integer value.A higher CI value indicates a lower similarity between the two solutions when the number of cluster classes is determined.Considering the effect of the number of clustering, this paper employs a normalized adjusted CI index (ACI) to compare solutions with the following formula: Table 9 shows that the ACI indices of FPSO-GAK and other algorithms are greater than 0.15, indicating a substantial difference in their clustering-level similarity.Obtaining the above outcomes under identical data input conditions reveals that the equivalent mathematical models of the FPSO-GAK algorithm differ significantly from those of other algorithms.However, the value of the ACI index is generally less than 0.25, indicating that the difference between equivalent mathematical models is not fundamental.The CSI indices in Table 10 are more concerned with point-level similarity, and their values indicate similar model analysis information.

Discussion
The K-Means algorithm can achieve very good results on convex, compressible datasets.Urban roads are usually distributed in a neighborhood manner, and the trajectory points of vehicles satisfy convexity and compressibility in a large local area.Using the K-Means algorithm to identify urban hotspots can yield effective results.K-Means algorithm is very sensitive to the initial center points.Excellent initial clustering centers significantly impact the performance of the k-Means clustering algorithm.K-Means is particularly effective in fine-tuning local cluster boundaries, but is incapable of solving global cluster locations.The organic combination of global perception and local fine-tuning is the key to obtaining good initial centers.Literature [50] proposed the method of random swap clustering, which is simple and efficient.Under the condition of a suitable KNN super parameter configuration, this method's random swap efficiency is extremely high, it is an exquisite lightweight clustering method.Inspired by the ideas of randomness and exchange, the FPSO-GAK algorithm combines a variety of heuristic algorithms with the premise of sacrificing a little efficiency to avoid the homogeneity of the random search process and can better obtain global perception ability.At the same time, local search parameters can be fine-tuned automatically to prevent over-dependence on super parameter.
In practical applications, in order to further improve the clustering efficiency, the inertia weight and learning factor can better meet the iterative requirements by adjusting fuzzy rules.At the same time, the combination mode of PSO and GA can also be changed.PSO can improve the phenomenon that GA is prone to early maturity through global optimization in the early stage.while PSO is prone to fall into local optimization in the late stage, the deficiency can be made up through the global optimization in the late stage of GA.
Compared with many existing research algorithms in this field, the proposed algorithm has a better and relatively efficient non-homogeneous search capability, which can effectively find excellent initial clustering centers and at the same time can reduce the sensitivity of K-Means algorithm to noise to a certain extent.However, the proposed algorithm is relatively complex in structure and has a relatively high time complexity compared to many lightweight algorithms.The proposed algorithm is a class of algorithms based on a heuristic mechanism, which can only guarantee to get a better solution and cannot guarantee to obtain the optimal solution for sure.

Conclusions
In this paper, a Novel FPSO-GAK clustering algorithm has been proposed to effectively solve the problems of difficulty in determining the sensitivity of initializing the clustering center for a K-means clustering algorithm.The Novel FPSO-GAK clustering algorithm has been employed to capture urban hotspots in metropolitan areas in several countries worldwide.After the experimental clustering operation was completed, the SC index, SP index, and SSE index were applied to evaluate the clustering performance.The proposed Novel FPSO-GAK clustering algorithm obtained better clustering results for urban hotspots in metropolitan areas over the GAK, PSOK, PSOCK, PSOWK, PSO-GAK, and K-means++ algorithm.The algorithm presented in this paper can better improve the quality of public service for people in cities.In addition, the Novel FPSO-GAK clustering algorithm proposed in this paper can also be applied to the clustering analysis of remote sensing ground object data, animal migration area analysis, urban subway data analysis, urban heating network optimization analysis, detailed analysis of power supply flow of power grid and other fields.
There are also some shortcomings of the algorithm proposed in the paper.On the one hand, we only use the GPS data of taxis.The data amount is relatively small, leading to an insufficient resolution in the refined analysis of the distribution relationship of urban hotspots.On the other hand, the Novel FPSO-GAK clustering algorithm does not add urban road constraints, leading to some hotspots falling into particular areas where traffic vehicles cannot be reached, such as large lake areas in cities.These deficiencies will be further improved in future research work.

Figure 1 .
Figure 1.The flow of the novel FPSO-GAK clustering algorithm.

Figure 3 .
Figure 3.The spatial distribution of the multi-source data trajectory points.

3. 3 .
show the convergence diagrams for the searching optimal initial clustering centers stage of the GAK, PSOK, PSOCK, PSOWK, PSO-GAK, and the proposed algorithms utilizing Section 3.2 parameters in five distinct GPS data sets.Figures 9-13 is the convergence diagram for k-means clustering stage of 7 algorithms (the above six algorithms plus K-means++ algorithm).Tables 5-7 are the comparison tables for the average Silhouette coefficient (SC), the average degree of separation (SP), and the average SSE value of 20

Figure 4 .
Figure 4.The SSE convergence curve for the searching optimal initial clustering centers stage of taxi GPS dataset from Aracaju (Brazil).

Figure 5 .
Figure 5.The SSE convergence curve for the searching optimal initial clustering centers stage of taxi GPS dataset from Beijing (China).

Figure 6 .
Figure 6.The SSE convergence curve for the searching optimal initial clustering centers stage of taxi GPS dataset from Chongqing (China).

Figure 7 .
Figure 7.The SSE convergence curve for the searching optimal initial clustering centers stage of taxi GPS dataset from Roma (Italy).

Figure 8 .
Figure 8.The SSE convergence curve for the searching optimal initial clustering centers stage of taxi GPS dataset from San Francisco (USA).

Figure 9 .
Figure 9.The SSE convergence curve of taxi GPS dataset from Aracaju (Brazil).

Figure 10 .
Figure 10.The SSE convergence curve of taxi GPS dataset from Beijing (China).

Figure 11 .
Figure 11.The SSE convergence curve of taxi GPS dataset from Chongqing (China).

Figure 12 .
Figure 12.The SSE convergence curve of taxi GPS dataset from Roma (Italy).

Figure 13 .
Figure 13.The SSE convergence curve of taxi GPS dataset from San Francisco (USA).

3. 4 .Figure 14 .Figure 15 .Figure 16 .Figure 17 .Figure 18 .Figure 19 .
shows the centers of the clustering results for each algorithm in Baidu Map (put the clustering centers into the maps using the open software interface provided by Baidu Map).On the corresponding map, we intuitively perceive the distribution of clustering results based on various algorithms.According to pertinent maps, different clustering algorithms yield different clustering centers, indicating that different clustering results and identified urban hot areas are distinct.Since experiment GPS datasets are not cleaned according to road location limitations, the map may contain a very small number of outlier points.

3. 4 . 2 .Figure 20 .
Figure20shows the centers of the clustering results for each algorithm in multi-source trajectory data set.On the corresponding map, we intuitively perceive the distribution of clustering results based on various algorithms.

Table 1 .
Fuzzy rules for inertia weight ω, learning factor c 1 and c 2 .
PS: positive small; PM: positive middle; PB: positive big; PR: positive rare.

Table 2 .
Average indexes for clutering operations in synthetic dataset.The number within parentheses represents the number of iterations performed by the algorithms.

Table 3 .
Summary information of Taxis GPS data sets.

Table 4 .
PSO and fuzzy system parameter setting.

Table 5 .
Average Silhouette coefficient of 8 algorithms trained 20 times in 5 GPS data sets.

Table 6 .
Average degree of separation of 8 algorithms trained 20 times in 5 GPS data sets.

Table 7 .
Average SSE of 8 algorithms trained 20 times in 5 GPS data sets.

Table 8 .
Average evaluation indexes of 8 algorithms in multi-source dataset.