An Automatic K-Means Clustering Algorithm of GPS Data Combining a Novel Niche Genetic Algorithm with Noise and Density

: Rapidly growing Global Positioning System (GPS) data plays an important role in trajectory and their applications (e


Introduction
Nowadays, with the prevalence of smart Global Positioning System (GPS) devices with positioning ability, a large amount of GPS-based data and trajectories are available.There is a huge amount of hidden information behind location data, which are very useful in providing many services to people such as navigation and recommendation systems based on taxi position, the localization of points of interest, the population migration distribution of a city, land use, and the analysis of traffic flow.
The key element to these applications is location (based on GPS), which is required to mine the hidden information and understand the meaning of the trajectories, instead of only considering trajectory as a combination of recorded GPS data points.Therefore, in these application domains, techniques for mining trajectory patterns and frequent trajectory routes are very important [1], and have usually been described by several trajectory patterns, such as origins and destinations (OD) [2][3][4], stops and moves [5,6], moving object [7,8]; furthermore, a great quantity of clustering algorithms have been used to mine these patterns and produce clustering results.For example, the authors of [6] presented an improved DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to be used for stops clustering in trajectories; the authors in [3] proposed an OD pattern of interested regions based on taxi trajectories; Reference [4] presented a new spatio-temporal queries model that allowed users to visually query taxi trips through OD; a density-based clustering approach was presented in [9] that aimed at mining the topology of stops of a public transport network; a hierarchical clustering algorithm was proposed to first extract visit points from GPS trajectories in [10], then clustered these visit points with the aim to discover personally semantic places; Reference [11] proposed a density-based line-segment trajectories clustering algorithm based on partition and group framework; and [12] presented a mode-based clustering algorithm for trajectories, and an EM (Expectation-Maximization) algorithm was used to determine the cluster memberships.
In general, most of the existing clustering algorithms used in GPS data clustering suffer from their respective drawbacks, which are also deemed to be some of the most difficult and challenging problems in unsupervised machine learning [6,[13][14][15].In fact, clustering is an important data analysis approach and data mining technique that is used for identifying clusters of similar characteristics and dissimilar records in different clusters, and has a wide range of applications including machine learning, social network analysis, and the geosciences.Unfortunately, it is difficult to find the number of natural clusters; moreover, the clustering result is sensitive to the selection of the initial seeds.There are many clustering algorithms (e.g., density-based, partition-based) [14], out of which K-means (partition-based) is a more commonly-used technique-perhaps because of its simplicity and effectiveness-than other clustering algorithms (e.g., density-based, model-based).However, K-means requires a user to provide the number of clusters K as an input [15,16], and may converge to a partition that is significantly inferior or slow compared to the global optimum [17].Based on this user-defined value of the number of clusters, K-means randomly selects a K number of initial seeds from a given dataset [18].Namely, K-means is generally very sensitive to the quality of the initial seeds, therefore it is easy to produce poor quality results due to the poor quality of initial seeds [16,19].Therefore, clustering techniques that are capable of automatically selecting the number of clusters are highly desirable [16].Another drawback of K-means is that it does not record the quality of clusters obtained in the previous iterations.To improve the performance and enhance the efficiency of the K-means algorithm, several genetic algorithms (GAs) with K-means have been developed in the past few years [20][21][22].The use of GAs with K-means can help to avoid the minima issues of the K-means [16,[20][21][22][23], and can produce better clustering results than the K-means or GA clustering.For example, GenClust [16] can automatically find the appropriate number of clusters and identify the right genes through a novel initial population selection approach.AGCUK (Automatic genetic clustering for unknown K) [22] presented an automatic genetic clustering algorithm for an unknown K to automatically find the number of clusters and provide the proper clustering partition.GAGR (Genetic algorithm with gene rearrangement) [20] proposed a GA-based K-means clustering algorithm with gene rearrangement.GGA (Group genetic algorithm) [21] presented GA-based clustering algorithms with a new grouping method in the initial population.However, these GAs with K-means can lose population diversity due to global optimal problems and weak exploitation capabilities, and the gene size of the chromosomes must be equal in the AGCUK (Automatic genetic clustering for unknown K) and GAGR.In GGA algorithm, the number of clusters require a user input, but gene sizes are not equal.
In this paper, the presented novel clustering algorithm called NoiseClust (niche genetic algorithm (NGA) combining noise and density with K-means) combined a novel NGA with K-means for taxi GPS data clustering, which is used to mine the better OD.Part of NoiseClust includes an improved noise and K-means++ [24,25], which are used to generate the initial population and initial seeds without requiring a user to input the number of clusters, and can also handle the different size and shape of genes.Meanwhile, density-based niche partitioning and sharing-based niche are used to maintain population diversity, capture the global optimal solution, and enhance exploitation capabilities [26,27], meanwhile, adaptive crossover and mutation probabilities are also employed to avoid the local optimum.
NoiseClust can improve the gene rearrangement operation of GenClust [16] by using the cosine similarity between chromosomes before a crossover.The improved method can handle the similarity between chromosomes, and the distance between taxi GPS data points of the crossover operation do not have an equal number of genes.Therefore, the advantage of the improved method is the ease in selecting the best chromosomes in the crossover operation during each iteration, which strengthens the capability of gene rearrangement in the different NGA operations.Prior to this, a novel initial population method with noise and K-means++ is also proposed in NoiseClust, where its aim was to strengthen efficiency and reduce the complexity of the initial population.
Additionally, the NoiseClust technique integrates an improved NGA and K-means to generate higher quality clustering results than GenClust [16].The NGA consists of two parts: (1) the density estimation is used to divide the number of niches that maintains population diversity; (2) the Pearson similarity [28] between chromosomes is used to replace the distance in the sharing function of niches [29] with the advantage of better stability and diversity of the population than GA (see Section 2.6.2).Our proposed clustering technique overcomes the issues of K-means by using our NGA, which automatically selects the high-quality initial seeds.Namely, the NoiseClust clustering algorithm does not only automatically capture the number of clusters, overcomes the local minima issue and avoids sensitive seed selection for K-means, but also strengthens the diversity of the population and avoids premature convergence.
To verify the performance and effectiveness of NoiseClust, four taxi GPS data sets are used as examples (see Section 2.1), and the experiments compare NoiseClust with GenClust [16] and Genetic algorithm K-means (GAK) [30] on three cluster evaluation criteria: silhouette coefficient (SC) [21], PBM (Pakhira-Bandyopadhyay-Maulik) [31] and SSE (Sum of Squared Errors) [20].These results indicate that NoiseClust achieves better quality clusters than the GenClust and GAK algorithms.Furthermore, this paper also presents a complexity analysis of NoiseClust in Section 2.10.
Therefore, the main works of the paper are summarized as follows: a.
The selection of the initial population combining noise with K-means++.b.
An improved gene rearrangement technique using cosine similarity.c.
Adaptive probabilities of crossover and mutation are used to prevent the NoiseClust from getting stuck at a local optimal solution.d.
A novel niche operation is proposed to maintain population diversity.e.
NoiseClust works on four real-world taxi GPS data sets.
The remainder of the paper is organized as follows: Section 2 describes our proposed novel clustering algorithm.Experimental results are presented in Section 3. Finally, Section 4 offers conclusions and future work.

The NoiseClust Clustering Algorithm
NoiseClust is based on OD of the use in trajectory, and finds the best cluster centers of OD in the city.Figure 1 gives an overview of the NoiseClust algorithm including some steps as follows: (1) given a description of the taxi GPS data, our approach first used four taxi GPS data sets to explain the OD of the GPS data based trajectory (see Section 2.1); (2) real taxi GPS data sequences are used to encode chromosomes (see Section 2.2); (3) an initial population approach is proposed using noise, and K-means++ to initialize seeds (see Section 2.3); ( 4) DBI (Davis-Boudin index) [32] is employed as a fitness function; (5) in the genetic operation, a gene rearrangement technique based on cosine, and adaptive probabilities of crossover and mutation, are used for genetic operation (see Section 2.5); (6) a density estimation method is presented to divide the number of niches, and a share function also is used for the genetic operation, which are considered as the niche genetic algorithm (NGA) (see Section 2.6); (7) in Section 2.7, an elitism strategy is used to select the best chromosome, which is used to replace the worst chromosome in the current iteration; (8) the best chromosome is obtained to use K-means clustering; (9) the termination condition of K-means clustering is given in Section 2.9; and (10) the complexity of the NoiseClust algorithm is analyzed in Section 2.1.

GPS Data Description
In general, a trajectory consists of many GPS location points, and GPS data based on trajectory data not only contains location information (longitude and latitude), but also collects time and status (e.g., speed, orientation and status) [33,34].These trajectory patterns of location can be usually described by OD [2][3][4].In this paper, the GPS data were gathered from taxi GPS points in Aracaju (Brazil), Chongqing (China), Roma (Italy), and San Francisco (USA) [35] (Figure 2a-d).However, although we can obtain the distributions of OD, it is more important to understand which area in a city can attract more people, and what the spatial distributions of these attracting locations are.Therefore, to analyze/compare the NoiseClust, GAK, and GenClust algorithms and achieve OD clustering, the taxi GPS data sets are collected and divided into OD, which are then used to encode chromosomes of genetic operation (see Section 2.2).

GPS Data Description
In general, a trajectory consists of many GPS location points, and GPS data based on trajectory data not only contains location information (longitude and latitude), but also collects time and status (e.g., speed, orientation and status) [33,34].These trajectory patterns of location can be usually described by OD [2][3][4].In this paper, the GPS data were gathered from taxi GPS points in Aracaju (Brazil), Chongqing (China), Roma (Italy), and San Francisco (USA) [35] (Figure 2a-d).However, although we can obtain the distributions of OD, it is more important to understand which area in a city can attract more people, and what the spatial distributions of these attracting locations are.Therefore, to analyze/compare the NoiseClust, GAK, and GenClust algorithms and achieve OD clustering, the taxi GPS data sets are collected and divided into OD, which are then used to encode chromosomes of genetic operation (see Section 2.2).   Figure 2 displays the distributions of taxis' OD in a road network of the given land areas (see Table 1) within two minutes at each city, and the overall distributions of OD reflect the traffic change demand of citizens and population migration who use taxicabs as a transportation tool.As a result, the better cluster centers will be captured and can be used to dispatch taxis and find passengers.Furthermore, traffic information and the population migration distribution will be also obtained in order to explain each city's situation.Figure 2 displays the distributions of taxis' OD in a road network of the given land areas (see Table 1) within two minutes at each city, and the overall distributions of OD reflect the traffic change demand of citizens and population migration who use taxicabs as a transportation tool.As a result, the better cluster centers will be captured and can be used to dispatch taxis and find passengers.Furthermore, traffic information and the population migration distribution will be also obtained in order to explain each city's situation.For any GA, each chromosome is made up of a sequence of gene coding such as binary digits, floating-numbers, integers, symbols (i.e., a, b, c, d), and chromosome representation is needed to describe each chromosome in the population.Gene coding of chromosomes determines how the optimized problem and genetic operators are structured in the algorithm that are used, and the different gene representations of the chromosomes determine the genetic performance in the population.In early GAs, 0, 1 binary digits are usually used, and it has been shown that more natural representations can obtain a more efficient and better optimal solution.However, with the increase of the length of the 0, 1 strings, more CPU computing time is needed, which causes the genetic performance to decline [20,22,36].Therefore, OD representation is utilized to describe the chromosome in this paper; furthermore, a noise method is proposed to use the chromosomes initial selection where K-means++ is also used to select the initial seeds (see Section 2.3).Each seed of the cluster is represented by the chromosome in the same way as the NoiseClust algorithm.The number of genes of a chromosome is randomly chosen between 2, √ n [37], where n is the number of GPS data points.Additionally, each chromosome has a different gene in size and shape, namely, when each seed no longer changed, corresponding to the cluster center being determined.Therefore, chromosome representation is described by OD as follows: a chromosome can be defined by CR(G i1 , G i2 , . . ., G iK ), or CR(Seed i1 , Seed i2 , . . ., Seed iK ), namely, a gene is also regarded as a seed, and a chromosome consists of seeds in the clustering algorithms [16,23], in clustering processing, where, CR, G, K, i denotes chromosomes, genes, the number of genes of clusters ( 2 ≤ K ≤ √ n ), and the number of GPS data points in cluster, respectively; note that a chromosome is cluster.Therefore, the chromosomes are made up of real taxi GPS data representing the seeds (cluster centers) in the initial population.

Initial Population Using Noise and K-Means++ Method
The noise method guiding the heuristic search produces to explore the solution space has led to the proposal of the recent combinatorial optimization metaheuristics technique [38], which has been applied to K-means clustering [22], as well as other application fields of the noising method such as task allocation [39], and the clique partitioning problem [40].In addition, the noise method considers the optimal results as the outcome of a series of fluctuating data converging towards the genuine ones, and the features and the variants of the noise method are detailed, the tunings of their parameters when are applied to different combinatorial optimization problems have been summarized in [36].Compared with other metaheuristics based on elementary transformations, the noise method is not only based on elementary transformations, but also on a descent.This noise method is randomly chosen into an interval where the range decreases during the process.For example, if we draw the noise into the interval [−rate, +rate] with a given probability distribution in the taxi GPS data sets, then the noise rate decreases during the running of the iteration process.When the objective function f (1/SSE) (sum of squared errors (SSE) used to calculate noise value) [19] value for a given solution is considered, a perturbation called a noise is added to this value.Next, the noise is randomly chosen in an interval where the range decreases during the iteration process.This means that the original value of the noise rate rate, should be chosen in such a way that, at the beginning of the noise iteration process, a bad neighboring solution may be accepted.As added noises are chosen in [−rate, +rate] centered on zero, namely, the mean and the standard deviation of the rate tend towards 0 and the standard ISPRS Int.J. Geo-Inf.2017, 6, 392 8 of 30 deviation during this process, a good neighboring solution may be rejected, so it induces a neighboring with NS = b(b − 3)/2 as the elementary transformation of the noise operation [36].This is a quick and easy way to evaluate the consequences involved by the transformation, where NS denotes the size of the neighboring, and b denotes the number of binary bits of neighborhood in the elementary transformations.The final solution is the best solution captured during the noise iteration process.
At the iteration process of noise computing, to capture the initial seeds of the clustering, the K-means++ [25] method is employed to handle the objective function value, and control the decreasing-rate.For example, if the noise rate rate decreases arithmetically, it decreases by (rate max − rate min )/N after each trial-cluster, where the meaning of N is explained in Algorithm 1.Meanwhile, K-means++ may be an easy and quick way to evaluate the consequences involved by the elementary transformation; furthermore, it is easy to capture the number of the initial clusters and achieve population initialization.For example, for taxi GPS data sets in Aracaju (Brazil) (see Figure 2a), when the population size is equal to 30, 30 difference clusters (the number of clusters K i (i = 1, . . ., 30)) are initialized by 30 interval rate values of the noise method.As a result, the different size and shape of the genes are produced in 30 chromosomes, and initial seeds are also obtained.Ultimately, population initialization is also achieved, as shown in Figure 3.The structure of the algorithm is shown in Algorithm 1.
ISPRS Int.J. Geo-Inf.2017, 6, 392 8 of 30 neighboring solution may be rejected, so it induces a neighboring with = ( − 3)/2 as the elementary transformation of the noise operation [36].This is a quick and easy way to evaluate the consequences involved by the transformation, where NS denotes the size of the neighboring, and b denotes the number of binary bits of neighborhood in the elementary transformations.The final solution is the best solution captured during the noise iteration process.At the iteration process of noise computing, to capture the initial seeds of the clustering, the K-means++ [25] method is employed to handle the objective function value, and control the decreasing-rate.For example, if the noise rate rate decreases arithmetically, it decreases by ( − )/ after each trial-cluster, where the meaning of N is explained in Algorithm 1.
Meanwhile, K-means++ may be an easy and quick way to evaluate the consequences involved by the elementary transformation; furthermore, it is easy to capture the number of the initial clusters and achieve population initialization.For example, for taxi GPS data sets in Aracaju (Brazil) (see Figure 2a), when the population size is equal to 30, 30 difference clusters (the number of clusters ( = 1, … ,30)) are initialized by 30 interval rate values of the noise method.As a result, the different size and shape of the genes are produced in 30 chromosomes, and initial seeds are also obtained.
Ultimately, population initialization is also achieved, as shown in Figure 3.The structure of the algorithm is shown in Algorithm 1. , //give a group of the maximum and minimum noise rates (e.g., 30 pairs of noise rates), they are usually equal to population size.NIND//population size (for example NIND = 30).total_noise_ele_num//give the total number of noised elementary, it is usually equal the number of iterations which is used to noising operation.noise_ele_num//give the number of noised elementary, it is usually equal to population size, or set number of noised elementary.NS//it is the size of the neighborhood.

Algorithm 1. Initial population using noise and K-means++
Function: KMeanPlus (data, solution, dimension)//K-means++ [25] is employed to handle objective function value of noise, select initial seeds, and control the decreasing-rate Input: rate max , rate min //give a group of the maximum and minimum noise rates (e.g., 30 pairs of noise rates), they are usually equal to population size.
NIND//population size (for example NIND = 30).total_noise_ele_num//give the total number of noised elementary, it is usually equal the number of iterations which is used to noising operation.
produce a chromosome and record length of its (the number of clusters) END achieve population (30 chromosomes) initialization

Fitness Function
The fitness function is used to define a fitness value to each candidate solution.The common clustering criterion or quality indicators mainly include the SSE, DBI (Davis-Boudin index), Dunn's index, Xie-Beni index, PBM (PBM-index), and COSEC (compactness and separation measure of cluster) [16,31], which can be used as fitness functions for GA.In this paper, the DBI is used as the clustering measure.The idea of the Davis-Boudin index [32] is to minimize the intra-cluster distance, while maximizing the distances among the different clusters, and is defined as: where, due to the DBI, small values of the DBI correspond to compact and well-separated clusters.This index does not present a monotonic behavior with K, so the DBI also allows the optimal number of clusters for the given taxi GPS data set to be validated.Then, the fitness function of the chromosome is defined as the inverse of DBI, i.e., This fitness function will be maximized during the evolutionary process for the NGA for each iteration and leads to the minimization of the DBI.

Genetic Operations
This section mainly discusses the method used to handle genetic operation, and adaptive probabilities of crossover and mutation are also employed to prevent the convergence to a local optimum without defining the genetic parameter.

Selection Operation
Chromosomes (e.g., NIND = 30) are sorted by descending order of their fitness values, then the number of the best chromosomes are chosen from the initial population of the NIND chromosomes using the fitness function, which generate a selection population POP s .The maximum fitness value is selected from chromosomes according to the descending order of their fitness values.A copy of the best chromosome is stored in the memory.

Crossover Operation
To handle the different size and shape of the genes, an improved gene rearrangement operation is first presented, then the crossover operation is performed.
Several existing gene rearrangement techniques [20,41] considered the lengths of both chromosomes as equal and were therefore unable to rearrange the genes if the lengths of the chromosomes were not equal [16].This is a novel gene rearrangement technique, which can handle chromosomes of unequal lengths, but it only focuses on gene rearrangement operations without considering between the structure chromosomes, resulting in difficulty in determining the reference chromosome and target chromosome.Therefore, we first obtain the best chromosome in descending order according to its fitness values, and a pair of chromosomes are chosen using an existing roulette wheel technique (RWT) [22,28] where the chromosomes CR i , CR j are picked with the probability f CR i , CR j .Here, f CR i , CR j is the fitness of the pair of chromosomes CR i , CR j , and NIND is the size of the current population.Next, we use the cosine theorem to calculate the similarity of values between the chromosomes producing triangle planes.When the cosine similarity between the chromosomes have unequal length, 0 as genes are added into chromosomes to guarantee the same genes in each chromosome.To obtain a useful reference chromosome and target chromosome, these similarity values are translated into angles and the maximum angle of the triangle corresponded to the "triangle side" of the reference chromosome, and the minimum angle of the triangle corresponded to the "triangle side" of the target chromosome.Finally, the gene rearrangement method in [16] is employed to perform the gene rearrangement operation, which rearranges the genes of the inferior chromosome (called the target chromosome) with respect to the gene arrangement of the superior chromosome (called the reference chromosome).The similarity based on the cosine theorem is computed as where CR ik , CR jk are made of a number of genes in each chromosome, and the Euclidean distance between the taxi GPS data points are shorter and more similar; CR i , CR j are two chromosomes vectors; and CR b stands for the best chromosome with the fitness in the current iteration.
The main goal of the crossover operation is to create diversity, potentially producing new chromosomes using gene rearrangement and crossover probability.According to their fitness values, chromosomes are sorted by descending order.All chromosomes (selection population) participate in the crossover operation in terms of gene rearrangement.Then, it combines the features of two parent chromosomes to produce two offspring using a single point crossover operation SPC CR i , CR j , α , which is calculated according to [16,20], where each chromosome is divided into two parts at a random point between two genes.The crossover operation is an information exchange between different potential solutions, and the crossover probability P c is calculated.From the crossover of a pair of chromosomes, we obtain a pair of offspring chromosomes, which are added to the population of the next generation, and every time a chromosome is chosen, it is removed from the population of the current generation [18].
The point crossover operation is calculated as: where α is a crossover parameter and α ∈ (0, 1).The crossover probability is calculated as: where f max denotes the maximum fitness value of the current population; f avg denotes the average fitness value of the population; and f denotes the larger of the fitness of the chromosomes to be crossed.The value of P c increases when the chromosome is quite poor.In contrast, when the chromosome is a good solution, P c is low to reduce the likelihood of disrupting a good solution by crossover.
The whole process of the gene pair selection and crossover operation continues while there are genes in the population, and we obtain chromosomes to generate the crossover population POP c , which are used for the next generation.

Mutation Operation
The basic idea of the mutation operation is to randomly alter one or more genes of a selected chromosome to explore different solutions.The mutation operator includes small modifications to each chromosome in the population, with low fitness having a high probability of randomly changing using the mutation probability calculation formula, to explore new regions of the search space and to also escape from the local optima when the algorithm is near convergence [21], with the random change having a probability equal to the mutation rate.The intuition behind the mutation operation is to introduce extra variability into the population [20], and the mutation probability of each chromosome in the crossover population is calculated.The mutation probability of the ith chromosome is given as follows: where ξ 1 , ξ 2 are equal to 0.5; f max , f avg are the same as defined above; and f i is the fitness of the ith chromosome under mutation.When P m > P c , the high fitness solutions rapidly aid in the convergence of the NGA, but the low fitness values cannot prevent the GA from getting stuck at a local optimum.
To prevent the GA from getting stuck at a local optimum, the solutions with fitness values are used to reduce the mutation probability and search the search space for the region containing the global optimum.
The mutation process utilized in this paper is the same as that used in [21,42], which is calculated as follows: where f max and f min are the maximum and minimum fitness values in the current population.For a chromosome with fitness value f , a number δ in the range δ ∈ [−R, R] is generated with a uniform distribution.
If the maximum and minimum values of the GPS data set along the ith dimension are f i max and f i min , respectively, then after mutation, the ith gene of the chromosome is defined as follows: After the mutation operator, if they have the same genes in each chromosome, the twin removal operation in [16] is employed to remove twin genes from each chromosome producing a mutation population POP m , and we again update the chromosome with the best fitness at this stage (mutation operator).

Sharing-Based Niche Partitioning Using Density
This section mainly aims to maintain a population diversity and prevent premature convergence using the niche technique.We divide the population into a number of niches using a density-based method, then a sharing-based niching method is presented to adjust the sharing fitness values.

Niches Partitioning Based on Density
The density-based method has been widely used in clustering works from large scale data for its simple calculation structure and low computing cost [6,33,43].In this paper, the density-based method is used to divide the number of niches, directly divides all the point densities reachable from different points into the niches, and is meaningful in finding an appropriate method to estimate the density using a given radius r.In other words, a given density radius r is placed in taxi GPS data sets to draw some circles (density), and the number of taxi GPS points (density) are counted in each circle, then these densities are sorted in term of the number of taxi GPS points in each circle.Finally, the maximum density as a niche is obtained; and the above operations are repeated until all the taxi GPS points are selected, as shown in Algorithm 2. As a result, it indicates that the number of niches has been obtained.

Algorithm 2. Niches partitioning based on density
Input: GPS taxi data (data), a given density radius r Output: result of the niche partitioning, including number of niches, and number of GPS points each niche Procedure: WHILE data = = ∅ Num_circles ← radius r is placed in taxi GPS data sets//Num_circle denotes the number of densities Num_points ← counts the number of taxi GPS points each density and record in Num_points Num_niches ← sort Num_points in the ascending order according to the number of taxi GPS points, and obtain a maximum density to separate storage.
The remaining taxi GPS points continue operation until data = = ∅ END WHILE 2.6.2.Sharing-Based Niche Method NGA has been proved that when the number of chromosomes within the population is large enough and the niche radius is properly set, a sharing function provides as many niches in the population as the number of peaks in the fitness landscape [44,45].However, there are several problems such as stability and maintainability [45].To improve this performance and overcome the sharing level of niches, a modification of the niching method is introduced and integrated into our approach, in order to preserve the population diversity during the simultaneous search for a global optimum.Our approach can maintain the population diversity with respect to the new population with mutation operator POP m adjusts in the solutions.An initial niching population is generated by the mutation operation (for instance, a new population can be defined as: POP new = POP m + 2 3 POP s ).In this paper, the fitness sharing modifies the search range by reducing the fitness of a chromosome in densely-populated regions.It works by derating the fitness of each chromosome by an amount related to the number of similar chromosomes in the new population.In particular, the shared fitness f share (i) of each chromosome i in generation g of the number of the partitioning niches nich is defined as follows, and we again obtained the chromosome CR best,g with the best fitness each iteration: where f g (i) is the current fitness of the chromosome; and s g,nich (i) is the sum of niche sharing dependent on the number of the partitioning niches and each iteration of the chromosomes within the population.The sum of niche sharing degree each iteration is calculated as: where simi CR best,g , CR i denotes the similarity between the best fitness values of the chromosome and the current chromosome i using Pearson correlation-based similarity measure [28] in the new population; and simi is the sharing function which measures the sharing degree between the two chromosomes; additionally, the number of niches nich is obtained in Section 2.6.1.The most common method is defined as: where σ share is the niche radius, which is calculated as per [46], and |CR best | is the number of genes of the best chromosome according to its fitness value order; |POP new | is the number of chromosomes in the new population.
A gene expression data consisting of n genes (taxi GPS data points) and d chromosomes are usually expressed as a real valued n × d matrix E = G ij , i = 1, . . ., n, j = 1, . . ., d.Here, each element G ij denotes the expression level of the ith gene at the jth chromosome.When the Pearson similarity between the chromosomes are unequal in length, 0 as genes are added into the chromosomes to guarantee the same genes in each chromosome.Therefore, given two chromosome vectors CR i and CR j , the Pearson correlation coefficient simi CR i , CR j between them is calculated as where µ G i , µ G j represent the arithmetic means of the components of the chromosomes vectors CR i and CR j , respectively.When our approach maintains diversity and reached the global optimum, the number of niches reduced to one.After the niche operation, a new population is generated by using the elitism operator.

Elitism Operation
This keeps track of the best chromosome throughout the iterations and also keeps improving the quality of the population in each generation.Corresponding to the above results, a new elitism population is also generated, and the best chromosome is obtained by the descending order of their fitness values, which is used for the K-means clustering operation.

K-Means Clustering Using the Best Chromosome
K-means has become the most popular and compared clustering algorithm as the basic K-means requires as input a parameter of the number of clusters, which has a major dependency on the initialization of the seeds, and gets stuck in local optima.In this paper, we use the genes of the best elitism chromosome as the initial seeds, and when SSE is used as the seed updating method it achieves K-means clustering, as shown in Algorithm 3.With high-quality initial seeds, K-means is also expected to generate a high-quality clustering solution.Meanwhile, the clustering solutions based on NoiseClust are outputted on the Google map (See Section 3.2), which display the clustering results of OD.

Termination Condition
Our approach defines a termination condition value ε (e.g., ε = 0.00001) to try to capture the optimum clustering results in a close neighborhood of a cluster.The K-means clustering search, which is based on the ratio value between the difference of fitness value (between the current iteration and next iteration) and the current fitness value is calculated as where f (x) denotes the fitness value of the current iteration; f (x + 1) represents the fitness value of the next iteration; and x denotes the number of iterations for the K-means clustering operation.The implemented search works over the fitness values of the chromosomes, and this operation determines the better cluster results obtained when the termination condition is satisfied, and the GPS points are assigned to close neighborhood clusters in the solution.

Complexity Analysis of NoiseClust
In fact, the analysis used in many K-means algorithm clustering problems suffer from the number of clusters, the initial seeds, and local optimum, which can be avoided in the NoiseClust clustering algorithm.According to [16,47], considering the total number of GPS data points is n, the population size is NIND (N), the number of iterations is MAXGEN (g), the number of attributes is m, the number of iterations for K-means clustering is x, and the maximum number of genes in a chromosome is K.

The complexity of NoiseClust is made of six parts as follows:
Initialization: This is made of noise and K-means++.According to Algorithm 1, the time complexity of the population initialization is O(nmN).
Fitness function: The fitness functions of each chromosome are calculated using DBI, its complexity consisted of computing all pairs of seeds, the distances between each point and its closest seed operation, and the descending order operation.The time complexity of the fitness function is O gNnm 2 .
Genetic operation: This is made of the selection, crossover, and mutation operation; therefore, the time complexity of genetic operation is O gNK 2 m .
Niche operation: This is made of the density estimation and sharing-based computing; therefore, the time complexity of niche operation is O gNn 2 m .
Elitism operation: After the niche operation, the complexity to identify the best and worst chromosomes of a generation is O gN 2 m .Therefore, the total time complexity of the elitism operation is O gN 2 m .
K-means operation: The time complexity of K-means clustering is O(nmKx).
Therefore, the time complexity of NoiseClust is O(nmN + gNnm 2 + gNK 2 m + gNn 2 m + gN 2 m+nmKx).The time complexity of NoiseClust is lower than GenClust [16] due to the initial population and the twin removal of GenClust needed higher complexity, the test result of time complexity is described in Section 3.2.

Experimental Results and Discussion
In this section, for the purpose of testing the performance of the NoiseClust algorithm, experiments are conducted on real-world taxi GPS data sets [35] (shown in Table 1), and the results show that NoiseClust has a higher performance and effectiveness than GenClust [16] and GAK [42].Computer simulations are conducted in Matlab (v.2016a) (MathWorks, Natick, MA, USA) on an Intel (R) Xeon (R) CPU E5-2658, running at 2 @ 2.10 GHz with 32 GB of RAM in Windows Server 2008.The termination condition of clustering algorithms is ε = 0.00001.In order to use the same comparison standard described in [48], all fitness values f of the NGA (and GenClust, GAK) are normalized using the following formula: where f norm is the normalized value; and f max and f min are the maximum and minimum values of the f values.
If all f values of NGA (and GenClust, GAK) are normalized by f norm , the maximum and minimum fitness values each iteration can be 1 and 0, respectively.The aim is to compare three GA-based clustering algorithms (NoiseClust, GenClust, and GAK) using the same standard (see Figure 3a-d).

Clustering Evaluation Criteria
Validation or evaluation of the resulting clustering allow us to analyze the result in terms of objective measures [49].Depending on the information available, the clustering results can be evaluated in terms of the Silhouette coefficient (SC) [21,50], PBM [31], and SSE [20].
This type of evaluation tries to determine the quality of an obtained partition of the data without any available external information.Therefore, three of the most useful evaluation criteria are employed as follows.
SC is a measure that has been used quite often in clustering problems since it allows the evaluation of the quality of a particular solution as well as the quality of each cluster that conforms to that solution [21,28,50].In other words, it allows for the evaluation of a given assignment for a particular observation G iK .SC is then defined for the jth observation G jK : where a j denotes the average distance between an observation point G jK and the other point vectors of the cluster to which the point is assigned; and b j denotes the minimum of the average distance obtained for all clusters different than the one assigned to G jK .Note that the value of the SC index varies from −1 to 1, and a higher value indicates a better clustering result.
PBM [31] is used to measure the clustering performance as it can provide a measure of goodness of clustering on different partitions of a given data set, and can describe a cluster validity index of a cluster solution.Then, the fitness function maximizes the value of this PBM index: where n denotes the total number of GPS data points in the data set.The power p is used to control the contrast between the different cluster configurations.Here, let p = 2.A large value of the PBM index implies a better solution.
SSE is the most straightforward and popular evaluation of distance in unsupervised clustering measures.It only needs to consider the cohesion of clusters to evaluate the quality of the given partition data [21], and is defined as: where K denotes the number of clusters; and d 2 (CR, seed i ) is the distance from the observed chromosome (a chromosome is defined for a vector) CR to the seeds of the cluster k, represented by the seed ik .

Experiments on Taxi GPS Data
The experiments are implemented on four taxi GPS data sets, which are often used for testing clustering algorithms (shown in Table 1 with their characteristics).Table 1 shows four columns with name of taxi GPS data set, land area of longitude × latitude (see Figure 2), the number of data points, and the number of clusters each data set on three clustering algorithms.
In the experiments, the population size is set to 30, the crossover and mutation probabilities for GAK algorithm are P c = 0.8 and P m = 0.1 [20,42], respectively, and the parameter settings for GenClust is consulted in [16].The total number of generations is equal 100, and the number of clusters for GAK is equal to 100.
For the purpose of comparison, SC, PBM, and SSE are used to evaluate the performance of the clustering results of the taxi GPS data sets (Tables 2-4, respectively).Meanwhile, to verify the effectiveness of NoiseClust, GenClust and GAK are compared to NoiseClust in the experiments.However, the determination of the number of clusters is important in clustering problems.In this paper, the number of clusters are automatically determined in terms of clustering algorithms (except for GAK), and NoiseClust also avoids being sensitive to initial seeds in the initial population; the number of clusters of GAK uses the crossover and mutation operations of the standard GA and selects the initial cluster centers randomly in terms of [42,51].The bold font indicates the best value for each taxi GPS data.
Table 2 shows the results obtained in this instance by different algorithms (GAK, GenClust and NoiseClust).It shows that the results of the presented NoiseClust with SC as the evaluation criterion.Note that the results are given in terms of the K-means clustering results, and the best clustering results of the four taxi GPS data sets are obtained by NoiseClust.Note also that the NoiseClust solutions improves the results of the GAK and GenClust algorithms.The bold font indicates the best value for taxi GPS data.
Table 3 summarizes the clustering results obtained by the different algorithms considered.NoiseClust with a PBM index obtains a better clustering result (except in the Chongqing, China data set where it not only obtains two better value, but its convergence speed is faster than GAK and GenClust, as shown in Figure 4).The values indicate the superiority of NoiseClust, which produces a better value than those of the other GA-based clustering algorithms (GAK and GenClust).The bold font indicates the best value for each taxi GPS data.
Table 3 summarizes the clustering results obtained by the different algorithms considered.NoiseClust with a PBM index obtains a better clustering result (except in the Chongqing, China data set where it not only obtains two better value, but its convergence speed is faster than GAK and GenClust, as shown in Figure 4).The values indicate the superiority of NoiseClust, which produces a better value than those of the other GA-based clustering algorithms (GAK and GenClust).Table 4 shows the results obtained by the different compared algorithms.The best result is obtained by the NoiseClust algorithm in Aracaju, Brazil, and the NoiseClust algorithm obtains a really good clustering result, except in Chongqing, China.Furthermore, the averaged SSE values indicate that NoiseClust obtains a low error rate in the taxi GPS data.
For a more careful comparison of the algorithms, the Wilcoxon rank sum test (WRST) technique [52,53] is used to assess the cluster differences between the NoiseClust clusters and those obtained by the GenClust and GAK clustering algorithms.The WRST, as a powerful statistical tool, provides a nonparametric test for two samples when the taxi GPS samples are independent; furthermore, it is a frequently used statistical test to compare an ordinal outcome between two groups of subjects.This test is used to determine whether two independent samples selected from taxi GPS samples have the same distribution, and the test results are given in Table 5.
In the experiment, the WRST tested the null hypothesis that each taxi GPS sample is the same vector, and that any difference observed in the taxi GPS sample is due to random chance.There are three outputs, p, h, and stats.Therefore, in Table 5, p denotes the return of the p-value of a two-sided WRST for the given two clustering evaluation results u and v (NoiseClust and GenClust, NoiseClust and GAK), p is used to test the null hypothesis that data in u and v are taxi GPS samples from continuous distributions with equal medians against the alternative that they are not; when p → 0, it indicates that inconsistency between u and v are more evident.h denotes the return of a logical value indicating the test decision; the result h = 1 indicates a rejection of the null hypothesis, and h = 0 indicates a failure to reject the null hypothesis at the α (e.g., α = 0.05) significance level, in other words, h = 1 indicates the overall difference between u and v is at the significance level, v.v.h = 0, zval indicates normal statistics of p-value, ranksum indicates the WRST statistics, and stats is made of zval and ranksum.It can be seen from Table 5 that the overall inconsistency and difference between NoiseClust and GenClust, GAK are at the evidence and significance level, and the statistical results are also different in each evaluation criteria as a whole.For instance, SC is used to evaluate the clustering results of the taxi GPS data set (Aracaju, Brazil); both the p-value, 5.8284 × 10 −4 , and h = 1 (between NoiseClust and GenClust) indicate the rejection of the null hypothesis of equal medians at the default 5% significance level.In addition, in Roma, Italy, when PBM is used to evaluate the clustering results of the taxi GPS data set, both the p-value of 0.0539 and h = 0 (between NoiseClust and GAK) indicate that there are not insufficient evidence to reject the null hypothesis.Therefore, a comparison of the statistical results between NoiseClust and GenClust, GAK indicate that NoiseClust is a novel clustering algorithm, and NoiseClust has the better clustering results than GenClust, GAK.
In the experiment, this paper compares the speed of convergence of the three GA-based clustering algorithms (GAK, GenClust, and NoiseClust).For each GPS data set, GAK, GenClust and NoiseClust is performed on 20 independent trials with randomly generated initialization, with the generated initial population [16], and with noise-based and K-means++ generate initialization where these averaged values are recorded to account for the nature of the algorithms, respectively.The average DBIs obtained by the three algorithms are shown in Figure 4.
From Figure 4, it can be seen that NoiseClust converged to the desired value in a relatively fewer number of iterations for the four taxi GPS data sets.It can also be seen that NoiseClust provided some improvement in the DBI over GKA and GenClust for the four taxi GPS data sets.Table 6 presents the average computation time when the number of generations of the algorithms is 100 iterations; the NoiseClust converges in a relatively fewer number of interactions and shorter computation time than GenClust.As seen from Figure 4, the NoiseClust clustering algorithm converges with a relatively fewer number of iterations.In other words, the convergence of the NoiseClust algorithm is very good and fast without showing any fluctuation, and after about 10 iterations begin to converge.Figure 4 show that NoiseClust performs better than the other two existing algorithms for all criteria (see Tables 2-4), and also indicate that NoiseClust runs very stably without getting stuck in a local optimum and avoiding premature convergence.In Table 6, we present the results of the average execution time (20 runs per each taxi GPS data set) required by the GA-based clustering algorithms.NoiseClust is computationally more expensive than the existing GAK algorithm.Genetic algorithms are generally more time expensive than other basic clustering algorithms (e.g., K-means) [16].Moreover, NoiseClust uses 10-30 iterations compared to the more than 90 iterations of GenClust and GAK, which use computationally expensive iteration operations to achieve convergence.However, NoiseClust also uses computationally expensive operations such as the initial population and niche operation.On one hand, compared to NoiseClust with GAK, the operations of GAK take ordinary steps without running the initial population and niche computing, and can only handle two equal length chromosomes, therefore, the average execution time of GAK is the shortest.On the other hand, when comparing NoiseClust with GenClust, the computationally expensive operations of GenClust main focus on the initial population technique, gene rearrangement, and twin removal each iteration, which result in the average execution time of GenClust being the most expensive.Nevertheless, when the twin removal of GenClust is used in NoiseClust, it is only used in the mutation operation, and the average execution time of the initial population of NoiseClust (O(nmN)) is shorter than GenClust (O n 2 mN ), therefore, the average execution time of NoiseClust is of a medium level.
To validate the feasibility of the NoiseClust algorithm, we also use four taxi GPS data sets to display the best clustering results of NoiseClust in Google maps.Namely, in our application, a software tool is developed based on Google map Application Program Interfaces (APIs), which is a web-mapping service provided by Google.Based on this software, a large real number of cluster centers are visually shown on the map.During this work, regions with cluster centers of K-means clustering are labeled on the map.NoiseClust, GAK and GenClust generate better clustering results using DBI, as illustrated in Figure 5a-d.
From Figure 5a-d, each cluster reflects places with high population and traffic, and the clusters centers are mainly in the CBDs (central business districts) in Aracaju (Brazil); Chongqing (China); Roma (Italy); and San Francisco (USA).The tourist attractions, parks, resorts, hotels, museums, schools (universities), government, and subway stations are also places with taxi cabs.The clustering result reflect the city's traffic information and population migration distribution.However, the cluster centers of NoiseClust are better than GAK and GenClust (e.g., GAK get stuck at a local optimum in Figure 5(aii), and is also explained in Figure 4a where the distribution of the cluster centers is rather appropriate, for example, through the application in different regions, it can be concluded that the NoiseClust algorithm is a useful method for taxi GPS data clustering.(iii) (c) (i) From Figure 5a-d, each cluster reflects places with high population and traffic, and the clusters centers are mainly in the CBDs (central business districts) in Aracaju (Brazil); Chongqing (China);

Conclusions and Further Work
In this paper, the presented NoiseClust clustering algorithm aimed to achieve better quality clusters without requiring a user to input the number of clusters, and other genetic operation parameters.NoiseClust uses the proposed new NGA with noise and density to avoid getting stuck in a local optimum, while achieving high-quality cluster results for taxi GPS data.Namely, NoiseClust can automatically perform the clustering operation and find better OD clustering results.
In NoiseClust, each chromosome represents the seeds of the clusters through a sequence of real-valued taxi GPS data, which use noise and K-means++ to produce the initial population.Moreover, to reduce the degeneracy caused by different chromosomes describing the same cluster results, an improved gene rearrangement of the chromosome based on GenClust is used for NoiseClust.Meanwhile, to obtain a global optimum and maintain population diversity, a density-based sharing niching method is applied to the NGA, and computing methods of crossover and mutation probabilities are also integrated into the NGA, where their processes allow NoiseClust to explore the search space more effectively.
Finally, the genes of the chromosome with the best fitness values are used as the initial seeds of K-means to generate the final cluster results (displayed in a Google map), and the number of genes of the best chromosome is the number of clusters (K).We compared NoiseClust with the GenClust and GAK clustering algorithms with three criteria (SC, PBM, and SSE) on the four taxi GPS data sets, where the overall performance of NoiseClust achieves better clustering results than GenClust and GAK (see Tables 2-4).
However, to obtain the higher-quality clustering results, NoiseClust still requires a higher execution time (see Table 6) than GAK, but the execution time of NoiseClust is lower than GenClust.Therefore, our future research plans include the reduction of the time complexity.In addition, we will also study large amounts of GPS data on MapReduce with the GA, particle swarm algorithm, ant colony algorithm, K-median, or other clustering methods in the future.In particular, K-median with GA as a key study will be used with GPS data points clustering in the future.
ISPRS Int.J. Geo-Inf.2017, 6, 392 4 of 30 algorithm (NGA) (see Section 2.6); (7) in Section 2.7, an elitism strategy is used to select the best chromosome, which is used to replace the worst chromosome in the current iteration; (8) the best chromosome is obtained to use K-means clustering; (9) the termination condition of K-means clustering is given in Section 2.9; and (10) the complexity of the NoiseClust algorithm is analyzed in Section 2.1.

Figure 3 .Algorithm 1 .
Figure 3. Diagram of the initial population using the noise method and K-means++.

Figure 3 .
Figure 3. Diagram of the initial population using the noise method and K-means++.

Table 1 .
The experimental datasets of taxi GPS data.

Algorithm 1 .
Cont. noise_ele_num//give the number of noised elementary, it is usually equal to population size, or set number of noised elementary.NS//it is the size of the neighborhood.Call KMeanPlus (data, s, m)//calculate objective function of noise, and initial seeds, chrom_s denotes seeds of chromosome using K-means ++.best_sol ← s//it is the best solution found since the beginning best_sol(K) ← s//obtain number of clusters each chromosome best_chrom_sol ← chrom_s(K)//produce a chromosome to put into population decrease ← (rate max − rate min )/[(total_noise_ele_num − noise_ele_num) − 1] //denote the value by which rate decrease.restart ← √ total_noise_ele_num•NS //gives the frequency of the restart of the current solution rate ← rate max //give the current value of the noise rate initial_iteration ← 0 //count the number of noise trial which have been applied.WHILE initial_iteration < total_noise_ele_num initial_iteration ← initial_iteration + 1 IF rate!= 0 r_rate←rate END solution s translate into binary bits and produce next solution s'//let s' be the next neighbor of s solution s' translate into decimal value K and determine K range between 2 and √ − f (s) + noise < 0 s ← s' chrom_s(K) is controlled to produce chromosome END IF mod(initial iteration , 4 * NS) == 0//mod is modulo rate = 0 //apply noise rate (=0) descent from s result in a good neighboring solution may be rejected END IF f n//produce the number of clusters let noise be a random real number uniformly drawn into [−rate, rate] search s'//search one of its neighbors IF s' = ∅ (s , chrom_s(K)) → Call KMeanPlus (data, s', m) best_sol ← s' ELSE perform solution (s, chrom_s(K)) for taxi GPS data sets to start producing chromosome END IF f (s ) Obtain the number of genes (K) of the best chromosome as the number of clusters; Initialize the K seeds using genes of the best chromosome; WHILE ε = 0.00001//when meet ε, its number of iterations is record to x FOR 1 to K FOR 1 to m//repeat the number of attributes FOR 1 to n//repeat the number of taxi GPS data points Assign each taxi GPS data point to the closest seed; Update the seeds from the taxi GPS data points assigned to each cluster using SSE //SSE model is described in Section 3.1;

Table 2 .
The maximum, mean, and minimum values of silhouette coefficient (SC) obtained by the GAK, GenClust, and NoiseClust algorithms for 20 different runs for four real-would taxi GPS data sets.

Table 3 .
The maximum, mean, and minimum values of PBM obtained by the GAK, GenClust, and NoiseClust algorithms for 20 different runs for four real-would taxi GPS data sets.

Table 4 .
The maximum, mean, and minimum values of the Sum of Squared Errors (SSE) obtained by the GAK, GenClust, and NoiseClust algorithms for 20 different runs for four real-would taxi GPS data sets.

Table 4 .
The maximum, mean, and minimum values of the Sum of Squared Errors (SSE) obtained by the GAK, GenClust, and NoiseClust algorithms for 20 different runs for four real-would taxi GPS data sets.

Taxi GPS Data Set GAK GenClust NoiseClust
The bold font indicates the best value for each taxi GPS data.

Table 4
shows the results obtained by the different compared algorithms.The best result is obtained by the NoiseClust algorithm in Aracaju, Brazil, and the NoiseClust algorithm obtains a really good clustering result, except in Chongqing, China.Furthermore, the averaged SSE values indicate that NoiseClust obtains a low error rate in the taxi GPS data.

Table 5 .
The Wilcoxon rank sum test (WRST) testing results of SC, PBM, and SSE obtained by NoiseClust, which are statistically different from the results obtained by the GenClust and GAK clustering results (α = 0.05, α denotes significance level parameter of the WRST (0 < α < 1)).

Table 6 .
Overall average computational times (in minutes) of the clustering algorithms for four taxi GPS data sets.