A Varied Density-based Clustering Approach for Event Detection from Heterogeneous Twitter Data

: Extracting the latent knowledge from Twitter by applying spatial clustering on geotagged tweets provides the ability to discover events and their locations. DBSCAN (density-based spatial clustering of applications with noise), which has been widely used to retrieve events from geotagged tweets, cannot efﬁciently detect clusters when there is signiﬁcant spatial heterogeneity in the dataset, as it is the case for Twitter data where the distribution of users, as well as the intensity of publishing tweets, varies over the study areas. This study proposes VDCT (Varied Density-based spatial Clustering for Twitter data) algorithm that extracts clusters from geotagged tweets by considering spatial heterogeneity. The algorithm employs exponential spline interpolation to determine different search radiuses for cluster detection. Moreover, in addition to spatial proximity, textual similarities among tweets are also taken into account by the algorithm. In order to examine the efﬁciency of the algorithm, geotagged tweets collected during a hurricane in the United States were used for event detection. The output clusters of VDCT have been compared to those of DBSCAN. Visual and quantitative comparison of the results proved the feasibility of the proposed method.


Introduction
The dramatic increase in the popularity of social networks has resulted in the production of enormous amounts of "user-generated" data on a daily basis.Twitter, as one of the most popular and fast-growing microblogging services [1], produces over 500 million tweets per day (http://www.internetlivestats.com/twitter-statistics/)On the other hand, the advent of smart devices equipped with Global Navigation Systems has made it possible to share location in addition to the content of tweets.Daily generation of geotagged tweets has enabled scientists to look for advanced techniques to explore the latent knowledge and spatial patterns in various contexts including rumor diffusion [2], user activity pattern mining [3], crime type modeling [4], determining the relationship between social media attitudes and health outcomes [5], and extracting the users' communities and discussed topics [6], to name but a few.Among other things, users of Twitter share messages and report information about events they have witnessed (e.g., flooding, earthquakes, hurricanes, tsunamis, terrorist attacks, accidents, festivals, etc.).Monitoring and analysis of such stream of user-generated data can provide invaluable information about events which would have never been possible to gather from traditional methods and resources [7,8].Having the dynamic information about events, extracted from tweets, enables decision-makers to comprehend what is happening on the field and react appropriately.geotagged tweets with varied densities.The algorithm considers both spatial proximity of geotagged tweets along with their text similarities.The algorithm is efficient to work with large volume of Twitter data.In order to evaluate the efficiency of the algorithm, a case-study related to event detection from geotagged tweets collected during a hurricane in the United States was considered.The outputs of VDCT were quantitatively and visually compared with those of DBSCAN.

Materials and Methods
The overall workflow of the proposed approach for event detection from geotagged tweets is illustrated in Figure 1.The process starts with data collection using Twitter Streaming API where geotagged tweets (tweets that include latitude and longitude) are collected and saved.In order to prepare tweets for spatial clustering, text preprocessing is performed to transform tweets' contents into words which can be used in the following processes.In the clustering step, the proposed clustering algorithm is used to extract spatial clusters from geo-located tweets.Finally, the outputs are evaluated and visualized.
ISPRS Int.J. Geo-Inf.2019, 8, 82 3 of 18 with large volume of Twitter data.In order to evaluate the efficiency of the algorithm, a case-study related to event detection from geotagged tweets collected during a hurricane in the United States was considered.The outputs of VDCT were quantitatively and visually compared with those of DBSCAN.

Materials and Methods
The overall workflow of the proposed approach for event detection from geotagged tweets is illustrated in Figure 1.The process starts with data collection using Twitter Streaming API where geotagged tweets (tweets that include latitude and longitude) are collected and saved.In order to prepare tweets for spatial clustering, text preprocessing is performed to transform tweets' contents into words which can be used in the following processes.In the clustering step, the proposed clustering algorithm is used to extract spatial clusters from geo-located tweets.Finally, the outputs are evaluated and visualized.

Case-Study and Data Collection
Hurricane Florence is selected as the case study to test how the proposed method can detect spatial events from geotagged tweets.As a Category 1 hurricane, Florence was predicted to have maximum wind speeds between 74 and 95 kph.Powerful waves and walls of water moved inland and led to flooding.North Carolina State was severely affected by the hurricane and was selected as the study area in this research.Using Twitter streaming API, shared tweets within North Carolina have been collected during hurricane Florence, from 12 September to 19 September.Tweets were filtered using bounding box and only tweets which contain latitude and longitude were extracted and saved in the database.

Text Processing
In order to use the extracted tweets for spatial clustering, a text preprocessing phase is required.Initially, URLs, hashtags, special characters, and numbers are removed.Then, the words are converted to lowercase and stop words are deleted.Finally, the rest of the words are transformed into their stem form through lemmatization.After the text preprocessing, 8992 geotagged tweets remained in the study area.The final extracted tweets are presented in Figure 2.

Case-Study and Data Collection
Hurricane Florence is selected as the case study to test how the proposed method can detect spatial events from geotagged tweets.As a Category 1 hurricane, Florence was predicted to have maximum wind speeds between 74 and 95 kph.Powerful waves and walls of water moved inland and led to flooding.North Carolina State was severely affected by the hurricane and was selected as the study area in this research.Using Twitter streaming API, shared tweets within North Carolina have been collected during hurricane Florence, from 12 September to 19 September.Tweets were filtered using bounding box and only tweets which contain latitude and longitude were extracted and saved in the database.

Text Processing
In order to use the extracted tweets for spatial clustering, a text preprocessing phase is required.Initially, URLs, hashtags, special characters, and numbers are removed.Then, the words are converted to lowercase and stop words are deleted.Finally, the rest of the words are transformed into their stem form through lemmatization.After the text preprocessing, 8992 geotagged tweets remained in the study area.The final extracted tweets are presented in Figure 2.

VDCT
In order to cluster geo-located tweets and extract events, a varied density-based clustering algorithm, named VDCT (Varied Density-based spatial Clustering for Twitter data), was developed in this study.Consider T as a collection of geotagged tweets so that each tweet  ∈  is represented as a tuple [, , , ], where  and  are the geographical coordinate of the tweet,  is the textual content of the tweet and  is the cluster label of the tweet which is undefined at the beginning.VDCT algorithm receives  as input and return  , so that every tweet in the result set,  ∈  , has a defined cluster label,  =  , or its cluster label is set to noise,  = .

Text Similarity
In order to extract events, similar tweets must be placed in the same clusters.In this regard, in addition to Euclidean distance between geotagged tweets, the text similarity of tweets should be considered.Text similarity plays an important role in document clustering and topic modeling [28,29].As tweets are limited to 140 characters, most of the text similarity techniques are not efficient for calculating the similarity between them due to the short length of the messages, the informal language and a large number of spelling and grammatical errors [30,31].Meanwhile, cosine similarity is a similarity measure which has proved its ability to calculate the text similarity between tweets [32,33].Cosine similarity in text mining is a string-based measure which measures the distance between two strings.Each string represented by a vector.Consider W as the collection of all terms in T. For a tweet  ∈ , its textual content can be presented as vector . = [ ,  ,  , … ,  ], where  shows the number of times the term  ∈  occurs in tweet  .Having two tweets  = [ ,  ,  ,  ] and  = [ ,  ,  ,  ] , their similarity is computed according to cosine formula (Equation ( 1)) [3].
Cosine similarity measures the cosine of the angle between two non-zero vectors.If two vectors are perpendicular, they have cosine similarity of 0 and when they are similar and completely the same, they have cosine similarity of 1.

Clustering
The proposed VDCT solution for extracting clusters from geotagged Twitter data is an extension to the DBSCAN algorithm.DBSCAN receives two parameters of epsilon and minimum points (minPnts) as inputs, where epsilon is the radius for neighborhood search and minPnts is the minimum number of points that must exist around a data point so that those points can be considered as a cluster.DBSCAN randomly selects a point which at least has minPnts points within the distance of epsilon around it.The surrounding points are called reachable points afterward.The selected point and its reachable points are considered as a cluster.The cluster will repeatedly grow by adding other points

VDCT
In order to cluster geo-located tweets and extract events, a varied density-based clustering algorithm, named VDCT (Varied Density-based spatial Clustering for Twitter data), was developed in this study.Consider T as a collection of geotagged tweets so that each tweet t ∈ T is represented as a tuple [x, y, c, l], where x and y are the geographical coordinate of the tweet, c is the textual content of the tweet and l is the cluster label of the tweet which is undefined at the beginning.VDCT algorithm receives T as input and return T , so that every tweet in the result set, t ∈ T , has a defined cluster label, l = cluster label, or its cluster label is set to noise, l = noise.

Text Similarity
In order to extract events, similar tweets must be placed in the same clusters.In this regard, in addition to Euclidean distance between geotagged tweets, the text similarity of tweets should be considered.Text similarity plays an important role in document clustering and topic modeling [28,29].As tweets are limited to 140 characters, most of the text similarity techniques are not efficient for calculating the similarity between them due to the short length of the messages, the informal language and a large number of spelling and grammatical errors [30,31].Meanwhile, cosine similarity is a similarity measure which has proved its ability to calculate the text similarity between tweets [32,33].Cosine similarity in text mining is a string-based measure which measures the distance between two strings.Each string represented by a vector.Consider W as the collection of all terms in T. For a tweet t ∈ T, its textual content can be presented as vector t.c = n w 1 , n w 2 , n w 3 , . . ., n w k , where n w i shows the number of times the term w i ∈ W occurs in tweet t.Having two tweets t 1 = [x 1 , y 1 , c 1 , l 1 ] and t 2 = [x 2 , y 2 , c 2 , l 2 ], their similarity is computed according to cosine formula (Equation (1)) [3]. (1) Cosine similarity measures the cosine of the angle between two non-zero vectors.If two vectors are perpendicular, they have cosine similarity of 0 and when they are similar and completely the same, they have cosine similarity of 1.

Clustering
The proposed VDCT solution for extracting clusters from geotagged Twitter data is an extension to the DBSCAN algorithm.DBSCAN receives two parameters of epsilon and minimum points (minPnts) as inputs, where epsilon is the radius for neighborhood search and minPnts is the minimum number of points that must exist around a data point so that those points can be considered as a cluster.DBSCAN randomly selects a point which at least has minPnts points within the distance of epsilon around it.The surrounding points are called reachable points afterward.The selected point and its reachable points are considered as a cluster.The cluster will repeatedly grow by adding other points that are in the epsilon distance of the reachable points as new reachable points.The algorithm continues until all points are either has a cluster label or there is less than minPnts points around them and thus they are considered as noise [17].
In order to address the shortcoming of DBSCAN in dealing with spatial heterogeneity, VDBSCAN (varied density-based spatial clustering of applications with noise) algorithm was proposed by Liu, Zhou, and Wu [26].VDBSCAN chooses different values for epsilon using k-dist plot.For all data points in the dataset, the average distances to the k neighbors of each point are computed, sorted in ascending order and plotted in a graph where x-axis shows the distance (epsilon) and y-axis depicts the points sorted by the distance.The sharp changes in the plot correspond to the suitable epsilon values.In this method, when the density varies significantly in different regions, various values of epsilon are determined [26].
Although VDBSCAN can effectively deal with spatial heterogeneity, it still encounters two main challenges for event extraction from Twitter data.The first challenge is that VDBSCAN only considers one dimension for cluster detection.However, in order to develop an effective event detection algorithm for Twitter data, the algorithm must consider the similarity between the content of tweets in addition to closeness in space.As illustrated in Figure 3, tweets at the center of the image are geographically close, but they refer to different contents and therefore cannot be grouped in the same cluster.The second challenge is that VDBSCAN, in order to deal with spatial heterogeneity, utilizes k-dist plot to determine the epsilon parameter for various densities.However, k-dist plot is efficient for small data and does not perform well when we are dealing with large datasets [34] such as Twitter data.Particularly, it is hard to detect the sharp changes in the plot when there exists a large number of data points.
that are in the epsilon distance of the reachable points as new reachable points.The algorithm continues until all points are either has a cluster label or there is less than minPnts points around them and thus they are considered as noise [17].
In order to address the shortcoming of DBSCAN in dealing with spatial heterogeneity, VDBSCAN (varied density-based spatial clustering of applications with noise) algorithm was proposed by Liu, Zhou, and Wu [26].VDBSCAN chooses different values for epsilon using k-dist plot.For all data points in the dataset, the average distances to the k neighbors of each point are computed, sorted in ascending order and plotted in a graph where x-axis shows the distance (epsilon) and y-axis depicts the points sorted by the distance.The sharp changes in the plot correspond to the suitable epsilon values.In this method, when the density varies significantly in different regions, various values of epsilon are determined [26].
Although VDBSCAN can effectively deal with spatial heterogeneity, it still encounters two main challenges for event extraction from Twitter data.The first challenge is that VDBSCAN only considers one dimension for cluster detection.However, in order to develop an effective event detection algorithm for Twitter data, the algorithm must consider the similarity between the content of tweets in addition to closeness in space.As illustrated in Figure 3, tweets at the center of the image are geographically close, but they refer to different contents and therefore cannot be grouped in the same cluster.The second challenge is that VDBSCAN, in order to deal with spatial heterogeneity, utilizes k-dist plot to determine the epsilon parameter for various densities.However, k-dist plot is efficient for small data and does not perform well when we are dealing with large datasets [34] such as Twitter data.Particularly, it is hard to detect the sharp changes in the plot when there exists a large number of data points.In order to address the mentioned shortcomings, VDCT considers both location and text similarity for cluster detection by including two neighborhood search parameters of  and  for spatial proximity and text similarity, respectively.Additionally, the proposed algorithm borrows the idea of calculating various values for neighborhood search from VDBSCAN, but uses exponential spline interpolation, instead of k-dist plot to find the different levels of densities.Figure 4 represents the pseudo code of VDCT algorithm.In order to address the mentioned shortcomings, VDCT considers both location and text similarity for cluster detection by including two neighborhood search parameters of ε e and ε t for spatial proximity and text similarity, respectively.Additionally, the proposed algorithm borrows the idea of calculating various values for neighborhood search from VDBSCAN, but uses exponential spline interpolation, instead of k-dist plot to find the different levels of densities.Figure 4 represents the pseudo code of VDCT algorithm.As it is described in Figure 4, the algorithm receives geotagged tweets (T) and text similarity search radius ( ) as input.Each tweet in T contains both geographical coordinates and textual content, but the cluster labels of the tweets are undefined.The algorithm starts by calculating minPnts (Figure 4, line 3), which is the minimum number of tweets that must exist in the neighborhood of a tweet so that those tweets can be considered as a cluster.A heuristic approach for choosing a proper minPnts is through calculating ln(n), where n is the total number of input tweets [35,36].
In order to calculate  (Figure 4, line 3 and lines 22 to 29), as the varied radiuses for neighborhood search, k-nearest neighbor distances are calculated using kd-tree data structure.Employing kd-tree leads the computation of k-nearest neighbors (k-NN) to be more efficient which is crucial for handling large datasets [37,38].The value of k is set to minPnts value [39].The average distances to k nearest neighbors are calculated afterwards and sorted in an ascending order.Based on the recommendation of Louhichi, Gzara, and Ben-Abdallah [34], exponential splines interpolation [40] is used then to extract different levels of densities.In this approach, an exponential spline curve is fitted to the sorted average distances of k nearest neighbors.Having the exponential spline curve, inflection points, the points at which the direction of curvature changes, are extracted and considered as the candidates of different density levels and therefore  values.Bronshtein et al. [41] has thoroughly described the procedure of extracting inflection points from exponential splines.
After calculating the values of  , they are sorted in ascending order.Then, by iterating through the lowest value to the highest value of  , the algorithm tries to find clusters with different densities and assign cluster labels to tweets while considering both  and  (Figure 4, lines 6 to 17).In each iteration, a tweet that hasn't given a cluster label before (a tweet with the label of undefined or noise) is selected and its neighbors are listed.If the number of selected neighbors is less than minPnts, then the tweet is considered as a noise.Otherwise, the tweet and its neighbors are considered as a new cluster; they receive a new label; and the algorithm tries to expand this cluster and find other tweets As it is described in Figure 4, the algorithm receives geotagged tweets (T) and text similarity search radius (ε t ) as input.Each tweet in T contains both geographical coordinates and textual content, but the cluster labels of the tweets are undefined.The algorithm starts by calculating minPnts (Figure 4, line 3), which is the minimum number of tweets that must exist in the neighborhood of a tweet so that those tweets can be considered as a cluster.A heuristic approach for choosing a proper minPnts is through calculating ln(n), where n is the total number of input tweets [35,36].
In order to calculate ε e (Figure 4, line 3 and lines 22 to 29), as the varied radiuses for neighborhood search, k-nearest neighbor distances are calculated using kd-tree data structure.Employing kd-tree leads the computation of k-nearest neighbors (k-NN) to be more efficient which is crucial for handling large datasets [37,38].The value of k is set to minPnts value [39].The average distances to k nearest neighbors are calculated afterwards and sorted in an ascending order.Based on the recommendation of Louhichi, Gzara, and Ben-Abdallah [34], exponential splines interpolation [40] is used then to extract different levels of densities.In this approach, an exponential spline curve is fitted to the sorted average distances of k nearest neighbors.Having the exponential spline curve, inflection points, the points at which the direction of curvature changes, are extracted and considered as the candidates of different density levels and therefore ε e values.Bronshtein et al. [41] has thoroughly described the procedure of extracting inflection points from exponential splines.
After calculating the values of ε e , they are sorted in ascending order.Then, by iterating through the lowest value to the highest value of ε e , the algorithm tries to find clusters with different densities and assign cluster labels to tweets while considering both ε e and ε t (Figure 4, lines 6 to 17).In each iteration, a tweet that hasn't given a cluster label before (a tweet with the label of undefined or noise) is selected and its neighbors are listed.If the number of selected neighbors is less than minPnts, then the tweet is considered as a noise.Otherwise, the tweet and its neighbors are considered as a new cluster; they receive a new label; and the algorithm tries to expand this cluster and find other tweets around these tweets that are included in this cluster (Figure 4, lines 16 and 17) by searching for the neighbors of the tweets in the neighbor list and merging the results into the neighbor list.The expansion continuous until all the tweets in the current cluster are found and labeled.Then, the algorithm continues with the next value of ε e .
In order to the select the neighbor of a tweet (Figure 4, lines 19 to 23), both Euclidean distance between tweets and text similarity are considered.Two tweets are considered as neighbors if the Euclidean distance between them is less than ε e and their text similarity, calculated based on cosine formula (Equation (1)), is greater than ε t .

Quality Measures
Selection of proper measures for the evaluation of clustering algorithms depends on the available information and utilized methods [42,43].Two types of evaluation measures have been used in the literature: internal indices and external indices.While external indices compare the results with the existing ground truth, internal measures compare the results of different algorithms to show which algorithm performs better.Using internal evaluation criteria, the output clusters with high intra-similarity and low inter-similarity get higher scores.Due to the fact that it is very hard to collect ground truth data for events that are already happening in the real world, three internal measures of Davies-Bouldin index [44], Dunn index [45] and Silhouette coefficient [46] have been used in this study to compare the results of the proposed clustering algorithms with the results of DBSCAN as the base algorithm.Davies-Bouldin index is calculated using Equation (2).
In Equation ( 2), n is the number of clusters, c x is the centroid of cluster x, σ x is the average distance of all objects of cluster x to the centroid of cluster and d(c i , c i ) depicts the distance between centroids of clusters i and j.According to this criteria, the algorithm which produces the lowest Davies-Bouldin index is considered to perform better.
Dunn index calculates the ratio between the minimum inter-cluster distances to the maximum intra-cluster distance [45].This index is calculated using Equation (3).
In Equation (3), d(i, j) is the inter-cluster distance between clusters i and j and d (k) is the distance between objects in cluster k.The distance between clusters i and j can be calculated using various methods such as measuring the distance between centroids of the clusters.The algorithm which achieves higher Dunn index is more efficient.
The silhouette coefficient contrasts the average distance to objects in the same cluster with the average distance to the objects in other clusters [46].The coefficient ranges between −1 and 1 where 1 represents the best value.Negative values show that samples are wrongly assigned to a cluster.Overlapping clusters result in values near 0. The following equation calculates the silhouette coefficient.
In Equation ( 4), b(i) is the distance between an object and the nearest cluster that the object does not belong to and a(i) is the mean intra-cluster distance of an object.

Results and Discussion
In addition to geotagged tweets, ε t is the only input parameter of the proposed algorithm.The algorithm received 8992 geotagged tweets, related to Hurricane Florence (Section 2.1), while the value of ε t was set to 0.5.The algorithm was able to assign cluster labels to the input tweets.In order to provide a reference for comparison, we also ran a DBSCAN algorithm on the dataset with minPnts = 10 and epsilon = 0.1.

Parameter Sselection
In order to obtain the best value of ε t that maximizes the performance of the proposed solution, the model was run with different values of ε t , from 0.3 to 1, and the silhouette scores was calculated for the output results.As it can be seen in Figure 5, the value of 0.5 for ε t results in the highest silhouette score.For the ε t value of 0.7, VDCT only extracted one cluster and hence the silhouette score was equal to zero.For the values higher than 0.7, no cluster was detected, and no silhouette score was calculated.Therefore, the value of 0.5 has been chosen for text similarity threshold which is also in line with the best text similarity threshold that was used in the literature [47][48][49][50].
ISPRS Int.J. Geo-Inf.2019, 8, 82 8 of 18 In addition to geotagged tweets,  is the only input parameter of the proposed algorithm.The algorithm received 8992 geotagged tweets, related to Hurricane Florence (Section 2.1), while the value of  was set to 0.5.The algorithm was able to assign cluster labels to the input tweets.In order to provide a reference for comparison, we also ran a DBSCAN algorithm on the dataset with minPnts = 10 and epsilon = 0.1.

Parameter Sselection
In order to obtain the best value of  that maximizes the performance of the proposed solution, the model was run with different values of  , from 0.3 to 1, and the silhouette scores was calculated for the output results.As it can be seen in Figure 5, the value of 0.5 for  results in the highest silhouette score.For the  value of 0.7, VDCT only extracted one cluster and hence the silhouette score was equal to zero.For the values higher than 0.7, no cluster was detected, and no silhouette score was calculated.Therefore, the value of 0.5 has been chosen for text similarity threshold which is also in line with the best text similarity threshold that was used in the literature [47][48][49][50].

Quality Measures' Results
The output results of VDCT and DBSCAN algorithms are presented in Table 1.According to the results, the value of Dunn index and silhouette coefficient of VDCT are higher than those of DBSCAN.Also, VDCT achieved lower Davies-Bouldin index than DBSCAN.The output results prove that VDCT provides more satisfactory results in comparison with DBSCAN.

Visual Comparison and Discussion
The distribution of geotagged tweets from 12 September to 19 September is illustrated in Figure 2. The figure shows that the distribution of geotagged tweets varies over the study area and therefore, there may be clusters with varied densities.The extracted clusters using VDCT and DBSCAN clustering algorithms are demonstrated in Figures 6 and 7, respectively, where 11 clusters have been extracted by VDCT and 8 clusters have been determined by DBSCAN.The location of the extracted clusters by both algorithms are almost the same.However, VDCT extracted clusters with more details and higher accuracy in comparison with DBSCAN.In addition, the sizes of clusters detected by VDCT are different from those of DBSCAN.

Quality Measures' Results
The output results of VDCT and DBSCAN algorithms are presented in Table 1.According to the results, the value of Dunn index and silhouette coefficient of VDCT are higher than those of DBSCAN.Also, VDCT achieved lower Davies-Bouldin index than DBSCAN.The output results prove that VDCT provides more satisfactory results in comparison with DBSCAN.

Visual Comparison and Discussion
The distribution of geotagged tweets from 12 September to 19 September is illustrated in Figure 2. The figure shows that the distribution of geotagged tweets varies over the study area and therefore, there may be clusters with varied densities.The extracted clusters using VDCT and DBSCAN clustering algorithms are demonstrated in Figures 6 and 7, respectively, where 11 clusters have been extracted by VDCT and 8 clusters have been determined by DBSCAN.The location of the extracted clusters by both algorithms are almost the same.However, VDCT extracted clusters with more details and higher accuracy in comparison with DBSCAN.In addition, the sizes of clusters detected by VDCT are different from those of DBSCAN.Comparing the size and content of the extracted clusters by the algorithms clarifies that in the areas with less density variation, VDCT and DBSCAN extracted clusters which are almost the same.While in the areas with higher variation in densities, the algorithms perform differently.In these areas, VDCT extracted more clusters with different densities.A part of the study area with higher variation in density is depicted in Figure 8.As it is shown in Figure 8a, by considering different values for ε , VDCT was able to extract 3 distinct clusters (C2, C3 and C4, Figure 6) while DBSCAN clustered all data of this area into only one group (C2, Figure 7).The Word Cloud of cluster C2 of DBSCAN and clusters C2, C3 and C4 of VDCT are illustrated in Figures 9 and 10a-c, respectively.Word Cloud is a visualization method which displays the frequency and importance of each word in a document by its size.The Word Cloud of VDCT clusters indicates that the proposed algorithm was able to appropriately separate clusters related to the hurricane and other topics.The most frequent words are "forecast", "tstorm" and "today" in cluster C4, "charlotte", "Florence" and "hurricane" in cluster C3 and "charlotte", "opening", "work" and "hiring" in cluster C2.While all these tweets are clustered together by DBSCAN algorithm.The other clear distinctions between the output clusters are cluster C5 of DBSCAN and C7 and C8 of VDCT.As depicted in Figures 11 and 12, two distinct clusters extracted by VDCT are grouped as a unique cluster by DBSCAN.VDCT separated clusters related to "Raleigh" event and its "Opening".While DBSCAN clustered all tweets related to "Raleigh" to one group.Comparing the size and content of the extracted clusters by the algorithms clarifies that in the areas with less density variation, VDCT and DBSCAN extracted clusters which are almost the same.While in the areas with higher variation in densities, the algorithms perform differently.In these areas, VDCT extracted more clusters with different densities.A part of the study area with higher variation in density is depicted in Figure 8.As it is shown in Figure 8a, by considering different values for ε , VDCT was able to extract 3 distinct clusters (C2, C3 and C4, Figure 6) while DBSCAN clustered all data of this area into only one group (C2, Figure 7).The Word Cloud of cluster C2 of DBSCAN and clusters C2, C3 and C4 of VDCT are illustrated in Figures 9 and 10a-c, respectively.Word Cloud is a visualization method which displays the frequency and importance of each word in a document by its size.The Word Cloud of VDCT clusters indicates that the proposed algorithm was able to appropriately separate clusters related to the hurricane and other topics.The most frequent words are "forecast", "tstorm" and "today" in cluster C4, "charlotte", "Florence" and "hurricane" in cluster C3 and "charlotte", "opening", "work" and "hiring" in cluster C2.While all these tweets are clustered together by DBSCAN algorithm.The other clear distinctions between the output clusters are cluster C5 of DBSCAN and C7 and C8 of VDCT.As depicted in Figures 11 and 12, two distinct clusters extracted by VDCT are grouped as a unique cluster by DBSCAN.VDCT separated clusters related to "Raleigh" event and its "Opening".While DBSCAN clustered all tweets related to "Raleigh" to one group.Comparing the size and content of the extracted clusters by the algorithms clarifies that in the areas with less density variation, VDCT and DBSCAN extracted clusters which are almost the same.While in the areas with higher variation in densities, the algorithms perform differently.In these areas, VDCT extracted more clusters with different densities.A part of the study area with higher variation in density is depicted in Figure 8.As it is shown in Figure 8a, by considering different values for ε e , VDCT was able to extract 3 distinct clusters (C2, C3 and C4, Figure 6) while DBSCAN clustered all data of this area into only one group (C2, Figure 7).The Word Cloud of cluster C2 of DBSCAN and clusters C2, C3 and C4 of VDCT are illustrated in Figures 9 and 10a-c, respectively.Word Cloud is a visualization method which displays the frequency and importance of each word in a document by its size.The Word Cloud of VDCT clusters indicates that the proposed algorithm was able to appropriately separate clusters related to the hurricane and other topics.The most frequent words are "forecast", "tstorm" and "today" in cluster C4, "charlotte", "Florence" and "hurricane" in cluster C3 and "charlotte", "opening", "work" and "hiring" in cluster C2.While all these tweets are clustered together by DBSCAN algorithm.The other clear distinctions between the output clusters are cluster C5 of DBSCAN and C7 and C8 of VDCT.As depicted in Figures 11 and 12, two distinct clusters extracted by VDCT are grouped as a unique cluster by DBSCAN.VDCT separated clusters related to "Raleigh" event and its "Opening".While DBSCAN clustered all tweets related to "Raleigh" to one group.The other crucial issue that should be noticed is the different sizes of output clusters.Figure 13 demonstrates the extracted clusters of algorithms which are located at the same places but with diverse sizes.As clusters are located at the same location, they are indicating the same events.But the extracted clusters by VDCT are smaller than those of DBSCAN.Smaller and more compact clusters seem to be more useful since they provide us with the ability to monitor specific incidents rather than large clusters which may present both the event and the surrounding area.An example is Figure 11, in which VDCT separated tweets relating to "Raleigh" and its "opening" while DBSCAN considered all these tweets as one cluster.Another example is cluster C8 of DBSCAN and C11 of VDCT which their Word Clouds are illustrated in Figure 14a,b, respectively.The extracted Word Clouds and frequent words indicate that both clusters point to the same event which is "hurricane" in "Wilmington".Having more tweets with irrelevant words to the hurricane, the frequency and importance of words related to hurricane such as "warning", "flood", "tornado" are affected by the existence of other words and therefore their sizes are diminished in the Word Cloud of the DBSCAN cluster.However, in VDCT extracted cluster, words such as "warning", "hurricane", "flood" and "storm" can be clearly distinguished.It is the same for clusters C7 of DBSCAN and C10 of VDCT which their Word Cloud are illustrated in Figure 15a,b, respectively.The frequent words, "thunderstorm", "tstorm", "severe", "warning" and "forecast" can be quickly recognized in VDCT cluster, while these words cannot be easily identified in DBSCAN cluster as they are affected by other words.The other clusters (C1, C5, C6 and C9 of VDCT and C1, C3, C4 and C6 of DBSCAN) are almost the same for both algorithms as the density of tweets does not significantly vary in these places and their Word Clouds depict the same frequent words in related clusters.The generated Word Clouds of these clusters are illustrated in Figure 16.The other crucial issue that should be noticed is the different sizes of output clusters.Figure 13 demonstrates the extracted clusters of algorithms which are located at the same places but with diverse sizes.As clusters are located at the same location, they are indicating the same events.But the extracted clusters by VDCT are smaller than those of DBSCAN.Smaller and more compact clusters seem to be more useful since they provide us with the ability to monitor specific incidents rather than large clusters which may present both the event and the surrounding area.An example is Figure 11, in which VDCT separated tweets relating to "Raleigh" and its "opening" while DBSCAN considered all these tweets as one cluster.Another example is cluster C8 of DBSCAN and C11 of VDCT which their Word Clouds are illustrated in Figure 14a,b, respectively.The extracted Word Clouds and frequent words indicate that both clusters point to the same event which is "hurricane" in "Wilmington".Having more tweets with irrelevant words to the hurricane, the frequency and importance of words related to hurricane such as "warning", "flood", "tornado" are affected by the existence of other words and therefore their sizes are diminished in the Word Cloud of the DBSCAN cluster.However, in VDCT extracted cluster, words such as "warning", "hurricane", "flood" and "storm" can be clearly distinguished.It is the same for clusters C7 of DBSCAN and C10 of VDCT which their Word Cloud are illustrated in Figure 15a,b, respectively.The frequent words, "thunderstorm", "tstorm", "severe", "warning" and "forecast" can be quickly recognized in VDCT cluster, while these words cannot be easily identified in DBSCAN cluster as they are affected by other words.The other clusters (C1, C5, C6 and C9 of VDCT and C1, C3, C4 and C6 of DBSCAN) are almost the same for both algorithms as the density of tweets does not significantly vary in these places and their Word Clouds depict the same frequent words in related clusters.The generated Word Clouds of these clusters are illustrated in Figure 16.An important issue about the presented Word Clouds of the clusters (Figures 8-16) is the existence of the names of the places near the main event in the Word Clouds.Although these place names can be considered redundant words from one aspect, they still convey valuable information that can be helpful in determining the location of the events.If we consider an event which has happened in a special place, the users from other areas may share some tweets related to that event.These tweets may create a cluster.However, the words in the Word Cloud of the cluster do not match with its surrounding location names.By considering the name of the surrounding locations and the   An important issue about the presented Word Clouds of the clusters (Figures 8-16) is the existence of the names of the places near the main event in the Word Clouds.Although these place names can be considered redundant words from one aspect, they still convey valuable information that can be helpful in determining the location of the events.If we consider an event which has happened in a special place, the users from other areas may share some tweets related to that event.These tweets may create a cluster.However, the words in the Word Cloud of the cluster do not match with its surrounding location names.By considering the name of the surrounding locations and the An important issue about the presented Word Clouds of the clusters (Figures 8-16) is the existence of the names of the places near the main event in the Word Clouds.Although these place names can be considered redundant words from one aspect, they still convey valuable information that can be helpful in determining the location of the events.If we consider an event which has happened in a special place, the users from other areas may share some tweets related to that event.These tweets may create a cluster.However, the words in the Word Cloud of the cluster do not match with its surrounding location names.By considering the name of the surrounding locations and the Word Cloud of the cluster, the events which happened in other places can be identified and single out.There are some studies in this regard that have tried to localize the extracted clusters based on the location names in each area [51][52][53][54].

Conclusions and Future Works
This study proposed a solution for event extraction from geotagged tweets.In order to overcome the shortcomings of DBSCAN in dealing with density variation in the Twitter dataset, VDBSCAN algorithm has been extended to extract clusters from geotagged tweets.The proposed algorithm, VDCT, employs exponential spline interpolation to determine different search radiuses for cluster detection.It also utilizes cosine similarity to group tweets with similar content in addition to spatial closeness.For evaluation, the output clusters of VDCT have been compared to those of DBSCAN.The results prove the ability of VDCT in extracting clusters with varied densities from geotagged tweets.In areas where density fluctuated, VDCT was able to extract more precise clusters with different densities and more details, while DBSCAN merged denser clusters into one in areas with significant variation in density.Also, the comparison of the content of the output clusters showed that VDCT was able to efficiently group tweets with more related contents, while DBSCAN clusters sometimes included some tweets with less similarity in context.
As depicted in output maps, the number of geotagged tweets considerably varies over different areas.Some areas consist of a considerable number of users who share a large number of tweets during weekdays while the others have only a few active users.So, in order to form a cluster, the number of minimum points can be set differently for different areas due to the number of active users in each area.In this regard, the future work will focus on determining different values for the minimum number of points for VDCT algorithm based on the number of active users in each area.Additionally, improving the proposed solution so that it can localize the extracted events by considering the names of the surrounding locations will be a field of future investigation.The other issue is that cosine similarity does not consider the semantics of the words.Utilizing semantic similarity measures to improve the result of spatial clustering is also considered as a future work of this study.

Figure 4 .
Figure 4.The pseudo code of VDCT algorithm.

Figure 4 .
Figure 4.The pseudo code of VDCT algorithm.

Figure 5 .
Figure 5. Variation of the silhouette coefficient in response to different  values

Figure 5 .
Figure 5. Variation of the silhouette coefficient in response to different ε t values.

Figure 8 .
Figure 8.(a) A part of the study area with more variation in density; (b) Extracted clusters by VDCT and (c) Extracted clusters by DBSCAN.

Figure 8 .Figure 8 .
Figure 8.(a) A part of the study area with more variation in density; (b) Extracted clusters by VDCT and (c) Extracted clusters by DBSCAN.

Figure 11 .
Figure 11.Word Cloud generated for Cluster 5 of DBSCAN.Figure 11.Word Cloud generated for Cluster 5 of DBSCAN.

Figure 11 .
Figure 11.Word Cloud generated for Cluster 5 of DBSCAN.Figure 11.Word Cloud generated for Cluster 5 of DBSCAN.

Figure 12 .
Figure 12.Word Cloud generated for Clusters (a) 7 and (b) 8 extracted by VDCT and (c) their positions on map.

Figure 12 .
Figure 12.Word Cloud generated for Clusters (a) 7 and (b) 8 extracted by VDCT and (c) their positions on map.

Table 1 .
The output results of internal evaluation criteria.

Table 1 .
The output results of internal evaluation criteria.