Urban Crowd Detection Using SOM, DBSCAN and LBSN Data Entropy: A Twitter Experiment in New York and Madrid

: The surfer and the physical location are two important concepts associated with each other in the social network-based localization service. This work consists of studying urban behavior based on location-based social networks (LBSN) data; we focus especially on the detection of abnormal events. The proposed crowd detection system uses the geolocated social network provided by the Twitterapplication programming interface (API) to automatically detect the abnormal events. The methodology we propose consists of using an unsupervised competitive learning algorithm (self-organizing map (SOM)) and a density-based clustering method (density-based spatial clustering of applications with noise (DBCSAN)) to identify and detect crowds. The second stage is to build the entropy model to determine whether the detected crowds ﬁt into the daily pattern with reference to a spatio-temporal entropy model, or whether they should be considered as evidence that something unusual occurs in the city because of their number, size, location and time of day. To detect an abnormal event in the city, it is su ﬃ cient to determine the real entropy model and to compare it with the reference model. For the normal day, the reference model is constructed o ﬄ ine for each time interval. The obtained results conﬁrm the e ﬀ ectiveness of our method used in the ﬁrst stage (SOM and DBSCAN stage) to detect and identify clusters dynamically, and imitating human activity. These ﬁndings also clearly conﬁrm the detection of special days in New York City (NYC), which proves the performance of our proposed model.


Introduction
Since the advent of smartphones, the activity on social networks has become increasingly important. Our pace of activity is increasing and the consequences of this intensive lifestyle will be interesting to see. We are constantly connected; smartphones have definitely changed the way we live. How many parties have you had where there are four or five people (or more) hanging on their phones? We are always connected. This huge smartphone revolution, coupled with the large number of social network users, has unearthed a new type of service in the field of localization, known as location-based social network (LBSN) services. These are applications available on mobile devices via the mobile network that use the geographic location of the device. With the massive development of smartphones, the daily data produced is now almost systematically linked to geographical coordinates (i.e., latitude and longitude). We can cite the example of Flickr [1], whose users can upload locally tagged photos to a social networking service, the example of Foursquare [2], which allows organizations to share their current location on a website to organize an activity in the real world, and the example of Twitter, which allows Internet users to comment on an event in real time and at the exact location where it takes place. In particular, the exploitation of the large amount of information provided by Twitter, with more than 350 million users (currently in 2018), can potentially open new perspectives on the urban structure and the urban mobility process (grouping, trajectories, etc.). An LBSN does not only mean sharing our physical position with our friends, but also reflects an urban structure composed of individuals resulting from their physical location. Location information collected over time can really characterize the urban prevalence (crowd spread), spatial distribution at different moments of the urban prevalence, and detection of a grouping and monitoring of its evolution (spatial movement taking into account temporal aspects). These geolocated data are produced in huge quantity, especially from social networks such as Facebook and Twitter. The expression "big geosocial data" [3] is that by which we mean all these data. They should not be confused with the "voluntary geographic information" (VGI) of [4] as they are precisely characterized by their non-contributory nature. Indeed, since the system of "check-in" (i.e., associating a physical place with a publication) appeared with the emergence of geolocated social networks, the social sharing of geographical localization across social web platforms has become commonplace [5]. Like [6], we prefer to use the term "ambient geospatial information" rather than VGI. In a societal context where the improvement of urban intelligence is crucial, we argue that the potential offered by the analysis of geosocial data sets is an opportunity that should be seized. More precisely, we focus here on urban crowd detection and ask the following starting question: Can we use massive geosocial data to identify urban crowds using big geosocial data?

Concepts and Definitions of LBSNs
Social networks that include geolocation information in shared content are called social location networks. They provide geographical information on a map by physical proximity (real location), unlike the concept in chronological order [7]. The emergence of smartphones, equipped with sensors, allowing users to locate themselves in a sustainable way and at any place in urban areas, has offered a major development potential in this field. location-based services (LBS) technologies are behind this geolocation. They allow you to customize the content of a mobile device based on its location. These are application services that use the location data of mobile terminals to provide them with personalized content and applications, depending on their geographical position. This type of service can be used in a very wide variety of fields: Marketing, advertising, health, work, etc.
LBS can be applied to both a fixed object (typically a point of interest) and a moving object. In the second case, it will generally be another terminal. With the development of smartphones, LBS can increasingly integrate geolocation data as a search criterion. A set of elements that will allow personal assistants, such as Siri (from Apple) or Cortana (from Microsoft), to offer a personalized search for each user integrating this geolocation dimension. The applications, providing these services can be grouped into three classes: geo-tagging, point location and trajectory-based.
The user and the physical location are two important concepts that are associated with each other in the social network-based localization service. In the following, we focus on the research philosophy for urban cluster detection based on LBSNs.

Background on Crowd Detection based LBSNs
Throughout the day, we visit several physical locations and we generate a "location history", through the rental tags. Figure 1 illustrates the relationship between the user and the history of his physical location. The sequential connection of these locations in terms of time allows us to have a trajectory for each user. The collection of positions allows us to detect urban groupings in a given region at different times. In this sense, several studies focus on the study of the behaviour and the human mobility to know the citizen movement and the detection of abnormal groups based on LBSNs. Geolocation is one of the most common issues in LBSNs, which is made possible by various services. For these services, the user's location is only planned with a certain precision, especially if it concerns the prediction at any time. Geolocated data are used in the follow-up study of people, usually in the context of the use of a particular space (urban space, station, etc.).

Related work
Many research studies have investigated the value of using spatialized (distance and position) information to answer various questions, such as the evaluation of city activity [8][9][10], the analysis of spatial mobility [11][12][13], the epidemiology [14,15], the natural crisis management [16], and the spatial planning [17]. For more details on the proposed methods, in the context of the human mobility, the reader is invited to read in particular [18]. Noulaset [19] quantitatively analyzed a colossal mass of data from Foursquare: about 12 million check-ins were collected from geotagged tweets for a total of 679,000 users and a 100-day collection period. The results of Noulaset make it possible to conclude that the Foursquare data are representative of the daily movements of the users. Among the most relevant spatial analyses carried out on Foursquare data, we also mention Kelley's research [20]. Ben Khalifa [21] suggested an analysis of geolocated social media data to identify urban crowds in New York City. "This analysis is gathered under a methodology for crowd detection in cities that combines social data mining, density-based clustering and outlier detection into a solution that can operate onthe-fly" According to Gao [22], the availability of big geo-social data on LBSNs provides an unheated opportunity to "study human mobile behavior through data analysis in a spatial-temporal-social context, enabling a variety of LBS, from mobile marketing to disaster relief". With the development and speedy admiration of LBSN, Domíngueza [23] proposed a system for the detection of abnormal high or low number of citizens in a given area on the basis of these services. Pelechrinis [24] presented practices and methods in the field of urban computing as well as open challenges; civic data and technologies for urban detection, analytical techniques used for urban data analysis and concrete examples of urban computing applications. In [25], the authors showed that location-based social media systems such as Instagram and Foursquare can serve as valuable sources of large-scale detection, and provide access to important characteristics of urban social behavior much faster than traditional methods. In [26], the authors studied human activity from mobile socio-demographic data in six Italian cities. Roberts [27] proposed the use of Twitter data for urban green space research. Time and geo-coordinates associated with a sequence of messages or tweets reflect the spatial and temporal movements of people in real life. The purpose of the Comito research [28] was to analyze these Geolocation is one of the most common issues in LBSNs, which is made possible by various services. For these services, the user's location is only planned with a certain precision, especially if it concerns the prediction at any time. Geolocated data are used in the follow-up study of people, usually in the context of the use of a particular space (urban space, station, etc.).

Related Work
Many research studies have investigated the value of using spatialized (distance and position) information to answer various questions, such as the evaluation of city activity [8][9][10], the analysis of spatial mobility [11][12][13], the epidemiology [14,15], the natural crisis management [16], and the spatial planning [17]. For more details on the proposed methods, in the context of the human mobility, the reader is invited to read in particular [18]. Noulaset [19] quantitatively analyzed a colossal mass of data from Foursquare: about 12 million check-ins were collected from geotagged tweets for a total of 679,000 users and a 100-day collection period. The results of Noulaset make it possible to conclude that the Foursquare data are representative of the daily movements of the users. Among the most relevant spatial analyses carried out on Foursquare data, we also mention Kelley's research [20]. Ben Khalifa [21] suggested an analysis of geolocated social media data to identify urban crowds in New York City. "This analysis is gathered under a methodology for crowd detection in cities that combines social data mining, density-based clustering and outlier detection into a solution that can operate on-the-fly" According to Gao [22], the availability of big geo-social data on LBSNs provides an unheated opportunity to "study human mobile behavior through data analysis in a spatial-temporal-social context, enabling a variety of LBS, from mobile marketing to disaster relief". With the development and speedy admiration of LBSN, Domíngueza [23] proposed a system for the detection of abnormal high or low number of citizens in a given area on the basis of these services. Pelechrinis [24] presented practices and methods in the field of urban computing as well as open challenges; civic data and technologies for urban detection, analytical techniques used for urban data analysis and concrete examples of urban computing applications. In [25], the authors showed that location-based social media systems such as Instagram and Foursquare can serve as valuable sources of large-scale detection, and provide access to important characteristics of urban social behavior much faster than traditional methods. In [26], the authors studied human activity from mobile socio-demographic data in six Italian cities. Roberts [27] proposed the use of Twitter data for urban green space research. Time and geo-coordinates associated with a sequence of messages or tweets reflect the spatial and temporal movements of people in real life. The purpose of the Comito research [28] was to analyze these movements in order to discover community behaviors and to determine popular travel routes from geo-tagged stations (i.e., collected geo-tagged data). Kanno [29] "proposed a method that measures demographic snapshots of a city from time-and geo-stamped micro-blog posts and visualizes high-risk evacuation roads on the basis of geographical characteristics and demographics". According to the author, this method allows a high level of situational awareness (hourly) to be achieved in order to provide evacuation routes. Kim [30] "proposed a new system of analyzing the spatiotemporal patterns of social phenomena in real time and the discovery of local topics based on their latent spatiotemporal relationships". Yang et al. [31] presented an observational study of the geolocated activities of the users on two social media platforms, performed over a period of three weeks in four European cities. This study showed how demographic, geographical, technological and contextual properties of social media (and their users) can provide very different reflections and interpretations of the reality of an urban environment. The work of Bordogana [32] exploited the timestamped geo-labelled messages posted by Twitter users from their smartphones when they travel to track their journeys. To learn more about how social media data can be used to infer knowledge about urban dynamics and mobility patterns in an urban area, the reader is invited to read [33]. In [34], two types of data were used to determine the user communities. The authors studied the use of the physical space through the individual data (to track movements) and the overall use of the space. Ahas [35] pointed out that the main shortcomings of the telephone data are the difficulty of accessing the data and the lack of precision of the locations, where the considered location is most often that of the antenna to which the telephone is connected. We can also cite several types of geolocated data of interest by using the classification of Senaratne [36].
In this work, we use geolocated social data provided by the Twitter API to detect and identify an urban grouping. Successful grouping identification relies on two different techniques: the Kohonen topological maps [37,38] based on unsupervised learning methods and DBSCAN [39].

Methodology
The proposed methodology, for urban crowd detection, consists of three phases: using the SOM to map the input space and select the DBSCAN parameters, applying the DBSCAN algorithm and thirdly, building the real and reference entropy model. LBSN was used and the data analyzed. In practice, this mapping is used to carry out a first partitioning of these large data. The first clustering result will be refined by the DBSCAN technique. The use of SOM in the first stage has the following two objectives: first, to provide a topological view of the data partitioning, and second, to allow us to propose an appropriate procedure for setting the necessary parameters of the DBSCAN algorithm.
Then, the density-based clustering phase would be ideally applied to discover clusters of arbitrary shape as well as to distinguish noise. The data would have to be re-analyzed using the DBSCAN algorithm ( Figure 2) to determine whether the detected crowds fit into the daily pattern with reference to a spatio-temporal entropy model, or whether they should be considered as evidence of something unusual happening in the city because of their number, size, location and time of day. The DBSCAN algorithm uses two parameters: the distance є and the minimum number of points MinPts that must be within a radius є for these points to be considered as a cluster. The input parameters are, therefore, an estimate of the point density of the clusters. The є-neighborhoods of a point x is the set of points in the data set whose distance to x is less than є;

SOM and DBSCAN
After a random initialization of the values of each neuron, the data are submitted one by one to the SOM. Each iteration of the sequential learning of the Kohonen maps consists of two steps. The first step is to randomly select an observation x(t) from the set of inputs, and present it to the network to determine its winning neuron. The winning neuron (best matching unit (BMU)) of an observation is that having the closest referent vector to it in the sense of a given distance (e.g., Euclidean distance). If c is the winning neuron of the vector x(t), c is determined as follows: where d (a, b) is the Euclidean distance between a and b, w is the weight vectors, x(t) is the input vector and k is the number of the data (input space dimensionality).
In the second step, the BMU is activated. Its referent vector is updated to be closer to the input vector presented to the network. This update does not only concern the winning neuron as in competitive learning methods, but also its neighboring neurons, which then see their reference vectors adjust to the input vector. The amplitude of this adjustment is determined by the value of a learning step α(t) and the value of a neighborhood function h(t). The parameter α(t) regulates the speed of the learning process. It is initialized with a high value at the beginning, then decreases with the iterations to slow down the learning process as it progresses. The function h(t) defines the rate of change of the neighborhood around the BMU. It depends on both the location of the neurons on the map and a certain radius of the neighborhood. In the first iterations, the neighborhood radius is large enough to reveal a large number of neurons. But this radius gradually narrows to contain only the winning neuron with its immediate neighbors, or even the winning neuron only. The rule for updating the reference vectors is as follows: Where c is the winning neuron of the input vector x(t) presented to the network at iteration t and h ck (t) is the neighborhood function that defines the proximity between neurons c and k. A more flexible and common neighborhood function is the Gaussian function defined below: where r c and r k are respectively the location of neuron c and neuron k on the map, and σ(t) is the neighborhood radius at iteration t of the learning process. With such a neighborhood function, the amplitude of the adjustment is graduated according to the distance from the winning neuron, which reserves the maximum amplitude for itself. The result of this unsupervised learning is the non-linear projection of all the observations on the map. Each observation is attributed to its winning neuron. In addition to the quantification task, this projection preserves the topology of the data through the use of the neighborhood function. Two neighboring neurons on the map will represent close observations in the data space.
The obtained result is represented by a set of winning nodes. Each BMU is associated with the most similar set of real data. However, they are not all taken into account to avoid outliers. Therefore, we obtain a first grouping of the input space (SOM-based clustering).
The DBSCAN algorithm uses two parameters: the distance єand the minimum number of points MinPts that must be within a radius є for these points to be considered as a cluster. The input parameters are, therefore, an estimate of the point density of the clusters. The є-neighborhoods of a point x is the set of points in the data set whose distance to x is less than є; N ε (x) = u ∈ X d(u, x) < ε .
We now say that two points u and x are connected by density if we can switch from one to the other by a series of є-neighborhoods, each containing at least MinPts points. In other words, there is a sequence of inner points v1, v2, . . . v m , such that v 1 belongs to the є-neighborhood of u, v 2 belongs to the є-neighborhood of v 1 , and so on, until x belongs to the є-neighborhood of v m . We also say that x is reachable by density from u. However, we do not know these values in advance. So, it is essential to select their values properly. As a result (SOM stage), data in the input space can be abstracted to a much smaller number (each BMU is associated with the most similar set of real data). Then, the input space can be seen as a set of BMU.
On the basis of these results, we choose the parameters є and MinPts as follows: where i is the number of clusters and k is the number of tweets in each cluster. Once є is determined, we compute the number of NB points in є neighborhood for each data set. Subsequently, the MinPts parameter is calculated by its arithmetic average: Which then allows us to establish DBSCAN as follows: 1.
Consider each BMU as a cluster centroid.

3.
Check that it contains MinPts points or more.

4.
Check if there is a BMU k , k = 1, ..., n and k j, achievable by density from BMU j .

5.
Build C = C k BMU k , x 1 , ..., x i ∪ C j BMU j , x 1 , ..., x i and consider the central point H c between these two BMUs as the centroid of this new cluster.
As human activity is variable over time, we define a test interval T equal to 30 and 60 min. At each time interval T, and based on the result provided by SOMs and DBSCAN, we can model the grouping behavior in a given area by a circle Z defined by its center C, a radius єand a density MinPts. The aim is to learn the SOM from the input space (geo-localized tweets), and establish DBSCAN to detect the clusters of varied density with different shapes and sizes. The stage can be summarized as follows: 1.
Initialization-Choose random values for the initial weight vectors ω k (of the same type as the elements of the input space, a geographic coordinate (latitude and longitude)).

2.
Sampling-Draw a sample training input vector x(t) from the input space.

3.
Find the winning neuron c that has weight vector closest to the input vector, Equation (1).
Keep returning to step 2 until the feature map stops changing. 6.
Build the real entropy model (Section 3.2). 10. Compare the two models.
The studied city can be considered as a spatio-temporal model of an urban grouping composed of a set of circles, with each circle representing a grouping.
This model is instantiated t = 24 times if the time interval associates T = 60 min, and t = 48 times if T = 30 min. Each instance is associated with a set of symbols β that define the state of the city under study. Thus, the behavior of the city is defined by a sequence S of i symbols S = β 1 , ..., β i , with β i = {G 1 , ..., G k } and k = 1, . . . , k is the number of crowds. Each group is defined by a circle of the center C(latitude x , longitude y ), the radius є and a MinPts density.
is the state of the city described by three groupings at the time interval T 1 . From the entropy point of view, if the source M always sends the β 1 symbol, then its entropy according to Shannon [40] is nil, i.e., the uncertainty about what the source emits is minimal.
H(M) is, therefore, a reference on the state of the city at this time interval. In this way, we can build a reference on the state of the city for the 24 h based on the Shannon entropy (the reference entropy model). To detect an abnormal grouping in the city, it is sufficient to determine the real entropy model and compare it with the reference model.

Real and Reference Entropy Models
For a discrete random variable M, with i symbols and each symbol β i having a probability of appearing P i , the entropy H of the source M is defined as follows: where E denotes the mathematical expectation, and log is the logarithm function. The symbols representing the possible achievements of the random variable M are β 1 , ..., β i . To build the reference entropy model, look for the nil entropies of M at each interval: As єand MinPts are provided by DBSCAN and SOM to the already known time interval T, then Equation (9) is written as follows: where P i is the probability of occurrence of the {C 1 , ..., C k } locations in a sequence of j measurements for each time interval T. That is to say, the number of times each symbol appears is β divided by j.
The reference model (for the normal day) is then constructed offline for a period of 28 days, i.e., four tests for each time interval T.
For the For the reference model, we keep the range of the minimum entropy of the source defined by the [min(entropy), max(entropy)].For example, for T = [00:00,00:30] and by applying SOM and DBSCAN, the possible realizations of the source are β 1 , β 2 So, we calculate I = 6 measurements for the possible realizations of source M, which describe the state of the city at the time interval T, the results is shown in Table 1.
The entropy of the source for the first normal Monday is therefore:

Capturing Tweets
According to the literature, geo-localized tweets represent "1% of the total feed" [40]. "Tweet geo-localized" is associated with a geographic coordinate (latitude and longitude). The public streaming from which we gather the tweets is made available by Twitter API. However, Twitter's data is not accessible to the general public. Officially, 1% of the tweet traffic is made available [40,41]. This sample of tweets in the form of streams is issued according to a user-defined criterion (geo-location or keywords).
Generally, there are three different ways to catch Twitter data: Firehose API, Twitter Search API, and the Twitter Streaming API. Through the Search API, users request tweets that match some sort of "search" criteria. The criteria can be keywords, usernames, locations, named places, etc. A good way to think of the Twitter Search API is by thinking how an individual user would do a search directly at Twitter. According to the documentation, Twitter Search API is limited by the number of requests per time, currently limited to 180 requests in 15 min. Unlike Twitter's Search API where you are polling data from tweets that have already happened, Twitter's Streaming API is a push of data as tweets happen in near real-time. The final way to access data is by having access to the full Twitter Firehose. The Twitter Firehose is in fact very similar to the Twitter's Streaming API as it pushes data to end users in near real-time. However, the Twitter Firehose guarantees delivery of 100% of the tweets that match your criteria, but it is not free.
In this work, we use Twitter's Streaming API due to the following reasons: the amount of data provided, Twitter's Streaming API is certainly better than Twitter's Search API, the nearly real-time data-set, the possibility of specifying the region of study (a shape formed by two geographic coordinates), and it is free.

Data Sets
In this study, we classified the corpus of the gathered tweets into three datasets. The first one is for the special day; from 31  The first dataset "for the special day, from 15:00 of 31 December 2017 to 15:00 of 1 January 2018", was selected to study the behaviour of the city during a special day and also to validate the effectiveness of our approach to the detection of an abnormal day. The second datasets were selected to build the reference entropy model. The last one was selected to determine whether the proposed approach is able to detect the abnormal days during this period based on the constructed reference model. Several works have followed the same procedure. In [21], the authors used the Streaming API, to obtained four datasets: one for the special day (from 15:00 of 31 December 2013 to 15:00 of 1 January 2014) and another for a normal day (from 15:00 of 24 February 2014 to 15:00 25 February 2014) in both cities (NYC and Madrid). In [23], the authors used a data set including both normal days and special days, due to festivities like Christmas or natural phenomena like the weekend when Storm Jonas hit the United States, which can be used to test the outlier detection because these dates are supposed to have an uncommon behaviour (higher densities in Christmas, lower during the Storm). The study in [40] relied on one full year of geolocated tweets, which were posted by users all over the world from 1 January until 31 December 2012. The database consists of 944 M records generated by a total of 13 M users.

Reference Area
Before applying the clustering stage and the real and reference entropy model, it is necessary to define a reference area. In NYC, we selected Times Square as the reference area, the most popular area in Manhattan, where people gather for New Year's Eve. The studied area is defined by the central point P1(−73.985131, 40.758895) and a radius of 500 m. In Madrid, we selected Puerta del Sol as the reference area defined by the central point P2(40.416729, −3.703339) and a radius of 500 m.
The results are summarized in Table 2: Settlers (1) represent the time interval, (2) and (3) the values of the input parameters, (4) the number of clusters in NYC and (5) the noise points.

Results
At first glance, we notice that the obtained result in Table 2 shows that the number of detected clusters is clearly lower at night [00:00, 06:00]. The results, presented in Figure 3, show a higher number of clusters during the special day. It also shows that, the number of crowds decreases from 22:00 to 05:00, and then, increases again. Except for the interval [05:00, 09:00], the detected number of clusters is remarkably higher during the New Year's Eve day than a normal day.   Therefore, we clearly see a contrast between both days. A comparison, between the number of crowds detected in the normal day (blue) and in the New Year's Eve day, emphasizes the effectiveness of our system for crowd detection and distinguishes between the normal and the special days. The procedure for measuring/estimating parameters (є and MinPts) assumes a cautious process for clusters detection, without missing or disguising the small crowd.
An examination and analysis of the results of Tables 2 and 3 show how the є and MinPts parameters affect the results of the clustering. It represents a dynamic system for the є and MinPts parameters selection. Values of these parameters are considered each hour and are obtained from the geolocated tweets. Tables 2 and 3 show how є and MinPts parameters selection using SOM (Section 3.1) affects the DBSCAN algorithm results. The SOM parameters selection process adopts a conservative approach, and preserves the topology of the data through the use of the neighborhood function, which allows detection of the clusters (even if they are small). Focusing on the number of clusters in the normal and special day of both Tables 2 and 3, we note that the number of clusters in the New Year's Eve day in NYC is always higher than in the normal day, except for a very few cases, which is evidence of something unusual happening in the city. Figure 3 illustrates a comparison between the number of clusters obtained on the normal day and the New Year's Eve day in NYC. Table 4   The ε values were calculated each hour, and are logically higher in the normal day compared to the special day when people began to come together to celebrate. Figure 4 shows the discrepancy in the number of clusters during 24 h in Madrid city within 300 m of Puerta del Sol. This obviously confirms the dissimilarity behavior on a special and a normal day, so we can give a clear picture of the activity and the locations of the urban crowds in the city. Thus, it can be used (SOM and DBSCAN stage) to build with confidence the real and reference entropy models. The ε values were calculated each hour, and are logically higher in the normal day compared to the special day when people began to come together to celebrate. Figure 4 shows the discrepancy in the number of clusters during 24 h in Madrid city within 300 m of Puerta del Sol. This obviously confirms the dissimilarity behavior on a special and a normal day, so we can give a clear picture of the activity and the locations of the urban crowds in the city. Thus, it can be used (SOM and DBSCAN stage) to build with confidence the real and reference entropy models. Subsequently, we focused first on the reference entropy model for each day, and then we compared it to the real entropy model, to detect anything unusual about the crowd behavior in the city.   Subsequently, we focused first on the reference entropy model for each day, and then we compared it to the real entropy model, to detect anything unusual about the crowd behavior in the city. The ε values were calculated each hour, and are logically higher in the normal day compared to the special day when people began to come together to celebrate. Figure 4 shows the discrepancy in the number of clusters during 24 h in Madrid city within 300 m of Puerta del Sol. This obviously confirms the dissimilarity behavior on a special and a normal day, so we can give a clear picture of the activity and the locations of the urban crowds in the city. Thus, it can be used (SOM and DBSCAN stage) to build with confidence the real and reference entropy models. Subsequently, we focused first on the reference entropy model for each day, and then we compared it to the real entropy model, to detect anything unusual about the crowd behavior in the city.   To determine the entropy reference interval for every day of the week, we calculated the maximum and minimum entropy values for each time interval T during the first four weeks of January, February, March and April 2017, i.e., four measurements for each T of every day, and we kept the obtained average. For a normal Monday, the entropy values in T 1 = [00:00,01:00] are between [0.0344,0.3544]. Figure 6 shows a detailed description of the state of the city for the 7 days of the week for the time interval T = 60 min in New York City. The purple torque shows the evolution of the maximum and minimum reference value of the entropy for each day of the week. This reference is presented by the two lower and upper bounds that frame the possible values of the entropy of the possible realization of the source for a normal day. The blue curve represents the evolution of the entropy (6 measurements per time interval) of the source for a normal day.
To determine the entropy reference interval for every day of the week, we calculated the maximum and minimum entropy values for each time interval T during the first four weeks of January, February, March and April 2017, i.e., four measurements for each T of every day, and we kept the obtained average. For a normal Monday, the entropy values in T1 = [00:00,01:00] are between [0.0344,0.3544]. Figure 6 shows a detailed description of the state of the city for the 7 days of the week for the time interval T = 60 min in New York City. The purple torque shows the evolution of the maximum and minimum reference value of the entropy for each day of the week. This reference is presented by the two lower and upper bounds that frame the possible values of the entropy of the possible realization of the source for a normal day. The blue curve represents the evolution of the entropy (6 measurements per time interval) of the source for a normal day.
Note that for the Friday entropy model shown in Figure 6 at time intervals [10:00, 11:00] and [11:00,12:00], the value of the recorded entropy exceeds the upper limit of the reference. That is to say, for the six measurements of entropy, the source always returns the possible sequence of realization (presence of the possible sequence of realization for the six measurements). This can be explained either by an abnormal event at this interval that results in a stable state of the city, or by a false connection state, i.e., the tweet is not connected but is still considered online. In the following, we limit ourselves to four measurements for each time interval T to avoid false connections. Looking at the effects that followed from the intense changes in entropy (blue curve) by the reference entropy model (purple torque), we found that, the entropy algorithm is properly applied to build the reference model. The entropy algorithm applies four measurements for each time interval T. However, this utility is very important for adapting the dynamics of crowds, which very quickly allows introspection of the creation of a new crowd in the short term. These results validate the proposed entropy approach that adapts the dynamics of the city, and we can look to the entropy process for anomaly detection with confidence. Note that for the Friday entropy model shown in Figure 6 at time intervals [10:00,11:00] and [11:00,12:00], the value of the recorded entropy exceeds the upper limit of the reference. That is to say, for the six measurements of entropy, the source always returns the possible sequence of realization (presence of the possible sequence of realization for the six measurements). This can be explained either by an abnormal event at this interval that results in a stable state of the city, or by a false connection state, i.e., the tweet is not connected but is still considered online. In the following, we limit ourselves to four measurements for each time interval T to avoid false connections.
Looking at the effects that followed from the intense changes in entropy (blue curve) by the reference entropy model (purple torque), we found that, the entropy algorithm is properly applied to build the reference model. The entropy algorithm applies four measurements for each time interval T. However, this utility is very important for adapting the dynamics of crowds, which very quickly allows introspection of the creation of a new crowd in the short term. These results validate the proposed entropy approach that adapts the dynamics of the city, and we can look to the entropy process for anomaly detection with confidence.
The ε distance ranges from 21 to 150 m on the normal day and from 9 to 60 m on the New Year's Eve day, as shown in Figure 7. Moreover, ε is usually lower on the special day than on the normal day. This means that a low value of ε designates more clusters closer to each other, which is consistent with a special day. In this study, we notice these differences, especially in the time interval [00:00, 06:00] where the differences between New Year's Eve day and the normal day are remarkable. Now let's focus on the variation of the MinPts parameter, where its least acute values vary from two to four tweets on the normal day in the time intervals [00:04, 20:00] and [00:06,23:00]. On a special day, these MinPts values are higher in the time interval [00:01, 00:02] and reflect low geo-tagged tweets and a low activity in the city. The ε distance ranges from 21 to 150 m on the normal day and from 9 to 60 m on the New Year's Eve day, as shown in Figure 7. Moreover, ε is usually lower on the special day than on the normal day. This means that a low value of ε designates more clusters closer to each other, which is consistent with a special day. In this study, we notice these differences, especially in the time interval [00:00, 06:00] where the differences between New Year's Eve day and the normal day are remarkable. Now let's focus on the variation of the MinPts parameter, where its least acute values vary from two to four tweets on the normal day in the time intervals [00:04, 20:00] and [00:06,23:00]. On a special day, these MinPts values are higher in the time interval [00:01, 00:02] and reflect low geo-tagged tweets and a low activity in the city.  day. This means that a low value of ε designates more clusters closer to each other, which is consistent with a special day. In this study, we notice these differences, especially in the time interval [00:00, 06:00] where the differences between New Year's Eve day and the normal day are remarkable. Now let's focus on the variation of the MinPts parameter, where its least acute values vary from two to four tweets on the normal day in the time intervals [00:04, 20:00] and [00:06,23:00]. On a special day, these MinPts values are higher in the time interval [00:01, 00:02] and reflect low geo-tagged tweets and a low activity in the city.

Discussion
With the advent of smartphones, it is very interesting to propose a crowd detection system using the geolocated social network. The idea of density-based clustering for urban crowd detection has recently been applied to social network analysis [21,23,[41][42][43]. Geo-tagged tweets allow one to detect real-world events from social network data. To analyze behavior of individual users in a geographical area under study, the preferred spatial clustering method is DBSCAN [21], a proposed density-based clustering method consisting of two main stages: A training stage and a detection stage. "For the first one it is necessary to mine LBSNs (one or more) in order to gather a representative set of geo-located users' interactions (posts) and construct a geo-located dataset of the citizen's activity (positions) all around the smart city for a whole day (24 h), the reference day. This dataset is analyzed by using a density-based clustering algorithm, with geographical proximity as distance, in order to detect dense groups of users located in the same geographical area at the same period of time". However, using DBSCAN also involves problems. It can often be hard to choose the input parameters that should be used in high dimensional data due to the loss of contrast in the distances database. Furthermore, the LBSN data contains a large amount of information, and trying to find cluster patterns in several dimensions requires vast computing power. However, short computing time is always favorable. Last, the clusters can be arbitrary and complex, then finding these shapes can be very cumbersome. It therefore becomes difficult to use DBSCAN in high-dimensional data because of parameterization. In [23], the authors propose an improvement of [21], consisting of two main phases, to "compare the current activity in the social media stream on-the-fly with the reference cluster that is located in the same area on the same day of the week and at the same time interval. Therefore, the outlier's detection is not performed equally for every cluster, but locally. This means that, instead of comparing the number of points of all the clusters in a wide area, a cluster is only compared with the nearest reference cluster (if there is one which is near enough to be considered as comparable)". As such, DBSCAN parameterization in large data is difficult, prompting other studies to use the "Ordering points to identify the clustering structure (OPTICS)" algorithm, which is similar to DBSCAN, but addresses the problem of detecting meaningful clusters in data of varying density [23]. LSDBC, OPTICS, and HDBSCAN*, in "which the concept of border points was abandoned, and only core points are considered to be part of a cluster at any time, which is more consistent with the concepts of a density level set", are examples of DBSCAN variants that focus on finding hierarchical clustering results [43], but still suffer from high dimensionality.
On the other hand, [44] introduced the concept of entropy and its practical interpretation. The proposed approach exploits the entropy behavior to minimize both drawbacks: the "anomalous data increase the entropy values, so no previous patterns are needed", and the "entropy levels are continuously adapted as long as new geolocated data are extracted from social media". The obtained results validated the effectiveness of the entropy-based social media location data and the methodology for crowd anomalies detection.
Furthermore, one of the strengths of the DBSCAN algorithm is that it can be paired with any data type, distance function (Euclidean, great-circle), and indexing technique adequate for the dataset to be analyzed. We therefore used SOM to mainly reduce the input space and enable parameter (є and MinPts) process selection, which then allows us to establish DBSCAN to handle the spatial and temporal properties of the Twitter data. With fertile ground to establish DBSCAN, this encourages us to propose urban crowd detection using the SOM, DBSCAN and LBSN data entropy methodology.
In this work, we use geolocated social data provided by the twitter API to automatically detect and identify an urban grouping. The crowd detection relies successively on two stages:

•
The SOM (unsupervised clustering algorithm) and DBCSAN (density-based clustering algorithm) stage to identify and detect the crowds. This SOM and DBCSAN method for tweets clustering is well described in Section 3.1. Table 2, Table 3, Table 4 and Figure 4 summarize the obtained clustering results in NYC and Madrid city. The obtained results helped create a tool to support the abnormal events detection process, so this was a very important step. Figures 3-5 show a detailed description of the state of the city on normal and special days in Madrid and NYC, illustrating the activity and the locations of the urban crowds. Figure 7 shows the dynamics of our system for estimating the parameters є and MinPts. All these results confirm the effectiveness of our method used in the first stage to detect and identify clusters dynamically, and imitating human nature movements. Therefore we have a robust methodology for identifying and detecting crowds that we can rely on. • The entropy model was applied to detect abnormal events in the crowds. The reference entropy model was then constructed offline. Figure 6 illustrates the evolution of the maximum and minimum reference value of the entropy for each day of the week. Figure 8 shows clearly the detection of special days in NYC, and proves the performances of our proposed model to determine whether the detected crowds fit into the daily pattern, or if they should be considered as evidence of something unusual happening in the city.

•
The use of SOM in the first stage allows for: 1.
Reducing the input space, which then allows us to establish DBSCAN. The DBSCAN algorithm is difficult to use in very large dimensions.
Using SOM and DBSCAN to detect the clusters of varied density with different shapes and sizes from the large amount of data, which contains noise and outliers.
The principal limit of using LBSN is the poor availability of geolocated tweets data. However, Twitter data is not accessible to the general public. Officially, 1% of the tweet traffic is made available [40,41]. Furthermore, it is a very interesting context for studying urban behavior.

Conclusions
Personal location and navigation have become a major field in a mobile society, especially with the huge Smartphone revolution coupled with the large number of social network users, where the daily produced data are now almost systematically linked to geographical coordinates. This technological revolution has unearthed a new type of service in the field of localization, known as LBSN services; these are applications available on mobile devices via the mobile network that use the geographic location of the mobile device. An LBSN does not only mean sharing our physical position with our friends, it also similarly reflects a natural urban structure. Precisely, in this context we propose a new system for urban crowd detection using SOMs, DBSCAN and entropy based on the LBSN of the most popular public social network, i.e., Twitter.
The proposed system in this paper consists of two stages. The first one is successively based on the unsupervised clustering algorithm SOM and the density-based clustering algorithm DBCSAN to identify and detect crowds. The use of SOM in the first stage has the following objectives: first, to provide a topological view of the partitioning of the data and second, to allow us to propose an appropriate procedure for selecting the necessary parameters for the DBSCAN algorithm in order to make the algorithm usable with big databases. Once the DBSCAN parameters are obtained they will not be applied to the entire data space but to the BMUs set as explained in Section 3.1. The SOM and DBSCAN step help identify and detect urban crowding. This will improve the robustness of our system to detect abnormal events in the crowds. The second stage, is to build a daily city-state reference based on the Shannon entropy (the reference entropy model). To detect an abnormal event, it is sufficient to determine the real entropy model and to compare it with the reference model (Section 3.1).
The concept of the abnormal event detection approach can be summarized as follows: identification and detection of clusters (SOM + DBSCAN), building the reference and the real entropy models, and finally comparing the two models. The obtained results prove the correctness and robustness of our method.
Author Contributions: All the authors participated in the conceptualization of this paper. M.S. and M.Z. contributed to proposed idea, the design, implementation, and validation of the SOM and DBSCAN algorithm for crowd's detection. A.D.A. contributed to the data sets and the real-reference entropy model approach to abnormal events detections.
Funding: This research received no external funding.