Mapping Dynamic Urban Land Use Patterns with Crowdsourced Geo-Tagged Social Media (Sina-Weibo) and Commercial Points of Interest Collections in Beijing, China

: In fast-growing cities, especially large cities in developing countries, land use types are changing rapidly, and different types of land use are mixed together. It is difﬁcult to assess the land use types in these fast-growing cities in a timely and accurate way. To address this problem, this paper presents a multi-source data mining approach to study dynamic urban land use patterns. Spatiotemporal social media data reveal human activity patterns in different areas, social media text data reﬂects the topics discussed in different areas, and Points of Interest (POI) reﬂect the distribution of urban facilities in different regions. Human activity patterns, topics of discussion on social media, and the distribution of urban facilities in different regions were combined and analyzed to infer urban land use patterns. We collected 9.5 million geo-tagged Chinese social media (Sina-Weibo) messages from January 2014 to July 2014 in the urban core areas of Beijing and compared them with 385,792 commercial Points of Interest (POI) from Datatang (a Chinese digital data content provider). To estimate urban land use types and patterns in Beijing, a regular grid of 400 m × 400 m was created to divide the urban core areas into 18,492 cells. By analyzing the temporal frequency trends of social media messages within each cell using K-means clustering algorithm, we identiﬁed seven types of land use clusters in Beijing: residential areas, university dormitories, commercial areas, work areas, transportation hubs, and two types of mixed land use areas. Text mining, word clouds, and the distribution analysis of POI were used to verify the estimated land use types successfully. This study can help urban planners create up-to-date land use patterns in an economic way and help us better understand dynamic human activity patterns in a city.


Introduction
The increasing popularity of social media services and smartphones has enabled the public to share their daily activities online and to leave their digital footprint in urban areas.Collecting geo-tagged social media messages with GPS coordinates within urban areas could help researchers understand dynamic spatial-oriented human activities and urban spatial patterns.For example, Tsou et al. (2013) [1] demonstrated a research framework for tracking and analyzing spatial content of social media (Twitter) that can facilitate the tracking of social events (2012 U.S. presidential election) from a spatial-temporal perspective.Liu et al. (2014) [2] used location-based social media data to analyze the underlying patterns of trips and spatial interactions in cities, and revisited spatial interaction and distance decay in spatially-embedded networks.Several scholars used social media, as a crowdsourced spatiotemporal data content, to understand emergency events, enhance emergency situation awareness, and improve the efficiency of emergency response [3][4][5][6].
Geo-tagged social media messages and GPS tracking data (from mobile phones and vehicles) have been used to understand human movement behavior and spatiotemporal patterns [7][8][9][10].For example, studying social media "check-in" patterns can provide a better explanation of urban dynamics, as well as a deeper understanding of land use pattern changes [11].Using mobile telephone positioning data, researchers can analyze the diurnal rhythms of city life and its spatiotemporal differences [12].Mobile telephone-based sensor data can be used for detecting dynamic urban activities in different time in different cities (Harbin, Paris, and Tallinn) [13].Using GPS trajectory data from taxi drivers, Liu et al. (2010) [14] revealed taxi drivers' spatial selection of routes and their operation behaviors.Deville (2014) [15] introduced a new approach using mobile phone data to estimate dynamic population densities in near real time.
From an urban planning perspective, researchers have begun to focus on urban area characterization and dynamic patterns associated with various human activities and behaviors [16][17][18][19].Liu et al. (2012) [20] used a seven-day taxi trajectory data to study the relationship between the urban land uses and traffic patterns.Their study shows that human mobility data from smartphones can provide a good estimation for urban land use patterns in a timely fashion, which can help urban planners design better routes for mitigating traffic and improving public services.Frias-Martinez et al. (2014) [21] presented another good case study of land use pattern detection by using location-based Twitter data in Manhattan, London, and Madrid, showing that geo-located tweets can constitute a complementary data source for urban planners.However, there may be a certain bias when using single-source data to detect the urban land use types of the city, especially for large cities in developing countries.The urban land use types of these fast-growing cities are changing rapidly, and different types of land use are mixed together.Analysis from multiple angles is necessary to accurately infer urban land use types of these cities.
In this paper, we introduced a comprehensive analysis framework for detecting dynamic urban land use patterns by using multiple crowdsourced information services, including a popular social media service in China (Sina-Weibo) and a commercial-based Points of Interest (POI) collection (Datatang).By using grid-based aggregation methods for analyzing the spatiotemporal patterns of social media messages, our research goal is to discover the unique characteristics of urban land use types, such as residential areas, commercial areas, work areas, and transportation hubs.Spatiotemporal data analysis algorithms and text mining methods were adopted to identify different types of urban land use patterns.We used Sina-Weibo application programming interfaces (APIs) to collect Chinese geo-tagged social media messages in Beijing City from January 2014 to July 2014 and downloaded commercial POI data collection in Beijing from Datatang.com.By applying a clustering algorithm (K-means), different areas were classified based on the variations in social media message frequency (hourly) temporal trend patterns.Then we estimated urban land use patterns using multiple procedures.First, social media message (Sina-Weibo) temporal trend patterns were used to estimate land use types, such as residential, commercial, or business (office) areas.Second, using text mining algorithms, we can validate the estimated land use patterns by comparing the popular keywords between different categories of land use areas.Third, the distribution of different categories of POI within each land use category can be used to verify detail urban activities associated with different land use types.

Mapping Urban Dynamics with Social Media and Social Sensor Data
Social media, as crowdsourced data content providers, have been used in business analytics, knowledge discovery, event detection, and dynamic mapping applications [22].Some researchers adopted a new term, social sensor data, to indicate individual-level crowdsourced geospatial data, such as check-ins (Foursquare), social media messages (Twitter and Sina-Weibo), and online location-based service reviews (Yelp), which contain rich information about spatial interactions and place semantics in local communities.These social sensor data can provide an opportunity for us to understand our socioeconomic environments in urban areas [23].Spatio-temporal analysis of social sensor data can provide additional insights into how collective social activity shape urban systems [24].
Currently, a large number of scholars study urban dynamics by using taxi GPS dataset [25], cell phone data [26] or social media data [27].Researchers have found that there is a significant association between commuting activity and land use types [28].Using Twitter messages, Han et al. (2015) [29] proposed a new analytic method to identify spatiotemporal differences in the level of geographical awareness of Twitter users living in each U.S. city.In order to discover the hidden logic of connections between areas of a city, a new kind of pattern called the C-pattern was revealed by analyzing frequently co-occurring changes in population densities [30].Yuan et al. (2012) [31] adopted a topic-based inference model to for discovering areas of different functions in a city using both GPS trajectory datasets and POI datasets.Using cell-phone traffic data, a technique was developed for real-time monitoring of population density in an urban area, which could improve the efficiency of urban systems management and planning [32].

Clustering Algorithms for Land Use
Clustering algorithms can be used to aggregate objects of a collection into groups (classes) based on their similarities [33].In the field of data mining, there are several representative clustering algorithms, such as density-based spatial clustering of applications with noise (DBSCAN), Expectation-Maximinzation (EM), and K-means.DBSCAN is a well-known implementation of the density-based clustering algorithm that can divide an area at a sufficiently high density into clusters [34].It requires that the number of objects within a certain area is not less than a given threshold value.DBSCAN algorithm can effectively deal with noises and spatial clustering of arbitrary shapes.However, if the density of the clustered objects is uneven, the clustering effect is poor.For high-dimensional objects, the definition of density is difficult to select.
On the other hand, the K-means algorithm is appropriate for clusters of high-dimensional objects.The K-means algorithm [35] is a typical clustering algorithm based on distance of objects, as the similarity evaluation index.The closer the distances between two objects, the more similar the two objects are.The K-means algorithm is a fast clustering algorithm that can handle large amounts of data efficiently.The number of clusters (k) is an input parameter for K-means algorithm and very important for the quality of clustering results.Gap statistic approach [36] can be used to identify the appropriate number of clusters k (number) for targeted objects.In this research, since we needed to identify the clusters of 24-dimensional objects (using hourly-based social media temporal trend in each day, 24 h), we used K-means for the clustering algorithm to identify various types of land use patterns.

Text Mining in Social Media
The advance of computational technology over the past decade has enabled the dramatically progress of text mining methods and tools.Text mining is a computational process to understand the content and meaning of text corpora with rich semantics.Text categorization is an important part of the text mining and a process for determining the category of text according to its content.Several representative approaches can be used for text categorization, such as Latent Dirichlet Allocation (LDA), Support Vector Machine (SVM), and Deep Learning.
LDA [37] is a document theme generation model, which is an unsupervised machine learning technique to identify the underlying themes in a large-scale document collection.These themes (topics) can be used to classify each document item into different categories.
SVM [38] is a supervised learning model, which can be used for text classification.For SVM, the low dimensional space of points (keywords) is mapped into a highly dimensional space, so that points (keywords) become linearly separable.The linear division principle can be used to judge classification boundaries for each document (corpus).
Deep Learning refers to an approach that builds a learning neural network to simulate the human brain.Word2vec, provided by Google, is a Deep learning tool example [39].Using the word2vec (word to vector) web tool, words can be converted into a vector according to the relationship between words and context; the similarity of vector space calculated can indicate the similarity of text semantics.The word2vec tool can help find similar words in a group of documents that together constitute a topic.In our research, word2vec is adopted to reveal the topic hidden in the social media texts.

Data Collection
Sina-Weibo, a Twitter-like microblogging system, is the most popular microblogging service in China with 176 million active users monthly in 2013.The amount of daily active Weibo users can reach 4.6 million with about 100 million messages posted every day (Sina-Weibo, 2013).Similar to Twitter's messages, which are called "tweets", Sina-Weibo users can only post their messages with a 140-Chinese-character limit.Each posted message in Sina-Weibo is called "weibo" (microblogs).Sina-Weibo has very similar functions of Twitter, such as retweet (RT), mentioned (@), and hashtags (#).However, the major differences between Sina-Weibo and Twitter are the available content in user profiles.Sina-Weibo collects more optional personal information, such as gender, user locations, birthday, and blood type, in user profiles.Sina-Weibo also provides powerful search engine application programming interfaces (APIs) for collecting and analyzing their microblog messages from third parties.The Sina-Weibo data used in this study was collected during the period from January 2014 to July 2014 within the city of Beijing.Totally, 9.5 million Sina-Weibo messages containing geo-tagged information are collected using Sina-Weibo APIs.
Figure 1 illustrates the average daily (24 h) temporal trend of Sina-Weibo message frequency in Beijing aggregated at one-hour time intervals.There are clear daily fluctuations in the frequency of Sina-Weibo messages.Very few Sina-Weibo messages were generated in the early morning (between 2:00 a.m. and 6:00 a.m.).The lowest frequency time is between 3:00 a.m. and 4:00 a.m.The highest frequency time is at 10:00 p.m.There are relatively high volumes of Sina-Weibo messages were sent between 8:00 a.m. to 6:00 p.m. Lastly, there is a dramatically decreased frequency between 9:00 p.m. to 2:00 a.m. when people are going to sleep at night.
Deep Learning refers to an approach that builds a learning neural network to simulate the human brain.Word2vec, provided by Google, is a Deep learning tool example [39].Using the word2vec (word to vector) web tool, words can be converted into a vector according to the relationship between words and context; the similarity of vector space calculated can indicate the similarity of text semantics.The word2vec tool can help find similar words in a group of documents that together constitute a topic.In our research, word2vec is adopted to reveal the topic hidden in the social media texts.

Data Collection
Sina-Weibo, a Twitter-like microblogging system, is the most popular microblogging service in China with 176 million active users monthly in 2013.The amount of daily active Weibo users can reach 4.6 million with about 100 million messages posted every day (Sina-Weibo, 2013).Similar to Twitter's messages, which are called "tweets", Sina-Weibo users can only post their messages with a 140-Chinese-character limit.Each posted message in Sina-Weibo is called "weibo" (microblogs).Sina-Weibo has very similar functions of Twitter, such as retweet (RT), mentioned (@), and hashtags (#).However, the major differences between Sina-Weibo and Twitter are the available content in user profiles.Sina-Weibo collects more optional personal information, such as gender, user locations, birthday, and blood type, in user profiles.Sina-Weibo also provides powerful search engine application programming interfaces (APIs) for collecting and analyzing their microblog messages from third parties.The Sina-Weibo data used in this study was collected during the period from January 2014 to July 2014 within the city of Beijing.Totally, 9.5 million Sina-Weibo messages containing geo-tagged information are collected using Sina-Weibo APIs.
Figure 1 illustrates the average daily (24 h) temporal trend of Sina-Weibo message frequency in Beijing aggregated at one-hour time intervals.There are clear daily fluctuations in the frequency of Sina-Weibo messages.Very few Sina-Weibo messages were generated in the early morning (between 2:00 a.m. and 6:00 a.m.).The lowest frequency time is between 3:00 a.m. and 4:00 a.m.The highest frequency time is at 10:00 p.m.There are relatively high volumes of Sina-Weibo messages were sent between 8:00 a.m. to 6:00 p.m. Lastly, there is a dramatically decreased frequency between 9:00 p.m. to 2:00 a.m. when people are going to sleep at night.

Grid-Based Land Use Segmentation and Aggregated Temporal Trends
In order to estimate the spatial variation of land use patterns, we divided our study area (the central part of Beijing within the red rectangle in Figure 2) into regular grids of 400 m × 400 m.There

Grid-Based Land Use Segmentation and Aggregated Temporal Trends
In order to estimate the spatial variation of land use patterns, we divided our study area (the central part of Beijing within the red rectangle in Figure 2) into regular grids of 400 m × 400 m.There are two reasons to choose this rectangle as our study area.First, the majority of collected geo-tagged Sina-Weibo messages in Beijing were located within the Sixth Ring Road (the ring road nearby the rectangle box in Figure 2).Second, the area within the Sixth Ring Road is the densely populated urban core of Beijing.are two reasons to choose this rectangle as our study area.First, the majority of collected geo-tagged Sina-Weibo messages in Beijing were located within the Sixth Ring Road (the ring road nearby the rectangle box in Figure 2).Second, the area within the Sixth Ring Road is the densely populated urban core of Beijing.As shown in Figure 2, our study area consists of 18,492 (134 × 138) cells within the rectangular box.These cells are sequentially numbered from 0 to 18,491 from left to right and from bottom to top.After the grid-based land use segmentation, we calculated the daily temporal trends of the Sina-Weibo messages within each cell.Based on the characteristics of daily temporal trends (24 h time periods as 24 eigenvalues) in each cell, we used the K-means clustering algorithm to classify 18,493 cells into seven different categories (clusters).Cells with similar temporal trend characteristics were classified in the same cluster (land use category).
Within each cell, we counted the total number of Sina-Weibo messages generated within each time period (one hour).Then, we calculated the proportion of the number of Sina-Weibo messages generated within the cell during each time period divided by the total number of messages generated within the cell in all time periods.Therefore, for each cell, we can build a unique daily temporal trend of social media messages for each cell.However, some cells have very few Sina-Weibo messages within each time period and it will be difficult to calculate their daily temporal trends.We decided to remove the cell which containing less than 100 Sina-Weibo messages in total.There are 12,277 cells (out of 18,492 cells) removed from our study and only 6215 cells with high numbers of social media messages are used for our urban land use mapping task.Figure 3   As shown in Figure 2, our study area consists of 18,492 (134 × 138) cells within the rectangular box.These cells are sequentially numbered from 0 to 18,491 from left to right and from bottom to top.After the grid-based land use segmentation, we calculated the daily temporal trends of the Sina-Weibo messages within each cell.Based on the characteristics of daily temporal trends (24 h time periods as 24 eigenvalues) in each cell, we used the K-means clustering algorithm to classify 18,493 cells into seven different categories (clusters).Cells with similar temporal trend characteristics were classified in the same cluster (land use category).
Within each cell, we counted the total number of Sina-Weibo messages generated within each time period (one hour).Then, we calculated the proportion of the number of Sina-Weibo messages generated within the cell during each time period divided by the total number of messages generated within the cell in all time periods.Therefore, for each cell, we can build a unique daily temporal trend of social media messages for each cell.However, some cells have very few Sina-Weibo messages within each time period and it will be difficult to calculate their daily temporal trends.We decided to remove the cell which containing less than 100 Sina-Weibo messages in total.There are 12,277 cells (out of 18,492 cells) removed from our study and only 6215 cells with high numbers of social media messages are used for our urban land use mapping task.Figure 3  To analyze the temporal change of social media messages (posting times) in each cell zone, we divided one day into 24 segments (one hour as one segment).In each cell, we calculated the proportion of posting numbers in each time period to the total number of posting in one day (24 h).Therefore, we can observe the temporal patterns of posting activities in each grid during 24 h.In order to group the cells with similar daily temporal trends for Sina-Weibo message frequency, Kmeans clustering algorithm was adopted to classify the grids based on the 24 h time periods (one hour per unit), considered as 24 eigenvalues.In order to identify the appropriate number of clusters k, we followed the gap statistic approach.As a result, when the number of clusters was seven, the gap between different clusters is more recognizable, and the gap between the cells in the same cluster is relatively small.Therefore, k = 7 was selected as the input parameter for the K-means algorithm in this study.
These activity temporal patterns, or the Sina-Weibo temporal patterns, for different clusters are shown in Figure 4.
Figure 3 shows the geographical distributions of different clustering results.The distribution of the cells corresponding to each cluster is relatively decentralized and mixed.To further explore the characteristics of each cluster, we used the Fifth Ring Road as a line to divide the urban core areas and the surrounding areas of Beijing.We calculated the percentages of cluster cells in each category and compare the inner and outer urban areas.The results are shown in Table 1.We found that the areas corresponding to Cluster 3 and Cluster 5 are mainly concentrated inside the urban core areas.The areas corresponding to Cluster 2 and Cluster 4 are mostly located in the outer urban area.To analyze the temporal change of social media messages (posting times) in each cell zone, we divided one day into 24 segments (one hour as one segment).In each cell, we calculated the proportion of posting numbers in each time period to the total number of posting in one day (24 h).Therefore, we can observe the temporal patterns of posting activities in each grid during 24 h.In order to group the cells with similar daily temporal trends for Sina-Weibo message frequency, K-means clustering algorithm was adopted to classify the grids based on the 24 h time periods (one hour per unit), considered as 24 eigenvalues.In order to identify the appropriate number of clusters k, we followed the gap statistic approach.As a result, when the number of clusters was seven, the gap between different clusters is more recognizable, and the gap between the cells in the same cluster is relatively small.Therefore, k = 7 was selected as the input parameter for the K-means algorithm in this study.
These activity temporal patterns, or the Sina-Weibo temporal patterns, for different clusters are shown in Figure 4.
Figure 3 shows the geographical distributions of different clustering results.The distribution of the cells corresponding to each cluster is relatively decentralized and mixed.To further explore the characteristics of each cluster, we used the Fifth Ring Road as a line to divide the urban core areas and the surrounding areas of Beijing.We calculated the percentages of cluster cells in each category and compare the inner and outer urban areas.The results are shown in Table 1.We found that the areas corresponding to Cluster 3 and Cluster 5 are mainly concentrated inside the urban core areas.The areas corresponding to Cluster 2 and Cluster 4 are mostly located in the outer urban area.

Analysis of Different Clusters with Associated Land Use Types
By analyzing social media message (Sina-Weibo) temporal trend patterns in different areas, we can estimate different types of urban land use in the corresponding areas.In Figure 4, the social media messages from Cluster 6 are mostly generated from 7:00 p.m. to 12:00 a.m., which indicated that most social activities in these regions are relatively happen in the evening.Therefore, we estimated Cluster 6 areas as residential areas.Cluster 7 has a completely different temporal pattern compared to Cluster 6. Social media messages in Cluster 7 are mainly generated from 8:00 a.m. to 5:00 p.m. Therefore, we estimated Cluster 7 as work areas.Another example is Cluster 4, where many social media messages are generated in the morning time (6:00 a.m. to 8:00 a.m.) and evening time (two peaks).We estimated Cluster 4 as transportation hub related areas, such as airports and train stations.
To verify our estimations of each cluster, we used text mining methods to analyze the key messages among each cluster.Ideally, people in work areas are more likely to discuss their work topics or business items in their social media messages.People in the commercial areas are more likely to discuss topics about shopping stores or restaurants.In order to find the key topics within hundreds of social media messages in each cluster, we applied the word2vec for text mining, provided by Google, a learning tool based on Deep Learning.When a word was selected as core vocabulary, some related keywords and correlation coefficients can be obtained by applying word2vec to all Sina-Weibo texts.Collections of these words and core vocabularies represent a topic.The Sina-Weibo texts were originally analyzed in Chinese.In order to make it easier to read for non-Chinese readers, we translated the text mining results from Chinese into English in this paper.
For example, we chose the keyword, "Airport Terminal (note that all keywords have been

Analysis of Different Clusters with Associated Land Use Types
By analyzing social media message (Sina-Weibo) temporal trend patterns in different areas, we can estimate different types of urban land use in the corresponding areas.In Figure 4, the social media messages from Cluster 6 are mostly generated from 7:00 p.m. to 12:00 a.m., which indicated that most social activities in these regions are relatively happen in the evening.Therefore, we estimated Cluster 6 areas as residential areas.Cluster 7 has a completely different temporal pattern compared to Cluster 6. Social media messages in Cluster 7 are mainly generated from 8:00 a.m. to 5:00 p.m. Therefore, we estimated Cluster 7 as work areas.Another example is Cluster 4, where many social media messages are generated in the morning time (6:00 a.m. to 8:00 a.m.) and evening time (two peaks).We estimated Cluster 4 as transportation hub related areas, such as airports and train stations.
To verify our estimations of each cluster, we used text mining methods to analyze the key messages among each cluster.Ideally, people in work areas are more likely to discuss their work topics or business items in their social media messages.People in the commercial areas are more likely to discuss topics about shopping stores or restaurants.In order to find the key topics within hundreds of social media messages in each cluster, we applied the word2vec for text mining, provided by Google, a learning tool based on Deep Learning.When a word was selected as core vocabulary, some related keywords and correlation coefficients can be obtained by applying word2vec to all Sina-Weibo texts.Collections of these words and core vocabularies represent a topic.The Sina-Weibo texts were originally analyzed in Chinese.In order to make it easier to read for non-Chinese readers, we translated the text mining results from Chinese into English in this paper.
For example, we chose the keyword, "Airport Terminal (note that all keywords have been translated from the original Chinese)", as a core vocabulary.Then we obtained the related keywords and the correlation coefficient from word2vec, as shown in Table 2.The correlation coefficient represents the degree of correlation between the keyword and the core vocabulary: the higher the value of the correlation coefficient, the closer the relationship between the keyword and the core vocabulary.With the keywords related to the core vocabulary, a Sina-Weibo message can be evaluated for relevance to the core vocabulary (the core topic).If one of these keywords was found in a Sina-Weibo text, the Sina-Weibo message will be identified as relevant to this core topic.The proportions of messages for each topic in different clusters can be calculated.As shown in Figure 5, people in the Cluster 4 areas are more likely to mention keywords related to airport terminals.This result matches our previous estimation that Cluster 4 areas are transportation related areas, such as airports and train stations.and the correlation coefficient from word2vec, as shown in Table 2.The correlation coefficient represents the degree of correlation between the keyword and the core vocabulary: the higher the value of the correlation coefficient, the closer the relationship between the keyword and the core vocabulary.With the keywords related to the core vocabulary, a Sina-Weibo message can be evaluated for relevance to the core vocabulary (the core topic).If one of these keywords was found in a Sina-Weibo text, the Sina-Weibo message will be identified as relevant to this core topic.The proportions of messages for each topic in different clusters can be calculated.As shown in Figure 5, people in the Cluster 4 areas are more likely to mention keywords related to airport terminals.This result matches our previous estimation that Cluster 4 areas are transportation related areas, such as airports and train stations.Using the same method, we examined several other topics.As shown in Table 3, the keywords in the first column of the table are our core vocabularies which are selected from the high frequency vocabulary of social media messages in each cluster, and those words in the second column of the table are the keywords related to the corresponding core vocabulary generated by word2vec.We discarded the words whose correlation coefficient was less than 0.6.The results are shown in the next section.Due to space limitations, we only show part of the related keywords associated with a core vocabulary.Using the same method, we examined several other topics.As shown in Table 3, the keywords in the first column of the table are our core vocabularies which are selected from the high frequency vocabulary of social media messages in each cluster, and those words in the second column of the table are the keywords related to the corresponding core vocabulary generated by word2vec.We discarded the words whose correlation coefficient was less than 0.6.The results are shown in the next section.Due to space limitations, we only show part of the related keywords associated with a core vocabulary.

Commercial POI Analysis for the Verification of Land Use Types
Points of Interest (POI) are specific point locations which can be used for location-based services (LBS), such as commercial shops, post offices, and restaurants.Different land use areas may contain different types of POI.For example, commercial areas will have more restaurant and shopping stores POI comparing to the residential areas.We use the distribution of different types of POI among the seven clusters of land use to verify the estimated land use types.
We collected 17 types of POIs relevant to land use types in Beijing from the Datatang, a Chinese online platform and service provider for Big Data sharing and trading.The collected dataset includes 95,588 POIs.Each record contains seven attribute values; CITYCODE, NAME, ADDRESS, TEL, TYPE, X-coordinates and Y-coordinates.There are 182 types of POI found in the attribute TYPE.Table 4 illustrates the list of 17 types of POIs and their associated total numbers for each land use cluster.We also calculated the proportion of each type of POI in each cluster as shown in Equation (1), where N i represents the number of POI of category i.In Table 4, the proportion of some types of POI is large in each cluster, such as "Company" and "Residential Region".In order to eliminate this discrepancy, for each type of POI, we calculated the proportion of the POI belonging to each cluster for all cases of this type of POI.As shown in Equation ( 2), where P i represents the probability of POI belonging to the cluster i for one type of POI.
Thus, we can compare the relative probability between different types of POI.The results will show in the discussion section.

Residential Areas (Cluster 2 and Cluster 6)
Among the seven land use clusters, as shown in Figure 4, we found that Cluster 2 and Cluster 6 has similar temporal trends, the social media messages have higher activity frequency from 7:00 p.m. to 12:00 a.m., with a very large activity peak at night between 10:00 p.m. (Cluster 2) and 11:00 p.m.
(Cluster 6).The activity peak in Cluster 6 is an hour later than the activity peak of Cluster 2. We estimated that the two cluster areas 2 and 6 are more likely to be associated with residential areas.
Figure 6 shows the word cloud from all the Sina-Weibo text messages generated within the areas corresponding to Cluster 2 and Cluster 6.The word cloud indicated many residence-related vocabularies, such as "Good night", "Mood", "Mom", and "Home".In addition, we used text mining methods (word2vec) compare the probability distribution for several core vocabularies (Property, Dormitory, Library, and Campus) in each land use cluster, as shown in Figure 7.We found out that Cluster 2 has higher probability of topic "Property" related keywords (see Table 3 for more related keywords).Cluster 6 has higher probability of topics related to "Dormitory" and "Campus".Therefore, we estimated that Cluster 6 might be university dormitory areas or campus-related residential regions.
We also analyzed the probability of different types of POI within the areas corresponding to the Cluster 2 and Cluster 6. Figure 8 illustrated the POI distribution probability results.The two clusters (2 and 6) have significantly more POIs associated with residential areas, convenience stores, and supermarkets.This result validated our previous estimation that the cluster 2 and 6 are corresponding to residential areas.and "Campus".Therefore, we estimated that Cluster 6 might be university dormitory areas or campus-related residential regions.We also analyzed the probability of different types of POI within the areas corresponding to the Cluster 2 and Cluster 6. Figure 8 illustrated the POI distribution probability results.The two clusters (2 and 6) have significantly more POIs associated with residential areas, convenience stores, and supermarkets.This result validated our previous estimation that the cluster 2 and 6 are corresponding to residential areas.

Commercial Areas and Work Areas (Cluster 5 and Cluster 7)
In Figure 4, we can find that Cluster 5 started the higher message activities from 8:00 a.m. to 12:00 p.m. and Cluster 7 has a sharper increase of activities between 6:00 a.m. to 9:00 a.m.Then, the number starts to decline at 4:00 p.m. Considering the temporal patterns of social media messages in Cluster 5 and Cluster 7, we estimated that Cluster 5 is more likely to be associated with the commercial areas for dining, shopping and entertainment.Cluster 7 is more related to work areas or business office areas.
Figure 10 illustrates the word cloud from all the Sina-Weibo text generated within the areas corresponding to Cluster 5 and Cluster 7.This figure shows that commerce-related vocabularies, such as "Shop", "Restaurants", and "Taste", were often mentioned in areas corresponding to Cluster 5 (Figure 10a).The work-related vocabularies, such as "go home", "company", and "rich", were frequently discussed in Cluster 7 areas.
We also used text mining methods (word2vec) to compare the probability distribution for several core vocabularies (Eating, Bar, Manager, and Boss) in each land use cluster, as shown in Figure 11.We found that Cluster 5 messages have more association with the two vocabulary topics, "Eating" and "Bar".As shown in Figure 12, Cluster 5 has higher proportions of POI numbers for "Casual dining", "Malls", "Sporting goods stores", "Cinema", and "foreign restaurants".On the other hand, Cluster 7 has higher proportions of POI related to work areas, such as "Company", "Industrial Park", in the areas corresponding to Cluster 7.This analysis result is consistent with our previous estimation of land use types for Clusters 5 and 7. Figure 13 displays the spatial distribution of Cluster 5 cells in Beijing.As shown in Figure 13, we can find that the cells corresponding to the Cluster 5 contain the core business district of Beijing, such as Xidan business district, Wangfujing business district and Chaowai business district.Xidan business district and Wangfujing business district are well-known traditional business district with a long history in Beijing, while Chaowai business district is the representative of Beijing's emerging business district.This further confirms our inferences that Cluster 5 cells are more likely to be associated with the commercial areas for dining, shopping and entertainment.Figure 13 displays the spatial distribution of Cluster 5 cells in Beijing.As shown in Figure 13, we can find that the cells corresponding to the Cluster 5 contain the core business district of Beijing, such as Xidan business district, Wangfujing business district and Chaowai business district.Xidan business district and Wangfujing business district are well-known traditional business district with a long history in Beijing, while Chaowai business district is the representative of Beijing's emerging business district.This further confirms our inferences that Cluster 5 cells are more likely to be associated with the commercial areas for dining, shopping and entertainment.Figure 13 displays the spatial distribution of Cluster 5 cells in Beijing.As shown in Figure 13, we can find that the cells corresponding to the Cluster 5 contain the core business district of Beijing, such as Xidan business district, Wangfujing business district and Chaowai business district.Xidan business district and Wangfujing business district are well-known traditional business district with a long history in Beijing, while Chaowai business district is the representative of Beijing's emerging business district.This further confirms our inferences that Cluster 5 cells are more likely to be associated with the commercial areas for dining, shopping and entertainment.

Transportation Hub Areas (Cluster 4)
In our previous discussion (Section 3.3), we already estimated that Cluster 4 is related to transportation hubs, such as airports and train stations.Figure 4 shows that the activity peaks of Cluster 4 are between 7:00 a.m. to 8:00 a.m. and between 8:00 p.m. to 11:00 p.m.The spatial analysis of the Cluster 4 locations revealed that they are around the airport, railway station or subway stations.Many cities in China have mixed land use types in urban regions.For example, some shopping malls can be built in residential areas and some residents can live in the industrial areas or commercial areas.This type of mixed land use can be found in Cluster 1 and Cluster 3. We estimated that Cluster 1 and Cluster 3 are mixed land use areas.The characteristics of the temporal patterns for Cluster 1 and Cluster 3 are shown in Figure 4.The probability distribution of 17 types of POI (Figure 14) also shows that the various types of POI are relatively mixed and equally distributed in both Cluster 1 and Cluster 3 areas.

Transportation Hub Areas (Cluster 4)
In our previous discussion (Section 3.3), we already estimated that Cluster 4 is related to transportation hubs, such as airports and train stations.Figure 4 shows that the activity peaks of Cluster 4 are between 7:00 a.m. to 8:00 a.m. and between 8:00 p.m. to 11:00 p.m.The spatial analysis of the Cluster 4 locations revealed that they are around the airport, railway station or subway stations.

Mixed Land Use Areas (Cluster 1 and Cluster 3)
Many cities in China have mixed land use types in urban regions.For example, some shopping malls can be built in residential areas and some residents can live in the industrial areas or commercial areas.This type of mixed land use can be found in Cluster 1 and Cluster 3. We estimated that Cluster 1 and Cluster 3 are mixed land use areas.The characteristics of the temporal patterns for Cluster 1 and Cluster 3 are shown in Figure 4.The probability distribution of 17 types of POI (Figure 14) also shows that the various types of POI are relatively mixed and equally distributed in both Cluster 1 and Cluster 3 areas.

Conclusions and Future Works
Classifying urban land use areas is an important task for urban planning and urban design fields.Along with the rapid development of urban regions and the quick increase of population in cities, it is very difficult and expensive to acquire up-to-date land use maps for urban planners or city design specialists, especially in many developing counties, such as China.The proposed crowdsourced mapping frameworks in this paper present a new perspective to explore urban land use patterns and to display up-to-date urban activity spatial patterns.
Social media data is time-sensitive, which can reflect the status in near real-time.More importantly, it is popular participation, cheap and easy to obtain.The spatial-temporal characteristics of social media data can be used to explore up-to-date urban activity spatial patterns.

Conclusions and Future Works
Classifying urban land use areas is an important task for urban planning and urban design fields.Along with the rapid development of urban regions and the quick increase of population in cities, it is very difficult and expensive to acquire up-to-date land use maps for urban planners or city design specialists, especially in many developing counties, such as China.The proposed crowdsourced mapping frameworks in this paper present a new perspective to explore urban land use patterns and to display up-to-date urban activity spatial patterns.
Social media data is time-sensitive, which can reflect the status in near real-time.More importantly, it is popular participation, cheap and easy to obtain.The spatial-temporal characteristics of social media data can be used to explore up-to-date urban activity spatial patterns.Furthermore, hiding topic in the social media text data and a commercial collection of POI can be combined to reveal the urban land use patterns.
Our approach is to use geo-tagged social media messages (Sina-Weibo) as a crowdsourced map data provider.We utilized the public APIs to collect six months of geo-tagged social media messages in Beijing (total 9.5 million messages collected).There are five key steps in our crowdsourced analysis and mapping framework.First, a regular grid of 400 m × 400 m was created to divide the urban core of Beijing into 18,492 cells.Second, we calculated the numbers of social media messages within each cell and their temporal frequency trends.Third, we only kept the cells with high frequency of social media activities and used the K-means to categorize these cells into seven types of land use clusters.Fourth, we applied text mining approach and word clouds to verify our estimation of land use type for each cluster.Fifth, we used a commercial collection of POI to exanimate the relevance of associate POI types in each land use cluster.We also found some research challenges in this study.First, our methods only focus on 2D dimensions of urban land use rather than 3D dimensions.In a big city, like Beijing, multiple urban land use types (such as residential and commercial) are mixed within high-rise buildings and concentrated housing areas.Urban land use types are more dynamic and mixed in many Chinese cities.By analyzing the fluctuations and message content in social media over time and space, we may be able to monitor the dynamic and mixed features of urban land use patterns in big Chinese cities.Second, we only use a pre-defined grid system (400 m × 400 m) for our urban land use spatial analysis unit.The size was adopted by following previous research works.However, we may need to examine the sensitivity of our methods by using different size of grids, such as 800 m × 800 m or 200 m × 200 m in the future.Third, we only considered the temporal trends of social media messages by combining all messages within a cell.We may need to apply some linguistic methods to classify different types of social media messages first and to remove some errors and noises before creating the temporal trend graph.Finally, we only collected the social media data from January 2014 to July 2014 (six months).The temporal trend patterns might be changed in different seasons or months.If we can collect the whole-year datasets, we can compare the dynamic changes of land use patterns between Summer season and Winter season in Beijing and these dynamic changes might provide more useful information for urban planning.
This research provides a new way to study urban land use patterns in a city from a multidisciplinary perspective.We combine multiple research methods, including GIS, text mining (Deep Learning), K-means, word clouds, and other visualization tools, to explore the dynamic relationships between the temporal trend patterns of social media messages and the land use patterns in Beijing.Social media data, as a major crowdsourced data source, can be linked and aggregated into multiple map layers and GIS datasets for multiple purposes [40].The fundamental concept of "map overlay" is applied here to combine, integrate, and cross-reference multiple data sources together (including geo-tagged social media, texts, and commercial POI data) and explore their dynamic spatiotemporal patterns in maps.We hope that our new method can be used for other types of applications for urban development in the future, such as hourly population density estimations for disaster responses, site selection for commercial companies, business potential analytics, etc.

Figure 1 .
Figure 1.The average daily temporal trend (24 h) of Sina-Weibo message frequency in Beijing (aggregated by one hour time intervals).

Figure 1 .
Figure 1.The average daily temporal trend (24 h) of Sina-Weibo message frequency in Beijing (aggregated by one hour time intervals).

Figure 2 .
Figure 2. Building a grid-based land use segmentation with 400 m × 400 m grids for the urban core of Beijing.
illustrate the distribution of 6215 high frequency cells (color pixels) in the urban core areas.The cells with high frequency of social media messages may indicate the high population density areas in Beijing.

Figure 2 .
Figure 2. Building a grid-based land use segmentation with 400 m × 400 m grids for the urban core of Beijing.

19 Figure 3 .
Figure 3. Geographical distributions of the 400 m × 400 m land use cells with high frequency social media messages (containing more than 100 messages in each cell).The color of each cell indicates the clustering results (from Cluster 1 to 7) with in the urban core of Beijing.

Figure 3 .
Figure 3. Geographical distributions of the 400 m × 400 m land use cells with high frequency social media messages (containing more than 100 messages in each cell).The color of each cell indicates the clustering results (from Cluster 1 to 7) with in the urban core of Beijing.

Figure 5 .
Figure 5.The probability distribution of the airport-terminal-related keywords in different clusters.

Figure 5 .
Figure 5.The probability distribution of the airport-terminal-related keywords in different clusters.

Figure 7 .
Figure 7.The probability distribution of different keyword topics in different clusters.(a) Property; (b) Dormitory; (c) Library; and (d) Campus.

Figure 8 .
Figure 8.The probability distribution of 17 types of POI in Cluster 2 and Cluster 6.

Figure 9
Figure 9 displays the spatial distribution of Cluster 2 and Cluster 6 cells in Beijing.As shown in Figure 9a, the cells corresponding to the Cluster 2 (yellow color) are distributed in the outer of Fifth Ring Road, while the cells corresponding to the Cluster 6 (blue color) are evenly distributed throughout the inner and outer fifth ring.Moreover, the cells corresponding to the Cluster 6 are concentrated in the Haidian District (the red circle area), which is a well-known region for major universities and science research institutes in Beijing.Figure 9b,c shows the large-scale spatial distribution of areas 1 and 2 in Figure 9a.As can be seen from the figure, the cells corresponding to the Cluster 6 are consistent with the geographical scope of several universities (the black dotted line in the figure, including Renmin University of China, Beijing Institute of Technology, Beijing University of Posts and Telecommunications and so on), which shows that our inference is reliable.

Figure 8 .
Figure 8.The probability distribution of 17 types of POI in Cluster 2 and Cluster 6.

Figure 9
Figure 9 displays the spatial distribution of Cluster 2 and Cluster 6 cells in Beijing.As shown in Figure 9a, the cells corresponding to the Cluster 2 (yellow color) are distributed in the outer of Fifth Ring Road, while the cells corresponding to the Cluster 6 (blue color) are evenly distributed throughout the inner and outer fifth ring.Moreover, the cells corresponding to the Cluster 6 are concentrated in the Haidian District (the red circle area), which is a well-known region for major universities and science research institutes in Beijing.Figure 9b,c shows the large-scale spatial distribution of areas 1 and 2 in Figure 9a.As can be seen from the figure, the cells corresponding to the Cluster 6 are consistent with the geographical scope of several universities (the black dotted line in the figure, including Renmin University of China, Beijing Institute of Technology, Beijing University of Posts and Telecommunications and so on), which shows that our inference is reliable.

Figure 9 .
Figure 9.The spatial distribution of Cluster 2 and Cluster 6 areas in Beijing.(a) The overall spatial distribution; (b) areas 1; (c) areas 2.

Figure 11 .
Figure 11.The probability distribution of different keyword topics in different clusters.(a) Eating; (b) Bar; (c) Manager; (d) Boss.

Figure 12 .
Figure 12.The probability distribution of 17 types of POI in: Cluster 5; and Cluster 7.

Figure 13 .
Figure 13.The spatial distribution of Cluster 5 areas in Beijing.

Figure 12 .
Figure 12.The probability distribution of 17 types of POI in: Cluster 5; and Cluster 7.

Figure 13 .
Figure 13.The spatial distribution of Cluster 5 areas in Beijing.

Figure 13 .
Figure 13.The spatial distribution of Cluster 5 areas in Beijing.

Figure 14 .
Figure 14.The probability distribution of 17 types of POI in Cluster 1 (mixed land use) and Cluster 3 (mixed land use).

Figure 14 .
Figure 14.The probability distribution of 17 types of POI in Cluster 1 (mixed land use) and Cluster 3 (mixed land use).

Table 1 .
Comparison of inner and outer urban core areas (divided by the Fifth Ring Road) with different land use clusters in Beijing.

Table 1 .
Comparison of inner and outer urban core areas (divided by the Fifth Ring Road) with different land use clusters in Beijing.

Table 2 .
The related keywords associated with the core vocabulary, "Airport Terminal", and their coefficient (keywords were translated from Chinese to English).

Table 3 .
Core vocabulary and related words for different topics.Tired of eating, Too hungry, Vegetable dish, Half full, Change to eat, Too full, Quite full, Satiate, Each meal, Supper, Noodles, Eat less, Bowl, Bun, Rice, etc.

Table 4 .
The proportion distribution of 17 types of POI in different land use type clusters.