Analyzing Urban Spatial Patterns and Functional Zones Using Sina Weibo POI Data: A Case Study of Beijing

With the development of Web2.0 and mobile Internet, urban residents, a new type of “sensor”, provide us with massive amounts of volunteered geographic information (VGI). Quantifying the spatial patterns of VGI plays an increasingly important role in the understanding and development of urban spatial functions. Using VGI and social media activity data, this article developed a method to automatically extract and identify urban spatial patterns and functional zones. The method is put forward based on the case of Beijing, China, and includes the following three steps: (1) Obtain multi-source urban spatial data, such as Weibo data (equivalent to Twitter in Chinese), OpenStreetMap, population data, etc.; (2) Use the hierarchical clustering algorithm, term frequency-inverse document frequency (TF-IDF) method, and improved k-means clustering algorithms to identify functional zones; (3) Compare the identified results with the actual urban land uses and verify its accuracy. The experiment results proved that our method can effectively identify urban functional zones, and the results provide new ideas for the study of urban spatial patterns and have great significance in optimizing urban spatial planning.


Introduction
In the era of Web2.0 and mobile Internet, people often use Weibo (equivalent to Twitter in Chinese), online comments, photo sharing, travel records, and social media to generate, process, and share a large amount of information [1][2][3]. With the popularization of global positioning systems (GPS) and wireless cellular positioning technology in mobile devices, most of the information spontaneously created by users automatically carries spatial information [4]. This kind of spatial information is called volunteered geographic information (VGI) in academia [5]. VGI's real-time, diversity, and content creativity have huge application potential in the fields of spatio-temporal analysis, urban planning, environmental monitoring, disaster warning, and public information services [6][7][8][9]. These massive data are gradually being mined and analyzed, and thus people have truly entered the era of big data. Goodchild also pointed out that we are rapidly entering an era where ordinary citizens are both consumers and producers of geographic information [1].
The advent of the big data era has put forward new ideas for the study of urban spatial patterns. Currently, data based on location-based service (LBS) technology are the most widely used data in urban research, such as bus card records, taxi trajectory data, mobile phone call records, and login data based on social media [10][11][12][13][14]. These data can be interpreted as a description of the city, and their mining and analysis can lead to a more people-oriented urban spatial pattern [1,2]. Traditional surveying methods based on visual and statistical data have some limitations in the research process, such as being

Materials and Methods
In this section, we describe our study area and present the data we collected (including data pre-processing). Then, we show how to conduct hotspot analysis based on these data and use clustering methods to identify urban functional areas. The specific research process is shown in Figure 1.
Sustainability 2021, 13, x FOR PEER REVIEW 3 of 15 spatial structure. The Results and Discussion section presents the experiments and results, and discusses next steps. Finally, the paper ends with the Conclusions section.

Materials and Methods
In this section, we describe our study area and present the data we collected (including data pre-processing). Then, we show how to conduct hotspot analysis based on these data and use clustering methods to identify urban functional areas. The specific research process is shown in Figure 1.

Study Area
The study area was Beijing, China (115°24′39″-117°30′37″ E, 39°26′9″-41°3′32″ N, as shown in Figure 2). Beijing is the capital and also a typical megacity in China. With the rapid urbanization, its urban scale has been expanded 12 times in 55 years [30]. The total area is 16,410.54 km 2 , and the permanent population is 21.53 million. Considering the complexity of urban space, the large population (who can act as sensors), and even the increasingly prominent problem of big city diseases, Beijing is an ideal study area.

Study Area
The study area was Beijing, China (115 • 24 39 -117 • 30 37 E, 39 • 26 9 -41 • 3 32 N, as shown in Figure 2). Beijing is the capital and also a typical megacity in China. With the rapid urbanization, its urban scale has been expanded 12 times in 55 years [30]. The total area is 16,410.54 km 2 , and the permanent population is 21.53 million. Considering the complexity of urban space, the large population (who can act as sensors), and even the increasingly prominent problem of big city diseases, Beijing is an ideal study area.

Sina Weibo POIs and Data Categorization
As one of the most popular social media platforms in China, Sina Weibo has the characteristics of fast updates, a large number of participants, and widely distributed users [3,9]. Most of the information on Sina Weibo is closely related to urban life. Since the content and types of Weibo POI are very rich, it is best to determine its category before acquisition. Considering the research content and the special background of Beijing, we then divided Weibo POI data into 15 categories (Table 1, modified according to [31]).

Sina Weibo POIs and Data Categorization
As one of the most popular social media platforms in China, Sina Weibo has the characteristics of fast updates, a large number of participants, and widely distributed users [3,9]. Most of the information on Sina Weibo is closely related to urban life. Since the content and types of Weibo POI are very rich, it is best to determine its category before acquisition. Considering the research content and the special background of Beijing, we then divided Weibo POI data into 15 categories (Table 1, modified according to [31]).    Figure 3). However, there were problems such as duplication of data records and ambiguity in place names, which required further data cleaning. Next, we deleted duplicate records and deleted records that did not meet a specific classification. visit parks. Therefore, the parks were classified as tourist attractions rather than public facilities. The number of companies in Beijing is high (73,224 companies out of 113,206 buildings). Therefore, in the following research, companies were not considered in the category of building, but the distribution of companies was studied separately. After data cleaning and processing, 51,916 company data points and 115,616 classified data points were finally obtained (company was categorized as 06*). The POI data of each category is shown in Figure 3.

OpenStreetMap and Map Segmentation
We collected the road network data of Beijing on April 14, 2015 from OpenStreetMap (https://www.openstreetmap.org). The road data set included 50,816 roads with a total length of 24,877,717 m, and a railway with 5121 sections and a total length of 3,699,118 m. Then, we combined OpenStreetMap's road network classification and selected three levels: highway (motorway_link), trunk (trunk_link), and primary (primary_link) as the research objects. There were 9655 lines with a total length of 6,077,240 m, as shown in Figure  4a. These three different road types constitute the natural division of Beijing. It can be seen intuitively that the outline of Beijing's road network meets the experimental requirements.
In order to better divide the research area into different zones, we needed to remove unnecessary details and ensure the topological relationships of the roads, including multilane merging, two-lane road centerline extraction, overpass deletion, and topological relationship correction. After checking the data, we segmented regions according to the center line of the road network ( Figure 4b). It can be seen that in the process of categorization, some parks were also classified as tourist attractions. This is because Beijing has a large number of tourists, and they often visit parks. Therefore, the parks were classified as tourist attractions rather than public facilities. The number of companies in Beijing is high (73,224 companies out of 113,206 buildings). Therefore, in the following research, companies were not considered in the category of building, but the distribution of companies was studied separately. After data cleaning and processing, 51,916 company data points and 115,616 classified data points were finally obtained (company was categorized as 06*). The POI data of each category is shown in Figure 3.

OpenStreetMap and Map Segmentation
We collected the road network data of Beijing on 14 April 2015 from OpenStreetMap (https://www.openstreetmap.org). The road data set included 50,816 roads with a total length of 24,877,717 m, and a railway with 5121 sections and a total length of 3,699,118 m. Then, we combined OpenStreetMap's road network classification and selected three levels: highway (motorway_link), trunk (trunk_link), and primary (primary_link) as the research objects. There were 9655 lines with a total length of 6,077,240 m, as shown in Figure 4a. These three different road types constitute the natural division of Beijing. It can be seen intuitively that the outline of Beijing's road network meets the experimental requirements.

Population Data
China's 1 km grid population data set was based on land use type data and demographic data obtained from remote sensing data. The data set was used to establish a population spatial distribution model by using the spatial analysis function of the geographic In order to better divide the research area into different zones, we needed to remove unnecessary details and ensure the topological relationships of the roads, including multilane merging, two-lane road centerline extraction, overpass deletion, and topological relationship correction. After checking the data, we segmented regions according to the center line of the road network (Figure 4b).

Population Data
China's 1 km grid population data set was based on land use type data and demographic data obtained from remote sensing data. The data set was used to establish a population spatial distribution model by using the spatial analysis function of the geographic information system to spatialize the statistical population data [32]. We extracted the population distribution data within the border of Beijing from China's 1 km grid population data set (2010). The generated population density distribution map is shown in Figure 4c.
It can be seen from the above analysis that the population density is the highest in central Beijing. As the urban center expands, the population density gradually decreases. However, in the suburbs, the population density shows a high-density distribution center in a small area, showing obvious characteristics of suburbanization. In addition, the population density in the core areas of suburban counties remains high. Generally speaking, the population density in the east is higher than that of the west, especially in areas where the population density distribution is expanding in the southeast and Langfang.

Analysis of Urban Spatial Structure 2.3.1. Analyzing Urban Hot Spots Based on Weibo POI Data
Weibo POI data can better describe the distribution of people in a city through the location information of volunteers. In order to further analyze the distribution characteristics, we selected a large number of checked-in POIs in Weibo for analysis, and used the checkin_num of each POI point as a weight to analyze the kernel density.
The kernel density estimation (KDE) algorithm mainly uses a moving unit (equivalent to a window) to estimate the density of a point or line pattern [33]. It is defined as x 1 . . . x n and is an independent and identically distributed sample drawn from the population of the distribution density function f (). To estimate the value of f () at a certain x, the Rosenblatt-Parzen kernel estimation is usually used: where k () is the kernel function; h > 0 is the variable; and (x − x i ) represents the distance from the estimated point to the sample x i . In KDE estimation, the determination or choice of variable h has a great influence on the calculation result. When h increases, the point density changes more smoothly in space, but it will hide the density structure; When h decreases, the estimated point density changes suddenly and unevenly [34].
In the KDE module of ArcGIS, the default bandwidth is automatically generated. The larger the search radius value, the smoother is the density grid generated and the higher is the generalization degree; therefore, the smaller the value, the more detailed is the information displayed in the generated grid. In order to obtain more detailed results, we changed the default search radius to 1500 m and the output cell size of the raster image to 100 m.

Identifying Urban Functional Zones
In this section, we used Sina Weibo POI data to analyze urban functional zones. Cluster analysis is a statistical analysis method for studying classification problems, and it is also an important algorithm for data mining. In this research, we mainly used the k-means algorithm and hierarchical clustering algorithm. • K-means: For a given data set, we made the following provisions: the set of n ddimensional points was X = {x i }, i = 1, . . . , n; the set of k clusters was C = {c k }, k = 1, . . . , k; the mean value of c k was µ k ; and the squared error was J(c k ) = ∑ Therefore, the goal of K-means can be understood as a solution that minimizes • Hierarchical clustering algorithm: A hierarchical clustering method is used to construct and maintain a clustering tree formed by clusters and sub-clusters according to a given distance measurement criterion between clusters until a certain end condition is met. Hierarchical clustering algorithm is divided into condensed and split, from bottom-up and top-down, according to hierarchical decomposition. The default discussed in this article is cohesive.
TF-IDF (term frequency-inverse document frequency) is a statistical method used to evaluate the importance of a word to one of the documents in a document set or corpus [35]. The importance of a word is proportional to the number of times it appears in the document, but it decreases inversely proportional to the frequency of its appearance in the corpus. TF-IDF is TF × IDF, where TF is term frequency (term frequency) and IDF is inverse document frequency (inverse document frequency). In a given document, TF refers to the frequency of a given word in the document, where n i , j is the number of occurrences of the word in the file d j , and the denominator is the sum of the number of occurrences of all words in the file d j .
IDF is used to measure the universal importance of a word. The IDF of a specific word can be obtained by dividing the total number of documents in the research by the number of documents containing the word, and then taking the logarithm of the obtained quotient, where |D| is the total number of documents in the corpus, and j : t i ∈ d j is the number of documents containing word. Then, according to t f id f ij = t f ij × id f i , with a high word frequency in a particular file, and a low file frequency of the word in the entire file set, a high-weight TF-IDF can be generated.

Weibo Hot Spots Analysis Results
We selected a large number of Weibo POIs (with each category) for analysis, and used the checkin_num kernel density analysis weight of each POI point to perform kernel density analysis, and obtained the following results.
It can be seen from Figure 5 that the spatial distribution of Weibo users in Beijing is large. In the urban area, it is mainly concentrated in science and education areas, commercial and entertainment areas, and diplomatic and political areas. It is not difficult to understand that there are a large number of universities in science and education areas. College students are an active group of Weibo users. At the same time, office people also like to use Weibo when commuting. In diplomatic and political areas, hot spots are mainly concentrated in tourist attractions of political significance with Tiananmen Square as the center. In commercial and entertainment areas, people mainly use Weibo to share information during leisure or entertainment activities. In addition to the urban area, the Capital International Airport District and Changping District are also hot spots for Weibo user activities. In Changping District, the campuses of some colleges and universities are relatively concentrated, and it is also the area where the Great Wall (Badaling Great Wall) Sustainability 2021, 13, 647 8 of 15 is located. It is a hot spot for people to sign in on Weibo. For the results of kernel density analysis, the data can be used for further interpretation. We selected the POI points with high check-ins numbers to display, as shown in the Table 2. to use Weibo when commuting. In diplomatic and political areas, hot spots are mainly concentrated in tourist attractions of political significance with Tiananmen Square as the center. In commercial and entertainment areas, people mainly use Weibo to share information during leisure or entertainment activities. In addition to the urban area, the Capital International Airport District and Changping District are also hot spots for Weibo user activities. In Changping District, the campuses of some colleges and universities are relatively concentrated, and it is also the area where the Great Wall (Badaling Great Wall) is located. It is a hot spot for people to sign in on Weibo. For the results of kernel density analysis, the data can be used for further interpretation. We selected the POI points with high check-ins numbers to display, as shown in the Table 2.

Identifying Urban Functional Zones
For the 15 categories of POI data we classified, we first used the spatial connection tool in ArcGIS to calculate the number of POI points in each divided area. Furthermore, the hot spot discovery tool was used to detect cluster centers. We selected eight typical categories of POI data to determine the clustering centers (as shown in Figure 6), and the distribution of hot spots obtained by ArcGIS.

Identifying Urban Functional Zones
For the 15 categories of POI data we classified, we first used the spatial connection tool in ArcGIS to calculate the number of POI points in each divided area. Furthermore, the hot spot discovery tool was used to detect cluster centers. We selected eight typical categories of POI data to determine the clustering centers (as shown in Figure 6), and the distribution of hot spots obtained by ArcGIS. In specific experiments, we mainly used three methods, the hierarchical clustering method (Figure 7a), the TD-IDF method (Figure 7b), and the improved k-means clustering method (Figure 7c). The improved k-means method takes the aforementioned hotspot analysis results as the initial clustering center, thus expecting a better clustering result. The TF-IDF method compares the urban function exploring it as text-topic discovery, and this urban function similarity is further explored using a plain k-means method. We analyzed and compared these clustering results. In specific experiments, we mainly used three methods, the hierarchical clustering method (Figure 7a), the TD-IDF method (Figure 7b), and the improved k-means clustering method (Figure 7c). The improved k-means method takes the aforementioned hotspot analysis results as the initial clustering center, thus expecting a better clustering result. The TF-IDF method compares the urban function exploring it as text-topic discovery, and this urban function similarity is further explored using a plain k-means method. We analyzed and compared these clustering results. By counting the POI data of each functional zone, we sorted the number of various categories of POIs in the functional zone, as shown in Table 3. We then comprehensively analyzed the three clustering results and statistical data, and finally determined eight categories of functional zones, including diplomatic and political centers, science and education areas, mature residential areas, new residential areas, commercial and entertainment areas, tourist attraction areas, areas to be developed, and unclassified areas.  By counting the POI data of each functional zone, we sorted the number of various categories of POIs in the functional zone, as shown in Table 3. We then comprehensively analyzed the three clustering results and statistical data, and finally determined eight categories of functional zones, including diplomatic and political centers, science and edu-cation areas, mature residential areas, new residential areas, commercial and entertainment areas, tourist attraction areas, areas to be developed, and unclassified areas. • Diplomatic and political zone In these areas, a large number of embassies are gathered, and the number of POIs in tourist attractions, sports and entertainment, and buildings is large. Combining the fact that Beijing is also the capital, this area is not only the gathering place of embassies, but also the location of Tiananmen Square, the Forbidden City, and the Great Hall of the People, etc.

•
Science and education zone In these areas, POI data for science, education, culture, and publicity are the highest, and combined with the location of the area, it can be seen that there are a large number of universities in this area, such as Peking University and Tsinghua University. At the same time, Zhongguancun, China's earliest high-tech development center, has a large number of high-tech companies and scientific research institutes in these areas. Therefore, there are a large number of building and companies in these areas.

•
Mature residential zone In these areas, the number of residential POIs is the largest, and the number of restaurants, public facilities, shopping centers, financial and insurance, tourist attractions, sports and entertainments, and healthcare POIs are also the highest. It can be seen that in mature residential areas, all types of service facilities are the most complete. They are distributed around the core functional areas of the city. At the same time, very few areas in the suburbs have developed into mature residential areas.

•
New residential zone As seen in the mature residential areas, the residential POI of new residential areas is the largest of all categories, but the number of other categories in this functional zone is mostly lower than that of mature residential zones. Highway services, industrial sites, public transportation, and government agencies have the highest number of POIs. This is because this area is composed of many sub-regions with a large number of government agencies. In addition, it is located in the suburbs, and has more highway services and public transportation.

• Commercial and entertainment zone
This functional zone is located near the diplomatic and political zone, and next to the mature residential zone, which shows people's shopping habits. However, the number of various categories in this zone is balanced, and the number in the same category is not high, which is mainly due to the small number of sub-regions.

•
Tourist attractions zone In this functional zone, there are many public transportation POIs. It can be seen from the distribution of sub-regions that the functional zone is basically distributed in the suburbs, but the number of tourist attractions POIs is not particularly large.

•
Area to be developed In these areas, all types of Checkin_num are small, but the number of public transportation and industrial sites is large. At the same time, it can be seen from the distribution that they are adjacent to new residential areas and located in remote counties.

•
Unclassified area Since Weibo POI check-in data are essentially volunteer geographic information, the number of POIs in some areas is not high enough and they are not classified.

Verifying the Results
For the results of functional zones obtained by clustering, we evaluated them using the following three measures.
Firstly, we compared the clustering results with the Beijing City Master Plan (2004-2020, shown in Figure 8a), mainly comparing the downtown area. It can be clearly seen from the planning map that area A is land for commercial and financial use, and area B is land for science, teaching, and research, which is in full agreement with the results of this article. In addition, it can be seen from the planning map that the downtown area is a residential area, which is not inconsistent with functional zones such as the diplomatic and political area that we derived. Because residential areas have dominant functions in cities, and functions are established by human activities in the residential environment. Finally, we selected some typical areas to verify the results (outer areas were not selected for comparison because they were mainly unclassified areas and areas to be developed). The results (shown in Figure 8c) show that Xiangshan Park is a tourist attraction in several typical areas selected at random. Yongle District is located in a mature residential area; Peking University and the Zhongguancun campus of the Chinese Academy of Sciences are located in the science, education, and cultural district. The French Embassy and Tiananmen Square are located in the diplomatic and political center. Sanlitun Bar Street is in the commercial and entertainment district.
Combining the above three verification methods, and considering the current situation of Beijing's highly mixed land use, it can be seen that the results of Beijing's functional zoning obtained by this method had great accuracy. Secondly, we compared the clustering results with the initial cluster centers of k-means (shown in Figure 8b). Because the clustering result of the k-means method itself depends on the initial clustering center, we started from the clustering method for comparative verification. It can be seen from the above that the eight initial clusters selected here fell within the corresponding local area, which can be seen in the selection of the initial cluster center. The method here is effective and the results are also reliable.
Finally, we selected some typical areas to verify the results (outer areas were not selected for comparison because they were mainly unclassified areas and areas to be developed). The results (shown in Figure 8c) show that Xiangshan Park is a tourist attraction in several typical areas selected at random. Yongle District is located in a mature residential area; Peking University and the Zhongguancun campus of the Chinese Academy of Sciences are located in the science, education, and cultural district. The French Embassy and Tiananmen Square are located in the diplomatic and political center. Sanlitun Bar Street is in the commercial and entertainment district.
Combining the above three verification methods, and considering the current situation of Beijing's highly mixed land use, it can be seen that the results of Beijing's functional zoning obtained by this method had great accuracy.

Discussion
The significance of this work is the development a method to automatically identify detailed spatial functional zones. The fine distinction between mature and new residential zones, and the delineation of areas to be developed are of greater importance, except for the easily distinguishable zones (diplomatic and political, science and education, commercial and entertainment). Taking mature residential zones as reference, the new residential zones need to place more effort into promoting service-related facilities, including shopping, financial and insurance, sport and entertainment, and healthcare facilities. As for the areas to be developed, they distinguish themselves by highly ranked public transport and industrial sites, and they also next to the new residential zones in the suburban areas. This kind of functional zone is of great potential and places near center zones would enjoy better development if the infrastructures there were gradually improved.
We then combined the characteristics of the study area and the research results for further analysis.
Firstly, the central city of Beijing is showing a trend of suburbanization, and the spatial distribution structure presents a three-level structure of main center-sub centertown. However, despite the significant increase in population density in the suburbs, the construction of various infrastructures in the area is not yet complete, and the level of urbanization needs to be improved.
Secondly, Beijing is developing towards the southeast. It can be seen that the connection area between Beijing and Tianjin (Langfang, located in the southeast of Beijing) has a large population and spatial distribution density. At the same time, the distribution of Weibo POI also shows that the regional distribution density in the southeast direction is large.
Finally, diplomacy and politics; business and entertainment; and science, education, and cultural are the main service functions of the major urban areas. Mature residential areas are located near the city center. In the suburbs and counties of Beijing, there are new residential areas and areas to be developed. Commercial and entertainment areas are less distributed in suburban counties.
With the process of urbanization, the built-up area of Beijing has become larger and larger, and more and more people live in the suburbs. On the one hand, the process of suburbanization has eased the pressure on population, traffic, and housing in the major city, but at the same time many new problems have emerged. For example, many people live in the suburbs but work in the city center, which requires a long time to commute. On the other hand, it can be seen from the results of the analysis that the infrastructure in the suburbs is still not sound, so that the schooling and medical problems of children cannot be well addressed.
What should be done? In the process of ensuring the stable development of central urban areas, the development of emerging urban areas should also be balanced and attention should be paid to the equal distribution of resources, such as education, healthcare, and other supporting facilities. In addition, while optimizing the internal structure of the city, it is necessary to integrate Beijing's overall resources for external development, actively drive the surrounding areas, and strive to achieve the coordinated development of the Beijing-Tianjin-Hebei metropolitan area.

Conclusions
Currently, urban residents provide massive VGI, and understanding of the urban spatial pattern plays an increasingly important role in promoting urban spatial development. Using VGI and social media activity data, this article developed a method to automatically extract and identify urban spatial patterns and functional zones. We obtained a total of 167,532 Weibo POI data points in Beijing from 13 April to 17 April 2015, OpenStreetMap road network data on 14 April 2015, and China's 1 km grid population data set. Then, we used the hierarchical clustering algorithm, TF-IDF method, and improved k-means clustering algorithms and identified eight functional zones. The functional zones included the diplomatic and political zone, science and education zone, mature residential zone, new residential zone, commercial and entertainment zone, tourist attractions zone, areas to be developed, and unclassified areas. Finally, we verified the results of the study with the Beijing city master plan and typical areas, and the comparison shows that the clustering results had high accuracy.
The contributions of this work lie in three aspects. Firstly, the feasibility of using usergenerated social media data on investigating urban spatial structures was verified. This kind of VGI data are large-volume, easily obtainable, more time-saving, and more peopleoriented than traditional datasets, and their application in delineating city functional zones could provide more detailed information. Secondly, by dividing research units using the road network, we obtained the natural areas in Beijing. This street map segmenting method was more consistent with urban function division and was more effective in depicting city heterogeneities than was the urban uniform grid. Lastly, the automatically-identified urban functional zones using social media data provided more information than did generallydefined residential or employment areas. The advantage of mature residential zones over new residential zones provides us with useful information for the future planning of the newly developed areas and areas to be developed, so that sustainable development might utilized for the creation of well-developed center zones. In general, the use of Weibo POI data and OpenStreetMap road network data combined with spatial clustering methods to analyze the urban spatial structure and explore functional areas, provides new ideas for the study of urban spatial structure.