A Method for Identifying Urban Functional Zones Based on Landscape Types and Human Activities

: The effects of land use and socioeconomic changes on urban landscape patterns and functional zones have been increasingly investigated around the world; however, our knowledge on these effects is still inadequate for sustainably managing urban ecosystems. The urban functional zone (UFZ) refers to a kind of regional space that provides speciﬁc functions for human activities and reﬂects the land use type in a city. They are important for urban planning and exploring urban texture dynamics. UFZs improve understanding of sustainable development for urban ecosystems with extreme environments and unique social backgrounds. However, the identiﬁcation methods for UFZs are incomplete because of a lack of socioeconomic attributes, as well as their hierarchical relations. Here, we present a hierarchical weighted clustering model to identify UFZs based on the entropy weight method. The data included points of interest (POIs), land use type data, road network data, socioeconomic data, and population density. We found that the adjusted cosine metric and the average criterion were the optimal distance metric and linkage strategy, respectively, to cluster urban zone data. The performance with weighted data was better than that with raw data, and the level of the POI classiﬁcation scheme and landscape pattern affected the accuracy of identiﬁcation UFZs. The research indicated that the hierarchical weighted clustering model was a useful method to classify UFZs in order to improve urban planning and environmental management schemes.


Introduction
Efforts to make society and its processes more livable necessitate sustainability, which has long been regarded as one of the most significant policy objectives in the world [1]. The spatial patterns of buildings or functional zones affect the urban heat island (UHI) and the sustainability of a city [2]. Traditional techniques for urban functional zone identification, however, are unable to meet the objectives of sustainable urban development because of a lack of hierarchical relations [3]. The "functional zone" is a concept that describes the social and economic properties that satisfy various needs and accommodate diverse human activities in a certain area [4,5]. Urban functional zones share common social and economic activities and are spatially aggregated by diverse geographic objects and semantically abstracted from land uses [3,6]. Urban populations are increasing, with the number of mega-sized cities expected to increase from 10 in 1990 to 41 in 2030 [7]. According to the reports of United Nations, urbanization has been rapid in recent decades, and 68% of the world population are projected to live in urban areas by 2050 [8]. Population density is growing, and the urban area is expanding along with the intensive urban growth [9]. Urban areas directly consume land as their physical footprints expand, resulting in landscape and urban function transformation [10][11][12][13]. Strong spatial clustering patterns can also be seen in urban socioeconomic activity [14]. These clustering patterns lead to the generation type, population density, and composition of POIs (i.e., human activities), as well as their hierarchical relations. An experiment within the Fifth Ring Road, Beijing, China, was conducted to validate the proposed framework. We aimed to: (1) map UFZs using the presented framework; (2) test the performance of the data combination, data weighted methods, cluster strategies, and metrics indicators; and (3) analyze the spatial pattern of functional zones by using an example in Beijing, China.

Study Area
Beijing is the capital of China, the world's most populous country. Beijing has become one of the world's fastest expanding cities in recent decades as a result of rapid industrialization and urbanization. The central Beijing area, which is surrounded by multiple ring roads, is made up of several concentric belts of infrastructure and functional zones. The Fifth Ring Road area is the core of the downtown Beijing district, covering 667 km 2 . It is affected by intensified human activities and has a variety of functional zones, such as the educational zones, public zones, recreation areas, and business districts, etc. The study area offers a significant diversity of human activities with distinct urban functional zones. The study area was divided into 336 sub-regions ( Figure 1), each with a minimum area of 150,000 m 2 . Segmented zones can be represented by eigenvector consist of amounts or relative amounts of characteristics. Each segmented region is relatively homogeneous in terms of socio-economic function [43].

Data Sources
Multi-source data were used in this study, such as POIs, land use type data, road network data, socioeconomic data, and population density. The POIs were obtained through AMap™ (https://www.amap.com. accessed 1 September 2018), a web-mapping, navigation, and LBS provider. A total of 572,169 POIs were retrieved in September 2018. Recreation, Catering, Automotive Services, Financial, Education, Public, Health Care Services, Hospitality, Residence, Organizations, and Travel are among the 23 categories of POI data. Furthermore, while 20 of these types are stable categories, the other 3 categories are real-time incidents, such as traffic accidents and road maintenance incidents. For each POI, there are six column properties for each POI: Name, Coordinates, and Categories in three hierarchy levels (composed of primary, secondary, and third-level classes, otherwise called level 1 (L1), level 2 (L2), and level 3 (L3), respectively) ( Figure 2). For example, level 1-Education Service, including college, middle school, elementary school, and kindergarten, can distinguish between the functional properties of the sub-regions. As a result, a data processing framework must be created in order to compute the weight of the comprehensive evaluation of categories at 3 levels, respectively. POIs were used to present the human activities and hierarchical relations. The POIs were divided into 20 primary classes (Table 1), 264 secondary classes, and 868 three-level classes. The land use data [49] with a spatial resolution of 10 m were obtained from Department of Earth System Science/Institute for Global Change Studies Tsinghua University (http://data. ess.tsinghua.edu.cn/ accessed 1 January 2020). The land-use composition was described by the proportions of urban areas, urban green land, farmland, and woodland. The proportion of various land uses was used to describe land-use heterogeneity. The urban road network data comes from the Open Street Map (OSM) geographic data platform (https://www.openstreetmap.org/ accessed 1 January 2020). Redundant paths and broken paths were weeded out to represent the functional unites of the study area. The population from WorldPop products of 2017 had a spatial resolution of 1 km × 1km (https://www. worldpop.org/ accessed 1 January 2020). Statistical socioeconomic data (i.e., population, GDP) in 2017 were obtained from the National Bureau of Statistics.

The Framework for Identifying UFZs
The segmented regions within the same cluster have similarity characteristic vectors that include the proportion of POIs, land use type, and socio-economic data. The similarity can be gauged by the distance between the two segmented regions. The regions have a high degree of resemblance if the similarity distance is small, and we can expect them to act similarly in terms of urban functions. The larger the distance, the smaller the similarity, indicating that the regions diverge significantly. The characteristic vector of each segmented region can be defined as: Here, R i is the segmented patch i, and C i,n is the amount of one type in a POI classification scheme at the same level within R i . n = 20, n = 264, and n = 868 represents cluster POIs at L1, L2, and L3, respectively. Lu j is the proportion of land use type, Pop is the population density, and GDP is the per capital GDP.
As illustrated in Figure 3, the study area was initially segmented into research units by the road network data. The eigenvectors of each segmented research units are then composed of various data combinations based on independence or combination of POIs data at various levels, land use type data, population density, and GDP data. Finally, the Shannon entropy was used to calculate the weight of POI classes at various levels. The results of cluster results by various similarity metric indicators and cluster strategies. Two data processing datasets, three levels of POI classification schemes, six clustering merging strategies with a vector matrix of hierarchical weighted count of POIs within the region, and four similarity distance measure methods are all included in the data.

Hierarchical Weighted Clustering Model
Hierarchical agglomerative clustering algorithms represent a popular unsupervised learning technique that seeks to build a hierarchy of clusters and to discover the natural groups of a set of observations. Clustering is the process of grouping samples so that samples in the same group are as similar as possible, while samples in other groups are as distinct as possible.
For the actual functional label of sub-regions is unknown, we tested different distance measurement methods (Euclidean distance, cosine distance, adjusted cosine distance, and Pearson correlation distance) to categorize functional zones. The Euclidean distance method computes the Euclidean distance between two attribute vectors, which is sensitive to the magnitude of the count of POIs, but not sensitive to the percentage of different features. The cosine distance method computes the cosine distance between two attribute vectors, which is sensitive to the percentage of different features, but not the magnitude of the count of POIs. The adjusted cosine distance computes the cosine distance between two preprocessing attribute vectors by subtracting the mean value. The Pearson by correlation distance method computes the Pearson correlation distance between two attribute vectors.

Weighting Coefficients and Construct Eigenmatrix
The urban functional zone is influenced by the amount or the weight amount of POIs at each level which characterize the intensity of human activity. We propose the entropy weight method, based on the Shannon entropy theory, integrated with the hierarchical agglomerative clustering method, to balance discrepancy between different subregions, which could contribute to the identification of UFZs as a comparison to compare the clustering results.
The weighting coefficients for different POI types were calculated using the Shannon entropy approach. Shannon entropy is a probability theory-based notion that was developed as a measure of information uncertainty. Since the concept of entropy is well adapted to measuring the relative intensities of contrast criteria [50], it can be used to represent the average intrinsic information transmitted for decision-making. It is a good and practical alternative for us to calculate the weight of POIs type at different levels. On each subtree of the POI classification scheme, we apply the entropy weight method: Step 1: Standardization of data Because the data of the metric are not uniform, it is necessary for us to standardize the data. The data were standardized according to the following methods.
where x Lij is the standardized count of POIs type j within region i on specific scale L, and min x Lij and max x Lij are the minimum and maximum values on a particular POI type j in respective level, respectively. Through the operation, the value are in the range of 0~1.
Step 2: Calculating entropy of information The entropy of information is a crucial factor to measure the weight of evaluation metric. The high entropy of information indicates that the weight is larger. The following equation shows how to calculate entropy of information: where E Lj is entropy of information of each POIs type j a specific scale L, P Lij is the count variance of each POIs type j within region i on a specific scale L, n is the amount of the data on a specific scale L, and x Lij is the standardized data.
Step 3: Calculation of weight After calculating the entropy of information, the weight of each metric is determined using the theory of entropy, which indicates the importance of the metric in the evaluation system.
In terms of the weight, the following formula can be used to obtain the weighted value: Step 4: Calculation of the weighted value Following these steps, it is reasonable for us to obtain a comprehensive score of type j of region i on a specific scale L. Therefore, we can evaluate the weighed count of the region i at level L.
Step 5: The above steps are repeated with data on other POI classification scheme subtrees and the next top level. The weighted eigenvector matrix on a specific scale can then be obtained.
Step 6: The pairwise similarity distances are calculated for a given pair of nodes, which reflects their distinct degrees.
The HAC algorithm repeatedly identifies the minimal similarity coefficient in the distance matrix to assign the nodes into a linkage tree after constructing the pairwise distances matrix. Updating the pairwise distance matrix is a crucial step, and hierarchical agglomerative clustering can be accomplished using various algorithm [51]. For measuring the distance between the newly formed cluster and original objects, we used five different HAC algorithm methods: single, average, ward, centroid, and complete linkage strategy.

Evaluation of Clustering Performance
The most common approaches for assessing the quality of clustering results are cophenetic correlation and some internal indices [52]. The cophenetic correlation coefficient compares (correlates) the actual pairwise distances of all samples to those implied by the hierarchical clustering. When the value is closer to 1, the clustering can better preserve the original distances. Suppose that the original dataset x i is modeled using a cluster method to produce a dendrogram set t i , the cophenetic correlation coefficient can be denoted as [53]: where x(i, j) = | x i − x j |, i.e., the ordinary Euclidean distance between the ith and jth observations. t(i, j) is the dendrogrammatic distance between the model points t i and t j . This distance is the height of the node at which these two points are first joined together. We used the cophenetic correlation coefficient to evaluate the performance of all distance metric. Then, we evaluated the performance of cluster results quality of clustering by the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index [54].

The Best Cluster Model Parameters and Strategies
The cophenetic correlation for a cluster tree is defined as the linear correlation coefficient between the cophenetic distances obtained from the tree and the original distances (or dissimilarities) used to construct the tree [55]. As a result, it is a method for detecting the differences among observations in the cluster tree. Table 2 shows HAC with a sample size (n = 5, n = 20, n = 30) at different levels. From the cophenetic correlation results, we found that (1) the weighted data processing method performed better than raw data across all distance metrics; (2) the performance of the adjusted cosine distance metric is similar, regardless of whether the weighted or raw data are used; and (3) the optimal clustering merge strategies differs depending on the number and levels of clusters. When cluster number n = 5, the cophenetic correlation result was better than other levels using the adjusted cosine distance metric at level 2. When n = 20, the cophenetic correlation coefficient result at level 1 was better than other levels, and the adjusted cosine distance metric, synonymous with the "Ward" clustering merge strategies, achieved the max value of cophenetic correlation coefficient, i.e., 0.909. When n = 30, the result at level 1 was better than other levels, and the adjusted cosine distance metric, synonymous with the centroid of clustering merge strategies, achieved the max value of cophenetic correlation coefficient, i.e., 0.929. Overall, the adjusted cosine was the best distance metric, and the performance with weighted data was better than raw data, according to the cophenetic correlation coefficient results.
The dendrogram achieved as a result of clustering process illustrates the number of clusters obtained and their linkage. According to the dendrogram results in Figure 4 and quality curve, as shown in Figure 5, we found that (1) the clustering results with weighted POI and land use data performed better, which indicates that the identification of UFZs should take into account landscape patterns; (2) the optimal combination methods for clustering the UFZs were used for the adjusted cosine distance metrics and the average of the clustering strategy; and (3) the silhouette coefficient was used for the optimal clustering quality metrics, and the optimal number of clusters was 10.

Spatial Patterns of UFZs
According to the results of the hierarchical weighted agglomerative clustering (Figure 6), we found that the hierarchical weighted agglomerative clustering model identified the clusters in an unambiguous way. POI data can be used to identify UFZs to some extent, and the POI data can represent the intensity of human activity. Furthermore, by combining POI and land use type, UFZs can be identified more precisely. Land use type, for example, can be used to identify cultural tourism zones and natural landscape districts. The accuracy and fineness of clustering results were both affected by the number of clusters and segment patches, as shown in Figure 6. Furthermore, the finer the segmentation of the study area, the better the clustering results. To express the spatial autocorrelation of clusters, we used Moran's I index to measure the spatial distribution pattern of the two clustering results. The Moran's I index analysis revealed that the clustering results based on POI and land use type data had a substantial and positive autocorrelation at the 0.05 significance level. The spatial distribution of the clustering results matched that of the actual UFZs (Figure 7). Clustering the segmented sub-regions requires a weighted raw data technique, according to the results. Combining the adjusted cosine distance metric and average clustering linkage strategies can be a suitable method if there is no prior knowledge.
UFZs were identified based on the composition of POI class and land use type data among clusters. Because the region is tiny, some clusters were merged into other clusters. There are seven types of functional zones in downtown Beijing, as shown in Figure 7, including four types of single functional areas and three types of mixed functional zones. The education zone, the recreation green zone, the residence zone, and the social and community zone are single functional zones with areas of 87.4 km 2 , 145.4 km 2 , 153.7 km 2 , and 42.3 km 2 , respectively. With areas of 47.5 km 2 , 73.1 km 2 , and 117.2 km 2 , respectively, the mixed functional areas contain a combination of residence and recreation zones, commercial and industrial zones, and commercial residence zones. The residential zone occupied the most space of all, and it was widely spread out across the study area with significant disparities. On the perimeter, the proportion of the residential zone was higher than in the center. Recreation green zones area denser in the north, but the residential area is relatively far away. Recreation green zones are more dispersed in the south, and they all surround the residential area. As a result, the recreation equity in the south is better. Education zones are concentrated in the Haidian District, i.e., northwest of the study area. Commercial zones are always found in conjunction with other functional zones, such as residence zones, recreation zones, and industrial zones. It also demonstrated that Beijing, as a metropolis, has a relatively effective functional zone plan.

Methodological Advantages and Limitations
The hierarchical weighted clustering model is a popular unsupervised learning technique for discovering the natural groupings of a set of observations, which we used to identify the UFZs. In this study, we proposed a hierarchical weighted clustering model that uses the weighted POI, land use, and socio-economic data to cluster segmented sub-regions divided by road networks. For identifying the urban functional zones, the weights of POI categories scheme, the POI level, distance metrics, and clustering merge strategies were integrated into the clustering model. This study could expand the traditional understanding of clustering based on the individual densities of POIs.
Previous approaches are required to either reduce the raw data into new categories [4,51], which results in the loss of feature information, or simply classify the regions using the raw POI densities [4]. The most significant benefits of our study our that it provides a general paradigm for identifying UFZs and helps to quickly analyze the impact of different characteristic vectors on classification results. Furthermore, we can identify the segmented zones without having any prior knowledge of the label data. Additionally, unlike the K-means algorithm, which has inconsistencies in the results, the hierarchical weighted clustering model could obtain consistent clustering results.
The other advantage is that the entropy weight method was integrated into the evaluation system, making it possible to automatically calculate the weights of hierarchical POI categories. In order to identify UFZs, previous studies usually fail to consider the effects of POI classification level and the weight of POI categories [56]. It has the potential to increase efficiency, unlike the Delphi consensus technique method which requires too much time and money to obtain valuable response through questionnaires. Furthermore, rather than relying on a particular region, it is important to obtain objective and convenient scores in each study area.
There are some limitations to this framework. For example, it is an unsupervised framework, and uncertainty analysis can be problematic due to the lack of prior knowledge in this method. Because the identification of UFZs is based on the feature vectors generated by POIs, some inconsistencies may exist when compared to actual urban functional property. Although this approach has simplified the data processing procedure to consider the weighted and hierarchical relations of POIs, it has yet to establish a uniform mechanism for evaluating the performance of UFZ classification, and all of the processes in this study may need to be repeated in other areas of research.

Application for Sustainable Urban Planning
The hierarchical weighted clustering model, as opposed to the previous method based on POI density, is clearly more conducive to the analysis and less prone to misinterpretation regarding the weighted and hierarchical relations of POIs. The hierarchical weighted clustering model is a social-based, planning-oriented, and data-driven classification system linked with the urban function, and it may also be used to connect human activity intensity and UFZ identification. UFZs could identify the heterogeneity of the urban internal thermal environment and quantify the basic units of the effect of anthropogenic heat, as reiterated in a published article on the effects of UHI [15]. The usage of UFZs can provide more precise information than the use of land use and cover data, synonymous with the basic planning unit based on a city's UFZs' pattern. Therefore, the HAC model and the results of UFZs can provide a consistent mapping to urban planning and energy saving inside a city, allowing the UFZs to be applied to city management practices. In general, it is difficult to quantify the impact of human activities on urban heat island effects in an ecological environment because we cannot scientifically partition the intensity of human activities.
This method also has practical significance, and our methodology can advance the understanding of local contexts. For example, the results of the functional zones can be used for identifying the factors of traffic congestion caused by urban planning, analyze the relationship between rainfall water capacity and wettability of small-leaved lime and poplar in different city zones [57], plan a fresh food distribution center based on functional zones for fresh product logistics [58], and provide a means of calibration and reference for urban planning by monitoring the temporal and spatial variability of UFZs [6,59]. Overall, the hierarchical weighted clustering model provides new insights into the methodology of UFZ identification and quantitative assessment of the weight of POI categories, as well as wider application of the impacts of human activities or UFZs on the natural ecological landscape.

Conclusions
This study proposed an identification model of UFZs, annotated the social property using POIs and land use data, and provided some potential solutions for the sustainable development of a city on urban functional zones pattern. We found availability and feasibility of hierarchical weighted clustering model. The combination of the adjusted cosine metric and the average criterion revealed the optimal distance metric and linkage strategy, respectively, which has the best performance and quality of clustering results within the Fifth Ring Road, Beijing, China. Compared with the remote sensing images, which primarily depict the physical properties, the results of the clustering model based on POIs data can be viewed as a complementary social sensing view of urban planning and human activities. Despite the fact that semantically meaningful UFZs were identified, the hierarchical weighted clustering model is an unsupervised approach with limits in identifying the actual urban functions. In addition, more research is needed to recognize the social functions accurately while taking into account building height and building density in the study area. This study also provides a valuable method for correlating the natural characteristics and social activities in a densely populated region.  Data Availability Statement: The data and code presented in this study are available on request from the corresponding author.