DFCNN-Based Semantic Recognition of Urban Functional Zones by Integrating Remote Sensing Data and POI Data

: The urban functional zone, as a special fundamental unit of the city, helps to understand the complex interaction between human space activities and environmental changes. Based on the recognition of physical and social semantics of buildings, combining remote sensing data and social sensing data is an e ﬀ ective way to quickly and accurately comprehend urban functional zone patterns. From the object level, this paper proposes a novel object-wise recognition strategy based on very high spatial resolution images (VHSRI) and social sensing data. First, buildings are extracted according to the physical semantics of objects; second, remote sensing and point of interest (POI) data are combined to comprehend the spatial distribution and functional semantics in the social function context; ﬁnally, urban functional zones are recognized and determined by building with physical and social functional semantics. When it comes to building geometrical information extraction, this paper, given the importance of building boundary information, introduces the deeper edge feature map (DEFM) into the segmentation and classiﬁcation, and improves the result of building boundary recognition. Given the di ﬃ culty in understanding deeper semantics and spatial information and the limitation of traditional convolutional neural network (CNN) models in feature extraction, we propose the Deeper-Feature Convolutional Neural Network (DFCNN), which is able to extract more and deeper features for building semantic recognition. Experimental results conducted on a Google Earth image of Shenzhen City show that the proposed method and model are able to e ﬀ ectively, quickly, and accurately recognize urban functional zones by combining building physical semantics and social functional semantics, and are able to ensure the accuracy of urban functional zone recognition.


Introduction
Functional zones are the fundamental units of the city, which not only reflect the complex spatial distribution and socio-economic functions of the city but also help to understand the complex interaction between human space activities and environmental changes. Different functional zone units interact with each other, which results in shaping the complexity of the city [1][2][3]. Buildings are one of the most important components of a city, and the same kind of building often has the same requirements for area, location, and function, which leads to the aggregation of the same type of buildings in urban space. Thus, various functional zones are formed, such as commercial, residential, industrial, shantytown,

Stratified Scale Estimation Strategy
Fixed parameters are not satisfactory for all geographic objects; the optimal segmentation scale of a building is different from vegetation, water, or other objects. In addition, the spatial distribution pattern of geographic objects is often affected by scale [50][51][52][53][54]. Similar geographic objects tend to gather and dominate in a certain space and have similar scale and features.
Based on the above reasons, we propose a stratified scale estimation strategy, which combines the area division based on the normalized grey level co-occurrence matrix (NGLCM) with the spatial scale estimation to obtain the segmentation object. As shown in Figure 2, first, the entire image is stratified into several large regions by multi-texture computing, and then the spatial scale of each region is estimated to implement fine-scale segmentation [52,55,56]. To some extent, the strategy can avoid the blindness and subjectivity of scale parameter selection, can satisfy the suitability and accuracy of different geographic objects, and can improve the efficiency of experiments. The grey level co-occurrence matrix (GLCM) is a matrix that calculates the spatial combinations (angles and distances) of center pixels and neighborhood pixels with different window sizes and step sizes. Most texture calculations are weighted averages of the normalized GLCM (NGLCM) element values [57]. The purpose of weighted averages is to emphasize the relative importance of different

Stratified Scale Estimation Strategy
Fixed parameters are not satisfactory for all geographic objects; the optimal segmentation scale of a building is different from vegetation, water, or other objects. In addition, the spatial distribution pattern of geographic objects is often affected by scale [50][51][52][53][54]. Similar geographic objects tend to gather and dominate in a certain space and have similar scale and features.
Based on the above reasons, we propose a stratified scale estimation strategy, which combines the area division based on the normalized grey level co-occurrence matrix (NGLCM) with the spatial scale estimation to obtain the segmentation object. As shown in Figure 2, first, the entire image is stratified into several large regions by multi-texture computing, and then the spatial scale of each region is estimated to implement fine-scale segmentation [52,55,56]. To some extent, the strategy can avoid the blindness and subjectivity of scale parameter selection, can satisfy the suitability and accuracy of different geographic objects, and can improve the efficiency of experiments.

Stratified Scale Estimation Strategy
Fixed parameters are not satisfactory for all geographic objects; the optimal segmentation scale of a building is different from vegetation, water, or other objects. In addition, the spatial distribution pattern of geographic objects is often affected by scale [50][51][52][53][54]. Similar geographic objects tend to gather and dominate in a certain space and have similar scale and features.
Based on the above reasons, we propose a stratified scale estimation strategy, which combines the area division based on the normalized grey level co-occurrence matrix (NGLCM) with the spatial scale estimation to obtain the segmentation object. As shown in Figure 2, first, the entire image is stratified into several large regions by multi-texture computing, and then the spatial scale of each region is estimated to implement fine-scale segmentation [52,55,56]. To some extent, the strategy can avoid the blindness and subjectivity of scale parameter selection, can satisfy the suitability and accuracy of different geographic objects, and can improve the efficiency of experiments. The grey level co-occurrence matrix (GLCM) is a matrix that calculates the spatial combinations (angles and distances) of center pixels and neighborhood pixels with different window sizes and step sizes. Most texture calculations are weighted averages of the normalized GLCM (NGLCM) element values [57]. The purpose of weighted averages is to emphasize the relative importance of different The grey level co-occurrence matrix (GLCM) is a matrix that calculates the spatial combinations (angles and distances) of center pixels and neighborhood pixels with different window sizes and step sizes. Most texture calculations are weighted averages of the normalized GLCM (NGLCM) element values [57]. The purpose of weighted averages is to emphasize the relative importance of different values in the normalized GLCM. The value returned by homogeneity mainly measures the uniformity of the texture distribution of the image. Entropy measures the randomness of the information contained in the image, that is, the complexity of the gray distribution of the image. The division of image regions based on the normalized GLCM creates the premise for the appropriateness of geographic object scale estimation.
In this paper, spatial scale estimation is used to calculate the average local variance of different windows in global images [55,58]. The basic theories of spatial statistics and spectral statistics can be applied to object-oriented remote-sensing image segmentation scale estimation. Based on the spatial information and attributes of remote-sensing data, we can summarize the three basic meanings of the scale parameters of objects: the spatial scale parameter h s (the spatial distance or the range of spatial correlation between patches), the attribute scale parameter h r (attribute distance or attribute difference between patches), and the merge threshold parameter M (size of patch or number of pixels).
The stratified scale estimation strategy can satisfy the suitability and accuracy of different geographic object scales to a certain extent. This paper takes mean-shift segmentation as an example to demonstrate the feasibility of this strategy.

DFCNN-Based Semantic Recognition of Buildings
Semantics contain not only features of certain category objects but also include corresponding complex distribution patterns and spatial structures [20]. This study adopts the self-built DFCNN semantic recognition model and introduces the feature of self-learning, and therefore, is able to understand remote sensing images in more and deeper features and to realize structure semantic recognition.

Deeper-Feature CNN (DFCNN)
Traditional classification methods have limited ability to express object features, especially for VHSRI, which contain more complex object information, and usually generate some less desired classification results [18,[59][60][61][62][63]. Convolutional neural networks, as the core of deep learning, are suitable for image processing and image understanding, and are able to automatically extract rich deeper features from images, thanks to the close boundaries and spatial information between each convolution [64]. Therefore, CNNs show great potential in automatic feature extraction and complex object recognition for VHSRI [33,[65][66][67].
Most current CNN models merely stack more convolution layers to deepen the network, and in the hope to obtain better performance. However, not only can the depth and width of the network be adjusted but also the perception width can be increased if the filter running on the same convolution layer is multi-scaled. Thus, inspired by the inception module, we built the Deeper-feature CNN (DFCNN) to promote the efficiency of mining deeper semantics and semantic information among objects [68,69]. The DFCNN consists of five convolution modules, five pooling layers, and a fully connected layer, and the score function is Softmax. The number of DFCNN channels is flexible and determined by specific data. Figure 3 is a self-explanatory description of the structure of DFCNN. Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 26 A convolutional module, which contains a 1×1 convolutional filter, two 3×3 convolution filters, and a 3×3 depthwise separable convolution, was adopted to replace the traditional convolutional layer. Depthwise separable convolutions divide the convolution kernel into two separate convolution kernels: depthwise convolution and pointwise convolution ( Figure 4). Depthwise convolution is the normal convolution of the input image without changing the depth, whereas pointwise convolution uses a 1×1 kernel function whose depth is the number of channels of the input image. Unlike traditional convolution layers, the proposed convolutional module separates the region and channel and puts more emphasis on the region, which is favorable for the extraction of multi-layer deep features. Compared to other convolutional neural networks, the DFCNN can mine deep semantics. As the DFCNN considers multiple convolutional cores with different sizes, it gains better robustness and a larger receptive field, and obtains more and deeper features than CNN with single filters. Moreover, the addition of depthwise separable convolutions ensures that the information from multichannels is fully used and that more features are further extracted. The ability of DFCNN to fully extract more and deeper features is the key to probing physical and social semantics of buildings.

Physical Semantic Recognition of Buildings
Different objects with the same spectra characteristics and same objects with different spectra characteristics impact the recognition of buildings; therefore, it is difficult to obtain buildings satisfactorily with traditional shallow features. Therefore, we integrated VHSRI and DEFM to mine the deeper semantic information of geographic objects based on the DFCNN model, and to extract buildings from physical semantics. The process is shown in Figure 5. A convolutional module, which contains a 1 × 1 convolutional filter, two 3 × 3 convolution filters, and a 3 × 3 depthwise separable convolution, was adopted to replace the traditional convolutional layer. Depthwise separable convolutions divide the convolution kernel into two separate convolution kernels: depthwise convolution and pointwise convolution ( Figure 4). Depthwise convolution is the normal convolution of the input image without changing the depth, whereas pointwise convolution uses a 1 × 1 kernel function whose depth is the number of channels of the input image. Unlike traditional convolution layers, the proposed convolutional module separates the region and channel and puts more emphasis on the region, which is favorable for the extraction of multi-layer deep features.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 26 A convolutional module, which contains a 1×1 convolutional filter, two 3×3 convolution filters, and a 3×3 depthwise separable convolution, was adopted to replace the traditional convolutional layer. Depthwise separable convolutions divide the convolution kernel into two separate convolution kernels: depthwise convolution and pointwise convolution ( Figure 4). Depthwise convolution is the normal convolution of the input image without changing the depth, whereas pointwise convolution uses a 1×1 kernel function whose depth is the number of channels of the input image. Unlike traditional convolution layers, the proposed convolutional module separates the region and channel and puts more emphasis on the region, which is favorable for the extraction of multi-layer deep features. Compared to other convolutional neural networks, the DFCNN can mine deep semantics. As the DFCNN considers multiple convolutional cores with different sizes, it gains better robustness and a larger receptive field, and obtains more and deeper features than CNN with single filters. Moreover, the addition of depthwise separable convolutions ensures that the information from multichannels is fully used and that more features are further extracted. The ability of DFCNN to fully extract more and deeper features is the key to probing physical and social semantics of buildings.

Physical Semantic Recognition of Buildings
Different objects with the same spectra characteristics and same objects with different spectra characteristics impact the recognition of buildings; therefore, it is difficult to obtain buildings satisfactorily with traditional shallow features. Therefore, we integrated VHSRI and DEFM to mine the deeper semantic information of geographic objects based on the DFCNN model, and to extract buildings from physical semantics. The process is shown in Figure 5. Compared to other convolutional neural networks, the DFCNN can mine deep semantics. As the DFCNN considers multiple convolutional cores with different sizes, it gains better robustness and a larger receptive field, and obtains more and deeper features than CNN with single filters. Moreover, the addition of depthwise separable convolutions ensures that the information from multi-channels is fully used and that more features are further extracted. The ability of DFCNN to fully extract more and deeper features is the key to probing physical and social semantics of buildings.

Physical Semantic Recognition of Buildings
Different objects with the same spectra characteristics and same objects with different spectra characteristics impact the recognition of buildings; therefore, it is difficult to obtain buildings satisfactorily with traditional shallow features. Therefore, we integrated VHSRI and DEFM to mine the deeper semantic information of geographic objects based on the DFCNN model, and to extract buildings from physical semantics. The process is shown in Figure 5. The high-frequency part of the image is more sensitive and interesting, and the details of the image are often related to the high-frequency part. The deeper edge feature map (DEFM) can better reflect the detailed information of the image, and it also can be considered that the DEFM contains a lot of structure and edge information of the image. Due to the influence of the window on the image structure, the Gaussian weighting method was used to calculate the deeper edge features. The specific formula is calculated as: where , , , is the local variance and mean based on the centered pixel , , L is a window of × , is a pixel in the window, and is the weight, where ∈ .

Social Semantic Recognition of Buildings
Although the method based on DFCNN is effective to recognize the physical semantics of buildings, it is difficult to detect the social function attributes of buildings. To fully comprehend the semantics of building object from VHRSI and social context information, we integrated POI with buildings at the object level. Building objects can be provided high-level social semantic information by DFCNN, such as residential districts, commercial districts, hospitals, and so on. The process of social functional semantic recognition of buildings is shown in Figure 6. The high-frequency part of the image is more sensitive and interesting, and the details of the image are often related to the high-frequency part. The deeper edge feature map (DEFM) can better reflect the detailed information of the image, and it also can be considered that the DEFM contains a lot of structure and edge information of the image. Due to the influence of the window on the image structure, the Gaussian weighting method was used to calculate the deeper edge features. The specific formula is calculated as: where VM x,y , M x,y is the local variance and mean based on the centered pixel (x, y), L is a window of m × m, η p is a pixel in the window, and ω p is the weight, where p ∈ L.

Social Semantic Recognition of Buildings
Although the method based on DFCNN is effective to recognize the physical semantics of buildings, it is difficult to detect the social function attributes of buildings. To fully comprehend the semantics of building object from VHRSI and social context information, we integrated POI with buildings at the object level. Building objects can be provided high-level social semantic information by DFCNN, such as residential districts, commercial districts, hospitals, and so on. The process of social functional semantic recognition of buildings is shown in Figure 6. Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 26 Unfortunately, the quality of POI data varies by category. For example, there are significantly more commercial POI data than other types, as POI data are generated by human activities, and so it is mainly concentrated in commercial areas. This results in an uneven distribution of POI data, so the original POI data are unable to satisfactorily represent the volume and spatial distribution of geographical objects. Therefore, the kernel density analysis method was adopted to fully mine the spatial and semantic granularity information of POI data, which is a convenient way to extract semantic information of social functions. The following formula defines how to calculate the kernel density of POI data and to determine the default search radius.
The predicted density at (x, y) is determined by the following formula: where i represents the input POI data, = 1, … , n; refers to the population field value of the POI data, which represents the magnitude of point i, and is the distance between point i and position , .
The search radius is defined as follows: where represents the median distance of the mean center; n refers to the sum of the population field value of POI data, and SD is the standard distance.
where , are the coordinates of point i; , represent the mean center of point i, and n refers to the number of POI data. The above formula calculates the density of each position, and then the density multiplies the sum of the population field value of POI data. Finally, the result is output to the center of each pixel to obtain the kernel density image of POI data.
To some extent, the kernel density image of POI data not only makes up for the imbalance of the number of POI types but also reduces the impact of the inaccurate position of POI data. The kernel Unfortunately, the quality of POI data varies by category. For example, there are significantly more commercial POI data than other types, as POI data are generated by human activities, and so it is mainly concentrated in commercial areas. This results in an uneven distribution of POI data, so the original POI data are unable to satisfactorily represent the volume and spatial distribution of geographical objects. Therefore, the kernel density analysis method was adopted to fully mine the spatial and semantic granularity information of POI data, which is a convenient way to extract semantic information of social functions. The following formula defines how to calculate the kernel density of POI data and to determine the default search radius.
The predicted density at (x, y) is determined by the following formula: where i represents the input POI data, i = 1, . . . , n; p i refers to the population field value of the POI data, which represents the magnitude of point i, and dist i is the distance between point i and position (x, y).
The search radius is defined as follows: where D m represents the median distance of the mean center; n refers to the sum of the population field value of POI data, and SD is the standard distance.
where x i , y i are the coordinates of point i; X, Y represent the mean center of point i, and n refers to the number of POI data.
Remote Sens. 2020, 12, 1088 9 of 26 The above formula calculates the density of each position, and then the density multiplies the sum of the population field value of POI data. Finally, the result is output to the center of each pixel to obtain the kernel density image of POI data.
To some extent, the kernel density image of POI data not only makes up for the imbalance of the number of POI types but also reduces the impact of the inaccurate position of POI data. The kernel density image allows POI and VHSRI to complement one another, which is conducive to the recognition of social function semantics.

Maximum Area-Based Urban Functional Zone Recognition
Owing to buildings have multiple overlapping functions in a block, this paper adopts the method of maximum area. To be specific, we counted the building area of the same functional category in a block, then a functional building with a maximum area was selected to represent the main functional attributes of the zone. Based on this information, the map of the functional zones of the city is presented in this section.

Experiments and Results
The experiments were performed on an Ubuntu 16.04 operating system using a CPU (3.4 GHz core i7-6700), RAM (8 GB), and GPU (NVIDIA TITAN X 12 Gb GPU). TensorFlow1.7 was selected as the deep learning framework.

Study Area and Data Sets
This study focuses on urban functional zones, and the central city of Shenzhen (26.8 km 2 ) was selected as the research area ( Figure 7). Shenzhen is one of China's mega-economic centers and is an international city, and it is also China's first fully urbanized city.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 26 density image allows POI and VHSRI to complement one another, which is conducive to the recognition of social function semantics.

Maximum Area-Based Urban Functional Zone Recognition
Owing to buildings have multiple overlapping functions in a block, this paper adopts the method of maximum area. To be specific, we counted the building area of the same functional category in a block, then a functional building with a maximum area was selected to represent the main functional attributes of the zone. Based on this information, the map of the functional zones of the city is presented in this section.

Experiments and Results
The experiments were performed on an Ubuntu 16.04 operating system using a CPU (3.4 GHz core i7-6700), RAM (8 GB), and GPU (NVIDIA TITAN X 12 Gb GPU). TensorFlow1.7 was selected as the deep learning framework.

Study Area and Data Sets
This study focuses on urban functional zones, and the central city of Shenzhen (26.8 ) was selected as the research area ( Figure 7). Shenzhen is one of China's mega-economic centers and is an international city, and it is also China's first fully urbanized city. The urban building styles are different, and the functional zones are extremely complex and heterogeneous, which presents a huge challenge for the proposed method. Three types of data were used in this experiment, as outlined below.
Google Maps image data: The image was provided by worldview-3 satellite in June 2018, with a size of 19,726 × 15,111 pixels and a spatial resolution of 0.31 m. The image was used to extract geographic objects and their features and spatial patterns, and this information was then used for classifying land cover and recognizing functional zones.
Urban road network data: As shown in Figure 8, 3880 detailed urban road vectors obtained from OpenStreetMap (OSM) were used to generate 551 blocks. After geometric correction, the urban road network was matched with the image. The urban building styles are different, and the functional zones are extremely complex and heterogeneous, which presents a huge challenge for the proposed method. Three types of data were used in this experiment, as outlined below.
Google Maps image data: The image was provided by worldview-3 satellite in June 2018, with a size of 19,726 × 15,111 pixels and a spatial resolution of 0.31 m. The image was used to extract geographic objects and their features and spatial patterns, and this information was then used for classifying land cover and recognizing functional zones.
Urban road network data: As shown in Figure 8, 3880 detailed urban road vectors obtained from OpenStreetMap (OSM) were used to generate 551 blocks. After geometric correction, the urban road network was matched with the image. POI data: 33,755 POIs acquired from Amap were divided into seven categories according to social function attributes, that is, commercial services, public services, residential quarters, factory industries, schools, hospitals, and urban green, as shown in Figure 8.

Results of the Stratified Scale Estimation
According to the strategy of stratified scale estimation based on normalized GLCM, the VHSRI was continuously fine-tuned with segmentation parameters by experience, and Shenzhen's urban area was divided into 12 areas. Then, spatial scale estimation was performed on each region; the estimates of ℎ and ℎ are shown in Figure 9 and Figure 10. Moreover, taking mean-shift segmentation as an example, Table 1 outlines the local segmentation image estimation parameters, and the results of the stratified scale estimation are shown in Figure 11.
(a) ALV POI data: 33,755 POIs acquired from Amap were divided into seven categories according to social function attributes, that is, commercial services, public services, residential quarters, factory industries, schools, hospitals, and urban green, as shown in Figure 8.

Results of the Stratified Scale Estimation
According to the strategy of stratified scale estimation based on normalized GLCM, the VHSRI was continuously fine-tuned with segmentation parameters by experience, and Shenzhen's urban area was divided into 12 areas. Then, spatial scale estimation was performed on each region; the estimates of h s and h r are shown in Figures 9 and 10. Moreover, taking mean-shift segmentation as an example, Table 1 outlines the local segmentation image estimation parameters, and the results of the stratified scale estimation are shown in Figure 11. POI data: 33,755 POIs acquired from Amap were divided into seven categories according to social function attributes, that is, commercial services, public services, residential quarters, factory industries, schools, hospitals, and urban green, as shown in Figure 8.

Results of the Stratified Scale Estimation
According to the strategy of stratified scale estimation based on normalized GLCM, the VHSRI was continuously fine-tuned with segmentation parameters by experience, and Shenzhen's urban area was divided into 12 areas. Then, spatial scale estimation was performed on each region; the estimates of ℎ and ℎ are shown in Figure 9 and Figure 10. Moreover, taking mean-shift segmentation as an example, Table 1 outlines the local segmentation image estimation parameters, and the results of the stratified scale estimation are shown in Figure 11.
(a) ALV        Labeled samples are crucial in the training of DFCNN models, which can be divided into training data and validation data. Training data are used to train DFCNN weights and biases, while validation data are used to optimize hyper-parameters and to evaluate the overfitting phenomenon of DFCNN after training.  In this experiment, a label sample was composed of square patches and a land-cover class with each central patch pixel. The study area was divided into seven land-cover categories according to their characteristics, namely, building, road, tree, grassland, water, shadow, and bare land. The samples of each category were selected by manual visual interpretation separately. This not only considers all of the categories in the image but also ensures that the number of samples in each category is appropriate and accurate, so that different categories can be randomly and uniquely fused.
To enhance the robustness of the CNN, 80% were randomly selected as labeled samples. These labeled samples were then randomly divided into a training data set (80%) and a validation data set (20%). The labels of the samples were treated as true values, and were used to calculate the difference between the predicted value and the true value. The actual figures of the samples used for training and verification are listed in Table 2.

Parameter Setting for DFCNN
Before training the DFCNN network, some parameters needed to be preset. The number of input image channels was four, which was determined by the fusion of the remote sensing images and the deep edge feature map. The learning rate that controls the progress of the learning model was 0.01, which was derived from a large number of training results. When a complete data set is passed through a neural network and returned once, this process is called an epoch. According to empirical evidence, the number of epochs was set to 100, which can meet the training accuracy and improve the training speed of the model. Due to the large number of samples that cannot be passed through the neural network at one time, the data set needs to be divided into several batches with a size of 100, and the number of batches is the number of training data divided by the batch. To avoid overfitting of DFCNN, the drop-off rate was set to 0.5.

Ground Truth Validation Points
To reflect the classification of geographic objects accurately and to extract buildings appropriately, according to experience, the ratio of regular verification points to training sample points should be 4:1; thus, in this experiment, more than 5000 random ground truth verification points were generated separately, as shown in Table 3. These verification points are different from the verification data in Section 3.3.1, and they were used to evaluate the accuracy of the final classification results. Finally, the confounding matrix was introduced to compare the ground truth verification points with the classification results. Figure 12 illustrates the distribution of the ground truth verification points.

Results of the Building Extraction
The recognition of urban buildings based on physical semantics is a significant prerequisite for the recognition of urban functional zones. The results of the building extraction are shown in this section.
The scale effect is an inevitable and important issue in the VHSRI classification. The feature expression of the same training sample point varies with the size of the scale. The small scale contains more features of the object (spectrum, shape, texture, etc.), while the large scale focuses on the information of its surroundings (proximity, spatial distribution, etc.). Therefore, we considered the scale effect accordingly and carried out multiple sets of experiments in this paper. The size of the training window, that is, the scale, was set to , and the value was taken every 20 intervals. The specific experimental results are discussed in Section 4.2.3, combining multiple single-scale and multi-scale features at the same time,.
In the area, the same types were classified in to different categories and scattered, and the originally homogeneous patch was "broken", looking like TV snowflakes. This phenomenon is called the "salt and pepper effect". As shown in Figure 13 and Figure 14, especially for small-scale results, there are obvious salt and pepper effects (scale 15 and scale 35). With the increase in the scale, this phenomenon gradually weakens, and the classification accuracy of geographic objects and the accuracy of building extraction gradually increases, and then decreases. When the scale was 155, the recognition of building accuracy was the best, which was 97.0%. Figure 15 shows the overall accuracy of geographic objects classification and the recognition accuracy of buildings.

Results of the Building Extraction
The recognition of urban buildings based on physical semantics is a significant prerequisite for the recognition of urban functional zones. The results of the building extraction are shown in this section.
The scale effect is an inevitable and important issue in the VHSRI classification. The feature expression of the same training sample point varies with the size of the scale. The small scale contains more features of the object (spectrum, shape, texture, etc.), while the large scale focuses on the information of its surroundings (proximity, spatial distribution, etc.). Therefore, we considered the scale effect accordingly and carried out multiple sets of experiments in this paper. The size of the training window, that is, the scale, was set to , and the value was taken every 20 intervals. The specific experimental results are discussed in Section 4.2.3, combining multiple single-scale and multi-scale features at the same time.
In the area, the same types were classified in to different categories and scattered, and the originally homogeneous patch was "broken", looking like TV snowflakes. This phenomenon is called the "salt and pepper effect". As shown in Figures 13 and 14, especially for small-scale results, there are obvious salt and pepper effects (scale 15 and scale 35). With the increase in the scale, this phenomenon gradually weakens, and the classification accuracy of geographic objects and the accuracy of building extraction gradually increases, and then decreases. When the scale was 155, the recognition of building accuracy was the best, which was 97.0%. Figure 15 shows the overall accuracy of geographic objects classification and the recognition accuracy of buildings.
the "salt and pepper effect". As shown in Figure 13 and Figure 14, especially for small-scale results, there are obvious salt and pepper effects (scale 15 and scale 35). With the increase in the scale, this phenomenon gradually weakens, and the classification accuracy of geographic objects and the accuracy of building extraction gradually increases, and then decreases. When the scale was 155, the recognition of building accuracy was the best, which was 97.0%. Figure 15 shows the overall accuracy of geographic objects classification and the recognition accuracy of buildings.

Social Functional Semantic Refinement for Buildings
Buildings with physical semantics were recognized and extracted in the previous stage. However, many buildings with similar appearances can also have different functions. Figure 16 shows the kernel density analysis result of the POIs. According to the quality and quantity of POIs in different categories, the population field value and search radius were set. For example, the number of POIs in schools and hospitals was small. To obtain the appropriate functional radiation range of schools and hospitals, their weights and radii must be set larger. In this way, the

Social Functional Semantic Refinement for Buildings
Buildings with physical semantics were recognized and extracted in the previous stage. However, many buildings with similar appearances can also have different functions. Figure 16 shows the kernel density analysis result of the POIs. According to the quality and quantity of POIs in different categories, the population field value and search radius were set. For example, the number of POIs in schools and hospitals was small. To obtain the appropriate functional radiation range of schools and hospitals, their weights and radii must be set larger. In this way, the

Social Functional Semantic Refinement for Buildings
Buildings with physical semantics were recognized and extracted in the previous stage. However, many buildings with similar appearances can also have different functions. Figure 16 shows the kernel density analysis result of the POIs. According to the quality and quantity of POIs in different categories, the population field value and search radius were set. For example, the number of POIs in schools and hospitals was small. To obtain the appropriate functional radiation range of schools and hospitals, their weights and radii must be set larger. In this way, the commercial POIs can be set with a smaller weight and radius. All kernel density images were integrated with the VSHRI, and the image with 11 channels was input into the DFCNN model for training. The buildings were mined for the social function semantic information under geographical constraints. Finally, the social attributes of the building were refined.

Results of the Urban Functional Zones
There will always be buildings with multiple functional attributes in one zone ( Figure 17A-C), which means that the functions of the zones are overlapped or mixed, but there will always be a major social function to guide the zones. Therefore, we set the Maxarea voting strategy; that is, the dominant functional attribute of a certain zone is the social function with the largest building area in the zone. The results of the urban functional zones are shown in Figure 18. Shenzhen is a city with multiple functions and comprehensive distribution, but the commercial and residential zones occupy 63.57% of the total. Due to China's national conditions and the relationship between people and land, Shenzhen's urban structure was determined, which greatly facilitated people's clothing, food, and shelter. To better verify the accuracy of the urban functional zones, the results were evaluated using a hybrid matrix, as shown in Table 4. All kernel density images were integrated with the VSHRI, and the image with 11 channels was input into the DFCNN model for training. The buildings were mined for the social function semantic information under geographical constraints. Finally, the social attributes of the building were refined.

Results of the Urban Functional Zones
There will always be buildings with multiple functional attributes in one zone ( Figure 17A-C), which means that the functions of the zones are overlapped or mixed, but there will always be a major social function to guide the zones. Therefore, we set the Maxarea voting strategy; that is, the dominant functional attribute of a certain zone is the social function with the largest building area in the zone. The results of the urban functional zones are shown in Figure 18. Shenzhen is a city with multiple functions and comprehensive distribution, but the commercial and residential zones occupy 63.57% of the total. Due to China's national conditions and the relationship between people and land, Shenzhen's urban structure was determined, which greatly facilitated people's clothing, food, and shelter. To better verify the accuracy of the urban functional zones, the results were evaluated using a hybrid matrix, as shown in Table 4.

Effectiveness of the Stratified Scale Estimation
To verify the effectiveness of the stratified scale estimation, we directly recognized and extracted buildings across the entire image without using stratified processing. As shown in Figure 19, it is obvious that the accuracy of the stratified scale estimation is higher than that of the non-stratified process. The best single-scale precision appeared at the 155 scale. The stratified strategy assembles objects with homogeneous features to reduce the experimental time and to provide a better adaptation environment for segmentation. Scale estimation avoids a large number of experiments to find the most appropriate scale to a certain extent, and it also improves the efficiency of experiments

Effectiveness of the Stratified Scale Estimation
To verify the effectiveness of the stratified scale estimation, we directly recognized and extracted buildings across the entire image without using stratified processing. As shown in Figure 19, it is obvious that the accuracy of the stratified scale estimation is higher than that of the non-stratified process. The best single-scale precision appeared at the 155 scale. The stratified strategy assembles objects with homogeneous features to reduce the experimental time and to provide a better adaptation environment for segmentation. Scale estimation avoids a large number of experiments to find the most appropriate scale to a certain extent, and it also improves the efficiency of experiments

Effectiveness of the Stratified Scale Estimation
To verify the effectiveness of the stratified scale estimation, we directly recognized and extracted buildings across the entire image without using stratified processing. As shown in Figure 19, it is obvious that the accuracy of the stratified scale estimation is higher than that of the non-stratified process. The best single-scale precision appeared at the 155 scale. The stratified strategy assembles objects with homogeneous features to reduce the experimental time and to provide a better adaptation environment for segmentation. Scale estimation avoids a large number of experiments to find the most appropriate scale to a certain extent, and it also improves the efficiency of experiments while satisfying the accuracy of segmentation and classification. Therefore, it can be proved that the stratified scale estimation has certain suitability and effectiveness, and it has practical significance in the classification and feature extraction of remote sensing geology. while satisfying the accuracy of segmentation and classification. Therefore, it can be proved that the stratified scale estimation has certain suitability and effectiveness, and it has practical significance in the classification and feature extraction of remote sensing geology.  Figure 20 shows the deeper edge feature map (DEFM) of the study area, and the three subregions (A-C) are the epitome of deeper edge features. To verify the contribution of the DEFM, this paper set deeper edge features as the only variable and conducted comparative experiments. The results are given in Table 5. According to the analysis, in the comparison of different scales, as the scale becomes larger, the accuracy of a group of experiments under the blessing of deeper edge features gradually improves, particularly when the building accuracy reaches its largest at the 155 scale. This proves that the deeper edge feature map can provide richer detailed features of buildings and semantic information, such as the surrounding environment, and can also validate the effectiveness of the training scale for building extraction from the other side.    The above experiments prove that the combination of VHSRI and DEFM is effective. On the one hand, it saves time in feature selection to a certain extent. On the other hand, it provides accurate target boundary information and surrounding semantic information for object-based deep learning methods.

Effectiveness of the DFCNN Structure for Building Extraction
To prove the superiority and robustness of DFCNN, we used Random Forest (RF) and AlexNet to compare and recognize buildings on Google data in Shenzhen. The traditional methods used for feature extraction and classification of high-resolution images, especially for complex urban cities, only consider shallow information. Therefore, the introduction of AlexNet solves the problem that semantic information is not used, and significantly improves the accuracy of building extraction. The self-built DFCNN model can make full use of the image information and can mine deeper semantic information, and thus can extract buildings more accurately. A comparison of the accuracy of the three methods can be found in Table 6. Based on DFCNN, the overall accuracy rate increased from 80.99% to 96.65%, and the building accuracy increased significantly by about 29.1%. As shown in Figure 21, the Random Forest (RF) method is still somewhat unsatisfactory from a visual point of view. Many trees and buildings are mistakenly classified into shadows and accompanied by the salt and pepper effect. The buildings recognized by the AlexNet method almost correspond to the corresponding locations of the original image. As shown in Figure 22, almost all the buildings in areas A and B have been completely extracted. However, the recognition ability of the AlexNet method seems to be limited. Due to the complexity of buildings, and interference from roads with the same spectrum and texture as buildings, some misclassifications still occur. In contrast, DFCNN's effect in this zone is better, so it further proves the advantages of DFCNN mining deeper semantics and its capability of semantic mapping.

Multi-Scale Training
Objects in nature show different shapes, but a tile is difficult to map a building, and a leaf is difficult to explain a tree or whole forest [70,71]. Therefore, different scales need to be taken into account. A small scale can reflect the low-level features inside the object, while a large scale contains more high-level information of the surrounding environment [65,66,72]. The strategy of scale combination can take into account the information of different levels of macro-and micro-images, and can also analyze and explain different geographic phenomena at different scales. the AlexNet method seems to be limited. Due to the complexity of buildings, and interference from roads with the same spectrum and texture as buildings, some misclassifications still occur. In contrast, DFCNN's effect in this zone is better, so it further proves the advantages of DFCNN mining deeper semantics and its capability of semantic mapping.

Multi-Scale Training
Objects in nature show different shapes, but a tile is difficult to map a building, and a leaf is difficult to explain a tree or whole forest [70,71]. Therefore, different scales need to be taken into account. A small scale can reflect the low-level features inside the object, while a large scale contains more high-level information of the surrounding environment [65,66,72]. The strategy of scale combination can take into account the information of different levels of macro-and micro-images, and can also analyze and explain different geographic phenomena at different scales.
Based on the above, a multi-scale strategy was adopted, and a single-scale experiment was conducted. Around 75 was set as the scale boundary, and three different scale combinations were  the AlexNet method seems to be limited. Due to the complexity of buildings, and interference from roads with the same spectrum and texture as buildings, some misclassifications still occur. In contrast, DFCNN's effect in this zone is better, so it further proves the advantages of DFCNN mining deeper semantics and its capability of semantic mapping.

Multi-Scale Training
Objects in nature show different shapes, but a tile is difficult to map a building, and a leaf is difficult to explain a tree or whole forest [70,71]. Therefore, different scales need to be taken into account. A small scale can reflect the low-level features inside the object, while a large scale contains more high-level information of the surrounding environment [65,66,72]. The strategy of scale combination can take into account the information of different levels of macro-and micro-images, and can also analyze and explain different geographic phenomena at different scales.
Based on the above, a multi-scale strategy was adopted, and a single-scale experiment was conducted. Around 75 was set as the scale boundary, and three different scale combinations were Based on the above, a multi-scale strategy was adopted, and a single-scale experiment was conducted. Around 75 was set as the scale boundary, and three different scale combinations were used: large, medium, and small. Multi-scale classification results and building extraction are shown in Figures 23 and 24. And Figure 25 shows a multi-scale precision point-line diagram. It can be seen from these figures that the multi-scale final classification results all show good performance, the salt and pepper phenomenon is almost eliminated, and the accuracy is almost improved. However, the accuracy of the three groups of scales 15-75-135, 35-75-115, 55-95-135 is lower than the accuracy of the single scale 155. On the contrary, the 15-95-175, 35-95-155, 55-115-175 three groups are higher than the best single scale. This is owing to the 155 scale having the best performance in a single scale experiment, which contains richer semantic information of its own depth and surrounding environment information. The large scales in the last three sets of scale combinations all contain deeper semantic information of the 155 scale, which further improves the classification accuracy and the building accuracy. Therefore, the multi-scale strategy still has guiding significance for further improving the accuracy of building extraction.
salt and pepper phenomenon is almost eliminated, and the accuracy is almost improved. However, the accuracy of the three groups of scales 15-75-135, 35-75-115, 55-95-135 is lower than the accuracy of the single scale 155. On the contrary, the 15-95-175, 35-95-155, 55-115-175 three groups are higher than the best single scale. This is owing to the 155 scale having the best performance in a single scale experiment, which contains richer semantic information of its own depth and surrounding environment information. The large scales in the last three sets of scale combinations all contain deeper semantic information of the 155 scale, which further improves the classification accuracy and the building accuracy. Therefore, the multi-scale strategy still has guiding significance for further improving the accuracy of building extraction.

Pros and Cons
The strategy of functional zone recognition proposed in this paper replaces the method of directly recognizing urban functional zones with remote sensing data or social sensing data. The combination of POIs and VHSRI effectively realizes the geographical matching of the physical semantics and functional semantics of objects. The stratified scale estimation strategy solves the scale parameter problem to a certain extent, and the construction of the DFCNN model provides more and deeper semantics, as well as strong support for the realization of complex urban functional zone recognition.
However, this method faces the following problems: 1. The recognition of functional zones depends largely on the accuracy of the building extraction. The strategy of urban functional zone recognition in this paper is based on the

Pros and Cons
The strategy of functional zone recognition proposed in this paper replaces the method of directly recognizing urban functional zones with remote sensing data or social sensing data. The combination of POIs and VHSRI effectively realizes the geographical matching of the physical semantics and functional semantics of objects. The stratified scale estimation strategy solves the scale parameter problem to a certain extent, and the construction of the DFCNN model provides more and deeper semantics, as well as strong support for the realization of complex urban functional zone recognition.
However, this method faces the following problems: 1.
The recognition of functional zones depends largely on the accuracy of the building extraction. The strategy of urban functional zone recognition in this paper is based on the functional semantics of buildings. Therefore, the misclassification and omission of buildings affect the division of functional zones.

2.
Although the contribution of the buildings to the functional zones is huge, other geographic objects are also part of the functional zones. For example, in the park, the contribution of water and vegetation is greater, the greening rates of commercial and residential areas are different, and the complexity and grade of roads are also different. Therefore, a strategy that only considers the attributes of buildings has certain limitations, and the attributes of all geographic objects inside the area should be fully considered.

3.
Furthermore, spatial relationships and distributions were not fully taken into account. The construction of zones must follow certain spatial syntax, which is directly reflected in the spatial relationship between buildings, and even between various geographic objects [73,74]. Similarly, the relationship between geographic objects also affects the evolution of functional zones.
Therefore, in future studies, we plan to explore the spatial relationship of geographical objects and the hybrid expression and evolution of regional functions.

Conclusions
This paper integrates POI social perception data and VHSRI to realize the recognition of urban functional zones based on the functional semantic recognition of urban buildings. The main contributions of this paper are integrating the deeper edge feature map (DEFM) and VHSRI, using multi-scale DFCNN to extract urban buildings. In addition, the recognition of urban functional zones in Shenzhen, which is a mega-city in China, verifies the accuracy and effectiveness of the method.

1.
Due to the complexity of urban ground features, we adopted a strategy of stratified scale estimation. To some extent, this method avoids the blindness and subjectivity of scale parameter selection and effectively improves the efficiency of the experiment. At the same time, it can also meet the suitability of different geographic object scales and can improve the accuracy of building extraction.

2.
In view of the diversity of the spectrum, shape, and texture of urban buildings, low-level information can no longer be satisfied in the recognition and extraction of buildings. In this paper, a DFCNN model was designed based on the inception module, and DEFM was used to mine higher-level semantics of buildings and to improve the accuracy of building extraction. 3.
The organic fusion of POIs and VSHRI was put into the DFCNN model to explore geographic objects and to comprehend the functional semantics information in the context of social functions. Then, the Maxarea voting strategy was adopted to label zones as the dominant function. This method effectively facilitated the combination of the building's physical and social functional semantics and realized the recognition of urban functional zones. Compared with previous studies, the method proposed in this paper can accurately describe semantic buildings in complex urban environments. In addition, semantic buildings can be abstracted into functional zones through social functional information, which realizes the recognition of urban functional zones from the bottom to the top, from objects to scenes.
However, although POIs and VHSRI greatly improve the accuracy of urban functional zones, there are still some problems that need to be addressed further. For example, the resolution of the three limitations identified in Section 4.3 needs to be the focus of future research. In addition, our method proved to be effective for the analysis of Shenzhen, but its adaptability to the country, and even the rest of the world, needs further verification. Finally, in our future research, we will use more effective strategies to mine the deeper semantic information and spatial relationships of objects, and the mapping and analysis of urban functional zones will be further advanced [75,76].