An Ensemble Learning Approach for Urban Land Use Mapping Based on Remote Sensing Imagery and Social Sensing Data

: Urban land use mapping is crucial for e ﬀ ective urban management and planning due to the rapid change of urban processes. State-of-the-art approaches rely heavily on the socioeconomic, topographical, infrastructural and land cover information of urban environments via feeding them into ad hoc classiﬁers for land use classiﬁcation. Yet, the major challenge lies in the lack of a universal and reliable approach for the extraction and combination of physical and socioeconomic features derived from remote sensing imagery and social sensing data. This article proposes an ensemble-learning-approach-based solution of integrating a rich body of features derived from high resolution satellite images, street-view images, building footprints, points-of-interest (POIs) and social media check-ins for the urban land use mapping task. The proposed approach can statistically di ﬀ erentiate the importance of input feature variables and provides a good explanation for the relationships between land cover, socioeconomic activities and land use categories. We apply the proposed method to infer the land use distribution in ﬁne-grained spatial granularity within the Fifth Ring Road of Beijing and achieve an average classiﬁcation accuracy of 74.2% over nine typical land use types. The results also indicate that our model outperforms several alternative models that have been widely utilized as baselines for land use classiﬁcation.


Introduction
Automatic urban land use mapping is crucial for effective urban management and planning. It provides an essential tool for examining the way social, economic and ecological factors shape the spatial structure and change of urban processes under both empirical and simulation scenarios [1].State-of-the-art approaches heavily rely on the socioeconomic, topographical, infrastructural and land cover information of urban environments via feeding them into ad hoc classifiers for land use classifications. Examples include the derivation of the physical characteristics (i.e., urban structure) of the built-up environments from remote sensing imagery [2], and the extraction Similarly, mobile phone data, which is more sophisticated and information-intensive, can be used to effectively label land use and urban patterns [30][31][32].
Furthermore, relevant studies have demonstrated that advanced approaches with a combination of remote sensing and social sensing data for urban land use discrimination can yield significantly improved performance compared with previous approaches. For instance, Hu et al. [33] constructed POI density, NDVI, NVBI and other characteristics from Sina Weibo POIs and Landsat remote sensing images to measure and explore the feature similarity between different land use types. Xia et al. [34] developed an approach to combine multisource features from remote sensing and geolocation datasets, including night-time lights, vegetation cover, land surface temperature, population density, LRD, accessibility and road networks, to extract urban areas at large scales. Zhang et al. [35] proposed a Hierarchical Semantic Cognition framework for the classification of urban functional zones based on objects segmented from the remote sensing images and labeled with nearby POI information. Liu et al. [36] developed a new classification framework to identify dominant urban land use type at the level of traffic analysis zones by integrating natural-physical features from remote sensing images and socioeconomic features from social media data. Zhang et al. [37] developed a method to synthetically utilize spectral and structural features from GF2 image data and spatiotemporal distributions of Weibo check-ins and POI data as the input of random forest to differentiate land use types. Nonetheless, the major challenge lies in the fact that existing research lacks a universal and reliable approach for the extraction and combination of physical and socioeconomic features derived from remote sensing and social sensing data sources. It is still challenging to accurately identify fine-grained urban land use based on the comprehensive information of urban environments derived from multisourced remote sensing and social sensing data.
To fill the research gap, this article proposes an ensemble-learning-approach-based solution of integrating rich features in remote sensing and social sensing data for urban land use mapping tasks. We utilize diverse datasets including high resolution satellite images, street-view images, building footprints, POIs and social media check-ins in the ensemble framework. We then extract essential physical and socioeconomic features of urban environments from these datasets based on image segmentation and feature embedding methods. Last, we apply a state-of-the-art ensemble learning approach to determine urban land uses by weighting the derived physical and socioeconomic features of built-up environments. In addition, we empirically verified the efficacy of our proposed approach in the city of Beijing, China.
To the best of our knowledge, we used both the most extensive data sources and an indicator-sensitive ensemble model to achieve a comprehensive perception of urban land use. The contribution of this study lies mainly in the following aspects. First, in terms of delineating urban morphology and extracting physical features, we use high-resolution remote sensing images and street-view images, and extract land cover and scene attributes of city zones by deep learning approaches to construct a semantically rich physical feature space. Secondly, we integrate POI and check-in data to construct a socioeconomic feature space. In particular, check-in information with user classification is used as the socioeconomic feature for the first time, that is, temporal variations of check-in by locals and nonlocals. Thirdly, by using the ensemble learning model, we systematically discuss and quantitatively compare the effects of various physical and socioeconomic indicators on predictions of urban land use attributes.

The Ensemble Learning Model
Ensemble learning is a popular and effective machine learning approach for classification problems. It builds and combines multiple machine learning algorithms to achieve the learning task, and therefore, is a suitable method for dealing with massive quantities of sparse remote and social sensing data. More importantly, it can statistically determine the importance of feature variables, thus providing a good explanation for the relationships between land cover, socioeconomic activities and land use.
Our proposed ensemble learning framework for urban land use mapping is illustrated in Figure 1. We first categorize the input data into two types as: (1) Image data. This consists of satellite images (from Google Earth with a spatial resolution of 1 m in 2018) and street-view images (from Tencent, one of the largest web mapping platforms in China), which capture the physical characteristic of the urban built-up environment; (2) Non-image data. This includes (publicly available) building footprints, POIs (from Baidu, one of the leading location service providers in China) and social media check-ins (from Weibo, one of the leading social media sharing platforms in China), which provide additional characteristics concerning the socioeconomic activities over urban land parcels. Then, we apply state-of-the-art image segmentation methods, i.e., DeepLabV3+ [38] and ResNet-50, for the extraction of physical features from the image data. Meanwhile, we calculate the spatial and temporal characteristics of the non-image data in order to extract the socioeconomic features. The resultant features include the land cover proportion of the land parcel, the geometric attributes of the building, scene categories of the street-view image, densities and types of the POIs, volumes and temporal variations of the check-ins (See Appendix A for the descriptions of our datasets and the full list of derived features). Last, we feed these features into a state-of-the-art ensemble learning model, XGBoost [39], for urban land use classification and validate the classification accuracy at both fine and coarse granularities using the five-fold cross validation method. We also compare the performance of our model with several alternative baseline models including K-means clustering, latent Dirichlet allocation (LDA) -based topic modeling and the random forest (RF) classifier.
Remote Sens. 2020, 08, x 4 of 18 (from Google Earth with a spatial resolution of 1 m in 2018) and street-view images (from Tencent, one of the largest web mapping platforms in China), which capture the physical characteristic of the urban built-up environment; (2) Non-image data. This includes (publicly available) building footprints, POIs (from Baidu, one of the leading location service providers in China) and social media check-ins (from Weibo, one of the leading social media sharing platforms in China), which provide additional characteristics concerning the socioeconomic activities over urban land parcels. Then, we apply state-of-the-art image segmentation methods, i.e., DeepLabV3+ [38] and ResNet-50, for the extraction of physical features from the image data. Meanwhile, we calculate the spatial and temporal characteristics of the non-image data in order to extract the socioeconomic features. The resultant features include the land cover proportion of the land parcel, the geometric attributes of the building, scene categories of the street-view image, densities and types of the POIs, volumes and temporal variations of the check-ins (See Appendix A for the descriptions of our datasets and the full list of derived features). Last, we feed these features into a state-of-the-art ensemble learning model, XGBoost [39], for urban land use classification and validate the classification accuracy at both fine and coarse granularities using the five-fold cross validation method. We also compare the performance of our model with several alternative baseline models including K-means clustering, latent Dirichlet allocation (LDA) -based topic modeling and the random forest (RF) classifier.

Extraction of Physical Features
Physical features include the 2D land cover information extracted from remote sensing images and the 3D scene information extracted from street-view images. The semantic segmentation model of remote sensing imagery can be used to obtain pixel-level land cover results. CNN-based models represented by FCN [40] have been applied to semantic segmentation of remote sensing imagery. Some subsequent improved models such as SegNet [41] introduced the encoder-decoder structure, and DeepLabV3+ [38] further improved the precision of semantic segmentation by combining encoder-decoder structure with SPP (Spatial Pyramid Pooling) [42]. The scene vectors of street blocks (parcels) can be gathered by using scene extraction models of street-view images. Since AlexNet, as a multilayer convolutional neural network, has been shown to have excellent performance [43],

Extraction of Physical Features
Physical features include the 2D land cover information extracted from remote sensing images and the 3D scene information extracted from street-view images. The semantic segmentation model of remote sensing imagery can be used to obtain pixel-level land cover results. CNN-based models represented by FCN [40] have been applied to semantic segmentation of remote sensing imagery. Some subsequent improved models such as SegNet [41] introduced the encoder-decoder structure, and DeepLabV3+ [38] further improved the precision of semantic segmentation by combining encoder-decoder structure with SPP (Spatial Pyramid Pooling) [42]. The scene vectors of street blocks (parcels) can be gathered by using scene extraction models of street-view images. Since AlexNet, as a multilayer convolutional neural network, has been shown to have excellent performance [43], subsequent CNN networks such as VGG, GoogleNet and ResNet have continuously improved the scene classification precision [44,45].
The 2D land cover information is extracted using the DeepLabV3+ architecture (see Figure 2). The model follows the classical encoder-decoder structure, where the encoder captures semantic information and the decoder recovers spatial information. In detail, the encoder consists of two modules: (1) The Deep Convolution Neural Network (DCNN) module, which uses the Xception_65 backbone network to extract features. We remove the fully connected layer and replace the max-pooling operations with a depth-wise separable convolution to adapt the original Xception network to our classification task and to obtain more detailed local features. To accelerate the convergence of the network and to avoid overfitting, we further impose a batch normalization operation and the ReLU activation function after each 3 × 3 depth-wise convolution; (2) The Atrous Spatial Pyramid Pooling (ASPP) module, which processes the output features from DCNN and probes the convolutional features at multiple scales by applying Atrous convolution at different rates. After the encoder, the decoder up-samples the encoder features bilinearly by a factor of 4, and then concatenates them with the corresponding low-level features from the network backbone (whose number of channels is reduced by a 1 × 1 convolution) with the same spatial resolution. Thereafter, we apply 3 × 3 convolutions and another bilinear up-sampling by a factor of 4 to refine the feature and obtain the result with the same size of the input image.
Remote Sens. 2020, 08, x 5 of 18 subsequent CNN networks such as VGG, GoogleNet and ResNet have continuously improved the scene classification precision [44,45]. The 2D land cover information is extracted using the DeepLabV3+ architecture (see Figure 2). The model follows the classical encoder-decoder structure, where the encoder captures semantic information and the decoder recovers spatial information. In detail, the encoder consists of two modules: (1) The Deep Convolution Neural Network (DCNN) module, which uses the Xception_65 backbone network to extract features. We remove the fully connected layer and replace the maxpooling operations with a depth-wise separable convolution to adapt the original Xception network to our classification task and to obtain more detailed local features. To accelerate the convergence of the network and to avoid overfitting, we further impose a batch normalization operation and the ReLU activation function after each 3 × 3 depth-wise convolution; (2) The Atrous Spatial Pyramid Pooling (ASPP) module, which processes the output features from DCNN and probes the convolutional features at multiple scales by applying Atrous convolution at different rates. After the encoder, the decoder up-samples the encoder features bilinearly by a factor of 4, and then concatenates them with the corresponding low-level features from the network backbone (whose number of channels is reduced by a 1 × 1 convolution) with the same spatial resolution. Thereafter, we apply 3 × 3 convolutions and another bilinear up-sampling by a factor of 4 to refine the feature and obtain the result with the same size of the input image. To enrich the physical features of an urban built-up environment, the 3D scene of a street-view image is further recognized by the ResNet-50 model (see Figure 3). We pretrain the model using the Places2 dataset [46], which contains approximately 10 million images belonging to 365 common scene types. Applying the model, we obtain a 365-dimensional output, indicating the probabilities of each of the 365 scene types that a given input street-view image belongs to. We summarize the probabilities of all the street-view images in each parcel and take the predominate scene assignment as the final scene category of the land parcel. It is worth noting that although that street-view images are limited in their spatiotemporal coverages, they provide auxiliary information of the physical built-up environment from 2D nadir view to 3D dimensions for land use classification. To enrich the physical features of an urban built-up environment, the 3D scene of a street-view image is further recognized by the ResNet-50 model (see Figure 3). We pretrain the model using the Places2 dataset [46], which contains approximately 10 million images belonging to 365 common scene types. Applying the model, we obtain a 365-dimensional output, indicating the probabilities of each of the 365 scene types that a given input street-view image belongs to. We summarize the probabilities of all the street-view images in each parcel and take the predominate scene assignment as the final scene category of the land parcel. It is worth noting that although that street-view images are limited in their spatiotemporal coverages, they provide auxiliary information of the physical built-up environment from 2D nadir view to 3D dimensions for land use classification.

Extraction of Socioeconomic Features
The socioeconomic attributes associated with urban spaces provide essential evidence of the functional use of the urban land parcels. In this research, we first leverage building footprints to understand the spatiotemporal characteristics of urban facilities, including the geographical location, geometry, number of floors and year of completion information. For each building feature, we calculate the mean value and the standard deviation to reveal the spatiotemporal diversity of buildings in the land parcels and obtain a 15-dimentional feature vector. Beyond the spatiotemporal diversity, we also utilize the POI data to measure the socioeconomic diversity of urban facilities in each land parcel. Specifically, the POIs comprise 21 types, including catering, hotel, shopping, leisure and entertainment, cultural media, tourist attraction, education and training, beauty, college, enterprise, medical, automotive services, government and organization, kindergarten and primary schools, transportation facility, sports, life services, parking, finance and residence and office buildings. Thereafter, we calculate the density and the proportion of different POIs for each parcel and obtain a 46-dimentional feature vector. A detailed category scheme is described in the Appendix A, Table A1.
Considering that different POIs often have different attractions to human activities, we further differentiate the spatiotemporal and socioeconomic characteristics of urban land parcels using social media check-in data. Each check-in record contains a user ID, check-in time and spatial location. We use a random forest-based model proposed in the literature [47] to determine the residence city of each user. The model is based on the check-in frequency in different cities and the random forest algorithm, which yields a reliable separation of visitors from local residents in the target city. Based

Extraction of Socioeconomic Features
The socioeconomic attributes associated with urban spaces provide essential evidence of the functional use of the urban land parcels. In this research, we first leverage building footprints to understand the spatiotemporal characteristics of urban facilities, including the geographical location, geometry, number of floors and year of completion information. For each building feature, we calculate the mean value and the standard deviation to reveal the spatiotemporal diversity of buildings in the land parcels and obtain a 15-dimentional feature vector. Beyond the spatiotemporal diversity, we also utilize the POI data to measure the socioeconomic diversity of urban facilities in each land parcel. Specifically, the POIs comprise 21 types, including catering, hotel, shopping, leisure and entertainment, cultural media, tourist attraction, education and training, beauty, college, enterprise, medical, automotive services, government and organization, kindergarten and primary schools, transportation facility, sports, life services, parking, finance and residence and office buildings. Thereafter, we calculate the density and the proportion of different POIs for each parcel and obtain a 46-dimentional feature vector. A detailed category scheme is described in the Appendix A, Table A1.
Considering that different POIs often have different attractions to human activities, we further differentiate the spatiotemporal and socioeconomic characteristics of urban land parcels using social media check-in data. Each check-in record contains a user ID, check-in time and spatial location. We use a random forest-based model proposed in the literature [47] to determine the residence city of each user. The model is based on the check-in frequency in different cities and the random forest algorithm, which yields a reliable separation of visitors from local residents in the target city. Based on the inferred user profiles, we divide check-ins into local and nonlocal. In so doing, we obtain the densities and temporal variations of check-ins by local and nonlocal users in each parcel (i.e., street block) as important socioeconomic features for the proposed ensemble learning model. We finally integrate the feature vectors derived from building, POI and check-in activity as the final socioeconomic feature of the learning model (as illustrate in Figure 4).
Remote Sens. 2020, 08, x 7 of 18 block) as important socioeconomic features for the proposed ensemble learning model. We finally integrate the feature vectors derived from building, POI and check-in activity as the final socioeconomic feature of the learning model (as illustrate in Figure 4).

Land Use Taxonomy for Model Training and Validation
To train and validate the proposed land use classification model, we adopt a land use taxonomy predefined by the Land Administration Law of the People's Republic of China (GBT 21010-2017). According to the taxonomy, land use types are classified in a hierarchical manner, consisting of cultivated land, garden land, woodland, grassland, commercial land, industrial land, residential land, public management and public service land, special land, transportation land, water and water conservancy facilities and other land. We filter our land use categories that are nonexistent in the case the studied city, Beijing, and generate nine refined land use categories at both coarse-and finegrained scales to provide the final urban land use taxonomy (see Table 1). According to the refined classification criteria, we semi-automatically label the land use type for each land parcel and construct training and testing sets. The labelling process is based on the land use planning map provided by the Beijing Municipal Commission of Planning and Natural Resources. We rescale the land use map into street block granularity by counting the area of each land use type, and assign the label based on the dominant land use type or by visual interpretation. The detailed procedure for land use labelling is shown in Algorithm 1. The two thresholds of parameter P[0] and that of parameter P [1] are determined by experiments. Firstly, 10% parcels are randomly selected, and different classification results can be obtained by manually assigning parameters. Then, the parameter combinations (i.e., 0.6, 0.4 and 0.2) which led to the highest labelling accuracy are selected through manual interpretation. Thereafter, during the training and testing procedures, we feed all the aforementioned features obtained by feature engineering (in Section 2.2) into the XGBoost classifier and use five-fold cross-validation to evaluate model performance. Several trials are made to tweak the hyperparameters of the XGBoost model to improve the final classification precision during our case study. For example, the maximum depth of the tree max_depth is set to 9,

Land Use Taxonomy for Model Training and Validation
To train and validate the proposed land use classification model, we adopt a land use taxonomy predefined by the Land Administration Law of the People's Republic of China (GBT 21010-2017). According to the taxonomy, land use types are classified in a hierarchical manner, consisting of cultivated land, garden land, woodland, grassland, commercial land, industrial land, residential land, public management and public service land, special land, transportation land, water and water conservancy facilities and other land. We filter our land use categories that are nonexistent in the case the studied city, Beijing, and generate nine refined land use categories at both coarse-and fine-grained scales to provide the final urban land use taxonomy (see Table 1). According to the refined classification criteria, we semi-automatically label the land use type for each land parcel and construct training and testing sets. The labelling process is based on the land use planning map provided by the Beijing Municipal Commission of Planning and Natural Resources. We rescale the land use map into street block granularity by counting the area of each land use type, and assign the label based on the dominant land use type or by visual interpretation. The detailed procedure for land use labelling is shown in Algorithm 1. The two thresholds of parameter P[0] and that of parameter P [1] are determined by experiments. Firstly, 10% parcels are randomly selected, and different classification results can be obtained by manually assigning parameters. Then, the parameter combinations (i.e., 0.6, 0.4 and 0.2) which led to the highest labelling accuracy are selected through manual interpretation. Thereafter, during the training and testing procedures, we feed all the aforementioned features obtained by feature engineering (in Section 2.2) into the XGBoost classifier and use five-fold cross-validation to evaluate model performance. Several trials are made to tweak the hyperparameters of the XGBoost model to improve the final classification precision during our case study. For example, the maximum depth of the tree max_depth is set to 9, the learning rate eta is set to 0.01 and the number of iterations num_round is set to 5000. There are several reasons for selecting XGBoost as the land use classifier. First, XGBoost, as an ensemble learning model, makes a decision based upon a combination of multiple tree-based classifiers, and so is more robust than a single classifier. Second, due to the imbalance in the number of each category of land use, the boosting sampling method adopted by XGBoost is more suitable, since it focuses more on the misclassification samples in the category with fewer total samples. This is conducive to maximizing the accuracy of each category. Third, XGBoost has a good effect on sparse data [39]. It especially fits the classification scenarios for fine-grained parcels where sample data are sparse in the case study.

Classification Accuracy
In the empirical analysis, we apply the proposed method to infer the land use types of street blocks with fine granularity within the Fifth Ring Road of Beijing. The model performances for each land use type are reported in Figure 5. The average classification accuracy over the nine predefined land use types is 74.2%. Specifically, the classification accuracies of educational and transport land use types are about 85%, which are relative higher than those of other land use types. In contrast, the classification accuracies of commercial and civic land use types are lower than the average (about 65%), in that it is hard to differentiate residential and natural land parcels. Considering that most of the land parcels within the case study area are residential, our model yields a good prediction accuracy for residential land use, i.e., as high as 76.2%. It is also noteworthy that due to the effects of data sparsity and spatial heterogeneity, the classification accuracy in the coarse-grained spatial resolution (i.e., 80.4% for traffic analysis zone as an example) is slightly higher than that in the fine-grained spatial resolution (i.e., 74.2% for street blocks). We revisit this issue and discuss the differences between the model results of the two resolutions in the next section.
Remote Sens. 2020, 08, x 9 of 18 Based on the spatial distribution of land use classification results (see Figure 6), we notice that: (1) The majority of residential land parcels are located within the Fourth Ring Road; (2) Commercial land parcels are mainly located near the Second and the Third Ring Roads and other main roads; (3) The majority of natural land parcels (such as green space, water, etc.) are located outside the Fourth Ring Road and in parks within the Fourth Ring Road; (4) The land parcels for education and research are mainly located in northwest part of the study area (i.e., Haidian District); (5) Industrial land parcels are mainly in the southern part of Beijing, and there is no large-scale industrial zone; (6) Public facilities are scattered across all the districts within the case study area. These spatial patterns indicate that Beijing has a huge built-up area and a high proportion of residential land, which reflects the population pressure compared with other cities in China. In addition, by comparing our model predications with the ground truth data labeled by Algorithm 1 in Section 2.3, we find that the spatial distribution of land uses according to our model's predictions is more cohesive, due to the derived physical and socioeconomic features, demonstrating significant spatial autocorrelations. In particular, civic land parcels are often misclassified as natural land at the periphery area due to the fact that civic facilities and small open/natural lands are usually colocated in space with each other. Additionally, commercial land parcels are often very small and mixed with residential and civic lands, making them hard to detect. The situation is similar for industrial land parcels; most of them are located in the periphery area, surrounded by natural lands. On the other hand, POI and check-in data are very sparse within these spaces, which undermines the benefit of integrating physical features and socioeconomic features to differentiate among certain land use types. Based on the spatial distribution of land use classification results (see Figure 6), we notice that: (1) The majority of residential land parcels are located within the Fourth Ring Road; (2) Commercial land parcels are mainly located near the Second and the Third Ring Roads and other main roads; (3) The majority of natural land parcels (such as green space, water, etc.) are located outside the Fourth Ring Road and in parks within the Fourth Ring Road; (4) The land parcels for education and research are mainly located in northwest part of the study area (i.e., Haidian District); (5) Industrial land parcels are mainly in the southern part of Beijing, and there is no large-scale industrial zone; (6) Public facilities are scattered across all the districts within the case study area. These spatial patterns indicate that Beijing has a huge built-up area and a high proportion of residential land, which reflects the population pressure compared with other cities in China. In addition, by comparing our model predications with the ground truth data labeled by Algorithm 1 in Section 2.3, we find that the spatial distribution of land uses according to our model's predictions is more cohesive, due to the derived physical and socioeconomic features, demonstrating significant spatial autocorrelations. In particular, civic land parcels are often misclassified as natural land at the periphery area due to the fact that civic facilities and small open/natural lands are usually colocated in space with each other. Additionally, commercial land parcels are often very small and mixed with residential and civic lands, making them hard to detect. The situation is similar for industrial land parcels; most of them are located in the periphery area, surrounded by natural lands. On the other hand, POI and check-in data are very sparse within these spaces, which undermines the benefit of integrating physical features and socio-economic features to differentiate among certain land use types.

Analysis of Contributing Features
To further understand the primary features for determining different land use types, we inspect the information of land covers, buildings, POIs and check-ins in land parcels. This analysis enables us to distinguish the different features associated with each land use category. As shown in Figure  7a, the land cover categories in parcels of different land uses are distinct from each other. In detail, the built-up area in transportation facilities, natural and agricultural lands is relatively low, while roads are predominant therein. In educational and residential land types, the area of impervious surface is also low compared with other man-made land use types. Based on the statistics of building footprints illustrated in Figure 7b, we find that commercial and financial lands are associated with the largest value of building volume rates and floors, followed by residential areas. In contrast, the building volume rates in ecological and agricultural area are relatively low. In addition, the sizes of individual buildings in transportation facilities are significantly larger than those in other land use parcels. The ages of buildings in industrial areas are more similar to each other, a consequence of urban planning. In contrast, transportation facilities are built gradually, along with the development of urban areas. Furthermore, the shapes of buildings in commercial and educational land use parcels are much more irregular and complex compared to buildings in other land use types. Figure 7c demonstrates that the proportions of different POIs in each land use types are also distinctive. On the one hand, certain POI types such as catering services are widely distributed in several land use types. On the other hand, certain POI types are strongly concentrated within parcels of a specific land use type. For instance, the majority of sports related POIs are located in civic facilities, educational and research-related land use parcels. In contrast, office and financial POIs are clustered in commercial areas. Moreover, temporal fluctuations of check-in activities in different land use types show different patterns; see Figure 7d. In commercial and financial land parcels, the volume of local user check-in activities stays at a very high level between 9:00 am and 7:00 pm, while the corresponding check-in volume in residential and educational land parcels reaches the peak value after working hours.

Analysis of Contributing Features
To further understand the primary features for determining different land use types, we inspect the information of land covers, buildings, POIs and check-ins in land parcels. This analysis enables us to distinguish the different features associated with each land use category. As shown in Figure 7a, the land cover categories in parcels of different land uses are distinct from each other. In detail, the built-up area in transportation facilities, natural and agricultural lands is relatively low, while roads are predominant therein. In educational and residential land types, the area of impervious surface is also low compared with other man-made land use types. Based on the statistics of building footprints illustrated in Figure 7b, we find that commercial and financial lands are associated with the largest value of building volume rates and floors, followed by residential areas. In contrast, the building volume rates in ecological and agricultural area are relatively low. In addition, the sizes of individual buildings in transportation facilities are significantly larger than those in other land use parcels. The ages of buildings in industrial areas are more similar to each other, a consequence of urban planning. In contrast, transportation facilities are built gradually, along with the development of urban areas. Furthermore, the shapes of buildings in commercial and educational land use parcels are much more irregular and complex compared to buildings in other land use types. Figure 7c demonstrates that the proportions of different POIs in each land use types are also distinctive. On the one hand, certain POI types such as catering services are widely distributed in several land use types. On the other hand, certain POI types are strongly concentrated within parcels of a specific land use type. For instance, the majority of sports related POIs are located in civic facilities, educational and research-related land use parcels. In contrast, office and financial POIs are clustered in commercial areas. Moreover, temporal fluctuations of check-in activities in different land use types show different patterns; see Figure 7d. In commercial and financial land parcels, the volume of local user check-in activities stays at a very high level between 9:00 a.m. and 7:00 p.m., while the corresponding check-in volume in residential and educational land parcels reaches the peak value after working hours.  We assume that richer data sources will enable more decision dimensions for the machine learning model to discover subtle differences between different categories of functionality. To verify this point, we construct three different feature sets. The first set uses Google remote sensing imagery alone. In the second set, Google remote sensing imagery and building footprint data are combined to observe the performance improvement. The third set uses the full range of Google remote sensing imagery, building footprint data, POIs, check-ins and street view images. Our experimental results confirm the above conjecture. With the addition of multisource features, the classification accuracy of the model is improved. The accuracy of the XGBoost classifier is 54.7%, 70.3% and 74.2%, respectively, on these three feature sets. In addition, in order to find the best classifier for this task, we also test the sensitivity of various classifiers such as random forest. By comparing different classifiers, we find that the XGBoost classifier achieves the highest overall accuracy on these datasets. With the exception of XGBoost, the best-performing random forest classifier has an accuracy of 57.4%, 68.8%, and 69.8%, respectively, on the three data sets.
Closer scrutiny of the experimental results indicates that an important reason for the performance difference among the different experiments lies in the data sparsity caused by small parcels. In this sense, prediction experiments on different scales of research units are carried out. On a coarse-grained scale of TAZ (traffic analysis zone), we construct three data sets as described above. On these three feature sets, the classification accuracy of XGBoost reaches 72.6%, 78.1% and 80.4% respectively. The overall accuracy is improved on the TAZ scale, indicating that data sparsity is indeed an important factor affecting classification accuracy. The comparison results of different experiments on the three feature sets are shown in Figure 8. We assume that richer data sources will enable more decision dimensions for the machine learning model to discover subtle differences between different categories of functionality. To verify this point, we construct three different feature sets. The first set uses Google remote sensing imagery alone. In the second set, Google remote sensing imagery and building footprint data are combined to observe the performance improvement. The third set uses the full range of Google remote sensing imagery, building footprint data, POIs, check-ins and street view images. Our experimental results confirm the above conjecture. With the addition of multisource features, the classification accuracy of the model is improved. The accuracy of the XGBoost classifier is 54.7%, 70.3% and 74.2%, respectively, on these three feature sets. In addition, in order to find the best classifier for this task, we also test the sensitivity of various classifiers such as random forest. By comparing different classifiers, we find that the XGBoost classifier achieves the highest overall accuracy on these datasets. With the exception of XGBoost, the best-performing random forest classifier has an accuracy of 57.4%, 68.8%, and 69.8%, respectively, on the three data sets.
Closer scrutiny of the experimental results indicates that an important reason for the performance difference among the different experiments lies in the data sparsity caused by small parcels. In this sense, prediction experiments on different scales of research units are carried out. On a coarse-grained scale of TAZ (traffic analysis zone), we construct three data sets as described above. On these three feature sets, the classification accuracy of XGBoost reaches 72.6%, 78.1% and 80.4% respectively. The overall accuracy is improved on the TAZ scale, indicating that data sparsity is indeed an important factor affecting classification accuracy. The comparison results of different experiments on the three feature sets are shown in Figure 8. Figure 9 compares our XGBoost-based model's performance with those of several alternative models that have been widely applied for land use mapping. We divide these baseline models into two categories: (1) Supervised models. These models feed urban features with regard to the landscape metrics (e.g., rs-RF), the proportion of POIs and the temporal variation of check-ins (e.g., rs-poi-checkin-RF), and the building characteristics (e.g., building-RF, rs-building-RF) to the random forest classifier; (2) Unsupervised models. These include K-means clustering (e.g., poi-Kmeans, checkin-Kmeans) and LDA-based topic modeling (e.g., poi-LDA). As shown in Figure 8, our model outperforms these models for land use classification in the case study area. Under closer scrutiny, we find that, due to the relatively coarse spatial resolution of social sensing data, the POI and check-in-based models yield very low classification accuracies (i.e., less than 50%). As a comparison, land cover and building information, as the most popular data source for land use mapping in the existing literature, yields a much higher classification accuracy (i.e., about 60% to 70%). Promisingly, our model can achieve an additional 7% to 13% performance improvement through efficiently integrating physical and socioeconomic features using the XGBoost classifier. We believe that our model can serve as an effective approach for the extraction and combination of physical and socioeconomic features from both remote sensing and social sensing data.

Comparison with Alternative Models
Remote Sens. 2020, 08, x 12 of 18  Figure 9 compares our XGBoost-based model's performance with those of several alternative models that have been widely applied for land use mapping. We divide these baseline models into two categories: (1) Supervised models. These models feed urban features with regard to the landscape metrics (e.g., rs-RF), the proportion of POIs and the temporal variation of check-ins (e.g., rs-poicheckin-RF), and the building characteristics (e.g., building-RF, rs-building-RF) to the random forest classifier; (2) Unsupervised models. These include K-means clustering (e.g., poi-Kmeans, checkin-Kmeans) and LDA-based topic modeling (e.g., poi-LDA). As shown in Figure 8, our model outperforms these models for land use classification in the case study area. Under closer scrutiny, we find that, due to the relatively coarse spatial resolution of social sensing data, the POI and check-inbased models yield very low classification accuracies (i.e., less than 50%). As a comparison, land cover and building information, as the most popular data source for land use mapping in the existing literature, yields a much higher classification accuracy (i.e., about 60% to 70%). Promisingly, our model can achieve an additional 7% to 13% performance improvement through efficiently integrating physical and socioeconomic features using the XGBoost classifier. We believe that our model can serve as an effective approach for the extraction and combination of physical and socioeconomic features from both remote sensing and social sensing data.   Figure 9 compares our XGBoost-based model's performance with those of several alternative models that have been widely applied for land use mapping. We divide these baseline models into two categories: (1) Supervised models. These models feed urban features with regard to the landscape metrics (e.g., rs-RF), the proportion of POIs and the temporal variation of check-ins (e.g., rs-poicheckin-RF), and the building characteristics (e.g., building-RF, rs-building-RF) to the random forest classifier; (2) Unsupervised models. These include K-means clustering (e.g., poi-Kmeans, checkin-Kmeans) and LDA-based topic modeling (e.g., poi-LDA). As shown in Figure 8, our model outperforms these models for land use classification in the case study area. Under closer scrutiny, we find that, due to the relatively coarse spatial resolution of social sensing data, the POI and check-inbased models yield very low classification accuracies (i.e., less than 50%). As a comparison, land cover and building information, as the most popular data source for land use mapping in the existing literature, yields a much higher classification accuracy (i.e., about 60% to 70%). Promisingly, our model can achieve an additional 7% to 13% performance improvement through efficiently integrating physical and socioeconomic features using the XGBoost classifier. We believe that our model can serve as an effective approach for the extraction and combination of physical and socioeconomic features from both remote sensing and social sensing data.

Conclusions
In this study, we integrate physical and socioeconomic features from Google remote sensing image, street-view image, building data, POI data and Weibo check-in data to develop an ensemble learning model to infer fine-grained urban land use distributions at the street block level. The experimental results show that the land use classification accuracy of the XGBoost-based model is greatly improved compared with those of other, state-of-the-art models, which include random forest classifiers, K-means clustering and LDA-based models, indicating that the proposed framework based on multisource data is an effective strategy for urban land use recognition. Specifically, the POI characteristics, land cover characteristics, architectural features, time curves and place scene categories extracted in this study are significantly different in different land use types, indicating their good distinguishing abilities for urban land use classification. Our empirical experiment proves that the model has high classification accuracy, strong discriminating ability and good robustness, which can be widely used to automatically generate urban land use maps in practice.
There are still some shortcomings that need to be overcome in the future. First, the current classification criteria of our model are limited to a few well-refined land use types. In future work, we need to extend the model's ability to identify more comprehensive land use types. Secondly, there are potentially data quality problems in certain datasets, such as missing or incomplete data regarding certain building attributes. In addition, the spatial distribution of POIs and check-ins within each land use parcel are not effectively utilized in the current model. We look forward to further improving the classification ability and accuracy of the proposed ensemble model by refining the model architecture and improving the quality of data in future work.  Acknowledgments: This work was supported in part by Joint Laboratory for Future Transport and Urban Computing of AutoNavi.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A.
Appendix A.1. Google Remote Sensing Images Google Earth provides remote sensing images with a resolution of 1 m in 2018. The land cover classification is implemented using TensorFlow on Tesla K80 GPU. During the training phase, we collect remote sensing images with a resolution of 1 m in 7 different regions, and utilize a pre-trained Xception on the ImageNet-1k [48] data set as the backbone network of the DeepLabV3+ model.