Integrating Aerial and Street View Images for Urban Land Use Classiﬁcation

: Urban land use is key to rational urban planning and management. Traditional land use classiﬁcation methods rely heavily on domain experts, which is both expensive and inefﬁcient. In this paper, deep neural network-based approaches are presented to label urban land use at pixel level using high-resolution aerial images and ground-level street view images. We use a deep neural network to extract semantic features from sparsely distributed street view images and interpolate them in the spatial domain to match the spatial resolution of the aerial images, which are then fused together through a deep neural network for classifying land use categories. Our methods are tested on a large publicly available aerial and street view images dataset of New York City, and the results show that using aerial images alone can achieve relatively high classiﬁcation accuracy, the ground-level street view images contain useful information for urban land use classiﬁcation, and fusing street image features with aerial images can improve classiﬁcation accuracy. Moreover, we present experimental studies to show that street view images add more values when the resolutions of the aerial images are lower, and we also present case studies to illustrate how street view images provide useful auxiliary information to aerial images to boost performances.


Introduction
Urban areas account for less than 2% of the earth land surface, but accommodate more than half of the world population, and the urban population is still growing and is estimated to reach five billion by 2030 globally [1]. The unprecedented urbanization leads to rapid changes of urban surface, it is therefore of great significance to monitor our urban land so as to provide essential information to decision makers to better manage our cities and provide supports for sustainable development.
Urban land use and land cover (LULC) maps are very important tools to understand and monitor our cities, they can reflect the macro properties of the urban surface. Specifically, land cover indicates the physical attributes of landscapes, such as forestry, grass, agricultural, water bodies, built-up areas, etc., the results, and we also investigate into the impact of aerial image resolution changes on semantic segmentation results. In Section 5, we discuss the classification results and present case studies on the improvement of accuracy by integrating street view images. Finally, we conclude in Section 6.

Land Use and Land Cover Classification
Remote sensing: Land use and land cover classification via satellite images have been extensively studied in remote sensing community. Most related work has engaged with land cover classification [1,3,7], and normally, the inference of specific land cover types more relies on spectral-based classification [2], because the spatial resolution of remote sensing images in visible bands is limited. With the development of geospatial technologies, very high resolution satellite and aerial images become more available, which enables us to analyze more spatial patterns via these images [2,[11][12][13]. Albert et al. [14] use satellite imagery to identify urban land use patterns. Hu et al. [15] classify urban land use categories using remote sensing imagery given land parcels. Hernandez et al. [16] categorize land use integrating spatial metrics and texture analysis. Lv et al. [17] use remote sensing SAR imagery for urban land use and land cover classification. Most remote sensing-based LULC classification work focuses on spectral-based land cover classification, however, with the growing accessibility to VHR remote sensing imagery and ground-level geo-tagged proximate sensing data, there are more opportunities to infer urban land use of social-economic properties.
Proximate sensing: Traditional land use map is provided by labor-intensive land survey [1] which is time-consuming and expensive. To alleviate the situation, researchers have tried to infer land use from proximate sensing data. Pei et al. [18] use aggregated mobile phone data to conduct land use classification in mesh grid level. Zhu et al. [7] use ground-level geo-referenced images from Flickr to do land use mapping based on land parcel map. Antoniou et al. [19] have investigated geo-tagged social media images as land cover input data and prove that these data include useful information about land use and land cover. Torres et al. [20] show that ground-taken imagery contains more useful details than overhead imagery for fine-grained habitat classification. Zhang et al. [10] and Kang et al. [21] use street view images to classify building functions given building footprints. Tu et al. [22] and Cao et al. [23] couple mobile phone data and social media check-in data to infer urban land function zones. Yuyun et al. [24] use Twitter data to acquire dynamic land use map. Tu et al. [25] and Liu et al. [26] demonstrate that public transport mobility data also implies urban land use variation. These works demonstrate that ground-level geo-tagged data contain useful information for land use and land cover classification. However, because of the lack of global view, most proximate sensing-based work relies on given land parcels as prior statistic units which limits the application scenarios.
Multimodal data fusion: Remote and proximate sensing data include macro overhead and micro ground-level information, respectively. The integration of them is believed to be able to capture both information and therefore provide more insights into the understanding of urban land use distribution than just use one data source alone. Tu et al. [1] and Jia et al. [3] integrate satellite images and mobile phone positioning data to generate urban land use maps. Liu et al. [27] and Hu et al. [28] combine satellite images and POIs (points of interest) to classify urban land parcels, showing that social media data have the potential for augmenting LULC classification. Jendryke et al. [29] integrate SAR imagery and social media message data to acquire urban land use information. Some researches also try to fuse data of different views in terms of physical appearance of urban surface [8] to estimate geospatial functions [30,31], classify urban land parcels [10], and analyze city land surface conditions [32,33]. Data of different sources and modalities possess different information about targeted objects, however, it is not easy to fuse them directly because of the heterogeneity of data distribution and various demands of specific applications. Thus, it is of great value to develop methods to fuse data of different sources and modalities to improve urban land use classification. The paper therefore presents an effective method to extract features from street view images, and further fuse them to aerial images to categorize urban land use in pixel level.
Fully Convolutional Network (FCN) [41] is regarded as a milestone for DNN-based semantic segmentation. Ever since it is proposed to solve the pixel-level classification problem, more and more semantic segmentation researches have focused on deep neural networks methods. The network changes the architecture of normal deep convolutional neural networks for classification by replacing fully connection layers with convolutional layers which enable it to make dense pixel-level predictions, this paradigm is adopted by many DNN-based semantic segmentation methods followed [5,42]. FCN has its drawbacks, and the most significant problem is its pooling layer, which can aggregate information and extract spatial-invariant features. However, spatial information is crucial for semantic segmentation problems since pixel-level predictions are to be made. To address the problem, two main architectures are proposed, the first one is encoder-decoder architecture, such as U-Net [43], SegNet [42], FC-DenseNet [44], and the other is the use of dilated convolutions, such as DeepLab [45][46][47]. Some researches also add post-processing stage by using a Conditional Random Field (CRF) [45,48].
Most breakthroughs in DNN-based semantic segmentation happened on natural images [5]. However, remote sensing images are very different from ordinary natural images. Some efforts and progresses have been made on satellite and aerial image segmentation using deep learning approaches [6]. Convolutional neural networks like patch-based and pixel-to-pixel based network [13], self-cascaded network [49], hourglass-shape network [50], gated network [51], and dual multi-scale manifold ranking-based network [52] have been proposed for land cover mapping using very high resolution aerial images based on the dataset from ISPRS 2D Semantic Labeling Challenge [53]. Moreover, CNNs for multi-modal fusion of aerial images and DSM (digital surface model) data have also been investigated [11,12,54]. Most DNN-based semantic segmentation researches focus on land cover classification of limited categories; however, the possibilities on categorizing land use categories have not been fully explored yet. On the other hand, the integration of ground-level geo-tagged data is also to be examined. Our work enriches the research on urban land use classification on VHR remote sensing images, and further investigates into the method of integrating ground-level data to improve classification results.

Methodology
To use ground-level street view images for urban land use classification, we propose an approach to construct ground feature maps from street view images and further integrate them with remote sensing aerial images, the workflow is illustrated in Figure 1. Specifically, semantic features are firstly extracted from street view images, and then ground feature maps are constructed by interpolating those features in the spatial domain. After that, both aerial images and ground feature maps are taken as inputs to the proposed deep convolutional neural network, which is able to fuse the two sources of data from different views. Finally, the segmentation results of coupling aerial and ground images are compared with that of using one source of images only.

Ground Feature Map Construction
In order to align ground-level street view images with overhead aerial images in pixel level, we present a method to construct ground feature maps from street view images. Basically, there are two major steps, i.e., semantic feature extraction and spatial interpolation. The construction process is illustrated in Figure 2.

Semantic Feature Extraction
Deep neural networks are reported to be effective to extract useful semantic information from street view images [21,30,55]. In our study, semantic features of street view images are firstly extracted by Places-CNN which is a deep convolutional neural network used for ground-level scene recognition. The network is trained on the Places365 dataset [56], a ten-million large image database of real-world scene photographs labeled with diverse scene semantic categories. The extracted semantic features are able to reflect the semantic information of particular scenes which are captured near the collecting spots of street view images, and they are capable of providing ground-level details for urban land use mapping.
As it is shown in Figure 2, locations with street view images are symbolized by blue dots, and there are four street view images facing different directions for each location which capture the panorama scene of the spot. We first use a pretrained Places-CNN (without the last fully connection layer) to extract a 512-dimensional feature vector for each image, and then concatenate the extracted four feature vectors into a 2048-dimensional feature vector for each location. After that, principal component analysis (PCA) is used to compress semantic information and reduce the dimension of the feature vector to 50, which finally produce the representational semantic features for the locations.

Spatial Interpolation
As it can be seen from Figure 2, places with street view images are sparsely distributed along roads in the spatial domain. However, street view images capture the scenes of nearby visual areas instead of single dots in the space. It is thus important to project the semantic information of street view images to their covered areas from top-down viewpoint. To form a dense ground-level feature map from sparsely distributed street view features, we use spatial interpolation method which is based on two assumptions: (1) for a given location, nearer street view images are more important than those far away, (2) street view images can only cover limited areas around the collected locations.
Based on the two assumptions, Nadaraya-Watson kernel regression is adopted to interpolate the features in the spatial domain. The Nadaraya-Watson kernel regression is a locally weighted regression method which is a generalization of inverse distance weighting (IDW), and it generalizes weight function to arbitrary form [57]. Normally, the method uses a kernel function with bandwidth which does not just satisfy the distance decay assumption like IDW, but also accommodates the need for limiting the impact of street view images in certain areas. The method is formulated as Equation (1).
where, in our case, f (x) is the value of the pixel centered at point x, f (x i ) is the value of nearby point (with street view images) x i , the impact of x i on x is measured by the weight w h (x, x i ), k is the number of nearby points.
To estimate the impact of nearby street view images on a pixel, we use Gaussian kernel to calculate weights. Considering the assumption of limited visual coverage of street view images, a distance threshold is set to exclude the impacts of distant street view images and also reduce the introduction of possible noise. The kernel to calculate the weights is shown in Equation (2).
where w h (x, x i ) is the weight that the point x i impacts on the pixel at point x, d(x, x i ) is the distance between them, and h is the bandwidth of the Gaussian kernel, and is also used as cutoff distance threshold.
Specifically, as Figure 2 shows, spatial interpolation is conducted on each of the 50 dimensions of the representational semantic features extracted from street view images, and thus the densified ground feature maps can be obtained and match the spatial resolution of aerial images. Spatial interpolation enables the smoothing of semantic information considering spatial dependency and the visual coverage of street view images. After the extracted semantic features being interpolated spatially, the ground feature maps are finally constructed and ready for fusion.

DNN-Based Data Fusion
After the construction of ground feature maps, we present a deep convolutional neural network-based method to couple aerial images and the produced ground feature maps. Our proposed method is based on SegNet [42] (see Figure 3), and the overview of the proposed method is shown in Figure 4.

Semantic Segmentation Network
Because of the simplicity and effectiveness of SegNet on segmenting both natural and aerial images [11,42], we use it as the base network to implement pixel-level classification from aerial images and ground feature maps. The architecture of the network is illustrated in Figure 3. As we can see, the network is composed of two major components, i.e., an encoder and a decoder. The encoder resembles the architecture of VGG-16 [35] without the fully connection layers, and it is composed of five sequential convolutional blocks. Each block in the encoder performs convolution with a trainable filter bank to produce a set of feature maps. Batch normalization [58] and element-wise rectified linear unit (ReLU) max(0, x) are then applied to the outputs of each convolutional layers. After convolution operation, max pooling is performed with a non-overlapping 2 by 2 window of stride 2, and thus the resulting output is subsampled by 2 times. For the first two blocks, a max pooling layer is followed after two convolutional layers (followed by batch normalization and ReLU activation), while a max pooling layer follows three convolutional layers for the rest three blocks. The encoder extracts semantic features from the original input at the cost of location information loss, with spatial resolution of input images reduced by 32 times after the processing of the encoder.
The decoder also has five convolutional blocks and its structure is symmetric to the encoder counterpart, only to use max unpooling layers to replace max pooling layers. Max unpooling is the reversed operation of max pooling, it upscales its input using memorized max pooling indices. Each block in the decoder performs max unpooling operation to upsample its input feature maps using memorized pooling indices from its corresponding encoder feature maps, and sparse feature maps are produced after the step. Then convolution is applied to densify the feature maps, with batch normalization and ReLU activation followed. The final output feature map of the decoder are recovered to the same spatial resolution as the original input, and the channel number is the same as the class number to be predicted. Finally, the output feature map is fed to a Softmax layer to make pixel-level predictions. Feeding aerial images and ground feature maps to the network respectively, then we can acquire the segmentation results, i.e., pixel-level land use classification results as required.

Data Fusion
To integrate aerial images and ground feature maps together, we propose using a deep convolutional neural network-based method to fuse them, the overview of the method is shown in Figure 4. The method is based on SegNet, which is the composition of Encoder1 (without fusion layer) and Decoder (see Figure 3). However, the proposed network add an extra encoder and has a fusion strategy to fuse the two sources of data. Encoder1 is the main branch and is designed to extract features from aerial images, while Encoder2 is used to distill features from ground feature maps. Input aerial images and ground feature maps are fed into the two encoders separately, and then the outputs from the encoders are stacked together at the fusion layer, finally fused feature maps are regarded as input fed into the decoder to upscale and make the final pixel-wise predictions.
Encoder2 also follows the structure of Encoder1, its specific structure depends on the level of fusion with block one to five respectively. The fusion can be implemented in five levels, each level corresponding to one convolutional block in Encoder2. The fusion strategy is stacking corresponding level of feature maps of Encoder1 and Encoder2. Specifically, we concatenate feature maps produced from Encoder2 to the corresponding feature maps of Encoder1 in the channel dimension as the dashed arrow indicated in Figure 4.
The encoder has the trade-off between location and semantic information, the shallow layers have more accurate location information while the deeper layers contain more comprehensive semantic information. Therefore, to find the best fusion level to balance between location accuracy and semantic representational abilities, we perform tests to stack output feature maps of the two encoders from different levels of feature maps, i.e., outputs from the five convolutional blocks of encoders without pooling operation.

Dataset
New York City (shown in Figure 5), located in the east shore of the US, is the most densely populated city in the US. It has a land area of 783.84 km 2 with more than 8 million population. The land use of the city is highly diversified which therefore poses great challenges for land use classification. New York City consists of five boroughs, i.e., Manhattan, Brooklyn, Queens, Bronx, and Staten Island. Among them, Brooklyn borough is the most populous and Queens is the largest in land area. In our study, the major area of Brooklyn borough and a squared area of Queens borough are selected as our study area, which are highlighted in Figure 5 with different colors. In the experiments, we use a public available dataset of New York City from [30]. The dataset consists of three types of data: high-resolution aerial images, corresponding land use maps, and sparsely sampled street view images.
(1) Aerial images. The aerial images are from Bing Map [59] with ground resolution of about 0.3 m (as shown in Figure 6). The aerial imagery is divided into small image tiles of 256 by 256 pixels to prepare for the training and test for deep convolutional neural networks. The dataset contains two subsets. The Brooklyn dataset covers the major area of Brooklyn borough with 73,921 aerial image tiles in total. As we can see from Figure 6, a large portion of them are over water which are therefore discarded, 39,244 of them are used as training data, and 4361 randomly selected tiles are used as validation data. The Queens dataset covers a squared area in Queens borough which are used as test set, there are 10,044 aerial image tiles in total.
(2) Street view images. The ground-level images come from Google Street Views [60], with four images from different heading directions at each place, i.e., the north, the east, the south, and the west, and the field of view of each street view image is 90 degrees, which indicates that the panorama view of each location can be captured by the four images. The dataset we used cover Brooklyn and part of Queens borough in New York City. As we can see from Figure 7, red and green dots symbolize the locations where Google street view images are sampled in the two boroughs respectively. In Brooklyn borough, there are 139,327 locations with GSVs, four GSVs are collected in each place, the density of GSV points in this study area is 790.06 per square kilometer. In Queens area, 154,412 street view images are collected at 38,603 places, with a spatial density of points of 1167.66/km 2 . An aerial image and its corresponding four GSVs (heading the north, the east, the south, and the west respectively) are illustrated in Figure 7.  (3) Land use maps. In this study, we use land use maps as ground truth to train and test our method. The ground truth of segmentation labels is adjusted from the GIS data of land use maps from New York City Department of City Planning [61]. The original maps are categorized into 11 categories, as shown in Table 1 with OID (original ID), documenting the primary land use in tax lot level. To accommodate the missing data and unlabeled areas, two extra categories are added, i.e., unknown and background. The adjusted land use types and corresponding descriptions (adjusted from [10]) are listed in Table 1.

Evaluation Metrics
To evaluate the pixel-level classification results, we adopt overall pixel accuracy, Kappa coefficient, mean IoU and F1 score as our evaluation metrics.
(1) Pixel accuracy: where x ij is the ith row and jth column element in the confusion matrix, N is the total pixel numbers, and n is the number of classes.
(4) F1 score: where p i and r i are precision and recall score of class i respectively, p i = x ii / ∑ n j=1 x ij , r i = x ii / ∑ n j=1 x ji . F1 i measures the segmentation result of class i.
Average F1 score is the average of summed F1 scores of different categories and can measure the overall segmentation results of all the n classes.

Study on Integrating Aerial and Street View Images
To explore the effectiveness of aerial and street view images, we have conducted three groups of experiments, i.e., segmentation using aerial images only, with street view images only, and integrating aerial and street view images respectively.
(1) Aerial images only. In this group of study, the input data only include aerial images, and original SegNet is used to conduct the segmentation task. (2) Street view images only. In this experiment, we first extract semantic features from GSVs, and then interpolate them in the spatial domain to acquire ground feature maps. Next, we use the spatially densified ground feature maps as inputs to SegNet by modifying the shape of input filters to match the dimensions of input ground feature maps, and finally make the dense prediction.
(3) Integrating aerial and street view images. In this study, we try to fuse aerial images and ground feature maps constructed from GSVs. We use the proposed method (described in Section 3) to fuse aerial images and ground feature maps, and then acquire the final segmentation results.

Implementation Details
In the experiments, models are implemented based on the PyTorch [62] framework. For training phase, we use Stochastic Gradient Descend (SGD) optimization algorithm with an initial learning rate of 0.01, a momentum of 0.9, a weight decay of 0.0005, and a batch size of 16. Learning rate is divided by 10 after the epoch of 15, 25, and 35 epochs. In addition, cross entropy is used as loss function, and the encoders are initialized by VGG-16 weights pretrained on ImageNet [63], while the decoder is initialized by He initialization [64]. For street view images, we use pretrained ResNet-18 based Places-CNN [56] to extract features and set the cutoff threshold as 30 m.

Results
We have trained and validated the networks on Brooklyn dataset, and tested them on Queens dataset. For each group of experiment, i.e., aerial images only (aerial), street view images only (ground), and integration of the two sources of data (fused), we have trained five instances of the same network with different order of inputs since previous experiments suggest that five instances are sufficient in most cases [65], and then the average of the results on those instances of segmentation models are taken as the final results. In addition, we ignore the unknown category for the final evaluation of classification results.
Furthermore, we have experimented on fusion in different convolutional layers, the results shows that the fusion of aerial and ground feature maps matches best before the third pooling layer which indicates that the trade-off between semantic features and location information achieves the best performance in the middle. As a result, we perform the fusion before the third pooling layers in our experiments.

(1) Overall Results
We have performed three comparative experiments, the results are listed in Table 2. We can see that: The segmentation results using overhead aerial images alone can already achieve a relatively high pixel accuracy of 77.62% and 74.02%, Kappa coefficient of 72.50% and 68.01%, and average F1 score of 61.96% and 51.86% on Brooklyn validation and Queens test sets respectively. Ground feature maps alone can reach an accuracy of 54.31% and 32.94% of pixel accuracy on the two evaluation sets, which proves that the ground feature maps constructed from GSVs contain information that can improve land use classification results. In addition, the Kappa coefficients of 40.62% and 13.15% shows that the results are fairly consistent and are better than random.
Moreover, our proposed method to fuse overhead and ground-level street view images achieves an overall accuracy of 78.10%, a Kappa coefficient of 73.10%, an average F1 score of 62.73%, and a mean IoU of 48.15% on Brooklyn validation set, which are both higher than that of using aerial images alone. Similarly, the corresponding evaluation scores (74.87%, 69.10%, 52.69%, and 39.40%) on Queens test set also witness an increase in accuracy. The improvement in the values of the evaluation metrics implies that the integration of aerial and street view images can use both the overhead and the ground-level information which help improve urban land use classification results.
In addition, our fused results of pixel accuracy (78.10% and 74.87%) have achieved better accuracy than results reported in [30] (77.40% and 70.55% on Brooklyn and Queens evaluation sets respectively). Besides, the results are also improved significantly in mean IoU, with 48.15% and 39.40% compared with 45.54% and 33.48% in [30].
It is interesting to note that the standard deviation of evaluation results are relatively small with most of the values less than 1% which indicates that the results are statistically stable. In general, the standard deviations of overall metrics of fused data are higher than that of using aerial images only, which indicates that fusing aerial and street view images introduces more uncertainty than just use aerial images alone. The standard deviations of the test results are higher than that of validation results. This is reasonable since the models are selected via validation set, and therefore it is expected that the results for the validation set will be more stable than that of the test set.
It should also be noted that the overall evaluation results of average F1 score and mean IoU are noticeably different between the validation and test sets, reaching about 10% and 9% respectively. The phenomenon is associated with the difference between our training-validation and test datasets. In our study, the training-validation and test sets are in different boroughs of New York City. The training set covers the major area of Brooklyn borough and the validation set is randomly selected in the same borough, while the test set is a squared area in Queens borough. The landscapes of the adjacent two boroughs are similar; however, they also vary slightly in certain land use categories as well as buildings facades, this may explain why the neural network trained with data from one borough works on data of another borough but with reduced accuracy.

(2) Per-class Results
The overall results reflect the average accuracy of classification among all classes. In order to figure out the variation regarding different land use categories, we compare the F1 score of each class. The validation and test results of Brooklyn and Queens datasets are shown in Tables 3 and 4 respectively. It can be seen from the tables that: Specific land use types, such as background, one and two family buildings, and multi-family elevator buildings show significantly higher F1 scores compared with the average, which may be related to their high percentage of areas and more distinguishable physical appearances. This is consistent with the fact that, in both datasets, background class (roads, etc.) accounts for the largest portion of land, followed by easy recognizable residential areas, especially one and two family buildings which are usually low-floor villas with big gardens. On the other hand, parking facilities and vacant land show considerably lower values than the average F1 score which may be caused by the low percentage of pixel numbers of these categories.
It should also be noted that, for some categories, the evaluation results vary significantly between Brooklyn and Queens boroughs, for example, open space and outdoor recreation achieves an accuracy of more than 15% higher in Queens than that in Brooklyn, which is related to the different urban landscapes of the two areas since the area of this land use type in Queens are significantly larger than that of Brooklyn. This suggests that deep neural networks are dependent on data, and their performances may vary with different datasets.  Typical examples of segmentation results of the three comparative studies are shown in Figure 8. The first three rows are the segmentation results on Brooklyn validation set, and the other three rows are results on Queens test set. As we can see, ground feature map-based segmentation is considerably distorted; however, the shape of roads is generally recovered. These results are in line with the fact that street view images are collected alone roads and streets and thus contain enough information of roads. For the results of using aerial images alone, much better results have been achieved than only use ground-level GSVs since the overall shape of the areas are better presented on aerial images. Furthermore, the fusion of ground-level information to aerial images helps to refine the segmentation results in these cases, some misclassified areas have been remedied and the results are more compacted.

Study on the Impact of Aerial Image Resolution
Although the dataset we used contains very high resolution (about 0.3 meters per pixel) aerial images, we are interested in determining whether the high resolution really benefits the pixel-level land use classification results. In addition, we also want to investigate how ground-level street view images influence the segmentation results given aerial images of different resolutions.

Implementation Details
We decrease the resolution of aerial images on different levels and thus acquire an auxiliary subset of Queens test set of different image resolutions. An example of an aerial image with different resolutions are shown in Figure 9. The original aerial image tile size is 256 by 256 pixels, we firstly downsample the original image tile size by 2, 4, 8 times respectively, then lower-resolution aerial images can be acquired, i.e., 128 by 128, 64 by 64, and 32 by 32 pixels respectively as shown in Figure 9. The ground truth labels are also resized in terms of degraded sizes. After we acquire the test set of different image resolutions, then we use them to test the pixel-level classification. Specifically, we firstly upsample the degraded images to the original size to make the images match the input size of our proposed networks. Then the resized aerial images are fed into the networks to obtain prediction results. Finally, we resize the segmentation outputs to corresponding degraded size and evaluate the final results on those resized degraded outputs.

Results
The overall evaluation metrics of prediction results using aerial images of different resolutions are shown in Figure 10. The horizontal axis is the level of resolution decrease, i.e., the factors that aerial images are downsampled from original images. The vertical axis represents the results of overall accuracy, Kappa coefficient, and average F1 score (shown in different point shapes) on the Queens test set of different resolutions. In addition, the evaluation results of using aerial images alone and integrating with street view images are shown in dashed and solid lines respectively.
As it can be seen from Figure 10, in general, the values of overall accuracy, Kappa coefficient, and average F1 score all display a similar decreasing pattern with the decline of aerial image resolutions. Furthermore, the evaluation values of using aerial and street view images together are all higher than that of using aerial images alone, regardless of resolutions. Specifically, for classification results based on aerial images only, the overall accuracy decreases with the falling of resolutions of aerial images, and the degree of decrease is dramatic with the loss of resolutions in the first two levels, this is not surprising since the details of aerial images change significantly in those levels as seen in Figure 9. The tendency is slowed down after that. For classification results based on fused data, the overall accuracy also decreases and presents a similar pattern with the situation of only aerial images used; however, the overall accuracy is higher than that of using aerial images alone. Moreover, with the help of extra ground-level information, the overall accuracy decreases slower than that of using aerial images alone, in other words, the contribution of street view images to the increase of accuracy is more significant when the resolution of aerial images is lower. Similar patterns are also observed in Kappa coefficient. Although different from that of overall accuracy and Kappa coefficient, the values of average F1 score of integrating both data are higher than that of using aerial images alone, and the increase is relatively stable with the change of aerial image resolutions. The results indicate that aerial images with higher resolutions have better performance than images with lower resolutions in land use classification. Besides, the ground-level street view images contain useful information for the classification regardless of the resolutions of aerial images. In addition, it is interesting to note that the gap of the values of overall accuracy and Kappa coefficient of using aerial images alone and integrating both data is widen when the resolution of aerial images decreases, which implies that, in general, street view images contribute more to the increase of prediction accuracy when the aerial image resolution is lower. This is an interesting finding because the increasingly ubiquitous street view images may therefore be very useful to help better interpret many low resolution aerial or satellite images.

Discussion on Classification Results
It can be seen from our experimental results that using aerial images alone can achieve a relatively high pixel-level classification accuracy, which suggests that deep neural networks have the ability to learn the mapping between different land use types and their inner spatial arrangements and patterns. Instead of selecting features manually, deep learning methods learn the representational features from given data automatically.
Furthermore, the ground feature maps constructed from street view images also include urban land use information which is proven by the prediction results of using street view images only. Thus, we have expected more improvement in accuracy when we try to integrate ground-level street view images with aerial images; however, the results are not dramatically improved despite the increase of accuracy on both validation and test set. There are several possible reasons for the results: (1) The coverage of ground-level information is limited, because the street view images are very sparsely distributed and only limited scenes near streets can be captured by the available street view images. Besides, in our study, spatial interpolation is used to project semantic information of street view images, which suffers from certain loss of information. Although cutoff distance threshold is set to limit the interpolation in local visual areas of available street view images and the weights satisfy distance decay assumption which limit noise introduced by the interpolation, the operation may still bring in certain level of noise and thus affect the final classification accuracy. In the future, better processing strategies of street view images will be further explored. (2) The base neural network we used may limit the performance of semantic segmentation results.
As the focus of the present study is to investigate methods for integrating different sources of information, specifically street view images and aerial imagery, for land use classification, we choose to use SegNet because of its simple and elegant architecture, its efficiency and effectiveness in both aerial and natural image segmentation as shown in [11,42]. However, ever since the introduction of FCN, the development of DNN-based semantic segmentation networks emerge frequently. There are many alternative neural network architectures apart from SegNet can be used in the context of this work. Segmentation networks with state-of-the-art performance may well improve the accuracy of the final results in our case. It would be interesting to compare performances of different state-of-the-art CNN architectures on fusing the two sources of data in future work. (3) The two sources of data contain duplicated information, and the aerial images may already include much of what there is in the street view images. The classification results using aerial images only have achieved a relatively high accuracy, which suggests that aerial images contain most of the information for urban land use classification and the addition of street view images improve the results but not dramatically. In addition, street view images add more values when the resolutions of the aerial images are lower, which also implies that the contribution of street view images to the classification results is associated with the information provided by aerial images.
Adding street view images achieves modest improvement in average classification accuracy. This is not surprising because these images are very sparsely distributed and only available along streets. Nevertheless, we have demonstrated that they provide useful information. In the next section, we present case studies to demonstrate how street view images can help significantly improve segmentation accuracy near the street areas where they were taken.

Case Study on Segmentation Refinement
As we can see from Figure 8, the integration of GSVs refines the segmentation results of using aerial images alone. These results indicate that ground-level street view images possess useful information for land use categorization. In addition, it is interesting to note that the refinement is concentrating near the roads. The results is not surprising since those locations near roads are within the visual coverage of street view images and thus the street scenes can be captured. To go deeper into the details and investigate the effects of street view images on segmentation results, two specific cases are studied, and the street scenes and corresponding segmentation results are presented in Figures 11 and 12. Figure 11a presents a real world scene in Brooklyn borough which is corresponding to the third row of segmentation results in Figure 8. The center is the aerial view of the area, and the blue and yellow dots are locations where GSVs are available in this study area. The surrounding four images are Google street views collected at the yellow dot location, and the orientation of the GSVs are represented by the four black hollow triangles (representing camera positions), which are heading the north, the east, the south, and the west respectively. Figure 11c shows the segmentation results of the aerial image in Figure 11a. As we can see from Figure 11a, it is difficult to tell the differences and figure out the categories of the buildings from nadir view of the aerial image. This dilemma can also be observed in the corresponding segmentation result using aerial image alone (see Figure 11c) as the categories are misclassified. However, we are able to obtain more details from the four ground-level street view images. It can be seen that the buildings nearby are three to four floors with store awnings in the first floor (as Figure 11b shows), which indicates that the buildings are typically mixed used for commercial in the ground floor and residential for upper stories. The finding is also in line with the land use map shown in Figure 11c, with the three major building areas labeled as mixed residential and commercial buildings (in purple), and the segmentation result based on integrating aerial and ground images are better than only using aerial images alone.  Similarly, Figure 12a presents a scene in Queens borough which is corresponding to the last row of segmentation results in Figure 8. As we can see in the aerial image, the roofs of the buildings in the overhead view are quite similar and thus it is almost indistinguishable given aerial image alone. However, we can observe great variation of building facades from the ground-level street view images. To the north, the south, and the west, the buildings in the street view images are two or three-stories buildings, which are typical walk-up residential houses or apartments. In comparison, the building shown in the street view image to the east, as Figure 12b shows, is one-floor and with two typical store awnings (one in red and the other in blue) which are easily recognized from the street view image but invisible from the aerial view. Therefore, it can be inferred that the land to the east is used for commercial purpose while the land in other directions is used for residential. This is further confirmed by the segmentation result shown in Figure 12c. We can see that the bottom-right corner of the commercial and office buildings land is misclassified as residential area when only aerial images are used; however, the segmentation result is corrected when integrating ground-level street view images.
The two case studies demonstrate that street view images can make significant contribution to improving segmentation results in the vicinity of the streets where they were taken. This therefore suggests that the major contribution of the street view images, as can be expected, may be in the visual areas that available street view images cover, rather than across a very large urban area.

Conclusions
Urban land use is of great significance to urban planning and management. Traditional urban land use mapping relies heavily on domain experts, which is labor-intensive and expensive. To alleviate the situation, we used a DNN-based method to label urban land in pixel level, integrating aerial images and ground-level street view images. Our method has been tested on a large publicly available dataset of New York City. The results show that it is possible to predict urban land use from very high resolution overhead images with relatively high accuracy, ground-level street view images contain useful information for land use classification, and the integration of street view images to aerial images can improve the pixel-level classification results. We have also examined the impact of aerial imagery resolution changes on land use classification, the results indicate that aerial imagery resolution is positively correlated with classification accuracy, and street view images will contribute more to enhance the classification accuracy when the resolution of aerial images is lower. Furthermore, we have discussed the limitations of the study. Specific cases of the segmentation results have also been investigated, and the case studies further demonstrate that the street view images can provide ground-level details that aerial images lack and help improve the results, especially in ambiguous situations near roads.
In the future, we plan to explore more sophisticated deep neural networks and other fusion strategies to further improve our segmentation results on fusing aerial and street view images. Although our presented methods have successfully incorporated aerial and street view images; however, there are still room to improve the pixel-level classification accuracy. We also plan to integrate more sources of proximate sensing data, such as social media data and vehicle trajectories, to further improve the land use mapping results. The combination of more sources of proximate sensing data does not just improve the urban land use mapping results, but also provides more insights for the understanding of our cities.
Author Contributions: R.C. and G.Q. conceived of the main idea; R.C., J.Z., W.T., and Q.L. developed the methodology and designed the experiments; R.C. and J.C. processed the data and conducted the experiments; R.C., B.L. and Q.Z. analyze the results. The manuscript was written by R.C. and improved by the contributions of all the co-authors.