1. Introduction
Urban mapping techniques have been rapidly evolving through advances in computer algorithms and integration of a wide range of satellite data. High-resolution urban mapping typically involves characterizing important features such as individual buildings, roads, open built-up, and urban trees and associated vegetation [
1,
2]. These small-scale urban features can be best mapped using very high-resolution (VHR, <5 m spatial resolution) satellite data such as those from IKONOS, QuickBird, and the WorldView series of sensors [
3,
4,
5]. For mapping tasks using VHR data, object-based image analysis (OBIA) is preferred over traditional per-pixel classification [
6,
7,
8,
9,
10], because pixels of a homogeneous land-cover patch often have heterogeneous spectral responses or high information content. Combined with various classification algorithms, OBIA has been routinely used to map detailed urban features with some success [
4,
11,
12]. Previous studies also demonstrated the advantage of data fusion of VHR and LiDAR or synthetic aperture radar (SAR) images for urban-mapping applications [
13,
14,
15].
Performance of VHR-based urban mapping depends on the choice of image classification algorithms. Currently, the most commonly used algorithms include support vector machines, random forests, feed-forward artificial neural networks, and radial basis function neural networks [
16]. While each of these approaches show promise for automatic urban mapping, they often require a diverse set of input data for the given algorithm to properly function. For example, in addition to the original spectral bands of VHR data, a rich set of textural and structural features, such as gray-level co-occurrence matrix and wavelet textures, have been examined to improve urban mapping accuracy [
2,
16]. Relying on hand-crafted features requires additional processing time and remote sensing expertise and can potentially confuse other landscape features. It is also unclear whether the textural features from each study can be applied with similar results to other study areas.
Deep convolutional neural networks (DCNNs) have recently shown great promise in the field of computer vision after the landmark paper by Krizhevsky et al. [
17]. DCNNs work by learning convolutions (or features) that best represent image classes through error minimization via backpropogation. For example, one of the famous DCNNs, the VGG16, was designed by researchers from the Visual Geometry Group Lab of Oxford University and the 16-layer network architecture achieved 92.7 percent test accuracy in ImageNet [
18]. The automated feature engineering is particularly appealing compared to traditional, hand-crafted features using domain knowledge [
19]. While DCNNs were originally designed for large-scale image recognition [
17,
18,
20], recent work has shown that fully convolutional networks (FCN) and Markov conditional random fields (CRF) can be used for effective semantic segmentation or pixel-wise classification [
21,
22]. The U-Net, a specific type of convolutional network with creative design of contracting and expanding architecture, shows great potential in biomedical image segmentation or pixel labeling [
23].
DCNNs have only recently transitioned into the field of remote sensing. Researchers are increasingly interested in using DCNNs or other deep learning models for scene classification, object detection, and land use and land cover classification [
24]. For example, Zou et al. [
25] applied transfer learning to VHR remote sensing imagery for classifying 400 × 400 pixel samples into seven distinct scene types. Sun et al. [
26] modified three DCNNs (i.e., AlexNet, VGG16, and ResNet50) to classify tree species, and they found that VGG16 performed best. Full scene classification has also been applied using SVMs on DCNN features [
27], DCNN-CRF [
28], and DCNN-FCN [
29]. More recently, several studies evaluated U-Net for vegetation mapping and obtained very accurate map products [
30,
31,
32]. The U-Net allows for end-to-end training at the pixel level and it is less demanding on the training sample size [
23].
Although DCNNs showed high potential for scene classification and object detection, their overall effectiveness in pixel-wise land use and land cover mapping is still unclear. Several recent articles suggested that the application of DCNNs for pixel-wise land cover mapping remains sparse, and the potential is not fully explored [
24,
31]. For DCNNs and many other machine learning algorithms, it is also important to investigate machine perception that imitates human perception to improve learning and predictive performance. For a given image/scene, the object of interest and background or contextual information can be rapidly identified by humans. It is thus potentially beneficial to enhance or synthesize observations that may improve computer understanding of remotely sensed images. Within the DCNNs, image tiles are commonly used as input, and the object of interest (e.g., building) is mixed with other background features or integrated pixels. We propose using contextual Gaussian blurring to reduce the impact from background features, an assisted machine perception method that has not been thoroughly examined in the literature, especially within the DCNN framework.
The main objective of this study is to investigate two DCNNs, U-Net and VGG16, for urban land cover mapping using VHR data. To support the training and validation, we developed an urban land cover map product through a commonly used OBIA classification of WorldView-2 imagery. For U-Net mapping, we were particularly interested in how classification performance varies with respect to training sample sizes. For the VGG16 mapping, we designed a new object-based image classification approach that involves image segmentation, framing, and object recognition. For each image segment, we incorporated contextual Gaussian blurring to assist machine perception. The experiments are implemented for urban mapping of San Cristobal Island, one of the Islands of the Galapagos archipelago of Ecuador.
2. Materials and Methods
2.1. Study Area and Data
The Galapagos Islands are a chain of islands known for their natural beauty, history, diverse wildlife, and conservation efforts. Over the last several decades, the growing tourism and human migration have been exerting increasing pressure on the fragile and sensitive island ecosystems [
8,
33]. San Cristobal is one of the four populated islands in the archipelago and has an area of 558 km
2. Total population for San Cristobal is around 7000. Most residents live in the port city, Puerto Baquerizo Moreno, although the smaller upland town of El Progreso, close to the agricultural zone, accounts for under 1000 residents. The port community is bounded by the Pacific Ocean, and land is managed and controlled by the Galapagos National Park. As such, the community has limited space for urban development, with hard borders occurring with the park that limit peripheral growth; hence, most new development occurs through urban in-filling and through land swaps with the park. The urban structures are relatively small in size, although two-story residential units are common. Generally, residential units are less than 30 square meters in area, although larger structures, primarily hotels and associated commercial buildings, occur, but they are generally close to the water’s edge, where most tourist facilities are concentrated. Streets are relatively narrow in the town. Most roads are paved, although dirt roads persist in the town.
The image used for this study was acquired by the WorldView-2 satellite on 1 December 2016. The multispectral product (2.4 m resolution) was converted to TOA reflectance and then the subtractive resolution merge process was applied to create a pan-sharpened image consisting of four spectral bands.
Figure 1 shows the study area and WorldView-2 image used for this research.
2.2. OBIA Image Classification
To support the DCNN-based urban mapping, we first implemented OBIA image classification to derive a land use and land cover reference. This OBIA-derived land use and land cover map served as the primary benchmark dataset for DCNN training and validation. Using the WorldView-2 image as input, we applied the multiresolution segmentation algorithm within the eCognition Essentials software package to generate image objects. Among several adjustable parameters (scale, shape, and compactness), the scale factor is the most important parameter for image object size [
34,
35]. We examined a range of scales (75, 100, and 125) through a trial-and-error approach to determine the optimum scale factor. The weighting between color and shape was set to 0.7/0.3 based on previous studies showing the relative importance of color components [
4,
6]. The compactness and smoothness were assigned equal weights. After each segmentation, we visually assessed the image objects, using the original WorldView-2 image as a reference. We found that a scale of 75 was adequate to separate buildings and road objects from surrounding areas, while minimizing over-segmentation. A small building was typically represented by one image object, while a large building was divided into several homogenous patches.
Following the image segmentation, we conducted the object-based image classification using a random forest algorithm. The mean value of each spectral band was used to represent each object to support the classification. Six land cover classes were considered: building, road/open built-up, vegetation, beach, volcanic lava/soil, and water. A few locations with obvious cloud/shadow were manually masked out. Minimal post-classification manual editing was conducted to remove the obvious classification errors. For example, some buildings were misclassified as roads and vice versa. Classification accuracy was assessed using 50 randomly selected image objects (polygons) per class for three major urban classes of building, road/open built-up, and vegetation. The other three classes, beach, volcanic lava/soil, and water, were not significant components of urban land cover. We visually interpreted each polygon using the WorldView-2 image and Google Earth’s very high-resolution imagery archive as references. Reference polygons contain heterogeneous land cover types, and the dominant land cover was used as the label. Error matrix and accuracy statistics were generated by comparing the OBIA classification result and visually interpreted land cover references.
2.3. Image Classification Using U-Net
The U-Net was developed by Ronneberger et al. [
23] for localized pixel classification for biomedical image segmentation. The U-Net architecture (
Figure 2) features a contracting path, where convolution and max pooling operations are used to extract image context, and an expanding path, where up-sampling and convolution are used for sequential localization. The localization is enhanced by integrating the extracted features from the contracting path at each spatial scale. This U-Net design allows for end-to-end training at the pixel level and shows very good performance on many image segmentation tasks [
23,
36]. The U-Net design is particularly appealing for the remote sensing community, because it provides a class label for individual pixels, instead of focusing on scene labeling. The localization is the key for land cover mapping and change detection tasks.
The four-band WorldView-2 image was divided into two parts of northern (50%) and southern (50%) sub-regions. The northern and southern sub-regions were used as the training and testing sets, respectively. These areas were chosen to balance the need to provide the network with sufficient training data, while ensuring that the network can be tested for overall generalization. The classification map derived from the OBIA method (
Section 2.2) was used as the labeled image (or target) to support U-Net training and testing. The original classification scheme included six land cover classes of building, road/open built-up, vegetation, beach, volcanic lava/soil, and water. We masked out land cover classes of beach, volcanic lava/soil, and water for U-Net classification, because they are either not significant components of urban land cover or they have very limited spatial coverage (i.e., beach).
From the training set, the northern sub-region, we randomly extracted image tiles of varying sample sizes from 32 to 4096 to evaluate how sample size affects classification performance. Each tile has four spectral bands with 256 by 256 pixels for each layer. For a given training set, the image tiles were further divided into image batches (batch size = 32) during U-Net training. The final layer of U-Net includes a pixel-wise softmax activation combined with the cross-entropy loss function. The U-Net was trained using the stochastic gradient descent with momentum (SGDM) optimization. The initial learning rate was specified as 0.05 and the gradient clipping threshold was set as 0.05 to improve the stability of network training. The maximum training epoch was set as 10, because the training accuracy typically became saturating after 3–5 epochs. To reduce potential overfitting, we recorded cross-validation (20% hold-out) accuracy for each training epoch. The trained U-Net with the best cross-validation accuracy was then applied to the southern sub-region of the study area to generate pixel labels.
2.4. Image Classification Using VGG16
The data preparation of the VGG16 urban mapping included three basic steps: segmentation, framing, and labeling. The WorldView-2 image was previously segmented using the multiresolution segmentation algorithm within the eCognition Essentials software package (scale factor is 75,
Section 2.2). Once the image was segmented, an image database was created by framing an individual scene around each object, regardless of image object size. Frame dimension was determined by considering natural image perception: smaller objects required larger frames to place in context, while larger objects (such as roads, large vegetation patches, etc.) required little to no framing for identification (see
Table 1). Object size was considered by averaging width and height in the spatial
x and
y domain. Objects with size less than 50 pixels were assigned a window size of 75 × 75 pixels, the smallest resolution at which objects can be placed in context and identified by the human eye. Objects with size between 50 and 500 pixels were given a frame dependent on their size; a scale factor was developed that provides linear interpolation between a 50 pixels object being given a window size of 75 × 75 and a 500 pixels object being given a window size of 500 × 500. Objects larger than 500 pixels were given framing equal to their width and height.
To test our hypothesis that machine perception could be assisted by highlighting the pertinent object, two separate image databases were produced, one where each object’s background was blurred using a Gaussian filter (σ = 1, 5 × 5 filter size) and one with no augmentation.
Figure 3 shows a comparison of a chipped image example of an object with/without the Gaussian assistance.
Finally, the image database was labeled according to the class labels from the OBIA classification results.
Training was accomplished on the VGG16 architectures using the deep learning package from Matlab2020a. To capitalize on the millions of images that these networks have already been exposed to, transfer learning was performed on the image databases using the network pre-trained for the ImageNet challenge. The network architectures used were identical to the original networks, except that the networks’ last layer’s output was reduced from 1000 to 3 to classify three distinct object types (building, road/other built-up, and vegetation). Because the shallower layers contain basic image feature information such as edge or color detection, which are shared between the ImageNet and remote sensing datasets, the learning rates for the earlier network layers were set to zero, thus freezing the weights of these layers during transfer learning [
37]. For the last fully connected layer, the network’s learning rate was set to 1 × 10
−5. This enabled fine-tuning of the deep network layers, while more dramatically altering the weights of the last fully connected layer that assigns class probabilities.
The original VGG16 uses 3-band RBG images as input. For our study, we selected near-infrared, red, and green bands as input channels, because certain land cover classes (i.e., vegetation cover) can be best mapped by including the near-infrared band. Similar to the U-Net urban mapping, the northern portion of the WorldView-2 image was used as the training data. Once the network was trained, the VGG16 network predicted the label for the segmented test dataset from the southern portion of the image. The spatial location of each object was retained so that object labels could be mapped directly onto the georeferenced scene.
4. Discussion
Currently, the major challenge and opportunity for remote sensing researchers involves developing image analytical approaches that can take full advantage of DCNNs designed for computer vision or other pattern recognition fields. Our study serves as one of these experiments. The U-Net network, combined with an OBIA-derived land cover map as training data, performed very well at high-resolution urban mapping. The buildings, road/open built-up, and vegetation were identified with an overall accuracy of 87.8 percent. The classification result from OBIA showed errors. For example, the class-specific accuracies for building and road/open built-up objects were close to 85 percent. Building boundaries in certain residential areas were not well defined because of combined limiting factors of sensor data, segmentation procedure, and classification algorithms. It is interesting to note that U-Net generated accurate building and road products using the OBIA-derived reference with reasonable accuracy, rather than using reference data acquired by visual interpretation or manual digitizing [
31,
38]. This points to the utility of traditional, shallow machine learning methods to efficiently generate training data, particularly for the express purpose of evaluating various deep learning schemas.
With an OBIA-derived reference map, image tiles can be easily extracted for U-Net training. With a relatively small training sample size (e.g., 32–256 tiles), U-Net generated a detailed urban map with acceptable accuracy. However, it may require multiple trials to generate acceptable results, and it is difficult to know which training tiles lead to acceptable generalization performance. For this study, a sample size of 2048 image tiles generated consistent cross-validation performance across various training epochs. This suggests that trial-and-error learning can be significantly reduced by increasing training samples. The actual pixel counts for 2048 image tiles (256 by 256 pixel) were approximately 12.5 times the total pixels within the training image (2300 by 4649 pixels). For computational efficiency, the U-Net training can be best accomplished through the use of GPU. We used GeForce GTX1080 GPUs from the advanced research computing facility of the University of North Carolina at Chapel Hill. With multiple GPUs, the training, validation, and testing could be accomplished within tens of minutes. The availability of GPUs is clearly important if a large amount of U-Net training and testing needs to be implemented. Currently, this is still a limiting factor for incorporating the U-Net as a routine tool for the general remote sensing community.
For VGG16-based urban mapping, an OBIA approach using pre-training segmentation and Gaussian assistance was used as a shortcut for scene classification. Our experiment suggested that the pre-trained VGG16 and transfer learning were not as good as U-Net for detailed urban land cover mapping. For example, this paradigm produced more errors of commission for road/open built-up than both OBIA and U-Net classification. The spatially clustered error of commission was most obvious in a testing area with dominant vegetation cover, where the WorldView-2 image has sufficient resolution to distinguish them. There are several possible explanations. We relied on pre-trained VGG16 and transfer learning. Some image segments in the vegetation class may have very similar VGG16-derived features compared to those of the open built-up class. The fine-tuning appeared to be insufficient in separating those image segments. We conducted additional experiments to re-train the entire VGG16. However, the limited training samples (i.e., several thousand of image objects) did not warrant full network training. It should be noted that the DCNN, such as VGG16, typically requires a large amount of meticulously labeled reference imagery in training. Larger image databases could lead to improved classification accuracy.
With this paradigm of VGG16 mapping, performance is dependent on the objects derived through segmentation, representing real objects in space. For this study, we used the multiresolution segmentation algorithm within the eCognition Essentials software package (scale factor is 75,
Section 2.2) to generate image segments. A more representative pre-DCNN segmentation algorithm may result in better accuracies for VGG16 mapping. Apparently, the selection of segmentation algorithm and associated parameters call for future research. The OBIA-derived land cover map has approximately 87 percent of object-level accuracy. The direct use of such a noisy product for VGG16 training may generate high uncertainties in prediction. Therefore, the availability of high quality/quantity of training data remains a major challenge for the application of DCNNs for detailed urban mapping. For our experiment, the VGG16 could not generate usable urban map products without Gaussian blurring. This indicates that Gaussian blurring was essential to DCNN performance as it assisted machine perception in the same way that it guides the human eye; thus, machine perception is limited (for the present) to at least what the human brain can visually distinguish.