Heavy rainfalls can lead to widespread flooding along riverbank areas, especially in China nearly every summer, and Chongqing City suffers from flooding due to the Yangtze River that flows through it. Unmanned Aerial Vehicle (UAV)-based photogrammetry has been proven a competitive way to collect timely data in riverbank areas for its speed, convenience, and also its low cost in getting high-resolution images, high-accuracy orthoimages, and for 3D mapping [1
], so it is very suitable for riverbank monitoring in Chongqing which is a mountainous city with a high population density and where is also cloudy most of the year.
However, for building detection and extraction from UAV images, building details will largely affect the detection accuracy in the case of very high spatial resolution. While various simple and complex building patterns could be easily interpreted by visual inspection, buildings in remotely sensed imagery may have various patterns and styles, which will cause a big problem. Traditional remote sensing image processing approaches, like classification methods (e.g., Maximum Likelihood Classifier (MLC) and Support Vector Machine (SVM)) [3
], do not perform well on UAV images, because these methods require training samples to predict the synonymous features and they can only achieve good classification performance when certain criteria are met, such as a Gaussian distribution of data. The processing time is also a problem when the training samples are large. These limitations can be solved by using an Artificial Neural Network (ANN) [4
]. Modern forms of ANN have been introduced, in particular Deep Learning (DL), which has already shown great performance in remote sensing applications such as scene classification, object detection, and semantic segmentation. In recent years, DL combined with UAV technology has been applied to many applications such as car detection [5
], real-time scene understanding [7
], location identification and core fire area segmentation [8
], etc. DL can take advantage of spectral, textural, geometrical, and contextual features in classifying UAV images with much higher accuracy compared to object-based classification [9
]. Zeggada et al. [10
] proposed a Convolution Neural Network (CNN) model combined with multilabeling layer (ML) consisting of a customized thresholding operation to classify a grid of tiles based on multi-labelled UAV images. As to semantic segmentation, Kemker et al. [11
] evaluated simple classification algorithms (k-nearest neighbor (kNN), linear support vector machine (SVM), multi-layer perceptron (MLP), spatial mean-pooling (ML)), spatial-spectral feature extraction methods (MICA and SCAE), and two Fully Connected Network (FCN) models (SharpMask and RefineNet) with multispectral UAV dataset (RIT-18), and their results showed that DL with RefineNet had the best performance compared to other classification techniques. Liu et al. [9
] compared the performance of traditional object-based classifications (SVM and Random Forest (RF)) with Deep CNN (DCNN) on ortho-images and multi-view UAV images, and the results showed that the DCNN provided more classification accuracy than other traditional approaches, and the accuracy increased more obviously when multi-view data was used. Furthermore, Nogueira et al. [12
] evaluated some typical DL semantic segmentation methods, including pixel-wise (standard ConvNet), fully convolutional (Fully ConvNet), deconvolutional (SegNet), and the combination of these methods called ensemble of ConvNets, and final conclusion was that ConvNets reached the highest overall accuracy of 96%. As a robust and accurate technique, DL semantic segmentation shows great performance on extracting objects from UAV images, and multi-view images can contribute significant improvement compared to orthoimage only.
Moreover, various network architectures have been proposed in DL semantic segmentation to improve building extraction accuracy, such as single patch-based CNN architecture [13
], FCN [15
], encoder-decoder network architecture [16
]. Xu et al. [18
] improved building extraction accuracy using a new model based on deep residual network (ResNet) which is defined as Res-U-Net and object-oriented guided filter, and the reported accuracy is around 5% higher than other standard models (SegNet, FCN). For better DL application in UAV image processing, Zhuo et al. [19
] proposed a DL semantic segmentation on oblique images from UAV to optimize building footprint extraction from OpenStreetMap (OSM).
DL methods require huge training samples and the data complexity should be considered in the supervised learning process. Buildings along the Yangtze riverside in Chongqing have a variety of patterns and styles, while the buildings in the few public standard photogrammetry datasets currently published online have different styles, because they are collected from different geographic locations, so in our work, a set of building samples on the riverbank area is created and labeled for the supervised learning process and then DL semantic segmentation using the standard network architecture SegNet is evaluated. Moreover, the generality and advantage of proposed DL procedure are further verified by the application to two other standard datasets.