Deep Learning-Based Semantic Segmentation of Urban Features in Satellite Images: A Review and Meta-Analysis

: Availability of very high-resolution remote sensing images and advancement of deep learning methods have shifted the paradigm of image classiﬁcation from pixel-based and object-based methods to deep learning-based semantic segmentation. This shift demands a structured analysis and revision of the current status on the research domain of deep learning-based semantic segmentation. The focus of this paper is on urban remote sensing images. We review and perform a meta-analysis to juxtapose recent papers in terms of research problems, data source, data preparation methods including pre-processing and augmentation techniques, training details on architectures, backbones, frameworks, optimizers, loss functions and other hyper-parameters and performance comparison. Our detailed review and meta-analysis show that deep learning not only outperforms traditional methods in terms of accuracy, but also addresses several challenges previously faced. Further, we provide future directions of research in this domain.


Introduction
Land-use/land-cover (LULC) maps are often generated from medium resolution satellite images like Sentinel [1] and Landsat [2]. These images are useful to classify land cover classes like built area, residential area, vegetation surface, impervious surface, water, etc. However, to prepare LULC maps for urban areas, objects like cars, individual buildings and trees, etc. needs to be classified. When extracting urban features or urban land cover information from aerial images, spatial resolution is considered being more important than spectral resolution. In other words, a finer-resolution image pixel is more useful than a greater number of spectral bands or narrower interval of wavelength [3]. This is the reason why commercial satellite images and unmanned aerial vehicles (UAV) are now more popular as they aim to increase the visibility of terrestrial objects, especially urban features, by reducing per-pixel size. With the increase in spatial resolution, more urban objects are now clearly visible in satellite images, and studies shifted its paradigm from spectral image classification, pixel-based image analysis (PBIA) and object-based image analysis (OBIA) to and most recently, pixel-level semantic segmentation. In this paper, we will analyse the advancement of deep learning-based semantic segmentation for urban LULC.
In early PBIA methods, the pixel size was not fine enough to identify an object in an image. As reported by Cowen et al. [4], a 4-m wide object needs a minimum of 2 by 2 m spatial resolution (i.e., minimum of four pixels is required). A 4 m wide object however does not locate perfectly over four pixels in a 2 m spatial resolution image. As the spatial resolution increased, the spectral response from different small objects in the urban area started to show complex patterns. This was because many objects are made of the same or similar material (e.g., cemented rooftops, cemented parking lots and cemented sidewalks; grasses and shrubs; etc.), and they emit a similar spectral response [5]. This is where the traditional classification methods and per-pixel classifiers (e.g., maximum likelihood) did not perform effectively because they used pixel-level spectral information alone as a fundamental basis to classify remote sensing images, and ignored spatial and contextual information [6]. Also, PBIA methods tended to produce "salt and pepper" noise after classification. To overcome these drawbacks, OBIA methods were studied.
Unlike PBIA, OBIA starts with the generation of segmented objects at multiple levels of scale as fundamental analysis units [7][8][9]. An object (aka. segment) in this approach is a group of contiguous homogeneous pixels with similar texture, spatial and/or spectral attributes. An image is initially divided into segments, various attributes of the segments are then computed and some rules are built to classify features. These rules are based on attributes such as geometry (length, area), size and textures etc., thus helping to differentiate features in an image. e.g., differentiating lakes and rivers based on length, trees from grass based on texture, separating building and road based on area, etc. For many years, OBIA was considered the better approach because it prioritized contextual information of an object [10]. Supervised and unsupervised learning classifiers methods were produced for the classification task, and PBIA was considered not as useful as OBIA to classify very high-resolution (VHR) images, until recently, after deep learning-based methods were started to be explored for pixel-based semantic segmentation of VHR images. In recent years, the use of UAVs to collect images and open-source/non-commercial software [11] to prepare orthomosaic have opened up a promising future towards the increased use of UAVcollected images. Despite the development of a large number of PBIA and OBIA-based methods proposed in the last two decades, these frameworks had several drawbacks [3] and complexities due to classification errors and imbalance in classes, which limited the widespread application. With recent end-to-end deep deep learning-based semantic segmentation, LULC has seen rapid progress in the classification of VHR images compared to traditional PBIA and OBIA methods.
Deep learning (DL) [12,13] architectures are modern machine learning methods that have increased the performance and precision of computation by increasing the number of "layers" or "depths". DL allows fast and automatic feature extraction from an adequately large dataset, iteratively using complex models to reduce classification errors in regression [14]. In recent years, DL has become a core method in many researches in remote sensing, such as plant recognition [15], plant disease recognition [16,17], weed detection [18], crop type classification [19,20], crop counting [21], natural hazard modeling [22][23][24], land cover classification [25] and also uncertainty modeling [26]. DL architectures like convolutional neural networks (CNN) [27,28] and Fully Convolutional Network (FCN) [29] are commonly used methods for the segmentation of urban images. The most successful state-of-the-art DL architecture is FCN [30,31]. The main idea of this approach is to use a CNN as a powerful feature extractor while replacing the fully connected layers with convolution ones to output spatial maps instead of classification scores. Those maps are up-sampled to produce dense per-pixel output. This method allows training CNN in the end to end manner for segmentation with input images of arbitrary sizes. This approach achieved a notable enhancement in segmentation accuracy over common methods on standard dataset like Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL) Visual Object Classes (VOC) [32]. However, the architecture of the FCN network is often changed to solve different challenges faced during semantic segmentation of urban satellite images.
A meta-analysis is performed in this paper to review the research papers that used DL-based methods to answer several research questions. A few review and meta-analysis papers have been recently published on a broader scope of the study, mostly focusing on a brief summary of recent trends on DL in remote sensing applications [33,34]. However, these studies do not provide detailed analysis in the domain of deep learning-based semantic segmentation of urban remote sensing images, which has been competitively and rigorously studied in the last 5 years as shown by our meta-analysis. Several studies improved image segmentation on urban features by designing efficient DL architectures, however there is a lack of a complete review on the technical details of the architectures experimented. In this paper, we present detailed and tabulated analysis and discuss how the recent papers have tried to address the challenges faced during the shift of paradigm to semantic segmentation such as: ignorance of spatial/contextual information by CNNs, boundary pixel classification problems, class imbalance problem, domain-shift problem, salt-and-pepper noise, structural stereotype, insufficient training and other limitations.
The rest of the paper is structured as follows: Section 2 presents literature on DL-based semantic segmentation of urban satellite images; Section 3 presents the complete metaanalysis on DL-based semantic segmentation; Section 4 provides discussions; and Section 5 concludes the paper with future directions. Furthermore, Appendices A-C are presented to display (i) the meta-analysis in structured summary, (ii) a separate performance comparison of papers that use the most abundant datasets and (iii) a list of the available dataset for DL-based semantic segmentation of urban images, respectively.

Semantic Segmentation in Remote Sensing
Semantic segmentation can be defined as the process of assigning a semantic label (aka. class) to each coherent region of an image. This coherent region can be a pixel, a subpixel, a super-pixel, or an image patch consisting of several pixels. Per-pixel segmentation classifies pixels by either assigning a single label to each pixel for high-resolution images, or assigning class membership on lower-resolution images because the resolution is not enough to contain an object [35]. Several parametric classifiers such as maximum likelihood classifiers and non-parametric classifiers such as artificial neural networks (ANN), support vector machine (SVM), decision tree classifiers, expert systems, etc. have been used in the past for per-pixel segmentation [10,[36][37][38]. While the majority of ANN-based research was conducted for per-pixel classification in earlier days [39][40][41], sub-pixel classification of impervious surfaces were also highly studied later [42][43][44][45][46]. In sub-pixel classification, pixels are further classified into a fraction of the pixel size. Soon as the discovery of ANN, various networks were studied in early 2000s: Multi-layer Perceptron (MLP) [47][48][49][50], Adaptive Resonance Theory (ARTMAP), Self-Organizing Map (SOM) [51][52][53] and Hopfield Neural Networks. Superpixel is another coherent region for semantic segmentation of urban images [54][55][56]. These regions are first segmented on images using methods like Simple Linear Iterative Clustering (SLIC) [57] or superpixelization [58] to generate coherent regions at sub-object level.
In patch-based semantic segmentation, classifiers are trained on image patches as a single label and make predictions in a similar fashion. A sliding window is used to extract patches from the input images as bounding boxes around the object, which are further forwarded to predict label [21,23]. Multi-scale inference and re-current refinements performed significant gains on their study, and is also supported by other studies for scene labeling [59,60]. Patch-based segmentation performs well for object detection, but for the tasks like LULC, per-pixel methods often outperform the former method [55,61]. Some even consider patch-based methods wasteful because of redundant operations performed on adjacent patches [31].
As the resolution of satellite images increased, problems like loss of spatial information and imbalance of class distribution due to many small objects visible in an image resulted in many studies on pixel-based semantic-segmentation. Most studies in recent years have used DL-based semantic-segmentation to accurately label each pixel to a specific class like buildings, roads, etc. [62], which is further detailed in the next subsection.
The use of blocks consisting of separate architecture on its own and ensemble of the outputs using softmax layer is a general framework for CNN. Steps parametric rectified linear unit (PReLU) [71], dropout [72], batch normalization (BN) [73], and network optimizers like stochastic gradient descent (SGD) [74] and Adam Optimizer [75] are widely used to accelerate CNN training process. A standard structure of CNN consists of an input layer, convolutional layers, pooling layers, a fully connected layer and last the output soft-max layer. The output of each filter for a convolutional layer is calculated as where w l m,n and b l m are the weights and bias term of mth filter of lth layer and f (x) denotes nonlinear activation function. A most commonly used activation function in CNN is rectified linear unit (ReLU) [71]. The weights and bias of each filter are passed to every location of the input feature map such that a model learns from the regrets of having applied the filter overlaid over any location of feature maps. A pooling layer like averagepooling and max-pooling are commonly used to provide shift-invariance by reducing the resolution of feature maps. The consecutive use of multiple convolutional and pooling layers produces smaller and abstract 2D feature maps, which are then connected by a sequence of fully connected layers that transform them into 1D features. The last softmax layer makes the final predictions.
CNNs are effective for object detection, scene-wise classification and feature extraction. However for pixel-based semantic segmentation, the use of pooling layers diminishes many features and the resulting feature maps and predictions cannot achieve the required PBIA. In the next section, we will talk about how FCN has improved the drawbacks of CNN.

Fully Convolutional Network (FCN)
When Long et al. [29] first proposed FCN in 2015, it achieved state-of-the-art semantic segmentation. What made it more efficient than CNN is that fully connected layers in a network for aerial image classification purposes can be considered as convolutions with kernels that cover their entire input region. Mou et al. [76] have considered this as equivalent to evaluating classification network on patches with overlapping regions. As the computation runs across images by overlapping the regions, FCN achieves better efficiency. And due to the presence of a pooling layer, feature maps obtained from FCN are then upsampled. CNN's pooling layer aggregates information and extracts spatial-invariant features that are crucial for pixel-level semantic segmentation. A general framework of FCN therefore consists of two parts: encoders and decoders. Encoders are inspired from [66] and are similar to CNNs that extracts feature maps, and decoders transform these features into dense label maps, whose size is the same as the input image.
The first proposed FCNs were FCN-8s, FCN-16s and FCN-32s [29]. Instead of using simple bilinear interpolation, they used transpose convolutional layers to upsample the deep feature maps into labeling results. This improved the performance of classification resulting in finer predictions, but new challenges were born: (i) transpose convolutional layers were computationally expensive because of its hunger towards memory [77], (ii) they were difficult to train and (iii) the resulted classification was poor around the object's boundary [76]. Several studies were thereafter performed to overcome these drawbacks.   [78] introduced atrous convolutions in FCN, removing the max pooling layers, to expand the field of view with fewer parameters. To improve the classification errors around the boundary of label maps, they used conditional random fields (CRF) after FCNs. Ji et al. [79] introduced atrous convolutions on the first two layers of decoding steps to enlarge sight-of-view and integrate semantic information of large buildings, which are aggregated with multi-scale aggregation strategy. The last feature maps of each layer make predictions that are concatenated in the final prediction. Chen et al. (2019) [80] used FCN called DeepLabv3 [81] with Resnet backbone and atrous convolution layers, augmented Atrous Spatial Pyramid Pool (AASPP) with 4 dilated convolutional layers for multi-scale information and fusion layers to concatenate and merge the feature maps.
A mean-field approximate inference was used for CRF by Zheng et al. [82] with Gaussian pairwise potentials as recurrent neural fields (RNNs) to train an end-to-end CRF-as-RNN with unary energy of CRF. Lin et al. [83] later predicted both unary and pairwise energy using a deep structured model, which achieved significant performance on PASCAL VOC 2012 dataset. Most recently,   [84] combined two pixellevel predictions-a pretrained FCN-8s fine-tuned and re-trained on 3-band image and probabilistic classifier called multinomial logistic regression (MLR) trained on LiDAR data-as unary potential modeled as CRF. They also used segments obtained from a gradient-based segmentation algorithm (GSEG) into a higher-order CRF called P N Potts' model to resolve ambiguities by exploiting spatial-contextual information, and finally use graph-cut to efficiently infer the final semantic labeling for their proposed higher-order CRF (HCRF) framework. Apart from integrating CRF with FCNs, other network architectures were also designed for better classification, such as ResNet [85] based FCN [86,87].
Pyramid pooling module was proposed by Zhao et al. [88] in 2017, which was applied onto ResNet-based architecture to obtain clues on semantic categories distribution by parsing global information using large kernel pooling layers. Yang et al. [62] proposed end-to-end DL architecture to perform pixel-level understanding of high spatial resolution RS images, based on both local and global contextual information. They used the pyramid pooling module to collect multi-level global information. Low-dimension context representations were then upsampled by bilinear interpolation, thus obtaining representations with the same size as the original feature map. Finally, they concatenated multi-level global features and the last-layer convolutional feature map for the pixel-wise prediction. Yu et al. [89] proposed a pyramid scene parsing network (PSPNet) by incorporating cheaper network building blocks to extract multi-scale features for semantic segmentation. PSPNet comprised of two parts: (i) a CNN based on ResNet101-v2 to extract features by encoding input images into feature maps and (ii) a pyramid pooling module inspired from SPPNet [90] to extract features at multiple scales and upsample the feature maps to learn global contextual information by concatenating the multi-scale features. The final segmentation result is achieved by convolution operation on the concatenated feature maps. Later,   [91] compared the use of ResNet-101 as the base structure for three baseline methods applied on roof segmentation: (i) feature pyramid network (FPN) (ii) FPN with multi-scale feature fusion (MSFF) and (iii) PSPNet; to observe the highest performance from PSPNet. Different approaches of data fusion such as multi-modal, multi-scale and multikernel data fusion have also been practiced to improve the performance of FCN on RS images [56,80,[91][92][93][94][95][96][97][98]. In a distinctive approach, some studies focused on the symmetry of the encoder-decoder structure, which is discussed in the next subsection.

Symmetrical FCNs with Skip Connections
Semantic segmentation requires both contextual information (object-level information) as well as low-level pixel data. Besides the pixel-level information, how to utilize contextual information is a key to formulate better semantic labeling. Contextual relationships therefore provide valuable information from neighborhood objects. In a symmetrical encoder-decoder FCN architecture, upper layers of encoders encode object-level information and lower layers capture rich spatial information. This information can become invariant to factors such as pose and appearance. And with skip connections, such networks can use both lower and higher-level information for finer predictions. This concept also helped the original U-Net architecture of Ronneberger et al. (2015) [99], in which each layer produces independent predictions. Multiple layers of convolutions, ReLU and max pooling combine the outputs of lower layers and higher layers to generate the final output. BN and ReLU are heavily applied to accelerate training and avoid vanishing gradient problems. These layers when trained using model optimizers such as SGD and back propagation (BP) of error, U-Net learns patterns. Finally, a softmax function generates the segmentation results. The model soon became widely used for medical image segmentation and also some studies can be found in urban images segmentation [100][101][102]. Yi et al. [103] proposed DeepResUnet for effective urban building segmentation at pixel-scale from VHR imagery. Their modification to the original U-Net included ResBlocks of two 3 × 3 and one 1 × 1 convolutional layers and ReLU activation functions. Compared to the original U-Net, DeepResUnet achieved better performance with fewer parameters, but with longer inference time. Yue et al. [104] proposed TreeUNet with DeepUNet [105] and Tree-CNN block to improve the differentiation of easily confused multi-classes. Tree-based CNN was later also used by Robinson et al. [106] to improve segmentation results with decision trees. Liu et al. (2020) [107] used two branches of modified U-Net to align feature distributions in the image domain and wavelet domain simultaneously, which were later merged to predict the final classification results. Diakogiannis et al. [108] proposed ResUnet-a, which uses U-Net's encoder-decoder backbone, in combination with residual connections, atrous convolutions, pyramid scene parsing pooling and multi-tasking inference. SiameseDenseU-Net of Dong et al. (2020) [109] used two similar parallel DenseU-Nets [110] to alleviate boundary detection and class imbalance problems.
Another widely used symmetrical FCN is SegNet [111], which improved boundary delineation with minimum parameters by reusing pooling indices. Segnet is widely used in the domain of urban feature extraction, which can be seen later in Section 3.2.4. Many modified SegNet or used it as a backbone architecture [92,[112][113][114][115] and some composed a new network using multiple FCNs like SegNet and U-Net [116,117]. Some used a network with multiple encoders of SegNet [113].
Several other symmetrical FCN networks have been proposed for urban image segmentation: DeconvNet [118], gated semantic segmentation network (GSN) [87], a network with boundaries obtained from edge detection [112] and SharpMask [119]. To boost the speed of segmentation compared to their original network called DeepMask [120], Sharp-Mask produced features instead of independent predictions at all layers. Kemker et al. [121] adapted SharpMask and compared the results to RefineNet with ResNet-50 architecture, to observe similar performance results on urban feature classification. RefineNet [122] is similar to U-Net, but introduces several residual convolutional blocks inside the encoder and decoder. DeepLab networks-DeepLabv3 [81] and DeepLabv3+ [123] are other commonly used FCNs [104,124].   [125] introduced two semantic segmentation frameworks: Symmetrical normal shortcut fully convolutional networks (SNFCN) and Symmetrical dense-shortcut fully convolutional networks (SDFCN), both of which contain deep FCN with shortcut blocks. In order to properly fuse multi-level convolutional feature maps for semantic segmentation, Mou et al. [76] proposed a novel bidirectional network called "recurrent network in fully convolutional network" (RiFCN). Li et al. (2019) [126] used two separate sub-networks to make a Y-shaped network, whose final predicted feature maps are concatenated and merged using convolutional layers. The first sub-network is an FCN with skip-connections and the second sub-network is a network of 13 convolutional layers without downsampling layers to avoid loss of information. Sun et al. [115] proposed a novel residual architecture called ResegNet to solve the problem of "structural stereotype" and "insufficient learnings" that encoder-decoder architectures face.

Generative Adversarial Networks (GAN)
Most CNN based methods perform well when they are tested on the images that are similar to the train images, but fail when the domain shifts to a new domain. To address this problem and reduce the domain shift impact caused by imaging sensors, resolution and class representation, Benjdira et al. [127] modified the original GAN [128]. GAN consists of two models: generator and discriminator. A generator is trained to generate fake data to fool the discriminator, whereas discriminator is trained to differentiate between fake and real data. In Benjdira's GAN, the generator follows a symmetrical encoder-decoder architecture and encoder consists of 4 convolutional layers for down-sampling and Leaky ReLU as an activation function. The output features extracted are passed to the decoder to rebuild the original feature vector. Decoder consists of four convolutional layers for up-sampling, standard ReLU as activation function and 'dropout' to reduce over-fitting. Instead of batch normalization, they used instance normalization. On the other hand, the discriminator consists of 5 convolutional layers that encode the generated image into a feature vector of a size of 256 × 256. Then, the sigmoid activation function in the last layer converts this feature vector into a binary output. Similar to the generator, Leaky ReLU activation function and instance normalization are used. The overall method can be summarized into the following steps: (i) Train segmentation model on source dataset. (ii) Train GAN to efficiently translate images from the source domain to the target domain. (iii) Convert the source dataset to the target domain using GAN, producing a new dataset with conserved structures of the source dataset but mimicking the global characteristics of the target dataset. (iv) Fine-tune the already trained segmentation model of step 1 with translated dataset associated with the source labels. Lin et al. [129] proposed an unsupervised model called multiple-layer feature-matching GAN (MARTA GANs) to learn a representation using only unlabeled data to increase the label dataset and Zhan et al. [130] designed a semi-supervised framework for hyper-spectral image data based on a 1-D GAN (HSGAN). Liu et al. (2020) [107] proposed a bispace alignment network called BSANet for domain adaptation (DA) and for automated labeling. BSANet uses two discriminators of modified U-Nets to minimize the discrepancy between the source and target domains. The use of GAN to solve the domain shift problems is limited and is expected to be studied more in the upcoming research works.

Transfer Learning
In order to increase the performance of a DL-model with fewer training samples and less computational power, some studies have used transfer learning [131]. Transfer learning allows the transfer of knowledge gained from solving one problem to be used for similar problems. The method has been quite popular in the studies that lack enough training images and labels [14,132,133]. Panboonyuen et al. [96] proposed to segment urban features in RS images using a global convolutional network (GCN) with channel attention blocks and domain specific transfer learning to transfer between learning obtained by training on VHR image to medium resolution images. Du et al. (2019) [134] performed semantic segmentation of crop (vegetation) area on RGB aerial images of Worldview-2. A DeepLabv3+ model pre-trained on ImageNet dataset was retrained on their dataset with image-GT (ground truth) label pairs. Compared to modern methods like U-Net, PSPNet, SegNet and DeepLabv2, and traditional machine learning methods like maximum likelihood (ML), SVM, and random forest (RF), their re-trained DeepLabv3+ model obtained the highest performance. Wurm et al. [135] segmented slum areas on RS images, in which they use transfer learning between models trained on different resolution images. Transfer Learning was done in two groups: (i) FCN based on VGG19 architecture that was pre-trained on ImageNet dataset transfers learnings and weights to three FCNs, trained on images collected from QuickBird (FCN-QB), Sentinel-2 (FCN-S2) and TerraSAR-X (FCN-TX). (ii) The learning of FCN-QB of the first group of the experiment was again transferred to FCN-S2 and FCN-TX. Some transfers produced better performance than others. Some studies use transfer learning to test their model in different ablations [79].

Methods and Data for Review
Our methodology for bibliographical analysis in the domain under study follows three steps: (a) collection of related works (b) thorough study and (c) detailed metaanalysis. For the collection of related works, title search was performed in IEEE Xplore, ScienceDirect and Google Scholar using the search query [["semantic segmentation"] OR ["pixel-level classification"] AND ["urban feature classification"] AND ["satellite imagery"]] on 21 December 2020. Also, more papers were downloaded from International Society for Photogrammetry and Remote Sensing (ISPRS)'s 2D Semantic Labeling Contest's leaderboard for Vaihingen (Click the http://www2.isprs.org/commissions/ comm2/wg4/vaihingen-2d-semantic-labeling-contest.html (accessed on 21 December 2020) to go to ISPRS's leaderboard for Vaihingen dataset.) and Potsdam (Click the http: //www2.isprs.org/commissions/comm2/wg4/potsdam-2d-semantic-labeling.html (accessed on 21 December 2020) to go to ISPRS's leaderboard for Potsdam dataset). A total of 122 papers related to pixel-based classification of urban features were first downloaded.
Secondly, with thorough study, the papers that adapted object-based methods, traditional machine learning methods, performed vehicle segmentation and the studies that used terrestrial scene images, multi-view stereo (MVS) and Lidar for data collection were filtered out. Also, the technical reports that were not peer-reviewed papers were filtered out to select 71 papers that studied semantic segmentation of VHR satellite images for urban feature classification using DL-based methods. The 71 papers come from different countries. The countries of the first author's affiliation are shown in Figure 1. In the third step, the 71 papers filtered in the previous step were studied carefully to look for the trends and contributions of the 71 papers on the following five major research questions: 1.
What are the study targets? 2.
What are the data sources and datasets used? 3.
How is the training and testing data prepared for deep learning? This question looks for pre-processing, preparations and augmentation methods if used.

4.
What are the training details? This question looks for architecture, backbone, framework, optimizer, loss function and hyper-parameters that are mentioned by the papers?

5.
What is the overall performance of the method? This question looks for the performance metrics used, methods used for comparison, highest performance in each area of study and performance gains over previous methods.

The Study Targets
The 71 papers can be divided into 4 study targets as building/roof segmentation, road segmentation, urban feature segmentation and urban land cover classification. Some papers intend multiple targets. The distribution is shown in Figure 2, which shows fewer studies on urban land cover [96,100,134,[136][137][138][139]. With the availability of VHR images, smaller urban features have been segmented in the majority of papers. The papers that used dataset such as ISPRS Vaihingen 2D Semantic Labeling dataset, ISPRS Potsdam 2D Semantic Labeling dataset and IEEE GRSS (Geoscience and Remote Sensing Society) Data Fusion Contest (of Zeebrugge, Belgium) are dedicated to the improvement of semantic segmentation of urban features from traditionally interesting "urban area" class to Impervious surfaces, Building, Low vegetation, Tree, Car, Water, Boats and Clutter/background. Some have segmented urban features as small as sidewalk, motorcycles, traffic signs, pedestrians, picnic table, orange pad, buoy, rocks, sports courts, etc. [106,121,140,141]. Some classified buildings to their utilities [102,113], smaller features inside roads [142] and slum area [135]. Besides the segmentation of features, the studies use different approaches to improve semantic segmentation of urban features, which are shown in Table 1. The coherent regions used for segmentation are pixel, patch and super-pixel for 66, 4 and 2 papers respectively, including [143], which uses super-pixels to enhance pixel-based segmentation.

Data Sources
Among the 71 papers, most of them used a publicly available dataset consisting of a large number of images and labels, and some tested their model on multiple of these datasets. The most commonly used datasets are ISPRS Vaihingen 2D Semantic Labeling dataset [166] (38 papers) and ISPRS Potsdam 2D Semantic Labeling dataset [167] (27 papers) and IEEE GRSS Data Fusion Contest [168] (5 papers) collected their images using UAV. Other than UAV-based images, satellite images were obtained from RADARSAT-2 [136], Worldview-2 [102,134,156], Worldview-3 [149], Landsat-8 [96], SPOT [137,155], Gaofen [138,169], Quickbird, Sentinel-1 and 2 [147], Sentinel-2 and TerraSAR-X [135], Plan-etScope (Dove constellations) [164]. A particular research collected data from a plane [140]. Some also used web-services such as Google Maps/Earth [140,161], World [101,140], Bing Maps [113] and Linz Data Service [103] often along with satellite images to improve segmentation results. Terrestrial data was also used for similar purposes [113]. For the comparison purpose or to assist the DL model, some used Lidar data. Lidar data were mostly included in the widely used public dataset mentioned before, and some used different ones. For VHR images, some prepared their own UAV-based dataset [121]. Some used Inria Aerial Image Labeling Dataset [76,114,117] for urban LULC. Other than urban LULC, some specific dataset dedicated for building/roof segmentation are Aerial Imagery for Roof Segmentation [91], SpaceNet building dataset [101], UK-based building dataset [160], WHU Building dataset [79]; for roads are AerialLanes18 dataset [142]; and for both roads and buildings is Massachusetts Building and Road Dataset (Mnih, 2013 dataset) [116,126,[151][152][153]158,159]. Other dataset for building and roads are also used in [152,153]. More datasets that were found along with the details on resolution and websource are provided in Appendix C. The spatial coverage of the 71 papers includes 46 local domain studies and 25 global domain studies. The local domain study means the dataset was collected within a single country and global domain study means the dataset was collected in more than one country.

Data Preparation
Here, we summarize the distribution of pre-processing, preparation and augmentation methods used. Pre-processing methods change the characteristics of images on pixel or spectral level. Some of these methods include image processing methods, image normalization by mean value subtraction on images [31,96,158,161], k-means clustering [134], relative radiometric calibration strategy [79], super-pixel segmentation [93], Lidar to digital surface model (DSM) [163], satellite image correction and pan-sharpening [102,164,169], normalized DSM (nDSM) and calculation of vegetation index such as normalized difference vegetation index (NDVI). Twenty-two papers performed such pre-processing methods. The most commonly performed methods calculation of DSM, nDSM, LiDAR data and NDVI [76,84,92,94,146,150]. Some used a stack of multiple channels of images [95,101,144,160]. Some used filters like unsharp mask filter, median filter, linear contrast filter [106] and Wiener filter [139]. Most papers that used datasets like ISPRS or IEEE did not perform pre-processing on images, because the images were already in a ready-to-use condition, and consisted of annotated labels too.
Data preparation includes methods used to prepare labels and images for train, test and validation dataset. The ratio of train-test/validation data is commonly 80-20 or 90-10. Image tiling is the most commonly used method and tile size was mostly 256 × 256 or 512 × 512 pixels. 18 papers overlapped the training tiles to their neighbors, among which most of them used 50% overlap. Yue et al. [104] used 2D Gaussian function to calculate the overlap. The distribution of these methods is shown in Figure 3. For the papers that used datasets like ISPRS and IEEE, the labels were already prepared. Others prepared labels manually [76,103,134,142] or used some traditional image segmenting methods to prepare labels [135,160]. The majority of papers (42 papers) have mentioned the use of image augmentation techniques to increase the number of training dataset or to increase the performance of overall learning procedure. The distribution of these methods is shown in Figure 3. Some also evaluated the increment in performance due to augmentation [95]. More details on this section can be found in Appendix A.

Training Details
Training details include network architectures, frameworks, optimizers, loss function, hyper-parameters like learning rate, weight decay, momentum, epochs and iterations, use of dropout to handle over-fitting, and the hardware used by the papers. The architectures used are shown in Figure 4. The most commonly used architectures are CNN, SegNet, FCN, U-Net and FPN. More details can be found in Appendix A. Among the architectures used, SegNet, few FCNs, FPN, U-Net, DeepLabv3+, FSN, ResegNet, DeepUNet, ResUNet-a, DenseU-Net, FuseNet and ReuseNet are encoder-decoder structured architectures, mostly using skip connections (aka. skip branches). The convolutional backbones employed by the papers are shown in Figure 5. Among the papers that mentioned the name of the backbone used, the most commonly used ones are ResNet and Visual Geometry Group (VGG). The most commonly used ResNet backbone is ResNet-101, which helps in the reduction of vanishing gradient problem of deep learning. Other than convolutional backbones, some papers used DL architecture like SegNet and U-Net as a part of their model, making them the backbone of their new architecture.
Talking about frameworks that are used to run or wrap the network architecture, the most commonly used were Caffe, Tensorflow, Keras and Torch. The distribution is shown in Figure 6 and details in Appendix A. Most commonly used network optimizers that optimize the network and keep off from model over-fitting are SGD including SGD with momentum or other hyper-parameters like learning rate. Other algorithms were Adam, Nadam (aka. Nesterov Adam) [76,121] and Adadelta [125]. Some used dropout functions to stop the over-fitting with common values of 0.5. To initialize the weights in the network, some used a special algorithm called Glorot Uniform Initializer [76,103] and Xavier algorithm [107,147]. The use of optimizers, dropout values and initializers are shown in Figure 7.    The loss functions that are used to evaluate how well specific algorithm models the given data include commonly used cross-entropy (CE) loss function (35 papers). The variety of CE loss used includes binary CE with semantic encoding loss [154], CE with median frequency balancing (MFB) [61,94,109,110,162], normalized CE [151], categorical CE function called logloss [125] and sparse softmax CE [135,148]. Some modified the CE loss [142,163] and compared the use of CE with MFB and CE with focal loss function weighted by MFB [110]. Other loss functions include auxiliary loss [62,89], adversarial loss [127], regularized logistic regression [144], multinomial logistic loss [146,147,161], validation loss from Keras [121], softmax [104,106,134,138], bidirectional loss [76], moving average [169], negative log likelihood loss (NLLLoss) [164] and Dice coefficient and Tanimoto coefficient [108]. The distribution of loss function is shown in Figure 8. Hyper-parameters like learning rate, weight decay, momentum, epoch, batch size, number of iterations and steps are set up carefully as they depend upon the architecture, number of layers used and computer hardware specifications. The learning rate is either fixed or is decreasing with some decay rate of weight or momentum after a certain number of steps, iterations or epochs; ranging from 10 −1 to 10 −10 . Most studies used graphics processing unit (GPU) units to train models faster (Figure 9). Most recent and powerful GPUs are used by the most recent papers, and all GPUs are from Nvidia.

Performance Comparison
A wide range of metrics are used to evaluate the methods/models as seen in Figure 10 For simplicity, we refer to "DL performance" as the DL model's performance score obtained from the metrics used in the papers. Most papers use multiple metrics to compare different combinations of same the model or to compare to other base models. The use of different experimental dataset and performance metrics makes performance comparison between different papers extremely difficult and often carries conviction. Therefore, we summarize the findings in the most meaningful ways by only comparing overall accuracy (OA) and F1 score among the papers that segmented multiple classes, and do not compare other metrics if these already exist OA. And some used k-fold cross validation in case the train and test dataset was small [135,137,146]. The papers that used common dataset have used similar metrics. Among the 32 papers that used ISPRS Vaihingen dataset and OA as the metric, OA ranged from 0.86 to 0.92 for the study target of "urban feature segmentation" with the highest value of 0.92 by superpixel-enhanced Deep Neural Forest (SDNF) [143]. For the ones that used F1 score, the highest of 0.92 was obtained by the ScasNet model (CASIA-2 in ISPRS leaderboard) [151]. Among the papers that used pixel accuracy, the highest of 0.87 was obtained. Similarly for the ISPRS Potsdam dataset, among the 23 papers that used the dataset for "urban feature segmentation", OA ranged from 0.85 to 0.93 with the highest of 0.93 from 2 papers [143,147]. For the ones that used F1 score, the highest was 0.93 [98,108,151] and for the papers that used pixel accuracy, 0.87 was obtained. Potsdam data was also used for other study targets. The highest OA of 0.96 was obtained for "building/roof segmentation" [160].
For the dataset of IEEE GRSS, the highest OA of 0.90 was for "urban feature segmentation" [155] and for the Massachusetts Building dataset and Road Dataset, the highest was 0.97 [159] and 0.94 [153] respectively. For the other datasets used, the highest OA for each study targets are: 0.99 for "building/roof detection" [160], 0.96 for "urban feature segmentation" [76,156] and 0.99 for "urban land cover classification" [100]. For "Road segmentation", 90% of roads were correctly segmented [159] and the highest F1 score of 0.94 was achieved [152].
It has to be noted that the objective of some papers [107,127] was not focused on producing the highest metric in segmentation, but was centered to improve the segmentation when the train and test datasets are from different domains. Benjdira's method of GAN [127] improved OA from 35% to 52% when passing from Potsdam (source domain) to Vaihingen (target domain). Also, it improved the average segmentation accuracy of classes inverted due to sensor variation from 14% to 61%.
The trend shows that the more recent the research, the less are they being compared to traditional methods that do not use DL. Out of the 54 papers that compared their method to the modern DL-based methods, 38 papers used ISPRS's Vaihingen or Potsdam data for comparison and 16 papers used other datasets. Details can be found out about the 38 papers that used the ISPRS dataset in Section 4.3 and Appendix B. Among the papers that compare their results to other datasets, some of the modern base models used for comparison are Cascade CNN [137]; Saito et al. (2015) [158]'s CNN [152,153]; DeepLab, Unet, FCN-8s, DeepLabv3 and DeepLabv3+ [142]; U-Net, PSPNet, SegNet and DeepLabv2 [134]; FCN-8s, SegNet, DeconvNet, U-Net, ResUNet and DeepUNet [103]; C-Unet, U-net, FCN-8s and 2-scale FCN [145]'s 2-scale FCN [79]. The recent papers tend to compare their method to recent other DL-based methods for state-of-the-art comparison.
Out of all, most papers (65 papers) obtained less than 20% DL performance, including 55 papers less than 10%, 49 papers less than 6%, 36 papers less than 3% and 25 papers less than 1%. Among the 54 papers that compared their method to modern DL-based methods, majority (41 papers) show improvement in DL performance by less than 6%, and 8 papers shows more than 6% improvement [62,84,101,114,163]. Similarly, among the 14 papers that compared to traditional methods, 7 papers observed less than 6% DL performance and the remaining observed more than 6%. [55,92,136,137,140,156] [121,134]. These numbers are the DL performance of the overall method, and comparison between each class is not shown. However, they can help us understand how significant DL-based methods are when compared to traditional methods.
The readers are suggested to go through Appendix A, a tabulated analysis that is ordered by the year of publication. Now in the next section, we will discuss how traditional methods helped DL-based methods, improvements shown by DL, improvement boosted by dataset challenges and research problems addressed by DL on semantic segmentation.

Discussion
Our meta-analysis shows that DL-based methods have shown significant performance gains in the majority of literature reviewed. Among all papers, 89% papers compared to some traditional, modern or both type of base model and 93% compared to a different version/ablations or combination of their own model including some comparison among different datasets. Some recent papers of 2019 and 2020 include an ablation study to discuss the performance of their method. Most works are motivated to improve segmentation with methods for better fusion, contextual information, auxiliary data, skip connections and transfer learning. Some wanted to minimize pre-processing and post-processing, improve labeling dataset, solve the class imbalance problem and improve the boundary prediction. Compared to the traditional methods, all related works have shown improvement regarding these problems. Thanks to labeling contests, 71% of papers have used a common dataset of ISPRS 2D semantic labeling dataset including 38 that used Vaihingen and 27 that used Potsdam. Also other data sources were mentioned in Section 3.2.2. If the papers did not use these common datasets, it would have been nearly impossible to compare their work. Further, we will now discuss the improvements achieved with the help of traditional methods (Section 4.1), improvements shown by DL (Section 4.2) and dataset challenges (Section 4.3) and research problems addressed by DL (Section 4.4).

Helping Hands for DL Models
DL models offered superior performance compared to SVM, conventional NNs, RF, CRF, HOG, ML, SLIC and other supervised and unsupervised classification methods. However, some DL-based methods have also included a few of these traditional methods to support their model either on segmentation before training, or as pre-processing or as a post-processing method. CRF has been commonly used with CNN and FCN to exploit contextual information. Many papers applied CRF as a post-processing on their model [31,55,144,169]. Some compared their model to the methods using CRF [92,145]. In 2015, Paisitkriangkrai et al. [144] used CRF on the combined probabilities of a CNN and an RF classifier. Saito et al. (2015) [158] compared the conventional NN's of Mnih et al. [170], out of which, NN with CRF had shown better segmentation of building. Saito et al. (2016) [159] in their next study omitted the use of data augmentation that Mnih et al.'s networks with CRF needed. Sherrah et al. [31] observed a slight improvement in results when they used RF and CRF on their FCN without downsampling. Marmanis et al. (2016) [157] observed a slight gain in OA while using CRF with his FCN. To quantify the influence of post-processing, they used class likelihoods predicted by their ensemble of FCN, as input to a fully connected CRF (FCRF). Later, Zhao et al. [156] used CRF to capture contextual information about semantic segments and refine classification maps. When comparing pixel-based CRF vs pixel-based CNN, they pointed out that CRF was better than CNN because CRF overcomes the "salt-and-pepper" noise effects. However, CRF generally requires a substantial amount of calculation time and overlooks the contextual information between different objects. Therefore they combined CNN with CRF to observe up to 3% better OA. Liu et al. (2017) [150] used higher-order CRF to combine two predictions from a FCN trained on infrared-red-green (IR-R-G) images of Potsdam and a linear classifier trained on LiDAR. Their fusion using higher order CRF helped them resolve fusion ambiguities and observed 1 to 3% better OA.
As most methods combined CNN with strategies for spatial regularization such as CRF, Volpi et al. (2018) [163] proposed a method to learn evidence in the form of semantic class likelihoods, semantic boundaries across classes and shallow-to-deep visual features, each one modeled by a multi-task CNN. They used CRF to extract boundaries with base parameters of sensitivity to color contrast, flat graph and segmentation tree, producing better OA. Pan et al. (2018) [148] used fully connected CRFs as unary potential on the outputs of softmax layer (heat maps) and as pairwise potential on CIR images, to observe a slight improvement in results on their encoder-decoder architecture called Fine segmentation Network (FSN). Liu et al. (2019) [84] proposed a decision-level multisensor fusion technique for semantic labeling of the VHR RGB imagery and LiDAR data. In their study, they fused segmented outputs obtained from multiple classifiers such as FCN, probabilistic classifier and unary potential modeled as CRF, each trained on a different multi-modal data such as 3-band images, LiDAR and NDVI, using HCRF. When tested on Zeebruges dataset, their method produced 3 to 19% better OA than methods like SVM, AlexNet and FCN-8s. Du et al. (2020) [124] obtained two initial probabilistic labeling predictions using a DeepLabv3+ network on spectral image and an RF classifier on hand-crafted features, which they integrated by Dempster-Shafer (D-S) evidence theory to be fed into an object-constrained higher-order CRF framework to estimate the final semantic labeling results with the consideration of the spatial contextual information. Some other post-processing methods also include denoise and smoothing [100] and Otsu's thresholding [139]. Li

Improvements from Deep Learning
Besides the improvement from the use of traditional methods on DL-based methods, various improvements have been produced from the rise in the use of CNNs and FCNs. Among the various architectures used (Figure 4), encoder-decoder architectures like SegNet, few FCNs, FPN, U-Net, Hourglass and DeepLabv3+ mostly have a symmetrical architecture with skip connections. These architectures minimized the problem of boundary pixel classification by using both lower-level and higher-level information coming from skip connections and previous down-sampling layers. Unlike CNN, these architectures do not rely only on the feature maps produced by the pooling layers. Azimi et al. [142] have shown the impact of symmetrical FCN over non-symmetrical FCN like FCN-8s in terms of DL performance. Also from Table 1, 8

papers exclusively exploit contextual information to improve the accuracy of pixel-based semantic segmentation.
Talking about pixel-based and patch-based semantic segmentation, 68 papers performed pixel-based and 4 performed patch-based [55,61,158,159]. Volpi et al. (2016) [55] compared between patch-based, sub-patch-based and full-patch-based CNNs, where the third one performed better. Kampffmeyer et al. [61] compared pixel-based vs patch-based FCN while improving segmentation of small objects like cars. Pixel-based FCN performed better than patch-based methods in their comparison. Other papers performed pixel-based segmentation. Sherrah et al. have also argued that patch-based methods produce output in lower resolution, do not fulfill the task of semantic segmentation and also perform redundant operations on adjacent patches.
As most of the papers used the dataset from some semantic labeling contest, in the next section, we compare those papers in more detail and point out how the DL performance was improved by the contests.

Improvement Boosted by Dataset Challenges
In 2014, Gerke et al. [171]  For the study target of "urban feature segmentation", 32 papers that used Vaihingen dataset achieved OA from 0.86 to the highest of 0.92 [143], 23 papers that used the Potsdam dataset obtained OA of 0.85 to the highest 0.93 [143,147], among the 6 papers that used IEEE GRSS the highest OA was 0.90 [155] and for Massachusetts Building dataset and Road Dataset, the highest of 0.97 [159] and 0.94 [153] was achieved. Also for the other datasets, the study target of "building/roof detection", "urban feature segmentation", "urban land cover classification" and "road segmentation" achieved up to 0.99, 0.96, 0.99 and 0.90 DL performance respectively. It can be seen that the performance metric of over 90% has been achieved in all four study targets using high-resolution images, which was not possible using traditional methods.

Problems Addressed by Deep Learning
Sections 3.2.5, 4.2 and 4.3 have shown that DL-based semantic segmentation shows contrasting improvement compared to traditional methods. All of the four study targets have achieved over 90% DL performance, even in challenging datasets like ISPRS 2D labeling dataset and IEEE GRSS dataset. Some of the previously faced challenges of pixel-level semantic segmentation have been addressed by several of the recent DL-based methods, as listed below.

1.
Complete ignorance of spatial information: In most of the traditional methods for pixel-level segmentation, spatial information was completely ignored. To solve this problem, several DL-based studies figured out the use of contextual information from lower and higher layers/levels of encoder (or downsampling) block, using skip connection and symmetrical networks. The features maps obtained from these lower to higher levels of encoders are concatenated to the feature maps of decoder (or upsampling) layers in symmetrical networks. As this problem also entails incorrect segmentation caused by similar features of similar categories, concatenation/aggregation or better fusion techniques are sought to merge feature maps of different levels. Table 1 shows 18 papers were motivated to use better fusion techniques. The details of these papers are already presented in Sections 2 and 3.2.1.

2.
Boundary pixel classification (aka boundary blur) problem: Several studies were performed to address this problem.   [78] removed max pooling layers and used CRF for post-processing, Badrinarayanan et al. [111] reused pooling indices, and Marmanis et al. (2018) [112] combined semantic segmentation with semantically informed edge detection to make boundary explicit. Other papers [55,163] were also motivated to address this problem. Specifically, encoder-decoder architectures with symmetrical network and skip connections have minimized this problem significantly. Mi et al. [143] alleviated the problem by using superpixel segmentation with region loss to emphasize on homogeneity within and the heterogeneity between superpixels.  [109] proposed DenseU-Net and SiameseDenseU-Net architecture inspired from [172] to solve this problem. Some used MFB with focal loss to minimize this problem [61,94,109,110,162].

4.
Salt-and-pepper noise: To minimize the noises produced by pixel-based semantic segmentation, Guo et al. [100] denoised and smoothened the segmentation results by implementing kernel-based morphological methods, and Zhao et al. [156] used CRF to capture contextual information of the semantic segments and refine the classification map.

5.
Structural stereotype and insufficient training: As highlighted by Sun et al. [115] in 2019, the problem of structural stereotype causes unfair learning and inhomogeneous reasoning in encoder-decoder architectures. They alleviate this problem by random sampling and ensemble inference strategy. They also proposed a novel encoderdecoder architecture called ResegNet to solve insufficient training. 6.
Domain-shift problem: Another drawback of DL-based method is that the DL performance decreases when the study domains are shifted. To address this problem, and reduce the domain-shift impact caused by imaging sensors, resolution and class representation, Benjdira et al. [127] used GAN consisting of generator and discriminant as explained in Section 2.5. Liu et al. (2020) [107] used two discriminators to minimize the discrepancy between the source and target domains. Also, many others performed multi-modal, multi-scale and multi-resolution training of DL-models [56,80,[91][92][93][94][95][96][97][98], for sufficient training of DL model.
The major drawback of DL-based methods is considered the lack of training dataset. Depending on the complexity of the problem and number of classes, any DL-model would require a large set of training images. Also with the added complexity of remote sensing-based data collection and expense, the use of DL is even more challenging. Several augmentation techniques (Figure 3) are often used to increase the number and variation of the dataset. However, ISPRS's 2D labeling dataset and IEEE's GRRS dataset have tried to address the insufficiency of data by providing VHR images collected from UAVs, of up to 5 cm resolution. Also, several studies used transfer learning to transfer the learning from a model trained on a different domain to the model to be trained. Therefore, with the availability of public datasets and the use of transfer learning, several studies have therefore also conducted multi-modal and multi-resolution training of DL-models.
Another disadvantage of DL-based semantic segmentation is the requirement of a high amount of label dataset that often needs manual annotation. This problem has also been addressed to some extent by the public datasets by providing annotations as well, but is still persistent if we want to our own datasets. Some studies used traditional methods to produce the annotations [135,160]. In a similar way, one can also take advantage of existing pre-trained models to create their label dataset. Some have also tried to improve the labeling dataset [160,161]. Also, the labels can be collected from web services like open street maps (OSM) [164] for the segmentation of some urban features.
The other cons of DL-based methods are the necessity of GPU-based computation and heavier size of models in terms of storage volume. This challenge is addressed by a few cloud computing services like Google Colaboratory [173] recently, providing free usage until several hours. From Figure 9, 62% of papers mentioned the GPU specification they used, and it has to be noted that most of them were able to use costly GPU's like Nvidia's Titan-X. This could also be because of the rapid introduction of the latest powerful GPUs with higher capabilities and price reduction of the older ones. In a dissimilar note, some have also attempted to produce a smaller DL-model of less than a megabyte [165] and some reduced the training parameters [155] to create a faster model. Although DL-models take longer time to train compared to traditional methods, the time to test the model is significantly faster.

Conclusions and Future Directions
The results of the meta-analysis found that DL offers superior performance and outperforms traditional methods. Several drawbacks of DL-based methods have been addressed and minimized in recent years (Section 4.4), further increasing the performance. However, more challenges are to be expected with the recent trend to classify smaller features such as type of vehicle. Similar to the ISPRS and IEEE's dataset for 2D labeling challenge, availability of more public datasets including VHR imagery as well as coarser resolution imagery could help in the improvement of urban feature classification, by allowing multi-modal and multi-resolution training of DL-model. Also, more training labels are expected in the future. Auxiliary data sources like OSM are needed to be updated for the most recent labels. More digital image processing methods and vegetation indices can be explored to make the labeling/annotation task easier. GANs can help increase the number of training datasets by translating images, and also provide synthetically produced ultra-high resolution images. Moreover, we propose the surge of studies that focus on minimization of domain-shift problem, the number of training dataset, the training parameters, size and the time required to train the DL models with optimized architectures, GANs and transfer learning. Acknowledgments: This work was partially supported by Center of Excellence in Intelligent Informatics, Speech and Language Technology and Service Innovation (CILS), Thammasat University and Advance Geospatial Technology Research Unit, SIIT, Thammasat University. We would like to thank the supporters and also the reviewers, whose valuable suggestions and comments increased the overall quality of this review paper.

Conflicts of Interest:
The authors declare no conflict of interest. Multi-spectral images were orthorectified using DEM, and converted to Pauli RGB Image using a neighboring window of some size.