Dual and Single Polarized SAR Image Classiﬁcation Using Compact Convolutional Neural Networks

: Accurate land use / land cover classiﬁcation of synthetic aperture radar (SAR) images plays an important role in environmental, economic, and nature related research areas and applications. When fully polarimetric SAR data is not available, single-or dual-polarization SAR data can also be used whilst posing certain di ﬃ culties. For instance, traditional Machine Learning (ML) methods generally focus on ﬁnding more discriminative features to overcome the lack of information due to single-or dual-polarimetry. Beside conventional ML approaches, studies proposing deep convolutional neural networks (CNNs) come with limitations and drawbacks such as requirements of massive amounts of data for training and special hardware for implementing complex deep networks. In this study, we propose a systematic approach based on sliding-window classiﬁcation with compact and adaptive CNNs that can overcome such drawbacks whilst achieving state-of-the-art performance levels for land use / land cover classiﬁcation. The proposed approach voids the need for feature extraction and selection processes entirely, and perform classiﬁcation directly over SAR intensity data. Furthermore, unlike deep CNNs, the proposed approach requires neither a dedicated hardware nor a large amount of data with ground-truth labels. The proposed systematic approach is designed to achieve maximum classiﬁcation accuracy on single and dual-polarized intensity data with minimum human interaction. Moreover, due to its compact conﬁguration, the proposed approach can process such small patches which is not possible with deep learning solutions. This ability signiﬁcantly improves the details in segmentation masks. An extensive set of experiments over two benchmark SAR datasets conﬁrms the superior classiﬁcation performance and e ﬃ cient computational complexity of the proposed approach compared to the competing methods.


Introduction
Synthetic Aperture Radar (SAR), consisting of air-borne and space-borne systems, has been actively used in remote sensing in many fields such as geology, agriculture, forestry, and oceanography.SAR systems can operate in many conditions where optic systems often fail, e.g., night time or severe weather conditions.Hence, they have been extensively used in various applications such as tsunami-induced building damage analysis with TerraSAR-X [1], ocean wind retrieval using RADARSAT-2 [2], oil spill detection using RADARSAT-1, ENVISAT [3], land use/land cover (LU/LC) classification with RADARSAT-2 [4], vegetation monitoring [5] using Sentinel-1, and soil moisture retrieval with Sentinel-1 [6], TerraSAR-X, and COSMO-Skymed [7].A comprehensive list of fields and applications of SAR is available in [8].
Ecological and socioeconomic applications greatly benefit from LU/LC classification, making SAR image classification the primary task.For example, forest biomass analysis investigated in [9] provides vegetation ecosystem analysis in Mediterranean areas.Further studies [10,11] focus on the relation between vegetation type and urban climate by questioning how vegetation types affect the temperature.Moreover, Mennis [12] analyzes the relationship between socioeconomic status and vegetation intensity and reveals that higher vegetation intensity is associated with socioeconomic advantage.However, accurate LU/LC classification is a challenging task especially for conventional machine learning methods due to several reasons: (1) existing speckle noise in SAR data, (2) requirement of pre-processing, i.e., feature extraction is especially needed for single-and dual-polarimetric cases to compensate for the lack of full polarization information, and finally, (3) the large-scale nature of SAR data.
Nevertheless, there have been many existing studies using supervised and unsupervised methods [13][14][15][16][17][18][19][20] for LU/LC classification of SAR images.On the one hand, several clustering methods are proposed [19,20] as the second group and the underlying task is challenging especially for high-resolution SAR images mainly due to the heterogonous regions in the data.Superpixel segmentation approaches that group similar pixels based on color and other low-level properties have been proposed, see for instance the comprehensive study in [21].The work in [22] proposes to use the mean shift algorithm for SAR image segmentation.In particular, an extension of the mean shift algorithm with adaptive asymmetric bandwidth is proposed to deal with speckle noise and the large dynamic range of SAR images in [22].Superpixel based watershed approaches [23] are used with average contrast maximization in [24] for river channel segmentation.On the other hand, recent studies [15,16,18] have shown that supervised methods have significantly better performance compared to unsupervised ones.
Traditional supervised approaches for classification consist of two distinct stages: feature extraction and feature classification [15,16,18,[25][26][27][28][29][30][31][32], and may be further categorized based on how they describe multidimensional SAR data.For the cases of multiple polarizations, different target decompositions are used as high-level electromagnetic features, whereas only a single (intensity) channel exists for the single polarization, hence limiting the use of the rich set of electromagnetic features for classification.These studies further reveal that using secondary features such as color and texture [15,18,27,31] can significantly improve the classification performance with an inevitable cost of computational complexity increase.
The state-of-the-art classification performance over single-and dual-polarized SAR intensity data has been achieved by a recent study [18] which uses a large ensemble of classifiers over a composite feature vector in high dimensions (e.g., >200-D) with several electromagnetic (primary) and image processing (secondary) features.As a conventional approach, this method also has certain limitations.First, it cannot be applied directly over the intensity SAR data which makes its performance dependent on the selected features.This is the reason for using a large set of features in the studies [16,32,33], whose extraction process results in a massive computational complexity.Moreover, the classification accuracy of certain terrain types may still suffer from suboptimal performance of such fixed set of handcrafted features.
In recent years, Convolutional Neural Networks (CNNs) have become the de-facto standard for many visual recognition applications (e.g., object recognition, segmentation, and tracking) as they achieve the state-of-the-art performance [34][35][36][37] with a significant performance gap.In remote sensing [38], Deep Learning methods reside in the following areas: hyperspectral image analysis, interpretation of SAR images and high-resolution satellite images, multimodal data fusion, and 3-D reconstruction.On the other hand, such deep learners require training datasets with massive sizes, e.g., in the "Big Data" scale to achieve such performance levels.Furthermore, they require a special hardware setup for both training and classification.Such drawbacks can be observed in the recent deep learning approaches for SAR image classification [39,40].In these studies, a large partition of SAR data (i.e., 75% or even higher) is used just to train the network in order to achieve an acceptable performance level.For example, in the study [39], the authors propose a SAR image classification system which used 78-80% of SAR data for training, more specifically, 28,404 training samples are selected while 8000 samples are used for the evaluation of San Francisco fully-polarized L-band image.Similarly, in the same study, 10,817 samples of a total of 13,598 samples are used to train the model over Flevoland fully-polarized L-band image.Another similar classification system in [40] uses 75% of all available data for training which corresponds to 111,520 samples out of 148,520 samples in Flevoland L-band SAR image.One can argue that in practice, the availability of such an amount of labeled SAR data may not be feasible due to the cost and difficulty of ground-truth labeling in remote sensing.Furthermore, using such proportions of ground-truth labels eliminates the main goal of the LU/LC classification task, as the classification may not be required anymore after labeling more than three-quarter of the data.Finally, deep CNNs require a special hardware setup for training and classification to cope with the massive computational complexity incurred due to the deep network structure.This requirement may prevent their use in a low-cost and or real-time applications.
In this study, in order to address the aforementioned drawbacks and limitations of conventional and Deep Learning methods, we propose a systematic approach for accurate LU/LC classification of single-polarized COSMO-SkyMed and dual-polarized TerraSAR-X intensity data, which are both space-born X-band SAR images, using compact and adaptive CNNs.Performance of the proposed approach will be evaluated against the current state-of-the-art method in SAR image classification [18] and two recently proposed deep CNNs for ImageNet -Large Scale Visual Recognition Challenge [41]: Xception and Inception-Resnet-v2 [36,37].The novel and significant contributions of the proposed approach can be listed as follows: first, unlike conventional methods, the proposed approach can directly be performed over SAR intensity data without requiring any prior feature extraction or pre-processing steps.This is the sole advantage of CNNs which can fuse and simultaneously optimize the feature extraction and classification in a single learning body.Second, we shall show that unlike deep CNNs, the proposed compact CNNs can achieve the state-of-the-art classification performance with an insignificant amount of training data (e.g., <0.1% of the entire SAR data).Third, the proposed compact CNNs achieve a superior computational complexity for both training and classification, making them suitable for real-time processing.Finally, contrary to deep learning techniques, we show that small (e.g., 7 × 7 up to 19 × 19 pixels) patches can be used to achieve a more detailed segmentation mask, thanks to the compact nature of the proposed CNN configuration.
The rest of the paper is organized as follows: a brief discussion of the related work is given in Section 2, followed by a detailed explanation of the proposed methodology in Section 3. The data processing phase is presented along with the experimental results and the computational complexity analysis of the network in Section 4, where the main findings are analyzed and discussed.Finally, in Section 5, concluding remarks are drawn with potential future research directions.

Related Work
The data acquisition of a polarimetric SAR (PolSAR) system measures the complex backscattering [S] matrix.For the full polarization case, [S] can be expressed as: where S hv = S vh holds for monostatic system configurations using reciprocity theorem [42].Consequently, each pixel in a PolSAR image can be represented by five parameters: the three absolutes: |S hh |, |S vv | as co-polarized intensities, |S hv |/|S vh | as cross-polarized intensity, and the two relative phases: φ hv−hh , φ vv−hh .The advantage of PolSAR data is that it can characterize scattering mechanisms of numerous terrain covers.Lee et al. [43] investigated such characteristics of terrain types.
For instance, open areas typically have surface scattering, trees and bushes show volume scattering, while man-made objects such as buildings and vehicles have double bounce and specular scattering.
In SAR image classification, using these backscattering parameters directly is the most common method, where it is preferred to have fully polarimetric SAR data since this will help to acquire more information on the observed target.However, such data may not be fully polarimetric in practice and information regarding the observed target is decreased due to the single or dual polarized data.This negative effect on classification performance is demonstrated by studies over different SAR sensors; AIRSAR [44,45], ALOS PALSAR [46,47], and EMISAR AgriSAR [48].The current state-of-the-art method in SAR image classification with single and dual polarized intensity is Uhlmann et al. [18].To the best of our knowledge, no other method has ever achieved better classification performance than [18] using less than 0.1% of the entire SAR image for the training.Previous studies are based on using only pixel-wise information from each target, which assumes that there is no correlation within small neighborhoods, whereas the method proposed in [18] brings pixel correlation but still lacks region information.They combine electromagnetic features (backscattering coefficients) with image processing features.Hence, in [18], the following image processing features are utilized: (1) texture features: local binary pattern (LBP) [49], the edge histogram descriptor (EHD) [50], Gabor wavelets [51] and gray-level co-occurrence matrix (GLCM) [52]; (2) color features: hue-saturation-value color histogram [53], MPEG-7 dominant color descriptor (DCD) [50], and MPEG-7 color structure descriptor (CSD) [53].More specifically, in [18], they perform classification over dual-and single-polarized SAR intensity data using different techniques to produce pseudo colored RGB image and intensity images to make color and texture feature extraction possible.For color feature extraction, pseudo colored RGB images are produced by assigning the magnitude of backscattering coefficients [S] in Equation ( 1), (VH, VH-VV, VV) and/or (VV, VV-VH, VH), to R, G, and B channels, respectively, and then color features are extracted from these two images for dual-polarized SAR intensity data where magnitudes of the two backscattering coefficients are available.On the other hand, producing pseudo colored images for a single-polarized intensity is still possible by assigning intensity values to HSI (Hue, Saturation, and Intensity) color space by [54].Lastly, the extraction of texture features is performed using total scattering power span (commonly used in SAR image processing as another target descriptor) as an intensity image for dual-polarized intensity data and directly using the available intensity for single-polarized SAR intensity data.Finally, an ensemble of conventional classifiers can then be used to learn all these features simultaneously to maximize the classification accuracy.

Methodology
The proposed systematic approach for terrain classification is illustrated in Figure 1.For illustration purposes, the pseudo-color image in the figure is created from the benchmark Po Delta (Italy, X band) SAR data by transforming available intensity to HSI color space and assigning each component to RGB channels, respectively, using the approach of [54].

Methodology
The proposed systematic approach for terrain classification is illustrated in Figure 1.For illustration purposes, the pseudo-color image in the figure is created from the benchmark Po Delta (Italy, X band) SAR data by transforming available intensity to HSI color space and assigning each component to RGB channels, respectively, using the approach of [54].In order to obtain the final segmentation mask in Figure 1, an NxN window of each individual electromagnetic (EM) channel data around each pixel has been fed as the input to an adaptive 2D CNN, and the corresponding output of the CNN determines its center pixel's label.The CNN configuration used in the proposed classification system is given in Figure 2. Accordingly, the used number of EM channels determines the size of the input layer of the CNN.We have tested 1 to 4 EM channels, and the results will be discussed in Section 4. One hyper-parameter in this model is the size (N) of the NxN sliding window.In deep-learning approaches, N has to be kept high due to numerous convolution and pooling layers in the deep network structures.However, the proposed compact network enables the user to set N as low as 5, and we will discuss the effect of the window size over the classification performance in Section 4. In the following sub-sections, we will present the proposed adaptive CNN topology, more detailed description of the network structure and the formulation of the back-propagation training algorithm for the SAR data are given in Appendix A.
electromagnetic (EM) channel data around each pixel has been fed as the input to an adaptive 2D CNN, and the corresponding output of the CNN determines its center pixel's label.The CNN configuration used in the proposed classification system is given in Figure 2. Accordingly, the used number of EM channels determines the size of the input layer of the CNN.We have tested 1 to 4 EM channels, and the results will be discussed in Section 4. One hyper-parameter in this model is the size (N) of the NxN sliding window.In deep-learning approaches, N has to be kept high due to numerous convolution and pooling layers in the deep network structures.However, the proposed compact network enables the user to set N as low as 5, and we will discuss the effect of the window size over the classification performance in Section 4. In the following sub-sections, we will present the proposed adaptive CNN topology, more detailed description of the network structure and the formulation of the back-propagation training algorithm for the SAR data are given in Appendix A.

Adaptive CNN Implementation
In the proposed adaptive CNN implementation, in order to simplify the network and achieve an adaptive configuration, several novel modifications are proposed as compared to conventional deep CNNs.First of all, the network encapsulates only two distinct hidden layer types: 1) "CNN" layers into which conventional "convolutional" and "subsampling-pooling" layers are merged, and, 2) fully-connected (or "MLP") layers.By this way, each neuron within CNN layers has the ability to perform convolution and down-sampling.The intermediate output of each neuron is sub-sampled to obtain the final output of that particular neuron.The final output maps are then convolved with their individual kernels and further cumulated to form the input of the next layer neuron.In Appendix A.1, the simplified CNN analogy is given where the image dimension of its input layer is made independent from CNN parameters.
The number of hidden CNN layers can be arbitrarily, regardless of the input patch size.The proposed implementation makes this possible by adjusting the sub-sampling factor of the intermediate outputs of the last hidden convolutional layer to produce scalar values as the input of the first MLP layer.For example, if the feature maps of the last hidden convolutional layer are 8x8 as in the figure at layer l+1, then, they are sub-sampled by a factor of 8.Besides sub-sampling, note that the dimension of the input maps is gradually decreasing due to the convolution without zero padding.As a result, after each convolution operation, the dimension of the input maps is reduced by (Kx-1, Ky-1) where Kx and Ky are the width and height of the convolution kernels, respectively.Each input neuron in the input layer is fed with the patch of the particular channel.As discussed earlier, in this study the number of channels is varied from 1 to 4. In general, it is determined by the data; for example, the available single intensity is directly used with one-channel CNN setup for single-polarized SAR data.In addition, the HSI channels are added to the input with four-channel setup, and it is revealed in Section 4.3 that adding HSI channels improves the accuracy obtained using single channel.Next, for the dual-polarized intensity data, two available channels are used as the input of the CNN.

Adaptive CNN Implementation
In the proposed adaptive CNN implementation, in order to simplify the network and achieve an adaptive configuration, several novel modifications are proposed as compared to conventional deep CNNs.First of all, the network encapsulates only two distinct hidden layer types: (1) "CNN" layers into which conventional "convolutional" and "subsampling-pooling" layers are merged, and, (2) fully-connected (or "MLP") layers.By this way, each neuron within CNN layers has the ability to perform convolution and down-sampling.The intermediate output of each neuron is sub-sampled to obtain the final output of that particular neuron.The final output maps are then convolved with their individual kernels and further cumulated to form the input of the next layer neuron.In Appendix A.1, the simplified CNN analogy is given where the image dimension of its input layer is made independent from CNN parameters.
The number of hidden CNN layers can be arbitrarily, regardless of the input patch size.The proposed implementation makes this possible by adjusting the sub-sampling factor of the intermediate outputs of the last hidden convolutional layer to produce scalar values as the input of the first MLP layer.For example, if the feature maps of the last hidden convolutional layer are 8 × 8 as in the figure at layer l+1, then, they are sub-sampled by a factor of 8.Besides sub-sampling, note that the dimension of the input maps is gradually decreasing due to the convolution without zero padding.As a result, after each convolution operation, the dimension of the input maps is reduced by (K x −1, K y −1) where K x and K y are the width and height of the convolution kernels, respectively.Each input neuron in the input layer is fed with the patch of the particular channel.As discussed earlier, in this study the number of channels is varied from 1 to 4. In general, it is determined by the data; for example, the available single intensity is directly used with one-channel CNN setup for single-polarized SAR data.In addition, the HSI channels are added to the input with four-channel setup, and it is revealed in Section 4.3 that adding HSI channels improves the accuracy obtained using single channel.Next, for the dual-polarized intensity data, two available channels are used as the input of the CNN.

Back-Propagation for Adaptive CNNs
The illustration of the BP training of the adaptive CNNs is shown in Figure 3.For an N L -class problem, the class labels are first converted to the target class vectors using 1-of-N L encoding scheme.By this way, for each window, with its corresponding target and output class vectors, t 1 , . . ., t N L and y L l , . . ., y L N L , respectively, the MSE error in the last layer is expressed as in Equation ( 2).Next, derivative of this error with respect to individual weights and biases is computed.The BP formulation of the MLP layers is identical to the traditional BP for MLPs and hence it is skipped in this paper.On the other hand, the BP training of the CNN layers, composed of four distinct operations is detailed in Appendix A.2.

Back-Propagation for Adaptive CNNs
The illustration of the BP training of the adaptive CNNs is shown in Figure 3.For an NL-class problem, the class labels are first converted to the target class vectors using 1-of-NL encoding scheme.By this way, for each window, with its corresponding target and output class vectors,  , … .,  and  , … .,  , respectively, the MSE error in the last layer is expressed as in Equation ( 2).Next, derivative of this error with respect to individual weights and biases is computed.The BP formulation of the MLP layers is identical to the traditional BP for MLPs and hence it is skipped in this paper.On the other hand, the BP training of the CNN layers, composed of four distinct operations is detailed in Appendix A.2.

Experimental Results
In this section, we will first introduce our benchmark dataset used in the experiments and continue with the experimental setup.Next, the proposed compact adaptive CNNs will be analyzed against the state-of-the-art method in [18] with a comprehensive set of experiments.Performance evaluations will be presented in terms of visual inspection of the final obtained segmentation masks and quantitative analysis by comparing the overall classification accuracies and individual class accuracies of the proposed approach versus the competing method in [18].Furthermore, precision, recall, and F1 Score of each class are calculated for the multi-class case by the following.The precision

Experimental Results
In this section, we will first introduce our benchmark dataset used in the experiments and continue with the experimental setup.Next, the proposed compact adaptive CNNs will be analyzed against the state-of-the-art method in [18] with a comprehensive set of experiments.Performance evaluations will be presented in terms of visual inspection of the final obtained segmentation masks and quantitative analysis by comparing the overall classification accuracies and individual class accuracies of the proposed approach versus the competing method in [18].Furthermore, precision, recall, and F1 Score of each class are calculated for the multi-class case by the following.The precision of class c is the proportion of correctly classified samples of class c among all samples that are classified by the classifier as c, where recall is the proportion of correctly classified samples of c among true samples of class c.Consequently, F1 Score = 2 × Precision × Recall/(Precision + Recall).Furthermore, the Cohen's Kappa coefficient [55] is used as another performance measurement metric to analyze the reliability of the proposed system against deep CNN methods.Moreover, the sensitivity of the proposed approach with respect to the two hyper-parameters, the window size (N) and number of input channels will be investigated.Finally, we will conclude this section by demonstrating the performance gain of such compact configuration against deep network structures through sensitivity analysis with respect to the number of neurons and layers.

Benchmark SAR Data
In this study, two benchmark SAR data are used for our testing and comparative evaluations.The details of these benchmark SAR data are presented in Table 1.The first set of SAR data is for the Po Delta area located in the Northeast of Italy, and acquired at X-band and single-polarization mode.The second is the Dresden area in the Southeast of Germany at X-band and dual-polarization mode.The Po Delta area mainly consists of urban and natural zones, and the Dresden area has vegetation fields with man-made terrain types.The total number of samples in the whole ground truth (GTD) and the train data are presented in Table 2 for each SAR data.As used GTD in this study is hand-labeled, it is almost impossible to provide 100% accuracy on the ground truth labels.However, this is also true with the other (competing) methods.Therefore, if there is a labelling error (which is the most probably case), this will affect all the methods equally.On the other hand, no ML method will tolerate on the high labelling errors since they are all "supervised" methods and if the supervision is erroneous at large then this will deteriorate the performance of the method, including the proposed method in this study.

Po Delta, COSMO-SkyMed, and X-Band
This benchmark single polarized SAR data covers the Po Delta area which mainly provides natural class information with different types of water classes for our experiments.It has only one polarization (HH) in Strip Map HImage mode with original size of 16,716 × 18,308 and a 3-meter resolution.Due to computational reasons in [18], the data is downscaled by 3.6 × 5.8.The same procedure is followed in this study as well to make comparison possible with the same GTD is used.The ground truth of this data is constructed by visually inspecting optical image data with the help of [56].This data consists of mainly natural terrain types, such as several water-based terrain and soil-vegetation classes, some man-made structures which are grouped in one class.Consequently, we have determined six-classes which are urban fabric, arable land, forest, inland waters, maritime wetlands, and marine waters, and our constructed ground truth corresponds to the same GTD used for the previous state-of-the-art study in [18].A pseudo colored image is generated by assigning HSI channels (obtained by [54]) to RGB channels, as shown in Figure 4 with its corresponding ground truth.For a fair comparison with [18], in this study we also used the same samples for training, which are randomly chosen from the ground truth (2000 pixels per class) as 1%-2% of the whole ground truth, corresponding to 0.08% of the entire data.
This benchmark single polarized SAR data covers the Po Delta area which mainly provides natural class information with different types of water classes for our experiments.It has only one polarization (HH) in Strip Map HImage mode with original size of 16,716 × 18,308 and a 3-meter resolution.Due to computational reasons in [18], the data is downscaled by 3.6 × 5.8.The same procedure is followed in this study as well to make comparison possible with the same GTD is used.The ground truth of this data is constructed by visually inspecting optical image data with the help of [56].This data consists of mainly natural terrain types, such as several water-based terrain and soil-vegetation classes, some man-made structures which are grouped in one class.Consequently, we have determined six-classes which are urban fabric, arable land, forest, inland waters, maritime wetlands, and marine waters, and our constructed ground truth corresponds to the same GTD used for the previous state-of-the-art study in [18].A pseudo colored image is generated by assigning HSI channels (obtained by [54]) to RGB channels, as shown in Figure 4 with its corresponding ground truth.For a fair comparison with [18], in this study we also used the same samples for training, which are randomly chosen from the ground truth (2000 pixels per class) as 1% − 2 % of the whole ground truth, corresponding to 0.08% of the entire data.In MGD mode, coordinates are projected to the ground range, and each pixel is represented with its magnitude only, where the phase information is lost.However, MGD and RE provide speckle noise reduction.Because of the aforementioned reason for Po Delta data, this data is also downscaled by 2 × 2. Ground truth of this data is also manually constructed as explained before by using [56] as a reference and optical image data.It is the same GTD used also in [18] and consists of six classes which are urban fabric and industrial as man-made terrain types, arable land, pastures, forest, and inland waters as natural terrain types.The GTD is shown in Figure 5 by assigning distinct RGB values to each terrain class.In our experimental setups, we have used randomly chosen 1000 pixels and 100,000 pixels per class for training and testing (train/test ratio: 0.01), respectively, which was followed by the competing method [18].by 2 × 2. Ground truth of this data is also manually constructed as explained before by using [56] as a reference and optical image data.It is the same GTD used also in [18] and consists of six classes which are urban fabric and industrial as man-made terrain types, arable land, pastures, forest, and inland waters as natural terrain types.The GTD is shown in Figure 5 by assigning distinct RGB values to each terrain class.In our experimental setups, we have used randomly chosen 1000 pixels and 100,000 pixels per class for training and testing (train/test ratio: 0.01), respectively, which was followed by the competing method [18].

Experimental Setup
Due to radiometrically enhanced multi-look ground range processing of the dual polarized TerraSAR-X image, speckle filtering is not performed for this dataset, whereas the Po Delta COSMO-SkyMed single polarized image is filtered for speckle noise removal.Hyper-parameters of the proposed adaptive CNN are selected by using 50% of the training data as the validation set.
Implementation of the proposed 2D CNN is done using C++ with MS Visual Studio 2015 in 64bit.Although this is not a GPU based implementation, multithreading is possible with Intel ® OpenMP API with a shared memory.Overall, all experiments in this work performed by an I7-4790 CPU at 3.6GHz (4 real, 8 logical cores) with 16 GB memory.Experiments with Xception and Inception-ResNet-v2 are performed using Keras [57] with Tensorflow [58] backend in Python.We use a workstation with four Nvidia® TITAN-X GPU cards, 128 GB system memory, and Intel ® Xeon(R) CPU E5-2637 v4 at 3.50GHz.

Experimental Setup
Due to radiometrically enhanced multi-look ground range processing of the dual polarized TerraSAR-X image, speckle filtering is not performed for this dataset, whereas the Po Delta COSMO-SkyMed single polarized image is filtered for speckle noise removal.Hyper-parameters of the proposed adaptive CNN are selected by using 50% of the training data as the validation set.
Implementation of the proposed 2D CNN is done using C++ with MS Visual Studio 2015 in 64-bit.Although this is not a GPU based implementation, multithreading is possible with Intel ®OpenMP API with a shared memory.Overall, all experiments in this work performed by an I7-4790 CPU at 3.6GHz (4 real, 8 logical cores) with 16 GB memory.Experiments with Xception and Inception-ResNet-v2 are performed using Keras [57] with Tensorflow [58] backend in Python.We use a workstation with four Nvidia®TITAN-X GPU cards, 128 GB system memory, and Intel ®Xeon(R) CPU E5-2637 v4 at 3.50 GHz.
The CNN network is configured with a single hidden CNN and MLP layers with 3 × 3 convolution filters.The subsampling factor is two for the CNN layer.Due to its compactness, even with a limited training set, over-fitting does not pose any threat during training, and, therefore, we have only used the maximum number of training iterations as the sole early stopping criterion, which is 200 for both datasets.The convergence curve of the network with the proposed configuration is given in Figure 6.In the figure, half of the training data is used for the validation for both datasets.Since the training data is limited (<0.08 % and <0.076 % of the Po Delta and Dresden data), it is hard to draw a conclusion regarding the convergence of the network.However, the figure demonstrates that the proposed compact configuration is able to converge within 200 iterations, and over-fitting does not occur during the training process.One fact that, because of dynamic adaption of the learning rate ε, its initial value is not a game changer during the BP process, though, we set initially as 0.05.For instance, MSE is watched during the training process, and if it drops in the current iteration, then ε increases by 5%.On the other hand, it is decreased by 30% for the contrary case in the next iteration.
is 200 for both datasets.The convergence curve of the network with the proposed configuration is given in Figure 6.In the figure, half of the training data is used for the validation for both datasets.Since the training data is limited (<0.08 % and <0.076 % of the Po Delta and Dresden data), it is hard to draw a conclusion regarding the convergence of the network.However, the figure demonstrates that the proposed compact configuration is able to converge within 200 iterations, and over-fitting does not occur during the training process.One fact that, because of dynamic adaption of the learning rate ε, its initial value is not a game changer during the BP process, though, we set initially as 0.05.For instance, MSE is watched during the training process, and if it drops in the current iteration, then ε increases by 5%.On the other hand, it is decreased by 30% for the contrary case in the next iteration.

Results and Performance Evaluations
The test and performance evaluation of the proposed systematic approach for classification of SAR data are performed over each benchmark dataset.The comparative evaluations against the stateof-the-art method in [18] are performed in terms of overall classification accuracy, and in particular, we report each individual performance improvement per terrain class.As discussed earlier, the improvements are analyzed both quantitatively by classification accuracy and qualitatively by visual inspection.Lastly, to compare the proposed approach with deep CNNs, two recent state-of-the-art deep learners, Xception and Inception-ResNet-v2 [36,37], will be used.

Performance Evaluations over Po Delta Data
The Po Delta data consists of six classes and has an emphasis on natural classes.We have varied the sliding window size N from 5 × 5 to 27 × 27 to investigate its effect on the classification accuracy and the ability to produce finer details in segmentation masks.Hence, the overall classification accuracies are presented in Table 3 with different settings of N and different number of channels.The results clearly indicate that using only HH backscattering coefficient, the proposed approach with the adaptive 2D CNN outperforms the best performance of the state-of-the-art method [18] with a significant gap (>10%), despite the fact that the competing method uses higher dimensional (>200-D)

Results and Performance Evaluations
The test and performance evaluation of the proposed systematic approach for classification of SAR data are performed over each benchmark dataset.The comparative evaluations against the state-of-the-art method in [18] are performed in terms of overall classification accuracy, and in particular, we report each individual performance improvement per terrain class.As discussed earlier, the improvements are analyzed both quantitatively by classification accuracy and qualitatively by visual inspection.Lastly, to compare the proposed approach with deep CNNs, two recent state-of-the-art deep learners, Xception and Inception-ResNet-v2 [36,37], will be used.

Performance Evaluations over Po Delta Data
The Po Delta data consists of six classes and has an emphasis on natural classes.We have varied the sliding window size N from 5 × 5 to 27 × 27 to investigate its effect on the classification accuracy and the ability to produce finer details in segmentation masks.Hence, the overall classification accuracies are presented in Table 3 with different settings of N and different number of channels.The results clearly indicate that using only HH backscattering coefficient, the proposed approach with the adaptive 2D CNN outperforms the best performance of the state-of-the-art method [18] with a significant gap (>10%), despite the fact that the competing method uses higher dimensional (>200-D) composite feature vector (color + texture + HH).For a more fair comparison, if both methods use the same information (i.e., with only HH channel), the performance gap between the proposed approach and [18] exceeds 40%.When additional input information is used in the proposed approach, further performance improvements can be achieved.For instance, when HSI components are used as distinct input channels, around 1% improvement in the accuracy is obtained with parameter N = 25, which is the optimal window size.However, with any N setting which is higher than 9, the proposed approach can achieve >75% accuracy.One can also observe the fact that the classification performance is not improving with four-channel setup after N = 25.
Furthermore, confusion matrix is given in Table 4.It can be observed from the confusion matrix that the most confused terrain types are maritime wetlands and marine waters.This is expected, since they are not even distinguishable with the human eye, and have similar characteristics.Additionally, for a detailed comparison, the classification accuracy for each terrain type is presented in Figure 7.In the figure, the blue two bar plots on the right display the best results obtained by the state-of-the-art method using HH channel and 208-D features in [18], whereas the bar plots on the left represent the results for the proposed method with one-channel and four-channel setups.While the classification performance of each terrain type is improved, a significant performance gap occurs e.g., for inland waters, maritime wetlands, marine waters, and arable land.Notably, the classification performance of some terrain types such as wetland is improved by >20%, which justifies the earlier argument that manually selected features cannot provide the same discrimination power for all classes, whilst the proposed adaptive CNN can "learn to extract" such features.Since, in the competing method, a significant performance gap occurs among the terrain types, the reliability eventually becomes a serious issue in [18] whereas the proposed method can always achieve >80% accuracy for any terrain type.
Remote Sens. 2019, 11, x FOR PEER REVIEW 11 of 27 [18].Moreover, it can be said that arable land is misclassified as urban fabric and forest in general by [18].Comparison of Figure 9 with Figure 8 further reveals that the classification performance for each terrain type is highly improved by the proposed method.This is also confirmed by a detailed visual evaluation over the zoomed section shown in Figure 10.Accordingly, the classification performance of each class (especially urban fabric, forest and arable land) is improved and the segmentation noise (error) has been removed almost entirely.For visual evaluation, the final segmentation masks for Po Delta SAR data are given in Figure 8 with their corresponding overlaid regions with the ground truth.The previous quantitative analysis made based on the overall accuracies in Table 3 has shown that using larger window sizes generally increases the overall accuracy.However, an important observation from the final segmentation masks in Figure 8 is that there is a trade-off between choosing the quantitatively and qualitatively good results.Consider that the optimal window size is 25 × 25 in Table 3, whereas the segmentation mask suffers from the sliding window artifacts for this case in Figure 8.On the other hand, the setup with 11 × 11 pixels window can achieve finer details in the final mask.The overlaid regions in the figure can be directly compared with overlaid regions of the competing method in Figure 9. Hence, Figure 9 shows that forest class is mostly confused with urban fabric by the competing method in [18].Moreover, it can be said that arable land is misclassified as urban fabric and forest in general by [18].Comparison of Figure 9 with Figure 8 further reveals that the classification performance for each terrain type is highly improved by the proposed method.This is also confirmed by a detailed visual evaluation over the zoomed section shown in Figure 10.Accordingly, the classification performance of each class (especially urban fabric, forest and arable land) is improved and the segmentation noise (error) has been removed almost entirely.

Performance Evaluations over Dresden Data
The Dresden data has also six classes, but it has more man-made structures compared to the Po Delta.For the performance evaluation, we have again varied the window size N from 5 to 27 to

Performance Evaluations over Dresden Data
The Dresden data has also six classes, but it has more man-made structures compared to the Po Delta.For the performance evaluation, we have again varied the window size N from 5 to 27 to investigate its effect on the classification performance.The overall classification accuracies are presented in Table 5.Note that on this dataset, the best accuracy achieved is 81.33% with the window size, 21 × 21 pixels using only VH/VV channels as two-channel input, and the accuracy starts to decrease after N = 21 This reveals the advantage of using such small window sizes in the classification performance.The competing method [18] can also achieve the top performance around 81%-82% using 209-D features (VH/VV + color + texture).In a more fair comparison, when both methods use the same SAR information (VH/VV channels as two-channel input) the proposed approach achieves a significant performance gap greater than 30%.The confusion matrix of six classes given in Table 6 shows that urban fabric is confused mostly by industrial terrain type, while pastures confused with arable land.This is also expected because of similarities between those terrain types, and this may reveal the fact that the multi-label classification would be possible for these terrains in this dataset.For a detailed comparison, the classification accuracy for each terrain type is plotted in Figure 11.The performance of the proposed approach with two-channel input is compared against the best results obtained by the competing method using the composite 209-D features.The proposed approach achieves similar or better classification accuracy except for inland water terrain type.Again, for a more fair comparison where each method uses the same information (i.e., only VH/VV channels), the performance gap becomes substantial and exceeds 50% for urban and industrial classes.The confusion matrix of six classes given in Table 6 shows that urban fabric is confused mostly by industrial terrain type, while pastures confused with arable land.This is also expected because of similarities between those terrain types, and this may reveal the fact that the multi-label classification would be possible for these terrains in this dataset.For a detailed comparison, the classification accuracy for each terrain type is plotted in Figure 11.The performance of the proposed approach with two-channel input is compared against the best results obtained by the competing method using the composite 209-D features.The proposed approach achieves similar or better classification accuracy except for inland water terrain type.Again, for a more fair comparison where each method uses the same information (i.e., only VH/VV channels), the performance gap becomes substantial and exceeds 50% for urban and industrial classes.Segmentation masks of the proposed approach and their corresponding overlaid regions with the ground truth for the Dresden data are given in Figure 12.As before, overlaid regions can be compared with the results of the competing method in Figure 13.Similar observations can be made, i.e., considering the best classification results of both methods using overlaid regions over the corresponding ground truth as illustrated in the figures.In Figure 14, it is clear that the competing method suffers from a high classification noise, whereas the proposed approach has significantly reduced the noise level especially for forest, pastures, and arable land classes.It is worth to mention that the best window size N is 21 for the Dresden data, and the classification accuracy is not increasing with larger values of N.
i.e., considering the best classification results of both methods using overlaid regions over the corresponding ground truth as illustrated in the figures.In Figure 14, it is clear that the competing method suffers from a high classification noise, whereas the proposed approach has significantly reduced the noise level especially for forest, pastures, and arable land classes.It is worth to mention that the best window size N is 21 for the Dresden data, and the classification accuracy is not increasing with larger values of N.

Deep versus Compact CNNs
Two recent deep CNNs, Xception and Inception-ResNet-v2 [36,37], are compared against the proposed method.As mentioned before, such deep CNNs are not suitable for small dimensional window sizes as we use in this study.The input dimensions should be at least 71 × 71 and 75 × 75 for Xception and Inception-ResNet-v2 models because of their depths.To address this limitation, we have to up-sample each input patch by bilinear interpolation.
For the training, both networks are trained from scratch using stochastic gradient descent as optimizer with the following training parameters: batch size: 32; momentum: 0.9; initial learning rate: 0.045; and the learning rate is decayed with the rate of 0.94 every two epochs.Overall, Xception has been able to converge after 400 epochs, while Inception-Resnet-v2 has needed 160 epochs.Furthermore, we apply transfer learning to investigate if the pre-trained versions of the networks by ImageNet dataset [41] will increase the classification performance.During the training process of transfer learning approach, we first freeze those ImageNet layers and train randomly initialized first convolutional layers and MLP layers of Xception and Image-Resnet-v2 networks for 25 epochs using the above training parameters.After this process accomplished, we fine-tune all layers of the networks with a smaller learning rate as 0.001, with 0.94 decaying factor every two epochs for 75 epochs.
Finally, the classification accuracies of Xception, Inception-Resnet-v2, and the proposed compact CNN approach are given in Table 7 and Table 8 for the Po Delta and Dresden datasets, respectively.

Deep versus Compact CNNs
Two recent deep CNNs, Xception and Inception-ResNet-v2 [36,37], are compared against the proposed method.As mentioned before, such deep CNNs are not suitable for small dimensional window sizes as we use in this study.The input dimensions should be at least 71 × 71 and 75 × 75 for Xception and Inception-ResNet-v2 models because of their depths.To address this limitation, we have to up-sample each input patch by bilinear interpolation.
For the training, both networks are trained from scratch using stochastic gradient descent as optimizer with the following training parameters: batch size: 32; momentum: 0.9; initial learning rate: 0.045; and the learning rate is decayed with the rate of 0.94 every two epochs.Overall, Xception has been able to converge after 400 epochs, while Inception-Resnet-v2 has needed 160 epochs.Furthermore, we apply transfer learning to investigate if the pre-trained versions of the networks by ImageNet dataset [41] will increase the classification performance.During the training process of transfer learning approach, we first freeze those ImageNet layers and train randomly initialized first convolutional layers and MLP layers of Xception and Image-Resnet-v2 networks for 25 epochs using the above training parameters.After this process accomplished, we fine-tune all layers of the networks with a smaller learning rate as 0.001, with 0.94 decaying factor every two epochs for 75 epochs.
Finally, the classification accuracies of Xception, Inception-Resnet-v2, and the proposed compact CNN approach are given in Tables 7 and 8 for the Po Delta and Dresden datasets, respectively.It is clear that deep CNNs cannot outperform such compact model with limited training data.Additionally, the ability to work with small window sizes are clearly shown in Table 8 for the Dresden data.There is >20% performance gap between the deep CNNs and the proposed approach in the classification accuracy for 5 × 5 window size.On the one hand, the proposed approach cannot outperform Inception-ResNet-v2 with 27 × 27 window size over Dresden data.However, such large window sizes do not produce high-detailed segmentation masks as shown in the previous sub-sections.Moreover, the performance comparison between deep and the proposed compact CNNs is performed over each class by calculating class specific precision, recall and F1 scores for the Po Delta and Dresden data as presented in Tables 9 and 10, respectively.As the tables demonstrate, the proposed approach is able to produce at least 0.77 and 0.72 F1 Scores for Po Delta and Dresden, respectively, where the deep CNNs fail for the certain classes.This means that the reliability of the proposed system is achieved for all classes by the proposed approach, and the overall system reliability is further motivated in Table 11.As given in Table 11, the kappa coefficient of the proposed approach is higher than the deep CNN methods.

Sensitivity Analysis on Hyper-Parameters
Besides the major parameter, N, the proposed adaptive CNN topology has two additional hyper-parameters: the number of hidden layers and hidden neurons in each layer.In the aforementioned experiments, we use a CNN configuration that has one hidden CNN and one MLP layer, with 20 and 10 hidden neurons, respectively.Note that the input and output layer sizes are determined by the underlying classification problem.In this section, we will analyze the network hyper-parameter sensitivity by varying them significantly.For this purpose, we define m as the multiplier for the number of neurons in the hidden layers and n as the multiplier for the number of hidden CNN layers, respectively.Hence, we vary the network hyper-parameters by keeping the best window size for each SAR data unchanged.For example, m = n = 1 corresponds to the default setup, and if m = 4 and n = 2, the network would have two hidden CNN layers with 4 × 20 = 80 neurons each, and 4 × 10 = 40 neurons for hidden MLP layer (i.e., [In-80-80-40-Out]).The classification accuracies provided in Table 12 demonstrate that when the network configuration varies, the accuracy varies slightly in a margin around ±3%.This shows that the proposed approach is robust to variations of the network hyper parameters in general.Moreover, increasing the number of hidden layers slightly degrades the performance whilst the best classification performance is achieved with a single hidden CNN/MLP The second convolution of a BP iteration given in Equation (A14) is performed between the current layer output, s l k , and next layer delta error, ∆ l+1 i where wl l = xl l+1 − sl l .The number of additions and multiplications of each connection will be, wl l and wl l xl l+1 2 , respectively.Similarly, the total number of multiplications and additions coming from the second convolution will be: Finally, the total number of operation during a BP iteration will be T FP mul + T BP1 mul + T BP2 mul and T FP add1 + T FP add2 + T BP1 add1 + T BP1 add2 + T BP2 add1 + T BP2 add2 , respectively.Apparently, the overall complexity of the network consists of mostly the first part rather than the additions, and contribution from the MLP side of the network is insignificant.Because, only a scalar multiplication and addition are performed for each connection and the computational complexity of MLPs is well-studied [59].
Considering inference complexity of the network, T  4)- (6).Consequently, the total number of operations would equal to 900K which requires only 0.0009 GFlops computing sources and shows the suitability of such compact networks for real-time applications.This is substantially smaller than typical deep CNNs, compared that Xception has 8380 M and Inception-ResNet-v2 has 13,200 M number of operations [60].Examples can be extended: GoogleNet's [61] every inception modules has greater than 50M operations, in total it has 1501 M operations, and Residual nets [62] proposed originally to have lower complexity using short-cuts even though it has more layers than typical deep nets.However, their network still needs 1.8 GFlops to 11.3 GFlops for 18-layers to 152-layers.

Conclusions and Future Works
In this study, we propose a novel approach for fast and accurate classification of single-and dual-polarized SAR intensity data especially when the labeled data is scarce (e.g., <0.1% of the SAR data).For this purpose, the proposed system is designed using adaptive and compact CNNs which are capable of learning from such limited training data with small-size patches.The proposed system exhibits crucial advantages over conventional and Deep Learning (DL) methods: first, the computational complexity is greatly reduced, making the proposed algorithm more efficient and more suitable for real-time deployment.Second, the accuracy and the details in the segmentation mask are both improved since the correlation between pixels in high-resolution SAR images tends to decrease when the patch size increases.Finally, as opposed to existing DL methods which require the use of a large partition of SAR data for training, the proposed approach only uses a small amount of (labelled) SAR data for training (e.g., <0.1%) and hence minimizes the labor and cost intensive labelling process for accurate classification.The latter deficiencies also prevent direct comparison of the proposed approach against the DL methods.The recent state-of-the-art method, [18], which can also function with minimum training and small patches, is proposed to extract color and texture features to have a very high dimensional composite feature vector that is learned by a large ensemble of classifiers.The proposed approach was shown to achieve a superior classification performance using only one compact classifier with one to four input EM channels.Therefore, such an approach voids entirely the pre-processing step of feature extraction.Finally, the comparative evaluations further reveal that the proposed approach has an improved inter-class classification reliability and significantly less or no classification (segmentation) noise when tested over two X-band datasets.In the proposed method, the patch size N is an important hyper-parameter, where, N > 19 achieves the highest classification accuracy in both datasets.The sensitivity analysis over the hyper-parameters of the network also demonstrates that a good level of robustness is achieved against their variations.As for future work, further investigations will be performed over other datasets.Moreover, we believe that certain visual features, if properly adapted to the input layer of the CNN, can be used to further improve the classification accuracy.For this purpose, we will investigate the use of color and texture features as well as EM channels together.According to the basic rule of BP, if the kth neuron at the current layer l is connected to the next layer to obtain its ith input with weight w l ki , then the next layer's delta ∆ l+1 l and the same weight will together form ∆ l k of the neuron in the current layer l.This means: where, E is the mean-square-error (MSE).Specifically: where: Let us focus on a single pixel's contribution of the output s l k (m, n), to pixels x l+1 i (m, n) (assuming a 3 × 3 kernel for simplicity), since obtaining the derivative from the convolution is obviously not straightforward. x In Figure A2, the contribution of an output pixel, s l k (m, n), over two pixels of the input map at the next layer's ith neuron, x l+1 i (m − 1, n − 1) and x l+1 i (m + 1, n + 1) is indicated.The delta of s l k (m, n) can be now written by treating the pixel as a MLP neuron and has a connection to the next layer, therefore, the rule of BP can be expressed as follows: If we generalize it for all pixels of ∆s l k : After performing the BP from the next layer, l+1, to the current layer, l, we should continue to back-propagate it to the input delta.If the up-sampled version of the map with zero order is represented as us = up , (s ), then the input delta is obtained as follows: where  = (, ) since each pixel of the output  was obtained by averaging ssx.ssy number of pixels of the intermediate output  .If "maximum pooling" is used as down-sampling, then Equation (A8) should be adjusted accordingly.After performing the BP from the next layer, l+1, to the current layer, l, we should continue to back-propagate it to the input delta.If the up-sampled version of the map with zero order is represented as us l k = up ssx,ssy s l k , then the input delta is obtained as follows: where β = (ssx, ssy) −1 since each pixel of the output s l k was obtained by averaging ssx.ssy number of pixels of the intermediate output y l k .If "maximum pooling" is used as down-sampling, then Equation (A8) should be adjusted accordingly.Appendix A.2.3.BP from the First MLP Layer to the Last Convolutional Layer As illustrated in Figure A3, the last hidden convolutional layer is connected to the first MLP layer, and hence it produces scalar output, s l k .In other words, this particular layer's sub-sampling factors are adjusted depending on the input map dimensions, i.e., (ssx = 8, ssy = 8) is chosen as in the figure, to always produce scalars.Correspondingly, each weight, w l ki , of this layer is also scalarly the same as a regular MLP and performing multiplication.Thus, the BP as the scalar case is demonstrated as follows in Equation (A9) from MLP layer to the CNN layer.SS (8,8) US (8,8) kth neuron The rule of thumb update rule of the BP states that the sensitivity of the weight connecting the k th neuron in the current layer to the i th neuron in the next layer depends on the output of the current layer neuron and the delta of the next layer neuron.For hidden neurons in a convolutional layer, a similar approach is followed to compute the weight and bias sensitivities.The rule of thumb update rule of the BP states that the sensitivity of the weight connecting the kth neuron in the current layer to the ith neuron in the next layer depends on the output of the current layer neuron and the delta of the next layer neuron.For hidden neurons in a convolutional layer, a similar approach is followed to compute the weight and bias sensitivities. (0,0) = . .+  (0,0) (0,0) +  (0,1) (0,1) +  (1,0) (1,0)+..  (0,1) = . .+  (0,0) (0,1) +  (0,1) (0,2) +  (1,0) (1,1)+..  (1,0) = . .+  (0,0) (1,0) +  (0,1) (1,1) +  (1,0) (2,0)+.. illustrates the evaluation of the input of the ith neuron, x l+1 i , at the next layer l + 1 with the output, s l k , and weight, w l ki , of the current layer.One can write that each weight element's contribution over the output:

(A13)
Due to the fact that each weight contributes to each neuron input, x l+1 i (m, n) as seen in Equation (A13), the derivative is obtained by the summation of delta-output product for all pixels, as the following: As expected, the bias of kth neuron at layer l, b l k , also contributes to every pixel (all pixels in the map use the shared bias).Hence, the bias sensitivity will be the summation of pixel sensitivities in the map as expressed in Equation (A15).

∂E ∂b
Overall, for each patch in the training set in Figure 3, BP flow is given as follows: (1) Initialize weights (kernels) and biases (e.g., randomly, U(−0.5, 0.5)) of the CNN.
Update: Cumulate the sensitivities in iii and scale with the learning factor, ε, and update the weights and biases as follows: )

Figure 1 .
Figure 1.The proposed classification system for single-and dual-polarized Synthetic Aperture Radar (SAR) intensity data.

Figure 1 .
Figure 1.The proposed classification system for single-and dual-polarized Synthetic Aperture Radar (SAR) intensity data.

Figure 3 .
Figure 3. Training process of the adaptive 2D CNN by the SAR data.

Figure 3 .
Figure 3. Training process of the adaptive 2D CNN by the SAR data.

Figure 4 .
Figure 4. Pseudo color image of Po Delta SAR image (X-band) is obtained from to HSI (Hue, Saturation, and Intensity) channels and given (left) with its corresponding ground truth set (right) with class labels.

Figure 4 .
Figure 4. Pseudo color image of Po Delta SAR image (X-band) is obtained from to HSI (Hue, Saturation, and Intensity) channels and given (left) with its corresponding ground truth set (right) with class labels.4.1.2.Dresden, TerraSAR-X, and X-Band Dresden SAR intensity data has 4419 × 7154 pixels with approximately 4 × 4 meters square pixel resolution.It was acquired in Strip Map mode with dual-polarization (VH/VV), and it is radiometrically enhanced (RE) Multi-Look ground range detected (MGD) with effective number of looks 6.6.In MGD mode, coordinates are projected to the ground range, and each pixel is represented with its magnitude only, where the phase information is lost.However, MGD and RE provide speckle noise reduction.Because of the aforementioned reason for Po Delta data, this data is also downscaled by 2 × 2. Ground truth of this data is also manually constructed as explained before by using[56] as a reference and optical image data.It is the same GTD used also in[18] and consists of six classes which are urban fabric and industrial as man-made terrain types, arable land, pastures, forest, and inland waters as natural terrain types.The GTD is shown in Figure5by assigning distinct RGB values to each terrain class.In our experimental setups, we have used randomly chosen 1000 pixels and 100,000 pixels per class for training and testing (train/test ratio: 0.01), respectively, which was followed by the competing method[18].

Figure 5 .
Figure 5. Pseudo color image of Dresden SAR image (X-band) is constructed by assigning backscattering coefficients VV, VV-VH, VV to R, G, and B channels (left) and its corresponding ground truth set is given (right) with class labels.

Figure 5 .
Figure 5. Pseudo color image of Dresden SAR image (X-band) is constructed by assigning backscattering coefficients VV, VV-VH, VV to R, G, and B channels (left) and its corresponding ground truth set is given (right) with class labels.

Figure 6 .
Figure 6.Learning curve of the proposed Compact CNNs over Po Delta and Dresden SAR data.

Figure 6 .
Figure 6.Learning curve of the proposed Compact CNNs over Po Delta and Dresden SAR data.

Figure 7 .
Figure 7. Classification performance (recall rates per class) of the proposed and the competing methods for Po Delta data.11 × 11 and 25 × 25 window sizes are used in the proposed approach with single HH channel, where the competing method in [18] uses HH intensity image and a 208-D composite feature vector with HH, color and texture features (HH + CT).

Figure 7 .
Figure 7. Classification performance (recall rates per class) of the proposed and the competing methods for Po Delta data.11 × 11 and 25 × 25 window sizes are used in the proposed approach with single HH channel, where the competing method in [18] uses HH intensity image and a 208-D composite feature vector with HH, color and texture features (HH + CT).

Figure 8 .Figure 9 .
Figure 8.For Po Delta data, segmentation masks with 11 × 11 window size of one-channel, and 25 × 25 window sizes and of one-, and four-channel(s) are shown in (a), (b), and (c), respectively, using the proposed approach.Their corresponding overlaid regions on the ground truth are shown in (d), (e), and (f), respectively.

Figure 8 .Figure 8 .Figure 9 .
Figure 8.For Po Delta data, segmentation masks with 11 × 11 window size of one-channel, and 25 × 25 window sizes and of one-, and four-channel(s) are shown in (a-c), respectively, using the proposed approach.Their corresponding overlaid regions on the ground truth are shown in (d-f), respectively.

Figure 9 .Figure 9 .Figure 10 .
Figure 9. Segmentation masks over ground truth of the competing method in [18] using only 72-D Color features in (a), 207-D Color and Texture in (b), and 208-D, HH and Color + Texture in (c) for Po Delta data.

Figure 10 .
Figure 10.Enlarged regions of ground-truth of Po Delta (a), and corresponding segmentation masks obtained by the competing (b) and the proposed methods (c).

Figure 11 .
Figure 11.Classification performance (recall rates per class) of the proposed and competing state-ofthe-art (soa) methods in [18] for Dresden data.11 × 11, 15 × 15, and 21 × 21 window sizes are used in the proposed approach with dual (VH/VV) channel, where the competing method uses VH/VV, and a 209-D composite feature vector with VH/VV, and color and texture features (VH/VV + CT).

Figure 11 .
Figure 11.Classification performance (recall rates per class) of the proposed and competing state-of-the-art (soa) methods in [18] for Dresden data.11 × 11, 15 × 15, and 21 × 21 window sizes are used in the proposed approach with dual (VH/VV) channel, where the competing method uses VH/VV, and a 209-D composite feature vector with VH/VV, and color and texture features (VH/VV + CT).

Figure 13 .Figure 14 .
Figure 13.Segmentation masks of the competing method in [18] using only 72-D Color features in (a), 207-D Color and Texture in (b), and 209-D, VH/VV and Color + Texture in (c) for Dresden data.Remote Sens. 2019, 11, x FOR PEER REVIEW 16 of 27

Figure 14 .
Figure 14.Enlarged regions of ground-truth of Dresden (a), and corresponding segmentation masks obtained by the competing (b) and the proposed methods (c).

Figure A2 .
Figure A2.Contribution of single output pixel s l k (m, n) to the two pixels at the next layer's x l+1 i with a 3 × 3 kernel.Appendix A.2.2.Intra-BP within a CNN Neuron: ∆ l k ← ∆s l k Equation (A9) from MLP layer to the CNN layer.and intra BP to get: ∆ ∆ is identical to a regular BP for MLPs.

Figure A3 . 4 .
Figure A3.The output of the last hidden convolutional layer-first MLP layer.

Figure A3 . 4 .
Figure A3.The output of the last hidden convolutional layer-first MLP layer.Finally, the sensitivities for the weight and bias, too, are identical to a regular MLPs:

Figure A4 .
Figure A4.Obtaining the input of the ith neuron,     , at the next layer, l + 1 by the convolution of the output of the kth neuron at the current layer,    , and weight,

Figure A4 .
Figure A4.Obtaining the input of the ith neuron, x l+1 i , at the next layer, l + 1 by the convolution of the output of the kth neuron at the current layer, s l k , and weight, w l ki a.For each patch, p, in the train set, DO: i. FP: Forward propagate from the input layer to the output layer to find output of each neuron at each layer, y l i , ∀i ∈ [1, N l ] and ∀l ∈ [1,L].ii.BP: Compute delta error at the output (MLP) layer and back-propagate it to first hidden CNN layer to compute the delta errors, ∆ l k , ∀k ∈ [1, N l ]and ∀l ∈ [2, L − 1].iii.

Table 1 .
SAR Images used in this work.

Table 2 .
Number of classes, number of samples in training and ground truth (GTD).

Table 3 .
Classification accuracy of the proposed approach for Po Delta data with different window sizes and number of channels.The obtained highest accuracies are highlighted in bold.

Table 4 .
Confusion matrix over Po Delta data obtained by the proposed approach using the best setup with window size 25 and four channels (HH -Hue-Sat.-Int.).The number of correctly classified samples per class and in total are highlighted in bold.

Table 4 .
Confusion matrix over Po Delta data obtained by the proposed approach using the best setup with window size 25 and four channels (HH -Hue-Sat.-Int.).The number of correctly classified samples per class and in total are highlighted in bold.

Table 5 .
Classification accuracies of the proposed approach for Dresden data with different window sizes.The highest accuracy (highlighted in bold) is obtained with 21 × 21 window size.

Table 6 .
Confusion matrix over Dresden data obtained by the proposed approach using the best setup with window size 21 and two channels (VH/VV).The number of correctly classified samples per class and in total are highlighted in bold.

Table 5 .
Classification accuracies of the proposed approach for Dresden data with different window sizes.The highest accuracy (highlighted in bold) is obtained with 21x21 window size.

Table 7 .
Classification accuracies of Xception, Inception-Resnet-v2, and the proposed compact CNN over Po Delta data using four-channels (HH, Hue, Sat., Int.) with different window sizes.Layers of Xception* and Inception-ResNet-v2* (except MLP part and the first convolution layer) are initialized with ImageNet weights.The obtained highest classification accuracies are highlighted in bold for different window sizes.

Table 8 .
Classification accuracies of Xception, Inception-Resnet-v2, and the proposed compact CNN over Dresden data using 2-channels (VH − VV) with different window sizes.Layers of Xception* and Inception-ResNet-v2* (except MLP part and the first convolution layer) are initialized with ImageNet weights.The obtained highest classification accuracies are highlighted in bold for different window sizes.

Table 9 .
Performances of each class in terms of Precision, Recall and F1 Score of Xception, Inception-Resnet-v2, and the proposed approach over Po Delta data using the best setup with window size 25 and using four channels (HH -Hue-Sat.-Int.).The obtained highest classification performances in terms of each metric are highlighted in bold for each class.

Table 10 .
Performances of each class in terms of Precision, Recall and F1 Score of Xception, Inception-Resnet-v2, and the proposed approach over Dresden data using the best setup with window size 21 and using two channels (VH/VV).The obtained highest classification performances in terms of each metric are highlighted in bold for each class.

Table 11 .
The classification performance comparison of the proposed approach with Xception and Inception-Resnet-v2 based on Cohen's kappa coefficient (κ) over Po Delta and Dresden data, using window size 25 and 27 with four channels (HH -Hue-Sat.-Int.)and two channels (VH/VV), respectively.The obtained largest kappa coefficients are highlighted in bold for Po Delta and Dresden data.