A Multi ‐ Scale Deep Neural Network for Water Detection from SAR Images in the Mountainous Areas

: Water detection from Synthetic Aperture Radar (SAR) images has been widely utilized in various applications. However, it remains an open challenge due to the high similarity between water and shadow in SAR images. To address this challenge, a new end ‐ to ‐ end framework based on deep learning has been proposed to automatically classify water and shadow areas in SAR images. This end ‐ to ‐ end framework is mainly composed of three parts, namely, Multi ‐ scale Spatial Feature (MSF) extraction, Multi ‐ Level Selective Attention Network (MLSAN) and the Improvement Strategy (IS). Firstly, the dataset is input to MSF for multi ‐ scale low ‐ level feature extraction via three different methods. Then, these low ‐ level features are fed into the MLSAN network, which contains the Encoder and Decoder. The Encoder aims to generate different levels of features using residual network of 101 layers. The Decoder extracts geospatial contextual information and fuses the multi ‐ level features to generate high ‐ level features that are further optimized by the IS. Finally, the classification is implemented with the Softmax function. We name the proposed framework as MSF ‐ MLSAN, which is trained and tested using millimeter wave SAR datasets. The classification accuracy reaches 0.8382 and 0.9278 for water and shadow, respectively; while the overall Intersection over Union (IoU) is 0.9076. MSF ‐ MLSAN demonstrates the success of integrating SAR domain knowledge and state ‐ of ‐ the ‐ art deep learning techniques.


Introduction
Detecting water bodies from Synthetic Aperture Radar (SAR) images has been a very active research field [1]. Detection results are widely applied to reduce errors in SAR phase unwrapping, monitor flooding and assess damage, track water storage changes over various periods, investigate the area increase and decrease of wetlands and delineate shoreline movement [2][3][4]. Due to the speckle noise in SAR images and confusing characteristics of other objects (e.g., shadows) in SAR images, automatic water body detection from SAR images with high precision remains an open challenge. Karvonen et al. [5] have proposed a semi-automated classification algorithm of water areas using Radarsat-1 SAR imagery, in which the average accuracy was merely around 70%. Moreover, Martinis et al. [6] compared advantages and disadvantages of four semi-automatic/automatic water detection algorithms and these algorithms achieved similar overall accuracy (i.e., ~85%) in the evaluation experiments. Their study has emphasized the importance of auxiliary datasets (e.g., digital elevation models) and parameter tuning in reaching decent accuracy. Other semi-automatic/automatic water detection approaches have also been explored but their performance is unsatisfying [7][8][9].
Shadow is one of the most important challenges in the process of water detection, especially in mountainous regions [10]. Both water and shadow appear as dark areas in SAR images (i.e., low gray values), which often results in false detection of water bodies. To achieve higher accuracy in automatic water detection, simultaneous SAR image classification of water and shadow is employed in this paper. SAR image classification is to identify different targets from SAR images, which is being ameliorated considerably by deep learning techniques [11]. Deep learning mimics the workflow of human brain using artificial neural networks that combine low-level features to form higher-level features for object detection, language processing and decision-making [12]. Krizhevsky et al. [13] adopted a convolutional neural network (CNN) to achieve 10% accuracy improvement in the ImageNet contest. This was the first time the deep learning outperformed traditional handcrafted features with shallow models in object detection. From then on, the blooming of deep learning has begun in computer vision research, such as object detection [14] and semantic segmentation [15]. The success of deep learning has inspired us to design and develop a special neural network to address the challenge of simultaneous water and shadow classification in this paper.
The study of SAR image classification using deep learning techniques is developing rapidly. Zhu et al. [16] summarized the progress of deep learning in remote sensing and highlighted research challenges of employing deep learning in the analytics and interpretation of SAR images. Hänsch et al. [17] proposed the first application of Complex-valued CNN (C-CNN) for object recognition in Polarization SAR (Pol-SAR) data. Though the architecture was only one single convolutional layer, the results have surpassed those from standard CNN. Song et al. [18] proposed an efficient and accurate water area classification method, which required precise Digital Elevation Model (DEM) and Digital Surface Model (DSM) datasets. Huang et al. [9] investigated a tree-based automatic classification approach to analyze surface water using Sentinel-1 SAR data. However, this method was problematic in distinguishing shadow and water areas. Therefore, how to employ deep learning techniques without auxiliary datasets for water detection becomes a major challenge to solve.
We seek solutions from SAR domain knowledge and geospatial analytics. Although there is great advancement in deep learning to perform image classification, domain knowledge in SAR image analytics has not been integrated in the design and development of deep neural networks. Because most deep learning techniques (e.g., CNN) are designed for the classification of daily objects (e.g., cats and dogs), image features extracted by neural networks cannot provide satisfying performance in water and shadow classification due to their high difference [11]. Meanwhile, the high computation cost of deep neural network training has also been noticed [16]. To address these problems, we embed prior SAR image feature extraction algorithms into the architecture of deep neural network to achieve higher classification accuracy and lower training costs.
Meanwhile, geospatial contextual information should also be considered in bridging deep learning and SAR image analytics. Notti et al. [7] assessed the potential and limitations for the flood mapping using various satellite imagery datasets, including MODIS, Proba-V, Landsat, Sentinel-2 and Sentinel-1. They emphasized that geospatial contextual information in the automatic detection of inundated areas was essential but had not been given enough attention. Contextual image features usually encode spatial structures of geographic objects, which play pivotal roles in classifying similar objects [19]. Although multi-level contextual image features could be extracted by the spatial pyramid pooling technique within a image tile [20], there is no guarantee that the given image tile could cover the right geospatial extent of target objects. This further echoes the necessity of merging geospatial analytics within deep learning.
In this paper, we propose a new end-to-end framework called Multi-scale Spatial Feature Multi-Level Selective Attention Network (MSF-MLSAN), which is built on state-of-the-art deep learning techniques, SAR domain knowledge and geospatial context analytics, to implement water and shadow classification. MSF-MLSAN contains three parts. The first part is the Multi-scale Spatial Features (MSF) to extract multi-scale and low-level SAR image features. The second part is the Multi-Level Selective Attention Network (MLSAN), which extracts and refines middle-level and the high-level SAR image features. It contains Encoder module and Decoder module. The third part is named Improvement Strategy (IS), including the score map weighting and the splicing strategy to optimize simultaneous classification results. The contribution of this paper is two-fold. On the one hand, we propose MSF-MLSAN, which integrates existing domain knowledge in SAR image analytics with deep learning techniques, as a successful approach to achieve simultaneous classification of water and shadow areas. On the other hand, the pivotal role of geospatial context has been noticed in the design of SAR-specific neural networks to implement the automatic detection of geographic objects with high similarities in SAR images.
The rest of this paper is organized as follows-the methodology of our proposed neural network, MSF-MLSAN, for simultaneous classification of water and shadow is presented in Section 2. Section 3 introduces the datasets, performance assessment and result interpretation. The improvement and extension of MSF-MLSAN are discussed in Section 4. Finally, the conclusion is given in Section 5.

Foundation
Great progress has been achieved to classify daily objects with deep learning techniques, for example, RefineNet, which combines image features from different resolutions using multipath refinement, has improved the intersection-over-union (IoU) score of PASCAL VOC 2012 dataset to 83.4 [21]. However, the lack of geospatial contextual information remains an open challenge in processing remote sensing images, since geographic objects present much larger scale variance than daily objects that routinely requires the support of contextual information across a number of image tiles to detect [22]. In this paper, we employ the attention mechanism and spatial pyramid pooling to handle geospatial contextual information.
(1) Attention mechanism Attention mechanism has developed from human visual system [23]. It scans the whole image to detect important regions, to which more attention needs to be assigned. More attention means more weights of features from these regions need be considered for classification. In recent years, attention mechanism has been successfully applied in many fields, such as natural language processing [24] and image classification [25].
Based on this mechanism, Squeenze-and-Excitation Network (SENet) was proposed and won the first place in ILSVRC 2017 classification competition, achieving a ~25% relative improvement over the winner of 2016 [26]. It first performed a Squeeze operation on the feature map obtained by the convolution to extract channel-level features, then employed an Excitation operation on these features to learn the relationship among channel as assigned weights, which were used to calculate final features for classification.
(2) Pyramid pooling While SENet can extract the channel-wise geospatial contextual information from feature maps, the pyramid pooling [20] techniques have also been successful in extracting contextual information from the spatial structure of feature maps. Generally, the size of a receptive field determines how much contextual information could be extracted using pyramid pooling. Since geographic objects usually cover larger spatial extents than the size of receptive fields, we combine the pyramid pooling with the Global Attention Pooling (GAP) [27] to capture contextual information with high scale variance.

The Framework of MSF-MLSAN
MSF-MLSAN, the innovative framework for water classification from SAR images, is shown in Figure 1. MSF-MLSAN mainly contains three parts: Multi-scale Spatial Features (MSF) extraction, Multi-Level and Selective Attention Network (MLSAN) and Improvement Strategy (IS).
First, SAR imagery datasets are input to the MSF module. MSF extracts three prior SAR image features from the input images, namely, the Gabor feature, Gray-Level Gradient Co-occurrence Matrix (GLGCM) feature and Multi-scale Omnidirectional Gaussian Derivative Filter (MSOGDF) feature. These three types of low-level features could be further fused to forge a group of three-channel feature maps. Therefore, we can get three groups of feature maps from these three of low-level features. In addition, the fourth group of three-channel feature map will be generated by SAR images. Then the four groups of three-channel feature maps will be input to MLSAN to extract corresponding high-level features, which produces four Score Maps (SMs). SMs are processed by the Improvement Strategy to generate final maps for classification. Details of each part of MSF-MSLAN are delineated in the following discussion.

Multi-Scale Spatial Features (MSF)
MSF aims to acquire multi-scale spatial features from SAR images by integrating prior SAR image feature extraction algorithms with deep neural networks. Different features of objects could be enhanced by transformations of original SAR images, which are utilized to achieve better classification accuracy. In this paper, we employ prior SAR features extraction methods, namely GLGCM, Gabor transformation and MSOGDF to extract low-level features and then we fuse them as higher-level features.

GLGCM Extraction
GLGCM can effectively exploit the grayscale and the gradient of adjacent pixels in SAR images and generate various statistical characteristics, including 15 commonly used ones. We select grayscale mean, gradient mean and grayscale mean square error in this paper according to Reference [11].

Gabor Transformation
Gabor transformation is a kind of Windowed Fourier transformation. It can extract relevant features at different scales and directions in the frequency domain. The Gabor transformation resembles the biological function of human eyes, since it is often used to recognize the texture of targets. There are usually four characteristics from four directions and we select the two characteristics from 45° and 135° following our previous work [11].

MSOGDF
The spatial structure features contain important information about the visual characteristics of images. To acquire these features, a Multi-Scale Omnidirectional Gauss Derivative Filter (MSOGDF) is introduced, which can generate features with different directions from the image. We heuristically select the features with two directions on the specific scale, which are 45° and 135° [11].

Multi-Scale Space Statistical Features Fusion
Features extracted by GLGCM, Gabor and MSOGDF are concatenated as the input of MLSAN. In this paper, four groups of three-channel fusion feature maps are generated. The first group is produced by concatenating the gray-mean, the gradient mean and the gray mean square error characteristics of SAR images from GLGCM. The second group is generated by concatenating the 45° feature and 135° feature from Gabor transformation and the SAR image. The third group is generated by merging the 45° feature and 135° feature from MSOGDF and SAR image, which uses 1 and the 1. The fourth group is the original SAR images. All four groups of fusion feature maps are ground-referenced.

MLSAN
There are several classical and widely used networks for image classification proposed in recent years, such as FCN [28], SegNet [15], DeepLab [29], RefineNet [21] and PSPNet [20]. These networks have been successfully applied in image classification. However, there are still problems in these networks, such as the poor classification results due to information lose in the pooling operation and inter-class overlapping caused by the lack of multi-scale contextual information [16]. MLSAN network is proposed to address the geospatial context challenge in SAR image analysis, which is quite different from the classification of daily objects.
MLSAN contains two parts, the Encoder part and the Decoder part, as shown in Figure 1.

The Encoder
The Encoder part is based on residual networks [30] to extract the middle-level and high-level features of the input images. In this paper, resnet_v2_101 is adopted. The forward operation of the whole network is a process of continuously solving the residuals, accompanied by the resolution decreasing of feature maps and the dimension increasing. Supposing the size of the input image is 512×512×3, then the feature map F0 with the size of 128×128×64 is first generated by a convolution and a pooling operation. Second, the feature map F1 with the size of 128×128×256 is generated from Res-1 including three residual units [30]. Third, the feature map F2 with the size of 64×64×512 is outputted from Res-2 using four residual units. Fourth, we generate the feature map F3 with 32×32×1024 from Res-3 via 23 residual units. Finally, the feature map F4 of size 16×16×2048 will be outputted from Res-4 through three residual units.
The residual network is built by the stacking of residual units. The residual unit is based on the idea of identity mapping to guarantee the propagation of gradient information to the lower layer, which enables the parameters training through the whole network. Resnet_v2_101 used in this paper makes use of the model pre-trained with ImageNet. We follow [31] to use feature migration instead of the random initialization of parameters.

The Decoder
The high-level features are usually used to distinguish different categories, while low-level features often offer too many details of the objects. Therefore, we carefully design the decoder to merge and enhance high-level and low-level features.
There are four feature maps generated from the Encoder network, namely, the high-level feature F4 (16×16×2048) from Res-4, the middle-level features F3 (32×32×1024), F2 (64×64×512) and F1 (128×128×256) from Res-3, Res-2 and Res-1, respectively. To reduce the computational complexity of the Decoder part and the number of redundant features, convolution is incurred between the encoding and decoding operations to perform the dimension reduction. The final dimensions of feature maps inputted to the Decoding network are 16×16×512, 32×32×256, 64×64×256 and 128×128×256. The Decoding network is the opposite of the encoding network. It uses the bottom-up approach, in which, high-level and sub-high-level features outputted from the encoder network is stepwise merged into new feature maps. Meanwhile, resolutions of feature maps increase while dimensions decrease.
All the four features maps (F1, F2, F3 and F4) are inputted to decoder modules, M1, M2, M3 and M4, respectively ( Figure 1). The three modules (M1, M3 and M4) are designed with the same architecture, namely, the Feature Map Adaption and Fusion (FMAF) and the Residual and Attention Pooling (RAP), as shown in Figures 2 and 3. The M2 module contains two parts, namely, the FMAF and the Pyramid and Attention Pooling (PAP), which is depicted in Figure 4.
The FMAF is designed for adjusting the size of feature maps and fusing features at different levels. The architecture is shown in Figure 2. The inputs are the feature maps at different levels. Then they will be processed by two residual convolutional units (RCU), one convolution and one up-sampling for scaling. Finally, they will be summed to produce the fusion map.

RAP module
The Residual and Attention Pooling (RAP) aims to capture geospatial contextual information from a large area and pool the input features by multi-pooling layers. The architecture of RAP is shown in Figure 3. It is composed of a series of pooling modules and each module contains one convolution layer and one Maximum pooling layer. The output of the previous pooling layer is used as the input to the latter pooling layer. Therefore, the subsequent pooling layer can handle image features over larger areas with small pooling windows. After fusing features with pooling layers, the attention module is used to weight the fused features, which can highlight useful features and decline redundant features. Finally, weighted features are fused with initial input features and results will be inputted to the RCU. In the RAP, four pooling modules are employed. The window size of each pooling layer is 5 × 5 and the stride is 1 × 1.  PAP module The design of Pyramid and Attention Pooling (PAP) module is shown in Figure 4. Processed by the pooling layer, the size of the input feature map (i.e., 64×64×256) will become 1×1×256, 2×2×256, 3×3×256 and 6×6×256. Then, their dimension will be reduced by the convolution layer to 1×1×64, 2×2×64, 3×3×64 and 6×6×64. Furthermore, they will all be up-sampled to 64×64×64 by the bilinear interpolation and concatenated to form one feature map with the initial dimension (64×64×256). It will also be weighted by the attention module. Then, it will be concatenated with the input feature map to generate a feature map (64×64×512) and recovered to the initial dimension (64×64×256) by a convolution layer. Finally, the feature map will be processed by an RCU to generate features for further processing. The pyramid pooling fuses four different scales of features, including global pooling and local features of different sub-regions, in which, the attention mechanism is employed to enhance the extraction of the contextual information. In this paper, the combination of RAP and PAP enhances the network's ability to extract contextual information significantly and also increases the variety of extracted information. By this means, the context of geographic objects has been taken into account in the architecture of deep neural networks. Meanwhile, all modules of the network are built on top of the residual connection, which ensures the efficiency of the network training.
The detailed decoding process is depicted as follows: According to Figure 1, the feature map F4 outputted from Res-4 enters the M4 module. In which, F4 will be processed by the RAP module to extract contextual information and the new feature map F4_2 will be inputted to the module M3. Therefore, there are two inputs for M3 module, namely, F4_2 and the middle-level feature map F3 outputted from Res-3. First, these two different-level feature maps will be processed by the FMAF module for feature fusion. Then, the fused feature map will be processed by the RAP module to extract contextual information with different scales. Finally, the new feature map F3_2 is generated. For M2 module and M1 module, there are similar operations as M3 module. The difference lies in that PAP is used to extract contextual information in M2 module, while RAP is used in the other three modules (M1, M3 and M4). Once the F1_2 feature map is generated, it will be processed by two RCUs to increase the non-linearity. The size of the feature map is 128×128×256 at this stage. Later, one up-sampling layer and one convolution layer are used to recover the dimension of the feature map to 512×512× ( is the number of the classes). Finally, a dense score map will be generated by the Softmax layer.

Improvement Strategy (IS)
Although MSF-MLSAN could be employed to classify large-scale images, the challenge of context still exists-it is possible that the given image tile contains very limited contextual information or even covers the target object partially. To better handle the geographic context of classification and improve the overall accuracy, two strategies are invented, namely, splicing and weighting. Inspired by Reference [32], we splice the two adjacent image slices with the method shown in Figure 5. While testing, we will use a sliding window of size 512 × 512 to cut the large-scale images and the step for the sliding is 256, such as the sliding windows S11, S12, S21, …, Smn shown in Figure 5 To solve the discontinuity of the classification results at the junction of two adjacent windows by direct splicing, the sliding step of the next cutting window is only 256, which is half of the two adjacent windows are overlapping (such as S11 and S12 or S11 and S21). After both two windows are tested, the final classification result of the overlapping area will be generated by averaging the results of the two adjacent windows. By this splicing, we can get the score map tested with minimum splitting border errors [22]. In this paper, four trained models are generated eventually. And we can obtain four score maps via the splicing method. Then, the four score maps will be weighted to generate the final score map according to their respective classification performance (shown in Figure 1).
The whole network is organized by long-range residual connections between the blocks in encoder and decoder. Hence, it consists of many stacked "Residual Units." One Residual Unit performs the following computation: , where and denote the input feature and a set of weights (and biases) associated with the l-th residual unit.
is the residual function. The function h and f are set as an identity mapping: ℎ , . By applying the recursion several times, the output of the residual unit admits a representation of the form The implementation detail of training for the whole network is summarized in the following Algorithm 1.

Datasets
The experiments use imagery data from the millimeter SAR system to evaluate MSF-MLSAN. The central frequency of the system is 35 GHz and the resolutions are 0.13 m in the range and 0.14 m in the azimuth. In the SAR system, there are considerable water and shadow areas in the mountainous areas because of the relative low flight height (2000 m to 4000 m). There are nine large-scale SAR images used in experiments and the size of each is 10240 × 13050 pixels. Eight of these images are used for training and validation, while the remaining one is used for model testing. We choose three labels, namely, water, shadow and background. The ground truth labels are generated by manual marking and confirmed by SAR experts from Beijing Institute of Radio Measurement (the data provider). The 8 large-scale images are cut to lots of SAR tiles with a size of 720 × 720 for training and validation purposes. There are 1288 SAR image tiles in the datasets generated from the eight images. We select 80% of the samples randomly as the training datasets and the rest are kept as the validation datasets. Figure 6 shows examples of the datasets, in which, (a) and (b) denote the SAR image and the ground truth respectively. Our MSF-MLSAN has been developed using Tensorflow 1.10, CUDA 9.0 and Python 3.6 on the server with Titan xp 12 GB, CPU of i7-6800k, 16 GB memory and 2 TB hard disk. The detailed testing steps can be found in Appendix A (Algorithm A1). In the experiment, 100 epochs are trained, one of which is to train all the pictures in the training set once. The total training time of MSF-MLSAN for the dataset is about 20 h.

Performance Indices
To assess the classification accuracy, two important indices are used in the paper, namely, Overall Accuracy (OA) and Intersection over Union (IoU).


Overall Accuracy OA is a vital index to evaluate the classification performance of the given algorithm, which is computed by the following equation supposing for type A. OA ∩ , (4) where is all the pixels of ground truth for type A and , is all the pixels detected as type A. ∩ denotes all the pixels of intersection area of and . As shown in Figure 7, the red rectangles contain all the pixels of ground truth for type A (B or C) and the blue rectangles contain all pixels detected as type A (B or C). While the overlapping areas denote successfully detected results. Therefore, OA is the ratio of correctly classified pixels (the total overlapping pixels) to pixels of ground truth. The larger the OA value is, the higher the classification accuracy we would achieve.

 Intersection over Union
IoU is an effective metric for evaluating classification accuracy, which is an important supplement to OA. According to Figure 7, the pixel number of ground truth for type A is the same as type C but the number of pixels detected as type C ( ) is much larger than the number of pixels detected as type A ( ). So, the value of OA for type C is higher than that of type A. But there are more pixels detected as type C, which belong to other types. Since OA cannot fully delineate the classification performance, IoU has been incurred in our experiment.
IoU is computed by Equation (2) for type A Therefore, IoU is the ratio of the intersection of and to the union of and . Then, for Figure 7A,C, IoU for type C is lower than that of type A because many pixels have been falsely detected as type C.
Hence, for a classification experiment, the performance is better with higher values of both OA and IoU.

MSF Results
The MSF integrates prior SAR image features with deep learning techniques for multi-level feature extraction. Figure 8 presents the Gabor features of the SAR image. Figure 8a shows the SAR image, which covers some water and shadow areas. Figure 8b,c indicate the Gabor feature with the direction of 45-degree and 135-degree respectively. From the two figures we can find out that they effectively embody the local texture features in different directions. Figure 8d gives the fused feature map concatenated from Figure 8a-c, which clearly reflects the texture of local areas.  Figure 9 depicts the GLGCM feature of the SAR image. Figure 9a illustrates the gray mean, which is useful for reducing the speckle noise. Figure 9b presents the gradient mean map, which highlights the water and shadows to improve the classification accuracy. Figure 9c demonstrates the gray mean square error which blurs other objects but water and shadow areas. Figure 9d is the fusion map of concatenating Figure 9a-c, from which the water and shadow edges are enhanced. Therefore, GLGCM offers strong low-level features for water and shadow classification.  Figure 10 delineates MSOGDF features from SAR images. Figure 10a shows the original SAR image. Figure 10b,c are the MSOGDF features with the direction of 45-degree and 135-degree respectively. Figure 10d is the fusion result of Figure 10a-c, as an enhanced feature for simultaneous classification of water and shadow.

Classification Results and Analysis
In order to test MSF-MLSAN, we compare it with six widely applied classification methods: RefineNet [21], DeepLabV3 [33] In the experiment, two SAR images are tested: one covers more water but fewer shadow areas and the other contains fewer water areas but more shadow areas. These two experiments are designed to test if MSF-MLSAN could simultaneously classify water and shadow, given their coverage is heterogenous. Figure 11 shows the data of the first experiment, which includes the SAR image and the ground truth. The size of the SAR image in the first experiment is 4608 × 4096 pixels, covering considerable water and shadow areas. In this experiment, we mainly test the performance of MSF-MLSAN in distinguishing water and shadow. Figure 12 shows the fusion map with SAR image and the classification results of different methods. Figure 12a depicts the ground truth of the SAR image, which is generated by SAR experts. The two colors of green and blue denote the shadow and water, respectively.  Figure 12b gives the classification result generated by RefineNet. Compared with the ground truth, we find there are several obvious errors, such as the areas of the red rectangles which are incorrectly classified between water and shadow areas, especially many shadow areas are classified as water. Figure  13a shows the confusion matrix of RefineNet, of which, W, S and B denote the types of water, shadow and background respectively. The overall accuracy for water, shadow and background are 0.865, 0.891 and 0.8512 and the IoU for them are 0.7166, 0.7524 and 0.8174 respectively. Figure 12c illustrates the classification result of DeepLabV3. Compared with RefineNet, its performance of distinguishing water and shadow areas is a little worse than RefineNet. We can see many shadow areas are classified as water areas in the red boxes and many shadow areas are not detected in the pink boxes. Therefore, the accuracy and IoU for water and shadow are both reduced a lot according to Figure 13b, because of the false classification and miss detection. Figure 12d presents the classification result of MLSAN. Compared with RefineNet, many false alarm areas for water have disappeared, which displays the satisfying performance of MLSAN. Figure 13c shows the confusion matrix of MLSAN. There is a small increase in water IoU due to the reduction of false alarms, despite the small decrease in water OA. In addition, the OA and IoU for the shadow and the background are both improved. Figure 12e indicates the classification results of GLGCM-MLSAN. Compared with Figure 12d, some missed water areas are detected, which results in the increase in OA and IoU of the water showing in Figure 13d. However, there is a slightly decline in OA and IoU of the shadow and background due to misclassification. Figure 12f gives the classification result of Gabor-MLSAN. Compared with Figure 12e, Figure 12f reduces a few false alarms for water though it still has some miss detections. Therefore, Gabor-MLSAN can produce better IoU though its OA is a little bit lower than GLGCM-MLSAN (shown in Figure 13e). In addition, Gabor-MLSAN brings higher OA and IoU for shadows than GLGCM-MLSAN, due to the reduction of misclassification of shadow areas. Furthermore, they basically have similar performance in background classification. Figure 12g presents the classification results for MSOGDF-MLSAN. Compared with GLGCM-MLSAN and Gabor-MLSAN, it gains the worst performance in water classification, as the lowest OA and IoU shown in Figure 13f. It has a little worse performance for shadow classification than Gabor-MLSAN but a little better than GLGCM-MLSAN. However, it has the best classification performance for background than the other five methods.

The Experiment with More Water but Fewer Shadow Areas
Through analyzing the classification performance of methods above, we find GLGCM-MLSAN and Gabor-MLSAN could achieve better classification results for water and shadow, while MLSAN and MSOGDF-MLSAN slightly outperform in the classification of background. Therefore, the weights of 0.1, 0.35, 0.45 and 0.1 are selected heuristically for the score maps of MLSAN, GLGCM-MLSAN, Gabor-MLSAN and MSOGDF-MLSAN respectively to generate the MSF-MLSAN score map. The classification results of MSF-MLSAN is given in Figure 12h. It generates better classification results than MSOGDF-MLSAN or Gabor-MLSAN. Figure 13g shows the OA and IoU of MSF-MLSAN, from which we can see the IoU for water is the highest. As shadow classification is concerned, the OA is slightly lower than Gabor-MLSAN but the IoU is the highest. For the background, it reaches the highest values of OA and IoU among all the seven methods. This evaluation proves the prominence performance of MSF-MLSAN for simultaneous classification of water and shadow. To further investigate the classification accuracy, Figure 14 shows three maps for the three types of targets classified by different methods. Figure 14a is the classification accuracy for water. The DeepLabV3 achieves the lowest accuracy in OA and IoU and MSF-MLSAN generates the best classification result. For shadow classification illustrated in Figure 14b, DeepLabV3 also achieves relatively poor classification accuracy and MSF-MLSAN offers the highest accuracy in OA and IoU. Figure 14c indicates that RefineNet gets the lowest accuracy in background classification and MSF-MLSAN achieves the best performance.  Figure 15 shows the data of our second experiment, which includes the SAR image and the ground truth. The size of SAR image is 4608 × 3548 pixels, containing few water areas and considerable shadow areas. The mountain areas in this test scenario incur many shadow areas in the SAR image, which proves a good test to MSF-MLSAN in classifying water and shadow. The results of this experiment are shown in Figure 16. Figure 16a is the ground truth of this scene, generated by SAR experts. Figure 16b presents the classification results of RefineNet, which contain many false alarms for water detection highlighted in the red rectangles. That's the reason why the IoU of water classification is low (as shown in Figure 17a), though its OA is high. Its classification performance for shadow and background are good with IoU of 0.7935 and 0783 respectively. Figure 16c illustrates the classification results of DeepLabV3. Compared with RefineNet, the classification performance for water and shadow are even worse. We can see many shadow areas are classified as water in red boxes and also many shadow areas are not detected in pink boxes. which causes the classification accuracy of water and shadow areas are reduced, as depicted in Figure 17b.  Figure 16d gives the classification results of MLSAN. Compared with Figure 19b, the considerable false alarms for water have been resolved except some red box areas. Hence, the water detection performance has been significantly improved in MLSAN depicted as Figure 17c, achieving 0.6383 in IoU. In the classification of shadow, there are still false alarm areas such as the left red box region and some missed detection areas. However, the accuracy for shadow classification has also been improved to 0.925 in OA and 0.8121 in IoU. Because of the reduction in false alarms of background, the IoU is improved to 0.8181. Figure 16e delineates the classification result of GLGCM-MLSAN. Compared with Figure 16d, it reduces the false alarm areas for water classification, though it still has small areas shown in the two little red boxes. The OA and IoU for water classification have been greatly improved to 0.8762 and 0.7869, respectively, as shown in Figure 17d. For shadow classification, the OA and IoU have also been increased to 0.9485 and 0.8189, though there are still some false detected areas highlighted by the large red box on the left and missed detected areas in the pink box. However, the OA and IoU for background classification have declined a little bit by 1.84% and 0.55% compared with Figure 16d. Figure 16f shows the classification result of Gabor-MLSAN. Compared with GLGCM-MLSAN, there are slightly more false alarms for water, such as the red boxes, so the OA and IoU for water are reduced by 0.94% and 4.61% (refer to Figure 17e). While for shadow classification, the accuracy increases a little in IoU though there is a slightly reduction in OA compared with GLGCM-MLSAN. However, the classification performance for the background is worse than GLGCM-MLSAN, with a reduction of 1.57% in OA and 1.32% in IoU.  Figure 17g). For the background, it is 0.8575 in OA and 0.8154 in IoU, which are slightly lower than MLSAN in OA and IoU but higher than the other methods.   Figure 19 shows the Loss curve and IOU diagram of MSF-MLSAN network when training this dataset. In this project, 100 epochs are trained and the corresponding Loss and IOU values are obtained after being verified every five epochs. From these two graphs, it can be found that Loss converges steadily and the verification accuracy tends to be stable during the training process. According to the classification results of the two experiments, the mean OA and IoU for each type of target as well as the overall IoU of the proposed framework (MSF-MLSAN) are computed and shown in Table 1. is 0.9076. The high OA and IoU of MSF-MLSAN have proven its success in the simultaneous classification of water and shadow. The complexity of context has been addressed by the combination of decoder (attention mechanism and pyramid pooling operation) and splicing.

Discussion
MSF-MLSAN, a new end-to-end framework based on deep learning is presented in the paper and prominent results are obtained for simultaneous water and shadow classification in SAR images. However, there are two extension to accommodate broader applications in the future for MSF-MLSAN: the weights optimization for feature fusion and the generalization of MSF-MLSAN.

Weights Optimization for Feature Fusion
In MSF-MLSAN, the weights for the four score maps are selected to build the final fused map, from which classification results are generated. The weights are determined according to their separate classification performance heuristically, following [11]. But this approach requires high-level expertise and is difficult to transfer to other fields. Therefore, an automatic method to determine optimal weight combination needs to be developed in the future. And we plan to employ the Ho-Kashyap algorithm depicted in Reference [34] to realize it.

Generalization
In this paper, the dataset comes from the millimeter wave SAR system. The center frequency is 35 GHz and the resolutions are 0.13 m in the range and 0.14 m in the azimuth. With the following two extensions, MSF-MLSAN can be applied to analyze SAR imagery datasets with different resolutions or different bands.

Resolution
Resolution is a key parameter in SAR systems. It is a measure of the ability of an imaging system to distinguish level of details and also a key indicator for measuring the performance of SAR imaging system. The characteristics of the same target could be different in SAR images with different resolutions. The higher the resolution, the richer the texture and detailed information the image would offer. The lower the resolution, the more homogeneity of targets will be. In addition, the spatial extents of the same target may vary significantly under different image resolutions, which will have a big impact on the classification. For example, if we test SAR images from TerraSAR (3 m resolution) directly with the model trained by the millimeter SAR dataset used in this paper, the classification results may not be as good as it is reported in this paper, due to the large difference of their resolutions.
To solve this problem, two methods could be employed. One is to sample a small portion of SAR images with the different resolutions, then re-train the sample dataset within MSF-MLSAN to generate a new model. Finally, SAR images of different resolutions can be analyzed with these new models respectively. The accuracy depends on the quality of the samples and the re-training process is time-consuming.
Another way is to use the interpolation or down-sampling techniques [35]. Then the accuracy of MSF-MLSAN is subject to the quality of the re-scaling process and we also notice the risk of incurring additional errors in the interpolation or potential information loss due to down-sampling operations [36].

Frequency Band
It is also an important problem to use the model trained by the millimeter wave system to test SAR images acquired with different bands (e.g., Sentinel-1 and TerraSAR). Different bands of SAR systems have been employed for different applications. If the wavelength of the microwave is longer (i.e., the frequency band is smaller), the penetration will become stronger, which is usually used to detect underneath targets (e.g., forest soil conditions). On the other hand, more detailed information of the targets can be acquired if the wavelength is shorter.
Transfer learning aims to apply knowledge or patterns learned in one field or task to different but related domains, which could serve as an appealing approach to address the band heterogeneity [37]. In summary, transfer learning allows MSF-MLSAN trained on a given band to be applied to SAR data with different frequency bands. Transfer learning within MSF-MLSAN will be an important part of our future work.

Conclusions
In the paper, a new end-to-end framework with deep learning to classify water and shadow in SAR imagery synthetically has been proposed, which demonstrates the success of integrating SAR domain konelsgew and geospatial contextual information in deep learning for SAR image analytics. MSF-MLSAN could automatically classify these two highly similar objects simultaneously, to avoid the error-prone coordination of their respective classification results. MSF-MLSAN mainly contains three parts, namely the MSF, the MLSAN network and the IS. The MSF is to extract the low-level features of SAR images with GLGCM, Gabor and MSOGDF, as the integration of SAR domain knowledge with deep neural networks. The MLSAN network contains two parts, namely, the encoder and the decoder. Finally, the score maps are weighted to get the fusion score map to generate the final classification result within the IS.
In this paper, two experiments from millimeter wave SAR images are performed. One is the scene with more water areas and the other is the scene with more shadow areas. The classification results are compared with other six methods, including RefineNet, DeepLabV3, MLSAN, GLGCM-MLSAN, Gabor-MLSAN and MSOGDF-MLSAN. The proposed MSF-MLSAN method has the best classification results for water and shadow in mean OA of the two experiments, as 0.8382 and 0.9278; and the mean IoUs of them are 0.7727 and 0.806. The mean overall IoU of MSF-MLSAN is 0.9076. Therefore, the proposed framework MSF-MLSAN achieves outstanding accuracy for simultaneous water and shadow classification in millimeter wave SAR images.
The success of MSF-MLSAN has proven the necessity of integrating SAR domain knowledge and geospatial analytics into the design of deep neural networks. Without such integration, MSF cannot capture appropriate SAR image features. On the other hand, geospatial contextual information of geographic objects has been handled via the selective attention pooling (the RAP module and the PAP module) and splicing in IS. The former encodes multiscale spatial structures within the given image tile, while the latter guarantees the right spatial coverage of given image tiles. In our future work, we will explore new deep learning techniques to solve scale gaps between SAR images analytics and daily object classification.
The proposed framework can also be applied to spaceborne SAR systems and InSAR systems for classification and object detection studies, such as the TanDEM-X system and the TerranSAR system. The weighting optimization and generalization strategies will play pivotal roles in extending MSF-MLSAN for these data analytics, as a focus of our future research. MSF-MLSAN paves the way for further integration of SAR domain knowledge and state-of-the-art deep learning techniques. We hope this paper will inspire more innovative works in this direction.