ME-Net: A Deep Convolutional Neural Network for Extracting Mangrove Using Sentinel-2A Data

: Mangroves play an important role in many aspects of ecosystem services. Mangroves should be accurately extracted from remote sensing imagery to dynamically map and monitor the mangrove distribution area. However, popular mangrove extraction methods, such as the object-oriented method, still have some defects for remote sensing imagery, such as being low-intelligence, time-consuming, and laborious. A pixel classiﬁcation model inspired by deep learning technology was proposed to solve these problems. Three modules in the proposed model were designed to improve the model performance. A multiscale context embedding module was designed to extract multiscale context information. Location information was restored by the global attention module, and the boundary of the feature map was optimized by the boundary ﬁtting unit. Remote sensing imagery and mangrove distribution ground truth labels obtained through visual interpretation were applied to build the dataset. Then, the dataset was used to train deep convolutional neural network (CNN) for extracting the mangrove. Finally, comparative experiments were conducted to prove the potential for mangrove extraction. We selected the Sentinel-2A remote sensing data acquired on 13 April 2018 in Hainan Dongzhaigang National Nature Reserve in China to conduct a group of experiments. After processing, the data exhibited 2093 × 2214 pixels, and a mangrove extraction dataset was generated. The dataset was made from Sentinel-2A satellite, which includes ﬁve original bands, namely R, G, B, NIR, and SWIR-1, and six multispectral indices, namely normalization difference vegetation index (NDVI), modiﬁed normalized difference water index (MNDWI), forest discrimination index (FDI), wetland forest index (WFI), mangrove discrimination index (MDI), and the ﬁrst principal component (PCA1). The dataset has a total of 6400 images. Experimental results based on datasets show that the overall accuracy of the trained mangrove extraction network reaches 97.48%. Our method beneﬁts from CNN and achieves a more accurate intersection and union ratio than other machine learning and pixel classiﬁcation methods by analysis. The designed model global attention module


Introduction
Mangrove is a salt-tolerant evergreen woody plant, which is distributed in intertidal zones of tropical and subtropical areas [1]. Mangroves provide breeding and nursing places for marine and pelagic species and play an important role in wind prevention, coastal stability, carbon sequestration, and some other applications [2]. In the past 50-60 years, all mangrove areas are detected. Mangrove extraction is difficult because of the complexity of the spectral and spatial features.
Extracting mangroves from remote sensing imagery can be regarded as the pixel classification, which can be solved by a semantic segmentation network. The semantic segmentation network structure often cannot accurately detect the boundary of the object and lacks the ability to remove "salt and pepper noise" because of the principle of convolution computing [18]. In the postprocessing stage, boundary alignment attempts to improve prediction to slightly adjust the results of semantic segmentation. An improved pixel classification network based on ResNet [33] was designed to solve the above-mentioned problems in extracting mangroves. The proposed neural network contained an attention module, a multiscale context embedding (MCE) module, and a boundary fitting unit (BFU). The ResNet network overcame the problem of gradient vanishing, and the training was simple and could effectively extract feature information. To successfully obtain all the information, the proposed global attention module (GAM) provided classification guidance for the low-stage feature map through learning high-stage feature map information to improve the classification accuracy. Moreover, we propose the MCE module, which extracts the multiscale context information through the convolution of different scales and solves the intraclass consistency issue. A BFU was also designed to integrate the object position inconsistency and feature map aliasing effect. This module optimized the boundary of mangrove distribution and eliminated some "salt and pepper noise". This work aims to design a new mangrove extraction method based on deep learning.
The main contribution of this study is to develop a pixel classification model for extracting mangrove from remote sensing imagery by pixel classification. This work does not focus on scientifically examining the full capability of Sentinel 2 data to perform the mangrove extraction. What this study does do is show the success of the proposed GAM, MCE, and BFU approaches to the mangrove extraction issue and how the approach is repeatable at other sites when similarly implemented. Moreover, we want to assess the performance of different learning approaches for mangrove extraction, especially to demonstrate the capability of our new deep convolutional neural network for mangrove extraction. We aim to solve the problem in mangrove extraction, including the boundary of mangrove distribution, some "salt and pepper noise", and more high-stage feature map information extraction. Hence, this work designed a model that exploited attention mechanism and global context information to improve the ability of remote sensing imagery feature learning. The experimental results show that the proposed network structure can effectively extract mangrove from remote sensing imagery.

Materials and Methods
A pixel classification model is proposed to extract mangroves from remote sensing imagery. We preprocessed the original remote sensing imagery to prepare datasets as follows.
(1) A radiometric correction of Sentinel-2 spectral data was conducted. (2) Multispectral indices of the image are required for mangrove extraction. Given that the band selection is not our research focus, six multispectral indices were used in this study based on the vegetation index commonly used in remote sensing images and the existing research results in mangrove extraction research [34][35][36][37]. According to previous experiments and research results [35,38], the red-edge bands from Sentinel-2 and the SAR data from sentinel-1 are also useful for differentiating different vegetation types. The R, G, and B are the common spectra for object extraction. Accordingly, five original bands, including R, G, B, NIR, and SWIR-1, were selected for the experiments. The mangroves are a type of vegetation, and they always live around the water. Here, normalization difference vegetation index (NDVI) and modified normalized difference water index (MNDWI) were introduced. Since mangroves are a kind of forest, the forest discrimination index (FDI), wetland forest index (WFI), and mangrove discrimination index (MDI) were also used in the experiment to improve the extraction accuracy. The first principal component (PCA1), a common method for enhancing information, was used as a multispectral index. (3) To prepare the Remote Sens. 2021, 13, 1292 4 of 24 datasets for training the mangrove extraction model, all the remote sensing images and the corresponding ground truth labels were clipped with a fixed size sliding window, and the datasets were expanded by data augmentation. Each data sample has five original bands (R, G, B, NIR, and SWIR-1), and six multispectral indices (NDVI, MNDWI, FDI, WFI, MDI, and PCA1), the data sample, and the corresponding ground truth were treated as the input of the proposed deep neural network for training the mangrove extraction network (ME-Net). The output of the deep neural network is a binary grey-scale image, where 0 represents the pixel, which is measured as a non-mangrove forest, and 1 represents the mangrove forest. An overview of the proposed framework is shown in Figure 1. troduced. Since mangroves are a kind of forest, the forest discrimination index (FDI), wetland forest index (WFI), and mangrove discrimination index (MDI) were also used in the experiment to improve the extraction accuracy. The first principal component (PCA1), a common method for enhancing information, was used as a multispectral index. (3) To  prepare the datasets for training the mangrove extraction model, all the remote sensing  images and the corresponding ground truth labels were clipped with a fixed size sliding  window, and the datasets were expanded by data augmentation. Each data sample has  five original bands (R, G, B, NIR, and SWIR-1), and six multispectral indices (NDVI,  MNDWI, FDI, WFI, MDI, and PCA1), the data sample, and the corresponding ground truth were treated as the input of the proposed deep neural network for training the mangrove extraction network (ME-Net). The output of the deep neural network is a binary grey-scale image, where 0 represents the pixel, which is measured as a non-mangrove forest, and 1 represents the mangrove forest. An overview of the proposed framework is shown in Figure 1.

Study Area
The study area is located in the northeast of Hainan Island, including Dongzhaigang National Nature Reserve (DNNR) and its surrounding area of approximately 5 km ( Figure  2). DNNR is the first National Nature Reserve for mangroves in China. Dongzhaigang mangrove is the largest coastal beach forest in China, with a total length of 28 km. It is the most well-preserved, most concentrated, continuous, and mature mangrove forest. DNNR is the most resource-rich area of all mangrove types. It is a typical mangrove wetland composed of major mangrove species in southern China and is becoming a major area for mangrove classification research [35]. There are five families and eight genera of mangroves in DNNR. These mangroves contain eleven species, including Bruguiera gymnorhiza Lamk, Bruguiera sexangular Poir, Bruguiera sexangular rhynchopetala, Ceriops tagal, Kandelia candel, Rhizophora stylosa Griff, Sonneratia apetala Buch, Sonneratia cylindria Engler, Aegiceras corniculatum Blanco, Acanthus ilicifolius, and Derris trifoliata. Figure 2 shows that some mangroves are located in intertidal wetlands, such as estuaries, coasts, and islands. Therefore, the integration of water and vegetation characteristics has important guiding significance for the distinction between land vegetation and mangrove vegetation. The MNDWI was closely related to the characteristics of the water body, and it was introduced for extracting the mangroves in this study.

Study Area
The study area is located in the northeast of Hainan Island, including Dongzhaigang National Nature Reserve (DNNR) and its surrounding area of approximately 5 km ( Figure 2). DNNR is the first National Nature Reserve for mangroves in China. Dongzhaigang mangrove is the largest coastal beach forest in China, with a total length of 28 km. It is the most well-preserved, most concentrated, continuous, and mature mangrove forest. DNNR is the most resource-rich area of all mangrove types. It is a typical mangrove wetland composed of major mangrove species in southern China and is becoming a major area for mangrove classification research [35]. There are five families and eight genera of mangroves in DNNR. These mangroves contain eleven species, including Bruguiera gymnorhiza Lamk, Bruguiera sexangular Poir, Bruguiera sexangular rhynchopetala, Ceriops tagal, Kandelia candel, Rhizophora stylosa Griff, Sonneratia apetala Buch, Sonneratia cylindria Engler, Aegiceras corniculatum Blanco, Acanthus ilicifolius, and Derris trifoliata. Figure 2 shows that some mangroves are located in intertidal wetlands, such as estuaries, coasts, and islands. Therefore, the integration of water and vegetation characteristics has important guiding significance for the distinction between land vegetation and mangrove vegetation. The MNDWI was closely related to the characteristics of the water body, and it was introduced for extracting the mangroves in this study.

Remote Sensing Data and Preprocessing
The data characteristics of Sentinel-2A MSI (S2) images are shown in Table 1, including basic information, such as wavelength range and spatial resolution of 13 bands. S2 satellite images (Level-1C) were downloaded from the Sentinel Scientific Data Hub (https://scihub.copernicus.eu/dhus/#/home accessed date: 1 March, 2020) of the European Space Agency. These images are atmospheric apparent reflectance products after orthophoto correction and subpixel geometric precision correction; thus, the images are not geometrically corrected. The authors used sen2cor to correct the atmosphere of the Level-1C image and obtain the processed bottom-of-atmosphere Level-2A products. The sen2cor atmospheric correlated processor software (version 2.8.0) is a built-in algorithm within software SNAP (Sentinel's Application Platform) version v6.0. Sentinel-2 data cannot be directly opened with ENVI 5.3.1. To ensure that all the data products had the same pixel size for deep learning for mangrove extraction, we read the Level-C 2A image through SNAP, resampled the band needed by the image to 10 m pixel size, and converted it to a format that could be used by ENVI to facilitate subsequent data processing in ENVI.

Remote Sensing Data and Preprocessing
The data characteristics of Sentinel-2A MSI (S2) images are shown in Table 1, including basic information, such as wavelength range and spatial resolution of 13 bands. S2 satellite images (Level-1C) were downloaded from the Sentinel Scientific Data Hub (https://scihub.copernicus.eu/dhus/#/home accessed date: 1 March 2020) of the European Space Agency. These images are atmospheric apparent reflectance products after orthophoto correction and subpixel geometric precision correction; thus, the images are not geometrically corrected. The authors used sen2cor to correct the atmosphere of the Level-1C image and obtain the processed bottom-of-atmosphere Level-2A products. The sen2cor atmospheric correlated processor software (version 2.8.0) is a built-in algorithm within software SNAP (Sentinel's Application Platform) version v6.0. Sentinel-2 data cannot be directly opened with ENVI 5.3.1. To ensure that all the data products had the same pixel size for deep learning for mangrove extraction, we read the Level-C 2A image through SNAP, resampled the band needed by the image to 10 m pixel size, and converted it to a format that could be used by ENVI to facilitate subsequent data processing in ENVI. According to previous experiments and previous research results [35,38], the red-edge bands from Sentinel-2 and the SAR data from Sentinel-1 are also useful for differentiating different vegetation types. However, many redundant and even noise data [39] are observed for only mangrove extraction using deep learning after our preliminary research. Therefore, five original bands, namely R, G, B, NIR, and SWIR-1, were selected for the experiments to improve the accuracy of mangrove extraction. In addition, five multispectral indices were obtained by band calculation for mangrove extraction, and their detailed calculation process is shown in Table 2. The PCA1 [11] was computed by the six original bands (including R, G, B, NIR, SWIR-1, and SWIR-2) as the sixth multispectral index. A series of experiments was conducted in Section 3.5, which starts with the five spectral bands. Then, additional data were incorporated to improve the performance of the mechanism in proving the effectiveness of each index. Table 2. Calculation method of multispectral indices.

Multispectral Indices
Calculation Method Calculation Details in Sentinel-2

Deep CNN Structure
The designed architecture is named ME-Net. Inspired by the performance of the fully convolutional networks (FCN) structure in pixel classification, ME-Net is designed with two parts (Figure 3). The first part (top of Figure 3) uses ResNet-101 to extract features, whose kernel is arithmetic mean; the second part (bottom of Figure 3) aims to extract multiscale information and context information in different stages and generate a binary classification map to obtain good mangrove extraction performance.  In accordance with the size of the feature map, the ResNet-101 network is divided into six stages, namely {Stage 0, Stage 1, …, Stage 5}. We refer to the stage with a larger feature size as the low stage and the stage with a smaller feature size as the high stage. According to our observation, different stages have varying recognition abilities. In the lower stage, the network encodes detailed spatial information. Thus, the low-stage feature In accordance with the size of the feature map, the ResNet-101 network is divided into six stages, namely {Stage 0, Stage 1, . . . , Stage 5}. We refer to the stage with a larger feature size as the low stage and the stage with a smaller feature size as the high stage. According to our observation, different stages have varying recognition abilities. In the lower stage, the network encodes detailed spatial information. Thus, the low-stage feature map has accurate location information. However, the semantic consistency is poor due to its small receptive field and insufficient spatial context information guidance. At a higher stage, the map has strong semantic consistency because of its large receptive field; however, the location information is relatively inaccurate.
In summary, the lower stage provides more accurate spatial prediction, and the higher stage offers more accurate semantic prediction. On the basis of this observation, we propose a GAM, which guides the lower stage by the context information of the higher stage. In addition, we propose an MCE module and a BFU. The former extracts the multiscale information of mangroves in remote sensing imagery, and the latter combines the boundary of features in the feature map to eliminate some "salt and pepper noise" and "grid artifacts" [40]. These models are introduced in the following sections.

GAM Module
GAM ( Figure 4) performs GAP to provide global context information to guide in the low-stage feature map. Global context information provides strong location consistency constraints for feature map in low-stage maps to correct the offset and dislocation of feature location. The structure integrates position consistency guidance information from high-stage feature maps and detailed information from low-stage feature maps. GAM has two branches, namely the global attention information weighting branch and the upsampling branch.  Obtaining sufficient information to extract the relationship between channels is difficult because convolution only operates in a local space. To encode the entire spatial feature on a channel as a global feature, GAP is exploited to address the problem, as follows: Obtaining sufficient information to extract the relationship between channels is difficult because convolution only operates in a local space. To encode the entire spatial feature on a channel as a global feature, GAP is exploited to address the problem, as follows: where x i,j is the value of each feature pixel in channel k, and k ∈ {1, 2, . . . , C}; H is the height; W is the width; and C is the number of channels.
On the weighted branch of the global attention information, we performed GAP on the feature map in the higher stage to generate global context information and conducted a nonlinear 1 × 1 convolution and batch normalization [41]. The nonlinear 1 × 1 convolution is activated by a rectified linear unit (ReLU) or softmax function. The calculation process is shown as follows: where w k is the prediction probability of each channel and z k is the output of each channel. Finally, the result calculated by softmax is multiplied by each pixel in the low-stage feature map, and the high-stage feature map information is used to guide the low-stage feature map channel to provide context information guidance. The calculation process is shown in Formula (3), as follows: where S out is a 3D matrix of H × W × C, W is a column vector of 1 × C, F is the feature map, w c is the weight of the channel c, and f c (i, j) is any pixel value in the feature map of channel c.
In the upsampling branch, the high-stage feature map uses complex encoder blocks, which cost considerable computing resources. In the sampling process from the high-stage feature map to low-stage feature map, 1 × 1 convolution is performed to reduce the channel from the high-stage feature map and integrate the channel information to increase the nonlinear features of decoding layer and reduce the computing load. This module can effectively deal with the feature map of different scales and use a simple method to allow the high-stage feature map to provide consistent constraint information for the low-stage feature map.

MCE Module
Inspired by the inception architecture in GoogleNet [42][43][44] and atrous spatial pyramid pooling (ASPP) module in DeepLab [23], we propose an MCE module ( Figure 5). This module extracts multiscale context information by four convolution kernels of different sizes and compresses the number of channels to reduce the computational load.
The designed MCE combines the feature maps of context information with the global information of the high-stage feature map in the GAM. We used 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution in MCE to effectively extract context information from feature maps at different stages, where 1 × 1 convolution is used to reduce the dimension of the feature map. The 1 × k + k × 1 and k × 1 + 1 × k convolutions were combined in MCE instead of k × k to avoid a large convolution kernel or global convolution. After splicing the multiscale information in accordance with the channel, 3 × 3 convolution was used to roughly integrate the multiscale information and adjust the number of channels.

MCE Module
Inspired by the inception architecture in GoogleNet [42][43][44] and atrous spatial pyramid pooling (ASPP) module in DeepLab [23], we propose an MCE module ( Figure 5). This module extracts multiscale context information by four convolution kernels of different sizes and compresses the number of channels to reduce the computational load. The designed MCE combines the feature maps of context information with the global information of the high-stage feature map in the GAM. We used 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution in MCE to effectively extract context information from feature maps at different stages, where 1 × 1 convolution is used to reduce the dimension of the feature map. The 1 × k + k × 1 and k × 1 + 1 × k convolutions were combined in MCE instead of k × k to Figure 5. MCE architecture. The green box is the input feature map, the yellow box is the output feature map, the rectangle is the convolution operation of different sizes of convolution kernel, and the colored box is the multiscale feature map in series according to the channel.

BFU Module
Inspired by the mask regions with CNN (R-CNN [45]), BFU ( Figure 6) is designed to correct the boundary position of the features by two continuous 3 × 3 convolution kernels. On the one hand, the BFU eliminates the "grid artifacts" caused by the context embedding of the feature maps in different stages. On the other hand, BFU solves the aliasing effect caused by the convolution and pooling operations. In addition, a skipped connection is added to supervise the semantic of the feature map after boundary modification. This approach speeds up the flow of information in the network and optimizes the performance of boundary fitting.
The proposed BFU can be understood as a 1 × 1 convolution and a residual module. In the BFU, 1 × 1 convolution is used to learn the information of different channels while reducing the number of channels of the original feature map to reduce the computation amount. The residual module is used to smoothen the feature map and eliminate "grid artifacts" and aliasing effects.
correct the boundary position of the features by two continuous 3 × 3 convolution kernels. On the one hand, the BFU eliminates the "grid artifacts" caused by the context embedding of the feature maps in different stages. On the other hand, BFU solves the aliasing effect caused by the convolution and pooling operations. In addition, a skipped connection is added to supervise the semantic of the feature map after boundary modification. This approach speeds up the flow of information in the network and optimizes the performance of boundary fitting. Figure 6. BFU architecture. The green box is the feature map, the yellow rectangle is the convolution operation, the brown rectangle is the batch normalization(BN) operation, the orange ellipse is ReLU, "+" represents that a feature map with the same size is added in accordance with the pixel position, the blue arrow represents the skipped connection, and the red box represents the feature mapping.
The proposed BFU can be understood as a 1 × 1 convolution and a residual module. In the BFU, 1 × 1 convolution is used to learn the information of different channels while reducing the number of channels of the original feature map to reduce the computation amount. The residual module is used to smoothen the feature map and eliminate "grid artifacts" and aliasing effects.

Loss Function Optimization
Mangrove extraction is a binary classification problem; thus, binary cross-entropy loss and Dice loss function value at each pixel are used, as follows: where represents the set of pixel ground truth labels, and denotes the set of pixel prediction results.
The strategy of deep supervision was applied in the training of the ME-Net model. A supervision function was added to the hidden layer to reduce the effects of the gradient disappearance and improve the speed of model convergence. As shown in Figure 7, we upsampled the output of the third stage feature map of the ME-Net to resize it to its original image size. A binary cross-entropy loss function Loss2 was added as the supervision of the middle hidden layer to optimize the learning process. Loss1 was used to optimize the overall network. We also increased the weight to balance the two loss functions. Figure 6. BFU architecture. The green box is the feature map, the yellow rectangle is the convolution operation, the brown rectangle is the batch normalization(BN) operation, the orange ellipse is ReLU, "+" represents that a feature map with the same size is added in accordance with the pixel position, the blue arrow represents the skipped connection, and the red box represents the feature mapping.

Loss Function Optimization
Mangrove extraction is a binary classification problem; thus, binary cross-entropy loss L BCE and Dice loss function value L DC at each pixel are used, as follows: where P gt represents the set of pixel ground truth labels, and P m denotes the set of pixel prediction results. The strategy of deep supervision was applied in the training of the ME-Net model. A supervision function was added to the hidden layer to reduce the effects of the gradient disappearance and improve the speed of model convergence. As shown in Figure 7, we upsampled the output of the third stage feature map of the ME-Net to resize it to its original image size. A binary cross-entropy loss function Loss2 was added as the supervision of the middle hidden layer to optimize the learning process. Loss1 was used to optimize the overall network. We also increased the weight to balance the two loss functions.

Preprocessing of Experimental Data
In combination with field sampling and visual interpretation of Google Earth satellite images, we manually marked the original remote sensing imagery by ArcGIS 10.2 to ob-

Preprocessing of Experimental Data
In combination with field sampling and visual interpretation of Google Earth satellite images, we manually marked the original remote sensing imagery by ArcGIS 10.2 to obtain the ground truth labels. These training samples were labeled under the supervision of several experts, who are professionals in mapping mangrove extent and species, to ensure that these marked samples are correct. We resampled all the bands to the same size to perform band calculations by ENVI 5.3 software to calculate the multispectral indices. The spatial resolution of Sentinel-2 remote sensing imagery used in the experiments is 10 m. This work labeled the remote sensing images by consensus of several experts to ensure correct classification of mangroves. We selected the sentinel-2A remote sensing data acquired on 13 April 2018, in Hainan Dongzhaigang National Nature Reserve in China. The data comprised 2093 × 2214 pixels after preprocessing, such as cropping. The remote sensing imagery was clipped by a 256 × 256 sliding window with a 32-pixel step. We used random left and right flips and up and down flips and increased "salt and pepper noise" for some datasets to increase the size of the datasets and avoid filling null values. Furthermore, we randomly rotated the clipped samples by 90 • , 180 • , and 270 • and randomly scaled the sample data in five scales. The dataset had 5120 original images with 256 × 256, where 20% (1024 images) were used as test sets for validating the proposed model, and 80% (4096 images) together with 1280 augmented images were utilized as the training sets for training the proposed model. During the training, 85% of the training sets were used to train the ME-Net model, and 15% of the training sets were utilized to validate the ME-Net model.

Input Data
All the prepared sample data, including five original bands, namely, R, G, B, NIR, and SWIR-1; six multispectral indices, namely NDVI, MNDWI, FDI, WFI, MDI, and PCA1; and the corresponding ground truth labels were used as inputs to the ME-Net model. The input data used by the deep neural network are shown in Figure 8.

Set of Hyperparameters
During the training of the ME-Net model, transfer learning was used to improve the generalization ability of the model. The ME-Net was designed on the basis of Res-Net, which was trained before it was inserted into the whole model. Moreover, minibatch stochastic gradient descent (SGD) [29] was used to minimize the loss function and update the weight parameters in backpropagation. In the experiment, the batch size was 8, the momentum was 0.9, and the weight decay was 0.0001. The SGD optimization function is greatly affected by the initial learning rate. Thus, the learning rate of the ME-Net model was set to 0.01 to obtain better performance and speed up the processing. We used the "poly" learning rate strategy, in which the initial rate was multiplied by 1 − iter max_iter power . The number of training epochs was 100, the number of iterations in each epoch was 200, and 32 samples were used in each iteration.

Experimental Results
The proposed ME-Net model was implemented using the open-source Tensorflow and Keras framework provided by Google in Python. The code of the pixel classification model was executed on Windows 10 platform with four NVIDIA GTX 1080Ti GPUs (12 GB RAM per GPU). After 100 epochs, the ME-Net model achieved state-of-the-art results on the datasets (Figure 9). Remote Sens. 2021, 13, x FOR PEER REVIEW 12 and the corresponding ground truth labels were used as inputs to the ME-Net model. input data used by the deep neural network are shown in Figure 8.  We used pixel Intersection over Union (IoU) as the accuracy measure to quantitatively evaluate the performance of the ME-Net model in extracting mangroves from remote sensing images. IoU is defined as: where P gt represents the set of pixel ground truth labels; P m represents the set of pixel prediction results; "∩" and "∪" represent the calculation operation of intersection and union, respectively; and | • | represents the number of pixels in the calculation set.
was set to 0.01 to obtain better performance and speed up the processing. We used the "poly" learning rate strategy, in which the initial rate was multiplied by (1 − _ . The number of training epochs was 100, the number of iterations in each epoch was 200, and 32 samples were used in each iteration.

Experimental Results
The proposed ME-Net model was implemented using the open-source Tensorflow and Keras framework provided by Google in Python. The code of the pixel classification model was executed on Windows 10 platform with four NVIDIA GTX 1080Ti GPUs (12 GB RAM per GPU). After 100 epochs, the ME-Net model achieved state-of-the-art results on the datasets (Figure 9). We used pixel Intersection over Union (IoU) as the accuracy measure to quantitatively evaluate the performance of the ME-Net model in extracting mangroves from remote sensing images. IoU is defined as: The overall accuracy of the trained ME-Net reached 97.49%, and the F1 score reached 96.56% (Table 3), which proved that the proposed model was excellent in extracting mangroves from remote sensing imagery. To prove that the method is universal, we used some data from roadside areas, estuaries, bays, shoals, and islands to verify the method. In the ablation study, we successively added GAM, MCE, and BFU to explore the impact of each module on the experimental results ( Figure 10).    In various experimental scenarios, we compared the results ( Figure 10) to explore the impact of different modules in the ME-Net model on the performance of mangrove extraction. Although most mangroves in the remote sensing imagery can be accurately extracted by using the GAM, the blue area is greatly reduced after the MCE is added. The multiscale information is beneficial to improve the accuracy of mangrove extraction. However, the prediction results of the fourth row of mangroves show that many problems, such as "salt and pepper noise", blurred boundaries, and misclassified or missed pixels, still exist. To solve these problems, BFU was introduced into the model. In addition, the red and blue areas were reduced at different scales. This finding fully showed that the BFU made further constraints on the pixel classification information of mangroves, and the predicted results were further optimized.

Evaluating the Model by a New Dataset
The overall accuracy and F1 score reached over 96.56% in the dataset of DNNR. We have made a new dataset to prove that the designed model has the generality to extract mangroves from remote sensing imagery. This dataset is based on the study area of He'anpian Mangrove Nature Reserve in Southeast Zhanjiang City, Guangdong Province. The geographical coordinates of the study area are 110 • 17 49"-110 • 27 40" E and 20 • 34 11"-20 • 43 48" N. The mangroves labeled by experts in the remote sensing images were treated as the ground truth. The precision of the trained ME-Net for the new dataset reached 96.00%, and the F1 score reached 95.55% (Table 4). A series of experiments was implemented to qualitatively prove that our model has the generality for a new dataset ( Figure 11). When trained models were applied to the new dataset, the red and blue areas were greatly reduced by using the GAM, MCE, and BFU. The experimental results of mangrove extraction by different modules in ME-Net showed that the proposed method can effectively extract mangroves, and it has good generalization ability.

Effects of Sample Data on the Results
The Sentinel-2 remote sensing imagery can extract the mangrove area on the ground through the false-color image composed of SWIR, G, and B bands. However, the phenomena of "same object with different spectra" and "different objects with the same spectra" are observed because of the different living environments and distributions of mangroves.

Effects of Sample Data on the Results
The Sentinel-2 remote sensing imagery can extract the mangrove area on the ground through the false-color image composed of SWIR, G, and B bands. However, the phenomena of "same object with different spectra" and "different objects with the same spectra" are observed because of the different living environments and distributions of mangroves. Accordingly, the mangroves in some areas are missing or misclassified. We need to fully use the multiband information of remote sensing information and mine the multispectral indices to improve the accuracy of mangrove classification, which is beneficial to the extraction of mangroves. In this research, five original bands were selected from the sample data, namely, B, G, R, NIR, and SWIR-1. In addition, six multispectral indices (NDVI, MNDWI, FDI, WFI, MDI, and PCA1) were computed to mine the spectral, textural, and shape information between mangrove and non-mangrove features. We used the pretrained ResNet-101 weight on the ImageNet datasets as the initial weight of our basic feature extraction network and upsampled the output in accordance with the structure of the FCN referred to as ResNet-based FCN. Under the ResNet-based FCN structure, we obtained the actual color images of the B, G, and R bands as the initial input data of the experiment and constantly added new input data to the experiment (Table 5). Adding some original band information and multispectral indices can effectively improve the results of mangrove prediction. The performance of network classification increased from 86.64% to 92.13%.  Table 5 shows that the IoU increased by 0.69% with the addition of NIR and SWIR-1. After adding the six multispectral indices, IoU increased by 4.80%. Moreover, the effects of NDVI, MNDWI, and MDI had a remarkable effect on the results. The controlling variable method was used to analyze the effect of each multispectral index for exploring the performance impact of these multispectral indices on the mangrove extraction results ( Table 6).  Table 6 shows that when NDVI, MNDWI, and MDI were excluded, their IoU indicators were reduced by 0.81%, 1.2%, and 1.37%, respectively. When only the RGB bands were used as inputs, IoU was reduced by 5.49%. The experimental data showed that the MNDWI and MDI can significantly improve the performance of mangrove extraction. The MNDWI was closely related to the characteristics of the water body, and mangroves were located in intertidal wetlands, such as estuaries, coasts, and islands, which coincides with the difference between Figure 12e,c. Therefore, the integration of water and vegetation characteristics has important guiding significance for the distinction between land vegetation and mangrove vegetation. In addition, the comparative analysis of the spectral characteristics showed that the spectral reflectance of mangroves in SWIR-2 is lower than that of terrestrial vegetation, which also confirmed the potential reason why MDI could significantly improve the classification results. Figure 12 shows that in remote sensing imagery, some land vegetation distributed near shallow land surface areas, such as lakes and wetlands, can be further distinguished from mangroves by MDI. In different experimental scenarios, we compared the results (Figure 12) to explore the impact of adding different sample data to the ME-Net model on the performance of mangrove extraction (the IoUs for data are shown in Table 7). The experiment results showed that some FP pixels would appear in the prediction results of the model when only three bands, R, G, and B, were used. The actual color of the remote sensing imagery indicated that it is a classical phenomenon of "different objects with the same spectra". Most features represented by these FP pixels were wetlands on the land surface or dense woodland growing in shallow water areas, with similar spectral characteristics in mangrove areas, thereby resulting in a large number of categorical misjudgments. The experimental results showed that red and blue regions decrease in varying degrees, characterized by a more significant reduction in the red areas. The results showed that rich multi- In different experimental scenarios, we compared the results (Figure 12) to explore the impact of adding different sample data to the ME-Net model on the performance of mangrove extraction (the IoUs for data are shown in Table 7). The experiment results showed that some FP pixels would appear in the prediction results of the model when only three bands, R, G, and B, were used. The actual color of the remote sensing imagery indicated that it is a classical phenomenon of "different objects with the same spectra". Most features represented by these FP pixels were wetlands on the land surface or dense woodland growing in shallow water areas, with similar spectral characteristics in mangrove areas, thereby resulting in a large number of categorical misjudgments. The experimental results showed that red and blue regions decrease in varying degrees, characterized by a more significant reduction in the red areas. The results showed that rich multiband data and multispectral indices were conducive to a more detailed pixel-level classification of mangroves. The prediction results with and without the MDI index are shown in the fourth and fifth columns of Figure 12, respectively. The comparative results of these columns clearly show that the classification of marginal areas was greatly improved after the MDI index was added, such as the river and forest edge of the mangrove area. This finding indicated that the MDI index contains the structural and textural information required for mangrove classification. Table 7. The IoU for data in Figure 12 from rows 1 to 5.

Influence of Network Structure and Training Skills
In the pixel classification of remote sensing imagery, we need to simultaneously complete the classification and location of mangroves. However, the classification and location in the deep learning algorithm are contradictory. The high-stage feature map of CNN is excellent at solving the classification problem. However, reconstructing the prediction result of binarization of the original resolution is difficult because convolution and downsampling lose a large amount of location information. Therefore, we proposed GAM and used the classification information learned by the high-stage feature map as a weight to guide the location reconstruction of the low-stage feature map. Prior to the reconstruction of location information by GAM, MCE was used to extract features from the low-stage feature maps, and the multiscale information was fused. Subsequently, BFU was used to eliminate problems, such as aliasing and "grid artifacts" in the convolution process and pooling operation and "salt and pepper noise" in image classification. We used the controlling variable method to analyze the influences of each element to explore the effects of GAM, MCE, and BFU on the mangrove extraction results (Table 8).
The experimental data in Table 8 showed that GAM can effectively extract global contextual attention information and significantly improve the performance of mangrove extraction from 92.13% to 95.55% compared with ResNet-based FCN. Different global pooling methods lead to varying results. GAP improved the performance of the model by 0.17% compared with GMP. Therefore, GAP was used in the final model. Moreover, the performance of ME-Net improved by adding the MCE module, and the IoU of mangrove classification increased from 95.71% to 96.16%. The results of C1355 and C1357 showed that different sizes of convolution kernels can extract diverse scale information, and the information of various scales improved the classification of mangroves. When we used BFU instead of MCE in the experiment, IoU increased from 95.71% to 95.89%. On the basis of MCE, IoU increased by 0.73% when BFU is added to the network. This finding showed that BFU is beneficial to mangrove classification, and that the combination of MCE can enable GAM to obtain more accurate and rich global mangrove information. We conducted a series of comparative experiments to more intuitively show the effect of BFU on the mangrove classification results. The results ( Figure 10) clearly showed that the BFU module can simultaneously improve the boundary of pixel classification and eliminate some noise. Two training skills were also used to improve network performance, namely data augmentation, and deep supervision. Some comparative experiments were conducted to explore the influence of these training skills on the results of pixel classification. The results (the last three rows in Table 8) showed that both approaches improve the model performance. Table 8. Effects of GAM, MCE and BFU on the extraction results of mangroves.

IoU (%)
ResNet We added a final loss function at the end of the main branch of ME-Net and a second loss function at the end of the ResNet-101 network to solve the difficult problems of the deep neural network optimization. The first loss function optimized the pixel classification performance of the entire network. Meanwhile, the second function optimized the feature extraction process of ResNet-101. We added a balance weight a to the second loss function. We used five different values of 0, 0.25, 0.5, 0.75, and 1 to approximately determine the value of a and further analyze the effect of deep supervision on the network performance improvement. In Figure 13, under the same conditions, the effect of the optimization model was the best, and the accuracy was 97.22% when the balance weight was equal to 0.25. Finally, the experiment using various methods and techniques indicated that the performance of the ME-Net model was improved to 97.48%. The experimental data in Table 8 showed that GAM can effectively extract global contextual attention information and significantly improve the performance of mangrove extraction from 92.13% to 95.55% compared with ResNet-based FCN. Different global pooling methods lead to varying results. GAP improved the performance of the model by 0.17% compared with GMP. Therefore, GAP was used in the final model. Moreover, the performance of ME-Net improved by adding the MCE module, and the IoU of mangrove classification increased from 95.71% to 96.16%. The results of C1355 and C1357 showed that different sizes of convolution kernels can extract diverse scale information, and the information of various scales improved the classification of mangroves. When we used BFU instead of MCE in the experiment, IoU increased from 95.71% to 95.89%. On the basis of MCE, IoU increased by 0.73% when BFU is added to the network. This finding showed that BFU is beneficial to mangrove classification, and that the combination of MCE can enable GAM to obtain more accurate and rich global mangrove information. We conducted a series of comparative experiments to more intuitively show the effect of BFU on the mangrove classification results. The results ( Figure 10) clearly showed that the BFU module can simultaneously improve the boundary of pixel classification and eliminate some noise. Two training skills were also used to improve network performance, namely data augmentation, and deep supervision. Some comparative experiments were conducted to explore the influence of these training skills on the results of pixel classification. The results (the last three rows in Table 8) showed that both approaches improve the model performance.
We added a final loss function at the end of the main branch of ME-Net and a second loss function at the end of the ResNet-101 network to solve the difficult problems of the deep neural network optimization. The first loss function optimized the pixel classification performance of the entire network. Meanwhile, the second function optimized the feature extraction process of ResNet-101. We added a balance weight a to the second loss function. We used five different values of 0, 0.25, 0.5, 0.75, and 1 to approximately determine the value of a and further analyze the effect of deep supervision on the network performance improvement. In Figure 13, under the same conditions, the effect of the optimization model was the best, and the accuracy was 97.22% when the balance weight was equal to 0.25. Finally, the experiment using various methods and techniques indicated that the performance of the ME-Net model was improved to 97.48%.

IoU(%)
The balance parameter a of loss Performance of ME-Net with a value Figure 13. Results of ME-Net with different values of balance weight a.

Model Analysis
This work was compared with some new methods, including FCN [31], SegNet [30], DilatedNet [46], U-Net [21], PSPNet [22], DeepLab series [23,24,47], and Mask R-CNN [48], to evaluate the effectiveness of the proposed ME-Net model in mangrove extraction from remote sensing imagery. All methods were trained, validated, and tested on the same datasets for an objective and impartial finding. The comparative test results are shown in Table 9.  Table 9 indicates that our proposed ME-Net model effectively performed in the mangrove extraction tasks. We have achieved the highest IoU (96.97%) without using the methods of data augmentation and deep supervision. We selected samples to more intuitively show the impact of different methods on mangrove extraction performance, the classification results of which are difficult to predict. In addition, the prediction results of ResNet-based FCN, DeepLab v3, and ME-Net model were compared (Figure 14). Some scenes, which were difficult to classify, such as nonblock, sporadic scattered, and coastal strip edges, were used in the experiments to increase the contrast of the classification results. The object-oriented model failed to extract mangrove compared with the deep learning methods (Figure 14 and Table 10). The classification results of different methods were compared in detail. The result showed that the blue area in the prediction results of the ResNet-based FCN model was significantly more than that of the other methods. The existence of a large number of FP pixels showed that some pixels that should belong to mangroves were been detected by the model, and the model has an under-fitting problem. The under-fitting of the model indicated that a large amount of classification information was not learned by the model. The data in the third and fourth columns in Figure 14 indicate that the blue area in the prediction result of DeepLab v3 was much less than the other areas (the IoU for data are shown in Table 10). This finding indicated that the data fitting ability of the DeepLab v3 model was stronger than that of the ResNet-based FCN. The analysis of the network structure of DeepLab v3 model showed that the model using dilated convolution and ASPP can effectively capture multiscale information and improve the performance of mangrove extraction. Finally, we found that the blue region was greatly reduced compared with the proposed ME-Net model; however, the red region was partially increased. This finding shows that the ME-Net model has strong data fitting ability and can be effective for pixel classification. However, part of the boundary was over-fitted and overcompensated to the prediction results due to the role of the BFU, and some pixels that belong to nonmangroves were misjudged as mangroves. Although some over-fitting cases were found, the overall performance of ME-Net model in mangrove extraction was still much better than that of the other pixel classification models. In addition, the noise in the remote sensing imagery will decrease the accuracy of ME-Net, and the denoising method will be exploited to address this problem in the future [48,49]  Some scenes, which were difficult to classify, such as nonblock, sporadic scattered, and coastal strip edges, were used in the experiments to increase the contrast of the classification results. The object-oriented model failed to extract mangrove compared with the deep learning methods ( Figure 14 and Table 10). The classification results of different methods were compared in detail. The result showed that the blue area in the prediction results of the ResNet-based FCN model was significantly more than that of the other methods. The existence of a large number of FP pixels showed that some pixels that should belong to mangroves were been detected by the model, and the model has an under-fitting problem. The under-fitting of the model indicated that a large amount of classification information was not learned by the model. The data in the third and fourth columns in Figure 14 indicate that the blue area in the prediction result of DeepLab v3 was much less than the other areas (the IoU for data are shown in Table 10). This finding indicated that the data fitting ability of the DeepLab v3 model was stronger than that of the ResNetbased FCN. The analysis of the network structure of DeepLab v3 model showed that the model using dilated convolution and ASPP can effectively capture multiscale information and improve the performance of mangrove extraction. Finally, we found that the blue region was greatly reduced compared with the proposed ME-Net model; however, the red region was partially increased. This finding shows that the ME-Net model has strong Figure 14. Performance of different pixel classification models in mangrove extraction. The first column (a) shows the actual color of the remote sensing imagery; the second column (b) shows the corresponding ground reality; the third column (c) shows the prediction result of object-oriented by ENVI; the fourth column (d) shows the prediction result of ResNet-based FCN model; the fifth column (e) shows the prediction result of DeepLab v3 model; the sixth column (f) shows the prediction result of the ME-Net model. Green, red, blue, and black represent the TP, FP, FN, and TN, respectively. Table 10. The IoU for data in Figure 14 from rows 1 to 5.

Conclusions
Accurate extraction of mangroves from remote sensing imagery is important to dynamically map and monitor the distribution area of mangroves. However, mangroves have different geometric appearances and spectral and textural features. Accordingly, accurate extraction of mangroves faces great challenges. Datasets for mangrove extraction are developed, and a new pixel classification framework, ME-Net, is proposed. The ME-Net is trained and tested to explore the impact of different sample data and feature learning modules on the extraction of mangrove results. In this research, the controlling variable method is used to experiment on each band and multispectral index. The results show that the selection of multiband data and the multispectral indices are beneficial to the extraction of mangroves. In the network model, GAM is proposed to provide global context information to guide in the low-stage feature map, and an MCE module is proposed to extract multiscale information. BFU is applied to optimize the classification results. In the data preprocessing, multiband remote sensing imagery and manually created multispectral indices are used to improve the performance of mangrove pixel classification. The results of the experiments on multiple remote sensing imagery indicate that the ME-Net model can effectively integrate a large number of sample data, which can effectively solve the problem of data redundancy and mine abstract semantic information and location information in remote sensing imagery. The proposed approach can successfully extract mangroves in each scene. The results show that the framework is effective and feasible in improving the classification performance of mangroves in different coastal areas.
This work aims to show the success of the proposed GAM, MCE, and BFU approaches to the mangrove extraction issue. We demonstrated the capability of our new deep learning model for mangrove extraction. This study focuses on our new deep learning model (ME-Net). We successfully demonstrated that deep learning methods can be exploited to extract mangroves. We conducted many groups of experiments to demonstrate that our new deep convolutional neural network for extracting mangrove (ME-Net) has the capability to automatically extract mangrove, and it performs better than other typical deep learning methods.
An effective method is provided to improve the classification performance of remote sensing imagery. However, the model still has some unsolved problems. Future work will focus on the following aspects: an end-to-end pixel classification model should be implemented through residual module and convolution to achieve better results in boundary-fitting tasks instead of dense-CRF to a certain extent and develop a simple training process of the model. However, this model exhibits some shortcomings compared with dense-CRF; that is, it cannot effectively combine input data similar to dense-CRF to obtain pixels with similar colors and adjacent positions for a more consistent classification. Additionally, the spatial computational intensity grid [50] is exploited to improve the parallel performance of ME-Net in the next work.
Geophysical setting has an extremely important influence on the distribution of mangrove areas, especially some naturally growing mangrove areas. The mangrove wetland is located in the intertidal zone and is closely related to the characteristics of the water body. Hence, we will consider using remote sensing imagery of different time series to extract natural mangrove regions in future research. Moreover, the intelligent system indicated that the right bands and indices are important for mapping mangroves. Next, we will research how to use the knowledge graph methods to find the right bands and indices.