Coastal Wetland Classification with Deep U-Net Convolutional Networks and Sentinel-2 Imagery: A Case Study at the Tien Yen Estuary of Vietnam

The natural wetland areas in Vietnam, which are transition areas from inland and ocean, play a crucial role in minimizing coastal hazards; however, during the last two decades, about 64% of these areas have been converted from the natural wetland to the human-made wetland. It is anticipated that the conversion rate continues to increase due to economic development and urbanization. Therefore, monitoring and assessment of the wetland are essential for the coastal vulnerability assessment and geo-ecosystem management. The aim of this study is to propose and verify a new deep learning approach to interpret 9 of 19 coastal wetland types classified in the RAMSAR and MONRE systems for the Tien Yen estuary of Vietnam. Herein, a Resnet framework was integrated into the U-Net to optimize the performance of the proposed deep learning model. The Sentinel-2, ALOS-DEM, and NOAA-DEM satellite images were used as the input data, whereas the output is the predefined nine wetland types. As a result, two ResU-Net models using Adam and RMSprop optimizer functions show the accuracy higher than 85%, especially in forested intertidal wetlands, aquaculture ponds, and farm ponds. The better performance of these models was proved, compared to Random Forest and Support Vector Machine methods. After optimizing the ResU-Net models, they were also used to map the coastal wetland areas correctly in the northeastern part of Vietnam. The final model can potentially update new wetland types in the southern parts and islands in Vietnam towards wetland change monitoring in real time.


Introduction
Currently, about 70% of the world's population lives in coastal estuaries and around inland freshwater bodies [1][2][3]. According to [4,5], the wetland ecosystem provides humankind with a large number of products worth USD 33,000 billion yearly. However, the world's wetlands have disappeared 64% since the 1900s [6,7], and 87% since the 1700s [8]. Together with the decline in words, it is difficult to use the developed deep learning models in previous studies for further coastal wetland classification. Therefore, it is necessary to make deep learning models more applicable to the coastal wetland classification of the RAMSAR system. Accordingly, other studies can use or improve the models towards a complete model for the coastal wetland classification.
Additionally, to observe wetland types in a large area, satellite images such as MODIS, Landsat, and Sentinel-2 were commonly used [45][46][47]. Compared to the MODIS and Landsat satellite images with a low spatial resolution, the Sentinel-2, as a multi-spectral imaging mission, can systematically obtain optical imagery over both inland and coastal areas at a high spatial resolution (10 to 60 m) [47]. In this research, the authors, therefore, propose ResU-Net models for coastal wetland cover prediction based on multi-temporal Sentinel-2 data in an estuary of Quang Ninh province, Vietnam. Three research questions-relevant to wetland cover classification based on deep learning models-will clarify this study: • What are the advantages of integrating deep learning and multi-temporal remote sensing images for monitoring coastal wetland classification? • How do the ResU-Net34 models for coastal wetland classification improve from the benchmark methods? • How are wetland types distributed in the northeastern part of Vietnam?
In this study, multi-temporal 4-band Sentinel-2 images integrated with digital elevation models (DEM) were used as input data of the ResU-Net models for coastal wetland-cover classification. Land covers in an estuary area of about 15x18 km were used as a mask to develop a ResU-Net model for wetland cover classification. The performance of the trained ResU-Net models will be compared with results obtained from two benchmark methods, including Random Forest (RF) and Support Vector Machine (SVM). After the best model is chosen, the new Sentinel-2 images in other times can be added to interpret wetland cover changes in the Tien Yen estuary, as well as in the whole coastal area of Quang Ninh province, Vietnam. Notably, the authors will explain in detail the wetland classification of different systems (Section 2.2) and define which coastal wetland types were improved in this study. The explanation of sample collection and model development will be shown from Sections 2.3-2.5). The final models will be compared with benchmark methods and discussed in Sections 3 and 4.

Study Area
The focus area analyzed in the topic is the wetland area of the Tien Yen estuary, which belongs to Hai Lang, Dong Ngu, Binh Dan and Dong Rui communes, Quang Ninh province of Vietnam ( Figure 1). With the diurnal tide, the tidal range is about 3.5-4.0 m. The number of days with one water rise and one water down per day accounts for 85-95% of a month (i.e., over 25 days in the month). These characteristics of the tide directly affect local aquaculture. High tide amplitude and good water exchange facilitate the intake of saltwater into the ponds. However, because of high tide, the ponds must have dykes or high banks to reduce the influence of the continuous tide [48]. Accordingly, the area affected by alluvium is often used to grow two rice crops. Higher areas are often used for intercropping. Meanwhile, areas affected by seawater and tides often form saline soils, developing mangrove systems (for example, mangrove, black tiger, yellow and red).
In the dry season, the water level is lower, and the seaward flow is weaker than the rainy season. The coastal soil is affected by tidal currents, creating favorable conditions for the aquaculture of brackish water. The Tien Yen river is narrow, and the water flow from upstream areas in the rainy season often causes (1) flooding in many low-lying estuaries, (2) rapid freshening in shrimp farms, (3) increasing erosion process, leaching, (4) the destruction of dike systems, swamp farms, and sweeping away animals [49].
Regarding the land-use conversion, before 1975, Dong Rui commune mangroves account for about 3000 ha, mainly natural forests. Since 1992, Tien Yen district and Dong Rui commune have allocated Remote Sens. 2020, 12, 3270 4 of 26 1500 hectares of mangrove land to local households. These landowners have made investments and converted mangrove land into shrimp farming ponds. However, this conversion has not brought about the expected results of the people [50]. Since 2000, the government of Dong Rui commune has made adjustments in policies and has called for a number of investment projects of the governmental and non-governmental organizations to restore and replant mangroves that have been destroyed. Especially since 2005, Dong Rui has promoted the model of community forest management, assigning specific forest areas to each village planting, tending, protecting, and exploiting, so people's awareness of mangroves values has been raised, no one is cutting down the mangroves anymore, but they are actively protecting the forests [48]. Especially from 2012 to date, Dong Rui commune has over 3200 hectares of forest restored, and now only 500 hectares continue to be supported for restoration. Mangrove forests cover over 57% of the commune's total natural land area. Dong Rui is considered one of the few localities with large and good quality mangrove areas of the Northeastern part of Vietnam. However, other areas outside of the Dong Rui area are currently mostly used for aquaculture [51].
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 27 not brought about the expected results of the people [50]. Since 2000, the government of Dong Rui commune has made adjustments in policies and has called for a number of investment projects of the governmental and non-governmental organizations to restore and replant mangroves that have been destroyed. Especially since 2005, Dong Rui has promoted the model of community forest management, assigning specific forest areas to each village planting, tending, protecting, and exploiting, so people's awareness of mangroves values has been raised, no one is cutting down the mangroves anymore, but they are actively protecting the forests [48]. Especially from 2012 to date, Dong Rui commune has over 3200 hectares of forest restored, and now only 500 hectares continue to be supported for restoration. Mangrove forests cover over 57% of the commune's total natural land area. Dong Rui is considered one of the few localities with large and good quality mangrove areas of the Northeastern part of Vietnam. However, other areas outside of the Dong Rui area are currently mostly used for aquaculture [51].

Selection of the Wetland Types for This Research
In Vietnam, the Government's Decree No. 66/2019 / ND-CP in 2019 and the Decision No. 1093 / QD-TCMT of the Vietnam Environment Administration in 2016-the Ministry of Natural Resources and Environment (MONRE) (http://www.monre.gov.vn/English) participated in the Ramsar Convention with the concept of "Wetlands are swampy areas, peatlands, areas of regular or temporary inundation, including coastal areas and island areas, with a depth not exceeding 06 meters when the tide is at the lowest tide". Particularly, coastal wetlands include salt and brackish lands along the coast and islands where are influenced by tides [52]. In the above definitions, the wetland is generally defined as an ecological transition zone, a transitional area between terrestrial and flooded environments, or the place where soil inundation creates the development of a typical flora.

Selection of the Wetland Types for This Research
In Vietnam, the Government's Decree No. 66/2019 / ND-CP in 2019 and the Decision No. 1093 / QD-TCMT of the Vietnam Environment Administration in 2016-the Ministry of Natural Resources and Environment (MONRE) (http://www.monre.gov.vn/English) participated in the Ramsar Convention with the concept of "Wetlands are swampy areas, peatlands, areas of regular or temporary inundation, including coastal areas and island areas, with a depth not exceeding 06 m when the tide is at the lowest tide". Particularly, coastal wetlands include salt and brackish lands along the coast and islands where are influenced by tides [52]. In the above definitions, the wetland is generally defined as Remote Sens. 2020, 12, 3270 5 of 26 an ecological transition zone, a transitional area between terrestrial and flooded environments, or the place where soil inundation creates the development of a typical flora.
There are two main ways to classify wetlands, which are landscape-and hierarchy-based classifications [26,28,53]. A hierarchical classification system (in which the attributes used to distinguish between levels with greater differences) is superior because it allows the classification according to different levels of detail. Most classification systems have three to four categories: coastal wetlands or saltwater wetlands and inland/freshwater wetlands.
Accordingly, the study separated 19 types of coastal wetlands based on the MONRE's classification system [54] and RAMSAR convention [29] (Table 1). Among them, there are 12 types of natural wetlands and seven types of human activities. This classification has omitted two types of foreign waterways that are not available in Vietnam, including natural and man-made karst and other subterranean hydrological systems. This study focused on 10/19 types of wetlands in the northeastern coastal region of Vietnam. In this study, the irrigated and seasonal flooded agricultural lands are combined into one because these wetland types distributed discontinuously and heterogeneously in the fields, leading to difficulties in separating them from the satellite images. The remaining eight types, which occur mostly in southern regions and island systems, will not be covered in this study. Particularly for canals, drainage canals, small ditches (No.18) often have a narrow width, making it difficult to identify this object on remote sensing images. Therefore, this subject was not mentioned in this study. The detailed explanations for each type of wetland will be analyzed in Section 2.3.2. Estuarine waters x x x 7 Intertidal mud, sand or salt flats x x 8 Intertidal marshes x  x  9  Intertidal forested wetlands  x  x  x  10  Coastal brackish/saline lagoons  x  x  11  Coastal freshwater lagoons  x  x  12 Karst and other subterranean hydrological systems x 13 Man-made wetland   Aquaculture ponds  x  x  x  14  Farm ponds  x  x  x  15  Irrigated land  x  x  x  16  Seasonally flooded agricultural land  x  x  17  Salt exploitation sites  x  x  18  Canals and drainage channels, ditches  x  x  19 Karst and other subterranean hydrological systems x

Data and Sample Collection
The development of the deep learning models is developed through three main steps, including (1) zoning wetland areas; (2) input data preparation; and (3) training models. The structure of the deep learning model development for coastal wetland classification is shown in Figure 2. These contents will be explained in Sections 2.3-2.5. Firstly, Section 2.3 presents the methods to collect and set up training and validation data.
Remote Sens. 2020, 12, 3270 6 of 26 study is not only important to separate the wetland ecosystems with the inland areas but also to detect cliffs with a slope higher than 30 degrees. The wetland areas along the cliffs are commonly "rocky marine shores" as classified in the RAMSAR system. Therefore, the slope calculated directly from the DEM data reflects the terrain surface's steepness or degree of inclination compared to the horizontal surface [55]. The topographical data were collected only for districts surrounding the Tien Yen estuary from the Vietnam Academy of Science and Technology (VAST). The data have two continuous contour lines for every 2.5 meters of elevation. With the use of the Advanced Land Observing Satellite (ALOS) [56], 30-meters inland DEMs were downloaded from the Google Earth Engine system (https://code.earthengine.google.com/) generated by the Panchromatic Remote Sensing Instrument for Stereo Mapping (PRISM). However, the ALOS satellite data only provide the height above sea level. The ALOS DEM's lowest value is zero; thus, at the inland border of the value '0' the sea-land boundary was clearly defined. The DEM under the sea with a resolution of one arc-minute was downloaded from Global Relief Data collected by NOAA National Centers for Environmental Information (NCEI) [57]. The DEM data covered whole inland and offshore areas in the northeastern part of Vietnam and was re-projected to the WGS84 / UTM horizontal datum-48N and downscaled to a 30 meters resolution raster. Afterward, authors combined inland ALOS DEM data with the NOAA DEM ones along the boundary between sea and land (or coastline) to complete a full DEM from inland to offshore areas using ArcGIS software.

Input Dataset Preparation
Based on the RAMSAR definition, the coastal wetland ecosystems can be separated from coastal inland areas based on geomorphic features. The wetland areas can be identified from the areas affected by tidal to the areas at lower than −6 m of elevation. Therefore, the essential input data in this step is digital elevation models (DEM). In this study, the DEM was obtained from two sources, including topographical data at 1:5.000 of scale and the satellite data. The topographical data were used for the training process (explained in Sections 2.4 and 2.5), whereas the DEM obtained from satellite data were used for new prediction (explained in Section 2.7). All DEM data generated in this study is not only important to separate the wetland ecosystems with the inland areas but also to detect cliffs with a slope higher than 30 degrees. The wetland areas along the cliffs are commonly "rocky marine shores" as classified in the RAMSAR system. Therefore, the slope calculated directly from the DEM data reflects the terrain surface's steepness or degree of inclination compared to the horizontal surface [55]. The topographical data were collected only for districts surrounding the Tien Yen estuary from the Vietnam Academy of Science and Technology (VAST). The data have two continuous contour lines for every 2.5 m of elevation.
With the use of the Advanced Land Observing Satellite (ALOS) [56], 30-meters inland DEMs were downloaded from the Google Earth Engine system (https://code.earthengine.google.com/) generated by the Panchromatic Remote Sensing Instrument for Stereo Mapping (PRISM). However, the ALOS satellite data only provide the height above sea level. The ALOS DEM's lowest value is zero; thus, at the inland border of the value '0' the sea-land boundary was clearly defined. The DEM under the sea with a resolution of one arc-minute was downloaded from Global Relief Data collected by NOAA National Centers for Environmental Information (NCEI) [57]. The DEM data covered whole inland and offshore areas in the northeastern part of Vietnam and was re-projected to the WGS84 / UTM horizontal datum-48N and downscaled to a 30 meters resolution raster. Afterward, authors combined inland ALOS DEM data with the NOAA DEM ones along the boundary between sea and land (or coastline) to complete a full DEM from inland to offshore areas using ArcGIS software.
Regarding the multi-spectral satellite images, the Sentinel-2 images were chosen due to their spatial resolution of 10 meters. The use of the medium-resolution satellite image in different time is useful to separate specific narrow wetlands covered by seawater or affected by tidal such as permanent and temporal wetlands, and mangrove swamps [40][41][42]. Additionally, the Sentinel-2 images have been taken from two to three times per year in the research areas. In this study, the Sentinel-2 images taken on 07/11/2019 and 22/11/2019 were used to verify a mask for training ResU-Net models. The Sentinel-2 images were taken when the tide is 2.8 meters. As all Sentinel-2 images in 2019 and 2020 in the research area were taken at the same tidal condition, authors chose the clearer images without a cloud for training models. The satellite image interpretation from time to time can represent the current situation of each wetland type. The field works were done in March 2020 to validate wetland types in the Tien Yen estuary. The authors also used the Sentinel-2 images in three periods 2016, 2018, and 2020 for assessments of wetland changes. It will be explained in detail in Section 2.7.

Wetland Classification in Sentinel-2 Imagine
In the first step (zoning wetland areas) of the wetland classification, the merged DEM data were used to separate the inland areas with wetland areas in an estuary area where is strongly affected by tidal and river flow current. The tidal level in the Tien Yen estuary fluctuates from three to four meters daily, while the coastline in the topographical maps in Vietnam was identified at an average tidal level [49]. Therefore, the highest boundary of the wetland areas will be the two-meter contour line. In the topographical maps, the inland contour lines have the lowest value at 2.5 m before coming to the coastline. The distance from these lines to the coastline is lower than 10 m. Therefore, the authors chose the 2.5 m contour line as the highest boundary of the wetland areas. Additionally, according to the RAMSAR and MONRE wetland classification systems, the offshore boundary is limited at "-6" meters under the sea. It was identified easily in both the topographical maps and merged DEM data. The two objects that are separated from topographic data are "inland areas" with elevations above 2 m and "deep sea" with depths above 6m. Due to the main classified object in this study is wetland types, both "inland areas" and "and "deep sea" will be combined and called as "non-wetland" type. However, the research area in the Tien Yen estuary does not include "deep sea" type. Therefore, in the following section, the authors will only mention to "in-land" type. It is the tenth type that will be classified. In addition, nine wetland types are identified on Sentinel-2 images.
After zoning the wetland areas, the Sentinel-2 image was integrated with the field works to identify ground control points (GCPs) of one non-wetland type and nine wetland types. Firstly, two Sentinel-2 images obtained in November 2019 were segmented into polygons based on SAGA 7.6.3 software. In some regions with different tones, different shape structures are still included in the same category. Many areas of the same color, very small area sizes near each other, are assigned different object types. Therefore, visual interpretation, combined with field interpretation samples using standard GCPs, were used to reduce the degree of automatic image partition error.
The field works in March 2020 were carried out in the Tien Yen estuary, Quang Ninh province, to evaluate the indoor interpretation based on GCPs. The GCPs for image interpretation, after being analyzed and extracted from the original images, are evaluated and assessed for accuracy through field surveys. The authors built circular plots with a radius of 50 m. The authors selected randomly 10 GCPs for each inland and wetland type on the Sentinel-2 images and then verified via a field survey. The total number of standard plots for the whole study area includes 10 GCPs × 10 types = 100 GCPs. As the segmentation process that was done before the field works is an automatic partition result, the error is more than 50%, compared to the GCPs. Figure 3 shows that the "intertidal forested wetlands" and "marine subtidal aquatic beds" types are easily identified by color and distribution structure. On the true color combination, the shallow water surface is identified among the estuary areas, easily identifiable on the image with light tones, while the "deep water surface" is easily identified on the image with darker colors and linear form. According to the coastal land use, some "intertidal forested wetlands" areas have been used for intensive aquaculture (fish farming), this wetland type can be separated into a natural type and extensive farming in mangrove forests. However, the total area of mangrove forest is too small, reducing the input samples for training models. Therefore, the authors combined them to one type as classified by the RAMSAR system. Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 27 extensive farming in mangrove forests. However, the total area of mangrove forest is too small, reducing the input samples for training models. Therefore, the authors combined them to one type as classified by the RAMSAR system. Regarding the "farm ponds" and the "aquaculture ponds", it is difficult to distinguish them in remote sensing images with the use of the pixel-based classification. However, these wetland types are easy to access in the fieldwork. In fact, the aquaculture ponds have been used for intensive farming without high technology, whereas the farm ponds are commonly planed for shrimp farming with high technology. The area of aquaculture ponds is commonly larger than the farm pond, but the farm ponds distribute homogeneously with each other in a large area ( Figure 3). The "aquaculture ponds" can be identified with a bounded structure and light blue border and fine pattern, while the "farm ponds" includes agricultural ponds, farming ponds, small tanks (smaller than 8 ha), easily identifiable with a small plot structure, dark green color, and also surrounded by a thin bank. Therefore, the differences between these two wetland types are the area, shape, and distribution of the ponds that require object-instead of pixel-based classification.
Based on the standard interpretation of key samples, the authors conducted the interpretation of wetland objects with the same tones, structures, and shapes on Segmentation from SAGA 7.6.3. The result of the image partitioning process in step 1 created 8459 regions divided into ten categories. The visual interpretation process has normalized the boundaries of the subjects. Segmental regions with similar tones and structures are combined into one object type. Areas of different colors will be Regarding the "farm ponds" and the "aquaculture ponds", it is difficult to distinguish them in remote sensing images with the use of the pixel-based classification. However, these wetland types are easy to access in the fieldwork. In fact, the aquaculture ponds have been used for intensive farming without high technology, whereas the farm ponds are commonly planed for shrimp farming with high technology. The area of aquaculture ponds is commonly larger than the farm pond, but the farm ponds distribute homogeneously with each other in a large area ( Figure 3). The "aquaculture ponds" can be identified with a bounded structure and light blue border and fine pattern, while the "farm ponds" includes agricultural ponds, farming ponds, small tanks (smaller than 8 ha), easily identifiable with a small plot structure, dark green color, and also surrounded by a thin bank. Therefore, the differences between these two wetland types are the area, shape, and distribution of the ponds that require objectinstead of pixel-based classification.
Based on the standard interpretation of key samples, the authors conducted the interpretation of wetland objects with the same tones, structures, and shapes on Segmentation from SAGA 7.6.3. The result of the image partitioning process in step 1 created 8459 regions divided into ten categories. The visual interpretation process has normalized the boundaries of the subjects. Segmental regions with similar tones and structures are combined into one object type. Areas of different colors will be separated into other objects according to the interpretation pattern. For some objects having the same shape and color structures but different natural characteristics, we used high-resolution Google Earth images for additional interpretation. The outcomes of this step are a mask for ResU-Net development explained in the next sections.

ResU-Net Architecture for Coastal Wetland Classification
According to the universal approximation theorem, a mathematical network with a single layer can represent any relations between nature and humans. However, the width of the single-layer network could be massive [58]. Hence, the geo-informatics research community needs deeper network architectures to explain non-linear correlations in nature. The increase in network depth makes the data gradients to burst and disappear [36]. Nevertheless, deeper networks (such as the 50 layers) undergo convergence degradation, leading to precision being saturated and errors staying higher than the shallower ones.
The ResU-Net (Deep Residual U-Net) is an architecture that takes advantage of deep residual neural networks with 34 layers [39,59,60] and U-Net [35,58,61]. The architecture of the proposed ResU-Net is shown in Figure 4. The ResU-Net networks integrate residual building blocks (abbreviated as ResBlock) in an encoder side of the U-Net models, whereas their decoder side remains as introduced in former U-Net architecture [62,63]. The key idea of ResNet34 is to skip the information from the initial layers in the outcomes of the ResBlocks (so-called "identity shortcut connection". The ResBlocks propagate initial information over layers without degradation, avoiding the loss of information during the encoder process and enabling to develop a deeper neural network. It optimizes the inter-dependency between layers and reduces the computational cost by decreasing the parameters. The integration of the Resnet34 into a U-Net, therefore, allows for training of up to hundreds or even thousands of layers, while the trained network still has a high performance. The Resnet34 networks have been used in object classification, image recognition, and non-computer vision tasks [39,59]. Based on these advantages, the ResU-Net architecture is chosen as the network backbone in this study. In this section, the authors explain in detail the architecture of the ResBlock, encoder and decoder sides, as well as the development of ResU-Net models to classify coastal wetland ecosystems.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 27 separated into other objects according to the interpretation pattern. For some objects having the same shape and color structures but different natural characteristics, we used high-resolution Google Earth images for additional interpretation. The outcomes of this step are a mask for ResU-Net development explained in the next sections.

ResU-Net Architecture for Coastal Wetland Classification
According to the universal approximation theorem, a mathematical network with a single layer can represent any relations between nature and humans. However, the width of the single-layer network could be massive [58]. Hence, the geo-informatics research community needs deeper network architectures to explain non-linear correlations in nature. The increase in network depth makes the data gradients to burst and disappear [36]. Nevertheless, deeper networks (such as the 50 layers) undergo convergence degradation, leading to precision being saturated and errors staying higher than the shallower ones.
The ResU-Net (Deep Residual U-Net) is an architecture that takes advantage of deep residual neural networks with 34 layers [39,59,60] and U-Net [35,58,61]. The architecture of the proposed ResU-Net is shown in Figure 4. The ResU-Net networks integrate residual building blocks (abbreviated as ResBlock) in an encoder side of the U-Net models, whereas their decoder side remains as introduced in former U-Net architecture [62,63]. The key idea of ResNet34 is to skip the information from the initial layers in the outcomes of the ResBlocks (so-called "identity shortcut connection". The ResBlocks propagate initial information over layers without degradation, avoiding the loss of information during the encoder process and enabling to develop a deeper neural network. It optimizes the inter-dependency between layers and reduces the computational cost by decreasing the parameters. The integration of the Resnet34 into a U-Net, therefore, allows for training of up to hundreds or even thousands of layers, while the trained network still has a high performance. The Resnet34 networks have been used in object classification, image recognition, and non-computer vision tasks [39,59]. Based on these advantages, the ResU-Net architecture is chosen as the network backbone in this study. In this section, the authors explain in detail the architecture of the ResBlock, encoder and decoder sides, as well as the development of ResU-Net models to classify coastal wetland ecosystems.   (5) Pooling Layer (POOL). These five-layer types were arranged, as shown in Figure 4, to form a full ResU-Net architecture and described as follows: • INPUT layer is added at the beginning of the ResU-Net to insert the raw pixel values of all input images to the training model. In this study, four bands (red, green, blue, and near-infrared bands), the raw Sentinel-2 images depicted in Section 2.3.1 were merged with the DEM data. Then, the input data were separated into 1820 sub-images with the dimension of 128-pixel wide, 128-pixel height, and five spectral bands. • BATCH NORMALIZATION layer is used to standardize outcomes from the CONV layer to the same size, before a new measurement. This layer is used to optimize the distribution of the activation values during the model development, avoiding internal covariate shift problems [64]. Every layer of input data is standardized by using the mean (β) and variance (or standard deviation -γ) parameters representing the relation between input and output batch data in the following formula: wherex i is calculated based on the mean (µ B ) and variance σ 2 B of mini-batch M = {x1 . . . n} as in the following formula: In total, four parameters can be trained or optimized in the batch normalization layers.
• PADDING layers is a simple process to add zero-layers to input images in order to preserve information on the image corners and edges for calculation as good as the information on the image middle. • POOLING layer is a sampled discretization process to work downscaling data by 2 × 2 spatial matrices [58]. In the ResU-Net models, the max-pooling layer was used only once before coming to the ResBlocks. In this study, the max-pooling layer is used once in the eighth layer (Appendix A). Instead of using the pooling layers to downsampling, the stride is increased from one to two • CONV layers calculate the neural outputs using a collection of filters. The filter width and length values chosen are smaller than the input values. In this study, the chosen dimension of filters is 3 × 3. The filter slides across the images, linking input images with local regions. New pixel values are calculated with the input based on a ReLU activation functions for the filters (more detailed in Section 2.5). The ReLU functionality use max (0, x)-the threshold at zero-to preserve the images' considerable size (128 × 128 × 5) and speed up the ResU-Net models during the convergence process [62]. In this study, the authors selected 34 CONV layers for ResU-Net construction. 64, 128, 256, and 512 filters chosen for the 34 CONV layers in the contracting direction to reduce the training and validation performance.
The ResBlock diagram integrated into the encoder side of the ResU-Net to classify the coastal wetland ecosystems is described in Figure 4. In the block diagram, the completed residual block is a combination of two layers of batch normalization, two layers of sigmoid activation function, two layers of padding, and two layers of convolution. The encoder blocks in the contracting path consist of 15 completed ResBlocks and identity shortcut connections. The identity shortcut connection is used to add the input to the output of the ResBlock. Accordingly, the input is subjected to a kernel size convolution layer of (1, 1) to increase the number of functions to the initial filter size needed. To prevent the loss of information from the initial image, a (1, 1) convolution layer was used by summing features across pixels with a larger kernel [65]. The output of whole encoder blocks is basically calculated through a "batch normalization-activation" block as a bridge to enlarge the field-of-view of filters before coming to the decoder side or an expansive path.

2.
Decoder architecture In addition to the batch normalization and the convolution mentioned above, the expansive path uses two other layer types, including concatenate and up-sampling layers. These layers can be explained as follows: • CONCATENATE layers are used to link information from the encoder path to the decoder path. The data is standardized from the batch normalization, and activation functions in the encoder path will be combined with up-sampled data. This process makes the prediction more accurate. • UP-SAMPLING layers is a simple, weight-free layer that doubles the input dimensions and can be used in a generative model, following a traditional convolution layer [66]. Up-sampling is applied to recover the size of the segmentation map on the decoding path with a value of 2.
Five up-sampling blocks were generated to reduce the depth of sub-images from 512 to 256, 128, 64, 32, and 16. Each up-sampling block is designed by five-layer types, respectively, from up-sampling, concatenate, convolution, 2× batch normalization, and convolutional layers. The width and height of the sub-images in the encoder path during the concatenate processes equal to those in the decoder path. The up-sampling steps convert prediction values from the ResBlocks back to the wetland-type values.
The first convolutional layer uses a filter with a dimension of 7 × 7 to remain the information from input data, whereas the rest of the convolutional layers use a filter with the dimension of 3 × 3 in the analysis process. The number of parameters of the convolutional layers is calculated as follows: where 'H' is the height the previous filter, 'W' is the width of the previous filter, 'D' is the number of filters in the previous layer and 'N Filter ' is the number of filters. For instance, the second convolutional layer has (3 × 3 × 64) × 64 = 36,864 parameters. Due to the batch normalization generate four parameters for each convolutional layer, the number of parameters in the batch normalization layer is calculated as follows: They can be optimized with different choices of activation and optimizer functions to improve the performance and accuracy of the ResU-Net models. It will be described in detail in Section 2.6. During ResU-Net development, the accuracy of both the training and validation data was tested to avoid overfitting and underfitting problems [59]. The best ResU-Net is chosen if the prediction of wetland types is consistent with the labels assigned from the training and validation data in the raw data. The ResU-Net model is developed based on the Segmentation model python API in Keras framework, as an API designed for image segmentation based on Tensorflow [67]. During the model-development process, all observed parameters include total accuracy and separated accuracy and loss functions of test and validation data. The ResU-Net training cycle is limited to 200 loops (epochs), but if the coefficient on the testing data set converges, the cycle can be halted if all accuracy values do not change after 20 epochs.

Alternative Options to Develop Resu-Net Models
According to the ResU-Net architecture for the wetland classification, two types of functions, including loss function and optimizer methods, can be modified to optimize the model. These functions provide optimal parameters for filters in batch-normalization and convolutional layers. The final loss function and optimizer method for the model development is chosen based on the accuracy/loss values achieved.

Loss Functions
The loss function represents the performance of the trained models to predict new input data. Due to the number of samples for nine wetland objects is not balance in the training and validation dataset, two types of loss functions were chosen in this study are (1) dice loss/F1 score and (2) focal loss to train ResU-Net models, instead of using traditional Multi-Class Classification Loss Functions as used by [68,69]. It reduces the imbalance of training datasets between objects, especially with the inland-area types that take a large coastal area in input data. With traditional cross-entropy loss, the loss from the negative samples dominate the overall loss and then optimize the models to predict negative samples and ignore the negative ones during the training process [67,68,70]. The focal loss that is proposed by [71] can identify this problem and optimize the models to classify the positive ones correctly. This loss function considers the loss in a global sense rather than considering it in a micro one. Therefore, it is more useful for image-level prediction than other cross-entropy loss [72]. Accordingly, the focal loss function (FL) to estimate the loss between input Sentinel-2 image (S) and the respective ground truth (G) is calculated as Formula (7). Additionally, the authors added the dice loss proposed by [73] as a function to calculate the loss at both local and global scales with high accuracy. This function that is used to estimate the overlap value between the input and mask data can be calculated by Formula (8).
where B is assigned of 10 as the number of the wetland types, A is the number of observations in whole input data, α and γ are weighting factors fluctuate from [0,5].
Based on the advantages of both focal and dice loss functions, they will be merged into one value. In this study, two other accuracy values will be calculated, including total accuracy and Intersection over Union (IoU), as the following formulas: where TP is the true positive value, FP is the false positive value, and FN false negative value between prediction and ground truth. The trained model that has the lowest values of all loss functions will be the best model for classifying new wetland regions.

Optimizer Methods
Optimization approaches are widely used to build neural networks based on a stochastic gradient descent algorithm to reduce cost functions. This approach to change weights in the negative gradient direction improves the accuracy of qualified neural networks and minimizes the loss. The errors of the trained models (or the loss function) were calculated during the optimization cycles. One epoch is a period of data moving forward and backward through the ResU-Net models [74], and the update weights after each epoch is required to reduce the loss value for the next evaluation. Seven optimization algorithms were sequentially modified in this study include Adam (Adaptive Moment Estimation), Adagrad (Adaptive Gradient Algorithm), Adamax, RMSProp (Root Mean Square Propagation), SGD (Stochastic Gradient Descent algorithm), and Nadam (Nesterov-accelerated Adaptive Moment Estimation) during the ResU-Net development process. Table 2 presents an overview of the above optimization algorithms. All in all, the best optimizer approach would produce the highest accuracy and lowest function values.

Propagation), SGD (Stochastic Gradient Descent algorithm), and Nadam
Adaptive Moment Estimation) during the ResU-Net development proce overview of the above optimization algorithms. All in all, the best optimizer a the highest accuracy and lowest function values. where θ is parameter value; ᵑ is the learning rates; t is time step; ∈ = 10-8; moving average of squared gradients; m, v are estimates of first and second operation; β-moving average parameter (good default value-0.9

Model Comparison
In this section, the prediction results of six ResU-Net models using six (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-N RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from t including RF and SVM. A total of 1146 random points were chosen in the wetland types interpreted from eight models and the mask were assigned t interpretation results from eight models were compared with the original inf to check the performance of each trained model. Two evaluation values cho (ACC) and the kappa coefficient values. The best model will achieve the h values (presented in Section 3.2). Two benchmark models were set up in Pyt 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensem includes a large amount of decision trees was generated automatically and stage is made by majority voting [78]. The training dataset was separated, 80 as a bootstrap sample for each decision tree, and 20% dataset were assigned bag samples to evaluate the RF model independently. To increase the homo node, RF chooses a subset of variables randomly and tests them to group Therefore, the decision trees in the forest were varied, avoiding overfitt number of trees, the number of variables, and also the number of traini parameters. Once the forest is grown, it can be used for new prediction a study, the number of tree and variables were tested with 10, 100, 500, and accuracy was achieved with 100 trees.

Support Vector Machine (SVM)
The SVM is a supervised algorithm in machine learning that has been u and regression [80]. In the classification purpose, the SVM models create a separate categories by wide gaps [78,81]. The hyperplane based on the SVM two-dimensional space to divide the data into two categories [82]. As in this  where θ is parameter value; ᵑ is the learning rates; t is time step; ∈ = 10-8; moving average of squared gradients; m, v are estimates of first and second operation; β-moving average parameter (good default value-0.9

Model Comparison
In this section, the prediction results of six ResU-Net models using six (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-N RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from t including RF and SVM. A total of 1146 random points were chosen in the wetland types interpreted from eight models and the mask were assigned t interpretation results from eight models were compared with the original inf to check the performance of each trained model. Two evaluation values cho (ACC) and the kappa coefficient values. The best model will achieve the h values (presented in Section 3.2). Two benchmark models were set up in Pyt 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensem includes a large amount of decision trees was generated automatically and stage is made by majority voting [78]. The training dataset was separated, 80 as a bootstrap sample for each decision tree, and 20% dataset were assigned bag samples to evaluate the RF model independently. To increase the homo node, RF chooses a subset of variables randomly and tests them to group Therefore, the decision trees in the forest were varied, avoiding overfitt number of trees, the number of variables, and also the number of traini parameters. Once the forest is grown, it can be used for new prediction a study, the number of tree and variables were tested with 10, 100, 500, and accuracy was achieved with 100 trees.

Support Vector Machine (SVM)
The SVM is a supervised algorithm in machine learning that has been u and regression [80]. In the classification purpose, the SVM models create a separate categories by wide gaps [78,81]. The hyperplane based on the SVM u tm t 13 Adagrad θ t+1 = θ t − Remote Sens. 2020, 12, x FOR PEER REVIEW Propagation), SGD (Stochastic Gradient Descent algorithm), and Nadam Adaptive Moment Estimation) during the ResU-Net development proce overview of the above optimization algorithms. All in all, the best optimizer a the highest accuracy and lowest function values. where θ is parameter value; ᵑ is the learning rates; t is time step; ∈ = 10-8; moving average of squared gradients; m, v are estimates of first and second operation; β-moving average parameter (good default value-0.9

Model Comparison
In this section, the prediction results of six ResU-Net models using six (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-N RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from including RF and SVM. A total of 1146 random points were chosen in th wetland types interpreted from eight models and the mask were assigned t interpretation results from eight models were compared with the original inf to check the performance of each trained model. Two evaluation values cho (ACC) and the kappa coefficient values. The best model will achieve the h values (presented in Section 3.2). Two benchmark models were set up in Pyt 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensem includes a large amount of decision trees was generated automatically and stage is made by majority voting [78]. The training dataset was separated, 80 as a bootstrap sample for each decision tree, and 20% dataset were assigned bag samples to evaluate the RF model independently. To increase the homo node, RF chooses a subset of variables randomly and tests them to group Therefore, the decision trees in the forest were varied, avoiding overfitt number of trees, the number of variables, and also the number of traini parameters. Once the forest is grown, it can be used for new prediction a study, the number of tree and variables were tested with 10, 100, 500, and accuracy was achieved with 100 trees.

Support Vector Machine (SVM)
The SVM is a supervised algorithm in machine learning that has been u where θ is parameter value; ᵑ is the learning rates; t is time step; ∈ = 10-8; g is th moving average of squared gradients; m, v are estimates of first and second mome operation; β-moving average parameter (good default value-0.9); -s

Model Comparison
In this section, the prediction results of six ResU-Net models using six optimi (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-Net, N RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from two be including RF and SVM. A total of 1146 random points were chosen in the Tien wetland types interpreted from eight models and the mask were assigned to these interpretation results from eight models were compared with the original informat to check the performance of each trained model. Two evaluation values chosen ar (ACC) and the kappa coefficient values. The best model will achieve the highest values (presented in Section 3.2). Two benchmark models were set up in Python as 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensemble by includes a large amount of decision trees was generated automatically and rando stage is made by majority voting [78]. The training dataset was separated, 80% data as a bootstrap sample for each decision tree, and 20% dataset were assigned for va bag samples to evaluate the RF model independently. To increase the homogeneou node, RF chooses a subset of variables randomly and tests them to group the t Therefore, the decision trees in the forest were varied, avoiding overfitting pr number of trees, the number of variables, and also the number of training dat parameters. Once the forest is grown, it can be used for new prediction and cla study, the number of tree and variables were tested with 10, 100, 500, and 1000. L accuracy was achieved with 100 trees.

Model Comparison
In this section, the prediction results of six ResU-Net mo (so-called as Adam-ResU-Net, Adamax-ResU-Net, Ad RMSprop-ResU-Net, and SGD-ResU-Net) are compared with including RF and SVM. A total of 1146 random points wer wetland types interpreted from eight models and the mask w interpretation results from eight models were compared with to check the performance of each trained model. Two evalua (ACC) and the kappa coefficient values. The best model wi values (presented in Section 3.2). Two benchmark models we 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine l includes a large amount of decision trees was generated auto stage is made by majority voting [78]. The training dataset wa as a bootstrap sample for each decision tree, and 20% dataset bag samples to evaluate the RF model independently. To incr node, RF chooses a subset of variables randomly and tests Therefore, the decision trees in the forest were varied, av number of trees, the number of variables, and also the nu parameters. Once the forest is grown, it can be used for new study, the number of tree and variables were tested with 10, accuracy was achieved with 100 trees.
where θ is parameter value; Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 27 Propagation), SGD (Stochastic Gradient Descent algorithm), and Nadam (Nesterov-accelerated Adaptive Moment Estimation) during the ResU-Net development process. Table 2 presents an overview of the above optimization algorithms. All in all, the best optimizer approach would produce the highest accuracy and lowest function values.  where θ is parameter value; ᵑ is the learning rates; t is time step; ∈ = 10-8; g is the gradient; E[g]moving average of squared gradients; m, v are estimates of first and second moments; u -the max operation; β-moving average parameter (good default value-0.9); -step size.

Model Comparison
In this section, the prediction results of six ResU-Net models using six optimization algorithms (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-Net, Nadam-ResU-Net, RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from two benchmark models, including RF and SVM. A total of 1146 random points were chosen in the Tien Yen estuary. The wetland types interpreted from eight models and the mask were assigned to these 1146 points. The interpretation results from eight models were compared with the original information from the mask to check the performance of each trained model. Two evaluation values chosen are overall accuracy (ACC) and the kappa coefficient values. The best model will achieve the highest ACC and kappa values (presented in Section 3.2). Two benchmark models were set up in Python as follows: 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensemble by [77]. A forest that includes a large amount of decision trees was generated automatically and randomly, and the final stage is made by majority voting [78]. The training dataset was separated, 80% dataset were assigned as a bootstrap sample for each decision tree, and 20% dataset were assigned for validation as out of bag samples to evaluate the RF model independently. To increase the homogeneous subsets, at each node, RF chooses a subset of variables randomly and tests them to group the training data [32]. Therefore, the decision trees in the forest were varied, avoiding overfitting problems [79]. The number of trees, the number of variables, and also the number of training data are changeable parameters. Once the forest is grown, it can be used for new prediction and classification. In this is the learning rates; t is time step; ∈ = 10-8; g t is the gradient; E[g]-moving average of squared gradients; m, v are estimates of first and second moments; u t -the max operation; β-moving average parameter (good default value-0.9); η-step size.

Model Comparison
In this section, the prediction results of six ResU-Net models using six optimization algorithms (so-called as Adam-ResU-Net, Adamax-ResU-Net, Adagrad-ResU-Net, Nadam-ResU-Net, RMSprop-ResU-Net, and SGD-ResU-Net) are compared with results from two benchmark models, including RF and SVM. A total of 1146 random points were chosen in the Tien Yen estuary. The wetland types interpreted from eight models and the mask were assigned to these 1146 points. The interpretation results from eight models were compared with the original information from the mask to check the performance of each trained model. Two evaluation values chosen are overall accuracy (ACC) and the kappa coefficient values. The best model will achieve the highest ACC and kappa values (presented in Section 3.2). Two benchmark models were set up in Python as follows: 2.6.1. Random Forest (RF) In 2001, RF was proposed as a non-parametric machine learning ensemble by [77]. A forest that includes a large amount of decision trees was generated automatically and randomly, and the final stage is made by majority voting [78]. The training dataset was separated, 80% dataset were assigned as a bootstrap sample for each decision tree, and 20% dataset were assigned for validation as out of bag samples to evaluate the RF model independently. To increase the homogeneous subsets, at each node, RF chooses a subset of variables randomly and tests them to group the training data [32]. Therefore, the decision trees in the forest were varied, avoiding overfitting problems [79]. The number of trees, the number of variables, and also the number of training data are changeable parameters. Once the forest is grown, it can be used for new prediction and classification. In this study, the number of tree and variables were tested with 10, 100, 500, and 1000. Lastly, the highest accuracy was achieved with 100 trees.

Support Vector Machine (SVM)
The SVM is a supervised algorithm in machine learning that has been used in both classification and regression [80]. In the classification purpose, the SVM models create a hyperplane or plane to separate categories by wide gaps [78,81]. The hyperplane based on the SVM model was generated in two-dimensional space to divide the data into two categories [82]. As in this study, the training data was also set up as in the RF model. The data is converted to the corresponding multi-dimensional space data, and the plane was generated to divide data into categories [83]. In order to optimize the SVM models, two parameters were searched and optimized, including the "gamma" as a kernel coefficient and the "C" value as a penalty parameter of the error term. The increase of the gamma value can make the plane smother, and the training dataset fitted to the SVM models. Even if the error is minimized, it can create over-fitting problems. Therefore, the SVM model's performance is affected by alternative kernel functions such as linear, polynomial, sigmoid, and radial basis (RBF) functions [84]. The "C" value limits the number of training data in the SVM development. Hence, the values "gamma" and "C" were tested to achieve the highest OA and kappa values. In this study, the optimal "gamma" value at 0.25 and "C" value at 100 were selected.

Application of Trained Resu-Net Models for New Coastal Wetland Classification
Once the final ResU-Net model was chosen, the most important function of the deep learning models is to predict the distribution of the wetland types and their changes from new Sentinel-2 images. In this study, authors downloaded the Sentinel-2 images along the coastline of the northeastern part of Vietnam since 2015. The wetland areas were prepared, as explained in detail in Section 2.3. Upon inputting these new images into the trained ResU-Net, the model accesses the trained parameters in 199 layers to convert new input images into different spatial matrices, before interpreting the final type values for each image's pixel. Class scores will be allocated with the name of the wetland types in the FC layer. The wetland results of the final ResU-Net models will be compared with former prediction in Vietnam to assess the wetland changes in the research areas that were explained in Section 4.

ResU-Net Model Performance
The distribution of nine wetland types and one non-wetland type in November 2019 that were obtained from visual interpretation and field interpretation samples is shown in Figure 5. It was used as the input mask for the all ResU-Net, RF, and SVM models. According to Figure 6 and Table 3, the ResU-Net model using Adam optimizer has the highest accuracy with the validation data in six proposed models. Its ACC value is 90%, whereas its IoU value is 83% after 200 epochs. Two other models using the RMSprop and Nadam optimizer functions can predict the validation data with an accuracy of 85%. Accordingly, the Adagrad and SGD optimizer functions provide low accuracy values. The loss function values of the models using Adam, Adamax, RMSprop, and Nadam optimizers (so-called as Group 1) decreased from about 1.3 to 0.1, whereas those of the models using Adamax, Adagrad and SGD optimizers (so-called as Group 2) only decrease to about 0.9. Therefore, we used the models in Group 2 to predict input Sentinel-2 image and compare with the distribution of wetland ecosystems in Tien Yen district, as shown in Figure 5. Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 27

Accuracy Comparison among the Trained Models
The prediction based on the models in the Group 2 is shown in Figure 7. The coastal wetland prediction based on the RF and SVM models was shown and separate from the third group for model comparison. In general, four prediction results in Group 2 are nearly similar. The inland area, rocky marine shores, sand, shingle or pebble shores, and seasonal flooded agricultural lands were predicted correctly by all four models. It is challenging to interpret two objects: the shallow marine and estuary waters by three models using Adamax, Nadam, and RMSprop optimizers due to their mixture of sand and sea/river waters. The same situation can be found in the aquaculture and farm ponds, especially the area inside the dams of the Hai Lang district.   The performances of eight trained models (including six ResU-Net models and two benchmark models) are compared in Table 4. Due to the testing samples were chosen randomly in the research area, they can be contained in training or validation datasets. The IoU values of eight models are higher than the results depicted in Table 3. As shown in Figures 6 and 7, the IoU and Kappa value of the model using the Adagrad and SGD optimizer (in Group 1) provided the lowest values, compared to other models. The interpretation results of the RF and SVM models (in Group 3) have the IoU of about 50%, whereas their kappa values only have 45% on average. However, the accuracy of the models in Group 1 and 3 is lower than four ResU-Net models using the Adam, Adamax, Nadam and RMSprop optimizers (in Group 2). Compared with the manual interpretation mask, the ResU-Net model using Adam optimizer provides the best prediction. According to Figure 7, excepting two ResU-Net models in Group 1, the "inland areas" and "farm ponds" types can be correctly interpreted by other models. The SVM model misses all "shallow marine waters" and "estuarine waters" samples. Although the "shallow marine waters" and "estuarine waters" areas interpreted by the RF model are more accurate than those by the SVM model, the "rocky marine shores", "aquaculture ponds" and "seasonal flooded agricultural flooded agricultural lands" interpreted by the SVM models are more accurate than those by the RF model. Although four models in the Group 3 are more accurate than two benchmark models in the Group 2 in general, the accuracy in interpreting the "seasonal flooded agricultural lands" type of these four ResU-Net models is lower than the prediction from two benchmark models, only from 60 to 67%, even with the Adam-ResU-Net model. However, the overall accuracy and kappa index of the ResU-Net model using Adam optimizer reaches about 90%. As a result, the ResU-Net model using Adam optimizer is used to predict new wetland types for the next interpretation.

Wetland Cover Changes in Tien Yen Estuary
Based on the trained ResU-Net model using Adam optimizer, the distribution of the wetland types in the northeastern part of Vietnam was mapped in Figure 8. Its area was bordered from the depth of minus 6 meters to a tidal area of two meters. The wetland ecosystems distribute mainly in the Cua Luc bay, Tien Yen estuary and coastal area of Mong Cai city. The "marine subtidal aquatic beds" and "intertidal forested wetlands" types have enlarged in the northern part, whereas the area of the human-made wetland types such as the "aquaculture ponds" and "farm ponds" in the southern part of the area are larger than the northern parts. The area on islands was combined to "inland areas". The "rocky marine shores" area distributes narrow around cliffs and islands such as Van Don, Cat Ba, and Tra Bau islands. Based on the trained ResU-Net model using Adam optimizer, the distribution of the wetland types in the northeastern part of Vietnam was mapped in Figure 8. Its area was bordered from the depth of minus 6 meters to a tidal area of two meters. The wetland ecosystems distribute mainly in the Cua Luc bay, Tien Yen estuary and coastal area of Mong Cai city. The "marine subtidal aquatic beds" and "intertidal forested wetlands" types have enlarged in the northern part, whereas the area of the human-made wetland types such as the "aquaculture ponds" and "farm ponds" in the southern part of the area are larger than the northern parts. The area on islands was combined to "inland areas". The "rocky marine shores" area distributes narrow around cliffs and islands such as Van Don, Cat Ba, and Tra Bau islands. Additionally, Figure 8 also shows the areal percentage changes of wetland types in the Tien Yen estuary area in 2016, 2018, and 2020. The area of the "shallow marine waters" and the "estuary waters" are inversely proportional change. The area of shallow waters was narrowed from 29% of the area in 2016 to 27% in 2020, while the estuarine area was expanded from 15% in 2016 to 20% in 2020. It shows that the natural activity of the river to transport alluvium materials to the sea is getting stronger after the recent four years. Sand and mud were accumulated to form small islands, sandbanks, and tidal flats. The area of farm ponds and aquaculture ponds has been narrowed, from 16% in 2016 to 11% in 2020. According to the interviews in 2020, the aquaculture production is reduced significantly due to urbanization in the wetland area of the Quang Ninh province. It led to the land-use conversion from wetland to new urban. In 4 years, local economic development and uncontrolled population rate are increasing in the research area have led to a sharp decrease of Additionally, Figure 8 also shows the areal percentage changes of wetland types in the Tien Yen estuary area in 2016, 2018, and 2020. The area of the "shallow marine waters" and the "estuary waters" are inversely proportional change. The area of shallow waters was narrowed from 29% of the area in 2016 to 27% in 2020, while the estuarine area was expanded from 15% in 2016 to 20% in 2020. It shows that the natural activity of the river to transport alluvium materials to the sea is getting stronger after the recent four years. Sand and mud were accumulated to form small islands, sandbanks, and tidal flats. The area of farm ponds and aquaculture ponds has been narrowed, from 16% in 2016 to 11% in 2020. According to the interviews in 2020, the aquaculture production is reduced significantly due to urbanization in the wetland area of the Quang Ninh province. It led to the land-use conversion from wetland to new urban. In 4 years, local economic development and uncontrolled population rate are increasing in the research area have led to a sharp decrease of mangrove area up to 50% of the area. Therefore, the program to afforest and protect mangrove ecosystems has been interested in some coastal communes of Tien Yen River by the district committee. It is reflected through the increase in planted forest area by over 20% and aquatic ecosystems by over 50% after four years. The area of the "rocky marine shores" and "seasonally flooded agricultural land" is stable, respectively, with 210,000 m 2 and 440,000 m 2 .

Comparison with Formal Networks/Frameworks
Compared to the wetland classification systems of RAMSAR and MONRE, this study focuses on nine coastal wetland ecosystems in the dynamic estuary in the northeastern part of Vietnam (Figure 8). Although the wetland classification models were developed in some former studies [26,40,41,43,53], the classification models for the inland and coastal wetland ecosystem should be separated to provide suitable tools for different land managers. Most of the former studies only focused on the method or models to identify wetland in technical ways instead of on explaining how their outcomes have met the standard wetland classification systems and how to practically apply the trained models for land management [40,44]. As an example, the rocky marine shores, as a specific ecosystem in the RAMSAR classification system, were identified based on the trained ResU-Net models in this study. However, they were not attended by many former studies. This ecosystem covers a narrow area with a slight slope nearby cliffs. Therefore, it is difficult to identify the rocky marine shores in Landsat or SPOT satellite images.
Additionally, the use of remote sensing data was optimized in this study, especially with the integration between Sentinel-2, ALOS, and NOAA satellite data. Adapted from the former studies, the authors used DEM as important data to extract wetland areas. The trained models can use both the DEM developed from topographical maps or from the ALOS and NOAA data. However, the DEM generated from the topographical maps can provide more accurate data than satellite images, especially the areas under the sea level. The trained model using the high-quality Sentinel-2 satellite images (without cloud cover) collected two to three times per year, can be used effectively to monitor wetland use/cover changes, instead of waiting for land use maps that have been generated every five years in many countries. Especially, the coastal wetland ecosystems in Vietnam are commonly affected by about five storm events annually. The identification of wetland changes potentially provides different information related to the quantitative changes in beneficial values of these ecosystems to coastal people, particularly with the northeastern part of Vietnam that were analyzed in this study.

Improvement of Land Cover Classification
While traditional satellite image interpretation methods require many real samples to generate a wetland cover map in a particular time and region, the final trained ResU-Net models can be used to interpret wetland types from new satellite images in any coastal area and in any time. Eleven wetland types that can be classified quickly based on the trained model and the satellite data were taken from 19 types shown in the RAMSAR and MONRE classification systems [29,54]. It is a benefit for further studies to update new samples from other areas where the other eight wetland ecosystems are developed. Notably, further studies can take more "coral reefs" samples from islands where have clear seawater and warm water temperature (20-32 • C), or coastal lagoons and salt exploration areas in the middle part of Vietnam which are strongly affected by wave action [48]. As an advantage to using deep learning models, the developers can update new samples in the trained model to make a better model. The new models do not only predict the wetland ecosystem type more accurately, but they also can identify more types if they learn correct samples. However, some specific human-made wetland types such as canals, ditches, and drainage channels in karst regions mentioned in the RAMSAR and MONRE classification systems cannot be identified in the medium-resolution satellite images. The wide of these objects is commonly lower than 10 meters. In this study, we merged these types with some nearby flooded and irrigated lands to collect the high enough number of samples. For these specific human-made wetland types, it is necessary to use high-resolution images integrated with field works to identify them correctly.
Both high or low tidal levels can affect the input samples. If the satellite images are taken at low tide, all wetland types can be identified in dry conditions. If the images are taken at high tide, the tidal flats are flooded, the results from the prediction models might show the same type of shallow wetland type. Therefore, it is important to check the tidal level when the satellite images were taken for further studies. More samples can be collected when the tidal is low in the research area to make the interpretation models become more accurate.
The ResU-Net development for coastal wetland classification requires the cost-and time-consuming dedication of scientists. In this study, the authors used a CPU Intel(R) Xeon(R) CPU @ 2.6GHz CPU with 16GB RAM and GPU NVIDIA GeForce GTX1070. The average time per epoch to train a ResU-Net model is more than 22 s. Meanwhile, the average time to train the RF and SVM models is from 45 to 60 s for each model. Although the time to train a ResU-Net model is long, the trained model can be updated from the new data. Different optimization approaches such as evolutionary or swarm intelligence algorithms may also be used for future work instead of using six optimizers to boost the ResU-Net models. It will be a possible method for the training of new information from new multi-spectral satellite image data for the qualified ResU-Net models. The supercomputer is an alternative option to rapidly classify the wetland types, especially with the use of high-resolution data.

Conclusions
Based on the integration of a ResU-Net34 model with the U-Net models to classify wetland ecosystem types in the northeastern part of Vietnam, the individual research questions mentioned in the introduction section are answered as follows: • What are the advantages of integrating deep learning and multi-temporal remote sensing images for monitoring wetland classification? The completed deep learning models can be used to interpret new satellite images in any coastal area and at any time, especially in hard-to-access areas among reefs and rocky marine shores. The use of deep learning models can help coastal managers to monitor the dynamic ecosystems annually in the wetlands that have been commonly done every five years by ecologists. • How do the ResU-Net34 models for coastal wetland classification improve from the benchmark methods? The geomorphological and land cover characteristics of nine wetland ecosystem types were recorded during the training process of ResU-Net models with an accuracy of 83% and loss function value of 1.4 based on the use of the Adam optimizer. The best-trained ResU-Net model was used to successfully classify the wetland types in the Tien Yen estuary for four years. It can potentially be used to classify whole Vietnamese coastal wetlands in the future.

•
How are wetland types distributed in the northeastern part of Vietnam? Nine wetland types distributed mainly in three regions, including the Cai Lan bay, Tien Yen estuary, and the coastal area of Mong Cai city. Due to the effect of rivers, the estuary and shallow marine waters have significant fluctuation. The area of the aquaculture pools and mangrove area has been narrowed, while the marine subtidal aquatic beds have been expanded.