Swin Transformer and Deep Convolutional Neural Networks for Coastal Wetland Classiﬁcation Using Sentinel-1, Sentinel-2, and LiDAR Data

: The use of machine learning algorithms to classify complex landscapes has been revolution-ized by the introduction of deep learning techniques, particularly in remote sensing. Convolutional neural networks (CNNs) have shown great success in the classiﬁcation of complex high-dimensional remote sensing imagery, speciﬁcally in wetland classiﬁcation. On the other hand, the state-of-the-art natural language processing (NLP) algorithms are transformers. Although the transformers have been studied for a few remote sensing applications, the integration of deep CNNs and transformers has not been studied, particularly in wetland mapping. As such, in this study, we explore the potential and possible limitations to be overcome regarding the use of a multi-model deep learning network with the integration of a modiﬁed version of the well-known deep CNN network of VGG-16, a 3D CNN network, and Swin transformer for complex coastal wetland classiﬁcation. Moreover, we discuss the potential and limitation of the proposed multi-model technique over several solo models, including a random forest (RF), support vector machine (SVM), VGG-16, 3D CNN, and Swin transformer in the pilot site of Saint John city located in New Brunswick, Canada. In terms of F-1 score, the multi-model network obtained values of 0.87, 0.88, 0.89, 0.91, 0.93, 0.93, and 0.93 for the recognition of shrub wetland, fen, bog, aquatic bed, coastal marsh, forested wetland, and freshwater marsh, respectively. The results suggest that the multi-model network is superior to other solo classiﬁers from 3.36% to 33.35% in terms of average accuracy. Results achieved in this study suggest the high potential for integrating and using CNN networks with the cutting-edge transformers for the classiﬁcation of complex landscapes in remote sensing.


Introduction
Wetlands are basically regions flooded or saturated by water for at least a portion of the year, though this definition differs widely depending on the field of interest [1][2][3]. Wetlands are vital for biodiversity, ecological security, and humans as they perform a variety of functions and provide various ecosystem services [4][5][6]. Wetland services include climate regulation, water filtration, flood and drought mitigation, shoreline erosion preservation, soil protection, and wildlife habitat, among others [7][8][9][10]. Wetlands have deteriorated and degraded considerably around the world in recent decades as a result of increased human activity and climatic changes [11][12][13]. Wetland degradation has resulted in significant ecological repercussions such as biodiversity loss, habitat fragmentation, floods, and droughts [14][15][16]. Around two-thirds of the world's wetlands have been lost or drastically altered since the turn of the last century [17,18]. Land-use change due to the settlement Moreover, deep learning approaches' higher performance is also due to their ability to include feature extraction in the optimization process [54]. CNN, a deep learning model inspired by biological processes, is commonly used for remote sensing image classification that has achieved high accuracy in high-dimensional and complicated situations [55][56][57][58]. CNNs have historically dominated computer vision modeling, specifically image classification. After the introduction of AlexNet [59] and its groundbreaking performance on the ImageNet image classification task, CNN architectures have grown to become more powerful through increased size [60], more extended connections [61], and more advanced convolutions [62]. However, in natural language processing (NLP), transformers are now the most widely used architecture [63]. The transformer is renowned for its use of attention to model long-range patterns in data. It was designed for sequence modeling and transduction activities. Its enormous success in the language domain has prompted researchers to investigate its application in computer vision. It has lately shown success on a number of tasks, including a few remote sensing image classifications [64][65][66][67].
Despite the promising results achieved from the transformers in a few remote sensing studies, this cutting-edge method's capability integrated with deep CNNs has not been investigated in wetland mapping. As such, this research aims to assess and illustrate the transformer's efficacy integrated with the capabilities of deep CNNs in the classification of complex coastal wetlands. In particular, we develop a multi-model that uses three networks of a well-known two-dimensional deep CNN of VGG-16 that uses the extracted features of optical Sentinel-2 image, a three-dimensional CNN that utilizes normalized backscattering coefficients of SAR Sentinel-1 imagery, and a Swin transformer that employs a digital elevation model (DEM) generated from LiDAR point clouds. To the best of our knowledge, the integration of transformers and CNNs has not been used and evaluated in the remote sensing image classification, specifically for complex wetland classification. Figure 1 presents the flowchart of the proposed method. As seen, Sentinel-1 and Sentinel-2 features were collected from the Google Earth Engine (GEE) platform, while DEM was generated from LiDAR data with the use of QGIS software and the LAS tool. Then, the Python programming language was employed to develop the proposed multimodel DCNN classifier. Results of the proposed classifier were then compared with Swin transformer, 3D CNN, VGG-16, RF, and SVM classifiers. Finally, coastal wetland classified maps were produced in QGIS software.

The Proposed Multi-Model Deep Learning Classifier
The architecture of the proposed multi-model deep learning algorithm for the classification of coastal wetlands is presented in Figure 2. To efficiently use and integrate capabilities and power of CNN networks with the state-of-the-art transformers, the proposed multimodel deep learning network has three branches: a modified version of VGG-16 CNN, a 3D CNN, and the Swin transformer. We experimentally used image patches of 4 × 4 for Sentinel-2 features with 12 bands in the VGG-16 network, image patches of 8 × 8 for Sentinel-1 features with four bands in the 3D CNN network, and image patches of 8 × 8 for the DEM generated from the LiDAR data with one band in the Swin transformer network. The reason behind using smaller image patch sizes is that using larger image patches would significantly affect the linear objects of urban areas (i.e., they will lose their linear geometry). Moreover, urban regions would be over-classified by the utilization of large image patches.
The 3D CNN network has three convolutional layers (i.e., two 3-D convolutional layers and one 2-D convolutional layer). As seen in Figure 2, in the 3D CNN, experimentally, we used 8 × 8 image patches of four backscattering coefficients of σ 0 VV , σ 0 V H , σ 0 HH , and σ 0 HV . The first two 3D convolutional layers have 64 filters (8 × 8 × 4 × 64). Then, we reshaped the 3D convolutional layer into 2D. The last layer is a 2D convolutional layer with 128 filters, followed by a max-pooling layer that reduces the image patches into 4 × 4. On the other hand, we experimentally used 4 × 4 image patches of 12 spectral bands and indices of Sentinel-2 image in the VGG-16 network. There are 13 convolutional layers in the wellknown CNN network of VGG-16, as presented in Figure 2. It is worth highlighting that to decrease the computation cost of the VGG-16 deep CNN network, we reduced the number of filters compared with the original VGG-16 architecture. In addition, as we are using image patches of 4 × 4 for the input of the VGG-16 network, we are using max-pooling layers with kernel sizes of 1 × 1. Then, the feature output of the 3D CNN and VGG-16 networks was concatenated to form image patches of 4 × 4 with 256 filters. Afterward, we used a flatten layer of size 512, followed by two dense layers with sizes of 100 and 50.

The Proposed Multi-Model Deep Learning Classifier
The architecture of the proposed multi-model deep learning algorithm for the classification of coastal wetlands is presented in Figure 2. To efficiently use and integrate capabilities and power of CNN networks with the state-of-the-art transformers, the proposed multi-model deep learning network has three branches: a modified version of VGG-16 CNN, a 3D CNN, and the Swin transformer. We experimentally used image worth highlighting that to decrease the computation cost of the VGG-16 deep CNN network, we reduced the number of filters compared with the original VGG-16 architecture. In addition, as we are using image patches of 4 × 4 for the input of the VGG-16 network, we are using max-pooling layers with kernel sizes of 1 × 1. Then, the feature output of the 3D CNN and VGG-16 networks was concatenated to form image patches of 4 × 4 with 256 filters. Afterward, we used a flatten layer of size 512, followed by two dense layers with sizes of 100 and 50. We used the DEM data with a patch size of 8 × 8 in the Swin transformer network. It is worth highlighting that the first two layers of the Swin transformer model (i.e., random crop and random flip) are data augmentation techniques. Afterward, in the patch extract layer, from input images, image patches of 2 × 2 were extracted and transformed into linear features of size 4, resulting in an output feature of 16 × 4. Then, in the patch embedding layer, as we used an embedding dimension of 64, the output feature was size 16 × 64. In the embedding layer, image patches are converted (i.e., translated) into vector data to be used in transformers. Afterward, the output vectors are passed into the We used the DEM data with a patch size of 8 × 8 in the Swin transformer network. It is worth highlighting that the first two layers of the Swin transformer model (i.e., random crop and random flip) are data augmentation techniques. Afterward, in the patch extract layer, from input images, image patches of 2 × 2 were extracted and transformed into linear features of size 4, resulting in an output feature of 16 × 4. Then, in the patch embedding layer, as we used an embedding dimension of 64, the output feature was size 16 × 64. In the embedding layer, image patches are converted (i.e., translated) into vector data to be used in transformers. Afterward, the output vectors are passed into the Swin transformers. Then, the output features of the Swin transformer are merged by a patch merging layer resulting in an output feature of 4 × 128, followed by a 1-D global average pooling with a size of 128. The final layer of the Swin transformer is a dense layer of size 50 that is concatenated by the concatenation of the results of the other two networks of 3D CNN and VGG-16 into a feature size of 100. The final layer of the multi-model deep learning algorithm is a dense layer with a size of 11 using a softmax activation function. The Swin transformer is discussed in more detail in the next section.

Study Area and Data Collection
The study area is located in Saint John city in the southcentral part of New Brunswick province, Canada (see Figure 3). Saint John city, which is on the Bay of Fundy, has an area of approximately 326 km 2 with a population of around 71,000. The city is divided by the south-flowing river, while the Kennebecasis River, which enters the Saint John River near Grand Bay, runs through the east side. At the confluence of the two rivers and the Bay of Fundy, Saint John harbor is a deep-water harbor with no ice all year. The city has a humid continental climate. wick province, Canada (see Figure 3). Saint John city, which is on the Bay of Fundy, has an area of approximately 326 km 2 with a population of around 71,000. The city is divided by the south-flowing river, while the Kennebecasis River, which enters the Saint John River near Grand Bay, runs through the east side. At the confluence of the two rivers and the Bay of Fundy, Saint John harbor is a deep-water harbor with no ice all year. The city has a humid continental climate.  We used Sentinel-1, Sentinel-2, and LiDAR data to classify seven wetland classes, including aquatic bed, bog, coastal marsh, fen, forested wetlands, freshwater marsh, and shrub wetland. The wetland ground truth data was acquired from the 2021 wetland inventory of New Brunswick (http://www.snb.ca/geonb1/e/DC/catalogue-E.asp, accessed on 6 December 2021) (see Figure 3). Wetland inventory is collected and yearly updated by the Canadian Department of Environment and Local Government (ELG). Wetland maps are provided by the Department of Environment and Local Government in notifying primary users of wetlands and potential regulatory obligations for land development. To avoid over-classification of wetlands in the pilot site, we manually extracted four additional non-wetland classes of water, urban, grass, and crop through visual interpretation of very high-resolution imagery of Google Earth. The number of training and test data is presented in Table 1. It is worth highlighting that we used a stratified random sampling technique to divide our ground truth data into 50% as training and 50% as test samples. The Sentinel-1 and Sentinel-2 image features, including normalized backscattering coefficients, spectral bands, and indices (see Table 2), were created in Google Earth Engine code editor (https://code.earthengine.google.com/, accessed on 6 December 2021). In this study, for extracting features of Sentinel-1 and Sentinel-2 data, the median image values between 1st June to 1st September 2020 were created utilizing the GEE code editor. It is worth highlighting that GEE provides Sentinel-2 level-2A data that is pre-processed by sen2cor software [68]. Although bands of Sentinel-2 image have different spatial resolutions (10 m, 20 m, and 60 m), all bands are resampled into 10 m spatial resolutions in the GEE code editor. On the other hand, the GEE platform provides Sentinel-1 ground range detected (GRD) data, including σ 0 VV , σ 0 V H , σ 0 HH , and σ 0 HV that is log scaled at 10 m spatial resolution. The provided Sentinel-1 data are pre-processed by Sentinel-1 toolbox, including thermal noise removal, radiometric calibration, and terrain correction. To improve the classification accuracy of wetlands in the pilot site of Saint John, we created a DEM from LiDAR data using Las Tools (https://rapidlasso.com/lastools/, accessed on 6 December 2021) in QGIS software. It should be noted that the DEM was resampled into 10 m to be stacked with the Sentinel-1 and Sentinel-2 image features. It is worth highlighting that the LiDAR data had a point density of 6 points per 1 m 2 . Table 2. The normalized backscattering coefficients, spectral bands, and indices used in this study (NDV I = normalized difference vegetation index, NDBI = normalized difference build-up index).

Experimental Setting
In this study, we used the Adam optimizer to train our proposed multi-model deep learning network as well as the other deep learning models, including the modified version of the VGG-16 network, 3D CNN, and Swin transformer with a learning rate of 0.0002. We set the maximum training iteration to 100 epochs with a batch size of 32. In the Swin transformer, we used patch size, dropout rate, number of attention heads, embedding dimension, number of multi-layer perceptron, and shift size of 2 × 2, 0.03, 8, 64, 256, and 1, respectively. It is worth highlighting that for the implementation of classifiers, we used a graphical processing unit (GPU) of NVIDIA GeForce RTX 2070, an Intel processor (i.e., i7-10750H central processing unit (CPU) of 2.60 GHz), and a 16 GB random access memory (RAM) operating on 64-bit Windows 11 in our experiments. All deep learning algorithms were developed in the Python TensorFlow library, while the RF and SVM classifiers were implemented using sklearn Python library.

Evaluation Metrics
To assess the quantitative performance of the developed models, coastal wetland classification results were evaluated in terms of average accuracy, recall, overall accuracy, precision, and kappa index statistical metrics (Equations (1)-(6)): Ovrall Accuracy = (True positive + True Negative) Total number of pixels × 100 (5) where x i+ is the marginal total of row I, the total number of observations is shown by N, and x ii is observation in row i and column i.

Comparison with Other Classifiers
For the evaluation of the efficiency of the proposed multi-model deep learning algorithm, the coastal wetland classification results were compared with several algorithms, including: Swin Transformer-Differences between the two domains of language and vision, including substantial variations in the scale of visual entities and the high resolution of pixels in pictures compared with words in texts, pose challenges in adapting transformer models from language to vision. As such, the Swin transformer introduced a hierarchical transformer whose representation is computed with shifted windows to address these issues [69] (see Figure 4). The shifted windowing technique improves efficiency by limiting self-attention computation to non-overlapping local windows while allowing for crosswindow connectivity. This hierarchical architecture can predict at multiple scales and has a linear computing complexity as image size increases. Swin transformer's characteristics make it suitable for a wide range of vision tasks, such as image classification.
Remote Sens. 2022, 13, x FOR PEER REVIEW 9 of 24 chical transformer whose representation is computed with shifted windows to address these issues [69] (see Figure 4). The shifted windowing technique improves efficiency by limiting self-attention computation to non-overlapping local windows while allowing for cross-window connectivity. This hierarchical architecture can predict at multiple scales and has a linear computing complexity as image size increases. Swin transformer's characteristics make it suitable for a wide range of vision tasks, such as image classification. The shifted window partitioning strategy is successful in image classification, object detection, and semantic segmentation as it introduces links between neighboring nonoverlapping windows in the previous layer (see Figure 5). The shifted window partitioning strategy is successful in image classification, object detection, and semantic segmentation as it introduces links between neighboring nonoverlapping windows in the previous layer (see Figure 5). The shifted window partitioning strategy is successful in image classification, object detection, and semantic segmentation as it introduces links between neighboring nonoverlapping windows in the previous layer (see Figure 5). Swin transformer blocks are computed consecutively using the shifting window partitioning method as (see Equations (7)-(10)): where ̂ and present the outputs of (S)W-MSA and MLP module of block , respectively. SW-MSA and W-MSA are multi-head self-attention modules with shifted windowing and regular settings, respectively (for more information, refer to Liu et al. [69]). Swin transformer blocks are computed consecutively using the shifting window partitioning method as (see Equations (7)-(10)): whereẑ l and z l present the outputs of (S)W-MSA and MLP module of block l, respectively. SW-MSA and W-MSA are multi-head self-attention modules with shifted windowing and regular settings, respectively (for more information, refer to Liu et al. [69]). Random Forest-In remote sensing image classification, RF [70] is an extensively used ensemble learning method that has shown great success in high-dimensional and complex issues [7,[71][72][73]. Moreover, RF is also an effective feature selection technique as it reveals the importance of each band of earth observation images. As such, RF is regarded as one of the most used methods for accurate image classification. It is worth highlighting that the use of highly-important features does not guarantee that this is the best combination of features for a specific issue. For example, high-dimensional data typically has a high number of correlated variables, which negatively impacts the feature selection process [74]. Within complex heterogeneous landscapes with low inter-class discrimination and high intraclass variability, supervised classification of remote sensing data using machine learning techniques such as RF has the power to tackle drawbacks of using a single index or simple linear regression models [75,76]. Support Vector Machine-The SVM [40] is non-parametric, unlike conventional statistic-based parametric classification techniques. The distribution of the data set has no influence on SVM. This is one of the benefits of SVMs over other statistical methods such as maximum likelihood, which require data distribution to be known in advance. SVMs, in particular, use the training data to generate an optimal hyperplane (a line in the simplest scenario) for dividing the dataset into a discrete number of predetermined classes [77]. It is worth mentioning that the SVM's accuracy is mostly determined by the variants and parameters used. It should be noted that while SVM is among the most utilized non-parametric machine learning algorithms, its performance decreases with a large amount of training data [78].
Convolutional Neural Network-Deep learning algorithms based on CNNs have become a prominent remote sensing image classification topic in the last decade [55,56,58,79]. The input nodes for the classifier in CNN classification include a single-pixel and local sets of adjacent pixels. The CNN learning process includes determining appropriate convolu-tional operations and weights for different kernels or moving windows, which enables the network to model useful spatial contextual information at various spatial scales [80,81]. Different filters and convolutional layers extract spatial, spectral, edge, and textural information, enabling a high degree of data generalization in CNNs. Because CNNs enable local connection and weight sharing principle, CNN has shown great robustness and effectiveness in spatial feature extraction of remote sensing images compared with other deep learning techniques such as RF. The convolution, pooling, and fully connected layers comprise a CNN architecture. The convolution layer contains two key elements: kernels (i.e., filters) and biases. The filters are designed to extract certain information from the image of interest. By decreasing the resolution of the feature map, the pooling layer provides translation invariance. Finally, the fully connected layer uses the extracted information from all the previous layer's feature maps to create a classification map [82]. Based on the weights (W) and biases (B) of the previous layers, in each layer (l) of a CNN, low-, intermediate-, and high-level features are extracted that are updated in the next iteration (Equations (11) and (12)): where n, x, and λ denote the total number of training samples, the learning rate, and a regularization parameter. Moreover, m, t, and C are momentum, updating step, and cost function. According to the dataset of interest, to obtain an optimal result, regularizing parameter (λ), the learning rate (x), and momentum (m) are fine-tuned. VGG-16 [83]-The University of Oxford's Visual Geometry Group created this 16-layer network with approximately 138 million parameters that were trained and evaluated on the ImageNet dataset. The original VGG-16 model architecture is consists of 3 × 3 kernelsized filters that enable the network to learn more complex features by increasing the network's depth [84]. It is worth highlighting that to reduce the complexity of the original VGG-16 model and reduce the computation cost, we experimentally replaced some of the 3 × 3 kernel-sized filters with 1 × 1 kernels while reducing the number of filters.

Comparison Results on the Saint John Pilot Site
Comparison results of the complex coastal wetland classification using the developed models are shown in Table 3. It is worth highlighting that we used image patches of 8 × 8 in the solo 3D CNN, Swin transformer, and VGG-16 networks. The proposed multimodel deep learning network achieved the best results compared with other classifiers, including the Swin transformer, VGG-16, 3D CNN, RF, and SVM classifiers in terms of average accuracy, overall accuracy, and kappa index with values of 92.68%, 92.30%, and 90.65%, respectively. In terms of F-1 score, the multi-model network obtained values of 0.87, 0.88, 0.89, 0.91, 0.93, 0.93, and 0.93 for the recognition of shrub wetland, fen, bog, aquatic bed, coastal marsh, forested wetland, and freshwater marsh, respectively. The multi-model network outperformed the solo 3D CNN, Swin transformer, and VGG-16 in terms of average accuracy by 8.92%, 13.93%, and 17.31%, respectively. Based on the results, in terms of average accuracy, the Swin transformer (78.75%) network achieved slightly better results compared with the well-known deep CNN network of VGG-16 (75.37%). On the other hand, the RF classifier had the best performance over the solo 3D CNN, Swin transformer, VGG-16, and SVM in terms of average accuracy by 5.56%, 10.57%, 13.95%, and 29.99%, respectively. The results revealed the higher capability of the RF classifier over the SVM algorithm in dealing with a noisy, complex, and high-dimensional remote sensing environment. It is worth highlighting that the performance of the SVM classifier highly depends on the predefined parameters and kernels. As such, in this study, we examined different kernel types, including linear, radial basis function (RBF), polynomial, and sigmoid. The SVM with a polynomial kernel achieved the best results, as shown in Table 3. Table 3. Results of the proposed multi-model compared with other classifiers in terms of average accuracy, precision, F1-score, and recall (AB = aquatic bed, BO = bog, CM = coastal marsh, FE = fen, FM = freshwater marsh, FW = forested wetland, SB = shrub wetland, W = water, U = urban, G = grass, C = crop, AA = average accuracy, OA = overall accuracy, and K = kappa). Overall, better results obtained by the multi-model algorithm over other solo classifiers showed the superiority of using a model consisting of several different networks over a single deep learning network. Each network of the multi-model algorithm extracted different useful information from multi-data sources (i.e., Sentinel-1, Sentinel-2, and DEM), resulting in significantly improved coastal wetland classification over a single network deep learning model (see Table 3).

Confusion Matrices
The highest confusion between wetlands was obtained by the SVM classifier. The SVM algorithm had issues with the correct classification of coastal marsh, bog, and shrub wetlands (refer to Table S1 of Supplementary Materials). As discussed in the previous sections, the inherited complexity of the Saint John environment, speckle noise of Sentinel-1 data, and the number of training data can be attributed as the reasons behind the underperformance of the SVM classifier over the other implemented methods. On the other hand, the proposed multi-model deep learning algorithm achieved the least confusion between the wetlands. Moreover, for the identification of the aquatic bed wetlands, 3D CNN, VGG-16, and RF classifiers showed over-classification, as shown in Figure 6. It should be noted that systematic stripes in Figure 6f,g that are wetlands classified by the RF and SVM algorithms are due to the resampling technique of nearest neighbor that was used for mosaicking several patches of high-resolution LiDAR DEM data. It is possible to remove such noises by using image smoothing techniques such as a three by three mean filter, but we did not utilize them to investigate the effects of such noises on the performance of patch-based (i.e., CNNs) and pixel-based (i.e., traditional algorithms) classifiers. From Figure 6, it is clear that patchbased classifiers that consider both spatial and spectral information are not affected by such noises compared with pixel-based techniques that only consider spectral information.

Ablation Study
To better understand the contribution of each network of the proposed multi-model algorithm, including the VGG-16 that uses the extracted features of Sentinel-2 data, 3D CNN that employs the backscattering coefficients of Sentinel-1 image, and the Swin transformer the utilizes the generated DEM from LiDAR data, we performed an ablation study. As seen in Table 4, the VGG-16 using the Sentinel-2 features reached an average accuracy and overall accuracy of 73.93% and 79.75%, respectively. By adding the 3D CNN network and using Sentinel-1 features, the average accuracy, and overall accuracy significantly improved by 17.21% and 9.62%. In addition, by adding the Swin transformer network and the DEM data, the best visual and quantitative results were achieved. The inclusion of the Swin transformer improved the average and overall accuracies by 1.54% and 2.93%, respectively. Moreover, the inclusion of DEM by Swin transformer significantly improved the visual result of the proposed deep learning classifier. For instance, over-classification of coastal wetland and under-classification of water classes were visually improved in the multi-model algorithm, as seen in Figure 7.  The inclusion of each network considerably decreased the confusion between wetland and non-wetland classes. For instance, high confusion between bog, coastal marsh, and shrub wetlands resulting from the VGG-16 network was significantly decreased by adding the 3D CNN network and Sentinel-1 features (refer to Table S2 of Supplementary Materials).

Effect of Different Data Sources on Wetland Classification Accuracy
To better understand how different data sources, including backscattering features from Sentinel-1, spectral features of Sentinel-2, and DEM generated from LiDAR data contribute to the coastal wetland classification, we conducted several experiments with the use of an RF classifier as seen in Table 5. Although the inclusion of SAR data did not considerably improve the average accuracy of wetland classification, the F-1 scores of wetlands of shrub, aquatic bed, freshwater marsh, and coastal marsh improved by 1%, 2%, 3%, and 5%, respectively. On the other hand, the use of DEM data improved the average accuracy, overall accuracy, and kappa by 5.48%, 3.82%, and 4.69%, respectively. F-1 scores of forested wetland, aquatic bed, fen, shrub, freshwater marsh, coastal marsh, and bog improved by 2%, 3%, 6%, 6%, 8%, 8%, and 13%, respectively, with the inclusion of DEM data. The inclusion of each network considerably decreased the confusion between wetland and non-wetland classes. For instance, high confusion between bog, coastal marsh, and shrub wetlands resulting from the VGG-16 network was significantly decreased by adding the 3D CNN network and Sentinel-1 features (refer to Table S2 of Supplementary Materials).

Effect of Different Data Sources on Wetland Classification Accuracy
To better understand how different data sources, including backscattering features from Sentinel-1, spectral features of Sentinel-2, and DEM generated from LiDAR data contribute to the coastal wetland classification, we conducted several experiments with the use of an RF classifier as seen in Table 5. Although the inclusion of SAR data did not considerably improve the average accuracy of wetland classification, the F-1 scores of wetlands of shrub, aquatic bed, freshwater marsh, and coastal marsh improved by 1%, 2%, 3%, and 5%, respectively. On the other hand, the use of DEM data improved the average accuracy, overall accuracy, and kappa by 5.48%, 3.82%, and 4.69%, respectively. F-1 scores of forested wetland, aquatic bed, fen, shrub, freshwater marsh, coastal marsh, and bog improved by 2%, 3%, 6%, 6%, 8%, 8%, and 13%, respectively, with the inclusion of DEM data. Table 5. Results of the RF classifier with different data sources in terms of average accuracy, precision, F1-score, and recall (AB = aquatic bed, BO = bog, CM = coastal marsh, FE = fen, FM = freshwater marsh, FW = forested wetland, SB = shrub wetland, W = water, U = urban, G = grass, C = crop, AA = average accuracy, OA = overall accuracy, K = kappa, S1 = Sentinel-1, S2 = Sentinel-2, DEM = DEM from LiDAR data).   Table 5. Results of the RF classifier with different data sources in terms of average accuracy, precision, F1-score, and recall (AB = aquatic bed, BO = bog, CM = coastal marsh, FE = fen, FM = freshwater marsh, FW = forested wetland, SB = shrub wetland, W = water, U = urban, G = grass, C = crop, AA = average accuracy, OA = overall accuracy, K = kappa, S1 = Sentinel-1, S2 = Sentinel-2, DEM = DEM from LiDAR data). Moreover, the variable importance was measured to better understand the significance and contribution of different extracted Sentinel-1 and Sentinel-2 features. We ran the RF classifier [52] 30 times for the spectral analysis, as shown in Figure 8. As expected, Sentinel-2 bands and indices were more effective for classifying coastal wetlands than the Sentinel-1 backscattering features. Based on the Gini index for the prediction of the test data, the most influential variable for coastal wetland classification was the first vegetation red edge band (i.e., B5). In the RF classifier for attribute selection measure, typically, the Gini index [85] is used. In contrast, the least effective variable was the second vegetation red edge band (i.e., B6). Moreover, the most effective feature of Sentinel-1 was the σ 0 HH band. The reason is that σ 0 HH is sensitive to double bounce scattering for flooded vegetation, a suitable Sentinel-1 feature for recognizing coastal wetlands. In addition, compared with σ 0 VV , the σ 0 HH is less affected by the water roughness that is useful for the identification of non-water bodies from water regions.

Model
test data, the most influential variable for coastal wetland classification was the fi vegetation red edge band (i.e., B5). In the RF classifier for attribute selection measu typically, the Gini index [85] is used. In contrast, the least effective variable was the se ond vegetation red edge band (i.e., B6). Moreover, the most effective feature of Sentin 1 was the band. The reason is that is sensitive to double bounce scattering f flooded vegetation, a suitable Sentinel-1 feature for recognizing coastal wetlands. In a dition, compared with , the is less affected by the water roughness that is usef for the identification of non-water bodies from water regions.

Effect of Different Spatial Resolutions on Wetland Classification Accuracy
Effects of spatial resolution on coastal classification accuracy were investigated the comparison of classification results of the proposed multi-model for 10 m and 30 spatial resolution data sources, as seen in Table 6. Based on the results, higher 10 m sp tial resolution data sources outperformed the lower 10 m spatial resolution data sourc by 7.71%, 7.68%, and 9.26%, respectively, by the proposed multi-model classifier terms of average accuracy, overall accuracy, and kappa index. The reason can be e plained by greater details and more information extracted from high-resolution imag specifically while dealing with a complex landscape such as wetlands with a high lev of inter-class similarity.

Effect of Different Spatial Resolutions on Wetland Classification Accuracy
Effects of spatial resolution on coastal classification accuracy were investigated by the comparison of classification results of the proposed multi-model for 10 m and 30 m spatial resolution data sources, as seen in Table 6. Based on the results, higher 10 m spatial resolution data sources outperformed the lower 10 m spatial resolution data sources by 7.71%, 7.68%, and 9.26%, respectively, by the proposed multi-model classifier in terms of average accuracy, overall accuracy, and kappa index. The reason can be explained by greater details and more information extracted from high-resolution images, specifically while dealing with a complex landscape such as wetlands with a high level of inter-class similarity.

Computation Cost
In analyzing the computation cost in terms of time, the RF classifier with 2 min training time showed the best performance over the other implemented classifiers, including the SVM, VGG-16, 3D CNN, Swin transformer, and the proposed multi-model network with training times of 20, 40, 50, 60, and 90 min, respectively. Table 6. Results of the proposed multi-model for different spatial resolutions in terms of average accuracy, precision, F1-score, and recall (AB = aquatic bed, BO = bog, CM = coastal marsh, FE = fen, FM = freshwater marsh, FW = forested wetland, SB = shrub wetland, W = water, U = urban, G = grass, C = crop, AA = average accuracy, OA = overall accuracy, K = kappa, 10 = 10 m spatial resolution, and 30 = 30 m spatial resolution).

Discussion
Various research teams have conducted substantial research to improve wetland mapping in Canada by utilizing various data sources and approaches [35,36,48,86]. For example, Jamali et al. [87] used Sentinel-1 and Sentinel-2 data to classify five wetlands in Newfoundland, Canada: bog, fen, marsh, swamp, and shallow water, with a high average accuracy of 92.30% using very deep CNN networks and a generative adversarial network (GAN). According to their research, creating synthetic samples of Sentinel-1 and Sentinel-2 data considerably improved the classification accuracy of wetland mapping. Moreover, the synergic use of several satellite data has shown its superiority over a single satellite sensor by various research [7,8,87]. Results achieved in this study confirm the superiority of using different satellite data over solo Earth image data to improve the classification of a complex landscape of coastal wetlands. For instance, the F-1 scores of forested wetland, aquatic bed, fen, shrub, freshwater marsh, coastal marsh, and bog improved by 2%, 3%, 6%, 6%, 8%, 8%, and 13%, respectively with the utilization of DEM data in this research. Moreover, due to the higher distinguishable backscattering coefficients of Sentinel-1 features in wetlands and the high capability of the developed 3D CNN network in the extraction of useful information from training image patches, the accuracy of the proposed network was considerably improved.
On the other hand, most wetland mapping approaches in New Brunswick, as described by LaRocque et al. [88], depend on manual interpretation of high-resolution data. In the Saint John city study area, there is little literature on the usage of cutting-edge deep learning technologies. Although LaRocque et al. [88] reported that using Landsat 8 OLI, ALOS-1 PALSAR, Sentinel-1, and LiDAR-derived topographic metrics, the RF classifier achieved an overall accuracy of 97.67% in New Brunswick. Because our classifiers, the number of training data, and satellite data are different, we cannot precisely compare their results to this study's acquired results. Based on the results, the proposed multi-model classifier with average accuracy, overall accuracy, and kappa index of 92.68%, 92.30%, and 90.65%, respectively, is highly capable of precise coastal wetland classification. As, in this study, the DEM was generated from LiDAR data, it would be difficult to create such precise height data for a large-scale wetland mapping. This is the main limitation of our study; however, several studies have shown the possibility of creating accurate and high-resolution DEM from Sentinel-1 data using the SAR interferometry technique [89,90].
As discussed in the previous sections, transformers have achieved great success in solving NLP issues. Although they have shown high potential for a few computers vision issues, a challenging issue in remote sensing complex landscape classification over the usual computer vision image classification is a much higher resolution of remote sensing satellite images. Considering that vision transformers have a complexity of O(n 2 ) with the increase in the pixel resolution, we selected the Swin transformer as it has a much lower linear complexity of O(n) with the increase in the resolution of image pixels. In other words, the Swin transformer is much more computationally efficient than other vision transformers. The other benefit of transformer networks is their higher generalization capability over CNN networks. An additional advantage of transformers over CNNs is that the relationship between different features of an image is also considered (i.e., positional encoding or attention mechanism). However, transformers require much more training data than that of the CNN models to reach their full image classification capability, which is a challenge in the remote sensing field. As positional relation between a pixel and its neighboring features is important in a DEM layer, we used a Swin transformer that considers the positional relationship to extract useful information. This led to increases of 1.54% and 2.93% in average and overall accuracies of the proposed multi-model algorithm. Our results show that integrating CNNs with transformers opens a new window for advancing new technologies and methods for complex scene classification in remote sensing.

Conclusions
Innovative methodologies and technology for wetland mapping and monitoring are critical because of the significant benefits that wetland functions provide to humans and wildlife. Wetlands are among the most difficult ecosystems to classify because of their dynamic and complicated structure, which lacks clear-cut boundaries and vegetation forms that are similar. As such, for the preservation and monitoring of coastal wetlands in the pilot site of Saint John city situated in New Brunswick, Canada, we explored the potential of the integration of the state-of-the-art transformers (i.e., Swin transformer) with a modified version of VGG-16 CNN network and a 3D CNN model. We used different data sources for each network, including spectral bands and indices of Sentinel-2 in the VGG-16 network, backscattering coefficients of Sentinel-1 in the 3D CNN, and a DEM generated from LiDAR data in the Swin transformer network and compared the achieved results with several solo CNN models as well as two shallow conventional classifiers of RF and SVM. The results suggest that the multi-model network in terms of average accuracy significantly improved the classification of coastal wetlands over other solo algorithms, including the RF, SVM, Swin transformer, VGG-16, and 3D CNN from 3.36% to 33.35%. Moreover, the utilization of multi-source data significantly increased the classification accuracy of Saint John city's complex landscape. For instance, the inclusion of extracted features of Sentinel-1 and DEM increased the F-1 scores of the VGG-16 CNN network for the classification of shrub wetland, aquatic bed, fen, freshwater marsh, and bog by 1%, 3%, 3%, 8%, and 11%, respectively.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/rs14020359/s1, Table S1: Confusion matrices of the implemented models including the proposed multi-model deep learning model, Swin Transformer, 3D CNN, VGG-16, Random Forest, and Support Vector Machine; Table S2