Comparing Solo Versus Ensemble Convolutional Neural Networks for Wetland Classiﬁcation Using Multi-Spectral Satellite Imagery

: Wetlands are important ecosystems that are linked to climate change mitigation. As 25% of global wetlands are located in Canada, accurate and up-to-date wetland classiﬁcation is of high importance, nationally and internationally. The advent of deep learning techniques has revolutionized the current use of machine learning algorithms to classify complex environments, speciﬁcally in remote sensing. In this paper, we explore the potential and possible limitations to be overcome regarding the use of ensemble deep learning techniques for complex wetland classiﬁcation and discusses the potential and limitation of various solo convolutional neural networks (CNNs), including DenseNet, GoogLeNet, ShufﬂeNet, MobileNet, Xception, Inception-ResNet, ResNet18, and ResNet101 in three different study areas located in Newfoundland and Labrador, Canada (i.e., Avalon, Gros Morne, and Grand Falls). Moreover, to improve the classiﬁcation accuracies of wetland classes of bog, fen, marsh, swamp, and shallow water, the results of the three best CNNs in each study area is fused using three supervised classiﬁers of random forest (RF), bagged tree (BTree), Bayesian optimized tree (BOT), and one unsupervised majority voting classiﬁer. The results suggest that the ensemble models, in particular BTree, have a valuable role to play in the classiﬁcation of wetland classes of bog, fen, marsh, swamp, and shallow water. The ensemble CNNs show an improvement of 9.63–19.04% in terms of mean producer’s accuracy compared to the solo CNNs, to recognize wetland classes in three different study areas. This research indicates a promising potential for integrating ensemble-based learning and deep learning for operational large area land cover, particularly complex wetland type classiﬁcation.


Introduction
Wetlands cover 3% to 8% of the Earth's land surface and are amongst the most valuable ecosystems across the world [1]. Wetlands make invaluable contributions to the maintenance and quality of life for nature and humanity. Since the plants, bacteria, and animals in wetlands filter the water, trapping nutrients like phosphorus, one of the main reasons for harmful algae blooms in water bodies, wetlands are usually referred to as the kidneys of the earth [2,3]. Carbon sequestration, food security, water storage, as well as flood and shoreline protection are only some of the services provided by wetlands [4,5]. Also, they provide critical habitat that supports plant and animal biodiversity [6,7]. Despite Therefore, this study contributes to the support of the use of the state-of-the-art deep learning model for wetland mapping using high-resolution remote sensing data.

The Study Area and Training Data
In this study, three different study areas are used, located in and around the region of Avalon, the town of Grand Falls-Windsor, and Gros Morne National Park, on the island of Newfoundland in Canada, as presented in Figure 1. Within the study areas, the dominant land cover is highly productive coniferous forests and vast peatlands [34]. Essential wetland habitat for waterfowl for nesting and raising of young and other natural ecosystems are found within these study areas. All wetland classes (including bog, fen, marsh, swamp, and shallow water) are located within the study areas' borders. The most dominant wetland classes are bog and fen, broadly referred to as peatlands. Ground-truth data were collected by a team of ecologists and wetland specialists in the summers of 2015, 2016, and 2017. Before field visitation, potential wetland sites were identified based on the visual interpretation of RapidEye and Google Earth imagery. Then, sites were visited in the field where wetlands were classified as bog, fen, swamp, marsh, or shallow water classes based on the Canadian Wetland Classification System (CWCS), a wetland classification standard for the country. Dominant vegetation groups, the presence of certain plant species, hydrology, and landscape position were considered when assigning a wetland a class.
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 27 the ability of ensemble CNNs for wetland classification using multispectral satellite imagery. Therefore, this study contributes to the support of the use of the state-of-the-art deep learning model for wetland mapping using high-resolution remote sensing data.

The Study Area and Training Data
In this study, three different study areas are used, located in and around the region of Avalon, the town of Grand Falls-Windsor, and Gros Morne National Park, on the island of Newfoundland in Canada, as presented in Figure 1. Within the study areas, the dominant land cover is highly productive coniferous forests and vast peatlands [32]. Essential wetland habitat for waterfowl for nesting and raising of young and other natural ecosystems are found within these study areas. All wetland classes (including bog, fen, marsh, swamp, and shallow water) are located within the study areas' borders. The most dominant wetland classes are bog and fen, broadly referred to as peatlands. Ground-truth data were collected by a team of ecologists and wetland specialists in the summers of 2015, 2016, and 2017. Before field visitation, potential wetland sites were identified based on the visual interpretation of RapidEye and Google Earth imagery. Then, sites were visited in the field where wetlands were classified as bog, fen, swamp, marsh, or shallow water classes based on the Canadian Wetland Classification System (CWCS), a wetland classification standard for the country. Dominant vegetation groups, the presence of certain plant species, hydrology, and landscape position were considered when assigning a wetland a class. Global positioning system (GPS) points, along with notes and photos, were taken in the field for use as a guide for the delineation of polygons representing the wetlands visited. Refer to Figure 2 for examples of the delineated polygons. To improve the accuracy Global positioning system (GPS) points, along with notes and photos, were taken in the field for use as a guide for the delineation of polygons representing the wetlands visited. Refer to Figure 2 for examples of the delineated polygons. To improve the accuracy of delineation, multi-season, and multi-year, Google Earth imagery is used as ancillary data. See Table 1 for the number of training and test data (i.e., pixels). of delineation, multi-season, and multi-year, Google Earth imagery is used as ancillary data. See Table 1 for the number of training and test data (i.e., pixels).  In this study, five bands of blue (440-510 nm), green (520-590 nm), red (630-685 nm), red edge (690-730 nm), and near-infrared (760-850 nm) of RapidEye imageries are used. In particular, two level 3a RapidEye imagery with a spatial resolution of five meter, collected on 18 June and 22 October 2015, were used for wetland mapping. To improve the wetland classification accuracy, three spectral indices, namely red edge normalized difference vegetation index (RENDVI), ratio-vegetation index (RVI), and green NDVI (GNDVI), are utilized as well (Table 2).  In this study, five bands of blue (440-510 nm), green (520-590 nm), red (630-685 nm), red edge (690-730 nm), and near-infrared (760-850 nm) of RapidEye imageries are used. In particular, two level 3a RapidEye imagery with a spatial resolution of five meter, collected on 18 June and 22 October 2015, were used for wetland mapping. To improve the wetland classification accuracy, three spectral indices, namely red edge normalized difference vegetation index (RENDVI), ratio-vegetation index (RVI), and green NDVI (GNDVI), are utilized as well (Table 2). It is worth noting that for the CNN model results evaluation, we used the pixel-based comparison of the ground truth and predicted classes in each study area of the Avalon, Grand Falls, and Gros Morne. In this study, different polygons were selected for training and test data to avoid the autocorrelation between the datasets. Reference polygons, sorted by size, were alternatingly assigned to testing and training datasets for each class. The reason was to ensure that both the training and test data had a comparable number of pixels for each class. Due to the limited number of data and the wide variation of size within each wetland class (some large, some small), random assignment of polygons training and test groups may result in the group having highly uneven pixel numbers. This method may result in lower accuracy; however, in comparison to random sampling, the confidence level of the achieved results will be high.

Methods
The flowchart of the proposed ensemble modeling for complex wetland classification is shown in Figure 3. As seen, the proposed framework can be summarized in four steps: (1) evaluate the performance of each solo CNN model for wetland classification using multi-spectral RapidEye satellite data, (2) select the best three CNN models based on accuracy assessments indices, (3) apply ensemble modeling using two different strategies of majority voting and employing supervised machine learning models (i.e., RF, BTree, and BOT), (4) Evaluate of results of solo versus ensemble CNN models for wetland classification. In this section, the processing steps are explained in more detail.

Convolutional Neural Networks (CNNs)
CNNs are the most popular deep learning techniques and have recently attracted a substantial amount of attention in the remote sensing community. These supervised nonlinear models can automatically extract important features without any human supervision. Specifically, CNNs are multi-layer interconnected neural networks that hierarchically extract powerful low-, intermediate-, and high-level features. In each layer (l), these features are extracted based on the weights (W) and biases (B) of the previous layers, which are updated in each iteration as follows (Equations (1) and (2)): where λ, x, and n denote a regularization parameter, learning rate, the total number of training samples, and m, t, and C are momentum, updating step, and cost function, respectively. According to the dataset, the regularizing parameter (λ), learning rate (x), momentum (m) will be tuned to achieve optimum performance. In particular, the optimum λ prevents overfitting of the data, the learning rate controls the training time, and momentum can help to converge the data. A typical CNN framework consists of three different layers, namely a convolutional layer, a pooling layer, and a fully connected layer, which are described in more detail below.

Convolutional layer:
The main body of a CNN architecture is the convolutional layer, which contains several filters sliding across the image. General speaking, convolution is a mathematical operation that merges two different sources of information (i.e., input image and filter) given by: where y is the feature map and x, f, and n are the input image, filter, and the number of pixels in the image, respectively.
Pooling layer: This layer is usually implemented after the convolution layer to reduce the dimensionality and number of parameters. This also helps to reduce training time and to prevent overfitting. The down-sampling layer is another term that has been used for the pooling layer because it spatially down-samples each feature map. Although several functions such as average pooling or even L2-norm pooling can be used as a pooling layer, most studies use max-pooling operation (with filters of size 2 × 2 and stride 2).
Fully connected layer: Similar to typical neural networks (NNs), the neurons in this layer have full connections to all of the activations in the previous layer. The overfitting problem mostly occurs in the fully connected layer because it contains a higher number of parameters. Dropout is a regularization solution in neural networks that reduces interdependent learning amongst neurons in fully connected layers. The classification layer is In particular, the optimum λ prevents overfitting of the data, the learning rate controls the training time, and momentum can help to converge the data. A typical CNN framework consists of three different layers, namely a convolutional layer, a pooling layer, and a fully connected layer, which are described in more detail below.
Convolutional layer: The main body of a CNN architecture is the convolutional layer, which contains several filters sliding across the image. General speaking, convolution is a mathematical operation that merges two different sources of information (i.e., input image and filter) given by: where y is the feature map and x, f, and n are the input image, filter, and the number of pixels in the image, respectively. Pooling layer: This layer is usually implemented after the convolution layer to reduce the dimensionality and number of parameters. This also helps to reduce training time and to prevent overfitting. The down-sampling layer is another term that has been used for the pooling layer because it spatially down-samples each feature map. Although several functions such as average pooling or even L2-norm pooling can be used as a pooling layer, most studies use max-pooling operation (with filters of size 2 × 2 and stride 2).
Fully connected layer: Similar to typical neural networks (NNs), the neurons in this layer have full connections to all of the activations in the previous layer. The overfitting problem mostly occurs in the fully connected layer because it contains a higher number of parameters. Dropout is a regularization solution in neural networks that reduces interdependent learning amongst neurons in fully connected layers. The classification layer is the last layer of a CNN model. The SoftMax function is the most commonly used classification layer that outputs a vector representing the probability distributions of the potential classes. GN has been proposed by [36] for computer vision applications, such as image classification and object recognition. In this CNN algorithm, an innovative approach called the Inception module was introduced. There are nine operations of convolution and pooling layers in the structure of the GN method. Besides that, to reduce the cost of computation, a one-by-one window size was suggested for the convolutional layers at the end of its structure. As a consequence of using a one-by-one window size, input sizes of the convolution layers are decreased, resulting in a faster and more efficient computation ( Figure 4). The GN algorithm was proposed to solve two issues of conventional CNNs. Notably, there are too many parameters in the deep CNN model to be estimated. A high number of parameters, in this case, would result in overfitting. Moreover, having too many layers in the CNN model means increasing the computation cost. By replacing the fully connected layer with sparse layers, the GN algorithm solved these issues. the last layer of a CNN model. The SoftMax function is the most commonly used classification layer that outputs a vector representing the probability distributions of the potential classes.

GoogLeNet (GN)
GN has been proposed by [34] for computer vision applications, such as image classification and object recognition. In this CNN algorithm, an innovative approach called the Inception module was introduced. There are nine operations of convolution and pooling layers in the structure of the GN method. Besides that, to reduce the cost of computation, a one-by-one window size was suggested for the convolutional layers at the end of its structure. As a consequence of using a one-by-one window size, input sizes of the convolution layers are decreased, resulting in a faster and more efficient computation ( Figure  4). The GN algorithm was proposed to solve two issues of conventional CNNs. Notably, there are too many parameters in the deep CNN model to be estimated. A high number of parameters, in this case, would result in overfitting. Moreover, having too many layers in the CNN model means increasing the computation cost. By replacing the fully connected layer with sparse layers, the GN algorithm solved these issues.

MobileNet
Given the limited hardware and computation restrictions of mobile devices, image recognition MobileNet architectures were introduced and are considered to be efficiently designed CNNs [36]. The highly efficient MobileNet-224 was proposed by [37], which uses depth-wise separable convolutions. In MobileNet-224, for each input feature map separately, three by three convolution stacks are used, which are considered highly efficient ( Figure 5).

MobileNet
Given the limited hardware and computation restrictions of mobile devices, image recognition MobileNet architectures were introduced and are considered to be efficiently designed CNNs [37]. The highly efficient MobileNet-224 was proposed by [38], which uses depth-wise separable convolutions. In MobileNet-224, for each input feature map separately, three by three convolution stacks are used, which are considered highly efficient ( Figure 5).

Xception
Xception is considered to be a family member of Inception networks proposed by [39]. Inception models introduced complex building block structures, bottleneck design, batch normalization, as well as space and depth factorization. The Xception networks implement the use of factorization in its structure, where for feature extraction, it uses depth-wise separable convolutions ( Figure 6). For each output channel, without using any non-linear activation function, the Xception uses a point-wise convolution of one-by-one with an adjacent three by three depth-wise convolution [37].

ShuffleNet
To decrease computation cost, point-wise group convolution and channel shuffle were utilized in the ShuffleNet [40]. Particularly, this model maintains the accuracy level of very deep CNN algorithms while having efficient computation costs. It is worth noting that the computation complexity and target platform, which define the computation budget, were a major consideration in the creation of the ShuffleNet method. As a result, under the equal setting with the ResNet and ResNetX [41], the ShuffleNet model has lower complexity ( Figure 7). Also, with accuracy comparable to AlexNet, the ShuffleNet was almost thirteen times faster on a mobile device.

Xception
Xception is considered to be a family member of Inception networks proposed by [38]. Inception models introduced complex building block structures, bottleneck design, batch normalization, as well as space and depth factorization. The Xception networks implement the use of factorization in its structure, where for feature extraction, it uses depthwise separable convolutions ( Figure 6). For each output channel, without using any nonlinear activation function, the Xception uses a point-wise convolution of one-by-one with an adjacent three by three depth-wise convolution [36].

ShuffleNet
To decrease computation cost, point-wise group convolution and channel shuffle were utilized in the ShuffleNet [39]. Particularly, this model maintains the accuracy level of very deep CNN algorithms while having efficient computation costs. It is worth noting that the computation complexity and target platform, which define the computation budget, were a major consideration in the creation of the ShuffleNet method. As a result, under the equal setting with the ResNet and ResNetX [40], the ShuffleNet model has lower

Xception
Xception is considered to be a family member of Inception networks proposed by [38]. Inception models introduced complex building block structures, bottleneck design, batch normalization, as well as space and depth factorization. The Xception networks implement the use of factorization in its structure, where for feature extraction, it uses depthwise separable convolutions ( Figure 6). For each output channel, without using any nonlinear activation function, the Xception uses a point-wise convolution of one-by-one with an adjacent three by three depth-wise convolution [36].

ShuffleNet
To decrease computation cost, point-wise group convolution and channel shuffle were utilized in the ShuffleNet [39]. Particularly, this model maintains the accuracy level of very deep CNN algorithms while having efficient computation costs. It is worth noting that the computation complexity and target platform, which define the computation budget, were a major consideration in the creation of the ShuffleNet method. As a result, under the equal setting with the ResNet and ResNetX [40], the ShuffleNet model has lower

ResNet
The bottleneck structure was proposed in the ResNet method, achieving an impressively high accuracy [41,42]. In ResNet, instead of learning unreferenced functions, layers are created as the residual learning functions (Figure 8). By increasing the depth of the network, residual networks of the ResNet were easier to optimize, achieving higher accuracy. Moreover, the degradation problem of very deep CNNs was solved by the deep residual learning framework in the ResNet method. It is worth noting that, in conventional CNNs, degradation occurs as depth increases where the accuracy will first be saturated and then degraded.

ResNet
The bottleneck structure was proposed in the ResNet method, achieving an impressively high accuracy [41,42]. In ResNet, instead of learning unreferenced functions, layers are created as the residual learning functions (Figure 8). By increasing the depth of the Remote Sens. 2021, 13, 2046 9 of 25 network [43], residual networks of the ResNet were easier to optimize, achieving higher accuracy. Moreover, the degradation problem of very deep CNNs was solved by the deep residual learning framework in the ResNet method. It is worth noting that, in conventional CNNs, degradation occurs as depth increases where the accuracy will first be saturated and then degraded.

ResNet
The bottleneck structure was proposed in the ResNet method, achieving an impressively high accuracy [41,42]. In ResNet, instead of learning unreferenced functions, layers are created as the residual learning functions (Figure 8). By increasing the depth of the network, residual networks of the ResNet were easier to optimize, achieving higher accuracy. Moreover, the degradation problem of very deep CNNs was solved by the deep residual learning framework in the ResNet method. It is worth noting that, in conventional CNNs, degradation occurs as depth increases where the accuracy will first be saturated and then degraded.

Inception-ResNet
Inception-ResNet is a combined version of Inception and ResNet modules developed by [34]. Inception-ResNet utilizes both characteristics of Inception and ResNet networks. This model has a similar architecture to Inception while benefiting from the bottleneck structure, batch normalization, and residual connections of ResNet [36]. It is deeper than the ResNet and Inception modules, where, unlike its ancestors, it does not require any auxiliary classifiers. Moreover, with fewer parameters, the Inception-ResNet has equal or better results than the ResNet and Inception networks ( Figure 9).

Inception-ResNet
Inception-ResNet is a combined version of Inception and ResNet modules developed by [44]. Inception-ResNet utilizes both characteristics of Inception and ResNet networks. This model has a similar architecture to Inception while benefiting from the bottleneck structure, batch normalization, and residual connections of ResNet [37]. It is deeper than the ResNet and Inception modules, where, unlike its ancestors, it does not require any auxiliary classifiers. Moreover, with fewer parameters, the Inception-ResNet has equal or better results than the ResNet and Inception networks ( Figure 9).

DenseNet
DenseNet, proposed by [43] is amongst those ResNet networks that use intensively residual connections. DenseNet, as its name suggests, has a densely connected building block where each convolutional layer uses the output of previous convolutions and all inputs inside its block using several residual networks [44,45]. Layers in DenseNet are merged by the concatenation layer, which results in a very deep feature map. Like other Resnet networks, DenseNet uses a bottleneck design to reduce its depth [36] (Figure 10).

DenseNet
DenseNet, proposed by [45] is amongst those ResNet networks that use intensively residual connections. DenseNet, as its name suggests, has a densely connected building block where each convolutional layer uses the output of previous convolutions and all inputs inside its block using several residual networks. Layers in DenseNet are merged by the concatenation layer, which results in a very deep feature map. Like other Resnet networks, DenseNet uses a bottleneck design to reduce its depth [37] (Figure 10).
DenseNet, proposed by [43] is amongst those ResNet networks that use intensively residual connections. DenseNet, as its name suggests, has a densely connected building block where each convolutional layer uses the output of previous convolutions and all inputs inside its block using several residual networks [44,45]. Layers in DenseNet are merged by the concatenation layer, which results in a very deep feature map. Like other Resnet networks, DenseNet uses a bottleneck design to reduce its depth [36] (Figure 10).

Ensemble CNN models
In this study, we trained several well-known CNN models where each model assigns different labels to each region of the image. Due to classification errors resulting from insufficient or poor training, different models will sometimes assign different labels to the same image patch. This classification error can be minimized through an ensemble model, where the outputs of different trained models are ensembled to minimize error. In this section, we introduce two main ensemble techniques that we used to enhance the performance of the trained solo CNN models.

Majority Voting Algorithm
As described by Equation 4, majority voting is the most simple ensemble technique where for each image patch, the label produced by the majority of the models is assigned to that patch: where mode(.) is the "mode" or majority function, is the label produced by the mth CNN model, M is the number of the CNN models, and is the final label assigned to the image patch. It should be noted that majority voting is an unsupervised ensemble approach as it does not require an additional training step for training.

Ensemble CNN Models
In this study, we trained several well-known CNN models where each model assigns different labels to each region of the image. Due to classification errors resulting from insufficient or poor training, different models will sometimes assign different labels to the same image patch. This classification error can be minimized through an ensemble model, where the outputs of different trained models are ensembled to minimize error. In this section, we introduce two main ensemble techniques that we used to enhance the performance of the trained solo CNN models.

Majority Voting Algorithm
As described by Equation (4), majority voting is the most simple ensemble technique where for each image patch, the label produced by the majority of the models is assigned to that patch: where mode(.) is the "mode" or majority function, L m is the label produced by the mth CNN model, M is the number of the CNN models, and L is the final label assigned to the image patch. It should be noted that majority voting is an unsupervised ensemble approach as it does not require an additional training step for training.

Machine Learning-Based Approach
To improve the classification results of the solo CNN networks, the probabilities produced by the softmax layer of a CNN model can be used for another phase of training. The probabilities generated by the CNN models can be classified using well-known machine learning classification techniques, such as Support Vector Machine (SVM), k-nearest neighbors (KNN), or decision tree-based algorithms. In this paper, we employ supervised classifiers of RF, BTree, and BOT to classify these features. In contrast to the majority voting method, this approach is a supervised method as it requires an additional training step. The trained CNNs are evaluated in terms of the overall accuracy and producer's accuracy based on the test data, which are derived from different sets of polygons and are unseen for the model during hyper-parameters tuning and the training phase (Equations (5) and (6)). It is worth noting that the test data are evaluated individually for each of the three study regions: Overall Accuracy = number o f correct classi f ied pixels total number o f pixels × 100 (5) Producer's accuracy = number of correct pixels in one class total number of pixels as derived from reference data × 100 (6)

Results and Discussion
This research phase consisted of results evaluation designed to assess if solo CNNs can detect complex wetland classes to an acceptable accuracy. To do so, the overall and producer's accuracies were used to evaluate the capability of different models for identifi-cation of the wetland classes (i.e., bog, fen, marsh, swamp, and shallow water) as well as non-wetland classes (i.e., urban, upland, and deep water).
The overall accuracy values for the solo CNNs from a comparison of reference and predicted classes are summarized in Figure 11. Comparisons were made for three different study areas of Avalon, Grand Falls, and Gros Morne. The overall accuracies indicated a strong level of agreement between the reference and predicted classes of wetland in the Gros Morne region (OA = 84.63-90.14%), followed by the Avalon area (OA = 79.86-85.16%) and the Grand Falls (OA = 71.93-79.34%). Generally, the lower level of accuracy of the Grand Falls can be explained by the lower numbers of training data for non-wetlands and the high level of complexity in wetlands in this study area. ′ = number of correct pixels in one class total number of pixels as derived from reference data × 100 (6)

Results and Discussion
This research phase consisted of results evaluation designed to assess if solo CNNs can detect complex wetland classes to an acceptable accuracy. To do so, the overall and producer's accuracies were used to evaluate the capability of different models for identification of the wetland classes (i.e., bog, fen, marsh, swamp, and shallow water) as well as non-wetland classes (i.e., urban, upland, and deep water).
The overall accuracy values for the solo CNNs from a comparison of reference and predicted classes are summarized in Figure 11. Comparisons were made for three different study areas of Avalon, Grand Falls, and Gros Morne. The overall accuracies indicated a strong level of agreement between the reference and predicted classes of wetland in the Gros Morne region (OA = 84.63-90.14%), followed by the Avalon area (OA = 79.86-85.16%) and the Grand Falls (OA = 71.93-79.34%). Generally, the lower level of accuracy of the Grand Falls can be explained by the lower numbers of training data for non-wetlands and the high level of complexity in wetlands in this study area.  Overall Accuracy Figure 11. Results of CNNs for the test data sets of the three study areas (in percent).
To evaluate the time cost of different CNN models, their training time was assessed, as shown in Figure 12. The Inception-ResNet and DenseNet models required the longest time for training at 1392 and 1181 min, respectively. In contrast, the ResNet18 model required the least amount of time for training at 73 min. The comparison revealed the advantage of shallow CNNs compared to deep CNNs in terms of overall accuracy and time. This is because there are a higher number of parameters to be fine-tuned in the deeper CNNs, which increases the time and computational costs. Also, these CNN models with a high number of layers require larger training samples to achieve their full potential, which may result in a lower level of accuracy. It is worth highlighting that the experiments were done with an Intel processor (i.e., i5-6200U Central Processing Unit (CPU) of 2.30 GHz) and an 8 GB Random Access Memory (RAM) operating on 64-bit Windows 10.
vantage of shallow CNNs compared to deep CNNs in terms of overall accuracy and time. This is because there are a higher number of parameters to be fine-tuned in the deeper CNNs, which increases the time and computational costs. Also, these CNN models with a high number of layers require larger training samples to achieve their full potential, which may result in a lower level of accuracy. It is worth highlighting that the experiments were done with an Intel processor (i.e., i5-6200U Central Processing Unit (CPU) of 2.30 GHz) and an 8 GB Random Access Memory (RAM) operating on 64-bit Windows 10. We also evaluated the efficiency of different solo CNN models in each study area in terms of producer's accuracy, as described in more detail in the following subsections.

Avalon Study Area
A part of the Avalon study area with approximately 4.2 km by 4.9 km was used for the classification mapping ( Figure 13). As seen in Table 3, all the solo CNN models have demonstrated poor performances in identifying fen, marsh, and swamp wetland classes. There were few training data for fen, marsh, and swamp wetland classes in this study area. Consequently, the classification results of the CNN models in terms of producer's accuracy were less than 60%. It is worth mentioning that wetland classes do not have a clear-cut boundary (e.g., wetlands have irregular boundary shapes), and some of these classes may have similar vegetation types and structures, resulting in similar spectral reflectance values. For example, the accuracy of fen classification was low and incorrectly classified as bog. We also evaluated the efficiency of different solo CNN models in each study area in terms of producer's accuracy, as described in more detail in the following subsections.

Avalon Study Area
A part of the Avalon study area with approximately 4.2 km by 4.9 km was used for the classification mapping ( Figure 13). As seen in Table 3, all the solo CNN models have demonstrated poor performances in identifying fen, marsh, and swamp wetland classes. There were few training data for fen, marsh, and swamp wetland classes in this study area. Consequently, the classification results of the CNN models in terms of producer's accuracy were less than 60%. It is worth mentioning that wetland classes do not have a clear-cut boundary (e.g., wetlands have irregular boundary shapes), and some of these classes may have similar vegetation types and structures, resulting in similar spectral reflectance values. For example, the accuracy of fen classification was low and incorrectly classified as bog.
Additionally, some marsh regions were classified as fen areas. Moreover, most of the swamp areas were recognized as uplands. Generally, these can be explained as a result of the similarity shared between bog, fen, and marsh classes in terms of vegetation pattern (i.e., wet soils, some emergent, and saturated vegetation), as well as the similarity between the swamp and upland forest in terms of tree dominance, in addition to the overall low amount of training data required for training very deep learning models ( Figure 13 and Table 3).
In terms of the overall classification accuracy, the MobileNet CNN network with a value of 85.16% had the best classification accuracy in the Avalon, while the least performance was for the CNN network of DenseNet with an overall accuracy of 79.86% (Figure 11).

Grand Falls Study Area
A part of the Grand Falls study area with approximately 4.2 km by 4.9 km is used for the classification mapping ( Figure 14). Like the Avalon area, the CNN models had poor performances for distinguishing the bog, fen, and marsh classes, likely due to their spectral similarity and the low amount of training data. In the Grand Falls, most of the swamp areas were incorrectly classified as bog, followed by fen, marsh, and upland classes. Also, shallow water was classified as deep water in some cases, potentially resulting from spectral similarities (Table 4). Results indicated that all CNN models had achieved a high producer's accuracy for recognizing non-wetland classes of urban and upland. It can be generally explained by their higher number of training data relative to that of wetland training samples. Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 27 Additionally, some marsh regions were classified as fen areas. Moreover, most of the swamp areas were recognized as uplands. Generally, these can be explained as a result of the similarity shared between bog, fen, and marsh classes in terms of vegetation pattern (i.e., wet soils, some emergent, and saturated vegetation), as well as the similarity between the swamp and upland forest in terms of tree dominance, in addition to the overall low amount of training data required for training very deep learning models ( Figure 13 and Table 3).    The CNN models had a lower producer's accuracy in the Grand Falls than that of the Avalon and Gros Morne regions due to the relative complexity of this study area. Fen, marsh, and swamp regions were better recognized by the CNN models in this study region than the Avalon. This could be attributed to the higher number of training data in the Grand Falls for the fen and marsh classes relative to the Avalon and Gros Morne regions. However, they demonstrated relatively poor performances on the classification of shallow water class than that of the Avalon.
In addition, as the number of training data for urban, deep water, and upland classes was less in the Grand Falls, the overall accuracy obtained in this area was much less compared to the Avalon and Gros Morne. The CNN networks of GoogLeNet and ShuffleNet, with the overall accuracies of 79.34% and 79.07%, were superior over the other CNN models, while the least overall accuracy belonged to the CNN network of ResNet18 (Table 4 and Figures 11 and 14).
In the Grand Falls with much less training data, CNN networks, including ShuffleNet and GoogLeNet, were superior over the deeper CNN models of Inception-ResNet and ResNet101. It is due to the reason that there are fewer parameters to be fine-tuned in CNN networks of ShuffleNet and GoogLeNet. There were a higher number of training data for wetland classes in the Grand Falls; consequently, the classification accuracy was higher in this study area than the Avalon and Gros Morne. On the other hand, the training data for the non-wetland classes were relatively low in the Grand Falls, resulting in a lower overall accuracy ranging from 73.87% to 79.34% (Figures 11 and 14).

Gros Morne Study Area
A part of the Gros Morne study area with about 4.2 km by 4.9 km was used for the classification mapping ( Figure 15). In the Gros Morne region, the same issue exists for the incorrect classification of bog, fen, and marsh with solo CNN models. Most of the swamp areas were recognized as the upland class, and most of the CNNs had an issue with the correct classification of shallow and deep waters. It is worth mentioning that swamp and upland classes may have similar structure and vegetation types, specifically in the low water seasons. Consequently, their spectral reflectance can be similar, which leads to their misclassifications. Generally, wetlands are a complex environment where some of their classes have similar spectral signatures, specifically, bog, fen, and marsh wetland classes. All the solo CNNs presented a high level of accuracy for the classification of non-wetlands classes of urban, deep water, and uplands (Table 5 and Figure 15). In this study area similar to the Avalon, there were fewer training data for wetland classes of fen, marsh, swamp, and shallow water. Consequently, the performance of the solo CNNs was relatively poor compared to the Grand Falls. Moreover, as the number of training data for non-wetland classes of deep water and upland and wetland class of bog was higher in this region, the achieved overall accuracy was higher than the Avalon and Grand Falls. With an overall accuracy of 90.14%, the Inception-ResNet network was superior over the other solo CNNs (Figures 11 and 14 and Table 5). the non-wetland classes were relatively low in the Grand Falls, resulting in a lower overall accuracy ranging from 73.87% to 79.34% (Figures 11 and 14).

Gros Morne Study Area
A part of the Gros Morne study area with about 4.2 km by 4.9 km was used for the classification mapping ( Figure 15). In the Gros Morne region, the same issue exists for the incorrect classification of bog, fen, and marsh with solo CNN models. Most of the swamp areas were recognized as the upland class, and most of the CNNs had an issue with the correct classification of shallow and deep waters. It is worth mentioning that swamp and upland classes may have similar structure and vegetation types, specifically in the low water seasons. Consequently, their spectral reflectance can be similar, which leads to their misclassifications. Generally, wetlands are a complex environment where some of their classes have similar spectral signatures, specifically, bog, fen, and marsh wetland classes. All the solo CNNs presented a high level of accuracy for the classification of non-wetlands classes of urban, deep water, and uplands (Table 5 and Figure 15). In this study area similar to the Avalon, there were fewer training data for wetland classes of fen, marsh,

Results of Ensemble Models
In this study, the main objective of integrating CNN models is to improve the wetland classification accuracy. As such, the probability layers extracted from the three solo CNN models with the highest accuracy for wetland classification in each study area were fused using four different approaches, which are RF, BTree, BOT, and majority voting. The overall accuracy from a comparison of predicted and reference classes is presented in Figure 16. Overall, RF, BOT, and BTree models showed higher accuracy in the Gros Morne region, followed by the Avalon and Grand Falls regions. In the Avalon area, the BOT classifier improved the overall accuracy of wetland classes by 6.43% through the ensemble of CNN models of DenseNet, ResNet18, and Xception networks (i.e., they had better results for wetland classification). Overall accuracy between the reference and predicted classes in the Avalon were generally lower compared to that of the Gros Morne study area. With the ensemble of Inception-ResNet, Xception, and MobileNet using the BTree algorithm, the overall accuracy was improved by 3.36% in the Gros Morne study area. The Grand Falls had the least overall accuracy compared to that of the Avalon and Gros Morne, There were more training data for bog, urban, deep water, and upland classes in the Avalon and Gros Morne. As a result, CNN models including MobileNet and Inception-ResNet with more parameters outperformed the CNN networks of ShuffleNet and GoogLeNet with fewer parameters.

Results of Ensemble Models
In this study, the main objective of integrating CNN models is to improve the wetland classification accuracy. As such, the probability layers extracted from the three solo CNN models with the highest accuracy for wetland classification in each study area were fused using four different approaches, which are RF, BTree, BOT, and majority voting. The overall accuracy from a comparison of predicted and reference classes is presented in Figure 16. Overall, RF, BOT, and BTree models showed higher accuracy in the Gros Morne region, followed by the Avalon and Grand Falls regions. In the Avalon area, the BOT classifier improved the overall accuracy of wetland classes by 6.43% through the ensemble of CNN models of DenseNet, ResNet18, and Xception networks (i.e., they had better results for wetland classification). Overall accuracy between the reference and predicted classes in the Avalon were generally lower compared to that of the Gros Morne study area. With the ensemble of Inception-ResNet, Xception, and MobileNet using the BTree algorithm, the overall accuracy was improved by 3.36% in the Gros Morne study area. The Grand Falls had the least overall accuracy compared to that of the Avalon and Gros Morne, where the BTree obtained higher accuracy than the majority voting, BOT, and RF classifiers, improving the overall accuracy by about 8.16% using the ensemble of GoogLeNet, Xception, and MobileNet. where the BTree obtained higher accuracy than the majority voting, BOT, and RF classifiers, improving the overall accuracy by about 8.16% using the ensemble of GoogLeNet, Xception, and MobileNet. It can be seen that, in the Avalon region, the BTree classifier improved the results of the best solo CNN (i.e., MobileNet) for the classification of marsh, swamp, and fen classes by 36.68%, 25.76%, and 20.01%, respectively. However, the classification accuracies of shallow water and bog were decreased by 12.5% and 3.29%, respectively (Table 6 and Figure 17). Overall Accuracy Figure 16. The overall accuracy of the ensemble algorithms in the three study areas (in percent)..
It can be seen that, in the Avalon region, the BTree classifier improved the results of the best solo CNN (i.e., MobileNet) for the classification of marsh, swamp, and fen classes by 36.68%, 25.76%, and 20.01%, respectively. However, the classification accuracies of shallow water and bog were decreased by 12.5% and 3.29%, respectively (Table 6 and Figure 17).
Results obtained by the BTree classifier indicated an improvement of the resulting shallow water, marsh, swamp, and bog classification of the best solo CNN (i.e., GoogLeNet) by 30.28%, 24.27%, 15.99%, and 5.72%, respectively. However, the classification accuracy of fen was decreased by 6.27%. The results of ensemble modeling indicated a significant improvement of wetland classification compared to that of the solo CNNs (Table 7 and Figure 18).  Results obtained by the BTree classifier indicated an improvement of the resulting shallow water, marsh, swamp, and bog classification of the best solo CNN (i.e., Goog-LeNet) by 30.28%,24.27%, 15.99%, and 5.72%, respectively. However, the classification ac-  It is worth noting that, in the Gros Morne region, even though the overall accuracy did not increase substantially, the ensemble models achieved better classification accuracies for the wetland classes of swamp, fen, marsh, and shallow water. In more detail, the classification accuracy of these classes improved by 62.06%, 32.95%, 26.09%, and 9.79%, respectively, using the BTree classifier compared to that of the best solo CNN (i.e., Inception-ResNet) ( Table 8 and Figure 19). Although, the classification accuracy of bog was decreased by 14.38%. To evaluate the efficiency and effectiveness of the solo CNNs and ensemble models for the classification of the wetland classes of bog, fen, marsh, swamp, and shallow water, their mean producer's accuracy was assessed and summarized in Figure 20.
The comparison revealed the superiority of the ensemble models compared to the solo CNN networks in terms of the mean producer's accuracy. Results indicated a strong agreement between the predicted and reference wetland classes in the Gros Morne region using the ensemble models. In more detail, the ensemble model of the RF algorithm had the highest accuracy with a mean producer's accuracy of 78%, where it improved the results of the best solo CNN model for the wetland classification (i.e., Xception with a mean producer's accuracy of 58.96%) by more than 19%. In the Grand Falls, the ensemble model of the BTree improved the accuracy of the best solo CNN model (i.e., Xception with a mean producer's accuracy of 63.51%) by 16.7%, with a mean producer's accuracy of 80.21%. Finally, the Avalon area had the least agreement between the predicted and reference wetland classes using the ensemble models.
The BTree classifier improved the results of the best solo CNN model of ResNet18 (with a mean producer's accuracy of 61.96%) by 9.63%, with a mean producer's accuracy of 71.59%. Results obtained by the solo and ensemble CNNs indicated the advantage of shallower CNN models, including ResNet18 and Xception, over very deep learning models, such as DenseNet. Besides that, classification accuracies achieved by the solo CNN models were substantially improved in all three study areas for the wetland classification of bog, fen, marsh, swamp, and shallow water (Tables 6-8).
The number of parameters that are required to be fine-tuned for each solo CNNs is presented in Table 9. It is evident from Table 9 that Inception-ResNet, ResNet101, and MobileNet CNN networks with approximately 50.2, 42.5, and 40.5 million parameters, respectively, had the highest number of parameters that are required to be fine-tuned. On the other hand, the CNN networks of ShuffleNet, GoogLeNet, and ResNet18 with about 1, 6, and 11.2 million parameters, respectively, had the least number of parameters.  To evaluate the efficiency and effectiveness of the solo CNNs and ensemble models for the classification of the wetland classes of bog, fen, marsh, swamp, and shallow water, their mean producer's accuracy was assessed and summarized in Figure 20. The comparison revealed the superiority of the ensemble models compared to the solo CNN networks in terms of the mean producer's accuracy. Results indicated a strong agreement between the predicted and reference wetland classes in the Gros Morne region using the ensemble models. In more detail, the ensemble model of the RF algorithm had the highest accuracy with a mean producer's accuracy of 78%, where it improved the results of the best solo CNN model for the wetland classification (i.e., Xception with a mean producer's accuracy of 58.96%) by more than 19%. In the Grand Falls, the ensemble model of the BTree improved the accuracy of the best solo CNN model (i.e., Xception with a mean producer's accuracy of 63.51%) by 16.7%, with a mean producer's accuracy of 80.21%. Finally, the Avalon area had the least agreement between the predicted and reference wetland classes using the ensemble models.
The BTree classifier improved the results of the best solo CNN model of ResNet18 (with a mean producer's accuracy of 61.96%) by 9.63%, with a mean producer's accuracy of 71.59%. Results obtained by the solo and ensemble CNNs indicated the advantage of shallower CNN models, including ResNet18 and Xception, over very deep learning models, such as DenseNet. Besides that, classification accuracies achieved by the solo CNN models were substantially improved in all three study areas for the wetland classification of bog, fen, marsh, swamp, and shallow water (Tables 6-8).
The number of parameters that are required to be fine-tuned for each solo CNNs is presented in Table 9. It is evident from Table 9 that Inception-ResNet, ResNet101, and MobileNet CNN networks with approximately 50.2, 42.5, and 40.5 million parameters, respectively, had the highest number of parameters that are required to be fine-tuned. On the other hand, the CNN networks of ShuffleNet, GoogLeNet, and ResNet18 with about 1, 6, and 11.2 million parameters, respectively, had the least number of parameters.  The solo CNNs with a higher number of parameters (e.g., Inception-ResNet) require a higher number of training data to reach their full classification potential capability. This contrasts with the situation in remote sensing applications, specifically in wetland classification. As discussed in the previous sections, creating a high number of training data is labor-intensive and quite costly in remote sensing. Overall, this research demonstrated that with a limited number of training data, CNN networks with fewer parameters had better classification performance (e.g., ShuffleNet).
Moreover, the results demonstrated that the supervised classifiers, including BTree, BOT, and RF, were superior in terms of the overall accuracy and mean producer's accuracy over the unsupervised classifier of majority voting in the Avalon, Grand Falls, and Gros Morne. Their different strategy of data fusion can explain their better classification results. In majority voting classifier, results of the best CNNs are simply fused by their majority values. In contrast, in the supervised tree-based classifiers such as BTree algorithm, results are trained once more to minimize the classification error, resulting in much better classification accuracy.

Conclusions
Due to the valuable benefits obtained by humans and Nature provided by wetland functions, new techniques and technologies for wetland mapping/monitoring are of great importance. Wetlands are considered among the most complex ecosystems to classify due to their dynamic and complex structure with no clear-cut boundaries with similar vegetation structures. In this regard, for high-resolution complex wetland classification, the results of various solo CNN models, including DenseNet, GoogLeNet, ShuffleNet, MobileNet, Xception, Inception-ResNet, ResNet18, and ResNet101, were compared and evaluated against several proposed ensemble-based approaches. Regarding the solo CNNs, due to the different number of existing training data in each study area, obtained results were relatively inconsistent. For example, in the Grand Falls, the number of training data for wetland classes was higher than the other two study regions, resulting in a better producer's accuracy in this region. The overall accuracy of the solo CNNs was low in the Grand Falls; the number of training data for non-wetland classes was less than the Avalon and Gros Morne (overall accuracy ranged from 73.87% to 79.34%). In addition, in both the Avalon and Gros Morne, producer's accuracy for the classification of wetlands was low due to the limited number of wetlands training in these regions.
In contrast, in the Avalon and Gros Morne, their overall accuracy was better, resulting from a higher number of training data of non-wetlands. It was concluded that the classification performances of the solo CNNs highly depend on the existing training data, specifically, deeper CNNs such as Inception-ResNet and DenseNet with a higher number of parameters (Tables 3-5). Overall, CNNs with fewer parameters to be fine-tuned (i.e., ShuffleNet) were more successful in recognizing wetlands in terms of classification accuracy ( Figure 11). On the other hand, the proposed ensemble of solo CNNs using the results of the best three CNNs in each study area significantly improved the classification accuracy of wetlands (Tables 6-8). The ensemble models were superior over the solo CNNs as they include one more classification step minimizing the classification error of the solo CNNs, specifically for wetland classification. The classification results of the solo CNNs improved by the supervised classifiers of BTree, BOT, and RF and the unsupervised algorithms of majority voting in terms of the mean producer's accuracy by 9.63%, 16.7%, and 19.04% in the Avalon, Grand Falls, and Gros Morne, respectively.