Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention

Xu, Han; Woodley, Alan

doi:10.3390/rs17010090

Open AccessArticle

Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention

by

Han Xu

and

Alan Woodley

^*

School of Computer Science, Queensland University of Technology, Brisbane 4000, Australia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(1), 90; https://doi.org/10.3390/rs17010090

Submission received: 31 October 2024 / Revised: 15 December 2024 / Accepted: 17 December 2024 / Published: 29 December 2024

(This article belongs to the Special Issue Advances in Earth Observation to Improve Flood Disaster Monitoring and Management (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

:

In post-flood disaster analysis, accurate flood mapping in complex riverine urban areas is critical for effective flood risk management. Recent studies have explored the use of water-related spectral indices derived from satellite imagery combined with machine learning (ML) models to achieve this purpose. However, relying solely on spectral indices can lead these models to overlook crucial urban contextual features, making it difficult to distinguish inundated areas from other similar features like shadows or wet roads. To address this, our research explores a novel approach to improve flood segmentation by integrating a row-wise cross attention (CA) module with ML ensemble learning. We apply this method to the analysis of the Brisbane Floods of 2022, utilizing 4-band satellite imagery from PlanetScope and derived spectral indices. Applied as a pre-processing step, the CA module fuses a spectral band index into each band of a peak-flood satellite image using a row-wise operation. This process amplifies subtle differences between floodwater and other urban characteristics while preserving complete landscape information. The CA-fused datasets are then fed into our proposed ensemble model, which is constructed using four classic ML models. A soft voting strategy averages their binary predictions to determine the final classification for each pixel. Our research demonstrates that CA datasets can enhance the sensitivity of individual ML models to floodwater in complex riverine urban areas, generally improving flood mapping accuracy. The experimental results reveal that the ensemble model achieves high accuracy (approaching 100%) on each CA dataset. However, this may be affected by overfitting, which indicates that evaluating the model on additional datasets may lead to reduced accuracy. This study encourages further research to optimize the model and validate its generalizability in various urban contexts.

Keywords:

urban flood mapping; ensemble learning; machine learning; cross attention; satellite imagery analysis; water spectral indices; disaster management

1. Introduction

Australia has a long history of floods that have posed significant threats to urban communities, particularly cities vulnerable to tsunamis and storms, that can exacerbate flooding [1]. Beyond the immediate threat to life, floods severely impact critical infrastructure like transportation networks, power grids, and water treatment facilities, causing widespread disruptions and economic losses. The 2022 floods in eastern Australia caused an estimated $4.7 billion in economic damage [2]. The true cost, however, extends beyond immediate economic losses, burdening communities with displacement, health risks from contaminated water, and the psychological impact of trauma [3,4,5]. Accurate and timely flood mapping is crucial for post-flood disaster analysis, improving response efforts, recovery planning, and mitigation strategies [6]. Such mapping also enables efficient resource allocation for rebuilding and identifying vulnerable populations [7].

Satellite remote sensing, with its capacity to provide high-resolution and time-series images, acts as a powerful tool for flood mapping. For instance, Sentinel-1, equipped with synthetic aperture radar, has been used in studies to capture dynamic flood images in urban areas, due to its ability to penetrate cloud cover even during inclement weather [8,9,10]. Meanwhile, Sentinel-2, incorporating multi-spectral imagery with detailed textures about land cover and vegetation health [11], has been utilized to assess urban flood extent, monitor water body changes, and identify vulnerable areas prone to inundation [12,13,14]. Landsat-8 and Landsat-9, with their extensive historical archive of moderate spatial resolution images dating back to the 1970s [15], have enabled the analysis of long-term flood patterns and the identification of areas susceptible to recurrent flooding within urban environments [16]. Recently, PlanetScope has gained favor in urban flood segmentation due to its unique advantages over other satellite platforms [17]. Its constellation of small satellites offers high spatial resolution (3–5 m per pixel) and frequent revisit times (daily or near-daily) [18], enabling the capture of fine-scale details of urban landscapes and the rapid monitoring of floodwater dynamics [17,19].

To further enhance the accuracy of flood mapping using satellite imagery, research has explored water-related spectral indices, derived from different bands of this imagery, to distinguish inundated regions. Some mainstream indices are utilized. For instance, the normalized difference water index (NDWI) enhances the contrast between water bodies and other land cover characteristics by leveraging the strong absorption of near-infrared light by water [20,21]. Furthermore, the normalized difference inundation index (NDII) is sensitive to changes in moisture content in urban environments [22]. The differential normalized difference water index (dNDWI) calculates the difference between the NDWI values of pre-flood and peak-flood images, thus amplifying the changes in water bodies [23]. In addition, the concatenated normalized difference water index (cNDWI) combines NDWI values from two different periods (typically pre-flood and peak-flood) into a single bi-band image, providing a richer representation of water dynamics [24,25].

While spectral indices highlight the detection of water features, flood mapping in urban environments presents various challenges. Shadows from tall structures or bridges can mimic water features [24,25], and spectral confusion arises from diverse building materials and vegetation [25]. These challenges are compounded in river-adjacent cities, where the dynamic behavior of rivers introduces mixed spectral signatures [26]. This dynamic nature further confounds traditional water indices, leading to inaccurate flood extent delineation.

Machine learning (ML) models show promise in their ability to discern subtle differences between floodwater and other urban features after learning the spectral indices. Some non-parametric models, such as k-nearest neighbors (KNN) [24,27], Gaussian Naive Bayes (GNB) [24,27,28], decision trees (DT) [24,27,29], and random forests (RF) [24,27,28,29,30], have been explored in combination with these indices in riverbank cities. However, relying solely on spectral indices—which primarily focus on water reflectance—can lead to the omission of key urban features such as shape, texture, and spatial relationships [31]. These contextual urban characteristics play a critical role in helping ML models distinguish floodwater from visually similar elements, such as shadows, wet roads, or rooftops [31]. Therefore, incorporating these contextual features alongside spectral data enhances the models’ capacity for more accurate flood detection [32,33].

Attention mechanisms in computer vision enable deep learning (DL) models to selectively focus on the most salient regions of an image [34]. These mechanisms calculate the importance of different image features based on contextual cues, such as spatial, channel, or temporal information. The calculated importance is then translated into weights assigned to the features, reflecting their contribution to the final output. Applying this concept to flood detection, studies have demonstrated that DL models can effectively leverage these weighted features to improve their sensitivity to flood-like areas in satellite imagery, further enhancing detection accuracy [35,36,37]. Nevertheless, integrating attention mechanisms into classic ML models remains an underexplored area. This research gap highlights a promising direction: the fusion of a water-based spectral index into multi-spectral satellite imagery by attention mechanisms, potentially enabling these models to capture spectral and urban contextual information for improved flood segmentation.

Aside from exploring contextual data, several studies demonstrate the potential of ensemble learning with satellite imagery [38,39] or water indices [40,41]. One of the core concepts of this approach is to aggregate the multiple outputs from constituent models to make a final decision [42]. In a river-adjacent urban environment, individual models within the ensemble can capture non-linear correlations between water features and contextual data, potentially yielding more accurate segmentation results. It is worth investigating the feasibility of embedding ensemble learning with the abovementioned direction for urban submerged mapping.

To examine this idea, our research proposes a novel approach that integrates a row-wise cross attention (CA) algorithm with ensemble learning, tested on a region in Brisbane affected by the February 2022 floods. Since this study focuses on a single area, we do not claim generalizability beyond this specific context. Building upon the traditional attention mechanisms [43], the CA algorithm, applied in the pre-processing step, fuses water spectral indices (dNDWI, cNDWI, NDII, NDWI) into each band of a peak-flood satellite image (4-band) at the row level. This local spatial attention is expected to amplify the nuanced differences between floodwater and other city characteristics while preserving overall landscape information. Subsequently, four CA datasets (CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI) are generated, alongside a 4-band peak-flood image, for model training and testing. The ensemble model (EM), which combines the strengths of proximity-based (KNN), probabilistic-based (GNB), and tree-based (DT and RF) classifiers, employs a soft voting strategy to finalize predictions. This strategy averages the four binary outputs from the sub-models, and if the average is equal to or exceeds 0.5, the pixel is classified as flooded; otherwise, it is categorized as non-flooded.

In this research, we claim two hypotheses for this validation. First, the hypothesis (H1) is that CA datasets can improve component models’ sensitivity to floodwater in the current study city, generating a better-segmented result than using the peak-flood image. Second, the hypothesis (H2) is that the ensemble approach (constituent model selection and pre-defined learning strategy) can more accurately delineate flood extent (as measured by F1 score [44] on positive pixels) than any of its individual constituent models. Beyond validating the two hypotheses, we also explore the efficiency of each model on all datasets (including the peak-flood image and CA datasets) to inform model-dataset selection for timely flood response and resource-constrained environments. The objectives of this research include:

Evaluate the effectiveness of individual models on each CA dataset to verify H1.
Evaluate the effectiveness of EM on each CA dataset to prove H2.
Evaluate the efficiency of EM and its sub-models across all datasets.

2. Materials and Methods

This study aims to further improve flood mapping accuracy by integrating a row-wise CA module with ML-based ensemble learning. Building upon our previous research [24], we employed the same study case (the 2022 Brisbane Floods), water-related spectral indices (NDWI, dNDWI, cNDWI, NDII) and ML models (KNN, GNB, DT, and RF) to evaluate our proposed approach.

2.1. Study Case

Brisbane, the capital of Queensland and Australia’s third-largest city, is home to approximately 2.6 million people [45]. Its development along the riverbanks, coupled with impervious soil, reduces natural water absorption, increasing runoff into the river and its tributaries [45]. This increased runoff significantly contributes to the potential risk of flooding events. Historically, the city has endured five major floods: 1841, 1893, 1974, 2011, and 2022 [22]. Our research focuses on the 2022 Brisbane floods, specifically within the densely populated urban areas located between 152.979°E, 27.605°S and 153.034°E, 27.460°S (EPSG:4326-WGS 84 [46]), as shown in Figure 1. This area includes the Brisbane River, surrounding suburbs, houses, apartments, business complexes, and three universities [22]. There are two images used: the first is a preflood image captured on 9 February 2022, and the second is an image captured at the peak of the flood on 28 February 2022.

2.2. Dataset and Pre-Process

2.2.1. PlanetScope Satellite Imagery

For this research, we utilized PlanetScope imagery due to its high spatial resolution and frequent revisit times [18], enabling us to capture the fine-scale dynamics of the February 2022 floods in Brisbane. We downloaded two satellite images from PlanetScope using the application programming interface (API) [49]: a pre-flood image captured on February 9 and a peak-flood image captured on February 22. Both images were subjected to cloud masking to exclude areas with more than 30% cloud cover. Each image comprises four spectral bands, as detailed in Table 1. Figure 2 compares the natural and false color composites of both images to highlight water patterns.

The labeled image was derived from the BCC Flood Map [47] and World Water Bodies shapefiles [48]. The flood map provided information on river and creek flooding within the BCC’s jurisdiction. The World Water Bodies shapefile included detailed information on the Brisbane River. To minimize the potential impact of the river on the segmentation task, a ‘river-only’ image was created from the shapefile and overlaid onto the flood map, effectively removing the river from the labeled image. Figure 3 illustrates this river removal process.

Following the river removal, subsequent pre-processing steps involved removing no-data values and resizing images for each batch using QGIS [50]. Pixel-based values were then converted to double-precision format (float64) and normalized into a range of −1 to 1 using min-max scaling. Table 2 provides details of the final pre-processed labeled image.

2.2.2. Spectral Index Calculation

Four spectral indices (NDWI [21], dNDWI [23], cNDWI [24], NDII [22]) were derived from two satellite images: a pre-flood image and a peak-flood image. NDWI, calculated from the peak-flood image and thus referred to as NDWI_peak-flood, measures water features by contrasting green and NIR reflectance. dNDWI quantifies changes in water bodies between the two time periods. cNDWI integrates both NDWIs into a single multi-channel image. NDII leverages the NIR bands from both images. Equations (1)–(4) presents the formulas used to compute these indices.

{NDWI}_{peak-flood} = \frac{{Green}_{peak-flood} - {NIR}_{peak-flood}}{{Green}_{peak-flood} + {NIR}_{peak-flood}}

(1)

dNDWI = {NDWI}_{peak-flood} - {NDWI}_{pre-flood}

(2)

cNDWI = ({NDWI}_{pre-flood}, {NDWI}_{peak-flood})

(3)

NDII = \frac{{NIR}_{pre-flood} - {NIR}_{peak-flood}}{{NIR}_{pre-flood} + {NIR}_{peak-flood}}

(4)

We established five baselines as visual references using NDWI_peak-flood, NDII, cNDWI, and dNDWI indices. A specific threshold is defined for each index to generate a binary map indicating flooded and non-flooded areas. This setting can provide insights for analyzing the potential impact of their fusions with the peak-flood image by the CA operation.

We applied thresholds of 0.2 and 0.3 to NDWI_peak-flood. The threshold of 0.3 is used in urban environments to detect water bodies, while 0.2 is used in the context of land cover [51]. These two thresholds help assess the capabilities of NDWI_peak-flood for floodwater detection in the riverine city environment.

For NDII, we adopted a modified threshold of 0.1. This threshold was adapted from [22], with the modification of including the ‘equal to’ condition. This modification aims to increase the floodwater sensitivity for subtle variations in our study area.

For cNDWI, a bi-band spectral index, we averaged the band values and then applied a threshold of 0.1. Similarly, a threshold of 0.2 was applied to the single band dNDWI. Both thresholds were determined empirically through a trial-and-error process.

2.2.3. Cross Attention Operation

Drawing from classic attention mechanisms [43], the proposed CA operation for each feature involved four steps: (a) similarity score calculation; (b) attention weight calculation; (c) weighted summation; and (d) rescaling and reshaping.

The similarity score quantifies the relevance between each channel of the query image (4-band peak-flood image) and the corresponding channel of the key (spectral indices). This is achieved through element-wise multiplication, also known as the Hadamard product [52]. The result is a set of feature maps S, where each map encapsulates the interaction between a specific query channel and its corresponding key channel. This approach offers superior computational efficiency compared to traditional attention mechanisms that rely on matrix transpositions and dot product operation, especially when dealing with large matrices. Additionally, the Hadamard product excels at capturing local contextual relevance through its pixel-wise interactions. By leveraging spectral indices, NDII and NDWI, primarily designed to identify soil or vegetation moisture content and water presence, this step amplifies these crucial features within each query band, thereby enhancing the model’s sensitivity to these characteristics.

S_{i, j, c, k} = Q_{i, j, c} ⊙ K_{i, j, k}

(5)

The tensor S_i,j,c,k represents the similarity score at pixel location (i, j) for the given query channel ‘c’ and key channel ‘k’. Let i ∈ {1, …, H}, j ∈ {1, …, W}, c ∈ {1, …, C_q}, and k ∈ {1, …, C_k}, where H represents the height of the image, W represents its width, C_q is the number of channels in the query, and C_k is the number of channels in the key. The Hadamard product is denoted by (⊙). The output S is a four-dimensional (4D) tensor with dimensions (H, W, C_q, C_k).

The second step involves computing attention weights, representing the relative importance of each query channel and its interaction with the spectral index as probabilities. Using the similarity scores S from the previous step, we iterate over each query channel ‘c’, extracting the corresponding 2D similarity map for its associated key channel ‘k’. We then apply the SoftMax function [43] row-wise to each similarity map across the width (W) dimension, generating attention weights for each pixel within that row. This row-wise approach captures fine-grained spatial variations within each channel, potentially highlighting subtle distinctions like water body boundaries, vegetation moisture, or other urban features. The SoftMax operation ensures that the sum of attention weights within each row equals 1, effectively converting them into probabilities. These attention weights will be utilized in the subsequent weighted summation step to emphasize selectively key features for each query channel selectively. To ensure numerical stability during the SoftMax calculation, we normalize the pixel values after exponentiation.

A_{i, j, c, k} = \frac{\exp (S_{i, j, c, k} - \max_{n} (S_{i, n, c, k}))}{\sum_{m = 1}^{W} \exp (S_{i, m, c, k} - \max_{n} (S_{i, n, c, k}))}

(6)

where the tensor A_i,j,c,k represents the attention weight at pixel location (i, j) for the given query channel ‘c’ and key channel ‘k’, and max_n (S_i,n,c,k) denotes the maximum value found within the i-th row of the similarity scores S for the given query channel ‘c’ and key channel ‘k’. The ‘n’ refers to an index of this value. The summation in the denominator iterates over all columns (‘m’ from 1 to W) within the i-th row, for the same query channel ‘c’ and key channel ‘k’. The output A is a 4D tensor with dimensions (H, W, C_q, C_k).

The third step, weighted summation, refines the feature representations at the pixel level by incorporating the attention weights (probabilistic representations) from tensor A with the key features. This allows the model to focus on the most salient patterns. For each query–key channel pair (c, k), we perform element-wise multiplication between the corresponding 2D attention weight matrix (extracted from A) and the associated key feature map.

O_{i, j, c, k} = A_{i, j, c, k} ⊙ K_{i, j, k}

(7)

where O_i,j,c,k is the output at pixel location (i, j) for the c-th query channel and the k-th key channel. The output O is a 4D tensor with dimensions (H, W, C_q, C_k).

The fourth step involves rescaling and reshaping the output tensor O. First, we apply min-max scaling to normalize its pixel values to the range [0, 1]. Subsequently, we perform a linear transformation to shift the range of these normalized values to [−1, 1]. Finally, we reshape O by concatenating the information from all query–key channel interactions along the channel dimension, resulting in a 3D tensor with dimensions (H, W, C), where C = C_q × C_k.

2.3. Construct and Train: Ensemble Model

This section explains the four ML algorithms used to produce the experimental EM. While each model has its advantages and disadvantages for flood identification, the chosen models span linear and non-linear classification categories and have varying training and testing complexities. These differences will also be explored in EM, which combines the individual models, and its resulting costs and benefits.

2.3.1. K-Nearest Neighbors

KNN provides a distance-based perspective to our proposed ensemble model [53,54]. In this model, the class of a pixel is determined by the majority vote of its ‘k’ nearest neighbors within the input data, where proximity is measured using Euclidean distance [54]. In this study, each dataset comprises 3,647,904 pixels multiplied by the number of channels. KNN computes the distances between all pixel pairs during training for binary classification. During inference, the distance between a new pixel and all training pixels is calculated, and its class (flooded or non-flooded) is determined by the majority vote of its ‘k’ nearest neighbors. This approach assumes that unknown flooded pixels could be similar to labeled flooded pixels, and vice versa.

2.3.2. Gaussian Naive Bayes

GNB, as illustrated by [55], offers a probabilistic perspective to the ensemble model for satellite data analysis. As a variant of Naive Bayes, it leverages Bayes’ theorem under the assumptions of pixel independence and normal distribution of pixel values. This approach reduces over-reliance on specific patterns, mitigating issues such as overfitting and the influence of noisy pixels (for example, cloudy areas shown in Figure 2c).

In the context of binary classification, the training process involves estimating the mean and standard deviation parameters for positive (flooded) and negative (non-Flooded) pixels independently using maximum likelihood estimation [56], treating each pixel as an individual instance. For classification, the model utilizes a feature vector composed of spectral values from the same spatial location across different channels. The final prediction is determined by selecting the class with the highest posterior probability between the two calculated probabilities (0 and 1).

2.3.3. Decision Tree

DT, as studied by [57], is a tree-like structural model that employs recursive binary splits to learn complex, non-linear patterns. This is achieved by partitioning satellite pixels into distinct content regions based on their spectral signatures, such as building shadows, wet roads, or water bodies. This partitioning facilitates the creation of granular, rule-based trees capable of accurate predictions. Moreover, DT performs well at handling multidimensional satellite data, capable of capturing the intricate patterns associated with flooded areas by analyzing pixel values across all spectral bands.

Specifically, the model consists of three types of nodes that are connected by branches: the root node, internal nodes, and leaf nodes. The root node serves as the entry point for input data and defines the target task (binary segmentation). Internal nodes, governed by rule-based conditions, guide the input data through the branches until they reach the leaf nodes, which represent the final positive or negative labels. These conditions are automatically configured based on data characteristics and a predefined splitting criterion.

During training, DT learns to fit the training data for constructing the tree by optimizing the rule-based conditions at each internal node. Subsequently, every new pixel is treated as an independent instance, traversing the tree from the root node to the corresponding leaf node to obtain its final prediction.

2.3.4. Random Forest

RF is an ensemble of decision trees that leverage the bootstrap aggregation (bagging) technique to train on diverse subsets of the input data with replacement [58]. Its advantages are similar to DT; however, it additionally considers an ensemble of trees to help decrease errors.

During tree construction, three factors influence the model: inherent data characteristics, the chosen splitting criterion (for example, Gini impurity), and random feature selection. Unlike single decision trees that consider all features at each split, RFs select a random subset of features (typically the square root of the total number of channels) to prevent overfitting to dominant features and potentially accelerate training. The final prediction is determined by a majority vote from all trees.

2.3.5. Stratified Cross Validation

Given large class imbalance (74.28% negative pixels and 25.72% positive pixels), models are prone to bias towards the dominant class during training. To mitigate this, we employ stratified k-fold cross-validation [59]. This technique partitions the data into ‘k’ folds while preserving the original class distribution in each fold. The model is trained and evaluated ‘k’ times, using each fold as the testing set once. This ensures that both the training and testing sets retain the original proportion of positive and negative pixels. The final model performance is determined by averaging the performance across all ‘k’ folds.

In our case, with 3,647,904 pixels (3081 × 1181) multiplied by the number of channels, each fold used 1/k of the pixels for testing and the remaining (k − 1)/k for training. This partitioning maintained the 74.28% to 25.72% class distribution in both the training and testing sets for each fold.

2.3.6. Ensemble Model

The proposed EM integrates four non-parametric supervised classifiers: KNN, GNB, DT, and RF. Before ensemble learning, each dataset was reshaped into (H × W, C) dimensions to ensure compatibility with the models. This learning pipeline entailed that each sub-model independently performed k-fold stratified cross-validation, data fitting, and prediction on each dataset.

A soft voting strategy [60] was then employed to generate the final prediction. This strategy averaged the binary predictive results from the four sub-models. If this average was greater than or equal to a threshold of 0.5, the pixel was classified as positive (flooded); otherwise, it was classified as negative (non-flooded).

The proposed EM provides versatile perspectives, including distance-based, probabilistic-based, tree-based, and nested-ensemble-based, to analyze intricate urban floods simultaneously. This combination serves as a foundation to explore how these four contributors and the learning strategy affect the predictive results, thereby providing insights for improvements to address complex urban flood challenges. However, the EM can potentially inherit strengths and weaknesses from its sub-models.

2.4. Evaluation

The research evaluated EM and its component models regarding efficiency and effectiveness. Effectiveness was measured using seven metrics: accuracy, recall (flooded/non-flooded), precision (flooded/non-flooded) and F1 (flooded/non-flooded) [44]. Efficiency was assessed based on time and memory usage during training and testing. Means, standard deviations and significant differences were calculated across each of the metrics. Boxplots [61] were used to visualize the distribution of each evaluated metric. One-way analysis of variance (ANOVA) [62] was conducted to determine whether there were statistically significant differences among the means of groups (models or datasets) for each metric. Following a significant ANOVA result, Tukey’s honestly significant difference (HSD) test [63] was employed to identify specific pairs of groups exhibiting statistically significant differences for each metric. Paired two-tailed t-tests were conducted to compare the peak-flood image with each of the CA datasets and to determine if the differences were statistically significant.

2.4.1. Effectiveness Metrics

This section describes the effectiveness metrics used. Most metrics include calculations for both flooded and non-flooded areas. This provides a holistic view of performance by identifying areas that are under- or over-inundated compared to the ground truth dataset.

Accuracy

The metric measures model’s performance in the proportion of correctly classified positive and negative pixels [44]. The formula is:

Accuracy = \frac{TP + TN}{TP + FP + TN + FN}

(8)

TP stands for true positive, indicating the correct number of predicted positive pixels (flooded). TN indicates true negative, reflecting the number of correctly identified negative pixels (non-flooded). False positive (FP) describes the number of mistakenly recognized negative pixels as positive ones, which is known as type I errors. False negative (FN) represents the number of positive pixels that were incorrectly predicted as negative.

Recall

The metric evaluates the model’s ability to identify positive or negative pixels [44]. We used two equations for recall (flooded) and recall (non-flooded) to explore this performance separately, with the formulas defined below:

Recall (Flooded) = \frac{TP}{TP + FN}

(9)

Recall (Non - Flooded) = \frac{TN}{TN + FP}

(10)

Precision

The metric examines the performance of correctly predicting positive or negative pixels [44]. Two equations for precision (flooded) and precision (non-flooded) were utilized for experiments:

Precision (Flooded) = \frac{TP}{TP + FP}

(11)

Precision (Non - Flooded) = \frac{TN}{TN + FN}

(12)

F1 Score

The metric harmonizes the mean of recall and precision to evaluate a model’s ability in target pixel identification [44]. In the unbalanced dataset, it effectively avoids the biased impact caused by the significant difference between recall and precision. Similarly, we employ F1 (flooded) and F1 (non-flooded) to quantify this ability for positive and negative pixels respectively. Their formulas are specified below:

F 1 (Flooded) = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

(13)

F 1 (Non - Flooded) = \frac{2 \cdot TN}{2 \cdot TN + FP + FN}

(14)

2.4.2. Efficiency Metrics

The experiment was conducted on an Apple Pro with an M2 chip (3.2 GHz 16-core CPU and 10-core GPU) and 128 GB RAM. We examined the efficiency of each model in terms of its time and memory requirements during both training and prediction processes. As the ensemble model, composed of these four algorithms, learns and predicts data independently, the total time cost for training or prediction is calculated as the sum of the time required by each algorithm. Also, the memory requirement is determined by the maximum memory usage observed among all the algorithms during each process. Time cost is measured in seconds, while memory usage is quantified in megabytes (MB).

2.4.3. Analysis of Variance

We employed boxplots, one-way ANOVA, and Tukey’s HSD test to analyze the influence of five models (EM and its sub-models) and five datasets (the peak-flood image and four CA datasets) on the metrics (effectiveness and efficiency). These methods can identify outliers, median lines, variability, and significant/non-significant differences for these results, informing model/dataset selection and future improvements.

Model-Based Analysis

We evaluated five models across the metrics utilizing five datasets. This comparative analysis aimed to identify the strengths or weaknesses of these models for informed model selection and provide evidence-based cues for model optimizations.

Dataset-Based Analysis

Conversely, we compared five datasets across the metrics, evaluating their performance with each of the five models. This analysis aims to identify specific data features that positively or negatively affect these metrics, providing informed insights for dataset selection and dimensionality reduction.

Data Variability

Boxplots visually represented the distribution and variability of the metrics for each group (five models or five datasets), providing an overview of the performance variations.

Statistical Analysis

One-way ANOVA was conducted to determine if there were statistically significant differences among the means of the metrics for the five models or five datasets. The null hypothesis stated that there were no differences among the groups being tested. We set a significance level (α) of 0.05. If the p-value was greater than α, we failed to reject the null hypothesis. Conversely, a p-value less than or equal to α indicated significant differences among the group means.

Post-Hoc Analysis

Following one-way ANOVA test, Tukey’s HSD test was used to identify specific pairs of groups with significantly different means. Tukey’s HSD, which controls the family-wise error rate at 0.05, was employed for pairwise comparisons. Significant differences were determined based on the adjusted p-value (p-adj). The p-adj value is a modified p-value used in multiple comparisons to reduce the likelihood of making a type I error (falsely rejecting a true null hypothesis).

Paired T-Test

The difference made by the addition of the CA datasets was evaluated via a paired two-tailed t-test across each of the effectiveness and efficiency metrics. Results were grouped by their model, with one list consisting of the values from the CA datasets, while the second list consisted of the values from the peak-flood image. The threshold was set at 0.025 to test both the upper and lower bounds. This provided an alternative view to the analysis by highlighting significant differences.

3. Results

In our experiments, four spectral indices (dNDWI, cNDWI, NDII, NDWI_peak-flood) were fused into the peak-flood image through CA module. This process generated four CA datasets: CA-dNDWI, CA-cNDWI, CA-NDII, and CA-NDWI. The CA-dNDWI, CA-NDII and CA-NDWI were 4-band indices, while CA-cNDWI was an 8-band index. The distribution of values within these CA datasets exhibited varying degrees of skew and deviated from a standard Gaussian distribution, as evidenced by Figure 4.

We used the same hyperparameters for KNN, GNB, DT, and RF as in our previous research [24]. KNN was configured with k = 3 and uniform weighting for class prediction. GNB was employed with its default settings, which assumed feature independence and a Gaussian distribution. DT utilized the Gini impurity splitting criterion for identifying informative splits, while the RF was set to construct 100 trees for majority voting prediction. DT and RF were allowed to grow fully to fit the data without depth limitations to capture complex relationships in the data. Furthermore, we enabled parallel processing for KNN and RF to leverage the capacity of our multi-core processors. Drawing on [64], we set k = 10 for the k-fold stratified cross-validation.

After configuration, we conducted 30 experiments. These included 25 experiments applying five models (KNN, GNB, DT, RF, EM) to five datasets (peak-flood image, CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI). We also conducted five experiments utilizing threshold-based baseline spectral indices: NDWI_peak-flood (0.2 and 0.3), NDII (0.1), cNDWI (0.1), and dNDWI (0.2). Pixel values that equaled or exceeded the threshold were considered positive/flooded; otherwise, they were considered negative/non-flooded.

3.1. Flood Area Segmentation and Effectiveness Metrics

3.1.1. Threshold-Based Spectral Index Baselines

This section presents the results of five baseline indices across seven metrics. Figure 5 provides visual representations and Table 3 provides numerical results.

When compared the Truth Flood Map in Figure 3c with Figure 5, the baseline flood maps mainly differ in three main spatial areas: (1) identifying the Brisbane River as non-flooded water; (2) identifying the urban area in the northeast; and (3) distinguishing between the flooded creek and the urban structure in the southern half of the image.

Table 3 presents the effectiveness metrics for the five baseline indices. Each of the indices has strengths and weaknesses when identifying flooded and non-flooded pixels.

NDWI_peak-flood (0.2) misclassifies the Brisbane River as flooded, lowering its precision (flooded) and misclassifies areas in the south and northeast of the image, lowering the recall and precision for both flooded and non-flooded areas. This impacts its accuracy of 0.692.

NDWI_peak-flood (0.3) struggles to find inundated regions. This results in high scores across the non-flooded metrics, particularly in the recall (non-flooded) metric (0.976). However, it performs far worse in the flooded metrics, particularly recall (flooded) and F1(flooded) metrics (0.149 and 0.244). This is despite the fact that the threshold of 0.3 has previously reported better performance in urban areas [51]. The accuracy of 0.763 is very similar to the proportion of the non-flooded pixels in the image (0.743).

NDII has the highest accuracy (0.833). It can detect the inundation in the southeast of the image, but does not misclassify the river, the northeast region, or the southwest region of the image. This means that it performed strongly across the non-flooded metrics. However, it was not able to identify the flooded area stemming from the river in the north of the image.

cNDWI has the lowest accuracy (0.587). It misclassified the river and the region in the southwest of the image. Visually, it is similar to the NDWI (0.2) image but had poorer results in the flooded metrics.

dNDWI performed similarly to NDII. It had the second highest accuracy (0.748) and performed strongly in the non-flooded metrics. Like NDII it correctly classified the Brisbane River, and the region in the southwest of the image. The similarity between dNDWI and NDII is that they are derived from differential equations. However, visually, dNDWI seems to contain more noisy pixels.

3.1.2. Model-Based Predictive Results

This section describes the predictive ability of five models when the peak-flood image and the CA datasets are used as inputs. Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 display the visual representations. Table 4 presents the mean and standard deviation of each metric across 25 experiments. Table 5 provides detailed effectiveness metrics for each model and dataset.

In Table 4, the mean for each metric is close to 1.000, indicating that most experiments performed very well. This is further emphasized in Table 5, where some experiments achieve a perfect score (1.000). However, this high performance may indicate that the models overfit the dataset.

Table 5 and Figure 6 demonstrate that KNN consistently predicts flood areas across all datasets. The model correctly classified the river and outperformed the baselines in classifying areas in the northwest, southeast, and southwest. Furthermore, KNN achieved the highest accuracy, recall (flooded), precision (non-flooded), F1 (flooded), and F1 (non-flooded) scores when using the peak-flood image.

Table 5 shows that GNB was the poorest-performing model, exhibiting the highest standard deviation in most metrics compared to the other models. This is also shown in Figure 7, where GNB’s peak-flood and CA-NDWI images misclassify the Brisbane River. Overall, GNB underestimates the flooded areas, particularly with the CA-NDWI image, resulting in lower scores for flood-related metrics.

In Table 5, the DT and RF models performed very strongly. Notably, the CA versions achieved near-perfect results (to three decimal places) for some metrics, likely due to overfitting on the dataset. Visually, Figure 8 and Figure 9 are almost identical to Figure 3c, which helps to confirm this idea.

Due to the dominant influence of DT and RF in the ensemble, EM suffers from overfitting across all datasets. Figure 10 is visually similar to Figure 3c, which supports this observation. Notably, as shown in Table 5, CA-cNDWI enables this model to achieve the highest values across all metrics.

The overarching results reveal that (a) the KNN with the peak-flood image outperforms when used with CA datasets; (b) the GNB with CA-dNDWI, CA-cNDWI, and CA-NDII outperforms the other two datasets; and (c) DT, RF, and EM encounter overfitting across all datasets, achieving approximately 1 for each metric.

3.2. Efficiency Results

As introduced in Section 2.4.2., our experiments were conducted on an Apple Mac Pro (M2 chip, 3.2 GHz 16-core CPU, 10-core GPU, 128 GB RAM). While the Mac Pro has a 10-core GPU, it was not utilized for parallel processing. The multi-core CPU was applied to parallel processing in KNN and RF models, while GNB and DT models were run without it. Table 6 represents the mean and standard deviation of the efficiency metrics.

Table 7 illustrates efficiency metrics for five models across five datasets. In terms of training time, the models rank from fastest to slowest as follows: GNB, KNN, DT, RF, and then EM. The RF and EM models had a standard deviation greater than the other models.

For testing time, the rank is GNB, DT, RF, KNN, and then EM. The lower-than-linear search capacities of GNB, DT, and RF contributed to their faster testing times compared to the linear search capabilities of KNN.

In terms of memory consumption in training time the rank was DT, GNB, RF, KNN and EM. DT took a standard deviation less than the other models, while KNN and EM took the longest.

The rank was similar for the testing memory usage: DT, GNB, KNN, RF, and EM. The times between KNN, RF, and EM were very similar. EM’s total time demand is the sum of the time costs of its constituent models for each corresponding dataset.

3.3. Variance Comparisons

3.3.1. Model-Based Effectiveness and Efficiency Metrics

Figure 11 presents boxplots that visualize the distribution of effectiveness metrics for each model across the five datasets.

KNN exhibits relatively consistent performance in all metrics, as evidenced by narrow box-shapes. The median lines, in accuracy, recall (flooded), precision (non-flooded), F1 (flooded), and F1 (non-flooded), are generally close to the lower quartile. It indicates that the value distribution is left-skewed and centered at the lower level. These lower values are observed when the model is trained on CA-dNDWI, CA-NDII, and CA-NDWI. Notably, CA-NDWI results in outliers in recall (non-flooded) and precision (flooded).

GNB exhibits varying variances across all metrics, indicating its sensitivity to the choice of dataset. The median lines, in accuracy, recall (flooded), F1 (flooded), and F1 (non-flooded), are close to the upper quartile. This suggests a right-skewed distribution concentrated at higher values when trained on CA-dNDWI, CA-cNDWI, and CA-NDII. Notably, CA-NDWI leads to two outliers in recall (non-flooded) and F1 (non-flooded) for GNB.

DT, RF, and EM display extremely narrow box-shapes, very close to the upper threshold of 1, indicative of overfitting for all metrics. EM generates outliers in accuracy, recall (non-flooded), precision (flooded), F1 (flooded), and F1 (non-flooded), with upper outliers associated with CA-cNDWI and lower outliers associated with CA-NDWI.

Figure 12 presents boxplots that visualize the distribution of efficiency metrics for each model across the five datasets.

KNN and GNB exhibit similar training times. However, their computational demands increase with the number of channels, as demonstrated by the outliers in the CA-cNDWI (72.040 and 5.059 s, respectively) due to the higher number of pixels. DT maintains slight variability in training times, with a median line close to the upper quartile. It is influenced by CA-dNDWI, CA-cNDWI, and CA-NDWI, indicating that DT requires more time to fit these complex datasets, in contrast to the simpler peak-flood image (lacking spectral features). RF displays higher variability, as evidenced by a longer lower whisker extending to a minimum value of 1101.444 s generated by the peak-flood image. EM shows an outlier for the peak-flood image, with a value of 1361.073 s.

The testing time had significant differences across the models but are relatively consistent within the models. GNB and DT have the shortest testing times, while RF, KNN, and EM require an order of magnitude more time. EM requires the highest training and testing times, attributable to the ensemble setting.

For memory usage, all models demonstrate varying degrees of variability. Notably, KNN shows wide-spanning variability, indicating its sensitivity to the choice of datasets. EM’s variability in memory usage stems from KNN during training and RF during testing.

In summary, optimizing KNN and GNB for efficiency requires careful consideration of pixel count. Also, the feature complexity integrated from certain spectral indices can impact model efficiency.

The one-way ANOVA analysis (Table 8) reveals statistically significant differences across effectiveness metric. All p-values are less than 0.0001, indicating strong evidence to reject the null hypothesis of no difference between the models. The smallest difference between the models is observed in recall (non-flooded) (p-value = 4.8080 × 10⁻³), while the largest difference is in recall (flooded) (p-value = 2.8140 × 10⁻¹⁶).

Table 9, Table 10, Table 11 and Table 12 provide additional insights into pairwise model comparisons. Table 9 displays accuracy p-values. KNN and GNB are significantly different from each other and from DT, RF, and EM. In contrast, DT, RF, and EM all perform very strongly in their accuracy metric and are not significantly different from each other. This supports the analysis in Section 3.1.2.

In Table 10, the smaller difference in recall (non-flooded) is primarily due to the significant differences observed only when comparing GNB to DT, RF, and EM (p-adj ≤ 0.05). The non-significant differences are observed in pairwise comparisons involving KNN with any of the other models (GNB, DT, RF and EM), and in comparisons between DT, RF, and EM. The largest difference in recall (flooded) occurs in comparisons involving either KNN or GNB with any of the other models, with the exception of the non-significant differences between the pairs of DT, RF, and EM. This extends the observations from Table 5, where all the recall (non-flooded) experiments returned values > 0.9, except for the experiments of GNB using the peak-flood image and CA-cNDWI as inputs. For recall (flooded), DT, RF, and EM also performed strongly (>0.98), but KNN and GNB performed much worse.

Further analysis of pairwise comparisons in Table 11 and Table 12 reveals reveal additional insights. Firstly, pairwise comparisons between DT, RF, and EM show no statistically significant differences (p-adj > 0.9500) in precision (flooded), precision (non-flooded), F1 (flooded), and F1 (non-flooded). Secondly, for precision (flooded), the pairs of KNN and DT, KNN and RF, KNN and EM also exhibit no significant difference. Aside from these observations, all other pairwise comparisons between the models show statistically significant differences.

Overall, DT, RF and EM are tightly coupled. In this case, this means that all these models are overfitted. This also implies a limitation of our implementation of EM, which does not demonstrate a diverse ability to detect flood extent, particularly with the current sample size.

In terms of efficiency metrics (Table 13), the five models exhibit the least variation in training time (p-value: 2.0320 × 10⁻⁷), yet the most variation in testing time (p-value: 1.0000 × 10⁻²⁵).

Table 14 reveals no significant difference in training time among these pairs: KNN and GNB, KNN and DT, GNB and DT, RF and EM. For testing time, there is no significant difference between GNB and DT, while all other pairs show significant differences (p-adj < 0.0001).

Regarding memory usage, Table 15 demonstrates no statistical difference between EM and KNN during training, as they utilize the same amount of memory. Similarly, GNB and RF exhibit no significant difference in training memory usage. In terms of testing memory, EM and RF show no significant difference, while all other model pairs exhibit significant differences in memory usage.

3.3.2. Dataset-Based Effectiveness and Efficiency Metrics

Figure 13 presents boxplots that visualize the distribution of effectiveness metrics for each dataset across the five models. Similar variances are observed across all datasets for accuracy, recall (non-flooded), precision (flooded), F1 (flooded), and F1 (non-flooded). In contrast, recall (flooded) and precision (non-flooded) exhibit varying variances across the datasets.

The median lines for all metrics in each dataset are generally close to the upper quartile, due to the influence of the overfitting models: DT, RF, and EM. Across all metrics, GNB consistently produces the lowest values, often resulting in lower outliers for each dataset, in the peak-flood image and CA-NDWI.

Figure 14 presents boxplots that visualize the distribution of efficiency metrics for each dataset across the five models. The peak-flood image exhibits the smallest variance in training time, in stark contrast to the high variance observed in CA-dNDWI and CA-NDWI. The five datasets show similar variances in testing time. Regarding memory usage, CA-cNDWI consistently exhibits the lowest variance in both training and testing.

The one-way ANOVA analysis (Table 16) reveals that the choice of dataset has little impact on most effectiveness metrics, as their p-values (close to 1.000) suggest no statistically significant differences. However, subtle variations exist for recall (non-flooded) and precision (flooded) as indicated by their relatively lower p-values (0.6190, 0.8420).

Fine-grained analysis of effectiveness pairwise comparisons (Table 17, Table 18, Table 19 and Table 20) highlights this observation. Table 17 presents the accuracy pairwise comparison. Table 18, Table 19 and Table 20 present the pairwise comparisons for recall, precision, and F1 for both flooding and non-flooding, respectively. Most values were greater than 0.99; while some were smaller, none exceeded the p-adj threshold.

Similar results were observed in the comparisons of the effectiveness metrics. Table 21 presents the one-way ANOVA analysis of the efficiency metrics, showing that none of the p-values exceed the threshold for significant differences.

Table 22 presents the testing and training times. While there are some non-significant differences in p-values for the training times, all training times are equal to 1.0000. Table 23 shows the memory usage for training and testing. Although some differences are observed, none of them are statistically significant.

3.4. Impact of CA Datasets on Model Effectiveness (Partial Support for H1)

This section analyses the differences between the CA datasets and the peak-flood image. Metrics were applied to each model and dataset. Table 24 presents the mean and standard deviation for each metric. Table 25 shows the differences between each model and index, including standard deviations.

KNN shows poorer performance across most of the metrics except for recall (non-flooded) and precision (flooded). For the recall (flooded), precision (non-flooded) and F1 (flooded) metrics, some of the datasets exceed a lower threshold. This indicates that the Peak-Flood dataset contained more true positives than the CA datasets.

GNB exhibits the largest variation compared to other models, with most metrics (CA-dNDWI, CA-cNDWI, CA-NDII) demonstrating values at least one standard deviation above the mean. CA-dNDWI enhances sensitivity to flooded areas (true positives), while CA-cNDWI shifts the model’s focus toward identifying non-flooded areas (true negatives). Notably, despite the weak sensitivity of the threshold-based CA-cNDWI to floodwater, GNB effectively identifies intricate relationships between pre- and peak-flood dynamics captured by this index, further improving its performance. It must be noted that GNB started from a lower baseline than the other models.

However, CA-NDWI moves beyond a lower standard deviation threshold. As discussed, NDWI has limitations in accurately delineating flood extent. These limitations are inherited by CA-NDWI, resulting in negative metrics for GNB, KNN, and EM. This suggests that NDWI may not be a suitable spectral index for fusion with 4-band satellite imagery (peak-flood image) when mapping floods in this study.

The CA datasets slightly improved the metric values for DT and RF, though the changes remained within the upper and lower thresholds. It is important to note that these models began with a high baseline performance. EM also shows some small positive and negative changes.

Overall, this analysis suggests that the benefits of CA datasets are influenced by model-specific factors, such as the model type or its configuration.

Table 26 presents the results of the paired two-tailed t-test, with one list containing the values of the peak-flood image and the other containing the values from the CA datasets. Instead of comparing the entire set of experiments, as in Table 5, the results were grouped by model. The significance level was set at 0.025. The small sample size may influence the interpretation of the results.

At times, metrics with improved standard deviations also showed significant improvement. For example, the significant differences observed in the KNN values aligned with the consistent negative differences between the two datasets, even when some differences did not exceed one standard deviation.

However, the GNB model exhibited different behaviour. Although CA datasets most significantly impacted GNB, a significant difference was observed only for recall (flooded), where all results improved with the CA datasets. This occurred because, for most metrics, the CA-dNDWI, CA-cNDWI, and CA-NDII datasets exceeded the upper standard deviation threshold, while CA-NDWI fell below the lower threshold. This effectively resulted in the values offsetting each other.

The DT, RF and EM datasets had some values that were significantly different, even though the actual difference between them was very small (less than 0.01) and did not exceed a standard deviation. This was partially due to overfitting. Since all the values in DT’s Recall (Flooded) and Precision (Non-Flooded) were the same (1.00) that had no variance, which would have resulted in a divide by zero error when calculating the significant difference.

Overall, this shows that the combination of model and dataset influences the effectiveness results.

3.5. Impact of CA Datasets on Model Efficiency

We consider memory use and time costs as key indicators of efficiency. This analysis focuses on the changes in testing time, training time, training memory, and test memory between the CA datasets and the peak-flood image.

Table 27 and Table 28 reveal that different CA datasets can affect training time, testing time, and memory usage, increasing or decreasing costs. Table 27 outlines the mean and standard deviation across the datasets, while Table 28 outlines the differences between the CA datasets and the peak-flood image, including the number of significant differences observed.

In terms of testing time, only the RF and EM datasets show a larger standard deviation compared to the peak-flood image. The difference in DT is evident across all CA datasets but does not exceed the standard deviation threshold. However, the number of items is relatively small compared to the value range, so a wider dataset could yield different results. The impact of the DT model influences the RF model to the extent that it passes the standard deviation threshold. Since EM’s efficiency is determined by its constituent models, it inherently has higher time demands. There are some changes for the KNN and GNB models, but they do not pass a threshold.

There are differences between the models in terms of testing time. The KNN falls below the lower standard deviation threshold, while the GNB exceeds the upper standard deviation threshold for most indices. However, the actual measure differences are very small and are based on a small dataset. The DT, RF, and EM models show an increase, but do not pass a threshold.

Most training memory usage decreases with the CA datasets. For KNN’s CA-cNDWI and CA-NDII, the decrease exceeds a threshold. The decreases are larger for RF and EM; however, they are not uniform across the indices. For RF, this could be due to similarities in the decision trees. DT also shows a decrease, but it does not exceed the thresholds. GNB is the only model that shows an increase in memory usage, but only CA-NDWI exceeds the threshold. EM benefits from the decreases in most other models.

Test memory usage follows a similar pattern to training memory. The KNN, DT, RF, and EM models all report lower test memory usage. This is particularly true for KNN’s CA-cNDWI and CA-NDWI. Only GNB reports higher test memory usage.

Table 29 presents the results of the two-tailed paired t-test for the efficiency metrics. This analysis complements the standard deviation results by comparing the CA datasets with the peak-flood image.

The training times for the RF and EM models were significantly different from the peak-flood image and exceeded the upper standard deviation threshold. While the DT model showed a significant difference from the peak-flood image, it did not exceed the standard deviation threshold. This is due to the orders-of-magnitude differences between the models for this metric.

All of the testing times with the CA-datasets were significantly different from the peak-flood image, even though the actual times were fragments of a second. For most of the models, the testing time increased, but for KNN, the time decreased.

For the training memory usage, the GNB significantly increased the amount of memory, while RF significantly decreased the amount of memory. For the testing memory usage, only the DT had a significant memory decrease.

4. Discussion

This section summarizes the essential findings for informed model-dataset selection, focusing on effectiveness and efficiency within our case study. We also discuss limitations, potential generalizability factors, and avenues for future improvement.

4.1. Comparing the Effectiveness of Each Model-Dataset

Regarding effectiveness, the main issue was the overfitting of DT, RF, and EM. Regardless of the metric or dataset, the response was >96%. This was likely due to the small dataset used in the research. The suggestions for the three models should be considered with caution due to potential generalizability issues and the lack of statistically significant differences.

KNN and GNB were not overfitted. KNN performed best among the models across different datasets and metrics. For some metrics, the scores were similar; however, significant differences between GNB and the other models were observed for recall (flooded) and F1 (flooded).

4.2. Comparing the Efficiency of Each Model-Dataset

GNB was the fastest model in terms of training and testing. If it scales well, it would be suitable for situations where timely results are necessary.

KNN took an order of magnitude longer than GNB for training and two orders of magnitude longer for testing. The higher testing time is due to the inherent process of checking many pixels. It also consumed more memory than the other models, except for EM.

DT took an order of magnitude longer than KNN for training; however, it took one order of magnitude less time than KNN for testing. This is due to DT’s time complexity in O(log n) for searching. Its memory consumption was significantly smaller than the other models, suggesting its suitability for situations with strict memory constraints.

RF took an order of magnitude longer than DT for training time, testing time, and memory usage. This is because RF builds multiple DTs. Despite taking two orders of magnitude longer than KNN for training, it is still faster than KNN when testing.

EM took the longest time and used more memory than the other models. Its training time was heavily influenced by RF, and its testing time was heavily influenced by KNN. EM’s training memory and testing memory usage were influenced by all the other models.

4.3. Challenges: Model Performance and Generalizability

4.3.1. KNN: Local Variance and False Positives

Table 24, Table 25, Table 26, Table 27, Table 28 and Table 29 reveal that KNN does not benefit from the CA datasets, with the exception of testing time. For almost all effectiveness metrics, it performs worse, with significant declines in some metrics. This indicates that KNN was misclassifying flooded pixels as non-flooded pixels. This can be attributed to two main factors: (a) KNN’s sensitivity to local variance; and (b) potential noise amplification in CA datasets.

CA datasets utilize spectral indices to amplify differences between urban features. However, this process may inadvertently amplify noise in individual pixels. With k = 3, KNN’s classification relies on majority votes from only the three nearest neighbors. This setting makes KNN susceptible to local noise, especially in water body edges or mixed urban zones, where pixel-based features can be ambiguous. It also increases the risk of local overfitting. Figure 6 supports this inference due to the varying distribution of false positives in the northern half of the image due to river overflow and highlights the model’s difficulty in generalizing to unseen pixels.

Furthermore, the local focus imposed by a small k value can lead to unstable classifications. The incorporation of spectral indices in CA datasets increases the complexity of the feature space. The limited local focus (k = 3) makes it difficult for KNN to learn global relationships between urban features (for example, reflectance from roof shadows or water bodies in different areas) from this altered feature space.

4.3.2. GNB: Skewed Feature Distribution and False Positives

As observed in Figure 7, GNB exhibits varying degrees of false positives across all datasets, including the peak-flood image. Notably, CA-cNDWI reduces the issue of false positives. In Table 25, for all the metrics, GNB is the only model to pass the upper threshold in standard deviation compared to the peak-flood image. In this dataset, each independent pixel-based feature comprises eight values contributed by each band at a specific spatial location. These values capture the pre- and post-flood dynamics from four bands (B, G, R, NIR), providing rich features for GNB to learn their relationships and thus potentially mitigate false positives.

However, the issue of false positives still persists, likely due to class imbalance. In such scenarios, GNB’s assumption of a Gaussian distribution [55] for each class (positive or negative pixels) is violated, as the minority class (positive pixels) may not adequately represent the true underlying distribution. This leads to GNB’s decision boundaries being biased towards the majority class (negative pixels), making it more likely to misclassify pixels from the minority class as belonging to the majority class.

4.3.3. Overfitting in DT and RF: High Model Complexity

Despite employing 10-fold stratified cross-validation, DT and RF overfit each dataset due to the high dimensionality of data (3,647,904 pixels multiplied by the number of channels). This overfitting is attributable to the default settings of both models, which permit unrestricted tree growth. As a result, complex trees are constructed that memorize the training data, including noise and outliers.

Our results from the CA datasets show slight improvement, but it is not significant. This is potentially because the subtle urban complexity, along with spectral signatures, generates more complex trees to distinguish floodwater. However, it is important to be cautious that the two models do not demonstrate their generalizability to unseen data.

4.3.4. EM’s Soft Voting: Inherited Overfitting and False Positives (Refute to H2)

The EM employs a soft voting strategy to determine each pixel prediction. Specifically, if two of the four models assign a positive label to a pixel, the average binary value reaches or exceeds the pre-defined threshold of 0.5, classifying the pixel as positive. This mechanism contributes to EM’s overfitting issue, which arises from the overlapping predictions of DT and RF. Additionally, false positives are influenced by the overlapping predictions of GNB and KNN in non-flooded areas.

As illustrated in Table 5 and Figure 10, EM’s performance, as measured by F1 (flooded), falls below that of DT and RF but exceeds that of KNN and GNB across all datasets. These findings refute H2, which hypothesized that the ensemble model would achieve superior performance in flood mapping compared to the individual constituent models. This suggests that the soft voting strategy, in this case, did not lead to the anticipated improvements, particularly since DT and RF are tightly coupled.

4.3.5. Generalizability Issue: Using a Single Dataset

Due to data accessibility and time constraints, we could not test our approach on additional urban environments. Our study case, Brisbane city, primarily has moderately dense commercial areas in the center and more sparsely populated suburbs [45]. This can lead to small, localized water pooling in the urban core and broader water dispersal in the suburbs.

However, generalizing these findings to other urban environments may be challenging due to variations in key features. Urban layout (for example, extensive impervious surfaces in high-density cities) [65,66], green infrastructure (runoff reduction) [67], and drainage systems (affecting floodwater extent and depth) [68] all influence floodwater interaction with urban surfaces. These factors could lead to diverse water pooling patterns and alter spectral reflectance, potentially hindering inundation delineation in our model.

4.4. Future Directions

Future work will focus on refining our flood mapping approach in two aspects. First, we will perform model-specific hyperparameter tuning. For KNN, we will explore the optimal ‘k’ value using particle swarm optimization [69,70] and distance-based voting. For DT and RF, we will employ pruning techniques, such as cost-complexity pruning [71], to optimize tree depth. Additionally, for EM, majority voting strategies will be investigated. Second, to assess generalizability, we will expand validation by incorporating diverse geographical characteristics, such as comparing performance in flat versus mountainous terrain, and by considering additional contexts, including urban infrastructure data and socio-economic factors.

5. Conclusions

In this study, we investigated the integration of a row-wise CA algorithm with ML-based ensemble learning to enhance flood mapping in a riverine city. Our findings indicate that CA datasets can have both positive and negative effects on the effectiveness and efficiency of individual models, which in turn influences the performance of the EM. While EM can achieve higher accuracy in flood segmentation, its reliance on a predefined soft voting strategy leads to the inheritance of drawbacks (overfitting and false positives) from the individual models, ultimately hindering overall effectiveness. Furthermore, the high computational demands of KNN and RF restrict EM’s capability for real-time flood monitoring.

Additionally, our study identifies other two limitations: the learning settings of individual models and the potential generalizability issue. Despite these limitations, our research provides valuable insights into post-flood analysis and the integration of attention mechanisms with ML-based ensemble learning, serving as a reference for future studies. Future research will focus on refining models and expanding validation to include a broader range of geographical conditions.

Author Contributions

Conceptualization, A.W. and H.X.; data curation A.W. and H.X.; formal analysis, H.X.; investigation, H.X.; methodology, H.X.; project administration, A.W.; resources, H.X. and A.W.; software, H.X. and A.W.; supervision, A.W.; validation, H.X.; visualization, H.X.; writing—original draft, H.X.; writing—review and editing, H.X. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon request.

Acknowledgments

We extend our gratitude to N. Levin and S. Phinn for providing valuable insights and access to their PlanetScope materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Floods. Available online: https://www.acs.gov.au/pages/floods (accessed on 21 September 2024).
Eastern Australian Floods Generated a Whopping $4.7 Billion in Industry Loss. Available online: https://www.dsh.gov.au/news-and-media/eastern-australian-floods-generated-whopping-47-billion-industry-loss (accessed on 21 September 2024).
Zhong Shuang, Z.S.; Yang LianPing, Y.L.; Toloo, S.; Wang Zhe, W.Z.; Tong Shilu, T.S.; Sun XiaoJie, S.X.; Crompton, D.; FitzGerald, G.; Huang Cunrui, H.C. The long-term physical and psychological health impacts of flooding: A systematic mapping. Sci. Total Environ. 2018, 626, 165–194. [Google Scholar] [CrossRef] [PubMed]
Ahern, M.; Kovats, S. The health impacts of floods. In Flood Hazards and Health; Routledge: Oxfordshire, UK, 2013; pp. 28–53. [Google Scholar]
Alderman, K.; Turner, L.R.; Tong, S. Floods and human health: A systematic review. Environ. Int. 2012, 47, 37–47. [Google Scholar] [CrossRef] [PubMed]
Raadgever, G.; Booister, N.; Steenstra, M.K. Flood risk management strategies. In Flood Risk Management Strategies and Governance; Springer: Cham, Switzerland, 2018; pp. 93–100. [Google Scholar]
Moatty, A. Post-flood recovery: An opportunity for disaster risk reduction? In Floods; Elsevier: Amsterdam, The Netherlands, 2017; pp. 349–363. [Google Scholar]
Islam, M.T.; Meng, Q. An exploratory study of Sentinel-1 SAR for rapid urban flood mapping on Google Earth Engine. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 103002. [Google Scholar]
Mason, D.C.; Dance, S.L.; Cloke, H.L. Floodwater detection in urban areas using Sentinel-1 and WorldDEM data. J. Appl. Remote Sens. 2021, 15, 032003. [Google Scholar] [CrossRef]
Sy, H.M.; Luu, C.; Bui, Q.D.; Ha, H.; Nguyen, D.Q. Urban flood risk assessment using Sentinel-1 on the google earth engine: A case study in Thai Nguyen city, Vietnam. Remote Sens. Appl. Soc. Environ. 2023, 31, 100987. [Google Scholar]
Plant Health. Available online: https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2/Plant_health (accessed on 22 September 2024).
Sadek, M.; Li, X. Low-Cost Solution for Assessment of Urban Flash Flood Impacts Using Sentinel-2 Satellite Images and Fuzzy Analytic Hierarchy Process: A Case Study of Ras Ghareb City, Egypt. Adv. Civ. Eng. 2019, 2019, 2561215. [Google Scholar] [CrossRef]
Kuc, G.; Chormański, J. Sentinel-2 imagery for mapping and monitoring imperviousness in urban areas. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 42, 43–47. [Google Scholar] [CrossRef]
Caballero, I.; Ruiz, J.; Navarro, G. Sentinel-2 satellites provide near-real time evaluation of catastrophic floods in the west mediterranean. Water 2019, 11, 2499. [Google Scholar] [CrossRef]
Hemati, M.; Hasanlou, M.; Mahdianpari, M.; Mohammadimanesh, F. A systematic review of landsat data for change detection applications: 50 years of monitoring the earth. Remote Sens. 2021, 13, 2869. [Google Scholar] [CrossRef]
Demarquet, Q.; Rapinel, S.; Dufour, S.; Hubert-Moy, L. Long-term wetland monitoring using the landsat archive: A review. Remote Sens. 2023, 15, 820. [Google Scholar] [CrossRef]
Mukherjee, R.; Zhang, Z.; Giezendanner, J.; Melancon, A.; Tellman, B.; Gurung, I.; Molthan, A. FloodPlanet: A Multi-Sourced Hand-Labeled Dataset of PlanetScope, Sentinel-1, Sentinel-2, and Landsat 8 Imagery for Enhanced Global Flood Monitoring. In Proceedings of the AGU Fall Meeting Abstracts, San Francisco, CA, USA, 11–15 December 2023; p. IN51C-0442. [Google Scholar]
PlanetScope Overview. Available online: https://developers.planet.com/docs/data/planetscope/ (accessed on 22 September 2024).
Theocharidis, C.; Argyriou, A.V.; Tsouni, A.; Kaskara, M.; Kontoes, C. Comparative analysis of Sentinel-1 and PlanetScope imagery for flood mapping of Evros River, Greece. In Proceedings of the Ninth International Conference on Remote Sensing and Geoinformation of the Environment (RSCy2023), Ayia Napa, Cyprus, 3–5 April 2023; pp. 465–474. [Google Scholar]
Gao, B.-C. NDWI—A normalized difference water index for remote sensing of vegetation liquid water from space. Remote Sens. Environ. 1996, 58, 257–266. [Google Scholar] [CrossRef]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Levin, N.; Phinn, S. Assessing the 2022 flood impacts in Queensland combining daytime and nighttime optical and imaging radar data. Remote Sens. 2022, 14, 5009. [Google Scholar] [CrossRef]
Mallinis, G.; Gitas, I.Z.; Giannakopoulos, V.; Maris, F.; Tsakiri-Strati, M. An object-based approach for flood area delineation in a transboundary area using ENVISAT ASAR and LANDSAT TM data. Int. J. Digit. Earth 2013, 6, 124–136. [Google Scholar] [CrossRef]
Li, L.; Woodley, A.; Chappell, T. Mapping Urban Floods via Spectral Indices and Machine Learning Algorithms. Sustainability 2024, 16, 2493. [Google Scholar] [CrossRef]
Wu, W.; Li, Q.; Zhang, Y.; Du, X.; Wang, H. Two-step urban water index (TSUWI): A new technique for high-resolution mapping of urban surface water. Remote Sens. 2018, 10, 1704. [Google Scholar] [CrossRef]
Farhadi, H.; Ebadi, H.; Kiani, A.; Asgary, A. A novel flood/water extraction index (FWEI) for identifying water and flooded areas using sentinel-2 visible and near-infrared spectral bands. Stoch. Environ. Res. Risk Assess. 2024, 38, 1873–1895. [Google Scholar] [CrossRef]
Hosni, M.; Boushaba, F.; Chourak, M. A Systematic Literature Review on Classification Machine Learning for Urban Flood Hazard Mapping. Water Resour. Manag. 2024, 38, 5823–5864. [Google Scholar]
Lamovec, P.; Veljanovski, T.; Mikoš, M.; Oštir, K. Detecting flooded areas with machine learning techniques: Case study of the Selška Sora river flash flood in September 2007. J. Appl. Remote Sens. 2013, 7, 073564. [Google Scholar] [CrossRef]
Seydi, S.T.; Kanani-Sadat, Y.; Hasanlou, M.; Sahraei, R.; Chanussot, J.; Amani, M. Comparison of machine learning algorithms for flood susceptibility mapping. Remote Sens. 2022, 15, 192. [Google Scholar] [CrossRef]
Lee, S.; Kim, J.-C.; Jung, H.-S.; Lee, M.J.; Lee, S. Spatial prediction of flood susceptibility using random-forest and boosted-tree models in Seoul metropolitan city, Korea. Geomat. Nat. Hazards Risk 2017, 8, 1185–1203. [Google Scholar] [CrossRef]
Kulk, G.; Sathyendranath, S.; Platt, T.; George, G.; Suresan, A.K.; Menon, N.; Evers-King, H.; Abdulaziz, A. Using Multi-Spectral Remote Sensing for Flood Mapping: A Case Study in Lake Vembanad, India. Remote Sens. 2023, 15, 5139. [Google Scholar] [CrossRef]
Lombana, L.; Martínez-Graña, A. A flood mapping method for land use management in small-size water bodies: Validation of spectral indexes and a machine learning technique. Agronomy 2022, 12, 1280. [Google Scholar] [CrossRef]
Gašparović, M.; Klobučar, D. Mapping floods in lowland forest using sentinel-1 and sentinel-2 data and an object-based approach. Forests 2021, 12, 553. [Google Scholar] [CrossRef]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Li, W.; Wu, J.; Chen, H.; Wang, Y.; Jia, Y.; Gui, G. Unet combined with attention mechanism method for extracting flood submerged range. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6588–6597. [Google Scholar] [CrossRef]
Elghouat, A.; Algouti, A.; Algouti, A.; Baid, S. Attention is all you need for an improved CNN-based flash flood susceptibility modeling. The case of the ungauged Rheraya watershed, Morocco. arXiv 2024, arXiv:2408.02692. [Google Scholar]
Dixit, P.; Roy, B.N.; Rout, D. Deep Learning Approach for Flood Mapping Using Satellite Images Dataset. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Olten, Switzerland, 11–13 December 2023; pp. 12–20. [Google Scholar]
Ahmad, M.N.; Shao, Z.; Xiao, X.; Fu, P.; Javed, A.; Ara, I. A novel ensemble learning approach to extract urban impervious surface based on machine learning algorithms using SAR and optical data. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104013. [Google Scholar] [CrossRef]
Islam, A.R.M.T.; Bappi, M.M.R.; Alqadhi, S.; Bindajam, A.A.; Mallick, J.; Talukdar, S. Improvement of flood susceptibility mapping by introducing hybrid ensemble learning algorithms and high-resolution satellite imageries. Nat. Hazards 2023, 119, 1–37. [Google Scholar] [CrossRef]
Jony, R.I.; Woodley, A.; Raj, A.; Perrin, D. Ensemble classification technique for water detection in satellite images. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 1–8. [Google Scholar]
Shahabi, H.; Shirzadi, A.; Ghaderi, K.; Omidvar, E.; Al-Ansari, N.; Clague, J.J.; Geertsema, M.; Khosravi, K.; Amini, A.; Bahrami, S. Flood detection and susceptibility mapping using sentinel-1 remote sensing data and a machine learning approach: Hybrid intelligence of bagging ensemble based on k-nearest neighbor classifier. Remote Sens. 2020, 12, 266. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, J.; Shen, W. A review of ensemble learning algorithms used in remote sensing applications. Appl. Sci. 2022, 12, 8654. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Berrar, D. Performance Measures for Binary Classification. In Reference Module in Life Sciences; Elsevier: Amsterdam, The Netherlands, 2018. [Google Scholar]
Hou, Y.; Wei, Y.; Wu, S.; Li, J. Mapping the Social, Economic, and Ecological Impact of Floods in Brisbane. Water 2023, 15, 3842. [Google Scholar] [CrossRef]
World Geodetic System 1984 (WGS84). Available online: https://www.linz.govt.nz/guidance/geodetic-system/coordinate-systems-used-new-zealand/geodetic-datums/world-geodetic-system-1984-wgs84 (accessed on 30 September 2024).
Open Data Brisbane City Council. Flood—Awareness—Historic—Brisbane River and Creek Floods—Feb–2022. Available online: https://www.spatial-data.brisbane.qld.gov.au/datasets/26f2d6ad138043c69326ce0e2259dfd8_0/about (accessed on 24 September 2024).
Maps, E.D.a. World Water Bodies. Available online: https://hub.arcgis.com/content/esri::world-water-bodies/about (accessed on 24 September 2024).
Developers. Available online: https://developers.planet.com/docs/apis/ (accessed on 22 September 2024).
QGIS. Available online: https://www.qgis.org/ (accessed on 24 September 2024).
McFeeters, S.K. Using the normalized difference water index (NDWI) within a geographic information system to detect swimming pools for mosquito abatement: A practical approach. Remote Sens. 2013, 5, 3544–3561. [Google Scholar] [CrossRef]
Horn, R.A. The hadamard product. In Proceedings of the Symposia in Applied Mathematics, Louisville, Kenctuky, 16–17 January 1990; pp. 87–169. [Google Scholar]
Steinbach, M.; Tan, P.-N. kNN: K-nearest neighbors. In The Top Ten Algorithms in Data Mining; Chapman and Hall: London, UK; CRC: Boca Raton, FL, USA, 2009; pp. 165–176. [Google Scholar]
Gauhar, N.; Das, S.; Moury, K.S. Prediction of flood in Bangladesh using K-nearest neighbors algorithm. In Proceedings of the 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 5–7 January 2021; pp. 357–361. [Google Scholar]
Vikramkumar, B.; Vijaykumar, T. Bayes and naive bayes classifier. arXiv 2014, arXiv:1404.0933. [Google Scholar]
Pan, J.-X.; Fang, K.-T.; Pan, J.-X.; Fang, K.-T. Maximum likelihood estimation. In Growth Curve Models and Statistical Diagnostics; Springer: New York, NY, USA, 2002; pp. 77–158. [Google Scholar]
Song, Y.-Y.; Ying, L. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar] [PubMed]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Berrar, D. Cross-validation. In Reference Module in Life Sciences; Elsevier: Amsterdam, The Netherlands, 2019. [Google Scholar]
Akshitha, K.; Anand, M.V. Flood Prediction System with Voting Classifier. In Proceedings of the 2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 15–16 March 2024; pp. 306–311. [Google Scholar]
McGill, R.; Tukey, J.W.; Larsen, W.A. Variations of box plots. Am. Stat. 1978, 32, 12–16. [Google Scholar] [CrossRef]
Quirk, T.J.; Quirk, T.J. One-way analysis of variance (ANOVA). In Excel 2007 for Educational and Psychological Statistics: A Guide to Solving Practical Problems; Springer: New York, NY, USA, 2012; pp. 163–179. [Google Scholar]
Abdi, H.; Williams, L.J. Tukey’s honestly significant difference (HSD) test. Encycl. Res. Des. 2010, 3, 1–5. [Google Scholar]
Segretier, W.; Collard, M.; Clergue, M. Evolutionary predictive modelling for flash floods. In Proceedings of the 2013 IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 844–851. [Google Scholar]
Herold, M.; Roberts, D.A.; Gardner, M.E.; Dennison, P.E. Spectrometry for urban area remote sensing—Development and analysis of a spectral library from 350 to 2400 nm. Remote Sens. Environ. 2004, 91, 304–319. [Google Scholar] [CrossRef]
Su, S.; Tian, J.; Dong, X.; Tian, Q.; Wang, N.; Xi, Y. An impervious surface spectral index on multispectral imagery using visible and near-infrared bands. Remote Sens. 2022, 14, 3391. [Google Scholar] [CrossRef]
Green, D.; O’Donnell, E.; Johnson, M.; Slater, L.; Thorne, C.; Zheng, S.; Stirling, R.; Chan, F.K.; Li, L.; Boothroyd, R.J. Green infrastructure: The future of urban flood risk management? Wiley Interdiscip. Rev. Water 2021, 8, e1560. [Google Scholar] [CrossRef]
Liang, C.; Guan, M. Effects of urban drainage inlet layout on surface flood dynamics and discharge. J. Hydrol. 2024, 632, 130890. [Google Scholar] [CrossRef]
Razavi-Termeh, S.V.; Sadeghi-Niaraki, A.; Razavi, S.; Choi, S.-M. Enhancing flood-prone area mapping: Fine-tuning the K-nearest neighbors (KNN) algorithm for spatial modelling. Int. J. Digit. Earth 2024, 17, 2311325. [Google Scholar] [CrossRef]
Sarwar, J.; Khan, S.A.; Azmat, M.; Khan, F. A comparative analysis of feature selection models for spatial analysis of floods using hybrid metaheuristic and machine learning models. Environ. Sci. Pollut. Res. 2024, 31, 33495–33514. [Google Scholar] [CrossRef]
Ravi, K.B.; Serra, J. Cost-complexity pruning of random forests. arXiv 2017, arXiv:1703.05430. [Google Scholar]

Figure 1. The extent of flooding in Brisbane during February 2022. The figure, based on [24], was created by combining Brisbane City Council (BCC) Flood Map [47], and World Water Bodies shapefiles [48].

Figure 2. Comparison of pre-flood (a) and peak-flood (c) images in true color and false color (NIR, R, G) composites (b,d). In the false color images (b,d), red indicates vegetation, dark olive green represents water bodies, and other colors depict buildings and urban areas.

Figure 3. Illustration of the creation of the labeled dataset image (c). It combines a river mask (a) with a flood extent map (b) to isolate flooded areas (white) while excluding the river (blue).

Figure 4. Comparisons of the distribution of pixel values from all bands of each CA dataset, with (a–d) referring CA-dNDWI, CA-cNDWI, CA-NDII, and CA-NDWI.

Figure 5. Baseline flood maps. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas. (a,b) depict the flood maps obtained by applying 0.2 and 0.3 thresholds to NDWI_peak-flood. (c) shows the flood map of NDII using a 0.1 threshold. (d) depicts the flood map of cNDWI using a 0.1 threshold. (e) delineates the flood map of dNDWI using a threshold of 0.2.

Figure 6. KNN flood prediction maps for five datasets: (a) peak-flood image, (b) CA-dNDWI, (c) CA-cNDWI, (d) CA-NDII, and (e) CA-NDWI. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas.

Figure 7. GNB flood prediction maps for five datasets: (a) peak-flood image, (b) CA-dNDWI, (c) CA-cNDWI, (d) CA-NDII, and (e) CA-NDWI. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas.

Figure 8. DT flood prediction maps for five datasets: (a) peak-flood image, (b) CA-dNDWI, (c) CA-cNDWI, (d) CA-NDII, and (e) CA-NDWI. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas.

Figure 9. RF flood prediction maps for five datasets: (a) peak-flood image, (b) CA-dNDWI, (c) CA-cNDWI, (d) CA-NDII, and (e) CA-NDWI. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas.

Figure 10. EM flood prediction maps for five datasets: (a) peak-flood image, (b) CA-dNDWI, (c) CA-cNDWI, (d) CA-NDII, and (e) CA-NDWI. Each sub-figure delineates inundated water (white) from non-inundated areas (black), which include both permanent water and non-water areas.

Figure 11. Boxplots of effectiveness metrics for each model (KNN, GNB, DT, RF, EM) across five datasets (the peak-image flood, CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI: (a) accuracy; (b) recall (flooded); (c) recall (non-flooded); (d) precision (flooded); (e) precision (non-flooded); (f) F1 (flooded); and (g) F1 (non-flooded). The sky-blue line represents the median value across the five datasets for each model, while blank circles indicate outliers.

Figure 12. Boxplots of efficiency metrics for each model (KNN, GNB, DT, RF, EM) across five datasets (the peak-image flood, CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI: (a) training time; (b) testing time; (c) training memory usage; and (d) testing memory usage. The sky-blue line represents the median value across the five datasets for each model, while blank circles indicate outliers.

Figure 13. Boxplots of effectiveness metrics for each dataset (the peak-image flood, CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI) across five models (KNN, GNB, DT, RF, EM): (a) accuracy; (b) recall (flooded); (c) recall (non-flooded); (d) precision (flooded); (e) precision (non-flooded); (f) F1 (flooded); and (g) F1 (non-flooded). The sky-blue line represents the median value across the five models for each dataset, while blank circles indicate outliers.

Figure 14. Boxplots of efficiency metrics for each dataset (the peak-image flood, CA-dNDWI, CA-cNDWI, CA-NDII, CA-NDWI) across five models (KNN, GNB, DT, RF, EM): (a) training time; (b) testing time; (c) training memory usage; and (d) testing memory usage. The sky-blue line represents the median value across the five models for each dataset, while blank circles indicate outliers.

Table 1. Spectral Bands of PlanetScope Imagery.

Order	Band	Type	Wavelength (nm)
1	Blue (B)	Visible	455–515
2	Green (G)	Visible	500–590
3	Red (R)	Visible	590–670
4	Near-infrared (NIR)	Invisible	780–860

Table 2. Characteristics of the pre-processed labeled image.

Parameter	Value
Width	1184 pixels
Height	3081 pixels
Inundated Regions	25.71% (938,033 pixels)
Non-Inundated Regions	74.29% (2,709,871 pixels)
Spatial Pixel Size (Width × Height)	4.65 × 5.25 m
Image Size (Width × Height)	5.51 × 16.18 km

Table 3. Summary of effectiveness metrics for five baselines (spectral indices with pre-defined thresholds). Bold values highlight the maximum score for each metric.

Baseline	Threshold	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
NDWI_peak-flood	0.2	0.692	0.429	0.784	0.407	0.798	0.418	0.791
NDWI_peak-flood	0.3	0.763	0.149	0.976	0.686	0.768	0.244	0.860
NDII	0.1	0.833	0.506	0.946	0.766	0.847	0.609	0.894
cNDWI	0.1	0.587	0.243	0.706	0.222	0.729	0.232	0.717
dNDWI	0.2	0.794	0.639	0.848	0.593	0.871	0.615	0.860

Table 4. The mean and standard deviation for the effectiveness metrics. The mean value for each of the metrics is close to 1, indicating that most of the approaches performed strongly.

Metric	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
Mean	0.934	0.842	0.965	0.894	0.948	0.862	0.956
Standard Deviation	0.093	0.210	0.644	0.160	0.069	0.184	0.064

Table 5. Summary of effectiveness metrics for model-dataset combinations. Values significantly lower than the mean are marked with daggers (†). The number of symbols corresponds to the number of standard deviations below the mean. No values are significantly higher than the mean.

Model	Dataset	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
KNN	Peak-Flood Image	0.918	0.798	0.959	0.871	0.932	0.833	0.945
	CA-dNDWI	0.897	0.701	0.965	0.873	0.903	0.778	0.933
	CA-cNDWI	0.909	0.745	0.966	0.885	0.916	0.809	0.941
	CA-NDII	0.893	0.694	0.962	0.864	0.901	0.770	0.930
	CA-NDWI	0.887	0.709	0.949	0.828	0.904	0.764	0.926
GNB	Peak-Flood Image	0.752 †	0.402 †††	0.873 †	0.523 †††	0.808 †††	0.455 †††	0.839 †
	CA-dNDWI	0.825 †	0.555 †	0.918	0.702 †	0.856 †	0.619 †	0.886 †
	CA-cNDWI	0.841	0.455 †	0.975	0.862	0.838 †	0.596	0.901
	CA-NDII	0.831 †	0.519 †	0.939	0.747	0.849 †	0.613 †	0.892 †
	CA-NDWI	0.652 †††	0.531 †	0.694 †††	0.376 †††	0.810 †	0.440 †††	0.748 †††
DT	Peak-Flood Image	0.997	0.989	1.000	1.000	0.996	0.995	0.998
	CA-dNDWI	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	CA-cNDWI	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	CA-NDII	1.000	0.999	1.000	1.000	1.000	1.000	1.000
	CA-NDWI	0.998	0.993	1.000	1.000	0.998	0.996	0.999
RF	Peak-Flood Image	0.997	0.991	0.999	0.998	0.997	0.995	0.998
	CA-dNDWI	1.000	0.999	1.000	1.000	1.000	1.000	1.000
	CA-cNDWI	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	CA-NDII	1.000	0.999	1.000	1.000	1.000	1.000	1.000
	CA-NDWI	0.998	0.994	0.999	0.998	0.998	0.996	0.999
EM	Peak-Flood Image	0.989	0.991	0.989	0.969	0.997	0.980	0.993
	CA-dNDWI	0.990	0.999	0.987	0.964	1.000	0.982	0.993
	CA-cNDWI	0.994	1.000	0.992	0.977	1.000	0.988	0.996
	CA-NDII	0.990	0.999	0.987	0.964	1.000	0.981	0.993
	CA-NDWI	0.983	0.994	0.979	0.943	0.998	0.968	0.988

Table 6. The mean and standard deviation for the efficiency metrics. The mean training time is larger than the mean testing time.

Metric	Training Time (s)	Testing Time (s)	Training Memory Usage (MB)	Testing Memory Usage (MB)
Mean	1337.935	27.655	3186.419	4475.821
Standard Deviation	1575.478	55.105	5212.443	7212.900

Table 7. Summary of efficiency metrics for model-dataset combinations. Values significantly different from the mean are marked with two symbols: asterisks (*) for higher values and daggers (†) for lower values. The number of symbols corresponds to the number of standard deviations above or below the mean.

Model	Dataset	Training Time (s)	Testing Time (s)	Training Memory (MB)	Testing Memory Usage (MB)
KNN	Peak-Flood Image	64.837	48.614	5955.547 *	5948.859
	CA-dNDWI	61.188	46.350	6079.125 *	5869.203
	CA-cNDWI	72.040	50.509	4697.828	4896.906
	CA-NDII	59.614	44.955	5212.016	5080.719
	CA-NDWI	60.689	45.508	5405.656 *	4333.297
GNB	Peak-Flood Image	4.183	0.113 †	1652.484	1749.891
	CA-dNDWI	4.176	0.230	1864.734	1864.734
	CA-cNDWI	5.059	0.278	1965.203	1972.578
	CA-NDII	4.215	0.227	2099.375	2099.375
	CA-NDWI	4.246	0.258	2220.703	2221.359
DT	Peak-Flood Image	190.609	1.013	779.844 †	875.516 †
	CA-dNDWI	676.742	1.338	597.297 †	700.250 †
	CA-cNDWI	832.863	0.961	730.516 †	826.188 †
	CA-NDII	477.909	1.076	555.609 †	691.375 †
	CA-NDWI	696.648	1.254	634.875 †	741.938 †
RF	Peak-Flood Image	1101.444	18.546	2985.891	7391.078 *
	CA-dNDWI	3601.179 *	22.033	2215.188	7103.000
	CA-cNDWI	2363.282	20.116	2158.156	6716.516
	CA-NDII	2949.289 *	20.711	2674.641	7469.406 *
	CA-NDWI	3493.978 *	21.585	1825.625	7331.672 *
EM	Peak-Flood Image	1361.073	68.287 *	5955.547 *	7391.078 *
	CA-dNDWI	4343.285 *	69.958 *	6079.125 *	7103.000
	CA-cNDWI	3273.243 *	71.864 *	4697.828	6716.516
	CA-NDII	3491.026 *	66.968 *	5212.016	7469.406
	CA-NDWI	4255.560 *	68.606 *	5405.656 *	7331.672 *

Table 8. Summary of p-values from one-way ANOVA for five models across five datasets, using the effectiveness metrics.

Effectiveness Metrics	p-Value
Accuracy	9.7880 × 10⁻⁹
Recall (Flooded)	2.8140 × 10⁻¹⁶
Recall (Non-Flooded)	4.8080 × 10⁻³
Precision (Flooded)	7.3880 × 10⁻⁶
Precision (Non-Flooded)	4.5810 × 10⁻¹⁶
F1 (Flooded)	2.5020 × 10⁻¹³
F1 (Non-Flooded)	1.8440 × 10⁻⁷

Table 9. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for the accuracy. Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the placeholders.

Models	KNN	GNB	DT	RF
KNN
GNB	0.0003
DT	0.0029	<0.0001
RF	0.0029	<0.0001	1.0000
EM	0.0075	<0.0001	0.9928	0.9929

Table 10. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for recall (flood-ed/non-flooded). Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the dividers for the two metrics.

Models	KNN	GNB	DT	RF	EM
KNN		0.1138	0.7097	0.7149	0.9108	Recall (Non-Flooded)
GNB	<0.0001		0.0079	0.0081	0.0201
DT	<0.0001	<0.0001		1.0000	0.9929
RF	<0.0001	<0.0001	1.0000		0.9934
EM	<0.0001	<0.0001	1.0000	1.0000
	Recall (Flooded)

Table 11. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for precision (flooded/non-flooded). Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the dividers for the two metrics.

Models	KNN	GNB	DT	RF	EM
KNN		<0.0001	<0.0001	<0.0001	<0.0001	Precision (Non-Flooded)
GNB	0.0051		<0.0001	<0.0001	<0.0001
DT	0.1361	<0.0001		1.0000	1.0000
RF	0.1398	<0.0001	1.0000		1.0000
EM	0.3958	0.0001	0.9612	0.9642
	Precision (Flooded)

Table 12. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for F1 (flooded/non-flooded). Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the dividers for the two metrics.

Models	KNN	GNB	DT	RF	EM
KNN		0.0018	0.0154	0.0155	0.0335	F1 (Non-Flooded)
GNB	0.0001		<0.0001	<0.0001	<0.0001
DT	<0.0001	<0.0001		1.0000	0.996
RF	<0.0001	<0.0001	1.0000		0.9961
EM	<0.0001	<0.0001	0.9564	0.9570
	F1 (Flooded)

Table 13. Summary of p-values from one-way ANOVA for five models across five datasets, using the efficiency metrics.

Efficiency Metrics	p-Value
Training Time	2.0320 × 10⁻⁷
Testing Time	1.0000 × 10⁻²⁵
Training Memory Usage	4.6500 × 10⁻¹⁴
Testing Memory Usage	8.0690 × 10⁻¹⁸

Table 14. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for training/testing time. Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the dividers for the two metrics.

Models	KNN	GNB	DT	RF	EM
KNN		<0.0001	<0.0001	<0.0001	<0.0001	Testing Time
GNB	0.9999		0.8624	<0.0001	<0.0001
DT	0.7883	0.7159		<0.0001	<0.0001
RF	0.0001	0.0001	0.0011		<0.0001
EM	<0.0001	<0.0001	<0.0001	0.6208
	Training Time

Table 15. Pairwise model comparisons of p-adj values calculated by Tukey’s HSD for training/testing memory usage. Bold values indicate a significantly different pair (p-adj value ≤ 0.05). The grey cells indicate the dividers for the two metrics.

Models	KNN	GNB	DT	RF	EM
KNN		<0.0001	<0.0001	<0.0001	<0.0001	Testing Memory Usage
GNB	<0.0001		0.0004	<0.0001	<0.0001
DT	<0.0001	0.0008		<0.0001	<0.0001
RF	<0.0001	0.5555	<0.0001		1.0000
EM	1.0000	<0.0001	<0.0001	<0.0001
	Training

Table 16. Summary of p-values from one-way ANOVA for five datasets evaluated by five models separately, using effectiveness metrics.

Effectiveness Metrics	p-Value
Accuracy	0.9550
Recall (Flooded)	0.9990
Recall (Non-Flooded)	0.6190
Precision (Flooded)	0.8420
Precision (Non-Flooded)	0.9990
F1 (Flooded)	0.9950
F1 (Non-Flooded)	0.9260

Table 17. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for accuracy. The grey cells indicate the placeholders.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII
Peak-Flood Image
CA-dNDWI	0.9997
CA-cNDWI	0.9984	1.0000
CA-NDII	0.9997	1.0000	1.0000
CA-NDWI	0.9927	0.9721	0.9516	0.9709

Table 18. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for recall (flooded/non-flooded). The grey cells indicate the dividers for the two metrics.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII	CA-NDWI
Peak-Flood Image		0.9992	0.9821	0.9974	0.8753	Recall (Non-Flooded)
CA-dNDWI	1.0000		0.9981	1.0000	0.7599
CA-cNDWI	1.0000	1.0000		0.9995	0.5830
CA-NDII	1.0000	1.0000	1.0000		0.7113
CA-NDWI	1.0000	1.0000	1.0000	1.0000
	Recall (Flooded)

Table 19. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for precision (flooded/non-flooded). The grey cells indicate the dividers for the two metrics.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII	CA-NDWI
Peak-Flood Image		0.9999	1.0000	1.0000	1.0000	Precision (Non-Flooded)
CA-dNDWI	0.9971		1.0000	1.0000	0.9995
CA-cNDWI	0.9582	0.9966		1.0000	0.9997
CA-NDII	0.9941	1.0000	0.9985		0.9998
CA-NDWI	0.9939	0.9448	0.8115	0.9258
	Precision (Flooded)

Table 20. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for F1 (flooded/non-flooded). The grey cells indicate the dividers for the two metrics.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII	CA-NDWI
Peak-Flood Image		0.9997	0.9982	0.9997	0.9833	F1 (Non-Flooded)
CA-dNDWI	0.9997		1.0000	1.0000	0.9523
CA-cNDWI	0.9995	1.0000		1.0000	0.9199
CA-NDII	0.9998	1.0000	1.0000		0.9487
CA-NDWI	0.9999	0.9970	0.9962	0.9978
	F1 (Flooded)

Table 21. Summary of p-values from one-way ANOVA for five datasets evaluated by five models separately, using effectiveness metrics.

Efficiency Metrics	p-Value
Training Time	0.7910
Testing Time	0.9990
Training Memory Usage	0.9920
Testing Memory Usage	0.9990

Table 22. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for training/testing time. The grey cells indicate the dividers for the two metrics.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII	CA-NDWI
Peak-Flood Image		1.0000	1.0000	1.0000	1.0000	Testing Time
CA-dNDWI	0.7849		1.0000	1.0000	1.0000
CA-cNDWI	0.9470	0.9937		1.0000	1.0000
CA-NDII	0.9236	0.9974	1.0000		1.0000
CA-NDWI	0.8022	1.0000	0.9954	0.9983
	Training Time

Table 23. Pairwise dataset comparisons of p-adj values calculated by Tukey’s HSD for training/testing memory usage. The grey cells indicate the dividers for the two metrics.

Datasets	Peak-Flood Image	CA-dNDWI	CA-cNDWI	CA-NDII	CA-NDWI
Peak-Flood Image		1.0000	0.9993	1.0000	0.9999	Testing Memory Usage
CA-dNDWI	1.0000		0.9998	1.0000	1.0000
CA-cNDWI	0.9915	0.9956		0.9998	1.0000
CA-NDII	0.9994	0.9999	0.9995		1.0000
CA-NDWI	0.9988	0.9997	0.9998	1.0000
	Training Memory Usage

Table 24. The mean and standard deviation of the differences in the effectiveness metrics between the CA datasets and the peak-flood image. The mean value for each of the metrics is small, indicating that there is not much difference between the datasets.

Metric	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
Mean	0.004	0.010	0.002	0.027	0.003	0.003	0.013
Standard Deviation	0.041	0.674	0.051	0.104	0.020	0.020	0.065

Table 25. Summary of differences in the effectiveness metrics of the comparison of the CA datasets with the peak-flood image for all model-dataset combinations. For each metric, the mean and standard deviation were calculated. Values significantly different from the mean are marked with two symbols: asterisks (*) for higher values and daggers (†) for lower values. The number of symbols corresponds to the number of standard deviations above or below the mean.

Model	Dataset	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
KNN	CA-dNDWI	−0.021	−0.097 †	0.006	0.002	−0.029 †	−0.055	−0.012
	CA-cNDWI	−0.009	−0.053	0.007	0.014	−0.016	−0.024	−0.004
	CA-NDII	−0.025	−0.104 †	0.003	−0.007	−0.031 †	−0.063	−0.015
	CA-NDWI	−0.031	−0.089 †	−0.010	−0.043	−0.028 †	−0.069 †	−0.019
GNB	CA-dNDWI	0.073 *	0.153 **	0.045	0.179 *	0.048 **	0.164 **	0.047 *
	CA-cNDWI	0.089 **	0.053	0.102 **	0.339 **	0.030 *	0.141 **	0.062 **
	CA-NDII	0.079 *	0.117 *	0.066 *	0.224 *	0.041 **	0.158 **	0.053 *
	CA-NDWI	−0.100 †	0.129 *	−0.179 †	−0.147 †	0.002	−0.015	−0.091 ††
DT	CA-dNDWI	0.003	0.011	0.000	0.000	0.004	0.005	0.002
	CA-cNDWI	0.003	0.011	0.000	0.000	0.004	0.005	0.002
	CA-NDII	0.003	0.010	0.000	0.000	0.004	0.005	0.002
	CA-NDWI	0.001	0.004	0.000	0.000	0.002	0.001	0.001
RF	CA-dNDWI	0.003	0.008	0.001	0.002	0.003	0.005	0.002
	CA-cNDWI	0.003	0.009	0.001	0.002	0.003	0.005	0.002
	CA-NDII	0.003	0.008	0.001	0.002	0.003	0.005	0.002
	CA-NDWI	0.001	0.003	0.000	0.000	0.001	0.001	0.001
EM	CA-dNDWI	0.001	0.008	−0.002	−0.005	0.003	0.002	0.000
	CA-cNDWI	0.005	0.009	0.003	0.008	0.003	0.008	0.003
	CA-NDII	0.001	0.008	−0.002	−0.005	0.003	0.001	0.000
	CA-NDWI	−0.006	0.003	−0.010	−0.026	0.001	−0.012	−0.005

Table 26. Summary of the paired two-tailed t-test results comparing the peak-flood image with the CA datasets across the effectiveness metrics. Each evaluation was conducted separately for each model. Values with a significant difference are in bold.

Model	Accuracy	Recall (Flooded)	Recall (Non-Flooded)	Precision (Flooded)	Precision (Non-Flooded)	F1 (Flooded)	F1 (Non-Flooded)
KNN	0.019	0.005	0.728	0.539	0.005	0.013	0.029
GNB	0.492	0.013	0.902	0.249	0.058	0.078	0.659
DT	0.015	0.013	N/A	N/A	0.006	0.028	0.006
RF	0.015	0.014	0.058	0.058	0.015	0.028	0.006
EM	0.920	0.014	0.382	0.393	0.015	0.956	0.783

Table 27. The mean and standard deviation of the efficiency metrics. There is a wide range in this table since the metrics use different measures and are on a wide scale.

Metric	Training Time (s)	Testing Time (s)	Training Memory (MB)	Testing Memory (MB)
Mean	991.882	0.010	−249.304	−244.329
Standard Deviation	1118.160	0.067	560.034	506.287

Table 28. Summary of differences in the efficiency metrics compared with the peak-flood image for all model-dataset combinations. For each metric, the mean and standard deviation were calculated. Values significantly different from the mean are marked with two symbols: asterisks (*) for higher values and daggers (†) for lower values. The number of symbols corresponds to the number of standard deviations above or below the mean.

Model	Dataset	Training Time (s)	Testing Time (s)	Training Memory Usage (MB)	Testing Memory Usage (MB)
KNN	CA-dNDWI	−3.649	−0.097 †	123.578	−79.656
	CA-cNDWI	7.203	−0.053	−1257.719 †	−1051.953 ††
	CA-NDII	−5.223	−0.104 †	−743.531 †	−868.14 †
	CA-NDWI	−4.148	−0.089 †	−549.891	−1615.562 †††
GNB	CA-dNDWI	−0.007	0.153 **	212.250	114.843
	CA-cNDWI	0.876	0.053	312.719	222.687
	CA-NDII	0.032	0.117 *	446.891	349.484
	CA-NDWI	0.063	0.129 *	568.219 *	471.468
DT	CA-dNDWI	486.133	0.011	−182.547	−175.266
	CA-cNDWI	642.254	0.011	−49.328	−49.328
	CA-NDII	287.300	0.010	−224.235	−184.141
	CA-NDWI	506.039	0.004	−144.969	−133.578
RF	CA-dNDWI	2499.735 **	0.008	−770.703 †	−288.078
	CA-cNDWI	1261.838 *	0.009	−827.735 †	−674.562 †
	CA-NDII	1847.845 *	0.008	−311.250	78.328
	CA-NDWI	2392.534 **	0.003	−1160.266 ††	−59.406
EM	CA-dNDWI	2982.212 **	0.008	123.578	−288.078
	CA-cNDWI	1912.17 *	0.009	−1257.719 ††	−674.562 †
	CA-NDII	2129.953 *	0.008	−743.531 †	78.328
	CA-NDWI	2894.487 **	0.003	−549.891	−59.406

Table 29. Summary of the paired two-tailed t-test results comparing the peak-flood images with the CA datasets across the efficiency metrics. Each evaluation was conducted separately for each model. Values with a significant difference are in bold.

Model	Training Time (s)	Testing Time (s)	Training Memory Usage (MB)	Testing Memory Usage (MB)
KNN	0.651	0.005	0.124	0.065
GNB	0.339	0.013	0.016	0.033
DT	0.007	0.013	0.028	0.022
RF	0.006	0.014	0.022	0.247
EM	0.003	0.014	0.124	0.247

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, H.; Woodley, A. Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention. Remote Sens. 2025, 17, 90. https://doi.org/10.3390/rs17010090

AMA Style

Xu H, Woodley A. Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention. Remote Sensing. 2025; 17(1):90. https://doi.org/10.3390/rs17010090

Chicago/Turabian Style

Xu, Han, and Alan Woodley. 2025. "Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention" Remote Sensing 17, no. 1: 90. https://doi.org/10.3390/rs17010090

APA Style

Xu, H., & Woodley, A. (2025). Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention. Remote Sensing, 17(1), 90. https://doi.org/10.3390/rs17010090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble Learning for Urban Flood Segmentation Through the Fusion of Multi-Spectral Satellite Data with Water Spectral Indices Using Row-Wise Cross Attention

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Case

2.2. Dataset and Pre-Process

2.2.1. PlanetScope Satellite Imagery

2.2.2. Spectral Index Calculation

2.2.3. Cross Attention Operation

2.3. Construct and Train: Ensemble Model

2.3.1. K-Nearest Neighbors

2.3.2. Gaussian Naive Bayes

2.3.3. Decision Tree

2.3.4. Random Forest

2.3.5. Stratified Cross Validation

2.3.6. Ensemble Model

2.4. Evaluation

2.4.1. Effectiveness Metrics

Accuracy

Recall

Precision

F1 Score

2.4.2. Efficiency Metrics

2.4.3. Analysis of Variance

Model-Based Analysis

Dataset-Based Analysis

Data Variability

Statistical Analysis

Post-Hoc Analysis

Paired T-Test

3. Results

3.1. Flood Area Segmentation and Effectiveness Metrics

3.1.1. Threshold-Based Spectral Index Baselines

3.1.2. Model-Based Predictive Results

3.2. Efficiency Results

3.3. Variance Comparisons

3.3.1. Model-Based Effectiveness and Efficiency Metrics

3.3.2. Dataset-Based Effectiveness and Efficiency Metrics

3.4. Impact of CA Datasets on Model Effectiveness (Partial Support for H1)

3.5. Impact of CA Datasets on Model Efficiency

4. Discussion

4.1. Comparing the Effectiveness of Each Model-Dataset

4.2. Comparing the Efficiency of Each Model-Dataset

4.3. Challenges: Model Performance and Generalizability

4.3.1. KNN: Local Variance and False Positives

4.3.2. GNB: Skewed Feature Distribution and False Positives

4.3.3. Overfitting in DT and RF: High Model Complexity

4.3.4. EM’s Soft Voting: Inherited Overfitting and False Positives (Refute to H2)

4.3.5. Generalizability Issue: Using a Single Dataset

4.4. Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI