1. Introduction
The increasing frequency and severity of extreme weather events, particularly hurricanes, necessitate the development of rapid and accurate methods for post-disaster assessment [
1,
2]. Remote sensing technologies, augmented by advances in deep learning, provide a scalable and efficient means to analyze post-event damage across large geographic areas [
3,
4]. The performance of these data-driven models is fundamentally dependent on the availability of high-quality annotated datasets that reflect the complexities of disaster-stricken environments [
5,
6,
7,
8].
While numerous remote sensing datasets exist for general scene understanding, their annotations are often not tailored to the specific requirements of damage assessment [
9,
10,
11]. Conversely, datasets created for disaster response frequently rely on satellite imagery, which, despite its broad coverage, may not capture the detailed structural information required for a comprehensive evaluation [
12]. Even specialized flood detection datasets tend to focus on delineating the extent of inundation, rather than assessing the integrity of built infrastructure [
13,
14]. In particular, a significant research gap persists in assessing hurricane-induced damage to critical waterfront infrastructure, such as piers and docks, which requires high spatial resolution for accurate analysis [
15,
16].
To address this limitation, we propose a new dataset named the Flood and Waterfront Infrastructure Segmentation Dataset (FWISD). It is constructed from high-resolution Unmanned Aerial Vehicle (UAV) imagery captured after a major hurricane and provides detailed semantic labels for various objects. The primary contribution of this study is the release of this specialized dataset, designed to facilitate the development of advanced segmentation models for post-disaster response. Furthermore, we provide comprehensive benchmarks and regression analyses not to propose novel architectures, but to characterize the dataset’s intrinsic complexity and guide future research toward robust solutions for infrastructure assessment.
The rest of this paper is organized as follows. The related literature review is given in
Section 2. The construction and statistical characteristics of the FWISD are presented in
Section 3.
Section 4 details the experimental setup and evaluates the performance of representative segmentation models.
Section 5 investigates the factors influencing segmentation success through regression analysis and compares these findings with existing datasets. Finally, the conclusion and future work are given in
Section 6.
2. Related Works
The advancement of deep learning has fundamentally transformed post-disaster assessment, enabling rapid and automated analysis of vast quantities of remote sensing imagery. The efficacy of these data-driven models, however, depends on the availability of large-scale, high-quality annotated datasets that reflect the complexities of disaster-stricken environments. In recent years, the remote sensing community has developed numerous benchmark datasets to facilitate research in scene understanding. Yet, specific resources for hurricane-induced damage, particularly concerning waterfront infrastructure, remain limited. This review situates our work within the broader context of existing datasets and semantic segmentation methodologies, emphasizing the research gap our proposed dataset aims to address.
2.1. General-Purpose Benchmarks
A substantial body of work has focused on creating benchmark datasets for general-purpose aerial and satellite image analysis. Foundational efforts such as the ISPRS Vaihingen dataset provide high-resolution imagery for urban land cover classification, establishing a standard for evaluating semantic segmentation algorithms in complex urban settings [
9,
17,
18]. As the field matures, demand for larger and more diverse datasets grows. The DOTA dataset, for example, significantly advances object detection in aerial images by introducing a large collection of imagery with oriented bounding box annotations for various object categories, addressing challenges related to scale and orientation variance [
10,
19,
20,
21,
22]. For segmentation tasks, the iSAID dataset offers a large-scale benchmark for instance-level segmentation, providing detailed annotations for numerous object instances and pushing the boundaries of fine-grained scene analysis [
23,
24,
25,
26]. Similarly, datasets like LoveDA are developed to address specific challenges such as domain adaptation between urban and rural scenes, which is crucial for building generalizable models [
11,
27,
28,
29,
30]. While these datasets are invaluable for general scene understanding, their annotations are not tailored to the specific requirements of post-disaster damage assessment.
2.2. Disaster Assessment and Specialized Datasets
Recognizing this limitation, researchers have developed datasets specifically focused on disaster management. The xBD dataset represents a landmark contribution, providing a large-scale benchmark for building damage assessment across multiple disaster types, including earthquakes, floods, and wildfires, using pre- and post-event satellite imagery [
12,
31]. Another notable resource is the AIDER dataset, which focuses on classifying disaster events from aerial images into categories such as fire, flood, and collapsed buildings [
32,
33,
34]. These datasets are instrumental in advancing automated damage assessment. However, they typically rely on satellite imagery, which offers broad coverage but may lack the detail needed to assess specific types of infrastructure damage, or they focus on image-level classification rather than pixel-level segmentation required for precise infrastructure evaluation.
Within the domain of flood-related disasters, several specialized datasets have emerged. The SEN12-FLOOD dataset provides global coverage by leveraging Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 multispectral imagery, making it suitable for developing models that can operate under various weather conditions and across different geographic regions [
14,
35,
36]. Similarly, GF-FloodNet is a multi-source dataset for flood area extraction using high-resolution satellite data [
37,
38,
39]. While powerful for mapping the extent of inundation over large areas, the moderate spatial resolution of these satellite-based datasets often precludes the detailed analysis of individual structures. The DeepFlood dataset addresses this by providing high-resolution aerial RGB imagery focused on inundated vegetation, enabling the study of flooding in vegetated environments [
40,
41,
42,
43]. Nevertheless, the focus of these datasets remains primarily on delineating water boundaries, with less emphasis on the structural integrity of built infrastructure, which is a critical component of comprehensive post-hurricane assessment [
44].
The increasing availability of UAVs has opened new frontiers for data collection, offering centimeter-level spatial resolution. Datasets derived from UAV imagery capture details invisible in satellite or traditional aerial data. The UAVid dataset, for instance, provides a high-resolution benchmark for semantic segmentation in urban traffic scenes, demonstrating the potential of UAVs for detailed infrastructure monitoring [
45,
46,
47,
48]. The unique characteristics of UAV data, such as variable viewing angles and high detail, present challenges and opportunities for segmentation models. However, few public datasets leverage this high-resolution capability for the specific and complex task of post-hurricane damage assessment, particularly for the vital and vulnerable category of waterfront infrastructure. This gap is significant, as structures such as piers, docks, and seawalls are often the first to be impacted by storm surges and are critical for economic activity. For a systematic overview, a comparison of these existing datasets is provided in
Table A1 of
Appendix A.
2.3. Deep Learning Architectures for Semantic Segmentation
Parallel to the evolution of datasets, semantic segmentation methodologies progress significantly, moving from traditional machine learning approaches to deep neural networks. The development of Fully Convolutional Networks (FCNs) marks a turning point, enabling end-to-end, pixel-level prediction for the first time [
49]. This is followed by the introduction of encoder–decoder architectures, exemplified by U-Net, which uses skip connections to fuse low-level feature maps with high-level ones, thereby preserving spatial details crucial for precise boundary delineation in both medical and remote sensing imagery [
50]. To better handle objects at multiple scales, subsequent architectures incorporate mechanisms for multi-scale context aggregation. PSPNet introduces a pyramid pooling module to capture contextual information at various scales [
51], while the DeepLab family of models employs dilated convolutions to enlarge the receptive field without sacrificing spatial resolution [
52,
53,
54,
55].
More recently, the success of Transformers in natural language processing has inspired their application in computer vision. Vision Transformer (ViT) demonstrates that a pure Transformer architecture can achieve state-of-the-art results on image classification tasks [
56,
57,
58]. This paradigm is adapted for dense prediction tasks like semantic segmentation. Models such as SegFormer and Swin Transformer are designed to be more efficient and effective for segmentation by using hierarchical structures and local attention mechanisms, enabling them to capture both long-range dependencies and local features [
59,
60]. Hybrid models that combine the strengths of both convolutional and Transformer-based components also emerge, aiming to leverage the spatial inductive bias of CNNs and the global context modeling capabilities of Transformers. The performance of these diverse architectures on the challenging task of post-disaster assessment, especially with high-resolution UAV data, remains an active area of investigation.
In summary, while a rich ecosystem of remote sensing datasets and segmentation models exists, a critical void persists. Existing datasets for general scene understanding are not disaster-specific. Datasets for disaster management often rely on moderate-resolution satellite imagery or focus on broad classification. Flood-specific datasets primarily target inundation mapping rather than infrastructure integrity. Finally, high-resolution UAV datasets have yet to be extensively applied to the nuanced challenge of post-hurricane waterfront damage assessment. Our work addresses this gap by introducing a high-resolution UAV dataset for the semantic segmentation of flood and waterfront infrastructure damage. This resource facilitates the development and rigorous evaluation of advanced segmentation models capable of performing the analysis required for effective and timely post-disaster response and recovery efforts.
3. Flood and Waterfront Infrastructure Segmentation Dataset—FWISD
This chapter describes the construction of the Flood and Waterfront Infrastructure Segmentation Dataset (FWISD) for post-disaster assessment. The content covers study area selection, original data acquisition, the annotation procedure and quality control methods, and the definition of the classes. A comprehensive statistical and feature analysis of the final dataset is also presented.
3.1. Data Collection
The data collection for this study centers on Hurricane Francine during the 2024 Atlantic hurricane season. On 11 September 2024, a Category 2 hurricane originating in the Atlantic struck the southern Louisiana coast. The event cut power to over 163,000 residents and triggered widespread flooding. The hurricane’s 3 m storm surge and 304 mm of rainfall severely threatened coastal infrastructure. As Louisiana is a vital trade hub located at the Mississippi River’s mouth, a swift and precise assessment of the region is important.
This study utilized UAV imagery released by the U.S. National Oceanic and Atmospheric Administration (NOAA) in the aftermath of the disaster to assess the impact caused by this hurricane. The image data were collected between 16 and 17 September 2024, covering multiple severely affected areas in southern Louisiana. To construct a high-quality segmentation dataset, we design a standardized pipeline: it first clearly defines 12 target categories (see
Section 3.3), and provides clear textual definitions, typical visual examples, and criteria for distinguishing ambiguous cases for each category. Subsequently, we provide professional training to the annotation team to ensure that every annotator fully understands the annotation specifications, and is proficient in using the LabelMe annotation software v5.9.1. This dataset can be downloaded at
https://huggingface.co/datasets/kevinxue112/FWISD (accessed 10 January 2026).
3.2. Class and Annotation
Fully labeling all types of objects in the street scene in a UAV image over 8 K is very expensive and not necessary. As a consequence, only the most common and representative types of objects are labeled for the FWISD. In total, the dataset contains 12 defined classes, consisting of 11 foreground classes selected for semantic segmentation and 1 background class. They are organized logically from natural environmental elements to man-made infrastructure, movable objects, and finally the disaster-specific element. Example instances from different classes are shown in
Figure 1, and their definitions are described as follows.
- (1)
Natural Water: Pre-existing, permanent water bodies within the scene, such as rivers, lakes, and other natural reservoirs.
- (2)
Tree: Various forms of arbor (trees) and taller shrub vegetation.
- (3)
Road-Passable: Road segments, including highways, streets, and bridges, where the road surface is clearly visible and not submerged by floodwater.
- (4)
Road-Flooded: Road segments that are partially or entirely covered by floodwater.
- (5)
Building-Intact: Buildings retaining their structural integrity or exhibiting only minor damage, with no obvious collapse or significant breaches in major load-bearing elements (e.g., roof, walls).
- (6)
Building-Damaged: Buildings exhibiting evident structural failure, characterized by partial or total roof loss, wall collapse, or significant structural deformation.
- (7)
Waterfront Structure-Intact: Infrastructure (e.g., piers, jetties, docks) that interfaces with water bodies and remains structurally sound and undamaged.
- (8)
Waterfront Structure-Damaged: Waterfront infrastructure exhibiting structural failure, such as breakage, collapse, or severe degradation due to flood or water damage.
- (9)
Vehicle-Land: Conveyances situated on terrestrial surfaces, including roads, parking areas, or dry ground.
- (10)
Vehicle-Water: Conveyances located within natural water bodies.
- (11)
Floodwater: Transient accumulation of water over land areas (e.g., roads, vegetated areas, building perimeters) resulting from hurricanes or heavy rainfall.
- (12)
Background: Regions that do not belong to any of the 11 defined classes, such as unidentifiable, fragmented debris, or general clutter.
To ensure pixel-level labeling accuracy, we implement a multi-round iterative quality control mechanism. First, after completing the initial round of annotation, the annotators perform self-inspection and preliminary correction. Subsequently, the manager conducts a second round of comprehensive review. The focus of this review is to identify errors such as misclassification, omission of objects, and imprecise delineation of object boundaries. For samples found to contain errors, detailed revision feedback is provided, and the samples are returned for correction. A final, third-round inspection is then performed to ensure that all issues have been thoroughly addressed. Through this closed-loop iterative process of “annotation—review—correction”, we can enhance the labeling quality, ultimately resulting in pixel-level ground truth labels characterized by sharp boundaries and accurate class assignment.
3.3. Statistical Analysis
Firstly, the pixel proportion for each class in the FWISD is calculated in
Figure 2 to evaluate the class balance of the dataset. A class imbalance is clearly observed; e.g., classes such as Natural Water and Tree account for a large proportion of the pixels. In contrast, disaster-related feature classes, including Building-Damaged and Vehicle-Water, have low proportions and constitute minority classes. Such a distribution is representative of real-world post-disaster scenarios but poses a challenge to the learning capability of segmentation models. Specifically, the severe class imbalance may cause models to bias towards majority classes, resulting in poor segmentation performance on critical minority classes such as Damaged Waterfront Infrastructure. This necessitates the use of strategies such as weighted loss functions during training, as detailed in
Section 4.1.
To investigate the differential vulnerability of various asset types post-disaster, we conduct a statistical analysis on a specific subset of the annotated images. The full dataset is very large: prior to being tiled into 1024 × 1024 patches for model training, it consists of over 200 larger, annotated images derived from the original UAV captures. We therefore filter this collection to isolate all images that contained both ordinary buildings and waterfront infrastructure, a process that yielded the 65 images used in this study. This methodological choice is justified as it ensures that the comparison of damage rates occurs within the same localized disaster contexts.
For this subset, the statistical results firstly show that the mean damage rate of waterfront infrastructure (M = 0.3496, SD = 0.3404) is markedly higher than that of ordinary buildings (M = 0.1321, SD = 0.1616). It suggests that waterfront infrastructure constitutes a more vulnerable asset class within the context of post-disaster scenarios. Next, an ordinal logistic regression model is employed to investigate the determinants of damage severity. This approach identifies macro-level patterns by estimating the average effects of variables, assuming their influence is spatially uniform across the study area. The dependent variable for this model is the severity of Waterfront Infrastructure Damage (WID) within each area. Based on the sample distribution of the waterfront infrastructure damage rate, this variable is categorized into three ordered levels, ‘minor’, ‘moderate’, and ‘severe’, using the 25th percentile (a damage rate of 6%) and the 75th percentile (a damage rate of 65%) as thresholds. The utility function is:
where each
represents the estimated coefficient for the corresponding explanatory variable. The explanations of these variables are defined in
Table 1.
The regression results are detailed in
Table 2 below. In this analysis, all independent variables are standardized to a range of −1 to 1, a process that does not affect the statistical significance of their coefficients. Specifically, ordinary building density, waterfront infrastructure density, road network density, and the boat ratio all exhibit a significant negative association with the damage level. This finding suggests that in areas with denser infrastructure, developers tend to adhere to more stringent site selection criteria and adopt more robust building codes [
61]. Consequently, as construction becomes concentrated in these high-standard areas, individual structures demonstrate greater disaster resilience. In addition, we find an opposing effect emerges at the macro-regional scale. The overall development saturation is positively associated with damage, rendering the entire region more vulnerable. This paradox can be explained by ‘urban entropy,’ where increased systemic interdependence makes the area more susceptible to cascading failures [
62].
4. Experimental Design and Evaluation
4.1. Experimental Settings
To evaluate the performance of various semantic segmentation models on our proposed dataset, we select 9 representative models, which are first trained on the training set and subsequently evaluated on the validation set. Specifically, our selection includes two classes of architectures: convolution-based and Transformer-based. The convolution-based models include the original U-Net [
50] with its encoder–decoder structure, alongside PSPNet [
51] and DeepLabv3+ [
52], which employ multi-scale pyramid context extraction. The Transformer-based selection includes SegFormer [
60], K-Net [
63], and two variants of Mask2Former [
64] using ResNet [
65] and Swin-Transformer [
59] backbones, respectively. Furthermore, to provide a direct comparison, we also assess a modified U-Net that incorporates a Swin-Transformer encoder. This model serves as a representative hybrid baseline rather than a novel architectural proposal.
In the experiments, the original large-scale images are first allocated to the training and validation sets based on an approximate 70:30 ratio of pixel counts for each class. After this allocation, the images are tiled into patches, yielding a total of 2770 training patches and 980 validation patches. This procedure ensures that no patches from the same original image appear in both sets, guaranteeing a fair evaluation. All experiments are conducted in PyTorch 2.4.0 and Python 3.10 on a server equipped with one NVIDIA RTX4090D GPU (NVIDIA, Santa Clara, CA, USA) with 24 GB of memory. The implementation is based on the MMSegmentation open-source toolbox v1.2.0. During training, input images are uniformly cropped to a resolution of 1024 × 1024 pixels. To enhance model robustness, we employ a set of data augmentation techniques, including random horizontal flipping, random rotation (up to 45 degrees), and photometric distortions that randomly altered brightness, contrast, and saturation. The total number of training iterations is set to 160,000, and the batch size is set to 2 due to GPU memory constraints. We utilize the AdamW optimizer with betas of (0.9, 0.999) and a weight decay of 0.05. The initial learning rate is set to 1.2 × 10−4 and adjusted using a schedule composed of a linear warmup phase followed by a polynomial decay policy. To address the class imbalance problem, we use a composite loss function, which is a weighted sum of Cross Entropy Loss and Dice Loss, and assign higher loss weights to minority classes. All models are initialized with weights pre-trained on large-scale natural image datasets and subsequently fine-tuned on our dataset.
In the following sections, we use Overall Accuracy (OA) and mean Intersection over Union (mIoU) for the performance evaluation, and the Intersection over Union (IoU) of each class for class-wise performance evaluation. This allows for a detailed analysis of model performance on specific categories, such as distinguishing between intact and damaged structures.
4.2. Experimental Results
Table 3 presents the quantitative performance of the selected models, while
Figure 3 provides corresponding visual segmentation examples. A comprehensive analysis of the results indicates that the Swin-Mask2Former model, which utilizes a Swin Transformer backbone, achieves the highest metrics among all the original models. Notably, the Modified UNet, a hybrid architecture integrating a Swin-Transformer encoder with a classic U-Net decoder structure, obtains even higher metrics than Swin-Mask2Former. This suggests that while purely Transformer-based methods excel at overall pixel classification, the combination of a Transformer encoder and a convolutional decoder is most effective for achieving superior segmentation quality across all classes. Other high-performing models, including SegFormer-B4 and Mask2Former built upon ResNet backbones, also secure mIoU scores exceeding 60%, indicating the robustness of advanced architectures for the post-disaster assessment task.
A comparative evaluation of different architectural classes highlights their respective advantages. The leading mIoU performance of the Modified UNet demonstrates how this architecture leverages the Transformer’s capacity for capturing long-range contextual information. This combined approach achieves superior segmentation quality across most classes, attaining the highest scores in categories such as the large, contiguous ‘Natural Water’ (95.12%) and the well-defined ‘Building-Intact’ (84.16%). In contrast, purely Transformer-based models like Swin-Mask2Former demonstrate their strength in complex object recognition tasks. For instance, Swin-Mask2Former obtained the highest IoU on the ‘Vehicle—Water’ class (82.65%), suggesting its mask classification framework is adept at identifying smaller objects that require sophisticated scene understanding.
Further class-wise analysis reveals significant discrepancies in challenging categories critical for disaster assessment. The ‘Road—Flooded’ class is consistently difficult for all models, with the highest score being only 35.67% from SegFormer-B4. This indicates that delineating the boundary between passable and inundated road surfaces is a complex task. Similarly, the task of identifying damaged infrastructure proved more challenging than that of identifying its intact counterpart. For example, Swin-Mask2Former achieved an IoU of 79.57% for ‘Building-Intact’ but only 40.92% for ‘Building-Damaged’. This performance gap suggests that the visual cues for structural damage are often subtle and confused with other features. It is also worth noting that performance on the ‘Waterfront Structure-Damaged’ class remains relatively low across all models. This limitation emphasizes the extreme difficulty in distinguishing complex structural debris from general background clutter in high-resolution imagery, suggesting a need for further research into feature enhancement for minority damage classes.
Next, the visual results shown in
Figure 3 further complement these quantitative findings. High-performing models such as the Modified UNet and Swin-Mask2Former produce segmentation maps with sharper boundaries and fewer pixel misclassifications, particularly for large structures and water bodies. In contrast, models with lower mIoU values, like the baseline UNet, yield more fragmented predictions and exhibit greater confusion between visually similar classes.
5. Analysis of Segmentation Behavior
5.1. Analysis About Segmentation Performance on the FWISD
To investigate the factors influencing model segmentation performance, we consider a binary Logit regression models for fitting. The dependent variable is the success of detecting a target object, which is defined under three distinct conditions based on whether the IoU for a given object exceeds three typical thresholds, including 0.1, 0.25 and 0.5. The corresponding utility function is:
where
represents the intercept, which is the log-odds of success for a target with zero size located at zero distance from a water body. Since previous studies revealed that the distance to a water body (D) and the size of the target object (S) are the most significantly correlated factors with segmentation outcomes [
13], we consider here both of them as the main independent variables. Quantitatively speaking, D is defined as the Euclidean distance from the geometric centroid of a target object to the nearest natural water body. S represents the area of the target, quantified by the number of pixels it occupies.
and
are the estimated coefficients for the corresponding explanatory variables.
Next, the brief regression results are shown in
Table 4. Here, the summary table reports only the Sign (+/−) and the Adjusted R
2 value, while the significant variables are marked by red (detailed results can be found in
Table A2 of
Appendix A). The key insights are as follows:
- (1)
In all the results, the distance factor (D) always exhibits a negative correlation with detection success, and it is statistically significant in most cases. It indicates that the probability of correctly identifying a damaged waterfront infrastructure decreases as its distance from a water body increases, underscoring a strong reliance on contextual information.
- (2)
On the contrary, the target size (S) is not statistically significant in many cases (except DeepLabV3+). The signs also vary in different model results. This discrepancy likely arises because different models exhibit varying aptitudes for recognizing objects of different scales. Some are more adept at identifying small targets, while others excel at preserving the integrity of larger segmentations. Such complex architectural-dependent performance can lead to the variation in the signs.
- (3)
The intercept term (ASC) shows a consistent decrease as the IoU threshold for success increases from 0.1 to 0.5. When IoU is larger, it is also statistically significant in most cases. This trend suggests that as the segmentation task becomes more stringent, successful detection becomes less dependent on favorable contextual positioning, and more reliant on the instance’s intrinsic features.
- (4)
An examination of the pseudo-R2 values reveals that for most models, the explanatory power of the regression model increases with the IoU threshold. Additionally, we find that some of the best performing architectures, such as Swin-Mask2Former, consistently yield lower R2 values compared to other architectures like KNet. It further suggests that the sophisticated reasoning process of advanced models is more complex than what this regression model can capture. This is particularly evident in models that utilize decoders based on Transformers.
5.2. Analysis About Segmentation Performance on the DeepFlood Dataset
To further reveal the main features of our new dataset, here, we also conduct a comparative study by applying our analytical framework to another typical public dataset named DeepFlood (DF). It is a four-class, multi-modal collection specifically focused on inundated vegetation [
9]. As the original research on this dataset indicated that multiple modalities provided only a 1.6% increase in mIoU, we restrict the analysis to RGB images to facilitate a fair comparison. We analyze the factors influencing the detection of inundated vegetation in a sample of nearly 5000 instances after filtering out those smaller than 100 pixels. The utility function is:
where the definitions of ASC, D and S are the same as those in Equation (2). The three new variables are Water area (W), which is the total pixel area of natural water; Brightness (B), which is the image luminance calculated via the ITU-R BT.601 standard, ranging from 0 to 255; and Density (DE), which is the total building count.
Similar to
Table 4, only the brief regression results are shown in
Table 5, while the detailed results can be found in
Table A3 of
Appendix A. The findings are as follows:
- (1)
Unlike the results on the FWISD, when IoU = 0.25 or 0.5, nearly all the independent variables in the DF dataset are significantly correlated with the detection success rate. On the contrary, W, B and DE are not considered in
Table 4, since they are not statistically significant in most cases of the FWISD. The signs of the two common variables for the two datasets are also quite different, e.g., some of
in the DF dataset are even positive.
- (2)
The analysis of the DF dataset also reveals a weaker relationship between pseudo-R2 values and the IoU threshold. Specifically, regardless of the size of IoU, the R2 results are roughly consistent. Furthermore, the R2 values are nearly consistent across different model architectures.
- (3)
The disparities in (1) and (2) stem from fundamental differences in the segmentation tasks. The DF dataset focuses on detecting a single class, inundated vegetation, which has uniform color and shape characteristics. Its detection is therefore highly predictable using the regression model’s basic variables, which also provide the basis for distinguishing this class from others and explaining why the variables are significant. In contrast, waterfront structures in the FWISD present a wide diversity of structural forms and color features, and they often exhibit irregular shapes at the complex land-water interface. Such categorical complexity also explains the differing behavior of the pseudo-R2 values.
- (4)
Finally, we need to note that our Logit models are not intended to make causal claims about segmentation performance or invalidate correlation-based research [
12]. Rather, we aim to enrich the research landscape by introducing a new perspective. The correlation-based method aligns with this goal, as our focus is on centering on data rather than competing on benchmarks.
6. Conclusions
This paper introduces the Flood and Waterfront Infrastructure Segmentation Dataset (FWISD) to address the scarcity of high-resolution datasets for post-hurricane damage assessment. Constructed from UAV imagery, the dataset provides detailed semantic labels necessary for evaluating segmentation models on the specific task of identifying damage to waterfront infrastructure. Experimental results demonstrate the efficacy of hybrid architectures. Specifically, the Modified UNet achieved the leading performance with an mIoU of 65.41%. This validates the strategy of combining Transformer encoders with convolutional decoders. Next, our binary Logit analysis identifies spatial context as a decisive factor. For example, the Modified UNet model shows that the probability of correct detection decreases as the distance to water increases. This is confirmed by a significant negative coefficient. Furthermore, our experiments emphasize the complexity of waterfront infrastructure assessment. Regression models can explain up to 63% of the variance in the homogenous DeepFlood dataset. In contrast, they show significantly lower explanatory power for our FWISD, quantitatively proving its intrinsic complexity. By providing this new dataset and a thorough analysis of model behavior, this work establishes a crucial foundation for developing automated systems for rapid post-disaster response.
After this series of studies, we find that FWISD possesses strong representativeness. First, the dataset captures universal physical characteristics of hurricane damage. These characteristics include the fragmentation of waterfront infrastructure and the spectral signature of inundation, which are consistent across coastal disasters. Second, the functional design of waterfront infrastructure, such as piers and docks, exhibits high similarity across different regions. This ensures that the visual features learned by the models are applicable to other coastal areas. Third, the study area features a highly complex mixture of land and water. This provides a rigorous testing ground, suggesting that models capable of handling this challenging environment will be robust in simpler scenarios. Consequently, FWISD serves as a critical resource for pre-training deep learning models. By learning these complex semantic features, models can be effectively fine-tuned on smaller datasets from future events. This capability can significantly accelerate the deployment of automated assessment tools in new geographic regions.
Despite the contributions presented in this work, the study entails certain limitations that warrant further investigation. The current FWISD relies on RGB imagery collected from a single hurricane event in a specific geographic region. This concentration potentially restricts the generalization capability of trained models when applied to different disaster scenarios or diverse architectural styles. Furthermore, the exclusive use of visual-spectrum data limits the ability to capture depth information, which is critical for distinguishing complex debris from intact structures. Performance analysis also indicates that current state-of-the-art models still struggle with ambiguous boundaries between intact and damaged waterfront structures.
Future research should aim to expand the dataset diversity by incorporating imagery from a wider range of disaster events and integrating multi-modal data sources such as digital elevation models. We also plan to explore domain adaptation techniques to improve model robustness across varying environmental conditions. In addition, subsequent work should also focus on optimizing the computationally intensive models for deployment on edge devices to facilitate real-time damage assessment during emergency response missions.
Author Contributions
Conceptualization, K.X.; Methodology, C.-J.J.; Software, K.X.; Validation, C.-J.J.; Formal analysis, K.X.; Data curation, K.X.; Writing—original draft, K.X.; Writing—review & editing, C.-J.J.; Visualization, K.X.; Funding acquisition, C.-J.J. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded by the National Natural Science Foundation of China (No.71801036).
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Table A1.
Comparison of the existing annotation dataset in remote sensing community.
Table A1.
Comparison of the existing annotation dataset in remote sensing community.
| Dataset | Primary Task | Resolution | Data Source | Category | Application Scenario |
|---|
| xBD [12] | Object Detection & Instance-level Classification | 0.5–0.8 m | Satellite (RGB) | 4 | Various Disasters (Buildings) |
| AIDER [32] | Classification | 0.3 m | Satellite (RGB) | 8 | Various Disasters (Buildings) |
| GF-FloodNet [37] | Semantic Segmentation | 1.5–4 m | Satellite (MS, SAR) | 2 | General Flood Scenarios |
| SEN12-FLOOD [14] | Semantic Segmentation | 10 m | Satellite (SAR, MS) | 2 | Global Flood Events |
| DeepFlood [40] | Semantic Segmentation | 1 m | Satellite (RGB) | 4 | Flooded Vegetated Areas |
| UAVid [45] | Semantic Segmentation | ~0.05–0.1 m | UAV (RGB) | 8 | Urban Traffic Environments |
| ISPRS Vaihingen [9] | Semantic Segmentation | 0.09 m | Aerial (RGB + IR) | 6 | Small German Town |
| DOTA [10] | Object Detection | 0.1–1 m | Satellite/Aerial (RGB) | 15 | Diverse Geographic Scenes |
| DeepGlobe [66] | Segmentation & Detection & Classification | 0.5 m | Satellite (RGB) | 7 | Global Diverse Terrains |
| LoveDA [11] | Semantic Segmentation | 0.3 m | Satellite (RGB) | 7 | Urban-Rural Interfaces |
| iSAID [23] | Instance Segmentation | 0.5–0.8 m | Satellite/Aerial (RGB) | 15 | Complex Aerial Scenes |
Table A2.
Detailed Logit Regression Results for FWISD.
Table A2.
Detailed Logit Regression Results for FWISD.
| Variables | (a) Basic UNet | (b) Modified UNet | (c) DeepLabV3+ |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | −0.001 (−0.00) | −1.080 (−3.71) | −2.280 (−4.77) | 0.524 (1.98) | −0.118 (−0.47) | −1.520 (−5.28) | −0.093 (−0.38) | −0.291 (−0.94) | −2.120 (−4.05) |
| −0.871 (−2.98) | −0.926 (−1.90) | −0.967 (−3.55) | −0.111 (−3.28) | −0.978 (−2.89) | −0.620 (−1.92) | −0.839 (−2.85) | −0.358 (−4.00) | −0.348 (−2.07) |
| 0.816 (0.99) | 0.106 (1.51) | −0.369 (−1.56) | 0.146 (1.39) | 0.110 (1.28) | 0.624 (0.74) | 0.134 (1.47) | 0.285 (2.32) | 0.243 (1.24) |
| LL(b) | −88.967 | −66.047 | −23.540 | −88.714 | −87.066 | −56.022 | −88.950 | −60.144 | −22.476 |
| R2 | 0.063 | 0.404 | 0.753 | 0.066 | 0.083 | 0.410 | 0.063 | 0.367 | 0.763 |
| Adj. R2 | 0.032 | 0.273 | 0.721 | 0.034 | 0.052 | 0.378 | 0.032 | 0.335 | 0.732 |
| Variables | (d) PSPNet | (e) SegFormer_b0 | (f) SegFormer_b4 |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | 0.157 (0.53) | −0.111 (−0.38) | −0.503 (−1.46) | 0.413 (1.46) | −0.199 (−0.68) | −1.090 (−3.25) | 0.072 (0.19) | −0.117 (−0.44) | −0.828 (−2.67) |
| −0.187 (−3.27) | −0.220 (−3.44) | −0.257 (−3.04) | −0.152 (−3.10) | −0.127 (−2.29) | −0.741 (−2.29) | −0.197 (−3.19) | −0.111 (−2.67) | −0.790 (−2.00) |
| 0.070 (0.75) | 0.177 (1.90) | −0.551 (−0.36) | 0.115 (1.31) | −0.459 (−0.52) | −0.162 (−1.10) | 0.755 (1.75) | 0.333 (0.33) | −0.299 (−1.62) |
| LL(b) | −80.822 | −75.766 | −55.421 | −87.268 | −77.780 | −58.178 | −81.431 | −83.807 | −60.276 |
| R2 | 0.149 | 0.202 | 0.416 | 0.081 | 0.181 | 0.387 | 0.142 | 0.117 | 0.365 |
| Adj. R2 | 0.117 | 0.171 | 0.385 | 0.049 | 0.149 | 0.356 | 0.111 | 0.086 | 0.334 |
| Variables | (g) KNet | (h) Res-Mask2Former | (i) Swin-Mask2Former |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | −0.521 (−1.85) | −1.040 (−3.15) | −1.910 (−4.61) | 0.409 (1.63) | −0.178 (−0.77) | −1.020 (−3.26) | 1.150 (3.31) | 0.788 (2.55) | 0.230 (0.72) |
| −0.129 (−2.34) | −0.188 (−2.46) | −0.121 (−1.79) | −0.809 (−2.51) | −0.586 (−2.22) | −0.795 (−1.09) | −0.147 (−2.14) | −0.153 (−3.13) | −0.180 (−3.14) |
| 0.148 (1.62) | 1.270 (1.37) | −0.754 (−0.53) | −0.106 (−0.14) | 0.403 (0.52) | 0.534 (0.48) | −0.775 (−0.76) | −0.633 (−0.68) | −0.180 (−1.21) |
| LL(b) | −77.119 | −55.360 | −34.476 | −90.368 | −89.184 | −67.643 | −83.780 | −85.513 | −75.619 |
| R2 | 0.188 | 0.417 | 0.637 | 0.048 | 0.061 | 0.288 | 0.118 | 0.100 | 0.204 |
| Adj. R2 | 0.156 | 0.385 | 0.605 | 0.017 | 0.029 | 0.256 | 0.086 | 0.068 | 0.172 |
Table A3.
Detailed Logit Regression Results for DF Dataset.
Table A3.
Detailed Logit Regression Results for DF Dataset.
| Variables | (a) Basic UNet | (b) Modified UNet | (c) DeepLabV3+ |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | 0.085 (0.36) | −0.445 (−1.66) | −1.060 (−3.09) | −1.070 (−4.27) | −1.140 (−4.05) | −1.340 (−3.82) | −1.170 (−4.92) | −1.610 (−5.78) | −1.690 (−4.84) |
| 0.560 (0.60) | −0.247 (−1.97) | −0.834 (−4.28) | −0.284 (−2.47) | −0.410 (−2.77) | −0.608 (−3.18) | 0.240 (0.27) | −0.204 (−1.54) | −0.435 (−2.32) |
| 0.887 (27.00) | 0.994 (26.80) | 0.132 (18.60) | 0.873 (27.10) | 0.109 (25.60) | 0.152 (21.10) | 0.887 (26.80) | 0.114 (27.70) | 0.155 (24.90) |
| −0.255 (−11.50) | −0.241 (−9.95) | −0.241 (−8.58) | −0.127 (−5.98) | −0.160 (−7.13) | −0.196 (−7.09) | 0.127 (−6.16) | −0.149 (−6.38) | −0.193 (−6.58) |
| 0.340 (10.90) | 0.334 (9.43) | 0.371 (8.06) | 0.377 (11.20) | 0.371 (9.74) | 0.400 (8.37) | 0.352 (10.80) | 0.380 (10.20) | 0.375 (8.21) |
| −0.665 (−2.71) | −0.114 (−3.90) | −0.245 (−5.58) | −0.428 (−1.58) | −0.140 (−4.43) | −0.377 (−7.76) | −0.122 (−0.49) | −0.756 (−2.50) | −0.300 (−7.22) |
| LL(b) | −1607.87 | −1554.35 | −1432.77 | −1730.58 | −1517.90 | −1254.74 | −1618.84 | −1349.27 | −1052.16 |
| R2 | 0.44 | 0.46 | 0.50 | 0.40 | 0.47 | 0.56 | 0.44 | 0.53 | 0.63 |
| Adj. R2 | 0.44 | 0.46 | 0.50 | 0.40 | 0.47 | 0.56 | 0.44 | 0.53 | 0.63 |
| Variables | (d) PSPNet | (e) SegFormer_b0 | (f) SegFormer_b4 |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | −0.677 (−2.83) | −0.921 (−3.28) | −1.240 (−3.46) | −0.601 (−2.49) | −0.994 (−3.55) | −0.894 (−2.61) | −0.792 (−3.18) | −1.090 (−3.86) | −1.380 (−4.09) |
| −0.300 (−2.48) | −0.472 (−3.09) | −0.775 (−3.89) | 0.260 (0.26) | −0.157 (−1.21) | −0.274 (−1.60) | −0.859 (−5.23) | −0.114 (−5.87) | −0.130 (−6.41) |
| 0.889 (26.30) | 0.109 (24.10) | 0.152 (19.90) | 0.892 (28.70) | −0.110 (28.50) | 0.152 (24.30) | 0.925 (20.80) | 0.117 (20.30) | 0.157 (20.10) |
| −0.164 (−7.89) | −0.194 (−8.09) | −0.208 (−6.86) | −0.182 (−8.43) | −0.206 (−8.56) | −0.254 (−8.54) | −0.149 (−6.72) | −0.147 (−6.72) | −0.150 (−5.70) |
| 0.370 (11.60) | 0.385 (10.70) | 0.378 (8.48) | 0.345 (10.50) | 0.377 (10.10) | 0.369 (8.14) | 0.359 (10.50) | 0.346 (9.06) | 0.334 (7.38) |
| −0.849 (−3.24) | −0.157 (−4.94) | −0.365 (−7.54) | −0.825 (−3.19) | −0.144 (−4.67) | −0.400 (−8.81) | −0.798 (−2.83) | −0.171 (−5.03) | −0.395 (−7.97) |
| LL(b) | −1740.88 | −1546.78 | −1288.50 | −1637.93 | −1422.96 | −1168.69 | −1873.41 | −1686.95 | −1437.83 |
| R2 | 0.40 | 0.46 | 0.55 | 0.43 | 0.51 | 0.60 | 0.35 | 0.42 | 0.50 |
| Adj. R2 | 0.40 | 0.46 | 0.55 | 0.43 | 0.50 | 0.60 | 0.35 | 0.41 | 0.50 |
| Variables | (g) KNet | (h) Res-Mask2Former | (i) Swin-Mask2Former |
| IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 | IoU = 0.1 | IoU = 0.25 | IoU = 0.5 |
| Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) | Value (t-test) |
| ASC | −1.080 (−4.08) | −1.390 (−4.55) | −1.550 (−3.96) | −0.945 (−4.12) | −1.180 (−4.83) | −1.370 (−4.81) | 0.158 (0.75) | −0.243 (−1.08) | −0.667 (−2.60) |
| −0.491 (−3.52) | −0.840 (−4.81) | −0.101 (−5.09) | 0.152 (1.92) | 0.174 (0.17) | −0.472 (−2.99) | 0.115 (1.49) | 0.116 (1.30) | −0.206 (−1.55) |
| 0.928 (24.00) | 0.121 (23.30) | 0.163 (22.80) | 0.818 (25.50) | 0.930 (28.20) | 0.115 (23.00) | 0.738 (25.70) | 0.809 (28.10) | 0.910 (21.90) |
| −0.186 (−8.85) | −0.189 (−7.85) | −0.234 (−7.73) | −0.755 (−3.61) | −0.939 (−4.28) | −0.109 (−4.35) | −0.994 (−5.22) | −0.111 (−5.59) | −0.113 (−5.08) |
| 0.486 (12.90) | 0.450 (10.70) | 0.448 (8.37) | 0.303 (10.40) | 0.281 (8.98) | 0.237 (6.67) | 0.172 (6.43) | 0.158 (5.56) | 0.127 (3.96) |
| −0.547 (−1.81) | −0.140 (−3.92) | −0.346 (−6.60) | −0.612 (−2.72) | −0.117 (−4.68) | −0.221 (−6.89) | −0.132 (−6.34) | −0.166 (−7.22) | −0.229 (−8.09) |
| LL(b) | −1739.62 | −1504.32 | −1223.92 | −1719.17 | −1616.14 | −1472.66 | −1849.39 | −1790.62 | −1763.08 |
| R2 | 0.40 | 0.48 | 0.58 | 0.40 | 0.44 | 0.50 | 0.36 | 0.38 | 0.40 |
| Adj. R2 | 0.40 | 0.48 | 0.57 | 0.40 | 0.44 | 0.50 | 0.36 | 0.38 | 0.40 |
References
- Reed, K.A.; Wehner, M.F.; Zarzycki, C.M. Attribution of 2020 hurricane season extreme rainfall to human-induced climate change. Nat. Commun. 2022, 13, 1905. [Google Scholar] [CrossRef] [PubMed]
- Patricola, C.M.; Hansen, G.E.; Sena, A.C.T. The influence of climate variability and future climate change on atlantic hurricane season length. Geophys. Res. Lett. 2024, 51, e2023GL107881. [Google Scholar] [CrossRef]
- Akhyar, A.; Zulkifley, M.A.; Lee, J.; Song, T.; Han, J.; Cho, C.; Hyun, S.; Son, Y.; Hong, B.W. Deep artificial intelligence applications for natural disaster management systems: A methodological review. Ecol. Indic. 2024, 163, 112067. [Google Scholar] [CrossRef]
- Braik, A.M.; Koliou, M. Automated building damage assessment and large-scale mapping by integrating satellite imagery, GIS, and deep learning. Comput. Aided Civ. Infrastruct. Eng. 2024, 39, 2389–2404. [Google Scholar] [CrossRef]
- Mayo, D.; Cummings, J.; Lin, X.; Gutfreund, D.; Katz, B.; Barbu, A. How hard are computer vision datasets? calibrating dataset difficulty to viewing time. Adv. Neural Inf. Process. Syst. 2023, 36, 11008–11036. [Google Scholar]
- Ammirato, P.; Poirson, P.; Park, E.; Košecká, J.; Berg, A.C. A dataset for developing and benchmarking active vision. In 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2017. [Google Scholar]
- Paul, M.; Ganguli, S.; Dziugaite, G.K. Deep learning on a data diet: Finding important examples early in training. Adv. Neural Inf. Process. Syst. 2021, 34, 20596–20607. [Google Scholar]
- Zhu, H.; Akrout, M.; Zheng, B.; Pelegris, A.; Jayarajan, A.; Phanishayee, A. Benchmarking and analyzing deep neural network training. In 2018 IEEE International Symposium on Workload Characterization (IISWC); IEEE: New York, NY, USA, 2018. [Google Scholar]
- International Society for Photogrammetry and Remote Sensing 2D Semantic Labeling Challenge. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx (accessed on 10 January 2026).
- Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
- Gupta, R.; Goodman, B.; Patel, N.; Hosfelt, R.; Sajeev, S.; Heim, E.; Doshi, J.; Lucas, K.; Choset, H.; Gaston, M. Creating xBD: A dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Zhao, J.; Li, M.; Li, Y.; Matgen, P.; Chini, M. Urban flood mapping using satellite synthetic aperture radar data: A review of characteristics, approaches, and datasets. IEEE Geosci. Remote Sens. Mag. 2024, 13, 237–268. [Google Scholar] [CrossRef]
- Rambour, C.; Audebert, N.; Koeniguer, E.; Le Saux, B.; Crucianu, M.; Datcu, M. SEN12-FLOOD: A SAR and Multispectral Dataset for Flood Detection; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Tansel, B. Multi-Hazard Vulnerability and Impact Intensification: Interactive Hazards and Impact Compounding in Coastal Areas. Int. J. Disaster Risk Reduct. 2025, 131, 105883. [Google Scholar] [CrossRef]
- Alam, M.S.; Kim, K.; Horner, M.W.; Alisan, O.; Antwi, R.; Ozguven, E.E. Large-scale modeling of hurricane flooding and disrupted infrastructure impacts on accessibility to critical facilities. J. Transp. Geogr. 2024, 116, 103852. [Google Scholar] [CrossRef]
- Kamann, C.; Rother, C. Benchmarking the robustness of semantic segmentation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Kerssies, T.; De Geus, D.; Dubbelman, G. How to Benchmark Vision Foundation Models for Semantic Segmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A comprehensive survey of oriented object detection in remote sensing images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
- Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Yang, X.; Yan, J. On the arbitrary-oriented object detection: Classification based approaches revisited. Int. J. Comput. Vis. 2022, 130, 1340–1365. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
- Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.-S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Mou, L.; Zhu, X.X. Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6699–6711. [Google Scholar] [CrossRef]
- Zeng, X.; Wei, S.; Shi, J.; Zhang, X. A lightweight adaptive RoI extraction network for precise aerial image instance segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 5018617. [Google Scholar] [CrossRef]
- Cao, J.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Pang, Y.; Shao, L. D2det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Zou, Y.; Yu, Z.; Kumar, B.V.K.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Liu, Y.; Zhang, W.; Wang, J. Source-free domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021. [Google Scholar]
- Li, Y.; Yuan, L.; Vasconcelos, N. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Zhao, S.; Li, B.; Yue, X.; Gu, Y.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. Multi-source Domain Adaptation for Semantic Segmentation. arXiv 2019, arXiv:1910.12181. [Google Scholar] [CrossRef]
- Qing, Y.; Ming, D.; Wen, Q.; Weng, Q.; Xu, L.; Chen, Y.; Zhang, Y.; Zeng, B. Operational earthquake-induced building damage assessment using CNN-based direct remote sensing change detection on superpixel level. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102899. [Google Scholar] [CrossRef]
- Kyrkou, C. AIDER (Aerial Image Dataset for Emergency Response Applications); Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
- Kyrkou, C.; Theocharides, T. Deep-Learning-Based Aerial Image Classification for Emergency Response Applications Using Unmanned Aerial Vehicles. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Kyrkou, C.; Theocharides, T. EmergencyNet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1687–1699. [Google Scholar] [CrossRef]
- Yadav, R.; Nascetti, A.; Azizpour, H.; Ban, Y. Unsupervised flood detection on SAR time series using variational autoencoder. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103635. [Google Scholar]
- Peng, B.; Huang, Q.; Vongkusolkit, J.; Gao, S.; Wright, D.B.; Fang, Z.N.; Qiang, Y. Urban flood mapping with bitemporal multispectral imagery via a self-supervised learning framework. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 2001–2016. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, P.; Chen, L.; Xu, M.; Guo, X.; Zhao, L. A new multi-source remote sensing image sample dataset with high resolution for flood area extraction: GF-FloodNet. Int. J. Digit. Earth 2023, 16, 2522–2554. [Google Scholar] [CrossRef]
- Zhao, J.; Xiong, Z.; Zhu, X.X. UrbanSARFloods: Sentinel-1 SLC-based benchmark dataset for urban and open-area flood mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Zheng, X.; Maidment, D.R.; Tarboton, D.G.; Liu, Y.Y.; Passalacqua, P. GeoFlood: Large-scale flood inundation mapping based on high-resolution terrain analysis. Water Resour. Res. 2018, 54, 10013–10033. [Google Scholar] [CrossRef]
- Fawakherji, M.; Blay, J.; Anokye, M.; Hashemi-Beni, L.; Dorton, J. DeepFlood for inundated vegetation high-resolution dataset for accurate flood mapping and segmentation. Sci. Data 2025, 12, 271. [Google Scholar] [CrossRef]
- Akiva, P.; Purri, M.; Dana, K.; Tellman, B.; Anderson, T. H2O-Net: Self-supervised flood segmentation via adversarial domain adaptation and label refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021. [Google Scholar]
- Rudner, T.G.J.; Rußwurm, M.; Fil, J.; Pelich, R.; Bischke, B.; Kopačková, V.; Biliński, P. Multi3net: Segmenting flooded buildings via fusion of multiresolution, multisensor, and multitemporal satellite imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33. [Google Scholar]
- Jia, Y.; Gao, J.; Huang, W.; Yuan, Y.; Wang, Q. Holistic mutual representation enhancement for few-shot remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622613. [Google Scholar] [CrossRef]
- Maha Arachchige, S.; Pradhan, B. AI meets the eye of the storm: Machine learning-driven insights for hurricane damage risk assessment in Florida. Earth Syst. Environ. 2025, 9, 2143–2163. [Google Scholar] [CrossRef]
- Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
- Xu, H.; Wang, L.; Han, W.; Yang, Y.; Li, J.; Lu, Y.; Li, J. A survey on UAV applications in smart city management: Challenges, advances, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8982–9010. [Google Scholar] [CrossRef]
- Ji, M.; Xu, Y.; Zhu, S.; Zhang, Y.; Xin, Y.; Mo, Y. Exploring the potential of UAV-based thermal imagery for monitoring diurnal variations in the microscale urban thermal environment. Energy Build. 2025, 347, 116375. [Google Scholar] [CrossRef]
- Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: A composite transformer network for urban scene segmentation of UAV images. Pattern Recognit. 2023, 133, 109019. [Google Scholar] [CrossRef]
- Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Lian, X.; Pang, Y.; Han, J.; Pan, J. Cascaded hierarchical atrous spatial pyramid pooling module for semantic segmentation. Pattern Recognit. 2021, 110, 107622. [Google Scholar] [CrossRef]
- Qiu, Y.; Liu, Y.; Chen, Y.; Zhang, J.; Zhu, J.; Xu, J. A2SPPNet: Attentive atrous spatial pyramid pooling network for salient object detection. IEEE Trans. Multimed. 2022, 25, 1991–2006. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, X.; Jung, C. DCSR: Dilated convolutions for single image super-resolution. IEEE Trans. Image Process. 2018, 28, 1625–1635. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. CSUR 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Nwadike, A.; Wilkinson, S.; Clifton, C. Improving disaster resilience through effective building code compliance. In Proceedings of the 9th International i-Rec (Information and Research for Reconstruction) Conference, Gainesville, FL, USA, 5–7 June 2019. [Google Scholar]
- Fistola, R.; Gargiulo, C.; La Rocca, R.A. Rethinking vulnerability in city-systems: A methodological proposal to assess “urban entropy”. Environ. Impact Assess. Rev. 2020, 85, 106464. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |