CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers

Zetout, Ahcene; Allili, Mohand Saïd

doi:10.3390/rs17142337

Open AccessArticle

CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers

by

Ahcene Zetout

and

Mohand Saïd Allili

^*

Department of Computer Science and Engineering, University of Quebec in Outaouais, Gatineau, QC J8Y 3G5, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2337; https://doi.org/10.3390/rs17142337

Submission received: 11 May 2025 / Revised: 27 June 2025 / Accepted: 4 July 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate multi-class semantic segmentation of disaster-affected areas is essential for rapid response and effective recovery planning. We present CSDNet, a context-aware segmentation model tailored to disaster scene scenarios, designed to improve segmentation of both large-scale disaster zones and small, underrepresented classes. The architecture combines a lightweight transformer module for global context modeling with depthwise separable convolutions (DWSCs) to enhance efficiency without compromising representational capacity. Additionally, we introduce a detection-guided feature fusion mechanism that integrates outputs from auxiliary detection tasks to mitigate class imbalance and improve discrimination of visually similar categories. Extensive experiments on several public datasets demonstrate that our model significantly improves segmentation of both man-made infrastructure and natural damage-related features, offering a robust and efficient solution for post-disaster analysis.

Keywords:

contextual semantic segmentation; detection; lightweight model; aerial imagery; class imbalance; disaster response

1. Introduction

Natural disasters such as hurricanes and floods pose growing threats to critical infrastructure and human life. Intensified by climate change, these events are increasing in frequency, scale, and severity, causing widespread destruction, displacement, economic losses, and environmental damage [1,2]. In such rapidly evolving crises, timely and accurate situational data is essential, along with the ability to coordinate resources and leverage AI-powered analytics to support effective disaster preparedness, emergency response, and recovery planning [3,4].

In this context, the use of unmanned aerial vehicles (UAVs), commonly referred to as drones, has emerged as a transformative solution [3,5]. Drones enable rapid and cost-efficient acquisition of high-resolution aerial imagery across vast and often inaccessible or hazardous terrain. However, the true value of this imagery lies in its interpretation. This is where aerial image segmentation becomes indispensable [6]. As a core computer vision technique, aerial semantic segmentation can provide pixel-level classification of disaster scenes, enabling detailed mapping of affected areas, including flooded roads, collapsed structures, blocked evacuation routes, and intact infrastructure [7]. By converting raw aerial imagery into structured, actionable insights, segmentation plays a vital role in situational awareness, damage assessment, and strategic decision making [1].

Recently, segmentation of aerial imagery captured by unmanned aerial vehicles (UAVs) has emerged as a crucial task in numerous vision-based applications, including urban planning, precision agriculture, environmental monitoring, and disaster response [7]. UAVs offer the unique advantage of low-altitude, high-resolution, and flexible image acquisition, making them ideal platforms for detailed scene understanding. Early progress in aerial image segmentation was mainly propelled by convolutional neural networks (CNNs), especially fully convolutional networks (FCNs) [8]. This enabled end-to-end pixel-level classification, allowing models to directly map input images to segmentation masks without requiring handcrafted features. The U-Net architecture [9], originally designed for biomedical images, quickly gained popularity in the aerial domain due to its symmetric encoder–decoder structure and skip connections, which help preserve fine spatial features. Variants such as U-Net++ [10], which incorporate nested and dense skip pathways, and DeepLabv3+ [11], with its Atrous Spatial Pyramid Pooling (ASPP) module, have further improved the ability to capture multi-scale contextual information, a crucial factor for accurately segmenting objects of varying sizes in aerial scenes. Recent architectures like HRNet [12] have improved segmentation of small or thin structures common in UAV imagery, such as roads, power lines, and fences. Models like BiSeNet [13] and Fast-SCNN [14] addressed computational efficiency by proposing lightweight models which are suitable for real-time onboard UAV processing. The interlocution of vision transformers such as SegFormer [15] and Swin-Unet [16] enabled the modeling of long-range dependencies and capturing both local and global context. However, they impose significant computational demands, limiting real-time deployment on resource-constrained UAV platforms.

Despite recent advances, applying existing segmentation models to disaster zones using aerial imagery remains highly challenging. First, severe class imbalance persists, with critical but rare classes, such as damaged buildings, debris, or flooded infrastructure, occupying only a small fraction of the scene, leading to poor detection of these vital regions [17]. Second, small objects and fine structures, like vehicles or partially submerged buildings, are frequently overlooked. Moreover, the dynamic, heterogeneous nature of disaster environments introduces complex materials, occlusions, and visual ambiguities that conventional segmentation pipelines struggle to manage [7,18,19]. While high-performing models rely on large, clean datasets, they poorly generalize to the noisy, unpredictable conditions of real-world UAV disaster imagery. Domain shifts across regions and disaster types, along with temporal inconsistencies in UAV video streams, further compromise generalization. Overcoming these limitations calls for efficient, context-aware segmentation approaches that integrate global, local, and task-specific information while remaining robust under disaster-specific constraints and lightweight for deployment.

To tackle key challenges in UAV-based aerial image segmentation, such as class imbalance, poor small-object detection, and limited global context, we propose a novel framework inspired by DeepLabV3+ [11]. DeepLabV3+ offers strong performance through its boundary-refining decoder and multi-scale context aggregation via the Atrous Spatial Pyramid Pooling (ASPP) mechanism [20]. However, it falls short in modeling long-range dependencies and building adaptive contexts for detecting variable sizes of objects and distinguishing visually similar classes in complex aerial scenes such as disaster zones. Our enhanced architecture CSDNet builds upon DeepLabV3+ by introducing several improvements to boost contextual reasoning and semantic clarity in high-resolution UAV imagery. Our main contributions can be summarized as follows:

We propose CSDNet, a novel context-aware segmentation model specifically designed for natural disaster-affected areas. CSDNet advances semantic segmentation by integrating three key innovations: (i) a lightweight transformer module to enhance global context understanding, (ii) depthwise separable convolutions (DWSC) to improve computational efficiency without compromising performance, and (iii) a detection-guided feature fusion mechanism that injects auxiliary detection cues into the segmentation pipeline. This unique combination addresses critical challenges in disaster scene analysis, including class imbalance, poor small-object segmentation, and the difficulty of distinguishing visually similar categories.
We introduce a multi-scale feature fusion strategy that hierarchically combines low-level spatial details with mid-level semantic features within the decoder. This enhances boundary precision and improves the detection of small, underrepresented structures such as debris, vehicles, narrow roads, and damaged infrastructure, which are often missed by conventional approaches. To further strengthen class separation, especially for visually similar categories like “intact roof” and “collapsed roof,” we integrate detection-aware features from an auxiliary object detection branch. These semantic cues act as class-specific attention signals within the decoder, guiding the network toward more discriminative regions and improving segmentation robustness in complex, cluttered UAV imagery.
We evaluate CSDNet on FloodNet [21] and RescueNet [7] datasets, two challenging UAV-based disaster datasets that include diverse environmental conditions, complex scene structures, and significant class imbalance (see image examples on Figure 1). Extensive experiments demonstrate that our approach consistently outperforms baseline models in accurately segmenting both large-scale disaster zones and small critical objects. The results confirm the effectiveness of our architectural enhancements in improving segmentation robustness, boundary precision, and semantic clarity, making CSDNet a reliable tool for rapid situational assessment in real-world disaster response scenarios.

This paper is organized as follows: Section 2 discusses the related works. Section 3 presents the proposed methodology. Section 4 presents experimental results for validation. Section 5 presents a discussion of our contributions and some limitations of the proposed work. Finally, Section 6 presents a conclusion and some future work perspectives.

2. Related Work

Semantic segmentation of aerial images is essential for environmental monitoring and disaster management, providing reliable spatial information for critical decision making. Advances in remote sensing now enable the precise detection of objects and complex regions, such as vehicles, humans and buildings [22]. Moreover, attention mechanisms became a key component for improving segmentation performance on small or subtle features by guiding the model to focus on relevant pixels [7,23]. Complementary techniques like separable convolutions further accelerate learning while reducing computational costs, reinforcing the growing role of aerial imagery in large-scale scene analysis [6].

2.1. Disaster Scene Segmentation from Aerial Imagery

Building on the aforementioned developments, disaster scene analysis through segmentation has become critical in providing essential data for damage mapping and assessment [1,7]. Approaches have been developed for change detection from satellite imagery [24], UAV-based flood mapping [25], and flooded building segmentation through multi-source fusion models like Multi3Net [26]. Chen et al. [27] have proposed an attention-based model for building damage assessment after natural disasters such as hurricanes and earthquakes; Gupta et al. [28] combined multimodal satellite data with UNet for urban flood detection. Zhang et al. [29] used transformer-based segmentation to improve small object segmentation objects like vehicles and swimming pools.

While the increasing availability of high-resolution aerial datasets offers new opportunities for deep learning models, aerial disaster imagery presents unique challenges for segmentation [30,31]. Most existing models rely on global representations and large, homogeneous datasets, limiting their ability to generalize to the cluttered, ambiguous conditions of real-world disaster zones [32]. Small or critical objects like vehicles, debris, or flooded infrastructure remain challenging to detect, especially under severe class imbalance. Overcoming these challenges requires lightweight, context-aware segmentation frameworks that effectively integrate global, local, and class-specific cues while maintaining efficiency for practical disaster response.

2.2. Attention in Semantic Segmentation

Recent studies have emphasized the role of attention mechanisms in addressing class imbalance and enhancing segmentation accuracy [30,33]. Attention modules help the network to selectively focus on informative regions, improving the separation of overlapping or visually similar classes. For instance, the Squeeze-and-Excitation (SE) block [34] enhances channel-wise feature recalibration, while criss-cross attention [35] captures long-range dependencies efficiently by aggregating contextual information across spatial dimensions. Other works integrate hybrid attention modules to boost contextual awareness, particularly for small or underrepresented objects [36]. Although these approaches provide notable improvements, many models still struggle when dealing with high intra-class variability, small object instances, or severe class imbalance, underscoring the need for more effective attention-driven segmentation strategies.

A growing trend in recent research is the cross-domain integration of object detection and semantic segmentation to improve the segmentation of small, ambiguous, or underrepresented classes. For instance, Zou et al. [37] applied YOLO-based detection to enhance building damage assessment in UAV imagery for post-earthquake disaster management. Similarly, Zhu et al. [38] proposed a multilevel instance segmentation approach using Mask R-CNN for natural disaster damage analysis in aerial videos. However, these methods primarily focus on detecting specific damaged objects rather than providing dense, pixel-wise segmentation across all classes in the scene. In contrast, our proposed CSDNet introduces a novel cross-task integration that combines object detection cues with full-scene semantic segmentation.

Instead of explicit object detection, CSDNet leverages intermediate features from a detection model [39] to inject class-specific semantic cues into the segmentation pipeline. This enhances the model’s ability to distinguish visually similar, compact, or underrepresented regions. By unifying detection-aware features with dense segmentation, CSDNet provides more reliable scene understanding, which is essential for disaster scenarios where small but critical objects influence situational assessment.

2.3. Context-Aware Image Segmentation

In image segmentation, local information (e.g., color/texture) alone is insufficient for pixel-level object identification, making context crucial to accurately distinguish similar objects. Early convolutional neural networks (CNNs) addressed this challenge by expanding receptive fields through multi-scale feature extraction, as demonstrated by models like DeepLab [11] and PSPNet [40]. These architectures introduced mechanisms such as atrous convolutions and pyramid pooling to aggregate global and local information, thus improving segmentation performance for large regions as well as small complex structures. More recently, attention mechanisms have further improved contextual understanding by allowing models to dynamically focus on the most informative parts of the image [41]. Particularly in aerial imagery and disaster analysis, context integration helps to disambiguate visually similar areas (e.g., distinguishing flooded roads from rivers) and improves the detection of small objects often missed by local feature extraction alone [42]. Therefore, modern segmentation models increasingly combine multi-scale modules and transformers to maximize the integration of spatial and semantic context [43]. However, these approaches are computationally intensive and require huge datasets for their training.

2.4. Class Imbalance in Semantic Segmentation

Class imbalance is a well-recognized challenge in semantic segmentation, particularly in real-world scenarios such as aerial imagery and disaster scene analysis, where certain classes occupy only a small fraction of the total image area [7,25]. In such settings, dominant classes (e.g., background, vegetation, roads) heavily outweigh rare or critical categories like vehicles, debris, or damaged infrastructure, causing standard deep learning models to bias toward frequent classes while neglecting minority ones [44]. This imbalance not only affects model performance on underrepresented classes but also results in misleading evaluation metrics, where high overall accuracy masks poor segmentation of rare but operationally important categories [31]. For example, in disaster response tasks, failure to detect small, infrequent objects like vehicles or flooded structures can severely compromise situational awareness and decision making [7].

In the past, several strategies have been proposed to mitigate the class imbalance problem. These include loss functions such as focal loss [39,45] and dice loss [46], which place greater emphasis on hard-to-classify or minority regions. Other strategies involve over-sampling and synthetic augmentation to increase the representation of rare classes [44]. However, effectively addressing class imbalance remains an open problem, especially in complex environments like UAV-based disaster imagery, where small objects, visual ambiguity, and severe occlusions further complicate minority class detection.

2.5. Summary

Recent advances in semantic segmentation have significantly improved UAV-based disaster scene understanding. Early methods focused on convolutional encoder–decoder architectures, which demonstrated strong capabilities in preserving spatial detail through skip connections. Despite these advances, significant challenges remain for disaster scene segmentation. These inlude severe class imbalance, small object detection, and visual ambiguity, which limit model performance, especially for rare but critical categories like flooded structures or debris [25,30]. To address these limitations, attention mechanisms, class balancing losses, and hybrid models combining multi-scale and global context cues have been proposed. However, many existing models rely heavily on large, homogeneous datasets and global scene representations, which often fail to generalize to the complex, cluttered, and dynamic conditions typical of real-world disaster zones.

Through its lightweight, context-aware segmentation framework, the proposed CSDNet is an innovative method able to address these limitations. Unlike conventional models that rely mainly on global scene representations, CSDNet integrates detection-guided semantic cues and self-attention mechanisms to enhance the segmentation of small, ambiguous, and underrepresented classes. By combining multi-scale context, global attention, and class-specific features, CSDNet improves fine-grained segmentation while maintaining computational efficiency. This makes it well suited for real-time UAV operations, enabling more reliable scene understanding in complex disaster environments.

3. Methodology

In disaster scene segmentation, accurately detecting small objects and underrepresented classes remains a critical yet often overlooked challenge. Essential categories such as vehicles, debris, pools, and flooded structures occupy only a small portion of the image (see Figure 2 for illustration), but carry significant operational value for emergency response [5]. In post-disaster settings, these objects are often occluded, visually degraded, or dispersed within cluttered scenes, making them harder to identify. Additionally, the dominance of large background regions like roads, vegetation, or buildings biases standard segmentation models toward frequent classes, prioritizing overall pixel accuracy at the expense of rare but crucial elements [47]. This class imbalance and scale disparity limit the practical effectiveness of segmentation systems for real-world disaster response.

CSDNet is designed for multiclass segmentation with a specific focus on underrepresented classes, such as small objects or rare categories within the dataset. These critical classes are often neglected, as dominant classes inflate overall performance metrics, concealing segmentation errors in less frequent but operationally important regions [48]. To address this, CSDNet integrates multiple specialized modules that work in synergy to improve segmentation accuracy for rare and small-scale features. The overall architecture, illustrated in Figure 3, consists of three key components: (1) a targeted class enhancement module, (2) a deep contextual attention module, and (3) a multi-scale feature fusion module. Together, these three modules form a synergistic architecture tailored for segmenting complex, imbalanced, and visually ambiguous disaster scenes from UAV imagery. In the next subsections, we separately detail each module’s implementation.

3.1. Targeted Classes Enhancement Module

In high-resolution aerial imagery, certain semantic classes (e.g., flooded building vs. non-flooded building), exhibit high visual similarity and often suffer from underrepresentation in the training dataset. These factors contribute to poor segmentation performance for these critical classes. To address this challenge, we propose a targeted class enhancement strategy that incorporates features from a class-specific detection model into the semantic segmentation pipeline. For this purpose, we leverage the RetinaNet model [39], originally designed for object detection, as a source of semantically rich, class-specific features that complement the segmentation pipeline. Detection can provide global and local cues for underrepresented objects, which are often missed by standard encoder–decoder segmentation models. Rather than training RetinaNet on the entire set of segmentation labels, we constrain its training to a targeted subset of classes, denoted

T

, focused on two critical classification tasks: distinguishing between flooded and non-flooded roads and between flooded and non-flooded buildings.

Let the input aerial image be

X \in R^{H \times W \times 3}

, and the corresponding ground truth segmentation mask

Y \in {1, . . ., C}^{H \times W}

, where C is the total number of classes. A detection model

D_{θ_{D}}

is trained only on classes in

T

using bounding-box annotations derived from the segmentation masks. After training, the parameters

θ_{D}

of

D_{θ_{D}}

are frozen (put in evaluation mode) during segmentation model training. It acts solely as a class-aware feature extractor, providing high-level semantic representations:

F_{D} = D_{θ_{D}}^{eval} (X) \in R^{H^{'} \times W^{'} \times d_{D}},

(1)

where

H^{'} \times W^{'}

is the spatial resolution of the feature map and

d_{D}

is its channel depth. This allows stable, semantically rich features to be reused without updating the detection model.

Parallel to the detection module, we process the input image through a standard semantic segmentation encoder

B

. We employ EfficientNet-B5, a high-capacity CNN backbone pretrained on ImageNet, to extract low-level to mid-level features. These features are denoted

F_{1} = B_{1} (X) \in R^{\frac{H}{2} \times \frac{W}{2} \times d_{1}}

,

F_{2} = B_{2} (F_{1}) \in R^{\frac{H}{4} \times \frac{W}{4} \times d_{2}}

,

F_{3} = B_{3} (F_{2}) \in R^{\frac{H}{8} \times \frac{W}{8} \times d_{3}}

, and

F_{4} = B_{4} (F_{3}) \in R^{\frac{H}{16} \times \frac{W}{16} \times d_{4}}

, representing increasingly abstract representations. Here,

B_{l}

designates the model up to the lth layer, and

d_{l}

is its number of output features.

To improve contextual understanding, we integrate detection-derived features

F_{D}

into the segmentation pathway by first aligning their channel dimensions using transformation

ϕ

. The resulted features are then fused via element-wise multiplication with the encoder feature map

F_{1}

after applying Atrous Spatial Pyramid Pooling (ASPP) [11] to the latter (see Figure 4 for illustration):

F_{d e t} = ϕ (F_{D}) ⊙ ASPP (F_{1}),

(2)

where ⊙ denotes element-wise multiplication. This operation functions as a gating mechanism, amplifying class-specific activations and guiding the model to better capture subtle distinctions, particularly for rare or ambiguous classes.

3.2. Deep Contextual Attention Module

In multi-class semantic segmentation, a major challenge arises from class imbalance, especially in high-resolution aerial datasets where certain classes occupy minimal spatial regions and appear rarely across the dataset [49]. This leads to two compounded issues: (1) the small object problem, where relevant classes cover few pixels within individual images, and (2) dataset-level underrepresentation, where only a limited number of training samples contain instances of these classes. In our case, the dataset exhibits both of these difficulties, making it crucial to enhance global context understanding and reinforce feature representations corresponding to rare categories.

To address this challenge, we integrate a lightweight transformer-based attention mechanism inspired from MobileViT [50], represented by the green module in Figure 3. This module is applied to high-level feature maps extracted from the segmentation backbone to better capture relationships between distant regions of the image that may belong to the same object or class. By integrating these global contextual interactions, the network gains a more comprehensive understanding of the scene, which is especially important for accurately segmenting large structures or spatially disconnected but semantically related regions. We first apply a depthwise separable convolution (DWSC) block to extract localized features from the last convolutional layer of the backbone:

F_{l o c} = DWSC (F_{4}) \in R^{\frac{H}{16} \times \frac{W}{16} \times d_{4}} .

(3)

Next, the resulting tensor is flattened into non-overlapping patches and passed through a transformer encoder (TE) consisting of multi-head self-attention and feedforward blocks:

F_{attn} = TE (flatten (F_{l o c}) \in R^{N \times d_{4}} .

(4)

where

N = \frac{H}{16} \times \frac{W}{16}

. The attention-enhanced output is then reshaped back into spatial dimensions and then goes through ASPP to capture diverse receptive fields:

F_{g l o b a l} = ASPP (unflatten (F_{attn})) .

(5)

The output

F_{g l o b a l}

provides a multi-scale, attention-enhanced encoding that is fused with decoder features to guide the segmentation network in focusing on underrepresented and small classes. This integration of transformer-based attention and multi-scale contextual refinement improves the model’s capacity to identify fine-grained and rare categories, which would otherwise be lost due to spatial sparsity or dataset imbalance.

3.3. Multi-Scale Feature Fusion Module

In addition to our specialized modules, one focused on the semantic distinction of challenging classes and the other addressing class imbalance and small object enhancement, we introduce a third component that serves as a general-purpose convolutional module. This module is designed to process the remaining, well-represented classes by leveraging mid-level features extracted from the backbone. Its role is to ensure broad coverage and semantic consistency across common object categories, while also acting as a bridge that connects the high-level and low-level representations extracted by the other modules.

Let

F_{2}, F_{3} \in R^{H_{i} \times W_{i} \times d_{i}}

represent two mid-level feature maps from intermediate layers of the EfficientNetB5 backbone, where

i = 2, 3

. These features contain rich semantic cues while retaining sufficient spatial detail. To ensure compatibility and allow fusion, we first reduce their dimensionality using point-wise convolutions (i.e.,

1 \times 1

convolutions), denoted by transformations

ψ_{i} : R^{H_{i} \times W_{i} \times d_{i}} \to R^{H_{i} \times W_{i} \times d^{'}}

. The reduced features are

{\hat{F}}_{2} = ψ_{2} (F_{2}), {\hat{F}}_{3} = ψ_{3} (F_{3}), {\hat{F}}_{2}, {\hat{F}}_{3} \in R^{H_{i} \times W_{i} \times d^{'}} .

(6)

These transformed feature maps are then upsampled and combined, for instance, via concatenation or addition followed by a convolutional block

C

, to produce a refined intermediate representation:

F_{gen} = C [↑ ({\hat{F}}_{2}), ↑ ({\hat{F}}_{3})] \in R^{H_{i} \times W_{i} \times d^{''}},

(7)

where

d^{″}

is the unified feature depth, ↑ is the upsampling operation, and

C

denotes channel-wise concatenation.

This feature map acts as a general-purpose semantic representation, complementing the outputs of the specialized modules, while maintaining cross-scale coherence. Finally, the outputs of all modules (targeted detection features, attention-enhanced features, and generalized CNN features) are fused to form the final decoder input:

F_{final} = C [F_{\det}, F_{global}, F_{gen}],

(8)

where

C

denotes the fusion operation, implemented as channel-wise concatenation. This unified feature map integrates complementary information from detection cues, long-range global dependencies, and local CNN representations, providing the decoder with a rich, multi-scale, and semantically diverse context. The fused representation is then passed to the segmentation decoder to produce the final dense prediction

\hat{Y}

, ensuring that both small objects and complex, visually ambiguous regions are effectively segmented within the scene.

3.4. Model Training, Implementation, and Loss Function

3.4.1. Training Strategy and Implementation

The proposed segmentation framework consists of three complementary modules: (i) a detection-based class-specific enhancement module, (ii) an attention-guided refinement module for small and underrepresented classes, and (iii) a generalized CNN module for broad multi-class semantic understanding. To ensure stability and modularity, the detection module is trained independently on a curated subset of binary class labels (e.g., flooded vs. non-flooded roads/buildings). It is then kept frozen (evaluation mode) during the training of the segmentation model, serving solely as a high-level feature extractor for targeted class distinction.

The remaining components, including the backbone encoder (EfficientNet-B5), attention modules, feature refinement layers, and decoder, are trained end to end using full multi-class segmentation labels. The model is implemented in PyTorch (version 2.6.0), with the EfficientNet-B5 backbone initialized with pretrained ImageNet weights to accelerate convergence and leverage transferable low-level features. To improve generalization and robustness, especially in complex disaster scenarios, we apply standard data augmentations, including random horizontal and vertical flips, rotations, brightness and contrast adjustments, and scale jittering. These augmentations help simulate variations in UAV imaging conditions, such as changes in viewpoint, illumination, and object scale. Additionally, we balance the training batches to ensure adequate representation of underrepresented classes, further mitigating class imbalance.

3.4.2. Loss Function

To optimize our segmentation model and effectively handle the challenges posed by class imbalance and spatial misalignment, we adopt a composite loss function that combines focal loss and Jaccard loss (1 − IoU). This dual-objective formulation allows the model to emphasize difficult examples while improving region-level accuracy and boundary alignment.

The focal loss addresses class imbalance by down-weighting easy examples and focusing learning on hard or misclassified instances, making it effective for highly imbalanced datasets. It is defined as follows:

L_{Focal} (p_{t}) = - α_{t} \cdot {(1 - p_{t})}^{γ} \cdot log (p_{t}),

(9)

where

p_{t} \in [0, 1]

is the predicted probability for the ground truth class,

γ

is the focusing parameter (set to 2.0 in our experiments), and

α_{t}

is a class-specific weighting factor to further balance the loss contribution across categories.

The Jaccard loss (also known as intersection over union or IoU loss) is designed to improve spatial overlap and region alignment between predicted and ground truth segmentation masks. It penalizes discrepancies at the region level, rather than individual pixel-wise errors, making it especially suitable for tasks involving imbalanced foreground–background distributions. The Jaccard loss is computed as follows:

L_{Jaccard} = 1 - \frac{| \hat{Y} \cap Y | + ε}{| \hat{Y} \cup Y | + ε},

(10)

where

\hat{Y} \in {0, 1}^{H \times W}

is the predicted binary mask, and

Y \in {0, 1}^{H \times W D}

is the ground truth binary mask.

| \hat{Y} \cap Y |

denotes the number of overlapping pixels (i.e., the intersection) between

\hat{Y}

and Y.

| \hat{Y} \cup Y |

denotes the total number of pixels in either mask (i.e., the union). Finally,

ε

is a small constant (e.g.,

1 \times 10^{- 6}

) added to prevent division by zero.

L_{Jaccard} \in [0, 1]

represents the loss value, where lower values indicate better segmentation overlap. The overall training objective is the expectation over the sum of both loss components:

L_{total} = L_{Focal} + L_{Jaccard},

(11)

This combined loss encourages both class-wise discrimination and pixel-level spatial consistency, improving segmentation quality for both dominant and underrepresented classes, especially in disaster-affected scenes with visually similar and fragmented objects. The script in Algorithm 1 outlines the steps of the method. The detection model is first pretrained and frozen, and then the rest of the CSDNet model is trained for

N_{e}

epochs.

Algorithm 1 CSDNet training script

Input: Set of input images

D

and targeted classes to enhance

T

;

Output: Trained CSDNet model

Step 1: Pre-train the detection module on

T

and freeze it;

Step 2: For epoch = 1 to

N_{e}

:

Compute

F_{d e t}

, using Equation (2);

Compute

F_{g l o b a l}

using Equation (5);

Compute

F_{g e n}

using Equation (7);

Compute

F_{t o t a l}

using Equation (8);

Predict

\hat{Y}

;

Backpropagate using loss function

L_{total}

;

4. Experiments

To assess the effectiveness of our segmentation model, we conducted extensive experiments with both quantitative and qualitative analyses, comparing against state-of-the-art models. Evaluation focused on mean Intersection over Union (mIoU) and per-class IoU for both frequent and underrepresented classes. Qualitative results further demonstrate the model ability to segment small objects, ambiguous regions, and complex boundaries in disaster scenes. Overall, the results highlight the advantages of CSDNet in improving segmentation for critical classes.

4.1. Datasets

For our tests, we used FloodNet [21,25] and RescueNet [7], which are two standard annotated datasets for disaster scene segmentation.

FloodNet is a high-resolution UAV dataset collected after Hurricane Harvey in the Houston area, designed for post-disaster scene understanding. It provides pixel-level annotations for multiple classes, including non-flooded and flooded roads, buildings, vehicles, trees, pools, and background. The dataset covers diverse urban, suburban, and semi-rural areas affected by varying flood levels. A key challenge is class imbalance and the visual similarity between overlapping classes, such as flooded versus non-flooded structures. It also includes small objects like vehicles and pools, making it suitable for evaluating class-specific enhancement and feature fusion methods. FloodNet contains 2345 images, divided into training (∼60%), validation (∼20%), and test (∼20%) sets.

RescueNet is a high-resolution UAV imagery benchmark collected from the areas affected by Hurricane Michael using UAVs. It comprises ten thematic classes: water, building-no-damage, building-minor-damage, building-major-damage, building-total-destruction, vehicle, road-clear, road-blocked, tree, and pool. The whole dataset has 4494 images, divided into training (∼80%), validation (∼10%), and test (∼10%) sets.

The segmentation task for both datasets involves classes spanning structural (e.g., flooded/non-flooded buildings and roads) and environmental elements (e.g., water, trees, grass, vehicles, pools), which are essential for flood impact analysis. As is common in real-world disaster datasets, FloodNet and RescueNet exhibit class imbalance, with minority classes such as vehicle and pool appearing far less frequently than dominant classes like buildings or trees. This imbalance, illustrated in Figure 2, highlights the challenge of accurately segmenting underrepresented categories.

4.2. Training Setup

All models were implemented in PyTorch and trained for 100 epochs on a single NVIDIA RTX A6000 GPU (NVIDIA, Santa Clara, CA, USA). Input images were resized from their original

4000 \times 3000

resolution to

512 \times 512

pixels to reduce memory consumption and accelerate training. This resizing significantly decreases the computational load, enabling larger batch sizes and faster iterations, which is critical for efficient model development. However, this reduction in resolution introduces a trade-off, as fine-grained spatial details, particularly small or thin structures, may be lost, potentially affecting segmentation accuracy for subtle or small-scale objects. To mitigate this problemn, the RetinaNet is trained on higher-resolution images

1024 \times 1024

, using bounding boxes extracted from segmentation masks of targeted class pairs (e.g., flooded vs. non-flooded regions). The feature maps are then resized to

512 \times 512

to be integrated to the segmentation pathway.

In the segmentation architecture, four hierarchical features (

F_{1}

to

F_{4}

) were extracted from EfficientNet;

F_{2}

and

F_{3}

were processed by the generalized CNN module, while

F_{4}

was passed through an ASPP block for global context extraction. This stage was trained with the Adam optimizer (batch size = 8, learning rate

1 \times 10^{- 4}

) using a cosine annealing schedule. The EfficientNet-B5 backbone was initialized with ImageNet-pretrained weights and fine-tuned end to end. For the main segmentation model, we used an SGD optimizer with momentum 0.9, weight decay

1 \times 10^{- 4}

, and an initial learning rate of 0.01, adjusted via a ReduceLROnPlateau scheduler (patience = 10, factor = 0.1, min LR =

1 \times 10^{- 4}

). A weighted sampling strategy using a WeightedRandomSampler was employed to address class imbalance, with weights derived from pixel-level class distributions. Data augmentations includes random horizontal flips, and rotations.

4.3. Evaluation Metrics

To assess model performance, we employed standard semantic segmentation metrics, focusing primarily on class-wise intersection over union (IoU) and mean IoU (mIoU). It is defined as the ratio of the intersection over the union between the predicted and ground truth masks. The mean IoU is calculated as the average IoU across all classes, offering a holistic view of the model performance on the multi-class segmentation task. The IoU per class i is defined as follows:

{IoU}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(12)

where

T P_{i}

,

F P_{i}

, and

F N_{i}

denote true positives, false positives, and false negatives for the class, respectively. The mean IoU is computed across all classes:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} {IoU}_{i}

(13)

where N is the total number of classes. Contrary to overall pixel accuracy metric which tends to favor dominant classes, mIoU measures the overlap between the predicted segmentation and the ground truth for each class, providing an intuitive understanding of how well each class is segmented.

4.4. Ablation Study

To assess the individual contribution of each architectural component in our proposed framework, we performed a series of ablation experiments using the FloodNet dataset by systematically removing the transformer block and the detection module, evaluating their respective impact on segmentation performance. The results are summarized in Table 1, where we also present the results of DeepLabV3+ as a baseline model.

4.4.1. Effect of Removing the Transformer Block

Excluding the transformer block led to a nearly 8% decrease in overall mIoU, with the most severe impact observed on small and underrepresented classes. For example, the IoU for the “Vehicle” class dropped by 20%, while the “Pool” class saw an 18% reduction. These results highlight the crucial role of global context modeling provided by the transformer. The self-attention of the transformer architecture enables the network to capture long-range dependencies, which are often missed by conventional convolutional encoders that primarily focus on local spatial information. This global perspective is particularly beneficial for segmenting small-scale structures, which often appear fragmented or visually ambiguous in cluttered aerial scenes. In the absence of the transformer, the model’s ability to integrate information across distant regions deteriorates, leading to misclassification or omission of small but operationally critical objects.

4.4.2. Effect of Removing Detection Features

The removal of the detection-based targeted class enhancement module resulted in a 4% overall mIoU decrease, with a pronounced effect on visually similar and flood-specific classes. Specifically, the “Road Flooded” and “Building Flooded” categories experienced IoU drops of approximately 10%, while their non-flooded counterparts saw reductions of around 8%. This confirms the effectiveness of integrating detection-derived features, which act as class-specific semantic priors. These features guide the segmentation network to focus on subtle visual cues that distinguish between classes with high spectral and spatial similarity, such as flooded versus non-flooded regions. In real-world disaster scenarios, these distinctions are critical, as flooded infrastructure, debris, and water bodies often share overlapping visual properties.

4.4.3. Effect of Removing Both Transformer and Detection Features

Excluding both the transformer block and the detection-based targeted class enhancement module resulted in a substantial overall mIoU decline to 63.52%. The combined removal severely impaired the segmentation of critical classes, particularly small objects, visually ambiguous regions, and those with subtle appearance differences. This highlights the complementary role of both mechanisms: the transformer provides global context and long-range dependencies for accurate segmentation of small or fragmented objects, while detection-guided features inject class-specific cues that enhance discrimination in complex, overlapping regions. These results underscore the importance of integrating both global attention and detection-aware priors to achieve robust, high-resolution segmentation for real-world UAV-based disaster analysis.

To illustrate the impact of removing the detection and transformer modules, Figure 5 presents three representative examples comparing different ablation configurations described in Table 1. These visual comparisons highlight the contribution of each module to the overall segmentation performance, particularly in challenging scenarios involving small objects and fine structural details. The absence of either module leads to visible degradation in boundary delineation and object localization, confirming their complementary roles in enhancing model robustness.

4.5. Quantitative Results and Analysis

To rigorously assess the effectiveness of our proposed segmentation framework, we conducted comparative experiments against eight state-of-the-art semantic segmentation models, selected for their architectural diversity, proven performance, and relevance to both general and domain-specific tasks. All models were fine-tuned on the FloodNet and RescueNet datasets, respectively, under the same experimental conditions, and all were evaluated using the mean intersection over union (mIoU) and per-class IoU, providing both global and class-level insights into segmentation performance.

The models included in the benchmark are as follows. (1) SegFormer-B0 [15], a lightweight transformer-based architecture using hierarchical encoding, known for its strong generalization across vision tasks; (2) DeepLabV3+ [11], a robust CNN-based model leveraging ASPP to capture multi-scale context, particularly effective in capturing large object structures; (3) U-Net++ [10], a deeply nested encoder–decoder network that enhances the classical U-Net with densely connected skip paths; (4) PSPNet [40], which employs pyramid pooling to collect contextual features from multiple receptive fields; (5) ENet [51], a real-time segmentation model offering a good baseline for lightweight deployments; (6) TransUNet [52], which fuses CNN-based and transformer-based decoders, enabling it to capture both local texture and global dependencies, and (7) CMTFNet [53], integrating contextual and multi-scale temporal features for disaster analysis.

All models were retrained from scratch or fine-tuned using ImageNet-pretrained weights and were uniformly configured in terms of data pre-processing, input resolution (

512 \times 512

), and augmentation strategies. The evaluation was carried out on the test sets, which include diverse flood-affected urban and semi-urban environments, presenting a realistic challenge due to high class imbalance and fine-grained object distinctions. The quantitative results of our model, along with competitive baselines, are presented in Table 2 and Table 3 for FloodNet and RescueNet, respectively.

For both datasets, CSDNet outperformed baseline architectures in both overall segmentation and challenging class performance. For FloodNet, our proposed model achieved an mIoU of 73.03%, significantly outperforming strong baselines such as SegFormer (B0), which achieved 67.15%, representing an absolute improvement of nearly 6%. Compared to other widely adopted models, our method shows consistent gains, outperforming DeepLabV3+ (67.24%) by approximately 5.8%, and UNet++ (62.74%) by over 10.2%. The same comparative trend is observed on the RescueNet dataset, where our model achieved an mIoU of 69.05%, consistently outperforming established baselines. Specifically, it delivered a 5.9% improvement over DeepLabV3+ (63.12%), a 7.5% gain compared to SegFormer (B0) (61.55%), and outperformed UNet++ (63.41%) by over 5.6%. The closest competing model, PSPNet, reached an mIoU of 68.19%, thanks to its Pyramid Pooling Module (PPM), which effectively captures multi-scale contextual information. Yet, it fell short of our approach by nearly 1%. These results underscore the effectiveness of our architecture in handling both global context and class-level ambiguity in complex disaster scenes.

Focusing on the targeted classes, the improvements are even more noteworthy. For FloodNet, the flooded-road class, our model achieved an IoU of 56.81%, surpassing the second-best score of 53.88% by SegFormer by almost 3%. The advantage became more pronounced for flooded-buildings, where our method reached 75.77%, a substantial improvement of over 18% compared to the next best model, UNet++ (57.31%). The same trend can be observed for RescueNet, where our model achieved the best IoU for hard classes such building-no-damage, and building-total-destruction. These gains directly reflect the strength of our targeted class enhancement strategy and the detection-guided feature integration mechanism.

Our model also demonstrated notable robustness on small and underrepresented classes, which are particularly challenging in disaster scenarios. Notably, smaller and infrequent classes like ‘vehicle’ (58.62% IoU) and ‘pool’ (66.12% IoU) saw significant gains, attributed to the integration of detection-derived features that guide class-specific segmentation. The self-attention block and ASPP further enhanced global context modeling, improving boundary distinction in complex cases, such as ‘road flooded’ achieving 56.81% IoU—an improvement over baselines. For instance, in the vehicle class, models like ENet failed entirely (0.00%) for both datasets. Even stronger models like SegFormer and DeepLabV3+ underperformed in comparison with our model for both datasets.

Importantly, our model maintained competitive or superior performance on frequent and well-represented classes as well. For example, on FloodNet, our model achieved 81.94% IoU on non-flooded roads and 84.53% on tree regions, ensuring that improvements on rare classes did not come at the expense of performance on dominant ones. The same remark can be made for RescueNet, where our model achieved good performance on dominant classes sch as water and tree classes. Overall, these results demonstrate that our model offers well-balanced and superior segmentation performance, excelling across both frequent and rare categories; it is particularly effective in handling class imbalance, small objects, and visually ambiguous regions, which are prevalent in real-world disaster imagery.

4.6. Computational Analysis

Table 4 compares our model with existing architectures in terms of parameter count and inference time. Despite having significantly fewer parameters than DeepLabV3+ and PSPNet, our model offers the best trade-off between accuracy and efficiency, achieving the highest mIoU with competitive inference speed. Reported inference times reflect the average processing time (ms) per

512 \times 512

image on an NVIDIA Tesla A100 GPU.

As illustrated in Figure 6, our proposed model exhibits a notably faster and more consistent increase in the IoU metric for underrepresented classes, such as vehicle and pool classes, compared to baseline architectures. This early performance gain, observable within the first few training epochs, underscores the effectiveness of our model in addressing two critical challenges: (i) the reliable segmentation of small, sparsely distributed objects, and (ii) the mitigation of class imbalance. By leveraging detection features, dynamic attention mechanisms and long-range feature interactions, our model gained the ability to capture fine-grained object boundaries and semantic cues even in cluttered or low-contrast regions. Furthermore, the accelerated convergence observed for minority classes suggests that our architecture not only improves final performance but also facilitates more efficient learning during early optimization stages—an essential property for time-constrained or resource-limited deployment scenarios in disaster response applications.

4.7. Qualitative Results

Visual comparisons of the predicted segmentation masks are presented in Figure 7 and Figure 8, respectively. These figures clearly demonstrate that our proposed model achieves more refined, coherent, and spatially accurate boundaries, especially in regions containing small or structurally complex objects such as vehicles, pools, and flooded buildings. In contrast to baseline models that often produce fragmented or over-smoothed predictions in these areas, our method successfully captures fine-grained details and preserves object shapes. These improvements stem from the synergistic design of our architecture modules. While the attention module facilitates better localization of small-scale objects, the detection-based features improve the model’s ability to distinguish visually similar or overlapping classes—an essential capability in disaster scene analysis where ambiguous visual cues are common.

Note finally that both FloodNet and RescueNet contain annotation flaws that can mislead segmentation models and affect evaluation reliability. In many cases, we observe that the quality of ground truth annotations can significantly impact evaluation. Figure 9 shows examples where ground truths are either missing key regions such as small objects like vehicles or pools (and in some cases even larger structures such as buildings) or mislabel certain classes, for example by inaccurately representing the severity of damage to buildings.

These imperfections introduce noise during training and can misguide models, particularly for rare or visually similar classes. These annotation challenges may lead to underestimation of model performance in the corresponding categories when correct predictions deviate from flawed ground truth. This highlights the need for more reliable annotation protocols and model designs, like CSDNet, that integrate global context and class-specific cues to compensate for label imperfections and improve segmentation in complex disaster scenarios.

5. Discussion

CSDNet introduces a hybrid segmentation framework designed to address key challenges in UAV-based disaster scene analysis, notably small object detection, class imbalance, and visual ambiguity. By integrating detection-guided features with global context modeling through lightweight transformers, CSDNet improves both fine-grained localization and semantic discrimination, particularly for rare or visually similar classes such as flooded infrastructure, vehicles, and pools. The inclusion of detection-aware cues allows the model to incorporate high-level semantic priors, while self-attention mechanisms enhance the understanding of global spatial relationships, which is critical for segmenting fragmented or spatially compact objects in cluttered aerial scenes.

Despite these advancements, certain limitations persist. First, while CSDNet improves the segmentation of rare and ambiguous classes, its reliance on detection-derived features requires a targeted subset of classes to be pre-defined for the detection module, limiting flexibility when new, unseen object categories emerge during deployment. Second, although lightweight compared to conventional transformer architectures, the inclusion of self-attention blocks still introduces additional computational overhead, which may constrain real-time performance on resource-limited UAV platforms, especially in large-scale disaster zones. Another challenge lies in handling semantically overlapping regions with inherent class ambiguity, such as distinguishing between flooded roads, flooded buildings, and open water, where boundaries are visually subtle or structurally interconnected. While CSDNet partially mitigates this through multi-scale fusion and class-specific cues, segmentation errors persist in highly complex or occluded regions.

Lastly, the model generalization may be affected by severe domain shifts, such as variations in sensor modalities, image resolution, geographic regions, or environmental conditions. Future work will explore multi-modal data fusion (e.g., combining RGB with depth or infrared) to disambiguate complex, interdependent classes in post-disaster scenes. We can also incorporate multi-resolution analysis through image patches as well as semantic consistency analysis across frames of UAV video streams. Another promising avenue may involve using domain adaptation and relational modeling (e.g., inter-object relations) to further enhance CSDNet robustness and scalability across diverse disaster environments [54]. Finally, using semi-supervised learning and/or self-supervised pretraining can further improve generalization and robustness in dynamically evolving disaster environments.

6. Conclusions

In this work, we introduced a novel hybrid semantic segmentation framework designed specifically for the challenges posed by disaster scenarios. Our architecture integrates three complementary components: a lightweight attention mechanism for enhancing the representation of small and underrepresented objects, a detection-guided feature extraction module based on RetinaNet for improving class distinction in visually similar regions, and a context-aware module leveraging multi-scale features and ASPP to capture global scene structure. Together, these components enable more accurate and balanced segmentation, particularly in the presence of class imbalance, small object scales, and semantic overlap, which are common issues in disaster scene understanding. We evaluated our approach on two well-know datasets, and our results demonstrate that our model consistently outperforms established baseline architectures with significant gains, particularly observed on small and underrepresented classes such as vehicles and pools.

Author Contributions

Conceptualization, A.Z. and M.S.A.; methodology, A.Z. and M.S.A.; software, A.Z.; validation, A.Z. and M.S.A.; formal analysis, A.Z. and M.S.A.; investigation, A.Z. and M.S.A.; resources, A.Z. and M.S.A.; data curation, A.Z. and M.S.A.; writing—original draft preparation, A.Z.; writing—review and editing, M.S.A.; visualization, A.Z.; supervision, M.S.A.; project administration, M.S.A.; funding acquisition, M.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) under the grant numnber: RGPIN-2020-05653.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krichen, M.; Abdalzaher, M.S.; Elwekeil, M.; Fouda, M.M. Managing natural disasters: An analysis of technological advancements, opportunities, and challenges. Internet Things Cyber-Phys. Syst. 2024, 4, 99–109. [Google Scholar] [CrossRef]
Summers, J.K.; Lamper, A.; McMillion, C.; Harwell, L.C. Observed Changes in the Frequency, Intensity, and Spatial Patterns of Nine Natural Hazards in the United States from 2000 to 2019. Sustainability 2022, 14, 4158. [Google Scholar] [CrossRef]
Bashir, M.H.; Ahmad, M.; Rizvi, D.R.; Abd El-Latif, A.A. Efficient CNN-based disaster events classification using UAV-aided images for emergency response application. Neural Comput. Appl. 2024, 36, 10599–10612. [Google Scholar] [CrossRef]
Lapointe, J.F.; Molyneaux, H.; Allili, M.S. A literature review of AR-based remote guidance tasks with user studies. In Virtual, Augmented and Mixed Reality, Industrial and Everyday Life Applications, Proceedings of the12th International Conference, VAMR 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, 19–24 July 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 111–120. [Google Scholar]
Khan, A.; Gupta, S.; Gupta, S.K. Emerging UAV technology for disaster detection, mitigation, response, and preparedness. J. Field Robot. 2022, 39, 905–955. [Google Scholar] [CrossRef]
Ijaz, H.; Ahmad, R.; Ahmed, R.; Ahmed, W.; Kai, Y.; Jun, W. A UAV-Assisted Edge Framework for Real-Time Disaster Management. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1001013. [Google Scholar] [CrossRef]
Rahnemoonfar, M.; Chowdhury, T.; Murphy, R. RescueNet: A high resolution UAV semantic segmentation dataset for natural disaster damage assessment. Sci. Data 2023, 10, 913. [Google Scholar] [CrossRef]
Osco, L.P.; Junior, J.M.; Ramos, A.P.M.; de Castro Jorge, L.A.; Fatholahi, S.N.; de Andrade Silva, J.; Matsubara, E.T.; Pistori, H.; Gonçalves, W.N.; Li, J. A review on deep learning in UAV remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102456. [Google Scholar] [CrossRef]
Dimitrovski, I.; Spasev, V.; Loshkovska, S.; Kitanovski, I. U-Net ensemble for enhanced semantic segmentation in remote sensing imagery. Remote Sens. 2024, 16, 2077. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 334–349. [Google Scholar]
Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. In Proceedings of the British Machine Vision Conference, Cardiff, UK, 9–12 September 2019; p. 285. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: U-Net-like pure transformer for medical image segmentation. In Proceedings of the ECCVWorkshops, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Gupta, A.; Watson, S.; Yin, H. Deep learning-based aerial image segmentation with open data for disaster impact assessment. Neurocomputing 2021, 439, 22–33. [Google Scholar] [CrossRef]
Almarzouqi, H.; Saad Saoud, L. Semantic labeling of high-resolution images using EfficientUNets and transformers. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4402913. [Google Scholar] [CrossRef]
Lee, H.; Kim, G.; Ha, S.; Kim, H. Lightweight disaster semantic segmentation for UAV on-device intelligence. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 8821–8825. [Google Scholar]
Li, Z.; Liu, Z.; Yang, Z.; Peng, Z.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Rahnemoonfar, M.; Chowdhury, T.; Sarkar, A.; Varshney, D.; Yari, M.; Murphy, R.R. FloodNet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 2021, 9, 89644–89654. [Google Scholar] [CrossRef]
Guo, D.; Weeks, A.; Klee, H. Robust approach for suburban road segmentation in high-resolution aerial images. Int. J. Remote Sens. 2007, 28, 307–318. [Google Scholar] [CrossRef]
Sang, S.; Zhou, Y.; Islam, M.T.; Xing, L. Small-object sensitive segmentation using across feature map attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6289–6306. [Google Scholar] [CrossRef]
Doshi, J. Residual inception skip network for binary segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 206–2063. [Google Scholar]
Rahnemoonfar, M.; Murphy, R.; Vicens Miquel, M.; Dobbs, D.; Adams, A. Flooded area detection from UAV images based on densely connected recurrent neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1788–1791. [Google Scholar]
Rudner, T.G.J.; Rußwurm, M.; Fil, J.; Pelich, R.; Bischke, B.; Kopačková, V.; Biliński, P. Multi3Net: Segmenting flooded buildings via fusion of multiresolution, multisensor, and multitemporal satellite imagery. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 702–709. [Google Scholar]
Shen, Y.; Zhu, S.; Yang, T.; Chen, C.; Pan, D.; Chen, J. BDANet: Multiscale Convolutional Neural Network With Cross-Directional Attention for Building Damage Assessment From Satellite Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5402114. [Google Scholar] [CrossRef]
Gupta, R.; Hosfelt, R.; Sajeev, S.; Patel, N.; Goodman, B.; Doshi, J.; Heim, E.; Choset, H.; Gaston, M. Creating xBD: A dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 10–17. [Google Scholar]
Zhang, Y.; Gao, X.; Duan, Q.; Yuan, L.; Gao, X. DHT: Deformable hybrid transformer for aerial image segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6518805. [Google Scholar] [CrossRef]
Khan, B.A.; Jung, J.-W. Semantic segmentation of aerial imagery using U-Net with self-attention and separable convolutions. Appl. Sci. 2024, 14, 3712. [Google Scholar] [CrossRef]
Soleimani, R.; Soleimani-Babakamali, M.H.; Meng, S.; Avci, O.; Taciroglu, E. Computer vision tools for early post-disaster assessment: Enhancing generalizability. Eng. Appl. Artif. Intell. 2024, 136 Pt A, 108855. [Google Scholar] [CrossRef]
Sundaresan, A.A.; Solomon, A.A. Post-disaster flooded region segmentation using DeepLabv3+ and unmanned aerial system imagery. Nat. Hazards Res. 2025, 5, 363–371. [Google Scholar] [CrossRef]
Zhao, Q.; Liu, J.; Li, Y.; Zhang, H. Semantic segmentation with attention mechanism for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5403913. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Chen, Y.; Dong, Q.; Wang, X.; Zhang, Q.; Kang, M.; Jiang, W.; Wang, M.; Xu, L.; Zhang, C. Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4421–4435. [Google Scholar] [CrossRef]
Liu, R.Z.J.; Pan, H.; Tang, D.; Zhou, R. An Improved Instance Segmentation Method for Fast Assessment of Damaged Buildings Based on Post-Earthquake UAV Images. Sensors 2024, 24, 4371. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Liang, J.; Hauptmann, A. MSNet: A Multilevel Instance Segmentation Network for Natural Disaster Damage Assessment in Aerial Videos. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Vitrual, 5–9 January 2021; pp. 2022–2031. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
Xu, Y.; Liu, H.; Yang, R.; Chen, Z. Remote Sensing Image Semantic Segmentation Sample Generation Using a Decoupled Latent Diffusion Framework. Remote Sens. 2025, 17, 2143. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, Q.; Zhang, G. LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610713. [Google Scholar] [CrossRef]
Maki, M.B.A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar]
Hebbache, L.; Amirkhani, D.; Allili, M.S.; Hammouche, N.; Lapointe, J.F. Leveraging saliency in single-stage multi-label concrete defect detection using unmanned aerial vehicle imagery. Remote Sens. 2023, 15, 1218. [Google Scholar] [CrossRef]
Azad, R.; Heidary, M.; Yilmaz, K.; Hüttemann, M.; Karimijafarbigloo, S.; Wu, Y.; Schmeink, A.; Merhof, D. Loss Functions in the Era of Semantic Segmentation: A Survey and Outlook. arXiv 2023, arXiv:2312.05391. [Google Scholar] [CrossRef]
Chu, S.; Kim, D.; Han, B. Learning debiased and disentangled representations for semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 8355–8366. [Google Scholar]
Chen, P.; Liu, Y.; Ren, Y.; Zhang, B.; Zhao, Y. A Deep Learning-Based Solution to the Class Imbalance Problem in High-Resolution Land Cover Classification. Remote Sens. 2025, 17, 1845. [Google Scholar] [CrossRef]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7151–7160. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the International Conference on Learning Representations, Vitrual, 22–29 April 2022. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Chen, X.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Bouafia, Y.; Allili, M.S.; Hebbache, L.; Guezouli, L. SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions. Signal Process. Image Commun. 2025, 130, 117223. [Google Scholar] [CrossRef]

Figure 1. Sample images from the FloodNet (a) and RescueNet (b), with their corresponding segmentation masks. For each dataset, we add a legend indicating class color mappings.

Figure 2. Class distributions of the FloodNet (a) and RescueNet (b) datasets used in this work. They highlight the severe imbalance across categories, with certain classes like vehicles and pools being significantly underrepresented.

Figure 3. Overall architecture of the proposed model: (1) Targeted class enhancement module, (2) deep contextual attention module, and (3) multi-scale feature fusion module.

Figure 4. Illustration of the Atrous Spatial Pyramid Pooling (ASPP) module.

Figure 5. Qualitative comparison of our model predictions across ablation configurations. Each column corresponds to a different test sample from the FloodNet dataset.

Figure 6. IoU of the validation set progression across 100 epochs for the “vehicle” and “pool” classes using the FloodNet dataset.

Figure 7. Segmentation examples using the compared methods. Columns (A–C) correspond to different test samples from the FloodNet dataset. Different colors represent semantic classes, and black circles highlight challenging or misclassified regions.

Figure 8. Segmentation examples using the compared methods. Columns (A–C) correspond to different test samples from the RescueNet dataset. Different colors represent semantic classes, and red circles highlight challenging or misclassified regions.

Figure 9. Annotation challenges. (A) is a sample from the FloodNet dataset; (B,C) are from the RescueNet dataset. Red circles indicate annotation errors such as missing objects or misclassified regions.

Table 1. Ablation study: impact of removing key components on class-level IoU (%) and overall mIoU (%).

Class	Full Model	w/o Transformer	w/o RetinaNet	w/o Transformer w/o RetinaNet	DeepLabV3+
Vehicle	58.62	34.62	47.91	31.32	44.25
Pool	66.12	48.12	60.22	47.67	51.80
Road Flooded	56.81	53.21	45.61	43.28	52.91
Building Flooded	75.77	73.14	65.87	48.42	47.24
mIoU	73.03	65.74	68.36	63.52	67.24

Table 2. Per-class IoU (%) comparison across different models and thematic classes of the FloodNet test set (BF: flooded building, BNF: non-flooded building, RF: flooded road, and RNF: non-flooded road).

Model	BF	BNF	RF	RNF	Water	Tree	Vehicle	Pool	Grass	mIoU
SegFormer (B0) [15]	48.42	78.90	53.88	82.95	76.46	82.05	45.38	46.95	89.39	67.15
PSPNet [40]	44.18	74.53	49.60	80.14	75.05	78.55	28.40	39.89	87.54	61.99
ENet [51]	45.39	71.30	46.24	77.68	75.18	77.81	0.00	0.00	86.64	53.36
DeepLabV3+ [11]	47.24	78.66	52.91	82.81	76.39	81.97	44.25	51.80	89.14	67.24
UNet++ [10]	57.31	76.18	45.73	80.13	65.58	79.28	26.45	47.37	86.59	62.74
TransUNet [52]	46.96	72.52	48.86	79.49	71.77	78.50	35.84	43.90	86.81	62.74
CMTFNet [53]	48.04	78.64	50.44	82.15	75.21	81.10	41.91	45.46	88.85	65.76
Our Model (CSDNet)	75.77	84.99	56.81	81.94	62.26	84.53	58.62	66.12	86.26	73.03

Table 3. Per-class IoU (%) comparison across different models and thematic classes of the RescueNet test set (BG: background, BND: building no damage, BMND: building minor damage, BMJD: building major damage, BTD: building total destruction, BR: blocked road and CR: clear road).

Model	BG	Water	BND	BMND	BMJD	BTD	Vehicle	CR	BR	Tree	Pool	mIoU
SegFormer(B0) [15]	82.76	83.86	62.95	53.33	51.48	53.41	50.65	74.50	41.22	81.12	63.02	61.55
PSPNet [40]	84.07	84.01	67.80	60.59	59.50	61.04	57.65	75.77	46.53	81.64	71.55	68.19
ENet [51]	76.35	74.84	45.12	36.02	31.10	41.67	0.00	52.43	16.98	74.12	0.00	40.78
DeepLabV3+ [11]	82.91	81.45	65.27	53.19	50.76	52.79	59.86	72.29	41.29	81.57	62.72	64.01
UNet++ [10]	83.38	82.19	67.01	56.13	53.53	60.53	57.94	74.85	40.87	81.01	60.05	63.41
TransUNet [52]	78.81	77.93	47.68	38.11	30.12	40.59	41.40	67.22	19.02	76.71	31.91	49.95
CMTFNet [53]	83.84	82.70	66.00	56.69	55.74	60.40	54.49	74.29	40.50	81.96	56.70	64.84
Our Model (CSDNet)	84.60	84.71	67.98	59.17	58.61	62.04	60.35	76.77	48.73	82.57	74.07	69.05

Table 4. Comparison of model efficiency: number of parameters and inference time per image (on a single NVIDIA A100 GPU).

Model	Params (M)	Inference Time (ms)	mIoU (%)
DeepLabV3+	39.64	9.14	67.24
SegFormer-B0	3.72	10.57	67.15
UNet++	48.99	12.46	62.74
CMTFNet	30.07	19.69	65.76
ENet	0.35	16.32	53.36
TransUnet	107.68	36.47	62.74
PSPNet	53.58	12.25	61.99
Our Model (CSDNet)	33.52	22.84	73.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zetout, A.; Allili, M.S. CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers. Remote Sens. 2025, 17, 2337. https://doi.org/10.3390/rs17142337

AMA Style

Zetout A, Allili MS. CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers. Remote Sensing. 2025; 17(14):2337. https://doi.org/10.3390/rs17142337

Chicago/Turabian Style

Zetout, Ahcene, and Mohand Saïd Allili. 2025. "CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers" Remote Sensing 17, no. 14: 2337. https://doi.org/10.3390/rs17142337

APA Style

Zetout, A., & Allili, M. S. (2025). CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers. Remote Sensing, 17(14), 2337. https://doi.org/10.3390/rs17142337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSDNet: Context-Aware Segmentation of Disaster Aerial Imagery Using Detection-Guided Features and Lightweight Transformers

Abstract

1. Introduction

2. Related Work

2.1. Disaster Scene Segmentation from Aerial Imagery

2.2. Attention in Semantic Segmentation

2.3. Context-Aware Image Segmentation

2.4. Class Imbalance in Semantic Segmentation

2.5. Summary

3. Methodology

3.1. Targeted Classes Enhancement Module

3.2. Deep Contextual Attention Module

3.3. Multi-Scale Feature Fusion Module

3.4. Model Training, Implementation, and Loss Function

3.4.1. Training Strategy and Implementation

3.4.2. Loss Function

4. Experiments

4.1. Datasets

4.2. Training Setup

4.3. Evaluation Metrics

4.4. Ablation Study

4.4.1. Effect of Removing the Transformer Block

4.4.2. Effect of Removing Detection Features

4.4.3. Effect of Removing Both Transformer and Detection Features

4.5. Quantitative Results and Analysis

4.6. Computational Analysis

4.7. Qualitative Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI