DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation

Kim, Young Jae; Kim, Sang-Chul

doi:10.3390/app16115688

Open AccessArticle

DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation

by

Young Jae Kim

^1,*

and

Sang-Chul Kim

^2,*

¹

Department of AI·SW, Kookmin University, 77 Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of Korea

²

School of Computer Science, Kookmin University, 77 Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5688; https://doi.org/10.3390/app16115688

Submission received: 20 April 2026 / Revised: 26 May 2026 / Accepted: 26 May 2026 / Published: 5 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Anti-UAV surveillance demands real-time pixel-level UAV localization on resource-constrained gimbal-mounted platforms, yet existing lightweight segmentation models suffer from low recall that propagates to downstream tracking failure. Building on our prior dataset of 605,045 paired visible-light and infrared images, we extend the lightweight ThinDyUNet baseline with three architectural improvements: (1) symmetric dynamic convolution applied to both the encoder and decoder, (2) attention gates filtering skip connections, and (3) deep supervision with auxiliary loss heads. The resulting DAS-Net is evaluated under a three-seed Monte Carlo cross-validation protocol on the full 174,008-image test set. DAS-Net achieves a mean test mIoU of 0.6780 and Dice coefficient of 0.7509 across three independent seeds, outperforming the ThinDyUNet baseline by +6.65 percentage points (pp) in mIoU with statistical significance (one-sided paired t-test, p = 0.045, Cohen’s d = 1.74; full variance and significance analysis in the experimental section). DAS-Net matches the best-performing external baseline (UNet) and exceeds the others (MobileUNet, PAN, PSPNet) while using approximately 14.7× fewer parameters than ResNet-34-based variants. DAS-Net runs at 8.83 ms per image on an NVIDIA A6000 GPU (113 FPS) and 38.44 ms on an NVIDIA Jetson AGX Orin (26 FPS at FP16), demonstrating real-time deployability across server-class and embedded edge platforms.

Keywords:

anti-UAV; semantic segmentation; lightweight network; dynamic convolution; attention gate; deep supervision; U-Net; DAS-Net

1. Introduction

The rapid proliferation of unmanned aerial vehicles (UAVs) has raised significant security and privacy concerns, driving the development of anti-UAV technologies [1,2,3]. Among various detection approaches, including radar [4], radio frequency [5], and acoustic methods [6], image-based techniques have gained considerable attention due to advances in deep learning [7,8]. In particular, semantic segmentation, which performs pixel-level classification, offers superior localization accuracy for small UAVs compared to bounding box-based object detection methods [9]. However, the real-time deployment of segmentation models on the resource-constrained platforms typical of anti-UAV systems (e.g., gimbal-mounted cameras, edge AI boxes) remains an open challenge that motivates lightweight architectural design.

Our prior work [10] introduced a large-scale UAV semantic segmentation dataset comprising 605,045 visible-light and infrared images paired with corresponding segmentation masks. Along with the dataset, we proposed ThinDyUNet, a lightweight U-Net variant that incorporates Dynamic Convolution Blocks (DyConvBlocks) in the encoder to improve feature representation while maintaining a small parameter footprint. ThinDyUNet demonstrated competitive segmentation performance with real-time inference speed, establishing a strong baseline for UAV semantic segmentation.

However, ThinDyUNet employs an asymmetric architecture in which only the encoder utilizes dynamic convolution, while the decoder relies on conventional MultiCNNBlocks. This design limits the decoder’s ability to reconstruct fine-grained segmentation maps. Furthermore, the skip connections directly transfer encoder features to the decoder without any filtering mechanism, potentially introducing irrelevant or noisy features that hinder segmentation quality.

To address these limitations, we propose DAS-Net (Dynamic Attention-Supervised Network), which extends ThinDyUNet with three key improvements: (1) symmetric dynamic convolution in both the encoder and decoder to enhance feature representation throughout the network, (2) attention gates on skip connections to selectively suppress irrelevant features before they reach the decoder, and (3) deep supervision with auxiliary loss heads at intermediate decoder stages to improve gradient flow and encourage learning of multi-scale features.

We conducted experiments on the UAV semantic segmentation dataset proposed in [10] under a multi-seed Monte Carlo cross-validation protocol (Section 4.1). DAS-Net achieves a mean test mIoU of 0.6780, outperforming ThinDyUNet (0.6115) by +6.65 pp with statistical significance (one-sided paired t-test, p = 0.045, Cohen’s d = 1.74; Section 4.6) and matching UNet while exceeding the other external baselines (MobileUNet, PAN, PSPNet) with 14.7× fewer parameters (vs. UNet’s ResNet-34 backbone). On the edge, DAS-Net achieves 26.0 FPS on an NVIDIA Jetson AGX Orin (FP16, Section 4.7), confirming its real-time deployability for anti-UAV surveillance. The main contributions of this study are summarized as follows:

Symmetric Dynamic Convolution for UAV Segmentation: To our knowledge, this is the first application of symmetric dynamic convolution in the decoder of a lightweight U-Net for UAV semantic segmentation. Prior dynamic-convolution works (CondConv [11], ODConv [12]) focus on encoder backbones for ImageNet-scale classification. The symmetric extension contributes to a cumulative +0.0665 mIoU improvement (relative +10.9%) of full DAS-Net over ThinDyUNet, at the cost of only 0.32 M additional parameters across all three components (1.34 M → 1.66 M).
Component-Wise Validation with Statistical Rigor: We integrate three lightweight architectural mechanisms—symmetric dynamic convolution, attention gates on skip connections, and three-stage deep supervision (λ = 0.4)—into a unified backbone. Each component is independently ablated under a three-seed protocol (Section 4.1), and the cumulative DAS-Net improvement over ThinDyUNet is verified to be statistically significant (one-sided paired t-test, p = 0.045, Cohen’s d = 1.74; per-component breakdown in Section 4.6).
Deployment-Oriented Evaluation: We benchmark DAS-Net across two platforms: an NVIDIA A6000 GPU (workstation-class) and an NVIDIA Jetson AGX Orin embedded edge device. DAS-Net achieves real-time inference (113 FPS on A6000, 26 FPS on Jetson at FP16) on both platforms, demonstrating practical deployability across the server-to-edge spectrum for anti-UAV surveillance.

2. Related Works

2.1. Lightweight Semantic Segmentation for UAV

U-Net [13] is a widely adopted encoder–decoder architecture for semantic segmentation, originally proposed for biomedical image analysis. Several variants have been developed to improve its performance, including ResUNet++ [14], UNet++ [15], and Attention U-Net [16]. For UAV-specific applications, our prior work [10] proposed a large-scale UAV semantic segmentation dataset and introduced ThinDyUNet, a lightweight model that incorporates dynamic convolution in the encoder. Do et al. [17] proposed ITE-U-Net, which replaces the convolutional encoder with an Identity Transformer Encoder and applies Spatial Pyramid Pooling and Swish activation to further improve segmentation accuracy on the same dataset.

Recent advances continue to push the U-Net family toward small-target and remote-sensing applications. CN-UNet [18], a ConvNeXt-based U-Net augmented with slicing-aided hyper-segmentation, targets infrared small-target detection. KECS-Net [19] combines a Knowledge-Embedded CSwin Transformer U-Net with slicing-aided hyper-segmentation in the same domain. For aerial scenes specifically, Qureshi et al. [20] combine semantic segmentation and YOLO detection of aerial vehicle images to refine small-object boundaries. In a related vein, PKNet [21] explores a parallel interactive Kolmogorov–Arnold Network for infrared small-target detection. While these works share our motivation to retain U-Net’s encoder–decoder efficiency for small-target tasks, they do not address the asymmetric encoder–decoder representation gap that arises in lightweight dynamic-convolution models—the central concern of DAS-Net.

2.2. Dynamic Convolution

Dynamic convolution [22] generates input-dependent convolution kernels by computing attention weights over multiple parallel kernels using a softmax-based mechanism. Unlike standard convolution with fixed weights, dynamic convolution adapts its kernel parameters based on the input features, enhancing representational capacity with minimal additional computational cost. Subsequent works extended this idea along several axes: CondConv [11] introduces routing-based kernel mixtures, while WeightNet [23] unifies weight generation through a single hyper-network. More recent variants such as ODConv [12] further factorize the dynamic kernel along spatial, channel, kernel-size, and filter-number dimensions, achieving stronger accuracy with comparable cost. ThinDyUNet used dynamic convolution exclusively in the encoder, leaving the decoder with conventional convolutional blocks. Most existing dynamic-convolution applications focus on the encoder or feature-extraction backbone, leaving the decoder’s role in dynamic kernel allocation comparatively unexplored. In this work, we extend dynamic convolution to both the encoder and decoder paths to achieve a symmetric architecture tailored to the small-target reconstruction demands of UAV segmentation, where the decoder’s representational capacity is the bottleneck.

2.3. Attention Mechanisms in Segmentation

Attention mechanisms have been widely adopted in semantic segmentation to improve feature selection. Channel- and spatial-attention modules such as SE [24] and CBAM [25] recalibrate intermediate feature maps to emphasize informative regions, and have been integrated into both classification and segmentation backbones with consistent gains. Oktay et al. [16] proposed Attention U-Net, which introduces attention gates at skip connections to suppress irrelevant features from the encoder before concatenation with decoder features. Unlike module-level attention applied within blocks, the attention gate operates on the encoder-to-decoder pathway, gating features based on the decoder’s evolving semantic context—a property well-suited to suppressing background clutter in aerial imagery where UAVs occupy only a small fraction of the image (typically <1% of pixels in long-range surveillance). This mechanism enables the decoder to focus on salient regions, improving segmentation accuracy with <1% added parameters (see Section 3.3 for the exact parameter cost).

2.4. Deep Supervision

Deep supervision [26] introduces auxiliary loss functions at intermediate layers to provide additional gradient signals during training. Zhou et al. [15] incorporated deep supervision into a nested U-Net architecture, demonstrating improved convergence and segmentation accuracy. Beyond medical imaging, deep supervision has also been explored in remote-sensing applications to stabilize the training of deep encoder–decoder networks, where vanishing gradients can disproportionately affect the deepest layers far from the supervisory signal. By applying supervision at multiple decoder stages, the network learns multi-scale feature representations, which is particularly beneficial for segmenting objects of varying sizes, such as UAVs at different distances.

3. Proposed Method

3.1. Overview

The proposed DAS-Net builds upon ThinDyUNet [10] by introducing three architectural improvements: (1) symmetric dynamic convolution applied to both encoder and decoder, (2) attention gates on skip connections, and (3) deep supervision at intermediate decoder stages. The network follows an encoder–decoder structure with seven stages, retaining a lightweight footprint (1.34 M → 1.66 M) suitable for real-time deployment. Each component is described in detail in the following subsections.

3.2. Dynamic Convolution Block

The Dynamic Convolution Block (DyConvBlock) was originally introduced in our prior work [10] for the encoder path. As shown in Figure 1, a DyConvBlock applies K = 2 parallel 3 × 3 convolutional kernels with stride = 1 and padding = 1, whose outputs are aggregated using input-dependent attention weights. All DyConvBlocks throughout DAS-Net operate at a uniform channel width of 64, matching the lightweight design of ThinDyUNet [10].

Given an input feature map x, the output is computed as:

y = \sum_{k = 1}^{K} α_{k} (x) \cdot W_{k} (x)

(1)

where Wk denotes the k-th convolutional kernel and αk represents the corresponding attention weight generated by a softmax function with temperature scaling (τ = 30). This mechanism allows the network to dynamically adapt its convolutional kernels based on the input, improving representational capacity with minimal parameter overhead.

Attention Head: The mixing coefficients α ∈

R^{B \times K}

are produced by a lightweight attention branch:

α = s o f t m a x (W_{2} \cdot \frac{R e L U (W_{1} \cdot G A P (x))}{τ})

(2)

where GAP (·) denotes global average pooling, W₁ ∈

R^{H i d \times C}

and W₂ ∈

R^{K \times H i d}

are 1 × 1 convolutions, and the bottleneck dimension is Hid = max (K, ⌈C/4⌉ + 1). The softmax temperature τ = 30 encourages well-distributed mixing weights during early training while remaining sufficiently concentrated for selective kernel routing at convergence. The value τ = 30 was selected via grid search over {1, 10, 30, 50}, with τ = 30 yielding the best validation mIoU (see prior work [10] for details).

Sample-Wise Kernel Synthesis: Given a kernel bank W ∈

R^{K \times C o u t \times C i n \times 3 \times 3}

shared across the batch, each sample b ∈ {1, …, B} forms its own dynamic kernel θ_b by a weighted sum θ_b = Σ_k α_{b,k} · W_k. We implement this efficiently with a single grouped convolution, setting groups = batch size, so that all B sample-conditional kernels are applied in a single forward call without explicit per-sample loops, preserving GPU parallelism.

Normalization and Activation: The dynamic convolution output is normalized with GroupNorm (eight groups) and activated by LeakyReLU, important for the varying effective batch sizes encountered when auxiliary heads are computed during deep supervision and for resource-constrained deployment.

While ThinDyUNet applies DyConvBlocks only in the encoder, DAS-Net adopts a symmetric configuration with DyConvBlocks in both paths. Compared to prior dynamic convolution works (e.g., CondConv [11], ODConv [12]) that typically adopt K ∈ {4, 8} or {8, 16}, we deliberately fix K = 2 for two reasons. First, the parameter overhead of dynamic convolution scales linearly with K, and our binary UAV-vs-background segmentation task does not benefit from the kernel diversity required by ImageNet-scale (1000-class) classification tasks for which CondConv and ODConv were originally designed. Second, preliminary experiments showed that increasing K from 2 to 4 yielded marginal accuracy gain (<0.3 pp mIoU) at the cost of approximately 30% additional parameters and substantially increased per-batch kernel-synthesis overhead. The K = 2 choice therefore preserves the lightweight footprint (1.66 M parameters, Table 1) without sacrificing segmentation quality.

3.3. Attention Gate

Skip connections in U-Net directly concatenate encoder features with decoder features at corresponding resolution levels. However, not all encoder features are equally relevant for segmentation, and irrelevant features may introduce noise that degrades decoder performance [16].

DAS-Net incorporates attention gates at each skip connection to selectively filter encoder features before concatenation, as illustrated in Figure 2. The attention gate computes an attention coefficient α for each spatial location by combining the encoder feature map

x_{e}

and the decoder gating signal

x_{d}

:

α = σ (ψ \cdot R e L U (W_{x} \cdot x_{e} + W_{d} \cdot x_{d} + b))

(3)

where σ denotes the sigmoid activation function. The filtered encoder feature

x_{e}^{'}

= α ·

x_{e}

retains only task-relevant information, allowing the decoder to focus on salient regions (UAV-containing spatial locations identified by the gate) for segmentation.

Learned Selection Mechanism: The attention coefficient α is learned end-to-end through the segmentation loss, assigning higher weights where encoder features align with the decoder’s evolving semantic context (UAV regions) and suppress background clutter. Each gate uses three 1 × 1 convolutions at intermediate channels

F_{i n t} = F_{l} / 2, F_{l} = 64

, adding <1% of the total model parameters.

3.4. Deep Supervision

To improve gradient flow and encourage multi-scale feature learning, DAS-Net applies deep supervision at the last three decoder stages. Auxiliary segmentation heads are attached to intermediate decoder outputs, each producing a prediction map that is upsampled to the original input resolution and compared against the ground truth mask.

The total training loss is computed as a weighted sum of the main output loss and auxiliary losses:

L_{t o t a l} = L_{m a i n} + λ \sum_{i = 1}^{3} L_{a u x_{i}}

(4)

where λ = 0.4 controls the contribution of auxiliary losses. During inference, only the main output head is used, so deep supervision adds no computational overhead at the test time.

Significance of the Auxiliary Loss: The three auxiliary heads provide direct gradient signals to the last three decoder stages, mitigating vanishing gradients that would otherwise propagate weakly through deep encoder–decoder networks and disproportionately affect layers far from the supervisory signal at the network output. This stabilizes training and encourages multi-scale feature learning, which is particularly beneficial for UAV segmentation, where targets span a wide range of pixel sizes—from a few pixels for distant UAVs to several thousand pixels for close ones—requiring representations that single end-of-network supervision tends to under-train. The value λ = 0.4 follows established deep-supervision practice [15].

3.5. Network Architecture

The overall architecture of DAS-Net follows a symmetric encoder–decoder structure with DyConvBlocks in both paths (unlike ThinDyUNet’s encoder-only configuration), with a fixed channel width of 64 across all stages. The network consists of seven encoder stages and seven decoder stages. Each encoder stage contains a DyConvBlock followed by a 2 × 2 max pooling layer for downsampling, with an additional DyConvBlock at the bottleneck. Figure 3 illustrates the overall architecture.

Each decoder stage performs bilinear upsampling (scale factor 2), applies an attention gate to filter the corresponding skip connection, concatenates the filtered encoder features, and processes them through a DyConvBlock. The final output is produced by a 3 × 3 convolution followed by a 1 × 1 convolution to generate the segmentation mask.

The internal designs of the DyConvBlock, attention gate, and deep supervision are detailed in Section 3.2, Section 3.3 and Section 3.4, respectively. Table 1 summarizes the architectural differences and parameter counts of the ablation variants studied in Section 4.5.

The incremental parameter cost of each component is as follows: the symmetric DyConv decoder adds +0.29 M (1.34 M → 1.63 M), the attention gates contribute +0.03 M (1.63 M → 1.66 M), and deep supervision introduces zero inference-time parameters as the auxiliary heads are discarded at the test time. The total parameter increase from ThinDyUNet to DAS-Net is therefore only +0.32 M (a 24% increase), while computational complexity (FLOPs) at 512 × 512 input remains within 5% of the baseline.

4. Experiments

4.1. Dataset

We use the UAV semantic segmentation dataset proposed in [10]. The dataset comprises 605,045 paired visible-light and infrared images at 1920 × 1080 resolution, split into 304,677/126,360/174,008 (train/val/test) images with binary segmentation masks of UAV targets. Figure 4 shows representative samples from the dataset, including both visible-light and infrared images, along with their corresponding ground truth masks.

We adopt Monte Carlo cross-validation with three independent trials using random seeds {42, 2024, 8888}; each trial draws a stratified random subset of 20,000 training images and 1000 validation images from the source dataset, while the evaluation is conducted on the complete test set of 174,008 images (roughly 8.7× larger than the training subset), providing a rigorous out-of-sample generalization assessment. The 20K/1K subset sizes match those of ITE-U-Net [17] in terms of comparability with the most recent SOTA on this dataset, and our results (Section 4.4) confirm that this training budget is sufficient for DAS-Net to attain competitive segmentation accuracy across the full test set. Independent subsets per seed allow the reported across-trial standard deviation (Section 4.6) to capture both model-initialization and data-sampling variance, while stratified sampling preserves the source-set class distribution. Full-dataset training is identified as a future direction in Section 5.

4.2. Implementation Details

All models are trained using the same setup for fair comparison. Input images and masks are resized to 512 × 512 via bilinear interpolation, and no data augmentation is applied to isolate architectural contributions in the ablation. We use the AdamW optimizer (β₁ = 0.9; β₂ = 0.999; weight decay = 1 × 10⁻²) with a learning rate of 1 × 10⁻⁴ and Dice loss as the objective function. The batch size is 24, and training runs for up to 50 epochs. A validation-loss-based learning rate scheduler with reduction factor of 0.5, patience of 15 epochs, and cooldown of 5 epochs is applied, along with early stopping (patience 30 epochs). For DAS-Net and DeepSupDyUNet, the auxiliary loss weight λ = 0.4 follows established deep-supervision practice [15]. All decoder upsampling operations use bilinear interpolation without learnable parameters. All training runs were conducted on an NVIDIA A6000 GPU (48 GB). For each independent trial, the random seed governs subset sampling, model initialization, and DataLoader shuffling, ensuring reproducibility of all stochastic operations. A threshold of 0.5 is applied to binarize the predicted segmentation masks.

4.3. Baseline Models

We compare DAS-Net against seven models. Four are external baselines: UNet [13] with ResNet-34 backbone and MobileUNet with MobileNetV2 backbone (both ImageNet-pretrained, implemented via [27]), PAN [28], and PSPNet [29] (both with ResNet-34 backbone, ImageNet-pretrained, implemented via [27]). Three are DyConv-family ablation variants: ThinDyUNet [10], FullDyUNet, and DeepSupDyUNet. All models are trained from scratch except UNet and MobileUNet, which use ImageNet-pretrained encoders.

4.4. Results

Table 2 reports test set performance. As shown in Table 2, DAS-Net achieves the highest mean test mIoU (0.6780) and test Dice (0.7509) among the four DyConv-family ablation models, while maintaining only 1.66 M parameters and 8.83 ms inference time per image on an NVIDIA A6000 GPU (workstation-class). Compared to single-seed external baselines, DAS-Net’s test mIoU matches UNet (0.6780 vs. 0.6760), with 14.7× fewer parameters and 1.1× faster inference, and exceeds MobileUNet (0.6185) by +5.95 pp with 3.6× fewer parameters. Among the lightweight (≤2 M parameter) DyConv family, DAS-Net outperforms ThinDyUNet by +6.65 pp (relative +10.9%) in test mIoU. The statistical significance of these improvements under the multi-seed protocol is analyzed in Section 4.6.

Figure 5 presents qualitative segmentation results for three test samples. ThinDyUNet fails to detect the UAV in several cases due to its low recall, while DAS-Net consistently produces masks that closely match the ground truth. Figure 6 shows the training curves for all compared models. DAS-Net converges smoothly and achieves the highest validation mIoU by the end of training.

Figure 7 illustrates the trade-off between model size and segmentation accuracy. DAS-Net achieves the highest mIoU among all lightweight models (<2 M parameters), comparable to UNet (24.4 M), while using approximately 14.7× fewer parameters.

4.5. Ablation Study

To evaluate the contribution of each proposed component, we conduct an ablation study using ThinDyUNet as the baseline. We define two intermediate variants—FullDyUNet, which extends dynamic convolution to the decoder, and DeepSupDyUNet, which adds deep supervision to ThinDyUNet—and the full DAS-Net combining all three improvements. Table 3 presents the ablation results on the test set.

Under the multi-seed protocol, deep supervision provides the largest individual improvement among the three components (DeepSupDyUNet mean test mIoU = 0.6627 vs. ThinDyUNet baseline 0.6115; Δ = +0.0512). The auxiliary heads—discarded during inference—stabilize training by providing direct gradient signals at intermediate decoder stages without any inference-time computational cost. Symmetric dynamic convolution in isolation (FullDyUNet = 0.6451) provides a smaller standalone gain (Δ = +0.0336), partly attributable to the training instability of the asymmetric ThinDyUNet baseline reflected in its multi-seed dispersion (visible as oscillations in Figure 6). The attention gate, when added on top of the symmetric dynamic convolution (DAS-Net vs. FullDyUNet), contributes a further Δ = +0.0329 mIoU.

The cumulative DAS-Net improvement over ThinDyUNet is Δ mIoU = +0.0665—a 10.9% relative gain—combining the three components in a synergistic manner. Deep supervision provides multi-scale gradient signal, the symmetric DyConv decoder provides the representational capacity to act on those signals, and the attention gate filters noise before skip concatenation. This synergistic pattern—with each component providing only modest gain in isolation but a substantially stronger gain in combination—is consistent with prior observations on nested U-Net architectures [15]. The statistical significance of each component contribution under the multi-seed protocol is analyzed in Section 4.6.

Hypothesized Regimes of Effectiveness: Based on the component design, we expect symmetric dynamic convolution to contribute most when fine-grained boundary reconstruction is required, attention gates to be most effective on cluttered scenes where skip-connection noise dominates, and deep supervision to benefit small or distant UAV instances. A subset-stratified evaluation to empirically verify these regimes is left as future work.

4.6. Statistical Significance Analysis

To address the concern that single-seed evaluations may conflate architectural improvements with stochastic training noise, we retrained the four DyConv-family ablation models under the multi-seed protocol defined in Section 4.1. The four models—ThinDyUNet, FullDyUNet, DeepSupDyUNet, and DAS-Net—cover the baseline as well as all single-component and combined-component configurations evaluated in the ablation (Section 4.5). External baselines (UNet, MobileUNet, PAN, PSPNet) outside the lightweight target footprint were trained with a single seed for cross-architecture reference; multi-seed analysis was prioritized for the ablation models because the primary statistical question concerns the within-family contribution of each component rather than across-family ranking.

Variance Estimation: Across the three independent seeds, the four DyConv-family models exhibit the variance characteristics summarized in Table 4. DAS-Net exhibits 2–4× lower seed-to-seed variance than any single-component variant (σ_mIoU = 0.0101 vs. 0.0224–0.0435), demonstrating that the synergy of all three components yields both higher accuracy and more reproducible training. Notably, DeepSupDyUNet alone shows substantial seed-to-seed variance (σ = 0.0410), which is reduced 4× when combined with the attention gate in DAS-Net (σ = 0.0101). This suggests that attention gating contributes most to training stability, complementing the multi-scale gradient signal provided by deep supervision.

Component-Wise Significance (Paired t-test): Table 5 reports per-component paired t-test results. We adopt one-sided tests under the directional hypothesis that each architectural component improves segmentation, supplemented with Cohen’s d effect sizes for paired samples. The choice of one-sided test reflects the pre-specified hypothesis direction; the n = 3 protocol limits the power of strict two-sided testing.

The cumulative DAS-Net improvement over ThinDyUNet is statistically significant (p = 0.045, Cohen’s d = 1.74, very large effect). Deep supervision alone reaches significance (p = 0.033, d = 2.16). The attention gate (DAS-Net vs. FullDyUNet) is marginal (p = 0.065) but with a clear large effect size (d = 1.46). Symmetric dynamic convolution evaluated in isolation does not reach significance with three seeds (p = 0.22) primarily due to the training instability of the asymmetric ThinDyUNet baseline (σ = 0.0435, the largest among the four variants). All four comparisons exhibit large to very large Cohen’s d, supporting the architectural contributions; the limited statistical power of n = 3 is acknowledged as a methodological limitation (Section 5).

Reproducibility: For each seed, the 20,000-image training subset is obtained via stratified random sampling, producing distinct but distributionally equivalent subsets across the three trials. Full per-seed metric trajectories and training logs are released with the source code (Data Availability).

4.7. Edge Inference Evaluation

We evaluate the DAS-Net family’s edge deployment feasibility on an NVIDIA Jetson AGX Orin (32 GB, JetPack 5.1.1, CUDA 11.4, cuDNN 8.6.0, PyTorch 2.1.0, MAXN power mode with jetson_clocks enabled). Inference latency is measured at FP16 precision, with batch size 1 and 512 × 512 input resolution, averaged over 100 iterations after 10 warm-up runs. Identical model weights from the A6000 evaluation (Table 2) are used; segmentation accuracy is unchanged.

Table 6 reports per-image inference latency on the Jetson AGX Orin. All four DAS-Net family variants achieve over 25 FPS at 512 × 512 input, exceeding typical real-time anti-UAV surveillance requirements. DAS-Net’s higher latency (38.44 ms vs. ~32 ms for simpler variants) stems from the attention-gate computation at each of the four skip connections (≈1.5 ms per gate on Jetson). FullDyUNet (1.63 M, 32.91 ms) demonstrates that the symmetric DyConv decoder itself contributes negligible overhead. This overhead is offset by the segmentation-accuracy improvement reported in Table 2 and Table 3.

Compared to the A6000 desktop GPU (8.83 ms, Table 2), Jetson AGX Orin inference is 4.4× slower for DAS-Net, reflecting the embedded platform’s reduced compute throughput. Nonetheless, 26 FPS at 512 × 512 exceeds the typical 20–25 FPS UAV camera capture rate, confirming its real-time feasibility on current-generation embedded UAV platforms.

Cross-architecture Jetson benchmarking against the external baselines reported in Table 2 (UNet, MobileUNet, PAN, PSPNet) is deferred to follow-up work, as fair edge-platform comparison requires TensorRT-optimized custom kernel development for the dynamic-convolution operator, which is not yet available in standard inference runtimes. We provide cross-architecture latency on the A6000 desktop GPU in Table 2 as a reference point.

5. Discussion

The experimental results highlight three key observations regarding DAS-Net’s performance: (i) parameter efficiency comparable to heavyweight baselines, (ii) a precision–recall asymmetry favoring deployment safety, and (iii) cross-platform real-time capability. We discuss each in turn.

Comparison with Baseline Models: Among the external baselines, UNet with a ResNet-34 backbone achieves competitive segmentation accuracy but requires 24.4 M parameters—approximately 14.7× more than DAS-Net. PAN (21.4 M) and PSPNet (21.4 M), despite parameter counts ~13× larger than DAS-Net’s, perform worse than DAS-Net in both mIoU and Dice. These results suggest that lightweight architectures with task-specific design choices can match or surpass large pretrained models in domain-specific segmentation tasks.

Comparison with ThinDyUNet: DAS-Net outperforms ThinDyUNet by a substantial margin while adding only 0.32 M parameters (1.34 M → 1.66 M). Under the multi-seed protocol, deep supervision provides the largest individual component contribution (+0.0512 mIoU), with symmetric dynamic convolution and the attention gate providing complementary gains (+0.0336 and +0.0329, respectively); the three components act synergistically to produce the cumulative +0.0665 mIoU improvement, as detailed in Section 4.5 and Section 4.6. The collective effect enhances the network’s ability to generate accurate segmentation masks for UAVs of varying sizes and distances.

Precision–Recall Trade-Off. The precision–recall profile of DAS-Net aligns with the operational priorities of anti-UAV surveillance: false negatives are unacceptable, while false positives are recoverable. DAS-Net exhibits lower precision (0.8422 vs. 0.9308) but substantially higher recall (0.7643 vs. 0.6449) than ThinDyUNet, indicating fewer false negatives at the cost of slightly more false positives. For anti-UAV deployment this trade-off favors safety: a missed UAV detection propagates to downstream tracking failure, while a false positive incurs only a bounded operational cost (e.g., a brief gimbal slew or a confirmation step) and can be further mitigated by temporal consistency filtering or a downstream classifier. ThinDyUNet’s 0.6449 recall implies that roughly 35.5% of UAV pixels are undetected, which is operationally unacceptable for surveillance. DAS-Net’s recall improvement (+0.1194) substantially exceeds its precision reduction (−0.0886), as reflected in the higher Dice (0.7509 vs. 0.6766, +0.0743) and mIoU (0.6780 vs. 0.6115, +0.0665).

Inference Speed: DAS-Net maintains real-time inference capability with 8.83 ms per image (113 FPS) on an NVIDIA A6000 GPU, comparable to ThinDyUNet (8.29 ms). On the embedded NVIDIA Jetson AGX Orin, DAS-Net achieves 38.44 ms (26 FPS) at FP16 precision (Section 4.7). The attention gates add minimal overhead (≈6 ms total across four skip connections on Jetson), making DAS-Net suitable for real-time anti-UAV surveillance.

Limitations and Future Work: This study points to five follow-up directions:

Full-Dataset Training: Our Monte Carlo cross-validation protocol uses 20,000 training samples per trial; extending to the full 304,677-image source split may yield further refinement.
Cross-Dataset Validation: All experiments were conducted on a single UAV semantic segmentation benchmark. Evaluation on additional aerial datasets would strengthen the generalizability claims.
Wider/Deeper-Decoder Ablation: A controlled comparison against a non-dynamic but wider or deeper decoder would further isolate the contribution of dynamic kernel synthesis from general decoder capacity.
Edge-Platform Benchmarking: We report Jetson AGX Orin latency for the DAS-Net family (Section 4.7). Two follow-up directions remain: (i) cross-architecture benchmarking against external baselines (UNet, MobileUNet, PAN, PSPNet); (ii) extension to additional embedded platforms (Jetson Orin Nano, Raspberry Pi 4/5). Both require TensorRT-optimized custom CUDA kernels for the dynamic-convolution operator. Additionally, input-dependent kernel synthesis may interact with INT8 quantization differently from static convolutions—characterizing this interaction is part of the follow-up work.
Statistical Power: Our three-seed protocol provides limited statistical power (df = 2) for paired comparisons. While the cumulative DAS-Net improvement and the deep-supervision component reach one-sided significance (p < 0.05), the symmetric-dynamic-convolution component in isolation (FullDyUNet vs. ThinDyUNet) shows a medium effect size (d = 0.56) but does not reach significance (p = 0.22), partly due to the training instability of the asymmetric ThinDyUNet baseline. Extension to n ≥ 5 seeds is identified as a methodological refinement for follow-up work, particularly to strengthen attribution for components with high seed-to-seed variance.

6. Conclusions

In this paper, we proposed DAS-Net, a lightweight semantic segmentation model that extends ThinDyUNet through three architectural improvements—symmetric dynamic convolution applied to both the encoder and decoder, attention gates filtering skip connections, and deep supervision at intermediate decoder stages. Under a three-seed Monte Carlo cross-validation protocol on a 174,008-image held-out test set, DAS-Net attains a mean test mIoU of 0.6780 and Dice coefficient of 0.7509, outperforming the ThinDyUNet baseline by +6.65 pp (one-sided paired t-test, p = 0.045, Cohen’s d = 1.74). The model maintains a parameter footprint of 1.66 M—approximately 14.7× smaller than ResNet-34-based baselines—and 8.83 ms per-image inference on workstation-class hardware (113 FPS) and 38.44 ms inference on an embedded NVIDIA Jetson AGX Orin (26 FPS at FP16), positioning DAS-Net as a practical segmentation model for anti-UAV surveillance pipelines that must balance segmentation fidelity with on-device deployment constraints.

For future work, we plan to (i) train DAS-Net on the full 304,677-image source dataset to assess potential refinement beyond the multi-seed-validated 20K Monte Carlo training budget, (ii) extend evaluation to additional aerial UAV segmentation benchmarks for broader generalizability assessment, (iii) conduct a controlled wider/deeper-decoder ablation to further isolate the contribution of dynamic kernel synthesis, (iv) extend edge benchmarking to additional embedded platforms (Jetson Orin Nano, Raspberry Pi 4/5) and cross-architecture baselines with TensorRT/INT8 optimization for full deployment validation, and (v) extend the multi-seed protocol to n ≥ 5 seeds to strengthen statistical power for component-wise paired comparisons.

Author Contributions

Conceptualization, Y.J.K. and S.-C.K.; methodology, Y.J.K.; software, Y.J.K.; validation, Y.J.K. and S.-C.K.; formal analysis, Y.J.K.; investigation, Y.J.K.; resources, S.-C.K.; data curation, Y.J.K.; writing—original draft preparation, Y.J.K.; writing—review and editing, S.-C.K.; visualization, Y.J.K.; supervision, S.-C.K.; project administration, S.-C.K.; funding acquisition, S.-C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Korea Research Institute for Defense Technology Planning and Advancement (KRIT) grant funded by the Korea government [DAPA (Defense Acquisition Program Administration)] (KRIT-CT-23-041, LiDAR/RADAR Supported Edge AI-based Highly Reliable IR/UV FSO/OCC Specialized Research Laboratory, 2024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The UAV semantic segmentation dataset is publicly available at https://github.com/SCKIMOSU/uav (accessed on 17 June 2025). The source code of DAS-Net is available at https://github.com/niceyoungjae/DAS-net (accessed on 25 May 2026).

Acknowledgments

The authors acknowledge the use of GPU computing resources at the Department of AI·SW, Kookmin University, and thank Subproject 1 of the project for the loan of the NVIDIA Jetson AGX Orin device used in the edge inference experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
IR	Infrared
VL	Visible Light
mIoU	Mean Intersection over Union
FPS	Frames Per Second
DAS-Net	Dynamic Attention-Supervised Network
AG	Attention Gate
GN	Group Normalization
GAP	Global Average Pooling
PAN	Pyramid Attention Network
PSPNet	Pyramid Scene Parsing Network
DyConvBlock	Dynamic Convolution Block
CNN	Convolutional Neural Network

References

Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.L.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned Aerial Vehicles (UAVs): Practical Aspects, Applications, Open Challenges, Security Issues, and Future Trends. Intell. Serv. Robot. 2023, 16, 109–137. [Google Scholar] [CrossRef] [PubMed]
Aggarwal, V.; Kaushik, A.R.; Jutla, C.; Ratha, N. Enhancing Privacy and Security of Autonomous UAV Navigation. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024; pp. 518–523. [Google Scholar] [CrossRef]
Li, Y.; Fu, M.; Sun, H.; Deng, Z.; Zhang, Y. Radar-Based UAV Swarm Surveillance Based on a Two-Stage Wave Path Difference Estimation Method. IEEE Sens. J. 2022, 22, 4268–4280. [Google Scholar] [CrossRef]
Vasant Ahirrao, Y.; Yadav, R.P.; Kumar, S. RF-Based UAV Detection and Identification Enhanced by Machine Learning Approach. IEEE Access 2024, 12, 177735–177745. [Google Scholar] [CrossRef]
Sun, Y.; Li, J.; Wang, L.; Xv, J.; Liu, Y. Deep Learning-Based Drone Acoustic Event Detection System for Microphone Arrays. Multimed. Tools Appl. 2024, 83, 47865–47887. [Google Scholar] [CrossRef]
Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M.; Stilla, U. Semantic Segmentation of Aerial Images with an Ensemble of CNNs. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473–480. [Google Scholar] [CrossRef]
Benjdira, B.; Bazi, Y.; Koubaa, A.; Ouni, K. Unsupervised Domain Adaptation Using Generative Adversarial Networks for Semantic Segmentation of Aerial Images. Remote Sens. 2019, 11, 1369. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Kim, S.-C.; Jang, Y.M. A Semantic Segmentation Dataset and Real-Time Localization Model for Anti-UAV Applications. Appl. Sci. 2025, 15, 7183. [Google Scholar] [CrossRef]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 1305–1316. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. In Proceedings of the International Conference on Learning Representations (ICLR 2022, Spotlight), Online, 25–29 April 2022; Available online: https://api.semanticscholar.org/CorpusID:251647798 (accessed on 25 May 2026).
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; pp. 225–230. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Do, T.-D.; Tran, L.-A.; Lee, J.; Hong, S.K. Improved U-Net with Identity Transformer Encoder for Efficient UAV Semantic Segmentation. IEEE Access 2025, 13, 181034–181045. [Google Scholar] [CrossRef]
Li, L.; Liu, L.; Cheng, F.; He, Y.; Zhong, Z. CN-UNet: ConvNeXt UNet with Slicing-Aided Hyper Segmentation for Infrared Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 84–98. [Google Scholar] [CrossRef]
Li, L.; Liu, L.; Huang, D.; Wang, S.; Wang, X.; He, Y. KECS-Net: Knowledge-Embedded CSwin-UNet with Slicing-Aided Hypersegmentation for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2025, 23, 7000105. [Google Scholar] [CrossRef]
Qureshi, A.M.; Butt, A.H.; Alazeb, A.; Mudawi, N.A.; Alonazi, M.; Almujally, N.A.; Jalal, A.; Liu, H. Semantic Segmentation and YOLO Detector over Aerial Vehicle Images. Comput. Mater. Contin. 2024, 80, 3315–3332. [Google Scholar] [CrossRef]
Yan, X.; Ye, W.; Wang, C.; Xia, C.; Xu, J.; Wang, Z. PKNet: Infrared Small Target Detection via Parallel Interactive Kolmogorov–Arnold Network. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5010114. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic Convolution: Attention over Convolution Kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Huang, J.; Sun, J. WeightNet: Revisiting the Design Space of Weight Networks. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12360, pp. 776–792. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-Supervised Nets. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), San Diego, CA, USA, 9–12 May 2015; pp. 562–570. [Google Scholar] [CrossRef]
Iakubovskii, P. Segmentation Models Pytorch. Available online: https://github.com/qubvel-org/segmentation_models.pytorch (accessed on 30 March 2026).
Mei, Y.; Fan, Y.; Zhang, Y.; Yu, J.; Zhou, Y.; Liu, D.; Fu, Y.; Huang, T.S.; Shi, H. Pyramid Attention Networks for Image Restoration. arXiv 2020, arXiv:2004.13824. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]

Figure 1. Structure of the Dynamic Convolution Block (DyConvBlock) used in both the encoder and decoder of DAS-Net. The block generates K = 2 input-dependent mixing coefficients via global average pooling and softmax with temperature τ = 30, then computes a per-sample weighted combination of two 3 × 3 convolutional kernels, followed by GroupNorm (eight groups) and LeakyReLU activation.

Figure 2. Structure of the attention gate used at each skip connection. The encoder feature

x_{e}

and decoder gating signal

x_{d}

are transformed by 1 × 1 convolutions, summed, and passed through ReLU, a 1 × 1 convolution, and sigmoid to produce the spatial attention coefficient α. The filtered output

x_{e}^{'}

= α ·

x_{e}

retains only task-relevant features.

Figure 2. Structure of the attention gate used at each skip connection. The encoder feature

x_{e}

and decoder gating signal

x_{d}

are transformed by 1 × 1 convolutions, summed, and passed through ReLU, a 1 × 1 convolution, and sigmoid to produce the spatial attention coefficient α. The filtered output

x_{e}^{'}

= α ·

x_{e}

retains only task-relevant features.

Figure 3. Symmetric encoder–decoder architecture of DAS-Net. The encoder (left) consists of seven DyConvBlock stages with MaxPooling, and the decoder (right) mirrors the structure with upsampling and DyConvBlocks. Attention gates filter skip connections at each stage. Deep supervision auxiliary heads are attached to the last three decoder stages during training.

Figure 4. Sample images from the UAV semantic segmentation dataset. (a) VL input. (b) IR input (clear). (c) IR input (foggy). (d–f) Corresponding ground truth segmentation masks (Note: Chinese OSD labels in panels (a–c)—前端可见光/红外 [Front-end VL/IR], 方位 [Azimuth], 俯仰 [Elevation]—are inherent to the source dataset and preserved unchanged for traceability).

Figure 5. Qualitative segmentation comparison on three test samples. Columns: input image; ground-truth mask; ThinDyUNet; DAS-Net (ours, bold column); MobileUNet; UNet; PAN; PSPNet. Rows: (top) clear VL scene; (middle) IR scene with cluttered background; (bottom) IR scene with small distant UAV. DAS-Net produces masks consistently closer to ground truth, particularly for small and partially occluded targets (Note: Chinese OSD labels in the input columns are inherent to the source dataset, as detailed in the Figure 4 caption).

Figure 6. Training curves for all compared models. (a) Validation mIoU over 50 epochs. (b) Validation loss over 50 epochs.

Figure 7. Trade-off between model size (parameters) and segmentation accuracy (mIoU). DAS-Net achieves the highest mIoU among lightweight models with only 1.66 M parameters.

Table 1. Architectural comparison of model variants.

Model	Encoder	Decoder	Attention Gate	Deep Supervision	Params (M)
ThinDyUNet	DyConvBlock	MultiCNNBlock	-	-	1.34
FullDyUNet	DyConvBlock	DyConvBlock	-	-	1.63
DeepSupDyUNet	DyConvBlock	MultiCNNBlock	-	✓ (λ = 0.4)	1.34
DAS-Net (Ours)	DyConvBlock	DyConvBlock	✓	✓ (λ = 0.4)	1.66

Notes: ✓ indicates the component is enabled (this convention applies to all tables in this paper).

Table 2. Comparison of all evaluated models on the test set. Values are mean across n = 3 seeds for DyConv-family ablation models (ThinDyUNet, FullDyUNet, DeepSupDyUNet, DAS-Net); single-seed values for external baselines (UNet, MobileUNet, PAN, PSPNet). Per-seed variance and statistical significance analysis are reported in Section 4.6. Bold = best.

Model	Params (M)	Precision	Recall	Dice	mIoU	ms	FPS
ThinDyUNet	1.34	0.9308	0.6449	0.6766	0.6115	8.29	120.6
PSPNet	21.4	0.8554	0.6685	0.6768	0.6094	11.46	87.3
MobileUNet	6.0	0.9248	0.6593	0.6790	0.6185	10.23	97.7
PAN	21.4	0.9055	0.7047	0.7045	0.6438	11.16	89.6
FullDyUNet	1.63	0.8816	0.6994	0.7159	0.6451	9.04	110.7
DeepSupDyUNet	1.34	0.9268	0.7063	0.7222	0.6627	8.26	121.0
UNet	24.4	0.9004	0.7333	0.7413	0.6760	9.93	100.7
DAS-Net (Ours)	1.66	0.8422	0.7643	0.7509	0.6780	8.83	113.2

Table 3. Component-wise ablation on the test set. Values are mean across n = 3 seeds. Per-component statistical significance (paired t-test, Cohen’s d) is reported in Section 4.6. Bold = best.

Model	Sym. Decoder	Attn Gate	Deep Sup	Precision	Recall	Dice	mIoU
ThinDyUNet (baseline)				0.9308	0.6449	0.6766	0.6115
FullDyUNet	✓			0.8816	0.6994	0.7159	0.6451
DeepSupDyUNet			✓	0.9268	0.7063	0.7222	0.6627
DAS-Net (Ours)	✓	✓	✓	0.8422	0.7643	0.7509	0.6780

Notes: ✓ indicates the component is enabled.

Table 4. Multi-seed variance (n = 3) on the test set. CI = 95% confidence interval based on Student’s t (df = 2).

Model	Mean mIoU	σ mIoU	95% CI mIoU	Mean Dice	σ Dice	95% CI Dice
ThinDyUNet	0.6115	0.0435	[0.5034, 0.7196]	0.6766	0.0442	[0.5667, 0.7864]
FullDyUNet	0.6451	0.0224	[0.5895, 0.7007]	0.7159	0.0212	[0.6633, 0.7685]
DeepSupDyUNet	0.6627	0.0410	[0.5609, 0.7645]	0.7222	0.0415	[0.6192, 0.8253]
DAS-Net (Ours)	0.6780	0.0101	[0.6529, 0.7031]	0.7509	0.0087	[0.7293, 0.7725]

Table 5. Component-wise paired t-test results (n = 3; df = 2). p-values are one-sided under the directional hypothesis that each component improves segmentation. Cohen’s d interpretation: |d| ≥ 0.2 (small); ≥0.5 (medium); ≥0.8 (large); ≥1.2 (very large). Significance markers: * p < 0.05; marginal = 0.05 ≤ p < 0.10.

Comparison (Effect)	Δ mIoU	t	p (1-Sided)	Cohen’s d	Verdict
FullDyUNet vs. ThinDyUNet (Sym DyConv)	+0.0336	0.96	0.22	0.56	medium, n.s.
DeepSupDyUNet vs. ThinDyUNet (Deep Sup)	+0.0512	3.74	0.033 *	2.16	very large
DAS-Net vs. FullDyUNet (Attn Gate)	+0.0329	2.52	0.065	1.46	large, marginal
DAS-Net vs. ThinDyUNet (cumulative)	+0.0665	3.02	0.045 *	1.74	very large

Notes: n.s. = not statistically significant (p > 0.05).

Table 6. Jetson AGX Orin inference latency (FP16, batch = 1, 512 × 512).

Model	Params (M)	Latency (ms)	FPS
ThinDyUNet	1.34	32.16	31.1
FullDyUNet	1.63	32.91	30.4
DeepSupDyUNet	1.34	32.23	31.0
DAS-Net (Ours)	1.66	38.44	26.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, Y.J.; Kim, S.-C. DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation. Appl. Sci. 2026, 16, 5688. https://doi.org/10.3390/app16115688

AMA Style

Kim YJ, Kim S-C. DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation. Applied Sciences. 2026; 16(11):5688. https://doi.org/10.3390/app16115688

Chicago/Turabian Style

Kim, Young Jae, and Sang-Chul Kim. 2026. "DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation" Applied Sciences 16, no. 11: 5688. https://doi.org/10.3390/app16115688

APA Style

Kim, Y. J., & Kim, S.-C. (2026). DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation. Applied Sciences, 16(11), 5688. https://doi.org/10.3390/app16115688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DAS-Net: A Lightweight Dynamic Convolution Network with Attention Gates and Deep Supervision for UAV Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Lightweight Semantic Segmentation for UAV

2.2. Dynamic Convolution

2.3. Attention Mechanisms in Segmentation

2.4. Deep Supervision

3. Proposed Method

3.1. Overview

3.2. Dynamic Convolution Block

3.3. Attention Gate

3.4. Deep Supervision

3.5. Network Architecture

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Baseline Models

4.4. Results

4.5. Ablation Study

4.6. Statistical Significance Analysis

4.7. Edge Inference Evaluation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI