DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images

Jia, Liruizhi; Gao, Jiazhan; Li, Zuolong; Shi, Heng; Zhu, Jihong

doi:10.3390/drones10060456

Open AccessArticle

DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images

by

Liruizhi Jia

^1,†

,

Jiazhan Gao

^1,†

,

Zuolong Li

²

,

Heng Shi

^2,*

and

Jihong Zhu

^1,2,*

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

²

Department of Precision Instrument, Tsinghua University, Beijing 100084, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2026, 10(6), 456; https://doi.org/10.3390/drones10060456

Submission received: 5 May 2026 / Revised: 5 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

(This article belongs to the Section Drones in Agriculture and Forestry)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A UAV cotton growth dataset covering four classes (three foreground growth levels and the background) was constructed for fine-grained segmentation evaluation.
DCA-DeepLab improves cotton growth segmentation by modelling row and column field structure, adaptively fusing features and addressing class imbalance.

What are the implications of the main findings?

Encoding directional crop row priors benefits fine-grained UAV agricultural segmentation.
The proposed framework provides a practical basis for field-level cotton growth monitoring and precision management.

Abstract

UAV remote sensing provides centimetre-level imagery for fine-grained cotton growth monitoring, yet existing segmentation models face three challenges: cotton fields exhibit a pronounced row and column structure that standard convolutions struggle to capture; conventional decoders fuse features statically, suppressing fine boundary cues; and the pixel-level class distribution is severely imbalanced. We present DCA-DeepLab, built on DeepLabv3+ with three task-specific components: a Dual-Coordinate Attention Gating (DCAG) module that decouples horizontal and vertical dependencies to encode row and column structures; a Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module that reweights semantic and detail features at each location; and an adaptive pixel-level modulated focal loss (APMFL), which focuses training on hard, minority-class pixels. We construct a cotton growth dataset of 11,745 UAV patches with four semantic classes. On this dataset and the public LoveDA benchmark, DCA-DeepLab attained the highest mIoU among the compared methods (51.74% and 51.71%), exceeding the strongest cotton baseline by 1.10 percentage points. Relative to DeepLabv3+, the Vigorous and Sparse minority-class IoUs improved by 3.51 and 1.91 percentage points, respectively, and Vigorous recall rose from 51.85% to 60.04%, with only 3.9% more parameters. These results show that encoding directional structure and adaptively balancing class contributions benefits fine-grained UAV crop segmentation.

Keywords:

crop monitoring; class imbalance; directional feature encoding; precision agriculture

1. Introduction

Cotton is one of the most important natural fibre crops worldwide and underpins the textile, garment and international trade industries [1,2]. According to U.S. Department of Agriculture statistics for the 2024–2025 marketing year, global cotton production reached approximately 26 million tonnes, with China being one of the largest producers [3]. Xinjiang has become the dominant Chinese production region, supported by abundant solar radiation, a suitable climate and large-scale mechanised cultivation [4]; according to China’s National Bureau of Statistics, national cotton output reached approximately 6.64 million tonnes in 2025, of which Xinjiang produced 6.165 million tonnes, representing 92.8% of the national total [5]. The high concentration of cotton production in this single region makes timely and reliable growth monitoring particularly valuable for stable supply, yield estimation and sustainable resource use.

The growth status of cotton is a comprehensive indicator that reflects plant vigour, phenological progress and potential yield, providing critical information for irrigation scheduling, fertilisation and pest control [6,7]. Among all developmental stages, the flowering and boll-forming stage is especially decisive; cotton plants transition from vegetative to reproductive growth during this period, and the dry matter accumulation, boll number and fibre development achieved here largely determine the final yield and quality [8,9,10]. Plant status at this stage is jointly influenced by climate, soil, water and nutrient management, as well as pests and diseases, producing large within-field variability in growth conditions. Recognising the spatial pattern of cotton growth at this critical stage is therefore essential for precision management of mechanised plantations.

Conventional growth monitoring relies on field inspection and expert judgement, which is subjective, labour-intensive and difficult to scale to thousands of hectares [11]. Satellite platforms such as Sentinel-2 and Landsat-8 have been widely adopted for crop growth assessment, but their spatial resolution (typically 10–30 m) is too coarse to resolve the fine canopy textures characteristic of cotton, and their data quality is often degraded by clouds, atmospheric effects and revisit constraints [12,13]. Unmanned aerial vehicle (UAV) remote sensing, in contrast, offers centimetre-level ground sampling distance, flexible flight scheduling, on-demand revisit and the ability to carry visible, multispectral and thermal payloads, and it has therefore become a preferred data source for fine-grained crop monitoring [14,15,16]. Equally important for cotton, UAV imagery captures the row and column planting structure of mechanised fields with high fidelity, a structural cue that is largely lost at a satellite resolution, which proves to be highly informative for downstream growth segmentation. Realising this potential depends on the analysis method used to translate such high-resolution imagery into reliable growth status maps.

Traditional approaches to crop growth recognition rely on hand-crafted spectral indices, textural descriptors and classical machine learning models such as support vector machines and random forests. Representative studies model leaf chlorophyll, SPAD and LAI from UAV multispectral imagery [17,18], fuse multiple physiological and structural indicators into composite growth indices for cotton and other crops [19,20], or augment observations with multi-temporal sequences to better track phenological dynamics [21]. These approaches work well when the targets and conditions are constrained, but their dependence on engineered features limits robustness under cluttered backgrounds and fine-grained class differences, especially in large-area UAV imagery with strong illumination and viewpoint variability [22]. These limitations have driven a shift towards deep learning, which learns hierarchical representations from raw imagery in an end-to-end manner.

Convolutional neural networks (CNNs) in particular have largely replaced hand-crafted pipelines for crop analysis. Recent UAV-based studies used deep models for growth stage identification [23,24], anomaly detection in growing crops [25,26], regression of structural growth parameters [27,28] and pixel-level segmentation of growth distribution [29,30]. Among these formulations, semantic segmentation is particularly attractive for precision agriculture because it directly produces a spatial map that can be overlaid on the field for actionable decisions. Within this paradigm, encoder-decoder architectures with multi-scale context aggregation are widely adopted; the DeepLab family [31] in particular combines atrous spatial pyramid pooling (ASPP) for multi-scale context with an encoder-decoder structure for boundary recovery, and it has become a common and computationally efficient backbone for remote sensing and crop segmentation. Among UAV-based crop segmentation works, however, rather few address the specific combination of a directional row structure, fine-grained adjacent growth class boundaries and long-tailed pixel distribution that characterises mechanised cotton imagery. As a consequence, off-the-shelf segmentation networks still leave considerable room for improvement on fine-grained cotton growth segmentation.

Vision transformer architectures and foundation models are increasingly being applied to remote sensing and UAV image analysis. SegFormer [32] and Swin Transformer [33] model long-range dependencies through self-attention, and the Segment Anything Model (SAM) [34] provides zero-shot segmentation that has been adapted to remote-sensing tasks via prior knowledge integration and domain-specific fine-tuning [35]. At a higher level, large language and vision-language foundation models support UAV scene description and language-guided interpretation; a recent survey [36] reviewed transformer-, CNN-transformer hybrid- and LLM-based UAV applications and identified on-board efficiency and task-specific adaptation as the two open challenges. For agricultural and UAV segmentation specifically, transformer-based methods have been applied to multi-source crop time series [37] and to high-resolution UAV vegetation segmentation with hierarchical transformers [38]. Two properties of these architectures informed our design choice. First, their computational budgets are typically several times larger than those of CNN counterparts of comparable accuracy. TransUNet, for example, required

104.7

M parameters and

281.6

G FLOPs at

512 \times 512

in our experiments. Second, neither global self-attention nor window-based self-attention encodes the directional row-column structure characteristic of mechanised cotton fields; this structural prior must be added explicitly, either through tokenisation or through a directional module placed elsewhere in the network. These two considerations motivated our use of a CNN backbone augmented with task-specific structural priors, with explicit directional encoding placed in the encoder rather than relying on self-attention to recover it implicitly.

In this work, we identify three key challenges in UAV cotton imagery segmentation that current models do not adequately handle. First, mechanised cotton fields exhibit pronounced striped patterns, yet standard convolutions and existing attentions fail to exploit this severe structural anisotropy, leading to broken predictions and cross-row leakage. Second, static fusion of cross-layer features ignores spatial variations in importance, often suppressing fine-grained details precisely where different growth states interleave. Finally, these structural and fusion limitations are compounded by severe class imbalance; highly informative categories for agricultural management (e.g., vigorously and sparsely grown cotton) occupy a small fraction of pixels and are easily overwhelmed by background classes under standard loss formulations.

To tackle these issues, we propose DCA-DeepLab. We adopt DeepLabv3+ [31] as the base architecture because its ASPP aggregates context across a range of receptive fields without enlarging the spatial footprint, which suits the varying row widths found across cotton fields; its encoder-decoder design with explicit low-level feature fusion provides a natural insertion point for our adaptive merging module, and it offers a favourable accuracy-efficiency trade-off compatible with operational UAV workflows (59.4 M parameters and 177.9 G FLOPs at

512 \times 512

). Built upon this proven backbone, the proposed method integrates targeted architectural redesigns and an adaptive loss. The objectives of this study are (1) to better exploit the directional row-column structure of mechanised cotton fields, a structural prior that standard convolutions and generic attention mechanisms do not explicitly capture; (2) to address the potential loss of fine-grained detail at transitions between adjacent growth classes caused by static cross-layer fusion; and (3) to mitigate the severe pixel-level class imbalance so that minority growth classes are not overwhelmed during training. The main contributions are as follows:

1.: We introduce a Dual-Coordinate Attention Gating (DCAG) module that decouples horizontal and vertical encoding to inject the row-and-column planting prior, addressing the structural anisotropy of mechanised cotton fields.
2.: We design a Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module that uses multi-scale spatial attention to adaptively fuse semantic and detail features, preserving the boundaries between adjacent growth classes.
3.: We develop an adaptive pixel-level modulated focal loss (APMFL) that pairs an adaptive class weight with a per-pixel confidence modulator to focus learning on minority and hard pixels.
4.: We construct a high-resolution UAV cotton dataset acquired during the flowering and boll-forming stage at the Manas Hui’er Farm and the Changji Huaxing Farm in Xinjiang. On the cotton growth dataset, DCA-DeepLab achieved the highest overall mIoU among eleven representative CNN-, transformer-, and Mamba-based baselines. On the public LoveDA benchmark, it also achieved the highest overall mIoU among the compared methods. The method maintains a computational budget comparable to the DeepLabv3+ baseline.

2. Materials and Methods

2.1. Study Area and UAV Data Acquisition

Field campaigns were conducted at the Manas Hui’er Farm in Manas County and the Changji Huaxing Farm in the adjacent Changji Hui Autonomous Prefecture of the Xinjiang Uygur Autonomous Region, which are representative of large-scale mechanised cotton cultivation in northwest China. These two sites are characterised by long mechanised rows aligned in regular grid layouts (Figure 1) and therefore provide a typical setting in which the row-and-column directional structure becomes a salient visual cue at the UAV resolution. Data acquisition was concentrated on the flowering and boll-forming stage between June and September 2024, when cotton plants undergo the transition from vegetative to reproductive growth and exhibit the most pronounced phenotypic variation that can be used for growth status discrimination [39,40]. In this region, cotton is cultivated as a single-season crop, and the June–September window therefore spans the complete in-season growth dynamics relevant to canopy-based growth status analysis.

A DJI Mavic 3 Multispectral platform (SZ DJI Technology Co., Ltd., Shenzhen, China) was used as the airborne sensor. The platform was equipped with a 20-megapixel RGB camera that produced high-resolution visible-light orthomosaics and a multispectral imaging system composed of four 5-megapixel cameras covering the green (

560 \pm 16

nm), red (

650 \pm 16

nm), red-edge (

730 \pm 16

nm) and near-infrared (

860 \pm 26

nm) bands. The RGB stream is used as the input modality of DCA-DeepLab in this paper, while the multispectral stream provided a normalised difference vegetation index (NDVI) map consulted as an agronomic reference during dataset construction (Section 2.2.1). Restricting the segmentation model to the RGB stream kept the pipeline aligned with the most widely available and lowest-cost UAV imaging set-up; incorporating the multispectral channels as an additional modality is left to future multimodal extensions. Flights followed regular grid lines that were pre-planned over each plot and were conducted primarily at a nominal altitude of 40 m above ground level to balance spatial resolution and coverage [41,42]. Approximately 1 TB of raw imagery was collected, together with ground reference measurements of the plant height, daily main stem growth and number of fruit-bearing branches, which were later used to calibrate the growth-class definition.

2.2. Cotton Growth Dataset Construction

2.2.1. Annotation Protocol

To reduce the subjectivity of purely visual annotation and ensure that the resulting labels were agronomically meaningful, the annotation workflow combined high-resolution RGB imagery, ground-truth agronomic measurements and agronomist review. The NDVI maps described below serve exclusively as an annotation aid to help agronomists calibrate growth-level boundaries; they were never used as model input during training or inference, and the segmentation network operated on RGB imagery only. As a standard preprocessing step for UAV photogrammetry that produces geometrically corrected, seamless image mosaics suitable for accurate annotation and downstream analysis, the raw UAV imagery was first orthorectified and stitched into digital orthophoto maps (DOMs) using DJI Terra (SZ DJI Technology Co., Ltd., Shenzhen, China); corresponding NDVI maps were also generated from the multispectral channels and consulted as an auxiliary agronomic reference [43].

A four-class growth taxonomy was defined with reference to recent agronomic studies on flowering and boll cotton phenology [44,45] and calibrated against field measurements collected at the Manas Hui’er Farm and the Changji Huaxing Farm. The four classes were the background (Background), vigorous growth (Vigorous), moderate growth (Moderate) and sparse growth (Sparse). The criteria fuse canopy coverage, plant structural metrics (height, main stem nodes, daily growth and number and position of fruit-bearing branches), boll counts, indicative canopy coverage percentages and NDVI ranges; the detailed thresholds are summarised in Table 1. The canopy coverage percentages and NDVI ranges in Table 1 are intended as indicative anchors used by annotators when calibrating the class boundaries rather than as automatic decision rules. Pixel-level masks were finally drawn on the DOM with LabelMe and cross-checked by a second annotator. In these masks, bare soil, inter-row gaps and other non-canopy surfaces were always assigned to the Background class; the Vigorous, Moderate and Sparse labels were applied exclusively to cotton canopy pixels, whose growth level was determined by the regional agronomic criteria in Table 1. Figure 2 presents representative ground-level examples of the three cotton growth classes.

2.2.2. Image Partitioning and Augmentation

To balance computational cost and contextual completeness, the annotated DOMs were tiled into

512 \times 512

patches. To prevent spatial information leakage, the dataset was partitioned at the orthomosaic level rather than by random patch sampling: training, validation and test sets were drawn from disjoint cotton fields so that no spatially adjacent or overlapping regions appeared across splits. Because the splits were drawn from physically separate fields and orthomosaics across the two farms rather than from a single shared scene, the test set included inter-field and inter-farm variation, such as differences in canopy structure, row orientation and acquisition illumination, beyond mere patch-level variation within a common scene. The patches were first split at the orthomosaic level into disjoint training, validation and test sets at an approximate 8:1:1 ratio, and data augmentation was then applied to the training set only, yielding 11,745 patches in total across the three splits.

A composite augmentation pipeline (Figure 3) was applied to the training set to improve generalisation under the variable conditions encountered during UAV operation. Geometric transformations included random horizontal and vertical flips and random rotations, which simulated changes in flight orientation and crop row direction. Photometric augmentations included HSV-space colour jittering, brightness scaling, additive Gaussian noise and motion blur, which simulated complex illumination, sensor noise and platform vibration.

The pixel-level class distribution of the resulting dataset, shown in Figure 4, was dominated by the Moderate and Background categories, while the Vigorous and Sparse categories were markedly under-represented. This long-tailed distribution directly motivated the design of the adaptive pixel-level loss introduced in Section 2.3.4.

2.3. DCA-DeepLab

2.3.1. Overall Architecture

DCA-DeepLab follows the encoder-decoder paradigm of DeepLabv3+ and consists of three task-specific components: the Dual-Coordinate Attention Gating (DCAG) module, the Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module, and the adaptive pixel-level modulated focal loss (APMFL). The overall architecture is illustrated in Figure 5. A high-resolution UAV image was first encoded by a ResNet-101 [46] backbone pretrained on ImageNet [47], which is the same configuration as the original DeepLabv3+ baseline [31], so that any improvement reported later could be attributed to the proposed components rather than to a stronger backbone. A DCAG module was inserted after each of the first three residual stages (ResBlock1–3). These early-to-middle stages still retained enough spatial resolution for the row-and-column layout to remain discernible, whereas the deepest stage was dominated by abstract semantics in which the directional prior was largely washed out, and thus directional gating was injected progressively across the first three stages. At each insertion point, the module decoupled the horizontal and vertical spatial dependencies and re-weighted features along these two axes accordingly. In the decoder, the low-level detail features were taken from the output of the first DCAG-augmented stage (after ResBlock1) and projected by a

1 \times 1

convolution before fusion, while the high-level semantic features came from the ASPP output. MSAM-MFM replaces the default concatenation of these two streams with a spatially adaptive fusion that suppresses semantic conflict and preserves the boundaries between adjacent growth classes. During training, the network is supervised by the APMFL, which jointly incorporates running class-frequency statistics and per-pixel prediction confidence into the focal loss formulation, dynamically focusing optimisation on minority and hard pixels.

2.3.2. Dual-Coordinate Attention Gating

UAV cotton imagery exhibits a striking anisotropy: features are continuous along the planted rows but discontinuous across rows. Standard convolutions perform isotropic aggregation over local neighbourhoods and cannot directly encode this orientation prior. To explicitly inject a row-and-column structure into the feature extraction stream, we introduce the DCAG module, illustrated in Figure 6. DCAG decouples horizontal and vertical spatial encoding and adaptively gates the two pathways before re-weighting the input features. We indexed the two parallel pathways by their pooling operators;

A_{avg}

and

A_{\max}

denote the final directional attention maps produced by the average-pooled branch and the max-pooled branch, respectively.

Given an input feature map

F \in R^{C \times H \times W}

, DCAG first aggregates features along the horizontal and vertical directions using both average and max pooling to capture the global directional distribution and the salient local responses simultaneously:

Z_{avg}^{h} = {AvgPool}_{1 \times W} (F), Z_{avg}^{w} = {AvgPool}_{H \times 1} (F),

(1)

Z_{\max}^{h} = {MaxPool}_{1 \times W} (F), Z_{\max}^{w} = {MaxPool}_{H \times 1} (F),

(2)

where

{AvgPool}_{1 \times W}

and

{MaxPool}_{1 \times W}

denote pooling along the horizontal direction and

{AvgPool}_{H \times 1}

and

{MaxPool}_{H \times 1}

denote pooling along the vertical direction.

The horizontal and vertical descriptors of each pooling branch are concatenated along the spatial dimension and passed through a shared

1 \times 1

convolution with batch normalisation and a ReLu activation, allowing cross-direction information exchange. The fused tensor is then split back into the two directional components, each of which is projected by an independent

1 \times 1

convolution and squashed by a sigmoid

σ (\cdot)

to produce direction-specific attention maps:

Y = ReLu (BN ({Conv}_{1 \times 1} ([Z^{h}; Z^{w}]))),

(3)

Y_{h}, Y_{w} = split (Y),

(4)

A_{h} = σ ({Conv}_{1 \times 1} (Y_{h})), A_{w} = σ ({Conv}_{1 \times 1} (Y_{w})),

(5)

A_{avg} = A_{h} ⊙ A_{w},

(6)

where

[Z^{h}; Z^{w}]

is shorthand for either

[Z_{avg}^{h}; Z_{avg}^{w}]

or

[Z_{\max}^{h}; Z_{\max}^{w}]

, depending on the branch,

split (\cdot)

denotes the inverse split along the spatial dimension and ⊙ is the element-wise product. The same procedure is applied to the max-pooled descriptors to produce a complementary directional map

A_{\max}

.

To adaptively balance the contributions of the two pooling branches, DCAG inserts a lightweight gate that operates directly on the input feature map; F is projected through two

1 \times 1

convolutions and a softmax along the branch dimension, yielding spatial fusion weights

ω_{avg}, ω_{\max} \in R^{1 \times H \times W}

that sum to one at every location. The two directional attention maps are then merged into a single map

A_{dir} = ω_{avg} \cdot A_{avg} + ω_{\max} \cdot A_{\max},

(7)

and applied back to the input features via element-wise multiplication

F^{'} = A_{dir} ⊙ F

. In Figure 6, the labels

F_{A}

and

F_{M}

on the gating row are the inputs to the two element-wise multiplications and correspond to

A_{avg}

and

A_{\max}

, respectively; they are then re-weighted by

ω_{avg}

and

ω_{\max}

from the central softmax branch and summed to form

A_{dir}

. The gated feature

F^{'}

is then passed through a Squeeze-and-Excitation (SE) block [48] to enhance cross-channel responses, and a residual connection adds F back to stabilise gradient flow and preserve the original semantic content. By design, DCAG only adds lightweight

1 \times 1

convolutions on top of pooled descriptors, and thus its computational cost is small relative to the encoder backbone.

2.3.3. Multi-Scale Attention-Guided Modulated Feature Merging

In a semantic segmentation decoder, high-level features carry strong semantic discriminability, while low-level features carry detailed spatial cues. Plain concatenation or addition of the two often masks fine boundaries with high-level semantic responses or amplifies background noise where adjacent growth levels are interleaved. We argue that the relative importance of semantic vs. detail information is in fact spatially variant; the inner part of a homogeneous growth region prefers semantic features, whereas pixels on class boundaries prefer detail features. We therefore designed the MSAM-MFM module, shown in Figure 7, to explicitly model this spatial preference. Different from existing cross-scale fusion methods that operate purely on the spatial or channel dimension, MSAM-MFM introduces a dedicated feature-source dimension that distinguishes semantic and detail features and uses multi-scale pooling to infer per-pixel fusion weights along this dimension.

Let the high-level semantic feature and low-level detail feature after spatial alignment be

F_{1}, F_{2} \in R^{C \times H \times W}

. They are stacked along a newly introduced feature-source dimension to form a tensor of a shape

2 \times C \times H \times W

, the sum of which along this dimension yields an initial fused representation

F_{sum} \in R^{C \times H \times W}

. Multi-scale adaptive average pooling is then applied to

F_{sum}

at three resolutions:

S^{(1)} = GAP (F_{sum}), S^{(2)} = {AAP}_{2 \times 2} (F_{sum}), S^{(3)} = {AAP}_{4 \times 4} (F_{sum}),

(8)

where

GAP (\cdot)

denotes global average pooling (equivalent to

{AAP}_{1 \times 1}

) and

{AAP}_{k \times k}

is an adaptive average pool with an output size of

k \times k

. The three scales capture the global (

k = 1

), mid-level regional (

k = 2

) and finer regional (

k = 4

) context. Each pooled descriptor is fed into an independent multi-layer perceptron whose output channel dimension is

2 C

and bilinearly upsampled back to the original spatial resolution. The three resulting tensors are summed element-wise to yield a multi-scale attention representation

M \in R^{2 C \times H \times W}

.

To expose the feature-source dimension explicitly, M is reshaped to

R^{2 \times C \times H \times W}

, and a softmax branch is applied along the feature-source axis, producing fusion weights

α \in R^{2 \times C \times H \times W}

that sum to one across the two sources at every

(c, h, w)

location. By writing

α_{1}, α_{2} \in R^{C \times H \times W}

for the two slices of

α

along this axis, the fused output is

F_{out} = α_{1} ⊙ F_{1} + α_{2} ⊙ F_{2},

(9)

where ⊙ denotes element-wise multiplication. Each

α_{i}

is therefore a per-pixel, per-channel weight that adaptively routes information between the high-level semantic stream and the low-level detail stream. Because the weights are normalised over the feature-source dimension, MSAM-MFM behaves as a soft routing between semantic and detail features at every spatial location; it is designed to favour semantic information in homogeneous regions and detail features at class boundaries, yielding cleaner predictions and sharper transitions between growth classes.

2.3.4. Adaptive Pixel-Level Modulated Focal Loss

The cotton growth dataset is severely class-imbalanced; the most informative Vigorous and Sparse categories occupy only a small fraction of pixels, while the network is dominated by the Background and Moderate classes. Standard cross-entropy is overwhelmed by frequent classes. The vanilla focal loss alleviates this with a difficulty modulator and a class weight, but its class weight is typically pre-set and static and therefore cannot follow the evolving class distribution and difficulty during training.

The APMFL addresses this gap by augmenting the focal loss with two ingredients: a running class frequency estimator and a pixel-level confidence modulator. Let

{\hat{f}}_{j}^{t}

be the frequency of class j in the current mini-batch at iteration t, and let

f_{j}^{t - 1}

be the historical estimate. The running frequency is updated by

f_{j}^{t} = \frac{{\hat{f}}_{j}^{t} + (t - 1) f_{j}^{t - 1}}{t},

(10)

which yields a smooth and stable estimate of the global pixel distribution as training progresses. The recurrence is self-initialising at

t = 1

, where the factor

(t - 1)

vanishes and

f_{j}^{1} = {\hat{f}}_{j}^{1}

. Thus, no separate initial value is required. A median-frequency class weight is then computed as a base balancing factor:

w_{j}^{t} = \frac{median ({f_{k}^{t} ∣ k \in C})}{f_{j}^{t} + ϵ_{f}},

(11)

where

C

is the set of classes. A small constant

ϵ_{f} = 10^{- 6}

is added in the denominators to avoid division by zero when a class is rare or absent in early mini-batches. To further increase attention to hard pixels, the class weight is extended into a pixel-wise dynamic weight

{\tilde{w}}_{i j}

that incorporates the prediction confidence:

{\tilde{w}}_{i j} = \frac{w_{j}^{t}}{\sum_{k \in C} w_{k}^{t} + ϵ_{f}} \cdot (1 + y_{i j} - {\hat{y}}_{i j}),

(12)

where

{\hat{y}}_{i j} \in (0, 1)

is the predicted probability of pixel i for class j and

y_{i j} \in {0, 1}

is the corresponding ground-truth label. For the ground-truth class (

y_{i j} = 1

), the factor reduces to

2 - {\hat{y}}_{i j} \in [1, 2]

, smoothly amplifying the loss as model confidence decreases, namely from 1 for high-confidence correct predictions to 2 for fully wrong predictions.

Combining the running-frequency reweighting and the per-pixel confidence modulation under the focal loss skeleton yields the final APMFL:

L_{APMFL} = \frac{1}{| Ω |} \sum_{i \in Ω} [(\sum_{j \in C} {\tilde{w}}_{i j} y_{i j}) {(1 - p_{t, i})}^{γ} (- log (p_{t, i} + ϵ))],

(13)

where

Ω

is the set of pixels in the current batch,

p_{t, i}

is the predicted probability of pixel i for its ground-truth class,

γ

controls the down-weighting of easy examples and

ϵ

is a small constant ensuring numerical stability. Because

{\tilde{w}}_{i j}

is updated continuously during training, the APMFL keeps focusing on classes and pixels that are still hard, which empirically improves segmentation of minority growth categories without materially degrading the dominant classes.

2.4. Experimental Set-Up

All experiments were implemented with PyTorch on a workstation equipped with four NVIDIA GeForce RTX 4090 GPUs (24 GB each), CUDA 11.8, Python 3.9.20 and PyTorch 2.5.1. The encoder of DCA-DeepLab was initialised with ImageNet-pretrained ResNet-101. All models shared the same optimiser (stochastic gradient descent, momentum of

0.9

, and weight decay of

1 \times 10^{- 4}

), learning rate schedule (poly with initial rate of

0.01

and power of

0.9

), batch size (8), number of epochs (100) and data augmentation pipeline. Unless otherwise specified, baseline methods were trained with their default cross-entropy objective, whereas DCA-DeepLab used the proposed APMFL. The contribution of the APMFL is examined separately in a controlled loss-only comparison in Section 3.4.3. The focal factor in the APMFL was set to

γ = 1.5

unless otherwise stated, and

ϵ = 10^{- 6}

. All experiments used a fixed random seed for reproducibility, and the checkpoint with the highest validation mIoU was selected for testing.

We evaluated DCA-DeepLab on two datasets. The first was the cotton growth dataset described in Section 2.2, which is an in-house UAV benchmark with four classes (Background, Vigorous, Moderate and Sparse). The second was the public LoveDA remote-sensing benchmark [49], which contains 5987 high-resolution aerial images at

0.3

m GSD covering urban and rural scenes from Nanjing, Changzhou and Wuhan, with seven semantic classes (Background, Building, Road, Water, Barren, Forest and Agriculture). On LoveDA, we followed the official train/val/test partition released with the dataset, training on the official train split and reporting all metrics on the validation split, since the test labels are not publicly available. LoveDA served as an auxiliary cross-domain benchmark: evaluating DCA-DeepLab on heterogeneous urban and rural scenes provided a robustness check beyond the cotton field setting and indicated whether the proposed inductive biases remained beneficial outside the cotton field design setting. An example of LoveDA imagery is shown in Figure 8.

Following common practice in semantic segmentation, we report the pixel accuracy (Acc), mean intersection over union (mIoU) and per-class IoU. Acc reflects overall correctness but is dominated by frequent classes; the mIoU and per-class IoU are more sensitive to minority classes and are therefore the primary metrics for evaluating long-tailed UAV cotton growth segmentation.

3. Results

3.1. Sensitivity to the Initial Learning Rate

This sensitivity analysis identifies the optimal learning rate by sweeping a practical range. We first studied the sensitivity of DCA-DeepLab to the initial learning rate by sweeping

{0.001, 0.005, 0.01, 0.02, 0.05}

while keeping all other hyperparameters fixed. The results on the cotton growth dataset are shown in Figure 9. The mIoU rose sharply from

0.001

to

0.005

, peaked at

0.01

(mIoU =

51.74 %

, Acc =

72.13 %

), and then declined as the learning rate was further increased. A learning rate that was too small could not push the network out of the basin of attraction set by the ImageNet-pretrained backbone in the limited number of epochs that we used, while a learning rate above

0.01

destabilised optimisation, especially on minority classes. We adopted

0.01

as the default initial learning rate in all subsequent experiments because it provided the best balance between convergence speed and stability.

3.2. Comparison with State-of-the-Art Methods

3.2.1. Cotton Growth Dataset

We compared DCA-DeepLab with 11 representative semantic segmentation methods on the cotton growth dataset, covering classical encoder-decoder models (U-Net [50] and DeepLabv3+ [31]), context aggregation networks (CENet [51] and DenseASPP [52]), transformer-based models (Segmenter [53] and TransUNet [54]), a Mamba-based model (RS3Mamba [55]) and recent remote-sensing or agricultural networks (MCCANet [56], MSGCNet [57], DBBANet [58] and MSEONet [59]). All methods followed the same training protocol described in Section 2.4; encoder backbones were initialised with ImageNet-pretrained weights when available, while segmentation heads were trained from scratch on our dataset. The hyperparameters of each baseline were kept at the default values reported in their original papers when those values were available; otherwise, they were tuned on the validation split with the same budget as DCA-DeepLab. The quantitative results are summarised in Table 2.

DCA-DeepLab attained the best Acc and mIoU among all compared methods (

72.13 %

and

51.74 %

, respectively). Compared with the DeepLabv3+ baseline, its mIoU improved by

1.90

percentage points (pp), and compared with the most competitive recent baseline, MSEONet, its mIoU improved by

1.10

pp. Looking at the per-class IoU profile, the gain was concentrated on the operationally informative growth classes; the IoU of the Vigorous class rose from

36.47

(DeepLabv3+) to

39.98

(

+ 3.51

pp), the IoU of the Sparse class rose from

30.45

to

32.36

(

+ 1.91

pp), and the IoU of the Moderate class rose from

59.48

to

61.85

(

+ 2.37

pp). The IoU of the dominant background class was essentially unchanged (

72.78

compared with

72.96

), which was the desired behaviour; gains on minority classes should not come from sacrificing the background. Generic context aggregation methods (DenseASPP and DeepLabv3+) and global attention methods (Segmenter and TransUNet) remained well below DCA-DeepLab on the minority classes. As these are end-to-end results in which each baseline used its default cross-entropy objective while DCA-DeepLab used the APMFL, this margin reflects the joint effect of the directional inductive bias motivated in Section 1 and the adaptive loss; the architecture- and loss-level contributions are decomposed separately in Section 3.4.3.

3.2.2. Computational Complexity

To complement the accuracy comparison, we report the parameter count, the number of floating-point operations (FLOPs) and the inference speed (frames per second (FPS)) at a

512 \times 512

input resolution in Table 3 and visualise the accuracy/complexity trade-off in Figure 10. DCA-DeepLab had

61.7

M parameters and

184.6

G FLOPs and reached

68.6

FPS, which is comparable to the DeepLabv3+ backbone (

59.4

M,

177.9

G and

104.6

FPS) and substantially smaller than TransUNet (

104.7

M,

281.6

G and

62.5

FPS). DCA-DeepLab was faster than its strongest accuracy competitor MSEONet (

64.3

FPS) and the transformer and Mamba baselines TransUNet (

62.5

FPS) and RS3Mamba (

44.9

FPS) under our single-GPU benchmarking set-up. MCCANet (

85.4

FPS) and MSGCNet (

83.6

FPS) achieved a higher throughput at a lower mIoU. We note that DCA-DeepLab was nonetheless slower than the plain DeepLabv3+ backbone (

68.6

vs.

104.6

FPS). Although its parameter and FLOP overheads over DeepLabv3+ were small (about

+ 3.9 %

and

+ 3.8 %

, respectively), the directional pooling, gating and adaptive fusion operations introduced by DCAG and MSAM-MFM are less hardware-fused on current GPUs than plain convolutions and therefore incur a larger wall clock cost than the FLOP count alone suggests. Despite this modest budget, DCA-DeepLab attained the highest mIoU and Acc among all methods, sitting in the most favourable region of the accuracy/complexity trade-off plot. The result indicates that the gain of DCA-DeepLab does not come from brute force capacity scaling but from the joint effect of the directional inductive bias, the spatially adaptive fusion and the adaptive loss formulation, which together account for the improvement at a near-baseline computational budget.

3.2.3. Cross-Domain Evaluation on LoveDA

To probe whether the inductive biases introduced by DCA-DeepLab generalised beyond cotton fields, we compared it with 10 baselines on the LoveDA benchmark; the results are listed in Table 4. For this cross-domain setting, we used baselines that are widely reported on LoveDA, including general-purpose segmentation and remote sensing networks (e.g., BANet and PSPNet), rather than the agriculture-specific networks tailored to the cotton comparison. DCA-DeepLab reached an mIoU of

51.71 %

, which was the best result among all methods compared. Relative to DeepLabv3+ (

47.62 %

), the gain was

4.09

pp; relative to recent strong baselines such as MSEONet (

50.66 %

) and DBBANet (

49.95 %

), the gain was

1.05

and

1.76

pp, respectively. As on the cotton dataset, the largest gains fell on the directionally organised, structurally heterogeneous classes, most notably Barren, where DCA-DeepLab increased the IoU from

10.40

to

31.34

(approximately a threefold improvement over DeepLabv3+), as well as Road and Building. The improvement was, however, not uniform across all classes; Water was the one class on which DCA-DeepLab fell below DeepLabv3+ (

71.96

vs.

74.36

,

- 2.40

pp), which we attribute to water bodies being spectrally and texturally homogeneous and lacking the directional structure that the DCAG prior is designed to exploit. The only method that surpassed ours on Barren was RS3Mamba (

37.24

), but this single-class advantage came at the price of a much lower overall mIoU (

46.90

compared with

51.71

) and a sharp drop on Agriculture (

33.98

), indicating an unbalanced rather than uniform improvement. Overall, DCA-DeepLab attained the highest mIoU among all compared methods while improving the structurally heterogeneous classes (Building, Road and Barren) that shared with cotton row patterns the property of being directionally organised at the metre scale the most.

3.3. Qualitative Analysis

3.3.1. Segmentation Visualisation

We provide qualitative comparisons in Figure 11, where DeepLabv3+, TransUNet and MSEONet are placed alongside DCA-DeepLab on representative tiles drawn from the test set. The baseline methods exhibited two common failure modes. First, predictions were often broken across planted rows even when the canopy was continuous, producing thin, striped artefacts that disagreed with the agronomic structure of the field. Second, transitions between adjacent growth classes were blurred, with patches of Vigorous bleeding into Moderate and vice versa, especially on plots where the canopy texture changed gradually. DCA-DeepLab visibly mitigated both issues. On these tiles, the directional decoupling in DCAG better preserved the along-row continuity and reduced cross-row leakage, while the spatially adaptive routing in MSAM-MFM yielded sharper transitions between adjacent growth classes. The visual improvement was most pronounced on minority-class regions, which is consistent with the per-class IoU gains in Table 2.

Some residual transition smoothing remained visible in DCA-DeepLab’s predictions, particularly where adjacent growth levels were highly interleaved. Cotton growth status formed a continuous biological gradient, and thus the boundaries between adjacent levels (e.g., Vigorous and Moderate) were gradual, unlike, for example, the building edges in urban scenes that the segmentation literature typically targets. The confusion matrix analysis in Figure 12 quantifies the improvement that DCA-DeepLab achieved despite this ambiguity; the Vigorous-to-Moderate leakage dropped from

44.82 %

to

35.96 %

(

- 8.86

pp), indicating that the directional and adaptive fusion priors reduced the dominant inter-class leakage even when the underlying signal was gradual.

3.3.2. Confusion Matrix

To complement the per-class IoU analysis with a class-conditional view, Figure 12 reports the confusion matrices of DeepLabv3+ and DCA-DeepLab on the cotton growth dataset, with cells showing row-normalised proportions (per-class recall on the diagonal and inter-class leakage off-diagonal). These proportions correspond to the per-class recall (the proportion of pixels belonging to class c that were also predicted as c), which is a different quantity from the IoU values reported in Table 2; recall captures how much of each class was recovered, while the IoU additionally penalised false positives. DeepLabv3+ confused adjacent growth levels heavily; only

51.85 %

of the Vigorous pixels were classified correctly, with

44.82 %

leaking into Moderate; for Sparse,

51.12 %

of the pixels leaked into Moderate; and for Moderate,

15.57 %

of the pixels leaked into Sparse. DCA-DeepLab moved diagonal mass towards the correct class on the Vigorous and Moderate rows; correct Vigorous predictions rose to

60.04 %

(

+ 8.19

pp), correct Moderate predictions rose from

70.43 %

to

74.01 %

(

+ 3.58

pp), the leakage from Vigorous to Moderate dropped by

8.86

pp, and the leakage from Moderate to Sparse dropped by

2.08

pp. The diagonal entry for the Sparse class moved slightly from

48.18 %

to

47.07 %

(

- 1.11

pp), but its IoU rose from

30.45

to

32.36

in Table 2. This combination of marginally lower recall and a higher IoU reflects a favourable precision–recall trade-off in which DCA-DeepLab predicted Sparse more selectively, removing more false positives than the few additional false negatives it introduced. The leakages that dominated the baseline’s errors both fell; Vigorous pixels leaked less into Moderate, and Moderate pixels leaked less into Sparse. The Vigorous and Moderate gains therefore reflect a shift in probability mass towards the correct class on the dominant confusion pairs, with one exception: a slight rise in Sparse-to-Moderate leakage (from

51.12 %

to

52.13 %

,

+ 1.01

pp) that was consistent with the marginally lower Sparse recall noted above. The remaining absolute margin on Sparse is a known limitation that we discuss further in Section 4.

3.4. Ablation Study

3.4.1. Module-Level Ablation

We dissect the contribution of each architectural component of DCA-DeepLab on the cotton growth dataset. Model 1 is the architectural baseline, a DeepLabv3+ encoder-decoder, trained with APMFL under our protocol; Model 2 adds MSAM-MFM, Model 3 adds DCAG, and Model 4 enables both modules. The APMFL was enabled in all four configurations so that any reported difference was attributable to the architectural change rather than to the loss redesign; the contribution of the APMFL is examined separately in Section 3.4.3. The numbers of Model 1 therefore differed slightly from those of DeepLabv3+ in Table 2, where each baseline was trained with its own default cross-entropy objective. The results are reported in Table 5.

Each module brought a consistent gain in isolation; adding MSAM-MFM alone raised the mIoU from

50.19

to

50.81

(

+ 0.62

pp), and adding DCAG alone raised it to

51.13

(

+ 0.94

pp). When the two modules were combined, the mIoU rose to

51.74

(

+ 1.55

pp over the baseline), close to the arithmetic sum of the individual contributions (

0.62 + 0.94 = 1.56

pp). Within this APMFL-trained setting, the combined gain showed no sign of strong negative interference between the two modules. Given the single-seed setting, we do not interpret this close agreement as a precise additivity measurement, but it is at least consistent with the two modules acting on largely non-overlapping aspects of the network; DCAG operates on the spatial direction of feature aggregation in the encoder, while MSAM-MFM operates on the relative weighting of semantic and detail features in the decoder. The contribution of the APMFL itself and its interaction with the two modules are examined separately in Section 3.4.3.

3.4.2. Comparison of Attention Mechanisms

To verify that the gain of DCAG came from directional decoupling rather than from generic re-weighting, we replaced DCAG with a sequence of representative attention mechanisms, with the MSAM-MFM module disabled so that the comparison isolated the encoder attention module: SE [48], CBAM [62], coordinate attention [63], axial attention [64], and criss-cross attention [65]. Each attention block was inserted at the same three positions where DCAG resided in the encoder (after ResBlock1, ResBlock2 and ResBlock3) and trained with the same schedule (SGD, 100 epochs, initial learning rate

0.01

and APMFL with

γ = 1.5

); only the attention module itself was replaced. Transformer and Mamba blocks were deliberately excluded from this position-controlled substitution; they are whole-architecture paradigms rather than drop-in attention modules, so inserting one at a single encoder position would simultaneously change the backbone capacity, tokenisation or state-space design and optimisation behaviour, which would no longer constitute a controlled ablation of DCAG. The full-architecture transformer (Segmenter and TransUNet) and Mamba (RS3Mamba) baselines are compared separately in Table 2. The results are reported in Table 6.

The mIoU followed a consistent upward ordering across progressively more elaborate forms of directional encoding (single seed; the sub-percentage-point gaps should be read as indicative rather than as statistically separated). Generic channel-only or channel and spatial re-weighting (SE and CBAM) yielded the smallest gains (

+ 0.25

and

+ 0.39

mIoU over the no-attention baseline, respectively; Model 1 in Table 5). Methods that introduced direction implicitly through coordinate encoding (CA), through axial attention (Axial), or through criss-cross attention (CCA) closed most of the remaining gap. DCAG, which combines explicit directional decoupling with a learnable max/avg gate, achieved the best mIoU of

51.13 %

. The gain of DCAG was observed on every class, including the dominant background, indicating that the improvement was not concentrated in the rare classes alone.

3.4.3. Comparison of Loss Functions

We further compare the APMFL against four widely used losses for class-imbalanced semantic segmentation: cross-entropy (CE), focal loss [66], Tversky loss [67] and Lovász softmax loss [68]. The architecture was fixed to the full DCA-DeepLab, and only the supervisory signal was changed. The results are reported in Table 7.

CE left the optimisation at the mercy of the dominant background class (mIoU

48.98 %

). The standard focal loss only marginally improved the result (mIoU of

49.43 %

) because its class weights were static and could not adapt to the evolving class statistics in mid-to-late training. The Tversky and Lovász losses, which directly approximate region overlap metrics, performed better (mIoU of

50.84 %

and

50.66 %

, respectively) but neglected pixel-level prediction confidence. The APMFL outperformed all four alternatives, raising the mIoU to

51.74 %

(

+ 2.76

,

+ 2.31

,

+ 0.90

,

+ 1.08

pp over CE and focal, Tversky and Lovász losses, respectively). The largest absolute gain was in the minority Vigorous class, where the APMFL exceeded the Tversky and Lovász losses by

+ 2.01

and

+ 2.45

pp, respectively. We attribute this consistent improvement to the two synergistic ingredients of the APMFL: the running class frequency estimator, which tracks the actual training distribution, and the per-pixel confidence modulator, which amplifies the loss on hard pixels without permanently down-weighting the dominant class. Read together with the module-level ablation, these results indicate that the architectural components and the adaptive loss acted jointly rather than independently. The

+ 1.90

pp improvement over DeepLabv3+ in Table 2 is an end-to-end effect obtained when the baselines used their default cross-entropy objective; under a matched APMFL objective, the architectural components alone contributed

+ 1.55

pp (Table 5), whereas under cross-entropy supervision, the full proposed architecture did not outperform and in fact fell slightly below (

48.98

vs.

49.84

mIoU, a difference of

0.86

pp) the plain DeepLabv3+ baseline. The directional and adaptive fusion priors were therefore most effective when paired with the adaptive loss, and we accordingly report the main comparison as a combined architecture-and-loss result rather than as an isolated architectural gain.

3.4.4. Sensitivity to the Focal Factor $γ$

Finally, we examine how the focal factor

γ

in the APMFL affected the segmentation accuracy. Following the convention of the original focal loss paper [66], where

γ = 2

is recommended for object detection, we swept around this default value with a step of

0.5

to identify the optimum for our segmentation setting:

γ \in {0.5, 1.0, 1.5, 2.0, 2.5}

. Figure 13 reports the corresponding mIoU and per-class IoU. The mIoU rose from

50.62 %

at

γ = 0.5

to a peak of

51.74 %

at

γ = 1.5

and decreased to

51.30 %

at

γ = 2.0

and

50.87 %

at

γ = 2.5

. At small

γ

values, easy background pixels dominated the gradient signal, and the network spent little capacity on minority and ambiguous regions. As

γ

grew, the loss landscape shifted focus towards low-confidence predictions; the Vigorous and Sparse IoUs climbed noticeably between

γ = 0.5

and

γ = 1.5

, with no detriment to the dominant classes. Beyond

γ = 1.5

, however, the modulator suppressed easy examples too aggressively, depriving the network of a stable supervisory signal in the Background and Moderate classes, both of which declined. We adopted

γ = 1.5

as the default value, which delivered the best balance between hard example mining and training stability.

The fact that the optimal

γ

for semantic segmentation was lower than the

γ = 2

recommended for object detection [66] admits a principled explanation rooted in the difference in easy pixel ratios between the two tasks. In object detection, the focal loss is applied to a sparse set of anchor boxes that are overwhelmingly dominated by easy negatives (trivially classified background anchors); a high

γ

value is needed to suppress this dominant easy negative signal. In dense semantic segmentation, by contrast, every pixel is an independent prediction, and the proportion of easy pixels is substantially lower. In our cotton growth dataset, the Background and Moderate classes together accounted for approximately

77 %

of the pixels (Background (

33.62 %

) + Moderate (

43.13 %

); see Figure 4), but a non-trivial fraction of these lay near class boundaries and remained uncertain throughout training. A lower

γ

therefore suffices to shift gradient mass towards hard pixels without over-suppressing the informative easy pixels that anchor the learning of dominant-class decision boundaries. The shallow optimisation surface around

γ = 1.5

(mIoU varied by less than

0.9

pp across

γ \in [1.0, 2.0]

) indicates that performance was stable across this range of

γ

.

4. Discussion

4.1. Comparison with Existing Approaches

To contextualise the contributions of DCA-DeepLab within the broader literature, we compared our design choices and results with recent UAV-based crop segmentation studies, attention-based methods, and transformer/Mamba architectures.

Most existing UAV crop monitoring studies frame growth assessment as plot- or plant-level regression or classification, leaving dense pixel-level segmentation comparatively underexplored. For example, Cai et al. [23] recognised rice growth stages from UAV imagery with a YOLOv8-based detector, and Xia et al. [24] proposed a lightweight network for cotton defoliation and boll-opening rate assessment, operating on image-level metrics. Zeng et al. [29] applied semantic segmentation to multi-stage rapeseed seedling monitoring from UAV imagery on a different crop with a distinct growth stage taxonomy. Our cotton growth dataset targets dense pixel-level segmentation of four classes with a severely long-tailed distribution (the Vigorous and Sparse classes together occupied around

23 %

of the pixels) and visually ambiguous inter-class boundaries. DCA-DeepLab addresses this challenge through task-specific inductive biases (directional structure, adaptive fusion and adaptive loss) rather than through generic capacity scaling, achieving a

51.74 %

mIoU on a problem where even strong baselines plateaued around

50 %

.

The limited benefit of generic mechanisms is most evident at the level of attention modules. Channel and spatial attention modules such as SE [48] and CBAM [62] have been widely adopted in remote sensing segmentation to recalibrate feature responses; Gao et al. [69], for example, integrated CBAM into an encoder-decoder network for crop mapping from Sentinel-2 imagery. Coordinate attention [63] extends this idea by encoding positional information along one spatial dimension at a time, which partially captures the directional structure. Our ablation (Table 6) shows that these attention mechanisms yielded only modest gains on UAV cotton imagery (

+ 0.25

to

+ 0.52

mIoU for SE, CBAM and coordinate attention), whereas DCAG’s explicit directional decoupling with a learnable max/avg gate achieved the best mIoU of

51.13 %

(

+ 0.94

pp over the no-attention baseline of Model 1 in Table 5). The key difference is that DCAG separated the along-row and across-row information pathways through its directional pooling and gating decomposition, which aligns with the physical structure of mechanised cotton fields. MCCANet [56] and MSGCNet [57] incorporate boundary-aware or multiscale global-context aggregation for remote sensing, yet their gains on our dataset remained modest (

50.37 %

and

50.49 %

mIoU, respectively) because they lack an explicit encoding of the dominant planting direction.

A comparable gap was observed for global-context architectures. Vision transformers such as Segmenter [53] and TransUNet [54] capture long-range dependencies via global self-attention, which in principle can model directional structure implicitly. However, on our dataset Segmenter achieved only a

49.90 %

mIoU, and TransUNet’s was

50.12 %

, both below that of DCA-DeepLab (

51.74 %

). We attribute this to two factors: (1) global attention treats all spatial positions equally and does not encode the row-column prior that is specific to mechanised agriculture, and (2) the quadratic complexity of self-attention limits the resolution at which these models can operate within a practical compute budget. RS3Mamba [55], which uses selective state-space modelling, achieved a

49.13 %

mIoU with fewer parameters (

41.7

M) but similarly lacks a directional inductive bias. A recent survey [36] noted that transformer-based methods, while strong on large-scale benchmarks with diverse scene categories, carry a substantial computational burden in UAV settings. Our results indicate that on a domain-specific task with strong structural regularities, a well-designed CNN with task-appropriate priors can match or exceed a transformer’s accuracy at a substantially lower computational cost. DCA-DeepLab achieved the highest mIoU among all compared methods while requiring only

61.7

M parameters and

184.6

G FLOPs, which is

41 %

fewer parameters and

34 %

fewer FLOPs than TransUNet.

Finally, on the optimisation side, the APMFL can be positioned against existing loss-level remedies. Class imbalance is a pervasive challenge in remote sensing segmentation, and loss-level remedies are widely used. Wang et al. [70], for instance, adopted focal loss to counter the uneven class proportions of land cover datasets. The standard focal loss [66] was designed for object detection where the foreground-to-background ratio is extreme (approximately 1:1000); in our pixel-dense setting, the imbalance was less extreme but more nuanced, with confusion concentrated between adjacent growth levels rather than between the foreground and background. Region-based losses such as the Tversky [67] and Lovász losses [68] directly optimise overlap metrics but cannot modulate individual pixel contributions based on prediction confidence. The APMFL combines the strengths of both approaches: a running class frequency estimator that adapts to the evolving training distribution and a per-pixel confidence modulator that focuses learning on uncertain pixels. The result was a

+ 0.90

mIoU gain over the Tversky loss and

+ 1.08

mIoU over the Lovász loss (Table 7), with the largest improvement in the minority Vigorous class (

+ 2.01

and

+ 2.45

pp, respectively).

4.2. Why DCA-DeepLab Works on UAV Cotton Imagery

The three properties of UAV cotton imagery identified in Section 1, namely a strong row-and-column directional structure, a misalignment between high-level semantics and low-level detail in the decoder and severe pixel-level class imbalance, are addressed by DCAG, MSAM-MFM and the APMFL, respectively. The quantitative gains reported in Section 3 were not the result of a single dominant component but emerged from the way the three proposed modules interacted with the specific properties of UAV cotton imagery. The DCAG module provides the network with an explicit prior about the row-and-column planting structure, which is a defining visual signature of mechanised cotton fields and was clearly captured by UAV imaging at the 40-m flight altitude used in this study. The crop-row structure also serves as a primary spatial cue in other vision-based agricultural systems [71], which supports encoding it explicitly in the network. By decoupling horizontal and vertical pooling and gating the two pathways, DCAG is designed to amplify the along-row continuity that any agronomic interpretation must respect and to suppress feature responses that drift across rows. The qualitative comparison in Figure 11 illustrates this effect; the predictions of DeepLabv3+, TransUNet and MSEONet often broke the row continuity or merged adjacent rows, while DCA-DeepLab kept the striped layout intact even where adjacent growth classes were interleaved. The attention-mechanism ablation in Table 6 is further consistent with this benefit arising from directional decoupling rather than from generic re-weighting; SE, CBAM and coordinate attention delivered modest gains (

+ 0.25

,

+ 0.39

and

+ 0.52

mIoU, respectively), while methods that more strongly encode the direction (Axial and CCA) closed most of the remaining gap, and DCAG, which combines directional decoupling with a learnable max/avg gate, sat at the top.

The MSAM-MFM module addresses a different but complementary failure mode. In a cotton field, homogeneous regions of the Vigorous or Moderate class are best described by high-level semantic features, whereas pixels on the boundary between adjacent growth classes are best described by low-level details such as canopy edges and inter-row gaps. Plain concatenation, which is the de facto default in DeepLabv3+, applies the same fusion ratio everywhere and consequently smooths the very boundaries that determine the per-class IoU on the long-tailed Vigorous and Sparse classes. By introducing an explicit feature-source dimension and routing the two streams through a soft, multi-scale attention, MSAM-MFM lets the network choose, at every location, whether semantics or detail should dominate. The module-level ablation in Table 5 shows that this single change brought

+ 2.21

pp on the Moderate-class IoU on top of the baseline, and its combined gain with DCAG was close to the sum of their individual contributions (Table 5), consistent with the two modules acting on complementary aspects of the network: the spatial structure in the encoder and inter-layer routing in the decoder.

The third ingredient, the APMFL, attacks the long-tailed pixel distribution that is intrinsic to growth-class data. Unlike static class weights, the running median-frequency estimator continuously tracks the empirical class distribution as training progresses; this is particularly relevant under heavy data augmentation, which reshapes the per-batch frequency from one iteration to the next. The pixel-level confidence modulator

1 + y_{i j} - {\hat{y}}_{i j}

adds a second axis of adaptivity by amplifying the loss only on pixels that the model is currently uncertain about. The loss ablation in Table 7 shows that the APMFL improved the minority Vigorous class’s IoU by

+ 2.01

over the Tversky loss and

+ 2.45

over the Lovász loss, while keeping the background IoU essentially unchanged, which is the desired behaviour; gains on under-represented classes should not come from sacrificing the dominant ones.

4.3. Implications for UAV-Based Precision Agriculture

From an applied perspective, the most actionable result is the per-class IoU profile rather than the headline mIoU. In a precision agriculture pipeline, the segmentation map is a means to an end; managers need to know where the under-performing patches are so that irrigation, fertilisation or pest control decisions can be retargeted. In UAV-based precision agriculture, spatially explicit crop status maps of this kind are used to drive variable-rate fertilisation and irrigation decisions [72,73]. DCA-DeepLab raised the IoU of the operationally most relevant minority classes (Vigorous from

36.47

to

39.98

and Sparse from

30.45

to

32.36

relative to DeepLabv3+) without materially degrading the bulk classes, which means the resulting maps localised the regions worth visiting in the field with substantially higher reliability. The headline mIoU advance of

1.10

pp over the strongest competing baseline (MSEONet) and

1.90

pp over the direct DeepLabv3+ backbone aggregated four classes whose individual IoU ranges differed by tens of percentage points; once this aggregate is decomposed, the per-class advance is concentrated on the operationally relevant growth classes. Relative to MSEONet, DCA-DeepLab gained

+ 2.08

pp on Vigorous,

+ 0.73

pp on Moderate and

+ 1.25

pp on Sparse. Relative to the DeepLabv3+ backbone, the corresponding gains were

+ 3.51

,

+ 2.37

and

+ 1.91

pp, respectively. These per-growth class gains were therefore consistent across both reference points, helping interpret the practical relevance of the aggregate mIoU gain, particularly because these growth classes drive precision agriculture decisions. These gains were obtained at a near-baseline computational cost (

+ 3.9 %

parameters and

+ 3.8 %

FLOPs over DeepLabv3+), so the architectural complexity introduced by DCAG and MSAM-MFM scaled proportionally with the practical benefit rather than incurring brute-force overhead. The cross-domain evaluation on LoveDA, where DCA-DeepLab increased the difficult Barren class IoU from

10.40

to

31.34

(approximately a threefold improvement over DeepLabv3+) while keeping the highest overall mIoU of

51.71 %

, provides preliminary evidence that the design may transfer beyond the cotton flowering setting to other UAV-based crop health and land cover applications that share the long-tailed and structurally directional nature of cotton fields, although further evaluation on additional datasets is needed to draw stronger conclusions.

Beyond the headline accuracy, the modest computational footprint of DCA-DeepLab (

61.7

M parameters and

184.6

G FLOPs at

512 \times 512

input) is compatible with three concrete deployment scenarios in operational UAV workflows. First, the network can be run offline on a workstation co-located with the UAV ground station, which is the typical configuration of agricultural cooperatives in Xinjiang and which only requires a single consumer-grade 24-GB GPU. Second, in larger demonstration bases, the model can be served from a small on-premise GPU cluster and consumed by mobile clients via a thin REST interface, which is the configuration we found most practical during dataset collection. Third, on-board real-time inference at flight speed remains an open challenge. It would require pruning or distilling DCA-DeepLab to a lightweight variant, which is one of the future work directions outlined in Section 5.

4.4. Limitations and Future Work

Several limitations remain and motivate ongoing work. First, although the APMFL substantially boosted minority-class performance and yielded a favourable precision–recall trade-off on the Sparse class (the higher IoU at slightly lower recall reported in Section 3.3), the absolute IoU of Sparse still sat below

33 %

, and its diagonal entry on the confusion matrix (Figure 12) only changed marginally relative to DeepLabv3+. This residual gap is a direct consequence of the extremely low pixel share of Sparse and the visual similarity between the Sparse-class canopies and exposed soil, both of which make the class difficult even with adaptive re-weighting. In remote sensing classification more broadly, sparse vegetation and bare soil categories are recognised as prone to omission under class imbalance and spectral confusion in mixed soil–canopy pixels [74]. A targeted remedy would be to combine the APMFL with hard-example synthesis or with patch-level oversampling that preserves the long-row context on which DCAG depends. In a downstream yield estimation pipeline, the residual confusion of Sparse with bare soil is a known concern that can be partially mitigated by post hoc filtering using the multispectral NDVI channel. Second, the cotton dataset was acquired during a single 2024 season at the Manas Hui’er Farm and the Changji Huaxing Farm. Although the training, validation and test sets were spatially disjoint at the orthomosaic level, the dataset could not probe robustness to multi-year phenological variation, different cultivars, or extreme weather events; collecting longitudinal data, evaluating the method on additional agricultural UAV benchmarks and adopting a domain generalisation evaluation protocol are therefore important next steps. Recent agricultural segmentation studies explicitly targeted such cross-domain and cross-crop transfer through domain adaptation techniques [75], precisely because segmentation models tend to lose accuracy under distribution shift across crops, sites or sensing platforms. Third, the current pipeline relies exclusively on the RGB stream of the Mavic 3 Multispectral platform; the multispectral channels are used only as an agronomic reference during dataset construction. Integrating NDVI, red-edge or thermal signals into the segmentation network in a principled way while preserving the directional structure prior introduced by DCAG is a natural extension that we plan to pursue. Multispectral and vegetation index features from UAV platforms can provide cues for crop vigour and nutrient status that are unavailable in the RGB stream alone [76], which motivates this extension. Finally, the directional decoupling embodied by DCAG assumes a regular row and column layout; on plots with curved planting, severe canopy occlusion or partial crop rotation, the prior may become weaker. Quantifying this degradation and designing a layout-aware variant of DCAG is a worthwhile direction for fully unstructured fields. Vision-based crop-row methods are themselves known to degrade under varying field conditions such as curved rows, irregular spacing and occlusion [71], a limitation that similarly affects the directional prior exploited by DCAG. As a methodological caveat, all reported results were obtained with a single fixed random seed; the sub-percentage-point differences between individual ablation configurations should therefore be read as indicative rather than as statistically separated effects, and multi-seed evaluation with variance reporting is left for future work.

5. Conclusions

This paper proposed DCA-DeepLab, an extension of DeepLabv3+ for semantic segmentation of cotton growth status from UAV remote sensing imagery. It addressed three task-specific challenges—directional row-column structure, semantic-detail misalignment in the decoder and severe pixel-level class imbalance—through the DCAG, MSAM-MFM and APMFL components. On our UAV cotton growth dataset, DCA-DeepLab achieved the highest overall mIoU among 11 representative CNN-, transformer- and Mamba-based baselines (

51.74 %

), improving the IoUs of the Vigorous and Sparse classes by

3.51

and

1.91

percentage points over DeepLabv3+, respectively.

On the public LoveDA benchmark, the same architecture, retrained on LoveDA, reached the highest mIoU of

51.71 %

among the compared methods. Comprehensive ablations indicated that each component contributed a complementary part of the overall gain. Future work will extend validation across years, cultivars and sites, integrate multispectral and thermal channels for multimodal fusion and develop lightweight variants for on-board real-time inference.

Author Contributions

Conceptualisation, L.J. and J.G.; data curation, Z.L.; formal analysis, L.J.; funding acquisition, H.S.; investigation, J.G.; methodology, L.J.; project administration, H.S.; resources, H.S.; software, J.G. and Z.L.; supervision, J.Z.; visualisation, Z.L.; writing—original draft, L.J. and J.G.; writing—review and editing, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Open Foundation of the State Key Laboratory of Precision Space-time Information Sensing Technology under Grant No. STSL2025-A-06(C), and in part by the Innovation Program for Doctoral Students of Xinjiang University under Grant XJU2024BS091.

Data Availability Statement

The data presented in this study are available from the corresponding authors on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, Z.; Yang, Y. Carbon footprint of global cotton production. Resour. Environ. Sustain. 2025, 20, 100214. [Google Scholar] [CrossRef]
Food and Agriculture Organization of the United Nations. Cotton Market Review. 2025. Available online: https://www.fao.org/markets-and-trade/food-and-agricultural-markets-analysis-FAMA/cotton/en (accessed on 22 February 2026).
U.S. Department of Agriculture, Foreign Agricultural Service. Production, Supply and Distribution (PSD) Database: Cotton. Database, 2025. In Data Release Synchronized with May 2026 Cotton Report; U.S. Department of Agriculture, Foreign Agricultural Service: Washington, DC, USA, 2026. [Google Scholar]
Mao, S.; Li, P.; Cheng, S.; Zhang, Z.; Li, X.; Tian, L.; Gao, L.; Wang, Z. Achievements, contributions, experience and prospects of the development of Xinjiang oasis cotton: Written on the 70th anniversary of the founding of the Xinjiang Uygur Autonomous Region. China Cotton 2025, 52, 1–17. (In Chinese) [Google Scholar]
National Bureau of Statistics of China. Announcement of the National Bureau of Statistics on Cotton Production in 2025; National Bureau of Statistics of China: Beijing, China, 2025. (In Chinese)
Abdullah, H.M.; Islam, M.; Islam, M.S.; Sen, S.; Tuhin, A.K.; Arman, S.E.; Hasan, M.M. Cotton seedling monitoring and growth stage classification integrating deep learning and feature engineering. Smart Agric. Technol. 2025, 12, 101619. [Google Scholar] [CrossRef]
Quddus, M.A.; Chowdhury, S.; Jasper, W.; Das, L. Enhancing cotton yield prediction with robust deep neural network-based framework. Eur. J. Agron. 2025, 170, 127750. [Google Scholar] [CrossRef]
Guo, S.; Liu, T.; Han, Y.; Wang, G.; Du, W.; Wu, F.; Li, Y.; Feng, L. Changes in within-boll yield components explain cotton yield and quality variation across planting dates under a double cropping system of cotton-wheat. Field Crop. Res. 2023, 293, 108853. [Google Scholar] [CrossRef]
Bista, M.K.; Kodadinne Narayana, N.; Chakravaram, A.; Pieralisi, B.; Dhillon, J.; Reddy, K.R.; Bheemanahalli, R. Intensifying heat stress impacts cotton flowering and boll development efficiency. BMC Plant Biol. 2025, 25, 984. [Google Scholar] [CrossRef] [PubMed]
Bai, Z.; Dong, B.; Fan, J.; Kefauver, S.C.; Araus, J.L.; Zhang, F.; Yin, F. Cotton yield estimation based on vegetation indices from UAV RGB imagery. Trans. Chin. Soc. Agric. Mach. 2025, 56, 182–192. (In Chinese) [Google Scholar]
Zhang, R.; Wu, X.; Li, J.; Zhao, P.; Zhang, Q.; Wuri, L.; Zhang, D.; Zhang, Z.; Yang, L. A bibliometric review of deep learning in crop monitoring: Trends, challenges, and future perspectives. Front. Artif. Intell. 2025, 8, 1636898. [Google Scholar] [CrossRef]
Vaz, C.M.P.; Ferreira, E.J.; Speranza, E.A.; Franchini, J.C.; Naime, J.d.M.; Inamasu, R.Y.; Lopes, I.d.O.N.; das Chagas, S.; Schelp, M.X.; Vecchi, L.; et al. Cotton yield map prediction using Sentinel-2 satellite imagery in the Brazilian Cerrado production system. AgriEngineering 2025, 7, 390. [Google Scholar] [CrossRef]
Xue, S.; Shi, H.; Li, X.; Yan, J.; Wang, W.; Miao, Q.; Yan, Y.; Hou, C.; Zhao, Y.; Li, X. Crop classification mapping in multi-cloud regions: An integrated approach using GF-SG time series reconstruction and 3D deep convolutional networks. Sci. Remote Sens. 2026, 13, 100361. [Google Scholar] [CrossRef]
Zhang, S.; Wang, X.; Lin, H.; Dong, Y.; Qiang, Z. A review of the application of UAV multispectral remote sensing technology in precision agriculture. Smart Agric. Technol. 2025, 12, 101406. [Google Scholar] [CrossRef]
Yao, W.; Liu, C.; Liu, Y.; Zheng, Q.; Wang, J.; Yu, H.; Chen, C.; Guo, S. Unmanned aerial vehicle payload technology applications in agriculture and other low-altitude scenarios: A review. Front. Plant Sci. 2025, 16, 1721484. [Google Scholar] [CrossRef]
Bazrafkan, A.; Igathinathane, C.; Bandillo, N.; Flores, P. Optimizing integration techniques for UAS and satellite image data in precision agriculture: A review. Front. Remote Sens. 2025, 6, 1622884. [Google Scholar] [CrossRef]
Zhang, L.; Wang, A.; Zhang, H.; Zhu, Q.; Zhang, H.; Sun, W.; Niu, Y. Estimating leaf chlorophyll content of winter wheat from UAV multispectral images using machine learning algorithms under different species, growth stages, and nitrogen stress conditions. Agriculture 2024, 14, 1064. [Google Scholar] [CrossRef]
Ji, J.; Li, N.; Cui, H.; Li, Y.; Zhao, X.; Zhang, H.; Ma, H. Study on monitoring SPAD values for multispatial spatial vertical scales of summer maize based on UAV multispectral remote sensing. Agriculture 2023, 13, 1004. [Google Scholar] [CrossRef]
Tang, Y.; Zhou, Y.; Cheng, M.; Sun, C. Comprehensive growth index (CGI): A comprehensive indicator from UAV-observed data for winter wheat growth status monitoring. Agronomy 2023, 13, 2883. [Google Scholar] [CrossRef]
Gu, H.; Xue, C.; Wang, G.; Lan, Y.; Wang, H.; Song, C. UAV-based multispectral inversion of integrated cotton growth. Agronomy 2024, 14, 2903. [Google Scholar] [CrossRef]
Zhang, D.; Qi, H.; Guo, X.; Sun, H.; Min, J.; Li, S.; Hou, L.; Lv, L. Integration of UAV multispectral remote sensing and random forest for full-growth stage monitoring of wheat dynamics. Agriculture 2025, 15, 353. [Google Scholar] [CrossRef]
Zarbakhsh, S.; Fakhrzad, F.; Rajkovic, D.; Niedbała, G.; Piekutowska, M. Approaches and challenges in machine learning for monitoring agricultural products and predicting plant physiological responses to biotic and abiotic stresses. Curr. Plant Biol. 2025, 43, 100535. [Google Scholar] [CrossRef]
Cai, W.; Lu, K.; Fan, M.; Liu, C.; Huang, W.; Chen, J.; Wu, Z.; Xu, C.; Ma, X.; Tan, S. Rice growth-stage recognition based on improved YOLOv8 with UAV imagery. Agronomy 2024, 14, 2751. [Google Scholar] [CrossRef]
Xia, M.; Chen, X.; Tian, X.; Wen, H.; Zhao, Y.; Liu, H.; Liu, W.; Zheng, Y. Lightweight deep learning for real-time cotton monitoring: UAV-based defoliation and boll-opening rate assessment. Agriculture 2025, 15, 2095. [Google Scholar] [CrossRef]
Li, Y.; Yang, W.; Lu, Z.; Shi, H. YH-RTYO: An end-to-end object detection method for crop growth anomaly detection in UAV scenarios. PeerJ Comput. Sci. 2024, 10, e2477. [Google Scholar] [CrossRef]
Rana, A.; Vaidya, P. YOLO-based deep learning framework for real-time multi-class plant health monitoring in precision agriculture. Sci. Rep. 2026, 16, 197. [Google Scholar] [CrossRef] [PubMed]
Reji, J.; Nidamanuri, R.R. Deep learning-based prediction of plant height and crown area of vegetable crops using LiDAR point cloud. Sci. Rep. 2024, 14, 14903. [Google Scholar]
Zhao, D.; Yang, G.; Xu, T.; Yu, F.; Zhang, C.; Cheng, Z.; Ren, L.; Yang, H. Dynamic maize true leaf area index retrieval with KGCNN and TL and integrated 3D radiative transfer modeling for crop phenotyping. Plant Phenomics 2025, 7, 100004. [Google Scholar] [CrossRef]
Zeng, F.; Wang, R.; Jiang, Y.; Liu, Z.; Ding, Y.; Dong, W.; Xu, C.; Zhang, D.; Wang, J. Growth monitoring of rapeseed seedlings in multiple growth stages based on low-altitude remote sensing and semantic segmentation. Comput. Electron. Agric. 2025, 232, 110135. [Google Scholar] [CrossRef]
Zhan, Y.; Zhou, Y.; Bai, G.; Ge, Y. Bagging improves the performance of deep learning-based semantic segmentation with limited labeled images: A case study of crop segmentation for high-throughput plant phenotyping. Sensors 2024, 24, 3420. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-based model with prior knowledge integration for remote sensing image semantic segmentation. Remote Sens. 2025, 17, 590. [Google Scholar] [CrossRef]
Kheddar, H.; Habchi, Y.; Ghanem, M.C.; Hemis, M.; Niyato, D. Recent Advances in Transformer and Large Language Models for UAV Applications. arXiv 2025, arXiv:2508.11834. [Google Scholar] [CrossRef]
Guan, X.; Liu, M.; Cao, S.; Jiang, J. Phenology-aware transformer for semantic segmentation of non-food crops from multi-source remote sensing time series. Remote Sens. 2025, 17, 2346. [Google Scholar] [CrossRef]
Valicharla, S.K.; Karimzadeh, R.; Li, X.; Park, Y.L. Transformer-based semantic segmentation of Japanese knotweed in high-resolution UAV imagery using Twins-SVT. Information 2025, 16, 741. [Google Scholar] [CrossRef]
Liu, H.; Meng, L.; Zhang, X.; Ustin, S.; Ning, D.; Sun, S. Cotton yield estimation model based on time-series Landsat imagery. Trans. Chin. Soc. Agric. Eng. 2015, 31, 215–220. (In Chinese) [Google Scholar]
Feng, M.; Su, Y.; Lin, T.; Yu, X.; Song, Y.; Jin, X. High-throughput cotton yield estimation based on UAV multi-source remote sensing data and machine learning. Trans. Chin. Soc. Agric. Mach. 2025, 56, 169–179. (In Chinese) [Google Scholar]
Li, Y.; Guo, S.; Jia, S.; Yan, Y.; Jia, H.; Zhang, W. Quantifying the effects of UAV flight altitude on the multispectral monitoring accuracy of soil moisture and maize phenotypic parameters. Agronomy 2025, 15, 2137. [Google Scholar] [CrossRef]
Stamenković, Z.; Kešelj, K.; Kostić, M.; Aćin, V.; Tekić, D.; Ivanišević, M.; Novaković, T. Assessing the impact of UAV flight altitudes on the accuracy of multispectral indices. Contemp. Agric. 2024, 73, 157–164. [Google Scholar] [CrossRef]
Hu, Z.; Fan, S.; Li, Y.; Tang, Q.; Bao, L.; Zhang, S.; Sarsen, G.; Guo, R.; Wang, L.; Zhang, N.; et al. Estimating stratified biomass in cotton fields using UAV multispectral remote sensing and machine learning. Drones 2025, 9, 186. [Google Scholar] [CrossRef]
Ding, J.; Ji, M.; Alkahtani, J.; Li, H.; Liu, Y.; Zhou, F.; Zhao, Z.; Dong, S.; Chen, Y.; Zhang, X.; et al. Influence of population biomass accumulation during different growth periods on agronomic traits and cotton yield. Agronomy 2024, 14, 2625. [Google Scholar] [CrossRef]
Hand, C.; Snider, J.L.; Roberts, P.M. Cotton Growth Monitoring and PGR Management; Circular 1244; University of Georgia Cooperative Extension: Athens, GA, USA, 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 23968–23980. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. CE-Net: Context encoder network for 2D medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Zheng, J.; Shao, A.; Yan, Y.; Wu, J.; Zhang, M. Remote sensing semantic segmentation via boundary supervision-aided multiscale channelwise cross attention network. In IEEE Transactions on Geoscience and Remote Sensing; IEEE: New York, NY, USA, 2023; Volume 61, p. 4405814. [Google Scholar]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for semantic segmentation of high-resolution remote sensing images. In IEEE Transactions on Geoscience and Remote Sensing; IEEE: New York, NY, USA, 2024; Volume 62, p. 5622913. [Google Scholar]
Li, J.; Wei, Y.; Wei, T.; He, W. A comprehensive deep-learning framework for fine-grained farmland mapping from high-resolution images. In IEEE Transactions on Geoscience and Remote Sensing; IEEE: New York, NY, USA, 2025; Volume 63, p. 5601215. [Google Scholar]
Huang, W.; Deng, F.; Liu, H.; Ding, M.; Yao, Q. Multiscale semantic segmentation of remote sensing images based on edge optimization. In IEEE Transactions on Geoscience and Remote Sensing; IEEE: New York, NY, USA, 2025; Volume 63, p. 5616813. [Google Scholar]
Zhou, Q.; Qiang, Y.; Mo, Y.; Wu, X.; Latecki, L.J. BANet: Boundary-assistant encoder-decoder network for semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25259–25270. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial Attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Proceedings of the International Workshop on Machine Learning in Medical Imaging (MLMI), Quebec City, QC, Canada, 10 September 2017; pp. 379–387. [Google Scholar]
Berman, M.; Triki, A.R.; Blaschko, M.B. The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4413–4421. [Google Scholar]
Gao, M.; Lu, T.; Wang, L. Crop mapping based on Sentinel-2 images using semantic segmentation model of attention mechanism. Sensors 2023, 23, 7008. [Google Scholar] [CrossRef]
Wang, G.; Chen, J.; Mo, L.; Wu, P.; Yi, X. Lightweight land cover classification via semantic segmentation of remote sensing imagery and analysis of influencing factors. Front. Environ. Sci. 2024, 12, 1329517. [Google Scholar] [CrossRef]
de Silva, R.; Cielniak, G.; Gao, J. Vision based crop row navigation under varying field conditions in arable fields. Comput. Electron. Agric. 2024, 217, 108581. [Google Scholar] [CrossRef]
Chen, X.; Zhang, H.; Wong, C.U.I. Dynamic Monitoring and Precision Fertilization Decision System for Agricultural Soil Nutrients Using UAV Remote Sensing and GIS. Agriculture 2025, 15, 1627. [Google Scholar] [CrossRef]
Yang, X.; Chen, J.; Lu, X.; Liu, H.; Liu, Y.; Bai, X.; Qian, L.; Zhang, Z. Advances in UAV Remote Sensing for Monitoring Crop Water and Nutrient Status: Modeling Methods, Influencing Factors, and Challenges. Plants 2025, 14, 2544. [Google Scholar] [CrossRef] [PubMed]
Deng, Y.; Chen, G.; Tang, B.; Duan, X.; Zuo, L.; Zhao, H. Study on Class Imbalance in Land Use Classification for Soil Erosion in Dry–Hot Valley Regions. Remote Sens. 2025, 17, 1628. [Google Scholar] [CrossRef]
Nadeem, N.; Asad, M.H.; Bais, A. CCUDA-SS: Cross-Crop Unsupervised Domain Adaptation for Semantic Segmentation. In IEEE Transactions on AgriFood Electronics; IEEE: New York, NY, USA, 2026. [Google Scholar] [CrossRef]
Blekanov, I.; Molin, A.; Zhang, D.; Mitrofanov, E.; Mitrofanova, O.; Li, Y. Monitoring of grain crops nitrogen status from UAV multispectral images coupled with deep learning approaches. Comput. Electron. Agric. 2023, 212, 108047. [Google Scholar] [CrossRef]

Figure 1. UAV orthomosaics from the Manas Hui’er Farm (a–c) and the Changji Huaxing Farm (d–f), illustrating row-and-column visibility across three nominal flight altitudes (20, 40 and 60 m above ground level).

Figure 2. Representative ground-level examples of the three foreground cotton growth classes. The corresponding top-down UAV view of the cotton canopy is shown in Figure 1. The background class corresponds to bare soil and inter-row gaps where no cotton canopy is present (see Table 1 for its quantitative definition).

Figure 3. Examples of the data augmentation pipeline applied to the training set, covering geometric transformations and photometric perturbations.

Figure 4. Pixel-level class distribution of the constructed cotton growth dataset. The distribution is long-tailed and dominated by the Moderate and Background classes.

Figure 5. Overall architecture of the proposed DCA-DeepLab. DCAG is embedded in the encoder to capture row and column directional dependencies; MSAM-MFM operates in the decoder to perform spatially adaptive cross-layer fusion; and APMFL replaces standard cross-entropy as the supervision signal. Abbreviations: RGB = red–green–blue input image; ResBlock = residual block; ASPP = Atrous Spatial Pyramid Pooling; Conv = convolution; DCAG = Dual-Coordinate Attention Gating; MSAM-MFM = Multi-Scale Attention-Guided Modulated Feature Merging.

Figure 6. Architecture of the Dual-Coordinate Attention Gating (DCAG) module. Average- and max-pooled directional descriptors are jointly modelled and fused under a learnable gate. Abbreviations: Conv2D = two-dimensional convolution; BN = batch normalization; ReLu = rectified linear unit; SE = Squeeze-and-Excitation block;

F_{A}

and

F_{M}

denote the average- and max-pooled directional attention maps (

A_{avg}

and

A_{\max}

in the text, respectively).

Figure 6. Architecture of the Dual-Coordinate Attention Gating (DCAG) module. Average- and max-pooled directional descriptors are jointly modelled and fused under a learnable gate. Abbreviations: Conv2D = two-dimensional convolution; BN = batch normalization; ReLu = rectified linear unit; SE = Squeeze-and-Excitation block;

F_{A}

and

F_{M}

denote the average- and max-pooled directional attention maps (

A_{avg}

and

A_{\max}

in the text, respectively).

Figure 7. Architecture of the Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module. Multi-scale pooling produces spatially adaptive fusion weights along an explicit feature-source dimension. Abbreviations: GAP = global average pooling; AAP = adaptive average pooling; MLP = multi-layer perceptron.

F_{1}

and

F_{2}

denote the high-level semantic and low-level detail input feature maps, respectively.

Figure 7. Architecture of the Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module. Multi-scale pooling produces spatially adaptive fusion weights along an explicit feature-source dimension. Abbreviations: GAP = global average pooling; AAP = adaptive average pooling; MLP = multi-layer perceptron.

F_{1}

and

F_{2}

denote the high-level semantic and low-level detail input feature maps, respectively.

Figure 8. Representative images and ground-truth labels from the LoveDA benchmark, used here as a cross-domain evaluation setting.

Figure 9. Sensitivity of DCA-DeepLab to the initial learning rate on the cotton growth dataset. The horizontal axis is the initial learning rate; the left and right vertical axes are Acc and mIoU, respectively.

Figure 10. Accuracy and model complexity on the cotton growth dataset. In both panels, the marker size is proportional to FLOPs. DCA-DeepLab is highlighted.

Figure 11. Qualitative comparison on the cotton growth dataset. From left to right: (a) input image, (b) ground truth, (c) DeepLabv3+, (d) TransUNet, (e) MSEONet and (f) DCA-DeepLab. DCA-DeepLab preserved the row-column structure more faithfully than the baselines, most evidently in the third row, where the regularly planted cotton rows clearly visible in the ground truth (b) were retained most faithfully in (f), whereas they largely collapsed into uniform colour blocks in (c) and remained visibly more fragmented than in (f) in (d,e); residual fragmentation in (f) occurred in regions where adjacent growth levels were highly interleaved and reflected the inherent softness of growth class boundaries.

Figure 12. Row-normalised confusion matrices on the cotton growth dataset. Diagonal values are per-class recalls; off-diagonal cells show inter-class leakages.

Figure 13. Sensitivity of APMFL to the focal factor

γ

.

Figure 13. Sensitivity of APMFL to the focal factor

γ

.

Table 1. Definition of cotton growth classes during the flowering and boll-forming stage. Thresholds combine canopy structure, agronomic measurements and indicative NDVI ranges. The canopy coverage percentages reported here were empirically calibrated against on-site canopy coverage statistics collected at the Manas Hui’er Farm and the Changji Huaxing Farm, and they are intended as indicative anchors rather than universal classification standards.

Class	Criteria
Background	Bare soil, inter-row gaps or non-vegetated areas with no cotton canopy present; canopy coverage $< 5 %$ ; NDVI typically $< 0.30$ .
Vigorous	Dense canopy with continuous rows and columns; bare soil between rows almost fully covered; canopy coverage $\geq 70 %$ ; plant height $\geq 85$ cm (typically 85–95); main stem nodes $\geq 18$ (typically 18–20); daily growth $1.2$ – $2.0$ cm/d; ≥9 (typically 9–11) fruit branches per plant; first fruit branch at ≤7th node (typically 6–7); ≥8 (typically 8–10) bolls per plant; NDVI $\geq 0.70$ .
Moderate	Canopy coverage incomplete but rows clearly visible; some bare soil between rows; $40 % \leq$ canopy coverage $< 70 %$ ; plant height 70–85 cm; main stem nodes 15–18; daily growth $0.6$ – $1.2$ cm/d; 6–9 fruit branches; first fruit branch at the 7–8th node; 5–8 bolls per plant; NDVI $0.55$ – $0.70$ .
Sparse	Sparse canopy with broken rows and large bare soil exposure; $5 % \leq$ canopy coverage $< 40 %$ ; plant height $\leq 70$ cm; main stem nodes $\leq 15$ ; daily growth $\leq 0.6$ cm/d; ≤6 fruit branches; first fruit branch at the ≥8th node (typically 8–9); ≤5 bolls per plant; NDVI $\leq 0.55$ .

Table 2. Comparison with state-of-the-art methods on the cotton growth dataset. The best result in each column is marked in bold; the DCA-DeepLab row is shown in bold for emphasis. DCA-DeepLab attained the highest mIoU (

51.74 %

) and the largest gains for the under-represented Vigorous and Sparse classes.

Table 2. Comparison with state-of-the-art methods on the cotton growth dataset. The best result in each column is marked in bold; the DCA-DeepLab row is shown in bold for emphasis. DCA-DeepLab attained the highest mIoU (

51.74 %

) and the largest gains for the under-represented Vigorous and Sparse classes.

Model	Year	Acc (%)	mIoU (%)	Background	Vigorous	Moderate	Sparse
U-Net [50]	2015	68.28	47.20	69.54	28.01	61.57	29.66
CENet [51]	2019	67.26	46.43	69.61	29.64	58.29	28.17
DenseASPP [52]	2018	70.64	48.85	70.95	32.85	61.09	30.52
Segmenter [53]	2021	70.73	49.90	72.17	35.61	61.22	30.58
RS3Mamba [55]	2024	70.31	49.13	71.93	34.05	60.54	29.99
MCCANet [56]	2023	71.12	50.37	72.38	37.03	61.02	31.07
MSGCNet [57]	2024	71.25	50.49	72.52	37.18	61.05	31.20
DBBANet [58]	2025	70.87	49.95	72.36	36.17	60.93	30.34
MSEONet [59]	2025	71.49	50.64	72.43	37.90	61.12	31.11
TransUNet [54]	2024	70.98	50.12	71.93	36.84	60.76	30.95
DeepLabv3+ [31]	2018	70.36	49.84	72.96	36.47	59.48	30.45
DCA-DeepLab (Ours)	–	72.13	51.74	72.78	39.98	61.85	32.36

Table 3. Computational complexity and inference speed of representative segmentation networks at

512 \times 512

input resolution. FPS was measured on a single NVIDIA RTX 4090 GPU with a batch size of 1, averaged over 1000 forward passes after 100 warm-up iterations under torch.no_grad() with torch.cuda.synchronize(). The best Acc and mIoU are marked in bold; the DCA-DeepLab row is shown in bold for emphasis. DCA-DeepLab achieved the highest accuracy at a parameter and FLOP budget comparable to DeepLabv3+ and substantially smaller than TransUNet.

Table 3. Computational complexity and inference speed of representative segmentation networks at

512 \times 512

input resolution. FPS was measured on a single NVIDIA RTX 4090 GPU with a batch size of 1, averaged over 1000 forward passes after 100 warm-up iterations under torch.no_grad() with torch.cuda.synchronize(). The best Acc and mIoU are marked in bold; the DCA-DeepLab row is shown in bold for emphasis. DCA-DeepLab achieved the highest accuracy at a parameter and FLOP budget comparable to DeepLabv3+ and substantially smaller than TransUNet.

Model	Acc (%)	mIoU (%)	Params (M)	FLOPs (G)	FPS
U-Net	68.28	47.20	31.2	94.7	147.9
CENet	67.26	46.43	29.6	89.3	187.3
DenseASPP	70.64	48.85	57.9	176.4	95.7
Segmenter	70.73	49.90	86.3	212.8	116.1
RS3Mamba	70.31	49.13	41.7	126.1	44.9
MCCANet	71.12	50.37	48.6	149.2	85.4
MSGCNet	71.25	50.49	50.8	154.6	83.6
DBBANet	70.87	49.95	47.3	146.8	123.8
MSEONet	71.49	50.64	52.9	159.4	64.3
TransUNet	70.98	50.12	104.7	281.6	62.5
DeepLabv3+	70.36	49.84	59.4	177.9	104.6
DCA-DeepLab (Ours)	72.13	51.74	61.7	184.6	68.6

Table 4. Comparison with state-of-the-art methods on the LoveDA benchmark. The best mIoU is marked in bold. DCA-DeepLab obtained the best overall mIoU (

51.71 %

) and particularly large gains on Barren relative to standard CNN baselines.

Table 4. Comparison with state-of-the-art methods on the LoveDA benchmark. The best mIoU is marked in bold. DCA-DeepLab obtained the best overall mIoU (

51.71 %

) and particularly large gains on Barren relative to standard CNN baselines.

Model	Year	mIoU (%)	Background	Building	Road	Water	Barren	Forest	Agriculture
BANet [60]	2022	50.15	53.94	62.14	51.33	64.59	27.07	43.86	48.12
PSPNet [61]	2017	48.31	44.40	52.13	53.52	76.50	9.73	44.07	57.85
U-Net [50]	2015	47.84	43.06	52.74	52.78	73.08	10.33	43.05	59.87
Segmenter [53]	2021	47.11	37.99	50.68	48.72	77.41	13.32	43.47	58.21
RS3Mamba [55]	2024	46.90	39.72	58.75	57.92	61.00	37.24	39.67	33.98
TransUNet [54]	2024	48.87	43.05	56.12	53.71	78.04	9.35	44.92	56.91
MSEONet [59]	2025	50.66	45.22	55.22	53.72	78.45	15.65	46.50	59.84
DBBANet [58]	2025	49.95	45.21	55.40	53.04	77.81	15.17	45.26	57.75
MCCANet [56]	2023	48.09	40.80	52.92	52.98	77.09	16.81	41.32	54.72
DeepLabv3+ [31]	2018	47.62	42.97	50.88	52.02	74.36	10.40	44.21	58.53
DCA-DeepLab (Ours)	–	51.71	44.92	54.18	56.32	71.96	31.34	44.58	58.70

Table 5. Module-level ablation of DCA-DeepLab on the cotton growth dataset. ✔ and × denote that the module is enabled and disabled, respectively; the APMFL was enabled in all four configurations. The best result in each column is marked in bold; the full DCA-DeepLab configuration (Model 4) is shown in bold for emphasis. Under APMFL supervision, DCAG and MSAM-MFM each contributed positively in isolation, and their combined gain was close to the arithmetic sum of the individual contributions. This near-additivity was observed within the APMFL-trained setting used throughout this ablation.

Model	MSAM-MFM	DCAG	Acc (%)	mIoU (%)	Background	Vigorous	Moderate	Sparse
1	×	×	70.52	50.19	72.26	38.10	59.36	31.02
2	✔	×	71.78	50.81	71.47	38.48	61.57	31.71
3	×	✔	71.65	51.13	72.43	38.96	61.26	31.88
4	✔	✔	72.13	51.74	72.78	39.98	61.85	32.36

Table 6. Comparison of attention mechanisms when plugged into the DCA-DeepLab encoder on the cotton growth dataset. Our DCAG is marked in bold. Within this APMFL-trained setting, the mIoU and overall accuracy increased consistently with progressively more elaborate forms of directional encoding (SE/CBAM → CA/Axial/CCA → DCAG), a pattern in line with directional decoupling, rather than generic re-weighting, being the more effective encoder inductive bias for UAV cotton imagery.

Model	Acc (%)	mIoU (%)	Background	Vigorous	Moderate	Sparse
SE	70.89	50.44	72.34	38.28	59.89	31.24
CBAM	71.07	50.58	72.35	38.44	60.12	31.39
CA	71.18	50.71	72.38	38.52	60.43	31.51
Axial	71.38	50.89	72.41	38.70	60.78	31.65
CCA	71.42	50.95	72.37	38.76	60.94	31.72
DCAG (Ours)	71.65	51.13	72.43	38.96	61.26	31.88

Table 7. Comparison of loss functions for DCA-DeepLab on the cotton growth dataset. Best per-column results are marked in bold; the APMFL row is shown in bold for emphasis. The APMFL outperformed cross-entropy and focal, Tversky and Lovász softmax losses in terms of mIoU and in the Vigorous, Moderate and Sparse foreground classes while remaining competitive in the background class.

Loss	mIoU (%)	Background	Vigorous	Moderate	Sparse
CE	48.98	72.70	37.19	57.02	28.99
Focal	49.43	72.11	37.89	57.68	30.05
Tversky	50.84	72.96	37.97	60.98	31.45
Lovász	50.66	72.85	37.53	60.68	31.57
APMFL (Ours)	51.74	72.78	39.98	61.85	32.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, L.; Gao, J.; Li, Z.; Shi, H.; Zhu, J. DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images. Drones 2026, 10, 456. https://doi.org/10.3390/drones10060456

AMA Style

Jia L, Gao J, Li Z, Shi H, Zhu J. DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images. Drones. 2026; 10(6):456. https://doi.org/10.3390/drones10060456

Chicago/Turabian Style

Jia, Liruizhi, Jiazhan Gao, Zuolong Li, Heng Shi, and Jihong Zhu. 2026. "DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images" Drones 10, no. 6: 456. https://doi.org/10.3390/drones10060456

APA Style

Jia, L., Gao, J., Li, Z., Shi, H., & Zhu, J. (2026). DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images. Drones, 10(6), 456. https://doi.org/10.3390/drones10060456

Article Menu

DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and UAV Data Acquisition

2.2. Cotton Growth Dataset Construction

2.2.1. Annotation Protocol

2.2.2. Image Partitioning and Augmentation

2.3. DCA-DeepLab

2.3.1. Overall Architecture

2.3.2. Dual-Coordinate Attention Gating

2.3.3. Multi-Scale Attention-Guided Modulated Feature Merging

2.3.4. Adaptive Pixel-Level Modulated Focal Loss

2.4. Experimental Set-Up

3. Results

3.1. Sensitivity to the Initial Learning Rate

3.2. Comparison with State-of-the-Art Methods

3.2.1. Cotton Growth Dataset

3.2.2. Computational Complexity

3.2.3. Cross-Domain Evaluation on LoveDA

3.3. Qualitative Analysis

3.3.1. Segmentation Visualisation

3.3.2. Confusion Matrix

3.4. Ablation Study

3.4.1. Module-Level Ablation

3.4.2. Comparison of Attention Mechanisms

3.4.3. Comparison of Loss Functions

3.4.4. Sensitivity to the Focal Factor γ

4. Discussion

4.1. Comparison with Existing Approaches

4.2. Why DCA-DeepLab Works on UAV Cotton Imagery

4.3. Implications for UAV-Based Precision Agriculture

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.4. Sensitivity to the Focal Factor $γ$