Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability

Li, Li; Wang, Jinjie; Jia, Keke; Ding, Jianli; Ge, Xiangyu; Liu, Zhihong; Zhang, Zihan; Xiao, Hongzhi

doi:10.3390/rs18070980

Open AccessArticle

Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability

by

Li Li

¹,

Jinjie Wang

^1,*,

Keke Jia

²,

Jianli Ding

³,

Xiangyu Ge

¹

,

Zhihong Liu

¹,

Zihan Zhang

¹ and

Hongzhi Xiao

¹

College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi 830017, China

²

Xinjiang Uygur Autonomous Region Farmland Quality Monitoring and Protection Center, Urumqi 830017, China

³

Xinjiang Institute of Technology, Aksu 843000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(7), 980; https://doi.org/10.3390/rs18070980

Submission received: 18 February 2026 / Revised: 13 March 2026 / Accepted: 21 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Advancements in Remote Sensing for Sustainable Agriculture (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The study systematically reveals stage-dependent differences in the contributions of multi-source remote sensing features for cotton mapping and characterizes phenology-driven feature response dynamics through feature-importance ranking.
Cross-year and cross-region experiments confirm stable model performance under varying environmental conditions, indicating strong adaptability to spatiotemporal variability.

What are the implications of the main findings?

The results provide quantitative evidence for understanding how phenological stages influence feature contributions in crop mapping, supporting optimized data-source selection and acquisition timing.
Cross-domain transfer evaluation offers methodological insights for improving the robustness and operational reliability of agricultural remote sensing mapping models.

Abstract

Accurate and timely cotton-field mapping is essential for irrigation management, water resource allocation, and regional yield assessment in arid irrigated agroecosystems. However, existing deep-learning-based crop mapping approaches generally lack interpretability and often exhibit performance variability across phenological stages, thereby limiting their reliability for operational deployment. To address these limitations, we developed an interpretable semantic segmentation framework for cotton mapping in the Wei-Ku Oasis, Xinjiang, China, under multi-source remote sensing conditions. The proposed model integrates Sentinel-2 surface reflectance, Sentinel-1 VV/VH backscatter, DEM, vegetation indices, and GLCM texture features. By incorporating a receptive-field enhancement mechanism together with an embedded feature-attribution module, the framework enables importance estimation of multi-source predictors within the network architecture, thereby providing intrinsic model interpretability. Under a unified training and evaluation protocol, the proposed model achieved an mIoU of 85.62% and an F1-score of 92.96% on the test set, outperforming U-Net, DeepLabV3+, and SegFormer baselines. Monthly classification results indicated that August provided the most discriminative acquisition window (mIoU = 85.54%, F1 = 92.83%), while June–July also maintained high recognition accuracy. Feature attribution results indicate that the importance of different predictors varies across phenological stages: Sentinel-2 red-edge bands remained highly influential throughout the growing season, NDVI/EVI exhibited increased contributions during June–August, SAR VH showed relatively higher importance during peak canopy development, and DEM maintained stable information contribution across all stages. Cross-year and cross-region experiments further demonstrated the model’s generalization capability, achieving an mIoU of 82.81% in same-region cross-year evaluation and 74.56% under cross-region transfer. Overall, the proposed segmentation framework improves classification accuracy while explicitly modeling and quantifying feature importance, providing a methodological reference for cotton-field mapping and acquisition timing selection in arid irrigated regions.

Keywords:

phenological analysis; transfer learning; optimal temporal window; oasis agriculture

1. Introduction

Cotton is an important strategic cash crop, and the structure of its cultivation and the stability of its production are directly related to regional economic security and sustainable development [1,2,3,4]. Against the backdrop of intensified climate variability, expanding soil salinization, and ongoing adjustments in agricultural planting structure, cotton production is facing increasing uncertainty. The timely and accurate acquisition of spatial distribution information for cotton planting areas and key phenological-stage information is not only a crucial foundation for precision agricultural management, but also a key prerequisite for agricultural risk early warning and regional production decision making [5,6].

Remote sensing technology has been widely applied to cotton planting area identification because of its broad spatial coverage, high observational efficiency, and strong timeliness [7,8,9]. In particular, the integration of optical and radar data with auxiliary features such as texture and topography enables simultaneous characterization of pigment content [10,11], moisture conditions [12,13], and spatial structural attributes, thereby significantly enhancing the accuracy and robustness of cotton field identification. In recent years, deep learning models such as U-Net and DeepLab have achieved significant improvements in semantic segmentation accuracy [14,15,16,17,18].

Despite the high accuracy achieved by existing methods in cotton identification, two key challenges remain. First, the decision-making process of deep learning models is often insufficiently transparent, making it difficult to determine which key variables truly drive classification decisions across different phenological stages [19,20]. Second, the transferability of these models across time and regions remains limited, and their performance is prone to degradation due to differences in imaging conditions, planting structure, and background environments [21,22]. These challenges are particularly pronounced during critical phenological windows or under constrained observation conditions, where multi-source feature redundancy and background heterogeneity further compromise the stability of discriminative features. Meanwhile, most existing attention mechanisms and structural optimization strategies primarily emphasize feature representation enhancement, with limited attention to structurally embedded modeling of stage-specific driving features and systematic analysis of cross-domain failure mechanisms [23]. Moreover, most studies rely on complete time-series data for model construction [24,25], which constrains their practical applicability in data-limited or rapid-response scenarios. Therefore, an identification framework that jointly incorporates explanation guidance and spatial context modeling is needed to improve discriminative stability, interpretability, and cross-domain robustness under limited observation conditions.

To address the above issues, this study takes the Wei-Ku Oasis in Xinjiang as the study area and proposes a cotton identification framework that integrates an explanation-guided mechanism with a receptive-field enhancement strategy. In this framework, key feature channels are dynamically reweighted at the input stage. Unlike conventional post hoc visualization-based interpretation methods, this mechanism directly participates in feature construction during the forward propagation process, thereby enabling internal guidance and constraint of multi-source variable contributions within the model. Meanwhile, wavelet-based frequency-domain modeling is introduced during the encoding stage to enhance spatial context representation and reduce the influence of redundant features and stage-sensitive variables on feature representation stability. The main contributions of this study are as follows:

(1): A cotton identification model integrating explanation guidance and receptive-field enhancement is developed to enable dynamic reweighting of multi-source features and effective spatial context modeling.
(2): The contribution characteristics of key variables in cotton identification across different phenological stages are revealed, and the optimal temporal window is identified.
(3): Through cross-year and cross-region transfer experiments, the stability of the model under inter-domain differences is evaluated, and, combined with ablation experiments, the mechanisms by which each structural module affects cross-domain performance are analyzed.

2. Materials and Methods

2.1. Study Area

The Wei-Ku Oasis, the focal region of this study, lies along the northern margin of the Tarim Basin in southern Xinjiang, China (41°01′N–41°43′N, 82°09′E–83°25′E). It encompasses three counties in Aksu Prefecture: Kuqa, Xinhe, and Shaya. The area is characterized by a temperate continental climate, transitioning from arid to semi-arid zones. The region maintains a mean temperature of roughly 11.5 °C per year, receives around 51.66 mm of precipitation annually, and enjoys nearly 13 h of daily sunshine on average. The area exhibits a large diurnal temperature range, which creates favorable conditions for cotton cultivation. As one of China’s major cotton-producing regions, the Wei-Ku River Oasis provides a representative environment for studying cotton mapping using remote sensing and deep learning techniques (Figure 1) [26]. In this region, cotton is typically sown in April and harvested from late September to early October, resulting in a growing season of approximately 183 days. This period encompasses six key phenological stages: sowing, seedling emergence, squaring, flowering, boll opening, and harvesting (Figure 2). Manasi is located on the northern slopes of the Tianshan Mountains (43°30′N–45°36′N, 85°42′E–86°42′E). It has a temperate continental climate, characterized by abundant sunshine, low precipitation, and fertile soils, making it a high-yield area for high-quality cotton in Xinjiang.

As shown in Figure 2, the study period spans several key phenological stages of the cotton growing season. The experimental design was based on monthly imagery, with monthly composites from April to October used to characterize temporal variations throughout the growing season. Because a given month may encompass one or more adjacent phenological stages, the analysis was not further stratified by discrete developmental stages; instead, month was adopted as the basic unit for experimental analysis and result comparison. Accordingly, Figure 2 is intended primarily to provide the phenological calendar context of the study period rather than to serve as an independent basis for experimental grouping. This figure also facilitates interpretation of the temporal characteristics captured by imagery from different months and their potential influence on classification performance.

2.2. Data Sources

This study utilized the Google Earth Engine platform [27] to integrate Sentinel-1 C-band SAR, Sentinel-2 Level-2A surface reflectance, and SRTM DEM data from 2020–2021. All datasets were harmonized to a spatial resolution of 10 m to construct a multi-source time-series dataset.

The Sentinel-1 data underwent preprocessing steps including orbit parameter correction, thermal noise removal, radiometric calibration, and terrain correction. To address the inherent speckle noise in SAR imagery, no single-scene despeckling was applied. Instead, given the high temporal sampling density, a multi-temporal mean compositing strategy was adopted to suppress noise along the temporal dimension [28,29].

For Sentinel-2 data, images with cloud cover below 10% were first selected. Cloud and cloud-shadow masks were then generated by integrating the QA60 band, the scene classification layer, and s2cloudless cloud probability (>60%), followed by morphological dilation (150–300 m) to reduce edge effects. To ensure temporal continuity and comparability across different study areas while reducing observational noise, a fixed monthly compositing scheme was adopted. Specifically, a 28-day window centered on the 15th day of each month (±14 days) was defined, and all valid pixels within the window were averaged to generate a representative monthly image. Previous studies have shown that stage-scale or seasonal-scale compositing can reduce random observational noise and improve temporal consistency [30]. In agricultural remote sensing, crop phenological characteristics are generally stable at the monthly scale; therefore, using an approximately one-month window helps preserve intra-stage spectral consistency while reducing excessive mixing of features from different phenological stages [31]. Ultimately, seven monthly composite images per year were obtained from April to October, forming a continuous optical time-series dataset. The number of valid images within each monthly window for all study areas is summarized in Appendix A Table A1, Table A2 and Table A3, confirming data availability.

The original spatial resolution of the SRTM DEM is 30 m, and it was resampled to 10 m in this study to ensure spatial consistency. Although the resampling procedure does not introduce additional true topographic information [32], the DEM and its derivatives (slope and aspect) were incorporated solely as large-scale terrain constraint factors to provide macro-topographic context for the model, rather than to characterize fine-scale textures, thereby avoiding potential interference from “pseudo-high resolution.”

In addition, gray-level co-occurrence matrix (GLCM) texture features were calculated based on the Sentinel-2 near-infrared band (B8). To balance the preservation of spectral detail with computational complexity control, the band values were linearly quantized into 256 gray levels. Texture measures were then calculated using a 3 × 3 moving window with four directions (0°, 45°, 90°, and 135°) and a pixel distance of 1 to obtain stable and comparable texture representations. Ultimately, six texture features—SAVG, contrast, entropy, angular second moment, correlation, and variance—were extracted to quantify the local spatial heterogeneity of land-cover objects.

The final model input comprised 27 channels integrating spectral, SAR, topographic, and texture information. The overall workflow of multi-source data preprocessing, feature extraction, and model input construction is shown in Figure 3, while the formulas for the related indices and texture features are provided in Table 1.

2.3. Sample Data

The cotton samples used for model training and evaluation were collected from two representative planting regions in Xinjiang, China: the Wei-Ku Oasis and Manasi. The reference cotton labels for both regions were obtained from the 10 m Xinjiang Cotton Dataset (2020–2021) [39]. To enhance spatial accuracy, particularly for field boundaries, these labels were further refined through manual visual interpretation and boundary correction, supported by field survey records and high-resolution Google Earth imagery. The corrected labels were then cropped into 256 × 256-pixel image patches, with cotton and non-cotton pixels assigned IDs of 2 and 1, respectively. This yielded a total of 8626 and 1855 sample patches for the Wei-Ku Oasis and Manasi regions, respectively.

To enhance the model’s generalization capability and robustness, data augmentation was applied to the training samples, including horizontal or vertical flipping with a probability of 0.5 [40], random rotation within the range of [−30°, 30°] [41], and random Gaussian blurring (kernel size ranging from 3 × 3 to 7 × 7, with a standard deviation σ in the range of [0.1, 2.0] [42]. During dataset partitioning, to rigorously prevent “spatial leakage” caused by spatial autocorrelation, a geographically block-based splitting strategy was adopted [43]. First, spatially contiguous cotton fields within administrative boundaries were treated as basic units, and all samples were grouped into several independent geographic blocks. Subsequently, at the block level (rather than the individual sample level), these blocks were randomly assigned to the training and testing sets in an approximate ratio of 7:3. This strategy ensured that all samples from the same geographic block and its spatial neighborhood appeared exclusively in either the training or the testing set, thereby enabling a truly independent assessment of spatial generalization capability.

2.4. Model Training Settings

For a comprehensive comparison, several representative semantic segmentation models were included in this study, covering both convolutional neural network (CNN)-based and Transformer-based architectures. Among them, U-Net is a classical encoder–decoder model widely used in remote sensing segmentation; DeepLabV3+ enhances multiscale context modeling and boundary recovery through atrous spatial pyramid pooling and an encoder–decoder design; VM-Net was included as a segmentation model with enhanced feature representation capability; and SegFormer and Swin Transformer represent Transformer-based models with strong global context modeling ability. These models were used as benchmark methods and were trained and evaluated under the same input features, data partitioning, and evaluation protocol to ensure fair comparison.

All models were implemented using PyTorch 2.0.0, Python 3.8, and CUDA 11.8, and were trained and evaluated on a workstation running Linux (Ubuntu 20.04) with an NVIDIA RTX 3090 GPU (24 GB memory). Limited yet reasonable adjustments were made to the training strategies according to model architecture. Specifically, U-Net, DeepLabV3+, and the proposed model used Adam, whereas VM-Net, SegFormer, and Swin Transformer used AdamW. The main training settings are summarized in Table 2. The initial learning rate was primarily selected from {1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵}, with 1 × 10⁻⁵ additionally tested for models with unstable convergence. All models were trained with a batch size of 4 for 200 epochs, and a learning rate decay strategy was applied during training.

3. Methodology

3.1. Network Architecture

U-Net, as a classical U-shaped fully convolutional neural network, consists of a symmetric encoder–decoder architecture, where the encoder extracts semantic information through multiple layers of convolution and downsampling, while the decoder progressively restores fine spatial details through a series of upsampling operations. U-Net employs a skip-connection mechanism, which fuses low-level spatial features extracted by the encoder with high-level semantic information in the decoder, thereby preserving global context while maintaining fine-grained local details, and enabling high-accuracy pixel-level classification.

This study aims to explore the key influencing factors of cotton identification across different phenological stages; however, the standard U-Net model lacks the ability to interpret the relative importance of its input channels, making it difficult to explicitly determine which feature variables contribute most to the classification task. In addition, the limited receptive field of U-Net hinders its ability to effectively model complex spatial structures inherent in remote sensing imagery, thereby constraining its performance in scenarios involving multi-source and multi-scale data fusion. To improve the model’s suitability for cotton field recognition while addressing the “black box” nature of deep learning, the following improvements were made to the U-Net architecture in this study:

First, an interpretability module was embedded at the input stage of the model, which performs both channel-wise and spatial attention modeling on each input channel, to generate weighted responses that highlight the most informative channels. Second, the weighted feature maps are fed into the backbone network (ConvNeXt) for continued multi-scale feature extraction, ensuring that the encoder focuses on the most discriminative input dimensions from the outset, thereby enhancing both the overall segmentation performance and the interpretability of the model. Finally, a WTConv-based wavelet convolution module was introduced between the interpretability module and the backbone network, to decompose and integrate multi-scale contextual information, thereby expanding the model’s receptive field. The proposed model retains the efficient feature fusion capabilities of the original U-Net architecture, while enhancing recognition accuracy and model transparency through the integration of channel-guided attention and wavelet convolution. The improved network architecture is illustrated in Figure 4.

3.2. Backbone Network

To evaluate the performance and adaptability of the improved model under different backbone architectures, ResNet18 and ConvNeXt were selected as backbone networks for comparative experiments. As a lightweight and parameter-efficient representative of residual networks, ResNet18 employs residual connections that effectively mitigate gradient vanishing in deep network training, offering high computational efficiency and strong feature extraction capabilities, making it suitable for crop classification tasks in resource-constrained environments. In contrast, ConvNeXt integrates the residual learning concept of ResNet with modern CNN design principles, such as large convolutional kernels and layer normalization, demonstrating superior capabilities in semantic representation and feature modeling. By comparing the performance of these two backbone networks within the improved model framework, this study aims to comprehensively evaluate performance variations under different levels of feature extraction depth and representational capacity, and to further investigate how backbone design influences classification accuracy and cotton mapping tasks, thereby providing a reference for backbone network selection in future applications.

3.3. Channel Importance Interpreter

To enhance the model’s sensitivity to critical features, an embedded channel-wise interpretability module is introduced, enabling dynamic learning and adjustment of channel importance during training and feature selection, thereby achieving channel selection and redundancy suppression. As illustrated in Figure 5, the module employs a Squeeze-and-Excitation (SE) mechanism to obtain channel attention weights, performing global average pooling on each channel, followed by a multilayer perceptron (MLP) composed of two 1 × 1 convolutional layers with ReLU and Sigmoid activations, and outputs a channel attention map

W_{c} \in R^{B \times C \times 1 \times 1}

. Meanwhile, the spatial attention branch applies channel-wise average pooling and max pooling to the input feature map X, concatenates the results, and feeds them into a 7 × 7 convolutional layer to generate a spatial attention map

W_{s} \in R^{B \times 1 \times H \times W}

. Finally, the outputs of the two branches are fused via element-wise broadcast multiplication, yielding a joint attention map

A = W_{c} ⊙ W_{s} \in R^{B \times C \times H \times W}

, and global average pooling is applied to A to obtain the channel response vector

s \in R^{B \times C}

. To prevent excessive amplification in the softmax distribution caused by scale differences in channel responses, min–max normalization is applied prior to softmax to mitigate the dominance of extreme values, yielding a channel importance distribution

p \in {(0, 1)}^{C}

satisfying

\sum_{c = 1}^{C} p_{c} = 1

. A sparse gating mechanism (Equation (1)) is then introduced, followed by the fusion of soft weights and hard gating (Equation (2)); with a fusion coefficient

α = 0.5

, discrete suppression is incorporated while preserving the continuous ranking of channel importance, thereby balancing flexibility and interpretability. The final channel-weighted output is formulated as shown in (Equation (3)).

G_{c} = 1 [p_{c} > θ]

(1)

W_{g a t e d} = (1 - α) \cdot p + α \cdot G

(2)

X' = W_{g a t e d} ⊙ X

(3)

Here, ⊙ denotes element-wise multiplication with channel-wise broadcasting;

1 [\cdot]

represents the indicator function, and

θ

is the threshold; p refers to the channel weights after softmax. Since the hard gating function is non-differentiable, the continuous soft branch p enables gradient backpropagation during training, while the hard gate directly participates in feature reweighting in the forward pass. Given that

\sum_{c = 1}^{C} p_{c} = 1

, the uniform prior baseline 1/C (with C = 27 in this study, 1/C ≈ 0.037) is used as a reference; a grid search over candidate thresholds {0.03, 0.05, 0.07} is conducted on the validation set, and the optimal value

θ = 0.05

is selected based on mIoU and F1-score. The resulting channel importance distribution p is derived from the model’s forward propagation and directly participates in feature reweighting, thereby providing transparent channel importance for further interpretability analysis.

3.4. Wavelet Convolution

The wavelet convolution (WTConv) module introduces a multi-scale frequency-domain modeling mechanism, significantly enlarging the receptive field and enhancing low-frequency representation capacity while maintaining a relatively stable parameter count, thereby improving the model’s ability to extract heterogeneous terrain features of cotton fields under complex textured backgrounds [44].

The WTConv module first applies a single-level two-dimensional discrete wavelet transform (DWT) using the Haar wavelet basis to the input feature map, achieving a balanced trade-off between frequency modeling accuracy and computational complexity [45]. The transformation produces one low-frequency sub-band (LL) and three high-frequency sub-bands (LH, HL, HH) representing different directional details, enabling a structured representation of multi-scale frequency responses and facilitating the extraction of more discriminative texture and edge features [46].

During sub-band processing, the low-frequency component (LL) is passed through a standard convolutional layer to preserve structural and semantic information, whereas the three high-frequency components (LH, HL, HH) are processed using structurally identical lightweight 3 × 3 convolutions (stride = 1, padding = 1), each followed by Batch Normalization and ReLU activation to ensure statistical and scale consistency across directional high-frequency features. Prior to feature recombination, all high-frequency sub-bands are adjusted through cropping or padding to match the spatial resolution of the low-frequency sub-band.

Subsequently, the sub-band features are fused and reconstructed into the original feature space via the inverse wavelet transform (IWT). Owing to the inherent invertibility and shift invariance of DWT/IWT, sub-bands can be precisely aligned during reconstruction without interpolation or upsampling, thereby preventing feature misalignment caused by multi-scale transformations [47].

The reconstructed features are further processed by a 1 × 1 convolution for channel compression and information reorganization, enhancing cross-frequency feature fusion. The module independently applies the above wavelet transform and convolution operations to each input channel, ensuring channel consistency while enabling efficient modeling and fusion of multi-scale frequency-domain information. Therefore, embedding the WTConv module into the backbone encoder significantly enhances the model’s perception of complex terrain and texture structures in cotton fields, improves low-frequency semantic information propagation, and ultimately increases overall classification accuracy and robustness.

[X_{L L}, X_{L H}, X_{H L}, X_{H H,}] = C o n v ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], X)

(4)

X = C o n v - t r a n s p o s e d ([f_{L L}, f_{L H}, f_{H L}, f_{H H}]), [X_{L L}, X_{L H}, X_{H L}, X_{H H}]

(5)

[X_{L L}^{(i)}, X_{L H}^{(i)}, X_{H L}^{(i)}, X_{H H}^{(i)}] = W T (X_{L L}^{(i - 1)})

(6)

Here,

C o n v

denotes the convolution operation;

f_{L L}

is the low-pass filter, while

f_{L H}

,

f_{H L}

and

f_{H H}

represent a set of high-pass filters. X denotes the input image, where

X_{L L}

is its low-frequency component, and

X_{L H}

,

X_{H L}

,

X_{H H}

correspond to the horizontal, vertical, and diagonal high-frequency components, respectively. These four filters constitute an orthogonal basis, and the inverse wavelet transform is applied accordingly, as shown in (Equation (5)). Subsequently, the low-frequency component is recursively decomposed to perform multilevel wavelet decomposition, with each level of decomposition described by (Equation (4)). Here,

X_{L L}^{(0)} = X

, where i denotes the current decomposition level.

3.5. Loss Function

To mitigate the class imbalance commonly encountered in remote sensing image classification, Focal Loss was employed as the objective function. By down-weighting well-classified samples and emphasizing hard examples, Focal Loss adaptively adjusts the contribution of each sample, thereby enhancing model robustness and generalization. The mathematical formulation of Focal Loss is given as follows:

L_{f o c a l} = - φ_{t} {(1 - P_{t})}^{γ} \log (P_{t})

(7)

where

P_{t}

denotes the predicted likelihood corresponding to the actual class label. The term

φ_{t}

serves as a balancing coefficient between positive and negative instances, while

γ

controls the degree of focus applied to easily classified samples by adjusting the modulation strength.

3.6. Evaluation Metrics

To comprehensively evaluate the classification performance of the proposed method, several accuracy metrics derived from the confusion matrix were employed, including mean Intersection over Union (mIoU), Precision, Recall, and F1-score. The definitions of these metrics are provided as follows:

I o U = \frac{T P}{T P + F P + F N}

(8)

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 - s c o r e = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c}

(12)

Here, TP (True Positive) denotes the number of pixels correctly predicted as positive, FP (False Positive) represents the number of pixels incorrectly predicted as positive while actually negative, and FN (False Negative) refers to the number of pixels incorrectly predicted as negative while actually being positive.

For the mIoU metric, two calculation schemes with different aggregation granularities were adopted according to the objectives of different experimental tasks, with the only distinction lying in the level of aggregation (Equation (12)). Here, C denotes the total number of classes (C = 2 in this study). The per-dataset mIoU is computed by deriving

{I o U}_{c}

from the accumulated confusion matrix of all test samples, whereas the per-image mIoU is obtained by first calculating

{I o U}_{c}

for each image individually and then averaging the results.

Per-dataset mIoU: The confusion matrices of all test samples are first globally aggregated, and the IoU for each class is computed before averaging across classes. This metric follows common evaluation practices in remote sensing image segmentation and offers strong statistical stability and ease of cross-model comparison [48,49]. In model comparison and ablation experiments, since all models are evaluated on the identical test set, the area weighting inherent in per-dataset mIoU is consistent across models, thereby not affecting their relative ranking nor favoring any specific method. Therefore, this metric is adopted in model comparison and ablation studies to ensure comparability with existing literature and fair evaluation under controlled experimental conditions.

Per-image mIoU: The IoU for each class is first calculated independently for each image, and then the IoU values are averaged across all images. This metric assigns equal weight to each test image, effectively mitigating evaluation bias caused by unequal cotton area distribution among images [50]. In this study, it is employed for transfer experiments and month-by-month temporal analysis for the following reasons:

Monthly time-series analysis: Due to the extended phenological cycle of cotton, the proportion of cotton pixels varies substantially across months (e.g., canopy cover is sparse in April, with cotton typically accounting for less than 5% of pixels, whereas in August during peak flowering, cotton fields form contiguous patches, comprising 30–40% of pixels). If per-dataset mIoU is adopted, model performance in months with low cotton proportions would be inherently diluted, whereas performance in high-proportion months would be disproportionately amplified. In contrast, per-image mIoU employs equal weighting, enabling fair comparison of model performance across different months on a consistent scale.

Transfer experiments: The transfer experiments conducted in this study, including temporal and spatial transfer, aim to evaluate the model’s generalization capability in unseen target domains. Unlike model comparison experiments that focus on performance ranking over a fixed test set, transfer experiments primarily assess the model’s stability and effectiveness on each individual image within the target domain. Because per-dataset mIoU assigns greater weight to images with large contiguous cotton areas, strong performance on a few extensive fields may obscure generalization deficiencies in fragmented scenes, thereby overestimating the model’s true cross-domain adaptability. By assigning equal weight to each image, per-image mIoU more directly reflects the model’s scene-level generalization performance, aligning closely with the objectives of transfer experiments.

4. Results

In the experimental section, this study investigated the cotton recognition performance of the proposed approach on a monthly basis throughout the growing season, along with an analysis of channel-wise importance weights. Subsequently, performance evaluation, ablation studies, and comparative experiments were conducted to validate the effectiveness of the proposed model. As mIoU and F1-score are comprehensive indicators of classification performance, they are emphasized in the following analysis as the primary evaluation metrics.

4.1. Performance and Ablation Study of the Interpretable U-Net

To evaluate the impact of each proposed module on model performance and complexity, ablation experiments were conducted based on different combinations of the modules. The results are presented in Table 3. After introducing the interpretability module, the mIoU increased by 1.37% and the F1-score improved by 1.02%, indicating that the module effectively enhanced the model’s ability to perceive and emphasize key feature channels. In contrast, removing the wavelet convolution (WTConv) module resulted in only a 0.13 M reduction in the number of parameters, while leading to a 0.50% drop in mIoU and a 1.52% decrease in F1-score, with GFLOPs remaining unchanged. These results indicate that the WTConv module plays a critical role in expanding the receptive field and enhancing multi-scale contextual representation. When the backbone network was replaced from ConvNeXt to ResNet18, the number of parameters was significantly reduced to 21.59 M, but the GFLOPs unexpectedly increased by 0.31 G, and the mIoU and F1-score dropped by 0.61% and 0.82%, respectively. These results indicate that although simplifying the backbone reduces model size, it also compromises the model’s feature extraction capability.

Therefore, considering both model performance and complexity, it is recommended to adopt a ConvNeXt-based backbone integrated with the interpretability module and WTConv for cotton field mapping tasks, as this configuration significantly improves segmentation performance with only a modest increase in model complexity.

4.2. Comparison with Other Models

To comprehensively evaluate the performance of different model architectures in cotton field object recognition tasks (Figure 6), we selected several representative networks for comparative experiments. These including the classical CNN-based models U-Net [16] and DeepLabV3+ [17]. Transformer-based models SegFormer [51] and Swin-Transformer [52], as well as the Mamba-based architecture VM-Net [53] for comparative experiments. As shown in Table 4, U-Net demonstrates strong suitability for cotton field classification in the Wei-Ku Oasis, primarily due to the relatively regular texture and geometric patterns of cotton fields under high-resolution satellite imagery in this region. The skip connection mechanism in U-Net enhances the extraction of low-level texture features, and compared to models without skip connections or with fewer connection stages, U-Net shows superior capability in segmenting regularly textured cotton parcels.

Traditional convolutional networks in small-sample remote sensing scenarios often suffer from feature redundancy and low sensitivity to critical information, leading to blurred object boundaries and suboptimal fine-grained feature extraction. To address these limitations, the proposed interpretable U-Net introduces both a channel-wise interpretability module and a WTConv-based wavelet convolution module, effectively combining channel-wise feature selection with multi-scale frequency-domain modeling. The improved model achieved the best performance in identifying cotton planting areas in the Wei-Ku Oasis, with a mean IoU of 85.62% and an F1-score of 92.96%, outperforming other classical models including Transformer-based architectures and Visual Mamba variants. Among these, the Swin Transformer—which uses a window-based self-attention (W-MSA) mechanism—achieved the highest performance, with an F1-score of 91.59%, which is 1.37% lower than that of the proposed model, and a Precision of 90.66%, 1.39% lower than that of the improved model. DeepLabV3+ employs the Atrous Spatial Pyramid Pooling (ASPP) module to extract multi-scale features, and incorporates an encoder–decoder structure to enhance fine-grained segmentation along object boundaries. However, its F1-score is 1.71% lower than that of the proposed model. This performance gap may be attributed to its limited ability to model the relative importance of input feature channels, resulting in insufficient attention to semantically critical channels. Moreover, the lack of a mechanism for multi-scale frequency-domain modeling in complex textured regions hinders its capacity for fine-grained feature extraction and spatial detail preservation. The VM-Net model achieved the lowest accuracy in cotton field recognition, with an F1-score of 83.68%, which is 9.28% lower than that of the proposed model. Combining the local visual comparison in Figure 6, this phenomenon may be related to the limited adaptability of VM-Net’s architectural characteristics to agricultural remote sensing scenarios. VM-Net is more prone to fragmented boundaries, local discontinuities, and spatial confusion in complex boundary regions, resulting in relatively lower consistency between its segmentation results and the reference labels. Its design places greater emphasis on the extraction of high-level semantic features; however, when applied to agricultural scenes characterized by fine-scale boundaries, strong spatial heterogeneity, and complex phenological variation, its sensitivity to local texture variation and boundary complexity may lead to relatively weaker overall performance.

As illustrated by the local regions in Figure 6, the proposed model demonstrates superior preservation of spatial details, accurately delineating the boundaries between ridges and cotton plots, and producing clearer and more coherent edge predictions compared with other baseline models. These results validate the effectiveness of integrating both the interpretability module and the WTConv module into the U-Net framework: the former enhances the model’s attention to critical features and improves feature discriminability, while the latter enlarges the receptive field, facilitating the capture of multi-scale contextual information. Together, these modules improve fine-grained object representation and boundary localization accuracy.

4.3. Channel-Wise Interpretability Across Cotton Growth Stages

To systematically analyze the impact of different remote sensing features on cotton field classification across the growing season. Monthly surface imagery from April to October was collected, covering 27 feature variables including vegetation indices, SAVG, texture descriptors, spectral bands, and topographic factors. The importance weights of these features were computed using the embedded interpretability module of the model. For each month, three independent experiments were conducted, and the averaged importance scores were used as the final feature importance for that time step to ensure robustness (Figure 7).

As illustrated in Figure 7, the importance of different features exhibits distinct stage-specific variations over time. Overall, spectral reflectance features contributed most significantly to the classification performance, with red-edge bands B6, B7, B8A, B5, and near-infrared (NIR) band B8 consistently ranking among the most important across multiple time points. This indicates that these bands are highly sensitive to cotton-specific spectral signatures, consistent with prior findings that red-edge regions are responsive to chlorophyll concentration and canopy structural variation [54]. Among vegetation indices, NDVI exhibited the highest average importance across all features, followed by EVI. Both indices showed elevated importance from May to August, which corresponds to the physiological characteristics of cotton, where canopy density and vegetation index values increase rapidly during the peak growth period [55,56].

SAR data are highly sensitive to the geometric structure of observed targets, making them useful for distinguishing between different land cover types [57]. Cotton undergoes significant phenological changes across growth stages—such as seedling, squaring, flowering, and boll opening—including variations in canopy coverage, leaf size, water content, and plant height. These changes affect the electromagnetic scattering mechanism and backscatter intensity captured by SAR sensors [58]. The results indicate that during the flowering and boll-opening stages, increased canopy height and leaf density result in near-maximum vegetation coverage, which reduces the radar wave’s ability to penetrate the canopy. Consequently, the importance of SAR bands shows an overall decreasing trend during these stages. In comparison, the average importance of the VH-polarized backscatter coefficient is significantly higher than that of the VV polarization, indicating that VH polarization is more sensitive to structural and moisture-related differences between cotton and non-cotton fields. Therefore, SAR-based cotton classification during the seedling and squaring stages shows strong potential, offering a feasible pathway for multi-source remote sensing fusion in phenology-specific crop classification.

Among texture features, Contrast exhibited a consistently high contribution across multiple cotton growth stages, particularly during May and June, which is likely related to heterogeneous vegetation cover, surface texture variability, and gray-level variations during the early growth stages [59]. In contrast, the topographic factor Slope maintained the lowest importance across all growth stages, suggesting that terrain slope has a relatively minor influence on cotton distribution in the study area.

Overall, the period from May to August represents the peak window for feature importance, during which numerous variables exhibited significantly elevated importance scores. This indicates that this period corresponds to the rapid growth and critical developmental stages of cotton. During this phase, spectral reflectance, texture features, and SAR backscatter undergo substantial changes in remote sensing imagery, providing rich and discriminative information for classification tasks [37]. As shown in Figure 7, the importance of B4, B8, NDVI, and EVI increased significantly during July and August, which aligns with the spectral response characteristics of peak vegetation growth. In contrast, SAVG and Contrast exhibited higher importance in May, reflecting the distinct texture variation and spectral heterogeneity typical of the seedling stage. These results suggest that selecting feature variables based on phenological stages can effectively improve both the accuracy and efficiency of cotton field classification. Focusing on key growth phases not only simplifies the feature extraction process, but also enhances the model’s discriminative power for the target crop.

4.4. Optimal Temporal Window Analysis

Satellite observations during key crop growth periods are often affected by cloud cover, and thus, this study synthesized monthly imagery to address cloud-induced data gaps, aiming to identify the optimal time window for accurate cotton extraction, thereby enabling high-precision cotton mapping while reducing the complexity of phenological analysis. Sentinel-based imagery from April to October 2021 was composited into monthly mean reflectance using the GEE platform, and the proposed interpretable U-Net model was used to extract cotton planting areas and assess classification accuracy (Table 5). The performance trends of the model across different months were subsequently analyzed. The results indicate a clear seasonal fluctuation in cotton classification accuracy across different months. August was identified as the optimal time window, with a mean F1-score of 92.83% and a mean mIoU of 85.54%. The recognition performance in September was slightly lower, with an F1-score of 91.59% and mIoU of 84.37%. This can be attributed to the boll-opening stage occurring from late August to early September, during which the spectral characteristics of cotton undergo significant changes, creating greater separability from other crops. This enhanced contrast led to peak performance in both F1-score and mIoU, indicating fewer misclassifications and omissions in the segmentation results. The F1-scores in June and July differed by only 2.07%, with both months achieving average values exceeding 88%, indicating that these periods also offer substantial advantages for cotton extraction. Satisfactory extraction performance can be achieved during the flowering and boll-setting stages. In contrast, classification accuracy in April was notably lower, as this period corresponds to the sowing stage, when the surface is dominated by bare soil, plastic mulch, and sparse seedlings, making target recognition more challenging. Accordingly, performance metrics were substantially reduced, with a mean mIoU of 73.48% and a mean F1-score of 83.39%. October corresponds to the end of the cotton harvest period, during which residual stalks, weeds, and other interfering factors increase, leading to reduced classification accuracy, with an mIoU of 83.16% and an F1-score of 89.78%.

To further validate the variation in segmentation performance across different phenological stages, Figure A1 presents representative ROI-level comparisons from April, June, August, and October in the primary study area, including the original Sentinel imagery, reference labels, and corresponding model predictions. The visual results are consistent with the quantitative performance trends. In April, evident boundary discontinuities and localized omission errors are observed. In June, connectivity improves markedly, with more coherent field structures. August exhibits the highest boundary fidelity and the most complete internal filling of cotton parcels. In contrast, October shows slight edge fragmentation and a minor reduction in segmentation stability. Overall, this pattern clearly reflects a seasonal performance trajectory in which recognition accuracy peaks during the vigorous growth stage, while relatively lower performance is observed during the sowing and post-harvest periods.

Considering both classification accuracy and temporal stability, August is recommended as the optimal image acquisition period for cotton recognition tasks. When early-season monthly composites from April or May are used, temporal enhancement strategies—such as multi-date data fusion—should be incorporated to compensate for the limited spectral and structural information available in a single monthly composite.

5. Discussion

5.1. Model Transferability

The transferability of deep learning models in remote sensing applications remains a significant challenge [60]. The success of model transfer is largely dependent on the model’s ability to adapt to different environmental conditions, particularly in regions with distinct climatic conditions, topographic structures, and cropping systems [61]. Such environmental variability can cause shifts in crop spectral responses and spatial distribution patterns across regions and years, which may adversely affect model performance when applied to unseen domains [62].

To evaluate the model’s generalization ability across temporal and spatial dimensions, cross-year and cross-region transfer experiments were designed in this study. Based on the comparative analysis of monthly classification performance (Table 6), August exhibited relatively stable and superior identification performance in the main study area. Moreover, cotton during this period is in the middle-to-late growth stage, during which the spectral differences among classes are relatively distinct, thereby providing favorable conditions for stable discrimination. Therefore, August imagery was selected as the transfer evaluation window in this study.

In the within-region cross-year experiment, the overall performance remained stable (mIoU = 82.81%, F1 = 89.41%), demonstrating good temporal generalization. In the cross-region transfer experiment, although the performance decreased slightly compared with the within-region cross-year setting (mIoU = 74.56%, F1 = 86.12%), the overall identification capability remained at a relatively high level, indicating that the model maintained relatively strong transfer capability under inter-domain differences.

The ablation experiments further revealed the functions of different modules in transfer scenarios. In the within-region cross-year ablation experiment, removing the Explainer module led to decreases of 1.44% in mIoU, 0.95% in Recall, and 0.69% in F1-score, suggesting that this module contributes to more complete detection of cotton field areas. After removing the WTConv module, mIoU decreased by 0.80% and Recall by 0.73%, while Precision showed a marked decline of 0.85%, indicating that WTConv has advantages in enhancing spatial detail representation and boundary discrimination.

In the cross-region transfer experiment, this difference became more pronounced. After removing the Explainer module, mIoU and F1-score decreased by 1.25% and 1.02%, respectively, while Recall declined by 1.36%, suggesting that under inter-domain distribution shifts, the channel selection and redundancy suppression mechanism may help alleviate the instability caused by feature shifts. In contrast, after removing the WTConv module, Precision declined significantly by 1.85%, whereas Recall decreased by only 0.52%, suggesting that this module plays a more important role in controlling cross-region misclassification and confusion from complex backgrounds.

Overall, the two modules play distinct roles under transfer scenarios. The Explainer module dynamically reweights the input channels during forward propagation, which may help suppress redundant features and improve the stability of discriminative representations under domain shifts. In contrast, the WTConv module enhances boundary and detail representation through frequency-domain spatial modeling. Together, these two modules form a complementary mechanism that helps the model maintain a more consistent discriminative feature structure in cross-year and cross-region transfer tasks.

To further validate the effectiveness of the proposed method, we compared the performance of the proposed model with that of the baseline U-Net model in transfer tasks. In the within-region cross-year experiment, the proposed model (mIoU = 82.81%, F1-score = 89.41%) outperformed U-Net (mIoU = 81.17%, F1-score = 88.02%). In the cross-region experiment, the proposed model (mIoU = 74.56%, F1-score = 86.12%) likewise outperformed the U-Net model (mIoU = 71.75%, F1-score = 84.39%). These experiments further support the robustness of the proposed method in cross-domain application scenarios.

A further point worth noting is that, in the cross-region transfer experiment, the Precision and Recall metrics exhibited a certain degree of imbalance. This phenomenon provides important clues for understanding the error structure of the model under inter-domain distribution shifts. We conducted a systematic comparative analysis of the model predictions against the original remote sensing imagery and reference labels, and diagnosed model behavior using representative qualitative cases (Figure 8). The results of Case 1 show that the model can stably and completely identify the core areas of cotton fields, and the predictions are highly consistent with the reference labels within the field interiors, thereby maintaining a high recall in cross-region applications. Cases 3 and 5 show that the predicted cotton field extent was generally smaller than or close to the actually identifiable cotton field extent, and no obvious outward expansion of cotton field area was observed. Cases 2, 4, and 6 show that model errors were mainly concentrated at field boundaries, crop transition zones, and areas with strong background heterogeneity, manifesting as local pixel-level confusion. This phenomenon may be related to differences in planting structure and spatial patterns between the Wei-Ku Oasis and Manasi County. During training in the source domain, which is characterized by complex cropping structure and fragmented fields, the model gradually learned spectral–phenological combination features with strong robustness for cotton discrimination, so as to reduce missed detections under mixed backgrounds. When transferred to the target domain dominated by large-scale and intensive cotton cultivation, this strategy helps stably identify the main body of cotton fields, but introduces a small amount of pixel-level uncertainty at field boundaries and in heterogeneous background areas, thereby resulting in a relative decline in precision.

Overall, the precision–recall imbalance in cross-region transfer is mainly manifested as local pixel-level confusion under complex boundary conditions, rather than overprediction at the overall spatial scale.

In summary, the transfer performance of the model depends not only on feature representation capability, but also on the stability of feature selection under inter-domain distribution shifts. By dynamically reweighting input channels during forward propagation, the Explainer module helps reduce the interference of stage-sensitive variables with discrimination results. In cross-region transfer scenarios, when shifts occur in the distributions of spectral and background structural characteristics, this mechanism helps maintain a relatively consistent feature response pattern. Therefore, the performance gains achieved in this study are reflected not only in improved accuracy metrics, but also in enhanced stability of discriminative features under inter-domain distribution shifts.

5.2. Phenology-Driven Analysis of Optimal Remote Sensing Time Windows for Cotton Identification

Phenological phenomena objectively reflect plants’ responses and adaptability to environmental conditions during growth and development [63], mainly manifested as advances or delays in crop phenology and changes in the duration of developmental stages. These variations are of great significance for monitoring and evaluating crop growth status [64]. The “optimal timing” for crop identification generally depends on the phenological cycle and seasonal rhythm of crops in the study area [65], thereby determining the optimal temporal acquisition window for remote-sensing imagery. As shown in Table 5, August and September are critical periods for remote-sensing identification of cotton planting areas in southern Xinjiang; during this time, wheat and maize in the region have largely been harvested [52], background interference is relatively low, and the salience of cotton targets is enhanced. Based on the results of the improved interpretable deep-learning model developed in this study, August is identified as the optimal acquisition time for cotton extraction in the study area, which is highly consistent with existing findings [66,67]. However, at the national scale, some studies argue that September [68] or June [69] is a more optimal identification time. Climatic differences among geographic regions are an important cause of this phenomenon, leading to pronounced temporal mismatches in cotton sowing dates, growth progression, and phenological characteristics [70]. In addition, the choice of remote-sensing data sources and feature types can substantially influence the determination of the optimal time, for example, optical vegetation indices such as NDVI and EVI are more sensitive to changes in vegetation greenness from June to August [56]. By contrast, SAR imagery-especially the Sentinel-1 VH polarization channel-responds more strongly to changes in canopy spatial structure during the cotton boll-opening stage, effectively enhancing discrimination from bare land and other crops [71]. Therefore, the optimal timing for remote-sensing identification of cotton planting areas is not fixed, but is jointly determined by multiple factors, including regional phenological rhythms, sensor types, feature sensitivity, and the design of classification strategies.

5.3. Evaluation of Multi-Source Remote Sensing Channels and Their Discriminative Value in Cotton Classification

Across the growing season, Sentinel-2 red-edge channels—particularly B6, followed by B8 and B7—show the highest and most persistent contributions (Figure 7). This pattern is consistent with the close association of red-edge reflectance with chlorophyll absorption-edge behavior and canopy biochemical status, thereby strengthening cotton–non-cotton contrast during key growth stages. The consistently high ranking of red-edge features, together with the importance of NIR- and SWIR-related information, aligns with prior evidence that red-edge [72], near-infrared [73], and shortwave-infrared bands [74] are among the most informative spectral domains for crop classification.

The results also suggest a stage-dependent, cross-sensor sensing mechanism. Vegetation indices (NDVI/EVI) become more influential during the mid-season, consistent with strengthened sensitivity to canopy greenness and biomass around peak growth. For SAR predictors, the relative contributions of VV and VH vary with canopy development. During April–May, VV exhibits higher importance, which is consistent with the study area’s widespread drip irrigation under plastic mulch that enhances soil/background backscatter and thus improves early-season separability [75,76,77]. As the canopy becomes denser in later vegetative and reproductive stages, VH becomes more discriminative than VV, consistent with stronger crop sensitivity in cross-polarized channels as canopy structure develops [78]. These findings indicate a seasonal shift from background-dominated to canopy-structure-dominated SAR information under single-period composites.

Topography also plays a persistent role throughout the season. The consistently non-negligible contribution of DEM suggests that terrain-related spatial organization, such as irrigation layouts and cultivation patterns constrained by relief, provides a robust spatial prior in arid irrigated landscapes, complementing spectral and SAR observations. By contrast, although GLCM contrast receives relatively high weight in some periods, its overall contribution remains limited, likely because the 10 m Sentinel-2 resolution constrains the ability of texture descriptors to capture within-field heterogeneity and fine field-boundary patterns [39,79].

These findings also help explain regional differences reported in the literature. Forkuor found that VV outperformed VH for discriminating multiple crop types in northwestern Benin, likely because rice cultivation and frequent irrigation produced smoother, wetter surfaces that enhanced VV contrast [80]. In our arid irrigated cotton system, by contrast, VV dominance is mainly confined to the early season when background and soil effects are stronger, whereas VH becomes increasingly informative as canopy structure develops. Overall, the optimal predictor set is both region- and season-dependent, highlighting the importance of stage-aware quantitative feature evaluation for robust cotton mapping.

5.4. Limitations and Future Directions

Nevertheless, several limitations of this study should be acknowledged. First, although we conducted a systematic monthly diagnostic of cotton classification performance across the full growing season (April–October), classification accuracy in the early growth stages (April–May) remains substantially lower than that achieved in the mid-to-late stages. This is primarily attributed to the high spectral similarity between cotton, bare soil, and other vegetation during the emergence and seedling stages, which inherently limits the information capacity of a single monthly composite derived from medium-resolution optical imagery.

Second, cross-region transfer experiments revealed that model performance declines moderately in heterogeneous transition zones and along field boundaries, manifested as pixel-level false positives and boundary uncertainties. While the proposed method maintained high recall and stable identification of core cotton areas, this precision loss reflects the inherent trade-off of learning robust spectral–phenological features in complex source domains.

Future research may address these limitations by: (1) integrating multi-temporal or hyperspectral data to improve early-stage separability; (2) incorporating boundary-aware constraints or cross-regional adaptation strategies to reduce pixel-level uncertainty in complex transitional zones, thereby further improving cross-region transfer accuracy.

6. Conclusions

To address the challenges of discriminative stability and cross-domain generalization in cotton mapping in arid oasis agricultural regions, this study developed a multi-source remote sensing framework integrating an embedded explainability mechanism with a receptive-field enhancement strategy. In addition to improving classification accuracy, the framework enabled analysis of key feature contribution patterns across phenological stages and assessment of model generalization across temporal and spatial dimensions. The main conclusions are as follows:

(1): This study developed a cotton identification model that integrates an explainability module, a WTConv-based multiscale feature enhancement mechanism, and a ConvNeXt backbone. Experimental results show that the proposed model achieved an mIoU of 85.62% and an F1-score of 92.96% in remote sensing-based cotton mapping, outperforming several mainstream models overall. While improving classification accuracy, the model also enhanced the interpretability of variable contributions through an embedded feature-weighting mechanism, thereby improving both model interpretability and classification performance.
(2): Based on the embedded explainability module, this study systematically evaluated the temporal variation in the importance of multi-source remote sensing features across different cotton growth stages. The results indicate that channels such as B6, B8, B7, DEM, NDVI, EVI, VH, and SAVG make strong discriminative contributions during key growth stages, highlighting the complementary roles of spectral, structural, and topographic information in cotton identification.
(3): Comparative analysis across different phenological windows showed that August was the optimal period for cotton extraction, yielding the highest accuracy and stability (mIoU = 85.54%, F1-score = 92.83%). June and July also exhibited strong identification performance, with F1-scores both exceeding 88%. In contrast, April showed relatively lower accuracy due to stronger surface heterogeneity during the early sowing stage. These results indicate that stable discriminative feature representation within key phenological windows is critical for improving identification performance.
(4): The transfer experiments demonstrate that the proposed model has strong spatiotemporal generalization capability. In the within-region cross-year scenario, the model achieved an mIoU of 82.81% and an F1-score of 89.41%; in the cross-region transfer experiment, it still attained an mIoU of 74.56% and an F1-score of 86.12%, outperforming the baseline U-Net overall. The ablation results further confirm that the explainability-guided channel selection mechanism and the multiscale wavelet-based feature enhancement mechanism form a complementary structure that helps alleviate distribution shifts caused by changes in spectral responses and spatial structures, thereby improving the model’s stability and reliability in cross-temporal and cross-regional applications.

Furthermore, in monitoring scenarios characterized by limited data availability or the need for rapid response, an identification framework constructed around key phenological windows demonstrates strong practical feasibility. The embedded feature-importance modeling mechanism also shows potential for extension to other agricultural remote sensing segmentation tasks, particularly in regional environments with pronounced multi-source feature redundancy and background heterogeneity. This study provides a useful technical reference for crop mapping and monitoring in arid or irrigation-dominated agricultural regions.

Author Contributions

Conceptualization, L.L. and K.J.; Methodology, L.L., X.G. and J.W.; Software, L.L., Z.Z. and H.X.; Validation, L.L.; Visualization, L.L. and K.J.; Writing—original draft preparation, L.L. and X.G.; Writing—review and editing, J.W., J.D. and Z.L.; Supervision, J.W.; Funding acquisition, J.D. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. U41961059), the Youth Top Talent Project of Xinjiang Uygur Autonomous Region (No. 2024TSYCCX0024), and the Excellent Graduate Innovation Project of Xinjiang University (No. XJDX2025YJS074).

Data Availability Statement

The data are available on request to the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Acquisition dates and number of Sentinel-1 and Sentinel-2 images used in the Wei-Ku Oasis in 2021.

Data Source	April	May	June	July	August	September	October
Sentinel-1	10	10	10	10	12	10	8
Sentinel-2	45	35	48	39	47	49	51

Table A2. Acquisition dates and number of Sentinel-1 and Sentinel-2 images used in the Manasi in 2020.

Data Source	April	May	June	July	August	September	October
Sentinel-1	15	20	20	20	23	22	27
Sentinel-2	76	82	71	62	87	64	72

Table A3. Acquisition dates and number of Sentinel-1 and Sentinel-2 images used in the Wei-Ku Oasis in 2020.

Data Source	April	May	June	July	August	September	October
Sentinel-1	12	20	20	20	20	20	16
Sentinel-2	45	49	61	44	58	55	55

Figure A1. Visual comparison of cotton classification results across representative fields at different phenological stages (April, June, August, and October) using Sentinel imagery.

References

Amrouk, E.; Palmeri, F. Recent Trends and Prospects in the World Cotton Market and Policy Developments; Food and Agriculture Organization of the United Nations: Rome, Italy, 2021. [Google Scholar] [CrossRef]
Xun, L.; Zhang, J.; Cao, D.; Yang, S.; Yao, F. A novel cotton mapping index combining Sentinel-1 SAR and Sentinel-2 multispectral imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 148–166. [Google Scholar] [CrossRef]
Riaz, T.; Iqbal, M.W.; Mahmood, S.; Yasmin, I.; Leghari, A.A.; Rehman, A.; Mushtaq, A.; Ali, K.; Azam, M.; Bilal, M. Cottonseed oil: A review of extraction techniques, physicochemical, functional, and nutritional properties. Crit. Rev. Food Sci. Nutr. 2023, 63, 1219–1237. [Google Scholar] [CrossRef]
OECD; FAO. OECD-FAO Agricultural Outlook 2021–2030; Organization for Economic Co-Operation and Development: Paris, France; Food and Agriculture Organization of the United Nations: Rome, Italy, 2021. [Google Scholar] [CrossRef]
Wang, C.; Chen, Q.; Fan, H.; Yao, C.; Sun, X.; Chan, J.; Deng, J. Evaluating satellite hyperspectral (Orbita) and multispectral (Landsat 8 and Sentinel-2) imagery for identifying cotton acreage. Int. J. Remote Sens. 2021, 42, 4042–4063. [Google Scholar] [CrossRef]
Liu, X.; Tian, M.; Liang, J. Prediction of cotton yield densely planted in Xinjiang of China using RCH-UNet model. Trans. Chin. Soc. Agric. Eng. 2024, 40, 230–239. [Google Scholar] [CrossRef]
Johnson, D.M.; Mueller, R. Pre-and within-season crop type classification trained with archival land cover information. Remote Sens. Environ. 2021, 264, 112576. [Google Scholar] [CrossRef]
Tang, K.; Zhu, W.; Zhan, P.; Ding, S. An identification method for spring maize in Northeast China based on spectral and phenological features. Remote Sens. 2018, 10, 193. [Google Scholar] [CrossRef]
Ma, C.; Liu, M.; Ding, F.; Li, C.; Cui, Y.; Chen, W.; Wang, Y. Wheat growth monitoring and yield estimation based on remote sensing data assimilation into the SAFY crop growth model. Sci. Rep. 2022, 12, 5473. [Google Scholar] [CrossRef]
Wang, J.; Xiao, X.; Liu, L.; Wu, X.; Qin, Y.; Steiner, J.L.; Dong, J. Mapping sugarcane plantation dynamics in Guangxi, China, by time series Sentinel-1, Sentinel-2 and Landsat images. Remote Sens. Environ. 2020, 247, 111951. [Google Scholar] [CrossRef]
Adrian, J.; Sagan, V.; Maimaitijiang, M. Sentinel SAR-optical fusion for crop type mapping using deep learning and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 175, 215–235. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Biradar, C.M.; Noojipady, P.; Dheeravath, V.; Li, Y.; Velpuri, M.; Gumma, M.; Gangalakunta, O.R.P.; Turral, H.; Cai, X. Global irrigated area map (GIAM), derived from remote sensing, for the end of the last millennium. Int. J. Remote Sens. 2009, 30, 3679–3733. [Google Scholar] [CrossRef]
Cao, B.; Yu, L.; Naipal, V.; Ciais, P.; Li, W.; Zhao, Y.; Wei, W.; Chen, D.; Liu, Z.; Gong, P. A 30-meter terrace mapping in China using Landsat 8 imagery and digital elevation model based on the Google Earth Engine. Earth Syst. Sci. Data Discuss. 2020, 2020, 1–35. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Shawkat Ali, A.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Madaeni, F.; Chokmani, K.; Lhissou, R.; Gauthier, Y.; Tolszczuk-Leclerc, S. Convolutional neural network and long short-term memory models for ice-jam predictions. Cryosphere 2022, 16, 1447–1468. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhan, W.; Chen, J. SADNet: Space-aware DeepLab network for Urban-Scale point clouds semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103827. [Google Scholar] [CrossRef]
Han, T.; Chen, Y.; Ma, J.; Liu, X.; Zhang, W.; Zhang, X.; Wang, H. Point cloud semantic segmentation with adaptive spatial structure graph transformer. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104105. [Google Scholar] [CrossRef]
Zwijnen, A.-W.; Watzema, L.; Ridwan, Y.; van Der Pluijm, I.; Smal, I.; Essers, J. Self-adaptive deep learning-based segmentation for universal and functional clinical and preclinical CT image analysis. Comput. Biol. Med. 2024, 179, 108853. [Google Scholar] [CrossRef] [PubMed]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Picon, A.; San-Emeterio, M.G.; Bereciartua-Perez, A.; Klukas, C.; Eggers, T.; Navarra-Mestre, R. Deep learning-based segmentation of multiple species of weeds and corn crop using synthetic and real image datasets. Comput. Electron. Agric. 2022, 194, 106719. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Ashapure, A.; Jung, J.; Chang, A.; Oh, S.; Yeom, J.; Maeda, M.; Maeda, A.; Dube, N.; Landivar, J.; Hague, S.; et al. Developing a machine learning based cotton yield estimation framework using multi-temporal UAS data. ISPRS J. Photogramm. Remote Sens. 2020, 169, 180–194. [Google Scholar] [CrossRef]
Xu, J.; Yang, J.; Xiong, X.; Li, H.; Huang, J.; Ting, K.C.; Ying, Y.; Lin, T. Towards interpreting multi-temporal deep learning models in crop mapping. Remote Sens. Environ. 2021, 264, 112599. [Google Scholar] [CrossRef]
Zhang, J.; Ding, J.; Tan, J.; Wang, J.; Zhang, Z.; Wang, Z.; Ge, X. Monitoring soil salinization in Arid cotton fields using Unmanned Aerial Vehicle hyperspectral imagery. Int. J. Appl. Earth Obs. Geoinf. 2025, 140, 104584. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Dalsasso, E.; Meraoumia, I.; Denis, L.; Tupin, F. Exploiting multi-temporal information for improved speckle reduction of Sentinel-1 SAR images by deep learning. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021. [Google Scholar]
Singh, P.; Shankar, A.; Diwakar, M. Review on nontraditional perspectives of synthetic aperture radar image despeckling. J. Electron. Imaging 2023, 32, 021609. [Google Scholar] [CrossRef]
Flood, N. Seasonal Composite Landsat TM/ETM+ Images Using the Medoid (a Multi-Dimensional Median). Remote Sens. 2013, 5, 6481–6500. [Google Scholar] [CrossRef]
Inglada, J.; Vincent, A.; Arias, M.; Tardy, B.; Morin, D.; Rodes, I. Operational High Resolution Land Cover Map Production at the Country Scale Using Satellite Image Time Series. Remote Sens. 2017, 9, 95. [Google Scholar] [CrossRef]
Thompson, J.A.; Bell, J.C.; Butler, C.A. Digital elevation model resolution: Effects on terrain attribute calculation and quantitative soil-landscape modeling. Geoderma 2001, 100, 67–89. [Google Scholar] [CrossRef]
Zhao, W.; Zhou, C.; Zhou, C.; Ma, H.; Wang, Z. Soil salinity inversion model of oasis in arid area based on UAV multispectral remote sensing. Remote Sens. 2022, 14, 1804. [Google Scholar] [CrossRef]
Rouse, J.; Haas, R.; Schell, J.; Deering, D. Monitoring vegetation systems in the great plains with ERTS proceeding. In Proceedings of the Third Earth Reserves Technology Satellite Symposium, NASA SP-351, Greenbelt, MD, USA, 10–14 December 1974; p. 317. [Google Scholar]
Li, Y.; Chang, C.; Wang, Z.; Zhao, G. Upscaling remote sensing inversion and dynamic monitoring of soil salinization in the Yellow River Delta, China. Ecol. Indic. 2023, 148, 110087. [Google Scholar] [CrossRef]
Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M. GMES Sentinel-1 mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 2007, SMC-3, 610–621. [Google Scholar] [CrossRef]
Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L. The shuttle radar topography mission. Rev. Geophys. 2007, 45, RG2004. [Google Scholar] [CrossRef]
Kang, X.; Huang, C.; Chen, J.M.; Lv, X.; Wang, J.; Zhong, T.; Wang, H.; Fan, X.; Ma, Y.; Yi, X. The 10-m cotton maps in Xinjiang, China during 2018–2021. Sci. Data 2023, 10, 688. [Google Scholar] [CrossRef] [PubMed]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep learning for remote sensing image classification: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1264. [Google Scholar] [CrossRef]
Perez, L.; Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv 2017, arXiv:1712.04621. [Google Scholar] [CrossRef]
Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
Li, Q.; Shen, L.; Guo, S.; Lai, Z. Wavelet integrated CNNs for noise-robust image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7245–7254. [Google Scholar]
Haar, A. Zur Theorie der Orthogonalen Funktionensysteme; Georg-August-Universitat: Gottingen, Germany, 1909. [Google Scholar]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 11, 674–693. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Wang, Z.; Berman, M.; Rannen-Triki, A.; Torr, P.; Tuia, D.; Tuytelaars, T.; Gool, L.V.; Yu, J.; Blaschko, M. Revisiting evaluation metrics for semantic segmentation: Optimization and evaluation of fine-grained intersection over union. Adv. Neural Inf. Process. Syst. 2023, 36, 60144–60225. [Google Scholar]
Xu, H.; Song, J.; Zhu, Y. Evaluation and Comparison of Semantic Segmentation Networks for Rice Identification Based on Sentinel-2 Imagery. Remote Sens. 2023, 15, 1499. [Google Scholar] [CrossRef]
Mohammadi, M.; Mollazade, K.; Behroozi-Khazaei, N. Under-and Over-Segmentation: New Metrics for Image Segmentation Accuracy Measurement. Array 2025, 28, 100624. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Zarco-Tejada, P.J.; González-Dugo, V.; Berni, J.A. Fluorescence, temperature and narrow-band indices acquired from a UAV platform for water stress detection using a micro-hyperspectral imager and a thermal camera. Remote Sens. Environ. 2012, 117, 322–337. [Google Scholar] [CrossRef]
Tian, F.; Cai, Z.; Jin, H.; Hufkens, K.; Scheifinger, H.; Tagesson, T.; Smets, B.; Van Hoolst, R.; Bonte, K.; Ivits, E. Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe. Remote Sens. Environ. 2021, 260, 112456. [Google Scholar] [CrossRef]
Shammi, S.A.; Meng, Q. Use time series NDVI and EVI to develop dynamic crop growth metrics for yield modeling. Ecol. Indic. 2021, 121, 107124. [Google Scholar] [CrossRef]
Gašparović, M.; Klobučar, D. Mapping Floods in Lowland Forest Using Sentinel-1 and Sentinel-2 Data and an Object-Based Approach. Forests 2021, 12, 553. [Google Scholar] [CrossRef]
Gella, G.W.; Bijker, W.; Belgiu, M. Mapping crop types in complex farming areas using SAR imagery with dynamic time warping. ISPRS J. Photogramm. Remote Sens. 2021, 175, 171–183. [Google Scholar] [CrossRef]
Song, Q.; Hu, Q.; Zhou, Q.; Hovis, C.; Xiang, M.; Tang, H.; Wu, W. In-Season Crop Mapping with GF-1/WFV Data by Combining Object-Based Image Analysis and Random Forest. Remote Sens. 2017, 9, 1184. [Google Scholar] [CrossRef]
Zhang, Y.; Hui, J.; Qin, Q.; Sun, Y.; Zhang, T.; Sun, H.; Li, M. Transfer-learning-based approach for leaf chlorophyll content estimation of winter wheat from hyperspectral data. Remote Sens. Environ. 2021, 267, 112724. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, H.; He, W.; Zhang, L. Cross-phenological-region crop mapping framework using Sentinel-2 time series Imagery: A new perspective for winter crops in China. ISPRS J. Photogramm. Remote Sens. 2022, 193, 200–215. [Google Scholar] [CrossRef]
Siriwardana, A.N.; Kume, A. Introducing the spectral characteristics index: A novel method for clustering solar radiation fluctuations from a plant-ecophysiological perspective. Ecol. Inform. 2025, 85, 102940. [Google Scholar] [CrossRef]
Tang, J.; Körner, C.; Muraoka, H.; Piao, S.; Shen, M.; Thackeray, S.J.; Yang, X. Emerging opportunities and challenges in phenology: A review. Ecosphere 2016, 7, e01436. [Google Scholar] [CrossRef]
Yanxi, Z.; Dengpan, X.; Huizi, B.; Fulu, T. Research progress on the response and adaptation of crop phenology to climate change in China. Prog. Geogr. 2019, 38, 224–235. [Google Scholar] [CrossRef]
Dong, Q.; Chen, X.; Chen, J.; Zhang, C.; Liu, L.; Cao, X.; Zang, Y.; Zhu, X.; Cui, X. Mapping winter wheat in North China using Sentinel 2A/B data: A method based on phenology-time weighted dynamic time warping. Remote Sens. 2020, 12, 1274. [Google Scholar] [CrossRef]
Tan, Z.; Tan, Z.; Luo, J.; Duan, H. Mapping 30-m cotton areas based on an automatic sample selection and machine learning method using Landsat and MODIS images. Geo Spat. Inf. Sci. 2024, 27, 1767–1784. [Google Scholar] [CrossRef]
Xiong, J.; Ge, X.; Ding, J.; Wang, J.; Zhang, Z.; Zhu, C.; Han, L.; Wang, J. Optimal Time-Window for Assessing Soil Salinity via Sentinel-2 Multitemporal Synthetic Data in the Arid Agricultural Regions of China. Ecol. Indic. 2025, 176, 113642. [Google Scholar] [CrossRef]
Sun, Z.; Wang, D.; Zhou, Q. Dryland crop recognition based on multi-temporal polarization SAR data. In Proceedings of the 2019 8th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Istanbul, Turkey, 16–19 July 2019; pp. 1–5. [Google Scholar]
SI, K.; WANG, C.; ZHAO, Q. Cotton extraction method based on optimal time phase combination of Sentinel—2 remote sensing images. J. Shihezi Univ. Nat. Sci. 2022, 40, 639–647. [Google Scholar]
Zhu, Y.; Sun, L.; Luo, Q.; Chen, H.; Yang, Y. Spatial optimization of cotton cultivation in Xinjiang: A climate change perspective. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103523. [Google Scholar] [CrossRef]
WANG, L.; JIN, H.; WANG, C.; SUN, R. Backscattering characteristics and texture information analysis of typical crops based on synthetic aperture radar: A case study of Nong’an County, Jilin Province. Chin. J. Eco Agric. 2019, 27, 1385–1393. [Google Scholar] [CrossRef]
Immitzer, M.; Vuolo, F.; Atzberger, C. First experience with Sentinel-2 data for crop and tree species classifications in central Europe. Remote Sens. 2016, 8, 166. [Google Scholar] [CrossRef]
Sonobe, R.; Yamaya, Y.; Tani, H.; Wang, X.; Kobayashi, N.; Mochizuki, K.-i. Crop classification from Sentinel-2-derived vegetation indices using ensemble learning. J. Appl. Remote Sens. 2018, 12, 026019. [Google Scholar] [CrossRef]
Hunt, E.R., Jr.; Yilmaz, M.T. Remote sensing of vegetation water content using shortwave infrared reflectances. In Proceedings of the Remote Sensing and Modeling of Ecosystems for Sustainability IV, San Diego, CA, USA, 28–29 August 2007; pp. 15–22. [Google Scholar]
Toscani, P.; Immitzer, M.; Atzberger, C. Texturanalyse mittels diskreter Wavelet Transformation für die objektbasierte Klassifikation von Orthophotos. Photogramm. Fernerkund. Geoinf 2013, 2, 105–121. [Google Scholar] [CrossRef]
Immitzer, M.; Toscani, P.; Atzberger, C. The Utility of Wavelet-based Texture Measures to Improve Object-based Classification of Aerial Images. South.-East. Eur. J. Earth Obs. Geomat 2014, 3, 79–84. [Google Scholar]
Liu, X.; Bo, Y. Object-based crop species classification based on the combination of airborne hyperspectral images and LiDAR data. Remote Sens. 2015, 7, 922–950. [Google Scholar] [CrossRef]
Zhang, J.; Ding, J.; Zhang, Z.; Wang, J.; Zeng, X.; Ge, X. Study on the inversion and spatiotemporal variation mechanism of soil salinization at multiple depths in typical oases in arid areas: A case study of Wei-Ku Oasis. Agric. Water Manag. 2025, 315, 109542. [Google Scholar] [CrossRef]
Chong, L.; Liu, H.-J.; Lu, L.-p.; Liu, Z.-R.; Kong, F.-C.; Zhang, X.-L. Monthly composites from Sentinel-1 and Sentinel-2 images for regional major crop mapping with Google Earth Engine. J. Integr. Agric. 2021, 20, 1944–1957. [Google Scholar] [CrossRef]
Forkuor, G.; Conrad, C.; Thiel, M.; Ullmann, T.; Zoungrana, E. Integration of optical and Synthetic Aperture Radar imagery for improving crop mapping in Northwestern Benin, West Africa. Remote Sens. 2014, 6, 6472–6499. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area and phenological stages: (a) Xinjiang Province location; (b) composite imagery of the Wei-Ku Oasis; (c) composite imagery of the Manasi; (d–i) sequential representations of cotton growth stages, including sowing, emergence, seedling, budding, boll formation, and boll opening.

Figure 2. The growth and development stages of cotton.

Figure 3. Workflow of multi-source remote sensing data preprocessing, feature extraction, and model input construction.

Figure 4. Overall structure of the model. The model follows an encoder–decoder architecture with skip connections and consists of a ConvNeXt backbone, an explainability-guided module, and a WTConv-based multiscale enhancement module. The input is a 27-channel temporal feature set, and the output is the predicted cotton classification map.

Figure 5. Structure of the explainability-guided module (Explainer). The module takes the 27-channel temporal feature set as input and combines channel attention and spatial attention for feature reweighting. Normalization and sparse gating are further introduced to emphasize informative features. The resulting channel importance scores facilitate interpretation of the model predictions.

Figure 6. Visual comparison of cotton classification results from different models for representative fields using August Sentinel imagery.

Figure 7. Features importance for cotton classification across the growing season. The top bar chart shows the mean feature importance (%) of each variable from April to October. The bottom heatmap shows the monthly feature importance (%) of each variable, with values normalized within each month.

Figure 8. Case examples of the precision–recall trade-off under domain shift.

Table 1. Selection of Environmental Covariates.

Environmental Covariates	Formula	References
Original Band	$C o a s t a l B l u e, B l u e, G r e e n, R e d, R e d E d g e 1, R e d E d g e 2$ $R e d E d g e 3, N I R, R e d E d g e 4, W a t e r V a p o r, S W I R 1, S W I R 2$	[33]
Normalized Difference Vegetation Index (NDVI)	$(N I R - R) / (N I R + R)$	[34]
Enhanced Vegetation Index (EVI)	$2.5 [\frac{(N I R - R)}{(N I R + 6 \times R - 7.5 \times B + 1)}]$	[33]
Difference Vegetation Index (DVI)	$N I R - R$	[33]
Ratio Vegetation Index (RVI)	$N I R / R$	[35]
Backscatter Coefficient (σ⁰)	VV, VH	[36]
SAVG	$\sum_{k = 2}^{2 N_{g}} k (\sum_{i = 1}^{N_{g}} \sum_{j = 1}^{N_{g}} p (i, j))$ , i + j = k	[37]
Contrast	$\sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} \frac{(i - M_{ean}) \times (j - M_{e a n}) \times P (i, j)}{V_{a r}}$	[37]
Ent	$- \sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} P (i, j) \times l o g (P (i, j))$	[37]
Asm	$\sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} P {(i, j)}^{2}$	[37]
Corr	$\sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} \frac{(i - M_{ean}) \times (j - M_{e a n}) \times P (i, j)}{V_{a r}}$	[37]
Var	$\sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} {(i - μ)}^{2} \times P (i, j)$	[37]
DEM	SRTM	[38]
Slope	SRTM	[38]
Aspect	SRTM	[38]

Table 2. Hyperparameter tuning ranges and selected optimal settings for different models.

Model	Optimizer	Weight Decay	Learning Rate Search Range	Final Learning Rate	Batch Size	Epochs
U-Net	Adam	1 × 10⁻⁴	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵}	6 × 10⁻⁵	4	200
DeepLabV3+	Adam	1 × 10⁻⁴	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵}	6 × 10⁻⁵	4	200
VM-Net	AdamW	1 × 10⁻²	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵, 1 × 10⁻⁵}	6 × 10⁻⁵	4	200
SegFormer	AdamW	1 × 10⁻²	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵, 1 × 10⁻⁵}	3 × 10⁻⁵	4	200
Swin Transformer	AdamW	1 × 10⁻²	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵, 1 × 10⁻⁵}	1 × 10⁻⁵	4	200
Ours	Adam	1 × 10⁻⁴	{1 × 10⁻⁴, 6 × 10⁻⁵, 3 × 10⁻⁵}	6 × 10⁻⁵	4	200

Table 3. Results of ablation studies.

Settings					mIoU (%)	Precision	Recall	F1-Score	Param (M)	GFLOPs (G)
No.	Explainer	WTconv	ConvNeXt	ResNet18
1	√	√	√	×	85.62	92.05	93.88	92.96	36.85	10.87
2	√	√	×	√	85.01	92.23	92.05	92.14	21.59	11.18
3	×	√	√	×	84.25	91.63	92.06	91.94	36.84	10.42
4	√	×	√	×	85.12	90.26	92.55	91.44	36.72	10.87

Table 4. Comparison of test accuracy of different models on August datasets.

Model	mIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
DeepLabV3+	84.44	89.84	92.71	91.25
SegFormer	84.12	90.32	91.71	91.01
Swin-Transformer	85.04	90.66	92.58	91.59
U-Net	84.83	91.73	91.81	91.77
VM-Net	71.94	79.68	88.09	83.68
Ours	85.62	92.05	93.88	92.96

Table 5. Classification Accuracy Comparison of Cotton During the Growing Season.

Month	mIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
4	73.48	81.89	84.92	83.39
5	78.42	86.64	87.95	86.47
6	80.17	87.25	89.66	88.43
7	83.27	89.37	91.67	90.50
8	85.54	91.96	93.72	92.83
9	84.37	91.69	91.77	91.59
10	83.16	89.67	89.89	89.78

Table 6. Generalizability of model.

Model	Year	Region	mIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
U-Net	2020	Wei-Ku Oasis	81.17	87.09	88.97	88.02
Ours	2020	Wei-Ku Oasis	82.81	88.12	90.74	89.41
w/o Explainer	2020	Wei-Ku Oasis	81.37	87.68	89.79	88.72
w/o WTConv	2020	Wei-Ku Oasis	82.01	87.27	90.01	88.62
U-Net	2020	Manasi	71.75	79.23	90.28	84.39
Ours	2020	Manasi	74.56	80.43	92.68	86.12
w/o Explainer	2020	Manasi	73.31	79.67	91.32	85.10
w/o WTConv	2020	Manasi	73.89	78.58	92.16	84.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Wang, J.; Jia, K.; Ding, J.; Ge, X.; Liu, Z.; Zhang, Z.; Xiao, H. Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability. Remote Sens. 2026, 18, 980. https://doi.org/10.3390/rs18070980

AMA Style

Li L, Wang J, Jia K, Ding J, Ge X, Liu Z, Zhang Z, Xiao H. Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability. Remote Sensing. 2026; 18(7):980. https://doi.org/10.3390/rs18070980

Chicago/Turabian Style

Li, Li, Jinjie Wang, Keke Jia, Jianli Ding, Xiangyu Ge, Zhihong Liu, Zihan Zhang, and Hongzhi Xiao. 2026. "Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability" Remote Sensing 18, no. 7: 980. https://doi.org/10.3390/rs18070980

APA Style

Li, L., Wang, J., Jia, K., Ding, J., Ge, X., Liu, Z., Zhang, Z., & Xiao, H. (2026). Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability. Remote Sensing, 18(7), 980. https://doi.org/10.3390/rs18070980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Cotton Mapping Across Phenological Stages: Receptive-Field Enhancement and Cross-Domain Stability

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources

2.3. Sample Data

2.4. Model Training Settings

3. Methodology

3.1. Network Architecture

3.2. Backbone Network

3.3. Channel Importance Interpreter

3.4. Wavelet Convolution

3.5. Loss Function

3.6. Evaluation Metrics

4. Results

4.1. Performance and Ablation Study of the Interpretable U-Net

4.2. Comparison with Other Models

4.3. Channel-Wise Interpretability Across Cotton Growth Stages

4.4. Optimal Temporal Window Analysis

5. Discussion

5.1. Model Transferability

5.2. Phenology-Driven Analysis of Optimal Remote Sensing Time Windows for Cotton Identification

5.3. Evaluation of Multi-Source Remote Sensing Channels and Their Discriminative Value in Cotton Classification

5.4. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI