Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction

Wu, Haotong; Wan, Xinwen; Qian, Rong; Ruan, Chao; Zhao, Jinling; Wang, Chuanjian

doi:10.3390/agriculture16060693

Open AccessArticle

Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction

by

Haotong Wu

¹,

Xinwen Wan

¹,

Rong Qian

²,

Chao Ruan

^1,3,

Jinling Zhao

^1,3,*

and

Chuanjian Wang

¹

School of Internet, Anhui University, Hefei 230039, China

²

Agricultural Economy and Information Research Institute, Anhui Academy of Agricultural Sciences, Hefei 230001, China

³

National Engineering Research Center for Agro-Ecological Big Data Analysis & Application, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(6), 693; https://doi.org/10.3390/agriculture16060693

Submission received: 15 January 2026 / Revised: 15 March 2026 / Accepted: 16 March 2026 / Published: 19 March 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate and stable acquisition of the spatial distribution of soybean planting areas is essential for supporting precision agricultural monitoring and ensuring food security. However, crop remote-sensing mapping for specific regions still faces critical data bottlenecks: high-precision, large-scale pixel-level annotation is costly, resulting in scarce available labeled samples that make it difficult to construct large-scale training datasets. Although parameter-intensive models such as FCN and SegNet can achieve sufficient end-to-end training on large-scale public remote sensing datasets like LoveDA, when directly applied to the data-limited dataset in this study area, the models are prone to overfitting, leading to a significant decline in generalization ability. To address these issues, this study proposes a lightweight U-shaped semantic segmentation model, SimSDRU-Net. The model utilizes a pre-trained VGG-16 backbone to extract shallow texture and deep semantic features. The pre-trained weights mitigate the impact of overfitting in data-limited settings. In the decoding stage, a parameter-free lightweight SimAM attention module enhances effective soybean features and suppresses soil background redundancy, while an embedded S-DRU unit fuses multi-scale features for deep complementary reconstruction to improve edge detail capture. A label dataset was constructed using Sentinel-2 images as the data source and Menard County (USA) as the study area. The USDA CDL was used as a foundation for the dataset, with Google high-resolution images serving as visual interpretation aids. In the context of the experiment, Deeplabv3+ and U-Net++ were compared with U-Net under identical conditions. The results demonstrated that SimSDRU-Net exhibited optimal performance, with MIoU of 89.03%, MPA of 93.81%, and OA of 95.96%. Specifically, SimSDRU-Net uses the SimAM attention module to generate spatial attention weights by analyzing feature statistical differences through an energy function, so as to adaptively enhance soybean texture features. Meanwhile, the S-DRU unit groups, dynamically weights, and cross-branch reconstructs multi-scale convolutional features to preserve fine boundary details and achieve accurate segmentation of soybean plots. The present study demonstrates that SimSDRU-Net integrates lightweight design and high precision in data-limited scenarios, thereby providing effective technical support for the rapid extraction of soybean planting areas in North America.

Keywords:

soybean; planting area extraction; few-shot remote sensing; lightweight semantic segmentation; SimSDRU-Net

1. Introduction

Soybean is a pivotal food crop and a foundational raw material for the food industry. Historically, China has maintained a substantial reliance on international soybean imports. The import volume of soybeans exhibited a marked increase from 10.4 million tons in 2000 to 100.3 million tons in 2020, with the import dependency rate rising from 46.2% to 83.6% [1]. Concurrently, the United States maintains a predominant presence in the international soybean trade, while advancements in crop breeding technologies are occurring at an accelerated pace [2]. The rapid and accurate acquisition of soybean planting area information in North America is of significant importance. It can provide data support for the development of soybean futures trading and agricultural insurance businesses in China. Furthermore, it can serve as a scientific basis for formulating policies to ensure soybean food security.

In recent years, conventional machine learning algorithms, including Random Forest (RF), Support Vector Machine (SVM), and decision trees, have been widely adopted for crop mapping and type identification. As validated by Sheykhmousa et al. [3] via meta-analysis of 251 studies, RF and SVM excel in modeling high-dimensional spectral data and separating spectrally similar vegetation. For example, Ibrahim et al. [4] used a hierarchical RF classifier combined with Sentinel-2-derived spectral-temporal metrics (STMs) to map crops and intercropping systems in smallholder agricultural regions of Nigeria, achieving 72% overall classification accuracy. Jiang et al. [5] developed a cropping-system-adapted decision tree model using Sentinel-2 imagery to conduct large-scale, high-resolution mapping of staple crops in major Chinese grain-producing plains, with an average Overall Accuracy (OA) of 94%. While conventional machine learning methods, such as pixel-based classifiers, have been effectively applied in crop mapping, they primarily rely on handcrafted spectral and spatial features. This approach can be susceptible to image noise and may not fully capture the complex, high-level semantic patterns present in heterogeneous landscapes. It is important to note that advanced machine learning techniques, including object-based image analysis (OBIA) and the integration of multi-source data (e.g., multi-temporal, multi-sensor fusion), have been developed to improve classification accuracy in fragmented agricultural regions. However, the increasing demand for automated, high-resolution, and large-scale crop mapping continues to drive the need for more robust and scalable algorithms, such as those offered by deep learning approaches.

Deep learning has demonstrated considerable potential in the domain of remote sensing image classification [6]. Its core advantage over traditional machine learning lies in its ability to automatically identify and extract complex hierarchical features from data [7], a capability that is particularly critical for the analysis of remote sensing images. Deep learning has emerged as a significant methodology in domains such as image classification, target detection, and recognition, particularly within the agricultural sector [8,9].

Thanks to the rapid development of Earth observation technology, remote sensing data have become an indispensable support for large-scale agricultural monitoring. A series of medium- and high-resolution satellites, including Sentinel, GF and Landsat, as well as commercial high-resolution images (e.g., WorldView), have been extensively utilized in crop monitoring, classification, and mapping tasks. Deep learning and high-resolution remote sensing images support each other and complement one another’s strengths, providing a new technical approach for high-precision crop remote sensing interpretation. For instance, Liu et al. [10] proposed an advanced cultivated land extraction model for complex terrain, which adopts Gaofen-2 (GF-2) imagery and an improved U-Net architecture to realize 1 m resolution mapping, and fuses spectral features with vegetation index features to optimize the model input. Wang et al. [11] developed a novel attention-based convolutional neural network (CNN) method (Geo-CBAM-CNN) for crop classification with time-series Sentinel-2 images. Du et al. [12] adopted the U-Net semantic segmentation model to classify large-scale paddy fields in Arkansas, USA. This classification was achieved using multi-temporal Landsat data, and the resultant classification accuracy was favorable. Chen et al. [13] conducted a comparative analysis of hydrothermal alteration mineral identification in the Duobuza porphyry copper mining area, Tibet, using high-spatial-resolution WorldView-3 data and high-spectral-resolution GF-5 data, and verified that combining these two types of datasets can yield more accurate mapping results.

Nevertheless, contemporary deep learning-based methodologies continue to encounter significant challenges in the realm of remote sensing-based extraction of crop planting areas. First, training samples are characterized by high cost and long acquisition cycles; second, remote sensing annotation of crops relies on professional agronomic knowledge [14,15]. In soybean planting area extraction, soybeans exhibit high similarity in spectral and texture features to corn, grassland, and other vegetation. This similarity makes accurate differentiation a difficult task [16]. Therefore, under conditions of limited data volume, the model amplifies errors in the limited labeled samples, resulting in a significant decrease in feature learning efficiency and segmentation accuracy [17]. This in turn severely limits the large-scale, high-precision application of remote sensing-based soybean planting area extraction.

Despite the continuous emergence of new high-performance deep learning architectures, obvious contradictions remain between existing models and the practical demands of agricultural remote sensing. For example, the Transformer model [18] possesses the capability of processing multimodal data including images, videos, texts, and speeches, and exhibits excellent scalability in ultra-large-scale networks and massive datasets. However, its disadvantages of excessive parameters and high computational cost make it difficult to adapt to the actual deployment requirements of agricultural remote sensing. Recently, visual state space (VSS) models represented by Mamba have achieved long-range dependency modeling with linear computational complexity, providing a new approach to overcoming the limitations of traditional architectures. The dual-branch network RS3Mamba proposed by Ma et al. [19] effectively improves the accuracy of remote sensing image segmentation by using VSS blocks to construct an auxiliary branch and adopting the feature fusion mechanism of the collaborative completion module (CCM). Nevertheless, it should be noted that the performance verification and optimization of this model still rely on large-scale labeled datasets and lack specialized adaptive designs for data-limited scenarios. Therefore, it cannot meet the demand for data-limited soybean planting area extraction under the conditions of limited image availability and high labeling costs. However, the monitoring of soybean planting areas in North America is precisely constrained by the above key limitations: on the one hand, image acquisition is limited, and on the other hand, sample labeling costs remain high, forcing relevant research to be conducted under data-limited remote sensing image segmentation tasks. Meanwhile, an excessively large number of model parameters tends to cause overfitting, thereby severely limiting the generalization performance of the model [20].

Against this background, lightweight design focusing on reducing the number of parameters has become a key research direction for improving model performance in data-limited scenarios. Relevant methodological advances have mainly focused on optimizing existing segmentation models to adapt them to data-limited and low-computational-resource application scenarios. For instance, Wang et al. [21] proposed a novel remote sensing image classification model, which they named MST-DeepLabv3+. This model is based on DeepLabv3+ and can achieve superior performance with fewer training parameters. Wang [22] developed a convolutional neural network called Adaptive Feature Fusion UNet (AFF-UNet) to optimize semantic segmentation performance. To address the aforementioned challenges in data-limited remote sensing and the limitations of existing models, this study adopts the U-shaped encoder–decoder architecture (represented by U-Net) as the baseline network due to its strong feature fusion ability, moderate parameter size, and good adaptability to data-limited remote sensing segmentation tasks. On this basis, we propose a lightweight and efficient semantic segmentation model for soybean planting area extraction, named SimAM-integrated Statistical Dynamic Recon Unit Network (SimSDRU-Net). Experiments are conducted using Sentinel-2 satellite images, with soybean plots labeled accurately using the USDA Cropland Data Layer (CDL) to construct the semantic segmentation dataset. Built on the PyTorch framework, the proposed model is applied to pixel-level classification for high-precision extraction of soybean planting areas.

2. Materials and Methods

2.1. Study Area

Peoria County and Menard County, located in the midwestern region of the United States, were selected as the study areas (Figure 1). Specifically, Peoria County is located between 40°30′–40°58′ N and 89°26′–89°59′ W, while Menard County is situated between 39°54′–40°09′ N and 89°34′–89°59′ W. Both counties are located in the central part of Illinois (Figure 1). Illinois is the second-largest agricultural state in the United States. It is recognized as one of the global core regions for soybean production. The state occupies a crucial strategic position in terms of soybean yield and export volume in the global market. The study area is characterized by a temperate continental climate, marked by abundant precipitation, with an annual average of 950 mm, and a mean annual temperature of 12.0 degrees Celsius. The region’s distinct four seasons foster favorable natural conditions conducive to the cultivation of crops such as soybeans and corn.

2.2. Data Sources

2.2.1. Sentinel-2 Remote Sensing Data

This study primarily used Sentinel-2 satellite remote sensing data obtained from the Copernicus Open Access Hub of the European Space Agency (ESA). The Sentinel-2 satellite is equipped with the Multi-Spectral Instrument (MSI), which captures Earth’s surface images across 13 spectral bands. The spatial resolution of these bands ranges from 10 to 60 m, as detailed in Table 1. Specifically, bands 2, 3, 4, and 8 offer a 10 m resolution. The satellite has an imaging swath of 290 km.

2.2.2. Auxiliary Geospatial Data

(1): Cropland Data Layer (CDL)

The Cropland Data Layer (CDL) is an annual, crop-specific land cover raster dataset for the contiguous United States, produced and distributed by the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA). It is widely recognized as a high-precision benchmark dataset for crop classification in the global agricultural remote sensing community. Since 2008, the CDL has provided data at a 30 m spatial resolution. Its production integrates multi-source, medium- to high-resolution satellite time-series imagery with extensive agricultural ground truth data, such as the Farm Service Agency’s Common Land Unit (CLU) records. Crop identification is achieved through supervised classification algorithms, which differentiate dozens of major crops (e.g., corn, soybean, wheat, cotton) and non-cropland types. The reported classification accuracy for key field crops like soybean and corn exceeds 95% in major U.S. production regions.

The dataset supports diverse applications, including crop area estimation, land use monitoring, remote sensing model validation, and agricultural policy analysis. Beginning with partial coverage in 1997, the CDL has achieved annual, nationwide updates since 2008. Its data are openly shared in standardized, georeferenced raster formats, facilitating direct overlay and spatial analysis with other remote sensing products and GIS data. In this study, CDL land cover data from 2022 and 2023—covering classes such as soybean, corn, developed land, water, and grassland—were downloaded for Peoria and Menard Counties. These data served as authoritative auxiliary references for the manual annotation of soybean planting areas and for subsequent accuracy verification.

(2): Administrative Boundary Vector Data

The administrative boundary vector data of Illinois State, Peoria County and Menard County in the US adopted in this study were obtained from the geospatial information resource system. Underwent georegistration and accuracy verification, the data is based on the WGS84 coordinate system (EPSG: 4326), which is consistent with that of Sentinel-2 remote sensing imagery. It can be directly used for image clipping and spatial matching, serving as the geographic benchmark for the spatial registration of multi-source data and ensuring the spatial consistency between remote sensing imagery and auxiliary datasets such as CDL.

(3): HYBRID Basemap Data

In the processes of data preprocessing and spatial analysis, the HYBRID (Google LLC, Mountain View, CA, USA) basemap provided by the Google Earth Engine (GEE) platform (GEE, Google LLC, Mountain View, CA, USA; https://earthengine.google.com/, accessed on 7 May 2025) was used as the geographic reference background in this study. Composed of the superposition of high-resolution real satellite imagery and standardized geographic vector annotations, the basemap shares the same source of geographic data with Google Earth (Google LLC, Mountain View, CA, USA) and Google Maps (Google LLC, Mountain View, CA, USA), featuring clear spatial positioning and authentic surface information. It can intuitively reflect the surface landscape, plot distribution and geographic location characteristics of the study area, offering visual references for confirming the study area boundary and checking the image coverage. Meanwhile, it provides a reliable geographic background support for the production of labels in the dataset.

2.2.3. Key Phase Selection and Remote Sensing Image Preprocessing

Soybean production in Illinois is characterized by a single-cropping system, with the sowing period concentrated within the months of March through May on an annual basis. The key growth stages of the crop extend from May to September, encompassing the processes of seedling emergence, flowering, and pod setting. The harvesting period occurs during the months of September and October. In order to enhance the generalization ability and classification fairness of the soybean planting area extraction model, this study uniformly deployed sample points across the Illinois study area based on the U.S. Department of Agriculture (USDA) Cropland Data Layer (CDL) (Figure 2). We then proceeded to analyze the temporal variation characteristics and monthly correlation of the Normalized Difference Vegetation Index (NDVI) between soybeans and other land cover types (e.g., corn, bare soil) (Figure 3). The primary objective of this study is to ascertain the period during which spectral discrimination between soybeans and other land cover types is at its zenith. This will provide a quantitative foundation for the selection of optimal remote sensing image acquisition time phases, thereby enhancing the extraction accuracy.

Sentinel-2 remote sensing images were acquired from the Copernicus Data Space Ecosystem | Europe’s eyes on Earth (https://dataspace.copernicus.eu/ (accessed on 10 May 2025)), with the preprocessed L2A-level surface reflectance products prioritized for use. These products have undergone radiometric calibration, atmospheric correction, topographic correction, and cloud-snow masking, providing a high-precision reflectance foundation for subsequent quantitative remote sensing analyses. Based on the analysis results of NDVI time-series characteristics in Figure 3, soybeans exhibited a significant spectral difference from other concurrent crops (e.g., corn) in early September, which was identified as the optimal time phase for high-precision segmentation and extraction of soybean planting areas. Therefore, we screened all available Sentinel-2 images covering the study areas (Macon County and Peoria County, Illinois, USA) from 1 to 15 September 2023. Cloud-contaminated images were excluded through quantitative cloud cover assessment to ensure that the cloud cover rate in the study areas was below 10%, and a comprehensive evaluation of image quality was conducted via visual interpretation. Finally, the images captured on 10 September 2023, were selected as the core data source—these images featured low cloud cover and optimal quality, and their true-color characteristics showed no significant difference from those in the period of 1–9 September, fully matching the optimal spectral identification window in early September. Meanwhile, soybeans in the study areas were at the seed-filling and maturation stage during this period.

To ensure the complete spatial coverage of the study areas and the precise matching between image boundaries and geographical ranges, multiple scenes of images covering Macon County and Peoria County on the aforementioned date were downloaded. Image mosaic, spatial registration, and precise clipping based on administrative boundaries were completed on the ArcGIS 10.2 platform. The true-color images, synthesized from three original 10 m resolution bands (B2 for blue light, B3 for green light, and B4 for red light), were adopted as the core data for the study. Ultimately, a high-quality image dataset with a unified spatial datum, complete coverage, and spectral fidelity was generated, which provided reliable data support for the subsequent precise extraction of soybean planting areas and the construction of classification models.

2.3. Construction of the Labeled Dataset

This study used Sentinel-2 (Sentinel-2) remote sensing images covering Peoria County and Menard County in Illinois, USA, as the core data source. Specifically, one preprocessed Sentinel-2 L2A product was selected for each of the two study counties, totaling 2 images; these 2 images were the final products generated through the image screening and preprocessing procedures detailed in Section 2.2.3, acquired on 10 September 2023, and downloaded from the Copernicus Open Access Hub operated by the European Space Agency (ESA). The specific process of image processing and label production is as follows:

First, the Sentinel-2 images covering Peoria County and Menard County in the study area were loaded separately with the help of the ArcGIS platform. A vector polygon layer consistent with the scope and size of the study area was created within the platform to ensure that the number of grid pixels of the labeled grid was consistent with that of the original image.

Relying on the polygon drawing tool integrated in the Editor toolbar, systematic manual visual interpretation was carried out based on the original remote sensing images. During the interpretation process, the Cropland Data Layer (CDL) of the United States Department of Agriculture (USDA) and high-resolution Google Earth images were introduced as auxiliary reference materials to accurately outline the vector boundaries of soybean planting areas and complete the preliminary vectorization labeling of soybean planting areas.

The label data adopted a binarization labeling criterion, where the pixel value of the target soybean area was set to 1, and the pixel value of the non-soybean background area was set to 0. After the vectorization labeling was completed, the vector data of all soybean blocks were converted into raster data format (TIFF format) and exported through relevant ArcGIS tools.

Subsequently, the exported TIFF format label images were converted into PNG format, and the original Sentinel-2 remote sensing images (TIF format) were converted into JPG format for storage to ensure complete spatial registration between the original images and the corresponding label data, effectively avoiding potential spatial misalignment problems that may occur in the subsequent block processing and model training processes.

Finally, a sliding window cropping method was employed to perform block segmentation on the registered original images and label data, with the window overlap rate set to 30%. All data were uniformly cropped into 256-pixel × 256-pixel image patches, ensuring that the cropped patches could fully retain the spectral and spatial characteristics of soybean plots.

In order to expand the scale of the dataset, enhance the model’s ability to generalize, and avoid overfitting [23], operations of data augmentation were performed on the original images and their corresponding labeled images. Specifically, five transformation methods were adopted, including rotations by 90°, 180°, and 270°, as well as horizontal flipping and vertical flipping (Figure 4).

The cropped image patches from Peoria County were utilized as training data, yielding a total of 783 image-label pairs with a size of 256 × 256 pixels. Following the implementation of data augmentation, the dataset was augmented, resulting in a total of 4698 pairs. These pairs were then randomly allocated to a training set and a validation set, with a ratio of 9:1. The cropped image patches from Menard County were utilized as independent test data, culminating in 375 image patches of 256 × 256 pixels.

2.4. The VGG-16-Based SimSDRU-Net

Semantic segmentation models, particularly for high-resolution remote sensing imagery, must balance representation capacity and computational efficiency. Conventional convolutional layers often generate redundant feature channels, which not only increases computational burden but may also dilute salient information. Conversely, over-aggressive channel reduction can degrade boundary details and texture fidelity, compromising segmentation accuracy. Additionally, many attention mechanisms depend on parameterized modules, limiting their applicability on resource-limited devices. To mitigate these issues, a novel lightweight module named the SimAM-integrated Statistical Dynamic Reconstruction Unit (SimSDRU)—which integrates a parameter-free Statistical Dynamic Reconstruction Unit (S-DRU) with a parameter-free attention mechanism (SimAM)—is proposed. Based on this module, SimSDRU-Net, which employs VGG-16 as its backbone, is further introduced.

2.4.1. Overall Architecture

SimSDRU-Net adopts a classic symmetric encoder–decoder (U-Net) architecture, with its overall structure illustrated in Figure 5. The encoder is responsible for extracting multi-level, deep semantic features from the input image, while the decoder progressively restores spatial details and generates the precise segmentation map.

Encoder: VGG-16 Backbone with Transfer Learning. This study employs the classic VGG-16 network [24] as the backbone encoder. Recognized for its regular architecture—characterized by consecutive stacks of 3 × 3 convolutional kernels and max-pooling layers—VGG-16 (Figure 6) demonstrates strong feature learning capability, straightforward trainability, and favorable transferability. To adapt it for the semantic segmentation task, the original classifier components, including the terminal adaptive average pooling layer and fully connected layers, are removed. Only the convolutional and pooling layers that constitute the feature extraction module are retained, thereby eliminating redundant parameters associated with classification and ensuring focus on the soybean segmentation objective. For an input image of size 256 × 256 × 3, the encoder processes the data through a sequence of operations: two 3 × 3 convolutions (64 channels) followed by pooling, two 3 × 3 convolutions (128 channels) followed by pooling, three 3 × 3 convolutions (256 channels) followed by pooling, and two stages of three 3 × 3 convolutions (512 channels), each followed by pooling. This pipeline finally yields a deep feature map of size 8 × 8 × 512. By loading weights pre-trained on the ImageNet dataset, the encoder transfers general visual knowledge, which effectively mitigates overfitting commonly encountered when training on small-scale remote sensing datasets. During the initial training phase, the backbone parameters are frozen to stabilize feature extraction, followed by gradual unfreezing for fine-tuning.

Decoder: A feature refinement and reconstruction module with SimSDRU as its core. The core of the decoder is the proposed SimSDRU module in this paper. It is embedded in the key upsampling and feature fusion paths of the decoder, and is used to efficiently process deep features from the encoder and shallow features transmitted through skip connections. This module works through its two internal collaborative components: (1) Statistical Dynamic Reconstruction Unit, which realizes lightweight spatial feature reconstruction based on the statistical distribution of feature maps, suppresses redundant channels, reduces computational load, and retains key semantic information; (2) SimAM attention mechanism, which adaptively assigns spatial weights through neuron importance statistics, guiding the model to focus on key regions such as soybean planting areas, thereby improving the robustness and discriminative power of feature representation. The decoder gradually restores the details of target boundaries through multiple upsampling and fusion with the features of the corresponding layers of the encoder (processed by SimSDRU), and finally outputs a semantic segmentation map with the same resolution as the input.

2.4.2. S-DRU

In the process of extracting soybeans from remote sensing images, the skip connections of the decoder must first concatenate and fuse the multi-scale features captured by the encoder with the upsampled features. These fused features are then fed into subsequent modules for processing. The primary objective of this operation is to restore the fine spatial details of soybean plots based on the fused features while improving the segmentation accuracy of complex planting areas.

However, it is evident that the prevailing methodologies are encumbered by discernible limitations. In concatenated feature maps, multiple channels often respond repetitively to similar features, leading to significant information redundancy. Furthermore, the planting of soybeans is distinguished by two distinct patterns: large-scale contiguous distribution and small-area scattered growth. Convolutional operations, in their traditional form, have been demonstrated to extract information from concatenated feature maps. However, they are not well-suited to meeting the practical requirements of lightweight extraction due to the presence of parameter redundancy.

In order to address the aforementioned issues, the present study proposes the implementation of a statistical dynamic reconstruction unit (S-DRU) that is devoid of any parameters. The unit’s design is characterized by its extreme lightweight nature. As indicated by the SRU unit [25], the automatic identification and selection of key channels related to soybeans is achieved through the utilization of statistical calculation results of concatenated feature maps. Concurrently, it accomplishes a complementary fusion of features by cross-recombining feature components with disparate weights. The specific structure of S-DRU is illustrated in Figure 7.

In the S-DRU, an input feature map

X \in R^{C \times H \times W}

is initially processed, where

C

,

H

, and

W

represent the number of input channels, height, and width, respectively. This is followed by a grouping operation on the input. The channels of the input feature map are divided into

G

groups, with the number of groups determined adaptively. Each group contains

\frac{C}{G}

channels, as illustrated in Formula (1):

X_{g} = r e s h a p e (X) \in R^{G \times \frac{C}{G} \times H \times W}

(1)

Subsequently, independent normalization is applied to each group of feature maps, as expressed in Formula (2):

G N (X) = \frac{X_{g} - μ_{g}}{σ_{g} + ε} \in R^{G \times \frac{C}{G} \times H \times W}

(2)

where

μ_{g}

represents the group mean,

σ_{g}

denotes the group standard deviation, and

ε = 10^{- 5}

is a tiny positive constant added to ensure the stability of the division operation. The groups are subsequently amalgamated to reconstitute the initial configuration, yielding

\tilde{X} \in R^{C \times H \times W}

. The absolute mean value of each channel is calculated as delineated in Formula (3), yielding the

Γ_{c} \in R^{C \times 1 \times 1}

:

Γ_{c} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} | {\tilde{X}}_{c, h, w} | c = 1, \dots, C

(3)

Next, the weights of each channel are normalized, as illustrated in Formula (4), where the final output is

Γ_{c}^{'} \in R^{C \times 1 \times 1}

:

Γ_{c}^{'} = \frac{Γ_{c}}{\sum_{c = 1}^{C} Γ_{c} + ε} c = 1, \dots, C

(4)

The normalized input feature

\tilde{X} \in R^{C \times H \times W}

is multiplied by its corresponding channel weight

Γ^{'} \in R^{C \times 1 \times 1}

, and the resulting product is compressed to the range of 0 to 1 using the Sigmoid function. This process generates the output

r e w e i g h t s \in R^{C \times H \times W}

.

The enhancement of the representation of effective features is achieved through dynamic weight allocation and cross-branch feature cross-reconstruction, which suppresses redundant information, as outlined in Formula (5). In this context, R denotes reweights, ⊗ indicates element-wise multiplication, ⊕ d signifies element-wise addition, Split refers to equal channel splitting, and Concat represents channel concatenation. The following steps yield the final output feature map:

{\begin{array}{l} X_{1} = R \otimes \tilde{X}, \\ X_{2} = (1 - R) \otimes \tilde{X}, \\ X_{11}, X_{12} = S p l i t (X_{1}), \\ X_{21}, X_{22} = S p l i t (X_{2}), \\ X_{o u t} = C o n c a t (X_{11} \oplus X_{22}, X_{12} \oplus X_{21}) \end{array}

(5)

2.4.3. SimAM

SimAM [26] is a parameter-free attention mechanism based on an energy function. This mechanism can adaptively generate spatial attention weights for feature maps without introducing additional learnable parameters. The fundamental premise of this methodology is rooted in the tenets of saliency detection theory as elucidated within the domain of neuroscience. This theory posits that regions within an image that exhibit significant disparities from their immediate surroundings tend to harbor more substantial information. SimAM is a methodology that utilizes quantitative analysis to determine the relative importance of each position in the feature map. This is achieved by calculating the statistical difference between the position and the global features. The result of this calculation is the enhancement of key regions and the suppression of non-critical regions. The structural configuration of SimAM is delineated in Figure 8.

First, for a given input feature map

X \in R^{C \times H \times W}

, where

C

,

H

, and

W

denote the number of channels, height, and width, respectively, the spatial mean value of each channel is calculated to characterize the overall response level of that channel, as shown in Formula (6):

μ_{c} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} X_{c, h, w} \in R^{C \times 1 \times 1}

(6)

Subsequently, the energy value

d_{c, h, w}

of each position is calculated relative to the mean value of its corresponding channel. This value reflects the local saliency of the position, as expressed in Formula (7):

d_{c, h, w} = (X_{c, h, w} - μ_{c})^{2} \in R^{C \times H \times W}

(7)

Next, the spatial variance of each channel is computed, as illustrated in Formula (8):

σ_{C}^{2} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} d_{c, h, w} \in R^{C \times 1 \times 1}

(8)

Attention weights are derived through a process of energy normalization and offset, as delineated in Formula (9). In this equation, the numerical stability constant ε is defined as 5 × 10⁻⁴, and the offset term 0.5 is incorporated to guarantee that the output falls within a reasonable range.

y_{c, h, w} = \frac{d_{c, h, w}}{4 (v_{c} + ε)} + 0.5 \in R^{C \times H \times W}

(9)

The Sigmoid function is employed to map the energy values to the range of (0,1). Dropout regularization is a technique employed during the training phase to derive the final attention weights, denoted by a. The original features are multiplied by the attention weights in an element-wise manner to generate the enhanced features, as depicted in Formula (10):

X_{o u t} = X \otimes a \in R^{C \times H \times W}

(10)

2.5. Evaluation Metrics

In the context of semantic segmentation tasks, OA, Mean Pixel Accuracy (MPA), and Mean Intersection over Union (MIoU) are adopted as the evaluation metrics for soybean segmentation performance. The OA is defined as the proportion of correctly classified pixels among the total pixels in an image, as delineated in Formula (11). The MPA is calculated by first determining the pixel accuracy for each class and then taking the mean of the results across all classes, as expressed in Formula (12). The MIoU is calculated by determining the ratio of the intersection to the union of the predicted and ground-truth regions for each class. This ratio is then averaged across all classes, as illustrated in Formula (13).

O A = \frac{\sum_{i = 0}^{k} p_{i i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i j}}

(11)

M P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j}}

(12)

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(13)

where

k

denotes the total number of classes in the segmentation task, with class indices

i, j \in {0, 1, \dots, k}

;

p_{i j}

represents the number of pixels belonging to class

i

that are predicted as class

j

(where

i = j

indicates correct classification and

i \neq j

indicates misclassification);

p_{i i}

refers to the number of correctly predicted pixels in class

i

; and

p_{j i}

denotes the number of pixels misclassified from class

j

to class

i

.

2.6. Experimental Environment Setup

All experiments were conducted on a workstation equipped with an AMD Ryzen 9 7940HX with Radeon Graphics (2.50 GHz) CPU (Advanced Micro Devices, Inc., Sunnyvale, CA, USA) and an NVIDIA GeForce RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA), running the Windows 10 operating system (Microsoft Corporation, Redmond, WA, USA). The software environment comprised CUDA 10.4 (NVIDIA Corporation), Python 3.8.20 (Python Software Foundation, Wilmington, DE, USA), and the deep learning framework PyTorch 1.8.0 (Meta Platforms, Inc., Menlo Park, CA, USA). To address the issue of positive-negative sample imbalance in soybean segmentation, a hybrid loss function combining Dice Loss and Focal Loss was adopted. During model training, a cosine annealing learning rate scheduling strategy was implemented, with an initial learning rate of 1 × 10⁻⁴ and a minimum learning rate of 1 × 10⁻⁶. The Adam optimizer was utilized with β₁ = 0.9, the total number of training epochs was set to 100, and the batch size was configured as 4. Specifically, the backbone network was frozen for the first 50 epochs, with only the decoder and segmentation output layer being trained; in the subsequent 50 epochs, the backbone was unfrozen to perform joint fine-tuning of the entire network.

3. Results and Discussion

3.1. Experimental Comparative Analysis

To verify the effectiveness of using VGG-16 as the backbone network and the pre-training weight strategy, this paper designed a controlled experiment: one group adopted random initialization and trained from scratch, while the other group employed transfer learning with VGG-16 as the backbone and ImageNet pre-trained weights loaded. The performance gain of transfer learning in the remote sensing semantic segmentation of soybean planting areas was systematically validated by comparing the results of the two groups. Three classic semantic segmentation models, namely U-Net [27], DeepLabv3+ [28] and U-Net++ [29], were selected as baseline methods. The experimental results are shown in Table 2.

As shown in the table, when trained from scratch with random initialization and without a pre-trained backbone, the three models—U-Net, Deeplabv3+ and U-Net++—exhibit significant differences in segmentation performance. U-Net achieves the best overall performance, with an MIoU of 85.46%, MPA of 90.39%, and OA of 93.45%. Due to its relatively complex architecture, Deeplabv3+ suffers from underfitting on the data-limited remote sensing dataset, resulting in notably lower performance with an MIoU of only 73.81%. U-Net++ performs between the other two models and is slightly inferior to the original U-Net.

After adopting VGG-16 as the backbone network and loading its pre-trained weights, all three models achieve consistent improvements across all evaluation metrics. This robustly validates the efficacy of the transfer learning strategy. By leveraging the rich, general visual features learned from the large-scale ImageNet dataset, the model is provided with more robust initial parameters and superior foundational feature representations. This prior knowledge is crucial, as it effectively mitigates the overfitting problem that is prevalent in data-limited remote sensing analysis [30,31]. This approach aligns with the current mainstream paradigm for addressing data scarcity in agricultural remote sensing segmentation tasks [32]. Ultimately, this strategy effectively enhances the feature extraction capability and improves the accuracy of remote sensing semantic segmentation for soybean planting areas. Based on the above comparative results of transfer learning, this chapter finally adopts VGG-16 pre-trained on ImageNet as the unified encoder backbone for subsequent model comparison experiments. U-Net, DeepLabv3+ and U-Net++ are selected as the baseline models, and all the comparative models as well as the proposed SimSDRU-Net use the same training and validation datasets. The trained models are independently tested on cropped remote sensing images of Menard County, and the quantitative evaluation results are presented in Table 3.

Based on the quantitative evaluation results in Table 3, where the proposed SimSDRU-Net achieves an MIoU of 89.03%, MPA of 93.81%, and OA of 95.96%—outperforming all baseline models—further analysis was conducted to explore the intrinsic reason for this performance improvement.

To verify the effectiveness of the proposed SimDRU module in strengthening the model’s feature focusing ability, visual analysis was conducted based on the attention response maps. As illustrated in Figure 9, the response map of the baseline model exhibits a general capacity to encompass the soybean plots. However, it is characterized by blurred boundaries and significant noise in non-soybean regions. In contrast, the enhanced model integrated with the SimDRU module demonstrates enhanced consistency between its red-yellow response regions and the actual spatial distribution of soybean plots. This enhanced consistency is characterized by more defined boundaries and a purer blue-purple hue in the background regions. This visual discrepancy directly demonstrates the efficacy of the SimDRU module in enhancing the model’s focus on soybean features, suppressing the interference of background noise, and consequently improving the accuracy of the model’s feature representation for soybean areas.

In order to provide further verification of the optimization effect of the SimDRU module on the basic feature representation capability of the model, Figure 10 presents the feature map comparison experiment of the model at the up4 layer of the decoder. The same input remote sensing images as those in Figure 9 were used to ensure the uniqueness of the experimental variables. As shown in Figure 10, although the feature maps of the baseline model (Figure 10f–j) can roughly depict the texture and contours of soybean regions, the feature boundaries between soybean and non-soybean areas are blurred, and the model fails to exhibit a focused response on soybean regions, as indicated by the uniform gray brightness across the entire feature map. In contrast, the feature maps of the enhanced model integrated with the SimDRU module (Figure 10k–o) show that their bright white response regions are highly consistent with the actual spatial distribution of soybean plots, with clear and sharp boundaries. The noise response in the background is significantly suppressed, presenting a purer dark gray tone. This visually demonstrates that the SimDRU module can effectively enhance the model’s precise focusing capability on soybean features and improve the accuracy of feature representation.

To more intuitively demonstrate the differences in segmentation performance, three representative remote sensing images from the Menard County validation set were selected for visual analysis of U-Net, DeepLabv3+, U-Net++, and the proposed SimSDRU-Net, as shown in Figure 11. As observed, DeepLabv3+ (Figure 11d) exhibits significant missed detections and misclassifications at the object boundaries highlighted by red circles, especially at the junctions of soybean plots and bare soil. The segmentation edges are blurred and discontinuous, failing to clearly show the angular shape of soybean plots, with scattered pixel missing inside the plots, making it difficult to accurately outline the contours. U-Net (Figure 11c) and U-Net++ (Figure 11e) achieve better OA with good recognition of main regions, but still show minor imperfections such as missing central pixels in the red-circled areas, affecting plot integrity. The proposed SimSDRU-Net (Figure 11f) produces segmentation results that are highly consistent with the ground truth labels (Figure 11b) at the corresponding red-circled locations. It not only accurately segments the main soybean areas but also effectively restores edge details, with clear and continuous boundaries, thus alleviating under-segmentation and misclassification in complex environments. This further confirms the effectiveness and superiority of the proposed model for remote sensing image segmentation in Menard County.

As illustrated in Figure 12, the original image of Menard County is shown alongside the corresponding soybean label map and the aggregated prediction map generated by the models. The comprehensive map was obtained by predicting each cropped test patch, preserving their coordinates, and then stitching them together. As shown in Figure 12, SimSDRU-Net (Figure 12c) produces predictions of ground object distribution and boundaries that closely match the ground truth (Figure 12b), with consistent edge segmentation and only minor inaccuracies as small scattered plots, even in the red-circled regions. In contrast, DeepLabv3+ (Figure 12d) shows ambiguous boundaries, indistinct internal segmentation, and errors in the scattered mountain-adjacent soybean patches highlighted by red circles. U-Net (Figure 12e) and U-Net++ (Figure 12f) yield label-similar results but suffer from patch adhesion and missing segmentation within the red-circled predicted areas.

3.2. Ablation Experiment Analysis

To verify the effectiveness of the proposed S-DRU and SimAM modules, ablation experiments were conducted based on U-Net with VGG-16 as the backbone and pre-trained weights. U-Net was chosen as the baseline for two main reasons: first, U-Net achieved the best performance in the binary classification task of soybean planting areas from remote sensing images in the preliminary model comparison; second, models with large parameters tend to suffer from overfitting and poor generalization on data-limited remote sensing binary classification datasets [33,34].

As shown in Table 4, the baseline U-Net model (VGG-16 backbone) achieved 88.61% MIoU, 93.54% MPA, and 95.80% in the soybean segmentation task. This model was initialized with ImageNet pre-trained weights and trained using a freeze-then-fine-tune transfer learning strategy. This transfer learning-based baseline aligns with the mainstream paradigm for data-limited remote sensing image segmentation—leveraging pre-trained weights from natural image datasets to mitigate the scarcity of labeled remote sensing samples [35,36]. As secondary auxiliary improvement modules integrated into the VGG-16 transfer learning framework, SimAM and S-DRU were sequentially incorporated and yielded significant optimization effects: SimAM (a parameter-free attention mechanism based on energy functions) led to slight improvements in various metrics, while the S-DRU module (a parameter-free statistical dynamic reconstruction unit) increased the MIoU to 88.98% (+0.37%), MPA to 93.75% (+0.21%), and accuracy to 95.91%. Further integration of SimAM and S-DRU enhanced the model performance to 89.03% MIoU, 93.81% MPA, and 95.96% accuracy: SimAM strengthens spatial saliency, and S-DRU optimizes channel feature reconstruction. The attention and feature optimization mechanisms integrated by the two modules are highly consistent with the latest research progress in improving remote sensing segmentation performance through lightweight feature enhancement modules [37,38,39]. These ablation experimental results confirm that the parameter-free SimAM and S-DRU modules can synergize efficiently with the transfer learning strategy without introducing additional computational overhead, providing a lightweight and effective solution for data-limited agricultural remote sensing image segmentation. This approach successfully addresses the core challenge of balancing model performance and computational complexity, a topic of ongoing interest in efficient remote sensing model design [40,41].

4. Conclusions

In this study, a lightweight U-shaped semantic segmentation network SimSDRU-Net is proposed to accurately extract soybean planting areas from remote sensing images with limited data. Based on NDVI time-series variation in spectral signatures, early September is determined as the optimal phenological window for soybean identification, and Sentinel-2 images of Menard County, Illinois, USA are used as the independent test area. The model adopts a classic U-Net encoder–decoder structure with an ImageNet pre-trained VGG-16 encoder, a decoder enhanced by the SimAM attention mechanism and S-DRU feature reconstruction unit, and a freeze-then-fine-tune training strategy. Experimental results show that compared with random initialization, the ImageNet pre-trained VGG-16 significantly improves the performance of baseline models including U-Net, U-Net++ and DeepLabv3+. The integrated SimSDRU-Net achieves superior performance on the independent test set (MIoU 89.03%, MPA 93.81%, OA 95.96%) and outperforms all benchmark models. Ablation studies verify that both SimAM and S-DRU modules effectively improve segmentation performance, providing an effective and efficient solution for high-precision soybean mapping under limited data conditions.

Given the current limitations in terms of computational cost and data dependency, future research can focus on two key directions. First, to better exploit the multispectral capability of Sentinel-2, studies can explore optimal band combinations for segmentation and integrate vegetation indices (e.g., NDVI, EVI, NDRE, NDWI) for in-depth feature mining. On this basis, efficient band selection and dimensionality reduction methods can be developed to improve spectral utilization and reduce computational costs. Second, to improve cross-region robustness, the proposed lightweight SimSDRU-Net segmentation method can be combined with unsupervised domain adaptation (UDA) techniques. UDA can leverage knowledge from annotated “source regions” to adaptively learn and transfer features to unannotated or sparsely annotated “new target regions”, thereby optimizing the model to adapt to diverse agricultural landscapes worldwide.

Author Contributions

Conceptualization, J.Z. and X.W.; methodology, H.W.; validation, R.Q., H.W. and C.R.; formal analysis, X.W.; investigation, H.W.; data curation, C.W.; writing—original draft preparation, H.W.; writing—review and editing, J.Z. and C.W.; supervision, J.Z.; funding acquisition, C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Hefei City (HZR2530), Anhui Provincial Natural Science Foundation (2508085QD124) and Excellent Scientific Research and Innovation Team (2022AH010005).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, M.; Liu, D.; Wang, Z.; Li, Y. Structural evolution of global soybean trade network and the implications to China. Foods 2023, 12, 1550. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Li, J. Short- and long-term challenges in crop breeding. Natl. Sci. Rev. 2021, 8, nwab002. [Google Scholar] [CrossRef] [PubMed]
Sheykhmousa, M.; Mahdianpari, M.; Ghanbari, H.; Mohammadimanesh, F.; Ghamisi, P.; Homayouni, S. Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review. IEEE J. Sel. Topics. Appl. Earth Observ. Remote Sens. 2020, 13, 6308–6325. [Google Scholar] [CrossRef]
Ibrahim, E.S.; Rufin, P.; Nill, L.; Kamali, B.; Nendel, C.; Hostert, P. Mapping crop types and cropping systems in Nigeria with Sentinel-2 imagery. Remote Sens. 2021, 13, 3523. [Google Scholar] [CrossRef]
Jiang, Y.; Lu, Z.; Li, S.; Lei, Y.; Chu, Q.; Yin, X.; Chen, F. Large-scale and high-resolution crop mapping in China using Sentinel-2 satellite imagery. Agriculture 2020, 10, 433. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in agriculture by machine and deep learning techniques: A review of recent developments. Precis. Agric. 2021, 22, 2053–2091, Correction in Precis. Agric. 2021, 22, 2092–2094.. [Google Scholar] [CrossRef]
Lei, M.; Liu, Y.; Zhang, X.L.; Ye, Y.X.; Yin, G.F.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A Survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Liu, Z.; Guo, J.; Li, C.; Wang, L.; Gao, D.; Bai, Y.; Qin, F. Effective cultivated land extraction in complex terrain using high-resolution imagery and deep learning method. Remote Sens. 2025, 17, 931. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Z.; Feng, L.; Ma, Y.; Du, Q. A new attention-based CNN approach for crop mapping using time series Sentinel-2 images. Comput. Electron. Agric. 2021, 184, 106090. [Google Scholar] [CrossRef]
Du, M.; Huang, J.; Chai, D.; Lin, T.; Wei, P. Classification and mapping of paddy rice using multi-temporal Landsat data with a deep semantic segmentation model. In Proceedings of the 2021 9th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Shenzhen, China, 26–29 July 2021; IEEE: New York, NT, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
Chen, Q.; Cai, D.; Xia, J.; Zeng, M.; Yang, H.; Zhang, R.; He, Y.; Zhang, X.; Chen, Y.; Xu, X.; et al. Remote sensing identification of hydrothermal alteration minerals in the Duobuza porphyry copper mining area in Tibet using WorldView-3 and GF-5 data: The impact of spatial and spectral resolution. Ore Geol. Rev. 2025, 180, 106573. [Google Scholar] [CrossRef]
Sun, H.; Chu, H.Q.; Qin, Y.M.; Hu, P.; Wang, R.F. Empowering smart soybean farming with deep learning: Progress, challenges, and future perspectives. Agronomy 2025, 15, 1831. [Google Scholar] [CrossRef]
Zhang, R.; Wu, X.; Li, J.; Zhao, P.; Zhang, Q.; Wuri, L.; Zhang, D.; Zhang, Z.; Yang, L. A bibliometric review of deep learning in crop monitoring: Trends, challenges, and future perspectives. Front. Artif. Intell. 2025, 8, 1636898. [Google Scholar] [CrossRef] [PubMed]
Lyu, D.; Lai, C.; Zhu, B.; Zhen, Z.; Song, K. An Efficient Remote Sensing Index for Soybean Identification: Enhanced Chlorophyll Index (NRLI). Remote Sens. 2026, 18, 278. [Google Scholar] [CrossRef]
Barbedo, J.G.A. Deep learning for soybean monitoring and management. Seeds 2023, 2, 340–356. [Google Scholar] [CrossRef]
Khan, S.; Muzammal Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Wang, Z.; Zou, C.; Cai, W. Small sample classification of hyperspectral remote sensing images based on sequential joint deep learning model. IEEE Access 2020, 8, 71353–71363. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Liu, X.; Yan, P. An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+. Sci. Rep. 2024, 14, 9716. [Google Scholar] [CrossRef]
Wang, X.; Hu, Z.; Shi, S.; Hou, M.; Xu, L.; Zhang, X. A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet. Sci. Rep. 2023, 3, 7600. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Li, J.F.; Wen, Y.; He, L.H. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6153–6162. [Google Scholar]
Qin, X.Y.; Li, N.; Weng, C.; Su, D.; Li, M. Simple attention module based speaker verification with iterative noisy label detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6722–6726. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 1st ed.; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision–ECCV 2018: Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imag. 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Petrov, M.; Pandilova, E.; Dimitrovski, I.; Trajanov, D.; Spasev, V.; Kitanovski, I. Few-shot semantic segmentation in remote sensing: A review on definitions, methods, datasets, advances and future trends. Remote Sens. 2026, 18, 637. [Google Scholar] [CrossRef]
Liu, P.; Zhu, H.; Mi, X.; Yang, J.; Meng, Y.; Zhao, H.; Gu, X. A systematic study on pretraining strategies for low-label remote sensing image semantic segmentation. Sensors 2026, 26, 1385. [Google Scholar] [CrossRef]
Chang, Z.; Lu, Y.; Ran, X.; Gao, X.; Wang, X. Few-shot semantic segmentation: A review on recent approaches. Neural Comput. Appl. 2023, 35, 18251–18275. [Google Scholar] [CrossRef]
Yu, A.; Quan, Y.; Yu, R.; Guo, W.; Wang, X.; Hong, D.; Zhang, H.; Chen, J.; Hu, Q.; He, P. Deep learning methods for semantic segmentation in remote sensing with small data: A survey. Remote Sens. 2023, 15, 4987. [Google Scholar] [CrossRef]
Safonova, A.; Ghazaryan, G.; Stiller, S.; Main-Knorn, M.; Nendel, C.; Ryo, M. Ten deep learning techniques to address small data problems with remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103569. [Google Scholar] [CrossRef]
Zhang, H.; Jiang, Z.; Zheng, G.; Yao, X. Semantic segmentation of high-resolution remote sensing images with improved U-Net based on transfer learning. Int. J. Comput. Intell. Syst. 2023, 16, 181. [Google Scholar] [CrossRef]
Yuan, J.; Ma, H.; Zhang, L.; Deng, J.; Luo, W.; Liu, K.; Cai, Z. Transfer learning based on multi-branch architecture feature extractor for airborne LiDAR point cloud semantic segmentation with few samples. Remote Sens. 2025, 17, 2618. [Google Scholar] [CrossRef]
Huang, M.; Dai, W.; Yan, W.; Wang, J. High-resolution remote sensing image segmentation algorithm based on improved feature extraction and hybrid attention mechanism. Electronics 2023, 12, 3660. [Google Scholar] [CrossRef]
Ma, Y.; Wang, X.; Deng, B.; Yu, Y. Lightweight semantic segmentation network with multi-level feature fusion and dual attention collaboration. Electronics 2025, 14, 2244. [Google Scholar] [CrossRef]
Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.K.; Shuai, H.H.; Li, Y.H.; Cheng, W.H. Lightweight deep learning for resource-constrained environments: A survey. ACM Comput. Surv. 2024, 56, 267. [Google Scholar] [CrossRef]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-learning-based semantic segmentation of remote sensing images: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
Wang, H.; Wen, X.; Xu, H.; Yuan, L.; Wang, X. A lightweight enhanced branching attention model for remote sensing scene image classification. Earth Sci. Inform. 2025, 18, 554. [Google Scholar] [CrossRef]

Figure 1. Geographical location of the study area. (a) Location Map of the United States and Illinois; (b) Location Map of Illinois and Sub-Study Areas; (c) RGB Image of Peoria and Menard.

Figure 2. Spatial distribution of sample points for major land cover types (soybeans, corn, buildings, open water, and grassland) in Illinois, USA.

Figure 3. The x-axis represents ten-day intervals spanning early July to late September 2023, including: “7.1–7.11” (early July), “7.11–7.21” (mid-July), “7.21–7.31” (late July), “7.31–8.10” (early August), “8.10–8.20” (mid-August), “8.20–9.1” (late August), “9.1–9.9” (early September), “9.9–9.19” (mid-September), and “9.19–9.29” (late September). The y-axis shows the mean NDVI for each crop and land cover type during each interval.

Figure 4. Data augmentation methods and corresponding effects. (a) Original image and its corresponding soybean label; (b) Image and label after 90° rotation; (c) Image and label after 180° rotation; (d) Image and label after 270° rotation; (e) Image and label after vertical flipping; (f) Image and label after horizontal flipping.

Figure 5. Architecture diagram of the SimSDRU-Net model.

Figure 6. Schematic diagram of the VGG-16 feature extraction backbone module. The blue dashed box highlights the VGG-16 feature extraction backbone module employed in this study.

Figure 7. Diagram of the S-DRU architecture.

Figure 8. The diagram of the SimAM architecture.

Figure 9. Attention response heatmaps of test images. (a–e) Original remote sensing images; (f–j) Baseline U-Net (VGG-16) without SimSDRU module; (k–o) Proposed SimSDRU-Net with SimSDRU module. Heatmaps are visualized with the JET colormap, where red–yellow denotes high feature response and blue–purple denotes low response, emphasizing the model’s focus on soybean areas.

Figure 10. Feature map comparison at the decoder up4 layer. (a–e) Original remote sensing images; (f–j) Feature maps of the baseline U-Net (VGG-16) without the SimSDRU module; (k–o) Feature maps of the proposed SimSDRU-Net with the SimDRU module. The brightness intensity of gray regions is positively correlated with feature response intensity.

Figure 11. Comparison of visualization results for different models. (a) Original images; (b) Labels; (c) U-Net; (d) DeepLabv3+; (e) U-Net++; (f) SimSDRU-Net.

Figure 12. Overall soybean prediction of Menard County: (a–c) correspond to the basic data and proposed model result; (d–f) are the results of comparison models.

Table 1. Band Information of the MSI Sensor on the Sentinel-2 Satellite.

Band Name	Sentinel-2A		Sentinel-2B		Spatial Resolution/m
Band Name	Bandwidth/nm	Central Wavelength/nm	Bandwidth/nm	Central Wavelength/nm	Spatial Resolution/m
B1	20	443.90	20	442.30	60
B2	65	496.60	65	492.10	10
B3	35	560.00	35	559.00	10
B4	30	664.50	30	665.00	10
B5	15	703.90	15	703.80	20
B6	15	740.20	15	739.10	20
B7	20	782.50	20	779.70	20
B8	115	835.10	115	833.00	10
B8a	20	864.80	20	864.00	20
B9	20	945.00	20	943.00	60
B10	30	1373.50	30	1376.90	60
B11	90	1613.70	90	1610.40	20
B12	180	2202.40	180	2185.70	20

Table 2. Performance comparison of different models with and without VGG-16 backbone.

Backbone	Evaluation Metrics	U-Net	Deeplabv3+	U-Net++
/	MIoU%	85.46	73.81	84.51
	MPA%	90.39	83.85	89.05
	OA%	93.45	86.73	90.73
VGG-16	MIoU%	88.61	78.33	87.53
	MPA%	93.54	87.48	92.41
	OA%	95.80	91.32	94.08

Table 3. Performance comparison of the proposed SimSDRU-Net and baseline models based on VGG-16 encoder.

Model	MIoU/%	MPA/%	OA/%
U-Net	88.61	93.54	95.80
Deeplabv3+	78.33	87.48	91.32
Unet++	87.53	92.41	94.08
SimSDRU-Net (Ours)	89.03	93.81	95.96

Table 4. Ablation experiments.

U-Net	SimAm	S-DRU	MIoU/%	MPA/%	OA/%
✓	✗	✗	88.61	93.54	95.80
✓	✓	✗	88.67	93.61	95.84
✓	✗	✓	88.98	93.75	95.91
✓	✓	✓	89.03	93.81	95.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, H.; Wan, X.; Qian, R.; Ruan, C.; Zhao, J.; Wang, C. Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction. Agriculture 2026, 16, 693. https://doi.org/10.3390/agriculture16060693

AMA Style

Wu H, Wan X, Qian R, Ruan C, Zhao J, Wang C. Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction. Agriculture. 2026; 16(6):693. https://doi.org/10.3390/agriculture16060693

Chicago/Turabian Style

Wu, Haotong, Xinwen Wan, Rong Qian, Chao Ruan, Jinling Zhao, and Chuanjian Wang. 2026. "Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction" Agriculture 16, no. 6: 693. https://doi.org/10.3390/agriculture16060693

APA Style

Wu, H., Wan, X., Qian, R., Ruan, C., Zhao, J., & Wang, C. (2026). Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction. Agriculture, 16(6), 693. https://doi.org/10.3390/agriculture16060693

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating SimAM Attention and S-DRU Feature Reconstruction for Sentinel-2 Imagery-Based Soybean Planting Area Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources

2.2.1. Sentinel-2 Remote Sensing Data

2.2.2. Auxiliary Geospatial Data

2.2.3. Key Phase Selection and Remote Sensing Image Preprocessing

2.3. Construction of the Labeled Dataset

2.4. The VGG-16-Based SimSDRU-Net

2.4.1. Overall Architecture

2.4.2. S-DRU

2.4.3. SimAM

2.5. Evaluation Metrics

2.6. Experimental Environment Setup

3. Results and Discussion

3.1. Experimental Comparative Analysis

3.2. Ablation Experiment Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI