1. Introduction
Soybean is a pivotal food crop and a foundational raw material for the food industry. Historically, China has maintained a substantial reliance on international soybean imports. The import volume of soybeans exhibited a marked increase from 10.4 million tons in 2000 to 100.3 million tons in 2020, with the import dependency rate rising from 46.2% to 83.6% [
1]. Concurrently, the United States maintains a predominant presence in the international soybean trade, while advancements in crop breeding technologies are occurring at an accelerated pace [
2]. The rapid and accurate acquisition of soybean planting area information in North America is of significant importance. It can provide data support for the development of soybean futures trading and agricultural insurance businesses in China. Furthermore, it can serve as a scientific basis for formulating policies to ensure soybean food security.
In recent years, conventional machine learning algorithms, including Random Forest (RF), Support Vector Machine (SVM), and decision trees, have been widely adopted for crop mapping and type identification. As validated by Sheykhmousa et al. [
3] via meta-analysis of 251 studies, RF and SVM excel in modeling high-dimensional spectral data and separating spectrally similar vegetation. For example, Ibrahim et al. [
4] used a hierarchical RF classifier combined with Sentinel-2-derived spectral-temporal metrics (STMs) to map crops and intercropping systems in smallholder agricultural regions of Nigeria, achieving 72% overall classification accuracy. Jiang et al. [
5] developed a cropping-system-adapted decision tree model using Sentinel-2 imagery to conduct large-scale, high-resolution mapping of staple crops in major Chinese grain-producing plains, with an average Overall Accuracy (OA) of 94%. While conventional machine learning methods, such as pixel-based classifiers, have been effectively applied in crop mapping, they primarily rely on handcrafted spectral and spatial features. This approach can be susceptible to image noise and may not fully capture the complex, high-level semantic patterns present in heterogeneous landscapes. It is important to note that advanced machine learning techniques, including object-based image analysis (OBIA) and the integration of multi-source data (e.g., multi-temporal, multi-sensor fusion), have been developed to improve classification accuracy in fragmented agricultural regions. However, the increasing demand for automated, high-resolution, and large-scale crop mapping continues to drive the need for more robust and scalable algorithms, such as those offered by deep learning approaches.
Deep learning has demonstrated considerable potential in the domain of remote sensing image classification [
6]. Its core advantage over traditional machine learning lies in its ability to automatically identify and extract complex hierarchical features from data [
7], a capability that is particularly critical for the analysis of remote sensing images. Deep learning has emerged as a significant methodology in domains such as image classification, target detection, and recognition, particularly within the agricultural sector [
8,
9].
Thanks to the rapid development of Earth observation technology, remote sensing data have become an indispensable support for large-scale agricultural monitoring. A series of medium- and high-resolution satellites, including Sentinel, GF and Landsat, as well as commercial high-resolution images (e.g., WorldView), have been extensively utilized in crop monitoring, classification, and mapping tasks. Deep learning and high-resolution remote sensing images support each other and complement one another’s strengths, providing a new technical approach for high-precision crop remote sensing interpretation. For instance, Liu et al. [
10] proposed an advanced cultivated land extraction model for complex terrain, which adopts Gaofen-2 (GF-2) imagery and an improved U-Net architecture to realize 1 m resolution mapping, and fuses spectral features with vegetation index features to optimize the model input. Wang et al. [
11] developed a novel attention-based convolutional neural network (CNN) method (Geo-CBAM-CNN) for crop classification with time-series Sentinel-2 images. Du et al. [
12] adopted the U-Net semantic segmentation model to classify large-scale paddy fields in Arkansas, USA. This classification was achieved using multi-temporal Landsat data, and the resultant classification accuracy was favorable. Chen et al. [
13] conducted a comparative analysis of hydrothermal alteration mineral identification in the Duobuza porphyry copper mining area, Tibet, using high-spatial-resolution WorldView-3 data and high-spectral-resolution GF-5 data, and verified that combining these two types of datasets can yield more accurate mapping results.
Nevertheless, contemporary deep learning-based methodologies continue to encounter significant challenges in the realm of remote sensing-based extraction of crop planting areas. First, training samples are characterized by high cost and long acquisition cycles; second, remote sensing annotation of crops relies on professional agronomic knowledge [
14,
15]. In soybean planting area extraction, soybeans exhibit high similarity in spectral and texture features to corn, grassland, and other vegetation. This similarity makes accurate differentiation a difficult task [
16]. Therefore, under conditions of limited data volume, the model amplifies errors in the limited labeled samples, resulting in a significant decrease in feature learning efficiency and segmentation accuracy [
17]. This in turn severely limits the large-scale, high-precision application of remote sensing-based soybean planting area extraction.
Despite the continuous emergence of new high-performance deep learning architectures, obvious contradictions remain between existing models and the practical demands of agricultural remote sensing. For example, the Transformer model [
18] possesses the capability of processing multimodal data including images, videos, texts, and speeches, and exhibits excellent scalability in ultra-large-scale networks and massive datasets. However, its disadvantages of excessive parameters and high computational cost make it difficult to adapt to the actual deployment requirements of agricultural remote sensing. Recently, visual state space (VSS) models represented by Mamba have achieved long-range dependency modeling with linear computational complexity, providing a new approach to overcoming the limitations of traditional architectures. The dual-branch network RS3Mamba proposed by Ma et al. [
19] effectively improves the accuracy of remote sensing image segmentation by using VSS blocks to construct an auxiliary branch and adopting the feature fusion mechanism of the collaborative completion module (CCM). Nevertheless, it should be noted that the performance verification and optimization of this model still rely on large-scale labeled datasets and lack specialized adaptive designs for data-limited scenarios. Therefore, it cannot meet the demand for data-limited soybean planting area extraction under the conditions of limited image availability and high labeling costs. However, the monitoring of soybean planting areas in North America is precisely constrained by the above key limitations: on the one hand, image acquisition is limited, and on the other hand, sample labeling costs remain high, forcing relevant research to be conducted under data-limited remote sensing image segmentation tasks. Meanwhile, an excessively large number of model parameters tends to cause overfitting, thereby severely limiting the generalization performance of the model [
20].
Against this background, lightweight design focusing on reducing the number of parameters has become a key research direction for improving model performance in data-limited scenarios. Relevant methodological advances have mainly focused on optimizing existing segmentation models to adapt them to data-limited and low-computational-resource application scenarios. For instance, Wang et al. [
21] proposed a novel remote sensing image classification model, which they named MST-DeepLabv3+. This model is based on DeepLabv3+ and can achieve superior performance with fewer training parameters. Wang [
22] developed a convolutional neural network called Adaptive Feature Fusion UNet (AFF-UNet) to optimize semantic segmentation performance. To address the aforementioned challenges in data-limited remote sensing and the limitations of existing models, this study adopts the U-shaped encoder–decoder architecture (represented by U-Net) as the baseline network due to its strong feature fusion ability, moderate parameter size, and good adaptability to data-limited remote sensing segmentation tasks. On this basis, we propose a lightweight and efficient semantic segmentation model for soybean planting area extraction, named SimAM-integrated Statistical Dynamic Recon Unit Network (SimSDRU-Net). Experiments are conducted using Sentinel-2 satellite images, with soybean plots labeled accurately using the USDA Cropland Data Layer (CDL) to construct the semantic segmentation dataset. Built on the PyTorch framework, the proposed model is applied to pixel-level classification for high-precision extraction of soybean planting areas.
2. Materials and Methods
2.1. Study Area
Peoria County and Menard County, located in the midwestern region of the United States, were selected as the study areas (
Figure 1). Specifically, Peoria County is located between 40°30′–40°58′ N and 89°26′–89°59′ W, while Menard County is situated between 39°54′–40°09′ N and 89°34′–89°59′ W. Both counties are located in the central part of Illinois (
Figure 1). Illinois is the second-largest agricultural state in the United States. It is recognized as one of the global core regions for soybean production. The state occupies a crucial strategic position in terms of soybean yield and export volume in the global market. The study area is characterized by a temperate continental climate, marked by abundant precipitation, with an annual average of 950 mm, and a mean annual temperature of 12.0 degrees Celsius. The region’s distinct four seasons foster favorable natural conditions conducive to the cultivation of crops such as soybeans and corn.
2.2. Data Sources
2.2.1. Sentinel-2 Remote Sensing Data
This study primarily used Sentinel-2 satellite remote sensing data obtained from the Copernicus Open Access Hub of the European Space Agency (ESA). The Sentinel-2 satellite is equipped with the Multi-Spectral Instrument (MSI), which captures Earth’s surface images across 13 spectral bands. The spatial resolution of these bands ranges from 10 to 60 m, as detailed in
Table 1. Specifically, bands 2, 3, 4, and 8 offer a 10 m resolution. The satellite has an imaging swath of 290 km.
2.2.2. Auxiliary Geospatial Data
- (1)
Cropland Data Layer (CDL)
The Cropland Data Layer (CDL) is an annual, crop-specific land cover raster dataset for the contiguous United States, produced and distributed by the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA). It is widely recognized as a high-precision benchmark dataset for crop classification in the global agricultural remote sensing community. Since 2008, the CDL has provided data at a 30 m spatial resolution. Its production integrates multi-source, medium- to high-resolution satellite time-series imagery with extensive agricultural ground truth data, such as the Farm Service Agency’s Common Land Unit (CLU) records. Crop identification is achieved through supervised classification algorithms, which differentiate dozens of major crops (e.g., corn, soybean, wheat, cotton) and non-cropland types. The reported classification accuracy for key field crops like soybean and corn exceeds 95% in major U.S. production regions.
The dataset supports diverse applications, including crop area estimation, land use monitoring, remote sensing model validation, and agricultural policy analysis. Beginning with partial coverage in 1997, the CDL has achieved annual, nationwide updates since 2008. Its data are openly shared in standardized, georeferenced raster formats, facilitating direct overlay and spatial analysis with other remote sensing products and GIS data. In this study, CDL land cover data from 2022 and 2023—covering classes such as soybean, corn, developed land, water, and grassland—were downloaded for Peoria and Menard Counties. These data served as authoritative auxiliary references for the manual annotation of soybean planting areas and for subsequent accuracy verification.
- (2)
Administrative Boundary Vector Data
The administrative boundary vector data of Illinois State, Peoria County and Menard County in the US adopted in this study were obtained from the geospatial information resource system. Underwent georegistration and accuracy verification, the data is based on the WGS84 coordinate system (EPSG: 4326), which is consistent with that of Sentinel-2 remote sensing imagery. It can be directly used for image clipping and spatial matching, serving as the geographic benchmark for the spatial registration of multi-source data and ensuring the spatial consistency between remote sensing imagery and auxiliary datasets such as CDL.
- (3)
HYBRID Basemap Data
In the processes of data preprocessing and spatial analysis, the HYBRID (Google LLC, Mountain View, CA, USA) basemap provided by the Google Earth Engine (GEE) platform (GEE, Google LLC, Mountain View, CA, USA;
https://earthengine.google.com/, accessed on 7 May 2025) was used as the geographic reference background in this study. Composed of the superposition of high-resolution real satellite imagery and standardized geographic vector annotations, the basemap shares the same source of geographic data with Google Earth (Google LLC, Mountain View, CA, USA) and Google Maps (Google LLC, Mountain View, CA, USA), featuring clear spatial positioning and authentic surface information. It can intuitively reflect the surface landscape, plot distribution and geographic location characteristics of the study area, offering visual references for confirming the study area boundary and checking the image coverage. Meanwhile, it provides a reliable geographic background support for the production of labels in the dataset.
2.2.3. Key Phase Selection and Remote Sensing Image Preprocessing
Soybean production in Illinois is characterized by a single-cropping system, with the sowing period concentrated within the months of March through May on an annual basis. The key growth stages of the crop extend from May to September, encompassing the processes of seedling emergence, flowering, and pod setting. The harvesting period occurs during the months of September and October. In order to enhance the generalization ability and classification fairness of the soybean planting area extraction model, this study uniformly deployed sample points across the Illinois study area based on the U.S. Department of Agriculture (USDA) Cropland Data Layer (CDL) (
Figure 2). We then proceeded to analyze the temporal variation characteristics and monthly correlation of the Normalized Difference Vegetation Index (NDVI) between soybeans and other land cover types (e.g., corn, bare soil) (
Figure 3). The primary objective of this study is to ascertain the period during which spectral discrimination between soybeans and other land cover types is at its zenith. This will provide a quantitative foundation for the selection of optimal remote sensing image acquisition time phases, thereby enhancing the extraction accuracy.
Sentinel-2 remote sensing images were acquired from the Copernicus Data Space Ecosystem | Europe’s eyes on Earth (
https://dataspace.copernicus.eu/ (accessed on 10 May 2025)), with the preprocessed L2A-level surface reflectance products prioritized for use. These products have undergone radiometric calibration, atmospheric correction, topographic correction, and cloud-snow masking, providing a high-precision reflectance foundation for subsequent quantitative remote sensing analyses. Based on the analysis results of NDVI time-series characteristics in
Figure 3, soybeans exhibited a significant spectral difference from other concurrent crops (e.g., corn) in early September, which was identified as the optimal time phase for high-precision segmentation and extraction of soybean planting areas. Therefore, we screened all available Sentinel-2 images covering the study areas (Macon County and Peoria County, Illinois, USA) from 1 to 15 September 2023. Cloud-contaminated images were excluded through quantitative cloud cover assessment to ensure that the cloud cover rate in the study areas was below 10%, and a comprehensive evaluation of image quality was conducted via visual interpretation. Finally, the images captured on 10 September 2023, were selected as the core data source—these images featured low cloud cover and optimal quality, and their true-color characteristics showed no significant difference from those in the period of 1–9 September, fully matching the optimal spectral identification window in early September. Meanwhile, soybeans in the study areas were at the seed-filling and maturation stage during this period.
To ensure the complete spatial coverage of the study areas and the precise matching between image boundaries and geographical ranges, multiple scenes of images covering Macon County and Peoria County on the aforementioned date were downloaded. Image mosaic, spatial registration, and precise clipping based on administrative boundaries were completed on the ArcGIS 10.2 platform. The true-color images, synthesized from three original 10 m resolution bands (B2 for blue light, B3 for green light, and B4 for red light), were adopted as the core data for the study. Ultimately, a high-quality image dataset with a unified spatial datum, complete coverage, and spectral fidelity was generated, which provided reliable data support for the subsequent precise extraction of soybean planting areas and the construction of classification models.
2.3. Construction of the Labeled Dataset
This study used Sentinel-2 (Sentinel-2) remote sensing images covering Peoria County and Menard County in Illinois, USA, as the core data source. Specifically, one preprocessed Sentinel-2 L2A product was selected for each of the two study counties, totaling 2 images; these 2 images were the final products generated through the image screening and preprocessing procedures detailed in
Section 2.2.3, acquired on 10 September 2023, and downloaded from the Copernicus Open Access Hub operated by the European Space Agency (ESA). The specific process of image processing and label production is as follows:
First, the Sentinel-2 images covering Peoria County and Menard County in the study area were loaded separately with the help of the ArcGIS platform. A vector polygon layer consistent with the scope and size of the study area was created within the platform to ensure that the number of grid pixels of the labeled grid was consistent with that of the original image.
Relying on the polygon drawing tool integrated in the Editor toolbar, systematic manual visual interpretation was carried out based on the original remote sensing images. During the interpretation process, the Cropland Data Layer (CDL) of the United States Department of Agriculture (USDA) and high-resolution Google Earth images were introduced as auxiliary reference materials to accurately outline the vector boundaries of soybean planting areas and complete the preliminary vectorization labeling of soybean planting areas.
The label data adopted a binarization labeling criterion, where the pixel value of the target soybean area was set to 1, and the pixel value of the non-soybean background area was set to 0. After the vectorization labeling was completed, the vector data of all soybean blocks were converted into raster data format (TIFF format) and exported through relevant ArcGIS tools.
Subsequently, the exported TIFF format label images were converted into PNG format, and the original Sentinel-2 remote sensing images (TIF format) were converted into JPG format for storage to ensure complete spatial registration between the original images and the corresponding label data, effectively avoiding potential spatial misalignment problems that may occur in the subsequent block processing and model training processes.
Finally, a sliding window cropping method was employed to perform block segmentation on the registered original images and label data, with the window overlap rate set to 30%. All data were uniformly cropped into 256-pixel × 256-pixel image patches, ensuring that the cropped patches could fully retain the spectral and spatial characteristics of soybean plots.
In order to expand the scale of the dataset, enhance the model’s ability to generalize, and avoid overfitting [
23], operations of data augmentation were performed on the original images and their corresponding labeled images. Specifically, five transformation methods were adopted, including rotations by 90°, 180°, and 270°, as well as horizontal flipping and vertical flipping (
Figure 4).
The cropped image patches from Peoria County were utilized as training data, yielding a total of 783 image-label pairs with a size of 256 × 256 pixels. Following the implementation of data augmentation, the dataset was augmented, resulting in a total of 4698 pairs. These pairs were then randomly allocated to a training set and a validation set, with a ratio of 9:1. The cropped image patches from Menard County were utilized as independent test data, culminating in 375 image patches of 256 × 256 pixels.
2.4. The VGG-16-Based SimSDRU-Net
Semantic segmentation models, particularly for high-resolution remote sensing imagery, must balance representation capacity and computational efficiency. Conventional convolutional layers often generate redundant feature channels, which not only increases computational burden but may also dilute salient information. Conversely, over-aggressive channel reduction can degrade boundary details and texture fidelity, compromising segmentation accuracy. Additionally, many attention mechanisms depend on parameterized modules, limiting their applicability on resource-limited devices. To mitigate these issues, a novel lightweight module named the SimAM-integrated Statistical Dynamic Reconstruction Unit (SimSDRU)—which integrates a parameter-free Statistical Dynamic Reconstruction Unit (S-DRU) with a parameter-free attention mechanism (SimAM)—is proposed. Based on this module, SimSDRU-Net, which employs VGG-16 as its backbone, is further introduced.
2.4.1. Overall Architecture
SimSDRU-Net adopts a classic symmetric encoder–decoder (U-Net) architecture, with its overall structure illustrated in
Figure 5. The encoder is responsible for extracting multi-level, deep semantic features from the input image, while the decoder progressively restores spatial details and generates the precise segmentation map.
Encoder: VGG-16 Backbone with Transfer Learning. This study employs the classic VGG-16 network [
24] as the backbone encoder. Recognized for its regular architecture—characterized by consecutive stacks of 3 × 3 convolutional kernels and max-pooling layers—VGG-16 (
Figure 6) demonstrates strong feature learning capability, straightforward trainability, and favorable transferability. To adapt it for the semantic segmentation task, the original classifier components, including the terminal adaptive average pooling layer and fully connected layers, are removed. Only the convolutional and pooling layers that constitute the feature extraction module are retained, thereby eliminating redundant parameters associated with classification and ensuring focus on the soybean segmentation objective. For an input image of size 256 × 256 × 3, the encoder processes the data through a sequence of operations: two 3 × 3 convolutions (64 channels) followed by pooling, two 3 × 3 convolutions (128 channels) followed by pooling, three 3 × 3 convolutions (256 channels) followed by pooling, and two stages of three 3 × 3 convolutions (512 channels), each followed by pooling. This pipeline finally yields a deep feature map of size 8 × 8 × 512. By loading weights pre-trained on the ImageNet dataset, the encoder transfers general visual knowledge, which effectively mitigates overfitting commonly encountered when training on small-scale remote sensing datasets. During the initial training phase, the backbone parameters are frozen to stabilize feature extraction, followed by gradual unfreezing for fine-tuning.
Decoder: A feature refinement and reconstruction module with SimSDRU as its core. The core of the decoder is the proposed SimSDRU module in this paper. It is embedded in the key upsampling and feature fusion paths of the decoder, and is used to efficiently process deep features from the encoder and shallow features transmitted through skip connections. This module works through its two internal collaborative components: (1) Statistical Dynamic Reconstruction Unit, which realizes lightweight spatial feature reconstruction based on the statistical distribution of feature maps, suppresses redundant channels, reduces computational load, and retains key semantic information; (2) SimAM attention mechanism, which adaptively assigns spatial weights through neuron importance statistics, guiding the model to focus on key regions such as soybean planting areas, thereby improving the robustness and discriminative power of feature representation. The decoder gradually restores the details of target boundaries through multiple upsampling and fusion with the features of the corresponding layers of the encoder (processed by SimSDRU), and finally outputs a semantic segmentation map with the same resolution as the input.
2.4.2. S-DRU
In the process of extracting soybeans from remote sensing images, the skip connections of the decoder must first concatenate and fuse the multi-scale features captured by the encoder with the upsampled features. These fused features are then fed into subsequent modules for processing. The primary objective of this operation is to restore the fine spatial details of soybean plots based on the fused features while improving the segmentation accuracy of complex planting areas.
However, it is evident that the prevailing methodologies are encumbered by discernible limitations. In concatenated feature maps, multiple channels often respond repetitively to similar features, leading to significant information redundancy. Furthermore, the planting of soybeans is distinguished by two distinct patterns: large-scale contiguous distribution and small-area scattered growth. Convolutional operations, in their traditional form, have been demonstrated to extract information from concatenated feature maps. However, they are not well-suited to meeting the practical requirements of lightweight extraction due to the presence of parameter redundancy.
In order to address the aforementioned issues, the present study proposes the implementation of a statistical dynamic reconstruction unit (S-DRU) that is devoid of any parameters. The unit’s design is characterized by its extreme lightweight nature. As indicated by the SRU unit [
25], the automatic identification and selection of key channels related to soybeans is achieved through the utilization of statistical calculation results of concatenated feature maps. Concurrently, it accomplishes a complementary fusion of features by cross-recombining feature components with disparate weights. The specific structure of S-DRU is illustrated in
Figure 7.
In the S-DRU, an input feature map
is initially processed, where
,
, and
represent the number of input channels, height, and width, respectively. This is followed by a grouping operation on the input. The channels of the input feature map are divided into
groups, with the number of groups determined adaptively. Each group contains
channels, as illustrated in Formula (1):
Subsequently, independent normalization is applied to each group of feature maps, as expressed in Formula (2):
where
represents the group mean,
denotes the group standard deviation, and
is a tiny positive constant added to ensure the stability of the division operation. The groups are subsequently amalgamated to reconstitute the initial configuration, yielding
. The absolute mean value of each channel is calculated as delineated in Formula (3), yielding the
:
Next, the weights of each channel are normalized, as illustrated in Formula (4), where the final output is
:
The normalized input feature is multiplied by its corresponding channel weight , and the resulting product is compressed to the range of 0 to 1 using the Sigmoid function. This process generates the output .
The enhancement of the representation of effective features is achieved through dynamic weight allocation and cross-branch feature cross-reconstruction, which suppresses redundant information, as outlined in Formula (5). In this context,
R denotes reweights, ⊗ indicates element-wise multiplication, ⊕
d signifies element-wise addition, Split refers to equal channel splitting, and Concat represents channel concatenation. The following steps yield the final output feature map:
2.4.3. SimAM
SimAM [
26] is a parameter-free attention mechanism based on an energy function. This mechanism can adaptively generate spatial attention weights for feature maps without introducing additional learnable parameters. The fundamental premise of this methodology is rooted in the tenets of saliency detection theory as elucidated within the domain of neuroscience. This theory posits that regions within an image that exhibit significant disparities from their immediate surroundings tend to harbor more substantial information. SimAM is a methodology that utilizes quantitative analysis to determine the relative importance of each position in the feature map. This is achieved by calculating the statistical difference between the position and the global features. The result of this calculation is the enhancement of key regions and the suppression of non-critical regions. The structural configuration of SimAM is delineated in
Figure 8.
First, for a given input feature map
, where
,
, and
denote the number of channels, height, and width, respectively, the spatial mean value of each channel is calculated to characterize the overall response level of that channel, as shown in Formula (6):
Subsequently, the energy value
of each position is calculated relative to the mean value of its corresponding channel. This value reflects the local saliency of the position, as expressed in Formula (7):
Next, the spatial variance of each channel is computed, as illustrated in Formula (8):
Attention weights are derived through a process of energy normalization and offset, as delineated in Formula (9). In this equation, the numerical stability constant
ε is defined as 5 × 10
−4, and the offset term 0.5 is incorporated to guarantee that the output falls within a reasonable range.
The Sigmoid function is employed to map the energy values to the range of (0,1). Dropout regularization is a technique employed during the training phase to derive the final attention weights, denoted by
a. The original features are multiplied by the attention weights in an element-wise manner to generate the enhanced features, as depicted in Formula (10):
2.5. Evaluation Metrics
In the context of semantic segmentation tasks, OA, Mean Pixel Accuracy (MPA), and Mean Intersection over Union (MIoU) are adopted as the evaluation metrics for soybean segmentation performance. The OA is defined as the proportion of correctly classified pixels among the total pixels in an image, as delineated in Formula (11). The MPA is calculated by first determining the pixel accuracy for each class and then taking the mean of the results across all classes, as expressed in Formula (12). The MIoU is calculated by determining the ratio of the intersection to the union of the predicted and ground-truth regions for each class. This ratio is then averaged across all classes, as illustrated in Formula (13).
where
denotes the total number of classes in the segmentation task, with class indices
;
represents the number of pixels belonging to class
that are predicted as class
(where
indicates correct classification and
indicates misclassification);
refers to the number of correctly predicted pixels in class
; and
denotes the number of pixels misclassified from class
to class
.
2.6. Experimental Environment Setup
All experiments were conducted on a workstation equipped with an AMD Ryzen 9 7940HX with Radeon Graphics (2.50 GHz) CPU (Advanced Micro Devices, Inc., Sunnyvale, CA, USA) and an NVIDIA GeForce RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA), running the Windows 10 operating system (Microsoft Corporation, Redmond, WA, USA). The software environment comprised CUDA 10.4 (NVIDIA Corporation), Python 3.8.20 (Python Software Foundation, Wilmington, DE, USA), and the deep learning framework PyTorch 1.8.0 (Meta Platforms, Inc., Menlo Park, CA, USA). To address the issue of positive-negative sample imbalance in soybean segmentation, a hybrid loss function combining Dice Loss and Focal Loss was adopted. During model training, a cosine annealing learning rate scheduling strategy was implemented, with an initial learning rate of 1 × 10−4 and a minimum learning rate of 1 × 10−6. The Adam optimizer was utilized with β1 = 0.9, the total number of training epochs was set to 100, and the batch size was configured as 4. Specifically, the backbone network was frozen for the first 50 epochs, with only the decoder and segmentation output layer being trained; in the subsequent 50 epochs, the backbone was unfrozen to perform joint fine-tuning of the entire network.
4. Conclusions
In this study, a lightweight U-shaped semantic segmentation network SimSDRU-Net is proposed to accurately extract soybean planting areas from remote sensing images with limited data. Based on NDVI time-series variation in spectral signatures, early September is determined as the optimal phenological window for soybean identification, and Sentinel-2 images of Menard County, Illinois, USA are used as the independent test area. The model adopts a classic U-Net encoder–decoder structure with an ImageNet pre-trained VGG-16 encoder, a decoder enhanced by the SimAM attention mechanism and S-DRU feature reconstruction unit, and a freeze-then-fine-tune training strategy. Experimental results show that compared with random initialization, the ImageNet pre-trained VGG-16 significantly improves the performance of baseline models including U-Net, U-Net++ and DeepLabv3+. The integrated SimSDRU-Net achieves superior performance on the independent test set (MIoU 89.03%, MPA 93.81%, OA 95.96%) and outperforms all benchmark models. Ablation studies verify that both SimAM and S-DRU modules effectively improve segmentation performance, providing an effective and efficient solution for high-precision soybean mapping under limited data conditions.
Given the current limitations in terms of computational cost and data dependency, future research can focus on two key directions. First, to better exploit the multispectral capability of Sentinel-2, studies can explore optimal band combinations for segmentation and integrate vegetation indices (e.g., NDVI, EVI, NDRE, NDWI) for in-depth feature mining. On this basis, efficient band selection and dimensionality reduction methods can be developed to improve spectral utilization and reduce computational costs. Second, to improve cross-region robustness, the proposed lightweight SimSDRU-Net segmentation method can be combined with unsupervised domain adaptation (UDA) techniques. UDA can leverage knowledge from annotated “source regions” to adaptively learn and transfer features to unannotated or sparsely annotated “new target regions”, thereby optimizing the model to adapt to diverse agricultural landscapes worldwide.