1. Introduction
Efficient agricultural practices are a foundational pillar of national food security [
1] and long-term ecological sustainability [
2]. As the essential asset of agricultural production, cropland is vital not only for ensuring grain production [
3] but also for supporting ecosystem services [
4]. Therefore, obtaining timely and accurate information on the coverage and distribution of cropland is crucial for agricultural resource assessment and food security [
5,
6,
7]. Remote sensing (RS) plays a key role in cropland mapping due to its capability of providing large-scale spatial coverage, high temporal resolution, and objective data acquisition [
8,
9]. Recent advances in high-resolution remote sensing (HRRS) imaging enabled the creation of important data sources for cropland mapping with a high level of thematic detail. However, despite the richness of information present in HRRS images, the high intraclass differences and low interclass diversity often encountered [
10] make high-resolution cropland extraction challenging.
Traditional high-resolution cropland extraction methods rely heavily on image segmentation and hand-crafted features. They can be broadly categorized in object-based and region-based image analysis methods [
11,
12,
13]. The object-based random forest method initially employs traditional image segmentation algorithms to isolate homogeneous objects, from which spectral, texture, and geometric features are then extracted. Finally, predefined classification rules based on these features are applied to identify cropland [
11,
12]. Region-based methods adopt an iterative aggregation strategy to group homogeneous pixels into objects, thereby producing highly continuous cropland boundaries [
13]. Although the above methods are interpretable and adaptable through rule design, they are sensitive to image quality and scene complexity, which often leads to limited accuracy and generalization, particularly in high-resolution and heterogeneous cropland landscapes [
14,
15]. Moreover, their dependence on manual feature engineering and parameter tuning hinders scalability for large-scale, high-precision extraction tasks [
16,
17].
Due to its robust automatic feature extraction capabilities, deep learning has been widely applied in high-resolution cropland extraction [
5,
9,
18]. Early deep learning models primarily adopted convolutional neural networks (CNNs), such as fully convolutional networks (FCNs), to achieve end-to-end pixel-level classification. The subsequent introduction of encoder–decoder architectures, incorporating dilated convolution modules and pyramid pooling modules, facilitated the development of classic semantic segmentation models like UNet and DeepLabV3+ [
19,
20]. These architectures significantly enhanced the models’ multi-scale feature representation and contextual reasoning capabilities. Through global context modeling, Transformer leveraging the self-attention mechanism have overcome the limitations of local receptive fields in traditional convolution operations [
21]. This allows them to captures long-range pixel dependencies effectively and enhance discrimination of complex backgrounds and boundary details [
22], further advancing the development of semantic segmentation models. Additionally, researchers have combined the local feature extraction advantages of CNN with the global relationship modeling capabilities of Transformer to develop remote sensing semantic segmentation models based on hybrid CNN-Transformer architectures. For example, Wang et al. [
23] proposed UNetformer as a combination of a Transformer-based decoder with a lightweight CNN-based encoder for the efficient semantic segmentation of urban RS images. Wu et al. [
24] proposed CMTFNet, which integrates CNNs with multiscale transformers to derive and unify local details and global-scale contextual information from HRRS imagery. Despite their strong performance in semantic segmentation tasks, these models face challenges in precise boundary delineation, particularly when capturing the irregular and intricate geometric features of cropland boundaries.
The Segment Anything Model (SAM), proposed by Meta AI in 2023 [
25], represents a significant breakthrough in universal segmentation models. Trained on over millions of annotated images, SAM is exceptional in that it enables zero-shot segmentation of new visual objects without prior exposure. Although originally developed for natural images, SAM is being increasingly applied in the RS domain. However, since SAM’s segmentation results do not include class information, researchers typically use its raw mask predictions directly as priors for downstream semantic segmentation or boundary refinement, without altering SAM’s base architecture. For example, Sun et al. [
26] employed SAM-generated masks combined with phenological feature recognition of crop outcomes to delineate crop boundaries. Ma et al. [
27] developed a SAM-Assisted auxiliary semantic segmentation strategy, which employed an object consistency loss and boundary preservation loss based on SAM’s segmentation outputs, thus enhancing semantic segmentation performance. Nevertheless these methods rely heavily on SAM’s original outputs and lack adaptability for RS-specific tasks.
Recent studies have thus shifted toward adapting SAM for domain-specific tasks. For example, in the medical field, Lin et al. [
28], introduced a CNN branch with cross-attention mechanisms in parallel with a Vision Transformer (ViT), along with feature and positional adapters, enabling the modified SAM to adapt to structured anatomical features like organs and lesions. However, HRRS imagery differs significantly from medical data in terms of object morphology and scene variability, as it typically exhibits large-scale variation, class heterogeneity, and irregular boundaries. To address these challenges, Chen et al. proposed RSPrompter [
29], which learns the generation of prompt inputs for SAM, thereby enabling it to autonomously obtain semantic instance-level masks. Luo et al. [
30] introduced an adapter to adapt a pretrained ViT backbone of SAM to RS images, and fine-tune the mask decoder by integrating bounding box information with multiscale features from the adapter to mitigate the significant domain shift from natural images. Sultan et al. [
31] fine-tuned SAM using dense visual prompts from zero-shot learning and sparse visual prompts from a pre-trained CNN segmentation model, thus enabling the segmentation of mobility infrastructure including roads, sidewalks, and crosswalks. Wang et al. introduced SAMPolyBuild [
32], a model that integrates an Auto Bbox Prompter for automatic object localization and extends the SAM decoder with multi-task learning capabilities to simultaneously predict segmentation masks, vertex maps, and boundary maps. Chen et al. presented ESAM [
33], an edge-enhanced SAM variant specifically designed for photovoltaic (PV) power plant extraction. ESAM incorporates an edge detection module and a learning-based fusion strategy to effectively combine semantic and edge features, substantially improving segmentation accuracy. Ma et al. proposed a unified multimodal fine-tuning framework that leverages SAM’s generalizable image encoder and enhances it with Adapter and LoRA modules to process multimodal remote sensing data for semantic segmentation effectively [
34]. Additionally, it is noteworthy that SAM’s fixed input size (1024 × 1024 pixels) leads to excessive computational resource consumption during training and low inference efficiency. To address this issue, Kato et al. proposed Generalized SAM [
35], which supports variable input image sizes during fine-tuning through a positional encoding generator and introduces a spatial-multiscale AdaptFormer to enhance spatial feature modeling, reducing computational costs while maintaining segmentation performance. Specifically, GSAM leverages a learnable positional encoding generator composed of a convolution-based absolute positional embedding module and optional relative positional embeddings integrated into the attention blocks. These mechanisms enable GSAM to support variable input sizes during fine-tuning, but they introduce additional trainable parameters and rely on task-specific adaptation. Despite advancements in prompt design and architectural adaptation, research on SAM-based high-resolution cropland extraction remains limited. The aforementioned methods often focus on semantic or instance modeling, rarely addressing issues related to shape preservation and boundary continuity.
In summary, although current semantic segmentation models for high-resolution cropland extraction have achieved notable progress, they still suffer from insufficient accuracy in boundary localization. SAM is limited to generating segmentation results without class information, and directly applying the model to HRRS imagery semantic segmentation tasks may lead to poor adaptability. Therefore, in this study a dual-branch framework integrating SAM and a semantic-aware network (SAM-SANet) is proposed for high-resolution cropland extraction. The network utilizes a semantically aware branch based on semantic segmentation networks to recognize cropland regions, while a SAM-based branch incorporating boundary constraints is utilized to enhance the performance of cropland extraction. To tailor SAM for high-resolution cropland extraction, several modifications were made to its architecture. Specifically, a position embedding adapter (PEA) module is designed for the SAM-based branch, so that the network can accommodate image inputs of size, alleviating the computational overhead caused by SAM’s fixed input size. Additionally, a boundary-aware feature fusion module (BFFM) and a prompt generation and selection module (PGSM) are introduced into the SAM-based branch to enhance boundary accuracy and segmentation robustness. Furthermore, the images of a large-scale land-cover classification dataset, namely the Gaofen image dataset (GID) [
36] are processed and reclassified into cropland and non-cropland, to construct the GID Cropland Dataset (GID-CD). Two study areas with distinct agricultural landscapes, namely Juye County and Qixia City in Shandong Province, were selected to create two cropland datasets, namely JY-CD and QX-CD. The effectiveness of the proposed SAM-SANet is evaluated comprehensively on these three datasets.
2. Datasets
This study employs three cropland datasets to assess the performance of the proposed approach in extracting high-resolution cropland regions. The first dataset is the GID Cropland Dataset (GID-CD), which is a reclassified version of the high-resolution land cover Gaofen Image Dataset (GID), released by Wuhan University. The other two datasets were constructed from study areas with distinct agricultural landscapes in Shandong Province, namely Juye County and Qixia City, and are referred to as the Juye Cropland Dataset (JY-CD) and the Qixia Cropland Dataset (QX-CD), respectively. These datasets provide a comprehensive basis for assessing the robustness of the proposed method across varying real-world scenarios. All datasets consisted of RS imagery and pixel-level cropland annotations, supporting training and evaluation for semantic segmentation tasks.
2.1. GID-CD
The original GID consists of 150 Gaofen-2 satellite images, each with a spatial resolution of 4 m and a size of 6800 × 7200 pixels, collectively covering more than 50,000 km
2 across over 60 cities in China. Each image covers approximately 506 km
2, and the geographic distribution of these scenes is detailed in the original GID publication [
36]. The dataset included pixel-level annotations for five types of land cover, namely built-up, cropland, forest, grassland, and water, with areas not belonging to the above five classes and clutter regions are labeled as background. To facilitate cropland extraction, in this study the categories of the original dataset were redefined and integrated to construct the GID Cropland Dataset (GID-CD). Specifically, the “Cropland” and “background” categories from the original dataset are retained, while the other categories (i.e., “built-up”, “forest”, “grassland” and “water”) were merged into a single “non-cropland” category. After processing, a three-class dataset was formed, consisting of background (class 0), cropland (class 1), and non-cropland (class 2). To ensure the dataset contained valid cropland information, only satellite images with cropland area proportions greater than 15% were selected for the final GID-CD. Subsequently, the GID-CD dataset was split into non-overlapping patches of size 256 × 256, and samples with background proportions exceeding 50% were excluded. Ultimately, 46,994 samples were obtained in total and divided into training, validation, and test sets at a ratio of 8:1:1. These samples were drawn from the 150 Gaofen-2 scenes distributed over more than 60 cities in China, thus preserving the spatial diversity of the original GID and covering a wide range of geographic and cropland patterns. The GID-CD constructed in this study not only retains spatial details and texture characteristics of the initial high-resolution images, but also simplifies the classification process to enable models to focus on distinguishing cropland from non-cropland. It is therefore well-suited for the high-resolution cropland extraction tasks.
2.2. JY-CD and QX-CD
In addition to the GID-CD, which was constructed using a publicly available high-resolution land cover dataset, in this study two additional cropland datasets were constructed to validate the proposed method. Two study areas with distinct agricultural landscape characteristics were selected, namely Juye County (115°46′13″E–116°16′59″E, 35°6′13″N–35°29′38″N) and Qixia City (120°32′45″E–121°16′8″E, 37°5′5″N–37°32′57″N) in Shandong Province. The geographic locations of the study areas are shown in
Figure 1.
Covering 1302 km2, Juye County lies on the southwestern plain of Shandong Province, and is characterized by high agricultural productivity. It has a temperate continental monsoon climate with an average annual temperature of 13.8 °C and annual precipitation of 658 mm, concentrated from late June to September. Most of the land is dedicated to agriculture, while residential, grassland, forested, and water areas are also present to a lesser extent. The cropland parcels in the study area are relatively large and compact, making them suitable for intensive agricultural production. The crop calendar starts from early October to early or mid-June of the following year for winter wheat, and from April to October for spring and summer crops, which include maize, rice, soybean, millet, and cotton.
Qixia City is located in the northeastern Shandong Province, with an area of 1793 km2. The climate is similar to that of Juye, with an average annual temperature of 11.7 °C and annual precipitation of 693 mm. The major land use/land cover types are forestry, agriculture, water, and residential. Hilly and mountainous terrain almost dominates the landscape, with irregularly distributed and fragmented agricultural fields. The typical crop rotation is winter wheat followed by spring and summer crops, which include maize, soybean, and peanut. Winter wheat is sown in early October and harvested in early or mid-June of the following year. Spring and summer crops are sown in late April and harvested from mid-September to early October.
This study used Gaofen-1 (GF-1) satellite images provided by the Land Satellite Remote Sensing Application Center (LASAC), Ministry of Natural Resources of the People’s Republic of China, as the remote sensing data source for cropland dataset construction. The spatial resolution of the satellite imagery is 2 m panchromatic (PAN)/8 m multispectral (MS). Then, the remotely sensed images were processed as composites with red, green, and blue bands at 2 m resolution through the fusion of the MS and corresponding PAN images using the Gram-Schmidt transformation, followed by applying the Albers Conical Equal Area projection. For Juye, GF-1 images acquired on 1 April 2016, were used, corresponding to the jointing stage of winter wheat, a period characterized by vigorous growth and significantly high vegetation coverage in the cropland areas (
Figure 1(a1)). For Qixia, GF-1 images acquired on 22 August 2016, were used, corresponding to the silking stage of summer maize, which is a vigorous growth period where both forested and cropland areas demonstrate high vegetation coverage (
Figure 1(b1)).
The cropland labels were constructed through visual inspection of the GF-1 composite data by professionals with RS experience. In Juye, non-cropland encompasses built-up, forest, water, road, meadow, greenhouse, and bare land, with built-up areas constituting the predominant non-cropland cover. Similarly, in Qixia, non-cropland included the same categories, with forests being the primary type of cover. A total of 19,147 (Juye) and 19,549 (Qixia) cropland parcels were delineated in the two study areas as ground truth data. The cropland parcel data were converted into raster format, setting “cropland” pixels to 1, “non-cropland” pixels to 2, and pixels without RS imagery coverage to 0—“background”. The GF-1 composite images and labeling results constituted the final cropland sample dataset. Consequently, two cropland datasets were created: the Juye Cropland Dataset (JY-CD) and the Qixia Cropland Dataset (QX-CD). Compared with the widely used GID-CD dataset, JY-CD and QX-CD were specifically constructed to represent two distinct and typical agricultural landscapes within Shandong Province. JY-CD corresponds to a flat plain region with concentrated and geometrically regular cropland parcels. In contrast, QX-CD is situated in a hilly and mountainous area, where cropland is fragmented, irregularly distributed, and surrounded by dense forest and complex terrain. These distinctions contribute to a broader representation of real-world scenarios and provide a more rigorous basis for evaluating the robustness of cropland extraction models under diverse geographic and agricultural conditions. A macro-grid-based splitting strategy was adopted based on the 16-sub-region grid illustrated in
Figure 1. Entire grid cells were reserved as independent test regions (highlighted by red squares in
Figure 1), thereby ensuring that no spatial overlap occurred between the training/validation regions and the test regions. Within the training and validation regions, a sliding-window approach with a 50% overlap and a patch size of 256 × 256 pixels was applied to generate a sufficient number of samples. Ultimately, 17,364 and 26,704 training sample blocks were obtained for JY-CD and QX-CD, and 1780 and 2659 validation sample blocks, respectively.
Figure 2 shows some typical samples of the three high-resolution cropland datasets GID-CD, JY-CD, and QX-CD.
5. Discussion
5.1. Sensitivity Analysis of the Boundary Coefficient
The sensitivity analysis primarily examined the hyperparameter λ, which denotes the weight coefficient of the boundary loss and controls the contribution strength of boundary-guided supervision. To assess the impact of the boundary loss weight λ on segmentation performance, we conducted sensitivity experiments on three datasets: GID-CD, JY-CD, and QX-CD. Additionally, we introduced a quantitative complexity indicator—boundary tortuosity (Equation (19) in
Section 3.7)—to explore the underlying reason for variation in optimal λ values across different terrains.
Figure 8 illustrates the histogram-based distributions of boundary tortuosity values for cropland objects in the three datasets, overlaid with kernel density estimation (KDE) curves. These results reveal that the GID-CD and JY-CD datasets exhibit tortuosity values predominantly concentrated within the range of 1.2 to 1.8, suggesting that cropland boundaries in these regions are relatively regular and compact. In contrast, the QX-CD dataset presents a broader distribution with generally higher tortuosity values. Specifically, most samples fall within the range of 1.4 to 2.0, with a significant portion extending into the high-tortuosity interval of 2.2 to 2.8, reflecting more fragmented and complex boundary structures.
Figure 9 provides violin plots that further illustrate both the distribution and dispersion of tortuosity values. Compared to the other datasets, QX-CD exhibits not only a higher median tortuosity but also a greater upper quartile and a wider overall range. This quantitatively confirms that QX-CD has the highest boundary complexity in terms of geometric morphology among the three datasets.
To investigate the influence of different λ values on model performance, each λ experiment was independently repeated three times, and the mean and standard deviation of MIoU, MF1, Kappa, and OA were reported.
The variations in these metrics with respect to λ are summarized in
Figure 10, where the error bars denote the standard deviation across trials, providing a measure of performance stability.
Figure 10a–c indicate that the optimal λ values are 0.4, 0.3, and 0.2 for the GID-CD, JY-CD, and QX-CD datasets, respectively. By synthesizing the boundary complexity distributions in
Figure 8 with the segmentation performance variations under different λ values in
Figure 10, a clear negative correlation can be observed—datasets with higher boundary tortuosity tend to achieve optimal performance at smaller λ values. For example, QX-CD exhibits the highest mean tortuosity, reflecting more fragmented and geometrically complex boundaries, and correspondingly achieves optimal performance at λ = 0.2. In contrast, GID-CD and JY-CD have relatively lower boundary complexity and perform best at larger λ values (0.4 and 0.3, respectively).
In summary, the optimal value of the boundary weight λ varies across different terrain types. For datasets with clear and regular boundaries, such as GID-CD and JY-CD, the model exhibits relatively stable performance when λ ranges from 0.2 to 0.5. However, in the mountainous QX-CD dataset, which is characterized by blurry boundaries and complex geometries, performance metrics fluctuate significantly with changes in λ, indicating that fragmented boundaries are more sensitive to boundary supervision. In this case, a smaller λ achieves better results. Overall, all three datasets exhibit a common trend where accuracy initially improves with increasing λ, reaches a peak, and then gradually declines. This suggests that overly strong boundary supervision may interfere with region-level semantic representation in the segmentation branch, particularly in terrain with ambiguous or inconsistently labeled boundaries. In such cases, the model benefits more from semantic features directly extracted from the image, rather than from boundary priors that may reflect human annotation preferences. Therefore, boundary weighting should be carefully balanced to harmonize semantic region representation and edge guidance.
5.2. Ablation Study
To verify the effectiveness of SAM-SANet, ablation experiments were conducted on the three datasets with the aim of assessing the specific impact of each module to the overall segmentation performance. The ablation mainly focused on the SAB, the SAM branch that perceives cropland boundaries (BCBSAM), and the BFFM and PGSM incorporated to enhance cropland boundary representation. For the BFFM and PGSM in particular, a more fine-grained ablation was performed. In addition to the cumulative ablation setting, the parameters of either BFFM or PGSM were frozen during training, while the remaining network components were kept trainable. By freezing BFFM, the PGSM branch remains learnable, and the performance difference reflects the independent contribution of BFFM. By freezing PGSM, the BFFM branch remains learnable, and the performance difference reflects the independent contribution of PGSM. This strategy preserves the original network architecture, avoids side effects caused by physically removing modules, and isolates the effect of BFFM and PGSM by quantifying the performance change when their learning ability is disabled.
All ablation settings adopt the optimal λ values identified in the sensitivity analysis of
Section 5.1 to ensure consistency across comparisons. The results of the effect of each component on the segmentation performance are shown in
Table 4,
Table 5 and
Table 6. The values in parentheses indicate the performance improvement compared to the SAB baseline.
5.2.1. Synergistic Effect of SAB and BCB
The introduction of the BCBSAM branch results in consistent, performance gains across all three datasets, suggesting the beneficial effect of boundary supervision on the model’s structural representation. Compared to the results of the model with only the SAB, the metrics on the GID-CD show that mIoU increased from 86.79% to 86.94% (+0.15 pp), with mF1 and Kappa improving by +0.08 and +0.07 pp, respectively. These results indicate that BCBSAM improved the model’s ability to preserve the spatial coherence and shape consistency of large-scale cropland boundaries. On JY-CD, the addition of BCBSAM yielded an mIoU gain of +0.01 pp and mF1 of +0.03 pp, with Kappa and OA increasing by +0.11 pp and +0.29, respectively. Although the performance gains were modest on JY-CD, they are still meaningful given the regular and contiguous nature of cropland boundaries in flat terrain. This indicates that the BCBSAM primarily serves to enhance the SAB by refining fine-grained edge details and correcting minor boundary localization errors. On the QX-CD mountainous dataset, the improvements are more substantial: mIoU is increased by +0.82 pp, mF1 by +0.64 pp, and Kappa by +1.30 pp, OA by +0.37%, demonstrating BCBSAM ’s improved capability in capturing irregular and fragmented cropland boundaries. This enhancement mitigates common issues such as broken or blurred boundaries effectively in terraced landscapes.
Overall, incorporating the BCBSAM in parallel with SAB improves the edge localization outcomes substantially and compensates for SAB’s limitations in capturing fine-grained boundary details—particularly in rugged or low-contrast environments.
5.2.2. Effectiveness of BFFM and PGSM
To further isolate the independent contributions of BFFM and PGSM, complementary experiments were conducted by freezing the parameters of one module at a time while keeping the other trainable. The results reveal that both BFFM-only and PGSM-only settings outperform the configuration without either module, yet remain inferior to the fully trainable combination. By further integrating BFFM and PGSM into the BCBSAM branch, the model achieves substantial performance gains across all datasets, validating the importance of boundary-aware feature fusion and prompt-driven guidance in boundary modeling. On the GID-CD dataset, compared to the BCBSAM branch, mIoU increased by +0.64 pp, mF1 by +0.54 pp, Kappa by +0.91 pp, and OA by +0.63 pp. This indicates that BFFM strengthens the aggregation of multi-level boundary features under large-scale scenarios, while PGSM guides the model to focus on critical boundary regions through appropriate prompts, thereby achieving more accurate object localization and refined boundary reconstruction. On the JY-CD dataset, mIoU improved from 90.61% to 91.17% (+0.56 pp), mF1 by +0.34 pp, Kappa by +0.62 pp, and OA by +0.27 pp. BFFM compensated for the ViT encoder’s limited local boundary detail extraction capability by aggregating multi-scale edge features, while PGSM reinforced boundary localization by concentrating on uncertain boundary zones, enhancing boundary closure and continuity. On the QX-CD mountainous dataset, the effect was even more significant: mIoU, mF1, Kappa and OA increased by +0.56 pp, +0.42 pp, +1.53 pp and +0.35 pp, respectively. These results indicate that BFFM provided robust structural continuity under complex terrains, and that the PGSM generated effective sparse prompts to guide the model in correcting broken or curved edges, leading to refined segmentation along irregular boundaries.
By visualizing the intermediate feature maps of the BFFM and PGSM (as shown in
Figure 11), distinct activation patterns emerge across different stages of feature processing. The shallow feature maps (F
shallow), derived from early layers of the backbone, primarily capture low-level spatial structures such as textures and color variations, often exhibiting noisy and disordered responses. As the features deepen (F
deep), the feature representations become more semantically enriched, revealing clearer object contours and structural patterns. The fused feature map from the BFFM (F
BFFM) exhibits enhanced activation along object boundaries, highlights boundary regions more prominently. In the PGSM, these boundary-focused fused features are further refined to generate the final prompt-related feature map (F
PGSM), where high-response regions are densely concentrated in the foreground. These salient activations are subsequently transformed into sparse prompt boxes, which guide the input to the SAM mask decoder.
In conclusion, the joint use of BFFM and PGSM guided the model to focus on critical boundaries effectively, improving both structural awareness and prompt responsiveness. This synergy significantly enhanced boundary representation quality and overall robustness in diverse and challenging cropland landscapes.
5.2.3. Effectiveness of the PEA
To further evaluate the practical effectiveness of the proposed PEA module, a comparative experiment was conducted on the JY-CD dataset. Specifically, the full SAM-SANet model (SAB+BCB
SAM+PEA+BFFM+PGSM) integrating the PEA module was compared with its counterpart without the PEA module (SAB+BCB
SAM+BFFM+PGSM). As presented in
Table 7, the PEA-enabled SAM-SANet achieves comparable segmentation accuracy while substantially reducing computational cost (in terms of GFLOPs), without introducing additional model parameters. The results highlight the critical role of the PEA module in enabling support for smaller input sizes (256 × 256), thereby improving facilitating practical deployment.
5.3. Running Efficiency
To further evaluate the computational efficiency of the proposed method, we compared different model configurations in terms of computational cost and model size. Specifically, we analyze the impact of each module, including the semantically aware branch (SAB), the boundary-constrained branch based on SAM (BCB), the position embedding adapter (PEA), as well as the BFFM and PGSM, on the overall resource consumption.
As shown in
Table 8, the SAB is lightweight with low parameter count and computational cost. In contrast, the original SAM branch (BCB) has high computational cost due to its Vit backbone and large input size. Incorporating the PEA module enables SAM to process 256 × 256 inputs, which substantially reduces the computational cost. Adding BFFM and PGSM introduces only slight overhead while providing moderate improvements in segmentation accuracy. Overall, the full model maintains a good balance between accuracy and efficiency, making it suitable for practical deployment.
6. Conclusions
In this study, the challenges of cropland extraction accuracy and boundary preservation in high-resolution remote sensing imagery were addressed through the proposal of a novel dual-branch network architecture named SAM-SANet. The objective is to jointly enhance semantic perception and boundary refinement for cropland extraction under complex terrain conditions, where field shapes may be irregular, boundaries are often blurred, and class confusion is prevalent. Specifically, SAM-SANet integrates a semantically aware branch (SAB)and a boundary-constrained SAM branch (BCB) to handle region-level semantic recognition and contour-level boundary modeling separately. These two branches are jointly optimized through a combination of segmentation and boundary loss functions to achieve collaborative learning between regional semantics and edge precision. To improve SAM’s adaptability and representational power in RS scenarios, the boundary branch incorporates three key modules: The PEA addresses the incompatibility between SAM’s fixed input size and the dimensions of cropland images, thereby enhancing spatial alignment and the expressiveness of positional embeddings. The BFFM and the PGSM jointly enhance the accuracy of prompt-driven segmentation for complex cropland boundary representation by aggregating multi-scale edge features and generating prompt embeddings associated with cropland boundaries. In extensive experiments and ablation studies conducted on three RS cropland datasets, each with distinct topographical characteristics, SAM-SANet achieved superior performance in preserving cropland integrity, ensuring boundary clarity, and maintaining spatial consistency. Overall, the proposed model outperformed existing mainstream methods across multiple evaluation metrics.
Despite its promising performance, SAM-SANet still faces certain limitations in its effectively integration of semantic and boundary information under complex agricultural landscapes. In particular, scenarios involving fragmented plots or small irregular fields reveal localized inconsistencies between branches, highlighting the need for improved joint modeling strategies. Future research will focus on two key directions: (1) exploring deep collaborative optimization between the prompt mechanism and semantic–boundary feature fusion to improve the model’s ability to diverse cropland morphologies and (2) incorporating multi-temporal RS data to better capture seasonal variations and crop rotation patterns, thereby improving the model’s capability and practical applicability in high-resolution cropland extraction.