1. Introduction
Soybeans are a vital grain and oil crop, and their yield is influenced by key structural traits such as pod number, lowest pod position, and spatial distribution. These traits not only reflect the plant’s yield potential but also serve as essential indicators for assessing harvest losses and optimizing operational parameters [
1,
2]. However, mature soybean plants typically exhibit a uniform yellowish-brown coloration, resulting in a highly homogeneous background. Pods, senescent leaves, and straw show pronounced similarities in texture and brightness, and densely clustered instances frequently obscure one another. These characteristics hinder manual surveys from efficiently and accurately capturing structural traits, thus failing to meet the high-throughput requirements of breeding phenotyping and field operations [
3]. Therefore, there is an urgent need for a visual model capable of robustly identifying and separating individual pod instances in complex field environments, thereby enabling the automated and efficient acquisition of key structural traits in mature soybeans.
Existing studies on automated phenotypic extraction of soybean pods can be broadly classified, based on scene complexity, into three categories: laboratory-controlled environments, potted-plant or single-plant close-up scenarios, and real-world field settings.
In laboratory-controlled environments, researchers commonly employ artificial backgrounds (e.g., black-light-absorbing cloth) to simplify visual processing and primarily focus on improving recognition accuracy. For high-throughput counting of isolated pods, Yu et al. [
4] developed the lightweight PodNet, which performs efficient pod counting and localization via an enhanced decoder design. Yang et al. [
5] introduced the RefinePod instance segmentation network for estimating seed counts per pod, alleviating fine-grained classification challenges through synthetic data training. To address structural phenotyping and occlusion issues in whole plants, Zhou et al. [
6] integrated an attention-enhanced YOLOv5 with an
search algorithm to extract traits such as plant height and branch length under black backdrops. Wu et al. [
7] proposed the GenPoD framework, incorporating generative data augmentation and multi-stage transfer learning to reduce the impacts of class imbalance and foliage occlusion in on-branch pod detection. However, the stable lighting and uniform backgrounds in laboratory conditions substantially limit the transferability of these methods to complex field scenarios, making them unsuitable for direct application to mature soybean pod analysis in natural environments.
Under potted-plant or close-up single-plant conditions, existing studies primarily aim to improve pod detection and counting performance under relatively simplified backgrounds. Jia et al. [
8] proposed YOLOv8n-POD, an enhanced version of YOLOv8n that incorporates a dense-block backbone to better accommodate diverse pod morphologies. Liu et al. [
9] introduced SmartPod, integrating a Transformer backbone with an efficient attention mechanism to achieve stable pod detection and counting in single-plant ground-level settings. He et al. [
10] embedded coordinate attention into YOLOv5 to enable accurate pod identification across multiple growth stages of potted plants. Building on this, He et al. [
11] reformulated pod phenotyping as a human pose estimation task and proposed DEKR-SPrior for seed-level detection and separation of densely occluded pods under white backdrops. Xu et al. [
12] developed DARFP-SD for detecting dense pods against black velvet backgrounds, improving counting accuracy in crowded scenes through deformable attention and a recursive feature pyramid. Although these studies enhance plant-structure recognition to some extent, their reliance on controlled backgrounds and relatively low-density plant conditions limits their applicability to real field environments, where mature pods exhibit high-density adhesion and substantial multi-scale variability.
In real-world field environments, existing research has primarily concentrated on object detection and density estimation. Some studies have achieved dynamic field pod detection using multi-scale attention or improved YOLO variants [
13,
14,
15], while others have addressed high-density counting challenges through point supervision and density-map regression [
16,
17,
18]. However, research on instance-level segmentation remains limited. Although Zhou et al. [
19] investigated lightweight segmentation networks, they reported that mature soybean plants exhibit homogeneous yellowish-brown textures, weak boundaries, and high-density clustering, all of which severely compromise traditional models, leading to instance coalescence and segmentation omissions. Existing approaches therefore remain centered on detection or counting tasks and exhibit notable limitations in weak-boundary segmentation, instance decoupling in dense regions, and precise mask generation under complex field backgrounds.
Instance segmentation of mature soybean pods still lacks a unified and robust solution capable of operating reliably in complex field environments. The key challenges include (1) weak boundaries arising from the high homogeneity in texture and color among pods, straw, and senescent leaves; (2) pronounced scale and density discrepancies between upper and lower canopy regions; and (3) mask adhesion and instance overlap resulting from natural lighting variations, reflections, and occlusions. These challenges collectively constrain the robustness and reliability of existing methods in real-world field applications.
To address these challenges, this paper proposes PodFormer, an instance segmentation framework specifically designed for real-world field conditions. The main contributions of this work are summarized as follows:
A high-quality field dataset of mature soybean pods is constructed, covering diverse cultivation patterns and lighting conditions. It provides abundant samples featuring homogeneous backgrounds and densely clustered pods, thereby mitigating the scarcity of data in complex scenarios.
An Adaptive Wavelet Detail Enhancement (AWDE) module is proposed, which employs attention-weighted wavelet transformations to amplify high-frequency boundary cues and alleviate weak-boundary issues caused by homogeneous textures.
A Density-Guided Query Initialization (DGQI) module is designed, leveraging multi-scale density priors to modulate query initialization, explicitly model scale and density variations, and improve instance-awareness under non-uniform distributions.
A Mask Feedback Gated Refinement (MFGR) layer is introduced, which incorporates preceding-layer mask quality as feedback and utilizes a dual-path gating mechanism to adaptively refine queries, enabling precise instance separation under severe occlusion and adhesion.
3. Experiments
This chapter aims to systematically evaluate the performance of the proposed PodFormer on the task of instance segmentation for mature soybean pods in real-world field settings. We first describe the experimental setup, including the training environment, implementation details, and evaluation metrics. Subsequently, we validate the independent contributions of each module through a series of ablation studies. We then conduct a comprehensive comparison with representative mainstream instance segmentation methods. Finally, we assess the model’s generalization capabilities in real-world field scenarios and across different crop domains.
3.1. Experimental Setup
3.1.1. Experimental Environment and Parameter Settings
The experimental hardware environment for this study comprises an AMD EPYC 7282 processor with 250 GB of memory, running the Ubuntu 18.04.6 LTS operating system. Training utilized an NVIDIA A100 GPU (80 GB, PCIe) with CUDA version 11.3. The model was implemented using Python 3.8 and the PyTorch 2.0 deep learning framework, with all input images uniformly resized to
. Additional training parameters are detailed in
Table 1.
To ensure a rigorous and reliable evaluation, the experiments were conducted using the self-constructed dataset described in
Section 2.1.3. This dataset comprises 1510 images, which were randomly partitioned into training, validation, and test sets using a 7:2:1 split. In total, it contains approximately 55,000 annotated instances, providing a diverse and representative basis for comprehensive model evaluation.
3.1.2. Evaluation Metrics
In the instance-segmentation task for mature soybean pods in the field, this study employed a series of evaluation metrics, including Precision, Recall, and localization accuracy (mAP50 and mAP50–95). In image-segmentation tasks, accurately assessing model performance through visual inspection alone is often challenging due to the subtle differences between segmentation results. Consequently, these metrics provide a clearer and more objective quantitative assessment of model performance. The formulas for computing these metrics are as follows:
among which
denotes the number of true positive predictions,
denotes the number of false positive predictions, and
denotes the number of false negative predictions.
mAP (mean Average Precision) represents the mean of the Average Precision (AP) values computed for all categories, providing an overall measure of detection accuracy.
denotes the AP computed at a fixed Intersection over Union (IoU) threshold of 0.5. mAP50–95 computes the mean AP across IoU thresholds ranging from 0.5 to 0.95 at increments of 0.05, enabling a more fine-grained evaluation of model performance under varying degrees of overlap. The formula for computing mAP is as follows:
where
denotes the area under the precision–recall curve and
N represents the total number of categories.
3.2. Ablation Study
To quantitatively evaluate the effectiveness of each innovative module in PodFormer, this section presents a series of ablation experiments. Mask2Former is adopted as the baseline model, and we systematically analyze the impact of different module combinations on field pod instance segmentation performance by progressively incorporating the Adaptive Wavelet Detail Enhancement (AWDE) module, the Density-Guided Query Initialization (DGQI) module, and the Mask Feedback Gated Refinement (MFGR) layer. All ablation experiments were conducted on the self-built field-mature soybean dataset using the same training parameters and evaluation metrics as those employed in the comparative experiments. The results are summarized in
Table 2, where a “✓” indicates that the corresponding module is enabled in the current configuration.
As shown in
Table 2, the baseline model Mask2Former (Group A) provides basic segmentation capability but achieves the lowest performance across all metrics (mAP50 = 0.728, mAP50–95 = 0.342). This result indicates that standard models struggle to handle homogeneous backgrounds, uneven density distributions, and severe occlusions in real field environments.
After introducing each module individually (Groups B, C, and D), the model exhibits performance improvements to varying degrees. AWDE (Group B) increases mAP50 to 0.740 and mAP50–95 to 0.355, demonstrating that enhancing high-frequency boundary cues through wavelet transformation and attention mechanisms effectively suppresses false detections arising from homogeneous backgrounds such as fallen leaves and straw. DGQI (Group C) further improves mAP50 to 0.745 and substantially increases recall from 0.646 to 0.670, indicating that incorporating density priors enables better detection of pods in both sparse and densely clustered regions. MFGR (Group D) yields the highest gain among individual modules, raising mAP50 to 0.750 and mAP50–95 to 0.360, as mask-quality feedback and gated refinement mitigate feature contamination caused by adhesion and occlusion.
When modules are combined in pairs (Groups E, F, and G), performance improves across all metrics. Notably, DGQI + MFGR (Group G) achieves strong recall (0.700) and improved fine-grained segmentation performance (mAP50–95 = 0.380), demonstrating that jointly modeling density cues and occlusion refinement enhances recognition in highly clustered regions.
Finally, when all three modules are integrated (Group H, i.e., PodFormer), the model achieves the best overall performance (mAP50 = 0.795, mAP50–95 = 0.396), with improvements of 6.7% and 5.4% over the baseline, respectively. In summary, the ablation experiments validate the effectiveness and complementary roles of the three modules: AWDE addresses homogeneous backgrounds and weak boundaries, DGQI models scale and density variations, and MFGR mitigates occlusion and adhesion effects. Their synergistic interaction substantially enhances the model’s robustness and segmentation accuracy in complex field environments.
3.3. Comparative Experiments
To comprehensively validate PodFormer’s performance advantages in mature soybean pod instance segmentation, this study compares it with several representative instance segmentation methods, including YOLOv8-Seg [
27], YOLOv11-Seg [
28], Mask R-CNN [
29], SegFormer [
30], and Mask2Former [
22] as the baseline model. All models were trained and evaluated under identical experimental settings on the self-built dataset, and their performance was measured using four metrics: Precision, Recall, mAP50, and mAP50–95. The detailed performance results for each method are summarized in
Table 3.
As shown in
Table 3, the traditional instance segmentation model Mask R-CNN exhibits moderate performance, achieving an mAP50 of 0.722. Its mAP50–95 remains low at 0.331, indicating limited capability for fine-grained segmentation at higher IoU thresholds. The Transformer-based SegFormer performs even worse, attaining an mAP50 of only 0.715. YOLOv8-Seg provides a slight improvement (mAP50 = 0.726), while its enhanced variant YOLOv11-Seg reaches 0.784; however, it still underperforms in high-precision segmentation, with an mAP50–95 of 0.380. Mask2Former, adopted as the baseline model, achieves mAP50 and mAP50–95 scores of 0.728 and 0.342, respectively, but its recall (0.646) remains relatively low, reflecting persistent limitations in high-IoU segmentation quality.
Building upon this baseline, PodFormer integrates three key modules—AWDE, DGQI, and MFGR. AWDE strengthens high-frequency boundary cues, DGQI injects scale and density priors to enhance instance detection, and MFGR improves instance separation under occlusion and adhesion. Benefiting from the synergistic interaction of these modules, PodFormer achieves a Precision of 0.837, a Recall of 0.718, an mAP50 of 0.795, and an mAP50–95 of 0.396, outperforming all comparison models. Compared with Mask2Former, PodFormer improves mAP50 and mAP50–95 by 6.7% and 5.4%, respectively, demonstrating substantial performance gains.
To visually illustrate the segmentation differences among models,
Figure 6 presents representative prediction results on typical field images. Under two major challenges—homogeneous background interference and instance occlusion/adhesion—PodFormer consistently achieves more complete boundary reconstruction and clearer instance separation, demonstrating superior segmentation accuracy and robustness.
In homogeneous background scenarios (
Figure 6b,d), mature pods share high similarity in color and texture with dead leaves, straw, and soil, leading to widespread false detections in contrast-dependent models. For example, YOLOv8-Seg and Mask R-CNN frequently misidentify color-similar background textures as pods due to their limited boundary modeling capability. In contrast, PodFormer (
Figure 6), equipped with AWDE to adaptively enhance high-frequency edge details, achieves more accurate discrimination between pods and homogeneous backgrounds, producing cleaner mask contours and substantially reducing false detections.
In dense occlusion scenarios (
Figure 6a,c), pods frequently overlap and form tightly adhered clusters, presenting highly challenging segmentation cases. All comparison models exhibit missed detections and instance mergers, particularly in heavily occluded or densely clustered regions. For instance, in
Figure 6a, only PodFormer successfully identifies and separates the occluded pod, and in
Figure 6c the comparison models suffer from severe missed detections. Although PodFormer does not achieve perfect segmentation under extreme density, it separates more instances with higher accuracy and substantially outperforms other methods. This advantage primarily arises from the density priors introduced by DGQI and the mask-feedback refinement of MFGR, enabling stronger instance discrimination under extreme adhesion.
In summary, both quantitative and qualitative results demonstrate that PodFormer exhibits clear advantages in feature representation and instance-level consistency modeling, particularly in challenging field scenarios such as weak boundaries, high-density clusters, and severe occlusions. These strengths provide a reliable technical foundation for precise phenotyping analysis and yield estimation based on pod-level instance segmentation.
3.4. Generalization Ability Evaluation
To further evaluate the adaptability and generalization capability of the proposed PodFormer architecture across different scenarios, two sets of generalization experiments were conducted: validation in real field settings and cross-domain evaluation on a novel wheat-ear dataset.
3.4.1. Real-World Field Scenario Generalization Testing
To evaluate the model’s robustness under real-world field conditions with variable illumination and complex backgrounds, this experiment extends beyond the controlled training environment that utilized a white background board. Multiple sets of unseen images were selected for evaluation (
Figure 7), including self-collected field images without background boards (
Figure 7A–D) and samples from two public datasets (
Figure 7E,F). In these scenes, pods exhibit strong similarity in color and texture to withered branches, soil, weeds, and adjacent crop rows. The degree of background interference and occlusion far exceeds that of the training set, posing a rigorous challenge to the model’s robustness and domain-adaptation capability.
To ensure fair evaluation, all comparison models (YOLOv8-Seg [
27], YOLOv11-Seg [
28], Mask R-CNN [
29], SegFormer [
30], and Mask2Former [
22]) were directly applied using weights trained on the white-background dataset. Qualitative results for each method are presented in
Figure 7.
As shown in
Figure 7, the performance of all comparison models deteriorates markedly in highly challenging real-world field scenarios. Mask R-CNN (
Figure 7(4)) and SegFormer (
Figure 7(5)) fail almost entirely, producing numerous misclassifications caused by complex background textures. Although YOLOv8-Seg (
Figure 7(2)), YOLOv11-Seg (
Figure 7(3)), and Mask2Former (
Figure 7(6)) can identify some pod instances, they still generate substantial false positives, false negatives, and severe instance merging in homogeneous or densely clustered regions (
Figure 7C,E,F).
In contrast, PodFormer (
Figure 7(7)) demonstrates the strongest generalization capability and stability across all tested images. AWDE enhances high-frequency boundary cues and effectively suppresses complex background noise; DGQI incorporates density priors that support robust localization in extremely dense regions (
Figure 7A–C,F); and MFGR mitigates feature contamination caused by occlusions, enabling clearer separation of adhered pods.
To complement the qualitative comparison,
Table 4 presents the quantitative performance of all models on real-world field images without background panels. As shown in the table, all methods exhibit varying degrees of performance degradation relative to the white-background evaluation, underscoring the substantial challenges posed by heterogeneous textures, illumination variations, and severe occlusions in natural field scenes. YOLOv8-Seg and YOLOv11-Seg maintain the highest inference speeds, but their mAP50–95 scores decline markedly due to their limited boundary modeling capability. Mask R-CNN, SegFormer, and Mask2Former likewise exhibit lower Recall and reduced fine-grained segmentation accuracy, indicating that conventional models struggle to maintain robustness under complex background interference.
In contrast, PodFormer achieves the highest overall accuracy, improving mAP50 and mAP50–95 by 6.4% and 3.9%, respectively, over Mask2Former, while maintaining a competitive inference speed of 72.158 FPS. These results confirm that AWDE, DGQI, and MFGR effectively enhance boundary discrimination, density-aware instance detection, and occlusion resolution. Therefore, PodFormer demonstrates strong generalization capability and stable segmentation performance, even in challenging real-world field environments.
In summary, the generalization results demonstrate that PodFormer does not overfit to the white-background training conditions but instead exhibits strong cross-domain adaptability. By leveraging AWDE, DGQI, and MFGR, PodFormer maintains stable and accurate pod-level instance segmentation performance in real field environments, laying a solid foundation for practical deployment in agricultural applications.
3.4.2. Cross-Domain Generalization Testing
This experiment aims to evaluate whether PodFormer possesses cross-crop universality, specifically its ability to transfer to targets with completely different morphologies and background characteristics. To this end, a publicly available wheat-ear image dataset [
31] was selected for cross-domain instance segmentation evaluation, comprising 760 images (523 for training, 160 for validation, and 77 for testing). This dataset imposes higher demands on the model. Wheat ears exhibit elongated, densely clustered strip-like structures that differ markedly from the clustered morphology of soybean pods. The background—comprising green leaves and dark soil—introduces substantial color and texture interference, and occlusion and adhesion between wheat ears and leaves are similarly severe. These factors jointly challenge the model’s capability in detail modeling, texture representation, and occlusion handling.
To ensure consistent evaluation, all comparison models (YOLOv8-Seg [
27], YOLOv11-Seg [
28], Mask R-CNN [
29], SegFormer [
30], and Mask2Former [
22]) and the proposed PodFormer were trained and evaluated independently on the wheat-ear dataset using the same training hyperparameters described in
Section 3.1. Precision, Recall, mAP50, and mAP50–95 were used to assess performance, and the results are summarized in
Table 5.
As shown in
Table 5, the clearer boundaries and more uniform structural characteristics of wheat ears lead all models to exhibit substantially better performance on this dataset compared with their results on the self-built soybean dataset. In this relatively easier cross-domain task, PodFormer continues to achieve the best performance, demonstrating stable generalization capability.
Among the comparison methods, the Transformer-based SegFormer performs the weakest (mAP50–95 = 0.645), indicating limited capability in modeling instance-level local details. YOLOv8-Seg, YOLOv11-Seg, and Mask R-CNN achieve comparable accuracy but still underperform in boundary refinement. Mask2Former, as the baseline model, leverages its query-driven mechanism to achieve an mAP50–95 of 0.682 on the wheat-ear dataset, outperforming the other comparison methods and illustrating the strong foundational capability of Transformer-based frameworks.
PodFormer achieves the best performance across all key metrics, attaining an mAP50 of 0.971 and an mAP50–95 of 0.706, corresponding to a 2.4% improvement over Mask2Former in mAP50–95. These results indicate that the AWDE, DGQI, and MFGR modules—originally designed for soybean field segmentation—exhibit strong task independence. Their capabilities in boundary enhancement, density modeling, and occlusion separation remain effective for wheat-ear instance segmentation, enabling PodFormer to achieve optimal cross-domain performance.
To visually illustrate the performance differences among models in the wheat-ear segmentation task,
Figure 8 presents representative test samples along with the corresponding segmentation outputs of each method. The visual comparison reveals pronounced differences when handling occlusions and complex background textures. In regions obscured by leaves or containing complex background patterns (red dashed circles), YOLOv8-Seg and SegFormer exhibit noticeable false negatives. In regions affected by wheat-awn interference or densely clustered ears (yellow dashed circles), Mask R-CNN and YOLOv11-Seg frequently produce inaccurate boundaries or merge multiple instances. The baseline Mask2Former (cyan dashed circles) detects most wheat ears but still suffers from missed detections and boundary coalescence in high-density areas.
In contrast, PodFormer achieves the most accurate and consistent segmentation results across all challenging scenarios. By leveraging AWDE’s boundary enhancement capability, it effectively distinguishes wheat ears from leaves and soil backgrounds. Meanwhile, DGQI and MFGR further improve instance separation in dense and occluded regions, enabling PodFormer to generate the most complete masks with the clearest boundaries, closely aligning with ground-truth annotations.
In summary, PodFormer demonstrates clear performance advantages in cross-domain wheat-ear segmentation. These results not only validate its strong performance in soybean pod segmentation but also confirm that its core modules (AWDE, DGQI, and MFGR) possess strong generalizability and transferability. This architecture holds substantial practical value for addressing common challenges in agricultural vision tasks, including homogeneous backgrounds, uneven density, and severe occlusions.
4. Discussion
4.1. Effectiveness of the Proposed Modules
The integration of the AWDE, DGQI, and MFGR modules into the Mask2Former framework effectively addresses the primary challenges of mature soybean pod segmentation. AWDE amplifies high-frequency boundary cues while suppressing background-induced texture noise, enabling more accurate delineation of pods with weak contrast. DGQI incorporates multi-scale features and density priors into the query initialization process, thereby stabilizing instance detection across substantial scale variations and densely clustered regions. MFGR further refines mask predictions via confidence-guided feedback, enhancing instance separation under severe occlusion. Results from ablation and generalization experiments consistently validate the effectiveness of these modules. The improvements remain stable across controlled-background, real-world field, and cross-domain wheat-ear settings, highlighting the robustness and task-agnostic transferability of the proposed design.
4.2. Analysis of Model Performance
Compared with both CNN-based and Transformer-based baselines, PodFormer demonstrates significantly superior performance in accuracy, boundary preservation, and robustness to complex backgrounds. Quantitative metrics (
Table 4) and qualitative visualizations (
Figure 7 and
Figure 8) show that the model produces more complete and consistent instance masks, reduces false merging in dense regions, and maintains stable performance under heterogeneous field textures.
A practical consideration concerns the use of high-resolution images (). While such high-resolution inputs preserve fine structural details essential for annotating mature pods, they inevitably increase memory consumption and computational cost. PodFormer maintains competitive inference speed despite this burden, indicating that the proposed modules enhance feature-level efficiency and reduce representational redundancy. These characteristics are particularly valuable for downstream deployment in precision agricultural machinery.
Importantly, real-field generalization results demonstrate that the model does not overfit to the artificial white-background acquisition setting used during training. Instead, it learns transferable structural and boundary cues that remain effective under natural field conditions. This validates the experimental strategy and confirms that controlled-background acquisition does not compromise—and may even enhance—real-world applicability.
4.3. Limitations and Future Work
Several limitations warrant further investigation. First, although the use of a white background board is effective for ensuring annotation consistency and mitigating extreme pod–background homogeneity, it inevitably introduces a degree of environmental control. Expanding the training dataset with a greater volume of natural-background field images will help narrow potential domain gaps and further enhance real-world robustness.
Second, although PodFormer achieves acceptable inference speed, the computational burden introduced by high-resolution inputs and the multi-branch decoding design may restrict real-time deployment on resource-constrained agricultural machinery. Future research will therefore explore lightweight variants of the proposed modules, model compression strategies, and hardware-efficient architectures to improve deployment feasibility.
Third, despite the improvements introduced by MFGR, scenarios involving extremely dense clusters or fully overlapped pods remain challenging. Incorporating richer structural priors—such as morphological constraints, hierarchical refinement, or topological instance reasoning—may further improve segmentation performance under such extreme occlusions.
Future research will focus on (1) developing lightweight architectures suitable for embedded agricultural devices; (2) expanding real-field training datasets without background boards; (3) exploring data-efficient and weakly supervised learning paradigms to reduce annotation costs; and (4) integrating PodFormer into intelligent harvesting systems for real-time phenotyping and autonomous control.
5. Conclusions
This paper addresses the challenge of instance segmentation for mature soybean pods under field conditions characterized by homogeneous backgrounds, scale and density inhomogeneity, and severe occlusion. We propose PodFormer, an enhanced architecture built upon Mask2Former, incorporating three key modules: Adaptive Wavelet Detail Enhancement (AWDE), Density-Guided Query Initialization (DGQI), and Mask Feedback Gated Refinement (MFGR). These modules respectively tackle weak boundaries, scale and density variations, and adhesion-induced occlusions, thereby substantially strengthening feature representation and instance separation.
On our self-built field-mature soybean dataset, PodFormer achieves state-of-the-art performance across multiple evaluation metrics, including mAP50 and mAP50–95. It also demonstrates strong robustness and generalization capability on real-world field images without background boards and on a cross-domain wheat-ear dataset, validating both the universality of the proposed modules and the effectiveness of the overall architecture.
Despite these improvements, the model still depends on white-background images during training and exhibits relatively high computational complexity. Future work will focus on model lightweighting and the development of more complex, background-free field datasets to enhance its end-to-end applicability in real agricultural environments.
Overall, PodFormer provides a robust, efficient, and generalizable solution for precise instance segmentation in complex agricultural scenarios, laying a solid foundation for intelligent soybean harvesting and agronomic trait extraction.