1. Introduction
Accurate fruit segmentation is crucial for automated field management tasks in fruit cultivation, including condition monitoring [
1,
2], yield prediction [
3], and automated harvesting [
4,
5,
6]. Current segmentation methods are effective when there is a significant color contrast between the fruit and its background. However, these methods often falter in complex occlusion environments where leaves, branches, and fruits share similar colors, leading to misclassification of leaves as fruits, missing green fruits, or producing unclear fruit boundaries.
One of the fundamental causes lies in the significant degradation of discriminative visual features available to instance segmentation models in green fruit scenarios. Green fruits exhibit high similarity to background elements such as leaves and branches in terms of color and texture, which causes severe foreground–background mixing in the RGB feature space, particularly in fruit boundary regions. As a result, models struggle to accurately distinguish fruit targets from the background, thereby increasing the difficulty of precise fruit localization and instance segmentation.
Zhao et al. also reported similar observations in their study on instance segmentation of fruits at different maturity stages, showing that segmentation models typically achieve higher accuracy on mature fruits with clear color contrast against the background, whereas performance drops markedly on green, immature fruits [
7]. In addition, complex occlusion constitutes another important factor that constrains the segmentation performance of green fruits, including occlusion by leaves, interlaced branches, and mutual occlusion among fruits. In such scenarios, the visible regions of fruits are often incomplete, leading to missing contour information for individual fruit instances. In some cases, the ground-truth fruit masks even exhibit fragmented spatial distributions, which further weakens the model’s ability to effectively capture the overall structural characteristics of fruit instances [
8]. Notably, many fruits remain green throughout their prolonged immature stages, and some high-value varieties even retain green skins after ripening, such as Granny Smith apples, Green Zebra tomatoes, green-skinned figs in the Weihai region, and Keitt mangoes. Consequently, green fruit instance segmentation continues to face persistent challenges in real-world orchard environments [
9].
Methods for segmenting green fruits primarily encompass traditional machine learning and deep learning techniques. In situations where the color of green fruits closely matches the background, traditional machine learning methods often augment segmentation accuracy by integrating supplementary features such as texture and shape or employing more sophisticated algorithms. For example, Lv et al. utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) to obtain R-B color difference images of bagged green apples. They proposed an OTSU algorithm integrated with varying illumination regions to ensure precise segmentation [
10]. Sun’s team integrated visual attention mechanisms with an improved GrabCut model to extract fruit regions, subsequently segmenting overlapping fruits using the Ncut algorithm and reconstructing fruit contours through three-point circle fitting [
11]. Sun et al. accurately located fruit centers using gradient fields from depth images and combined this approach with an optimized density peak clustering algorithm for segmentation and contour fitting of green apple images [
12]. Although these methods exhibit commendable performance in uncomplicated scenarios, their dependence on manually curated features or bespoke rules limits their adaptability in authentic orchard settings, thereby diminishing their practical utility.
With the continuous development of deep learning technology, many studies have begun to combine green fruit segmentation with state-of-the-art deep learning methods [
13]. Zu proposed a mature green tomato segmentation method based on Mask R-CNN, combined with an automatic image acquisition technology designed to realistically simulate greenhouse robot harvesting scenarios [
14]. Jia et al. optimized the FCOS detection model by adding a boundary attention module (BAM) and introduced a segmentation module to achieve green apple segmentation in orchards [
15]. El Akrouchi et al. decomposed high-resolution images of green citrus trees into smaller segments using image slicing techniques, employing Multi-scale Vision Transformer version 2 (MViTv2) combined with cascade Mask R-CNN to segment the original high-resolution images [
16]. However, single-modal RGB instance segmentation models lack effective spatial structural information, limiting their feature discrimination capability in scenarios with high fruit–background color similarity and severe occlusion.
To tackle the challenge of background similarity, agricultural researchers have incorporated multi-modal information, particularly depth data, which provides crucial spatial structure cues. These cues significantly improve the differentiation of targets that share similar colors with their surroundings in complex environments. Rong et al. proposed an improved YOLOv5 algorithm based on RGB-D fusion, which substantially reduced missed detections of immature tomatoes due to their color resemblance to the background [
17]. Similarly, Kang et al. developed a single-stage instance segmentation model named OccluInst, which is based on RGB-D and CNN–Transformer. This model enhances the perception and localization abilities for mature broccoli through spatial structure clues provided by depth information [
18].
The introduction of depth information not only effectively alleviates the limitations of single-modal RGB representations in low-contrast scenarios, but also provides intelligent agricultural machinery with spatial positional information of fruits and their surrounding environments, which is beneficial for obstacle avoidance and more precise path planning [
19,
20]. However, research on the application of depth information specifically for green fruit instance segmentation remains relatively limited. One important reason for this limitation lies in the scarcity of specialized datasets. Traditional depth acquisition approaches typically rely on dedicated depth sensors such as LiDAR or RGB-D cameras. Due to their high cost, complex deployment procedures, and strict calibration requirements, these sensors not only increase the difficulty of data collection but also, to some extent, hinder their large-scale adoption in agricultural production scenarios [
21].
In recent years, with the continuous advancement of monocular depth estimation techniques, foundation models trained through the joint use of synthetic data and pseudo-labels generated from real-world scenes have gradually matured. These developments make it possible to predict stable and discriminative relative depth information from a single RGB image without introducing dedicated depth sensors, thereby providing a feasible technical pathway for low-cost depth information acquisition. Building upon this progress, monocular depth estimation has achieved notable success across a wide range of visual perception tasks, with its feasibility and practical value being extensively validated, particularly in autonomous driving, where it has demonstrated strong performance and application potential [
22,
23,
24,
25]. In the agricultural domain, several studies have begun to explore monocular vision-based depth perception methods for fruit localization, three-dimensional crop structure understanding, and harvesting-related tasks, and have reported promising preliminary results [
26,
27,
28].
For typical agricultural application scenarios such as orchards, the design of perception systems must consider not only algorithmic performance metrics, but also factors including cost control, ease of deployment, and environmental adaptability. Characteristics such as dense foliage, severe occlusion, and constrained operating spaces pose significant challenges to the practical deployment of complex and high-cost multi-sensor systems [
29,
30]. Under this context, introducing monocular depth estimation as a “virtual depth sensor” into green fruit instance segmentation, and leveraging the relative geometric structural information it provides as a spatial prior, can enhance the model’s ability to distinguish fruits from the background under low-contrast and severe occlusion conditions, while maintaining low system cost and high deployability. Moreover, purely vision-based perception methods offer information representations that are closer to human visual perception, enabling a better simulation of human observation and decision-making processes. This characteristic holds the potential to allow intelligent agricultural machinery to reach, or even surpass, human-level performance in tasks such as growth status monitoring, yield estimation, and automated harvesting. Therefore, this study proposes a novel and low-cost depth information acquisition scheme that employs the latest large-scale monocular depth pre-trained model, Depth Anything V2 [
31], to reliably estimate depth information directly from RGB images, thereby facilitating green fruit instance segmentation.
To achieve high-precision segmentation of green fruits against complex backgrounds in agricultural scenarios, this study proposes a novel multi-modal instance segmentation framework, DepthCL-Seg, which integrates monocular depth estimation technology into the classical Mask R-CNN architecture. Specifically, the monocular depth estimation technique Depth Anything V2 is applied to the task of green fruit instance segmentation. We propose a dual-stream feature extraction structure that effectively leverages complementary features from RGB images and depth information. Additionally, a Cross-modal Complementary Fusion (CCF) module is developed to enhance feature representation in low-contrast target regions through multi-scale interactive fusion. Furthermore, a Low-contrast Adaptive Refinement (LAR) module is proposed, which employs dynamic adaptive contrastive learning to improve the model’s ability to distinguish ambiguous boundary pixels. The experimental results validate the effectiveness of DepthCL-Seg, achieving mAP scores of 74.2% on our self-constructed green fig dataset and 86.0% on the green peach dataset derived from the cleaned NinePeach public dataset, significantly surpassing current mainstream methods. The primary contributions of this study are threefold:
- (1)
We propose a novel low-cost, high-performance multi-modal green fruit instance segmentation framework, termed DepthCL-Seg, which leverages monocular depth estimation technology.
- (2)
A Cross-modal Complementary Fusion (CCF) module is designed to enhance feature complementarity between RGB and depth dual-streams via channel attention, spatial attention, and pixel-level fusion, with particular emphasis on low-contrast regions.
- (3)
A Low-contrast Adaptive Refinement (LAR) module is developed, which employs dynamic adaptive contrastive learning to improve segmentation performance in ambiguous boundary regions.
3. Results
3.1. Experimental Setup
This study was conducted on the AutoDL cloud platform using a single-GPU node equipped with an NVIDIA GeForce RTX 4090 D (24 GB VRAM) and an AMD EPYC 9754 (18 vCPUs). The server ran the Ubuntu 20.04 operating system with the PyTorch 2.0.0 framework and CUDA version 11.8. We implemented our method with MMDetection v3.1.0 in Python 3.8.10 and conducted experiments on the target datasets. The batch size was set to 1, and the model was trained for a fixed number of 24 epochs using the SGD optimizer with a learning rate of 0.02 and a momentum of 0.9. The 24-epoch training schedule follows the common practice of the MMDetection framework and was further validated through empirical convergence analysis, as shown in
Figure 9. Training was terminated once the predefined maximum number of epochs was reached. During training, all images were resized to 1333 × 800 while preserving their original aspect ratios, and random horizontal flip augmentation was applied. To mitigate overfitting, L2 regularization (weight decay) was employed, and training convergence was monitored based on the training loss and validation mAP.
Figure 9 illustrates the training loss curves and periodic evaluation results of the proposed DepthCL-Seg model across different datasets, with a total of 24 evaluation checkpoints during training. The results indicate that the optimization process proceeds smoothly, and the evaluation performance becomes stable in the later training stages, demonstrating satisfactory convergence behavior.
3.2. Evaluation Metrics
Instance segmentation is commonly evaluated by the precision–recall (P–R) curve, which consists of precision and recall. Precision indicates the proportion of true positive instances among predicted positives, while recall represents the proportion of actual positive instances correctly detected. Precision and recall are calculated as follows:
For the green fig and green peach datasets, the average precision (AP) metric was used to evaluate the model’s performance on segmenting green fruit instances. The AP can be calculated from the P–R curve using the following equation:
where
denotes precision at recall
. Mean average precision (mAP) for all classes is defined as the arithmetic mean of AP over all categories:
AP50 and AP75 denote the AP at Intersection over Union (IoU) thresholds of 0.5 and 0.75, respectively. In this experiment, since both green fig and green peach datasets have only one category , mAP is equal to AP. Additionally, in order to measure segmentation accuracy for small, medium, and large objects, metrics , and were also employed.
3.3. Quantitative Comparison with Other Methods
To verify the effectiveness of the proposed DepthCL-Seg model, experiments were conducted on the fig and peach datasets, and comprehensive comparisons were made with current mainstream instance segmentation methods including Cascade R-CNN [
41], HTC [
42], Mask R-CNN [
34], MS-RCNN [
43], PointRend [
44], Mask Transfiner [
45], and DI-MaskDINO [
46]. All experiments uniformly used ResNet-50 as the backbone network, and the training cycle was set to 24 epochs; DI-MaskDINO was trained for 50 epochs based on an analysis of its training dynamics, in order to reach a stable convergence state. All hyperparameters and data augmentation methods were maintained at default values, and the results are shown in
Table 2 and
Table 3.
On the fig dataset (
Table 2), DepthCL-Seg achieved the best performance in all metrics. The overall AP reached 74.2%, which is 7.5% higher than that of Mask R-CNN at 66.7%, and 4.2% higher than that of the second-ranked PointRend at 70.0%. On the large-object AP metric, DepthCL-Seg reached 96.8%, outperforming the second-ranked DI-MaskDINO by 3.6% at 93.2%. Additionally, DepthCL-Seg performed particularly well in small object segmentation (
), reaching 29.9%, significantly surpassing other models, fully reflecting the advantages of the proposed depth fusion method for targets of different scales, especially exhibiting stronger segmentation capability for challenging small-sized fruits.
To verify the segmentation performance of the model on different varieties of green fruits, experiments were also conducted on a publicly available peach dataset. The results shown in
Table 3 further confirmed the effectiveness of DepthCL-Seg. As seen in
Table 3, DepthCL-Seg also achieved significant improvement in the green peach segmentation task. The overall AP reached 86.0%, exceeding second-ranked PointRend’s 83.1% by 2.9%, and improving over the classical Mask R-CNN’s 81.6% by 4.4%. Small object
increased from 16.1% to 22.4%, an improvement of 6.3%. The medium-scale and large-scale segmentation metrics reached 58.7% (
) and 94.8% (
), respectively, surpassing all comparative models. The stable performance of DepthCL-Seg’s across targets of different sizes demonstrates its excellent generalization ability and adaptability to complex scenes, providing robust support for precise segmentation of green fruits in real orchard environments.
The quantitative analysis results from both datasets clearly show that DepthCL-Seg has a significant advantages in green fruit instance segmentation, which is attributed to the effective utilization of monocular depth information by CCF and LAR modules. Both modules effectively enhance feature representation, allowing the model to achieve more accurate segmentation performance in complex orchard environments.
3.4. Qualitative Comparison with Other Methods
Figure 10 and
Figure 11 compare the actual segmentation performance differences between DepthCL-Seg, existing state-of-the-art methods, and baseline Mask R-CNN on the green fig and green peach datasets, respectively.
For instance segmentation shown in
Figure 10, DepthCL-Seg outperformed other models in both stability and accuracy. For instance, DI-MaskDINO clearly missed the detection of occluded fig instances located at the upper middle position of b1, in the b3 region, and behind b4. PointRend and Mask R-CNN displayed ambiguous boundary segmentation for overlapping regions such as a2, a3, c2, c3 and c4, where masks were overlapped or interweaved to cause duplicated segmentation. In contrast, the segmentation results of DepthCL-Seg (d1–d4) accurately segmented overlapping fig regions with clear and smooth boundaries without obvious omission or misclassification.
Similarly, DepthCL-Seg also demonstrated significant advantages in peach instance segmentation as shown in
Figure 11. HTC and Mask R-CNN produced relatively blurred segmentation boundaries under the background of low-contrast green leaves and branches, and failed to refine occluded peach regions under severe occlusion conditions, such as b4 and c4. Moreover, HTC exhibited duplicated segmentation regions in areas where fruits were occluded by branches, for example, above b3. PointRend also had severe confusion in boundary segmentation for overlapping peach regions, including duplicate segmentation and blurred boundaries in the upper-right corner of a1. In comparison, DepthCL-Seg could output smooth and complete peach masks at occluded regions, clearly and accurately delineating instance boundaries as shown in d1–d4.
Overall, DepthCL-Seg significantly outperformed traditional models in green fruit segmentation accuracy under complex environments, improving fruit mask quality, reducing omission rates, and substantially minimizing misclassification of interference such as leaves as fruits. This demonstrates that DepthCL-Seg is highly suitable for intelligent orchard management requiring strict precision in green fruit segmentation.
3.5. Ablation Experiments
To verify the effectiveness of each module, ablation experiments were conducted on both the green fig and green peach datasets, as shown in
Table 4 and
Table 5 (bold indicates the highest performance). Starting from the single-branch Mask R-CNN baseline, the RefineMask head, the proposed CCF module, and the LAR module were sequentially introduced. All experiments used the same hyperparameters and data augmentation strategies to ensure a fair comparison.
As shown in
Table 4, adding only the RefineMask head improved the overall AP by 4.1%, which indicates that the multi-stage mask refinement module can effectively improve the accuracy of masks. When the CCF module was further added, the AP increased by another 1.0%, especially on the small-object
metric, demonstrating that cross-modal fusion helps enhance feature representation capabilities for low-contrast areas. Adding only the LAR module improved the AP by 0.6%, indicating that the contrastive learning strategy effectively distinguished foreground and background in fuzzy boundary regions. When all modules were combined, the overall AP reached 74.2%, with an improvement of 7.5% over the baseline. Compared to using only the CCF module, this was a further improvement of 2.4%; compared to using only the LAR module, it represented a further improvement of 2.8%. On the metric of small-object
, the AP was improved by 3.3% compared to using only the CCF module, and by 2.9% compared to using only the LAR module.
Similar trends can be observed on the green peach dataset, as shown in
Table 5. The progressive introduction of the RefineMask head, CCF, and LAR modules consistently improves segmentation performance across different object scales, with the full model achieving the highest AP of 86.0%. These results strongly demonstrate that the CCF module for fusing RGB and depth information and the contrastive learning strategy of LAR are essential for performance enhancement.
To further investigate the roles of individual modules under complex visual conditions, more fine-grained ablation analyses are conducted exclusively on the green fig dataset. This dataset exhibits more severe occlusion and more complex background interference, and therefore provides a more representative benchmark for performance analysis.
Under this setting, we further compare the impact of different monocular depth estimation models on the performance of DepthCL-Seg, as detailed in
Table 6.
As shown in
Table 6, Depth Anything V2 has the best performance across all metrics, with an overall AP 2.0% higher than Depth Pro and 2.1% higher than Marigold, especially on multi-scale performance metrics. This indicates that Depth Anything V2 can provide more precise and reliable depth information, which effectively helps the model to complete accurate segmentation tasks in complex scenarios.
To investigate the impact of multimodal fusion strategies on the performance of DepthCL-Seg, we conducted a comparison of several common dual-stream feature fusion methods, including Weighted Sum, Element-wise Multiplication, Gated Fusion, and CBAM [
47], with the results presented in
Table 7.
As shown in
Table 7, the proposed CCF module exhibits the best performance across all metrics and clearly outperforms other fusion methods. Both Weighted Sum and Gated Fusion achieve an overall AP of 73.6%, indicating that simple weighted and gated strategies are also effective for dual-stream feature fusion. Element-wise Multiplication is slightly worse with an AP of 73.2% and CBAM achieves the worst AP of 72.7%, which shows that it is difficult to realize cross-modal complementarity. This further confirms that the CCF module can more precisely explore the complementarity between RGB and depth features, thereby improving segmentation performance in complex scenarios.
For the LAR module, this paper analyzes its effect on segmentation performance from the perspective of sampling strategies and contrastive loss by comparing two sampling strategies, Random Sampling and Confidence-based Sampling as shown in
Table 8, and three contrastive loss functions, α-CL-direct [
48], InfoNCE [
49], and ADNCE as shown in
Table 9.
As indicated in
Table 8, using the confidence-based sampling strategy improves the overall AP from 73.8% to 74.2% compared with random sampling and especially for medium-sized and small-sized targets. This suggests that selecting ambiguous foreground pixels based on their confidence in the LAR module effectively improves mask quality and segmentation accuracy in complex areas.
Table 9 shows that the ADNCE contrastive loss achieves the best performance, with an overall AP of 74.2%, surpassing α-CL-direct at 72.7% and InfoNCE at 73.0%. This difference is attributed to ADNCE’s Gaussian weighting strategy for difficult negative samples, which assigns different weights according to the distance between negative samples and a preset center, significantly enhancing the model’s feature discrimination ability in low-contrast areas, thus achieving higher-quality segmentation in complex scenarios.
After validating the effectiveness of the sampling strategy and the contrastive loss formulation, the influence of key hyperparameters in the LAR module on segmentation performance is further examined. Sensitivity experiments are conducted for the Gaussian weighting parameters μ and σ, the temperature parameter τ in contrastive learning, and the confidence threshold used for anchor selection, with the results summarized in
Table 10,
Table 11,
Table 12 and
Table 13. In these experiments, only one hyperparameter is varied at a time, while the others are kept fixed at their default values.
The experimental results indicate that different hyperparameters have varying influences on segmentation performance. Variations in the Gaussian mean μ and the contrastive temperature τ result in relatively small changes in overall AP, with performance differences remaining within approximately 1 mAP. In contrast, adjusting the Gaussian standard deviation σ and the confidence threshold for anchor selection leads to more noticeable performance variations, in some cases exceeding 1 mAP, particularly for overall AP as well as medium and small object metrics. Nevertheless, no abrupt performance degradation is observed within the evaluated parameter ranges. The best overall performance is achieved with μ = 0.7, σ = 1.0, τ = 0.07, and a confidence threshold of 0.97, and this configuration is therefore adopted as the default setting in all experiments.
The ablation experimental results further validate the effectiveness of the proposed CCF and LAR modules in improving the performance of DepthCL-Seg. In addition, by comparing different monocular depth models, it is found that Depth Anything V2 can provide the most accurate and effective depth information, which can further improve the instance segmentation abilities of pure visual models in complex scenarios.
4. Discussion
The primary errors of the baseline single-modal RGB model on both datasets mainly stem from the strong color similarity between fruits and surrounding leaves, which leads to blurred fruit boundaries and frequent misclassification. By incorporating monocular depth estimation, the proposed Cross-modal Complementary Fusion (CCF) module effectively aligns and integrates texture information from RGB features with spatial structural cues derived from depth features. In addition, the Low-contrast Adaptive Refinement (LAR) module introduces dynamic contrastive constraints in low-confidence boundary regions. The synergy of these two modules substantially alleviates the aforementioned misclassification issues, particularly in scenarios involving dense fruit distributions, severe occlusions, and overlapping instances, resulting in more complete and smoother instance masks.
Compared with RGB-only methods that mainly rely on color and texture cues, as well as hardware-driven multi-sensor fusion approaches that depend on explicit depth sensors [
14,
15,
16,
18], the proposed DepthCL-Seg framework introduces monocular depth priors as a low-cost and flexible source of spatial structural information for instance segmentation. This design enables the model to effectively address key challenges in orchard environments—such as high color similarity between fruits and background, dense fruit clustering, and ambiguous or occluded boundaries—without introducing additional hardware burdens [
29,
30]. Orchard scenes are typically characterized by dense foliage and constrained operating spaces, which considerably limit the practical deployability of complex and costly multi-sensor systems. In contrast, DepthCL-Seg adopts a purely vision-based solution using a monocular RGB camera, offering clear advantages in terms of cost efficiency and environmental adaptability. As a result, it achieves a favorable balance between segmentation accuracy and practical deployability, making it more suitable for large-scale, real-world orchard perception applications [
8].
Nevertheless, the proposed method also has certain limitations. Due to the introduction of an additional monocular depth estimation branch and a dual-stream backbone architecture, DepthCL-Seg incurs higher computational complexity compared to the RGB-only Mask R-CNN baseline. In a fully end-to-end online inference setting—where monocular depth maps are generated on-the-fly by Depth Anything V2 during inference—the overall processing speed reaches approximately 6.88 FPS on a single NVIDIA RTX 4090D GPU. The model parameter count increases from 43.97 M to 77.92 M, while the proposed CCF and LAR modules account for only a small portion of the total parameters (6.97 M and 0.18 M, respectively), as shown in
Table 14 and
Table 15.
It should be noted, however, that DepthCL-Seg is not primarily designed for ultra-low-latency robotic control tasks with stringent real-time requirements. Instead, it is better suited for orchard perception applications such as growth condition monitoring, yield estimation, and periodic field inspection, where segmentation accuracy and result stability are typically prioritized over frame-level real-time responsiveness. Under such application scenarios, the associated computational overhead is considered reasonable and acceptable.
Based on these observations, future research will further explore the potential of monocular depth information in orchard perception tasks. On the one hand, we plan to investigate more lightweight monocular depth modeling strategies tailored to agricultural scenarios, combined with architectural optimizations such as shared backbone designs and knowledge distillation, in order to further reduce computational cost while maintaining segmentation performance. On the other hand, we aim to incorporate depth calibration mechanisms to strengthen the correspondence between estimated depth values and real-world physical scales. This would enable the construction of a mapping from estimated depth to pixel area and ultimately to actual fruit volume, thereby providing more reliable spatial priors for downstream quantitative tasks such as fruit volume estimation and yield prediction.