DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth

Shang, Yuelong; Sun, Guodong; Zhang, Haiyan

doi:10.3390/agriculture16020283

Open AccessArticle

DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth

by

Yuelong Shang

^1,2,3,

Guodong Sun

^1,2,3

and

Haiyan Zhang

^1,2,3,*

¹

School of Information Science and Technology (School of Artificial Intelligence), Beijing Forestry University, Beijing 100083, China

²

Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China

³

Hebei Key Laboratory of Smart National Park, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(2), 283; https://doi.org/10.3390/agriculture16020283

Submission received: 15 December 2025 / Revised: 16 January 2026 / Accepted: 21 January 2026 / Published: 22 January 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of target fruits is essential for automated field management. However, the challenge lies in the fact that many fruits remain green for extended periods, closely resembling the colors of leaves and branches, thus making accurate identification difficult. While current multi-modal methods that utilize depth information can mitigate this problem, the high cost of equipment for acquiring such data limits the practical implementation of these techniques. To tackle this challenge, we introduce the monocular depth estimation technique Depth Anything V2 to fruit segmentation tasks, proposing a novel monocular depth-assisted instance segmentation framework, DepthCL-Seg. Within DepthCL-Seg, the Cross-modal Complementary Fusion (CCF) module effectively fuses RGB and depth information to enhance feature representation in low-contrast target regions. Additionally, a low-contrast adaptive refinement (LAR) module is designed to improve discrimination of easily confusable boundary pixels. Experimental results show that DepthCL-Seg achieves mAP scores of 74.2% and 86.0% on our self-constructed green fig and green peach datasets, respectively. These scores surpass the classical Mask R-CNN by 7.5% and 4.4%, and significantly outperform current mainstream methods. This framework provides novel technical support for automated management in fruit cultivation.

Keywords:

monocular depth estimation; instance segmentation; cross-modal fusion; green fruit; contrast learning

1. Introduction

Accurate fruit segmentation is crucial for automated field management tasks in fruit cultivation, including condition monitoring [1,2], yield prediction [3], and automated harvesting [4,5,6]. Current segmentation methods are effective when there is a significant color contrast between the fruit and its background. However, these methods often falter in complex occlusion environments where leaves, branches, and fruits share similar colors, leading to misclassification of leaves as fruits, missing green fruits, or producing unclear fruit boundaries.

One of the fundamental causes lies in the significant degradation of discriminative visual features available to instance segmentation models in green fruit scenarios. Green fruits exhibit high similarity to background elements such as leaves and branches in terms of color and texture, which causes severe foreground–background mixing in the RGB feature space, particularly in fruit boundary regions. As a result, models struggle to accurately distinguish fruit targets from the background, thereby increasing the difficulty of precise fruit localization and instance segmentation.

Zhao et al. also reported similar observations in their study on instance segmentation of fruits at different maturity stages, showing that segmentation models typically achieve higher accuracy on mature fruits with clear color contrast against the background, whereas performance drops markedly on green, immature fruits [7]. In addition, complex occlusion constitutes another important factor that constrains the segmentation performance of green fruits, including occlusion by leaves, interlaced branches, and mutual occlusion among fruits. In such scenarios, the visible regions of fruits are often incomplete, leading to missing contour information for individual fruit instances. In some cases, the ground-truth fruit masks even exhibit fragmented spatial distributions, which further weakens the model’s ability to effectively capture the overall structural characteristics of fruit instances [8]. Notably, many fruits remain green throughout their prolonged immature stages, and some high-value varieties even retain green skins after ripening, such as Granny Smith apples, Green Zebra tomatoes, green-skinned figs in the Weihai region, and Keitt mangoes. Consequently, green fruit instance segmentation continues to face persistent challenges in real-world orchard environments [9].

Methods for segmenting green fruits primarily encompass traditional machine learning and deep learning techniques. In situations where the color of green fruits closely matches the background, traditional machine learning methods often augment segmentation accuracy by integrating supplementary features such as texture and shape or employing more sophisticated algorithms. For example, Lv et al. utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) to obtain R-B color difference images of bagged green apples. They proposed an OTSU algorithm integrated with varying illumination regions to ensure precise segmentation [10]. Sun’s team integrated visual attention mechanisms with an improved GrabCut model to extract fruit regions, subsequently segmenting overlapping fruits using the Ncut algorithm and reconstructing fruit contours through three-point circle fitting [11]. Sun et al. accurately located fruit centers using gradient fields from depth images and combined this approach with an optimized density peak clustering algorithm for segmentation and contour fitting of green apple images [12]. Although these methods exhibit commendable performance in uncomplicated scenarios, their dependence on manually curated features or bespoke rules limits their adaptability in authentic orchard settings, thereby diminishing their practical utility.

With the continuous development of deep learning technology, many studies have begun to combine green fruit segmentation with state-of-the-art deep learning methods [13]. Zu proposed a mature green tomato segmentation method based on Mask R-CNN, combined with an automatic image acquisition technology designed to realistically simulate greenhouse robot harvesting scenarios [14]. Jia et al. optimized the FCOS detection model by adding a boundary attention module (BAM) and introduced a segmentation module to achieve green apple segmentation in orchards [15]. El Akrouchi et al. decomposed high-resolution images of green citrus trees into smaller segments using image slicing techniques, employing Multi-scale Vision Transformer version 2 (MViTv2) combined with cascade Mask R-CNN to segment the original high-resolution images [16]. However, single-modal RGB instance segmentation models lack effective spatial structural information, limiting their feature discrimination capability in scenarios with high fruit–background color similarity and severe occlusion.

To tackle the challenge of background similarity, agricultural researchers have incorporated multi-modal information, particularly depth data, which provides crucial spatial structure cues. These cues significantly improve the differentiation of targets that share similar colors with their surroundings in complex environments. Rong et al. proposed an improved YOLOv5 algorithm based on RGB-D fusion, which substantially reduced missed detections of immature tomatoes due to their color resemblance to the background [17]. Similarly, Kang et al. developed a single-stage instance segmentation model named OccluInst, which is based on RGB-D and CNN–Transformer. This model enhances the perception and localization abilities for mature broccoli through spatial structure clues provided by depth information [18].

The introduction of depth information not only effectively alleviates the limitations of single-modal RGB representations in low-contrast scenarios, but also provides intelligent agricultural machinery with spatial positional information of fruits and their surrounding environments, which is beneficial for obstacle avoidance and more precise path planning [19,20]. However, research on the application of depth information specifically for green fruit instance segmentation remains relatively limited. One important reason for this limitation lies in the scarcity of specialized datasets. Traditional depth acquisition approaches typically rely on dedicated depth sensors such as LiDAR or RGB-D cameras. Due to their high cost, complex deployment procedures, and strict calibration requirements, these sensors not only increase the difficulty of data collection but also, to some extent, hinder their large-scale adoption in agricultural production scenarios [21].

In recent years, with the continuous advancement of monocular depth estimation techniques, foundation models trained through the joint use of synthetic data and pseudo-labels generated from real-world scenes have gradually matured. These developments make it possible to predict stable and discriminative relative depth information from a single RGB image without introducing dedicated depth sensors, thereby providing a feasible technical pathway for low-cost depth information acquisition. Building upon this progress, monocular depth estimation has achieved notable success across a wide range of visual perception tasks, with its feasibility and practical value being extensively validated, particularly in autonomous driving, where it has demonstrated strong performance and application potential [22,23,24,25]. In the agricultural domain, several studies have begun to explore monocular vision-based depth perception methods for fruit localization, three-dimensional crop structure understanding, and harvesting-related tasks, and have reported promising preliminary results [26,27,28].

For typical agricultural application scenarios such as orchards, the design of perception systems must consider not only algorithmic performance metrics, but also factors including cost control, ease of deployment, and environmental adaptability. Characteristics such as dense foliage, severe occlusion, and constrained operating spaces pose significant challenges to the practical deployment of complex and high-cost multi-sensor systems [29,30]. Under this context, introducing monocular depth estimation as a “virtual depth sensor” into green fruit instance segmentation, and leveraging the relative geometric structural information it provides as a spatial prior, can enhance the model’s ability to distinguish fruits from the background under low-contrast and severe occlusion conditions, while maintaining low system cost and high deployability. Moreover, purely vision-based perception methods offer information representations that are closer to human visual perception, enabling a better simulation of human observation and decision-making processes. This characteristic holds the potential to allow intelligent agricultural machinery to reach, or even surpass, human-level performance in tasks such as growth status monitoring, yield estimation, and automated harvesting. Therefore, this study proposes a novel and low-cost depth information acquisition scheme that employs the latest large-scale monocular depth pre-trained model, Depth Anything V2 [31], to reliably estimate depth information directly from RGB images, thereby facilitating green fruit instance segmentation.

To achieve high-precision segmentation of green fruits against complex backgrounds in agricultural scenarios, this study proposes a novel multi-modal instance segmentation framework, DepthCL-Seg, which integrates monocular depth estimation technology into the classical Mask R-CNN architecture. Specifically, the monocular depth estimation technique Depth Anything V2 is applied to the task of green fruit instance segmentation. We propose a dual-stream feature extraction structure that effectively leverages complementary features from RGB images and depth information. Additionally, a Cross-modal Complementary Fusion (CCF) module is developed to enhance feature representation in low-contrast target regions through multi-scale interactive fusion. Furthermore, a Low-contrast Adaptive Refinement (LAR) module is proposed, which employs dynamic adaptive contrastive learning to improve the model’s ability to distinguish ambiguous boundary pixels. The experimental results validate the effectiveness of DepthCL-Seg, achieving mAP scores of 74.2% on our self-constructed green fig dataset and 86.0% on the green peach dataset derived from the cleaned NinePeach public dataset, significantly surpassing current mainstream methods. The primary contributions of this study are threefold:

(1): We propose a novel low-cost, high-performance multi-modal green fruit instance segmentation framework, termed DepthCL-Seg, which leverages monocular depth estimation technology.
(2): A Cross-modal Complementary Fusion (CCF) module is designed to enhance feature complementarity between RGB and depth dual-streams via channel attention, spatial attention, and pixel-level fusion, with particular emphasis on low-contrast regions.
(3): A Low-contrast Adaptive Refinement (LAR) module is developed, which employs dynamic adaptive contrastive learning to improve segmentation performance in ambiguous boundary regions.

2. Materials and Methods

2.1. Datasets

Green fruit instance segmentation has gradually become an important research area in agricultural automation. However, recent studies and reviews have pointed out that publicly available agricultural and orchard segmentation datasets remain limited in scale and diversity. For example, Lei et al. reported that current agricultural image segmentation datasets are generally small in scale and often restricted to a limited number of plant species, which hampers robust model evaluation [32]. Similarly, Wang et al. emphasized that largescale publicly available datasets for modern fruit orchards are still scarce [33]. As a result, existing green fruit instance segmentation datasets are limited in number, scale, and variety, making it difficult to fully verify the generalization abilities of models in complex real scenes. Therefore, this paper specifically collected images of green fruits under various environmental conditions and constructed two new datasets: a green fig dataset and a green peach dataset, to evaluate the performance of the proposed DepthCL-Seg model. These two datasets cover diverse complex orchard scenarios, including serious occlusion, overlapping fruits and varying lighting conditions, as shown in Figure 1 and Figure 2.

The green fig dataset was captured in an indoor greenhouse in Shandong, China using a Xiaomi 13 Ultra smartphone equipped with a 50-megapixel main camera, producing images at a resolution of up to 8192 × 6144 pixels, which is sufficient to preserve fine object boundaries and small-scale details required for instance segmentation. The images were collected from different angles, distances, and light conditions to acquire real scenes with occlusions and overlapping fruit. Additionally, the green fig images of other scenes and scales were supplemented from iNaturalist platform. The dataset includes several varieties with high economic value, such as Boji Red, Jin Ao Fen, and Fengchan Huang. Image annotation was performed using the Labelme software (version 5.5.0). Multiple annotators participated in the labeling process and strictly followed unified annotation standards. For instances with complex or ambiguous boundaries, annotations were cross-validated by multiple labelers through repeated manual inspection and discussion to ensure labeling accuracy and consistency. Although quantitative inter-annotator agreement metrics were not explicitly computed, this annotation protocol was designed to minimize labeling inconsistency and improve overall annotation reliability. There are 793 training set images and 340 validation set images in this dataset.

The green peach dataset is derived from the NinePeach dataset [34], which is currently the largest publicly available peach instance segmentation dataset. It includes 3849 peach images across 9 cultivars. An automated data cleaning process was applied using Python scripts to filter the original annotations. Specifically, any image containing at least one peach instance labeled as non-unripe was discarded, and only images in which all peach instances were labeled as unripe were retained. Finally, 1240 valid images were selected, including 868 training and 372 validation images.

The partial instance annotations of the two datasets are shown in Figure 3. Following the split ratio adopted in the public NinePeach dataset and the common dataset partitioning practice in fruit instance segmentation studies [7,15,34], both datasets were divided into training and validation sets according to a ratio of 7:3, and the detailed statistical information on the number of images and instances are provided in Table 1.

2.2. Model Framework Overview

The proposed DepthCL-Seg extends Mask R-CNN by explicitly targeting its limitations in green fruit instance segmentation. In complex orchard scenes, the reliance of Mask R-CNN on RGB appearance features alone makes the model insensitive to spatial structural cues, which hampers the discrimination between green fruits and visually similar leaves and branches. Moreover, the standard mask head predominantly performs appearance-driven pixel classification and shows limited effectiveness in refining low-contrast and occluded boundaries, often resulting in blurred or incomplete instance masks. To alleviate these issues, DepthCL-Seg incorporates monocular depth as complementary structural information, adopts a dual-stream feature extraction with cross-modal fusion, and improves boundary delineation through adaptive mask refinement. The overall workflow is illustrated in Figure 4, which consists of three main components: a dual-stream feature extraction network, a Cross-modal Complementary Fusion (CCF) module, and a Low-contrast Adaptive Refinement (LAR) module.

First, the monocular depth estimation model Depth Anything V2 converts RGB images into depth maps, after which a dual-stream ResNet-50 backbone extracts RGB and depth features separately to generate fruit representations under different modalities. Next, the CCF module performs fine-grained fusion of RGB and depth features, achieving more accurate feature alignment and complementarity. Subsequently, the Feature Pyramid Network (FPN) integrates low-level detailed information with high-level semantic information to capture multi-scale representations suitable for fruits of varying sizes and distances in the field.

The FPN output feature maps are localized to fruit regions using RoIAlign. The object detection branch employs the standard Mask R-CNN Shared 2FC BBox Head [35] to output class scores and bounding box regression offsets. The instance segmentation branch consists of a multi-stage RefineMask head combined with the LAR module. Specifically, RefineMask first extracts global semantic information from the highest-resolution feature maps in the FPN to support subsequent mask refinement. Then, a multi-stage refinement strategy progressively integrates and upsamples semantic features to increase spatial resolution for more precise instance masks. The LAR module is embedded in each mask prediction stage to enhance model discrimination in confusing regions by contrastive learning that pulls foreground pixels within the same fruit closer and pushes foreground and background pixels away. This process ensures the segmentation head prioritizes ambiguous boundary areas during multi-stage refinement, which ultimately achieves high-precision mask predictions for green fruits.

2.2.1. Dual-Stream Feature Extraction Structure

Monocular depth estimation technology has been widely used in autonomous driving, but its application in agricultural instance segmentation tasks is still relatively rare. This paper introduces monocular depth estimation into the green fruit instance segmentation task. After comparing the depth maps generated by several mainstream monocular depth models, including Marigold, Depth Pro [36,37], and Depth Anything V2 (as shown in Figure 5), we selected Depth Anything V2 with better performance to obtain reliable and low-cost depth information.

Depth Anything V2 is an advanced monocular depth estimation model characterized by efficient inference capabilities and precise predictive performance. A key feature of this model is its ability to predict depth with high accuracy, achieved through training on superior synthetic data. The model’s generalization is further enhanced by the use of pseudo-labels derived from large-scale real-world images. When compared to other monocular depth models, this approach significantly reduces the costs associated with data acquisition and improves the model’s ability to generalize in complex environments. This makes it an ideal, cost-effective replacement for professional depth equipment in agricultural applications.

To fully exploit the complementary spatial features from RGB and depth images, a dual-stream branch structure with ResNet-50 as the backbone network is designed to extract features from RGB images and depth maps separately. The depth maps are generated by a monocular depth estimation model, which is employed solely as a fixed pre-processing module; its parameters are frozen during training and are not involved in optimization. ResNet-50 was adopted following common practice in agricultural instance segmentation, where it has been widely used as a stable and effective backbone for Mask R-CNN-based models, providing a favorable balance between representation capability and computational cost [38]. The two parallel ResNet-50 branches share the same network architecture but maintain independent parameters, and both are initialized with ImageNet-pretrained weights. Each branch outputs multi-level feature maps at different scales. Through this design, the RGB branch captures rich color and texture information, while the depth branch provides stable spatial structural cues, together forming a solid foundation for subsequent cross-modal feature fusion.

2.2.2. Cross-Modal Complementary Fusion (CCF)

In visual tasks, the quality of extracted features determines the accuracy of models. Simply adding or multiplying RGB and depth features at the pixel level, or using dual-modal gated fusion, cannot adequately meet the needs of green fruit instance segmentation. Different feature fusion methods are shown in Figure 6. This is mainly because monocular depth maps only provide relative depth information, resulting in unstable depth quality. The existing RGB-D fusion methods (e.g., concatenation after pixel-wise addition or parallel channel/spatial attention) tend to bias the attention weights toward the modality with a higher signal-to-noise ratio, which weakens the depth information and fails to properly address the feature alignment problem.

To address these challenges, a Cross-modal Complementary Fusion (CCF) module is designed, which applies channel attention to RGB features and spatial attention to depth features, followed by pixel-level gated fusion to achieve fine-grained alignment and complementary information. This significantly enhances the model’s ability to represent low-contrast fruit regions. The CCF module is depicted in Figure 7.

The CCF module operates in three stages:

First, channel attention is applied to RGB features to select the most discriminative color and texture channels that complement depth information. Specifically, input RGB features

F^{rgb} \in R^{(C \times H \times W)}

are subjected to AvgPool and MaxPool operations along the spatial dimension to obtain two channel context vectors of shape C × 1 × 1. These vectors are then passed through a shared MLP to compute attention weights, which are summed and activated with Sigmoid function to produce the final channel attention map

C A

.

The calculation process is as follows:

C A (F^{r g b}) = σ (M L P (A v g P o o l (F^{r g b})) + M L P (M a x P o o l (F^{r g b})))

(1)

where

M L P (x)

denotes a two-layer fully connected network with ReLU activation that maps channels from C to C/r and back to C, and

σ (\cdot)

is the Sigmoid activation function.

Next, spatial attention is applied to depth features. The convolved single-channel depth maps do not have significant multi-channel differences like RGB images but contain important geometric clues in their spatial distribution, which help focus on fruit regions and suppress interference from similar backgrounds such as leaves and branches. Specifically, channel-wise AvgPool and MaxPool operations are applied to the depth feature maps

F^{d} \in R^{(C \times H \times W)}

to obtain two 1 × H × W spatial context maps. These results are concatenated and processed by a 7 × 7 convolution to generate the spatial attention map

S A

:

S A (F^{d}) = {Conv}_{7 \times 7} (Concat (A v g P o o l (F^{d}), M a x P o o l (F^{d})))

(2)

Finally, the RGB channel attention map and depth spatial attention map are combined into a pixel-wise fused weight map

G

:

G = σ (C A + S A)

(3)

The fusion weight map

G

is used to perform weighted fusion of RGB and depth features:

F^{f} = (F^{r g b} + F^{d}) + G \otimes F^{r g b} + (1 - G) \otimes F^{d}

(4)

where

\otimes

denotes element-wise multiplication, and

F^{f}

represents the final fused features.

Compared with common fusion methods, the proposed CCF module significantly suppresses interference from depth noise and alignment errors, enhancing feature representation of foreground objects in complex orchard scenarios.

2.2.3. Low-Contrast Adaptive Refinement Module (LAR)

To further enhance pixel-level discrimination in low-contrast regions and ambiguous boundaries, a Low-contrast Adaptive Refinement (LAR) module is proposed. Integrated with each mask prediction stage of the multi-stage RefineMask, this module significantly improves segmentation accuracy for green fruits.

The traditional RefineMask segmentation head consists of two branches: Semantic and Instance. The former branch extracts global semantic information, while the latter focuses on local instance features. Through multiple refinement stages, RefineMask predicts masks by progressively integrating semantic and instance features via multi-scale dilated convolutions and incrementally upsampling to increase spatial resolution.

In each mask prediction stage, the LAR module adaptively selects positive samples and anchors based on pixel confidence in predicted masks, and applies Gaussian weighting to negative samples to enhance dynamic contrastive learning. The process is illustrated in Figure 8.

Specifically, the predicted mask logits are passed through a Sigmoid function to generate a pixel-level confidence map. Pixels with high confidence in foreground regions are selected as anchor samples, while the remaining foreground pixels with lower confidence are treated as positive samples. The confidence threshold is set to 0.97, following a similar high-confidence selection principle as in Zhang et al. [39].

Different from Zhang et al., where high-confidence pixels are mainly used for explicit boundary supervision, in this work they are employed to construct anchor samples within a contrastive learning framework. Negative samples are randomly sampled from non-foreground regions (i.e., pixels where the ground-truth mask is False). By comparing the similarities between anchor samples and positive/negative samples, and further applying Gaussian weighting to negative samples, the proposed method is able to better distinguish ambiguous boundary pixels, thereby enhancing semantic aggregation and background rejection.

Then, the LAR module calculates the similarity

s i m_{a p}

between anchor samples and positive samples and the similarity

s i m_{a n}

between anchor samples and negative samples. The calculation method is as follows: after L2 normalization of the feature map, for an anchor position

a

, a positive sample

p

and a negative sample

n

, the formula for calculating similarity is:

s i m_{a p} = \frac{f {(a)}^{⊤} f (p)}{| f (a) |_{2} \cdot | f (p) |_{2}},

(5)

s i m_{a n} = \frac{f {(a)}^{⊤} f (n)}{| f (a) |_{2} \cdot | f (n) |_{2}}

(6)

To further enhance the discrimination ability for low-contrast and confusing regions, LAR introduces the ADNCE Gaussian weighting mechanism [40] when calculating the loss for negative samples. The similarity

s i m_{a n}

corresponding to Gaussian weight

w

is defined as:

w = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{1}{2} {(\frac{s i m_{a n} - μ}{σ})}^{2})

(7)

where

μ = 0.7 and σ = 1.0

are the empirically determined mean and standard deviation, respectively.

The final loss function of LAR is defined as:

L_{LAR} = - \frac{1}{N} \sum_{i = 1}^{N} \log (\frac{\exp (s i m_{a p, i} / τ)}{\exp (s i m_{a p, i} / τ) + \sum_{j} \exp (s i m_{a n, i j} / τ) \cdot w_{i j}})

(8)

Specifically,

s i m_{a p, i}

denotes the similarity between the

i

-th anchor sample and the positive sample,

s i m_{a n, i j}

represents the similarity between the

i

-th anchor sample and the

j

-th negative sample,

τ

is the temperature (set to 0.07), and

w_{i j}

is the weight of negative samples calculated by Gaussian function. The weights of contrastive learning in each layer are consistent with the instance mask loss weights at each refinement stage.

With this adaptive contrastive learning mechanism, the LAR module significantly improves the feature discrimination ability of the model in low-contrast and blurred boundary regions, thereby enhancing the overall instance segmentation performance.

3. Results

3.1. Experimental Setup

This study was conducted on the AutoDL cloud platform using a single-GPU node equipped with an NVIDIA GeForce RTX 4090 D (24 GB VRAM) and an AMD EPYC 9754 (18 vCPUs). The server ran the Ubuntu 20.04 operating system with the PyTorch 2.0.0 framework and CUDA version 11.8. We implemented our method with MMDetection v3.1.0 in Python 3.8.10 and conducted experiments on the target datasets. The batch size was set to 1, and the model was trained for a fixed number of 24 epochs using the SGD optimizer with a learning rate of 0.02 and a momentum of 0.9. The 24-epoch training schedule follows the common practice of the MMDetection framework and was further validated through empirical convergence analysis, as shown in Figure 9. Training was terminated once the predefined maximum number of epochs was reached. During training, all images were resized to 1333 × 800 while preserving their original aspect ratios, and random horizontal flip augmentation was applied. To mitigate overfitting, L2 regularization (weight decay) was employed, and training convergence was monitored based on the training loss and validation mAP.

Figure 9 illustrates the training loss curves and periodic evaluation results of the proposed DepthCL-Seg model across different datasets, with a total of 24 evaluation checkpoints during training. The results indicate that the optimization process proceeds smoothly, and the evaluation performance becomes stable in the later training stages, demonstrating satisfactory convergence behavior.

3.2. Evaluation Metrics

Instance segmentation is commonly evaluated by the precision–recall (P–R) curve, which consists of precision and recall. Precision indicates the proportion of true positive instances among predicted positives, while recall represents the proportion of actual positive instances correctly detected. Precision and recall are calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

For the green fig and green peach datasets, the average precision (AP) metric was used to evaluate the model’s performance on segmenting green fruit instances. The AP can be calculated from the P–R curve using the following equation:

A P = \int_{0}^{1} P (r) d r

(11)

where

P (r)

denotes precision at recall

r

. Mean average precision (mAP) for all classes is defined as the arithmetic mean of AP over all categories:

m A P = \frac{1}{C} \sum_{c = 1}^{C} A P_{c}

(12)

AP50 and AP75 denote the AP at Intersection over Union (IoU) thresholds of 0.5 and 0.75, respectively. In this experiment, since both green fig and green peach datasets have only one category

(C = 1)

, mAP is equal to AP. Additionally, in order to measure segmentation accuracy for small, medium, and large objects, metrics

A P_{S}

,

A P_{M}

and

A P_{L}

were also employed.

3.3. Quantitative Comparison with Other Methods

To verify the effectiveness of the proposed DepthCL-Seg model, experiments were conducted on the fig and peach datasets, and comprehensive comparisons were made with current mainstream instance segmentation methods including Cascade R-CNN [41], HTC [42], Mask R-CNN [34], MS-RCNN [43], PointRend [44], Mask Transfiner [45], and DI-MaskDINO [46]. All experiments uniformly used ResNet-50 as the backbone network, and the training cycle was set to 24 epochs; DI-MaskDINO was trained for 50 epochs based on an analysis of its training dynamics, in order to reach a stable convergence state. All hyperparameters and data augmentation methods were maintained at default values, and the results are shown in Table 2 and Table 3.

On the fig dataset (Table 2), DepthCL-Seg achieved the best performance in all metrics. The overall AP reached 74.2%, which is 7.5% higher than that of Mask R-CNN at 66.7%, and 4.2% higher than that of the second-ranked PointRend at 70.0%. On the large-object AP metric, DepthCL-Seg reached 96.8%, outperforming the second-ranked DI-MaskDINO by 3.6% at 93.2%. Additionally, DepthCL-Seg performed particularly well in small object segmentation (

A P_{S}

), reaching 29.9%, significantly surpassing other models, fully reflecting the advantages of the proposed depth fusion method for targets of different scales, especially exhibiting stronger segmentation capability for challenging small-sized fruits.

To verify the segmentation performance of the model on different varieties of green fruits, experiments were also conducted on a publicly available peach dataset. The results shown in Table 3 further confirmed the effectiveness of DepthCL-Seg. As seen in Table 3, DepthCL-Seg also achieved significant improvement in the green peach segmentation task. The overall AP reached 86.0%, exceeding second-ranked PointRend’s 83.1% by 2.9%, and improving over the classical Mask R-CNN’s 81.6% by 4.4%. Small object

A P_{S}

increased from 16.1% to 22.4%, an improvement of 6.3%. The medium-scale and large-scale segmentation metrics reached 58.7% (

A P_{M}

) and 94.8% (

A P_{L}

), respectively, surpassing all comparative models. The stable performance of DepthCL-Seg’s across targets of different sizes demonstrates its excellent generalization ability and adaptability to complex scenes, providing robust support for precise segmentation of green fruits in real orchard environments.

The quantitative analysis results from both datasets clearly show that DepthCL-Seg has a significant advantages in green fruit instance segmentation, which is attributed to the effective utilization of monocular depth information by CCF and LAR modules. Both modules effectively enhance feature representation, allowing the model to achieve more accurate segmentation performance in complex orchard environments.

3.4. Qualitative Comparison with Other Methods

Figure 10 and Figure 11 compare the actual segmentation performance differences between DepthCL-Seg, existing state-of-the-art methods, and baseline Mask R-CNN on the green fig and green peach datasets, respectively.

For instance segmentation shown in Figure 10, DepthCL-Seg outperformed other models in both stability and accuracy. For instance, DI-MaskDINO clearly missed the detection of occluded fig instances located at the upper middle position of b1, in the b3 region, and behind b4. PointRend and Mask R-CNN displayed ambiguous boundary segmentation for overlapping regions such as a2, a3, c2, c3 and c4, where masks were overlapped or interweaved to cause duplicated segmentation. In contrast, the segmentation results of DepthCL-Seg (d1–d4) accurately segmented overlapping fig regions with clear and smooth boundaries without obvious omission or misclassification.

Similarly, DepthCL-Seg also demonstrated significant advantages in peach instance segmentation as shown in Figure 11. HTC and Mask R-CNN produced relatively blurred segmentation boundaries under the background of low-contrast green leaves and branches, and failed to refine occluded peach regions under severe occlusion conditions, such as b4 and c4. Moreover, HTC exhibited duplicated segmentation regions in areas where fruits were occluded by branches, for example, above b3. PointRend also had severe confusion in boundary segmentation for overlapping peach regions, including duplicate segmentation and blurred boundaries in the upper-right corner of a1. In comparison, DepthCL-Seg could output smooth and complete peach masks at occluded regions, clearly and accurately delineating instance boundaries as shown in d1–d4.

Overall, DepthCL-Seg significantly outperformed traditional models in green fruit segmentation accuracy under complex environments, improving fruit mask quality, reducing omission rates, and substantially minimizing misclassification of interference such as leaves as fruits. This demonstrates that DepthCL-Seg is highly suitable for intelligent orchard management requiring strict precision in green fruit segmentation.

3.5. Ablation Experiments

To verify the effectiveness of each module, ablation experiments were conducted on both the green fig and green peach datasets, as shown in Table 4 and Table 5 (bold indicates the highest performance). Starting from the single-branch Mask R-CNN baseline, the RefineMask head, the proposed CCF module, and the LAR module were sequentially introduced. All experiments used the same hyperparameters and data augmentation strategies to ensure a fair comparison.

As shown in Table 4, adding only the RefineMask head improved the overall AP by 4.1%, which indicates that the multi-stage mask refinement module can effectively improve the accuracy of masks. When the CCF module was further added, the AP increased by another 1.0%, especially on the small-object

A P_{S}

metric, demonstrating that cross-modal fusion helps enhance feature representation capabilities for low-contrast areas. Adding only the LAR module improved the AP by 0.6%, indicating that the contrastive learning strategy effectively distinguished foreground and background in fuzzy boundary regions. When all modules were combined, the overall AP reached 74.2%, with an improvement of 7.5% over the baseline. Compared to using only the CCF module, this was a further improvement of 2.4%; compared to using only the LAR module, it represented a further improvement of 2.8%. On the metric of small-object

A P_{S}

, the AP was improved by 3.3% compared to using only the CCF module, and by 2.9% compared to using only the LAR module.

Similar trends can be observed on the green peach dataset, as shown in Table 5. The progressive introduction of the RefineMask head, CCF, and LAR modules consistently improves segmentation performance across different object scales, with the full model achieving the highest AP of 86.0%. These results strongly demonstrate that the CCF module for fusing RGB and depth information and the contrastive learning strategy of LAR are essential for performance enhancement.

To further investigate the roles of individual modules under complex visual conditions, more fine-grained ablation analyses are conducted exclusively on the green fig dataset. This dataset exhibits more severe occlusion and more complex background interference, and therefore provides a more representative benchmark for performance analysis.

Under this setting, we further compare the impact of different monocular depth estimation models on the performance of DepthCL-Seg, as detailed in Table 6.

As shown in Table 6, Depth Anything V2 has the best performance across all metrics, with an overall AP 2.0% higher than Depth Pro and 2.1% higher than Marigold, especially on multi-scale performance metrics. This indicates that Depth Anything V2 can provide more precise and reliable depth information, which effectively helps the model to complete accurate segmentation tasks in complex scenarios.

To investigate the impact of multimodal fusion strategies on the performance of DepthCL-Seg, we conducted a comparison of several common dual-stream feature fusion methods, including Weighted Sum, Element-wise Multiplication, Gated Fusion, and CBAM [47], with the results presented in Table 7.

As shown in Table 7, the proposed CCF module exhibits the best performance across all metrics and clearly outperforms other fusion methods. Both Weighted Sum and Gated Fusion achieve an overall AP of 73.6%, indicating that simple weighted and gated strategies are also effective for dual-stream feature fusion. Element-wise Multiplication is slightly worse with an AP of 73.2% and CBAM achieves the worst AP of 72.7%, which shows that it is difficult to realize cross-modal complementarity. This further confirms that the CCF module can more precisely explore the complementarity between RGB and depth features, thereby improving segmentation performance in complex scenarios.

For the LAR module, this paper analyzes its effect on segmentation performance from the perspective of sampling strategies and contrastive loss by comparing two sampling strategies, Random Sampling and Confidence-based Sampling as shown in Table 8, and three contrastive loss functions, α-CL-direct [48], InfoNCE [49], and ADNCE as shown in Table 9.

As indicated in Table 8, using the confidence-based sampling strategy improves the overall AP from 73.8% to 74.2% compared with random sampling and especially for medium-sized and small-sized targets. This suggests that selecting ambiguous foreground pixels based on their confidence in the LAR module effectively improves mask quality and segmentation accuracy in complex areas.

Table 9 shows that the ADNCE contrastive loss achieves the best performance, with an overall AP of 74.2%, surpassing α-CL-direct at 72.7% and InfoNCE at 73.0%. This difference is attributed to ADNCE’s Gaussian weighting strategy for difficult negative samples, which assigns different weights according to the distance between negative samples and a preset center, significantly enhancing the model’s feature discrimination ability in low-contrast areas, thus achieving higher-quality segmentation in complex scenarios.

After validating the effectiveness of the sampling strategy and the contrastive loss formulation, the influence of key hyperparameters in the LAR module on segmentation performance is further examined. Sensitivity experiments are conducted for the Gaussian weighting parameters μ and σ, the temperature parameter τ in contrastive learning, and the confidence threshold used for anchor selection, with the results summarized in Table 10, Table 11, Table 12 and Table 13. In these experiments, only one hyperparameter is varied at a time, while the others are kept fixed at their default values.

The experimental results indicate that different hyperparameters have varying influences on segmentation performance. Variations in the Gaussian mean μ and the contrastive temperature τ result in relatively small changes in overall AP, with performance differences remaining within approximately 1 mAP. In contrast, adjusting the Gaussian standard deviation σ and the confidence threshold for anchor selection leads to more noticeable performance variations, in some cases exceeding 1 mAP, particularly for overall AP as well as medium and small object metrics. Nevertheless, no abrupt performance degradation is observed within the evaluated parameter ranges. The best overall performance is achieved with μ = 0.7, σ = 1.0, τ = 0.07, and a confidence threshold of 0.97, and this configuration is therefore adopted as the default setting in all experiments.

The ablation experimental results further validate the effectiveness of the proposed CCF and LAR modules in improving the performance of DepthCL-Seg. In addition, by comparing different monocular depth models, it is found that Depth Anything V2 can provide the most accurate and effective depth information, which can further improve the instance segmentation abilities of pure visual models in complex scenarios.

4. Discussion

The primary errors of the baseline single-modal RGB model on both datasets mainly stem from the strong color similarity between fruits and surrounding leaves, which leads to blurred fruit boundaries and frequent misclassification. By incorporating monocular depth estimation, the proposed Cross-modal Complementary Fusion (CCF) module effectively aligns and integrates texture information from RGB features with spatial structural cues derived from depth features. In addition, the Low-contrast Adaptive Refinement (LAR) module introduces dynamic contrastive constraints in low-confidence boundary regions. The synergy of these two modules substantially alleviates the aforementioned misclassification issues, particularly in scenarios involving dense fruit distributions, severe occlusions, and overlapping instances, resulting in more complete and smoother instance masks.

Compared with RGB-only methods that mainly rely on color and texture cues, as well as hardware-driven multi-sensor fusion approaches that depend on explicit depth sensors [14,15,16,18], the proposed DepthCL-Seg framework introduces monocular depth priors as a low-cost and flexible source of spatial structural information for instance segmentation. This design enables the model to effectively address key challenges in orchard environments—such as high color similarity between fruits and background, dense fruit clustering, and ambiguous or occluded boundaries—without introducing additional hardware burdens [29,30]. Orchard scenes are typically characterized by dense foliage and constrained operating spaces, which considerably limit the practical deployability of complex and costly multi-sensor systems. In contrast, DepthCL-Seg adopts a purely vision-based solution using a monocular RGB camera, offering clear advantages in terms of cost efficiency and environmental adaptability. As a result, it achieves a favorable balance between segmentation accuracy and practical deployability, making it more suitable for large-scale, real-world orchard perception applications [8].

Nevertheless, the proposed method also has certain limitations. Due to the introduction of an additional monocular depth estimation branch and a dual-stream backbone architecture, DepthCL-Seg incurs higher computational complexity compared to the RGB-only Mask R-CNN baseline. In a fully end-to-end online inference setting—where monocular depth maps are generated on-the-fly by Depth Anything V2 during inference—the overall processing speed reaches approximately 6.88 FPS on a single NVIDIA RTX 4090D GPU. The model parameter count increases from 43.97 M to 77.92 M, while the proposed CCF and LAR modules account for only a small portion of the total parameters (6.97 M and 0.18 M, respectively), as shown in Table 14 and Table 15.

It should be noted, however, that DepthCL-Seg is not primarily designed for ultra-low-latency robotic control tasks with stringent real-time requirements. Instead, it is better suited for orchard perception applications such as growth condition monitoring, yield estimation, and periodic field inspection, where segmentation accuracy and result stability are typically prioritized over frame-level real-time responsiveness. Under such application scenarios, the associated computational overhead is considered reasonable and acceptable.

Based on these observations, future research will further explore the potential of monocular depth information in orchard perception tasks. On the one hand, we plan to investigate more lightweight monocular depth modeling strategies tailored to agricultural scenarios, combined with architectural optimizations such as shared backbone designs and knowledge distillation, in order to further reduce computational cost while maintaining segmentation performance. On the other hand, we aim to incorporate depth calibration mechanisms to strengthen the correspondence between estimated depth values and real-world physical scales. This would enable the construction of a mapping from estimated depth to pixel area and ultimately to actual fruit volume, thereby providing more reliable spatial priors for downstream quantitative tasks such as fruit volume estimation and yield prediction.

5. Conclusions

This paper addresses the challenges of low contrast and occlusion for green fruits in RGB images, which arise due to their color similarity with leaves and branches. We propose a multimodal instance segmentation framework based on monocular depth estimation, named DepthCL-Seg. This approach utilizes depth information generated by Depth Anything V2, leveraging a dual-stream ResNet-50 feature extraction backbone. The CCF and LAR modules are designed and combined with the RefineMask multi-stage mask refinement strategy, achieving fine-grained segmentation of green fruits. The primary conclusions of this paper are as follows:

(1): DepthCL-Seg achieved mAP scores of 74.2% and 86.0% on the self-constructed green fig dataset and the cleaned green peach subset, respectively, improving by 7.5% and 4.4% over the baseline Mask R-CNN. Even in scenarios with overlapping fruit and complex occlusion, the model could produce smooth and complete masks, significantly reducing omission and misclassification rates.
(2): The CCF module effectively fused RGB and depth dual-stream features, achieving cross-modal complementary enhancement, suppressing background interference, and significantly improving feature distinguishability in low-contrast areas. The LAR module leveraged dynamic contrastive learning to separate indistinguishable foreground and background pixels in feature space; the synergy of both modules constituted the core of performance enhancement.
(3): DepthCL-Seg obtained depth information solely from monocular cameras, resulting in low hardware costs and flexible deployment. It could provide high-quality instance masks for downstream tasks such as fruit status monitoring, yield prediction, and automated harvesting.

In future work, we will focus on achieving the conversion from relative depth to absolute depth, integrating scale calibration, and developing a “depth estimation → pixel area → actual volume” mapping to support volume and yield estimation of orchard fruits.

Author Contributions

All authors have made significant contributions to this research. Conceptualization, Y.S. and H.Z.; methodology, Y.S. and H.Z.; validation, Y.S.; writing, Y.S., H.Z. and G.S.; supervision, H.Z. and G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Forestry University Science and Technology Innovation Program (Grant No. ZZK202506).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Amogi, B.R.; Ranjan, R.; Khot, L.R. Mask R-CNN aided fruit surface temperature monitoring algorithm with edge compute enabled internet of things system for automated apple heat stress management. Inf. Process. Agric. 2024, 11, 603–611. [Google Scholar] [CrossRef]
Giménez-Gallego, J.; Martinez-del-Rincon, J.; González-Teruel, J.D.; Navarro-Hellín, H.; Navarro, P.J.; Torres-Sánchez, R. On-tree fruit image segmentation comparing Mask R-CNN and Vision Transformer models. Application in a novel algorithm for pixel-based fruit size estimation. Comput. Electron. Agric. 2024, 222, 109077. [Google Scholar] [CrossRef]
Palacios, F.; Diago, M.P.; Melo-Pinto, P.; Tardaguila, J. Early yield prediction in different grapevine varieties using computer vision and machine learning. Precis. Agric. 2023, 24, 407–435. [Google Scholar] [CrossRef]
Barbole, D.K.; Jadhav, P.M.; Patil, S.B. A review on fruit detection and segmentation techniques in agricultural field. In Proceedings of the Second International Conference on Image Processing and Capsule Networks—ICIPCN 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Peng, H. Human Machine Interface. Encyclopedia of Digital Agricultural Technologies; Springer International Publishing: Cham, Switzerland, 2023; pp. 613–621. [Google Scholar]
Fang, C.; Chen, H.; Li, L.; Luo, Z.; Liu, L.; Ban, Z. A novel Adaptive Zone-fusion network for precise waxberry semantic segmentation to improve automated-harvesting in a complex orchard environment. Comput. Electron. Agric. 2024, 221, 108937. [Google Scholar] [CrossRef]
Zhao, Z.; Hicks, Y.; Sun, X.; Luo, C. FruitQuery: A lightweight query-based instance segmentation model for in-field fruit ripeness determination. Smart Agric. Technol. 2025, 12, 101068. [Google Scholar] [CrossRef]
Zhou, X.; Hu, X.; Sun, J. A review of fruit ripeness recognition methods based on deep learning. Cyber-Phys. Syst. 2025, 11, 508–542. [Google Scholar] [CrossRef]
Zhai, Y.; Gao, Z.; Zhou, Y.; Li, J.; Zhang, Y.; Xu, Y. Green fruit detection methods: Innovative application of camouflage object detection and multilevel feature mining. Comput. Electron. Agric. 2024, 225, 109356. [Google Scholar] [CrossRef]
Lv, J.; Wang, F.; Xu, L.; Ma, Z.; Yang, B. A segmentation method of bagged green apple image. Sci. Hortic. 2019, 246, 411–417. [Google Scholar] [CrossRef]
Sun, S.; Jiang, M.; He, D.; Long, Y.; Song, H. Recognition of green apples in an orchard environment by combining the GrabCut model and Ncut algorithm. Biosyst. Eng. 2019, 187, 201–213. [Google Scholar] [CrossRef]
Sun, M.; Xu, L.; Luo, R.; Lu, Y.; Jia, W. Fast location and recognition of green apple based on RGB-D image. Front. Plant Sci. 2022, 13, 864458. [Google Scholar] [CrossRef]
Hussain, M.; He, L.; Schupp, J.; Lyons, D.; Heinemann, P. Green fruit segmentation and orientation estimation for robotic green fruit thinning of apples. Comput. Electron. Agric. 2023, 207, 107734. [Google Scholar] [CrossRef]
Zu, L.; Zhao, Y.; Liu, J.; Su, F.; Zhang, Y.; Liu, P. Detection and segmentation of mature green tomatoes based on mask R-CNN with automatic image acquisition approach. Sensors 2021, 21, 7842. [Google Scholar] [CrossRef]
Jia, W.; Cao, K.; Liu, M.; Lu, Y.; Ji, Z.; Liu, G.; Yin, X.; Ge, X. FCOS-EAM: An accurate segmentation method for overlapping green fruits. Comput. Electron. Agric. 2024, 226, 109392. [Google Scholar] [CrossRef]
El Akrouchi, M.; Mhada, M.; Bayad, M.; Hawkesford, M.J.; Gérard, B. AI-Based Framework for Early Detection and Segmentation of Green Citrus fruits in Orchards. Smart Agric. Technol. 2025, 10, 100834. [Google Scholar] [CrossRef]
Rong, J.; Zhou, H.; Zhang, F.; Yuan, T.; Wang, P. Tomato cluster detection and counting using improved YOLOv5 based on RGB-D fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
Kang, S.; Fan, J.; Ye, Y.; Li, C.; Du, D.; Wang, J. Maturity recognition and localisation of broccoli under occlusion based on RGB-D instance segmentation network. Biosyst. Eng. 2025, 250, 270–284. [Google Scholar] [CrossRef]
Liu, Z.; Sampurno, R.M.; Abeyrathna, R.R.D.; Nakaguchi, V.M.; Ahamed, T. Development of a Collision-Free Path Planning Method for a 6-DoF Orchard Harvesting Manipulator Using RGB-D Camera and Bi-RRT Algorithm. Sensors 2024, 24, 8113. [Google Scholar] [CrossRef]
Chen, Z.; Yin, J.; Farhan, S.M.; Liu, L.; Zhang, D.; Zhou, M.; Cheng, J. A comprehensive review of obstacle avoidance for autonomous agricultural machinery in multi-operational environment. Artif. Intell. Agric. 2025, 16, 139–163. [Google Scholar] [CrossRef]
Kurtser, P.; Lowry, S. RGB-D datasets for robotic perception in site-specific agricultural operations—A survey. Comput. Electron. Agric. 2023, 212, 108035. [Google Scholar] [CrossRef]
Xue, F.; Zhuo, G.; Huang, Z.; Fu, W.; Wu, Z.; Ang, M.H. Toward hierarchical self-supervised monocular absolute depth estimation for autonomous driving applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N. Monocular depth estimation: A thorough review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2396–2414. [Google Scholar] [CrossRef]
Zheng, J.; Lin, C.; Sun, J.; Zhao, Z.; Li, Q.; Shen, C. Physical 3D adversarial attacks against monocular depth estimation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Obukhov, A.; Poggi, M.; Tosi, F.; Arora, R.S.; Spencer, J.; Russell, C.; Hadfield, S.; Bowden, R.; Wang, S.; Ma, Z.; et al. The fourth monocular depth estimation challenge. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 6182–6195. [Google Scholar]
Zhang, X.F.; Gong, R.; Cheng, Y.; Zha, X.; Cao, Z.; Ren, G.; Zhang, J.; Ding, W.; Xu, S. 3D Gaussian-Driven Depth Estimation for Efficient and Accurate Agricultural Produce Picking. In Proceedings of the 2024 10th International Conference on Communication and Information Processing, Lingshui, China, 21–24 November 2024; pp. 420–427. [Google Scholar]
Zhuo, Y.; You, F. 3D plant phenotyping from a single image: Learning fine-scale organ morphology with monocular depth estimation. Comput. Electron. Agric. 2025, 239, 110925. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, T.; Wu, X.; Li, Q.; Li, Y.; Wang, Q. TPDNet: Triple phenotype deepen networks for Monocular 3D Object Detection of Melons and Fruits in Fields. Plant Phenomics 2025, 7, 100048. [Google Scholar] [CrossRef]
Tan, Y.; Liu, X.; Zhang, J.; Wang, Y.; Hu, Y. A Review of Research on Fruit and Vegetable Picking Robots Based on Deep Learning. Sensors 2025, 25, 3677. [Google Scholar] [CrossRef]
Nkwocha, C.L.; Adewumi, A.; Folorunsho, S.O.; Eze, C.; Jjagwe, P.; Kemeshi, J.; Wang, N. A Comprehensive Review of Sensing, Control, and Networking in Agricultural Robots: From Perception to Coordination. Robotics 2025, 14, 159. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything v2. Adv. Neural Inf. Process. Syst. 2024, 37, 21875–21911. [Google Scholar]
Lei, L.; Yang, Q.; Yang, L.; Shen, T.; Wang, R.; Fu, C. Deep learning implementation of image segmentation in agricultural applications: A comprehensive review. Artif. Intell. Rev. 2024, 57, 149. [Google Scholar] [CrossRef]
Wang, T.; Jain, A.; He, L.; Grimm, C.; Todorovic, S. A Dataset for Semantic and Instance Segmentation of Modern Fruit Orchards. In Proceedings of the Computer Vision and Pattern Recognition Conference, Denver, CO, USA, 2–6 June 2025; pp. 5381–5391. [Google Scholar]
Zhao, Z.; Hicks, Y.; Sun, X.; Luo, C. Peach ripeness classification based on a new one-stage instance segmentation model. Comput. Electron. Agric. 2023, 214, 108369. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Cambridge, MA, USA, 20–23 June 2017. [Google Scholar]
Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Bochkovskii, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.R.; Koltun, V. Depth pro: Sharp monocular metric depth in less than a second. arXiv 2024, arXiv:2410.02073. [Google Scholar] [CrossRef]
Charisis, C.; Argyropoulos, D. Deep learning-based instance segmentation architectures in agriculture: A review of the scopes and challenges. Smart Agric. Technol. 2024, 8, 100448. [Google Scholar] [CrossRef]
Zhang, F.; Pan, T.; Tai, Y.W.; Wang, B. Boundary-Enhanced Instance Segmentation. In Proceedings of the European Conference on Artificial Intelligence—ECAI 2024; IOS Press: Amsterdam, The Netherlands, 2024; pp. 585–592. [Google Scholar]
Wu, J.; Chen, J.; Wu, J.; Shi, W.; Wang, X.; He, X. Understanding contrastive learning via distributionally robust optimization. Adv. Neural Inf. Process. Syst. 2023, 36, 23297–23320. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask scoring r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Ke, L.; Danelljan, M.; Li, X.; Tai, Y.W.; Tang, C.K.; Yu, F. Mask transfiner for high-quality instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
Nan, Z.; Li, X.; Xiang, T.; Dai, J. Di-maskdino: A joint object detection and instance segmentation model. Adv. Neural Inf. Process. Syst. 2024, 37, 60359–60381. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Tian, Y. Understanding deep contrastive learning via coordinate-wise optimization. Adv. Neural Inf. Process. Syst. 2022, 35, 19511–19522. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]

Figure 1. Images of the green fig dataset.

Figure 2. Images of the green peach dataset.

Figure 3. Instance annotation diagrams of green fruits.

Figure 4. Overall Framework of the DepthCL-Seg Model.

Figure 5. Visualization of Depth Maps from Different Monocular Depth Models.

Figure 6. Different Feature Fusion Methods.

Figure 7. Cross-Modal Complementary Fusion (CCF) Module.

Figure 8. Integration of the LAR Module in the Mask Prediction Stage.

Figure 9. Training loss and validation mAP of DepthCL-Seg. The solid line denotes the training loss, while the dashed line represents the validation segmentation mAP.

Figure 10. Qualitative Comparison of Different Methods on the Green Fig Dataset.

Figure 11. Qualitative Comparison of Different Methods on the Green Peach Dataset.

Table 1. Statistics on the Number of Images and Instances in the Two Datasets.

Category	Total		Training		Validation
Category	Image	Instance	Image	Instance	Image	Instance
Fig	1133	3772	793	2500	340	1272
Peach	1240	3504	868	2376	372	1128

Table 2. Quantitative Comparison Results of Different Models on the Green Fig Dataset (bold indicates the highest performance).

Model	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
Evaluation Metric = Segm			Dataset = Fig
Cascade R-CNN	65.2	81.3	72.2	21.2	68.1	89.0
HTC	67.5	83.8	74.7	23.6	71.4	89.5
Mask-RCNN	66.7	82.4	74.9	22.9	70.6	89.2
MS-RCNN	66.6	83.0	74.1	22.6	71.1	89.5
PointRend	70.0	85.3	77.2	26.7	72.9	92.2
Mask Transfiner	66.1	83.7	73.3	24.0	68.0	89.4
DI-MaskDINO	68.8	84.6	75.1	22.6	70.9	93.2
DepthCL-Seg (Ours)	74.2	86.3	80.9	29.9	78.1	96.8

Table 3. Quantitative Comparison Results of Different Models on the Green Peach Dataset (bold indicates the highest performance).

Model	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
Evaluation Metric = Segm			Dataset = Peach
Cascade R-CNN	81.1	93.5	86.5	12.1	52.9	91.1
HTC	82.6	95.0	88.2	18.7	54.6	91.8
Mask-RCNN	81.6	94.3	87.3	16.1	54.1	91.2
MS-RCNN	80.9	93.6	86.4	14.8	54.0	90.4
PointRend	83.1	94.3	88.3	21.8	55.5	92.3
Mask Transfiner	80.5	94.2	86.1	20.7	53.8	89.6
DI-MaskDINO	81.7	92.8	84.9	11.7	51.8	92.1
DepthCL-Seg (Ours)	86.0	95.7	90.6	22.4	58.7	94.8

Table 4. Impact of Different Module Combinations on Segmentation Performance on the Green Fig Dataset.

Base (Mask R-CNN)	Refinemask Head	CCF	LAR	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
✓				66.7	82.4	74.9	22.9	70.6	89.2
✓	✓			70.8	83.2	76.9	24.7	77.0	95.1
✓	✓	✓		71.8	85.1	79.0	26.6	75.4	96.1
✓	✓		✓	71.4	84.4	77.8	27.0	74.7	94.8
✓	✓	✓	✓	74.2	86.3	80.9	29.9	78.1	96.8

Table 5. Impact of Different Module Combinations on Segmentation Performance on the Green Peach Dataset.

Base (Mask R-CNN)	Refinemask Head	CCF	LAR	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
✓				81.6	94.3	87.3	16.1	54.1	91.2
✓	✓			84.0	94.4	88.5	15.2	56.2	93.7
✓	✓	✓		84.9	95.2	90.0	19.4	58.6	93.9
✓	✓		✓	85.2	94.6	89.4	22.6	57.2	94.2
✓	✓	✓	✓	86.0	95.7	90.6	22.4	58.7	94.8

Table 6. Performance Comparison of Different Monocular Depth Models.

Depth Model	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
+Marigold	72.1	84.7	79.1	27.8	75.4	95.0
+Depth Pro	72.2	84.6	78.9	27.8	75.9	94.7
+Depth Anything V2	74.2	86.3	80.9	29.9	78.1	96.8

Table 7. Comparison of Different Multimodal Fusion Strategies.

Fusion Strategy	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
+Weighted Sum	73.6	85.8	80.5	28.8	77.6	96.3
+Element-wise multiplication	73.2	85.5	79.9	28.5	77.0	96.0
+Gated Fusion	73.6	85.9	80.2	29.4	77.3	96.4
+CBAM	72.7	85.3	80.1	27.9	77.0	95.3
+CCF	74.2	86.3	80.9	29.9	78.1	96.8

Table 8. Random Sampling vs. Confidence-Based Sampling (Using ADNCE Loss).

Sampling Method	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
+Random Sampling	73.8	86.2	81.1	29.7	77.6	96.1
+Confidence-based Sampling	74.2	86.3	80.9	29.9	78.1	96.8

Table 9. Comparison of Different Contrastive Loss Forms (Using Confidence-Based Sampling).

Contrast Loss	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
+α-CL-direct	72.7	86.1	79.9	28.2	76.3	95.5
+InfoNCE	73.0	85.3	79.5	28.4	77.0	96.6
+ADNCE	74.2	86.3	80.9	29.9	78.1	96.8

Table 10. Sensitivity analysis of the Gaussian mean parameter μ.

Gaussian (μ)	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
0.5	73.7	85.9	80.3	30.5	77.3	96.4
0.7	74.2	86.3	80.9	29.9	78.1	96.8
0.9	73.5	85.6	80.2	27.8	77.8	96.4

Table 11. Sensitivity analysis of the Gaussian standard deviation parameter σ.

Gaussian (σ)	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
0.5	73.4	85.7	80.2	29.5	77.1	96.5
1.0	74.2	86.3	80.9	29.9	78.1	96.8
1.5	72.6	85.0	80.0	27.3	77.1	96.1

Table 12. Sensitivity analysis of the temperature parameter τ.

$τ$	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
0.04	73.7	85.6	80.8	28.4	77.8	96.3
0.07	74.2	86.3	80.9	29.9	78.1	96.8
0.1	74.0	86.5	80.8	30.0	78.1	95.9
0.2	73.8	86.9	80.9	29.3	78.0	96.1

Table 13. Sensitivity analysis of the confidence threshold.

Confidence	AP	AP50	AP75	$A P_{S}$	$A P_{M}$	$A P_{L}$
0.90	72.7	85.6	79.8	28.9	76.4	95.5
0.95	73.3	85.6	80.1	28.6	77.0	96.2
0.97	74.2	86.3	80.9	29.9	78.1	96.8
0.99	73.7	85.9	80.9	29.2	77.7	96.4

Table 14. Comparison of model complexity and inference efficiency between Mask R-CNN and DepthCL-Seg.

Model	Params	GPU Mem	FPS
Mask-RCNN	43.971 M	762.3 MB	38.65
DepthCL-Seg	77.9211 M	2769.2 MB	6.88

Table 15. Parameter breakdown of the proposed CCF and LAR modules.

Component	Params
CCF	6.9718 M
LAR	0.1750 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shang, Y.; Sun, G.; Zhang, H. DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth. Agriculture 2026, 16, 283. https://doi.org/10.3390/agriculture16020283

AMA Style

Shang Y, Sun G, Zhang H. DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth. Agriculture. 2026; 16(2):283. https://doi.org/10.3390/agriculture16020283

Chicago/Turabian Style

Shang, Yuelong, Guodong Sun, and Haiyan Zhang. 2026. "DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth" Agriculture 16, no. 2: 283. https://doi.org/10.3390/agriculture16020283

APA Style

Shang, Y., Sun, G., & Zhang, H. (2026). DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth. Agriculture, 16(2), 283. https://doi.org/10.3390/agriculture16020283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Model Framework Overview

2.2.1. Dual-Stream Feature Extraction Structure

2.2.2. Cross-Modal Complementary Fusion (CCF)

2.2.3. Low-Contrast Adaptive Refinement Module (LAR)

3. Results

3.1. Experimental Setup

3.2. Evaluation Metrics

3.3. Quantitative Comparison with Other Methods

3.4. Qualitative Comparison with Other Methods

3.5. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI