Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology

Shi, Juntao; Li, Changyong; Zhao, Zehui; Zhang, Shunchun

doi:10.3390/agriengineering8010006

Open AccessArticle

Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology

College of Mechanical Engineering, Xinjiang University, Urumqi 830047, China

^*

Author to whom correspondence should be addressed.

AgriEngineering 2026, 8(1), 6; https://doi.org/10.3390/agriengineering8010006 (registering DOI)

Submission received: 6 November 2025 / Revised: 11 December 2025 / Accepted: 14 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue Advancing Smart Farming through Agricultural Robots and Automation Technologies)

Download

Browse Figures

Versions Notes

Abstract

To address the issue of low depth estimation accuracy in complex goji berry orchards, this paper proposes a method for identifying and locating goji berries that combines the YOLO-VitBiS object detection network with stereo vision technology. Based on the YOLO11n backbone network, the C3K2 module in the backbone is first improved using the AdditiveBlock module to enhance its detail-capturing capability in complex environments. The AdditiveBlock introduces lightweight long-range interactions via residual additive operations, thereby strengthening global context modeling without significantly increasing computation. Subsequently, a weighted bidirectional feature pyramid network is introduced into the Neck to enable more flexible and efficient feature fusion. Finally, a lightweight shared detail-enhanced detection head is proposed to further reduce the network’s computational complexity and parameter count. The enhanced model is integrated with binocular stereo vision technology, employing the CREStereo depth estimation algorithm for disparity calculation during binocular stereo matching to derive the three-dimensional spatial coordinates of the goji berry target. This approach enables efficient and precise positioning. Experimental results demonstrate that the YOLO-VitBiS model achieves a detection accuracy of

96.6 %

, with a model size of

4.3 MB

and only

1.856 M

parameters. Compared to the traditional SGBM method and other deep learning approaches such as UniMatch, the CREStereo algorithm generates superior depth maps under complex conditions. Within a distance range of 400 mm to 1000 mm, the average relative error between the estimated and actual depth measurements is

2.42 %

, meeting the detection and ranging accuracy requirements for field operations and providing reliable recognition and localization support for subsequent goji berry harvesting robots.

Keywords:

binocular vision technology; CREStereo; depth estimation; goji berry recognition; YOLO

1. Introduction

Goji berries are a specialized cash crop in northwestern China, with Ningxia, Gansu, and Xinjiang being particularly renowned for their cultivation. In recent years, driven by rapidly expanding market demand, the area devoted to goji berry cultivation has continued to grow [1]. However, harvesting remains reliant on traditional manual labor, which is both physically demanding and inefficient. This is particularly acute during the brief ripening window, when delayed picking adversely affects both quality and yield [2,3,4].

With the rapid advancement of computer vision in agricultural harvesting, traditional two-dimensional object detection cannot meet the demands of contemporary intelligent agriculture. Stereo vision technology, capable of acquiring three-dimensional information about objects, offers a novel technical pathway for intelligent agricultural harvesting [5,6,7]. In the realm of object detection, the YOLO series of algorithms has garnered significant attention due to its outstanding real-time performance and detection accuracy [8]. In the field of stereo matching, traditional stereo matching methods, such as SGM [9], which optimizes disparity through global energy minimization, and ORB [10], which combines FAST keypoints and BRIEF descriptors for efficient feature matching, have shown limited effectiveness in complex agricultural environments due to their sensitivity to occlusion, scale variation, and dense object clustering. Deep learning-based stereo matching approaches, such as GC-Net [11], which aggregates contextual information by constructing a three-dimensional cost volume, and PSMNet [12], employing a pyramid structure to process multi-scale features, enhance matching accuracy in complex environments. Transformer-based methods [13] improve long-range feature matching capabilities but incur high computational costs, rendering them unsuitable for edge deployment. RAFT-Stereo [14] and CREStereo [15] achieve superior performance in complex scenes by leveraging cascaded recurrent networks and adaptive correlation layers to enhance depth estimation accuracy and handle occlusions effectively.

Currently, some studies have combined object detection with stereo vision for target localization. Tang et al. [16] also noted in their review the need for integrated systems combining detection and localization capabilities. Li J. et al. [17] proposed a non-invasive grading method for Chinese mitten crabs based on machine vision and deep learning, constructing a vision-based carapace segmentation and fatness estimation model and integrating multi-view (including binocular) images for quality estimation. Li S. et al. [18] studied a binocular-vision and improved YOLOv8s based system for dynamic object detection and localization on cattle farms, constructing a lightweight binocular detection + 3D positioning model for livestock without wearable devices. Wang et al. [19] employed YOLOv4-Tiny with a ZED 2 stereo camera for 3D reconstruction to obtain the 3D coordinates of pixels in the current scene. This approach also computes the distance between each potted plant’s centre and the stereo camera’s optical centre, enabling flower-variety identification. However, most existing research focuses on the recognition and localization of single objects. Goji plants typically grow densely, with fruits and foliage heavily occluding one another. Variations in natural lighting significantly impair the performance of conventional stereo-matching algorithms, increasing depth-estimation errors and hindering individual fruit recognition. This is precisely the challenge widely acknowledged within the field of berry detection. Furthermore, the small size and dense distribution of goji berries make per-fruit path planning inefficient, failing to meet the demands of large-scale field operations. Automated recognition and three-dimensional localization of goji berries are not only a technical problem, but also directly support harvest management and production decision-making.

Field research has revealed that although goji berries grow densely, their overall distribution conforms to a cluster-like arrangement. Therefore, the identification of individual goji berries can be transformed into the recognition of branches bearing dense clusters of fruit. Based on this idea, this paper proposes shifting the focus from identifying individual goji berries to recognizing clusters or branches. Cluster recognition proves more suitable for subsequent harvesting requirements. Consequently, a goji berry identification and localization system combining the YOLOv11n object detection algorithm with stereo vision technology is proposed. By detecting fruit clusters on branches and estimating their three-dimensional positions, the system guides harvesting robots to plan collision-free, efficient picking trajectories. It prioritizes branches with higher fruit density and minimizes redundant movements of the end effector.

2. Data Acquisition

2.1. Data Set Capture

The goji berry data utilized in this study were collected from the goji berry plantation in the Eighth Brigade of Toli Town, Jinghe County, Bortala Mongol Autonomous Prefecture, Xinjiang Uygur Autonomous Region. Sampling occurred at 9:30 am, 12:30 pm, 2:30 pm, 5:30 pm, and 7:30 pm. The data was captured using a Realme GT Neo5 device, with the lens positioned 0.4 m to 1.2 m from the goji berries. Photographs were captured under varying light conditions, plant postures, and levels of shading. Given the short growth cycle of goji berries and the fact that data collection primarily occurs during the harvesting period, and in accordance with local orchard management protocols, only berries within the harvestable maturity range are considered valid targets for testing. A total of 1190 goji berry images at 3072 × 3072 pixels were collected. Images rendered blurred due to shooting conditions or weather were discarded, yielding 1100 clear goji berry images. An image example is shown in Figure 1. Subsequently, the LabelImg tool was employed for image annotation, Harvestable goji berry instances were annotated with axis-aligned bounding boxes in the Pascal VOC format. Each bounding box was drawn to tightly enclose the visible ripe berry or compact cluster of ripe berries that can be removed in a single picking action. Unlike robotic harvesting systems that rely on precisely cutting the stem or pedicel, commercial goji berry harvesting in the target orchards is typically performed using vibration-based mechanisms that detach ripe berries while preserving the branches for subsequent flowering and fruiting. Consequently, the perception task in this work focuses on localizing the harvestable fruit clusters rather than explicitly delineating an individual cutting point or pedicel.

2.2. Data Set Augmentation

Following manual annotation, a final dataset of 1100 annotated images was obtained. These annotated images were randomly allocated in an 8:1:1 ratio to form the training set (880 images), validation set (110 images), and test set (110 images). To ensure sample diversity and prevent model overfitting, Gaussian noise and salt-and-pepper noise were introduced alongside conventional data augmentation techniques. This approach simulates the image blurring and detail loss caused by airborne particulate matter interference during sandstorm conditions prevalent in the arid, wind-sand-prone climate of northwestern regions. Refs. [20,21] to simulate image blurring and detail loss caused by airborne particulate interference during sandstorm conditions. In summary, the training set underwent random augmentation using ten techniques: mirroring, rotation, brightness adjustment, cropping, translation, and noise introduction. This yielded a total of 2600 augmented images, with the enhanced results illustrated in Figure 2.

3. Method

3.1. Testing Procedure

The overall testing process is illustrated in Figure 3. First, using the Realme GT Neo5 smartphone, raw images of goji berries in various states were captured. Following the acquisition, the goji berry targets were annotated using the LabelImg tool. Upon completion of all image annotation, the dataset was partitioned into training, validation, and test sets. Data augmentation was applied to the training set, which was then fed into an enhanced YOLO-VitBiS object detection network for training to generate weight files for goji berry detection. Concurrently, the stereo camera calibration tool in MATLAB R2023b was employed to calibrate the binocular camera system, yielding parameters including the camera’s intrinsic parameter matrix, distortion coefficients, rotation matrix, and translation vector. Following calibration, the OpenCV polar line correction algorithm was applied to rectify the binocular images, ensuring precise alignment of the left and right image pairs on the same plane. This established a high-precision foundation for subsequent depth estimation.

Once both the stereo camera and object detection model are operational, the real-time stereo images are fed into the detection and localization system. The enhanced YOLO-VitBiS detection network identifies and localizes the target regions of goji berries within the left-hand image, outputting the bounding box coordinates for each object. Subsequently, the detection results are matched against the corresponding right-side image. By integrating the CREStereo depth estimation algorithm, the matched regions undergo parallax calculation to precisely estimate the depth information of goji berry plants within complex backgrounds, generating a high-resolution depth map. Finally, based on the depth map and the geometric parameters of the stereo camera, the three-dimensional spatial coordinates and depth information of the goji berry targets are computed.

3.2. YOLO-VitBiS Object Detection Network

Traditional object detection models such as VGG [22] and ResNet [23] rely on substantial parameter counts and complex computational architectures, leading to significant challenges when deployed on mobile devices. Consequently, the single-stage detection algorithm YOLO11n was selected as the baseline model to design a lightweight goji berry detection model, YOLO-VitBiS, whose architecture is illustrated in Figure 4. The input image is first processed by the backbone, which consists of Conv, AdditiveBlock, SPPF and C2PSA modules to extract multi-scale feature maps. These features are then fed into the neck, where a series of BiFPN and upsampling operations perform bidirectional feature fusion. Finally, the fused features at three scales are sent to the shared detail-enhanced detection heads (SDDH) to generate prediction maps for goji berry clusters.

(1) C3k2-AdditiveBlock module. The original C3K2 module employs the CSP method to reduce redundant computations, yet its convolutional operations lack the capacity to model long-range dependencies. The finite receptive field of convolutional layers and their inability to capture distant dependencies result in diminished detection performance within complex visual scenes. To address this, we propose enhancing the C3k2 module by adopting the AdditiveBlock architecture from CAS-ViT [24]. Through element-wise addition operations and residual connections that integrate features from the preceding layer with the current layer, this approach effectively preserves model performance while reducing computational complexity. This design significantly improves the model’s ability to capture fine details in complex scenes while maintaining a lightweight overall structure, making it suitable for efficient deployment requirements. The architecture of the AdditiveBlock module is illustrated in Figure 5.

(2) BiFPN Module. The Weighted Bi-directional Feature Pyramid Network (BiFPN) [25] replaces the PAFPN [26] network in YOLOv11. By performing bidirectional feature fusion during upsampling and downsampling processes and assigning weights to different feature layers, BiFPN enables more flexible and efficient feature integration. This mechanism adaptively assigns the contribution of each input feature based on the data, as shown in Equation (1):

O = \sum_{i} \frac{w_{i}}{ϵ + \sum_{j} w_{j}} \cdot I_{i}

(1)

In the formula, O denotes the weight after weighted fusion,

w_{i}

represents the learning weight corresponding to input feature

I_{i}

, and the activation function ReLU ensures

w_{i} \geq 0

.

ϵ

signifies the initial learning rate, a very small value (e.g.,

10^{- 6}

) employed to prevent division by zero and guarantee numerical stability. Comparisons of feature network architectures are illustrated in Figure 6, where

P_{3}

to

P_{7}

denote feature maps at different scales, with arrows indicating bidirectional feature fusion.

(3) SDDH lightweight detection head. This paper proposes a lightweight shared detail-enhanced detection head—SDDH, which optimizes the detection efficiency and accuracy of YOLOv11 by sharing convolutional and detail-enhancement modules. The SDDH module can process feature maps with resolutions of

80 \times 80

,

40 \times 40

, and

20 \times 20

to accommodate defects of varying sizes. Three

1 \times 1

Conv_GN convolutional modules adjust the input feature map channels. Cross-layer feature extraction is then performed via a shared dual-layer detail-enhanced convolution module, DEConv_GN. The DEConv_GN module fuses weights from centre-difference convolution (Conv2d_cd), horizontal/vertical-difference convolution (Conv2d_hd/vd), angular-difference convolution (Conv2d_ad), and standard convolution. Each convolution type captures specific directional or contextual information: the center-difference convolution focuses on local details, the horizontal and vertical convolutions emphasize texture changes along specific axes, and the angular-difference convolution captures rotational variance, with the weight fusion formula shown in (2).

Y_{train} = \sum_{i = 1}^{5} (Conv (X, W_{i}) + B_{i})

(2)

In the equation, X denotes the input feature map, while

W_{i}

and

B_{i}

represent the weights and biases for each branch, respectively.

Finally, to address the potential weakening of feature extraction capabilities resulting from lightweight designs in shared convolutions, the Batch Normalization (BN) layers within all Conv modules are replaced with Gradient Normalization (GN) layers. This compensates for the loss of accuracy incurred by the lightweight design. The structure of the SDDH module is illustrated in Figure 7.

3.3. Binocular Stereoscopic Vision Technology

Stereoscopic vision technology constitutes a depth estimation method grounded in geometric imaging principles [27]. By emulating the visual mechanism of human binocular vision, it acquires left and right view images of a target scene from two cameras positioned at a spatial distance. The depth value of the target is then computed using the disparity information between these images. Each camera captures a left or right view of the same scene, projecting points from the three-dimensional scene onto a two-dimensional image plane. This projection relationship is jointly determined by the camera’s internal and external parameters [28].

Figure 8 illustrates the stereo vision positioning model [29]. Here,

O_{l}

and

O_{r}

denote the optical centres of the left and right cameras, respectively, with the distance between them termed the baseline distance B. A reference coordinate system

O_{l} - X_{l} Y_{l} Z_{l}

is established with the left optical centre

O_{l}

as its origin, while the right camera coordinate system is denoted as

O_{r} - X_{r} Y_{r} Z_{r}

. A point

P (x_{1}, y_{1}, z_{1})

in space represents the three-dimensional spatial coordinates within the binocular camera coordinate system; its projected coordinates within the left and right camera imaging coordinate systems are denoted as

P_{l} (x_{1}, y_{1})

and

P_{r} (x_{2}, y_{2})

, respectively.

In a stereo camera system, the optical axes of the two cameras are typically parallel. For any point within the scene, its projected points in the left and right views exhibit a certain horizontal displacement, termed the disparity. This is expressed by the formula:

d = x_{l} - x_{r}

, where

x_{l}

and

x_{r}

denote the horizontal coordinates of point P within the left and right image pixel coordinate systems, respectively. As illustrated in Figure 7, the three-dimensional coordinates of a point in space can be calculated based on the principle of similar triangles. The formula is:

\{\begin{matrix} \frac{x_{l}}{f} & = \frac{X}{Z} \\ - \frac{x_{r}}{f} & = \frac{b - X}{Z} \end{matrix}

(3)

\{\begin{matrix} X & = \frac{B \cdot x_{l}}{d} \\ Y & = \frac{B \cdot y_{l}}{d} \\ Z & = \frac{B \cdot f}{d} \end{matrix}

(4)

where B denotes the baseline distance between the stereo cameras, f represents the focal length of the camera, and d signifies the disparity. Equation (1) demonstrates that depth is inversely proportional to disparity and directly proportional to the camera baseline length and focal length. Greater disparity indicates the target is closer to the camera, while lesser disparity signifies the target is farther away. The key task in stereo vision is to locate corresponding matching points in the left and right images [30], thereby calculating the disparity to obtain a disparity map. Once the disparity map is acquired, the depth value for each pixel in the scene can be inferred through the relationship between disparity and depth, thus generating a depth map. By further integrating the camera calibration parameters, target points in the two-dimensional image can be back-projected into three-dimensional space, thereby achieving three-dimensional reconstruction of the scene.

3.3.1. Stereo Camera Calibration

Calibration of the stereo camera using the MATLAB R2023b stereo camera calibration tool employs a core calibration method fundamentally based on the Zhang Zhengyou calibration method [31]. An 8 × 6 monochrome checkerboard pattern was selected as the calibration target during the stereo camera calibration process, providing 35 effective corner points. Each square in the pattern measured 22mm along its side. Initially, 30 checkerboard images were captured using the stereo camera at varying distances and angles, each image measuring 1280 × 720 pixels, as illustrated in Figure 9. The finalized checkerboard images were subsequently transferred to MATLAB. During camera calibration, reprojection error serves as a crucial metric for assessing calibration accuracy. Typically, reprojection error should be less than 0.5 pixels; a smaller error indicates higher calibration precision. To enhance calibration accuracy, images with significant capture errors were discarded, ultimately retaining 25 valid image sets. Figure 10 presents the reprojection error results for the calibrated images, providing a visual representation of the overall calibration accuracy. The calibration yielded an average reprojection error of 0.06 pixels, with the derived camera parameters detailed in Table 1.

3.3.2. CRE Stereo Matching Algorithm

The core method employed for stereo matching is the Cascaded Recursive Stereo Matching Network (CREStereo). As illustrated in Figure 11, the core modules of the CREStereo framework comprise the Recursive Update Module (RUM), the Adaptive Group Correlation Layer (AGCL), and the cascaded architecture. The RUM serves as CREStereo’s central component, iteratively optimizing features and disparity predictions across multiple cascaded layers. The RUM primarily comprises a Gated Recurrent Unit (GRU) and AGCL. Feature map correlations are computed independently at each cascade level, progressively refining the disparity prediction. The initial input utilizes low-resolution feature maps. The final disparity result from each cascade level serves as the initialization input for the subsequent level. This is combined with higher-resolution feature maps for further refinement, progressively enhancing prediction resolution and accuracy. This achieves stepwise disparity refinement from low to high resolution, ensuring robustness in large-scale matching while preserving the accuracy of fine-grained structural details. AGCL is a module within CREStereo specifically designed to address complex scene matching challenges. For non-ideal calibration issues, it employs local correlation computation and deformable search windows to dynamically adjust the search range based on scene content. This reduces matching ambiguity, enhances accuracy in complex scenes, and mitigates excessive memory consumption and computational costs associated with high-resolution image stereo matching.

3.4. Experimental Setup and Evaluation Metrics

The operating system employed in the experiments was Windows 11, with an NVIDIA GeForce RTX 4060 Laptop GPU, 16 GB of RAM, and a 13th Gen Intel^® Core™ i9-13900HX 2.20 GHz CPU. The deep learning framework utilized was PyTorch, version 2.2.2, running on Python version 3.11 and CUDA version 12.1. To ensure fairness and reliability, all experiments were conducted under identical conditions, with specific training parameters detailed in Table 2.

In this study, the quality of the goji berry identification task is evaluated using three commonly adopted metrics in object detection: precision, recall, and mAP50. Precision reflects the proportion of correctly detected targets among all detections, which is important for avoiding false positives. Recall measures the ability of the detector to avoid missed detections, and mAP50 is widely used as a standard benchmark for evaluating object detection performance in practical applications. The relevant formulae are as follows:

P = \frac{T P}{T P + F P}

(5)

R = \frac{T P}{T P + F N}

(6)

A P = \int_{0}^{1} P (R) d R

(7)

m A P = \frac{1}{n} \sum_{i = 0}^{n} A P (i)

(8)

where

T P

represents the true positives,

F P

represents the false positives, and

F N

represents the false negatives.

A P

denotes the average precision, n is the total number of samples, and

A P (i)

indicates the average precision of each sample i.

Additionally, we employed two key metrics the number of parameters and model size for comparison, which were utilized to more thoroughly evaluate the lightweight capabilities of our proposed YOLO variant. The number of parameters in the convolutional layers is expressed by Equation (9):

Params = (k_{h} \times k_{w} \times C_{i n}) \times C_{o u t}

(9)

where

k_{h}

and

k_{w}

represent the height and width of the convolution kernel, respectively.

C_{i n}

and

C_{o u t}

denote the numbers of input and output channels, while H and W represent the spatial dimensions of the feature map. The parameters refer to the learnable variables of the model, which are continuously optimized during training to minimize loss. The size of the model reflects its complexity and the amount of required storage space. An increase in the number of parameters and model size will result in higher hardware resource consumption.

4. Experimental Results and Analysis

4.1. Training Results for the YOLO-VitBiS Object Detection Network

4.1.1. Ablation Experiment

To validate the effectiveness of the three improvement methods on the goji berry dataset, four sets of ablation experiments were designed. As shown in Table 3, compared with the original YOLO11n model, when using AdditiveBlock to enhance C3K2, the model’s accuracy improved from 0.965 to 0.978, though recall decreased from 0.898 to 0.877, while mAP50 improved from 0.95 to 0.966. The parameter count increased by merely 0.077 M. This indicates that AdditiveBlock significantly enhances the model’s detection accuracy and detail capture capability in complex scenes, though recall may have slightly decreased due to changes in feature extraction.

Subsequently, replacing the original PAFPN with BIFPN elevated recall from 0.877 to 0.921, while mAP50 increased from 0.966 to 0.972. Parameter count was reduced by 0.554 M. Although accuracy decreased, BIFPN improved recall and demonstrated superior performance in multi-scale feature fusion, while simultaneously reducing the overall model parameter count.

Finally, the lightweight detection head SDDH was introduced, slightly improving the model’s accuracy to

0.958

with a recall of

0.915

, while mAP50 stabilized at

0.966

. Concurrently, the model size was reduced to

4.3 MB

, with parameters further decreased to

1.856

million. In summary, when combining these three optimizations, mAP50 improved by

1.6 %

compared to YOLO11n, while the total model size decreased by

21.82 %

, parameters by

28.11 %

, and computational load by

3.2 %

. Ablation studies demonstrate that concurrently applying all three improvement strategies achieves optimal recognition performance alongside model lightweighting.

In summary, the ablation experiments demonstrate that the three proposed components jointly provide thebest trade-off between accuracy and model complexity. AdditiveBlock mainly improves detection precision and fine-detail representation, BiFPN enhances recall through more effective multi-scale feature fusion, and SDDH reduces model size and parameter count while maintaining competitive detection performance. These properties are particularly important for deployment on resource-constrained goji berry harvesting robots operating in complex field environments.

4.1.2. Comparative Experiments

To better validate the effectiveness of the proposed algorithm, comparative experiments were conducted using several mainstream object detection models. The selected models included Faster R-CNN, YOLOv7n, YOLOv8n, YOLOv9n, YOLOv10n, and YOLOv11n. All comparative experiments employed identical datasets, experimental environments, and training parameters. Table 4 presents the comparative results across different object detection models.

As shown in Table 4, the proposed YOLO-VitBiS model achieves significantly higher FPS (155) compared to traditional two-stage detectors such as Faster R-CNN (25 FPS), demonstrating its real-time processing capability. Recent work on fruit detection for harvesting robots typically reports detection speeds in the range of about 30–80 FPS on embedded or edge devices and still regards these speeds as sufficient for real-time operation in orchard environments [32,33,34]. In this context, the 155 FPS achieved by YOLO-VitBiS provides a large temporal margin for subsequent stereo matching, three-dimensional localization, and motion planning in dense goji orchards. Compared with lightweight one-stage detectors such as YOLOv8n and YOLOv10n, YOLO-VitBiS not only attains higher mAP50 and recall but also does so while using fewer parameters and a smaller memory footprint. These characteristics are essential for optimizing both detection performance and real-time execution. Our model is specifically designed to handle the challenges of densely clustered goji berries with severe occlusions, and therefore, it prioritizes a balance between a lightweight design and robust performance in complex field scenarios.

The visualization results are shown in Figure 12. Utilizing the YOLO11n model and YOLO-VitBiS model for detection comparisons on the test set, detection was performed across four scenarios: single object, multiple objects, overlapping, and occlusion. As evident from the detection maps, the YOLO-VitBiS model demonstrates performance largely consistent with the original model across all detection scenarios. The original model exhibits missed detections in multi-object detection, overlapping, and occlusion detection scenarios. Conversely, the YOLO-VitBiS model achieves precise detection results while reducing the number of parameters by

28.11 %

. Overall, the comparative study shows that the proposed YOLO-VitBiS detector offers a favorable balance between accuracy and efficiency among current mainstream detection models.

4.2. Analysis of Spatial Positioning Results

Figure 13 presents the visualization of depth information for the goji berry plant reconstructed using the CREStereo stereo matching algorithm. The raw depth data constitutes a single-channel matrix, with the intuitive representation appearing as a greyscale image, as shown in Figure 13b. Within the greyscale image, each pixel’s brightness value is inversely proportional to the physical distance of that point from the camera. Thus, higher brightness levels approaching white represent shallower depth distances, while lower brightness levels approaching black denote greater depth distances. The distribution of greyscale values within the image reveals the camera’s depth distance relative to the goji berry plant. However, the human eye possesses limited resolution for distinguishing subtle brightness variations in greyscale. To enhance the depth map’s legibility and contrast, thereby facilitating clearer identification of relative spatial hierarchies among scene objects, the single-channel greyscale depth image is mapped into a color image, as illustrated in Figure 13c.

To validate CREStereo’s depth acquisition performance in complex goji berry scenarios, this paper conducts comparative experiments against the traditional semi-global matching algorithm SGBM and the deep learning stereo matching algorithm UniMatch [35]. Figure 14 illustrates the depth map acquisition results of the three methods for goji plants at different angles. The comparison reveals that due to the intricate interlacing of branches and extensive occlusion by foliage, the traditional SGBM method generates numerous mismatches during depth map acquisition. This results in significant noise within the depth map, with fine leaves and branch segments exhibiting pronounced discontinuities and blurring. The deep learning method UniMatch partially improved detail and reduced noise, yet still exhibited considerable blurring and discontinuities along the edges of slender goji branches; CREStereo, through its cascaded recurrent network and adaptive group correlation layer design, more accurately restores depth information for minute structures. The generated depth map exhibits clear details and sharp edges, with no blurring or discontinuities observed along the edges of goji branches or fruits. Comparative images demonstrate that CREStereo exhibits superior depth estimation capabilities in complex agricultural environments compared to both the traditional SGBM method and the deep learning approach UniMatch.

Based on the binocular stereoscopic vision ranging principle illustrated in Figure 7 and the geometric model established using Equations (3) and (4), the depth value Z can be directly expressed as the actual physical distance of the goji berry target relative to the plane of the binocular camera. To quantitatively evaluate the accuracy of this ranging model, practical ranging experiments were conducted within the distance range of 400 mm to 1000 mm. The depth measurement results output by the system were compared with measurement data collected by a high-precision laser rangefinder. The comparison data are presented in Table 5.

The depth estimation accuracy of the proposed CREStereo-based stereo vision system was evaluated using two metrics: mean relative error and millimeter-level absolute error. As shown in Table 5, within the measurement range of 400 mm to 1000 mm, the average relative error was 2.42%, with a maximum relative error of 2.9%, and the average absolute error was 16.1 mm. As the measurement distance increased, the absolute error exhibited a gradual upward trend. In practical harvesting scenarios, however, the camera is mounted on the end-effector of the robotic arm and operates at relatively short working distances from the fruit clusters, which is expected to further reduce the absolute depth error compared with the static evaluation range.

Although there is currently no universally accepted threshold for the maximum allowable 3D localization error in harvesting robots, existing studies on fruit-harvesting systems generally report millimeter-level errors in the range of approximately 10–20 mm as being sufficient for successful operation [36,37,38]. Combined with the fact that real field deployment typically uses shorter camera-to-target distances than those in the static evaluation, the proposed CREStereo-based binocular stereo vision system can be considered to provide sufficiently accurate three-dimensional localization of clustered goji berries, offering a reliable perception basis for path planning and end-effector control in goji berry harvesting robots.

5. Conclusions

To address the issue of low depth estimation accuracy for goji berries in complex agricultural settings, this paper proposes a method for identifying and localizing goji berries that combines the YOLO-VitBiS object detection network with the CREStereo depth estimation algorithm.

(1) A lightweight goji berry detection model based on YOLO-VitBiS is proposed. Firstly, the C3k2 module is enhanced using AdditiveBlock. Secondly, the PAFPN network is replaced with a weighted bidirectional feature pyramid network within the Neck layer. Finally, a lightweight detection head named SDDH is designed specifically to reduce computational complexity and parameter count. Comparison with the original model demonstrates a

21.82 %

reduction in model size, parameters by

28.11 %

, and computational load by

3.2 %

. The

{mAP}_{50}

metric improved by

1.6 %

.

(2) A stereo vision positioning system based on the CREStereo depth estimation algorithm was proposed. Experimental results demonstrated that CREStereo could more clearly reconstruct the three-dimensional details of goji berry branches and fruits while effectively reducing noise and mismatch phenomena. The relative error in depth measurement results remained below

3 %

, with an average relative error of

2.57 %

. This performance meets the practical requirements for field operations and provides a technical foundation for subsequent intelligent goji berry harvesting.

In subsequent research, we plan to expand the dataset to encompass multiple orchards and camera configurations to further evaluate the proposed system’s generalization capabilities under diverse environmental conditions. Furthermore, the 3D localization experiments conducted in this study were performed under quasi-static conditions, where both cameras and goji berry plants experienced only minimal natural disturbances. Future investigations will prioritize the consideration of impacts arizing from dynamic interference.

Author Contributions

J.S.: Conceptualization; formal analysis; software; visualization; methodology; data curation; writing—original draft. C.L.: Conceptualization; project administration; supervision; funding acquisition; writing—review and editing. S.Z. and Z.Z.: Methodology; validation; writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Xinjiang Uygur Autonomous Region Central Guided Local Science and Technology Development Fund Project (ZYYD2025QY17).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, X.; Lin, G.; Hui, J.; Zhang, J.; Zhang, J. Current Situation and Sustainable Development Suggestions for the Small Berry Industry in Xinjiang. North. Hortic. 2023, 3, 127–132. (In Chinese) [Google Scholar]
Li, Y.; Hu, Z.; Zhang, Y.; Wei, L. Research Progress on Mechanized Harvesting Technology and Equipment for Goji Berries. J. Chin. Agric. Mech. 2024, 45, 16–21+35. (In Chinese) [Google Scholar] [CrossRef]
Chen, Q.; Zhang, S.; Wei, N.; Fan, Y.; Zhang, W.; Wang, Z.; Chen, J.; Chen, Y. Simulation and Experiment of Goji Berry Vibrational Harvesting under Different Excitation Modes. Trans. Chin. Soc. Agric. Eng. 2025, 41, 32–42. (In Chinese) [Google Scholar]
Liu, Y.; Liu, J.; Zhao, J.; Zhao, D.; Zhang, H.; Su, X.; Feng, Y.; Cheng, Y.; Li, Z. Research Progress on Theories and Equipment for Mechanized Goji Berry Harvesting. Sci. Silvae Sin. 2025, 61, 222–232. (In Chinese) [Google Scholar]
Min, X.; Ye, Y.; Xiong, S.; Chen, X. Computer Vision Meets Generative Models in Agriculture: Technological Advances, Challenges and Opportunities. Appl. Sci. 2025, 15, 7663. [Google Scholar] [CrossRef]
Yang, X.; Zhong, J.; Lin, K.; Wu, J.; Chen, J.; Si, H. Research Progress on Binocular Stereo Vision Technology and Its Applications in Smart Agriculture. Trans. Chin. Soc. Agric. Eng. 2025, 41, 27–39. [Google Scholar] [CrossRef]
Hou, C.; Xu, J.; Tang, Y.; Zhuang, J.; Tan, Z.; Chen, W.; Wei, S.; Huang, H.; Fang, M. Detection and Localization of Citrus Picking Points Based on Binocular Vision. Precis. Agric. 2024, 25, 2321–2355. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar] [CrossRef]
Chang, J.-R.; Chen, Y.-S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar] [CrossRef]
Li, Z.; Liu, X.; Drenkow, N.; Ding, A.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting Stereo Depth Estimation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6177–6186. [Google Scholar] [CrossRef]
Lipson, L.; Teed, Z.; Deng, J. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. In Proceedings of the IEEE International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 218–227. [Google Scholar] [CrossRef]
Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, S.; Fan, H.; Liu, S. CREStereo: Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5485–5494. [Google Scholar] [CrossRef]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and Localization Methods for Vision-Based Fruit Picking Robots: A Review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef]
Li, J.; Zhang, H.; Xu, Y.; Chen, Z.; Wang, Q. Efficient and Non-Invasive Grading of Chinese Mitten Crab Based on Fatness Estimated by Machine Vision and Deep Learning. Foods 2025, 14, 1989. [Google Scholar] [CrossRef]
Li, S.; Zhao, L.; Zhou, X.; Wang, J.; Xu, C. Dynamic Object Detection and Non-Contact Localization in Lightweight Cattle Farms Based on Binocular Vision and Improved YOLOv8s. Agriculture 2025, 15, 1766. [Google Scholar] [CrossRef]
Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-Time Detection and Location of Potted Flowers Based on a ZED Camera and a YOLO V4-Tiny Deep Learning Algorithm. Horticulturae 2022, 8, 21. [Google Scholar] [CrossRef]
Wu, C.; Shen, J.; Yang, W.; Ren, W.; Liu, Y.; Wei, Z.; Ma, L.; Cao, X.; Zhao, Y.; Kwong, S. Synthetic Weathered Image Generation for Robust Vision Models. IEEE Trans. Image Process. 2021, 30, 5359–5373. [Google Scholar]
Goyal, B.; Dogra, A.; Agrawal, S.; Sohi, B.S.; Sharma, A. Image Denoising Review: From Classical to State-of-the-Art Approaches. Inf. Fusion 2020, 55, 220–244. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Ji, X. CAS-ViT: Convolutional Additive Self-Attention Vision Transformers for Efficient Mobile Applications. arXiv 2024, arXiv:2408.03703. Available online: https://arxiv.org/abs/2408.03703 (accessed on 27 October 2024).
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Zhang, H.; Lee, S. Robot Bionic Vision Technologies: A Review. Appl. Sci. 2022, 12, 7970. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, D.; Zhou, Y.; Jin, W.; Wu, G.; Li, Y. A Binocular Stereo-Imaging-Perception System with a Wide Field-of-View and Infrared-and-Visible Light Dual-Band Fusion. Sensors 2024, 24, 676. [Google Scholar] [CrossRef]
Ding, J.; Yan, Z.; We, X. High-Accuracy Recognition and Localization of Moving Targets in an Indoor Environment Using Binocular Stereo Vision. ISPRS Int. J. Geo-Inf. 2021, 10, 234. [Google Scholar] [CrossRef]
Lin, X.; Wang, J.; Lin, C. Research on 3D Reconstruction in Binocular Stereo Vision Based on Feature Point Matching Method. In Proceedings of the IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 25–27 September 2020; pp. 551–556. [Google Scholar]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Xu, Z.; Luo, T.; Lai, Y.; Liu, Y.; Kang, W. EdgeFormer-YOLO: A Lightweight Multi-Attention Framework for Real-Time Red-Fruit Detection in Complex Orchard Environments. Mathematics 2025, 13, 3790. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Y.; Chen, K.; Li, H.; Duan, Y.; Wu, W.; Shi, Y.; Guo, W. Lightweight Fruit-Detection Algorithm for Edge Computing Applications. Front. Plant Sci. 2021, 12, 740936. [Google Scholar] [CrossRef]
Sun, H.; Wang, B.; Xue, J. YOLO-P: An Efficient Method for Pear Fast Detection in Complex Orchard Picking Environment. Front. Plant Sci. 2023, 13, 1089454. [Google Scholar] [CrossRef] [PubMed]
Xu, H.F.; Zhang, J.; Cai, J.F.; Rezatofighi, H.; Yu, F.; Tao, D.C.; Geiger, A. UniMatch: A Unified Transformer for 3D Vision Tasks via 2D Dense Correspondence Learning. arXiv 2023, arXiv:2309.11754. [Google Scholar]
Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based Extraction of Spatial Information in Grape Clusters for Harvesting Robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Huang, W.; Miao, Z.; Wu, T.; Guo, Z.; Han, W.; Li, T. Design of and Experiment with a Dual-Arm Apple Harvesting Robot System. Horticulturae 2024, 10, 1268. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Liu, H.; Yang, L.; Zhang, D. Real-Time Visual Localization of the Picking Points for a Ridge-Planting Strawberry Harvesting Robot. IEEE Access 2020, 8, 116556–116568. [Google Scholar] [CrossRef]

Figure 1. Representative examples from the goji berry dataset: (a). single target; (b). multiple targets; (c). overlapping targets; (d). foliage occlusiont.

Figure 2. Image enhancement methods: (a). Original image; (b). Horizontal mirroring; (c). Vertical mirroring; (d). Rotate by 90; (e). Image cropping; (f). Image panning; (g). Gaussian noise; (h). Pretzel noise; (i). Brightness and darkness adjustments-dark; (j). Brightness and darkness adjustments-light.

Figure 3. System Overall Inspection Flowchart.

Figure 4. YOLO-VitBiS Network Architecture Diagram.

Figure 5. AdditiveBlock module Architecture Diagram.

Figure 6. Comparison of network structures with different features: (a). FPN; (b). PAFPN; (c). BIFPN.

Figure 7. SDDH Module Network Structure Diagram.

Figure 8. Schematic Diagram of Stereoscopic Vision Ranging Principle.

Figure 9. Images of the calibration plate captured by the left and right cameras.

Figure 10. System Calibration Results. The solid red line in the figure represents the maximum reprojection error, while the dashed yellow line denotes the average reprojection error.

Figure 11. CRE Network architecture.

Figure 12. Comparison of model detection effects.

Figure 13. Deep Image of Lycium Barbarum Figure: (a). Original image; (b). Grayscale image; (c). Color image.

Figure 14. Comparison of depth maps across different models.

Table 1. Binocular Camera Calibration Parameters.

Parameter Name	Parameter Value
Translation Matrix	$[- 63.157781, - 0.133512, - 0.208277]$
Rotation Matrix	$[\begin{matrix} 0.999999 & - 0.000054 & - 0.000519 \\ 0.000053 & 0.999998 & - 0.001540 \\ 0.000519 & 0.001540 & 0.999998 \end{matrix}]$
Left Camera Intrinsic Matrix	$[\begin{matrix} 710.618385 & 0 & 631.778856 \\ 0 & 710.953810 & 372.914059 \\ 0 & 0 & 1 \end{matrix}]$
Left Camera Radial Distortion	$[- 0.166755, 0.003064, 0.017226]$
Left Camera Tangential Distortion	$[0, 0]$
Right Camera Intrinsic Matrix	$[\begin{matrix} 707.546973 & 0 & 649.634785 \\ 0 & 708.194484 & 361.949633 \\ 0 & 0 & 1 \end{matrix}]$
Right Camera Radial Distortion	$[- 0.163392, - 0.008428, 0.029543]$
Right Camera Tangential Distortion	$[0, 0]$
Fundamental Matrix	$[\begin{matrix} - 1.160 \times 10^{- 10} & 4.136 \times 10^{- 7} & - 0.000343 \\ - 3.486 \times 10^{- 7} & 1.932 \times 10^{- 10} & 0.089329 \\ 0.000309 & - 0.089173 & 0.999578 \end{matrix}]$
Essential Matrix	$[\begin{matrix} - 0.000058 & 0.208071 & - 0.133832 \\ - 0.175453 & 0.097283 & 63.157805 \\ 0.130158 & - 63.157713 & 0.097204 \end{matrix}]$

Table 2. Training Parameters.

Parameter Name	Parameter Value
Model Input Size	640 × 640 pixels
Training Batch Size	16
Initial Learning Rate	0.01
Momentum Setting	0.937
Weight Decay Coefficient	0.0005
Optimizer	SGD
Training Epochs	180

Table 3. Comparison of Ablation Experiment Results.

Basic Model	AdditiveBlock	BiFPN	SDDH	Precision	Recall	mAP50	Model Size (MB)	Parameters (M)
YOLO11n	✕	✕	✕	0.965	0.898	0.950	5.5	2.582
	✓	✕	✕	0.978	0.877	0.966	5.8	2.659
	✓	✓	✕	0.882	0.921	0.972	4.6	2.028
	✓	✓	✓	0.958	0.915	0.966	4.3	1.856

Table 4. Performance Comparison of Different Detection Models.

Model	Precision	Recall	mAP50	Model Size (MB)	Parameters (M)	FPS
Faster-R-CNN	0.846	0.836	0.914	90.8	40.651	25
YOLOv7n	0.947	0.919	0.92	74.8	37.212	61
YOLOv8n	0.941	0.883	0.957	6.2	3.128	124
YOLOv9n	0.8412	0.882	0.921	390	51.145	21
YOLOv10n	0.928	0.88	0.951	5.8	2.713	115
YOLO11n	0.965	0.898	0.950	5.5	2.622	133
(OURS)	0.958	0.915	0.966	4.3	1.856	155

Table 5. Target Ranging Results.

Serial Number	Actual Distance (mm)	Test Distance (mm)	Absolute Error (mm)	Relative Error (%)
1	400	409	9	2.2%
2	452	463	11	2.4%
3	498	509	11	2.2%
4	545	558	13	2.3%
5	598	612	14	2.3%
6	654	667	13	2.0%
7	694	677	17	2.4%
8	832	810	22	2.6%
9	917	943	26	2.8%
10	1024	999	25	2.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, J.; Li, C.; Zhao, Z.; Zhang, S. Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology. AgriEngineering 2026, 8, 6. https://doi.org/10.3390/agriengineering8010006

AMA Style

Shi J, Li C, Zhao Z, Zhang S. Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology. AgriEngineering. 2026; 8(1):6. https://doi.org/10.3390/agriengineering8010006

Chicago/Turabian Style

Shi, Juntao, Changyong Li, Zehui Zhao, and Shunchun Zhang. 2026. "Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology" AgriEngineering 8, no. 1: 6. https://doi.org/10.3390/agriengineering8010006

APA Style

Shi, J., Li, C., Zhao, Z., & Zhang, S. (2026). Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology. AgriEngineering, 8(1), 6. https://doi.org/10.3390/agriengineering8010006

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Research on a Method for Identifying and Localizing Goji Berries Based on Binocular Stereo Vision Technology

Abstract

1. Introduction

2. Data Acquisition

2.1. Data Set Capture

2.2. Data Set Augmentation

3. Method

3.1. Testing Procedure

3.2. YOLO-VitBiS Object Detection Network

3.3. Binocular Stereoscopic Vision Technology

3.3.1. Stereo Camera Calibration

3.3.2. CRE Stereo Matching Algorithm

3.4. Experimental Setup and Evaluation Metrics

4. Experimental Results and Analysis

4.1. Training Results for the YOLO-VitBiS Object Detection Network

4.1.1. Ablation Experiment

4.1.2. Comparative Experiments

4.2. Analysis of Spatial Positioning Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI