Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision

Mei, Zhimin; Li, Yifan; Zhu, Rongbo; Wang, Shucai

doi:10.3390/agriculture15141508

Open AccessArticle

Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision

by

Zhimin Mei

^1,2,3,

Yifan Li

²,

Rongbo Zhu

²

and

Shucai Wang

^2,*

¹

School of Intelligent Manufacturing, Wuchang Institute of Technology, Wuhan 430065, China

²

College of Informatics, Huazhong Agricultural University, Wuhan 430070, China

³

School of Power and Mechanical Engineering, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(14), 1508; https://doi.org/10.3390/agriculture15141508

Submission received: 3 June 2025 / Revised: 5 July 2025 / Accepted: 11 July 2025 / Published: 13 July 2025

(This article belongs to the Special Issue Advanced Image Collection, Processing, and Analysis in Crop and Livestock Management)

Download

Browse Figures

Versions Notes

Abstract

Recent years have seen significant interest among agricultural researchers in using robotics and machine vision to enhance intelligent orchard harvesting efficiency. This study proposes an improved hybrid framework integrating YOLO VX deep learning, 3D object recognition, and SLAM-based navigation for harvesting ripe fruits in greenhouse environments, achieving servo control of robotic arms with flexible end-effectors. The method comprises three key components: First, a fruit sample database containing varying maturity levels and morphological features is established, interfaced with an optimized YOLO VX model for target fruit identification. Second, a 3D camera acquires the target fruit’s spatial position and orientation data in real time, and these data are stored in the collaborative robot’s microcontroller. Finally, employing binocular calibration and triangulation, the SLAM navigation module guides the robotic arm to the designated picking location via unobstructed target positioning. Comprehensive comparative experiments between the improved YOLO v12n model and earlier versions were conducted to validate its performance. The results demonstrate that the optimized model surpasses traditional recognition and harvesting methods, offering superior target fruit identification response (minimum 30.9ms) and significantly higher accuracy (91.14%).

Keywords:

intelligent fruit localization; three-dimensional vision; YOLO VX

1. Introduction

In recent years, the application of intelligent robots + agriculture has attracted people’s attention; in particular, automation and intelligent target positioning pickup technology under indoor constant temperature conditions are more favored by users. However, the current research methods have many shortcomings, especially in real-time control, occlusion robustness, and lighting conditions; in particular, in the aspect of target detection in dynamic environment, it is not enough to only consider YOLO VX, 3D vision or Arduino Control alone, as this cannot improve the efficiency and comprehensive performance of fruit recognition. It is necessary to effectively integrate the three. In order to improve the efficiency of fruit picking in smart orchards, determining how to use automatic recognition and positioning, mobile manipulator navigation, and flexible grasping technology to realize real-time intelligent picking of fruit is the main technical problem to be solved in smart agriculture. To this end, it is necessary to find efficient deep learning models, collect image sets to train the positive sample library, design a reasonable mature fruit matching algorithm, use 3D vision technology to obtain the spatial pose and appearance size information of the center point of the target entity, transmit it to the mechanical arm, and control the flexible end to achieve accurate grasping.

Scholars have carried out related research on target detection, 3D positioning, dynamic environment adaptation, grasping planning, and system integration. Firstly, in terms of target detection and fruit recognition, YOLO-series models are central to fruit localization. References achieved an average precision (AP) of 99.5% in kiwifruit localization using YOLOv5x combined with calyx detection, significantly reducing positioning errors (X/Y/Z-axis deviations ≤7.9 mm) [1,2,3]. Reference [4] improved YOLOv8s by introducing a dynamic sparse attention mechanism (BiFormer) and a small-target detection layer, increasing the mAP for citrus detection to 88.45% and optimizing depth estimation accuracy (MAE = 0.53). Additionally, Reference [5] proposed the FasterNet-YOLOv7 model, which enhances feature extraction and attention mechanisms (CoTAttention) for underwater complex environments, achieving an mAP of 91.8% and real-time detection performance at 83.21 FPS. However, existing studies still face challenges such as insufficient accuracy in small-target detection (e.g., the fragility of berry skins in Reference [3]) and dynamic occlusions (e.g., dynamic interference in SLAM systems in Reference [6]). Secondly, in terms of 3D positioning and hand–eye calibration, 3D positioning technologies primarily include monocular, binocular, and depth-camera-based methods. Reference [1] optimized 3D coordinates by matching calyx feature points in binocular images combined with depth maps (DIDM), reducing localization errors by 53.8% compared to traditional methods. Reference [7] proposed a low-cost monocular vision solution that fuses multi-resolution depth estimation, achieving a depth estimation MAE of 0.49 within a 30–60 cm range and a grasping success rate exceeding 80%. For hand–eye calibration, Reference [8] established an “I = AXB” transformation matrix using robot base point clouds and learning algorithms, achieving a translation deviation of 0.93 mm and a rotation deviation of 0.265, with calibration completed in 1 s, significantly outperforming traditional marker-dependent methods. Thirdly, in terms of dynamic environment adaptation and SLAM technology, robustness in dynamic environments is critical for fruit grasping systems. References [9] and [6] proposed dynamic SLAM systems based on YOLOv5s and Darknet19-YOLOv3, respectively, which remove dynamic feature points through semantic segmentation and geometric constraints (RANSAC), reducing trajectory errors (RMSE) by 51.28–98.13% on the TUM dataset. Reference [10] further introduced double polyline constraints and Gaussian models to segment dynamic targets, enhancing localization stability. However, challenges remain, such as complex lighting (Reference [4]), disordered scenes (Reference [11]), and high-speed motion (Reference [12]). Fourthly, in terms of grasp planning and system integration, grasp planning requires integrating target pose estimation and robotic arm control. Reference [12] proposed a YOLO-GGCNN cascaded model that constructs environmental maps through multi-robot collaborative SLAM, achieving a grasping accuracy of 86%. Reference [2] emphasized visual servo control and continuous behavior coordination, proposing an integrated framework for orchard-wide motion and grasping to enhance commercial feasibility. Additionally, Reference [13] reviewed three key tasks in vision-based grasping (target localization, pose estimation, and grasp estimation), highlighting the potential of end-to-end learning methods (e.g., 6D force estimation using the 9DTact tactile sensor in Reference [14]) in complex contact scenarios. In a word, current research has made significant progress in fruit detection accuracy, 3D positioning efficiency, and robustness in dynamic environments.

(1): The above studies provide some methods for fruit picking, but there are still deficiencies in fruit maturity matching, random location capture, and picking methods. In terms of application scenarios, this study is limited to greenhouse or strongly light-controlled environments, because the picking path of the robot is about 40 klux~80 klux, and the indoor light intensity is less than this value, so direct sunlight should be avoided as much as possible. To this end, this method focuses on using the latest YOLO VX deep learning model under the indoor constant-temperature condition, and synchronously calculates the spatial appearance and appearance characteristics sending the position information to the robotic arm in real time through visual calibration and servo control and then relying on the three-finger flexible claw to complete the picking. The principal contributions of this study are summarized as follows:
(2): We developed a hybrid multimodal autonomous fruit ripeness recognition algorithm based on C3K2 and SPPF, reducing real-time control latency to 30.9 ms; this is more efficient than earlier models. By implementing cascaded pooling to cover larger image regions, the system achieves enhanced robustness in detecting large-scale targets. Compared to conventional SPP methods, the serial design preserves richer edge information, resulting in finer feature extraction through small-kernel pooling.
(3): We implemented a spatial coordination framework that integrates 3D-vision-derived centroid coordinates and dimensional parameters into a collaborative robotic control system, enforcing synchronized locomotion between the mobile chassis and manipulator via triaxial coordinate system alignment.
(4): A hybrid communication architecture combining TX/RX serial protocol and Ethernet connection is designed to establish a unified address scheme for 3D spatial data and Arduino-based motion control, which optimizes the path planning of autonomous mobile robot (AMR) operation compared with traditional methods.
(5): Experimental validation demonstrated that the proposed machine vision–Arduino integrated framework achieves 91.14% target recognition accuracy with ±1.5 mm positioning precision in agricultural harvesting scenarios.

2. System Scheme Design

The system design comprises three primary components: a mobile aerial delivery platform, an image processing unit, and a target harvesting unit, which collectively integrate a mobile robot, a collaborative robotic arm, a 3D camera system, and fruit tree detection modules. To optimize computational efficiency and data transmission, each subsystem processes and transmits collected data through dedicated microcontroller-based systems [9,15,16].

Figure 1 presents the system layout and structural configuration: (1) mobile chassis, (2) collection platform, (3) feeding mechanism, (4) flexible gripper assembly, (5) vision support frame, (6) 3D vision sensor, (7) adaptive lighting system, (8) target fruit tree, and (9) SLAM (Simultaneous Localization and Mapping) module. Correspondingly, Figure 2 illustrates the visual servo control architecture and SLAM implementation scheme specifically developed for this methodology. The end of the robot arm adopts a flexible three-finger gripper, which is controlled by pneumatic contraction and expansion. The maximum working frequency is 300 times/min, the accuracy range is ±0.05 mm, the service life is 3 million times, and the safe working pressure is 120 KPa.

According to the system design flow chart in Figure 2, this study aims to develop an intelligent fruit positioning and grasping system based on the YOLO VX model and three-dimensional vision technology, so as to realize efficient automatic operation through the collaboration of a mobile robot, a cooperative mechanical arm, and multimodal sensing technology [17]. The technical process is divided into a perception layer and a control layer: the perception layer first obtains the fruit information through the image acquisition module and uses the YOLO VX model to accurately identify the spatial coordinates, size, and attitude of the fruit. Relying on the Arduino system, the control layer regulates the motor through TB6612ENG and A498.8 drivers, realizes the multi-freedom motion planning of the robotic arm end actuator, and synchronously coordinates the positioning of the wheel mobile platform and the precise operation of the screw capture mechanism. Through the closed-loop architecture of “image recognition–location–control”, the system integrates the deep learning algorithm with three-dimensional visual depth, solves the problems of a complex environment and target occlusion in traditional picking, and combines high-precision recognition and dynamic adaptability [18,19,20,21]. This study can provide a technical paradigm for automated picking in agriculture and promote the intelligent application of robot technology in unstructured scenarios.

3. Establishing a YOLO X Network Model for Fruit Target Recognition

The proposed framework was implemented through a Python V3.10.0-based virtual environment under a Linux environment, where the network model was trained and validated with stratified dataset partitioning [12,22]. Our core innovation lies in the Improved YOLOX Target-Following Algorithm (IYTFA), specifically designed for cooperative robotic applications. As depicted in Figure 3, the system architecture comprises three functionally integrated modules: (a) multimodal target detection, (b) adaptive tracking, and (c) dynamic following control. The technical workflow unfolded as follows: First, we architecturally enhanced YOLOX by substituting its Darknet53 backbone with MobileNet V2X, achieving a 37% reduction in computational load while maintaining 92% feature extraction fidelity. RGB video streams are processed through this optimized network to generate hierarchical feature maps [23,24,25]. Subsequently, a dual-branch learning paradigm was implemented, where dedicated re-identification and detection loss functions simultaneously optimize the recognition and localization branches, respectively. For motion tracking, we developed an Enhanced Kalman Filter (EKF) incorporating (i) trajectory affinity-based target association, (ii) occlusion-aware discrimination through depth probability constraints, and (iii) adaptive state prediction during visual occlusion scenarios. Furthermore, a vision-guided active pursuit strategy was devised, enabling real-time pose adjustment of the mobile robot via visual servo control mechanisms.

The algorithm network is divided into three parts, namely the main trunk network, the network layer, and the prediction layer. Its backbone network uses a Darknet53 feature extraction network, the network layer adopts a feature pyramid network, and the prediction layer uses three decoupling heads. The input pictures of the fruit set were shallow feature extraction in the backbone network part, three feature layers for deep feature extraction, and three decoupling heads for target detection. However, YOLOX backbone networks usually use Darknet53 networks, which have problems such as a large model size and slow inference speed. Therefore, to realize mobile robot real-time target detection, this method proposed the YOLOX M2X network, a YOLOX backbone network using the lightweight feature extraction network C3K2; the convolution core layer of the network is a deep separable convolution layer, the output feature diagram channel is reduced to half and then merged with the original convolution layer extraction. Compared with the early YOLO model, the C3K2 module has faster detection speed and further reduces the calculation cost by removing redundant convolution paths. The parameter tuning process comprises five distinct stages: Stage 1 (Rounds 0–100): Perform basic training while analyzing real-time mAP@50-95 responses. Stage 2 (Rounds 101–150): Optimize the learning rate to mitigate potential oscillations and continue training. Stage 3 (Rounds 151–200): Further reduce the learning rate. Stage 4 (Rounds 201–250): Analyze the training and validation loss curves. If significant oscillations are detected, implement corrective measures: further reduce the learning rate, decrease the bounding box loss weight, and increase the segmentation loss weight. Stage 5 (Rounds 251–300): Fine-tune and refine model parameters to identify optimal hyperparameters.

The proposed system integrates the AT-S1000-06C-S 3D binocular structured-light depth camera, a multi-sensor fusion device equipped with four synergistic sensing elements: (i) dual infrared stereo vision modules, (ii) an active IR pattern projector, and (iii) an RGB chromatic imaging sensor [26,27,28]. As demonstrated in Figure 4 and Figure 5, this configuration enables millimeter-level depth resolution through synchronized operation of its optical subsystems—the IR projector emits encoded light patterns while paired IR cameras perform phase-shift analysis, with the RGB sensor concurrently capturing 1280 × 720 texture information at 30 FPS.

To ensure the accuracy of fruit picking, the equal error method was used to disperse the contour curves. According to the diameter of the fruit captured by the camera, the coordinates of the grasping points at both ends were determined, which were defined as

F_{0}

and

F_{N}

, respectively. Then the actual grasping points of the robot are n + 1, and the next node can be calculated from Equation (1):

t_{i + 1} = t_{i} + Δ t_{i} = t_{i} + \frac{{(8 ε R_{i})}^{\frac{1}{2}}}{|F^{’} (t_{i})|}

(1)

In formula,

F (t_{i})

indicates the current node,

F (t_{i + 1})

represents the next discrete node as determined by the equal error method,

t_{i}

indicates the

F (t_{i})

parameter value at the node, ε represents discrete accuracy, and

R_{i}

indicates the radius of curvature at

F (t_{i})

.

4. Visual Recognition System Design

4.1. Collaborative Robotic Arm and Binocular Camera Calibration

The stereo calibration process serves to establish a precise geometric mapping between pixel coordinates and real-world spatial dimensions. This paper adopts the calibration method of eye-to-hand, as shown in Figure 6. The calibration process is performed to obtain the internal and external parameters of the camera, so as to calculate the relationship between the pixel coordinates and the actual coordinates through the transformation of the formula [29,30]. In the figure, A represents the coordinate transformation relationship between the camera and the robot base, and B represents the coordinate transformation relationship between the camera and the earth. The mathematical expression of actual coordinates

(X_{W}, Y_{W}, Z_{W})

and pixel coordinates (u, v) is shown in Equation (2):

Z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} \frac{f}{d_{x}} & 0 & u_{0} & 0 \\ 0 & \frac{f}{d_{y}} & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} X_{w} \\ Y_{w} \\ z_{w} \\ 1 \end{matrix}] = M_{1} M_{2} X

(2)

In the formula, M refers to camera internal parameters, including the focal length f, the width of a single pixel element

d_{y}

, the height of a single pixel element

d_{x}

, the pixel value of the image length

u_{0}

, and the pixel value of the image width

v_{0}

. In addition to these five parameters,

M_{2}

refers to external parameters of the camera, including the rotation matrix {R} and the translation matrix {T}.

In this method, the calibration board data file and the calibration board image file can be obtained by using the “gen caltab” operator, as shown in Figure 7, where the first and second parameters of the operator represent the number of origin markers in each row and each column, the third parameter represents the distance between each two MARK points (in meters), the fourth parameter represents the diameter of the MARK point, and the fifth and sixth parameters represent the storage location of the calibration board file and the storage location of the image, respectively.

The calibration process begins with preparing a calibration board and capturing its images. To ensure accuracy, the images must meet strict criteria: high clarity, the absence of occlusions or stains on the calibration plate, and clear visibility of all marker points. Since image quality directly impacts calibration results, multiple captures are required, involving systematic rotation and translation of the calibration plate to cover diverse spatial orientations. This iterative approach enhances the robustness and precision of the calibration outcome [31,32]. Figure 7 shows part of the pictures collected by the camera and calibration plate in this study.

4.2. Obtaining Fruit Features

The external dimensions of the target fruit were quantified using binocular camera measurements, as summarized in Table 1. The workflow proceeded as follows: (1) Image Preprocessing: The acquired images were filtered using median and Gaussian filters to reduce noise. (2) Contour Extraction: Feature matching, adaptive binarization, area thresholding, and morphological operations were applied to isolate the fruit region. (3) Dimensional Analysis: Subpixel-accurate contours were obtained through the eXtended Line Descriptor (XLD) method, from which the pixel-based length and width measurements were derived. The process is shown in Figure 8.

To ensure the accuracy of the fruit shape measurements, the fruit edge images were next acquired for the XLD contour (subpixel edge, extended line descriptions) obtained from the box edge image. The purpose of contour extraction is to extract the spatial direction of the gray value in the image by calculating the reciprocal of the gray value change. For the processing of edge and line features with large gradients, for a continuous image function f(x, y), the gradient is expressed in Equation (3):

\partial f (x, y) = G (x, y) = {[\frac{\partial f}{\partial x} \frac{\partial f}{\partial y}]}^{T}

(3)

The basic idea of obtaining the image edge is as follows: after the image is smoothed and filtered by a Gaussian filter, it is processed by non-extremum suppression technology to obtain the image edge. The steps are as follows:

First, use the smooth image f(x, y) with a Gaussian filter G(x, y) to produce a smoothed image as shown in Equations (4) and (5):

G (x, y) = e^{\frac{x^{2} + y^{2}}{{2 σ}^{2}}}

(4)

f_{s} (x, y) = G (x, y) * f (x, y)

(5)

Second, calculate the magnitude M(x, y) and direction α(x, y) of each pixel of the above image as shown in Equations (6) and (7):

M = \sqrt{{G (x, y)}^{2} + f {(x, y)}^{2}}

(6)

α (x, y) = \arctan (\frac{G (x, y)}{f (x, y)})

(7)

Third, set the local maximum point to 0 to refine the edge.

Fourth, set high and low thresholds “

T_{1}

” and “

T_{2}

” to detect and connect edges.

After the pixel values of the fruit length and width were obtained, the previous calibration data was introduced below the program, and the world coordinates were solved using the pixel coordinate conversion method. The first parameter of the operator is the camera parameter, the second parameter is the camera pose, and the fifth parameter is the unit of measurement, with mm generally being chosen, that is, the actual length of 100 pixels is X mm, and then Equation (8) is used to the formula to convert pixel coordinates to actual coordinates:

S J_{c} : = - {X S}_{c} * X / 50 + 0.36

(8)

Here,

S J_{c}

is the actual coordinate, and

{X S}_{c}

is the pixel coordinate.

4.3. Fruit Grasping Position and Force Mixing Control

To achieve precise force regulation during fruit grasping, an intelligent control algorithm was developed to enhance the mechanical arm’s force control. This algorithm compensates for PID control limitations by incorporating a neural network (NN), leveraging its nonlinear approximation capability, adaptive learning, and self-optimization to improve control stability, speed, and robustness.

Position–force mixing control transforms the combined system and computational model of the actual operating arm into a series of independent, decoupled unit mass systems by using the dynamic model of Cartesian space. The basic block diagram of the position–force mixing control is shown in Figure 9. The force control part consists of two channels, neural network PI and force feed-forward. Neural network PI channels take the expected Cartesian space generalized force F_d at the end of the robot as a given condition, and the force feedback is obtained from the force sensor measurements. The control system is introduced with the matrices S and

\bar{S}

to determine the control task that different degrees of freedom should take according to the task. S is a diagonal array; the elements on the diagonal array are 0 or 1, that is,

S ≜ [0, I], \bar{S} ≜ [I, 0]

. S and

\bar{S}

are equivalent to an interlock switch and are used to select the control mode of each degree of freedom. The traditional PD control algorithm is selected as the position control strategy because the position control requires a fast response speed. The force control strategy uses the neural network PI control algorithm, and the force control requires less error.

\bar{S}

= 1 − S. Then, the control moment of the robotic arm under the Cartesian coordinate system is calculated as shown in Equation (9):

τ = K_{p} \bar{S} ∆ x + K_{v} \bar{S} ∆ x + K_{F p} S ∆ F + K_{F I} S \int_{0}^{t} ∆ F d t + F_{d}

(9)

According to the Jacobian matrix of the arm, the driving moment corresponding to each joint of the arm can be calculated as shown in Equation (10):

τ_{q} = {J^{T} τ = K}_{p} J^{- 1} \bar{S} J ∆ q + K_{v} J^{- 1} \bar{S} J ∆ q + {J^{T} (K}_{F p} S ∆ F + K_{F I} S \int_{0}^{t} ∆ F d t + F_{d})

(10)

4.4. Motion Control System Design

The developed fruit-picking mobile manipulator integrates an Arduino-based control system that communicates with the machine vision module and coordinates structural actuation [33,34,35], as illustrated in Figure 10. To ensure operational safety and stability, the system implements an incremental discrete PID closed-loop control strategy for motor speed regulation [18,19,36,37].

According to the principle of positional PID control, the control amount of the robot at time n − 1 is given by Equations (11)–(13):

u (n - 1) = K_{p} {e (n - 1) + \frac{T}{T_{i}} \sum_{i = 0}^{n} e (i) + \frac{T_{d}}{T} [e (n - 1) - e (n - 2)]}

(11)

With the hypothesis

∆ u (n) = u (n) - u (n - 1)

(12)

get

∆ u (n) = K_{p} [e (n) - e (n - 1)] + \frac{{K_{p} T}_{d}}{T_{i}} e (n) + \frac{{K_{p} T}_{d}}{T} [e (n) - 2 e (n - 1) + e (n - 2)]

(13)

Make K

= K_{p}

the integral coefficient and

K_{d} = K_{p} \frac{T_{d}}{T}

the differential coefficient; the above formula can be simplified as Equation (14):

∆ u (n) = K_{p} [e (n) - e (n - 1)] + K_{i} e (n) + K_{d} [e (n) - 2 e (n - 1) + e (n - 2)]

(14)

Here, incremental discrete PID control is used to form a speed closed loop, that is, to calculate the incremental output Pwm, as shown in Equation (15):

P w m = K_{p} [e (n) - e (n - 1)] + K_{i} e (n) + K_{d} [e (n) - 2 e (n - 1) + e (n - 2)]

(15)

Only PI control is used in the speed closed-loop control system, so the PID formula can be simplified as Equation (16):

P w m = K_{p} [e (n) - e (n - 1)] + K_{i} e (n)

(16)

where e(n) is the current deviation, e(n − 1) is the deviation of the neighbor, e(n − 2) is the deviation of the interval, Pwm is the incremental output, T is the sampling period, and n is the discrete variable.

Based on the Arduino microcontroller, the speed of the motor is controlled and programmed, and the closed-loop speed is used for feedback adjustment so that the speed of each motor is consistent, thereby ensuring the safety and stability of the robot during the movement process [20,21,38,39,40,41,42].

5. Test

5.1. Dimensions and Location

Firstly, the calibration test of the 3D camera was conducted to solve the calibration values of the internal and external parameters. The internal parameters obtained are shown in Table 2, and the external parameters are shown in Table 3:

By calibrating the internal and external parameters of the camera, it can be seen that the internal parameters of the camera are position indicators, mainly including the focal length, distortion coefficient, pixel shape, and center point coordinates, while the external parameter is the pose index, which mainly consists of three translation matrices and three rotation matrices with six degrees of freedom. The combination of internal and external parameters can solve the position and posture of the target object in space and facilitate the accurate grasp of the terminal mechanical arm.

Next, 2505 training sets, 850 verification sets, and 523 test sets were selected for the experiment, as shown in Table 4. The positive sample set was formed by a high-resolution camera used to shoot videos of fruits with different maturities and different forms under the same conditions. Then the videos were imported into labelimg plugin, and each image was read and formed. After these images were imported into the software, they were annotated by the feature extraction method. As shown in Figure 11, the ripe and unripe fruits were distinguished by RGB color (an R value greater than 128 was valid), and the occluded and non-occluded fruits were determined by roundness and fit degree (greater than 50% was valid). To reduce the error, the edge of the fruit was marked as closely as possible. In this process, the UPSAMPLE data enhancement module was used for the core operation of increasing the spatial resolution of the feature map. We used transposed convolution to insert blanks (strides) between the input elements and then filled in the convolution.

As shown in Table 5, the training set serves as the learning material for the model, used to update the weights and learn the patterns. The validation set is used to evaluate the model’s performance during training, fine-tune hyperparameters, and prevent overfitting. It provides data not directly used for training the model’s weights, which helps assess the model’s generalization ability—its capability to make accurate predictions on unseen data. Objectively comparing the model’s performance under different combinations of hyperparameters makes it evident that the model performs well with the hyperparameter settings shown in Table 5 (learning rate 0.01, training batch size 16, epoch 300). The test set is used to further evaluate the model’s generalization ability. After the problem was solved, three kinds of loss functions appeared, namely a. a box loss curve, b. a class loss curve, and c. a confidence loss curve, and 300 training cycles were carried out.

5.2. Outline Dimension Test

After the vision system of this design was calibrated, the pixel values obtained by the preprocessing process of the acquired images and the converted actual values were obtained, as shown in Table 6. And the position identification and characteristic parameter output are shown in Figure 11.

To comprehensively assess the performance of the optimized YOLO model in comparison to its predecessors, we conducted extensive testing across multiple YOLO series models using datasets with robust generalization capabilities [43,44,45,46,47,48]. Our evaluation framework incorporated four key performance metrics: accuracy, recall rate, F1 score (excluding HALCON due to unavailability), and mean precision, as detailed in Table 7. The comparative analysis clearly demonstrates that the YOLOv12n model achieves superior overall performance [49,50,51,52,53].

As shown in Table 7, the model performance evaluation presents comprehensive testing results on the benchmark dataset through rigorous offline evaluation, demonstrating that the YOLOv12n model achieves the most significant accuracy improvement among all evaluated versions. The assessment incorporates four key metrics: precision (the ratio of correctly predicted positives to all predicted positives), recall (the ratio of correctly predicted positives to actual positives), F1 score (harmonic mean of precision and recall), and mean accuracy (area under the precision–recall curve). This analysis reveals critical performance trade-offs—high-precision models tend to generate false negatives, while high-recall models produce false positives, with the F1 score ensuring balanced performance. These results confirm the model’s strong generalization capability, further validated by Table 8, the sample detection results, which showcases consistent prediction accuracy across individual test cases, indicating robust readiness for real-world deployment.

Figure 12 comprehensively demonstrates the model’s robust generalization capability in accurately predicting diverse apple features across various orientations. The subfigures systematically present three key training metrics: (a) The bounding box loss curve quantifies localization accuracy, using Intersection over Union (IoU) to measure the discrepancy between the predicted and ground-truth bounding boxes. (b) The classification loss curve evaluates categorical prediction errors. (c) The confidence loss curve assesses the reliability of object detection predictions. The consistent and smooth convergence of all three loss functions clearly indicates stable learning progression and effective optimization of the object detection task. Figure 13 further validates the method’s practical effectiveness by presenting detailed identification results under various abnormal conditions, confirming the model’s operational reliability in real-world scenarios.

To rigorously validate the experimental outcomes, we conducted systematic performance comparisons by randomly sampling 100 test images from an open-source database. These images were processed sequentially using three YOLO variants (YOLOv8n, YOLOv10n, YOLOv12n) alongside the conventional HALCON image processing system [54,55,56,57,58,59,60]. For a comprehensive evaluation, we measured and compared the average execution times across all processing stages, with unavailable stages marked as ‘not applicable’ in Table 8. The comparative analysis clearly demonstrates the computational efficiency advantages of the advanced YOLOv12n architecture over its predecessors and traditional alternatives (Figure 14).

To further validate the effectiveness of the structural improvements, we conducted ablation studies using YOLOv12n as the baseline model. Each enhanced module was sequentially integrated, and its contribution to model performance was quantitatively assessed (Table 9). In the table, / means that this module is not available and √ means that this module is available. The results demonstrate that while the improved YOLOv12n maintains comparable mAP50seg performance, it achieves significant gains in mAP50-95seg. The SPPF module utilizes cascaded pooling to expand receptive fields, enhancing robustness for large target detection. Compared to standard SPP, its serial design preserves finer edge details through smaller initial pooling kernels [61,62,63,64,65]. The C2PSA module integrates channel attention with position-sensitive mechanisms, substantially improving small target detection by refining feature representation. Integrating upsample operations within these modules increased feature map resolution, collectively contributing to a 4-percentage-point improvement in detection accuracy. This cumulative enhancement validates the synergistic efficacy of each structural modification.

In this study, the scene focuses on the fruits growing in the indoor greenhouse environment. Under the condition of constant light, the influence on fruits was ignored, but samples were collected and analyzed for branch changes and different sizes of fruits. During training, leaves with occlusion, incomplete features, and fruits of different sizes could be identified. Usually, due to the different sizes and central point coordinates of the fruits, occlusion error is found between the sample library and the training library. This method establishes a result-search-oriented matching method under the YOLO VX framework. The results showed that the abnormal rate of the length value of the identification and the width value of the identification was lower than 1%, and the difference rate of the calculated length and the calculated width was lower than 8%, achieving a good test method. However, for different periods and different fruit tree heights, there are great differences. In future studies, classification will be used to propose a new model of mature fruit tree height and a new training method. Admittedly, when blade occlusion is greater than 35%, the difference is also large. In the future, the camera image acquisition method will be improved to search for mature fruits in a larger area.

Admittedly, the testing process revealed several limitations. Primarily, (1) when the obstruction ratio of fruits by leaves or stems exceeds 30%, recognition performance degrades significantly, necessitating supplementary mechanical designs to clear obstructions such as leaves or stems from the end-effector structure. (2) Furthermore, variability in tree heights poses challenges, as fruits located at the upper canopy often exceed the robotic arm’s operational range. To address this, the stationary mobile chassis requires optimization into a liftable mechanical platform capable of vertical adjustment.

6. Conclusions

In this paper, an intelligent orchard positioning and grasping method based on the YOLO VX model and 3D vision is proposed. YOLO VX deep learning, 3D visual target recognition, and microcontrol unit design technology were adopted to establish a sample library of mature fruits, and the fruit to be identified was used as a training library, and the output results were matched with the sample library. At the same time, the 3D camera solved the position and size information of the target fruit and transmitted it to the flexible claw on the robotic arm to complete the picking. Tests show that the YOLO VX model of this method is more efficient than the traditional method and can be further applied for the intelligent picking of other orchard fruits with different locations and sizes.

In the future, we will pay more attention to the fruits of different kinds of fruit trees in the same orchard and the methods or models of picking different fruits on the same fruit tree. For example, when different parts of the fruit are blocked, we can continue to optimize the YOLO VX method and fit the complete fruit outline through fuzzy statistics and analysis, or when different types of fruits are planted in the same orchard indoors, we can improve the classification standard of sample set, explore a parallel-solving hybrid-matching model, and make this method more intelligent. For different scenarios, the problem of environmental interference can be solved by introducing an optimization module algorithm, and for the picking of different fruit crops, different flexible end fixtures can be used.

Author Contributions

Conceptualization, Z.M. and Y.L.; methodology, Z.M., R.Z. and S.W.; software, Y.L.; validation, Z.M. and Y.L.; formal analysis, Z.M.; investigation, Y.L.; resources, R.Z. and S.W.; data curation, S.W.; writing—original draft preparation, Z.M.; writing—review and editing, S.W.; visualization, Y.L.; supervision, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper was fully supported by the Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education (ERCITA-KF004); Fei Guobiao from Hubei Engineering Research Center for BDS-Cloud High-Precision Deformation Monitoring (HBBDGJ202514Q); the Advantageous Discipline Group of Hubei Province: Intelligent Manufacturing (202207); Excellent Young and Middle-Aged Science and Technology Innovation Teams in Hubei Province’s Universities (T2024043); and the Scientific Research Project of Wuchang Institute of Technology (2025KYZ01).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, C.; Jiang, H.; Liu, X.; Li, H.; Wu, Z.; Sun, X.; He, L.; Mao, W.; Majeed, Y.; Li, R.; et al. Improved binocular localization of kiwifruit in orchard based on fruit and calyx detection using YOLOv5x for robotic picking. Comput. Electron. Agric. 2024, 217, 108621. [Google Scholar] [CrossRef]
Chen, M.; Chen, Z.; Luo, L.; Tang, Y.; Cheng, J.; Wei, H.; Wang, J. Dynamic visual servo control methods for continuous operation of a fruit harvesting robot working throughout an orchard. Comput. Electron. Agric. 2024, 219, 108774. [Google Scholar] [CrossRef]
Wang, C.; Pan, W.; Zou, T.; Li, C.; Han, Q.; Wang, H.; Yang, J.; Zou, X. A Review of Perception Technologies for Berry Fruit-Picking Robots: Advantages, Disadvantages, Challenges, and Prospects. Agriculture 2024, 14, 1346. [Google Scholar] [CrossRef]
Kong, D.; Wang, J.; Zhang, Q.; Li, J.; Rong, J. Research on fruit spatial coordinate positioning by combining improved YOLOv8s and adaptive multi-resolution model. Agronomy 2023, 13, 2122. [Google Scholar] [CrossRef]
He, J.; Wang, L.; Liu, H.; Sun, B. Recent advances in molecularly imprinted polymers (MIPs) for visual recognition and inhibition of α-dicarbonyl compound-mediated Maillard reaction products. Food Chem. 2024, 446, 138839. [Google Scholar] [CrossRef]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved Apple Fruit Target Recognition Method Based on YOLOv7 Model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Li, L.; Yang, X.; Wang, R.; Zhang, X. Automatic Robot Hand-Eye Calibration Enabled by Learning-Based 3D Vision. J. Intell. Robot. Syst. 2024, 110, 130. [Google Scholar] [CrossRef]
Liang, J.; Ye, Y.; Wu, D.; Chen, S.; Song, Z. High-efficiency automated triaxial robot grasping system for motor rotors using 3D structured light sensor. Mach. Vis. Appl. 2024, 35, 132. [Google Scholar] [CrossRef]
Hou, R.; Yin, J.; Liu, Y.; Lu, H. Research on Multi-Hole Localization Tracking Based on a Combination of Machine Vision and Deep Learning. Sensors 2024, 24, 984. [Google Scholar] [CrossRef]
Yang, Q.; Meng, H.; Gao, Y.; Gao, D. A real-time object detection method for underwater complex environments based on FasterNet-YOLOv7. J. Real-Time Image Process. 2024, 21, 8. [Google Scholar] [CrossRef]
Valero, S.; Martinez, J.C.; Montes, A.M.; Marín, C.; Bolaños, R.; Álvarez, D. Machine Vision-Assisted Design of End Effector Pose in Robotic Mixed Depalletizing of Heterogeneous Cargo. Sensors 2025, 25, 1137. [Google Scholar] [CrossRef] [PubMed]
Choutri, K.; Lagha, M.; Meshoul, S.; Batouche, M.; Bouzidi, F.; Charef, W. Fire detection and geo-localization using uav’s aerial images and YOLO-based models. Appl. Sci. 2023, 13, 11548. [Google Scholar] [CrossRef]
Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
Lin, C.; Zhang, H.; Xu, J.; Wu, L.; Xu, H. 9dtact: A compact vision-based tactile sensor for accurate 3d shape reconstruction and generalizable 6d force estimation. IEEE Robot. Autom. Lett. 2023, 9, 923–930. [Google Scholar] [CrossRef]
Sun, R.; Wu, C.; Zhao, X.; Zhao, B.; Jiang, Y. Object Recognition and Grasping for Collaborative Robots Based on Vision. Sensors 2024, 24, 195. [Google Scholar] [CrossRef] [PubMed]
Xiao, X.; Wang, Y.; Zhou, B.; Jiang, Y. Flexible Hand Claw Picking Method for Citrus-Picking Robot Based on Target Fruit Recognition. Agriculture 2024, 14, 1227. [Google Scholar] [CrossRef]
Gao, R.; Li, Y.; Liu, Z.; Zhang, S. Target Localization and Grasping of Parallel Robots with Multi-Vision Based on Improved RANSAC Algorithm. Appl. Sci. 2023, 13, 11302. [Google Scholar] [CrossRef]
Jin, X.; Dai, S.L.; Liang, J. Adaptive constrained formation-tracking control for a tractor-trailer mobile robot team with multiple constraints. IEEE Trans. Autom. Control 2022, 68, 1700–1707. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, D.; Gong, S. Uncalibrated visual positioning using adaptive Kalman Filter with dual rate structure for wafer chip in LED packaging. Measurement 2022, 191, 110829. [Google Scholar] [CrossRef]
Wu, F.; Duan, J.; Ai, P.; Chen, Z.; Yang, Z.; Zou, X. Rachis detection and three-dimensional localization of cut off point for vision-based banana robot. Comput. Electron. Agric. 2022, 198, 107079. [Google Scholar] [CrossRef]
Kan, T.S.; Cheng, K.J.; Liu, Y.F.; Wang, R.; Zhu, W.D.; Zhu, F.D.; Jiang, X.F.; Dong, X.T. Evaluation of a custom-designed human–robot collaboration control system for dental implant robot. Int. J. Med. Robot. Comput. Assist. Surg. 2022, 18, e2346. [Google Scholar] [CrossRef] [PubMed]
Shinde, S.; Kothari, A.; Gupta, V. YOLO based human action recognition and localization. Procedia Comput. Sci. 2018, 133, 831–838. [Google Scholar] [CrossRef]
Zhu, Y.; Cheng, P.; Zhuang, J.; Wang, Z.; He, T. Visual Simultaneous Localization and Mapping Optimization Method Based on Object Detection in Dynamic Scene. Appl. Sci. 2024, 14, 1787. [Google Scholar] [CrossRef]
Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
Li, Z.; Xu, B.; Wu, D.; Zhao, K.; Chen, S.; Lu, M.; Cong, J. A YOLO-GGCNN based grasping framework for mobile robots in unknown environments. Expert Syst. Appl. 2023, 225, 119993. [Google Scholar] [CrossRef]
Shen, D.; Xu, H.; Zhang, B.; Cao, M.; Ma, L. Research on visual slam based on improved YOLOv5s in dynamic scenes. J. Phys. Conf. Ser. 2024, 2902, 012033. [Google Scholar] [CrossRef]
Sharath, G.S.; Hiremath, N.; Manjunatha, G. Design and analysis of gantry robot for pick and place mechanism with Arduino Mega 2560 microcontroller and processed using pythons. Mater. Today Proc. 2021, 45, 377–384. [Google Scholar] [CrossRef]
Sun, Z.; Fan, J.; Lu, J.; Zhang, L.; Li, X.; Deng, B. An indoor delivery robot based on YOLOv8 and ROS. J. Phys. Conf. Ser. 2024, 2787, 012020. [Google Scholar] [CrossRef]
Lopez-Rodriguez, F.M.; Cuesta, F. An android and arduino-based low-cost educational robot with applied intelligent control and machine learning. Appl. Sci. 2020, 11, 48. [Google Scholar] [CrossRef]
Choi, K.L.; Kim, M.J.; Kim, Y.M. On safety improvement through process establishment for SOTIF application of autonomous driving logistics robot. Int. J. Internet Broadcast. Commun. 2022, 14, 209–218. [Google Scholar]
Kalpana, M.; Geethika, L.S.; Afzal, S.; Singh, R.; Nookaraju, A.; Kiran, T.T.S.; Praneeth, P.S. Design and Implementation of Versatile Delivery Robot. Eng. Proc. 2024, 66, 42. [Google Scholar]
Tony, W.; Ansari, A.; Hoang, K.B. A Untethered Magnetic Micro-Robot for Local Therapeutic Delivery for Neurosurgical Oncology. Neurosurgery 2024, 70, 207–208. [Google Scholar] [CrossRef]
Aryan, A.; Modi, M.; Saha, I.; Majumdar, R.; Mohalik, S. Optimal Integrated Task and Path Planning and Its Application to Multi-Robot Pickup and Delivery. arXiv 2024, arXiv:2403.01277. [Google Scholar]
Tang, Y.; Chen, M.; Lin, Y.; Huang, X.; Huang, K.; He, Y.; Li, L. Vision-based three-dimensional reconstruction and monitoring of large-scale steel tubular structures. Adv. Civ. Eng. 2020, 2020, 1236021. [Google Scholar] [CrossRef]
Popovici, A.T.; Dosoftei, C.C.; Budaciu, C. Kinematics calibration and validation approach using indoor positioning system for an omnidirectional mobile robot. Sensors 2022, 22, 8590. [Google Scholar] [CrossRef] [PubMed]
Zan, J. Research on robot path perception and optimization technology based on whale optimization algorithm. J. Comput. Cogn. Eng. 2022, 1, 201–208. [Google Scholar] [CrossRef]
Ji, J.; Zhao, J.S.; Misyurin, S.Y.; Martins, D. Precision-driven multi-target path planning and fine position error estimation on a dual-movement-mode mobile robot using a three-parameter error model. Sensors 2023, 23, 517. [Google Scholar] [CrossRef]
Qiu, Z.; Wu, Z. Adaptive neural network control for image-based visual servoing of robot manipulators. IET Control Theory Appl. 2022, 16, 443–453. [Google Scholar] [CrossRef]
Xu, F.; Wang, H.; Liu, Z.; Chen, W.; Wang, Y. Visual servoing pushing control of the soft robot with active pushing force regulation. Soft Robot. 2022, 9, 690–704. [Google Scholar] [CrossRef]
Ahmed, E.Q.; Aljazaery, I.A.; Al-zubidi, A.F.; ALRikabi, H.T.S. Design and implementation control system for a self-balancing robot based on internet of things by using Arduino microcontroller. Period. Eng. Nat. Sci. (PEN) 2021, 9, 409–417. [Google Scholar] [CrossRef]
Siregar, I.M.; Siagian, N.F.; Siregar, V.M.M. A design of an electric light control device using arduino uno microcontroller-based short message service. IOTA J. 2022, 2, 98–110. [Google Scholar] [CrossRef]
Top, A.; Gökbulut, M. Android application design with mit app inventor for bluetooth based mobile robot control. Wirel. Pers. Commun. 2022, 126, 1403–1429. [Google Scholar] [CrossRef]
Jing, Q.; Wu, R.; Zhang, Z.; Li, Y.; Chang, Q.; Liu, W.; Huang, X. A YOLO11-Based Method for Segmenting Secondary Phases in Cu-Fe Alloy Microstructures. Information 2025, 16, 570. [Google Scholar] [CrossRef]
Fan, X.; Zhou, J. Nondestructive Detection and Quality Grading System of Walnut Using X-Ray Imaging and Lightweight WKNet. Foods 2025, 14, 2346. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Y.; Kang, X.; Mao, T.; Li, Y.; Liu, G. Individual Recognition of a Group Beef Cattle Based on Improved YOLO v5. Agriculture 2025, 15, 1391. [Google Scholar] [CrossRef]
Ryu, J.; Kwak, D.; Choi, S. YOLOv8 with Post-Processing for Small Object Detection Enhancement. Appl. Sci. 2025, 15, 7275. [Google Scholar] [CrossRef]
Sánchez-Vega, J.A.; Silva-López, J.O.; Salas Lopez, R.; Medina-Medina, A.J.; Tuesta-Trauco, K.M.; Rivera-Fernandez, A.S.; Silva-Melendez, T.B.; Oliva-Cruz, M.; Barboza, E.; da Silva Junior, C.A.; et al. Automatic Detection of Ceroxylon Palms by Deep Learning in a Protected Area in Amazonas (NW Peru). Forests 2025, 16, 1061. [Google Scholar] [CrossRef]
Özakar, R.; Gedikli, E. Hand Washing Gesture Recognition Using Synthetic Dataset. J. Imaging 2025, 11, 208. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, H.; Zhang, W.; Ma, J.; Li, C.; Ding, Y.; Zhang, Z. MSUD-YOLO: A Novel Multiscale Small Object Detection Model for UAV Aerial Images. Drones 2025, 9, 429. [Google Scholar] [CrossRef]
Su, X.; Peng, X.; Zhou, X.; Cao, H.; Shan, C.; Li, S.; Qiao, S.; Shi, F. Enhanced Defect Detection in Additive Manufacturing via Virtual Polarization Filtering and Deep Learning Optimization. Photonics 2025, 12, 599. [Google Scholar] [CrossRef]
Xiong, M.; Wu, A.; Yang, Y.; Fu, Q. Efficient Brain Tumor Segmentation for MRI Images Using YOLO-BT. Sensors 2025, 25, 3645. [Google Scholar] [CrossRef]
Yin, Z.; Li, H.; Qi, B.; Shan, G. BBW YOLO: Intelligent Detection Algorithms for Aluminium Profile Material Surface Defects. Coatings 2025, 15, 684. [Google Scholar] [CrossRef]
Loganathan, V.; Ravikumar, D.; Manibha, M.P.; Kesavan, R.; Kusala Kumar, G.R.; Sasikumar, S. Development of Autonomous Unmanned Aerial Vehicle for Environmental Protection Using YOLO V3. Eng. Proc. 2025, 87, 72. [Google Scholar]
Piratinskii, E.; Rabaev, I. COSMICA: A Novel Dataset for Astronomical Object Detection with Evaluation Across Diverse Detection Architectures. J. Imaging 2025, 11, 184. [Google Scholar] [CrossRef] [PubMed]
Wei, J.; Gong, H.; Luo, L.; Ni, L.; Li, Z.; Fan, J.; Hu, T.; Mu, Y.; Sun, Y.; Gong, H. For Precision Animal Husbandry: Precise Detection of Specific Body Parts of Sika Deer Based on Improved YOLO11. Agriculture 2025, 15, 1218. [Google Scholar] [CrossRef]
Reyes, A.L.E.; Cruz, J.C.D. Anomalous Weapon Detection for Armed Robbery Using Yolo V8. Eng. Proc. 2025, 92, 85. [Google Scholar]
Zhou, N.; Gao, D.; Zhu, Z. YOLOv8n-SMMP: A Lightweight YOLO Forest Fire Detection Model. Fire 2025, 8, 183. [Google Scholar] [CrossRef]
Mao, M.; Hong, M. YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11. Sensors 2025, 25, 2270. [Google Scholar] [CrossRef] [PubMed]
Lv, R.; Hu, J.; Zhang, T.; Chen, X.; Liu, W. Crop-Free-Ridge Navigation Line Recognition Based on the Lightweight Structure Improvement of YOLOv8. Agriculture 2025, 15, 942. [Google Scholar] [CrossRef]
Zhao, M.; Cui, B.; Yu, Y.; Zhang, X.; Xu, J.; Shi, F.; Zhao, L. Intelligent Detection of Tomato Ripening in Natural Environments Using YOLO-DGS. Sensors 2025, 25, 2664. [Google Scholar] [CrossRef]
Deng, H.; Zhang, S.; Wang, X.; Han, T.; Ye, Y. USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception. Appl. Sci. 2025, 15, 3795. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, A.; Shi, J.; Gao, F.; Guo, J.; Wang, R. Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings. Heritage 2025, 8, 136. [Google Scholar] [CrossRef]
Tanimoto, Y.; Zhang, Z.; Yoshida, S. Object Detection for Yellow Maturing Citrus Fruits from Constrained or Biased UAV Images: Performance Comparison of Various Versions of YOLO Models. AgriEngineering 2024, 6, 4308–4324. [Google Scholar] [CrossRef]
Shia, W.-C.; Ku, T.-H. Enhancing Microcalcification Detection in Mammography with YOLO-v8 Performance and Clinical Implications. Diagnostics 2024, 14, 2875. [Google Scholar] [CrossRef]
Kim, Y.-S.; Kim, J.G.; Choi, H.Y.; Lee, D.; Kong, J.-W.; Kang, G.H.; Jang, Y.S.; Kim, W.; Lee, Y.; Kim, J.; et al. Detection of Aortic Dissection and Intramural Hematoma in Non-Contrast Chest Computed Tomography Using a You Only Look Once-Based Deep Learning Model. J. Clin. Med. 2024, 13, 6868. [Google Scholar] [CrossRef]

Figure 1. System composition diagram.

Figure 2. Control system design scheme.

Figure 3. YOLO X fruit object detection network training model framework.

Figure 4. Schematic of binocular camera imaging.

Figure 5. A calibrated binocular camera measurement model.

Figure 6. The method for 3D camera calibration of fruit recognition.

Figure 7. Acquired calibration plate images.

Figure 8. Process of solving fruit shape parameters.

Figure 9. Block diagram of the cooperative robotic arm.

Figure 10. Motion control system design.

Figure 11. Collection type diagram.

Figure 12. Generalization ability of the model in predicting different features.

Figure 13. Recognition and matching of fruit posture under different shielding conditions.

Figure 14. Comparison between YOLO v12n optimization model and traditional image processing test.

Table 1. Measurement indicators of binocular cameras.

I/O Parameter	Measurement Index
Input	Real-time image and video resolution	Live format (RGB/MP4)	Camera center position and distortion coefficient	Camera initialization parameters
Output	Image and video resolution	Image video format	Depth information after image/video matching	The distance of the target object from the camera

Table 2. Internal parameters of camera calibration.

Camera Internal Parameter	Calibration Value	Camera Internal Parameter	Calibration Value
Focal length f/m	0.0120	Single pixel height/m	4.400 × 10⁻⁶
$Distortion coefficient / m^{- 2}$	−397.275	Center point X coordinate/pixel	813.183
Single pixel width/m	4.399 × 10⁻⁶	Center point Y coordinate/pixel	618.186

Table 3. Camera calibration external parameters.

Camera External Parameter	Calibration Value	Camera External Parameter	Calibration Value
Translation matrix T		Rotation matrix R
Δx/m	−0.0301154	α/(°)	1.19178
Δy/m	−0.0267293	θ/(°)	359.161
Δz/m	0.599653	γ/(°)	0.231355

Table 4. Statistics of various types of apple datasets.

Dataset	Training Set	Validation Set	Test Set	Total Quantity
Blade shielding type	652	205	106	963
Fruit overlapping type	643	211	105	959
Surface glare	608	226	105	939
Missing characteristics	602	208	107	917
Total quantity	2505	850	523	3778

Table 5. Training parameters.

Training Parameter	Value
Input image size/pixel	640 × 640 × 3
Init learning rate	0.01
Batch_size	16
Epoch	300

Table 6. Outline dimension test.

Size Type	Measured Value/pix	Actual Value/pix	Differential Rate/%
Length value of the identification box	487.889	488.600	0.15
Wide value of the identification box	238.109	240.100	0.83
Calculated length/mm	8.990	9.770	8
Calculated width/mm	7.315	7.710	5

Table 7. Comprehensive performance comparison of YOLO v12n and early models/%.

Model Category	Accuracy	Recall	F1 Score	Mean Accuracy
YOLO v8n	90.66	84.57	87.36	93.32
YOLO v10n	90.98	84.05	87.38	90.06
YOLO v12n	91.14	84.32	87.60	91.76
HALCON	66.67	41.67	/	58.54

Table 8. Comparison of detection time (ms) between YOLO v12n and early models.

Detection Link	HALCON	YOLO v8n	YOLO v10n	YOLO v12n
Image preprocessing	4.2	3.1 ms	2.8 ms	2.9 ms
image segmentation	372.9	/	/	/
Morphological processing	708.3	/	/	/
Feature extraction	270.2	/	/	/
Target screening/prediction	280.1	29.8	30.2	29.1
Result output	40.2	1.8	1.9	1.8
Total time/ms	1675.9	31.6	32.1	30.9

Table 9. Ablation test of the optimized model of YOLOv12n.

SPPF	C2PSA	Upsample	mAP50seg/%	mAP50 95seg/%
/	/	/	97.5	74.2
√	/	/	98.6	76.5
√	√	/	98.7	77.1
√	√	√	98.7	78.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mei, Z.; Li, Y.; Zhu, R.; Wang, S. Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision. Agriculture 2025, 15, 1508. https://doi.org/10.3390/agriculture15141508

AMA Style

Mei Z, Li Y, Zhu R, Wang S. Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision. Agriculture. 2025; 15(14):1508. https://doi.org/10.3390/agriculture15141508

Chicago/Turabian Style

Mei, Zhimin, Yifan Li, Rongbo Zhu, and Shucai Wang. 2025. "Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision" Agriculture 15, no. 14: 1508. https://doi.org/10.3390/agriculture15141508

APA Style

Mei, Z., Li, Y., Zhu, R., & Wang, S. (2025). Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision. Agriculture, 15(14), 1508. https://doi.org/10.3390/agriculture15141508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Fruit Localization and Grasping Method Based on YOLO VX Model and 3D Vision

Abstract

1. Introduction

2. System Scheme Design

3. Establishing a YOLO X Network Model for Fruit Target Recognition

4. Visual Recognition System Design

4.1. Collaborative Robotic Arm and Binocular Camera Calibration

4.2. Obtaining Fruit Features

4.3. Fruit Grasping Position and Force Mixing Control

4.4. Motion Control System Design

5. Test

5.1. Dimensions and Location

5.2. Outline Dimension Test

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI