3.2.1. Single-Frame 3D Positioning Accuracy Test
To evaluate the effectiveness of the method proposed in this paper, we recorded 100 sets of strawberry images and depth maps under different occlusion levels in Yunchuang Farm, Panlong District, Kunming City, Yunnan Province. Subsequently, we used the improved YOLOv8n for inference to obtain the bounding boxes of strawberries in each image. In field tests, the occluders of strawberries can be roughly divided into five categories: fruit overlap occlusion, unripe fruit occlusion, calyx occlusion, leaf occlusion, and stem occlusion. According to the degree of occlusion, they are further divided into slight occlusion, moderate occlusion, and severe occlusion. In actual harvesting scenarios, slight and moderate occlusion under different occluder conditions are more common.
Figure 13 shows some examples of strawberries with 3D centroid reconstruction in real-world scenarios, where the white boxes are the bounding boxes predicted by the model. In the results, the white points represent the center points of the bounding boxes, and the red points represent the coordinates after centroid reconstruction. In cases of slight occlusion, the results of centroid reconstruction almost coincide with the center points of the bounding boxes, indicating that when the depth is continuous inside the bounding box and there is no obvious interference, the performance of the two methods is close, as shown in
Figure 13 for leaf occlusion and stem occlusion. When the occluder is exactly at the center of the bounding box, the positioning point can also be accurately selected on the strawberry.
After testing, the 100 sets of photos collected on-site contained a total of 216 strawberries, of which 209 were successfully completed with 3D reconstruction. The remaining 7 strawberries, 5 failed to complete reconstruction due to the lack of effective depth information within the bounding box, and 1 was offset to the occluder due to large-scale occlusion, as shown in
Figure 14. The failure of the former is mainly due to the strong natural light in the strawberry greenhouse interfering with the projection of structured light, making it impossible for the depth camera to correctly measure the distance to the target surface, resulting in large areas of “depth voids” (depth value of 0) in the depth map, ultimately leading to no usable depth data within the bounding box. The deviation of the latter occurs under extremely severe occlusion conditions, when the occluder occupies most of the area of the bounding box and is near the geometric center, the weighted centroid algorithm tends to assign higher spatial distribution weight and depth continuity weight to the occluder pixels during the calculation process, ultimately causing the reconstruction point to fall on the occluder rather than the fruit surface.
To further evaluate the performance of the proposed weighted geometric centroid reconstruction method, comparative experiments are conducted against two baseline approaches: the traditional center-point depth method and a region-of-interest (ROI) refinement method with median depth extraction. The effectiveness of the proposed method in improving the accuracy of fruit 3D localization is validated under varying occlusion conditions.
As illustrated in
Figure 15, the experimental setup consists of a fixed-depth camera mounted on a stable platform, a strawberry holder with an adjustable position, and a calibrated reference base. A precision ruler is used to obtain the ground truth coordinates (
) in the camera coordinate system. By adjusting the relative positions of the strawberry and the camera, a total of nine experimental scenarios are constructed, covering four representative occlusion conditions: no occlusion (fully visible fruit), slight occlusion (partially covered by leaves), moderate occlusion (partial edge occlusion), and severe occlusion (most of the fruit boundary occluded).
For each scenario, 50 repeated measurements are performed, and the averaged results are used for evaluation. The performance of different methods is quantitatively assessed using the root mean square error (RMSE) as the primary metric, which reflects both accuracy and sensitivity to outliers. In addition, the standard deviation (STD) is reported to evaluate the stability of the reconstruction results.
The three methods compared in this experiment include (1) the conventional method based on the bounding box center point with corresponding depth value, (2) an ROI-based method that reduces the detection region and extracts the median depth to suppress noise, and (3) the proposed weighted geometric centroid reconstruction method, which integrates spatial weighting and depth consistency to achieve more accurate and robust 3D position estimation.
From the experimental results presented in
Table 4, it can be observed that the proposed weighted centroid method consistently outperforms the traditional center-point depth method across different occlusion conditions, while also demonstrating clear advantages over the ROI-based median depth method, particularly under moderate and severe occlusion. To provide a more intuitive evaluation, the errors are further normalized with respect to the average strawberry diameter (4.5 cm) and reported as relative errors.
Under no-occlusion conditions, the depth distribution within the bounding box remains continuous and stable. As a result, all three methods achieve comparable performance, with only marginal differences. Specifically, the MAE of the weighted centroid method is 0.19 cm (4.22%), compared to 0.21 cm (4.67%) for the central point method and 0.18 cm (4.00%) for the ROI-based method, indicating that the benefit of advanced strategies is limited when depth information is reliable.
However, as occlusion increases, the performance gap between methods becomes more pronounced. The traditional center-point method exhibits significant degradation due to its reliance on a single pixel location. For instance, under slight occlusion, its MAE increases to 0.65 cm (14.44%), and further rises to 0.89 cm (19.78%) under moderate occlusion. In contrast, the ROI-based method improves robustness by suppressing background noise through regional filtering, achieving MAEs of 0.42 cm (9.33%) and 0.66 cm (14.67%) under slight and moderate occlusion, respectively.
The proposed weighted centroid method demonstrates superior performance by incorporating spatial weighting and depth consistency. It achieves MAEs of 0.33 cm (7.33%), 0.54 cm (12.00%), and 0.73 cm (16.22%) under slight, moderate, and severe occlusion conditions, respectively, with corresponding RMSE values of 0.40 cm, 0.62 cm, and 0.81 cm. Compared to the baseline methods, it maintains lower absolute and relative errors, as well as reduced variance, indicating improved robustness and stability under complex occlusion scenarios.
To further determine the optimal parameter combination of the weighted 3D centroid reconstruction algorithm and validate its stability, a full-parameter grid search was conducted on the spatial scaling factor and the depth bias adjustment coefficient . A total of 50 strawberry samples with varying levels of occlusion and fruit sizes were selected as the test set. The geometric center of each strawberry was manually annotated in the camera coordinate system using a high-precision ruler as the ground truth (GT).
During the experiments, the search range of was set to with a step size of 0.05, while was varied within mm with a step size of 5 mm. The root mean square error (RMSE) was adopted as the evaluation metric to comprehensively assess localization accuracy and sensitivity to outliers.
As illustrated in
Figure 16, the RMSE distribution forms a smooth and continuous basin-shaped surface in the
–
parameter space, with a clearly identifiable global minimum(the value represented by the blue dot). The error surface gradually decreases from the outer regions toward the central valley and then increases again, indicating a well-defined optimal region rather than isolated local minima. Specifically, for a fixed
, the RMSE decreases initially and then increases as
grows, forming a convex trend along the
direction.
The lowest error is observed around , which corresponds to approximately one-quarter of the bounding box width. This suggests that the selected spatial weighting effectively balances the contribution of central pixels and the suppression of noisy depth measurements from occluded boundaries. When is too small (), the weighting becomes overly concentrated, making the estimation sensitive to local depth noise. In contrast, when is too large (), the inclusion of peripheral pixels introduces interference from surrounding leaves and stems, resulting in centroid bias.
Along the dimension, the RMSE surface shows a clear minimum near mm, where the global minimum RMSE (approximately 0.32 cm) is achieved. When mm, the depth filtering window is too narrow, leading to the loss of valid surface points and insufficient geometric representation. Conversely, when mm, the window becomes excessively large, causing background depth values and neighboring structures to be incorporated, which significantly degrades localization accuracy.
Furthermore, the relatively flat and wide valley region around the optimal point (, mm) indicates low sensitivity to parameter perturbations. This characteristic demonstrates that the proposed method maintains stable performance even under moderate variations in fruit size and occlusion conditions. Such a smooth error landscape confirms the robustness and practical applicability of the weighted centroid reconstruction method in unstructured orchard environments.
Nevertheless, it can be observed that the accuracy of all methods degrades under severe occlusion. Although the proposed method still achieves the best performance (MAE = 0.73 cm, STD = ±0.23 cm), the error increase is unavoidable. This is because, in such scenarios, the proportion of valid depth pixels within the detection region is significantly reduced, and the remaining depth information may be heavily contaminated by occluders. As a result, the reliability of the weighted centroid estimation is affected, leading to deviations from the true fruit position. Representative examples of such cases are illustrated in
Figure 17.
As illustrated in
Figure 17, qualitative comparisons of different reconstruction strategies under varying occlusion conditions are presented. The first row (
Figure 17a–d) shows the original RGB images under four conditions: no occlusion, slight occlusion, moderate occlusion, and severe occlusion. The second row (
Figure 17e–h) presents the corresponding detection results obtained by the improved YOLOv8n model. It can be observed that, under occlusion, the predicted bounding boxes inevitably include partial occluders, which introduces challenges for accurate depth estimation. The third row (
Figure 17i–l) illustrates the results of the proposed weighted centroid reconstruction method in the form of spatial weight distributions. It can be seen that pixels located on the visible fruit surface are assigned higher weights, forming a concentrated high-response region. The final centroid is obtained through a weighted aggregation over the entire distribution rather than selecting a single maximum point, which enables the method to better capture the true geometric center of the fruit. Even under slight and moderate occlusion (
Figure 17j,k), the high-weight region remains primarily distributed over the exposed fruit area, demonstrating strong robustness against occlusion interference. The fourth row (
Figure 17m–p) shows the results of the ROI-shrink combined with median depth extraction, where the blue boxes are prediction boxes and the green ones are ROIs based on the contraction of prediction boxes. By reducing the effective region and applying median filtering, this method alleviates part of the background noise and improves robustness compared to simple center-point estimation. However, due to the absence of spatial weighting, the reconstructed results are still influenced by the distribution of remaining pixels. In particular, under severe occlusion (
Figure 17p), the estimated position tends to deviate from the actual fruit center.
Overall, the comparison indicates that the proposed weighted centroid method can effectively suppress the influence of occluders by emphasizing reliable pixels and reducing the contribution of noisy or irrelevant regions. Nevertheless, under extreme conditions where the visible portion of the fruit is severely limited, the available valid depth information becomes insufficient, which may still lead to deviations in the reconstructed centroid.
In addition, it should be noted that the performance of the system is also affected by the characteristics of the depth sensor. For example, structured-light depth cameras may experience reduced measurement reliability under strong natural illumination, potentially introducing additional noise into depth estimation. Furthermore, the current weighting strategy is primarily based on empirical design and lacks full adaptability. Future work could explore multi-sensor fusion approaches, such as integrating depth cameras with LiDAR, and incorporating lightweight semantic segmentation techniques to further improve the discrimination between fruits and occluding objects.
3.2.2. IMU Compensation Experiment Under Dynamic Interference
To verify the effectiveness of the proposed IMU-assisted dynamic compensation and multi-frame fusion method under dynamic interference conditions, this paper designs and constructs a controllable simulation vibration test platform, as shown in
Figure 18. The platform is installed 200 mm away from the strawberry experimental bench, used to generate multi-directional, random disturbances, thereby simulating dynamic interferences such as the vibration of mechanical arms during orchard operations, platform displacement, or uneven ground. The vibration device consists of two servo motors and an ESP32 control board. The servo motors are used to generate disturbances in the horizontal and pitch directions, while the ESP32 drives the servos via PWM signals to execute pseudo-random control sequences, causing the camera to exhibit irregular attitude jitter in multiple directions. By adjusting the amplitude and frequency of servo deflection, different intensities of disturbances can be flexibly set, thus achieving controllable simulation of the experimental environment.
To make the simulated vibration closer to real conditions, this paper recorded the original IMU data during the actual operation of the harvesting robot, including tri-axial acceleration
and tri-axial angular velocity
, and defined a quantitative index for disturbance intensity δt according to Formula (20) in
Section 2.6.
Based on the statistical distribution, the experiment divided the disturbances into slight jitter (), moderate jitter () and severe jitter (). Among them, , . Experiments were conducted on the simulation platform for each corresponding disturbance level. Throughout the experimental process, the camera captured RGB images and depth images at 30 fps, and the IMU recorded acceleration and angular velocity at 200 Hz. All data were timestamped and aligned through interpolation to ensure consistency between visual information and inertial information. Under stable conditions, the true coordinates of the strawberries were taken as (5, 5, 20). Multiple independent tests were executed under different disturbance levels, with 100 positioning points taken each time to ensure the statistical reliability of the results.
Throughout the experimental process, the camera captured RGB images and depth images at 30 fps, and the IMU recorded acceleration and angular velocity at 200 Hz. To address the sampling frequency mismatch, we implemented a linear interpolation method for timestamp alignment. Specifically, for each camera frame at
, we identified the two closest IMU readings at
and
(
) and calculated the synchronized inertial state via
Since the IMU sampling interval is 5 ms, the maximum theoretical synchronization jitter is significantly reduced through interpolation.
In terms of method comparison, four schemes are designed for evaluation. The first is the uncompensated method, which directly uses single-frame 3D coordinates obtained from detection and depth reconstruction. The second is the baseline method, where a sliding window with equal weights is applied to average the 3D coordinates of consecutive frames, aiming to suppress short-term jitter. The third method introduces IMU-assisted compensation, in which the rotation matrix estimated from IMU measurements is used to correct single-frame coordinates, followed by a dynamic adaptive-weight sliding window fusion to improve robustness under motion disturbances. The fourth method corresponds to the proposed approach, where an Extended Kalman Filter (EKF) is further applied to the fused coordinates to model temporal dynamics and refine the estimation results. This hierarchical experimental design enables a comprehensive evaluation of the contributions of sliding window smoothing, IMU-based compensation, and EKF-based temporal refinement to the overall localization performance.
To comprehensively evaluate the effectiveness of the methods, the experiment used 3D positioning mean absolute error (MAE), standard deviation (STD), and root mean square error (RMSE). The experimental results are shown in
Table 5.
Table 5 presents the error and stability performance of four methods under different vibration levels. To provide a scale-invariant evaluation, the errors are additionally normalized with respect to the average strawberry diameter (4.5 cm) and reported as relative errors. The results demonstrate that the proposed IMU-assisted strategies significantly improve localization accuracy and robustness compared with the baseline methods, and that the integration of EKF further enhances performance.
Under slight vibration conditions (), all methods maintain relatively good performance due to limited dynamic disturbance, while noticeable differences can still be observed. The uncompensated method exhibits an MAE of 1.21 cm (26.89%) and an RMSE of 1.32 cm (29.33%), indicating the presence of inherent jitter even under low disturbance. Sliding window smoothing reduces short-term fluctuations, lowering the MAE to 0.68 cm (15.11%) and the RMSE to 0.76 cm (16.89%). The IMU-assisted weighted sliding window method further improves the results, achieving an MAE of 0.46 cm (10.22%) and an RMSE of 0.52 cm (11.56%), demonstrating its ability to compensate for minor motion disturbances. With the introduction of EKF, the error is further reduced to an MAE of 0.28 cm (6.22%) and an RMSE of 0.32 cm (7.11%), indicating improved temporal consistency and filtering effectiveness.
Under moderate vibration conditions (), the differences among the methods become more pronounced. The uncompensated method shows significant degradation, with an MAE of 2.23 cm (49.56%) and an RMSE of 2.34 cm (52.00%), reflecting poor robustness under dynamic disturbance. Sliding window smoothing provides limited improvement, reducing the MAE to 1.35 cm (30.00%) and the RMSE to 1.48 cm (32.89%), but still suffers from accumulated motion error. In contrast, the IMU-assisted method maintains strong robustness, achieving an MAE of 0.68 cm (15.11%) and an RMSE of 0.77 cm (17.11%), effectively suppressing motion-induced deviations. The EKF-enhanced method further improves accuracy, reducing the MAE to 0.40 cm (8.89%) and the RMSE to 0.45 cm (10.00%), demonstrating its advantage in modeling dynamic system behavior and reducing cumulative errors.
Under severe vibration conditions (), the performance gap becomes even more significant. The uncompensated method exhibits large errors (MAE = 2.45 cm (54.44%), RMSE = 2.55 cm (56.67%)), indicating poor reliability in highly disturbed environments. Although sliding window smoothing alleviates fluctuations to some extent (MAE = 1.58 cm (35.11%), RMSE = 1.85 cm (41.11%)), its compensation capability remains limited due to the lack of motion awareness. The IMU-assisted weighted sliding window method significantly improves performance, achieving an MAE of 0.92 cm (20.44%) and an RMSE of 1.05 cm (23.33%), demonstrating its effectiveness in compensating for strong disturbances. Furthermore, the proposed IMU + EKF method achieves the best results, with the MAE reduced to 0.51 cm (11.33%) and the RMSE to 0.57 cm (12.67%), highlighting its superior robustness and stability under severe vibration conditions.
Overall, the results indicate that while sliding window smoothing can effectively reduce high-frequency noise, it cannot fundamentally eliminate motion-induced errors. The IMU-assisted method addresses this limitation by incorporating motion compensation, and the integration of EKF further enhances temporal consistency by explicitly modeling system dynamics. Consequently, the combined IMU + EKF framework achieves the most accurate and stable localization performance across all vibration conditions.
As illustrated in the 3D point cloud distributions in
Figure 19, the localization performance of four methods under different vibration levels (severe, moderate, and slight jitter) can be intuitively compared. The green points denote the ground truth position (5,5,20), while the red points represent the uncompensated results, which exhibit significant dispersion, especially under severe jitter conditions (
Figure 19a,e,i). After applying sliding window smoothing (
Figure 19b,f,j), the spatial distribution of the points becomes more concentrated, indicating that short-term fluctuations are effectively suppressed. However, noticeable deviations from the ground truth still remain due to uncorrected motion-induced errors. With the introduction of IMU-assisted compensation and adaptive weighted sliding window fusion (
Figure 19c,g,k), the point cloud shows a significantly tighter clustering around the true position. This demonstrates that IMU-based pose correction effectively reduces systematic errors caused by platform motion, while the adaptive weighting mechanism further suppresses unreliable observations. Finally, as shown in
Figure 19d,h,l, the proposed method incorporating EKF-based refinement achieves the most compact and stable distribution. By explicitly modeling temporal dynamics, the EKF further reduces residual fluctuations and enhances the continuity of the localization results across frames. A horizontal comparison across different vibration levels reveals that the performance gap between methods becomes more pronounced as motion intensity increases. Although the IMU-assisted adaptive fusion method significantly improves positioning stability, occasional outliers may still occur under severe vibration conditions (
Figure 19c). This phenomenon is mainly attributed to the amplification of high-frequency noise and the accumulation of integration drift in IMU measurements under rapid motion, as well as transient inconsistencies between visual and inertial observations. In contrast, the proposed EKF-enhanced method effectively mitigates these residual anomalies by incorporating temporal consistency constraints, resulting in superior robustness and accuracy under all tested conditions.