Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting

Tufail, Muhammad; Iqbal, Jamshed; Ahmad, Rafiq

doi:10.3390/agriculture16070769

Open AccessArticle

Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting

by

Muhammad Tufail

^1,2,

Jamshed Iqbal

^3,*

and

Rafiq Ahmad

²

¹

Lacombe Research and Development Centre, Agriculture and Agri-Food Canada, Lacombe, AB T4L 1W1, Canada

²

Smart and Sustainable Manufacturing Systems Laboratory, Department of Mechanical Engineering, University of Alberta, Edmonton, AB T6G 1H9, Canada

³

College of Engineering and Energy, Abdullah Al Salem University, Khaldiya 72303, Kuwait

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(7), 769; https://doi.org/10.3390/agriculture16070769

Submission received: 18 February 2026 / Revised: 20 March 2026 / Accepted: 30 March 2026 / Published: 31 March 2026

(This article belongs to the Special Issue Advances in Robotic Systems for Precision Orchard Operations)

Download

Browse Figures

Versions Notes

Abstract

The integration of emerging technologies such as robotics and artificial intelligence (AI) has the potential to transform agricultural harvesting by improving efficiency, reducing waste, lowering labor dependency, and enhancing produce quality. This paper presents the development of an intelligent robotic berry harvesting system that combines deep learning–based perception with autonomous robotic manipulation for real-time strawberry harvesting. A computer vision pipeline based on the YOLOv11 segmentation model was developed and integrated into a Smart Mobile Manipulator (SMM) equipped with autonomous navigation, a 6-degree-of-freedom (6-DoF) xArm 6 robotic arm, and ROS middleware to enable real-time operation. Using a publicly available strawberry dataset comprising 2,800 images collected under ridge-planted cultivation conditions, the proposed YOLOv11-small segmentation model achieved 84.41% mAP@0.5, outperforming YOLOv11 object detection, Faster R-CNN, and RT-DETR in segmentation quality while maintaining real-time performance at 10 FPS on an NVIDIA Jetson Orin Nano edge GPU. A PCA-based fruit orientation and geometric analysis method achieved 86.5% localization accuracy on 200 test images. Controlled indoor harvesting experiments using synthetic strawberries demonstrated an overall harvesting success rate of 72% across 50 trials. The proposed system provides a general-purpose platform for berry harvesting in controlled environments, offering a scalable and efficient solution for autonomous harvesting.

Keywords:

robotic berry harvesting; smart mobile manipulator; ROS; precision agriculture; YOLO; Faster R-CNN; DETR

1. Introduction

By 2050, the world’s population is projected to grow from 8.2 billion to 9.7 billion, as reported in the report “World Population Prospects 2024” by the United Nations. It is also expected that the use of arable land for growing crops will shrink as increased land will become occupied due to rapid urbanization. This imminent rise in population by nearly 18% and the loss of arable land due to urbanization are clearly a portent of an impending upsurge in global demand for food, which consequently would result in food insecurity.

Several countries and cities are facing challenges in the food and agriculture sectors. California, the US’s greatest agricultural producer of fresh fruits, vegetables, and nuts, is grappling with issues like on-farm labor shortages and climate change, such as drought and water scarcity. As a result, berry production has been moved to the north. According to Agriculture and Agri-Food Canada [1], the second-most significant fruit crop in Canada in 2022 was cranberries, accounting for 209,205 metric tons, or 21.9%, of the total fruit harvest. The production volume of blueberries (lowbush and highbush combined) has been recorded as 180,116 metric tons (19.9%) and that of strawberries as 25,072 metric tons (2.6%). Unfortunately, Canada is not immune to the effects of a changing climate. The ambitious demands of meeting the requirements of the processing sector, exports, and local markets call for switching to large-scale farm mechanization and greenhouse agriculture.

Many cultures around the world have ritualized the gathering and processing of berries. Due to the delicate nature of this practice, adequate picking etiquette is being observed. Picking berries that are either under- or over-ripe causes major issues and financial losses. Size, firmness (soft vs. dry), shape, and color (e.g., deep purple in the case of huckleberries) are typically indicators of ripeness, which only skilled labor can assess. Pickers waste a lot of time moving around a patch to find berries. Good picking entails having the capacity to continuously pick berries for a lengthy period (between 20 min and an hour) [2]. In addition to taking up time, this could reduce harvest quality and farm productivity. Furthermore, picking practices are frequently inconsistent. There is a substantial variation in productivity across different field sections and cultivation patches from one season to another. Harvesting speed is a major issue, as is the labor shortage before or after harvest in developed nations like Canada, especially in the wake of the COVID-19 outbreak. Crop protection, irrigation, weeding, pruning, and thinning are additional procedures that require labor and are carried out all year long, not only during the harvesting season. The cost of labor in the strawberry industry alone is currently around USD 1 billion per year [3].

Traditional and manual methods of harvesting, however, cannot scale up to satisfy the needs of the rising world population. Robots can effectively automate all or most of the processes in diverse applications [4,5], which may be sufficient to compensate for the lack of labor in this sector and to remove the possibility of human error in these highly repetitive and precise tasks.

The solution to boost agricultural productivity and generate future food security lies in the application of agri-tech and agri-robotics, providing more efficient and reliable processes through Industry 5.0 and automation. Artificial Intelligence (AI) and robotic systems for agri-tech and agri-robotics have started gaining widespread popularity globally by automating and augmenting various farming tasks [6,7]. Robotic harvesting of soft fruits and vegetables is now achievable owing to advancements in deep learning-based computer vision algorithms, availability of the necessary computing power, and the falling costs of robotic/mechatronic systems. The objective is to lower costs, compensate for the lack of manpower, and speed up production. Due to their autonomous/semi-autonomous operation and dexterous manipulation capabilities supported by Industry 5.0 technologies, leading global agribusinesses have begun expressing interest in developing autonomous farming robots and their on-the-farm deployment to improve harvest quality, speed, and accuracy. It has been shown in [3] that robots can harvest a 25-acre field in three days, which would otherwise require 30 farm workers.

There are currently few automated crop harvesting methods and levels in use worldwide. The current transition in agriculture toward automation has created new opportunities. According to the report published by Growers and Berger [7], the current focus in agricultural automation is on the application of autonomous robots for weeding and vegetable & fruit harvesting. The ongoing attempts are in the proof of concept or early semi-commercial prototyping stages [7]. The report also cites results from a survey conducted in 2022 by Western Growers, the US Department of Agriculture, and Roland Berger, according to which the 2021 average harvesting cost of strawberries was USD 44,000 per acre. This is significantly more than apples, blueberries, lettuce, and broccoli—by at least ten times, three times, and nine times, respectively. Costs for harvesting the crops, such as labor, equipment, fuel, and other overhead expenses, were included in the harvesting cost. When it comes to activities such as pre-harvest, harvest, or harvest-assist, strawberry fields see the least amount of harvest automation. This may be partially explained by the fact that, unlike apples and other fruits, strawberries are only harvested when they are fully ripe. Their quality does not improve in flavor or quality after picking. While this makes harvest automation difficult and challenging, it also presents an opportunity for AI and robotics, which can ensure improvements listed below as compared to human laborers:

Speed: Robots can be programmed and replicated easily to pick the fruit as soon as they are ripe.
Accuracy: Over the last decade, robots have surpassed humans in terms of precision and accuracy, thanks to the latest development in hardware and force/motion control. On the other hand, AI and computer vision have reached the level of accuracy in detecting objects and fruits as humans would do. Strawberries are cut by the stem (ideally by twisting them off) to avoid damage to both the plant and the fruit. This has become possible with robots equipped with AI and customized grippers. The human success rate to gather strawberries on every plant without causing any damage is 80%. They usually spend 10 s per plant trying to find the ripe strawberries in the leaves, cut them, and place them into a plastic clamshell. With accurately programmed and rightly equipped robots, these numbers have been targeted by many research groups across the globe. The preliminary results are encouraging; see, e.g., the work of researchers at the National Agriculture and Food Research Organization (NARO), Japan [8], where a successful harvesting rate of 41.3% and an execution time (to successfully harvest a single fruit) of 11.5 s have been achieved.
Enhanced presence in the field: Robots can be present in the field 24/7. This is helpful during peak times of the season when berries must be harvested every two or three days.
Assist with other operations: Robots can be reprogrammed and equipped with a range of tools to assist in a variety of operations in addition to harvesting (which takes 23% of working hours), e.g., sorting/packaging (taking 27% of working hours) and cultivation management (which takes 28%).

Berries are typically grown inside greenhouses where elevated cultivation is used. It offers several advantages over conventional farming, including efficient soil and space management, water and ventilation conservation, and robot-friendly field management and fruit picking. Furthermore, strawberries grown at higher elevations are less vulnerable to attacks from ground insects. According to experiments conducted by the Traptic (Watsonville, CA, USA), robotic strawberry pickers can gather 100,000 strawberries every day. In these experiments, the robot is mounted atop an automated vehicle and has eight robotic arms. Additionally, strawberry flower pollination and leaf thinning or trimming have been planned as tasks that can be easily performed by the robot. The robotic harvester is outfitted with vision sensors and algorithms that can safely detect and locate ripe berries.

The problem statement of this study is that strawberry harvesting remains difficult to automate reliably due to challenges such as partial occlusion, ridge versus table-top cultivation, non-uniform lighting, and varying ripeness conditions. All of these factors make the machine vision problem highly challenging. Although recent progress in deep learning and agricultural robotics has improved fruit detection capability, practical deployment still faces integration challenges under greenhouse or field conditions. This work addresses these limitations to provide a foundation for future innovations.

Strawberries are cultivated using various methods depending on the environment and desired outcomes. Common cultivation techniques include table-top, bench-type, elevated-substrate, and ridge-planting systems. While table-top cultivation is commonly adopted in controlled environments, ridge planting remains widely used in open-field production. Octinion, a Belgium-based agricultural research and development company, has developed an autonomous picking robot for greenhouse strawberry harvesting under table-top cultivation. The algorithm and procedures developed in the present work can also be applied to ridge-planting systems as well as elevated cultivation methods such as bench-type and elevated-substrate with minimal modification and customization.

The key challenge in autonomous robotic berry harvesting is to identify fruits and obstacles surrounded by robots using perception models such as convolutional neural networks. The proposed approach uses You Look Only Once (YOLO) technique for computer vision-based real-time item detection. This is one of the most widely used model architectures and object detection algorithms that guarantees both high accuracy and overall quick processing speed.

Object detection is a widely used computer vision approach for identifying, locating, and labeling individual items in an image or video. The important decision in designing an object recognition system using machine learning techniques is whether to use the traditional machine learning algorithms (e.g., Support Vector Machines, Naive Bayes, k-means clustering, or decision trees) or use modern deep learning algorithms. The classic hand-crafted key point detector and descriptor methods (e.g., a combination of Good Features to Track (GFTT) [8] with BRISK [9]) are reported to provide convincing results in object recognition, and structure from motion is still being considered as an alternative to modern deep learning-based methods in many applications (see Figure 1). These algorithms extract key points or features (e.g., GFTT detects corners) in one image and match them across images using arbitrarily defined similarity measures such as L2-norm or Hamming distance. They guarantee both geometric and photometric invariance. The former approach entails a feature engineering step that does not generate adequate outcomes (both in terms of accuracy and speed) for the current application. Previously, we reported results on combining different features and algorithms to obtain the best performance in [10]. All these algorithms could not perform well in real-time to detect multiple strawberry objects present in a cluttered environment. More importantly, they are mostly binary classifiers and cannot be used for maturity detection of strawberries.

One of the main challenges encountered with traditional machine learning approaches is locating multiple objects in the same image. As an alternate option, Convolutional Neural Network (CNN) algorithms such as You Only Look Once (YOLO) have been found to be robust due to their high accuracy and better FPS (frame per second). These advantages make them perfect for real-time deployment [11] since they are single-step object detection models.

YOLO, unlike most of the previously developed object detection algorithms, addresses object detection as a single regression problem, circumventing the region proposal, classification, and duplicate elimination pipeline [12]. Images are downsized to a reduced resolution in YOLO algorithms, and then a single CNN is run on the images, yielding detection results based on the threshold of the model’s confidence. The sum of square errors was reduced in the first version of YOLO. This optimization improves detection speed but reduces accuracy compared to current object detection models [12]. Online data augmentation is used in YOLO to improve model robustness in object detection in various contexts by increasing the unpredictability of the input data. YOLO models have been used in a variety of applications where quick detection is required, including pedestrian detection [12], license plate detection [13], and fruit detection [14,15,16,17]. Since the release of YOLO v7 in 2022 [18], multiple versions of YOLO have been released, demonstrating continuous improvement in the algorithm. Furthermore, each primary version was offered as a full model as well as miniature variants, which had fewer layers and were faster than the full version. YOLOv11 [19] offered several architectural improvements. The architecture incorporates advanced modules such as Cross-Stage Partial blocks (e.g., C3k2), Spatial Pyramid Pooling–Fast (SPPF), and attention-enhanced feature fusion mechanisms, which improve multi-scale feature extraction and detection accuracy, particularly for small and partially occluded objects. It offered an excellent trade-off between efficiency and accuracy, along with the ability to run on mid-range GPUs and its ease of integration and deployment, making it the ideal choice for the current real-time segmentation task.

Detecting the peduncle rather than the strawberry itself is crucial for robotic harvesting, as it ensures a precise cutting point, preserving fruit quality and minimizing damage to the plant. By targeting the peduncle, robotic systems can improve harvesting efficiency, reduce the risk of bruising or mishandling, and align with natural practices, where the fruit remains intact for better market value. One of the notable works is mentioned in [20] by Parsa et al., which visually detected the key points and picking points for selective (ripe) strawberries independently. Picking points are found on the peduncles, which often have an uneven pattern and are asymmetrical. The authors have added three RGB cameras to overcome limitations associated with a single depth camera such as not covering depth distances less than 15 cm. Gaussian process regression was used to estimate picking point errors. This has resulted in 95% accuracy of detection and isolation of strawberries in a cluttered environment. Overall, the harvesting operation was 83% accurate. This relatively higher accuracy in real-field conditions can be partly attributed to peduncle gripping that restricts the gripping force to 10 N (15 N when a blade with a wedge angle of 16.6 degrees at a 30-degree orientation is used) for an average 50-g strawberry [21]. It is only possible to cut peduncles adaptively while maintaining the same holding resultant force when the picking motion is decoupled from the cutting one and when the position and orientation of the peduncle are known. Quaglia et al. in [22] have used this method to harvest grapevines by mounting an underactuated tool resembling a pair of 3D-printed scissors on a commercial gripper. The peduncle’s pose was assumed to be known.

Recently, object detection has evolved into open-vocabulary (OV) object detection using vision-language models (VLMs), such as OWL-ViT [23], EfficientViT-SAM [24], YOLO-World [25], and Grounding DINO [26]. These models leverage large-scale, unlabeled, or noisy image-text datasets to enable zero- or few-shot inference for recognizing objects beyond predefined categories. VLMs, employing a prompt-then-detect paradigm, are pivotal for advancing open-world intelligent robots with capabilities for generalized object and scene understanding. They represent a promising future for generalized AI, open-world robotics, and autonomous systems. However, their current limitations, including challenges with performance, reliability, and customizability; high computational and power requirements; lack of precision and accuracy; limited optimization for integration into strict real-time control loops; interoperability issues; ethical and safety concerns; and deployment complexities, hinder their practical adoption in robotic systems. Table 1 summarizes the key related robotic harvesting studies and their limitations, and how the proposed framework addresses them.

In the present work, the application of agricultural robotics and artificial intelligence is explored to: (i) perform AI-based berry detection and classification for harvesting, (ii) develop an autonomous or semi-autonomous harvesting robot, and (iii) support deployment in greenhouse or real agricultural field conditions. The main goal is to lay the groundwork for a comprehensive framework for developing a fruit-picking robot while highlighting the associated technical challenges. The novelty of this work lies in bridging perception and deployable robotic harvesting under realistic strawberry cultivation variability. Unlike prior studies that focus either on fruit detection or isolated manipulator trials, the present work demonstrates a deployable perception-to-harvesting pipeline integrating segmentation-based target selection, geometric grasp estimation, ROS coordination, and edge-GPU deployment. The key contributions of this research are summarized below:

Development of a deep learning-based vision model for strawberry detection and classification.
Performance comparison of selected algorithms from the literature using both real and synthetic datasets.
Integration of the learning-based vision model into a Robot Operating System framework deployed on GPU hardware to enable real-time autonomous strawberry harvesting using a Smart Mobile Manipulator (SMM).

The remainder of the paper is organized into four sections. Section 2 presents materials and methods. In particular, it describes the development of the mobile robotic manipulator platform used in this research, including both the hardware and software architecture. Section 3 presents the experimental results including the developed perception and robotic harvesting framework. Section 4 outlines the discussion listing limitations and challenges in robotic berry harvesting. Finally, Section 5 concludes the paper and brings forward a few thorough insight into directions for future research.

2. Materials and Methods

SMMs are gaining popularity to be deployed in a variety of applications such as warehouses, greenhouses, agriculture, factory floors, transportation, and the service industry. The job of designing an SMM is harder than putting together a robotic manipulator, a mobile robot base, and navigation and detection sensors based on computer vision. The key design criteria are hardware abstraction. The developed software architecture must not depend on equipment from a specific vendor. It should be reconfigurable and adaptable according to varying environmental conditions. From both the hardware and software points of view, the architecture should be plug-and-play, rendering itself to be usable by technicians and not only robotics experts. The Human–Machine Interface (HMI) plays a key role in the choice of middleware. ROS has been the first choice of developers in this regard because it provides tools and frameworks to achieve hardware abstraction at the lower level and integrate HMI at the top. In addition, it supports high-level supervisory control and low-level actuator control through its node-based architecture and messaging system. It can handle subsystems with varying sensing and control loop frequencies, making it suitable for systems with diverse sensors and actuators. The real-time autonomous strawberry framework presented in this paper is based on a SMMs initially developed in 2023 and subsequently upgraded in 2026.

Three important tasks that allow autonomous or semi-autonomous navigation of a mobile robot, which include mapping, localization, and path planning. The mobile robot platform uses sensory data, such as a wheel encoder, GPS, an Inertial Measurement Unit (IMU), etc., to navigate to the fruit-picking point autonomously or semi-autonomously. The first step in navigation is creating a map of the environment. A mobile robot represents the operational environment with the help of a map. Different representations of maps include metrics (e.g., grid-based and feature-based), topological, and hybrid. To follow the desired path, the robot must know its current pose and measurements up to the current time instant. This is called localization. Given a traversable map of the environment and a localized robot, a collision-free path (if there exists any) needs to be determined from a home location and the desired goal location. This is called path planning or trajectory planning. The latter takes into consideration both speed and path information.

In ideal scenarios, navigation assumes that the environment is static and that the map truly and completely reflects the environment. Also, there are no errors in sensor measurements. However, in practice these assumptions are hardly true. The environment may be dynamic, but the sensor readings have a considerable amount of noise. To solve this problem, a redundant set of heterogeneous sensors is proposed, which, along with sophisticated sensor fusion and probabilistic algorithms for navigation, can give satisfactory results in an outdoor environment. Moreover, an obstacle detection and avoidance module is implemented in a parallel and continuous manner. Moreover, real scenarios require Simultaneous Localization and Mapping (SLAM), which uses onboard sensors (odometry, IMU, GPS, and/or laser scanners) to estimate the vehicle’s position, and at the same time, a map of the environment is built using the same sensor data.

Most industrial mobile robots use the open-source implementation of autonomous navigation in ROS, called the ROS Navigation Stack [30]. This stack is a set of software tools and libraries for autonomous navigation of mobile robots, especially for holonomic and differential drive robots such as the one proposed in the present research work. The navigation stack in ROS makes use of different sensors, such as the robot odometry from wheel encoders, IMU, GPS, laser scanners, and depth cameras.

2.1. Hardware Architecture

Any effort to design a berry-picking robot must consider the following key requirements: (i) Spotting the ripe berry in a busy area. (ii) Approaching and grabbing the berry without causing damage to the fruit. (iii) Remove the fruit carefully without harming the plant. The last two requirements can be addressed by using a mobile manipulator, which is required to navigate and operate safely and autonomously in direct proximity and even contact with the fruit plants/trees and other objects.

The primary elements of the developed robotic system for strawberry harvesting in this research consist of the SMART robot mobile platform, the manipulator arm installed on the base, and sensors for navigation, obstacle avoidance, and strawberry detection as well as the electronics, control, and power systems for the robots. For use in a berry harvesting application, these parts/devices must be appropriately integrated into a single system.

At the core of the hardware design is a 6-wheeled differential-drive mobile robotic platform, SMART (manufactured by EAI, Shenzhen, Guangdong, China), as shown in Figure 2. The platform is equipped with sensors such as a laser range finder, an array of ultrasonic distance sensors, a wheel odometer, and an IMU for localization and obstacle avoidance. With dual LiDAR units, the robot achieves centimeter-level positioning accuracy in indoor environments. Table 2 summarizes key specifications of the SMART mobile robot.

The application (fruit picking) requires the robot system to have manipulation capabilities. A 6-DoF manipulator, xArm 6 manufactured by UFACTORY (Shenzhen, Guangdong, China), illustrated in Figure 3, has been used for this purpose. A few of the desired specifications of the arm include light weight, adequate payload (around one kilogram), good dexterity, simple kinematics structure and long reach, option to install a customized gripper and other tools for fruit picking and collection, availability of mechanical and electrical interface, ease in integration with the mobile base, open-source and customizable software (and ROS interface), and reasonable cost.

The manipulator is visualized in Rviz in Figure 3, showing the frames of reference on various joints. The Denavit–Hartenberg (DH) parameters of the manipulator are given in Table 3. The arm was simulated in ROS’s Gazebo simulator.

The SMM used in this research work comprises an xArm 6-DoF robot manipulator mounted on the SMART mobile platform as shown in Figure 4. Figure 5 illustrates the SMM’s modular hardware design, emphasizing the interfaces between the onboard processors, controllers, sensors, and actuators. With the help of the onboard LiDAR, ultrasonic sensor array, and depth camera, the surrounding environment of the robot, including the berries, is mapped. The sensory data becomes input to the intelligent algorithms, such as object identification and obstacle avoidance, robot localization and navigation, signal processing, data fusion, feature extraction, berry maturity detection and classification, decision-making, and so on.

Table 4 summarizes the hardware modules along with their descriptions that characterize the SMM used in this study.

2.2. Software Architecture

The aforementioned modules are executed to provide the functionality needed. The robot contains middleware to manage all the hardware and software resources and to provide the required services to all other software components. A few basic requirements for such an operating system include simplicity, low overhead, scalability, a mechanism for hardware abstraction, and tool support for software development. ROS, being the industry standard software framework, has been selected as middleware in this work owing to the following advantages: (i) Supports a variety of programming languages, including Python and C++. (ii) Integration with simulation and visualization tools like Gazebo, RViz, etc. (iii) Free software for planning, perception, control, and navigation. (iv) Hardware drivers for widely used mobile platforms, sensors, and actuators. The system is implemented using ROS Melodic (ROS 1) on Ubuntu 18.04.

In the present work, a layered framework has been proposed, as shown in Figure 6. The top layer provides tools and utilities for interaction between the human operator and the robot. The second layer contains a ‘Task Planner’ where each operation of the robot as commanded by the operator or decided by the robot itself (autonomously) is divided into a set of predefined tasks, e.g., fruit picking, parking, recharging, navigation, etc. More than one task may run in parallel. The task planner receives task requests from the user application and plans coordinated action sequences accordingly. Each task activates a combination of underlying ROS nodes. The task planner is followed by the ‘Task Execution and Coordination’ layer, which contains the ROS middleware, which implements different application-specific functions and responsibilities. To properly carry out the fruit picking task, the robot must first detect and compute the position and orientation of the object (fruit) relative to the base coordinate system of the robot. Once the base is positioned, the arm must be aware of the location, dimensions, and geometry of the fruit as well as its surroundings. This means the perception of the environment is needed. Vision is also important for the mobile robot to position itself relative to the fruit trees before starting picking. This has been achieved through detection of permanent visual marks around each fruit plant. The important packages used in the present work include PyTorch, Python, and OpenCV for deep learning-based visual object detection; MoveIT 1 for task and trajectory planning of the robotic arm; and RVIZ (v1.13) and Gazebo (v9) for visualization and simulation of the entire system. The initial system was based on YOLOv7 and was implemented using ROS Melodic (ROS 1) on Ubuntu 18.04, with Python 3.7–3.8, PyTorch 1.x, and OpenCV 4.x. The updated system, incorporating YOLOv11, was developed within the same ROS Melodic framework, using Python ≥ 3.8, PyTorch 2.5.1, and OpenCV 4.x, while maintaining compatibility with the existing ROS-based architecture.

In addition to the core packages, several open-source application libraries have been used, which include aruco-ros, easy-handeye, find-object-2d, find-object-2d, realsense-ros, and vision-visp.

Figure 7 shows the software nodes developed in this work and the way they communicate with each other. The oval-shaped nodes communicate with each other by publishing/subscribing messages to topics. The rectangular boxes indicate high-level packages. The ROS’s support for nodes to exchange data in a standard format over a distributed physical network makes the application software run over different computational devices such as GPUs or laptops, HMI screens, and robot control devices.

The low-level libraries communicate with microcontroller firmware using a simple packet-based protocol via an RS-232 serial connection between the robot and an onboard computer. It also provides interfaces to many accessories, including the robotic arm, the laser rangefinders, the robot’s built-in sonar and bumper sensors, pan/tilt cameras and pan/tilt units, GPS receivers, and more. The bottom layers in Figure 7 include the device drivers for different sensors and actuators for both the mobile platform and the arm. In the event of a node failure or system restart, the ROS architecture reinitializes all nodes, after which the manipulator returns to a safe home position before task execution resumes.

2.3. Programming Mobile Robot and xArm Manipulator

Before realizing the proposed deep learning-based fruit detection algorithm (to be discussed in Section 4) on the SMM, the mobile robotic platform and the robotic arm have been programmed and individually tested for navigation and pick-and-place scenarios, respectively.

Considering the navigation scenario, the very first time the navigation software is run, the robot is moved manually throughout the field to generate a 2D occupancy grid map. All subsequent executions of navigation use this map, which is permanently saved as a 2D image along with a configuration file that gives meta information about the map, such as origin coordinates, resolution, etc. The localization module then tracks the desired pose of the robot against the known map using a particle filter. The same 2D map helps the robot return to its charging station when the battery falls below a certain threshold. The navigation module was developed and tested separately to enable future autonomous row-to-row movement, map-based localization, and return-to-charge functionality.

3. Results

3.1. Deep Learning-Based Strawberry Detection and Segmentation

Detection and segmentation of strawberries come with several challenges. A major challenge for the vision system is to detect strawberries from clusters or deal with occluded (hidden) fruit. Once strawberries are recognized using a deep learning approach, the next challenge is to estimate the grasp pose for robotic harvesting and, finally, to implement the developed algorithm in real-time on a robotic manipulator to reach and grasp the detected fruit. Moreover, picking the fruit safely without damaging the plant is another challenge. Therefore, the overall process entails both implementational and design challenges that can only be met by a group of multidisciplinary researchers and experts [31].

In the present work, the Yolov11s-seg model (here “11” represents the version of the YOLO architecture, “s” denotes the small model variant, and “seg” indicates that the model is designed for instance segmentation) is used to segment and recognize strawberries. A public dataset, called Strawberry Digital Images (StrawDI) [32], with sample images shown in Figure 8, has been used to train and validate the performance of the YOLO v11s-seg model as well as other trending architectures. The dataset offers sufficient visual diversity, including occlusion, varying lighting, and ripeness levels across 2800 images. Instead of assigning classes based on ripeness levels, all strawberries were learned as one class. Ripeness evaluation was performed using post-detection color features such as RGB color. Different augmentation techniques that were employed include horizontal and vertical flipping, brightness adjustment, multi-angle rotation, Gaussian noise addition, and CutMix mosaicing.

The Ultralytic’s default model was pre-trained over the COCO dataset, which has 80 classes in it but does not include strawberries. Transfer learning is useful whenever a new object (outside of the COCO 80 classes) is detected. This allows us to leverage a model already trained on detecting general visual features such as edges, texture, color, and shape. Although strawberry is not present in these 80 classes, similar objects such as apple, orange, and banana share visual clues such as roundness, curvature, and surface texture. The backbone of the deep learning model automatically extracts low- and mid-level features and provides pretrained weights, which can be fine-tuned and updated in the neck and head layers of the architecture to enhance the detection of new objects. For strawberry detection and segmentation, the weights are initialized with the existing pre-trained weights.

Figure 9a highlights the training and validation curves for box loss and segmentation loss. Both losses exhibit a steady reduction and achieve a stable behavior towards the end, with the box loss (the agreement between the predicted bounding box and the strawberry position) reaching about 0.46 and the segmentation loss (how accurately the strawberry is delineated at the pixel level) reaching 0.64. The corresponding validation losses are approximately 0.52 for box loss and 0.77 for segmentation loss. No clear evidence of overfitting was observed within the reported training epochs.

The following performance metrics are shown in Figure 9b.

Precision: True Positive (TP)/(TP + False Positive (FP)), representing the proportion of predicted detections that are correct out of all detections produced by the model.
Recall: True Positive (TP)/(TP + False Negative (FN)), representing the proportion of correctly detected objects relative to all objects present in the ground truth.
Average Precision: calculated as the area under the precision–recall curve.
Intersection over Union (IoU): area of overlap divided by area of union, measuring the overlap between the predicted bounding box and the ground truth.
mAP@0.5: the mean average precision across all classes in the model at an IoU threshold of 0.50.
mAP@0.5:0.95 (AP): the mean average precision averaged over IoU thresholds from 0.50 to 0.95, providing a stricter evaluation criterion than mAP@0.5.

The YOLOv11s-seg model attains a segmentation precision of approximately 79.8% and recall of approximately 84.6% in the final training stage. The segmentation mAP@0.5 reaches approximately 84.4%, while mAP@0.5:0.95 reaches approximately 66.8%. These results indicate stable segmentation performance and reliable strawberry detection over a range of intersection over union (IoU) thresholds.

For comparison, three object detection architectures, namely, YOLOv11 object detection, Faster R-CNN [33], and DEtection TRansformers (DETR) [34], were trained on the same dataset. The objective of the comparison was to evaluate practical detection frameworks for robotic strawberry harvesting under a unified dataset and deployment-oriented setting. Experiments were conducted on a workstation equipped with an NVIDIA RTX A5500 laptop GPU (16 GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA), using CUDA 12.1 and PyTorch 2.5.1 for model training. The trained YOLOv11-small segmentation model (19.6 MB checkpoint size) was deployed on an NVIDIA Jetson Orin Nano (8 GB), using FP16-optimized TensorRT inference. An inference rate of approximately 10 FPS was achieved, which is considered sufficient for the intended real-time application.

Faster R-CNN has been widely used as a two-stage object detection approach for accurate object localization and classification in complex scenes with partially occluded objects. The architecture first generates candidate object regions through a Region Proposal Network (RPN) and then extracts features to classify each proposed region. Due to their high computation cost, slower inference speed, and increased memory demand, Faster R-CNN has limited suitability for real-time robotic applications.

In the present work, Faster R-CNN was implemented using the ResNet-50 backbone with a Feature Pyramid Network (FPN) for feature extraction and multi-scale object representation. For strawberry detection, the original classification head was replaced with a new predictor configured for two classes (background and strawberry). Only detections with a confidence score greater than 0.5 were retained.

Since 2020, DETR has emerged as an important approach for end-to-end object detection. Unlike conventional object detectors that involve non-maximum suppression, region proposal mechanisms, and anchor generation, DETR formulates detection as a set prediction problem. A transformer is used to capture global relationships within the image to directly predict a small fixed-size set of objects. The disadvantages of using DETR in applications such as the present one include slow convergence, high memory requirements, lower average precision for small objects, reduced real-time performance, and the need for large datasets.

Among several variants of DETR, RT-DETR [35] is specifically designed for real-time deployment and provides inference speed comparable to YOLO-based detectors. In the current work, RT-DETR was trained using Ultralytics’ pretrained “rtdetr-l.pt” configuration for 100 epochs, with an input image size of 640 × 640 pixels and a batch size of 8. The training process employed the AdamW optimizer with an initial learning rate of 0.002 and momentum of 0.9, while applying differentiated weight decay across parameter groups. To improve robustness and generalization, light data augmentation was applied, including blur, median blur, grayscale conversion, and CLAHE contrast enhancement, each with low probability. Under these settings, the model required approximately 12 h of total training time for 100 epochs on the employed GPU platform.

The experimental configuration is shown in Table 5, while the performance comparison of all the architectures tested is shown in Table 6.

As shown in Table 6, RT-DETR achieved the best overall detection performance, with the highest AP (0.7338) and mAP@0.5 (0.8447). The precision (0.7692) and recall (0.8674) also showed a balanced performance. In comparison, YOLOv11 box detection produced the second-best results, reaching AP = 0.7104 and mAP@0.5 = 0.8314. If robotic grasping had not been involved, this would have been the preferred framework for real-time strawberry detection. Faster R-CNN showed a similar precision (0.7669) but a lower recall (0.7611), which indicates generally accurate detections when fruits were clearly separated but reduced sensitivity when strawberries overlapped or appeared in clusters. The segmentation-based YOLOv11 model produced the highest recall (0.8681) together with the strongest F1-score (0.8155), while achieving mAP@0.5 = 0.8441. These statistics indicate that the model showed strong sensitivity to strawberry presence under partially overlapping field conditions. Despite the lower AP value (0.6766), the model was selected because of its suitability for robotic harvesting, where the instance-level masks provide geometric information necessary for peduncle estimation and grasp planning while maintaining inference time close to YOLO-based object detection.

RT-DETR, owing to its strong recall and high sensitivity, detects a higher number of detections (Figure 10a) because partially visible immature fruits and adjacent strawberry-like texture regions (e.g., flowers and leaves) activate object queries. For robotic picking, there are too many false candidates that need to be filtered. YOLO-based detectors would suppress many such weak candidates at an early stage (see Figure 10b,c). The segmentation model preserves the visible fruit boundaries more naturally, and it responds strongly to local fruit texture and shape. Some low-confidence detections are also present near highly occluded or immature fruits, which can be easily filtered during post-detection processing. Finally, the results of Faster R-CNN (Figure 10d) indicate 15 detections (against the 9 ground truths), suggesting that the model generated several additional high-confidence proposals. Most clearly visible strawberries were correctly detected, with confidence values as high as 0.95. During training, the model showed signs of overfitting and reached its best validation performance very early (epoch #3, the validation loss reached 0.2267), which explains the high confidence assigned to familiar local features while reducing generalization capability.

Once the strawberries are detected using the YOLOv11 segmentation model, the next step is to select the most suitable harvest target. The criterion is to select a fruit that is sufficiently red (determined based on the average values of the R and G color channels) and is large enough to be practically grasped. To determine the segmented strawberry’s dominant orientation, the Principal Component Analysis (PCA) technique was used. Once the main geometric axis is identified, the two extreme ends are calculated. The wider end is assumed to correspond to the leaf crown side, since strawberries are generally broader near the calyx and narrower toward the tip. From the wider end, a point slightly outside on the peduncle side is then defined as the robot approach point (shown in Figure 11). Using 200 manually inspected test images, the grasp-point estimation method correctly identified the crown-side peduncle region in 86.5% of cases.

3.2. Robot Manipulation Control

To detect and approach strawberries for robotic harvesting, the 3D camera must be calibrated. The eye-on-hand calibration approach is used in the present work since the camera has been mounted on the end effector (see Figure 12).

The transformation matrix (see Figure 13) from target to base coordinate frame

T_{t}^{b}

can be given by Equation (1),

T_{t}^{b} = {T_{e}^{b} T_{c}^{e} T}_{t}^{c}

(1)

where

T_{t}^{c}

is the transformation matrix from the target to the camera coordinate frame, which is obtained from the depth camera (Intel RealSense, Intel Corporation, Santa Clara, CA, USA).

T_{c}^{e}

is the transformation matrix from the camera to the end-effector coordinate frame, which is obtained from camera calibration. Finally,

T_{e}^{b}

is transformation matrix from the end-effector to base coordinate frame, which is obtained from the arm’s forward kinematic model.

Table 7 and Table 8 present camera calibration parameters. Despite careful calibration, indoor experiments showed end-effector repeatability within ±2 mm, fruit localization within ±3–5 mm, and grasp-point estimation error of 5 ± 2 mm.

In order to approach the strawberry for harvesting, the pose error between the desired state (

T_{t}^{c *})

and the current state (

{\hat{T}}_{t}^{c}

) determines the required motion of the camera mounted on the arm as shown in Figure 14.

Δ T = {\hat{T}}_{t}^{c} {(T_{t}^{c *})}^{- 1}

(2)

where ΔT includes position error and orientation error between the origin coordinates of the two frames (i.e., current and desired).

Alternatively, the desired end-effector pose (

T_{t}^{b *}

), in which the arm can harvest the strawberry) can be achieved using position control as shown in Figure 15. To reach the desired pose, the trajectory planner block computes the path

Γ

which is a set of values of end-effector pose

T_{t}^{b} (t_{k})

at each time instant

t_{k}

. A smooth and continuous motion is achieved through an interpolating function such as a cubic or quintic polynomial. This function regulates the timing during motion of the manipulator along the prescribed path

Γ

.

Consider a generic point,

p = {[x_{e}, y_{e}, z_{e}, ϕ, θ, ψ]}^{T}

, on the path

Γ

represented as a function of arc length

s (t)

, as [27].

p = f (s)

(3)

The generic vectors (4)–(6) indicating a unit tangent, a unit normal and a binormal respectively originate from each point describe the arm trajectory in space.

t = \frac{d p}{d s}

(4)

n = \frac{1}{‖\frac{d^{2} p}{d s^{2}}‖} \frac{d^{2} p}{d s^{2}}

(5)

b = t \times n

(6)

The timing law can be defined as a cubic polynomial, given in (7) as

s (t) = a_{3} t^{3} + a_{2} t^{2} + a_{1} t + a_{0}

(7)

where the coefficients

a_{3}, a_{2}, a_{1}, a_{0}

are determined to observe any given constraints on velocity and accelerations while transitioning from one path segment to another. For the given application (i.e., strawberry harvesting), the path segments include a combination of linear and circular segments with parabolic blends. Terminal and via points velocities and accelerations to achieve pick-and-place operation can be selected by imitating human motion while picking strawberries. A sample path is shown in Figure 16.

The reason to develop a customized trajectory planner and not rely on ROS’s motion planning tool lies in the manufacturer’s recommendation, which states that the poor behavior is exhibited by the arm when controlled in ROS (see Figure 17). The arm was about to hit its body while trying to reach the desired pose as shown in the figure. To overcome this, the path planning approach described by Equations (3)–(7) was used, representing the planned trajectory as a function of arc length c length

s (t)

and employing a quintic polynomial to regulate timing. Unlike the straight-line motion with arc blending at via points (move_lineb), this method provided smoother transitions between motion segments, ensuring continuous velocity and acceleration profiles.

After trajectory planning, the position controller employs a proportional control law to command the robotic arm for executing the desired motion. The position control law is given by Equation (8) as:

\dot{q} = J^{T} K e

(8)

where

J^{T}

is the transpose of the manipulator Jacobian matrix,

K

is the proportional gain matrix and

e

is the error term (in Cartesian space) between the desired pose

X^{*}

(position

x, y, z

and orientation

ϕ, θ, ψ

) and the actual pose

X

as given in Equation (9),

e = X^{*} - X

(9)

Figure 18, Figure 19 and Figure 20 illustrate the simulated setup of the manipulator and strawberry environment, the real-time results of strawberry detection using an Intel RealSense camera mounted on the robotic arm, and the desired trajectory path from the fruit to the bin, respectively.

Figure 21 and Figure 22 show experimental results of an exemplary robotic strawberry harvesting task. Plots in Figure 21 represent the corresponding Cartesian motion (x, y, and z positions) of the robot end effector. The three waveform segments correspond to the sequential harvesting of three strawberries, as illustrated in Figure 20. The robot starts from its home position (#1), moves to the park position (#2), proceeds to (#3) to harvest the first strawberry, returns through (#2), and then moves to the collection bin (#4) mounted on the mobile platform. The same sequence is repeated for the second and third strawberries at positions (#5) and (#6), respectively, with the overall path summarized as: 1 → 2 → 3 → 2 → 4 → 2 → 5 → 2 → 4 → 2 → 6 → 2 → 4. The segments (#1 → #2) and (#4 → #1) represent the initial and final stages of the harvesting task and are not repeated for each fruit. The brief pause at the park position (#2) before and after harvesting helps prevent collision with neighboring strawberries.

The robot successfully achieved the required level of accuracy to precisely approach and grasp strawberries within a 3D space. Based on the demonstrated trajectory, the path-efficiency metric, defined as the shortest path (888.6 mm) divided by the actual distance (1029.7 mm) traveled by the arm, was 86%. The extra 14% motion is due to safe return to avoid contact with neighboring fruit. The corresponding angular displacements of the joint are shown in Figure 22. As the picking and placing task is a type of repetitive point-to-point motion and the path doesn’t matter, position control mode in joint space has been used.

Fifty harvesting trials were performed in a controlled environment, using simulated (synthetic) strawberries to evaluate the performance of the developed system. The robot successfully harvested 36 strawberries, resulting in an overall success rate of 72%. The fourteen unsuccessful attempts were largely due to the vision module’s failure to detect strawberries (6 cases) and grasp instability due to the oversized gripper design (8 cases). Considering the complexity of the problem, varying lighting conditions, and fruit clustering in real farm environments, the harvesting success rate is expected to fall within the 60–80% range.

4. Discussion

Robotic berry harvesting is a challenging task, and any future research will build upon the progress made in this work. Following here, some of the important challenges and limitations of strawberry harvesting robots are presented:

Collaborative robots (cobots) are designed for safe operation in human-occupied environments. The xArm manipulator used in this study includes built-in 3D collision detection during motion planning, automatically switching from Mode 1 (external trajectory planning) to Mode 0 (internal position control) upon detecting a collision. Additionally, an emergency stop mechanism enhances safety. However, at this stage, the system lacks a dedicated human-robot interaction strategy. Future work will focus on implementing safe distance monitoring, real-time tracking using fixed vision cameras, and active compliance control to improve human–robot interaction and ensure safer operation.
This work currently relies on the X-arm manipulator’s collision detection features, which reset the arm to its home position upon detecting a collision. However, experiments revealed that the arm is not sensitive enough to detect strawberries within dense foliage. To address this, an additional vision module should be implemented to analyze the arrangement of individual strawberries in a cluster. The current approach focuses only on ripe strawberries that are closest to the end effector at the time of detection. Furthermore, the path planning algorithm should account for obstacles and adjust its trajectory accordingly.
There are several limitations with the current end effector. First, it is not compliant and was not designed to grip soft objects. During the grasping process, the end effector does not measure the distance between the gripper’s fingers and the object. Instead, it continues to squeeze the object, which is not suitable for harvesting strawberries, where gentle handling is required. Some researchers in the literature suggest grasping the strawberry by the peduncle. While this may work for strawberries with long enough peduncles, it doesn’t eliminate the need for developing a customized soft gripper to handle the fruit more effectively.
Currently, if a picking attempt fails, the robot restarts the entire detection process. A more efficient approach would be to skip detection and retry only the approach, grasp, and picking sequence. Using a dual-arm manipulator or attaching the bin to the end effector could further optimize performance. Additionally, the robot should dynamically adjust its path instead of following a fixed trajectory. Once the bin is attached, prioritizing nearby strawberries before harvesting farther ones would further enhance the path efficiency.
To the best of the authors’ knowledge, the literature focusing on a comprehensive cost-benefit assessment of robotic harvesting systems is limited. Technical studies often overlook economic feasibility, and vice versa. Different factors that could be a potential reason include the lack of collaboration among key stakeholders, including technology developers, system integrators, robot manufacturers, end-users, and researchers. A rigorous assessment should account for all relevant costs, including capital costs (hardware and software), operating costs (maintenance, upgrades, and vendor lock-in), lifecycle costs (energy, repairs, and commissioning), and infrastructure and personnel required to support deployment. As in many other industries, robots in agriculture have their strongest economic case lying in their ability to reduce reliance on manual labor. In this study, the time taken to detect and harvest a strawberry is approximately 10 s. Other similar studies have achieved harvesting speeds of 10 s [36], 9 s [37], 4 s [38], and 1.2 s [39] depending on the robot’s complexity and the number of manipulators. Trained human workers spend about 1 to 3 s on harvesting one fruit. To close the performance gap, two robotic arms must be deployed in parallel to harvest approximately 600 strawberries per hour. While the upfront investment for two robots would be much higher, the proposed system is still being considered a stable alternative to human harvesters due to their round-the-clock reliability, consistency, and ease of replication. In future, experiments on dual-arm setups will be carried out to target better economic feasibility.
Strawberry harvesting platform such as those developed in this study will need to navigate autonomously within cultivation rows, operate near workers, and ensure control system reliability. For practical commercial deployment, the system should be assessed under ISO 18497 (for automated agricultural machinery) [40] and ISO 25119 (for safety-related electrical and electronic control systems) [41], together with collaborative robot safety standards such as ISO 10218 (for industrial robots) [42] and ISO/TS 15066:2016 (for human–robot collaborative operation) [43].

5. Conclusions

The integration of robotics and AI has made significant advancements in enabling automated harvesting solutions, particularly for delicate crops like strawberries. In this research, an intelligent, computer-vision-based system for berry detection, classification, and localization has been successfully integrated with an SMM. Experimental results validate the robustness of the proposed solution, achieving over 85% accuracy in real-time strawberry detection and segmentation across both natural and structured environments, leveraging the YOLO v11 deep learning architecture.

While the proposed system is well-suited for controlled environments such as greenhouses, additional sensors, depth cameras, and a dual-arm robotic manipulator may be required to adapt the system to specific layouts and operational conditions. Furthermore, its ability to detect not only strawberries but also surrounding objects and human operators highlights its potential for safe and collaborative operation.

Future research will focus on optimizing the system’s performance by learning from human harvesting through imitation learning. Expanding the solution to outdoor environments and training a large language model on diverse datasets (including fruits, plants, robots, and surrounding elements) are key directions for enhancing adaptability. Moreover, development of a customized gripper for cutting the peduncle in the future will enable harvesting of real strawberries without causing any bruising or damage. Although current perception accuracy is promising, practical commercial deployment will ultimately depend on achieving end-to-end harvesting efficiency, minimizing fruit damage, and maintaining robust operation under variable field conditions. Moreover, a thorough ablation study has been planned in the near future.

Author Contributions

Conceptualization, M.T.; methodology, M.T.; software, M.T. and J.I.; validation, M.T.; formal analysis, M.T. and R.A.; investigation, M.T. and J.I.; resources, R.A.; data curation, J.I.; writing—original draft preparation, M.T.; writing—review and editing, J.I.; visualization, J.I.; supervision, R.A.; project administration, R.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

All figures and tables presented in this paper are based on the authors’ original work.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge the support of the Smart and Sustainable Manufacturing Systems Laboratory (SMART Lab) at the University of Alberta, where the experimental setup and validation for this research were conducted. The authors thank the SMART Lab research team and technical staff for their assistance in developing and operating the experimental platform, as well as the Department of Mechanical Engineering at the University of Alberta for providing the facilities and infrastructure that supported this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CNN	Convolution Neural Network
DH	Denavit–Hartenberg
GFTT	Good Features to Track
HMI	Human–Machine-Interface
IMU	Inertial Measurement Unit
IOU	Intersection Over Union
OV	Open-Vocabulary
PCA	Principal Component Analysis
R-CNN	Region-based Convolution Neural Network
ROS	Robot Operating System
RT-DETR	Real-Time Detector TRansformer
SLAM	Simultaneous Localization and Mapping
SMM	Smart Mobile Manipulator
VLM	Vision Language Models
YOLO	You Look Only Once

References

Agriculture and Agri-Food Canada. Statistical Overview of the Canadian Fruit Industry, 2022; Agriculture and Agri-Food Canada: Ottawa, ON, Canada, 2023. [Google Scholar]
Forney, A. Patterns of Harvest: Investigating the Social-Ecological Relationship Between Huckleberry Pickers and Black Huckleberry (Vaccinium membranaceum Dougl. ex Torr.; Ericaceae) in Southeastern British Columbia; University of Victoria: Victoria, BC, Canada, 2016. [Google Scholar]
Daniels, J. Wave of Agriculture Robotics Holds Potential to Ease Farm Labor Crunch; CNBC: Englewood Cliffs, NJ, USA, 2018; Available online: https://www.cnbc.com/2018/03/08/wave-of-agriculture-robotics-holds-potential-to-ease-farm-labor-crunch.html (accessed on 1 February 2026).
Iqbal, J.; Tsagarakis, N.G.; Caldwell, D. Four-fingered light-weight exoskeleton robotic device accommodating different hand sizes. IET Electron. Lett. 2015, 51, 888–890. [Google Scholar] [CrossRef]
Iqbal, J.; Tsagarakis, N.G.; Caldwell, D.A. Human hand compatible underactuated exoskeleton robotic system. IET Electron. Lett. 2014, 50, 494–496. [Google Scholar] [CrossRef]
Hassan, M.U.; Ullah, M.; Iqbal, J. Towards autonomy in agriculture: Design and prototyping of a robotic vehicle with seed selector. In Proceedings of the 2nd International Conference on Robotics and Artificial Intelligence (ICRAI), Rawalpindi, Pakistan, 1–2 November 2016; pp. 37–44. [Google Scholar]
Growers, W.; Berger, R. 2021 Global Harvest Automation Report; Western Growers Center for Innovation & Technology: Salinas, CA, USA, 2022; Available online: https://wga.s3.us-west-1.amazonaws.com/2022/wgcit_2021_harvest_automation_report_2022-02-07.pdf (accessed on 1 February 2026).
Shi, J.; Tomasi, C. Good features to track. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary robust invariant scalable keypoints. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar] [CrossRef]
Alam, M.; Alam, M.S.; Roman, M.; Tufail, M.; Khan, M.U.; Khan, M.T. Real-time machine-learning based crop/weed detection and classification for variable-rate spraying in precision agriculture. In Proceedings of the 7th International Conference on Electrical and Electronics Engineering (ICEEE), Antalya, Turkey, 14–16 April 2020; pp. 273–280. [Google Scholar]
Nasir, F.E.; Tufail, M.; Haris, M.; Iqbal, J.; Khan, S.; Khan, M.T. Precision agricultural robotic sprayer with real-time tobacco recognition and spraying system based on deep learning. PLoS ONE 2023, 18, e0283801. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Laroca, R.; Severo, E.; Zanlorensi, L.A.; Oliveira, L.S.; Gonçalves, G.R.; Schwartz, W.R.; Menotti, D. A robust real-time automatic license plate recognition based on the YOLO detector. Image Vis. Comput. 2018, 78, 33–45. [Google Scholar]
Bresilla, K.; Perulli, G.D.; Boini, A.; Morandi, B.; Corelli Grappadelli, L.; Manfrini, L. Single-shot convolution neural networks for real-time fruit detection within the tree. Front. Plant Sci. 2019, 10, 611. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Li, X.; Qin, Y.; Wang, F.; Guo, F.; Yeow, J.T.W. Pitaya detection in orchards using the MobileNet-YOLO model. In Proceedings of the 39th Chinese Control Conference, Shenyang, China, 27–29 July 2020; pp. 662–667. [Google Scholar]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11, version 11.0.0. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 19 March 2026).
Parsa, S.; Debnath, B.; Khan, M.A.; Esfahani, A.G. Autonomous strawberry picking robotic system (Robofruit). arXiv 2023, arXiv:2301.03947. [Google Scholar] [CrossRef]
V.R., S.; Parsa, S.; Parsons, S.; Esfahani, A.G. Peduncle gripping and cutting force for strawberry harvesting robotic end-effector design. arXiv 2022, arXiv:2207.12552. [Google Scholar] [CrossRef]
Quaglia, G.; Tagliavini, L.; Colucci, G.; Vorfi, A.; Botta, A.; Baglieri, L. Design and prototyping of an interchangeable and underactuated tool for automatic harvesting. Robotics 2022, 11, 145. [Google Scholar] [CrossRef]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection with vision transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar] [CrossRef]
Zhang, Z.; Cai, H.; Han, S. EfficientViT-SAM: Accelerated segment anything model without performance loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7859–7863. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2025; pp. 38–55. [Google Scholar]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Xiong, Y.; Ge, Y.; Grimstad, L.; From, P.J. An autonomous strawberry-harvesting robot: Design, development, integration, and field evaluation. J. Field Robot. 2020, 37, 202–224. [Google Scholar] [CrossRef]
Tituaña, L.; Gholami, A.; He, Z.; Xu, Y.; Karkee, M.; Ehsani, R. A Small autonomous robot for selective strawberry harvesting in open fields. Smart Agric. Technol. 2024, 8, 100454. [Google Scholar] [CrossRef]
Sather, J. Harvester-Sim: Virtual Strawberry Harvesting Environment in ROS/Gazebo. GitHub Repository. 2019. Available online: https://github.com/jsather/harvester-sim (accessed on 1 February 2026).
Siciliano, B.; Sciavicco, L.; Villani, L.; Oriolo, G. Robotics: Modelling, Planning and Control; Springer: London, UK, 2009. [Google Scholar]
Pérez-Borrero, I.; Marín-Santos, D.; Gegúndez-Arias, M.E.; Cortés-Ancos, E. A fast and accurate deep learning method for strawberry instance segmentation. Comput. Electron. Agric. 2020, 178, 105736. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 251–268. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Ackerman, E. Greedy Robot Picks Only the Ripest Strawberries; IEEE Spectrum: New York, NY, USA, 2019; Available online: https://spectrum.ieee.org/greedy-robot-picks-only-the-ripest-strawberries (accessed on 1 February 2026).
Geer, L. Navigation and Control for an Autonomous Robotic Fruit Harvesting System. Ph. D. Thesis, University of Essex University of Essex, Colchester, UK, 2023. [Google Scholar]
De Preter, A.; Anthonis, J.; De Baerdemaeker, J. Development of a robot for harvesting strawberries. IFAC-Pap. 2018, 51, 14–19. [Google Scholar] [CrossRef]
Durmuş, H.; Güneş, E.O.; Kırcı, M.; Üstündağ, B.B. The design of general purpose autonomous agricultural mobile robot: AGROBOT. In Proceedings of the Fourth International Conference on Agro-Geoinformatics, Istanbul, Turkey, 20–24 July 2015; pp. 49–53. [Google Scholar]
ISO 18497:2024; Agricultural Machinery and Tractors—Safety of Highly Automated Agricultural Machines. International Organization for Standardization (ISO): Geneva, Switzerland, 2024.
ISO 25119:2018; Tractors and Machinery for Agriculture and Forestry—Safety-Related Parts of Control Systems. International Organization for Standardization (ISO): Geneva, Switzerland, 2018.
ISO 10218-1:2025; Robotics—Safety Requirements—Part 1: Industrial Robots. International Organization for Standardization (ISO): Geneva, Switzerland, 2025.
ISO/TS 15066:2016; Robots and Robotic Devices—Collaborative Robots. International Organization for Standardization (ISO): Geneva, Switzerland, 2016.

Figure 1. Strawberry detection using a traditional feature-based approach (KAZE/AKAZE), where local keypoints (circles) are extracted and matched to identify the object, with the detected region highlighted by a bounding box.

Figure 2. SMART robot set to explore and map the indoor environment.

Figure 3. RViz visualization of xArm showing frames on various joints. Red, green, and blue colors correspond to x-, y-, and z-axes.

Figure 4. Developed Smart Mobile Manipulator (SMM) for berry harvesting applications.

Figure 5. Hardware architecture of the Smart Mobile Manipulator. Dashed lines indicate auxiliary or user-interface connections (e.g., keyboard, mouse, and display) used for configuration and monitoring, while solid lines represent primary power, control, and communication pathways. Numbered markers (1–5) denote key system components: (1) depth camera, (2) embedded GPU, (3) robotic arm control unit, (4) power supply, (5) DC-DC converter, and (6) optional display unit.

Figure 6. The developed software architecture for Smart Mobile Manipulator.

Figure 7. ROS computation graph.

Figure 8. Strawberry dataset [26] used to train the YOLO object detection and segmentation model.

Figure 9. Training and validation results of the YOLOv11s seg model. (a) Box and segmentation losses converge over training epochs. (b) Metrics (precision, recall, and mAP) show the progress of model performance during training.

Figure 10. Strawberry detection results obtained using four deep learning models. Ground truth: 9 strawberries.

Figure 11. Harvest target selection after YOLO11-based segmentation. The yellow line indicates the principal axis indicating fruit orientation. The red arrow points to the peduncle/crown side. The circular red dot shows the estimated grasp point.

Figure 12. Camera calibration using ROS calibration package (easy_handeye).

Figure 13. Coordinate frames and transformation matrices used in the camera calibration procedure.

Figure 14. Position-based visual serving approach to approach the detected strawberry.

Figure 15. Position control of xArm 6 for strawberry harvesting.

Figure 16. Sample parameterized path during pick and place harvesting activity.

Figure 17. xArm’s potentially harmful configuration as determined by the ROS’s motion planning tool MoveIT 1.

Figure 18. Robotic harvesting demonstration.

Figure 19. Strawberry segmentation using YOLOv11 during the final robotic harvesting demonstration.

Figure 20. Path of the robot during strawberry harvesting (demonstration).

Figure 21. End-effector positions recorded during the strawberry harvesting demonstration.

Figure 22. Joint positions recorded during the strawberry harvesting demonstration.

Table 1. Comparative summary of related robotic harvesting studies.

Ref.	Short Title	Major Contribution(s)	Limitation	How the Proposed Approach Addresses It
Yu et al. [27]	Fruit detection for strawberry harvesting based on Mask R-CNN	Developed a Master R-CNN based strawberry detection algorithm with focus on universality and robustness in unstructured environments.	The focus is only on perception model. Deployments on a real robot and edge devices have not been addressed. The ripe/unripe categorization of strawberry is subjective and may not be consistent across real farms.	The proposed method was developed having in mind deployment on real robots. Using a single strawberry class allows ripeness criteria to be adjusted post-deployment through post-detection evaluation.
Xiong et al. [28]	Design and development of autonomous strawberry-harvesting robot	A complete autonomous robotic solution for harvesting strawberries in table-top farms has been demonstrated. The approach used relies on HSV-based adaptive color thresholding combined with RGB-D localization to detect strawberries.	As modern AI-based models are not used, the system may undergo incomplete segmentation in clustered environments with occlusion and connected fruits.	Segmentation techniques such as YOLO and RT-DETR, as used in this study, can improve semantic separation and enhance robustness under occlusion and connected fruit conditions, particularly when supported by high-quality datasets and well-optimized training.
Parsa et al. [20]	Modular autonomous strawberry picking robotic system	A field-tested modular mobile robot with Panda robot arm, a customized gripper, and RGB-D camera have been demonstrated in commercial glasshouse. The perception module is based on Mask R-CNN for picking point determination.	Although the paper provides a strong benchmark for selective harvesting, the definition of pluckability is not an absolute visual property but rather a robot-specific assumption. In addition, key-point detection assumes no occluded fruit. The dataset of unpluckable and pluckable classes is imbalanced and may result in biased learning. The overall approach requires substantial GPU resources.	The proposed study adopts a lighter perception-manipulation pipeline composed of three transferable modules: generic deep segmentation, ripeness screening, and geometric grasp inference.
Tituaña et al. [29]	Small autonomous field strawberry robot for strawberry harvesting	The contribution of this paper is important as it address open field application. The perception module uses YOLOv4 to detect and classify strawberries in five maturity levels.	Use of five discrete maturity classes by the YOLO model can create natural label ambiguity due to slight changes in redness and ripeness, especially in outdoor conditions and when the annotation is subjective. In addition, top-view perception may undergo limited visibility of berries hidden beneath foliage.	The proposed method separates fruit detection from ripeness detection. Limitation of one approach can not affect the other.

Table 2. Specifications of the SMART mobile robot.

Specification	Value
Measuring frequency	9000 times/s
Positioning Accuracy	Centimeter level (5 cm) Can build map of up to 5000 m² area (~70.7 m × 70.7 m)
Size	525 × 525 × 268 mm
Weight	40 kg (approx.)
Payload	30–70 kg
Battery and charging	30 AH/24 V, 10 h (no load), automatic charging, charging time 4.5 H
Sensors	2 Lidar (YDLIDAR G4, 360 degrees omnidirectional scanning and 5–12 Hz frequency), Ultrasonic (×5), IR (×4), IMU (Gyro and 3-DoF accelerometer), Encoder (4096 res.)
Motors	250 W 12/24 V DC brushless hub
Software	Ubuntu 18.04 and ROS Melodic ROS 2 1.0.1 driver also exists Can build map of up to 5000 m² area (~70.7 m × 70.7 m)
Algorithms running onboard	SLAM navigation
Ports	LAN, USB
Obstacle height, width, and angle (all max)	20 mm, 40 mm, 10 degrees
Operating speed	1.0 m/s (max)

Table 3. DH table of xArm 6.

Axis	$Joint Variable θ_{i}$ [rad]	$Link Length a_{i}$ [mm]	$Link Offset d_{i}$ [mm]	$Link Twist α_{i}$ [rad]
1	$θ_{1}$	0	267	0
2	$θ_{2}$ − 1.3849	0	0	$- π / 2$
3	$θ_{3}$ + 1.3849	289.49	0	0
4	$θ_{4}$	77.5	342.5	$- π$
5	$θ_{5}$	0	0	$π / 2$
6	$θ_{6}$	76	97	$- π / 2$

Table 4. Hardware components of the SMM.

Task Category	Hardware Component	Application and Related Planned Hardware/Software Activity
Vision-based fruit detection and manipulation (Focus of major software development)	Intel RealSense depth camera D435i	Captures still images and video of the fruit plant. Unlike traditional RGB sensors, the camera captures a depth map that stores a distance value (from the camera to the scene objects along Z the z-axis) for each pixel in the image. It can also return a 3-D point cloud to show a depth map in 3D. Computer vision and deep learning-based fruit and tree detection are the main challenges to address here. It is mounted as part of the fixture on the arm’s wrist (i.e., eye-in-hand configuration).
	Robotic manipulator	To reach out to the fruit and pick it once it has been detected by the vision module. This involves solving the arm’s kinematic/inverse kinematic problems and controlling the arm position/velocity while considering the system dynamics.
Autonomous operation of the mobile platform	Mobile platform	A smart mobile platform with enough payload capacity and power to accommodate a 6-DoF robotic arm.
	Wheel encoders, IMU, and two 2D laser scanners	These modules help solve the localization and mapping problem for autonomous navigation of the mobile robot. This involves sensor fusion based on techniques such as Kalman filters.
	Stereo vision system	To measure visual odometry of the mobile robotic platform for localization. This involves detecting/recognizing objects and creating a depth map for obstacle avoidance.
	Safety bumpers and ultrasonic sensor array	To detect collision with objects/obstacles by the mobile robotic platform.

Table 5. Experimental configuration.

Model	Backbone	Input Size	Batch Size	Optimizer	LR	Epochs
YOLOv11 Seg	Cross Stage Partial (CSP) architecture	640	8	AdamW	auto/0.002	100
YOLOv11 Box	CSP	768	8	AdamW	0.002	100
RT-DETR	CNN + Transformer	640	8	default	default	100
Faster R-CNN	ResNet-50 + FPN	original/resized by loader	4	AdamW	0.0001	20

Table 6. Performance comparison.

Model	AP	mAP@0.5	Precision	Recall	F1-Score
RT-DETR	0.7338	0.8447	0.7692	0.8674	0.8154
YOLOv11 Box	0.7104	0.8314	0.7553	0.8449	0.7976
YOLOv11 Seg	0.6766	0.8441	0.7690	0.8681	0.8155
Faster R-CNN	0.6311	0.8114	0.7669	0.7611	0.7640

Table 7. Camera intrinsic parameters (at 640 × 480 resolution, color).

Parameter	Description	Value
$f = {[\begin{matrix} f_{x} & f_{y} \end{matrix}]}^{T}$	Focal length, in pixels, for left, right, and RGB cameras	$f = {[\begin{matrix} 325.33 & 249.07 \end{matrix}]}^{T}$
$p = {[\begin{matrix} p_{x} & p_{y} \end{matrix}]}^{T}$	Principal point, in pixels, for left, right, and RGB cameras	$p = {[\begin{matrix} 608.29 & 607.96 \end{matrix}]}^{T}$

Table 8. Camera extrinsic parameters obtained as a result of camera calibration.

Transformation	Values
Translation	x: −0.05298594969415579
	y: −0.036995891328604535
	z: −0.019036862460579607
Rotation (quaternion representation)	x: 0.03504656738631116
	y: 0.03173155364147595
	z: 0.6961654236502247
	w: 0.7163229366227484

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tufail, M.; Iqbal, J.; Ahmad, R. Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting. Agriculture 2026, 16, 769. https://doi.org/10.3390/agriculture16070769

AMA Style

Tufail M, Iqbal J, Ahmad R. Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting. Agriculture. 2026; 16(7):769. https://doi.org/10.3390/agriculture16070769

Chicago/Turabian Style

Tufail, Muhammad, Jamshed Iqbal, and Rafiq Ahmad. 2026. "Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting" Agriculture 16, no. 7: 769. https://doi.org/10.3390/agriculture16070769

APA Style

Tufail, M., Iqbal, J., & Ahmad, R. (2026). Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting. Agriculture, 16(7), 769. https://doi.org/10.3390/agriculture16070769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting

Abstract

1. Introduction

2. Materials and Methods

2.1. Hardware Architecture

2.2. Software Architecture

2.3. Programming Mobile Robot and xArm Manipulator

3. Results

3.1. Deep Learning-Based Strawberry Detection and Segmentation

3.2. Robot Manipulation Control

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI