Recognition and Positioning of Strawberries Based on Improved YOLOv7 and RGB-D Sensing

: To improve the speed and accuracy of the methods used for the recognition and positioning of strawberry plants, this paper is concerned with the detection of elevated-substrate strawberries and their picking points, using a strawberry picking robot, based on the You Only Look Once version 7 (YOLOv7) object detection algorithm and Red Green Blue-Depth (RGB-D) sensing. Modifications to the YOLOv7 model include the integration of more efficient modules, incorporation of attention mechanisms, elimination of superfluous feature layers, and the addition of layers dedicated to the detection of smaller targets. These modifications have culminated in a lightweight and improved YOLOv7 network model. The number of parameters is only 40.3% of that of the original model. The calculation amount is reduced by 41.8% and the model size by 59.2%. The recognition speed and accuracy are also both improved. The frame rate of model recognition is increased by 19.3%, the accuracy of model recognition reaches 98.8%, and mAP@0.95 reaches 96.8%. In addition, we have developed a method for locating strawberry picking points based on strawberry geometry. The test results demonstrated that the average positioning success rate and average positioning time were 90.8% and 76 ms, respectively. The picking robot in the laboratory utilized the recognition and positioning method proposed in this paper. The error of hand–eye calibration is less than 5.5 mm on the X-axis, less than 1.6 mm on the Y-axis, and less than 2.7 mm on the Z-axis, which meets the requirements of picking accuracy. The success rate of the picking experiment was about 90.8%, and the average execution time for picking each strawberry was 7.5 s. In summary, the recognition and positioning method proposed in this paper provides a more effective method for automatically picking elevated-substrate strawberries.


Introduction
A fruit with high nutritional value, strawberries have a sweet and sour taste, which is very popular with consumers [1,2].Their economic value is also very high but, due to variations in quality, the price of strawberries per kilogram ranges from tens to hundreds of yuan.Strawberry quality is not only related to the variety in question but, also, to planting methods [3].Currently, the two main domestic cultivation methods are overhead facility cultivation and ridging planting on the ground.Because of the advantages of ridge planting in terms of cost, domestic farmers primarily employ this method on the ground.However, because the fruit is close to the ground, it is easy to incur problems such as rotten fruit and uneven coloring, which affect quality.Strawberries planted in elevated facilities are suspended in the air and, during the ripening process, no other objects are in contact with them.The light is relatively uniform, making the strawberries less susceptible to decay, and the color is relatively uniform.Due to the advantages of elevated facilities in terms of quality, standardization, and planting, domestic strawberry planting methods are trending positively in China.In the process of strawberry cultivation, the cost of picking has always plagued growers.Strawberries must be picked at 80-90% maturity and have a the actual size of the elevated-substrate strawberry (Figure 1).A single strawberry stand is 2000 mm long, 350 mm wide, and 1000 mm high.The distance between the two strawberry racks is 900 mm.Plastic strawberry plants were used to simulate the elevated strawberry planting arrangement on the viaduct.Each strawberry stand has two rows of strawberries, approximately with rows 150 mm apart and plants 200 mm apart.Within this arrangement, we hung strawberries at random on both sides of the ripe and unripe elevated shelves, with shielding and overlapping to simulate the randomness of real strawberry growth.

Strawberry Scene
In order to replicate the actual conditions that elevated-substrate strawberries grow in, we built an elevated-substrate simulation strawberry scene in a laboratory, according to the actual size of the elevated-substrate strawberry (Figure 1).A single strawberry stand is 2000 mm long, 350 mm wide, and 1000 mm high.The distance between the two strawberry racks is 900 mm.Plastic strawberry plants were used to simulate the elevated strawberry planting arrangement on the viaduct.Each strawberry stand has two rows of strawberries, approximately with rows 150 mm apart and plants 200 mm apart.Within this arrangement, we hung strawberries at random on both sides of the ripe and unripe elevated shelves, with shielding and overlapping to simulate the randomness of real strawberry growth.

Image Acquisition
This study used an RGB-D camera (RealSense D435i, Intel, Santa Clara, CA, USA) to collect the dataset on the elevated-substrates strawberry scene.This camera has two infrared stereo sensors and an infrared emitter, which are primarily used to detect depth.It also has an RGB image sensor that can obtain RGB images.In addition, the D435i has an inertial measurement unit (IMU) sensor that measures its current acceleration and angular velocity to calculate its attitude.Several studies on fruit picking [5, 9,11] have chosen to use RealSense depth cameras because they can identify the depths of objects more accurately than other models, which is necessary for picking.The D435i is exceptional in that it can use a suitable range of approximately 0.1-10 m within its scope.Table 1 illustrates this camera's parameters.

Image Acquisition
This study used an RGB-D camera (RealSense D435i, Intel, Santa Clara, CA, USA) to collect the dataset on the elevated-substrates strawberry scene.This camera has two infrared stereo sensors and an infrared emitter, which are primarily used to detect depth.It also has an RGB image sensor that can obtain RGB images.In addition, the D435i has an inertial measurement unit (IMU) sensor that measures its current acceleration and angular velocity to calculate its attitude.Several studies on fruit picking [5, 9,11] have chosen to use RealSense depth cameras because they can identify the depths of objects more accurately than other models, which is necessary for picking.The D435i is exceptional in that it can use a suitable range of approximately 0.1-10 m within its scope.Table 1 illustrates this camera's parameters.Before using the D435i, it was necessary to dynamically calibrate it to prevent depth measurement inaccuracies that might result from wear and daily usage.After installing the RealSense camera Dynamic Calibration Tool on the computer, the calibration paper was printed, calibration program was initiated, and rectification and scale calibration were completed (Figure 2).Post-calibration, the Depth Quality Tool was employed to verify the calibration results.The calibration was deemed successful if the depth value error was within the acceptable range of 2 mm.
Service distance (m) 0.1-10 Before using the D435i, it was necessary to dynamically calibrate it to prevent depth measurement inaccuracies that might result from wear and daily usage.After installing the RealSense camera Dynamic Calibration Tool on the computer, the calibration paper was printed, calibration program was initiated, and rectification and scale calibration were completed (Figure 2).Post-calibration, the Depth Quality Tool was employed to verify the calibration results.The calibration was deemed successful if the depth value error was within the acceptable range of 2 mm.We recorded a 10 min video with a resolution of 1280 × 720 pixels at 30 frames per second.From this video, we extracted an image every five frames.We selected one in every six of these images based on clarity, forming a dataset of 600 images.These dataset images underwent rotation (Figure 3a), salt-and-pepper noise addition (Figure 3b), sharpening (Figure 3c), and brightness adjustment (Figure 3d), all of which resulted in a total of 3000 experimental dataset images.We recorded a 10 min video with a resolution of 1280 × 720 pixels at 30 frames per second.From this video, we extracted an image every five frames.We selected one in every six of these images based on clarity, forming a dataset of 600 images.These dataset images underwent rotation (Figure 3a), salt-and-pepper noise addition (Figure 3b), sharpening (Figure 3c), and brightness adjustment (Figure 3d), all of which resulted in a total of 3000 experimental dataset images.
Service distance (m) 0.1-10 Before using the D435i, it was necessary to dynamically calibrate it to prevent depth measurement inaccuracies that might result from wear and daily usage.After installing the RealSense camera Dynamic Calibration Tool on the computer, the calibration paper was printed, calibration program was initiated, and rectification and scale calibration were completed (Figure 2).Post-calibration, the Depth Quality Tool was employed to verify the calibration results.The calibration was deemed successful if the depth value error was within the acceptable range of 2 mm.We recorded a 10 min video with a resolution of 1280 × 720 pixels at 30 frames per second.From this video, we extracted an image every five frames.We selected one in every six of these images based on clarity, forming a dataset of 600 images.These dataset images underwent rotation (Figure 3a), salt-and-pepper noise addition (Figure 3b), sharpening (Figure 3c), and brightness adjustment (Figure 3d), all of which resulted in a total of 3000 experimental dataset images.

Training Environment
Strawberries have a fragile surface and directly picking them will shorten their storage time; therefore, the best way to pick them is to cut the stem.Detecting the stem of a strawberry, using a deep learning network, is a complex challenge.
A series of image processing and geometric algorithms determined the positioning of the fruit stem.When marking, LabelImg software (version 1.8.6) uses rectangles to frame the area of the ripe strawberry (Figure 4).
Strawberries have a fragile surface and directly picking them will shorten their storage time; therefore, the best way to pick them is to cut the stem.Detecting the stem of a strawberry, using a deep learning network, is a complex challenge.
A series of image processing and geometric algorithms determined the positioning of the fruit stem.When marking, LabelImg software (version 1.8.6) uses rectangles to frame the area of the ripe strawberry (Figure 4).The object detection algorithm encompasses both single-stage and double-stage algorithms.Yu et al. used Mask-RCNN and other two-stage algorithms to detect strawberry and fruit stems, but the detection speed could not meet the real-time requirement [23].As a single-stage algorithm, YOLO has become a popular choice for real-time object detection.It not only detects velocity blocks but, also, has a high detection accuracy rate.YOLOv7 performs well in real time, demonstrates high precision, and has improved on small target detection compared to the previous version; therefore, the YOLOv7 model was selected for this study, and further improved, to better adapt to strawberry detection.
YOLOv7 is a single-stage object detection algorithm capable of achieving real-time performance and high accuracy.The YOLOv7 network model structure has three main parts: the input layer, backbone network, and head network [31].
The input layer mainly pre-processes incoming images through data augmentation and adaptive scaling.The backbone network includes Convolution, Batch normalization, the SILU (CBS) module, Efficient Layer Aggregation Network (ELAN), and the Max Pooling (MP) module, facilitating feature extraction.The head network comprises Spatial   The object detection algorithm encompasses both single-stage and double-stage algorithms.Yu et al. used Mask-RCNN and other two-stage algorithms to detect strawberry and fruit stems, but the detection speed could not meet the real-time requirement [23].As a single-stage algorithm, YOLO has become a popular choice for real-time object detection.It not only detects velocity blocks but, also, has a high detection accuracy rate.YOLOv7 performs well in real time, demonstrates high precision, and has improved on small target detection compared to the previous version; therefore, the YOLOv7 model was selected for this study, and further improved, to better adapt to strawberry detection.
YOLOv7 is a single-stage object detection algorithm capable of achieving real-time performance and high accuracy.The YOLOv7 network model structure has three main parts: the input layer, backbone network, and head network [31].
The input layer mainly pre-processes incoming images through data augmentation and adaptive scaling.The backbone network includes Convolution, Batch normalization, the SILU (CBS) module, Efficient Layer Aggregation Network (ELAN), and the Max Pooling (MP) module, facilitating feature extraction.The head network comprises Spatial Pyramid Pooling, the Cross-Stage Partial Channel (SPPCSPC) module, UPSample module, ELAN-H module, and the RepVGG block (REP) module, enabling object detection on feature maps.YOLOv7 offers the following advantages: enhanced training and prediction efficiency, using ELAN architecture, and better control of the gradient path.It combines the advantages of the YOLOv5 cross-grid search and the matching strategy of YOLOX.Additionally, YOLOv7 employs a training method that utilizes auxiliary heads, enabling enhanced detection accuracy without extending prediction times.

Improved YOLOv7 Network
To further improve the detection speed and accuracy of the model, this study improved the baseline YOLOv7 network.Figure 5 illustrates the improved YOLOv7 model structure.Firstly, the GhostConv module replaced the Conv module in the original CBS module, in order to improve detection speed.The conventional feature extraction method uses multiple convolutions to check all channels in the input feature map [32][33][34].Stacking convolutional layers in deep networks requires many parameters and significant computation resources, producing many rich and even redundant feature graphs.GhostConv uses a smaller number of convolution checks for the feature extraction of the input feature map and, then, performs cheaper linear change operations on this part of the feature map.This reduces the cost of learning non-critical features, effectively reducing the need for computing resources, and the model's performance is unaffected.
ule, ELAN-H module, and the RepVGG block (REP) module, enabling object detection on feature maps.YOLOv7 offers the following advantages: enhanced training and prediction efficiency, using ELAN architecture, and better control of the gradient path.It combines the advantages of the YOLOv5 cross-grid search and the matching strategy of YOLOX.Additionally, YOLOv7 employs a training method that utilizes auxiliary heads, enabling enhanced detection accuracy without extending prediction times.

Improved YOLOv7 Network
To further improve the detection speed and accuracy of the model, this study improved the baseline YOLOv7 network.Figure 5 illustrates the improved YOLOv7 model structure.Firstly, the GhostConv module replaced the Conv module in the original CBS module, in order to improve detection speed.The conventional feature extraction method uses multiple convolutions to check all channels in the input feature map [32][33][34].Stacking convolutional layers in deep networks requires many parameters and significant computation resources, producing many rich and even redundant feature graphs.GhostConv uses a smaller number of convolution checks for the feature extraction of the input feature map and, then, performs cheaper linear change operations on this part of the feature map.This reduces the cost of learning non-critical features, effectively reducing the need for computing resources, and the model's performance is unaffected.For the next stage, a Convolutional Block Attention Module (CBAM) was added after the last CBS module in the backbone network.This multiplied the output features of the channel attention module and the spatial attention module, element by element, obtaining the final attention-enhancing features [35,36].These enhanced features could then be used to input subsequent network layers to suppress noise and irrelevant information while preserving critical information.For the next stage, a Convolutional Block Attention Module (CBAM) was added after the last CBS module in the backbone network.This multiplied the output features of the channel attention module and the spatial attention module, element by element, obtaining the final attention-enhancing features [35,36].These enhanced features could then be used to input subsequent network layers to suppress noise and irrelevant information while preserving critical information.
Finally, the last MP-2 and ELAN modules from the head network were removed; that is, the network removed the unnecessary 20 × 20 × 1024 feature layer and introduced the 160 × 160 × 256 feature layer to pay more attention to the detection of small targets [37].

Performance Evaluation Index
In order to accurately evaluate the performance of the improved YOLOv7 model, we compared three commonly used performance indicators in the baseline YOLOv7 model and the improved YOLOv7 model: precision, recall, and mean average precision (mAP), which evaluate the degree of performance improvement in the improved model compared to the original model.The model operates in four prediction states: true positive (TP), which predicts a ripe strawberry sample among samples of ripe strawberries, and the prediction is correct; False positive (FP), which predicts an immature strawberry sample among samples of ripe strawberries, and the prediction is incorrect; True negative (TN), which predicts an immature strawberry sample among samples of ripe strawberries, and the prediction is correct; False negative (FN), which predicts a ripe strawberry sample among a sample of ripe strawberries, and the prediction is wrong.Precision and recall are defined as follows: The AP value is the area under the precision and recall (PR) curve, which can simultaneously measure the two precision and recall performance indicators.The mAP value is the mean of average accuracy; that is, the average accuracy of all detection categories is totaled and then averaged.The mAP value is defined as follows, where AP n is the average accuracy when detecting category n:

Position of Picking Points 2.3.1. Positioning Method
After identifying the mature strawberries in the images using the deep learning model, any strawberries within each detection frame were cropped into smaller images (Figure 6a).To obtain the binary images and to edge the contours of the strawberry fruits, as depicted in Figures 6b and 6c, respectively, we undertook threshold segmentation and edge extraction.Because the surface of a strawberry is prone to rotting after squeezing, which affects the quality of that strawberry, it is advisable to avoid touching the fruit during strawberry picking.Strawberry stems are located near the fruit axis.The picking point can be precisely positioned above the fruit axis for subsequent harvesting with the aid of geometric calculations, which find the fruit axis.Point P on the stem represents the ideal picking point (Figure 6d).The process began by securing the strawberry's binary image and edge contour, in order to identify an optimal picking point close to P.Then, the connected region centroid algorithm was used to calculate the strawberry's centroid O, yielding its Because the surface of a strawberry is prone to rotting after squeezing, which affects the quality of that strawberry, it is advisable to avoid touching the fruit during strawberry picking.Strawberry stems are located near the fruit axis.The picking point can be pre-cisely positioned above the fruit axis for subsequent harvesting with the aid of geometric calculations, which find the fruit axis.Point P on the stem represents the ideal picking point (Figure 6d).The process began by securing the strawberry's binary image and edge contour, in order to identify an optimal picking point close to P.Then, the connected region centroid algorithm was used to calculate the strawberry's centroid O, yielding its pixel coordinates as (x o , y o ).Then, the distance from each point (x b , y b ) on the strawberry contour below the MN line to the centroid was calculated using the distance formula: (5) The decision not to calculate the distance from the centroid to each point on the whole contour of the strawberry was made because the tips of the elevated-substrate strawberry's fruit point downward; therefore, calculating only the distance from the points below the MN line to the centroid reduces computational load and improves the practical results.Point A (x a , y a ), usually near the fruit tip, and having a maximum distance d from centroid O, was then identified.Connecting OA and extending it defines the fruit axis.If x o ̸ = x a , the slope k is given by The distance between the highest and lowest points of the strawberry contour is h.Picking point S (x s , y s ) was then chosen on the extended axis of the fruit at a vertical distance from point O, typically located near the strawberry stem, where the remaining stem would be approximately 1-2 cm.The point S coordinate calculation is as follows:

Picking Point Positioning Evaluation
In the previous section, we introduced the positioning method for the strawberry picking point used in this study: the positioning of the picking point is near the stem.We aimed to determine whether the picking point can successfully identify ripe strawberries, so we designed the following method: Due to the influence of its own gravity, the strawberry's direction is downward, and the general direction of the fruit stem is also vertical in a downward direction.When cutting strawberries, we found it best to ensure the scissors plane of the end effector was perpendicular to the stem; therefore, when the robot picked the strawberry, the end effector scissor plane was horizontally close to the stem.This proved to be simple and effective in keeping the stem between the shear hands.The range of the cutting hand was optimum at 20 mm.In this study, the picking point was deemed valid if the calculated anchor point was less than 10 mm from the strawberry stem.
The horizontal camera field of view was set to α.The horizontal pixel of the captured image is given as u.When the picking robot performed successfully, the distance between the camera and strawberry is given as L. The 10 mm physical distance was then converted to the pixel distance in the image, given as u 0 .The actual distance between the camera and the object is U.The following relationships were established: These two expressions were converted to Agriculture 2024, 14, 624 9 of 18 A horizontal line segment was made, with u 0 pixels on the left and right of the picking point for image positioning.If the line segment intersected the stem of the strawberry, the picking point was deemed effective.

Calibration Method
The strawberry recognition and positioning methods introduced earlier are related to the RGB-D sensor.The strawberry picking robot needs to establish a connection between the robotic arm and the RGB-D sensor in order to form a complete visual picking robot system.Currently, there are two main ways to install a camera and robotic arm: the eye-inhand and eye-to-hand.This strawberry picking robot in this study utilized the eye-in-hand mode, and the camera was fixed at the end of the robot arm and followed the movement of the robotic arm.When the camera was on the robotic arm, it was necessary to establish a connection between it and the robot arm; this is called hand-eye calibration.
The robotic arm used in this paper was an RM65-B, a 6-DOF robot from REALMAN ROBOT Co., Ltd., Beijing, China.The RM65-B is a lightweight series robot arm powered by a 24 V lithium battery.Its controller, for controlling the manipulator and external communication, is in the base of the manipulator.The total mass of the robot arm is only 7.2 kg, and the load it can bear is up to 5 kg.The robotic arm length is 850.5 mm, the working space is a sphere with a radius of 610 mm, and the cylindrical space directly above and below the base is the singularity region.The repeated positioning accuracy is ±0.05 mm, and the maximum joint speed is 225 • per second.
As can be seen in the eye-in-hand calibration diagram (Figure 7), this system includes the manipulator's base coordinate system, manipulator end coordinate system, camera coordinate system, and the calibration board coordinate system.Hand-eye calibration obtains the transformation matrix between the camera coordinate system and the robotic arm end coordinate system by calculating the transformation of coordinates among these coordinate systems.The conversion relationship between the manipulator's end coordinate system and the manipulator's base coordinate system is T base end , denoted as A. A is known as it is obtained from the robot system during hand-eye calibration.The conversion relationship between the camera coordinate system and the end coordinate system of the manipulator's arm is T end cam , denoted as X.X is unknown and needs to be solved.The conversion relationship between the camera coordinate system and the calibration plate coordinate system is T cal cam , denoted as B. B is known as it is obtained by camera calibration.The transformation relationship between the calibration board's coordinate system and the manipulator base's coordinate system is T base cal .The relative position of the manipulator's base and the calibration board does not change during the calibration process, so this transformation matrix does not change.
Agriculture 2024, 14, x FOR PEER REVIEW 10 of 18 coordinate systems.The conversion relationship between the manipulator's end coordinate system and the manipulator's base coordinate system is  , denoted as A. A is known as it is obtained from the robot system during hand-eye calibration.The conversion relationship between the camera coordinate system and the end coordinate system of the manipulator's arm is  , denoted as X.X is unknown and needs to be solved.The conversion relationship between the camera coordinate system and the calibration plate coordinate system is  , denoted as B. B is known as it is obtained by camera calibration.The transformation relationship between the calibration board's coordinate system and the manipulator base's coordinate system is  .The relative position of the manipulator's base and the calibration board does not change during the calibration process, so this transformation matrix does not change.Eye-in-hand calibration involves mounting the camera and end effector to the robotic arm, fixing the checkerboard calibration plate, and moving the robotic arm so that the camera takes checkerboard photographs from different directions.Because the relationship between the robot arm base and the calibration plate is immobile, a relationship is established when the robot arm reaches position 1, 2: Eye-in-hand calibration involves mounting the camera and end effector to the robotic arm, fixing the checkerboard calibration plate, and moving the robotic arm so that the camera takes checkerboard photographs from different directions.Because the relationship between the robot arm base and the calibration plate is immobile, a relationship is established when the robot arm reaches position 1, 2: This can be formulated as which can be converted to This is the solution to the X problem of the form AX = XB, which can be computed as R is a 3 × 3 matrix associated with rotations, and T is a 1 × 3 matrix associated with translations.

Calibration Error
This study conducted an error analysis experiment to apply a deep learning model and RGB-D sensor to the picking robot (Figure 8).A 16:9 error experiment target was self-designed to match the pixel size ratio of the depth camera's RGB module.The target consisted of equidistant 5 × 5 red circles with black center points, printed in color on A3 paper and mounted on a well-lit wall.A depth camera was mounted at the end of the robotic arm.The arm was adjusted so that the optical center of the depth camera's RGB module aligned with the center point of the error experiment target.Concurrently, the distance between the camera and the target was adjusted to ensure that the target was entirely within the camera's field of view.This distance was recorded, and the camera's IMU module was checked to adjust the camera plane so that it was parallel with the target plane.The coordinate system was, at that stage, at the end of the mechanical arm, as shown in Figure 8: the XY axis is parallel to the error experiment target, and the Z axis is perpendicular to the error experiment target.We enabled the camera to recognize each red circle and to calculate its centroid.The centroid coordinates were then translated to the coordinates at the end of the robot arm and recorded; these coordinates represent the actual picking point.Then, we used the robot arm's teaching device to control the end of the arm so that it touched the black center point of each red circle in the same posture, and we recorded the coordinates.These coordinates represent the theoretical picking point.
ular to the error experiment target.We enabled the camera to recognize each red circle and to calculate its centroid.The centroid coordinates were then translated to the coordinates at the end of the robot arm and recorded; these coordinates represent the actual picking point.Then, we used the robot arm's teaching device to control the end of the arm so that it touched the black center point of each red circle in the same posture, and we recorded the coordinates.These coordinates represent the theoretical picking point.

Strawberry Detection
The dataset was randomly divided into a training set (2100 images) and a validation set (900 images) in a 7:3 ratio.The input image was resized to 640 × 640 (pixels).The model training process found that the learning rate was too small, and the convergence too slow.If the learning rate is too large, the loss will fluctuate significantly.So, finally, we chose an initial learning rate of 0.001.The training epochs were 300.The batch size was eight, and Adam was selected as the training optimizer.We trained the YOLOv7 model and the improved YOLOv7 model to develop an elevated-substrate strawberry recognition model.The YOLOv7 model and the improved YOLOv7 were trained for 518 min and 373 min, respectively.
We compared the performance of the baseline YOLOv7 model to that of the improved YOLOv7.As illustrated by the trend in Figure 9, the prediction accuracy of the baseline YOLOv7 model was 98.0%, and its recall rate was 99.4%.The value of mAP@0.5 was 99.6%,

Strawberry Detection
The dataset was randomly divided into a training set (2100 images) and a validation set (900 images) in a 7:3 ratio.The input image was resized to 640 × 640 (pixels).The model training process found that the learning rate was too small, and the convergence too slow.If the learning rate is too large, the loss will fluctuate significantly.So, finally, we chose an initial learning rate of 0.001.The training epochs were 300.The batch size was eight, and Adam was selected as the training optimizer.We trained the YOLOv7 model and the improved YOLOv7 model to develop an elevated-substrate strawberry recognition model.The YOLOv7 model and the improved YOLOv7 were trained for 518 min and 373 min, respectively.
We compared the performance of the baseline YOLOv7 model to that of the improved YOLOv7.As illustrated by the trend in Figure 9, the prediction accuracy of the baseline YOLOv7 model was 98.0%, and its recall rate was 99.4%.The value of mAP@0.5 was 99.6%, and the value of mAP@0.95 was 95.4%.The prediction accuracy of the improved YOLOv7 model was 98.8%, and its recall rate was 99.2%.The mAP@0.5 was 99.8%, and the mAP@0.95was 96.8%.We also compared the parameters of the two models (Table 3).The baseline model had 37.2 million parameters, 105.1 G Floating Point Operations (GFLOPs), and a model size of 74.8 MB.The improved YOLOv7 model had 15.0 million parameters, 61.2 GFLOPs, and a model size of 30.5 MB.The improved YOLOv7 model is significantly streamlined in terms of model size and complexity: the number of parameters was only 40.3% of that of the original model, the calculation amount was reduced by 41.8%, and the model size was reduced by 59.2%.Reducing the model size could also result in lower memory usage during inference.The processing times of the two models were compared according to the frame rate (frames per second, FPS) at the inference time to ensure that the reduced model size also reduced inference time.The original model operated at 18.7 FPS; the improved model operated at 3.6 FPS higher than that, reaching 22.3 FPS.According to the models' performances and their various parameters, the improved YOLOv7 model proved to be superior to the baseline YOLOv7 model in detecting elevated-substrate strawberries.
according to the frame rate (frames per second, FPS) at the inference time to ensure that the reduced model size also reduced inference time.The original model operated at 18.7 FPS; the improved model operated at 3.6 FPS higher than that, reaching 22.3 FPS.According to the models' performances and their various parameters, the improved YOLOv7 model proved to be superior to the baseline YOLOv7 model in detecting elevated-substrate strawberries.In this study, research was conducted in a laboratory that simulated an elevated-substrate strawberry environment.It utilized the improved YOLOv7 model in its method.As can be seen in the confusion matrix in Figure 10, a total of 98 ripe strawberries grew in this environment: 95 were successfully detected, three were not successfully detected, and one false detection was registered.The TP was 95, FP was 1, TN was 0, and the FN was 3. The results demonstrate a detection precision of 99.0%, and a recall of 96.9%. Figure 11 shows  In this study, research was conducted in a laboratory that simulated an elevatedsubstrate strawberry environment.It utilized the improved YOLOv7 model in its method.As can be seen in the confusion matrix in Figure 10, a total of 98 ripe strawberries grew in this environment: 95 were successfully detected, three were not successfully detected, and one false detection was registered.The TP was 95, FP was 1, TN was 0, and the FN was 3. The results demonstrate a detection precision of 99.0%, and a recall of 96.9%. Figure 11 shows the detection results of the model, accurately identifying ripe strawberries, either without shelter or with shelter and hay.The results provide information for improving conditions for the subsequent localization of strawberry picking points and robotic picking.Compared to the convolutional neural network method used by Habaragamuwa et al. [23], and the improved convolutional neural network method used by Lamb et al. [30], the improved YOLOv7 model increased accuracy rates by 11% and 14.8%, respectively.We conclude that the improved YOLOv7 method for identifying ripe strawberries proposed in this paper is highly accurate and is suitable for identifying and detecting strawberries.
for the subsequent localization of strawberry picking points and robotic picking.Compared to the convolutional neural network method used by Habaragamuwa et al. [23], and the improved convolutional neural network method used by Lamb et al. [30], the improved YOLOv7 model increased accuracy rates by 11% and 14.8%, respectively.We conclude that the improved YOLOv7 method for identifying ripe strawberries proposed in this paper is highly accurate and is suitable for identifying and detecting strawberries.

Position of Picking Points
After detecting ripe strawberries with the improved YOLOv7 model, the detected strawberries were clipped to a small map using the boundary coordinates of each boundary box; each small map represents a ripe strawberry.After the threshold was reached, the segmentation and edge extraction of each small graph, the binary graph, and the edge contour of the strawberry fruit were all obtained.Then, the fruit axis and picking point of each strawberry was obtained by the strawberry picking point positioning method.
According to the camera parameters, α = 69° and horizontal pixel u = 1280.When the picking robot performed successfully, the distance between the camera and the strawberry was 30-40 cm.It could then be calculated that the actual 10 mm converted to the pixel distance u0 in the image, ranging from 23 to 31 pixels.A line segment with a value of 23 pixels was made around the positioned picking point.If the line segment intersected with the fruit stem of the strawberry, the picking point was considered to be successfully positioned.
Figure 12 shows the fruit axes and picking points of strawberries obtained by the positioning method and the line segments for judging the effectiveness of picking points.shelter or with shelter and hay.The results provide information for improving conditions for the subsequent localization of strawberry picking points and robotic picking.Compared to the convolutional neural network method used by Habaragamuwa et al. [23], and the improved convolutional neural network method used by Lamb et al. [30], the improved YOLOv7 model increased accuracy rates by 11% and 14.8%, respectively.We conclude that the improved YOLOv7 method for identifying ripe strawberries proposed in this paper is highly accurate and is suitable for identifying and detecting strawberries.

Position of Picking Points
After detecting ripe strawberries with the improved YOLOv7 model, the detected strawberries were clipped to a small map using the boundary coordinates of each boundary box; each small map represents a ripe strawberry.After the threshold was reached, the segmentation and edge extraction of each small graph, the binary graph, and the edge contour of the strawberry fruit were all obtained.Then, the fruit axis and picking point of each strawberry was obtained by the strawberry picking point positioning method.
According to the camera parameters, α = 69° and horizontal pixel u = 1280.When the picking robot performed successfully, the distance between the camera and the strawberry was 30-40 cm.It could then be calculated that the actual 10 mm converted to the pixel distance u0 in the image, ranging from 23 to 31 pixels.A line segment with a value of 23 pixels was made around the positioned picking point.If the line segment intersected with the fruit stem of the strawberry, the picking point was considered to be successfully positioned.
Figure 12 shows the fruit axes and picking points of strawberries obtained by the positioning method and the line segments for judging the effectiveness of picking points.

Position of Picking Points
After detecting ripe strawberries with the improved YOLOv7 model, the detected strawberries were clipped to a small map using the boundary coordinates of each boundary box; each small map represents a ripe strawberry.After the threshold was reached, the segmentation and edge extraction of each small graph, the binary graph, and the edge contour of the strawberry fruit were all obtained.Then, the fruit axis and picking point of each strawberry was obtained by the strawberry picking point positioning method.
According to the camera parameters, α = 69 • and horizontal pixel u = 1280.When the picking robot performed successfully, the distance between the camera and the strawberry was 30-40 cm.It could then be calculated that the actual 10 mm converted to the pixel distance u 0 in the image, ranging from 23 to 31 pixels.A line segment with a value of 23 pixels was made around the positioned picking point.If the line segment intersected with the fruit stem of the strawberry, the picking point was considered to be successfully positioned.
Figure 12 shows the fruit axes and picking points of strawberries obtained by the positioning method and the line segments for judging the effectiveness of picking points.Under the conditions of no occlusion and slight occlusion, the picking point positioning method demonstrates its effectiveness.The successful positioning rate was 90.8%, and the average positioning time was 76 ms.
adopted by Lemsalu et al. [26], and the semantic segmentation model adopted by Kim et al. [28], the positioning accuracy of the picking points has been improved by this model.The examples of location failure in this study were all strawberries with severe occlusion.When the fruit was severely occluded, the picking point positioning method, as well as other methods, could not locate the picking point accurately.The picking point positioning method for fruit under severe occlusion requires further research.

Calibration Error
In each hand-eye calibration error test, the camera and the robot arm were adjusted according to this method, the center of each red circle was identified, the end of the robot arm was controlled so that it touched the center of the red circle, and the coordinates of 25 actual and theoretical picking points, within the camera's field of view, were obtained.Table 4 shows the coordinates of the actual picking points and the coordinate values of the theoretical picking points in each area within the camera's field of view.By comparing the two coordinate values of each group, it was found that there is little difference between the two coordinates in terms of value, indicating that the actual and theoretical coordinates are relatively close.Compared to the case segmentation method proposed by Yu et al. [24] and Perez-Borrero et al. [27], this model has advantages in terms of its speed in identifying and locating picking points.Compared to the YOLOv5 method of direct fruit stem detection adopted by Lemsalu et al. [26], and the semantic segmentation model adopted by Kim et al. [28], the positioning accuracy of the picking points has been improved by this model.The examples of location failure in this study were all strawberries with severe occlusion.When the fruit was severely occluded, the picking point positioning method, as well as other methods, could not locate the picking point accurately.The picking point positioning method for fruit under severe occlusion requires further research.

Calibration Error
In each hand-eye calibration error test, the camera and the robot arm were adjusted according to this method, the center of each red circle was identified, the end of the robot arm was controlled so that it touched the center of the red circle, and the coordinates of 25 actual and theoretical picking points, within the camera's field of view, were obtained.Table 4 shows the coordinates of the actual picking points and the coordinate values of the theoretical picking points in each area within the camera's field of view.By comparing the two coordinate values of each group, it was found that there is little difference between the two coordinates in terms of value, indicating that the actual and theoretical coordinates are relatively close.By calculating the difference of each group of coordinates and then taking the absolute value, the errors for each group of experiments were obtained, on the X-axis, Y-axis, and Zaxis, and the error values were then visualized in the error result graph shown in Figure 13.In the field of view of the camera, the maximum error between the actual pick point coordinates and the theoretical pick point coordinates in the X-axis direction is 5.5 mm, and the average error is 3.6 mm; the maximum error in the Y-axis direction is 1.6 mm, and the average error is 0.7 mm; the maximum error in the Z-axis is 2.9 mm, and the average error is 1.5 mm.The assembly error in the X-axis direction is significant when installing the end actuator.In contrast, the Y-axis direction is close to the flange, and the assembly error is small, resulting in a more significant X-axis error than that on the Y-axis.In addition, during the process of hand-eye calibration, because the calibration plate is not flat, differences in errors on different axes can occur; however, considering the maximum travel distance of the end effector, of 20 mm, the error range is much smaller than the opening distance of the end effector; therefore, it can meet the picking needs of the robot and achieve effective strawberry picking.lute value, the errors for each group of experiments were obtained, on the X-axis, Y-axis, and Z-axis, and the error values were then visualized in the error result graph shown in Figure 13.In the field of view of the camera, the maximum error between the actual pick point coordinates and the theoretical pick point coordinates in the X-axis direction is 5.5 mm, and the average error is 3.6 mm; the maximum error in the Y-axis direction is 1.6 mm, and the average error is 0.7 mm; the maximum error in the Z-axis is 2.9 mm, and the average error is 1.5 mm.The assembly error in the X-axis direction is significant when installing the end actuator.In contrast, the Y-axis direction is close to the flange, and the assembly error is small, resulting in a more significant X-axis error than that on the Y-axis.In addition, during the process of hand-eye calibration, because the calibration plate is not flat, differences in errors on different axes can occur; however, considering the maximum travel distance of the end effector, of 20 mm, the error range is much smaller than the opening distance of the end effector; therefore, it can meet the picking needs of the robot and achieve effective strawberry picking.

Robot Picking
To test the strawberry recognition and positioning method proposed in this study, based on the improved YOLOv7 and RGB-D sensors, picking experiments were carried out in a simulated elevated-substrate strawberry scene built in a laboratory.In this experiment, we fixed the posture of the end effector of the robotic arm and, each time, the end effector approached the strawberry stem with the same posture and grasped it.Because the simulated strawberry was plastic, the end effector could not directly cut the stem of the strawberry; therefore, in this experiment, when the two fingers of the end effector gripped the stem of the strawberry, the strawberry was considered to be successfully picked by the robot (Figure 14).

Robot Picking
To test the strawberry recognition and positioning method proposed in this study, based on the improved YOLOv7 and RGB-D sensors, picking experiments were carried out in a simulated elevated-substrate strawberry scene built in a laboratory.In this experiment, we fixed the posture of the end effector of the robotic arm and, each time, the end effector approached the strawberry stem with the same posture and grasped it.Because the simulated strawberry was plastic, the end effector could not directly cut the stem of the strawberry; therefore, in this experiment, when the two fingers of the end effector gripped the stem of the strawberry, the strawberry was considered to be successfully picked by the robot (Figure 14).lute value, the errors for each group of experiments were obtained, on the X-axis, Y-axis, and Z-axis, and the error values were then visualized in the error result graph shown in Figure 13.In the field of view of the camera, the maximum error between the actual pick point coordinates and the theoretical pick point coordinates in the X-axis direction is 5.5 mm, and the average error is 3.6 mm; the maximum error in the Y-axis direction is 1.6 mm, and the average error is 0.7 mm; the maximum error in the Z-axis is 2.9 mm, and the average error is 1.5 mm.The assembly error in the X-axis direction is significant when installing the end actuator.In contrast, the Y-axis direction is close to the flange, and the assembly error is small, resulting in a more significant X-axis error than that on the Y-axis.In addition, during the process of hand-eye calibration, because the calibration plate is not flat, differences in errors on different axes can occur; however, considering the maximum travel distance of the end effector, of 20 mm, the error range is much smaller than the opening distance of the end effector; therefore, it can meet the picking needs of the robot and achieve effective strawberry picking.

Robot Picking
To test the strawberry recognition and positioning method proposed in this study, based on the improved YOLOv7 and RGB-D sensors, picking experiments were carried out in a simulated elevated-substrate strawberry scene built in a laboratory.In this experiment, we fixed the posture of the end effector of the robotic arm and, each time, the end effector approached the strawberry stem with the same posture and grasped it.Because the simulated strawberry was plastic, the end effector could not directly cut the stem of the strawberry; therefore, in this experiment, when the two fingers of the end effector gripped the stem of the strawberry, the strawberry was considered to be successfully picked by the robot (Figure 14).We carried out four picking experiments, and the experimental results are shown in Table 5.According to the tallied statistics, there were 124 strawberries in the three experiments, of which 98 were ripe, 25 were immature, and 14 were blocked.In the experiment, 89 strawberries were successfully picked, and the picking success rate was 90.82%.The average execution time for picking each strawberry was 7.5 s.Failure to pick was due to the blocking and stacking of strawberries.Subsequent recognition and positioning algorithms need to be further studied to address the challenges associated with blocking and stacking strawberries.The picking success rate of Feng et al.'s picking robot was 84%, and its average picking time was 10.7 s [8].The success rate of the strawberry picking robot developed by Parsa et al. was 83% [11].Cui et al.'s picking robot took 16.6 s to pick a single strawberry, with an accuracy rate of 70.8% [22].Compared to the strawberry picking robots in these previous studies, the strawberry recognition and positioning method proposed in this study, based on the improved YOLOv7 model and RGB-D sensing, can effectively improve the success rate of picking and reduce the picking time of strawberries when applied to actual picking.Initially, the picking robot in this study was designed to install two robotic arms and to pick strawberries on both sides simultaneously.The picking speed will be further improved in subsequent research.

Figure 1 .
Figure 1.The model of the strawberry scene.

Figure 1 .
Figure 1.The model of the strawberry scene.

Figure 4 .
Figure 4. Front view of LabelImg.Experiments in model development were conducted on a Windows 11 laptop using Python-based PyTorch and PyCharm for Python 3.8, with an NVIDIA GPU card (GeForce RTX 3070).Table2summarizes the hardware and software configurations for model development.

Figure 4 .
Figure 4. Front view of LabelImg.Experiments in model development were conducted on a Windows 11 laptop using Python-based PyTorch and PyCharm for Python 3.8, with an NVIDIA GPU card (GeForce RTX 3070).Table2summarizes the hardware and software configurations for model development.

Figure 6 .
Figure 6.Positioning process of strawberry picking point.(a) Small image; (b) Binary image; (c) Edge contour; (d) Schematic diagram of picking point location.

Figure 11 .
Figure 11.Detection results with no and slight occlusion.

Figure 11 .
Figure 11.Detection results with no and slight occlusion.

Figure 11 .
Figure 11.Detection results with no and slight occlusion.

Figure 12 .
Figure 12.Positioning result with no and slight occlusion.

Table 1 .
Parameters of the Intel RealSense D435i camera.

Table 1 .
Parameters of the Intel RealSense D435i camera.

Table 2
summarizes the hardware and software configurations for model development.

Table 2 .
Hardware and software configurations for model development.

Table 2
summarizes the hardware and software configurations for model development.

Table 2 .
Hardware and software configurations for model development.

Table 3 .
Performance of two detection models.

Table 3 .
Performance of two detection models.

Table 4 .
Theoretical and actual coordinate data of hand-eye calibration.Positioning result with no and slight occlusion.

Table 4 .
Theoretical and actual coordinate data of hand-eye calibration.

Table 5 .
Results of the picking experiment.