Real-Time Detection and Location of Potted Flowers Based on a ZED Camera and a YOLO V4-Tiny Deep Learning Algorithm

: In order to realize the real-time and accurate detection of potted ﬂowers on benches, in this paper we propose a method based on the ZED 2 stereo camera and the YOLO V4-Tiny deep learning algorithm for potted ﬂower detection and location. First, an automatic detection model of ﬂowers was established based on the YOLO V4-Tiny convolutional neural network (CNN) model, and the center points on the pixel plane of the ﬂowers were obtained according to the prediction box. Then, the real-time 3D point cloud information obtained by the ZED 2 camera was used to calculate the actual position of the ﬂowers. The test results showed that the mean average precision (MAP) and recall rate of the training model was 89.72% and 80%, respectively, and the real-time average detection frame rate of the model deployed under Jetson TX2 was 16 FPS. The results of the occlusion experiment showed that when the canopy overlap ratio between the two ﬂowers is more than 10%, the recognition accuracy will be affected. The mean absolute error of the ﬂower center location based on 3D point cloud information of the ZED 2 camera was 18.1 mm, and the maximum locating error of the ﬂower center was 25.8 mm under different light radiation conditions. The method in this paper establishes the relationship between the detection target of ﬂowers and the actual spatial location, which has reference signiﬁcance for the machinery and automatic management of potted ﬂowers on benches.


Introduction
Floriculture refers to practices involving the growing of cut flowers, potted flowering, foliage plants, and bedding plants in greenhouses and fields. Floriculture products, have the highest profit per unit area among compared to other agricultural products [1]. As the population is aging, the future of floriculture production relies on a balance between labor and technology [2], so mechanization and automation for floriculture management are important to solve problems of labor shortage. With the development of the robot or the automatic manipulator, precision management equipment can be developed to realize the automatic transplantation, grading, and harvesting for floriculture [3]. For these tasks, target detection and location are essential. So, the real-time and accurate detection of potted flowers is necessary [4].
Machine vision technology has been used to achieve the precise management of flowers. Through the method of machine vision, researchers can realize flower detection [5,6], classification [7], and calculate the number of flowers [8,9] in a specific area. Compared with manual methods, this method can improve economic efficiency [10]. Therefore, the counting and handling of flowers can be handled by machine vision, which can optimize the management process of flowers to a large extent. For the effective detection and tracking of flower targets, Zhuang et al. (2018) proposed a robust detection method for citrus fruits based on a monocular vision system and used the adaptive enhancement of red and green time accuracy requirements on mobile terminals and edge computing devices at one time, YOLO V4-Tiny is obtained by simplifying YOLO V3 [22], it is a single-stage target detection model that performs well balance between accuracy and speed, and the trained weight file is small can be transferred to the mobile terminal. [30]. the structure of the model was shown in Figure 1a. The convolution layer of the CSP (Cross Stage Pritial connection) was compressed, and the FPN is set as 2.
bounding box, positioning confidence, and probability vectors of all categories of targets contained in all grids, so it has good real-time performance. In order to meet the real-time accuracy requirements on mobile terminals and edge computing devices at one time, YOLO V4-Tiny is obtained by simplifying YOLO V3 [22], it is a single-stage target detection model that performs well balance between accuracy and speed, and the trained weight file is small can be transferred to the mobile terminal. [30]. the structure of the model was shown in Figure 1a. The convolution layer of the CSP (Cross Stage Pritial connection) was compressed, and the FPN is set as 2.
The ZED 2 stereo camera is an integrated binocular camera that adopts advanced sensing technology based on stereo vision and provides technologies such as video acquisition, depth information acquisition, and real-time position information. It has been applied to object reconstruction [31], position acquisition [32][33][34], etc.    The ZED 2 stereo camera is an integrated binocular camera that adopts advanced sensing technology based on stereo vision and provides technologies such as video acquisition, depth information acquisition, and real-time position information. It has been applied to object reconstruction [31], position acquisition [32][33][34], etc.
The processes for potted flower detection based on the ZED 2 stereo camera and YOLO V4-Tiny are shown in Figure 1b. In this study, the artificial labeled flower dataset was constructed to train the CNN model of YOLO V4. The model can accurately detect the flowers and mark the prediction box, and the flower center point on the pixel plane of flowers is obtained according to the prediction box. With the match of the RGB image and depth point cloud image collected by the ZED 2 stereo camera, the location coordinates of the flowers can be obtained. January 2021, respectively. The pictures of the flowers were collected by a digital camera (Canon EOS M3, Canon Co. LTD, Tokyo, Japan), and the camera was about 100 cm above the flower canopy, so the flower canopy is mainly located in the center of the image. Figure 2 shows images of poinsettias and cyclamens. The processes for potted flower detection based on the ZED 2 stereo camera and YOLO V4-Tiny are shown in Figure 1b. In this study, the artificial labeled flower dataset was constructed to train the CNN model of YOLO V4. The model can accurately detect the flowers and mark the prediction box, and the flower center point on the pixel plane of flowers is obtained according to the prediction box. With the match of the RGB image and depth point cloud image collected by the ZED 2 stereo camera, the location coordinates of the flowers can be obtained.

Data Collection for YOLO V4-Tiny Model Training
The images were collected from Changshu Jiasheng Agricultural Co. LTD (120°38′47″, 31°33′44.24″) on 26 March 2019, and from Qiyi Flower Co. LTD (119°23′38″, 32°0′18″) on 9 January 2021, respectively. The pictures of the flowers were collected by a digital camera (Canon EOS M3, Canon Co. LTD, Tokyo, Japan), and the camera was about 100 cm above the flower canopy, so the flower canopy is mainly located in the center of the image.  Considering the influences of different light intensities in the natural environments and the camera parameters, 1000 images of poinsettias and cyclamens under different light intensities, camera parameters, and shooting angles were selected during data collection. In order to increase the training images, and to prevent over-fitting in the training, the selected images were mirrored, shrunk, enlarged, rotated, and affine-transformed to enhance the dataset, as shown in Figure 3. The images were then selected again, and images of flowers that were too repetitive and that were missing due to data augmentation were deleted. The number of final images for model training was 3000. Considering the influences of different light intensities in the natural environments and the camera parameters, 1000 images of poinsettias and cyclamens under different light intensities, camera parameters, and shooting angles were selected during data collection. In order to increase the training images, and to prevent over-fitting in the training, the selected images were mirrored, shrunk, enlarged, rotated, and affine-transformed to enhance the dataset, as shown in Figure 3. The images were then selected again, and images of flowers that were too repetitive and that were missing due to data augmentation were deleted. The number of final images for model training was 3000.

Model Training
In order to train the CNN model, an annotation XMLdataset based on PASCAL VOC data format, formatted by labeling with the LabelImg software, was constructed. Two categories, cyclamen and poinsettia, were customized in LabelImg's data, and the outline of the flowers was selected using the rectangular box to label the target, shown in Figure 4. The application generated coordinates containing the four vertices of the rectangular box. The labeled dataset was imported to the YOLO for model training. In addition, in order to accelerate model convergence and reduce training time, the YOLO V4-Tiny weight file (YOLO V4-Tiny. Conv.29) without a full connection layer was used for transfer learning. The weight file was obtained by training 80 categories on the COCO dataset. Shared-parameter-based transfer learning consisted of training the weights of the feature extraction part in the YOLO layer in advance to find common parameters or prior distributions between the spatial model of the source data and the target data (for the purpose of migration).
CNN model training was carried out on a computer. The main hardware configuration of the computer was an Intel Core i5-9300H, an NVIDIA GeForce GTX 1650 GPU, and 16G of memory, and CUDA and CUDNN were used to accelerate the calculation during training.

Model Training
In order to train the CNN model, an annotation XMLdataset based on PASCAL VOC data format, formatted by labeling with the LabelImg software, was constructed. Two categories, cyclamen and poinsettia, were customized in LabelImg's data, and the outline of the flowers was selected using the rectangular box to label the target, shown in Figure 4. The application generated coordinates containing the four vertices of the rectangular box. The labeled dataset was imported to the YOLO for model training.

Model Training
In order to train the CNN model, an annotation XMLdataset based on PASCAL VOC data format, formatted by labeling with the LabelImg software, was constructed. Two cat egories, cyclamen and poinsettia, were customized in LabelImg's data, and the outline o the flowers was selected using the rectangular box to label the target, shown in Figure 4 The application generated coordinates containing the four vertices of the rectangular box The labeled dataset was imported to the YOLO for model training. In addition, in order to accelerate model convergence and reduce training time, th YOLO V4-Tiny weight file (YOLO V4-Tiny. Conv.29) without a full connection layer wa used for transfer learning. The weight file was obtained by training 80 categories on th COCO dataset. Shared-parameter-based transfer learning consisted of training th weights of the feature extraction part in the YOLO layer in advance to find common pa rameters or prior distributions between the spatial model of the source data and the targe data (for the purpose of migration).
CNN model training was carried out on a computer. The main hardware configura tion of the computer was an Intel Core i5-9300H, an NVIDIA GeForce GTX 1650 GPU, and 16G of memory, and CUDA and CUDNN were used to accelerate the calculation durin training. In addition, in order to accelerate model convergence and reduce training time, the YOLO V4-Tiny weight file (YOLO V4-Tiny. Conv.29) without a full connection layer was used for transfer learning. The weight file was obtained by training 80 categories on the COCO dataset. Shared-parameter-based transfer learning consisted of training the weights of the feature extraction part in the YOLO layer in advance to find common parameters or prior distributions between the spatial model of the source data and the target data (for the purpose of migration).
CNN model training was carried out on a computer. The main hardware configuration of the computer was an Intel Core i5-9300H, an NVIDIA GeForce GTX 1650 GPU, and 16G of memory, and CUDA and CUDNN were used to accelerate the calculation during training.

Real-Time Detection Based on the ZED 2 Camera and the Jetson TX2
In order to obtain the real-time location of flowers, a detection system was constructed with the ZED 2 stereo camera (Stereolabs Inc., San Francisco, CA, USA) and the Jetson TX2 (NVIDIA Co., Santa Clara, CA, USA) AI computing module. The ZED 2 camera was used to obtain an RGB image and a depth point cloud of the flowers in real-time, and the Jetson TX2 computing module processed the RGB image of flowers and obtained the plane position of the flowers by the trained CNN model, transferred from the computer. The spatial location of the flowers was obtained by matching the RGB image and the depth point cloud obtained by the ZED 2 camera. Due to the RGB and depth image having different dimensions and resolutions, the resolutions of the RGB image and depth image were both set up as 1080P@30FPS with the right-hand coordinate system. With this mode, the ZED has the balance between depth accuracy and processing time [32].

Plane Location Based on the YOLO V4-Tiny Detection Result
Due to the canopy of the cyclamen and poinsettia being approximately round or elliptic, and the prediction box of the flowers obtained by YOLO V4-Tiny was close to the outline of the flowers. Therefore, the center point of the prediction box obtained by the YOLO V4-Tiny can be used as the central coordinate of flowers in the plane coordinate. Figure 5 is the plane relation of the flower central coordinate. The point (µ 0 , v 0 ) is the vertex coordinate of the prediction box detected by the YOLO V4-Tiny model, and W and H are the width and height of the prediction box, respectively. Accordingly, the center point coordinate p (µ 1 , v 1 ) can be calculated as (1)

Spatial Location Based on the ZED 2 Stereo Camera
With the depth sensor of the ZED 2 camera to obtain the 3D point cloud of the flower. The spatial corresponding relationship between the pixel plane of the flowers and the camera is shown in Figure 6.

Detection Accuracy Affected by a Different Overlap Ratio
With the growth of flowers, the plant canopy experiences an overlapping of each other. The overlap ratio affects the accuracy of the detection result. Figure 7 shows the overlapped flowers. The widths and the heights of the two flowers' canopy are w1, w2, h1, and h2, respectively, and w3 and h3 are the total width and height of the two flowers. The overlap ratio (s) is defined as In order to determine the minimum distance of the two flowers, by adjusting the distance between the two flowers, when the two potted flowers were detected as one potted flower, the distance between the two flowers was defined as the minimal distance between the two potted flowers, and the s calculated by Equation (2) is the maximal overlap degree.

Detection Accuracy Affected by a Different Overlap Ratio
With the growth of flowers, the plant canopy experiences an overlapping of each other. The overlap ratio affects the accuracy of the detection result. Figure 7 shows the overlapped flowers. The widths and the heights of the two flowers' canopy are w 1 , w 2 , h 1 , and h 2 , respectively, and w 3 and h 3 are the total width and height of the two flowers. The overlap ratio (s) is defined as

Detection Accuracy Affected by a Different Overlap Ratio
With the growth of flowers, the plant canopy experiences an overlapping of each other. The overlap ratio affects the accuracy of the detection result. Figure 7 shows the overlapped flowers. The widths and the heights of the two flowers' canopy are w1, w2, h1, and h2, respectively, and w3 and h3 are the total width and height of the two flowers. The overlap ratio (s) is defined as In order to determine the minimum distance of the two flowers, by adjusting the distance between the two flowers, when the two potted flowers were detected as one potted flower, the distance between the two flowers was defined as the minimal distance between the two potted flowers, and the s calculated by Equation (2) is the maximal overlap degree.  In order to determine the minimum distance of the two flowers, by adjusting the distance between the two flowers, when the two potted flowers were detected as one potted flower, the distance between the two flowers was defined as the minimal distance between the two potted flowers, and the s calculated by Equation (2) is the maximal overlap degree.

Detection Accuracy Affected by Natural Light
The detection of the flowers by the ZED 2 camera was implemented in the greenhouse. The radiation of the natural environment affects the detection and location accuracy. In order to obtain the effect of the radiation, by using the ZED 2 camera and Jetson TX2, the detection and location results for 15 pots of flowers at 9:00, 13:00, 15:00, and 17:00, in the greenhouse as well as the radiation density, were collected by LightScout quantum light sensor 3668I (Spectrum Tech., Inc., Haltom, TX, USA) at the same time.

Training Results of the YOLO V4-Tiny Model
The results of the CNN model training after 4000 iteration steps are shown in Figure 8. when the number of training iterations reaches about 3200, the Loss fluctuated slightly around 0.2. Performance statistics were conducted for the weight of training completion, the results are shown in Table 1, mainly including the detection average precision (AP), recall, intersection over union (IoU), and mean average precision (mAP) under the threshold mAP@IoU = 0.50. The training results meet the detection accuracy requirements.

Detection Accuracy Affected by Natural Light
The detection of the flowers by the ZED 2 camera was implemented in the greenhouse. The radiation of the natural environment affects the detection and location accuracy. In order to obtain the effect of the radiation, by using the ZED 2 camera and Jetson TX2, the detection and location results for 15 pots of flowers at 9:00, 13:00, 15:00, and 17:00, in the greenhouse as well as the radiation density, were collected by LightScout quantum light sensor 3668I (Spectrum Tech., Inc., Haltom, TX, USA) at the same time.

Training Results of the YOLO V4-Tiny Model
The results of the CNN model training after 4000 iteration steps are shown in Figure  8. when the number of training iterations reaches about 3200, the Loss fluctuated slightly around 0.2. Performance statistics were conducted for the weight of training completion, the results are shown in Table 1, mainly including the detection average precision (AP), recall, intersection over union (IoU), and mean average precision (mAP) under the threshold mAP@IoU = 0.50. The training results meet the detection accuracy requirements.   [35], which meets the real-time requirements.  By transferring the trained CNN model to the Jetson TX2 computing module and collecting the RGB flower images by the ZED 2 camera, the average detection frame rate reached 16 FPS (frames per second). The result was greater than 12 FPS [35], which meets the real-time requirements.

Spatial Location Results
In order to determine the spatial location results of the method based on CNN model, The SORT (Simple online and real-time tracking) model [36] was used to compare the location accuracy. Figure 9 is the y-direction location results of the two models with the y-direction distance from 150mm to 300 mm, and the z-direction distance from 115 to 130 mm. The average location error of YOLO model at different z-direction distances is 23.41 mm, 31.34 mm, 40.8 mm, and 38.47 mm, respectively; and the average location error of SORT model is 43.27 mm,47.91 mm, 53.00 mm, and 58.58 mm, respectively. The results show that the location accuracy of the YOLO model is obviously high than the SORT model. The variations of the YOOL model with different z-direction distances is indicated that the crop height affects the location accuracy.

Spatial Location Results
In order to determine the spatial location results of the method based on CNN model, The SORT (Simple online and real-time tracking) model [36] was used to compare the location accuracy. Figure 9 is the y-direction location results of the two models with the y-direction distance from 150mm to 300 mm, and the z-direction distance from 115 to 130 mm. The average location error of YOLO model at different z-direction distances is 23.41 mm, 31.34 mm, 40.8 mm, and 38.47 mm, respectively; and the average location error of SORT model is 43.27 mm,47.91 mm, 53.00 mm, and 58.58 mm, respectively. The results show that the location accuracy of the YOLO model is obviously high than the SORT model. The variations of the YOOL model with different z-direction distances is indicated that the crop height affects the location accuracy.

Detection Results of Different Overlap Ratio
The minimum distance between flowers and the canopy overlap ratio (s) is shown in Figure 10. The minimum distance between flowers for the detection was 18-27 cm, with an average value of 23 cm. The maximum overlap ratio of the flower canopy ranged from 10.7% to 28.92%, with an average value of 17.42%. Therefore, for the poinsettias and cyclamens, false detection will occur when the canopy overlap ratio is over 10%. For mature cyclamens and poinsettias, the minimum distance between two pots of flowers should be 27 cm to prevent false detection.

Detection Results of Different Overlap Ratio
The minimum distance between flowers and the canopy overlap ratio (s) is shown in Figure 10. The minimum distance between flowers for the detection was 18-27 cm, with an average value of 23 cm. The maximum overlap ratio of the flower canopy ranged from 10.7% to 28.92%, with an average value of 17.42%. Therefore, for the poinsettias and cyclamens, false detection will occur when the canopy overlap ratio is over 10%. For mature cyclamens and poinsettias, the minimum distance between two pots of flowers should be 27 cm to prevent false detection.  Table 2 shows the detection results with varying levels of radiation. In the greenhouse test, the radiation was 102, 408, 211, and 27 W/m 2 at 9:00, 13:00, 15:00, and 17:00 of 31 March 2021, respectively. Table 2 shows the detection results with varying levels of radiation. When the radiation is 27 W/m 2 , two pots of flowers are not detected; when the radiation is 102 W/m 2 , one pot of flowers is not detected. When the radiation is higher than 200 W/m 2 , all flowers can be detected. Figure 11 shows the location results with varying levels of radiation. The average location errors are 25.8, 13.1, 11.7, and 4.6 mm, respectively. The results show that detection and location accuracy gradually increase with radiation, and there will be missed detection at low radiation. Therefore, the working time for flower detection and location is from 9:00 to 16:00 with higher radiation.    Table 2 shows the detection results with varying levels of radiation. In the greenhouse test, the radiation was 102, 408, 211, and 27 W/m 2 at 9:00, 13:00, 15:00, and 17:00 of 31 March 2021, respectively. Table 2 shows the detection results with varying levels of radiation. When the radiation is 27 W/m 2 , two pots of flowers are not detected; when the radiation is 102 W/m 2 , one pot of flowers is not detected. When the radiation is higher than 200 W/m 2 , all flowers can be detected. Figure 11 shows the location results with varying levels of radiation. The average location errors are 25.8, 13.1, 11.7, and 4.6 mm, respectively. The results show that detection and location accuracy gradually increase with radiation, and there will be missed detection at low radiation. Therefore, the working time for flower detection and location is from 9:00 to 16:00 with higher radiation. Table 2. Detection numbers of the flowers with varying levels of radiation at a different time on a sunny day in the greenhouse.

Discussion
In a floriculture greenhouse, the detection and location of the potted flowers are affected by the natural environment and the density of the flower. Meanwhile, real-time detection and location are important for machinery and automatic management. In this study, we established a method for the real-time detection and location of potted flowers based on the ZED 2 camera and the YOLO V4-Tiny deep learning algorithm.
At present, real-time detection and location methods based on machine vision are mainly focused on the detection of crops [8,12,37], but real-time location [34,38]. In this study, the real-time detection and location of flowers were preliminarily established by using a YOLO V4-TINY algorithm as a detection algorithm. With the training result, the CNN model was transferred to the edge computing module Jetson TX2 to realize the flower detection and location. Compared with other CNN models, YOLO V4-Tiny has a fast detection speed and a high detection accuracy [25]. It also ensures the simplification of the model, making it easy to transfer in real-time computing on a mobile terminal [39]. Secondly, to improve the real-time performance, the Jetson TX2 AI computing module was adopted as the running environment [40]. Finally, to obtain real-time RGB and depth information about the flowers, we used a ZED 2 camera to collect 3D point clouds of the flowers in real-time, and the ZED 2 matched the RGB and 3D point clouds automatically [32]. With these methods and devices, the average detection frame rate could reach 16 FPS, greater than 12 FPS, which is considered the highest frame rate in real-time camera tracking [35]. Therefore, the accuracy and real-time performance of the system can match the requirements for the detection and location for precise management.
In the process of flower detection and location, in addition to a real-time performance, the accuracy of detection and location is also important. Positioning accuracy determines the quality of the detection and location. Lee et al. (2017) proposed a method for identifying and retrieving flower species in a natural environment based on multi-layer technology, identifying different types of flowers through color, texture, and shape characteristics. After testing, the image recognition rate was 91.26% [41]. Tian et al. (2019) detected apple flowers with a Single Shot MultiBox Detector (SSD) algorithm, with an average detection accuracy of 87.40% [42]. Yamamoto et al. (2014) proposed a method that combines RGB digital cameras and machine learning to identify tomato fruits. The identification accuracy rate was 80% after testing [43]. The above research shows that the accuracy of the current algorithm for flower detection is mostly between 80% and 95%. The experimental results of our study show an accuracy of 89.72%, and performance was improved in terms of detection accuracy. With the detection results of the CNN model, by matching between the RGB image and 3D point clouds for the spatial location, the average error of the location was less than 40 mm in different crop heights. and the comparison results with the SORT method show that the YOLO model has obviously higher accuracy. By using the CNN detection model, the flower canopy width, the number of the flower of each plant can be obtained, and with the depth image, the morphosis of the flower can be analyzed [44]. These results can be used to realize the phenotyping analysis of the flower for the detection of the growth status, quality, and grading for the floriculture. And the robot or manipulator could be developed with the detection, location, and grading results to realize the precision and unmanned management [45].
To improve the detection and location accuracy, influencing factors of the natural environment were determined. The results show that the accuracy of detection and location was affected when the light was weak. Thus, the operation can be implemented under sufficient light conditions or using supplement light [46]. Regarding the distance and canopy overlap between the two flowers, when the overlap ratio exceeded 10%, the minimum distance required for the two flowers to avoid false detection was 27 cm. Moreover, for the poinsettia and cyclamen culture, the distance between each flower on the bench was 30 cm [47][48][49], so this method meets the requirements for potted flower detection.

Conclusions
In this study, a potted flower detection and location method based on the ZED 2 stereo camera and a CNN deep learning model was proposed. The CNN model based on YOLO V4-Tiny was used to detect the flowers. With the detection results, by using RGB images and 3D point clouds collected by the ZED 2 stereo camera, the spatial location of the flowers was implemented. With the Jason TX 2 computing module, real-time and accurate detection and location were achieved, and the results show that the system has sufficient detection and location accuracy, an acceptable real-time tracking frame rate, and an appropriate distance and canopy overlap ratio for natural environments with potted flower cultures. It provides a detection and location method for machinery and automatic management for floriculture.
Although the CNN model was realized for flower detection and location, we considered only two types of flowers with similar canopy styles. It is necessary to train this model with other flowers, different growth backgrounds, lighting conditions to improve the model performance. However, with the Jenson TX2 AI computing device, by transferring the CNN model trained by the computer to achieve deep learning model construction, the learning ability is limited. In the future, by combining cloud computing with edge computing, a cloud-edge collaborative framework can achieve real-time and automatic learning for flower detection and location [39,50].