1. Introduction
With the development of online shopping, an increasing number of people choose to buy fresh-cut flowers online, which presents higher requirements for the preservation period of fresh-cut flowers. However, the current flower sorting process is mainly carried out manually, which is subjective, tiring, and can reduce the preservation period of flowers. According to statistics, the processing loss of fresh-cut flowers after harvest can reach 31.88%, with the sorting loss accounting for 21.74% [
1]. Therefore, a more automatic and accurate fresh-cut flower sorting method is required. Computer vision is a popular tool for object detection due to its ability to recognize objects from images. By collecting fresh-cut flower images and applying intelligent algorithms, computer vision can achieve accurate and efficient sorting, greatly reducing sorting loss and improving economic benefits.
Early research utilized machine learning methods in computer vision to extract features and recognize flowers. These methods included K-nearest neighbor [
2], random forest (RF) algorithm [
3], stochastic gradient descent (SGD) [
4], support vector machines (SVMs) [
5], all of which demonstrated satisfactory results in flower recognition. Some researchers combined methods to extract more features for recognition. For instance, Soleimanipour A. et al. [
6] used principal component analysis (PCA), linear discriminant analysis (LDA), and SVM to classify flowers, achieving an accuracy of 99.50%. Patel I et al. [
7] explored new morphological feature extraction methods and classified flowers using multiple kernel learning SVM, reaching an accuracy rate of 76.92%. Although good results have been achieved, these methods require manual extraction of features.
In recent years, due to the ability to automatically extract features, deep learning technology based on convolutional neural network (CNN) has gained significant attention. Tian M et al. [
8] utilized the CNN classification model and softmax classifier to classify 17 types of flowers, achieving a precision rate of 92%. Anjani I A et al. [
9] employed the CNN algorithm and dropout technology to streamline the automatic rose sales system, obtaining the accuracy of 96.33% on their test data. Cibuk M et al. [
10] applied the hybrid classification method to flower classification, achieving the accuracy rate of 96.39%. The above research studies used CNN network and achieved good results, but the methods were not very real-time.
Currently, the anchor-based framework has become a research hotspot for object detection. Based on whether the classification and positioning process can be directly realized, it can be divided into two-stage algorithm and one-stage algorithm. The two-stage algorithm mainly included the region-convolutional neural network (R-CNN) series [
11,
12,
13], while the one-stage algorithm mainly included the You Only Look Once (YOLO) series [
14,
15,
16] and single-shot detection (SSD) series [
17,
18]. Building on the YOLO series network, Krishna K P et al. [
19] proposed a panoramic driving perception system, which can simultaneously perform traffic target detection, drivable area segmentation, and lane detection. Gao Y L et al. [
20] used YOLOv8MS network to explore automatic cultivation of corn, achieved a mean average precision (mAP) of 89.6%, and a multiple object tracking accuracy (MOTA) of 92.5%. One-stage object detection methods were simple and stable, providing fast detection speed to achieve flower sorting systems. However, they were less accurate for difficult-to-distinguish maturity levels in flower grading.
The addition of depth information through red–green–blue depth (RGBD) enables better differentiation of difficult-to-distinguish maturities. Sun X et al. [
21] proposed a flower quality grading method based on deep learning and deep information. For diana rose, the RGBD improved InceptionV3 network was used for grading, achieving a grading accuracy of 98%. Fei Y et al. [
22] classified the maturity of flame roses using depth information. Initially, traditional image segmentation was conducted to obtain edge information of fresh-cut flowers, followed by bract segmentation. The lightweight Shuffle Net V2 network was used for recognition, achieving a classification precision of 98% on the RGB flower dataset and 99% on the RGBD flower dataset. To enhance the efficiency and accuracy of the flower sorting system, this paper presented a real-time, high-precision end-to-end method. The main contributions of this paper were as follows:
Firstly, an RGBD flower sorting dataset was produced with RGBD images to improve the accuracy of difficult-to-distinguish maturity.
Secondly, the MTMD-YOLO network was constructed to an end-to-end realize flower sorting system, the feature fusion was simplified to increase training speed, and the detection head and non-maximum suppression (NMS) [
23] were improved to make it suitable for a flower sorting dataset; the loss function for the maturity task was added to train each task separately.
Lastly, the multi-task and multi-dimension-You Only Look Once (MTMD-YOLO) network was compared with YOLO series and validated the performance of hardware in real time and difficult-to-distinguish maturity accuracy.
  3. Experimental Results and Analysis
  3.1. Optimization Experiment
  3.1.1. Feature Fusion Optimization
To compare the effects of different feature layers selected by PAN, three combinations of P3~P4, P4~P5, and P3~P5 were selected to compare mAP and Params. The ↑ after the index indicated that the larger the index, the better the effect. The ↓ indicated that the smaller the index, the better the effect. The comparison results are shown in 
Table 1. The results showed that the mAP of the P4~P5 layer was the highest, 7.67% and 3.49% higher than the P3~P4 layer and P3~P5 layer, respectively. The number of parameters of the P3~P4 layer was the smallest, 4.56M and 5.13M smaller than the P4~P5 layer and P3~P5 layer, respectively, but the accuracy was lower than the other two. The number of parameters of the P3~P5 layer was 0.57M lower than the P4~P5 layer, and the mAP was 3.49% lower than that in the P4~P5 layer. This was because the target of this dataset in this paper was about 96–340 pixels, which was smaller than the anchor frame in the P3 layer which was 10–33 pixels. And the resolution of P3 was greater than that of P4 and P5, increasing the speed of the network′s operation. So, the P3 layer was not suitable for this dataset in this work and increased the false detection rate. Therefore, choosing the P4~P5 layer to complete the feature fusion can not only increase the mAP of the model but also reduce some parameters.
  3.1.2. Weight Optimization of the Loss Function
The weight of each task of the loss function represented the level of attention given to the task. Weight optimization experiments were conducted to balance the attention to the maturity tasks; the weight of five maturity loss functions was selected to compare the size of mAP. The five weights were 0.5, 0.8, 1.0, 1.2, and 1.5, as shown in 
Figure 10. The horizontal axis represents the number of training epochs, ranging from 100 to 200 epochs; the vertical axis represents mAP, and the curves of different colors represent the results of different weights. As the number of training epochs increased, mAP showed an increasing trend. The results showed that when the number of training epochs was 200, the mAP with a weight of 1.0 was the highest, followed by 0.8, then 1.5 and 0.5, and the mAP with a weight of 1.2 was the lowest. Therefore, choosing the right weights can lead to higher accuracy in maturity tasks.
  3.2. Experiments Contrast
  3.2.1. Ablation Experiments
In order to verify the contribution of each module to the performance of the model, four improvements of feature fusion, RGBD, multi-task, and loss function optimization were carried out on the two tasks of classification and grading. RGBD meant the adding of depth information, and multi-task meant the joint completion of the three tasks of flower sorting: localization, classification, and grading. Compared to the basic network, YOLOv5, AP, AR, mAP, and Speed were compared. The results of the ablation experiment of the fresh-cut flower classification task are shown in 
Table 2. In the classification task, the basic network had performed well, with the mAP reaching 97.17% and the speed reaching 76.49 FPS. However, the improved module still increased the mAP, but the final speed after improvement was slightly lower than that of the basic network. Compared to the basic network, after simplifying the feature fusion layer, mAP increased by 0.7%, AP reached 100%, and speed increased by 14.2 FPS. After adding RGBD, mAP increased by 0.3%, and the speed was still 4.73 FPS faster than the basic network. After adding multi-task, mAP increased by 0.1%, and the speed decreased by 3.6 FPS. After loss function optimization, the mAP remained unchanged and the speed decreased to 3.42 FPS.
The results of the ablation experiment of the fresh-cut flower grading task are shown in 
Table 3. In the grading task, due to the high similarity between different maturity, the performance of the basic network in the grading task was worse than the classification task: AP was only 76.8%, AR only 85.1%, mAP only 85.5%. After adding four improvements, the final network had also achieved good performance. Compared to the basic network, after simplifying the feature fusion layer, AP increased by 7.2%, AR increased by 6.3%, and mAP increased by 6.1%, which had been greatly improved, and the simplified feature layer increased the speed by 14.34 FPS. After adding RGBD, AP increased by 7.2%, AR increased by 2%, and mAP increased by 3.7%. This showed that adding depth information to the dataset helped to increase more bud detail information, thereby improving the accuracy of the grading task. After adding the multi-task, AP increased by 7.1%, AR increased by 5.1%, and mAP increased by 1.9%. Due to the change in the NMS screening method, the detection accuracy was improved. After loss function optimization, AP increased by 1.3%, AR increased by 0.7%, and mAP increased by 10.6%. The detection speed was reduced by 9 FPS, but still achieved real-time performance. This showed that loss function optimization strengthened the learning ability of the grading task, and the accuracy was further improved. Through the analysis of the two tasks’ results, it was concluded that each module used in this paper improved the detection accuracy of the network. However, due to the addition of a new structure, the speed gradually decreased, but it still reached 73.07 FPS, which was 3.38 FPS lower than the basic network.
To verify the necessary of depth data, the results of P, R, mAP, and F1 using RGB and RGBD images were compared, as shown in 
Table 4. The results showed that in the classification task, P was 99.98%, R was 100%, the F1 score was 100% using RGB, and using RGBD still maintained the high score. The mAP using RGB was 97.75%, and the mAP using RGBD was 98.13%, increased by 0.38%. This indicated that the depth information had a certain enhancement effect on the classification task. For the grading task, when RGB was used, P was 83.95%, R was 91.38%, mAP was 91.61%, and F1 was 87.03%. When using RGBD, all the indicators had improved. Among them, P increased by 7.26%, R increased by 2.01%, mAP increased by 3.63%, and F1 increased by 5.12%. This showed that the addition of depth information played an important role in grading task. The necessity of in-depth information was verified.
In order to verify the comprehensive impact of multi-task on the model, F1, mAP, Params, and Speed of the single task and multi-task were compared, as shown in 
Table 5. The results showed that F1 was 100% and mAP was 98.13% in the single classification task, which had reached a high level. F1 remained the same in multi-task, and mAP improved by 0.07%, indicating that the classification task in the multi-task resulted in a small optimization. In the single grading task, F1 was 92.15% and mAP was 95.24%. In multi-task, F1 improved by 6.86% and mAP improved by 1.87%, indicating that multi-task strengthened the ability of grading task, which was due to the double screening of NMS, making the final samples obtained more accurate. In a single task, the number of parameters was 6.45M and the speed was 81.22 FPS. The number of parameters increased by 0.01M in multi-task, so the computing power of the computer did not increase the load too much. The Speed drop of 8.21 FPS was due to an increase in the improved NMS, resulting in a decrease in the detection speed. Overall, although the speed of multi-task was lower than that of single task, the completion of multi-task did not increase too much computer load but also achieved higher accuracy. The comprehensive ability of multi-task was verified.
  3.2.2. Contrast Experiments
In order to verify the effectiveness of the MTMD-YOLO network, the detection effects of the SSD, RetinaNet, and YOLOv5, YOLOv6, YOLOv7, YOLOv8, four kinds of YOLO series detection networks, were reproduced on the RGB flower species dataset. The five indicators of AP, AR, mAP, Params, and Speed were compared, and the comparison results are shown in 
Table 6. The mAP of MTMD-YOLO on the species dataset was 98.19%, which was 1.02%, 1.37%, and 0.97% higher than that of YOLOv5, YOLOv6, and YOLOv7, and was 0.76%, 1.51%, and 0.7% lower than that of SSD, RetinaNet, and YOLOv8, respectively. This was because the number of parameters of the YOLOv8 network was 4.67M more than that of MTMD-YOLO, which will lead to a decrease in speed while increasing accuracy. The number of parameters of MTMD-YOLO was next to RetinaNet, 6.46M, and the speed of MTMD-YOLO reached 73.07 FPS, which was 1.86 FPS slower than that of the YOLOv5 network, but mAP was higher than YOLOv5. Among the classification task, the RetinaNet model had fewer parameters, greatly improving the detection speed, and had the best comprehensive performance, followed by the MTMD-YOLO network.
Figure 11 shows the histogram comparison of MTMD-YOLO and other excellent networks in the RGB flower species dataset. It can be seen that the MTMD-YOLO network achieved the highest level of integration in the flower classification task, with the second highest accuracy after YOLOV8, the smallest number of parameters, and the second highest speed after YOLOv5.
 In order to verify the effectiveness of the MTMD-YOLO network, the detection effects of the SSD, RetinaNet, and YOLOv5, YOLOv6, YOLOv7, YOLOv8, four YOLO series detection networks, were reproduced on the RGB flower maturity dataset. Compared to the five indicators of AP, AR, mAP, Params, and Speed, the comparison results of maturity detection are shown in 
Table 7. The mAP of MTMD-YOLO on the maturity dataset was 97.81%, which was 19.09%, 14.07%, 12.36%, 15.46%, 14.30%, and 0.68% higher than SSD, RetinaNet, YOLOv5, YOLOv6, YOLOv7, and YOLOv8, respectively. The RetinaNet model had the smallest number of parameters, which was 4.02M, greatly improving the detection speed. The number of parameters of MTMD-YOLO was next to RetinaNet, which was 6.46M, and the speed was 16.78 FPS and 1.86 FPS slower than the RetinaNet and YOLOv5 network, respectively, reaching 73.07 FPS. Although RetinaNet performed well in the classification task, due to its simple model, it was unable to cope with a complex maturity task, resulting in lower test results. The detailed analysis showed that the MTMD-YOLO model performed best and achieved the best results in the speed and accuracy in both two tasks. It also showed that the MTMD-YOLO network can still maintain high accuracy and speed, which can end-to-end complete the flower sorting task.
Figure 12 shows the histogram comparison of MTMD-YOLO and other excellent networks in the RGB flower maturity dataset. It can be seen that the MTMD-YOLO network achieved the highest level of integration in the flower grading task, with the highest accuracy, the smallest number of parameters, and the second highest speed after YOLOv5.
 In order to verify the performance of the model on the embedded Jetson Orin NX, the MTMD-YOLO network was compared with the RetinaNet, YOLOv5, and YOLOv8, as shown in 
Table 8. The results showed that RetinaNet had the fastest detection speed, reaching 45 FPS. In the classification task, the mAP reached 97.70%, and in the grading task, the mAP reached 83.74%, which was 14.06% lower than the MTMD-YOLO network. The speed of the MTMD-YOLO network was second, reaching 37 FPS, the mAP of classification task was 98.15%, next to YOLOv8, and the mAP of classification task was the highest, reaching 97.80%. The mAP of YOLOv8 had reached a high level, but the speed was only 29 FPS, which did not reach the real-time speed. YOLOv5 had only 86.12% of the mAP in the grading task and was 5 FPS slower than MTMD-YOLO. In general, MTMD-YOLO had an excellent performance in hardware.
  3.3. Detection Results in Challenging Conditions
  3.3.1. Experiments on Difficult-to-Distinguish Maturity of Flower
In visible light (RGB), distinguishing whether the petals are open or not can be challenging, potentially resulting in lower accuracy in maturity judgment. After adding RGBD, the accuracy of difficult-to-distinguish maturity was significantly improved. The comparison of the effect of visible light (RGB) and RGBD is shown in 
Figure 13. The results showed that anna rose petals of grade 3 have been categorized as grade 2, weiguang rose of grade 3 have been categorized as grade 2, zhenai rose of grade 5 have been categorized as grade 4, and jinzhi rose of grade 1 have been categorized as grade 2. This was because visible light was unable to judge the upright state of the petals, while RGBD incorporated depth information and was able to obtain depth information of the petals, and judge the number of petals open, thus accurately judging the maturity of the petals.
  3.3.2. Detection Effects in Real-World Environments
Due to the limitation of the depth camera collecting device in a real environment, this work used an experimental environment dataset to simulate the flower sorting effect. The experimental environment dataset used a variety of lighting and backgrounds to simulate real-world lighting and backgrounds. To verify the robustness of the dataset, ten samples were randomly selected for verification through online and the real world, and 
Figure 14 shows the sorting results in the natural environment. The results showed that when there was a single flower in a single image, the system can accurately identify the background and correctly distinguish between species and maturity. This showed that the model can be generalized to real environments. When there were multiple flowers in a single image, some flower species and maturity can be accurately identified, but not all flowers. This indicated that the system′s ability to recognize flowers needed to be improved when flowers were dense in the image. When flowers had indistinct cores, inaccurate positioning and incorrect maturity predictions can occur. This was because flower maturity was mainly judged by the flower cores, and this put forward higher requirements for the acquisition angle. In order to verify the detection effect on other species, two varieties of flowers were selected for verification. When the flower species were not included in the dataset, the system still recognized flowers and maturity but identified them as species present in the dataset. In the ten samples, the true positive (TP) of the classification task was 8 and false positive (FP) was 2. The TP of the grading task was 8 and FP was 2. In summary, the system had a certain generalization ability for the background of the real environment, but it was limited by the acquisition angle, the number and species of flowers.
  3.4. Innovations, Limitations, and Future Work
Previous studies had shown that deep learning technology had made progress in the field of flower species detection, but its application in flower maturity detection was relatively limited. The existing methods had some problems, such as complex model, single type of dataset, and complex operation. For instance, Sun X et al. [
21] proposed a flower quality grading method based on deep learning and deep information. Four convolutional models, VGG16, ResNet18, MobileNetV2, and InceptionV3, were used to classify RGBD images, which proved that the depth information can effectively reflect the characteristics of flower buds. On the basis of using in-depth information, Fei Y et al. [
22] realized a lightweight flower grading system based on the ShuffleNetV2 network. The overall predicted classification speed can reach 0.020 s/flower. Compared with the fresh-cut flower classifier on the market, the system had great advantages in speed. Although the above models achieved good results in terms of speed and accuracy, it can only detect the maturity information of one variety of flower, and the species and maturity detection of many varieties of fresh-cut flowers had not been completed.
To address these issues, this study proposed an end-to-end flower sorting method based on depth information, which can simultaneously complete flower location, classification, and grading tasks. To improve the accuracy of difficult-to-distinguish maturity, an RGBD flower sorting dataset with RGBD images and double label was produced. To end-to-end realize a flower sorting system, the MTMD-YOLO network was constructed. In the MTMD-YOLO network, the high-resolution P3 layer was removed to increase training speed; the detection head increased the maturity tensor to predict maturity information. The NMS filtered the prediction boxes of the classification and grading tasks, respectively, and then merged them to obtain the prediction results of location, species, and maturity. The loss function for the maturity task was added to train each task separately. Compared with the SSD, RetinaNet, and YOLOv5, YOLOv6, YOLOv7, YOLOv8, four kinds of YOLO series detection networks, considering accuracy and speed, this model had the best performance. F1 achieved 100% in the classification task and 99.01% in the grading task. In the classification task, mAP reached 98.19%, and in the grading task, mAP reached 97.81%. In the hardware Jetson Orin NX, a speed of 37 FPS was achieved with the same accuracy. This method effectively improved the efficiency of flower sorting and only needed to input the depth camera to collect images, which can directly obtain the location, species, and maturity information of flowers. The end-to-end operation saved learning costs and holds great significance for smart agriculture and actual deployment.
This study had several limitations. It had verified the feasibility of sorting four common varieties of fresh-cut flowers, but this model had not yet been validated in other rose varieties. Considering that the class information learned by the model was built according to the training set, the detection accuracy of the model on other flower species may be reduced, because other flower species were not present in the training set. In addition, the actual flower growing environment was dynamic and influenced by numerous unpredictable factors, such as leaf shading, severe insufficient or excessive light, acquisition angle, and complex backgrounds. In such a complex real-world environment, further evaluation, optimization, and enhancement of the robustness of the model are necessary.
This experiment provided the idea of sorting fresh-cut flowers and broadened the feasibility of multi-task network processing. The sorting task was preliminarily completed. In the future, this work will continue to expand the dataset, optimize the algorithm model, and enhance the ability of the system under overload environment to better serve the flower sorting task. This work will upgrade the robotic arm to collect more species of flowers in real-world environments, then to validate the MTMD-YOLO flower sorting model. This method aimed to enhance the robustness and generalizability of the model. In addition, this work will explore the application of MTMD-YOLO federated learning technology combined with the Internet of Things (IoT) in real-world flower sorting and explore the combination of the system and the client and how to carry out self-service flower sorting conveniently and quickly. This work will also explore the application of MTMD-YOLO in large-scale flower sorting, thereby driving the development of smart agriculture.
  4. Conclusions
This paper proposed a real-time, high-precision end-to-end system for flower sorting, addressing three key challenges: real-time operation on embedded devices, high precision in distinguishing difficult-to-determine maturity stages, and end-to-end processing for flower localization, classification, and grading tasks. The MTMD-YOLO network was developed for an end-to-end flower sorting system. To improve difficult-to-distinguish maturity, an RGBD flower sorting dataset was constructed, and real-time capability was achieved by simplifying the feature fusion layer. For the final prediction and post-processing of flower sorting, improvements included the use of a double-label detection head, double-label NMS, and loss function to predict information across three tasks. Experiments showed that the mAP of the MTMD-YOLO network reaches 98.19% in the fresh-cut flower classification task, and 97.81% in the fresh-cut flower grading task. The method achieved real-time speed (37 FPS) on a portable embedded Jetson Orin NX. Furthermore, the flower sorting system can be seamlessly integrated with mobile carts to fully leverage depth information and execute robotic automatic picking tasks efficiently.