Fruit Detection and Recognition Based on Deep Learning for Automatic Harvesting: An Overview and Review

: Continuing progress in machine learning (ML) has led to signiﬁcant advancements in agricultural tasks. Due to its strong ability to extract high-dimensional features from fruit images, deep learning (DL) is widely used in fruit detection and automatic harvesting. Convolutional neural networks (CNN) in particular have demonstrated the ability to attain accuracy and speed levels comparable to those of humans in some fruit detection and automatic harvesting ﬁelds. This paper presents a comprehensive overview and review of fruit detection and recognition based on DL for automatic harvesting from 2018 up to now. We focus on the current challenges affecting fruit detection performance for automatic harvesting: the scarcity of high-quality fruit datasets, fruit detection of small targets, fruit detection in occluded and dense scenarios, fruit detection of multiple scales and multiple species, and lightweight fruit detection models. In response to these challenges, we propose feasible solutions and prospective future development trends. Future research should prioritize addressing these current challenges and improving the accuracy, speed, robustness, and generalization of fruit vision detection systems, while reducing the overall complexity and cost. This paper hopes to provide a reference for follow-up research in the ﬁeld of fruit detection and recognition based on DL for automatic harvesting.


Introduction
In recent years, the application of artificial intelligence (AI) techniques and robotic systems to automate agricultural processes has garnered significant interest (as shown in Figure 1).Fruits usually grow in complex environments with many uncertainties.Powerful fruit vision detection systems are necessary for intelligent agriculture and automatic harvesting.Fruit vision detection systems' characteristics mainly include imaging sensors and visual information about fruits.Fruit vision detection systems generally operate through the five stages (as shown in Figure 2): fruit image acquisition, fruit image preprocessing, fruit feature extraction, fruit image segmentation, and fruit image recognition.Black and white cameras, red-green-blue (RGB) cameras, spectral cameras, thermal cameras, and RGB-depth map (RGB-D) cameras (as shown in Figure 3) are commonly used for fruit vision detection systems to obtain color, shape, texture, and size information of fruits in specific operational areas.A comparison of different types of imaging sensors is shown in Table 1.Fruit images acquired through different imaging methods are shown in Figure 4.The main research processes of fruit detection and recognition methods are shown in Figure 5. Since DL has a strong ability to extract high-dimensional features from fruit images, researchers have proposed many fruit detection and recognition methods based on DL (you only look once (YOLO), single shot multibox detector (SSD), Alex Krizhevsky networks (AlexNet), visual geometry group networks (VGGNet), residual networks (ResNet), faster region-convolutional neural networks (Faster R-CNN), fully convolutional networks (FCN), SegNet, and mask region-convolutional neural networks (Mask R-CNN)) for automatic harvesting (as shown in Table 2).Despite much research, many challenges need to be overcome to build an effective fruit vision detection and harvesting system.
Agronomy 2023, 13, x FOR PEER REVIEW 2 of 32 networks (ResNet), faster region-convolutional neural networks (Faster R-CNN), fully convolutional networks (FCN), SegNet, and mask region-convolutional neural networks (Mask R-CNN)) for automatic harvesting (as shown in Table 2).Despite much research, many challenges need to be overcome to build an effective fruit vision detection and harvesting system.
networks (ResNet), faster region-convolutional neural networks (Faster R-CNN), fully convolutional networks (FCN), SegNet, and mask region-convolutional neural networks (Mask R-CNN)) for automatic harvesting (as shown in Table 2).Despite much research, many challenges need to be overcome to build an effective fruit vision detection and harvesting system.

Figure 2.
Different processes of fruit detection and recognition based on DL (image reprinted with permission from ref. [16].2023, Xiao F.).   RGB, depth, and infrared images (photos reprinted with permission from ref. [17].2020, Fu L.); (c) spectral image (photo reprinted with permission from ref. [18].2009, Okamoto H.); (d) color and thermal-registered image (photo reprinted with permission from ref. [19].2010, Wachs J.P.).RGB, depth, and infrared images (photos reprinted with permission from ref. [17].2020, Fu L.); (c) spectral image (photo reprinted with permission from ref. [18].2009, Okamoto H.); (d) color and thermal-registered image (photo reprinted with permission from ref. [19].2010, Wachs J.P.).Some review articles have been published encompassing diverse agricultural applications, such as crop recognition, fruit counting, weed discrimination, and plant disease detection, with or without a robotic system, by considering AI/computer vision (CV)/other advanced vision control techniques.For example, Rehman, T.U.et al., (including researchers based in America and Canada) (2019) [20] provided a comprehensive summary of ML algorithms that have been utilized in diverse agricultural operations.Brazilian researchers Patrício, D.I. and Rieder, R. ( 2018) [21] investigated potential applications of machine vision (MV) for diverse agricultural tasks, such as crop disease/pest detection, grain quality evaluation, and automatic plant phenotyping.Narvaez, F.Y. et al., (including researchers based in Chile, Italy, and America) (2017) [22] summarized various sensing techniques, along with their limitations, to categorize fruits/plants.Indian researchers Jha, K. et al., (2019) [23] outlined the latest smart methodologies, such as the Internet of Things (IoT), for agricultural purposes.Dutch researchers Wolfert, S. et al., (2017) [24] reviewed the application of big data in agriculture.There are also some review articles that have been published incorporating only a particular type of agricultural application or scenario.For example, we reviewed fruit detection and recognition techniques based on digital image processing and traditional ML for fruit harvesters in [16].New Zealand researchers Saleem, M.H. et al., (2019) [25] summarized and explained DL models for the identification and classification of plant diseases, along with the application of DL with advanced imaging techniques, including hyperspectral/multispectral imaging.Wang, D. et al., (including American researchers and a researcher based in Israel) (2019) [26] and Chinese researchers Wang, A. et al., (2019) [27] reviewed procedures for weed detection using various classification methods, including ML and DL.The review literature on AI/ML/DL/MV/CV/other advanced vision control techniques for intelligent agriculture and automatic harvesting also includes [28][29][30][31][32][33][34][35][36][37][38][39][40][41].However, unlike the articles mentioned above, our work focuses on providing an overview and review of the use of DL applied to fruit image recognition (mainly in the areas of detection and classification) for automatic harvesting.In order to further define the study areas of our paper, we identify fruit detection and classification tasks such as the determination of classes based on their specific types.The contributions of this work are as follows: (1) systematically summarizes and explains all kinds of fruit detection and recognition methods based on DL for automatic harvesting from 2018 up to now; (2) systematically compares and analyzes the advantages, disadvantages, and applicability of various fruit detection and recognition methods based on DL for automatic harvesting; (3) systematically demonstrates the current challenges affecting fruit detection performance for automatic harvesting and proposes feasible solutions and prospective future potential developments.Through this clearer and more comprehensive overview and review, we aim to provide a reference for follow-up research in the field of fruit detection and recognition based on DL for automatic harvesting.
According to Martín-Martín, A. et al., (including Spanish researchers and a researcher based in the UK) (2018) [42], Google Scholar citation data encompass a larger set of publications than Web of Science and Scopus.In order to comprehensively survey the literature relevant to the scope of this article, the Google Scholar database has been selected as the source.In the first step, combinations of keywords such as "fruit detection", "fruit recognition", "deep learning", "computer vision", and "fruit harvesting" were utilized in the initial search process.All retrieved papers were subsequently evaluated for their relevance to the subject matter.The second step included the examination of the references from step one for a more thorough review.In the final step, to ensure that our study focuses on the most current research, all papers published before 2018 were excluded.Only the recent literature from 2018 to the present was considered.The final set of papers regarding fruit detection and recognition based on DL for automatic harvesting included 53 research articles.Figure 6 displays the distribution of articles per year, network models used, and crops detected.
nition", "deep learning", "computer vision", and "fruit harvesting" were utilized in the initial search process.All retrieved papers were subsequently evaluated for their relevance to the subject matter.The second step included the examination of the references from step one for a more thorough review.In the final step, to ensure that our study focuses on the most current research, all papers published before 2018 were excluded.Only the recent literature from 2018 to the present was considered.The final set of papers regarding fruit detection and recognition based on DL for automatic harvesting included 53 research articles.Figure 6 displays the distribution of articles per year, network models used, and crops detected.As shown in Figure 6, in recent years, the application of DL techniques and robotic systems to automate agricultural processes has garnered significant interest.Improvement and application research based on Faster R-CNN (21%) is currently a hotspot.The recognition accuracy of fruit detection methods based on Faster R-CNN is high, but recognition speed is limited by complex anchor frame mechanisms.When there are mobile deployment and high recognition speed requirements, fruit detection methods based on YOLO (17%) are used most frequently.Their recognition speed is fast, but the recognition effect for small target fruits is not very good.In addition, ResNet (11%) is the most popular backbone network, followed by AlexNet (7%).
Most of the research focuses on apples (32.14%), followed by tomatoes (8.93%), and citrus (7.14%).These three kinds of fruits are in high demand and yield globally.There are some reasons that make them ideal candidates for automatic harvesting.Firstly, they individually hang from plants, making them easily detectable based on their distinctive features.Secondly, they have no extreme variations in size or weight.Lastly, they are relatively hard and not easily damaged in mechanical operations.However, in terms of fruit dimensions and peduncle length, different cultivars may exhibit different characteristics, which can affect fruit detection and recognition performance.This poses challenges for adapting fruit detection and recognition methods for different cultivars.Future work could aim to identify cultivars that are more suitable for automatic harvesting.
The outline of this article is shown in Figure 7.The organization of the rest of the paper is as follows: Section 2 summarizes and explains previous research articles about DL applied to fruit detection and recognition for automatic harvesting.We compare and analyze the advantages, disadvantages, and applicability of various fruit detection and recognition methods based on DL (YOLO, SSD, AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN) for automatic harvesting; Section 3 discusses the current challenges affecting fruit detection and recognition performance for automatic harvesting (scarcity of high-quality fruit datasets, fruit detection of small targets, fruit detection in occluded and dense scenarios, fruit detection of multiple scales and multiple As shown in Figure 6, in recent years, the application of DL techniques and robotic systems to automate agricultural processes has garnered significant interest.Improvement and application research based on Faster R-CNN (21%) is currently a hotspot.The recognition accuracy of fruit detection methods based on Faster R-CNN is high, but recognition speed is limited by complex anchor frame mechanisms.When there are mobile deployment and high recognition speed requirements, fruit detection methods based on YOLO (17%) are used most frequently.Their recognition speed is fast, but the recognition effect for small target fruits is not very good.In addition, ResNet (11%) is the most popular backbone network, followed by AlexNet (7%).
Most of the research focuses on apples (32.14%), followed by tomatoes (8.93%), and citrus (7.14%).These three kinds of fruits are in high demand and yield globally.There are some reasons that make them ideal candidates for automatic harvesting.Firstly, they individually hang from plants, making them easily detectable based on their distinctive features.Secondly, they have no extreme variations in size or weight.Lastly, they are relatively hard and not easily damaged in mechanical operations.However, in terms of fruit dimensions and peduncle length, different cultivars may exhibit different characteristics, which can affect fruit detection and recognition performance.This poses challenges for adapting fruit detection and recognition methods for different cultivars.Future work could aim to identify cultivars that are more suitable for automatic harvesting.
The outline of this article is shown in Figure 7.The organization of the rest of the paper is as follows: Section 2 summarizes and explains previous research articles about DL applied to fruit detection and recognition for automatic harvesting.We compare and analyze the advantages, disadvantages, and applicability of various fruit detection and recognition methods based on DL (YOLO, SSD, AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN) for automatic harvesting; Section 3 discusses the current challenges affecting fruit detection and recognition performance for automatic harvesting (scarcity of high-quality fruit datasets, fruit detection of small targets, fruit detection in occluded and dense scenarios, fruit detection of multiple scales and multiple species, and lightweight fruit detection models) and proposes feasible solutions and prospective future development trends; Section 4 concludes this article.

Fruit Detection and Recognition Based on DL
The concept of DL originated from research on artificial neural networks (ANN), proposed by Canadian researchers Hinton, G.E. and Salakhutdinov, R.R. in 2006 [43].Since DL has a strong ability to extract high-dimensional features from fruit images, many researchers have conducted extensive and in-depth research on fruit detection and recognition based on DL for automatic harvesting.The basic architecture of DL-based ANN for fruit detection and recognition is shown in Figure 8.

Fruit Detection and Recognition Based on DL
The concept of DL originated from research on artificial neural networks (ANN), proposed by Canadian researchers Hinton, G.E. and Salakhutdinov, R.R. in 2006 [43].Since DL has a strong ability to extract high-dimensional features from fruit images, many researchers have conducted extensive and in-depth research on fruit detection and recognition based on DL for automatic harvesting.The basic architecture of DL-based ANN for fruit detection and recognition is shown in Figure 8. CNNs were proposed by American researchers LeCun, Y. et al. in the 1980s [44,45].They can efficiently capture patterns in multidimensional space.A typical CNN framework for fruit detection and recognition is shown in Figure 9.It includes the convolutional layer (Conv), pooling layer (Pool), nonlinear activation function, and fully connected layer (FC).The convolutional layer is the core of the CNN for fruit feature extraction.Depending on the designed convolution kernel, convolution operations capture fruit image contours and generate corresponding fruit feature maps.In order to reduce the spatial size of the fruit feature maps, the pooling layer performs down-sampling operations by sampling the maximum or average value in a neighborhood range.The nonlinear activation function uses activation functions to process the input data.Neurons in the fully connected layer are connected to all activated neurons in the layer above it.When training the CNN, the model scores categories of predicted images, calculates training loss using selected loss functions, and updates weights through backpropagation functions and gradient descent.The cross-entropy loss function is one of the most widely used loss functions, and the stochastic gradient descent method is the most popular method to address gradient descent.Compared with digital image processing and traditional ML techniques, fruit detection and recognition methods based on CNN have great advantages in terms of accuracy.Jahanbakhshi, A. et al. (including Iranian researchers and a researcher based in the UK) (2020) [46] proposed an improved CNN (15, 16, and 18 layers) to detect apparent defects in sour lemons.In comparison to traditional fruit feature extraction methods, such as histogram of oriented gradient (HOG), local binary pattern (LBP), support vector machine (SVM), k-nearest neighbor (KNN), decision tree, and fuzzy classification, the improved CNN was found to outperform these methods, achieving an accuracy of 100%.Bangladeshi researchers Sakib, S. et al. (2019) [47] proposed a fruit detection system using CNN.The Fruits-360 dataset was utilized to evaluate the proposed system.The training CNNs were proposed by American researchers LeCun, Y. et al. in the 1980s [44,45].They can efficiently capture patterns in multidimensional space.A typical CNN framework for fruit detection and recognition is shown in Figure 9.It includes the convolutional layer (Conv), pooling layer (Pool), nonlinear activation function, and fully connected layer (FC).The convolutional layer is the core of the CNN for fruit feature extraction.Depending on the designed convolution kernel, convolution operations capture fruit image contours and generate corresponding fruit feature maps.In order to reduce the spatial size of the fruit feature maps, the pooling layer performs down-sampling operations by sampling the maximum or average value in a neighborhood range.The nonlinear activation function uses activation functions to process the input data.Neurons in the fully connected layer are connected to all activated neurons in the layer above it.When training the CNN, the model scores categories of predicted images, calculates training loss using selected loss functions, and updates weights through backpropagation functions and gradient descent.The crossentropy loss function is one of the most widely used loss functions, and the stochastic gradient descent method is the most popular method to address gradient descent.CNNs were proposed by American researchers LeCun, Y. et al. in the 1980s [44,45].They can efficiently capture patterns in multidimensional space.A typical CNN framework for fruit detection and recognition is shown in Figure 9.It includes the convolutional layer (Conv), pooling layer (Pool), nonlinear activation function, and fully connected layer (FC).The convolutional layer is the core of the CNN for fruit feature extraction.Depending on the designed convolution kernel, convolution operations capture fruit image contours and generate corresponding fruit feature maps.In order to reduce the spatial size of the fruit feature maps, the pooling layer performs down-sampling operations by sampling the maximum or average value in a neighborhood range.The nonlinear activation function uses activation functions to process the input data.Neurons in the fully connected layer are connected to all activated neurons in the layer above it.When training the CNN, the model scores categories of predicted images, calculates training loss using selected loss functions, and updates weights through backpropagation functions and gradient descent.The cross-entropy loss function is one of the most widely used loss functions, and the stochastic gradient descent method is the most popular method to address gradient descent.[46] proposed an improved CNN (15, 16, and 18 layers) to detect apparent defects in sour lemons.In comparison to traditional fruit feature extraction methods, such as histogram of oriented gradient (HOG), local binary pattern (LBP), support vector machine (SVM), k-nearest neighbor (KNN), decision tree, and fuzzy classification, the improved CNN was found to outperform these methods, achieving an accuracy of 100%.Bangladeshi researchers Sakib, S. et al. ( 2019) [47] proposed a fruit detection system using CNN.The Fruits-360 dataset was utilized to evaluate the proposed system.The training Compared with digital image processing and traditional ML techniques, fruit detection and recognition methods based on CNN have great advantages in terms of accuracy.Jahanbakhshi, A. et al., (including Iranian researchers and a researcher based in the UK) (2020) [46] proposed an improved CNN (15, 16, and 18 layers) to detect apparent defects in sour lemons.In comparison to traditional fruit feature extraction methods, such as histogram of oriented gradient (HOG), local binary pattern (LBP), support vector machine (SVM), k-nearest neighbor (KNN), decision tree, and fuzzy classification, the improved CNN was found to outperform these methods, achieving an accuracy of 100%.Bangladeshi researchers Sakib, S. et al., (2019) [47] proposed a fruit detection system using CNN.The Fruits-360 dataset was utilized to evaluate the proposed system.The training accuracy and testing accuracy are 99.79% and 100%, respectively.In general, fruit detection and recognition methods based on CNN can achieve state-of-the-art (SOTA) accuracy for detecting and recognizing any type of fruit on any background.
Current fruit detection and recognition methods based on DL for automatic harvesting can be classified into two categories: single-stage fruit detection and recognition methods (such as YOLO and SSD) based on regression, and two-stage fruit detection and recognition methods (AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN) based on candidate regions.Single-stage methods define fruit detection tasks as regression problems of class confidence and bounding box locations (as shown in Figure 10).They divide input fruit images into a grid of cells, extract fruit feature information through the convolutional layer, and predict object class probabilities and bounding box coordinates for each cell.In contrast, as shown in Figure 11, for two-stage methods, in the first stage, a set of target fruit proposals is generated by the RPN on fruit feature maps produced by the convolutional layer.The RPN generates region of interest (RoI) proposals for each location on the fruit feature maps.Each proposal consists of a fixed-size bounding box and a probability score of containing a target fruit.Based on the scores assigned to these proposals, the top N highest-scoring regions are selected as final RoI proposals.To generate RoI proposals, the RPN applies sliding windows of different scales and aspect ratios to fruit feature maps.In the second stage, each final RoI proposal is cropped into a fixed-size feature map using RoI pooling.The maps are then fed into a separate CNN for fruit classification and bounding box regression.
Agronomy 2023, 13, x FOR PEER REVIEW 9 of 32 accuracy and testing accuracy are 99.79% and 100%, respectively.In general, fruit detection and recognition methods based on CNN can achieve state-of-the-art (SOTA) accuracy for detecting and recognizing any type of fruit on any background.Current fruit detection and recognition methods based on DL for automatic harvesting can be classified into two categories: single-stage fruit detection and recognition methods (such as YOLO and SSD) based on regression, and two-stage fruit detection and recognition methods (AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN) based on candidate regions.Single-stage methods define fruit detection tasks as regression problems of class confidence and bounding box locations (as shown in Figure 10).They divide input fruit images into a grid of cells, extract fruit feature information through the convolutional layer, and predict object class probabilities and bounding box coordinates for each cell.In contrast, as shown in Figure 11, for two-stage methods, in the first stage, a set of target fruit proposals is generated by the RPN on fruit feature maps produced by the convolutional layer.The RPN generates region of interest (RoI) proposals for each location on the fruit feature maps.Each proposal consists of a fixed-size bounding box and a probability score of containing a target fruit.Based on the scores assigned to these proposals, the top N highest-scoring regions are selected as final RoI proposals.To generate RoI proposals, the RPN applies sliding windows of different scales and aspect ratios to fruit feature maps.In the second stage, each final RoI proposal is cropped into a fixed-size feature map using RoI pooling.The maps are then fed into a separate CNN for fruit classification and bounding box regression.Table 3 compares and analyzes different fruit detection and recognition methods used by various researchers.In the section on "crop, description, and merit", we explain the innovation.In the section on "improvement", we identify the weaknesses and potential improvements.In general, two-stage fruit detection and recognition methods have been shown to achieve higher accuracy than single-stage fruit detection and recognition methods due to their ability to propose more accurate fruit locations.However, they are slower and computationally more intensive than single-stage fruit detection and recognition methods.On the other hand, while single-stage fruit detection and recognition methods are faster and simpler than two-stage fruit detection and recognition methods, they may be less accurate, especially for small target fruits.10).They divide input fruit images into a grid of cells, extract fruit feature information through the convolutional layer, and predict object class probabilities and bounding box coordinates for each cell.In contrast, as shown in Figure 11, for two-stage methods, in the first stage, a set of target fruit proposals is generated by the RPN on fruit feature maps produced by the convolutional layer.The RPN generates region of interest (RoI) proposals for each location on the fruit feature maps.Each proposal consists of a fixed-size bounding box and a probability score of containing a target fruit.Based on the scores assigned to these proposals, the top N highest-scoring regions are selected as final RoI proposals.To generate RoI proposals, the RPN applies sliding windows of different scales and aspect ratios to fruit feature maps.In the second stage, each final RoI proposal is cropped into a fixed-size feature map using RoI pooling.The maps are then fed into a separate CNN for fruit classification and bounding box regression.Table 3 compares and analyzes different fruit detection and recognition methods used by various researchers.In the section on "crop, description, and merit", we explain the innovation.In the section on "improvement", we identify the weaknesses and potential improvements.In general, two-stage fruit detection and recognition methods have been shown to achieve higher accuracy than single-stage fruit detection and recognition methods due to their ability to propose more accurate fruit locations.However, they are slower and computationally more intensive than single-stage fruit detection and recognition methods.On the other hand, while single-stage fruit detection and recognition methods are faster and simpler than two-stage fruit detection and recognition methods, they may be less accurate, especially for small target fruits.Table 3 compares and analyzes different fruit detection and recognition methods used by various researchers.In the section on "crop, description, and merit", we explain the innovation.In the section on "improvement", we identify the weaknesses and potential improvements.In general, two-stage fruit detection and recognition methods have been shown to achieve higher accuracy than single-stage fruit detection and recognition methods due to their ability to propose more accurate fruit locations.However, they are slower and computationally more intensive than single-stage fruit detection and recognition methods.On the other hand, while single-stage fruit detection and recognition methods are faster and simpler than two-stage fruit detection and recognition methods, they may be less accurate, especially for small target fruits.[66].YOLO-v2 was proposed by American researchers Redmon, J. and Farhadi, A. in 2017 [67].It included improvements to the structure of YOLO-v1.The K-means clustering algorithm was used to determine the optimal number of anchor boxes and to analyze the relationship between recognition accuracy and speed.Then, they also proposed YOLO-v3 [68], which featured improvements such as the Darknet-53 backbone network and multi-scale prediction.Bochkovskiy, A. et al., (2020) [69] systematically analyzed the processes of data preprocessing and the design of detection and prediction networks.Based on the analysis, they designed an efficient target detector (YOLO-v4) suitable for a single graphics card.YOLO-v5 [70] provided four different sizes of target detectors to meet the needs of different applications.YOLOR [71], YOLOX [72], YOLO-v6 [73], YOLO-v7 [74], and YOLO-v8 [75] [66].YOLO-v2 was proposed by American researchers Redmon, J. and Farhadi, A. in 2017 [67].It included improvements to the structure of YOLO-v1.The K-means clustering algorithm was used to determine the optimal number of anchor boxes and to analyze the relationship between recognition accuracy and speed.Then, they also proposed YOLO-v3 [68], which featured improvements such as the Darknet-53 backbone network and multi-scale prediction.Bochkovskiy, A. et al. (2020) [69] systematically analyzed the processes of data preprocessing and the design of detection and prediction networks.Based on the analysis, they designed an efficient target detector (YOLO-v4) suitable for a single graphics card.YOLO-v5 [70] provided four different sizes of target detectors to meet the needs of different applications.YOLOR [71], YOLOX [72], YOLO-v6 [73], YOLO-v7 [74], and YOLO-v8 [75]   Fruit detection and recognition methods based on YOLO are widely used, by virtue of their advantages.Chinese researchers Xiong, J. et al. (2020) [76] proposed a method based on YOLO-v2 to detect and count mangoes in fruit images taken by an UAV.The processing time is 80ms, and the average detection accuracy is 96.1%.British researchers Birrell, S. et al. (2020) [77] proposed a method based on YOLO-v3 to detect and classify cabbages in four growth stages, achieving a total detection accuracy of 91% and a classification accuracy of 82%.In order to create an even more lightweight fruit detection model, Chinese researchers Li, C. et al. (2022) [58] proposed an improved YOLO-v3-tiny fruit detection model based on K-means 3D clustering partitioning for small and densely packed lychee fruits, and compared it with other fruit detection networks (YOLO-v3-tiny, YOLO-v4, YOLO-v5, and Faster R-CNN).The improved YOLOv3-tiny can recognize lychee fruits Fruit detection and recognition methods based on YOLO are widely used, by virtue of their advantages.Chinese researchers Xiong, J. et al., (2020) [76] proposed a method based on YOLO-v2 to detect and count mangoes in fruit images taken by an UAV.The processing time is 80ms, and the average detection accuracy is 96.1%.British researchers Birrell, S. et al., (2020) [77] proposed a method based on YOLO-v3 to detect and classify cabbages in four growth stages, achieving a total detection accuracy of 91% and a classification accuracy of 82%.In order to create an even more lightweight fruit detection model, Chinese researchers Li, C. et al., (2022) [58] proposed an improved YOLO-v3-tiny fruit detection model based on K-means 3D clustering partitioning for small and densely packed lychee fruits, and compared it with other fruit detection networks (YOLO-v3-tiny, YOLO-v4, YOLO-v5, and Faster R-CNN).The improved YOLOv3-tiny can recognize lychee fruits more accurately.The check-all rate, check-accuracy rate, and F1 score are 78.99%,87.43%, and 0.83, respectively.However, fruit detection and recognition methods based on YOLO do not use prior information when predicting fruit positions.This results in a loss of fruit location accuracy.In addition, when YOLO predicts detection results corresponding to each bounding box, it requires that the target fruit's center point must be located inside the bounding box.This imposes a strong spatial constraint on the prediction process of YOLO and makes fruit detection and recognition methods based on YOLO less effective at detecting small target fruits that appear in groups.In the future, we can input and fuse semantic information (such as fruit scene and context-related information) into fruit detection algorithms to greatly improve fruit detection accuracy.For example, Chinese researchers Miao, Z. et al., (2022) [59] integrated classic image processing methods with YOLO-v5 to increase fruit detection accuracy and robustness.A tomato-harvesting robot can be guided to efficiently harvest truss tomatoes, with an average operating time of 9 s per cluster.
Agronomy 2023, 13, x FOR PEER REVIEW 15 of 32 more accurately.The check-all rate, check-accuracy rate, and F1 score are 78.99%,87.43%, and 0.83, respectively.However, fruit detection and recognition methods based on YOLO do not use prior information when predicting fruit positions.This results in a loss of fruit location accuracy.In addition, when YOLO predicts detection results corresponding to each bounding box, it requires that the target fruit's center point must be located inside the bounding box.This imposes a strong spatial constraint on the prediction process of YOLO and makes fruit detection and recognition methods based on YOLO less effective at detecting small target fruits that appear in groups.In the future, we can input and fuse semantic information (such as fruit scene and context-related information) into fruit detection algorithms to greatly improve fruit detection accuracy.For example, Chinese researchers Miao, Z. et al. ( 2022) [59] integrated classic image processing methods with YOLO-v5 to increase fruit detection accuracy and robustness.A tomato-harvesting robot can be guided to efficiently harvest truss tomatoes, with an average operating time of 9 s per cluster.The former achieves 4.55 FPS, whereas the latter achieves a significantly higher performance of approximately 16.67 FPS.However, it is worth noting that fruit detection and recognition methods based on SSD preprocess input fruit images, which may lead to lower fruit detection accuracy for relatively small target fruits when passing through deeper convolutional layers.Chinese researchers Liang, Q. et al., (2018) [80] proposed a real-time detection method for on-tree mangoes based on SSD.New sampling strategies were de-signed to optimize data augmentation techniques.With optimized data augmentation techniques and default box proposals, SSD outperforms Faster R-CNN in mango detection.Detection results for an almond dataset further confirm the effectiveness of the proposed method.However, it is important to note that the proposed method has deeper layers and a larger number of parameters.This results in slower operation speed and longer computation time.
Agronomy 2023, 13, x FOR PEER REVIEW 16 of 32 were designed to optimize data augmentation techniques.With optimized data augmentation techniques and default box proposals, SSD outperforms Faster R-CNN in mango detection.Detection results for an almond dataset further confirm the effectiveness of the proposed method.However, it is important to note that the proposed method has deeper layers and a larger number of parameters.This results in slower operation speed and longer computation time.In general, fruit detection and recognition methods based on SSD also have certain disadvantages.They independently input fruit image features, extracted by different convolutional layers, into corresponding network detection branches.This means that the same fruits in detected images may be identified by bounding boxes of different sizes simultaneously, which can easily lead to the problem of repeated detection.Additionally, each detection branch only operates on target fruits in its respective field, making it difficult to consider the relationship between target fruits of different layers and scales.Therefore, the detection effect of fruit detection and recognition methods based on SSD on small target fruits is not good.Further research could improve SSD in detector frameworks, prediction mechanisms, matching mechanisms, and loss functions.2018) [82] proposed a highly effective method for vegetable classification based on AlexNet.The accuracy achieved in the testing set was significantly improved compared to the BP neural network (78%) and SVM classifier method (80.5%), with a remarkable accuracy of 92.1%.Indian researchers Rangarajan, A.K. et al. ( 2018) [83] demonstrated that the classification accuracy of 13,262 fruit images was 97.49% for AlexNet.Fruit detection and recognition methods based on AlexNet have gained widespread acceptance due to their advantages.By modifying the size of the convolutional kernel and convolutional layer, fruit detection accuracy can be effectively improved.For example, Chinese researchers Ni, J. et al. (2021) [52] improved AlexNet by proposing a new architecture-E-AlexNet.The new architecture enhanced the convolutional layer, reduced kernel size, and used L2 regularization and a BN layer instead of LRN layer.E-AlexNet was compared with the original AlexNet by classifying five strawberry varieties with different qualities.The average recognition accuracy of E-AlexNet was 90.70%, while that of the original AlexNet was 84.50%.In general, fruit detection and recognition methods based on SSD also have certain disadvantages.They independently input fruit image features, extracted by different convolutional layers, into corresponding network detection branches.This means that the same fruits in detected images may be identified by bounding boxes of different sizes simultaneously, which can easily lead to the problem of repeated detection.Additionally, each detection branch only operates on target fruits in its respective field, making it difficult to consider the relationship between target fruits of different layers and scales.Therefore, the detection effect of fruit detection and recognition methods based on SSD on small target fruits is not good.Further research could improve SSD in detector frameworks, prediction mechanisms, matching mechanisms, and loss functions.[82] proposed a highly effective method for vegetable classification based on AlexNet.The accuracy achieved in the testing set was significantly improved compared to the BP neural network (78%) and SVM classifier method (80.5%), with a remarkable accuracy of 92.1%.Indian researchers Rangarajan, A.K. et al., (2018) [83] demonstrated that the classification accuracy of 13,262 fruit images was 97.49% for AlexNet.Fruit detection and recognition methods based on AlexNet have gained widespread acceptance due to their advantages.By modifying the size of the convolutional kernel and convolutional layer, fruit detection accuracy can be effectively improved.For example, Chinese researchers Ni, J. et al., (2021) [52] improved AlexNet by proposing a new architecture-E-AlexNet.The new architecture enhanced the convolutional layer, reduced kernel size, and used L2 regularization and a BN layer instead of LRN layer.E-AlexNet was compared with the original AlexNet by classifying five strawberry varieties with different qualities.The average recognition accuracy of E-AlexNet was 90.70%, while that of the original AlexNet was 84.50%.VGGNet was proposed by American researchers Simonyan, K. and Zisserman, A. in 2014 [84].It has high accuracy in fruit detection and recognition.The biggest improvement of VGGNet is the depth of the network, which has been increased from 8 layers to 16 or 19 layers.Additionally, VGGNet uses a 3 × 3 convolution kernel to replace the large convolution kernels (11 × 11, 7 × 7, 5 × 5) in AlexNet.In the case of the same receptive field, the accumulation effect of the small convolution kernel is better than that of the large convolution kernel.For example, Indian researchers Mahmood, A. et al. (2022) [85] assessed VGGNet was proposed by American researchers Simonyan, K. and Zisserman, A. in 2014 [84].It has high accuracy in fruit detection and recognition.The biggest improvement of VGGNet is the depth of the network, which has been increased from 8 layers to 16 or 19 layers.Additionally, VGGNet uses a 3 × 3 convolution kernel to replace the large convolution kernels (11 × 11, 7 × 7, 5 × 5) in AlexNet.In the case of the same receptive field, the accumulation effect of the small convolution kernel is better than that of the large convolution kernel.For example, Indian researchers Mahmood, A. et al., (2022) [85] as-sessed the effectiveness of two CNN paradigms (AlexNet and VGG-16) in classifying jujube fruits based on their maturity level (unripe, ripe, and overripe).The best accuracy achieved by VGG-16 was 97.65%.Indian researchers Begum, N. and Hazarika, M.K. (2022) [86]  VGG-19, with the Adam optimizer, is the one that reported the best accuracy (99.32%).

Two-Stage Fruit Detection and Recognition Methods Based on Candidate Regions
In order to further improve the accuracy and speed of fruit detection and recognition, Chinese researchers Li, Z. et al., (2020) [88] proposed a fruit recognition and classification method based on VGG-M and VGG-M-BN.On the basis of the original VGG, VGG-M combined the output features of the first two fully connected layers.VGG-M-BN had the BN layer added.The convergence rate of VGG-M-BN is nearly three times faster.The quality of datasets, batch size, and different activation functions also influence fruit recognition and classification accuracy.Firstly, they used VGG-M-BN to train different numbers of vegetable datasets.Recognition accuracy decreases as the quality of datasets decreases.Secondly, by contrasting activation functions, they verified that the rectified linear unit (ReLU) activation function is better than the traditional Sigmoid and Tanh functions in VGG-M-BN.Finally, they verified that the fruit recognition and classification accuracy of VGG-M-BN increases as the batch size increases.
ResNet was proposed by American researchers He, K. et al. in 2015 [89].It has a high pattern recognition capability.According to the number of backbone layers, ResNet can be further subdivided into ResNet-18, ResNet-50, ResNet-101, and ResNet-152.Fruit detection and recognition methods based on ResNet are widely used, by virtue of their advantages.Helwan, A. et al., (including Lebanese researchers and researchers based in Turkey) (2019) [90] performed automatic segmentation of bananas based on ResNet.Wang, D. et al., (including Chinese researchers and a researcher based in America) (2020) [55] developed a remote apple horizontal diameter detection system based on ResNet to achieve automatic measurement of apples throughout the entire growth period.
Capturing fruit feature information on multiple scales is one way to address the problem that target fruits are overlapped and occluded by branches and leaves.American researchers Rahnemoonfar, M. and Sheppard, C. (2017) [91] optimized the structure of Inception-ResNet.The Improved-Inception-ResNet can count efficiently, even if fruits are under shadow, overlapped, and occluded by leaves.However, although the above fruit detection and recognition methods have high accuracy, they are slow.To address this problem, Australian researchers Kang, H. and Chen, C. (2020) [92] introduced an enhanced deep neural network DaSNet-v2 with ResNet.It has the ability to carry out both detection and instance segmentation of fruits, alongside semantic segmentation of branches.To further improve the speed of fruit detection and meet the real-time requirements of harvesters, Australian researchers Kang, H. and Chen, C. (2019) [93] constructed a multifunctional network for the real-time detection and semantic segmentation of apples and branches.They combined it with the lightweight backbone of ResNet-101 to improve the real-time computational performance of the fruit detection model.

Fruit Detection and Recognition Methods Based on R-CNN, Fast R-CNN, and
Faster R-CNN Typical R-CNN, Fast R-CNN, and Faster R-CNN frameworks for fruit detection and recognition are shown in Figure 16.R-CNN was proposed by American researchers Girshick, R. et al. in 2014 [94].It is the first algorithm to successfully apply DL to object detection and recognition.Fast R-CNN was proposed by American researcher Girshick, R, one of the creators of R-CNN, in 2015 [95].It solves some problems of its predecessor, such as slow speed and a large overlap of proposal boxes.One of the key innovations of Fast R-CNN is the "RoI pooling layer", which operates by taking CNN feature maps and regions of interest as inputs and providing the corresponding features for each region.This allows Fast R-CNN to extract fruit features from all regions of interest in fruit images in a single pass, instead of R-CNN processing each region separately.It significantly improves the speed of fruit detection and recognition.However, Fast R-CNN still requires regions of fruit images to be extracted and provided as inputs to fruit detection models.
Faster R-CNN was proposed by American researchers Ren, S. et al. in 2016 [96].It takes images of fruits as inputs and returns a list of fruit classes, along with their corresponding bounding boxes.Its main innovation is the "RPN".By integrating region detection into the main neural network structure, Faster R-CNN achieves near real-time detection speed with high accuracy and generalization capability.
The fruit detection performance obtained by Faster R-CNN may outperform other networks (YOLOv3, SSD, and ReFCN) [97].Therefore, fruit detection and recognition methods based on Faster R-CNN are widely used.Chinese researcher Wan, S. and Greek researcher Goudos, S. (2020) [98] proposed a multi-class fruit (apple, mango, and orange) detection method based on Faster R-CNN.The average detection accuracy was 90.72%, and the image processing time was 58ms.Fu, L. et al., (including Chinese researchers and researchers based in America) (2018) [99] proposed a kiwifruit detection method based on Faster R-CNN and evaluated it on kiwifruit images collected in field environments.Zhang, J. et al., (including Chinese researchers and researchers based in America) (2020) [100] used Faster R-CNN to improve a multi-class fruit detection method.They aimed to automatically detect apples, branches, and tree trunks in natural environments and estimate the bobbing locations of collected and captured apples.
Under changing lighting conditions, with low resolution, and with severe occlusion by adjacent fruits and leaves, fruit detection and recognition are very challenging tasks.To solve the problem, Chinese researchers Wang, P. et al., (2021) [57] proposed an improved Faster R-CNN with an attention mechanism based on a near-color background for young tomato detection and recognition.Small target fruit detection and recognition are also very challenging tasks.To solve this problem, in the localization phase, Chinese researchers Cao, C. et al., (2019) [101] proposed an improved loss function based on intersection and ratio for bounding box regression.Additionally, in the recognition phase, the bilinear interpolation method is used to improve the pooling operation of interest regions.

Fruit Detection and Recognition Methods Based on FCN, SegNet, and Mask R-CNN
FCN was proposed by American researchers Long, J. et al. in 2015 [102].A typical FCN framework for fruit detection and recognition is shown in Figure 17.FCN classifies fruit images at the pixel level and solves the problem of semantic image segmentation.FCN replaces the fully connected layer of the original CNN with the convolutional layer so that the output will be a heatmap instead of a category.Meanwhile, to solve the problem of smaller image size due to convolution and pooling, up-sampling is used to recover image size.Chinese researchers Lin, G. et al., (2019) [61], German researchers Zabawa, L. et al., (2019) [103], and Li, Y. et al., (including Chinese researchers and a researcher based in Germany) (2017) [104] used FCN for the semantic segmentation of guava, grape, and cotton, respectively.Although guava can be segmented easily, the branch is a little difficult to segment.They also compared FCN with SegNet and classification and regression tree classifier (CART).FCN outperforms the other two methods.However, FCN makes some false predictions due to the effects of overlaps and changing lighting conditions.American researchers Chen, S.W. et al., (2017) [105] proposed a method based on FCN for accurate fruit counting in complex natural environments.The method works well even under highly shaded conditions.Furthermore, American researchers Liu, X. et al., (2018) [106] combined deep convolutional segmentation to accurately count sequential images of visible fruits.some false predictions due to the effects of overlaps and changing lighting conditions.American researchers Chen, S.W. et al. (2017) [105] proposed a method based on FCN for accurate fruit counting in complex natural environments.The method works well even under highly shaded conditions.Furthermore, American researchers Liu, X. et al. ( 2018) [106] combined deep convolutional segmentation to accurately count sequential images of visible fruits.In general, fruit detection and recognition methods based on FCN can accept fruit image inputs of arbitrary size, and the recognition efficiency is higher.They avoid the problem of repeated storage and computational convolution caused by the use of pixel blocks.They reduce the computational effort of the whole fruit detection operation.However, the recognition accuracy is not high because they are insensitive to the details in fruit images, and the classification does not consider inter-pixel relationships.
SegNet was proposed by British researchers Badrinarayanan, V. et al. in 2017 [107].A typical SegNet framework for fruit detection and recognition is shown in Figure 18.It follows the segmentation idea of FCN and is a symmetric network model with a supervised coding and decoding structure.SegNet can handle fruit image inputs of arbitrary sizes.The coding part reduces the size of input fruit images and the number of parameters stage by stage through maximum pooling, and records the pooling index positions in the fruit images.In order to ensure consistency in resolution between input and output fruit images, decoding processes recover fruit image information through up-sampling.Finally, it outputs semantic segmentation results through the SoftMax classifier.The major In general, fruit detection and recognition methods based on FCN can accept fruit image inputs of arbitrary size, and the recognition efficiency is higher.They avoid the problem of repeated storage and computational convolution caused by the use of pixel blocks.They reduce the computational effort of the whole fruit detection operation.However, the recognition accuracy is not high because they are insensitive to the details in fruit images, and the classification does not consider inter-pixel relationships.
SegNet was proposed by British researchers Badrinarayanan, V. et al. in 2017 [107].A typical SegNet framework for fruit detection and recognition is shown in Figure 18.It follows the segmentation idea of FCN and is a symmetric network model with a supervised coding and decoding structure.SegNet can handle fruit image inputs of arbitrary sizes.The coding part reduces the size of input fruit images and the number of parameters stage by stage through maximum pooling, and records the pooling index positions in the fruit images.In order to ensure consistency in resolution between input and output fruit images, decoding processes recover fruit image information through up-sampling.Finally, it outputs semantic segmentation results through the SoftMax classifier.The major difference between SegNet and FCN is the method used for up-sampling low-resolution feature maps to high-resolution feature maps.Harvesting robots usually operate in complex natural environments, and the random growth of trunks and branches poses a challenge for fruit detection and recognition.Majeed, Y. et al., (including American researchers and a researcher based in China) (2018) [109] developed a trunk and branch segmentation method using a Kinect V2 sensor.Harvesting robots need to optimize the position of the end effector based on the position and angle between fruits and robot components before approaching, grasping, and cutting target fruits.For this purpose, Dutch researchers Barth, R. et al., (2019) [110] proposed inferring the position of fruits and stems through sparse semantic segmentation in the image plane.In addition, to improve the efficiency of fruit detection and enhance real-time performance, Australian researchers Kang, H. and Chen, C. (2019) [93] used a semantic segmentation network to detect and segment apples and branches in an orchard in real-time.Meanwhile, in order to enable harvesting robots to simultaneously recognize and locate multiple target fruit clusters, Chinese researchers Li, J. et al., (2020) [62] proposed a semantic segmentation method to segment fruit RGB images into three categories: background, fruit, and branch.The method achieved accurate and automatic detection of fruits and branches of multiple lychee clusters in complex natural environments and guided robots to complete continuous harvesting tasks.
Mask R-CNN was proposed by American researchers He, K. et al. in 2017 [111].A typical Mask R-CNN framework for fruit detection and recognition is shown in Figure 19.It consists of three parts.Firstly, the backbone network extracts fruit feature maps from input fruit images.Secondly, the fruit feature maps outputted by the backbone network are sent to the RPN to generate proposals.Finally, the proposals outputted by the RPN are mapped, and the corresponding target fruit features are extracted from the shared feature maps.These features are outputted to the FC and FCN for fruit classification and instance segmentation, respectively.The process generates classification confidence, bounding boxes, and mask images.
Mask R-CNN combines semantic segmentation with object detection by outputting mask images.This improves the localization accuracy of small target fruits, as well as the prediction accuracy of mask images.Fruit detection and recognition methods based on Mask R-CNN have better robustness and generality for fruit detection and recognition, especially in situations of clustered fruit growth.Chinese researchers Yu, Y. et al., (2019) [65] and Jia, W. et al., (2020) [112] used a Mask R-CNN instance segmentation network model to recognize overlapping strawberries and apples, respectively.They can determine not only categories but also individuals.Since some ripe green tomatoes are similar in color to branches and leaves, shaded by branches and leaves, or overlapped by other tomatoes, accurate detection and localization of these tomatoes is difficult.Chinese researchers Zu, L. et al., (2021) [113] proposed using Mask R-CNN for the detection and segmentation of ripe green tomatoes.The research results showed the effectiveness of the method.The best model performance was achieved when the IoU was 0.5, and the F1-score of both the testing set bounding box and the masked region reached 92%.Chinese researchers Xu, P. et al., (2022) [64] proposed an improved Mask R-CNN network model for the recognition of cherry tomatoes, considering the prior neighborhood constraint between fruits and stalks.
In the future, improvements in fruit detection and recognition methods based on Mask R-CNN should focus on integrating more convolutions to improve performance, and reducing the computational complexity of multi-head attention in the transformer.In addition, multimodal fruit detection methods could be adopted to design Mask R-CNN fruit detection models based on vision, LiDAR, millimeter-wave radar, and other multisensor fusion technologies.segmentation of ripe green tomatoes.The research results showed the effectiveness of the method.The best model performance was achieved when the IoU was 0.5, and the F1score of both the testing set bounding box and the masked region reached 92%.Chinese researchers Xu, P. et al. (2022) [64] proposed an improved Mask R-CNN network model for the recognition of cherry tomatoes, considering the prior neighborhood constraint between fruits and stalks.In the future, improvements in fruit detection and recognition methods based on Mask R-CNN should focus on integrating more convolutions to improve performance, and reducing the computational complexity of multi-head attention in the transformer.In addition, multimodal fruit detection methods could be adopted to design Mask R-CNN fruit detection models based on vision, LiDAR, millimeter-wave radar, and other multisensor fusion technologies.

Discussion
Currently, there are many factors leading to low accuracy, slow speed, and poor robustness of fruit detection and recognition.They can be summarized in the following aspects: scarcity of high-quality fruit datasets, detection of small target fruits, fruit detection in occluded and dense scenarios, detection of multi-scale and multi-species fruits, and lightweight fruit detection models.
(1) Scarcity of high-quality fruit datasets.Fruit datasets, as signal sources to guide fruit detection algorithms based on DL for information understanding [41], largely determine the final performance of trained fruit detection models.Fruit detection and recognition methods based on DL have two requirements for datasets.One is the sufficiency of data, and the other is the richness of data categories.Fruit datasets are mainly collected in real field environments and through internet channels.A comparison of the advantages and shortcomings of the two collection methods is shown in Table 4.In order to objectively compare the performance of fruit detection and recognition methods, as shown in Table 5, international communities provide some public benchmark datasets.Different fruit datasets have significant differences in the number, quality, and category of images.Researchers can choose compatible fruit datasets for experiments according to their needs.The Fruits-360 dataset is the most commonly used public benchmark dataset.The total number of categories in this dataset is as high as 131, and the total amount of images is

Discussion
Currently, there are many factors leading to low accuracy, slow speed, and poor robustness of fruit detection and recognition.They can be summarized in the following aspects: scarcity of high-quality fruit datasets, detection of small target fruits, fruit detection in occluded and dense scenarios, detection of multi-scale and multi-species fruits, and lightweight fruit detection models.
(1) Scarcity of high-quality fruit datasets.Fruit datasets, as signal sources to guide fruit detection algorithms based on DL for information understanding [41], largely determine the final performance of trained fruit detection models.Fruit detection and recognition methods based on DL have two requirements for datasets.One is the sufficiency of data, and the other is the richness of data categories.Fruit datasets are mainly collected in real field environments and through internet channels.A comparison of the advantages and shortcomings of the two collection methods is shown in Table 4.In order to objectively compare the performance of fruit detection and recognition methods, as shown in Table 5, international communities provide some public benchmark datasets.Different fruit datasets have significant differences in the number, quality, and category of images.Researchers can choose compatible fruit datasets for experiments according to their needs.The Fruits-360 dataset is the most commonly used public benchmark dataset.The total number of categories in this dataset is as high as 131, and the total amount of images is considerable.However, it suffers from the problems of single-image backgrounds, insufficient data diversity, and category imbalance.When public benchmark fruit detection datasets cannot meet practical needs, some scholars have created individual fruit datasets to train a fruit detection model for fruit detection and recognition in specific environments.In particular, most of the existing public benchmark fruit detection datasets, such as fruit images of MS COCO and ImageNet, are collected through internet channels.Many of these images differ greatly from actual fruit recognition and harvesting situations.They consist of data from simple scenes, mainly for large and medium-sized fruits.Additionally, datasets for small target fruit detection in complex scenes are especially scarce.International communities might consider continually providing and updating quality public benchmark fruit detection datasets, for example, establishing a unified standard fruit data-sharing platform.The public can upload their fruit images to the platform, and the platform organizes personnel to identify and annotate them.
Due to the scarcity of high-quality fruit datasets, there are potential directions for development in the future: (1) Fruit detection and recognition methods based on smallsample learning may be a key breakthrough.For certain fruit categories for which it is difficult to obtain a large number of samples, this method allows a small number of fruit samples to be selected as representative of new fruit categories.Then, the inherent internal connection between the base fruit class and the new fruit class is used to realize effective knowledge transfer.(2) Fruit detection and recognition methods based on unsupervised learning/semi-supervised learning may be another key breakthrough.Current methods are mainly based on supervised learning, in which performance relies on a large amount of labeled fruit data.In unsupervised/semi-supervised learning, the model is pre-trained on the data with no or little labeled information.
(2) Detection of small target fruits.Fruits always grow in complex environments with many uncertainties (as shown in Figure 20).We usually define a small target fruit as being smaller than 32 × 32 pixels relative to the absolute size of the image it is in.The difficulties of small target fruit detection are as follows: (1) Limited features that can be extracted.Small target fruits have a small area share and low resolution in images, and they contain limited features themselves.(2) Convolution operations can cause loss of small target features.Fruit detection and recognition methods based on DL extract information of interest about fruits by performing convolution operations on fruit images containing a large amount of redundant information [50].Fruit feature maps keep shrinking as the number of convolutions increases.If the down-sampling rate is too high, a lot of detailed information for small target fruit detection will be lost.(3) Requirements for the positioning accuracy of the small target fruit bounding boxes are higher.Compared with large target fruits, small target fruits are more sensitive to the offset of prediction boxes and less tolerant of errors.(4) The scale of anchor boxes has not been designed properly.When the scale of anchor boxes is too large, the area of small target fruits is reduced.Therefore, even if small target fruits are within anchor frames, the IoU may not reach the threshold value, resulting in missed detection.In addition, when the receptive field is too large, the fruit detection results are easily disturbed by a large number of other features.When the preset scale of anchor boxes is too close, the spatial difference after down-sampling cannot be guaranteed, resulting in small target fruits being ignored.(5) Sample imbalance.The IoU-based positive and negative samples are considered negative if the IoU is smaller than the threshold.This may lead to small target fruits being ignored in the process of model learning due to the small number of positive samples.Small target fruits usually grow in clusters, which may further cause occlusion and dense detection problems.When small target fruits appear together with other scaled fruits, this gives rise to multi-scale detection problems.
Small target fruits have a small area share and low resolution in images, and they contain limited features themselves.(2) Convolution operations can cause loss of small target features.Fruit detection and recognition methods based on DL extract information of interest about fruits by performing convolution operations on fruit images containing a large amount of redundant information [50].Fruit feature maps keep shrinking as the number of convolutions increases.If the down-sampling rate is too high, a lot of detailed information for small target fruit detection will be lost.(3) Requirements for the positioning accuracy of the small target fruit bounding boxes are higher.Compared with large target fruits, small target fruits are more sensitive to the offset of prediction boxes and less tolerant of errors.(4) The scale of anchor boxes has not been designed properly.When the scale of anchor boxes is too large, the area of small target fruits is reduced.Therefore, even if small target fruits are within anchor frames, the IoU may not reach the threshold value, resulting in missed detection.In addition, when the receptive field is too large, the fruit detection results are easily disturbed by a large number of other features.When the preset scale of anchor boxes is too close, the spatial difference after down-sampling cannot be guaranteed, resulting in small target fruits being ignored.(5) Sample imbalance.The IoUbased positive and negative samples are considered negative if the IoU is smaller than the threshold.This may lead to small target fruits being ignored in the process of model learning due to the small number of positive samples.Small target fruits usually grow in clusters, which may further cause occlusion and dense detection problems.When small target fruits appear together with other scaled fruits, this gives rise to multi-scale detection problems.Current solutions for small target fruit detection mainly include: (1) Increasing the number of small target fruit samples through data preprocessing and enhancement, such as in the research presented in [46,49,55,56,60,[62][63][64]; (2) Generating higher-quality small target fruit candidate regions by improving the RPN, such as in the research presented in [56,57,65]; (3) Ensuring the sensory field and small target fruit matching by optimizing anchor boxes, such as in the research presented in [58,61,62].Combining traditional methods for small target fruit image detection may be a trend for future development.Some small target fruit images contain little information, and they lack the necessary semantic information.Fruit detection and recognition methods based on DL have limited feature extraction ability for small target fruits at the pixel level.Therefore, traditional feature extractors can be introduced to make them more capable of representing features of fruit images.In addition, depth features extracted using CNN can also be combined with traditional methods, such as saliency detection and superpixel segmentation, to obtain a more effective fruit feature representation.
(3) Fruit detection in occluded and dense scenarios.Fruit growth environments are usually complex.There are cases of inter-fruit occlusion and occlusion by shadows or other distractions, such as branches and leaves.The difficulty of detecting fruits in occluded and dense scenarios lies in improving the recall rate of the occluded target fruits [29].In overlapping cases, the main reasons for missing the detection of obscured target fruits are: (1) Fruits are incomplete, and extractable fruit features are sharply reduced.
(2) Overlapping target fruits usually have highly similar features, and it is difficult for fruit detection models to determine whether they belong to different individuals.(3) The NMS post-processing method directly discards objects with lower scores in overlapping regions.For fruit detection in occluded and dense scenarios, the main method of improvement is to enhance fruit feature extraction.
Commonly used methods to enhance the feature extraction capability of fruits are: (1) increasing the width or depth of networks, such as in the research presented in [52,53,57,59].However, this method will increase the computational load of models.This requires us to strike a balance between performance improvement and computational cost increase.
(2) Adding attention mechanisms, such as in the research presented in [55][56][57].The introduction of attention mechanisms can help fruit detection models fully consider the connection between each position of target fruits, effectively enhancing the ability of fruit detection models to learn fruit features.Current scholars divide them into the channel attention mechanism and spatial attention mechanism, according to the way the attention acts on feature maps.In fruit detection models, common implementations of attention mechanisms include squeeze-and-excitation networks (SENet) and the convolutional block attention module (CBAM).However, adding an attention mechanism will make fruit detection models more complex and increase convergence time.At the same time, adding an attention mechanism requires careful consideration of whether the design principle of attention, as well as the position and method of action, are suitable for current tasks.Otherwise, it may have a negative impact on fruit detection models.How to reasonably design and implement attention mechanisms, and efficiently use a wide range of environmental features, are important research directions for the future.
(4) Detection of multi-scale and multi-species fruit.Most current fruit detection models are solutions for specific crops.When the detected fruits appear on multiple scales or in multiple types, it is difficult to guarantee the model's generalization ability.For the multiscale fruit detection problem, the multi-scale training method may be a key breakthrough.It can enable fruit detection models to process fruit information at different scales and improve their ability to capture cross-scale fruit information.Overall, the use of multiscale fruit prediction networks can make full use of receptive fields, which can effectively alleviate the lack of scale invariance in convolutional neural networks.However, this also increases the number of calculations, resulting in higher demand for hardware facilities.For multi-category fruit detection problems, a common solution is to use transfer learning technology to fine-tune existing models.However, this may result in a loss of detection accuracy for the original fruit categories.Adding a large amount of new category data to the original dataset and retraining a new model can ensure the detection effectiveness of the original fruit categories.However, every time a new category appears, it needs to be trained from scratch.This not only consumes time and resources, but also cannot satisfy complex, dynamic environments such as farmland orchards.For this problem, we believe that introducing the idea of incremental learning can improve the generalization ability and adaptive learning ability of fruit detection models.
(5) Lightweight fruit detection models.With the continuous development of the field of fruit detection and recognition, researchers are also committed to improving the accuracy of fruit detection models, and the fruit detection models are gradually becoming more complex.For example, some researchers add a super-resolution module to the localization part of fruit detection networks.This may increase the computational load, which, in turn, makes the fruit detection model more dependent on high-performance computing resources.How to further optimize network structures, reduce the number of model parameters, decrease computational complexity, improve running speed, and deploy them on mobile devices are currently hot research topics.Model pruning, quantization, knowledge distillation, and matrix decomposition are effective ways to achieve lightweight and high efficiency.For example, the lightweight MobileNet [11] or ResNet-101 [23] are used to replace the original backbone feature extraction network for fruit feature extraction.At the same time, optimizing operators within frameworks and using AI chips in hardware can greatly accelerate the running speed and parallelism of fruit detection models.

Conclusions
Fruit detection and recognition methods based on DL are the mainstream methods for accurate, fast, and robust fruit detection and recognition.These methods are also an important development trend.They are relatively less affected by environments.Our work focuses on providing an overview and review of DL applied to fruit image recognition, mainly in the areas of detection and classification.In order to further define the study areas of this paper, we identify fruit detection and classification tasks as the determination of the class based on their specific types.In general, current fruit detection and recognition methods based on DL can be divided into the following areas: methods based on YOLO, SSD, AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN.These methods can also be classified into two categories: single-stage fruit detection and recognition methods (YOLO, SSD) based on regression, and two-stage fruit detection and recognition methods (AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN) based on candidate regions.
Most of the current research work is based on two-stage fruit detection and recognition methods.Improvement and application research based on Faster R-CNN (21%) is currently a hotspot.The recognition accuracy of fruit detection and recognition methods based on Faster R-CNN is high, but the recognition speed is limited by complex anchor frame mechanisms.When there are mobile deployment and high recognition speed requirements, fruit detection and recognition methods based on YOLO (17%) are used most frequently.Their recognition speed is fast, but the recognition effect on small target fruits is not very good.In addition, ResNet (11%) is the most popular backbone network, followed by AlexNet (7%).Most of the research focuses on apples (32.14%), followed by tomatoes (8.93%), and citrus (7.14%).These three kinds of fruits are in high demand and yield globally.There are some reasons that make them ideal candidates for automatic harvesting.Firstly, they hang from plants individually, making them easily detectable based on their distinctive features.Secondly, they have no extreme variations in size or weight.Lastly, they are relatively hard and not easily damaged in mechanical operations.However, in terms of fruit dimensions and peduncle length, different cultivars may exhibit different characteristics that can affect fruit detection and recognition performance.This poses challenges for adapting fruit detection and recognition methods for different cultivars.Future work could aim to identify cultivars that are more suitable for automatic harvesting.
The scarcity of high-quality fruit datasets, detection of small target fruits, fruit detection in occluded and dense scenarios, detection of multiple scales and multiple species of fruits, and lightweight fruit detection models are the current challenges of fruit detection and recognition based on DL for automatic harvesting.The quality and scale of fruit datasets, appropriate improvement strategies, and underlying model architectures all have a significant impact on the detection and recognition performance.For example, fruit data preprocessing can standardize data by cleaning and adjusting them.Fruit data augmentation can effectively expand data and increase data diversity, thereby reducing the dependence on specific factors and improving model robustness.Fruit feature fusion is conducive to alleviating the problem of fruit feature disappearance and improving the detection effect of small target fruits and multi-scale fruits.Building a multi-task learning model, the original fruit detection framework is beneficial for obtaining more fruit information by combining other learning tasks.Moreover, establishing a parameter-sharing mechanism through multi-task learning can significantly improve the performance of fruit detection

Figure 2 .
Figure 2. Different processes of fruit detection and recognition based on DL (image reprinted with permission from ref. [16].2023, Xiao F.).

Figure 2 .
Figure 2. Different processes of fruit detection and recognition based on DL (image reprinted with permission from ref. [16].2023, Xiao F.).

Figure 3 .
Figure 3. Different types of imaging sensors commonly used for fruit vision detection systems (Accessed on 5 January 2023).

Figure 3 .
Figure 3. Different types of imaging sensors commonly used for fruit vision detection systems (Accessed on 5 January 2023).

Figure 5 .
Figure 5. Main research processes of fruit detection and recognition methods based on DL.
(a) Distribution of articles per year.(b) Distribution of network models used.Agronomy 2023, 13, x FOR PEER REVIEW 6 of 32 (c) Distribution of crops detected and recognized.

Figure 6 .
Figure 6.Summary of literature search.

Figure 6 .
Figure 6.Summary of literature search.

Figure 7 .
Figure 7. Outline of the article.

Figure 7 .
Figure 7. Outline of the article.

Figure 8 .
Figure 8. Basic architecture of DL-based ANN for fruit detection and recognition.

Figure 9 .
Figure 9.Typical CNN framework for fruit detection and recognition.

Figure 8 .
Figure 8. Basic architecture of DL-based ANN for fruit detection and recognition.

Figure 8 .
Figure 8. Basic architecture of DL-based ANN for fruit detection and recognition.

Figure 9 .
Figure 9.Typical CNN framework for fruit detection and recognition.Compared with digital image processing and traditional ML techniques, fruit detection and recognition methods based on CNN have great advantages in terms of accuracy.Jahanbakhshi, A. et al. (including Iranian researchers and a researcher based in the UK) (2020)[46] proposed an improved CNN (15, 16, and 18 layers) to detect apparent defects in sour lemons.In comparison to traditional fruit feature extraction methods, such as histogram of oriented gradient (HOG), local binary pattern (LBP), support vector machine (SVM), k-nearest neighbor (KNN), decision tree, and fuzzy classification, the improved CNN was found to outperform these methods, achieving an accuracy of 100%.Bangladeshi researchers Sakib, S. et al. (2019)[47] proposed a fruit detection system using CNN.The Fruits-360 dataset was utilized to evaluate the proposed system.The training

Figure 9 .
Figure 9.Typical CNN framework for fruit detection and recognition.

Figure 11 .
Figure 11.Comparison of one-stage and two-stage fruit detection and recognition methods.

Figure 11 .
Figure 11.Comparison of one-stage and two-stage fruit detection and recognition methods.

Figure 11 .
Figure 11.Comparison of one-stage and two-stage fruit detection and recognition methods.

2. 1 .
Single-Stage Fruit Detection and Recognition Methods Based on Regression 2.1.1.Fruit Detection and Recognition Methods Based on YOLO YOLO is one of the most classic and advanced fruit detection algorithms.It can detect and classify target fruits simultaneously in a single image.As shown in Figure 12, YOLO-v1 was the beginning.YOLO-v1 was proposed by American researchers Redmon, J. et al. in 2015

32 2. 1 .
also appeared one after another.YOLO-v8 is a SOTA model.It was opensourced on January 10, 2023.The framework is shown in Figure 13.Specific innovations include a new backbone network, a new anchor-free detection head, and a new loss function that can run on various hardware platforms from CPU to GPU.Agronomy 2023, 13, x FOR PEER REVIEW 14 of Single-Stage Fruit Detection and Recognition Methods Based on Regression 2.1.1.Fruit Detection and Recognition Methods Based on YOLO YOLO is one of the most classic and advanced fruit detection algorithms.It can detect and classify target fruits simultaneously in a single image.As shown in Figure 12, YOLO-v1 was the beginning.YOLO-v1 was proposed by American researchers Redmon, J. et al. in 2015 also appeared one after another.YOLO-v8 is a SOTA model.It was open-sourced on January 10, 2023.The framework is shown in Figure 13.Specific innovations include a new backbone network, a new anchor-free detection head, and a new loss function that can run on various hardware platforms from CPU to GPU.

Figure 12 .
Figure 12.Main research processes of YOLO.

Figure 12 .
Figure 12.Main research processes of YOLO.

Figure 13 .
Figure 13.YOLO-v8 framework (image reprinted with permission from ref. [78].2023, Lou, H.). 2.1.2.Fruit Detection and Recognition Methods Based on SSD SSD was proposed by American researchers Liu, W. et al. in 2016 [79].A typical SSD framework for fruit detection and recognition is shown in Figure 14.It consists of a base network (such as VGG-16) and an additional set of convolutional and pooling layers for fruit feature extraction and detection.It also includes an NMS layer for filtering and selecting the detection results.It borrows the idea of multi-scale fruit detection.Fruit detection tasks are accomplished by generating multiple fruit feature maps of different scales during the fruit detection process.The network model calculates confidence scores for each category in predicted boxes and ground truth boxes, respectively.Then, an NMS operation is performed on the calculated scores of each prediction boxes.Finally, topranked prediction boxes are outputted as the final result of fruit detection.Validated on multiple fruit datasets, fruit detection and recognition methods based on SSD have high accuracy and speed.Vasconez, J.P. et al. (including Chilean researchers and a researcher based in America) (2020) [60] evaluated two of the most widely used architectures (Faster R-CNN with Inception V2 and SSD with MobileNet) for fruit detection.The former achieves 4.55 FPS, whereas the latter achieves a significantly higher performance of approximately 16.67 FPS.However, it is worth noting that fruit detection and recognition methods based on SSD preprocess input fruit images, which may lead to lower fruit detection accuracy for relatively small target fruits when passing through deeper convolutional layers.Chinese researchers Liang, Q. et al. (2018) [80] proposed a real-time detection method for on-tree mangoes based on SSD.New sampling strategies

Figure 14 .
Figure 14.Typical SSD framework for fruit detection and recognition.

2. 2 . 1 .
Fruit Detection and Recognition Methods Based on AlexNet, VGGNet, and ResNet Typical AlexNet, VGGNet, and ResNet frameworks for fruit detection and recognition are shown in Figure 15.AlexNet was proposed by American researchers Krizhevsky, A. et al. in 2012 [81].It is the first DL framework that extends CNN to the field of CV.Compared with techniques based on digital image processing and traditional ML, fruit detection and recognition methods based on AlexNet have great advantages in terms of accuracy.Chinese researchers Zhu, L. et al. (

Figure 14 .
Figure 14.Typical SSD framework for fruit detection and recognition.

2. 2 .
Two-Stage Fruit Detection and Recognition Methods Based on Candidate Regions 2.2.1.Fruit Detection and Recognition Methods Based on AlexNet, VGGNet, and ResNet Typical AlexNet, VGGNet, and ResNet frameworks for fruit detection and recognition are shown in Figure 15.AlexNet was proposed by American researchers Krizhevsky, A. et al. in 2012 [81].It is the first DL framework that extends CNN to the field of CV.Compared with techniques based on digital image processing and traditional ML, fruit detection and recognition methods based on AlexNet have great advantages in terms of accuracy.Chinese researchers Zhu, L. et al., (2018)

Figure 16 .
Figure 16.Typical R-CNN, Fast R-CNN, and Faster R-CNN frameworks for fruit detection and recognition.

Agronomy 2023 ,
13, x FOR PEER REVIEW 21 of 3difference between SegNet and FCN is the method used for up-sampling low-resolution feature maps to high-resolution feature maps.
(a) Single fruit (b) Multiple fruits (c) High degree of occlusion (d) Different background (e) Low light (f) Dense fruit aggregation

Table 1 .
Comparison of different types of imaging sensors commonly used in fruit vision detection systems.

Table 1 .
Comparison of different types of imaging sensors commonly used in fruit vision detection systems.
(Lift, Splat, Shoot) Active RGB and depth images Complete fruit scene characteristics Lack of feature descriptors

Table 2 .
Fruit detection and recognition methods based on DL.
s Accuracy Applied Crops Advantages Disadvantages 84-98% cabbage, citrus, lychee, mango, tomato High fruit detection speed; it can meet real-time requirements well for automatic harvesting Fruit detection accuracy under severe occlusion, low resolution, and changing lighting conditions is low Insensitive to the details of fruits in fruit images; fruit classification does not consider inter-pixel relationships t 83-95% apple, tomato Obtaining edge contours and maintaining the integrity of high-frequency details in segmentation Neighboring information may be ignored when fruit feature maps with low resolution are unpooled -80-94% apple, strawberry, tomato Combining semantic segmentation with fruit detection by outputting mask images Fruit detection speed is slow, and it cannot meet real-time requirements well

Table 2 .
Fruit detection and recognition methods based on DL.

Table 3 .
Comparison of different fruit detection and recognition methods based on DL.

Table 4 .
Comparison of fruit image collection methods.

Table 5 .
Some frequently used fruit image databases.