A Review of Convolutional Neural Network Applied to Fruit Image Processing

: Agriculture has always been an important economic and social sector for humans. Fruit production is especially essential, with a great demand from all households. Therefore, the use of innovative technologies is of vital importance for the agri-food sector. Currently artificial intelligence is one very important technological tool widely used in modern society. Particularly, Deep Learning (DL) has several applications due to its ability to learn robust representations from images. Convolutional Neural Networks (CNN) is the main DL architecture for image classification. Based on the great attention that CNNs have had in the last years, we present a review of the use of CNN applied to different automatic processing tasks of fruit images: classification, quality control


Introduction
Agriculture is very important and essential for humans because they directly depend on it for food production.Especially, fruits are typically bought by every household and rich in nourishment; thus, it is required a continuous supply and production to satisfy the demand of the growing world population [1][2][3].For this reason, the entire agri-food sector chain experiences increasing challenges, which require to apply new innovative technologies in order to improve its productivity.Among other applications, computational technologies have been adopted for fruits recognition tasks and the effective detection of specific defects of its, both in wholesale and retail markets [4][5][6].
Computer vision is one of the most used technological tools in the agro-industrial field, both in automatic fruit harvesting, fruit sorting machines, and fruit scanning in supermarkets [7,8].All vision systems typically include different types of data generated by sensors or cameras.This data can be RGB images, RGB depth images (RGB-D), hyperspectral images, among many other types.So that, due to different computational methods and algorithms, required features must be extracted and processed to perform the corresponding task to the fruit industry sector.For example, in supermarkets, a fruit recognition process is required or in an orchard for harvest, the accurate detection of fruit.
Nowadays, artificial intelligence (AI) is a field with several practical applications in a wide range of industries and active research topics.The main challenge for AI is to solve the tasks that people intuitively solve, but hard to implement computationally [9].Therefore, AI systems must have the ability to acquire their knowledge, extracting raw data patterns, which is known as machine learning [9][10][11][12].Thus, AI-based techniques are very useful to solve complex problems where traditional methods would not be efficient.
Machine learning (ML) allows researchers and developers to computationally address problems related to the knowledge of the real world.ML endows computers with the ability to act without being explicitly programmed, building algorithms to recognize patterns on the data and make predictions based on it [13][14][15].ML-based systems are applied in several areas, such as information analysis, agriculture, ecology, mining, urban planning, defense, space exploration, among others [9][10][11][12][16][17][18].
Currently, deep learning (DL) is one of the most used ML-based methods.An important characteristic of DL is that it has high levels of abstraction and the ability to automatically learn patterns present in images.Particularly, Convolutional Neural Network (CNN) [19] is the main DL architecture used for image processing [9,12,20,21].CNNs is a kind of artificial neural networks (ANNs) that use convolution operations in at least one of their layers [9,19].Since 2012, when Krizhevsky et al. [22] won the ImageNet competition (ILSVRC) [23], CNNs have gained great popularity as an efficient method for image classification in many fields.Specifically in agriculture, CNN-based approaches have been used for fruit classification [24][25][26] and fruit detection [27,28].
In order to define the study areas of our review, we identify fruit classification task as the determination of the class according to their specific type.Besides, fruit quality control is focused on the determination of internal and external damages, as well as its maturity degree.On the other hand, fruit detection is oriented to carry out an automatic harvest.
Based on the great attention that CNNs have had in recent years, and unlike the existing surveys, we present a comprehensive review of the use of CNNs applied to fruit image processing, mainly in the areas of classification, quality control, and detection.Additionally, aiming to give a better understanding to researchers in the agriculture area about CNNs, we introduce a practical theoretical framework on CNNs to easily illustrate their operation and use it in different examples.Compared to previous reviews in the literature, the main contributions of this paper are as follow:

•
To the best of our knowledge, the presented paper is the first study that extensively reviews the application of CNN-based models to fruit image processing.

•
Our study covers very recent literature from 2015 to the present, due to the novelty of the use of CNNs in the studied area.

•
We summarize the main aspects, properties, and results of the collected works on three main areas of the agri-food industry related to fruit classification, fruit quality control, and fruit detection.

•
Aiming to give a better understanding of how CNN models are implemented, we present a theoretical background on CNNs and also provide two practical examples of CNN model for fruit classification.
This paper is organized as follows.Section 2 summarizes previous reviews about ML and computer vision methods applied to fruit studies and articles search strategies.Section 3 introduces the principles and basic concepts of CNNs.Sections 4-6 present the review of the state-of-the-art CNN-based approaches for fruit image processing.In Section 7, we discuss the main aspects related to the studied works.Besides, Section 8 presents different frameworks to develop CNNs and two practical examples.Finally, we give the conclusions in Section 9.

Preliminaries
Fruits have great relevance for humans because of their nutritional value.Consequently, research on fruit processing is very important for several economic sectors, both for the wholesale and retail markets, as well as for the processing industries.Hence, different methods have been developed to automatically process fruits, either to classify them or to efficiently estimate their quality.
In 2017, Liu et al. [29] present a literature review about the latest methods for fruit identification.They carry out a search in the recently published literature regarding the topic and selecting eleven publications considered relevant by them.From these, only four works correspond to DL or other traditional ML techniques [6,[30][31][32].Showing that at that time, despite the great interest generated by CNNs, it had not yet been extended to the study and analysis of fruits.They conclude that an excellent method for these studies is the support vector machine (SVM) and its variants.Besides, they suggest that DL models and especially CNNs should be applied more frequently because of their success in computer vision in other application areas.
Similarly, Naik and Patel [33] give a basic overview of the fruit classification process based on computer vision technology.They study feature extraction methods such as Local Binary Pattern (LBP), Histogram of Oriented Gradient (HOG), and Speeded Up Robust Features (SURF).Besides, the authors analyze ML-based approaches like Support Vector Machine (SVM), K-nearest neighbor (KNN), and CNN.Their work emphasizes that DL-based algorithms, and especially CNNs, are becoming very popular (in 2017) because these algorithms automatically learn the characteristics of the images and reduce the error in the image recognition process.However, they do not present a summary or review of the published articles that apply said algorithms.
It is worth to mention the work of Zhu et al. [34], which argues that researchers in the agriculture area do not pay attention to the mechanisms behind software frameworks and they just use them.
In their paper, they provide a summary of DL-based algorithms and examine the main concepts, limitations, implementation and training processes.Their work is relevant because it helps researchers in agriculture to have a better understanding of major DL techniques.In the final section, to make visible the broad spectrum of the application of DL to agriculture, they performed a Meta-Analysis (i.e., bibliometric analysis) of DL-based applications in smart agriculture.Thus, the authors find that most of the recent works in the agriculture innovation area are closely related to the production or other tasks to improve the productivity of crops, reduce the plant diseases, and automate agriculture or agro-industry.
The review made by Bhargava and Bansal [35] analyzes the use of computer vision and image processing techniques in the agri-food industry.They define that the most relevant quality properties of agricultural products are color, size, texture, shape, and defects.Hence, the authors present an overview of different methods for preprocessing, segmentation, feature extraction, and classification using these features.They study several approaches for classification in food quality evaluation, including KNN, SVM, ANN, and CNN.According to them, DL-based approaches such as convolutional neural networks are very efficient for fruit classification and recognition, reducing the error remarkably in classification.Despite that, although CNNs have recently received much more attention than any other ML algorithm, it has not been widely used in fruit research and reports a single article using CNN [36] as relevant for the date.
In Reference [37], Hameed et al. compare different computer vision methods to classify fruits and vegetables, based on SVM, KNN, decision trees, ANN, CNN, and other features extraction methods.Moreover, they highlight the fact that several classification approaches have been proposed for quality assessment and automatic harvesting, but these techniques are limited to a few classes and small datasets.Besides, their paper identifies three main groups of classification applications of fruits and vegetables: quality assessment, automatic harvesting, and supermarket inventory.Similarly, to the one mentioned above, it only identifies two articles where CNNs are applied.
On the other hand, Li et al. [38] recently reviewed non-destructive optical techniques applied to berries quality control (e.g., strawberry and blueberry).They analyze different data acquisition techniques, such as computer vision system, Vis-NIR spectroscopy, laser-induced method, and thermal, multispectral and hyperspectral devices.Besides, the authors examine the latest analysis techniques like photoacoustics, odor images, X-rays, micro-destructive tests, terahertz spectroscopy, and intelligent analyzer based on mobile terminals.However, they do not perform an analysis of the algorithms or methods involved to process the obtained datasets.They present a table that summarizes more than 45 papers (from 2011 − 2019) and only two articles are based on CNN.
Notwithstanding all the considerable attention that CNNs have gained, fruits studies applying CNNs had not extended until the end of 2018 ( [29,[33][34][35]37,38]), although there is a great diversity of fruit study areas.For the presented review, the searching strategy was designed as follows.First, we limited the years range of the search from 2015 to the present, due to the novelty of the use of CNNs in the studied area.Secondly, the main search keywords were "Fruits" and "Convolutional Neural Network", as well as the possible combinations of them with other auxiliary keywords such as "deep learning", "classification", "quality", and "detection".The search was performed on the well-known scientific databases Web of Science and SCOPUS, as well as through direct searches on the very important publishing companies, such as MDPI, IEEE Xplore Digital Library, and arXiv.Finally, the results were separated into the three studied groups of applications that are classification, quality control, and detection for automatic harvesting.These three groups cover almost the entire fruit treatment chain from harvest to final consumer.
We observed that by the end of 2019 this type of study dedicated to fruits follows the general trend of CNNs application.Figure 1 shows the relationship of the articles (by groups and total) that use a CNN for fruit image processing.

Background on Convolutional Neural Networks
Multilayer networks can learn complex and high dimensional patterns from large datasets, making them obvious candidates for images recognition task [19].Particularly, Convolutional neural networks are a special kind of multilayer neural network, which was firstly proposed by LeCun et al. in 1998 [19] and have several practical applications [9,[39][40][41].Figure 2 shows the original architecture of the first CNN model, called LeNet-5 [19].CNNs gained great popularity when the AlexNet model [22] won the ImageNet (ILSVRC) competition [23]

CNN Architecture
Contrary to traditional neural networks, CNNs use convolution operations in at least one of its layers [9,19].The CNN architecture includes multiple stages or blocks composed of four main components: a filter bank called kernels, a convolution layer, a non-linearity activation function, and a pooling or subsampling layer.Each stage aim to represent features as sets of arrays called feature maps (see Figure 2) [12,19,42,43].We depict a typical CNN architecture in Figure 3, comprising of a stack of several convolutional stages and one or more fully connected layers, which gives the final output as a classification module.Following we introduce the main components of a typical CNN architecture.Filter bank or kernels: each filter or kernel aims to detect a particular characteristic at each input location, therefore, the spatial translation of the input from a characteristic detection layer will be transferred to the output without changes [43].As it is defined by LeCun [43], there is a bank of m 1 filters in each convolutional layer and the output Y (l) i of the l th layer consists of m (l) 3 .The i th feature map is computed as follows: where B (l) i denotes the trainable bias parameters matrix, K (l) ij is the filter with dimensions (2h ) that connect the j th feature map of (l − 1) layer with i th feature map of (l) layer, and ( * ) is the 2D discrete convolution operator.
Convolution layer: the convolution operation is widely used in digital image processing where the 2D matrix representing the image (I) is convolved with the smaller 2D kernel matrix (K), then the mathematical formulation with zero padding is given by [9]: In the convolution process, a small sliding filter operates from left to right through the image from top to bottom.Figure 4 shows an example of the convolution operation with an input image (4 × 4) and a convolution kernel (3 × 3), obtaining an output convolved image.At each location, it is computed the sum of the products between each kernel element and the corresponding input element.This process is repeated using different kernels to form as many output feature maps as desired [44].The dimensions of the output characteristics map are more reduced than the input.Alternatively, we can apply a padding technique to keep the same in-plane dimension by adding zeroes around the input and fitting the center of the kernel on outermost elements [44,45].Besides, the stride denotes the size of the passage between two successive positions of the kernel nucleus.Generally, a stride equal to 1 is chosen, but sometimes a stride greater than 1 is used to reduce the resolution of feature maps constituting to subsampling.
Nonlinear activation function: after the filter bank produces the output, a nonlinear activation function is applied (Equation ( 1)) to produce the activation maps, where only the activated features are carried forward to the next layer.This function determines the behavior of the neuron output.Then, the operation of the activation function f (•) is as follows: There are different types of activation functions.Currently, the most widely used in CNN are: • Rectified Linear Unit function (ReLU): ReLU is the most used activation function for convolution layers.It is a half rectified function [19,46] (see Figure 5a).It is mathematically defined as: • Sigmoid function: its curve looks like a S-shape as it is shown in Figure 5b [9,10].The function varies between [0, 1], therefore it is used to predict a probability as an output.Mathematically it has the form: • Hyperbolic Tangent (tanh) function: the tanh function has similar form to Sigmoid function [9,10], as it is depicted in Figure 5c, but the range is [−1, 1].The advantage is that the zero values will be mapped near zero, and negative values will be mapped strongly negative.Its mathematical definition is:  Pooling layer: it reduces the number of parameters of the network by reducing the spatial size of convolutional outputs.Additionally, pooling operations contribute to obtaining an invariant representation to small translations of the input [9,11,47].The two main pooling operations are explained following and Figure 6 depicts an example of pooling operations by using a 2 × 2 filter.

•
Max pooling: it calculates the maximum value for each patch of the input [48,49].The max-pooling layer preserves the maximum value of each patch by sliding the filter over the feature map.
Mathematically it has the form: Commonly, in max pooling layer a 2 × 2 filters are applied with a stride of 2. It downsamples the input by 2 along its dimensions and discards the 75% of the convolutional outputs.

•
Average pooling: it computes the average value for each patch of the input [48,49].The average pooling layer downsamples the convolutional activation by dividing the input into pooling regions and computing their average values.It it matematically defined as follows: Dropout layer: it is a regularization layer that randomly drops neuron units of the network, preventing the units from co-adapt too much.The dropout technique allows facing the overfitting problem, at the same time, it improves the performance of the network.It can be applied to any layer in the network.
Fully connected (FC) layer: the final output of the convolutional stages is flattened to a 1D array and connected to a fully connected layer.FC layers take the results of the convolution/pooling process and use them to classify the image into a label (i.e., class), like a traditional neural network.Thus, the activation function of the last layer (i.e., output layer) computes the final probabilities for each class and it is selected according to the task.Typically, a multi-class classification task uses the Softmax function, where each class probability value ranges between [0, 1], and their total sum is equal to 1. Finally, each output neuron decides on each of the labels, and the greatest output value corresponds to the classification decision.

Training Process of CNN
The training process optimizes different layer parameters of a neural network to minimize differences between given labels on a training dataset and the output predictions.Commonly, the backpropagation algorithm is the most used method for training a neural networks.The training process with backpropagation is as follows: 1.
Select a training dataset of images, usually taken by batch with lesser dimensions.

2.
Pass each batch over the network and obtain the output.

3.
Compute the error between the given labels and the output predictions by using a loss function L.

4.
Propagate the error throughout the network by the backpropagation algorithm.5.
Update the weights W to minimize the error.6.
Repeat until converge or reach a limit of iterations.
To carry out previous steps and train a CNN, we must consider the following aspects: • Define the CNN architecture: it consists of establishing the number of layers for each corresponding type, as well as the size and number of filters for each layer.The architecture design always depends on the objective of CNN.

•
Loss function: it measures the difference between the given ground-truth labels and the outputs of the network.Typically, the Mean Squared Error function is applied and it is given by: Hence, L must be minimized to find the contribution of each weight and optimized them.
The gradient descent algorithm is widely adopted for the minimization procedure, which is mathematically expressed as partial derivative of the loss function.Then, the parameter update process is formulated as follows [19]: where α denotes the learning rate.Thus, the learning rate is a very important hyper-parameters and must be established before starting the training process.It should be noted that a lower learning rate can give a more accurate result, but the network may take longer to train.

•
Training dataset: the available data is generally divided into three subsets: a training set to train the network, the validation set to evaluate the model during the training process, and the testing set to evaluate the final trained model.Most CNN frameworks require that all training data have the same shape (i.e., dimensions).Therefore, pre-processing the data is the first step before the training process to normalize the data.
Another important point is that the dataset should be balanced, which means the same number of images for each class.In case the dataset does not have a sufficient number of images, it is recommended to apply a Data Augmentation technique.It consists of increasing the amount of training data by performing a series of transformations, such as rotations, translations, mirroring, among others.

Transfer Learning with CNN
Transfer learning is the process based on a previously trained deep learning network and adjusting it to learn a new task.Sometimes creating new CNNs for any type of task defining the network architecture and training the network from scratch can be time-consuming to achieve optimal configuration.For this reason, we can take advantage of a pre-trained network to learn new patterns with new data.Besides, it is useful when we do not have enough data to train the network.Thus, we use a pre-trained model in an appropriate dataset for the task at hand.
The key idea is to freeze some layers of a pre-trained network and typically adjust input and output layers.There are several pre-trained models that we can adopt.Among them, the most used are well-known architectures such as LeNet-5 [19], AlexNet [22], VGG [39], GoogLeNet [50], and ResNet [51].Additionally, researchers and engineers share lots of CNN models at Caffe Model Zoo [52], which are learned several tasks from simple regression, large-scale visual classification, image similarity, speech, among other applications.

CNN-Based Approaches for Fruit Classification Tasks
We present a summary of the most recent articles that CNNs are applied for accurate classification of fruits according to their specific type.Table 1 shows a summary of these articles.In this case, classification is understood as the fact of identifying the specific type of fruit observed in an image containing a single type or several types of fruits.Lu [24] applied CNNs with data expansion techniques to select a total of 5822 color images in ten-class food items from the ImageNet [64], comparing its method against bag-of-feature (BoF) and support vector machine (SVM) models.BoF with SVM showed accuracy results of 56%, while the CNN model performed an accuracy of 74% and 90% without and with data augmentation techniques, respectively.
The study of fruit classification performed by Zhang et al. [25] compared the effect of three types of data augmentation method and max-pooling techniques on the accuracy, as well as using GPU and CPU hardware platforms.They used the same dataset and image pre-processing procedure done by Wang and Chen [60].They obtained an accuracy of 94.94%, which is higher than the other methods of machine learning that they applied, that yielded accuracies for the PCA + kSVM [65] of 88.20%, the PCA + FSCABC [30] of 89.11%, WE + BBO [32] of 89.47%, FRFE + BPNN [66] of 88.99%, and FRFE + IHGA [67] of 89.59%.The max-pooling performed slightly better than average-pooling.This work is derived from the article by Wang and Chen [60], where they used the same dataset and preprocessing procedure.
Wang and Chen [60] created an 8-layer CNN by using a parametric ReLU and placing a dropout layer before each FC layer.The fruit dataset contained 3600 images of 18 types of fruits, which were collected on the site with a digital camera and also with downloaded images from the Internet.In a pre-processing procedure, the fruit was moved to the center of the image, then it is trimmed and resized.Then, the background is removed and finally, each image is labeled.The 8-layer CNN overall accuracy was of 95.67%, better than 6-layer CNN [58] with 94.94%, HWE + GA [68] with 81.11%, HIGA [67] with 89.06%, BBO [31] with 87.94%, SANN [69] with 88.22%, and IABC [30] with 89.11%.
Steimbrener et al. [26] used a modified GoogLeNet [50] for fruit and vegetable classification.The dataset consisted of 2700 images from a total of 13 kinds of fruits and vegetables taken with a 16-band hyperspectral camera, covering the visible range of 470-630 nm.The images were reorganized into 3D matrices with 2D cuts for each spectral bandThree models of network architectures were used to adjust the intermediate NN model: Pseudo-RGB, linear combination, and convolutional kernel.The CNN average accuracy was 88.15% with Pseudo-RGB images, 85.93% with linear combinations, and 92.23% with convolutional kernels.
Katarzyna and Pawel [53] evaluated two 9-layer CNNs with the same architecture, aiming to carry out the fruit classification for application in retail.Both networks had different weights, the first one classifies fruits images with background, and the second used images containing a single fruit.The number of original images was 6161, which were segmented by using a recognition algorithm to identify a single apple in the original image.Each apple object was saved as a separate image, and all apples in each original image were identified and recorded, producing a dataset comprised of 23,662 images from six apple varieties.The evaluation results on this dataset showed an overall accuracy of 99.78%.
Mureşan and Oltean [55] introduce a new dataset of fruit images called Fruits-360 [70].They also present the results of the evaluation of a basic CNN model, as well as, AlexNet and GoogLeNet models for fruit recognition.They conclude the CNN-based techniques achieve very good accuracy results and basic CNN is the more efficient in terms of processing time.
Zhu et al. [56] adopted an AlexNet network model for vegetable image classification by using Caffe framework [52].The data set was obtained from ImageNet.The authors train their CNN model on different datasets by varying the number of vegetable images.The classification results showed that accuracy decreases as the number of images decreases.Besides, they compared the accuracy rate of the CNN-based method (92.1%) against BP neural network (78%) and SVM classifier (80.5%).
Sakib et al. [54] designed and evaluated several CNN architectures for fruit classification using the Fruits-360 dataset [70].They used various combinations of hidden layers and epochs for different cases and made a comparison between them.The initial CNN model is comprised of two convolutional layers, each one followed by a pooling layer, and two FC layers.The input RGB images have a size of 100 × 100 pixels.They achieved an accuracy of 100% and a training accuracy of 99.79%.
Hussain et al. [57] proposed a fruit recognition algorithm based on Deep Convolution Neural Network (DCNN).They used a fruit image database with 15 different categories comprising of 44,406 images.The network has three convolutional-pooling layers, one FC layer, and finally a dense output layer.The input shape is 150 × 150 × 3. The experimental results of the proposed approach showed a high accuracy of 99%.
Lu et al. [58] designed a 6-layer CNN for fruit classification.The fruit dataset contained 1800 images from 9 types of fruits, which were obtained using a digital camera.They compared the proposed CNN model against voting-based-SVM (VB-SVM), wavelet entropy (WE), and genetic algorithm (GA).The accuracy results were 86.56% for VB-SVM, 89.78% for WE, 82.33% for GA, and 91.44% for 6-layer CNN.
Patino-Saucedo et al. [59] adapted the AlexNet model to create Fruit-AlexNet aiming to classify tropical fruits.They used the Supermarket Produce dataset [71], which contains 2633 images of 15 categories, including fruits inside a bag.The images were collected at several times and days for the same category.They reduce the AlexNet model complexity.Their proposed network comprises of five convolutional layers joined to max-pooling after the first, second, and fifth layers.Finally, they stacked three fully connected layers with dropout layers after the first and second ones.Experimental results outperform previous works [5] on the same dataset achieving a classification accuracy of 99.56% and of 100% using statistical color and texture descriptors, respectively.
Zeng [61] used a modified VGG model for fruit and vegetable classification.The images were downloaded from picture websites, and the database has 26 categories.Firstly, they use a bottom-up graph-based visual saliency (GBVS) model to segment the fruit region.Then, a CNN model learn image features to perform the classification task, obtaining an accuracy rate of 95.6%.
Zhang et al. [63] designed a 5-layer CNN model for fruit classification on the UEC-FOOD100 dataset [72] and a self-established fruit dataset.The UEC-FOOD100 dataset is comprised of about 15,000 from 100 classes of food image and the second dataset has more than 40,000 fruit images.For evaluation purposes, they elaborated two groups of controlled tests: single fruit vs multi-food images and gray vs RGB images.The experimental results concludes that features based on color information are not always useful to improve the classification accuracy.The best achieved accuracy was 80.8% on the fruit dataset and 60.9% on the multi-food dataset.

CNN-Based Approaches for Fruit Quality Control Tasks
The accurate detection and classification of fruit quality are one of the critical tasks to avoid losing added value in the markets.For this reason, continuous efforts are made to improve the methods of detecting damage, diseases, and the maturity level of the fruit.This quality control is carried out before, during, and after the fruit harvesting.We perform an analysis of the recent use of CNN in this area, understanding by fruit quality control the process of determining internal-external damages in the fruit, degrees of maturity and diseases.Table 2 shows an overview of recent articles where CNNs are adopted for fruit quality control.In a recently published study, Wu et al. [73] took a modified AlexNet model with an 11-layers structure, aiming to identify and detect defects in apples.Furthermore, as a comparison of the classification, they use three known algorithms-backpropagation neural networks (BP), particle swarm optimization (PSO), and support vector machine (SVM).The dataset consists of laser-induced backscatter images (i.e., speckle images of 5472 × 3648 pixels), where the acquisition process is carried out with a laser system with a beam expander, a complementary metal-oxide-semiconductor (CMOS) color camera with a zoom lens, and a polarizer.The dataset has a total of 500 apple samples of about a similar size (equatorial diameter 80-100 mm).The proposed CNN model for apple detection achieves a recognition rate of 92.50%, which is higher than other algorithms commonly used, such as BP, SVM, and PSO algorithm.
Jahanbakhshi et al. [74] designed and used three models of CNN with 15, 16, and 18 layers, aiming to detect and qualify apparent defects of lemon sour fruit.In total 341 samples of healthy and unhealthy sour lemon were used to take RGB images of 4320 × 3240 pixels.The images were preprocessed by removing the background and resizing them.The CNN models were compared against KNN, fuzzy method, artificial neural network, decision tree, and SVM.It extracts the features with local binary patterns (LBP) and histogram oriented gradients (HOG).The results showed that on average, the accuracy of the proposed CNN models was 100%.
Barré et al. [75] used an LSL (Light Separation Lab) to capture illumination-separated images of grapevine berries with a dataset of 270 images, for phenotyping the distribution of epicuticular waxes (berry bloom).They used a CNN model for image analysis.The validation over six grapevine cultivars showed accuracies up to 97.3%.Besides, the cuticle electrical impedance and its epicuticular waxes (thickness indicator of the berry skin and its permeability) was correlated to the waxes proportion with r = 0.76.
A CNN model for the identification of papaya disease is presented by Munasingha et al. [76].They collected diseased images using a digital camera under normal conditions of papaya farms.Some of the images were found from publicly available images on the Internet.The network can classify images into five main papaya diseases.The model achieved ∼92% of classification accuracy for new images, comparing their results against the use of Support Vector Machine.
Ranjit et al. [77] applied a 6-layer CNN model with the quadtree method, exploring to check homogeneity of the sub-tree image pixel of the quadtree.They aimed to detect the diseased region from the fruit to facilitate effective classification.Comparing the CNN results with SVM and kNN classifiers, CNN gives better results and accuracy of 93% after segmentation.
The Autoencoder models and Inception-ResNet v2 were employed by Tran et al. [78] to recognize, classify, and predict the nutritional deficiencies in plants of tomato.They used 571 images captured during the fruiting and leafing phases.Moreover, they applied a statistical structure called Ensemble Averaging with two aforementioned predictive models to improve the accuracy regarding the predictive validation.The predictive performance of the three models had accuracy rates of 79.09% and 87.27% for Autoencoder and Inception-ResNet v2, respectively, and 91% validity using Ensemble Averaging.
Sustika et al. [79] evaluated the AlexNet, MobileNet, GoogLeNet, VGGNet, and Xception architectures against a 2-layer CNN architecture used as a baseline.They used a strawberry classification system for quality inspection, evaluating with two sets of data, two strawberry classes and other four classes.The results show that VGGNet achieves the best accuracy for both datasets (96.49% and 89.12%), and GoogLeNet was the most computational efficient architecture by requiring much less training time and less memory.
Wang et al. [80] applied Residual Network (ResNet) and ResNeXt, to detect the internal mechanical damage of blueberries using data of hyperspectral transmittance.Four ML algorithms were used in the comparison experiments-Sequential Minimum Optimization (SMO), Random Forest (RF), Linear Regression (LR), and Multilayer Perceptron (MLP).They plotted the Precision-Recall and ROC curves to observe the performance of the classifier.The two CNN models reach better performance in classification than traditional ML methods.The fine-tuned ResNet/ResNeXt achieves F1-score and average accuracy of 0.8952/0.8905and 0.8844/0.8784,respectively, and the classifiers SMO/RF/LR/Bagging/MLP obtained 0.8082/0.7314/0.7606/0.7113/0.7827and 0.8268/0.7796/0.7529/0.7339/0.7971,respectively.
Zhang [81] proposes a novel CNN architecture designed for the fine-grained classification of banana's ripening stages.The image data were 17,312 images of bananas in different stages of ripening, taken in the standard RGB of 3200 × 2400 pixels and stored in PNG format.The overall accuracy resulting from the proposed CNN is 95.6%, which is better than the Gabor + SVM (85.2%),Wavelet + SVM (86.5%),Wavelet + Gabor + SVM (88.2%), and Combined features + SVM (89.2%) methods.
Cen et al. [82] introduce a framework which combines a stacked sparse auto-encoder (SSAE) with a CNN, called the CNN-SSAE system, for detecting surface defects and internal defects of cucumbers in a hyperspectral imaging (HSI) system.CNN is used for self-learning of local image features, which are used for classification by SSAE.Images were obtained from a hyperspectral imaging system performed with two conveyor speeds of 85 and 165 mm/s.Their testing results showed that by using spectral features, they achieve accuracies of 85.6% and 78.3% on average at the conveyor speeds.They also showed that by combining spectral and spatial features, the accuracies improved to 91.1% and 88.6% at two speeds.
Tan et al. [83] used a 5-layer CNN for the recognition of lesions on the skin of melon and fruits.The dataset is acquired in real-time using an infrared video sensor.They use a system and methods for image transformation of apple skin lesion to simulate orientation and alteration of light in orchards.The results show accuracy and a recovery rate of up to 97.5% and 98.5% respectively.The proposed method is compared with LeNet-5, k-Nearest-Neighbor (kNN), Boosted-LeNet-4 (B-LeNet-4) and multi-layer neural network with 3 layers (MNN).

CNN-Based Approaches for Fruit Detection
Another process to consider in the study of fruits is fruit detection for automatic harvesting and counting when they are in the greenhouses or orchards because it is a determining component for crop automation in agriculture.In this section, we present a review of automatic fruit detection by using convolutional neural networks.Table 3 summarizes of recent articles based on CNN for fruit detection.An example of automatic fruit detection is the work presented by Williams et al. [84], where they show a multi-arm kiwi pickup robot, presenting the design and evaluation of the robot's performance.The authors aim to operate autonomously in pergola-style orchards.Their work is based on a modified CNN architecture of the VGG-16 called FCN-8S [39].To detect the kiwi fruit in the canopy, each robotic arm has a pair of centered color cameras.When testing the harvesting robot in commercial orchards, the results show a successful harvest of 51% of the total kiwi, with an average time of 5.5 s/fruit.Santos et al. [85] showed that grape clusters can be successfully recognized, segmented and tracked using CNNs.They adopted a similar procedure to that proposed by Yu et al. [86] based on ResNet, where the Mask R-CNN and a feature pyramid network (FPN) are the feature extractor.The evaluation was on the Embrapa Wine Grape Instance Segmentation Dataset (WGISD) dataset, which is composed of 300 RGB images which show 4432 grape clusters from five different varieties of grape.The results reached an F1-score up to 0.91 for segmentation, and also an appropriate separation of each cluster from other structures in the image, which allowed a more accurate assessment of fruit size and shape.
Yu at al. [86] applied the convolutional Neural Networks with the implementation of Mask R-CNN, aiming to improve the performance of computer vision in the detection of fruits for robotic strawberry harvesting.Their work is based on the Resnet-50 model combined with the Feature Pyramid Network (FPN) architecture.The data were 2000 JPEG images of 2352 × 1568 pixels, obtained from several strawberry orchards with a portable digital camera under different conditions.The results for fruit detection over 100 test images showed that the detection accuracy rate was 95.78% on average, the recovery rate was 95.41%, and the results of prediction over 573 ripe fruit collection points with an error average was ±1.2 mm.
In order to detect individual fruits and also to obtain pixel-wise mask for each detected fruit in an image, Ganesh et al. [87] presented a deep learning approach, named Deep Orange, based on a segmentation framework with the implementation of Mask R-CNN by using ResNet-101.They use multi-modal input data comprising of HSV and RGB images retrieved from an orange grove in Citra, Florida, under natural lighting conditions.The algorithm performance is tested using RGB and RGB + HSV images.Their preliminary results showed that the inclusion of HSV data improves the precision from 0.8 to 0.9753.
In 2019, Liu et al. [88] applied the RGB-D sensors with fused aligned RGB and Near-Infrared (NIR) images in a deep CNN for fruit detection, aiming to develop a fruit detection system to estimate the yield of the fruit and the automatic harvest.They adopted a modified VGG-16 for the task of kiwifruit detection using images obtained from two modalities: NIR and RGB images using the Kinect v2 device.The modified VGG-16 was compared with the original VGG-16.They used two fusion methods to extract features: fusion of the NIR and RGB images on the input layer (Image-Fusion) and fusion of feature maps of two VGG-16 networks, where the NIR and RGB images were input (Feature-Fusion).The results showed that the precision of the original VGG-16 with NIR and RGB image was 89.2% and 88.4% on average, respectively.Besides, the 6-channel VGG-16 using the Feature-Fusion method achieved 0.5%, and using the Image-Fusion method achieved the highest average precision of 90.7% and the fastest speed of detection with 0.134 s/image.
Ge et al. [89] carried out a study of instance segmentation and strawberries localization in farm conditions for automatic harvesting based on a deep CNN (Mask R-CNN).They used four classes of strawberry conditions, three ripeness levels and one of shape.The image dataset contained 310 images, where two-thirds were captured by an iPhone 6s camera during the harvesting season, and the remaining were captured using a RGB-D camera (D415 and D435 by Intel).The results showed that ripe strawberries are the easiest to be identified.They proposed a bounding box refinement method to improve the localization accuracy by detecting occluded fruits.The experimental comparison showed that an overlap between the refined and ground truth was 0.87, and between raw detected bounding box and ground truth was 0.68.
In 2019, based on the fact that there is no research in computer vision for date fruits detection in an orchard environment, Altheri et al. [90] proposed a machine vision framework for date fruit harvesting robots.Their approach consisted of three classification models for date fruit image classification in real-time according to their maturity, type and the harvesting decision.The dataset contained 8072 images of five date types, in different maturity and pre-maturity stages from more than 350 date bunches which belong to 29 date palms in an orchard.Three CNN models were used for classification: AlexNet, VGG-16, and a modified VGG-16.The classification models achieved accuracies of 99.01%, 97.25%, and 98.59% with classification times of 20.7, 20.6 and 35.9 ms for the maturity, type and harvesting decision classification tasks, respectively.
Zapotezny-Anderson and Lehnert [91] proposed a multi-perspective (multi-camera) visual serving method for unstructured and occluded environments, called 3D Move To See (3DMTS), for robotic crop harvesting environments.They created a Deep-3DMTS model, which is comprised of 3DMTS with a CNN.The performance of the proposed approach, to guide the end-effector of a robotic arm to improve the view regarding occluded sweet peppers, showed that it is equivalent to the standard 3DMTS baseline.The results concluded that the end-effector final position was within 11.4 mm of the baseline, and also a fruit size increasing in the image by a factor of 17.8 compared against to the baseline of 16.8 on average.
Lin et al. [92] introduced a sensing system composed of an RGB-D sensor (Microsoft Kinect V2) and a sensing algorithm based on CNN, to conduct automatic collision-free picking guava fruit.A dataset was acquired in an outdoor orchard.They used an FCNN model to segment guava fruits and branches.The authors modified the VGG-16 and GoogLeNet models to segment guava fruits and branches, then applied Euclidean clustering to obtain all individual fruits.They estimate the pose of the fruit relative to its mother branch.The results showed that the precision and recall regarding the guava fruit detection were 0.983 and 0.948, respectively, and the 3D pose error were 23.43%-14.18%,with an execution time of 0.565 s/fruit.
Tu et al. [97] developed a machine vision algorithm to detect and identify the maturity of the passion fruits, using RGB-D images from a Kinect sensor.Firstly, passion fruits were detected using faster R-CNN (VGG-16 model) by color and RGB-D.Then, the fruits were represented by the features of fruit maturity using the dense scale-invariant features transform (DSIFT) algorithm together with locality-constrained linear coding (LLC).The output of the above process was the input for a linear SVM classifier to identify the maturity of fruits.The method achieved 92.71% in detection accuracy and 91.52% in maturity classification accuracy.
A CNN algorithm (with five convolutional and three fully connected layers) was used by Habaragamuwa et al. [93] for the recognition of the mature and immature stage of strawberry.Greenhouse images were taken under natural lighting conditions to evaluate the results of BBOL (bounding box overlap which measures localization accuracy) and AP (average precision).The developed deep learning model achieved BBOLs of 0.7394% and 0.7045% and AP of 88.03% and 77.21% for mature and immature classes respectively.
Rahnemoonfar and Sheppard [94] presented a simulated deep CNN for automatic yield estimation based on robotic agriculture to help farmers in decisions for cultivation practices, plant disease prevention, and the size of the harvest labor force.They generated 24,000 synthetic tomato images ( images) to train the network and tested on real data.They used a modified Inception-ResNet architecture.Experimental results showed a 91% average test accuracy over real images and 93% over synthetic images, counting efficiently even in scenarios when fruits are under a shadow, occluded by foliage or branches, or if exist some degree of overlap between fruits.
Bargoti and Underwood [28,95] presented a framework for apple and mangoes detection and counting, using orchards image data.They used a general-purpose image segmentation approach with two feature learning algorithms-CNN and multiscale multilayered perceptrons (MLP).Their approaches were designed to include contextual information about how the image data were captured.The robotic vehicle is composed of a Point Grey Ladybug3 Spherical Digital Video Camera, equipped with six 2MP cameras, and oriented to capture a complete 360 • panoramic view.They used the circular Hough transform (CHT) and watershed segmentation (WS) algorithms, to detect and count individual fruits, from the pixel-wise fruit segmentation.The CNN's results achieved a pixel-wise F1-score of 0.791.The results about count estimates, using CNN and WS, showed the best performance for this dataset, and also a squared correlation coefficient of r 2 = 0.826.Chen et al. [27] adopted deep learning to map from input images to total fruit counting.They used a blob detector based on a fully convolutional network (FCN) to extract candidate regions in the images.A counting algorithm based on a second FCN estimates the number of fruit in each region.They used two different datasets composed of oranges in daylight and green apples at night, using human-generated labels as ground truth.Then, a linear regression model maps fruit count estimate to a final fruit count.
The detection of fruits accurately, quickly, and reliably is essential for estimating fruit yield and automated harvesting in orchards.Based on that, Sa et al. [36] adapted an object detector by using a Faster R-CNN (VGG-16) through transfer learning with images obtained from two modalities-color (RGB) and Near-Infrared (NIR).Additionally, they explored early and late fusion methods to combine the multi-modal information (RGB and NIR).The obtained multi-modal Faster R-CNN model achieved results of the F1 score from 0.807 to 0.838 for sweet pepper detection.
Stein et al. [96] presented a multi-sensor framework to identify, track, localize and map every fruit in a mango orchard.Data were collected using a vehicle equipped with color cameras and strobes, a global positioning inertial navigation system (GPS/INS), and a 3D LiDAR.The fruit was detected using a faster R-CNN detector (a modified VGG-16), and pair-wise correspondences were calculated between images using data of trajectory provided by a GPS.They automatically generated image masks for each canopy with a LiDAR component, linking each fruit with the corresponding tree.In the results we can observe that single, dual, and multi-view methods can achieve precise yield estimates.However, in the multi-view approach is not required a calibration and it achieves an error rate of only 1.36% for individual trees.

Discussion on the Review of CNN-Based Approaches for Fruit Image Processing
In this paper, we present a comprehensive review of state-of-the-art CNN-based approaches for fruit image processing.We have analyzed the main latest contributions of three important application areas-fruit classification, fruit quality control, and fruit detection.
Regarding the CNN architectures used in collected studies, we can group them into two general categories-"own" and and "pre-trained" networks.In the first group, the authors build the network from scratch, and in the second they use well-known CNN models such as AlexNet, GoogLeNet, among others.Additionally, this second group is subdivided into two subgroups, one is where the pre-trained networks are taken for a transfer learning process, and the other subgroup made modifications on the pre-trained network adapting them to the objective of the study.Figure 7 shows the overall distribution of the CNN models used by the authors.It is observed that in 70% of the cases correspond to pre-trained models.Generally, this is because the primary aim in the agro-industry is to use CNNs as a support tool in the development of applications.According to the above, it is worth noticing that the number of layers depends on the kind of task to be performed.If the task is more complex and a greater number of characteristics must be extracted, then the number of layers and filters must be increased.Besides, Figure 8 shows the CNN model's distribution used for each case study.For cases of classification and quality control, we observe that the relationships between the "own" and "pre-trained" models are quite similar.However, in the case of fruit detection, the use of "own" models is reduced to less than 6%.According to the above, it is worth noticing that the number of layers depends on the kind of task to be performed.If the task is more complex and a greater number of characteristics must be extracted, then the number of layers and filters must be increased.From Figure 8a, we note that the use of "own" models is common in fruit classification tasks.In Table 1, it is noticeable that CNNs with less than 13 layers ( [24,25,53,58]) obtain excellent results of about 99%.We also show this behavior with the examples presented in Section 8.2.1.Moreover, it should be highlighted that CNN models reach higher improvement for fruit classification against other traditional methods (i.e., ML and computer vision).
From the review and as seen in Table 1, the datasets contain RGB images without any general pre-processing.In some rare cases, the images were processed with different techniques before being used on CNN.The most common process was to resize the images.Furthermore, it is not necessary to highlight any particular characteristic because all these studies aim at the control, selection, and classification of fruits for the wholesale and retail markets, to develop practical applications that facilitate this process.
In this same context, researchers commonly develop CNN models adapted to their own needs for fruit quality control, as shown in Figure 8b.From a practical point of view, the problem is to classify the images whether the fruit is sick or healthy, whether the fruit is damaged or not, among other similar decisions.Therefore, a complicated CNN architecture is not required to achieve the stated objectives.Generally, the studies implement a wide variety of CNNs models, varying the dimensions of the convolutional filters and pooling types to improve the results as much as possible.It should be noted that more complex approaches are required to classify particular defects of the fruits.Regarding that, we observe that since these studies require a great amount of information about the fruit, most of them use datasets composed of Hyperspectral images, Laser backscattering spectroscopic images, among other types.Besides, when RGB images have bee used, a fairly extensive pre-processing is performed to guarantee the successful extraction of characteristics.
Unlike the previous two cases, the main objective of fruit detection tasks is to segment the fruits in the orchard to efficiently perform an automatic harvest.Therefore, the task to be carried out is more complex than the previous two.Then, it is required to design CNN models that can efficiently perform a semantic segmentation in the wild images.For this reason, in the distribution of CNN models shown in Figure 8c, the pre-trained networks prevail as the basis of all applications.The most used CNN architectures are ResNet [51] and VGGNet [39] models.Particularly, the VGG-16 model is considered an excellent Faster Region-based CNN, as well as some of its variants and modifications.Another difference from the previous two cases is that the evaluation of the studies was not based on comparing their results against other ML models, as observed in the column Results of Table 3.
The authors directly evaluate the robotic systems in the orchards because their objective is to put the robotic system into operation in an integral way.
Moreover, like the case of quality control, datasets contain different characteristics that are not only provided by individual RGB images.The datasets are then grouped as follows (i) a group that uses images captured with depth sensors (RGB-D), which allow to accurately estimate the distance to the fruit to be able to perform the harvest robotically; (ii) the second group that uses RGB images captured with a multi-vision system (i.e., multi-camera system) to have a wider view and measure distances; and (iii) the group comprised of multi-sensor systems, combining NIR and LiDAR sensors with RGB cameras.
We observed that the split of datasets is carried out in two ways, one in training-testing sets (Tr-Ts), and another group in training-validation-testing sets (Tr-V-Ts).The overall distribution of approaches that used each distribution is shown in Figure 9.For the Tr-Ts splits, the most used proportion is 80%-20%, which represents 40% of all works that use this split.The rest of the approaches that adopt Tr-Ts splits vary in proportions, including studies that performed the tests in ranges of 10%-90% of the dataset.For Tr-V-Ts splits, the proportions are very diverse, and they range from a 2/3-1/6-1/6 up to 4/5-1/10-1/10 of the dataset size, respectively.In this regard, the most important conclusion is the datasets splits must adapt to the researcher's decision according to the dimensions of the dataset. Tr-V-Ts

Challenges and Future Research Directions
In the present study, we have found that DL-based models, especially CNNs, are very efficient approaches to address important tasks on fruit image processing for the agro-food industry.However, CNN-based approaches should still face important challenges in order to apply them in real-world scenarios.In our opinion, these limitations indicate the main future research directions and are the following: 1.
Size of the datasets-the dataset must be sufficient large and well labeled to train CNN, address overfitting problems, and to perform the assigned task efficiently.Therefore, the process of preparing the dataset is one of the activities that require more time and effort in the application of CNN.Although there is a wide variety of databases proposed by the authors, not all are available, for this reason, the reproducibility of all studies is not entirely guaranteed.In addition, in many cases, the databases are collected depending on the task at hand.

2.
Search of CNN parameters: the number of layers and filters when proposing a CNN architecture for a specific problem, as well as determining the parameters and hyperparameters of the model, remains a relevant problem commonly solve by trial-and-error tuning until getting the best settings, which is very time-consuming for very deep models.At this point, pre-trained CNN models represent a great help since they can be taken as the basic design of other CNNs.Besides, other recent approaches, such as Multi-layer Extreme Learning Machine [98], could be evaluated aiming to reduce the computation time for tuning network parameters and the amount of data for training purposes.

3.
Multi-fruit classification-in fruit classification studies, we found that no evaluation has been carried out with multiple types of fruit in the same image, limiting themselves to images with a single kind of fruit, either individually or grouped.Thus, the challenge is to design a CNN model for multi-detection and classification of different kinds of fruit at the same time.

4.
Pre-processing of fruit images for quality control-almost all the quality control works were carried out under laboratory conditions by using sensors that are not ready for real conditions.Hence, extensive pre-processing procedures are required in all cases, making them very hard to implement efficiently in real-world scenarios.

Deep Learning Frameworks and CNN-Based Examples
In the following, we present different frameworks to develop CNNs and two practical examples of fruit classification and fruit quality control tasks.This section mainly aims to give these basic tools to beginner researchers on CNN and researchers in the agriculture area, who do not necessarily have skills in computer science.In this way, people from different research areas could gain a better understanding of how CNNs apply to their research fields.

CNN Frameworks
It is known that CNNs are a model or method of DL, and in turn, DL is a component of ML.Therefore, when speaking of a framework for CNN, we have to speak of a Machine Learning Framework as a whole.A Machine Learning Framework is an interface, library, or tool that allows us to easily create ML models.There are a variety of ML frameworks, where each is different from others.Following, we will introduce some of the best-known frameworks for ML: • TensorFlow [99]: is an open source ML library developed by Google, which provides a collection of workflows to develop and train models using Python, C++, JavaScript, or Java.Additionally, these frameworks should also consider the platforms for hosting and running ML developments, which can be used to implement a trained model in external environments.The most popular platforms are Google Cloud (https://cloud.google.com),Amazon Web Services (https://aws.amazon.com),and Microsoft Azure (https://azure.microsoft.com).All these platforms allow limited free access in time and space.

CNN-Based Examples for Fruit Classification
In order to illustrate the use of some available tools to develop a CNN, we show the implementation of examples for fruit classification and quality control.Additionally, the same examples were implemented using well-known pre-trained models in order to illustrate another solution perspective using transfer learning.It is important to remember that the objective of these examples is only to show in the simplest way how to implement CNN models for a specific task.For this reason, the proposed examples were not optimized and very simple solutions are proposed aiming to they can be easily understood.Finally, the results of all the solutions were compared and analyzed.
The implementations were coded in Python and MATLAB, by using TensorFlow [99] and the Deep Learning Toolbox [102], respectively.The source codes were commented with descriptive information and they are available online at: http://www.litrp.cl/repository.html.

Example of Fruit Classification
For classification, we used the dataset Fruit-360 [70], which contains 82,213 images (100 × 100 pixels) of fruits and vegetables of 120 classes, already subdivided into training and test sets.We selected six categories (three kind of apples and pears) from the dataset-Apple Golden 1, Apple Pink Lady, Apple Red 1, Pear Red, Pear Williams, and Pear Monster.
Once the categories have been selected to carry out the classification study, the next step is to design the CNN architecture.Figure 10 depicts the CNN architecture adopted in this case.We designed a CNN with quite simple architecture comprised of the input layer, three convolutional layers, a flattening layer, one fully connected layer, and the output layer.We train our model by 10 epochs and using the Stochastic Gradient Descent (SGD) optimizer with momentum.Firstly, all filters are randomly initialized from a normal distribution and all biases are equal zero.Some of the network adjustment parameters are indicated in Table 4, which were fixed by applying a coarse-to-fine grid strategy on a small part of the training set.First, a set of parameters was defined, which was varied in combination as a small number of training epochs (3-5) were executed.According to the preliminary obtained results, the parameters were refined to then execute a greater number of training epochs and set those for which the best training performance was obtained.
We perform 10-fold cross-validation on the database aiming to obtain unbiased results in our experiments.Besides, we separate 10% of the training data for validation purposes.Thus, our training set was comprised of 2779 images and 307 images in the validation set.After the training process, we evaluate our model on 1034 testing images.In Figure 12, we show the activation maps obtained by the 3rd convolution layer for an Apple Pink Lady and a Pear Williams.It is noticeable the differences between both fruits based on their activation maps, which are later used by the fully connected layers to make the class predictions.Figure 13 depicts the confusion matrix for classification results on the testing image set, where it can be noted the misclassifications between Pear Red, Apple Red 1, and Apple Pink Lady classes.The final average accuracy was 95.45% with an average F1-score of 0.96.In addition to the previous example and aiming to compare the performance, we implemented deeper CNN solutions based on well-known pre-trained models with the Imagenet dataset [64] to compare their performance.For this purpose, we used architectures with different depths, including the most used models in the literature for fruit classification (see Figure 8a).Although the used dataset is different from Imagenet, in order to bring an example as simple as possible, we avoided optimizing the pre-trained model for the used dataset.Hence, we loaded each pre-trained model with the learned weights on Imagenet and froze their convolutional base.Then, we changed their top FC layers by adding one FC layer and modifying the number of outputs to six classes.
Table 5 summarizes the results for each model.We perform the training process by using the same settings as in the initial example to obtain comparable results, which means the same input shape and batch size, 10-fold cross-validation, same training-validation-testing split, 10 training epochs, and the SGD optimizer with the same learning rate and momentum.It should be noted that as the complexity (i.e., depth) of the model increases, overfitting also increases, even by applying data augmentation and adding dropout.This behavior is because of the used dataset is small without sufficient variability.Hence, the results show that the complexity of a CNN model should correspond to the complexity of the classification task and the amount of data available.For this reason, simpler models like the proposed example and AlexNet obtain such good results.Besides, that explains why 66.7% of CNN approaches in the literature for fruit classification are based on own-created models and the AlexNet architecture, as shown in Figure 8a.For fruit quality classification, we used the Apple-NDDA dataset [104], which consists of 1110 apple images from defective and non-defective category.As the objective of the proposed examples is to show the use of CNN in the simplest way, in this case, we use the same CNN architecture shown in Figure 10, modifying the output layer and verifying that the input layer has adequate dimensions.Unlike the classification example, in this case, due to the fact that the dataset is smaller, we adopted a data augmentation solution aiming to increase the number of training samples and also reduce the overfitting.Thus, in order to generate image batches in real-time during the training process, we randomly applied the following variations of the training images:

•
Rotation in the range of ±10 degrees.

•
Width and/or height shifting of ±0.1 of the image dimensions.

•
Zoom the image in the range of ±0.1x.• Horizontal and/or vertical flipping.
We train the network by 20 epochs with the same settings of optimizer and parameters of the first example, which are described in Table 4.As Apple-NDDA dataset [104] do not has separated training and testing subsets, we randomly divided the dataset in 80% for training and 20% for testing.Besides, we select the 10% of the training images for validation.Figure 14   We evaluate our model on 222 testing images.Figure 15 shows the activation maps obtained after the 3rd convolution layer for Non-defect and Defect sample images, where it is noticeable how the defects are highlighted in the second case.After we evaluate the model on the testing images, we obtain an average accuracy of 81.25% with an average F1-score of 0.87.The confusion matrix for classification results is shown in Figure 16.It is worth highlighting that the accuracy for the Defect class is 90%.These results show that our model misclassifies apples without defects to a greater extent than apples with defects, which would be good for controlling poor quality.However, the final average accuracy is not good enough for a real system of fruit quality control.This is mainly due to the quality and size of the database.Aiming to compare the performance between the proposed example and other pre-trained models, we implemented the same solutions as in Section 8.2.1 based on transfer learning with the Imagenet dataset [64].We include the most used models in the literature for fruit quality control (see Figure 8b), with the exception of the LeNet architecture.In order to provide a simpler solution, we adopt the same transfer learning strategy by freezing the convolutional base, adding one FC layer, and modifying the number of outputs.Moreover, the parameters of the training process were as in the initial example.
In Table 6, we compare the results obtained by the proposed example and pre-trained models.It should be noticed that classification results are slightly improved by AlexNet, VGG16, and MobileNet.However, for deeper models such as InceptionV3 and ResNet50, the accuracy decreases due to the overfitting caused by a lack of data.The results show that more complex models are more appropriate for more complex tasks and data, instead of shallow architectures as the proposed.

Conclusions
In this review, we studied different works on the use of CNN-based approaches for fruit image processing.In previous reviews of computer vision applications for fruit analysis, we observed that CNN was not identified as a relevant approach.Besides, it should be noted that most of these studies collected information before 2019.
We were able to identify three basic application areas in our study.The first is fruit classification, where this process is directly applied to the classification of fruits by their type in applications for markets, supermarkets, wholesalers, and retailers.The second is fruit quality control, which is used in applications to identify internal and external damages on fruits, its degree of maturity, and also to detect a lack of nutrients or diseases.The third is an area that we called fruit detection, which is applied for the harvesting of fruits in the orchards and also to estimate its location for automatic harvesting.
We noted that the architectures of the CNN-based approaches vary by different applications, works, and authors.Besides, we cannot establish one CNN architecture as superior over the rest.Hence, it is possible to use a pre-trained CNN modifying some layers and parameters to design a new CNN model, as well as starting from scratch.Evaluation results show CNN-based approaches achieved excellent results, up to 100% in some cases.Moreover, we observed that the greatest growth of CNN applications is in the robotic harvesting sector.

Figure 1 .
Figure 1.Relationship of the articles based on Convolutional Neural Network (CNN) for fruit image processing, by group and total.

Figure 3 .
Figure 3. General architecture of a convolutional neural network.

Figure 4 .
Figure 4. Example of the convolution operation with an input image (4 × 4) and a 3 × 3 kernel.

Figure 6 .
Figure 6.Examples of pooling operations by using a 2 × 2 filters applied with a stride of 2.

Figure 7 .
Figure 7. General distribution of CNN architectures used for fruit image processing.

Figure 8 .
Figure 8. Distribution of CNN architectures used for: (a) fruit classification, (b) fruit quality control, and (c) fruit detection for automatic harvest.

Figure 9 .
Figure 9. Overall distribution of training-testing (Tr-Ts) and training-validation-test splits (Tr-V-Ts) applied in CNN models.

Figure 10 .
Figure 10.Designed CNN architecture for fruit classification.
Figure 11 shows the average loss and accuracy curves during the model training process.The x-axis represents the training epoch number.It should be noted that from the 5th-epoch, the loss value is close to 0, and the accuracy result is close to 1 for both training and validation.These training and validation results seem to be perfect due to the low variability of the used dataset.

Table 4 .Figure 11 .
Parameters of the CNN model for fruit classification.Curves of (a) average loss and (b) average accuracy during the model training for fruit classification.

Figure 12 .Figure 13 .
Figure 12.Examples of activation maps of the 3rd convolution layer for two images from Fruit-360 dataset [70].
depicts the average loss and accuracy curves during the model training process for 10-fold cross-validation.In both curves, after the 12th-epoch, the loss values keep stable and close to 0, and the accuracy results are above 80% for training and vary for validation obtaining 81.57% at 20th-epoch.

Figure 14 .
Curves of (a) average loss and (b) average accuracy during the model training for fruit quality classification.

Figure 15 .Figure 16 .
Figure 15.Examples of activation maps of the 3rd convolution layer for two images from Apple-NDDA dataset [104].

Table 1 .
Summary of state-of-the-art CNN-based approaches applied for fruit classification tasks.

Table 2 .
Summary of state-of-the-art CNN-based approaches applied for fruit quality control tasks.

Table 3 .
Summary of state-of-the-art Convolutional Neural Network (CNN)-based approaches applied to the detection of fruits for automatic harvest.

Table 5 .
[70]arison between the proposed example and pre-trained models for fruit classification on Fruit-360 dataset[70].

Table 6 .
[104]rison between the proposed example and pre-trained models for fruit quality classification on Apple-NDDA dataset[104].