A Vision-Based Method Utilizing Deep Convolutional Neural Networks for Fruit Variety Classiﬁcation in Uncertainty Conditions of Retail Sales

: This study proposes a double-track method for the classiﬁcation of fruit varieties for application in retail sales. The method uses two nine-layer Convolutional Neural Networks (CNNs) with the same architecture, but di ﬀ erent weight matrices. The ﬁrst network classiﬁes fruits according to images of fruits with a background, and the second network classiﬁes based on images with the ROI (Region Of Interest, a single fruit). The results are aggregated with the proposed values of weights (importance). Consequently, the method returns the predicted class membership with the Certainty Factor ( CF ). The use of the certainty factor associated with prediction results from the original images and cropped ROIs is the main contribution of this paper. It has been shown that CF s indicate the correctness of the classiﬁcation result and represent a more reliable measure compared to the probabilities on the CNN outputs. The method is tested with a dataset containing images of six apple varieties. The overall image classiﬁcation accuracy for this testing dataset is excellent (99.78%). In conclusion, the proposed method is highly successful at recognizing unambiguous, ambiguous, and uncertain classiﬁcations, and it can be used in a vision-based sales systems in uncertain conditions and unplanned situations.


Introduction
Recognizing different kinds of fruits and vegetables is perhaps the most difficult task in supermarkets and fruit shops [1]. Retail sales systems based on bar code identification require the seller (cashier) to enter the unique code of the given fruit or vegetable because they are individually sold by weight. This procedure often leads to mistakes because the seller must correctly recognize every type of vegetable and fruit; a significant challenge even for highly-trained employees. A partial solution to this problem is the introduction of an inventory with photos and codes. Unfortunately, this requires the cashier to browse the catalog during check-out, extending the time of the transaction. In the case of self-service sales, the species (types) and varieties of fruits must be specified by the buyer. Unsurprisingly, this can often result in the misidentification of fruits by buyers (e.g., Conference pear instead of Bartlett pear). Independent indication of the product, in addition to both honest and deliberate mistakes (purposeful indication of a less expensive species/variety of fruit/vegetable) can lead to business losses. The likelihood of an incorrect assessment increases when different fresh products are mixed up.
One potential solution to this challenge is the automatic recognition of fruits and vegetables. The notation of recognition (identification, classification) can also be understood in different ways: additional extraction of color chromaticity was presented in [7]. A number of other studies [8] indicated that fruit recognition can be also provided by other classification methods such as fuzzy support vector machine, linear regression classifier, twin support vector machine, sparse autoencoder, classification tree, logistic regression, etc.
The proposal to use neural networks as the fruit classifier was presented among others by Zhang et al. [9], who used a feedforward neural network. The authors first removed the image background with the split-and-merge algorithm, then the color, texture, and shape information was extracted to compose feature data. The authors analyzed numerous learning algorithms, and the FSCABC algorithm (Fitness-Scaled Chaotic Artificial Bee Colony algorithm) was reported to have the best classification accuracy (89.1%). Other applications of neural networks for fruit classification can be found in [10,11].
In recent years, a number of articles have shown considerable modeling success with deep learning applications for image recognition. In [12], the authors applied deep learning with the Convolutional Neural Network (CNN) to vegetable object recognition with the results of learning rate being 99.14% and the recognition rate being 97.58%. In [13], the authors evaluated two CNN architectures (Inception and MobileNet) as classifiers of 10 different kinds of fruits or vegetables. They reported that MobileNet propagated images significantly faster with almost the same accuracy (top three accuracy of 97%). However, there were difficulties in predicting clementines and kiwis. This may be due to the choice of the training and testing of a variety of images, which were captured with a video camera attached to the proposed retail market systems and at the same time extracted from ImageNet.
The article [14] presented a comparative study between Bag Of Features (BOF), Conventional Convolutional Neural Network (CNN), and AlexNet for fruit recognition. The results indicated that all three techniques had excellent recognition accuracy, but the CNN technique was the fastest at presenting a recognition prediction. In turn, in the article [15], two deep neural networks were proposed and tested for using simple and more demanding datasets, with very good results for fruit classification accuracy in both bases. Many numerical experiments for training various architectures of CNN to detect fruits were presented in [16]. A 13-layer CNN was proposed for a similar purpose in [17].
The literature review reported above was used to inform the use of computer vision techniques in an automated sales stand or self-checkout. The literature indicates that machine learning methods (especially CNN methods) perform well at classification of fruits and vegetables in the case of pre-prepared datasets. However, pre-trained (tuned) methods are dependent on the data, but the availability of large collections of images of fruits and vegetables is limited [2]. Given the detailed discussion on the use of CNN methods in automated sales stands or self-checkout, the suggestion can be raised that it is necessary to become independent of a single result in order to increase the certainty of the obtained class and achieve a more effective use of computer vision.
For this purpose, we combine several methods: a CNN method for the fruit classification from a whole image, a YOLO (You Only Look Once) V3 method [18] for the fruit detection from a whole image, and then, a CNN method for the fruit classification from images with a single object (apple). This double-track approach to the fruit classification allows determining the Certainty Factor (CF) of the results, the use of which is the main novelty of this paper.
The problem of fruit detection is also widely analyzed in the literature, especially during the detection of fruits in orchards [19] and damage detection [20]. The YOLO V3 model [18], the Faster R-CNN model [21], and their modifications are the state-of-the-art fruit detection approaches [19,20]. The use of object detection and recognition techniques for multi-class fruit classification was presented in [22]. This approach is also effective, but does not calculate an objective certainty factor for the results, which are independent of one classification method.

Problem Statement
The traditional grocery store has been evolving in recent decades to a supermarket and discount store concept, carrying all the goods shoppers often desire. These stores offer a very large number of products, both processed and partially processed, as well as fresh produce such as fruits and vegetables. Fresh product is typically sold per piece and by weight. As discussed earlier, the sale of produce may be burdensome for cashiers, because they must remember (or search for) the identification code of each item. In the case of self-service checkouts, the sale of fruits and vegetables is connected with the identification of the products species and varieties by buyers. Thus, the sales process in current use leads to longer customer service time, often causing errors (payments for bad products) and business losses.
The published literature suggests that machine vision systems and machine learning methods allow for the construction of systems for automatic fruit and vegetable classification. In particular, deep learning methods have high classification accuracy for both training and testing images, mainly in the case of recognizing species of fruits and vegetables. Recognizing varieties of fruits and vegetables is more difficult due to highly similar color, structure, and shapes in the same class. In fact, the image of the identified object may differ from the learned pattern, resulting in classification errors.
The primary problem addressed in this study is the following: Is it possible to build a machine vision system that can quickly classify the variety of fruits and vegetables together with providing the result certainty factor and, in the case of uncertainty, will notify about the set of the most probable classes?
To tackle this question, a double-track method for fruit variety classification is proposed that uses the image classification methods on the example of images with a background, as well as the method of object detection allowing the detection of fruit objects that are also used for classification. Comparison of the classification results of different objects of the same image, using the weights of the results, will allow the calculation of the Certainty Factor (CF) regarding the proposed result of the classification.

CNN for Fruit Classification
In the proposed fruit classification method, the inference procedure based on CNN is used several times for classification of one variety of fruit. Therefore, the CNN architecture should be as simple as possible, with the goal of handling the task of classification with the highest possible prediction accuracy. By advancing previous research [1,23], we present a simplified CNN architecture in Table 1.
This CNN model has been tested for the variety classification of apples. Here, we propose a deep neural network model architecture with 9 layers of neurons. The first layer is an input layer that contains 150 × 150 × 3 neurons (RGB image with 150 × 150 × 3 pixels as a resized image with 320 × 258 × 3 pixels). The next 4 layers constitute two tracks with convolution pooling layers that use receptive field (convolutional kernels) of size 3 × 3 with no stride and no padding. The layers give 32 and 64 features maps, respectively. The convolution layers use nonlinear ReLU (Rectifier Linear Unit) activation functions as follows [24]: This function reduces (turns into zero) the number of parameters in the network, resulting in faster learning. To reduce dimensionality and simultaneously capture the features contained in the sub-regions binned, the max pooling strategy [25] is used in the 3rd and 5th layers. The convolutional and max-pooling layers extract features from image. Then, in order to classify the fruits, the fully-connected layers are applied to the previous dropout layer. Dropout is applied to each element within the feature maps (with a 50% chance of setting inputs to zero), thus allowing for randomly dropping units (along with their connections) from the neural network during training and helping prevent overfitting by adding noise to its hidden units [26].
The 8th layer provides 64 ReLU fully-connected neurons. The last layer as a final classifier has the 6 Softmax neurons, which correspond to the six varieties of apples.
To train the CNN model (optimize its weights and biases), the Adam (Adaptive moment estimation) algorithm [27] was employed with cross entropy as the loss function. The Adam algorithm is a computationally-efficient extension of the stochastic gradient descent method.
The presented architecture of the CNN model was tested for fruit classification in three different ways. First, the network was trained (and validated) with image data from apple objects (original images). Second, the network was trained (and validated) with training data for a single apple object (called the image with the apple or ROI as the Region Of Interest). Finally, the network was trained (and validated) with the both training data. In all cases, different network weights were obtained for the same CNN model. All trained CNN models were tested using the same testing data.

You Only Look Once for Fruit Detection
The YOLO V3 [18] architecture was used to generate the apple ROIs from the original images of [28]. The YOLO (You Only Look Once) family of models is a series of end-to-end deep learning models designed for fast object detection. Version 3 used in this research has 53 convolutional layers. The main difference from the previous version of this architecture is that it makes detections at three different scales, thus making it suitable for the smaller objects. Object features are extracted from these scales like the feature pyramid network.
In the first step, YOLO divides the input image into an S × S grid where S depends on the scale. For each cell, it predicts only one object using boundary boxes. The network predicts an objectless score for each bounding box using logistic regression. The score parameter was used to filter out weak predictions. The result prediction is a box described by the top left and bottom right corner.
The original dataset consisted of folders for each fruit class, such as apple, banana, etc. Within the current class folder, additional classification was done for specific species. Apples in the images were located on a silver shiny plate that generated many false predictions. We used the weights pre-trained on a COCO dataset, containing 80 classes where one of them was the apple class. The COCO apple class consists of many different apple species; a desirable attribute in this case. We could run the object detection using YOLO and filter out just the apple class. To maximize predictive performance, we set the minimum score parameter to 0.8. The predictions were good, but many apples were not detected. This this reason, we set the minimum score to 0.3, which allowed almost all objects to be included regardless of the species.
Model predictions were saved as separate files named according to the source sample to allow for later verification. Generated predictions could be used for ground truthing during the training process.

Proposed Fruit Classification Method Using the Certainty Factor
This study proposes a fruit classification method for a retail sales system. The method uses machine vision system together with machine learning methods (shown in Figure 1). The first stage of the method involves creating an image with all fruit objects. The image includes one or many fruits (intended for one variety and species) with the background, and it is called original image.
Appl. Sci. 2019, 9, 3971 6 of 18 Model predictions were saved as separate files named according to the source sample to allow for later verification. Generated predictions could be used for ground truthing during the training process.

Proposed Fruit Classification Method Using the Certainty Factor
This study proposes a fruit classification method for a retail sales system. The method uses machine vision system together with machine learning methods (shown in Figure 1). The first stage of the method involves creating an image with all fruit objects. The image includes one or many fruits (intended for one variety and species) with the background, and it is called original image.
In order to be sure of the obtained fruit classification result, the proposed method has two separate pathways for fruit classification. The first pathway was to identify the fruit variety based on the entire original image. For this purpose, the previously described nine-layer CNN was used, which was trained based on the original In order to be sure of the obtained fruit classification result, the proposed method has two separate pathways for fruit classification. The first pathway was to identify the fruit variety based on the entire original image. For this purpose, the previously described nine-layer CNN was used, which was trained based on the original images. The result of classification A was determined with a certainty factor CF with the following value: where d is a small positive value (here, d = 10 −4 ). In order not to introduce errors in the interpretation of the results, it is recommended that the d value be less than 0.5/(o + 1), where o is the maximum number of objects (apples) on the one image in the dataset. Studies have shown slightly higher accuracy for CNN trained with original images than with ROI objects. Therefore, the small positive value d for the certainty factor gave slightly more importance to the result obtained in the first pathway of the fruit classification algorithm compared to the second pathway described below.
The second pathway of fruit classification consisted of recognizing the fruit variety based on single fruits from the original image. The first step was to use the object detection method to identify single apple objects in the amount of N (N = 0, 1, 2, . . .). The recorded objects of single apples were images with ROIs.
The You Only Look Once (YOLO) method was used to extract images of individual objects from the original images. The fruit variety classification was done based on each nth ROI image (n = 1, . . . , N). For this purpose, the previously described nine-layer CNN was used. It was trained with the ROI images. Each result of the classification A n (each CNN inference) was provided with the appropriate value of certainty factor CF n : where N is the number of objects (apples) detected in the original image and d is a small positive value (in research d = 10 −4 ). The results obtained from both pathways were grouped, and factor CF k for each kth variety was calculated as follows: where CF k is a certainty factor for the kth variety of fruit A k (in the researched variety of apples). The indirect result of the classification can be determined in the form of the following model: where K is the number of fruit varieties that were detected. Based on: the final result of the classification was provided. If the value of CF max was higher than the limit value of certainty factor (CF limit ) and is not equal CF = 0.5 + d, then an unambiguous classification was obtained: Fruits (for example, apples) are variety of A k with CF k where CF k = CF max . If the value of CF max did not exceed the limit value of the certainty factor (CF limit ) and was not equal to CF = 0.5 + d, then an ambiguous classification was obtained. The user can only be informed about the set of possible classification results: Possible varieties of fruits (apples) are {A k with CF k |CF k 0}. If the value of CF max equaled CF = 0.5 + d, then uncertain classification was the result: Fruits (for example, apples) are variety of A k with CF k = 0.5 + d, where d is a small positive value (d = 10 −4 ). This result was obtained using only one pathway of classification, which may give rise to uncertainty about the model results.

Datasets
We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. We employed six different apple varieties (named A-F), and the number of images for each class is provided in Table 2. The images (320 × 258 × 3 pixels) of apples came from the datasets presented in [28]. The images were obtained using an HD Logitech web camera with five-megapixel snapshots and present objects (different amount of apples) placed in the shop scenery. Various poses and different lighting conditions (i.e., in fluorescent, natural light, with or without sunshine) were preserved.
More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags. The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. More information on the analyzed dataset was reported in [1,23]. For simplicity, the images of fruits were taken without being placed in plastic bags.
The data were divided into three sets (training data, validation data, testing data) in the ratio of 70% (4311 original images), 15% (924 original images), and 15% (926 original images), respectively. The recognition algorithm was used to identify a single apple in the original image. Each apple object was saved as a separate image (named as image with apple, or ROI image). In addition, all apples in each original image were identified and recorded. The detailed structure of the analyzed dataset is presented in Table 3. All tests and analyzes were carried out using Python programming (v. 3.6.3) with Keras as a high-level neural network API, capable of operating on TensorFlow.

Original contra Region of Interest Images
The CNN model architecture (presented in Section 3.2.1) was tested with various testing and validation images (with only original images, only ROI images, and both). Our goal was to determine the image types necessary to estimate appropriate values of weights in the CNN model to classify the varieties of the fruit correctly. Despite different training and validation files, the model was tested using the same testing dataset. The test results are given in Table 4. According to the accuracy values presented in Table 4, the CNN should be trained with only image types that will be recognized by this network. Training the network with additional images of different scales (many objects or one object) did not improve the accuracy of the classification. The results may also indicate that the size of the ROI in the image is diametrically significant.
It can be also concluded that the best classification possibilities are for the proposed CNN trained and tested with the original images; these models only produced two incorrect classifications (accuracy: 99.78%). However, it should be noted that the number of testing images in this case was much smaller (about 35%) than in the case of testing with ROI images. The misclassifications referring to these cases are described in Table 5. Unfortunately, despite the high values of probabilities that samples belonged to the varieties obtained, the results were incorrect. Thus, the CNN output was not considered a fully reliable classifier, although its performance was close to perfect. As a result, additional methods were explored. According to the accuracy values presented in Table 4, the CNN should be trained with only image types that will be recognized by this network. Training the network with additional images of different scales (many objects or one object) did not improve the accuracy of the classification. The results may also indicate that the size of the ROI in the image is diametrically significant.
It can be also concluded that the best classification possibilities are for the proposed CNN trained and tested with the original images; these models only produced two incorrect classifications (accuracy: 99.78%). However, it should be noted that the number of testing images in this case was much smaller (about 35%) than in the case of testing with ROI images. The misclassifications referring to these cases are described in Table 5. Unfortunately, despite the high values of probabilities that samples belonged to the varieties obtained, the results were incorrect. Thus, the CNN output was not considered a fully reliable classifier, although its performance was close to perfect. As a result, additional methods were explored. The proposed CNN, which was trained and validated with ROI images, demonstrated slightly less accuracy (97.56%). This accuracy was characteristic for the network, which was trained and validated with the same type of images.

Proposed Fruit Classification Method Using the Certainty Factor
The proposed fruit classification method was tested for the same dataset with six different apple varieties. In this method, one photo classification was associated with classifications of a few images (one original image and ROI images). The amount of testing data according to the number of images for one photo classification is presented in Figure 2. It is evident that the most cases (364 photos) concerned the classification of an apple based on one original image and one ROI image (this was a photo with one apple or a photo on which the YOLO method detected only one apple object). In the dataset, there were photos for which the YOLO method failed to identify any fruit (five photos). The YOLO method detected the most fruits (as many as 11) in four cases. According to the accuracy values presented in Table 4, the CNN should be trained with only image types that will be recognized by this network. Training the network with additional images of different scales (many objects or one object) did not improve the accuracy of the classification. The results may also indicate that the size of the ROI in the image is diametrically significant.
It can be also concluded that the best classification possibilities are for the proposed CNN trained and tested with the original images; these models only produced two incorrect classifications (accuracy: 99.78%). However, it should be noted that the number of testing images in this case was much smaller (about 35%) than in the case of testing with ROI images. The misclassifications referring to these cases are described in Table 5. Unfortunately, despite the high values of probabilities that samples belonged to the varieties obtained, the results were incorrect. Thus, the CNN output was not considered a fully reliable classifier, although its performance was close to perfect. As a result, additional methods were explored. The proposed CNN, which was trained and validated with ROI images, demonstrated slightly less accuracy (97.56%). This accuracy was characteristic for the network, which was trained and validated with the same type of images.

Proposed Fruit Classification Method Using the Certainty Factor
The proposed fruit classification method was tested for the same dataset with six different apple varieties. In this method, one photo classification was associated with classifications of a few images (one original image and ROI images). The amount of testing data according to the number of images for one photo classification is presented in Figure 2. It is evident that the most cases (364 photos) concerned the classification of an apple based on one original image and one ROI image (this was a photo with one apple or a photo on which the YOLO method detected only one apple object). In the dataset, there were photos for which the YOLO method failed to identify any fruit (five photos). The YOLO method detected the most fruits (as many as 11) in four cases.
variety of E (the probability of the sample belonging to this variety: 0.9999) variety of C The proposed CNN, which was trained and validated with ROI images, demonstrated slightly less accuracy (97.56%). This accuracy was characteristic for the network, which was trained and validated with the same type of images.

Proposed Fruit Classification Method Using the Certainty Factor
The proposed fruit classification method was tested for the same dataset with six different apple varieties. In this method, one photo classification was associated with classifications of a few images (one original image and ROI images). The amount of testing data according to the number of images for one photo classification is presented in Figure 2. It is evident that the most cases (364 photos) concerned the classification of an apple based on one original image and one ROI image (this was a photo with one apple or a photo on which the YOLO method detected only one apple object). In the dataset, there were photos for which the YOLO method failed to identify any fruit (five photos). The YOLO method detected the most fruits (as many as 11) in four cases. As a result, the proposed method gave the recognized fruit varieties together with their certainty factors ( s). Because the classification model was performed based on two CNNs with different weights and many different fruit objects, it can be assumed that the approach was relatively objective, and the certainty factors can be reliable factors validating the correctness of the classification result. Consequently, the maximal value of CF ( ) may indicate the result of classification (i.e., a fruit variety with can be the correct class). In a situation where equals one, the given classification result can be treated as certain (it was in 875 cases out of 926 all -94.49%). In 97.94% of cases, exceeded 0.7501, then the variety of fruit with was the correct variety. Only two misclassifications were detected for varieties with ∈ {0.6700, 0.7501} (Table 6). Therefore, in the large majority of cases (99.78%), the fruit variety with was the correct variety. All results of correct and incorrect classifications together with the values of the maximal value of CF are presented in Table 7. The actual classification related to predicted varieties of apples is shown in Table 8. Table 6. Incorrect classification based on the value in the proposed method.   As a result, the proposed method gave the recognized fruit varieties together with their certainty factors (CFs). Because the classification model was performed based on two CNNs with different weights and many different fruit objects, it can be assumed that the approach was relatively objective, and the certainty factors can be reliable factors validating the correctness of the classification result. Consequently, the maximal value of CF (CF max ) may indicate the result of classification (i.e., a fruit variety with CF max can be the correct class). In a situation where CF max equals one, the given classification result can be treated as certain (it was in 875 cases out of 926 all -94.49%). In 97.94% of cases, CF max exceeded 0.7501, then the variety of fruit with CF max was the correct variety. Only two misclassifications were detected for varieties with CF max ∈ {0.6700, 0.7501} (Table 6). Therefore, in the large majority of cases (99.78%), the fruit variety with CF max was the correct variety. All results of correct and incorrect classifications together with the values of the maximal value of CF are presented in Table 7. The actual classification related to predicted varieties of apples is shown in Table 8.  As a result, the proposed method gave the recognized fruit varieties together with their certainty factors ( s). Because the classification model was performed based on two CNNs with different weights and many different fruit objects, it can be assumed that the approach was relatively objective, and the certainty factors can be reliable factors validating the correctness of the classification result. Consequently, the maximal value of CF ( ) may indicate the result of classification (i.e., a fruit variety with can be the correct class). In a situation where equals one, the given classification result can be treated as certain (it was in 875 cases out of 926 all -94.49%). In 97.94% of cases, exceeded 0.7501, then the variety of fruit with was the correct variety. Only two misclassifications were detected for varieties with ∈ {0.6700, 0.7501} (Table 6). Therefore, in the large majority of cases (99.78%), the fruit variety with was the correct variety. All results of correct and incorrect classifications together with the values of the maximal value of CF are presented in Table 7. The actual classification related to predicted varieties of apples is shown in Table 8. Table 6. Incorrect classification based on the value in the proposed method.   As a result, the proposed method gave the recognized fruit varieties together with their certainty factors ( s). Because the classification model was performed based on two CNNs with different weights and many different fruit objects, it can be assumed that the approach was relatively objective, and the certainty factors can be reliable factors validating the correctness of the classification result. Consequently, the maximal value of CF ( ) may indicate the result of classification (i.e., a fruit variety with can be the correct class). In a situation where equals one, the given classification result can be treated as certain (it was in 875 cases out of 926 all -94.49%). In 97.94% of cases, exceeded 0.7501, then the variety of fruit with was the correct variety. Only two misclassifications were detected for varieties with ∈ {0.6700, 0.7501} (Table 6). Therefore, in the large majority of cases (99.78%), the fruit variety with was the correct variety. All results of correct and incorrect classifications together with the values of the maximal value of CF are presented in Table 7. The actual classification related to predicted varieties of apples is shown in Table 8. Table 6. Incorrect classification based on the value in the proposed method.    Table 7. Results of apple variety classification using the proposed method.

Number of Classifications Obtained Based on
One Image  According to the model results, it was possible to identify the limit value of CF (CF limit ) for which variety of fruit with CF max ≤ CF limit can be treated as an ambiguous classification. In the analyzed case, CF limit can be equal to 0.7501. We had 19 original images for which the results of classification had CF max ≤ 0.7501, including five original images with CF max = 0.5001 (ROIs were not detected). Thus, the analyzed classification cases can be divided into three types as follows:        To complete the analysis, the execution time (predicting time) of proposed method is presented in Table 9. As can be seen, the execution time depended on the number of objects detected in the original image.

Comparison of the Results
The research was focused on the synergy of two approaches, the object detection method (in our case, YOLO V3) and the classifier of the full frame and ROIs. Therefore, the comparisons can relate to each method separately or the whole proposed method, which calculated the CFs of object classes.
First, a comparison of YOLO V3's performance in relation to other tested methods is presented in Table 10. The YOLO accuracy did not directly influence the result of the system's end inference. The YOLO V3 method affected the relation between the size of the object in the image and the image size itself in the training set and testing set, which in turn affected the accuracy of the fruit identification method using ROIs (in our case, CNN). In addition, YOLO accuracy also affected the number of classified objects (number of ROIs), which in turn affected the accuracy of the certainty factor.
In the research, all the methods were tested with the same training data, which consisted of 926 files with multiple objects. As can be seen, the YOLO V3 generated the highest number of apple class To complete the analysis, the execution time (predicting time) of proposed method is presented in Table 9. As can be seen, the execution time depended on the number of objects detected in the original image.

Comparison of the Results
The research was focused on the synergy of two approaches, the object detection method (in our case, YOLO V3) and the classifier of the full frame and ROIs. Therefore, the comparisons can relate to each method separately or the whole proposed method, which calculated the CFs of object classes.
First, a comparison of YOLO V3's performance in relation to other tested methods is presented in Table 10. The YOLO accuracy did not directly influence the result of the system's end inference. The YOLO V3 method affected the relation between the size of the object in the image and the image size itself in the training set and testing set, which in turn affected the accuracy of the fruit identification method using ROIs (in our case, CNN). In addition, YOLO accuracy also affected the number of classified objects (number of ROIs), which in turn affected the accuracy of the certainty factor.
In the research, all the methods were tested with the same training data, which consisted of 926 files with multiple objects. As can be seen, the YOLO V3 generated the highest number of apple class detections, which could be used as the learning ROI images for the classification network. The best average processing times were obtained using the MobileNetV2 + SSDLite and SSD Inception v2 configuration, but the number of detections was much lower comparing to other architectures. based on images with the ROI (a single fruit). The results were aggregated with proposed values of neuron weights (importance). Consequently, the method returned predicted class/classes (fruits variety/varieties) together with their Certainty Factor (CF). The presented method combined the detection and classification methods and determined the certainty factor associated with the prediction results from original and cropped images ROIs, which was the contribution of this paper. The CFs had an advantage in that the correctness of the classification result could be determined, resulting in more reliable predictions compared to the probabilities from the CNNs' outputs. This suggests that the proposed vision-based method can be used in uncertain conditions and unplanned situations as commonly encountered in sales systems (such as the accidental mixture of fresh products, placement of another object in the frame, unusual packaging of fruit, different lighting conditions, etc.). The test using 926 images of six apple varieties indicated that classification accuracy for this method (based on a maximal value of CF) was excellent (99.78%). In addition, the method was 100% successful at recognizing unambiguous, ambiguous, and uncertain classifications.
It is important to recognize that the proposed method also had limitations. First, the method performed the classification process several times (for the whole image and detected objects), which could result in a longer time for obtaining the result. However, the uncomplicated structure of CNN and the YOLO V3 method for real-time processing [18] imply that the method can still be used in online sales systems. Second, the use of two different types of training images complicated the learning process of the system. Therefore, the learning process together with determining CF limit values in the proposed method is recommended for further research.
In addition, the future direction is to test the method using a larger dataset containing greater amounts of different fruit and vegetable varieties of different species. It is also preferable to build a fruit and vegetable dataset with more demanding images, which will ultimately be the true test of the system. An interesting research direction will be testing the system with a dataset containing images of fruits and vegetables wrapped in a transparent plastic bag. This situation may cause uncertainty of the obtained result.