This section describes the experimental results obtained, and discusses the main findings from these experiments. First the partial results of the intermediate steps are presented, that is, the selection of the most effective color features and the configuration of the ANNs with ICA and HS metaheuristic algorithms. Then, the accuracy of the basic classifiers and the ensemble method are analyzed using the evaluation parameters previously described. Finally, the results obtained are compared with other state-of-the-art methods available in the literature.
3.1. Selection of the Color Features and Configuration of ANN-ICA and ANN-HS
As described in
Section 2.2, the first step of the process is to select the most effective color features for the problem of interest, among the 38 available color channels. This is done with a hybrid approach ANN-CA, which tests different combinations of channels. Finally, the most effective features automatically selected were channel b* in the L*a*b* color space, and the color purity index, C*, from the L*C*h space, which is also derived from a* and b* channels (see
Table 1). This predominance of the L*a*b* color space has also been reported in other applications of computer vision in agriculture [
38]. However, it should be noted that these results may depend on the specific application domain.
Another interesting result to discuss is that the process ends with the selection of only 2 color features. Observe that this number is not fixed a priori. The ANN-CA process tests different combinations between one and six channels. However, only two channels are selected as the optimal configuration. This could indicate that a greater number of channels is prone to produce overfitting, thus leading to poor classification results. This conclusion is consistent with [
38], where all the combinations of one, two, and three channels in nine standard color spaces for a problem of plant/soil segmentation are tested; the reported results indicate that the optimal selection consists of the channels L* and a* in L*a*b*. Therefore, a reduced number of features is enough for an effective use of color in segmentation problems. This limitation is also very useful to achieve good computational efficiency in the algorithms.
The other applications of hybrid approaches are in the configuration of the hyperparameters of the ANNs for classification, with ANN-ICA and ANN-HS, using the two features mentioned. The optimal configurations for ANN-ICA and ANN-HS are shown in
Table 4 and
Table 5, respectively.
In both cases, the optimal configuration of the ANNs consists of only two hidden layers with a similar number of neurons, although the process tests bigger sizes. This result may be related to the fact that the input tuples consist of only two values, so the necessary decision boundaries can be created with only 2 layers between 10 and 24 neurons per layer.
3.2. Classification Results of the Ensemble Method and the Basic Classifiers
In order to evaluate the reliability of the classifiers, 275 repetitions were performed for each method, that is, 275 independent executions of the training/testing process. The proposed ensemble method originally consisted of the 5 classifiers presented: ANN-ICA; ANN-HS; SVM; kNN; and LDA. However, it was observed that the poor results of LDA (as presented below) seriously affected the global performance of the ensemble. Therefore, we decided to include a new majority voting method removing LDA from the ensemble.
Figure 3 shows boxplots of the correct classification rates (CCR), or overall accuracy, achieved by all the classifiers in the fruit/background segmentation for the 275 executions. The red crosses indicate exceptionally low or high execution results.
In all cases, the accuracy achieved was always above 96%, with best average results above 98%. Some methods were more consistent in these good results (i.e., the variance between executions is very small), such as SVM, kNN and specially the ensemble method, while the results of LDA were also consistent but significantly lower, below 97%. This compactness of the boxplots indicates the close proximity of the values in different executions, and consequently the high reliability of the classification. The two hybrid methods based on ANN (i.e., ANN-ICA and ANN-HS) also produced good results, but with a bigger variance between executions. On the other hand, the original ensemble classifier including LDA was clearly affected by the errors of the poorest methods; it had an average accuracy of only 97.68%, and the variance was larger than that of all the basic classifiers. However, removing LDA from the ensemble, the method is able to improve the results of the constituent classifiers, both achieving a good average accuracy of 98.59%, and a very low variance.
Table 6 presents the confusion matrices and the error rates per class for the test data in the 275 repetitions for all the classifiers. Since there are 14,804 test samples and 275 iterations performed, the total accumulated is equivalent to 4,071,100 samples (32.6% of the fruit class, and 67.4% background).
These matrices allow a deeper insight into the results. Since the number of background samples is bigger than fruit samples, all the methods tend to over-classify the new inputs into the background class, producing higher error rates for this class, i.e., the number of FP and FN samples are not balanced. This is especially prominent in the LDA method, where the error in the fruit class is 31 times bigger. Only the SVM classifier exhibits a balanced accuracy, with 1.99% error in the fruit class and 1.28% in the background. The CCR of SVM (98.49%) is lower than that kNN (98.50%), but the difference is not significant. The imbalance in the results of the ensemble without LDA is relatively small (as compared, for example, with the two ANN methods), and it also outperforms the CCR of the constituent methods.
The imbalance observed in the accuracies for the different classes is most probably due to the imbalance in the dataset between the fruit and background samples. In our case of study, the plums only represent about 2% of the whole images. Thus, the plum class has been oversampled, with 33% and 67% samples of the fruit and background classes, respectively. But although the imbalance of the samples has been reduced, there are still twice as many samples of the background class. It would be interesting to perform an additional subsampling of the background class, using for example half of the samples. However, while this could reduce the classification bias, the overall classification accuracy of the images would be smaller, since they contain about 98% of background.
The ROC curves obtained for all the methods are depicted in
Figure 4. In these curves, the closer the curve is to the vertical, the higher the performance. In general, all the ROCs exhibit good results, but the accuracy of SVM is again the best of all the basic methods. As mentioned above, the curves of SVM, kNN and the ensemble classifiers are piecewise linear, since these techniques cannot be adjusted to be more or less restrictive. This fact makes that, even that these methods achieve a good accuracy, the AUC parameter is degenerated since it loses its original meaning of measuring different configurations of the classifier.
Finally,
Table 7 contains the six performance evaluation criteria obtained in the experiments. As can be seen, except for the LDA classifier and the original voting method, all the values were above 96%, indicating high performance in the fruit/background classification in general. The high recall value of LDA is due to the fact that it tends to overclassify in the background class; so this cannot be considered a good result, when considering its low specificity. Since LDA performs a simple linear separation of the samples, its poor results indicate that the samples cannot be classified with a linear decision boundary. This fact seriously affects the results of the voting method with LDA. The two ANN methods also present a certain tendency to overclassify in the background class, which was evident from their low precision. However, the majority voting method without LDA is able to reduce this bias; it achieves high results for recall and precision, which are transformed in an F_measure of 98.95%.
As a concluding remark, it is evident that the ensemble method without LDA is able to improve significantly the accuracy of the constituent methods. It exhibits a better overall accuracy and greatly reduces the over-classification tendency of the ANN methods. The standard deviation of the accuracy also indicates a great consistency in these good results, as previously mentioned. The LDA is the worst of all the methods, and the accuracy obtained was only 96.57%. This fact justifies that LDA should be removed from the ensemble, since it only worsens the final result, producing also a standard deviation nearly ±2.
Another interesting aspect to consider is that the application of the proposed ensemble method without LDA requires the execution of all four classifiers that compose it. It is still unclear whether the computational cost of the ensemble and the majority voting rule is or is not beneficial, taking into account the small increment in the accuracy of only 0.1%. Alternatively, the SVM classifier also achieves good results, with low standard deviation, and is computationally more efficient. This is another advantage with respect to the kNN, which requires storing and processing all the training samples in the classification, while the SVM only stores the selected support vectors. In any case, the bottleneck of the process is most probably the color space conversion; since all the methods rely on L*a*b*, the application of the ensemble method could be justified. More experiments concerning the computational efficiency, which are outside the scope of this paper, would be needed.