Fruit Classification by Wavelet-entropy and Feedforward Neural Network Trained by Fitness-scaled Chaotic Abc and Biogeography-based Optimization

Fruit classification is quite difficult because of the various categories and similar shapes and features of fruit. In this work, we proposed two novel machine-learning based classification methods. The developed system consists of wavelet entropy (WE), principal component analysis (PCA), feedforward neural network (FNN) trained by fitness-scaled chaotic artificial bee colony (FSCABC) and biogeography-based optimization (BBO), respectively. The K-fold stratified cross validation (SCV) was utilized for statistical analysis. The classification performance for 1653 fruit images from 18 categories showed that the proposed " WE + PCA + FSCABC-FNN " and " WE + PCA + BBO-FNN " methods achieve the same accuracy of 89.5%, higher than state-of-the-art approaches: " (CH + MP + US) + PCA + GA-FNN " of 84.8%, " (CH + MP + US) + PCA + PSO-FNN " of 87.9%, " (CH + MP + US) + PCA + ABC-FNN " of 85.4%, " (CH + MP + US) + PCA + kSVM " of 88.2%, and " (CH + MP + US) + PCA + FSCABC-FNN " of 89.1%. Besides, our methods used only 12 features, less than the number of features used by other methods. Therefore, the proposed methods are effective for fruit classification.


Introduction
Fruit classification remains a hot topic in the academic research field.It can help cashiers in the supermarkets to interpret the class of an individual fruit, with the goal of determining the price quickly [1].Additionally, it is needed for providing dietary guidance to help people select suitable types of foods to fulfill their health and nutrient needs [2,3].Furthermore, food factories also rely on fruit classification techniques for automatic packaging.
Manual fruit classification is still a challenging task.The fruit categories and subcategories vary from area to area, because the focus is on not only the necessary ingredients within fruits, but also the area-dependent and population-dependent fruits availability [3].
In this study, we try to use wavelet-entropy (WE), which is a relatively novel feature descriptor, to extract efficient features from colorful fruit images.WE combines wavelet transform and entropy, with the aim of estimating the degree of order/disorder of the fruit image with a high time-frequency resolution.Meanwhile, machine-learning based methods are employed to create classifiers.We used feed-forward neural network (FNN) because of its outstanding performance, which has been reported in the literature [16][17][18].
The remainder of this paper is organized as follows: Section 2 contains the literature review.Section 3 depicts the methodology used in this study.Section 4 presents the experimental results.Section 5 discusses the results and gives the reasons for these results.Finally, Section 6 concludes the paper.For the ease of reading, the acronyms that appear are listed at the end of this paper.

State-of-the-Art
Recently, scholars proposed numerous automatic fruit classification methods.Pennington and Fisher (2009) [3] were the first to utilize a clustering algorithm to classify vegetables and fruits.Pholpho, Pathaveerat and Sirisomboon (2011) [4] used visible spectroscopy for classification of both bruised longan fruits and non-bruised ones.Yang, Lee and Williamson (2012) [5] used multispectral imaging analysis in the application of a blueberry yield estimation system.Wu (2012) [6] selected the support vector machine (SVM) with radial basis function (RBF) kernel, in order to classify different fruit types, with overall accuracy of 88.2%.Their multiclass strategy was chosen as max-wins-voting.Feng, Zhang and Zhu (2013) [7] employed Raman spectroscopy as a rapid and non-destructive tool, and adopted a polynomial fitting for baseline correction.Afterwards, principal component analysis (PCA) and hierarchical cluster analysis (HCA) were selected to recognize eight different citrus fruits.Cano Marchal et al. (2013) [8] created an expert system based on machine learning and computer vision, with the aim of estimating the content of impurities in a particular olive oil sample.Breijo et al. (2013) [9] used an odor sampling system (electronic nose) to classify the aroma of Diospyros kaki, whose working parameters can possess variable configurations making the system flexible.Fan et al. (2013) [10] used a two-hidden-layer artificial neural network (ANN) trained by back-propagation (BP) to predict the texture characteristics based on food-surface images.Omid et al. (2013) [11] used defects and size as features, and then they presented an expert system based on the combination of both fuzzy logic and machine vision techniques.Zhang et al. (2014) [12] proposed a fitness-scaled chaotic artificial bee colony (FSCABC) algorithm, with the aim of developing an automatic fruit classification system.Khanmohammadi, et al. (2014) [13] used Fourier transform near infrared (FT-NIR) spectrometry to authenticate the origin of persimmon fruits cultivated in different regions of Spain.Chaivivatrakul and Dailey (2014) [14] proposed a texture-based technique to detect green fruits on plants.Their method involved interest point feature extraction and descriptor computation.Muhammad (2015) [15] classified data fruits using both local binary pattern (LBP) and Weber local descriptor (WLD).They used Fisher discrimination ratio (FDR) for feature selection, and SVM for classifier.
The above techniques suffer from the following shortcomings.(1) They require expensive sensors: an invisible light sensor, a gas-sensitive sensor, a chemical sensor, a heat sensor, a dew sensor, or a weight sensor; (2) The classifiers work for limited categories of fruits; (3) The recognition systems perform poorly on fruits with nearly identical shape, color, and texture features; (4) The classification accuracy does not reach the standard needed for practical use.
Note the number in parentheses denotes the number of images for each class.Figure 1 depicts the samples of different categories of fruits.

Four-Step Preprocessing
We followed the four-step preprocessing method in [12].(i) We captured fruit images by a digital camera or obtained fruit images from search engines, and labelled them manually; (ii) Split-and-merge algorithm [19] was employed to remove the background; (iii) A square window was used to capture the area of fruits, meanwhile centering the fruit; (iv) The square images was downsampled to 256 × 256, since high-resolution does not augment the classification performance.

Discrete Wavelet Transform
The discrete wavelet transform (DWT) is an outstanding implementation tool using the dyadic positions and scales.Letting x(t) represent a square-integral function, we can deduce the continuous wavelet transform of the signal x(t) relative to a particular wavelet, where u(t) is defined according to [20] ( , ) ( ) ( | , )d where Here, the wavelet u(t | as, at) is obtained based on the mother wavelet u(t) by two types of operations: dilation and translation.as represents the scale factor, at the translation factor.They are both real positive numbers.C represents the coefficients of WT.
Discretization of Equation ( 1) is undertaken by restraining as and at to discrete lattices (as = 2 at & as > 0).This generates the so-called discrete wavelet transform (DWT): where L and H represents the coefficients of the approximation and the detailed subbands, respectively.The l and h denote the low-pass and high-pass filter, respectively.Parameters k and j represent the discretized values of translation and scale factors, respectively.The DS mean the downsampling operation.Equation (3) iterates with approximations being decomposed successively, such that the signal is broken down to the required level, which meets the expected resolution [21].The whole process is termed a tree of wavelet decomposition.
This technique can be generated to a 2D (fruit) image, i.e., the 1D-DWT is applied to each dimension separately and independently.Consequently, four subbands (LL, HH, HL and LH) occur at each level.The subband LL corresponds to the approximation coefficient, and is prepared for the higher-level decomposition.As the decomposition level increases, more compact yet coarser approximation components are obtained.Thus, wavelets provide a simple hierarchical framework for interpreting the fruit image information.

Wavelet-Entropy
The entropy concept of traditional Boltzmann/Gibbs was redefined as a measure of uncertainty for the information content of a system as Shannon entropy (SE) [22] 2 1 log ( ) where S represents the value of entropy, v the grey-level of decomposition coefficient, pv the probability of v, and G the total number of grey-levels.

Principal Component Analysis
Those 30 features may hinder the computation, cost memory storage, complicate the classification process, and even worsen the classification performance.Principal component analysis (PCA) was utilized to reduce the number of features (30 at this step) further, with the criterion that the reduced features should explain more than 95% of variance explained by original features [23].

Feed-Forward Neural Network
After extracting and reducing the features from the fruit pictures, we feed them into the feedforward neural network (FNN), which can classify nonlinear separable patterns and approximate an arbitrary continuous function [24].The reason we chose FNN was that (1) it has been widely used in pattern classification; (2) it does not need any a priori information about the probability distribution [25].The common model of one-hidden-layer FNN is shown in Figure 3.There are three layers within this model: an input layer (IL), an output layer (OL), and a hidden layer (HL) in-between.Nodes of adjacent layers are connected completely.Each link is assigned with a weighted value, corresponding to the relational degree of this link.Sigmoid and linear functions are used as the activation functions for HL and OL, respectively.Training of FNN is an optimization problem that selects the optimal weights to make the mean-squared error (MSE) minimal.
The BP, SA, PSO, ABC, and GA algorithms all demand exceeding computational investment.Their optimizers may still be easily retained into the local optimal points, therefore, the optimizers may terminate without yielding the optimal weights/biases of the network.Zhang et al. (2014) [12] proved the FSCABC was superior to BP, MBP, GA, ABC, and PSO in the training of FNN.However, FSCABC is time consuming.In this study, we proposed to use both FSCABC and another rather novel optimization method-biogeography-based optimization (BBO).In the experiments, we will compare their performances.

Fitness-Scaled Chaotic Artificial Bee Colony
Detailed descriptions of fitness-scaled chaotic artificial bee colony (FSCABC) can be referred in the literature [32].Here, we only list the pseudocode of FSCABC in Algorithm 1.

Evaluation and initial population
Step 2 Produce new food sources: Produce new solutions in the neighborhood of last solution for the employed bees.The random number is replaced with a chaotic number generator.Implement the greedy selection to select the best solutions.
Step 3 Produce new onlookers: Generate new solutions for the onlookers based on population group of last step, selecting the best depending on the probability of scaled fitness values.
Step 4 Produce new scouts: Produce the discarded solution, i.e., the worst candidate, which is replaced with a novel randomly generated solution.Here the random number generator is replaced with a chaotic number generator.
Step 5 If the termination criterion is met, output the final results, otherwise jump to the second step.

Biogeography-Based Optimization
Biogeography-based optimization (BBO) was inspired from biogeography, which describes speciation and migration of species between isolated habitats, and the extinction of species [33].Habitats friendly to life are termed to have a large habitat suitability index (HSI), and vice versa.Features that correlate with HSI include land area, temperature, rainfall, topographic diversity, vegetative diversity, etc.Those features are termed "suitability index variables (SIV)".Like other bio-inspired algorithms, the SIV and HSI are considered as the search space and objective function, respectively [34].

5718
Habitats with high HSI have a high emigration rate and a low immigration rate, since those habitats have supported many species.Species that migrate to this kind of habitat will tend to die even if it has high HSI, because there is too much competition for resources from other species.Meanwhile, habitats with low HSI have both a high emigration rate and a low immigration rate; the reason is not because species want to immigrate, but because there are a lot of resources for additional species [35].
To illustrate, Figure 4 shows the relationship of immigration and emigration probabilities, where the λ and μ represents the immigration and emigration probability, respectively.I and E represent the maximum immigration and emigration rate, respectively.Smax represents the maximum number of species that the habitat can support, and S0 represents the equilibrium species count.Following common convention, we assumed a linear relationship between rates and numbers of species, and gave the definition of the immigration and emigration rates of habitats that contain S species as  How can the biogeography theory be transformed to an optimization algorithm?Immigration and emigration rates of each habitat are used to share information across the ecosystem.With modification probability Pd, solutions Hi and Hj are modified in the way that we use the immigration rate of Hi and emigration rate of Hj to determine whether some SIVs of Hj can be migrated to some SIVs of Hi.
Mutation was simulated at the SIV level.Solutions with very large or very small HSI are equally unfeasible, nevertheless, medium HSI solutions have more chances to occur.The above idea can be carried out through a mutation rate W, which is inversely proportional to the solution probability PS.
where Wmax is a predefined mutation-related parameter, representing the maximum mutation rate.

Pmax is the maximum value of P(∞).
Elitism is also included in standard BBO, with the goal of retaining the best solutions within the ecosystem.Hence, the mutation approach will not impair the high HSI habitats.Elitism is performed by forcing λ = 0 for the e best habitats, in which e is a predefined number of elitism.
The pseudocode of BBO is listed in Algorithm 2. Both FSCABC and BBO were used to train the weights and biases of FNN, and we dubbed them as FSCABC-FNN as BBO-FNN.Step 1 Initialize BBO parameters, which include a problem-dependent method of mapping problem solutions to SIVs and habitats, the modification probability P d , the maximum species count S max , the maximum mutation rate W max , the maximum migration rates E and I, and elite number e.
Step 2 Initialize the population by generating a random set of habitats.
Step 3 Compute HSI for each habitat.
Step 5 Modify the whole ecosystem by migration based on P d , λ and μ.
Step 6 Mutate the ecosystem based on mutate probabilities.
Step 8 If termination criterion was met, output the best habitat, otherwise jump to Step 3.

Statistical Analysis
A five-fold stratified cross validation (SCV) was employed.The pseudo-code is listed in Algorithm 3. SCV divides the dataset into different folds, and makes each fold a test set and the rest a training set, in turn.SCV averages and reports the out-of-sample error on each test set.Output the average accuracy A = (A 1 + … + A 5 )/5.

Implementation
Figure 5 shows the proposed system that consists of three different stages (feature extraction, feature reduction, classification) as above.In the figure, blue and green arrows are used to represent offline learning and online prediction, respectively.The number of reduced features (12) was obtained by the following PCA experiment.The proposed system has two phases (Algorithm 4): offline learning with the aim of training, and online prediction in order to classify the query fruit image.Note that we not only used BBO but also used FSCABC algorithm for training the FNN.Step 2. Preprocessing and Feature Extraction.Remove background by split-and-merge algorithm [19].Crop and resize each image to 256 × 256 × 3. Center the fruits.For each image, 30 features are obtained that contain WEs of three color channels.
Step 3. Feature Reduction.The number of features are decreased by PCA, and the criterion is to cover more than 95% of total variance.PC coefficient matrix was generated.
Step 4. Classifier Training.The training set is fed into feed-forward neural network.The weights and biases of the FNN are adjusted to make minimal the average MSE of FNN.BBO and FSCABC are set as the training algorithms, respectively.
Step 5. Evaluation.A K-fold SCV is employed for statistical evaluation.

Phase II: Online Prediction
Step 1. Query Image.Generate the query image by a digital camera.
Step 2. Preprocessing and Feature Extraction.The same as Phase I.
Step 3. Feature Reduction.Reduced feature is obtained by multiplying extracted features with PC coefficient matrix generated in Phase I.
Step 4. Prediction.The reduced feature is sent into the classifier trained in Phase I to predict the fruit category.

Experiments and Results
The experiments were implemented on a P4 IBM machine, with Intel Core i3-2330M 2.2GHz processor and 6GB RAM, running under 64-bit Microsoft Windows 7 OS.The proposed algorithms were in-house developed on MatLab 2014a (The Mathworks © ) platform.

DWT Results
Table 1 lists the DWT results of the RGB channels of fruit images.Three-level Haar wavelet decomposition was performed.Due to the page limit, only Black Grapes and Tangerines are shown.For each image, the size of DWT coefficients is 256 × 256 × 3 × 3 = 589,824, which is then reduced to 30 by entropy operation.

Fruit R-Channel G-Channel B-Channel
Black Grapes Tangerines

Feature Reduction
The curve of cumulative variance against the number of selected PCs is shown in Figure 6, which shows that only 12 features (See the big red dot) are able to variances higher than 95%.

Algorithm Comparison
Remember there are 12 features remaining after PCA, so the structure of the FNN is set to 12-10-18.The number of input neurons NI and output neurons NO corresponds to the number of features and categories, respectively.The number of hidden neurons NH is set 10 via the information entropy method [36].
We compared WE with other feature vectors, as the combination of color-histogram (CH), morphology-based features (MP), and Unser's features (US).We also compared the proposed FSCABC-FNN and BBO-FNN with the latest classifiers, including GA-FNN [18], PSO-FNN [16], ABC-FNN [31], and kSVM [6].To make the analysis statistically significant, a five-fold stratified cross validation was employed.The parameters of those training algorithms were obtained using a trialand-error method and are shown in Table 2.The maximal iteration steps of all algorithms are set to a value of 500.The sizes of population SP of all algorithms are set to 20.To remove randomness, each algorithm was run 20 times.Table 3 gives the classification accuracy results obtained by different algorithms.We compared the two proposed methods "WE + PCA + FSCABC-FNN" and "WE + PCA + BBO-FNN" with existing methods.For the computation time, the training of FSCABC-FNN cost 31 s on average, and the training of BBO-FNN cost only 26 s on average.(* The results of five existing algorithms were extracted from literature [12]).

Discussions
The curve in Figure 6 indicates that PCA reduces features efficiently; the foremost features contribute the most the cumulative variances.Indeed, PCA accelerates the proposed fruit recognition system by way of reducing the number of features.Although 30 features will not impede current computers, the reduced 12 features are capable of accelerating the following classification procedure.Additionally, removing extra features leads to an improvement in performance.
Another finding from Table 3 is that BBO-FNN yields the same result as FSCABC-FNN.However, the latter used complicated techniques such as fitness-scaling and chaos series generator; while the former is performed in its plain form.The computation time comparison (26 s for BBO-FNN versus 31 s for FSCABC-FNN) also highlights the simplicity of BBO-FNN.Hence, we expect that BBO is an efficient swarm-intelligence method that will have many successful applications.
Both our two proposed methods and state-of-the-art methods obtain a relatively low accuracy for fruit classification.Our methods yield 89.5%, while other methods yield less than 89.1%.The results seem depressed compared to other applications like face classification and medical classification, which usually achieve accuracy higher than 99%.The reasons are three-fold: (1) The fruit images are obtained in complicated conditions, the pose and position of cameras are different, the illumination conditions vary; (2) The different categories of fruits also levy challenges.In total, 18 categories is a quite large number for multi-class classification.Some similar categories will cause incorrect classification; (3) The automatic classification of fruits is not fully investigated, and potential research will be undertaken in the future.
Why is WE more efficient than the other three features and even than their combination?The reason is WE combines wavelet decomposition and entropy to extract features from fruit images with a high time-frequency resolution.The entropy is capable of extracting relevant information from complex and high-dimension datasets (here the wavelet coefficients).If we omit the entropy procedure, this system will not work appropriately and the classification performance will deteriorate.
For the classification, we did not consider using a convolutional neural network (CNN) in the form that they are applied to deep learning, which tends to exceed almost every image classification benchmark nowadays.CNN has the advantages that it does not require nicely set up and centered pictures.The reason why we ignored CNN is the small data size (See Section 3.1).The dataset of 1653 images is a bit small for CNN learning.As reported, CNN performs better than traditional classifiers only for "big data" [37,38].Augmenting the data size is not difficult, so we shall try to collect more data and try the CNN method in the future.
We used sigmoid function as the activation function for the hidden Nevertheless, the rectified linear unit (ReLU) is receiving more and more attention in the of f(x) = max (0, x) [39].ReLUs are more biologically plausible than the widely used sigmoid function, and are reported to have superior performance to traditional activation functions [40].We will try ReLU in future research.
In closing, the contribution of the work lies in the following three aspects.(1) We used a novel tool of WE that combines wavelet decomposition with Shannon entropy; (2) We proposed two novel classification methods-"FSCABC-FNN" and "BBO-FNN"-based on a traditional FNN classifier and two novel swarm-intelligence optimization methods; (3) We proved the two proposed methods are superior to state-of-the-art methods in terms of accuracy and number of features.

Conclusions and Future Research
This work proposed two novel classification methods-"WE + PCA + FSCABC-FNN" and "WE + PCA + BBO-FNN"-for the application of fruit classification.Their accuracies were both 89.5%, which is higher than state-of-the-art methods.Future work will concentrate on the following five areas: (1) Extending our research to fruit images obtained in severe conditions, such as dried, sliced, tinned, canned, and partially covered; (2) Including additional relevant features (such as local binary patterns, wavelet-energy [41], spider-web-plot [42], wavelet packet transform, etc.) to enhance the classification performance; (3) Using interactive data mining [43], knowledge discovery [44] to test the proposed method; (4) Using compressed sensing techniques [45,46] to represent the image in sparsity domain; (5) Using advanced classification methods, like evolutionary methods inspired by Lamarch and Baldwin [29]; (6) Trying other activation functions such as ReLU.
immigration and emigration rates are utilized to communicate between different habitats.Consider the special case E = I, we have S S E λ + μ = .

Figure 5 .
Figure 5. Diagram of the proposed method.

Algorithm 3: 5-Fold Stratified Cross Validation
i of test set of i-th fold.Step 5 Let i = i + 1.If i ≤ 5, then jump to Step 2, otherwise jump to Step 6. Step 6

Table 2 .
Parameter setting of training algorithms.

Table 3 .
Classification accuracy based on different training algorithms.