Identification of Green, Oolong and Black Teas in China via Wavelet Packet Entropy and Fuzzy Support Vector Machine

To develop an automatic tea-category identification system with a high recall rate, we proposed a computer-vision and machine-learning based system, which did not require expensive signal acquiring devices and time-consuming procedures. We captured 300 tea images using a 3-CCD digital camera, and then extracted 64 color histogram features and 16 wavelet packet entropy (WPE) features to obtain color information and texture information, respectively. Principal component analysis was used to reduce features, which were fed into a fuzzy support vector machine (FSVM). Winner-take-all (WTA) was introduced to help the classifier deal with this 3-class problem. The 10 × 10-fold stratified cross-validation results show that the proposed FSVM + WTA method yields an overall recall rate of 97.77%, higher than 5 existing methods. In addition, the number of reduced features is only five, less than or equal to existing methods. The proposed method is effective for tea identification.


Introduction
Tea is a pleasant beverage commonly brewed by pouring (near) boiling water over curled leaves of the Camellia sinensis, which is an evergreen bush indigenous to Asia.Tea is an extensively consumed beverage worldwide with an expanding market [1].
Tea originated in China having been recognized for its healing properties, and in recent times, there has been more and more evidence to show that specific substances in tea can help resist diseases.For example, researchers found antioxidants in tea may be capable of protecting against diseases: Alzheimer's disease [2], Parkinson's disease [3], neurodegenerative disease [4], blood pressure and cardiovascular disease [5], colon cancer [6], breast cancer [7], lung cancer [8], etc.
There are many different species of tea in the world.At least six different types have been produced and are listed in Table 1.This study focuses on tea in China, where green, Oolong, and black tea are the most popular classes [9].Table 1.Tea Categories.

Category Characteristics
White Tea wilted and unoxidized Yellow tea unwilted and unoxidized, with sweltering Green tea unwilted and unoxidized Oolong tea wilted, bruised, and partially oxidized Black tea wilted and fully oxidized Post-fermented tea fermented green tea The classification of tea categories is very important for controlling fermentation time and obtaining specific types of tea, which strongly influences the commercial market.In the past, human experts measured the color variation during fermentation by visual checks; however, this method has various shortcomings such as irreproducibility, high labor cost, and inconsistency.The colorimeter was then introduced for objective quality evaluations.In addition to examining color features, sieves of different pore sizes were used.However, those traditional methods are rather rough; hence, they cannot accurately assess, predict, and control tea quality.
Two categories of automatic tea-classification methods were proposed: (1) to develop new measurement devices and (2) to develop new algorithms based on computer vision.For the former category, Herrador and Gonzalez [10] selected eight metals (Al, Zn, Ca, Mn, Ba, Mg, Cu, and K) as chemical descriptors to differentiate three tea classes.They proved the back propagation-artificial neural network (BP-ANN) achieved an almost 95% recall rate.Zhao et al. [11] utilized near-infrared (NIR) spectroscopy for fast identification of green, oolong, and black tea.They extracted five principal components (PCs), which were sent to SVM classifiers.For all categories, their identification accuracies were no less than 90%.Chen et al. [12] proposed using NIR reflectance spectroscopy to identify three types of tea.They used RBF-SVM as the classifier, and found the best identification for green, black, and oolong teas were 90%, 100%, and 95%, respectively.Wu et al. [13] put forward a new nondestructive method based on multispectral digital image texture feature.The sample images were obtained via a red waveband, NIR waveband and green waveband multispectral digital imager.They combined DCT and LS-SVM as the classifier.Chen et al. [14] developed a portable electronic nose, by an odor imaging sensor array, with the aim of tea classification of three different fermentation degrees.Liu et al. [15] used electronic tongue technique to analyze 43 samples of green and black tea.A class of metallic oxide-modified nickel foam electrodes (SnO2, ZnO, TiO2, Bi2O3) was compared.The signals obtained by cyclic voltammetry underwent multivariate data analysis that consisted of principal component analysis (PCA) and SVM.
The above methods can obtain good identification results; however, they need expensive signal acquiring devices with time-consuming procedures.On the other hand, the latter category (i.e., computer-vision based systems) is attaining support in the food industry due to its rapid speed, low cost, consistency, reliability, and high accuracy.For instance, Borah et al. [16] described a novel texture-feature estimation technique, with the goal of discriminating images of eight dissimilar grades of CTC tea.Chen et al. [17] used 12 color feature variables and 12 texture feature variables.PCA and linear discriminant analysis (LDA) were employed to build the identification model for five varieties of Chinese green tea.Features were reduced to 11. Jian et al. [18] used computer-vision techniques in order to classify and grade a particular tea sample, based on the parameters of color and shape.Two kinds of features were obtained: (1) The color features were extracted by transforming the color tea image to HSI model.(2) The shape features were extracted after the color images were degraded to binary images.Genetic neural-network (GNN) was used as the classification method.It achieved acceptable identification performance with eight features of both shape and color.Gill et al. [19] presented a survey of versatile computer vision techniques, which are related to color and texture analysis, with a tendency towards tea grading and monitoring.They pointed out that computer vision and image analyses were harmless for sorting tea.They predicted computer vision based techniques will become more and more popular in tea classification.Laddi et al. [20] acquired the images of tea granules using 3-CCD color camera under dual ring light.In all, 10 graded tea samples were obtained and analyzed.Their acquired features, i.e., (energy, entropy, contrast, correlation, and homogeneity) were reduced by principal component analysis (PCA).Their experimental results showed that the best discrimination was dark-field illumination (variance = 96%).In contrast, the bright-field illumination showed weak discrimination (variance = 83%).Zhang et al. [21] used the combination of different features (color, shape, and texture) and a feedforward neural network (FNN) with the aim of classifying fruits.The method can be directly used for tea identification.
Those methods used cheap digital camera as the main system; however, their classification accuracy does not meet the standard of practical usage.In information theory, the Shannon entropy [22] is commonly based on compression in a quantization process, and this can be investigated by using the wavelet compression.Hence, wavelet entropy (WE) is proposed and used in many applications.From another point of view, WE is the minimization of the feature space for data analysis.To augment the performance, we proposed replacing wavelet decomposition with wavelet packet transform (WPT), and thus the WE is replaced with wavelet packet entropy (WPE).
Based on WPE and machine learning techniques, we proposed a novel tea-identification method, with the aim of developing an automatic identification system with better identification accuracy of green, black, and oolong teas.To implement it, we employed the latest developments in signal processing.The proposed methods consist of three stages following common convention: (i) Feature extraction: we combined color features (obtained by color histogram) with texture features (obtained by WPE); (ii) Feature reduction: we employed PCA to reduce the feature dimensions; (iii) Classification: we introduced a fuzzy support vector machine (FSVM).
The remainder of the paper is organized in the following way: Section 2 describes the sample preparation, image-acquiring procedure, the proposed methodology, and the statistical setting.Experiments in Section 3 compare the proposed methods with state-of-the-art methods.Finally, Section 4 concludes the paper and outlines future research directions.For ease of reading, we explain the nomenclatures in Abbreviation (please refer to the end of this work).

Tea Preparation
Three hundred samples consisting of three categories were prepared with origins from various provinces in China.Each tea category contained different brands to increase the generalization ability of the identification system.All tea samples were bought on stock within the period of four months.Table 2 shows the characteristics of tea samples used.

Image Acquiring
A flowchart of the used computer vision system is illustrated in Figure 1, which consists of 5 basic parts: a digital camera, an illumination platform, a capture board (digitizer or frame grabber), computer software, and computer hardware [23].The image acquiring procedures are as follows: Tea images were grabbed after spreading tea leaves uniformly by a 3-CCD digital camera, which uses three separate charge-coupled devices (CCD).For each one, it takes a segregated mensuration of the primary colors: red (R), green (G), and blue (B) light, respectively [24].The optical system then splits the light, which emits through the lens, by a prism assembly.The appropriate wavelength ranges of light are directed to the corresponding CCDs.In general, a 3-CCD camera provides better image quality through enhanced resolution and lower noise [20] than a 1-CCD camera.
The intensity and nature of illumination affects the performance of the computer vision system.The lighting placement is arranged as either back or front lighting [25].The back lighting is used for producing a silhouette image to augment tea edges; meanwhile, front lighting is used to enhance tea surface features [19].In this study, front lighting was used.

Feature Processing
Here, we presented a hybrid feature set that consisted of color and wavelet-based features.For a tea image with size of 256 × 256, its total number of features is 65,536, because each pixel can be regarded as a feature.We obtained 64 color-based features and 16 wavelet packet entropy features.Then, principal component analysis (PCA) was utilized in order to decrease the number of total features so that the remained PCs should explain more than 99.9% variances of original 65,536-feature set. Figure 2 shows a diagram of the proposed feature processing method.

Color Histogram
The color histogram (CH) was harnessed over wide application areas, to count the distribution of colorations from a given tea image [26].The CH counts the number of particular pixels, which have similar colors within a fixed range that cover the predefined color space.Usually, the CH is generated by two steps: (i) color discretization into 4 × 4 × 4 = 64 bins (that means four bins for each channel), and (ii) counting the number of pixels in each bin [21].
The color histogram (CH) offers a concise summarization of the color distribution.Obviously, CH is relatively invariant with both rotation and translation about the inspecting axis.By extracting and comparing CH features of two images, the CH is found to be especially suitable for problems of object detection of unidentified position and unidentified rotation angles inside a scene.CH is used as an extremely important feature in many applications.

Discrete Wavelet Packet Transform
In the domain of signal processing, the discrete wavelet transform (DWT) is an outstanding tool that has various successful applications [27].Further, the Discrete Wavelet packet transform (DWPT) is a powerful extension of DWT.The difference is that all nodes in the tree structure are allowed to split further at any decomposition level for a DWPT, which is forbidden in a conventional DWT [28].
In a detailed way, conventional DWT passes only the previous approximation subband to the next decomposition procedure [29,30].Nevertheless, DWPT passes both the detail and approximation subbands to the next decomposition; hence, it can generate a full binary tree (See Figure 3).
From another point of view, DWPT features are provided on the basis of both detail and approximation subbands at various levels, so it yields more information than the conventional DWT does.It is necessary to reconstruct DWPT coefficients in image domain using zero-padding for each decomposition subband, since the following entropy-based feature extraction technique is for the image domain not the wavelet coefficient domain.2D-DWPT is implemented by applying 1D-DWPT along the x-and y-axis, respectively.3D-DWPT is carried out in a similar way [31].

Shannon Entropy
In the past, entropy was used to measure the randomness of systems by statisticians.It is then generalized to measure uncertainty of the information content of a given system with the professional name of Shannon entropy (SE) [32] 2 1 log ( ) where S represents the Shannon entropy, n the greylevel of any subband, hn the probability of n-th greylevel, and G the total number of greylevels [33].
In this study, entropies of both approximation and detail components at 2-level DWPT were computed, and termed as wavelet packet entropy (WPE).The pseudocode of calculating WPE is listed in Table 3.In the initial phase, each pixel was regarded as a feature, since pixels do provide some information.Hence, an image of 256 × 256 is considered containing 65,536 features.After feature extraction, we reduce the 65,536 features to only 16 WPE features (See Table 3).

Pseudocode of WPE
Step A Input Image.Read the 2D Image.
Step B 1D-DWPT.Pass the image through low-pass and high-pass filters and perform downsampling along x-axis and y-axis in sequence.Obtain four subbands.
Step C 2D-DWPT.For the four subbands obtained in 1D-DWPT, we continually implement 1D-DWPT to each subband, and finally obtained 16 subbands.
Step D WPE.Extract Shannon entropy from the 16 subbands obtained by 2D-DWPT, and output the final feature vector of 16 elements.
Literatures showed that WPE is able to capture the wavelet-based features (including shape and texture) in an efficient way and has many successful applications in various fields [34][35][36].

Principal Component Analysis
In total, there are 80 features (64 color and 16 WPE features) extracted from a prescribed tea image.Those 80 features will increase computation resources and cost a mass of storage memory, which may aggravate the performance of the classifier.A valid strategy is to decrease the feature number by feature-reduction techniques [37].
Principal component analysis (PCA) is an effective utensil that is commonly used not only to reduce the interrelated variables, but also to retain the most substantial principal components (PC).The implementation of PCA is achieved by transforming the sample vectors to a new set of new variables sorted by the variance degree in a decreasing way.
Principal component analysis (PCA) has three benefits [38]: (i) It can orthogonalize the variables of the input vectors, in order to decorrelate each other.(ii) It sorts the resulting orthogonal variables, to guarantee the principal components with the largest variation come first and those with the smallest variation come last.(iii) It completely removes the variables from the dataset that impart the least variation.
Note that the input dataset is suggested to take a normalization as zero-mean and unity-variance before implementing a principal component analysis.Users can type the "PCA" command in Matlab platform to implement a canonical PCA procedure.

Classification
Support vector machine (SVM) is the most popular learning model that analyzes data and recognizes patterns for supervised classification [39].However, SVM cannot deal with outliers and noises, i.e., its performance will decrease sharply when the data set either contains outliers or was contaminated by noises.Fuzzy SVM (FSVM) was an effective variant with the advantages of reducing the effect from outliers and noises.There are some other advanced variants of SVM, such as generalized eigenvalue proximal SVM [40], twin SVM, least-square SVM

Support Vector Machine
Let us suppose there is an N-size training samples of z-dimensional vector, and suppose the goal is to create a hyperplane of (z − 1)-dimension.Assume the dataset takes the form of [41]   where pn denotes a training point that is a z-dimensional vector, yn is the realistic class of pn taking the value of either +1 or −1, which corresponds to the class 2 or 1, respectively [42].The hyperplane with maximum-margin that separates the two classes is the desired SVM.Considering that any hyperplane is in the form of where w represents the weights and b the bias.We need to select the optimal values of w and b to maximize the distance between the two parallel hyperplanes to the full degree, while it can yet separate the data of the two classes.
  Positive slack vector ξ = (ξ1, …, ξn, …, ξN) are added to measure the misclassification degree of sample pn.Hence, the mathematical formula of the optimal SVM can be deduced by solving: where L represents the error penalty and e a vector of ones of N-dimension.Therefore, the optimization turns to a trade-off between a small error penalty and a large margin.The constraint optimization problem is solved using "Lagrange multiplier": The min-max problem is not easy to solve, so dual form technique is commonly proposed to solve it as The main merit of the dual form is that the slack variables ξn disappear, with only the constant L be an additional constraint on the Lagrange multipliers.

Fuzzy SVM
Fuzzy support vector machine (FSVM) is effective than simple SVM models especially in predicting or classifying real-world data, because several training samples are more substantial than others.It makes sense to require the meaningful training samples must be recognized perfectly meanwhile to neglect some meaningless points like noises or outliers [43].
Fuzzy support vector machine (FSVM) applies a fuzzy membership function (FMF) s to every training point [44], such that the training samples are transferred to fuzzy training samples, which can be expressed as where sn denotes the altitude of the corresponding training point towards one class and (1 − sn) is the attitude of meaning less.The optimal hyperplane problem of FSVM is defined as: where s = (s1, s2, …, sN) represents the membership vector of FMF.A smaller sn decreases the influence of the parameter ξn, such that the corresponding sample pn is regarded less substantial.In a similar way, the Lagrangian is constructed as: Again, "dual form" is used to transform Problem (10) to Therefore, it is clear that the task becomes merely a function of the support vectors, which are the subset of the training data lying on the margins.

Fuzzy Membership Function
We set the fuzzy membership function (FMF) to the distance between the point and its class center.Suppose the mean of class +1 and class −1 as p+ and p−, respectively.Then, we can get the radius of class +1 and class −1 as where r+ and r − represents the radius of class +1 and −1, respectively.The fuzzy membership sn is defined as a function of both the radius and the mean of both classes [43] 1 where δ > 0 is used to guarantee sn > 0.

Multiclass Technique
Support vector machines (SVMs) and its variants were originally developed for a two-class problem.However, we needed to predict three classes: green, oolong, and black tea.Several methods have already been proposed for multi-class problems via SVMs, among which the most popular approach was to break down the multiclass task into multiple two-class tasks.We chose the Winner-Takes-All (WTA) method.
Suppose V (>2) classes exist in the task.WTA strategy classifies new instances based on the idea of one-versus-all.At first, we train V different individual binary classifiers (SVM or its variants).The n-th individual classifier is aimed to distinguish the data in class n from the data of all the remaining classes (1, 2, …, n − 1, n + 1, …, V).A new test sample will be sent to all the V individual classifiers, and the individual classifier that outputs the largest value is chosen.

Statistical Setting
It will yield an optimistically biased assessment if the whole dataset is used as a validation set, which is dubbed "in-sample estimate".Therefore, the whole dataset is divided into two sets: training set and test (or validation) set, and the evaluation performance of the test set is reported, which is dubbed "out-of-sample estimate".However, for small-size dataset problems, the "out-of-sample estimate" will increase the variance of estimation of classification performance [45].
In this study, we used a more advanced technique to calculate the out-of-sample performance of the proposed identification system: the K-fold stratified cross validation (SCV) technique.The original samples were randomly segmented into K mutually exclusive subsets with closely equal length.Then, K-1 subsets were used for training and the rest for validation.
The abovementioned procedure repeated K runs, such that each subset was used once and only once for validation.The K validation results from the K runs were then merged together, in order to generate an out-of-sample estimation over the whole dataset.We assigned K with a value of 10 following common convention.
The 10-fold SCV was repeated 10 times, to further reduce the variance of estimation.Stratification technique was used so that each subset contained roughly the same proportions of different tea classes.As per Figure 4, the purpose of the proposed system is bi-fold: offline learning for training the classifier, and online prediction for predicting the category of query tea images.

Results and Discussions
The experiments were carried out on the IBM machine with 3 GHz core i3 processor and 8 GB random access memory (RAM), running on the Windows 7 operating system.The algorithms were developed in-house via Matlab 2015a (The Mathworks ©, Natick, Massachusetts, USA).

Feature Extraction
The second row of Table 4 shows the sample of each category of teas.Indeed, their colors and textures were distinct from each other using human vision.Next, we tried to extract their color information by CH.The third row of Table 4 shows the corresponding histogram, which clearly indicates the distribution of CHs of three categories of teas were different; hence, the CH is an effective feature.Table 4. Feature extraction of three category of tea.The fourth and fifth rows of Table 4 compare the results of DWT decomposition with DWPT decomposition.There were three channels (RGB) of each image, so we performed decomposition for each channel and combined them to output the final decomposition result.It was clearly observed the DWPT decomposed the detail component that DWT did not decompose.Therefore, DWPT can give more multi-resolution information than DWT.Entropy was extracted on the 16 subbands of DWPT.

PCA Result
The 80 features extracted from each tea image were aligned to be a row vector, and features of all images were aligned row-wise to generate a 2D matrix, on which PCA was performed.Figure 5 shows the curve of variance explained against number of PCs, and the detailed results are listed in Table 5, from which it is clear one PC preserves 93.91% of total variances, two PCs preserve 99.08% of total variances, three PCs preserve 99.49% of total variances, four PCs preserve 99.78% of total variances.Finally, five PCs obtains more than 99.90% of total variance, which meet our criterion.Figure 5. Curve of accumulated variance explained against number of principal components (PCs).

Classification Performance Comparison
We compared the proposed two classifiers (SVM + WTA; and FSVM + WTA), with state-of-the-art methods (BP-ANN [10], SVM [12], LDA [17], GNN [18], and FSCABC-FNN [21]), by averaging the results of 10 × 10-fold SCV.The classification-performance comparison results are listed in Table 6.Please refer Abbreviation to see the full names of abbreviations.Studies in literatures [10] and [12] differentiated the same categories of green tea, oolong tea, and black tea as in this study.Therefore, we directly listed their results on test set, which obtained the out-of-sample evaluation, the same as the K-fold SCV used in this study.
Table 6 shows that the proposed two methods ("SVM + WTA" and "FSVM + WTA") achieve good recall rates.The "SVM + WTA" method obtained 95.7%, 98.1%, and 97.9% on green, oolong, and black tea, respectively.The "FSVM + WTA" method obtained 96.2%, 98.8%, and 98.3% on the same categories.The overall recall rate of former method was 97.23%, while the latter one obtained 97.77% overall recall rate.From the results, we can deduce the FSVM was more effective than SVM.It can increase the overall recall rate slightly (about 0.54%).This finding aligns with past publications that stated FSVM is better than SVM [46,47].In summary, the proposed "FSVM + WTA" yields 97.77% overall recall rate over an average of 10 runs, which exceeds not only the proposed "SVM + WTA" method with 97.23%, but also the state-of-the-art methods including BP-ANN [10] with 95%, SVM [12] with 95%, LDA [17] with 95.8%, GNN [18] with 96.0%, and FSCABC-FNN [21] with 97.4%.The reasons were two-fold: (i) Compared to traditional shape features and texture features, the WPE has better a ability to analyze transient features of non-stationary signals, which are abundant in tea leaves.(ii) The FSVM is a rather novel variant of SVM.FSVM can reduce the effect of outliers and noises, hence, increasing the overall recall rate.
From another point of view, few features expend less time.The number of the final reduced features of the proposed method was only five, less than or equal to BP-ANN [10] with eight, SVM [12] with five, LDA [17] with 11, GNN [18] with eight, and FSCABC-FNN [21] with 14.The excellent recall rate proved the effectiveness of those five features used in this study.

Optimal Wavelet
To find which wavelet performs best for this problem, we tested six different wavelets including db1, db2, db3, bior2.2,bior3.3, and bior4.4.The classification method was "FSVM + WTA".Decomposition level was set to 2. A 10 × 10-fold SCV was employed.The average overall recall-rates by different wavelets are listed in Table 7. Table 7 shows the bior4.4wavelet achieves the highest average overall recall-rate among all wavelets.This explains why we chose the bior4.4wavelet in our research work.In addition, db1 wavelet performs the worst for tea-category identification.To find the reason, we show the scaling function, wavelet function, low-pass filter (LPF) and high-pass filter (HPF) of both decomposition and reconstruction functions of bior 4.4 in Figure 6 and Figure 7, respectively.
We find that the scaling functions and wavelet functions of bior4.4 are more similar to gray-level changes along the textures and edges in tea images.Hence, this may be the reason why bior4.4 is better than other wavelets.This result aligns with past publications: Manthalkar et al. [48] stated "Bior4.4gives the best classification performance" and Prabhakar and Reddy [49]

Further Discussion of Methods
Here, we revisit the proposed methods.Firstly, there were three reasons why we used CH: (i) Color information is an effective tool for recognizing objects, which is the case of this study; (ii) The CH related to the physical properties of the teas; (iii) CH not only can reflect illumination conditions and tea color, but also ascertains image geometry and surface roughness.
Secondly, we have two reasons for choosing WPE: (i) Wavelet packet transform yields more information related to high-frequency subbands than wavelet transform does, which was essential in feature extraction; (ii) Entropy is effective in describing the uncertainty and complexity of 1D signals and 2D images.Although DWPT took more time than DWT, it is worthy considering its excellent performance.
Thirdly, FSVM is employed since the influences of training samples vary.Usually, one part of training samples is more important than the rest.Hence, it is natural to expect the important training points to be recognized correctly, and that other training samples (like noises) are not properly considered by the classifier.Generally, using FSVM, each sample no longer only belongs to one class.This is the fundamental concept of FSVM.

Conclusion
The goal of this work is to develop an automatic tea-category classification system with high sensitivity/recall rate.The contributions of this study consisted of the following aspects: (i) We proposed using WPE as a novel feature; (ii) We introduced FSVM that can reduce the effect from either outliers or noises, compared to conventional SVM; (iii) We used a WTA technique to cope with the three-class problem; (iv) The proposed method achieved a higher identification rate than five state-of-the-art methods while using the least features.

Figure 1 .
Figure 1.Computer vision based system to obtain the tea image database.

Figure 3 .
Figure 3. Diagram of two-level one-dimensional wavelet packet transform.Here, a and b represent the low-pass and high-pass filters, respectively.L and H represent the low-frequency and high-frequency subbands, respectively.

Figure 4 .
Figure 4. Diagram of the proposed automatic tea classification system.
also found bior4.4performed the best among all wavelets.

Table 2 .
Characteristics of tea samples.

Table 5 .
Detailed data of accumulated explained variance.

Table 7 .
Average Overall Recall-Rate of different wavelets.