^{1}

^{2}

^{1}

^{2}

^{*}

^{1}

^{2}

^{1}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

Principal Component Analysis (PCA) is one of the main methods used for electronic nose pattern recognition. However, poor classification performance is common in classification and recognition when using regular PCA. This paper aims to improve the classification performance of regular PCA based on the existing Wilks Λ-statistic (

Classification and recognition has been widely used in various fields [

Principal Component Analysis is usually chosen as the first and second principal component (PC) according to the cumulative sensor contributions when using PCA. However, PCA often cannot produce the best recognition effect when using the first and second principal components for PCA. For this purpose, the Wilks distribution [

In the process of the classification and recognition of hybrid and inbred rough rice varieties, we also met the difficulty that the recognition effect of PCA cannot reach the ideal state. This paper aims to analyse the problem of the existing combination of PCA with the Wilks distribution method, determine an improved method, classify and recognise rough rice varieties and use the Mahalanobis Distance (MD) and Probabilistic Neural Networks (PNN) to verify the method. This paper also proposes a new method for rough rice classification and recognition.

The six types of rough rice varieties selected in this experiment were planted on the farm (Yuejinbei) of South China Agricultural University. They included three inbred rough rice varieties (Zhongxiang1, Xiangwan13, Yaopingxiang) and three hybrid rough rice varieties (WufengyouT025, Pin 36, Youyou122). These varieties have the same crops for rotation. The harvest time differences among them do not surpass 30 days. After harvest, natural drying to keep the water content between 12%–14% via the method of sunning on cement ground was performed. The characteristic appearance of the six types of rough rice is shown in

A portable electronic nose (PEN3, Airsense Analytics GmbH) is used in this experiment. This electronic nose is mainly composed of a sensor array, sampling and cleaning channel, data processing system, _{i} is the ratio of the resistance value G (when sensors contact to sample volatiles) and the resistance value G_{0} (when sensors contact to zero gas).

The zero gas used by the PEN3 is the field air, which is filtered by an activated carbon filter. The special flow regulator inside can guarantee stable sampling under poor experiment conditions. The detection principle is as follows: when volatile compounds contact the active material of the sensor, it will create a transient response (a series of physical and chemical changes occur). This response from the voltage signal translates into the figure signal via an interface circuit, which is then recorded via a computer and sent to a signal processing unit for analysis. Afterwards, a comparison is made with a large number of volatile compound information in a database that can compare and identify the type of volatiles [

There were 20 samples of each rough rice variety (6 varieties of rough rice × 20 = 120 samples in total). Each sample weighed 10 g, measured using an electronic scale, and was collected in a 200-mL beaker, then sealed with plastic wrap. Before sampling, every sample was kept at room temperature environment (27 °C) for 1 h. Beakers were washed using an ultrasonic cleaning instrument and cooled in the shade, and no peculiar smell was detected. Preheating for 10 min before the measurement was performed to ensure that the sensors reach their working temperature. Zero gas was used to flush the induction trunk of the electronic nose before sampling. The working parameter settings are as follows: sampling interval is 1 s; flush time is 60 s; zero point trim time is 10 s; measurement time is 80 s; presampling time is 5 s; and injection flow is 190 mL/min.

_{ave}) as the feature value of response curve of the sensor. The test results constitute a 120 (120 samples in total) × 10 (10 sensors) matrix. The D_{ave} is defined as follows:
_{z}_{z}_{+}_{1}

PCA is a multivariate technique that analyses a data table in which the observations are described by several inter-correlated quantitative dependent variables. The goal of PCA is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps [_{g} is the sample number of the gth class (N_{1} = N_{2} = N_{3} = N_{4} = N_{5} = N_{6} = 20), _{ig}_{i} is the average corresponding to the ith PC of the total classes.

According to the idea of Wilks distribution, the lower the value of |D| and the higher the value of |A| and the more significant is the difference between classes; it is useful to classify these classes. Λ is the Wilks Λ-statistic, defined as:

In actual operation, the deviation values _{igk}-u_{ig}_{jgk}-u_{jg}_{igk}-u_{i}_{jgk}-u_{j}_{igk}_{ig}_{jgk}_{jg}_{igk}_{i}_{jgk}_{j}_{igk}_{ig}_{jgk}_{jg}_{igk}_{i}_{jgk}_{j}

For this situation, this paper proposes that after getting the products of the deviation values of (_{igk}_{ig}_{jgk}_{jg}_{igk}_{i}_{jgk}_{j}

A Probabilistic Neural Network (PNN) is a type of neural network with a simple construction and wide application that was developed by Specht in 1989 [_{ij}. The transfer function of the second layer is g(z_{i}) = exp((z_{i} − 1)/σ^{2}), where z_{i} is the input value of the i-th neuron, and σ is the average variance. The third layer is the summation layer, which has the function of linear summation. The amount of neurons of the third layer is equal to the pattern number that is planned to be allocated. The last layer is the output layer, which has a judgment function. The outputs of this layer are discrete values 1 and −1 (or 0), which represent the respective classes of the input pattern [

According to regular PCA, we can determine the feature values of each PC that were, from large to small, as follows: λ_{1} = 9.460 × 10^{−5}, λ_{2} = 2.9085 × 10^{−7}, λ_{3} = 8.0030 × 10^{−8}, λ_{4} = 3.8928 × 10^{−8}, λ_{5} = 1.9657 × 10^{−8}, λ_{6} = 3.7732 × 10^{−9}, λ_{7} = 8.2980 × 10^{−10}, λ_{8} = 4.9310 × 10^{−10}, λ_{9} = 1.8386 × 10^{−10}, and λ_{10} = 8.4961 × 10^{−11}. Then, λ_{1} and λ_{2} are chosen as PC1 and PC2 to perform the PCA. The amount of the variance accounted for by PC1 and PC2 was 99.85%. The classification result of PCA is shown in

The results of the dispersion ratio when any two eigenvectors comprise a lower-dimensional matrix are shown in

The Mahalanobis distance (MD) is a commonly used distance detection method that can compute data correlations [_{ave}_{ave}

In this table, choosing PC1 and PC5 for analysis, compared with choosing PC1 and PC2, the MDs of each of the centre points of two sample data points were enlarged: Xiangwan 13 and Pin 36, Pin 36 and Youyou 122, Youyou122 and Wufengyou T025, Youyou122 and Xiangwan 13, Zhongxiang 1 and Pin 36. These results are the same as the results of

where _{i}_{i}

To further prove the suitability of the Wilks Λ-statistic, we use PC1 and PC2 and PC1 and PC5 as the inputs of the PNN for the respective classification of the six rough rice varieties. We choose the first 15 samples of each variety as the training samples and the remaining five samples as the test set. There are 90 training samples and 30 test samples in total. There are two neurons in the input layer, 120 neurons in the hidden layer, six neurons in summation layer and six neurons in output layer.

This research used the “newpnn” function command of Matlab to develop the PNN model. Both PCs and the spread value have a certain influence to the classification results. The spread value is the diffusion rate of the PNN model that can be optimised for maximum classification accuracy, and its default is 0.1 [^{−5}, 2 × 10^{−5}, 3 × 10^{−5}, 4 × 10^{−5}, 5 × 10^{−5}, 6 × 10^{−5}, 7 × 10^{−5}, 8 × 10^{−5}, 9 × 10^{−5}, 1 × 10^{−4}]. The optimal results are shown in

As can been seen from the figure, comparing with PC1 and PC2, the classification accuracy of the test set based on PC1 and PC5 was improved 20% when spread = 1 × 10^{−5}, which was improved 13.33% respectively when spread = 2 × 10^{−5} and 3 × 10^{−5}.We chose the best model when both the classification accuracy of the training set and the test set were the highest at the same time. According to ^{−5}, 5 × 10^{−5}, 6 × 10^{−5}, 7 ×10^{−5}, 8 × 10^{−5}, 9 × 10^{−5}, 1 × 10^{−4}], so we can set the best PNN to be the model by select the spread = 4 × 10^{−5}. The best results were found when PC1 and PC5 were used for PCA, which increased the classification accuracy of PC1 & PC2 used for PCA by 6.67%, thereby proving the effectiveness of the Wilks Λ-statistic used for improving the classification accuracy of regular PCA. The classification results are described in

The purpose of this study was to determine a better method than conventional PCA to improve the classification accuracy of various rough rice varieties. The data analyses and results mentioned above provide demonstrative evidence of the effectiveness of using a combination of PCA with the Wilks distribution method.

More than 126 chemical species have been reported in volatile compounds released by various rough rice varieties, including hexanal, enanthal, nonanal, pentanal, isobutyl aldehyde, methanol, ethanol, acetonum, 4-vinylphenol, 2-pentylfuran,

The conventional PCA is a mathematical dimensionality reduction method. In general, it will find several aggregate variables which contain almost all the information in the original variables but no correlations with each other, to replace numerous original variables. It is inconsistent with the use of the overlap effect of the electronic nose. Especially when the difference between odor samples is small, the overlap effect among the sensor array is stronger than usual, and it is often more difficult for the conventional PCA method to conduct effective sample classification.

The improved algorithm presented in this study, which combines PCA and the Wilks Distribution (Wilks Λ-statistic), selects PCs by estimating the smallest ratio of D (deviations within classes) to A (deviations for all classes) for further PCA classification. It could maximize the correlation between eigenvectors, but minimize the correlation between eigenvalues. That is why the improved algorithms in this study can obtain better classification accuracy.

The capabilities established in this study demonstrate the tentative feasibility of using electronic noses for the classification of various rough rice varieties. However, there are still a number of potential problems associated with the application of electric noses for the classification of various rough rice varieties. Firstly, due to the sensitivities of gas sensors to humidity and temperature, the variability of humidity and temperature in the test environment can greatly affect the outputs of electronic noses. Following additional research to solve this problem some humidity and temperature compensation algorithms should be included. Secondly, the possibility of several rough rice varieties being mixed together may further complicate classification of rough rice varieties. In addition, there is also a need to reduce the number of sensors in the sensor array in order to reduce the cost of the electronic noses. The proper sensor design choices to achieve sensor sensitivity specific to classify rough rice varieties should help further optimize the number of sensors used in a sensor array. What's more, software improvements will seemingly resolve some of the problems.

PCA is one of the main methods for electronic nose pattern recognition. This paper aimed to understand why the classification effect of the first two PCs used for PCA is poor. To study this effect, a method that combines PCA with the Wilks distribution (Wilks Λ-statistic) was used to improve the classification accuracy of regular PCA. First, the functionality and defects of the Wilks Λ-statistic were analysed, which led to the development of improved algorithms.

Subsequently, the Wilks Λ-statistic was used for the classification of six rough rice varieties and then compared with the regular PCA classification result. The results indicated that there are three rough rice varieties that cannot be classified using the regular PCA and two rough rice varieties that cannot be classified using the Wilks Λ-statistic. A preliminary judgement was made that the classification effect of the Wilks Λ-statistic is better than that of the PCA. Next, the MDs of each of the centre points of two sample data points (_{ave}_{ave}

Finally, PC1, PC2 and PC1, PC5 were used as the inputs of a PNN for the respective classification of 6 rough rice varieties. Comparing with PC1 and PC2, the results showed that the classification accuracy of the test set based on PC1 and PC5 was improved 20% when ^{−5}, which was improved 13.33% respectively when ^{−5} and 3 × 10^{−5}. The PNN is the best when ^{−5}, 5 × 10^{−5}, 6 × 10^{−5}, 7 × 10^{−5}, 8 × 10^{−5}, 9 × 10^{−5}, 1 × 10^{−4}]. We set the best PNN in this research as the model by select the spread = 4 × 10^{−5}. The best results indicated that the use of PC1 and PC5 for PCA increased the classification accuracy compared to the use of PC1 and PC2 by 6.67%, thereby proving the effectiveness of the use of the Wilks Λ-statistic to improve the classification accuracy of regular PCA. In addition, this research provides a novel non-destructive and rapid classification method for rough rice electronic-nose classification that has a certain guiding significance.

The authors thank the National Natural Science Foundation of China (Project No.: 31371539) and the Natural Science Foundation of Guangdong Province of China (Project No.: S2012040007613) for funding this research. The authors also thank the anonymous reviewers for their critical comments and suggestions to improve the manuscript.

Zhiyan Zhou designed the reported study, evaluated the results, and prepared and reviewed the manuscript. Sai Xu conducted the whole experiment, analysed the results, developed the improved algorithm, and prepared the manuscript. Huazhong Lu and Xiwen Luo contributed to plan the reported research, evaluate the results, review and approve the manuscript. Yubin Lan helped in preparing the experimental setup, evaluated the system, and reviewed the manuscript. All authors read and approved the manuscript.

The authors declare no conflicts of interest.

The six studied varieties of rough rice.

The structure of the portable electronic nose.

Sampling set-up using the electronic nose.

The electrical signal change in volatile detection of “Youyou 122” rice grain sample (where R(1)–R(10) are the numbers of the 10 metal-oxide sensors in the sensor array).

The flow diagram of the improved algorithm.

The diagram of the PNN.

Classification of the six rough rice varieties using PC1 and PC2.

Classification of the six rough rice varieties using PC1 and PC5.

The selection of the spread value for a PNN.

The dispersion ratio when any two eigenvectors comprise a lower-dimensional matrix.

1 | 0.1441 | 0.1855 | 0.1687 | 0.1254 | 0.1709 | 0.1877 | 0.1552 | 0.1407 | 0.1569 |

2 | 0.5154 | 0.4351 | 0.3127 | 0.3563 | 0.4921 | 0.4793 | 0.4642 | 0.4855 | |

3 | 0.6940 | 0.5111 | 0.5122 | 0.7981 | 0.7470 | 0.7650 | 0.7846 | ||

4 | 0.4195 | 0.5632 | 0.6848 | 0.7219 | 0.7435 | 0.7360 | |||

5 | 0.3298 | 0.4631 | 0.4728 | 0.4979 | 0.5204 | ||||

6 | 0.7126 | 0.6931 | 0.7189 | 0.6781 | |||||

7 | 0.9207 | 0.9848 | 0.8886 | ||||||

8 | 0.9332 | 0.9511 | |||||||

9 | 0.9562 |

MDs of the centre points of the sample data points for the six rough rice varieties.

Pin36 | PCs | 1,2 | 2.6311 | 2.5955 | 3.0686 | 2.5392 | 1.7540 |

1,5 | 2.4466 | 2.6110 | 2.8300 | 2.7571 | 1.8892 | ||

Wufengyou T025 | PCs | 1,2 | 0.9606 | 2.9705 | 0.6827 | 1.6154 | |

1,5 | 0.6624 | 1.5311 | 2.8815 | 0.7148 | |||

Xiangwan 13 | PCs | 1,2 | 2.0368 | 0.2802 | 1.0224 | ||

1,5 | 0.8843 | 2.3718 | 0.7290 | ||||

Yaopingxiang | PCs | 1,2 | 2.2980 | 1.6652 | |||

1,5 | 1.6452 | 1.2571 | |||||

Youyou 122 | PCs | 1,2 | 1.1189 | ||||

1,5 | 2.2791 |

Classification result of the test set (PC1 and PC2) using PNN (Spread = 4 × 10^{−5}).

| ||||||||
---|---|---|---|---|---|---|---|---|

Real classification | P36 | 5 | 0 | 0 | 0 | 0 | 0 | 5 |

WT025 | 0 | 4 | 0 | 0 | 1 | 0 | 5 | |

XW13 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | |

YPX | 0 | 0 | 0 | 5 | 0 | 0 | 5 | |

YY122 | 0 | 0 | 2 | 0 | 3 | 0 | 5 | |

ZX1 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | |

| ||||||||

SUM | 5 | 4 | 7 | 5 | 4 | 5 | 30 | |

| ||||||||

Classification accuracy = (5 + 4 + 5 + 5 + 3 + 5)/30 = 90% |

Notes: P36 (Pin 36), WT025 (Wufengyou T025), XW13 (Xiangwan 13), YPX (Yaopingxiang), YY122 (Youyou 122), ZX1 (Zhongxiang 1); There are a total of 120 samples for the cross test, 30 samples were randomly selected for the independence test set, and each variety has five samples.

Classification result of the test set (PC1 and PC5) using PNN (Spread = 4 × 10^{−5}).

| ||||||||
---|---|---|---|---|---|---|---|---|

Real classification | P36 | 5 | 0 | 0 | 0 | 0 | 0 | 5 |

WT025 | 0 | 4 | 0 | 0 | 0 | 1 | 5 | |

XW13 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | |

YPX | 0 | 0 | 0 | 5 | 0 | 0 | 5 | |

YY122 | 0 | 0 | 0 | 0 | 5 | 0 | 5 | |

ZX1 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | |

| ||||||||

SUM | 5 | 4 | 5 | 5 | 5 | 6 | 30 | |

Classification accuracy = (5 + 4 + 5 + 5 + 5 + 5)/30 = 96.67% |

Note: P36 (Pin 36), WT025 (Wufengyou T025), XW13 (Xiangwan 13), YPX (Yaopingxiang), YY122 (Youyou 122), ZX1 (Zhongxiang 1); There are a total of 120 samples for the cross test, 30 samples were randomly selected for the independence test set, and each variety has five samples.