Optimization of Electronic Nose Sensor Array for Tea Aroma Detecting Based on Correlation Coefﬁcient and Cluster Analysis

: The electronic nose system is widely used in tea aroma detecting, and the sensor array plays a fundamental role for obtaining good results. Here, a sensor array optimization (SAO) method based on correlation coefﬁcient and cluster analysis (CA) is proposed. First, correlation coefﬁcient and distinguishing performance value (DPV) are calculated to eliminate redundant sensors. Then, the sensor independence is obtained through cluster analysis and the number of sensors is conﬁrmed. Finally, the optimized sensor array is constructed. According to the results of the proposed method, sensor array for green tea (LG), fried green tea (LF) and baked green tea (LB) are constructed, and validation experiments are carried out. The classiﬁcation accuracy using methods of linear discriminant analysis (LDA) based on the average value (LDA-ave) combined with nearest-neighbor classiﬁer (NNC) can almost reach 94.44~100%. When the proposed method is used to discriminate between various grades of West Lake Longjing tea, LF can show comparable performance to that of the German PEN2 electronic nose. The electronic nose SAO method proposed in this paper can effectively eliminate redundant sensors and improve the quality of original tea aroma data. With fewer sensors, the optimized sensor array contributes to the miniaturization and cost reduction of the electronic nose system. performances but between


Introduction
Tea is one of the most popular non-alcoholic beverages in the world. Aroma is an important attribute of tea, which contains rich information such as quality and type. There are approximately 600 aromatic compounds in tea aroma [1,2]. At present, tea aroma detection depends primarily on the sensory evaluation of tea appraisers, which is time-consuming, labor-intensive and not conducive to ensuring accuracy. In recent years, electronic nose systems have played an increasingly important role in the field of gas detection. Electronic nose can closely mimic the organization of human olfactory system for obtaining the fingerprints of gas signals from samples through a sensor array, and pattern recognition methods have the ability to identify 'fingerprints' in a given dataset [3]. The sensor array composed of gas sensors is highly sensitive to the main volatile components in the target aroma sample. Therefore, the mapping from aroma to semantic information such as variety and grade can be constructed. Electronic nose has played an important role in many fields of food engineering, including food classification [4,5], quality assessment [6,7], freshness prediction [8] and identification authenticity [9]. Research objects include various foods such as fruit [10], vegetables [11], meat [12], beverage [13][14][15], herb [16] and especially tea [17][18][19][20][21][22].

Tea Samples Preparation
In China, there are 4 kinds of green tea: fried green tea, baked green tea, sunburned green tea and steamed green tea. Among them, fried and baked green tea occupy the main market share. Hence, we mainly focus on fried green tea and baked green tea.
Three other kinds of fried green tea, shangwushanhuibai (swshb), laoshanlvcha (lslc) and biluochun (blc), and three other kinds of baked green tea, jingtinglvxue (jtlx), hanzhongxianhao (hzxh) and emeishanmaofeng (emsmf), are used to validate the discriminating performance of the optimized sensor array for green tea. The tea samples are given in Table 1. As one of the representative brands of fried green tea in China, the grades classification of West Lake Longjing has also received great attention. Hence, 4 grades of West Lake Longjing are also used to validate the discriminating performance of the optimized sensor array for fried green tea.
The experimental parameters of the sample preparation and aroma collection can affect the reaction speed, which subsequently affects the final detection effect. Hence, some preliminary experiments were carried out to select the optimal experimental parameters including sealing time, gas flow rate, temperature and data acquisition time. The tea aroma sampling process is as follows: The ambient temperature is maintained at 25 • C. In the 500 mL beaker, 5 g of tea leaves is brewed with 250 mL of boiling water. The water is then poured out after 5 min and leaves the tea at the bottom of the cup. The beaker is sealed for 30 min until the tea aroma is volatile. Then, the air pump pumps the aroma evenly into the electronic nose system at a flow rate of 15 mL/s, and flows through the sensor array. The stable response value after 35 s is used for analysis, and the response values of each sensor are read every second for approximately 60 s. There are 3 samples of each kind of tea, and each sample is measured once. Since a drying tube was added before the aroma entered the reaction chamber, indoor humidity had little effect on the experimental results. Hence, there are no special considerations regarding humidity in this paper.
The response curve of a typical green tea (West Lake Longjing) is shown in Figure 1. Each curve represents the variation in conductivity of each sensor with time when the tea volatiles reached the reaction chamber. It can be seen that in the 35-60 s interval, the response values of all sensors tend to be stable. In subsequent experiments, we observed that all tea aroma samples used in this study have similar characteristics, so it is reasonable to use the response value in the same interval for subsequent feature extraction and pattern recognition.

Preliminary Sensor Array
First, we conduct a preliminary screening of common gas sensors on the market to construct a candidate list. The sensors are selected according to the following three criteria: (1) sensors sensitive to aroma components; (2) sensors used in other studies regarding odor detection; and (3) sensors with stable performances and full ranges of models. Based on the above 3 criteria, the gas sensors were finally selected. Those sensors have the operating voltage (5 V) and external resistance necessary to facilitate the integrated circuit design of the sensor. The preliminary sensors and their basic information are shown in Table 2; more details are in Table S1 in the Supplementary Files.  All sensors listed in Table 2 are metal oxide sensors (MOSs). As an example, the circuit schematic of sensor TGS826 is shown in Figure 2. It requires two voltage sources: heating voltage (V H ) and loop voltage (V C ). V H can keep the sensor at a certain temperature, and V C can monitor the voltage (V RL ) across the load resistance (R L ). When the sensors detect sensitive gases, the resistance of the sensor decreases, and the voltage across the load resistance increases.

Electronic Nose System Set-Up
The self-made electronic nose system used was primarily composed of a gas path and a signal acquisition circuit. The internal structure and airflow of the device are shown in Figure 3. An introduction of all of the components of the e-nose system can be seen in Table S2 in the Supplementary Files. The workflow of the device is as follows: (1) After the samples are ready, the aroma is pulled out by the suction pump and then flowed into the drying tube. The drying tube is filled with sufficient amount of granular silica gel desiccant (produced by Longhui Desiccant Co., Ltd., Suzhou, Jiangsu Province, China). Hence, the water vapor in the tea aroma is removed to prevent it from affecting the measurement result.
(2) The aroma enters the reaction chamber and reacts with the sensors to generate a response signal. The reaction chamber is made of acrylonitrile butadiene styrene (ABS) and is 3D printed, which has high strength and good heat insulation. The reaction chamber is especially designed to ensure that the aroma environment of each sensor in the sensor array is uniform and consistent.
(3) The signal acquisition circuit can obtain the response signal and send it to the host analysis system in a personal computer (PC) through the serial port. The post-processing and pattern recognition of the signal are completed in the PC.
The schematic and an image of the sensor array are shown in Figure 4. In the schematic, s1-s15 indicates the 15 sensors; p2 and p3 indicate the sensors' signal extraction pins. The overall appearance of the electronic nose system is shown in Figure 5. The overall size of the e-nose system is 300 mm × 200 mm × 110 mm. All function modules are integrated in the gray box. There are some function buttons on the operation panel, which provide functions such as power switch, flow adjustment, pre-heating and purging.

Data Analysis Methods
Two kinds of features, average value and value on the maximum variance moment, are extracted from the original sensor response curve. The calculation of the average value is shown in Equation (1): where k is the number of sampling, n is the number of sensors and m nT is the response value of the nth sensor at time T. m n is the average value of the nth sensor for the sample, which will eventually form a feature vector of 1 × n dimension. For the second method, the variance of all sensors at the same time is calculated firstly; the calculation method is shown in Equation (2). Then, the response value of each sensor at the time when the variance value is maximum is found out, forming a 1 × n dimensional feature vector.
where n is the number of sensors, S 2 is the variance and m iT represents the response value of the i-th sensor at time T. Two simple and widely-used pattern recognition methods, principal component analysis (PCA) and linear discriminant analysis (LDA), are introduced for data processing after feature extraction. PCA could deduct dimensions and observe a primary evaluation of the between-class similarity. PCA is a projection method that allows an easy visualization of all the information contained in a dataset. LDA is a statistical method that could determine to which group the samples belong. The method maximizes the variance between categories and minimizes the variance within categories. With the help of dimensionality reduction and visualization of PCA and LDA, we can directly observe the distribution of samples of different categories.
In order to quantitatively compare the accuracy of different pattern recognition methods in discriminating unknown samples, the nearest neighbor classifier (NNC) is introduced for category discrimination. For each category participating in the discrimination, 3 additional aroma samples were collected as a test set. The original 3 samples are used as labeled data for training to build the model.
In a previous work, the authors introduced the random forest machine learning (RFML) algorithm to analyze aroma data and achieved excellent performance [15], and it is also introduced for comparative research. Taking the random sampling of RFML into account, we repeated the same unknown sample discrimination experiment 10 times, and took the average value as the final accuracy result.

Sensor Array Optimization Methods
The two-step down-selection methodology of SAO involves the analysis of the following factors (the flow chart of the method is shown in Figure 6): (1) Correlation analysis. The correlation coefficient between the two sensors is calculated. The large value of the correlation coefficient means that the two sensors have a strong correlation, and the obtained signals have high similarity. Thus, the two sensors can replace each other. The correlation coefficient can be calculated using Equation (3): where x and y represent two different models of sensors; x and y are the mean value of the first 60 s of the two sensors, respectively; x i is the i-th data value of the x sensor, y i is the i-th data value of the y sensor, and R xy is the absolute value of the correlation coefficient between sensor x and sensor y.
(2) Distinguishing performance value (DPV) calculation. Only one of the replaceable sensors can be kept. It is difficult to make this decision based solely on the sensitivity of the sensor. Therefore, the ability of the sensor to discriminate among different tea classes can be determined by calculating the inter-and intra-class dispersion, that is, the DPV. Sensors with smaller DPVs should be eliminated. The DPV of each sensor was evaluated by calculating the inter-and the intra-class dispersion, as shown in Equation (4): where S b and S w are the inter-and intra-class dispersions of the sample, respectively. A larger F i value represents better distinguishing performance. n denotes the number of tea varieties, m is the number of samples for detecting of each tea variety, u denotes the average detected value of total tea samples, u i denotes the average detected value of total samples of tea variety i, and x k denotes the detected value of the k-th sample of tea variety i.
(3) Cluster analysis (CA). In the process of constructing the sensor array, the independence between the sensors should be considered. Through CA, the distance between different sensors can be calculated to determine the independence between the sensors.
(4) Sensor number determination. If the number of sensors N in the array is not specified, the effect of tea aroma detection is tested with a different number N of sensor arrays, and N is determined based on the effect.

Optimization of Sensor Array LG for Green Tea
(1) Sensor array optimization based on correlation analysis and the DPV Correlation analysis is used to calculate the correlation coefficient between two sensors. The correlation coefficient R xy ranges from −1 to 1, and R xy ≥ 0 means positive correlation and vice versa. The degree of correlation of the sensor increases as R xy increases. When R xy ≥ 0.9, we can believe that the two sensors have strong similarity and can replace each other. If there is no R xy ≥ 0.9 in detecting a certain tea variety, we also take three sensor pairs with the largest R xy as the candidates to be removed. Table 3 shows a list of sensor pairs with R xy ≥ 0.9 or maximum three values of R xy in different tea aromas. Since the selected sensors are sensitive to tea aroma, it is not easy to decide which to eliminate based on the sensitivity of the sensors. Thus, the DPV of each sensor was evaluated by calculating the inter-and the intra-class dispersion. Table 4 shows the sensors' DPVs for 6 kinds of green tea, including 3 kinds of fried green tea and 3 kinds of baked green teas. Significant differences in the DPVs occur when different sensors detect tea aroma within identical varieties. When R xy of two sensors is high, as listed in Table 3, a redundant sensor can be eliminated. For example, the high correlated sensors TGS826/TGS822 for msqf and their DPVs for 6 general green teas are 4.59 and 13.74, respectively. Therefore, TGS826 needs to be eliminated due to its small DPV. Similarly, for detecting msqf, TGS822, TGS2620, 2M009 and TGS2620 should be respectively eliminated in the corresponding high correlated sensor pairs of MQK2/TGS822, TGS2620/TGS822, TGS822/2M009 and TGS2620/MQK2. Due to the contingency of the detection, only sensors that are rejected by two or more kinds of green tea should be eliminated from sensor array LG for general green tea. Therefore, TGS826 and TGS2620 are eliminated since they are rejected in the detection of msqf and dgdf; TGS822 is eliminated since it is rejected in the detection of msqf, dgdf, qs and hsmf; MQ-6 is eliminated since it is rejected in the detection of qs, lagp and tphk. Hence, 11 sensors of 2M009, TGS813, TGS832, MQ-8, MQ-5, MQ-3, 2M012, TGS2600, TGS2610, MQK2 and TGS800 are retained. The optimization results are shown in Table 5.  MQ-3, 2M012,  TGS2600, TGS2610, MQK2, TGS800  TGS826, TGS822, TGS2620, 2M009   dgdf  2M009, TGS832, MQ-6, MQ-8, MQ-5, MQ-3, 2M012,  TGS2600, TGS2610, MQK2, TGS800  TGS826, TGS2620, TGS813, TGS822   qs  2M009, TGS813, TGS832, MQ-8, MQ-5, MQ-3, 2M012,  TGS2620, TGS2600, TGS2610, TGS826, TGS800  TGS822, MQ-6, MQK2   lagp  2M009, TGS813, TGS822, TGS832, MQ-8, MQ-5, MQ-3,  TGS2620, TGS2600, TGS2610, TGS826, MQK2, TGS800  MQ-6, 2M012   hsmf  2M009, TGS813, TGS832, MQ-6, MQ-8, MQ-3, 2M012,  TGS2620, TGS2600, TGS2610, TGS826, MQK2, TGS800  TGS822, MQ-5   tphk  2M009, TGS813, TGS822, MQ-8, MQ-5, MQ-3, 2M012,  TGS2620, TGS2600, TGS2610, TGS826, MQK2, TGS800  MQ-6, TGS832 (2) Sensor array optimization based on CA and DPV The previous step eliminated redundant sensors with high correlation through correlation analysis and the DPV. However, how to select a specific number of (N) sensors from the remaining sensors to construct the array is still unknown. Thus, CA and the DPV are designed to solve this problem.
The average value of aroma response data of each sensor to each sample of different tea varieties was used as input. The system cluster method was performed. The square Euclidean distance was used as the measurement standard and the between-groups linkage was used as cluster method to analyze the output icicles of cluster results. The whole process was performed in SPSS 24.0 software. The results are shown in Figure 7. The clustering coefficients between the sensors are shown in Table 6.  According to Figure 7, if the histograms of the sensors are connected, the connected sensors can be grouped into one class. For example, if we want to group the sensors into 8 classes, we can make a horizontal dotted line at the place with the ordinate of 8 (shown in Figure 7). By scanning left to right, the continuous sensor histogram without disconnection can be grouped into one class, and the results of 8 classes can be obtained, namely MQK2, MQ-8, TGS813, 2M009, MQ-3, 2M012, (TGS2610/TGS2600) and (TGS832/MQ-5/TGS800). Similarly, we can cluster the remaining 11 sensors into 2-10 classes, as shown in Table 7.  For each clustered sensor class, we selected a sensor with maximum DPV in its class to construct the optimized sensor array. For example, if we want to obtain an optimized sensor array with number N = 8, there are two classes that have more than one sensor in the clustered results of 8 classes, namely (TGS2610/TGS2600) and (TGS832/MQ-5/TGS800). According to Table 5, the DPVs for 6 green teas of these sensors are ((TGS2610, 1.28)/(TGS2600, 3.89)) and ((TGS832, 6.56)/(MQ-5, 2.06)/(TGS800, 0.17)). Hence, we chose TGS2600 and TGS832 combined with other clustered classes with single sensor MQK2, MQ-8, TGS813, 2M009, MQ-3, 2M012 to construct an optimized sensor array with number N = 8. We can obtain optimized sensor arrays with different number N = 2-10 by CA and DPVs, as shown in Table 7.
(3) Sensor number determination Table 8 shows the discriminating accuracy of six kinds of green tea (msqf, dgdf, qs, lagp, hsmf and tphk) by sensor arrays with different number N of sensors. In general, if N is too small, such as when N < 6, the accuracy will be relatively low. When N ≥ 6, the accuracy is more than 98%, and is very close to other rates, which are 98.42%, 98.88%, 98.90%, 98.80% and 98.87% for the accuracy rates of N = 6, 7, 8, 9 and 10, respectively. It can be seen that when N = 8, the number of sensors is appropriate and the accuracy is stable and relatively high. Thus, we can acquire an optimized array LG with 8 sensors MQK2, MQ-8, TGS813, 2M009, MQ-3, 2M012, TGS2600 and TGS832) for general green tea detection.

Optimization of Sensor Array LF for Fried Green Tea
Fried green tea accounts for approximately 70% of green tea in China. When LG is directly used to discriminate between varieties of fried green tea, the results are not quite satisfactory. Therefore, it is necessary to screen out the sensor array for fried green tea. Here, we specify that there are 8 sensors (N = 8) in sensor array LF, which is equal to that of LG. The fried green tea samples used for LF optimization are msqf, dgdf and qs.
According to the high correlated sensors listed in Table 3, we used the DPVs for 3 fried green teas that were listed in Table 4 to decide which sensor should be eliminated. For msqf, sensors of TGS822, TGS2620 and 2M009 should be eliminated; for dgdf, sensors of TGS826, TGS822, MQ-8 and TGS2620 should be eliminated; and for qs, sensors of TGS822, MQK2 and MQ-8 should be eliminated. Thus, TGS822, TGS2620 and MQ-8 were eliminated from LF since they were rejected by two or more kinds of green tea. After removal, the retained 12 sensors were 2M009, TGS813, TGS832, MQ-6, MQ-5, MQ-3, 2M012, TGS2600, TGS2610, TGS826, MQK2 and TGS800. Then, we took the average value of aroma data of each sensor to each sample of different kinds of fried tea as input for SPSS software, and produced and icicle figure of the clustering process, as shown in Figure 8. Similarly, we grouped the sensors into 8 classes, as described previously, namely TGS813, MQK2, MQ-6, 2M009, 2M012, (MQ-5/TGS832/TGS800), (MQ-3/TGS2600) and (TGS2610/TGS826). For each clustered sensor class, we selected a sensor with maximum DPV for 3 fried green teas, as listed in Table 4, and constructed an optimized sensor array LF with number N = 8, namely TGS813, MQK2, MQ-6, 2M009, 2M012, MQ-5, MQ-3 and TGS2610.

Optimization of Sensor Array LB for Baked Green Tea
Lagp, hsmf and tphk were used as baked green tea examples for the optimization of sensor array LB. Similarly, according to the high correlated sensors listed in Table 3, we used the DPVs for 3 kinds of baked green tea that were listed in Table 4 to decide which sensor should be eliminated. For lagp, sensors of 2M012 and MQ-3 should be eliminated; for hsmf, sensors of MQ-6 and MQ-5 should be eliminated; and for tphk, sensors of MQ-3 and MQ-6 should be eliminated. Thus, MQ-3 and MQ-6 were eliminated from LF since they were rejected by two or more kinds of green tea. After removal, the retained 13 sensors  were TGS822, TGS813, 2M009, TGS832, TGS800, MQ-5, 2M012, TGS2620, TGS826, TGS2600,  MQK2, MQ-8 and TGS2610. Then, we took the average value of aroma response data of each sensor to each sample of different kinds of baked tea as input for SPSS software, and produced an icicle figure of the clustering process, as shown in Figure 9. Similarly, we grouped the sensors into 8 classes, as described previous, namely MQK2, TGS822, 2M009, 2M012, TGS813, MQ-8, TGS832 and (TGS2620/TGS2610/TGS2600/MQ-5/TGS800/TGS826). For each clustered sensor class, we selected a sensor with maximum DPV for 3 kinds of baked green tea, as listed in Table 4, and constructed an optimized sensor array LB with number N = 8, namely MQK2, TGS822, 2M009, 2M012, TGS813, MQ-8, TGS832 and TGS2620.

Classification of Green Tea Varieties
Three groups of optimized sensor arrays (LG, LF and LB) obtained for green tea, fried green tea and baked green tea, respectively, were generated based on the process above. The discriminating accuracy of these 3 sensor arrays needs to be further verified. The data analysis methods used were PCA based on the average value (PCA-ave), LDA based on the average value (LDA-ave), PCA based on the maximum variance moment (PCA-var) and LDA based on the maximum variance moment (LDA-var). Similar methods are also described in [30].
The results of the 12 varieties of green tea detected by sensor array LG are shown in Figure 10 and Table 9. There are 2, 2, 2 and 4 kinds of tea area overlap that occur in LDA-ave, PCA-ave, LDA-var and PCA-var, respectively. As shown in Figure 11 and Table 9, when discriminating between 6 varieties of fried green tea using LF, there are 2 kinds of tea area overlap occurring in LDA-var. For discriminating between 6 varieties of baked green tea using LB, there are 2 kinds of tea area overlap occurring in PCA-var, as shown in Figure 12 and Table 9. Because of the large degree of dispersion between inter-classes, the scale range of the whole graph is large, and the points of some regions appear to gather on the whole graph. As a result, some local magnifications are added to the whole graph to represent the local aggregation points. For easy understanding, the points of incorrect distinguished results are red-circled in Figures 10-12.  In general, the discrimination accuracy of fried green tea by LF or baked green tea by LB is higher than that of general green tea by LG. It seems that there are some effects in SAO for green tea with specified processing techniques.
According to Figures 10-12, we can see the distribution of dispersion and concentration of the discrimination results, and find out whether there are regional overlaps. However, in order to obtain the discrimination accuracy value, it is necessary to combine some classification algorithms. Here, we used NNC to obtain discrimination accuracy value, as shown in Table 9.  When LDA-ave +NNC are used, satisfactory results are obtained. The discrimination accuracy of LG for 12 kinds of green tea, LF for 6 kinds of faked green tea and LB for 6 kinds of baked green tea can almost reach to 88.89~100%.
Note that NNC may lead to misjudgments. For example, when detecting 12 kinds of green tea by LG and PCA-ave, a feature value of qs is misjudged as dgdf since they are closer to the center of dgdf, although qs and dgdf seem to be correctly separated, as shown in Figure 10. Similarly, a feature value of emsmf is misjudged as jtlx when detecting 6 baked green teas by LB and PCA-ave, as shown in Figure 12. These misjudgments will reduce the accuracy of tea discrimination.

Classification of West Lake Longjing Tea Grade
Here, we further discriminated between different grades of identical fried green tea. Since West Lake Longjing is the most common representative of fried green tea, we will consider 4 grades of West Lake Longjing tea as examples.
It can be seen in Figure 13 and Table 10 that the LDA-ave, PCA-ave and LDA-var methods all have good classification effects on West Lake Longjing tea using LF. Some scholars also used the commercial electronic nose PEN2 to carry out Longjing tea quality identification research [32]. Taking into account the difference between the experimental sample and the environment, it is not scientific to directly compare the two electronic noses quantitatively. However, it can be concluded that in the identification of Longjing tea grades, our self-made electronic nose using the optimized sensor array can achieve an effect comparable to that of the PEN2 electronic nose.

Comparison of Correlation Analysis Methods and the Elimination of Sensors
In this paper, we used DPVs to eliminate the correlated sensors; the principles and characteristics of these methods are listed in Table 11. Coefficient of variation (COV) is another common index that can reflect the dispersion degree of the observed values for each indicator on the unit mean. COV has also been introduced for the screening and elimination of sensors, which is compared with the method proposed in this paper. Table 11. Comparison of correlated sensor analysis and elimination methods.

Method Principle Characteristics
Correlation analysis The correlation calculation formula is identical to Formula (1) in this paper.
Calculate the sum of each sensor's correlation coefficients and eliminate the sensor with the largest sum. Optimizes the correlation between sensors, but does not judge the discriminating ability of the sensor.
Where x i is the i-th test value of the gas sensor, x is the average value of the gas sensor at different times, and n is the total number of tests.
The larger the coefficient of variation, the greater the intra-class dispersion of the sensor to detect the same class of tea. Therefore, sensors with large coefficients of variation were eliminated. This method does not judge the dispersion between classes and does not optimize the correlation between sensors.
The calculation formula for DPV is introduced in Equation (2) of this paper.
The DPV considers the inter-and intra-class dispersion of sensors. Optimizes sensors' discriminating performances but does not optimize the correlation between sensors.
In order to verify the classification effect of the sensor array obtained by different methods, we discriminated 6 tea samples for training using 6 group sensors: preliminary 15 sensors, 3 groups of random-selected 11 sensors, 11 sensors screened by correlation analysis and COV as well as 11 sensors screened by correlation analysis and the DPV. The discriminating results are shown in Table 12. Sensors screened by the correlation analysis and the DPV have a better discrimination performance than that by the correlation analysis and COV, or by random. This is due to the fact that the DPV simultaneously considers the inter-and intra-class dispersion of sensors and is better for eliminating correlated sensors, which are analyzed using correlation analysis. The preliminary 15-sensor array combined with RFML has the highest accuracy. Sensors screened by the correlation analysis and the DPV have the most balanced performance. From the perspective of data processing efficiency and equipment cost, the 11-sensor array screened by the correlation analysis and the DPV is obviously better than the preliminary 15-sensor array.

Comparison of Screening Methods for Given Number of Sensors N
The second optimization step is to select N sensors from 11 sensors remained after the first step to construct a sensor array with good independence. For N = 8, the sensor array optimized by CA and DPV for 6 kinds of green tea is (MQK2, MQ-8, TGS813, 2M009, MQ-3, 2M012, TGS2600, TGS832), and the corresponding rankings of DPV are (2,1,6,10,3,4,11,8). If the sensor array is optimized only by the ranking order of DPVs at the second step, the result is (MQ-8, MQK2, MQ-3, 2M012, TGS813, TGS832, 2M009, TGS2600), and the corresponding rankings of DPV are (1,2,3,4,6,8,10,11). Note that two methods of CA + DPV and only DPV were used in selecting 8 sensors at the second optimization step, and the results are coincidentally consistent. Thus, in order to show the effect of SAO at the second step, we constructed 5 sensor array groups. Four of them were randomly selected from the eleven sensors remained after the first step, and the other group was selected using CA and the DPV. The accuracy with which each of the 5 sensor groups were discriminated between 6 tea examples for training is shown in Table 13. The results show that the sensor array selected using CA and DPV screening has almost the best performance in most cases, because the sensor array has better independence based on CA and better discrimination performance based on the DPV.

Conclusions
This paper proposed an optimization of an electronic nose sensor array to detect tea aroma based on correlation coefficient and cluster analysis. A method based on correlation coefficient and DPV is proposed to eliminate redundant sensors with high correlation coefficients. Three sensor arrays are constructed based on green tea (LG), fried green tea (LF) and baked green tea (LB), respectively. Based on the optimized sensor array LG, only 2 kinds of tea areas are overlapped when discriminating 12 green tea varieties by LDA-ave, LDA-var and PCA-ave methods. Combined with the NNC algorithm, the accuracy can reach 83. 33-94.44%. This indicates that the tea aroma data obtained by LG have high quality. When detecting various grades of West Lake Longjing tea, LF shows comparable discrimination accuracy to that of the German PEN2 electronic nose based on the same data-processing method. Then, some sensor arrays screened and constructed based on other SAO methods were also experimented in the same electronic nose system for the identification of tea types. The experimental results show that sensors screened by the correlation analysis and the DPV have a better discrimination performance than that by the correlation analysis and COV, or by random. Finally, yet importantly, given the number of sensors, the proposed method can filter out the optimal sensor combination from the given candidate list.
The results show that, after proper optimization, fewer sensors can not only stop the reduction of the sensor array's performance in tea aroma detection, but can also improve it; this is because, in our model, the introduction of noise is reduced. The electronic nose SAO method proposed in this paper can effectively eliminate redundant sensors and improve the quality of original tea aroma data. Fewer sensors also help simplify the circuit board design, provide a higher degree of freedom in the layout of system components and facilitate the miniaturization of the electronic nose system. In addition, fewer sensors can reduce the cost of the sensor array, which is beneficial to providing more design freedom for other modules in the system, thus having the potential to reduce system costs.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/chemosensors9090266/s1, Table S1: Further information about candidate sensors, Table S2: The specifications of components in the e-nose system