A Convolutional Neural Network Based Auto Features Extraction Method for Tea Classiﬁcation with Electronic Tongue

: Feature extraction is a key part of the electronic tongue system. Almost all of the existing features extraction methods are “hand-crafted”, which are di ﬃ cult in features selection and poor in stability. The lack of automatic, e ﬃ cient and accurate features extraction methods has limited the application and development of electronic tongue systems. In this work, a convolutional neural network-based auto features extraction strategy (CNN-AFE) in an electronic tongue (e-tongue) system for tea classiﬁcation was proposed. First, the sensor response of the e-tongue was converted to time-frequency maps by short-time Fourier transform (STFT). Second, features were extracted by convolutional neural network (CNN) with time-frequency maps as input. Finally, the features extraction and classiﬁcation results were carried out under a general shallow CNN architecture. To evaluate the performance of the proposed strategy, experiments were held on a tea database containing 5100 samples for ﬁve kinds of tea. Compared with other features extraction methods including features of raw response, peak-inﬂection point, discrete cosine transform (DCT), discrete wavelet transform (DWT) and singular value decomposition (SVD), the proposed model showed superior performance. Nearly 99.9% classiﬁcation accuracy was obtained and the proposed method is an approximate end-to-end features extraction and pattern recognition model, which reduces manual operation and improves e ﬃ ciency.


Introduction
Tea is one of the most prevailing beverages across the world. The practice of drinking tea has been a long history in China. Tea contains theine, cholestenone, inose, folic acid and other components, which can improve humanity health. In the actual tasting process, this kind of mellow and fragrant taste of tea stimulates people's taste buds. There are many external conditions that affect the taste of tea, for example, number of tea leaves added, tea making utensils, tea making time, water quality and the way tea is stored. The synergistic effects of these factors make the unique taste of tea. Organic compounds with different chemical structures and concentrations play a significant role in the quality of tea. The ingredients in tea are very complicated, the most important of which are tea polyphenols, amino acids, alkaloids and other aromatic substances.
Tea classification has a wide range of application scenarios. It plays an important role in tea quality estimation, for example, new or old, true or fake judgment of tea. Moreover, the classification and identification of tea provide a reference for healthy tea drinking. What's more, the classification and recognition of tea are more likely to provide the basis for the design of smart teapot in the future.
Traditional tea quality assessment methods are based on different analytical instruments, such as high performance liquid chromatography [1], gas chromatography [2] and plasma atomic emission spectrometry [3]. However, these methods require a lot of technical personnel, material and financial support, which lead to low efficiency and larger overhead [4]. With the development of sensor technology, the advantages of methods based on sensor technology are more distinguished. The typical advantages are high accuracy, simple operation and fast detection, which improve the efficiency of tea quality inspection obviously. At the same time, the electronic nose for gas analysis has also made technological breakthroughs [5,6]. The arrays of electrochemical sensors and devices have been designed for the analysis of complex liquid samples, such as taste sensor and electronic tongues [7,8]. As a modern intelligent sensory instrument, electronic tongue is skillful in monitoring the production cycle of beverage, and has the advantages of simple, fast and low cost, which shows huge potential in beverage quality evaluation [9]. Figure 1 shows the basic flow of the electronic tongue system for beverage detection and quality evaluation. First, the response signals from the e-tongue hardware system are collected. Then, features of sampling raw data are extracted for pattern recognition.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 2 of 15 and recognition of tea are more likely to provide the basis for the design of smart teapot in the future. Traditional tea quality assessment methods are based on different analytical instruments, such as high performance liquid chromatography [1], gas chromatography [2] and plasma atomic emission spectrometry [3]. However, these methods require a lot of technical personnel, material and financial support, which lead to low efficiency and larger overhead [4]. With the development of sensor technology, the advantages of methods based on sensor technology are more distinguished. The typical advantages are high accuracy, simple operation and fast detection, which improve the efficiency of tea quality inspection obviously. At the same time, the electronic nose for gas analysis has also made technological breakthroughs [5,6]. The arrays of electrochemical sensors and devices have been designed for the analysis of complex liquid samples, such as taste sensor and electronic tongues [7,8]. As a modern intelligent sensory instrument, electronic tongue is skillful in monitoring the production cycle of beverage, and has the advantages of simple, fast and low cost, which shows huge potential in beverage quality evaluation [9]. Pattern recognition and features extraction are the two most important parts of the electronic tongue system for liquid classification. Many scholars have contributed to tea pattern recognition. For instance, principal component analysis (PCA) [10], artificial neural network (ANN) [11], support vector machine (SVM) [9,12], k-nearest neighbor algorithm (k-NN) [13], random forest (RF) [14] and autoregressive (AR) model [4] have been put forward for classification analysis in the e-tongue system. Features extraction is another critical step of the e-tongue as the quality of features selection will directly affect the quality of pattern recognition. Generally speaking, feature extraction is advantageous for three main purposes, namely: (1) Reduction of random noise, (2) reduction of unwanted systematic variations which are often due to experimental conditions, (3) data compression in order to capture the most relevant information from signals. There are differences in the implementation details of feature extraction methods in different types of electronic tongue systems. However, as far as we know, these methods are similar in principle, which can be divided into the following three categories: Features with physical meaning, features acquired by mathematical transformation, features in frequency domain. In terms of voltammetry e-tongue, in the early stage, the features extraction methods of samples were relatively simple, for example, raw data which were collected at a fix frequency from samples were treated as features [15]. Then, peak value and inflection point (the maximum value, the minimum value, and two inflection values in a circle from samples) were extracted as features [10,16]. Unfortunately, these methods were low accuracy and features redundancy. Therefore, to extract features more effectively, many professional researchers tried to compress sampling data by various mathematical transformation methods. For instance, Pradip Saha et al. used discrete wavelet transform (DWT) with sliding windows to extract energy in different frequency bands as features [17]. Andrea Scozzari et al. used discrete cosine transform (DCT) to extract features, and selected some coefficients as eigenvalues for tea classification [18]. In addition, Santanu Ghoraiand et al. transformed the sampling data into matrices, decomposed the matrices into singular values (SVD), and selected several singular values as features for tea classification [19]. In addition, SVD and DCT are fused for feature extraction and then applied in the prediction of theaflavin and thearubigin in Tea [20]. Although these mathematical transformation methods made some improvements in features selection, they still need to be set manually in the selection of features data. Specifically, for DWT, both the number of selected layers in wavelet transform and the strategy of characteristic coefficients (mean, variance, Pattern recognition and features extraction are the two most important parts of the electronic tongue system for liquid classification. Many scholars have contributed to tea pattern recognition. For instance, principal component analysis (PCA) [10], artificial neural network (ANN) [11], support vector machine (SVM) [9,12], k-nearest neighbor algorithm (k-NN) [13], random forest (RF) [14] and autoregressive (AR) model [4] have been put forward for classification analysis in the e-tongue system. Features extraction is another critical step of the e-tongue as the quality of features selection will directly affect the quality of pattern recognition. Generally speaking, feature extraction is advantageous for three main purposes, namely: (1) Reduction of random noise, (2) reduction of unwanted systematic variations which are often due to experimental conditions, (3) data compression in order to capture the most relevant information from signals. There are differences in the implementation details of feature extraction methods in different types of electronic tongue systems. However, as far as we know, these methods are similar in principle, which can be divided into the following three categories: Features with physical meaning, features acquired by mathematical transformation, features in frequency domain. In terms of voltammetry e-tongue, in the early stage, the features extraction methods of samples were relatively simple, for example, raw data which were collected at a fix frequency from samples were treated as features [15]. Then, peak value and inflection point (the maximum value, the minimum value, and two inflection values in a circle from samples) were extracted as features [10,16]. Unfortunately, these methods were low accuracy and features redundancy. Therefore, to extract features more effectively, many professional researchers tried to compress sampling data by various mathematical transformation methods. For instance, Pradip Saha et al. used discrete wavelet transform (DWT) with sliding windows to extract energy in different frequency bands as features [17]. Andrea Scozzari et al. used discrete cosine transform (DCT) to extract features, and selected some coefficients as eigenvalues for tea classification [18]. In addition, Santanu Ghoraiand et al. transformed the sampling data into matrices, decomposed the matrices into singular values (SVD), and selected several singular values as features for tea classification [19]. In addition, SVD and DCT are fused for feature extraction and then applied in the prediction of theaflavin and thearubigin in Tea [20]. Although these mathematical transformation methods made some improvements in features selection, they still need to be set manually in the selection of features data. Specifically, for DWT, both the number of selected layers in wavelet transform and the strategy of characteristic coefficients (mean, variance, etc.) are determined according to experience. Moreover, the process of manually setting parameters in DCT is similar to DWT. In terms of SVD, diagonal values are selected based on experience. The way of features selection determines the limitations of these mathematical transformation methods. In addition, scholars have tried to fuse the electronic tongue with the electronic nose sensor data to enhance the tea quality prediction accuracies. For example, Nur Zawatil Isqi Zakaria et al. adopted PCA to compress electronic tongue and electronic nose samples to assess a bio-inspired herbal tea flavor [21]. Wavelet energy feature (WEF) has been extracted from the responses of e-nose and e-tongue for the classification of different grades of Indian black tea [22]. Furthermore, Ruicong Zhi et al. proposed a framework for a multi-level fusion strategy of electronic nose and electronic tongue. The time-domain based feature (mean value and max value of sensor response) and frequency-domain based feature (the energy of DWT) were fused for classification [23]. Runu Banerjee et al. combined electronic tongue data with electronic nose data and then fused DWT with Bayesian statistical analysis to evaluate the artificial flavor of black tea [24]. Mahuya Bhattacharyya Banerjee et al. proposed a cross-perception model (fused of electronic nose and electronic tongue samples by PCA and multiple linear regression) for the detection of aroma and taste of black tea [25]. In general, the manual features-based methods are difficult to adapt to changing scenes and have poor stability. The reason is that the methods adopt empirical parameters and the empirical value is usually no longer effective when the external environment changes.
Recently, the impressive achievement of deep architectures on computer vision tasks such as object recognition [26,27], object detection [28] and action recognition [29] have shown the significance of convolution neural network (CNN) in the image domain. Most methods utilize the deep networks as features extraction strategy and then train the pattern recognition model. Inspired by the successful application of CNN in image processing, we proposed an auto features extraction strategy based on CNN (CNN-AFE). The key idea behind our method is to transform the time series into pictures so as to make full use of the advantages of CNN. The significant contribution of the paper is to put forward a deep learning-based auto features extraction strategy in the e-tongue system for tea classification. Furthermore, the proposed model is an approximate end-to-end features extraction and pattern recognition method, which reduces manual operation and improves efficiency. Figure 2 shows the implementation process of this work. First, sensors response of the e-tongue was converted to time-frequency maps by STFT. Second, the CNN extracted features automatically with time-frequency maps as input. Finally, the features extraction and classification results were carried out under a general shallow CNN architecture. Compared with other methods, the proposed method avoided manual features selection with training of network parameters. The proposed method has advantages in two aspects. On the one hand, the remaining features extraction steps are completed in network training automatically after transforming the sampling signal into time-frequency images with STFT. On the other hand, the CNN-AFE combines features extraction and pattern recognition into a whole. In terms of traditional algorithms, features extraction and pattern recognition are two separate parts. First, they apply experience-based features extraction method to obtain features, and then they choose different classifiers, such as SVM, k-NN and RF to complete pattern recognition. However, the CNN-AFE is an approximate end-to-end features extraction and pattern recognition method, which is conducive to improve the efficiency and accuracy of pattern recognition. Comprehensively, the proposed method not only has high accuracy, but also omits the inconvenience of manual data selection, which is applicable to most data scenarios.
In this study, we adopt tea database for classification, tea samples are collected by an e-tongue system we designed. The proposed method was compared with other state-of-art approaches, and showed superior performance, which is nearly 99.9% classification accuracy. We infer that the CNN-AFE is suitable for liquid classification. In this study, we adopt tea database for classification, tea samples are collected by an e-tongue system we designed. The proposed method was compared with other state-of-art approaches, and showed superior performance, which is nearly 99.9% classification accuracy. We infer that the CNN-AFE is suitable for liquid classification.
In the rest of this paper, we arrange the content as follows: Section 2 provides a brief description about the e-tongue system, introduces experiment settings and details of the proposed method. In Section 3, we demonstrate the experimental results for evaluations upon our own database. Finally, we summarize this study in Section 4.

Electronic Tongue System
Electronic tongue systems can be divided into several types according to data measurement technology, such as potentiometry [30], voltammetry [31] and so on. In addition, spectrophotometric [32] is another very popular means of measurement. Each type of electronic tongue can utilize different types of working electrodes, such as bare electrodes, modified electrodes and biosensors [33]. Typically, bare electrodes are gold, silver, and palladium; the modified electrodes are carbon paste processed by double phthalocyanine compounds or conductive polymers; biosensors are carbon biocomposite or graphene electrode containing enzyme and different metal catalysts. Although different types of electronic tongue systems collect sensors response in a different way, their features extraction and pattern recognition methods are similar. In this paper, to verify the validity of features extraction and classification methods, the widely used voltammetry electronic tongue system is adopted to collect sensors response.
As the structure shown in Figure 3, we designed a voltammetry electronic tongue hardware system composed of controller, multi-frequency pulse voltage occurrence circuit, constant potential circuit, three-electrode module, micro-current detection circuit and low-pass filter. Three-electrode module consists of working electrodes (WE), reference electrodes (RE) and counter electrodes (CE), which is the core part of the electronic tongue hardware system. Different electrode materials have different response characteristics, the details of the electrode sensor array used in the proposed model are described in Table 1. Figure 4 shows the physical diagram of the electronic tongue hardware system we designed. In the rest of this paper, we arrange the content as follows: Section 2 provides a brief description about the e-tongue system, introduces experiment settings and details of the proposed method. In Section 3, we demonstrate the experimental results for evaluations upon our own database. Finally, we summarize this study in Section 4.

Electronic Tongue System
Electronic tongue systems can be divided into several types according to data measurement technology, such as potentiometry [30], voltammetry [31] and so on. In addition, spectrophotometric [32] is another very popular means of measurement. Each type of electronic tongue can utilize different types of working electrodes, such as bare electrodes, modified electrodes and biosensors [33]. Typically, bare electrodes are gold, silver, and palladium; the modified electrodes are carbon paste processed by double phthalocyanine compounds or conductive polymers; biosensors are carbon biocomposite or graphene electrode containing enzyme and different metal catalysts. Although different types of electronic tongue systems collect sensors response in a different way, their features extraction and pattern recognition methods are similar. In this paper, to verify the validity of features extraction and classification methods, the widely used voltammetry electronic tongue system is adopted to collect sensors response.
As the structure shown in Figure 3, we designed a voltammetry electronic tongue hardware system composed of controller, multi-frequency pulse voltage occurrence circuit, constant potential circuit, three-electrode module, micro-current detection circuit and low-pass filter. Three-electrode module consists of working electrodes (WE), reference electrodes (RE) and counter electrodes (CE), which is the core part of the electronic tongue hardware system. Different electrode materials have different response characteristics, the details of the electrode sensor array used in the proposed model are described in Table 1. Figure 4 shows the physical diagram of the electronic tongue hardware system we designed.    The micro-control unit of the electronic tongue hardware system is the STM32 controller, whose functions include generating scan signals, micro current detection, noise filtering, signal sampling and signal storage. Specifically, the controller drives the external circuit through the DA conversion to generate different voltage waveforms in the first and analog voltage waveforms are equivalent added to RE through the constant potential circuit under the feedback of CE. Then, driven by potential of RE, the Faraday current is generated between WE and CE, the micro current detecting circuit connected to the WE converts the Faraday current into a voltage signal. Next, the filter circuit is designed to eliminate clutter in the converted signal. Finally, the amplified voltage signal is converted to a digital voltage signal by AD module conversion and the response signals from different WE are stored. Two images in Figure 5 represent the typical sensor data of different kind of tea and electrodes, respectively. Figure 5a shows five kinds of tea sampling from silver WE and Figure 5b displays the response of Tieguanyin tea from three different WE (Gold, Silver, Palladium). It should be noted that for the same concentration of tea from the same brew, the three different WE work sequentially. In detail, the gold electrode is applied to collect samples first, and the silver electrode is replaced to get samples subsequently, and the palladium electrode is replaced for data sampling finally. All the sampling signals are stored in the SD card.    The micro-control unit of the electronic tongue hardware system is the STM32 controller, whose functions include generating scan signals, micro current detection, noise filtering, signal sampling and signal storage. Specifically, the controller drives the external circuit through the DA conversion to generate different voltage waveforms in the first and analog voltage waveforms are equivalent added to RE through the constant potential circuit under the feedback of CE. Then, driven by potential of RE, the Faraday current is generated between WE and CE, the micro current detecting circuit connected to the WE converts the Faraday current into a voltage signal. Next, the filter circuit is designed to eliminate clutter in the converted signal. Finally, the amplified voltage signal is converted to a digital voltage signal by AD module conversion and the response signals from different WE are stored. Two images in Figure 5 represent the typical sensor data of different kind of tea and electrodes, respectively. Figure 5a shows five kinds of tea sampling from silver WE and Figure 5b displays the response of Tieguanyin tea from three different WE (Gold, Silver, Palladium). It should be noted that for the same concentration of tea from the same brew, the three different WE work sequentially. In detail, the gold electrode is applied to collect samples first, and the silver electrode is replaced to get samples subsequently, and the palladium electrode is replaced for data sampling finally. All the sampling signals are stored in the SD card. The micro-control unit of the electronic tongue hardware system is the STM32 controller, whose functions include generating scan signals, micro current detection, noise filtering, signal sampling and signal storage. Specifically, the controller drives the external circuit through the DA conversion to generate different voltage waveforms in the first and analog voltage waveforms are equivalent added to RE through the constant potential circuit under the feedback of CE. Then, driven by potential of RE, the Faraday current is generated between WE and CE, the micro current detecting circuit connected to the WE converts the Faraday current into a voltage signal. Next, the filter circuit is designed to eliminate clutter in the converted signal. Finally, the amplified voltage signal is converted to a digital voltage signal by AD module conversion and the response signals from different WE are stored. Two images in Figure 5 represent the typical sensor data of different kind of tea and electrodes, respectively. Figure 5a shows five kinds of tea sampling from silver WE and Figure 5b displays the response of Tieguanyin tea from three different WE (Gold, Silver, Palladium). It should be noted that for the same concentration of tea from the same brew, the three different WE work sequentially. In detail, the gold electrode is applied to collect samples first, and the silver electrode is replaced to get samples subsequently, and the palladium electrode is replaced for data sampling finally. All the sampling signals are stored in the SD card.

Electronic Tongue Settings
As mentioned above, the e-tongue system consists of three working electrodes, one reference electrode and one counter electrode was adopted for experiment. Electrodes have been polished with cloth and ground powder before first measurement. In the middle of any two measurement processes, the working electrode is placed in distilled water and cleaned with an electrochemical cleaning method for 1 min and dried with cloth. The counter electrode and the reference electrode are washed with distilled water and dried with filter paper.
A multi-frequency large pulse voltammetry (MLAPV) [34] method was utilized to generate multi frequency and multi-scale scanning signals. Specifically, we make use of three frequency scanning signals, namely 1, 2 and 2.5 Hz. For each frequency, ten pulses are generated at the voltage of 1.0, 0.8, 0.6, 0.4, 0.2, −0,2, −0.4, −0.6, −0.8, −1.0 V. To avoid the interference between signals of different frequencies in the reaction process, we stop the scanning signal for five s after the signal scanning of each frequency is finished. In the sampling step, we set the sampling rate at 1 KHz to get more detail information. Thus, we collect 10,000 points during 1 Hz scanning frequency, 5000 points during 2 Hz scanning frequency and 4000 points during 2.5 Hz scanning frequency. Figure 6 illustrates the scanning voltage and the corresponding typical response. Figure 6a displays the scanning voltage of different frequency, and Figure 6b shows the typical response. In Figure 6a, the red, green and blue histograms represent the scanning signals of 1, 2 and 2.5 Hz, respectively. The response of different scanning signals is depicted by lines of the same color as the scanning signals in Figure 6b. The black horizontal line in Figure 6 represent the stop of scanning signal for five seconds.

Electronic Tongue Settings
As mentioned above, the e-tongue system consists of three working electrodes, one reference electrode and one counter electrode was adopted for experiment. Electrodes have been polished with cloth and ground powder before first measurement. In the middle of any two measurement processes, the working electrode is placed in distilled water and cleaned with an electrochemical cleaning method for 1 min and dried with cloth. The counter electrode and the reference electrode are washed with distilled water and dried with filter paper.
A multi-frequency large pulse voltammetry (MLAPV) [34] method was utilized to generate multi frequency and multi-scale scanning signals. Specifically, we make use of three frequency scanning signals, namely 1, 2 and 2.5 Hz. For each frequency, ten pulses are generated at the voltage of 1.0, 0.8, 0.6, 0.4, 0.2, −0,2, −0.4, −0.6, −0.8, −1.0 V. To avoid the interference between signals of different frequencies in the reaction process, we stop the scanning signal for five s after the signal scanning of each frequency is finished. In the sampling step, we set the sampling rate at 1 KHz to get more detail information. Thus, we collect 10,000 points during 1 Hz scanning frequency, 5000 points during 2 Hz scanning frequency and 4000 points during 2.5 Hz scanning frequency. Figure 6 illustrates the scanning voltage and the corresponding typical response. Figure 6a displays the scanning voltage of different frequency, and Figure 6b shows the typical response. In Figure 6a, the red, green and blue histograms represent the scanning signals of 1, 2 and 2.5 Hz, respectively. The response of different scanning signals is depicted by lines of the same color as the scanning signals in Figure 6b. The black horizontal line in Figure 6 represent the stop of scanning signal for five seconds.

Electronic Tongue Settings
As mentioned above, the e-tongue system consists of three working electrodes, one reference electrode and one counter electrode was adopted for experiment. Electrodes have been polished with cloth and ground powder before first measurement. In the middle of any two measurement processes, the working electrode is placed in distilled water and cleaned with an electrochemical cleaning method for 1 min and dried with cloth. The counter electrode and the reference electrode are washed with distilled water and dried with filter paper.
A multi-frequency large pulse voltammetry (MLAPV) [34] method was utilized to generate multi frequency and multi-scale scanning signals. Specifically, we make use of three frequency scanning signals, namely 1, 2 and 2.5 Hz. For each frequency, ten pulses are generated at the voltage of 1.0, 0.8, 0.6, 0.4, 0.2, −0,2, −0.4, −0.6, −0.8, −1.0 V. To avoid the interference between signals of different frequencies in the reaction process, we stop the scanning signal for five s after the signal scanning of each frequency is finished. In the sampling step, we set the sampling rate at 1 KHz to get more detail information. Thus, we collect 10,000 points during 1 Hz scanning frequency, 5000 points during 2 Hz scanning frequency and 4000 points during 2.5 Hz scanning frequency. Figure 6 illustrates the scanning voltage and the corresponding typical response. Figure 6a displays the scanning voltage of different frequency, and Figure 6b shows the typical response. In Figure 6a, the red, green and blue histograms represent the scanning signals of 1, 2 and 2.5 Hz, respectively. The response of different scanning signals is depicted by lines of the same color as the scanning signals in Figure 6b. The black horizontal line in Figure 6 represent the stop of scanning signal for five seconds.

Sample Preparation
Five kinds of tea picked from different provinces were applied in the study. They are Yuzhu tea (Chongqing), Show bud tea (Chongqing), Tieguanyin tea (Quanzhou), Biluochun tea (Suzhou) and Westlake tea (Hangzhou), respectively. It is worth mentioning that Biluochun tea, Westlake tea and Show bud are all green tea. These five teas all contain nutrients such as tea polyphenols, catechins, chlorophyll, caffeine, amino acids and vitamins. In comparison, Yuzhu tea contains more vitamins, and Tieguanyin tea has higher levels of tea polyphenols, catechins, and various amino acids. In terms of flavors, these teas are fragrant, refreshing and a little bitter. For each kind of tea leaf, we selected 1 g with high precision electronic scale and brewed it with 100 mL boiling water for 10 min. Then, the tea leaves were filtered through a sieve to obtain the original tea liquid. What's more, we brewed 34 times for the same type of tea, and each type of tea was sampled 10 times. Since electrode sampling is related to many factors, there are still slight differences among the ten groups, which can be understood as the diversity of samples. It should be emphasized that the time span of these 34 brews has reached nearly three months. During the three months, we did not seal and refrigerated tea deliberately, that is to say, changes have occurred in tea during these months (such as oxidation), which guarantees the diversity of the samples.
In the experiment, we obtained three kinds of tea liquid (including 100%, 50% and 25% concentrations) by mixing the original tea liquid with water. In order to ensure the reliability of the experiment, we collected 340 samples for each concentration of tea liquid and the sum of samples for each working electrode is 5100 (five kinds of tea leaves × three concentrations per tea × 340 samples per concentration). As mentioned above, we used three kinds of working electrodes (gold, silver and palladium). In other words, we got 5100 × three samples totally. To speed up the convergence of the classifier, a normalization between (0, 1) was implemented for each sample under the same working electrode.

Software Platform
The features extraction and classification experiments of CNN-AFE are carried out on the same server with 32 G RAM, NVIDIA GeForce GTX Titan GPU (NVIDIA, Santa Clara, CA, USA.) and Linux 64-bit operating system (Canonical, Isle of Man, UK). The model is built with the Pytorch framework (Facebook, Menlo Park, CA, USA), and the programming language is Python.

Features Extraction and Classification Methods
In the electronic tongue system, since each sensor response consists of a large number of voltage measurements, features extraction for each response is particularly challenging. In addition, the validity of extracted features would further affect the accuracy of pattern recognition. It is generally believed that the features hidden in the time series include the characteristics of time domain and transformation domain. Time domain features are relatively intuitive, such as the size of the response signal and the location of the mathematical features points. Features in the transform domain are relatively complex, such as the features from frequency domain or matrix decomposition. It should be noted that most of the traditional methods adopted manual features extraction which function similar to the filter. Manual extraction methods make them difficult to adapt to various scenarios and lack of stability.
To put up with this bottleneck in the field of e-tongue, we proposed a novel features extraction method, which learns features through deep learning models automatically. The key idea behind our method is to transform the time series into time-frequency map by appropriate strategy so as to make full use of the advantages of CNN in images features extraction and pattern recognition. The structure of the proposed features extraction method is shown in Figure 7 and the algorithm consists of two steps. The first key step is to represent complex time series with features images by the appropriate strategy. The commonly used time-frequency representations of non-stationary signals are Wigner-Ville Distribution (WVD) [35], short-time Fourier transform (STFT) [36], and Gabor transform [37]. However, considering the characteristics of the transient abruptness of the response signal of the e-tongue system, STFT was adopted to extract the time-frequency characteristics. The second step is to adopt CNN to extract features for learning the details hidden in the time-frequency map and pattern recognition automatically. We innovatively combine STFT with CNN for features extraction to build an automatic features extraction architecture. In other words, the proposed method based on CNN can be considered as a combination of a variety of representative or unknown but effective features extraction methods. Another point we want to emphasize is that the model unifies features extraction and classification steps under a single architecture, which purpose is to improve classification accuracy and the robustness of algorithms. signal of the e-tongue system, STFT was adopted to extract the time-frequency characteristics. The second step is to adopt CNN to extract features for learning the details hidden in the time-frequency map and pattern recognition automatically. We innovatively combine STFT with CNN for features extraction to build an automatic features extraction architecture. In other words, the proposed method based on CNN can be considered as a combination of a variety of representative or unknown but effective features extraction methods. Another point we want to emphasize is that the model unifies features extraction and classification steps under a single architecture, which purpose is to improve classification accuracy and the robustness of algorithms.

Short-Time Fourier Transform (STFT)
STFT was proposed based on Fourier transform, which has been widely adopted to the analysis signals jointly in time and frequency [36]. As shown in Figure 7, in order to compromise the computational complexity and consistency between adjacent window signals, we choose partial overlap between consecutive windows. The sliding window length is Ws while the movement distance of each window is L, and L is equal to Ws/2 typically.
In detail, given sample , is multiplied by a window function which is non-zero only for a short time. When the window slides along the time axis, Fourier Transform of the sample is obtained, resulting in a two-dimensional representation of the signal. In mathematics, this is written as: where, ( ) is the sample and ( ) is the window function. Thus, the one-dimensional sample is transformed into a two-dimensional signal containing the time and frequency characteristics. Since is a two-dimensional complex number, we convert into a two-dimensional features image using the square of the magnitude.
In this study, different kinds of window function and window sizes have been applied to extract time-frequency features. Detailed parameter settings and corresponding experimental results are presented in Section 3.

Convolutional Neural Network (CNN)
CNN is a type of deep feedforward artificial neural network widely used to analyze visual images. Compared with the traditional artificial neural network, the neurons of the CNN and the neurons of the next layer are not fully connected, but locally connected. Furthermore, the parameters of the convolution kernel are weight shared. In general, CNN has the following advantages in image processing: 1) Shared convolution kernel, fast speed when dealing with high-dimensional data; 2) Model training replaces manual features extraction and improves stability; 3) Excellent classification effect.

Short-Time Fourier Transform (STFT)
STFT was proposed based on Fourier transform, which has been widely adopted to the analysis signals jointly in time and frequency [36]. As shown in Figure 7, in order to compromise the computational complexity and consistency between adjacent window signals, we choose partial overlap between consecutive windows. The sliding window length is Ws while the movement distance of each window is L, and L is equal to Ws/2 typically.
In detail, given sample s, s is multiplied by a window function which is non-zero only for a short time. When the window slides along the time axis, Fourier Transform of the sample is obtained, resulting in a two-dimensional representation of the signal. In mathematics, this is written as: where, s(n) is the sample and w(n) is the window function. Thus, the one-dimensional sample s is transformed into a two-dimensional signal X containing the time and frequency characteristics. Since X is a two-dimensional complex number, we convert X into a two-dimensional features image using the square of the magnitude.
In this study, different kinds of window function and window sizes have been applied to extract time-frequency features. Detailed parameter settings and corresponding experimental results are presented in Section 3.

Convolutional Neural Network (CNN)
CNN is a type of deep feedforward artificial neural network widely used to analyze visual images. Compared with the traditional artificial neural network, the neurons of the CNN and the neurons of the next layer are not fully connected, but locally connected. Furthermore, the parameters of the convolution kernel are weight shared. In general, CNN has the following advantages in image processing: (1) Shared convolution kernel, fast speed when dealing with high-dimensional data; (2) Model training replaces manual features extraction and improves stability; (3) Excellent classification effect.
The structure of CNN is various with different application scenarios, but several modules are similar in these models. For example, when dealing with classification problems, the typical models are AlexNet [27], GoogLeNet [38], ResNet [39], and the three models all include features extraction layer and features mapping classification layer. Features extraction layer includes a plurality of convolution layers, activation layers, and pooling layers. The purpose of convolution operation is to extract features, and activation layers enhance the nonlinearity of both the decision function and the whole neural network, while the pooling layer reduce the size of data space continuously, which can control over-fitting. In the features mapping classification layer, the loss function is applied to punish the difference between predicted and actual results. In addition, in order to improve the generalization ability of the network and prevent over-fitting, we apply the dropout layer in the structure and add regularization constraints to the loss function.
As illustrated in Figure 7, a multi-layer filter structure is utilized in the proposed method. The parameters of each filter are determined in the training process for features extraction. To summarize, three convolutional layers of different sizes, Rectified Linear Units (ReLU) activation function, and maximizing pooling layer are applied in the proposed model. Moreover, the fully connected layer is utilized for features mapping and classification. Detailed parameter settings of CNN and corresponding experimental results are discussed in Section 3.

Results and Discussion
To evaluate the effect and robustness of the proposed model, in Section 3.1, we discuss three parts: Time-frequency features extraction, network structure, and network parameter optimization. In Section 3.2, we compare the proposed method with the best methods we know so far. Specifically, in Section 3.1, we discuss window functions of different types and sizes when applying STFT to convert samples to time-frequency features. In addition, we discuss the structure of CNN. Then, we optimize the network parameters containing regularization, batch size, and loss function types in the network. All experiment results in the section are based on a tea database collected by an e-tongue system we designed. The code of proposed model is posted in the following link. https://github.com/Shunzhange/auto-feature-extraction-method-for-e-tongue.

Time-Frequency Features Extraction
To define the appropriate window function and window size, CNN network parameters should be set to default. Specifically, dropout is 0.5; regularization function is L2; batch size is 64; loss function is Cross entropy loss [40]. In the validation part of time-frequency features extraction, different types (Hann window, Hamming window and Gaussian window) [41] and sizes (64, 128, 256, 512) of window functions were adopted. The reason why we choose the four window functions is that they are the most widely used, and the choice of window function size is based on the relationship between the scan signal period and the sampling frequency. Figure 8 shows the average classification accuracy of four window sizes in three window functions respectively in the condition that experiments iterations generations are 100 and the evaluation approach is five-fold cross-validation. In Figure 8a, the best average classification accuracy achieved nearly 99.5% when window function is Hann and window size is 256. In terms of the Hamming window, the best average classification accuracy 99.8% is acquired in Figure 8b when the window size is 128. As for the Gaussian window in Figure 8c, the best average classification accuracy is 98% when the window size is set to 256. From the three figures, the blue and green lines are always in the lower position. It can be seen that the optimal window size should not be too large or too small. For the blue lines, we conclude that the window size is too small to cover even half of the response of the scanning pulse, which lead to invalid features extraction. On the contrary, when the window size is 512, we infer that large amounts of redundant features were collected and the effective features are difficult to exploit. In addition, large windows may affect the performance of the short-time Fourier transform.

Network Structure
In general, in order to extract features comprehensively, the depth of network structure needs to be sufficient. However, take actual application scenario into account, we hold the view that the complex structure of the network is unnecessary. The reason is that samples are two-dimensional which indicates that features in the time-frequency map are not as rich as the real pictures. The experimental results confirm our conjecture, that applying deep network structure does not improve performance, but brings more time overhead. Based on the above considerations and experimental verification, we choose a network structure containing three convolutional layers. In detail, for convolutional layer 1, we adopt the kernel size 3 × 3 with 3 channels inputs and 32 channels outputs; in terms of convolutional layer 2, we adopt the kernel size 3 × 3 with 32 channels inputs and 64 channels outputs; as for convolutional layer 3, we adopt the kernel size 3 × 3 with 64 channels inputs and 64 channels outputs; moreover, the activation function and pooling layer function are Rectified Linear Units (ReLU) [42] and maximizing function respectively; what's more, the fully connected layer is N × 64 (N is determined by the window size) to 64 for features mapping.

Network Parameter Optimization
In order to optimize the network parameters, we have conducted comparative experiments to compare and optimize the parameters of batch size, regularization options and loss function. Based on the experimental results of the time-frequency features extraction section, the window function is fixed to Hann, and the window size is fixed to 256. In this part, iterations generations are also 100 and the evaluation approach is still five-fold cross-validation.
As shown in Table 2, batch sizes are 16, 32, 64, respectively, loss functions are Cross entropy loss (abbreviated as C) [40] and multi-class Hinge loss (abbreviated as H) [43]. In the column where L2 Regularization is located, 'Yes' means 'with regularizations' and 'No' means 'without regularizations'. The experimental results show that the test accuracy rate of method with regularization term is over 99.5% regardless of the size of the batch size and the loss function. It is worth mentioning that the accuracy is 99.92% when the batch size is 32 and the loss function is Cross entropy loss. Therefore, we believe that the regularization item is necessary. In terms of Loss function, the cross entropy loss performs better than the multi-class Hinge loss as a whole.
After the parameter optimization, the trained model parameters can be saved. When the model is later applied to a new set of data, the model can predict the tea classification with original tea sample as input.

Network Structure
In general, in order to extract features comprehensively, the depth of network structure needs to be sufficient. However, take actual application scenario into account, we hold the view that the complex structure of the network is unnecessary. The reason is that samples are two-dimensional which indicates that features in the time-frequency map are not as rich as the real pictures. The experimental results confirm our conjecture, that applying deep network structure does not improve performance, but brings more time overhead. Based on the above considerations and experimental verification, we choose a network structure containing three convolutional layers. In detail, for convolutional layer 1, we adopt the kernel size 3 × 3 with 3 channels inputs and 32 channels outputs; in terms of convolutional layer 2, we adopt the kernel size 3 × 3 with 32 channels inputs and 64 channels outputs; as for convolutional layer 3, we adopt the kernel size 3 × 3 with 64 channels inputs and 64 channels outputs; moreover, the activation function and pooling layer function are Rectified Linear Units (ReLU) [42] and maximizing function respectively; what's more, the fully connected layer is N × 64 (N is determined by the window size) to 64 for features mapping.

Network Parameter Optimization
In order to optimize the network parameters, we have conducted comparative experiments to compare and optimize the parameters of batch size, regularization options and loss function. Based on the experimental results of the time-frequency features extraction section, the window function is fixed to Hann, and the window size is fixed to 256. In this part, iterations generations are also 100 and the evaluation approach is still five-fold cross-validation.
As shown in Table 2, batch sizes are 16, 32, 64, respectively, loss functions are Cross entropy loss (abbreviated as C) [40] and multi-class Hinge loss (abbreviated as H) [43]. In the column where L2 Regularization is located, 'Yes' means 'with regularizations' and 'No' means 'without regularizations'. The experimental results show that the test accuracy rate of method with regularization term is over 99.5% regardless of the size of the batch size and the loss function. It is worth mentioning that the accuracy is 99.92% when the batch size is 32 and the loss function is Cross entropy loss. Therefore, we believe that the regularization item is necessary. In terms of Loss function, the cross entropy loss performs better than the multi-class Hinge loss as a whole.
After the parameter optimization, the trained model parameters can be saved. When the model is later applied to a new set of data, the model can predict the tea classification with original tea sample as input.

Comparison with Other Techniques
In this part, we compared the proposed method with other state-of-art techniques. In traditional methods, features extraction and classification recognition are usually divided into two separate parts, however, CNN-AFE is an approximate end-to-end model which integrates features extraction and classification recognition into a same architecture. We introduce the comparison methods as follows.
As for features extraction, techniques such as features of raw response [15], peak-inflection point [10,16], Discrete wavelet transform (DWT) [17], Discrete cosine transform (DCT) [18], singular value decomposition (SVD) [19] and DCT fused with SVD (DCT + SVD) [20] were utilized as comparison methods. In terms of pattern recognition, several classifiers such as support vector machine (SVM) [9,12], the k-nearest neighbor (k-NN) [13], and random forest (RF) [14] were adopted to compare with the method we proposed. After parameter optimization, we show the best classification accuracy of the comparison algorithm in Table 3, the experimental results (average accuracy ± standard deviation) are based on the five-fold cross-validation. The SVD features extraction method performs better (about 98.8%) and the raw response features extraction performs worse (about 80%) relatively. It is noteworthy that the test accuracy rate of CNN-AFE with L2 regularization term is over 99.5% regardless of the size of the batch and the loss function, and the highest average recognition rate achieved 99.9%, which demonstrates that CNN-AFE provides a more effective solution to the tea classification problem. Further, we believe that the proposed method would make more prominent contributions in the field of liquid classification and identification.
Considering all the features obtained are high-dimensional whether the proposed method or traditional methods. Therefore, we applied principal component analysis to compress the features and observe the features distribution image. In detail, as shown in Figure 9, red diamonds, blue tones, blue-green squares, black circles and green plus represent Yuzhu tea, Show bud tea, Biluochun tea, Westlake tea, and Tieguanyin tea, respectively. Figure 9a-f show the two-dimensional features distribution for features of raw response, peak-inflection point, Discrete wavelet transform (DWT), Discrete cosine transform (DCT), singular value decomposition (SVD) and CNN-AFE, respectively. Compared with other methods in terms of visual effects, in the PCA diagram of the proposed method, different types of tea are more dispersed and the same type of tea is more compactly polymerized. Thus, brings the improvement of classification accuracy.
wavelet transform (DWT), Discrete cosine transform (DCT), singular value decomposition (SVD) and CNN-AFE, respectively. Compared with other methods in terms of visual effects, in the PCA diagram of the proposed method, different types of tea are more dispersed and the same type of tea is more compactly polymerized. Thus, brings the improvement of classification accuracy.

Conclusions
In this study, a CNN-based auto features extraction strategy in the e-tongue system for tea classification was proposed. The proposed CNN structure integrates automatic features extraction and classification into a unified, which makes the classification effect more robust. 99.9%

Conclusions
In this study, a CNN-based auto features extraction strategy in the e-tongue system for tea classification was proposed. The proposed CNN structure integrates automatic features extraction and classification into a unified, which makes the classification effect more robust. 99.9% classification accuracy was obtained based on tea database collected by an e-tongue system we designed. In conclusion, compared to these reference techniques, the proposed model in the paper has several advantages: (1) 99.9% classification accuracy; (2) easy features extraction method instead of manual features selection; (3) approximate end-to-end features extraction and classification structure makes the system more robust and accurate, which will make contributions to the e-tongue for more extensive applications in the future.

Conflicts of Interest:
There are no conflicts to declare.