Nondestructive Testing Model of Tea Polyphenols Based on Hyperspectral Technology Combined with Chemometric Methods

: Nondestructive detection of tea’s internal quality is of great signiﬁcance for the processing and storage of tea. In this study, hyperspectral imaging technology is adopted to quantitatively detect the content of tea polyphenols in Tibetan teas by analyzing the features of the tea spectrum in the wavelength ranging from 420 to 1010 nm. The samples are divided with joint x-y distances (SPXY) and Kennard-Stone (KS) algorithms, while six algorithms are used to preprocess the spectral data. Six other algorithms, Random Forest (RF), Gradient Boosting (GB), Adaptive boost (AdaBoost), Categorical Boosting (CatBoost), LightGBM, and XGBoost, are used to carry out feature extractions. Then based on a stacking combination strategy, a new two-layer combination prediction model is constructed, which is used to compare with the four individual regressor prediction models: RF Regressor (RFR), CatBoost Regressor (CatBoostR), LightGBM Regressor (LightGBMR) and XGBoost Regressor (XGBoostR). The experimental results show that the newly-built Stacking model predicts more accurately than the individual regressor prediction models. The coefﬁcients of determination R 2 c and R 2 p for the prediction of Tibetan tea polyphenols are 0.9709 and 0.9625, and the root mean square error RMSEC and RMSEP are 0.2766 and 0.3852 for the new model, respectively, which shows that the content of Tibetan tea polyphenols can be determined with precision.


Introduction
Tea is one of the three most popular non-alcoholic beverages in the world. Tea polyphenols are an important part of tea and a vital source of bioactive chemicals, with the ability of anti-oxidation, anti-cancer, anti-bacterial, anti-inflammation and anti-arteriosclerosis [1][2][3], and they play an important role in the medical and food industries. In addition, there is also a certain correlation between the content of tea polyphenols and the quality of tea [4]. Not only beneficial to human health, high-quality tea is also sold at a much higher price in the market. The traditional detection methods of tea polyphenols are mainly either physical or chemical [5][6][7], which are not only costly and complicated but also time-consuming and vulnerable to subjective factors [8]. Therefore, it is of great significance to develop a fast and nondestructive online detection technology to test tea polyphenols. Hyperspectral imaging technology is based on a lot of narrow-band image data technology. It combines imaging technology with spectral technology to detect the two-dimensional geometric space and one-dimensional spectral information of the target and obtain continuous and narrow-band data with high spectral resolution. Hyperspectral imaging is a new generation of photoelectric detection technology and can be adopted in this field for its low cost, fast speed, reliability and its ability to leave the samples intact in the test.
Near-infrared spectroscopy technology (NIR) is an optical detection method known for its fast speed and no direct touch of the samples [9]. It has been used in quality inspections of many agricultural products [10][11][12]. Wang et al. [13] established a pear juiciness detection model at 650-1100 nm, with an external verification determination coefficient of 0.93 and root mean square error of 0.97%. Pennisi et al. [14] established freshness models of different species of fish based on near-infrared spectroscopy technology. Jens et al. [15] designed a potato dry matter content detection model based on NIR technology. Previous studies have shown that the use of spectroscopy technology to detect the quality of agricultural products is feasible.
With the development of spectral technology, image analysis is added to spectroscopy [16], and hyperspectral imaging technology emerges with time. Compared with multispectral images, hyperspectral images have a richer image and spectral information [17]. At present, the use of hyperspectral technology to detect agricultural product quality is still in its infancy. However, as a fast and nondestructive detection technology, hyperspectral imaging has great application prospects. There has been only a small amount of research on agricultural-product quality detection based on hyperspectral technology [18,19].
The selection of spectral characteristic bands is another important factor affecting the model results. Effective selection of characteristic bands can save computing resources [25] and improve model performance. In recent years, researchers have proposed many characteristic band selection methods, such as interval partial least squares (iPLS) [26,27], synergy interval partial least square (siPLS) [28,29], backward interval partial least square (biPLS) [30][31][32]. These feature-selection algorithms divide all features into several intervals and then select a small part of the interval with good effect as the characteristic band by iteration. However, the spectral features selected by this "bundling" method are likely to miss some important features.
To avoid the presence of bias introduced by manual data splitting, there are a number of computational methods that can be used for sample selection, such as random selection (RS), Kennard-Stone (KS) [33,34], or sample set partitioning based on joint x-y distances (SPXY) [35][36][37] algorithm.
The purpose of this research is to explore the feasibility of fast and nondestructive online detection of Tibetan tea polyphenol content based on hyperspectral image technology. Different data preprocessing methods are used to process the acquired hyperspectral data of Tibetan tea. This paper selects the best preprocessing method by establishing the model and analyzing the modeling results.

Samples
A total of three grades of Ya'an Tibetan tea were selected for the test, including 32 samples for the first grade, 33 for the second grade and 37 for the third grade. Each group of samples was individually packaged in a sealed plastic bag and stored in a 5 • C thermostat for the subsequent determination of spectral data and tea polyphenol physicochemical data. The measurement process of tea polyphenol content is as follows.
(a) Mother liquor: The milled tea (0.6 g) and 5 mL 70% methanol solution were placed in a 10 mL centrifuge tube and shaken. After bathing at 70 • C for 10 min, the tube was removed, allowed to cool, and then centrifuged for 10 min at 3500 r/min, and the supernatant was collected. The precipitation was extracted according to the above extraction procedure. The collected supernatants from the above extraction were mixed, then diluted to 10 mL with 70% aqueous methanol, and filtered through a 0.45 µm filter.
Test solution: 1 mL mother solution (a) was added into a 100 mL volumetric flask, and distilled water was added to dilute to 100 mL and shaken well. (c) Gallic acid working solution: 1.0, 2.0, 3.0, 4.0 and 5.0 mL of gallic acid standard solution (1000 µg /mL) was added into five 100 mL volumetric flasks, diluted with distilled water to 100 mL, and shaken well. Finally, five groups of working fluid were obtained. The concentrations were 10, 20, 30, 40 and 50 µg/mL.

2.
Determination of the content of tea polyphenols.
A total of 1.0 mL each of gallic acid working solution (c), distilled water and test solution (b) were added into the scale tube. A total of 5.0 mL of Folinol reagent (concentration 10%) was added to each test tube. After 4 min, 4.0 mL 7.5 % sodium carbonate (Na 2 CO 3 ) solution was added, then we added water to a constant volume scale. The mixture was then stored at room temperature for 60 min. The absorbance (A, A 0 ) was measured by a spectrophotometer at the wavelength of 765 nm with a 10 mm colorimetric vessel. The standard curve was prepared according to the absorbance of the gallic acid working solution and the concentration of gallic acid in each working solution. By comparing the absorbance of the sample and the standard working solution, the content of tea polyphenols was calculated as follows: c (%) is the content of tea polyphenol (percentage of tea polyphenols in dry matter of tea), A represents the absorbance of the sample test solution, A 0 is the absorbance of the blank reagent solution, V (mL) is the volume of sample extract, d is the dilution factor (take 100 here), S std represents the slope of the gallic acid standard curve, ω is the dry matter content of the sample (percentage of tea sample quality before and after drying) and m (g) is the mass of the sample. The measurement results and the sample results based on the SPXY algorithm (see Section 3.1) are shown in Table 1. The Tibetan tea polyphenol data from the test is used as standard data for future use. Note: "%" is the percentage of tea polyphenols in the dry matter of tea, the same as line 120.

Hyperspectral Image Acquisition
The hyperspectral data of the Tibetan tea test is acquired using a GaiaSorter hyperspectral sorter made by Beijing Zolix Company, which provides an effective spectral band of 387-1035 nm, a spectral resolution of 2.8 nm and 256 spectral channels. We spread the tea leaves evenly into a square in a container (about 65 cm × 65 cm). The hyperspectral acquisition system is shown in Figure 1. Due to the influence of dark currents at the beginning and end of the spectral band, only the 420-1010 nm band is retained as raw spectral data. The sample platform is set to move at a speed of 4.0 mm · s −1 , the distance of the imaging object is 170 mm, and the camera exposure time was set to 16 ms. We placed the tea to be tested on the stage. Under the illumination of a uniform light source, the platform is moved horizontally at a set speed, and the hyperspectral camera can obtain continuous hyperspectral images of the samples on the platform. The acquired images are then calibrated using Equation (2): where I is the corrected image, I raw represents the raw image, I b is the standard black image and I w represents the standard white image. ENVI5.1 software is used to calculate the average spectral value of the region of interest (151 × 151 pixels) in the hyperspectral image.

Hyperspectral Data Preprocessing
Random noise is often generated during the acquisition of spectra by the external environment, instrument response and other factors unrelated to the nature of the sample to be measured, and disorderly fluctuations in the spectral data appear. Therefore, this article uses six preprocessing algorithms, including SG, MSC, SNVT, FD, SD and Z-score standardization (ZSS), to eliminate the noise in the raw spectrum (RAW) data. Python 3.8 (Python Software Foundation) is adopted in all data processing and modeling. The KS algorithm [33] regards all samples as candidate samples of a training set and selects the two samples with the farthest Euclidean distance into the training set. Then, by calculating the Euclidean distance between the remaining samples and the known samples in the training set, the two samples nearest to the selected samples are selected and put into the training set, and the above steps are repeated until the number of samples reaches the set value. The formula for calculating Euclidean distance is: where x p and x q represent two different samples and represent the number of spectral bands.

Sample Set Partitioning Based on Joint X-Y Distances (SPXY)
The SPXY algorithm is developed on the basis of the KS algorithm. When SPXY calculates the sample distance, the sample label (Y) and the sample feature (X) are taken into account at the same time. The specific calculation is as follows [36]: where d x (p, q) represents the spectral distance and d y (p, q) represents the chemical measurement value distance.

Feature Selection and Modeling
The acquired hyperspectral data often contains a lot of redundant information, which will have a certain impact on the accuracy and efficiency of the final modeling. Six methods [38][39][40][41], Gradient Boosting (GB), Adaptive Boosting (AdaBoost), Random Forest (RF), Categorical Boosting (CatBoost), LightGBM and XGBoost, are used to select hyperspectral feature bands. Random forest regression (RFR), categorical boosting regression (CatBoostR), LightGBM regression (LightGBMR), XGBoost regression (XGBoostR) and model integration strategy stacking are used in the model. Stacking is a combined model that trains the base learner from the initial data set and then uses the predicted value of the base-learner as a new feature to train the meta-learner.

Model Reliability
Model evaluation takes the coefficient of determination (R 2 ) [42] and root mean square error (RMSE) [43] as evaluation criteria, and the calculation method is shown in Equations (6) and (7).
where y i andŷ i are the measured value and predicted value of the sample, respectively, y represents the average value of the sample and n is the number of samples. When the predicted value (ŷ i ) of the model is closer to the true value (y i ), the better the effect, in other words, a good model should have small RMSE values (the closer the value of RMSE is to 0, the better the effect of the model). Furthermore, the models with high R 2 values are better than the models with low R 2 values (the closer the value of R 2 is to 1, the better the effect of the model). At the same time, the smaller the difference in the determination coefficient between the calibration set and the independent test set of the model, the better. If the gap is too large, it indicates that the model is under-fitting or over-fitting.

Spectral Preprocessing and Sample Division
In the process of collecting hyperspectral data, due to the influence of environmental factors, the acquired spectral data has certain noises, which will adversely affect the performance of the model. Therefore, the spectral data is preprocessed before modeling. Six methods, including SG, MSC, SNVT, ZSS, FD and SD, are used to preprocess the spectral data of the tea samples. In order to make the established model representative, the division of the data set is also very important. This paper uses the KS and SPXY sample division algorithm to divide the 102 groups of samples into the calibration set and the prediction set at a ratio of 3:1.
Gradient Boosting regression (GBR) is used to model and predict the raw data and preprocessed spectral data. The modeling results based on different preprocessing algorithms and different sample partitioning algorithms are shown in Figure 2.  The data set divided by the KS algorithm is more prone to overfitting than the model established by the SPXY algorithm, so the SPXY-GBR model is generally better than the KS-GBR model. Based on Figure 2, comparing two different data set partitioning methods, and six different preprocessing algorithm modeling results, the models with the better effects are RAW-KS-GBR, SG-SPXY-GBR and SNVT-SPXY-GBR. The SG-SPXY-GBR model has the highest R 2 p value of 0.9365, and its R 2 c value also reaches 0.9563, with a small discrepancy between them. This manifests that the model established with SG as the preprocessing algorithm and SPXY as the sample division method not only provides high accuracy but also has better robustness. In summary, the SG algorithm is finally selected to preprocess the original hyperspectral data of Tibetan tea. The original spectral characteristic curve RAW and the spectral characteristic curve after SG preprocessing are shown in Figure 3. The spectral curve Figure 3b, after SG preprocessing, is smoother than the raw spectral data in Figure 3a. Figure 3c,d show partial enlarged views corresponding to the red boxes in Figure 3a,b. The blue shaded part clearly shows this point of view, indicating that the algorithm can effectively filter out noise. The SPXY algorithm is selected to divide the calibration set and the test set. After the division, the statistical results of Tibetan tea polyphenol content are shown in Table 1. Figure 4 shows the prediction results of Tibetan tea polyphenols content by GBR model after SG algorithm preprocessing and SPXY algorithm partitioning of the data set. The horizontal axis represents the actual measured value, and the vertical axis represents the predicted value of the established model.

Selection of Characteristic Bands of Tibetan Tea Hyperspectral Data
The data noise after SG algorithm preprocessing has been improved to a certain extent, but there is still a lot of information unrelated to the prediction of tea polyphenol content in the data. If the spectrum number is not further extracted, the high-dimensional spectrum data will undoubtedly affect the accuracy and robustness of the model. In this study, six algorithms, including GB, AdaBoost, RF, CatBoost, LightGBM and XGBoost, have been used to select the top 30 Tibetan tea spectral characteristic bands. The final characteristic bands obtained are shown in Figure 5.
The feature selection algorithms RF and CatBoost take the wavelength of 522.66 nm as the second most important feature, while XGBoost takes the band of 564.55 nm as the first feature, which only ranks fifth in GB algorithm, fourth in AdaBoost algorithm and seventh in RF algorithm. The characteristic wavelengths extracted by different algorithms are mostly distributed between 420 and 700 nm. The experimental results show that the characteristic wavelengths extracted by different algorithms are different but also share some qualities. The features extracted by the above six feature extraction algorithms will be used as the input of the subsequent regression prediction algorithm.

Full-Band Modeling Results
The SG algorithm is used to preprocess the original spectral data, and the processed data is used for modeling and prediction. Table 2 shows the prediction results of different individual models. Among them, the CatBoostR model is the most accurate, with its R 2 c and R 2 p at 0.9578 and 0.9493, respectively. The model of RFR prediction effect is poor, and the coefficient of determination of the calibration set is only 0.9040. Six groups of Tibetan tea spectral features are selected using different feature extraction algorithms and used as inputs to the RFR, CatBoostR, LightGBMR and XGBoostR models. At the same time, based on the stacking combination strategy, RFR, LightGBM and XGBoostR are used as three base-learners, and CatBoostR is used as a meta-learner to build a new predictive model (Stacking model). The built Stacking model is shown in Figure 6.  Table 3 shows the prediction results of different models. Compared with the full-band modeling results, even if the feature dimension is reduced, the model performance is not reduced accordingly. The modeling accuracy has been improved to a certain extent, and the robust performance has also been further improved. The prediction accuracy of the CatBoostR model is generally acceptable, with RMSEC lower than 0.35 and RMSEP lower than 0.45. The R 2 c and R 2 p of the XGBoost + CatBoostR model are 0.9744 and 0.9509, respectively, and the RMSEC and RMSEP are 0.2546 and 0.4084, respectively. The R 2 c and R 2 p of the LightGBM + CatBoostR model are 0.9753 and 0.9520, respectively, and the RMSEC and RMSEP are 0.2499 and 0.4035, respectively. The R 2 c and R 2 p of the CatBoost + CatBoostR model are 0.9697 and 0.9563, respectively, and the RMSEC and RMSEP are 0.2766 and 0.3852, respectively. The RMSEC and RMSEP values of the CatBoost + CatBoostR model are closer. Therefore, this model is considered the best among the four individual models. In the Stacking model built in this article, the model with the characteristic band extracted by the CatBoost algorithm as the input is the most effective. The value of R 2 c is 0.9709, RMSEC is 0.2711, R 2 p is 0.9625 and RMSEP is 0.3568. The prediction accuracy is higher than that of other individual regressors, and as a result, it is the most accurate. Figure 7a is the prediction result of the CatBoost + Stacking model on the content of tea polyphenols in Tibetan tea. The horizontal axis represents the actual measured value, and the vertical axis represents the predicted value of the established model. Due to the small number of samples with a tea polyphenol content of about 7%, the data partition algorithm SPXY did not allocate the test set near this value. Therefore, in the data set divided by SPXY, the sample corresponding to the tea polyphenol content of the calibration set of 7.2671% is selected as one of the test set samples, and the sample corresponding to the tea polyphenol content of the test set of 8.7892% is selected as one of the calibration set samples. If the replaced data is input into the CatBoost + Stacking model, Figure 7b shows the prediction results. The value of R 2 c is 0.9686, RMSEC is 0.2833, R 2 p is 0.9577 and RMSEP is 0.3703.

Discussions
A detection model of Tibetan tea polyphenols is established based on hyperspectral technology. The test results show that the spectral data preprocessing algorithm SG can effectively eliminate noise. Band selection can improve the prediction accuracy and robustness of the model.
The final modeling results show that the characteristic band selection method used in this study is effective. The 233 feature variables are reduced to 30, but the accuracy of the model does not decrease as a result. Generally speaking, the effect of the Cat-BoostR individual model is better than other individual models. The calibration set of LightGBM + CatBoostR and XGBoost + CatBoostR models has performed well. However, the prediction set does not perform well, and the difference between the RMSEC and RM-SEP of the model is large, and the robustness of the model is low. Among all the models, the CatBoost + Stacking model built in this paper is the most effective. The determination coefficients R 2 c and R 2 p are 0.9709 and 0.9625, respectively, and the RMSEC and RMSEP are 0.2711 and 0.3569, respectively. The data divided by the SPXY algorithm has no test sample with a tea polyphenol content of about 7%. In order to improve the credibility of the model, one sample is selected for replacement in the calibration set and the test set (see Section 3.3.2). The final result is slightly lower than the result before replacing the sample. The reason for this phenomenon may be due to the fact that there are fewer samples with a tea polyphenol content of about 7%, and the model has not been trained perfectly. Trained models and examples can be found here: https://github.com/luo-rochon/example_for_stacking_model (accessed on 25 June 2021).
Traditional detection methods for total tea polyphenols include ferrous tartrate colorimetry [44][45][46], potassium permanganate titration [45,47,48], folin phenol colorimetry [45,49], electrochemical method [50], among which folin phenol colorimetry is the most widely used. The principle of ferrous tartrate colorimetry [44] is to use polyphenols to react with ferrous tartrate under a certain pH value to form a blue-violet complex, which is quantified by spectrophotometry. This method has good reproducibility but requires a large sample size and a long measurement cycle. In addition, the detection result is slightly higher than the true value.
The potassium permanganate oxidation titration method [48,51] uses potassium permanganate to oxidize tea polyphenols to fade the potassium permanganate solution. The decrease in absorbance is measured at the maximum absorption wavelength, which can indirectly determine the content of tea polyphenols. The method is convenient and does not require the use of expensive equipment. However, in addition to being able to oxidize some non-polyphenolic substances, the titration endpoint is difficult to grasp, resulting in large measurement errors.
Folin phenol colorimetry [49,51] is a common method for the determination of plant phenols in the world. Polyphenol compounds have the -OH group in tea polyphenols that are easily oxidized and appear blue. The absorbance is measured at a wavelength of 765 nm. This method has the most accurate measurement results, but the measurement process is more complicated, and the detection cost is relatively high.
According to the electrochemical properties of substances in the solution and its change rule, an electrochemical analysis method [50] was established based on the existence of certain electrical parameters such as potential, conductivity, current and electricity and the concentration of the measured substance. This method has the advantages of intuitive sensitivity, simple and rapid, wide determination range and is not susceptible to color, precipitation and other non-polyphenol organic compounds. However, the preparation process of the electrode and the surface treatment of the electrode needs to be further studied, and a chemical buffer is also needed to increase the cost of a single measurement.
In summary, most of the traditional tea polyphenol detection methods are more or less the use of certain chemical reagents, resulting in increased measurement costs and environmental pollution, in addition to the sample damage. As a new detection technology, the biggest advantage of hyperspectral technology is that it can quickly, nondestructively and in real-time detect agricultural products. The deficiency is that the test sensitivity is low, and the quantitative analysis of unknown samples must be realized by establishing a correction model. The establishment process of the correction model is relatively complex and requires a large number of training samples. Finally, it has to be mentioned that hyperspectral equipment is more expensive, but its reusability can make up for this defect.
At present, there is still a lack of tools and methods for the rapid and nondestructive determination of tea polyphenol content in the tea production process. In this study, only the total polyphenol content was predicted. In future research, we will explore the feasibility of tea polyphenol monomer detection based on this technology. The combination of hyperspectral technology and an integrated algorithm can be used for the online determination of Tibetan tea polyphenol content. At the same time, it provides a reference for the internal quality testing of other agricultural products.