^{1}

^{2}

^{1}

^{1}

^{2}

^{*}

^{1}

^{1}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Effects of the moisture content (MC) of tea on diffuse reflectance spectroscopy were investigated by integrated wavelet transform and multivariate analysis. A total of 738 representative samples, including fresh tea leaves, manufactured tea and partially processed tea were collected for spectral measurement in the 325–1,075 nm range with a field portable spectroradiometer. Then wavelet transform (WT) and multivariate analysis were adopted for quantitative determination of the relationship between MC and spectral data. Three feature extraction methods including WT, principal component analysis (PCA) and kernel principal component analysis (KPCA) were used to explore the internal structure of spectral data. Comparison of those three methods indicated that the variables generated by WT could efficiently discover structural information of spectral data. Calibration involving seeking the relationship between MC and spectral data was executed by using regression analysis, including partial least squares regression, multiple linear regression and least square support vector machine. Results showed that there was a significant correlation between MC and spectral data (

Tea is produced from fresh burgeon of tea plant after a series of physical and chemical reactions in the various tea processing procedures. Generally speaking, the tea processing procedures are always accompanied with great variations of moisture content (MC). There are three main processing procedures including fixation, rolling and drying for green tea. The fixation procedure is implemented by high temperature processing to reduce the activity of enzymes, to eliminate herbaceous odor components, and to evaporate some water. Especially, the drying procedure dehydrates tea to reduce MC and to improve tea's smell and taste after thermochemical reactions under high temperature. Therefore, the MC of tea not only determines the shelf life of tea, but also affects the physical and chemical reactions in tea processing, so measurement of MC is an important task for producing high-quality tea [

The traditional way of accurately measuring MC is the gravimetric method, which takes several hours and cannot meet the requirements of real-time, on-line detection of MC in tea processing. Moreover, the gravimetric method reduces the quality of tea, so tea measured by this method usually has to be discarded.

Diffuse reflectance spectroscopy (DRS) measures the reflectance from the surface of study objects, but DRS does not involve exactly the surface, as most of the light is contributed by scattering centers beneath the surface. The reflectance attribute and its derivatives have been proven to be highly correlated with a number of physicochemical properties [

Spectra from modern high throughput spectrometers often contain hundreds or thousands of spectral data points, and Vis/NIR spectra are characterized by generally overlapping vibrations of overtones and combination bands, in consequence these bands may appear to be non-specific and poorly resolved. So multivariate analysis plays a very important role in analysis of spectral data, such as principal component analysis (PCA), multiple linear regression (MLR), partial least squares regression (PLSR) and principal component regression (PCR). Especially, PCA, PLSR and PCR are all based on orthogonal transformation techniques, so these algorithms not only can greatly reduce the complexity of modeling, but also can eliminate the adverse effects caused by multicollinearity among spectral variables. However, PCA, PLSR, PCR and MLR can only deal with the linear relationship between spectral data and composition concentration, and the nonlinear information can hardly be calibrated by these linear models [

Nowadays, nonlinear algorithms including kernel principal component analysis (KPCA), artificial neural network (ANN) and least squares support vector machine (LSSVM) are frequently used for description of nonlinear phenomena [

The objectives of this study were: (1) to investigate the response of Vis/NIR diffuse reflectance spectroscopy toward MC of fresh tea leaves, manufactured green tea and partially processed green tea; (2) to perform and compare linear and nonlinear feature extraction algorithms for discovering the latent structure of spectral data, which included PCA, KPCA and WT; (3) to acquire characteristic wavelengths for determination of MC of tea based on WT.

For sample diversity, three types of samples were collected, which included fresh tea leaves, manufactured green tea and partially processed green tea. The total number of samples was 738. The general information of samples was summarized in

In modeling, all 738 samples were divided into the calibration set and the prediction set with a ratio of 2:1. To avoid bias in subset partition, all samples were first arranged in an ascending order according to their respective MC values, then one sample was picked out from every three samples consecutively, resulting in 246 samples of prediction set, and the remaining 492 samples formed calibration set. The statistical information of

In this study, a Vis/NIR spectroradiometer (FieldSpec®3, Analytical Spectral Devices, Inc., Boulder, CO, USA) was adopted for Vis/NIR spectroscopy acquisition. This spectroradiometer has high sensitivity in the range of 325–1,075 nm with a 512 photodiode array detector, while the field-of-view is 10°, the spectral resolution is 3.5 nm, and the interval of sampling is 1.5 nm. A 150 watt halogen lamp was used to provide uniform light in the visible and short-wave near infrared range. When scanning spectrum, the spectroradiometer was fixed on a tripod with 45° between the spectroradiometer axis and horizontal line, and fixed at approximately 100 mm above samples. After each sample was scanned, it was taken away to empty the position for the next sample, this movement might lead to a change in the measurement system. In order to reduce this influence, the spectroradiometer was calibrated every half hour by a 100-mm^{2} white standard panel with approximately 100% reflectance across the entire spectrum. So, relative reflectance was calculated with measurements from both the samples and the standard panel as shown in

The reference MC was measured by the gravimetric method according to the Chinese National Standard GB8304-87. In detail, every sample was heated in a constant temperature oven at 103 °C for 4 h, and weighed before and after the heating by an electronic balance with an accuracy of 0.0001 g. All the measurements were carried out in a room at approximate constant temperature of 25 °C and relative humidity of 40–55%.

WT enables the signal (spectrum) to be analyzed as a sum of functions (wavelets) with different spatial and frequency properties. The discrete WT (DWT) has the most popular application. The generated waveforms are analyzed with wavelet multi-resolution analysis to extract sub-band information from the non-stationary signals. The signal can be constructed accurately with the wavelet analysis using relatively small numbers of components [

KPCA is an extension of linear PCA using the kernel method technique, as shown by Schölkopf _{1},…,_{n}^{P}_{n}

Least squares support vector machine (LSSVM) is a least squares version of support vector machine (SVM) proposed by Suykens and Vandewalle [

Before calibration, spectral reflectance was transformed in absorbance [log(1/R)] to establish the linear correlation between spectral data and concentration of composition. Then, spectral data were processed by three types of feature extraction algorithms including WT, PCA and KPCA, and then the synthetic variables from each algorithm were used as predictors. In this study, WT was implemented with wavelet function of Daubechies 5 (db5) at level 3. For KPCA, a RBF kernel was adopted for establishment of nonlinear mapping, the optimal sig2 (^{2}^{®} 9.7 package (CAMO PROCESS, AS, Oslo, Norway) was adopted for realization of PCA, PLSR and MLR.

The quality of the regression model was quantified by root mean squared error of calibration (RMSEC), root mean squared error of prediction (RMSEP), and the correlation coefficient (

Vis/NIR diffuse reflectance spectra of the three types of samples are shown in

Except of the above similarities, many differences also existed in the spectra among the three types of samples. Comparing

Multi-signal wavelet decomposition was realized to expose the internal structure of all the spectral data of the 738 samples. After WT, the spectrum of each sample was decomposed to four sets of wavelet coefficients, including approximation coefficients _{3}_{1}_{2}_{3}_{3}_{1}_{2}_{3}

_{3}_{1}_{2}_{3}_{1}_{2}_{3}_{3}_{1}_{2}_{3}_{3}

Through feature extraction, WT, PCA and KPCA produced 89-dimensional new synthetic variables from original 651-dimensional spectral data respectively. Thus, samples can be represented with these new variables.

To evaluate the performances of WT, PCA and KPCA, three regression models (Models 1, 2 and 3) were respectively developed with the three sets of newly synthesized variables as predictors. Moreover, the original 651-dimensional spectra were also taken as predictor to develop regression model (Model 4). PLSR was adopted to establish regression models based on the full cross-validation method. The results of the above four models are shown in

In _{3}_{3}

As shown above, the 89-dimensional coefficients _{3}^{2}) were optimized as 111,570 and 972.655 by grid-search which was a two-dimensional optimization procedure based on exhaustive search in a limited range [

In the MLR model, the relationship between wavelet coefficients _{3}

In _{3}_{3}_{3}_{3}_{3}^{−02} respectively. This result indicates that the spectra in the range of 888–1,007 nm are significantly correlated to MC of tea. This finding is corresponding to the strong and characteristic second overtone absorption position of O–H (960 nm).

The total results indicate that Vis/NIR diffuse reflectance spectroscopy data is significantly correlated to MC of tea, especially the wavelengths of 888–1,007 nm can be taken as fingerprint indicators of tea MC. This measurement method not only has high accuracy, but also can be applicable to a variety of tea leaves with different tenderness. Moreover, this model is suitable for several types of samples, including fresh tea leaves, manufactured green tea, and partially processed green tea in processing, which covers the range of MC values from 3.15% to 71.40%.

Linear transform algorithm and nonlinear transform algorithms (PCA, KPCA and WT) were all implemented to extract characteristic information from spectral data. Results indicated that the WT outperformed KPCA and PCA. It can be concluded that WT is a powerful tool for extraction of characteristic from spectral data. The capabilities of PLSR, MLR and LSSVM regression algorithms were investigated to establish determination models. The MLR regression model gave the optimal result. Moreover, the fingerprint wavelengths (888–1,007 nm) were detected by merged MLR with wavelet reconstruction. Overall results indicate that the Vis/NIR diffuse reflectance spectroscopy of tea is strongly affected by MC, it is feasible to measure MC of tea based on Vis/NIR diffuse reflectance spectroscopy with the conjunction of wavelet transform and multivariate analysis.

This research was supported by the National Science and Technology Support Program of China (2011BAD20B12), the National High Technology Research and Development Program (2011AA100705) and the Fundamental Research Funds for the Central Universities.

Vis/NIR diffuse reflectance spectroscopy of the samples.

Structure of discrete wavelet decomposition at level 3.

Wavelet decomposition coefficients by db5 at level 3.

Energy distribution of wavelet coefficients.

Description of tea samples in these new synthetic variable spaces, (_{3}

Scatter plot of reference

B-coefficients of the optimal determination Model 6.

Reconstruction of approximation at level 3 (

General information of the three types of samples.

I | 2006.12.04 | 100 | Fresh tea leaves |

II | 2007.09.12 | 70 | Manufactured green tea |

III | 2008.10.12 | 568 | Partially processed green tea |

Statistical information of moisture content (w.b., %) of samples in type I.

^{a} |
||||
---|---|---|---|---|

Longjing changye | 54.662–68.421 | 62.906 | 0.038 | 20 |

Guangdong shuixian | 66.029–69.792 | 67.715 | 0.011 | 20 |

Zisun cha | 54.397–67.841 | 63.843 | 0.031 | 20 |

Maoxie | 51.773–71.388 | 62.930 | 0.037 | 20 |

Longjing 43 | 56.410–68.889 | 63.958 | 0.040 | 20 |

SD: standard deviation.

Statistical information of moisture content (w.b., %) of samples in type II.

^{a} |
||||
---|---|---|---|---|

Excellent grade | 4.237–6.901 | 6.138 | 0.008 | 10 |

1 grade | 5.075–6.644 | 5.558 | 0.005 | 10 |

2 grade | 5.014–5.991 | 5.455 | 0.003 | 10 |

3 grade | 5.312–6.050 | 5.737 | 0.002 | 10 |

4 grade | 5.277–6.429 | 6.003 | 0.003 | 10 |

5 grade | 5.521–6.286 | 5.896 | 0.003 | 10 |

6 grade | 4.237–6.901 | 6.138 | 0.008 | 10 |

SD: standard deviation.

Statistical information of moisture content (w.b., %) of samples in type III.

^{a} |
||||
---|---|---|---|---|

Fresh leaves | 61.347–71.723 | 67.021 | 0.023 | 74 |

Fixation | 53.412–61.854 | 58.723 | 0.009 | 74 |

Rolling and cutting | 39.567–60.506 | 51.327 | 0.049 | 72 |

Drying 1 | 33.780–44.404 | 38.766 | 0.018 | 74 |

Drying 2 | 12.082–16.838 | 14.191 | 0.008 | 70 |

Drying 3 | 9.459–11.556 | 10.916 | 0.005 | 76 |

Manufactured tea | 3.148–4.638 | 3.728 | 0.002 | 58 |

Tea dust | 4.171–5.214 | 4.613 | 0.002 | 70 |

SD: standard deviation.

Statistical information of moisture content (w.b., %) of samples in three data sets.

^{a} |
||||
---|---|---|---|---|

Calibration set | 3.148–71.388 | 33.768 | 0.255 | 492 |

Prediction set | 3.485–71.722 | 34.182 | 0.257 | 246 |

Total | 3.148–71.388 | 33.906 | 0.256 | 738 |

SD: standard deviation.

Results of four PLS models corresponding to PCA, KPCA, WT and original spectral data.

^{a} |
^{b} |
^{c} |
^{d} |
^{e} |
^{f} |
|||
---|---|---|---|---|---|---|---|---|

Model 1 | PCA | 89 | 10 | Calibration | 492 | 0.972 | 0.060 | −1.802e^{−09} |

Validation | 492 | 0.969 | 0.063 | −8.050e^{−05} | ||||

Prediction | 246 | 0.961 | 0.072 | −1.14e^{−02} | ||||

Model 2 | KPCA | 89 | 11 | Calibration | 492 | 0.979 | 0.051 | −4.649e^{−09} |

Validation | 492 | 0.976 | 0.046 | −9.659e^{−05} | ||||

Prediction | 246 | 0.966 | 0.060 | −1.200e^{−02} | ||||

Model 3 | WT | 89 | 13 | Calibration | 492 | 0.988 | 0.040 | −2.770e^{−07} |

Validation | 492 | 0.985 | 0.044 | 1.634e^{−05} | ||||

Prediction | 246 | 0.986 | 0.044 | −4.800e^{−03} | ||||

Model 4 | non | 651 | 13 | Calibration | 492 | 0.987 | 0.041 | −1.637e^{−08} |

Validation | 492 | 0.985 | 0.044 | −2.030e^{−07} | ||||

Prediction | 246 | 0.980 | 0.052 | −8.600e^{−03} |

SN: Sequence number.

FEA: Feature extraction algorithm.

IV: Number of input variables.

LV: Number of latent variables.

Cor.: Correlation.

RMSE: Root mean squared error.

Results of three models corresponding to the three types of regression algorithms based on the wavelet approximation coefficients as predictors.

^{a} |
^{b} |
^{c} |
^{d} |
||||
---|---|---|---|---|---|---|---|

Model 5 | PLS | 89 | Calibration | 492 | 0.987 | 0.041 | −1.637e^{−08} |

Prediction | 246 | 0.980 | 0.052 | −8.600e^{−03} | |||

Model 6 | MLR | 89 | Calibration | 492 | 0.996 | 0.024 | −1.462e^{−05} |

Prediction | 246 | 0.991 | 0.034 | −6.800e^{−03} | |||

Model 7 | LSSVM | 89 | Calibration | 492 | 0.999 | 0.013 | −4.514e^{−05} |

Prediction | 246 | 0.986 | 0.044 | −6.730e^{−03} |

SN: sequence number.

Alg.: regression algorithm.

Cor.: correlation coefficient.

RMSE: root mean squared error.

Results of MLR regression models with different sets of wavelet approximate coefficients as independent variables.

^{a} |
^{b} |
^{c} |
||||
---|---|---|---|---|---|---|

Model 8 | 2-7,51-57,59-60, 62-63,67,72 | Calibration | 492 | 0.951 | 0.079 | −2.326e^{−05} |

Prediction | 246 | 0.909 | 0.107 | −7.500e^{−03} | ||

Model 9 | 2-7,46-74 | Calibration | 492 | 0.982 | 0.048 | −7.546e^{−06} |

Prediction | 246 | 0.978 | 0.054 | −2.73e^{−03} | ||

Model 10 | 2-6,58-74 | Calibration | 492 | 0.969 | 0.063 | −2.160e^{−06} |

Prediction | 246 | 0.965 | 0.067 | 1.220e^{−04} | ||

Model 11 | 58-74 | Calibration | 492 | 0.966 | 0.065 | 3.633e^{−06} |

Prediction | 246 | 0.968 | 0.065 | −8.680e^{−04} | ||

Model 12 | 69-89 | Calibration | 492 | 0.986 | 0.043 | −8.997e^{−08} |

Prediction | 246 | 0.983 | 0.051 | −1.290e^{−02} | ||

Model 13 | 65-83 | Calibration | 492 | 0.992 | 0.032 | 1.103e^{−06} |

Prediction | 246 | 0.991 | 0.034 | 6.282e^{−06} |

SN: Sequence number.

Cor.: Correlation coefficient.

RMSE: Root mean squared error.