A Study on Origin Traceability of White Tea (White Peony) Based on Near-Infrared Spectroscopy and Machine Learning Algorithms

Identifying the geographical origins of white tea is of significance because the quality and price of white tea from different production areas vary largely from different growing environment and climatic conditions. In this study, we used near-infrared spectroscopy (NIRS) with white tea (n = 579) to produce models to discriminate these origins under different conditions. Continuous wavelet transform (CWT), min-max normalization (Minmax), multiplicative scattering correction (MSC) and standard normal variables (SNV) were used to preprocess the original spectra (OS). The approaches of principal component analysis (PCA), linear discriminant analysis (LDA) and successive projection algorithm (SPA) were used for features extraction. Subsequently, identification models of white tea from different provinces of China (DPC), different districts of Fujian Province (DDFP) and authenticity of Fuding white tea (AFWT) were established by K-nearest neighbors (KNN), random forest (RF) and support vector machine (SVM) algorithms. Among the established models, DPC-CWT-LDA-KNN, DDFP-OS-LDA-KNN and AFWT-OS-LDA-KNN have the best performances, with recognition accuracies of 88.97%, 93.88% and 97.96%, respectively; the area under curve (AUC) values were 0.85, 0.93 and 0.98, respectively. The research revealed that NIRS with machine learning algorithms can be an effective tool for the geographical origin traceability of white tea.


Introduction
Tea (Camellia sinensis (L.) O. Kuntze) is the second most consumed beverage in the world after water [1]. It contains rich secondary metabolites that are strongly associated with benefits to human health, such as free amino acids, polyphenols and alkaloids which are good for human health, create a complex and varied taste and attractive aroma [2,3]. In general, according to the degree of fermentation and processing techniques, tea is classified into six categories: green tea (unfermented, enzyme inactivation), white tea (slightly fermented, withering), yellow tea (partly fermented, heaping for yellowing), Oolong tea (partial-fermented, fine manipulation), black tea (fully fermented, fermentation), and dark tea (post-fermented, pile) [4,5]. Unlike other kinds of tea, white tea has the simplest producing process with only two steps: withering and drying [6]. In recent years, white tea has become increasingly popular with an ever-growing international market demand and public interest because of its unique flavor and health benefits [7]. The flavor and quality of white tea are often affected by origins, and the origin is an important basis for consumers to make a purchase. Fuding white tea is the best-known white tea in China, and it is popular among consumers as a China-Europe Geographical Indication Product with a higher commercial value compared to white tea produced in other areas. It has been reported that Fuding white tea sold on the market is far more than the actual production [8]. It is difficult for consumers to distinguish white tea from different producing areas only by its appearance, which may affect the value assessment of white tea products. Therefore, a reliable and fast method is increasingly needed to identify and trace the white tea produced in different origins, thereby providing strong technical support the high-quality development of the white tea industry.
Until now, the identification of tea origins has mainly relied on professional experts to conduct sensory evaluation by tea appearance and flavor, and the results are easily influenced by experts' physical conditions and mental factors, leading to the results being more subjective and lacking repeatability [9]. In recent years, proton transfer reaction time of flight-mass spectrometry [10], inductively coupled plasma optical emission spectrometry and inductively coupled plasma mass spectrometry [11] have been used for the origin tracing of white tea, but these methods all have problems of being time-consuming, costly and complex to analyze, which make it difficult to promote and utilize in industrial application. Near-infrared spectroscopy (NIRS), as a green analytical technique with high efficiency, high accuracy, and convenience, has proven its applicability in the field of fuel [12], medicine [13] and wine [14], and has shown its advantages in the traceability of the origin of other teas. Jin et al. [15] used near-infrared spectral data combined with an extreme learning machine to build an origin traceability model for Taiping Houkui green tea in a narrow region, and the accuracy rate of the optimized model could reach 95.35%. Ren et al. [16] used a factorization method combined with NIRS data to establish rapid identification model of black tea growing regions, and the identification accuracy of black tea from different geographical regions was 94.3%. Yan et al. [17] used partial least squares discriminant analysis and a NIRS-established model to discriminate Anxi-Tieguanyin's (oolong tea) authenticity, the best model's specificity and sensitivity could reach 0.931 and 1.000.
In recent years, machine learning algorithms have also been gradually used to identify food products' producing areas, authenticity. Xu et al. [18] successfully identified 16 kinds of millet origins based on Vis-NIR data combined with machine learning algorithms, with F-Score values up to 99.5% for random forest (RF) and support vector machine (SVM) models, and 99.1% for K-nearest neighbor (KNN) models. Zhang et al. [19] combined hyperspectral data with SVM algorithm to successfully achieve a fast and nondestructive identification of salted sea cucumbers, over-salted sea cucumbers and sugar-treated sea cucumbers, with the best model achieving 100% accuracy. Liu et al. [20] combined the hyperspectral data with the PCA algorithm and SVM algorithm to achieve a fast and nondestructive identification of green tea origins and the exact processing month, and the correct recognition rate of the best origin identification model could reach 97.5%; the correct recognition rate of the best processing-month recognition model could reach 95%. The above results demonstrate that NIRS combined with machine learning algorithms has the potential to achieve rapid and nondestructive identification of white tea's origins, but there is no relevant report about NIRS application in identifying white tea origins. Therefore, the main purpose of this paper is to investigate the potential and possibility of using NIRS combined with different preprocessing, feature extraction and machine learning algorithms analysis as a fast and nondestructive tool to identify and classify white tea according to geographically larger production areas (different provinces of China, DPC), narrow range of origins (different districts of Fujian Province, DDFP) and the authenticity of China-Europe Geographical Indication Product (authenticity of Fuding white tea, AFWT). This paper describes the systematic and comprehensive evaluation of the applicability of NIRS as a traceability tool for white tea.
white tea, AFWT). This paper describes the systematic and comprehensive evaluation o the applicability of NIRS as a traceability tool for white tea.

Spectra Acquisition
The samples' NIRS was collected using an Antaris II FT-NIR spectrophotomete (Thermo Scientific, Waltham, MA, USA). The NIRS was operated at a temperature of 2 °C and humidity <70%; workflow spectral acquisition workflow parameters were set a wave number range 4000-10,000 cm −1 , scan interval 3.856 cm −1 , 64 times, and resolution 8.0 cm −1 . To ensure the reliability of the NIRS detection data, the samples were scanned once for the background before the acquisition, and the air background spectra was de ducted to reduce the influence of environmental factors on the spectra data, the spectra o each sample were collected three times, and the average spectrum was taken as th

Spectra Acquisition
The samples' NIRS was collected using an Antaris II FT-NIR spectrophotometer (Thermo Scientific, Waltham, MA, USA). The NIRS was operated at a temperature of 25 • C and humidity < 70%; workflow spectral acquisition workflow parameters were set as wave number range 4000-10,000 cm −1 , scan interval 3.856 cm −1 , 64 times, and resolution 8.0 cm −1 . To ensure the reliability of the NIRS detection data, the samples were scanned once for the background before the acquisition, and the air background spectra was deducted to reduce the influence of environmental factors on the spectra data, the spectra of each sample were collected three times, and the average spectrum was taken as the original spectral data. The spectra were saved as absorbance using TQ Analyst software (Thermo Nicolet Corporation, Madison, WI, USA) for subsequent analysis.

Spectral Pretreatment
Due to the influence of electrical noise, light scattering and other environmental factors, it is inevitable to have baseline drift and high-frequency noise in NIRS data. To further eliminate the influence of the environmental factors on the original spectra (OS), continuous wavelet transform (CWT), minmax normalization (Minmax), standard normal variate (SNV) and multiplicative scattering correction (MSC) were used as four preprocessing algorithms in MATLAB (MATLAB R2016a, Mathworks) for spectra correction. The CWT algorithm was used to correct the baseline drift and eliminate high-frequency noise; the Minmax algorithm was chosen to strengthen the data; the SNV and MSC algorithms were used to correct the scattering and eliminate the effects caused by the inhomogeneity of tea powder particle size and the nonconstant light range [21][22][23][24]. The choice of wavelet parameters (wavelet basis and decomposition scale) in CWT was crucial and directly determined the merits of the subsequent models [25]. After trial calculation and analysis, the db4 wavelet basis of the Daubechies family was chosen in this study, and the decomposition scale was set as 64.

Extraction of Characteristics
A large amount of redundant information existed in the continuous wavenumbers of NIRS, which closely related to feature information. Computational speed and accuracy can be easily affected due to excessive data if we use all the data to build models. Therefore, to better reduce the computational burden of models, we applied dimensionality reduction of spectra data to characterize the vast majority of information of the spectra by extracting feature vectors or wavenumbers. In this study, principal component analysis (PCA), linear discriminant analysis (LDA), and successive projection algorithm (SPA) were used to perform data dimensionality reduction, and the above algorithms were all implemented in Python v3.8.5.
Among these methods, PCA is often applied to reduce the dimensionality of spectra in agricultural and livestock products and has been proven to be an effective spectra dimensionality reduction method, which can extract features from a large amount of data and convert them into the data set that still contains most of the valid information but has a smaller dimensionality. Thus, the original data information is retained to the greatest extent [26]. Therefore, the PCA method is optimal and the most commonly used.
LDA is a supervised feature extraction method, which is based on the principle that all sample points are projected onto a high-dimensional line, so that the projections of the sample points of the same class are as close as possible, while the projections of the sample points of different classes are distributed as scattered as possible [27].
SPA is a forward circular feature extraction method, which can extract the information of effective predictive response variables from the original spectral matrix by continuous projection, and minimize the covariance effect between the spectral variables to maximize the predictive ability of the selected response variables [28]. The wavenumber with the largest projection vector and the smallest covariance with the wavenumber in the feature set is selected into the feature set. The number of characteristic wavenumbers is determined by the root to mean square error (RMSE) of the internal complete cross-validation of the calibration set, and the number of features and characteristic wavenumbers corresponding to the minimum RMSE value are the best values [29].

Establishment and Evaluation of Models
Machine learning algorithms are widely used in the analysis and utilization of NIRS data, but so far, no classifier has shown its superior advantages in all cases. Hence, using multiple classifiers for modeling is better for constructing high-quality models. In this paper, we adopted three classical machine learning algorithms, including K-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM), and combined the NIRS data processed by different pre-processing and feature extraction algorithms to build models and optimize the model parameters, to systematically and comprehensively explore the optimal process for the construction of white tea origin traceability models. All model constructions were based on Python v3.8.5, and the evaluation parameter tables were made with Excel. Before the models were constructed, the data were divided into four equal parts, of which three parts were used as the training set and one part was used as the validation set. The training set was used to construct the traceability model; the validation set was used to evaluate the source prediction ability of the model for new samples.
The KNN classification algorithm is one of the simplest machine learning algorithms with mature theory and wide application. Its principle is to judge the attributes based on the category of the nearest k points when predicting new values, which is simple, fast, and insensitive to outliers. The selection of k-value will have a significant impact on the results of the algorithm, when the k-value is small, the overall complexity of the model will rise and be prone to overfitting. When the k-value is large, it will make the training set instances far away from the validation set samples, which influences the prediction's making the prediction errors occur [30]. In practice, the k-value is generally chosen as a small value, and cross-validation is subsequently used to select the optimal k-value, and the initial k-value was set as 3 in this study after in-depth analysis.
RF is a supervised integrated classification algorithm that emerged mainly to solve the problem of large errors and over-fitting that may occur in a single decision tree. RF performs well in classification problems, with great potential to become the classifier with optimal effect in each case. The model consists of many decision trees, but there is no association with each other. When judging or predicting a new sample after getting the forest, each decision tree in the forest will be judged separately to distinguish which category the sample belongs to and compare which category has the highest number of choices to make a judgment on the sample category; it is crucial to decide how many trees in this model should have [31,32]. After the trial calculation, the number of trees in this study was initially set as 20 for the subsequent comparative analysis.
In recent years, SVM has become one of the most widely used and effective machine learning algorithms for use in tea. It is an algorithm that uses a kernel function to map the input n-dimensional data to a K-dimensional feature space (K > n) to perform classification by a high-dimensional feature space. To improve the model quality, all SVM models in this paper were based on the radial basis function (RBF) kernel function, which could reduce the computational complexity of the training process and has good performance under the general smoothing assumption; at the same time, the determination of the optimal values of the penalty parameter C and gamma parameter is also crucial, and the accuracy of the SVM models depends on the combination of these two parameters [33]. The accuracy of the SVM model depends on the combination of these two parameters. After trial calculations, C = 100 and gamma = 0.1 were used as the initial modeling parameters in this study.
The model performance was preliminarily evaluated using recognition accuracy (RA) and area under curve (AUC). In detail, RA is often used to evaluate the predictive ability of the model, and its value range is between 0 and 100%. The larger the value is, the better the predictive ability of the model for new samples. AUC is often used to evaluate the generalization ability of the model. The better the generalization ability of the model, the better the ability to classify new samples correctly. The value range of AUC is from 0 to 1.0 and is positively correlated with the quality of the model [34,35]. When the preliminary evaluation parameters of models are the same, to further evaluate the discriminant and generalization ability, four-fold cross-validation is used to verify the accuracy of the model. Four-fold cross-validation refers to dividing the original data into four subsets equally, making each subset data as a validation set respectively, and the rest data as a training set to obtain four model performance parameters, and using the average of these four models RA as the performance index of this classifier [36]. The confusion matrix is often used to evaluate the classification effect of each group, reflecting the relationship between the real category of the sample data and the prediction results, and quantifying the details of classification more intuitively [37]. In this paper, the confusion matrices were used for the classification details evaluation of the best models obtained.

Data Analysis
The raw NIRS saved the spectrum as absorbance through TQ Analyst (Thermo Nicolet Corporation) software for subsequent analysis. MATLAB (MATLAB R2016a, Mathworks) software was used to preprocess the raw spectrum and draw all spectra. Python v3.8.5 software was used to extract features, build models and draw 3D models. The model evaluation tables and parameter optimization diagrams were generated in Excel. Confusion matrices were generated by TBtools software (Guangdong, China). Figure 2a shows the initial NIRS of 579 white tea samples in the 4000-10,000 cm −1 band. The trend of absorbance values in each band tended to be consistent without any significant differences. With the increase of wavenumber, the absorbance values showed an overall decreasing trend, and the range of variation was between 0.239 and 0.833. Figure 2a shows the initial NIRS of 579 white tea samples in the 4000-10,000 cm −1 band. The trend of absorbance values in each band tended to be consistent without any significant differences. With the increase of wavenumber, the absorbance values showed an overall decreasing trend, and the range of variation was between 0.239 and 0.833.

Spectral Analysis
To visualize the differences in the NIRS of white tea from different origins, three different average spectra were plotted based on the OS data: (1) Figure 2d). With each average spectrum observed, it could be found that the absorbance values fluctuated significantly in the range of 4000-7200 cm −1 , and the average spectra could be largely separated from each other, indicating that the white tea samples of different geographical origins have different absorbance increases and decreases in this band, which indicates a correlation between the spectral information and the origin. The overlap among the average spectra from 7200-10,000 cm −1 in Figures 2b,c indicates there was less effective information related to the origin in this band; in addition, the fluctuation of the band tends to be flat without obvious peaks and valleys, which means that the characteristic information in this band was not obvious and the signal-to-noise ratio was low.  Figure 2d). With each average spectrum observed, it could be found that the absorbance values fluctuated significantly in the range of 4000-7200 cm −1 , and the average spectra could be largely separated from each other, indicating that the white tea samples of different geographical origins have different absorbance increases and decreases in this band, which indicates a correlation between the spectral information and the origin. The overlap among the average spectra from 7200-10,000 cm −1 in Figure 2b,c indicates there was less effective information related to the origin in this band; in addition, the fluctuation of the band tends to be flat without obvious peaks and valleys, which means that the characteristic information in this band was not obvious and the signal-to-noise ratio was low.

Spectral Pretreatment
The OS contained a large amount of chemical information about the samples, but there existed obvious problems that baseline drift and spectra peak overlap, which made it difficult to trace the geographical origin of white tea by the OS only. To further optimize the OS data, spectral preprocessing was performed with CWT, Minmax, MSC and SNV.
We can see clearly from the spectrogram changes that all four treatments led to great changes in the spectra morphology. Figure 3a applied CWT for spectral preprocessing; the degree of morphological transformation was the largest among the four preprocessing methods, baseline drift, background interference and noise phenomena were eliminated, the spectral peaks were clearer and the segments of difference information were more obvious. The minmax algorithm (Figure 3b) condensed the spectral absorbance values into −1 to 1, which augmented the data and eliminated the influence of data outline and the range of values, and the subsequent could make the constructed model converge faster and improve the model performance. To eliminate the influence of the uneven size of tea powder particles and the scattering generated, SNV with MSC was used for preprocessing ( Figure 3c,d), and the scattering interference in the spectra was eliminated after processing, and the feature information was more prominent. Compared with OS, the pretreatment could effectively eliminate the signal interference caused by light scattering and baseline drift in the spectra, but the treated spectrograms still could not visually distinguish the differences among the production areas, which might be due to the more similarity in the composition and content of inclusions in white tea from different producing areas. Consistent with OS, the fluctuations at 7200-10,000 cm −1 of the four pre-treated spectra were still flat and the feature information was not obvious. To reduce the data's dimensionality in the model and improve the model calculation's speed and quality, this segment was discarded in the subsequent model construction [9].

Spectral Pretreatment
The OS contained a large amount of chemical information about the samples, but there existed obvious problems that baseline drift and spectra peak overlap, which made it difficult to trace the geographical origin of white tea by the OS only. To further optimize the OS data, spectral preprocessing was performed with CWT, Minmax, MSC and SNV.
We can see clearly from the spectrogram changes that all four treatments led to great changes in the spectra morphology. Figure 3a applied CWT for spectral preprocessing; the degree of morphological transformation was the largest among the four preprocessing methods, baseline drift, background interference and noise phenomena were eliminated, the spectral peaks were clearer and the segments of difference information were more obvious. The minmax algorithm (Figure 3b) condensed the spectral absorbance values into −1 to 1, which augmented the data and eliminated the influence of data outline and the range of values, and the subsequent could make the constructed model converge faster and improve the model performance. To eliminate the influence of the uneven size of tea powder particles and the scattering generated, SNV with MSC was used for preprocessing (Figures 3c,d), and the scattering interference in the spectra was eliminated after processing, and the feature information was more prominent. Compared with OS, the pretreatment could effectively eliminate the signal interference caused by light scattering and baseline drift in the spectra, but the treated spectrograms still could not visually distinguish the differences among the production areas, which might be due to the more similarity in the composition and content of inclusions in white tea from different producing areas. Consistent with OS, the fluctuations at 7200-10,000 cm −1 of the four pre-treated spectra were still flat and the feature information was not obvious. To reduce the data's dimensionality in the model and improve the model calculation's speed and quality, this segment was discarded in the subsequent model construction [9].

Extraction of Characteristics
In this study, the NIRS data in the range of 4000-10,000 cm −1 was obtained; after preprocessing and comparison analysis, it was decided to use 4000-7200 cm −1 for the establishment of the origin traceability models of white tea. Using all data in the range to build models may negatively affect the operation speed and accuracy due to a large amount of data. Therefore, the processes of dimensionality reduction were performed to extract features with lower dimensionality to characterize the spectra data information. This study used PCA, LDA and SPA to achieve dimensionality reduction, and the optimal dimensionality reduction method was determined based on the modeling results.

PCA
The characteristic vectors in OS and preprocessed NIR spectra by CWT, Minmax, MSC and SNV were extracted by PCA, and the results are shown in Table 1. The table shows the extracted eigenvalues and cumulative contributions of the first 15 principal components, and the number of model input principal components was screened based on the principle that the eigenvalue is >1 and the cumulative contribution is >80%.
In the NIRS data matrix of white tea origins classified by DPC, the number of feature vectors obtained by OS was 4; the number of feature vectors obtained by CWT, Minmax, MSC and SNV preprocessed spectra were 11, 7, 7 and 7, respectively; in the NIRS data matrix of white tea origins classified by DDFP or AFWT, the number of feature vectors obtained by OS was 4; the number of feature vectors obtained by CWT, Minmax, MSC and SNV preprocessed spectra were 10, 6, 7 and 7, respectively. The cumulative contribution was >80%, which was consistent with the principle, and models were subsequently constructed based on the screened principal components.

LDA
LDA is commonly used as a classifier in the field of tea. However, the research on using LDA for NIRS feature extraction and building was rarely reported involving white tea recognition models based on the extracted feature vectors combined with classifiers.
LDA can reduce the dimension of the data matrix to the number of categories minus 1. To reduce the dimension without losing too much original information, all dimensions obtained by LDA dimension reduction would be used for subsequent modeling. Therefore, the number of feature vectors obtained using LDA for DPC, DDFP and AFWT data matrices was 6, 5 and 1, respectively. Figure 4 shows the number of feature wavenumbers extracted by SPA. As can be seen from Figure 4, the RMSE reached the minimum value when a specific number of wavenumbers was selected; and after that, although the RMSE still fluctuated and decreased, the decrease was small and led to an increase in the selected number of wavenumbers, so there was no need to increase the number of dimensions to pursue a smaller RMSE. Figure 4a-e shows the iterative RMSE decline curves of the white tea NIRS data matrix of DPC obtained by SPA, from which it could be seen that the number of feature wavenumbers obtained from the final feature extraction was 15, 13, 13, 15 and 11; Figure 4f-j shows the iterative RMSE decline curves of the white tea NIRS spectra data matrix of DDFP obtained by SPA, from which it could be seen that the number of feature wavenumbers obtained from the final feature extraction was 13, 19, 12, 14 and 13; Figure 4k-o shows the iterative RMSE decline curves of the white tea spectra data matrix of AFWT obtained by SPA, from which it could be seen that the number of feature wavenumbers obtained from the final feature extraction was 11, 13, 10, 12 and 13. numbers obtained from the final feature extraction was 15, 13, 13, 15 and 11; Figures 4f-j shows the iterative RMSE decline curves of the white tea NIRS spectra data matrix of DDFP obtained by SPA, from which it could be seen that the number of feature wavenumbers obtained from the final feature extraction was 13, 19, 12, 14 and 13; Figures 4ko shows the iterative RMSE decline curves of the white tea spectra data matrix of AFWT obtained by SPA, from which it could be seen that the number of feature wavenumbers obtained from the final feature extraction was 11, 13, 10, 12 and 13.

Models Evaluation and Optimization
The KNN, RF and SVM algorithms were used to train the models on the spectra data to achieve the following objectives: (1) white tea classified by DPC (FJ vs. GZ vs. HN vs. SC vs. YN vs. ZJ vs. GX); (2) white tea classified by DDFP (FD vs. FA vs. ZH vs. SX vs. JY vs. ZR); (3) white tea classified by AFWT (FD vs. Non-FD). In this paper, the joint evaluation of RA and AUC was used to preliminarily evaluate the performance of models, and the parameters of the models with the best performance were optimized. Table 2 shows the evaluation parameters of the model obtained from NIRS data combined with different preprocessing, feature extraction and machine learning algorithms for the identification of white tea from geographically larger production areas (DPC, including FJ, GZ, HN, SC, YN, ZJ and GX). The number of training set samples of all DPC recognition models was 434, and the number of validation set samples was 145. The number of modeling features before dimensionality reduction was 831, and after dimensionality reduction, the number of modeling features was reduced to about 10, which greatly reduced the computational task and improved the computing speed. The recognition accuracy ranged from 66.90 to 86.90%, and the AUC values were in the range of 0.50 to 0.83. The majority of the obtained models have RA > 70% and AUC > 0.65, which indicated that the NIRS data of the samples were highly correlated with the classification and identification of white tea production provinces, and their combination with machine learning algorithms could effectively identify white tea from different production provinces. Therefore, the research method proposed in this study was reasonable and effective for tracing white tea production provinces.    In the established DPC recognition model of white tea, the RA of KNN, RF and SVM models based on OS were 73.10%, 75.17% and 75.86%, respectively, and the AUC values were 0.62, 0.65 and 0.63, respectively. The pretreatment of the initially established KNN and RF models were significantly improved with CWT, Minmax, MSC and SNV, the RA and AUC, and the model prediction and generalization ability were further improved. Compared with the results of the DPC-OS-SVM model, only CWT and SNV algorithms achieve the purpose of optimized models.

Models Evaluation of White Tea's Origins Classified by DPC
After further combining the feature extraction algorithms, the dimensionality of the data was significantly reduced, but the performance of most models was not further improved by the feature extraction algorithms. The number of models whose model quality was further improved after dimensionality reduction of NIRS data used PCA, LDA and SPA algorithms were 11, 5 and 2, respectively. In the process of establishing geographically larger production area recognition models, the dimensionality reduction algorithm PCA performed the best, accomplished the reduction of data dimensionality for the vast majority of models, reduced the computational task and improved the model computing speed. The number of model quality improved using LDA was not as good as PCA, but it had the least number of feature dimensions after dimensionality reduction, and with the subsequent sample collection, the increase in the number of samples in the validation set will make the model using LDA dimensionality reduction more advanced in terms of computational tasks and recognition time. Compared with PCA and LDA, the SPA algorithm has a relatively poor ability to reduce dimensionality and improve model quality.
The overall effect of KNN and RF among the three machine learning algorithms was better, and the models built had an average RA of up to 80% and an average AUC of up to 0.72, which were significantly better than the SVM model. The best recognition model for DPC appeared in the KNN model as DPC-CWT-LDA-KNN with features number 6, RA = 86.90% and AUC = 0.83; the lowest feature number, the highest recognition accuracy and AUC value made the model own the best recognition performance and good generalization capacity for different white tea production provinces.
Overall, NIRS has great potential to build recognition models for geographically large production areas (different provinces) of white tea. When it comes to building geographically larger production area identification models, the preprocessing algorithms CWT and SNV showed stronger general adaptability, and the combination of the three machine learning algorithms presented significant advantages in identifying white tea origins, leading to a point similar to the results of the study by Zhang et al. [19]. It is suggested that SNV or CWT be combined with other classification algorithms for white tea origin tracing during the subsequent research. The better feature extraction algorithms were PCA and LDA, while the effect of SPA was relatively poor, presumably because LDA and PCA extracted feature vectors, which could represent most of the spectral information; while SPA extracted feature wavenumbers which might not be as comprehensive as feature vectors in terms of representativeness. The overall effect of KNN and RF in the machine learning algorithm was better, and the average RA of the established model could reach 80% and the average AUC could reach 0.72, which was significantly better than the SVM model. To obtain the optimal DPC white tea classification model, the parameters of the DPC-CWT-LDA-KNN model with the best tracking effect on the geographically larger white-tea-producing areas would be optimized subsequently. Tables 3 and 4 show the evaluation parameters of the same NIRS dataset combined with different pre-processing, feature extraction and machine learning algorithms obtained for identifying geographically narrow range of origin (DDFP) and authenticity of China-Europe Geographical Indication Product (AFWT) models. Since the data sets used were the same, the number of samples in the training set was 291 for all models and the number of samples in the validation set was 98 for all models in Tables 3 and 4.      The models in Table 3 identify white tea in geographically narrow origin ranges (DDFP, including FD, FA, ZH, SX, JY, and ZR), and the RA of DDFP identification models ranged from 50.00 to 92.86%, and the AUC was in the range of 0.50 to 0.92. The vast majority of DDFP models had RA > 70% and AUC > 0.70, which indicated that the NIR spectral data used were highly correlated with the classification and identification of white tea in a geographically narrow range of origin, and that NIRS combined with machine learning algorithms could achieve fast and nondestructive identification of white tea in a geographically narrow range of origin. The models in Table 4 could make the identification of Fujian white tea as Fuding white tea or not (AFWT, including FD and Non-FD). The RA of AFWT identification models ranged from 51.02 to 97.96%, and the AUC was in the range of 0.50 to 0.98, with the majority of models having RA > 80% and AUC > 0.80, and the models had excellent performance. By comparing the performance difference between AFWT recognition models and DDFP recognition models, it could be seen that when used with the same dataset to reach different recognition goals (differentiated by DDFP or AFWT), the recognition goals with fewer categories were easier to reach, and the RA and AUC values were significantly higher. This may be due to the fact that it was easier to extract the appropriate features when there were fewer recognition categories, improving the model's quality.

Models Evaluation of White Tea Origins Classified by DDFP and AFWT
In the established DDFP recognition model, the RA of the KNN, RF and SVM models based on OS were 54.08%, 59.18% and 61.22%, respectively, and the AUC values were 0.62, 0.64 and 0.61, respectively. After OS was preprocessed by CWT, Minmax, MSC and SNV, the RA and AUC of the initially established KNN and RF models were significantly improved, and the model accuracy and generalization ability were improved. In the DDFP recognition model established by SVM algorithm, the quality of the spectral model decreases after MSC preprocessing, and other preprocessing algorithms improve the model's quality.
In the AFWT recognition model, the RA of KNN, RF and SVM models based on OS were 75.51%, 76.53% and 84.69%, respectively, and the AUC were 0.76, 0.77 and 0.85, respectively. After pretreatment with CWT, Minmax, MSC and SNV, the RA and AUC of the initial identification model were significantly improved, and the model accuracy and generalization ability were improved. Like RA and AUC, the preprocessing algorithm performs better in establishing the AFWT recognition model. It is speculated when there were fewer recognition categories, the universality of the preprocessing algorithms would be wider.
After further combination with the feature extraction algorithms, the data dimensions of the DDFP and AFWT recognition models were greatly reduced. In the DDFP recognition models, more than half of the model's performance was further improved due to the feature extraction algorithms. After a dimensionality reduction using PCA, LDA and SPA algorithms, the number of models whose model quality was further improved was 4.15 and 4, respectively. In the AFWT recognition models, most of the model performance was not further improved by the feature extraction algorithms; after dimensionality reduction using PCA, LDA and SPA algorithms, the number of models whose model quality was further improved was 3.13 and 0, respectively. In general, when establishing DDFP and AFWT recognition models based on the same data, LDA performs best in feature extraction algorithms, and the extracted model has the least number of features and the best model performance. As the number of subsequent samples increases, the increase in the number of validation set samples will make the model using LDA dimensionality reduction more obvious in terms of computing tasks and recognition time.
In the DDFP and AFWT recognition models, the machine learning algorithm RF has the best overall effect, with the highest average RA and AUC values. The best DDFP recognition model appeared in the KNN model, which was DDFP-OS-LDA-KNN with a number of features 5, RA = 92.86%, AUC = 0.92, indicating that the model had good prediction ability for DDFP recognition of white tea; the lowest number of features enables the model to have fewer computing tasks and better computing speed when the number of samples in the subsequent validation set increases. There were three best AFWT recognition models, namely AFWT-OS-LDA-KNN, AFWT-OS-LDA-RF and AFWT-OS-LDA-SVM. Their feature numbers were all 1, RA = 97.96%, and AUC = 0.98. In order to further explore their performance differences, four-fold cross-validation results were introduced to evaluate these three models.
The principle of four-fold cross-validation is to divide the original data set into four subsets equally and make the data of each subset into a validation set, respectively, and the data of the remaining three subsets as the training set, which can get four models performance parameters, and the average of these four models RA is the four-fold crossvalidation result. The higher RA of the cross-validation results represents the stronger generalization ability of the model and the better prediction ability for new samples. Table 5 shows the four-fold cross-validation results of AFWT-OS-LDA-KNN, AFWT-OS-LDA-RF and AFWT-OS-LDA-SVM. As shown in the table, the four-fold cross-validation could distinguish small differences in generalization ability among the models when the RA values of the training and validation sets of the three models were the same. AFWT-OS-LDA-KNN had the highest four-fold cross-validation RA of 97.96%, which indicated that KNN was more suitable than the RF and SVM algorithms for the construction of authenticity models with fewer classification categories. To obtain the optimal AFWT identification model, the AFWT-OS-LDA-KNN would be subsequently optimized for the model parameters. In general, the NIRS dataset combined with different pre-processings, feature extractions and machine learning algorithms was excellent for identifying the geographically narrow range of origin (DDFP) and authenticity of China-Europe Geographical Indication Product (AFWT). SNV performed the best among the preprocessing algorithms and improved the model quality best, with similar findings in the study by Zhang et al. [19]. LDA performs best among the feature extraction algorithms, with the least number of dimensions obtained by dimensionality reduction, which could significantly reduce the model computational task and thus improve the computing speed. Machine learning algorithms with RF in combination with different algorithms present good model results with higher overall average performance parameters; however, the best performing models were found in the KNN model. It is suggested that a reference standard for higher-quality model evaluation parameters can be modeled using the RF algorithm in subsequent studies, and then the KNN algorithm can be used to build a higher-quality model.

Models Optimization
To further improve the performance of the model, the parameters of the three models with the best comprehensive performance in the three types of identification models were optimized, including DPC-CWT-LDA-KNN, DDFP-OS-LDA-KNN and AFWT-OS-LDA-KNN.
In the KNN algorithm, the number of neighbors k plays a decisive role in the quality of the model [38]. To further optimized the models, the k-values were defined between 1-100 for model optimization, and the established models were evaluated by the magnitude of RA values as the model performance. Figure 5 shows the curves of RA values for the optimization of parameter k in the above KNN model. The black circle represents the occurrence of the maximum value of RA at that parameter. Thus, the optimal parameter k = 8 when the RA of the DPC-CWT-LDA-KNN validation set reached a maximum value of 88.97% (Figure 5a); the optimal RA of the DDFP-OS-LDA-KNN validation set was 93.88%,

Performance Analysis of Optimal Models
After the models' evaluation and optimization, we obtained the best models for identifying white tea DPC, DDFP and AFWT, and the optimal model parameters are shown in Table 6. As shown in the table, the best models for identifying and classifying white tea based on a geographically larger production area (DPC), narrow range of origin (DDFP) and authenticity of China-Europe Geographical Indication Product (AFWT) all had modeling feature numbers less than 10, with the RA all close to or greater than 90% and AUC values all close to or greater than 0.90, which indicated these models possessed excellent prediction and generalization abilities. The excellent quality of the above models demonstrates the ability of NIRS for rapid and nondestructive origin tracing of white tea and provides a reference for other agricultural products in terms of technology and algorithm application in origin traceability. To further evaluate the ability of the best models to recognize each category, confusion matrices were introduced for in-depth evaluation. The confusion matrix provides a

Performance Analysis of Optimal Models
After the models' evaluation and optimization, we obtained the best models for identifying white tea DPC, DDFP and AFWT, and the optimal model parameters are shown in Table 6. As shown in the table, the best models for identifying and classifying white tea based on a geographically larger production area (DPC), narrow range of origin (DDFP) and authenticity of China-Europe Geographical Indication Product (AFWT) all had modeling feature numbers less than 10, with the RA all close to or greater than 90% and AUC values all close to or greater than 0.90, which indicated these models possessed excellent prediction and generalization abilities. The excellent quality of the above models demonstrates the ability of NIRS for rapid and nondestructive origin tracing of white tea and provides a reference for other agricultural products in terms of technology and algorithm application in origin traceability.
To further evaluate the ability of the best models to recognize each category, confusion matrices were introduced for in-depth evaluation. The confusion matrix provides a detailed reflection of the performance of the classification model, where the rows represent the true class, and the columns represent the predicted class. The confusion matrix enables the visualization of the number of the correctly classified as well as the categories and number of misclassified categories for each white-tea-producing area. The higher the value on the diagonal of the matrix, the better the prediction ability of the model. The confusion matrix of the best DPC, DDFP and AFWT identification models are shown in Figure 6. From Figure 6a, it could be seen that in distinguishing white tea from different provinces, the predicted accuracy of DPC-CWT-LDA-KNN for YN and ZJ was 100%, and the predicted accuracy for both FJ and GZ was greater than 85.00%; the misclassification occurred mostly in HN and SC samples, and the HN production area was often misclassified as SC production area, and the SC production area was often misclassified as FJ production area. When identifying white tea from different production districts in Fujian Province, the predicted accuracy of DDFP-OS-LDA-KNN for FD, FA and ZH was 100%; misclassification occurred in SX, JY and ZR, and the JY production district was often misclassified as ZH ( Figure 6b). As shown in Figure 6c, when performing authenticity identification of Fuding white tea, AFWT-OS-LDA-KNN correctly identified 97.92% and 98.00% of FD and Non-FD, respectively, with excellent prediction ability and good model performance. Overall, the models had excellent correct identification rates for each appellation, and misclassifications occurred mostly among appellations bordering geographic locations. The high similarity of geographic environment, climatic factors and processing processes may be the reason for the frequent misclassification among these appellations. detailed reflection of the performance of the classification model, where the rows represent the true class, and the columns represent the predicted class. The confusion matrix enables the visualization of the number of the correctly classified as well as the categories and number of misclassified categories for each white-tea-producing area. The higher the value on the diagonal of the matrix, the better the prediction ability of the model. The confusion matrix of the best DPC, DDFP and AFWT identification models are shown in Figure 6. From Figure 6a, it could be seen that in distinguishing white tea from different provinces, the predicted accuracy of DPC-CWT-LDA-KNN for YN and ZJ was 100%, and the predicted accuracy for both FJ and GZ was greater than 85.00%; the misclassification occurred mostly in HN and SC samples, and the HN production area was often misclassified as SC production area, and the SC production area was often misclassified as FJ production area. When identifying white tea from different production districts in Fujian Province, the predicted accuracy of DDFP-OS-LDA-KNN for FD, FA and ZH was 100%; misclassification occurred in SX, JY and ZR, and the JY production district was often misclassified as ZH ( Figure 6b). As shown in Figure 6c, when performing authenticity identification of Fuding white tea, AFWT-OS-LDA-KNN correctly identified 97.92% and 98.00% of FD and Non-FD, respectively, with excellent prediction ability and good model performance. Overall, the models had excellent correct identification rates for each appellation, and misclassifications occurred mostly among appellations bordering geographic locations. The high similarity of geographic environment, climatic factors and processing processes may be the reason for the frequent misclassification among these appellations. The optimal model was visualized to reveal the clustering trend of samples from each producing area (Figure 7). In the three-dimensional model diagram of DPC-CWT-LDA-KNN (Figure 7a), the clustering effect of the GX and YN samples was excellent, which could be clearly distinguished from tea in other provinces. The FJ sample's clustering effect was good, which could be basically separated from other provinces. The spectral characteristics of GZ, HN, SC and ZJ samples were very close to each other in three-dimensional space, the geographical location of these provinces and the similarity of the climatic conditions and tea processing technology may be the reasons for the clustering effect's good performance. By observing the three-dimensional model diagram of DDFP-OS-LDA-KNN (Figure 7b), it could be found that samples from different producing areas could basically be clustered separately in three-dimensional space, which could easily distinguish them. ZH and SX samples were close, which may be due to their similar geographical locations and similar processing technology. By observing the visualization of the AFWT-OS-LDA-KNN model (Figure 7c), it could be found that the clustering of FD The optimal model was visualized to reveal the clustering trend of samples from each producing area (Figure 7). In the three-dimensional model diagram of DPC-CWT-LDA-KNN (Figure 7a), the clustering effect of the GX and YN samples was excellent, which could be clearly distinguished from tea in other provinces. The FJ sample's clustering effect was good, which could be basically separated from other provinces. The spectral characteristics of GZ, HN, SC and ZJ samples were very close to each other in three-dimensional space, the geographical location of these provinces and the similarity of the climatic conditions and tea processing technology may be the reasons for the clustering effect's good performance. By observing the three-dimensional model diagram of DDFP-OS-LDA-KNN (Figure 7b), it could be found that samples from different producing areas could basically be clustered separately in three-dimensional space, which could easily distinguish them. ZH and SX samples were close, which may be due to their similar geographical locations and similar processing technology. By observing the visualization of the AFWT-OS-LDA-KNN model (Figure 7c), it could be found that the clustering of FD and Non-FD samples were very effective, which may be the reason for the excellent effect of the model. In general, after visualization, the distribution of spectral characteristics of white tea samples in some producing areas was very close in three-dimensional space, which may be related to the small number of samples collected in these producing areas and the lack of obvious spectral characteristics. In addition, it may also be related to the similarity of the white tea quality caused by geographical location, climatic factors and similar processing technology. To solve these problems, we will strengthen the spectral characteristics of the production area by increasing the number of samples year by year, so as to further improve the performance of the model. and Non-FD samples were very effective, which may be the reason for the excellent effect of the model. In general, after visualization, the distribution of spectral characteristics of white tea samples in some producing areas was very close in three-dimensional space, which may be related to the small number of samples collected in these producing areas and the lack of obvious spectral characteristics. In addition, it may also be related to the similarity of the white tea quality caused by geographical location, climatic factors and similar processing technology. To solve these problems, we will strengthen the spectral characteristics of the production area by increasing the number of samples year by year, so as to further improve the performance of the model.

Conclusions
This study proved the feasibility of using NIRS data to verify the origin of white tea simply and quickly. Combining different spectra data preprocessing methods (CWT, Minmax, MSC and SNV) with different feature extraction algorithms (PCA, LDA and SPA), 180 white tea origin traceability models were established based on KNN, RF and SVM algorithms. The modeling results show that the SNV effect was the most excellent among the preprocessing algorithms, and the performance of the model was improved best without combining other algorithms. LDA has the greatest advantages in different feature extraction algorithms, and the number of features obtained by dimensionality reduction was the least. RF has the strongest general adaptability in machine learning algorithms, but the best model quality generally appears in KNN models. DPC-CWT-LDA-KNN, DDFP-OS-LDA-KNN and AFWT-OS-LDA-KNN were proved to be the optimal models for identifying white tea origins classified by DPC, DDFP and AFWT. The RA of the optimal models was close to or greater than 90%, and their AUC value was close to or greater than 0.90, these models had excellent predictive ability and good generalization ability. Overall, this study demonstrates the possibility of achieving white tea origin traceability based on NIRS, representing a step forward in the method selection for origin traceability and quality control of white tea. Based on the above research results, in order to solve the problems of unbalanced model samples and the close distance of model clusters, we will increase the number of samples year by year in the future, enrich the white tea NIRS data set to optimize model performance and develop portable white tea origin traceability devices on this basis. In addition, we will also try to build an online white tea origin identification platform using internet technology to carry out remote white tea origin traceability.

Conclusions
This study proved the feasibility of using NIRS data to verify the origin of white tea simply and quickly. Combining different spectra data preprocessing methods (CWT, Minmax, MSC and SNV) with different feature extraction algorithms (PCA, LDA and SPA), 180 white tea origin traceability models were established based on KNN, RF and SVM algorithms. The modeling results show that the SNV effect was the most excellent among the preprocessing algorithms, and the performance of the model was improved best without combining other algorithms. LDA has the greatest advantages in different feature extraction algorithms, and the number of features obtained by dimensionality reduction was the least. RF has the strongest general adaptability in machine learning algorithms, but the best model quality generally appears in KNN models. DPC-CWT-LDA-KNN, DDFP-OS-LDA-KNN and AFWT-OS-LDA-KNN were proved to be the optimal models for identifying white tea origins classified by DPC, DDFP and AFWT. The RA of the optimal models was close to or greater than 90%, and their AUC value was close to or greater than 0.90, these models had excellent predictive ability and good generalization ability. Overall, this study demonstrates the possibility of achieving white tea origin traceability based on NIRS, representing a step forward in the method selection for origin traceability and quality control of white tea. Based on the above research results, in order to solve the problems of unbalanced model samples and the close distance of model clusters, we will increase the number of samples year by year in the future, enrich the white tea NIRS data set to optimize model performance and develop portable white tea origin traceability devices on this basis. In addition, we will also try to build an online white tea origin identification platform using internet technology to carry out remote white tea origin traceability.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest: Jiaya Chen from LiuMiao White Tea corporation (the author contributed to the resources, the company provides a large number of experimental samples); Gang Lin from Fujian Rongyuntong Ecological Technology Limited Company (the author contributed to the resources, the company provides algorithmic support); Linhai Chen from Fu'an Tea Industry Development Center (the author contributed to the resources, the company provides a large number of experimental samples).