Near-Infrared Spectroscopy Coupled Chemometric Algorithms for Rapid Origin Identification and Lipid Content Detection of Pinus Koraiensis Seeds

Lipid content is an important indicator of the edible and breeding value of Pinus koraiensis seeds. Difference in origin will affect the lipid content of the inner kernel, and neither can be judged by appearance or morphology. Traditional chemical methods are small-scale, time-consuming, labor-intensive, costly, and laboratory-dependent. In this study, near-infrared (NIR) spectroscopy combined with chemometrics was used to identify the origin and lipid content of P. koraiensis seeds. Principal component analysis (PCA), wavelet transformation (WT), Monte Carlo (MC), and uninformative variable elimination (UVE) methods were used to process spectral data and the prediction models were established with partial least-squares (PLS). Models were evaluated by R2 for calibration and prediction sets, root mean standard error of cross-validation (RMSECV), and root mean square error of prediction (RMSEP). Two dimensions of input data produced a faster and more accurate PLS model. The accuracy of the calibration and prediction sets was 98.75% and 97.50%, respectively. When the Donoho Thresholding wavelet filter ‘bior4.4’ was selected, the WT–MC–UVE–PLS regression model had the best predictions. The R2 for the calibration and prediction sets was 0.9485 and 0.9369, and the RMSECV and RMSEP were 0.0098 and 0.0390, respectively. NIR technology combined with chemometric algorithms can be used to characterize P. koraiensis seeds.


Introduction
Pinus koraiensis is a rare and valuable species and a National Class II protected wild plant in China [1]. It is only found in the area between the Changbai Mountains and the Xiaoxingan Mountains in northeast China, as well as in parts of Japan, Korea, and Russia [2]. Differences in the nutrient content of P. koraiensis seeds are often caused by different water qualities, soils, climates, and latitude and longitude conditions [1]. The lipid in P. koraiensis seeds directly determines the special flavor and taste of the nuts, which are different from other nuts and one of the important agronomic benefits of the tree [1][2][3]. The lipid content is significant in the production of pine nut oil. The lipid components of P. koraiensis seeds include unsaturated fatty acids such as oleic acid and linoleic acid [3]. The unsaturated fatty acids in the seeds possess an anti-atherosclerotic effect and also enhance brain cell metabolism while maintaining brain cell function and nerve function [2]. At present, the traditional detection techniques of this species still rely on experienced staff to distinguish the origin of the plant [1,4]. The lipid content of pine nuts is difficult to detect by the naked eye; hence, manufacturers perform extraction tests on raw materials. Although the results obtained by extractions are more accurate, the process is destructive and time-consuming, and the special chemical reagents used (such as ether) are dangerous. If the waste liquid left after the experiment is not handled properly, it will pollute the Sensors 2020, 20, 4905 3 of 17 reading, the samples (the pine nut powder) were individually sealed, kept away from light, and stored for 24 h at 21°C to reach a stable state.

NIR Spectrometer and Spectral Acquisition
The experimental data consisted of the NIR spectroscopy data and the lipid content analysis of the pine nuts. The NIR data were collected by a NIRQuest512 spectrometer (Ocean Optics, Inc., Dunedin, FL, USA), which was equipped with a fiber interface (SMA905) and detector (Hamamatsu G9204-512InGaAs linear array). The spectral grating selection of the instrument ranged from 900 to 1700 nm. During the NIR data reading, the measurement environment was maintained at a constant relative humidity and room temperature (fluctuating by no more than 1°C), and the distance between the probe and the sample was kept at 0.5 mm during measurement ( Figure 1). The spectral data were determined by the mean of three measurements and collated using the configuration software program SpectraSuite (Ocean Optics, Inc., Dunedin, FL, USA). All the ground samples were properly preserved until chemical analysis.
Sensors 2020, 20, x 3 of 19 koraiensis seeds were shelled and the pine nuts were ground to powder using a grinder. The 40 samples (each 5 g) of P. koraiensis seeds from each place of origin were put into transparent, sealed bags. One side of the sealed bag was marked with the serial number. The sampling process was random. Before the NIR reading, the samples (the pine nut powder) were individually sealed, kept away from light, and stored for 24 h at 21 ℃ to reach a stable state.

NIR Spectrometer and Spectral Acquisition
The experimental data consisted of the NIR spectroscopy data and the lipid content analysis of the pine nuts. The NIR data were collected by a NIRQuest512 spectrometer (Ocean Optics, Inc., Dunedin, FL, USA), which was equipped with a fiber interface (SMA905) and detector (Hamamatsu G9204-512InGaAs linear array). The spectral grating selection of the instrument ranged from 900 to 1700 nm. During the NIR data reading, the measurement environment was maintained at a constant relative humidity and room temperature (fluctuating by no more than 1 ℃), and the distance between the probe and the sample was kept at 0.5 mm during measurement ( Figure 1). The spectral data were determined by the mean of three measurements and collated using the configuration software program SpectraSuite (Ocean Optics, Inc., Dunedin, FL, USA). All the ground samples were properly preserved until chemical analysis.

Lipid Content
The lipid content of P. koraiensis seeds was determined using the Soxhlet extractor method as described by the AOAC [22]. The solvent used in the Soxhlet was anhydrous ether. The weighed filter paper bag was filled with 5 g of finely ground pine nut sample. The parts of the extractor were connected, the condensate water flow was connected, and the extraction was carried out at a constant temperature (70 °C). After extraction, the filter paper bag was removed and placed in a wellventilated place while the ether volatilized completely before the sample was dried and weighed.
The crude lipid was calculated as follows: where is the crude lipid content; is the weight of the weighing bottle and filter paper bag; is the weight of the weighing bottle with the filter paper bag and drying sample; and is the weight of the weighing bottle with the filter paper bag and the dry residue after extraction. The lipid analysis was carried out and certified by the Academy of Quality Supervision and Inspection in Heilongjiang Province.

Data Analysis
Standard normalized variable (SNV) aims to eliminate the effects of surface scattering, solid particle size, and light path changes during measurement on NIR diffuse reflectance spectra, which are generally the spectra used during data preprocessing in NIR spectroscopy. Uninformative

Lipid Content
The lipid content of P. koraiensis seeds was determined using the Soxhlet extractor method as described by the AOAC [22]. The solvent used in the Soxhlet was anhydrous ether. The weighed filter paper bag was filled with 5 g of finely ground pine nut sample. The parts of the extractor were connected, the condensate water flow was connected, and the extraction was carried out at a constant temperature (70 • C). After extraction, the filter paper bag was removed and placed in a well-ventilated place while the ether volatilized completely before the sample was dried and weighed.
The crude lipid was calculated as follows: where X is the crude lipid content; a is the weight of the weighing bottle and filter paper bag; b is the weight of the weighing bottle with the filter paper bag and drying sample; and c is the weight of the weighing bottle with the filter paper bag and the dry residue after extraction. The lipid analysis was carried out and certified by the Academy of Quality Supervision and Inspection in Heilongjiang Province.

Data Analysis
Standard normalized variable (SNV) aims to eliminate the effects of surface scattering, solid particle size, and light path changes during measurement on NIR diffuse reflectance spectra, which are generally the spectra used during data preprocessing in NIR spectroscopy. Uninformative variable elimination (UVE) and the successive projections algorithm (SPA) are widely used in the selection of spectral characteristic bands [15,23]. To address classification issues, the purpose of this study was to ensure the accuracy of classification while reducing the number of input features as much as possible; regression problems need to focus on optimizing data quality and preserving the integrity of data information [23].
Unlike traditional NIR stoichiometric analysis, in addition to SNV preprocessing, SPA, and UVE feature selection [24,25], this study specifically introduces wavelet transformation and wavelet compression techniques, the PCA method, and the MCUVE algorithm. On the basis of SNV, WT was used for further preprocessing of the data, which can optimize and compress the spectral data to obtain useful information. PCA can greatly reduce data size, and the MCUVE algorithm can improve the efficiency of feature selection. The PCA-PLS classification model and a better performing WT-MCUVE-PLS regression model were obtained as a result.
In this study, all the sample data were divided into calibration and prediction sets. Before each modeling scenario, the sample selection of each data set conformed to the principle of random sampling, but the numbers of the calibration and prediction sets were always fixed, corresponding to 80 and 40. In order to optimize the parameters and improve the prediction accuracy and generalization ability of the model, five-fold cross-validation was used to split the data set. In the process of modeling, the data set was divided into five parts randomly and evenly, four of which were used for training in turn, with the remaining part used for validation. For all the calibration models involved in this study, a grid-search was used to optimize the parameters. Within the specified parameter range, the parameters were adjusted in sequence according to the step length. The adjusted parameters were used to train the learner. All the mathematical models involved in this study were performed on MATLAB R2018a (The MathWorks, Natick, MA, USA).

Discrete Wavelet Transformation
WT, which uses a wavelet basis function as a window function, transfers the time-domain signal to the wavelet domain coefficient through a wavelet basis function [26]. A discrete wavelet transformation (DWT) is any wavelet transformation for which the wavelets are discretely sampled, with the intention of selecting a discrete subset to be able to reconstruct a signal from the corresponding wavelet coefficients [27].
From an engineering point of view, the wavelet transformation is transformed into a set of filtering operations by the superimposition of the wavelet on the different initial time-domains [27]. The discrete wavelet transformation can be intuitively defined by Equations (2)-(4): where c jk and d jk represent the approximate wavelet coefficient and the detailed wavelet coefficient, respectively, and where J indicates the number of transformation levels, ϕ jk (t) is called the scaling function, and ψ jk (t) is called the wavelet function.

Wavelet Threshold Denoising and Compression Method
In this study, a wavelet compression denoising algorithm was used for the extraction of the original spectral data features, reducing the original spectral data noise, and improving the accuracy of the model. Feature selection is crucial to the quality of the PLS calibration model [27]. Thus, wavelet compression can be added before modeling the wavelet coefficients to select specific wavelet coefficients for modeling. Although researchers in the NIR field use wavelet transformation as a common means to compress data sets, extract data features, and reduce noise [26,27], most do not pay attention to the wavelet transformation compression ratio, information loss, and final model quality. Two indices of compression ratio and information loss were used to evaluate the wavelet transformation.
At present, the mainstream wavelet denoising method is the wavelet threshold denoising method proposed by Donoho and Johnstone in 1994 [28]. Formally, Donoho and Johnstone defined the original signal and the noise signal in the wavelet denoising problem by Equation (5) [28]. If the one-dimensional signal is f (t), the signal can be expressed as (Equation (5)): where f s (t) is the noise-free signal and f n (t) is the noise signal. The noise data have an i. i. d. Gaussian (Normal) distribution according to the Donoho and Johnstone studies, with a mean of zero and a variance σ 2 , featuring a high frequency and small values. The wavelet transformation coefficients corresponding to the noise signal are almost all in the wavelet detail coefficients, and the coefficient values are low. Therefore, almost all the noise signals can be filtered by a suitable threshold algorithm.

Principal Component Analysis (PCA)
Using the linear projection method [29,30] to reduce the dimensionality of the data [31], the aim of the PCA model is to make sure that the data projected in the given direction will produce the maximum variance (Equation (6)). max w =1 Var w T X where X represents the data matrix. It is also equivalent to the minimum distance between the data point and its projection point (Equation (7)).
where x n denotes the data points.

Monte Carlo (MC) Combined with Uninformative Variable Elimination (UVE)
UVE algorithms are used for feature selection before modeling [25]. MC combined with UVE is designed to randomly split the original training data set into sub-data sets using MC algorithms and to implement the UVE-PLS model in the sub-dataset [25,26]. If the correlation coefficient matrix of spectral data is β = β 1 , β 2 , · · · , β p , then stability can be judged by Equation (8): where s j is arranged in order from large to small, and the top k correlation coefficients are intercepted according to the threshold. The band corresponding to the top k correlation coefficients is the characteristic band selected by the MCUVE algorithm.

Partial Least-Square (PLS)
PLS is often used to solve regression problems (partial least-squares regression; PLSR). A classification model can be established by adding a category determination step based on PLSR, which is called partial least-squares discriminant analysis (PLS-DA). The PLS algorithm is widely used in NIR spectroscopy for its stability [32][33][34][35]. In the lipid content regression model, for example, the lipid content matrix Y = y ij n×m and absorbance matrix X = x ij n×xp are decomposed into the form of feature vectors (Equations (9) and (10)).
where U and T are the lipid content characteristic factor matrix and the absorbance characteristic factor matrix of n rows and d columns (d is the abstract group fraction), respectively. In addition, Q (d × m) is the lipid content load array, P (d × p) is the absorbance load array, and F (n × m) and E (n × p) are Sensors 2020, 20, 4905 6 of 17 the lipid content residual array and absorbance residual array, respectively. Q is the load matrix of lipid content, P is the load matrix of absorbance, and F and E are residual matrices of lipid content and absorbance, respectively. The cross-validation method was used to obtain the value of d.
The PLS method decomposes Y and X according to the correlation of eigenvectors. The regression model was as follows (Equation (11)): where E d is a random error matrix and B is a d dimensional diagonal regression coefficient matrix.
If the absorbance vector of the test sample is x, the lipid content is expressed as follows (Equation (12)): Unlike PLSR, PLS-DA uses a binary matrix as a response matrix instead of a numerical matrix [33]. There were three category vectors in this study, which were set to [0,0,1], [0,1,0], and [1,0,0].

Origin and Lipid Content Calibration Models
The nonrelevant information in spectral data can compromise the precision and accuracy of the results [23,25,26]. The existing principle of NIR technology reveals that most regression calibration model inputs require only about 10 bands [23,25]. After wavelet compression, there is still space for compression and dimension reduction. Feature selection, used to extract useful information and eliminate irrelevant variables, is a crucial step before building the calibration model. In this study, PCA and WT-MCUVE were used for dimension reduction and feature transformation to produce the PCA-PLS classification model and the WT-MCUVE-PLS regression model ( Figure 2).

Model Validation
The stability and reliability of the prediction model can be obtained by the correction parameters, such as the root mean standard error of cross-validation (RMSECV) and the root mean square error of prediction (RMSEP) [35,36]. More formally, the RMSECV and RMSEP for the NIR calibration model were computed by Equation 13:

Model Validation
The stability and reliability of the prediction model can be obtained by the correction parameters, such as the root mean standard error of cross-validation (RMSECV) and the root mean square error of prediction (RMSEP) [35,36]. More formally, the RMSECV and RMSEP for the NIR calibration model were computed by Equation (13): where y i represents the measured value andŷ i represents the predicted value. Furthermore, the correlation between chemical and spectral data is denoted by the R 2 of the calibration and prediction sets [37]. R 2 is computed by Equation (14): where y denotes average measurement. Percent root mean square difference (PRD) is used as a criterion for judging the degree of distortion of data [35], and can be expressed by Equation (15): where x ori ( j) corresponds to the raw data, and x rec ( j) corresponds to the reconstructed data (signal).

Quantitative Analysis of Lipid Content
The chemical analysis of the lipid content of pine nut samples collected from the Yichun, Heihe, and Changbai Mountains are reported in Table 1. The lipid content of the sample is approximately 60%, and the difference in content depends on the temperature, humidity, soil conditions, and altitude of different growth environments [1,4]. In addition to satisfying the different environmental conditions, the three places of origin selected in this study also had different low-temperature periods in each year.  0.94, and 1.52, and 60.75, 0.71, and 1.16, respectively. The absolute value of the difference between the mean and extreme of samples ranged from 1.7% to 2%. Figure 3 shows the raw and SNV pretreated spectra of pine nuts with shells. The shell of pine nuts has a uniform texture and produces smooth spectra. During the measurement, it was difficult to ensure that the distance from the fiber probe to the pine sample was constant. The SNV method can effectively remove the regular differences caused by optical path changes.

Spectral Data and Preprocessing Results
Sensors 2020, 20, x 9 of 19 Figure 3 shows the raw and SNV pretreated spectra of pine nuts with shells. The shell of pine nuts has a uniform texture and produces smooth spectra. During the measurement, it was difficult to ensure that the distance from the fiber probe to the pine sample was constant. The SNV method can effectively remove the regular differences caused by optical path changes.  Raw spectral data were collected from ground samples for regression analysis. After SNV processing, the data were compressed by the wavelet noise reduction threshold. Figure 4 shows the inverse transformation results after compression of the raw spectra, the SNV pretreated spectra, and the 'db9', 'bior4.4', 'sym8', and 'coif4' noise reduction thresholds. High-frequency noise was filtered out by reconstructing NIR spectral bands in 'db9'. Compared with the original NIR spectroscopy, the spectra of 'bior4.4' and 'sym8' in the wavelength ranges of 1100-1200 nm and 1600-1700 nm subintervals were distorted.  Raw spectral data were collected from ground samples for regression analysis. After SNV processing, the data were compressed by the wavelet noise reduction threshold. Figure 4 shows the inverse transformation results after compression of the raw spectra, the SNV pretreated spectra, and the 'db9', 'bior4.4', 'sym8', and 'coif4' noise reduction thresholds. High-frequency noise was filtered out by reconstructing NIR spectral bands in 'db9'. Compared with the original NIR spectroscopy, the spectra of 'bior4.4' and 'sym8' in the wavelength ranges of 1100-1200 nm and 1600-1700 nm subintervals were distorted.

Spectral Data and Preprocessing Results
Sensors 2020, 20, x 9 of 19 Figure 3 shows the raw and SNV pretreated spectra of pine nuts with shells. The shell of pine nuts has a uniform texture and produces smooth spectra. During the measurement, it was difficult to ensure that the distance from the fiber probe to the pine sample was constant. The SNV method can effectively remove the regular differences caused by optical path changes.  Raw spectral data were collected from ground samples for regression analysis. After SNV processing, the data were compressed by the wavelet noise reduction threshold. Figure 4 shows the inverse transformation results after compression of the raw spectra, the SNV pretreated spectra, and the 'db9', 'bior4.4', 'sym8', and 'coif4' noise reduction thresholds. High-frequency noise was filtered out by reconstructing NIR spectral bands in 'db9'. Compared with the original NIR spectroscopy, the spectra of 'bior4.4' and 'sym8' in the wavelength ranges of 1100-1200 nm and 1600-1700 nm subintervals were distorted.   In this study, the four wavelet basis functions of 'db9', 'bior4.4', 'sym8', and 'coif4' were used for three layer decomposition. The compression performance of four different wavelet compression methods: the Birge-Massart Strategy, SURE Shrink Thresholding, Donoho Thresholding, and Soft Thresholding, were tested. A performance optimal wavelet compression method was selected, and the PRD was used as the criterion to judge the degree of data distortion [38]. Table 2 shows the compression rate and the degree of data distortion of the four different wavelet compression methods. The performance of Soft Thresholding was too rough: data integrity from this method was the worst of all and the PRD ranged from 0.37% to 0.39%. SURE Shrink Thresholding and the Birge-Massart Strategy methods focus on improving the compression rate. The Donoho Thresholding method emphasizes data integrity, but the compression rate was the worst of all methods. In this study, the feature selection of wavelet coefficients was performed using MCUVE so that the desired effect of wavelet compression resulted in less data loss rather than a larger compression rate. This approach filtered out a small amount of high-frequency and low-amplitude information to reduce the noise effectively and make the spectral data smoother. Combined with the results of Table 2 and Figure 4, the compromise between data loss and the compression rate was Figure 4. NIR raw spectra, SNV pretreated spectra, and wavelet compression spectra of pine nut powder, (a) The raw spectra, (b) The SNV pretreated spectra, (c) The spectra after being compressed by 'db9', (d) The spectra after being compressed by 'bior4.4', (e) The spectra after being compressed by 'sym8', (f) The spectra after being compressed by 'coif4'.

Spectral Data and Preprocessing Results
In this study, the four wavelet basis functions of 'db9', 'bior4.4', 'sym8', and 'coif4' were used for three layer decomposition. The compression performance of four different wavelet compression methods: the Birge-Massart Strategy, SURE Shrink Thresholding, Donoho Thresholding, and Soft Thresholding, were tested. A performance optimal wavelet compression method was selected, and the PRD was used as the criterion to judge the degree of data distortion [38]. Table 2 shows the compression rate and the degree of data distortion of the four different wavelet compression methods. The performance of Soft Thresholding was too rough: data integrity from this method was the worst of all and the PRD ranged from 0.37% to 0.39%. SURE Shrink Thresholding and the Birge-Massart Strategy methods focus on improving the compression rate. The Donoho Thresholding method emphasizes data integrity, but the compression rate was the worst of all methods. In this study, the feature selection of wavelet coefficients was performed using MCUVE so that the desired effect of wavelet compression resulted in less data loss rather than a larger compression rate. This approach filtered out a small amount of high-frequency and low-amplitude information to reduce the noise effectively and make the spectral data smoother. Combined with the results of Table 2 and Figure 4, the compromise between data loss and the compression rate was regulated. According to the actual requirement of this study, the Donoho Thresholding with the 'bior4.4' wavelet filter was chosen to compress the NIR raw data, with a corresponding compression rate of 84.7487% and a PRD of 0.21%.

Results of Principal Component Analysis (PCA)
The data visualization output of reducing high-dimensional data to two and three dimensional data using PCA is displayed in Figure 5. When reduced to two dimensions by the PCA method, the three groups of samples were completely separated. When three principal components were produced, the three sets of sample data points were still not interleaved in three dimensional space. Comparing the two cases, the distance between data points of similar samples in three dimensional space was closer. By comparison of the data visualization effects, it can be concluded that the effects of reducing dimensionality by PCA to two and three dimensions were similar. Therefore, there is no need to choose more principal components.

Results of Monte Carlo-Uninformative Variable Elimination (MCUVE)
MCUVE was used for feature selection of wavelet coefficients. Figure 6 shows the results of the stability of subsets of spectral bands selected by MCUVE. The top 70 sets of wavelet coefficient data with the best stability were selected and used to establish a PLS calibration model of the relationship between the pine nut fruit lipid content and the NIR spectrum.

Results of Monte Carlo-Uninformative Variable Elimination (MCUVE)
MCUVE was used for feature selection of wavelet coefficients. Figure 6 shows the results of the stability of subsets of spectral bands selected by MCUVE. The top 70 sets of wavelet coefficient data with the best stability were selected and used to establish a PLS calibration model of the relationship between the pine nut fruit lipid content and the NIR spectrum.

Results of Monte Carlo-Uninformative Variable Elimination (MCUVE)
MCUVE was used for feature selection of wavelet coefficients. Figure 6 shows the results of the stability of subsets of spectral bands selected by MCUVE. The top 70 sets of wavelet coefficient data with the best stability were selected and used to establish a PLS calibration model of the relationship between the pine nut fruit lipid content and the NIR spectrum.  Table 3 shows the output of each parameter in the calibration and prediction sets under the different classification schemes. The calibration model was evaluated with Precision, Recall, and F1 [30]. Values of the three indices are in the range of 0 to 1: the closer to 1, the better the performance of the model. The three indices of the SNV-PCA-PLS calibration model were better than the  Table 3 shows the output of each parameter in the calibration and prediction sets under the different classification schemes. The calibration model was evaluated with Precision, Recall, and F1 [30]. Values of the three indices are in the range of 0 to 1: the closer to 1, the better the performance of the model. The three indices of the SNV-PCA-PLS calibration model were better than the calibration model without PCA dimension reduction, which were 1.0, 0.94, and 0.97, and 0.97, 0.95, and 0.97, for Precision, Recall, and F1, respectively.

Results of The Classification Model
Combined with the results of Figure 5 and Table 3, when the input data were reduced by PCA to two and three dimensions, the accuracy of the calibration and prediction sets were the same and were the highest, at 98.75% and 97.5%, respectively. Thanks to the significant reduction in the dimensionality of the input data, the average training time and average prediction time were reduced to 2.61 s and 0.91 s, respectively. Although the choice of three principal components had a better Recall value, the distance between data points of similar samples was closer. From the perspective of improving modeling efficiency, it is evident that using the SNV-PCA-PLS scheme with two dimensional input data produced the best comprehensive performance of the classification model.

Results of the Regression Model
The RMSECV and RMSEP of each method and the correlation coefficient R 2 are shown in Table 4. The relationship between the observed lipid content and the results predicted by each calibration model is shown in Figure 7.
The It can be seen that the model has better accuracy and correlation when the input data undergoes the synergy of preprocessing and feature selection. By comparing the output of R 2 , it can be seen that the performance of the WT-PLS model was worse than that of the UVE-PLS model. The RMSECV and RMSEP of the two models were 0.0808 and 0.1491, and 0.0159 and 0.0875, respectively. Since the input data of the UVE-PLS model was not pretreated and the number of features was reduced, the R 2 of its calibration and prediction sets was the second highest of all the models, and the distribution of data points in Figure 7 can be explained. The input matrix features of the eight models were 511, 100, 70, 154, 70, 511, 80, and 50 (Table 4). After a grid-search was used for global optimization, the MCUVE-PLS model had the best quality when the feature number of the MCUVE output matrix was 70. To achieve the best results, the feature number of the input matrix of WT-PLS was reduced to 154 by WT. Under the joint action of WT and MCUVE, the final feature number of the input matrix of WT-MCUVE-PLS was 70. The model performance gradually increased according to the decrease in input matrix features. Therefore, choosing a more efficient data compression scheme can significantly improve model performance. Of the above schemes, WT-MCUVE-PLS was the best model for predicting performance. In comparison with the other methods, the regression effect of PCR and SPA-PLS was good, but there was a certain gap compared with WT-MCUVE-PLS. PCA-PLS had a poor regression effect because of the limited amount of information carried after dimension reduction.  Table 4 and Figure 7. The performance of the PLS model was worse than that of the MCUVE-PLS model. The implementation of the WT-MCUVE-PLS was better than that of the MCUVE-PLS model. The RMSECV and RMSEP of the PLS, MCUVE-PLS, and WT-MCUVE-PLS models were 0.0407, 0.1396, and 0.0449, 0.1556, and 0.0098, 0.0390, respectively. The WT-MCUVE-PLS model had the highest R in the calibration and prediction sets of 0.9485 and 0.9369, respectively. It can be seen that the model has better accuracy and correlation when the input data undergoes the synergy of preprocessing and feature selection. By comparing the output of R , it can be seen that the performance of the WT-PLS model was worse than that of the UVE-PLS model. The RMSECV and RMSEP of the two models were 0.0808 and 0.1491, and 0.0159 and 0.0875, respectively. Since the input data of the UVE-PLS model was not pretreated and the number of features was reduced, the R of its calibration and prediction sets was the second highest of all the models, and the distribution of data points in Figure 7 can be explained. The input matrix features of the eight models were 511, 100, 70, 154, 70, 511, 80, and 50 (Table 4). After a grid-search was used for global optimization, the MCUVE-PLS model had the best quality when the feature number of the MCUVE output matrix was 70. To achieve the best results, the feature number of the input matrix of WT-PLS was reduced to 154 by WT. Under the joint action of WT and MCUVE, the final feature number of the input matrix of WT-MCUVE-PLS was 70. The model performance gradually increased according to the decrease in input matrix features. Therefore, choosing a more efficient data compression scheme can significantly improve model performance. Of the above schemes, WT-MCUVE-PLS was the best model for predicting performance. In comparison with the other methods, the regression effect of PCR and SPA-PLS was good, but there was a certain gap compared with WT-MCUVE-PLS. PCA-PLS had a poor regression effect because of the limited amount of information carried after dimension reduction.

Discussion
In this study, the input data of the classification model preserved enough feature information for classification by PCA. The reduction in computation greatly reduced the time required for modeling and prediction, while improving the accuracy of prediction. In the process of establishing the regression model, wavelet transformation was applied to the processing of spectral data, and the MCUVE algorithm was used to select features. The WT-MCUVE-PLS model was superior to the traditional PLS, WT-PLS, and UVE-PLS in terms of computation, prediction accuracy, and time consumption. It had great advantages in comparison with the prediction effect of PCA-PLS, SPA-PLS, and PCR. The advantages were as follows: In the wavelet domain, DWT transformed discrete signals into approximate wavelet coefficients and detailed wavelet coefficients more quickly and conveniently. The data collected by the spectrometer are distributed in the range of 600-2400 nm in the form of discrete data points, which is the commonality of mainstream spectrometers [39][40][41]. Therefore, the discrete wavelet transformation is a good prospect in the field of NIR analysis, especially in the aspects of noise reduction and spectral data compression. Compared with using UVE to extract features from modeling data, MCUVE has several advantages. The MC algorithm aims to segment the original data set and the amount of subset data obtained is greatly reduced, thereby improving the efficiency of feature selection.
The algorithm in this study, combined with a portable NIR spectrometer, has practical significance for the production of P. koraiensis seeds and other kinds of nuts with shells. The characteristics of the food itself and the specific organic content can be quickly tested by analyzing the spectral data of the shells and kernels in combination with the actual chemical properties. The nutshell is the hard woody covering around the kernel of a nut. Both pine nutshells and wood contain a lot of cellulose, and the spectral curves are similar. The physical and chemical properties of pine nuts are similar to those of peanuts and other nuts. The qualitative analysis of wood by NIR technology and the quantitative analysis of nutrient content in foods, especially nuts, were used as important references for this study [7,36]. This technology is applicable in the classification of food and improves the related detection capabilities of spectroscopic equipment. The extrapolation of these analysis methods to the analysis of other foods is also highly relevant.

Conclusions
In this study, portable NIR spectroscopy was successfully combined with effective chemometric methods to classify and identify different places of origin of P. koraiensis seeds, and the lipid content of the inner kernel of P. koraiensis seeds was predicted by a regression model. Compared with traditional methods, the NIR detection method was more rapid, non-destructive, and environmentally friendly. Based on PCA, a PCA-PLS classification model of places of origin was established. The input data dimension was greatly reduced by PCA. Meanwhile, the prediction accuracy was improved and the time loss was reduced. From the results, it can be concluded that the best scheme was reduced to two dimensions by PCA. In the study of the lipid content prediction model, the compression effect of wavelet transformation on input data was discussed in detail. Combining MCUVE technology and PLS technology, a WT-MCUVE-PLS prediction model of lipid content was established. This method was efficient and accurate for the preprocessing and feature selection of input data. Compared with PCA-PLS, SPA-PLS, and PCR, the WT-MCUVE-PLS model had the best prediction results. Choosing wavelet compression technology reasonably and combining it with the MCUVE method can produce a more concise and effective model. Finally, this can achieve the onsite analysis of P. koraiensis seeds.