Visible-Near Infrared Spectroscopy and Chemometric Methods for Wood Density Prediction and Origin/Species Identification

: This study aimed to rapidly and accurately identify geographical origin, tree species, and model wood density using visible and near infrared (Vis-NIR) spectroscopy coupled with chemometric methods. A total of 280 samples with two origins (Jilin and Heilongjiang province, China), and three species, Dahurian larch ( Larix gmelinii (Rupr.) Rupr. ), Japanese elm ( Ulmus davidiana Planch. var . japonica Nakai), and Chinese white poplar ( Populus tomentosa carriere ), were collected for classification and prediction analysis. The spectral data were de-noised using lifting wavelet transform (LWT) and linear and nonlinear models were built from the de-noised spectra using partial least squares (PLS) and particle swarm optimization (PSO)-support vector machine (SVM) methods, respectively. The response surface methodology (RSM) was applied to analyze the best combined parameters of PSO-SVM. The PSO-SVM model was employed for discrimination of origin and species. The identification accuracy for tree species using wavelet coefficients were better than models developed using raw spectra, and the accuracy of geographical origin and species was greater than 98% for the prediction dataset. The prediction accuracy of density using wavelet coefficients was better than that of constructed spectra. The PSO-SVM models optimized by RSM obtained the best results with coefficients of determination of the calibration set of 0.953, 0.974, 0.959, and 0.837 for Dahurian larch, Japanese elm , Chinese white poplar (Jilin), and Chinese white poplar (Heilongjiang), respectively. The results showed the feasibility of Vis-NIR spectroscopy coupled with chemometric methods for determining wood property and geographical origin with simple, rapid, and non-destructive advantages.


Introduction
With the rising consumption of wood and the competitiveness among export firms, timber quality traceability systems are becoming increasingly important with the ability of informing the identity or provenance of wood. Recently, many studies have been conducted in relation to the discrimination of origin and valuable species, especially for the Convention on International Trade in Endangered Species (CITES) listed species. Silva et al. indicated that partial least squares for discriminant analysis (PLS-DA) was better than soft independent modeling of class analogy (SIMCA) when the origin of true mahogany (CITES Appendix II) was identified with near infrared (NIR) spectroscopy [1]. A study from Bergo et al. [2] demonstrated that the correct classification rate of mahogany higher than 96.8% using NIR technology. Braga et al. employed NIR spectroscopy and novelty of the paper as these methods have been limited to fields other than forestry and forest products.
This study aimed to build reliable qualitative and quantitative models for the identification of geographical origin and tree species while simultaneously predicting wood property, an important skill needed for a wood traceability system. Taking density as an example, 280 wood samples from two different geographical origins belonging to three of the major commercial species (i.e., Dahurian larch, Japanese elm, Chinese white poplar) were collected for spectra collection and density estimation. The spectra were processed by lifting wavelet transform (LWT) [34], and different reconstruct means for spectra (wavelet coefficients and constructed spectra after de-noising) were also compared. The support vector machine (SVM) optimized by particle swarm optimization (PSO) was used for tracking origin and identifying species. To improve the prediction accuracy of density, the response surface methodology (RSM) was performed for the optimization of the best combined parameters of PSO-SVM models.

Sample Preparation
A total of 280 wood samples divided equally into three species were harvested from two different locations (Table 1). The same species need to be collected from multiple sites at the same age-class to determine site of origin. Therefore, Chinese white poplar were collected from Jilin and Heilongjiang, independently. Eight Dahurian larch and seven Chinese white poplar, with tree heights ranging from 17.2 to 22.6 m and 21.9 to 23.6 m, respectively, were from Heilongjiang province (131°08'-131°21'E, 45°44'-45°53'N). Eleven Japanese elm and seven Chinese white poplar were from Jilin province (126°30'-127°16'E, 42°06'-42°48'N), with tree heights ranging from 8.5 to 20.1 m and 22.4 to 23.8 m, respectively. Five-centimeter-thick (5 cm) disks were removed from each tree at 2 m intervals along the stem with a total of seventy discs prepared for each species for model calibration and spectra collection. Additionally, to reduce the roughness of sample surface, the cross-section of all disks were polished using an electric plane (Eastern sketcher hardware tools, Yong Kang, China).

Spectra Collection and Density Measurement
The Vis-NIR spectra were acquired using a LabSpec Pro FR/A114260 (Analytical Spectral Devices, Inc., Boulder, USA) from 350 to 2500 nm with a spectral resolution of 3 nm @700 nm, 10 nm @1400/2100 nm. Before spectral collection, the spectrometer was preheated for 30 minutes and calibrated with a commercial white plate made of polytetraflouroethylene (PTFE) and cintered halon. It had the nature of being nearly 100% reflective within the whole wavelength range (350-2500 nm). White references were collected every 15 minutes from the surface of the white plate. A total of 30 scans were acquired and automatically averaged into one spectrum. Each sample was scanned three times from the cross-section with a glare probe (unit 6523 h.i. contact probe) and the average spectrum was regarded as the raw spectrum. Additionally, to reduce the influence of baseline shifts, baseline offset was employed and implemented in the multivariate calibration software, The Unscrambler V10.4 (CAMO Software AS, Oslo, Norway). The density of the samples was measured according to GB/T 1933-2009.

Vis-NIR Spectral Data Processing
The procedure of lifting wavelet transform (LWT) includes split, prediction, and update ( Figure  1). The detailed procedure of LWT is shown in Zhou et al. [35]. The reason that LWT can be used for the de-noising or compression of spectra is that the time domain can be replaced by the wavelength domain. The spectral data in the original wavelength domain can be transformed into coefficients, approximation coefficients (AC) and detail coefficients (DC), in a new wavelength domain when LWT is performed. Additionally, this transformation from the original domain to a new domain is time-saving and not biased [36,37]. Therefore, the spectra with a wavelength range between 350 to 2397 nm were de-noised using a biorthogonal mother wavelet method from first to seventh level on the basis of lifting scheme to eliminate the influence of optical and electrical noise and the performance of wavelet coefficients (the AC retaining the low-frequency content), and the construct spectra with AC and DC were compared in this study. In addition, the de-noising results of different order and decomposition level were analyzed using partial least squares (PLS) regression to obtain the optimal de-noising parameters. The LWT was implemented using in-house algorithms of Matlab R2010b (MathWorks, Natick, USA).

Optimization of Support Vector Machine Models
After the establishment of a linear model from PLS using raw spectra and the wavelet coefficients, the support vector machine (SVM) procedure was performed to analyze the nonlinear relationship between spectra and wood properties. SVM has been widely applied to classification and regression in various sectors with satisfactory model accuracy [38][39][40][41], but to our knowledge few studies [42,43] have been tested in forest-based materials. The mathematical background of SVM is described in Wang et al. [44]. However, the selection of appropriate parameter values is a key factor affecting model performance. The particle swarm optimization (PSO) procedure has been proven to optimize SVM parameter values on the basis of the social and cooperative behavior of species such as birds that achieves their needs in a multi-dimensional space [45,46]. It was adopted to track geographical origin/tree species and determine density. The SVM and PSO were also implemented in Matlab.
For the preferable optimization results of PSO, the number of initial individuals (population size) and maximum generation should be determined before optimization, the cross-validation number is also critical for the optimization process. Furthermore, these parameters are selected by trial and error or through searching many experiments across a large number of studies [47][48][49][50]. To investigate the relationship between parameters (cross-validation number, maximum generation, and population size) and PSO-SVM model accuracy, the response surface methodology (RSM) with minimum data and resources was employed to design the best combined parameters, and Box-Behnken design with a three-level (lower, equal, and high levels) and a three factor (cross-validation number, maximum generation, and population size) was applied in this study. The variables were coded at lower (−1), equal to (0), and high levels (1), respectively ( Table 2). The analysis of the variance (ANOVA) was used to analyze the results. RSM was implemented in Design-Expert Software Version 11 (Stat-Ease, Minneapolis, Minnesota, USA). More information about RSM can be found in He et al. [51].

Overview of Spectral Data Analysis
The spectral data were analyzed in quantitative and qualitative analyses. For each species, the sample with highest and lowest density value in the calibration and then spectral data were randomly divided into a calibration set (N = 52) and prediction set (N = 18) using simple random sample method so that range of the physical properties was greatest for the calibration set. This was implemented using Matlab R2010b (MathWorks, Natick, MA, USA). First, the spectra for the calibration set were de-noised using biorthogonal mother wavelet (biorNr.Nd, Nr and Nd are order for reconstruction and decomposition, respectively) from the first to seventh decomposition level. The performance of wavelet coefficients and reconstructed spectra were compared to PLS models implemented in Unscrambler V10.4 (CAMO Software AS, Oslo, Norway). The optimal de-noising parameters for these species were determined by the performance of PLS. The response surface methodology (RSM) was then performed to design the best combined parameters for the estimation of density using PSO-SVM model. Finally, wavelet coefficients with the best de-noising parameters were used for classification with SVM optimized by PSO. The performance of identification origin and species using wavelet coefficients was compared to PSO-SVM model developed on the raw spectra. The total treatment was as shown in Figure 2.

Vis-NIR Spectra Analysis
The raw Vis-NIR spectra for these species are shown in Figure 3. As for Dahurian larch, the spectra had similar absorbance patterns in the range of 350-2008 nm but the relative intensities of band in 353-835, 1424-1828, and 1897-2397 nm were higher than that of other species. Additionally, there were two obvious peaks around 2110 and 2272 nm, which were associated with the O-H stretching of cellulose and C-H stretching of hemicellulose, respectively. This difference may be due to the different wood properties between softwood and hardwood. In terms of Chinese white poplar harvested from Jilin and Heilongjiang, the different growing conditions led to the three main peaks around 1196, 1451, and 1929 nm associated with the lignin/cellulose, lignin, and water absorbance bands, respectively, moving to the right, which was different from that of Chinese white poplar from Heilongjiang.

Quantification Models
The statistical characteristics for modeling density for the calibration and prediction set are provided in Table 3. As for these species, although the range of density was different, the range in the prediction set was within the calibration set for each species. Additionally, the values of average and standard deviation between calibration and prediction set were similar, which demonstrated that the dataset was as representative of the species in these sites as possible.  The performance of wavelet coefficients and constructed spectra were also compared on the basis of the results of PLS models. Higher coefficient of determination (R 2 ) value indicated a good denoising result. The de-noising results of different decomposition level and biorthogonal wavelet family are shown in Figure 4. It was observed that the de-noising results were dependent on denoising parameters, which was indicative of the potential that different wavelet parameters may yield on calibration models. As for the best performance of Dahurian larch, although wavelet coefficients yielded the same results with constructed spectra, the de-noising parameters were different, which were bior2.8 under a six decomposition level and bior1.3 with a fourth level, respectively. In the models of Japanese elm, little improvement was obtained after pretreatments; the wavelet coefficients of bior2.8 with a sixth level were the most effective, being better than the optimal results of constructed spectra (bior1.5 with third level). As for Chinese white poplar harvested from Jilin, the wavelet coefficients and constructed spectra pretreated by bior2.8 under the second level and bior1.5 under the second level, respectively, improved PLS performance as they increased R 2 up to 0.840. As for Chinese white poplar from Heilongjiang, the performance of the optimal wavelet coefficients (bior2.8 with six decomposition level) was better than that of constructed spectra (bior2.8 under seven decomposition level).
As for the de-noising performance of the biorthogonal wavelet family, an obvious trend from bior1.1 to bior2.8 with the same decomposition level was not observed, which is in agreement with the results of Zhang et al. [52]. Compared to the constructed spectra processed by LWT, except for Dahurian larch, the R 2 of the best model was the same-wavelet coefficients obtained better accuracy than for constructed spectra. Additionally, 2048 spectral variables were reduced to 512 (Chinese white poplar from Jilin) and 32 (Dahurian larch, Japanese elm, and Chinese white poplar from Heilongjiang) wavelet coefficients. This was an indication that the quality of the information was perhaps improved even with the reduction in the amount of information needed for calibration or analysis. Therefore, the spectra were processed with the optimal de-noising parameters, and wavelet coefficients were obtained. These wavelet coefficients were used for further analysis. The optimal wavelet coefficients for each species are shown in Figure 5. The SVM models optimized by PSO were used for analyzing the relationship between wavelet coefficients and density. The cross-validation number, population size, and maximum generation were assumed to be 5, 10, and 5, respectively. The results of PLS regression using raw spectra and wavelet coefficients and PSO-SVM models using wavelet coefficients are shown in Table 4. It can be seen that, for these species, the performance of PSO-SVM models were better than the conventional enhanced PLS method, especially for Dahurian larch. However, the accuracy of Japanese elm improved slightly when going from the PLS to the PSO-SVM method (Table 4). This could be caused by the default variable values in the process of modeling. To investigate the effect of variables (i.e., cross-validation number, maximum generation, and population size) on the prediction accuracy of the PSO-SVM models for the estimation of density, a Box-Behnken design with three levels and three factors was used to construct a response surface trace. The results of the ANOVA for these four species are respectively shown in Tables 5-8. X1, X2, and X3 represent the cross-validation number, maximum generation, and population size, respectively. DF, SS, and MS represent the degrees of freedom, sum of squares, and mean square, respectively. As seen in Table 5, for Dahurian larch, the CVmse (mean squared error of cross-validation) of PSO-SVM models was more significantly affected by cross-validation number and population size (p < 0.01) than by maximum generation at p < 0.05. All two-way interaction terms were non-significant. The quadratic variables of and were more significant, whereas was significant at the level of p < 0.05. In the RSM model of Japanese elm and Chinese white poplar from Jilin and Heilongjiang (Tables 6 and 7), the linear terms of , , and were significant for the former two species, was significant for Chinese white poplar from Heilongjiang, and only the interaction of and were significant at p < 0.05 for Japanese elm and Chinese white poplar (Jilin), respectively. The quadratic of was more significant at the level of p < 0.01 for Dahurian larch, Japanese elm and Chinese white poplar from Jilin and Heilongjiang, and was significant for Chinese white poplar from Jilin. These four models obtained a good fit, as showed values up to 0.85 that were highly significant (p < 0.01). Figure 6 is an illustration of the changes of CVmse of PSO-SVM models in terms of parameters of cross-validation number ( ), maximum generation ( ), and population size ( ). As for Dahurian larch ( Figure 6A), it was concluded that the CVmse decreased with the cross-validation number decreasing from 15 to 5, and the lowest value of CVmse was observed at level (−1) for the crossvalidation number. The minimum value of CVmse was obtained with the cross-validation number, maximum generation, and population size at 5, 57, and 60, respectively. In terms of the Japanese elm ( Figure 6B), the CVmse increased with the decreasing cross-validation number, and its lowest value was observed when the cross-validation number, maximum generation, and population size were 15, 73, and 45, respectively. For Chinese white poplar harvested from Jilin and Heilongjiang ( Figure  6C,D), increasing the cross-validation number reduced the CVmse, and the minimum value was obtained with the cross-validation number, maximum generation, and population size at 15, 86, and 60 for Jilin and 10, 50, and 60 for Heilongjiang, respectively.
Thus, regardless of different tree species, the CVmse of PSO-SVM models was significantly affected by the cross-validation number. However, different tree species exhibited different variations, and different optimal parameters were obtained in this study, suggesting that there is not a universal optimization parameter that works for all scenarios. The reason for this difference may have been due to the differences of wood properties and growing conditions. To verify the reliability of the RSM models, PSO-SVM models were performed at the optimal parameters for these species. The results of PSO-SVM models optimized by RSM are shown in Table  9. It can be seen that of models for these species were above 0.80, indicating the accuracy of the models were improved by using RSM optimization when compared to conventional PLS models and PSO-SVM models without RSM optimization. Additionally, for the PSO-SVM model of Japanese elm, was increased from 0.844 to 0.974 after RSM optimization, which demonstrated that the feasibility of optimization of PSO-SVM model parameters on the basis of the RSM method.

Qualitative Analysis Models
The technology of rapid classification of geographical origin and tree species is needed and important for the establishment of a timber quality traceability system. Due to a good performance of wavelet coefficients decomposed by the optimal de-noising LWT parameters, PSO-SVM models using wavelet coefficients were used to identify geographical origin and tree species, and the results of raw spectra were also analyzed.
Chinese white poplar samples harvested from Jilin and Heilongjiang were used to determine geographical origin. Figure 7 and Table 10 show the clustering results of geographical origin and the detailed description of the classification results, respectively. Category labels 0-1 are the samples from Jilin and Heilongjiang, respectively. It can be seen that the training dataset accuracy of the PSO-SVM model based on wavelet coefficients had similar results with the modeling of raw spectra with a correctness of approximately 100% (data no shown), indicating the effectiveness of the SVM model optimized by PSO for differentiating the origin. Additionally, for the prediction set, regardless of raw spectra and LWT coefficients, the accuracy of geographical origin was 100%.   Figure 8 and Table 11 show the clustering results of species and the detailed description of the classification results, respectively.   As seen in Figure 8 and Table 11, the PSO-SVM model using wavelet coefficients demonstrated that these tree species were well-separated with a correctness of 100% and 98.61% for the training (data no shown) and prediction set, respectively. This method was more effective than models built with the raw spectra due to the higher accuracy. As for the performance of prediction set, the accuracy of wavelet coefficients was better than that of raw spectra. The difference of prediction results was due to Dahurian larch and Japanese elm. In the model of LWT wavelet coefficients, Dahurian larch samples were correctly classified, but one Japanese elm sample was misjudged to Dahurian larch. Regardless of geographical origin, Chinese white poplar from Jilin and Heilongjiang were wellseparated without misjudgment for raw spectra and wavelet coefficients. This is an indication that the PSO-SVM combined with LWT may be a powerful method not only in simplifying the spectral data dimension, but also in classification of the geographical origin and tree species at the same time with high discrimination accuracy.

Discussion
LWT was used for de-noising and the optimal de-noising parameters were discussed in the prediction of wood tracheid length in our previous study [34]; the results demonstrated that LWT, as a second-generation wavelet, has more power to suppress noise and improve model performance when compared to other de-noising methods such as moving average, loess, Savitzky-Golay, and lowess. According to the advantages of LWT, the performance of models using wavelet coefficients of LWT and constructed spectra were first compared in this study. The ability to effectively predict wood density was due to a significant correlation between specific absorption bands of chemical properties (lignin, cellulose, and hemicellulose), and their C-H, O-H, and other N-H groups had obvious absorption in the Vis-NIR region [53,54]. Therefore, the improvement of spectral quality is important to decrease the error in the modeling. LWT was employed to remove noise and other irrelevant information for improving the accuracy in this study. The results show that, except for Dahurian larch, the results were similar, and the best performance of wavelet coefficients were better than that of constructed spectra when raw spectra were decomposed from the first to seventh decomposition level using biorthogonal family (Figure 4). This was mainly because some features of wood are inconspicuous in the original Vis-NIR spectral domain, but are magnified and become obvious in a new domain (wavelet coefficients of LWT) [55]. In addition, the transformation of these wavelet coefficients is not biased. This fills in the shortcomings that some useful information was removed and the accuracy of the model was decreased for constructed spectra [56,57].
In terms of the optimal de-noising parameters of LWT, it can be found that different species may obtain better calibration models with different de-noising parameters, demonstrating that there is no universal parameter that works in all cases. The reason that the optimal decomposition level between Chinese white poplar from Jilin (level = 2) and other species (level = 6) is different may be due to a small amount of noise for the former; namely, the noise was removed while retaining the useful information when spectra were decomposed into two levels. However, for other species, more decomposition levels are needed to remove some unobvious noises of spectra. Additionally, there was no obvious trend with the increasing order of the biorthogonal family. This result was in line with the finding reported by Zhang et al. [52]. Despite different effects of various combination of LWT parameters on wood spectral de-noising, good performance for different species was obtained and spectral dimensions were also reduced in this study. This not only simplified the complexity of modeling, but also saved time and memory.
As for the performance of linear PLS models (Table 4), regardless of raw spectra and wavelet coefficients, the RMSE values for the Chinese white poplar (Jilin) were better than other species. Therefore, the loadings of the spectral wavelength variables for these species were compared ( Figure  9). The loading spectra indicate which wavelength variables were associated with the highest variation. Frequencies of high variation reflect contributions of spectra correlated with wood properties. As seen in Figure 9, the wavelengths with the highest variation for Chinese white poplar were mainly at 354, 1451, 1939, and 2395 nm, which were higher than that of other species, suggesting the highest contribution of spectra for Chinese white poplar (Jilin) density. SVM optimized by PSO has been widely applied to predict the complicated sample properties. Zhang et al. [58] applied PSO and genetic algorithm (GA) to optimize the parameters of SVM, and they found that PSO showed a better learning ability and generalization in wood drying process modeling. A study presented by He et al. [59] indicated that PSO-SVM model is more accurate than the routine linear regression model for the water quality retrievals of Weihe River in Shanxi province. Their results demonstrated the improvement of SVM models through the optimization of PSO. The results in our study support their findings (Table 4). However, the parameters of PSO including population size, maximum generation, and number of cross-validation influenced the performance of PSO-SVM models; these parameters were selected by randomized trial or searching many experiments across many studies [47][48][49][50]. Thus, RSM was employed to design the best combined parameters. It was found that the cross-validation number had a significant effect on PSO-SVM models, regardless of different species (Tables 5-8). For the optimization of RSM, the prediction accuracy ( 2 ) of Japanese elm was lower than the other three species, with this possibly being due to a narrow range for modeling density values or small-sample prediction set (N = 18). In a follow-up study, a wide range of representative large samples are needed for good prediction. However, the performance of calibration models was improved and the of models up to 0.80 for these species (Table 9). This demonstrated that the feasibility of the optimization was based on RSM.
Identification of geographical origin and tree species is critical for the establishment of a timber quality traceability system. Sandak et al. [60] applied Fourier transform near infrared (FT-NIR) to track the provenance of spruce, and they found that FT-NIR is sensitive enough to detect wood differences of different provenance. Yang et al. demonstrated correctness of species up to 90% when NIR spectroscopy and PLS-DA were employed to identify wood species from different locations [61]. It was observed that PSO-SVM models using wavelet coefficients obtained superior performance relative to that of the raw spectra, regardless of geographical origin and tree species (Figures 7 and  8). The reason behind this is that wavelet coefficients extract feature information of spectral signals, resulting in the simplification of the model structure and improvement of the model robustness [62]. Different provenances have influence on wood properties [63]. A total of 280 wood samples among three species were harvested from two different origins, that is, Heilongjiang and Jilin provinces, China, in this study (Table 1). There are big differences in climate and soil type. The former is mainly an East Asian monsoon climate with humus soil. The latter is a temperate continental monsoon climate with marsh soil. Additionally, precipitation and frost-free period are also different. These growing conditions result in wide a variation in wood properties and spectral data, and these large differences between species corresponded to the changes of Vis-NIR spectra, achieving the rapid discrimination of geographical origin and tree species on the basis of Vis-NIR spectroscopy. Therefore, due to the difference of species, what is important and required in practice is the discrimination of origin of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES)-listed species, such as mahogany and Dalbergia nigra, which will provide a powerful technical tool for detecting logs harvested illegally from protected areas.

Conclusions
The estimation of wood property and identification of the origin and species simultaneously are important for wood traceability, yet are challenging when using traditional methods. This study demonstrated that the wavelet coefficients based on LWT not only simplify the dimension of the spectral data, but also improve model accuracy. Vis-NIR spectroscopy coupled with chemometric analysis discriminates the geographical origin/tree species and predicts wood properties with high accuracy concurrently, which provides rapid and non-destructive methods to obtain information on the wood property for the establishment of a wood traceability system.