Multivariate Discriminant Analysis of Single Seed Near Infrared Spectra for Sorting Dead-Filled and Viable Seeds of Three Pine Species: Does One Model Fit All Species?

: Seed lots of pine species are composed of viable, dead-ﬁlled and empty seeds, and the success of complete sorting of dead-ﬁlled seeds using the conventional method (Incubation, Drying and Separation in water) is di ﬃ cult to achieve; leaving a considerable scope for upgrading the sorting e ﬃ ciency. The objective of this study was to evaluate the prospect of sorting viable and dead-ﬁlled seeds of pine species using Near Infrared (NIR) spectroscopy. To demonstrate this, dead-ﬁlled and viable seeds of Mason’s pine, slash pine and loblolly pine were incubated in moist medium for three days, dried for six hours and scanned by XDS Rapid Content Analyzer from 780–2500 nm. Orthogonal Projection to Latent Structure-Discriminant Analysis was used to develop discriminant models for each species separately and for all species combined. The results showed that the sensitivity (the model’s ability to correctly classify members of a given class) and the speciﬁcity (the model’s ability to reject non-members of a given class) were 100% for each species model and 98%–99% for combined species model. The overall classiﬁcation accuracy was 100% and 99% for individual species and combined species models, respectively. The absorption band in the 1870–1950 nm with a major peak at 1930 nm, which is related to water, was responsible for discrimination as dead-ﬁlled seeds dried quicker than viable seeds during the drying process. Our study is the ﬁrst attempt to simultaneously discriminate dead-ﬁlled and viable seeds of pines by NIR spectroscopy. The results demonstrates that a global calibration model of seed lots of several pine species can be equally e ﬀ ective as the individual species model to discriminate viable and dead-ﬁlled seeds by NIR spectroscopy, thereby ensuring precision sowing (also known as single seed sowing) in nurseries. slightly larger for viable than dead-ﬁlled seeds of Mason’s pine and slash pine almost across the entire NIR region whereas the absorbance values of viable seeds were larger than dead-ﬁlled seeds of loblolly pine in the 1850–1940 nm. Overall, there was ample spectral information that could be used for distinguishing viable from dead-ﬁlled seeds of


Introduction
The increasing demand for wood, fiber and pulp coupled with risks associated with global climate change has put immense importance on the development of forest plantations. The global planted forest area increased from 167.5 million ha to 277.9 million ha during 1990-2015 with the increase varying by region and climate domain [1]. Several species in the genus Pinus L., which are fit to separate viable and dead-filled seeds of three pine species? The classification performance of individual species model and combined species model was compared so that a global calibration model that works for several pine species can be developed. To do this, NIR reflectance spectra (log 1/R) were collected from single seeds and multivariate discriminant models were developed by Orthogonal Projection to Latent Structure-Discriminant Analysis (OPLS-DA). Unlike the classic Partial least squares-discriminant analysis (PLS-DA), the OPLS-DA incorporates Orthogonal Signal Correction (OSC)-filter to remove unwanted spectral variation that has no correlation with the response variable prior to model building [23], thereby resulting in a parsimonious model. Our study is the first attempt to simultaneously discriminate viable and dead-filled seeds of several pine species and provides valuable insight into future development of on-line sorting systems by seed technologists.

Sample Preparation
Two seed lots of Mason's pine, slash pine and loblolly pine, which differed in origin and year of collection (2015 and 2016) were bought from a commercial tree seed company in China, and stored at 5 • C and ca. 6% moisture content in cloth bags for one month until the study was carried out. To create a distinct class of dead-filled and viable seeds of the pine species, two sub-samples of 200 seeds each from each species were taken and one of the sub-samples was killed in a drying oven set at 95 • C for 24 hours. Thereafter, both sub-samples (the killed and non-killed seeds) were placed separately between two moistened germination papers (Munktell filter paper, Ø 125 mm) in an incubation cabinet (Inventum Denmark 11) for 3 days at 5 • C, ca. 95% relative humidity. Such an arrangement permitted free imbibition of water by seeds. After incubation, seeds were evenly distributed on a piece of blotting paper and dried with a fan-ventilation at 20 • C and ca. 40% relative humidity for six hours. The drying hour was selected based on our preliminary drying experiment. Immediately after drying, killed and non-killed seeds were placed in bowls containing Millipore-filtered water separately and stirred to assist the separation process. The floated and sunken fractions were collected separately after five minutes. The floated seeds represented the dead-filled seeds while the sunken seeds represent the viable seeds [8]. In total 900 seeds (n = 150 dead-filled and viable seeds each per species) were used for NIR analysis. The relative water content of dead-filled and viable seeds is given in Table 1. The relative water content was determined as the difference in weight after incubation to the weight after six hours of drying divided by the initial weight and multiplied by 100.

Measurement of NIR Spectra
Absorbance values (log (1/Reflectance)) of individual seeds were measured using XDS Rapid Content Analyzer (FOSS NIRSystems, Inc., Hilleroed, Denmark) from 780 to 2498 nm at a wavelength resolution of 0.5 nm. Before scanning individual seeds, a reference measurement was recorded on the instrument's standard built-in reference material. Thereafter, individual seeds were scanned by putting them at the center of the scanning white quartz glass window of the instrument with 9 mm aperture in stationary position and then covered with the instrument's lid, which had a black background. For each seed, the average of 32 monochromatic scans was recorded. The scanned seeds were further verified for their viability status (viable and dead-filled) by germination and subsequent cutting tests. The germination test was carried out on a germination table at a constant temperature of 20 ± 1 • C day and night with an illumination of ca. 20 µE m −2 s −1 for 30 days. At the end of the germination test, non-germinated seeds were cut individually and examined for their viability. Seeds were considered as viable when they have a firm white embryo and dead-filled when seeds were covered with fungi, collapsed when pinched and had grey, yellow, or brownish embryos [24]. The cutting test enabled us to make sure that sorting with the flotation in water had resulted in 100% separation of dead-filled and viable seeds, thereby the classes were clearly set before developing the calibration models.

Model Development and Validation
At first, Principal Component Analysis (PCA) was performed on the entire data set to detect outlier and data anomalies. PCA score plot showed that few samples from each species were found outside the 95% confidence limit (data not shown), but they were not serious outliers, and hence kept in the final data set. The data sets were then divided into calibration sets to fit the models and validation sets to evaluate the prediction performance of fitted models. The data set for calibration was composed of 540 seeds from one seed lot (three replicates of 30 seeds × 2 classes × 3 species) and the data set for validation was composed of 360 seeds from another seed lot (three replicates of 20 seeds × 2 classes × 3 species). Orthogonal Projection to Latent Structures-Discriminant Analysis (OPLS-DA) was used to develop discriminant models for each species separately and for all species combined. Basically, the OPLS-DA modelling approach first filters general types of interferences in the spectra by removing components orthogonal to the response calibrated variable [25]. The filtered spectra were then computed by subtracting components orthogonal to the response variable from the original spectral data. The final discriminant models were developed using the filtered absorbance values as a regressor and a Y-matrix of dummy variables as a regress (1.0 for member of a given class, 0.0 otherwise). It should be noted that NIR spectroscopic data are often pre-processed using different spectral pretreatment techniques to remove spectral noises arising from light scattering, base line shift, and path length differences [26], which in turn were induced by differences in individual seed size and moisture content [15,17]. The model was also fitted on truncated spectra, 1870-1950 nm wavelength region, where a major absorption peak was observed. The number of significant components (factors) to be included in the model was selected based on a seven-segment cross validation. A significant component was the one having the ratio of the prediction error sum of squares to the residual sum of squares of the previous dimension statistically smaller than 1.0.
Finally, the fitted models were applied to classify samples in the validation set; and seeds with predicted values greater than the threshold for classification (Ypred ≥ 0.5) were considered viable, and all others were considered dead-filled. To evaluate the classification performances of the fitted models, the following parameters were used: sensitivity (the ability of the model to correctly recognize samples belonging to that class), specificity (the ability of the model to reject samples of all other classes), classification accuracy (the proportion of correctly classified samples), and classification error rate (the proportion of misclassified samples). The following equations were employed to compute the classification parameters: Sn, Sp, CA and ER stand for sensitivity, specificity, classification accuracy and error rate, respectively. TP (True Positive) is the number of viable seeds of a given species correctly classified as viable seeds. FN (False Negative) is the number of viable seeds of a given species incorrectly classified as dead-filled seeds. TN (True Negative) is the number of dead seeds of a given species correctly classified as dead seeds, and FP (False Positive) is the number of dead seeds of a given species incorrectly classified as viable seeds, and n is the number of classes [27].
To obtain insight into the absorption bands, which were relevant for discriminating viable and dead-filled seeds, a parameter called Variable Influence on Projection (VIP) was computed. The formula used to calculate the VIP for predictive components (PRED_VIPOPLS) was: Kp denotes the total number of variables in the model; P is the normalized loadings; a is the number of predictive component; Ap is the total number of predictive components; SSXcomp and SSYcomp denote the explained sum of squares of a th component for X and Y data matrices, respectively; and SSXcum and SSYcum represent the cumulative explained sum of squares by all A components in the model for X and Y data matrices, respectively [28]. Predictors with VIP value greater than 1.0 are highly relevant for the discriminant model, but VIP values around 0.7-0.8 were recommended as a cut off to distinguish between relevant and irrelevant predictor variables [29]. All model computations were made on mean-centered data sets using Simca-P + software (Version 14, Umetrics AB, Umeå, Sweden).

Mean Absorbance Values and Model Overview
The spectral profile of viable and dead-filled seeds of Mason's pine, loblolly pine and slash pine were similar with absorption maxima appearing at 1450 nm and 1936 nm ( Figure 1). The mean absorbance values were slightly larger for viable than dead-filled seeds of Mason's pine and slash pine almost across the entire NIR region whereas the absorbance values of viable seeds were larger than dead-filled seeds of loblolly pine in the 1850-1940 nm. Overall, there was ample spectral information that could be used for distinguishing viable from dead-filled seeds of pine species. viable seeds. FN (False Negative) is the number of viable seeds of a given species incorrectly 180 classified as dead-filled seeds. TN (True Negative) is the number of dead seeds of a given species 181 correctly classified as dead seeds, and FP (False Positive) is the number of dead seeds of a given 182 species incorrectly classified as viable seeds, and n is the number of classes [27].

183
To obtain insight into the absorption bands, which were relevant for discriminating viable and 184 dead-filled seeds, a parameter called Variable Influence on Projection (VIP) was computed. The 185 formula used to calculate the VIP for predictive components (PRED_VIPOPLS) was:   The model statistics reveal that the fitted models for discriminating viable and dead-filled seeds of each pine species and all species combined had 1 predictive and 5-12 Y-orthogonal components ( Table 2; e.g., A = 1 + 9 for Mason's pine). The explained predictive spectral variation (R 2 X P ) accounted for 2.6%, 2.4% and 2.1% of the total explained variation for Mason's pine, loblolly pine and slash pine, respectively, whereas the explained orthogonal spectral variation that had no correlation with the classes (R 2 X o ) and constituted about 97% of the total spectral variation ( Table 2). The explained predictive spectral variation explained 94%, 90% and 91% of the variation between viable and dead-filled seeds classes (R 2 Y) of Mason's pine, loblolly pine and slash pine, respectively with 86%-92% predictive ability (Q 2 cv ) based on cross validation. For the combined species model, the predictive spectral variation (R 2 X P = 0.3%), the explained variation between viable and dead-filled seeds classes (R 2 Y = 80%) and the predictive power based on cross-validation (Q 2 cv = 80%) were slightly lower than the individual species models. Similarly, the model fitted on truncated spectra (1870-1950 nm) of all species together and had nearly the same explained class variance and prediction power, but slightly higher explained spectral variance than the model fitted on full NIR spectral region. Table 2. Summary of model statistics for discriminant models developed to distinguish viable and dead-filled seeds of each pine species separately and all species combined using the entire NIR region (780-2500 nm) and truncated spectra (1870-1950 nm). Where A stands for number of significant predictive (the first values) and orthogonal (the second values) components to build the model; R 2 X p is the explained predictive spectral variation; R 2 X o = the explained Y-orthogonal variation that has no correlation to class discrimination; R 2 Y = the class variation explained by the model; and Q 2 cv = the predictive power of a model based on cross validation.

Species
Model Statistics The score plots of the fitted models disclosed symmetrical grouping of viable and dead-filled seeds of pine species (Figure 2; X-axis), while the orthogonal scores depicted within species variation (Y-axis) for calibration data sets. The model fitted on data from all species combined and also resulted in nearly symmetrical separation of viable and dead-filled seeds. There were few samples that appeared outside the 95% confidence ellipse according to Hotelling's T2 test. However, these samples were moderate outliers, thus, they were kept in the final calibration data set. The loading plots for the predictive component show that the absorption band in 1894-1948 nm with absorption maxima appearing at 1930 nm were mainly responsible for separating viable and dead-filled seeds of pine species (Figure 3a). The orthogonal loading plot (Figure 3b) showed several small peaks across the entire NIR region, except that of loblolly pine, that was attributed to systematic spectral noise.

Classification Performance of Fitted Models
The individual species models completely recognized viable and dead-filled seeds in the validation set whereas the all species model misclassified three viable and two dead-filled seeds (Figure 4). The ability of the models fitted on full NIR spectra to assign viable and dead-filled seeds in the validation set to their respective classes (sensitivity) and their ability to reject seeds of other classes (specificity) were 100% when the model was fitted for each species separately ( Table 3). The combined species models fitted on the whole NIR spectral region and truncated spectra had also very high sensitivity (98%-99%) and specificity (98%-99%). The mean classification accuracy was 100% for Mason's pine, loblolly pine and slash pine, while it was 99% for all species combined together.

248
The individual species models completely recognized viable and dead-filled seeds in the 249 validation set whereas the all species model misclassified three viable and two dead-filled seeds 250 ( Figure 4). The ability of the models fitted on full NIR spectra to assign viable and dead-filled seeds 251 in the validation set to their respective classes (sensitivity) and their ability to reject seeds of other 252 classes (specificity) were 100% when the model was fitted for each species separately ( Table 3). The

253
combined species models fitted on the whole NIR spectral region and truncated spectra had also 254 very high sensitivity (98%-99%) and specificity (98%-99%). The mean classification accuracy was 255 100% for Mason's pine, loblolly pine and slash pine, while it was 99% for all species combined   Table 3. Classification performance of models fitted on whole NIR spectral region (780-2500 nm) and truncated NIR spectra (1870-1950 nm) for classifying viable and dead-filled seeds of three pine species in the validation set (n = 60 seeds per class and species). Where Sn, Sp, CA and ER denote class sensitivity, class specificity, classification accuracy and error rate, respectively.

Absorption Bands Relevant for Discriminating Viable and Dead-Filled Pine Seeds
For all three pine seed lots, the plots of VIP values show that the absorption band in 1870-2000 nm with absorption maxima appeared at 1930 nm and strongly influenced the separation of dead-filled and viable seeds ( Figure 5; VIP > 1). For Mason's pine and slash pine, the absorption band in 1400-1500 nm with absorption maxima occurred at 1460 nm and appeared to be also relevant to discriminate viable and dead-filled seeds (VIP = 0.9), whereas the absorption band in 890-1348 nm with broad peak at 1080 nm was highly relevant for discriminating viable and dead-filled seeds of loblolly pine. Other NIR regions of interest for discriminating viable and dead-filled seeds by combined species model appeared in 1098-1300 nm and 1400-1550 nm with absorption peaks centered at 1118 nm and 1450 nm (VIP = 0.8-1.0).
Forests 2019, 10, x FOR PEER REVIEW 9 of 13 Table 3. Classification performance of models fitted on whole NIR spectral region (780-2500 nm) and 263 truncated NIR spectra (1870-1950 nm) for classifying viable and dead-filled seeds of three pine 264 species in the validation set (n = 60 seeds per class and species). Where Sn, Sp, CA and ER denote 265 class sensitivity, class specificity, classification accuracy and error rate, respectively.

281
The NIR spectra of pine seed lots contain sufficient information with subtle differences in 282 absorbance values, particularly for loblolly pine (Figure 1). This might be due the slight variation in

Discussion
The NIR spectra of pine seed lots contain sufficient information with subtle differences in absorbance values, particularly for loblolly pine (Figure 1). This might be due the slight variation in moisture content between viable and dead-filled seeds of loblolly pine compared to the other two pine species (Table 1). However, the computed OPLS-DA models effectively used this subtle spectral variation to considerably explain the class variation with excellent predictive power as multivariate analysis is powerful in extracting this subtle difference in the samples [30]. One significant predictive component was sufficient to condense the large spectral data set by removing 97% of orthogonal spectral variation that had no correlation with class variation (Table 2). This large proportion of orthogonal spectral variation might arise from spectral redundancy as the XDS Rapid Content Analyzer used in this study measured the reflectance at an interval of 0.5 nm. Previous studies have attributed spectral redundancy to large orthogonal variations in the spectra [20,21,31].
Furthermore, the large uncorrelated spectral variation could be attributed to path length differences and light scattering as a result of variations in individual seed size, moisture content and chemical composition [15,18]. This can be further seen from the score plot where dead-filled seeds were slightly more dispersed than viable seeds along the orthogonal component ( Figure 2). The corresponding orthogonal loading plot (Figure 3) shows smaller absorption peaks at 1113 nm and 1304 nm for Mason's pine, slash pine and all species combined that correlate with the dispersion pattern of samples observed in the score plot. The absorption band in the 1100-1400 nm with absorption maxima appeared at 1113 nm and 1312 nm has been ascribed to the binding of two water molecules to the hydroxide ion [12]. Apparently, differences in water content of individual seeds could be the major source of Y-orthogonal spectral variation. For loblolly pine seeds, no major absorption peak was observed, suggesting that the Y-orthogonal spectral variation could be related to the baseline shift.
The computed OPLS-DA models substantially describe the class variation with one significant predictive component to build the model ( Table 2). Parsimonious models with few predictive components are often desired as dimensionally complex which is an important aspect in the interpretation of multivariate analysis [32]. For both individual species models and the global model, the sensitivity (the model's ability to recognize members of a given class) and specificity (the model's ability to reject non-members of a given class) was superb for validation sets (Table 3). This indicates the robustness of both models and the prospect of NIR spectroscopy as a sorting system for improving seed lot quality by removing the dead-filled seeds of pine species. Compared with the current sorting system (IDS technique) with 10%-15% sorting error, the NIR appears to be superior as the misclassification rate is just 1%. The success of discriminating viable and dead-filled seeds using truncated spectra accentuates the prospect of developing sorting systems using filters or diode arrays, which are less expensive.
For all three pine seed lots, the VIP plot shows that the absorption band in 1900-2000 nm with absorption maxima occurred at 1930 nm and strongly influenced the separation of dead-filled and viable seeds ( Figure 5). The 1900-2000 nm region is characterized mainly by O-H stretch and HOH deformation combination and O-H bend second overtone, and pure water has absorption maxima at 1940 nm [10,31]. The shift in absorption maxima in our study could be related to variations in hydrogen bonding and temperature when water is in solute admixture and a solvent [10]. As a whole, this region was found useful to discriminate sound and insect-attacked seeds by NIR spectroscopy based on differences in relative water content [15].
In addition, the absorption band in 1400-1500 nm with a peak at 1450 had appeared to be relevant to discriminate viable and dead-filled seeds of pine species. The region is characterized by O-H and N-H stretch first overtone and C-H combination bands due to absorption by ROH, protein moieties, starch and H 2 O; and water has absorption maxima at 1450 nm. For loblolly pine seeds, additional absorption bands in 860-1370 nm with absorption maxima appeared at 1074 nm and 1215 nm and had an influence on the discrimination of dead-filled and viable seeds. While the absorption band in 780-1100 nm is attributed to O-H stretching second overtone due to absorption by aliphatic and aromatic hydroxyl groups, the absorption band in 1100-1300 nm region is characterized by a second overtone of C-H stretching vibration due to absorption by methyl and methylene [12,33]. Thus, the difference in relative water content between dead-filled and viable seeds was the basis for discriminating the two seed lot fractions by NIR spectroscopy. It should be noted that after three days of incubation on moist medium and subsequent drying for 6 h, the dead-filled seeds dried faster than viable-filled seeds that had metabolically fixed the absorbed water (Table 1). This eventually created sufficient relative water content difference between dead-filled and viable seeds for detection by NIR spectroscopy.

Conclusions
The results demonstrated that a global calibration model of seed lots of several pine species can be equally effective as individual species model to discriminate viable and dead-filled seeds by NIR spectroscopy, thereby enhancing seed lot quality. Improved seed lot quality, in turn, enables precision sowing in the nursery, and reduces the time, energy and resources needed for replacement planting of empty planting pots or thinning of young seedlings in the case where several seeds are sown per pot. Sowing such high quality seeds, in turn, ensures the production of uniform-sized seedlings in nurseries. Further evaluation of outplanting performance will enable defining target seedlings for planting in a reforestation site, resulting in better survival and growth for outplanted seedlings. The technique is extremely fast as it takes a fraction of a minute to scan a single seed. In addition, it is a simple task to acquire the spectral data, with the possibility for automation of the process. As sorting by NIR spectroscopy is based on a universal phenomenon that dead-filled seeds dry faster than viable seeds, it can be applied to several economically important pine species. The success of classifying viable and dead-filled seeds using truncated spectra sheds light on the prospect of developing less expensive sorting systems using filters and diode arrays. Thus, efforts need to be made to construct an automated sorting system based on NIR technology.