Comparison of Pyrolysis Mass Spectrometry and Near Infrared Spectroscopy for Genetic Analysis of Lignocellulose Chemical Composition in Populus

Genetic analysis of wood chemical composition is often limited by the cost and throughput of direct analytical methods. The speed and low cost of Fourier transform near infrared (FT-NIR) overcomes many of these limitations, but it is an indirect method relying on calibration models that are typically developed and validated with small sample sets. In this study, we used >1500 young greenhouse grown trees from a clonally propagated single Populus family, grown at low and high nitrogen, and compared FT-NIR calibration sample sizes of 150, 250, 500 and 750 on calibration and prediction model statistics, and heritability estimates developed with pyrolysis molecular beam mass spectrometry (pyMBMS) wood chemical composition. As calibration sample size increased from 150 to 750, predictive model statistics improved slightly. Overall, stronger calibration and prediction statistics were obtained with lignin, S-lignin, S/G ratio, and m/z 144 (an ion from cellulose), than with C5 and C6 carbohydrates, and m/z 114 (an ion from xylan). Although small differences in model statistics were observed between the 250 and 500 sample calibration sets, when predicted values were used for calculating genetic control, the 500 sample set gave substantially more similar results to those obtained with the pyMBMS data. With the 500 sample calibration models, genetic correlations obtained with FT-NIR and OPEN ACCESS


Introduction
Forests trees capture greenhouse gases [1] and provide a renewable supply of wood for pulp, paper, construction and bioenergy.Genetic improvement of forest tree species, such as Eucalyptus, Populus, and Pinus species has increased growth, improved stem form and disease resistance [2], leading to significant gains in yield at shorter rotations [3,4].However, despite the importance of wood chemical composition for yields of chemical pulp and biofuel and our knowledge of genes that code for enzymes involved in synthesis of cellulose, hemicellulose and lignin from model herbaceous and tree species, forest tree breeders have only recently begun dissecting the genetic architecture of these traits [5][6][7][8][9][10][11].
A variety of methods for measuring the chemical composition of lignocellulosic biomass have been developed and some have been applied to understand genetic control and environmental effects [12,13].Although classical wet chemical methods have been used widely, their application for genetic analyses is limited by their high cost and low throughput.Composition data from miniaturized wet chemical methods were used to calculate genetic parameters for wood chemical composition in loblolly pine [14] and Eucalyptus [13,15].Pyrolysis molecular beam mass spectrometry (pyMBMS) is faster than wet chemical methods and has been used to characterize lignocellulosic components in herbaceous and woody biomass [16][17][18], analyze genetic trials aimed at characterizing trait genetic architecture [19], map quantitative trait loci (QTL) [10,20] and identify genes that affect cellulose, hemicellulose and lignin content in pine and poplar [19,21].Despite these advances in our knowledge of the genetic mechanisms controlling wood lignocellulose composition, analysis of the very large number of samples needed for traditional and advanced molecular marker based breeding in commercially important species requires faster and lower cost methods [22].
One such method is near infrared (NIR) spectroscopy, which is an indirect method that relates absorbance differences in NIR wavelengths to differences in anatomical, chemical and mechanical properties.NIR has been used for qualitative and quantitative chemical analyses in many fields and industrial applications, including agricultural products [23], food [24], and forestry [25][26][27].NIR relies on multivariate models to predict properties of new samples, and has been used to estimate genetic parameters for wood chemical content in different tree species.For example, in Eucalyptus globulus [12,[28][29][30][31] NIR was utilized for prediction of cellulose, pulp yield, lignin and extractives, and Eucalyptus nitens [32,33] for cellulose, and in Pinus pinaster Ait [34,35] for lignin, extractives, cellulose and monosaccharides.Nevertheless, no direct comparison of genetic parameter estimates of wood chemical composition obtained with NIR and a direct method have been reported.Nor have the utility of NIR predictions for quantitative trait loci (QTL) mapping, association genetics, and genomic selection been investigated.
The goals of this research were to assess the importance of calibration sample size on FT-NIR model predictions and compare genetic parameter estimates and significant QTLs from the indirect FT-NIR predictions to pyMBMS within a single Populus family.

Samples and FT-NIR
For this research, wood samples from a single pseudo-backcross family of Populus were used.The experimental design, sample processing and collection of pyMBMS data are described in detail by Novaes et al. [10].Briefly, ground samples of wood were available from 2376 plants from 396 genotypes grown under two different nitrogen treatments (with ~3 clonal replicates) and then harvested after 10 weeks.The basal 5 cm section of the debarked stem were dried and ground to 40 mesh.Pyrolysis MBMS data were available for wood samples from two replicates (1515), and 1505 of these samples were scanned with FT-NIR.The spectra were obtained with a Perkin-Elmer Spectrum 400 FTIR/FTNIR (PerkinElmer Ltd., Beaconsfield, UK) equipped with an X-Y stage autosampler to increase the efficiency of scanning.About 4 mg of powdered samples were loaded into wells of an X-Y plate (96 wells) autosampler with diffuse reflectance from Pike technologies (Pike Tech., Madison, WI, USA).Aliquots from each wood sample were loaded into three wells and each well was scanned 32 times, at a resolution of 8 cm −1 (10,000-4000 cm −1 ).Spectra from the three wells were averaged prior to analysis.The Spectrum Quant+ software (PerkinElmer Ltd., Beaconsfield, UK) was used for calibration and prediction.The second derivatives of the NIR spectra were applied to adjust the baseline.In all calibrations with the software package, a partial least square (PLS) algorithm was used to develop the calibration and prediction models from FT-NIR spectra data.The 1505 samples were randomly separated into calibration and prediction sets (150 vs. 1355, 250 vs. 1255, 500 vs. 1005, and 750 vs. 755).Calibration models were also developed using full cross validation.

Calibration Statistics
The standard error of estimate (SEE) was used to determine how fit the calibration models were, and standard error of prediction (SEP) was used to determine how fit is the prediction performed by this calibration model [27], as described below: (1) where is the estimated value of sample i by the calibration model, y i is the known value (from pyMBMS) of sample i, n is the number of samples in the calibration model. ( ) ( ) where y pred,i is the predicted value of sample i by the calibration model, y ref,i is the known value (from pyMBMS) of sample i, m is the number of samples in the prediction set.
The coefficient of determination (R 2 ) was utilized to evaluate the calibration and prediction performances.

Statistical Analysis of Phenotypic Data
The distribution of residuals was checked by PROC INSIGHT (SAS Institute Inc. 9.2 ® 2002-2008, Cary, NC, USA).Data cleaning was conducted by removing outliers and recording errors as reported by Novaes et al. [10].The mixed model used in this paper was the same as before [10]: (3) where y ijklmno is the response of the oth ramet of the lth clone in the kth treatment of the jth bench within the ith replication; μ is population mean; r i is the random effect of replication, which is normally and independently distributed (NID) as N(0, σ 2 r ); T k is the treatment effect of nitrogen; rt ik is the interaction of replication by treatment, ~NID(0, σ 2 rt ); b j(i) is the random effect of bench (incomplete block) within replication, ~ NID(0, σ 2 rbc ); c l is the random effect of clone, ~NID(0, σ 2 c ); rc il is the random effect of replication by clone interaction, ~NID(0, σ 2 rc ); tc kl is the random effect of treatment by clone interaction, ~NID(0, σ 2 tc ); p m(i) is the random effect of row within replication, ~NID(0, σ 2 p ); q n(i) is the random effect of column with replication, ~NID(0, σ 2 q ); e ijklmno is the random error effect within the experiment, .
The treatment effect for each trait was obtained using the SAS ® System for mixed models.Restricted maximum likelihood with ASReml was utilized to obtain the variance components and genetic parameters.For least-square means in QTL analysis, we took both clone effect and its interaction with treatment as fixed effects in the model.
ASReml was used to calculate the clonal repeatability for each trait in the univariate analysis with estimates of variance components as follows: (5) where σ c , σ rc , σ tc , were defined as previously.Pair-wise genetic correlations between wood chemical traits were estimated as described before [10].

QTL Analysis
A previously published genetic map [10,36] was utilized to test whether the FT-NIR wood chemical predictions could be used to detect genomic loci (QTL) controlling the quantitative traits.Briefly, the genetic map contains 163 microsatellite and 18 microarray-based markers covering all the 19 linkage groups of poplar, at an average density of one marker for every 16 cM (Kosambi's map function).The genetic map was constructed on MapMaker 3.0 [37] and the QTL analysis was performed with composite interval mapping based on maximum likelihood estimation [38] using Windows QTL Cartographer v.2.5 [39].The presence of QTLs along the linkage groups is tested with a likelihood ratio , , and

  
(LR) test.The LR compares the likelihood of having a QTL (full model) at any single position of the map against the likelihood of not having it (reduced model).The LR threshold for definition of a QTL was α = 0.05, determined with genome-wide analysis of 1000 permutations [40].For this article we only tested the performance of the FT-NIR prediction with the 500 sample set on the high N fertilization treatment.

Effects of Sample Size on Calibration and Prediction
To investigate the impact of the number of samples needed for good calibration and prediction using pyMBMS data, we compared the correlation coefficients when using 150, 250, 500 and 750 samples in the calibration set (Table 1).All sets of samples were chosen randomly, and the remaining samples were used as a prediction set.The four calibration sets had similar means and ranges of chemical components (data not shown).In general, similar calibration R 2 were obtained with all sample sets for lignin, G-lignin, and S-lignin (Table 1); however, the calibration R 2 for C5, C6 and m/z 114, an ion from xylan, were overall lower than for lignin and more variable across different sample sizes (Table 1).The prediction correlation coefficients strengthened slightly for all chemical components with increasing sample size.For example, lignin prediction increased from R 2 = 75.50%(150) to R 2 = 82.39%(750) (Table 1).The lignin prediction results were better than the sugar (C5, C6) components, except for m/z 144, an ion arising from cellulose, which had a prediction R 2 of 70.55% with the 500 calibration sample set.
We previously reported that when grown at high nitrogen, wood lignin content is significantly lower than when grown at low nitrogen [10].The FT-NIR predicted lignin contents also differed significantly between low and high nitrogen environments, for calibration models developed with all sample sizes (Table 2).For the 500 sample calibration set, FT-NIR prediction has a slope of 0.813 relative to the pyMBMS; the predicted FT-NIR values were slightly higher at high nitrogen, and slightly lower at low nitrogen than the pyMBMS data (Figure 1) [41].With both FT-NIR and pyMBMS all carbohydrate components did not differ between nitrogen levels.These results validate the ability of calibration models to detect relatively large differences in poplar wood lignin content induced by nitrogen availability.An important standard measure of genetic control of a phenotype is heritability, the ratio of genetic to phenotypic variation.The best measure of genetic control for clonally propagated populations is clonal repeatability, a measure of broad sense heritability or total genetic control.With the FT-NIR predicted values, the clonal repeatabilities were all lower than those obtained with the pyMBMS data (Figure 2).For C5, C6, m/z 144, and G-lignin, the FT-NIR heritability estimates increased with larger calibration sample size (Figure 2).For m/z 114, an ion from xylan, the heritability was zero for every calibration set, and thus was dropped from the genetic analyses described below.For this population a random set of 500 samples gave strong calibration models with the smallest standard errors of prediction and for all chemical components gave the highest estimates of genetic control for most traits.Thus, the 500 sample calibration model was used to predict the chemical composition of the 1505 wood samples for the genetic correlation and QTL analyses described below.

Comparison of Heritability and Genetic Correlations among Traits
Compared with pyMBMS, the clonal repeatability estimates with FT-NIR predictions were about 25% lower for total lignin and S-lignin and 60% lower for G-lignin.For both pyMBMS and FT-NIR, the genetic control of lignin was higher than for carbohydrates [10].However, the clonal repeatability was very similar for C5 and cellulose m/z 144 ions, and a little better for C6 estimated with FT-NIR when compared with pyMBMS (Table 2).Interestingly, with FT-NIR the genetic control of m/z 144, a cellulose ion, was almost as high as the sum of the C6 sugar ions [16].
Pair-wise genetic correlations between traits are important for understanding whether common genetic pathways are involved in the control of two or more traits and for applying the appropriate breeding and selection strategies.Genetic correlations of chemical traits within and across pyMBMS and FT-NIR were quite similar, with most of the pair-wise correlation estimates being stronger than 0.60 (absolute value) (Table 3).The strongest genetic correlations between pyMBMS and FT-NIR predictions were for lignin (1.00), S-lignin (0.99), G-lignin (0.74), S/G (0.92), C6 (0.87), C5 (0.73), C6/C5 (1.00), m/z 144 (0.87), and C6/lignin (0.87).Genetic correlations among chemical components were similar between pyMBMS and FT-NIR data for most of the traits, except for 9 pairs where correlations differed by more than 0.2.

QTL Analysis
For all six wood chemistry traits under high nitrogen treatment, QTL analyses were performed using values from all 1505 samples predicted with the 500 sample calibration model.The objective was to compare the number of QTLs and whether the QTLs detected with FT-NIR co-localize with those detected by pyMBMS.Even though the level of genetic control estimated with FT-NIR data were considerably lower than those estimated with pyMBMS, seven QTLs were identified with the FT-NIR predictions and eight with pyMBMS (Table 5).However, only three of these QTLs mapped to the same intervals in both pyMBMS and FT-NIR (Table 5, Figure 3).These coincidences were detected for m/z 144, lignin and G-lignin, and were all located on the same region of LGXIII.We previously identified this region as having a major effect on wood chemical and growth traits of this family [10].More specifically, this region explains 56% of the heritable variation for cellulose to lignin ratio, as well as 20%-25% of the heritable variation for biomass.As expected, the QTL profiles of pyMBMS and FT-NIR tend to be more similar for traits that have stronger genotypic correlation between both estimates.For example, for lignin (r = 1.0),C6 (r = 0.87) and m/z 144 (r = 0.87) the QTL profiles with pyMBMS and FT-NIR tend to co-vary (Figure 3).Generally for these traits, when a QTL is detected with one of the two techniques there is a peak on the QTL profile of the other that may or may not be above significance threshold.Conversely, for C5 sugars, which have weak correlation between estimates obtained with pyMBMS and FT-NIR (r = 0.73), the QTL profiles are quite different, even though the one significant QTL was detected with both methods.

Discussion
The low cost and high throughput of FT-NIR offers the opportunity for efficiently phenotyping the thousands of wood samples needed for dissecting genetic and environmental control of wood lignocellulose composition.However, because FT-NIR is an indirect method, robust calibration models with high prediction precision need to be developed and broadly validated.In this study, pyMBMS chemical composition data and FT-NIR spectra were used to calibrate and predict the composition of 1505 poplar samples from a single family, grown under high and low nitrogen [10].

Calibration, Prediction, and Sample Size
FT-NIR calibration models to estimate wood chemical composition have been reported for a number of species; however, most of these models were calibrated with wood chemistry data obtained from wet chemical methods using modest sample sizes and moderate calibration and prediction results were obtained [27,41,[43][44][45][46].Recently, global NIR models were developed with multiple pine species from multiple sites with high calibration and validation precision for lignin (R 2 of calibration and validation 0.97 and 0.95 respectively) and cellulose (R 2 of calibration and validation 0.84 and 0.72 respectively) content [47,48].For Eucalyptus, more than 40 species across Australia (720 samples) were used to calibrate NIR models to predict Kraft pulp yield with R 2 = 0.91 and low standard error of cross validation (1.36%) [49].However, the utility of these global calibrations for genetic analyses has not been reported.
For indirect methods requiring calibration, an important question is how many samples are needed in the calibration set to develop strong predictive models that can be used for environmental and genetic analyses of wood chemistry.Although the R 2 of calibration models were slightly weaker with 150 and 250 compared with the 500 and 750 sample sets (Table 1), the R 2 of prediction models increased for all components with increasing sample size, with the strongest coefficients being obtained with 500 or 750 sample sets.Random sets of 500 samples yielded calibration models with R 2 that ranged from 0.56 to 0.87 (Table 1) and were sufficient for developing good predictive models for all components.Overall, stronger calibration and prediction statistics were obtained with lignin than carbohydrates (Table 1), likely reflecting the quality of the estimation of lignin compared with the carbohydrate data from pyMBMS [11,50,51].Reported calibration and prediction statistics using wet chemical methods were stronger than predictions obtained with pyMBMS [47].One reason could be that the sum of peak intensities, arising from the breakdown of molecules during pyrolysis might be a mixture of several chemical components.

Genetic Parameters and Environmental Control
A few reports have demonstrated that NIR predictions can be used to estimate genetic parameters.For Pinus pinaster Ait and Pinus taeda, Isik et al. [52], Perez et al. [34] and Gaspar et al. [35] showed NIR as a potential tool to estimate heritabilities and genetic correlations for tree selection in physical and chemical properties.For Eucalyptus globulus and Eucalyptus nitens, the application of NIR focused on predicting the genetic parameters for cellulose [28,33], pulp yield [12,30,31], lignin [29].Schimleck et al. [15,32] investigated the cellulose content predicted with a NIR calibration model as an alternative to estimate Kraft pulp yield in Eucalyptus nitens.NIR analysis provided estimates of genetic parameters that were as good as direct cellulose assessment, demonstrating that the heritability and genetic gain could be estimated by NIR.However, these studies suffer from the lack of validation of the accuracy and reliability of the estimates from NIR predictions.
No comparison of direct and indirect estimates of wood chemical composition and of genetic parameter estimates has been reported previously.Heritability estimates with FT-NIRS were a little lower than the estimates with pyMBMS except for C6 (Table 2).Figure 2 demonstrates that estimates of genetic control were more similar between FT-NIR predictions and pyMBMS data when larger 500 and 750 sample sizes were used for NIR calibration.This suggests that for genetic analyses, using more samples in the calibration is important, even though the statistics of calibration and prediction were similar between the 250 and 500 sample sets.Consequently, with pyMBMS data, sample sizes of 500 or more for NIR calibration should be used when the goal is to partition variances and estimate heritability.
Pair-wise genetic correlations obtained with FT-NIR predicted chemical traits were close to those obtained with the direct pyMBMS method.For example, predicted lignin was highly correlated with predicted S-lignin (0.99), G-lignin (0.79), S/G (0.90), C6 (−1.00),C5 (−1.00),C6/C5 (−0.98), m/z 144 (−0.97) and C6/lignin (−1.00) (Table 3), where pyMBMS lignin performed in a similar way.We also investigated the genetic correlation for chemical components between pyMBMS data and FT-NIR predictions, and they had the similar correlation estimates for most of the chemical traits, except for 17 pairs where correlations differed by more than 0.2.Consequently, the results proved FT-NIR can provide a robust model for genetic correlations [13].
Previously, Novaes et al. [10] observed that nitrogen application during early growth of Populus significantly increased the content of C5 (hemicellulose) and C6 (cellulose), and decreased lignin content.With FT-NIR predictions, nitrogen fertilization was also highly significant for all chemical traits, with the C6 increase being larger than the C5 increase, and total lignin decrease was much more than S-lignin and G-lignin (Table 2).This shows that the nitrogen effect could be of major importance for enhancing cellulose content in wood.

QTL Mapping
In total 7 QTLs identified with FT-NIR and 8 with py-MBMS, but only three were colocalized to the same interval.With the pyMBMS data, one QTL was identified for S-lignin but none were identified with FT-NIR.This relatively low coincidence of QTLs between the two methods requires additional investigation to understand whether the seven QTLs detected only with FT-NIR are real, even though a the stringent (α = 0.05, with 1000 genome-wide permutation tests) QTL threshold used gives statistical strength to the hypothesis that the novel FT-NIR QTLs are real.However, it is important to note the low prediction R 2 of C6 (49%) and C5 (52%), for which two QTLs were identified only with FT-NIR predictions.Because coincident QTLs for lignin and m/z 144 were detected with FT-NIR predictions and pyMBMS data, this suggests that FT-NIR can be used when calibration models yield good predictions (R 2 > 0.80).Once a good calibration model is obtained, FT-NIR is less costly and faster than direct methods, such as wet chemistry.FT-NIR may be especially interesting for genetic analyses of wood chemistry, given the fact that these studies, such as QTL and association studies, require the phenotyping of thousands of individuals.

Conclusions
These results show that FT-NIR, coupled with pyMBMS for model calibration, is an appropriate high-throughput method for wood chemical calibration.Our results show good and moderate calibration and prediction results for lignin and carbohydrates, respectively.Our study demonstrates the similarity of the direct and indirect estimates of wood chemical composition and genetic parameter estimates.Strong genetic correlations are obtained between the FT-NIR and pyMBMS.However, QTLs detected by FT-NIR and pyMBMS are quite different.

Figure 2 .
Figure 2. Clonal repeatability and standard errors for different FT-NIR size calibration sets compared with pyMBMS; Error bars correspond to standard errors of the mean.

Figure 3 .
Figure 3. QTL profiles for pyMBMS (blue) and FT-NIR (green) predictions for six wood chemistry traits.The FT-NIR predictions were conducted with the 500 samples calibration set.The likelihood ratio (LR) threshold is indicated in each figure (red line), the y-axis is the likelihood ratio and the black and yellow bars on the x-axis delimit each linkage group.

Table 1 .
Different sample size calibration and prediction results.

Table 2 .
Estimates of clonal repeatability and average trait value for seven wood chemistry phenotypes in two nitrogen treatments with pyMBMS and FT-NIR of 500 set.Clonal repeatability (H 2 ), Standard error (SE), N (L): Low nitrogen treatment, N (H): High nitrogen treatment.The last column is the p-values for testing nitrogen treatment for each phenotype.

Table 3 .
Pair-wise comparisons of genotypic correlations between chemical traits for both NIR and pyMBMS.

Table 4 .
Pair-wise estimates of genotypic correlations between pyMBMS and FT-NIR.

Table 5 .
Number of quantitative trait loci (QTLs) detected with pyMBMS and FT-NIRS for six wood chemical traits.Coincidental QTLs map to the same intervals in the genome.