Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data
Abstract
:1. Introduction
2. Theory
2.1. Notations
2.2. Prediction Models
2.2.1. Partial Least Squares Regression
2.2.2. Averaging PLSR
- = dr/dup if dr < dup
- = 1 if dr ≥ dup (this case implies a null weight).
2.2.3. Weights for Methods AVG-CV, -AIC, and -BIC
- AVG-CV: For the PLSR model with r LVs, the error rate dr is the root mean squared error of predictions (RMSEP) estimated on the training data {X, y} from a random K-fold (K = 5) CV (RMSEPCV). The K-fold CV was repeated ten times and dr was computed by the average of the ten RMSEPCV estimates;
- AVG-AIC: dr is the Akaike information criterion [15,16]: AIC = log(SSR) + 2 df, where SSR is the sum of the squared residuals computed on the training data {X, y} and df the complexity (or “effective” dimension or number of degrees of freedom) of the model. The AIC penalty “2 df” increases when the complexity of the model increases (in contrary to SSR) and counter-balances the optimism of SSR to measure the performance of the model for predicting new observations. When several models are compared (i.e., in this paper, the PLSR models with different numbers r of LVs), models with the lowest AICs are considered to be the most performant, as with RMSEPCV in CV. The complexity df is known to be difficult to estimate for PLSR [17,18,19,20]. This is due to the fact that the response variable y is involved in the computation of the LVs, which is not the case, for instance, for PCR models. Nevertheless, approximations are available and, in particular, several methods are detailed and compared in Lesnoff et al. [21]. In the paper, df was computed from the conjugate gradient least square algorithm [22,23]. Since CV and AIC estimate approximately the same type of prediction error [21,24], both methods are expected to estimate close weights patterns w and therefore close results of averaging in Equation (2);
- AVG-BIC: here, dr is another common parsimony criterion, the Bayesian information criterion (BIC) [25]. In BIC, the AIC penalty constant “2” is replaced by log(n), where n is the number of training observations): BIC = log(SSR) + log(n) df. Since the penalty added to SSR is increased compared to AIC, BIC is more conservative and selects (by minimal error rate) models with lower dimensions.
2.2.4. Stacking
3. Materials and Methods
3.1. Datasets and Software
Abbreviation | Unit | Description |
---|---|---|
ADF | %DM 1 | Acid detergent fiber [29] |
ADL | %DM | Acid detergent lignin [29] |
ASH | %DM | Ashes |
CF | %DM | Crude fiber [30] |
CP | %DM | Crude protein [30] |
DM | % | Dry matter, 103 degrees Celsius, 24 h |
DMDCELL | %DM | Pepsine–cellulase dry matter digestibility [31] |
NDF | %DM | Neutral detergent fiber [32] |
OMDCELL | %OM 2 | Pepsine–cellulase organic matter digestibility [31] |
Response | Dataset | |||||
---|---|---|---|---|---|---|
Variable (y) | TROP1 | TROP2 | LUS1 | LUS2 | THEIX | WAL |
ADF | 1530 (8.8, 66.9) | 1126 (12.4, 61.1) | 1310 (10.3, 36.5) | 1355 (17.4, 50.8) | 1507 (15.0, 46.5) | – |
ADL | 1423 (0.7, 43.1) | 1126 (0.4, 13.6) | – | 1139 (3.0, 10.9) | 1620 (2.7, 27.1) | – |
ASH | 1597 (1.5, 66.4) | 1476 (0.4, 57.4) | 3526 (4.5, 15.8) | 1242 (5.8, 17.7) | – | – |
CF | – | 1302 (7.5, 57.3) | – | – | – | 797 (12.0, 42.1) |
CP | 1564 (1.6, 32.3) | 1389 (0.7, 28.5) | 4029 (3.1, 24.9) | 1612 (2.4, 39.2) | 1564 (3.9, 37.8) | 797 (4.0, 34.2) |
DM | 1607 (72.2, 97.7) | 1481 (84.7, 98.8) | – | – | – | 797 (89.3, 98.4) |
DMDCELL | 1459 (9.9, 95.0) | 1137 (14.6, 93.3) | 5194 (41.0, 95.0) | 1584 (38.7, 87.3) | 1386 (20.7, 91.4) | – |
NDF | 1529 (16.0, 85.7) | 1119 (26.3, 88.0 | 3948 (20.6, 68.4) | 1386 (26.0, 67.8) | 1672 (27.6, 76.9) | – |
OMDCELL | 1459 (8.6, 94.3) | 1137 (10.9, 90.0) | – | – | – | – |
3.2. Overall Approach to Evaluate the Models
- A number of ntrain observations {Xtrain, ytrain} are used as a training set to calibrate a given model, say f. This learning step is detailed in Section 3.3;
- A number of ntest observations {Xtest, ytest} (with n = ntrain + ntest) are used to compute the performance of model f learned on {Xtrain, ytrain}. The model performance was defined by the RMSEP computed on the ntest predictions (RMSEPtest).
3.3. Learning Step for Models f
3.3.1. Usual PLSR
3.3.2. Parsimonious PLSR (PLSR-P)
3.3.3. PLSR Averaging and Stacking
- α = 0, i.e., q = max{d0, d1, …, da}, which means that only the less performant model within r = 0, …, 50 LVs is removed from the average;
- α = 0.3, which means that the 30% less performant models within r = 0, …, 50 LVs are removed.
Abbreviation | Method |
---|---|
PLSR 1 | Dimensionality is selected by minimal RMSECV. |
PLSR-P | Parsimonious dimensionality (Wold criterion on RMSECV). |
“Omnibus” methods | |
AVG | Averaging with uniform weights. |
AVG-CV | Averaging with weights computed from CV errors. |
AVG-AIC | Averaging with weights computed from AIC errors. |
AVG-BIC | Averaging with weights computed from BIC errors. |
AVG-SHENK | Averaging with the LOCAL weights [38] |
STACK | Stacking with MLR as “top” model. |
4. Results
5. Discussion and Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Höskuldsson, A. PLS Regression Methods. J. Chemom. 1988, 2, 211–228. [Google Scholar] [CrossRef]
- Wold, H. Nonlinear Iterative Partial Least Squares (NIPALS) Modeling: Some Current Developments. In Multivariate Analysis II; Krishnaiah, P.R., Ed.; Academic Press: Cambridge, MA, USA, 1973; pp. 383–407. [Google Scholar]
- Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [Google Scholar] [CrossRef]
- Dardenne, P.; Sinnaeve, G.; Baeten, V. Multivariate Calibration and Chemometrics for near Infrared Spectroscopy: Which Method? J. Near Infrared Spectrosc. JNIRS 2000, 8, 229–237. [Google Scholar] [CrossRef]
- Wang, F.; Zhao, C.; Yang, H.; Jiang, H.; Li, L.; Yang, G. Non-Destructive and in-Site Estimation of Apple Quality and Maturity by Hyperspectral Imaging. Comput. Electron. Agric. 2022, 195, 106843. [Google Scholar] [CrossRef]
- Chu, X.; Li, R.; Wei, H.; Liu, H.; Mu, Y.; Jiang, H.; Ma, Z. Determination of Total Flavonoid and Polysaccharide Content in Anoectochilus Formosanus in Response to Different Light Qualities Using Hyperspectral Imaging. Infrared Phys. Technol. 2022, 122, 104098. [Google Scholar] [CrossRef]
- Gowen, A.A.; Downey, G.; Esquerre, C.; O’Donnell, C.P. Preventing Over-Fitting in PLS Calibration Models of near-Infrared (NIR) Spectroscopy Data Using Regression Coefficients. J. Chemom. 2011, 25, 375–381. [Google Scholar] [CrossRef]
- Kalivas, J.H. Multivariate Calibration, an Overview. Anal. Lett. 2005, 38, 2259–2279. [Google Scholar] [CrossRef]
- Westad, F.; Marini, F. Validation of Chemometric Models—A Tutorial. Anal. Chim. Acta 2015, 893, 14–24. [Google Scholar] [CrossRef]
- Silalahi, D.D.; Midi, H.; Arasan, J.; Mustafa, M.S.; Caliman, J.-P. Automated Fitting Process Using Robust Reliable Weighted Average on Near Infrared Spectral Data Analysis. Symmetry 2020, 12, 2099. [Google Scholar] [CrossRef]
- Zhang, M.H.; Xu, Q.S.; Massart, D.L. Averaged and Weighted Average Partial Least Squares. Anal. Chim. Acta 2004, 504, 279–289. [Google Scholar] [CrossRef]
- Andersson, M. A Comparison of Nine PLS1 Algorithms. J. Chemom. 2009, 23, 518–529. [Google Scholar] [CrossRef]
- Cleveland, W.S.; Grosse, E. Computational Methods for Local Regression. Stat. Comput. 1991, 1, 47–62. [Google Scholar] [CrossRef]
- Shenk, J.S.; Westerhaus, M.O. Population Definition, Sample Selection, and Calibration Procedures for Near Infrared Reflectance Spectroscopy. Crop Sci. 1991, 31, 469. [Google Scholar] [CrossRef]
- Hurvich, C.M.; Tsai, C.-L. Bias of the Corrected AIC Criterion for Underfitted Regression and Time Series Models. Biometrika 1991, 78, 499–509. [Google Scholar] [CrossRef]
- Hurvich, C.M.; Tsai, C.-L. Regression and Time Series Model Selection in Small Samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
- Ildiko, F.E.; Friedman, J.H. A Statistical View of Some Chemometrics Regression Tools. Technometrics 1993, 35, 109–135. [Google Scholar] [CrossRef]
- Krämer, N.; Sugiyama, M. The Degrees of Freedom of Partial Least Squares Regression. J. Am. Stat. Assoc. 2011, 106, 697–705. [Google Scholar] [CrossRef] [Green Version]
- Seipel, H.A.; Kalivas, J.H. Effective Rank for Multivariate Calibration Methods. J. Chemom. 2004, 18, 306–311. [Google Scholar] [CrossRef]
- van der Voet, H. Pseudo-Degrees of Freedom for Complex Predictive Models: The Example of Partial Least Squares. J. Chemom. 1999, 13, 195–208. [Google Scholar] [CrossRef]
- Lesnoff, M.; Roger, J.-M.; Rutledge, D.N. Monte Carlo Methods for Estimating Mallows’s Cp and AIC Criteria for PLSR Models. Illustration on Agronomic Spectroscopic NIR Data. J. Chemom. 2021, 35, e3369. [Google Scholar] [CrossRef]
- Björck, Å. Numerical Methods for Least Squares Problems; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1996; ISBN 978-0-89871-360-2. [Google Scholar]
- Hansen, P.C. Rank-Deficient and Discrete Ill-Posed Problems; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1998; ISBN 978-0-89871-403-6. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
- Schwarz, G. Estimating the Dimension of a Model. Ann. Statist. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Lesnoff, M.; Metz, M.; Roger, J.-M. Comparison of Locally Weighted PLS Strategies for Regression and Discrimination on Agronomic NIR Data. J. Chemom. 2020, 10, e3209. [Google Scholar] [CrossRef]
- Lesnoff, M. Jchemo: A Julia Package for Dimension Reduction, Regression and Discrimination for Chemometrics; CIRAD, UMR SELMET: Montpellier, France, 2021. [Google Scholar]
- Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V.B. Julia: A Fresh Approach to Numerical Computing. SIAM Rev. 2017, 59, 65–98. [Google Scholar] [CrossRef] [Green Version]
- Van Soest, P.J.; Robertson, J.B. Systems of Analysis for Evaluating Fibrous Feeds. In IDRC No 134; IDRC: Ottawa, ON, Canada, 1980; pp. 49–60. [Google Scholar]
- AOAC. Official Methods of Analysis of the Association of Official Analytical Chemists; AOAC International Publishing: Gaithersburg, MD, USA, 2005. [Google Scholar]
- Aufrère, J.; Michalet-Doreau, B. In Vivo Digestibility and Prediction of Digestibility of Some By-Products. In Feeding Value of by-Products and Their Use by Beef Cattle; Boucqué, C.V., Fiems, L.O., Cottyn, B.G., Eds.; Commission of the European Communities Publishing: Brussels, Belgium; Luxembourg, 1983; pp. 25–33. [Google Scholar]
- Van Soest, P.J.; Robertson, J.B.; Lewis, B.A. Methods for Dietary Fiber, Neutral Detergent Fiber, and Nonstarch Polysaccharides in Relation to Animal Nutrition. J. Dairy Sci. 1991, 74, 3583–3597. [Google Scholar] [CrossRef]
- Filzmoser, P.; Liebmann, B.; Varmuza, K. Repeated Double Cross Validation. J. Chemom. 2009, 23, 160–171. [Google Scholar] [CrossRef]
- Krstajic, D.; Buturovic, L.J.; Leahy, D.E.; Thomas, S. Cross-Validation Pitfalls When Selecting and Assessing Regression and Classification Models. J. Cheminform. 2014, 6, 10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Andries, J.P.M.; Vander Heyden, Y.; Buydens, L.M.C. Improved Variable Reduction in Partial Least Squares Modelling Based on Predictive-Property-Ranked Variables and Adaptation of Partial Least Squares Complexity. Anal. Chim. Acta 2011, 705, 292–305. [Google Scholar] [CrossRef] [PubMed]
- Schaal, S.; Atkeson, C.G.; Vijayakumar, S. Scalable Techniques from Nonparametric Statistics for Real Time Robot Learning. Appl. Intell. 2002, 17, 49–60. [Google Scholar] [CrossRef]
- Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20, 397–405. [Google Scholar] [CrossRef]
- Shenk, J.; Westerhaus, M.; Berzaghi, P. Investigation of a LOCAL Calibration Procedure for near Infrared Instruments. J. Near Infrared Spectrosc. 1997, 5, 223. [Google Scholar] [CrossRef]
- Kim, S.; Okajima, R.; Kano, M.; Hasebe, S. Development of Soft-Sensor Using Locally Weighted PLS with Adaptive Similarity Measure. Chemom. Intell. Lab. Syst. 2013, 124, 43–49. [Google Scholar] [CrossRef] [Green Version]
- Shen, G.; Lesnoff, M.; Baeten, V.; Dardenne, P.; Davrieux, F.; Ceballos, H.; Belalcazar, J.; Dufour, D.; Yang, Z.; Han, L.; et al. Local Partial Least Squares Based on Global PLS Scores. J. Chemom. 2019, 33, e3117. [Google Scholar] [CrossRef]
- Allegrini, F.; Fernández Pierna, J.A.; Fragoso, W.D.; Olivieri, A.C.; Baeten, V.; Dardenne, P. Regression Models Based on New Local Strategies for near Infrared Spectroscopic Data. Anal. Chim. Acta 2016, 933, 50–58. [Google Scholar] [CrossRef]
- Minet, O.; Baeten, V.; Lecler, B.; Dardenne, P.; Fernández Pierna, J.A. Local vs. Global Methods Applied to Large near Infrared Databases Covering High Variability. In Proceedings of the 18th International Conference on Near Infrared Spectroscopy; IM Publications Open LLP: Copenhagen, Denmark, 2019; pp. 45–49. ISBN 978-1-906715-27-4. [Google Scholar]
Dataset | n | Type of Material | Source |
---|---|---|---|
TROP1 | 1608 | Mixtures of plants collected mainly from the Mediterranean, Reunion Island, and Sahelian areas (e.g., Burkina Faso, Chad, Mali, and Senegal): grasses, herbs, legumes, shrubs, etc. | CIRAD, France |
TROP2 | 1483 | Tropical sorghum forage | CIRAD, France |
LUS1 | 5626 | Grass forage species (Lusignan, France) | INRAE, France |
LUS2 | 1827 | Legume forages with mainly alfalfa (Lusignan, France) | INRAE, France |
THEIX | 1894 | Forages of diversified permanent grasslands collected mainly from the Massif Central (France) | INRAE, France |
WAL | 797 | Grass forages from different areas in Wallonia (Belgium) | CRA-W, Belgium |
Dataset | Dimensionality a | |||||
---|---|---|---|---|---|---|
PLSR | PLSR-P | |||||
Mean | Min. | Max. | Mean | Min. | Max. | |
TROP1 | 20.8 | 12 | 43 | 13.5 | 6 | 22 |
TROP2 | 17.9 | 8 | 44 | 13.9 | 8 | 19 |
LUS1 | 26.1 | 11 | 50 | 14.9 | 9 | 19 |
LUS2 | 21.0 | 11 | 47 | 15.2 | 11 | 19 |
THEIX | 18.8 | 13 | 46 | 13.8 | 9 | 19 |
WAL | 10.3 | 1 | 18 | 8.3 | 1 | 14 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lesnoff, M.; Andueza, D.; Barotin, C.; Barre, P.; Bonnal, L.; Fernández Pierna, J.A.; Picard, F.; Vermeulen, P.; Roger, J.-M. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Appl. Sci. 2022, 12, 7850. https://doi.org/10.3390/app12157850
Lesnoff M, Andueza D, Barotin C, Barre P, Bonnal L, Fernández Pierna JA, Picard F, Vermeulen P, Roger J-M. Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences. 2022; 12(15):7850. https://doi.org/10.3390/app12157850
Chicago/Turabian StyleLesnoff, Mathieu, Donato Andueza, Charlène Barotin, Philippe Barre, Laurent Bonnal, Juan Antonio Fernández Pierna, Fabienne Picard, Philippe Vermeulen, and Jean-Michel Roger. 2022. "Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data" Applied Sciences 12, no. 15: 7850. https://doi.org/10.3390/app12157850
APA StyleLesnoff, M., Andueza, D., Barotin, C., Barre, P., Bonnal, L., Fernández Pierna, J. A., Picard, F., Vermeulen, P., & Roger, J.-M. (2022). Averaging and Stacking Partial Least Squares Regression Models to Predict the Chemical Compositions and the Nutritive Values of Forages from Spectral Near Infrared Data. Applied Sciences, 12(15), 7850. https://doi.org/10.3390/app12157850