High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms

Liu, Bin; Huang, Yu; Gu, Lan; Wang, Sheng; Xue, Shuai; Fu, Tongcheng; Yi, Zili; Li, Jie; Wang, Xiaoyu; Tang, Chaochen; Li, Meng

doi:10.3390/agronomy15112659

Open AccessArticle

High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms

by

Bin Liu

^1,2,

Yu Huang

^1,2,

Lan Gu

^1,2,

Sheng Wang

^1,2,3

,

Shuai Xue

^1,2,3

,

Tongcheng Fu

^1,2,3,

Zili Yi

^1,2,

Jie Li

^3,4,

Xiaoyu Wang

^3,4

,

Chaochen Tang

^5,* and

Meng Li

^1,2,3,*

¹

Hunan Engineering Laboratory of Miscanthus Ecological Applications, College of Bioscience & Biotechnology, Hunan Agricultural University, Changsha 410128, China

²

Hunan Branch, National Energy R&D Center for Non-Food Biomass, Hunan Agricultural University, Changsha 410128, China

³

Yuelushan Laboratory, Changsha 410128, China

⁴

Hunan Agricultural Equipment Research Institute, Hunan Academy of Agricultural Sciences, Hunan Intelligent Agriculture Engineering Technology Research Center, Changsha 410125, China

⁵

Crops Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Crop Genetic Improvement of Guangdong Province, Guangzhou 510640, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2025, 15(11), 2659; https://doi.org/10.3390/agronomy15112659

Submission received: 19 October 2025 / Revised: 11 November 2025 / Accepted: 13 November 2025 / Published: 20 November 2025

(This article belongs to the Special Issue Sustainable Cropping Systems and Biomasses for Energy and Biorefinery Applications)

Download

Browse Figures

Versions Notes

Abstract

Rapid, non-destructive assessment of biomass composition is essential for advancing Miscanthus spp. breeding and bioenergy production. This study aimed to develop and validate high-throughput near-infrared spectroscopy (NIRS) models for key chemical components in Miscanthus biomass. A robust calibration set was constructed from 107 diverse samples by combining two key species, Miscanthus sacchariflorus and M. lutarioriparius, to enhance chemical variability and create broadly applicable models. Partial Least Squares (PLS) regression models were developed using this dataset, comparing full-spectrum performance against models optimized with three feature selection algorithms: CARS, VCPA-GA, and VCPA-IRIV. All feature selection methods significantly enhanced predictive accuracy. Notably, the CARS-PLS models yielded excellent performance for cellulose (R²v = 0.98; RPD = 7.38), hemicellulose (R²v = 0.95, RPD = 4.35), lignin (R²v = 0.96, RPD = 5.40), and moisture (R²v = 0.98, RPD = 7.18), while the VCPA-IRIV-PLS model was superior for ash content (R²v = 0.96, RPD = 5.13). Overall, NIRS coupled with advanced feature selection provides a powerful, rapid protocol for Miscanthus biomass analysis, poised to accelerate germplasm evaluation and industrial quality control in the bioenergy sector.

Keywords:

Miscanthus sacchariflorus; Miscanthus lutarioriparius; lignocellulose; chemometrics; feature selection; high-throughput phenotyping

1. Introduction

The escalating global demand for renewable energy, driven by concerns over fossil fuel depletion and climate change, has spurred intensive research into lignocellulosic biomass as a sustainable feedstock [1,2]. Among various energy crops, Miscanthus spp., a genus of perennial C4 grasses, has emerged as a frontrunner due to its high biomass yield, broad adaptability, low input requirements, and excellent calorific value [3,4]. Two prominent species native to China, Miscanthus sacchariflorus (MS) and Miscanthus lutarioriparius (ML), are particularly valued for their potential for producing bioethanol, bio-based materials, and chemicals [5].

The conversion efficiency and economic viability of these bio-products are intrinsically linked to the chemical composition of the feedstock, primarily the relative content of cellulose, hemicellulose, and lignin, as well as ash and moisture content [6]. However, these properties can vary significantly due to genetic diversity, cultivation practices, and post-harvest storage conditions [7]. Consequently, large-scale breeding programs and industrial processing require a rapid, accurate, and high-throughput method to screen vast numbers of samples for optimal chemical profiles [5]. This need drives the development of high-throughput phenotyping technologies, which involve the rapid and non-destructive assessment of complex plant traits on a large scale.

Traditional wet-chemistry methods for lignocellulose analysis, while accurate, are laborious, time-consuming, destructive, and generate hazardous chemical waste, making them impractical for large-scale applications [8]. This bottleneck severely limits the pace of germplasm improvement and biomass quality control. Near-infrared spectroscopy (NIRS) has been widely recognized as a promising alternative. As a non-destructive analytical technique, NIRS measures the absorption of light by molecules containing C–H, O–H, and N–H bonds, providing a chemical fingerprint of the sample [9]. Coupled with chemometric modeling, NIRS enables the rapid and simultaneous prediction of multiple chemical components from a single spectrum [10].

Previous studies have demonstrated the feasibility of using NIRS to predict lignocellulosic content in various biomass feedstocks, including Miscanthus spp. [11,12]. For instance, Li et al. [13] and Chen et al. [14] developed NIRS models for Miscanthus components, confirming the potential of this technology. However, the predictive power of NIRS models can be compromised by the high dimensionality and multicollinearity of spectral data, where many features are redundant or irrelevant [15]. To address this, feature (variable or wavelength) selection algorithms are employed to isolate the most informative features, thereby simplifying the model, reducing overfitting, and improving predictive accuracy and robustness [16].

Competitive Adaptive Reweighted Sampling (CARS) is a well-established algorithm that mimics Darwin’s “survival of the fittest” principle to select optimal variable subsets [17]. More recently, hybrid strategies have been proposed, such as combining Variable Combination Population Analysis (VCPA) with a Genetic Algorithm (GA) or with Iteratively Retaining Informative Variables (IRIVs), which have shown promise in other fields but are less explored for biomass analysis [18].

This study addresses the critical need for high-throughput phenotyping in Miscanthus breeding and bio-utilization. To this end, our specific objectives were to: (1) establish robust NIRS calibration models for the quantification of cellulose, hemicellulose, lignin, ash, and moisture in a diverse collection of Miscanthus spp. samples; (2) systematically compare the performance of PLS models built with the full spectrum versus those optimized by three distinct feature selection algorithms (CARS, VCPA-GA, and VCPA-IRIV), representing different selection approaches, to identify the most effective modeling strategy; and (3) validate the resulting optimized method as a practical high-throughput phenotyping tool to accelerate germplasm screening and quality assessment for the Miscanthus-based bioeconomy.

2. Materials and Methods

2.1. Plant Materials and Sample Preparation

This study utilized a total of 107 accessions, comprising 25 of MS and 82 of ML The two species were intentionally combined to create a single, diverse population, thereby enhancing the chemical and spectral variability of the calibration set and improving the robustness and general applicability of the resulting NIRS models. The samples were identified by Prof. Yi and harvested during senescence in November 2021 from the Miscanthus Resource Nursery at Hunan Agricultural University, Changsha, China (28°18′ N, 113°07′ E). The collected aerial biomass was dried in an oven (DHG-9070A, Shanghai Jinghong Experimental Equipment Co., Ltd., Shanghai, China) at 65 °C for approximately 48 h until a constant weight was achieved. The dried samples were coarsely chopped, then ground into a fine powder using a plant grinder (DLF-55S water-cooled pulverizer, Wenzhou Dingli Technology Equipment Co., Ltd., Wenzhou, China). The powder was sieved through a 65-mesh screen (approx. 230 µm) and stored in sealed bags separately at 4 °C in a refrigerator (BCD-618WGHSSEDBL, Haier Group, Haier Refrigerator Co., Ltd., Qingdao, China) prior to analysis.

2.2. Chemical Composition Analysis

The moisture, ash, cellulose, hemicellulose, and lignin contents were determined according to the National Energy Industry Standards of China for lignocellulosic biomass. Moisture content was determined using an automatic infrared moisture analyzer (HE53/02, Mettler Toledo Instruments (Shanghai) Co., Ltd., Shanghai, China), a standardized gravimetric method based on the loss-on-drying principle (NB/T 34057.3-2017 [19]). Ash content was determined by incineration in a muffle furnace (MFLC-36/12D, Tianjin Tesi Instrument Co., Ltd., Tianjin, China), at 575 ± 25 °C (NB/T 34057.6-2017 [20]). The contents of cellulose, hemicellulose, and acid-insoluble lignin were determined using the two-step sulfuric acid hydrolysis method (NB/T 34057.5-2017 [21]). All chemical analyses were performed in triplicate for each sample.

2.3. NIR Spectroscopy

Diffuse-reflectance NIR spectra were collected from 900 to 2500 nm using a bench-top spectrometer equipped with a rotating sample stage and integrating sphere (G3000 spectrometer, Sichuan Ways-Spec Technology Co., Ltd., Chengdu, China). The spectral sampling interval was 1 nm (final 1601 features), and 16 scans/sample were averaged after white-reference correction (Spectralon, Sichuan Ways-Spec Technology Co., Ltd., Chengdu, China). Environmental conditions were controlled at 22 ± 2 °C and 40–60% RH. The filled sample cup depth exceeded 10 mm to minimize background. All wavelength units are in nm; spectral resolution and interval follow manufacturer specifications.

2.4. Spectral Data Preprocessing

To minimize physical effects such as baseline shifts and particle size variation, the raw spectra were preprocessed. Various methods were tested, and the second derivative (SD) using the Savitzky–Golay algorithm (11-point window, 2nd-order polynomial) was found to provide the best results for enhancing spectral features and was applied to all subsequent analyses.

2.5. Feature Selection

Three algorithms were used to select informative features from the preprocessed spectra. CARS selects feature subsets based on their regression coefficients in PLS models, progressively eliminating uninformative features [17]. VCPA-GA first uses an exponential decay function (EDF) to reduce the variable space and then employs a Genetic Algorithm (GA) to refine the selection of the most predictive features combinations [18]. Similarly to VCPA-GA, the VCPA-IRIV method also uses EDF for initial screening, followed by the Iteratively Retaining Informative Variables (IRIV) procedure to classify features and retain only those with strong or weak information content [18]. The key parameters for VCPA-GA and VCPA-IRIV were set as described by Yun et al. [18] (Table 1).

2.6. Sample Set Partitioning

The 107 samples were divided into a calibration set (for model training) and a prediction set (for external validation) at a 2:1 ratio. The Sample set Partitioning based on joint x–y distances (SPXY) algorithm was employed [22]. This method considers the Euclidean distances in both the spectral space (X-variables) and the chemical composition space (Y-variables), ensuring that both sets are representative of the overall sample diversity.

2.7. Model Construction and Evaluation

Partial Least Squares (PLS) regression was used to build quantitative models correlating the spectral data (full range or selected features) with the measured chemical reference values. To prevent overfitting, the optimal number of latent variables (LVs) for each model was determined using 10-fold cross-validation, selecting the number of LVs that minimized the root mean square error of cross-validation (RMSECV).

The performance of the models was evaluated using several statistical metrics for both the calibration and prediction sets:

Coefficient of Determination (R²c, R²cv, R²v): Indicates the proportion of variance explained by the model for calibration, cross-validation, and prediction, respectively.

Root Mean Square Error (RMSEC, RMSECV, RMSEP): Measures the average error of the model.

Residual Predictive Deviation (RPD): Calculated as the standard deviation (SD) of the reference values in the prediction set divided by the RMSEP.

RPD values are used to assess the practical predictive ability of the model, where RPD > 3 indicates a good model, and RPD > 5 suggests excellent predictive power.

All data processing and modeling were performed using Python (version 3.10) with custom-written scripts and relevant libraries, including scikit-learn (version 1.2.2) for PLS regression, Numpy (version 1.24.3) for numerical operations, Scipy (version 1.10.1) for spectral preprocessing, Matplotlib (version 3.7.1) for figure generation.

3. Results and Discussion

3.1. Chemical Characterization of Miscanthus Samples

In this study, 107 samples of Miscanthus spp. were selected, and their lignocellulosic components were quantified. As shown in the box plot in Figure 1A, the compositional data were evenly and approximately normally distributed, indicating that the selected samples adequately represented the natural variation in chemical composition within the broader population.

The measured component contents span the following ranges: cellulose (30.61–41.28%), hemicellulose (19.97–27.44%), lignin (15.74–22.93%), ash (3.05–11.37%) and moisture (2.03–8.94%). Among these, ash and moisture exhibited the highest CVs at 33.85% and 40.68%, respectively. The CV for lignin, cellulose, and hemicellulose showed lower variability, with CVs of 7.32%, 6.91%, and 5.55%, respectively, the lowest among all components (Figure 1A). Furthermore, the relatively wide distribution ranges of ash and moisture suggest that these components may be more reliably predicted by subsequent NIRS models than cellulose, hemicellulose and lignin, which exhibit narrower variability.

Overall, the chemical compositions of the Miscanthus spp. samples were sufficiently diverse and representative to support robust NIRS modeling. Cellulose, hemicellulose, and lignin, the primary components of bio-based product development, collectively accounted for an average of 80% of the dry matter in the 107 samples. Notably, the cellulose content in MS and ML exceeds that of corn straw [23,24], suggesting a superior potential for applications such as nanocellulose and bioethanol production [25]. Additionally, when compared to the straw of Helianthus tuberosus L., another potential biomass feedstock, the hemicellulose levels in our Miscanthus samples are notably higher, indicating greater economic promise for high-value applications such as xylooligosaccharide production [26,27]. Given that lignin is a key determinant of calorific value, the elevated lignin content in Miscanthus spp. relative to Helianthus tuberosus L. straw, further underscores their strong potential for use in biomass-based combustion and power generation [27].

3.2. Spectral Analysis and Pretreatment

The NIRS of the 107 Miscanthus spp. samples are presented in Figure 1B. The spectra span the wavelength range of 900–2500 nm, with absorbance as the ordinate, and 1601 data points per spectrum. The spectra exhibit strong and distinct absorption features, reflecting the high and variable contents of key lignocellulosic components, particularly cellulose, hemicellulose, and lignin.

The majority of characteristic absorption bands are concentrated in the 1400–2500 nm region [28], which corresponds to the first and second overtones as well as combination bands of fundamental vibrations involving C–H, O–H, N–H, C=O, and C=C bonds. Specific assignments include:

the O–H asymmetric stretching vibration of cellulose at ~1500 nm;
the C=C aromatic overtone vibration of lignin at ~1660 nm;
the C=O stretching vibration of acetyl groups in hemicellulose at ~1730 nm;
the combined O–H asymmetric stretching and bending vibrations of water at ~1941 nm;
overlapping O–H and C–H bending and stretching vibrations from cellulose and xylan at ~2100 nm;
O–H and C–O stretching vibrations associated with lignin at ~2250 nm;
C–H bending and stretching vibrations of xylan at ~2350 nm;
the combination bands arising from weak N-H (e.g., in residual proteins), C=O, and C=C vibrations, appearing broadly between 2050 and 2400 nm.

To enhance spectral interpretability and model performance, SD preprocessing was applied, and the resulting transformed spectra are shown in Figure 1C. This technique effectively corrects baseline drift and offsets caused by light scattering or instrumental variation, suppresses non-chemical spectral artifacts, and improves the resolution of overlapping absorption bands. Furthermore, SD preprocessing enhances the signal-to-noise ratio and minimizes random noise, thereby facilitating the development of more accurate and robust chemometric models. As demonstrated here, appropriate spectral pretreatment is essential for extracting meaningful chemical information and ensuring reliable predictions in NIR-based quantitative analysis.

3.3. Optimization of Near-Infrared Spectral Features

Feature selection is a critical step in NIRS modeling, as it enhances computational efficiency, reduces model complexity, and often improves prediction accuracy by retaining only the most informative spectral features [29]. In this study, three features selection algorithms, CARS, VCPA-GA and VCPA-IRIV, were employed to identify optimal features subsets, reflecting their distinct theoretical frameworks [17,18,30]. The resulting selected features are summarized in Figure 2 and Table 2.

For most components, CARS retained a larger number of spectral features compared to VCPA-GA and VCPA-IRIV, with the exception of cellulose and hemicellulose, where the three methods yielded comparable selections. This difference likely stems from CARS’s more conservative elimination strategy, which gradually removes uninformative features while preserving moderately relevant ones, whereas the hybrid VCPA-based approaches tend to discard weakly informative features more aggressively during optimization. Notably, all three algorithms achieved substantial dimensionality reduction: the maximum number of selected features was 120 (7.5% of the original 1601 spectral points), significantly simplifying subsequent modeling without compromising predictive power (Table 2). As shown in Figure 2, the retained features were distributed across key spectral regions associated with lignocellulosic components, confirming that they captured chemically relevant information and thereby enabled simpler yet more accurate PLS models.

A strong absorption band observed at 1925–1942 nm—attributed to the O–H asymmetric stretch and deformation vibrations of water—is known to interfere with the quantification of other lignocellulosic components [31]. Intriguingly, the CARS algorithm largely avoided this region, whereas both VCPA-GA and VCPA-IRIV included features within it, potentially introducing noise into their models. In contrast, all three methods successfully retained established key spectral regions associated with lignocellulose:

Cellulose and hemicellulose: 1471–1563, 1586–1597, 1725–1731, 2092–2101, 2328–2336, and 2486–2491 nm [28,32];

Lignin: 1811, 2328–2332, 2375, and 2488 nm [28,32].

CARS selected a greater number of features within these informative regions, which likely contributes to the consistently higher prediction accuracy of CARS-PLS models across all target components.

Interestingly, all three algorithms commonly selected features in the 900–1400 nm range—a region not traditionally considered critical for lignocellulose prediction. Previous studies have linked specific sub-bands in this region to protein (e.g., 906–942, 1152, 1159, and 1187 nm) and fat (e.g., 1167–1210 and 1240–1270 nm) [33]. However, CARS has also been reported to utilize 1250–1400 nm for cellulose, 1111–1250 nm for hemicellulose, and 1176–1282 nm for lignin modeling [34], suggesting that these shorter features may still contain latent, component-specific information. Critically, because samples were collected during winter senescence (the “yellowing” phase), endogenous protein and lipid reserves were largely remobilized, minimizing interference from these non-target components. Consequently, the inclusion of 900–1400 nm features did not degrade model performance; rather, it enhanced robustness by leveraging additional spectral diversity.

Indeed, spectral diversity is essential for developing reliable NIRS calibrations. As shown in Figure 1B, the 900–1400 nm region exhibits considerable variation across samples, supporting its utility in multivariate modeling. Taken together, these findings indicate that harvesting during the winter yellowing stage not only reduces confounding effects from soluble nutrients but also enriches spectral contrast, ultimately facilitating more accurate prediction of lignocellulosic composition.

3.4. Sample Set Division

The selection of a representative calibration set is critical for developing robust and reliable multivariate calibration models, as it directly influences the predictive performance and stability of the model [35]. In this study, the sample set was partitioned into calibration and prediction subsets using the SPXY algorithm, which simultaneously considers both spectral (X) and reference property (Y) information to ensure comprehensive coverage of the data space. To balance model training and independent validation, approximately one-third of the samples (36 out of 107) were assigned to the prediction set, whereas the remaining two-thirds (71 samples) formed the calibration set.

Figure 3A presents the distribution of samples partitioned using the full spectral range (1601 features), whereas Figure 3B–D illustrate the partitions obtained after applying SPXY to the reduced variable sets selected by CARS, VCPA-GA, and VCPA-IRIV. In all cases, the SPXY algorithm yielded calibration and prediction sets with broad and representative distributions across the range of lignocellulosic component contents studied. Notably, the calibration sets encompass a higher proportion of extreme values and largely envelop the prediction sets in property space, confirming the effective coverage of sample variability.

This distribution pattern demonstrates that SPXY successfully optimizes the representativeness of both subsets, thereby supporting reliable model development and validation of the model. However, when using the full spectrum, an imbalance was observed in the moisture content; the prediction set contained a slightly higher concentration of samples at certain moisture levels than the calibration set. This potential bias was effectively mitigated when SPXY was applied to the feature-reduced datasets generated by the three feature selection algorithms, underscoring the benefit of combining spectral pretreatment with intelligent sample partitioning.

An exception was noted for lignin in the VCPA-IRIV-based partition: the calibration set contained fewer samples than the prediction set within a specific lignin-content range. This underrepresentation may limit the model’s ability to accurately predict lignin in that interval, potentially explaining the relatively lower performance of the VCPA-IRIV-PLS model for this component. In contrast, the CARS- and VCPA-GA-based partitions exhibited more balanced distributions across all target components, contributing to the superior accuracy and robustness of the respective PLS models.

3.5. Quantitative Analysis of NIRS for Lignocellulosic Components

In this study, 15 PLS regression models were developed to predict the contents of cellulose, hemicellulose, lignin, ash, and moisture in Miscanthus samples. These models were built using spectral feature subsets optimized by three algorithms—CARS, VCPA-GA, and VCPA-IRIV—in combination with sample partitioning via the SPXY algorithm. To ensure robustness and minimize overfitting, all PLS models were validated using 10-fold cross-validation [36,37]: the calibration set was divided into 10 subsets, with one subset used as validation data in each iteration while the remaining nine were used for training. The optimal number of latent features for each model was determined by jointly minimizing the RMSEC and RMSECV.

For comparative purposes, five additional PLS models based on the full spectrum (1601 features) were also constructed (Figure 4). The results demonstrate that all 15 optimized models significantly outperformed their full-spectrum counterparts, exhibiting lower RMSEC (0.18–0.44) and RMSECV (0.43–0.69), higher coefficients of determination for calibration (R²c = 0.92–0.99) and cross-validation (R²cv = 0.80–0.96), and improved stability. These findings corroborate our earlier observation—based on component distribution analysis—that modeling performance is closely tied to the representativeness and variability of the sample set. Notably, although ash is typically challenging to predict via NIRS due to its inorganic nature and low concentration in plant biomass [27,31], the feature-selection algorithms employed here markedly enhanced ash prediction accuracy, highlighting their effectiveness in extracting subtle but relevant spectral information.

Recent guidelines suggest that a successful multivariate calibration model should achieve a RPD > 3 and an RMSEP/RMSECV ratio between 0.8 and 1.2 [12,38]. In this study, the optimized PLS models met or approached these benchmarks, indicating strong predictive capability. To facilitate visual comparison of model performance, radar plots were generated for the full-spectrum PLS, CARS-PLS, VCPA-GA-PLS, and VCPA-IRIV-PLS models (Figure 4). Among the three feature selection strategies, CARS consistently delivered the best overall performance.

As summarized in Figure 4, the CARS-PLS models achieved the highest predictive accuracy for all components except ash. Specifically, R²v values ranged from 0.97 to 0.99, RMSECV from 0.43 to 0.69, RMSEC from 0.18 to 0.31, RMSEP from 0.22 to 0.33, and RPD from 4.35 to 7.38—demonstrating excellent calibration and prediction capabilities.

For ash, however, the VCPA-IRIV-PLS model outperformed CARS-PLS despite using only 73 spectral features (4.56% of the original 1601). This indicates that VCPA-IRIV effectively compresses the spectral space while retaining features most relevant to ash prediction. This result aligns with prior findings in black tea ash modeling, where hybrid feature selection methods also proved superior for inorganic components [38], suggesting that VCPA-IRIV may be particularly well-suited for predicting low-abundance, non-organic components.

Figure 5 presents the scatter plots of predicted versus measured values, offering a direct visual assessment of the models’ predictive performance. The plots for the CARS-PLS models demonstrate a remarkable level of accuracy, with data points tightly and evenly clustered around the ideal y = x line for all five components. This low dispersion visually confirms the high R² and low RMSEP values reported in Figure 4.

In contrast, while the VCPA-GA and VCPA-IRIV models also show strong correlations, their scatter plots reveal slightly greater dispersion and a few noticeable outliers, particularly for lignin and ash. For instance, the VCPA-IRIV plot for lignin shows several points deviating from the trend line, consistent with its relatively lower RPD value. This visual evidence reinforces that CARS’s feature selection strategy, which retains a robust and highly informative set of features, translates directly into more reliable and accurate predictions across the entire concentration range. The superior fit demonstrated by the CARS-based models in Figure 5 underscores their robustness and suitability for high-throughput analysis.

In summary, all three variable selection algorithms—CARS, VCPA-GA, and VCPA-IRIV—significantly enhanced the accuracy, robustness, and interpretability of PLS models for lignocellulosic component prediction. The resulting CARS-PLS models (with the exception of ash) provide a reliable, rapid, and non-destructive tool for assessing key biomass traits in Miscanthus samples, supporting their potential use in bioenergy and biorefinery applications.

4. Conclusions

This study successfully demonstrates the potential of NIRS, combined with advanced feature selection algorithms, for the rapid and non-destructive quantitative prediction of key lignocellulosic components, cellulose, hemicellulose, lignin, ash, and moisture, in Miscanthus biomass. Spectral preprocessing using the SD effectively corrected baseline drift and scattering effects, thereby enhancing spectral reproducibility and model robustness.

Among the three feature selection strategies evaluated—CARS, VCPA-GA, and VCPA-IRIV—all yielded highly accurate and stable PLS models, with coefficients of determination (R²v) exceeding 0.80 for all target components. Notably, the CARS-PLS model exhibited superior predictive performance for cellulose, hemicellulose, lignin, and moisture, achieving high R²v (>0.95) and residual predictive deviation (RPD > 4.3) values, confirming that the features selected by CARS are highly informative for lignocellulose quantification. For ash prediction, the VCPA-IRIV-PLS model performed best, underscoring the importance of algorithm selection depending on the chemical nature of the target component.

Overall, this work establishes a reliable, rapid, and non-destructive NIRS-based analytical framework for assessing lignocellulosic composition in Miscanthus biomass. The developed models offer a practical tool for high-throughput phenotyping, enabling efficient screening of superior germplasm and accelerating breeding programs aimed at optimizing biomass quality for bioenergy and biorefinery applications.

Author Contributions

Conceptualization: B.L., Y.H. and M.L.; methodology: B.L., L.G. and S.W.; resources: Z.Y.; data curation, B.L., J.L. and X.W.; writing—original draft preparation, B.L. and M.L.; writing, review and editing C.T. and S.X.; supervision, T.F. and M.L.; project administration, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported financially by the Yuelushan Laboratory Breeding Project (YLS-2025-ZY04075), Project by Hunan Agricultural University for Supporting Young Interdisciplinary Scholars (2024XKJC05), Hunan Province Major Scientific and Technological Breakthrough Project under the “Open Competition for Leadership” System (No. 2025AQ2034-4), Scientific Research Project of Hunan Provincial Department of Education (23B0230), Research funding of Hunan Agricultural University (25KJ040).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

We express our deepest gratitude to Hunan Engineering Laboratory of Miscanthus Ecological Applications of Hunan province for providing support and permission to conduct this research. We are thankful to Liang Xiao, Sai Yang, Zhiyong Chen, Weiming Liu, Ran Yuan, Yinghong Liang, Lanqing You, Weihong Du and other colleagues for their assistance in data collection.

Conflicts of Interest

The authors declare that they have no competing interest.

References

Lee, W.-C.; Kuan, W.-C. Miscanthus as cellulosic biomass for bioethanol production. Biotechnol. J. 2015, 10, 840–854. [Google Scholar] [CrossRef]
Liu, H.; Tian, Y.; Lu, X. A Review of the Biorefinery Technology of Miscanthus. Energy Sources Part A Recovery Util. Environ. Eff. 2015, 37, 2422–2428. [Google Scholar]
Sa, M.; Zhang, B.; Zhu, S. Miscanthus: Beyond its use as an energy crop. BioResources 2021, 16, 5–8. [Google Scholar] [CrossRef]
Soomro, A.; Chen, S.; Ma, S.; Xu, C.; Sun, Z.; Xiang, W. Elucidation of syngas composition from catalytic steam gasification of lignin, cellulose, actual and simulated biomasses. Biomass Bioenergy 2018, 114, 210–222. [Google Scholar] [CrossRef]
Brosse, N.; Dufour, A.; Meng, X.; Sun, Q.; Ragauskas, A. Miscanthus: A fast-growing crop for biofuels and chemicals production. Biofuels Bioprod. Biorefining 2012, 6, 580–598. [Google Scholar] [CrossRef]
Xu, N.; Zhang, W.; Ren, S.; Liu, F.; Zhao, C.; Liao, H.; Xu, Z.; Huang, J.; Li, Q.; Tu, Y.; et al. Hemicelluloses negatively affect lignocellulose crystallinity for high biomass digestibility under NaOH and H₂SO₄ pretreatments in Miscanthus. Biotechnol. Biofuels 2012, 5, 58. [Google Scholar] [CrossRef]
Dossa, K.; Wei, X.; Niang, M.; Liu, P.; Zhang, Y.; Wang, L.; Liao, B.; Cissé, N.; Zhang, X.; Diouf, D. Near-infrared reflectance spectroscopy reveals wide variation in major components of sesame seeds from Africa and Asia. Crop J. 2018, 6, 202–206. [Google Scholar] [CrossRef]
Czaja, T.P.; Engelsen, S.B. Why nothing beats NIRS technology: The green analytical choice for the future sustainable food production. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2025, 326, 125028. [Google Scholar] [CrossRef] [PubMed]
Xu, F.; Yu, J.; Tesso, T.; Dowell, F.; Wang, D. Qualitative and quantitative analysis of lignocellulosic biomass using infrared techniques: A mini-review. Appl. Energy 2013, 104, 801–809. [Google Scholar] [CrossRef]
Pasquini, C. Near infrared spectroscopy: A mature analytical technique with new perspectives—A review. Anal. Chim. Acta 2018, 1026, 8–36. [Google Scholar] [CrossRef]
Huang, J.; Xia, T.; Li, A.; Yu, B.; Li, Q.; Tu, Y.; Zhang, W.; Yi, Z.; Peng, L. A rapid and consistent near infrared spectroscopic assay for biomass enzymatic digestibility upon various physical and chemical pretreatments in Miscanthus. Bioresour. Technol. 2012, 121, 274–281. [Google Scholar] [CrossRef]
Birenboim, M.; Kengisbuch, D.; Chalupowicz, D.; Maurer, D.; Barel, S.; Chen, Y.; Fallik, E.; Paz-Kagan, T.; Shimshoni, J.A. Use of near-infrared spectroscopy for the classification of medicinal cannabis cultivars and the prediction of their cannabinoid and terpene contents. Phytochemistry 2022, 204, 113445. [Google Scholar] [CrossRef]
Li, X.; Fan, X.; Wu, J.; Zhang, G.; Liu, S.; Wu, M.; Cheng, Y.; Zhang, N. Prediction of Cellulose, Hemicellulose, Lignin and Ash Content of Four Miscanthus Bio-Energy Crops Using Near-Infrared Spectroscopy. Spectrosc. Spectr. Anal. 2016, 36, 64–69. [Google Scholar]
Chen, X.; Li, J.; Cheng, W.; Zhang, M.; Zhou, J.; Chen, Y.; Wang, Y.; Liu, J.; Zhou, K.; Chen, Y. Determination of hemicellulose, cellulose and lignin content in Miscanthus using visible and near-infrared spectroscopy. Bioresour. Technol. 2017, 243, 602–608. [Google Scholar]
Zhu, J.; Ahmad, W.; Jiao, T.; Wang, J.; Jiang, H.; Li, H.; Chen, Q. Interval combination iterative optimization approach coupled with SIMPLS (ICIOA-SIMPLS) for quantitative analysis of surface-enhanced Raman scattering (SERS) spectra. Anal. Chim. Acta 2020, 1103, 45–55. [Google Scholar] [CrossRef] [PubMed]
Jiao, Y.; Li, Z.; Chen, X.; Fei, S. Preprocessing methods for near-infrared spectrum calibration. J. Chemom. 2020, 34, e3306. [Google Scholar] [CrossRef]
Li, H.; Liang, Y.; Xu, Q.; Cao, D. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 2009, 648, 77–84. [Google Scholar] [CrossRef]
Yun, Y.-H.; Bin, J.; Liu, D.-L.; Xu, L.; Yan, T.-L.; Cao, D.-S.; Xu, Q.-S. A hybrid variable selection strategy based on continuous shrinkage of variable space in multivariate calibration. Anal. Chim. Acta 2019, 1058, 58–69. [Google Scholar] [CrossRef]
NB/T 34057.3-2017; Determination of Chemical Components in Lignocellulosic Feedstocks Part 3: Determination of Moisture. China Agriculture Press Co., Ltd.: Beijing, China, 2017.
NB/T 34057.6-2017; Determination of Chemical Components in Lignocellulosic Feedstocks Part 6: Determination of Ash. China Agriculture Press Co., Ltd.: Beijing, China, 2017.
NB/T 34057.5-2017; Determination of Chemical Components in Lignocellulosic Feedstocks Part 5: Determination of Cellulose, Hemicellulose, Pectin and Lignin. China Agriculture Press Co., Ltd.: Beijing, China, 2017.
Galvão, R.K.H.; Araujo, M.C.U.; José, G.E.; Pontes, M.J.C.; Silva, E.C.; Saldanha, T.C.B. A method for calibration and validation subset partitioning. Talanta 2005, 67, 736–740. [Google Scholar] [CrossRef]
Yin, S.; Peng, L.; Li, M.; Yu, S.; Chen, Z.; Zhao, J.; Li, W.; Liang, S.; He, L. Study on Analytical Method for Rapid Determination of Hot Water-Soluble Substances in Plant Fiber Raw Materials by Countercurrent Brewing Method. J. Cellulose Sci. Technol. 2024, 32, 32–37. [Google Scholar]
Chu, D.; Xin, Y.; Zhao, C. Production of bio-ethanol by consecutive hydrogenolysis of corn-stalk cellulose. Chin. J. Catal. 2021, 42, 844–854. [Google Scholar] [CrossRef]
Lacey, J.A.; Li, C.; Aston, J.E. Impact of feedstock quality and variation on biochemical and thermochemical conversion. Renew. Sustain. Energy Rev. 2016, 65, 623–638. [Google Scholar] [CrossRef]
Dalmis, R.; Candan, Z. Description of a new cellulosic natural fiber extracted from Helianthus tuberosus L. as a composite reinforcement material. Ind. Crops Prod. 2023, 197, 116631. [Google Scholar] [CrossRef]
Li, M.; He, S.; Wang, J.; Liu, Z.; Xie, G.H. An NIRS-based assay of chemical composition and biomass digestibility for rapid selection of Jerusalem artichoke clones. Biotechnol. Biofuels 2018, 11, 336. [Google Scholar] [CrossRef] [PubMed]
Schwanninger, M.; Rodrigues, J.C.; Fackler, K. A Review of Band Assignments in near Infrared Spectra of Wood and Wood Components. J. Near Infrared Spectrosc. 2011, 19, 287–308. [Google Scholar] [CrossRef]
Cui, X.; Yu, L.; Liu, X. Understanding the thermal stability of human serum proteins with the related near-infrared spectral variables selected by Monte Carlo-uninformative variable elimination. Chin. Chem. Lett. 2017, 28, 1795–1801. [Google Scholar]
Wang, W.-T.; Yun, Y.-H.; Li, H.-D.; Lu, H.-M.; Liang, Y.-Z.; Xu, Q.-S. Using variable combination population analysis for variable selection in multivariate calibration. Anal. Chim. Acta 2015, 858, 28–36. [Google Scholar]
Li, M.; Wang, J.; Du, F.; Diallo, B.; Xie, G.H. High-throughput analysis of chemical components and theoretical ethanol yield of dedicated bioenergy sorghum using dual-optimized partial least squares calibration models. Biotechnol. Biofuels 2017, 10, 206. [Google Scholar] [CrossRef]
Simeone, M.L.F.; Guimarães, C.C.; Gouveia, B.F.T.; de Oliveira, L.C.A.; de Siqueira, G. Use of NIRS to predict composition and bioethanol yield from cell wall structural components of sweet sorghum biomass. Microchem. J. 2014, 117, 194–201. [Google Scholar] [CrossRef]
Zulfahrizal, Z.; Arip, A.; Hayati, R. Robust prediction performance of inner quality attributes in intact cocoa. Heliyon 2021, 7, e06286. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, F.; Gupta, B.B.; Rho, S.; Liu, Q.; Hou, H.; Jing, D.; Shen, W. Variable Selection and Optimization in Rapid Detection of Soybean Straw Biomass Based on CARS. IEEE Access 2018, 6, 5290–5299. [Google Scholar] [CrossRef]
Li, G.; Mu, L.; Zhou, M.; Zhao, J.; Wu, S.; Lin, L. New strategy of sample set division in spectroscopy analysis—SWNW. Infrared Phys. Technol. 2021, 118, 103824. [Google Scholar] [CrossRef]
Berrar, D. Cross-Validation. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 542–545. [Google Scholar]
Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:1811.12808. [Google Scholar]
Ren, G.; Yin, L.; Wu, R.; Ning, J. Rapid detection of ash content in black tea using a homemade miniature near-infrared spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 308, 123740. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Descriptive statistics of chemical components and near-infrared spectral profiles of Miscanthus samples. (A) Box plots showing the distribution of measured values for key chemical components (cv: coefficient of variation). (B) Raw near-infrared spectra of the 107 samples. (C) Spectra after second derivative (SD) preprocessing.

Figure 2. Visualization of optimal features selected by CARS, VCPA-GA, and VCPA-IRIV algorithms for each component. The background curve is the mean preprocessed spectrum, shown for context. The vertical lines/markers indicate the specific features selected by each algorithm for modeling (A) cellulose, (B) hemicellulose, (C) lignin, (D) ash, and (E) moisture. Different colors/styles represent different algorithms.

Figure 3. Distribution of samples in the calibration and prediction sets. (A) Full spectrum; (B) CARS algorithm; (C) VCPA-GA; (D) VCPA-IRIV algorithm.

Figure 4. Radar diagram of chemical component PLS models of Miscanthus samples. (a) cellulose, (b) hemicellulose, (c) lignin, (d) ash, (e) moisture.

Figure 5. Scatter plots of predicted versus measured chemical reference values of key chemical components. (A): Full spectrum; (B): CARS algorithm; (C): VCPA-GA; (D): VCPA-IRIV algorithm. The triangles represent the individual predicted versus measured data points, and the solid line represents the 1:1 reference line. R²: coefficient determination of validation.

Table 1. Parameter Settings for VCPA-GA and VCPA-IRIV Algorithms.

Parameter Name	Value
The exponential decay function	50
Binary matrix sampling	1000
Number of retained features in the last round of EDF	100
Average number of binary matrix sampling	103
The ratio of the best model to the worst model for the k submodel	0.1

Table 2. Common features and specific locations of the three features selection algorithms.

Component	Feature Selection Algorithm	Number of Selected Features	Common Features	Specific Location
Cellulose	CARS	90	20	906, 919, 1072, 1356, 1405, 1692, 1719, 1721, 1728, 1749, 1978, 1980, 2050, 2054, 2097, 2194, 2232, 2246, 2247, 2298
	VCPA-GA	98
	VCPA-IRIV	71
Hemicellulose	CARS	91	25	905, 918, 948, 1071, 1189, 1355, 1404, 1541, 1689, 1727, 1748, 1896, 1977, 1979, 2045, 2053, 2054, 2071, 2096, 2155, 2157, 2245, 2297, 2468, 2495
	VCPA-GA	96
	VCPA-IRIV	76
Lignin	CARS	120	18	1002, 1055, 1067, 1092, 1093, 1157, 1188, 1827, 1927, 1979, 1980, 2111, 2199, 2233, 2242, 2247, 2417, 2435
	VCPA-GA	88
	VCPA-IRIV	59
Ash	CARS	104	14	1035, 1133, 1309, 1334, 1620, 1772, 1793, 1974, 2080, 2153, 2188, 2219, 2339, 2415
	VCPA-GA	99
	VCPA-IRIV	73
Moisture	CARS	120	16	900, 992, 1002, 1024, 1088, 1368, 1636, 1645, 1861, 1896, 1920, 2008, 2320, 2367, 2370, 2398
	VCPA-GA	96
	VCPA-IRIV	67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Huang, Y.; Gu, L.; Wang, S.; Xue, S.; Fu, T.; Yi, Z.; Li, J.; Wang, X.; Tang, C.; et al. High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms. Agronomy 2025, 15, 2659. https://doi.org/10.3390/agronomy15112659

AMA Style

Liu B, Huang Y, Gu L, Wang S, Xue S, Fu T, Yi Z, Li J, Wang X, Tang C, et al. High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms. Agronomy. 2025; 15(11):2659. https://doi.org/10.3390/agronomy15112659

Chicago/Turabian Style

Liu, Bin, Yu Huang, Lan Gu, Sheng Wang, Shuai Xue, Tongcheng Fu, Zili Yi, Jie Li, Xiaoyu Wang, Chaochen Tang, and et al. 2025. "High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms" Agronomy 15, no. 11: 2659. https://doi.org/10.3390/agronomy15112659

APA Style

Liu, B., Huang, Y., Gu, L., Wang, S., Xue, S., Fu, T., Yi, Z., Li, J., Wang, X., Tang, C., & Li, M. (2025). High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms. Agronomy, 15(11), 2659. https://doi.org/10.3390/agronomy15112659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Throughput Analysis of Lignocellulosic Components in Miscanthus spp. Utilizing Near-Infrared Spectroscopy Integrated with Feature Selection Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Plant Materials and Sample Preparation

2.2. Chemical Composition Analysis

2.3. NIR Spectroscopy

2.4. Spectral Data Preprocessing

2.5. Feature Selection

2.6. Sample Set Partitioning

2.7. Model Construction and Evaluation

3. Results and Discussion

3.1. Chemical Characterization of Miscanthus Samples

3.2. Spectral Analysis and Pretreatment

3.3. Optimization of Near-Infrared Spectral Features

3.4. Sample Set Division

3.5. Quantitative Analysis of NIRS for Lignocellulosic Components

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI