Quantitative Determination of Nitrogen Content in Cucumber Leaves Using Raman Spectroscopy and Multidimensional Feature Selection

Zhaolong Hou; Feng Tan; Manshu Li; Jiaxin Gao; Chunjie Su; Feng Jiao; Yaxuan Wang; Xin Zheng

doi:10.3390/agronomy15081884

,

and

¹

College of Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China

²

College of Information and Electrical Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China

³

College of Horticulture and Landscape Architecture, Heilongjiang Bayi Agricultural University, Daqing 163319, China

⁴

College of Agriculture, Heilongjiang Bayi Agricultural University, Daqing 163319, China

Agronomy2025, 15(8), 1884;https://doi.org/10.3390/agronomy15081884

This article belongs to the Section Precision and Digital Agriculture

Version Notes

Order Reprints

Abstract

Cucumber, a high-yielding crop commonly grown in facility environments, is particularly susceptible to nitrogen (N) deficiency due to its rapid growth and high nutrient demand. This study used cucumber as its experimental subject and established a spectral dataset of leaves under four nutritional conditions, normal supply, nitrogen deficiency, phosphorus deficiency, and potassium deficiency, aiming to develop an efficient and robust method for quantifying N in cucumber leaves using Raman spectroscopy (RS). Spectral data were preprocessed using three baseline correction methods—BaselineWavelet (BW), Iteratively Improve the Moving Average (IIMA), and Iterative Polynomial Fitting (IPF)—and key spectral variables were selected using 4-Dimensional Feature Extraction (4DFE) and Competitive Adaptive Reweighted Sampling (CARS). These selected features were then used to develop a N content prediction model based on Partial Least Squares Regression (PLSR). The results indicated that baseline correction significantly enhanced model performance, with three methods outperforming unprocessed spectra. A further analysis showed that the combination of IPF, 4DFE, and CARS achieved optimal PLSR model performance, achieving determination coefficients (R²) of 0.947 and 0.847 for the calibration and prediction sets, respectively. The corresponding root mean square errors (RMSEC and RMSEP) were 0.250 and 0.368, while the residual predictive deviation (RPDC and RPDP) values reached 4.335 and 2.555. These findings confirm the feasibility of integrating RS with advanced data processing for rapid, non-destructive nitrogen assessment in cucumber leaves, offering a valuable tool for nutrient monitoring in precision agriculture.

Keywords:

Raman spectroscopy; cucumber leaves; nitrogen quantification; non-destructive detection; spectral preprocessing; chemometric modeling; feature selection

1. Introduction

Nitrogen (N) is a fundamental component of cellular proteins and a key constituent of chlorophyll, enzymes, and phytohormones, playing an essential role in plant physiological processes [1]. An insufficient N supply limits crop growth, while excessive N application reduces fertilizer use efficiency and contributes to soil degradation and environmental pollution [2,3]. Precision N management is thus critical for enhancing agricultural productivity and sustainability. As a critical organ in plants, leaves serve as indicators of N status, reflecting overall nutritional conditions [4]. The accurate determination of leaf N content provides a scientific basis for the precise application of N fertilizers, facilitating optimized fertilization strategies and improved N use efficiency [5].

In traditional nutrient management, plant nutrient status has been primarily assessed through visual inspection and laboratory chemical analysis. However, visual assessment is subjective and prone to expert bias, making accurate diagnosis difficult [6]. Chemical methods, such as Kjeldahl N determination [7] and the Dumas combustion method [8], provide high accuracy but are cumbersome and time-consuming, and they require destructive sampling, which damages crops [9]. With the advancement of precision and intelligent agriculture, there is an urgent need for efficient, rapid, and non-destructive N detection technology to enable real-time monitoring and the precise management of crop N content.

In recent years, spectroscopic techniques have advanced rapidly in agricultural and plant sciences, providing crucial tools for non-destructive analysis [10]. Among these techniques, Raman spectroscopy (RS) offers distinct advantages for crop nutrient detection, as it probes molecular vibrational characteristics with minimal interference from water content and enables the in situ detection of biochemical components in plant tissues [11,12]. Previous studies have demonstrated the promising potential of Raman spectroscopy for monitoring nutrient stress in crops [13,14]. For instance, Huang et al. quantified nitrate content in plants using Raman spectroscopy [15]. Other research has also applied Raman spectroscopy for the rapid detection of available nitrogen in soil, achieving similarly high detection accuracy [16,17]. Additionally, Partial Least Squares Regression (PLSR) has been a fundamental and widely used method in Raman spectral modeling. For example, Khongkaew et al. [18] demonstrated that PLSR models could effectively quantify lead content in turmeric, while Pattamapan et al. [19] developed a rapid method for analyzing gamma-oryzanol content in rice bran oil using RS and PLSR. Experimental findings confirm that the laser excitation used in RS does not induce thermal or photochemical degradation in plant tissues, further highlighting its applicability for non-destructive nutrient analysis [20]. However, the quantitative analysis of leaf N content using RS remains underexplored, and its potential in precision agriculture requires further investigation.

During spectral acquisition, RS not only captures the chemical composition of samples but is also influenced by instrument noise, experimental conditions, and fluorescence background interference [21]. To enhance spectral quality, baseline correction is an essential preprocessing step that effectively eliminates background offsets and restores true feature peaks [22]. For example, Liu et al. [23] achieved a rapid quantitative analysis of chlorophyll content in citrus leaves by combining Raman spectroscopy with background subtraction methods. Additionally, large spectral datasets can pose challenges to modeling, potentially reducing efficiency and increasing the risk of overfitting. Feature extraction methods help mitigate these issues by reducing variable dimensionality, decreasing modeling complexity, and improving model stability and predictive accuracy [24]. Yu et al. [25] successfully established a predictive model for total nitrogen content in Korla fragrant pear leaves using near-infrared spectroscopy combined with the CARS algorithm to extract characteristic wavelengths. Similarly, Liu et al. [26] employed the CARS method to select spectral variables highly correlated with baicalin, and the results indicated that the simplified model after variable selection outperformed the full-spectrum model in terms of both simplicity and predictive performance. Overall, implementing appropriate data processing strategies is crucial for minimizing interference, extracting key information, and enhancing the reliability of quantitative models. However, the effects of different baseline correction and feature extraction method combinations on RS-based quantitative analysis require further study for specific detection objectives.

Cucumber is a widely cultivated economic crop in facility agriculture, characterized by a short growth cycle, high yield, and substantial nutrient demand [27]. In intensive farming systems, the precise regulation of the N supply is crucial for optimizing cucumber growth, biomass accumulation, and yield formation. However, phosphorus and potassium, as essential macronutrients for plants, play important roles in influencing nitrogen absorption and utilization efficiency, which in turn affect the nitrogen content in leaves [28] and may lead to alterations in their spectral response characteristics. Therefore, to construct a N prediction model with broad adaptability, cucumber leaf samples covering a representative range of nitrogen levels—under typical nitrogen, phosphorus, and potassium deficiencies, as well as normal growth conditions—were obtained through controlled nutrient regulation. On this basis, RS was integrated with chemometric modeling, incorporating multidimensional feature extraction and dimensionality reduction strategies, to develop a quantitative analytical method for N detection in cucumber leaves, with the aim of exploring its feasibility as an alternative to traditional chemical analysis.

2. Materials and Methods

2.1. Plant Materials

The cucumber cultivar ‘Jinyou 401’, provided by the Tianjin Kerun Cucumber Research Institute, was used in this study. During the early growth stage, the plants were irrigated with standard Hoagland nutrient solution [29], and perlite was used as the cultivation substrate. On the 15th day after transplanting, 48 uniformly growing healthy plants were selected and randomly divided into four treatment groups, normal supply, nitrogen deficiency, phosphorus deficiency, and potassium deficiency, with 12 plants in each group. For the normal supply group, the nutrient solution followed the complete Hoagland formula, including Ca(NO₃)₂·4H₂O at 945 mg/L, KNO₃ at 607 mg/L, (NH₄)H₂PO₄ at 115 mg/L, MgSO₄·7H₂O at 493 mg/L, and trace elements such as Fe-EDTA at 2.5 mg/L, H₃BO₃ at 2.86 mg/L, MnCl₂·4H₂O at 2.13 mg/L, ZnSO₄·7H₂O at 0.22 mg/L, CuSO₄·5H₂O at 0.08 mg/L, and Na₂MoO₄·2H₂O at 0.02 mg/L, with the pH adjusted to 6.0. In the nitrogen deficiency group, Ca(NO₃)₂·4H₂O and KNO₃ were omitted and replaced with 520 mg/L of CaCl₂ and 450 mg/L of KCl, respectively [15]. In the phosphorus deficiency group, (NH₄)H₂PO₄ was omitted and replaced with 53 mg/L of NH₄Cl to maintain the nitrogen supply. In the potassium deficiency group, KNO₃ was replaced with 510 mg/L of NaNO₃.

Considering the variation in nitrogen content across different leaf positions [30,31] and referring to previous studies on nutrient stress sampling time points [14,32], sampling was conducted at 24, 48, 72, 96, 120, 144, 168, and 192 h after treatment initiation. To ensure sufficient representativeness while minimizing potential physiological changes caused by excessive sampling, three plants were selected as biological replicates in each group during each round, and one functional leaf was collected from each plant. In each round, a total of 12 leaves were collected, resulting in eight rounds and ultimately 96 fresh leaf samples. After harvesting, leaf surfaces were rinsed with deionized water, air-dried, and sealed in bags for rapid transfer to the laboratory for RS measurements and chemical N analysis. Although RS enables in situ and non-destructive detection, in this study, excised leaves were measured under controlled laboratory conditions to minimize potential interferences from temperature, light, and other environmental factors in the greenhouse. This approach ensured consistency in measurement conditions and angles, thereby maximizing data quality and comparability. This approach provides a foundation for future in vivo non-destructive diagnostics.

2.2. Spectral Data Acquisition

A miniature Raman spectrometer (ATP3000P, OPTOSKY, Xiamen, China) was used for spectral data acquisition in this study. The instrument was equipped with a 785 nm fiber-optic laser source, covering a spectral measurement range of 200–3400 cm⁻¹, with a resolution of 1 cm⁻¹ and a wavenumber accuracy within ±4 cm⁻¹. According to the manufacturer’s instructions, Raman shift calibration was performed using an acetonitrile standard. During leaf spectral acquisition, the instrument parameters were set as follows: laser power of 400 mW, an integration time of 3.5 s, and three consecutive measurements at each sampling point. The average of these three measurements was taken as the raw spectral data to minimize random noise.

Due to spatial heterogeneity in N distribution within the leaf, variations in N content at different locations could affect spectral detection accuracy [33]. To mitigate this effect, the leaf vein region was avoided during spectral acquisition, and a grid scanning mode was employed to acquire spectra from multiple measurement points at 1.0 cm intervals on one side of the main vein (Figure 1). The average spectrum from multiple sampling points was used as the representative spectrum for each leaf sample. This strategy effectively reduced measurement noise caused by local metabolic variations in the leaves, thereby improving the consistency and accuracy of spectral signals used for modeling.

Figure 1. Grid scanning mode for Raman spectral acquisition.

2.3. N Content Determination

After RS data acquisition, cucumber leaves were immediately placed in an oven at 105 °C for 20 min to deactivate the enzymes. They were then dried at 80 °C until a constant weight was achieved and were stored in a desiccator for subsequent chemical analysis. The dried leaf samples were weighed, ground into a fine powder, and digested with concentrated H₂SO₄. The N content was determined using the Kjeldahl method [7] and expressed as the mass fraction in g/100 g, calculated as follows:

N (%) = \frac{(V_{2} - V_{0}) \times c \times 0.014}{m \times (V_{1} / V)} \times 100

(1)

where c is the concentration of the sulfuric acid standard titration solution (1/2 H₂SO₄) in 0.01 mol/L, V₂ is the volume of the standard acid solution consumed by the sample (mL), V₀ is the volume consumed by the blank (mL), V₁ is the volume of liquid A tested during distillation (mL), V is the total volume of liquid A tested (mL), m is the sample mass (g), and 0.014 represents the mass of N in 1 mL of 1 mol/L sulfuric acid standard titration solution (g).

2.4. Preprocessing Methods

To eliminate background offsets, enhance feature peaks, and improve spectral reliability, three baseline correction methods were applied separately to optimize the signal-to-noise ratio and reduce non-target interference. The specific methods are as follows:

(i): BaselineWavelet (BW): As reported by Zhang et al. [34], this approach is based on wavelet decomposition, which removes baseline drift by retaining only low-frequency approximation coefficients, thereby improving spectral reliability. In this study, the wavelet basis function chosen was db8, with six levels of decomposition.
(ii): Iteratively Improve the Moving Average (IIMA): Following the method described by Wang et al. [35], this method is based on a moving average algorithm that iteratively adjusts the baseline by comparing the mean and central values within a moving window, enabling more accurate background fitting. The parameters were set as follows: window size = 31; iteration count = 5.
(iii): Iterative Polynomial Fitting (IPF): According to the approach proposed by Gan et al. [36], this method uses polynomial regression for baseline correction by iteratively removing spectral peaks and progressively optimizing the baseline fitting to enhance correction accuracy. In this study, the polynomial fitting order was set to 15, and the residual threshold for stopping iterations was set to 5%.

2.5. Feature Extraction

To enhance model stability and reliability, this study employed 4-Dimensional Feature Extraction (4DFE), Competitive Adaptive Reweighted Sampling (CARS), and their combined strategy for feature extraction from baseline-corrected spectral data.

4DFE was implemented using the scipy.signal.find_peaks method in Python to automatically detect spectral peaks. To maximize useful spectral information, intensity values were extracted within ±25 cm⁻¹ of each detected peak, and the peak height (h) was included as a feature, resulting in 51 variables per peak. Additionally, peak_prominences and peak_widths functions were used to calculate peak prominence (p) and width (w), while the Simpson method was employed to compute the peak area (S). In total, 54 feature variables were extracted per peak for subsequent modeling and analysis.

CARS is a feature selection method that integrates Monte Carlo sampling with PLSR model coefficients. It mimics the “survival of the fittest” principle in Darwinian evolution, retaining variables with the highest absolute regression coefficients to optimize model performance [37]. The CARS algorithm proceeded as follows:

(i): Monte Carlo Sampling: In each iteration, 80% of calibration samples were randomly selected for modeling, while the remaining 20% were used as a validation set. The absolute values of regression coefficients in the PLSR model were recorded:

w_{i} = |b_{i}| / \sum_{i = 1}^{m} |b_{i}|

(2)

where w_i represents the weight of the absolute regression coefficient for the ith variable, |b_i| is its absolute regression coefficient, and m is the number of remaining variables in the iteration.

(ii): Exponential decay function: An exponential decay function was applied to eliminate variables with small absolute regression coefficients. The proportion of retained spectral points in the jth iteration is given as follows:

R_{j} = μ e^{- k j}

(3)

where μ and k are constants, calculated based on initial and final retention conditions.

(iii): Adaptive reweighted sampling: in each iteration, a subset of variables was selected using adaptive reweighted sampling for PLSR modeling, and the Root Mean Square Error of Cross-Validation (RMSECV) was calculated.
(iv): Selection of optimal feature subset: After n iterations, CARS generated n candidate feature subsets and their corresponding RMSECV values. The subset with the lowest RMSECV was selected as the final feature set.

2.6. Model Construction and Evaluation

Partial Least Squares Regression (PLSR) is a multivariate statistical method widely used in chemometric modeling [38,39]. Compared to traditional multiple regression methods, PLSR effectively handles datasets with strong correlations and multicollinearity, making it particularly suitable when the number of samples is smaller than the number of variables. In this study, PLSR models were constructed based on the aforementioned preprocessing and feature extraction methods.

The optimal number of latent variables (LVs) was determined via K-fold cross-validation to prevent overfitting, ensuring the LVs count remained below 10 [40]. The model’s predictive performance was evaluated using the coefficient of determination (R²), root mean square error (RMSE), and residual predictive deviation (RPD) [41]. R² quantifies the goodness of fit between observed and predicted values. RMSE measures the deviation between predicted and actual values. RPD assesses predictive capability, with higher values indicating better performance. To further distinguish the performance on different datasets, the metrics corresponding to the calibration set were denoted as R²_c, RMSEC, and RPDC, while those for the prediction set were denoted as R²_p, RMSEP, and RPDP.

All data processing was performed using Python 3.10.4, while graphical visualizations were generated using Origin 2021.

3. Results

3.1. N Content Statistics and Dataset Partitioning

The N content measurements of cucumber leaves are presented as a box-and-whisker plot (Figure 2), with values ranging from 1.214% to 5.396%, a mean of 3.713%, and a standard deviation of 1.051%. The inter-sample variability in N content provides a broader data distribution, serving as a robust basis for model calibration.

Figure 2. Box-and-whisker plot of sample N content. The boxes represent the interquartile range, the lines inside the boxes represent the medians, and the whiskers denote the lowest and highest values within 1.5 times the interquartile range. Each point represents the N content of an individual cucumber leaf.

The 96 samples were divided into a calibration set and a prediction set in a 3:1 ratio [25]. To ensure that the calibration set encompassed the full N content range, particularly extreme values, a stratified sampling strategy was employed with reference to the literature [42]. First, all samples were ranked in ascending order based on N content and divided into 12 groups. Within each group, the second and seventh samples were allocated to the prediction set, while the remaining samples were assigned to the calibration set. Consequently, the calibration set comprised 72 samples, and the prediction set contained 24 samples (Table 1). This sampling strategy ensured that the N content range in the calibration set fully covered that of the prediction set, thereby enhancing the model’s generalization ability.

Table 1. Summary statistics of N content and dataset partitioning results.

3.2. Comparison of Different Baseline Correction Methods

Figure 3 illustrates the preprocessing of raw spectra using the three baseline correction methods—BW, IIMA, and IPF—along with their corresponding corrected spectra. As shown in Figure 3a,c,e, the baselines fitted via these methods exhibited slight variations, leading to noticeable differences in the corrected spectral data. However, when combined with Figure 3b,d,f, it is evident that the Raman spectra of the three corrected datasets within the 700–1800 cm⁻¹ range consistently display a series of prominent feature peaks. In contrast, spectral regions outside this range were dominated by noise interference or weak signals, which failed to provide stable spectral information. To minimize the impact of invalid signals while preserving key spectral features, this study selected the 700–1800 cm⁻¹ range for subsequent quantitative analysis. Each extracted spectrum contained 1101 data points.

Figure 3. Comparison of preprocessing and correction effects of three baseline correction methods. (a,b) BW; (c,d) IIMA; (e,f) IPF.

To evaluate the impact of baseline correction on quantitative analysis, PLSR models were constructed using the corrected spectra, and the number of LVs was determined via cross-validation. Table 2 summarizes the modeling results for the cropped raw spectrum (CRS) and spectra processed with different baseline correction methods. The results indicated that baseline correction significantly enhanced model performance. Compared to modeling with raw spectra, all three baseline correction methods yielded better fitting results on both the calibration and prediction sets. Among them, the IIMA method achieved the best results on the calibration set, with an R²_c of 0.933 and an RMSEC of 0.267. In contrast, the IPF method performed best on the prediction set, with an R²_p of 0.799, an RMSEP of 0.466, and an RPDP of 2.231. Although the BW method outperformed the raw spectra on the prediction set, its performance on the calibration set (R²_c = 0.915) did not show a marked improvement over the raw spectra (R²_c = 0.913). These differences suggest that various baseline correction methods affect the signal-to-noise ratio and the retention of feature information, influencing the final N content prediction performance. While IIMA and IPF showed promising calibration results, further feature extraction and modeling analyses were conducted to determine the optimal preprocessing approach.

Table 2. PLSR modeling results for the CRS and different baseline correction methods.

3.3. Spectral Analysis and Feature Extraction

To further comprehensively extract effective N-related information from the Raman spectra of cucumber leaves, fifteen representative characteristic peaks within the 700–1800 cm⁻¹ range were identified and summarized, as shown in Table 3, along with their corresponding vibrational modes and chemical assignments in leaf tissues. Among them, peaks directly related to N include the in-plane N–H bending at 1286 cm⁻¹ and the NH₂ scissoring vibration at 1612 cm⁻¹. However, relying solely on these N-specific signals is insufficient to accurately reflect the overall nitrogen status of the leaves. In fact, several peaks that indirectly reflect N-related metabolic status were also detected in the spectra. Chlorophyll-related peaks were observed at 1155, 1185, 1225, and 1527 cm⁻¹, while protein-related peaks were found at 1003 and 1674 cm⁻¹. Although these peaks do not originate from typical N-containing functional group vibrations, their spectral intensities were found to be significantly correlated with leaf nitrogen content. Therefore, comprehensive extraction and integrated analysis of multidimensional spectral features is essential for improving the accuracy of nitrogen quantification.

Table 3. Vibrational modes and assignments of Raman peaks in cucumber leaves.

Based on the spectral data preprocessed with the three baseline correction methods (BW, IIMA, and IPF) and the preceding spectral feature analysis, a total of 15 prominent Raman peak positions were selected at 747, 917, 1003, 1048, 1115, 1155, 1185, 1225, 1286, 1327, 1387, 1443, 1527, 1612, and 1674 cm⁻¹. The 4DFE method was then applied to extract 54 parameters from each peak, resulting in 810 feature variables per spectrum. Figure 4 presents the extracted peak features. The h represents the intensity at the peak position within a ±25 cm⁻¹ range. The p corresponds to the length of the yellow vertical line in Figure 4. The w is defined as the width at 0.8 times the prominence height, represented by the green horizontal line. The S refers to the enclosed area between the green horizontal line and the peak contour. As shown in Figure 4a–c, all three baseline correction methods successfully extracted the four-dimensional information of the 15 feature peaks, with slight variations in the h, p, w, and S values across different methods. Although 4DFE expands the feature dimensions, it may introduce irrelevant or redundant variables, potentially interfering with the quantitative analysis. Therefore, additional feature selection and dimensionality reduction steps were necessary.

Figure 4. Extracted peak features from baseline-corrected spectra using 4DFE. (a) BW + 4DFE; (b) IIMA + 4DFE; (c) IPF + 4DFE. The peak features include h (intensity at the peak position within ±25 cm⁻¹), p (yellow vertical line indicating peak prominence), w (width at 0.8 times the prominence height, shown as a green horizontal line), and S (area enclosed between the green horizontal line and the peak contour).

To improve model stability and computational efficiency, CARS was employed for feature selection, with 50 Monte Carlo iterations. Figure 5 illustrates the relationship between the number of selected features and model performance during CARS selection. Figure 5a shows that, as iterations progressed, the number of retained features gradually decreased before stabilizing. This suggests that the initial screening focused on removing feature variables unrelated to N content. Figure 5b depicts the RMSECV trend during CARS selection, which initially decreased, reaching a minimum at the 14th iteration (RMSECV = 0.310), before rising due to excessive feature reduction. This trend indicates that the initial screening effectively reduced the interference of irrelevant variables, thereby lowering prediction error. However, as component-related variables were over-screened, information loss led to an increase in RMSECV. Figure 5c further highlights the trends of regression coefficients corresponding to different sampling iterations, with vertical lines indicating the number of variables corresponding to the minimum RMSECV—the optimal number of features after screening. Ultimately, 267, 234, and 206 features were retained from BW, IIMA, and IPF-corrected spectra, accounting for 24.25%, 21.25%, and 18.71% of the original 1101 features, respectively. When combined with 4DFE, the selected feature subsets were further reduced to 166, 188, and 188 variables, compressing feature dimensions to 20.49%, 23.21%, and 23.21%, respectively, significantly lowering model complexity (Table 4).

Figure 5. CARS feature selection process. (a) Relationship between the number of variables and the number of sampling iterations; (b) relationship between the number of sampling iterations and RMSECV; (c) trend of regression coefficient changes.

Table 4. PLSR modeling results using different baseline correction algorithms combined with feature extraction methods.

3.4. Evaluation of the PLSR Model for N Content in Cucumber Leaves

Table 4 presents the performance of PLSR models constructed using different feature extraction methods. The results demonstrated that the combined 4DFE + CARS method outperformed individual feature extraction techniques. Under BW preprocessing, the PLSR model based on the 4DFE + CARS combination achieved an R²_c of 0.940 and an RMSEC of 0.252 on the calibration set. This result was superior to that obtained using CARS alone (R²_c = 0.924, RMSEC = 0.284) or 4DFE alone (R²_c = 0.928, RMSEC = 0.277). A similar trend was observed under both IIMA and IPF preprocessing, further confirming the generalizability and robustness of the combined strategy. In addition, all models required fewer than 10 latent variables (ranging from 4 to 6), indicating that model complexity was well controlled and reducing the risk of overfitting. Overall, our findings indicate that incorporating multidimensional peak shape features (e.g., peak height, width, and area), combined with feature selection, improves the model’s ability to quantify N content. Such models tend to exhibit superior predictive accuracy and stability compared to conventional approaches relying solely on spectral intensity.

The radar charts and scatter plots for evaluating the performance of the 4DFE + CARS + PLSR models are presented in Figure 6. Figure 6a illustrates the modeling performance on both the calibration and prediction sets for the three baseline correction methods combined with the 4DFE + CARS feature extraction strategy. As shown, the BW + 4DFE + CARS scheme yielded the weakest prediction performance, with an R²_p of 0.722, an RMSEP of 0.547, and an RPDP of 1.897. The IIMA + 4DFE + CARS scheme performed slightly better, achieving an R²_p of 0.747, an RMSEP of 0.522, and an RPDP of 1.988 on the prediction set. In contrast, the IPF + 4DFE + CARS scheme demonstrated the best predictive performance, with an R²_p of 0.847, the lowest RMSEP of 0.368, and the highest RPDP of 2.555. Figure 6b presents the fitting results of measured versus predicted values for the PLSR model constructed using the optimal scheme (IPF + 4DFE + CARS). In the calibration set, the model achieved an R²_c of 0.947, an RMSEC of 0.250, and an RPDC of 4.335; in the prediction set, it reached an R²_p of 0.847, an RMSEP of 0.368, and an RPDP of 2.555. Moreover, the deviation of the fit line between the calibration set and the prediction set was small, and the scatter points were distributed near the regression line, which indicated that the PLSR model constructed using this scheme achieved good prediction stability.

Figure 6. Radar and scatter plots for 4DFE + CARS + PLSR model performance evaluation. (a) Comparison of evaluation results for the calibration and prediction sets across the three combination schemes; (b) scatter plot of measured versus predicted values for the IPF + 4DFE + CARS + PLSR model.

4. Discussion

The ability to rapidly, non-destructively, and accurately detect nutrient content in plant leaves is crucial for plant physiology research and precision agriculture management. This study explored a method for predicting nitrogen content in cucumber leaves by integrating Raman spectroscopy with PLSR. A comparative analysis demonstrated that all three baseline correction methods (BW, IIMA, and IPF) significantly improved the model’s fitting performance on both the calibration and prediction sets, confirming the critical role of baseline correction in suppressing fluorescence background interference and enhancing modeling accuracy. A further evaluation revealed that the IPF + 4DFE + CARS scheme outperformed other approaches across all metrics, representing the best-performing modeling scheme in this study. This strategy, when combined with automated sensor networks in the future, could enable the real-time and rapid monitoring of crop nutrient status, thereby providing technical support for precision fertilization and intelligent crop management.

From the perspective of feature engineering, 4DFE enables a more comprehensive extraction of useful information from spectral data. However, its high-dimensional nature may introduce redundant variables, thereby increasing computational complexity and reducing analytical efficiency [6,52]. In contrast, CARS effectively eliminates redundant features while retaining spectral variables that are highly correlated with nitrogen content, thereby enhancing the accuracy and stability of the model. The combination of 4DFE and CARS achieves a balanced integration of multidimensional feature extraction and dimensionality reduction, leading to a substantial improvement in model predictive performance. Additionally, a study has shown that RS, when integrated with multidimensional feature extraction methods, demonstrates excellent classification performance in the qualitative detection of crop seeds [53], further underscoring the effectiveness of this strategy in enhancing model robustness and generalizability.

There were significant differences in statistical parameters between the calibration and prediction sets, which were primarily due to the introduction of phosphorus (P) and potassium (K) deficiency treatments in this study. These treatments increased sample complexity, contributing to enhanced model generalizability and adaptability, but they also posed greater challenges to predictive performance. The RPD value is an important indicator for determining whether a model is suitable for quantitative analysis [41,54]. It is generally accepted that an RPD value greater than 3.0 indicates a highly satisfactory model, whereas 2.0 < RPD < 3.0 denotes a model with good predictive ability. An RPD between 1.4 and 2.0 suggests an intermediate model requiring improvement, while RPD < 1.4 indicates poor predictive ability [55,56]. In this study, the PLSR model constructed using IPF + 4DFE + CARS achieved an RPD of 2.555 on the prediction set. According to the RPD evaluation criteria, this suggests that the model exhibits good predictive ability and can be used to predict the N content of cucumber leaves, which further illustrates its potential for practical applications.

It is worth noting that, in previous studies, Huang et al. [15] employed RS to monitor N stress in Arabidopsis thaliana, Pak Choi, and Choy Sum, and they identified a characteristic peak at 1046 cm⁻¹ as a Raman response associated with nitrate in plant tissues. In contrast, the 1048 cm⁻¹ peak detected in this study was attributed to Raman-active regions related to cellulose or phenylpropanoid structures. This is because cellulose is closely associated with plant cell walls, which are important structures for nitrogen distribution in leaves [57]. The observed shift in peak position may be due to interspecies differences in cellular structure and chemical composition, leading to subtle variations in Raman response or signal intensity. Although the 1048 cm⁻¹ peak in this study was assigned to structural component vibrations, the possibility that it partially reflects nitrate-related Raman activity cannot be entirely excluded. However, several of the peak positions identified in this study closely resemble those reported by Sanchez L et al. [14] in rice leaves, suggesting a degree of spectral commonality across crop species. This supports the potential applicability of RS for cross-species quantitative analysis of N content.

Although the constructed models demonstrated good predictive performance, there remains room for further optimization. Due to the limited sample size and the exclusive use of greenhouse-grown samples under soilless cultivation conditions, the stability and generalizability of the models need to be further validated. Future studies could incorporate regularization-based feature selection methods such as LASSO, include samples from different cultivars and growth stages, and introduce independent external validation sets to continuously enhance model performance. Additionally, it is recommended to conduct validation under field conditions, fully considering environmental and soil variability, to further evaluate the stability and practical applicability of the models in actual production environments. These efforts will provide a solid foundation for the broader application of the proposed method in precision agriculture.

5. Conclusions

This study demonstrated the potential of RS for the non-destructive and rapid quantification of N content in cucumber leaves. The experimental samples encompassed a wide range of nutritional conditions, including normal nutrition, as well as nitrogen, phosphorus, and potassium deficiencies, thereby enhancing the model’s generalization ability across diverse nutritional backgrounds. By integrating RS with PLSR, we successfully established predictive models for N content in cucumber leaves. Importantly, to improve model stability and generalizability, we incorporated multidimensional feature selection strategies. Among the evaluated preprocessing and feature extraction combinations, the IPF + 4DFE + CARS scheme yielded the best performance. Specifically, this model achieved an R²c of 0.947, an RMSEC of 0.250, and an RPDC of 4.335 on the calibration set, and an R²p of 0.847, an RMSEP of 0.368, and an RPDP of 2.555 on the prediction set, demonstrating strong predictive capability. In summary, this study provides robust technical support for the rapid, non-destructive detection of N in cucumber leaves using RS, and it serves as a reference for predicting leaf N content in other plant species.

Author Contributions

Z.H.: data curation, formal analysis, methodology, and writing—original draft. F.T.: conceptualization, funding acquisition, and writing—review and editing. M.L.: data curation, and writing—original draft. J.G.: investigation and software. C.S.: investigation and methodology. F.J.: data curation and validation. Y.W.: funding acquisition, resources, and investigation. X.Z.: funding acquisition, project administration, resources, supervision, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key Research and Development Program (2023YFD2301605), Heilongjiang Provincial Natural Science Foundation of China (PL2024E024), Heilongjiang Provincial Natural Science Foundation Joint Guidance Program (LH2019E072), the Key Research and Development Program of Heilongjiang Province (GZ20220020), the Heilongjiang Natural Science Foundation (LH2023F043), the Natural Science Talent Support Program of Heilongjiang Bayi Agricultural University (ZRCPY202015), the Higher Education Teaching Reform Research Project of Heilongjiang Province (SJGY20210622), the San-Zong Scientific Research Support Program of Heilongjiang Bayi Agricultural University (ZRCPY202120), and the 2022 Doctoral Startup Fund of Heilongjiang Bayi Agricultural University (XDB202211).

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to Z.H.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ohyama, T. Nitrogen as a major essential element of plants. Nitrogen Assim. Plants 2010, 37, 1–17. [Google Scholar]
Amer, K.H.; Midan, S.A.; Hatfield, J.L. Effect of Deficit Irrigation and Fertilization on Cucumber. Agron. J. 2009, 101, 1556–1564. [Google Scholar] [CrossRef]
Dai, J.; Liu, S.; Zhang, W.; Xu, R.; Luo, W.; Zhang, S.; Yin, X.; Han, L.; Chen, W. Quantifying the effects of nitrogen on fruit growth and yield of cucumber crop in greenhouses. Sci. Hortic. 2011, 130, 551–561. [Google Scholar] [CrossRef]
de Mello Prado, R.; Rozane, D.E. Leaf analysis as diagnostic tool for balanced fertilization in tropical fruits. In Fruit Crops; Srivastava, A.K., Hu, C., Eds.; Elsevier: Amsterdam, The Netherlands, 2020. [Google Scholar]
Kant, S.; Bi, Y.-M.; Rothstein, S.J. Understanding plant response to nitrogen limitation for the improvement of crop nitrogen use efficiency. J. Exp. Bot. 2011, 62, 1499–1509. [Google Scholar] [CrossRef]
Zhang, X.; Duan, C.; Wang, Y.; Gao, H.; Hu, L.; Wang, X. Research on a nondestructive model for the detection of the nitrogen content of tomato. Front. Plant Sci. 2023, 13, 1093671. [Google Scholar] [CrossRef] [PubMed]
Kjeldahl, C. A new method for the determination of nitrogen in organic matter. Z. Anal. Chem. 1883, 22, 366. [Google Scholar] [CrossRef]
Dumas, J. Procédés de l’analyse organique. Ann. Chim. Phys. 1831, 47, 198–205. [Google Scholar]
Li, D.; Zhang, P.; Chen, T.; Qin, W. Recent Development and Challenges in Spectroscopy and Machine Vision Technologies for Crop Nitrogen Diagnosis: A Review. Remote Sens. 2020, 12, 2578. [Google Scholar] [CrossRef]
Cavaco, A.M.; Utkin, A.B.; Marques da Silva, J.; Guerra, R. Making Sense of Light: The Use of Optical Spectroscopy Techniques in Plant Sciences and Agriculture. Appl. Sci. 2022, 12, 997. [Google Scholar] [CrossRef]
Juárez, I.D.; Kurouski, D. Contemporary applications of vibrational spectroscopy in plant stresses and phenotyping. Front. Plant Sci. 2024, 15, 1411859. [Google Scholar] [CrossRef]
Farber, C.; Kurouski, D. Raman Spectroscopy and Machine Learning for Agricultural Applications: Chemometric Assessment of Spectroscopic Signatures of Plants as the Essential Step Toward Digital Farming. Front. Plant Sci. 2022, 13, 887511. [Google Scholar] [CrossRef]
Zhao, X.; Cai, L. Early detection of zinc deficit with confocal Raman spectroscopy. J. Raman Spectrosc. 2018, 49, 1706–1712. [Google Scholar] [CrossRef]
Sanchez, L.; Ermolenkov, A.; Biswas, S.; Septiningsih, E.M.; Kurouski, D. Raman Spectroscopy Enables Non-invasive and Confirmatory Diagnostics of Salinity Stresses, Nitrogen, Phosphorus, and Potassium Deficiencies in Rice. Front. Plant Sci. 2020, 11, 573321. [Google Scholar] [CrossRef]
Huang, C.H.; Singh, G.P.; Park, S.H.; Chua, N.-H.; Ram, R.J.; Park, B.S. Early Diagnosis and Management of Nitrogen Deficiency in Plants Utilizing Raman Spectroscopy. Front. Plant Sci. 2020, 11, 663. [Google Scholar] [CrossRef]
Dong, T.; Xiao, S.; He, Y.; Tang, Y.; Nie, P.; Lin, L.; Qu, F.; Luo, S. Rapid and Quantitative Determination of Soil Water-Soluble Nitrogen Based on Surface-Enhanced Raman Spectroscopy Analysis. Appl. Sci. 2018, 8, 701. [Google Scholar] [CrossRef]
Qin, R.; Zhang, Y.; Ren, S.; Nie, P. Rapid Detection of Available Nitrogen in Soil by Surface-Enhanced Raman Spectroscopy. Int. J. Mol. Sci. 2022, 23, 10404. [Google Scholar] [CrossRef]
Khongkaew, P.; Phechkrajang, C.; Cruz, J.; Cárdenas, V.; Rojsanga, P. Quantitative models for detecting the presence of lead in turmeric using Raman spectroscopy. Chemom. Intell. Lab. Syst. 2020, 200, 103994. [Google Scholar] [CrossRef]
Pattamapan, L.; Chutima, P.; Pawida, S.; Natthinee, A. Raman spectroscopy coupled with the PLSR model: A rapid method for analyzing gamma-oryzanol content in rice bran oil. Food Chem. X 2024, 30, 101923. [Google Scholar] [CrossRef] [PubMed]
Butler, H.J.; McAinsh, M.R.; Adams, S.; Martin, F.L. Application of vibrational spectroscopy techniques to non-destructively monitor plant health and development. Anal. Methods 2015, 7, 4059–4070. [Google Scholar] [CrossRef]
McCreery, R.L. Raman Spectroscopy for Chemical Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
Mostafapour, S.; Dörfer, T.; Heinke, R.; Rösch, P.; Popp, J.; Bocklitz, T.J. Investigating the effect of different pre-treatment methods on Raman spectra recorded with different excitation wavelengths. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2023, 302, 123100. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Cheng, M.; Hao, Y.; Zhang, Y.; Hou, Z. Quantitative analysis of chlorophyll content in citrus leaves by Raman Spectroscopy. Spectrosc. Spectr. Anal. 2019, 39, 1768–1772. [Google Scholar]
Zhao, X.; Xu, M.; Zhang, W.; Liu, G.; Tong, L.J. Identification of zinc pollution in rice plants based on two characteristic variables. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 261, 120043. [Google Scholar] [CrossRef]
Yu, M.; Bai, X.; Bao, J.; Wang, Z.; Tang, Z.; Zheng, Q.; Zhi, J. The Prediction Model of Total Nitrogen Content in Leaves of Korla Fragrant Pear Was Established Based on Near Infrared Spectroscopy. Agronomy 2024, 14, 1284. [Google Scholar] [CrossRef]
Liu, X.; Zhang, S.; Ni, H.; Xiao, W.; Wang, J.; Li, Y.; Wu, Y. Near infrared system coupled chemometric algorithms for the variable selection and prediction of baicalin in three different processes. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2019, 218, 33–39. [Google Scholar] [CrossRef] [PubMed]
Pan, F.; Pan, S.; Tang, J.; Yuan, J.; Zhang, H.; Chen, B. Fertilization Practices: Optimization in Greenhouse Vegetable Cultivation with Different Planting Years. Sustainability 2022, 14, 7543. [Google Scholar] [CrossRef]
Shi, Z.; Tang, J.; Cheng, R.; Luo, D.; Liu, S. A review of nitrogen allocation in leaves and factors in its effects. Acta Ecol. Sin. 2015, 35, 5909–5919. [Google Scholar]
Hoagland, D.R.; Arnon, D.I. The water-culture method for growing plants without soil. Circ. Calif. Agric. Exp. Stn. 1938, 347, 39. [Google Scholar]
Röll, G.; Hartung, J.; Graeff-Hönninger, S. Determination of Plant Nitrogen Content in Wheat Plants via Spectral Reflectance Measurements: Impact of Leaf Number and Leaf Position. Remote Sens. 2019, 11, 2794. [Google Scholar] [CrossRef]
Hikosaka, K. Optimality of nitrogen distribution among leaves in plant canopies. J. Plant Res. 2016, 129, 299–311. [Google Scholar] [CrossRef]
Antoszewski, G.; Guenther, J.F.; Roberts, J.K.; Adler, M.; Dalle Molle, M.; Kaczmar, N.S.; Miller, W.B.; Mattson, N.S.; Grab, H. Non-Invasive Detection of Nitrogen Deficiency in Cannabis sativa Using Hand-Held Raman Spectroscopy. Agronomy 2024, 14, 2390. [Google Scholar] [CrossRef]
Yuan, Z.; Cao, Q.; Zhang, K.; Ata-Ul-Karim, S.T.; Tian, Y.; Zhu, Y.; Cao, W.; Liu, X.J. Optimal leaf positions for SPAD meter measurement in rice. Front. Plant Sci. 2016, 7, 719. [Google Scholar] [CrossRef]
Zhang, Z.-M.; Chen, S.; Liang, Y.-Z.; Liu, Z.-X.; Zhang, Q.-M.; Ding, L.-X.; Ye, F.; Zhou, H. An intelligent background-correction algorithm for highly fluorescent samples in Raman spectroscopy. J. Raman Spectrosc. 2009, 41, 659–669. [Google Scholar] [CrossRef]
Wang, Y.; Tan, F. Extraction and classification of origin characteristic peaks from rice Raman spectra by principal component analysis. Vib. Spectrosc. 2021, 114, 103249. [Google Scholar] [CrossRef]
Gan, F.; Ruan, G.; Mo, J. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemom. Intell. Lab. Syst. 2006, 82, 59–65. [Google Scholar] [CrossRef]
Li, H.; Liang, Y.; Xu, Q.; Cao, D. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 2009, 648, 77–84. [Google Scholar] [CrossRef]
Huang, X.; Li, H.; Ruan, Y.; Li, Z.; Yang, H.; Xie, G.; Yang, Y.; Du, Q.; Ji, K.; Yang, M.J. An integrated approach utilizing raman spectroscopy and chemometrics for authentication and detection of adulteration of agarwood essential oils. Front. Chem. 2022, 10, 1036082. [Google Scholar] [CrossRef] [PubMed]
Wong, K.H.; Razmovski-Naumovski, V.; Li, K.M.; Li, G.Q.; Chan, K.J. Differentiation of Pueraria lobata and Pueraria thomsonii using partial least square discriminant analysis (PLS-DA). J. Pharm. Biomed. Anal. 2013, 84, 5–13. [Google Scholar] [CrossRef] [PubMed]
Guo, P.; Li, T.; Gao, H.; Chen, X.; Cui, Y.; Huang, Y. Evaluating calibration and spectral variable selection methods for predicting three soil nutrients using Vis-NIR spectroscopy. Remote Sens. 2021, 13, 4000. [Google Scholar] [CrossRef]
Gao, X.; Fan, D.; Li, W.; Zhang, X.; Ye, Z.; Meng, Y.; Cheng-Yi Liu, T. Rapid quantification of the adulteration of pomegranate juices by Raman spectroscopy and chemometrics. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2023, 302, 123014. [Google Scholar] [CrossRef]
Zeng, J.; Ping, W.; Sanaeifar, A.; Xu, X.; Luo, W.; Sha, J.; Huang, Z.; Huang, Y.; Liu, X.; Zhan, B.; et al. Quantitative visualization of photosynthetic pigments in tea leaves based on Raman spectroscopy and calibration model transfer. Plant Methods 2021, 17, 4. [Google Scholar] [CrossRef]
Synytsya, A.; Čopíková, J.; Matějka, P.; Machovič, V.J. Fourier transform Raman and infrared spectroscopy of pectins. Carbohydr. Polym. 2003, 54, 97–106. [Google Scholar] [CrossRef]
Edwards, H.; Farwell, D.; Webster, D.J. FT Raman microscopy of untreated natural plant fibres. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 1997, 53, 2383–2392. [Google Scholar] [CrossRef] [PubMed]
Schulz, H.; Baranska, M.; Baranski, R.J. Potential of NIR-FT-Raman spectroscopy in natural carotenoid analysis. Biopolymers 2005, 77, 212–221. [Google Scholar] [CrossRef] [PubMed]
De Gelder, J.; De Gussem, K.; Vandenabeele, P.; Moens, L. Reference database of Raman spectra of biological molecules. J. Raman Spectrosc. 2007, 38, 1133–1147. [Google Scholar] [CrossRef]
Picaud, T.; Le Moigne, C.; Gomez de Gracia, A.; Desbois, A. Soret-Excited Raman Spectroscopy of the Spinach Cytochrome b₆f Complex. Structures of the b-and c-Type Hemes, Chlorophyll a, and β-Carotene. Biochemistry 2001, 40, 7309–7317. [Google Scholar] [CrossRef]
Vítek, P.; Novotná, K.; Hodaňová, P.; Rapantová, B.; Klem, K. Detection of herbicide effects on pigment composition and PSII photochemistry in Helianthus annuus by Raman spectroscopy and chlorophyll a fluorescence. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2017, 170, 234–241. [Google Scholar] [CrossRef]
Zhiying, Z.; Renao, G.; Tianhong, L. Raman Spectroscopy Application in Chemistry; Northeastern University Press: Boston, MA, USA, 1998. [Google Scholar]
Yu, M.M.; Schulze, H.G.; Jetter, R.; Blades, M.W.; Turner, R.F.J. Raman microspectroscopic analysis of triterpenoids found in plant cuticles. Appl. Spectrosc. 2007, 61, 32–37. [Google Scholar] [CrossRef]
Devitt, G.; Howard, K.; Mudher, A.; Mahajan, S.J. Raman spectroscopy: An emerging tool in neurodegenerative disease research and diagnosis. ACS Chem. Neurosci. 2018, 9, 404–420. [Google Scholar] [CrossRef]
Ma, Y.; Zhu, L. A Review on Dimension Reduction. Int. Stat. Rev. 2013, 81, 134–150. [Google Scholar] [CrossRef]
Liu, R.; Tan, F.; Wang, Y.; Ma, B.; Yuan, M.; Wang, L.; Zhao, X. Machine learning identification of Saline-Alkali-Tolerant Japonica rice varieties based on Raman spectroscopy and Python visual analysis. Agriculture 2022, 12, 1048. [Google Scholar] [CrossRef]
Jin, X.; Li, S.; Zhang, W.; Zhu, J.; Sun, J. Prediction of soil-available potassium content with visible near-infrared ray spectroscopy of different pretreatment transformations by the boosting algorithms. Appl. Sci. 2020, 10, 1520. [Google Scholar] [CrossRef]
Hssaini, L.; Razouk, R.; Bouslihim, Y. Rapid prediction of fig phenolic acids and flavonoids using mid-infrared spectroscopy combined with partial least square regression. Front. Plant Sci. 2022, 13, 782159. [Google Scholar] [CrossRef] [PubMed]
Congli, M.; Ziyu, W.; Hui, J. Determination of aflatoxin B1 in wheat using Raman spectroscopy combined with chemometrics. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 327, 125384. [Google Scholar]
Onoda, Y.; Hikosaka, K.; Hirose, T. Allocation of nitrogen to cell walls decreases photosynthetic nitrogen-use efficiency. Funct. Ecol. 2004, 18, 419–425. [Google Scholar] [CrossRef]

Figure 1. Grid scanning mode for Raman spectral acquisition.

Figure 2. Box-and-whisker plot of sample N content. The boxes represent the interquartile range, the lines inside the boxes represent the medians, and the whiskers denote the lowest and highest values within 1.5 times the interquartile range. Each point represents the N content of an individual cucumber leaf.

Figure 3. Comparison of preprocessing and correction effects of three baseline correction methods. (a,b) BW; (c,d) IIMA; (e,f) IPF.

Figure 4. Extracted peak features from baseline-corrected spectra using 4DFE. (a) BW + 4DFE; (b) IIMA + 4DFE; (c) IPF + 4DFE. The peak features include h (intensity at the peak position within ±25 cm⁻¹), p (yellow vertical line indicating peak prominence), w (width at 0.8 times the prominence height, shown as a green horizontal line), and S (area enclosed between the green horizontal line and the peak contour).

Figure 5. CARS feature selection process. (a) Relationship between the number of variables and the number of sampling iterations; (b) relationship between the number of sampling iterations and RMSECV; (c) trend of regression coefficient changes.

Figure 6. Radar and scatter plots for 4DFE + CARS + PLSR model performance evaluation. (a) Comparison of evaluation results for the calibration and prediction sets across the three combination schemes; (b) scatter plot of measured versus predicted values for the IPF + 4DFE + CARS + PLSR model.

Table 1. Summary statistics of N content and dataset partitioning results.

Dataset	Sample Size	Maximum (%)	Minimum (%)	Mean (%)	SD (%)
Calibration Set	72	5.396	1.214	3.715	1.031
Prediction Set	24	5.366	1.384	3.709	1.109
Total	96	5.396	1.214	3.713	1.051

SD represents the standard deviation.

Table 2. PLSR modeling results for the CRS and different baseline correction methods.

Preprocessing Method	LVs	Calibration		Prediction
Preprocessing Method	LVs	R²_c	RMSEC	R²_p	RMSEP	RPDP
CRS	7	0.913	0.304	0.729	0.541	1.922
BW	6	0.915	0.300	0.738	0.532	1.954
IIMA	6	0.933	0.267	0.743	0.527	1.974
IPF	6	0.919	0.294	0.799	0.466	2.231

Table 3. Vibrational modes and assignments of Raman peaks in cucumber leaves.

No.	Wavenumber (cm⁻¹)	Vibrational Mode	Assignment
1	747	γ(C–O–H) of COOH	Pectin [43]
2	917	Symmetric in-plane ν(C–O–C)	Cellulose, phenylpropanoids [44]
3	1000–1005	In-plane CH₃ rocking of polyene; aromatic ring vibration of phenylalanine	Carotenoids [45], proteins [46]
4	1048–1068	ν(C–O) + ν(C–C) + δ(C–O–H)	Cellulose, phenylpropanoids [44]
5	1115	δ(C–O–H)	Cellulose [44]
6	1155	ν(C–O–C), ν(C–C) in glycosidic linkages, asymmetric ring breathing	Chlorophyll [23], carotenoids [45]
7	1185	ν(CmC₁₀) + δ(CbH); ν(C–O–H) next to aromatic ring + σ(CH)	Chlorophyll [47], carotenoids [45]
8	1225	δ(CH) + δ(CH₂)	Chlorophyll [48]
9	1286	δ(N–H), amide III	Proteins [49]
10	1327	δ(CH₂)	Cellulose, lignin [44]
11	1387	δ(CH₂)	Aliphatics [50]
12	1443–1446	δ(CH₂) + δ(CH₃)	Aliphatics [50]
13	1527–1545	In-plane –C=C– stretching	Chlorophyll [23]
14	1612	NH₂ scissoring vibration	Primary amines [49]
15	1674	ν(C=O), amide I	Proteins [51]

Table 4. PLSR modeling results using different baseline correction algorithms combined with feature extraction methods.

Preprocessing Method	Feature Extraction Method	Number of Features	LVs	Calibration		Prediction
Preprocessing Method	Feature Extraction Method	Number of Features	LVs	R²_c	RMSEC	R²_p	RMSEP	RPDP
BW	CARS	267	4	0.924	0.284	0.714	0.502	1.871
	4DFE	810	6	0.928	0.277	0.719	0.551	1.887
	4DFE + CARS	166	5	0.940	0.252	0.722	0.547	1.897
IIMA	CARS	234	4	0.933	0.266	0.738	0.481	1.954
	4DFE	810	6	0.937	0.259	0.742	0.528	1.969
	4DFE + CARS	188	5	0.954	0.221	0.747	0.522	1.988
IPF	CARS	206	4	0.921	0.290	0.812	0.451	2.304
	4DFE	810	6	0.936	0.260	0.839	0.417	2.493
	4DFE + CARS	188	5	0.947	0.250	0.847	0.368	2.555

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Quantitative Determination of Nitrogen Content in Cucumber Leaves Using Raman Spectroscopy and Multidimensional Feature Selection

Abstract

1. Introduction

2. Materials and Methods

2.1. Plant Materials

2.2. Spectral Data Acquisition

2.3. N Content Determination

2.4. Preprocessing Methods

2.5. Feature Extraction

2.6. Model Construction and Evaluation

3. Results

3.1. N Content Statistics and Dataset Partitioning

3.2. Comparison of Different Baseline Correction Methods

3.3. Spectral Analysis and Feature Extraction

3.4. Evaluation of the PLSR Model for N Content in Cucumber Leaves

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics