1. Introduction
With increasing global industrialization, cadmium pollution in agricultural production is becoming an increasingly severe global environmental challenge, primarily due to anthropogenic activities such as industrial emissions, mining operations, the application of phosphate fertilizers and sewage sludge, and improper disposal of electronic waste [
1,
2]. Studies have reported that the most severe cadmium pollution in agricultural soils globally occurs in regions such as northern and central India, Pakistan, Bangladesh, southern China, and southern Thailand [
3]. In China, the average cadmium concentration in agricultural soils is 0.19 mg/kg, approximately twice the natural background level of 0.097 mg/kg. Alarmingly, cadmium pollution affects 33.54% of farmland and 44.65% of urban soils [
4].
Cadmium is a Group 1 carcinogen [
5]. Previous research has indicated that Cd generally negatively affects plant physiological and molecular processes (including tissue growth, nutrient uptake, photosynthesis, nutrient balance, antioxidant enzyme activity, ROS accumulation, biomass reduction, and molecular pathway disturbances) either directly or indirectly. The toxic effects of Cd include growth inhibition, root damage, leaf curling and yellowing, and even leaf abscission. Furthermore, excessive accumulation of Cd in plants can induce massive reactive oxygen species production, causing cell membrane lipid peroxidation, chloroplast degradation, severe damage to plant photosynthetic reaction centres, and inhibition of plant growth and development [
6,
7].
Pak choi (
Brassica chinensis L.) is a high-value vegetable in Asia and is important for ensuring global food security because of its widespread consumption. This vegetable is widely cultivated in Europe, the Mediterranean, and East Asia (particularly China, South Korea, and Japan) [
8,
9]. Pak choi, a crop within the
Brassicaceae family,
Brassica genus,
B. campestris L.
subspecies, has a very high capacity for cadmium uptake and accumulation. Cadmium, a nonessential element, is easily absorbed and accumulates in its edible leaves, subsequently inducing a biomagnification effect through the food chain. This effect not only threatens crop production itself but also causes irreversible harm to human health [
10]. In the General Standard for Contaminants and Toxins in Food and Feed (Codex Stan 193-1995) [
11] of the Codex Alimentarius Commission (CAC) and China’s current National Food Safety Standard—Maximum Levels of Contaminants in Food (GB 2762-2022) [
12], the maximum limit of cadmium (Cd) in leafy vegetables is 0.2 mg/kg. However, due to factors such as soil cadmium contamination and varietal characteristics, cadmium levels in some pak choi crops may exceed this limit during actual production, posing a direct threat to food safety and human health.
Therefore, assessing cadmium levels in pak choi can mitigate the risk of human exposure via the food chain. Early detection of cadmium accumulation can increase crop safety, reduce losses, prevent cadmium from entering the human body, and guide pollution prevention and control. Early detection is important for increasing food security, protecting public health, and facilitating sustainable agricultural development. Traditional methods for detecting heavy metals in crops rely on laboratory chemical analysis of a large number of leaf samples, which is inefficient and time-consuming. With the development of photoelectric non-destructive prediction technology, efficient and time-saving visible–near-infrared reflectance spectroscopy has become an alternative technique for detecting heavy metal pollution [
13]. Spectral data from only plant leaves are needed to establish Cd content models based on sensitive bands or spectral indices [
14]. Some studies have reported that spectral-based methods can be used to determine the type of pollution or stress suffered by plants, including pest and disease stress, salinity stress, water stress, and heavy metal pollution [
15,
16,
17,
18]. Wang et al. [
19] analysed the relationship between the spectral reflectance of pepper leaves at four growth stages at different cadmium stress levels and the Cd content in mature pepper fruits and estimated the fruit Cd content using multiple regression. Yi et al. [
20] used hyperspectral remote sensing data combined with support vector machine regression (SVMR) to estimate the cadmium content in field pepper and eggplant leaves, with prediction set determination coefficients R
2 of 0.897 and 0.726, respectively. Sun et al. [
21] used deep belief networks combined with hyperspectral imaging to estimate the cadmium content in lettuce. Shen et al. [
22] used hyperspectral imaging and chemometrics to estimate the free proline content in rice leaves under cadmium stress. The results indicated that the ELM model based on 27 feature wavelengths selected by CARS performed best, with an R
2 value of 0.9426, and could be used to explore changes in free amino acids in rice leaves under Cd stress. The above studies indicate that the use of hyperspectral technology for crop element and content detection and analysis is feasible.
Research on the spectral inversion of the heavy metal cadmium has focused on soil, rice, tomatoes, lettuce, and other crops [
23,
24,
25,
26], with few studies on pak choi, which has a strong enrichment capacity. More critically, existing technologies often struggle to achieve precise detection during the subvisual stage of cadmium stress, when conventional physiological indicators such as chlorophyll fluorescence and pigment indices in plants show no significant changes, thereby limiting their practical value in early warning applications. The focus of this study is to use hyperspectral imaging and machine learning tools to construct a prediction model for cadmium content in pak choi. Therefore, the objectives of this study are as follows: (1) to determine the effect of different cadmium stress levels on pak choi using phenotypic data such as 2D images, physiological indicators, cadmium content accumulation, and chlorophyll fluorescence parameters; (2) to evaluate the effectiveness of preprocessing and feature band selection tools in improving model performance; and (3) to apply chemometrics and regression techniques to develop prediction models. This study aims to fill the technical gap in early sub-visual detection of cadmium stress in pak choi, providing a novel technical pathway for non-destructive, rapid early warning of cadmium content in pak choi.
2. Materials and Methods
2.1. Experimental Design
The tested variety was ‘Huaguan Qinggengcai’, bred by Musashino Seed Co., Ltd., Tokyo, Japan. The seedling substrate was imported from Danish Pindstrup peat. Anhydrous cadmium chloride (CdCl2) was obtained from Shanghai Macklin Biochemical Technology Co., Ltd., Shanghai, China.
The experiments were conducted from 14 February to 20 April 2025, in the phenotyping experimental greenhouse and physiological and biochemical laboratory of the Digital Agriculture Research Institute, Fujian Academy of Agricultural Sciences. Pak choi was sown on 14 February in plug trays covered with peat. After sowing, the seedlings were placed on a tidal seedling bed within a small greenhouse. The nutrient solution formula was as follows: to 100 L of water, A (11.2 kg Ca(NO3)2, 12 kg KNO3) and B (2.6 kg KH2PO4, 3 kg MgSO4, and 750 g EDTA-Fe) were added. During the seedling stage, the electrical conductivity (EC) of the nutrient solution was 1.0 mS/cm. The tidal bed was flooded daily at 08:00 for 5 min, held for 10 min, and then allowed to drain for 10 min. To reduce interference from other factors in the soil, a nutrient film technique (NFT) was used for cultivation to ensure that the experiment was affected only by the cadmium concentration. The transplanting date was 2 April, when the seedlings were moved from the plug trays to custom-made NFT cultivation troughs. The nutrient solution EC was 1.5 mS/cm. After precultivation with a normal nutrient solution for 7 days, cadmium was added to the nutrient solution on 9 April. In this experiment, the cadmium concentrations were 25, 50, and 100 mg/L, and a cadmium-free treatment (0 mg/L) was used as the control (CK). Each treatment consisted of 300 plants, with three replicates per treatment. There were 4 NFT experimental areas, each with an independent water and fertilizer supply system, using a timed irrigation mode (each irrigation for 10 min and an interval of 15 min). The nutrient solution in the reservoir was replaced once every Monday. After 7 days of treatment, pak choi samples were collected to determine their chlorophyll content, cadmium content, and chlorophyll fluorescence parameters, as well as their hyperspectral reflectance. For each treatment, 40 samples were collected, totalling 160 samples.
2.2. Determination of Chlorophyll and Cadmium Contents
Chlorophyll was extracted using the acetone–ethanol mixed solution method [
27]. The cadmium content was determined by digestion with a nitric–perchloric acid mixture (4:1,
v/
v) and measured using an atomic absorption spectrophotometer (AAS, PinAAcle 900F, PerkinElmer, Waltham, MA, USA). Each sample was tested in triplicate, and the final result was determined as the average of the three replicates.
2.3. Measurement of Chlorophyll Fluorescence Data
A chlorophyll fluorescence imager (FC800-D, Photon Systems Instruments PSI, Drásov, Czech Republic) was used. Data were acquired in an indoor dark chamber, with key parameters set as follows: the saturating pulse value was 5250 μmol/(m2·s), the actinic light 2 value was 422 μmol/(m2·s), and the object distance was 30 cm. After dark adaptation of the sample for 30 min, image data acquisition was performed. Upon completion, the following parameters were obtained: initial fluorescence value (F0), maximum PSII quantum yield (Fv/Fm), effective PSII quantum yield (Fv′/Fm′), actual PSII quantum efficiency (ΦPSII), nonphotochemical quenching coefficient (NPQ), and photochemical quenching coefficient (qP). Each treatment was performed in triplicate.
2.4. Hyperspectral Imaging Data Acquisition
In this study, a visible-near-infrared hyperspectral imaging system (FX10, SPECIM, Oulu, Finland) was used. The system consisted of a hyperspectral imager, an imaging lens, a 500 W halogen lamp, a 99% reflectivity white reflective reference panel, an industrial touchscreen tablet computer, and an electric control translation stage (as shown in
Figure 1). The spectral range collected was 397.66–1003.81 nm. The key parameters were as follows: the moving speed of the electric control translation stage was 2 mm/s, the exposure time was 10 ms, and the object distance was 40 cm. The technical roadmap is presented in
Figure 2. Before data acquisition, the hyperspectral system was preheated for 30 min to achieve a stable operating state. Subsequently, a white calibration panel with 99% reflectivity and the sample to be tested were fixed side by side at the same sample acquisition station, ensuring that both fully covered the system’s field of view under uniform lighting conditions. After initiating the sample acquisition program, the system automatically completed the capture and storage of dark current images according to the preset workflow and synchronously acquired an integrated spectral image containing both the reflective area of the white calibration panel and the sample area. Upon completion of the acquisition, the reflectance correction of the sample spectral image was performed using ENVI 4.8 software to separately extract the reflectance data of the white panel area from the integrated image as the white reference and invoke the stored dark reference data. The correction formula is detailed in Equation (1).
where
represents the corrected sample spectral image,
denotes the original sample hyperspectral image,
indicates the white calibration image, and
represents the dark reference image.
2.5. Hyperspectral Data Analysis and Model Evaluation
2.5.1. Hyperspectral Data Preprocessing
The region of interest (ROI) corresponding to the canopy of each sample was delineated using ENVI 5.3 software to extract spectral information. The average spectral data of all pixel points within the ROI was adopted as the representative spectral feature of the sample. Given the significant noise interference in the spectral range of 803.1–1003.81 nm, only the reflectance spectral data of 151 bands within the range of 397.66–800.34 nm for each sample were selected for subsequent modelling and analysis.
To reduce the effect of noise and other interference factors, this study used four methods for preprocessing the original spectral data: first derivative (FD), second derivative (SD), multiplicative scatter correction (MSC), and normalization (Nor). FD and SD are important mathematical tools for analysing the rate of change of spectral signals. FD is mainly used to eliminate baseline drift and background interference, improve parts of the spectrum that change rapidly, separate overlapping absorption peaks, and increase spectral resolution. SD can further increase the resolution of spectral details, allowing clearer identification of the position and shape of weak absorption peaks [
28]. MSC is primarily used to eliminate scattering interference caused by uneven particle size, surface roughness, or differences in optical path length [
29]. Nor adjusted spectral data to a comparable scale or distribution, differences in units across feature dimensions are eliminated, the balanced weight of each feature is ensured during analysis, and the performance of subsequent models is improved [
30].
2.5.2. Feature Wavelength Selection Algorithms
Hyperspectral full-band data often suffer from information redundancy and multicollinearity. Feature wavelength selection can effectively reduce data dimensionality, eliminate redundant information, and improve the generalizability of the model. This study employed three algorithms for feature wavelength screening: competitive adaptive reweighted sampling (CARS), the successive projections algorithm (SPA), and the random frog (RF). CARS simulates a “biological evolution” process, adaptively reweighting and selecting spectral bands, and gradually eliminating redundant and unimportant bands. The premise of CARS is to use Monte Carlo sampling and an exponential decay function to adaptively adjust the selection probability of each band, ultimately selecting the optimal band combination that contributes the most to the modelling performance [
31]. SPA is a forward feature variable selection method that identifies the main feature vectors of the data through iterative projection, retaining the main information and eliminating redundancy [
32]. The RF is an efficient feature selection method for high-dimensional data. It is based on sequential random sampling and probability statistics, generating different feature subsets through multiple iterations and calculating the frequency of each feature band that is selected to measure feature importance [
33].
2.5.3. Modelling Algorithms
Partial least squares regression (PLSR), random forest regression (RFR), the Elman neural network (Elman NNs), and the bidirectional long short-term memory network (BiLSTM) were used to predict the cadmium content. PLSR is a common statistical method for regression analysis and modelling. When data exhibit multicollinearity, PLSR extracts latent variables (also referred to as components or principal components) to project the independent and dependent variables into a new low-dimensional space. These latent variables are the components most correlated between the independent and dependent variables. The main goal of PLSR is to maximize the covariance between independent and dependent variables, establishing a regression relationship between them [
34]. In this study, the optimal parameters of the PLSR model were determined via grid search combined with ten-fold cross-validation, with the search range of the number of principal components specified as 1 to 20. RFR is an ensemble learning algorithm that is based on decision trees, builds multiple decision trees, and combines their predictions to perform regression tasks. The random forest is a powerful regression model with strong generalizability and robustness [
35]. The core hyperparameter settings for the RFR model in this study were as follows: the number of decision trees was set to 100; the minimum number of samples per leaf node was set to 3; and the mean squared error (MSE) was adopted as the splitting criterion. Elman NNs is an improved recurrent neural network that is based on the BP neural network and is a powerful and widely used neural network model [
36]. The core configurations of the Elman NNs are as follows: (1) Network architecture: The number of neurons in the input layer was equal to the dimension of input features in the dataset; the hidden layer was set with 10 neurons (the number of neurons in the context layer was consistent with that in the hidden layer, which was following the default feedback structure of Elman NNs); the output layer contained 1 neuron. (2) Activation functions: The hidden layer adopted the tanh (hyperbolic tangent function), and the output layer used the purelin (linear function). LSTM is an improvement over recurrent neural networks (RNNs). BiLSTM is an improvement over LSTM networks, including forward and backward LSTM layers. BiLSTM supports the bidirectional feature learning of data, which can better identify correlations among multivariate regression data features [
37]. For the BiLSTM model, the core parameters were configured as follows: 4 hidden layer units, a maximum of 1000 training epochs, and a batch size of 128. The Adam gradient descent algorithm was selected as the optimizer, with an initial learning rate of 0.01. With respect to dataset division, in this study, 160 samples were divided into a training set of 128 samples and a prediction set of 32 samples at a ratio of 80%:20%.
2.5.4. Model Evaluation Methods
To validate the reliability of the established models, in this study, the determination coefficient (R
2) and root mean square error (RMSE) were used to evaluate model accuracy. R
2 and RMSE are divided into the determination coefficient for the training set (R
c2) and that for the prediction set (R
p2), and the root mean square error for the training set (RMSEC) and that for the prediction set (RMSEP), respectively. R
2 ranges from 0 to 1. When both R
c2 and R
p2 are at high levels with a small difference between them, it indicates that the constructed model has both good data fitting ability and generalization performance. If R
c2 is relatively high while R
p2 is significantly lower, it suggests that the model suffers from overfitting. If both R
c2 and R
p2 are at low levels, it indicates that the model has an underfitting problem. The RMSE is used to measure the magnitude of the prediction error of the model; a smaller value indicates a higher prediction accuracy of the model [
38].
where
n represents the number of samples in the dataset,
denotes the measured value of the sample,
denotes the predicted value of the sample, and
denotes the average value of all physicochemical values in the dataset.
4. Discussion
In this study, pak choi was somewhat tolerant and could still grow and develop under low cadmium stress, whereas high cadmium stress significantly inhibited pak choi growth, and the degree of inhibition intensified with increasing stress concentration. The cadmium content in the shoots of pak choi increased significantly with increasing cadmium concentration, which is consistent with the results of Chang Pengyan et al. [
39]. After 7 days of cadmium stress, the total chlorophyll content of pak choi significantly decreased with the different cadmium concentration treatments, but the differences among the cadmium treatments were not significant. In addition to NPQ, the other chlorophyll fluorescence parameters did not significantly change. The response of the plant to cadmium stress has a time lag; 7 days may not be sufficient for cadmium to accumulate in the plant to a level that significantly affects the photosynthetic system, or the defence mechanisms of the plant may offset the adverse effects of cadmium during the initial stage, requiring a longer stress duration to reveal significant parameter changes. On the other hand, pak choi may have a specific tolerance to cadmium; under short-term cadmium stress, its photosynthetic system can remain relatively stable, which is similar to the results of Wang Tao et al. [
40]. As the cadmium treatment time increased and its concentration increased, the photosynthesis of pak choi was more severely inhibited. It is necessary to use hyperspectral modelling to predict cadmium content based on the physiological response characteristics mentioned above. In detection methods using traditional physiological indicators, it is difficult to distinguish different treatment differences under early cadmium stress, while hyperspectral technology can capture subtle biochemical and structural changes that cannot be identified by conventional methods, forming characteristic spectral responses. According to previous studies on pak choi, rapeseed, flue-cured tobacco, and other crops, the spectral characteristics of leaves are strongly correlated with the leaf cadmium content [
41,
42,
43]. Therefore, in this study, regression models were constructed using the selected feature bands as independent variables and the leaf cadmium content as the dependent variable. By comparing and analysing different models, the optimal modelling method for determining the pak choi leaf cadmium content was determined.
In this study, hyperspectral imaging was used to obtain spectral data from pak choi, and the cadmium content in the leaves was measured. FD, SD, MSC, and Nor algorithms were used to preprocess the collected original hyperspectral data. CARS, SPA, and RF algorithms were used to screen feature bands from the preprocessed hyperspectral data. The PLSR, RFR, Elman, and BiLSTM algorithms were used to construct prediction models for the pak choi leaf cadmium content. The results indicated that when FD and SD preprocessing were used, and feature wavelengths were screened through the CARS and RF algorithms, the accuracy of the constructed models exceeded 0.9. FD-RF screened 24.5% of the feature bands, SD-RF selected 15.23% of the feature bands, and SD-CARS retained 19.21% of the original bands. The FD–RF–BiLSTM model had the highest accuracy, indicating that FD preprocessing combined with the RF algorithm is the optimal method for feature wavelength extraction for pak choi leaf cadmium content and that BiLSTM is the optimal prediction model for pak choi leaf cadmium content. This method breaks through the reliance of traditional hyperspectral detection on distinct changes in physiological indicators. Even at the early exposure stage, where chlorophyll fluorescence and pigment indices only exhibit minimal changes, it can still successfully achieve accurate prediction of cadmium content in pak choi, thus demonstrating genuine sub-visual detection capability.
It should be noted that the total sample size of 160 in this study was relatively limited. Although interfering variables were strictly controlled using the NFT hydroponic system and measurement errors were minimized by three replicate determinations, with the core model exhibiting excellent stability (Rp2 = 0.913, RMSEP = 0.032), the small sample size may still restrict the generalization ability of the model to pak choi of different genotypes and complex field environments. It also hinders the adequate capture of subtle heterogeneity in spectral responses under cadmium stress, which is one of the main limitations of this study. In addition, all models in this study were constructed and validated under controlled conditions (a single variety, NFT hydroponic system, and stable temperature, humidity, and light environment). Although they exhibit high prediction accuracy, systematic external validation is still required before their promotion and application in actual production environments. Firstly, this study only used “Huaguan Qinggengcai” as the test variety. However, different genotypes of pak choi vary in leaf structure and cadmium enrichment capacity, which may lead to spectral response heterogeneity. Thus, it is necessary to expand the variety and range to verify the generality of the models. Secondly, various cultivation modes, such as soil cultivation and substrate cultivation, exist in actual production. The existing form of cadmium and the crop absorption environment differ significantly from those in hydroponic conditions, so the adaptability of the models to different production systems needs to be verified. Thirdly, environmental factors such as light, temperature, and humidity fluctuate dynamically in open-field or greenhouse cultivation, which may interfere with leaf spectral characteristics. Therefore, it is necessary to test the stability of the models under multiple environmental conditions. Finally, cadmium pollution in actual farmland is mostly low-concentration or combined pollution with other heavy metals, which is different from the medium-to-high concentration single cadmium stress scenario set in this study. Relevant samples need to be supplemented to verify the detection sensitivity of the models in complex pollution scenarios.
External validation across different cultivars, production systems, environmental conditions, and contamination scenarios, combined with an expanded sample size, could clarify the applicable boundaries of the model. This will provide a solid basis for subsequent optimization measures such as the introduction of cultivar correction coefficients and environmental factor compensation modules, thereby improving the practical application value of the technology and facilitating the large-scale application of hyperspectral non-destructive detection technology in the screening of heavy metal contamination in vegetables.