Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea

: The integration of hyperspectral imaging with machine learning algorithms has presented a promising strategy for the non-invasive and rapid detection of plant metabolites. For this study, we developed prediction models using partial least squares regression (PLSR) and boosting algorithms (such as AdaBoost, XGBoost, and LightGBM) for ﬁve metabolites in Brassica juncea leaves: total chlorophyll, phenolics, ﬂavonoids, glucosinolates, and anthocyanins. To enhance the model performance, we employed several spectral data preprocessing methods and feature-selection algorithms. Our results showed that the boosting algorithms generally outperformed the PLSR models in terms of prediction accuracy. In particular, the LightGBM model for chlorophyll and the AdaBoost model for ﬂavonoids improved the prediction performance, with R2p = 0.71–0.74, com-pared to the PLSR models (R2p = 0.53–0.58). The ﬁnal models for the glucosinolates and anthocya-nins performed sufﬁciently for practical uses such as screening, with R2p = 0.82–0.85 and RPD = 2.4–2.6. Our ﬁndings indicate that the application of a single preprocessing method is more effective than utilizing multiple techniques. Additionally, the boosting algorithms with feature selection ex-hibited superior performance compared to the PLSR models in the majority of cases. These results highlight the potential of hyperspectral imaging and machine learning algorithms for the non-destructive and rapid detection of plant metabolites, which could have signiﬁcant implications for the ﬁeld of smart agriculture


Introduction
Mustard (Brassica juncea), also commonly known as Chinese mustard, brown mustard, leaf mustard, vegetable mustard, and oriental mustard, is an annual plant that belongs to the Brassicaceae family [1].Mustard contains bioactive components such as glucosinolates and their degradation products; polyphenols (flavonoids and anthocyanins); and large amounts of dietary fiber, chlorophyll, β-carotene, ascorbic acid, minerals, and volatile components [2].Mustard is used as a spice because of its pungent taste.It also has important uses in medicine; its leaves are used as a diuretic, stimulant, and expectorant in folk medicine.Previous studies have found that B. juncea has bactericidal properties, can reduce the risk of atherosclerosis, and has antioxidant-and peroxynitrite-scavenging effects [1,3].Additionally, B. juncea has exhibited antibacterial and antitumor properties and has been shown to improve various metabolic disorders [4].Despite its excellent bioactivity, its industrial use as a raw material for medicine is limited by traditional analytical techniques, which are time-consuming and destructive.Therefore, a non-destructive method of determining bioactive compound contents should be developed for quality control in the production stage.
Recently, hyperspectral imaging has been used for the assessment of the biophysical traits of plants.Spectral information from hyperspectral images can be combined with various data processing and mining tools to ensure fast, non-destructive, and highly accurate detection of functional component contents [5].Preprocessing of spectral data is an important step for suppressing the undesired effects of measurement conditions and enhancing relevant features, which commonly contain normalization, derivatives, and smoothing [6].Partial least squares regression (PLSR) is a widely used method to analyze large amounts of hyperspectral data and predict functional components in plants, such as chlorophyll and carotenoids in spinach [7] or total polyphenols in cocoa beans [8].In our previous study, we aimed to develop a predictive model for the functional components of mustard plants using a PLSR prediction model based on hyperspectral images and preprocessing techniques [9].In that study, we found that a preprocessing combination of SNV transformation and 1st-Der with spectral data resulted in high-performance prediction models for the total chlorophyll, carotenoid, and glucosinolate contents, while a preprocessing combination of the S.G. filter and SNV transformation gave the highest prediction rate for the total phenolics.However, the accuracy of this model was limited because the amount of data was relatively small and it was only applied in an indoor environment.
Machine learning techniques, combined with hyperspectral imaging, have been extensively used for the determination of food quality [10], such as identifying contaminants in food [11].Among them, boosting methods in ensemble learning are attracting attention for their outstanding performance and have paved the way for data analysis.Boosting algorithms, such as those for adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), and the light gradient-boosting model (LightGBM), have performed well in hyperspectral imaging-based data classification tasks [12,13].Effective training of machine learning models usually requires abundant data for a more accurate predictive model [14].To train the model and improve the accuracy of the PLSR prediction model for functional components such as chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins in mustard plants, we first acquired more hyperspectral imaging data of plant leaves.For this study, we aimed to develop a model with excellent predictive performance by adding enough training data to apply boosting algorithms and applying a combination of data processing methods.This analysis has expanded upon the previous study by including the prediction of the total phenolic components, which was not previously considered.However, the prediction of the total carotenoids was excluded from the current study due to its poor performance in the previous study.To apply the developed model and predict functional components in the growing environment, hyperspectral images were measured from various angles.

Training Data Acquisition
The plant growth conditions and analysis methods used for this study were the same as those described in detail in the previous study [9].Briefly, mustard plants (B.juncea L. Czern.) were cultivated in three different environments.Plants in an indoor farm were hydroponically grown under mixed LEDs and with Hoagland nutrient solution.Plants in a greenhouse and an open field were grown in pots filled with commercial soil and fertilizer.Fifteen plants from each cultivation environment were harvested for 4 weeks after the transplant to ensure variation in growth stage and leaf color.A total of 122 fully expanded leaves were collected for analysis.
As with the experimental setup in the previous study [9], the hyperspectral imaging system consisted of a hyperspectral imaging camera (MicroHSI 410 SHARK; Corning Inc., Corning, NY, USA) and eight 15 W halogen lamps.A total of 112 hyperspectral images were acquired, with 1408 spatial pixels and 150 spectral bands in the range of 400-1000 nm.After the hyperspectral imaging data were obtained, the leaves were freezedried for 4 days and powdered for component analysis [9,15].The powder obtained using pulverization after freeze-drying was subdivided into 3 repetitions of 20 mg each and used for the analysis of 5 functional components.Briefly, the previous methods were used for the determination of the total chlorophyll content [16], total phenolic content [17], total flavonoid content [18], total glucosinolate content [19], and total anthocyanin content [20].As a result of the component analysis, analysis values with high degrees of variation in content were excluded, and the average value of the rest was used as the component value for model development.

Data Processing and Prediction Models
The average of the spectral data was extracted from hyperspectral images within predefined regions of interest, as in the previous method [9].The average spectral data of 150 bands for each of the 112 hyperspectral images were obtained [9].The preprocessing methods, used alone or in combination (Table S1), included normalization, logarithmic transformation, a Savitzky-Golay filter, the 1st and 2nd derivative after SG filtering, multiplicative scatter correction (MSC), and standard normal variate (SNV) transformation.The SG filter was applied with a three-order polynomial fit with five data points, using the SciPy package in Python 3.9.A total of 36 preprocessing combinations were used to prepare the spectral data for the development of the predictive models.
Partial least squares regression (PLSR), adaptive boosting (AdaBoost), extreme gradient boosting (XGboost), and light gradient boosting model (LightGBM) algorithms were applied to predict the content of each metabolite in the plants.PLSR is a method that is commonly used to predict metabolite content from hyperspectral data.It works by extracting latent variables (LVs), which are linear combinations of original predictor variables that capture the maximum variation in data.The number of LVs is chosen based on the optimal performance of the relevant model, which is typically determined through cross-validation.
AdaBoost, XGboost, and LightGBM are boosting algorithms that are also commonly used for regression tasks [21][22][23].Boosting algorithms combine multiple weak learners (e.g., decision trees) into a strong learner, which improves the accuracy of predictions.In this study, boosting algorithms were used for both feature selection and regression.
To reduce redundant information in the hyperspectral data, feature selection based on the importance of boosting was used.Only bands with a feature importance value greater than 1.25 times the average value were selected.The implementation of model development was programmed using the Scikit-learn, XGboost, and LightGBM packages in Python 3.9.
The preprocessing and feature-selection methods were determined, and the parameters for all of the algorithms were optimized after tenfold cross-validation based on the training dataset, corresponding to 80% of the data.After hyperparameter tuning, the performance of the final model was tested with an independent validation dataset that corresponded to 20% of the data.The model performance was evaluated based on the coefficient of determination (R 2 ) and the root mean square error (RMSE), as follows: (1) where y i is the measured value of the component analysis; ŷi is the value predicted by the model; y is the mean value of the component analysis; and n is the number of samples.

Development of a Prediction Model Based on Hyperspectral Imaging with the PLSR, AdaBoost, XGboost, and LightGBM Algorithms
The total chlorophyll, phenolic, flavonoid, glucosinolate, and anthocyanin contents in the B. juncea plants are summarized in Table 1.The reflectance spectra of 112 leaves were obtained by averaging the hyperspectral data, followed by preprocessing, as shown in Figure 1.In the spectra of the B. juncea leaves, the green and red regions were relatively low and high, respectively, compared to those of a typical green leaf.These different spectra could be caused by the absolute contents and ratio of chlorophyll and anthocyanin in the leaf [24].The plant used in this study was a red mustard cultivar with purple-green leaves and a high anthocyanin content (Table 1).
where  is the measured value of the component analysis;  is the value predicted by the model;  is the mean value of the component analysis; and  is the number of samples.

Development of a Prediction Model Based on Hyperspectral Imaging with the PLSR, AdaBoost, XGboost, and LightGBM Algorithms
The total chlorophyll, phenolic, flavonoid, glucosinolate, and anthocyanin contents in the B. juncea plants are summarized in Table 1.The reflectance spectra of 112 leaves were obtained by averaging the hyperspectral data, followed by preprocessing, as shown in Figure 1.In the spectra of the B. juncea leaves, the green and red regions were relatively low and high, respectively, compared to those of a typical green leaf.These different spectra could be caused by the absolute contents and ratio of chlorophyll and anthocyanin in the leaf [24].The plant used in this study was a red mustard cultivar with purple-green leaves and a high anthocyanin content (Table 1).S1.  S1.PLSR models for five metabolites were developed using 36 preprocessing methods (Table S1).The optimal combination of preprocessing methods for each of the five PLSR models, as well as the optimal number of latent variables (LVs) for each component, was determined based on the low root mean square error of cross-validation (RMSECV) values, as shown in Table 2. Spectral preprocessing is an essential step in order to avoid undesirable scattering effects and reveal signals that correspond to chemical components [25].The appropriate preprocessing method will depend on various factors, including the wavelength range and interval, the prediction model, the target compound, and the plant organs used, such as leaves and fruits.In previous studies, the performance of the PLSR model in detecting total phenolic content using a VIS-NIR hyperspectral imaging system was improved with the normalization method in apple fruits [26] and with the SG filter and derivative transformation in Arabidopsis leaves [27].Derivative transforms emphasize spectral features but also emphasize the noise of data.The first and second derivatives removed an additive and a linear baseline, respectively.Logarithmic transformation can be employed to address a non-linear problem.In a previous study using the SWIR hyperspectral imaging system, the logarithmic transformation Log (1/R) improved the performance of the PLSR model for the ABA content in zucchini leaves [28].MSC and SNV transformation are useful in reducing spectral variability due to scattering and baseline shifts.To further improve model performance, spectral preprocessing methods can be used in combination [9].In this study, the prediction performance of the PLSR model was higher with the single preprocessing methods than with combinations of multiple methods (Table 2).
The AdaBoost, XGboost, and LightGBM prediction models were also developed using 36 preprocessing methods.The best preprocessing method for each algorithm and metabolite was determined based on low RMSECV values (Table S2).After that, the prediction models were compared according to the selection of three features (bands) based on the feature importance in the boosting algorithms (Table 3).The spectral bands that were reduced by the algorithms made the performances of several models better compared with the full bands.Hyperspectral data require band selection due to the large amount of highly correlated and redundant information.Reducing the number of features, even to less than 20% of the total band, can enhance the performance of a regression algorithm [28,29].Combinations of different feature-selection and regression algorithms can improve model accuracy [30].
The importance values for selecting the features of each best performance model are given in Figure 2.For chlorophyll prediction, the highest importance was at 480.57nm, followed by 916.83 nm, among 17 bands selected based on the XGBoost algorithm with 1st Der processing data.For phenolic prediction, the feature importance was the highest at 904.82 nm, followed by 760.73 nm, among 28 bands selected based on the LightGBM algorithm with Norm processing data.The selected features were distributed in the ranges of 488.57-544.61nm and 672.68-992.87nm, respectively.For flavonoid prediction, the feature importance was the highest at 692.69 nm, followed by 608.64 nm, among 33 bands selected based on the AdaBoost algorithm with 2nd Der processing data.The feature importance for glucosinolate prediction was concentrated in the range of 870-900 nm.The highest importance values were, in order, at 872.80, 896.81, 880.8, and 628.66 nm among 28 bands selected based on the AdaBoost algorithm with SNV processing data.For anthocyanin prediction, the feature importance was the highest at 924.83 nm among 37 bands selected with the LightGBM algorithm with Log (1/R), 1st Der, and MSC process-ing data.The features were selected depending on the spectral data preprocessing method as well as the selection algorithm.1st Der processing data.For phenolic prediction, the feature importance was the highest at 904.82 nm, followed by 760.73 nm, among 28 bands selected based on the LightGBM algorithm with Norm processing data.The selected features were distributed in the ranges of 488.57-544.61nm and 672.68-992.87nm, respectively.For flavonoid prediction, the feature importance was the highest at 692.69 nm, followed by 608.64 nm, among 33 bands selected based on the AdaBoost algorithm with 2nd Der processing data.The feature importance for glucosinolate prediction was concentrated in the range of 870-900 nm.The highest importance values were, in order, at 872.80, 896.81, 880.8, and 628.66 nm among 28 bands selected based on the AdaBoost algorithm with SNV processing data.For anthocyanin prediction, the feature importance was the highest at 924.83 nm among 37 bands selected with the LightGBM algorithm with Log (1/R), 1st Der, and MSC processing data.The features were selected depending on the spectral data preprocessing method as well as the selection algorithm.3.
Overall, the boosting algorithms showed better prediction performances compared to the PLSR models (Tables 2 and 3).Specifically, the LightGBM model was found to be the best for predicting chlorophyll, while the AdaBoost model was the best for predicting phenolics, flavonoids, glucosinolates, and anthocyanins.The boosting models performed better than the best PLSR models, except when it came to anthocyanins, where the PLSR models showed better performances.The performances of the best prediction models for five metabolites are given in Figure 3.The best model for chlorophyll was the 1st Der processing-XGBoost selection-LightGBM prediction model with 17 bands selected (R 2 P = 0.737, RMSEP = 1.052).The best model for phenolics was the Norm processing-LightGBM selection-AdaBoost prediction model with 28 bands selected (R 2 P = 0.594, RMSEP = 1.426).For flavonoids, the 2nd Der processing-AdaBoost selection-AdaBoost prediction model with 33 bands selected performed the best (R 2 P = 0.709, RMSEP = 1.417).For glucosinolates, the SNV processing-AdaBoost selection-AdaBoost prediction model with 28 bands selected was best (R 2 P = 0.816, RMSEP = 4.744).The best boosting model for anthocyanins  3.
Overall, the boosting algorithms showed better prediction performances compared to the PLSR models (Tables 2 and 3).Specifically, the LightGBM model was found to be the best for predicting chlorophyll, while the AdaBoost model was the best for predicting phenolics, flavonoids, glucosinolates, and anthocyanins.The boosting models performed better than the best PLSR models, except when it came to anthocyanins, where the PLSR models showed better performances.The performances of the best prediction models for five metabolites are given in Figure 3

Application of the Functional Component Prediction Model with Visualization
Prediction models based on hyperspectral imaging are used to predict content at a single-pixel level and generate compound distribution maps.The prediction model developed here was applied to actual plants that were grown and utilized the spectrum of every pixel.The spatial distribution of five metabolites was found to be uneven across the leaf area (Figure 4).Yuan et al. (2021) visualized the distribution of SPAD values, which indicate chlorophyll content in pepper leaves [29].The distribution of the total phenolics has been visualized using hyperspectral imaging and modeling in Arabidopsis plants [27] and shelled cocoa beans [8].Hence, by employing a hyperspectral imaging system and the necessary software to run the algorithm, we could non-destructively and continuously

Application of the Functional Component Prediction Model with Visualization
Prediction models based on hyperspectral imaging are used to predict content at a single-pixel level and generate compound distribution maps.The prediction model developed here was applied to actual plants that were grown and utilized the spectrum of every pixel.The spatial distribution of five metabolites was found to be uneven across the leaf area (Figure 4).Yuan et al. (2021) visualized the distribution of SPAD values, which indicate chlorophyll content in pepper leaves [29].The distribution of the total phenolics has been visualized using hyperspectral imaging and modeling in Arabidopsis plants [27] and shelled cocoa beans [8].Hence, by employing a hyperspectral imaging system and the necessary software to run the algorithm, we could non-destructively and continuously monitor the compound distribution.This phytochemical monitoring will aid in making cultivation decisions to effectively control the quality of functional plants.
Agriculture 2023, 13, x FOR PEER REVIEW 10 of 13 monitor the compound distribution.This phytochemical monitoring will aid in making cultivation decisions to effectively control the quality of functional plants.

Conclusions
A prediction model using hyperspectral imaging was developed based on PLSR and boosting algorithms such as AdaBoost, XGboost, and LightGBM to predict five metabolites in B. juncea: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins.To improve the model performance, various spectral data preprocessing methods and feature-selection algorithms were adopted.The prediction performance was higher with the single preprocessing methods than with combinations of multiple PLSR-and boosting-model methods.Feature selection based on boosting algorithms could improve prediction performance.The cross-validation and prediction performances were better in the boosting algorithms than in the PLSR models, except regarding anthocyanin predic-

Conclusions
A prediction model using hyperspectral imaging was developed based on PLSR and boosting algorithms such as AdaBoost, XGboost, and LightGBM to predict five metabolites in B. juncea: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins.To improve the model performance, various spectral data preprocessing methods and feature-selection algorithms were adopted.The prediction performance was higher with the single preprocessing methods than with combinations of multiple PLSR-and boostingmodel methods.Feature selection based on boosting algorithms could improve prediction performance.The cross-validation and prediction performances were better in the boosting algorithms than in the PLSR models, except regarding anthocyanin prediction.The final models for glucosinolates and anthocyanins especially performed sufficiently for practical use such as screening, as R 2 p = 0.82-0.85and RPD = 2.4-2.6.This research presents a promising approach for the rapid and accurate prediction of metabolites in plants using hyperspectral imaging, which can contribute to the development of precision agriculture and plant breeding.
Overall, our results showed that boosting algorithms can be applied to predict the functional components of medicinal plants.Many studies have compared spectral data preprocessing methods and tried to improve prediction performance.We have confirmed that prediction performance can be improved by reducing spectral bands with a featureselection algorithm.To develop faster and more accurate prediction techniques, it is necessary to continuously introduce the latest algorithms and data processing methods.Based on hyperspectral images, non-destructive monitoring techniques of functional components can be used as tools for quality control in the field of smart agriculture, including in the medicinal plant industry.

Figure 1 .
Figure 1.Single preprocessing method for hyperspectral data of B. juncea plants: raw reflectance (A), normalization (B), logarithmic transformation (C), Savitzky-Golay filter (D), first and second derivative after SG filtering (E,F), multiplicative scatter correction (G), and standard normal variate transformation (H).Colored lines represent different leaf samples.The combination of preprocessing methods refers to TableS1.

Figure 1 .
Figure 1.Single preprocessing method for hyperspectral data of B. juncea plants: raw reflectance (A), normalization (B), logarithmic transformation (C), Savitzky-Golay filter (D), first and second derivative after SG filtering (E,F), multiplicative scatter correction (G), and standard normal variate transformation (H).Colored lines represent different leaf samples.The combination of preprocessing methods refers to TableS1.

Figure 2 .
Figure 2. Feature importance values used to determine the best prediction models for total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E) in B. juncea plants.Orange bars represent selected features, and light blue bars represent unselected features, i.e., those not used in the prediction model.The best prediction models are documented in Table3.

Figure 2 .
Figure 2. Feature importance values used to determine the best prediction models for total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E) in B. juncea plants.Orange bars represent selected features, and light blue bars represent unselected features, i.e., those not used in the prediction model.The best prediction models are documented in Table3.

Figure 3 .
Figure 3.The optimal models for predicting the concentrations of the total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E,F) in B. juncea plants, as presented in Table 3. R 2 P and RMSEP indicate coefficient of determination and root mean square error of prediction, respectively.

Figure 3 .
Figure 3.The optimal models for predicting the concentrations of the total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E,F) in B. juncea plants, as presented in Table 3. R 2 P and RMSEP indicate coefficient of determination and root mean square error of prediction, respectively.

Figure 4 .
Figure 4. Distribution map of five metabolites, described by an application of hyperspectral imagebased prediction in a growing environment: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins in B. juncea plants.

Figure 4 .
Figure 4. Distribution map of five metabolites, described by an application of hyperspectral imagebased prediction in a growing environment: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins in B. juncea plants.

Funding:
This work was supported by the Korean Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) and by the Korean Smart Farm R&D Foundation (KosFarm) through the Smart Farm Innovation Technology Development Program, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA), the Ministry of Science and ICT (MSIT), and the Rural Development Administration (RDA) (421034-04).

Table 1 .
Statistical summary of the five components in B. juncea plants.

Table 2 .
Performances of five PLSR models based on best preprocessing methods and optimal latent variables (LVs) for each component of B. juncea plants.
R 2 : coefficient of determination; RMSEC, RMSECV, and RMSEP: root mean square errors of calibration, cross-validation, and prediction, respectively.Bold indicates the best performance based on the RMSEP for each component.

Table 3 .
Performance of AdaBoost, XGboost, and LightGBM prediction models for five metabolites in B. juncea plants according to feature-selection algorithms after determination of preprocessing and hyperparameter tuning.

Table 3 .
Cont. 2 : coefficient of determination; RMSEC, RMSECV, and RMSEP: root mean square errors of calibration, cross-validation, and prediction, respectively.Bold indicates the best performance based on the RMSEP for each component. R : List of preprocessing methods for hyperspectral data of B. juncea plants; Table S2: Determination of preprocessing methods for AdaBoost, XGBoost, and LightGBM prediction algorithms for five metabolites in B. juncea plants.Conceptualization, S.H.P. and S.M.K.; funding acquisition, S.M.K.; investigation, S.H.P. and H.I.Y.; methodology, H.I.Y., J.-H.C., D.-H.J., J.-E.P. and Y.J.P.; project administration, S.H.P.; software, H.I.Y. and H.L.; validation, H.I.Y. and S.H.P.; writing-original draft, H.I.Y. and S.H.P.; review and editing, H.I.Y., J.-S.Y. and S.H.P.All authors have read and agreed to the published version of the manuscript.