Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera

Jiang, Xinyue; Song, Yongzhong; Sun, Zhibin; Kuang, Fan; Tang, Xuehai

doi:10.3390/f15111937

Open AccessArticle

Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera

by

Xinyue Jiang

¹,

Yongzhong Song

^1,2,

Zhibin Sun

^1,2,*

,

Fan Kuang

^3,4 and

Xuehai Tang

^3,4,*

¹

School of Mathematical Sciences, Nanjing Normal University, Nanjing 210023, China

²

Ministry of Education Key Laboratory of NSLSCS, Nanjing Normal University, Nanjing 210023, China

³

School of Forestry and Landscape Architecture, Anhui Agricultural University, No. 130 Changjiang West Road, Hefei 230036, China

⁴

Anhui Province Key Laboratory of Forest Resources and Silviculture, No. 130 Changjiang West Road, Hefei 230036, China

^*

Authors to whom correspondence should be addressed.

Forests 2024, 15(11), 1937; https://doi.org/10.3390/f15111937

Submission received: 26 September 2024 / Revised: 1 November 2024 / Accepted: 2 November 2024 / Published: 4 November 2024

(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Camellia oleifera is an oilseed crop that holds significant economic, ecological, and social value. In the realm of Camellia oleifera cultivation, utilizing hyperspectral analysis techniques to estimate chlorophyll content can enhance our understanding of its physiological parameters and response characteristics. However, hyperspectral datasets contain information from many wavelengths, resulting in high-dimensional data. Therefore, selecting effective wavelengths is crucial for processing hyperspectral data and modeling in retrieval studies. In this study, by using hyperspectral data and chlorophyll content from Camellia oleifera samples, three different dimensionality reduction methods (Taylor-CC, NCC, and PCC) are used in the first round of dimensionality reduction. Combined with these methods, various thresholds and dimensionality reduction methods (with/without further dimensionality reduction) are used in the second round of dimensionality reduction; different sets of core wavelengths with equal size are identified respectively. Using hyperspectral reflectance data at different sets of core wavelengths, multiple machine learning models (Lasso, ANN, and RF) are constructed to predict the chlorophyll content of Camellia oleifera. The purpose of this study is to compare the performance of various dimensionality reduction methods in conjunction with machine learning models for predicting the chlorophyll content of Camellia oleifera. Results show that (1) the Taylor-CC method can effectively select core wavelengths with high sensitivity to chlorophyll variation; (2) the two-stage hybrid dimensionality reduction methods demonstrate superiority in three models; (3) the Taylor-CC + NCC method combined with an ANN achieves the best predictive performance of chlorophyll content. The new two-stage dimensionality reduction method proposed in this study not only improves both the efficiency of hyperspectral data processing and the predictive accuracy of models, but can serve as a complement to the study of Camellia oleifera properties using the Taylor-CC method.

Keywords:

Camellia oleifera; chlorophyll content; hyperspectral; dimensionality reduction; Taylor expansion; ANN

1. Introduction

Photosynthesis is the biochemical process through which most vegetation utilizes light energy to convert carbon dioxide and water into organic matter (i.e., glucose), while releasing oxygen as a byproduct [1]. In photosynthesis, chlorophyll serves as a key pigment responsible for absorbing light energy [2]. By measuring the chlorophyll content in the leaves of vegetation, it is possible to estimate the photosynthetic capacity and rate of this vegetation [3,4]. However, the chlorophyll content of vegetation may be influenced by environmental factors, such as climate change, light variations, and temperature fluctuations, making it difficult to assess chlorophyll content accurately [5,6]. In practice, decreased chlorophyll levels are often a sign of pests, diseases, nutritional deficiencies, or environmental stress [7]. Therefore, by measuring and monitoring changes in the chlorophyll content of vegetation, early interventions can be implemented to help maintain its health. Conventional methods for measuring chlorophyll content primarily involve acetone–ethanol extraction methods [8], spectrophotometry, and high-performance liquid chromatography [9]. These destructive methods, based on laboratory procedures, are time-consuming and costly. By overcoming the limitations of these methods, hyperspectral analysis technology can rapidly and non-destructively acquire spectral data from leaves/canopies of large areas of vegetation, facilitating real-time analysis [10,11]. Most chlorophyll a and chlorophyll b molecules exhibit absorption wavelengths/bands between 645 nm and 670 nm, while some special types of chlorophyll absorb at wavelengths longer than 700 nm [12]. Hyperspectral imaging technology can capture these absorption and reflection characteristic of chlorophyll across various spectral wavelengths, generating hyperspectral reflectance data [13].

At the same time, hyperspectral analysis technology not only provides significant spectral information but includes a large number of wavelengths. Feature extraction can help reduce dimensionality and redundant information in hyperspectral data, thus compressing datasets and improving the efficiency of data processing. However, when processing large-scale hyperspectral images and reflectance data, traditional dimensionality reduction methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA), may encounter challenges associated with high computational complexity [14,15]. This leads to extensive consumption of computational time and resources, hindering the efficiency of the analysis process. With advancement of computational science, machine learning methods have been widely used for hyperspectral feature extraction, including support vector machines [16], random forests [17], and k-nearest neighbor [18]. These models are prone to overfitting during training, particularly when the number of training samples is limited, and the feature dimension is high. The emergence of deep learning has advanced the research in hyperspectral feature extraction. Artificial neural network (ANN) [19], convolutional neural network (CNN) [20], and recurrent neural network (RNN) [21] have been applied to extracting features from hyperspectral data processing. However, deep learning models are often regarded as ‘black box’ models, making it difficult to intuitively understand the importance of the features and wavelengths upon which the models are based. What is more, when identifying the core wavelengths of hyperspectral reflectance data, only relying on one method/model could result in the loss of some advantages that other methods/models could preserve. Moreover, arbitrarily combining multiple methods/models to reduce the dimensionality of hyperspectral reflectance data may result in conflicts regarding the selection of dimensions and parameters, ultimately hindering achieving optimal dimensionality reduction results.

Like other green leafy vegetation, Camellia oleifera contains a large amount of chlorophyll in its leaves, allowing it to carry out photosynthesis. In practice, Camellia oleifera aids in preventing soil erosion and the reversion of farmland to nature land [22], thus holding considerable economic, ecological, and social value [23]. In the cultivation of Camellia oleifera, analyzing its hyperspectral reflectance data can help understand its physiological parameters and response characteristics, thereby advancing the growth of the Camellia oleifera industry. In a recent study on the hyperspectral data of Camellia oleifera [24], based on the spectral characteristics of its leaves at the same growth period, Sun et al. developed a new algorithm using Taylor expansion to identify core wavelengths. However, their study focused solely on the correlation among wavelengths, ignoring the relationship between wavelengths and target variables.

On the other hand, predicting chlorophyll content in leaves remains a challenge. In recent years, many scholars have conducted in-depth research on estimating/retrieving the chlorophyll content of vegetation using hyperspectral analysis. For example, using a hybrid strategy of variable combination population analysis (VCPA) and genetic algorithm (GA) [25], Hasan et al. studied the performance of five machine learning algorithms to estimate the chlorophyll content of litchi based on the hyperspectral fractional-derivative reflections from 298 litchi leaves. By extracting color features, spectral indices, and chlorophyll fluorescence intensities from these various types of images [26], Zhang et al. developed a multivariate linear regression model and a partial least squares regression model to predict chlorophyll content. However, the methods (VCPA and GA) lack mathematical representations of their regression function, which makes it difficult to ensure the stability and reliability of the prediction models in practical applications, particularly when addressing complex problems. Meanwhile, empirical or semi-empirical models based on statistical methods (i.e., multivariate regression and linear regression) often fail to address the universality problem related to the physical mechanism. More importantly, there are very limited studies focusing on the chlorophyll content of Camellia oleifera using hyperspectral analysis.

Based on these above-mentioned problems, the aim of this study is to compare the performance of various dimensionality reduction methods in combination with machine learning models to achieve the most accurate prediction of chlorophyll content in Camellia oleifera. By comprehensively considering the correlations between wavelengths and target variables (i.e., chlorophyll content), three dimensionality reduction methods (Taylor-CC, NCC, and PCC) were used in the first round of dimensionality reduction. Then, various thresholds and dimensionality reduction methods (with/without further dimensionality reduction) were applied in the second round of dimensionality reduction to identify different sets of core wavelengths. Thus, a series of chlorophyll content estimation models for Camellia oleifera were developed and compared by using three machine learning models based on these proposed methods. The specific objectives of this study are as follows:

(1): to screen core wavelengths through the utilization of various single/two-stage dimensionality reduction methods;
(2): to identify the optimal dimensionality reduction method and the best model to serve as a final chlorophyll content retrieval model;
(3): to act as a supplement to monitoring the growth of Camellia oleifera by hyperspectral analysis technology.

2. Data and Methods

2.1. Data and Collecting

2.1.1. Overview of the Study Area and Design of Site Experiments

The data for this experimental study were collected in Hepeng Town, Shucheng County, Lu’an City, Anhui Province, China. It borders Wanfohu Town to the north, Gaofeng Township to the west and south, and Tangchi Town to the east. Hepeng Town has a typical subtropical monsoon climate, which is characterized by a cold winter, a hot summer, four distinct seasons, abundant rainfall, and sufficient sunlight. Additionally, Hepeng Town has frequent fog and minimal frost, creating favorable conditions for the growth of agricultural crops.

The survey sites are situated in the Camellia oleifera plantation of Anhui Dechang Seedling Co., Ltd., located in Zhanchong Village at Hepeng Town (see Figure 1). From 14 December 2022 to 17 December 2022, a total of 48 square plots, each measuring 20 m × 20 m, were divided into four parts (A, B, C, and D) in the study area. Since the growth conditions of the sample trees in each plot, such as planting year, variety, and cultivation measures (i.e., fertilization, watering, pruning), were consistent, the sample size was ample. Thus, five trees were selected from each plot, and a total of 240 Camellia oleifera trees were numbered and surveyed. All selected trees belonged to the Changlin series of Camellia oleifera, which were in a stable full blooming phase, ensuring consistent and reliable data for the study.

2.1.2. Hyperspectral Data Acquisition

In this study, surveyors employed an advanced all-band terrain spectrometer (Fieldspec4 Wide-Res, Analytical Spectrum Devices Inc., Boulder, CO, USA) to capture the canopy spectra of the 240 trees (see Figure 2). This instrument operates across a broad wavelength range from 350 to 2500 nm, and at a distance of four meters or more from the ground. The fiber optic cable of the spectrometer is raised above the center of the plant canopy using a fiber jumper, a fiber adapter, a carbon fiber telescopic rod, and a pipeline clamp. The fiber optic probe was positioned perpendicular to the top of the canopy, creating a circular observation area with a diameter

D = 2 * h * \tan 12.5 °

, where

h

represents the vertical distance between the fiber optic probe and the center of the canopy. This setup allows for precise spectral measurements and data collection within the defined observation area. Moreover, measurements were typically conducted between 10 a.m. and 2 p.m. local time, ensuring that data collection was conducted during peak sunlight hours. To minimize the impact of shadows and human interference on the collection of canopy spectra, the surveyors dressed in dark clothing, positioned themselves facing the sun during measurements, and maintained a specific distance from the edge of the plant canopy. The instrument was preheated for a minimum of 20 min prior to taking measurements. Furthermore, each sample was measured continuously 10 times, whose average value was used as the original canopy spectra for the sample.

2.1.3. Soil and Plant Analyzer Development (SPAD) Data Acquisition

An SPAD-502 Plus chlorophyll meter (SPAD-502 Plus, Konica Minolta, Inc., Osaka, Japan) was utilized to measure the relative chlorophyll content of plant leaves in this study. By comparing the reflectance of leaves at red (680 nm) and near-infrared (940 nm) wavelengths, the ratio was used to calculate the SPAD measurement, which indicates chlorophyll content index. Therefore, SPAD measurements were directly used in place of chlorophyll content in this study. To ensure alignment with the hyperspectral observation scale, the spatial position of measured leaves must be located within a circular area with a diameter of 0.9 m, centered on the upper surface of the canopy. When collecting leaves, it is important to select those with their upper surfaces facing the sky. This is because those leaves usually receive more sunlight and have higher chlorophyll content, more accurately reflecting the growth state of the plant. During collecting, the number of leaves should be at least 16 from a plant, with three SPAD measurements taken from each leaf, resulting in at least 48 measurements per plant. Then, the 48 values were averaged to calculate the SPAD measurement for the leaves on the upper surface of a single plant’s canopy, after eliminating any outliers.

2.1.4. Measurements

In this study, 240 hyperspectral reflectance measurement samples were preprocessed to remove invalid data outside the range of [0, 1]. Their cleaned dataset is shown in Figure 3a, which comprises 2020 wavelengths. Blank/gap areas in the figure indicate wavelengths that were removed due to their susceptibility to the interference from water vapor during measurement. On the other hand, the histogram of the SPAD data corresponding to the 240 measurements samples is shown in Figure 3b.

2.2. Methods

In order to reduce the redundant information in hyperspectral reflectance data and utilize models to predict the chlorophyll content in the leaves of Camellia oleifera, the following feature extraction methods and machine learning models were employed.

2.2.1. Feature Extraction

Given that the hyperspectral reflectance data in this study exhibited characteristics of multi-band information and high resolution, they inherently involved a substantial volume of data. Therefore, it is crucial to perform correlation analysis among wavelengths and target variables, followed by the elimination of redundant wavelengths to optimize the utilization of computing resources effectively. Based on the methods introduced below, several two-stage hybrid dimensionality reduction methods were developed.

(1): Nonlinear Correlation Coefficient (NCC)

Mutual information entropy is a fundamental concept used to quantify the correlation between two random variables [27]. Wang et al. proposed a modified nonlinear correlation information entropy for multivariate analysis [28]. Their experimental findings demonstrated that this entropy can effectively analyze not only linear variables but nonlinear ones. Their proposed nonlinear correlation coefficient

N C C (X; Y)

is defined as

N C C (X; Y) = H^{r} (X) + H^{r} (Y) + H^{r} (X, Y) .

(1)

Specific definitions of

H^{r} (X)

and

H^{r} (X, Y)

are provided in Appendix B. Based on

{N C C}^{K \times K} = \{{N C C}_{i j}\}

,

(1 \leq i \leq K, 1 \leq j \leq K)

, the relationship between

K

variables can be analyzed. Calculated via Equation (1),

{N C C}_{i j}

represents the nonlinear correlation coefficient between the i-th random variable and the j-th one. Since a random variable is identical to itself,

{N C C}_{i i} = 1

. For the other elements of

{N C C}^{K \times K}

,

0 \leq {N C C}_{i j} \leq 1 (i \neq j, i \leq K, j \leq K)

. When the nonlinear correlation between two random variables is stronger, the value of their corresponding

{N C C}_{i j}

is closer to 1. Conversely, the value of

{N C C}_{i j}

tends to be closer to 0.

(2): Pearson Correlation Coefficient (PCC)

The Pearson correlation coefficient, also known as the product–moment correlation coefficient, is a statistical measure that quantifies the degree of linear correlation between two random variables (or also real-valued vectors). It is the first formal measure of correlation and remains one of the most commonly utilized measures of correlation [29]. The Pearson correlation coefficient for two random variables x and y is defined in Appendix B.

The Pearson correlation coefficient

r_{x y}

ranges from −1 to 1, and it is invariant to the linear transformation of either variable. If

r_{x y}

is close to −1 or 1, there is a strong linear relationship between the two random variables; If

r_{x y}

is close to 0, then there is almost no linear relationship between the two variables [30].

(3): Taylor Correlation Coefficient (Taylor-CC)

In a study related to hyperspectral signals, utilizing the Taylor expansion of smooth functions, Sun et al. [24] introduced a novel method to estimate the correlation between hyperspectral signals at two wavelengths. Assuming that a continuous (reflection) function

f

with at least the second-order derivative at two nearby wavelengths (

x

and

y

), the corresponding reflectance measurements at

y

can be described as

f (y) = f (x) + f^{'} (x) (y - x) + \frac{1}{2} f^{″} (x) {(y - x)}^{2} + o ({(y - x)}^{2}) .

(2)

Thus, the estimated reflectance at

y

, using the derived information up to the second-order derivative at

x

, becomes

\hat{f} (y) = f (x) + f^{'} (x) (y - x) + \frac{1}{2} f^{″} (x) {(y - x)}^{2} .

(3)

Similarly,

f (x)

and

\hat{f} (x)

can be obtained using the local information at

y

.

Therefore, the following

c (x, y)

can be defined as a correlation measure between

x

and

y

.

(x, y) = \{\begin{matrix} 1 - \frac{1}{2} (|\hat{f} (x) - f (x)| + |\hat{f} (y) - f (y)|), |\hat{f} (x) - f (x)| + |\hat{f} (y) - f (y)| \leq 2 \\ 0, e l s e \end{matrix} .

(4)

Based on Equation (4), a correlation matrix

C (f, x) = [c (x_{i}, x_{j})] \in R^{K \times K}

on

K

wavelengths can be achieved. It should be noted that

C (f, x)

is a symmetric matrix. Since a wavelength variable is identical to itself,

C (x_{i}, x_{i}) = 1

,

(1 \leq i \leq K)

. The other elements of

C (f, x)

,

0 \leq c (x_{i}, x_{j}) \leq 1 (i \neq j, i \leq K, j \leq K)

represent the linear/nonlinear relationship between the i-th wavelength variable and the j-th one. In this hyperspectral analysis,

C (f, x)

shows blocks of high values along its diagonal belt, because if

x

is closer to

y

, then the correlation

c (x, y)

is stronger.

(4): Setting Threshold

In a correlation analysis of feature extraction, a common practice is to establish a criterion by setting a threshold to screen features that exhibit no obvious correlation [31]. If the correlation coefficient between two features exceeds a specified threshold, it indicates a strong correlation between them, and only one of the features needs to be kept; if the correlation falls below this threshold, it is deemed weak, and neither of the two features can be filtered out.

2.2.2. Machine Learning Models

Based on the above-described methods and threshold setting, this study focused on selecting the core/key wavelengths that meet the desired criteria. Then, three machine learning models were introduced to develop the inversion model of SPAD based on selected core wavelengths.

(1): Least Absolute Shrinkage and Selection Operator

In 1996, Tibshirani [32] introduced the least absolute shrinkage and selection operator (Lasso) for parameter estimation and variable selection in regression analysis. Lasso regression is a specific case of penalized least squares regression using the L1-penalty function (see Figure A1). In contrast to ridge regression that minimizes the sum of squared errors, Lasso regression seeks to balance model complexity and predictive accuracy by penalizing the absolute values of its coefficients. By transforming each coefficient into a constant component and truncating at zero, Lasso regression minimizes the residual sum of squares while constraining the sum of the absolute values of the coefficients.

In this study, Lasso regression was used to model SPAD, since it is effective for high-dimensional datasets and problems with a large number of features [33]. The objective function of Lasso regression is

a r g m i n_{β} \{{|Y - \sum_{i = 1}^{N} X_{j} β|}^{2} + λ \sum_{i = 1}^{N} |β|\},

(5)

where

Y

is the response variable,

X_{j}

is the predictive variable, and

β

is the coefficient variable.

λ

is the regularization parameter that controls the strength of the penalty on the coefficient. By manually setting a range of possible

λ

values (e.g.,

λ

= 0.01, 0.1, 0.2, etc.), cross-validation is used for each

λ

value to evaluate the performance of the model, thus selecting the

λ

value that makes the model perform the best. Notably, to ensure that results are comparable, a specific value of

λ

= 0.1 was utilized in this study, as most experiments showed improved performance with this setting.

(2): Artificial Neural Network

An artificial neural network (ANN) is modeled based on biological neural networks [34]. Every neural network consists of three essential components: node features, network topology, and learning rules. Figure A2 illustrates how the ANN works. Initially, the input data (

x

) is transmitted from the input layer to the hidden layer in an adaptive network through forward propagation. Subsequently, the weights are adjusted in reverse based on the gradient of error measure produced by the loss function during the process of backpropagation. This adjustment is performed using a gradient descent algorithm, which iteratively optimizes network parameters to minimize the error measure. By continuously updating the weights based on the above rules, the network fine-tunes its parameters to learn and adapt to the input data, improving its predictive capabilities and overall performance. This iterative process of adjusting weights through both backpropagation and gradient descent enables the network to efficiently learn and optimize its internal representations for better accuracy and effectiveness [35].

In this study, given that the dataset consisted of only 240 samples, 15 hidden nodes were used to identify an ANN structure with the best performance. The hidden layer structures include all possible one to three hidden layers with pyramid style, i.e., {[15], [13,2], [12,3], [11,4], [10,5], [9,6], [8,7], [7,6,2], [6,5,4], [7,5,3], [8,5,2], [8,4,3], [9,4,2], [10,3,2], [14,1]}. Since the structure of [13,2] and the methods used in subsequent experiments achieved the best performance, [13,2] was chosen as the optimal structure for predicting chlorophyll content (see Table A1).

(3): Random Forest

A random forest (RF) is a versatile machine learning method that was proposed and developed by Breiman in 2001 [36]. RF is known for its proficiency in both classification and regression tasks, showcasing exceptional performance in minimizing prediction errors across various benchmark datasets. When constructing an RF, two key aspects are important to eliminate overfitting: the random selection of data and the random selection of features.

The modeling process of RF mainly includes the following steps, as shown in Figure A3. First, the input data are drawn from the original datasets using a self-service sampling method (i.e., bootstrap sampling) to form multiple sub-datasets. Subsequently, for each sub-dataset, a decision tree is created. Decision trees use features to split nodes, and each node represents a specific feature. Finally, the integrator makes a decision (majority voting/averaging according to different tasks, i.e., classification/regression) [37]. In this study, 200 trees were established, each allowed to have a maximum of five leaves.

3. Experimental Design and Data Processing

3.1. Data Preprocessing and Core Wavelengths Selection Schemes

In practice, most initial datasets may contain missing values, duplicate values, and other discrepancies that need to be addressed during the data preprocessing stage. Hence, data preprocessing is necessary before utilizing the data in models to ensure their quality and reliability in the subsequent analysis. There is no unique method for data preprocessing, and it varies based on specific tasks and dataset properties. In this study, to eliminate scale differences among different features, all input/output data (i.e., reflectance and SPAD) were scaled linearly to the range of [−1, 1]. Moreover, due to the impact of water vapor [34], some measured reflectance values at certain wavelengths may be considered outliers. Therefore, it is also important to deal with the outliers in the reflectance data.

More importantly, to streamline the training complexity of prediction models and optimize computational resources, it is necessary to reduce the dimensionality of input data by selecting core wavelengths from the entire spectrum of reflectance data. On the other hand, the order of combining different dimensionality reduction methods also needs to be further investigated to evaluate their performance. These dimensionality reduction methods were divided into three types of schemes based on the three methods (Taylor-CC, NCC, and PCC) selected for the first round of dimensionality reduction. Then, in each of these three schemes, a decision was made on whether to proceed with the second round of dimensionality reduction. On this basis, different dimensionality reduction methods, single dimensionality reduction methods, and two-stage hybrid dimensionality reduction methods, were set up to identify which method achieved the best performance. Therefore, these specific schemes, listed in Table 1, were developed to compare the performance of different core wavelength selection schemes.

The following were specific explanations of different dimensionality reduction methods utilized in different schemes.

Scheme I: Method (1): Taylor-CC: Dimensionality of input data is reduced using the Taylor-CC method based on a threshold and the information among different wavelengths. Method (2): Taylor-CC + PCC: Firstly, the dimensionality of input data is reduced based on the information among different wavelengths and a threshold using the Taylor-CC method. Subsequently, the reduced input data resulting from the first step are integrated with SPAD measurement information and a threshold for further dimensionality reduction through the PCC method. Method (3): Taylor-CC + PCC: Firstly, the dimensionality of the input data is reduced according to the information among different wavelengths by the Taylor-CC method. Subsequently, the reduced input data resulting from the first step are integrated with SPAD measurements for further dimensionality reduction through the NCC method.

Scheme II: Method (1): NCC; Method (2): NCC + PCC; Method (3): NCC + NCC.

Scheme III: Method (1): PCC; Method (2): PCC + PCC; Method (3): PCC + NCC.

Based on the results of the three schemes, it is also important to identify the most effective machine learning model for predicting chlorophyll content. To achieve this goal, using different sets of core wavelengths identified by various methods in different schemes, Lasso and RF models were trained and tested with a ratio of 70% and 30%, respectively; while an ANN model went through training, validation, and testing with a ratio of 70%, 15%, and 15%, respectively. Given that the data used in this study is relatively abundant and outliers have been removed in advance, a random division of training, validation, and testing samples would sufficiently meet the requirements for model training. Therefore, no cross-validation experiments were conducted to further verify the experimental results.

It should be noted that in the three schemes and different methods, different numbers of core wavelengths may be obtained by adjusting different thresholds. To ensure the validity of comparisons, the final number of core wavelengths used in experiments must be consistent in different dimensionality reduction schemes. Moreover, note that, for comparing results among different schemes, training samples in the three models should also be consistent, and testing samples in RF and Lasso must be consistent as well. Thus, the validation samples and testing samples in the ANN were the same as the testing samples in RF and Lasso.

In addition, it is important to perform uncertainty analysis to evaluate the potential impact of changes in core wavelengths on the prediction of chlorophyll content. Based on the best-performing methods and models identified from the previous experiments, each core wavelength was removed in turn to assess its impact on the prediction of chlorophyll content. Furthermore, a random forest importance ranking [38] was employed to assess the significance of each core wavelength by determining its contribution to the improvement of the model’s accuracy.

3.2. Evaluation Metrics

In this study, the root mean square error (RMSE), relative mean absolute error (RMAE), relative absolute error (RAE), determination coefficient (R²), mean square error (MSE), mean bias error (MBE), and mean absolute error (MAE) were utilized to assess the performance of the above schemes and models. They are defined as

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - x_{i})}^{2}},

R M A E = \frac{1}{N} \frac{\sum_{i = 1}^{N} |y_{i} - x_{i}|}{\bar{x}},

R A E = \frac{\sum_{i = 1}^{N} |y_{i} - x_{i}|}{\sum_{i = 1}^{N} |x_{i} - \bar{x}|},

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(x_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}},

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - x_{i})}^{2},

M B E = \frac{1}{N} \sum_{i = 1}^{N} |x_{i} - y_{i}|,

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - x_{i}|,

where

x_{i}

represents the i-th observed value,

y_{i}

represents the i-th model predicted value,

N

refers to the same size of observed values and model predicted values,

\bar{x}

is the mean of observed values, and

\bar{y}

is the mean of model predicted values. The performance of these schemes and models is better with lower RMSE, lower RMAE, lower RAE, higher R², lower MSE, MBE close to zero, and lower MAE.

4. Results

4.1. Correlation Matrices from Taylor-CC, NCC, and PCC Method

Using the hyperspectral reflectance measurements with 2151 wavelengths obtained from 240 samples, a correlation analysis of all the wavelengths was calculated by using the Taylor-CC, NCC, and PCC methods, respectively. Therefore, the correlation matrices of the three methods were obtained (see Figure 4, Figure 5 and Figure 6). Those correlation matrices presented a clear diagonal pattern. At the locations away from the diagonal of the matrices, the correlation values based on the Taylor-CC and NCC methods approached to zero, except that the values based on the NCC method approached to one at a few specific locations. On the other hand, the correlation matrix of the PCC method still exhibited numerous high values at locations far from the matrix’s diagonal, which clearly indicated an inconsistency with the general characteristics of the reflectivity data in practice.

4.2. Core Wavelengths Selection Using Different Schemes and Methods

As an example of this study, the final number of core wavelengths selected in different experimental schemes was set to 50, 70, and 69. Because their final conclusions based on 50, 70, and 69 core wavelength numbers were identical. Only the results from one experiment (i.e., 70) are shown below.

When single dimensionality reduction methods were utilized, their corresponding thresholds of 0.926, 0.396, and 0.94608 were set in the Taylor-CC method, NCC method, and PCC method, respectively, so that their final numbers of core wavelength were all 70. When two-stage hybrid dimensionality reduction methods were utilized, multiple sets of core wavelengths with different numbers were used to illustrate the universality of the experimental conclusions during the first round of dimensionality reduction. Therefore, in the Taylor-CC, NCC, and PCC methods, different specific thresholds were established during the first round of dimensionality reduction to ensure the number of identified core wavelengths for each method were 90, 94, 97, 122, 123, and 125. In the second round of dimensionality reduction, either the NCC method or PCC method set specific thresholds to ensure that the final number of core wavelengths can be reduced to 70. The specific thresholds from three schemes are presented in Table 2, Table A2 and Table A3. Taking the example of ‘90 + 70’ and its corresponding thresholds ‘0.9540 + 0.0580’ in Table 2, the explanation is as follows. In the first round of dimensionality reduction, a threshold was set to 0.9540, so that the number of identified core wavelengths was 90. Based on this, the threshold for the second round of dimensionality reduction was then adjusted to 0.0580, so that the final number of core wavelengths was reduced to 70.

By comparing Figure 7, Figure 8 and Figure 9, it is evident that the core wavelengths obtained from Scheme I were more evenly distributed than those obtained from Scheme II and III. In addition, the core wavelength sets obtained by the different dimensionality reduction methods were different in each scheme. This indicated that different dimensionality reduction methods/schemes analyzed the correlations among the dataset variables in different ways.

On the other hand, for Scheme II and III, most core wavelengths concentrated in the ranges of 1800 nm to 2000 nm and 2300 nm to 2400 nm, while Scheme I selected many core wavelengths in the range of 650 nm to 750 nm. Furthermore, Scheme I showed that two-stage hybrid dimensionality reduction methods can more effectively highlight even distribution of wavelengths.

It is worth noting that, to facilitate the comparisons among different dimensionality reduction methods, the average level of performance of the six groups of experiments (i.e., the six rows in Table 2, Table A2 and Table A3) in a two-stage hybrid dimensionality reduction method was used as a representative measure of the overall performance. Those average performance metrics were used in the following comparisons. Moreover, the detailed results of six groups of experiments in the different two-stage methods are shown in Appendix A.

4.3. Comparing the Performance of Different Methods in Scheme I

The results obtained from Methods (1), (2), and (3) using different models are listed in Table 3, Table 4 and Table 5, respectively. In both Lasso and ANN, it is evident that all seven metrics in Method (2) and Method (3) were superior to those in Method (1), with low MAE, MBE close to zero, low MSE, low RAE, low RMAE, high R², and low RMSE. In RF, except for the MBE of both Method (2) and Method (3), which were inferior to that of Method (1), the other metrics in Method (2) and Method (3) showed better performance. Therefore, a conclusion can be drawn that the performance of Method (2) and Method (3) was better than that of Method (1) in all three models.

With the conclusion that hybrid dimensionality reduction methods (Methods (2) and (3)) outperformed the single dimensionality reduction method (Method (1)) in different models, focus was shifted to identifying the best hybrid dimensionality reduction method between Method (2) and Method (3). By comparing the performance of Methods (2) and (3), it is clear that all metrics in Method (3) performed better than those in Method (2) in all three models. Moreover, by comparing seven metrics of the three models in Method (3), it is obvious that the ANN performed better than both Lasso and RF in all aspects except for MBE. Therefore, the above discussion indicated that Method (3) combined with an ANN achieved the best performance in Scheme I.

4.4. Comparing the Performance of Different Methods in Scheme II

The results of Method (1) using different models are listed in Table 6, and the results of Method (2) are listed in Table 7. Using Lasso, ANN, and RF, it is clear that the performance of the seven metrics in Method (2) was superior to that in Method (1) when using the same model, which indicated that Method (2) was superior to Method (1) in all three models.

The results of Method (3) using Lasso, ANN, and RF are listed in Table 8. From the seven metrics obtained in these three models, it is verified that all metrics in Method (3) were better than those in Method (1), indicating that Method (3) worked better than Method (1).

From the above discussions, it revealed that the performance of both Method (2) and Method (3) was better than that of Method (1). That is to say, two-stage hybrid dimensionality reduction outperformed single dimension reduction in Scheme II. The next comparison focused on identifying which hybrid dimensionality reduction method achieved the best performance. When used models were consistent, MBE in Method (3) was slightly inferior to that in Method (2), while the other metrics in Method (3) outperformed those in Method (2), with low MAE, low MSE, low RAE, low RMAE, high R², and low RMSE. Therefore, a conclusion can be drawn that the performance of Method (3) was superior to Method (2) in all three models. Consequently, Lasso, ANN, and RF in Method (3) can each achieve optimal predictive performance among all three methods, and Lasso in Method (3) demonstrated the best results in all seven metrics when compared with the other two models.

4.5. Comparing the Performance of Different Methods in Scheme III

The performance of Methods (1), (2), and (3) using different models is displayed in Table 9, Table 10 and Table 11, respectively. To assess the performance of single dimensionality reduction methods (Method (1)) and two-stage hybrid dimensionality reduction methods (Methods (2) and (3)), this study compared all three methods. It can be seen that both Method (2) and Method (3) performed better than Method (1), with low MAE, MBE close to zero, low MSE, low RAE, low RMAE, high R², and low RMSE.

To sum up, the performance of two-stage hybrid dimensionality reduction outperformed that of single dimensionality reduction. Next, a further comparison was carried out between the performances of different two-stage hybrid dimensionality reduction methods (Methods (2) and (3)). The comparison indicated that Method (3) outperformed Method (2) in terms of MAE, MSE, RAE, RMAE, R², and RMSE in all three models. On the other hand, when comparing the performance of all three models in Method (3), it can be concluded that Lasso worked the best. It implied that in Scheme III, Method (3) combined with Lasso was with the best prediction performance.

4.6. Comparing the Performance of Scheme I, II, and III

In Scheme I, the dimensionality reduction method used in the first round was the Taylor-CC method, with its results displayed in Table 3, Table 4 and Table 5. In Scheme II, the dimensionality reduction method used in the first round was the NCC method, with its results displayed in Table 6, Table 7 and Table 8. In Scheme III, the dimensionality reduction method used in the first round was the PCC method, with its results displayed in Table 9, Table 10 and Table 11. The performance of the single dimensionality reduction method in these three schemes was compared based on the seven metrics calculated from each model. For all three models (Lasso, ANN, and RF), comparisons showed that the performance of the single dimensionality reduction method in Scheme I was superior to that in both Scheme II and Scheme III. Subsequently, the performance of two-stage hybrid dimensionality reduction methods in these three schemes was also compared. By comparing the performance of Method (2) in Scheme I with that in Scheme II and III, it is evident that Method (2) in Scheme I exhibited the best performance among those methods, except that its MEB was poorer in Lasso and RF. Similarly, Method (3) in different schemes can also reach the same conclusion. Therefore, Scheme I generally performed better than both Scheme II and III when using the same model and the same method (PCC/NCC) for the second round of dimensionality reduction. It can be concluded that, in the first round of dimensionality reduction, the core wavelengths obtained by the Taylor-CC method can well preserve the information of the original reflectance data.

4.7. Comparing the Performance of the Method (3) in Scheme I Under Different Models

In Scheme I, II, and III, based on the above discussions in Section 4.3, Section 4.4 and Section 4.5, it was apparent that the two-stage hybrid dimensionality reduction method, i.e., Method (3), performed the best in each scheme. Furthermore, it can be observed that the Taylor-CC + NCC method in Scheme I performed better than other dimensionality reduction methods in Schemes I, II, and III for the model of ANN, RF, and Lasso. Finally, attention was paid to evaluating the performance of the three models using the Taylor-CC + NCC method. By comparison, we can conclude that the ANN performed the best among the three models. Moreover, in different experiments using an ANN, selecting 94 core wavelengths in the first round of dimensionality reduction achieved the best performance.

4.8. Uncertainty Analysis of Input Variables

Based on the best-performing methods and models mentioned above, an uncertainty analysis was used to assess the potential impact of changes from the 70 identified core wavelengths on chlorophyll content prediction. Analysis results are presented in Figure 10. By comparing the results with 71 groups of different experiments, it can be observed that retaining the original core wavelengths are beneficial for the model to achieve higher precision prediction in chlorophyll content, with low RMSE, low MAE, high R², low MSE, MBE close to zero, low RMAE, and low RAE. By comparing the results and sequentially eliminating each of 70 core wavelengths, it is evident that almost every core wavelength has an impact on the prediction of chlorophyll content after it was removed. Although the removal of a few core wavelengths significantly affects the prediction of chlorophyll content (e.g., #9, #32, #35, and #37 wavelength removal in Figure 10), most core wavelengths had a small impact after they were removed, which implied that the ANN is robust in predicting chlorophyll content.

In addition, based on the hyperspectral reflectance data used in this study, random forest importance was used to evaluate the importance of each wavelength (see Figure 11). It was observed that the wavelength region of [650 nm, 700 nm] exhibited relatively high importance. Considering the spectral characteristics of vegetation, the importance analysis was deemed reasonable, indicating that more attention should be paid to the wavelengths within this region in this study.

5. Discussion

Hyperspectral imaging technology can capture the chemical and physiological state of vegetation across various spectral wavelengths, typically ranging from hundreds to thousands of wavelengths [39,40]. In this study, a two-stage hybrid dimensionality reduction method was innovatively used to select core wavelengths in hyperspectral reflectance data. Compared with empirical inference, machine learning models, and deep learning models, this method provided a clear mathematical foundation, making it easier for researchers to understand and explain its underlying principles. Moreover, unlike other studies that typically use a single dimensionality reduction method or arbitrarily combine different dimensionality reduction, this study focused on sequentially identifying core wavelengths through two-stage hybrid dimensionality reduction methods, thereby preserving the advantages of various approaches.

This study found that both two-stage hybrid dimensionality reduction methods outperformed single dimensionality reduction methods in all schemes. This demonstrated that using two-stage hybrid dimensionality reduction methods can provide improved insights into correlation analysis and capture important patterns in hyperspectral data, thereby enhancing the comprehensiveness and accuracy of feature selection processes.

Further analysis indicated that using the Taylor-CC method for core wavelengths selection achieved higher predictive accuracy for chlorophyll content in Camellia oleifera leaves. Compared with the PCC and NCC methods, the Taylor-CC method was developed based on the property of hyperspectral reflectance data resembling continuous functions. It approached the problem from a mathematical perspective, fully taking into full account the correlation between different wavelengths and representing it through specific numerical values. Furthermore, the spectral characteristics of vegetation showed that most chlorophyll a and chlorophyll b molecules absorb wavelengths between 645 nm and 670 nm, while some special types of chlorophyll absorb wavelengths longer than 700 nm [10]. Therefore, it is evident that the wavelength range of 645 nm to 700 nm is highly sensitive to the variations in chlorophyll. Compared with the PCC and NCC methods, the core wavelengths selected based on the Taylor-CC method were evidently more concentrated in this range (see Figure 11). This strongly indicated that the Taylor-CC method was more effective for analyzing the spectral characteristics of Camellia oleifera and for identifying core wavelengths relative to the spectral properties of vegetation.

Additionally, using the Taylor-CC method to predict chlorophyll content with an ANN achieved the best performance. An ANN can effectively handle high-dimensional data and complex nonlinear relationships [41]. Through multiple layers of neurons and nonlinear activation functions, an ANN can approximate many complex functions, making it suitable for most real-world problems.

When the final number of core wavelengths selected was 70, it can be shown that the Taylor-CC + NCC method combined with an ANN achieved the best performance on chlorophyll content prediction (see Table 12). To demonstrate the generalizability and applicability of the conclusions in this study, additional similar experiments were conducted to compare the performance of different dimensionality reduction methods by using 50 or 69 core wavelengths in the final selection. As expected, the same conclusions as the above 70 core wavelengths were also reached when the final number of selected core wavelengths was 50 or 69.

Although the research object of this study focused on a specific plant sample (Camellia oleifera), the adopted methods and models have significant universality for many applications with hyperspectral signal/data. In the analysis of hyperspectral reflectance data, both the PCC and Taylor-CC methods can be used to extract core wavelengths [24,42,43]. The NCC method was also applied to the selection of core remote sensing factors in similar ecological environments [44]. Moreover, the three machine learning models used in this study have been successfully applied to various prediction scenarios many times, demonstrating their wide applicability. In fact, many machine or non-machine learning models can use identified core wavelengths to study user-interested physical quantities in different applications (e.g., concentrations in chemical solutions), and they might achieve better performance than the three models used in this study. Future research will further validate the generalizability of these conclusions in predicting plant physiological states using remote sensing data, so as to expand their applicability.

6. Conclusions

In this study, a two-stage hybrid dimensionality reduction method was innovatively used to identify core wavelengths by analyzing the correlations among wavelengths and SPAD measurements. By using hyperspectral reflectance data and SPAD measurements from 240 Camellia oleifera samples, different dimensionality reduction methods/schemes were proposed and compared. Combined with the methods/schemes, three different machine learning models were used to predict chlorophyll content, and it led to the following four conclusions.

(1): In Schemes I, II, and III, the performance of the two-stage hybrid dimensionality reduction methods was superior to that of the single dimensionality reduction method. This showed that the two-stage approach made better use of the advantages of different dimensionality reduction methods, thereby effectively improving the overall reduction effect and obtaining a more comprehensive feature representation.
(2): Compared with the PCC and NCC methods, the core wavelengths identified by the dimensionality reduction method based on Taylor-CC were more concentrated in the range associated with chlorophyll variation, while still preserving a significant amount of the original information from the wavelengths.
(3): The Taylor-CC + NCC method performed the best among different two-stage hybrid dimensionality reduction methods in different schemes. Meanwhile, in the Taylor-CC + NCC method, using 94 core wavelengths selected in the first round of dimensionality reduction, the ANN exhibited superior predictive performance against the other two models (MAE = 2.6583, MBE = 0.1371, MSE = 10.6127, RAE = 0.4221, RMAE = 0.0358, R² = 0.8210, RMSE = 3.2517).
(4): In the original Taylor-CC study from [24], which utilized hyperspectral analysis techniques to predict chlorophyll content in Camellia oleifera, using only the Taylor-CC method for identifying core wavelengths overlooked the relationship between wavelengths and target variables. This study addressed this deficiency and made improvements based on the Taylor-CC method.

The main imperfections of this study are the limited geographic scope, the short time scale, and the constraints on sample size and spatial coverage in the Camellia oleifera survey data. All these factors together could hinder the applicability and extensibility of the proposed method. In particular, the study area has more uniform vegetation types and environmental conditions, which could enhance the accuracy of the prediction model within this region. However, as the size of the region increases, the diversity of vegetation types and environmental conditions also grows, potentially diminishing the model’s applicability in various sub-regions unless more environmental factors can be introduced in models. Additionally, the data is limited to a single growth cycle of Camellia oleifera, and the model’s applicability will need to be re-verified and adjusted when applied to different ecological environments and climatic conditions. Nevertheless, the proposed methods in this study should have the potential to be applied into other research related to hyperspectral retrieval modeling, whose mathematical description is to find an inverse function

f^{- 1} (\vec{s} (λ))

of reflectance vector

\vec{s} (λ)

to predict/recover a physical quantity d based on the undiscovered relationship of

f (\vec{s} (λ)) = d

. Here

\vec{s} (λ)

’s redundant information should be excluded as much as possible to achieve the best prediction, and d could represent different quantities in different fields, e.g., pollution concentration in lakes, water vapor levels in the atmosphere, and protein content in meat.

Meanwhile, this study also achieved the following innovations and improvements:

(1): In feature selection, a method based on a classical mathematical formula (Taylor expansion) was used for correlation analysis. Compared with empirical inference and machine learning models, this method provided a clear theoretical foundation in Mathematics, making it easier for researchers to understand and explain its underlying principles.
(2): This study focused on sequentially identifying core wavelengths using two-stage hybrid dimensionality reduction methods rather than combining them arbitrarily.
(3): This study enhanced and extended the latest Taylor-CC method for hyperspectral analysis in studying the chlorophyll content of Camellia oleifera.

Furthermore, there are still many challenges in selecting suitable wavelengths and models for chlorophyll content prediction when using hyperspectral analysis techniques. Based on the findings of this study, we expect to further explore hyperspectral analysis techniques for predicting the chlorophyll content of Camellia oleifera in future research. Some potential ideas are presented below:

(1): Future research directions may focus on exploring integration methods for multi-source remote sensing data in complex natural environments (i.e., forests, urban areas, and mountainous regions) to overcome different challenges faced by existing hyperspectral reflectance data, providing comprehensive spectral/temporal/spatial information.
(2): Subsequent research may focus on predicting chlorophyll content in vegetation by integrating the nutrient elements of vegetation (e.g., nitrogen, phosphorus, and potassium) and the properties of soil to explore their effects in chlorophyll modeling.
(3): Based on the findings of this study, future research may continue to explore the environmental factors that affect the growth of Camellia oleifera. Specifically, we plan to collect long-term growth data of this plant under different growing conditions, which aims to construct a comprehensive model by combining spectral measurements, environmental factors, climate factors, and growth variables of Camellia oleifera.

Author Contributions

Conceptualization, Z.S.; Formal analysis, X.J. and Z.S.; Funding acquisition, Z.S., X.J. and X.T.; Investigation, X.T. and F.K.; Methodology, Z.S. and X.J.; Project administration, Z.S.; Resources, X.T. and F.K.; Software, X.J.; Supervision, Z.S. and Y.S.; Validation, X.J.; Visualization, X.J. and F.K.; Writing—original draft, X.J. and Z.S.; Writing—review and editing, Z.S., Y.S. and X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Nanjing Normal University (Grant No. 184080H202B371), Postgraduate Research and Practice Innovation Program of Jiangsu Province (Grant No. 181200003024158), and the National Natural Science Foundation of China (Grant No. 32171783).

Data Availability Statement

The data that support the findings of this study are available upon reasonable request from the authors.

Acknowledgments

The authors are thankful to Genshen Fu, Lipeng Yan, and Weijing Song for surveying and data processing in this study. The authors would like to thank the anonymous reviewers for their valuable comments to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Performance of 14 hidden layer structures based on the Taylor-CC + NCC method. (94 core wavelengths are selected in the first round of dimensionality reduction.). The best performing metric is highlighted in bold. It is the same in Table A1, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16, Table A17, Table A18, Table A19, Table A20 and Table A21.

	MAE	MBE	MSE	RAE	RMAE	R²	RMSE
[15]	3.7708	−0.2526	19.9566	0.5988	0.0507	0.6874	4.4673
[13,2]	2.6583	−0.1371	10.6127	0.4221	0.0358	0.8210	3.2577
[12,3]	3.2246	0.3208	15.2419	0.5210	0.0434	0.7424	3.9041
[11,4]	3.6778	0.9790	21.8276	0.5840	0.0495	0.6794	4.6720
[10,5]	3.6527	−1.4034	20.2538	0.5800	0.0492	0.6792	4.5004
[9,6]	2.9543	−0.3106	12.1366	0.4691	0.0398	0.8105	3.4838
[8,7]	3.1627	0.8284	16.7866	0.5022	0.0426	0.7178	4.0972
[7,6,2]	3.4661	−1.7263	17.1493	0.5504	0.0466	0.7607	4.1412
[6,5,4]	3.4521	−0.9052	17.0028	0.5482	0.0465	0.7246	4.1234
[7,5,3]	3.3024	0.7774	17.6491	0.5244	0.0444	0.7201	4.2011
[8,5,2]	3.0139	−0.1584	14.2534	0.4786	0.0406	0.7458	3.7754
[8,4,3]	3.0821	−0.4558	16.0011	0.4894	0.0415	0.7177	4.0001
[9,4,2]	3.0596	−0.1076	13.1356	0.4858	0.0412	0.7680	3.6243
[10,3,2]	3.5769	−1.0661	19.6833	0.5680	0.0481	0.6681	4.4366
[14,1]	2.8233	0.4945	12.4522	0.4483	0.0380	0.7824	3.5288

Table A2. Thresholds of different methods in Scheme II.

Method (1)		Method (2)		Method (3)
NCC	Threshold	NCC + PCC	Threshold	NCC + NCC	Threshold
70	0.3960	90 + 70	0.4258 + 0.0595	90 + 70	0.4258 + 0.2730
/	/	94 + 70	0.4300 + 0.0620	94 + 70	0.4300 + 0.2725
/	/	97 + 70	0.4400 + 0.0660	97 + 70	0.4400 + 0.2755
/	/	122 + 70	0.4710 + 0.0850	122 + 70	0.4710 + 0.2812
/	/	123 + 70	0.4720 + 0.0850	123 + 70	0.4720 + 0.2815
/	/	125 + 70	0.4730 + 0.0870	125 + 70	0.4730 + 0.2817

Table A3. Thresholds of different methods in Scheme III.

Method (1)		Method (2)		Method (3)
PCC	Threshold	PCC + PCC	Threshold	PCC + NCC	Threshold
70	0.94608	90 + 70	0.96485 + 0.0425	90 + 70	0.96485 + 0.074
/	/	94 + 70	0.9675 + 0.0472	94 + 70	0.9675 + 0.077
/	/	97 + 70	0.9687 + 0.052	97 + 70	0.9687 + 0.077
/	/	122 + 70	0.979 + 0.08308	122 + 70	0.979 + 0.0835
/	/	123 + 70	0.9791 + 0.0831	123 + 70	0.9791 + 0.0843
/	/	125 + 70	0.9793 + 0.087	125 + 70	0.9793 + 0.0841

Table A4. Performance assessments of Method (2) using ANN in Scheme I.

	90	94	97	122	123	125
MAE	2.8304	2.9575	2.8861	3.0331	3.0585	2.9941
MBE	0.4859	0.0946	0.0633	0.7990	0.6612	0.1360
MSE	12.0694	12.7972	12.8480	14.6065	14.0664	13.6245
RAE	0.4494	0.4696	0.4583	0.4816	0.4857	0.4754
RMAE	0.0381	0.0398	0.0388	0.0408	0.0412	0.0403
R²	0.7884	0.7720	0.7735	0.7500	0.7586	0.7568
RMSE	3.4741	3.5773	3.5844	3.8218	3.7505	3.6911

Table A5. Performance assessments of Method (3) using ANN in Scheme I.

	90	94	97	122	123	125
MAE	2.7411	2.6583	2.8749	2.9786	2.8934	2.9842
MBE	0.1464	−0.1371	0.5895	0.9668	0.6379	−0.1286
MSE	12.1395	10.6127	12.2958	16.5260	13.8749	13.3374
RAE	0.4353	0.4221	0.4565	0.4730	0.4594	0.4739
RMAE	0.0369	0.0358	0.0387	0.0401	0.0389	0.0402
R²	0.7837	0.8210	0.7869	0.7212	0.7668	0.7693
RMSE	3.4842	3.2577	3.5065	4.0652	3.7249	3.6520

Table A6. Performance assessments of Method (2) using RF in Scheme I.

	90	94	97	122	123	125
MAE	4.5043	4.6132	4.5684	4.6757	4.7602	4.7558
MBE	0.2622	0.4540	6.4140	0.5535	0.5488	0.3747
MSE	29.0830	30.3222	29.5027	31.2288	31.6060	31.4554
RAE	0.7388	0.7566	0.7493	0.7669	0.7807	0.7800
RMAE	0.0605	0.0620	0.0614	0.0628	0.0640	0.0639
R²	0.5393	0.4765	0.5408	0.4713	0.4761	0.4644
RMSE	5.3929	5.5066	5.4316	5.5883	5.6219	5.6085

Table A7. Performance assessments of Method (3) using RF in Scheme I.

	90	94	97	122	123	125
MAE	4.4853	4.5775	4.5452	4.6084	4.6075	4.6600
MBE	0.2644	0.2582	0.3658	0.4246	0.4151	0.4043
MSE	27.9287	29.5275	28.7763	30.2533	30.1835	30.5749
RAE	0.7356	0.7508	0.7455	0.7558	0.7557	0.7643
RMAE	0.0603	0.0615	0.0611	0.0619	0.0619	0.0626
R²	0.5491	0.5044	0.5090	0.4835	0.4680	0.4680
RMSE	5.2848	5.4339	5.3644	5.5003	5.4939	5.5295

Table A8. Performance assessments of Method (2) using Lasso in Scheme I.

	90	94	97	122	123	125
MAE	4.1370	4.1572	4.1231	4.0452	3.9607	3.8469
MBE	0.0947	0.1566	0.0852	0.3300	0.3386	0.0702
MSE	23.7400	23.7950	23.1769	23.1418	22.3023	20.9090
RAE	0.6785	0.6818	0.6762	0.6635	0.6496	0.6309
RMAE	0.0556	0.0559	0.0554	0.0554	0.0532	0.0517
R²	0.6466	0.6479	0.6617	0.6156	0.6272	0.6357
RMSE	4.8724	4.8780	4.8142	4.8106	4.7225	4.5726

Table A9. Performance assessments of Method (3) using Lasso in Scheme I.

	90	94	97	122	123	125
MAE	3.4974	3.5622	3.7377	4.0243	3.7576	3.6432
MBE	−0.2291	−0.1748	0.0598	0.2129	0.1176	0.0410
MSE	17.4854	17.3740	19.0103	21.9084	19.5814	19.1927
RAE	0.5736	0.5842	0.6130	0.6600	0.6163	0.5975
RMAE	0.0470	0.0479	0.0502	0.0541	0.0505	0.0490
R²	0.6787	0.6773	0.6403	0.5872	0.6308	0.6378
RMSE	4.1816	4.1682	4.3601	4.6806	4.4251	4.3810

Table A10. Performance assessments of Method (2) using ANN in Scheme II.

	90	94	97	122	123	125
MAE	6.5007	6.3981	5.7167	6.1842	5.7642	6.2352
MBE	−2.4224	0.3667	1.2222	0.1568	−0.9968	−1.2088
MSE	60.1897	56.2463	54.5279	54.3345	48.0286	58.3583
RAE	1.0323	1.0160	0.9078	0.9820	0.9153	0.9901
RMAE	0.0875	0.0861	0.0769	0.0832	0.0776	0.0839
R²	0.0907	0.0095	0.2051	0.0386	0.1584	0.0137
RMSE	7.7582	7.4998	7.3843	7.3712	6.9303	7.6393

Table A11. Performance assessments of Method (3) using ANN in Scheme II.

	90	94	97	122	123	125
MAE	6.2490	5.3059	4.9227	4.7905	5.3558	5.4055
MBE	0.1389	−2.1216	0.7123	−2.2126	−0.9207	−2.1351
MSE	55.7386	41.4520	33.7940	39.1563	36.9707	41.7702
RAE	0.9923	0.8425	0.7817	0.7607	0.8505	0.8583
RMAE	0.0841	0.0714	0.0663	0.0645	0.0721	0.0727
R²	0.0197	0.3761	0.4097	0.3887	0.3742	0.3408
RMSE	7.4658	6.4383	5.8133	6.2575	6.0804	6.4630

Table A12. Performance assessments of Method (2) using RF in Scheme II.

	90	94	97	122	123	125
MAE	5.8091	5.6850	5.6266	5.6106	5.7176	5.6085
MBE	0.6431	0.6438	0.6545	0.5101	0.6029	0.6378
MSE	48.9689	46.6168	45.7112	45.7996	47.5164	45.6624
RAE	0.9528	0.9324	0.9228	0.9202	0.9378	0.9199
RMAE	0.0781	0.0764	0.0756	0.0754	0.0768	0.0754
R²	0.0782	0.1285	0.1513	0.1501	0.1098	0.1545
RMSE	6.9978	6.8277	6.7610	6.7675	6.8932	6.7574

Table A13. Performance assessments of Method (3) using RF in Scheme II.

	90	94	97	122	123	125
MAE	5.7611	5.6769	5.5507	5.5807	5.6096	5.5703
MBE	0.6890	0.5685	0.6500	0.7177	0.7729	0.7399
MSE	48.2218	48.5792	44.7551	45.7633	46.3984	45.2280
RAE	0.9449	0.9311	0.9104	0.9153	0.9201	0.9136
RMAE	0.0774	0.0763	0.0746	0.0750	0.0754	0.0749
R²	0.0936	0.0854	0.1849	0.1544	0.1430	0.1703
RMSE	6.9442	6.9699	6.6899	6.7649	6.8116	6.7252

Table A14. Performance assessments of Method (2) using Lasso in Scheme II.

	90	94	97	122	123	125
MAE	5.5635	4.6024	4.5371	5.4963	5.6427	5.4473
MBE	0.5004	−0.1300	−0.1041	0.0767	0.3120	0.1021
MSE	44.0369	30.1984	28.8545	45.5965	47.5823	44.7454
RAE	0.9125	0.7549	0.7441	0.9015	0.9255	0.8934
RMAE	0.0748	0.0619	0.0610	0.0739	0.0758	0.0732
R²	0.2106	0.4647	0.5017	0.1355	0.0992	0.1508
RMSE	6.6360	5.4953	5.3716	6.7525	6.8980	6.6892

Table A15. Performance assessments of Method (3) using Lasso in Scheme II.

	90	94	97	122	123	125
MAE	5.2652	4.5062	4.4865	4.5264	4.4594	4.4920
MBE	−0.3712	−0.0407	0.0162	−0.1798	−0.2095	−0.1651
MSE	44.2580	29.6771	28.3256	27.6141	26.9206	27.5556
RAE	0.8636	0.7391	0.7358	0.7424	0.7314	0.7367
RMAE	0.0708	0.0606	0.0603	0.0608	0.0599	0.0604
R²	0.2522	0.4938	0.5363	0.5690	0.5815	0.5601
RMSE	6.6527	5.4477	5.3222	5.2549	5.1885	5.2493

Table A16. Performance assessments of Method (2) using ANN in Scheme III.

	90	94	97	122	123	125
MAE	6.2994	5.2302	6.0595	6.1559	6.1398	6.6527
MBE	−0.1112	0.4508	−2.1549	−1.1404	0.1502	0.2486
MSE	55.1701	40.0847	49.7168	55.5155	53.7158	60.4465
RAE	1.0003	0.8305	1.0845	0.9775	0.9749	1.0564
RMAE	0.0848	0.0704	0.0837	0.0828	0.0826	0.0895
R²	0.0212	0.3524	0.1883	0.0490	0.0396	0.0176
RMSE	7.4277	6.3312	7.0510	7.4509	7.3291	7.7747

Table A17. Performance assessments of Method (3) using ANN in Scheme III.

	90	94	97	122	123	125
MAE	5.0361	4.3926	5.2298	4.9333	4.4255	6.2637
MBE	−0.2201	−0.3118	−2.7404	3.0589	0.8591	1.3165
MSE	36.4218	28.5701	40.8884	39.8640	30.9662	53.8179
RAE	0.7997	0.6975	0.9360	0.7834	0.7027	0.9946
RMAE	0.0678	0.0591	0.0722	0.0664	0.0596	0.0843
R²	0.3890	0.5228	0.3711	0.4563	0.4615	0.2986
RMSE	6.0350	5.3451	6.3944	6.3138	5.5647	7.3361

Table A18. Performance assessments of Method (2) using RF in Scheme III.

	90	94	97	122	123	125
MAE	5.7536	5.7735	5.8097	5.3915	5.4676	5.7674
MBE	0.7434	0.9106	0.7177	0.5475	0.6068	0.8337
MSE	47.7782	47.9977	49.3269	42.4011	44.3290	49.2858
RAE	0.9437	0.9469	0.9529	0.8843	0.8968	0.9459
RMAE	0.0773	0.0776	0.0781	0.0725	0.0735	0.0775
R²	0.1139	0.1123	0.0736	0.2239	0.1876	0.0781
RMSE	6.9122	6.9280	7.0233	6.5116	6.6580	7.0204

Table A19. Performance assessments of Method (3) using RF in Scheme III.

	90	94	97	122	123	125
MAE	5.7275	5.7637	5.7773	5.3846	5.4090	5.6304
MBE	0.6181	0.6133	0.6858	0.5928	0.4608	0.6073
MSE	47.9028	48.4530	48.2440	41.9772	42.3528	45.1113
RAE	0.9394	0.9453	0.9476	0.8831	0.8871	0.9235
RMAE	0.0770	0.0775	0.0777	0.0724	0.0727	0.0757
R²	0.1010	0.0917	0.0963	0.2287	0.2391	0.1713
RMSE	6.9212	6.9608	6.9458	6.4790	6.5079	6.7165

Table A20. Performance assessments of Method (2) using Lasso in Scheme III.

	90	94	97	122	123	125
MAE	4.7103	4.6984	4.7834	5.1858	5.3948	4.7562
MBE	−0.0416	−0.0791	0.1825	−0.1542	−0.0218	−0.0998
MSE	31.1865	30.7809	31.8769	39.6046	43.5253	32.5621
RAE	0.7725	0.7706	0.7845	0.8505	0.8848	0.7801
RMAE	0.0633	0.0632	0.0643	0.0697	0.0725	0.0639
R²	0.4491	0.4485	0.4697	0.2527	0.1981	0.4262
RMSE	5.5845	5.5481	5.6460	6.2932	6.5974	5.7063

Table A21. Performance assessments of Method (3) using Lasso in Scheme III.

	90	94	97	122	123	125
MAE	4.6009	4.5179	4.6147	4.6292	4.7545	4.1585
MBE	−0.1027	−0.1101	−0.0105	−0.1860	−0.0864	−0.6701
MSE	31.1756	29.9233	29.9498	32.8672	34.4832	25.9009
RAE	0.7546	0.7410	0.7569	0.7593	0.7798	0.6820
RMAE	0.0618	0.0607	0.0620	0.0622	0.0639	0.0551
R²	0.4774	0.5093	0.5157	0.3820	0.3479	0.5179
RMSE	5.5835	5.4702	5.4726	5.7330	5.8722	5.0893

Table A22. 70 Identified core wavelengths from the Taylor-CC + NCC Method when 94 Core Wavelengths are Selected in the First Round (Unit: nm).

450	501	543	594	650	684	695	702	708	714
720	726	733	741	753	766	832	869	904	966
988	1000	1020	1050	1078	1105	1128	1150	1177	1212
1251	1287	1316	1338	1355	1365	1371	1377	1384	1441
1596	1634	1671	1737	1765	1787	1803	1816	1836	1872
1900	1907	1930	1946	2030	2103	2131	2156	2180	2261
2279	2294	2308	2321	2333	2345	2355	2365	2374	2390

Appendix B

The specific definitions of

H^{r} (X)

and

H^{r} (X, Y)

are shown in Equations (A1) and (A2),

H^{r} (X, Y) = \sum_{i = 1}^{b} \sum_{j = 1}^{b} \frac{n_{i j}}{N} \log_{b} \frac{n_{i j}}{N},

(A1)

H^{r} (X) = - \sum_{i = 1}^{b} \frac{n_{i}}{N} \log_{b} \frac{n_{i}}{N} .

(A2)

where

N

is the size of the dataset,

n_{i j}

is the number of sample pairs distributed in the ij-th grid, and

n_{i}

is the number of sample pairs distributed in the i-th grid.

The definition of the Pearson correlation coefficient for two random variables

x

and

y

is shown in Equation (A3),

r_{x y} = \frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{{\sum (x_{i} - \bar{x})}^{2} \sum {(y_{i} - \bar{y})}^{2}}},

(A3)

where

x_{i}

and

y_{i}

represent the i-th observed values of two random variables, respectively.

\bar{x} = \frac{1}{n} \sum_{i = 1}^{N} x_{i}

and

\bar{y} = \frac{1}{n} \sum_{i = 1}^{N} y_{i}

denote the mean of these two random variables.

Appendix C

Figure A1. Lasso regression and ridge regression examples in two-dimensional space. On the left is the Lasso regression diagram, and on the right is the ridge regression diagram. The elliptical region represents the function value of objective function, and the black region indicates constraints.

Figure A2. Example of an ANN. The adaptive network includes an input layer, a hidden layer, and an output layer, which are connected by weights.

Figure A3. Flow chart of random forest, which includes self-service sampling method, decision trees whose features are split at each node, and integrator.

References

Sekar, N.; Ramasamy, R.P. Photosynthetic energy conversion: Recent advances and future perspective. Electrochem. Soc. Interface 2015, 24, 67. [Google Scholar] [CrossRef]
Kume, A.; Akitsu, T.; Nasahara, K.N. Why is chlorophyll b only used in light-harvesting systems? J. Plant Res. 2018, 131, 961–972. [Google Scholar] [CrossRef] [PubMed]
Croft, H.; Chen, J.M.; Luo, X.; Bartlett, P.; Chen, B.; Staebler, R.M. Leaf chlorophyll content as a proxy for leaf photosynthetic capacity. Glob. Chang. Biol. 2017, 23, 3513–3524. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Li, Y.; Ju, W.; Chen, B.; Chen, J.; Croft, H.; Mickler, R.A.; Yang, F. Estimation of leaf photosynthetic capacity from leaf chlorophyll content and leaf age in a subtropical evergreen coniferous plantation. J. Geophys. Res. Biogeosci. 2020, 125, e2019JG005020. [Google Scholar] [CrossRef]
Henson, S.A.; Sarmiento, J.L.; Dunne, J.P.; Bopp, L.; Lima, I.; Doney, S.C.; John, J.; Beaulieu, C. Detection of anthropogenic climate change in satellite records of ocean chlorophyll and productivity. Biogeosciences 2010, 7, 621–640. [Google Scholar] [CrossRef]
Hodges, B.A.; Rudnick, D.L. Horizontal variability in chlorophyll fluorescence and potential temperature. Deep Sea Res. Part I Oceanogr. Res. Pap. 2006, 53, 1460–1482. [Google Scholar] [CrossRef]
Pérez-Bueno, M.L.; Pineda, M.; Barón, M. Phenotyping plant responses to biotic stress by chlorophyll fluorescence imaging. Front. Plant Sci. 2019, 10, 1135. [Google Scholar] [CrossRef]
Ritchie, R.J.; Sma-Air, S. Lability of chlorophylls in solvent. J. Appl. Phycol. 2022, 34, 1577–1586. [Google Scholar] [CrossRef]
Milenković, S.M.; Zvezdanović, J.B.; Anđelković, T.D.; Marković, D.Z. The identification of chlorophyll and its derivatives in the pigment mixtures: HPLC-chromatography, visible and mass spectroscopy studies. Adv. Technol. 2012, 1, 16–24. [Google Scholar]
Yue, X.; Quan, D.; Hong, T.; Wang, J.; Qu, X.; Gan, H. Non-destructive hyperspectral measurement model of chlorophyll content for citrus leaves. Trans. Chin. Soc. Agric. Eng. 2015, 31, 294–302. [Google Scholar]
Li, D.; Hu, Q.; Ruan, S.; Liu, J.; Zhang, J.; Hu, C.; Liu, Y.; Dian, Y.; Zhou, J. Utilizing Hyperspectral Reflectance and Machine Learning Algorithms for Non-Destructive Estimation of Chlorophyll Content in Citrus Leaves. Remote Sens. 2023, 15, 4934. [Google Scholar] [CrossRef]
Schmid, V.H.; Thomé, P.; Rühle, W.; Paulsen, H.; Kühlbrandt, W.; Rogl, H. Chlorophyll b is involved in long-wavelength spectral properties of light-harvesting complexes LHC I and LHC II. FEBS Lett. 2001, 499, 27–31. [Google Scholar] [CrossRef] [PubMed]
Falcioni, R.; Antunes, W.C.; Oliveira, R.B.D.; Chicati, M.L.; Demattê, J.A.M.; Nanni, M.R. Assessment of Combined Reflectance, Transmittance, and Absorbance Hyperspectral Sensors for Prediction of Chlorophyll a Fluorescence Parameters. Remote Sens. 2023, 15, 5067. [Google Scholar] [CrossRef]
Martel, E.; Lazcano, R.; López, J.; Madroñal, D.; Salvador, R.; López, S.; Juarez, E.; Guerra, R.; Sanz, C.; Sarmiento, R. Implementation of the principal component analysis onto high-performance computer facilities for hyperspectral dimensionality reduction: Results and comparisons. Remote Sens. 2018, 10, 864. [Google Scholar] [CrossRef]
Li, W.; Feng, F.; Li, H.; Du, Q. Discriminant analysis-based dimension reduction for hyperspectral image classification: A survey of the most recent advances and an experimental comparison of different techniques. IEEE Geosci. Remote Sens. Mag. 2018, 6, 15–34. [Google Scholar] [CrossRef]
Yuan, H.; Yang, G.; Li, C.; Wang, Y.; Liu, J.; Yu, H.; Feng, H.; Xu, B.; Zhao, X.; Yang, X. Retrieving soybean leaf area index from unmanned aerial vehicle hyperspectral remote sensing: Analysis of RF, ANN, and SVM regression models. Remote Sens. 2017, 9, 309. [Google Scholar] [CrossRef]
Zhou, W.; Yang, H.; Xie, L.; Li, H.; Huang, L.; Zhao, Y.; Yue, T. Hyperspectral inversion of soil heavy metals in Three-River Source Region based on random forest model. Catena 2021, 202, 105222. [Google Scholar] [CrossRef]
Pal, M.; Charan, T.B.; Poriya, A. K-nearest neighbour-based feature selection using hyperspectral data. Remote Sens. Lett. 2021, 12, 132–141. [Google Scholar] [CrossRef]
Guo, A.J.; Zhu, F. Spectral-spatial feature extraction and classification by ANN supervised with center loss in hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1755–1767. [Google Scholar] [CrossRef]
Kanthi, M.; Sarma, T.H.; Bindu, C.S. A 3D-deep CNN based feature extraction and hyperspectral image classification. In Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Ahmedabad, India, 1–4 December 2020. [Google Scholar]
Hu, W.S.; Li, H.C.; Pan, L.; Li, W.; Tao, R.; Du, Q. Spatial–spectral feature extraction via deep ConvLSTM neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4237–4250. [Google Scholar] [CrossRef]
Duanyuan, H.; Zhou, T.; He, Z.; Peng, Y.; Lei, J.; Dong, J.; Wu, X.; Wang, J.; Yan, W. Effects of Straw Mulching on Soil Properties and Enzyme Activities of Camellia oleifera–Cassia Intercropping Agroforestry Systems. Plants 2023, 12, 3046. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Zhu, F.; Chen, B.; Su, E.; Chen, Y.; Cao, F. Composition, bioactive substances, extraction technologies and the influences on characteristics of Camellia oleifera oil: A review. Food Res. Int. 2022, 156, 111159. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Jiang, X.; Tang, X.; Yan, L.; Kuang, F.; Li, X.; Dou, M.; Wang, B.; Gao, X. Identifying core wavelengths of oil tree’s hyperspectral data by Taylor expansion. Remote Sens. 2023, 15, 3137. [Google Scholar] [CrossRef]
Hasan, U.; Jia, K.; Wang, L.; Wang, C.; Shen, Z.; Yu, W.; Sun, Y.; Jiang, H.; Zhang, Z.; Guo, J.; et al. Retrieval of leaf chlorophyll contents (LCCs) in litchi based on fractional order derivatives and VCPA-GA-ML algorithms. Plants 2023, 12, 501. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Ge, Y.; Xie, X.; Atefi, A.; Wijewardane, N.K.; Thapa, S. High throughput analysis of leaf chlorophyll content in sorghum using RGB, hyperspectral, and fluorescence imaging and sensor fusion. Plant Methods 2022, 18, 60. [Google Scholar] [CrossRef]
Wang, Q.; Shen, Y.; Zhang, J.Q. A nonlinear correlation measure for multivariable data set. Phys. D Nonlinear Phenom. 2005, 200, 287–295. [Google Scholar] [CrossRef]
Wang, H.; Yao, X. Objective reduction based on nonlinear correlation information entropy. Soft Comput. 2016, 20, 2393–2407. [Google Scholar] [CrossRef]
Zhou, H.; Deng, Z.; Xia, Y.; Fu, M. A new sampling method in particle filter based on Pearson correlation coefficient. Neurocomputing 2016, 216, 208–215. [Google Scholar] [CrossRef]
Sedgwick, P. Pearson’s correlation coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
Johnstone, I.M.; Silverman, B.W. Wavelet threshold estimators for data with correlated noise. J. R. Stat. Soc. Ser. B Stat. Methodol. 1997, 59, 319–351. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Ghosh, P.; Azam, S.; Jonkman, M.; Karim, A.; Shamrat, F.J.M.; Ignatious, E.; Shultana, S.; Reddy Beeravolu, A.; De Boer, F. Efficient prediction of cardiovascular disease using machine learning algorithms with relief and Lasso feature selection techniques. IEEE Access 2021, 9, 19304–19326. [Google Scholar] [CrossRef]
Agatonovic-Kustrin, S.; Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 2000, 22, 717–727. [Google Scholar] [CrossRef] [PubMed]
Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 1994, 41, 327. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.; Sanchez-Castillo, M.; Chica-Olmo, M.; Chica-Rivas, M. Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geol. Rev. 2015, 71, 804–818. [Google Scholar] [CrossRef]
Li, D.; Hu, Q.; Zhang, J.; Dian, Y.; Hu, C.; Zhou, J. Leaf Nitrogen and Phosphorus Variation and Estimation of Citrus Tree under Two Labor-Saving Cultivation Modes Using Hyperspectral Data. Remote Sens. 2024, 16, 3261. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Boldrini, B.; Kessler, W.; Rebner, K.; Kessler, R.W. Hyperspectral imaging: A review of best practice, performance and pitfalls for in-line and on-line applications. J. Near Infrared Spectrosc. 2012, 20, 483–508. [Google Scholar] [CrossRef]
Aziz, R.; Verma, C.K.; Srivastava, N. Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction. Ann. Data Sci. 2018, 5, 615–635. [Google Scholar] [CrossRef]
Bhadra, S.; Sagan, V.; Maimaitijiang, M.; Maimaitiyiming, M.; Newcomb, M.; Shakoor, N.; Mockler, T.C. Quantifying leaf chlorophyll concentration of sorghum from hyperspectral data using derivative calculus and machine learning. Remote Sens. 2020, 12, 2082. [Google Scholar] [CrossRef]
Singh, K.D.; Ramakrishnan, D.; Mansinha, L. Relevance of transformation techniques in rapid endmember identification and spectral unmixing: A hypespectral remote sensing perspective. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012. [Google Scholar]
Sun, Z.; Qian, W.; Huang, Q.; Lv, H.; Yu, D.; Ou, Q.; Lu, H.; Tang, X. Use remote sensing and machine learning to study the changes of broad-leaved forest biomass and their climate driving forces in nature reserves of northern subtropics. Remote Sens. 2022, 14, 1066. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area and its geographic location in China. Boxes represent the locations of square plots, and points represent the location of the selected tree in each plot. The entire study area is divided into four parts (A, B, C, and D).

Figure 2. Instrument erection schematic diagram. The red dashed circle represents the circular observation region that is perpendicular to the top of canopy.

Figure 3. (a) 240 hyperspectral measurement samples after removing invalid reflectance (Note: color is only for distinguishing different lines and has no specific meaning), and (b) the distribution of their SPAD measurements.

Figure 4. The correlation matrix generated by the Taylor-CC method.

Figure 5. The correlation matrix generated by the NCC method.

Figure 6. The correlation matrix generated by the PCC method.

Figure 7. A total of 70 core wavelengths were selected in Scheme I. The core wavelengths selected using Method (1) are marked in red, those using Method (2) are marked in black, and those using Method (3) are marked in blue. The specific numbers of wavelengths obtained after the first round of dimensionality reduction are provided on the right.

Figure 8. A total of 70 core wavelengths were selected in Scheme II. The core wavelengths selected using Method (1) are marked in red, those using Method (2) are marked in black, and those using Method (3) are marked in blue. The specific numbers of wavelengths obtained after the first round of dimensionality reduction are provided on the right.

Figure 9. A total of 70 core wavelengths were selected in Scheme III. The core wavelengths selected using Method (1) are marked in red, those using Method (2) are marked in black, and those using Method (3) are marked in blue. The specific numbers of wavelengths obtained after the first round of dimensionality reduction are provided on the right.

Figure 10. Uncertainty analysis diagram, which consists of changes in seven metrics (RMSE, MAE, R², MSE, MBE, RMAE, and RAE). X-axis represents 71 groups of experiments, where #1 indicates that no core wavelengths have been eliminated, while #2 to #71 correspond to the sequential elimination of each of the 70 core wavelengths. Their corresponding 70 core wavelengths are listed in Table A22.

Figure 11. Importance analysis diagram of 350 nm–2500 nm wavelength. A higher importance value indicates a greater influence of this wavelength on the model’s prediction accuracy.

Table 1. Different specific dimensionality reduction methods in different schemes.

Scheme	Method	First Round	Second Round
Scheme I	Method (1)	Taylor-CC	/
	Method (2)	Taylor-CC	PCC
	Method (3)	Taylor-CC	NCC
Scheme II	Method (1)	NCC	/
	Method (2)	NCC	PCC
	Method (3)	NCC	NCC
Scheme III	Method (1)	PCC	/
	Method (2)	PCC	PCC
	Method (3)	PCC	NCC

Table 2. Thresholds of different methods in Scheme I.

Method (1)		Method (2)		Method (3)
Taylor-CC	Threshold	Taylor-CC + PCC	Threshold	Taylor-CC + NCC	Threshold
70	0.9260	90 + 70	0.9540 + 0.0580	90 + 70	0.9540 + 0.2775
/	/	94 + 70	0.9570 + 0.0597	94 + 70	0.9570 + 0.2779
/	/	97 + 70	0.9595 + 0.0604	97 + 70	0.9595 + 0.0770
/	/	122 + 70	0.9730 + 0.0907	122 + 70	0.9730 + 0.0829
/	/	123 + 70	0.9731 + 0.0906	123 + 70	0.9731 + 0.0826
/	/	125 + 70	0.9745 + 0.08647	125 + 70	0.9745 + 0.08425

Table 3. Performance assessments of Method (1) using Lasso, ANN, and RF in Scheme I. The best- performing metric is highlighted in bold. It is the same in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12.

	Lasso	ANN	RF
MAE	4.3030	3.0276	4.7546
MBE	0.1930	0.9531	0.1612
MSE	26.0906	14.6829	32.2617
RAE	0.7057	0.4808	0.7798
RMAE	0.0578	0.0407	0.0639
R²	0.5817	0.7536	0.4555
RMSE	5.1079	3.8318	5.6799

Table 4. Performance assessments of Method (2) using Lasso, ANN, and RF in Scheme I.

	Lasso	ANN	RF
MAE	4.0450	2.9600	4.6463
MBE	0.1792	0.3733	1.4345
MSE	22.8442	13.3353	30.5330
RAE	0.6634	0.4700	0.7621
RMAE	0.0545	0.0398	0.0624
R²	0.6391	0.7666	0.4947
RMSE	4.7784	3.6499	5.5250

Table 5. Performance assessments of Method (3) using Lasso, ANN, and RF in Scheme I.

	Lasso	ANN	RF
MAE	3.7037	2.8551	4.5807
MBE	0.0046	0.3458	0.3554
MSE	19.0920	13.1311	29.5407
RAE	0.6074	0.4534	0.7513
RMAE	0.0498	0.0384	0.0616
R²	0.6420	0.7748	0.4970
RMSE	4.3661	3.6151	5.4345

Table 6. Performance assessments of Method (1) using Lasso, ANN, and RF in Scheme II.

	Lasso	ANN	RF
MAE	5.7082	7.1377	6.0075
MBE	0.3068	2.0395	1.0005
MSE	47.7933	73.1636	52.5231
RAE	0.9362	1.1334	0.9853
RMAE	0.0767	0.0961	0.0807
R²	0.1018	0.0048	0.0311
RMSE	6.9133	8.5536	7.2473

Table 7. Performance assessments of Method (2) using Lasso, ANN, and RF in Scheme II.

	Lasso	ANN	RF
MAE	5.2149	6.1332	5.6762
MBE	0.1262	−0.4804	0.6154
MSE	40.1690	55.2809	46.7126
RAE	0.8553	0.9739	0.9310
RMAE	0.0701	0.0825	0.0763
R²	0.2604	0.0860	0.1287
RMSE	6.3071	7.4305	6.8341

Table 8. Performance assessments of Method (3) using Lasso, ANN, and RF in Scheme II.

	Lasso	ANN	RF
MAE	4.6226	5.3382	5.6249
MBE	−0.1584	−1.0898	0.6897
MSE	30.7252	41.4803	46.4910
RAE	0.7582	0.8477	0.9226
RMAE	0.0621	0.0719	0.0756
R²	0.4988	0.3182	0.1386
RMSE	5.5192	6.4197	6.8176

Table 9. Performance assessments of Method (1) using Lasso, ANN, and RF in Scheme III.

	Lasso	ANN	RF
MAE	6.0460	6.9590	5.9653
MBE	0.3413	2.5127	0.8276
MSE	55.9365	67.0371	51.9642
RAE	0.9916	1.1050	0.9784
RMAE	0.0813	0.0937	0.0802
R²	0.0182	0.0036	0.0277
RMSE	7.4791	8.1876	7.2086

Table 10. Performance assessments of Method (2) using Lasso, ANN, and RF in Scheme III.

	Lasso	ANN	RF
MAE	4.9215	6.0896	5.6606
MBE	−0.0357	−0.4262	0.7266
MSE	34.9227	52.4416	46.8531
RAE	0.8072	0.9874	0.9284
RMAE	0.0662	0.0823	0.0761
R²	0.3741	0.1114	0.1316
RMSE	5.8959	7.2274	6.8423

Table 11. Performance assessments of Method (3) using Lasso, ANN, and RF in Scheme III.

	Lasso	ANN	RF
MAE	4.5460	5.0468	5.6154
MBE	−0.1943	0.3270	0.5964
MSE	30.7167	38.4214	45.6735
RAE	0.7456	0.8190	0.9210
RMAE	0.0610	0.0682	0.0755
R²	0.4584	0.4166	0.1547
RMSE	5.5368	6.1649	6.7552

Table 12. The best-performance of different dimensionality reduction methods with 70 core wavelengths under three machine learning models.

		MAE	MBE	MSE	RAE	RMAE	R²	RMSE
Scheme I	Method (1)	3.0276	0.9531	14.6829	0.4808	0.0407	0.7536	3.8318
	Method (2)	2.8304	0.4859	12.0694	0.4494	0.0381	0.7884	3.4741
	Method (3)	2.6583	−0.1371	10.6127	0.4221	0.0358	0.8210	3.2577
Scheme II	Method (1)	5.7082	0.3068	47.7933	0.9362	0.0767	0.1018	6.9133
	Method (2)	4.5371	−0.1041	28.8545	0.7441	0.0610	0.5017	5.3716
	Method (3)	4.4594	−0.2095	26.9206	0.7314	0.0599	0.5815	5.1885
Scheme III	Method (1)	5.9653	0.8276	51.9642	0.9784	0.0802	0.0277	7.2086
	Method (2)	4.6984	−0.0791	30.7809	0.7706	0.0632	0.4485	5.5481
	Method (3)	4.1585	−0.6701	25.9009	0.6820	0.0551	0.5179	5.0893

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, X.; Song, Y.; Sun, Z.; Kuang, F.; Tang, X. Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera. Forests 2024, 15, 1937. https://doi.org/10.3390/f15111937

AMA Style

Jiang X, Song Y, Sun Z, Kuang F, Tang X. Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera. Forests. 2024; 15(11):1937. https://doi.org/10.3390/f15111937

Chicago/Turabian Style

Jiang, Xinyue, Yongzhong Song, Zhibin Sun, Fan Kuang, and Xuehai Tang. 2024. "Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera" Forests 15, no. 11: 1937. https://doi.org/10.3390/f15111937

APA Style

Jiang, X., Song, Y., Sun, Z., Kuang, F., & Tang, X. (2024). Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera. Forests, 15(11), 1937. https://doi.org/10.3390/f15111937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using a Two-Stage Hybrid Dimensionality Reduction Method on Hyperspectral Data to Predict Chlorophyll Content of Camellia oleifera

Abstract

1. Introduction

2. Data and Methods

2.1. Data and Collecting

2.1.1. Overview of the Study Area and Design of Site Experiments

2.1.2. Hyperspectral Data Acquisition

2.1.3. Soil and Plant Analyzer Development (SPAD) Data Acquisition

2.1.4. Measurements

2.2. Methods

2.2.1. Feature Extraction

2.2.2. Machine Learning Models

3. Experimental Design and Data Processing

3.1. Data Preprocessing and Core Wavelengths Selection Schemes

3.2. Evaluation Metrics

4. Results

4.1. Correlation Matrices from Taylor-CC, NCC, and PCC Method

4.2. Core Wavelengths Selection Using Different Schemes and Methods

4.3. Comparing the Performance of Different Methods in Scheme I

4.4. Comparing the Performance of Different Methods in Scheme II

4.5. Comparing the Performance of Different Methods in Scheme III

4.6. Comparing the Performance of Scheme I, II, and III

4.7. Comparing the Performance of the Method (3) in Scheme I Under Different Models

4.8. Uncertainty Analysis of Input Variables

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI