1. Introduction
Photosynthesis is the biochemical process through which most vegetation utilizes light energy to convert carbon dioxide and water into organic matter (i.e., glucose), while releasing oxygen as a byproduct [
1]. In photosynthesis, chlorophyll serves as a key pigment responsible for absorbing light energy [
2]. By measuring the chlorophyll content in the leaves of vegetation, it is possible to estimate the photosynthetic capacity and rate of this vegetation [
3,
4]. However, the chlorophyll content of vegetation may be influenced by environmental factors, such as climate change, light variations, and temperature fluctuations, making it difficult to assess chlorophyll content accurately [
5,
6]. In practice, decreased chlorophyll levels are often a sign of pests, diseases, nutritional deficiencies, or environmental stress [
7]. Therefore, by measuring and monitoring changes in the chlorophyll content of vegetation, early interventions can be implemented to help maintain its health. Conventional methods for measuring chlorophyll content primarily involve acetone–ethanol extraction methods [
8], spectrophotometry, and high-performance liquid chromatography [
9]. These destructive methods, based on laboratory procedures, are time-consuming and costly. By overcoming the limitations of these methods, hyperspectral analysis technology can rapidly and non-destructively acquire spectral data from leaves/canopies of large areas of vegetation, facilitating real-time analysis [
10,
11]. Most chlorophyll a and chlorophyll b molecules exhibit absorption wavelengths/bands between 645 nm and 670 nm, while some special types of chlorophyll absorb at wavelengths longer than 700 nm [
12]. Hyperspectral imaging technology can capture these absorption and reflection characteristic of chlorophyll across various spectral wavelengths, generating hyperspectral reflectance data [
13].
At the same time, hyperspectral analysis technology not only provides significant spectral information but includes a large number of wavelengths. Feature extraction can help reduce dimensionality and redundant information in hyperspectral data, thus compressing datasets and improving the efficiency of data processing. However, when processing large-scale hyperspectral images and reflectance data, traditional dimensionality reduction methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA), may encounter challenges associated with high computational complexity [
14,
15]. This leads to extensive consumption of computational time and resources, hindering the efficiency of the analysis process. With advancement of computational science, machine learning methods have been widely used for hyperspectral feature extraction, including support vector machines [
16], random forests [
17], and k-nearest neighbor [
18]. These models are prone to overfitting during training, particularly when the number of training samples is limited, and the feature dimension is high. The emergence of deep learning has advanced the research in hyperspectral feature extraction. Artificial neural network (ANN) [
19], convolutional neural network (CNN) [
20], and recurrent neural network (RNN) [
21] have been applied to extracting features from hyperspectral data processing. However, deep learning models are often regarded as ‘black box’ models, making it difficult to intuitively understand the importance of the features and wavelengths upon which the models are based. What is more, when identifying the core wavelengths of hyperspectral reflectance data, only relying on one method/model could result in the loss of some advantages that other methods/models could preserve. Moreover, arbitrarily combining multiple methods/models to reduce the dimensionality of hyperspectral reflectance data may result in conflicts regarding the selection of dimensions and parameters, ultimately hindering achieving optimal dimensionality reduction results.
Like other green leafy vegetation,
Camellia oleifera contains a large amount of chlorophyll in its leaves, allowing it to carry out photosynthesis. In practice,
Camellia oleifera aids in preventing soil erosion and the reversion of farmland to nature land [
22], thus holding considerable economic, ecological, and social value [
23]. In the cultivation of
Camellia oleifera, analyzing its hyperspectral reflectance data can help understand its physiological parameters and response characteristics, thereby advancing the growth of the
Camellia oleifera industry. In a recent study on the hyperspectral data of
Camellia oleifera [
24], based on the spectral characteristics of its leaves at the same growth period, Sun et al. developed a new algorithm using Taylor expansion to identify core wavelengths. However, their study focused solely on the correlation among wavelengths, ignoring the relationship between wavelengths and target variables.
On the other hand, predicting chlorophyll content in leaves remains a challenge. In recent years, many scholars have conducted in-depth research on estimating/retrieving the chlorophyll content of vegetation using hyperspectral analysis. For example, using a hybrid strategy of variable combination population analysis (VCPA) and genetic algorithm (GA) [
25], Hasan et al. studied the performance of five machine learning algorithms to estimate the chlorophyll content of litchi based on the hyperspectral fractional-derivative reflections from 298 litchi leaves. By extracting color features, spectral indices, and chlorophyll fluorescence intensities from these various types of images [
26], Zhang et al. developed a multivariate linear regression model and a partial least squares regression model to predict chlorophyll content. However, the methods (VCPA and GA) lack mathematical representations of their regression function, which makes it difficult to ensure the stability and reliability of the prediction models in practical applications, particularly when addressing complex problems. Meanwhile, empirical or semi-empirical models based on statistical methods (i.e., multivariate regression and linear regression) often fail to address the universality problem related to the physical mechanism. More importantly, there are very limited studies focusing on the chlorophyll content of
Camellia oleifera using hyperspectral analysis.
Based on these above-mentioned problems, the aim of this study is to compare the performance of various dimensionality reduction methods in combination with machine learning models to achieve the most accurate prediction of chlorophyll content in Camellia oleifera. By comprehensively considering the correlations between wavelengths and target variables (i.e., chlorophyll content), three dimensionality reduction methods (Taylor-CC, NCC, and PCC) were used in the first round of dimensionality reduction. Then, various thresholds and dimensionality reduction methods (with/without further dimensionality reduction) were applied in the second round of dimensionality reduction to identify different sets of core wavelengths. Thus, a series of chlorophyll content estimation models for Camellia oleifera were developed and compared by using three machine learning models based on these proposed methods. The specific objectives of this study are as follows:
- (1)
to screen core wavelengths through the utilization of various single/two-stage dimensionality reduction methods;
- (2)
to identify the optimal dimensionality reduction method and the best model to serve as a final chlorophyll content retrieval model;
- (3)
to act as a supplement to monitoring the growth of Camellia oleifera by hyperspectral analysis technology.
3. Experimental Design and Data Processing
3.1. Data Preprocessing and Core Wavelengths Selection Schemes
In practice, most initial datasets may contain missing values, duplicate values, and other discrepancies that need to be addressed during the data preprocessing stage. Hence, data preprocessing is necessary before utilizing the data in models to ensure their quality and reliability in the subsequent analysis. There is no unique method for data preprocessing, and it varies based on specific tasks and dataset properties. In this study, to eliminate scale differences among different features, all input/output data (i.e., reflectance and SPAD) were scaled linearly to the range of [−1, 1]. Moreover, due to the impact of water vapor [
34], some measured reflectance values at certain wavelengths may be considered outliers. Therefore, it is also important to deal with the outliers in the reflectance data.
More importantly, to streamline the training complexity of prediction models and optimize computational resources, it is necessary to reduce the dimensionality of input data by selecting core wavelengths from the entire spectrum of reflectance data. On the other hand, the order of combining different dimensionality reduction methods also needs to be further investigated to evaluate their performance. These dimensionality reduction methods were divided into three types of schemes based on the three methods (Taylor-CC, NCC, and PCC) selected for the first round of dimensionality reduction. Then, in each of these three schemes, a decision was made on whether to proceed with the second round of dimensionality reduction. On this basis, different dimensionality reduction methods, single dimensionality reduction methods, and two-stage hybrid dimensionality reduction methods, were set up to identify which method achieved the best performance. Therefore, these specific schemes, listed in
Table 1, were developed to compare the performance of different core wavelength selection schemes.
The following were specific explanations of different dimensionality reduction methods utilized in different schemes.
Scheme I: Method (1): Taylor-CC: Dimensionality of input data is reduced using the Taylor-CC method based on a threshold and the information among different wavelengths. Method (2): Taylor-CC + PCC: Firstly, the dimensionality of input data is reduced based on the information among different wavelengths and a threshold using the Taylor-CC method. Subsequently, the reduced input data resulting from the first step are integrated with SPAD measurement information and a threshold for further dimensionality reduction through the PCC method. Method (3): Taylor-CC + PCC: Firstly, the dimensionality of the input data is reduced according to the information among different wavelengths by the Taylor-CC method. Subsequently, the reduced input data resulting from the first step are integrated with SPAD measurements for further dimensionality reduction through the NCC method.
Scheme II: Method (1): NCC; Method (2): NCC + PCC; Method (3): NCC + NCC.
Scheme III: Method (1): PCC; Method (2): PCC + PCC; Method (3): PCC + NCC.
Based on the results of the three schemes, it is also important to identify the most effective machine learning model for predicting chlorophyll content. To achieve this goal, using different sets of core wavelengths identified by various methods in different schemes, Lasso and RF models were trained and tested with a ratio of 70% and 30%, respectively; while an ANN model went through training, validation, and testing with a ratio of 70%, 15%, and 15%, respectively. Given that the data used in this study is relatively abundant and outliers have been removed in advance, a random division of training, validation, and testing samples would sufficiently meet the requirements for model training. Therefore, no cross-validation experiments were conducted to further verify the experimental results.
It should be noted that in the three schemes and different methods, different numbers of core wavelengths may be obtained by adjusting different thresholds. To ensure the validity of comparisons, the final number of core wavelengths used in experiments must be consistent in different dimensionality reduction schemes. Moreover, note that, for comparing results among different schemes, training samples in the three models should also be consistent, and testing samples in RF and Lasso must be consistent as well. Thus, the validation samples and testing samples in the ANN were the same as the testing samples in RF and Lasso.
In addition, it is important to perform uncertainty analysis to evaluate the potential impact of changes in core wavelengths on the prediction of chlorophyll content. Based on the best-performing methods and models identified from the previous experiments, each core wavelength was removed in turn to assess its impact on the prediction of chlorophyll content. Furthermore, a random forest importance ranking [
38] was employed to assess the significance of each core wavelength by determining its contribution to the improvement of the model’s accuracy.
3.2. Evaluation Metrics
In this study, the root mean square error (RMSE), relative mean absolute error (RMAE), relative absolute error (RAE), determination coefficient (R
2), mean square error (MSE), mean bias error (MBE), and mean absolute error (MAE) were utilized to assess the performance of the above schemes and models. They are defined as
where
represents the
i-th observed value,
represents the
i-th model predicted value,
refers to the same size of observed values and model predicted values,
is the mean of observed values, and
is the mean of model predicted values. The performance of these schemes and models is better with lower RMSE, lower RMAE, lower RAE, higher R
2, lower MSE, MBE close to zero, and lower MAE.
4. Results
4.1. Correlation Matrices from Taylor-CC, NCC, and PCC Method
Using the hyperspectral reflectance measurements with 2151 wavelengths obtained from 240 samples, a correlation analysis of all the wavelengths was calculated by using the Taylor-CC, NCC, and PCC methods, respectively. Therefore, the correlation matrices of the three methods were obtained (see
Figure 4,
Figure 5 and
Figure 6). Those correlation matrices presented a clear diagonal pattern. At the locations away from the diagonal of the matrices, the correlation values based on the Taylor-CC and NCC methods approached to zero, except that the values based on the NCC method approached to one at a few specific locations. On the other hand, the correlation matrix of the PCC method still exhibited numerous high values at locations far from the matrix’s diagonal, which clearly indicated an inconsistency with the general characteristics of the reflectivity data in practice.
4.2. Core Wavelengths Selection Using Different Schemes and Methods
As an example of this study, the final number of core wavelengths selected in different experimental schemes was set to 50, 70, and 69. Because their final conclusions based on 50, 70, and 69 core wavelength numbers were identical. Only the results from one experiment (i.e., 70) are shown below.
When single dimensionality reduction methods were utilized, their corresponding thresholds of 0.926, 0.396, and 0.94608 were set in the Taylor-CC method, NCC method, and PCC method, respectively, so that their final numbers of core wavelength were all 70. When two-stage hybrid dimensionality reduction methods were utilized, multiple sets of core wavelengths with different numbers were used to illustrate the universality of the experimental conclusions during the first round of dimensionality reduction. Therefore, in the Taylor-CC, NCC, and PCC methods, different specific thresholds were established during the first round of dimensionality reduction to ensure the number of identified core wavelengths for each method were 90, 94, 97, 122, 123, and 125. In the second round of dimensionality reduction, either the NCC method or PCC method set specific thresholds to ensure that the final number of core wavelengths can be reduced to 70. The specific thresholds from three schemes are presented in
Table 2,
Table A2 and
Table A3. Taking the example of ‘90 + 70’ and its corresponding thresholds ‘0.9540 + 0.0580’ in
Table 2, the explanation is as follows. In the first round of dimensionality reduction, a threshold was set to 0.9540, so that the number of identified core wavelengths was 90. Based on this, the threshold for the second round of dimensionality reduction was then adjusted to 0.0580, so that the final number of core wavelengths was reduced to 70.
By comparing
Figure 7,
Figure 8 and
Figure 9, it is evident that the core wavelengths obtained from Scheme I were more evenly distributed than those obtained from Scheme II and III. In addition, the core wavelength sets obtained by the different dimensionality reduction methods were different in each scheme. This indicated that different dimensionality reduction methods/schemes analyzed the correlations among the dataset variables in different ways.
On the other hand, for Scheme II and III, most core wavelengths concentrated in the ranges of 1800 nm to 2000 nm and 2300 nm to 2400 nm, while Scheme I selected many core wavelengths in the range of 650 nm to 750 nm. Furthermore, Scheme I showed that two-stage hybrid dimensionality reduction methods can more effectively highlight even distribution of wavelengths.
It is worth noting that, to facilitate the comparisons among different dimensionality reduction methods, the average level of performance of the six groups of experiments (i.e., the six rows in
Table 2,
Table A2 and
Table A3) in a two-stage hybrid dimensionality reduction method was used as a representative measure of the overall performance. Those average performance metrics were used in the following comparisons. Moreover, the detailed results of six groups of experiments in the different two-stage methods are shown in
Appendix A.
4.3. Comparing the Performance of Different Methods in Scheme I
The results obtained from Methods (1), (2), and (3) using different models are listed in
Table 3,
Table 4 and
Table 5, respectively. In both Lasso and ANN, it is evident that all seven metrics in Method (2) and Method (3) were superior to those in Method (1), with low MAE, MBE close to zero, low MSE, low RAE, low RMAE, high R
2, and low RMSE. In RF, except for the MBE of both Method (2) and Method (3), which were inferior to that of Method (1), the other metrics in Method (2) and Method (3) showed better performance. Therefore, a conclusion can be drawn that the performance of Method (2) and Method (3) was better than that of Method (1) in all three models.
With the conclusion that hybrid dimensionality reduction methods (Methods (2) and (3)) outperformed the single dimensionality reduction method (Method (1)) in different models, focus was shifted to identifying the best hybrid dimensionality reduction method between Method (2) and Method (3). By comparing the performance of Methods (2) and (3), it is clear that all metrics in Method (3) performed better than those in Method (2) in all three models. Moreover, by comparing seven metrics of the three models in Method (3), it is obvious that the ANN performed better than both Lasso and RF in all aspects except for MBE. Therefore, the above discussion indicated that Method (3) combined with an ANN achieved the best performance in Scheme I.
4.4. Comparing the Performance of Different Methods in Scheme II
The results of Method (1) using different models are listed in
Table 6, and the results of Method (2) are listed in
Table 7. Using Lasso, ANN, and RF, it is clear that the performance of the seven metrics in Method (2) was superior to that in Method (1) when using the same model, which indicated that Method (2) was superior to Method (1) in all three models.
The results of Method (3) using Lasso, ANN, and RF are listed in
Table 8. From the seven metrics obtained in these three models, it is verified that all metrics in Method (3) were better than those in Method (1), indicating that Method (3) worked better than Method (1).
From the above discussions, it revealed that the performance of both Method (2) and Method (3) was better than that of Method (1). That is to say, two-stage hybrid dimensionality reduction outperformed single dimension reduction in Scheme II. The next comparison focused on identifying which hybrid dimensionality reduction method achieved the best performance. When used models were consistent, MBE in Method (3) was slightly inferior to that in Method (2), while the other metrics in Method (3) outperformed those in Method (2), with low MAE, low MSE, low RAE, low RMAE, high R2, and low RMSE. Therefore, a conclusion can be drawn that the performance of Method (3) was superior to Method (2) in all three models. Consequently, Lasso, ANN, and RF in Method (3) can each achieve optimal predictive performance among all three methods, and Lasso in Method (3) demonstrated the best results in all seven metrics when compared with the other two models.
4.5. Comparing the Performance of Different Methods in Scheme III
The performance of Methods (1), (2), and (3) using different models is displayed in
Table 9,
Table 10 and
Table 11, respectively. To assess the performance of single dimensionality reduction methods (Method (1)) and two-stage hybrid dimensionality reduction methods (Methods (2) and (3)), this study compared all three methods. It can be seen that both Method (2) and Method (3) performed better than Method (1), with low MAE, MBE close to zero, low MSE, low RAE, low RMAE, high R
2, and low RMSE.
To sum up, the performance of two-stage hybrid dimensionality reduction outperformed that of single dimensionality reduction. Next, a further comparison was carried out between the performances of different two-stage hybrid dimensionality reduction methods (Methods (2) and (3)). The comparison indicated that Method (3) outperformed Method (2) in terms of MAE, MSE, RAE, RMAE, R2, and RMSE in all three models. On the other hand, when comparing the performance of all three models in Method (3), it can be concluded that Lasso worked the best. It implied that in Scheme III, Method (3) combined with Lasso was with the best prediction performance.
4.6. Comparing the Performance of Scheme I, II, and III
In Scheme I, the dimensionality reduction method used in the first round was the Taylor-CC method, with its results displayed in
Table 3,
Table 4 and
Table 5. In Scheme II, the dimensionality reduction method used in the first round was the NCC method, with its results displayed in
Table 6,
Table 7 and
Table 8. In Scheme III, the dimensionality reduction method used in the first round was the PCC method, with its results displayed in
Table 9,
Table 10 and
Table 11. The performance of the single dimensionality reduction method in these three schemes was compared based on the seven metrics calculated from each model. For all three models (Lasso, ANN, and RF), comparisons showed that the performance of the single dimensionality reduction method in Scheme I was superior to that in both Scheme II and Scheme III. Subsequently, the performance of two-stage hybrid dimensionality reduction methods in these three schemes was also compared. By comparing the performance of Method (2) in Scheme I with that in Scheme II and III, it is evident that Method (2) in Scheme I exhibited the best performance among those methods, except that its MEB was poorer in Lasso and RF. Similarly, Method (3) in different schemes can also reach the same conclusion. Therefore, Scheme I generally performed better than both Scheme II and III when using the same model and the same method (PCC/NCC) for the second round of dimensionality reduction. It can be concluded that, in the first round of dimensionality reduction, the core wavelengths obtained by the Taylor-CC method can well preserve the information of the original reflectance data.
4.7. Comparing the Performance of the Method (3) in Scheme I Under Different Models
In Scheme I, II, and III, based on the above discussions in
Section 4.3,
Section 4.4 and
Section 4.5, it was apparent that the two-stage hybrid dimensionality reduction method, i.e., Method (3), performed the best in each scheme. Furthermore, it can be observed that the Taylor-CC + NCC method in Scheme I performed better than other dimensionality reduction methods in Schemes I, II, and III for the model of ANN, RF, and Lasso. Finally, attention was paid to evaluating the performance of the three models using the Taylor-CC + NCC method. By comparison, we can conclude that the ANN performed the best among the three models. Moreover, in different experiments using an ANN, selecting 94 core wavelengths in the first round of dimensionality reduction achieved the best performance.
4.8. Uncertainty Analysis of Input Variables
Based on the best-performing methods and models mentioned above, an uncertainty analysis was used to assess the potential impact of changes from the 70 identified core wavelengths on chlorophyll content prediction. Analysis results are presented in
Figure 10. By comparing the results with 71 groups of different experiments, it can be observed that retaining the original core wavelengths are beneficial for the model to achieve higher precision prediction in chlorophyll content, with low RMSE, low MAE, high R
2, low MSE, MBE close to zero, low RMAE, and low RAE. By comparing the results and sequentially eliminating each of 70 core wavelengths, it is evident that almost every core wavelength has an impact on the prediction of chlorophyll content after it was removed. Although the removal of a few core wavelengths significantly affects the prediction of chlorophyll content (e.g., #9, #32, #35, and #37 wavelength removal in
Figure 10), most core wavelengths had a small impact after they were removed, which implied that the ANN is robust in predicting chlorophyll content.
In addition, based on the hyperspectral reflectance data used in this study, random forest importance was used to evaluate the importance of each wavelength (see
Figure 11). It was observed that the wavelength region of [650 nm, 700 nm] exhibited relatively high importance. Considering the spectral characteristics of vegetation, the importance analysis was deemed reasonable, indicating that more attention should be paid to the wavelengths within this region in this study.
5. Discussion
Hyperspectral imaging technology can capture the chemical and physiological state of vegetation across various spectral wavelengths, typically ranging from hundreds to thousands of wavelengths [
39,
40]. In this study, a two-stage hybrid dimensionality reduction method was innovatively used to select core wavelengths in hyperspectral reflectance data. Compared with empirical inference, machine learning models, and deep learning models, this method provided a clear mathematical foundation, making it easier for researchers to understand and explain its underlying principles. Moreover, unlike other studies that typically use a single dimensionality reduction method or arbitrarily combine different dimensionality reduction, this study focused on sequentially identifying core wavelengths through two-stage hybrid dimensionality reduction methods, thereby preserving the advantages of various approaches.
This study found that both two-stage hybrid dimensionality reduction methods outperformed single dimensionality reduction methods in all schemes. This demonstrated that using two-stage hybrid dimensionality reduction methods can provide improved insights into correlation analysis and capture important patterns in hyperspectral data, thereby enhancing the comprehensiveness and accuracy of feature selection processes.
Further analysis indicated that using the Taylor-CC method for core wavelengths selection achieved higher predictive accuracy for chlorophyll content in
Camellia oleifera leaves. Compared with the PCC and NCC methods, the Taylor-CC method was developed based on the property of hyperspectral reflectance data resembling continuous functions. It approached the problem from a mathematical perspective, fully taking into full account the correlation between different wavelengths and representing it through specific numerical values. Furthermore, the spectral characteristics of vegetation showed that most chlorophyll a and chlorophyll b molecules absorb wavelengths between 645 nm and 670 nm, while some special types of chlorophyll absorb wavelengths longer than 700 nm [
10]. Therefore, it is evident that the wavelength range of 645 nm to 700 nm is highly sensitive to the variations in chlorophyll. Compared with the PCC and NCC methods, the core wavelengths selected based on the Taylor-CC method were evidently more concentrated in this range (see
Figure 11). This strongly indicated that the Taylor-CC method was more effective for analyzing the spectral characteristics of
Camellia oleifera and for identifying core wavelengths relative to the spectral properties of vegetation.
Additionally, using the Taylor-CC method to predict chlorophyll content with an ANN achieved the best performance. An ANN can effectively handle high-dimensional data and complex nonlinear relationships [
41]. Through multiple layers of neurons and nonlinear activation functions, an ANN can approximate many complex functions, making it suitable for most real-world problems.
When the final number of core wavelengths selected was 70, it can be shown that the Taylor-CC + NCC method combined with an ANN achieved the best performance on chlorophyll content prediction (see
Table 12). To demonstrate the generalizability and applicability of the conclusions in this study, additional similar experiments were conducted to compare the performance of different dimensionality reduction methods by using 50 or 69 core wavelengths in the final selection. As expected, the same conclusions as the above 70 core wavelengths were also reached when the final number of selected core wavelengths was 50 or 69.
Although the research object of this study focused on a specific plant sample (
Camellia oleifera), the adopted methods and models have significant universality for many applications with hyperspectral signal/data. In the analysis of hyperspectral reflectance data, both the PCC and Taylor-CC methods can be used to extract core wavelengths [
24,
42,
43]. The NCC method was also applied to the selection of core remote sensing factors in similar ecological environments [
44]. Moreover, the three machine learning models used in this study have been successfully applied to various prediction scenarios many times, demonstrating their wide applicability. In fact, many machine or non-machine learning models can use identified core wavelengths to study user-interested physical quantities in different applications (e.g., concentrations in chemical solutions), and they might achieve better performance than the three models used in this study. Future research will further validate the generalizability of these conclusions in predicting plant physiological states using remote sensing data, so as to expand their applicability.
6. Conclusions
In this study, a two-stage hybrid dimensionality reduction method was innovatively used to identify core wavelengths by analyzing the correlations among wavelengths and SPAD measurements. By using hyperspectral reflectance data and SPAD measurements from 240 Camellia oleifera samples, different dimensionality reduction methods/schemes were proposed and compared. Combined with the methods/schemes, three different machine learning models were used to predict chlorophyll content, and it led to the following four conclusions.
- (1)
In Schemes I, II, and III, the performance of the two-stage hybrid dimensionality reduction methods was superior to that of the single dimensionality reduction method. This showed that the two-stage approach made better use of the advantages of different dimensionality reduction methods, thereby effectively improving the overall reduction effect and obtaining a more comprehensive feature representation.
- (2)
Compared with the PCC and NCC methods, the core wavelengths identified by the dimensionality reduction method based on Taylor-CC were more concentrated in the range associated with chlorophyll variation, while still preserving a significant amount of the original information from the wavelengths.
- (3)
The Taylor-CC + NCC method performed the best among different two-stage hybrid dimensionality reduction methods in different schemes. Meanwhile, in the Taylor-CC + NCC method, using 94 core wavelengths selected in the first round of dimensionality reduction, the ANN exhibited superior predictive performance against the other two models (MAE = 2.6583, MBE = 0.1371, MSE = 10.6127, RAE = 0.4221, RMAE = 0.0358, R2 = 0.8210, RMSE = 3.2517).
- (4)
In the original Taylor-CC study from [
24], which utilized hyperspectral analysis techniques to predict chlorophyll content in
Camellia oleifera, using only the Taylor-CC method for identifying core wavelengths overlooked the relationship between wavelengths and target variables. This study addressed this deficiency and made improvements based on the Taylor-CC method.
The main imperfections of this study are the limited geographic scope, the short time scale, and the constraints on sample size and spatial coverage in the Camellia oleifera survey data. All these factors together could hinder the applicability and extensibility of the proposed method. In particular, the study area has more uniform vegetation types and environmental conditions, which could enhance the accuracy of the prediction model within this region. However, as the size of the region increases, the diversity of vegetation types and environmental conditions also grows, potentially diminishing the model’s applicability in various sub-regions unless more environmental factors can be introduced in models. Additionally, the data is limited to a single growth cycle of Camellia oleifera, and the model’s applicability will need to be re-verified and adjusted when applied to different ecological environments and climatic conditions. Nevertheless, the proposed methods in this study should have the potential to be applied into other research related to hyperspectral retrieval modeling, whose mathematical description is to find an inverse function of reflectance vector to predict/recover a physical quantity d based on the undiscovered relationship of . Here ’s redundant information should be excluded as much as possible to achieve the best prediction, and d could represent different quantities in different fields, e.g., pollution concentration in lakes, water vapor levels in the atmosphere, and protein content in meat.
Meanwhile, this study also achieved the following innovations and improvements:
- (1)
In feature selection, a method based on a classical mathematical formula (Taylor expansion) was used for correlation analysis. Compared with empirical inference and machine learning models, this method provided a clear theoretical foundation in Mathematics, making it easier for researchers to understand and explain its underlying principles.
- (2)
This study focused on sequentially identifying core wavelengths using two-stage hybrid dimensionality reduction methods rather than combining them arbitrarily.
- (3)
This study enhanced and extended the latest Taylor-CC method for hyperspectral analysis in studying the chlorophyll content of Camellia oleifera.
Furthermore, there are still many challenges in selecting suitable wavelengths and models for chlorophyll content prediction when using hyperspectral analysis techniques. Based on the findings of this study, we expect to further explore hyperspectral analysis techniques for predicting the chlorophyll content of Camellia oleifera in future research. Some potential ideas are presented below:
- (1)
Future research directions may focus on exploring integration methods for multi-source remote sensing data in complex natural environments (i.e., forests, urban areas, and mountainous regions) to overcome different challenges faced by existing hyperspectral reflectance data, providing comprehensive spectral/temporal/spatial information.
- (2)
Subsequent research may focus on predicting chlorophyll content in vegetation by integrating the nutrient elements of vegetation (e.g., nitrogen, phosphorus, and potassium) and the properties of soil to explore their effects in chlorophyll modeling.
- (3)
Based on the findings of this study, future research may continue to explore the environmental factors that affect the growth of Camellia oleifera. Specifically, we plan to collect long-term growth data of this plant under different growing conditions, which aims to construct a comprehensive model by combining spectral measurements, environmental factors, climate factors, and growth variables of Camellia oleifera.