# Unbiasing the Estimation of Chlorophyll from Hyperspectral Images: A Benchmark Dataset, Validation Procedure and Baseline Results

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

**Data Set:**DOI:10.1016/j.dib.2022.108087.

**Data Set License:**The license under which the data set is made available is (CC-BY).

## 1. Introduction

#### 1.1. Contribution

- We introduce a publicly available set of (i) chlorophyll content measurements with some complementary information, including the soil moisture, weather parameters collected during the measurements or the relative water content, being the amount of water in a leaf at the time of sampling relative to the maximal water a leaf can hold and (ii) the corresponding high-resolution hyperspectral imagery (2.2 cm GSD). The dataset encompasses the orthophotomaps with the marked plots where the chlorophyll sampling has been completed, as well as the extracted images of separate plots. We performed the on-the-ground chlorophyll measurements, which resulted in four ground-truth parameters (Section 3.1):
- The SPAD index [22];
- The maximum quantum yield of the PSII photochemistry (Fv/Fm) [24];
- The performance index for energy conservation from photons absorbed by PSII to the final PSI electron acceptors (PI) [26];
- Relative water content (RWC) measurements for the sampled canopy, for capturing additional derivative information on the nutrition of the plants.

- We introduce a procedure for the unbiased validation of machine learning algorithms for estimating the chlorophyll-related parameters from HSI, and we ensure the full reproducibility of the experiments over our dataset (Section 3.2).
- We deliver the baseline results obtained for the introduced dataset (for four ground-truth parameters) using 15 machine learning techniques which can constitute the reference for any future studies emerging from our work (Section 4). Additionally, we show that the performance of a selected model can be further improved through regularization.

#### 1.2. Structure of the Paper

## 2. Related Literature

## 3. Materials and Methods

#### 3.1. Chlorophyll Estimation Dataset (CHESS)

**CH**lorophyll

**ES**timation Data

**S**et (CHESS). The data was collected in 2020 in the Plant Breeding and Acclimatization Institute —National Research Institute (IHAR-PIB) facility located in the Central Poland region (Jadwisin, Masovian Voivodeship). For the selected 24 outdoor plots of two different soil profiles (12 plots for each soil profile, without any repetitions or overlaps), two popular (in the central Europe) potato varieties were planted: Lady Claire and Markies (split evenly). The acquisition was carried out in June and July 2020 (3 rounds of acquisition, 4 weeks apart: 3 flights over 2 sets of 12 plots resulting in 72 HSIs acquired in total) when the leaves were fully developed. The images were captured using an unmanned aerial vehicle with the push-broom imaging spectrometer that registers 150 continuous spectral bands (460–902 nm with the 2.2 cm GSD). The orthorectification procedure was executed using the collected image material of each spectral band—it was possible thanks to using four location targets (see the far left image in Figure 2) whose geographical positions have been collected with a precise GPS device. The spectral correction of those maps using four calibration targets of different spectral characteristics (selected to cover the spectrum range in which plants are perceived best) was performed. This allowed us to finalize the image acquisition process with low location error (less than 1 cm), high image resolution (2.2 cm GSD), and consistent spectral characteristics.

#### 3.2. Unbiased Validation of Chlorophyll Estimation

## 4. Experimental Results

**boldfaced in black**, whereas the Ridge regression model with additional L2 regularization is rendered in green (we discuss the process of improving a selected baseline model to enhance its capabilities later in this section). Finally, if the Ridge regression model with regularization led to obtaining the globally-best metric value (when compared with other techniques), we

**boldfaced and underlined**the corresponding entry. The results indeed show that it is possible to enhance the generalization capabilities of the “default” machine learning models.

## 5. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AGB | Above-ground biomass |

ALOS | Advanced Land Observing Satellite |

CHESS | CHlorophyll EStimation DataSet |

CHRIS | Compact High Resolution Imaging Spectrometer |

CWT | Continuous wavelet transform |

DNN | Deep neural network |

EFFIS | The European Forest Fire Information System |

EVI | Enhanced vegetation index |

FvFm | The maximum quantum yield of the PSII photochemistry |

GBDT | Gradient boosting decision tree |

GBVI | Green brown vegetation index |

GSD | Ground sample distance |

GPR | Gaussian process regression |

HSI | Hyperspectral image |

k-NN | k-nearest neighbor |

LAI | Leaf area index |

LSWI | Land surface water index |

MAE | Mean absolute error |

MAPE | Mean absolute percentage error |

MLR | Multiple linear regression |

MSE | Mean squared error |

MSI | Multispectral image |

NBSI | Non-binary snow index |

NDVI | Normalized difference vegetation index |

NIR | Near-infrared spectra |

NIRv | Near-infrared reflectance of vegetation |

PALSAR | Phased Array L-band Synthetic Aperture Radar |

PCA | Principal component analysis |

PI | The index for energy conservation from photons absorbed by PSII to PSI electron acceptors |

${R}^{2}$ | Coefficient of determination |

RDP | Ratio of the performance to deviation |

RF | Random forest |

RFE | Recursive-feature-elimination |

RMSE | Root mean squared error |

RRMSE | Relative root mean squared error |

RWC | Relative water content measurements for the sampled canopy |

SAVI | Soil-adjusted vegetation index |

SAR | Synthetic-aperture radar |

SOC | Soil organic carbon |

SPAD | The actual level of chlorophyll utilizing the soil-plant analysis development |

SVD | Singular value decomposition |

SVM | Support vector machine |

VHGPR | Variational heteroscedastic GPR |

VIS-NIR | Visible–near-infrared spectra |

VH | Vertical transmitted and horizontal received |

VV | Vertical transmitted and vertical received polarization |

XGB | Extreme gradient boosting |

## References

- Shen, Q.; Lin, J.; Yang, J.; Zhao, W.; Wu, J. Exploring the Potential of Spatially Downscaled Solar-Induced Chlorophyll Fluorescence to Monitor Drought Effects on Gross Primary Production in Winter Wheat. IEEE J-STARS
**2022**, 15, 2012–2022. [Google Scholar] [CrossRef] - Long, Y.; Ma, M. Recognition of Drought Stress State of Tomato Seedling Based on Chlorophyll Fluorescence Imaging. IEEE Access
**2022**, 10, 48633–48642. [Google Scholar] [CrossRef] - Oláh, V.; Hepp, A.; Irfan, M.; Mészáros, I. Chlorophyll Fluorescence Imaging-Based Duckweed Phenotyping to Assess Acute Phytotoxic Effects. Plants
**2021**, 10, 2763. [Google Scholar] [CrossRef] [PubMed] - Lazzeri, G.; Frodella, W.; Rossi, G.; Moretti, S. Multitemporal Mapping of Post-Fire Land Cover Using Multiplatform PRISMA Hyperspectral and Sentinel-UAV Multispectral Data: Insights from Case Studies in Portugal and Italy. Sensors
**2021**, 21. [Google Scholar] [CrossRef] [PubMed] - Pyo, J.; Duan, H.; Baek, S.; Kim, M.S.; Jeon, T.; Kwon, Y.S.; Lee, H.; Cho, K.H. A Convolutional Neural Network Regression for Quantifying Cyanobacteria Using Hyperspectral Imagery. Remote Sens. Environ.
**2019**, 233, 111350. [Google Scholar] [CrossRef] - Hill, P.R.; Kumar, A.; Temimi, M.; Bull, D.R. HABNet: Machine Learning, Remote Sensing-Based Detection of Harmful Algal Blooms. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
**2020**, 13, 3229–3239. [Google Scholar] [CrossRef] - Torres Palenzuela, J.M.; Vilas, L.G.; Bellas Aláez, F.M.; Pazos, Y. Potential Application of the New Sentinel Satellites for Monitoring of Harmful Algal Blooms in the Galician Aquaculture. Thalass. Int. J. Mar. Sci.
**2020**, 36, 85–93. [Google Scholar] [CrossRef] - Nalepa, J.; Myller, M.; Cwiek, M.; Zak, L.; Lakota, T.; Tulczyjew, L.; Kawulok, M. Towards On-Board Hyperspectral Satellite Image Segmentation: Understanding Robustness of Deep Learning through Simulating Acquisition Conditions. Remote Sens.
**2021**, 13, 1532. [Google Scholar] [CrossRef] - Liu, N.; Liu, G.; Sun, H. Real-Time Detection on SPAD Value of Potato Plant Using an In-Field Spectral Imaging Sensor System. Sensors
**2020**, 20, 3430. [Google Scholar] [CrossRef] - Haboudane, D.; Miller, J.R.; Tremblay, N.; Zarco-Tejada, P.J.; Dextraze, L. Integrated Narrow-Band Vegetation Indices for Prediction of Crop Chlorophyll Content for Application to Precision Agriculture. Remote Sens. Environ.
**2002**, 81, 416–426. [Google Scholar] [CrossRef] - Hai-ling, J.; Li-fu, Z.; Hang, Y.; Xiao-ping, C.; Shu-dong, W.; Xue-ke, L.; Kai, L. Comparison of Accuracy and Stability of Estimating Winter Wheat Chlorophyll Content Based on Spectral Indices. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 2985–2988. [Google Scholar] [CrossRef]
- Raya-Sereno, M.D.; Alonso-Ayuso, M.; Pancorbo, J.L.; Gabriel, J.L.; Camino, C.; Zarco-Tejada, P.J.; Quemada, M. Residual Effect and N Fertilizer Rate Detection by High-Resolution VNIR-SWIR Hyperspectral Imagery and Solar-Induced Chlorophyll Fluorescence in Wheat. IEEE Trans. Geosci. Remote Sens.
**2022**, 60, 1–17. [Google Scholar] [CrossRef] - Wang, J.; Zhou, Q.; Shang, J.; Liu, C.; Zhuang, T.; Ding, J.; Xian, Y.; Zhao, L.; Wang, W.; Zhou, G.; et al. UAV- and Machine Learning-Based Retrieval of Wheat SPAD Values at the Overwintering Stage for Variety Screening. Remote Sens.
**2021**, 13, 5166. [Google Scholar] [CrossRef] - Yuan, Z.; Ye, Y.; Wei, L.; Yang, X.; Huang, C. Study on the Optimization of Hyperspectral Characteristic Bands Combined with Monitoring and Visualization of Pepper Leaf SPAD Value. Sensors
**2021**, 22, 183. [Google Scholar] [CrossRef] [PubMed] - Ye, H.; Tang, S.; Yang, C. Deep Learning for Chlorophyll-a Concentration Retrieval: A Case Study for the Pearl River Estuary. Remote Sens.
**2021**, 13, 3717. [Google Scholar] [CrossRef] - Tulczyjew, L.; Kawulok, M.; Longépé, N.; Le Saux, B.; Nalepa, J. A Multibranch Convolutional Neural Network for Hyperspectral Unmixing. IEEE Geosci. Remote Sens. Lett.
**2022**, 19, 1–5. [Google Scholar] [CrossRef] - Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in ML-based Science. arXiv
**2022**, arXiv:2207.07048. [Google Scholar] [CrossRef] - Inoue, Y.; Guérif, M.; Baret, F.; Skidmore, A.; Gitelson, A.; Schlerf, M.; Darvishzadeh, R.; Olioso, A. Simple and Robust Methods for Remote Sensing of Canopy Chlorophyll Content: A Comparative Analysis of Hyperspectral Data for Different Types of Vegetation. Plant Cell Environ.
**2016**, 39, 2609–2623. [Google Scholar] [CrossRef] [Green Version] - Mayranti, F.P.; Saputro, A.H.; Handayani, W. Chlorophyll A and B Content Measurement System of Velvet Apple Leaf in Hyperspectral Imaging. In Proceedings of the ICICOS, Semarang, Indonesia, 29–30 October 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Tomaszewski, M.; Gasz, R.; Smykała, K. Monitoring Vegetation Changes Using Satellite Imaging—NDVI and RVI4S1 Indicators. In Proceedings of the Control, Computer Engineering and Neuroscience, Opole, Poland, 21 September 2021; Paszkiel, S., Ed.; Springer: Berlin/Heidelberg, Germany, 2021; pp. 268–278. [Google Scholar] [CrossRef]
- Bannari, A.; Khurshid, K.S.; Staenz, K.; Schwarz, J.W. A Comparison of Hyperspectral Chlorophyll Indices for Wheat Crop Chlorophyll Content Estimation Using Laboratory Reflectance Measurements. IEEE Trans. Geosci. Remote Sens.
**2007**, 45, 3063–3074. [Google Scholar] [CrossRef] - El-Hendawy, S.; Dewir, Y.H.; Elsayed, S.; Schmidhalter, U.; Al-Gaadi, K.; Tola, E.; Refay, Y.; Tahir, M.U.; Hassan, W.M. Combining Hyperspectral Reflectance Indices and Multivariate Analysis to Estimate Different Units of Chlorophyll Content of Spring Wheat under Salinity Conditions. Plants
**2022**, 11, 456. [Google Scholar] [CrossRef] [PubMed] - Middleton, E.M.; Julitta, T.; Campbell, P.E.; Huemmrich, K.F.; Schickling, A.; Rossini, M.; Cogliati, S.; Landis, D.R.; Alonso, L. Novel Leaf-Level Measurements of Chlorophyll Fluorescence for Photosynthetic Efficiency. In Proceedings of the IGARSS, Milan, Italy, 26–31 July 2015; pp. 3878–3881. [Google Scholar] [CrossRef]
- Jia, M.; Zhou, C.; Cheng, T.; Tian, Y.; Zhu, Y.; Cao, W.; Yao, X. Inversion of Chlorophyll Fluorescence Parameters on Vegetation Indices at Leaf Scale. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 4359–4362. [Google Scholar] [CrossRef]
- Nalepa, J.; Myller, M.; Kawulok, M. Validating Hyperspectral Image Segmentation. IEEE Geosci. Remote Sens. Lett.
**2019**, 16, 1264–1268. [Google Scholar] [CrossRef] [Green Version] - Singh, H.; Kumar, D.; Soni, V. Performance of Chlorophyll a Fluorescence Parameters in Lemna Minor under Heavy Metal Stress Induced by Various Concentration of Copper. Sci. Rep.
**2022**, 12, 10620. [Google Scholar] [CrossRef] [PubMed] - Yue, J.; Zhou, C.; Guo, W.; Feng, H.; Xu, K. Estimation of Winter-Wheat Above-Ground Biomass Using the Wavelet Analysis of Unmanned Aerial Vehicle-Based Digital Images and Hyperspectral Crop Canopy iImages. Int. J. Remote Sens.
**2021**, 42, 1602–1622. [Google Scholar] [CrossRef] - Jin, X.; Li, Z.; Feng, H.; Ren, Z.; Li, S. Deep Neural Network Algorithm for Estimating Maize Biomass Based on Simulated Sentinel 2A Vegetation Indices and Leaf Area Index. Crop J.
**2020**, 8, 87–97. [Google Scholar] [CrossRef] - Lu, B.; He, Y. Evaluating Empirical Regression, Machine Learning, and Radiative Transfer Modelling for Estimating Vegetation Chlorophyll Content Using Bi-Seasonal Hyperspectral Images. Remote Sens.
**2019**, 11, 1979. [Google Scholar] [CrossRef] [Green Version] - Adão, T.; Hruška, J.; Pádua, L.; Bessa, J.; Peres, E.; Morais, R.; Sousa, J.J. Hyperspectral Imaging: A Review on UAV-Based Sensors, Data Processing and Applications for Agriculture and Forestry. Remote Sens.
**2017**, 9, 1110. [Google Scholar] [CrossRef] - Brewer, K.; Clulow, A.; Sibanda, M.; Gokool, S.; Naiken, V.; Mabhaudhi, T. Predicting the Chlorophyll Content of Maize over Phenotyping as a Proxy for Crop Health in Smallholder Farming Systems. Remote Sens.
**2022**, 14, 518. [Google Scholar] [CrossRef] - Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent Advances of Hyperspectral Imaging Technology and Applications in Agriculture. Remote Sens.
**2020**, 12, 2659. [Google Scholar] [CrossRef] - Meng, X.; Bao, Y.; Liu, J.; Liu, H.; Zhang, X.; Zhang, Y.; Wang, P.; Tang, H.; Kong, F. Regional Soil Organic Carbon Prediction Model Based on a Discrete Wavelet Analysis of Hyperspectral Satellite Data. Int. J. Appl. Earth Obs. Geoinf.
**2020**, 89, 102111. [Google Scholar] [CrossRef] - Hong, Y.; Chen, S.; Chen, Y.; Linderman, M.; Mouazen, A.M.; Liu, Y.; Guo, L.; Yu, L.; Liu, Y.; Cheng, H.; et al. Comparing Laboratory and Airborne Hyperspectral Data for the Estimation and Mapping of Topsoil Organic Carbon: Feature Selection Coupled with Random Forest. Soil Tillage Res.
**2020**, 199, 104589. [Google Scholar] [CrossRef] - Zhang, Y.; Xia, C.; Zhang, X.; Cheng, X.; Feng, G.; Wang, Y.; Gao, Q. Estimating the Maize Biomass by Crop Height and Narrowband Vegetation Indices Derived from UAV-based Hyperspectral Images. Ecol. Indic.
**2021**, 129, 107985. [Google Scholar] [CrossRef] - Han, L.; Yang, G.; Dai, H.; Xu, B.; Yang, H.; Feng, H.; Li, Z.; Yang, X. Modeling Maize Above-Ground Biomass Based on Machine Learning Approaches Using UAV Remote-Sensing Data. Plant Methods
**2019**, 15, 1746–4811. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ji, S.; Zhang, C.; Xu, A.; Shi, Y.; Duan, Y. 3D Convolutional Neural Networks for Crop Classification with Multi-Temporal Remote Sensing Images. Remote Sens.
**2018**, 10, 75. [Google Scholar] [CrossRef] [Green Version] - Cui, Z.; Kerekes, J.P. Potential of Red Edge Spectral Bands in Future Landsat Satellites on Agroecosystem Canopy Green Leaf Area Index Retrieval. Remote Sens.
**2018**, 10, 1458. [Google Scholar] [CrossRef] [Green Version] - Zhang, F.; Zhou, G. Estimation of Vegetation Water Content using Hyperspectral Vegetation Indices: A Comparison of Crop Water Indicators in Response to Water Stress Treatments for Summer Maize. BMC Ecol.
**2019**, 19, 1–12. [Google Scholar] [CrossRef] [Green Version] - Mansaray, L.R.; Zhang, K.; Kanu, A.S. Dry Biomass Estimation of Paddy Rice With Sentinel-1A Satellite Data Using Machine Learning Regression Algorithms. Comput. Electron. Agric.
**2020**, 176, 105674. [Google Scholar] [CrossRef] - Wang, J.; Xiao, X.; Bajgain, R.; Starks, P.; Steiner, J.; Doughty, R.B.; Chang, Q. Estimating Leaf Area Index and Aboveground Biomass of Grazing Pastures Using Sentinel-1, Sentinel-2 and Landsat Images. ISPRS J. Photogramm. Remote Sens.
**2019**, 154, 189–201. [Google Scholar] [CrossRef] [Green Version] - Guo, H.; Liu, J.; Xiao, Z.; Xiao, L. Deep CNN-based Hyperspectral Image Classification Using Discriminative Multiple Spatial-spectral Feature Fusion. Remote Sens. Lett.
**2020**, 11, 827–836. [Google Scholar] [CrossRef] - Marshall, M.; Thenkabail, P. Advantage of Hyperspectral EO-1 Hyperion over Multispectral IKOMOS, Geoeye-1, Worldview-2, Landsat ETM+, and MODIS Vegetation Indices in Crop Biomass Estimation. ISPRS J. Photogramm. Remote Sens.
**2015**, 108, 205–218. [Google Scholar] [CrossRef] [Green Version] - Ribalta Lorenzo, P.; Tulczyjew, L.; Marcinkiewicz, M.; Nalepa, J. Hyperspectral Band Selection Using Attention-Based Convolutional Neural Networks. IEEE Access
**2020**, 8, 42384–42403. [Google Scholar] [CrossRef] - Zheng, Q.; Ye, H.; Huang, W.; Dong, Y.; Jiang, H.; Wang, C.; Li, D.; Wang, L.; Chen, S. Integrating Spectral Information and Meteorological Data to Monitor Wheat Yellow Rust at a Regional Scale: A Case Study. Remote Sens.
**2021**, 13, 278. [Google Scholar] [CrossRef] - Rao, K.; Williams, A.P.; Flefil, J.F.; Konings, A.G. SAR-enhanced Mapping of Live Fuel Moisture Content. Remote Sens. Environ.
**2020**, 245, 111797. [Google Scholar] [CrossRef] - Estévez, J.; Vicent, J.; Rivera-Caicedo, J.P.; Morcillo-Pallarés, P.; Vuolo, F.; Sabater, N.; Camps-Valls, G.; Moreno, J.; Verrelst, J. Gaussian Processes Retrieval of LAI from Sentinel-2 Top-of-atmosphere Radiance Data. ISPRS J. Photogramm. Remote Sens.
**2020**, 167, 289–304. [Google Scholar] [CrossRef] [PubMed] - Wang, X.; Zhang, Y.; Atkinson, P.M.; Yao, H. Predicting Soil Organic Carbon Content in Spain by Combining Landsat TM and ALOS PALSAR Images. Int. J. Appl. Earth Obs. Geoinf.
**2020**, 92, 102182. [Google Scholar] [CrossRef] - Battude, M.; Al Bitar, A.; Morin, D.; Cros, J.; Huc, M.; Marais Sicre, C.; Le Dantec, V.; Demarez, V. Estimating Maize Biomass and Yield over Large Areas Using High Spatial and Temporal Resolution Sentinel-2 like Remote Sensing Data. Remote Sens. Environ.
**2016**, 184, 668–681. [Google Scholar] [CrossRef] - Kalaji, H.M.; Račková, L.; Paganová, V.; Swoczyna, T.; Rusinowski, S.; Sitko, K. Can Chlorophyll-a Fluorescence Parameters Be Used as Bio-indicators to Distinguish Between Drought and Salinity Stress in Tilia Cordata Mill? Environ. Exp. Bot.
**2018**, 152, 149–157. [Google Scholar] [CrossRef] - Li, C.; Chen, P.; Ma, C.; Feng, H.; Wei, F.; Wang, Y.; Shi, J.; Cui, Y. Estimation of potato chlorophyll content using composite hyperspectral index parameters collected by an unmanned aerial vehicle. Int. J. Remote Sens.
**2020**, 41, 8176–8197. [Google Scholar] [CrossRef] - Liu, N.; Xing, Z.; Zhao, R.; Qiao, L.; Li, M.; Liu, G.; Sun, H. Analysis of Chlorophyll Concentration in Potato Crop by Coupling Continuous Wavelet Transform and Spectral Variable Optimization. Remote Sens.
**2020**, 12, 2826. [Google Scholar] [CrossRef] - Yang, H.; Hu, Y.; Zheng, Z.; Qiao, Y.; Zhang, K.; Guo, T.; Chen, J. Estimation of Potato Chlorophyll Content from UAV Multispectral Images with Stacking Ensemble Algorithm. Agronomy
**2022**, 12, 2318. [Google Scholar] [CrossRef] - Ruszczak, B.; Boguszewska-Mańkowska, D. Deep potato—The Hyperspectral Imagery of Potato Cultivation with Reference Agronomic Measurements Dataset: Towards Potato Physiological Features Modeling. Data Brief
**2022**, 42, 108087. [Google Scholar] [CrossRef] - Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci.
**2021**, 7, e623. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Nalepa, J.; Myller, M.; Tulczyjew, L.; Kawulok, M. Deep Ensembles for Hyperspectral Image Data Classification and Unmixing. Remote Sens.
**2021**, 13, 4133. [Google Scholar] [CrossRef] - Lin, C.Y.; Lin, C. Using Ridge Regression Method to Reduce Estimation Uncertainty in Chlorophyll Models Based on Worldview Multispectral Data. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1777–1780. [Google Scholar] [CrossRef]
- Ziaja, M.; Bosowski, P.; Myller, M.; Gajoch, G.; Gumiela, M.; Protich, J.; Borda, K.; Jayaraman, D.; Dividino, R.; Nalepa, J. Benchmarking Deep Learning for On-Board Space Applications. Remote Sens.
**2021**, 13, 3981. [Google Scholar] [CrossRef]

**Figure 1.**The popularity of topics related to the HSI analysis in various agricultural applications, including the chlorophyll estimation, quantified as the number of papers on such topics published between 2012 and 2022 (this analysis is based on https://app.dimensions.ai/discover/publication (accessed on 1 November 2022), the analysis was performed on 8 September 2022; in the legend, we present the keyphrase which was used). We can appreciate that the number of articles tackling the automated chlorophyll determination from hyperspectral imagery is increasing at a steady pace.

**Figure 2.**The dataset preparation procedure: 6 hyperspectral orthophotomaps, one for each flight for each set of 12 plots (

**left**) were used to extract 72 hyperspectral 150-band images (

**middle left**). The ground measurements of four parameters were performed for each plot (

**middle right**). We extracted the spectral curves, individually for each pixel and aggregated across all pixels—see, e.g., the median spectral curve in the far

**right**image.

**Figure 3.**Empirical cumulative distribution (ECDF) of all parameters (SPAD, FvFm, PI and RWC) in the training and test sets.

**Figure 4.**Spectral characteristics of all HSIs in the training and test sets. The mean spectral curves are rendered as blue and orange (dashed) lines for the training and test set, respectively.

**Figure 5.**The Ridge regression results (${R}^{2}$ over the test sets for each parameter) obtained using a range of the $\alpha $ hyperparameter values.

**Figure 6.**The results elaborated using a Support Vector Machine over the test sets for each parameter, and obtained using a range of the C and $\gamma $ hyperparameter values.

**Table 1.**Selected works focusing on the machine learning-powered analysis in agricultural applications, together with the additional feature extraction step performed in the corresponding method. The papers focusing on the chlorophyll estimation are in green.

Ref. | Year | Goal | Feature Extraction | Algorithm |
---|---|---|---|---|

[31] | 2022 | prediction of chlorophyll content in maize | selected bands; vegetation indices | random forest |

[35] | 2021 | monitoring of above-ground biomass of maize | vegetation indices | stepwise regression; random forest; extreme gradient boosting |

[45] | 2021 | monitoring of wheat yellow rust | vegetation indices; meteorological features | linear discriminant analysis; support vector machine; artificial neural network |

[46] | 2020 | mapping of live fuel moisture content | NDVI; NDWI; NIRv | recurrent neural network |

[28] | 2020 | estimation of biomass | vegetation indices | deep neural network |

[27] | 2020 | monitoring of crop growth | high-frequency IWD information; continuous wavelet transform (CWT) | multiple linear stepwise regression |

[40] | 2020 | dry biomass estimation of paddy rice | VH; VV | random forest; support vector machine; k-nearest neighbor; gradient boosting decision tree |

[47] | 2020 | LAI detection | LAI | Gaussian process regression (GPR); variational heteroscedastic GPR |

[33] | 2020 | estimation of soil organic carbon | principal component analysis; NDI; RI; DI | discrete wavelet transform at different scales; random forest; support vector machine; back-propagation neural network |

[48] | 2020 | estimation of soil organic carbon | vegetation indices: NDVI, SAVI, NBSI, NDWI, NDBI, FI | random forest |

[36] | 2019 | monitoring of above-ground biomass of maize | recursive-feature elimination | multiple linear regression; support vector machine; artificial neural network; random forest |

[41] | 2019 | pasture conditions, seasonal dynamics of LAI and AGB | NDVI; EVI; LSWI | multiple linear regression; support vector machine; random forest |

[38] | 2018 | canopy green leaf area | GBVI; NDVI; CI | empirical vegetation index regression (NDVIa-b and CIa-b); physically-based inversion, support vector regression |

[49] | 2016 | maize biomass estimation, the seasonal variation | specific leaf area | simple algorithm for yield estimates |

[43] | 2015 | detection of favorable wavelengths | singular value decomposition | stepwise regression |

**Table 2.**The results reported in the selected works on machine learning-powered analysis in various agricultural applications. The papers focusing on the chlorophyll estimation are in green.

Ref. | Date Source | Type | Wavelength | Amount of Data | Measure | Value |
---|---|---|---|---|---|---|

[31] | DJI S1000 UAV; MicaSense Altum; Downwelling Light Sensor 2 (DLS-2) | MSI | 465, 532, 630 nm, 680–730 nm, 1200–1600 nm | 3576 | ${R}^{2}$ RMSE RRMSE | results reported for the seasons |

[35] | DJI S1000 UAV; Cubert UHD 185 | HSI | 454–950 nm | 1809 | ${R}^{2}$ RMSE RRMSE | 0.85 0.27$\frac{t}{ha}$ 0.84% |

[45] | Sentinel-2; National Meteorological Information Center | MSI | 490, 560, 665, 842, 705 nm | 58 | accuracy | 84.20% |

[46] | Sentinel-1; Landsat-8 | SAR MSI | C-band; 450–510 nm, 530–590 nm, 640–670 nm, 850–880 nm, 1570–1650 nm, 2110–2290 nm | not specified | ${R}^{2}$ RMSE bias | 0.63 25.00% 1.90% |

[28] | Sentinel-2 | MSI | 443–2190 nm | 209 | ${R}^{2}$ RMSE RRMSE | 0.87 1.84$\frac{t}{ha}$ 24.76% |

[27] | DJI S1000 UAV; Cubert UHD 185; Sony DSC QX100 | HSI | 454–882 nm | 144 | ${R}^{2}$ RMSE MAE | 0.85 0.79 $\frac{t}{ha}$ 1.01 $\frac{t}{ha}$ |

[40] | Sentinel-1A | SAR | C-band | 175 | ${R}^{2}$ RMSE | 0.72 362.40 $\frac{g}{{m}^{2}}$ |

[47] | Sentinel-2 | MSI | 400–2400 nm | 114 | ${R}^{2}$ RMSE ${R}^{2}$ RMSE | 0.78 (GPR) 0.70 (GPR) 0.80 (VHGPR) 0.63 (VHGPR) |

[33] | Gaofen-5 | HSI | 433–1342 nm, 1460–1763 nm, 1990–2445 nm | 14 | ${R}^{2}$ RMSE | 0.83 2.89$\frac{g}{kg}$ |

[48] | ALOS PALSAR; Landsat TM | SAR MSI | L-band; 530–590 nm, 640–670 nm, 850–880 nm, 1570–1650 nm, 2110–2290 nm | not specified | ${R}^{2}$ RMSE RDP | 0.59 9.27$\frac{g}{kg}$ 1.98 |

[36] | DJI S1000 UAV; 1.2 megapixel Parrot Sequoia camera | MSI | 550–790 nm | 120 185 | ${R}^{2}$ RMSE MAE | 0.94 0.50 0.36 |

[41] | Sentinel-1A; Landsat-8; Sentinel-2 | SAR MSI | C-band; 452–512 nm, 636–673 nm, 851–879 nm, 1566–1651 nm | not specified | ${R}^{2}$ RMSE | 0.78 119.40$\frac{g}{{m}^{2}}$ |

[38] | HyMap; CHRIS/PROBA | HSI MSI | 677–707 nm | 118 | ${R}^{2}$ | 0.79 |

[49] | Formosat-2; SPOT4-Take5; Landsat-8; Deimos-1 | MSI | specific to the sensor | 195 | ${R}^{2}$ RRMSE | 0.96 4.6% |

[43] | Landsat ETM+; KONOS; GeoEye-1; WorldView-2; Hyperion | HSI MSI | 772, 539, 758, 914, 1130, 1320 nm | 9, 23, 23, 24, 10 | ${R}^{2}$ RMSE | 0.12–0.97 1.15–2.47 $\frac{g}{{m}^{2}}$ |

**Table 3.**Inter-parameter correlation between the measured ground-truth parameters (the Pearson’s correlation coefficient values are reported over the main diagonal, whereas the Spearman’s correlation coefficient values are given below the main diagonal; the corresponding p-values for the presented correlation coefficients are shown in brackets).

Parameter | SPAD | FvFm | PI | RWC |
---|---|---|---|---|

SPAD | 1.000 (1.000) | 0.668 (0.000) | 0.675 (0.000) | 0.109 (0.363) |

FvFm | 0.574 (0.000) | 1.000 (1.000) | 0.743 (0.000) | 0.352 (0.002) |

PI | 0.674 (0.000) | 0.877 (0.000) | 1.000 (1.000) | $-0.094$ (0.434) |

RWC | 0.697 (0.550) | 0.253 (0.161) | $-0.047$ (0.692) | 1.000 (1.000) |

**Table 4.**The most important hyperparameter values, as suggested by Pedregosa et al. [56], of all parameterized machine learning algorithms investigated in this study.

Algorithm | Hyperparameters |
---|---|

Ada Boost | Maximum number of estimators: 50,
learning rate: 1, linear loss. |

Decision Tree | Loss function: squared error,
maximum depth of the tree: not set, minimum number of samples required to split an internal node: 2,
minimum number of samples required to be at a leaf node: 1,
allowing all features to be considered for the best split with unlimited number of leaf nodes. |

Extra Trees | Maximum number of estimators: ${10}^{2}$,
loss function: squared error, maximum depth of the tree: not set, minimum number of samples required to split an internal node: 2, minimum number of samples required to be at a leaf node: 1, allowing all features to be considered for the best split with samples’ bootstrapping while building trees, unlimited number of leaf nodes. |

Extreme Gradient Boosting | Learning rate: 0.3, maximum depth of an individual estimator: 6,
minimum sum of instance weight: 1,
maximum delta step: 0, regularization terms$\lambda $: 1 and $\alpha $: 0. |

Gradient Boosting | Learning rate: 0.1, maximum number of estimators: ${10}^{2}$, maximum depth of an individual estimator: 3, loss function: squared error,
percentage of samples required to split an internal node: 10%. |

Kernel Ridge | $\alpha $: 1, the degree of the linear polynomial kernel: 3, zero coefficient for polynomial and sigmoid kernels: 1. |

Lasso | $\alpha $: 1, maximum number of iterations: ${10}^{3}$,
tolerance stopping criteria: ${10}^{-4}$. |

Light Gradient Boosting Machine | Maximum tree leaves for base learners: 31 without any limit for the tree depth for base learners,
boosting learning rate: ${10}^{-1}$, number of boosted trees to fit: ${10}^{2}$, number of samples for constructing bins: $2\xb7{10}^{4}$, no minimum loss reduction required to make a further partition on a leaf node, minimum sum of instance weight in a leaf: ${10}^{-3}$, minimum samples in a child: 20. |

Linear Support Vector Machine | Regularization parameterC: 1, L1 loss, maximum number of iterations: ${10}^{3}$,
the tolerance stopping criteria: ${10}^{-4}$. |

Nu Support Vector Machine | Kernel: Radial Basis Function,
upper bound on the fraction of training errors and a lower bound of the fraction of support vectors: 0.5, regularization parameterC: 1,
$\gamma $: 1 / $\left(\left|\mathcal{F}\right|\xb7{\sigma}^{2}\left(\mathit{T}\right)\right)$, where $\mathcal{F}$ is the number of features, and ${\sigma}^{2}\left(\mathit{T}\right)$ is the variance of the training set. |

Random Forest | Maximum number of estimators: ${10}^{2}$,
function measuring the quality of split: squared error,
minimum number of samples required to split an internal node: 2, minimum number of samples required to be at a leaf: 1, allowing all features to be considered for the best split with unlimited number of leaf nodes, no maximum number of samples for bootstrapping. |

Ridge | $\alpha $: 1, tolerance stopping criteria: ${10}^{-3}$. |

Stochastic Gradient Descent | Loss function: squared error with L2 regularization, $\alpha ={10}^{-4}$, L1 ratio: 0.15, maximum number of passes over the training data: ${10}^{3}$,
the stopping criterion for loss: ${10}^{-3}$, data shuffling after each epoch, the initial learning rate: ${10}^{-2}$, the exponent for inverse scaling learning rate: 0.25,
10% of training data is the validation set for early stopping
with 5 iterations with no improvement termination. |

Support Vector Machine | Kernel: Radial Basis Function,
$\gamma $: 1 / $\left(\left|\mathcal{F}\right|\xb7{\sigma}^{2}\left(\mathit{T}\right)\right)$, where $\mathcal{F}$ denotes the number of features, and ${\sigma}^{2}\left(\mathit{T}\right)$ is the variance of the training set, tolerance for stopping criterion: ${10}^{-3}$, regularization parameter C: 1, $\u03f5$: ${10}^{-1}$, where $\u03f5$ specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance $\u03f5$ from the actual value. |

**Table 5.**Three best machine learning models (according to ${R}^{2}$) with default parameterization for each ground-truth parameter. We indicate if the metric should be minimized (↓) or maximized (↑).

Param. | Model | ${\mathit{R}}^{2}$↑ | MAPE ↓ | MSE ↓ | MAE ↓ |
---|---|---|---|---|---|

Linear Regression | 0.818 | 0.072 | 9.583 | 1.569 | |

SPAD | Extreme Gradient Boosting | 0.719 | 0.092 | 14.808 | 2.784 |

AdaBoost | 0.698 | 0.080 | 15.935 | 1.814 | |

Linear Regression | 0.718 | 0.037 | 0.001 | 0.021 | |

FvFm | Gradient Boosting | 0.608 | 0.037 | 0.001 | 0.017 |

AdaBoost | 0.600 | 0.037 | 0.001 | 0.016 | |

Linear Regression | 0.667 | 0.532 | 0.169 | 0.280 | |

PI | Extra Trees | 0.379 | 1.251 | 0.315 | 0.368 |

Random Forest | 0.213 | 1.594 | 0.400 | 0.470 | |

Ridge | 0.817 | 0.014 | 2.249 | 1.207 | |

RWC | Extreme Gradient Boosting | 0.793 | 0.014 | 2.541 | 1.021 |

Support Vector Machine | 0.745 | 0.017 | 3.127 | 1.541 |

Param. | $\mathit{\alpha}$ | ${\mathit{R}}^{2}$↑ | MAPE ↓ | MSE ↓ | MAE ↓ |
---|---|---|---|---|---|

SPAD | $2.5\times {10}^{-5}$ | 0.827 | 0.072 | 9.095 | 1.625 |

FvFm | $5\times {10}^{-5}$ | 0.727 | 0.036 | 0.001 | 0.021 |

PI | ${10}^{-11}$ | 0.667 | 0.532 | 0.169 | 0.280 |

RWC | ${10}^{-3}$ | 0.859 | 0.013 | 1.731 | 0.941 |

**Table 7.**Performance of the investigated machine learning models for each chlorophyll-related parameter of interest. The best results obtained using the baseline machine learning models are

**boldfaced in black**, whereas the model with further regularization is indicated in green (if it resulted in the best metric value across all models, the values are boldfaced and underlined). We indicate if the metric should be minimized (↓) or maximized (↑).

Parameter | Model | ${\mathit{R}}^{2}$↑ | MAPE ↓ | MSE ↓ | MAE ↓ |
---|---|---|---|---|---|

SPAD | AdaBoost | 0.698 | 0.080 | 15.935 | 1.814 |

Decision Tree | 0.178 | 0.132 | 43.324 | 2.640 | |

ExtraTrees | 0.649 | 0.099 | 18.494 | 2.307 | |

Extreme Gradient Boosting | 0.719 | 0.092 | 14.808 | 2.784 | |

Gradient Boosting | 0.639 | 0.095 | 19.025 | 2.059 | |

Kernel Ridge | 0.132 | 0.179 | 45.768 | 4.787 | |

Lasso | −0.091 | 0.213 | 57.509 | 6.045 | |

Light Gradient Boosting Machine | 0.180 | 0.178 | 43.255 | 5.016 | |

Linear Regression | 0.818 | 0.072 | 9.583 | 1.569 | |

Linear Support Vector Machine | 0.216 | 0.131 | 41.321 | 3.242 | |

Nu Support Vector Machine | −0.142 | 0.219 | 60.211 | 6.065 | |

Random Forest | 0.587 | 0.123 | 21.760 | 3.029 | |

Ridge | 0.037 | 0.202 | 50.789 | 6.153 | |

Ridge with L2 | 0.827 | 0.072 | 9.095 | 1.625 | |

Stochastic Gradient Descent | 0.119 | 0.187 | 46.465 | 5.794 | |

Support Vector Machine | −0.289 | 0.231 | 67.984 | 6.831 | |

FvFm | AdaBoost | 0.600 | 0.037 | 0.001 | 0.016 |

Decision Tree | −0.005 | 0.055 | 0.003 | 0.020 | |

ExtraTrees | 0.136 | 0.049 | 0.003 | 0.016 | |

Extreme Gradient Boosting | 0.407 | 0.044 | 0.002 | 0.019 | |

Gradient Boosting | 0.608 | 0.037 | 0.001 | 0.017 | |

Kernel Ridge | −1.609 | 0.100 | 0.009 | 0.054 | |

Lasso | −0.002 | 0.061 | 0.003 | 0.028 | |

Light Gradient Boosting Machine | −0.002 | 0.061 | 0.003 | 0.028 | |

Linear Regression | 0.718 | 0.037 | 0.001 | 0.021 | |

Linear Support Vector Machine | 0.363 | 0.048 | 0.002 | 0.022 | |

Nu Support Vector Machine | 0.316 | 0.047 | 0.002 | 0.022 | |

Random Forest | 0.510 | 0.043 | 0.002 | 0.019 | |

Ridge | 0.272 | 0.051 | 0.003 | 0.025 | |

Ridge with L2 | 0.727 | 0.036 | 0.001 | 0.021 | |

Stochastic Gradient Descent | −4.424 | 0.155 | 0.019 | 0.102 | |

Support Vector Machine | −0.112 | 0.070 | 0.004 | 0.034 | |

PI | AdaBoost | −0.067 | 1.825 | 0.542 | 0.262 |

Decision Tree | −0.150 | 0.955 | 0.584 | 0.350 | |

Extra Tree | 0.379 | 1.251 | 0.315 | 0.368 | |

Extreme Gradient Boosting | 0.168 | 1.407 | 0.423 | 0.388 | |

Gradient Boosting | 0.114 | 1.508 | 0.450 | 0.296 | |

Kernel Ridge | −0.147 | 1.834 | 0.583 | 0.588 | |

Lasso | −0.020 | 1.941 | 0.518 | 0.502 | |

Light Gradient Boosting Machine | −0.010 | 1.623 | 0.513 | 0.456 | |

Linear Regression | 0.667 | 0.532 | 0.169 | 0.280 | |

Linear Support Vector Machine | −0.133 | 1.334 | 0.576 | 0.465 | |

Nu Support Vector Machine | −0.010 | 1.529 | 0.513 | 0.634 | |

Random Forest | 0.213 | 1.594 | 0.400 | 0.470 | |

Ridge | −0.036 | 1.993 | 0.526 | 0.442 | |

Ridge with L2 | 0.667 | 0.532 | 0.169 | 0.280 | |

Stochastic Gradient Descent | −0.108 | 1.550 | 0.563 | 0.601 | |

Support Vector Machine | 0.019 | 1.818 | 0.499 | 0.506 | |

RWC | AdaBoost | 0.657 | 0.018 | 4.213 | 1.233 |

Decision Tree | 0.432 | 0.025 | 6.977 | 1.900 | |

Extra Tree | 0.704 | 0.018 | 3.641 | 1.090 | |

Extreme Gradient Boosting | 0.793 | 0.014 | 2.541 | 1.021 | |

Gradient Boosting | 0.646 | 0.020 | 4.350 | 1.250 | |

Kernel Ridge | −4.435 | 0.074 | 66.776 | 4.045 | |

Lasso | 0.000 | 0.036 | 12.288 | 3.549 | |

Light Gradient Boosting Machine | 0.000 | 0.036 | 12.288 | 3.549 | |

Linear Regression | 0.720 | 0.018 | 3.440 | 1.385 | |

Linear Support Vector Machine | −14.497 | 0.125 | 190.396 | 7.235 | |

Nu Support Vector Machine | 0.695 | 0.019 | 3.745 | 1.610 | |

Random Forest | 0.709 | 0.018 | 3.581 | 1.499 | |

Ridge | 0.817 | 0.014 | 2.249 | 1.207 | |

Ridge with L2 | 0.859 | 0.013 | 1.731 | 0.941 | |

Stochastic Gradient Descent | −1.491 | 0.051 | 30.598 | 3.334 | |

Support Vector Machine | 0.745 | 0.017 | 3.127 | 1.541 |

**Table 8.**The results elaborated using a Support Vector Machine with the hyperparameters optimized for each parameter separately. We indicate if the metric should be minimized (↓) or maximized (↑).

Param. | $\mathit{\gamma}$ | C | ${\mathit{R}}^{2}$↑ | MAPE ↓ | MSE ↓ | MAE ↓ |
---|---|---|---|---|---|---|

SPAD | ${10}^{6}$ | ${10}^{-1}$ | 0.808 | 0.073 | 10.136 | 2.401 |

FvFm | ${10}^{0}$ | ${10}^{-1}$ | −0.169 | 0.076 | 0.004 | 0.052 |

PI | ${10}^{1}$ | ${10}^{0}$ | 0.652 | 1.157 | 0.177 | 0.372 |

RWC | ${10}^{3}$ | ${10}^{-1}$ | 0.845 | 0.016 | 1.904 | 1.006 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ruszczak, B.; Wijata, A.M.; Nalepa, J.
Unbiasing the Estimation of Chlorophyll from Hyperspectral Images: A Benchmark Dataset, Validation Procedure and Baseline Results. *Remote Sens.* **2022**, *14*, 5526.
https://doi.org/10.3390/rs14215526

**AMA Style**

Ruszczak B, Wijata AM, Nalepa J.
Unbiasing the Estimation of Chlorophyll from Hyperspectral Images: A Benchmark Dataset, Validation Procedure and Baseline Results. *Remote Sensing*. 2022; 14(21):5526.
https://doi.org/10.3390/rs14215526

**Chicago/Turabian Style**

Ruszczak, Bogdan, Agata M. Wijata, and Jakub Nalepa.
2022. "Unbiasing the Estimation of Chlorophyll from Hyperspectral Images: A Benchmark Dataset, Validation Procedure and Baseline Results" *Remote Sensing* 14, no. 21: 5526.
https://doi.org/10.3390/rs14215526