This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Aboveground biomass (AGB) is one of the strategic biophysical variables of interest in vegetation studies. The main objective of this study was to evaluate the Support Vector Machine (SVM) and Partial Least Squares Regression (PLSR) for estimating the AGB of grasslands from field spectrometer data and to find out which data preprocessing approach was the most suitable. The most accurate model to predict the total AGB involved PLSR and the Maximum Band Depth index derived from the continuum removed reflectance in the absorption features between 916–1,120 nm and 1,079–1,297 nm (R^{2} = 0.939, RMSE = 7.120 g/m^{2}). Regarding the green fraction of the AGB, the Area Over the Minimum index derived from the continuum removed spectra provided the most accurate model overall (R^{2} = 0.939, RMSE = 3.172 g/m^{2}). Identifying the appropriate absorption features was proved to be crucial to improve the performance of PLSR to estimate the total and green aboveground biomass, by using the indices derived from those spectral regions. Ordinary Least Square Regression could be used as a surrogate for the PLSR approach with the Area Over the Minimum index as the independent variable, although the resulting model would not be as accurate.
Biomass is one of the strategic biophysical variables of interest in vegetation studies, regardless of being in cultivated or natural areas [
Measuring biomass directly is a destructive and expensive procedure [
Hyperspectral measurements of vegetation canopies obtained from handheld spectroradiometers [
The strong multicollinearity caused by a number of samples much smaller than the number of spectral bands considered as independent variables results in high correlation among the predictors and unreliable models [
In order to improve the signaltonoise ratio of these data and enhance the information related to the biophysical variables, different preprocessing transformations have been applied to transform spectral data, preparing them for modelling. Preprocessing transformations of spectral data have been proved to improve the accuracy of prediction models [
The main objective of this study was to evaluate the performance of two advanced statistical techniques (PLSR and SVM) for estimating the aboveground biomass from field spectrometer data and to find out which data preprocessing approach was the most suitable. The total dry aboveground biomass (TAGB) was considered as the target variable, as well as the green fraction of the dry aboveground biomass (as an absolute value (GAGB) and as a percentage of the total dry aboveground biomass (%GAGB)). In addition, several data preprocessing techniques were tested in order to reduce the noise in the data and to boost the accuracy of the statistical methods. Thus, the following approaches were compared: (i) PLSR applied to different parts of the spectrum (not transformed and transformed by the continuum removal and other transformation methods), (ii) PLSR applied to indices derived from the continuum removal transformation, (iii) SVM regression applied to different parts of the spectrum, and (iv) OLSR applied to indices derived from the continuum removal transformation (as a reference).
This study was developed in two adjacent grassy areas located in the municipality of Villanueva de La Cañada (Madrid, Spain) and is defined by their central coordinates ETRS89 UTM30 4163814478513 and ETRS89 UTM30 4164634478505 (in metres). Both test areas were covered by commercial grass/clover (
For each 50 cm × 50 cm subplot the top of the canopy reflectance was measured. Spectral data was gathered in a spectral range of 350–2,500 nm using an ASD FieldSpec^{®}4 spectroradiometer. Hand held measurements were made with a 1.5 m fiber optic (25° field of view) from a height of about 1.5 m above the ground under clear sky conditions and around solar noon. Spectral readings were recorded in 1 nm intervals with a spectral resolution of 3 nm in the visible and near infrared spectra (VNIR detector: 350–1,000 nm) and 8 nm in the near and shortwave infrared (SWIR1 detector: 1,000–1,800 nm and SWIR2 detector: 1,800–2,500 nm). For each subplot, 15 reflectance readings were recorded, each one representing the average of 25 individual measurements of 100 ms, which increases the signaltonoise ratio of the resulting measurement [
All of the aboveground biomass in each 50 × 50 cm subplot located in the NE corner of each plot was harvested right after the spectral measurements were taken. In order to avoid a loss of water in the samples, they were put individually into hermetic plastics bags and immediately taken to the laboratory in portable fridges. The samples were weighed in the laboratory using a digital precision scale, therefore obtaining the total biomass weight. Afterwards, each sample was split in dry material and green material, in order to distinguish the green and dry fraction of the aboveground biomass. The green and dry fractions of each sample were separately dried in the oven for 48 h at 65 °C. After drying, samples were weighed again to determine the dry matter weight for both fractions. This workflow allowed the total dry aboveground biomass weight (TAGB) to be obtained, as well as the green fraction of the dry aboveground biomass weight (as an absolute value (GAGB) and as a percentage of the total dry aboveground biomass (%GAGB)). The total dry aboveground biomass weight (TAGB) was used as surrogate for the aboveground dry biomass (AGB) in each plot [
The methodology involved two main steps: spectral data processing and statistical analysis (
The spectral data (absolute surface reflectance) was preprocessed to diminish the sensor noise. This step consisted of two tasks: averaging the 15 spectra measured for each subplot and identifying the noisiest wavelengths. Firstly, the radiometry of each subplot was characterised by the median and the mean spectrum of the 15 original measurements and averaged for the 1 × 1 m plot. Secondly, the wavelengths were grouped into three spectral subsets, taking into account the three different sensors which define the spectroradiometer (VNIR, SWIR 1, SWIR 2). The wavelengths from 1,360 nm to 1,385 nm, from 1,800 nm to 1,930 nm and above 2,400 nm were eliminated due to high amounts of noise [
In this comparative study, two groups of spectral transformation methods were applied: derivatives/transformations and continuum removal. This section covers the different types of preprocessing transformations which have been widely used by researchers to preprocess hyperspectral data (
The Standard Normal Variate (SNV) is applied to spectroscopy data to remove the scattering effects, and it minimises the multiplicative interferences of the scattering caused by particles of different sizes [
Another option to model and correct the background interference is the Detrending method, especially when a constant, linear, or curved offset is present [
The transformations involving derivatives allow increasing differences among the overlapping and wide bands of the spectra, correcting as well the baseline effects [
On the other hand, normalisation methods try to correct the effect of multiplicative factors on the original values of a variable. These methods identify a characteristic in a sample which should remain constant regardless of the considered sample and correct the scale of all the variables using that characteristic. In this study, the variables were normalised by the maximum value, the mean, the range, the area and the unit vector [
In addition to the transformations described in the previous section, the Continuum removal transformation (CR) of the spectra was tested (
In addition to the continuous spectra derived from the CR transformation for each zone (continuum removed reflectance (CRR)), the absorption features were characterised by two indices: the maximum band depth (MBD) and the area over the minimum (AOM) [
This study tested three statistical methods for developing models to estimate biomass from the grass/clover spectra: partial least squares regression (PLSR), support vector machine (SVM) and ordinary least squares regression (OLSR). Due to its simplicity, the latter was included in the analysis as a reference and as a baseline to compare the results achieved by using PLSR and SVM.
PLSR is a generalisation of linear multiple regression which is able to reduce the large number of measured collinear spectral variables to a few noncorrelated latent variables or factors [
As independent variables, the following data sets were considered: (i) preprocessed but not transformed data (spectral subsets defined in
The selection of the most suitable model for each variable took into consideration the strategies to build a solid model [
Lately, the use of support vector machines (SVMs) on various classification and regression problems has become increasingly popular and it has been successfully used in the estimation of grassland biomass [
Initially, SVM was developed to solve classification problems but it was later extended to also handle regression [
The εSVR method was applied in this study to estimate aboveground biomass, using the Vapnik's εinsensitive loss function to minimise the training errors, which were not penalised as long as they were smaller than ε. As part of the process, a kernel function was applied, in order to map the data into a new space followed by finding the support vectors for the best performance for the type of model. The kernel type considered in this study was the linear kernel, since it is the one which requires the least parameters to be defined and because it is not as susceptible to overfitting as the radial or polynomial kernels [
A general methodology consisting of the following steps was applied [
In order to find the simplest model with an acceptable error and to maintain model parsimony, the criterion to add an additional support vector to the model was that it had to reduce the root mean square error of crossvalidation (RMSE) by at least 2%. The RMSE was determined from the residuals of each crossvalidation phase. In order to avoid an overfitting, it was checked that the RMSE values from the calibration and crossvalidation stages were as well smaller than 2%. The performance of the SVM models was compared using the number of support vectors, the RMSE (absolute and percentage of the mean/median value of the variable) and the coefficient of determination (R^{2}) for the crossvalidation. The analyses were carried out using the Unscrambler^{®} X 10.2 software (CAMO Software Inc.).
Ordinary Least Squares Regression (OLSR) was carried out using the biomass measurements (TAGB, GAGB, %GAGB) as dependent variables, and as independent variables the derived indices from the continuum removed reflectance (i) maximum band depth (MBD) for Z1–Z5 (
Overall, the results of the statistical models tested in this study were assessed in terms of coefficient of determination of the crossvalidation (R^{2}), the RMSE of the crossvalidation (absolute value and percentage of the mean/median value of the variable) and the agreement between wavelengths/region identified as important by statistical analysis and known water/biomass absorption features [
On the whole, 140 models were tested for each of the three dependent variables (total, green and percentage of green grass/clover biomass), 12 of them without transformations of the spectral data and using PLSR and SVM, 124 involving PLSR and transformations/indices and four considering indices from the continuum removed spectra and OLSR. Thus, 420 models were explored in order to find suitable combinations among the regression method, the transformation type, the spectral subset/zone/index and the averaging method of the spectra for the estimation of biomass in grasslands. The results of these approaches are presented in the next section and discussed later on.
As a result of the comprehensive analysis of the relationships between total, green and percentage of green grass/clover biomass and the spectral data (transformed and nontransformed),
Transformations:
As shown in
The comparative analysis of the performance of PLSR models and SVM models showed higher R^{2} and smaller RMSE for the PLSR models, regardless of the nontransformed spectral subset used as input data. In that case, the most accurate models were obtained using the VNIR+SW1 subset, for both PLSR (R^{2} = 0.756, RMSE = 7.866 g/m^{2}) and SVM (R^{2} = 0.751, RMSE = 7.684 g/m^{2}) (
OLSR provided the best results when using the AOM index derived from the continuum removed reflectance in the absorption feature between 1,079 and 1,297 nm (Z4) (RMSE = 8.150 g/m^{2}, 18.09% of the mean value) (
Input data: vid.
According to
Input data:
The PLSR and SVM models used to estimate green above ground biomass achieved the smallest RMSE when the largest subset of the spectra (VNIR + SW1 + SW2) was used, reaching values of RMSE smaller than 11% of the average value of the variable (
The comparative analysis of the performance of PLSR models and SVM models showed higher R^{2} and smaller RMSE for the PLSR models only when the nontransformed spectral subsets used were different from VNIR + SW1 + SW2, in which case SVM was the most accurate (
The PLSR/CRR and OLSR models with the highest R^{2} and lowest RMSE involved the use of data from the region Z4 (
Input data: vid.
The models that produced lower RMSE and higher R^{2} when estimating the percentage of green above ground biomass, were characterised by using the VNIR reflectance as input data (VNIR or the absorption feature Z1) and PLSR and SVM regressions (
The combination of PLSR and the VNIR spectra transformed by NGD3 or RAB produced lower ranges of error for the cross validation analyses (RMSE = 6.919 and 7.500%, respectively) compared to the PLSR applied to nontransformed VNIR data (RMSE = 7.502%) (
The OLSR yielded less accurate models than PLSR and SVM, as it was showed by the fact that the RMSE corresponding to the best OLSR model (AOM in the absorption feature Z5) was 24.08% larger than the RMSE obtained by the best PLSR (
This study has showed the suitability of PLSR and spectral data/indices derived from the CR transformation to estimate the total dry aboveground biomass (TAGB), the green fraction of the dry aboveground biomass (GAGB), and the green fraction of the dry aboveground biomass expressed as a percentage (%GAGB). The results found in our study agree with [
Transformed data always yielded more accurate models than nontransformed spectral data when the PLSR was applied. However, not always the same transformation improved the results in comparison with not using it. For instance MNX and MSCO were more suitable to model TAGB (RMSE = 7.443 g/m^{2} and RMSE = 7.457 g/m^{2}, in comparison with a RMSE = 7.866 g/m^{2} when no transformation was applied), while BLO led to better estimations of GAGB (RMSE = 3.417 g/m^{2}
The CR transformation showed that its application on certain regions of the spectra as Z3 (916–1,120 nm) and Z4 (1,079–1,297 nm), boosted the simplification of the TAGB model in comparison to the use of the full nontransformed, as it was epitomised by the decrease in the number of latent factors from 3 to 2 in the PLSR model (
Regarding the use of the two indices derived from the CR transformation (MBD and AOM), the combination of their values in the spectral regions commented previously, yielded the most accurate models for TAGB and GAGB. [
When no transformations were applied to the reflectance data, SVM outperformed PLSR regarding RMSE when the three variables were estimated. For instance, GAGB was estimated with an RMSE of 3.226 g/m^{2} (10.18%) using SVM, while PLSR led to an RMSE of 3.467 g/m^{2} (10.93%). The better performance of SVM in comparison to PLSR was also noted by [
In this paper, it has been demonstrated that the total dry aboveground biomass, as well as the green fraction of the dry aboveground biomass (as an absolute value and as a percentage of the total dry aboveground biomass) can be accurately predicted from spectrometer data by using PLSR and indices derived from the continuum removal transformation of certain regions of the spectra.
The models to estimate the green fraction of the dry aboveground biomass (as an absolute value) yielded smaller errors than the ones predicting the total dry aboveground biomass. Splitting the biomass sample into dry and green fractions allowed the development of more accurate models (for green fraction of the dry aboveground biomass) and it is therefore recommended in case the models need to be recalibrated.
The SVM models provided more accurate estimations of the three variables when no transformations were applied to the reflectance data, which encourages further work to test whether the accuracy of SVM can increase when the input data is previously transformed.
Applying transformations to the data led to more accurate models than nontransformed spectral data when using PLSR. However, unless the continuum removal transformation is chosen, the optimal transformation to apply to the data needs to be identified by taking into account the dependent variable which is being estimated.
Identifying the appropriate absorption features was proven to be crucial in order to improve the performance of PLSR to estimate the total and green aboveground biomass, by using the indices (MBD and AOM) as input data, which are derived from the continuum removed reflectance from those regions. OLSR could be used as a surrogate for the PLSR approach with AOM (1,079–1,297 nm) as the independent variable, although the resulting model would not be as accurate.
This research has been partially funded by the Junta de Castilla y León through the project “Calibración radiométrica de cámaras aéreas digitales. Aplicación a la clasificación automática de cubiertas del suelo y estimación de biomasa” (LE001B08). The authors would like to thank the two anonymous reviewers who helped improve the manuscript with their comments and suggestions.
The authors declare no conflict of interest.
Distribution of frequencies for TAGB (total aboveground biomass), GAGB (green portion of the AGB) and % GAGB (Percentage of the green faction of the AGB).
Methodology flowchart.
A grass reflectance spectrum and the representation of its continuum and absorption features (Zi: Zone I, as defined in
Crosscalibration results for TAGB using (
Crosscalibration results for predicting GAGB using (
Crosscalibration results for predicting %GAGB using (
Examples of statistical techniques for estimating vegetation biophysical variables from hyperspectral data.
PLSR  Partial least square regression  [ 
SVM  Support vector machine  [ 
OLSR  Ordinary Least Squares Regression  [ 
Descriptive statistics of the sample (
Mean  45.05  31.71  68.34 
Median  49.10  34.75  69.77 
Standard deviation  15.40  12.63  13.57 
Maximum  75.60  50.50  90.04 
Minimum  9.52  4.40  29.76 
Wavelengths which define the three spectral subsets considered in this research.
VNIR  [350–1,000] 
VNIR + SWIR 1  [350–1,359], [1,386–1,799] 
VNIR +SWIR1 + SWIR2  [350–1,359], [1,386–1,799], [1,931–2,399] 
Preprocessing transformations compared in this study.
BLO  Baseline offset  [ 
 
CR  Continuum Removal  [ 
 
DETREN1  Detrending using a 1storder polynomial  [ 
DETREN2  Detrending using a 2storder polynomial  
DETREN3  Detrending using a 3storder polynomial  
 
MSCA  Multiplicative Scatter Correction Common amplification f(X = X/b)  [ 
MSCF  Multiplicative Scatter Correction Full MSC f(X) = (X − a)/b  
MSCO  Multiplicative Scatter Correction Common off set f(X) = X − a  
 
NAR  Normalise by the area  [ 
NMX  Normalise by the maximum value  
NME  Normalise by the mean  
NRA  Normalise by the range  
NUV  Normalise by the unit vector  
 
NGD3  Norris gap derivative 1st derivativegap size = 3  [ 
NGD5  Norris gap derivative 1st derivativegap size = 5  
NGD7  Norris gap derivative 1st derivativegap size = 7  
NGD9  Norris gap derivative 1st derivativegap size = 9  
 
RAB  Reflectance to absorbance  [ 
 
SNV  Standard normal variate transformation  [ 
Continuum removal zones considered in this study.
Z1  [440–567]  VNIR 
Z2  [554–762]  VNIR 
Z3  [916–1,120]  VNIR+SWIR1 
Z4  [1,079–1,297]  SWIR1 
Z5  [1,265–1,676]  SWIR1 
Performance of PLSR, SVM and OLSR and spectral transformations for predicting total (TAGB), green (GAGB) and percentage of green (%GAGB) grass/clover biomass.
TAGB  PLSR/MBD  Z3Z4 (MBD)  Mean  2  0.800  7.120  15.81 
PLSR/CRR  Z4  Mean  5  0.799  7.136  15.84  
PLSR/NMX  VNIR  Mean  6  0.782  7.443  16.52  
PLSR/MSCO  VNIR + SWIR1  Mean  3  0.781  7.457  16.55  
PLSR/MSCO  VNIR + SWIR1+SWIR2  Mean  3  0.770  7.640  16.96  
PLSR/none  VNIR + SWIR1  Mean  3  0.756  7.866  17.46  
SVM/none  VNIR + SWIR1  Mean  0.04  0.751  7.684  17.06  
PLSR/none  VNIR + SWIR1+SWIR2  Mean  3  0.751  7.950  17.65  
SVM/none  VNIR + SWIR1+SWIR2  Mean  0.03  0.745  7.780  17.27  
OLSR/AOM  Z4 (AOM)  Mean  1  0.720  8.150  18.09  
PLSR/none  VNIR  Median  3  0.689  8.888  19.73  
SVM/none  VNIR  Median  0.11  0.683  8.690  19.29  
 
GAGB  PLSR/AOM  Z1Z3Z4 (AOM)  Mean  3  0.939  3.172  10.00 
SVM/none  VNIR + SWIR1 + SWIR2  Mean  0.1  0.933  3.229  10.18  
PLSR/BLO  VNIR + SWIR1 + SWIR2  Mean  6  0.929  3.417  10.78  
PLSR/none  VNIR + SWIR1 + SWIR2  Mean  6  0.927  3.467  10.93  
PLSR/CRR  Z4  Median  1  0.921  3.622  11.42  
OLSR/AOM  Z4 (AOM)  Mean  1  0.914  3.646  11.50  
PLSR/none  VNIR + SWIR1  Mean  5  0.913  3.789  11.95  
SVM/none  VNIR + SWIR1  Mean  0.14  0.909  3.759  11.85  
PLSR/DETREN3  VNIR  Mean  4  0.901  4.035  12.72  
PLSR/MSCO  VNIR + SWIR1  Mean  3  0.901  4.036  12.73  
PLSR/none  VNIR  Median  6  0.875  4.546  14.34  
SVM/none  VNIR  Median  0.16  0.846  4.895  15.44  
 
%GAGB  PLSR/CRR  Z1  Median  7  0.762  6.852  9.82 
PLSR/NGD3  VNIR  Mean  4  0.757  6.919  10.12  
SVM/none  VNIR  Mean  0.07  0.724  7.134  10.44  
PLSR/RAB  VNIR + SWIR1  Mean  5  0.715  7.500  10.97  
PLSR/none  VNIR  Median  3  0.714  7.502  10.75  
PLSR/NAR  VNIR + SWIR1 + SWIR2  Mean  3  0.705  7.628  11.16  
PLSR/AOM  Z2Z3Z5 (AOM)  Median  3  0.684  7.897  11.32  
PLSR/none  VNIR + SWIR1 + SWIR2  Median  4  0.682  7.913  11.34  
PLSR/none  VNIR + SWIR1  Median  3  0.678  7.947  11.39  
SVM/none  VNIR + SWIR1  Mean  0.02  0.650  8.047  11.53  
SVM/none  VNIR + SWIR1 + SWIR2  Median  0.02  0.655  7.991  11.69  
OLSR/MBD  Z5  Median  1  0.608  8.502  12.19 
Performance of Ordinary Least Squares Regression (OLSR), for predicting total (TAGB), green (GAGB) and percentage of green (%GAGB) grass/clover biomass using indices derived from the continuum removed spectra. In bold: most accurate models.

 

TAGB  Z1  Median  0.582  9.950  Z1  Mean  0.594  9.810 
Z2  Mean  0.537  10.476  Z2  Median  0.577  10.008  
Z3  Mean  0.650  9.110  Z3  Mean  0.641  9.226  
Z5  Mean  0.599  9.748  Z5  Mean  0.642  9.216  
 
GAGB  Z1  Median  0.728  6.483  Z1  Mean  0.722  6.546 
Z2  Median  0.669  7.146  Z2  Median  0.719  6.587  
Z3  Mean  0.870  4.470  Z3  Mean  0.866  4.550  
Z5  Mean  0.743  6.293  Z5  Median  0.797  5.593  
 
%GAGB  Z1  Median  0.567  8.931  Z1  Median  0.554  9.064 
Z2  Median  0.603  8.551  Z2  Median  0.591  8.674  
Z3  Median  0.557  9.034  Z3  Mean  0.552  9.080  
Z4  Median  0.524  9.359  Z4  Mean  0.523  9.370  
Performance of PLSR for predicting total (TAGB) grass/clover biomass using indices which consider the absorption feature Z4 derived from the continuum removed spectra. In bold: most accurate models.

 

Z1Z4  2  Median  0.709  8.595  2  Median  0.708  8.612 
Z2Z4  2  Median  0.724  8.369  2  Median  0.721  8.406 
Z4Z5  2  Median  0.725  8.347  2  Median  0.723  8.386 
 
Z1Z2Z4  2  Median  0.683  8.971  2  Median  0.679  9.023 
Z1Z3Z4  3  Median  0.786  7.375  3  Median  0.727  8.317 
Z1Z4Z5  2  Median  0.685  8.934  2  Median  0.692  8.838 
Z2Z4Z5  2  Mean  0.710  8.579  2  Mean  0.705  8.646 
Z3Z4Z5  3  Median  0.794  7.237  3  Median  0.730  8.274 
 
Z1Z2Z3Z4  4  Median  0.772  7.599  3  Median  0.739  8.144 
Z1Z2Z4Z5  2  Mean  0.679  9.025  2  Median  0.672  9.121 
Z1Z3Z4Z5  4  Median  0.772  7.611  3  Median  0.716  8.489 
Z2Z3Z4Z5  4  Median  0.780  7.473  2  Median  0.725  8.347 
 
Z1Z2Z3Z4Z5  5  Median  0.758  7.832  2  Mean  0.724  8.360 
Performance of PLSR for predicting green (GAGB) grass/clover biomass using indices which consider the absorption feature Z4 derived from the continuum removed spectra. In bold: most accurate models.

 

Z1Z4  2  Mean  0.914  3.762  2  Mean  0.924  3.539 
Z2Z4  2  Median  0.913  3.783  2  Mean  0.918  3.675 
Z4Z5  2  Median  0.913  3.786  2  Median  0.921  3.607 
 
Z1Z2Z4  3  Median  0.915  3.745  3  Median  0.925  3.512 
Z1Z4Z5  2  Mean  0.918  3.689  3  Mean  0.922  3.599 
Z2Z3Z4  3  Mean  0.915  3.738  3  Mean  0.929  3.432 
Z2Z4Z5  2  Mean  0.900  4.060  3  Median  0.921  3.607 
Z3Z4Z5  3  Mean  0.915  3.755  3  Mean  0.931  3.377 
 
Z1Z2Z4Z5  3  Mean  0.913  3.782  4  Median  0.921  3.618 
Z1Z3Z4Z5  3  Median  0.917  3.703  4  Mean  0.936  3.249 
Z2Z3Z4Z5  2  Mean  0.898  4.098  4  Mean  0.929  3.427 
 
Z1Z2Z3Z4Z5  3  Median  0.917  3.702  5  Mean  0.931  3.367 