2.2. Laboratory and NIRS Analysis
All leaf, stem, pod, and whole plant samples were oven-dried at 60 °C until a constant weight. Dry samples were ground to pass a 2-mm filter using a Wiley grinding mill. Total nitrogen concentration in each sample was determined by the flash combustion method (Model Vario Macro, Elementar Americas, Inc., Mt. Laurel, NJ, USA) and then converted into CP by multiplying with a factor of 6.25 (
Table 1). The IVTD was obtained for each sample by following the Daisy Digester procedures (ANKOM Technology, Macedon, NY, USA). The NDF and ADF concentrations were only determined in samples of guar and tepary bean, in accordance with the batch fiber analyzer techniques (ANKOM Technology, Macedon, NY, USA).
Aliquots of ground samples were filled into ring cups to eliminate voids. Spectral reflectance (R) of monochromatic light, averaged over 10 spectra per sample, were collected by scanning spectrophotometer (Model SpectraStar 2600 XT-R, Unity Scientific, Columbia, MD, USA). Spectral data were obtained as the logarithm of the inverse of reflectance [log(1/R)] at 1-nm interval over the range of 680–2600 nm.
2.3. Calibration Techniques
Partial least squares (PLS) is an extensively used class of statistical methods, which includes regression, classification, and dimension reduction techniques. It uses latent variables, also called score vectors, to model the relationship between input and response variables. In the case of regression problems, PLS first generates the latent variables from the given data and uses them as new predictor variables. There are different types of PLS, based on techniques employed to extract the latent variables. Two approaches are used to extend PLS for modeling non-linear relations among data. The first approach is to reformulate the linear relationship between score vectors,
and
by a non-linear model:
where
is the continuous function that models the existing non-linear relation. Generally,
is modeled using artificial neural networks, smoothing splines, polynomial, or radial basis functions. Remaining variables
and
denote a residual vector and a weight vector, respectively.
The second PLS approach is to apply kernel-based learning. The kernel PLS method transforms the input space data to higher dimensional feature space and linearly estimates PLS in that space. To avoid the mapping function
from projecting data to feature space, PLS applies the kernel trick which uses the fact that a value of the inner product of two vectors
and
in feature space can be calculated using a kernel function
[
20]:
By using the kernel function, score vectors ( and can be identified and used to define the non-linear relationship. The kernel PLS approach is used to model complex non-linear relations easily in terms of implementation and computation.
Gaussian processes (GP) are kernel-based, probabilistic, non-parametric regression models. A Gaussian process involves a set of random variables such that every finite number of those variables possess joint Gaussian distributions. A Gaussian process,
, can be described using a mean function
and a covariance function
The covariance function defines the smoothness of responses, and the basis function
projects the input space vector
to a higher dimension feature space vector
. A Gaussian process regression (GPR) model describes the response by using latent variables from a Gaussian process. A GPR model is represented as:
where
, and
are from a zero mean GP having a covariance function,
[
21]. The covariance is specified by kernel parameters, which are also known as hyperparameters. GPR is a probabilistic model, and an instance of response y is:
GPR is non-parametric as there is a latent variable for each observation . Noise variance , basis function coefficients , and hyperparameters of the kernel can be estimated from the data while training the GPR model.
Support vector machine (SVM) is a popular machine-learning algorithm used for identifying linear as well as non-linear dependency between input vectors and outputs. SVMs are non-parametric models, which means parameters are selected, estimated, and tuned in such a way that the model capacity matches the data complexity [
21]. Generally, SVM starts by observing the multivariate inputs
and outputs
estimates its parameters
and then learns the performed mapping function
, which approximates the underlying dependency between inputs and responses. The obtained function, also known as a hyperplane, must have a maximal margin (for classification) or the error of approximation (for regression) to predict the new data. In the case of SVM regression, Vapnik’s error (loss) function is used with ε-insensitivity. It finds a regression function
that deviates from the actual responses (y) by values no more than ε and is considerably flat at the same time.
For non-linear regression problems, SVM maps the input space to feature space (a higher dimension space) using a mapping
to find a linear regression hyperplane in that space. However, there is no need to know the mapping
, as the kernel function
which is the inner product of the vectors
and
, can be used to find the optimal regression hyperplane in extended space. There are many kernel functions available to describe non-linear regressions, such as the polynomial kernel, RBF kernel, Gaussian Kernel, normalized polynomial kernel, etc. The learning problem in classification as well as in regression, leads to solving the quadratic programming (QP). The sequential minimal optimization (SMO) is considered as the most popular optimizer for solving SVM problems [
22]. It divides the large QP problem into a set of small QP problems and analytically solves them.
2.4. Performance Evaluation
Apart from calibration, 10-fold cross-validations and external validations were conducted to assess the performance of the calibration techniques. The 10-fold cross-validation is a unique statistical way of performance evaluations of machine learning models in which ten repeated hold-out executions are obtained and averaged. In each execution, the model is trained with 90% of the data points and tested with the remaining 10%, and thus every data point is taken nine times for training and once for testing the model. For each species-based model, the original dataset of 90 samples for each species was split into two subsets (
Figure 2). A subset of 70 samples was used for running calibration and 10-fold cross-validation. The other subset of 20 remaining samples was used only for external validation and neither used in calibration nor cross-validation of any model. For the global model, the original dataset consisted of 250 samples, involving 50 samples each of guar, tepary bean, soybean, pigeon pea, and mothbean. These samples were divided into six subsets (
Figure 2). One random subset of 150 samples (30 samples per species) was employed for calibration as well as 10-fold cross-validation. Each of the remaining five subsets, comprising 20 samples of individual species, was used for external validation.
Coefficients of determination, being upper-bounded by 1.0, are often adopted for meaningful comparisons across different models and therefore was used here as an estimate of prediction accuracy. To be precise, coefficient of determination in calibration (R2c), coefficient of determination in cross-validation (R2cv), and coefficient of determination in validation (R2v) were used for direct computation of the variance in the data captured at calibration, cross-validation, and external validation, respectively by each model. Additionally, root mean squared error estimation was also presented for comparing models, which were termed as RMSEc, RMSEcv, and RMSEv for calibration, cross-validation, and external validation, respectively.