2.3. Data Wrangling and Features Engineering
Rows with more than 80% of features with null values were removed from the database. Therefore, the data wrangling process started with outlier treatment, data integrity analysis, and the imputation of missing values. Several biogeochemical and physical variables including chlorophyll-a (Chl), total nitrogen (N), total phosphorus (P), Silica (si), DQO, dissolved oxygen (O_D), oxygen saturation % (O_D_sat), PH, temperature (Temp), relative humidity (Hum), wind velocity (Wind), conductivity (Conduct), and transparency Secchi depth (Trans) were selected for the analysis.
Additionally, location variables, latitude and longitude, and times in terms of year, month, day, and hour were selected. Finally, dummy variables were created to associate the respective measurement with its sampling station. In total, 26 covariates (independent variables) were used for the prediction of chlorophyll (dependent variable).
Data cleaning was carried out according to each sampling station (Ensenada, Frutillar, Puerto Octay, Puerto Varas, Frutillar 2, Puerto Varas 2, Puerto Octay 2, and Zmax).
The cleaning steps were as follows:
Remove non-numerical values from each of the selected variables and replace them with null values.
Extract the year, month, and day for each measurement, verifying consistency and integrity.
Apply sensible imputation for the null values of each column using a robust central tendency measurement, the median.
Split data for training and test validation. In total, for all measurements, the first 80% collected at each sampling station over time were selected for training, and the remaining 20% were used for testing (
Table 1).
Standardize numerical variables (N, P, Si, DQO, O_D, O_D_sat, PH, Temp, Wind, Hum, Conduct, Trans, and Chl) using the PowerTransformer method, a technique for transforming numerical input or output variables to have a uniform or a Gaussian probability distribution. A power transform will make the probability distribution of a variable more Gaussian [
48].
2.4. Machine and Deep Learning Algorithms
This section describes the different modeling methodologies used for chlorophyll prediction. It is important to bear in mind that there is no perfect model, but having different modeling perspectives allows provides a better idea of how feasible learning is for a given task, which is why different models are selected to identify which ones perform better at forecasting chlorophyll values. The analysis covers ensemble methods, including bagging (i.e., random forest) and boosting (XGBoost, AdaBoost, GradientBoosting, and LightGBM) strategies, support vector machines (SVMs), and neural networks (i.e., MLP and ANN). For each algorithm, a brief description is provided.
Random Forest
Random forest is a bagging algorithm introduced by [
49] as an adaptation of the algorithm proposed by [
50]. The mathematical foundations of random forest were described by Breiman at the end of the 20th century [
51], and it has been among the most innovative machine learning techniques. Along with the boosting technique, random forests can be used for either classification (categorical response) or regression problems (continuous response) for supervised learning [
49].
The random forest regressor alternative was selected to predict chlorophyll-a using a different number of decision trees in various subsamples from bootstrapped datasets constructed from the original dataset to improve the predictive accuracy by avoiding overfitting [
49,
51].
The idea in bagging is to reduce variance by constructing many noisy, approximate, unbiased models (i.e., decision trees). Trees are ideal candidates for bagging since they are designed to provide an understanding of complex interaction structures from data, and if they are grown with enough depth, bias can be reduced. The algorithm is described in the following steps:
Random Forest Algorithm |
- 1.
For b = 1 to B:
- (a)
Draw a bootstrap sample Z* of size N from the training data. - (b)
Grow a random forest tree to the bootstrapped data by recursively repeating the following steps for each terminal node of the tree until the minimum node size is reached:
- i.
Select m variables at random from the p variables, - ii.
Pick the best variable/split point among the m, - iii.
Split the node into two daughter nodes.
- 2.
Output the ensemble of trees . - 3.
To predict at a new point x: Regression: Classification: Let be the class prediction of the bth random forest tree. Then .
|
Multiple parameters were evaluated to identify the best configuration [
52], including the maximum depth (15, 20, 25, 30), the number of trees (100, 120, 140, 150), the number of features considered for the best split (square root, log2), and the minimum number of samples needed to split an internal node (2, 3, 4, 5). In addition, default parameters such as the function to measure the quality of a split (squared error) were selected.
AdaBoost
AdaBoost is a machine learning algorithm created from the boosting technique using several weak estimators to reduce bias. The idea was proposed by Freund and Schapire and is one of the most common algorithms with applications in numerous fields [
53].
The AdaBoost regressor alternative (Ying et al., 2013) was selected to predict chlorophyll-a. The algorithm is described by the following steps:
AdaBoost Algorithm |
- 1.
Initialization: Given training data from the instance space where and . - 2.
Initialize the distribution . - 3.
For :
- (a)
Train a weak learner using the distribution , - (b)
Determine weight of , - (c)
Update the distribution throughout the training set:
where
is a normalization factor chosen so that will be a distribution.
- 4.
Calculate the final score:
|
Several parameters were assessed to identify the best configuration [
53,
54] including several estimators (120, 140, 160, 180) and learning rates to control overfitting, with a higher learning rate increasing the contribution of each regressor (0.01, 0.1, 0.5) and the loss function to update weights (linear, square, exponential).
Gradient Boosting
The gradient boosting regressor alternative was selected to predict chlorophyll-a. This is an ensemble learning technique constructed from the boosting methodology [
55,
56]. The algorithm was created with the idea of learning from a functional mapping function defined as y = F(x,B), where B is the set of parameters of F such that some cost function C is minimized [
57].
Gradient Boosting Algorithm |
- 1.
Define Input: Dataset D, loss function L, base learner , number of iterations M, and the learning rate . - 2.
Initialize . - 3.
For :
- (a)
, - (b)
, - (c)
, - (d)
, - (e)
.
- 4.
|
Several parameters were assessed to identify the best configuration [
53,
54], including learning rate (0.01, 0.1), maximum depth (25, 30, 35), number of trees (70, 100, 120, 140), and the number of features for the best split (square root and log2). Furthermore, other hyperparameters were set, such as the loss function to be optimized (squared error) and the function to measure the quality of a split (friedman_mse).
XGBoost
XGBoost is a scalable machine learning algorithm based on the boosting methodology, which in recent decades has been widely recognized as a very proficient approximation in several machine learning and data mining challenges with some advantages such as training execution time, scalability, error reduction, simplified calculations, and lower computational cost [
58] for the prediction of chlorophyll-a.
The XGBoost regression alternative was selected for the regression task. XGBoost minimizes an objective function using regularization (L1 and L2) to penalize unnecessary complexity in the model. The training task is an iterative process, with new trees added and the error reduced using a serial process to improve the final prediction. It is similar to the gradient boosting algorithm since the gradient descent algorithm is implemented to minimize the loss when the complexity increases [
58,
59]. A description of the algorithm is presented in
Figure 2.
For the XGBoost algorithm, four types of parameters were tuned. General parameters are related to the booster type, which is gbtree (gradient boosting). Booster parameters such as maximum depth (15, 20, 25, 30), learning rate (0.01, 0.1), and an L1 regularization term on weight (0, 0.3, 0.5) were considered. Finally, the learning task parameters and the command line parameters were set to default.
LightGBM
A LightGBM regression alternative was selected for predicting chlorophyll-a. This alternative is known as gradient boosting decision tree (GBDT), which is a popular machine learning algorithm such as XGBoost. It is a novel technique to address the multidimensionality problem with high efficiency and scalability using two techniques: gradient-based one-sided sampling (GOSS), which excludes a significant proportion of data instances using information gain criteria and exclusive feature bundling (EFB) as an effective method to reduce the number of features.
GBDT has been shown to speed up the training process of conventional gradient boosting decision trees by more than 20 times while achieving almost the same accuracy. It can be considered a similar alternative to the XGBoost algorithm [
60]. The algorithm is described in the following steps:
LightGBM Algorithm |
- 1.
Define Input: Dataset D, loss function L, base learner , number of iterations M, the sampling ratio of large gradient data (a), and the sampling ratio of small gradient data (b). - 2.
Merge mutually exclusive features using the exclusive feature bundling (EFB) method. - 3.
Initialize . - 4.
For - (a)
Compute the absolute values of gradients:
- (b)
Resample dataset using Gradient-based One-side Sampling (GOSS) method, - (c)
Compute the information gains, - (d)
Get a new decision tree .
- 5.
Return: .
|
Several parameters were assessed to identify the best configuration [
60], including boosting type (gbdt = traditional gradient boosting decision tree, dart = dropouts meet multiple additive regression trees, goss = gradient-based one-sided sampling), learning rate (0.01, 0.1), maximum depth (20, 30, 35), and several estimators (70, 100, 120, 140). Support vector machine (SVM), the support vector regression (SVR) algorithm, was developed by [
61]. SVR finds a function that estimates the difference between the input and output variable [
62] using the following equation:
where
is the network output,
is the input data, which is diagramed into a higher-dimensional feature using a nonlinear mapping function
, and
and
are coefficients determined by minimizing the regularized risk function based on the network output and real value [
63].
We evaluated several kernel functions to select the optimum performance, including the linear function given by <x,x^’>, the polynomial one (degrees 3 and 4), which is represented by the similarity of the vectors in the training dataset in a feature space over polynomials of the original variables involved in the kernel defined by (γ<x, x^’> + r)^d, where d denotes the degree and r a constant, and RBF (radial basis function), which adds a radial basis method to improve the transformation given by exp(〖− γ|(|x − x^’ |)|〗^2), where γ is a parameter defined by the algorithm that sets the “sparsity” of the kernel, usually under a scaling pattern.
Multilayer Perceptron (MLP)
The multilayer perceptron (MLP) is an artificial neural network created by Frank Rosenblatt in 1957 that generates a set of outputs from the inputs using multiple hidden layers of connected nodes as a directed graph between input and output layers using the backpropagation algorithm for training.
A few combinations of hidden layers (32,16,8), (32,16), and (16,8,4) were considered. Furthermore, we evaluated several activation functions for the hidden layers (tanh and relu), different types of solvers (sgd and Adam) for weight optimization, an L2 regularization term (0.001, 0.05), and two different learning rates (constant and adaptive).
Artificial Neural Network (ANN)
This is a deep learning algorithm for classification and regression tasks from multiple inputs with the ability to handle complex environmental interactions between variables [
64,
65,
66]. ANNs are usually designed with more than one input layer, several hidden layers, and an output layer [
67]. The general formula is given by:
, where Y is the vector of model outputs, X is the vector of inputs, W represents the weights, and the function f represents the relationship between outputs, inputs, and parameters of the model [
68]. In this case, the relu activation function was selected, and two hidden layers, the first with 32 neurons and the second with 16 neurons, define the geometric configuration to avoid bottlenecks in the learning task. The Adam optimizer and 100 epochs were selected for the training process.