Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data

Cui, Zhendong; Du, Depeng; Zhang, Xiaoling; Yang, Qiao

doi:10.3390/jmse10111749

Open AccessArticle

Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data

by

Zhendong Cui

¹,

Depeng Du

¹,

Xiaoling Zhang

² and

Qiao Yang

^2,3,*

¹

School of Computer and Control Engineering, Yantai University, Yantai 264005, China

²

ABI Group, Zhejiang Ocean University, Zhoushan 316022, China

³

Donghai Laboratory, Zhoushan 316021, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2022, 10(11), 1749; https://doi.org/10.3390/jmse10111749

Submission received: 19 October 2022 / Revised: 8 November 2022 / Accepted: 10 November 2022 / Published: 14 November 2022

(This article belongs to the Section Marine Environmental Science)

Download

Browse Figures

Review Reports Versions Notes

Abstract

It is of great theoretical and practical significance to understand the inherent relationship and evolution patterns among various environmental factors in the oceans. In this study, we used scientific data obtained by the Tara Oceans Project to conduct a comprehensive correlation analysis of marine environmental factors. Using artificial intelligence and machine learning methods, we evaluated different methods of modeling and predicting chlorophyll a (Chl-a) concentrations at the surface water layer of selected Tara Oceans data after the raw data processing. Then, a Pearson correlation and characteristic importance analysis between marine environmental factors and the Chl-a concentrations was conducted, and thus a comprehensive correlation model for environmental factors was established. With these obtained data, we developed a new prediction model for the Chl-a abundance based on the eXtreme Gradient Boosting (XGBoost) algorithm with intelligent parameter optimization strategy. The proposed model was used to analyze and predict the abundance of Chl-a abundance of TOP. The obtained predicted results were also compared with those by using other three widely-used machine learning methods including the random forest (RF), support vector regression (SVR) and linear regression (LR) algorithms. Our results show that the proposed comprehensive correlation evaluation model can identify the effective features closely related to Chl-a, abundance, and the prediction model can reveal the potential relationship between environmental factors and the Chl-a concentrations in the oceans.

Keywords:

comprehensive correlation model; multi-environmental factors prediction model; chlorophyll a; machine learning; Tara Oceans data; XGBoost

1. Introduction

In recent years, the human activities have had a significant impact on the marine environment [1]. To better understanding the changes in the ocean environment and reduce the effects of ocean disasters, global researchers have conducted numerous observations, experiments and analyses, and have accumulated a great deal of important data for subsequent in-depth research [2,3,4,5,6,7,8]. With the rapid development of artificial intelligence (AI), big data, the Internet of Things (IOT) and other advanced technologies, it’s of increasing importance to apply these tools to address the mysteries of the ocean [9]. Machine learning is primarily concerned with finding the patterns in empirical data and building intelligent models based on traditional observation, detection and numerical analysis. The combination of big data and machine learning is expected to become a new paradigm for the study of the evolution of complex ocean phenomena [10].

From 2009 to 2013, the Tara Ocean Project (TOP) conducted a global voyage covering and collected numerous environmental measures and a wide range of plankton communities in which their composition consisted mainly of the larger and more conspicuous diatoms and dinoflagellates and the minuscule picocyanobacteria Prochlorococcus and Synechococcus [4,11]. These photosynthetic plankton live in the sunlit upper layer to depths where light can still pass, and demonstrate Their extraordinary biogeochemical roles including oxygen generation, elemental nutrients recycling, and the removal of CO₂ from the atmosphere to generate organic biomass through primary production [4]. This project provided valuable data for subsequent analysis using modern sequencing and imaging techniques [2,12]. The accumulated scientific data and research results laid a solid foundation for subsequent in-depth research by global scientific communities [5,13,14]. Given the occurrence, development, and changing rules of marine science research, research on data analysis and prediction modeling based on intelligent algorithms has attracted increasing attention [5,10,15]. The pigment chlorophyll a (Chl-a) is regarded as an important component of various phytoplanktons which account for 1–2% of the dry weight of organic matter in the oceans [4,11,16]. However, the rapid prediction of Chl-a through modeling based on the physical and chemical indexes of TOP data has not been carried out effectively. Facing the urgent need to deeply understand the occurrence, development, and changing rules of marine science research, research on data analysis and prediction modeling based on intelligent algorithms has attracted increasing attention. Depending on the sample characteristics, eXtreme Gradient Boosting (XGBoost), artificial neural networks (ANNs), random forest (RF), support vector machine (SVM), linear regression (LR), and other technologies can be applied for the modeling and prediction of marine phenomena [17,18,19,20].

Previously, Kisi and Parmar [21] predicted chemical oxygen contents based on th abundance of the free ammonia, total ammonia nitrogen, water temperature and E. coli. The obtained results showed that the performance of the least-squares support vector machine (LS-SVM) and M5 model tree were better than that of the multi-derived adaptive regression spline method. Yajima and Derot [22] predicted trends in the time series of Chl-a in water bodies based on the random forest (RF) model, and identified the most influential parameters in the water bodies. Sun et al. [23] proposed a method for identifying and removing redundant data to solve the maintenance problems associated with database changes. They also mined valuable information from a massive amount of data on marine water quality. Zeng and Tang [24] demonstrated that the support vector machine (SVM) is superior to other models for reconstructing CO₂ in the global ocean surface when the sample size is small. Misra et al. [25] used SVM to measure the reflected light through echo sounding to obtain shallow water depth data around St. Martin Island and in Aramian Bay in the Netherlands, and found that SVM provided a comparable or better performance for shallow depths. Ling et al. [26] established a K-fold cross-validation SVM model to predict and evaluate the degradation of concrete strength in a complex marine environment. The results showed that the model performed well. Franklin et al. developed a mathematical model using multiple linear regression (MLR) and principal component analysis (PCA) to predict Chl-a concentrations based on a data-driven modeling approach, and found that PCA and the MLR method helped to identify the relationship amongst dependent as well as predictor variables and eliminated collinearity problems [27]. An algorithm combining artificial neural networks (ANN) and SVM was implemented to forecast algal growth and eutrophication, and demostrated good applicability and accuracy [18]. A hybrid algorithm based on optically fuzzy clustering was used for Chl-a estimation, and the results showed it had better performance than any single algorithm [28]. A framework based on Hilbert–Huang transformation and convolutional neural networks (CNN) was used to analyze the Chl-a content of the ocean by satellite remote sensing observation. The results showed that the spatial mode of Chl-a mainly depends on the distribution of phytoplankton [29].

The XGBoost algorithm performs a second-order Taylor expansion on the loss function based on the gradient lifting decision tree algorithm. It both effectively avoids overfitting and increases the convergence speed, in addition to strong adaptability in solving marine science problems and has received increased attention [17]. Li et al. proposed a prediction method for water quality parameters based on the XGBoost model, which effectively improved the prediction accuracy of dissolved oxygen [29]. The XGBoost algorithm was used in the optimization of pollutant concentration, and experimental results showed it can better capture the spatial and temporal variation patterns of pollutants [29]. Shapley Additive Explanations (SHAP) was used to interpret XGBoost to demonstrate how to extract spatial effects from machine learning models. Simulations proved that XGBoost estimates spatial effects in a manner similar to the simple linear model and mixed geographically weighted regression models. Nasir et al. (2022) used AI to classify water quality, and found that the boosting algorithm is a reliable approach for water quality classification. A hybrid model combining XGBoost, four generalized autoregressive conditional heteroscedasticity models, and a multi-layer perceptron (MLP) model were proposed to predict the PM2.5 concentrations and volatility [30]. The obtained results showed good performance in the long-term forecasting process. Recently, Wang et al. built a hybrid model based on XGBoost to predict strain for historical timber buildings, and found that the predictive performance of the proposed hybrid model was better than other models [31].

In the oceans, many environmental factors influence the Chl-a contents [32,33,34,35]. The relationship between Chl-a distribution and environmental factors is complex and nonlinear. The TOP conducted water sampling of numerous sites in the oceans [4]. Analysis and research on the microbial metagenome and macro transcriptome were conducted for some of these sites. Biological genetic engineering analysis has produced a substantial amount of research results [5,11]. It revealed that eukaryotic plankton diversity in the sunlit ocean. Most eukaryotic plankton biodiversity belonged to heterotrophic protistan groups, particularly those known to be parasites or symbiotic hosts like those which were closely related with marine phycosphere microbiota [36,37,38,39,40,41,42,43,44], which support the global biological and geochemical processes. However, the availability of current research on marine primary productivity factors and their correlation with the physical and chemical factors of water using Tara Oceans data is very limited, especially for Chl-a. The Tara Ocean expedition covered a long time period and a wide range of oceanic area with a small tonnage sailing boat. Sampling equipment is expensive, as well as the human and material resources. Thus, it’s vital to use those obtained scientific data to investigate the potential correlations among the marine environment indicators, and reveal the nature of the global ocean system [5,10,45], especially for the abundance and distribution of the Chl-a and its correlation with other environmental factors [46,47,48].

There were two main research foci in this study. The first is to find an effective correlation model to find out those strongest correlation factors with Chl-a. and the second is to build an effective model for the Chl-a prediction based on the infinite correlation marine environmental factors. To achieve these purposes, we first used marine water quality indicators and a machine learning method to construct a data cleaning procedure to clean up the original data. Then, a comprehensive attribute correlation evaluation model of seawater Chl-a was established using Pearson correlation and feature importance collaborative evaluation techniques. A prediction model was then created by combining to acquire the strongest correlation factors, XGBoost regression, intelligent parameter optimization strategy and recurrent leave-one-out cross-validation (LOOCV) techniques [32]. The proposed comprehensive correlation evaluation model was proven effective, and the prediction model revealed the potential relationship between ecological environmental factors and the Chl-a.

2. Materials and Methods

2.1. Data Sources

The TOP collected water samples from 210 sites were indicated by red dots in Figure 1, which was adopted and modified from Pesant et al. [3]. There were 102 sites with relatively complete data in the DCM layer. Three samples with abnormal value comparisons were shown in Figure 1. The raw data was obtained from http://ocean-microbiome.embl.de/data/OM.CompanionTables.xlsx (accessed on 1 October 2022).

2.2. Data Cleaning

The samples number was limited due to the difficulty of and financial investment required for ocean investigation. Samples contained detection errors and other inconsistencies, and some of them were partial or incomplete. Anomalous data may be caused by specific conditions such as location, bioturbation, and the influence of ocean currents. Extremely abnormal data was excluded, and data that were not suitable for modeling according to marine environmental indicators were also excluded from the further analysis.

For a sample matrix

S_{m \times n}

,

S_{m \times n} = [\begin{matrix} S_{0} \\ ⋮ \\ S_{i} \\ ⋮ \\ S_{m - 1} \end{matrix}] = [\begin{matrix} S_{00} & \dots & S_{0 (n - 2)} \\ ⋮ & ⋱ & ⋮ \\ S_{(m - 1) 0} & \dots & S_{(m - 1) (n - 2)} \end{matrix} \begin{matrix} S_{0 (n - 1)} \\ ⋮ \\ S_{(m - 1) (n - 1)} \end{matrix}]

(1)

where vector

S_{i}

represents sample data in the original samples and vector

S_{ij}

represents the

jth

factor in the

ith

original sample.

The sample points evaluation vector was

E_{n}

,

E_{1 \times m} = [E_{0}, \dots, E_{m - 1}]

(2)

where

E_{i}

was the sample evaluation weight of the

ith

sample. The

{E_{i}}^{’}

s value was set according to the characteristics of the sample, such as the location, distance from other points and density of nearby sample points.

\sum E_{i} = 1

, and

E_{i} \in [0, 1]

.

We calculated the value of

E \times S

to obtain the sample reference vector

B_{n}

,

B_{1 \times n} = (E \times S) / m = [B_{0}, \dots, B_{n - 1}]

(3)

and built the sample evaluation matrix

V_{m \times n}

,

V_{m \times n} = [\begin{matrix} V_{00} & \dots & V_{0 (n - 2)} \\ ⋮ & ⋱ & ⋮ \\ V_{(m - 1) 0} & \dots & V_{(m - 1) (n - 2)} \end{matrix} \begin{matrix} V_{0 (n - 1)} \\ ⋮ \\ V_{(m - 1) (n - 1)} \end{matrix}]

(4)

V_{ij}

is obtained from Equations (3) and (4),

V_{ij} = {\begin{array}{l} 1, S_{ij} 〈 α \times B_{j} or S_{ij} 〉 β \times B_{j} \\ 0, else \end{array}

(5)

where

α

and

β

were the upper and lower limit coefficients of evaluation.

Next, we built the sample attribute selection matrix,

δ_{m \times 1} = [\begin{matrix} δ_{0} \\ ⋮ \\ δ_{i} \\ ⋮ \\ δ_{m - 1} \end{matrix}]

(6)

where

δ_{i} = {\begin{array}{l} 0, ({η V}_{i (n - 1)} + \sum_{j = 0}^{n - 2} V_{ij} / ξ) \geq 1 \\ 1, else \end{array}

.

The coefficients

η

and

ξ

were the target weight coefficient and the other attribute’s weight coefficient, respectively. For the single tag predictive value,

η = 1 and ξ \geq 1

.

Finally, the sample was cleaned according to the value of

δ_{i}

in the sample attribute selection matrix

δ

, where

δ_{i}

= 1 indicates that sample

S_{i}

was reserved; otherwise, the sample was abandoned and was not used in subsequent modeling analysis.

3. Comprehensive Correlation Model

3.1. Character Importance

The RF algorithm was a decision tree-based packaged ensemble learning method that combines the advantages of boost-strap aggregation and random decision forests to improve the computing decision-making performance. It has good tolerance to noise and outliers. It also has good stability when dealing with large dimensional data problems. In this study, the Gini index was used to calculate the purity of the nodes and measure the characteristic importance index of marine environmental factors to Chl-a.

The Gini importance score

{VIM}_{j}

was obtained according to the Gini index changes before and after the decision tree branches of the RF. The Gini index can be calculated by Equation (7).

Gini (p) = \sum_{k = 1}^{K} p_{k} (1 - p_{k}) = 1 - \sum_{k = 1}^{K} p_{k}^{2},

(7)

where K represents the number of categories of feature samples, and

p_{k}

represents the sample weight of the

kth

category in all nodes. The importance of feature

X_{j}

at node

m

is the Gini index change before and after node

m

branches.

{VIM}_{jm}^{(gini)} = {GI}_{m} - {GI}_{l} - {GI}_{r},

(8)

where

{GI}_{l}

and

{GI}_{r}

represent the Gini index of two new nodes after node splitting. The importance of feature

X_{j}

on the

i

tree was

{VIM}_{jm}^{(gini)} = \sum_{m \in M} {VIM}_{jm}^{(gini)}

(9)

If there were

n

trees in the RF, the importance of feature

X_{j}

can be obtained by summing

{VIM}_{j}^{(gini)} = \sum_{m = 1}^{n} {VIM}_{jm}^{(gini)}

(10)

Finally, the final characteristic Gini index can be obtained by normalization of all the obtained importance factors:

{VIM}_{j} = \frac{{VIM}_{j}}{\sum_{i = 1}^{c} {VIM}_{i}}

(11)

3.2. Pearson Correlation

The Pearson correlation coefficient was used to measure the degree of correlation between two variables. It was defined as the quotient of covariance and standard deviation between two variables.

ρ_{x, y} = \frac{cov (X, Y)}{σ_{X} σ_{Y}},

(12)

where

cov (X, Y)

was the covariance of sample

X

and

Y

,

σ_{X}

and

σ_{y}

are the expected root values of variables

X

and

Y

,

ρ_{x, y}

was the degree of correlation between variables

X

and

Y

, and the value range is

- 1 \leq ρ_{x, y} \leq 1

. When

ρ_{x, y}

= −1, it means that the two variables are completely negatively correlated, and when

ρ_{x, y}

= 1, it means that the two variables are completely positively correlated.

By estimating the covariance and standard deviation of the sample, the Pearson correlation coefficient

r

can be obtained.

r = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}} \sqrt{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}}

(13)

where

\bar{X}

and

\bar{Y}

are the sample means.

3.3. Comprehensive Correlation Evaluation Model

There are complex implicit correlations among various marine environmental factors. To establish a comprehensive correlation evaluation model for the marine environment, we should consider not only the direct correlation between a single index and label elements, but also comprehensively consider the complex interactions between multiple index data and the implicit relationship of label elements. This study attempted to consider the comprehensive impact of feature importance and the Pearson correlation. We established a comprehensive correlation coefficient evaluation model between marine environmental factors and Chl-a, expecting to enhance the accuracy of Chl-a prediction. The established comprehensive correlation evaluation model was shown in Equation (14).

M_{evaluate} = ζ_{1} \times {VIM}_{j} + ζ_{2} \times ρ (x, y),

(14)

where

ζ_{1}

and

ζ_{2}

were the weight coefficients of feature importance and the Pearson correlation. The values were assigned according to the specific circumstances of the environmental factors and label factors.

4. Integrated Prediction Model

The relationships between Chl-a and the environmental factors were strongly nonlinear and the number of samples was not sufficient, which make it difficult to predict the Chl-a. Thus, the marine environment factors prediction model need to integrate XGBoost regression, LOOCV validation, MRE optimization and other strategies to achieve a better performance.

4.1. Extreme Gradient Boosting Regression

XGBoost is a scalable end-to-end tree boosting system that is widely used by data scientists to achieve state-of-the-art results in addressing many machine learning challenges (Chen, et al., 2016). In this study, XGBoost was employed for regression modeling.

The XGBoost algorithm can be expressed in a form of addition as shown in Equation (15).

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F,

(15)

where

{\hat{y}}_{i}

represents the predicted value of the model, K represents the number of decision numbers,

F

corresponds to the set of all K regression trees, and

f (x)

was one of the trees.

The goal of the XGBoost algorithm was to make the predicted value

{\hat{y}}_{i}

of the tree group close to the real value

y_{i}

, and ensure that the method has maximum generalization ability. In the process of XGBoost learning, the

f_{k}

function was added to optimize the objective function and reduce the error between the predicted results and the actual values. Its objective function was defined as

\begin{matrix} {Ob}_{j} & = \end{matrix} \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{i = 1}^{t} Ω (f_{i}),

(16)

The loss function

\sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})

refers to the error between the actual value and the predicted result from the XGBoost, and it’s the sum of the error of each iteration in the XGBoost modeling.

Ω (f_{i}) γ T + \frac{1}{2} λ {| | ω | |}^{2},

(17)

where

T

represents the number of leaf nodes,

ω

represents the fraction of leaf nodes, and γ and λ represent regularization coefficients to prevent the decision tree from being too complicated.

f (x + Δ_{x}) ≅ f (x) + f_{(x)}^{'} Δ x + \frac{1}{2} f^{″} (x) {Δ x}^{2}

(18)

{Obj}_{j}^{(t)} ≅ \sum_{i = 1}^{n} l [y_{i}, {\hat{y}}_{i}^{(t - 1)}] + Ω (f_{t}) + ε

(19)

In Equation (19),

l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

is the error function,

Ω (f_{t})

was the regular term, and

ε

was a constant for the complexity of the first T-1 tree.

Taylor-class expansion is carried out for the objective function, which was to combine Equation (18) with Equation (19),

{Obj}_{j}^{(t)} ≅ \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) + ε,

(20)

where

g_{i} = \frac{\partial l (y_{i}, {\hat{y}}_{i}^{(t - 1)})}{\partial {\hat{y}}_{i}^{(t - 1)}}, h_{i} = \frac{\partial^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})}{\partial^{2} {\hat{y}}_{i}^{(t - 1)}}

(21)

In combination with Equations (19)–(21), the deformation was as follows:

{Obj}_{j}^{(t)} ≅ \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}), = \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + γ T + λ \frac{1}{2} \sum_{j = 1}^{T} ω_{j}^{2}, = \sum_{j = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) ω_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{i} + λ) ω_{j}^{2}] + γ T

(22)

I_{j}

was defined for each leaf node j collection of the above samples in the table below,

I_{j} =

{

i

|

q

(

x_{i}

)=

j

},

g_{i}

was the first derivative, and

h_{i}

is the second derivative. Define

G_{j}

=

\sum_{i \in I_{j}} g_{i}

,

H_{j} =

\sum_{i \in I_{j}} h_{i}

, and Equation (22) can be simplified as

{Obj}_{j}^{(t)} = \sum_{j = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) ω_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{i} + λ) ω_{j}^{2}] + γ T, = \sum_{j = 1}^{T} [G_{j} ω_{j} + \frac{1}{2} (H_{j} + λ) ω_{j}^{2}] + γ T .

(23)

Equation (23) showed that the objective function

{Obj}_{j}^{(t)}

was a convex function. The optimal solution of the objective function can be obtained by taking the derivative of

ω_{j}

.

ω_{j}^{*} = - GjHj + λ

(24)

{Obj}_{j}^{(t)} = - \frac{1}{2} \sum_{j = 1}^{T} \frac{G_{j}^{2}}{H_{j} + λ} + γ T

(25)

According to the results of

ω_{j}^{*}

and

{Obj}_{j}^{(t)}

, Equation (17) can evaluate the quality of the tree model. The smaller the value of

{Obj}_{j}^{(t)}

, the better the tree model. The scoring formula for splitting can be obtained as follows:

Gain = \frac{1}{2} [\frac{{(\sum_{i \in I_{L}} g_{i})}^{2}}{\sum_{i \in I_{L}} h_{i} + λ} + \frac{{(\sum_{i \in I_{R}} g_{i})}^{2}}{\sum_{i \in I_{R}} h_{i} + λ} + \frac{{(\sum_{i \in I} g_{i})}^{2}}{\sum_{i \in I} g_{i} + λ}] - γ .

(26)

Equation (26) was used to calculate the split nodes of the tree model, where

I_{L}

and

I_{R}

were the real sets of the split left node and right node and

I

was the whole instance set.

4.2. LOOCV Validation

Due to the limited number of valid samples in this study, it was critical to select an appropriate validation method to ensure the accuracy and generalization ability of the Chl-a prediction in the multi-environmental factor prediction model. LOOCV was a special case of cross-validation in which the number of folds was equal to the number of instances in the dataset. Thus, the learning algorithm was applied once for each sample, using all other samples as a training set and using the selected sample as a single-item test set (Webb et al. 2011). LOOCV was adopted for the validation of Chl-a in the multi-environmental factors prediction model. The smaller sample size avoids the disadvantages of the computational complexity of the LOOCV method in dealing with large samples, but it didn’t affect the model’s prediction accuracy.

4.3. Objective for Prediction

To ensure the subsequent modeling training and prediction validation values of Chl-a, mean relative error (MRE) was defined as

MRE = \frac{1}{M} \sum_{t = 1}^{M} \frac{| z_{i} - {\hat{z}}_{i} |}{| z_{i} |}

(27)

where

z_{i}

was the real value of the sample labels of Chl-a, and

{\hat{z}}_{i}

is the predicted value of the Chl-a. The objection function for the Chl-a prediction model was to obtain the minimum MRE:

Min (\frac{1}{M} \sum_{t = 1}^{M} \frac{| z_{i} - {\hat{z}}_{i} |}{| z_{i} |})

(28)

5. Results and Discussion

5.1. Raw Data Processing

Figure 2 showed the distribution of Chl-a content in 102 original samples in the form of a box plot in order to illustrate the existence of outliers in the original samples. The abscissa was the sample size, and the ordinate was the content of Chl-a in those samples. It can be seenthat some Chl-a values were abnormally large or small.

There were 15 environmental factors that can be used for modeling and prediction including temperature (Temp), oxygen (Oxy), density (Den), carbonate (CO₃), ammonium (Amm), Brunt Väisälä frequency (Bru), salinity (Sal),

{PO}_{4}

iron (Iron), latitude (Lat),

{NO}_{2}

gradient surface temperature (Gra), beta470(Beta), Okubo (Oku), and longitude (Long). The target object was set to the Chl-a. Several samples had outlier Chl-a values, and the samples adjacent to them were presented in Table 1. Some environmental indicators in one sample might be quite different from a nearby sample, even if they were in the same area. The wide-ranging differences in Chl-a might be caused by some accidental event or systematic error, or other potential causes that have not yet been discovered. Such a distribution pattern might be caused by unknown topography, ocean currents, seasonal changes, human activity and other factors. A detailed and accurate description of the distribution of global marine environmental indicators cannot be developed based on the samples acquired to date. Accurate interpretation and in-depth study of these abnormal indicators require more comprehensive samples, targeted field tests, in-depth theoretical analysis, and artificial intelligence learning. Based on the data cleaning model used in this study, total 79 samples from the SRF layer were selected and subjected to the consequential analysis.

5.2. Evaluation of Characteristic Importance of Environmental Factors

Based on these selected data, the modeling analysis and the prediction were then conducted with regard to the relationship between Chl-a and other environmental factors in the SRF layer. The characteristic importance of each marine environmental factor to Chl-a was obtained through characteristic importance data analysis technology. The distribution of the characteristic importance indexes of each environmental factor was presented in Figure 3. The major parameters for the evaluation were listed in Table 2, where n_estimators was the number of trees in the forest, min_samples_split was the minimum number of samples required to split an internal node, min_samples_leaf is the minimum number of samples required to be at a leaf node, min_weight_fraction_leaf is the minimum weighted fraction of the sum total of weights required to be at a leaf node, and max_depth is the maximum depth of the tree. As shown in Figure 3, the characteristic importance analysis indicate that seawater density was the most important factor in terms of Chl-a abundance for the samples from the SRF layer. Seven factors including temperature, latitude, bate470, oxygen, ammonium, CO₃, and Brunt Väisälä frequency were also important in regard to Chl-a abundance. However, seven factors including the Okubo, PO₄, salinity, NO₂, gradient surface temperature, iron and longitude on the abundance of Chl-a demostrated no obvious importance.

The Pearson correlation method was used to further study the effects of environmental factors on the Chl-a abundance. Figure 4 presented the Pearson correlation coefficient of each marine environment factor and Chl-a. As shown in Figure 4, the strongest positive correlation between the oxygen content and Chl-a was observed. The strongest negative correlation between water temperature and Chl-a was also found. The correlation coefficients of CO₃, density, and PO₄ were 0.397, 0.339 and 0.226, respectively. The relative values of the Brunt Väisälä frequency and salinity were −0.270 and −0.264. However, the correlation between other factors and Chl-a was not obvious.

By comparing Figure 3 and Figure 4, it can be seen that Pearson’s correlation coefficient, which not only indicated the direct correlation between the two environmental indicators, but also provided the characteristic importance index, cannot fully reflect impact on the abundance of Chl-a. Given the potential correlation of multiple environmental factors in the global ocean and the limited number of samples available, Equation (14) was used to calculate the Chl-a comprehensive correlation index. In this study, the weight coefficients

ζ_{1}

and

ζ_{2}

of feature importance and Pearson correlation were set as 0.5 and 0.5.

Since the characteristic importance of the environmental factors was normalized, the Pearson correlation coefficients were also normalized. The Pearson correlation coefficient, characteristic importance index, and comprehensive correlation index are presented in Table 3. The histogram of the comprehensive correlation indexes was drawn in Figure 5. As shown in Figure 5 and Table 3, four environmental factors including the temperature, density, oxygen and CO₃ have the greatest influence on the abundance of Chl-a, while the comprehensive correlation indexes of the other factors were relative low.

5.3. Environmental Factors Modeling

The environmental characteristic importance model can be used to identify easily-detected and easily-measured factors that can be used to predict other factors that were not directly measured in previous explorations. Thus, rapid and low-cost data acquisition in marine development, exploration, and environmental protection may be possible. This study used XGBoost, RF, SVR and LR to implement multi-factor modeling, analysis, and prediction.

As shown in Figure 5, density, temperature, oxygen, and CO₃ had the highest comprehensive correlation values. In the modeling prediction study, the modeling began with the four most important indicators. Then, a new environmental factor was added successively according to the comprehensive correlation indexes from high to low, until all the factors were used. The modeling and prediction process was constructed based on several algorithms, including the XGBoost algorithm. The objective function was to minimize the MRE with Equation (28).

Table 4, Table 5, Table 6 and Table 7 showed the MRE, mean absolute error (MAE), root mean square error (RMSE), and the sum of squares of the residuals R² with different machine learning algorithms. As can be seen that no matter whether the number of environmental factors is four, five, six, or eight, the MRE obtained based on XGBoost regression is superior to that of the other methods. The results presented in Table 5 and Table 7 show that whether the model is based on five, six or eight environmental factors, the MRE, RMSE, and R² of based on XGBoost algorithm are better than those obtained with the other centralized machine learning algorithms. In Table 4 and Table 6, the MRE and RMSE based on XGBoost were slightly higher than those of RF, and less than those of SVR and LR. Because of the large difference in the Chl-a values in the raw samples, the MRE results show that the prediction results based on the XGBoost model were more reasonable.

5.4. Chl-a Abundance Prediction

Predictions were developed and the objective function model was established using Equation (28). Figure 6 showed the relative errors of each prediction. The modeling and prediction results obtained with the XGBoost algorithm were sorted from small to large in Figure 6. In general, the more environmental factors used for modeling, the smaller the relative error of the prediction of the trained model. In the prediction of Chl-a abundance, the number of sample points with a large relative error was greater with the LR algorithm than with the SVR or RF algorithms, and was least with the XGBoost algorithm. Too few modeling parameters (e.g., four or five) led to many high MRE values, which indicates that modeling with too few parameters does not result in reliable Chl-a predictions. When the number of modeling parameters is six to eight, the number of outlier relative error values was close to the prediction results using more parameters. This indicates that it is feasible to use appropriate quantity parameters modeling to predict Chl-a. The effect of modeling with more parameters (e.g., 13 to 15) did not reduce the relative errors, but lead to an increase of outlier values instead.

It can be seen from Figure 6, the XGBoost, RF, SVR, and LR methods can be used for modeling and predicting the value of Chl-a. When RF and SVR were used for modeling and prediction, more large relative errors appeared in the results.

For further analysis of the modeling prediction results, the prediction tolerance

M_{s}

was introduced. It represents the mean value of the

s %

smallest average relative error of samples.

M_{100}

is the mean relative error of all samples and

M_{60}

is the mean relative error of the

60 %

smallest relative error of samples. Figure 7 presents the s% smallest of the average relative error of samples by different machine learning methods. By comparing subgraphs (a)–(d) in Figure 7, it can be seen that for all the M_s, the modeling prediction results based on XGBoost were generally superior to the RF model. This method, and shows obvious advantages over SVR and LR. For the XGBoost and RF modeling methods, at first, the prediction accuracy of the target values improves gradually with the increase in the number of indices used for modeling. It is important to note that when the number of environmental factors is greater than five, the increase in prediction accuracy for XGBoost and RF is not obvious. For the SVR and LR methods, the prediction accuracy decreases with the increase of attributes of environmental factors initially. When the modeling factors were greater than eight, the increase in environmental factors has no obvious effect on the accuracy of prediction results.

Figure 7 presents the MRE of samples using different machine learning methods. It can be seen that the modeling prediction results based on XGBoost and RF were superior to those of SVR and LR.

A learning curve and grid search techniques were used in the model training to minimize the objective function (Equation (28)). The XGBoost regression analysis modeling process depends on several key parameters such as maximum tree depth (max_depth), number of trees (n_estimators), regularization term of the weight (reg_alpha), minimum loss function drop value (gamma), learning rate, subsample size (subsample), minimum node weight (min_child weight), and the proportion of the sampled columns (colsample_bytree). The optimization parameters of the training model were obtained by parameter estimation and cross-validation. Table 8 presents the combination of the optimization parameters obtained using grid search technology for modeling a different number of environmental factors.

5.5. Modeling Optimization and Comparison

A grid search method was used for the parameter optimization of Max_depth and n_estimators, and the obtained results were shown in Figure 8. And the variation in the MRE according to the n_estimator values increasing under different Max_depth values were also included. It can be seen that, when the n_estimator exceeds 70, the MRE achieved a stable value in any case. The results of the MRE based on four, five, and six factors were very similar, and were higher than the results based on more factors. In general, more factors were included for modeling and prediction analysis, the higher rate of the accuracy of the analysis were obtained.

6. Conclusions

It’s of great importance to integrate and analyze elements of the marine environment, integrate multi-source, heterogeneous, and dispersed marine environment data, and reveal complex internal mechanisms and evolution trends through the machine learning method. Based on selected Tara Oceans data, the present study conducted in-depth data cleaning, multi-factor comprehensive correlation, and Chl-a abundance modeling and prediction. A marine environmental factors comprehensive correlation model and Chl-a abundance prediction model were then established based on the proposed machine learning analysis.

The conclusions of this study are as follows:

(1): The comprehensive correlation model of marine environmental factors considers both the direct correlation between any two environmental factors and the potential correlation among multiple factors. It is helpful for the selection of factors for Chl-a modeling and prediction.
(2): A multi-environmental factor prediction model can accurately predict the amount of Chl-a, which can help to determine Chl-a abundance on other indicators of the marine environment.
(3): The more environmental factors used for Chl-a predicting, the more accurate the results will be. With increasing amounts of marine environmental data included, the machine learning technologies will be used for additional relevant studies. Revealing small-scale, refined and systematic laws of the marine environment requires additional marine observation, monitoring and measurement data. This approach will help to develop a deeper understanding of the underlying mechanisms behind the dynamic changes in the marine environment, and thus develop more efficient ways to protect the marine environment.

Author Contributions

Conceptualization, Q.Y.; methodology, Z.C. and D.D.; software, Z.C. and D.D.; validation, X.Z. and Q.Y.; formal analysis, Z.C. and X.Z.; investigation, X.Z.; resources, Q.Y.; data curation, Z.C. and Q.Y.; writing—original draft preparation, Z.C. and Q.Y.; writing—review and editing, Z.C. and Q.Y.; visualization, Z.C. and D.D.; supervision, Q.Y.; project administration, Q.Y.; funding acquisition, Z.C., X.Z. and Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work are supported National Key Research and Development Program of China (2018YFC1503204), Specific Project of Municipal Science and Technology Bureau of Zhoushan (2018C21007 and 20210198), National Natural Science Foundation of China (41876114), and Science Foundation of Donghai Laboratory (DH-2022KF0218).

Data Availability Statement

All relevant data are included in the manuscript.

Acknowledgments

The study used public datasets from the Tara Ocean Project.

Conflicts of Interest

The authors declare no conflict of interest.

References

Madin, E.M.; Dill, L.M.; Ridlon, A.D.; Heithaus, M.R.; Warner, R.R. Human activities change marine ecosystems by altering predation risk. Glob. Chang. Biol. 2016, 22, 44–60. [Google Scholar] [CrossRef]
Biard, T.; Bigeard, E.; Audic, S.; Poulain, J.; Gutierrez-Rodriguez, A.; Pesant, S.; Stemmann, L.; Not, F. Biogeography and diversity of Collodaria (Radiolaria) in the global ocean. ISME J. 2017, 11, 1331–1344. [Google Scholar] [CrossRef] [PubMed]
Pesant, S.; Tara Oceans Consortium Coordinators; Not, F.; Picheral, M.; Kandels-Lewis, S.; Le Bescot, N.; Gorsky, G.; Iudicone, D.; Karsenti, E.; Speich, S.; et al. Open science resources for the discovery and analysis of Tara Oceans data. Sci. Data 2015, 2, 150023. [Google Scholar] [CrossRef]
Sunagawa, S.; Coelho, L.P.; Chaffron, S.; Kultima, J.R.; Labadie, K.; Salazar, G.; Djahanschiri, B.; Zeller, G.; Mende, D.R.; Alberti, A.; et al. Structure and function of the global ocean microbiome. Science 2015, 348, 6237. [Google Scholar] [CrossRef]
Karlusich, J.; Ibarbalz, F.M.; Bowler, C. Phytoplankton in the Tara Ocean. Annu. Rev. Mar. Sci. 2020, 12, 233–265. [Google Scholar] [CrossRef]
Zhang, X.L.; Yang, X.; Wang, S.J.; Jiang, Z.W.; Xie, Z.X.; Zhang, L. Draft Genome Sequences of Nine Cultivable Heterotrophic Proteobacteria Isolated from Phycosphere Microbiota of Toxic Alexandrium catenella LZT09. Microbiol. Resour. Announc. 2020, 9, e00281-20. [Google Scholar] [CrossRef]
Zhang, X.L.; Tian, X.Q.; Ma, L.Y.; Feng, B.; Liu, Q.H.; Yuan, L.D. Biodiversity of the symbiotic bacteria associated with toxic marine dinoflagellate Alexandrium tamarense. J. Biosci. Med. 2015, 3, 23–28. [Google Scholar] [CrossRef]
Zhang, X.L.; Ma, L.Y.; Tian, X.Q.; Huang, H.L.; Yang, Q. Biodiversity study of intracellular bacteria closely associated with paralytic shellfish poisoning dinoflagellates Alexandrium tamarense and A. minutum. Int. J. Environ. Resour. 2015, 4, 23–27. [Google Scholar] [CrossRef]
Xu, G.; Shi, Y.; Sun, X.; Shen, W. Internet of Things in Marine Environment Monitoring: A Review. Sensors 2019, 19, 1711. [Google Scholar] [CrossRef]
Bonnefon, J.F.; Rahwan, I. Machine Thinking, Fast and Slow. Trends Cogn. Sci. 2020, 24, 1019–1027. [Google Scholar] [CrossRef] [PubMed]
Sunagawa, S.; Karsenti, E.; Bowler, C.; Bork, P. Computational eco-systems biology in Tara Oceans: Translating data into knowledge. Mol. Syst. Biol. 2015, 11, 809. [Google Scholar] [CrossRef] [PubMed]
Bork, P.; Bowler, C.; Vargas, C.D.; Gorsky, G. Tara oceans-studies plankton at planetary scale. Science 2015, 384, 873. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Jiang, Z.; Zhou, X.; Zhang, R.; Xie, Z.; Zhang, S.; Wu, Y.; Ge, Y.; Zhang, X. Haliea alexandrii sp. nov., isolated from phycosphere microbiota of the toxin-producing dinoflagellate Alexandrium catenella. Int. J. Syst. Evol. Microbiol. 2020, 70, 1133–1138. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Jiang, Z.W.; Zhang, J.; Zhou, X.; Zhang, X.L.; Wang, L.; Yu, T.; Wang, Z.; Bei, J.; Dong, B. Mesorhizobium alexandrii sp. nov., isolated from phycosphere microbiota of PSTs-producing marine dinoflagellate Alexandrium minutum amtk4. Antonie Van Leeuwenhoek 2020, 113, 907–917. [Google Scholar] [CrossRef] [PubMed]
Duan, Y.; Jiang, Z.; Wu, Z.; Sheng, Z.; Yang, X.; Sun, J.; Zhang, X.; Yang, Q.; Yu, X.; Yan, J. Limnobacter alexandrii sp. nov., a thiosulfate-oxidizing, heterotrophic and EPS-bearing Burkholderiaceae isolated from cultivable phycosphere microbiota of toxic Alexandrium catenella LZT09. Antonie Van Leeuwenhoek 2020, 13, 1689–1698. [Google Scholar] [CrossRef]
Brivio, P.A.; Giardino, C.; Zilioli, E. Determination of chlorophyll concentration changes in lake garda using an image-based radiative transfer code for landsat TM images. Int. J. Remote Sens. 2010, 22, 487–502. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Deng, T.; Chau, K.W.; Duan, H.F. Machine learning based marine water quality prediction for coastal hydro-environment management. J. Environ. Manag. 2021, 284, 112051. [Google Scholar] [CrossRef] [PubMed]
Ding, W.; Zhang, C.; Shang, S.; Li, X. Optimization of deep learning model for coastal chlorophyll a dynamic forecast. Ecol. Model. 2022, 467, 109913. [Google Scholar]
Villar, E.; Farrant, G.K.; Follows, M.; Garczarek, L.; Speich, S.; Audic, S.; Bittner, L.; Blanke, B.; Brum, J.R.; Brunet, C.; et al. Environmental characteristics of Agulhas rings affect interocean plankton transport. Science 2015, 348, 1261447. [Google Scholar] [CrossRef] [PubMed]
Kisi, O.; Parmar, K.S. Application of least square support vector machine and multivariate adaptive regression spline models in long term prediction of river water pollution. J. Hydrol. 2016, 534, 104–112. [Google Scholar] [CrossRef]
Yajima, H.; Derot, J. Application of the Random Forest model for chlorophyll-a forecasts in fresh and brackish water bodies in Japan, using multivariate long-term databases. J. Hydroinform. 2018, 20, 206–220. [Google Scholar] [CrossRef]
Sun, Q.; Zhang, J.; Xu, X. Research and application of rule updating mining algorithm for marine water quality monitoring data. Pol. Marit. Res. 2018, 25, 136–140. [Google Scholar] [CrossRef]
Zeng, J.; Tang, Z. Evaluate machine learning models used for upscaling surface ocean CO2 measurements. Ocean Sci. 2017, 13, 303–313. [Google Scholar] [CrossRef]
Misra, A.; Vojinovic, Z.; Ramakrishnan, B.; Luijendijk, A.; Ranasinghe, R. Shallow water bathymetry mapping using support vector machine technique and multispectral imagery. Int. J. Remote Sens. 2018, 39, 4431–4450. [Google Scholar] [CrossRef]
Ling, H.; Qian, C.X.; Kang, W.C.; Liang, C.Y.; Chen, H.C. Combination of support vector machine and k-fold cross validation to predict compressive strength of concrete in marine environment. Constr. Build. Mater. 2019, 206, 355–363. [Google Scholar] [CrossRef]
Franklin, J.B.; Sathish, T.; Vinithkumar, N.; Kirubagaran, R. A novel approach to predict chlorophyll-a in coastal-marine ecosystems using multiple linear regression and principal component scores. Mar. Pollut. Bull. 2020, 152, 110902. [Google Scholar] [CrossRef] [PubMed]
Bi, S.; Li, Y.; Liu, G.; Song, K.; Xu, J.; Dong, X.; Cai, X.; Mu, M.; Miao, S.; Lyu, H. Assessment of Algorithms for Estimating Chlorophyll-a Concentration in Inland Waters: A Round-Robin Scoring Method Based on the Optically Fuzzy Clustering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4200717. [Google Scholar] [CrossRef]
Li, J.; An, X.; Li, Q. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 2022, 276, 106238. [Google Scholar] [CrossRef]
Hu, H.; Westhuysen, A.; Chu, C.; Fujisaki-Manome, A. Predicting Lake Erie wave heights and periods using XGBoost and LSTM. Ocean Model. 2021, 164, 101832. [Google Scholar] [CrossRef]
Wang, J.; Du, X.; Qi, X. Strain prediction for historical timber buildings with a hybrid Prophet-XGBoost model. Mech. Syst. Signal Process. 2022, 179, 109316. [Google Scholar] [CrossRef]
Albaradei, S.; Thafar, M.; Alsaedi, A.; Van Neste, C.; Gojobori, T.; Essack, M.; Gao, X. Machine learning and deep learning methods that use omics data for metastasis prediction. Comput. Struct. Biotechnol. J. 2021, 19, 5008–5018. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Feng, Q.; Zhang, B.P.; Gao, J.J.; Sheng, Z.; Xue, Q.P.; Zhang, X.L. Marinobacter alexandrii sp. nov., a novel yellow-pigmented and algae growth-promoting bacterium isolated from marine phycosphere microbiota. Antonie Van Leeuwenhoek 2021, 114, 709–718. [Google Scholar] [CrossRef] [PubMed]
Jiang, Z.; Duan, Y.; Yang, X.; Yao, B.; Zeng, T.; Wang, X.; Feng, Q.; Qi, M.; Yang, Q.; Zhang, X.L. Nitratireductor alexandrii sp. nov., from phycosphere microbiota of toxic marine dinoflagellate Alexandrium tamarense. Int. J. Syst. Evol. Microbiol. 2020, 70, 4390–4397. [Google Scholar] [CrossRef]
Yang, Q.; Jiang, Z.W.; Huang, C.H.; Zhang, R.N.; Li, L.Z.; Yang, G.; Feng, L.J.; Yang, G.F.; Zhang, H.; Zhang, X.L. Hoeflea prorocentri sp. nov., isolated from a culture of the marine dinoflagellate Prorocentrum mexicanum PM01. Antonie Van Leeuwenhoek 2018, 111, 1845–1853. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.L.; Li, G.X.; Ge, Y.M.; Iqbal, N.M.; Yang, X.; Cui, Z.D.; Yang, Q. Sphingopyxis microcysteis sp. nov., a novel bioactive exopolysaccharides-bearing Sphingomonadaceae isolated from the Microcystis phycosphere. Antonie Van Leeuwenhoek 2021, 114, 845–857. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Ge, Y.M.; Iqbal, N.M.; Yang, X.; Zhang, X.L. Sulfitobacter alexandrii sp. nov., a new microalgae growth-promoting bacterium with exopolysaccharides bioflocculanting potential isolated from marine phycosphere. Antonie Van Leeuwenhoek 2021, 114, 1091–1106. [Google Scholar] [CrossRef]
Zhang, X.L.; Qi, M.; Li, Q.H.; Cui, Z.D.; Yang, Q. Maricaulis alexandrii sp. nov., a novel active bioflocculants-bearing and dimorphic prosthecate bacterium isolated from marine phycosphere. Antonie Van Leeuwenhoek 2021, 114, 1195–1203. [Google Scholar] [CrossRef]
Yang, Q.; Jiang, Z.; Zhou, X.; Zhang, R.; Wu, Y.; Lou, L.; Ma, Z.; Wang, D.; Ge, Y.; Zhang, X.; et al. Nioella ostreopsis sp. nov., isolated from toxic dinoflagellate, Ostreopsis lenticularis. Int. J. Syst. Evol. Microbiol. 2020, 70, 759–765. [Google Scholar] [CrossRef]
Ren, C.Z.; Gao, H.M.; Dai, J.; Zhu, W.Z.; Xu, F.F.; Ye, Y.; Zhang, X.L.; Yang, Q. Taxonomic and Bioactivity Characterizations of Mameliella alba Strain LZ-28 Isolated from Highly Toxic Marine Dinoflagellate Alexandrium catenella LZT09. Mar. Drugs 2022, 20, 321. [Google Scholar] [CrossRef]
Zhang, G.; Yang, Y.; Wang, S.; Sun, Z.; Jiao, K. Alkalimicrobium pacificum gen. nov., sp. nov., a marine bacterium in the family Rhodobacteraceae. Int. J. Syst. Evol. Microbiol. 2015, 65, 2453–2458. [Google Scholar] [CrossRef]
Yang, Q.; Jiang, Z.; Zhou, X.; Xie, Z.; Wang, Y.; Wang, D.; Feng, L.; Yang, G.; Ge, Y.; Zhang, X. Saccharospirillum alexandrii sp. nov., isolated from the toxigenic marine dinoflagellate Alexandrium catenella LZT09. Int. J. Syst. Evol. Microbiol. 2020, 70, 820–826. [Google Scholar] [CrossRef]
Wang, X.; Ye, Y.; Xu, F.F.; Duan, Y.H.; Xie, P.F.; Yang, Q.; Zhang, X. Maritimibacter alexandrii sp. nov.; a New Member of Rhodobacteraceae Isolated from Marine Phycosphere. Curr. Microbiol. 2021, 78, 3996–4003. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Zhang, X.; Jiang, Z.; Yang, X.; Zhang, X.; Yang, Q. Combined characterization of a new member of Marivita cryptomonadis, strain LZ-15-2 isolated from cultivable phycosphere microbiota of toxic HAB dinoflagellate Alexandrium catenella LZT09. Braz. J. Microbiol. 2021, 52, 739–748. [Google Scholar] [CrossRef] [PubMed]
Landry, Z.C.; Vergin, K.; Mannenbach, C.; Block, S.; Yang, Q.; Blainey, P.; Carlson, C.; Giovannoni, S. Optofluidic Single-Cell Genome Amplification of Sub-micron Bacteria in the Ocean Subsurface. Front. Microbiol. 2018, 9, 1152. [Google Scholar] [CrossRef] [PubMed]
Sommeria-Klein, G.; Watteaux, R.; Ibarbalz, F.M.; Pierella Karlusich, J.J.; Iudicone, D.; Bowler, C.; Morlon, H. Global drivers of eukaryotic plankton biogeography in the sunlit ocean. Science 2021, 374, 594–599. [Google Scholar] [CrossRef]
Sunagawa, S.; Acinas, S.G.; Bork, P.; Bowler, C.; Tara Oceans Coordinators; Eveillard, D.; Gorsky, G.; Guidi, L.; Iudicone, D.; Karsenti, E.; et al. Tara Oceans: Towards global ocean ecosystems biology. Nat. Rev. Microbiol. 2020, 18, 428–445. [Google Scholar] [CrossRef]
Pierella Karlusich, J.J.; Bowler, C.; Biswas, H. Carbon Dioxide Concentration Mechanisms in Natural Populations of Marine Diatoms: Insights From Tara Oceans. Front. Plant Sci. 2021, 12, 657821. [Google Scholar] [CrossRef]

Figure 1. Sample distribution of Tara Oceans data. Adopted and modified from Pesant et al. (2015). The TOP water samples were obtained from 210 sites as marked with red dots. Total 102 sites had complete datasets for the DCM layer.

Figure 2. Box diagram of the distribution of Chl-a values in selected analyzed samples.

Figure 3. Characteristic importance of marine environmental factors.

Figure 4. Pearson correlations of the environmental factors and Chl-a abundance.

Figure 5. Comprehensive importance of the 15 environmental factors of the analyzed samples.

Figure 6. Relative error of the Chl-a for each sample point. (a): n = 4; (b1,b2): n = 5; (c): n = 6; (d): n = 7; (e): n = 8; (f): n = 9; (g): n = 10; (h): n = 11; (i): n = 12; (j): n = 13; (k): n = 14; (l): n = 15.

Figure 7. MRE based on different methods.

Figure 8. Variation in MRE with different Max_depths and n_estimators. Pane a to e showed the variation in the MRE according to the increasing n_estimator number under different Max_depth values. (a) Max_depth = 1, (b) Max_depth = 2, (c) Max_depth = 3, (d) Max_depth = 4, (e) Max_depth = 5.

Table 1. Samples with outlier Chl-a values and samples adjacent to them.

Station	Chl-a	Long	Lat	Temp	Den	Oxygen	CO₃	Amm
TARA_091	0.828	−73.100	−34.161	17.609	24.893	229.394	0.013	0.048
TARA_092	2.258	−71.998	−33.690	15.912	25.302	245.324	0.013	0.073
TARA_086	0.220	−53.006	−64.360	−0.494	26.718	398.312	0.000	0.007
TARA_088	1.129	−56.794	−63.402	−0.763	27.684	335.243	0.013	0.118
TARA_173	0.417	79.420	78.956	−0.160	26.994	385.281	0.001	0.004
TARA_188	1.417	91.856	78.252	−1.650	26.511	388.304	0.000	0.002

Table 2. Five parameters used for the calculation of the characteristic importance.

Parameters	N_Estimators	Min_Samples _Split	Min_Samples _Leaf	Min_Weight _Fraction_Leaf	Max_Depth
Values	100	2	1	0	2

Table 3. Comprehensive correlation indexes.

Parameter	Character Importance Index	Pearson Correlation Coefficient	Normalized Pearson Correlation Coefficient	Comprehensive Correlation Index
Density	0.151	0.339	0.091	0.129
Temperature	0.103	−0.575	0.154	0.121
Latitude	0.098	0.106	0.028	0.121
Beta470	0.089	0.050	0.013	0.089
Oxygen	0.080	0.600	0.161	0.079
Ammonium	0.080	0.290	0.078	0.070
CO₃	0.070	0.397	0.107	0.063
Bru	0.067	−0.270	0.073	0.055
Okubo	0.054	0.068	0.018	0.051
PO₄	0.041	0.226	0.061	0.051
Salinity	0.040	−0.264	0.071	0.040
NO₂	0.037	0.158	0.042	0.039
Gra	0.037	0.120	0.032	0.036
Iron	0.027	−0.187	0.050	0.035
Longitude	0.027	0.072	0.019	0.023

Table 4. Four environmental factors used for modeling analysis.

	XGBoost	RF	SVR	Linear Regression
MRE	0.398	0.413	0.461	0.532
MAE	0.096	0.104	0.110	0.118
RMSE	0.144	0.152	0.155	0.166
R²	0.550	0.496	0.470	0.390

Table 5. Five environmental factors used for modeling.

	XGBoost	RF	SVR	Linear Regression
MRE	0.408	0.421	0.459	0.549
MAE	0.096	0.103	0.109	0.114
RMSE	0.142	0.148	0.154	0.157
R²	0.56	0.522	0.488	0.462

Table 6. Six environmental factors used for modeling.

	XGBoost	RF	SVR	Linear Regression
MRE	0.419	0.425	0.446	0.572
MAE	0.107	0.104	0.112	0.12
RMSE	0.152	0.151	0.155	0.160
R²	0.501	0.508	0.470	0.430

Table 7. Eight environmental factors used for modeling.

	XGBoost	RF	SVR	Linear Regression
MRE	0.344	0.378	0.447	0.562
MAE	0.083	0.086	0.112	0.125
RMSE	0.125	0.127	0.158	0.169
R²	0.660	0.649	0.460	0.370

Table 8. Parameters of XGBoost.

Factor Numbers	Max Depth	N_Estimators	Reg_Alpha	Gamma	Learning Rate	Sub-Sample	Min_Child Weight	Colsample _Bytree
4	1	70	0	0	1	0.8	1	1
5	1	70	0	0	1	0.8	1	1
6	3	70	0.01	0	0.1	0.75	4	0.8
7	3	70	0.01	0	0.1	0.75	6	0.8
8	4	70	0.01	0	0.1	−0.75	7	0.8
9	3	70	0.001	0.1	0.1	0.8	5	0.8
10	3	70	0.1	0	0.1	0.75	7	0.8
11	3	70	0.1	0	0.1	0.85	5	0.75
12	3	70	0.001	0	0.1	0.85	7	0.85
13	3	70	0.1	0	0.1	0.8	3	0.75
14	3	70	0.01	0	0.1	0.85	3	0.75
15	3	70	0.1	0	0.1	0.85	3	0.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, Z.; Du, D.; Zhang, X.; Yang, Q. Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data. J. Mar. Sci. Eng. 2022, 10, 1749. https://doi.org/10.3390/jmse10111749

AMA Style

Cui Z, Du D, Zhang X, Yang Q. Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data. Journal of Marine Science and Engineering. 2022; 10(11):1749. https://doi.org/10.3390/jmse10111749

Chicago/Turabian Style

Cui, Zhendong, Depeng Du, Xiaoling Zhang, and Qiao Yang. 2022. "Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data" Journal of Marine Science and Engineering 10, no. 11: 1749. https://doi.org/10.3390/jmse10111749

APA Style

Cui, Z., Du, D., Zhang, X., & Yang, Q. (2022). Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data. Journal of Marine Science and Engineering, 10(11), 1749. https://doi.org/10.3390/jmse10111749

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling and Prediction of Environmental Factors and Chlorophyll a Abundance by Machine Learning Based on Tara Oceans Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources

2.2. Data Cleaning

3. Comprehensive Correlation Model

3.1. Character Importance

3.2. Pearson Correlation

3.3. Comprehensive Correlation Evaluation Model

4. Integrated Prediction Model

4.1. Extreme Gradient Boosting Regression

4.2. LOOCV Validation

4.3. Objective for Prediction

5. Results and Discussion

5.1. Raw Data Processing

5.2. Evaluation of Characteristic Importance of Environmental Factors

5.3. Environmental Factors Modeling

5.4. Chl-a Abundance Prediction

5.5. Modeling Optimization and Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI