Machine Learning Applied to the Oxygen-18 Isotopic Composition, Salinity and Temperature/Potential Temperature in the Mediterranean Sea

: This study proposed different techniques to estimate the isotope composition ( δ 18 O), salinity and temperature/potential temperature in the Mediterranean Sea using ﬁve different variables: (i–ii) geographic coordinates (Longitude, Latitude), (iii) year, (iv) month and (v) depth. Three kinds of models based on artiﬁcial neural network (ANN), random forest (RF) and support vector machine (SVM) were developed. According to the results, the random forest models presents the best prediction accuracy for the querying phase and can be used to predict the isotope composition (mean absolute percentage error (MAPE) around 4.98%), salinity (MAPE below 0.20%) and temperature (MAPE around 2.44%). These models could be useful for research works that require the use of past data for these variables.


Introduction
The semi-enclosed Mediterranean Sea [1][2][3] is characterised by dry and warm summers and temperate and wet winters [3]. The Mediterranean Sea is considered a continentally influenced ocean basin [4] and occupies an area around 2.5 million km 2 between Africa and Europe [5]. The Mediterranean Sea is divided into two basins, western and eastern basins, and the Straits of Sicily is considered the point of the division [5,6].
The Mediterranean Sea has different gradients or distributions of oxygen isotope composition [7], temperature [1,3,8] and salinity [1,3,8], and it is considered an oligotrophic sea that presents moderate levels of primary production and low nutrient concentrations [9,10]. In the Mediterranean Sea, the precipitation is less than mean evaporation, which has important implications on their biogeochemistry and circulation [8]. The thermohaline circulation introduces warm and fresh surface waters through the Strait of Gibraltar, and the Mediterranean Sea returns cooler and saltier deep waters into the North Atlantic [3]. The thermohaline circulation is guided by the seasonal variations on surface water temperature and salinity [7]. The thermohaline circulation through the Strait of Gibraltar maintains oxygenated the depths of the Mediterranean Sea [3] and causes a decrease of salinity in the western, in contrast to the water of the eastern Mediterranean Sea [8]. Different studies consider that the time to renewal for eastern Mediterranean deep water is longer than western Mediterranean deep waters [8,11,12].
The Mediterranean Sea represents a complex marine environment; due to the fact of this, a large number of researchers have developed different studies on modern biogeochemical and physical processes (including their interactions) [3]. In this sense, the stable

•
Artificial neural networks are a computational method inspired on the cell of the nervous system (known as neuron) [19] to try to analyse and reproduce the learning mechanism that owned by the more highly evolved animal species [20]. These models can find the relationships between inputs and outputs variables [21]. When the relationships are complex and highly non-linear, this kind of model needs a relatively huge training data group [22]. The ANNs are used as an option to statistical methods for different purposes such as estimation, classification, among others [23]. ANN approaches are popular due to their flexibility to fit random data and their reasonably uncomplicated development [23,24]. As previously stated, ANN models developed in this research are based on an MLP neural network, a popular ANN architecture [25]. ANNs are applied in different fields such as chemistry [26], medicine [27], food authenticity [28], among others [29,30]. This type of model can be part of more complex systems such as a smart healthcare monitoring system to predict heart disease that used ensemble deep learning [31] or to classify skin disease through deep learning neural networks stand on MobileNet V2 and long short-term memory [32].
Within the research field of this article, it can be said that the capacity of artificial neural networks to sense out the trends and patterns in sea surface temperature is validated by the oceanographic community [23], the fact that is demonstrated with the use of this kind of approach by different researchers who used it to predict the SST at different spatial and temporal scales around the world [23]. An example of the use of this type of model can be found in the work carried out by Aparna et al. (2018) to determine the sea surface temperature (SST) and delineate SST fronts. Secondly, Patil and Deo (2017) developed wavelet neural networks to predict daily SST values at different locations in the Indian Ocean [24]. Neural networks can also be used to determine the sea surface salinity (SSS), in addition to temperature. In this case, Buongiorno Nardelli (2020) developed an innovative deep learning algorithm based on a stacked long short-term memory neural network and was applied over the North Atlantic Ocean data [33]. ANNs (back-propagation and radial basis function) can also be used applied to predict the salinity variations in a tidal estuary, which were compared with an Eulerian-Lagrangian Circulation model (ELCRIC) [34]. According to the authors, the prediction from the artificial neural network models was better than the prediction determined by the physically based hydrodynamic model. Finally, this kind of approach can also be used to predict the isotope composition of oxygen (δ 18 O) in shallow groundwater, which can be used to study the water cycle [35]. In this case, Cerar et al. (2018) compared different models such as ordinary kriging, and others, and based on three variables (average annual precipitation, elevation and distance from the sea) concluded that, based on validation data sets, the ANN model was the most suitable approach to predict δ 18 O in the groundwater [35].

•
The second kind of model used is a random forest model. RF is a computational method for regression and/or classification [36] proposed by Breiman (2001) [36,37]. A random forest model is formed by decision trees where each tree utilizes a sample subset of available data [38], and the random forest's prediction value is the average of all predicted values [38,39]. Random forest is one of the most capable machine learning approaches for forecasting [40] and can be used in different fields such as environmental science [38] and chemistry [41], among others [42,43].
Within the research field of this article, RFs can be used to estimate the ocean's interior salinity using surface remote sensing data [44]. In this sense, Su et al. (2019) used two different methods (one of them, random forest) to predict the subsurface salinity anomaly in the upper 2000 m which can help to understand the response of subsurface and deeper ocean environment to the global warming [44]. Another example of the use of models based on random forest was developed by Lui et al. (2015) to predict sea surface salinity in the Hong Kong Sea [45]. The random forest model was compared with three models (back-propagation ANN, classification and regression trees and multiple linear regression) and showed lower estimation error and good correlation coefficient so that, this model demonstrated its capability to estimate sea surface salinity in coastal waters [45]. RF is also used to estimate the errors dispersion and the central tendency in satellite-derived SST retrievals [46].

•
Finally, the last model developed is a support vector machine. An SVM model is a method enunciated by Boser et al. in 1992 [47,48]. Originally, the SVM models were developed for pattern recognition, nevertheless, nowadays they can be used to solve nonlinear regression problems or time series prediction [49,50] and due to its mathematical simplicity it has received much attention lately [51]. An SVM model creates a hyperplane, or hyperplanes, in a high-or infinite-dimensional space [52]. The hyperplane separates the dataset into a number of classes consistently with the training examples [53]. The principal advantage of SVM (compared to other classification techniques such as partial least square discriminant analysis) is its flexibility to model non-linear classification problems [54]. SVM models can be used in different areas such as: Engineering [55,56], Medicine [57,58], among others [59,60]. Related to this research field, SVM models can be used to estimate the SST in the tropical Atlantic [61] or to forecast the tropical Pacific SST anomalies [62]. In this case, Aguilar-Martinez used support vector regression and was compared with Bayesian neural network and linear regression models.
Finally, these three types of models (ANN, RF and SVM) can also be compared to each other. An example of this is the article developed by Sunder, et al. (2020) to estimate the daily cloud-free sea surface temperature from a single sensor (MODIS Aqua) [53].
Taking into account the above information, it can be said that all these studies used different machine learning models to predict one, or more, variables of interest (isotope composition (δ 18 O), salinity and temperature) time ahead. Given the good results offered in these investigations, it has been thought that it is possible to use these models to determine these variables in a determined past time. These models could be used to complete databases and study the Mediterranean Sea evolution.
In this study, the use of artificial neural networks, random forest and support vector machine models to determine these variables in the past, were analysed. For this purpose, five input variables were used (geographic coordinates-Longitude, Latitude-, year, month and depth), and an attempt was made to relate to the isotope composition (δ 18 O), salinity, and temperature/potential temperature.

Database Used
In this study, a large database collected by Schmidt et al. (1999) [63]-partially collected in previous publications of Schmidt (1999) and Bigg and Rohling (2000) [64,65]were used. The data were downloaded between Longitude ( • E) −4.73 • and 36.00 • and Latitude ( • N) 31.30 • and 46.00 • . Nevertheless, this database presents missing values for many variables (the isotope composition, salinity or temperature/potential temperature -the temperature determinations can be in-situ or potential temperature [63]-); for this reason, cases with missing values and a case with anomalous temperature, were deleted and as a result, the database is reduced to 470 experimental cases. According to this, the database used in this research come from different original research [7,13,66,67]). The data used are distributed as follows: from (i) Pierre et al. (1986) a total of 92 samples (collected in 1986) were used, (ii) from Pierre (1999) were used 267 samples collected between 1988 and 1990, (iii) from Gat et al. (1996) 109 samples (between 1988 and 1989) were collected, and (iv) from the original research of Stahl and Rinow (1973) 2 samples were used (collected in 1971). All these data bring a total of 470 experimental cases (Table 1).
In this case, 470 experimental cases were collected from the original database of Schmidt et al. (1999) [63] and were used to establish three different groups; (i) one group (training group, formed by the training cases -60% of the total cases-) to develop the different models, (ii) a second group (validation group, formed by validation cases -20% of the total cases-) to validate the different models developed and (iii) a third group (querying group, -the last 20%-) to check the chosen prediction model. The data distribution on the different sets was random.

Methodologies
It is possible to find in the literature different models applied in fields related to the different purposes of this paper, for example, Cerar et al. applied artificial neural networks to predict the oxygen-18 isotope composition in Slovenia's groundwater [35] or even to palaeoceanographic data analysis [68]. Neurological networks models were introduced for the first time in 1943 when McCulloh and Pitts [69] reported the ability of simple neural networks to calculate just about any logic or arithmetic function [70,71]. A neural network is formed by interconnected neurons that work as independent computational units [23]. Normally, neurons are grouped in layers (input, intermediate/s and output layer) and signals moves from the input layer to the output layer, going through the different hidden layers located between them [23]. An MLP is formed by different layers of neurons (input, intermediate/s and output layer) where each layer is connected to the next layer [72].
In this research, two different ANN models were developed: (i) a neural model (ANN 1 ) with the sigmoidal function implemented in the hidden neurons and the linear function implemented in the output neuron and (ii) a second artificial neural model (ANN 2 ) with the sigmoidal function implemented in all the hidden and the output neurons. As is known, to obtain good neural network models it is required to develop models with different topologies (models with different neurons in hidden layers), models with different training cycles, and so on. This procedure is called trial and error method and was used to find the best model based on the statistics of the validation phase.
A disadvantage of ANN models is that it is time consuming, due to the fact of this disadvantage, and taking into account the bibliography previously seen in the introduction and the experience of the research group, other two techniques, random forest and support vector machine models, have been developed in this research.
The random forest regression model is a computational learning method formed by simple decision trees where the prediction value is the average of individual prediction values [38,39]. In the same way as the ANN models, these models were made based on the trial error method to find the best model for the validation phase. In this case, the parameters analysed were the number of trees, the maximal depth and the use of prepruning.
Finally, the support vector machine is a strong technique for classification and regression [52] that in this research was used in regression mode using epsilon-SVR and nu-SVR SVM types. To develop the different SVM models, the LIBSVM learner by Chang and Lin [52,73,74] was used. The SVM models were developed using the RBF kernel and the gamma and C parameters were studied according to the updated guide provide by Hsu et al. [75]. The support vector machine models were made with the normalized input variables and without normalizing; however, in this research, only the models developed with the non-normalized variables are shown, because, in general, these were the models with the best adjustments.

Fitting of Data and Modelling
As stated above, the database was split randomly into three groups: (i) training group -60% cases-, (ii) validation group -20% cases-and (iii) querying group -20% cases-. To determine the good prediction power of the different developed models, different statistical parameters were used. For this purpose, squared correlation coefficient (r 2 ) to evaluate the correlation between predicted and real values, root mean square error (RMSE) -Equation (1)-and mean absolute percentage error (MAPE) -Equation (2)were calculated. Best models were selected using the RMSE for the validation phase and then were checked with querying cases.

Computational Resources
The research group has several servers to carry out these tasks, in this case, a computer equipped with a processor AMD Ryzen 7 1800X (Advanced Micro Devices, Inc., Sunnyvale, CA, USA) and 16 GB of random access memory were used. The models ANN 1

Results and Discussion
To find the best prediction model (artificial neural networks, random forest or support vector machine) it was necessary to develop a large number of models using trial and error method. The best models (Table 2) were chosen by the results obtained for the validation phase. In the following paragraphs, the best models for each variable are analysed.

δ 18 O Model
Stable isotope composition can provide, along with other variables, information about the origin and mixing pattern of water masses [13]. Table 2 shows the squared correlation coefficient for training, validation and querying phases for the best ANN 1 , ANN 2 , RF and SVM models selected. Taking into account Figure 1, it can be said that the ANN 1 , ANN 2 and SVM models present a huge dispersion for the training phase. This fact is especially clear in the SVM model that presents the worst adjust with a root mean square error for the training phase (0.167‰) and the lowest squared correlation coefficient (0.554); this fact may be due to the flat area that is located on the right side of the figure. According to this, the results for the validation phase shows a low squared correlation coefficient (0.520) and a high root mean square error value (0.132‰); once again, it can be seen as a flat area on the right. The other models, ANN 1 and ANN 2 , show slightly better results for the validation phase with squared correlation coefficients of 0.614 and 0.641, respectively-in these cases without a flat area to the right. The best model, according to the results showed in Table 2, is the random forest model. This model is characterized by the absence, both in the training and validation phase, of the flat prediction zone. This fact can be observed attending to the statistics r 2 showing a 0.889 for the training phase and a 0.682 for the validation phase. In the same way, the other statistics, RMSE and MAPE, present the minimum value for each phase (due to the low dispersion of the model).

Salinity Model
The other interesting variable predicted using the proposed models is salinity. Table  2 shows the adjustments for the best models developed. The models show, in general, better adjustments for all phases compared to the previous models (δ 18 O models). This fact is clearly visible in the training phase where the adjustments are higher in terms of squared correlation (between 0.891 and 0.978) than the models presented in the previous section (between 0.554 and 0.889). In terms of mean absolute percentage error, the improvement is notorious for this same phase (training), going from range 3.84-7.13% (δ 18 O models) to the range 0.12-0.27% (salinity models). This improvement can be seen in Figure  2, where only a few points are away from the line with slope one; this occurs for ANN1, ANN2 and SVM models. If we analyse the worst model in the training phase, the ANN1 As stated above, the models are chosen based on the statistics of the validation phase. These adjustments are used as estimators of the use of the model in the real world. To ensure its good performance, the best model will be applied to the query phase data group (Table 2 and Figure 1). In this phase, similar behaviour to the observed in the training and validation phase can be seen. Once again, the worst model is the SVM model that shows the worst adjustments for the querying phase in terms of r 2 and root mean square error (0.454 and 0.142‰, respectively) and a mean absolute percentage error of 7.38%. The adjustments provided by the SVM model are similar to those obtained for the training and validation phase. For the two models based on artificial neural networks, a similar behaviour to the reported values for the training and validation phases can be observed, that is, better squared correlations and lower prediction errors than the SVM model. Finally, it can be seen how the model based on random forest shows the best results with an r 2 Q of 0.739 and an MAPE Q of 4.98%.
According to the observed flat zone in the training phase, it is unusual that the flat prediction zone occurs only at high values of the δ 18 O. With low values of the δ 18 O, this flat zone is only slightly detected in the case of the model based on a support vector machine. This fact may lead us to think that the models based on neural networks and support vector machines do not work as well as they should when the δ 18 O exceeds values around 1.7‰. This behaviour was clearly reduced in the validation phase, probably due to the small number of cases with values higher than the limits described above. Flat prediction area is not observed in any of the three phases of the RF model, in fact, this model is the one that presents the best adjustments in all phases in terms of r 2 and also in the terms related to the measurement of dispersion (the root mean square error and the mean absolute percentage error), that is, data fit well to the line with slope one (black line).
Given the results obtained by the RF model, it can be concluded that the model is useful for predicting the δ 18 O in the Mediterranean Sea.

Salinity Model
The other interesting variable predicted using the proposed models is salinity. Table 2 shows the adjustments for the best models developed. The models show, in general, better adjustments for all phases compared to the previous models (δ 18 O models). This fact is clearly visible in the training phase where the adjustments are higher in terms of squared correlation (between 0.891 and 0.978) than the models presented in the previous section (between 0.554 and 0.889). In terms of mean absolute percentage error, the improvement is notorious for this same phase (training), going from range 3.84-7.13% (δ 18 O models) to the range 0.12-0.27% (salinity models). This improvement can be seen in Figure 2, where only a few points are away from the line with slope one; this occurs for ANN 1 , ANN 2 and SVM models. If we analyse the worst model in the training phase, the ANN 1 model, we can see a point with an important error (prediction value 39.01‰ vs. real value 37.90‰ (Figure 2)), presenting an individual percentage error (IPE) of 2.94% (overestimated the real value). Taking into account the low IPE value it can be concluded that all points outside the line with slope one, are not really outliers due to their low relative error.
For the validation phase, the adjustments present good values of r2V between 0.870 and 0.914 for the ANN2 and RF model, respectively ( Table 2). The root mean square errors present a small increment in their values although they are still low (under 0.30%). In the same way, as in the training phase, there are some points away from the line with slope one. All models presented some of these points, even the RF model that presented a case with an IPE of 2.66 (37.40‰ vs. 38.39‰, overestimated the real value) (see Figure 2). Once again, taking into account the low IPE value it can be concluded that this point cannot be considered an outlier Once the correct prediction power of the models has been verified, the chosen models were applied to querying cases. The models still worked with accuracy; that is, the models could predict the experimental values of salinity with small errors, RMSE V under 0.210‰ that corresponded with small mean absolute percentage errors values (MAPE V ) of approximately 0.29%. The models in this phase presented squared correlation coefficients between 0.864 and 0.942. The ANN 2 model presented three points away from the line with slope one. One case with an IPE of −2.65% (38.47‰ vs. 37.45‰) and two cases with the same value, but a different sign, −2.45% (38.35‰ vs. 37.41‰) and 2.45% (37.55‰ vs. 38.47‰), that is, two cases were underestimated and one overestimated -see Figure 2-. Once again, the low value of the IPE determines that these two points are not outliers.
Given the results obtained by the RF model, it can be concluded that the random forest model can predict with accuracy the salinity in the Mediterranean Sea. model, we can see a point with an important error (prediction value 39.01‰ vs. real value 37.90‰ (Figure 2)), presenting an individual percentage error (IPE) of 2.94% (overestimated the real value). Taking into account the low IPE value it can be concluded that all points outside the line with slope one, are not really outliers due to their low relative error. For the validation phase, the adjustments present good values of r 2 V between 0.870 and 0.914 for the ANN2 and RF model, respectively ( Table 2). The root mean square errors present a small increment in their values although they are still low (under 0.30%). In the same way, as in the training phase, there are some points away from the line with slope one. All models presented some of these points, even the RF model that presented a case with an IPE of 2.66 (37.40‰ vs. 38.39‰, overestimated the real value) (see Figure 2). Once

Temperature/Potential Temperature Model
Finally, a new group of models to predict, in this case, the Mediterranean's seawater temperature/potential temperature were developed. Table 2 shows the results obtained for the best prediction models selected. The ANN 1 model is the worst model for presenting the worst result in the validation phase.
The ANN 1 model presents a well squared correlation coefficient for the training phase (0.937) with an RMSE T value of 0.745 • C that corresponded with a MAPE T value of 3.95%. ANN models present a similar behaviour between them, that is, ANN 1 and ANN 2 present good adjustments for the training phase with r 2 values of 0.937 and 0.934 and similar root mean square errors (0.745 • C and 0.717 • C with MAPE T values of 3.95% and 3.07%), respectively. The SVM model presents similar adjustments to those reported by the ANN models (although with a slight improvement in the RMSE and MAPE values). Once again, the RF model presented the best adjustment for the training phase with an r 2 T of 0.972 and an RMSE T of 0.467 • C that corresponded with a MAPE T of 1.99%. In Figure 3, it can be seen that the ANN 2 model and SVM model presented, for the training phase, two points away from the line with slope one (top right of the figure). For the ANN 2 model, these two points (28. All the models developed in this research to determine δ 18 O, salinity, and temperature/potential temperature worked quite well, showing acceptable errors below 8.00%. The low percentage of error and the good square correlation coefficient values shown by the models to predict salinity and temperature/potential temperature seemed to indicate that there was a high correlation between the input variables and the variables to be predicted. This fact did not seem so marked in the case of the models to predict δ 18 O, where, In the validation phase, all models present good results according to the squared correlation coefficient that includes values between 0.926 and 0.972 with RMSE V values in the range 0.452-0.757 • C (Table 2). It can be said that an error under one degree may be acceptable. In the SVM model ( Figure 3) can be seen the presence of three points away from the line with slope one that present IPE values of −14.13%, 11.31% and 19.60%. The same three points can also be seen away from the line with slope one in the ANN 1 model (IPE values between −12.94% and 15.13%).
For the querying phase, the ANN models present the worse results. This can be clearly seen for the ANN 2 model where the RMSE increased to 0.777 • C that corresponds to a MAPE of 3.34%. The prediction is slightly improved by the ANN 1 model (0.699 • C). Once again, the random forest model presents the best adjustments for the querying phase (with similar values for the SVM model). The RF model showed the best squared correlation coefficient (0.953), the lowest root mean square error (0.513 • C) and a MAPE value of 2.44%. Due to the fact of these results, the RF model can be used to predict the temperature in the Mediterranean Sea.
All the models developed in this research to determine δ 18 O, salinity, and temperature/potential temperature worked quite well, showing acceptable errors below 8.00%. The low percentage of error and the good square correlation coefficient values shown by the models to predict salinity and temperature/potential temperature seemed to indicate that there was a high correlation between the input variables and the variables to be predicted. This fact did not seem so marked in the case of the models to predict δ 18 O, where, despite the low percentage errors, a low square correlation coefficient of the different models is observed in all phases, except in the training phase of the RF model where a value of 0.889 is reached. This low correlation, not only in the random forest models but also in the rest of the models to predict δ 18 O, might suggest that the variables selected to determine this parameter should be complemented with other input variables to improve the squared correlation coefficients and the percentage of error (made by decreasing the RMSE).
The models developed in this research can be used to determine with relative safety the levels of δ 18 O, salinity and temperature/potential temperature of the waters of the Mediterranean Sea, taking into account the geographical coordinates, year, month and depth.
These models have the disadvantage of requiring a longer processing time and computational cost than other types of more traditional models, such as models based on simple multiple linear regressions (models that are practically instantaneous compared to machine learning models such as those presented in this research). However, this inconvenience is overcome by the great capacity of these models (ANN, RF and SVM) to find the necessary relationships between the independent and dependent variables and achieve a good result.
Our models could be useful for all those research works that require, or need, the use of past data for these variables. These models work well between the dates analysed in this research. Outside of these dates, the model could lose predictive power due to the possible temporal evolution of the Mediterranean Sea caused by different factors that could influence it such as climate change, pollution phenomena, among others.
These models are far from being perfect models because they present points distant from the line with slope one and points, that although they are close to it, can present high values of IPE (points located in the lower areas of the line with slope one). These models should be optimized by including more sampling data, different locations and depths, as well as different measurement dates, studying different combinations of model parameters (increasing their study ranger or analysing more parameters), among others. Another possible way to improve the models is to establish independent databases for each variable under study (avoiding the elimination of cases that have only one missing value). In addition to taking into account these possible improvements, it is necessary to carry out a more exhaustive treatment of the data to discriminate and better choose the input variables avoiding possible noise such as due to the joint inclusion of values of temperature and potential temperature.

Conclusions
In this study, different models were developed to predict the isotope composition (δ 18 O), salinity and temperature/potential temperature in the Mediterranean Sea using five variables: (i-ii) geographic coordinates (Longitude, Latitude), (iii) year, (iv) month and (v) depth. δ 18 O models present a regular power prediction (MAPE Q between 7.38% and 4.98%). Salinity models can predict the salinity value with accuracy (under a MAPE Q value of 0.30%). Models to predict water temperature/potential temperature presented good power prediction with MAPE Q values between 3.99% and 2.44%.
Taking into account the different models implemented in this research and the results obtained, authors can say that random forest models proved a valid prediction tool to determine with accuracy the oxygen-18 isotope composition, the salinity and the temperature/potential temperature of the Mediterranean Sea.
The authors suggest that new models trained with a larger number of samplings, and a more detailed study of the data, could improve the accuracy of the developed models in this research.

Data Availability Statement:
The data used in this research to develop the different models were collected by Schmidt et al. (1999) [63] from different sources and are available at https://data.giss. nasa.gov/o18data/. Please see "2.1. Database Used" for more information.