Predictive Innovative Methods for Aquatic Heavy Metals Pollution Based on Bioindicators in Support of Blue Economy in the Danube River Basin

: Heavy metal pollution is still present in the Danube River basin, due to intensive naval and agricultural activities conducted in the area. Therefore, continuous monitoring of this pivotal aquatic macro-system is necessary, through the development and optimization of monitoring methodologies. The main objective of the present study was to develop a prediction model for heavy metals accumulation in biological tissues, based on ﬁeld gathered data which uses bioindicators (ﬁsh) and oxidative stress (OS) biomarkers. Samples of water and ﬁsh were collected from the lower sector of Danube River (DR), Danube Delta (DD) and Black Sea (BS). The following indicators were analyzed in samples: cadmium (Cd), lead (Pb), iron (Fe), zinc (Zn), copper (Cu) (in water and ﬁsh tissues), respec-tively, catalase (CAT), superoxide dismutase (SOD), glutathione peroxidase (GPx), malondialdehyde (MDA) (in ﬁsh tissues). The pollution index (PI) was calculated to identify the most polluted studied ecosystem, which revealed that Danube River is seriously affected by the presence of Fe (IP = 4887) and strongly affected by the presence of Zn (IP = 4.49). The concentration of Cd in ﬁsh muscle tissue was above the maximum permitted level (0.05 µ g/g) by the EU regulation. From all analyzed OS biomarkers, MDA registered the highest median values in ﬁsh muscle (145.7 nmol/mg protein in DR, 201.03 nmol/mg protein in DD, 148.58 nmol/mg protein in BS) and ﬁsh liver (200.28 nmol/mg protein in DR, 163.67 nmol/mg protein, 158.51 nmol/mg protein), compared to CAT, SOD and GPx. The prediction of Cd, Pb, Zn, Fe and Cu in ﬁsh hepatic and muscle tissue was determined based on CAT, SOD, GPx and MDA, by using non-linear tree-based RF prediction models. The analysis emphasizes that MDA in hepatic tissue is the most important independent variable for predicting heavy metals in ﬁsh muscle and tissues at BS coast, followed by GPx in both hepatic and muscle tissues. The RF analytical framework revealed that CAT in muscle tissue, respectively, MDA and GPx in hepatic tissues are most common predictors for determining the heavy metals concentration in both muscle and hepatic tissues in DD area. For DR, the MDA in muscle, followed by MDA in hepatic tissue are the main predictors in RF analysis. GPX both hepatic and muscle tissues. The prediction of Fe in hepatic tissues MDA and GPx from hepatic tissue, while Fe in muscle tissue on GPx in hepatic tissues and CAT muscle tissues. For Zn prediction in hepatic tissue, SOD hepatic and MDA muscle and hepatic are most signiﬁcant independent parameters that must be considered, while for Zn prediction in muscle tissues, CAT hepatic and both GPX, hepatic and muscle, must be considered as most important. CAT hepatic For hepatic while for Zn prediction in muscle tissues, both GPX and CAT in hepatic tissues must be considered as most important. The Cu prediction in both muscle and hepatic tissues MDA and in hepatic tissues, respectively, CAT in muscle and GPX in hepatic tissues as


Introduction
The Romanian Danube River basin, especially the lower sector, represents the basis for developing the country's blue growth potential, to align to the European Union average performance registered by the blue economy sector. Therefore, the living resources subsector must be developed, while assuring both the safety and security of aquatic food products. However, this objective can pose a real challenge to achieve since in the lower sector of the Romanian Danube River (LRDR) basin several important economic/anthropogenic activities related to shipyard industry (in the area of Brăila, Galat , i and Tulcea city), steel industry (Galat , i county), alumina industry (Tulcea county) and intensive agricultural activity (especially in Brăila and Tulcea county) are present. Thus, all the activities generate significant concentrations of heavy metals released in the nearby environment, creating a risk of heavy metal accumulation in the aquatic ecosystem, situation which requires periodic pollution monitoring activities. Additionally, the LRDR connects to the Black Sea through the Danube Delta and if pollution occurs upstream of the river, the negative effects of this phenomenon can extend on the Romanian Black Sea and Danube Delta coastal areas as well.
According to [1], the coastal area accumulates large quantities of metals from agricultural and industrial activities. Fish are often used as bioindicators for water pollution caused by heavy metals [2][3][4], through fish liver analysis. Furthermore, the Regional Organization for the Protection of the Marine Environment recommends the use of fish as environmental monitoring tools [5]. According to [3], machine learning based models for the detection of heavy metals in aquatic environments are increasing the economic efficiency of monitoring campaigns conducted to identify the ecological status of surface waters and, additionally, the health risk that is associated with the ingestion of heavy metals after fish consumption.
Fish stocks are key components in the development of a blue economy based on marine living resources. Thus, fish meat analysis must be performed to verify food safety in terms of heavy metal concentration. However, heavy metal analysis involves a high financial cost, complex logistics and the development of complex working protocols [3]. In addition, since heavy metals are not equally distributed in the aquatic environment and the identifications of hot spots imply the collection of many samples for analysis, the monitoring campaigns are cost-demanding. Therefore, even though ecotoxicological mathematical models were performed to predict internal concentrations of organic chemicals in fish [6], more recent studies recommend the continual development and upgrade of these models [7].
The oxidative stress (OS) biomarkers analysis is less expensive and offers information related to fish welfare status.
Manifestation of the OS within an organism occurs due to the disparity between the production of reactive oxygen species (ROS) and the organism's capacity to convert ROS into less harmful oxygen species or to repair the oxidative damage [8]. Enzyme activity is a well-known inhibition mechanism for oxidative cytotoxicity and antioxidant enzymes can catalyze the reaction of free radicals' reduction [9,10]. Antioxidant enzymes such as superoxide dismutase (SOD), glutathione peroxidase (GPx) and catalase (CAT) are the main components of the first line antioxidant defense mechanism and can prevent ROS production [10,11].
The main free radicals which can cause the oxidation of cellular components such as lipids, proteins and DNA, and which eventually lead to mutagenesis or cell death, are the superoxide radical (O 2 − ), the hydroxyl radical (OH) and hydrogen peroxide (H 2 O 2 ) [12,13]. The defense mechanism of SOD acts by inhibiting the activity of superoxide anion, through the transformation of O 2 -in H 2 O 2 [13]. Furthermore, the CAT and GPx antioxidant enzymes act as catalyzers for the decomposition of H 2 O 2 to H 2 O and O 2 , and, therefore, provide a detox mechanism in living cells [9,14]. In addition, ROS cytotoxic effects includes the peroxidation of membrane lipids (oxidation of polyunsaturated fatty acids) which causes high levels of malondialdehyde (MDA) [15]. The exposure of biota to heavy metals is known to induce OS [16][17][18][19]. OS biomarkers are sensitive to environmental contaminants and respond fast in their presence, including heavy metals [20,21]. OS biomarkers have been previously used to highlight the effects induced by heavy metal pollution in the aquatic environments, by using different model organisms such as crabs [22], mussels [15,23], frogs [24], jellyfish [25] and fish [14,26,27]. Fish are the vertebrates which manifest the fastest biological response to aquatic environmental pollution [9,28]. Metals with redox potential (Cu, Fe, Ni) are involved in the Fenton reaction as ROS catalyzers, while metals with no redox potential (Cd, Pb, Zn) compromise the antioxidant defense system by depletion of antioxidants [12,14,16]. Fish species inhabiting metal contaminated aquatic environments manifest high level of antioxidant enzymes such as SOD, GR, GST, GPx, CAT and free antioxidants such as GSH [16].
However, mathematical modelling of oxidative stress induced by heavy metals is a new concept and few research have been undertaken regarding this subject. For instance, Ning with co-authors [40] applied the Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS), using the software Matlab, to analyze the degree of toxicity to the earthworms after various stress concentrations. Gao et al. [41] fitted a toxicodynamic (TD) and toxicokinetic (TK) model for oxidative stress depended on the presence of heavy metals in zebrafish gills. However, the authors concluded that the TD model needs to be improved. Sahlo et al. [42] trained a neural network using sine-cosine algorithm to predict CAT, SOD and GPx in fish exposed to selenium nanoparticles.
The main goal of the European Common Fisheries Policy is to ensure that fishing and aquaculture activities provide a source of healthy food for EU citizens.
The precocious identification of heavy metal hot spot occurrence can be vital to assuring a sustainable blue economy in LRDR since this phenomenon may not only affect wild fish stocks, but also aquaculture facilities supplied with water from the Danube River. Therefore, heavy metal pollution can affect the sustainability of a blue economy both through the impact on wild fish stocks and on the aquaculture productions, situation which could create pressure on fish stocks from other fishing areas, which are not affected by pollution. Recent papers [43] highlighted that strategic and tactical decision-making within the blue economy concept can be achieved through the exploitation of aquatic numerical modelling and observations, which can represent common monitoring practices among coastal areas [44].
The present paper develops an innovative analytical framework which can be used to monitor the heavy metal pollution status in LRDR, based on fish OS biomarkers and the use of random decision forests (RF) algorithms. The analytical framework targets the identification of proper, high precision mathematical models and methodology, which can predict heavy metal concentrations in the tissues of different economically valuable fish species from the Danube River-Black Sea Basin, based on the enzymatic and nonenzymatic activity in muscle and liver tissues. The results of the present research can provide a cost-effective tool, which can support the Danube River Basin Management Plan (DRBMP), which is the most important task in the implementation of the Water Framework Directive (WFD). As well, the proposed tool can support the blue economy through the development of standardized, machine-learning based, monitoring methods of heavy metal pollution in the macro-system Danube River-Danube Delta-Black Sea, by using OS biomarkers as predictors.
In order to determine a reliable predictive model framework for identifying the heavy metals concentration in fish muscle and liver tissue, a non-linear methodology based on random forest algorithms was used. Random forest algorithms were successfully used in different contexts requiring heavy metals concentration prediction. For example, in [45], the authors are proposing a random forest approach for a rapid evaluation of heavy metals concentration (Cd, Cu, Pb, Zn) in Tegillarca granosa (a species of clam known as the blood clam because of the red hemoglobin liquid inside the soft tissues), to assess the magnitude of contamination. In [46], various machine learning algorithms were used for differentiating between organic and non-organic cattle production based on the analysis of the concentration of 14 mineral elements (As, Cd, Co, Cr, Cu, Fe, Hg, I, Mn, Mo, Ni, Pb, Se and Zn) in 522 serum samples from cows (341 from organic farms and 181 from non-organic farms), the best results being obtained with the decision tree C5.0, Random Forest and AdaBoost neural networks. Random forests were also used for determining heavy metals soil pollution. In their study, Zhang et al. [47], determined the concentrations of As, Cr, Cu, Ni, Pb, Zn, Cd, Hg, organic carbon, nitrogen, phosphorus, pH, Cl, clay, silt, sand, CaO, Fe 2 O 3 , Al 2 O 3 and SiO 2 in samples of soil and sediment collected in the region of Eastern China and the potential sources/concentration of heavy metals were identified using random forest (RF) and positive matrix factorization (PMF) models. Similarly, [48] emphasizes on the fact that heavy metals lead to food contamination endangering the human health. The objective of their study was to estimate the concentrations of heavy metals, such as zinc, chromium, arsenic by using random forest (RF) algorithm on collected data from the study area of Xuzhou, China. The obtained results were conclusive with an R-Sq around 0.9 for all their models and also very low RMSEs. Random forest methodology was also used in establish heavy metals impact on human health. For example, in [49], the authors show that lead, mercury and cadmium have a significant impact on hypercholesterolemia (HC). They used various machine learning models (k-nearest neighbors, decision trees, random forests and support vector) with data collected from 10,089 participants to predict the prevalence of HC associated with exposure to lead, mercury and cadmium. In addition to predicting heavy metals concentrations in organic tissues or soil sediments, random forests were also used, as a prevention method, to assess the water heavy metal pollution. Thus, [50] shows how it is possible, based on a random forest approach, to predict the pollution level in Marilao-Meycauayan-Obando River System, located in the province of Bulacan, Philippines, where the inhabitants are continuously exposed to pollution coming from informal industries such as used lead-acid battery or open dumpsites metal refining.

Study Area
The study area is represented by the macro-system Danube River (DR)-Danube Delta (DD)-Black Sea (BS), on the LRDR territory. It is well documented that the LRDR tends to concentrate the highest levels of different heavy metals accumulated from the Upper and Middle sectors [4]. Furthermore, according to the last DRBMP, Romania discharges the highest rates of Pb (7477 kg/year) and Cd (1605 kg/year) compared to other riparian countries. The LRDR could be considered the basis for improving Romanian blue growth in the context of European Union intention for developing a more sustainable blue economy, due to large diversity of fish stocks and aquaculture fish farm facilities connected to the Danube River for assuring their inlet water. Thus, this region must be subject to a strict operational management which should prevent possible pollution situations. Since heavy metals are pollutants difficult to be controlled or removed and they tend to accumulate in various trophic levels within an ecosystem, a rigorous monitoring management must be implemented especially in areas which are considered hot spots due to their specific heavy industrial activity. Therefore, the present study considers the Danube section between the city of Galati (45 • 25 50 N 28 • 3 33 E) and the city of Tulcea (45 • 10 51 N 28 • 47 43 E) ( Figure 1, orange spots), as monitoring heavy metal pollution in this area is imperative due to the anthropogenic pressure exercised on Danube River water column through the naval (the largest naval shipyard on the Danube is in Galat , i city and shipyard industry is presented also in the city of Tulcea) and metallurgical (the largest steel mill in Romania is located in the city of Galat , i and the largest alumina refinery from Romania is located in the city of Tulcea) activities conducted in this region. In the Danube Delta, sampling  (Figure 1, purple spots) since in these lakes significant fishing activities are reported and their geographical position, surrounded by intensive agriculture fields, is representative of most of the lakes presented in this study region. The practice of intensive agriculture near lakes, rivers and ponds is a common practice within Danube Delta and can be a threat to the sustainability of marine living resources blue economy subsector. In the Black Sea area, sampling was undertaken between the flowing mouth of the Danube River in the Black Sea, in the Saint George locality (44 •  Romania is located in the city of Galați and the largest alumina refinery from Romania is located in the city of Tulcea) activities conducted in this region. In the Danube Delta, sampling was performed in 3 representative lakes located on the Saint George branch (45°3′9" N 29°7′40" E, 44°58′38″ N 29°12′46″ E, 44°59′4″ N 29°10′29″ E) ( Figure 1, purple spots) since in these lakes significant fishing activities are reported and their geographical position, surrounded by intensive agriculture fields, is representative of most of the lakes presented in this study region. The practice of intensive agriculture near lakes, rivers and ponds is a common practice within Danube Delta and can be a threat to the sustainability of marine living resources blue economy subsector. In the Black Sea area, sampling was undertaken between the flowing mouth of the Danube River in the Black Sea, in the Saint George locality (44°52′14.8″ N 29°37′15.4″ E), and the Perisor Beach (44°47′43″ N 29°16′8″ E), since this is considered an important fishing area and is situated immediately after Danube Delta buffer zone ( Figure 1, blue spots).

Pollution Index
There are several water quality indices that can describe the ecological status of a water body. For instance, water quality index (WQI), Canadian Council of Ministers of the Environment Water Quality Index (CCME-WQI) and heavy metal pollution index (HPI) use in their specific equations more than 1 variable. In addition to Fe, Zn and Cr, WQI takes into consideration other water quality parameters such as phosphorous, nitrates, chlorides, alkalinity, total dissolved solids, total hardness, pH, electrical conductivity, dissolved oxygen, turbidity, biological oxygen demand, chemical oxygen demand [51,52]. In addition, HPI takes into consideration the concentrations of chromium, cobalt, nickel, copper, zinc, cadmium, mercury, arsenic, manganese and iron [52]. The pollution index (PI) is calculated based on the concentration of the individual metal; thus, each metal has its specific PI value [53]. The PI was chosen in this present study because it was targeted to highlight the impact of each studied metal on the investigated areas and to identify possible ecological risks generated by a specific contaminant, and not by the mixture of analyzed metals. To better characterize the study area in terms ecological status, the pollution index was calculated. According to [54], this index facilitates the identification of the impact of different heavy metals presence in a certain area, to identify the pollution degree and possible hotspot. Therefore, the following formula (1) was applied:

Pollution Index
There are several water quality indices that can describe the ecological status of a water body. For instance, water quality index (WQI), Canadian Council of Ministers of the Environment Water Quality Index (CCME-WQI) and heavy metal pollution index (HPI) use in their specific equations more than 1 variable. In addition to Fe, Zn and Cr, WQI takes into consideration other water quality parameters such as phosphorous, nitrates, chlorides, alkalinity, total dissolved solids, total hardness, pH, electrical conductivity, dissolved oxygen, turbidity, biological oxygen demand, chemical oxygen demand [51,52]. In addition, HPI takes into consideration the concentrations of chromium, cobalt, nickel, copper, zinc, cadmium, mercury, arsenic, manganese and iron [52]. The pollution index (PI) is calculated based on the concentration of the individual metal; thus, each metal has its specific PI value [53]. The PI was chosen in this present study because it was targeted to highlight the impact of each studied metal on the investigated areas and to identify possible ecological risks generated by a specific contaminant, and not by the mixture of analyzed metals. To better characterize the study area in terms ecological status, the pollution index was calculated. According to [54], this index facilitates the identification of the impact of different heavy metals presence in a certain area, to identify the pollution degree and possible hotspot. Therefore, the following Formula (1) was applied: where: For this study, the metal standard value in surface waters was taking into consideration according to the Romanian Regulation Order no. 161/2006.

Sample Collection and Preparation
Samples of water (n = 50) were collected from the Danube River, Danube Delta and Black Sea in 50 mL polyethylene (PE) Falcon tubes, previously washed with ultrapure water. Each sample was mineralized with 65% Suprapure HNO 3 to ensure metals dissolution and kept in the freezer until analysis.
A total number of 158 fish specimens were caught by commercial fishermen using specialized fishing nets (approved by the National Authority for Fishing and Aquaculture) and taxonomically identified as it follows: in the Danube River, Abramis brama (Linnaeus, 1758) (n = 9), Alosa immaculata (Bennett, 1835) (n = 10), Leuciscus aspius (Linnaeus, 1758) The biological material was placed in PE bags, stored in the icebox and transported to the laboratory, where it was kept in the freezer until analysis. Samples of muscle and hepatic tissue were collected from each fish and each sample was submitted to the digestion procedure, using certified Merck reagents (65% Suprapure HNO 3 and 30% H 2 O 2 ) and the microwave digestion system Berghof Speedwave MWS-2. The muscle tissue was chosen for sampling because all fish species analyzed in the present study are economically important and are consumed by the local population. Thus, a possible fish contamination with heavy metals would pose health risks to the human population. In addition, the liver is an active detoxification mechanism of different pollutants and heavy metals are known to degrade the lipid mechanism [34]. Therefore, liver tissue from each fish was sampled.
The protocol for sample preparation was conducted according to [2,55]. A total number of 316 samples were analyzed.

Heavy Metals Analysis
For metal quantification in samples the high-resolution continuum source atomic absorption spectrometry (HR-CS-AAS) technique was used. Cd, Pb, Cu, Fe and Zn were determined due to their high toxicity potential to exposed biota. All samples were analyzed using the ContrAA ® 700 equipment, produced by Analytik Jena, Germany. The graphite furnace method was applied for Cd, Pb and Cu analysis; and for Fe and Zn, the flame method was applied, following the optimization program previously described by [55,56].
To verify the accuracy of the results registered in case of the samples laboratory analysis, a certified reference material (CRM) for fish muscle was used (for heavy metal analysis), according to [2].

Enzymatic and Non-Enzymatic Biomarkers Determination
The CAT, SOD, GPx and MDA were analyzed from fish muscle and hepatic tissues. Each sample was homogenized in 10 volumes of ice-cold saline solution (0.90% NaCl) and centrifuged at 5500 rpm, at 4 • C. The resulting supernatant was discharged in clean tubes and divided in aliquots for biomarker analysis. Merck certified assay kits were used to determine CAT (CAT100-1KT), SOD (19190-1KT-F), GPx (CGP1-1KT) and MDA (MAK085-1KT). The working protocol was previously described by [57]. All samples were measured using Specord 210 Plus, AnalytikJena equipment. For OS biomarkers, the enzyme activities were determined as outlined in the detailed protocol provided by Merck for each assay kit.

Model Development
In order to identify a predictive modelling framework for determining heavy metals concentration in fish hepatic and muscle tissues from LRDR, based on OS biomarkers, a series of procedures were performed, starting from the raw data and ending to final prediction models ( Figure 2).

Enzymatic and Non-Enzymatic Biomarkers Determination
The CAT, SOD, GPx and MDA were analyzed from fish muscle and hepatic tissues. Each sample was homogenized in 10 volumes of ice-cold saline solution (0.90% NaCl) and centrifuged at 5500 rpm, at 4 °C. The resulting supernatant was discharged in clean tubes and divided in aliquots for biomarker analysis. Merck certified assay kits were used to determine CAT (CAT100-1KT), SOD (19190-1KT-F), GPx (CGP1-1KT) and MDA (MAK085-1KT). The working protocol was previously described by [57]. All samples were measured using Specord 210 Plus, AnalytikJena equipment. For OS biomarkers, the enzyme activities were determined as outlined in the detailed protocol provided by Merck for each assay kit.

Model Development
In order to identify a predictive modelling framework for determining heavy metals concentration in fish hepatic and muscle tissues from LRDR, based on OS biomarkers, a series of procedures were performed, starting from the raw data and ending to final prediction models ( Figure 2). According to Figure 2, after collecting the raw data parameters from the three regions presented above, in order to understand if there are significant data differences between the regions, due to the non-normal distribution of the data, a Kruskal-Wallis analysis of variance was performed. The results indicated significant differences in regions data distribution, therefore each region was independently analyzed. A region analysis involved three main steps: (a) calculate the correlation matrix for acquiring the perspective over the existing linear relations in the dataset-with possible impact over the predictive techniques such as linear regression, retaining only the relevant values, respectively, above 0.6 and below −0.6; (b) develop linear regression models on normalized datasets (scaled regressors), using stepwise backward and forward selection for obtaining the most relevant predictors [58] and comparable coefficients that would help to distinguish between predictors relevancy; (c) develop predictive random forest models for the targeted features.
For choosing the final predictive models, the model selection procedure took into consideration two main metrics, respectively, R-Sq and Root Mean Square Error (RMSE). The R-sq has an intuitive scale as it ranges from zero to one, where zero indicates no improvement over the mean model, and one indicating perfect prediction having the entire According to Figure 2, after collecting the raw data parameters from the three regions presented above, in order to understand if there are significant data differences between the regions, due to the non-normal distribution of the data, a Kruskal-Wallis analysis of variance was performed. The results indicated significant differences in regions data distribution, therefore each region was independently analyzed. A region analysis involved three main steps: (a) calculate the correlation matrix for acquiring the perspective over the existing linear relations in the dataset-with possible impact over the predictive techniques such as linear regression, retaining only the relevant values, respectively, above 0.6 and below −0.6; (b) develop linear regression models on normalized datasets (scaled regressors), using stepwise backward and forward selection for obtaining the most relevant predictors [58] and comparable coefficients that would help to distinguish between predictors relevancy; (c) develop predictive random forest models for the targeted features.
For choosing the final predictive models, the model selection procedure took into consideration two main metrics, respectively, R-Sq and Root Mean Square Error (RMSE). The R-sq has an intuitive scale as it ranges from zero to one, where zero indicates no improvement over the mean model, and one indicating perfect prediction having the entire dependent variable variance described by the predictors. The RMSE, the square root of the variance of the residuals, indicates the absolute fit of the model to the data. If R-sq is a relative measure of fit, RMSE is an absolute measure. RMSE is interpreted as the standard deviation of the unexplained variance, being in the same units as the response variable.
RMSE is a good measure for describing how accurately a model predicts the response, being one of the most important fit criteria for predictive models based on OLS (ordinary least square) estimations [59].
Last step of the research procedure was to identify for each reliable predictive model the order of the most relevant predictors, being able this way to assess strong prediction influences of the various dependent variables. Regarding the predictive modelling procedure to determine the heavy metals concentration in fish liver and muscle tissues, the current research started with the multiple linear regressions, a technique that uses a number of explanatory variables to predict the outcome of the dependent variable, with the goal of modelling the linear relationship between the independent variables and the response variable. The following assumptions were tested: there is a linear relationship between the dependent variables and the independent variables, the predictors are not highly correlated with each other; yi observations are selected independently and randomly from the population; residuals should be normally distributed.
Due to the fact that for the existing context, all linear models poorly described the relations between predictors and dependent variables (low Adj-Rsq values), the current research needed to use a non-linear approach for obtaining reliable predictive models. Even if the MLR models explained a low to medium percentage of the dependent variable variance, the current research presents in Table A1 the MLR models for heavy metals prediction in both muscle and hepatic tissues as a reference for the subsequently RF models. There are many ways to handle non-linearity problems by using the current machine learning methodologies (e.g., polynomial regressions, artificial neural networks, tree-based regressors, ensemble learning, etc.).
When choosing one of these approaches, several factors should be considered: (a) volume of available data, (b) avoiding overfitting, (c) feature importance determination, (d) prediction accuracy, (e) data distribution Measuring heavy metals concentration in muscle or liver tissue of the aquatic organic organisms is an expensive task in terms of time and financial resources. Thus, the volume of data cannot be expected to be large, therefore, any chosen prediction algorithm, should consider this aspect.
Random Forests (RF), the methodology used for the current research, handles well low volumes of data, performing optimal when a larger number of features are available.
RF represents an ensemble learning method [60]. That involves generating many classifiers/regressors, subsequently aggregating their results by using a boosting [61] or a bagging approach [62]. Boosting involves giving extra weight to points that were incorrectly predicted by earlier predictors, while for bagging, successive trees do not depend on earlier trees, being completely independent. The random forest approach offers an estimator that fits a number of decision trees on different subsamples of the dataset and uses averaging to control overfitting and to improve the predictive accuracy.
During the past six years, the random forest usage has steadily increased. By using Google Trends service, it was possible to collect historic data that was used to visualize the search interest for five prediction related terms: (a) linear regression-one of the most used prediction algorithms, (b) random forest, (c) XGBoost-a regularizing gradient boosting algorithm enforced by a strong community of data scientists [63], (d) CatBoost-one of the newest gradient boosting prediction algorithms [64], (e) prediction model-a generic term used to describe any of the previous mentioned algorithms ( Figure 3).  As it can be observed in Figure 3, the search interest value for the 'linear regression' term is the highest (which is to be expected due to its age and worldwide popularity). However, both 'random forest' and 'xgboost' terms presents a consistent ascending trend in search popularity. On the other hand, from all the above algorithms, CatBoost is the newest one (released in 2017), hence it is not yet as popular as its counterparts.
In terms of comparison, it was possible, by using the collected data through the Google Trends service, to highlight country level differences for the above search terms. Thus, considering for each country all the above terms, their sum of search percentage value would be 100. For a visualization purpose, the search percentage value of each term for every country allowed several classes definition as follows: (a) six classes for linear regression term-class A with search percentages below 40%, class B with values between 40 and 50%, class C between 50-60%, class D between 60 and 70%, class E between 70 and 80%, class F above 80%, (b) four classes for random forest term-class A with search percentages below 10%, class B with values between 10 and 20%, class C between 20-30%, class D above 40%, (c) four classes for XGBoost term-class A with search percentages below 5%, class B with values between 5 and 10%, class C between 10-15%, class D above 15%, (d) four classes for CatBoost term-class A with search percentages below 2%, class B with values between 2 and 3%, class C between 3%-4%, class D above 4%. For the linear regression term, the number of classes needed to be higher because the percentage values (of each country) were spread out on a larger interval.
Thus, clear differences related to the search interest in the above-mentioned search terms can be observed from the Figures A1-A4 between countries. If referring to linear regression term it can be noticed that the highest search interest percentage can be found in the USA, Canada, Australia, Norway or Finland and the lowest values in countries such as Russia, China, France or Spain. Switching to the remaining terms involving non-linear and more recent methodologies, it can be noticed that Europe and Russia display higher search interest percentage values. As it can be observed in Figure 3, the search interest value for the 'linear regression' term is the highest (which is to be expected due to its age and worldwide popularity). However, both 'random forest' and 'xgboost' terms presents a consistent ascending trend in search popularity. On the other hand, from all the above algorithms, CatBoost is the newest one (released in 2017), hence it is not yet as popular as its counterparts.
In terms of comparison, it was possible, by using the collected data through the Google Trends service, to highlight country level differences for the above search terms. Thus, considering for each country all the above terms, their sum of search percentage value would be 100. For a visualization purpose, the search percentage value of each term for every country allowed several classes definition as follows: (a) six classes for linear regression term-class A with search percentages below 40%, class B with values between 40% and 50%, class C between 50-60%, class D between 60% and 70%, class E between 70% and 80%, class F above 80%, (b) four classes for random forest term-class A with search percentages below 10%, class B with values between 10% and 20%, class C between 20-30%, class D above 40%, (c) four classes for XGBoost term-class A with search percentages below 5%, class B with values between 5% and 10%, class C between 10-15%, class D above 15%, (d) four classes for CatBoost term-class A with search percentages below 2%, class B with values between 2% and 3%, class C between 3-4%, class D above 4%. For the linear regression term, the number of classes needed to be higher because the percentage values (of each country) were spread out on a larger interval.
Thus, clear differences related to the search interest in the above-mentioned search terms can be observed from the Figures A1-A4 between countries. If referring to linear regression term it can be noticed that the highest search interest percentage can be found in the USA, Canada, Australia, Norway or Finland and the lowest values in countries such as Russia, China, France or Spain. Switching to the remaining terms involving non-linear and more recent methodologies, it can be noticed that Europe and Russia display higher search interest percentage values.
Several papers [65][66][67][68] described the advantages of using random forest: (a) data normalization is not required, (b) reduced overfitting results, (c) works well with both continuous and categorical variables, (d) works well with a reduced number of samples; (e) data distribution is not relevant.
The RF predictions could be improved (if required) by adjusting several hyperparameters: the maximum depth of the trees), the number of trees in the forest of the model, the minimum number of samples required to be at a leaf node or the minimum number of samples required to split an internal leaf node.
In term of model validation, the prediction methodology used both the root mean square error (RMSE) and the R-square metrics for assessing model's performance. All RF models were fitted using 80% of the data as training dataset. Thus, 20% of the data remained unseen to the models and could be used further on in the testing phase.
In addition to determining all reliable prediction models, another important aspect of the current research was to estimate the independent variables' importance when predicting the dependent one. For linear regression model, it is quite easy to assess the predictors feature importance. For example, after performing feature scaling that would bring all values into the range (0,1), parameter coefficients could be compared to see which one has a higher impact for the overall prediction. Furthermore, for a linear model, the p-value metric could be used to determine the predictors' importance in describing dependent parameter variance, being able to choose this way which one remains in the model. The random forest methodology offers also the possibility of assessing the importance of each predictor in predicting the dependent variable. For this, the 'eli5' Python library was used.
This library computes the parameter feature importance by measuring how the overall score decreases when a feature is not available. Thus, it is possible to rank the predictor variables in terms of relative significance, by calculating the decrease in node impurity weighted by the probability of reaching that node, probability that is calculated by the number of samples that reach the node, divided by the total number of samples [67,[69][70][71]. When the value is high, the feature is more important.
where: n ij represents the importance of node j; C j = the impurity value of node j; left( j ) = child node from left split on node j; right( j ) = child node from right split on node j; w j = weighted number of samples reaching node j.
The importance for each feature on a decision tree is calculated according to the following formula: where: ni j = the importance of node j; fi i is the importance of feature i. The values obtained from the above equation should be normalized (normfi j ) to values between 0 and 1 by dividing by the sum of all feature importance values: The feature importance is the average over all the trees. The sum of the feature's importance value on each tree is calculated and divided by the total number of trees: where: T = total number of trees; normfi ij = normalized feature importance for i in tree j; RFfi i = the feature i importance, calculated from all the trees in the RF model.

Heavy Metals Data Analysis
Heavy metal concentration in the water samples collected from the Danube Delta and the Black Sea registered the following descendent trend: Fe > Cu > Zn > Pb > Cd, whereas in the Danube River the trend was Fe > Zn > Cu > Pb > Cd ( Table 1). The highest values were recorded in the analyzed Danube River sector, followed by Black Sea Coast and Danube Delta (Table 1). Several papers [24,72] emphasize the high exposure of the Danube River to pollution since it passes through many industrial zones and receives urban and industrial wastes, as well as agriculture leachate. Therefore, this can explain the high concentrations of heavy metals recorded in the Danube River sampling sector. In addition, the findings emphasize the capacity of Danube Delta to act as a buffer zone for pollutants transported by the Danube.
According to other studies [4,73], the Danube Delta is the largest wetland of the complex system pertaining to the Danube River and forms a complex network of river channels, bays, lakes, sandy banks and numerous marshes that contribute to improving the quality of the Black Sea water, acting as a buffer zone for pollutants in the Danube. Therefore, the analysis of heavy metals in samples collected from the studied aquatic macro-system confirms the existing connections between its components. Fe registers the highest concentration among all analyzed metals, situation which correlates with both technological processes applied in the Galat , i steel factory and Tulcea alumina refinery (Table 1). In addition, Zn records considerable high concentrations in all the sampling area, which can be due to shipyard activity, namely galvanization processes. The natural presence of metals in the lower sector of Danube River, generated by the bedrock erosion should be taken into consideration also. Depth erosion and vertical erosion is manifested especially in the upper sector of the river due to the presence of stones and gravel. In the middle sector, the riverbed is represented by grit and gravel and horizontal erosion is manifested. The lower sector of the river is characterized by sand and very fine gravel and horizontal erosion occurs also. In this sector, the river substrate is sand, frequently interspersed with mud, organic sludge, silt, loam and clay. In small percentages gravel is present in fine to medium size.
According to [74] the erosion in this area is moderate. The grain size along the Sf. Gheorghe branch has been previously characterized as fine to median sands and silts [75]. The authors concluded in their study that the highest levels of heavy metals in the Danube River sediments, between Bazias and flowing mouth in the Black Sea, are associated in the areas where mining activities are conducted (km 1049-km 995). As well, cluster and PCA analysis highlighted the fact that the concentrations of Cd, Cu, Pb and Zn in the branches of Danube River (Chilia, Sulina and Sf. Gheorghe) are influenced by anthropogenic activities, whereas Cr and Ni concentrations are influenced mainly by the presence of rocks naturally rich in these metals in the Romanian sector of Danube River [76].
The median iron concentration in rivers has been reported to be 700 µg/L [77]. However, the concentrations reported within the present study are significantly much higher.
Hence, it can be stated that the bedrock of Danube River in the lower sector generates moderate concentrations of heavy metals and that abnormal high values are related to human activities. Similar results regarding Fe concentration in Danube River related to anthropogenic activities was previously reported in [54].
The data related to heavy metal concentration in water matrix are not considered proper to be included in the analytical framework prediction models since they could be highly variable among the Danube River basin during a short period of time, depending on a series of environmental factors such as water flow, water temperature, water level, occasional waste discharges.
Therefore, the data presented in Table 1 has the purpose of characterizing the environmental background of the analytical framework.
However, use of fish as bioindicators elaborating an analytical framework based on prediction models is more reliable since fish accumulate the metal concentrations during a longer time-period and due to their mobility, they can characterize a larger surface area.
The accumulation trend of heavy metals in the muscle tissue registered the following descending order: Fe > Zn > Cu > Cd > Pb in fish from Danube River, Danube Delta, respectively, and Fe > Zn > Cu > Pb > Cd in fish from the Black Sea (Table 2). In the case of the hepatic tissue, the accumulation tendency of heavy metals caught in the Danube River registered the same trend as for the muscle tissue (Table 2). Regarding the fish caught from the Danube Delta and the Black Sea, the accumulation trend of metals in the hepatic tissue was Fe > Zn > Cu > Pb > Cd, Fe > Zn > Cu > Cd > Pb (Table 2). Fe was the most abundant metal in both muscle and hepatic tissue. Nevertheless, the values registered for this element in the liver were at least 10 times higher compared to the values registered in the muscle. This fact is expected due to the liver's implication in the synthesis of red blood cells [41]. The least abundant metals detected in the muscle and hepatic tissues were Cd and Pb. However, these elements can induce toxic effects on fish even in trace amounts. To avoid human intoxication with heavy metals due to fish consumption, the European legislation [78] sets a maximum permitted level of Cd (0.05 µg/g) and Pb (0.3 µg/g) concentration in fish muscle. In the entire studied macro-system, 25 of the total analyzed fish specimens registered values of Cd concentration in the muscle tissue above the maximum permitted level. In natural aquatic ecosystems, metals are not equally distributed, instead they accumulate in certain "hot spots", influenced by hydraulics and water physico-chemical parameters. Thus, continuous monitoring of the water quality is needed to identify possible polluted sites.
By analyzing the differences between fish muscle tissue and fish hepatic tissue (Table 3), it can be observed that significant differences (p < 0.05) are recorded between DR and DD for heavy metal concentrations considered within this study. In addition, by comparing DR versus BS, it can be observed that Cd is the only metal which registered a statistically significant difference (p < 0.05). Furthermore, by comparing BS and DD, it was observed that Pb, Fe, Zn and Cu concentrations registered statistically significant differences (p < 0.05) ( Table 3). In terms of heavy metals from the hepatic matrix, statistically significant differences (p < 0.05) were recorded for Pb, Cu and Fe between all the analyzed areas and, for Cd between DR and BS, respectively, DD and Zn between DD and DR, respectively, BS (Table 3).

Enzymatic and Non-Enzymatic Biomarkers Determination
All OS biomarkers register higher values in fish hepatic tissue matrix, compared to muscle tissue, except MDA in DD sampling area (Table 4). However, CAT, SOD and GPx registered the highest values in fish muscle tissues at DR, followed by DD and BS coast, while MDA records maximum values at DD, followed by BS coast and DR (Table 4). According to other studies [79] it can be stated that increased MDA levels are correlated to decreased SOD activity, under the context of heavy metals exposure. The SOD and MDA are the most sensitive OS biomarkers to metal exposure, followed by CAT and GPx. The hepatic matrix is first affected by pollution; thus, it can be used as a main indicator of environmental pollution occurrence. The highest values of SOD and GPx in hepatic matrix are recorded at DR, followed by DD and BS, while CAT registers the highest values in DR, followed by BS and DD (Table 4). This can be due to the fact that in the DR water, the highest values in case of the studied metals were registered, compared to DD and BS. The MDA hepatic values were in accordance with Cd values (Table 2) and revealed the highest concentrations in DD and DR (Table 4).  Significant differences (p < 0.05) are recorded between DD and DR, respectively, BS, when analyzing both fish muscle and fish hepatic tissue matrix in terms of all OS biomarkers considered in the present study (Table 5). However, not significant differences were recorded between DR and BS (p > 0.05) in terms of CAT, SOD, GPx and MDA, in both analyzed sample matrix (Table 5).

Pollution Index in the Analysed Area
The pollution index was calculated in order to better establish the experimental backgrounds for each study area. However, in order to calculate the pollution index, based on Romanian Regulation Order no. 161/2006 (Table 6), the study area was associated to a certain water quality class. According to the Romanian Regulation Order no. 161/2006, the quality class I is the best ranking, while quality class V is the worse. Thus, in terms of Cd concentration in water, both BS and DD are associated to I quality class, while DR is within the second class. Considering Pb, Zn and Cu, it can be stated that all analyzed area are associated to first water quality class. However, the Fe concentration places all analyzed areas in the fifth water quality class.  The pollution index points out a seriously affected aquatic environment by the presence of Fe in the entire macro-system Danube River-Danube Delta-Black Sea ( Table 7). As well, Fe concentration in the water column is highly above the concentration specific to the quality class V for surface waters, mentioned in the Order 161/2006 by the Romanian Government (Table 6). In case of Zn accumulation in the water, a strongly affected aquatic environment was identified in the Danube River ( Table 7). The presence of Fe and Zn in high concentrations is due to the intense naval activities conducted in the Lower sector of the river. The values for the pollution index registered for the rest of the analyzed metals manifest no effect on the entire studied macro-system.

The Correlation Matrix
According to previous studies [3], the correlation matrix can be used as a tool to summarize the linear relationships existent in a dataset, as well as for identifying strong and relevant relationships that could be further modelled. In present study, all data related to heavy metals and OS biomarkers in both muscle tissue and hepatic tissue matrix were processed using Python NumPy library. According to other studies [80], weak positive and negative correlations would be considered in the range of 0.1 ÷ 0.3/−0.1 ÷ −0.3, moderate positive and negative correlation would range within 0.3 ÷ 0.5/0.3 ÷ 0.5, and strong positive and negative correlation between 0.5 ÷ 1.0/0.5 ÷ 1.0. However other studies which performed matrix correlation for heavy metals in different fish tissues considered as strong positive/negative all linear correlation with coefficient values over +0.7/−0.7. In addition, the correlation matrix is useful in analytical framework elaboration since in order to generate the prediction models, independent variables are supposed to be independent. According to similar studies [3], if the degree of correlation between variables is high enough, problems can arise when fitting the model and interpreting the results.
Positive significant correlations at BS are recorded within all OS biomarkers from both hepatic and muscle tissues (between 0.8-1). In addition, related heavy metals, Fe in muscle tissues is positively correlated with Cu (0.6) and Cd (0.6) in muscle, Fe (0.6), Cu (0.6) and Cd (0.7) in hepatic tissues. The Zn in muscle is positively correlated with Cd in both muscle and hepatic tissues (0.6) and negatively correlated with SOD (−0.6) in muscle and both CAT (−0.7) and GPx (−0.6) in hepatic tissue. In addition, Cd in muscle was found to be positively correlated with Cu (0.9) and Cd (0.9) in hepatic tissues, while Cu in muscle is strongly correlated with Cu (0.9) and Pb (0.7) in hepatic tissues.
In case of DD, positive significant correlations are recorded within all OS biomarkers from both hepatic and muscle tissues (between 0.7-0.9). In addition, related heavy metals concentration, the only strong positive correlation is observed between Zn in muscle and hepatic tissues (0.8). However, Pb in hepatic tissues is strongly correlated with CAT (0.7), SOD (0.8), GPx (0.7) and MDA (0.7) in hepatic tissues.
The correlation matrix revealed positive correlations between Cu in muscle and Cu (0.9), Pb (0.7) and Cd (0.9) in hepatic tissues, as well as Cd in muscle tissues (0.9). Some studies have emphasized that Cu enters the fish body mainly via food [81,82]. At the level of the intestine epithelium, Cu is transferred in the blood stream and then transported to the liver, due to its implication in organism detoxification. The presence of Cu in the liver is also related to the natural protein binding such as metallothioneins [83]. Even more, besides its storage purpose, the liver acts as a redistributor of metals to other organs, such as the muscle [83]. This phenomenon could stand as a possible explanation for the positive correlation registered between the Cu concentration in fish liver and fish muscle. This fact was confirmed in [84], where the authors highlighted a positive correlation in the concentration of Pb between the liver and muscle tissues. As well, the positive correlation between Cu and Cd in fish tissue (gonads) was previously observed in [85] in Alosa braschinkowi. Nevertheless, the correlation between metals in the fish body is highly dependent on the physiological processes within the organism, the presence of other metals and interactions between them [3].
In addition, Cd in muscle tissues is positively correlated with Cu (0.9), Pb (0.6) and Cd (0.9) in hepatic tissues, while Cd in hepatic tissue correlates positively with Cu (0.9) and Pb (0.6) determined from the same hepatic tissues. However, the Fe in hepatic tissue is the only elements which strongly correlate, positively, with OS biomarkers, more precisely, with MDA from hepatic tissues (0.8).

The Multi Linear Regression (MLR) Models
In order to generate predictive models between heavy metals and OS biomarkers from both hepatic and muscle tissues matrix, MLR was used. Therefore, the resulted equations are presented in Table A1. However, from a total of 29 MLR models, only 2 models record good performance. Both models help predicting the Fe in hepatic tissues, at BS and DR area, and explain 88.00%, respectively, 88.80% of the variance of this parameter. Therefore, the most important independent variable for the prediction of Fe concentration in hepatic fish tissues at both BS and DR are MDA, followed by CAT and SOD from hepatic tissue (Table A1).
It can be observed that MLR methods do not offer a substantial feed-back in order to elaborate an analytical framework aimed to predict the heavy metals concentration in fish tissues, based on OS biomarkers. In the literature, MLR were found useful for predicting micro-elements, based on macro-elements in fish tissues [3]. In addition, other authors manage to apply MLR for predicting elements in soil and crops [86], different water bodies [87] or different aquatic animals as mollusks [88].
Therefore, since MLR did not perform as expected, in order to accomplish the aim of this present study, RF based models were applied.

Non-Linear Models, Based on Random Forest (RF) Algorithm
According to other studies [3], non-linear models are tested for determining possible models among the parameters that did not register linear correlation between each other. Therefore, in the present study, after applying the RF technique on all the dataset, a number of 30 RF models were identified (Table 8), 10 models for each of the studied region (BS coast, DD and DR), all with high prediction accuracy ( Figures A5-A34). The weight of all independent variables was identified after running the feature importance algorithm in order to determine the importance of a specific OS biomarker parameter in determining the heavy metal dependent variable value ( Table 8).
The RF models for BS coast revealed that the prediction of Cd concentration in hepatic tissues is mostly based on MDA and CAT in hepatic tissues, while Cd in muscle tissues is influenced by GPX in both muscle and hepatic tissues and MDA in hepatic tissues. The prediction of Pb in hepatic tissues is significantly depended by MDA and GPx from hepatic tissues and SOD from muscle tissues, respectively, while Pb prediction in muscle tissues is mostly dependent by GPX in both hepatic and muscle tissues. The prediction of Fe in hepatic tissues depends on MDA and GPx from hepatic tissue, while Fe in muscle tissue is mostly dependent on GPx in hepatic tissues and CAT in muscle tissues. For Zn prediction in hepatic tissue, SOD hepatic and MDA muscle and hepatic are most significant independent parameters that must be considered, while for Zn prediction in muscle tissues, CAT hepatic and both GPX, hepatic and muscle, must be considered as most important. The Cu prediction in both muscle and hepatic tissues emphasizes MDA hepatic and SOD muscle, respectively, CAT muscle as most important parameters.
The analysis of RF models registered based on DD database revealed that Cd concentration in hepatic tissues is mostly based on CAT and SOD in muscle tissues, while Cd in muscle tissues is influenced by CAT and GPX in muscle, respectively, MDA in hepatic tissues. The prediction of Pb in hepatic tissues is significantly depended by MDA and CAT from hepatic tissues, while Pb prediction in muscle tissues is mostly dependent by GPX and MDA in muscle tissues. The prediction of Fe in hepatic tissues depends on CAT and GPx from muscle tissue, while Fe in muscle tissue is mostly dependent on SOD and GPx in hepatic tissues. For Zn prediction in hepatic tissue, CAT and SOD in muscle are most significant independent parameters that must be considered, while for Zn prediction in muscle tissues, both GPX and CAT in hepatic tissues must be considered as most important. The Cu prediction in both muscle and hepatic tissues emphasizes MDA and CAT in hepatic tissues, respectively, CAT in muscle and GPX in hepatic tissues as most important parameters. Table 8. Random forest models for predicting heavy metals in both muscle and hepatic tissues.

Dependent
Variable Predictors and Weights RMSE R-sq.

BS Coast
Zn muscle MDA hepatic (0.08); MDA muscle (0.05); CAT hepatic (0.31); CAT muscle (0.06); SOD hepatic (0.10); SOD muscle (0.10); GPx hepatic (0.12); GPx muscle (0.13). 0.14 84.00%   The analyzed database from DR region reveals that Cd concentration in hepatic tissues is mostly based on MDA in muscle tissues, followed by GPx and SOD in hepatic tissues, while Cd in muscle tissues is influenced by MDA in both analyzed tissues. The prediction of Pb in hepatic tissues is significantly depended by MDA and GPx from muscle tissues, while Pb prediction in muscle tissues is mostly dependent by GPX in muscle tissues and CAT in hepatic tissues. The prediction of Fe in hepatic tissues depends on MDA from both analyzed tissues, while Fe in muscle tissue is mostly dependent MDA and CAT in hepatic tissues. For both Zn predictions in hepatic and muscle tissues, the MDA in both in muscle and hepatic tissues are most significant independent parameters that must be considered. The Cu prediction in hepatic tissues emphasizes MDA, followed by SOD, CAT and GPx in hepatic tissues as most important parameters, while Cu in muscle can be mostly predicted by considering MDA in both analyzed tissues.
Random forest represents a good alternative to linear regression when the observed phenomenon is not linear. It produces reliable results, works well on both small and large datasets and it also creates estimates in case of missing data. Still, one of the major challenges comes from the fact that, at times, random forests are not able to extrapolate well over previously unseen data. The value in the leaves is calculated as an average of n observations. This approach improves the algorithm's accuracy, reduce the over-fitting, but it can lead to the extrapolation issue, where the predicted values (for previously unseen data) will not be outside the training set values for the target variable. This issue can be minimized by increasing the number of samples in the training dataset, allowing the algorithm to train on a wider data interval. For the current research context, as the number of samples in the dataset is not high, the extrapolation problem could lower the accuracy of the RF models. However, by adding new data to the dataset, after performing new analysis, would mitigate this issue. The Danube River Basin Management Plan (DRBMP) targets the monitoring of the following metals: silver, bismuth, arsenic, cadmium, cobalt, chromium, copper, manganese, nickel, lead, rubidium, antimony, selenium, tin, thallium, uranium, vanadium, tungsten and zinc. The present study can support DRBMP through the effective prediction of metals in the muscle tissue, based on the prediction metrics performance. Therefore, 3 categories of heavy metals can be identified as it follows: 1st category (Rsq between 70.00-80.00%) includes only Cu, 2nd category (R-sq between 80.00-90.00%) includes Zn and Pb, 3rd category (R-sq between 90.00-100.00%) includes Cd and Zn. Therefore, in order to quantify the importance of OS biomarkers for predicting the heavy metals concentration in both fish muscle and hepatic tissues, weight level of each predictor was evaluated and presented in Table 9.  The analysis revealed in Table 9 emphasizes that MDA in hepatic tissue is the most important independent variable for predicting heavy metals in fish muscle and tissues at BS coast, followed by GPx in both hepatic and muscle tissues. The RF analytical framework revealed that CAT in muscle tissue, respectively, MDA and GPx in hepatic tissues are most common predictors for determining the heavy metals concentration in both muscle and hepatic tissues, in DD area (Table 9). For DR, MDA in muscle, followed by MDA in hepatic tissue are the main predictors in RF analysis ( Table 9).
The presence of environmental pollutants can generate increased OS in fish and the main enzymes that mitigate the negative effects of this process are SOD, CAT and GPx. The SOD triggers ROS destruction by catalyzing the dismutation of O 2 − into H 2 O 2 , which CAT and GPx further reduces into H 2 O and O 2 [21]. The GPx is the most important peroxidase and holds the essential role for erythrocyte protection against H 2 O 2 . The antioxidant defense system can be more intense in the liver, due to the occurrence of several oxidative reactions and ROS generation in this organ. Ratn et al. [89] revealed in their study a positive correlation between SOD and CAT activity in the liver of Channa punctatus and Zn bioaccumulation in the same organ. According to the pollution index calculated in this study, the concentration of Zn strongly affects the aquatic ecosystems of Danube River and, thus, induced OS to the resident biota. Table 10 presents centralized data related to OS biomarkers in different fish species and heavy metals in water/liver samples, according to different studies. Mohanty with co-authors [21] reported in their study a more intense activity of SOD and CAT in the muscle, specifically hepatic (SOD between 2-3 U/mg protein, respectively, 12 U/mg protein; CAT between 100-150 nkat mg protein, respectively, between 10,000-15,000 nkat/mg protein) tissues of Notopterus notopterus specimen collected from a polluted site of the Mahanadi river, in India, compared to a specimen collected from a non-polluted site. This phenomenon is also confirmed in the present study through higher enzymatic activity in the tissues of fish collected from Danube River, which is a seriously affected aquatic environment, according to the pollution index. In addition, [35] pointed out in their study that the exposure of the common carp to Cd significantly increased the levels of MDA in the liver tissue (Table 10). Hajirezaee et al. [38] registered increased MDA level in the plasma collected from Mugil cephalus individuals exposed to Pb compared to the ones in the control group. This is similar to our results, since the sampled tissues from fish collected from Danube Delta registered the highest MDA levels and Cd concentration in the muscle of these fish was above the maximum permitted level by the EU regulation. The MDA represents the end-product of lipid peroxidation degradation and high levels indicate oxidative cell damage after toxicant exposure. In the study by [33], Cyprinus carpio individuals were exposed, for one week, to a mixture of metal pollutants (Cu: 4.8 µg/L; Cd: 2.9 µg/L and Zn: 206.8 µg/L). The authors registered no sign of OS in the liver and the gill tissues during that period. However, the decreased enzymatic activity of CAT and GPx under long-term metal exposure can lead to the manifestation of OS in the fish organism. It is difficult to define a narrow range for OS biomarkers in fish from the natural environment, because the concentrations of heavy metals are dynamic and the organism adapts its metabolism accordingly, by either increasing or decreasing the involved enzymes and antioxidants.
It is well known that aquatic organisms can be used as sentinels for an assessment of environmental contamination assessment by toxic chemicals, especially through the importance of ROS in physiological processes and toxicity mechanisms [79]. It is well known that heavy metals will accumulate in the fish liver tissue, compared to the muscle, due to the liver involvement in organism detoxification [4]. Increased metal levels in the liver generate a decreased SOD activity and increased lipid peroxidation [72]. Fish inhabiting water environments rich in Cd show increased generation of ROS, because Cd competes with essential metals for protein-binding sites, and thus, triggers the release of Fe 2+ and Cu 2+ ions [91]. In addition, in their study, [92] highlighted that oxidative stress caused by cadmium lead to cell death in trout hepatocytes. One mechanism of cellular injury in animals and indicator of oxidative stress in cells is lipid peroxidation [79]. An indicator of lipid peroxidation is malondialdehyde (MDA) and increased levels of MDA are associated to a variety of chronic diseases in vertebrates [79]. In their study, [79] highlighted a positive correlation (R 2 f = 0.966 and 0.987) between Cd and Pb concentrations in fish plasma and muscle, and the MDA levels in the same matrix. In addition, the author noted that increased MDA levels are correlated to decreased SOD activity, within the context of heavy metal exposure. [93] registered in their study elevated MDA levels in the liver of catfish exposed to Zn, Cu, Pb, Cd and As, which is attributed to the livers affinity to accumulate the aforementioned metals. Thus, metals are catalyzers in ROS formation. Further, high levels of SOD in the liver are a response to oxidative stress caused by the presence of heavy metals, which generated the production of superoxide anions which activated SOD to convert the superoxide radical to H 2 O 2 [93]. In case of CAT, a decreased activity compared to the rest of the antioxidant enzymes is expected due to inhibition by the wave of superoxide radicals [94][95][96]. The decreased CAT activity may be due to the flux of superoxide radicals, which have been shown to inhibit CAT activity [93]. Li with co-authors [97] demonstrated in their study that lipid peroxidation is enhanced in animals exposed to Cd compared to animals exposed to Zn.
Similar to most of the studies, the design of the current study is subject to limitations. Therefore, the analytical framework from present study was entirely based on fish muscle and hepatic tissues heavy metals and OS biomarkers. However, in future studies, water quality parameters data are recommended to be integrated in order to create a more precise and complex analytical framework, since it is well known that water hardness, alkalinity and pH influence the uptake process of potentially toxic metals by fish [55] and metals as Cd and Pb are bioavailable for fish to absorb if the aforementioned factors are low. Data from a wider study area should be collected, for several scenarios, in order for testing different approaches requiring larger data volumes such as neural network/deep learning or support vector regression. Furthermore, the ensemble modelling techniques function similar to a black box, meaning it can provide excellent predictive models, but difficult to interpret in terms of existing relation among predictors or between predictors and the dependent variable. Thus, more interpretable models could be used, such as for example the General Additive Models (GAM) which are generalized linear models, having the response variable depending on various smooth functions applied to predictors [98].

Conclusions
The present study points out the pollution of DR with Fe and Zn, due to intensive naval activities conducted in the lower sector of the river. In DD, Cd is present in the fish muscles in concentrations above the maximum level permitted by the EU regulations, due to the intensive agricultural activities conducted in the area.
The resident fish from DR manifest OS, through the highest activity of biomarkers CAT, SOD and GPx in muscle and liver tissues, compared to fish from DD and BS. The fish from DD, registered the highest values for MDA level, compared to the rest of the studied aquatic ecosystems.
The use of machine learning, through random forest algorithms, for the prediction of Cd, Pb, Zn, Fe, Cu in fish muscle and liver has proven to be an effective, fast and costeffective tool for monitoring pollution in DR, DD and BS.
This study revealed, for each analyzed region, the most important parameters used for the heavy metals' prediction. Thus, for the Black Sea coast, the MDA in hepatic tissue was identified as the most important independent variable for predicting heavy metals in fish muscle and tissues, followed by GPx in both hepatic and muscle tissues. For the Danube Delta, the random forest models revealed that CAT in muscle tissue, respectively, MDA and GPx in hepatic tissues were the most important predictors and for the remaining Danube regions, the MDA in muscle, followed by the MDA in hepatic tissue were the main predictors for the RF analysis.
Using the biomarker MDA from fish tissue to predict heavy metal concentrations in fish muscle can replace time-consuming and expensive field and laboratory analysis. In addition, it can help detect areas susceptible to pollution and it can identify contaminated fish with heavy metals in the Danube River Basin.