Artificial Intelligence Meets Marine Ecotoxicology: Applying Deep Learning to Bio-Optical Data from Marine Diatoms Exposed to Legacy and Emerging Contaminants

Simple Summary Our work is motivated by the increasing production of chemicals with environmentally harmful effects to our aquatic ecosystems. We show that it is possible to detect and distinguish the presence of several different emerging contaminants, using the photochemical responses of a microalgae species, which is among the most abundant phytoplankton group in the oceans. We use several machine learning and deep learning models that operate on chlorophyll fluorescence induction curves, which are composed of fluorescence values taken at different time steps from the microalgae exposure trials, achieving up to 97.65% accuracy when predicting the type of contaminant, and up to 100% in several cases when predicting the exposure concentration. Our results show the combination of these models with the fluorescence induction curves creates a powerful tool for ecotoxicity assessment, capable of classifying model organisms for their contaminant exposure, both in terms of type and concentration, opening new doors for toxicophenomics developments. Abstract Over recent decades, the world has experienced the adverse consequences of uncontrolled development of multiple human activities. In recent years, the total production of chemicals has been composed of environmentally harmful compounds, the majority of which have significant environmental impacts. These emerging contaminants (ECs) include a wide range of man-made chemicals (such as pesticides, cosmetics, personal and household care products, pharmaceuticals), which are of worldwide use. Among these, several ECs raised concerns regarding their ecotoxicological effects and how to assess them efficiently. This is of particular interest if marine diatoms are considered as potential target species, due to their widespread distribution, being the most abundant phytoplankton group in the oceans, and also being responsible for key ecological roles. Bio-optical ecotoxicity methods appear as reliable, fast, and high-throughput screening (HTS) techniques, providing large datasets with biological relevance on the mode of action of these ECs in phototrophic organisms, such as diatoms. However, from the large datasets produced, only a small amount of data are normally extracted for physiological evaluation, leaving out a large amount of information on the ECs exposure. In the present paper, we use all the available information and evaluate the application of several machine learning and deep learning algorithms to predict the exposure of model organisms to different ECs under different doses, using a model marine diatom (Phaeodactylum tricornutum) as a test organism. The results show that 2D convolutional neural networks are the best method to predict the type of EC to which the cultures were exposed, achieving a median accuracy of 97.65%, while Rocket is the best at predicting which concentration the cultures were subjected to, achieving a median accuracy of 100%.


Introduction
In recent decades, there is an evident increase in the development of uncontrolled human activity [1]. The relationship between increased human population density and environmental changes in coastal regions is well known. Coastal waters are the ultimate sink of sewage and other by-products of human activities. The EU's Task Group for the Marine Strategy Framework Directive (MSFD) implementation recommended that monitoring programs covering the chemical contaminants concentrations should also integrate biological measurements of the effects of the contaminants on marine organisms [2]. The combination of conventional and newer, effect-based methodologies, with the assessment of environmental contaminant concentrations, provides a powerful and comprehensive approach [2]. It is also striking that the pace of chemical discovery is growing rapidly, with the Chemicals Abstracts Service (CAS REGISTRY) reporting in May 2011 the registration of the 60th million chemical substance, and the 50th million substance registration in only 2009, highlighting the continued acceleration of synthetic chemical innovation [3]. Among these, several contaminants have raised concerns due to their ecotoxicological effects and how to assess them efficiently. For example, pesticides continue to be detected in surface and groundwater [4], pharmaceuticals, concentrated in wastewaters discharged from households and medical facilities are also present in coastal areas [5] and personal care products are being detected in phytoplankton cells even at remote locations [4], highlighting their environmental persistence.
At the basis of most marine systems are phototrophs, cycling solar energy, and soaking carbon, fueling the trophic web. Any disturbance at this level has inevitable impacts on the whole marine ecosystem. Contaminants toxic effects are known to have severe and specific influence on these organisms, especially impairing their photosynthetic metabolism [6][7][8][9][10]. Having both biochemical and biophysical components allows the photosynthetic mechanisms to be addressed remotely [11]. Pulse amplitude modulated (PAM) fluorometry emerges as a potential non-invasive high-throughput screening (HTS) tool [9,10]. These techniques use optical signatures as a proxy of the phototroph physiology, allowing for detection of disturbances at the primary productivity level [12], to efficiently evaluate contaminants' effects with a dose-related response [13], and have been integrated into numerical indexes, easily used by stakeholders [12,14]. The integrated repeated measures, over time, without organism scarification is another key benefit of this approach. Phytoplankton is the first compartment to be affected by contaminants. As small (0.2-200 um), single or chain-forming, cells suspended in water with very high surface:volume ratios, they respond quickly to suspended toxicants with high uptake rates, being used as pollution indicator species [15].
These facts are the cornerstone of the pioneering research field of toxicophenomics, which merges plant phenomics, measuring traits, such as plant growth and performance using non-invasive technologies, with ecotoxicology, shifting the use of these phenotyping tools to address contaminant induced stress in autotrophs. Optical techniques to disclose different groups of samples exposed to different degrees of contamination has been applied in marine phototrophs with a high degree of efficiency, in both ecotoxicological trials and under field conditions [6,8,16], and using communities as meta-organisms instead of single species [17]. The application of these bio-optical technologies produces large data packages with physiological interest, regarding the effects of a given compound in an autotrophic organism, and also with a high volume of data points that can be used by machine learning techniques, to produce classifiers of the toxicity degree to which organisms are exposed [18]. Consequently, the increase in big data from HTS systems needs efficient data processing and analytical pipelines. These bio-optical data have been successfully used in the past as biomarkers for contaminant exposure with a high degree of efficiency, although using only less than 10% of the information provided by these optical techniques. These optical biomarkers have been used from field to experimental trials using a wide array of organisms from microalgae to higher plants [9,10,14]. Conventional data analysis pipelines typically involve computer vision tasks (e.g., wheat head counting using object detection), which are addressed through the development of signal processing or machine learning (ML) algorithms [19]. When exposed to anthropogenic contaminants the typical chlorophyll a induction curves suffer alterations, not only on their intensity values but also on their shape [9]. These curves are composed of more than 400 data points referent to the chlorophyll a fluorescence echo of the organism derived from its photochemical fitness, but only some key points are effectively used for photochemical variable calculation, discarding a high number of data points that can be potentially used for high-throughput screening of the test organisms [9]. Contaminant exposure affects not only the specific time points used for photochemical variable calculation but also the remaining curve data points [9]. Such a large dataset is, therefore, overlooked, despite its potential to be analyzed by machine learning techniques, in order to disclose at a fine-scale resolution, the changes provoked by contaminant exposure in all the chlorophyll a induction curves data points, with a sensitivity that would escape the manual and traditional human analysis.
In this context, the aim of this paper is to evaluate the performance of several machine learning and deep learning methods in predicting the exposure of model organisms to different emerging contaminants (EC) under different exogenous concentrations, based on the complete set of fluorescence data as a potential set of biomarkers of contaminant exposure. This combination of machine learning with non-conventional biomarkers, intent to disentangle the potential of not only whole fluorescence profiles as descriptors (biomarkers) of contaminant exposure but also to address the power of machine learning in disentangling slight and unperceivable features that can be later used for automatic classification of marine diatom cultures exposed to a wide array of contaminants in ecotoxicological trials. For this purpose, a model ecotoxicological organism was used as a test organism, with marine diatom cultures of Phaeodactylum tricornutum subjected to thirteen emerging contaminants at various exposure concentrations. The resulting chlorophyll fluorescence induction curves are used as time series data, along with pre-defined endpoints, were used to address two tasks: (i) identifying which of the thirteen contaminants considered was applied, and (ii) at which exposure concentration. To address both points, we used five different machine learning and deep learning algorithms: • Random Forests: This classifier is an ensemble algorithm that uses the majority vote of a set of decision trees to make predictions; • XGBoost: This classifier is a state-of-the-art ensemble algorithm that uses a set of decision trees optimized with gradient boosting in order to minimize errors; • ROCKET: A recent state-of-the-art time series classification algorithm that uses random convolution kernels to increase the dimensionality of a dataset and, together with a simple regression classifier (commonly Ridge); • Neural Networks: Deep learning models based on the original multi layer perceptron. These networks have a large amount and variety of layers and are trained by means of gradient descent. Although for regular artificial neural networks (ANN) the feature engineering part is done a priori, convolutional neural networks (CNN) perform this step using convolutional and pooling layers. CNNs can be used for multi-dimensional data, so they can process 1D, 2D, or 3D.
In this work, the 2D CNNs have the particularity of using as input data the images of the plotted time series data, instead of the numeric values.

Ecotoxicological Assays
Phaeodactylum tricornutum Bohlin (Bacillariophyceae; strain IO 108-01, Instituto Português do Mar e da Atmosfera (IPMA)) axenic cell cultures (maintained under asexual reproduction conditions) were grown in f /2 medium [20], under constant aeration in a phytoclimatic chamber, at 18 • C, programmed with a 14/10 h day/night photoperiod (RGB 1:1:1, maximum PAR 80 µmol photons m-2 s-1), a sinusoidal function to mimic sunrise and sunset, and light intensity at noon, set to replicate a natural light environment [21]. Cultures are visually inspected periodically, under the microscope to ensure their axenic state. Exposure trials were conducted according to the Organization for Economic Cooperation and Development (OECD) recommendations for algae assays (OECD, 2011), with minor adaptations, and the suggested initial cell density for microalgae cells with comparable dimensions to ı (initial cell density = 2.7 × 10 5 cells mL −1 ). According to OECD guidelines, carbon was provided to cultures through aeration with ambient air. Exposure time was reduced to 48 h since in previous studies was observed that after 72 h the cultures enter the stationary phase and, thus, exhibit ageing effects that can mask the exposure trial [21]. Exposure conditions in terms of nutrient composition, light, and temperature environment were kept as those for culture maintenance. Forty-eight hours after inoculation (endpoint), cells were exposed to the target concentrations of the selected emerging contaminants (diclofenac, ibuprofen, propanol, fluoxetine, glyphosate, sodium dodecyl sulphate (SDS), triclosan, copper engineered nanoparticles (CuO), zinc engineered nanoparticles (ZnO), and titanium engineered nanoparticles (TiO)) and legacy contaminants (dissolved ionic copper (CuSO 4 ), dissolved ionic Zn (ZnSO 4 ), and dissolved ionic titanium (TiO 2 )). Exposure occurred for 48 h to ensure it covered the exponential growth phase [21][22][23]. Target concentrations were selected aiming to cover a concentration gradient reflecting not only the detected environmental concentrations found in the literature, but also concentrations known to have significant biological effects in P. tricornutum. As observed in previous works [21][22][23] this is a fast-growing strain, hence the exposure period was reduced from 72 h to 48 h to avoid cell ageing processes, that could occur in the stationary phase beyond the 48 h time point. All manipulations were executed within a laminar flow hood chamber, ensuring standard aseptic conditions. At the end of the exposure period, 1 mL of culture (per replicate) was used for bio-optical assessment, via chlorophyll-a pulse amplitude modulated (PAM) fluorometry (FluorPen FP100, Photo System Instruments, Brno, Czech Republic). Cell subsamples for bio-optical assessment were acclimated for 15 min in the dark and chlorophyll transient light curves (Kautsky plots) were generated using the preprogrammed OJIP protocol, according to [23].
Each sample corresponds to a chlorophyll fluorescence induction curve and contains the fluorescence values taken at different time steps from a model diatom used in ecotoxicology (Phaeodactylum tricornutum) exposed to 13 different emerging contaminants at different concentrations, following the international standards for ecotoxicological assays. Each curve is composed of 458 measurements per sample, taken in intervals of 10 milliseconds each. For each contaminant and concentration, 30 independent replicates were obtained, totaling 1950 samples. Table 1 contains some details regarding this dataset, including the identification of the contaminants and different concentrations used, and the distribution of samples between contaminants and concentrations. Further details on the techniques and protocols used for obtaining the data can be found in [6][7][8]16,24]. Cell density was evaluated in all experimental reactors at the end of the exposure trials and the percentage of growth inhibition towards the respective control condition was calculated, as well as the half-maximal inhibitory concentration (IC50), according to the OECD (2011) guidelines.
The same dataset is used differently for the two tasks we address. For the task of identifying the contaminant, we consider 13 classes, each corresponding to one EC regardless of its concentration, therefore 120/150/180 samples per class, depending on the EC (see Table 1). Additionally, control test units (absence of tested compound) were also performed so that it was possible to disclose the effects of the EC and its concentrations in the physiology of the cultures under the same conditions as the test bioreactors. The task of identifying the EC concentration is subdivided into 13 subtasks, each dealing with only one EC. Each of these subtasks considers 3-5 classes, depending on the EC, where each class contains exactly 30 samples, each corresponding to a specific concentration (Table 1).
In order to disentangle with a higher resolution the effects of each EC under each dose, an extended analysis of the fluorescence transients was done by calculating the difference in relative variable fluorescence towards the respective control units, allowing this way to have variation towards the control and represented as ∆V curves (expressed as V = f (t)), i.e., subtracting fluorescence values of the controls of transients normalized between: (1)
Regarding the CNNs, we test three different approaches: 1D, where the input is the one-dimensional induction curves; 2D, where we use image representations of the induction curves; and 2D Log, like 2D but with a logarithmic time axis. Regarding the 2D CNNs, for the remainder of this work, they will be designated simply as CNNs, and the methodology used to transform the induction curves into images is described in more detail in [18].

Hardware and Software
All experiments were performed on a Windows 10 machine with one NVIDIA 2080 TI GPU with 11GB of RAM. All models were implemented in Python 3.6 using the following packages: Scikit-learn (0.23.1), XGBoost (1.1.1), and Tensorfow (2.3) [30,31].

Architecture and Parameters
This section provides technical details regarding the implementation and parameters of the used methods and is intended to aid in the reproducibility of the experiments.
For both RF and XGB trees, we perform an independent grid search per run, for the optimal values of their parameters. Regarding RF, we grid search for the following parameters: criterion {gini, entropy}, No. estimators {5, 10 For both ANNs and CNNs, we use a simple and shallow architecture. The ANNs (These ANNs are in essence multilayer perceptrons (MLPs) with normalization and dropout.) are composed of four batch normalization layers, six fully connected layers, and five dropout layers (for the dropout layer we performed several initial tests with different rates, and since there was no significant difference, decided to use the standard rate value of 0.7). The 1D CNNs are composed of four batch normalization layers, three interspersed convolutional and max-pooling layers, and two fully connected layers. Lastly, the CNNs are composed of five interspersed convolutional and max-pooling layers, and three fully connected layers with two dropouts in between. These architectures, as well as the hyperparameters, can be seen in Figure 1. No form of grid-search or Bayesian hyperparameter and topology optimization was performed for these models, all values being derived from a small set of manual tests. Regarding weight initialization, we decided to use the Xavier/Glorot to avoid having the gradient vanishing or exploding, and every batch was normalized to have a mean of zero and a standard deviation of one (BatchNormalization layers) so the training was faster and to help improve generalization. For the optimizer, we use ADAM with a learning rate of 0.001. Figure 1. Architectures of the different deep learning models used in this work. The first network corresponds to the ANN used, it is composed by a set of BatchNormalization, Dense (fully connected), and Dropout layers. The second corresponds to the 1D CNN, composed by a set of BatchNormalization, 1D Convolution, and 1D MaxPooling layers, followed by a flatten operation to reduce the dimensionality of the data in order to pass it to the two dense layers that serve as a classifier. The last network corresponds to our 2D CNN, composed by 2D Convolution, 2D MaxPooling, a flatten and both Dense and Dropout layers. Each layer from models has its parameters discretized inside the corresponding box.

Experimental Setup
For both tasks, we follow the same methodology regarding the splitting of the data. In each trial, we randomly select 20% of the data to be our test set, to validate the models, and the remaining 80% is used for training. We perform 30 independent runs per method, so for each task/subtask we obtain a sample of 30 results that can be statistically analyzed and compared.
The choice of the number of epochs to run with the deep learning methods was based on a set of trials. Performing increments of 50 epochs we analyzed the convergence of the training loss of the networks and stopped adding epochs once the loss appeared to have stabilized. Based on these trials, we established that both CNNs run for 300 and 100 epochs on the first and second task, respectively, and ANN and 1D CNN run for 500 epochs on both tasks.

Results
In the present work, the authors aim to study the applicability of the obtained fluorescence data in automatic machine-learning-based systems, for ecotoxicological classification of marine diatoms-based exposure trials, and, thus, the physiological responses of the cells will not be discussed in the present work, as this is out of the scope of the present work. The physiological effects of each contaminant are discussed elsewhere [6][7][8]16,23,24]. Nevertheless, and in order to better understand the magnitude of the ecotoxicological effects observed in the cultures, relative growth inhibition and half-maximal inhibitory concentrations were also determined as proxy measures of the toxicity effects observed in the cultures, that are unequivocally connected with the photochemical results ( Table 2). Although the octanol-water partition coefficients have similar values among tested substances, the P. tricornutum exposures to these xenobiotics led to very different responses in terms of growth inhibition, a process that is intrinsically connected with the photochemical processes ( Table 2). Table 2. Octanol-water partition coefficients (log KOW; NA-not applicable) and concentrations applied of the tested compounds, and respective growth inhibition (%, negative values indicate growth inhibition) and half-maximal inhibitory concentration (IC 50 ) assessed for each exposure trial set. We report our results in tables, boxplots, spider plots, and heatmaps. Accuracy values are reported as proportions (0-1) in the tables and plots, but for convenience, we also use them as percentages (0-100) in the text. When presenting median results of the 30 runs, we determine the statistical significance of the observed differences with the non-parametric Kruskal-Wallis test. Whenever we refer to better and best or worse and worst results, it means the p-value of the statistical test was <0.01. All the p-values are presented in tables in Appendix A.
For the task of predicting the type of EC, the best results are obtained by the CNN Log, with a median accuracy of 97.65% on the test set, as detailed in Table 3. Observing the results of the best CNN Log model on the heatmap of Figure 2, we see that the vast majority of ECs are correctly classified with accuracy values of 100%, with Cu_d and Zn_np having the lowest accuracy values of 94.44%. The best models of the other deep learning methods also achieve very good results, as well as Rocket, but not so consistently in all the classes. For example, 1D CNN achieves accuracy 100% in even more classes than CNN Log, but reaches only 84.38% in one of the classes (TRI). Looking at the boxplot of Figure 3, we see that 1D CNN indeed shows a very high dispersion of results, revealing to be a less reliable method despite its high median accuracy. ANN also reveals a higher dispersion than Rocket or the CNNs. Regarding RF and XGB trees, their overall accuracy is much lower than the others in terms of median and best results.  The spider plot of Figure 4 contains the same data as the heatmap of Figure 2, represented in a different way and omitting the RF and XGB results since these two clearly cannot compete with the other methods. The advantage of having these two plots for the same data are that while it is easier to view the large differences between the models on the spider plots, the heatmaps make it easier to compare the concrete values of the accuracy per class given that the color scheme was chosen to emphasize the small yet relevant differences that can not be seen in the spider plots. Here we can easily verify which ECs are easier or harder to predict and observe the consistency or disparity of results obtained by the different models for each EC. We observe that ECs, such as GLIPH, IBU, Cu_np, and Ti_d, are always very well classified, whereas PROP, TRI, Cu_d, and Ti_np represent a challenge for most models. Regarding the task of predicting the concentrations of the different ECs, the scenario changes. Table 4 shows that Rocket is the superior method, achieving the best median test accuracy in all but one EC, followed by ANN which surpasses Rocket in one EC and shares the best results in 7 other ECs, and then by 1D CNN that shares 5 of the best results with Rocket and ANN. CNN and CNN Log only achieve one best result each. Regarding RF and XGB trees, although they do not claim any best results, on this task they do not perform so much worse than the other methods. SDS is the hardest EC to predict the concentration, probably because even on a small amount it simply kills the microalgae. Zn_np is another EC where predicting the concentration is not so easy.
Looking at the boxplots of Figure 5, for most ECs and methods we observe a higher dispersion of accuracy values than what we saw on the first task. Although on the first task the median accuracy was very close to the best result, here it is seldom the case. For this reason, the heatmap and spider plots that follow report the results of the models that achieve median, not best, results. The heatmap in Figure 6 and the spider plots in Figure 7 show the accuracy per concentration for all 13 ECs. The large heatmap is a concatenation of 13 smaller parts. Looking at both types of plots, we observe a large disparity of behaviors. Some classifiers are generally superior to others, but the performance of each method depends not only on the EC but also on its concentration, with no clear relationship between the concentration and the accuracy.   Table 1).

Discussion
The results reported above reveal that from the first task (predicting the EC) to the second task (predicting the concentration of the EC) the ranking of best methods changes. This seems to have been caused by a relative failure of the CNNs on the second task, after being so successful on the first one.
Several hypotheses may be advanced regarding this change: the normalization of the data, which is performed only for the second task (Section 2.5); the strong reduction in the number of classes (from 13 on the first task to 3-5 on the second); the strong reduction in the number of samples available in each class (from 120-180 on the first task to only 30 on the second). In order to test whether the culprit is the normalization of the data, we have performed experiments with non-normalized data for the second task. However, with this setting the CNNs obtained worse results than with normalization, thus discarding it as the cause for their failure.
Regarding the reduction in the number of classes, although it is normally expected that fewer classes mean easier problems, the labels used for the second task are not a subset of the first, but rather a refinement of each of the original labels. This defines a different problem, one that may be more difficult than the first one, even after splitting it into 13 independent subproblems. In order to check whether the set of 13 subproblems represent an easier or harder challenge than the one problem of the first task, we compare the results of predicting the EC (Table 3) with the median results of predicting the concentrations for each of the ECs (last row of Table 4). On the training set, all the methods do better on the median of the 13 problems of predicting concentrations than on the one problem of predicting the EC. In fact, not only the median is better, but also the single results of each of the 13 problems, most of them a perfect accuracy of 100%. Rocket is the only method reporting a single case where the training accuracy is higher when predicting the EC (99.91%) than when predicting the concentration of one of the ECs (IBU, 97.08%). Additionally, the CNNs do better on all the problems of the second task than on the one problem of the first task. Therefore, the learning of the training set is clearly easier on the second task. As for generalization on the test set, some methods are better on the second task (RF, XGB, Rocket, ANN) while others are worse (1D CNN, CNN, CNN Log), and this is what changes the ranking of the methods.
We investigated what causes the relative lack of generalization ability in 1D CNN and both CNNs, with a clear hypothesis in mind. One cause might be because there are not enough data to further split the training set into a validation set, which would allow us to implement an early stopping training criteria for the models. This is aggravated by the fact that these methods learn the training data very easily, and indeed we confirmed that there is a moderate amount of over-fitting. Assuming the amount of data was larger than 30 samples per class, it is possible we could have made better choices regarding the number of epochs to run, and the convolutional DL approaches would be expected to perform much better than they did on the second task. DL is indeed known to require large amounts of data to achieve a good performance, so this finding is actually not surprising, and strongly suggests the need for data augmentation.
One interesting result we obtained is related to the performance obtained by the CNNs that use logarithmic time data (CNN Log) and the CNNs that use time steps (CNN). It is common to use logarithmic scales of the temporal data for visualizing the Kautsky curves, but, of course, most ML methods only use the values of the time series for performing the classification, regardless of how they would be visualized. However, both CNN and CNN Log rely on plots of the data, and, therefore, they "see" the data differently depending on the x-axis, and we have observed that using a logarithmic scale is not always the best option. Although it proved to be significantly better on the first task (p-values of the statistical test in Appendix A), on the second task we have the opposite scenario, where the regular CNN is more often significantly better than CNN Log. This suggests that, when performing the necessary data augmentation for improving the performance of the CNNs, one easy way to implement it is to use both regular and logarithmic data.

Conclusions and Future Work
We evaluated the performance of several machine learning and deep learning methods in two different tasks of predicting the exposure of a model marine diatom (Phaeodactylum tricornutum) to different emerging contaminants under different doses, using only the chlorophyll fluorescence induction curves as data. The first task was to identify which of the 13 contaminants are detected, while the second task was to identify the concentration of the contaminant. We used five different machine learning and deep learning algorithms on both tasks, including Random Forests, XGBoost trees, Rocket, and several deep neural architectures, including 1D and 2D convolutional neural networks (CNNs). The 2D CNNs had the particularity of using as input data the plots of the time series data, instead of the numeric values. One of these methods (CNN Log) was given the plots with a logarithmic x-axis, as it is normally visualized, while the other used a linear scale with equally spaced time points (CNN). On the first task, the best results were obtained by CNN Log, with a median test accuracy of 97.65% over 30 independent runs, closely followed by other deep learning approaches and also the Rocket method. On the second task, the best method was Rocket and other deep learning approaches, while the 2D CNNs performed comparatively bad. Whereas Rocket proved to be a reliable method for both tasks, we analyzed the reasons for the relative failure of the 2D CNNs on the second task and concluded that it was caused by the occurrence of moderate over-fitting, probably due to an insufficient amount of training data.
Thus, the combination of machine learning algorithms with fluorescence HTS data resultant from fluorescence induction curves appears as a powerful tool for ecotoxicity assessment fast large data package evaluation aiming to classify model organisms, such as diatoms, for their contaminant exposure both in terms of type and concentration, opening new doors for toxicophenomics developments.
Markedly, future work should also progress to test contaminant mixtures, as well as complex phytoplankton communities, in order to produce models with broader ecological meaning and providing a better assessment of contaminants exposure in natural environments, despite current international ecotoxicological and risk assessment guidelines largely focus on individual chemicals of concern, which are simpler to address although unlikely to occur in natural environments.
In the future, we also plan to perform data augmentation by training the models using both regular and logarithmic data, as both of them provided good results without one proving to be overall better than the other. Another thing we intend to do in the future is to test and develop models to detect which chemical form, either ionic solution or engineered nanoparticles, is the EC, by comparing the induction curves of the same concentration produced by both. Beyond the bioinformatics point of view, the present work also highlights the potential of the application of these machine learning algorithms to assist ecotoxicological evaluation of model diatom cultures exposed to a myriad of contaminants. The proposed analysis pipelines can be integrated into an automatic bio-optical assessment system fed by fluorescence sensors producing accurate classifications of the toxicity degree to which the cells are subjected to, and thus reducing operator-associated artifacts and errors. This opens a new door for ecotoxicology not only reinforcing the potential of bio-optical features to be used as biomarkers of contaminant exposure but also introducing machine learning methodologies as tools to assist the ecotoxicological classifications which are on the basis of key assessments, such as Ecological Risk Assessment (ERA), crucial for impact and biomonitoring studies. Moreover, the data analysis pipeline here proposed, can be easily transposed to other autotrophic organisms, subjected to different stress/xenobiotic conditions, once calibrated and validated against known biochemical or morphological descriptors of stress.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to being part of an ongoing project.

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.  Table A10. Kruskal-Wallis p-values comparing all methods of the Cu_np concentration prediction task. Above the diagonal, we have the results for the test set, and below for the training set. Significant results (p < 0.01) are represented in green or red when the method on the left is significantly better or worse, respectively.  Table A11. Kruskal-Wallis p-values comparing all methods of the Ti_d concentration prediction task. Above the diagonal, we have the results for the test set, and below for the training set. Significant results (p < 0.01) are represented in green or red when the method on the left is significantly better or worse, respectively.  Table A12. Kruskal-Wallis p-values comparing all methods of the Ti_np concentration prediction task. Above the diagonal, we have the results for the test set, and below for the training set. Significant results (p < 0.01) are represented in green or red when the method on the left is significantly better or worse, respectively.  Table A14. Kruskal-Wallis p-values comparing all methods of the zn_np concentration prediction task. Above the diagonal, we have the results for the test set, and below for the training set. Significant results (p < 0.01) are represented in green or red when the method on the left is significantly better or worse, respectively.