Novel Approach to Freshwater Diatom Proﬁling and Identiﬁcation Using Raman Spectroscopy and Chemometric Analysis

: (1) An approach with great potential for fast and cost-effective proﬁling and identiﬁcation of diatoms in lake ecosystems is presented herein. This approach takes advantage of Raman spectroscopy. (2) The study was based on the analysis of 790 Raman spectra from 29 species, belonging to 15 genera, 12 families, 9 orders and 4 subclasses, which were analysed using chemometric methods. The Raman data were ﬁrst analysed by a partial least squares regression discriminant analysis (PLS-DA) to characterise the diatom species. Furthermore, a method was developed to streamline the integrated interpretation of PLS-DA when a high number of signiﬁcant components is extracted. Subsequently, an artiﬁcial neural network (ANN) was used for taxa identiﬁcation from Raman data. (3) The PLS interpretation produced a Raman proﬁle for each species reﬂecting its biochemical composition. The ANN models were useful to identify various taxa with high accuracy. (4) Compared to studies in the literature, involving huge datasets one to four orders of magnitude larger than ours, high sensitivity was found for the identiﬁcation of Achnanthidium exiguum (67%), Fragilaria pararumpens (67%), Amphora pediculus (71%), Achnanthidium minutissimum (80%) and Melosira varians (82%).


Introduction
Diatoms are widely employed to assess water quality around the world. These microscopic algae are abundant in practically all kinds of aquatic systems [1], exhibit fast and differential responses to changes in environmental factors [2][3][4][5], and are easy to sample [6] and preserve [7], hence their widespread use in the assessment of aquatic ecosystems [6][7][8]. Such water quality assessments rely on the thorough and accurate taxonomic identification of the diatoms present in the water samples collected [8][9][10]. The prevailing identification method is based on the microscopic examination of diatom frustules, and their counting, which requires highly specific expertise and is time-consuming and expensive [9,11,12]. For this reason, a great effort has recently been directed to developing faster and less cumbersome identification methods and metrics. These are mainly based either on DNA metabarcoding or a combination of diatom imaging acquisition and deep learning methods [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. While these alternatives show promising results and take advantage of state-of-the-art sequencing and imaging methods, they are still laborious and quite expensive, which limits their application to routine monitoring of water quality.

Diatom Sample Collection and Taxonomic Identification
Collection of diatom biofilm samples took place in three lakes within the Oporto City Park (Northern Portugal) [38,39] as described previously [40]. This is an urban park of about 83 ha with an extensive forested area composed of tree and shrub species. The fauna of the park is mainly composed of native and non-native birds and fish. Some species of cyanobacteria were also detected in these lakes: Microcystis sp., M. aeruginosa and Planktothrix sp. Cylindrospermopsis racirborskii, Planktothrix agardhi [38,39]. The lakes are similar in hydrogeomorphic characteristics and environmental conditions; they also have an interlinked water flow. The water physico-chemical parameters are summarised in Table S1 of the Support Information. For sample collection, a toothbrush was used to scrape natural and artificial substrates over an area of 100 cm 2 for each lake. The substrates were rocks, wood, sediment, bricks, or underwater plastic tubes. When toothbrush sampling was not feasible, a similar area of biofilm was pipetted from the substrate surface. The biofilm collected was then resuspended in lake water contained in a laboratory tray. All biofilm samples were collected in the same day under similar conditions. The sampled biofilm was subsequently transferred into ten flasks for each lake: one 40 mL capacity dark glass flask containing the biofilm preserved in 33% formaldehyde and nine 120 mL capacity plastic flasks. The biofilm samples were temporarily stored in a thermal box for transport into the laboratory where the plastic flasks were stored at −80 • C until further analysis. Diatom identification was done following the conventional microscopy method [10]. For this, samples preserved in formaldehyde were oxidized for 24 h in 10 mL nitric acid with potassium dichromate crystals. Oxidants were then removed by successive centrifugation, followed by supernatant discharge and ensuing resuspension in distilled water. Centrifugations were done at 1500 rpm, at room temperature, in a Kubota 2420 Centrifuge (Kubota Corporation, Osaka, Japan). After the cleaning process, the turbidity and cell density in the samples was decreased by dilution in water. Permanent slides were then obtained by mounting with Naphrax ® (Brunel Microscopes, Ltd., Chimpenham, UK). Diatom valves were counted (400 per sample) and used for identification. Diatom identification was done in a light microscope (Zeiss Primo Star, 100×, N.A. = 1.25) with diatom floras [41]. Databases such as AlgaeBase [42] and Diatoms of North America [42] were checked to update species nomenclature.

Raman Spectroscopy
Biofilm samples were defrosted at 4 • C and dropped onto microscope slides that were dried at room temperature to prevent valve movement during the Raman spectroscopy acquisition. The Raman recordings were immediately done using an InVia TM Qontor ® confocal Raman spectrometer (Renishaw, Kingswood, UK) assembled with a Leica DM2700 microscope (Ernst Leitz GmbH, Wetzlar, Germany) and a 50× objective. A Cobolt 04-01 Series Samba TM (Hübner Photonics, Kassel, Germany) incident laser was employed. The laser was set to 532 nm and 0.1 mW on the sample surface. The spectra acquisition time was 10 s, and 3 accumulations were done to improve the signal/noise ratio. Eighteen spectra with a spectral range of 860 to 1660 cm −1 were recorded for each diatom species identified, except for two species for which only two specimens could be found. The readings were done in the cell region located between the central area and the apex, including the chloroplast; the raphe area was excluded from the readings, as well as empty valves and frustules. The software WiRE TM 5.2. (Renishaw Inc., Wotton-under-Edge, UK) was used to acquire the Raman spectra. Raman spectra were deconvoluted by fitting a sum of damped oscillator functions using a harmonical Igor Pro TM (Wavemetrics Inc., Portland, USA, 1998) routine. For the fitting procedure the area (A), width (W), and frequency (F) of each band were determined.

Data Analysis
Normalisation of the Raman band areas was done using the area of band located at 1526 cm −1 , to correct for intensity fluctuations in the obtained spectra. A first correlation analysis of the raw data confirmed this band as appropriate for the normalisation process [31,43]. A PLS-DA was then performed to describe the taxa and identify the band components (profile) contributing to this discrimination. This is a chemometrics method useful to model multiple variables that may be related. In the PLS-DA, the Y i variables were the diatom species and the X i variables, i.e., the regressors or descriptors characterising the species, were the Raman parameters, hereafter designated by Raman variables. To help interpreting these results, an integrated measure of the relationship (covariation) between the descriptors (Raman variables) and the species was developed from the significant components extracted by the PLS-DA. This was done by calculating the scalar projections of the loadings of the species (Y i ) over the loadings of the descriptors (X i ) in the Cartesian hyperspace formed by the significant PLS-DA components. For simplicity, the scalar projection loadings are herein referred to as scalar projections. The scalar projections indicate the weight of the relationship between the descriptors and the species. To assist the interpretation, the length of the sum of the scalar projections, herein called Raman module, was also calculated for each species. To further characterise the Raman profiles of diatom species and infer about relationships in the dataset, a cluster analysis was done on the Y loadings obtained from the PLS-DA (i.e., loading vectors associated to the Y data set).
For taxa identification an ANN analysis with supervised learning was performed. The network architecture used was a Multilayer Perceptron (MLP) as previously done by Oliva-Teles et al. (2015) [44]. In this procedure, each neuron performs a weighted sum of its inputs and passes it through a transfer function to produce an output. For the ANN analysis, the data was randomly subdivided into three series: a training series; a testing series; and a validation series. Also, ANN models were developed for different taxonomic levels, i.e., using either the species, genus, family, order, or subclass as categorical output (target). Raman variables were the continuous input. The ANN models developed were evaluated for their classification performance using common measures employed in diagnostic tests, accuracy, and sensitivity rates. All statistical analysis was done with the software Statsoft Statistica TM 64 (Statsoft Software Inc., Tulsa, UK, 2014).

Diatom Species Identification
In total, 45 species were identified in all the three sampled lakes. A list of the species found, and respective valve counts is presented in Table S2 (Support information). Of these, 29 species belonging to 15 genera, 12 families, nine orders, and four subclasses showed >1% abundance in at least one lake. The most abundant species were Gomphonema parvulum (Kützing) Kützing, 1849, Melosira varians C. Agardh, 1827, Tabularia tabulata (C.Agardh) Snoeijs, 1992, Achnanthidium minutissimum s.l. (Kützing) Czarnecki, 1994, and Amphora pediculus (Kützing) Grunow ex A. Schmidt, 1875 ( Figure 1). analysis, the data was randomly subdivided into three series: a training series; a testing series; and a validation series. Also, ANN models were developed for different taxonomic levels, i.e., using either the species, genus, family, order, or subclass as categorical output (target). Raman variables were the continuous input. The ANN models developed were evaluated for their classification performance using common measures employed in diagnostic tests, accuracy, and sensitivity rates. All statistical analysis was done with the software Statsoft Statistica TM 64 (Statsoft Software Inc., Tulsa, UK, 2014).

Raman Spectra
A total of 790 Raman spectra were recorded from all species identified; their number per taxonomic level is indicated in Table 1. All the individuals analysed in this study showed Raman spectra composed by fourteen bands in the 800 to 1660 cm −1 spectral range ( Figure 2): located at 867 cm −1 , 920 cm −1 , 963 cm −1 , 1013 cm −1 , 1160 cm −1 , 1180 cm −1 , 1198 cm −1 , 1270 cm −1 , 1315 cm −1 , 1390 cm −1 , 1445 cm −1 , 1526 cm −1 , 1606 cm −1 , and 1656 cm −1 , respectively. Similar spectra were reported for Thalassiosira pseudonana and Ditylum brightwellii using Ti: sapphire and multimode diode 750 nm lasers as excitation wavelength, 30 mW power, and 30 and 2 s of acquisition time, respectively [45,46]. The Raman signature of the species Cylindrotheca closterium exhibits a similar profile using the 532 nm laser excitation line, a power of 0.1 mW, and an acquisition time of one second [47]. The recorded data was analysed by fitting the sum of damped oscillator functions, and the frequency, band width, and area were obtained for each spectral component. The frequency, band width, and area were the data used in the chemometrics methods.

Raman Spectra
A total of 790 Raman spectra were recorded from all species identified; their number per taxonomic level is indicated in Table 1. All the individuals analysed in this study showed Raman spectra composed by fourteen bands in the 800 to 1660 cm −1 spectral range ( Figure 2): located at 867 cm −1 , 920 cm −1 , 963 cm −1 , 1013 cm −1 , 1160 cm −1 , 1180 cm −1 , 1198 cm −1 , 1270 cm −1 , 1315 cm −1 , 1390 cm −1 , 1445 cm −1 , 1526 cm −1 , 1606 cm −1 , and 1656 cm −1 , respectively. Similar spectra were reported for Thalassiosira pseudonana and Ditylum brightwellii using Ti: sapphire and multimode diode 750 nm lasers as excitation wavelength, 30 mW power, and 30 and 2 s of acquisition time, respectively [45,46]. The Raman signature of the species Cylindrotheca closterium exhibits a similar profile using the 532 nm laser excitation line, a power of 0.1 mW, and an acquisition time of one second [47]. The recorded data was analysed by fitting the sum of damped oscillator functions, and the frequency, band width, and area were obtained for each spectral component. The frequency, band width, and area were the data used in the chemometrics methods.
A PLS-DA regression was done with the data obtained for the species showing >1% abundance to depict their Raman profiles. Six significant components were identified by the PLS-DA. Eleven variables were found to have important contribution to these components ( Figure S1, Support Information), which were associated to six different bands: width band assignments, already available in the literature about Raman applications to diatoms, the bands at 1013, 1160, 1180, and 1526 cm −1 are assigned to C-CH 3 in plane rocking modes, as well as C-C, C-H and C=C stretching modes from carotenoids, respectively [28], whereas bands 1198 and 1270 cm −1 can be assigned to N-C and C-N stretching modes of chlorophyll a [28]. It is known that though pigment composition is similar among diatom species, the ratio between the pigments [48], as well as the concentrations of these molecules in different cell compartments, is highly variable [29]. This may explain the differences found among species in the pigment-related bands. Table 1. Number of Raman spectra collected from each diatom taxonomic level. In total 790 Raman spectra were acquired.

Genus
Family Melosiraceae (54) Melosirales (54) Melosirophycidae (54) Cyclotella (18) Stephanodiscaceae (18) Stephanodiscales (18) Thalassiosirophycidae (18) A PLS-DA regression was done with the data obtained for the species showing >1% abundance to depict their Raman profiles. Six significant components were identified by the PLS-DA. Eleven variables were found to have important contribution to these components ( Figure S1, Support Information), which were associated to six different bands:  between the diatom species (Y i ) and the Raman variables (X i ) (Figure 3). High positive (red) and High negative (green) scalar projections indicate species showing high positive and negative correlations with the Raman variables, respectively. The Raman module, reflecting the global importance of the descriptors over a given species, was calculated and presented; darker blues representing the highest importance and lighter yellows representing the lowest importance ( Figure 3). This integrated analysis allowed us to clearly identify the species best discriminated by the model. In Figure 3, the similarity of species profiles is shown by the cluster analysis. Globally, two main groups of species were easily identified, one with Raman module values ranging from 0.43 to 0.92, representing the species better characterised by the Raman variables, and a second group with notably lower Raman modules. The first group is composed by the species Cyclotella stelligera Variation in the band area assigned to pigments might be related to the amount of these compounds in the cell [47]. For example, the area of band 1160 cm −1 was low for A. minutissimum, a smaller and pioneer taxon capable of colonizing baring substrates and resisting environmental adversities [6]. This indicates low amounts of carotenoids in this species. Among the most important carotenoids in diatoms are fucoxanthin, diadinoxanthin, and diatoxanthin. Fucoxanthin is involved in light-harvesting [48], and diadinoxanthin and diatoxanthin are involved in photoprotection [31,48]. Different band widths might reflect pigment diversity [43]. In contrast to the area, the width of band at 1526 cm −1 showed higher values in A. minutissimum than in the remaining species (Figure 4). This may indicate the presence of a higher variety of carotenoids, which is consistent with the fact that A. minutissimum sensu lato is a species complex encompassing multiple similar species [49]. The band at 1526 cm −1 is a marker of the length of the polyene chain, which vary among different carotenoids [50]. The width of the band 1160 cm −1 was lower in Cyclotella stelligera than in the remaining species (Figure 4), probably reflecting the presence of a lower variety of carotenoids. The frequency of some bands was also relevant to discriminate the species analysed. In particular, the frequency of bands at 1526 cm −1 , 1198 cm −1 and 1180 cm −1 . In diatom studies, a frequency variation is related to resonance phenomena caused by changes in the wavelength of the incident laser [28,43,51]. Resonance phenomena occur when the energy of the incident laser is similar to the energy of the transition between electrons of a determined compound causing the enhancement of the Raman band corresponding to that compound. Band frequency differences of pigment in solution can also be derived from conformational changes due to the polarity of the solvent [43,51]. In this study, pigments were not extracted, so no solvent was used, and the incident laser frequency was kept constant during the acquisitions. Hence, changes in frequency are most probably due to the presence of different molecular conformations in the measured cells.     Categories are as follows: dark green, 0-5%; green, 5-10%; light green, 10-20%; pink, 80-90%; red, 90-95%; dark red, 95-100%. The darker the red (positive) or green (negative) tone the greater the effect of the Raman variables over the species. Species legend as in Figure 3.
Globally, from the results obtained, C. stelligera stood out in profile from the other species. This is a non-motile and planktonic species, contrarily to the other species described [6]. Metabolic and molecular adaptations can occur in this species in response to challenging environmental conditions, which would be reflected in the recorded Raman bands. Overall, interpretation of the PLS by calculation of the scalar projections and Raman modules provided a clear Raman profile characterising each species, also bringing information about their biochemical composition. Future studies should focus on elucidating the differences in molecular conformations that could underlying the frequency shifts recorded in diatom species and the components involved. For example, variation in the components (area, width and frequency) of the bands at 1160, 1180, and 1198 cm −1 can also be assigned to C=S modes of the frustule [52] with previous authors having reported differences among genus in bands related to the frustule components, which may reflect differences in frustule silicification [37].

Artificial Neural Network Models
The ANN analysis was carried out with the whole dataset. The best ANN models obtained for the prediction of diatom taxa from Raman data are shown in Table 2. From these results, it is clear that within each subclass, order, and genus, some taxa were predicted with higher performance than others. The ANN methodology showed higher performance in predicting the diatom subclass, returning a prediction with good validation accuracy of 76.0%. The second best model was the one predicting the order ( Table 2). It is interesting to note that the lower the number of groups within a given taxonomic level, the higher the classification accuracy obtained. Another possible explanation is that the abundance of taxa could be interfering with the performance of the ANN model [53,54]. Indeed, other authors have found that when a taxon is rare, models tend to learn that the taxon is always absent. Conversely, when a taxon is common, models tend to learn that the taxon is always present [55]. In this work, each species was equally represented in the dataset, independent of their abundance, since the same number of spectra were obtained per species. However, the number of species within higher taxa was not evenly distributed; some taxa contained many species and others only a few. This may be a source of bias in the analysis. Further studies using a more even distribution of species can help clarify this effect and minimize such interferences. A very relevant aspect of our results is that of the amount of data with which the models were built. Compared to the available literature using ANN for automatic identification of diatoms, the use of Raman data required a remarkably smaller number of samples (spectra in our case). A previous study achieved an excellent accuracy (99.5%) using a total of 160,000 image samples processed by ANN to identify 80 diatom species (2000 samples per species) using a base dataset of 11,000 diatom samples [20]. Libreros and colleagues employed 16,000 segments of 365 images, combined with ANN, to identify diatom genera, achieving a classification accuracy of 74% [57]. A more recent study also using ANN and based on virtual slides obtained through high resolution focus-enhanced light microscopic slide scanning and a series of imaging processing steps, achieved a 95% identification accuracy of four diatom species (Fragilariopsis kerguelensis, Fragilariopsis rhombica, Thalassiosira gracilis, Thalassiosira lentiginosa) and five diatom genera (Asteromphalus, Chaetoceros, Nitzschia, Pseudonitzschia, Rhizosolenia) using a total of 2977 specimens [17]. According to these authors, around 100 specimens per taxon are required for this excellent identification. In another approach, Lambert & Green [58] employed 7092 labelled images processed with ANN to identify ten diatom morphological categories (Centric, Araphid, Symmetrical, Biraphid, Monoraphid, Nitzschioid, Asymmetrical Biraphid, Epithemioid, Surirelloid, Eunotioid) obtained from 1639 species and 112 genus; their accuracy rate was 94%. Finally, a study based on holographic reconstructions from a commercial glass slide containing 50 diatom species achieved an accuracy rate greater than 80% [59]. The authors used an augmented dataset with 174,636 images per class, with a total size of 8,731,800 elements. Overall, these studies showed useful results in the identification of both limited subsets of taxa or larger numbers of genera or species, but always requiring a huge amount of data. They were done with very large imaging datasets involving photographing or scanning and cumbersome pre/pots-processing techniques. Furthermore, most datasets were artificially augmented by imaging processing or segmentation. To the best of our knowledge, this is the first study concerning the prediction of diatom taxa from Raman spectral data. The accuracy rates obtained with a comparatively much lower amount of data, requiring no special processing or preprocessing treatments or artificial augmentation, are very promising, indicating the potential of Raman spectroscopy diatom identification. A more interesting characteristic of these Raman identification models is the high accuracy and sensitivity obtained, relative to the dataset size, when considering that the Raman spectra acquired reflect the biochemical composition of the diatoms rather than their morphological characteristics. Future studies, including species from different geographical locations living under a diverse range of environmental conditions, will provide a sound dataset for ANN and characterisation of Raman profiles for each species, improving species identification.

Species Identification
The model targeting the species was globally the less accurate in the identification (33.7% validation accuracy). However, within this elementary level of taxonomic identification, some species were identified with good sensitivity (Table 3), namely A. minutissimum (80%) and M. varians (82%). Also, the subclass Bacillariophycidae, comprising more than two thirds of the individuals studied (with 556 spectra acquired, see Table 1), was predicted with an excellent sensitivity (89%) by the ANN species model (Table 3). Interestingly, the subclass Thalassiosirophycidae (with only 18 Raman spectra acquired) was predicted with good sensitivity (75%) by this same model (Table 3). Table 3. Validation sensitivity (%) for the taxa best identified by the Artificial Neuronal Network (ANN) using Raman variables as continuous input variables and the species as categorical target. Diatom species, orders, and subclasses with a sensitivity >65% are indicated in bold.

Achnanthidium minutissimum 80%
Thalassiophysales 42% Amphora pediculus 71% Fragilariophycidae 44% Fragilariales 47% Fragilaria pararumpens 67% Melosirophycidae 45% Melosirales 64% Melosira varians 82% Thalassiosirophycidae 75% Stephanodiscales 25% Cyclotella stelligera 50% On the whole, the species model is the most important because the species is the basic taxonomic unit. While the accuracy for species determination was lower than expected, the identification of a given organism by an iteration process is one related to hierarchical error. In other words, failing in the identification of a species but not in the identification of its genus is less inaccurate than failing in the identification of both the species and the genus. It is in fact a matter of narrowly failing or failing too close to the target. The same principle is successively applicable up to the subclass level or above. Therefore, decreases in the identification error (1-sensitivity) between taxonomic levels, for example, between the species and its genus, indicate the model is using Raman characteristics, i.e., biochemical characteristics, which are common to the species of that genus, possibly reflecting evolutionarily conserved mechanisms. The concept is particularly interesting as it provides an indication to look for which characteristics are shared by a given taxonomic group that could be established as taxonomic characteristics. Although at the moment the approach is especially useful to complement microscope observations, the fact that taxa identification was still possible over some local variation in conditions points out its promising potential. Raman spectroscopy is very sensitive and able to detect structural molecules useful to distinguish among taxa. This study was a first investigation of the usefulness of this approach in a small area. The next step will be to enlarge the number of sites and ecosystems to refine its use under different environmental and growth conditions and select the most useful Raman spectra to generalise the application.

Conclusions
In conclusion, most Raman bands observed in the 800-1660 cm −1 spectral range were found to differ among species and revealed to be useful for their profiling. The integrated interpretation tool derived from the PLS-DA results allowed us to depict a Raman profile for each species that can be used in the characterisation and identification of the different species. The Artificial Neural Network models could better predict the diatom subclasses and order than the species, with accuracy varying from sufficient to excellent (67-89%) using a small dataset of 790 Raman spectra obtained from 29 species, requiring no artificial augmentation. Compared to imaging-based methods, Raman spectroscopy shows high costeffectiveness in sample measurement and fast acquisition of a great number of variables reflecting the molecular composition of diatoms, with great potential for profiling and detection of characteristics of high taxonomic value.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/w14132116/s1, Table S1: Mean values (± Standard deviation) of the physical-chemical parameters measured in the three lakes sampled; Table S2: Diatom valve counts and valve percentage found in the three lakes of Oporto's City Park. Species with the abundance superior to 1% in at least one lake are highlighted; Figure  Funding: This research is a result of the projects REWATER (Water JPI) and BioReset (DivRestore/0004/2020) funded by the BiodivRestore COFUND Action (a joint programme of, BiodivERsA and Water JPI). This research was also supported by national funds through FCT (Portuguese Foundation for the Science and Technology) within the scope of UIDB/04423/2020 and UIDP/04423/2020.

Institutional Review Board Statement: Not applicaple.
Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable.