Metagenomic Analysis of Upwelling-Affected Brazilian Coastal Seawater Reveals Sequence Domains of Type I PKS and Modular NRPS

Marine environments harbor a wide range of microorganisms from the three domains of life. These microorganisms have great potential to enable discovery of new enzymes and bioactive compounds for industrial use. However, only ~1% of microorganisms from the environment can currently be identified through cultured isolates, limiting the discovery of new compounds. To overcome this limitation, a metagenomics approach has been widely adopted for biodiversity studies on samples from marine environments. In this study, we screened metagenomes in order to estimate the potential for new natural compound synthesis mediated by diversity in the Polyketide Synthase (PKS) and Nonribosomal Peptide Synthetase (NRPS) genes. The samples were collected from the Praia dos Anjos (Angel’s Beach) surface water—Arraial do Cabo (Rio de Janeiro state, Brazil), an environment affected by upwelling. In order to evaluate the potential for screening natural products in Arraial do Cabo samples, we used KS (keto-synthase) and C (condensation) domains (from PKS and NRPS, respectively) to build Hidden Markov Models (HMM) models. From both samples, a total of 84 KS and 46 C novel domain sequences were obtained, showing the potential of this environment for the discovery of new genes of biotechnological interest. These domains were classified by phylogenetic analysis and this was the first study conducted to screen PKS and NRPS genes in an upwelling affected sample


Introduction
Metagenomics has been largely used to unlock the huge biodiversity and biochemical potential from uncultivable organisms [1,2]. Beside the large number of metagenomics studies carried in marine environments, only few studies have explored the marine waters of the Brazilian Coast using metagenomic approaches. Among them, Gregoracci et al. [3] studied the bacterioplankton of Guanabara Bay (the second largest bay of Brazil), in the Rio de Janeiro state. Trindade-Silva and colleagues [4] characterized the microbial diversity associated to the marine sponge Arenosclera brasiliensis from water of the João Fernandinho beach (Rio de Janeiro State) [4]. Cury and colleagues studied the taxonomic diversity of coastal seawater from Arraial do Cabo (Rio de Janeiro State, Cabo Frioregion) [5], an important fishing and tourism region influenced by an upwelling system and CAP3 with the very stringent default parameters, we tried to minimize the problem of chimeras, but the low coverage only can be outlined with a more deep sequencing effort. We obtained a total of 29,074 contigs and 269,587 singlets for sample P. The mean size of the contigs was 1368.4 bp (the largest contig had 17,926 bp).
For sample E, we obtained 20,792 contigs and 396,371 singlets. The mean size of the contigs was 1018.2 bp (the largest contig had 9281 bp). From these sequences (singlet + contigs), were also obtained 409,111 and 451,722 ORFs for sample P and E, respectively, using METAGENMARK [36].

Screening of NRPS and PKS Domains
Many studies have been conducted for functional or PCR-based screening of PKS and NRPS in diverse environments [22,23,37,38]. However, despite the growth of metagenomes databases, to date, only few studies were performed using computational approaches to screen PKS in whole shotgun sequenced metagenomes. One of these was conducted by Foerstner et al. [39] where six natural environment datasets were screened using a HMM profile approach. Moreover, no study has been performed to screen secondary metabolites genes in water from upwelling-affected coastal environment.
Due to the high abundance of Roseobacter clade organisms and Cyanobacteria in the samples from Arraial do Cabo (data not shown), we decided to conduct an in silico screening of PKS and NRPS genes. The aim of this study was to evaluate the potential of the studied environment to provide new secondary metabolites, since many studies have shown the presence of these genes in the genomes of these taxonomic groups (Roseobacter clade organisms and Cyanobacteria) [24,[26][27][28].

KS Domain
In this study, two KS pHMM were built: from sequences of modular KS and iterative KS. These pHHM were used to retrieve putative KS domains in metagenomes from Arraial de Cabo and the NapDos was used because this system has the capacity to classify KS and C sequences from poorly assembled genomes and metagenomes [40]. Using both pHMMs, a total of 102 KS sequences were obtained. Using the KS modular pHMM, we found a total of 28 hits in sample P and 37 in sample E. These sequences were submitted to BLASTP against RefSeq protein database (E-value cutoff e´5). Only seven sequences returned no hits, one from sample P and six from sample E. Using the annotation of the five best hits from RefSeq database it was possible to confirm the PKS annotation of 13 sequences (46.42%) and six (72.97%) of the KS sequences of P and E samples, respectively.
All KS sequences were submitted to NapDoS [40]. For sample P, it was possible to classify 23 (78.57%) sequences, whereas for sample E, it was possible to classify 34 (91.89%) sequences.
Using the KS iterative, we obtained 21 and 16 sequences from sample P and sample E, respectively. These sequences were submitted to BLASTP against RefSeq Protein (E-value cutoff e´5). For sample P, all the sequences showed hits on BLASTP, and for sample E, only two sequences did not show hits. Using the annotation of the five best hits from RefSeq database, it was possible to confirm the PKS annotation of 28.57% (6/21) and 56.25% (9/16) of the sequences from samples P and E, respectively. These sequences were also submitted to NapDoS. For sample P, it was possible to classify 71.42% (15/21) of the sequences, whereas for sample E, it was possible to classify 75% (12/16) of the sequences. From the total KS domains obtained by both pHMM (102), 38 sequences from sample P and 46 sequences from sample E (totalizing 84 sequences-82.35%) were confirmed in silico as KS (by blastP and/or NapDos results), respectively. The false positives found (17.65%) were expected, because the HMM approach is very sensitive for the detection of distant homologs [41] and the Fatty Acid Synthase (FAS) is homologous to PKS [42]. The advantage of the pHMM usage to screen type I PKS in metagenomic shotgun data and the possible recovery of false positives was discussed on a study performed by Foerstner et al. [39]. Table 1 summarizes the results of these analyses.
From the 84 type I KS (confirmed by Blast and/or NapDos), four were similar to Rhodobacteraceae organisms in the BLASTP results (against RefSeq protein). From these four KS sequences, two were classified in the phylogenetic tree as hybrid KS/PKS enzymes, agreeing with results obtained by Martens et al. [25], showing many hybrid enzymes in isolates from the Roseobacter clade.
The relative abundance of KS domains present in water was 0.0092% (38/409,111) from sample P and 0.0101% (46/451,722) from sample E. In the study of Foerstner et al. [39], the environment with the higher KS domain abundance was Minnesota farm soil [43], where 52 type I KS were found in 183,536 ORFs (0.0283%), 2.8 times more abundant than sample E of the present study. Soil environments commonly possess a high diversity of secondary metabolites because the microorganisms compete intensely with each other [44]. In addition, in the same study, the samples from an open ocean oligotrophic region (Sargasso Sea) [45] (ranging from 0.1 to 0.8 µm like sample P), were screened, showing 69 type I KS sequences in 1,214,207 ORFs (0.0056%), a relative smaller abundance than our sample P. These results confirm the potential of the costal upwelling-affected metagenome for the screening of secondary metabolites. Figure 1 shows the classification of KS domains obtained from both samples using both pHMMs. intensely with each other [44]. In addition, in the same study, the samples from an open ocean oligotrophic region (Sargasso Sea) [45] (ranging from 0.1 to 0.8 μm like sample P), were screened, showing 69 type I KS sequences in 1,214,207 ORFs (0.0056%), a relative smaller abundance than our sample P. These results confirm the potential of the costal upwelling-affected metagenome for the screening of secondary metabolites. Figure 1 shows the classification of KS domains obtained from both samples using both pHMMs.     Most of the sequences retrieved with modular KS pHMM were classified as modular by NapDoS (46% from both samples) and Hybrid KS (18% from sample E and 24% from sample P). These hybrid domains can be modular or iterative and are present on hybrid PKS/NRPS enzymes [46]. The Trans-AT domain was also present (11% on sample E and 14% on sample P) and, unexpectedly, using this pHMM it proved possible to retrieve some iterative KS domains. On the other hand, the iterative KS pHMM was able to retrieve only a few iterative sequences (only the iterative Enediyne sequences were retrieved). Most of the sequences retrieved with this pHMM on sample E were Trans-AT domains (31%). This result may be due to a bias in the iterative KS pHMM or due to the high abundance of hybrid PKS in this sample because trans-AT domains are common in hybrid systems. However, on sample P, the most abundant type of KS domains retrieved with the iterative KS pHMM was modular. All the sequences classified as polyunsaturated fatty acids (PUFA) by NapDos were manually verified and the best blast hits (against RefSeq) show similarity with PKS domain (with more significant E-value than the NapDos result). The high abundance of the KS module can be explained by the fact that in modular PKS the number of copies of each domain is much higher than in iterative PKS [42].
Additionally, aiming for a most accurate classification of the KS sequences, a phylogenetic analysis was performed using the 10 KS sequences from sample P and 15 from sample E (larger than 200 amino acids) to generate a phylogenetic tree on NapDos with the reference sequences (similar to the environmental sequences) (Figure 2).
The topology of the tree corroborates the results from many phylogenetic studies of the KS domain [37,40]. It was possible to separate the homologous Fab (from Fatty acid synthase), the type II KS alpha and beta domains, the polyunsaturated fatty acids (PUFA) domains (with one sequence from sample E in this clade), two clades from the Modular Trans-AT domains (where seven sequences from sample P and two from sample E were classified), the iterative KS (with two sequences from sample E), the hybrid KS (PKS/NRPS hybrid enzymes), with one sequence from sample P and six from sample E, the KS1 domains (typical starter KS KSQ) domains and KS domains in Curacin and Salinisporamide biosynthesis pathways, with one sequence from sample E, and finally the Modular KS domains (where one sequence from sample P and three from sample E were classified). As the results display in Figure 1, the phylogenetic tree shows most of sequences close to modular (Cis, Trans and Hybrid) KS sequences.
It is important to note that only sequences from Sample E were clustered in the clade of iterative KS domains. These results can be explained by the fact that iterative PKS are common in fungi genomes, and sample E is rich in eukaryotes due to the filtration membrane used (0.8 µm).
The discovery of these new KS domains in Arraial do Cabo can lead us to design new primers for PCR screening in a fosmid library and consequently can also lead to the discovery of new active compounds. The relative abundance of KS domains found in the current study confirms that the tropical coastal upwelling-affected environments, normally neglected in these studies, should be also screened for new polyketide compounds.

C Domain
Using the C domain pHMM, a total of 50 hits were obtained, 14 from sample P and 36 from sample E. These sequences were submitted to BLASTP against RefSeq protein (E-value cutoff e´5). Sample P yielded 11 hits (78.57%) with confirmed annotations. Sample E yielded 32 hits (88.88%) and it was possible to confirm the annotation from 31 of these. All 43 annotated sequences were submitted to NapDos and classified. It was not possible to confirm the annotation of one sequence (7.14%) from sample P and three sequences (8.33%) from sample E by these methods. These results show that our C pHMM model is specific, with low false positive detection.
The environmental C domain sequences larger than 200 amino acids were also submitted to phylogenetic analysis using the NapDoS pipeline, with the reference sequences (similar to the environmental sequences). Figure 3 shows the result of this analysis. The topology of the inferred tree was expected, separating the types of C domains as shown in previous studies [40].Most of the sequences were from LCL type (four from sample E and onefrom sample P) and Epimerization (four from sample E). The LCL domain catalyzes formation of a peptide bond between two L-amino acids [40]. The abundance of LCL domains was expected as the LCL domain was proposed to be exclusively from gram-negative bacteria (for example, proteobacteria), which are very abundant in the studied environment (data not showed).
As in the PKS screening, sample E showed a higher abundance of the NRPS C domain than sample P. This may be due to the presence of marine snow in sample E, with competition for space and nutrients in the particle-associated bacteria as a selective force [47].

Dataset
The sequences were generated by a previous study from our group (not published) and was submitted to MG-RAST (MG-RAST ID: 4539290.3 for sample P; 4539291.3 for sample E). The samples were collected in Arraial do Cabo (Rio de Janeiro State)-Brazil, at Praia dos Anjos Bay (−22°58′31.33′′, −42°0′46.84′′). No specific permits were required for the described field studies. The water was collected from surface (<2 m) and filtered first through 0.8 μm membranes (aiming to hold eukaryotes and particle-associated prokaryotes-named Sample E) and then through 0.22 μm membranes (aiming to hold free living prokaryotes-named sample P) using a vacuum filtration system. The DNA from both samples were extractedusing the kit DNA isolation kit for water (EPICENTRE) and sequenced by 454 FLX+ (ROCHE).

Metagenomic Reads Assembly
The metagenomic reads from both samples were assembled using CAP3 [29] with default parameters. The contigs and singlets were concatenated and the METAGENMARK [36] was used to extract the metagenomic Open Reading Frames (ORFs) and to translate those ORFs to protein sequences using the Transeq program from the EMBOSS package [48]. The topology of the inferred tree was expected, separating the types of C domains as shown in previous studies [40]. Most of the sequences were from LCL type (four from sample E and onefrom sample P) and Epimerization (four from sample E). The LCL domain catalyzes formation of a peptide bond between two L-amino acids [40]. The abundance of LCL domains was expected as the LCL domain was proposed to be exclusively from gram-negative bacteria (for example, proteobacteria), which are very abundant in the studied environment (data not showed).
As in the PKS screening, sample E showed a higher abundance of the NRPS C domain than sample P. This may be due to the presence of marine snow in sample E, with competition for space and nutrients in the particle-associated bacteria as a selective force [47].

Dataset
The sequences were generated by a previous study from our group (not published) and was submitted to MG-RAST (MG-RAST ID: 4539290.3 for sample P; 4539291.3 for sample E). The samples were collected in Arraial do Cabo (Rio de Janeiro State)-Brazil, at Praia dos Anjos Bay (´22˝58 1 31.33 11 ,´42˝0 1 46.84 11 ). No specific permits were required for the described field studies. The water was collected from surface (<2 m) and filtered first through 0.8 µm membranes (aiming to hold eukaryotes and particle-associated prokaryotes-named Sample E) and then through 0.22 µm membranes (aiming to hold free living prokaryotes-named sample P) using a vacuum filtration system. The DNA from both samples were extractedusing the kit DNA isolation kit for water (EPICENTRE) and sequenced by 454 FLX+ (ROCHE).

Metagenomic Reads Assembly
The metagenomic reads from both samples were assembled using CAP3 [29] with default parameters. The contigs and singlets were concatenated and the METAGENMARK [36] was used to extract the metagenomic Open Reading Frames (ORFs) and to translate those ORFs to protein sequences using the Transeq program from the EMBOSS package [48].

Screening for Genes of NRPS C Domain and Type I PKS KS Domains
A HMM profile (pHMM) approach was used to screen the genes of type I PKS (KS domains) and NRPS (C domains). The pipeline used in this work was adapted from a previous work developed by Dumaresq et al. [49]. The KS domains of type I PKS (modular and iterative) were obtained from MAPSIDB [50]. The NRPS C domains were obtained from NRPSDB [51,52]. For each domain, a multiple alignment was generated by MAFFT [53] and a HMM profile was built using the hmmbuild software from the HMMER v3.0 [41] package. Those profiles (PKS-pHMM and NRPS-pHMM) were then used to screen the translated ORFs of both samples from Arraial do Cabo using hmmsearch with E-value cutoff 0.1.
The translated ORFs that showed hits with PKS-pHMM and NRPS-pHMM were extracted from our metagenome dataset using FASTACMD (from BLAST 2.2.21 package [54]). A reverse search of the extracted translated ORFs against the PKS-pHMM and NRPS-pHMM was conducted using hmmscan (E-value cutoff 0.1), in order to classify the type of PKS from metagenomic sequences. In order to confirm and validate the results, all the extracted translated ORFs similar to profiles were submitted to BLASTP 2.2.21 (E-value cutoff e´5) against RefSeq protein database release 61. The annotation of the best five hits of each environmental sequence was used to annotate them. Finally, the confirmed KS and C domain sequences were submitted to Natural Product Domain Seeker-NapDoS [40,55] in order to classify the domains and to carry out phylogenetic analysis.

Conclusions
In this work, two fractions of the seawater metagenome from Praia dos Anjos (Angel's Beach) a coastal environment, affected by upwelling, were screened for PKS and NRPS domains.
Screening of the metagenome revealed 84 KS domain ORFs and 46 C domain ORFs (from PKS and NRPS). These sequences were manually verified and classified by NapDoS system and BLAST, showing a close enough similarity to curated sequences to infer annotation for these sequences. However, the degree of divergence suggests that they are probably new alleles. Based on these results we will now prepare an environmental DNA (fosmid) library from which to clone full sequences of PKS and NRPS, in order to conduct functional and sequence screening, using the sequences generated in this study as probes. In addition, a time-series study may be conducted in the future, to better understand the main differences between the microbial communities from each season in this region.