Metabolomic Tools for Secondary Metabolite Discovery from Marine Microbial Symbionts

Marine invertebrate-associated symbiotic bacteria produce a plethora of novel secondary metabolites which may be structurally unique with interesting pharmacological properties. Selection of strains usually relies on literature searching, genetic screening and bioactivity results, often without considering the chemical novelty and abundance of secondary metabolites being produced by the microorganism until the time-consuming bioassay-guided isolation stages. To fast track the selection process, metabolomic tools were used to aid strain selection by investigating differences in the chemical profiles of 77 bacterial extracts isolated from cold water marine invertebrates from Orkney, Scotland using liquid chromatography-high resolution mass spectrometry (LC-HRMS) and nuclear magnetic resonance (NMR) spectroscopy. Following mass spectrometric analysis and dereplication using an Excel macro developed in-house, principal component analysis (PCA) was employed to differentiate the bacterial strains based on their chemical profiles. NMR 1H and correlation spectroscopy (COSY) were also employed to obtain a chemical fingerprint of each bacterial strain and to confirm the presence of functional groups and spin systems. These results were then combined with taxonomic identification and bioassay screening data to identify three bacterial strains, namely Bacillus sp. 4117, Rhodococcus sp. ZS402 and Vibrio splendidus strain LGP32, to prioritize for scale-up based on their chemically interesting secondary metabolomes, established through dereplication and interesting bioactivities, determined from bioassay screening.


Introduction
Marine invertebrates such as sponges are a rich source of novel metabolites that are of medicinal interest due to their anti-cancer, anti-tumor, anti-viral and antibacterial properties [1][2][3][4]. However, there is a bottleneck when developing drugs from marine invertebrates. They are largely uncultivable and it is unsustainable to collect large quantities from marine habitats to facilitate the extraction of enough novel marine natural products for the supply chain, making pharmacological development difficult. Sponge-associated endosymbiotic bacteria are highly concentrated within the sponge matrix making up to 50%-60% of the dry weight of the sponge [5]. They are hypothesized to stabilize the sponge skeleton, process metabolic waste and provide chemical defense against environmental stresses such as predators and overgrowth of fouling organisms, by producing a plethora of novel secondary metabolites that may be structurally unique with interesting pharmacological properties [5][6][7], e.g., as antimicrobials [8] or anti-cancer drugs [9].
There is evidence to suggest that these microbes, which live symbiotically with the host organism, are the true source of many bioactive compounds discovered from associated marine invertebrates [5,[10][11][12][13][14][15]. Some of these compounds can be produced in large quantities on a biotechnological scale using bacterial fermentation processes without the need to harvest the host organism and are therefore an economically viable and sustainable source of commercial quantities of metabolites of interest [16]. For example, the anti-tumor drug bryostatin 1, isolated from the marine bryozoan Bugula neritina and synthesized by the symbiotic bacterium Candidatus Endobugula sertula [9], is now produced using a large-scale fermentation process to ensure supply [17].
Key to the exploitation of marine bacteria as sources of novel marine natural products has been the implementation of 16S rRNA-based phylogenetic analysis which has been used extensively to provide an insight into sponge-specific microbial communities [18,19]. The development of new analytical technologies and instrumentation has made it possible to rapidly obtain a chemical fingerprint of bacterial extracts to potentially discover new natural products from only a few milligrams of material. Historically, selection of bacterial strains has relied on literature searching, genetic screening and bioactivity results [20]. However, cultivated bacterial strains from the same genus may appear morphologically identical, but may produce different, structurally diverse secondary metabolites [21,22]. In contrast, strains that appear different by morphology and 16S rRNA sequencing often produce the same secondary metabolites, making it difficult to pinpoint interesting bacterial strains before the time-consuming bioassay-guided fractionation and purification stages.
Dereplication is the rapid identification of known metabolites in a sample mixture [23][24][25]. Dereplication uses chromatographic and spectroscopic methods and database searching, for example using the MarinLit [26] and AntiBase [27] databases, to screen samples for known natural products, which saves time and reduces the possibility of redundancy during natural product discovery programs. Common dereplication methods involve using liquid chromatography coupled to a photo diode array (LC-PDA) system or LC-PDA with mass spectrometry (MS) using electrospray ionization (ESI) [28,29] or atmospheric pressure chemical ionization (APCI) as soft-ionization sources. Liquid chromatography mass spectrometry (LC-MS) high resolution instruments such as Quadrupole Time-of-Flight (QTOF) or Orbitrap provide accurate mass data (0.5-5 ppm) with elemental composition output for a given ion [30]. This enables natural products databases to be queried in a high throughput manner, with fewer candidate metabolite IDs being observed for each feature. With a Quadrupole or an ion trap, data-dependent MS/MS and MS n can also be carried out to provide additional structural information (e.g., using a Q-TOF or LTQ-Orbitrap). TOF-based mass spectrometers enable a higher degree of certainty for identification of elemental compositions on the basis of both mass accuracy and isotope fit [28,[31][32][33]. These instruments offer high sensitivity and accuracy in the ng or pg range and, on several newer-generation instruments, spectra can be obtained in positive and negative ionization modes during a single experiment.
Metabolomics is defined as the comprehensive analysis of the small molecules (MW < 1000) in a biological system under a given set of conditions [34]. At the biochemical level, the metabolome is most closely related to the phenotype, providing insight into biological function [35]. Mass spectrometry and nuclear magnetic resonance (NMR)-based metabolomics are readily applicable to natural products research, offering the ability to deal with complex mixtures in a highly efficient manner [36][37][38][39]. Metabolomics methods are combined with chemoinformatics approaches, e.g., unsupervised multivariate analysis methods, to uncover interesting variation amongst groups of samples (e.g., in terms of their m/z values for mass spectrometry data or chemical shifts for NMR data) [40]. Microbial metabolomics is readily applicable to investigate the physical state of cells [41], to investigate intracellular metabolites [40,41] and for the optimization of experimental conditions for the production of pharmacologically active compounds [23,25].
The aims of the study were to utilize metabolomics tools to investigate differences in secondary metabolite production in marine symbiotic bacteria to fast track the strain selection and dereplication processes for natural product drug discovery. LC-HRMS and principal component analysis (PCA) were used to pinpoint strains that were chemically diverse in a high throughput and untargeted manner. LC-HRMS results were then correlated with bioassay screening results to prioritize strains for drug discovery efforts. The study was designed to monitor secondary metabolite production, using extraction methodology optimized for the recovery of secondary metabolites. In comparison with other studies that compared strains from the same species [22,42], we were able to compare chemically diverse, non-related strains from four different phyla, cultured on a variety of growth media. Additionally, an Excel macro, developed in-house, was used to sort and remove features (pairs of m/z ratios and retention times) associated with the different culture media used. This reduced the difficulties in spectral interpretation that are often encountered when comparing bacterial strains grown on different culture media.
It was predicted that bacterial extracts containing the same secondary metabolites would cluster together whilst those extracts with chemically distinct metabolites would be observed as outliers using unsupervised multivariate analysis [23,39], providing a means to focus on chemically diverse extracts during dereplication. Therefore we used a combinatorial approach for strain selection, utilizing a data analysis workflow that encompassed features of dereplication and metabolomics to establish the chemical profiles of bacterial extracts in a high throughput manner. By incorporating metabolomics approaches, dereplication could be focused on chemically diverse bacterial extracts.

Diversity of Invertebrate-Associated Bacteria
Several species of cold water marine invertebrates found in Scottish coastal waters (Orkney Islands, Scotland, UK) were swabbed for microbial symbionts. Specimens were then inoculated onto various types of agar media, which yielded a total of 77 isolates ( Figure 1 and Table S1 in Supplementary Information). Suberites ficus (sponge) yielded the highest number of isolates (22) followed by sponges Mycale (Carmia) similaris (14), Grantia compressa (12) and an unidentified hydroid (12), followed by sponges Leucosolenia sp. (8) and Sycon ciliatum (4), the soft coral Alcyonium digitatum (4) and sea urchin Diadema (1) (Figure 1a). A variety of isolation media were utilized in this study to maximize the diversity of the isolates obtained. M1 obtained the highest recovery (36 isolates) whilst marine agar recovered only one isolate (Figure 1b). In terms of the diversity of isolates, M1 produced isolates belonging to 15 different genera followed by ISP2 and Luria (seven genera, respectively). Oligo (oligotrophic) media produced isolates from four genera, R2A yielded two genera and marine agar only one genus (Table S1 in Supplementary Information). This variation is consistent with the results of previous studies [43,44]. By 16S rRNA sequencing, the phylogenetic affiliations of 75 of the isolates were determined whilst a further two isolates remained unidentified ( Figure 1c). The isolates were grouped to four different phyla representing 23 different identified genera (Figure 1c,d). The most abundant phylum was the Proteobacteria of which 42 were Gammaproteobacteria whilst four were Alpha proteobacteria, followed by the Actinobacteria (23), Bacteriodetes (4) and Firmicutes (2). This is consistent with the observation that it is more successful to culture Gammaproteobacteria than Alphaproteobacteria [45]. The highest numbers of isolates were affiliated to the genus Vibrio (21) followed by uncultured Gammaproteobacteria (12), Psychrobacter (6), Micrococcus (6) and Microbacterium (4) (Figure 1d). High numbers of Vibrio sp. are consistent with previous studies, as they are ubiquitous in the marine environment and are associated with various algae and animals such as sponges and corals [46].

Data Processing and Data Clean-Up
Following culturing and chemical extraction, the crude extracts from the 77 bacterial isolates were subjected to metabolomic analysis according to our pre-defined metabolomics workflow pathway ( Figure 2). To maximize secondary metabolite detection in this diverse bacterial population (with a range of phylogenetic affiliations and culture media), an Exactive benchtop Orbitrap mass spectrometer (Thermo Scientific, Bremen, Germany) that permitted fast polarity switching was used for untargeted dereplication. The Exactive allows positive and negative mode switching with a maximum scan time of 0.25 s and the instrument always gives good mass accuracy of <3 ppm. The average chromatographic base peak width is about 30 s; therefore, there is adequate time to acquire sufficient scans through the peak in switching mode.  Key features of the metabolomics workflow include data processing in MZmine 2.10 for peak detection, deconvolution, deisotoping, filtering (to narrow the retention time search window to 5-40 min), alignment and gap filling to make multiple data files comparable (Experimental Section 3.5 and Supplementary Information). In order to differentiate between structural isomers that eluted at the same retention time, chromatographic deconvolution using the local minimum search algorithm was applied in MZmine. The isomers were separated into individual LC-HRMS features if their chromatographic 3D resolution was sufficient. However, it is a real challenge to get a perfect setting for this function that would work correctly every time. The raw data were manually validated to confirm the output for outlier and bioactive strains only. The adduct and complex search tools were used for the identification of non-proton adducts and complexes, respectively. This minimized mis-assignment of features such as solvent or salt adducts and complexes such as dimers. The formula prediction tool enabled the possible molecular formulae for each feature to be predicted. The elemental composition output was supported by the heuristic isotopic pattern filter in MZmine [47]. The isotope fit scores were calculated for each isotope ion then combined with the individual fit scores which were weighed by their expected intensities. For each ion peak, the m/z and intensity differences between the expected and the measured patterns were obtained. Those differences were then normalized (normalized deviation values) to the maximum allowed mass and intensity deviation of 0.01%. The relative intensities for the expected and measured values were derived from the isotopic pattern spectra. Each value is a percentage of the isotope's intensity relative to that of A0. The normalized differences were summed by vector addition of intensity (I) and mass (M) deviations for m/z ions A0 [X], to A1 [X + 1], A2 [X + 2], and A3 [X + 3] [48].
Positive and negative data were then exported as a CSV file for further clean-up. One limitation of MZmine 2.10 is that data obtained in positive and negative ionization modes cannot be combined; therefore, it is not possible to assign the ionization mode for each feature. Thus, a macro was written in Excel that enabled positive and negative ionization mode data files to be processed together. This enabled the features that were observed in either or both positive and negative modes to be merged for further statistical analysis. Hence, this minimized the risk of missing poorly ionizing compounds only detectable in one mode. For example, phenolic and anthraquinone compounds poorly ionize in positive mode but ionize very well in negative mode [25], therefore such compounds were not deleted from the surveyed peaks.
Another complication when analyzing bacterial extracts is that they are cultured on complex growth media which generates multiple peaks in mass spectrometry and NMR datasets. The culture medium is a complex mixture of constituents and unutilized components that could cause interference in the detection of true bacterial secondary metabolites during dereplication. Therefore, a medium blank was analyzed together with the bacterial extracts in LC-HRMS and NMR experiments during data processing. The obtained features from the blank were regarded as interference and subtracted. A threshold intensity ratio of 1/20 was used if ion peaks (MS) were found in both the medium blank and the sample. The Excel macro was then utilized to extract and remove peaks originating from the culture medium by applying an algorithm to calculate the intensity of each m/z in both bacterial and medium extracts. This removed features thought to originate from the medium by only keeping those features with peak intensities 20 times greater in the bacterial samples than in the medium. Bacterial extracts were grouped according to their culture medium and this data clean-up step was carried out for each of the six types of culture medium used.
Using the Excel macro, the data were then recombined into CSV files that were utilized for statistical analysis in SIMCA-P V13.0 (Umetrics, Umeå , Sweden) as well as for dendrogram and heat map generation in the R program (version ×64 2.15.2) (R Foundation for Statistical Computing, Vienna, Austria). The Excel macro was also utilized to dereplicate the samples, matching each m/z found in each bacterial extract with compounds in the AntiMarin database (using a m/z threshold of + or −3 ppm) to provide details on the putative identities of all metabolites and to calculate the number of remaining unidentified features for each extract. This macro contains a function to identify the top 20 features (ranked by peak intensity) and corresponding putative identities in each sample by creating individual CSV files for each extract. Hits from the database were accessed using ChemBioFinder version 13 (PerkinElmer Informatics, Cambridge, UK) and structures were confirmed by tandem mass spectrometry (MS/MS) and two-dimensional 1 H-1 H correlation NMR spectroscopy ( 1 H-1 H COSY). For COSY analysis, spectra from the bacterial extracts were overlaid with spectra from the culture medium to determine signals and cross peaks originating from the culture medium. Three outlier strains were chosen for detailed discussion in this paper to demonstrate the application of different methodologies in dealing with a chemically diverse set of samples showing variation in terms of secondary metabolite production.

Multivariate Analysis for Strain Selection
Processed data was analyzed using SIMCA-P V 13.0 (Umetrics, Umeå , Sweden) using the unsupervised statistical analysis method, principal component analysis (PCA). PCA was used to identify differing features found in the outlying bacterial strains to aid prioritization of the strains with interesting secondary metabolomes. Four predominant outliers, Bacillus sp. 4117, Rhodococcus sp. ZS402 Vibrio splendidus strain LGP32, and Psychrobacter sp. were observed, indicating that there was variance in the secondary metabolites produced by these bacterial extracts as they lay furthest from the main group of samples in the score plot ( Figure 3a). From the primary general screen, the extracts from these isolates were also found to have interesting bioactivities against Trypanosoma brucei brucei and Enterococcus faecalis, as well as in target-based functional assays which includes TRPV1, TRPA1 and TRPM8 (pain and cancer) as well as PTPI1 and PPARα (inflammation, diabetes, metabolic disorders and atherosclerosis) (data not shown). Rhodococcus sp. ZS402 was also identified as NRPS positive from genetic screening, containing the non-ribosomal peptide genes. The PCA loading plot (Figure 3b) illustrates the features (m/z ratios, displayed as green dots) that are responsible for the separation shown in the score plot (Figure 3a), indicating the production of unique secondary metabolites particularly by two bacterial strains, Rhodococcus sp. ZS402 and Vibrio splendidus strain LGP32. Using SIMCA-P it is possible to select any point in the loadings plot to highlight the putative identity of any metabolite and to investigate the peak intensity of this metabolite across the sample set ( Figure 3c). Two metabolites (m/z 265.1476 and 279.1631) were found to be abundant in the Rhodococcus sp. compared to the other extracts ( Figure 3c); however, they were also observed in the Psychrobacter sp. (which was also an outlier in the PCA scoring plot lying adjacent to the Rhodococcus sp.) with lower peak intensities, suggesting they both produce some similar metabolites not found in the other extracts. However, further study with Psychrobacter sp. was halted due to observed instability in the production of the secondary metabolites leading to disappearance of bioactivity after freezing and thawing of the bacterial isolate. It is also worth mentioning that the outlier strains were repeatedly subjected to MS and NMR analysis every three months prior to scale-up work to evaluate their stability. Heat map analysis was utilized to look at the secondary metabolomes in the 77 bacterial extracts (Figure 4a,b) which is a visual representation of the metabolite diversity in the extracts. Multiple blue bands indicate a rich secondary metabolome with a high diversity of metabolites whilst fewer blue bands indicate that a more limited set of secondary metabolites are being produced. Heat maps were overlaid with dendrograms to relate the chemical profiles to the results of the multivariate analysis ( Figure 4a). The heat maps were also arranged by species (Figure 4b) to investigate the chemical diversity amongst strains from the same species, exemplified by the 21 strains of Vibrio sp. that have very different heat map profiles (Figure 4b). It can be seen that several species from different genera have rich chemical profiles whereas other strains do not (Figure 4a,b). Heat map analysis can also be used to gain an overview of the molecular weight range of metabolites as the features were sorted by m/z ratios (Figure 4a

Chemical Diversity of Natural Products in Outlying Bacterial Extracts
The limitation of a dereplication study for secondary metabolites, particularly from marine sources, is the difficulty to attain a reference standard for every -hit‖ from the database. To ensure the correctness of the identification of the basic structure of the identified peaks, UV, MS/MS data and NMR spectral data were used to support the results. However, dereplication through the UV data set was limited only to analyzing chromophore-containing metabolites. In this study, the mass resolution was 50,000 (at m/z 400), which is high enough to distinguish isobaric compounds with medium molecular weights (<800 Da). The total number of features identified in three of the outlying bacterial extracts by LC-HRMS is documented in Table 1. The highest number of features was detected in the Rhodococcus sp. ZS402, where 1198 features were detected in positive ionization mode and 2361 features were detected in negative ionization mode. By removing features from the ISP2 culture medium, 45% of these features were removed in positive ionization mode, whilst in negative ionization mode 27.4% of these features were removed, leaving 659 in positive mode and 1715 in negative mode. Following dereplication, 28% of features were putatively identified (positive and negative modes combined) whilst 72% were unidentified indicating that this isolate may contain novel compounds ( Table 1). The second highest number of features was detected in the Vibrio splendidus strain LGP32, where 2767 were detected in positive ionization mode and 654 features were detected in negative ionization mode. By removing features from the M1 medium, 61.2% of these features were removed in positive ionization mode, whereas in negative ionization mode 5.7% of these features were removed, leaving 1102 in positive mode and 617 in negative mode. Following dereplication, 40.7% of features were putatively identified (positive and negative modes combined), whilst 59.3% were unidentified (Table 1). In the Bacillus sp. 4115, 1220 features were detected in positive ionization mode and 1037 features were detected in negative ionization mode. By removing features from the M1 medium, 71.6% of these features were removed in positive ionization mode, whilst in negative ionization mode 57.8% of these features were removed, leaving 359 in positive mode and 438 in negative mode. Following dereplication, 51.3% of features were putatively identified (positive and negative modes combined), whilst 48.7% were unidentified ( Table 1). Base peak chromatograms and tables listing selected interesting secondary metabolites from three of the outlying bacterial species are shown below ( Figures 5-12 and Tables 1-4). The putative identities of metabolites, based on hits from the AntiMarin database, are only given if these metabolites have previously been identified from marine bacteria or sponges.

Dereplication of Bacillus sp. 4115
The crude ethyl acetate extract of the Bacillus sp. 4115 isolate was active on the initial screen in the target-based functional assay on TRPV1 against pain. Metabolites from the Bacillus sp. 4115 extract were putatively assigned as peptides through dereplication ( Table 2). They were eluted within the retention time range of 16-38 mins, when the percentage of organic mobile phase (acetonitrile) was greater than 50% ( Figures 5-7), and could be detected in both positive and negative ionization modes. Several of these were dereplicated using the AntiMarin 2013 natural products database as pumilacidin peptides (surfactins) which have already been described from the marine bacterium Bacillus pumilus. Pumilacidins have been described to exhibit antiviral activity [49]. Other plausible congeners that could not be found in AntiMarin were structural analogs of pumilacidins with varying numbers/length of alkyl or peptide side chains that could be targeted for isolation work. It has previously been reported that members of the Bacillus genus produce antibiotic peptides as part of their defence mechanism [50]. The fragmentation data reveals the presence of the cyclic and linear moities in the peptides compatible with those of the pumilacidins. The presence of peptides can be further observed by 2D-COSY correlation ( Figure 8) of NH signals between 8 and 9 ppm with the alpha protons resonating between 3 and 5 ppm which gave additional cross peaks upfield from 1 to 2 ppm, representing the beta proton in amino acid units.      Figure 5. Positive and negative mode base peak chromatograms from outlying bacterial sample, Bacillus sp. 4115, annotated to indicate metabolites identified in Table 2. NB: several of the metabolites were detected in both positive and negative modes. Positive and negative mode base peak chromatograms from M1 agar medium are shown to indicate that the annotated metabolites are being produced by the bacteria and are not from the medium.

Dereplication of Vibrio splendidus Strain LGP32
The ethyl acetate extract of the Vibrio splendidus strain LGP32 exhibited biological activity against Trypanosoma brucei brucei (marker assay system for trypanosomiasis) and PTP1B. Protein-tyrosine phosphatase 1B (PTP1B) is a novel therapeutic target for type 2 diabetes mellitus, obesity and related states of insulin resistance [51]. Vibrio splendidus strain LGP32 contains many semi-polar metabolites indicated by the retention times of the major peaks which ranged from 8 to 25 min (Figure 9). Its LC-HRMS and MS/MS data (Table 3) depicted a highly oxygenated set of metabolites, with the number of oxygen atoms varying from 4 to 13. The ratio of RDBs (ring-plus-double-bond equivalents) to the number of oxygen atoms ranged from 1:2 to 3:5, thus indicating the aromatic nature of the metabolites. COSY correlations between 3 to 5 ppm and 6 to 8 ppm shown in Figure 10 indicate the presence of a glycosidic-like moiety and an aromatic ring system, respectively. The aromatic signals between 6 and 7 ppm signify a phenolic or aniline system. The specified substructures can be found in oxyplicacetin, first detected in the mass spectral dereplication analysis of the isolate (Table 3). Oxyplicacetin, an anti-coccidal agent, was previously isolated from Streptomyces ramulosus [52,53].  Table 3. Positive and negative mode base peak chromatograms from M1 agar medium are shown to indicate that the annotated metabolites are being produced by the bacteria and are not from the medium.

Dereplication of Rhodococcus sp. ZS402
The ethyl acetate extract from Rhodococcus sp. ZS402 isolate was found to be active in several of the target-based functional assays, which included TRPA1 and TRPM8 against pain, PPARα in inflammation, diabetes, or other metabolic disorders, as well as against Enterococcus faecalis. The chromatographic trace from Rhodococcus sp. ZS402, indicated that several interesting features were observed in both positive and negative ionization modes within the retention time range of 16-23 min ( Figure 11). Only one of these features was identified using the AntiMarin natural products database during dereplication as xestoaminol C, an unsaturated acyl compound previously described from the sponge Xestospongia sp. Several others could not be dereplicated using AntiMarin but were indicated to be structural derivatives with additional C 2 H 4 on their side chains. Undereplicated features observed in the negative ionization mode specified the presence of sulfated metabolites from the molecular formula identification searches in Xcalibur and MZmine [30], as well as the occurrence of the sulfate fragment ion [HSO 4 ] − at m/z 96.9590 in the MS/MS data ( Table 4).
The 2D-COSY spectrum ( Figure 12) illustrates that this extract has a rich secondary metabolome. Signals can be seen which correspond to aromatic compounds (6-9 ppm), sugars (4-6 ppm) and sulfated aliphatics (0-4 ppm) and/or olefinics (2-5 ppm). The presence of peptides was also observed by cross peaks exhibited from the NH to the alpha and beta proton, typical for an amino acid. Table 5, summarizes the peptide metabolites that were detected in the positive ionization mode. This supports the presence of NRPS genes in the Rhodococcus sp. ZS402 bacterium. However, MS/MS data was only achieved for one detected metabolite. In Table 5, the presence of peptides can be determined within a range of double-bond equivalences [47] or alternatively calculated where the RDB is equivalent to [(# O − # N )/2 + # N ] IF linear; (−1) IF linear: (+1) IF cyclic; (+4) for additional Phe/Tyr; (+6) for additional Trp but account for extra Nitrogen(s) when [(# O − # N )/2 + # N ] is less than the found RDB which is also encountered with Arg. Besides following the Nitrogen Rule, approximately every 100 Da represents one amino acid with 1 RDB except for Phe, Tyr, and Trp. However, it was not possible to obtain the fragmentation data for most of the detected ion peaks due to the low intensities of parent ion peaks and/or the conceivable cyclic nature of some of the peptides. Devoid of a chain moiety, in comparison with the pumilacidins found in the Bacillus sp. 4115 isolate, cyclic peptides would need a hydrolysis step to cleave the ring prior to further fragmentation.  Table 4. Positive and negative mode base peak chromatograms from ISP2 agar medium are shown to indicate that the annotated metabolites are being produced by the bacteria and are not from the ISP2 agar medium.  Figure 12. 2D-NMR COSY spectrum of Rhodococcus sp. ZS402 overlaid with medium. Signals in orange are from the sample and signals in grey are from the medium.
Plates were incubated at 12 °C which led to the growth of visually diverse colonies of bacteria after 1-4 weeks. Distinct colony morphotypes were picked and bacterial streaking was utilized until pure bacterial colonies were isolated. Bacteria were then maintained on agar plates for short-term storage or archived for long-term storage. To archive the isolates, 3 mL of sterile artificial seawater was added to each plate and mixed before 400 μL of bacteria and artificial seawater solution was transferred into a 2 mL cryovial that contained 400 μL of 30% glycerol (Fisher Scientific, Hemel Hempstead, UK) using a pipette with sterile filter tips.

Bacterial Culture and Extraction
Seventy-seven fast-growing bacteria were selected from the archive. When required, bacteria in glycerol from archived cryovials were reinoculated onto agar plates and cultured for seven days in a dark incubator at 12 °C. Bacteria were then reinoculated to fresh agar plates by streaking, using disposable sterile loops and cultured as described above for seven days. This step was carried out to get rid of the glycerol in which the bacteria had been stored. Bacteria and agar from three replica plates were then collected into conical flasks using a sterile scalpel to cut the agar into small pieces. Culture growth was terminated with 200 mL HPLC grade ethyl acetate (Sigma Aldrich, Dorset, UK). After 24 h, samples were individually homogenized with an Ultra-turrax T 18 basic homogenizer (IKA, Staufen, Germany), filtered using a Buchner funnel with 110 mm Fisherbrand filters (Fisher Scientific, Hemel Hempstead, UK), transferred to a 500 mL separating funnel and subjected to liquid-liquid extraction and separation. This procedure involved initially separating the aqueous and ethyl acetate phases and washing the aqueous phase twice more with ethyl acetate. Ethyl acetate fractions were then collected, concentrated, weighed and reconstituted for mass spectrometry (1 mg/mL), NMR (5 mg/600 μL solvent) and bioassay screening (10 mg/mL), respectively. The mentioned fixed concentrations were strictly followed for MS and NMR analysis to normalize the weight of biomass used for each of the individual strains.

Mass Spectrometry
Methanol (MeOH), dichloromethane (DCM), acetonitrile (MeCN) and formic acid were purchased (Fisher Scientific, Hemel Hempstead, UK). All reagents were of analytical grade. HPLC grade water was obtained in-house from a direct Q-3 water purification system (Millipore, Watford, UK). Samples and medium control samples were prepared at a concentration of 1 mg/mL in 80:20 MeOH: DCM. A solvent blank was also included. Experiments were carried out using an Exactive mass spectrometer with an electrospray ionization source attached to an Accela 600 HPLC pump with Accela autosampler and UV/Vis detector (Thermo Scientific, Bremen, Germany). The mass accuracy was set to less than 3.0 ppm. The Orbitrap mass analyzer is able to limit the mass error within ±3.0 ppm. The instrument was calibrated to maintain a mass accuracy of ±1.0 ppm by applying the lock mass function. The instrument was externally calibrated according to the manufacturer's instructions before the run and was internally calibrated during the run using lock masses. In positive ion mode, lock masses were m/z 83.06037 (acetonitrile dimer) and m/z 195.08625 (caffeine) and in negative ion mode the lock mass was m/z 91.00368 (formic acid dimer). Mass spectrometry was carried out over a mass range of 100-2000 m/z in positive and negative ionization modes with spray voltage of 4.5 kV and capillary temperature at 270 °C. Ten μL was injected from each vial, at a flow rate of 300 μL/min. The column used was an ACE5 C18 column (5 μm × 75 mm × 3 mm) (Hichrom Limited, Reading, UK). A binary gradient method was utilized. The two solvents were A (water and 0.1% formic acid) and B (MeCN and 0.1% formic acid). The gradient was carried out for 45 minutes and the program followed; at zero minutes A = 90% and B = 10%, at 30 min A = 0% and B = 100% at 36 min A = 90% and B = 10% until end at 45 min. The UV absorption wavelength was set at 254 nm, the sample tray temperature was maintained at 4 °C and the column maintained at 20 °C. The samples were run sequentially, with solvent and media blanks analyzed first. LC-MS data was acquired using Xcalibur version 2.2 (Thermo Scientific, Bremen, Germany).
Data-dependent MS 2 and MS 3 experiments were carried out using a Finnigan LTQ Orbitrap coupled to a Surveyor Plus HPLC pump (Thermo Scientific, Bremen, Germany) and autosampler (Thermo Fisher, Bremen, Germany) in positive and negative ionization modes using a mass range of m/z 100-2000 and 30,000 resolution. The capillary temperature was 270 °C, the ion spray voltage was 4.5 kV, the capillary voltage 35 V, the tube lens voltage 110 V and the sheath and auxiliary gas flow rates were 50 and 15, respectively (units not specified by manufacturer). Multi-fragmentation (MS n ) experiments were accomplished on an Orbitrap analyzer, CID (collision-induced dissociation) was utilized with a normalized collision energy of 35%, activation Q of 0.250 ms and activation time of 30,000 ms applied on ions of most intense, 2nd most intense, and 3rd most intense peaks for MS 2 and MS 3 , respectively, at an isolation width of 3 microns with 5 microscans. Resolution was at 15,000 m/Δm50%, while the minimum ion signal threshold was set to 500. Fragment mass tolerance for molecular formula detection was set at ±5 ppm.

NMR Spectroscopy
Samples were prepared by dissolving 5 mg of bacterial extract (or culture medium extracts as controls) in 600 μL DMSO-d6 (Sigma-Aldrich, Dorset, UK). These were transferred to 5 mm 7″ NMR tubes (Sigma-Aldrich, Dorset, UK). NMR was carried out on a 400 MHz Jeol-LA400 FT-NMR spectrometer system equipped with a 40TH5AT/FG probe (JEOL, Tokyo, Japan). A presaturation sequence was included to suppress the DMSO solvent signal. For presaturation and proton experiments, sixteen scans were recorded while eight scans were recorded for 2D-1 H-1 H Correlation Spectroscopy (COSY) analysis. Presaturation and COSY spectra were processed using MestReNova (Mnova 8.1.0) software (Mestrelab Research, Santiago de Compostela, Spain). Normalization, baseline correction with Whittaker Smoother, apodization with Gaussian 1 and smoothing with Savitzy-Golay were carried out in MestReNova. For COSY analysis, spectra from the bacterial extracts were overlaid with the corresponding medium spectrum (control) to differentiate correlations from metabolites produced by the bacteria from those of the culture medium.

Data Analysis Tools for Mass Spectrometry Data
Raw data were initially sliced into two data sets based on the ionization mode (positive and negative modes) using the MassConvert tool from ProteoWizard [59]. The sliced data sets were imported and processed in MZmine 2.10 [30] using predefined settings to extract features from the raw data. The following data processing steps were carried out using MZmine: peak detection, (mass detection and chromatographic builder), deconvolution, deisotoping, filtering, alignment and gap filling. Identification of adducts and complexes and formula prediction steps were carried out to predict possible molecular formulae for each feature and to minimize mis-assignment of features by eliminating adducts and complexes (see Supplementary Information for full details of all settings and procedures utilized to process data in MZmine). Data was then exported as a CSV file for further clean-up.
An algorithm was employed to use the molecular formula data set from Antibase ® (February 2013) and Marinlit ® (September 2013). These versions are manually curated databases and the given molecular weights do not differentiate between monoisotopic, average, and most abundant masses. The monoisotopic exact masses for each metabolite were then calculated to be used for the customized library. The processed data from MZmine was incorporated into the customized library through the built-in Excel macro for peak identification and dereplication. -Hits‖ and unidentified peaks were double checked against the MS raw data in Xcalibur 2.2.
Excel macros were written to enable the subtraction of background peaks and to combine positive and negative ionization mode data files generated by MZmine. Peaks originating from the culture medium were extracted. By applying an algorithm to calculate the intensity of each m/z in both bacterial extracts and medium extracts, ion peaks originating from the medium were subtracted while features with peak intensity 20 times greater in the samples than in the medium were retained. Bacterial extracts were grouped according to their culture media and this data clean-up step was carried out for each culture medium used. The positive and negative ionization mode data sets from each of the respective bacterial extracts were combined by the macro enabling ion peaks that were observed in either or both positive and negative modes to be overlaid for further statistical analysis. The Excel macro was used to dereplicate each m/z ion peak with compounds in the customized database (using RT and m/z threshold of ±3 ppm) which provided details on the putative identities of all metabolites in each bacterial extract and sequentially sorted the number of remaining unknowns for each extract. The macro was then utilized to identify the top 20 features (ranked by peak intensity) and corresponding putative identities in each sample by creating a list for each extract. Hits from the database were accessed using ChemBioFinder version 13 (PerkinElmer Informatics, Cambridge, UK). The data was then converted into a CSV file and exported to SIMCA-P V 13.0 Umetrics, Umeå , Sweden), consequentially providing a feature ID number, ionization mode, m/z, retention time, possible molecular formulae and peak intensity for each feature in all 77 samples. The CSV file was also used to generate a heat map. Heat maps were plotted using the programming software R (version ×64 2.15.2) (R Foundation for Statistical Computing, Vienna, Austria) using a script utilizing the g-plot package. The data set was further analyzed using SIMCA-P V 13.0 using the unsupervised statistical analysis method, principal component analysis (PCA). Dendrograms were also created using SIMCA-P V 13.0 package (Umetrics, Umeå ,Sweden).

Molecular Identification
The whole genome DNA of each strain was extracted by scraping bacterial biomass, suspending in 100 μL of sterile water and heating at 95 °C for 10 min before cooling down the lysate on ice and centrifuging at 13,000 rpm for 10 min. The supernatant containing genome DNA was transferred into a new Eppendorf for 16S rRNA gene amplification. For some strains, the genome DNA could not be extracted using the method described above. For these strains, the FastDNA spin kit (MP Biomedicals, Eschwege, Germany) was used to obtain the whole genome DNA according to the manufacturer's protocol.
Nearly full-length 16S rRNA genes (1542 nucleotide bases) were amplified by polymerase chain reaction (PCR) using primers 27F and 1492R [60]. The reaction mixture consisted of 5 μL of 10× FastDigest green buffer including 20 mM MgCl 2 (Fermentas, Vilnius, Lithuania ), 1 μL of 10 mM dNTPs mixture (Fermentas, Vilnius, Lithuania), 1 μL of 25 mM of each primer (Sigma, München, Germany), 0.19 μL of 5 U/μL DreamTaq DNA polymerase (Thermo Scientific, Bremen, Germany), 1 μL of template DNA and 41.81 μL sterile water to make a final volume of 50 μL. The PCR was performed on a thermal cycler (Biometra, Goettingen, Germany) using the following thermal cycling protocol: the initial denaturation temperature was 95 °C for 2 min, followed by 34 cyclers of reaction starting another denaturation at 95 °C for 0.5 min, then primer annealing at 56 °C for 0.5 min and primer extension at 72 °C for 1.5 min, as well as the final primer extension at 72 °C for 10 min. The reaction was stopped by chilling at 16 °C to limit the polymerase activity. Five μL of PCR product was examined on agarose gel electrophoresis at 300 V for 20 min. An equal volume of 0.5 μg/μL Genen Rular 1Kb DNA ladder (Fermentas, Vilnius, Lithuania) was used as the reference object. The successfully amplified 16S rRNA genes presenting a clear single band around 1500 bases compared to the ladder under a Molecular Imager ® Gel Doc™ XR System (Bio-Rad laboratories, Berkeley, CA, USA) were purified using NucleoSpin Gel and PCR Clean-up package (MACHEREY-NAGEL, Düren, Germany) following the manufacturer's protocol. The genes amplified with more than one band were purified by cutting off the right band and extracting from the agarose gel using NucleoSpin Gel and PCR Clean-up package according to the manufacturer's protocol. The concentration of the purified 16S rRNA genes was determined using a NanoDrop 2000C Spectrophotometer (Thermo Scientific, Bremen, Germany) and adjusted to 30 μg/μL. High quality 16S rRNA genes were sent to LGC Genomics GmbH (Berlin, Germany) for initial sequencing using forward primer 27F. Sequences of good quality (usually between 150 and 900 bases) were chosen and contrasted in the GenBank database [61] using the BLASTn tool to identify the nearest neighbour to the amplified sequence. 98% and 95% were used as the thresholds to discriminate between sequences of the same species and genus, respectively.

Bioassay Screening
Extracts were prepared as 10 mg/mL solutions in DMSO in 96-well plate format and delivered for bioassays in dry ice. Extracts were stored at −20 °C until use and then appropriately diluted for testing. Assay-specific thresholds were set to determine the putative active hits: any samples which met this criterion were retested and their activity was assessed over a concentration range of the primary hit. When possible, quantitative measurements of activity (e.g., IC 50 , Ki, MIC) were determined. A full description of assays is presented only for bioactivities observed in extracts from outlier strains.

Anti-Infectives
Bioassays against Trypanosoma brucei brucei (model assay system for trypanosomiasis) were carried out as previously described [62]. Bacterial extracts were dissolved in DMSO to prepare 10 mg/mL stock solutions. DMSO was used as the negative control at a concentration of 1% to 0.002% and suramin was used as the positive control at a concentration range of 1 to 0.008 μM. The results were calculated as percentages of control values.
The in vitro antimicrobial testing against Enterococcus faecalis strain JH212 was carried out using the standard disk diffusion assay [63]. Sterile filter disks were impregnated with the bacterial extracts and placed on agar plates that had been inoculated with the pathogen. After incubation for 24 h, the antimicrobial potential was quantitatively assessed from the diameter of the inhibition zone.

Metabolic Disease and Inflammation
In search for potential drugs against metabolic disorders involving regulation of glucose metabolism, particularly diabetes mellitus and obesity, samples were tested in a protein-tyrosine phosphatase 1B (PTP1B) assay. Samples were tested at 30 μg/mL in duplicate.

Cell-based Functional Assays
Samples were tested at 30 μg/mL in quadruplicate in 384 well plate format. A Z factor computation value >0.4 was used to establish primary hits on the initial screen. Cell-based functional assays were carried out on the ion channels involving TRPA1 and TRPV1 (pain), and TRPM8 (pain, cancer) genes, whereas PPARα gene (inflammation, diabetes, metabolic disorders and atherosclerosis) targets a nuclear hormone receptor. The fluorescence readouts for TRPA1, TRPV1 and TRPM8 were measured on a Ca 2+ sensitive dye as based on Molecular Devices™ [64], while the activity on PPARα was measured against the luminescence on GAL4-UAS luciferase.

Conclusions
LC-HRMS and multivariate analysis by principal component analysis (PCA) were used to successfully compare the secondary metabolite profiles of crude extracts from 77 respective marine invertebrate-associated bacterial symbionts. PCA was shown to be an effective tool to differentiate bacterial strains based on their chemical diversity and novelty of metabolites, providing a means to select bacterial isolates with diverse chemistry without having to carry out full isolation work on each extract. PCA was used to reveal bacterial species producing similar chemical groups of metabolites grouped together whilst those producing distinct secondary metabolomes were observed as outliers. By using an Exactive mass spectrometer, which enabled fast-polarity switching, it was possible to obtain efficient and greater metabolite coverage in a single experiment, greatly speeding up analysis times. The development of a comprehensive metabolomics workflow pathway including an in-house developed Excel macro embedded with the AntiMarin database made it possible to rapidly dereplicate the 77 strains, providing putative identities of known metabolites in each extract. It was also possible to calculate the number of unknowns in each extract and to produce data files ranking the -top 20 metabolite hits‖ (ranked by peak intensity) from each strain. This Excel macro also removed peaks associated with the culture medium, making it possible to compare bacterial strains cultured on different types of growth medium and provided data output for statistical analysis. NMR 1 H and 2D-COSY data was also utilized to confirm the dereplication results obtained from the LC-HRMS data. Additionally, we have shown through PCA and heat map analysis that strains with nearly identical 16S rRNA sequences do not necessarily produce the same secondary metabolites. It is also shown that the dereplication results can also be correlated with bioassay screening results to support drug discovery efforts with the objective of both finding a bacterial isolate that has a unique diverse chemistry and is biologically active. Our approach is to use high resolution MS and NMR in parallel to efficiently detect and confirm the dereplication results. Overall, this shows that metabolomics approaches are worthwhile for the selection of strains for the isolation of novel natural products and that this methodology has the potential to reduce redundancy in drug discovery programs.