Integrated Omics Strategy Reveals Cyclic Lipopeptides Empedopeptins from Massilia sp. YMA4 and Their Biosynthetic Pathway

Empedopeptins—eight amino acid cyclic lipopeptides—are calcium-dependent antibiotics that act against Gram-positive bacteria such as Staphylococcus aureus by inhibiting cell wall biosynthesis. However, to date, the biosynthetic mechanism of the empedopeptins has not been well identified. Through comparative genomics and metabolomics analysis, we identified empedopeptin and its new analogs from a marine bacterium, Massilia sp. YMA4. We then unveiled the empedopeptin biosynthetic gene cluster. The core nonribosomal peptide gene null-mutant strains (ΔempC, ΔempD, and ΔempE) could not produce empedopeptin, while dioxygenase gene null-mutant strains (ΔempA and ΔempB) produced several unique empedopeptin analogs. However, the antibiotic activity of ΔempA and ΔempB was significantly reduced compared with the wild-type, demonstrating that the hydroxylated amino acid residues of empedopeptin and its analogs are important to their antibiotic activity. Furthermore, we found seven bacterial strains that could produce empedopeptin-like cyclic lipopeptides using a genome mining approach. In summary, this study demonstrated that an integrated omics strategy can facilitate the discovery of potential bioactive metabolites from microbial sources without further isolation and purification.


Introduction
The production of secondary metabolites is regulated by several genes that are responsible for modifying their chemical structure, transporting substrates and products, and other specific regulatory functions. These genes, named secondary metabolite biosynthesis gene clusters (BGCs), are contiguously aligned in the genome [1]. Advances in computational science and bioinformatic analysis now provide the possibility of dissecting secondary metabolite biosynthetic pathways from microorganisms [2]. Comparative genomics analysis and BGC mining have been applied to the natural product studies to discover new sources of bioactive natural products [3]. In addition, several mass spectrometry based metabolomics analysis approaches, such as Global Natural Products Social Molecular Networking (GNPS), have been utilized to monitor the production of specific molecules [4]. Molecular networking groups similar molecules as a cluster by comparing vector correlations between tandem mass fragment ions of molecules [5]. Thus, integrated BGC mining and molecular networking analysis are well suited for dereplication and identification of specific metabolites efficiently and even proposes new analog structures.
The biodiversity of marine microorganisms is exceptionally high and contributes to the vast chemical diversity seen among their related metabolites [6]. The chemical diversity of metabolites might play vital roles in their bioactivities. One of the significant applications of natural products is antibiotics. As reported previously, approximately 80 natural product-related antibiotics were approved by the FDA from 1981 to 2014 [7]. However, as is now well-known, existing antibiotics cannot effectively treat resistant strains of bacterial infections. Therefore, seeking new antibiotics from natural products is a priority. Staphylococcus aureus is a commensal bacterium in humans that colonizes in approximately 30% of the human population. It is also a major human pathogen and the leading cause of bacteremia, infective endocarditis, and other infections in the clinic [8]. Therefore, screening natural resources for effective agents against S. aureus infection is a potentially worthwhile challenge.
In this study, we first applied comparative genomics analysis and GNPS to reveal the biosynthetic gene cluster and new analogs of empedopeptin derived from the marinederived bacterium Massilia sp. YMA4. We further confirmed empedopeptins as the main antibiotic utilized by Massilia sp. YMA4 against S. aureus, using an insertional mutation approach of the biosynthetic genes. We then utilized genome mining to explore new bacterial sources of empedopeptins and related cyclic lipopeptides. Such an integrated approach holds promise for discovering new natural sources of antibiotics and their production.

Antibiotic Activity of Massilia sp. YMA4
Massilia, a Gram-negative bacteria genus, was first found in clinical blood samples in 1998 [9] and was subsequently isolated from various environmental samples [10]. Some strains of the Massilia genus, such as Massilia sp. BS-1 and Massilia sp. NR 4-1 were reported to be antibiotic producers [11,12]. However, knowledge about the potential natural antibiotics related to nonribosomal peptides in the Massilia remains scarce. For the screening of new antibiotics, a strain named Massilia sp. YMA4 was isolated from a sediment sample collected in the open ocean off Lamay island in Taiwan. The antagonist assay revealed that Massilia sp. YMA4 effectively inhibited the growth of S. aureus ATCC 29213 only in yeast malt agar (YMA), but not in tryptone soy agar (TSA) culture conditions ( Figure 1). To further understand the phylogenetic relationships of Massilia sp. YMA4 to other sequenced Massilia strains, the whole genome sequences of 69 available Massilia strains deposited in the NCBI database (2020/02) were collected and analyzed using the maximum likelihood method. Massilia sp. YMA4 was clustered with M. armeniaca ZMN-3 ( Figure S1). Interestingly, Massilia sp. YMA4 and M. armeniaca ZMN-3 were close to Empedobacter halobium ATCC 31962 according to BLAST analysis using partial 16S rRNA gene [13]. E. halobium ATCC 31962 has also been reported to have antimicrobial potential against Gram-positive bacteria [14].

The Biosynthetic Potential of Secondary Metabolites of the Massilia Strains and the Discovery of the Empedopeptin Biosynthesis Gene Cluster
A computational analysis of the secondary metabolite biosynthetic capacity of 72 Massilia strains was conducted using antiSMASH 5.1, MiBIG, and BiG-SCAPE [15,16]. The results revealed a high diversity of BGCs in the Massilia (Figure 2a). In 72 Massilia strains, 490 BGCs were identified, and the homologous BGCs were further grouped into 233 gene cluster families (GCF). Interestingly, 75 nonribosomal peptide BGCs were predicted (NRPS BGC). However, only four Massilia strains, Massilia sp. NR 4-1 (6 NRPS BGCs), M. violaceinigra B2 (5 NRPS BGCs), Massilia sp. YMA4 (4 NRPS BGCs), and M. albidiflava DSM17472 (3 NRPS BGCs) contained more than three NRPS BGCs (Figure 2b). Twelve putative BGCs were predicted in the Massilia sp. YMA4 genome, including four NRPS BGCs (Table S1). Among them, BGC 6 was comprised of three NRPS-related genes, empC (two modules, 8.2 kb), empD (one module, 3.3 kb), and empE (five modules, 15.9 kb); and two additional genes, empA (dioxygenase, 0.9 kb) and empB (dioxygenase, 0.9 kb) ( Figure 3). The modules of empC, empD, and empE genes were comprised of one condensation (C), one adenylation (A), and one peptidyl carrier protein (PCP) domain. In addition, two thioesterase (TE) domains were located in empC ( Figure 3). The A domain in each module was further analyzed using antiSMASH 5.1. and NRPSPredictor2 [15,17], which suggests that the amino acid sequence of the BGC 6 product is: Pro1-Ser2-Pro3-Arg4-Asp5-Ser6-Pro7-Asp8. Additionally, a lipo-initiation (CStarter) domain in module 1 of empE was revealed by antiSMASH 5.1 and NaPDos [18]. These results demonstrated that this domain is similar to the CStarter of syringomycin, indicating that this domain can catalyze the acylation of the first amino acid [19]. The combination of the polar peptide and fatty acid tail is a key feature of cyclic lipopeptides that is responsible for their amphiphilic properties [20]. The amphiphilic properties of lipopeptides have great potential for pharmaceutical use and can also deliver drugs by forming active pharmaceutical ingredients as liposomes [21]. The remaining C domains were further analyzed. Among these C domains, modules 2, 3, 6, and 7 were predicted as combined C/E (epimerization) domains, while the other C domains are the conventional L C L domains. The E domain contributes to the stereochemistry of amino acid residues in the NRPS assembly line. It changes the L form amino acid to the D form amino acid ( L C D ) [22]. Therefore, Pro1, Ser2, Asp5, and Ser6 are in the D configuration, while Pro3, Arg4, Pro7, and Asp8 are in the L configuration. Interestingly, most NRPS BGCs have one TE domain in the last module, but our results revealed two TE domains in BGC 6. The two TE domains in the last module have also been found in some cyclic lipopeptides BGCs, such as orfamide A, viscosin, and massetolide produced by P. fluorescens strains [23]. There are two types of TE domains, type I and II, which exhibit different functions. The type I TE domain can catalyze intramolecular cyclization, while the type II TE domain functions and supports the NRPS assembly line [24]. Taken together, the results of the bioinformatic analysis indi-cated that the structure produced by the BGC 6 is 3OH-FA-D-Pro1/D-Ser2/L-Pro3/L-Arg4/ D-Asp5/D-Ser6/L-Pro7/L-Asp8, which is the same as the core structure of empedopeptin ( Figure 3).
To further understand the biosynthetic pathway of empedopeptin, we constructed the empA, empB, empC, empD, and empE null-mutant strains. The mutant strains, ∆empC, ∆empD, and ∆empE, could not inhibit S. aureus ATCC 29213 growth, and ∆empA and ∆empB strains showed significantly reduced inhibition effects compared to wild-type ( Figure 5). We then conducted a comparative metabolomic analysis (molecular networking) of the wild-type and null-mutant strains to elucidate the structures and biosynthetic pathways of empedopeptin and its analogs. We identified 44 empedopeptin analogs, including 17 cyclic lipopeptides and 27 linear lipopeptides ( Figure 6). As shown in Figure 6, we observed the empedopeptin analogs only from wild-type, ∆empA, and ∆empB strains (dioxygenase genes), suggesting that the ∆empC, ∆empD, and ∆empE strains (core NRPS genes) were unable to produce empedopeptins. Additionally, we found several unique analogs that only appear in the molecular networking analysis of ∆empA and ∆empB strains. In the lipopeptides containing eight amino acids, m/z 1094.56 and 1122.6 were found in both ∆empA and ∆empB strains, while m/z 1078.57, 1104.59, 1106.6, 1120.58, and 1136.57 were observed only in the ∆empB strain. Comparing the structure of m/z 1094.56 and 1122.6 from ∆empA and ∆empB strains showed that they comprise the identical amino acid sequences as empedopeptin, but have hydroxyl modification in different amino acids and different chain lengths of 3-hydroxyl fatty acid as well. Asp5, Pro7, and Asp8 residues were all hydroxylated in empedopeptin. We found cyclic lipopeptides containing OH-Asp5 together with unmodified Pro7 and Asp8 from ∆empA strain, while cyclic lipopeptides with OH-Asp8 and even no hydroxylated residues were only identified from ∆empB strain (Table S2). These results demonstrated that the dioxygenase genes, empA and empB, located at the end of the empedopeptin biosynthetic assembly line, functioned in hydroxylation and played an important role in the bioactivity of cyclic lipopeptides. In addition to the cyclic lipopeptides, we also identified several linear lipopeptides, which might be the biosynthetic intermediates of empedopeptins ( Figure 6). The structure information of these empedopeptin analogs is summarized in Tables S2 and S3.  and null-mutant strains. 6 AAs: 6 amino acids lipopeptides; 7 AAs: 7 amino acids lipopeptides; and 8 AAs: 8 amino acids lipopeptides. Each node represents one lipopeptide tandem mass spectrum, and the width of the edge indicates the tandem mass spectral similarity of neighboring nodes. The shape of the node represents linear or cyclic lipopeptide. The color of the node represents the source of the lipopeptide. Multiple represents the analog was detected from more than one strain.

Genome Mining for Discovering Other Empedopeptin-Like Compound Producing Bacteria
Through the MultiGeneBlast analysis, we found that several microorganisms may produce cyclic lipopeptides similar to empedopeptins because similar NRPS BGCs were observed. The adenylation domain plays a role in selecting and recruiting specific amino acids in the NRPS biosynthetic pathway [25]. Therefore, to further propose the structures of cyclic lipopeptides, all the adenylation domains in the empedopeptin-related BGCs were analyzed phylogenetically (Figure 7a). Five types of amino acids, arginine, aspartic acid, proline, serine, and threonine, were involved in those BGCs. As shown in Figure 7b, Duganella sacchari has the same NRPS BGC for producing empedopeptin-like compounds. However, Collimonas and Variovorax strains were proposed to be the producers of tripropeptin-like compounds [26]. In the present study, we aimed to explore the antibiotic substances from the marine bacterium Massilia sp. YMA4 by integrating comparative genomics and metabolomics analysis. Our results revealed that the primary active substance of Massilia sp. YMA4 is a nonribosomal peptide, empedopeptin. We found that Massilia sp. YMA4 only inhibited S. aureus ATCC 29213 under certain culture conditions, suggesting that the empedopeptin biosynthetic gene cluster might be a silent gene cluster activated by changing culture conditions. This result was consistent with a previous study, demonstrating that microbial secondary metabolite synthesis is influenced by changing the type and concentration of nutrients in the culture media [27].
Empedopeptin is a calcium-dependent antibiotic that acts against Gram-positive bacteria by inhibiting cell wall biosynthesis [28]. Certain calcium-dependent antibiotics contain a conserved Asp-X-Asp-Gly motif that is thought to facilitate calcium binding [29]. This study confirmed that two dioxygenase genes, empA and empB, contributed to the hydroxylation of the amino acid residues in empedopeptin. The ∆empA and ∆empB strains showed significantly reduced inhibition effects compared to the wild-type, which implies the hydroxylated amino acid residues are important to antibiotic activity. We speculated that the two dioxygenases worked synergistically to hydroxylate the amino acid residues of empedopeptin. However, it is challenging to validate the hydroxylation selectivity of these two dioxygenases to specific amino acid residues or peptide sequences. Some bacteria were found to possess cyclic lipopeptide BGCs that are similar to empedopeptin. However, the amino acid residues of those cyclic lipopeptides were different. As shown in Figure 7b, the first and second residues of cyclic lipopeptides of Collimonas and Variovorax strains encode threonine and proline, respectively, while the same positions in Massilia sp. YMA4 and D. sacchari encode proline and serine, respectively. Surprisingly, we did not find the empedopeptin-related BGC from empedopeptin-producing bacteria E. haloabium ATCC 31962 through MultiGeneBlast analysis [13,30]. Therefore, we speculate that the whole genome of E. haloabium ATCC 31962 might not be well sequenced.

DNA Sample Preparation, Whole-Genome Sequencing, Assembly, and Annotation
Genomic DNA of Massilia sp. YMA4 was extracted from the cultures grown at 30 • C in YMB medium using a genomic DNA purification kit (QIAgen, Hilden, Germany) following the manufacturer's instructions. The genomic DNA (total 20 µg) was sequenced by the PacBio RS II system (Pacific Biosciences, Menlo Park, CA, USA). A 10-kb SMRTbell library was generated using a DNA Template Prep Kit 2.0 (10-20 Kb; Pacific Biosciences), following the manufacturer's instructions. Sequencing was performed with a PacBio RS II sequencer using one SMRT cell and P6-C4 chemistry at 360 min movie length (Pacific Biosciences, Menlo Park, CA, USA). Single-molecule real-time reads (159,840 filtered subreads and a mean length of 11,850 bp) were de novo assembled using the Hierarchical Genome Assembly Process (HGAP) workflow in the SMRT analysis software version 3 (Pacific Biosciences Inc., Menlo Park, CA, USA) [31]. Genome annotation was performed via Rapid Annotations using Subsystems Technology (RAST) server version 2.0 and blast2go [32,33].
The whole-genome sequence of Massilia sp. YMA4 was uploaded to the Genbank database (GenBank assembly accession: GCA_003293715.1).

Genome Mining and Bioinformatic Analysis
The whole-genome sequence of Massilia sp. YMA4 was analyzed by antiSMASH 5.1 for the genome mining of possible secondary metabolite biosynthetic gene clusters [15]. A phylogenetic tree based on assembled genome sequencing data of genus Massilia from NCBI, including Massilia sp. YMA4 was performed using reference alignment-based phylogenetic builder (REALPHY) via bowtie2 [34]. The phylogeny was estimated on the "polymorphisms_move.phy" file produced by REALPHY via RAxML with the general time reversible (GTR) nucleotide substitution model and GAMMA model of rate heterogeneity [35]. Finally, RAxML was carried out with 1000 alternative runs on distinct starting trees (N) and a search for the best-scoring ML tree with 1000 replications. The phylogenetic tree was generated by using iTOL [36]. All BGCs associated with the genus Massilia, a total of 490 BGCs containing 233 gene cluster families, were downloaded and used for the sequence similarity network. The BIG-SCAPE-CORASON pipeline was utilized locally to analyze the 490 BGCs downloaded from the antiSMASH database (March 2019) [15,16]. The singleton parameter in the sequence similarity network was the BGCs with distances lower than the default cutoff distance of 0.3. Generated sequencing similarity network files separated by BiG-SCAPE class were combined for visualization using Cytoscape version 3.7.1 [37].

Construction of Empedopeptin Biosynthetic Gene-Null Mutant Strains
To make insertion mutations of empedopeptin biosynthetic genes empA, empB, empC, empD, and empE in Massilia sp. YMA4, pCM184-∆empA, pCM184-∆empB, pCM184-∆empC, pCM184-∆empD, and pCM184-∆empE were constructed. The schematic diagram of the insertion mutant is shown in Figure S3. The pCM184 was purchased from Addgene (MA, USA). The regions for homologous recombination in Massilia sp. YMA4 were amplified by primer pairs, which are listed in Tables S5 and S6. The PCR products were digested with KpnI and SacI enzymes and cloned into the KpnI/SacI site of the digested pCM184, followed by transformation into the E. coli S17-1 λ pir. The E. coli S17-1 was applied to transform pCM184-∆empA, pCM184-∆empB, pCM184-∆empC, pCM184-∆empD, and pCM184-∆empE individually into Massilia sp. YMA4 in this study. Both Massilia sp. YMA4 and E. coli S17-1 were cultured overnight in YMB at 30 • C and in LB at 37 • C, respectively. Both cultures were then washed by YMB twice, and the pellets were suspended by YMB again. The O.D.600 values of the donor (E. coli S17-1) and recipient cells (Massilia sp. YMA4) were 2.0, then the donor and recipient cells were mixed at 1:1 (v:v) ratio. The mixtures were spotted on YMA for overnight inoculation. The colonies were scraped up and spread on YMA with tetracycline (2 µg/mL) and kanamycin (50 µg/mL), and then the conjugants were picked and checked by PCR after 48 h. The conjugants could grow if the plasmid was integrated into the Massilia sp. YMA4 genomic DNA. The insertion sites were confirmed by PCR ( Figure S4).

Quantitative Reverse Transcription PCR (qRT-PCR)
To evaluate emp biosynthetic gene expressions of Massilia sp. YMA4 on TSA and YMA culture conditions, the empA, empB, empC, empD, and empE were analyzed by qRT-PCR. The primer sets used for the qRT-PCR were listed in Table S5. The qPCR reaction consisted of 10 µL of SYBR Green Master Mix (BioTools, New Taipei City, Taiwan), 0.4 µL of each of the forward and reverse primers (100 µM), 2 µL of cDNA template, and 7.2 µL of ddH 2 O for a total volume of 20 µL. The reactions were further amplified by the following thermocycling steps: 95 • C for 5 min; 39 cycles of 95 • C for 10 s (denaturation), 60 • C for 5 s (annealing), 72 • C for 30 s (extension), and then conducted melting curve analysis (95 • C for 15 s, 60 • C for 5 s, then increased 0.5 • C/s up to 95 • C). The RT-qPCR analysis was performed in CFX96™ Real-Time PCR Detection System with C1000™ thermal cycler (Bio-Rad, Hercules, CA, USA) with five replicates. The results were analyzed using the CFX Manager software, the delta CT (cycle of threshold) was detected to determine the relative expression level of genes. All target genes were further normalized to the bacterial 16S gene (Internal control).

Comparative Metabolomics and Molecular Networking Analysis
For comparative metabolomics, Massilia sp. YMA4 and mutants were cultured on YMA until colony formation. The single colony of Massilia sp. YMA4 was picked and transferred to 3 mL YMB for overnight incubation. For the mutants, the single colony was transferred to 3 mL YMB with 0.1% tetracycline. After overnight incubation, 0.5 mL of Massilia sp. YMA4 or mutant cultures were transferred to 500 mL YMB w/wo 500 µg tetracycline for three days. The resulting culture broth was extracted by EtOAc. The remaining broth was subsequently extracted with BuOH to obtain BuOH extracts. The BuOH extracts of each strain (10 mg/mL) were dissolved in MeOH. Therefore, 10 µL of extracts were injected and separated by a C18 column (ACQUITY UPLC BEH C18, 1. rate was set at 0.4 mL/min. The mass data were acquired using UPLC-ESIMS (Thermo Scientific Orbitrap Elite Mass Spectrometer), and the mass range was set up as m/z 100-1500. The mass data were acquired in profile mode for molecular networking analysis, positive mode ion detection between m/z 100-1500 with 30,000 resolution. The top four intense ions from each full mass scan were selected for collision-induced dissociation (CID) fragmentation. For CID, isolation width was 2 Da, and the selected ions were fragmented with normalized collision energy 30.0, activation Q 0.250, activation time 10.0, and 15,000 resolution. The mass data (.RAW files) from Xcalibur were converted to mzML file format using ProteoWizard software and subjected to GNPS (https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp, accessed on 20 March 2019) to generate molecular networking, and the data was visualized in Cytoscape. The LC-MS/MS data of the BuOH extracts of Massilia sp. YMA4 and its mutant strains (MSV000086803) are publicly available via MassIVE (https://massive.ucsd.edu, accessed on 3 February 2021).

Conclusions
In summary, we explored the cyclic lipopeptide, empedopeptin, and its analogs from Massilia sp. YMA4 using an integrated omics approach. To elucidate the biosynthetic mechanism of empedopeptin, we constructed five empedopeptin biosynthetic gene null-mutants of Massilia sp. YMA4. The antibiotic activities of Massilia sp. YMA4 mutants were decreased in comparison with the wild-type, demonstrating that empedopeptins were the primary antibiotic substances of Massilia sp. YMA4. In the comparative metabolomics analysis of wild-type and null-mutant strains, we successfully identified 44 empedopeptin analogs, including 17 cyclic lipopeptides and 27 linear lipopeptides. Through MultiGeneBlast analysis, we were also able to survey the similar BGCs of empedopeptin from other microbes, including D. sacchari, three Collimonas, and three Variovorax strains. Our findings illustrated that this integrated omics strategy is suitable to identify and develop potential bioactive metabolites from microbial sources without further isolation and purification.