Genome Mining and Screening for Secondary Metabolite Production in the Endophytic Fungus Dactylonectria alcacerensis CT-6

Endophytic fungi are a treasure trove of natural products with great chemical diversity that is largely unexploited. As an alternative to the traditional bioactivity-guided screening approach, the genome-mining-based approach provides a new methodology for obtaining novel natural products from endophytes. In our study, the whole genome of an endophyte, Dactylonectria alcacerensis CT-6, was obtained for the first time. Genomic analysis indicated that D. alcacerensis CT-6 has one 61.8 Mb genome with a G+C content of 49.86%. Gene annotation was extensively carried out using various BLAST databases. Genome collinearity analysis revealed that D. alcacerensis CT-6 has high homology with three other strains of the Dactylonectria genus. AntiSMASH analysis displayed 45 secondary metabolite biosynthetic gene clusters (BGCs) in D. alcacerensis CT-6, and most of them were unknown and yet to be unveiled. Furthermore, only six known substances had been isolated from the fermented products of D. alcacerensis CT-6, suggesting that a great number of cryptic BGCs in D. alcacerensis CT-6 are silent and/or expressed at low levels under conventional conditions. Therefore, our study provides an important basis for further chemical study of D. alcacerensis CT-6 using the gene-mining strategy to awaken these cryptic BGCs for the production of bioactive secondary metabolites.


Introduction
Endophytic fungi commonly refer to a group of fungi that colonize healthy plant tissues inter-and/or intracellularly, without causing apparent disease symptoms in the host plants [1]. It has been widely reported that endophytic fungi have the ability to aid in the defense of their host plants [2]. More importantly, endophytic fungi are able to biosynthesize a variety of novel secondary metabolites, which can have outstanding potential as leading structures for new drug discovery [3]. These metabolites belong to different structural classes, such as alkaloids, terpenoids, steroids, peptides, polyketides, lignans, phenols and lactones [4,5] and have shown different pharmacological activities, such as anticancer [6], antimicrobial [7], antioxidant [8], antidiabetic [9], anti-inflammatory [10], anti-Alzheimer's disease [11] and immunosuppressive [12]. Therefore, endophytic fungi represent a treasure trove of bioactive and new natural products with great chemical diversity that have largely been unexploited.
Traditionally, the discovery of novel bioactive natural products with potent bioactivities from endophytic fungi occurs through a process known as bioactivity-guided screening, which is termed a "top-down" approach [13]. Unfortunately, this "top-down" approach has suffered from several pitfalls, such as high frequency of rediscovery of known compounds

Strain Source and Culture Medium
Strain CT-6 is an endophytic fungus that was isolated and purified from the medicinal plant C. tomentella, which was collected in September 2014 from Jinfo Mountain (N 29.10460, E 107.20736), Chongqing, China. The strain was deposited in the China General Microbiological Culture Collection Center (CGMCC) in Beijing, China, with collection number 23290.

Phylogenetic Analysis
For phylogenetic analysis, strain CT-6 was grown on PDA medium for 5 days at room temperature. The mycelium of strain CT-6, scraped directly from the surface of the agar culture, was used to extract the genomic DNA through the traditional cetyltrimethylammonium bromide (CTAB) method [17]. The universal ITS primers ITS4 (5 -GGA AGT AAA AGT CGT AAG G-3 ) and ITS5 (5 -TCC TCC GCT TAT TGA TATG C-3 ) were used for amplicon sequencing of the ITS region and the intervening 5.8S rRNA gene region [18]. The polymerase chain reaction (PCR) product was sent to Sangon Biotech (Shanghai, China) Co., Ltd. for sequencing. The ITS sequences of strain CT-6 were deposited in GenBank and matched against the nucleotide database in the National Center of Biotechnology Information (NCBI) to compare the sequence homology with closely related organisms. Then, the sequences from closely related organisms were downloaded to conduct the phylogenetic analysis using the neighbor-joining (NJ) method; Clonostachys chloroleuca (ON495792) was used as an outgroup. Bootstrap analysis was carried out using 1000 replications with MEGA 7.0 software [19].

Whole Genome Sequencing
Strain CT-6 was cultivated in PDB medium at 28 • C for 3 days on rotary shakers at 180 rpm. The mycelia were collected by centrifugation followed by genomic DNA extraction using the Wizard ® Genomic DNA Purification Kit (Promega, MD, USA) according to the manufacturer's instructions. The genomic DNA of strain CT-6 was sequenced using the Illumina HiSeq X Ten platform and the PacBio Sequel II platform. For Illumina sequencing, at least 10 µg of genomic DNA was interrupted to about 400 bp fragments with a Covaris M220 Focused Ultrasonicator (Covaris Inc., Woburn, MA, USA) for sequencing library construction. The sequencing library was constructed according to the NEXTflex™ Rapid DNA-Seq Kit (Illumina, San Diego, CA, USA) method and sequenced on the Illumina HiSeq X Ten platform. For PacBio sequencing, an aliquot of 8 µg of genomic DNA was sheared to 10 kb using a Covaris g-TUBE (Covaris, MA, USA) at 6000 RPM for 60 s using an Eppendorf 5424 centrifuge (Eppendorf, NY, USA). DNA fragments were then end-repaired and ligated with SMRTbell sequencing adapters (Pacific Biosciences, Menlo Park, CA, USA) following the manufacturer's recommendations. Next, an~10 kb insert library was prepared and sequenced on one SMRT cell using standard methods.

Genome Assembly
All bioinformatics analyses of the data generated from the Illumina and PacBio platforms were performed on using the Majorbio Cloud Platform (https://cloud.majorbio.com, accessed on 12 April 2022), a free online platform of Shanghai Majorbio Bio-pharm Technology Co., Ltd (Shanghai, China). The genome sequence was assembled using both the Illumina reads and PacBio reads. For Illumina sequence data, the raw data, saved as FASTQ files, were obtained by transferring the original image data into sequence data via base calling. High-quality clean data were obtained by removing connectors and filtering low-quality data according to the statistic of quality information. The PacBio reads were assembled into contigs using Canu (version 1.7). Finally, error correction of the PacBio assembly results was performed using the Illumina clean reads.

Gene Annotation
Prediction of coding gene was performed using Maker2 software (version 2.31.9) [20], tRNA-scan-SE software (version 2.0) [20] was used for tRNA prediction and Barrnap software (version 0.8) [21] was used for rRNA prediction. The predicted coding genes in the whole genome of strain CT-6 were annotated through NR, Pfam, Swiss-Prot, COG, GO and KEGG databases using sequence alignment tools such as BLAST software (version 2.3.0), Diamond software (version 0.8.35) and HMMER software (version 3.1b2) [22]. Briefly, each set of query proteins was aligned with the databases, and annotations of best-matched subjects (E-value < 10 −5 ) were obtained for gene annotation. Furthermore, Blast2GO software (version 2.5) [23] was used to obtain the GO annotation information and WEGO software (version 2.0) [24] was used to perform the GO functional classification statistics. Sibelia software (version 3.0.6) [25] was employed in genome collinearity analysis between strain CT-6 and Dactylonectria estremocensis, strain CT-6 and D. macrodidyma and strain CT-6 and D. torresensis, respectively.

Additional Annotation
In order to predict and annotate the presence of cytochrome P450-related genes and CAZy-related genes in strain CT-6, Diamond software (version 0.8.35) was used to align the amino acid sequences of the target species with the cytochrome P450 database [26] and the CAZy database [27], respectively (E-value < 10 −5 ). Antibiotic-resistance gene prediction and pathogen-host interaction phenotype classification were also performed using Diamond software (version 0.8.35) within the CARD and PHI databases (E-value < 10 −5 ) [28].

Fermentation, Extraction, Isolation and Identification of Secondary Metabolite
Strain CT-6 was activated on PDA medium at room temperature for 10 days, and then the activated fungal hyphae were added to a sterilized 250 mL Erlenmeyer flask containing 100 mL PDB medium for 5 days in a shaker (180 rpm) at 28 • C. After that, 10 mL of the PDB fungal culture was inoculated in a sterilized 500 mL Erlenmeyer flask containing sterilized rice medium (150 g rice, 1.5 g peptone, 150 mL of tap water) and cultured at room temperature for 30 days. The fermented products of 30 flasks were extracted exhaustively with methanol (MeOH) followed by decompressing distillation to acquire a brown extract (310 g). The brown extract was then suspended in a 50% methanol-water solution. After degreasing by petroleum ether (PE), the suspended solution was extracted with ethyl acetate (EtOAc) to afford 56.1 g of EtOAc fraction.
The structures of the six pure metabolites were identified by mass spectrometry (MS) and Nuclear Magnetic Resonance Spectroscopy (NMR). 1 H and 13 C NMR spectra were recorded with a Bruker Avance III 400 MHz NMR spectrometer at 25 • C. Low-resolution MS spectra were obtained with a Shimadzu LC-MS-2020 triple quadrupole tandem mass spectrometer equipped with an electrospray ionization (ESI) probe operating in positive or negative ionization mode.

Identification of Strain CT-6
The hyaline sterile hyphae of strain CT-6 grew from PDA medium after incubation for 4 d at 28 • C, and gradually became yellowish-brown ( Figure 1a). Phylogenetic analysis (Figure 1b) displayed that the ITS regions and the intervening 5.8S rRNA sequence of strain CT-6 (GenBank accession No. OP890611) had 100% similarity to the reference strain Dactylonectria alcacerensis CBS 129087 (GenBank accession No. NR121498), suggesting that this isolate belongs unambiguously to the species D. alcacerensis.
In this study, the endophyte CT-6 was identified as D. alcacerensis through ITS region and 5.8S rRNA sequence analyses. It has been reported that species belonging to Dactylonectria are phytopathogens causing grapevine root diseases [30], and D. alcacerensis has been associated with black foot disease of grapevines in Argentina [31]. Interestingly, in our research, D. alcacerensis CT-6 was an endophyte isolated from the medicinal plant C. tomentella, suggesting that the fungus which was a phytopathogen in one plant might be an endophyte in another plant.

Genome Sequencing and Assembly
The genome sequence of D. alcacerensis CT-6 was assembled and deposited in the NCBI GenBank database (BioProject accession No. PRJNA905207). As shown in Table 1, the genome sequencing of D. alcacerensis CT-6 afforded a sequence with a length of 61,760,550 bp, with a maximal length of 8,864,146 bp and a G+C content of 49.86%. The genome consisted of 22 scaffolds with an N50 of 4,367,436 bp and an N90 of 2,702,333 bp. A total of 16,963 protein-coding genes were predicted. The total length of the genes was 36,278,934 bp (accounting for 58.74% of the genome), the average length of these genes was 2138.71 bp and the G+C content in the gene region was 51.30%. For non-coding RNA, we predicted 244 tRNA of 22 types and 67 rRNA (consisting of 56 5S rRNA, 5 5.8S rRNA and 6 28S rRNA). In Figure 2, the genome diagram of D. alcacerensis CT-6 shows that there are four circles in the circle diagram, representing (from outside to inside) scaffolds, GC content (per 200 kb), gene density (per 200 kb) and genome duplication.  In this study, the endophyte CT-6 was identified as D. alcacerensis through ITS region and 5.8S rRNA sequence analyses. It has been reported that species belonging to Dactylonectria are phytopathogens causing grapevine root diseases [30], and D. alcacerensis has been associated with black foot disease of grapevines in Argentina [31]. Interestingly, in our research, D. alcacerensis CT-6 was an endophyte isolated from the medicinal plant C. tomentella, suggesting that the fungus which was a phytopathogen in one plant might be an endophyte in another plant.

Genome Sequencing and Assembly
The genome sequence of D. alcacerensis CT-6 was assembled and deposited in the NCBI GenBank database (BioProject accession No. PRJNA905207). As shown in Table 1, the genome sequencing of D. alcacerensis CT-6 afforded a sequence with a length of 61,760,550 bp, with a maximal length of 8,864,146 bp and a G+C content of 49.86%. The genome consisted of 22 scaffolds with an N50 of 4,367,436 bp and an N90 of 2,702,333 bp. A total of 16,963 protein-coding genes were predicted. The total length of the genes was 36,278,934 bp (accounting for 58.74% of the genome), the average length of these genes was 2138.71 bp and the G+C content in the gene region was 51.30%. For non-coding RNA, we predicted 244 tRNA of 22 types and 67 rRNA (consisting of 56 5S rRNA, 5 5.8S rRNA and 6 28S rRNA). In Figure 2, the genome diagram of D. alcacerensis CT-6 shows that there are four circles in the circle diagram, representing (from outside to inside) scaffolds, GC content (per 200 kb), gene density (per 200 kb) and genome duplication.  Regions sharing more than 95% sequence similarity over 5 kb are connected by grey lines. Those with more than 95% similarity over 10 kb are connected by orange lines.

Genome Annotation
To conduct the functional annotation of the putative-coding sequences in D. alcacerensis CT-6, 16,963 non-redundant genes were subjected to a BLAST search function in the NR, Pfam, Swiss-Prot, COG, GO and KEGG databases. A total of 15,558 genes were annotated based on one or more of the six public databases. The largest number of functional genes in D. alcacerensis CT-6 was determined as 15,552 genes/91.68% using the NR database, followed by Pfam (11,674 genes/68.82%), Swiss-Prot (10,880 genes/64.14%), COG (14,424 genes/85.03%), GO (10,975 genes/64.70%) and KEGG (4567 genes/26.92%) ( Table 1). According to COG analysis, "function unknown" (7990) was associated with the most genes, followed by "carbohydrate transport and metabolism" (1112), "amino acid transport and metabolism" (636) and "energy production and conversion" (595) as the most gene-rich classes in the COG groupings ( Figure 3). Based on the GO assignment, 10,975 genes were categorized into 3 main GO categories and 42 subcategories (Figure 4). In terms of biological processes, genes were detected to be involved in metabolic processes (3217) and cellular processes (3021). The cellular component was mainly distributed across the membrane part (3743), cell part (2599) and organelles (1577). Meanwhile, the molecular function revealed that 5650 genes were involved in catalytic activity, followed by binding (4766) and transporter activity (1200). According to KEGG analysis, 4567 genes were annotated and assigned to 46 different KEGG second categories, which could be classified into six main KEGG categories: Metabolism, Human Diseases, Organismal Sys-

Genome Annotation
To conduct the functional annotation of the putative-coding sequences in D. alcacerensis CT-6, 16,963 non-redundant genes were subjected to a BLAST search function in the NR, Pfam, Swiss-Prot, COG, GO and KEGG databases. A total of 15,558 genes were annotated based on one or more of the six public databases. The largest number of functional genes in D. alcacerensis CT-6 was determined as 15,552 genes/91.68% using the NR database, followed by Pfam (11,674 genes/68.82%), Swiss-Prot (10,880 genes/64.14%), COG (14,424 genes/85.03%), GO (10,975 genes/64.70%) and KEGG (4567 genes/26.92%) ( Table 1). According to COG analysis, "function unknown" (7990) was associated with the most genes, followed by "carbohydrate transport and metabolism" (1112), "amino acid transport and metabolism" (636) and "energy production and conversion" (595) as the most gene-rich classes in the COG groupings ( Figure 3). Based on the GO assignment, 10,975 genes were categorized into 3 main GO categories and 42 subcategories (Figure 4). In terms of biological processes, genes were detected to be involved in metabolic processes (3217) and cellular processes (3021). The cellular component was mainly distributed across the membrane part (3743), cell part (2599) and organelles (1577). Meanwhile, the molecular function revealed that 5650 genes were involved in catalytic activity, followed by binding (4766) and transporter activity (1200). According to KEGG analysis, 4567 genes were annotated and assigned to 46 different KEGG second categories, which could be classified into six main KEGG categories: Metabolism, Human Diseases, Organismal Systems, Genetic Information Processing, Environmental Information Processing and Cellular Processes ( Figure 5). "Global and overview maps" (1564) was the most enriched pathway, followed by "carbohydrate metabolism" (580) and "amino acid metabolism" (526). tems, Genetic Information Processing, Environmental Information Processing and Cellular Processes ( Figure 5). "Global and overview maps" (1564) was the most enriched pathway, followed by "carbohydrate metabolism" (580) and "amino acid metabolism" (526).    tems, Genetic Information Processing, Environmental Information Processing and Cellular Processes ( Figure 5). "Global and overview maps" (1564) was the most enriched pathway, followed by "carbohydrate metabolism" (580) and "amino acid metabolism" (526).   Furthermore, the collinearity relationships between the D. alcacerensis CT-6 genome and the reference genome sequences of three other strains of Dactylonectria genus whose whole genomes have been sequenced and submitted to the GenBank database (D. estremocensis, PRJNA370196; D. macrodidyma, PRJNA500112; D. torresensis, PRJNA566152) were compared, respectively, using Sibelia (Version 3.0.6) and Circos (Version 0.69-6) software. As shown in Figure 6a-c, the genome collinearity analysis revealed that the D. alcacerensis CT-6 genome shows high homology with three reference genomes, and large-scale gene rearrangements were observed between D. alcacerensis CT-6 and each of the other three strains of Dactylonectria genus. Furthermore, the collinearity relationships between the D. alcacerensis CT-6 genome and the reference genome sequences of three other strains of Dactylonectria genus whose whole genomes have been sequenced and submitted to the GenBank database (D. estremocensis, PRJNA370196; D. macrodidyma, PRJNA500112; D. torresensis, PRJNA566152) were compared, respectively, using Sibelia (Version 3.0.6) and Circos (Version 0.69-6) software. As shown in Figure 6a-c, the genome collinearity analysis revealed that the D. alcacerensis CT-6 genome shows high homology with three reference genomes, and large-scale gene rearrangements were observed between D. alcacerensis CT-6 and each of the other three strains of Dactylonectria genus.   Furthermore, the collinearity relationships between the D. alcacerensis CT-6 genome and the reference genome sequences of three other strains of Dactylonectria genus whose whole genomes have been sequenced and submitted to the GenBank database (D. estremocensis, PRJNA370196; D. macrodidyma, PRJNA500112; D. torresensis, PRJNA566152) were compared, respectively, using Sibelia (Version 3.0.6) and Circos (Version 0.69-6) software. As shown in Figure 6a-c, the genome collinearity analysis revealed that the D. alcacerensis CT-6 genome shows high homology with three reference genomes, and large-scale gene rearrangements were observed between D. alcacerensis CT-6 and each of the other three strains of Dactylonectria genus.

Analysis of Secondary Metabolite Biosynthetic Gene Clusters
The whole genome sequence of D. alcacerensis CT-6 was submitted to the antiSMASH database for secondary metabolite BGCs analysis. AntiSMASH analysis demonstrated that D. alcacerensis CT-6 possessed 45 secondary metabolite BGCs, including 15 T1PKS, 8 As far as we know, these organisms, which are prolific producers of antibiotic metabolites, must also be resistant to the antibiotics they produce [34]. Therefore, the analysis of antibiotic-resistant genes in microbial genomes is of great significance to discover what kind of antibiotics are produced by microorganisms. According to the CARD database, a total of 390 genes were annotated as antibiotic-resistant genes from the whole genome of D. alcacerensis CT-6 ( Figure 7c). "Tetracycline antibiotic" (67) was associated with the most genes, followed by "penam" (48), "cephalosporin" (45) and "peptide antibiotic" (35). Particularly, D. alcacerensis CT-6 possesses a large number of antibiotic-resistant genes, suggesting its robust ability to produce different kinds of antibiotics. Based on the PHI analysis, 4373 genes were annotated as PHI genes and classified into nine groups, in which the largest is "unaffected pathogenicity" with 2130 genes, followed by "reduced virulence" (1934), "loss of pathogenicity" (338), "mixed outcome" (209) and "lethal" (118) (Figure 7d).
Among the secondary metabolites isolated from D. alcacerensis CT-6, brefeldin A (1) was the predominant product, with relatively abundant yields. Attempts to isolate the products biosynthesized by the BGCs with 100% similarity to that of other reference strains led to failure, probably due to less product accumulation, limitation of separation means or because the BGCs were silent or expressed at low levels under conventional conditions. It should be noted that cultivation under laboratory conditions may not provide all the environmental stimuli required for all the BGCs to produce the corresponding secondary metabolites. Therefore, in order to obtain more bioactive secondary metabolites from D. alcacerensis CT-6, further investigation is necessary to understand the expression and regulation mechanism of these BGCs in D. alcacerensis CT-6.

Conclusions
In this study, we assembled and annotated the first high-quality whole genome of the endophytic D. alcacerensis CT-6. Although the genome of D. alcacerensis CT-6 had high homology with three other strains of the Dactylonectria genus, large-scale gene rearrangements were also observed. AntiSMASH analysis displayed 45 secondary metabolite BGCs in D. alcacerensis CT-6, and most of them were unknown and yet to be unveiled. Only six known substances have been isolated from the fermented products of D. alcacerensis CT-6. Therefore, a great number of cryptic BGCs in D. alcacerensis CT-6 are silent and/or expressed at low levels under conventional conditions. Our study provides an important basis for the further chemical study of D. alcacerensis CT-6 using a gene-mining strategy to unveil these cryptic BGCs for the production of more bioactive secondary metabolites through various approaches, including induction of mutations, gene knockout, regulation of promoters, transcriptional regulation and heterologous gene expression.