Whole Genome Sequencing of the Giant Grouper (Epinephelus lanceolatus) and High-Throughput Screening of Putative Antimicrobial Peptide Genes

Giant groupers, the largest grouper type in the world, are of economic importance in marine aquaculture for their rapid growth. At the same time, bacterial and viral diseases have become the main threats to the grouper industry. Here, we report a high-quality genome of a giant grouper sequenced by an Illumina HiSeq X-Ten and PacBio Bioscience Sequel platform. A total of 254 putative antimicrobial peptide (AMP) genes were identified, which can be divided into 34 classes according to the annotation of the Antimicrobial Peptides Database (APD3). Their locations in pseudochromosomes were also determined. Thrombin-, lectin-, and scolopendin-derived putative AMPs were the three largest parts. In addition, expressions of putative AMPs were measured by our transcriptome data. Two putative AMP genes (gapdh1 and gapdh2) were involved in glycolysis, which had extremely high expression levels in giant grouper muscle. As it has been reported that AMPs inhibit the growth of a broad spectrum of microbes and participate in regulating innate and adaptive immune responses, genome sequencing of this study provides a comprehensive cataloging of putative AMPs of groupers, supporting antimicrobial research and aquaculture therapy. These genomic resources will be beneficial to further molecular breeding of this economically important fish.


Introduction
Groupers are coral reef fishes in the subfamily Epinephelinae of the family Serranidae (order Perciformes), which are known for their delicious taste, tender flesh, and rich nutrition [1]. As economically important fish species in marine aquaculture, groupers reached a worldwide production of 155,000 tons in 2015, with a total value of USD 630 million [2]. More specifically, mainland China is responsible for an estimated 65% of the total production [2]. There are at least

De Novo Genome Assembly and Annotation
A total of 3,077,169 contigs were generated, and the length of contig N50 was determined to be 76,419 bp by using clean Illumina sequencing data. Subsequently, hybrid assembly involving PacBio reads that were error corrected by Illumina reads and the Illumina contigs was performed by DBG2OLC [19]. Finally, the achieved total contigs were reduced to 3207 (Table 1), which is much smaller than the assembly from Illumina data. The contig N50 and scaffold N50 of the final genome assembly were 1,469,414 and 1,505,601 bp, respectively (Table 1), reflecting the high quality of assembly. The total genome size reached up to 1.128 Gb, and the GC content was 41.4%. Our assembled genome is of high quality, as a genome completeness assessment by benchmarking universal single-copy orthologs (BUSCO) [20] proved that the assembly contained 93.1% complete gene models. Annotation of repeat sequences was performed by de novo and homology predictions based on the RepBase database [21]. The giant grouper genome comprised 45.1% repetitive sequences with a length of 508,638,319 bp, in which 42.4% were transposable elements (TEs) with a length of 478,176,524 bp. DNA transposons (24.6% of the genome) were the most abundant type of TE, followed by long interspersed elements (LINEs, 15.8% of the genome) and long terminal repeats (LTRs, 7.4% of the genome) (Table S1).

De Novo Genome Assembly and Annotation
A total of 3,077,169 contigs were generated, and the length of contig N50 was determined to be 76,419 bp by using clean Illumina sequencing data. Subsequently, hybrid assembly involving PacBio reads that were error corrected by Illumina reads and the Illumina contigs was performed by DBG2OLC [19]. Finally, the achieved total contigs were reduced to 3207 (Table 1), which is much smaller than the assembly from Illumina data. The contig N50 and scaffold N50 of the final genome assembly were 1,469,414 and 1,505,601 bp, respectively (Table 1), reflecting the high quality of assembly. The total genome size reached up to 1.128 Gb, and the GC content was 41.4%. Our assembled genome is of high quality, as a genome completeness assessment by benchmarking universal single-copy orthologs (BUSCO) [20] proved that the assembly contained 93.1% complete gene models. Annotation of repeat sequences was performed by de novo and homology predictions based on the RepBase database [21]. The giant grouper genome comprised 45.1% repetitive sequences with a length of 508,638,319 bp, in which 42.4% were transposable elements (TEs) with a length of 478,176,524 bp. DNA transposons (24.6% of the genome) were the most abundant type of TE, followed by long interspersed elements (LINEs, 15.8% of the genome) and long terminal repeats (LTRs, 7.4% of the genome) (Table S1).
De novo predictions based on the repeat-masked genome, RNA-seq predictions based on transcriptomic data from our previous work [6,7], and homolog predictions generated a comprehensive and nonredundant protein-coding gene set containing 24,794 genes. Annotation completeness, assessed by BUSCO analysis, indicated that gene sequences covered 85% complete single-copy orthologs. The gene set was annotated by InterPro, KEGG, Swiss-Prot, and TrEMBL databases, and approximately 93.37% (23,149 genes) of the gene set were supported by the above-mentioned databases.

Pseudochromosome Construction
Based on the genetic linkage map of the orange-spotted grouper, a total of 1256 scaffolds were anchored into 24 pseudochromosomes (Chr) of the giant grouper, and a total length of 999.69 Mb was assembled, which comprised 88.62% of the assembled genome sequences and 22,206 genes (from a total of 24,794 genes). The largest pseudochromosome was Chr13 with 54.06 Mb containing 56 scaffolds, and the smallest pseudochromosome was Chr3 with 20.61 Mb containing 18 scaffolds. The average pseudochromosome length was 41.65 Mb with 52 scaffolds (Table 2 and Figure 2). Figure 2 summarizes the distribution of genes, GC content in genomic intervals of 100 kb, and interchromosomal relationships of our assembled giant grouper pseudochromosomes.

Identification, Transcriptomic Quantification, and Annotation of Putative Antimicrobial Peptides (AMPs)
A total of 2927 AMP sequences were collected from the online Antimicrobial Peptides Database (APD3, http://aps.unmc.edu/AP/main.php) (Table S2), which were employed as query sequences for putative AMP identification by BLAST. A total of 254 putative AMP genes were obtained (Table S3), which can be divided into 34 classes according to annotation of AMPs in the APD3 (Figure 4a). Each

Identification, Transcriptomic Quantification, and Annotation of Putative Antimicrobial Peptides (AMPs)
A total of 2927 AMP sequences were collected from the online Antimicrobial Peptides Database (APD3, http://aps.unmc.edu/AP/main.php) (Table S2), which were employed as query sequences for putative AMP identification by BLAST. A total of 254 putative AMP genes were obtained (Table S3), which can be divided into 34 classes according to annotation of AMPs in the APD3 (Figure 4a). Each putative AMP gene was renamed by class followed by a serial number. In addition, thrombin-derived C-terminal peptides (TCPs, 64) [22,23], lectin-derived (29), and scolopendin-derived (23) were the top three classes among them, which is consistent with our previous study [24].
We also downloaded another recently published gene set of a giant grouper (PRJNA516312) from the National Center for Biotechnology Information (NCBI) and identified AMPs with the same method. We obtained 326 putative AMPs that were classified into 36 groups (Table S4). TCPs (75), lectin-derived (46), and histone-derived (41) were the top three classes. Comparison between gene set from PRJNA516312 and present study revealed differences in the case of predicted putative AMPs. We speculated that it may be associated with differences in the annotation strategy, which resulted in divergence of the two gene sets.
Based on the transcriptomic data of brain, liver, and muscle from our previous work [6,7], transcripts per million (TPM) values of each putative AMP gene were calculated (Tables S5-S7). TPM values reflected the transcription level of putative AMP genes. Among 254 putative AMP genes, 209, 193, and 177 putative AMP genes with TPM values were detected in brain, liver, and muscle tissues, respectively. The top 20 putative AMPs with high TPM values in each tissue are presented in Table 3.

Location of Putative AMP Genes and Growth-Related Genes
Out of 254 putative AMPs, 228 were mapped to the 24 assembled pseudochromosomes, with an average number of 9 per chromosome ( Figure 3). Chr10 and Chr15 had the highest hits of 22 and 19 genes, respectively. The chromosomes with the lowest counts were Chr20 and Chr3, both with 3 genes. Subsequently, putative AMPs were also confirmed by KEGG enrichment analysis, in which 167 genes were clustered into 35 KEGG items. The representative entries included the immune system (53 genes), signaling molecules and interactions (48 genes), signal transduction (46 genes), and cancers: overview (37 genes) (Figure 4b). Among them, glycogen synthesis may be related to the superior growth of groupers [6]. Thus, in this study, we found that 24 glycolytic-and Ca 2+ -regulating putative AMP genes were located in 18 chromosomes. We found that gapdh2, eno2, and tpi1a were located in Chr22 (Table 4 and Figure 3).
Interestingly, we found that 2 genes (gapdh1 and gapdh2) involved in glycolysis were precursors of predicted AMPs (Figure 3). The expression level of gapdh1 in the giant grouper muscle was extremely high (Table 5), which has a high identity (96.88%) and query alignment ratio (100%) with the skipjack tuna GAPDH-related antimicrobial peptide (SJGAP) [25]. While gapdh2 matched to yellowfin tuna glyceraldehyde-3-phosphate dehydrogenase-related antimicrobial peptide (YFGAP) [25], with 87.50% identity and 100.00% query alignment ratio. SJGAP and YFGAP are AMPs from the skin of skipjack tuna (Katsuwonus pelamis) and yellowfin tuna (Thunnus albacares), respectively, and both have potent antimicrobial activities [25,26]. To investigate the alignments of SJGAP and YFGAP in gapdh, we performed multiple sequence alignments of gapdh1 ( Figure 5a) and gapdh2 (Figure 5b) from zebrafish, yellowfin tuna, and giant groupers. As shown in Figure 5, gapdh1 of zebrafish, yellowfin tuna, and giant groupers showed higher similarity with YFGAP and SJGAP than gapdh2, suggesting that gapdh1 are more likely to play a role in the antimicrobial process in these fishes. Gene structures of gapdh1 and gapdh2 are also exhibited in Figure 5.
Most of the genomes of economic crops or animals have focused on growth traits. While some fish genome projects found immune genes that were lost (elephant shark [49], Atlantic cod [40]) or expanded (large yellow croaker [50]), we performed this project to screen putative AMPs with an attempt to explore immune resources for bacterial and viral disease therapy. Especially, some AMPs have been widely used in agriculture as potential alternatives to antibiotics [51]. This work may help those who make efforts to develop drugs for groupers and reduce the usage of antibiotics and other chemically poisoning drugs.
A total of 254 putative AMP genes of the giant grouper were classified into 34 groups in the present study. Among them, thrombin (64 AMPs), lectin (29 AMPs) and scolopendin (23 AMPs) were the top three groups. Thrombin was also the largest grouper in our previous works, including the blue tilapia (Oreochromis aureus), Nile tilapia (Oreochromis niloticus) [52], blue-spotted mudskipper (Boleophthalmus pectinirostris), and giant-fin mudskipper (Periophthalmus magnuspinnatus) [24]. Lectin also accounted for a major part in the above-mentioned four fishes. It seems that thrombin may play a vital role in fish.
It has been reported that several kinds of AMPs, including epinecidin [11,12], hepcidin [13], defensin [14], and piscidin [15], have been cloned and studied from groupers. We identified EC-hepcidin1 (query ID 1701 in Table S2), a hepcidin AMP derived from the liver and stomach of orange-spotted grouper [13], from the giant grouper gene set. Enap-1 (query ID 294 in Table S2), a defensin from horse (Equus caballus) [53], was also identified in giant groupers. However, the other two AMPs that have been reported in groupers were not found in our annotated gene set.
Cumulative evidence is showing that AMPs not only inhibit the growth of a broad spectrum of microbes through membrane disruption [10], but they also participate in regulating innate and adaptive immune responses [54,55]. High expression levels of the gapdh1 gene in the liver and muscle of giant groupers may imply its active glycolysis activity along with antimicrobial activity. Previous studies have reported that strong antibacterial activities against both Gram-positive and Gram-negative bacteria of N-terminal segments of this protein were shown in tuna fish [25,26]. Antifungal efficacies of GAPDH-derived peptide were also demonstrated in many studies [56,57]. However, GAPDH has not been shown to be cleaved in groupers, and it is possible that cleavage might be tissue specific. Even if it has a high level in muscle, it would be predicted to be at full length and enzymatically active for glycolysis only, whereas it might be cleaved to AMPs in skin or other tissues. Antimicrobial functions of this peptide (if produced) are worthy of further investigation in giant groupers.
Moreover, the great majority of giant groupers are born to be female, barely male, and some, but mostly one of them in the population, would change its gender to male after the first or second maturation (all females have the ability to change sex). This kind of protogynous hermaphrodite is mostly shared by groupers [58]. Materials in this work could provide opportunities to explore the mechanisms of sex change in groupers.

Sample Preparation and Sequencing
Genomic DNA was extracted from the muscle of a wild giant grouper cultured in the Guangdong Marine Fishery Experimental Center, Huizhou, China. Six libraries, including three short-insert libraries (270, 500, and 800 bp) and three long-insert libraries (2, 5, and 10 kb), were constructed for sequencing by an Illumina HiSeq X-Ten platform (Illumina, San Diego, CA, USA), except for the 800 bp insert-size library, which was sequenced by an Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). For high-quality assembly, we also constructed a 20 kb insert library for sequencing with the PacBio Bioscience Sequel platform (Pacific Biosciences, Menlo Park, CA, USA). All experiments were carried out according to the guidelines of the Animal Ethics Committee and were approved by the Institutional Review Board on Bioethics and Biosafety of BGI (No. FT14015).
All sequenced data from E. lanceolatus are available in the NCBI database at BioProject ID, PRJNA533524. All Illumina reads are available under accession numbers SRR8926032, SRR8926031, SRR8926030, SRR8926029, SRR8926035, SRR8926034, and SRR8926033 in the NCBI database.

Pseudochromosome Construction
SNP-containing reads in the genetic linkage map from the orange-spotted grouper [16] were mapped to the giant grouper assembled genome sequence, and only the matching reads were selected. Linkage groups (LGs) of the giant grouper were assigned using JoinMap4.1 software (Kyazma, Wageningen, Netherlands) [62]. Subsequently, a genetic linkage map was constructed. Single nucleotide polymorphisms in the genetic linkage map of the giant grouper were used for assembling the pseudochromosomes. To increase the accuracy of pseudochromosome assembly, we chose at least two SNPs in each scaffold [63]. Based on genetic distances between SNP markers, we determined the position and orientation of each scaffold and anchored these scaffolds to construct pseudochromosomes.

Repeat Annotation
Repeat elements were predicted by de novo and homology methods. De novo predictions were performed by LTR_FINDER (version 1.0.6, Fudan University, Shanghai, China) [64] and RepeatModeler (version 1.08, http://www.repeatmasker.org/RepeatModeler/) [65]. The merged repeat library was aligned to the assembled genome sequence by RepeatMasker (version 4.06, Institute for Systems Biology, Seattle, WA, USA) to produce repeat elements [65]. The homology prediction based on RepBase (version 21.01, Genetic Information Research Institute, Sunnyvale, CA, USA) was performed by RepeatMasker and RepeatProteinMask (version 4.06, Institute for Systems Biology, Seattle, WA, USA) [65]. Subsequently, nonredundant repeat elements were obtained by integrating de novo and homology data.

Identification and Transcriptomic Quantification of AMPs
A total of 2927 AMP sequences that have been reported to exhibit antimicrobial activity were collected from the APD3 database as a query sequence (Table S2). An index database of annotated gene sets was built for alignment by makeblastdb command. Collected active AMPs sequences were aligned to gene set sequences to identify potential AMPs based on sequence similarity by TBLASTN (e-value: 1e−5). Alignment hits were dealt with by in-house scripts. Those hits with a query alignment ratio less than 0.5 were filtered out, and redundant data were also removed. To calculate the TPM value of putative AMP genes, we performed referring sequence-based transcript quantification. Raw reads were filtered by SOAPnuke filter tools (version 1.5.6, BGI-Shenzhen, Shenzhen, China) [76]. Clean reads were mapped to the assembled genome by HISAT2 (version 2.0.4, https://github.com/DaehwanKimLab/hisat2) [77]. Subsequently, TPM values of each transcriptome were calculated by RSEM (version 1.2.12, https://deweylab.github.io/RSEM/) [78].

Conclusions
We report a high-quality genome of the giant grouper. The assembly reached up to 1.128 Gb, accounting for 96.8% of the estimated genome size. A total of 24,794 protein-coding genes were annotated through de novo prediction, transcriptomic data, and homolog prediction. Then, 254 putative AMP genes were identified, located in pseudochromosomes, and expressions were measured. Two putative AMP genes were connected to glycolysis, of which gapdh1 was highly expressed in muscle. Genome sequencing let us identify AMPs systematically in groupers so as to support antimicrobial research and possibly provide suggestions for therapy. These genomics resources will be beneficial for further molecular breeding of this economically important fish. This work shall aid in the effort against infectious diseases in the giant grouper industry.
Supplementary Materials: The following are available online at http://www.mdpi.com/1660-3397/17/9/503/s1, Table S1: Annotation of repeat sequences of giant grouper, Table S2: AMP sequences collected from APD3, Table  S3: putative AMP genes identified in the giant grouper genome, Table S4: Summary of the 326 identified AMPs from another giant grouper genome (PRJNA516312) and statistics of its classification, Table S5: TPM of AMPs in the brain, Table S6: TPM of AMPs in the liver, and Table S7: TPM of AMPs in the muscle.