First Draft Genome Assembly of the Malaysian Stingless Bee, Heterotrigona itama (Apidae, Meliponinae)

The Malaysian stingless bee industry is hugely dependent on wild colonies. Nevertheless, the availability of new queens to establish new colonies is insufficient to meet the growing demand for hives in the industry. Heterotrigona itama is primarily utilized for honey production in the region and the major source of stingless bee colonies comes from the wild. To propagate new colonies domestically, a fundamental understanding of the biology of queen development, especially from the genomics aspect, is necessary. The whole genome was sequenced using a paired-end 150 strategy on the Illumina HiSeq X platform. The shotgun sequencing generated approximately 89 million raw pair-end reads with a total output of 13.37 Gb and a GC content of 37.31%. The genome size of the species was estimated to be approximately 272 Mb. Phylogenetic analysis showed H. itama are much more closely related to the bumble bee (Bombus spp.) than they are to the modern honey bee (Apis spp.). The genome data provided here are expected to contribute to a better understanding of the genetic aspect of queen differentiation as well as of important molecular pathways which are crucial for stingless bee biology, management and conservation.


Summary
Malaysia is home to about 50 different species of Meliponines (stingless bees) previously belonging to 13 genera [1]. However, they recently have undergone taxonomic revisions and the 13 genera are now reduced to only seven-with some of the genera having been revised into the subgenus level [2,3].
While some taxa have limited distributions, Heterotrigona itama populations are scattered in Malesia (the Malay Archipelago) spanning over Peninsular Malaysia, Borneo's East Malaysia, Southern Thailand, Singapore and Indonesia (Kalimantan, Java and Sumatra) [1,4]. Based on our recent species diversity survey, Meliponines found in Malaysia produce edible, fragrant kinds of honey, the taste of which is generally sweet-tangy and whose color varies depending on the season. Among these, H. itama has become the most common species reared or kept for its highly prized medicinal honey; this species has also been regarded as an economically-important insect in agriculture due to its high adaptability to a wide range of habitats and floristic attributes [5][6][7][8][9].
H. itama is a member of the Apidae family within the order of Hymenoptera, which displays eusociality behavior and a caste polyphenism in females. Female larvae develop into two interdependent adult castes, the queen and the worker, during caste differentiation. Chen et al. [10] stated that caste differentiation depends on the differential expression of entire sets of genes involved in the larval fate of queens and workers. The substantial physiological and morphological differences are due to the differential expression of these genes [10][11][12][13]. To date, the genetic aspect of queen differentiation in Malaysian stingless bees has not been fully addressed, yet this area is crucial not only to enhance our understanding of caste differentiation but also to complement the ongoing research on producing queens using an in vitro platform. Tamizi et al. [14] conducted a related study focusing on the transcriptomics of a queen larva which discovered that H. itama most highly follows a conserved caste differentiation pathway based on the detection of a few sets of annotated genes related to queen differentiation. Other studies related to H. itama queens were on oviposition behavior and the presence of virgin queen eggs after colony splitting [6,8]; however, these two did not address the caste differentiation during the developmental stage. Here, we report the genome sequencing, de novo assembly and annotation data of the Malaysian stingless bee, H. itama, which can help researchers to better access genes and pathways related to queen differentiation from a much larger nucleotide reference-the genome. In addition, the data can be useful as a genomic reference to facilitate future studies on sustainable cultivation and conservation of stingless bees. The sequenced H. itama genome is the first report for the Indo-Malayan/Australasian stingless bee group and the fourth in the world after two Neotropical stingless bees, Melipona quadrifasciata and Frieseomelitta varia [15,16], and an Asian species from Taiwan, Lepidotrigona ventralis hoosana (GenBank accession: PRJNA387986), which had been revised as Lepidotrigona hoozana (Strand) Rasmussen [1].

Data Description
The whole genome sequencing of a queen of H. itama ( Figure 1A) generated approximately 89 million raw pair-end reads with a total output of 13.37 Gb sequencing data. A total of 90.83% of the sequencing data were retained after pre-processing; more than 82 million high-quality reads with approximate 12.15 Gb total bases were generated (Table 1A).

Genome Size Estimation
Analysis of k-mer distribution used the high-quality reads to estimate the genome size, heterozygosity and repeat content of the stingless bee genome. Based on the k-mer spectrum ( Figure 1B), a simple Poisson profile denotes the low heterozygosity of the organism (0.275%) [17]. In addition, the genome was found to be slightly repetitive, and the estimated genome size of the Malaysian stingless bee (H. itama) was discovered to be approximately 272 Mb. This is similar to the size of previously reported genomes of the honey bee (Apis mellifera-236 Mb) [18], bumble bee (Bombus terrestris-249 Mb) [19] and two Neotropical stingless bee species (M. quadrifasciata-256 Mb, F. varia-275 Mb) [15,16]. Based on the draft genome size estimated, subsequent de novo assembly and genome annotation were performed with the sequencing depth of approximately 49X coverage.

Genome Assembly and Structural Annotation
The draft genome with a total assembly size of 262.45 Mb has 13,733 contigs (≥1000 bp) with GC content of 37.31%, the longest contig of 438,094 bp and contig N50 sizes of 43,263 bp (Table 1B). The BUSCO integrity assessment detected a 94.4% completeness score of complete single-copy orthologs (53-mer) in the genome assembly (Figure 2A), indicating a high level of completeness of the draft genome. In addition, the BUSCO assessment reported a completeness score of 94.6% on the genome assembly based on hymenoptera_odb9 profiles, indicating a comparable level of completeness to that of other bee genomes ( Figure 2B). Subsequently, a total of 12,496 protein-coding genes (>99 bp), 398 tRNA and 14 rRNA genes were predicted from structural annotation ( Table 1C). The mean exon per gene annotated is 6.57. The predicted genes achieved 84.6% completeness based on hymenoptera_odb9 BUSCO profiles. A total of 2.29 Mb, equivalent to 0.87% repeats belonging to different classes of interspersed repeats were masked for the stingless bee draft genome. The repetitive elements identified in the draft genome comprising 0.19% of long terminal repeats (LTRs), 0.18% of DNA transposons, and 0.06% of long interspersed nuclear elements (LINEs) ( Figure 2C). The low amount of repetitive sequences detected was due to the sequencing technology used and the relatively low sequencing coverage. Short reads generated could have been collapsed into contiguous contig or resulted in a fragmented assembly. In addition, the relatively low sequencing coverage of~49× might not be enough to sequence every genomic region and thus some repetitive regions might be missed out from the sequencing.

Functional Annotation
From the predicted protein-coding genes (>99 bp) obtained, the peptide sequences (≥33 amino acid) were found to be 83.22% and 65.81% functionally annotated against RefSeq [20] and Swiss-Prot [21] databases, respectively (Table 1D). The RefSeq top-hit species annotation revealed that most of the stingless bees' proteins hit to those of the bumble bee, B. terrestris dataset. Further, protein structure characterization successfully discovered a total of 7999 genes from Gene Ontology (GO) [22] and categorized to biological function, cellular component and molecular function. A total of 142 KEGG pathways were mapped with the most enzymes identified in the biosynthesis of the antibiotics pathway, the purine metabolism pathway and the cysteine and methionine metabolism pathway ( Figure 3). Subsequently, genes of interest related to caste differentiation and insect hormone were data-mined from the annotated GO terms (Table 1E). Several key genes related to insect hormone, which were reported to play an important role in caste development [11,23], were mapped to the insect hormone biosynthesis pathway (KEGG pathway: map 00981).

Orthologous and Phylogenetic Analysis
Ortholog analysis showed that 10,067 orthologous clusters were formed based on the protein data set from the four Hymenoptera bee species. The numbers in the Venn diagram represent the number of orthologous clusters that H. itama shares with the three other species. In total, 7573 orthologous clusters were found to be common in all four bee species, suggesting their conservation in the lineage after speciation. In addition, the diagram shows that there are 70 clusters exclusive to the Malaysian stingless bee, H. itama ( Figure 4A). These 70 clusters contain unannotated genes whose function have not been studied at this stage (Table S1). As for the phylogenetic analysis, a total of 39 single-copy orthologs that were found to be conserved across the ten bees and one wasp species were adopted and analyzed. The eleven species that were analyzed were made up of different tribes, including the stingless bee, honey bee, bumble bee and euglossine bee, with the wasp as the outgroup. Phylogeny of these concatenated genes showed that the H. itama evolved closely with two other stingless bees, F. varia and the M. quadrifasciata, which share a common ancestor with the bumble bees ( Figure 4B). The phylogenetic inference also indicated that stingless bees are much closely related to bumble bees than they are to the modern honey bee group (Apis spp.). This is further supported by a study done by Tamizi et al. [14], which found that the H. itama transcripts shared higher similarity with B. impatiens and B. terrestris (30%-39%) than with A. mellifera and A. florea (<8.1%).
In summary, we report the first draft genome of H. itama, the fourth Meliponine genome to be sequenced. Since there are limited genomic sequence resources for this intriguing group of social insects, our study hopes to provide more insights into and understanding of the biological processes which could contribute to the conservation and sustainable management of stingless bees in Malaysia.

Sample Preparation and Sequencing
A single queen of H. itama (pupal stage) was collected from an active colony placed at MARDI, Serdang, Selangor. The main reason a single individual (particularly the queen) was used is the size advantage of the queen compared to a drone (male). A single queen pupa can provide just enough DNA material needed for genome sequencing while three-four workers or drones are needed to produce an equal amount of DNA. The biology and behavior of H. itama are slightly different than those of Apis spp. While drones from a single healthy Apini colony are most likely produced by a single queen, drones from Meliponini could come from different lineages as some species may house multiple queens at a time; even the workers could actively participate in laying haploid eggs that turn into drones. Genomic DNA was extracted from the whole pupa following a modified cetyl trimethyl ammonium bromide (CTAB) protocol [24]. Next generation sequencing library preparations were constructed following the manufacturer's protocol (VAHTS Universal DNA Library Prep Kit for Illumina). Prior to the library construction, the quality assessment of the genomic DNA sample was performed using Qubit 2.0 fluorometer (Life Technologies Corporation) for concentration determination and agarose gel electrophoresis for integrity testing. The prepared library was then loaded onto an Illumina HiSeq X Ten instrument according to the manufacturer's instructions (Illumina, San Diego, CA, USA). Sequencing was carried out using a 2 × 150 paired-end (PE) configuration; image analysis and base calling were conducted by the HiSeq Control Software (HCS) + OLB + GAPipeline-1.6 (Illumina) on the HiSeq instrument. The data quality of the raw sequencing reads was assessed using FastQC version 0.11.3 [25]. Pre-processing was then carried out using Trimmomatic version 0.38 [26] with the following parameters: ILLUMINACLIP: TruSeq3-PE-2. fa: 2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:8:20 MINLEN:50. Reads that were at least QV20, read length of ≥50 bp and adapters-free paired sequences were collected as clean reads.

Genome Size Estimation and Genome Assembly
The high-quality reads were used for k-mer distribution analysis to estimate the genome size, heterozygosity and repeat content of the stingless bee genome. Jellyfish v2.1.1 (University of Maryland, College Park, US) [27] was used to calculate the k-mer occurrences in DNA with a k-mer size of 21. Following that, GenomeScope v1.0.0 (Cold Spring Harbor Laboratory, Laurel Hollow, US) [28], a k-mer analysis software that uses the k-mer frequencies output from jellyfish, was employed to predict the genome size along with several other metrics for genome profiling.
The draft genome was de novo assembled using Velvet v1.2.10 (KBase, US) [29]. The assembly was carried out using different k-mer lengths (ranging from 51 to 59) with the parameters as follows: -shortPaired -separate -fastq for velveth; exp_cov = auto, read_trkg = yes for velvetg. Among the assemblies built with different k-mer sizes, the best assembly was chosen based on the number of contigs, assembly size, N50 and BUSCO completeness score. Therefore, QUAST v3.1 (St. Petersburg Academic University, St. Petersburg, Russia) [30] was used to access the quality of the draft genome in terms of contiguity through three different contig thresholds that were set at 300 bp, 500 bp and 1000 bp respectively. Benchmarking Universal Single-Copy Orthologs (BUSCO) v2.0 (Université de Genève, Geneva, Switzerland) [31] was used to assess the completeness of the draft genome. The hymenoptera_odb9 profile was chosen as the reference profile for this sample. After both assessments, filtering was applied to remove contigs shorter than 1000 bp in order to improve the overall contiguity without losing much genetic information. In other words, all contigs longer than 1000 bp were retained for downstream analyses.

Gene Structural Annotation and Functional Annotation
The draft genome was annotated using the MAKER v3.0 (University of Utah, Salt Lake City, US) [32] genome annotation pipeline which combines repeat masking, different prediction tools with evidence-based quality control and gene-model editing. The custom repeat library was used by RepeatMasker within the MAKER pipeline to mask repetitive elements. Transcript assembly from the same species [14] was adopted as EST evidence. In addition, four different sets of protein sequences from Apis spp., Bombus spp., Melipona spp. and Scaptotrigona spp. were downloaded from a public database and used as evidences to aid gene predictions. As MAKER was run iteratively for three times, repeat masking and evidence alignment were first performed with the following parameters: est = Trinity.fasta, protein = apis.fasta,bombus.fasta,melipona.fasta,scaptotrigona.fasta, model_org = hymenoptera, est2genome = 1 and protein2genome = 1. The resulting general feature format (GFF3) file was used as input for all subsequent MAKER runs. Gene predictions were performed within MAKER using SNAP. For the second and third rounds of MAKER, the following parameters were used: est2genome = 0 and protein2genome = 0 in order to specify ab initio gene prediction. tRNA was predicted using tRNAscan-SE v1.3.1 [33] with default parameters. This was followed by rRNA prediction using RNammer v1.2 [34] by including these parameters: -S euk and -multi. After gene prediction, the full repertoire of peptide sequences (≥33 amino acid) was assessed for completeness using BUSCO v2.0 [31]. The hymenoptera_odb9 profile was chosen as the reference profile. Transposable elements and repetitive regions in the genome were identified using RepeatMasker v4.0.5 (Institute for Systems Biology, Seattle, US) [35] with the following parameters: Hymenoptera as the source species of query DNA, NCBI search engine, without masking the small RNA (pseudo) genes or low complexity DNA or simple repeats but masking only the interspersed repeats using a sensitive slow-search mode.
The Protein homology BLAST search was performed against NCBI Reference Sequence protein (RefSeq) [20] and Swiss-Prot [21] protein databases. Diamond v0.9.22 (The University of Tübingen, Tübingen, Germany) [36] was used to blast the peptide sequences to the RefSeq database, while ncbi-blast v2.7.1+ was used to blast the same peptide set against Swiss-Prot. The cut-off for both BLAST searches was set at a maximum expect value (E-value) of 1e-5. Subsequently, the BLAST outputs from both databases were used for Gene Ontology (GO) [22] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [37] analysis using standalone Blast2GO v2.5 (BioBam, Valencia, Spain) [38]. In addition, protein structure characterization was carried out with local InterProScan lookup service v5.4-47.0 (EMBL-EBI, Cambridgeshire, UK) [39].

Orthologous and Phylogenetic Analysis
Ortholog analysis was carried out using OrthoMCL v2.0.9 (University of Pennsylvania, Philadelphia, US) [40] with default parameters. Three protein data sets from A. mellifera, B. terresteris and E. mexicana together with the protein data of the Malaysian stingless bee were used for orthologous group clustering.
Genome sequences from a total of eleven species (namely the H. itama  Table 2) that were annotated as complete and present in all eleven species were aligned using MAFFT v7.471 (AIST, Tsukuba, JP) [41]. Manual inspection and trimming were carried out for single-copy orthologs with gaps or poor alignments in ≥50% of the sequences. A phylogeny based on the refined concatenated multiple sequence alignments of 39 single-copy orthologs was generated using MrBayes v3.2.7 [42,43] utilizing the Bayesian Markov Chain Monte Carlo (MCMC) algorithm. The analysis was conducted by sampling a mixture of models: fixed rate matrices as well as gamma-distributed variable-and invariable matrices [44]. Input alignments were carried out with sampling frequency for every 500 generations. A burn-in of 25% from the beginning of the cold chain was discarded. An average standard deviation of split frequencies <0.01 was achieved. The plot of generation versus log probability of the data did not show a noticeable trend, and potential scale reduction factor (PSRF) close to 1.0 was set for all parameters. The Bayesian posterior probability of tree was based on a total of 50,000 generations. The phylogenetic tree was visualized using FigTree v1.4.4 (The University of Edinburgh, Edinburg, UK) [45] and rooted using V. germanica.