Full-Length SMRT Transcriptome Sequencing and SSR Analysis of Bactrocera dorsalis (Hendel)

Simple Summary In this study, a full-length transcriptome was analyzed with single-molecule real-time (SMRT) sequencing, which was first used to discover simple sequence repeat (SSR) genetic markers from B. dorsalis. Moreover, SSR markers from isoforms were screened for the identification of species diversity. These results could provide molecular biology methods for further population research. Abstract Bactrocera dorsalis (Hendel), as one of the most notorious and destructive invasive agricultural pests in the world, causes damage to over 250 different types of fruits and vegetables throughout tropical and subtropical areas. PacBio single-molecule real-time (SMRT) sequencing was used to generate the full-length transcriptome data of B. dorsalis. A total of 40,319,890 subreads (76.6 Gb, clean reads) were generated, including 535,241 circular consensus sequences (CCSs) and 386,916 full-length non-concatemer reads (FLNCs). Transcript cluster analysis of the FLNC reads revealed 22,780 high-quality reads (HQs). In total, 12,274 transcripts were functionally annotated based on four different databases. A total of 1978 SSR loci were distributed throughout 1714 HQ transcripts, of which 1926 were complete SSRs and 52 were complex SSRs. Among the total SSR loci, 2–3 nucleotide repeats were dominant, occupying 83.62%, of which di- and tri- nucleotide repeats were 39.38% and 44.24%, respectively. We detected 105 repeat motifs, of which AT/AT (50.19%), AC/GT (39.15%), CAA/TTG (32.46%), and ACA/TGT (10.86%) were the most common in di- and tri-nucleotide repeats. The repeat SSR motifs were 12–190 bp in length, and 1638 (88.02%) were shorter than 20 bp. According to the randomly selected microsatellite sequence, 80 pairs of primers were designed, and 174 individuals were randomly amplified by PCR using primers. The number of primers that had amplification products with clear bands and showed good polymorphism came to 41, indicating that this was a feasible way to explore SSR markers from the transcriptomic data of B. dorsalis. These results lay a foundation for developing highly polymorphic microsatellites for researching the functional genomics, population genetic structure, and genetic diversity of B. dorsalis.


Introduction
Bactrocera dorsalis (Hendel) (Diptera: Tephritidae), known as the oriental fruit fly, is one of the most devastating and highly invasive agricultural pests in the world, causing severe damage to over 250 species of commercial fruits and vegetables [1][2][3][4][5]. The broad range of distribution, the large number of host plants, and the complex interactions between B. dorsalis and diverse environments may be due to its high genetic variation, which makes it challenging to manage this pest. China's Guangxi Zhuang Autonomous Region is an excellent vegetable-and fruit-growing region because of its superior natural geographical conditions. Moreover, Guangxi is located in the south of China, which makes B. dorsalis a serious pest with invasive problems and domestic outbreaks.
Simple sequence repeats (SSRs), also known as microsatellites or short tandem repeats (STRs), are tandem repeats of 1-6 nucleotides. SSR markers commonly present high levels

RNA Extraction and SMRT Sequencing
Total RNA was extracted by grinding mixed samples of B. dorsalis in TRIzol reagent (Life Technologies, Carlsbad, CA, USA) on ice and processed according to the protocol provided by the manufacturer. The integrity of the RNA was determined with an Agilent 2100 bioanalyzer through agarose gel electrophoresis. The purity and concentration of the RNA were determined with a Nanodrop micro-spectrophotometer (Thermo Fisher, Waltham, MA, USA). The mRNA enriched using Oligo (dT) magnetic beads was reversetranscribed into cDNA using a Clontech SMARTer PCR cDNA synthesis kit (NO634926). The optimal amplification cycle number for downstream large-scale PCR reactions was determined through PCR cycle optimization. The optimized cycle number was used to generate double-stranded cDNA. In addition, >5 kb size selection was performed with a BluePippin TM Size-Selection System, and this was mixed equally with the no-size-selection cDNA. Large-scale PCR was performed for SMRT bell library construction. cDNAs were DNA damage-repaired, end-repaired, and ligated to sequencing adapters. The SMRT bell template was annealed to a sequencing primer, bound to polymerase, and sequenced on a PacBio Sequel II platform by Gene De novo Biotechnology Co (Guangzhou, China). The raw sequencing reads of the cDNA libraries were classified and clustered into transcript consensus with the SMRT Link v5.0.1 pipeline [32] supported by Pacific Biosciences. Briefly, circular consensus sequence (CCS) reads were extracted from subread BAM files with a minimum complete pass of 1 and minimum read score of 0.65. The integrity of the transcripts was evaluated according to whether the CCS reads contained 5 primer, 3 primer, and poly-A structures. The sequences containing all three structures are called full-length sequences (FLs). Afterwards, primers, barcodes, and the poly-A tail trimming of complete passes were removed to produce full-length nonchimeric (FLNC) reads. Reads shorter than 50 bp were discarded. Subsequently, the entire isoform was generated by clustering the FLNC reads. A consistency sequence (rough consensus isoforms) was obtained by hierarchically clustering similar FLNC reads with minimap2. Then, the quiver algorithm was used to correct the consistency sequence. For improving the accuracy of PacBio reads, we used two strategies. First, we used non-full-length reads to polish the obtained cluster consensus isoforms with the Quiver software to produce FL, polished, high-quality consensus sequences (accuracy of ≥99%). Second, Illumina short reads were obtained from the same samples with the LoRDEC tool v0.8 [33] to further correct lowquality isoforms.

Functional Annotation and Structure Analysis
A reference genome (ncbi_GCF_000789215.1) sequence was used to measure the accuracy of the generated Iso-Seq reads. Then, we mapped the corrected high-quality consensus sequences to the reference genome with GMAP [34], and redundant transcripts were collapsed with a minimum identity of 95% and minimum coverage of 99%. We compared the finally obtained isoforms with the reference genome annotation and then classified them into three groups: known, novel, and new isoforms. The new isoforms were BLAST analyzed against the NCBI non-redundant protein (Nr) database (http:// www.ncbi.nlm.nih.gov, accessed on 1 October 2019), Swiss-Port protein database (http:// www.expasy.ch/sprot, accessed on 1 October 2019), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.jp/kegg, accessed on 1 October 2019) with the BLASTx program (http://www.ncbi.nlm.nih.gov/BLAST/, accessed on 1 October 2019) at an E-value threshold of 1e−5 for the evaluation of their sequence similarity with the genes of other species and investigation of their functions. Gene Ontology (GO) annotation was analyzed with the Blast2GO [35] software with the Nr annotation results of the isoforms. We selected the isoforms that were not shorter than 33 high-scoring segment pair hits to conduct the Blast2GO analysis and then used WEGO software to perform functional classification of isoforms [36].
Long-chain noncoding RNA (LncRNA) candidates were identified using CPC [37], a predictor of messenger RNAs based on the CNCI v2.0 [38]. LncRNA with >200 nucleotides was selected. ANGEL [39] was used to identify open reading frames (ORFs) in the fulllength transcriptome of B. dorsalis. A previously described method was used to align all non-redundant HQ transcripts. Candidate (alternative splicing) AS events were identified using the SUPPA [40].

EST-SSR Detection and Primer Design
To identify microsatellites in the functional isoforms, we used the MicroSAtellite identification tool (MISA) v1.0 (http://pgrc.ipk-gatersleben.de/misa/, accessed on 1 October 2019). The SSR loci were identified based on the minimum number of repetitions of each unit size, which were 2-6, 3-5, 4-4, 5-4, and 6-4. In an interrupted composite microsatellite, the maximum number of bases for two SSRs was 100. EST-SSR primers were designed using Primer 3.0 (release 1.1.4). Mononucleotide repeats were not selected for subsequent trial analysis due to the possibility of mismatching during sequencing. The primer was designed with the following four principles. Firstly, the optimal primer length was 20 bp and could extend to 18-27 bp if needed; secondly, the optimal temperature was 60 • C and could extend to 57-63 • C; meanwhile, the difference in Tm values with upstream and downstream primers had to be less than 5 • C. Thirdly, the optimal GC content was 50% and could extend to 30-70%; fourthly, the PCR amplicon should have had a length of 100-300 bp [41].

Amplification and Validation of EST-SSRs
In the primer polymorphism verification experiments, nondenaturing polyacrylamide gel was used. In order to screen out primers with polymorphisms, the 80 primer pairs mentioned above were synthesized by the Shanghai Biological Engineering (Shanghai) Company. Polymerase chain reaction (PCR) was performed in a 10 µL reaction volume containing 5 µL of 2×Taq PCR MasterMix (TIANGEN Co., Ltd. Shanghai, China), configuring the PCR system according to the reagent instructions. The PCR system also included 1 µL of template DNA (Table 1), 0.5 µL of each primer (10 µmol/L), and 3 µL of ddH 2 O. PCR amplification was performed using the following temperature program in PCR Amplifier (T100 TM Thermal Cycler, BIO-RAD, Hercules, CA, USA): melting at 94 • C for 3 min; 32 cycles of denaturation at 94 • C for 30 s, with an annealing temperature depending on the primer for 30 s, and extension at 72 • C for 30 s; extension at 72 • C for 5 min; preservation at 4 • C. We used 8% nondenaturing polyacrylamide gel to run the PCR amplification products on a vertical plate electrophoresis apparatus when selecting polymorphic primers.

SMRT Sequencing
The PacBio Sequel system generated a total of 40,319,890 subreads with an average read length of 1009 bp. The N50 was 1659 bp. After self-correction and merging, 535,241 circular consensus sequences (CCSs) ( Figure 1A) with a mean length of 1756 bp were formed. The subreads also generated 386,916 full-length non-chimeric sequences (FLNCs) ( Figure 1B) with an average length of 1622 bp. A total of 22,780 high-quality (HQ) isoforms with ≥99% accuracy were obtained after clustering and removing redundant sequences ( Figure 1C). These isoforms were identified and mapped to 12,274, corresponding to 5365 known isoforms and 5321 new isoforms that were not annotated previously (Table 2).   The left ordinate represents the number of sequences of this length, and the right ordinate represents the number of sequences whose length is greater than a certain value (X-axis).

Functional Annotation
The functions of the HQ transcripts were annotated; then, the clusters of the Nr, Swiss-Port, GO, and KEGG annotations were used to elucidate the functions of the nonredundant isoforms. Of the 12,274 transcripts analyzed, 1,124 could not be functionally annotated to any of the used databases, and 2,412 were shared among the four databases ( Figure 2). After that, all transcripts were aligned to the NCBI non-redundant protein database (Nr). With its comprehensive content and the inclusion of species information in the annotated results, the Nr database is a non-redundant protein database that can be used to classify homologous species. The results showed that 11,150 (90.84%) HQ tran- The left ordinate represents the number of sequences of this length, and the right ordinate represents the number of sequences whose length is greater than a certain value (X-axis).

Functional Annotation
The functions of the HQ transcripts were annotated; then, the clusters of the Nr, Swiss-Port, GO, and KEGG annotations were used to elucidate the functions of the nonredundant isoforms. Of the 12,274 transcripts analyzed, 1124 could not be functionally annotated to any of the used databases, and 2412 were shared among the four databases ( Figure 2). After that, all transcripts were aligned to the NCBI non-redundant protein database (Nr). With its comprehensive content and the inclusion of species information in the annotated results, the Nr database is a non-redundant protein database that can be used to classify homologous species. The results showed that 11,150 (90.84%) HQ transcripts were annotated using Nr and exhibited homology with known proteins of various species, including B. dorsalis (4663, 41.82%), B. cucurbitae (178, 1.60%), and C. capitata (164, 1.47%) ( Table 3). Moreover, Bactrocera was the dominant genus by far, accounting for approximately half of all Nr annotation results. In total, 6102 (49.71%) isoforms were annotated into 49 sub-categories of the three main GO categories: biological processes (BPs), cellular components (CCs), and molecular functions (MFs) (Figure 3). Among them, 20 sub-categories, 18 sub-categories, and 11 sub-categories were annotated in BPs, CCs, and MFs. The top ten sub-categories were cellular processes (3240), metabolic processes (3138), binding (2972), catalytic activity (2962), single-organism processes (2961), cells (2472), cell parts (2472), organelles (1798), macromolecular complexes (1251), and biological regulation (1118). In addition, 3169 (25.82%) isoforms were identified in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and grouped into 162 KEGG pathways that were categorized into five broad categories: metabolism (1832, 57.81%), cellular processes (364, 11.49%), organismal systems (173, 5.46%), environmental information processing (276, 8.71%), and genetic information processing (1039, 32.79%) ( Table 4). Among all of the sub-categories, 'metabolic pathways', 'biosynthesis of secondary metabolites', and 'oxidative phosphorylation' were the top three. Using Swiss-Port, 8,675 isoforms were annotated. Then, the CPC (http://cpc.cbi.pku.edu.cn/, accessed on 1 October 2019) and CNCI were used to predict the coding ability of the new genes and transcripts.

Predictive Analysis of SSRs
Among the 12,274 evaluated sequences, 1978 SSRs were identified from the 1714 SSR-containing sequences. The frequency of occurrence of SSRs was 13.96% (total number of HQ transcripts containing SSRs/total number of HQ transcripts examined). The distance of the average distribution was 11.00 kb (total length/total number of independent genes used to find SSRs), and the SSRs appeared at a frequency of 16.12% (total number of SSRs identified/total number of HQ transcripts examined). The number of repetitive SSR motifs was 105. The identification results showed that there were a total of 286 transcript sequences with more than one EST-SSR locus, and a total of 52 SSRs were presented in compound form. Mononucleotide repeats are prone to mismatch during sequencing, leading to low sequencing quality, so they could not be selected in subsequent experimental analyses.

Verification of Novel and Polymorphic EST-SSRs
The development of primers contributes to the basis for further research on the genetic structure and diversity of species. In our study, 174 samples from different geographical populations of Guangxi were subjected to PCR amplification using 80 pairs of newly developed EST-SSRs. Of these 80 EST-SSRs, although five failed to generate a product, the rest of the 75 primer pairs successfully resulted in amplification. Among the 75 primer pairs, six exhibited poor universal applicability, three produced multiple bands, and 12 were monomorphic. Of the remaining 54 primer pairs with the capacity to generate polymorphic amplification products, 13 primer pairs generated unstable and unclear amplification, and the 41 others produced stable and clear amplification products. Details of these 41 primer pairs are shown in Table 6. The detailed information of the electrophoretogram of the nondenaturing polyacrylamide gel is shown in Table S1 and Figure S3.

Discussion
The transcriptome data were compared with the reference genome of B. dorsalis. The percentage of the sequences in the reference genome was more than 76%. More than 75% of the sequences were unique-mapped against the reference genome, and no more than 2% were multiple-mapped. In addition, 90.84% of the query sequences had comments for the four databases, indicating that the quality of the transcriptome was satisfactory.
In the study of gene annotation, to obtain gene function information, we can classify lots of new transcripts. Of the Nr databases, 4663 were annotated in B. dorsalis, accounting for 41.82% of the total annotated isoforms. This result not only indicates the close relationship between B. dorsalis and Bactrocera species, but also explains the correct assembly and annotation of this transcript library. A metabolic pathway analysis of 3169 isoforms of B. dorsalis was carried out with the KEGG database, which provided essential data for the next step of the mining of functional genes and functional verification of target genes.

Conclusions
In the current study, we obtained 76.6 Gb of clean reads, and approximately 99.3% of the high-quality transcripts were almost greater than 1000 bp in length. In addition, 12,274 sequences were successfully annotated, and 1978 EST-SSRs-excluding mononucleotide repeats-were identified. Moreover, because of their high polymorphism, 41 of these EST-SSRs were proved to be reliable molecular markers in research on B. dorsalis. Overall, with these large amounts of transcription data, most genetic analyses, such as the discovery of new genes, gene function verification, and study of genetic diversity, will be facilitated. In addition, the markers mentioned above will help reveal the genetic relationships of B. dorsalis and its related species in terms of functional molecular markers.