Estimation of the Whitefly Bemisia tabaci Genome Size Based on k-mer and Flow Cytometric Analyses

Whiteflies of the Bemisia tabaci (Hemiptera: Aleyrodidae) cryptic species complex are among the most important agricultural insect pests in the world. These phloem-feeding insects can colonize over 1000 species of plants worldwide and inflict severe economic losses to crops, mainly through the transmission of pathogenic viruses. Surprisingly, there is very little genomic information about whiteflies. As a starting point to genome sequencing, we report a new estimation of the genome size of the B. tabaci B biotype or Middle East-Asia Minor 1 (MEAM1) population. Using an isogenic whitefly colony with over 6500 haploid male individuals for genomic DNA, three paired-end genomic libraries with insert sizes of ~300 bp, 500 bp and 1 Kb were constructed and sequenced on an Illumina HiSeq 2500 system. A total of ~50 billion base pairs of sequences were obtained from each library. K-mer analysis using these sequences revealed that the genome size of the whitefly was ~682.3 Mb. In addition, the flow cytometric analysis estimated the haploid genome size of the whitefly to be ~690 Mb. Considering the congruency between both estimation methods, we predict the haploid genome size of B. tabaci MEAM1 to be ~680–690 Mb. Our data provide a baseline for ongoing efforts to assemble and annotate the B. tabaci genome.


Introduction
Whiteflies are among the most important insect pests in the world, causing damage to agricultural, horticultural, and ornamental plants. However, among the over 1500 species of whiteflies that have been reported [1], Bemisia tabaci (Gennadius) stands out by exhibiting a remarkable degree of plasticity in host range (over 1000 species) [2], environmental adaptation, insecticide resistance, fecundity, and ability to disperse, while attracting the attention of the agricultural community and scientists worldwide [3].
B. tabaci is responsible for transmitting numerous plant pathogenic viruses, most belonging to the genus Begomovirus (Family: Gemniviridae), as well as other genera such as Carlavirus, Crinivirus, Torradovirus, and Ipomovirus [4][5][6][7][8][9][10]. Begomoviruses are transmitted by whiteflies in a persistent manner, and are circulative, i.e., following acquisition, the virus translocates from the gut, into the hemolymph, and eventually into the salivary glands where the virus is egested [11]. For some viruses, protection and circulation in the whitefly is mediated, in part, by secreted factors from host endosymbionts [12][13][14][15][16]. Along with historical evidence, these intimate whitefly-Begomovirus relationships have a rich history together and have presumably been co-evolving for millions of years [11,16]. Aside from understanding the fundamental genetics driving whitefly-endosymbiont-virus relationships, there is a significant importance for developing genetic and genomic resources for managing whiteflies and the viruses they transmit.
B. tabaci (Hemiptera: Aleyrodidae) are closely related to aphids, mealybugs, psyllids, and scale insects. Their body sizes range from 2-3 mm in length and, over the course of a lifetime, transition through four instars before the adult stage. Under optimal conditions, B. tabaci can undergo 11-15 generations per year and a single female can lay between 100 to 300 eggs during its 3-6 week lifespan [17][18][19]. Like other members of the ancient Aleyrodidae family, B. tabaci employs a haplodiploid sex determination system, in which fertilized eggs yield diploid females and unfertilized eggs yield haploid males [20,21]. Considering the nutritional limitations in plant sap, whiteflies, like aphids and many other hemipterans, harbor species-specific mutualistic endosymbionts, which help compensate for deficiencies in certain amino acids [22][23][24][25] and to contribute to the dynamics that have defined the plasticity of B. tabaci [14,[26][27][28].
B. tabaci is a complex of at least 15 cryptic species, or "biotypes" that exhibit a wide range of genetic and phenotypic variations, including differences in virus vectoring potential, host preference and specificity, endosymbiont composition, resistance to insecticides, and reproductive incompatibility [29][30][31]. Here, using flow cytometry and k-mer analysis, we present a new estimate of the genome size for B. tabaci Middle East-Asia Minor 1 (MEAM1 [32], also referred to as B biotype), one of the most common and broadly distributed whiteflies in the world.
Flow cytometry has been widely adapted for genome size estimation [33]; although the method is fairly straightforward, the accuracy of estimation is highly dependent on the internal standard and quality of the material used for DNA content measurement [34]. With the rapid development of next generation sequencing, k-mer analysis has recently been developed as an alternative means of genome size estimation [35]. Using both flow cytometry and k-mer analysis, we estimate the B. tabaci genome size to be ~680-690 Mb, an estimation that is ~300 Mb smaller than that reported previously using the flow cytometry method alone [36]. Considering our sample material was obtained from whiteflies collected in North America from an isogenic colony, these studies complement and build upon a recent study that estimated the genome size of B. tabaci populations in China to be ~640-682 Mb [37]. This updated genome size estimate will serve as the baseline for ongoing genome assembly and annotation efforts.

Results
To estimate the genome size of B. tabaci using the k-mer approach, we first generated genome sequences using the Illumina deep sequencing technology. Genomic DNA extracted from haploid male individuals derived from an isogenic colony of B. tabaci MEAM1 was used to construct three paired-end libraries with insert sizes of approximately 300 bp, 500 bp and 1 Kb, respectively. Each library was sequenced in paired-end mode with read length of 150 bp. A total of approximately 50 billion base pairs of high-quality cleaned sequences were obtained for each of the three libraries, providing sufficient data for a robust k-mer analysis (Table 1). Different k-mer sizes (17, 25, 27 and 99) were tested and all gave nearly identical results for genome size estimation. In this study, results derived from k-mer size of 27 (27-mer) were presented. Left (R1) and right (R2) paired-end reads from the library with insert sizes of ~300 bp were analyzed separately to avoid duplicate counting of 27-mers in the overlap regions of the read pairs. Unique 27-mers were identified in each library from these sequences and the frequency of their occurrences (depths) was derived ( Figure 1). The 27-mer depth distributions showed a minor curve (peak) at the left side, indicating a low level of possible heterogeneity within the male individuals collected for sequencing. Based on our k-mer analysis, the genome size of B. tabaci MEAM1 was estimated to be 666.  We subsequently determined the nuclear DNA content of both haploid male and diploid female whiteflies from our isogenic colony using flow cytometry. The analysis was performed in four replications for both males and females. The nuclear DNA content of haploid males was estimated to be 0.73 pg, 0.70 pg, 0.71 pg and 0.69 pg, respectively, and that of diploid females was estimated to be 1.38 pg, 1.38 pg, 1.44 pg and 1.44 pg, respectively, using chicken red blood cells (CRBCs) as the internal standard for comparison (2C = 2.5 pg) ( Table 2). As expected, the nuclear DNA content of diploid females was approximately twice that of haploid males ( Our results demonstrate that the estimated genome sizes of whiteflies using flow cytometry were highly reproducible. Furthermore, histograms produced distinct peaks for both male and female samples, as well as for the CRBC standards, providing confidence in our measurements ( Figure 2). Together, our estimates using k-mer and flow cytometric analysis were highly consistent and suggested the haploid genome size of B. tabaci MEAM1 to be around 680-690 Mb.

Discussion
Despite its international notoriety as an insect pest and vector of plant pathogenic viruses, little is known about the genome of the whitefly Bemisia tabaci. Accurate determination of the whitefly genome size would be an essential starting point for downstream genomic and genetic analyses. Estimation of genome sizes using the k-mer analysis has been recently applied in various genome sequencing projects [35]. Our k-mer analysis using deep genome sequencing data from three different Illumina libraries produced a genome size estimate of around 680 Mb for B. tabaci MEAM1 and flow cytometric analysis of approximately 690 Mb. Together, these data provide robust evidence that the haploid genome size of B. tabaci is ~680-690 Mb, a value that is ~300 Mb smaller than that reported previously [36]. We speculate that this discrepancy is largely due to the improvements in the estimation of commonly used standards [33,34] as well as the materials used. In the previous report [36], the chicken red blood cells used as a size standard for estimation of the whitefly genomic DNA content was based on 2C = 3.0 pg DNA. The internal size standard for chicken red blood cells used in the current flow cytometric analysis was based on 2C = 2.5 pg DNA, which was used in the original flow cytometry study on genome estimation of many plant species [33,34].
Whiteflies used in the present study were obtained from a colony that was established from a single female B. tabaci MEAM1 collected in the United States in 2013. The use of samples from this isogenic colony was important because it reduced the amount of heterogeneity among whiteflies (as shown in Figure 1), which will also contribute to the assembly and annotation process of ongoing genome sequencing efforts. Our genome size estimate for B. tabaci MEAM1 from North America was also confirmed recently by a different team using a Chinese colony of whiteflies (B and Q biotypes) [37]. Together, using whiteflies from North America and China, these two independent studies have provided strong evidence for a new estimated genome size for B. tabaci MEAM1. This new genome size is still relatively large in comparison to that of other insects, including fruit fly (Drosophila melanogaster) (180 Mb), mosquito (Anopheles gambiae) (278 Mb), pea aphid (Acyrthosiphon pisum) (464 Mb), and silkworm (Bombyx mori) (530 Mb) [38][39][40][41].
Examining insects that have undergone extreme reduction or expansion in genome size, such as the polar-inhabiting Antarctic midge (Belgica antarctica) (99 Mb) [42] and the migratory locust (Locusta migratoria) (6500 Mb) [43], we see dramatic changes in genomic organization. In the case of the Antarctic midge, analysis revealed that a reduced genome was predominantly attained not through the loss of protein-encoding genes, but rather through reductions in intron length and reduced numbers of repeat elements [42]. Similarly for the migratory locust, although the expansion of locus-specific gene families contributed to a large genome size, the majority of the expansion was due to the proliferation of repeat elements and an increased length of intronic regions [43]. In light of these examples of insects that have experienced radical genomic change, it is thought that although the B. tabaci genome is nearly four times larger than that of the fruit fly, it is anticipated that B. tabaci possesses a similar number of genes, and expansions likely occurred in highly repetitive intergenic and intronic regions [36]. Considering the biological novelty of B. tabaci-as a complex of cryptic species, having a rich co-evolutionary history with Begomoviruses and other whitefly transmitted viruses, it will be interesting to uncover the underlying factors that have shaped the size and sequence of the B. tabaci genome. Together, the data presented herein, with our ongoing efforts to sequence and annotate the whitefly genome, will contribute to these very fundamental understandings.

Insect Rearing
An isogenic whitefly colony was established from a single female B. tabaci MEAM1 (B Biotype) collected from a parent colony in April, 2013 at the USDA-ARS, U.S. Vegetable Laboratory in Charleston, South Carolina, USA. The resulting isogenic population was maintained on caged collards (Brassica oleracea ssp. acephala de Condolle) in a greenhouse (26 °C ± 5 °C). The MEAM1 population was confirmed by PCR using established primers that amplify the mitochondrial cytochrome oxidase 1 gene [44]. The original source of the parent whitefly colony originated from a collection of adult B. tabaci from a sweetpotato (Ipomea batatas) field at the U.S. Vegetable Laboratory in 2011. The whiteflies in the parent colony were reared on collard, squash (Cucurbita pepo) and tomato (Solanum lycopersicum).

DNA Extraction from Male Whiteflies
Whiteflies were aspirated from collards and immediately frozen at −80 °C for 24 h. The whiteflies were then transferred to a Petri dish embedded in a block of ice and using a stereoscope, over 6500 male individuals were collected, pooled and immediately homogenized with a micropestle in the ice-cold lysis buffer (10 mM Tris-EDTA, 0.1 M EDTA pH 8.0, 0.5% w/v SDS, 1% β-mercaptoethanol). The homogenate was treated with RNase A (Qiagen, Valencia, CA, USA) for 1 h at 37 °C, followed by proteinase K (New England Biolabs, Ipswich, MA, USA) for 2 h at 50 °C. DNA was extracted using phenol:chloroform:isoamyl alcohol (25:24:1). The DNA was ethanol precipitated and rehydrated in TE (pH 8.0) overnight at 4 °C. DNA integrity was analyzed via gel electrophoresis and its quantity was determined on a NanoDrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA).

Genome Size Estimation by k-mer Analysis
Three paired-end libraries with insert sizes of approximately 300 bp, 500 bp and 1 Kb, respectively, were constructed using the Illumina TruSeq DNA sample preparation kit following manufacturer's instructions, and each library was sequenced on one lane of the Illumina HiSeq 2500 system with the paired-end 150-bp mode at the Roy J. Carver Biotechnology Center, University of Illinois. Raw Illumina reads were processed to remove duplicated read pairs and only one read pair from the duplicates was kept. Duplicated read pairs were defined as those having identical bases in the first 100 bp of both left and right reads. Illumina adaptor and low quality sequences (quality score < 20) were removed from the reads using the ShortRead package [45]. Finally, errors in the Illumina sequencing reads were further corrected using Quake [46].
High-quality cleaned Illumina sequences from each of the three genomic libraries were subjected to k-mer counting using SOAPec (v2.01) in the SOAPdenovo package [47] with the k-mer size set to 27. Because the read length was 150 bp, for the library with insert size of 300 bp, k-mers were counted separately for left and right paired-end reads in order to avoid possible duplicate counting of k-mers in the overlap region of the read pairs. K-mer depth distribution was then derived and the peak value of the depth distribution identified. In shot-gun genome sequencing, the short reads are assumed to be randomly generated, so any k-mers in the reads also occur randomly. Their depth of coverage follows the Poisson distribution [48] and the mean of k-mer depth should be equal to the peak value of the k-mer depth distribution. Thus, genome size can be estimated using the following formula: Genome size = total number of k-mers/peak value of k-mer frequency distribution.

Nuclear DNA Content Estimation by Flow Cytometry
The procedure used to analyze the nuclear DNA content of the whiteflies was modified from Arumuganathan and Earle [33]. Ten male or female whiteflies fixed in 95% ethanol were chopped vigorously and suspended in a 0.5 mL solution of 10 mM MgSO4.7H2O, 50 mM KCl, 5 mM HEPES, pH 8.0, 3 mM dithiothreitol, 0.1 mg/mL propidium iodide, 1.5 mg/mL DNase-free RNase and 0.25% Triton X-100. The suspended nuclei were withdrawn using a pipette, filtered through a 30 µm nylon mesh, and incubated at 37 °C for 30 min before flow cytometric analysis. Suspensions of sample nuclei were spiked with a suspension of standard nuclei (prepared in the above solution) and analyzed with a FACScalibur flow cytometer (Becton-Dickinson, San Jose, CA, USA). For each measurement, the propidium iodide fluorescence area signals (FL2-A) from 1000 nuclei were collected and analyzed by the CellQuest software (Becton-Dickinson). The mean position of the G0/G1 (Nuclei) peak of the sample and the internal standard were determined using CellQuest software. The mean nuclear DNA content of each sample, measured in picograms, was based on 1000 scanned nuclei. Nuclei from chicken red blood cells (2.5 pg/2C) [33,34] were used as an internal standard. Each experiment was conducted with four replications for male or female whiteflies. Whitefly nuclear DNA content was derived using the following formula: Nuclear DNA content = (Mean position of whitefly nuclei peak/Mean position of CRBC nuclei peak) × 2.5 pg.

Conclusions
In this study, we provide a new genome size estimate for B. tabaci (MEAM1), one of the most destructive and invasive whitefly species in the world. Results presented in this study were obtained from a population of B. tabaci MEAM1 from North America. Specifically, the insects used in this study were from a population that originated from a field site where this whitefly occurs year-round and feral populations have been known at this location since 1990 (originally reported as B. argentifolii) [49]. We collected the haploid male whiteflies derived from a single isoline colony to reduce the level of heterogeneity in the population and utilized flow cytometry and k-mer analysis to conclude that the haploid genome of B. tabaci MEAM1 is ~680-690 Mb. Flow cytometry on diploid female whiteflies predicted a genome size approximately twice that of males, further validating our estimations. This new estimation is ~300 Mb smaller than previously reported [36]. Our study provides a starting point for sequencing and annotating the whitefly genome, and will ultimately contribute to our understanding of whitefly biology and the generation of novel pest management strategies.