Genome Size Estimation of Callipogon relictus Semenov (Coleoptera: Cerambycidae), an Endangered Species and a Korea Natural Monument

Simple Summary The longhorned beetle Calipogon relictus has been considered as a class I endangered species since 2012 in Korea. In an attempt towards beetle conservation, we estimated its genome size at 1.8 ± 0.2 Gb, representing one of the largest cerambycid genomes. This study provides useful insight at the genome level and facilitates the development of an effective conservation strategy. Abstract We estimated the genome size of a relict longhorn beetle, Callipogon relictus Semenov (Cerambycidae: Prioninae)—the Korean natural monument no. 218 and a Class I endangered species—using a combination of flow cytometry and k-mer analysis. The two independent methods enabled accurate estimation of the genome size in Cerambycidae for the first time. The genome size of C. relictus was 1.8 ± 0.2 Gb, representing one of the largest cerambycid genomes studied to date. An accurate estimation of genome size of a critically endangered longhorned beetle is a major milestone in our understanding and characterization of the C. relictus genome. Ultimately, the findings provide useful insight into insect genomics and genome size evolution, particularly among beetles.


Introduction
The longhorned beetle genus Callipogon Audinet-Serville, 1832 [1] (Coleoptera: Cerambycidae) consists of five subgenera including nine species worldwide. Only a single species is found in East Asia (Korea, China, Far Eastern Russia), while the remaining species are distributed across Central and South America, including Mexico, Guatemala, and Colombia [2,3]. The relict longhorn beetle Callipogon relictus Semenov, 1898 [4] is the sole Asian representative of the genus, and thus represents one of most intriguing insects in the Palearctic region, both in terms of its biogeographical and ecological history, as well as its unequivocal importance for conservation. As such, C. relictus is strictly protected in Korea under the national natural monument no. 218 since 1968, and as a class I endangered species since 2012. C. relictus in East Asia has been suggested to represent a biogeographical link between the faunas of the Old World and New World when the Bering land bridge was exposed above the sea level [5,6].
C. relictus requires five to six years to complete the life cycle under natural conditions [7]. The host plant records suggest that C. relictus is polyphagous, feeding on 17 different species of broadleaved trees belonging to seven families [8]. In particular, C. relictus larvae were found to feed mainly on Quercus spp. and Carpinus laxiflora in the Gwangneung Forest, Korea, based on the investigation conducted by the Korea National Arboretum, and on Ulmus davidiana var. japonica in the Ussuri Nature Reserve, Russia (Kuprin, A.V., pers. comm.). This exceptionally wide host range is plausible given that C. relictus is primarily fungivorous, deriving nutrients during larval development from fungal mycelia in decaying wood, similar to most other saproxylophagous beetles [9].
To date, a total 138 insect genomes have been sequenced. However, only eight of them represent Coleoptera, most of which are regarded as important insect pests, including Agrilus planipennis (Buprestidae); Anoplophora glabripennis (Cerambycidae); Dendroctonus ponderosae and Hypothenumus hampei (Curculionidae); Leptinotarsa decemlineata (Chrysomelidae); and Tribolium castaneum (Tenebrionidae) [10]. McKenna et al. (2016) alone published the entire genome sequence of a longhorn beetle-A. glabripennis-with a genome size ranging between 981 and 970 Mb in female and male individuals, respectively. Based on the comparative genomic analyses, McKenna et al. (2016) concluded that the expansion and functional differentiation of the genes associated with specialized plant feeding facilitated the adaptation of A. glabripennis to a variety of new host plants in its new habitat [11].
The two common approaches for estimating genome size include flow cytometry and k-mer analysis. Flow cytometry is a fluorescence-based technique used to detects the intensity of fluorescence emitted by DNA stained with propidium iodide [12]. As a relatively quick and reliable method for accurately estimate the size of even large genomes, flow cytometry has been widely used to analyze various insect genomes, such as in firefly [13], the stick insect Clitarchus hookeri [14], Neotropical mutualistic ant [15], and Helicoverpa moths [16]. Nevertheless, the application of flow cytometry is limited by the availability of intact tissue [17] and the estimate is also affected by chromatin condensation and the proportion of cells in G 0 to G 1 phases. Given that insect tissues may show high levels of endoreplication, the use of appropriate tissue for the analysis and selection of proper standard species with well-known genome size are critical for accurate size estimation using flow cytometry [16,18].
However, k-mer analysis entails sequence-based estimation utilizing high-throughput sequencing data, and therefore, is independent of the stage of cell cycle, as well as the integrity of the tissue used. This method also facilitates measurement of genome properties, such as the rate of heterozygosity [19,20]. Nonetheless, k-mer based estimates alone are easily affected by repetitive element in the genome [21][22][23], and may result in underestimation of genome size [24]. Given the apparent caveat, k-mer approach has often been used in conjunction with flow cytometry, particularly in studies involved de novo assembly of arthropod genomes (e.g., the spider Dysdera silvatica [23] and caddisflies [22]).
In this study, we employed both flow cytometry and k-mer analysis to deduce the genome size of the critically endangered relict longhorn beetle, C. relictus. As the initial step towards expanding on the studies of longhorn beetle genomics, two independent approaches were concurrently used to estimate the genome size. The size estimate of the C. relictus genome was larger than that of most of the other beetle genomes assembled to date (e.g., 1.17 Gbp for the Colorado potato beetle, Leptinotarsa decemlineata [25]). We discuss its implications for studies investigating cerambycid genomics.

Sample Preparation
The Callipogon relictus specimens used in the present study include the second-generation offspring of the beetle collected from the Korea National Arboretum on 20 July 2017. The larvae were reared on a fungal diet and under the 14L:10D (14 h light:10 h darkness) photoperiod at 24 ± 1 • C and RH of 60% or 65% for adult. We extracted genomic DNA from leg muscle of an unmated one female adult, 2 weeks after eclosion using MagAttract HMW DNA kit (Qiagen, catalog no.67563) according to the manufacturer's instructions. Final genomic DNA was eluted in 100 µL of Solution AE.

Genome Size Estimation by Flow Cytometry
The whole tissue samples except the internal organs of C. relictus one larvae were dissected to estimate the genome size using flow cytometry. Ten-month-old male C57BL/6J mouse liver tissues were dissected and used as a control. Dissected tissues were digested with 1 mg/mL collagenase/dispase (Sigma-Aldrich, 10269638001, St. Louis, MO, USA) at 37 • C for 1 h, followed by trypsinization and filtering with 70 µm cell strainers (SPL, Pocheon, South Korea) to isolate single cells. The cells were then fixed in cold 70% ethanol overnight, stained with 50 µg/mL of propidium iodide (Sigma-Aldrich, USA), and treated with 125 µg/mL of RNase A (iNtRON, DaeJeon, South Korea). The relative size of genomic DNA in C. relictus and mouse was analyzed with FlowJo (TreeStar, San Jose, CA, USA) based on the fluorescence intensity using a flow cytometry (BD Bioscience, San Jose, CA, USA).
Geometric log mean values were used as the mean fluorescence intensity (MFI) to calculate the genome size of C. relictus, using the formula below based on the comparison with the MFI of mouse, Mus musculus, whose genome size is 2.67 Gb. Each MFI value was determined from three independent experiments.

Genome Size Estimation by k-mer Analysis
Genomic DNA library was prepared with a Truseq Nano DNA Prep Kit (Illumina, San Diego, CA, USA) by first randomly shearing 200 ng of genomic DNA into 550 bp inserts using the Covaris S2 system (Covaris, Woburn, MA, USA). Next, a single 'A' nucleotide was added to the 3 blunt-ends of fragmented DNA and the adapters were ligated to both ends of the fragmented DNA. The adapter-ligated DNA was PCR amplified to increase the concentration of the ligated DNA fragments. Bioanalyzer (Agilent, Santa Clara, CA, USA) was used to verify the length distribution of the amplified library. In addition, qPCR was performed to quantify the final library using SYBR Green PCR Master Mix (Applied Biosystems, Foster City, CA, USA). Finally, the verified library was sequenced using pairedend 101 bp reads on the Illumina NovaSeq6000 flow cell platform (Illumina, San Diego, CA, USA). As a result, a total of 60 Gb of raw sequence reads were generated (project accession PRJNA689978).
Prior to k-mer analysis, all raw sequence reads were pre-processed by Trimmomatic (v0.39) [26] to trim adapter sequences and eliminate low quality reads. Using the trimmed reads, k-mer analysis was performed to estimate the genome size. The k-mer frequency distributions with the values of k ranging between 17 and 23 bp with a 2-bp interval were estimated using Jellyfish (v2.3.0) [27]. The final genome size was calculated by dividing the total number of k-mer by the peak value of k-mer frequency distribution. Additionally, GenomeScope (v1.0) [28] was used to characterize the genome of C. relictus, including genome size, rates of heterozygosity, and repeat content.

Results
Callipogon relictus is shown in Figure 1. To estimate the genome size of C. relictus, we first quantified DNA contents of C. relictus larval cells and mouse liver cells using flow cytometry.

Results
Callipogon relictus is shown in Figure 1. To estimate the genome size of C. relictus, we first quantified DNA contents of C. relictus larval cells and mouse liver cells using flow cytometry. Based on the difference in fluorescence intensity between the two organisms, the C. relictus genome was smaller than that of mouse. Based on the known mouse genome size of 2.67 Gb, we inferred the size of C. relictus genome at about 2.00 Gb, using the MFI of mouse liver cells as a reference ( Figure 2). We generated high-throughput genomic sequence data from the leg muscle of a female adult specimen to estimate sequence-based genome size by k-mer analysis. The generated raw sequence reads and trimmed reads are presented in Table 1. Based on two different modes of k-mer analyses, we estimated the genome size of C. relictus to range from 1,517,383,829 bp to 1,882,948,731 bp, as summarized in Table 2. Based on the difference in fluorescence intensity between the two organisms, the C. relictus genome was smaller than that of mouse. Based on the known mouse genome size of 2.67 Gb, we inferred the size of C. relictus genome at about 2.00 Gb, using the MFI of mouse liver cells as a reference (Figure 2).

Results
Callipogon relictus is shown in Figure 1. To estimate the genome size of C. relictus, we first quantified DNA contents of C. relictus larval cells and mouse liver cells using flow cytometry. Based on the difference in fluorescence intensity between the two organisms, the C. relictus genome was smaller than that of mouse. Based on the known mouse genome size of 2.67 Gb, we inferred the size of C. relictus genome at about 2.00 Gb, using the MFI of mouse liver cells as a reference ( Figure 2). We generated high-throughput genomic sequence data from the leg muscle of a female adult specimen to estimate sequence-based genome size by k-mer analysis. The generated raw sequence reads and trimmed reads are presented in Table 1. Based on two different modes of k-mer analyses, we estimated the genome size of C. relictus to range from 1,517,383,829 bp to 1,882,948,731 bp, as summarized in Table 2. We generated high-throughput genomic sequence data from the leg muscle of a female adult specimen to estimate sequence-based genome size by k-mer analysis. The generated raw sequence reads and trimmed reads are presented in Table 1. Based on two different modes of k-mer analyses, we estimated the genome size of C. relictus to range from 1,517,383,829 bp to 1,882,948,731 bp, as summarized in Table 2. The distributions of k-mer coverages based on Jellyfish analysis presented double peaks, with the heterozygous peak recovered at coverage 11 (21-mer and 23-mer) and 12 (17-mer and 19-mer) and the homozygous peak at coverage 20 (23-mer), coverage 21 (21-mer), coverage 22 (19-mer), and coverage 23 (17-mer) (Figure 3).  The distributions of k-mer coverages based on Jellyfish analysis presented double peaks, with the heterozygous peak recovered at coverage 11 (21-mer and 23-mer) and 12 (17-mer and 19-mer) and the homozygous peak at coverage 20 (23-mer), coverage 21 (21mer), coverage 22 (19-mer), and coverage 23 (17-mer) (Figure 3).  In addition, based on the distribution of k-mer frequency, we evaluated the properties of the C. relictus genome using GenomeScope, which yielded an estimated heterozygosity rate of 1.70-1.81% and an estimated repeat length of 757-1096 Mb (Figure 4). In addition, based on the distribution of k-mer frequency, we evaluated the properties of the C. relictus genome using GenomeScope, which yielded an estimated heterozygosity rate of 1.70-1.81% and an estimated repeat length of 757-1096 Mb (Figure 4).

Discussion
This study represents the first attempt to estimate genome size within the family Cerambycidae by k-mer analysis. Because of possible endoreplication in insect cells and tissues, the use of flow cytometry alone may result in inaccurate estimation of genome size depending on the type and stage of the tissue used for the analysis [18]. Therefore, it is important to complement flow cytometry-based estimates by another independent method, such as k-mer analysis. The Illumina sequence reads generated for k-mer analysis can be further used to directly assemble the whole genome sequence de novo.
The genome size of 1.8 ± 0.2 Gb represents one of the largest longhorn beetle genomes reported to date, and is more than twice the size of the Asian longhorn beetle (Anoplophora glabripennis) genome. However, this result is not surprising given the apparent genome size variation reported previously within the family Cerambycidae, ranging from 528 Mb for the read-headed ash borer, Neoclytus acuminatus (subfamily Cerambycinae) to 1.88 Gb for the live-oak root borer, Archodontes melanopus (subfamily Prioninae), although these results were based solely on flow cytometry analysis [17]. The genome size varies across insects even more remarkably, with the largest insect genome discovered in the mountain grasshopper Podisma pedestris (1C-value = 16.93 pg) nearly 250-fold larger than the smallest genome of the non-biting midge Clunio tsushimensis (1C-value = 0.07 pg) [29][30][31]. Given the significant variation in size across insects and even among longhorn beetles, we expect the current findings to contribute to the burgeoning amount of insect genome data. As further genomic data become available across diverse insect lineages, we may conduct

Discussion
This study represents the first attempt to estimate genome size within the family Cerambycidae by k-mer analysis. Because of possible endoreplication in insect cells and tissues, the use of flow cytometry alone may result in inaccurate estimation of genome size depending on the type and stage of the tissue used for the analysis [18]. Therefore, it is important to complement flow cytometry-based estimates by another independent method, such as k-mer analysis. The Illumina sequence reads generated for k-mer analysis can be further used to directly assemble the whole genome sequence de novo.
The genome size of 1.8 ± 0.2 Gb represents one of the largest longhorn beetle genomes reported to date, and is more than twice the size of the Asian longhorn beetle (Anoplophora glabripennis) genome. However, this result is not surprising given the apparent genome size variation reported previously within the family Cerambycidae, ranging from 528 Mb for the read-headed ash borer, Neoclytus acuminatus (subfamily Cerambycinae) to 1.88 Gb for the live-oak root borer, Archodontes melanopus (subfamily Prioninae), although these results were based solely on flow cytometry analysis [17]. The genome size varies across insects even more remarkably, with the largest insect genome discovered in the mountain grasshopper Podisma pedestris (1C-value = 16.93 pg) nearly 250-fold larger than the smallest genome of the non-biting midge Clunio tsushimensis (1C-value = 0.07 pg) [29][30][31]. Given the significant variation in size across insects and even among longhorn beetles, we expect the current findings to contribute to the burgeoning amount of insect genome data. As further genomic data become available across diverse insect lineages, we may conduct comparative genomic analyses to delineate the genetic mechanism underlying the evolution of various ecological and physiological traits of insects, such as immune system, metabolic detoxification, parasitism and polyphagy [11].
Finally, given the importance of conservation, the genomic study of this critically endangered longhorn beetle is expected to offer useful information for developing an effective conservation strategy. Nevertheless, only a handful number of studies reported the molecular analysis of C. relictus in Korea based on sequencing of the Cytochrome c oxidase subunit I (COI) barcode gene from a cerambycid larval species collected and identified from the Gwangneung Forest [32]. The complete mitochondrial genome sequence of Callipogon relictus has been published [33]. Furthermore, the phylogenetic and biogeographic history of C. relictus based on multilocus sequence data has been obtained from multiple geographical populations of C. relictus, together with most of its congeners worldwide [5].

Conclusions
The current findings represent a pioneering effort in the study of Callipogon relictus evolutionary genomics. Additionally, comparative genomic studies in the future are expected to enable conservation efforts based on key loci that are, contributing to inbreeding depression and disease susceptibility, as well as the fitness of potential introgression [34].