A Linkage-Based Genome Assembly for the Mosquito Aedes albopictus and Identification of Chromosomal Regions Affecting Diapause

Simple Summary A genome sequence can provide the basis for understanding a wide range of biological properties of an organism. For vector and pest species, this knowledge can be used to develop novel control strategies based on genome modifications that disrupt traits related to ecological adaptation and disease transmission. Here, we use a genetic mapping experiment to produce an improved version of a previous genome sequence for the invasive vector mosquito, Aedes albopictus. We then use the improved genome sequence to identify regions of the genome that contain candidate genes that may affect the ability of this mosquito to undergo overwintering dormancy, a crucial ecological adaptation. Abstract The Asian tiger mosquito, Aedes albopictus, is an invasive vector mosquito of substantial public health concern. The large genome size (~1.19–1.28 Gb by cytofluorometric estimates), comprised of ~68% repetitive DNA sequences, has made it difficult to produce a high-quality genome assembly for this species. We constructed a high-density linkage map for Ae. albopictus based on 111,328 informative SNPs obtained by RNAseq. We then performed a linkage-map anchored reassembly of AalbF2, the genome assembly produced by Palatini et al. (2020). Our reassembled genome sequence, AalbF3, represents several improvements relative to AalbF2. First, the size of the AalbF3 assembly is 1.45 Gb, almost half the size of AalbF2. Furthermore, relative to AalbF2, AalbF3 contains a higher proportion of complete and single-copy BUSCO genes (84.3%) and a higher proportion of aligned RNAseq reads that map concordantly to a single location of the genome (46%). We demonstrate the utility of AalbF3 by using it as a reference for a bulk-segregant-based comparative genomics analysis that identifies chromosomal regions with clusters of candidate SNPs putatively associated with photoperiodic diapause, a crucial ecological adaptation underpinning the rapid range expansion and climatic adaptation of A. albopictus.

placed into each fly vial to provide a substrate for oviposition. Two females that oviposited > 20 eggs each were snap frozen in liquid nitrogen and stored at -80°C for subsequent RNAseq genotyping. The eggs (intercross F1) of these females were maintained and hatched out under the near-optimal conditions as described above to establish two independent intercross lines. Larvae were reared as described above and adults were allowed to random mate in a separate 9.5-liter mass-swarm cage for each line with at least 100 males and 100 females each generation. Both lines were used for the BSA in the intercross F4 generation. Additionally, one line was arbitrarily chosen and maintained to the intercross F7 generation to increase recombination among marker SNPs for linkage mapping.

File S1.2: Linkage mapping: Tissue preparation, RNA extraction, and sequencing
In the intercross F7 generation, larvae were reared to adults as described above. 50 males and 50 females were collected one week after eclosion, snap frozen in liquid nitrogen and individually stored at -80 o C. Total RNA was extracted from each of the two parents and 70 individual F7 intercross mosquitoes using a modified TRI ® Reagent (Sigma Aldrich, St. Louis, MO) RNA extraction protocol described in previous publications [17,19]. Briefly, contaminant DNA was removed from each RNA sample using a Turbo-DNA free kit (Ambion, Austin, Texas). For each sample, RNA integrity and quantity were assessed using an Agilent Bioanalyzer ® RNA Chip (Agilent Technologies, Santa Clara, CA, USA). Only samples with an RNA quantity of at least 400 ng and minor or no degradation were retained for library preparation. Samples of total RNA from the two intercross F0 parents were used for library preparation and sequencing alongside those of other TROP and TEMP

File S1.3: Linkage mapping: read cleaning and alignment to the Palatini et al. (2020) assembly
In order to remove contaminant and low-quality reads, all libraries were aligned to the UniVec database of potential DNA vector contaminants [49], as well as Ae. albopictus rRNA sequences (Genbank accession L22060.1) and Ae. albopictus mitochondrial tRNA sequences (tRNA features from GenBank accession AY072044.1). These potential contaminant sequences were indexed using bowtie2 version 2.2.6 [34]. Alignment of RNAseq reads was performed using bowtie2 version 2.2.6, and any aligned reads were discarded. Scripts for all bioinformatics and analyses are available in our dryad repository at: https://datadryad.org/stash/dataset/doi:10.5061/dryad.mgqnk98z4.
We then used Trimmomatic version 0.39 [50] to trim Illumina adaptor sequences from the reads. We used the dynamictrim tool from SolexaQA++ version 3.1.7.1 [51] to trim off the ends of reads once quality fell below 15 and to remove any read pairs in which either of the pair had a length less than 50 bp. We removed any unpaired reads from each library using custom scripts.
We indexed the Palatini et al. (2020) genome assembly (AalbF2), Genbank accession GCA_006496715.1, using the STAR aligner version 2.7.1a [52]. We then performed alignment of sequencing reads from the two intercross F0 parents and 70 individual F7 intercross individuals to the AalbF2 assembly using a 2-step alignment process. First, an initial STAR alignment to identify splice junctions in the mRNA libraries, then a second STAR alignment using those splice junctions to inform the alignment. The resulting alignments for each library were further processed using Picard version 2.20.4 [53]to sort alignments, add read groups, and remove duplicates. We indexed the reference genome using samtools 0.1.19 [27,54], and created a sequence dictionary using Picard. We then used the SplitNCigarReads tool from the Genome Analysis Toolkit (GATK) version 4.1.2.0 [55] to split alignments of RNA reads at which a splice junction had been identified.

File S1.4: Bulk segregant analysis (BSA): measuring diapause phenotypes
An adult mass-swarm cage was established for each line under a short-day photoperiod; each cage contained approximately 100 females and 100 males. After 11 days under short-day conditions, females were blood fed to repletion as described in SI Section 1.1. Engorged females were then transferred into individual fly vials half-filled with water and provisioned with an organic raisin and unbleached paper towel (Genesee Scientific, San Diego, CA, USA) as described in SI Section 1.1 to stimulate oviposition. Vials were checked daily and individual females were snap frozen in liquid nitrogen and stored at -80°C after they had oviposited at least 20 eggs. Diapause incidence (DI) was measured for eggs collected from individual females maintained under unambiguous short-day photoperiod (SD; 8L:16D) as described previously [25,56]. Briefly, paper towel strips with eggs were removed from fly vials containing individual females every Monday-Wednesday-Friday (M-W-F), maintained under SD conditions for ~ 48 hr, and then gently air-dried. Egg papers were then stored at approximately 80% relative humidity under SD for at least seven days.
Eggs ranging from one to two weeks of age were then stimulated to hatch by submersion in ~75 ml water with ~ 1ml larval food. The number of hatched larvae was recorded and the egg papers were re-dried. This procedure was repeated 7 days later, after which the eggs

File S1.6: Bulk segregant analysis: SNP calling and filtering
We used HaplotypeCaller to call variants within each individual (or each bulk, adjusting the sample ploidy as appropriate), then used GenomicsDBImport for each scaffold to combine the outputs from HaplotypeCaller from all individuals (and bulks). We then used to GenotypeGVCFs to do joint genotyping on all individuals (and bulks) simultaneously. The large ploidy of the bulks caused genotyping to fail due to lack of memory in some regions of the genome, which were therefore excluded from joint genotyping. These regions represented approximately 50-60 Mb spread across 143 scaffolds. The genotype calls were then combined using GatherVcfs, and a set of high-quality SNPs output by removing SNPs with quality by depth (QD) < 10, phred-scaled pvalue of Fisher's Exact Test for strand bias (FS) > 60, root mean square of the mapping quality (MQ) < 35, mapping quality rank sum test (MQRS) < -12.5, read position rank sum test (RPRS) < -8, or strand odds ratio (SOR) > 3. This set of SNPs was then used as the basis of bootstrap base recalibration using the GATK BaseRecalibrator tool. The alignments produced above were recalibrated using the ApplyBQSR tool, and these alignments used as the basis of a second round of SNP calling, following the protocol outlined above. After base recalibration, approximately 60-70 Mb spread across 148 scaffolds was excluded due to memory failures during joint genotyping.
High-quality SNPs were filtered as described above. The AnalyzeCovariates tool showed that a second round of base recalibration based on these SNPs had little effect on the alignment, and so we retained the SNPs identified after a single round of base recalibration.
These recalibrated, high-quality SNPs were further filtered to exclude multiallelic loci using the GATK SelectVariants tool. For libraries from single mosquitoes, genotypes with a genotype quality score of 5 or less were removed from the data set.

File S1.7: Bulk segregant analysis: identifying putative diapause-associated SNPs
To calculate a diapause-associated p-value based on SNP frequency differences between the TEMP and TROP sample, we calculated the proportion of the reference allele in each sample as the average of its proportion in the TEMP sample and its proportion in the TROP sample. From this, we assigned all SNPs into five categories based on minor allele frequencies (MAF): MAF ≤ 0.1, 0.1 < MAF ≤ 0.2, 0.2 < MAF ≤ 0.3, 0.3 < MAF ≤ 0.4, and MAF > 0.4. These categories account for the fact that SNPs with higher MAF have a larger range of possible allele frequencies differences between the two parent populations. We then calculated a diapause-associated p-value as the percentile of the distribution of absolute value of allele frequency difference (|AFD|) for each SNP within each category.
We calculated diapause-associated p-values separately for each of the bulks. In most cases (87%), we had genotyped both of the two founding parents of the bulk line. The genotypes of the parents affect the likely degree of allele frequency differentiation in the bulks; e.g., SNPs for which the parents had two of each allele are more able to become strongly differentiated solely by chance than SNPs for which the parents have one reference allele and three alternate alleles. We therefore divided SNPs into three categories, depending on whether the parents collectively had 4 identical alleles, 2 copies of the reference and the alternate allele between them, or 1 copy of one allele and 3 of the other.
Within each category, we determined the percentile of each of the SNPs on the distribution of |AFD|, and assigned a p-value by subtracting that percentile from 1 (so that extreme allele frequency differences would have a low p-value). In the case of ties in |AFD|, all tied SNPs were assigned the lowest possible percentile, and thus the highest (i.e., leastsignificant) p-value. For those SNPs where the genotypes of the two parents had 4 identical alleles, the allele frequency difference between the high and low bulks was almost always 0 (98% of SNPs). In the exceptional cases in which the two parents were genotyped as having 4 identical alleles, but the bulks had a non-zero allele frequency difference, we assumed that this had arisen due to an erroneous genotype in either the parents or the bulks; since we lacked confidence in the genotyping of these SNPs, we set the p-value for these SNPs at 1. For those SNPs for which both parents had not been genotyped, we calculated a p-value as described above for the comparison of the TEMP and TROP samples. The overall diapause-associated p-value for each SNP was calculated by multiplying the p-values for the three allele frequency differences described above (i.e., TEMP vs. TROP, high-diapause vs. low-diapause bulks for both BSA lines).
We applied three false discovery thresholds to take into account testing of multiple SNPs. In the first, only those SNPs were included for which their p-value was less than 0.05 / 46,736 = 1.1*10 -6 . We used 46,736 as the denominator in this calculation because that is the number of SNPs that meet the criteria that the three allele frequency differences (AFDs) had the same sign (i.e., TEMP vs. TROP, high-diapause vs. low-diapause bulks for both BSA lines). The other two p-value cutoffs were 1/46,736 = 2.1*10 -5 and 5/46,736 = 1.1*10 -4 .
These SNP sets are likely to include approximately 0.05, 1 and 5 falsely-discovered SNPs, respectively, but since the three threshold categories included approximately 4, 77 and 260 SNPs, the likelihood an any particularly SNP being a false-positive is low Only SNPs that were found in both F0 parent and F7 offspring, both TROP and TEMP populations, or in both bulks of a line, were included, which is why the same number of SNPs were found in each of those pairs. BSA: tropical population 20.9 9.5 1 These libraries were sequenced on two different flowcells and later combined.