Novel Molecular Resources to Facilitate Future Genetics Research on Freshwater Mussels (Bivalvia: Unionidae)

: Molecular data have been an integral tool in the resolution of the evolutionary relationships and systematics of freshwater mussels, despite the limited number of nuclear markers available for Sanger sequencing. To facilitate future studies, we evaluated the phylogenetic informativeness of loci from the recently published anchored hybrid enrichment (AHE) probe set Unioverse and developed novel Sanger primer sets to amplify two protein-coding nuclear loci with high net phylogenetic informativeness scores: fem-1 homolog C (FEM1) and UbiA prenyltransferase domain-containing protein 1 (UbiA). We report the methods used for marker development, along with the primer sequences and optimized PCR and thermal cycling conditions. To demonstrate the utility of these markers, we provide haplotype networks, DNA alignments, and summary statistics regarding the sequence variation for the two protein-coding nuclear loci (FEM1 and UbiA). Additionally, we compare the DNA sequence variation of FEM1 and UbiA to three loci commonly used in freshwater mussel genetic studies: the mitochondrial genes cytochrome c oxidase subunit 1 (CO1) and NADH dehydrogenase subunit 1 (ND1), and the nuclear internal transcribed spacer 1 (ITS1). All ﬁve loci distinguish among the three focal species ( Potamilus fragilis , Potamilus inﬂatus , and Potamilus purpuratus ), and the sequence variation was highest for ND1, followed by CO1, ITS1, UbiA, and FEM1, respectively. The newly developed Sanger PCR primers and methodologies for extracting additional loci from AHE probe sets have great potential to facilitate molecular investigations targeting supraspeciﬁc relationships in freshwater mussels, but may be of limited utility at shallow taxonomic scales.

Other nuclear loci, such as histone H3 and 28S, have been utilized in freshwater mussel phylogenetic studies; however, these markers are well known to show limited diversity at shallow taxonomic scales and have primarily been used to resolve deep level phylogeny [15][16][17][18][19][20].
In recent years, the decreasing costs of next-generation sequencing platforms have significantly increased the ability to generate molecular supermatrices in non-model taxa [21,22], including freshwater mussels [23,24]. In particular, the recently developed anchored hybrid enrichment (AHE) probe set Unioverse [23] has drastically improved the ability to resolve phylogeny in freshwater mussels. The Unioverse probe set consists of 811 protein-coding loci derived from genomic and transcriptomic resources across Bivalvia that can be captured across all freshwater mussels to resolve phylogenetic relationships. Despite the decreasing costs of next-generation sequencing, the utilization of AHE probe sets can be cost-prohibitive for small-scale projects or molecular investigations that incorporate hundreds of individuals to investigate intra-or interspecific relationships. However, AHE probe sets offer opportunities for the development of primers for the amplification and Sanger sequencing of select protein-coding loci that can be used for small-scale projects.
Here, we evaluated the phylogenetic informativeness of loci in the Universe probe set and report the development of novel primer pairs for the amplification of two protein-coding nuclear genes fem-1 homolog C (FEM1) and UbiA prenyltransferase domain-containing protein 1 (UbiA). To demonstrate the utility of these markers and facilitate their use in future studies, we provide the PCR primer sequences, optimized PCR conditions and thermal cycling parameters, haplotype networks, DNA alignments, and summary statistics regarding sequence variation for the two protein-coding nuclear loci (FEM1 and UbiA) and three loci that are commonly used in studies in freshwater mussels: the mitochondrial genes cytochrome c oxidase subunit 1 (CO1) and NADH dehydrogenase subunit 1 (ND1), and the nuclear ITS1 locus. All five loci distinguish among the three focal species (Potamilus fragilis, Potamilus inflatus, and Potamilus purpuratus) and should be amplifiable across the subfamily Ambleminae. The observed sequence variation was highest for ND1, followed by CO1, ITS1, UbiA, and FEM1, respectively ( Figure 1). We also provide the detailed methodology used in the marker selection to expedite the identification of additional candidate loci and primer development from available AHE data. The newly developed Sanger PCR primers and methodologies for extracting additional loci have great potential to facilitate molecular investigations targeting supraspecific relationships in freshwater mussels, but may be of limited utility at shallow taxonomic scales. . Each colored circle represents a unique haplotype, the colors correspond to individual species, the black circles represent unsampled haplotypes, and the hash marks indicate nucleotide differences between haplotypes. Table 4. Summary of diversity indices based on all five loci utilized in this study. Abbreviations and symbols are as follows: cytochrome c oxidase subunit 1 (CO1); fem-1 homolog C (FEM1); internal transcribed spacer 1 (ITS1); NADH dehydrogenase subunit 1 (ND1); UbiA prenyltransferase domain-containing protein 1 (UbiA); sample size (n); nucleotide diversity (π); number of haplotypes (nh); number of segregating sites (S); and number of parsimony-informative sites (P).

User Notes
All the data and metadata described in this study are at https://doi.org/10.5066/P9Q3CFL5 [1], and all the novel GenBank accessions for this study were as follows: CO1: MT662002-MT662099;  UbiA prenyltransferase domain-containing protein 1 (UbiA). Each colored circle represents a unique haplotype, the colors correspond to individual species, the black circles represent unsampled haplotypes, and the hash marks indicate nucleotide differences between haplotypes.

Specimen Details
All the metadata related to the specimens used in this study, including the collection location, GPS coordinates, and museum catalog numbers, are provided (https://doi.org/10.5066/P9Q3CFL5) [25].

Molecular Data
We present the DNA sequence data from five markers: the mitochondrial genes CO1 and ND1, the nuclear non-coding marker ITS1, and the protein-coding nuclear genes FEM1 and UbiA. Our five-locus DNA alignment consisted of 3368 bp of mitochondrial and nuclear sequence data (CO1 = 657 bp; ND1 = 900 bp; FEM1 = 501 bp; UBiA = 765 bp; ITS1 = 545 bp). The number of loci sequenced for each individual varies from two to five loci, with all loci available for 28 individuals ( Table 1). The specific sample sizes for each locus are as follows: CO1 (n = 102); ND1 (n = 103); FEM1 (n = 29); UBiA (n = 29); and ITS1 (n = 31). A subset of individuals was chosen for the additional nDNA loci due to the high prevalence of multiple copies at ITS1 and low genetic diversity at FEM1 and UbiA. All the DNA alignment files are available in Phylip format (.phy), with the first line indicating the number of taxa and number of nucleotides and subsequent lines containing a taxon identifier, catalog number, and GenBank Accession number in the first column (each separated by underscore), and the DNA sequence in the second column. The file names are as follows: CO1.phy; ND1.phy; FEM1.phy; UbiA.phy; ITS1.phy; and 5_locus.phy (https://doi.org/10.5066/P9Q3CFL5) [25].

Taxon Sampling and DNA Extraction
We present molecular data on 103 specimens representing Potamilus fragilis (n = 22), Potamilus inflatus (n = 14), and Potamilus purpuratus (n = 67) used in Smith and Johnson [26] (Table 1). All specimens were collected from four Gulf of Mexico river drainages in the southeastern United States: Mobile, Pascagoula, Pearl, and Pontchartrain. Genomic DNA was extracted from mantle tissue clips from vouchered individuals using the Qiagen PureGene DNA extraction kit with the standard extraction protocol (Qiagen, Hilden, Germany).

Novel Primer Design and Gene Annotation
We compiled data from a recent study [24] utilizing the AHE probe set Unioverse to develop novel primer sets for amplifying protein-coding nuclear loci for use in the freshwater mussel genus Potamilus.
To screen for loci in the dataset that were informative at shallow phylogenetic scales, we measured the net phylogenetic informativeness (PI) using an arbitrary time scale [27]. This methodology has been used in previous studies to calculate the power of individual loci in AHE datasets [28,29]. First, we reconstructed a phylogeny from a concatenated alignment of probe loci using IQ-TREE v 1.6.11 [30,31], and the consensus tree was arbitrarily dated with a molecular clock (i.e., tips = time 0, root = time 1) using the program PATHd8 [32]. A concatenated alignment partitioned by the probe and the ultrametric tree from PATHd8 were uploaded into the web application PhyDesign [33] (http://phydesign.townsend.yale.edu/) to estimate the PI using the HyPhy substitution rates algorithm with the GTR model of nucleotide evolution and empirical base frequencies [34]. We used the R script PhyDesign.r [29] to identify specific nucleotide positions in the alignment with unusually high substitution rates that could be contributing phylogenetic noise. Nucleotide positions with rate values higher than five were removed from the alignment manually and the filtered matrices were re-uploaded to PhyDesign as above for a final analysis.
Three nucleotides were removed from the dataset due to unusually high substitution rates (rate value > 5 = "phantom spikes"). In the filtered dataset, the probe regions with a 100% capture efficiency across Ambleminae had an average net PI of 4.62 and ranged from 0.32 to 23.09 (Table 2). Using the results from PhyDesign, we selected two candidate loci for primer development and PCR validation: locus 156 and locus 412. Locus 156 and locus 412 exhibited a 100% capture efficiency in our dataset, had suitable candidate primers that could be cross amplified across Potamilus, displayed high levels of average PI (9.22 and 11.61, respectively), and were able to discriminate our focal species. We were unable to develop compatible primers for the other candidate loci with high net PI scores (e.g., locus 70 and locus 413).
We used BLASTX [35] to annotate the gene and protein names for our candidate loci [36]. Briefly, the probe region sequences of both loci for P. inflatus were searched against the non-redundant protein database using BLASTX, which returned 172 and 118 BLAST hits for locus 156 and locus 412, respectively. Locus 156 was identified as UbiA prenyltransferase domain-containing protein 1, and the highest homology was to genes in the marine bivalves Crassostrea virginica (74.62%) and C. gigas (72.31%). Locus 412 was identified as a fem-1 homolog, and the highest homology was to genes in the unionid bivalves Hyriopsis schlegelii (99.44%) and H. cumingii (98.89%), and the marine bivalves Mizuhopecten yessoensis (87.22%), Pecten maximus (87.22%), C. virginica (86.11%), and C. gigas (86.11%). There were inconsistencies regarding whether the region was a fem-1 homolog A or fem-1 homolog C. All the blast hits except for H. cumingii and H. schlegelii indicated the sequence was representative of fem-1 homolog C; therefore, we annotated the locus as fem-1 homolog C.

PCR and Sequencing
PCRs were conducted using a 25 µL mixture of the following: molecular grade water (9.5 µL), MyTaq TM Red Mix (12.5 µL; Bioline, London, UK), primers (1.0 µL each), and DNA template (100 ng). The primers for all loci and thermal cycling conditions for CO1, ND1, and ITS1 are reported in Table 3. The thermal cycling conditions for FEM1 and UbiA were as follows: an initial denaturation at 95 • C for 3 min, followed by 35 cycles of 95 • C for 30 s, 51/60 • C (FEM1/UbiA) for 30 s, and 72 • C for 90 s. The products were sent to Molecular Cloning Laboratories (McLAB, South San Francisco, CA, USA) for bi-directional sequencing on an ABI 3730. Geneious v 10.2.3 [37] was used to assemble the contigs and edit chromatograms, and the sequences were aligned in Mesquite v 3.61 [38] using MAFFT v 7.311 [39]. The loci were aligned independently using the L-INS-i method in MAFFT and translated into amino acids to ensure the absence of stop codons and gaps.

Sequence Variation and Haplotype Analysis
We created haplotype networks ( Figure 1) and calculated the nucleotide diversity, number of haplotypes, number of segregating sites, and number of parsimony-informative sites (Table 4) to compare the amounts of sequence variation across all five loci used in this study. The TCS haplotype networks and sequence variation statistics were calculated using PopART 1.7 [42]. All five loci distinguish among the three focal species (Figure 1). The sequence variation was highest for ND1, followed by CO1, ITS1, UbiA, and FEM1, respectively (Table 4). Despite selecting loci from the AHE probe set with a high net PI, the level of sequence variation remains low when compared to mtDNA and ITS1, suggesting the limited utility of the probes at intraspecific levels. Table 4. Summary of diversity indices based on all five loci utilized in this study. Abbreviations and symbols are as follows: cytochrome c oxidase subunit 1 (CO1); fem-1 homolog C (FEM1); internal transcribed spacer 1 (ITS1); NADH dehydrogenase subunit 1 (ND1); UbiA prenyltransferase domain-containing protein 1 (UbiA); sample size (n); nucleotide diversity (π); number of haplotypes (nh); number of segregating sites (S); and number of parsimony-informative sites (P).

User Notes
All the data and metadata described in this study are at https://doi.org/10.5066/P9Q3CFL5 [25], and all the novel GenBank accessions for this study were as follows: CO1: MT662002-MT662099;