Systematics of the Arboreal Neotropical ‘thorellii’ Clade of Centruroides Bark Scorpions (Buthidae) and the Efficacy of Mini-Barcodes for Museum Specimens

Fragmented and degraded DNA is pervasive among museum specimens, hindering molecular phylogenetics and species identification. Mini-barcodes, 200–300-base-pair (bp) fragments of barcoding genes, have proven effective for species-level identification of specimens from which complete barcodes cannot be obtained in many groups, but have yet to be tested in arachnids. The present study investigated the efficacy of mini-barcodes combined with longer sequences of the Cytochrome c Oxidase Subunit I (COI) gene in the systematics of the arboreal Neotropical ‘thorellii’ clade of Centruroides Marx, 1890 bark scorpions (Buthidae, C.L. Koch 1837), the species of which have proven to be difficult to identify and delimit due to their similar morphology. The phylogeny of 53 terminals, representing all nine species of the clade and representative species belonging to related clades of Centruroides, rooted on Heteroctenus junceus (Herbst, 1800) and based on up to 1078 base pairs of COI and 112 morphological characters, is presented to test the monophyly of the clade and the limits of its component species. The results support the recognition of nine species of the ‘thorellii’ clade, in accordance with a recent taxonomic revision, and highlight the efficacy of mini-barcodes for identifying morphologically similar cryptic species using specimens of variable age and preservation.


Introduction
Although natural history collections constitute one of the largest repositories of genetic information and are paramount for biological research, successful amplification of DNA from museum specimens often proves difficult due to degradation and fragmentation [1,2]. Many museum specimens were not collected with the intention (or knowledge) of future genomic studies, resulting in varying methods of preservation both among and within taxa and varying rates of subsequent molecular degradation among specimens of similar age [2]. Several factors, including time, dehydration, environmental exposure, and the presence of bacterial or fungal contamination, markedly affect the quality of DNA in collections of vertebrate and invertebrate taxa [2][3][4]. Although DNA barcoding is cost-effective for species identification, the utility of this method for museum material is severely limited by the failure to amplify complete Cytochrome c Oxidase Subunit I (COI) sequences and the cost in finances, time, and effort to repeat amplifications for adequate coverage. Overcoming the obstacle of sequencing museum specimens is critical, as natural history collections provide the basis for studies of historical populations, the discovery of new (and often cryptic) species, and the verification of barcoded specimens to types [5][6][7][8].
Research suggests that barcodes as short as 200 nucleotide base pairs (bp), the length at which DNA from degraded museum material stabilizes [2,8], are sufficient The difficulty of collecting these canopy-dwelling scorpions appears to have resulted in their diversity and distribution being severely undersampled. Many collection records are represented by singletons. The comparative rarity of these scorpions in collections resulted in a tripling of the known diversity to a total of nine species in a recent revision [23], wherein six new species, previously identified as operational taxonomic units (OTUs) on the basis of DNA sequence data [22], were described. However, insufficient material prevented a decision as to whether the OTUs identified were new species or variable populations of more widespread species [22]. A larger sample of specimens from across the distribution permitted a more thorough assessment of species limits within the clade, resulting in the descriptions of Centruroides berstoni Goodman  The present study aimed to test the species limits and phylogenetic relationships among the species of the 'thorellii' clade recognized by Goodman et al. (2021) using COI mini-barcodes of individuals from disparate localities combined with longer COI sequences and morphological characters, and in so doing demonstrating the efficacy of mini-barcodes for species-level identification of old and/or poorly preserved arachnid specimens from natural history collections [23].

Taxon Sampling
The ingroup comprised nine terminals, representing all nine species of the 'thorellii' clade [23]. Each species was represented by three to seven individuals from one to five localities, including the type localities of all species except C. hoffmanni, C. rileyi, and C. schmidti, the identifications of which were verified based on morphology [16,19]. Exemplar species representing the North American, Central American, and Caribbean clades of Centruroides [21] were included as outgroups. Three species per clade were included to represent the genetic variability within each. Specimens of C. thorellii were also included to test its phylogenetic position. Trees were rooted on Heteroctenus junceus (Herbst, 1800), representing Heteroctenus Herbst, 1800, the sister taxon of Centruroides [21]. The complete taxon sample comprised nine ingroup species and ten outgroup species, considered satisfactory for testing the monophyly of the 'thorellii' clade.

Material Examined
Specimens and tissue samples used for molecular and morphological analyses were obtained from the American Museum of Natural History (AMNH), New York; the California Academy of Sciences (CAS), San Francisco; the Colección Nacional de Arácnidos, Instituto de Biología (CNAN), Universidad Nacional Autónoma de México, Mexico City; and the Oxford University Museum of Natural History (OUMNH), UK.
Field-collected specimens from localities in Veracruz, Mexico, and southern Guatemala were detected using ultraviolet light, euthanized by submersion in 95% ethanol, and subsequently injected with ethanol to improve internal preservation. A total of 53 tissue samples, stored at −20 • C at the CAS and the Ambrose Monell Cryo Collection at the AMNH, were used for DNA extraction (Appendix A).

DNA Sequencing
Genomic DNA was extracted from muscle tissue from the fourth leg of well-preserved specimens using the spin column extraction protocol of the Qiagen DNeasy Blood & Tissue Kit (Valencia, CA, USA). DNA extracts were stored in 50-200 µL of elution (AE) buffer depending on the DNA concentration assessed using a Nanodrop 2000C Spectrophotometer (ThermoFisher Scientific, Waltham, MA, USA).
Each PCR reaction was conducted in a 25-microliter volume, comprising 2.5 µL of 10× PCR buffer, 0.5 µL of dNTPs ( and an Epicenter thermocycler (Eppendorf, Hamburg, Germany) at the AMNH Sackler Institute for Comparative Genomics (SICG), included an initial denaturing step at 95 • C for 2 min; 40 cycles of denaturing at 95 • C for 30 s, annealing at 47 • C for 30 s, and extension at 72 • C for 60 s; and a final extension at 72 • C for 10 min. PCR products were checked using gel electrophoresis on 1% agarose gel with ethidium bromide, and 1-2 µL of ExoSAP-IT (USB Scientific, Cleveland, OH, USA) was used to clean PCR products.
Sanger dideoxy sequencing was conducted using fluorescent-labeled BigDye Termi- The labeled single-stranded DNA was precipitated by adding 2.5 µL of 125 mM di-NaEDTA to each sample, followed immediately by washing and centrifuging in 100% and 70% ethanol. Samples were then dried in a 65 • C incubator for 8 min, and 10 µL of HiDi formamide (Applied BioSystems) was added to each pellet, denatured at 94 • C for 2 min, and cooled on ice for 5 min. The denatured and fluorescent-labeled DNA pellets were sequenced on an ABI 3130x Genetic Analyzer (Applied BioSystems) at the CAS CCG and a Prism TM 3730x (Applied BioSystems) at the AMNH SICG.

Phylogenetic Analysis
Sequences were aligned using the ClustalW method [35,36] in Mesquite v. 3.51 [37] and checked by eye. The average nucleotide composition of the aligned COI sequences was 17.7% A, 13% C, 25.2% G, and 44% T, and 725 sites were identical (67.4%), with 40.7% missing data as sequences varied greatly in length (125-1078 bp). Gaps in the dataset were treated as missing data and codon positions were optimized to reduce stop codons.
Phylogenetic relationships were inferred with a combined analysis of the molecular and morphological datasets using Bayesian inference (BI) in MrBayes v. 4.3.6 [38] and only the molecular dataset using maximum likelihood (ML) in RAxML v. 8.0.0 [39] on the CIPRES supercomputing cluster [40]. PartitionFinder2 [41] was used to determine the evolutionary models for BI and the partition definitions for ML. PartitionFinder2 obtained the highest AIC and BIC scores for the HKY+G model (for codons 1, 2, and 3); hence, that model was used for all subsequent analyses. Morphological characters were incorporated into the ML analysis and optimized using the gamma model of rate of heterogeneity and corrected for ascertainment bias using a Lewis correction type [42]. A rapid bootstrap analysis was run with 1000 iterations. MrBayes was run twice on four threads for 50,000,000 generations, with Markov chains sampled every 1000 generations and the standard 25% burn-in calculated. Branches with posterior probability values ≥0.95 were considered strongly supported [43]. A 50% majority-rule consensus tree was created, and nonparametric bootstraps were estimated for the optimal nucleotide substitution model. Nodes with bootstrap values ≥70 were considered strongly supported [43].

Species Delimitation
The limits of putative species were evaluated on the majority-rule consensus tree from the Bayesian 95% posterior probability of the concatenated dataset of morphological characters and aligned DNA sequences using the species delimitation plugin in Geneious [26,44]. Clusters of sequences were evaluated based on their association with individuals collected at the type locality. Specimens which were monophyletic with those collected at the type locality or within the vicinity thereof with greater than 0.95 posterior probability were assigned to the species in question. Several metrics were investigated for assessing species delimitation in Geneious, including P (AB) for reciprocal monophyly [45]; P (RD), which measures the probability of an observed clade's degree of distinctiveness [46] and values for the probability of population identification of a hypothetical sample based on the groups being tested; P ID (Strict), and P ID (Liberal).

Results
Phylogenetic analyses with BI and ML produced almost identical tree topologies with well-supported terminal nodes but weakly supported internal nodes (Figures 2 and 3). Nine well-supported reciprocally monophyletic species-level clades were recovered in both analyses, but relationships within each clade received low support (Figures 2 and 3).

Systematics
The present study is the first analysis of scorpions in which COI mini-barcodes from museum specimens were used for species identification. Three nominal species and six OTUs, identified by Esposito (2011) and described as new species of the 'thorellii' clade by Goodman et al. (2021), were confirmed by the analyses presented here, highlighting the efficacy of mini-barcodes for identifying morphologically similar cryptic species using specimens of variable age and preservation [22,23].  (Figures 2 and 3). The species delimitation analysis identified nine distinct species, all of which received values for Rosenberg's P (AB) lower than 0.01, supporting the results of the BI and ML analyses ( Table 2).

Systematics
The present study is the first analysis of scorpions in which COI mini-barcodes from museum specimens were used for species identification. Three nominal species and six OTUs, identified by Esposito (2011) and described as new species of the 'thorellii' clade by Goodman et al. (2021), were confirmed by the analyses presented here, highlighting the efficacy of mini-barcodes for identifying morphologically similar cryptic species using specimens of variable age and preservation [22,23].  Although this study demonstrates the effectiveness of mini-barcodes in sequencing specimens preserved in suboptimal conditions, several caveats must be taken into consideration.
Mitochondrial genes such as COI are regarded as the gold standard for DNA barcoding as their high mutation rates allow the sorting of specimens to species level with relative ease, low cost, and efficiency [46,47]. However, recent studies demonstrated that short mini-barcodes (<200 bp) perform more poorly than barcodes of medium length (>200 bp), necessitating other lines of evidence to differentiate species [45]. The addition of slowmutating nuclear genes is necessary to resolve deep basal evolutionary relationships [21,22,27]. The incorporation of both mitochondrial and nuclear genes with differing mutation rates is desirable as it improves the support and stability of nodes.
The ML and BI trees presented here received low support for the internal nodes, most likely due to the absence of a nuclear gene. Successful amplification of nuclear genes proved fruitless, even when attempting to amplify shorter fragments such as the Internal Transcribed Spacer (ITS2, 350 bp) and Histone 3 (H3, 300 bp) [48,49]. The shorter, 125-bp mini-barcode primers would have provided insufficient support for species delimitation in the absence of other lines of evidence. The inclusion and analysis of a morphological character matrix with the COI data allowed an integrative approach to species delimitation in the 'thorellii' clade.  Although this study demonstrates the effectiveness of mini-barcodes in sequencing specimens preserved in suboptimal conditions, several caveats must be taken into consideration.
Mitochondrial genes such as COI are regarded as the gold standard for DNA barcoding as their high mutation rates allow the sorting of specimens to species level with relative ease, low cost, and efficiency [46,47]. However, recent studies demonstrated that short mini-barcodes (<200 bp) perform more poorly than barcodes of medium length (>200 bp), necessitating other lines of evidence to differentiate species [45]. The addition of slow-mutating nuclear genes is necessary to resolve deep basal evolutionary relationships [21,22,27]. The incorporation of both mitochondrial and nuclear genes with differing mutation rates is desirable as it improves the support and stability of nodes.
The ML and BI trees presented here received low support for the internal nodes, most likely due to the absence of a nuclear gene. Successful amplification of nuclear genes proved fruitless, even when attempting to amplify shorter fragments such as the Internal Transcribed Spacer (ITS2, 350 bp) and Histone 3 (H3, 300 bp) [48,49]. The shorter, 125-bp mini-barcode primers would have provided insufficient support for species delimitation in the absence of other lines of evidence. The inclusion and analysis of a morphological character matrix with the COI data allowed an integrative approach to species delimitation in the 'thorellii' clade.  Masters et al. (2011): Intra Dist: average pairwise tree distance among species; Inter Dist: average pairwise tree distance within species; Intra/Inter: ratio of Intra Dist to Inter Dist; P ID (Strict): mean probability, with 95% confidence interval (CI), of correct identification of unknown specimen using placement on tree; P ID (Liberal): mean probability, with 95% confidence interval (CI), of correct identification of unknown specimen using BLAST (best sequence alignment), DNA Barcoding (closest genetic distance), or placement on tree; Av (MRCA-tips): mean distance between most recent common ancestor of species and its conspecific members; P (RD): probability that clades have observed degree of distinctiveness; P (AB): probability that clades are reciprocally monophyletic. Importantly, in most cases, multiple specimens of each species were represented by both partial COI and mini-barcodes, allowing more accurate alignments. Future analyses using mini-barcodes should include additional datasets for species identification if possible [46,50].

Diversification
The complex topography of the Mexican and Central American landscapes together with the ecological specialization of these scorpions are hypothesized to have driven diversification in the 'thorellii' clade, which appears to display high levels of endemism. The arboreal habits of the clade confer a unique niche safe from predation by larger grounddwelling scorpion species as well as other ground-dwelling predators [51]. Species such as C. berstoni, C. catemacoensis, and C. hamadryas exhibit very specific temperature, humidity, and substrate preferences [52]. Evidence of population structure among habitats but genetic admixture within habitats suggests these species are not constrained by dispersal within a given habitat but have narrow ecological requirements preventing them from dispersing between habitats.
The analyses identified geographically structured intraspecific genetic variation, despite morphological conservatism. For example, C. hoffmanni displayed a genetic structure among specimens collected less than 50 km apart. This species occurs predominantly within the Chiapas Depression, which acts as a corridor for some species and a barrier for others [53]. The results of the species delimitation analyses suggested all individuals of C. hoffmanni are conspecific, as genetic differentiation is minimal. The specialized arboreal microhabitat of C. hoffmanni suggests dispersal is limited across its broad geographical range, resulting in genetic divergence among populations occupying heterogeneous habitats, from evergreen and broadleaf forests to savannahs and scrub, across the Chiapas Depression.
A similar pattern was apparent with C. schmidti, in which divergence was evident among specimens from Guatemala and Honduras. Localities within the neighboring El Progresso and Zacapa departments of Guatemala share contiguous pine savannah and thorn forests. Localities within Honduras originated from Atlántida and Islas del Bahía, tropical coastlines with offshore islands characterized by broadleaf forests which receive high rainfall.
A third example is provided by C. chanae, in which divergence was evident between specimens separated by the Balsas Depression, an arid basin covering part of southwestern Guerrero and most of northwestern Michoacán in southern Mexico, which represents a barrier for other arachnid species [54][55][56][57]. Future studies should sample the entire ranges of C. chanae, C. hoffmanni, and C. schmidti to better understand the population structure within these species and investigate the possibility of additional cryptic species.
In contrast to C. chanae, C. hoffmanni, and C. schmidti, little genetic divergence was evident among specimens of C. rileyi distributed over a large area of northeastern Mexico, in the states of San Luis Potosí and Tamaulipas, suggesting unrestricted gene flow among populations. This may be due to the absence of barriers to dispersal across the homogenous Tamaulipan mezquital, a subhumid xeric shrubland biome which encompasses most of the lowlands of both states [58].

Conclusions
Systematics has reached a critical point in the 21st century where the rate of species description must outpace the decline of biodiversity [46]. Specimens must be rapidly sorted to the species level to infer large-scale biodiversity patterns. Such baseline work is essential to determine changes in species richness caused by climate change and habitat destruction. Luckily, vast numbers of specimens, including many undescribed species, are already housed within natural history collections; the challenge comes in developing faster, cheaper, and more efficient molecular sorting methods to harness data for their delimitation and identification, regardless of the age or preservation of the specimen [47].