Highly Resolved Phylogenetic Relationships within Order Acipenseriformes According to Novel Nuclear Markers

Order Acipenseriformes contains 27 extant species distributed across the northern hemisphere, including so-called “living fossil” species of garfish and sturgeons. Previous studies have focused on their mitochondrial genetics and have rarely used nuclear genetic data, leaving questions as to their phylogenetic relationships. This study aimed to utilize a bioinformatics approach to screen for candidate single-copy nuclear genes, using transcriptomic data from sturgeon species and genomic data from the spotted gar, Lepisosteus oculatus. We utilized nested polymerase chain reaction (PCR) and degenerate primers to identify nuclear protein-coding (NPC) gene markers to determine phylogenetic relationships among the Acipenseriformes. We identified 193 nuclear single-copy genes, selected from 1850 candidate genes with at least one exon larger than 700 bp. Forty-three of these genes were used for primer design and development of 30 NPC markers, which were sequenced for at least 14 Acipenseriformes species. Twenty-seven NPC markers were found completely in 16 species. Gene trees according to Bayesian inference (BI) and maximum likelihood (ML) were calculated based on the 30 NPC markers (20,946 bp total). Both gene and species trees produced very similar topologies. A molecular clock model estimated the divergence time between sturgeon and paddlefish at 204.1 Mya, approximately 10% later than previous estimates based on cytochrome b data (184.4 Mya). The successful development and application of NPC markers provides a new perspective and insight for the phylogenetic relationships of Acipenseriformes. Furthermore, the newly developed nuclear markers may be useful in further studies on the conservation, evolution, and genomic biology of this group.


Introduction
In 2010, the International Union for Conservation of Nature (IUCN) Red List of Threatened Species concluded that "Sturgeon more critically endangered than any other group of species". Considering IUCN Red List data, twenty-seven species of sturgeon are listed with 63 percent as Critically Endangered-the Red List's highest category of threat. Four species are now possibly extinct. Although progress was made in recovery of North American species during the last decade, strategy. Four new transcriptomes of sturgeon and the L. oculatus genome were used as a database. These were analyzed by screening candidate single-copy genes to develop NPC markers and using multi-sequence alignment to find the conserved regions of exons. Two pairs of degenerate primers were designed for each marker for polymerase chain reaction (PCR) amplification. Thus, by first constructing a phylogenetic relationship and then reinvestigating the divergence times of the 16 Acipenseriformes species using the 30 newly developed NPC markers, our study provides new insight into Acipenseriformes phylogeny and improves our understanding of evolution at the nuclear gene level.

Datasets
The transcriptome sequencing data of seven species (Acipenser baerii, A. gueldenstaedtii, A. naccarii, A. schrenckii, A. sinensis, A. stellatus, and A. transmontanus) were searched and downloaded from the NCBI Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). Due to the difference in data size, here we only kept the transcriptome data of four species (A. baerii, A. schrenckii, A. sinensis, and A. transmontanus) for subsequent analysis. The accession numbers and other details are presented in Table S1. The SRA files were converted into fastq files using fastq-dump within the SRA Toolkit version 2.7.0 (Bethesda, MD, USA).

Quality Control and De Novo Transcriptome Assembly
The fastq data were first processed using fastqc version 0.11.3 software [37] (Babraham Institute, Cambridge, UK) to view the summary quality control of the reads. Trimmomatic version 0.33 [38] (RWTH Aachen University, Aachen, Germany) was used to filter the base reads for low quality and artificial errors in the sequence ends. In this step, bases with abnormal content of approximately 8 bp according to fastqc results and those showing low quality (Q ≤ 3 base) at the start and end of the sequence were removed. The de novo assembly of clean reads was performed using Trinity version 2.0.6 software [39] (Broad Institute, Cambridge, MA, USA), with the min_kmer_cov value set to 2 and all other parameters set to default. Only contigs longer than 200 bp were used for further analysis. Then, the unigenes were extracted from the transcripts and searched against the protein data set from the L. oculatus genome using blastx version 2.2.31 (NCBI, Bethesda, MD, USA) with an E-value of 1 × 10 −5 . Protein data sets were extracted from the blast results based on the Perl script. The coding sequences (CDS) and protein sequences of the unigenes without blastx hits were predicted using TransDecoder version 3.0.0 (http://transdecoder.sourceforge.net/).

Comparison of Transcripts and Extraction of Protein Sequences
The unigene sequences of four species (A. baerii, A. sinensis, A. schrenckii, and A. transmontanus) were compared with the protein sequences from the spotted gar (L. oculatus) whole genome using blastx (E-value < 10 −5 ). The protein sequences were then extracted according to the alignment results using the Perl script.

Identification of Orthologous Single-Copy Genes
We identified sturgeon orthologous genes using OrthoMCL version 2.0.9 [40] (http://www. orthomcl.org). Briefly, all protein sequences of the four sturgeons and the spotted gar were filtered according to length, with sequences >30 amino acid retained. Sequences between the start codon and the first stop codon which were less than 20% of the complete protein sequence were removed. We then used the all-vs-all BLAST method in the filtered files, and results with E-values < 10 −5 were retained. Clustering was performed using Markov cluster algorithm (MCL) to obtain gene families. Finally, the orthologous single-copy genes (1:1:1) were extracted using the Perl script.
In addition, the spotted gar genome and the corresponding annotation files were downloaded from Ensembl (http://www.ensembl.org/index.html). We used Perl scripts to determine the corresponding genes of the single-copy orthologous sequences of the four-sturgeon species, and single-copy genes containing at least one exon of >700 bp were identified for further analysis.

Exon Search and Primer Design
Nested PCR and degenerate primer strategies have been widely used in developing NPC markers because they can significantly improve the success rate of PCR [5,12,41]. In order to ensure the universality of primers, we screened 193 nuclear single-copy genes, which were selected from 1850 candidate genes containing at least one exon >700 bp. First, we downloaded specific information in fasta format for the 193 candidate genes of spotted gar from Ensembl (http://asia.ensembl.org). Then, we chose 43 nuclear single-copy genes for primer design, many having only one exon. We then integrated the information for each gene in a table and performed multiple alignments of the CDS from the five species. Conserved regions containing no termination codons were selected for designing degenerate primers ( Figure S1). The length of each primer was 21-26 bp, with approximately 172 primers designed for the development of nuclear gene markers.

Taxon Sampling, DNA Extraction, and Experimental Testing
Sixteen Acipenseriformes species were sampled in this study, consisting of four genera (Table 1). Originally, specimens were identified based on morphological characteristics [28]; but species were also identified by sequencing of mitochondrial genes [1,18,22]. All animal experiments were conducted and approved by the Animal Care and Use Committee of Southwest University (No. 2017-7). Genomic DNA was extracted from fin tissues and egg previously stored at −80 • C, using the Qiagen DNeasy Kit (Qiagen, Shanghai, China), according to manufacturer's instructions. The 43 NPC candidate loci were amplified using nested PCR. First-round PCR reactions were conducted in 25 µL volumes containing the following: 2.5 µL 10× buffer (Mg 2+ plus), 2.0 µL of 2.5 mM dNTPs, 1 U Taq DNA polymerase (rTaq, TaKaRa; Dalian, China), 1 µL of each first-round primer (10 µM), 1 µL genomic DNA (~100 ng/µL), and double-distilled water to yield a final volume of 25 µL. The following conditions were used for PCR amplification: initial denaturation for 4 min at 94 • C; 35 cycles of 45 s at 94 • C, 40 s at 45 • C, and 2 min at 72 • C; and a final extension of 10 min at 72 • C [12]. Negative controls (i.e., containing no DNA templates) were used in each PCR run to test for contamination and artifacts. Second-round PCR (25 µL) contained the following: 2.5 µL 10× buffer (Mg 2+ plus), 2.0 µL of 2.5 mM dNTPs, 1 U Taq DNA polymerase (rTaq, TaKaRa; Dalian, China), 1 µL of each second-round primer (10 µM), 0.8-1.5 µL first-round PCR product without dilution, and double-distilled water to yield a final volume of 25 µL. The following conditions were used for PCR amplification: an initial denaturation for 4 min at 94 • C; 30 cycles of 45 s at 94 • C, 40 s at 50-52 • C, and 1.5 min at 72 • C; and a final extension of 10 min at 72 • C. Samples were then cooled to 4 • C. Second-round PCR products were examined via electrophoresis through 1% agarose gels. If multiple bands were present, the correct PCR product according to size was purified by cutting the band from the gel. In order to verify that markers were effective, extraction and amplification of these markers was attempted using the relevant primer pair in three species (A. dabryanus, A. sinensis, and Polyodon spathula). When single-band amplification, sequencing, and BLAST searching against the target gene in these three species were satisfied, the marker was tested in the remaining 13 species. An NPC candidate locus was considered a feasible marker if Sanger sequencing was successful and results were achieved in at least 13 out of the 16 species ( Figure S1).

Sequence Assembly and Phylogenetic Analyses
Second-round PCR products were sequenced using Sanger sequencing. The forward and reverse sequences of each NPC maker were manually assembled using ContigExpress version 3.0.0 (Invitrogen; Carlsbad, CA, USA) in the Vector NIT software suite [42] and visually examined ( Figure S1). The sequences were aligned using the clustalW algorithm in MEGA version 7 [43]. Ambiguously aligned regions were removed using Gblocks version 0.91b [44], with all gaps allowed (−b5 = a), utilizing the 'codon' model setting and setting all other values to default. We used Sequence Matrix version 1.8 [45] to concatenate successful NPC markers. The variable sites (variability) and parsimony information (PI) sites of each alignment and concatenated data were calculated by MEGA 7. The supermatrix data were divided into four parts for saturation testing by DAMBE version 5 [46]. Then, we manually defined five partitioning strategies: two partitions (1-and 2-codon positions together as a partition and one partition for the 3-codon position), three partitions (one partition for each codon position), 30 partitions (each gene as a partition), 60 partitions (one partition for 1-and 2-codon positions together and the 3-codon position as a partition across 30 genes), and 90 partitions (codon position partitioning across 30 genes). The comparisons of the five partitioning strategies and selections of corresponding nucleotide substitution models were conducted under the corrected Akaike information criterion implemented in PartitionFinder version 2 [47].
The supermatrix dataset was separately analyzed with both maximum likelihood (ML) and Bayesian inference (BI) methods with 60 schemes. RAxML version 8.0 [48] was used to analyze the partition ML, with the GTR + Γ + I model assigned for each partition. Branch support for the resulting phylogeny was evaluated with 500 rapid bootstrapping replicates (with −f as an option) implemented in RAxML. MrBayes version 3.2 [49] was used to analyze partition BI. All model parameters were unlinked, with one cold chain and three heated chains for the Markov chain Monte Carlo (MCMC) process, which began with random starting trees. The analysis was run for 30 million generations and sampled every 1000 generations, with the first 25% representing burn-in. Posterior probabilities were obtained from the 50% majority rule consensus tree of the remaining topologies using TreeAnnotator, and the above process was run independently twice. Chain stationarity was visualized by plotting likelihoods against the generation number using the program Tracer version 1.7 [50]. Effective sample sizes greater than 200 ensured the convergence of the results.
The construction of the species tree was completed by ASTRAL version 5.5.9 [51] based on gene trees estimated from the 30 NPC markers. This Java program estimates a species tree given a set of unrooted gene trees and support is calculated using local posterior probabilities [52]. We used 30 maximum likelihood gene trees as input trees, derived using the GTR + Γ + I model in RAxML with five hundred bootstrap replicates.

Hypothesis Testing
To test the statistical significance of alternative hypotheses for the Acipenseridae family within the Acipenseriformes, we compared several previous studies [15,24,28,29] with respect to the relationships among three genera within the family (i.e., Pseudoscaphirhynchus, Huso, and Acipenser). Topology tests were conducted using the datasets of all 16 species. The site-log-likelihood values were calculated under the GTR + Γ + I model for nucleotides using TREE-PUZZLE version 5.3 [53] (command-line option: -wsl). The obtained values were used as input for the software, CONSEL [54]. Hypotheses were statistically tested by AU [55], KH [56], and SH [57].

Estimating Divergence Time
In order to increase the speed of calculation, molecular dating was estimated using the relaxed-clock based program MCMCTREE version 4.9e (University College London, London, UK) [58]. In this analysis, the best tree from the hypothesis testing was used as the reference tree. All 30 NPC markers were concatenated as a single "supermatrix" (20,946 bp), and the alignment was divided into three partitions corresponding to the 1st, 2nd, and 3rd codon sites. The approximate likelihood method (usedata = 2) [58] was used for divergence time estimation. For each locus, the root age was set at "<2.0" and the model at "HKY85 + G". The molecular clock model used independent rates (clock = 2). At present, there is only one report on the complete estimation of divergence times between species within the Acipenseriformes order [18]. Therefore, two nodes (C1 and C2) were considered as time-calibrated points with lognormal distributions and soft constraint bands (allowing violation probability of 0.025) in the present study. The C1 calibration point was estimated as the split time between sturgeon and paddlefish, based on a fossil of a Hauterivian-age polyodontid, dating from Early Cretaceous (~132-200 Mya) [17,59]. The C2 calibration point was used to estimate the node that A. oxyrinchus and A. sturio splitting with other sturgeons based on the acipenserid fossil from the Campanian age, dating from Late Cretaceous (~74 Mya) [60] (Table S2). We used 500,000 generations for the Markov chain Monte Carlo (MCMC) analysis, with the first 20% of all samples discarded as burn-in and samples collected every 200 generations thereafter up to 20,000. An independent rates model (clock = 2), which follows a lognormal distribution, was used for the MCMC search.

Quality Control and Assembly
After removing artifact erroneous bases at the ends of the sequences and low-quality reads, the high-quality reads were used for further analysis. The software, Trinity [39], was used to guarantee transcripts. In the same species, all reads from different libraries or tissues were pooled. The longest isoform of a gene served as the unigene. Finally, we obtained 13,666, 125,179, 211,209, and 234,179 unigenes with an N50 of 787, 653, 654, and 704 bp from A. sinensis, A. transmontanus, A. schrenckii, and A. baerii, respectively.

Identification of Putative Proteins
We analyzed the possible protein sequences to identify orthologous genes for comparison of the species divergence of sequences. The unigenes of each species were used to predict CDS and protein sequences. We recovered 37,751, 36,648, 32,470, and 36,490 protein sequences (N50: 224, 216, 319, and 323 bp, respectively) from A. sinensis, A. transmontanus, A. schrenckii, and A. baerii based on the search results from blastx. The remaining unigenes were predicted using TransDecoder pipeline, obtaining 9530, 7485, 12,226, and 14,040 amino acid sequences with N50 lengths of 195, 188, 177, and 191 bp, for the four species respectively (Table S3). These two groups of protein sequences were combined and used for the identification of homologous genes. Finally, these amino acid sequences from A. sinensis, A. transmontanus, A. schrenckii, and A. baerii were used for further analysis. A total of 21,820 homologous gene families were obtained in four sturgeons and spotted gar protein sequences clustered by using OrthoMCL. Among them, 1850 single-copy orthologous genes were identified, including spotted gar genes. In addition, 193 nuclear single-copy genes that contained at least one exon of >700 bp were selected from the spotted gar genome.

Characteristics of NPC Markers
According to the abovementioned screening criteria, we successfully found 30 new NPC markers for Acipenseriformes, which were amplified in at least 14 species. Fragments ranged from 467 bp to 993 bp, with an average length of 698 bp. In a total of 480 PCR reactions (30 loci × 16 species), 476 (99.1%) were successfully amplified, with 469 (97.7%) producing high-quality bands used for Sanger sequencing ( Figure S2). The PCR primer information of the 30 NPC markers, including their genome locations, is listed in Table S4 and their GenBank accession numbers are shown in Table S5. Their variability ranged from 4% to 16%, with an average of 9%, and their parsimony-informative sites ranged from 2% to 16%, with an average of 4% (Table S6). Twenty-seven NPC markers were amplified in all 16 species. The NPC marker 'fam43a' could not be amplified from H. dauricus or Pseudoscaphirhynchus kaufmanni, and 'ENSLOCG00000018097' and 'cebpa' could not be amplified from P. kaufmanni ( Figure S2).

Reconstructing Phylogenetic Relationships in the Acipenseriformes
The supermatrix combined 30 NPC markers based on 20,946 bp. The results of the saturation tests showed that the values of the saturation index (Iss) for the first codon, second codon, third codon, first and second codons, and all three sites of protein-coding regions were significantly smaller than the critical values, Iss.cSym or Iss.cAsym (Table S7).
The results of PartitionFinder 2 supported the 60 schemes as the best-fitting partitioning schemes using both ML and BI analyses. Both the ML tree and BI tree produced highly similar topologies. Ten out of thirteen internal nodes produced bootstrap support of >75% in the ML tree and all internal nodes had posterior probabilities >0.95 in the BI tree ( Figure 1). The species tree produced similar topologies with ML and BI trees in most lineages, and most internal nodes had high branch support ( Figure 2

Divergence Times
The estimated divergence times of the 16 species of Acipenseriformes are shown in Figure 3. Our results suggest that the estimated divergence of sturgeon and paddlefish occurred at 204.

Discussion
Polyploidization is a common phenomenon in animals and plants [62], which unfortunately causes problems for the use of nuclear markers in phylogenetic analyses. These problems make conventional methods of nuclear marker development very inefficient. On the contrary, searching conservative sequences based on multi-sequence alignment provides a new method to develop large scale nuclear markers [5,7,9,63]. Many Acipenseriformes species are polyploid and lack reference genome sequence data; as a result, there has been strong dependence on mitochondrial data of the phylogenetic relationships among species, with nuclear data rarely used. Nowadays, most Acipenseriformes are rare and often close to extinction, making sampling of wild specimens very difficult.
Twenty to fifty loci are often sufficient to answer phylogenetic questions [64,65]. Therefore, we used a strategy of multi-sequence alignment of transcriptomes combined with spotted gar genomic data to develop single-copy NPC markers to construct the phylogenetics of Acipenseriformes. The success rate of PCR amplification is an important parameter [41]. Our success rate was 99.1%, representing 30 NPC markers (100% for 27 markers) in 16 species of Acipenseriformes ( Figure S2); the combination of low degenerate primers and nested PCR was important for this high rate of success. Phylogenetic analysis showed that the gene and species trees have highly similar topologies and most internal nodes are highly supported, showing that our NPC markers are reliable. This strategy for developing NPC markers can be used for the development of molecular markers in other polyploid groups that also lack reference genomes but have sufficient transcriptome data. Notably, the set of 30 NPC markers can be used for the remaining 11 sturgeon species.
The results from the nuclear gene trees and species tree strongly support the classification of Acipenseriformes into two families, Acipenseridae and Polyodontidae, which is consistent with previous morphological and molecular studies [15,18,[21][22][23][24][25]27,28,51,66]. The monophyly of sturgeon and paddlefish is not disputed. Nonetheless, the phylogenetic relationships of the four genera in the family Acipenseridae have been controversial for a long time. Our results suggest the existence of three clades, where clade 1 is represented by A. sturio and A. oxyrinchus as sister species with a unified basal lineage. All other species of Acipenser, Huso, and Pseudoscaphirhynchus can be divided into Pacific species (clade 2) and Atlantic species (clade 3) [15,18]. Our data support the Pacific clade being a monophyletic group, in agreement with previous molecular studies [15,18,22,26]. However, the internal phylogenetic relationships are different from those of previous studies. For example, A. dabryanus and A. sinensis have been found to be a sister group with A. schrenckii and A. transmontanus based on many mitochondrial DNA data, but nuclear markers suggest that A. schrenckii and A. transmontanus should be a sister group with H. dauricus. Presently, A. dabryanus and A. sinensis are found in China; H. dauricus and A. schrenckii are found in the Amur River, Sea of Okhotsk, and Sea of Japan; and A. transmontanus is found in the North Eastern Pacific (Table 1, Figure 3). Therefore, our results are consistent with geographical distribution of species. Currently, the species composition of the Atlantic clade is complex and contains at least three genera, including Acipenser, Huso, and Pseudoscaphirhynchus. The genus Huso is not a monophyletic group, with H. dauricus and H. huso dispersed and nested within the genus Acipenser, consistent with previous molecular studies [15,18,22]. The genus Pseudoscaphirhynchus is also nested in the genus Acipenser. Previous studies have indicated that P. kaufmanni is closely related with A. stellatus within the Atlantic clade [15,24]. However, our data do strongly support that P. kaufmanni is located basal in the Atlantic clade, which has not been noted previously.
Prior to this study, there was only one complete report on the divergence time of 24 species of Acipenseriformes; therefore, we choose cytb results for comparison [18]. Considering the slower saturation of nuclear exon data, time estimates based on nuclear exons would seem to be more reliable [67]. Notably, our results showed that the nuclear gene data set was narrower and more accurate in its credibility intervals. We estimated the divergence time for sturgeon and paddlefish at 204.1 Mya, similar to the oldest Acipenseriformes fossil record (ca. 200 Mya) [17,59]. The Ponto-Caspian region has been very unstable over the last 150 Mya [68], which may have led to the diversification of the Acipenseridae [26]. Based on our dating results, the divergence time of Acipenseridae was estimated to be 144.9 Mya (Table S8), more recently than the 171.6 Mya estimated previously [18], but close to 150 Mya, supporting the hypothesis of Bemis and Kynard (1997) [26]. Interestingly, we determined that the Pacific-Atlantic split occurred at around 117.6 Mya, around the time that the Tethys Sea separated into two oceans (120 Mya) [68]. We suggest that the continued shrinkage of the Tethys Sea greatly changed the habitat, resulting in a significant diversification of the sturgeon species. Additionally, the divergence time of the genus Pseudoscaphirhynchus was estimated to be 103.3 Mya, less recently than estimated by all previous studies. However, due to the incomplete sampling of species, the actual divergence time may be a little earlier than our results indicate.

Conclusions
The transcriptome plus reference genome data mining strategy utilized here to identify suitable nuclear protein-coding loci for novel marker development has been proven successful in Acipenseriformes. The newly developed 30 NPC markers had a high rate of experimental success (~99.1%), with excellent performance of the concatenated dataset describing phylogenetic relationships. This dataset and markers were useful in reconstructing the relationships within the Acipenseriformes order, yielding the novel findings that P. kaufmanni is identified located in the base position of the Atlantic clade of species. As more sequences become available in the future, further studies with more comprehensive taxon sampling are essential to fully resolve the evolutionary history of Acipenseriformes. Additionally, our marker set offers a great opportunity for species identification including artificial produced hybrids. Trade control is essential for protection of last remaining wild stocks.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/10/1/38/s1, Figure S1: Flowchart for bioinformatics screening and experimental design, Figure S2: PCR amplification rendering of the 30 NPC markers in 16 Acipenseriformes species, Table S1: Transcriptome information of four sturgeons, Table S2: Maximum (U) and minimum (L) constrains (Mya) and calibration information for nodes in Figure 1 in the molecular dating analysis, Table S3: Number of predicted protein sequences (N) according to detected unigenes, Table S4: Summary information of 30 NPC markers, Table S5: List of GenBank accession numbers for the markers obtained in the present study, Table S6: Summary information of 30 NPC markers in 16 species of Acipenseriformes, Table S7: Saturation test performed by using DAMBE, Table S8: Detailed results of Bayesian molecular dating using MCMCTREE, and a comparison among shared nodes in previous study.