Genetic Diversity and Population Structure of Hemiculter leucisculus (Basilesky, 1855) in Xinjiang Tarim River

Hemiculter leucisculus is an invasive fish and widely distributed in the Xinjiang Tarim River. In this study, RAD-seq was used to explore the genetic diversity and population subgroup structure of H. leucisculus in the Tarim River and develop relevant Simple Sequence Repeat (SSR) markers. The study collected 40 samples distributed at four different sites of the Tarim River. A total of 7,291,260 single nucleotide polymorphisms (SNPs) were obtained. The genetic diversity results showed that the population genetic diversity level of H. leucisculus was low. The population pairwise FST values ranged from 0.231 to 0.258, indicating that there was moderate genetic differentiation among these populations. AMOVA showed that the genetic variation within populations accounted for 92.31% of the total variation. The principal component analysis (PCA) and neighbor joining (NJ) tree revealed that the four populations could be separated into two clusters (upper-middle and downstream populations) and the individuals from Taitema Lake (TTMH) showed differences and had a bigger geographic distance than the others. There is the probability that the H. leucisculus from Bosten Lake entered Taitema Lake to breed and then expanded into the Tarim River due to the water diversion projects in location. In addition, 147,705 SSRs loci were detected and 22,651 SSR primer pairs were developed. This study will contribute to providing valuable molecular data for the management of wild populations, marker-assisted selection and resource exploitation of H. leucisculus.


Introduction
Hemiculter leucisculus (Basilewsky, 1855) is a small cyprinid fish. It has a wide distribution in East Asia, living in the upper middle layer of water and likes to live in clusters. It is an invasive fish in the Tarim River of Xinjiang. In the past, the economic value of H. leucisculus was considered low, so its potential value has not been exploited. However, in some areas where stable populations have been established, it has become an important economic fish [1]. Xie Zongyong et al. studied H. leucisculus of Fenhe Reservoir and the growth characteristics of H. leucisculus [2]. Li Baolin et al. conducted biological research on H. leucisculus from Dalai Lake [3]. Currently, the research on H. leucisculus mainly focuses on classification, nutrition and biology. At the same time, there is less research on the genetic background of H. leucisculus, which makes it difficult to gain strong theoretical support for the population dynamics of H. leucisculus. Therefore, it is important to study the genetic diversity of H. leucisculus.
RAD-seq (Restriction site-associated DNA sequencing) is a series of sequencing technologies based on restriction endonucleases characterized by simple operation, low application cost and wide coverage. It provides an accurate and efficient technique for studying population genetic structure and diversity at the genome level and has been widely used in these fields. Microsatellites, also known as simple sequence repeats (SSR), are abundant and dispersed broadly in the coding region and non-coding region of eukaryote genome and are composed of short (2-5 bp) variable tandemly repeated arrays. SSR markers are very Genes 2022, 13, 1790 2 of 12 practical genetic markers widely used in many fields, such as paternity testing, molecular marker-assisted breeding, genetic map construction, quantitative trait loci location and population genetics [4,5].
In this study, we used the RAD-seq method to unravel the population structure of H. leucisculus in the Tarim River. The objectives of our study were to use SNP genotyping to investigate and evaluate the genetic diversity, genetic differentiation and population structure of H. leucisculus and develop SSR markers. These data are the basis for the formulation of effective management strategies for this invasive fish, providing essential information both for its control and its potential exploitation as a fishery resource.

Sampling and DNA Extraction
Along the 1321 km long main course of the Tarim River, fish were sampled in four sectors corresponding to the upper, middle and lower reaches of the river. Fish samples were taken from four locations in the Tarim River basin from May to October 2019. In total, 40 specimens were collected from four sample sites in the Tarim River basin ( Figure 1). The fin samples were obtained and stored in 95% ethyl alcohol in −20 • C refrigerators for subsequent experiments. The genomic DNA of 40 individuals was extracted using the CTAB (Cetyltrimethylammonium Bromide) method. DNA quality and concentration were measured by 1.0% agarose gel electrophoresis and NanoDrop 2000 spectrophotometers (Thermo Scientific, Wilmington, DE, USA). studying population genetic structure and diversity at the genome level and has been widely used in these fields. Microsatellites, also known as simple sequence repeats (SSR), are abundant and dispersed broadly in the coding region and non-coding region of eukaryote genome and are composed of short (2-5 bp) variable tandemly repeated arrays. SSR markers are very practical genetic markers widely used in many fields, such as paternity testing, molecular marker-assisted breeding, genetic map construction, quantitative trait loci location and population genetics [4,5].
In this study, we used the RAD-seq method to unravel the population structure of H. leucisculus in the Tarim River. The objectives of our study were to use SNP genotyping to investigate and evaluate the genetic diversity, genetic differentiation and population structure of H. leucisculus and develop SSR markers. These data are the basis for the formulation of effective management strategies for this invasive fish, providing essential information both for its control and its potential exploitation as a fishery resource.

Sampling and DNA Extraction
Along the 1321 km long main course of the Tarim River, fish were sampled in four sectors corresponding to the upper, middle and lower reaches of the river. Fish samples were taken from four locations in the Tarim River basin from May to October 2019. In total, 40 specimens were collected from four sample sites in the Tarim River basin ( Figure  1). The fin samples were obtained and stored in 95% ethyl alcohol in −20 °C refrigerators for subsequent experiments. The genomic DNA of 40 individuals was extracted using the CTAB (Cetyltrimethylammonium Bromide) method. DNA quality and concentration were measured by 1.0% agarose gel electrophoresis and NanoDrop 2000 spectrophotometers (Thermo Scientific, Wilmington, DE, USA).

RAD Library Construction and Sequencing
The restriction site-associated DNA sequencing library was constructed using the recommended method described by Willing E M et al. (2011) [6]. About 1 μg of genomic DNA per sample was digested with EcoRI and incubated at 37 °C with T4 DNA ligase and

RAD Library Construction and Sequencing
The restriction site-associated DNA sequencing library was constructed using the recommended method described by Willing E M et al. (2011) [6]. About 1 µg of genomic DNA per sample was digested with EcoRI and incubated at 37 • C with T4 DNA ligase and EcoRI adapter. Using NEBNext dsDNA fragmentase, the reaction mixtures were segmented into 50-1000 bp. The samples were purified with AMPure XP beads (Beckman Coulter, Brea, CA, USA). The fragmented DNA was used to conduct end repairing, dA-tailing and adapter ligation. Subsequently, the PCR productions were purified and pooled to obtain the lengths of 300-400 bp digested fragment sequences for library construction. PCR amplification was performed on the size-selected fragments (12 cycles of PCR; 50 µg reaction system) using NEBNext high-fidelity 2X PCR master mix. Finally, the PCR products were purified using a High Sensitivity DNA assay Kit (Agilent Technologies, Lithuania) and sequenced on the Hi-Seq 2500 platform (Illumina Inc., San Diego, CA, USA).
The raw data generated from the Illumina sequencer were filtered with the fastp program. Low-quality data filtering included removing reads with adapter, reads containing more than 10% of N and reads containing more than 50% bases with a quality value of less than 20 to obtain clean reads for subsequent bioinformatic analysis. To ensure the quality and quantity of the data, the minor allele frequency (MAF) was below 0.01. With the population RAD-tags set as a reference, the Burrows-Wheeler Aligner (BWA) was used to compare reads of each sample in the population, and GATK, a mutation detection software, was used to detect SNP in the population.

Genetic Diversity and Population Structure
SNPs obtained by sequencing were used to evaluate population genetic statistics, including average nucleotide diversity (Pi), observed heterozygosity per locus (H o ), expected heterozygosity (H e ), and Wright's inbreeding coefficient (F IS ). Pairwise F-statistics (F ST ) among the four populations were performed using the Arlequin v3.5 [7]. In addition, to detect within and among populations genetic variance components, molecular variance (AMOVA) was analyzed using the StAMPP v1.6.3 [8].
A neighbor-joining (NJ) tree was constructed to cluster the populations using TreeBest software. Based on SNP differences between individuals, principal components analysis (PCA) can cluster individuals into different subgroups according to different traits. PCA analysis was conducted by software GCTA v1.93.2 [9], and the scatter plotting was performed with the first and second components using the Ggplot2 package [10] in R. The Admixture software [11] was used to constructed population structure in order to analyze the ancestral components of the 40 individuals and visualized using R. Considering the complex history of the Tarim River, such as multiple water diversion projects, there may be more than four populations. Therefore, the default K value was set as 1-9, the bootstrap value was set as 1000 and other parameters were set as default. The optimal K value was determined by a cross-validation error of the software.

SSR Loci Detection and SSR Marker Development
Microsatellite mining (http://pgrc.ipk-gatersleben.de/misa/ (accessed on 9 November 2020)) was performed using the Microsatellite (MISA) identification tool, Primer3 V2.3.6 [12] for each SSR primer pair at default Settings design, the size of the PCR products between 100~300 bp. Locus polymorphism screening mainly included singleness and polymorphism screening. Singleness was determined by agarose gel after PCR and whether the product is single or not. The polymorphic screening was performed by agarose electrophoresis after PCR with fluorescent primers to determine whether the peak types were qualified or polymorphic.

RAD Sequencing and Data Quality
Forty H. leucisculus were sequenced using RAD-seq. The sequence results showed that 632,037,408 clean reads were obtained, with an average number of 15,800,935 samples, and the average quality of Q20 was 96.91%. The average quality of Q30 was 91.57%. The average GC content was 39%, indicating that the difference before and after filtration is very small. Q20 and Q30 were both at a high level, and GC content was low, which proved that the sequencing quality was up to standard, the base error rate was very low and the sequencing information was reliable (Table 1). Population SNP spectrum analysis ( Figure 2) showed that SNP mutations could be divided into 12 categories, with A/C (10.90%), C/A (10.58%), C/T (14.02%), G/A (13.99%) and T/C (10.65%) as the main types of SNP mutations. Transversions (Tv) was larger than transition (Ti), accounting for 50.43% of SNP, and the ratio of Ti to Tv is 0.98. Population SNP spectrum analysis ( Figure 2) showed that SNP mutations could be divided into 12 categories, with A/C (10.90%), C/A (10.58%), C/T (14.02%), G/A (13.99%) and T/C (10.65%) as the main types of SNP mutations. Transversions (Tv) was larger than transition (Ti), accounting for 50.43% of SNP, and the ratio of Ti to Tv is 0.98.

Genetic Diversity and Genetic Differentiation Analysis
The genetic diversity of 40 H. leucisculus populations was analyzed. The results showed that the population genetic diversity level of H. leucisculus was low. The nucleotide diversity (Pi) of the populations ranged from 0.000964 to 0.001144, which was highest in TTMH. The observed heterozygosity (Ho) across all populations ranged from 0.0593 to 0.0774, with an average of 0.0672. Ho was lowest in YBZ (0.0593) and highest in TTMH (0.0774). Expected heterozygosity (He) ranged from 0.1925 to 0.2055. He was lowest in YBZ

Genetic Diversity and Genetic Differentiation Analysis
The genetic diversity of 40 H. leucisculus populations was analyzed. The results showed that the population genetic diversity level of H. leucisculus was low. The nucleotide diversity (Pi) of the populations ranged from 0.000964 to 0.001144, which was highest in TTMH. The observed heterozygosity (Ho) across all populations ranged from 0.0593 to 0.0774, with an average of 0.0672. Ho was lowest in YBZ (0.0593) and highest in TTMH (0.0774). Expected heterozygosity (He) ranged from 0.1925 to 0.2055. H e was lowest in YBZ (0.1925) and highest in TTMH (0.2055). The individual inbreeding coefficients (F IS ) of the populations ranged from 0.402 to 0.7016 ( Table 2). The Ho of the four populations was lower than the He, and F IS was greater than 0, indicating that heterozygote deletion existed in H. leucisculus population. Pi was lower than 0.005, which showed that the genetic diversity of H. leucisculus population was relatively weak. TTMH had the highest Ho, He and Pi, indicating that the population had the highest genetic diversity. The population pairwise F ST values varied from 0.231 to 0.258 (Table 3). Significant genetic differentiation was observed in all population pairs. The lowest F ST values were between QL and TTMH and highest F ST values were between THY and YBZ. In addition, to detect genetic variance components within and among populations, AMOVA was conducted (Table 4). AMOVA showed that the genetic variation within populations accounted for 92.31% of total variation, and 7.69% of variation was distributed among populations.

Population Structure Analysis
According to the SNP information generated form RAD-seq, the NJ tree was performed in the 40 individuals ( Figure 3A). The NJ tree was supported by bootstrapping (>70%; Figure 3A). The results showed that the forty individuals could be grouped into two clusters (upper-middle and downstream populations). TTMH formed a cluster (cluster 1), and the other populations formed a large group (cluster 2). However, to some extent some individuals from TTMH can be clustered into the large cluster 2 (QL, THY, YBZ) together, which indicated that TTMH did not split as a whole and the four populations had a contact. The results of the PCA scatter plot, of which principal factor 1 explained 3.55% and principal factor 2 explained 3.16% of the overall variance, were consistent with NJ tree ( Figure 3B). The PCA indicated that QL, THY and YBZ were clustered closely and showed a highly mixed state. In addition, the individuals from TTMH clustered together with other populations, but they had a bigger geographic distance than the other populations. The results of PCA were consistent with the NJ tree.  Figure 3A.
To further reveal the genetic structure of 40 individuals, the cluster analysis was conducted (K = 1~9). The CV error showed the lowest value with K = 9 and indicated that the optimal cluster number was nine genetic clusters ( Figure 4B). When the K value was 3, the component of ancestry of most individuals from TTMH and YBZ was similar compared with QL and THY. When the K value was 6, the component of ancestry information of QL, THY and TTMH was similar except YBZ. Notably, TTMH formed a cluster compared with the other locations, which corresponded with the results of NJ tree and PCA plotting. When the K value was 9, several individuals had the admixture ancestry To further reveal the genetic structure of 40 individuals, the cluster analysis was conducted (K = 1~9). The CV error showed the lowest value with K = 9 and indicated that the optimal cluster number was nine genetic clusters ( Figure 4B). When the K value was 3, the component of ancestry of most individuals from TTMH and YBZ was similar compared with QL and THY. When the K value was 6, the component of ancestry information of QL, THY and TTMH was similar except YBZ. Notably, TTMH formed a cluster compared with the other locations, which corresponded with the results of NJ tree and PCA plotting. When the K value was 9, several individuals had the admixture ancestry composition, such as THY-02, TTMH-08 and TTMH-10, which indicated that there could have been four different ancestries ( Figure 4A).

Frequency of SSRs in H. leucisculus
The study detected a total of 147,705 potential simple sequence repeats in 1,533,492 fragments ( Table 5). The unit size of SSRs varied from 1 to 6, including 25,576 (17.32%)

SSR marker development and characterization
The Primer3 V2.3.6 was used to develop SSR primer pairs using 147,705 sequences containing SSRs (Table 5). Finally, 22,651 pairs of primer pairs were successfully designed, and 17,211 locus-specific primer sets were retained after strict filtering.
In order to verify the value of these primers via PCR, 10 primers pairs were randomly selected and synthesized to substantiate in forty H. leucisculus individuals. Among the 10 primers, 6 pairs of H. leucisculus genomic DNA were successfully amplified by PCR, while the other 4 pairs of primers could not obtain PCR products (Table 6). Most successful primer pairs amplified expected size bands.
This proved that the ability of high-throughput sequencing identifies genes in nonmodel organisms [13]. These results will provide valuable tools for future genetic studies at H. leucisculus. Table 6. The messages of 6 working SSR primer pairs from 10 synthesized primers.

SSR Marker Development and Characterization
The Primer3 V2.3.6 was used to develop SSR primer pairs using 147,705 sequences containing SSRs (Table 5). Finally, 22,651 pairs of primer pairs were successfully designed, and 17,211 locus-specific primer sets were retained after strict filtering.
In order to verify the value of these primers via PCR, 10 primers pairs were randomly selected and synthesized to substantiate in forty H. leucisculus individuals. Among the 10 primers, 6 pairs of H. leucisculus genomic DNA were successfully amplified by PCR, while the other 4 pairs of primers could not obtain PCR products (Table 6). Most successful primer pairs amplified expected size bands. This proved that the ability of high-throughput sequencing identifies genes in nonmodel organisms [13]. These results will provide valuable tools for future genetic studies at H. leucisculus.

Discussion
RAD-seq technology includes the lower cost acquisition of DNA sequence data from multiple sites in the genome, simpler library preparation processes, the simplified management of genomic data and improved pipelines for data processing and analysis [14]. It can be used to identify and genotype large numbers of SNPs [15]. SNP refers to the DNA sequence polymorphism formed due to nucleotide variation. SNP has the characteristics of a large number, wide distribution and high diversity in biological individuals and plays an important role in the fields of life science and molecular breeding [16]. Though RAD-seq requires considerable sequencing effort per individual compared with ddRAD, the values of commonly used summary statistic, Pi, Tajima's D and F ST of RAD-seq has lower deviation than ddRAD [17,18]. In this study, forty H. leucisculus individuals were sequenced using RAD-seq and a total of 7,291,260 high-quality SNPs were detected, which provided the basis for molecular marker development, population differentiation and the invasion analysis of the Tarim River H. leucisculus in Xinjiang.
Genetic diversity research can reveal the status quo of species germplasm resources, genetic diversity and phylogenetic evolution of species [19]. Due to the lack of the genomewide sequence of H. leucisculus, the RAD-seq method was chosen to analyze the population genetic diversity of H. leucisculus in this study. According to Ichikawa et al. the genetic diversity of H. leucisculus populations in Tarim River was low (Pi < 0.005) [20]. Genetic diversity is one of the most important factors affecting the success of invasive species [21]. It is traditionally believed that rich genetic diversity will help alien species adapt to new habitats and establish and maintain new population sizes [22]. We examined the relevant information that the Pi of H. leucisculus in the Huaihe River basin was low, while that in the Yangtze River was high (Pi = 0.03289) [23,24]. The reason could be that compared with the Yangtze River basin, the unstable water environment in the Tarim and Huaihe River basins makes fish resources vulnerable to environmental changes, leading to a decrease in genetic diversity [21]. In addition, H. leucisculus may have been affected by the founder effect during the invasion of the Tarim River, resulting in lower genetic diversity than in the Yangtze River basin [25,26]. The F IS in populations can directly indicate the loss or surplus of heterozygotes. The F IS values for the four populations were greater than 0, indicating that the H. leucisculus population had different degrees of heterozygote loss. Combined with H o < H e , there was a certain degree of inbreeding. The F ST value has three levels of different genetic differentiation: moderate differentiation was 0.05 to 0.15, larger differentiation was 0.15 to 0.25 and high differentiation was greater than 0.25 [27]. In fact, RAD-seq has the drawbacks that missing data tends to inflate estimates of F ST [17]. Therefore, the genetic differentiation of H. leucisculus may be at a moderate level. Combined with low genetic diversity, the high F ST values (F ST = 0.243) among these populations may be caused by habitat fragmentation in the Tarim River basin [28][29][30]. In addition, we noticed that the genetic differentiation of H. leucisculus was lower in the lower reaches than in other regions. This could be that the upstream terrain is more rugged than the downstream. Rugged reaches promote genetic differentiation, and gentle reaches are more conducive to gene flow [31].
Combined with the results of NJ tree and PCA, the individuals from TTMH showed differences and had a bigger geographic distance than the others. The genetic differentiation pattern of fish is usually consistent with the water system pattern of distribution [32,33]. It could be explained by the reason that due to the large-scale diversion and interception of water in the middle and upper reaches of the Tarim River, the lower reaches of Tarim River have been cut off [28,34,35]. It was not until 2000~2006 that the Lower Tarim River Emergency Water Transfer transferred water from Bosten Lake to Taitema Lake that this phenomenon was alleviated [36,37]. Therefore, geographic isolation led to a clear differentiation of TTMH populations. In addition, the result of population structure analysis showed that the degree of mixed blood among individuals was high, indicating frequent gene exchange. This could be because Taitema Lake is located at the end of the Tarim River, and the alluvial plain formed by three water systems of Tarim River, Qarqan River and Altun Rivers, with frequent water exchange [35]. There was increased gene flow between populations during floods [38][39][40].
H. leucisculus is an invasive species in the Tarim River system of Xinjiang. According to the population structure analysis, it was inferred that the origin of the four sites was the same species. Compared with the others, TTMH had some genetic differences. Dong et al. showed that human impact index (HII) was the most important factor affecting H. leucisculus invasion among the determinants of H. leucisculus invasion [41]. The water diversion project transferred water from Bosten Lake to Taitema Lake. In 1958, Bosten Lake introduced Cyprinus carpio, Carassius auratus, Hypophthalmichthys molitrix and Hypophthalmichthys nobilis directly or indirectly from the Yangtze River basin for economic benefits and H. leucisculus and Bosten male fish in the pond dominated [42,43]. High natural diffusion and human translocation are conducive to gene flow and high genetic diversity among river basins, which might help retain the invasive potential [38]. Therefore, it is speculated that H. leucisculus from Bosten Lake entered Taitema Lake to breed and then expanded into the Tarim River.
SSR markers are one of the most widely used molecular markers and play an important role in genetic research. Traditional SSR development methods often needed to construct the genome enrichment library, and the hybridization screening and sequencing processes were time-consuming and laborious. Today, the development of high-throughput sequencing technology and the sharp reduction of cost create better conditions for the development of SSR markers, which have the advantages of short development cycle, high yield and excellent flexibility [44]. In recent years, the technology of developing SSR markers using RAD to simplify genome sequencing has become increasingly mature, and there are more and more reports concerning it. Qu et al. used RAD-Seq to perform simplified genome sequencing on Indonesian tigerfish (Datnioides Pulcher), and a total of 26,359 SSR loci were detected through sequence analysis [45]. Wang et al. identified a total of 466 983 SSR loci by analyzing the RAD simplified genome sequencing data of Pelteobagrus Wallis [46]. At present, there are few reports about the development of H. leucisculus SSR markers. In this study, 22,651 SSR primers were designed to meet the primer development conditions by analyzing the simplified genome sequence of H. leucisculus. These results provide a basis for the development of H. leucisculus new primers.

Conclusions
In the study, low levels of genetic diversity were observed among four populations in the Tarim River. Major genetic variations originated from within populations variation and moderate genetic differentiation among these populations was observed. The four populations were able to be separated into two clusters (upper-middle and downstream populations), and the individuals from TTMH clustered and had a bigger geographic distance than the others. There is the probability that the H. leucisculus from Bosten Lake entered Taitema Lake to breed and then expanded into the Tarim River due to the water diversion projects in location. In addition, 147,705 SSRs loci were detected and 22,651 SSR primer pairs were developed. This study provides valuable genetic information for the management of wild populations, marker-assisted selection and resource exploitation of H. leucisculus.
Author Contributions: S.S. and Z.H. conceived the study. Z.L., L.L. and X.L. collected the samples. S.S. analyzed the data and wrote the manuscript. Q.Z., B.H., D.L. and R.T. contributed greatly to modifying the manuscript. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The raw genome sequencing datasets generated during the current study have been submitted to the NCBI Sequence Read Archive (SRA) and are available online at http://www.ncbi.nlm.nih.gov/bioproject/848648 (accessed on 30 June 2022).

Conflicts of Interest:
The authors declare no conflict of interest.