Insights from Initial Variant Detection by Sequencing Single Sperm in Cattle

: Meiotic de novo mutation (DNM) is one of the important phenomena contributing to gamete genome diversity. However, except for humans and a few model organisms, they are not well studied in livestock, including cattle. Moreover, bulk sperm samples have been routinely utilized in experiments, which include millions of single sperm cells and only report high-frequency variants. In this study, we isolated and sequenced 143 single sperms from two Holstein bulls and identiﬁed hundreds of candidate DNM events in ten sperms with deep sequencing coverage. We estimated DNM rates ranging from 1.08 × 10 − 8 to 3.78 × 10 − 8 per nucleotide per generation. We further validated 12 out of 14 selected DNM events using Sanger sequencing. To our knowledge, this is the ﬁrst single sperm whole-genome sequencing effort in livestock, which provided useful information for future studies of point mutations and male fertility. Our preliminary results pointed out future research directions and highlighted the importance of uniform whole genome ampliﬁcation, deep sequence coverage, and dedicated software pipelines for genetic variant detection using single-cell sequencing data.


Background
Recent breakthroughs in the development and application of single-cell sequencing technologies provide an avenue for dissecting population lineages and heterogeneity and understanding cell identity, differentiation, and function [1][2][3][4][5]. Single-cell DNA-seq (scDNA-seq) technologies produce data, which allow the detection of single nucleotide variants (SNVs) and short insertion and deletion (INDEL) variants, together with structural variations or abnormal chromosome numbers (aneuploidy) on the single-cell level [6][7][8].
Although small, sperm is one of the most important cells because it delivers the entire paternal genetic materials to the offspring. As novel mutations can occur during gametogenesis and postzygotically, studying mutations in sperm is particularly important for male fertility. Traditionally, bulk sperm sequencing can only detect high-frequency variants in the presence of millions of individual sperm cells. Recent genome and exome sequencing studies of parent-offspring trios have provided the first insights into the number and distribution of the de novo mutations (DNMs) [9]. In humans, DNMs have been shown to be a major cause of severe early-onset genetic disorders such as intellectual disability and autism spectrum disorder [10]. Recent studies have also shown that DNMs are predominantly of paternal origin and that their number increases with advanced paternal age [11]. By setting up a single sperm sequencing approach, Wang et al. reported 25 to 36 DNMs per sperm in humans [1].
As in other mammals, reproductive performance in cattle is also affected by paternal fertility. Although the fertility of artificial insemination (AI) bulls is monitored routinely by major organizations using microscopic examination of sperm count, motility, abnormality, and other laboratory tests, an understanding of the mutation mechanisms involved in sperm production and their impacts on male fertility is equally important and essential for the dairy industry and other livestock industries. In fact, the occurrence of novel mutations in each generation explains why these reproductively lethal disorders continue to occur in mammalian species [9].
In this study, we manually isolated, amplified, sequenced, and analyzed 143 single sperm genomes from two Holstein bulls. We identified hundreds of candidate DNM events in ten sperms with deep sequencing coverage by comparing them to the somatic genome. After validating selected events using Sanger sequencing, we estimated rates for DNMs. To the best of our knowledge, this is the first large-scale single sperm whole-genome sequencing report in livestock, which could facilitate future studies of point mutations and male fertility.

Sample Collection and Whole Genome Amplification and Sequencing
We randomly chose two Holstein bulls: Sample1 has a DPR (daughter pregnancy rate) PTA value of 0.0, reliability of 0.99, estimated from 6528 daughters. In contrast, Sample2 has a DPR PTA value of −3.2, reliability of 0.99, estimated from 15,314 daughters. Somatic tissue (ear punch) samples of Holstein Sample1, together with its parent somatic tissues, were donated by Select Sires, Inc. (Plain City, OH, USA). Semen samples were freshly collected by Select Sires, Inc. in its routine artificial insemination semen straw production. After receiving them under liquid nitrogen in USDA-ARS Animal Genomics and Improvement Laboratory (AGIL), we manually isolated a total of 156 sperm cells from two Holstein bulls (Sample1 with 73 sperm cells and Sample2 with 83 sperm cells). Briefly, isolated sperms were thawed in 37 • C water for 30-45 s and treated with 0.25% Trypsin-EDTA, followed by dilution with PBS + 1% BSA and washing twice. The sperms were further diluted to a proper resolution using PBS + 1% BSA on a Petri dish, and active single sperms were picked up manually by pipetting into a reaction tube under a micromanipulator as described previously [12]. Whole-genome amplification was performed on single cells according to the manufacturer's protocol, using the Single-Cell Whole-Genome Amplification Kit (Yikon Genomics, Shanghai, China) developed from the Multiple Annealing and Looping-Based Amplification Cycles (MALBAC) method [6]. In brief, a single sperm was initially analyzed and pre-amplified by primers supplied in the kit with 8 cycles with multiple annealing steps. PCR generated fragments with variable lengths at random starting positions for next-generation sequencing. We also sequenced the somatic diploid genomes of the trio, including Sample1 (Sample1-diploid) and its parents (Sample1-sire and Sample1-dam). Using their somatic ear punch tissues, we isolated their diploid genomes using a QIAGEN DNA extraction kit. DNA samples extracted from the donor and his parents' ear skin samples were then used to prepare sequencing libraries using standard Illumina protocol and sequenced on Illumina HiSeq 2000/NextSeq 500 sequencing platforms.

Filtration of SNPs, INDELs, and Samples
To improve the genotyping accuracy for single sperms, we applied a stringent cut-off on the raw genotyping quality score to call genotypes. We removed low-quality variants with quality by depth (QD) < 2, Fisher strand (FS) > 30, strand odds ratio (SQR) > 3, root mean square of the mapping quality (MQ) < 40, and quality score (QUAL) < 40. Using the VariantFiltration function in GATK, we defined the window size as 35 bp to evaluate clustered SNPs and allowed three SNPs to make up a cluster. For sperm data, we kept variants with at least two allele support reads and removed heterozygous (0/1) SNPs or INDELs because it was potentially caused by sequencing errors or sperm chromosomescale genomic anomalies [8]. As a result, 12 sperm samples were removed as their read depth was lower than 0.5 × (10 sperms) or genome coverage rate was lower than 10% (2 sperms). In addition, for diploid data, we filtered those variants with allele support reads less than 1/2 genome-wide depth.

De Novo Mutation Detection
The genotypes called by GATK from ten sperms of Sample1 were used to identify DNMs. To minimize the artifacts or sequencing errors, we required a genotyping quality score (QUAL) ≥ 100 and the number of supporting reads to be more than 3/4 genome coverage of each sample. We defined a candidate DNM in sperm when a distinct sperm genotype existed from the somatic homozygous genotype. We excluded signals from repetitive regions or with low alignment confidence. The DNM rate was calculated as the alternate allele calling frequency in genome sequence length.

Amplification and Sequencing of Cattle Mutations
We designed the PCR and sequencing primers using Primer-BLAST [17] based on the bovine ARS-UCD1.2 genome. All the primers used in the present study are listed in Table S8. The PCR amplification was performed with 25 µL reaction volume according to Taq DNA polymerase manufacturer's protocol (Taq PCR Master Mix Kit, Qiagen, Hilden, Germany), and the genomic DNA was amplified on a BioRad MyIQ thermocycler. The PCR cycle was as follows: initial denaturation at 95 • C for 5 min; followed by 25 cycles of 94 • C for 60 s, annealing at 56 • C for 60 s; primer extension at 72 • C for 60 s; and final extension at 72 • C for 10 min. All the amplified products were run in 1.5% agarose gel. After purification, DNA was sequenced using PCR primers at GENEWIZ (South Plainfield, NJ, USA).

Sequencing and Genotyping of Haploid Sperms
Sequencing of sperms: We amplified and sequenced a total of 156 single sperm cells manually picked from two Holstein bulls' semen using the MALBAC method [12]. After quality control filtering and mapping with BWA, 143 sperm data (71 for Sample1 and 72 for Sample2) were kept for downstream analyses. The sequenced sperms had an average of 1.79 × genome coverage, and 16 of them were at~4 × genome coverage, achieving an overall coverage of~11.40% to~41.35% of the genome, respectively (Table S1). On average, 98.18% of sequencing reads from single sperms were mapped on the bovine ARS-UCD1.2 genome.
Genotyping of sperms: We used GATK to call the raw genotypes for single nucleotide polymorphisms (SNPs) and INDELs. Each sperm yielded raw calls for 15.5-43.0 million SNPs and 2.4-7.2 million INDELs (Table S2). As sperms are haploid cells, extensive heterozygous genotype calls are considered anomalies and thus removed. We only detected a small fraction of heterozygous raw calls with an average frequency of 2.46% (ranging from 1.03% to 7.39%) for SNPs and 2.97% (1.03% to 9.16%) for INDELs, respectively. These indicated that most of the sperms were isolated successfully with low contamination before sequencing. After QC, 0.42 to 2.68 million for SNPs and 0.23 to 1.04 million for INDELs were kept.

Sequencing and Genotyping of Sample1 Diploid Somatic Genome
For Sample1 somatic diploid genomes, we sequenced bulk DNA samples extracted from ear punches of Sample1 to approximately 40 × genome coverage, with over 99% genome mapping rate and covering 96% genome sequence (Table S3) (Table S4).

De Novo Mutations Detected in Single Sperms
We estimated the DNM rates from bulls to sperms based on genotyped SNPs. From ten Sample1 sperms with deep sequencing coverage, we detected 955 candidate DNM sites by comparing them to their somatic genome. After removing 501 sites in repeat/segmental duplication regions and 74 sites in 100 bp clusters (39 overlapped with repeat/segmental duplication region), we found 419 DNM sites. Their statistics and potential effects are summarized in Table 1. On average, each chromosome contained 1.54 (0.5 to 2.5) mutations ( Figure 1). The most frequent mutations were A-G, C-T, G-A, and T-C with the count of 7.5, 7, 6.5, and 5.5 on the median, respectively (Figure 2A). Within a chromosome, DNM distribution was generally even ( Figure 2B). These 419 DNM sites led to an average mutation rate of 1.68 × 10 −8 per nucleotide per generation in each sperm, with a range of 1.08 to 3.78 × 10 −8 per nucleotide per generation ( Figure 3A, Table 1, Tables S5 and S6). Mutations were also enriched in the AT cluster region ( Figure 3B). Interestingly, we detected 15 DNMs in more than one sperm. Three sites at chr7:66596324, chr18:12015775, and chr29:439948 occurred in 7, 5, and 4 sperms. Of those 419 high-confidence DNMs, 190 sites overlapped 188 genes and 432 transcripts, with 15 sites mapped to 15 exons, six sites mapped to ten 3 -UTRs for different transcripts, and eight sites mapped to 16 coding sequence (CDS) regions for different proteins (Table S7).

PCR Sanger Sequencing Validation of De Novo Mutations
To confirm the candidate DNMs predicted from high-throughput sequencing, we performed PCR-Sanger sequencing. Using 20 primer pairs, we were able to obtain reliable results from 14 regions using PCR-Sanger sequencing (Table S8), along with the six positive control results derived from the sequenced cow (Dt, i.e., Hereford cow named "L1 Dominette 01449") bulk DNA. The Sanger sequencing results from 12 regions confirmed our recurrent DNM calls (12/14 = 85.71%), therefore largely excluding the possibility of errors during PCR amplification, high-throughput sequencing and read mapping. Because these loci are inconsistent with the Mendelian inheritance rules, we successfully validated that they were DNM by using independent PCR Sanger sequencing.

Discussion
Ideally, scDNA-seq could provide information about all variants that occurred in a single cell, such as SNV and INDEL variants, together with structural variations. However, all existing whole-genome amplification (WGA) methods introduce errors and amplification biases and even complete dropouts of variant alleles [2,6,18]. Current single cell-specific SNV callers, like Monovar, SCcaller, and SCAN-SNV, were developed to deal with these errors and missing data [19][20][21]. They often incorporate amplification error rates and allele dropout in their models, take advantage of enhanced signals from multiple single cells and imputation of missing alleles, and integrate with different data types like bulk sequencing or RNA sequencing. However, a systematic comparison of these tools is still lacking, and most of them were developed for processing scDNA-seq generated from somatic cells in cancer research. We are not aware of any turnkey scDNA-seq variant calling pipeline available for DNM discovery in sperm. Another disadvantage of studying gamete cells like sperms compared to somatic cells is that recombination occurs during meiosis, introducing a major obstacle.
In their pioneer study of single human sperm, Wang et al. plotted the allele discordance ratio of sperm MDA products against the somatic genome. They found that a peak at 100% discordance illustrated a distinct group of loci violating the amplification errors model [1]. After excluding signals from repetitive regions or with low alignment confidence, they obtained 25-36 candidate DNMs in each sperm cell. They further validated 16 out of 18 selected DNMs using independent PCR-Sanger sequencing.
Using a similar approach, we generated 143 single sperm sequencing data and tentatively determined DNM rates ranging from 1.08 to 3.78 × 10 −8 per nucleotide per generation in ten cattle sperm genomes with deep coverage, corresponding to 27-94 candidate point mutations per sperm. We also successfully validated 12 out of 14 selected recurrent DNMs using independent PCR-Sanger sequencing. Although there is a possibility that false-positive DNMs could be created during the sperm DNA amplification's first cycle, it is rare to detect the same mutation to occur at the same position for different sperms. Therefore, our high confirmation rate (85.71%) of the recurrent DNMs argues against this possibility. For example, a C to A mutation at chr7:66596324 was validated in 5 independent sperms except in WGA33 (Table S7). Agreeing with the abovementioned human result [1], Sample1 s mutation rate (1.08 to 3.78 × 10 −8 per nucleotide per generation) is higher than that derived from pedigree sequencing data (~1 × 10 −8 per nucleotide per generation) [22]. However, it is in line with evolutionary studies, suggesting more mutations in males than in females, possibly due to the larger number of germline cell divisions in males [23]. In our ten individual sperm cells, their respective mutation levels were broadly consistent ( Figure 3A). Within each cell, most mutations reside in intergenic or intronic regions (Table 1). However, we did find one to two mutations affecting the coding sequence. The transition-to-transversion ratios of Sample 1 mutations varied from 1.08 to 3.27, compared to a population average of 2.1. The main reason for more transition than transversion is generally thought to be deamination of methylated cytosine, primarily at CpG and potentially in other sequence contexts.
When a DNM arises during the terminal differentiation of sperm, it will rarely be detected to be recurrent. If it arises in proliferating spermatogonial stem cells, it can be detected in multiple sperms [24]. A DNM can also arise early during paternal embryonic development, before primordial germ cell (PGC) specification, or late within the PGC population, causing mosaicism in sperms, and all of them be detected as recurrent events [25]. We detected 15 recurrent mutations in more than one sperm. Those recurrent DNMs might occur during paternal embryonic development, before PGC specification, or within the PGC population, as well as later within spermatogonial stem cells [25].
Limitations and future directions: As discussed above, much progress is needed before turnkey software pipelines can routinely make reliable calls for DNM from single sperm sequencing. As the sequencing coverage was critically low, DNMs reported here are probably less reliable. Furthermore, our sperm number was not large enough to draw general conclusions. In summary, we tried to generate single sperm whole-genome sequencing data and detected occurrences of DNM in cattle. In the meantime, our results also highlighted the importance of uniform whole genome amplification, deep sequence coverage, and dedicated software pipelines for variant detection. To our knowledge, this is the first single sperm sequencing attempt in livestock, which could open the door for studying point mutations and male fertility.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/dairy2040050/s1, Table S1: Statistics of sequencing data for each sperm. Table S2: Statistics of genotyping data for each sperm. Table S3: Statistics of sequencing data for Sample1's somatic genome. Table S4: Statistics of genotyping data for Sample1's somatic genome. Table S5: De novo mutation sites from Sample1 to sperms. Table S6: De novo mutation rate of ten sperms. Table S7: Gene feature annotation of de novo mutation sites from Sample1 to sperms.

Data Availability Statement:
The data that support the results of this research are available within the article and its Supplementary Information Files. All other sequence data can be tracked in Supplemental Files. The single sperm sequencing data were submitted to GEO under the accession number PRJNA691741.

Acknowledgments:
We thank Reuben Anderson, Ki-Eun Park, Adam Oswalt, Bhanu P. Telugu, and Charles G. Sattler for their technical assistances and sample donations. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the US Department of Agriculture. The USDA is an equal opportunity provider and employer.

Conflicts of Interest:
All other authors declare that they have no competing interests.
Disclaimer: Ethics approval and consent to participate. The need for ethics approval was waived as the current study did not involve whole animals.