De novo QTL-seq Identifies Loci Linked to Blanchability in Peanut (Arachis hypogaea) and Refines Previously Identified QTL with Low Coverage Sequence

Blanchability is an often overlooked, but important trait for peanut breeding. The process of blanching, removing the skin, is an important step in the processing of raw nuts for manufacturing. Under strong genetic control and requiring considerable effort to phenotype, blanchability is conducive for marker-assisted selection. We used QTL sequencing (QTL-seq) to identify two QTLs related to blanchability using previously phenotyped breeding populations. To validate the QTLs, we show that two markers can select for significantly increased blanchability in an independent recombinant inbred line (RIL) population. Two wild introgressions from Arachis cardenasii conferring strong disease resistance were segregated in the population and were found to negatively impact blanchability. Finally, we show that by utilizing highly accurate sequence analysis pipelines, low coverage sequencing can be used to genotype whole populations with increased power and precision. This study highlights the potential to mine breeding data to identify and develop useful markers for genetic improvement programs, and provide powerful tools for breeding for processing and quality traits.


Introduction
Quantitative trait locus sequencing (QTL-seq) is a genetic analysis that takes advantage of the ever-decreasing cost in whole genome sequencing (WGS). The two seminal papers describing the simple approach of bulk segregant analysis were first published in 1991 [1,2]. The markers being used were restriction fragment length polymorphism (RFLP) or random amplified polymorphic DNA (RAPD), and in the absence of genome sequences, identifying new markers was difficult within mapping populations. After an interval was initially mapped to control a trait of interest, it was not possible to fine map the interval further in the absence of polymorphic markers. The method of bulk segregant analysis (BSA) rested on the hypothesis that markers linked to the trait of interest will segregate with the phenotype in a segregating population. The individuals exhibiting the extreme tails of the phenotypic distribution will be differentiated by only those markers that were physically within the genomic region of interest. In the case of the original methods, random oligos could be screened in the bulked DNA until polymorphic markers were found. This was how new markers could be identified and placed on the genetic map to fine map the trait to a higher resolution. precludes selection at early generations. Current uniform trials in the United States do not test for blanching percentage (https://www.ars.usda.gov/southeast-area/dawson-ga/ national-peanut-research-laboratory/docs/uniform-peanut-performance-tests-uppt/, accessed on 28 October 2021). Despite not having information on the blanching percentage of different cultivars, the cost savings for manufacturers would be substantial. After blanching, any unblanched kernels are discarded as waste, sold as less valuable products, or crushed for oil. The combination of strong genetic control, small G × E effect, and laborious measurement makes blanching an excellent target for marker assisted selection (MAS) in peanut [28,29].
Wild species represent a critical source for alleles that confer strong resistance to disease and tolerance to stress, and peanut is no exception [30][31][32]. Three introgressions from Arachis cardenasii have recently been discovered to contribute strong resistance to important fungal diseases [17,33,34]. These introgressions have been so successful that they were shared all over the world by breeders as "silent" sources of strong resistance and were just recently discovered [35]. The introgressions, two on chromosome A02 and one on chromosome A03, are physically not large, but contain many genes. There is potential linkage drag associated with them that manifests in unexpected ways.
In this study, we use a novel pipeline, Khufu, to de novo analyze a new dataset focused on blanching in peanut. We identify two QTL controlling blanching percentage in peanut. We validate those QTL in an independent recombinant inbred line (RIL) population. We also investigate the effect of Arachis cardenasii introgressions on blanching and these QTL. Finally, we benchmark Khufu on previously published datasets in peanut and discuss the potential for genotyping applications in crops with complex genomes.

Blanching QTL-seq and Validation
Good and poor blanching lines were selected from data generated in Wright et al., 2018. Each line was extracted for DNA individually and quantified using a Qubit fluorometer. Equal concentrations of each were pooled into 'good blanching' and 'poor blanching' DNA pools. The pools were sequenced by Novogene using Illumina HiSeq 2 × 150 bp sequencing. After filtering, a total of 193,712,311 reads from the good blanching pool were mapped (100%; 94.9% properly paired) and 248,926,026 reads from the poor blanching pool (100%; 94.9% properly paired). Analysis of sequenced bulks was carried out using the Khufu pipeline (https://www.hudsonalpha.org/khufudata/, accessed on 28 October 2021).

Development of RIL Population and Marker Validation
A recombinant inbred line (RIL) population was developed by crossing Middleton x Sutherland cultivars, comprising of 406 individual lines (RILs) at the Peanut Company of Australia (PCA), Kingaroy, Queensland, Australia. Middleton is a reliable high yielding commercial cultivar and has been used extensively in the Australian Peanut Breeding Program (APBP). The pedigree of the resistant parent, Sutherland, is (B123 (F1) × CS22) × D45-p37-102, and is tolerant to multiple foliar diseases; late leaf spot (Phaeoisariopsis personata), peanut rust (Puccinia arachidis) and web blotch [36]. Hybridization between parental lines was performed in 2014 and the single seed descent (SSD) method was employed under field conditions from the F2 to the F6 generation. Extracted DNAs from 406 RIL lines were used for marker validation. Blanching percentage was performed as described in Wright et al., 2018 [28]. Kompetitive allele specific PCR (KASP) markers were designed from the most significant SNPs from the QTLs on B11 and A06 (Table S3). Two methods were used to assess the significance of each identified locus to blanching percentage. The blanching data were transformed into ranks and ANOVA's were run using each marker alone and the combination of both markers in R. LSMeans for each marker and combinations of them were calculated using the lsmeans package in R. An additional test using a Kruskal-Wallis non parametric test was performed using Kruskal.test in R.

Analysis of Published Datasets
Sequences from published datasets were downloaded from public databases (NCBI SRA and DDBJ Bioproject). Raw fastq files were not processed further and were subjected to the Khufu pipeline as downloaded.

Post-Khufu Filtering
We implemented a shiny app to facilitate final polishing post-Khufu processing. It can be freely accessed here: https://w-korani.shinyapps.io/khufu_var2/ (accessed on 28 October 2021). This allows filtering in real time so the user can identify trends within potential noisy groups of SNPs. There are minimum and maximum depth filters. Low depth loci estimate allele frequencies poorly and loci with depths more than twice the estimated read coverage can signal regions of paralogous read mapping. The minor allele frequency filter was first suggested by Takagi et al., (2013) [10] as a mechanism to filter erroneous base calls if a particular alternative allele was in low frequency in both bulks. We implemented an "interval" filter which filters out potential SNPs clustered closely together (<100 bp). This indicates repetitive content and reads from many loci overlapping one locus.
The shiny app takes as input R objects that are created as output from the Khufu analysis pipeline. In the supplement we have included the R objects for the full Blanching data set, and the down sampled 5×, 10×, and 20× sets.

Blanchability QTL-seq
To identify QTLs associated with blanchability in peanut, we sequenced two bulks of breeding lines from three populations with related parents ( Table 1). The parents with high blanching percentage were Walter and Redvale and the low blanching parents were Sutherland and a sister line of Sutherland [28]. Redvale is a selection from a cross with Walter as a parent. We first analyzed the data by calling parent-specific SNPs by re-sequencing Walter, Redvale, and Sutherland and identifying SNPs where Walter and Redvale shared an allele and Sutherland had an alternate allele. We used these SNPs to analyze the QTL-seq dataset and identified two QTLs on chromosomes A06 and B11 ( Figure 1A). Markers were developed from the most significant SNPs within the identified peaks on A06 and B11. We next used a recombinant inbred line (RIL) population to validate the putative QTL. The parents of the population are Sutherland (poor blancher; 74.5%) and Middleton (good blancher; 89%). Middleton is not related to Walter or Redvale and so represents a good source to test the markers in unrelated populations where blanching is segregating. Both markers could select for increased blanching percentage in an unrelated RIL population, with A06 having a stronger effect (F = 6.73; df = 176; RANK ANOVA p = 0.0002; Kruskal p = 0.0003; Effect 9.8%; Median = 73.4%; R2 = 10.3%) than B11 (F = 3.5; df = 176; RANK ANOVA p = 0.01; Kruskal p = 0.005; Effect 8.03%; Median = 74%; R2 = 5.6%). The two markers combined were additive (F = 5; df = 176; RANK ANOVA p < 0.0001; Effect 19.7%; Median = 76.9%; R2 = 14.8%). The lsmean estimate when both alleles are beneficial is 79.3%, and when both alleles are non-beneficial is 59.6%. Selecting with the two markers shifts the median blanching percentage from 62.8% to 83.9% (21% increase).
We used Khufu (https://www.hudsonalpha.org/khufudata/, accessed on 28 Ocotber 2021) to re-analyze the blanching QTL-seq dataset using de novo discovered SNPs from the bulk data ( Figure 1B). The QTL on A06 has a stronger signal overall with the smoothed average above 0.6 with Khufu compared to 0.5 using parental sequence. The QTL on B11 is much stronger with the peak smoothed average reaching 0.85 compared to 0.65 using parental data. The results support that QTL-seq can be analyzed straight from bulk sequence even in crops with complex genomes, without the need to generate parental sequence when suitable informatic pipelines are utilized.

Linkage Drag from Wild Introgressions
Sutherland is known to have introgressions on chromosomes A02 and A03 derived from Arachis cardenasii [35]. The lines that were chosen from QTL-seq should have inherited these introgressions independent of their blanching percentage. However, there were two peaks on chromosome A02 that corresponded to the introgressed regions ( Figure 2A). The RIL population used for validation was also genotyped using a 10 SNP chip that assayed the introgressions. Initially, we used only those lines without introgressions in the RIL population for validation (Figure 1; 180 lines). On their own, the introgressions on the top of A02 and the bottom of A03 significantly affect blanching percentage ( Figure 2B and Table 2). Selecting against the introgressions raises the mean and median blanching percentage by 10.7% and 11.6%, respectively (Table 2 and Figure 2C). If the introgressions are not taken into account and selection in the population is just with the A. hypogaea QTL identified in this study, the effect is lessened. Finally, the QTL on B01 and A06 only have a small effect if selecting in a population that has both introgressions fixed (Table 2 and Figure 2D).

Improving QTL-seq Analysis in Polyploid Crops
There are two current methods that are employed to analyze QTL-seq data in polyploid crops, with a special focus on peanut. One is to use a specialized pipeline to identify high quality SNPs from the parental sequence, as in Cui et al., (2020) [26] and Clevenger et al., (2018) [18]. The other is to use parental sequence to generate population-specific

Improving QTL-seq Analysis in Polyploid Crops
There are two current methods that are employed to analyze QTL-seq data in polyploid crops, with a special focus on peanut. One is to use a specialized pipeline to identify high quality SNPs from the parental sequence, as in Cui et al., (2020) [26] and Clevenger et al., (2018) [18]. The other is to use parental sequence to generate populationspecific reference genomes and then analyze the QTL-seq data using those modified genomes while filtering for high quality, uniquely mapping reads, and depth. Both methods were successful in identifying significant QTL, although the noise in the data obscured the genetic structure of the loci. Further, sequencing parents to high coverage depth incurs additional substantial cost and requires the use of biparental populations where the parents are known and available.
Cui et al., (2020) [26] used QTL-seq to identify QTLs contributing to stem rot resistance in the field in peanut. From stem rot bulks sequenced to high depth (>40 × genome coverage), there were two QTLs identified on chromosomes A05 and A01 that could explain about 21% of the variance observed. The strategy used in Cui et al., (2020) [26] was to identify putative parental SNPs from the parental sequence (using the polyploidspecific pipeline SWEEP [18]) that were then used in the bulks to calculate allele frequency. Using Khufu, we analyzed the same bulk sequence de novo and identified the peaks on chromosomes A01 and A05 ( Figure S1). Khufu identified two additional peaks on chromosomes A04 and A07 that were not identified previously ( Figure S2). We developed kompetitive allele specific PCR (KASP) markers to assay the new QTL on A04 in the validation populations used in Cui et al., (2020) [26]. These markers were able to increase the ability to select for stem rot resistance ( Figure S2). The QTL on A07 was also identified in Luo et al., (2020) [37].
To test the feasibility of using lower coverage sequence data, we re-analyzed a QTL-seq dataset focused on late leaf spot resistance in peanut where the bulks were sequenced to approximately 10× genome coverage [18]. Khufu successfully identified the three QTL de novo from bulk sequence that were validated in Clevenger et al., (2018) [18] and Chu et al., (2019) [38] ( Figure S3). The QTL on B03 (on Arachis hypogaea known as B13) has a clear peak in the Khufu analysis indicating potential to develop more tightly linked markers to resistance.
Pandey et al., (2020) [39] used QTL-seq to identify QTLs associated with fresh seed dormancy in peanut. Using Khufu, we identified those same QTLs de novo from the bulked sequence ( Figure S4). As was the case with the stem rot dataset, the Khufu peaks had a higher smoothed average on chromosome B05, reaching 0.75 in the region from which Pandey and colleagues (2020) [39] developed a marker to select for the trait. Zhao and colleagues (2020) [20] used QTL-seq to identify a candidate gene for purple testa color in peanut on chromosome A10 using the parental sequence to analyze the bulk sequence data. Without the parental sequence, Khufu identified the same QTL and the predicted peak region (107,280,934 to 110,087,506) contained the most likely candidate gene which was functionally shown to control anthocyanin accumulation in tobacco [20] (Figure S5). Bertioli et al., (2020) [40] estimated that the rate of polymorphism between two peanut accessions is, on average, 1 SNP per 10,000 bp by comparing two reference-quality genome sequences. One advantageous aspect of QTL-seq is the theoretical ability to have access to all SNP polymorphisms segregating in the population. The result is that the SNP or group of SNPs that are most closely linked to the functional variation (or are the functional variation) are available. In contrast, the probability that a SNP array with fixed markers includes the most closely linked SNPs is low and decreases with low throughput marker types. Khufu identified 244,329 SNPs in the dormancy dataset (1 SNP per 10,395 bp), 372,611 SNPs in the stem rot dataset (1 SNP per 6816 bp), and 225,029 SNPs in the purple testa dataset (1 SNP per 11,287 bp). The number of SNPs Khufu filtering identifies de novo from the bulk sequences is within the estimate of polymorphic SNPs between any two A. hypogaea accessions. The filtering procedure of Khufu should be applicable for other genotyping applications using short reads in crops with complex genomes.

Potential for Genotyping Using Skim Sequencing
For crops with large, complex genomes, genotyping large populations still relies heavily on fixed SNP arrays. These arrays are incredibly useful, yet suffer from ascertainment bias. Reduced representation methods allow de novo SNP discovery yet only access a small portion of the genome. The accuracy of profiling SNPs within bulked samples should be useful for genotyping applications using sequencing. As a reference set of SNPs, we identified variants by comparing two published peanut genomes, Tifrunner [41], and Shitouqi [42]. Using Illumina sequencing of the two genotypes that were sequenced, we called SNPs using Khufu at different depths of coverage (Table S1). Across all levels of depth, Khufu calls alleles with more than 99% accuracy (according to the allele in the reference genome sequence). Above 10× coverage, Khufu identifies more than 95% of the SNPs. At lower coverage, Khufu identifies more than 71% at 5× coverage and 46% at 3×. We tested if it were possible to identify de novo SNPs in the parents of a population sequenced at 10× coverage and then call alleles in the progeny sequenced at 1× or even 0.5× coverage (Table S2). At 1× coverage, 27% of the possible SNPs could be called at 100% allele accuracy. At 0.5× coverage, 14% of possible SNPs could be called at 100% allele accuracy. With an estimate of 1 SNP per 10 kilobases polymorphic between any two peanut accessions, we can expect to call an average 67,000 SNPs with 1× coverage and 37,500 with 0.5× coverage. This represents genome-wide distribution of markers.
Finally, we analyzed a RIL population that was genotyped using WGS re-sequencing [43]. Using 5× coverage WGS of the progeny, Agarwal and colleagues mapped 11,106 SNPs using SWEEP_GOLD (https://github.com/w-korani/SWEEP_GOLD, accessed on 28 Ocotber 2021). Using Khufu, we identified and called 86,987 SNPs in the progeny after down sampling to <3× coverage. Analysis of physical maps color-coded for parental allele illustrates the accuracy of Khufu at low coverages ( Figure S6).

Future Considerations for QTL-seq Using Low Coverage Sequence Data
Analysis of sequence data in tetraploid peanut has been difficult and relied on specialized pipelines [25,44]. Accurate analysis of low coverage sequencing data for a polyploid crop such as peanut has not been viable until now. We were intrigued by the possibility of thinking about QTL-seq in a different way. The process of sequencing a bulk of 20 individuals to 20× genome coverage is akin to computationally bulking 20 fastq files that were sequenced to 1× genome coverage. If then, in a population of 300 individuals that was segregating strongly for multiple traits, each individual was sequenced to 1× coverage, QTL-seq could be carried out for all of the traits at once. We tested this concept using the RIL population from Agarwal et al., (2019) [43]. We used bulks representing 10, 20, 30, and 40% of the population (Figure 3). We successfully identified the TSWV (tomato spotted wilt virus) QTL on chromosome A01 with high resolution. The signal was most strong with the smaller bulks, but the variance and noise were very low when bulking 40%.
Agronomy 2021, 11, x FOR PEER REVIEW 10 of 13 Figure 3. in silico bulking and QTL-seq analysis of TSWV resistance. A major QTL is identified on chromosome A01 (top). We tested bulking 10%, 20%, 30%, and 40% of the population (left to right). The bottom shows the same bulk percentages for a chromosome with no signal (A02). Notice the noise decreases significantly as bulk% increases.

Conclusions
Blanchability is an important processing trait for peanut. There is a considerable energy requirement and significant additional costs to remove seed coats from poor blanching varieties as a result of lost kernels during processing and sorting. In contrast, if the processing application requires skin, it would be beneficial to utilize varieties that are resistant to blanching. As a breeding target, blanchability presents a challenge in that it is labor intensive and requires a large seed input which precludes testing at early generations. These considerations make blanchability an excellent target for marker assisted selection. In this study, we identified two strong QTL for blanchability and validated them in an independent population. Markers selecting for these QTL are available for breeding programs. We discovered linkage drag associated with two prominent A. cardenasii introgressions that confer disease resistance and also increase blanching resistance. Although for many applications this would be deleterious, it presents a unique opportunity to develop disease resistant and blanching resistant cultivars that are uniquely suitable for confectionery. Finally, we highlight the power of low coverage sequencing of whole populations to carry out QTL-seq experiments instead of using sequenced bulks.  We tested bulking 10%, 20%, 30%, and 40% of the population (left to right). The bottom shows the same bulk percentages for a chromosome with no signal (A02). Notice the noise decreases significantly as bulk% increases.

Conclusions
Blanchability is an important processing trait for peanut. There is a considerable energy requirement and significant additional costs to remove seed coats from poor blanching varieties as a result of lost kernels during processing and sorting. In contrast, if the processing application requires skin, it would be beneficial to utilize varieties that are resistant to blanching. As a breeding target, blanchability presents a challenge in that it is labor intensive and requires a large seed input which precludes testing at early generations. These considerations make blanchability an excellent target for marker assisted selection. In this study, we identified two strong QTL for blanchability and validated them in an independent population. Markers selecting for these QTL are available for breeding programs. We discovered linkage drag associated with two prominent A. cardenasii introgressions that confer disease resistance and also increase blanching resistance. Although for many applications this would be deleterious, it presents a unique opportunity to develop disease resistant and blanching resistant cultivars that are uniquely suitable for confectionery. Finally, we highlight the power of low coverage sequencing of whole populations to carry out QTL-seq experiments instead of using sequenced bulks. The same data analyzed de novo with Khufu (right); Figure S6: Representative physical map of 133 RIL individuals. Shown is chromosome A01. Red represents parent 1 allele, yellow represents heterozygous, and blue represents parent 2 allele. White is missing data. Individuals are clustered by similarity for visualization. Alleles are unfiltered out of the Khufu pipeline other than missing data < 25%. Markers in order of physical position are rows and individuals are columns; Table S1: Accuracy of allele calls using Khufu at different depths. Using SNPs identified from comparing reference genome sequences as a validation set, Khufu was used to call alleles using Illumina data at different coverages; Table S2: Polymorphic SNPs were called between Tfrunner and Shitouqi using 10× coverage Illumina sequence. To simulate "progeny", we downsampled the sequence to 0.5× and 1× coverage and called alleles based on the "parental" SNP sites; Table S3: KASPar marker sequence for Blanching percentage.
Author Contributions: W.K. and J.C. analyzed published datasets, developed Khufu, and drafted the manuscript. W.K. designed and wrote the code for the shiny application and for all modules of Khufu. Y.C. and P.O.-A. validated QTL, and revised the manuscript. C.C. and C.B. provided technical lab support. D.O. conceived and carried out the blanching experiments. B.G. provided some resources for validation. J.C. and G.W. conceived and supervised the project, secured funding, and revised the manuscript. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: R object files of blanching data of the full set, and 20X, 10X, and 5X subsets are included in the supplemental materials. Raw data fastq sequences of high and low bulks of blanching data are deposited at the NCBI (http://www.ncbi.nlm.nih.gov/, accessed on 28 Ocotber 2021) under BioProject PRJNA714121.