Mapping-by-Sequencing Reveals Genomic Regions Associated with Seed Quality Parameters in Brassica napus

Rapeseed (Brassica napus L.) is an important oil crop and has the potential to serve as a highly productive source of protein. This protein exhibits an excellent amino acid composition and has high nutritional value for humans. Seed protein content (SPC) and seed oil content (SOC) are two complex quantitative and polygenic traits which are negatively correlated and assumed to be controlled by additive and epistatic effects. A reduction in seed glucosinolate (GSL) content is desired as GSLs cause a stringent and bitter taste. The goal here was the identification of genomic intervals relevant for seed GSL content and SPC/SOC. Mapping by sequencing (MBS) revealed 30 and 15 new and known genomic intervals associated with seed GSL content and SPC/SOC, respectively. Within these intervals, we identified known but also so far unknown putatively causal genes and sequence variants. A 4 bp insertion in the MYB28 homolog on C09 shows a significant association with a reduction in seed GSL content. This study provides insights into the genetic architecture and potential mechanisms underlying seed quality traits, which will enhance future breeding approaches in B. napus.


Introduction
Rapeseed (Brassica napus, AACC, 2n = 38) is the second-most important oil crop after soybean [1]. B. napus is a recent allopolyploid species formed by hybridization between Brassica oleracea and Brassica rapa followed by chromosome doubling approximately 7500 years ago [2]. In addition to its nutritionally beneficial seed oil composition, B. napus has the potential to serve as a high-quality protein source. Protein-rich meals are a valuable "by-product" of oil extraction and can be used as a viable source for plant protein due to their high quality and production volume [3]. The increasing demand for vegetable protein and oil requires breeding efforts in order to enhance the yield of protein and oil in B. napus. Dry mature seeds are composed of oil (45-50% w/w) and protein (20-25% w/w) [4]. Lipids are stored in the form of triacylglycerols (TAGs) in oil bodies, while seed storage proteins In this study, genomic intervals associated with SPC, SOC, and GSL content in B. napus were analyzed via MBS using a large segregating F2 population. This population was derived from a cross of the B. napus winter-type cultivars Lorenz and Janetzkis Schlesischer. Furthermore, candidate genes were identified by incorporating transcriptomic data sets. Association and gene expression studies indicated that a 4 bp insertion located in a MYB28 homolog on chromosome C09 is a major factor controlling seed GSL content. Sequence variants identified in here will facilitate the development of genetic markers for breeding programs in B. napus.

Plant Material and Trait Measurement
The phenotypically segregating F2 population, designated L-x-JS, consists of 2323 individuals and was derived from a cross between the parental lines Lorenz (P1) and Janetzkis Schlesischer (P2), both are B. napus winter-type rape varieties. Janetzkis Schlesischer (DOI: 10.25642/IPK/GBIS/288477) has a high seed GSL content of~90 µmol/g FW and contains erucic acid. Lorenz is listed as a variety for diversity with the accession number RAW 2152 (https://pgrdeu-preview.ble.de/tsorten/steckbrief/id/551533 (accessed on 11 December 2020) and displays 00-quality. It has medium-high grain yields, high oil content and low GSL content (maintaining institute: Norddeutsche Pflanzenzucht Hans-Georg Lembke KG, DE005). A total of 1373 F2 individuals were planted in Granskevitz alias growing area 2 (SPC_A2) (GE, GPS: 54.526908 • , 13.21998 • ) and 948 in Asendorf alias growing area 1 (SPC_A1) (GE, GPS: 52.7724145 • , 9.0044643 • ). The plants were grown in accordance with German legislation. In total, 1951 F2 individuals of the L-x-JS F2 population were used for the genotype and phenotype analyses for seed GSL content, while 2315 individuals were used for SPC due to the higher variance of SPC.
Seeds were collected for the GSL, SPC and SOC measurements via near-infrared reflectance spectroscopy (NIRS) and analyzed in triplicates. The NIRS measurement was carried out with intact-seed samples. The measured trays were designed for high-throughput measurement of oil seed rape. Each tray requires a seed volume of about 2 cm 3 . The samples were scanned by a Polytec PSS-2121 diode array spectrometer (Polytec GmbH, Waldenbronn, Germany) with 256 pixels. Reflectance was measured in the range from 1100 to 2100 nm with a step size of 2 nm recorded with the software PSSHOP (Polytec) using DSV internal calibration. Calibration and validation procedures were carried out with several Software packages (SensoLogic GmbH, Norderstedt, Germany). Calibration performance was verified periodical with independent validation sets.
Individuals for sequencing were selected based on phenotypic data, DNA quality, and cultivation location. For the GSL pools, individuals grown in SPC_A1 and SPC_A2 were used to build the high-and the low-GSL pools. For the SPC pools, individuals grown in SPC_A2 were used to build the high and the low pool.

DNA Extraction and Pooling
Genomic DNA was extracted from leaf disks using the CTAB method [50]. The low-GSL pool consisted of 38 genotypes (GSL_L, <30.83 µmol/g dry weight), while the high-GSL pool contained 52 genotypes (GSL_H, >70 µmol/g dry weight).For growing area 2 (SPC_A2), 22 genotypes were used for the low protein pool (SPC_L_A2 low pool, <16.0% total dry mass) and 19 genotypes for the high protein pool (SPC_H_A2 high pool, >23.1% total dry mass). Library preparation and pooling strategy was performed as described before [41]. The GSL pools were sequenced on a HiSeq1500 in high-output mode using four lanes and the 2 × 100 PE scheme, while the SPC pools were sequenced on a HiSeq1500 in rapid mode using two lanes and the 2 × 150 PE scheme. Lorenz and Janetzkis Schlesischer were sequenced on a HiSeq1500 using the 2 × 150 PE scheme.

Mapping and Variant Calling
Read quality was checked with FastQC [51]. Reads were mapped via BWA-MEM v0.7 [52] to the B. napus Darmor-bzh v4.1 genome sequence [2] and the Zheyou7 assembly [47] (File S1). Default parameters were applied and the -M flag was set to avoid spurious mappings. Mapping statistics were calculated via the flagstat function of samtools [53] prior and past the following filtering step. Mappings were cleaned with samtools view -q 30 -b -F 0 × 900 -f 0 × 2 to remove low-quality alignments and reads without a properly mapped mate. The filtered BAM files were passed to GATK v3.8 [54][55][56] for the identification of a variant set based on hard filtering. BWA-MEM and GATK were chosen due to excellent performances in previous studies [41,57].

Generation of the "Gold Standard"for SNV Filtering
The workflow starts with the generation of a gold standard for SNV filtering, which contains SNVs which are homozygous in the parental genotypes and heterozygous in the reconstituted F1 (Files S2 and S3). First the reads of the parents were mapped to the B. napus Darmor-bzh genome sequence v4.1 and variants were called as described above. Next, coverage filters based on the BAM and VCF files were applied. BAM-derived coverage files were constructed as described in Pucker et al., 2018 [58]. A minimum coverage of 10 and a maximum coverage of 60 were determined to yield high-quality SNVs which are likely not affected by copy number variations (GitHub filter_parent_variants.py). The upper limit was chosen, because it represents twice the modal value of each file. Triallelic variants and variants present in both parents were excluded from further analyses as these are not contrasting between the pools (File S2) (GitHub combine_homo_VCFs_vs_Bn41.py).
The resulting set of homozygous SNVs of the parental genotypes was then screened for heterozygosity in a reconstituted F1 population (File S3) (GitHub filter_vcf_F1.py) to generate the final gold standard which contained 903,253 SNVs (Files S2 and S4) (GitHub merge_vcfs.py). The reconstituted F1 variant set comprises variants derived from all analyzed genomic sequencing data of our study. Only "heterozygous" variants, which showed an allele frequency between 0.2-0.8 against the B. napus Darmor-bzh genome sequence were used for the down-stream filtering (File S2).

Filter Raw Variants per Pool for Delta Allele Frequency (dAF) Calculation
A sophisticated filtering approach was applied to select only highly reliable SNVs for the downstream analyses (File S2). High-quality SNVs were extracted from the raw variants of each pool (GSL_L, GSL_H, SPC_H, SPC_L) by considering only SNV positions present in the gold standard (File S2) (GitHub filter_pools_vcfs_for_gold_standard.py). Only variants with a minimum 0.75-fold the average median coverage and a maximum coverage of 1.5-fold the average median coverage per pool were kept. This final set of variants of the high-and low-GSL and SPC_A2 pools, respectively, were used to calculate delta allele frequencies (dAFs) (Files S2, S5 and S6) (GitHub combine_single_VCFs_version3.py). The dAF is defined as the absolute difference between the allele frequency (AF) values from the two pools for a given variant position. Only biallelic variants are included for the calculation of dAFs to facilitate a reliable dAF estimation.

Interval Detection
For interval detection, Fisher's exact test was applied on the raw SNVs of the pools to yield variants with a significant dAF ( Figure 1). A p-value cut-off of 0.05 was applied after correction for multiple testing (GitHub fisher_exact_test_corrects_for_multiples_testing.py). The passing SNVs are called "statistically meaningful differential Allele-specific Read Counts" (dARCs) (Files S7 and S8).
For interval detection, Fisher's exact test was applied on the raw SNVs of the pools to yield variants with a significant dAF (Figure 1). A p-value cut-off of 0.05 was applied after correction for multiple testing (GitHub fisher_exact_test_corrects_for_multiples_testing.py). The passing SNVs are called "statistically meaningful differential Allele-specific Read Counts" (dARCs) (File S7, File S8). Figure 1. Schematic illustration of interval detection and extraction of genes carrying high-impact variants. First both raw variants of each pool are combined into a merged VCF file. Interval detection was performed based on the density of dARCs (left). A set of clean variants were extracted from the merged VCF file and used for the detection of high-impact SNVs (right). Finally, the results of both Figure 1. Schematic illustration of interval detection and extraction of genes carrying high-impact variants. First both raw variants of each pool are combined into a merged VCF file. Interval detection was performed based on the density of dARCs (left). A set of clean variants were extracted from the merged VCF file and used for the detection of high-impact SNVs (right). Finally, the results of both approaches are integrated (bottom). Raw variants (grey circles); dARCs (yellow points); ZCR (purple); genomic intervals (dark blue rectangles); high-impact variants (red points); genes carrying high-impact variants (red rectangles). These dARCs were used to identify genomic intervals associated with the analyzed traits ( Figure 1). The following criteria were applied: (I) the minimum amount of dARCs in an interval was set to 4 (-min_nr_dARCs_in_reg), (II) the distance between at least 3 dARCs of one interval needs to be greater than 1 kbp (-dis_in_reg), and (III) the distance between any two adjacent dARCs must be less than 50 kbp (-dis_out_reg) (GitHub get_intervals_based_on_dARCs_Bn41_v4.py). While a certain number of dARCs is required to seed an interval, it is also important that these are equally distributed. Numerous variants originating from the same sequenced DNA fragment could be due to an artifact and are excluded by requiring a minimal distance of the seed dARCs. To avoid extremely large intervals with low dARC frequencies between dARC-rich intervals, the 50 kbp cut-off for the dARC distance is intended to split intervals without a constantly high dARC density.
Zero coverage regions (ZCRs) were identified by using the coverage information of both pools and applying a genome-wide screening with a window size of 200 bp per chromosome (GitHub PAV_finder.py). ZCRs are considered during the interval detection, as they are often responsible for splitting genomic intervals into parts ( Figure 1) (Files S9 and S10). The localization of ZCRs at the same genomic position in both pools prevents the detection of variants and hence no dARCs can be detected.
Finally, detected genomic intervals were ranked according to their amount of dARCs. Initial candidate genomic intervals were manually inspected to find a suitable cut-off. For seed GSL content and SPC genomic intervals containing at least 100 dARCs and 65 dARCs were used for downstream analysis, respectively.

Generation of dAF Plots
Noise in the genome-wide dAF plots (Files S11 and S12) was reduced through the combination of adjacent dAFs (calculated as described in 2.6, GitHub sophisticated_cov_plot.py). Variants within a sliding window of 100 variants were represented by the median dAF of all variants in the window. Each step was a 5 variant shift of the window. The genomewide distribution of "statistically meaningful differential Allele-specific Read Counts" (dARCs) (compare 2.7 for details) was visualized by the normalized density of dARCs. The normalized density of dARCs was calculated by combining the amount of dARCs in sliding windows of size 100 kbp with steps of size 30 kbp divided by the total amount of SNVs within this window. In addition, the mean mapping coverage of the pools using the same sliding window parameters were calculated and normalized to the maximum mean coverage per chromosome for visualization (Files S11 and S12).

Presence-Absence Variations (PAVs)
PAVs were identified based on the BAM-derived coverage files by comparison of coverage information of both pools in a genome-wide window approach which considers annotated genes (GitHub PAV_finder.py). Genomic regions or genes with no or at least extremely low coverage in one pool, but substantially higher coverage in the other pool were considered as PAVs. The coverage was normalized to the overall coverage of the pool. Genes located on the genetically non-anchored random scaffolds were excluded from this analysis. For the identification of PAVs based on gene regions the following parameters were used: PAV_finder.py was used in gene mode and -mincov was set to 10 (Files S13 and S14).

Functional Annotation and Candidate Genes
Genes located within or spanning over the borders of the identified genomic intervals were extracted (Figure 1). Genes were functionally annotated by transferring the Araport11 functional annotation to the v5 gene models [2]. OrthoFinder v2.3.7 [59] was applied using default parameters to identify orthogroups between representative peptides of A. thaliana Araport11 as previously defined [60], and the B. napus representative peptide sequences derived from the B. rapa, B. oleracea, B. napus Express 617, Darmor-bzh, Lorenz, and Janetzkis Schlesischer (Files S1, S15-S19). Remaining unannotated genes were functionally annotated by using reciprocal best blast hits (RBHs) and best blast hits (BBHs) as described previously [61] (Figure 1). We refer to the Bna genes that were annotated as homologs of the respective A. thaliana genes.

Variant Impact Prediction via SnpEff
Variants predicted to have an impact on the genes located within the genomic interval were extracted. First, SnpEff v4.1f [62] was applied on the merged variants of each pool (GitHub combine_single_VCFs_for_SnpEff.py), which passed GATK's quality filters ('PASS') ( Figure 1). The resulting VCF was subjected to SnpEff with default parameters using a custom database constructed from the B. napus Darmor-bzh v4.1 genome sequence and the v5 annotation, which were corrected for the used frame (File S20). SnpEff results were filtered for "high impact" variants as previously defined [61], which included predictions of loss or gain of a stop codon mutations, frameshift mutations, and splice site variants (GitHub get_intervals_based_on_sig_snps_Bn41_v4.py) ( Figure 1). Finally, genes located within +/− 5 kbp of the borders of a genomic interval were analyzed for predicted high-impact variants ( Figure 1) (Files S21 and S22).

Generation and Analysis of RNA-Seq Data
Seeds and leaves (28 days after flowering (DAF)) RNA-Seq samples of Janetzkis Schlesischer (P2) and seeds and leaves (23 and 35 DAF) RNA-Seq samples of B. napus SGDH14 (medium-high seed GSL content) [63] were prepared and sequenced according to Schilbert et al., 2021 [64]. Additionally, RNA-Seq reads derived from seeds and leaves of B. napus Express 617 [64] and public RNA-Seq data sets (File S23) were mapped to the B. napus Darmor-bzh v4.1 and Zheyou7 assemblies using STAR v.2.7.1a [65]. STAR was run in basic mode allowing maximal 5% mismatches and requiring an alignment spanning at least 90% of the read length. Mapping statistics were calculated based on STAR.log files via a customized python script (GitHub parse_STAR_log_file_create_mapping_statistic.py) (Files S24). We used featureCounts v1.5.0-p3 [66] for the generation of count tables. The mean fragments per kilobase exon per million reads (FPKM) or mean counts per million (CPM) expression values per organ were used for downstream analysis (GitHub gener-ate_figures_only_mean_expression_calc.py and map_mean_exp_to_cand_genes_in_reg.py). For example, mean CPM expression values of Janetzkis Schlesischer, as well as average coverage information per pool were assigned to the genes to infer PAVs between pools (File S25 and S26) (GitHub fetch_gene_IDs_from_gff3_file.py, map_mean_exp_to_cand_ genes_in_reg.py, map_PAVs_to_genes_in_regs.py).

Identification of MYB Homologs
MYB homologs were identified with KIPEs as described previously [67]. KIPEs was run with a minimal BLAST hit similarity of 40% to reduce the number of fragmented peptides derived from possible mis-annotations. As bait peptide sequences, all A. thaliana MYBs were used [13]. As subject species, the proteomes of several Brassica species were used (Files S1, S17, S19, S27 and S28). The alignment was constructed with MAFFT v.7 [68] and trimmed to minimal alignment column occupancy of 10%. Next, a phylogenetic tree was build (https://GitHub.com/bpucker/script_collection/tree.py with FastTree v2.1.10 [69] using 10,000 rounds of bootstrapping, including the identified MYB homologs from several Brassica species and well described MYB sequences from literature (Files S29-S33). The phylogenetic trees were visualized with FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/ figtree/ (Files S34 and S35). Additional previously identified MYB sequences derived from Darmor-bzh [70] were added manually to the phylogenetic tree to ensure completeness of MYB homologs.
By analyzing mappings of genomic sequence reads from the parental genotypes against the Darmor-bzh and Zheyou7 assembly, the copy numbers of BnaMYB genes/alleles involved in GSL biosynthesis were manually inspected via IGV [71]. Generally, the numbering of GSL MYBs is based on Seo et al., 2017 [70] with small modifications. Finally, BnaMYB28_2 alleles from the B. napus cultivars of a subset of 100 lines from the BnASSYST diversity panel [72] were validated via PCR and Sanger sequencing (File S36).

Phenotyping of the Segregating F2 Population
A large F2 population segregating for seed quality traits and consisting of over 2000 individuals derived from a cross between the B. napus cultivars Lorenz (P1) and Janetzkis Schlesischer (P2) was used for MBS ( Figure 2). The traits studied were seed GSL content and SPC/SOC. The seed GSL content ranged from 11.8 to 88.1 µmol/g, while the SPC and SOC ranged between 9.7-28.0% and 24.3-56.3%, respectively ( Figure 2). The tails relevant for building the contrasting GSL pools account for 5.28% (<30.83 µmol/g) for the GSL_L pool and 2.76% (>70 µmol/g) for the GSL_H pool of the whole F2 population. The tails from the SPC distribution used to build the pools comprise 3.93% (<16.0%) for SPC_L_A2 and 2.33% (>23.1%) for SPC_H_A2 of the whole F2 population.

MBS Predicted Candidate Genomic Intervals Controlling SPC, SOC, and Seed GSL Content-Mapping and Variant Calling
To map candidate genes, pools from the F2 population were subjected to MBS analysis. Both parental genotypes P1 and P2 and the two pools representing individuals with extreme phenotypes were sequenced, both for seed GSL content and SPC/SOC. After read mapping to the B. napus Darmor-bzh genome sequence, low-quality alignments and reads without a properly mapped mate were removed. Among these data sets at least 52% plants were harvested and phenotyped via near-infrared resonance spectroscopy (NIRS). The sample size and mean of the distribution are given by N and x, respectively. The rectangles in (A,B) mark the tails of the distributions used for pool building, e.g., the low-GSL (GSL_L), high-GSL (GSL_H), low-SPC (SPC_L_A2) and high-SPC (SPC_H_A2) pools. As SPC and SOC are negatively correlated the SPC and SOC pools are largely congruent and thus only the SPC pools were marked. The tails relevant for building the contrasting GSL pools account for 5.28% (<30.83 µmol/g) for the GSL_L pool and 2.76% (>70 µmol/g) for the GSL_H pool of the whole F2 population. The tails from the SPC distribution used to build the pools comprise 3.93% (<16.0%) for SPC_L_A2 and 2.33% (>23.1%) for SPC_H_A2 of the whole F2 population.

MBS Predicted Candidate Genomic Intervals Controlling SPC, SOC, and Seed GSL Content-Mapping and Variant Calling
To map candidate genes, pools from the F2 population were subjected to MBS analysis. Both parental genotypes P1 and P2 and the two pools representing individuals with extreme phenotypes were sequenced, both for seed GSL content and SPC/SOC. After read mapping to the B. napus Darmor-bzh genome sequence, low-quality alignments and reads without a properly mapped mate were removed. Among these data sets at least 52% to 62% of the reads per data set were confidently mapped (File S37). See Supplementary File S37 for mapped read depth values (File S37). Variant calling revealed between 3,580,759 to 5,215,492 high-quality variants (InDels and single-nucleotide variants (SNVs)) for the respective samples (Table 1). Of these, 2,632,505 (73.5%) to 3,987,788 (76.5%) variants were distributed on the 19 pseudochromosomes. The remaining variants were distributed on the genetically non-anchored random scaffolds and were excluded from further analysis. The raw variants of each pool were filtered for the gold-standard SNVs, resulting in high-quality SNVs sets of 889,280 SNVs for the SPC pools and 880,842 for the GSL pools (~1036-1053 SNVs per Mbp) ( Table 1). These SNVs were used to generate delta allele frequency (dAF) plots of the high and the corresponding low pool (Files S11 and S12). We noticed that the usage of statistically meaningful differential Allele specific Read Counts (dARCs; see M&M for details) dARCs resulted in less noisy interval detection when compared against dAF approaches (Files S11 and S12).

Genomic Intervals and Candidate Genes Associated with Seed Glucosinolate Content
Evaluation of the dARC distribution among the pseudochromosomes allowed identification of 30 genomic intervals associated with seed GSL content (see M&M). These intervals were detected on six chromosomes, namely A02, A06, A09, C02, C07, and C09. Their sizes range from 73 kbp to 1.32 Mbp (Figure 3, Table 2). Out of the 30 intervals, 18 intervals are located on A09, 5 on C09, 3 on C02, 2 on C07, and 1 on A02 and A06 (Table 2). Several intervals in close proximity on one chromosome emerged due to the lack of dARCs located between these intervals. This can be caused by, e.g., (I) regions with low numbers of SNVs, (II) low-quality variants that do not qualify as dARCs, and (III) a combination of the two mentioned causes.
In total, 1807 genes were found within the genomic intervals associated with seed GSL content (File S25). Some of these genes have a well-known function in GSL biosynthesis or breakdown. BnaC02g41790D, a homolog of A. thaliana METHYLTHIOALKYLMALATE SYN-THASE 1 (AthMAM1), is involved in GSL biosynthesis and is part of the genomic interval C02_GSL_2. Moreover, BnaA09g08410D and BnaA09g01260D, whose A. thaliana homologs are APS KINASE (AthAPK) and N-(METHYLSULFINYL)ALKYL-GLUCOSINOLATE HY-DROXYLASE (AthAOP3), involved in GSL biosynthesis were identified in the genomic in-tervals A09_GSL_1 and A09_GSL_7, respectively. BnaA09g08470D, a homolog of THIOGLU-COSIDE GLUCOHYDROLASE (AthTGG) involved in GSL breakdown, was found in the genomic interval A09_GSL_7. Moreover, key transcription factors involved in the regulation of GSL content were located within the genomic intervals. Two close homologs of AthMYB28 (AthHAG1), BnaC09g05300D and BnaC09g05290D, were located within the genomic interval C09_GSL_2 (File S25). In addition, homologs of AthMYB34 (AthATR1), BnaC09g05060D and BnaC02g41860D, were identified in the genomic intervals C09_GSL_2 and C02_GSL_2, respectively (File S25).     To determine which parental genotype brings in which of the key transcription factor genes, we set out to identify all BnaMYB genes/alleles involved in GSL biosynthesis (MYB28, MYB29, MYB34, MYB51, and MYB122; see introduction; collectively referred to here as 'B. napus GSL MYBs') in P1 and P2. Since it turned out that not all BnaMYB homologs are resolved in the B. napus Darmor-bzh genome sequence, we used in addition the long-read assembly of the B. napus cultivar Zheyou7 which covers more BnaMYB homologs. BnaMYB sequences of various genotypes including both parental genotypes were subjected to a phylogenetic analysis (Files S29 and S35). By analyzing mappings of genomic sequence reads from the parental genotypes against both assemblies, the copy numbers of B. napus GSL MYBs were identified (Table 3, Files S15 and S29). A tandem gene duplication event of BnaMYB122_2 in P1 resulted in a higher number of BnaMYB122 genes compared to P2 (File S35). The additional BnaMYB28 copy in P2, BnaMYB28_5, is likely to be derived from the loss of this copy in P1 as indicated by the fractionated and extremely low coverage of the C02 BnaMYB28_5 by genomic read mappings of P1 reads to the Zheyou7 assembly (Files S29 and S15, Tables 3 and 4). This was also supported by the analysis of GSL pools, where BnaMYB28_5 revealed a~3-fold higher genomic coverage in the high-GSL pool compared to the low-GSL pool, indicating that this locus is only inherited by the high-GSL parent P2. Large deletions were detected on A09 in both, P1 and P2. Both deletions affect the presence or absence of B. napus GSL MYBs. The~920 kbp deletion of P2 ranges from~4.06 to 4.98 Mbp, while the~50 kbp deletion of P1 ranges from 4.46 to 4.51 Mbp (pseudochromosome positions taken from the Zheyou7 assembly). The P2 deletion A09_P2_920 harbors 163 genes, while only 1 gene (BnaA09g05680D) is located inside the shared deletion of P1 (File S38). Fractionated and extremely low coverage of the A09 BnaMYB28_4 homolog was observed in genomic read mappings of P1 and P2 reads to the Zheyou7 assembly, indicating its deletion in both parental genotypes (Table 4, File S15). In addition to BnaMYB28_4, additional genes associated with GSL biosynthesis were identified in the A09_P2_920 deletion (File S38) which overlap with high ranked genes affected by PAVs (File S13). The BnaMYB34_7 homolog is located within the A09_P2_920 deletion, but outside of the one of P1 (Tables 3 and 4, File S15). This was also supported by the analysis of GSL pools, where BnaMYB34_7 revealed a~3-fold higher genomic coverage in the low-GSL pool compared to the high-GSL pool, indicating that this locus is inherited by the low-GSL parent P1. The A09_P2_920 deletion overlaps with the genomic intervals A09_GSL_4 and A09_GSL_5 and might also be the reason for additional genomic intervals detected in its proximity.
In order to identify B. napus GSL MYB genes expressed in seeds which could influence seed GSL content and thus explain the phenotypic variation in the high-and low-GSL pools, we analyzed their expression in seeds and leaves of P2. Most BnaMYB28, BnaMYB29, BnaMYB34, BnaMYB51, and BnaMYB122 homologs were not or very low expressed in leaves and seeds ( Figure 4). Only five homologs are expressed in seeds: BnaMYB28_2, BnaMYB28_5, BnaMYB34_1, BnaMYB51_2, and BnaMYB51_6. As BnaMYB28_5 is absent in the low-GSL parent P1 (Table 4) but expressed in P2, this homolog might explain the genomic intervals identified in the south of chromosome C02 ( Figure 3, Table 2). Supporting these genomic intervals, BnaMYB34_1 is also located in the south of C02 (File S29). BnaMYB51_2 and BnaMYB51_6 are located on C08 and A08, respectively, and have homologs in both parental genotypes which showed no genomic coverage differences between the high-and low-GSL pools (File S15). However, BnaMYB28_2 on C09 exceeds the expression of all B. napus GSL MYB homologs by a factor of at least 2-3 fold in leaves and seeds (Figure 4). The analysis of over 650 public available B. napus RNA-Seq data sets (File S23) supports the high expression of BnaMYB28_2 compared to all other GSL MYBs across various tissues and environmental conditions (File S39).

Variation Effects in Genes Involved in Seed Glucosinolate Biosynthesis
As sequence variants can influence the function of gene products, the impact of sequence variants on genes located within or near the genomic intervals (+/−5 kbp) was predicted. Interestingly, the highly expressed BnaMYB28_2 located on C09 is affected by a 4 bp insertion (GCTA) near the end of the annotated third exon ( Figure 5A, Files S15 and S40). The phylogeny of MYB28 homologs across several Brassicaceae species revealed that the ancestral allele did not contain this 4 bp insertion ( Figure 5A,B, File S35), i.e., the ancestral allele encodes a functional MYB transcription factor. The Darmor-bzh genome sequence contains the insertion. The 4 bp insertion results in a premature stop codon of the MYB28 homolog BnaC09g05300D, leading to a truncated protein ( Figure 5B). The second fragment of this locus is annotated as BnaC09g05290D which encodes only a MYB28 C-terminal fragment ( Figure 5A).

Variation Effects in Genes Involved in Seed Glucosinolate Biosynthesis
As sequence variants can influence the function of gene products, the impact of sequence variants on genes located within or near the genomic intervals (+/−5 kbp) was predicted. Interestingly, the highly expressed BnaMYB28_2 located on C09 is affected by a 4 bp insertion (GCTA) near the end of the annotated third exon ( Figure 5A, File S15, File S40). The phylogeny of MYB28 homologs across several Brassicaceae species revealed that the ancestral allele did not contain this 4 bp insertion ( Figure 5A,B, File S35), i.e., the ancestral allele encodes a functional MYB transcription factor. The Darmor-bzh genome sequence contains the insertion. The 4 bp insertion results in a premature stop codon of the MYB28 homolog BnaC09g05300D, leading to a truncated protein ( Figure 5B). The second fragment of this locus is annotated as BnaC09g05290D which encodes only a MYB28 C-terminal fragment ( Figure 5A). The ancestral allele is present in 73% of the GSL high pool reads and 29% of the GSL low pool reads, resulting in a dAF of 0.44 (Files S21 and S40). The genomic reads of the high-GSL parent P2 showed the ancestral allele, while those of the low-GSL parent P1 carried the insertion ( Figure 5A, File S40). RNA-Seq data from leaves and seeds of P2 support the presence of the ancestral allele on transcript level (File S40).
The BnASSYST diversity panel was screened to investigate the association of the 4 bp insertion with low seed GSL content. The two alleles of the BnaMYB28_2 C09 homolog, namely BnaMYB28_2_1* describing the ancestral allele and BnaMYB28_2_2* describing the insertion allele, were validated by this analysis and confirmed by sequencing ( Figure 6A). The insertion allele BnaMYB28_2_2* was identified to be significantly associated with low seed GSL content ( Figure 6B). Moreover, co-segregation of the insertion with the C09 homolog was detected ( Figure 6B). This finding is in accordance with our phylogenetic analysis, supporting the assumption that the allele without the 4 bp insertion is the ancestral allele.
In order to evaluate whether the 4 bp insertion might be ubiquitously associated with low seed GSL content in B. napus, we analyzed its presence across several B. napus lines (File S41). In accordance with our previous findings, all high-GSL lines contain at least one functional BnaMYB28 homolog harboring the ancestral allele as it has been observed for example for the B. napus genotype SGDH14. All low-GSL lines revealed the presence of the insertion allele of the C09 BnaMYB28 homolog, while the A09 BnaMYB28 homolog was absent (File S41).

PAVs
Additional candidate genes were identified via PAV analysis, which revealed 316 genes affected by PAVs (File S13). As seed GSL content is a polygenic trait, it is not expected to identify genes with no read coverage in one pool compared to full coverage in the other pool. In this study, genes likely to be deleted in one parent but present in the other resulted in a 1/3 read coverage ratio. Genes predicted to be present in the low-GSL parent P1, but absent in the high-GSL parent P2 are described. By analysis of the chromosomal positions of high ranked PAVs, two deletions on A09 were identified. The first major 900 kbp deletion on A09 (A09_P2_920) and its associated candidate genes have already been described above. A second~25 kbp deletion is located within A09_GSL_13 ranging from~10.028-10.053 Mbp, namely A09_P2_25 (File S42). Three of four genes located within this deletion are homologs of genes involved in abscisic acid (ABA) signaling in A. thaliana. Namely BnaA09g16660D, a homolog of CALCIUM-DEPENDENT PROTEIN KINASE 32 (AthCPK32), as well as BnaA09g16680D and BnaA09g16690D, two homologs of BURNOUT1 (AthBNT1), were identified to be absent in P2 compared to P1. For the fourth gene no functional annotation was available. The ancestral allele is present in 73% of the GSL high pool reads and 29% of the G low pool reads, resulting in a dAF of 0.44 (File S21, File S40). The genomic reads of high-GSL parent P2 showed the ancestral allele, while those of the low-GSL parent carried the insertion ( Figure 5A, File S40). RNA-Seq data from leaves and seeds of support the presence of the ancestral allele on transcript level (File S40).
The BnASSYST diversity panel was screened to investigate the association of th  In order to evaluate whether the 4 bp insertion might be ubiquitously associated low seed GSL content in B. napus, we analyzed its presence across several B. napus (File S41). In accordance with our previous findings, all high-GSL lines contain at leas functional BnaMYB28 homolog harboring the ancestral allele as it has been observe example for the B. napus genotype SGDH14. All low-GSL lines revealed the presen the insertion allele of the C09 BnaMYB28 homolog, while the A09 BnaMYB28 hom was absent (File S41).

PAVs
Additional candidate genes were identified via PAV analysis, which revealed genes affected by PAVs (File S13). As seed GSL content is a polygenic trait, it i expected to identify genes with no read coverage in one pool compared to full cove in the other pool. In this study, genes likely to be deleted in one parent but present i other resulted in a 1/3 read coverage ratio. Genes predicted to be present in the low parent P1, but absent in the high-GSL parent P2 are described. By analysis o chromosomal positions of high ranked PAVs, two deletions on A09 were identified Boxplots for seed GSL content based on the genotypes derived from the BnASSYST diversity panel (n = 100): n = 90 for BnaMYB28_2_2*; n = 2 for BnaMYB28_2_2* + BnaMYB28_2_1*; n = 8 for BnaMYB28_2_1*. Differences between genotypes were analyzed by Mann-Whitney U-test and corrected for multiple testing. n.s. represents not significant.

Genomic Intervals, Candidate Genes and Variation Effects Associated with Seed Protein and Oil Content
In total 15 genomic intervals associated with SPC and SOC were identified on chromosome A01, A06, A09, C03, C04, C08, and C09 and their sizes range from 10.5 kbp to 2.07 Mbp (Figure 7, Table 5). Out of these 15 intervals, five intervals are located on C08, three on A06, two on C04 and C09 and one on A01, A09 and C03 (Table 5). A total of 351 genes were located within the genomic intervals (File S26), of which some have a wellknown function in lipid and/or protein biosynthesis. In addition, several SNVs affecting genes associated with SPC and SOC were investigated.
BnaC03g04570D, a homolog of ALTERED SEED GERMINATION 2 (AthASG2), is located in C03_SPC_1 and affected by a frameshift mutation which is present in 86% of the low-SPC pool reads and in 21% of the high-SPC pool reads (File S22). Within C04_SPC_1 BnaC04g00490D, a homolog of ATP BINDING CASSETTE SUBFAMILY B4 (AthABCB4), was predicted to carry a frameshift mutation with a prevalence of 91% of the low-SPC pool reads and 15% of the high-SPC pool reads. In C04_SPC_2 BnaC04g01520D, a homolog of a kinase family protein with an ARM repeat domain (AthCTEXP), was predicted to gain a stop codon which is present in 93% of the low-SPC pool reads and in 21% of the high-SPC pool reads (File S22). In C08_SPC_2, two major candidate genes were detected. First, BnaC08g05680D, a homolog of the regulator of seed oil content PHOSPHOLIPASE D DELTA (AthPLDδ), is affected by a mutation of the splice site of the first to the second exon (File S22). This mutation is present in 67% of the low-SPC pool reads and in 18% of the high-SPC pool reads. Second, BnaC08g05590D, a homolog of the SERINE CARBOXYPEPTIDASE LIKE 41 (AthSCPL41), carries a frameshift mutation in 59% of the low-SPC pool reads and was not detected in the high-SPC pool reads (File S22). BnaC08g05680D and BnaC08g05590D are located next to each other in the B. napus Darmor-bzh genome sequence. In addition, a frameshift mutation in BnaC08g02490D (C08_SPC_1), which A. thaliana homolog is annotated as an amino acid transporter, was identified. Moreover, mutations located in genes, which A. thaliana homologs are involved in post-translational protein modifications, e.g., ubiquitylation (BnaA06g15510D, BnaA06g18030D, BnaA06g18370D, BnaA06g18380D) (A06_SPC_1 and A06_SPC_3) or myristoylation (BnaA06g15490D) (A06_SPC_1) were identified in genomic intervals on chromosome A06 (File S22). BnaA06g15510D (A06_SPC_1), a homolog of F-BOX PROTEIN 7 (AthFBP7), was predicted to gain a stop codon due to a SNV in 78% of the low-SPC pool reads and 14% of the high-SPC pool reads (File S22). BnaA06g18220D (A06_SPC_3), a homolog of RAB GTPASE HOMOLOG 8A (AthRABE1c), is affected by a frameshift and stop gained mutation in the high-SPC pool and are not present in the low-SPC pool (File S22). PAVs genes located near or in the genomic intervals were not associated with SPC/SOC based on the functional annotation of their corresponding A. thaliana homolog (File S14). In summary, these candidate genes are proposed to contribute to the variations of SPC and SOC in B. napus. first major ~900 kbp deletion on A09 (A09_P2_920) and its associated candidate genes have already been described above. A second ~25 kbp deletion is located within A09_GSL_13 ranging from ~10.028-10.053 Mbp, namely A09_P2_25 (File S42). Three of four genes located within this deletion are homologs of genes involved in abscisic acid (ABA) signaling in A. thaliana. Namely BnaA09g16660D, a homolog of CALCIUM-DEPENDENT PROTEIN KINASE 32 (AthCPK32), as well as BnaA09g16680D and BnaA09g16690D, two homologs of BURNOUT1 (AthBNT1), were identified to be absent in P2 compared to P1. For the fourth gene no functional annotation was available.

Genomic Intervals, Candidate Genes and Variation Effects Associated with Seed Protein and Oil Content
In total 15 genomic intervals associated with SPC and SOC were identified on chromosome A01, A06, A09, C03, C04, C08, and C09 and their sizes range from 10.5 kbp to 2.07 Mbp (Figure 7, Table 5). Out of these 15 intervals, five intervals are located on C08, three on A06, two on C04 and C09 and one on A01, A09 and C03 (Table 5). A total of 351 genes were located within the genomic intervals (File S26), of which some have a well-known function in lipid and/or protein biosynthesis. In addition, several SNVs affecting genes associated with SPC and SOC were investigated.  A heatmap ranging from white to red represents the normalized density of dARCs, where a red color represents a high amount of dARCs.

Discussion
We investigated a segregating F2 population to identify genomic intervals and candidate genes associated with protein, oil, and glucosinolate content of B. napus seeds. The genomic intervals and candidate genes identified in this study should provide deeper insights into the genetic architecture of the three complex traits. We envision that the results of this study will be used for genetic improvement of seed quality in B. napus.

Seed Oil and Protein Content
Control of the multigenic traits SPC and SOC is complex and previous studies have reported various SPC and SOC QTL with the majority being minor QTL distributed across all linkage groups [28][29][30][31][32][33][34][35]. As expected for the multigenic traits SPC and SOC, we identified several genomic intervals distributed across 7 chromosomes: A01, A06, A09, C03, C04, C08, and C09. A large proportion of the genomic intervals overlap with loci associated with SOC from previous studies, such as A01_SPC_1, A06_SPC_1-3, C08_SPC_4-5 [27,30,35] indicating the high reliability of these loci. The genomic interval A01_SPC_1 overlaps with a significant region associated with amount of eicosenoic acid [27]. All intervals on chromosome A06 are in line with significant regions associated with oleic acid and linoleic acid [27]. Chao et al. identified several QTL for SPC and SOC, of which two QTL for SOC are in proximity to the genomic interval C03_SPC_1 and one QTL for SPC overlaps with C08_SPC_3-5 [26]. Moreover, C08_SPC_4 and C03_SPC_1 are located in proximity to SNVs significantly associated with linolenic acid [27,38], while C08_SPC_5 is close to SNVs significantly associated with oleic acid, erucic acid [27,38], and eicosenoic acid [27].
Numerous candidate genes and sequence variants associated with SPC/SOC have been detected. For example, BnaC08g05680D, a homolog of PHOSPHOLIPASE D DELTA, is located in C08_SPC_2. Phospholipases are involved in lipid degradation, membrane reconstruction and signal transduction [73]. PLDδ, one of the most abundant PLDs, hydrolyses phospholipids to phosphatidic acid (PA) [73]. Devaiah et al. showed significant reduced seed germination for the pldδ A. thaliana and attenuation of PLDα1 expression might improve oil stability, seed quality and seed aging [74]. In leaves of pldδ A. thaliana mutants the suppression of PLDδ results in the attenuation of PA formation, which blocks the degradation of membrane lipids retarding ABA-promoted senescence [75]. Another candidate gene BnaC08g05590D located in C08_SPC_2 is a homolog of SERINE CARBOXYPEPTIDASE-LIKE 41 (SCPL41). SCPL41 was identified as negative regulator of membrane lipid metabolism and is proposed to be required for phospholipid metabolism or PA-dependent signaling in A. thaliana [76]. Deletion of SCPL41 increased total leaf lipid content and phosphatidylcholine, phosphatidylethanolamine, and phosphatidylglycerol contents, which are substrates of phospholipid hydrolysis via PLD [76]. Interestingly, PLDδ and SCPL41 are located next to each other in the B. napus genome sequence indicating they might be functionally related or act in the same network as it has been observed for, e.g., biosynthetic gene clusters in A. thaliana [77]. In the low-SPC pool, the B. napus PLDδ and SCPL41 homologs are affected by high-impact variants. Therefore, the most likely non-functional SCPL41 and PLDδ homologs might result in an increase in total lipid content in the low-SPC pool. Due to the negative correlation of SPC and SOC this would in turn lead to a low SPC.
On chromosome C04 CTEXP and ABCB4 homologs have been identified as candidate genes affected by nonsense and frameshift mutations, respectively, with a high prevalence in the low-SPC pool. Homologs of CTEXP are known to play a role in intracellular protein trafficking [78], while ABCB4 was identified as an auxin efflux transporter [79]. However, members of the same enzyme family are known as intracellular sterol transporters in mice [80].
A. thaliana mutants of the candidate gene ASG2 located in C03_SPC_1 show seeds with an increased oil body density, fatty acid content, and weight [81]. The authors hypothesize that ASG2 modulates the gene expression or activity of ω-6-fatty acid desaturase (FAD2) and/or ω-3-fatty acid desaturase (FAD3), which are involved in the production of unsaturated FAs [81]. Thus, ASG2 might be a novel candidate gene contributing to an increased SOC in the low-SPC pool. Moreover, the candidate genes involved in post-translational protein modifications might influence SPC/SOC content as, e.g., myristoylation enables protein-lipid interactions and controls the transport and localization of proteins [82].
The candidate gene BnaA06g15510D (A06_SPC_1), whose A. thaliana homolog is annotated as F-BOX PROTEIN 7, is affected by a nonsense mutation in the low-SPC pool, and A. thaliana fbp7 mutants display a defect in protein biosynthesis after cold and heat stress [83]. FBP7 is proposed to regulate translation through ubiquitylation and thereby inactivates a translation repressor under temperature stress [83]. The nonsense mutation in BnaA06g15510D might results in a non-functional FBP7, leading to activation of the translational repressor and thus in a reduction in protein content.
Located in A06_SPC_3 a RABE1c homolog was affected by several mutations in the high-SPC pool. Peroxisomal fatty acid-oxidation is the main pathway for seed lipids catabolism [84]. RABE1c is responsible for PEROXIN 7 (PEX7) dislocation/degradation on the peroxisome membrane and mutation of RABE1c restored peroxisomal β-oxidation activity and PEX7 expression [84]. Treatment with proteasome inhibitors also restored endogenous PEX7 protein levels in GFP-PEX7-expressing seedlings [84]. Thus, a mutated RABE1c in the high-SPC pool might decrease SOC and in parallel increase SPC by increased peroxisomal β-oxidation activity and proteasome inhibitory-like characteristics, respectively.
Of the genes located in the genomic interval on chromosome A01, no association with SPC or SOC was detected based on the functional annotation. However, Liu et al. identified a significant SNV located within BnaA01g22680D, which is in proximity to the genomic interval A01_SPC_1 [85]. The A. thaliana homolog is MILDEW RESISTANCE LOCUS O 6 (AthMLO6). In addition to the identified candidate genes in this study, unknown genes or genomic components might be involved in trans-regulatory or epistatic interactions of SPC/SOC which may be responsible for the indicated genomic intervals.
We identified several genomic intervals on chromosome A09. In this case, a large interval is subdivided into several intervals because of (I) regions with low numbers of SNVs, or (II) low-quality variants or (III) a combination of both. Regions with low numbers of SNVs can be caused by deletion in one or both parental genotypes. Therefore, dARCs cannot be detected in these regions which results in a subdivision of intervals. It is possible that such intervals are flanking loci associated with the trait of interest, e.g., in the case of large deletions. Two deletions in the high-GSL parent P2 may have a major influence on seed GSL content. The first~900 kbp deletion (A09_P2_920) causes the loss of BnaMYB28_4 and BnaMYB34_7, the A. thaliana homologs of these two genes are known positive regulators of GSL biosynthesis. However, because BnaMYB28_4 is also absent in the low-GSL parent P1 and BnaMYB34_7 is not expressed in the high-GSL parent P2 seeds, these genes are likely not responsible for the observed variation in seed GSL content between the parents. The A09_P2_920 deletion also causes the loss of BnaA09g05810D and BnaA09g05510D annotated as CALNEXIN1 (At5g61790) and COBRA (At5g60920), respectively. CALNEXIN1 and COBRA were identified as interaction partner of the aliphatic GSL pathway specific enzyme CYP83A1 [86]. A. thaliana calnexin1 and cobra homozygous T-DNA insertion mutants revealed an increased total aliphatic and indolic GSL content [86], indicating that they have a negative influence on GSL biosynthesis. Thus, the deletion of both genes in P2 might contribute to its high GSL content.
The second~25 kbp deletion of P2 (A09_P2_2) was found to cause the loss of genes, whose A. thaliana homologs are involved in abscisic acid (ABA) signaling. Plant hormones such as ABA, jasmonic acid (JA) and salicylic acid (SA) impact indolic and aliphatic GSL biosynthesis by increasing the expression of GSL transcription factors such as MYB28 and MYB29 and vice versa [18,87,88]. The deletion of the two homologs (BnaA09g16680D, BnaA09g16690D) of AthBURNOUT1 could result in increased levels of plant stress hormones, as A. thaliana burnout1 loss of function mutants overproduce stress hormones such as JA, SA, ABA, and ethylene [89]. Moreover, the A. thaliana homolog of the deleted BnaCPK32 (BnaA09g16660D) is a positive regulator of AthABF4, which positively regulates the expression of ABA-responsive genes to increase stress tolerance [90]. Thus, lacking BnaCPK32 might impair ABA signaling and stress tolerance, which might be compensated by high GSL content. Taken together, the possible increase in plant stress hormones and reduced stress tolerance might boost GSL production in P2.
In addition to the detection of trait-associated genomic intervals and large deletions, our approach enables the identification of single candidate genes and pinpoints sequence variants and domains at the base-pair level which are associated with seed GSL content. Seed GSL content is influenced by multiple genes involved in the biosynthesis of aliphatic and indolic glucosinolates, as well as GSL breakdown and transport [12,23,91]. However, seed GSLs can be largely decreased by reducing aliphatic GSLs as they represent 91-94% of total seed GSL content [19]. According to Kittipol et al., the results of some studies lead to the assumption that inhibition of GSL transport processes cause the low seed GSL trait in B. napus as no significant correlation between leaf and seed GSL could be found. However, Kittipol et al. showed that seed and leave aliphatic GSL content is most likely regulated by a master regulator affecting all plant tissues rather than long-distance transport, because no accession with high leaf and low seed GSL content was identified [92]. Thus, the positive regulator of the aliphatic GSL biosynthesis, MYB28, was proposed as master regulator [92] and we could indeed identify specifically the BnaMYB28_2 homolog on chromosome C09 as major regulator of seed GSL content. B. napus lines with a low GSL content carry a 4 bp insertion in this gene, which causes a premature stop codon, thus leading to a most likely non-functional MYB transcription factor. Consequently, structural genes in the GSL biosynthesis are no longer activated due to the lack of this central transcriptional regulator. In contrast, lines with high GSL content carry a functional BnaMYB28_2 allele. A significant association between this 4 bp insertion and low seed GSL content was observed before [93], but not explained mechanistically. The Damor-bzh genome sequence harbors the 4 bp insertion which caused the prediction of two gene models at this locus, namely BnaC09g05300D and BnaC09g05290D. While BnaC09g05300D contains a R2R3-MYB DNA-binding domain on sequence level, BnaC09g05290D does not [70]. This could explain why previous studies [92,93] have not described the molecular consequences of this insertion. Our findings on the genomic level are supported by RNA-Seq analyses which show a strong expression of BnaMYB28_2 in seeds. The A09 BnaMYB28 homolog might contribute to a high seed GSL content in some high-GSL lines, but was absent in all low-GSL lines investigated in this study (File S41) marking the 4 bp insertion of the C09 homolog as key determinant for seed GSL content phenotype. Interestingly, the 4 bp insertion is located in the middle of a QTL for seed GSL content which explained 48% of the phenotypic variation [94]. Furthermore, the importance of the MYB28 C09 homolog as positive regulator of GSL biosynthesis was demonstrated in B. oleracea varieties [95,96]. Yi et al. analyzed the expression of 81 genes involved in GSL biosynthesis in 12 genotypes of four B. oleracea subspecies across leaves, stems, and florets [95]. Interestingly, out of five aliphatic transcription factor-related genes, only the C09 MYB28 homolog (Bol036286) was expressed in all genotypes, again stressing its essential role in aliphatic GSL biosynthesis. The data also confirm that not all GSL MYB transcription factors need to be expressed to produce GSLs [95].
Additional genomic differences controlling seed GSL content are copy number variations. As the BnaMYB28 homolog on C02 (BnaMYB28_5) is absent in the low-GSL parent P1, while being expressed in the seeds of in the high-GSL parent P2, BnaMYB28_5 could explain some of the phenotypic variation in GSL content. The role of the MYB28 homologs on C02 and C09 in GSL biosynthesis was analyzed by double-knockout lines of B. oleracea [97]. The remaining functional MYB28 homolog on C07 and the two MYB29 homologs did not compensate the low-GSL phenotype, indicating that these homologs play an inferior role in GSL production [97]. In line with our expression studies showing that the C02 homolog BnaMYB28_5 is >12-fold higher expressed in leaves compared to seeds, the expression of the C02 MYB28 homolog is assumed to be of most importance in regulating aliphatic glucosinolate biosynthesis in aerial organs [92]. Finally, BnaMYB28_5 is much lower expressed in seeds compared to BnaMYB28_2 supporting our hypothesis that the BnaMYB28 C09 homolog is the major regulator of seed GSL content. Although these transcription factors appear responsible for a huge proportion of the GSL difference between B. napus lines, we have identified additional candidate genes associated with seed GSL content such as BnaC02g41790D (homolog of AthMAM1) and BnaA09g08410D (homolog of AthAPK). In accordance with our findings, BnaA09g01260D (homolog of AthAOP3) and BnaA09g08470D (homolog of AthTGG1) were recently identified as novel candidate genes for seed GSL content [93].

Conclusions
We identified and described the molecular consequences of a 4 bp insertion located in the third exon of BnaMYB28_2 on chromosome C09 as the most likely causative variant explaining the majority of the phenotypic variance in seed GSL content. B. napus lines with a low GSL content carry a 4 bp insertion in this gene, which causes a premature stop codon, leading to a most likely non-functional MYB. BnaMYB28_2 is the only GSL transcription factor highly expressed in seeds as demonstrated in the high-GSL parent P2 and other B. napus genotypes. Moreover, we identified several new candidate genes controlling SPC and SOC. The new insight into the molecular mechanisms of SPC and SOC, as well as seed GSL content can serve as useful targets for the genetic improvement of B. napus seed quality traits.
Supplementary Materials: The following files are available online at https://doi.org/10.4119/ unibi/2963492. File S1: Data sets used for phylogenetic and genomic analyses. File S2: Schematic illustration of the workflow for the generation of the gold standard and follow-up delta allele frequency calculation. File S3: Variants of the reconstituted F1. File S4: Variants of the gold standard. File S5: Seed GSL content SNVs left after filtered for the gold standard. File S6: SPC SNVs left after filtered for the gold standard. File S7: dARCs of seed GSL content pools. File S8: dARCs of SPC pools. File S9: ZCRs of seed GSL content pools. File S10: ZCRs of SPC pools. File S11: Genome-wide delta allele frequency plots of seed GSL content pools. File S12: Genome-wide delta allele frequency plots of SPC pools. File S13: PAVs of GSL pool analysis. File S14: PAVs of SPC pool analysis. File S15: IGV screenshot of GSL MYB homologs expression and genomic coverage in the GSL pools and the parental genotypes, as well as transcriptomic coverage of Janetzkis Schlesischer (JS) leaves and seeds RNA-Seq data. File S16: CDS sequences from B. napus Lorenz. File S17: Peptide sequences from B. napus Lorenz. File S18: CDS sequences from B. napus Janetzkis Schlesischer. File S19: Peptide sequences from B. napus Janetzkis Schlesischer. File S20: Frame-corrected gff3 file for SnpEff analysis. File S21: Predicted high-impact variants of genes located within +/− 5 kbp of the borders of the genomic intervals associated with seed GSL content. File S22: Predicted high-impact variants of genes located within +/− 5 kbp of the borders of the genomic intervals associated with SPC content. File S23: SRA IDs of the used public available RNA-Seq data sets from B. napus. File S24: RNA-Seq mapping statistics. File S25: Genes located in genomic intervals associated with seed GSL content.