Association Mapping of Fertility Restorer Gene for CMS PET 1 in Sunflower

The phenomenon of cytoplasmic male sterility (CMS), consisting in the inability to produce functional pollen due to mutations in mitochondrial genome, has been described in more than 150 plant species. With the discovery of nuclear fertility restorer (Rf ) genes capable of suppressing the CMS phenotype, it became possible to use the CMS-Rf genetic systems as the basis for practical utilization of heterosis effect in various crops. Seed production of sunflower hybrids all over the world is based on the extensive use of the PET1 CMS combined with the Rf1 gene. At the same time, data on Rf1 localization, sequence, and molecular basis for the CMS PET1 type restoration of fertility remain unknown. Searching for candidate genes of the Rf1 gene has great fundamental and practical value. Therefore, in this study, association mapping of fertility restorer gene for CMS PET1 in sunflower was performed. The genome-wide association study (GWAS) results made it possible to isolate a segment 7.72 Mb in length on chromosome 13, in which 21 candidates for Rf1 fertility restorer gene were identified, including 20 pentatricopeptide repeat (PPR)family genes and one Probable aldehyde dehydrogenase gene. The results will serve as a basis for further study of the genetic nature and molecular mechanisms of pollen fertility restoration in sunflower, as well as for further intensification of sunflower breeding.


Introduction
The phenomenon of cytoplasmic male sterility, consisting in the inability to produce functional (viable) pollen due to mutations in mitochondrial genome, has been described in more than 150 plant species [1][2][3].With the discovery of nuclear Rf genes capable of suppressing the cytoplasmic male sterility (CMS) phenotype, it became possible to use the CMS-Rf genetic systems as the basis for the practical utilization of heterosis effect in various crops (maize, sunflower, rice, sorghum, sugar beet, rapeseed, and others), and also as models to study the interaction mechanisms between the nuclear and mitochondrial genomes.The use of CMS lines as the female parent eliminates manual labor during crossing, and the use of fertile paternal lines carrying the gene (genes) for the restoration of pollen fertility (Rf ) allows the production of highly fertile offspring that exhibit heterosis effect.When crossing the lines of annual cultivated sunflower (Helianthus annuus L.), the heterosis effect ranges from 28 to 40% according to different authors [4].Sunflower has more than 70 sources of CMS, but in commercial hybrids breeding, the PET1 CMS obtained by P. Leclercq from the interspecific hybrid H. petiolaris Nutt.× H. annuus is used predominantly [5].The PET1 mtDNA differs from the mtDNA of fertile forms by the presence of an inversion of 11 kb and insertion of 5 kb that leads to the appearance of a new open reading frame orfH522, which is co-transcribed with the atpA gene and encodes a 16 kDa protein [6,7].Literature describes various types of inheritance of pollen fertility restoration in sunflower with CMS PET1-type and reports on the different number of genes that determine this character.Based on hybridological analysis, various authors distinguish from one to five genes responsible for the restoration of pollen fertility in CMS PET1 [8,9].Seed production of sunflower hybrids is based on the extensive use of the Rf1 and Rf2 genes, which by interacting with each other give the effect of restoring pollen fertility.It is believed that the main gene is Rf1, which is responsible for the fertility restoration and is present in the vast majority of CMS PET1 fertility restoring lines [10].Sometimes, as a result of hybridological analysis in the second generation of hybrids, monogenic segregation occurs, in such cases it is assumed that the second gene is present in both crossed forms.
The Rf1 gene was originally assigned to the sixth linkage group [11] and was subsequently reassigned to the second linkage group [12].On the integrated genetic map of sunflower, the genetic factor Rf1 responsible for the restoration of pollen fertility is localized in the linkage group 13 [13,14].To identify the Rf1 gene alleles in the genotype, the closely linked molecular markers were developed [15], the diagnostic value of which was confirmed by several researchers [4,16].However, approximately 10% of clones from the N.I.Vavilov All-Russian Research Institute of Plant Genetic Resources (VIR) collection lacked markers, although the presence of the Rf1 gene dominant allele in their genotypes was confirmed using other methods [17].
Markers designed based on the information on the primary nucleotide sequence of the Rf genes themselves are considered to be the most effective for the selection of the functional Rf alleles carriers [18].The absence of such markers in sunflower is explained by the lack of information about the nature of the Rf gene (genes).To identify the Rf1 gene, positional cloning method was used [19], and molecular markers based on polymorphic fragments of PPR genes have also been developed [20,21].However, to this day, the sequence of the Rf1 gene remains unknown.
Most of the Rf genes described so far encode PPR proteins that contain repetitive degenerate motifs of 35 amino acid residues (Pentatricopeptide Repeats, PPR), with a few exceptions-for example, the Rf2 gene, which encodes maize aldehyde dehydrogenase [22] and the rice Rf2 gene, the product of which is a protein containing a glycine-rich domain [23].RF-PPR proteins have from 15 to 20 PPR motifs in their sequences [24].In angiosperm plants, PPR genes belong to two main types, P and PLS, which differ in the structure of PPR domains [25,26].In flowering plants, the family of PPR genes, containing up to 600 members, is involved in anterograde/retrograde regulation, which ensures the coordinated work of nuclear and organelle genomes.Genes the products of which have the function of restoring fertility are included into the RFL-PPR (Restoration of Fertility Like-PPR) subfamily.Unlike the numerous families of conservative PPR genes that regulate processing, as well as participate in the splicing, editing, stabilization and translation of organelle RNAs, RFL-PPR genes are organized into clusters and are characterized by an exceptionally high level of variability [27].It is believed that allele-specific markers of RFL-PPR genes can be used for positional cloning of fertility restorer genes, as well as for efficient selection of the carriers of functional alleles of Rf loci [18,28].In addition, an exceptionally high rate of evolution of the subfamily of RFL-PPR genes [29,30], as well as the important role of CMS and restoration of fertility in the formation of new species [31] allow us to consider them as a source of molecular markers for phylogenetic research.
Thus, the Rf1 gene is a key element in obtaining heterotic sunflower hybrids based on CMS.At the same time, data on its localization in the genome remain controversial.In addition, the gene sequence and molecular basis for the CMS PET1 type restoration of fertility remains unknown.Therefore, searching for candidate genes and mapping of the Rf1 gene may have a great fundamental and practical value.Therefore, in this study, association mapping of fertility restorer gene for CMS PET1 in sunflower was performed.

Materials and Methods
134 Helianthus annuus L. elite lines from "Agroplazma" breeding and seed production company (Krasnodar, Russia) were taken into the study (Table S1, Table S2).They included 74 restorer lines carrying dominant allele of the Rf1 gene and 60 sterility maintainer lines with the recessive allele of the gene.
Genomic DNA was extracted from the etiolated seedlings after one week of germination in the dark.100 mg of tissue for each sample was grounded using the FastPrep-96™ Automated Homogenizer (MP Biomedicals, Santa Ana, CA, USA).DNA was extracted using the NucleoSpin ® Plant II plant DNA extraction kit (Macherey-Nagel, Düren, Germany) according to the manufacturer's recommendations and stored at −20 • C. The quality of the purified DNA samples and DNA concentration were assessed by gel electrophoresis and Qubit 3.0 Fluorometer (ThermoFisher Sscientific, Waltham, MA, USA).Restriction site associated DNA sequencing (RAD) libraries were prepared using HindIII and Nlalll endonucleases as previously described [32] with minor modifications and sequenced in Illumina HiSeq4000 (San Diego, CA, USA).Raw sequence data are available on NCBI SRA under project number PRJNA515598.
Preprocessing of raw reads was performed with the aid of the Trimmomatic software (version 0.30) [33].Genome variants were called in Tassel 5 GBS v2 pipeline [34] with the following command line arguments: -kmerLength 65, -minMAPQ 20, and -mnQS 20.Bowtie2 [35] was used to map tags to the HanXRQr1.0reference genome [36] with -very-sensitive-very-sensitive preset.Principal component analysis (PCA) and linkage disequilibrium (LD) analyses were accomplished in Tassel 5 software and visualized by means of ggplot2 R library (version 3.1.0)[37].Statistical analysis using the mixed linear models (MLMs) [38] implemented in the Tassel 5 software was performed for association mapping with PCA and kinship matrixes as covariates.Multiallelic variants and those with the high missing call rates, MAF below 0.01 as well as the samples with many missing calls were filtered out in PLINK 1.9 [39,40] before genome-wide association study (GWAS) analysis.Significant loci were identified based on Bonferroni and FDR adjusted q-values with 0.01 alpha significance level.
Genome-wide association study results were visualized with the aid of the qqman R package (version 0.1.4)[41].

Genotyping and GWAS Analysis
Sequencing of RAD-libraries and subsequent analysis has identified 28,153 SNP (Single nucleotide polymorphisms) in 134 sunflower accessions.Overall transitions to transversions ratio was 1.83.PCA analysis revealed significant population structure.Restorer lines and sterility maintainers form separate groups on the scatterplot (Figure 1).GWAS analysis revealed four loci associated with the ability to suppress CMS phenotype.Single significant marker was revealed at both 8 and 17 linkage groups.Most of the markers significantly associated with the trait under study, as well as the markers with the highest p-values, were located at 10 and 13 LG (Figure 2, Figure S1, Table S3).GWAS analysis revealed four loci associated with the ability to suppress CMS phenotype.Single significant marker was revealed at both 8 and 17 linkage groups.Most of the markers significantly associated with the trait under study, as well as the markers with the highest p-values, were located at 10 and 13 LG (Figure 2, Figure S1, Table S3).GWAS analysis revealed four loci associated with the ability to suppress CMS phenotype.Single significant marker was revealed at both 8 and 17 linkage groups.Most of the markers significantly associated with the trait under study, as well as the markers with the highest p-values, were located at 10 and 13 LG (Figure 2, Figure S1, Table S3).In addition to the difference in the ability to restore pollen fertility in the crosses with sterile lines with PET1-type cytoplasm, the analyzed sunflower lines differed by the presence (restorer lines) or absence (sterility maintainer lines) of plant branching.This is due to the fact that to obtain F1 hybrids, non-branched lines with a single large apical head are most often used as female parents, and lines with a recessive type of branching, with multiple small heads located on the lateral branches, are used as male parents.This approach allows an increase in the length of the flowering period of male parents due to the difference in the flowering times of the heads on the plant, and at the same time to get F1 plants with a large single head.It is known from the literature that branching locus is localized on chromosome 10 [42,43].Therefore, the associations identified on chromosome 10 seem to be linked to this trait.
At the same time, the associations identified on chromosome 13 correspond with the data obtained in the previous studies.For instance, Yu et al. combined RFLP, RFLP-SSR, and SSR maps and obtained data for localization of Rf1 in LG13 [44].One year later Kusterer et al. [45] map Rf1 based on cosegregation with SSR markers ORS388 and ORS1030 belonging to LG 13 Tang et al. [14].Further Kusterer et al. obtained saturated map of the fertility restoration region Rf1 [13].Mapping data have confirmed the location of Rf1 on LG13 near marker ORS1030.According to Yue et al.Rf1 is in the interval between markers ORS511 and ORS799 of linkage group 13 [20].Based on this, the most likely location for the candidate Rf1 genes appears to be chromosome 13.
Within chromosome 13, based on the results of the GWAS analysis, a 7.72 Mb long section (coordinates170494693-178217103), can be distinguished, in which eight significant SNPs are located with p-values ranging from 5.69 × 10 −9 to 1.53 × 10 −18 (Table 1, Figure S2).To compare localization of 7.72 Mb region identified in this study with previously reported data we blasted PCR primer sequences of the ORS511, ORS799 and ORS1030 markers against the reference genome.ORS511 and ORS1030 were mapped in close proximity to each other on LG13 in according to Tang et al. [14].Complete sequences of ORS1030 forward and reverse primers were mapped with 100% identity twice in the genome.Forward primer mapped to the positions 169535691-169535666 and 169655088-169655063 and reverse primer to the positions 169535262-169535287 and 169654659-169654684 of LG 13.For ORS511 complete sequences of forward primer have no hits on the 100% identity threshold.Reverse primer of ORS511 was mapped at 169733686-169733704 of LG13.For the ORS799 marker complete sequences of forward and reverse primers were uniquely mapped to the genome in position 186516272-186516291 and 186516418-186516399 of LG13 respectively.

Identification of Rf1 Candidate Genes
Within identified 7.72 Mb region in the HanXRQr1.0reference genome sequence [36], 11 PPR genes are located, which are the most likely candidate genes for the fertility restorer gene Rf1.Almost all Rf genes in various plant species that have been identified so far belong to this family [46][47][48][49].
PPR genes are thought to be present in all eukaryotes, but they are most common in the genomes of terrestrial plants, where they form one of the largest gene families [50].For example, in the genome of Arabidopsis thaliana L. there are about 450 genes of this family [51,52], about 500 in the maize genome [53], and more than 600 in the genome of Oryza sativa L. [25].Proteins of this family are characterized by the presence of multiple helix-turn-helix domains, forming a supercoil with a central groove [50,52].This allows the protein to bind to RNA and participate in the RNA-protein interactions.Pentatricopeptide repeat (PPR) proteins play a significant role in regulating gene expression in the organelles at the RNA level [50,54,55].
The total number of annotated PPR genes in the sunflower genome HanXRQr1.0 is 333.Therefore, the identified region of 7.72 Mb (comprising 0.214% of the genome length) contains 3.3% of all annotated PPR genes and is rich in PPR genes.In addition, within this region, 10 genes of the TPR family are annotated.It is known that the sequences of PPR proteins are similar to the sequences of the TPR-family proteins and it is assumed that the tetratricopeptide repeat (TPR)-family genes gave rise to PPR genes at the early stages of the evolution of eukaryotes [56].
Therefore, it was decided to include the gene sequence of both the PPR and TPR families in the further analysis.It should be noted that genome sections 7.72 Mb in length, flanking the region of the chromosome 13 mentioned above, did not contain any annotated sequence of the PPR family, and only a single sequence belonging to the TPR family (HanXRQChr13g0421851).
The analysis of the translated amino acid sequences of 22 genes of the PPR and TPR families located in the identified region and its flanking regions was conducted using ScanProsite tool of ExPASy SIB Bioinformatics Resource Portal (SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland).As a result, in all 11 amino acid sequences of the PPR family and in 10 of the 11 sequences of the TPR family, Pentatricopeptide (PPR) repeats were identified.Therefore, within the 7.72 Mb region and the flanking regions, 21 genes were detected, their protein products demonstrating the primary structure characteristic of the sequences of the PPR family.Meanwhile, in addition to PPR repeats, the amino acid sequence of the protein product of one of the genes revealed a region of homology with UDP-glycosyltransferases, and therefore this gene was excluded from the list of possible candidate genes for Rf1.
In addition to PPR genes, a gene annotated as Probable aldehyde dehydrogenase 5F1 was detected in the 7.72 Mb region of chromosome 13.It was previously shown that Rf2 gene of maize is the gene encoding aldehyde dehydrogenase [57].Therefore, this gene is also a possible candidate Rf gene.The list of identified candidate genes is shown in Table 2, and their arrangement within the 7.72 Mb region is shown in Figure 3.The number of PPR repeats in the sequence and the length of the protein products of the candidate PPR family genes varied from 2 to 15 and from 110 to 756 amino acids, respectively.
Genomic regions with increased LD could be recognized as signatures of strong selection pressure on the traits encoded within these regions.The results of the analysis showed the presence of an extended section of elevated LD in 13 LG (Figure 4), of which the identified 7.72 Mb region forms part.This fact is an indirect proof of the localization of candidate genes in this region of the genome.
It should be noted that the reference genome used in the analysis was obtained by sequencing the XRQ line, which is a cytoplasmic male sterility maintainer (PET1 type) [36].At the same time, it is known that the Rf locus may undergo complex evolutionary events [46] and the structure of the The number of PPR repeats in the sequence and the length of the protein products of the candidate PPR family genes varied from 2 to 15 and from 110 to 756 amino acids, respectively.
Genomic regions with increased LD could be recognized as signatures of strong selection pressure on the traits encoded within these regions.The results of the analysis showed the presence of an extended section of elevated LD in 13 LG (Figure 4), of which the identified 7.72 Mb region forms part.This fact is an indirect proof of the localization of candidate genes in this region of the genome.

Conclusions
In this work, high-throughput genotyping of sunflower lines, differing by the ability to suppress CMS phenotype was carried out and a genome-wide association study was performed.The GWAS results made it possible to isolate a segment 7.72 Mb in length on chromosome 13, in which 21 candidate Rf1 fertility restorer genes were identified, including 20 PPR-family genes and one Probable aldehyde dehydrogenase gene.The results will serve as a basis for further study of the genetic nature and molecular mechanisms for pollen fertility restoration in sunflower, as well as for the search of selection markers.

Supplementary Materials:
The following are available online at www.mdpi.com/xxx/s1,Table S1: List of the sunflower lines selected for the study, Table S2: List of the sequenced sunflower samples, corresponding flowcell lanes and barcodes, Table S3: List of SNPs, significantly associated with the ability to suppress CMS phenotype after FDR correction, Figure S1: Quantile-quantile plot of associations with the ability to suppress CMS phenotype, Figure S2  It should be noted that the reference genome used in the analysis was obtained by sequencing the XRQ line, which is a cytoplasmic male sterility maintainer (PET1 type) [36].At the same time, it is known that the Rf locus may undergo complex evolutionary events [46] and the structure of the identified site may differ in the genome of the fertility restorer lines.Therefore, to identify the Rf1 gene, to determine the sequence of the dominant alleles of the Rf1 gene, and to understand the evolution of the sunflower Rf1 locus, the additional analysis of the structure of the 7.72 Mb region in the genome of fertility restorer lines is required.

Conclusions
In this work, high-throughput genotyping of sunflower lines, differing by the ability to suppress CMS phenotype was carried out and a genome-wide association study was performed.The GWAS results made it possible to isolate a segment 7.72 Mb in length on chromosome 13, in which 21 candidate Rf1 fertility restorer genes were identified, including 20 PPR-family genes and one Probable aldehyde dehydrogenase gene.The results will serve as a basis for further study of the genetic nature and molecular mechanisms for pollen fertility restoration in sunflower, as well as for the search of selection markers.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2073-4395/9/2/49/s1,Table S1: List of the sunflower lines selected for the study, Table S2: List of the sequenced sunflower samples, corresponding flowcell lanes and barcodes, Table S3: List of SNPs, significantly associated with the ability to suppress CMS phenotype after FDR correction, Figure S1: Quantile-quantile plot of associations with the ability to suppress CMS phenotype, Figure S2

Figure 1 .
Figure 1.Principal component analysis (PCA) plot of sunflower lines based on Restriction site associated DNA (RAD) sequencing data.Pink dots-sterility maintainers, blue dots-restorer lines.

Figure 2 .
Figure 2. Manhattan plot of associations with the ability to suppress CMS (cytoplasmic male sterility) phenotype.Red line indicates the significance threshold based on the Bonferroni multiple testing correction (alpha = 0.01).

Figure 1 .
Figure 1.Principal component analysis (PCA) plot of sunflower lines based on Restriction site associated DNA (RAD) sequencing data.Pink dots-sterility maintainers, blue dots-restorer lines.

Figure 1 .
Figure 1.Principal component analysis (PCA) plot of sunflower lines based on Restriction site associated DNA (RAD) sequencing data.Pink dots-sterility maintainers, blue dots-restorer lines.

Figure 2 .
Figure 2. Manhattan plot of associations with the ability to suppress CMS (cytoplasmic male sterility) phenotype.Red line indicates the significance threshold based on the Bonferroni multiple testing correction (alpha = 0.01).

Figure 2 .
Figure 2. Manhattan plot of associations with the ability to suppress CMS (cytoplasmic male sterility) phenotype.Red line indicates the significance threshold based on the Bonferroni multiple testing correction (alpha = 0.01).

Figure 3 .
Figure 3. Schematic localization of the candidate Rf1 genes within the 7.72 Mb region.Green arrows indicate the gene sequences of the PPR family.The direction of the arrow reflects the orientation of the sequence in the genome.Red box indicates the location of the Probable aldehyde dehydrogenase 5F1gene.

Figure 3 .
Figure 3. Schematic localization of the candidate Rf1 genes within the 7.72 Mb region.Green arrows indicate the gene sequences of the PPR family.The direction of the arrow reflects the orientation of the sequence in the genome.Red box indicates the location of the Probable aldehyde dehydrogenase 5F1gene.

Figure 4 .
Figure 4. Pairwise Linkage Disequilibrium (LD) Plot of the LG13.Individual data points reflect squared allele frequency correlations (R2) for all possible pairs of polymorphic SNP markers of LG13.The x-and y-axes correspond to the coordinates within 13 LG.Location of 7.72 Mb indicated by curly bracket.

Figure 4 .
Figure 4. Pairwise Linkage Disequilibrium (LD) Plot of the LG13.Individual data points reflect squared allele frequency correlations (R2) for all possible pairs of polymorphic SNP markers of LG13.The xand y-axes correspond to the coordinates within 13 LG.Location of 7.72 Mb indicated by curly bracket.

Table 1 .
List of Single nucleotide polymorphisms at linkage group 13, significantly associated with the ability to suppress CMS phenotype after Bonferroni correction.

Table 2 .
The list of candidate Rf1 genes identified within the 7.72 Mb region.