Genomic Variation in Korean japonica Rice Varieties

Next-generation sequencing technologies have enabled the discovery of numerous sequence variations among closely related crop varieties. We analyzed genome resequencing data from 24 Korean temperate japonica rice varieties and discovered 954,233 sequence variations, including 791,121 single nucleotide polymorphisms (SNPs) and 163,112 insertions/deletions (InDels). On average, there was one variant per 391 base-pairs (bp), a variant density of 2.6 per 1 kbp. Of the InDels, 10,860 were longer than 20 bp, which enabled conversion to markers resolvable on an agarose gel. The effect of each variant on gene function was predicted using the SnpEff program. The variants were categorized into four groups according to their impact: high, moderate, low, and modifier. These groups contained 3524 (0.4%), 27,656 (2.9%), 24,875 (2.6%), and 898,178 (94.1%) variants, respectively. To test the accuracy of these data, eight InDels from a pre-harvest sprouting resistance QTL (qPHS11) target region, four highly polymorphic InDels, and four functional sequence variations in known agronomically important genes were selected and successfully developed into markers. These results will be useful to develop markers for marker-assisted selection, to select candidate genes in map-based cloning, and to produce efficient high-throughput genome-wide genotyping systems for Korean temperate japonica rice varieties.


Introduction
Rice is the world's second most important cereal crop, following only maize (Zea mays). Worldwide, nearly 504 million metric tons of milled rice was produced from about 162 million hectares of paddy fields in 2019 (http://www.fao.org/faostat, accessed on 10 August 2021). Rice (Oryza sativa) can be classified into two main subgroups: indica and japonica. Indica genotypes are grown in tropical regions, whereas japonica varieties are grown in tropical or temperate regions. Generally, the genetic diversity of japonica varieties is lower than that of indica varieties [1]. Korean japonica rice varieties belong to the temperate japonica group and thus have a low level of genetic diversity. They exhibit low levels of polymorphism with traditional molecular markers, including restriction fragment length polymorphisms (RFLPs) and simple sequence repeats (SSRs), and this has hindered gene mapping and marker-assisted selection. Korean japonica rice varieties, however, show wide phenotypic variation in many important traits, including flowering time, plant architecture, disease and pest resistance, seed size, grain quality, pre-harvest sprouting resistance, and resistance to abiotic stress. Mapping and identification of the genes responsible for this variation and the development of selective markers are therefore required to facilitate molecular breeding.
Next-generation sequencing (NGS) technologies have revealed numerous sequence variations in closely related varieties of temperate japonica rice, and have enabled the development of a sufficient number of polymorphic markers to allow genotyping of populations performed using an Illumina HiSeq2000 instrument (Illumina, San Diego, USA) with a paired-end library. Raw sequencing data from the remaining 13 japonica varieties were reported previously [16].
To develop InDel markers in the qPHS11 region (22.0-25.0 Mbp on chromosome 11), nine InDels in this region longer than 20 bp were selected, and primers were designed based on their flanking sequences using the CLC Genomics Workbench (v6.0.1) program (http://www.qiagen.com, 25 August 2021). To develop highly polymorphic InDel markers, four InDels with PIC values greater than 0.4 and without missing data were selected, and primers were designed based on their flanking sequences. In order to find sequence variations in the well-known agronomically important genes, the list of "Agronomically important genes" in RAP-DB (https://rapdb.dna.affrc.go.jp, 25 August 2021) was used. Among the found genes, four genes including Hd1, Hd6, GS3, and SD1 were selected, and the primers were designed based on the flanking sequences of functional sequence variations in those genes. The primer sequences of the markers are shown in Supplementary Table S1.
Phylogenetic analysis of the 24 varieties was conducted using the SNPhylo [30] and MEGA X programs [31]. Population structure for varieties was determined using the STRUCTURE (version 2.3.4) [32,33] program, varying the number of clusters (K) from one to fifteen, with five replications. The models, following admixture and correlated allele frequency with a 5000 burnin length and a run length of 50,000, were used for conducting model-based structure analysis. Output of STRUCTURE analysis was collected using the STRUCTURE harvester [34], and the most probable K value was determined based on the LnP(D) and Evanno's ∆K [35].

Detection of Variations in Genome Sequences
We analyzed the genome resequencing data of 24 Korean temperate japonica rice varieties (Cheongho, Dami, Dongan, Dongjin, Giho, Haechanmulgyeol, Hiami, Hwacheong, Hwayeong, Ilpum, Jinbu43, Jopyeong, Joun, Junam, Nampyeong, Odae, Saeilmi, Saenuri, Samgwang, Seogan, Seomyeong, Sindongjin, Sobi, and Unbong40). The quantity of raw genome sequence data from the different varieties ranged from 14 The distributions of sequence variations per 100 kbp interval and nucleotide diversity within a 100 kbp window over the 12 rice chromosomes are shown in Figure 1. Most intervals contained SNPs, although their density was uneven across each chromosome. Chromosomes 6, 8, 10, 11, and 12 had the widest ranges with variation density as high as 100-1000 per 100 kbp; the nucleotide diversity within a 100 kbp window was especially high over large regions of chromosomes 6, 8, and 11. By contrast, variation density and nucleotide diversity were mostly low on chromosome 5.  The distribution of InDel sizes is shown in Figure 2. InDel size ranged from 1 to 234 bp, although 1 bp InDels occurred most frequently (75,490 InDels). Further information about each InDel, including genotypes of varieties and annotation, is provided in Supplementary Table S3. InDels longer than 20 bp can be converted to markers resolvable on agarose gels, which enables their practical use in ordinary laboratories. We identified 10,860 InDels longer than 20 bp; their full details are provided in Supplementary Table S4.
The distribution of InDel sizes is shown in Figure 2. InDel size ranged from 1 to 234 bp, although 1 bp InDels occurred most frequently (75,490 InDels). Further information about each InDel, including genotypes of varieties and annotation, is provided in Supplementary Table S3. InDels longer than 20 bp can be converted to markers resolvable on agarose gels, which enables their practical use in ordinary laboratories. We identified 10,860 InDels longer than 20 bp; their full details are provided in Supplementary Table S4. To test the usefulness of the InDel data, we designed nine InDel markers in the region of qPHS11, a major QTL for pre-harvest sprouting resistance found in the recombinant inbred line (RIL) population derived from a cross between Odae and Unbong40 [19]. An analysis showed that eight markers revealed polymorphisms between the parental varieties, Odae and Unbong40, and one marker failed to amplify by PCR (Figure 3a). We also designed four highly polymorphic InDel markers with PIC values greater than 0.4. All of these revealed polymorphisms between the 24 varieties, as expected ( Figure 3b). These results confirmed that the InDels identified in this study enabled the development of accurate and useful markers. To test the usefulness of the InDel data, we designed nine InDel markers in the region of qPHS11, a major QTL for pre-harvest sprouting resistance found in the recombinant inbred line (RIL) population derived from a cross between Odae and Unbong40 [19]. An analysis showed that eight markers revealed polymorphisms between the parental varieties, Odae and Unbong40, and one marker failed to amplify by PCR (Figure 3a). We also designed four highly polymorphic InDel markers with PIC values greater than 0.4. All of these revealed polymorphisms between the 24 varieties, as expected ( Figure 3b). These results confirmed that the InDels identified in this study enabled the development of accurate and useful markers.

Prediction of the Effects of Variation on Gene Function
The effects of the sequence variations on gene function were predicted using the SnpEff program [25]. The impacts of the effects were categorized into four groups: high, moderate, low, and modifier. These groups contained 3524 (0.4%), 27,656 (2.9%), 24,875 (2.6%), and 898,178 (94.1%) variants, respectively (Table 2). Frameshift mutations were the most common variants in the high-impact group (2518), whereas missense mutations were the most common variants in the moderate-impact group (25,436) (Table 3). Synonymous mutations were the most common variants (19,629) in the low-impact group, whereas variations in upstream gene sequences were the most common variants (361,453) in the modifier group (Table 3). Additional information about variants with high and moderate impact is provided in Supplementary Table S5. The variation identified in this study will be very useful for selecting candidate genes in specific target regions during map-based gene cloning with populations derived from crosses between Korean temperate japonica varieties.     In order to identify sequence variations in the well-known agronomically important genes, we referred to the genes in the list of "Agronomically important genes" in RAP-DB (https://rapdb.dna.affrc.go.jp, 25 August 2021), which included 73 genes. We found sequence variations in 18 genes among them (Table 4, Supplementary Table S6). Especially, the sequence variations in SD1, GS3, HD6, HD3B, HD1, Hd18, Pia, Pb1, and Ptr genes were identical with those that have been reported to be functional variations. Based on these results, we designed four functional markers for Hd1, Hd6, GS3, and SD1 genes. A 43 bp deletion causing a frameshift in the first exon was found in Hd1 [36,37], and was used to design a marker for Hd1. The genome resequencing data analysis showed that Joun and Unbong40, which are early-maturing varieties, contained the deletion genotype, while other varieties contained the reference genotype; this finding was confirmed by an analysis using the marker for Hd1 (Figure 3c). An SNP causing the loss of the stop codon was found in Hd6 and used to design a marker. HD6 encodes the α-subunit of a protein kinase, CASEIN KINASE II (CK2); the Nipponbare allele contains a premature stop codon, whereas the allele found in Kasalath, an indica variety, does not [38]. The genome resequencing data analysis showed that only Odae contained the Kasalath allele, while all the other tested varieties had the Nipponbare allele. An analysis with the Hd6 marker confirmed this result (Figure 3c). GS3 regulates grain length [39]. An SNP causing a premature stop codon was found in GS3 and was used to design a marker. The genome resequencing data analysis showed that Dami, Sindongjin, and Sobi, which are largegrained varieties, contained the premature stop codon, while the other varieties possessed the reference genotype; this finding was confirmed by an analysis using the GS3 marker ( Figure 3c). Mutations in SD1 reduce culm length [40]. An SNP causing an amino acid change was found in this gene and used to design a marker. The genome resequencing data analysis showed that Ilpum, Jopyeong, Odae, Junam, and Seogan contained the variant sd1 genotype, but the other varieties contained the reference genotype. An analysis with the marker for the sd1 variant confirmed this finding (Figure 3c).

Structure and Phylogenetic Analysis
The SNPhylo program extracts SNP data which meet the criteria of ≥ MAF (Minor Allele Frequency) and ≤ missing rate threshold and are in approximate linkage equilibrium with each other from large SNP datasets produced by resequencing [30]. By using this program, 1758 representative SNPs were extracted with criteria of MAF higher than 0.1, missing rate lower than 0.1, and LD (Linkage Disequilibrium) threshold of 0.5 from all SNPs detected in 24 Korean japonica rice varieties. These SNPs were used in the following population structure and phylogeny analysis.

Discussion
In this study, we identified 954,233 sequence variants, 791,121 SNPs, and 163,112 InDels. We found 50,555 new SNPs, in addition to the 740,566 SNPs reported in our previous study [16], and performed a novel analysis of InDels. This result reveals ample sequence variation in Korean japonica rice varieties and may explain the wide phenotypic variation observed in many important traits, including flowering time, plant architecture,

Discussion
In this study, we identified 954,233 sequence variants, 791,121 SNPs, and 163,112 In-Dels. We found 50,555 new SNPs, in addition to the 740,566 SNPs reported in our previous study [16], and performed a novel analysis of InDels. This result reveals ample sequence variation in Korean japonica rice varieties and may explain the wide phenotypic variation observed in many important traits, including flowering time, plant architecture, disease and pest resistance, seed size, grain quality, pre-harvest sprouting resistance, and resistance to abiotic stress. Using the SnpEff program, we predicted the effect of each variant on gene function: 3524 variants were predicted to have high-impact effects as they involved frameshift mutations, the gain or loss of a stop codon, or changes at splice donor or acceptor sites. These types of variants are extremely likely to be associated with variation in phenotypic traits. Moreover, 27,656 variants were missense mutations, in-frame insertion/deletions, or 5_prime_UTR_truncation and exon_loss_variants, which are predicted to have moderate effects on function and are thus also likely to be related to phenotypic variation. In addition, we cannot exclude the possibility that the remaining variants, predicted to have a low or modifier impact on function, are related to variation in phenotypic traits. Our analysis provided the position, genotypes of tested varieties, and full annotation of each variant, including its predicted effect on function. These data will be very helpful for future map-based cloning of genes underlying important traits in populations derived from crosses between Korean japonica varieties. In particular, within a target region associated with a trait, candidate genes can be identified based on the presence of variants that have a high or moderate impact on gene function.
The need for high-throughput genome-wide genotyping systems using arrays is increasing rapidly as marker-assisted selection and genomic selection become more popular techniques for plant breeding [54,55]. Highly polymorphic SNPs, SNPs affecting the function of known genes controlling agronomical traits, and SNPs located within a target gene interval are commonly used to develop high-throughput genome-wide genotyping systems [10,11,14,15]. Using the information produced by this study, suitable SNPs can be easily selected to develop genotyping systems for Korean japonica rice varieties. This is the goal of our future research.
The development of markers resolvable on agarose gels is important for genetic research and breeding. InDels longer than 20 bp are easily visualized on agarose gels and can thus be used in ordinary laboratories without the need for expensive equipment. Shen et al. [56] identified InDels between the rice varieties Nipponbare and 9311, and used InDels of 25-50 bp to construct 108 InDel markers. A further 346 markers based on InDels of 30-55 bp were developed following a comparison of the sequences of two indica and one japonica rice reference genomes [57]. InDels longer than 20 bp were detected by positional multiple sequence alignments between wild rice species and four cultivated rice varieties, and enabled the development of 541 genome-wide markers that discriminated between alleles from cultivated rice and seven AA-genome wild rice species [58]. We identified 10,860 InDels longer than 20 bp (Supplementary Table S4), and used this information to develop and successfully use eight InDel markers in the qPHS11 target region, as well as four other highly polymorphic InDel markers (Figure 3a,b). These results show that the InDel data generated by this study will be very useful for developing markers for finemapping and marker-assisted selection, as well as for the construction of a genome-wide InDel marker set for Korean japonica rice varieties.
Interestingly, a large difference in the numbers of sequence variations among chromosomes was observed. Chromosome 11 contained the highest number of variants (202,697), and chromosome 5 the lowest number of variants (20,602). A similar result has been reported by resequencing a Japanese temperate japonica rice variety, Koshihikari [2]. In comparison between Koshihikari and Nipponbare, which is the reference genome, chromosome 11 showed the highest number of SNPs (12,216), and chromosome 5 the second lowest number of SNPs (1032). Moreover, in resequencing 10 Japanese temperate japonica rice varieties released in Hokkaido, chromosome 11 contained the highest number of SNPs (10,779), and chromosome 5 the lowest number of SNPs (1563-2834) in comparison with Nipponbare [6]. In the study by Arai-Kichise et al. [5], six Japanese temperate japonica rice varieties were resequenced, and chromosome 11 contained the second highest number of SNPs (12,182), while chromosome 5 had the lowest number of SNPs (1184-6119) in comparison with Nipponbare [5]. Such a large difference seems to be seen only in temperate japonica rice varieties. The Moroberekan, a tropical japonica rice variety, did not show a large difference in SNP numbers: 61,350 on chromosome 5 and 61,169 on chromosome 11 [5]. The differences in the numbers of SNPs between chromosome 5 and chromosome 11 were much smaller in two Korean Tongil-type indica varieties: 72,242 and 87,759 on chromosome 5 vs. 121,783 and 126,752 on chromosome 11 [59]. The cause of the large differences in the numbers of sequence variations among chromosomes in temperate japonica rice varieties is unclear and needs further research.
Overall, the genomic variation found in this study will facilitate the development of markers for mapping important genes and for marker-assisted selection, together with the development of a high-throughput genome-wide genotyping system for Korean japonica rice varieties.

Conclusions
Through analyzing genome resequencing data from 24 Korean temperate japonica rice varieties, we discovered 954,233 sequence variations, including 791,121 SNPs and 163,112 InDels. The effect of each variant on gene function was predicted using the SnpEff program and was included in annotation. Eight InDels from a pre-harvest sprouting resistance QTL (qPHS11) target region, four highly polymorphic InDels, and four functional sequence variations in well-known agronomically important genes were selected and successfully developed into markers. These results will be useful to develop markers for marker-assisted selection, to select candidate genes in map-based cloning, and to produce efficient high-throughput genome-wide genotyping systems for Korean temperate japonica rice varieties.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12111749/s1, Figure S1: Estimation of population using LnP(D) derived ∆K for K from 1 to 15, Table S1: Primer sequences of markers developed in this study, Table S2: Summary of genome resequencing data, Table S3: Information about InDels detected in this study, Table S4: Information about InDels longer than 20 bp detected in this study, Table S5: Information about variants with high or moderate effects on gene function, Table S6: Information about variants with high or moderate effects on gene function in well-known agronomically important genes.