Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity

Fu, Yong-Bi

doi:10.3390/crops4040044

Open AccessArticle

Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity

by

Yong-Bi Fu

Plant Gene Resources of Canada, Saskatoon Research and Development Centre, Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK S7N 0X2, Canada

Crops 2024, 4(4), 636-650; https://doi.org/10.3390/crops4040044

Submission received: 12 October 2024 / Revised: 6 November 2024 / Accepted: 18 November 2024 / Published: 28 November 2024

Download

Browse Figures

Versions Notes

Abstract

Broadening the genetic base of an elite breeding gene pool is one important goal in a successful long-term plant breeding program. This goal is largely achieved through the search for and introgression of exotic germplasm with adaptive traits. However, little is known about the genetic backgrounds of acquired exotic germplasm, as germplasm selection is mainly based on trait information. Here, we expanded an average pairwise dissimilarity (APD) analysis to samples with SNP genotypes associated with genes for specific characters of breeding interest. Specifically, we explored a gene-associated APD analysis in a genomic characterization of 2643 rice lines based on their published FASTQ data. Published contigs for cloned genes conditioning heat tolerance, cold tolerance, fertility, and seed size were downloaded as gene reference sequences for SNP calling, along with those SNP calls based on the rice reference genome and published indels. Totally, eight SNP or indel data sets were formed for each of three sample groups (All2643, Indica1789, and Japonica854). APD estimation was made for each of the 24 data sets. For each sample group, four novel sets of the 25 most genetically distinct rice lines, each for an assayed character, were identified. Further analyses of APD estimates also revealed some interesting APD properties. Four contig-based SNP data sets for four specific characters displayed similar APD frequency distributions and positive high correlations of APD estimates. Contig-based APD estimates were negatively correlated with genome-based APD estimates and nearly uncorrelated with indel-based APD estimates. These findings are significant for plant germplasm characterization and germplasm utilization in plant breeding.

Keywords:

rice; gene-associated SNP; genomic SNP; indel; germplasm characterization

1. Introduction

A successful plant breeding program is highly dependent on the availability of genetically diverse germplasm in an elite breeding gene pool that is developed for effective genetic improvements of specific characters, such as yield, growth, disease resistance, and abiotic resistance [1,2]. Thus, broadening the genetic base of an elite breeding gene pool is one important goal in plant breeding [3]. Many approaches have been applied to widen the elite breeding gene pool [4], including the search for exotic germplasm such as landraces or crop wild relatives with adaptive traits such as disease resistance for introgression (e.g., [5]) and pre-breeding to explore adapted germplasm from genebanks (e.g., [6,7]). These breeding efforts can expand the elite breeding gene pool with the acquisition of valuable exotic germplasm [8,9], particularly through genomic selection based on the traits of agronomic interest [10]. However, the exotic germplasm selection is based more on the information of traits valued for agricultural adaptation but less on the germplasm genetic backgrounds. Thus, the genetic base of those selected exotic germplasm is uncertain [6], and little is known about the extent of improvement in genetic base from exotic germplasm addition for the elite breeding gene pool [11].

Marker-based average pairwise dissimilarity (APD) is a measure of genetic differences among plant samples that can allow for a relative assessment of genetic distinctness and genetic redundancy among assayed samples. This method was first developed in 2006 to facilitate plant germplasm characterization and utilization [12]. The APD estimation is based on a set of marker genotype data, calculates the pairwise genotypic dissimilarity following the simple matching coefficient of Sokal and Michener [13], and averages all the pairwise dissimilarity values of a sample against the remaining assayed samples. A sample with a higher APD value is meant to be more genetically distinct than the other samples with lower APD values. The method has been well cited in the scientific literature, but interestingly, it has not been employed as widely as expected to assess genetic distinctness and redundancy [14].

The genomic characterization of plant germplasm conserved in genebanks has become more feasible than before (e.g., see [15,16,17,18]) because of the technical advances in genomics. Many published genomic SNP data sets on conserved plant germplasm are available (e.g., [16,19]). Recently, a research effort was made following the APD method to assess the genetic distinctness and redundancy of the previously characterized germplasm accessions conserved in five international genebanks [20]. Based on 12 published genomic data sets on germplasm collections of size ranging from 661 to 55,879 accessions with up to 2.4 million SNPs, an APD value was generated for each accession in each data set. The effort not only helped to identify many sets of genetically distinct and redundant germplasm for the five genebanks but also revealed that the APD estimation was more sensitive to the number of SNPs, minor allele frequency, and missing data and less so to the sample size. An effective APD estimation required 5000 to 10,000 genome-wide SNPs. These findings are encouraging for the genetic characterization and categorization of plant germplasm for better germplasm management and utilization.

This study was conducted to expand the original APD analysis [12] to samples with SNP genotypes associated with genes for specific traits of breeding interest, as such APD estimates should theoretically carry useful information on genetic backgrounds at the functional regions of a genome conditioning the traits. Specifically, we explored a gene-associated APD (gaAPD) analysis in a genomic characterization of 2643 rice lines based on their published rice FASTQ data [21]. The gaAPD exploration carried three specific objectives: (1) generate gaAPD estimates for the rice samples based on cloned genes conditioning specific characters of rice heat tolerance, cold tolerance, fertility, and seed size; (2) assess gaAPD properties through comparative analyses to genome-based and indel-based APD estimations; and (3) identify a set of the most genetically distinct rice lines for each specific assayed character. We hope that this gaAPD exploration will demonstrate its usefulness as a supplemental tool for identifying genetically distinct germplasm to enhance the search for exotic germplasm for the widening of elite breeding gene pools.

2. Materials and Methods

2.1. Acquisition of Published Rice Genomic Data

We downloaded the published original FASTQ files of trimmed, filtered sequence reads for 2643 rice accessions [21] from European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/view/PRJEB6180; accessed 23 September 2024). These accessions included 1789 and 854 samples representing indica and japonica groups, respectively, and were extracted from the related inventory file of the 3K rice genome project [21]. For ease of computational requirement for this study, only one-run FASTQ files per accession were randomly selected and utilized for genomic and APD analyses. The IRGSP-1.0 genome reference and annotation files were also downloaded from the Rice Annotation Project Database (https://rapdb.dna.affrc.go.jp/download/irgsp1.html; accessed on 23 September 2024). We also acquired contig FASTA files published for cloned genes associated with gene ontologies for rice heat tolerance, cold tolerance, rice fertility, and seed size from the China Rice Data Centre (https://www.ricedata.cn/; accessed 23 September 2024). Specifically, there were 56 contig FASTA files selected for 60 (out of 158) genes for heat tolerance (TO:0000259) (Table S1); 79 contig FASTA files for 83 (out of 184) genes for cold tolerance (TO:0000303) (Table S2); 23 contig FASTA files for 24 (out of 56) genes for rice fertility (TO:0000420) (Table S3); and 44 contig FASTA files for 48 (out of 111) genes for seed size (TO:0000391) (Table S4). These selected contig FASTA files have sequence lengths of 202K base pairs or shorter and were published from many gene cloning efforts, as documented in the China Rice Data Centre. Note that these four characters were arbitrarily selected to represent traits of potential breeding interest, such as abiotic stress, growth, and yield. For a comparative analysis, the 3K RG 2.3mio biallelic indel data set in three PLINK bed, bim and fam formats was downloaded from the Rice SNP-Seek Database (https://snp-seek.irri.org/download.zul, accessed on 26 September 2024).

2.2. Data Processing

Efforts were made to process the acquired genomic data and generate five specific SNP data sets for the whole genome and for genes associated with heat tolerance, cold tolerance, fertility, and seed size for ease of comparative APD analyses. Each SNP data set was generated using several procedures. First, SNP calling from short-read sequence data requires a reference sequence. For the whole genome analysis, the IRGSP-1.0 genome reference was used. For the other four analyses, the reference sequence was generated from assembling selected contig FASTA files and checking ambiguous base calls. For example, the reference sequence for cold tolerance consisted of 79 contig FASTA files (combined without considering the order and length of the selected contigs), followed by base call checks. Second, BAM files were created using BWA (0.7.17-r1188; [22]) and Samtools (v 1.6; [23]), followed by sorting BAM files with Samtools. Third, SNP calling from sorted.bam files was performed using Bcftools (v 1.9; [24]) with the option of excluding indels to generate a SNP genotype VCF file. Fourth, the VCF file was checked using Vcftools (v 0.1.15; [25]) on missing values and minor allele frequency (MAF). For this study, no SNPs with missing values were allowed for contig-based data sets, while up to 20% missing values were permitted for genome-based data sets to ensure compatible numbers of SNPs for comparative APD analyses. Minor allelic frequencies were set to be greater than 0.01 for all data sets and further restricted to be smaller than 0.495 for contig-based SNPs, which carried invariant homozygotes and heterozygotes. The SNPs in the whole genome VCF file were further divided and extracted based on genic and non-genic (or presumably neutral) regions of SNPs. Thus, two extra VCF files were generated for genome genic SNPs and neutral SNPs, respectively. The genic SNPs were obtained with multiple steps: (1) IRGSP-1.0_representative_transcript_exon_2024-07-12.gtf was converted into a bed file using convert2bed (v 2.4.36; [26]); (2) the resulting bed file was used with Bcftools intersect function to generate the overlapped SNPs of the whole genome; and (3) further separation of genic SNPs from the overlapped whole genome SNPs was conducted in Microsoft Excel, along with the extraction of non-genic SNPs. Fifth, as APD estimation was performed using the SNPRelate Bioconductor R package [27], which is capable of handling a large genomic SNP data set, each VCF file was converted into GDS format using SNPRelate functions. Extra effort was also made to convert the downloaded indel data in PLINK formats into GDS format using SNPRelate functions. These efforts generated eight data sets for comparative APD analyses: four gene-associated SNP (gaSNP) data sets (gaSNPs for heat tolerance, gaSNPs for cold tolerance, gaSNPs for fertility, gaSNPs for seed size), three genome-wide SNP data sets (genomeSNPs, neutralSNPs, genicSNPs), and one indel data set (indels). As SNP miscalls can occur from short-read genome sequence data, particularly for gaSNPs, allelic frequencies for each SNP or indel data set were evaluated using a custom R script, and the final SNP or indel data sets were revised accordingly.

2.3. APD Analysis

For each SNP or indel data set, APD and its standard deviation were obtained for each of the 2643 samples using a custom APD.r script published in Fu [20] in an R v. 4.1.2 environment [28]. The R script was specifically written following the method of Fu [12] and incorporating the SNPRelate package to address a large genomic data set. Briefly, the R script considers a typical marker-based characterization of self-fertile plant germplasm with n samples that are assayed at many SNP loci. A given sample can form n-1 pairs with the remaining assayed samples. For each of such pairs, the genotypic similarity (S) can be calculated based on the SNP genotypes following the simple matching coefficient of Sokal and Michener [13], and the pairwise dissimilarity is 1-S. The APD for the given sample can be obtained by averaging all n-1 pairwise dissimilarity values. The higher the APD value obtained for the given sample, the more genetically distinct the sample is among the assayed samples.

The APD.r script was applied to each SNP or indel data set described above on Agriculture and Agri-Food Canada’s Biocluster high-performance computing platform, and the computation lasted up to several hours, particularly with the large indel data set. For each data set, two extra APD analyses were separately made with O. sativa indica and japonica groups of 1789 and 854 samples, respectively, due to their unique genetic features in rice germplasm. This was performed by running the APD.r script modified with the input of a separate sample identification set for each sample group rather than the whole set of 2643 samples. These efforts generated 24 sets of APD estimates: 8 SNP or indel data sets (gaSNPs for heat tolerance, gaSNPs for cold tolerance, gaSNPs for fertility, gaSNPs for seed size, genomeSNPs, neutralSNPs, genicSNPs, and indels) x 3 sample groups (All2643, Indica1789, and Japonica854).

The acquired APD estimates in each data set were further analyzed for their variations with basic statistics and frequency distributions using custom R scripts. An APD correlation analysis was performed using a custom R script among eight data sets in each sample group. To facilitate rice germplasm utilization and management, we identified and presented the most genetically distinct set of 25 rice samples based on the highest APD values for each gaSNP data set in each sample group. The acquired APD estimates for these 24 data sets were compiled and listed in supplemental materials to enhance rice germplasm management and utilization.

3. Results

3.1. Variability in Identified SNPs and Indels

The SNP calling identified variable numbers of SNPs among eight SNP or indel data sets for all 2643 samples (Table 1). The total number of SNPs per data set was dependent on the total length of selected contigs or 12 chromosomes. For example, gaSNPs for fertility with 23 contigs had 1,200,284 SNPs, while gaSNPs for cold tolerance with 79 contigs had 3,685,200 SNPs. Based on the genome reference, there were 74,136,931 SNPs identified for 2643 samples. The published indel data set had 2,354,934 indels across 12 chromosomes. However, there were substantial amounts of missing values for those identified SNPs across 2643 samples, particularly for those from the whole genome (Table 1). Also, up to 700 SNP genotypes were found to be invariant homozygotes and/or heterozygotes across the assayed samples in the four gaSNP data sets and thus were removed by restricting the MAF of 0.495 or smaller from further analyses. To make the analysis comparative, the four gaSNP data sets had no missing SNP values and a range of SNPs from 24,453 to 29,955, while the genome-based SNPs or indels were allowed to have up to 20% of missing values (27,556 and 445,188, respectively), as shown in Table 1 for further data analysis.

Further analyses of the eight final SNP or indel data sets revealed that these selected SNPs were widely distributed across the selected contigs or 12 chromosomes (Table S5). Interestingly, there were 3 out of the 56 contigs without SNPs of non-missing values in gaSNPs for heat tolerance and 5 out of the 79 contigs without SNPs of non-missing values in gaSNPs for cold tolerance. All the selected contigs for the four gaSNP data sets were mainly located in chromosomes 1 to 9 (with one contig AP011111.1 in chromosome 11 associated with seed size), and thus, there were no SNPs identified on chromosomes 10 and 12, both of which did not carry any of the selected gene contigs (Tables S1–S4).

An allelic frequency analysis revealed a similar pattern of MAF distributions present in 2643 samples for the four gaSNP data sets and a similar pattern of MAF distributions for the three genome SNP data sets, while the indel data set displayed an extreme L-shape of MAF distribution, shown in Figure S1A. Such patterns of MAF distribution remain the same for smaller sample groups Indica1789 (Figure S1B) and Japonica854 (Figure S1C). The major differences were observed mainly in minor alleles with frequencies approaching 0.5: more of these were found in the four gaSNP data sets, and fewer in the three genome SNP and indel data sets.

3.2. Variability of APD Estimates for Three Sample Groups

APD estimates of each sample group were generated separately for each SNP or indel data set and all are listed in Tables S6–S8 for three sample groups (All2643, Indica1789, and Japonica854), respectively. Their statistical summaries were given in Table 2, along with the number of SNPs or indels present in each data set. The frequency analysis revealed that the APD estimates were largely following an approximately normal distribution in each SNP data set, while a skewed distribution toward the left was observed in each indel data set (Figure 1). More specifically, the four gaSNP data sets showed similar distribution patterns of APD estimates, while the three genome SNP data sets also displayed similar distribution patterns. Such APD distribution patterns and their variations across eight SNP or indel data sets remain the same for the three sample groups (Figure 1).

The correlation analyses of APD estimates for a given sample group among eight SNP or indel data sets revealed interesting patterns of correlations (Table 3). First, the four gaSNP data sets displayed significantly high correlation coefficients of 0.99 and the three genome SNP data sets showed significantly high correlation coefficients of 0.74 to 0.98 for the three sample groups. Second, the four gaSNP data sets were significantly and negatively correlated with the three genome SNP data sets in All2643 and Indica1789. However, some variations in correlation were also observed in Japonica854, in which the four gaSNP data sets showed positive correlations of APD estimates with those in the genicSNP data set. Third, indel-based APD estimates were significantly and weakly correlated with the other seven SNP data sets in All2643 but non-significantly in Indica1789 and Japonica854. To illustrate these correlation patterns, correlation plots for APD estimates of pairwise SNP or indel data sets were made for All2643 (Figure 2), Indica1789 (Figure S2), and Japonica854 (Figure S3). Clearly, there were three marked patterns of correlations in APD estimates among the eight SNP or indel data sets for the three sample groups.

3.3. Four Sets of Most Genetically Distinct Rice Lines

To facilitate rice germplasm utilization, efforts were made to select a set of the 25 most genetically distinct rice lines for each specific character, which was based on the highest APD estimates across 2643 samples (Table 4). These selected lines had APD estimates larger than two standard deviations and represented both indica and japonica groups, but there were more indica than japonica lines. For example, there were 22 indica and 3 japonica lines for heat tolerance and 21 indica and 4 japonica lines for the other three characters. The selected lines originated from 13 to 14 countries, showing diverse origins. For example, the 25 selected lines for heat tolerance were from 13 countries, while the set for fertility was from 14 countries. Interestingly, most of the selected lines were largely overlapping over the four sets for four characters. For example, the japonica line B166 from North Korea and the indica line IRIS_313-9108 from Bangladesh were present in four sets. In other words, these overlapping lines have genetically distinct backgrounds for all four assayed characters.

A similar effort was made to select a set of the 25 most genetically distinct rice lines for each of the four assayed characters from 1789 indica lines (Table 5) and from 854 japonica lines (Table 6). These selected indica and japonica rice lines had APD estimates larger than two standard deviations. Four indica sets represented lines originating from 11 to 12 countries. A majority of the selected lines overlapped across the four sets. For example, the indica lines IRIS_313-8466 from Thailand and IRIS_313-11968 from China were present in all four sets. Similarly, four distinct japonica sets consisted of 25 rice lines that originated from 12 to 16 countries or regions. Many selected lines were also present across the four japonica distinct sets. For example, the japonica lines B166 from North Korea and IRIS_313-11582 from China were consistently on the top two lines across the four distinct japonica sets.

4. Discussion

Our gene-associated APD exploration not only identified four novel sets of the 25 most genetically distinct rice lines for each of the four specific characters of breeding interest but also revealed several interesting APD properties. First, APD estimates displayed similar frequency distributions and high correlations among the four contig-based SNP data sets and among the three genome-based SNP data sets. Second, APD estimates were negatively correlated between the contig-based and genome-based SNP data sets. Third, indel-based APD estimates were nearly uncorrelated with those in the other seven SNP data sets. These findings are significant for plant germplasm characterization and germplasm utilization in plant breeding.

The results of APD correlations are novel and interesting. For example, the correlation coefficients of 0.99 or higher among the four gaSNP data sets were much higher than the genetic correlations generally expected among these specific characters. With such high APD correlations, one character set of gaAPD estimates for a sample group is sufficient to assess the relative genetic distinctness of the samples with respect to the other characters. However, the observations of negative and/or weak correlations of contig-based APD estimates with those genome-based and indel-based APD estimates are largely unexpected, as these negative correlations implied that gaAPD estimates carried different sets of genetic backgrounds from those present across the genome. More surprising was the finding of negative or weak correlations between gaAPD estimates and those based on genome genic SNPs, as genic SNPs sampled functional regions of the whole genome, and they may overlap with those gaSNPs generated from gene contigs. It is possible that the strong linkage of alleles present in the gaSNPs, compared to the whole genome genic SNPs, may have contributed to the negative correlations. Also, gaAPD estimates were known to be compounded with mis-called SNP genotypes from gene paralogs, as contig-based SNP calls from short sequence reads cannot distinguish adequately between gene orthologs and paralogs (e.g., see [29,30]). This was evident that abundant SNP genotypes were found to be invariant homozygotes and/or heterozygotes across the assayed samples in the four gaSNP data sets. However, we cannot determine the extent of bias in APD estimation from existing paralogs and evaluate the degree of the impacts by biased estimation on the APD-based ranking of samples and APD correlations among SNP or indel data sets. How to improve gaSNP calls from short-read sequence data remains a topic of research interest.

Marker-based average pairwise dissimilarity is a function of marker genotypes of the assayed samples [12]. Thus, even with the same set of SNP genotypes, APD estimates per sample will vary if the APD estimation was made on a subset of the assayed samples versus on all the assayed samples. An APD estimation can also be affected by the type, size, and distribution of genetic markers such as various types of SNPs and indels, as studied here. It was previously found that the APD estimation was more sensitive to the number of SNPs, minor allele frequency, and missing data but less sensitive to the sample size and that 5000 to 10,000 genome-wide SNPs were generally required for an effective APD analysis [20]. As clearly demonstrated in the present APD analysis, APD estimates of the same sample group were different among different SNP or indel data sets (e.g., see Tables S6–S8). Such differences revealed the limitation in the informativeness of APD estimates to rank the assayed germplasm, as APD estimates are strictly informative only to the assayed samples with the given type, size, and distribution of genetic markers. An exception seems to exist for those gaAPD estimates with extremely high correlations among the four gaSNP data sets for the four assayed characters. Despite this exception, it is important to know such a limitation for proper interpretations of APD estimates with respect to germplasm selection and use.

The four sets of rice lines selected from 2643 samples (Table 4) had APD estimates larger than two standard deviations and represented the rice germplasm with the most genetically distinct backgrounds for the specific characters of heat and cold tolerance, fertility, and seed size. Similarly, the sets selected from the indica group of 1789 samples and from the japonica group of 854 samples were the rice germplasm with the most genetically distinct backgrounds for the four assayed characters. As indicated above, however, it is important to know that the selected sets of rice lines from the three sample groups can differ, as their APD estimates were based on different gaSNP data sets. For example, APD estimates for All2643 and Japonica854 for heat tolerance were made for 2643 and 854 samples based on 24,868 and 24,509 SNPs, respectively. The japonica line B166 from North Korea was present in both selected sets for heat tolerance (Table 4 and Table 6), but the second top japonica line IRIS_313-10057 from Japan in the set for All2643 (Table 4) was not present in the corresponding set for Japonica854 (Table 6). Thus, caution should be exercised for using the distinct sets generated from different APD estimations. Selecting the distinct set for a specific character should depend on research or breeding objectives. For example, if research is configured on the seed size of general rice germplasm, the distinct set from All2643 (Table 4) or APD estimates from Table S6 should be considered. Similarly, if a breeding effort is planned on the fertility of japonica rice lines, the distinct set from Japonica854 (Table 6) or APD estimates from Table S8 should be applied. Note that seeds for the distinct sets of rice lines (Table 4, Table 5 and Table 6) should be accessible upon a proper germplasm request to the International Rice Genebank (https://www.irri.org/rice-seeds; accessed on 9 October 2024) at the International Rice Research Institute, Philippines. Each distinct set can also be expanded, if needed, to include more genetically distinct lines by ranking the corresponding APD estimates listed in Tables S6–S8 and selecting the lines with the highest APD estimates.

Our APD analysis here was fully based on SNP or indel data without the need to perform the phenotypic characterization of the assayed rice lines for the associated traits. Thus, we did not have the related phenotypic data of these assayed characters to evaluate the correlations of the resulting APD estimates with their corresponding phenotypic values (for a specific character) of a sample group. However, it is useful to study such correlations by selecting a distinct set and a random set (as control) of rice lines and evaluating them agronomically in diverse environments. Such studies will not only allow for understanding how the rice lines with distinct genetic backgrounds are adapted to diverse environments but also verify the selection of exotic germplasm with the distinct genetic backgrounds and the traits of breeding interest for breeding gene pool addition. Alternatively, when a set of germplasm accessions with acquired traits of interest was identified, presumably from a pre-breeding effort (see [7]) or other germplasm search approaches (as described by Sukumaran et al. [9]), an additional APD analysis can be performed on the selected germplasm set to verify their genetic backgrounds for the re-selection of a few truly elite exotic lines. Either way, an APD analysis can serve as a supplemental genetic tool to search for genetically distinct exotic lines for the widening of elite breeding gene pools.

5. Conclusive Remarks

Exploring the gene-associated APD analysis in the genomic characterization of 2643 rice lines generated four novel sets of the 25 most genetically distinct rice lines, each for a specific character (heat tolerance, cold tolerance, fertility, or seed size). It also revealed several interesting APD properties. Four contig-based SNP data sets for four specific characters displayed similar frequency distributions and high correlations of APD estimates. Contig-based APD estimates were negatively correlated with genome-based APD estimates and nearly uncorrelated with indel-based APD estimates. These findings are significant for plant germplasm characterization and utilization in plant breeding.

Supplementary Materials

The following supporting material can be downloaded at https://doi.org/10.6084/m9.figshare.27214971 (accessed 17 November 2024): Figure S1. Distributions of minor allelic frequencies in eight SNP or indel data sets for three sample groups. Figure S2. Pairwise correlations of APD estimates among eight SNP or indel data sets for the 1789 indica samples. Figure S3. Pairwise correlations of APD estimates among eight SNP or indel data sets for the 854 japonica samples. Table S1. List of 60 genes (on 56 contigs) selected from TO:0000259 with 158 genes associated with heat tolerance and their related information. Table S2. List of 83 genes (on 79 contigs) selected from TO:0000303 with 184 genes associated with cold tolerance and their related information. Table S3. List of 24 genes (on 23 contigs) selected from TO:0000420 with 56 genes associated with fertility and their related information. Table S4. List of 48 genes (on 44 contigs) selected from TO:0000391 with 111 genes associated with seed size and their related information. Table S5. SNP or indel counts per contig or chromosome (chr) in eight SNP or indel data sets for all 2643 rice samples. Table S6. List of APD estimates for eight SNP or indel data sets for all 2643 samples (in excel file). Table S7. List of APD estimates for eight SNP or indel data sets for 1789 Indica samples (in excel file). Table S8. List of APD estimates for eight SNP or indel data sets for 854 Japonica samples (in excel file).

Funding

This study was funded by AAFC research grants J-000066, J-000185 and J-003159 to Yong-Bi Fu.

Data Availability Statement

The meta data sets generated for this paper are included as Supplementary Materials to this paper.

Acknowledgments

The author thanks Jeffrey Ross-Ibarra for his helpful discussion on the inference of allelic frequency distribution; Chenyi Liu for his assistance in gene data processing and helpful reading of the early version of manuscript; Carolee Horbach for her assistance in generating APD distribution plots and editing the early draft of manuscript; and Bill Biligetu for his constructive comments on the early version of manuscript.

Conflicts of Interest

The author declare no conflicts of interest.

References

Bernardo, R.N. Essentials of Plant Breeding; Stemma Press: London, UK, 2014. [Google Scholar]
Allier, A.; Teyssèdre, S.; Lehermeier, C.; Moreau, L.; Charcosset, A. Optimized breeding strategies to harness genetic resources with different performance levels. BMC Genom. 2020, 21, 349. [Google Scholar] [CrossRef]
Allard, R.W. Principles of Plant Breeding, 2nd ed.; John Wiley, Sons, Inc.: New York, NY, USA, 1999. [Google Scholar]
Bohra, A.; Kilian, B.; Sivasankar, S.; Caccamo, M.; Mba, C.; McCouch, S.R.; Varshney, R.K. Reap the crop wild relatives for breeding future crops. Trends Biotechnol. 2022, 22, 624–637. [Google Scholar] [CrossRef] [PubMed]
Prohens, J.; Gramazio, P.; Plazas, M.; Dempewolf, H.; Kilian, B.; Díez, M.J.; Fita, A.; Herraiz, F.J.; Rodríguez-Burruezo, A.; Soler, S.; et al. Introgressiomics: A new approach for using crop wild relatives in breeding for adaptation to climate change. Euphytica 2017, 213, 158. [Google Scholar] [CrossRef]
Wang, C.; Hu, S.; Gardner, C.; Lübberstedt, T. Emerging avenues for utilization of exotic germplasm. Trends Plant Sci. 2017, 22, 624–637. [Google Scholar] [CrossRef] [PubMed]
Schulthess, A.W.; Kale, S.M.; Liu, F.; Zhao, Y.; Philipp, N.; Rembe, M.; Jiang, Y.; Beukert, U.; Serfling, A.; Himmelbach, A.; et al. Genomics-informed prebreeding unlocks the diversity in genebanks for wheat improvement. Nat. Genet. 2022, 54, 1544–1552. [Google Scholar] [CrossRef] [PubMed]
Hernandez, J.; Meints, B.; Hayes, P. Introgression breeding in barley: Perspectives and case studies. Front. Plant Sci. 2020, 11, 761. [Google Scholar] [CrossRef] [PubMed]
Sukumaran, S.; Rebetzke, G.; Mackay, I.; Bentley, A.R.; Reynolds, M.P. Pre-breeding strategies. In Wheat Improvement; Reynolds, M.P., Braun, H.J., Eds.; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Yu, X.; Li, X.; Guo, T.; Zhu, C.; Wu, Y.; Mitchell, S.E.; Roozeboom, K.L.; Wang, D.; Wang, M.L.; Pederson, G.A.; et al. Genomic prediction contributing to a promising global strategy to turbocharge gene banks. Nat. Plants 2016, 2, 16150. [Google Scholar] [CrossRef]
Li, Y.; Shi, F.; Lin, Z.; Robinson, H.; Moody, D.; Rattey, A.; Godoy, J.; Mullan, D.; Keeble-Gagnere, G.; Hayden, M.J.; et al. Benefit of introgression depends on level of genetic trait variation in cereal breeding programmes. Front. Plant Sci. 2022, 13, 786452. [Google Scholar] [CrossRef] [PubMed]
Fu, Y.B. Redundancy and distinctness in flax germplasm as revealed by RAPD dissimilarity. Plant Genet. Resour. 2006, 4, 117–124. [Google Scholar] [CrossRef]
Sokal, R.R.; Michener, C.D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958, 38, 1409–1438. [Google Scholar]
Yang, M.H.; Fu, Y.B. AveDissR: An R function for assessing genetic distinctness and genetic redundancy. Appl. Plant Sci. 2017, 5, 1700018. [Google Scholar] [CrossRef] [PubMed]
Peterson, G.W.; Dong, Y.; Horbach, C.; Fu, Y.B. Genotyping-by-sequencing for plant genetic diversity analysis: A lab guide for SNP genotyping. Diversity 2014, 6, 665–680. [Google Scholar] [CrossRef]
Milner, S.G.; Jost, M.; Taketa, S.; Mazón, E.R.; Himmelbach, A.; Oppermann, M.; Weise, S.; Knüpffer, H.; Basterrechea, M.; König, P.; et al. Genebank genomics highlights the diversity of a global barley collection. Nat. Genet. 2019, 51, 319–326. [Google Scholar] [CrossRef] [PubMed]
Sansaloni, C.; Franco, J.; Santos, B.; Percival-Alwyn, L.; Singh, S.; Petroli, C.; Campos, J.; Dreher, K.; Payne, T.; Marshall, D.; et al. Diversity analysis of 80,000 wheat accessions reveals consequences and opportunities of selection footprints. Nat. Commun. 2020, 11, 4572. [Google Scholar] [CrossRef] [PubMed]
Varshney, R.K.; Roorkiwal, M.; Sun, S.; Bajaj, P.; Chitikineni, A.; Thudi, M.; Singh, N.P.; Du, X.; Upadhyaya, H.D.; Khan, A.W.; et al. A chickpea genetic variation map based on the sequencing of 3366 genomes. Nature 2021, 599, 622–627. [Google Scholar] [CrossRef] [PubMed]
Song, Q.; Hyten, D.L.; Jia, G.; Quigley, C.V.; Fickus, E.W.; Nelson, R.L.; Cregan, P.B. Fingerprinting soybean germplasm and its utility in genomic research. G3 Genes Genomes Genet. 2015, 5, 1999–2006. [Google Scholar] [CrossRef] [PubMed]
Fu, Y.B. Assessing genetic distinctness and redundancy of plant germplasm conserved ex situ based on published genomic SNP data. Plants 2023, 12, 1476. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Mauleon, R.; Hu, Z.; Chebortarov, D.; Tai, S.; Wu, Z.; Li, M.; Zheng, T.; Fuentes, R.R.; Zhang, F.; et al. Genomic variation in 3010 diverse accessions of Asian cultivated rice. Nature 2018, 557, 43–49. [Google Scholar] [CrossRef]
Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed]
Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef] [PubMed]
Danecek, P.; McCarthy, S.A.; HipSci Consortium; Durbin, R. A method for checking genomic integrity in cultured cell lines from SNP genotyping data. PLoS ONE 2016, 11, e0155014. [Google Scholar] [CrossRef]
Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T.; et al. The variant call format and VCFtools. Bioinformatics 2011, 27, 2156–2158. [Google Scholar] [CrossRef]
Neph, S.; Kuehn, M.S.; Reynolds, A.P.; Haugen, E.; Thurman, R.E.; Johnson, A.K.; Rynes, E.; Maurano, M.T.; Vierstra, J.; Thomas, S.; et al. BEDOPS: High-performance genomic feature operations. Bioinformatics 2012, 28, 1919–1920. [Google Scholar] [CrossRef]
Zheng, X.; Levine, D.; Shen, J.; Gogarten, S.M.; Laurie, C.; Weir, B.S. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 2012, 28, 3326–3328. [Google Scholar] [CrossRef] [PubMed]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 22 September 2024).
Field, M.A.; Burgio, G.; Chuah, A.; Shekaili, J.A.; Hassan, B.; Sukaiti, N.A.; Foote, S.J.; Cook, M.C.; Andrews, T.D. Recurrent miscalling of missense variation from short-read genome sequence data. BMC Genom. 2019, 20, 546. [Google Scholar] [CrossRef] [PubMed]
Steyaert, W.; Haer-Wigman, L.; Pfundt, R.; Hellebrekers, D.; Steehouwer, M.; Hampstead, J.; de Boer, E.; Stegmann, A.; Yntema, H.; Kamsteeg, E.-J.; et al. Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation. Nat. Commun. 2023, 14, 6845. [Google Scholar] [CrossRef]

Figure 1. Frequency distribution of APD estimates in eight SNP or indel data sets for three sample groups (All2643, Indica1789, and Japonica854). M = mean and R = range.

Figure 2. Pairwise correlations of APD estimates among eight SNP or indel data sets for all 2643 samples.

Table 1. Summary of SNP or indel identifications in eight SNP or indel data sets for all 2643 samples. The bold numbers of SNPs or indels indicate the final SNP or indel data sets used for APD analyses. Three levels of missing values for SNPs or indels were up to 20%, 10% and no missing values.

	Chromosome	Sequence Length	SNP or Indel		SNP Count
Data Set	Or Contig Count	In Base Pair	Count	20% Missing	10% Missing	No Missing
gaSNPs for heat tolerance	56	8,429,490	2,672,786	183,883	183,883	24,868
gaSNPs for cold tolerance	79	11,855,409	3,685,200	283,772	218,133	26,435
gaSNPs for fertility	23	3,542,347	1,200,284	183,883	111,422	24,453
gaSNPs for seed size	44	6,528,374	2,107,785	221,124	179,903	29,955
genomeSNPs	12	217,331,824	74,136,931	27,556	6630	36
neutralSNPs	12			17,873
genicSNPs	12			9683
indels	12		2,354,934	445,188	438,466	386,361

Table 2. Statistical summary of APD estimates in eight SNP or indel data sets for three sample groups (All2643, Indica1789, and Japonica854), along with SNP counts.

	SNP	APD Estimates
	Count	Mean	Standard Deviation	Minimum	Maximum
All2643 samples
gaSNPs for heat tolerance	24,868	0.2146	0.0156	0.1744	0.2689
gaSNPs for cold tolerance	26,435	0.2139	0.0142	0.1745	0.2598
gaSNPs for fertility	24,453	0.2154	0.0168	0.1804	0.2803
gaSNPs for seed size	29,955	0.2171	0.0152	0.1759	0.2687
genomeSNPs	27,556	0.2790	0.0103	0.2223	0.3063
neutralSNPs	17,873	0.2716	0.0095	0.2178	0.2997
genicSNPs	9683	0.2878	0.0105	0.2072	0.3143
indels	445,188	0.3524	0.0260	0.3116	0.5233
Indica1789 samples
gaSNPs for heat tolerance	24,556	0.2080	0.0168	0.1713	0.2635
gaSNPs for cold tolerance	26,168	0.2083	0.0153	0.1711	0.2586
gaSNPs for fertility	24,107	0.2083	0.0182	0.1762	0.2788
gaSNPs for seed size	29,578	0.2111	0.0165	0.1731	0.2675
genomeSNPs	25,286	0.2971	0.0054	0.2575	0.3160
neutralSNPs	16,231	0.2911	0.0056	0.2485	0.3116
genicSNPs	8967	0.3035	0.0059	0.2381	0.3281
indels	445,176	0.3299	0.0175	0.3087	0.5315
Japonica854 samples
gaSNPs for heat tolerance	24,509	0.1980	0.0145	0.1700	0.2503
gaSNPs for cold tolerance	26,095	0.1953	0.0135	0.1669	0.2404
gaSNPs for fertility	23,939	0.1939	0.0156	0.1655	0.2558
gaSNPs for seed size	29,453	0.1961	0.0144	0.1678	0.2461
genomeSNPs	17,444	0.2949	0.0162	0.2584	0.3919
neutralSNPs	11,103	0.2974	0.0151	0.2627	0.3879
genicSNPs	6025	0.3079	0.0164	0.2675	0.4151
indels	193,700	0.1414	0.0217	0.1095	0.3152

Table 3. Pairwise correlations of APD estimates in eight SNP or indel data sets for three sample groups (All2643, Indica1789, and Japonica854). The lower diagonal shows the estimates of the Pearson correlation coefficients, and the upper diagonal shows the levels of significant tests (0.0001 means that p is much smaller than 0.0001).

	Heat	Cold	Fertility	Seed Size	Genome	Neutral	Genic	Indel
All2643 samples
gaSNPs for heat tolerance		0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001
gaSNPs for cold tolerance	0.991		0.0001	0.0001	0.0001	0.0001	0.0001	0.0001
gaSNPs for fertility	0.991	0.990		0.0001	0.0001	0.0001	0.0001	0.0001
gaSNPs for seed size	0.992	0.993	0.991		0.0001	0.0001	0.0001	0.0001
genomeSNPs	−0.718	−0.687	−0.705	−0.689		0.0001	0.0001	0.0147
neutralSNPs	−0.737	−0.707	−0.727	−0.710	0.982		0.0001	0.0120
genicSNPs	−0.583	−0.545	−0.573	−0.550	0.934	0.914		0.0525
indels	−0.078	−0.075	−0.082	−0.078	0.047	0.049	0.038
Indica1789 samples
gaSNPs for heat tolerance		0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.5087
gaSNPs for cold tolerance	0.993		0.0001	0.0001	0.0001	0.0001	0.0001	0.4754
gaSNPs for fertility	0.993	0.991		0.0001	0.0001	0.0001	0.0001	0.5437
gaSNPs for seed size	0.995	0.994	0.992		0.0001	0.0001	0.0001	0.4414
genomeSNPs	−0.428	−0.423	−0.432	−0.430		0.0001	0.0001	0.1904
neutralSNPs	−0.441	−0.436	−0.448	−0.441	0.947		0.0001	0.2675
genicSNPs	−0.148	−0.139	−0.158	−0.147	0.776	0.736		0.6156
indels	0.016	0.017	0.014	0.018	0.031	0.029	0.014
Japonica854 samples
gaSNPs for heat tolerance		0.0001	0.0001	0.0001	0.1860	0.0584	0.0101	0.1220
gaSNPs for cold tolerance	0.992		0.0001	0.0001	0.4732	0.1989	0.0015	0.0898
gaSNPs for fertility	0.991	0.989		0.0001	0.0781	0.0171	0.0293	0.0722
gaSNPs for seed size	0.994	0.993	0.991		0.3319	0.1247	0.0044	0.1021
genomeSNPs	−0.045	−0.025	−0.060	−0.033		0.0001	0.0001	0.4486
neutralSNPs	−0.065	−0.044	−0.082	−0.053	0.984		0.0001	0.4246
genicSNPs	0.088	0.108	0.075	0.097	0.921	0.905		0.4255
indels	−0.053	−0.058	−0.062	−0.056	−0.026	−0.027	−0.027

Table 4. A list of the 25 most genetically distinct rice lines identified for each of the four specific characters based on the gene-associated APD estimates among all 2643 rice lines. Origin = the country or region of sample origin and SD = standard deviation.

Sample	Group	Origin	APD	SD	Sample	Group	Origin	APD	SD
Character: Heat tolerance					Character: Cold tolerance
B166	japonica	North Korea	0.2689	0.0179	B166	japonica	North Korea	0.2598	0.0178
IRIS_313-9108	indica	Bangladesh	0.2648	0.0118	B203	indica	China	0.2594	0.0107
IRIS_313-10375	indica	Philippines	0.2645	0.0102	IRIS_313-9108	indica	Bangladesh	0.2593	0.0105
B203	indica	China	0.2644	0.0125	IRIS_313-10002	indica	Sri Lanka	0.2591	0.0123
IRIS_313-10002	indica	Sri Lanka	0.2635	0.0127	B181	indica	Australia	0.2587	0.0120
IRIS_313-9575	indica	Thailand	0.2635	0.0131	IRIS_313-8859	indica	China	0.2577	0.0120
B146	indica	China	0.2621	0.0138	IRIS_313-8383	indica	Philippines	0.2575	0.0124
IRIS_313-8859	indica	China	0.2619	0.0132	IRIS_313-10057	japonica	Japan	0.2559	0.0199
B202	indica	China	0.2607	0.0124	CX51	indica	China	0.2553	0.0118
B185	indica	Lao	0.2602	0.0131	CX9	indica	China	0.2552	0.0118
B181	indica	Australia	0.2601	0.0133	IRIS_313-10375	indica	Philippines	0.2550	0.0092
B030	indica	India	0.2598	0.0140	IRIS_313-9814	japonica	Hungary	0.2550	0.0190
IRIS_313-11733	indica	China	0.2594	0.0141	B087	indica	China	0.2545	0.0120
CX50	indica	China	0.2593	0.0147	IRIS_313-8401	indica	India	0.2545	0.0121
IRIS_313-8401	indica	India	0.2586	0.0137	IRIS_313-9575	indica	Thailand	0.2544	0.0123
CX9	indica	China	0.2585	0.0133	B030	indica	India	0.2539	0.0126
B087	indica	China	0.2581	0.0130	B146	indica	China	0.2533	0.0128
IRIS_313-10057	japonica	Japan	0.2580	0.0200	B202	indica	China	0.2533	0.0118
CX84	indica	Vietnam	0.2578	0.0121	IRIS_313-8474	indica	Thailand	0.2529	0.0127
IRIS_313-10341	indica	Bangladesh	0.2572	0.0139	CX86	indica	Vietnam	0.2528	0.0133
IRIS_313-11144	indica	Myanmar	0.2568	0.0151	CX50	indica	China	0.2525	0.0132
IRIS_313-8383	indica	Philippines	0.2566	0.0135	IRIS_313-9346	japonica	Taiwan	0.2524	0.0196
CX548	indica	China	0.2565	0.0140	CX548	indica	China	0.2523	0.0125
IRIS_313-8314	japonica	Indonesia	0.2563	0.0148	B185	indica	Lao	0.2522	0.0122
CX86	indica	Vietnam	0.2560	0.0136	IRIS_313-11733	indica	China	0.2519	0.0136
Character: Fertility					Character: Seed size
IRIS_313-10375	indica	Philippines	0.2803	0.0111	B203	indica	China	0.2687	0.0111
B166	japonica	North Korea	0.2768	0.0207	B166	japonica	North Korea	0.2685	0.0205
B203	indica	China	0.2745	0.0134	IRIS_313-9108	indica	Bangladesh	0.2668	0.0114
CX84	indica	Vietnam	0.2718	0.0136	B185	indica	Lao	0.2654	0.0133
IRIS_313-9108	indica	Bangladesh	0.2707	0.0137	IRIS_313-10002	indica	Sri Lanka	0.2644	0.0131
B087	indica	China	0.2699	0.0143	B181	indica	Australia	0.2638	0.0130
B181	indica	Australia	0.2696	0.0149	IRIS_313-8383	indica	Philippines	0.2637	0.0138
IRIS_313-9575	indica	Thailand	0.2694	0.0149	IRIS_313-8401	indica	India	0.2637	0.0137
IRIS_313-8383	indica	Philippines	0.2686	0.0150	B087	indica	China	0.2632	0.0127
CX9	indica	China	0.2683	0.0145	IRIS_313-10057	japonica	Japan	0.2630	0.0211
IRIS_313-8401	indica	India	0.2682	0.0151	B030	indica	India	0.2625	0.0137
IRIS_313-10002	indica	Sri Lanka	0.2674	0.0153	IRIS_313-10375	indica	Philippines	0.2623	0.0101
B202	indica	China	0.2672	0.0135	CX51	indica	China	0.2619	0.0126
B185	indica	Lao	0.2671	0.0150	IRIS_313-8859	indica	China	0.2616	0.0129
B146	indica	China	0.2668	0.0152	CX9	indica	China	0.2611	0.0133
IRIS_313-8859	indica	China	0.2662	0.0145	CX50	indica	China	0.2608	0.0140
B030	indica	India	0.2651	0.0154	IRIS_313-9575	indica	Thailand	0.2604	0.0131
IRIS_313-11144	indica	Myanmar	0.2641	0.0161	IRIS_313-9814	japonica	Hungary	0.2604	0.0212
IRIS_313-9814	japonica	Hungary	0.2635	0.0232	B146	indica	China	0.2601	0.0144
IRIS_313-10341	indica	Bangladesh	0.2625	0.0157	IRIS_313-10341	indica	Bangladesh	0.2593	0.0145
IRIS_313-10057	japonica	Japan	0.2619	0.0234	B202	indica	China	0.2590	0.0126
IRIS_313-8314	japonica	Indonesia	0.2617	0.0171	CX548	indica	China	0.2590	0.0138
CX50	indica	China	0.2615	0.0160	IRIS_313-9346	japonica	Taiwan	0.2587	0.0227
IRIS_313-11139	indica	Myanmar	0.2610	0.0158	B244	indica	China	0.2585	0.0136
CX51	indica	China	0.2609	0.0143	IRIS_313-11144	indica	Myanmar	0.2583	0.0145

Table 5. A list of the 25 most genetically distinct rice lines identified for each of the four specific characters based on the gene-associated APD estimates among 1789 indica rice lines. Origin = the country or region of sample origin and SD = standard deviation.

Sample	Group	Origin	APD	SD	Sample	Group	Origin	APD	SD
Character: Heat tolerance					Character: Cold tolerance
IRIS_313-8466	indica	Thailand	0.2635	0.0107	IRIS_313-11968	indica	China	0.2586	0.0099
IRIS_313-11968	indica	China	0.2632	0.0106	B203	indica	China	0.2581	0.0098
B203	indica	China	0.2618	0.0106	IRIS_313-7636	indica	Mali	0.2563	0.0102
IRIS_313-7636	indica	Mali	0.2611	0.0107	B181	indica	Australia	0.2560	0.0100
IRIS_313-12190	indica	Lao	0.2609	0.0111	IRIS_313-8466	indica	Thailand	0.2554	0.0096
IRIS_313-11894	indica	Vietnam	0.2596	0.0113	IRIS_313-11894	indica	Vietnam	0.2553	0.0102
B146	indica	China	0.2591	0.0111	IRIS_313-11763	indica	Cameroon	0.2549	0.0105
B202	indica	China	0.2590	0.0110	CX561	indica	China	0.2532	0.0103
B185	indica	Lao	0.2576	0.0111	CX370	indica	China	0.2531	0.0105
B181	indica	Australia	0.2575	0.0111	IRIS_313-11779	indica	Tanzania	0.2523	0.0107
B030	indica	India	0.2565	0.0113	IRIS_313-12190	indica	Lao	0.2517	0.0107
IRIS_313-11084	indica	Cambodia	0.2564	0.0116	B087	indica	China	0.2516	0.0105
CX413	indica	Philippines	0.2563	0.0129	B202	indica	China	0.2514	0.0106
B087	indica	China	0.2560	0.0114	B030	indica	India	0.2510	0.0106
CX561	indica	China	0.2559	0.0115	B146	indica	China	0.2504	0.0108
IRIS_313-11779	indica	Tanzania	0.2559	0.0115	IRIS_313-11799	indica	China	0.2499	0.0108
CX369	indica	Philippines	0.2555	0.0116	B185	indica	Lao	0.2498	0.0106
IRIS_313-8405	indica	China	0.2545	0.0117	CX378	indica	China	0.2498	0.0109
IRIS_313-11763	indica	Cameroon	0.2542	0.0115	CX416	indica	Philippines	0.2496	0.0110
CX378	indica	China	0.2535	0.0116	CX369	indica	Philippines	0.2493	0.0108
CX416	indica	Philippines	0.2534	0.0115	B244	indica	China	0.2485	0.0112
IRIS_313-8265	indica	India	0.2529	0.0119	IRIS_313-11896	indica	Vietnam	0.2485	0.0113
IRIS_313-10054	indica	Panama	0.2529	0.0120	IRIS_313-11084	indica	Cambodia	0.2483	0.0112
B207	indica	China	0.2522	0.0118	IRIS_313-8265	indica	India	0.2483	0.0110
CX370	indica	China	0.2522	0.0118	IRIS_313-10045	indica	Gambia	0.2483	0.0112
Character: Fertility					Character: Seed size
IRIS_313-8466	indica	Thailand	0.2788	0.0113	B203	indica	China	0.2675	0.0097
B203	indica	China	0.2716	0.0110	IRIS_313-11968	indica	China	0.2656	0.0101
CX413	indica	Philippines	0.2691	0.0145	IRIS_313-8466	indica	Thailand	0.2626	0.0106
IRIS_313-11968	indica	China	0.2684	0.0115	B185	indica	Lao	0.2626	0.0106
B087	indica	China	0.2667	0.0116	IRIS_313-7636	indica	Mali	0.2618	0.0108
B181	indica	Australia	0.2660	0.0113	B087	indica	China	0.2607	0.0103
IRIS_313-12190	indica	Lao	0.2659	0.0118	B181	indica	Australia	0.2607	0.0103
CX561	indica	China	0.2652	0.0118	IRIS_313-11763	indica	Cameroon	0.2607	0.0110
IRIS_313-11763	indica	Cameroon	0.2650	0.0116	IRIS_313-11779	indica	Tanzania	0.2606	0.0110
B202	indica	China	0.2648	0.0115	CX370	indica	China	0.2602	0.0108
IRIS_313-11779	indica	Tanzania	0.2646	0.0115	B030	indica	India	0.2596	0.0109
IRIS_313-7636	indica	Mali	0.2632	0.0118	IRIS_313-11894	indica	Vietnam	0.2588	0.0106
B185	indica	Lao	0.2632	0.0117	CX561	indica	China	0.2583	0.0108
IRIS_313-11894	indica	Vietnam	0.2629	0.0117	IRIS_313-12190	indica	Lao	0.2582	0.0112
B146	indica	China	0.2628	0.0116	CX369	indica	Philippines	0.2576	0.0108
B030	indica	India	0.2615	0.0120	B202	indica	China	0.2573	0.0109
IRIS_313-10054	indica	Panama	0.2595	0.0119	IRIS_313-11896	indica	Vietnam	0.2565	0.0114
IRIS_313-11896	indica	Vietnam	0.2593	0.0126	B146	indica	China	0.2563	0.0112
IRIS_313-8405	indica	China	0.2587	0.0125	CX413	indica	Philippines	0.2562	0.0121
CX370	indica	China	0.2580	0.0119	CX378	indica	China	0.2558	0.0110
CX369	indica	Philippines	0.2573	0.0123	IRIS_313-8405	indica	China	0.2557	0.0111
B244	indica	China	0.2572	0.0124	B244	indica	China	0.2557	0.0112
IRIS_313-10045	indica	Gambia	0.2570	0.0123	CX416	indica	Philippines	0.2551	0.0112
IRIS_313-8265	indica	India	0.2568	0.0121	IRIS_313-8265	indica	India	0.2550	0.0113
B207	indica	China	0.2568	0.0124	IRIS_313-11084	indica	Cambodia	0.2550	0.0113

Table 6. A list of the 25 most genetically distinct rice lines identified for each of the four specific characters based on the gene-associated APD estimates among 854 japonica rice lines. Origin = the country or region of sample origin and SD = standard deviation.

Sample	Group	Origin	APD	SD	Sample	Group	Origin	APD	SD
Character: Heat tolerance					Character: Cold tolerance
B166	japonica	North Korea	0.2503	0.0099	B166	japonica	North Korea	0.2404	0.0098
IRIS_313-11582	japonica	China	0.2452	0.0105	IRIS_313-11582	japonica	China	0.2377	0.0103
CX389	japonica	China	0.2425	0.0103	CX389	japonica	China	0.2367	0.0101
B144	japonica	China	0.2398	0.0107	IRIS_313-12330	japonica	Lao	0.2350	0.0106
IRIS_313-12226	japonica	Lao	0.2395	0.0112	IRIS_313-7863	japonica	Brazil	0.2346	0.0112
IRIS_313-7863	japonica	Brazil	0.2380	0.0115	IRIS_313-11540	japonica	Guinea	0.2343	0.0113
IRIS_313-7856	japonica	Thailand	0.2376	0.0112	IRIS_313-8046	japonica	Italy	0.2327	0.0113
IRIS_313-12006	japonica	Malaysia	0.2360	0.0110	IRIS_313-12063	japonica	Lao	0.2320	0.0109
IRIS_313-8046	japonica	Italy	0.2357	0.0122	IRIS_313-12226	japonica	Lao	0.2310	0.0109
IRIS_313-11540	japonica	Guinea	0.2356	0.0117	B199	japonica	China	0.2310	0.0109
IRIS_313-12330	japonica	Lao	0.2348	0.0113	B144	japonica	China	0.2305	0.0107
IRIS_313-11923	japonica	Thailand	0.2342	0.0119	IRIS_313-7856	japonica	Thailand	0.2293	0.0109
IRIS_313-9366	japonica	United States of America	0.2339	0.0116	CX352	japonica	China	0.2290	0.0109
IRIS_313-12063	japonica	Lao	0.2336	0.0119	IRIS_313-12006	japonica	Malaysia	0.2282	0.0108
B025	japonica	Indonesia	0.2334	0.0112	IRIS_313-9366	japonica	United States of America	0.2282	0.0112
B169	japonica	Japan	0.2331	0.0117	IRIS_313-11652	japonica	China	0.2281	0.0115
CX353	japonica	Vietnam	0.2331	0.0116	IRIS_313-11923	japonica	Thailand	0.2280	0.0113
IRIS_313-7850	japonica	Madagascar	0.2331	0.0118	IRIS_313-11890	japonica	Taiwan	0.2278	0.0113
B117	japonica	China	0.2330	0.0118	B037	japonica	Argentina	0.2273	0.0110
B199	japonica	China	0.2329	0.0116	IRIS_313-7850	japonica	Madagascar	0.2272	0.0117
IRIS_313-12266	japonica	Myanmar	0.2327	0.0120	B025	japonica	Indonesia	0.2268	0.0109
IRIS_313-11755	japonica	Liberia	0.2326	0.0116	IRIS_313-11755	japonica	Liberia	0.2268	0.0115
IRIS_313-11890	japonica	Taiwan	0.2323	0.0119	IRIS_313-11928	japonica	Philippines	0.2268	0.0113
IRIS_313-11652	japonica	China	0.2318	0.0118	IRIS_313-12348	japonica	Lao	0.2266	0.0115
CX352	japonica	China	0.2310	0.0113	CX307	japonica	China	0.2260	0.0112
Character: Fertility					Character: Seed size
B166	japonica	North Korea	0.2558	0.0102	B166	japonica	North Korea	0.2461	0.0104
IRIS_313-11582	japonica	China	0.2482	0.0108	IRIS_313-11582	japonica	China	0.2434	0.0101
CX389	japonica	China	0.2432	0.0103	IRIS_313-7863	japonica	Brazil	0.2407	0.0109
IRIS_313-7856	japonica	Thailand	0.2417	0.0114	CX389	japonica	China	0.2406	0.0099
IRIS_313-12006	japonica	Malaysia	0.2393	0.0110	IRIS_313-12330	japonica	Lao	0.2382	0.0108
IRIS_313-12330	japonica	Lao	0.2392	0.0112	IRIS_313-11540	japonica	Guinea	0.2368	0.0116
IRIS_313-7863	japonica	Brazil	0.2381	0.0116	B144	japonica	China	0.2353	0.0104
IRIS_313-12226	japonica	Lao	0.2375	0.0115	IRIS_313-12063	japonica	Lao	0.2347	0.0117
IRIS_313-11540	japonica	Guinea	0.2374	0.0121	IRIS_313-12006	japonica	Malaysia	0.2344	0.0105
IRIS_313-12266	japonica	Myanmar	0.2366	0.0120	IRIS_313-7856	japonica	Thailand	0.2344	0.0110
B101	japonica	China	0.2359	0.0113	IRIS_313-12266	japonica	Myanmar	0.2332	0.0121
B144	japonica	China	0.2342	0.0107	IRIS_313-12226	japonica	Lao	0.2326	0.0111
IRIS_313-11652	japonica	China	0.2338	0.0122	IRIS_313-9366	japonica	United States of America	0.2326	0.0113
CX307	japonica	China	0.2338	0.0115	IRIS_313-8046	japonica	Italy	0.2324	0.0118
IRIS_313-9366	japonica	United States of America	0.2336	0.0117	IRIS_313-11652	japonica	China	0.2324	0.0117
IRIS_313-8046	japonica	Italy	0.2335	0.0127	IRIS_313-7850	japonica	Madagascar	0.2322	0.0117
IRIS_313-12063	japonica	Lao	0.2335	0.0120	B199	japonica	China	0.2322	0.0111
B199	japonica	China	0.2331	0.0119	IRIS_313-11923	japonica	Thailand	0.2320	0.0121
B117	japonica	China	0.2328	0.0121	CX353	japonica	Vietnam	0.2314	0.0110
IRIS_313-11923	japonica	Thailand	0.2324	0.0121	B025	japonica	Indonesia	0.2312	0.0108
B025	japonica	Indonesia	0.2317	0.0115	CX352	japonica	China	0.2310	0.0111
IRIS_313-11571	japonica	China	0.2312	0.0119	B117	japonica	China	0.2302	0.0117
CX352	japonica	China	0.2311	0.0120	IRIS_313-11755	japonica	Liberia	0.2300	0.0117
IRIS_313-7850	japonica	Madagascar	0.2310	0.0123	B101	japonica	China	0.2298	0.0110
IRIS_313-11908	japonica	China	0.2310	0.0129	IRIS_313-12348	japonica	Lao	0.2295	0.0117

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Crown Copyright: @ His Majesty the King in Right of Canada, 2024. Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Y.-B. Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity. Crops 2024, 4, 636-650. https://doi.org/10.3390/crops4040044

AMA Style

Fu Y-B. Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity. Crops. 2024; 4(4):636-650. https://doi.org/10.3390/crops4040044

Chicago/Turabian Style

Fu, Yong-Bi. 2024. "Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity" Crops 4, no. 4: 636-650. https://doi.org/10.3390/crops4040044

APA Style

Fu, Y.-B. (2024). Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity. Crops, 4(4), 636-650. https://doi.org/10.3390/crops4040044

Article Menu

Genetically Distinct Rice Lines for Specific Characters as Revealed by Gene-Associated Average Pairwise Dissimilarity

Abstract

1. Introduction

2. Materials and Methods

2.1. Acquisition of Published Rice Genomic Data

2.2. Data Processing

2.3. APD Analysis

3. Results

3.1. Variability in Identified SNPs and Indels

3.2. Variability of APD Estimates for Three Sample Groups

3.3. Four Sets of Most Genetically Distinct Rice Lines

4. Discussion

5. Conclusive Remarks

Supplementary Materials

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI