Next Article in Journal
LA-YOLO: Robust Tea-Shoot Detection Under Dynamic Illumination via Input Illumination Stabilization and Discriminative Feature Learning
Previous Article in Journal
Optimized Decision Model for Soil-Moisture Control Lower Limits and Evapotranspiration-Based Irrigation Replenishment Ratios Based on AquaCrop-OSPy, PyFAO56, and NSGA-II and Its Application
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Rice Quality in Indica Rice Using Multidimensional Data and Machine Learning Strategies

1
College of Agronomy, Hunan Agricultural University, Changsha 420128, China
2
State Key Laboratory of Rice Biology and Breeding, China National Rice Research Institute, Hangzhou 310006, China
3
National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya 572025, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Agriculture 2026, 16(7), 807; https://doi.org/10.3390/agriculture16070807
Submission received: 1 February 2026 / Revised: 25 March 2026 / Accepted: 1 April 2026 / Published: 4 April 2026
(This article belongs to the Section Agricultural Product Quality and Safety)

Abstract

Integrating agricultural remote sensing and phenomics for full-growth-period rice quality prediction is vital for early non-destructive screening and breeding; however, studies integrating genomic and multi-source phenotypic data across multiple environments remain limited. This study addressed this gap by integrating genomic SNP data, UAV-based spectral data, and individual multidimensional phenotypic data of 61 indica rice varieties (field and greenhouse environments). As a proof-of-concept study, feature selection methods (LASSO, MI, RFE, SPA) were used to mitigate overfitting and the “p >> n” problem, with further validation needed in larger populations. The results showed that amylose content is genetically dominated, protein content is genetically determined and influenced by gene-environment interactions, and chalkiness traits are determined by three combined factors. For amylose content, SNP data under the Random Forest model at the population level (phenomics data from field UAV remote sensing of variety populations) achieved optimal performance (R2 = 0.92; MAE = 1.1; RMSE = 1.5), while the Stacking Ensemble method enhanced accuracy at the individual level (phenomics data from greenhouse single-plant phenotyping per variety). Chalky grain rate and chalkiness degree showed SNP-comparable prediction accuracy, with Stacking significantly improving performance at the population level (R2 = 0.89 and 0.85, respectively). Protein content prediction remained relatively low (optimal R2 = 0.56) due to strong environmental sensitivity and complex interactions. This framework extends traditional single-environment/single-data-source approaches, providing an effective strategy for early, high-throughput, non-destructive rice quality screening. Further validation with larger datasets, more growing seasons, or independent populations is required for reliable application in breeding-related practices.

1. Introduction

Rice quality traits directly affect people’s quality of life [1]. Currently, acquiring quality trait data relies on destructive testing of harvested seeds, which limits real-time monitoring during the rice growth period. Timely and accurate prediction of rice quality traits is essential not only for guiding cultivation management but also for pre-harvest market pricing [2]. Researchers have developed prediction models for rice quality indicators such as protein content (PC) [3,4,5], amylose content (AC) [6], gel consistency (GC) [7], and pasting properties (RVA) [8] by combining hyperspectral data with partial least squares regression (PLSR). However, these methods are currently limited to in vitro scanning of mature seeds. Thus, it is crucial to develop efficient and non-destructive prediction models that enable dynamic assessment of living plants during the growing season, thereby accelerating the integration and application of agricultural remote sensing and phenomics [9].
Research on early prediction of crop quality traits using spectral characteristics is advancing [10,11,12]. In wheat, studies by Tan et al. and Pancorbo et al. show that vegetation indices like Normalized Difference Vegetation Index (NDVI) and Structure-Insensitive Pigment Index (SIPI), combined with hyperspectral and thermal imaging, can effectively estimate protein, chlorophyll, and nitrogen content, with Shortwave Infrared (SWIR) and Red Edge (RE) bands being particularly important for accuracy [13,14,15]. Furthermore, Fu et al. demonstrated that combining UAV multispectral imagery with texture features and ecological factors using an Artificial Neural Network (ANN) significantly improves the estimation accuracy of grain protein content in wheat [16]. However, these studies tended to focus on spectral and physiological features with limited inclusion of genetic variation or genomic information. Yan et al. proposed a Short-Wave Infrared Edge Position (SWEP) method based on Continuous Wavelet Transform (CWT), which significantly improved the estimation accuracy of wheat protein content (PC) by extracting absorption features associated with protein content [17]. While effective, this approach was mostly limited to a single environment and had not been fully validated across diverse germplasm. For rice, Xie et al. demonstrated strong correlations between canopy spectral data and protein/amylose content under temperature stress [18]. Yet the study was conducted under controlled stress conditions, with little consideration of genotype-by-environment interactions. Complementing these efforts, Wang et al. explored the integration of canopy spectral reflectance and RGB images to provide reliable estimations of leaf chlorophyll content and grain yield, offering a multi-source remote sensing approach for quality-related trait assessment [19]. Zhou and Yan et al. achieved dynamic protein monitoring across multiple varieties and nitrogen treatments [20,21]. Zhang et al. conducted pot experiments and combined Hyperspectral Data, Principal Component Analysis (PCA), and Partial Least Squares Regression (PLSR) models to achieve early estimation of nitrogen nutrition and grain quality for a single rice variety [22]. Hou et al. integrated physiological indicators such as Leaf Area Index (LAI) and Chlorophyll Content (SPAD) with leaf reflectance data from multiple growth stages to construct prediction models for key rice quality indices [9]. Notably, these rice-focused studies either lacked sufficient genomic information integration, were restricted to single varieties or pot conditions (thus potentially limiting generalizability), or involved relatively few multi-environment validation—common gaps in current quality prediction research. Ata-Ul-Karim et al. confirmed robust links between late-growth-stage spectral data and key quality traits (R2 > 0.85), especially when combined with the Nitrogen Nutrition Index (NNI) (R2 > 0.90) [23]. These studies support the integration of spectral data with physiological indicators for early, non-destructive quality prediction in rice breeding. In the context of field monitoring, Chen et al. proposed a fusion of UAV spectral and deep image features for precise growth monitoring and fertilization management, highlighting the potential of deep learning in extracting complex field-based phenotypes [24]. Collectively, while previous studies achieved promising prediction accuracy, most relied predominantly on spectral or phenotypic data alone, seldom incorporated genomic variation, lacked extensive multi-environment validation, and rarely addressed genotype-by-environment interactions. These limitations highlight the critical need for integrated multi-source, multi-environment, and genomic-informed prediction models, which is the focus of the present study.
Currently, researchers are increasingly integrating multi-source data to improve crop quality prediction models. For example, Zhang et al. enhanced rice protein content prediction by combining canopy hyperspectral data with agronomic parameters such as Plant Nitrogen Accumulation (PNA) [2]. Similarly, Peng et al. improved prediction accuracy for protein and taste value by integrating multispectral data with soil indicators, including pH and nutrient content [25]. To account for broader environmental dynamics, Zhang et al. developed a hybrid framework integrating crop phenology models with machine learning to predict rice phenology across China, addressing challenges posed by extreme temperatures [26]. Similarly, Liang et al. demonstrated that integrating multi-temporal UAV multispectral imagery with meteorological data, supported by SHAP (SHapley Additive exPlanations) analysis, markedly enhances the robustness of rice yield prediction models [27]. In wheat, Longmire et al. achieved high prediction accuracy (R2 = 0.80) for protein content by fusing airborne hyperspectral, thermal, solar-induced fluorescence (SIF), and PROSAIL-derived data using a Gradient Boosting Machine (GBM) [28]. Tilse et al. used a Random Forest Regressor (RFR) model combined with multi-source spatial data to predict wheat yield and crude protein content, finding that integrating multi-source data with Two-Fold Cross-Validation (2FCV) yielded the best prediction results [29]. Yan et al. further improved wheat protein prediction by transferring multi-scale spectral features from grain to canopy [30]. Additionally, Sun et al. successfully applied Hierarchical Linear Modeling (HLM) to integrate hyperspectral and meteorological data for multi-year quality prediction in maize, outperforming traditional methods [31]. These studies demonstrate that multi-source data fusion significantly enhances the stability and accuracy of crop quality prediction, offering a robust technical approach for monitoring quality in complex field environments. Nevertheless, most of these studies have tended to focus on phenotypic and environmental data fusion, with relatively limited attempts to incorporate genomic information or systematically evaluate model stability across contrasting environments. As a result, the generalizability and genetic interpretability of such models still need to be further improved for practical breeding applications.
Genomic selection, as a key technology for accelerating crop breeding processes, has demonstrated significant advantages in predicting various agronomic traits in crops such as rice [32,33,34]. Studies have confirmed that genomic data based on Single Nucleotide Polymorphisms (SNPs) can effectively capture the genetic variation underlying rice traits, thereby providing robust support for predicting crop phenotypes. For instance, Cui et al. used a hybrid rice population of 1495 individuals to construct genomic prediction models for 10 agronomic traits, achieving prediction accuracies ranging from 0.35 to 0.92. This provides an efficient tool for early screening of hybrid combinations [35]. Similarly, Fialho et al. demonstrated that genomic prediction methods based on SNP data can be effectively applied to predict 22 phenotypic traits in rice, with stable accuracies between 0.60 and 0.80, and are capable of accurately identifying both the best- and worst-performing individuals [36]. In a study of 404 hybrid rice lines, Yu et al. further confirmed that genomic data can effectively predict key quality indicators such as thousand-grain weight and starch content. These studies highlight the stable and reliable application value of SNP data in rice trait prediction, offering essential genetic-level support for accurate phenotypic prediction in crops [37]. Building on these genetic foundations, Togninalli et al. showed that multi-modal deep learning, which fuses genomics and high-throughput phenomics, significantly improves grain yield prediction in breeding programs [38]. Zou et al. further established a large-scale crop dataset and a deep learning-based multi-modal fusion framework to achieve more accurate G × E genomic prediction [39]. Additionally, Qin et al. utilized a Meta-Hybrid Regression Ensemble to predict multiple traits across rice varieties and regions, employing multi-source data to quantify the specific contributions of genetic and environmental drivers [40].
Notably, most existing genomic prediction studies for rice quality traits have tended to rely primarily on genetic data, with relatively limited integration of spectral, phenotypic, or multi-environment information—an oversight that may constrain prediction stability and generalizability. Integrating SNP markers, UAV-derived data, and multi-source phenotypic observations could complement genetic and environmental insights, addressing the need for more robust early prediction models for rice quality.
In this study, 61 indica rice varieties were used to integrate multi-source data, including SNPs, drone spectral features, phenotypic platform traits, fluorescence and hyperspectral data. As a proof-of-concept study, this population size is suitable for exploring the feasibility of multi-source data fusion, and a larger population will be required in future large-scale breeding. Aiming at four key quality traits, we constructed multi-source fusion prediction models at both population and individual levels. The main objectives are: (1) to evaluate the contribution of genomic and phenomics data to quality prediction; (2) to compare the performance of different machine learning models at two levels; (3) to verify the stability and generalization ability of multi-source models across environments. The focused research problem of this study is to overcome the limitations of traditional destructive detection and the lack of multi-source, multi-scale and multi-environment integration in existing rice quality prediction. The core value is to establish an early, high-throughput and non-destructive prediction framework for rice quality, which can support efficient germplasm screening and precision breeding.

2. Materials and Methods

2.1. Experimental Materials and Cultivation Methods

In this study, a total of 61 indica rice accessions were collected (Supplementary Table S1). These accessions included conventional cultivars, hybrid restorers, and maintainer lines from major indica-growing regions in China, as well as several introduced lines from the International Rice Research Institute (IRRI). They originated from more than 20 breeding institutions across Zhejiang, Sichuan, Guangdong, Guangxi, Hunan, Hubei, Jiangxi, Fujian, Guizhou, Jiangsu, and the China National Rice Research Institute. The panel was selected for its wide genetic background and rich phenotypic variation in grain quality, ensuring adequate genetic diversity and representativeness for genomic prediction and genotype–environment interaction analysis. Detailed information on each accession is provided in Supplementary Table S1.
Field-gals were cultivated in two distinct environments: a field experimental station in Fuyang, Hangzhou, Zhejiang Province, and a high-throughput greenhouse phenotyping platform, both in 2024. The two environments were treated as independent experimental systems and analyzed separately in subsequent analyses. The greenhouse was a natural-light glass greenhouse, with the ambient environment stably controlled at 27–28 °C and 70% relative humidity. For the field environment, detailed meteorological conditions during the rice growing season are provided in Supplementary Table S12.
Field trials employed a plot planting pattern (Figure 1a). Each plot was an independent field plot of 1.2 m × 1.2 m, with a planting density of 6 inches between plants and 7 inches between rows. Conventional field management practices were adopted.
For field-grown materials, each accession was planted in one independent plot without biological replication for UAV-based spectral data collection, given the relatively high spatial consistency of field plots under standardized management and the integrative nature of canopy-scale data. Three technical replicates were performed for grain quality trait determination (amylose content, protein content, chalkiness degree, chalky grain percentage) to ensure accuracy.
For the greenhouse phenotyping platform, a pot cultivation method was used (Figure 1b). Plastic pots with a 4 L capacity were each filled with 3.5 kg of dry soil. One seedling was transplanted per pot, with three biological replicates for each line. Materials were first cultivated in seedling beds in the field for 15 days before being transplanted into the pots. Nitrogen fertilizer was applied in three split applications (basal, tillering, and panicle fertilizer) at a ratio of 4:3:3. Potassium fertilizer was applied in two splits (basal and panicle fertilizer), each accounting for 50%. Phosphorus fertilizer was applied entirely as a basal fertilizer.

2.2. Data Acquisition from Field and Greenhouse

Field Trials: Data were collected using high-precision visible-light and multispectral unmanned aerial vehicles (UAVs), with a total of 19 aerial surveys obtained throughout the entire growth period. The high-precision visible-light UAV was a DJI M300 RTK (DJ-Innovations, Shenzhen, China) equipped with a P1 camera (DJ-Innovations, Shenzhen, China). The device’s sensor size was 35.9 mm × 24.0 mm (full-frame) with a resolution of 8192 × 5460 pixels. Aerial image data were captured using a fixed-interval exposure method. The multispectral UAV was a DJI Mavic 3M (DJ-Innovations, Shenzhen, China), featuring one 20-megapixel RGB (Red, Green, Blue) sensor (DJ-Innovations, Shenzhen, China) and four 5-megapixel single-band sensors (DJ-Innovations, Shenzhen, China) (Green [G]: 560 nm ± 16 nm; Red [R]: 650 nm ± 16 nm; Red Edge [RE]: 730 nm ± 16 nm; Near-Infrared [NIR]: 860 nm ± 26 nm). Both UAVs were flown at an altitude of 12 m, which is the minimum operational height ensuring high image clarity and efficient field operation. Images were acquired at 80% forward overlap along predefined flight routes. Surveys were conducted under clear weather conditions with full sunlight and no cloud cover.
Greenhouse Trials: Relying on the Plant ScreenTM high-throughput plant phenotyping platform (PSI, Drásov, Czech Republic), hyperspectral data were acquired from the seedling stage to maturity. The platform is equipped with two high-definition cameras for RGB imaging, enabling the extraction of feature values such as Area, Perimeter, Roundness, Compactness, Eccentricity, Rotational Mass Symmetry (RMS), and Slenderness of Leaves (SOL). The hyperspectral camera measures hundreds of spectral bands at nanometer-level resolution for each image pixel across the 380–1700 nm range, including the Near-Infrared (NIR, 380–900 nm) and Shortwave Infrared (SWIR, 900–1700 nm) wavelengths. Chlorophyll fluorescence imaging employed a Tomi-2 high-resolution camera (1360 × 1024 pixels; 20 frames per second). Measured parameters included Size (effective fluorescence imaging area), Fo (minimal fluorescence), Fm (maximal fluorescence), and Fv (variable fluorescence). Based on these parameters, derived feature indicators such as QY_max (maximum quantum yield of PSII), Fo_gauss (Fo Gaussian fitting value), Fo_median (Fo median value), avg_Fv (average Fv), and avg_QY_max (average QY_max) were calculated. Images were collected weekly in the greenhouse from 1 August to 17 October 2024 (13 sampling dates: 7, 16, 23, 28 August; 3, 12, 17, 23, 27 September; 5, 11, 17 October), yielding 13 valid datasets. For each plant, 2 RGB images (1 front view, 1 top view), 6 chlorophyll fluorescence images, and 8 hyperspectral images were acquired at each sampling.

2.3. Whole-Genome Sequencing and Analysis

Whole-genome resequencing analysis was performed on all tested varieties, with each sample achieving a sequencing depth of 10 Gb. High-quality Single Nucleotide Polymorphism (SNP) data were extracted using the Nipponbare genome as the reference, following the methods described in previous reports [41]. The genomic SNP dataset was encoded as 0 (homozygous, AA), 1 (heterozygous, Aa), and 2 (homozygous, aa).

2.4. Phenotypic Data Collection

Upon rice maturation, measurements for various grain quality indicators were conducted. Chalkiness degree and chalky grain percentage were determined according to the GB/T 1354-2018 [42] method using a seed observation scanner (Microtek Technology (Shanghai) Co., Ltd., Shanghai, China) paired with Wanshen Seed & Rice Appearance Quality Inspection Software (Version V2.7.2.8). Total protein content was measured using the Kjeldahl method [14]. Amylose content was determined following the procedure specified in GB/T 15683-2008 [43].

2.5. Data Analysis Methods

A Two-way ANOVA was employed to analyze the effects of genotype (G), environment (E), and their interaction (G × E) on rice quality traits. Variance components and their contributions for each effect were calculated.

2.6. Preprocessing and Feature Selection for Individual and Population Phenotypic Data

For field-collected population phenotypic data, orthomosaic stitching was performed using DaJiang TERRA software (version 3.9.3) after acquisition. This was followed by plot cropping and data extraction, calculating 13 key Vegetation Indices (VIs) such as the Normalized Difference Vegetation Index (NDVI) and 32 Texture Features (TFs) such as Homogeneity for each test plot according to the method of Feng et al. [44].
After acquiring images of individual plants, fisheye lens correction and color correction were completed using the phenotyping platform’s built-in analysis software (Morpho Analyzer version 1.0.14.3, PSI, Drásov, Czech Republic). Twenty morphological feature indices, including leaf projected area, were extracted from top-view and side-view RGB images following the method proposed by Pavicic et al. [45]. Additionally, nine chlorophyll fluorescence indices (e.g., maximal fluorescence, Fm) and the average reflectance within the 380–1700 nm range were extracted per plant.
Feature selection for both population and individual data utilized Recursive Feature Elimination (RFE), Mutual Information (MI), and Least Absolute Shrinkage and Selection Operator (LASSO). For individual feature selection, the Successive Projections Algorithm (SPA) was additionally applied (Figure 1c).

2.7. Model Construction and Validation

At the population level, phenotypic data were acquired exclusively by UAV platforms in the field, providing canopy-scale spectral and texture features. Accordingly, three feature sets were established: a Genome-wide SNP Feature Set (SNP), a UAV-derived Multi-source Feature Set (M3M), and their fusion feature set (SNP-M3M). At the individual level, high-throughput phenotypic data were collected using a controlled-environment phenotyping platform in the greenhouse, which provided three distinct modules of phenotypic information: chlorophyll fluorescence (Flu), RGB imaging, and hyperspectral reflectance (HFS). Therefore, four feature sets were defined: Hyperspectral Feature Set (HFS), Genome-wide SNP Feature Set (SNP), Chlorophyll Fluorescence-RGB Morphological Parameter Set (Flu-RGB), and a multi-source fusion feature set (HFS-SNP-Flu-RGB) integrating all previous individual feature sets. This grouping strategy strictly reflects the separate experimental systems of field and greenhouse, and ensures that feature combinations are consistent with the actual data structure obtained from each environment.
To investigate the predictive performance of different feature-source data on rice quality traits and to analyze the contributions of varying genetic information and data characteristics within the two environmental conditions to the models, prediction models were constructed at both the individual and population levels. At the population level, three types of feature data sources were established: Genome-wide SNP Feature Set (SNP), UAV M300 RTK & Mavic 3M Multi-source Feature Set (M3M), and SNP-UAV Fusion Feature Set (SNP-M3M). At the individual level, four types of feature data sources were established: Hyperspectral Feature Set (HFS), Genome-wide SNP Feature Set (SNP), and Chlorophyll Fluorescence-RGB Morphological Parameter Set (Flu-RGB), together with multi-source Fusion Feature Set (HFS-SNP-Flu-RGB), formed by concatenating the previous three feature sets.
Three commonly used machine learning algorithms—Random Forest Regressor (RFR), Support Vector Regression (SVR), and eXtreme Gradient Boosting (XGBoost)—along with one Stacking Ensemble (Stacking) strategy (Figure 1d), were employed for predicting amylose content, protein content, chalky grain percentage, and chalkiness degree. The dataset is split into a training set and a test set with a ratio of 8:2, and a five-fold cross-validation approach was also adopted. Set the random seed to 42. Key hyperparameters for each machine learner were optimized using grid search. RFR, SVR and XGBoost were selected as the meta-models for the Stacking Ensemble method (Supplementary Tables S2 and S3).
The Coefficient of Determination (R2), Pearson Correlation Coefficient (r), Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) were used as model accuracy evaluation metrics to comprehensively compare the predictive performance of different feature sources and machine learning algorithms.

2.8. Experimental Instruments and Analytical Software

Field spectral data were collected using unmanned aerial vehicles DJI M300 RTK and DJI Mavic 3M (DJI, Shenzhen, China), including four-band multispectral imagery, vegetation indices, and texture features. In the greenhouse, multi-modal phenotypic data were acquired using instruments imported from Photon Systems Instruments (PSI, Drásov, Czech Republic), including an RGB imaging module (model: RGB), chlorophyll fluorescence imaging module (model: FC), short-wave infrared hyperspectral imaging module (model: SWIR), and visible near-infrared hyperspectral imaging module (model: VNIR). All predictive modeling and data analysis were performed in Python 3.11.9, using scikit-learn, XGBoost, and mlxtend libraries.

3. Results

3.1. Experimental Design

This study conducted an early prediction of grain quality for 61 indica rice materials at two levels: population (field) and individual (greenhouse). At the population level, M3M and SNP data were collected from the experimental field (Figure 1a,c). Following data fusion and dataset partitioning, key features were selected from the training set using LASSO, MI, and RFE to train population-level models. Model performance was validated on the test set, and the top six performing models were selected for ensemble stacking optimization (Supplementary Table S2). At the individual level, HFS, Flu-RGB, and SNP data were collected in the greenhouse (Figure 1b,c). After data fusion and dataset partitioning, key features were selected from the training set using LASSO, MI, RFE, and SPA to train individual-level models. Model performance was validated on the test set, and the top three to five performing models were selected for ensemble stacking optimization (Supplementary Table S3).

3.2. Performance of Quality Traits for 61 Indica Rice Germplasms in Two Environments

Different rice cultivation environments affect the formation of grain quality traits [33]. To clarify the influence of field and greenhouse cultivation on rice quality, a comprehensive evaluation of amylose content, protein content, chalky grain percentage, and chalkiness degree was conducted using the collected 61 representative indica rice germplasms.
The results showed no significant difference in amylose content under the two environments (Figure 2a). Protein content, chalky grain percentage and chalkiness degree were significantly higher in the field environment compared to the greenhouse (Figure 2c,e,g). Spearman correlation analysis revealed significant positive correlations between the phenotypic values of amylose content, chalky grain percentage, and chalkiness degree measured in the greenhouse and the field, with amylose content showing the strongest correlation. In contrast, protein content showed no significant correlation across environments (Figure 2b,d,f,h).
Further analysis indicated that variation in amylose content was primarily governed by genotype (G, contributing 93.62%) and genotype-environment interaction (G × E). Protein content was influenced by both genotype (G, contributing 13.12%) and genotype-environment interaction (G × E, contributing 68.94%). Conversely, chalky grain percentage and chalkiness degree were predominantly regulated by the combined effects of genotype, environment, and genotype-environment interaction (Figure 2i–l).

3.3. SNP Screening and Spectral Feature Analysis

Whole-genome resequencing of the 61 tested varieties identified a total of 6.44 million SNPs (Figure 3a). The SNPs were relatively evenly distributed across the 12 chromosomes, with localized enrichment in certain regions, indicating good overall coverage. Phylogenetic tree analysis grouped the tested materials into several distinct clusters with clear genetic relationships (Figure 3b), illustrating the genetic differentiation and evolutionary lineage among the materials. Principal Component Analysis (PCA) results showed that the first two principal components explained 65.48% of the cumulative genetic variation (Figure 3c). The tested materials exhibited a continuous distribution without forming clearly separated subpopulations.
Spectral data analysis revealed that the overall trend of the hyperspectral reflectance curves for indica rice across different growth stages was consistent (Figure 3d–j), displaying typical plant spectral characteristics. Reflectance was low in the visible light band (400–700 nm), forming a green reflectance peak at 550 nm. Reflectance increased sharply in the near-infrared band (700–900 nm), showing a distinct red-edge feature. In the shortwave infrared band (900–1700 nm), reflectance initially decreased with slight fluctuations, formed an absorption valley at around 1450 nm, and then increased again. The overall trend of the curves aligned with previous studies [46]. However, differences in reflectance existed among different lines, reflecting genetic variations in physiological states among materials and providing a phenotypic basis for predicting quality traits.

3.4. Non-Destructive Estimation of Amylose Content

At the population level, prediction analysis for amylose content was conducted based on different feature sources and algorithms (Figure 4a and Supplementary Table S4). Using SNP as the feature source, the RFR and XGBoost models demonstrated outstanding predictive capabilities, with R2 values reaching 0.92 and 0.85, respectively. However, the XGBoost model constructed using the UAV feature set (M3M) achieved an R2 of only 0.55. Although different feature selection methods, such as LASSO, MI, and RFE, could improve the performance of RFR and SVR models, not the XGBoost model (Figure 4a). The predictive performance of models combining SNP and M3M was similar to that of the single SNP feature model. Additionally, the Stacking Ensemble method also achieved a mean R2 of 0.92 (Supplementary Table S2).
At the individual level, using SNP as the feature source, the RFR model performed best with an R2 value of 0.84, followed by the XGBoost model with an R2 of 0.75 (Figure 4b and Supplementary Table S5). Single phenotypic feature sets (HFS or Flu-RGB) resulted in lower predictive capabilities. The predictive performance of the combined feature set (SNP-HFS-Flu-RGB) was similar to that of the single SNP feature model. Feature selection was able to optimize model performance. For example, Mutual Information (MI) selection applied to the Flu-RGB feature set improved the XGBoost model’s R2 from 0.49 to 0.65. Least Absolute Shrinkage and Selection Operator (LASSO) selection applied to the Flu-RGB feature set improved the RFR model’s R2 from 0.46 to 0.73. Furthermore, we found that using reflectance from only 10 wavelength points (HFS_MS_SPA) at the mature stage (Figure 4b,c,f), compared to using hyperspectral data from the entire growth period (HFS_SPA), effectively improved the predictive performance of the RFR and XGBoost models. Notably, the Stacking Ensemble method further enhanced prediction accuracy to R2 = 0.88 by efficiently integrating the strengths of multiple models (Supplementary Table S3), becoming the optimal solution for predicting amylose content at the individual level.

3.5. Non-Destructive Estimation of Protein Content

At the population level, the overall prediction accuracy for protein content was relatively low (Figure 5a and Supplementary Table S6). Predictive performance was suboptimal whether using SNP information or the M3M alone for modeling. However, combining SNP and M3M for modeling yielded relatively better performance, with the XGBoost model achieving an R2 of 0.43.
At the individual level, the overall prediction accuracy for protein content also remained low (Figure 5b and Supplementary Table S7), but significant differences existed in predictive capabilities among different feature sources and algorithms, showing a general algorithm preference. It was revealed that prediction performance based on single phenotypic feature sets was generally superior to that based on SNP or combined feature sets. Specifically, the Flu-RGB feature set performed notably well under the XGBoost model, achieving an R2 of 0.53. After RFE selection, the R2 could be improved to 0.56, representing the optimal model performance at this level (Figure 5e). Furthermore, the Stacking Ensemble method did not demonstrate a corresponding advantage in protein prediction, achieving an R2 of only 0.32 (Supplementary Table S3), suggesting that there is still significant room for improvement in the predictive efficacy for protein content.

3.6. Non-Destructive Estimation of Chalky Grain Percentage

At the population level, the prediction accuracy for chalky grain percentage using different feature sources and modeling strategies was outstanding (Figure 6a and Supplementary Table S8). The Random Forest model using the SNP feature set achieved a prediction R2 of 0.79, significantly higher than models using the M3M or the combined SNP-M3M feature set. Notably, models built with the M3M feature set after LASSO, MI, and RFE selection showed generally significant improvement in prediction accuracy. Among them, the Random Forest model after RFE selection performed excellently, with an R2 of 0.80 (Figure 6d). Furthermore, the Stacking Ensemble method, which integrated the advantages of different models, achieved a high R2 of 0.89 (Supplementary Table S2), substantially enhancing predictive performance and making it the optimal model for predicting chalky grain percentage at the population level.
At the individual level, the performance of different feature source and model combinations for predicting chalky grain percentage varied significantly (Figure 6b and Supplementary Table S9). The XGBoost model using SNP features achieved an R2 of 0.70. This was higher than models built with single phenotypic features like HFS or Flu-RGB, and even exceeded the performance of models combining both phenotypic and SNP features. Meanwhile, applying RFE, MI, and LASSO selection to the Flu-RGB phenotypic feature set improved the prediction performance of SVR (R2 increased from 0.27 to 0.37~0.38), but reduced the performance of RFR and XGBoost (R2 decreased from 0.64 and 0.55 to 0.55~0.60 and 0.36~0.44, respectively). Additionally, we found that using reflectance from only 6 wavelength points (HFS_JS_SPA) at the booting stage, compared to using hyperspectral data from the entire growth period (HFS_SPA), effectively improved model prediction performance (Figure 6b,c,f). It is noteworthy that the Stacking Ensemble method showed no clear advantage at the individual level, with an R2 of only 0.70 (Supplementary Table S3).

3.7. Non-Destructive Estimation of Chalkiness Degree

At the population level, the prediction accuracy for chalkiness degree using different feature sources and model algorithms showed significant variation (Figure 7a and Supplementary Table S10). The SVR model using SNP features performed best (R2 = 0.77), followed by RFR (R2 = 0.71), indicating that genomic information has strong explanatory power for chalkiness degree. The RFR model using the M3M achieved a prediction accuracy of R2 = 0.78, outperforming the single SNP feature model, suggesting a high correlation between population canopy phenotype and chalkiness degree formation. Compared to models built with single features, the combined SNP-M3M feature model achieved significant improvement in XGBoost (R2 increased from 0.57 and 0.53 to 0.74), and only a slight improvement in SVR (R2 increased from 0.77 and 0.51 to 0.78). Additionally, we found that for the M3M feature set, the prediction performance of all three models built after RFE selection significantly improved. The Stacking Ensemble method, by integrating the advantages of various models, further enhanced prediction accuracy to an R2 of 0.87 (Supplementary Table S2), making it the optimal model for predicting chalkiness degree at the population level.
At the individual level, the prediction of chalkiness degree using different feature sources and model algorithms showed significant differentiation (Figure 7b and Supplementary Table S11). When using only SNP features, the SVR model performed best (R2 = 0.79), indicating strong explanatory power of genomic information for chalkiness degree. The performance of single phenotypic feature sets varied greatly. The Flu-RGB feature set performed excellently under the RF model (R2 = 0.81), surpassing the single SNP feature model, while the HFS feature set generally had lower predictive capability. After combining phenotypic data with SNPs, the prediction accuracy of most models improved. The SVR model achieved an R2 of 0.81, mainly attributed to the complementary nature of genetic and phenotypic information. We found that the Flu-RGB feature set after RFE selection further boosted the accuracy of the RF model to R2 = 0.84 (Figure 7e), representing the best performance among all single models. We also discovered that using reflectance from only 10 wavelength points (HFS_JS_SPA) at the booting stage (Figure 7b,c,f), compared to using hyperspectral data from the entire growth period (HFS_SPA), effectively improved the predictive performance of the RFR and SVR models. Specifically, the R2 for the RFR model increased from 0.60 to 0.78, and for the SVR model from 0.50 to 0.80. In contrast, the Stacking Ensemble method showed no clear advantage at the individual level, with an R2 of 0.73 (Supplementary Table S3).

3.8. Cross-Validation and Robustness Analysis of the Results

To verify the reliability and robustness of the prediction results obtained using the 8:2 training-test split in this study, five-fold cross-validation was performed on representative models at the population level. Specifically, models for predicting amylose content and protein content were selected, each constructed using SNP, M3M, and the combined SNP-M3M datasets. These two traits were chosen because amylose content showed the best prediction performance, while protein content exhibited the poorest prediction performance in this study. By validating these two traits with contrasting performance, the rationality of the dataset splitting strategy could be evaluated more comprehensively and objectively.
The validation procedure was as follows: The dataset was randomly divided into training and independent test sets at a ratio of 8:2. Within the training set, five-fold cross-validation was used for hyperparameter optimization, and the optimal hyperparameters for base learners and meta-learners were determined after 20 trials. The models were then retrained on the complete training set using these optimal parameters, and their actual generalization ability was evaluated on the completely independent 20% test set, with PCC and R2 as the key evaluation metrics. To ensure the stability of the machine learning prediction results, five independent repeated experiments were conducted with different random seeds (42, 123, 456, 789, 1024), which randomized the data split, model initialization, and cross-validation fold assignment. The model performance was finally reported as mean ± standard deviation.
The comparative results (Table 1) demonstrated that the prediction performance of models optimized by five-fold cross-validation was highly consistent with the original 8:2 train-test split, despite minor differences in specific R2 values. The ranking of models (e.g., models using SNP features outperforming those using M3M features, and the combined feature set performing similarly to the SNP-only set), the prediction patterns of the target traits, and the main conclusions remained unchanged. These results confirm that the original train-test split used in this study is reasonable and robust, and the obtained prediction results and related conclusions are highly reliable and reproducible.

4. Discussion

4.1. Effects of Different Estimation Methods on Grain Quality Prediction

Non-destructive prediction technology for rice grain quality holds transformative significance for agricultural production. Its core value lies in rapidly and accurately assessing the quality of grains within living plants using technologies such as spectroscopy and imaging, without destroying the seeds. Currently, non-destructive prediction of grain quality mainly takes two forms. The first is based on estimating the quality of mature seeds using scanned spectra in vitro. Research methods primarily employ PLSR, coupled with preprocessing of the source spectral data such as first-order derivatives and smoothing, which can eliminate noise and improve prediction accuracy [3,4,5,6,7,8]. The second form involves predicting post-maturity grain quality based on canopy spectral reflectance during the early growth stages of rice. This approach mostly utilizes different machine learning algorithms to build regression models [13,14,22]. Our study employed three machine learning models—RFR, XGBoost, and SVR—for analysis. Among them, the SVR model showed poor stability, with significant fluctuations in the prediction results for chalkiness degree at the individual level (R2 ranging from 0.16 to 0.81). In contrast, the RFR model demonstrated the best stability. For predicting amylose content, chalky grain percentage, and chalkiness degree, the RFR model generally outperformed others under the same conditions at both population and individual levels, making it the best predictive model for most indicators, with the highest R2 reaching 0.92, suggesting that this method could be applied in breeding practice in the future.

4.2. Impact of Different Multi-Source Data Fusion Strategies on Rice Quality Prediction

Previous studies have shown that multi-source data fusion can effectively enhance the stability and accuracy of crop quality prediction models, but the selected fusion data sources vary. For instance, Peng et al. combined canopy spectral data with soil nutrient indicators like pH, organic matter, nitrogen, phosphorus, and potassium [25]; Longmire et al. integrated hyperspectral data, thermal imaging, SIF, and parameters inverted from the PROSAIL (PROSPECT + SAIL) model [28]; Tilse et al. combined spatial data from different scales, including UAV, near-ground satellite, and space station data [29]; Sun et al. fused hyperspectral data with meteorological data [31]. All these studies improved the prediction accuracy of quality traits to some extent. In our study, we found that the predictive effects of different feature sources on grain quality indicators were not consistent. Chalky grain percentage was primarily influenced by genotype and genotype-environment interaction, while chalkiness degree was mainly driven by environmental factors. For both traits, high prediction accuracy could be achieved using either individual-level phenotypic features (morphology, fluorescence, hyperspectral) or population-scale UAV canopy spectral data, providing technical support for their early high-throughput screening or precise identification. However, for amylose content, a genetically dominant trait, the highest prediction accuracy (R2 = 0.92) was achieved at the population level using SNP data with the RF model. At the individual level, the RF model with SNP data achieved an R2 of 0.84. Yet, using different phenotypic source data did not significantly improve its prediction. Therefore, for amylose content, SNP data at both population and individual levels can directly capture its genetic essence and meet the needs for early screening.

4.3. Rice Quality Prediction Could Be Improved by Stacking Ensemble Method

To overcome the traditional research limitation of “few varieties/multiple treatments”, this study specifically selected 61 rice materials and conducted experiments under uniform optimal water and fertilizer management conditions, introducing high-dimensional genomic data as predictive features. However, the vast dimensionality of genomic data and the relatively limited dimensionality of phenotypic features differ significantly [34]. Simply concatenating and fusing them directly yielded suboptimal results. Improvement was only observed for predicting grain protein content at the population level, while prediction accuracy for other traits and scales remained unchanged. To address this issue, this study further adopted the Stacking Ensemble method [47,48]. It significantly enhanced the prediction performance for amylose content at the individual level and for chalky grain percentage and chalkiness degree at the population level. This approach effectively solved the fusion and adaptation challenge of multi-source heterogeneous data, significantly improving the prediction performance for genotype-dominant traits like amylose content and environment-dominant traits like chalkiness degree and chalky grain percentage. It provides a practical and feasible technical solution for the efficient fusion of multi-source data in rice quality prediction.

4.4. Relationships Between Hyperspectral Data at Different Growth Stages and Various Quality Traits

Previous studies have shown that spectral source data from different growth stages can lead to variations in predicting quality traits. Research by Hou et al. found that spectral indices from the booting stage had higher correlations with taste value than those from the tillering and maturity stages [9]. Xie et al. discovered that canopy spectral variables from the flowering stage were more effective for predicting crude protein and amylose content than those from the filling stage, which in turn were better than those from the maturity stage [18]. Similar to these findings, we observed that among different key growth stages, hyperspectral data from the booting stage showed the highest correlation with chalky grain percentage and chalkiness degree; data from the maturity stage were most strongly associated with amylose content; and data from the heading stage had the most significant correlation with protein content. To mitigate the multicollinearity inherent in hyperspectral data, this study employed the Successive Projections Algorithm (SPA) for feature selection. The screening criteria were set as follows: no more than 20 wavelength points were extracted, and the number of wavelength points corresponding to the minimum Root Mean Squared Error of Cross-Validation (RMSECV) was determined as the optimal number of features [49,50]. Comparing the modeling results between SPA-selected features from the entire growth period and those from the optimal single growth period revealed that, except for predictions related to protein content, models based on spectral features from the optimal single growth period showed significantly improved predictive performance. Therefore, selecting hyperspectral data from a single, optimal growth stage for modeling can effectively enhance prediction accuracy while reducing the cost of data acquisition and processing.

4.5. Novelty, Technical Contributions and Comparison with Previous Studies

Compared with previous studies, the major novelties and technical contributions are as follows: (1) We simultaneously integrate genomic SNP data, UAV-based population-level canopy phenotypes and greenhouse individual-level single-plant phenotypes, forming a multi-modal and multi-scale data system rarely used in rice quality research; (2) We adopt multiple feature selection methods and the Stacking ensemble model to improve prediction accuracy and alleviate the overfitting risk caused by high-dimensional features; (3) We systematically compare prediction patterns at population and individual levels, revealing the regulatory differences in genetic and environmental effects on different quality traits.
To further show the improvements of this study relative to existing research, a detailed comparison is listed in Table 2. Previous studies are classified into five types based on data sources, with obvious limitations in prediction period, germplasm quantity and data dimension. In contrast, this study realizes the combined use of genomic data and dual-scale phenotypic data from multiple environments, providing a new strategy for accurate and robust early prediction of rice quality.

5. Conclusions

This study focused on indica rice as the research subject. Under optimal water and fertilizer management conditions, whole-genome SNP data, drone-based population canopy spectral data, and multi-dimensional individual-scale data from phenotyping platforms were integrated to systematically analyze the effects of greenhouse and field environments on amylose content, protein content, chalky grain rate, and chalkiness degree. Additionally, non-destructive prediction models for quality traits were constructed using multi-scale and multi-feature sources. At the population level, the prediction model for amylose content based on SNP data achieved a maximum coefficient of determination (R2) of 0.92 with a mean absolute error (MAE) of 1.1. For chalky grain rate, the stacking ensemble method obtained an R2 of 0.89 with an MAE of 7.6. For the chalkiness degree, the stacking ensemble approach increased the R2 to 0.87 with an MAE of 2.5. The variation in amylose content was less influenced by environmental factors and primarily governed by genetic regulation. In contrast, the phenotypic performance of chalky grain rate, chalkiness degree, and protein content differed significantly between field and greenhouse environments. The variation in chalkiness traits was mainly driven by genetic, environmental, and genotype-by-environment interaction effects, while protein content was primarily regulated by genetic factors and genotype-by-environment interactions.
Model development and prediction analysis revealed that the predictive performance varied among different traits. As a genetically dominant trait, amylose content showed significantly better prediction accuracy using SNP data compared to phenotypic data. At the individual level, the Stacking Ensemble method further optimized prediction precision. For chalky grain rate and chalkiness degree, phenotypic data after feature selection achieved comparable prediction accuracy to SNP data, and the Stacking Ensemble method at the population level significantly enhanced model performance. Due to the complex regulatory mechanisms of protein content, the prediction accuracy across all feature sources and algorithms remained relatively low and has not yet reached a practical standard.
In summary, our study systematically integrated whole-genome SNP data of rice, population-scale canopy spectral data acquired by the UAV, and individual-scale morphological features, fluorescence kinetic parameters, and hyperspectral data collected by the phenotyping platform. It successfully constructed whole-growth-stage prediction models for amylose content, protein content, chalkiness degree, and chalky grain percentage. This work demonstrates that multi-source data fusion can effectively improve rice quality prediction, and the established framework has clear scientific applicability: it supports early non-destructive evaluation, high-throughput germplasm screening, and data-driven breeding decision-making for indica rice under both greenhouse and field conditions.
However, this study has several limitations. The sample size was relatively small, and the number of test environments was limited, which may restrict the generalizability of the prediction models. For future research, in addition to expanding the population size and testing across more ecological regions, multi-source data types can be further enriched, such as incorporating image features and meteorological data into the feature set. Meanwhile, more advanced ensemble strategies and deep learning architectures can be explored to further mine the complex relationships between multi-modal data, so as to achieve higher prediction accuracy and stronger model stability. Cross-year, multi-location validation and independent population testing will also be essential to ensure the reliability and practicality of the prediction models in real breeding programs.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/agriculture16070807/s1: Table S1. Collection of 61 indica rice varieties. Table S2. Base Models Selected via Optimized Stacking Ensemble for Population-Level Grain Quality Traits. Table S3. Base Models Selected via Optimized Stacking Ensemble for Individual-Level Grain Quality Traits. Table S4. Performance of Each Model for Amylose Content Prediction—Population (Field). Table S5. Performance of Each Model for Amylose Content Prediction—Individual (Greenhouse). Table S6. Performance of Each Model for Protein Content Prediction—Population (Field). Table S7. Performance of Each Model for Protein Content Prediction—Individual (Greenhouse). Table S8. Performance of Each Model for Chalky Grain Percentage Prediction—Population (Field). Table S9. Performance of Each Model for Chalky Grain Percentage Prediction—Individual (Greenhouse). Table S10. Performance of Each Model for Chalkiness Degree Prediction—Population (Field). Table S11. Performance of Each Model for Chalkiness Degree Prediction—Individual (Greenhouse). Table S12. Meteorological Conditions in Fuyang from June to October 2024.

Author Contributions

Conceptualization, S.C., J.H. and G.S.; Data curation, X.Z., J.Y., N.C. and W.Z.; Formal analysis, X.Z. and J.W.; Funding acquisition, F.Z., Y.C. and G.S.; Investigation, R.Z.; Methodology, X.Z., S.C. and Y.L.; Project administration, G.S.; Resources, S.T.; Supervision, G.S.; Validation, N.C., W.Z., J.W. and R.Z.; Visualization, X.Z.; Writing—original draft, X.Z., Y.L. and G.S.; Writing—review and editing, F.Z., Y.C., G.S., Y.L., S.C. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (2023YFD1202600), Zhejiang Provincial Natural Science Foundation of China (LDQ24C130001) and the Nanfan Special Project of CAAS (YBXM2434).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset is available on request from the authors.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Yu, J.; Zhou, W.; Shao, G.; Hu, P. Gene mining and breeding utilization of important agronomic traits in rice. Chin. Sci. Bull. 2025, 70, 3126–3148. [Google Scholar] [CrossRef]
  2. Zhang, J.; Xu, B.; Feng, H.; Jing, X.; Wang, J.; Ming, S.; Fu, Y.; Song, X. Monitoring Nitrogen Nutrition and Grain Protein Content of Rice Based on Ensemble Learning. Spectrosc. Spectr. Anal. 2022, 42, 1956–1964. [Google Scholar] [CrossRef]
  3. Shi, S.; Zhao, D.; Pan, K.; Ma, Y.; Zhang, G.; Li, L.; Cao, C.; Jiang, Y. Combination of Near-Infrared Spectroscopy and Key Wavelength-Based Screening Algorithm for Rapid Determination of Rice Protein Content. J. Food Compos. Anal. 2023, 118, 105216. [Google Scholar] [CrossRef]
  4. Ma, C.; Ren, Z.; Zhang, Z.; Du, J.; Jin, C.; Yin, X. Development of Simplified Models for Nondestructive Testing of Rice (with Husk) Protein Content Using Hyperspectral Imaging Technology. Vib. Spectrosc. 2021, 114, 103230. [Google Scholar] [CrossRef]
  5. Shao, Y.; Cen, Y.; He, Y.; Liu, F. Infrared Spectroscopy and Chemometrics for the Starch and Protein Prediction in Irradiated Rice. Food Chem. 2011, 126, 1856–1861. [Google Scholar] [CrossRef]
  6. Wang, J.; Du, Z.; Yng, Y.; Ai, Z.; Ren, G.; Zhang, D.; Fan, H.; Song, X. Rapid Detection of Rice Protein and Amylose Contents Based on Near-Infrared Spectroscopy. J. Chin. Cereals Oils Assoc. 2025, 40, 133–143. [Google Scholar] [CrossRef]
  7. Liu, H.; Shen, T.; Zhang, W.; Shi, X.; Dai, T.; Bai, T.; Xiao, Y. Construction and Verification of a Mathematical Model for Near-Infrared Spectroscopy Analysis of Gel Consistency in Southern Indica Rice. Spectrosc. Spectr. Anal. 2021, 41, 2432–2436. [Google Scholar]
  8. Siriphollakul, P.; Kanlayanarat, S.; Rittiron, R.; Wanitchang, J.; Suwonsichon, T.; Boonyaritthongchai, P.; Nakano, K. Pasting Properties by Near-Infrared Reflectance Analysis of Whole Grain Paddy Rice Samples. J. Innov. Opt. Health Sci. 2015, 8, 1550035. [Google Scholar] [CrossRef]
  9. Hou, Y.; Bao, H.; Rimi, T.I.; Zhang, S.; Han, B.; Wang, Y.; Yu, Z.; Chen, J.; Gao, H.; Zhao, Z.; et al. Rice Quality and Yield Prediction Based on Multi-Source Indicators at Different Periods. Plants 2025, 14, 424. [Google Scholar] [CrossRef] [PubMed]
  10. Liu, N.; Guo, J.; Liu, F.; Zha, X.; Cao, J.; Chen, Y.; Yan, H.; Du, C.; Wang, X.; Li, J.; et al. Development of a Vegetation Canopy Reflectance Sensor and Its Diurnal Applicability Under Clear Sky Conditions. Front. Plant Sci. 2025, 15, 1512660. [Google Scholar] [CrossRef]
  11. Liu, N.; Fang, Y.; Zhao, Y.; Chen, Y.; Zheng, Y.; Zha, X.; Guo, J.; Li, X. Novel Estimation of Tomato Soluble Solids Content Using Linearly Transformed Reflectance-Based Spectral Indices. Front. Plant Sci. 2026, 17, 1729375. [Google Scholar] [CrossRef]
  12. Yan, Y.; Bao, Z.; Shao, J. Phycocyanin Concentration Retrieval in Inland Waters: A Comparative Review of the Remote Sensing Techniques and Algorithms. J. Great Lakes Res. 2018, 44, 748–755. [Google Scholar] [CrossRef]
  13. Pancorbo, J.L.; Alonso-Ayuso, M.; Camino, C.; Raya-Sereno, M.D.; Zarco-Tejada, P.J.; Molina, I.; Gabriel, J.L.; Quemada, M. Airborne Hyperspectral and Sentinel Imagery to Quantify Winter Wheat Traits through Ensemble Modeling Approaches. Precis. Agric. 2023, 24, 1288–1311. [Google Scholar] [CrossRef]
  14. Pancorbo, J.L.; Quemada, M.; Raya-Sereno, M.D.; Gioli, B.; Beck, P.S.A.; Camino, C. Integrating Artificial Neural Network-PROSAIL with Sentinel-2 to Monitor Crop Traits Dynamics and Nitrogen Status. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17770–17786. [Google Scholar] [CrossRef]
  15. Tan, C.; Zhou, X.; Zhang, P.; Wang, Z.; Wang, D.; Guo, W.; Yun, F. Predicting Grain Protein Content of Field-Grown Winter Wheat with Satellite Images and Partial Least Square Algorithm. PLoS ONE 2020, 15, e0228500. [Google Scholar] [CrossRef]
  16. Fu, Z.; Yu, S.; Zhang, J.; Xi, H.; Gao, Y.; Lu, R.; Zheng, H.; Zhu, Y.; Cao, W.; Liu, X. Combining UAV Multispectral Imagery and Ecological Factors to Estimate Leaf Nitrogen and Grain Protein Content of Wheat. Eur. J. Agron. 2022, 132, 126405. [Google Scholar] [CrossRef]
  17. Yan, Y.; Zhang, X.; Li, D.; Zheng, H.; Yao, X.; Zhu, Y.; Cao, W.; Cheng, T. Laboratory Shortwave Infrared Reflectance Spectroscopy for Estimating Grain Protein Content in Rice and Wheat. Int. J. Remote Sens. 2021, 42, 4467–4492. [Google Scholar] [CrossRef]
  18. Xie, X.; Li, B.; Zhu, H. Estimating Contents of Crude Protein and Amylose Content in Rice Grain by Hyper-Spectral Under Different High Temperature Stress. Res. Agric. Mod. 2012, 33, 481–484. [Google Scholar] [CrossRef]
  19. Wang, Z.; Tan, X.; Ma, Y.; Liu, T.; He, L.; Yang, F.; Shu, C.; Li, L.; Fu, H.; Li, B.; et al. Combining Canopy Spectral Reflectance and RGB Images to Estimate Leaf Chlorophyll Content and Grain Yield in Rice. Comput. Electron. Agric. 2024, 221, 108975. [Google Scholar] [CrossRef]
  20. Yan, L.; Liu, C.; Zain, M.; Cheng, M.; Huo, Z.; Sun, C.; Yan, L.; Liu, C.; Zain, M.; Cheng, M.; et al. Estimation of Rice Protein Content Based on Unmanned Aerial Vehicle Hyperspectral Imaging. Agronomy 2024, 14, 2479. [Google Scholar] [CrossRef]
  21. Zhou, D.; Zhu, Y.; Yao, X.; Tian, Y.; Cao, W. Estimating Grain Protein Content with Canopy Spectral Reflectance in Rice. Acta Agron. Sin. 2007, 38, 1219–1225. [Google Scholar] [CrossRef]
  22. Zhang, H.; Hu, H.; Chen, Y.; Tang, X.; Wu, C.; Liu, Y.; Yang, S.; Zheng, K. Estimating Nitrogen of Rice Leaf and Protein of Rice Seed Based on Hyperspectral Data. J. Nucl. Agric. Sci. 2012, 26, 135–140. [Google Scholar]
  23. Ata-Ul-Karim, S.T.; Zhu, Y.; Cao, Q.; Rehmani, M.I.A.; Cao, W.; Tang, L. In-Season Assessment of Grain Protein and Amylose Content in Rice Using Critical Nitrogen Dilution Curve. Eur. J. Agron. 2017, 90, 139–151. [Google Scholar] [CrossRef]
  24. Chen, B.; Su, Q.; Li, Y.; Chen, R.; Yang, W.; Huang, C. Field Rice Growth Monitoring and Fertilization Management Based on UAV Spectral and Deep Image Feature Fusion. Agronomy 2025, 15, 886. [Google Scholar] [CrossRef]
  25. Peng, X.; Wu, W.; Do, Q.; Li, P.; Zhu, M.; Liu, D.; Liu, Z.; Yu, C. Rapid and accurate estimation of rice quality in northeast China based on UAV multispectral images. J. Plant Nutr. Fertil. 2024, 30, 12–26. [Google Scholar] [CrossRef]
  26. Zhang, J.; Lin, X.; Jiang, C.; Hu, X.; Liu, B.; Liu, L.; Xiao, L.; Zhu, Y.; Cao, W.; Tang, L. Predicting Rice Phenology Across China by Integrating Crop Phenology Model and Machine Learning. Sci. Total Environ. 2024, 951, 175585. [Google Scholar] [CrossRef] [PubMed]
  27. Liang, Z.; Fu, Z.; Kiplagat, D.; Wang, W.; Yang, J.; Li, Z.; Cao, Q.; Tian, Y.; Zhu, Y.; Cao, W.; et al. Rice Yield Prediction Base on UAV Multispectral Imagery Using Machine Learning Methods. Smart Agric. Technol. 2025, 12, 101549. [Google Scholar] [CrossRef]
  28. Longmire, A.R.; Poblete, T.; Hunt, J.R.; Chen, D.; Zarco-Tejada, P.J. Assessment of Crop Traits Retrieved from Airborne Hyperspectral and Thermal Remote Sensing Imagery to Predict Wheat Grain Protein Content. ISPRS J. Photogramm. Remote Sens. 2022, 193, 284–298. [Google Scholar] [CrossRef]
  29. Tilse, M.J.; Bishop, T.F.A.; Filippi, P. Predicting Within-Field Grain Protein Content at Scale Using Agronomic and Remote Sensing Variables, and Machine Learning. Precis. Agric. 2025, 26, 78. [Google Scholar] [CrossRef]
  30. Yan, Y.; Li, D.; Yao, X.; Zhu, Y.; Cao, W.; Cheng, T. Integration of Multiscale Spectral Features as the Intermediate Variables for Improved Prediction of Grain Protein Concentration in Winter Wheat. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–18. [Google Scholar] [CrossRef]
  31. Sun, W.; He, Q.; Liu, J.; Xiao, X.; Wu, Y.; Zhou, S.; Ma, S.; Wang, R. Dynamic Monitoring of Maize Grain Quality Based on Remote Sensing Data. Front. Plant Sci. 2023, 14, 1177477. [Google Scholar] [CrossRef]
  32. Zhang, Y.; Zhang, M.; Ye, J.; Xu, Q.; Feng, Y.; Xu, S.; Hu, D.; Wei, X.; Hu, P.; Yang, Y. Integrating Genome-Wide Association Study into Genomic Selection for the Prediction of Agronomic Traits in Rice (Oryza sativa L.). Mol. Breed. 2023, 43, 81. [Google Scholar] [CrossRef]
  33. Xu, Y.; Zhang, Y.; Cui, Y.; Zhou, K.; Yu, G.; Yang, W.; Wang, X.; Li, F.; Guan, X.; Zhang, X.; et al. GA-GBLUP: Leveraging the Genetic Algorithm to Improve the Predictability of Genomic Selection. Brief. Bioinform. 2024, 25, bbae385. [Google Scholar] [CrossRef]
  34. Zhao, X.; Sun, F.; Li, J.; Zhang, D.; Zhang, Q.; Liu, Z.; Tan, C.; Ma, H.; Wang, K. VMGP: A Unified Variational Auto-Encoder Based Multi-Task Model for Multi-Phenotype, Multi-Environment, and Cross-Population Genomic Selection in Plants. Artif. Intell. Agric. 2025, 15, 829–842. [Google Scholar] [CrossRef]
  35. Cui, Y.; Li, R.; Li, G.; Zhang, F.; Zhu, T.; Zhang, Q.; Ali, J.; Li, Z.; Xu, S. Hybrid Breeding of Rice via Genomic Selection. Plant Biotechnol. J. 2020, 18, 57–67. [Google Scholar] [CrossRef] [PubMed]
  36. Fialho, I.C.; Azevedo, C.F.; Nascimento, A.C.C.; Teixeira, F.R.F.; de Resende, M.D.V.; Nascimento, M. Factor Analysis Applied in Genomic Prediction Considering Different Density Marker Panels in Rice. Euphytica 2023, 219, 88. [Google Scholar] [CrossRef]
  37. Yu, P.; Ye, C.; Li, L.; Yin, H.; Zhao, J.; Wang, Y.; Zhang, Z.; Li, W.; Long, Y.; Hu, X.; et al. Genome-Wide Association Study and Genomic Prediction for Yield and Grain Quality Traits of Hybrid Rice. Mol. Breed. 2022, 42, 16. [Google Scholar] [CrossRef]
  38. Togninalli, M.; Wang, X.; Kucera, T.; Shrestha, S.; Juliana, P.; Mondal, S.; Pinto, F.; Govindan, V.; Crespo-Herrera, L.; Huerta-Espino, J.; et al. Multi-Modal Deep Learning Improves Grain Yield Prediction in Wheat Breeding by Fusing Genomics and Phenomics. Bioinformatics 2023, 39, btad336. [Google Scholar] [CrossRef]
  39. Zou, Q.; Tai, S.; Yuan, Q.; Nie, Y.; Gou, H.; Wang, L.; Li, C.; Jing, Y.; Dong, F.; Yue, Z.; et al. Large-Scale Crop Dataset and Deep Learning-Based Multi-Modal Fusion Framework for More Accurate G × E Genomic Prediction. Comput. Electron. Agric. 2025, 230, 109833. [Google Scholar] [CrossRef]
  40. Qin, Y.; Tauqir, M.; Yu, X.; Zheng, X.; Jiang, X.; Xu, N.; Zhang, J. Predicting Multiple Traits of Rice and Cotton Across Varieties and Regions Using Multi-Source Data and a Meta-Hybrid Regression Ensemble. Sensors 2026, 26, 375. [Google Scholar] [CrossRef]
  41. McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M.; et al. The Genome Analysis Toolkit: A MapReduce Framework for Analyzing Next-Generation DNA Sequencing Data. Genome Res. 2010, 20, 1297–1303. [Google Scholar] [CrossRef]
  42. GB/T 1354-2018; Milled Rice. State Administration of Market Regulation and Standardization Administration of China: Beijing, China, 2018.
  43. GB/T 15683-2008; Rice—Determination of Amylose Content. General Administration of Quality Supervision, Inspection and Quarantine & Standardization Administration of China: Beijing, China, 2008.
  44. Feng, X.; Li, Z.; Yang, P.; Hong, W.; Wang, A.; Qin, J.; Zhang, H.; Kem Senou, P.D.; Zhang, Y.; Wang, D.; et al. Enhance the Accuracy of Rice Yield Prediction Through an Advanced Preprocessing Architecture for Time Series Data Obtained from a UAV Multispectral Remote Sensing Platform. Eur. J. Agron. 2025, 165, 127542. [Google Scholar] [CrossRef]
  45. Pavicic, M.; Mouhu, K.; Wang, F.; Bilicka, M.; Chovanček, E.; Himanen, K. Genomic and Phenomic Screens for Flower Related RING Type Ubiquitin E3 Ligases in Arabidopsis. Front. Plant Sci. 2017, 8, 416. [Google Scholar] [CrossRef]
  46. Liu, Q.; Liu, G.; Chu, X. Comparison of the spectral characteristics of rice soybean and reed—A case study from Liaohe River Delta. Chin. J. Eco-Agric. 2006, 14, 66–69. [Google Scholar]
  47. Yao, D.; Fei, S.; Li, L.; Jia, Y.; Wang, D.; Han, T.; Zhang, B.; Yang, M.; Xiao, Y. Yield Prediction of Wheat Breeding Plots Based on a Novel Extreme Stacked Generalization Algorithm. J. Triticeae Crops 2025, 45, 1–10. [Google Scholar]
  48. Song, M.; Ma, H.; Wu, Z.; Li, T.; Yang, M.; Huang, H.; Wu, P.; Yang, D.; Xu, D.; Lu, Q. Study on the Detection of Protein Content in Wheat Seeds Based on SWG and LSG Stacked Model Machine Learning Algorithms. J. Instrum. Anal. 2025, 44, 2087–2094. [Google Scholar] [CrossRef]
  49. Jiang, J.; Zhu, J.; Wang, X.; Cheng, T.; Tian, Y.; Zhu, Y.; Cao, W.; Yao, X. Estimating the Leaf Nitrogen Content with a New Feature Extracted from the Ultra-High Spectral and Spatial Resolution Images in Wheat. Remote Sens. 2021, 13, 739. [Google Scholar] [CrossRef]
  50. Araújo, M.C.U.; Saldanha, T.C.B.; Galvão, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The Successive Projections Algorithm for Variable Selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the study design. (a) Overview of population-level feature acquisition. Feature values such as vegetation indices and texture features were extracted directly from the field. (b) Overview of individual-level feature acquisition. Feature values were acquired from a phenotyping platform via hyperspectral imaging, chlorophyll fluorescence imaging, and visible light imaging. (c) Technical roadmap for model development. Blue text indicates steps specific to building population-level models, green text indicates steps specific to building individual-level models, and black text indicates steps common to both levels. (d) Schematic diagram of the Stacking Ensemble strategy.
Figure 1. Flowchart of the study design. (a) Overview of population-level feature acquisition. Feature values such as vegetation indices and texture features were extracted directly from the field. (b) Overview of individual-level feature acquisition. Feature values were acquired from a phenotyping platform via hyperspectral imaging, chlorophyll fluorescence imaging, and visible light imaging. (c) Technical roadmap for model development. Blue text indicates steps specific to building population-level models, green text indicates steps specific to building individual-level models, and black text indicates steps common to both levels. (d) Schematic diagram of the Stacking Ensemble strategy.
Agriculture 16 00807 g001
Figure 2. Performance of quality traits for 61 indica rice accessions under two environments. (a,b) Performance (a) and correlation analysis (b) of amylose content in greenhouse and field environments. (c,d) Performance (c) and correlation analysis (d) of protein content in greenhouse and field environments. (e,f) Performance (e) and correlation analysis (f) of chalky grain rate in greenhouse and field environments. (g,h) Performance (g) and correlation analysis (h) of chalkiness degree in greenhouse and field environments. (il) Phenotypic variation contribution rates for amylose content (i), protein content (j), chalky grain rate (k), and chalkiness degree (l) based on two-way ANOVA. Genetic contribution rate represents heritability. Symbols: ns = no significant difference (p > 0.05); ** = highly significant difference (p < 0.01); **** = extremely highly significant difference (p < 0.0001).
Figure 2. Performance of quality traits for 61 indica rice accessions under two environments. (a,b) Performance (a) and correlation analysis (b) of amylose content in greenhouse and field environments. (c,d) Performance (c) and correlation analysis (d) of protein content in greenhouse and field environments. (e,f) Performance (e) and correlation analysis (f) of chalky grain rate in greenhouse and field environments. (g,h) Performance (g) and correlation analysis (h) of chalkiness degree in greenhouse and field environments. (il) Phenotypic variation contribution rates for amylose content (i), protein content (j), chalky grain rate (k), and chalkiness degree (l) based on two-way ANOVA. Genetic contribution rate represents heritability. Symbols: ns = no significant difference (p > 0.05); ** = highly significant difference (p < 0.01); **** = extremely highly significant difference (p < 0.0001).
Agriculture 16 00807 g002
Figure 3. Analysis of genome-wide SNP genetic diversity and spectral features at different growth stages for 61 indica rice accessions. (a) SNP density distribution within 1 Mb windows across the 12 chromosomes of 61 indica rice accessions. Color gradient indicates SNP count per window. (b) Phylogenetic tree of 61 indica rice accessions constructed using the Neighbor-Joining (NJ) method. Different colored branches represent different subpopulations within the indica group. (c) Principal component analysis (PCA) of 61 indica rice accessions based on SNP data. Each point represents one rice accession. (dj) Whole-plant spectral reflectance curves at the tillering stage (d), jointing stage (e), booting stage (f), heading stage (g), flowering stage (h), filling stage (i), and maturity stage (j). Different colored lines represent different indica rice varieties.
Figure 3. Analysis of genome-wide SNP genetic diversity and spectral features at different growth stages for 61 indica rice accessions. (a) SNP density distribution within 1 Mb windows across the 12 chromosomes of 61 indica rice accessions. Color gradient indicates SNP count per window. (b) Phylogenetic tree of 61 indica rice accessions constructed using the Neighbor-Joining (NJ) method. Different colored branches represent different subpopulations within the indica group. (c) Principal component analysis (PCA) of 61 indica rice accessions based on SNP data. Each point represents one rice accession. (dj) Whole-plant spectral reflectance curves at the tillering stage (d), jointing stage (e), booting stage (f), heading stage (g), flowering stage (h), filling stage (i), and maturity stage (j). Different colored lines represent different indica rice varieties.
Agriculture 16 00807 g003
Figure 4. Model development and prediction analysis for amylose content. (a,b) Coefficient of determination (R2) for amylose content at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of maturity-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual amylose content. (d) SHAP value of the top 10 features for predicting amylose content after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting amylose content after LASSO feature selection from Flu-RGB data. (f) SHAP value of the top 9 features for predicting amylose content after SPA feature selection from maturity-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Figure 4. Model development and prediction analysis for amylose content. (a,b) Coefficient of determination (R2) for amylose content at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of maturity-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual amylose content. (d) SHAP value of the top 10 features for predicting amylose content after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting amylose content after LASSO feature selection from Flu-RGB data. (f) SHAP value of the top 9 features for predicting amylose content after SPA feature selection from maturity-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Agriculture 16 00807 g004
Figure 5. Model development and prediction analysis for protein content. (a,b) Coefficient of determination (R2) for protein content at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of heading-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual protein content. (d) SHAP value of the top 10 features for predicting protein content after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting protein content after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 13 features for predicting protein content after SPA feature selection from heading-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Figure 5. Model development and prediction analysis for protein content. (a,b) Coefficient of determination (R2) for protein content at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of heading-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual protein content. (d) SHAP value of the top 10 features for predicting protein content after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting protein content after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 13 features for predicting protein content after SPA feature selection from heading-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Agriculture 16 00807 g005
Figure 6. Model development and prediction analysis for chalky grain rate. (a,b) Coefficient of determination (R2) for chalky grain rate at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of jointing-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual chalky grain rate. (d) SHAP value of the top 10 features for predicting chalky grain rate after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting chalky grain rate after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 6 features for predicting chalky grain rate after SPA feature selection from jointing-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Figure 6. Model development and prediction analysis for chalky grain rate. (a,b) Coefficient of determination (R2) for chalky grain rate at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of jointing-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual chalky grain rate. (d) SHAP value of the top 10 features for predicting chalky grain rate after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting chalky grain rate after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 6 features for predicting chalky grain rate after SPA feature selection from jointing-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Agriculture 16 00807 g006
Figure 7. Model development and prediction analysis for chalkiness degree. (a,b) Coefficient of determination (R2) for chalkiness degree at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of jointing-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual chalkiness degree. (d) SHAP value of the top 10 features for predicting chalkiness degree after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting chalkiness degree after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 6 features for predicting chalkiness degree after SPA feature selection from jointing-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Figure 7. Model development and prediction analysis for chalkiness degree. (a,b) Coefficient of determination (R2) for chalkiness degree at the population (a) and individual (b) level based on different feature sources and feature selection methods. (c) Feature value screening using the Successive Projections Algorithm (SPA). Distribution curve showing the relationship between the number of jointing-stage hyperspectral features and the root mean squared error of cross-validation (RMSECV) for predicting individual chalkiness degree. (d) SHAP value of the top 10 features for predicting chalkiness degree after RFE feature selection from M3M data. (e) SHAP value of the top 10 features for predicting chalkiness degree after RFE feature selection from Flu-RGB data. (f) SHAP value of the top 6 features for predicting chalkiness degree after SPA feature selection from jointing-stage hyperspectral data. The vertical axis represents the selected key features, and the horizontal axis represents the SHAP value. Color indicates the feature value magnitude, reflecting the direction, degree of contribution, and importance ranking of each core feature.
Agriculture 16 00807 g007
Table 1. Performance of prediction models for amylose content and protein content at the population level (five-fold cross-validation with five independent repetitions).
Table 1. Performance of prediction models for amylose content and protein content at the population level (five-fold cross-validation with five independent repetitions).
TraitFeatureModelR2Pearson rMAERMSE
Amylose ContentSNPRFR0.94 ± 0.010.97 ± 01.06 ± 0.071.27 ± 0.08
SVR0.59 ± 00.77 ± 02.67 ± 0.013.42 ± 0.01
XGBoost0.68 ± 0.140.83 ± 0.071.98 ± 0.172.97 ± 0.62
M3MRFR0.84 ± 0.080.92 ± 0.041.48 ± 0.212.1 ± 0.5
SVR0.24 ± 0.040.54 ± 0.073.68 ± 0.274.62 ± 0.13
XGBoost−0.02 ± 0.110.55 ± 0.084.1 ± 0.335.36 ± 0.28
SNP-M3MRFR0.38 ± 0.170.63 ± 0.113.07 ± 0.494.17 ± 0.56
SVR0.38 ± 0.160.63 ± 0.13.18 ± 0.284.16 ± 0.52
XGBoost0.94 ± 0.020.97 ± 0.011.11 ± 0.141.31 ± 0.17
Protein ContentSNPRFR0.23 ± 0.070.71 ± 0.030.56 ± 0.040.7 ± 0.03
SVR0.24 ± 00.62 ± 00.57 ± 00.69 ± 0
XGBoost0.15 ± 0.120.58 ± 0.120.6 ± 0.040.74 ± 0.05
M3MRFR0.34 ± 0.020.66 ± 0.020.53 ± 0.010.65 ± 0.01
SVR0.19 ± 0.040.75 ± 0.070.56 ± 0.010.72 ± 0.02
XGBoost−0.33 ± 0.380.54 ± 0.060.72 ± 0.050.91 ± 0.13
SNP-M3MRFR0.08 ± 0.050.6 ± 0.040.65 ± 0.010.76 ± 0.02
SVR0.22 ± 0.080.66 ± 0.090.58 ± 0.030.71 ± 0.04
XGBoost0.21 ± 0.20.7 ± 0.010.55 ± 0.090.71 ± 0.09
Table 2. Comparison of typical studies on crop quality prediction.
Table 2. Comparison of typical studies on crop quality prediction.
TypeData Source of FeaturesModelsAccuracyMain Limitations
Type 1 [3,4,5,6,7,8]Post-harvest grain hyperspectral data (hundreds of varieties)PLSRHigh accuracy, R2 > 0.9Unable to predict at the early growth stage
Type 2 [18,19,20,21,22,23]Canopy data at the early growth stage (1–3 varieties)Various machine learning and deep learning modelsRelatively high accuracy,
R2 > 0.8
Limited number of varieties, often involving stress treatments (e.g., gradient nitrogen)
Type 3
[27,28,29,30,31]
Canopy data fused with soil pH/nutrients, plant nitrogen accumulation, meteorological data, etc. (1–3 varieties)Various machine learning and deep learning modelsImproved compared with Category 2Still limited varieties, dependent on specific stress treatments
Type 4 [35,36,37]Genomic data of large-scale populations (hundreds to thousands of varieties)Various machine learning and deep learning modelsLarge variation among traits, 0.3 < r < 0.9Lack of dynamic phenotypic and environmental information
Type 5 [38]Genomic data combined with canopy dataVarious machine learning and deep learning modelsr = 0.75Mostly used for wheat yield prediction; rare application in rice quality prediction
This studyGenomic SNP data + UAV-based population canopy phenotypes/greenhouse single-plant phenotypes (61 indica rice varieties)RFR, SVR, XGBoost, StackingR2 > 0.85 for key quality traitsRelatively small population size
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Liu, Y.; Yu, J.; Cao, N.; Zhou, W.; Wu, J.; Zhao, R.; Tang, S.; Chen, S.; Chen, Y.; et al. Predicting Rice Quality in Indica Rice Using Multidimensional Data and Machine Learning Strategies. Agriculture 2026, 16, 807. https://doi.org/10.3390/agriculture16070807

AMA Style

Zhang X, Liu Y, Yu J, Cao N, Zhou W, Wu J, Zhao R, Tang S, Chen S, Chen Y, et al. Predicting Rice Quality in Indica Rice Using Multidimensional Data and Machine Learning Strategies. Agriculture. 2026; 16(7):807. https://doi.org/10.3390/agriculture16070807

Chicago/Turabian Style

Zhang, Xiang, Yongqiang Liu, Junming Yu, Ni Cao, Wei Zhou, Jiaming Wu, Rumeng Zhao, Shaoqing Tang, Song Chen, Ying Chen, and et al. 2026. "Predicting Rice Quality in Indica Rice Using Multidimensional Data and Machine Learning Strategies" Agriculture 16, no. 7: 807. https://doi.org/10.3390/agriculture16070807

APA Style

Zhang, X., Liu, Y., Yu, J., Cao, N., Zhou, W., Wu, J., Zhao, R., Tang, S., Chen, S., Chen, Y., Zhao, F., He, J., & Shao, G. (2026). Predicting Rice Quality in Indica Rice Using Multidimensional Data and Machine Learning Strategies. Agriculture, 16(7), 807. https://doi.org/10.3390/agriculture16070807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop