Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize

Du, Yan; Jia, Nianhua; Wang, Yueli; Li, Ronglan; Lu, Ying; Würschum, Tobias; Zhu, Xintian; Liu, Wenxin

doi:10.3390/ijms27041635

Open AccessArticle

Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize

by

Yan Du

^1,†

,

Nianhua Jia

^1,†,

Yueli Wang

¹,

Ronglan Li

^1,2,

Ying Lu

¹,

Tobias Würschum

³

,

Xintian Zhu

^3,*

and

Wenxin Liu

^1,2,*

¹

State Key Laboratory of Maize Bio-Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing 100193, China

²

Sanya Institute of China Agricultural University, China Agricultural University, Sanya 572025, China

³

Institute of Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, 70599 Stuttgart, Germany

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Int. J. Mol. Sci. 2026, 27(4), 1635; https://doi.org/10.3390/ijms27041635

Submission received: 19 December 2025 / Revised: 4 February 2026 / Accepted: 5 February 2026 / Published: 7 February 2026

(This article belongs to the Special Issue Flowers: Molecular and Genetic Regulation of Growth and Development)

Download

Browse Figures

Versions Notes

Abstract

Flowering time is a fundamental trait that determines crop adaptation and yield stability. To accurately predict flowering time and identify key regulatory factors, it is necessary to extract biologically meaningful signals from high-dimensional and multi-omics datasets. Although machine learning has been increasingly applied in plant genomics, there is still limited research on how feature selection (FS) methods and genomic prediction (GP) models affect the prediction of flowering time and gene discovery, particularly regarding the combination of different FS and GP approaches and the interpretability of prediction models. To address this gap, we conducted a large-scale benchmarking study that jointly evaluated seven feature selection methods and six prediction models, resulting in 42 FS–GP combinations. By integrating SNP and transcriptomic data, we assessed predictive performance and further interpreted model outputs using SHAP (SHapley Additive exPlanations) within a random forest (RF) framework to quantify feature contributions. This strategy successfully identified known flowering time regulators in maize, including ZmMADS69 and ZmRap2.7, and revealed additional candidate genes potentially involved in the flowering regulatory network. Overall, this study offers valuable insights into the genetic regulation of flowering time in maize and provides an effective framework for discovering candidate genes from multi-omics data for crop improvement.

Keywords:

genomic prediction; machine learning; flowering time; feature selection; SHAP

1. Introduction

Maize (Zea mays L.) is a globally important staple crop with considerable economic value in feed, food, industrial production, and energy sectors [1,2]. It is estimated that by 2050, the global population will reach 9.5 billion. Increasing the yield per unit area and breeding improved varieties in a short period of time have become essential strategies to address food security challenges posed by population growth [3]. However, achieving these breeding goals has become increasingly challenging due to the impact of extreme climatic events [4]. Among the agronomic traits targeted in maize breeding, flowering time is a typical complex quantitative trait controlled by many loci and strongly influenced by environmental conditions. Flowering time represents a critical developmental stage in the plants’ lifecycle, playing a decisive role in reproduction and seed yield. And flowering time is considered a key selection criterion in maize breeding as it reflects the adaptability of varieties to local environmental conditions [5,6,7]. In recent years, methods such as population analysis, comparative genomics, and mutant screening have contributed to the growing number of cloned flowering time genes in maize [8,9,10]. However, research on maize flowering time still faces numerous challenges, and the discovery of new regulatory loci remains essential.

Genomic prediction (GP) has been widely recognized as a powerful tool in plant and animal breeding. It enables the prediction of individual phenotypes or complex quantitative traits through the integration of genome-wide marker information. By integrating genome-wide markers with the phenotypic data of a training population, the breeding values of candidate lines can be accurately estimated without phenotyping, whereby selection intensity can be increased, breeding costs reduced, and the process of genetic improvement accelerated [11,12,13].

With the widespread adoption of sequencing technologies, the number of markers has increased exponentially, but high-dimensional data are confronted with the “curse of dimensionality,” high collinearity, noise accumulation, and model overfitting, which also substantially increases computational cost and model training time [14]. Therefore, the identification of key features contributing most to trait variation has become increasingly important. Feature selection (FS), by removing irrelevant, redundant, or noisy features, can significantly enhance model robustness and predictive performance. While various methods for dimensionality reduction and model improvement have been systematically evaluated, several limitations remain. Most evaluations have focused on human or animal datasets, with limited validation in crop systems [15,16]. Additionally, only a limited number of feature selection methods and prediction models have been compared, making it difficult to comprehensively evaluate different strategies [16]. Furthermore, input data were typically restricted to SNP markers, with limited incorporation of transcriptomic information, thereby failing to reflect the complexity of future breeding data environments [17]. Some studies have performed feature selection and model training on the full dataset simultaneously, neglecting the strict separation of training and test sets, which may lead to overly optimistic performance assessments and reduce reliability [15]. Lastly, many methods still face challenges regarding model interpretability, limiting their practical utility in breeding decisions.

SHAP (SHapley Additive exPlanations), a widely applied explainable AI approach, has been employed to interpret black-box machine learning models [18]. Unlike traditional feature importance analyses that only provide global rankings, SHAP allows the contribution of individual features (e.g., markers or expressed genes) to be quantified for each sample. This sample-level interpretability is particularly valuable for genetically heterogeneous plant populations, where the effects of specific loci may vary substantially across individuals. This enables the understanding of not only which genes are important but also how they influence specific phenotype predictions, enhancing both model interpretability and biological relevance [19]. Although its use in large-scale plant omics data remains limited, SHAP has been successfully applied in multiple fields, highlighting its potential for crop complex trait prediction and gene identification [20].

Flowering time is a classic quantitative trait in crops, with a genetic basis comprised multiple regulatory interactions and a complex genetic architecture. Despite the increasing application of machine learning techniques in bioinformatics, systematic research on the prediction of flowering time-related genes remains limited, particularly for the integration of feature selection and model building strategies, where significant theoretical and methodological gaps remain. This study focuses on flowering time in maize and systematically compares the performance of different feature selection methods to fill this gap by providing a large-scale benchmark evaluation of feature selection strategies. Seven feature selection strategies and six prediction models are systematically applied to SNP and transcriptomic data. SHAP values were further used to interpret known functional genes and to identify novel candidate genes. The findings provide guidance for selecting optimal methodologies and contribute to a deeper understanding of the genetic basis of complex traits.

2. Results

2.1. Pipeline of Feature Selection Methods and Prediction Models for Complex Trait Prediction

To systematically explore the combined effects of feature selection methods and prediction model on the predictive performance, we developed a robust analytical framework that integrates data processing, feature selection, model evaluation and model explanation (Figure 1A). Special attention was given to the stability of feature selection strategies in high-dimensional omics data, along with the potential benefits of integrating SNP and transcriptome information for improving prediction accuracy. In this high dimensional setting, the performance of integrated data was examined in depth to establish a more robust approach for complex trait modeling. First, a ten-fold cross-validation scheme was applied to the dataset to assess model performance, that was, 90% of the data was used as the training set, and 10% of the data was used as the test set in each iteration of model performance (outer loop). Within each run of model performance assessment, feature selection was performed in the training set with nine-fold cross-validation (inner loop) to identify the most important features recognized according to each method, and the selected features were then used to train the respective prediction models. The trained models were subsequently evaluated with the test set (Figure S1). It is important to emphasize that feature selection and model evaluation are two independent steps. Therefore, the evaluation of model performance must be based on data not used for training to ensure the objectivity and reliability of the results. In addition, the SHAP value framework was applied to interpret the models and reveal the contribution of individual features to the prediction outcome.

During this process, the predictive performance of each model was systematically assessed using ten-fold cross-validation (Figure 1B). For this, the entire dataset was partitioned into ten mutually exclusive subsets and in each iteration, nine subsets were used for model training and feature selection, while the remaining subset was used for testing. This strategy effectively reduces randomness caused by data splitting and enhances the robustness and reliability of model performance evaluation. The results obtained from the ten-fold cross-validation provide a solid foundation for subsequent model comparison and feature importance analysis.

2.2. The Performance of Linear and Nonlinear Feature Selection on Flowering Time Prediction

Feature selection can greatly reduce computational and resource costs. But which feature selection methods are capable of identifying the most relevant features without compromising prediction accuracy? This remains a key question that underlies this study. To address this, we integrated multiple feature selection strategies and subsequently examined their practical contributions to prediction model accuracy. Specifically, we used flowering time data from 388 maize inbred lines (Table S1) as the target phenotype and leveraged seedling-stage transcriptomic profiles (Table S2) as the feature matrix for prediction. Based on this dataset, we conducted a systematic evaluation of seven representative feature selection strategies. The feature selection methods included both linear approaches (Lasso and ElasticNet) and nonlinear ones (mutual information, RF, Boruta, LightGBM and XGBoost). After feature selection, the number of features was reduced from the original 31,237 to 56–17,061, corresponding to a compression ratio of approximately 45.0–99.8%. Among all methods, Boruta achieved the most substantial reduction, retaining only 56 features, followed by Lasso with 416 features, whereas mutual information preserved the largest number of features (17,061). These results highlight the substantial differences in the feature selection approaches to remove redundancy and distill informative signals (Figure 2A).

Our results showed that compared with the full set of features (RAW), all feature selection strategies except XGBoost led to performance gains when combined with rrBLUP as prediction model (Figure 2B). Among the linear approaches, ElasticNet, which integrates both L1 and L2 regularization, demonstrated superior capability in retaining relevant features and enhancing prediction stability compared to Lasso. However, among the nonlinear methods, tree-based models such as LightGBM and XGBoost exhibited notable performance differences, as LightGBM significantly outperformed XGBoost. Interestingly, features selected by RF showed robust compatibility, suggesting that some tree-based methods may still yield variables suitable for linear models.

All feature selection methods exhibited a clear linear relationship between the predicted and observed phenotypic values under the rrBLUP model, and the overall trend of the regression lines indicated a strong linear fitting ability. Among them, the ElasticNet feature selection method performed the best, achieving the highest mean Pearson correlation coefficient (PCC = 0.63 ± 0.13), while retaining an average of 3746 features. Its predicted values aligned most closely with the fitted trend line and showed the fewest outliers, demonstrating excellent predictive accuracy and stability (Figure 2C). These results suggest that appropriate feature filtering not only enables effective dimensionality reduction but also retains the key predictive signals.

2.3. Machine Learning Models Achieve Higher Accuracy for Flowering Time Prediction

To systematically investigate the effectiveness of various combinations of feature selection method and prediction model, we conducted a comprehensive comparison to evaluate their impact on prediction performance. By assessing each combination, we aimed to uncover the interactions between feature selection strategy and prediction model and provide insights for selecting the optimal pairing. We evaluated five machine learning prediction models, RF, SVMrbf, SVMpoly, LightGBM, and XGBoost, combined with seven feature selection methods. Across different prediction models, predictions based on the full set of raw features consistently outperformed rrBLUP (except for SVMpoly), indicating that machine learning algorithms are better suited to capture complex nonlinear relationships than traditional linear models (Figure 3, Table S3). After applying feature selection for dimensionality reduction, most models exhibited improved predictive ability, most notably rrBLUP, highlighting the value of feature selection in removing irrelevant features and mitigating the “curse of dimensionality”. As a result, feature selection enables certain models to more effectively capture trait associated signals.

Among all combinations, the RF prediction model consistently demonstrated the best overall performance, followed closely by LightGBM. Notably, the ElasticNet-rrBLUP combination achieved the highest prediction accuracy (PCC = 0.63 ± 0.13), with ElasticNet-RF ranking second (PCC = 0.62 ± 0.09). Surprisingly, although Boruta retained an average of only 56 features, more than half of the tested prediction models combined with Boruta achieved higher prediction accuracy compared with using the full set of features. This suggests that Boruta selects highly informative and remarkably stable features. In contrast, combinations involving SVMpoly as prediction model generally performed the worst, and even the raw feature set failed to enable SVMpoly to capture the key predictive signals. These results underscore the importance of choosing an appropriate genomic prediction model to optimize prediction accuracy.

2.4. Performance Evaluation of Omics Data Integration for Flowering Time Prediction

Studies have shown that machine learning algorithms have a significant impact on prediction using different omics data types [21]. Here, we selected three datasets for comparative analysis: SNP data, transcriptomic data, and their combination. The results revealed that integrative multi-omics modeling showed similar performance to single-omics approaches. With individual omics data, SNP data showed overall higher predictive power than transcriptomic data, with rrBLUP, RF, and LightGBM achieving the best performance. In contrast, models based solely on transcriptomic features produced slightly lower correlations between predicted and true values, suggesting that transcriptome signals alone capture less of the phenotypic variance. When the two data types were combined, the SNP + T integration only increased the average Pearson correlation coefficient to 0.67, representing improvements of approximately 1.52% over SNP alone (Figure 4 and Figure S2).

Among the feature selection strategies, both Lasso and ElasticNet exhibited superior overall performance. They maintained relatively high PCC values, with median PCCs of 0.58 (IQR: 0.54–0.63) and 0.62 (IQR: 0.54–0.63), respectively, and ranked among the top performers in more than 70% of model feature selection combinations. Boruta achieved moderate overall performance (median PCC = 0.55, IQR: 0.49–0.57), but demonstrated strong compatibility with the RF model, significantly reducing the number of features while still retaining key information for prediction. Based on these findings, we propose the following optimization strategy: Prioritize Lasso, ElasticNet or Boruta for dimensionality reduction in high-dimensional multi-omics data to greatly reduce the number of features while retaining as much informative signals as possible, then employ RF as the primary prediction model to fully leverage its robustness and interpretability in handling high-dimensional, heterogeneous data, enabling efficient, accurate, and reliable phenotype prediction (Table S4).

2.5. Potential to Predict Flowering Time Genes Through SHAP Values

The feasibility of interpreting phenotypic contributions through SHAP values has been demonstrated [22]. Previous results indicated that Boruta feature selection can significantly reduce the number of features. Consequently, we first constructed the prediction models using the transcriptome and SNP data variables retained by Boruta, and SHAP values were subsequently used to rank the importance of each feature (Figure 5A). The beeswarm plot visualizes the specific influence of each feature value on the RF prediction output (Figure 5B). The top ranked SNP locus is Chr8_156249642 (GRMZM2G085438), the point color for this SNP reflects the minor allele (−1) in blue and the major allele (1) in red. The corresponding SHAP values represent the direction of the gene’s influence: positive SHAP values suggest that tendency to delay flowering, whereas negative values promote earlier flowering (Figure 5B).

From a vertical comparison perspective, Chr8_156249642, GRMZM2G171650, GRMZM2G420684, Chr8_136012624, Chr3_160559109, and Chr8_136012820 had the strongest effects on predicting maize flowering time. In addition, SHAP dependence plots are commonly applied to interpret the relationships between significant features and phenotypes [23]. For the six most impactful features, we further constructed SHAP dependency plots (Figure 5C–H). Notably, in Figure 5C,F,H, allele 1 of SNP exhibited aggregated SHAP values above the baseline, indicating that this genotype tends to delay flowering time. In contrast, allele −1 showed aggregated SHAP values below zero, suggesting a suppressive effect, contributing to earlier flowering. Similarly, for GRMZM2G171650 in Figure 5D, higher expression levels promoted flowering, whereas lower expression levels were associated with delayed flowering. Interestingly, GRMZM2G171650 corresponds to ZmMADS69, a well-known transcription factor regulating maize flowering time. ZmMADS69 acts as a flowering promoter, and its overexpression leads to accelerated flowering. Furthermore, Chr8_136012624 and Chr8_136012820 map to different positions within the same gene GRMZM2G700665 (ZmRap2.7). Previous studies reported that CRISPR/Cas9 knockout mutants of the ZmRap2.7 gene flowered approximately 2.6 days earlier than the wild type [24], further supporting the biological reliability of our SHAP interpretations. Thus, Chr8_156249642 (GRMZM2G085438) and GRMZM2G420684 also emerge as strong novel candidate genes underlying flowering time variation.

Taken together, by combining Boruta feature selection with SHAP-based interpretability, our machine learning framework successfully identified both known flowering time regulators and novel candidate genes. These findings confirm the biological interpretability of the model and provide important clues for further elucidating the genetic regulatory network underlying flowering time.

3. Discussion

3.1. Molecular Basis and Key Regulators of Flowering Time in Maize

Flowering time is regarded as a key agronomic trait that determines crop adaptation and yield, and its genetic architecture is known to be highly complex. It is influenced by photoperiod, circadian rhythms, endogenous hormonal signals, and multiple developmental signals. The elucidation of the flowering time regulation network plays an important role in advancing our understanding of plant developmental mechanisms, and it also provides valuable genetic resources for improving crop adaptability in different environments. In recent years, numerous genes involved in photoperiod responsiveness and circadian regulation have been identified. For example, it was previously shown that ZmELF3.1 and ZmELF3.2 are incorporated into protein complexes with ZmELF4.1/4.2 and ZmLUX1/2 as part of the maize circadian system. When the activity of these components was impaired, substantial delays in flowering were observed under both long-day and short-day conditions [25]. These findings indicate that circadian-clock elements play a pivotal role in regulating flowering time in maize. Notably, the 388-line panel used in this study spans diverse breeding backgrounds and environmental origins, including NSS, Flint, Popcorn, SSS, Iodent, and tropical germplasm, which likely experienced distinct photoperiod regimes during development [26]. This diversity provides an ecological and evolutionary context for our findings and highlights the relevance of clock-related genes identified in this study to environmental adaptation across diverse agro-ecological settings.

It is worth emphasizing that ZmMADS69 and ZmRap2.7 were identified in this study as key features most strongly associated with flowering time, and both genes have been independently validated in previous functional analyses (Figure 5). According to previous research, the MADS-box transcription factor ZmMADS69 has been characterized as a major promotive regulator of flowering time that operates through the ZmRap2.7–ZCN8 pathway. Rather than acting directly on the florigen gene, ZmMADS69 functions upstream of the flowering repressor ZmRap2.7, suppressing its expression and consequently releasing the inhibition imposed on ZCN8. This regulation ultimately facilitates the floral transition [24].

In addition to confirmed flowering time genes, two candidate genes, GRMZM2G085438 (Chr8_156249642) and GRMZM2G420684, were also identified in this study (Figure 5). The maize gene GRMZM2G085438 is homologous to Arabidopsis thaliana AT1G77420, which encodes a member of the MAGL (monoacylglycerol lipase) family. MAGL proteins have been reported to participate in lipid catabolism, energy mobilization, and stress responses [27,28]. As lipid metabolism is closely linked to energy status, oxidized lipid-derived signaling, and the coordination of nutrient availability with developmental transitions [29], we speculate that changes in the expression or activity of GRMZM2G085438 could modulate lipid-mediated signaling or metabolic homeostasis in maize. This regulation may in turn influence the timing of floral induction, suggesting a potential role for this gene in the regulation of flowering time in maize. GRMZM2G420684 encodes a COI (Coronatine Insensitive) F-box protein, which has been shown to mediate the degradation of DELLA proteins, key repressors in the gibberellin (GA) signaling pathway [30]. In coi maize mutants, DELLA accumulation is associated with shortened internodes, delayed development, and reduced photosynthetic performance, suggesting that COI may influence the balance between growth and developmental progression through JA–GA crosstalk and DELLA turnover. These results suggest that COI may also function in regulating flowering time. Ultimately, functional validation of these candidate genes, for example by CRISPR-based gene editing and overexpression, is required to confirm the regulatory role of these genes in flowering time control in maize.

3.2. The Potential and Challenges of Machine Learning in Plant Genomic Prediction

Genomic prediction has been widely applied in both animal and plant breeding and continues to evolve alongside advances in computational methodologies. At present, no single algorithm universally performs best across all species, traits, and data structures. This limitation becomes particularly evident with the rapid expansion of high dimensional omics data, where traditional linear models struggle to capture the complex, nonlinear relationships among genotypes, phenotypes, and environments. In contrast, machine learning methods, with their stronger representational capacity, have emerged as a promising direction in genomic prediction research [31]. Although studies in livestock and aquaculture have demonstrated the predictive advantages of machine learning over classical linear models, systematic evaluations in plant breeding remain limited [32]. By comparing multiple models, including rrBLUP, RF, SVMrbf, SVMpoly, LightGBM, and XGBoost, this study further elucidates the applicability and limitations of machine learning in plant genomic prediction, providing insights for its future adoption in crop improvement.

With respect to model stability, RF and LightGBM showed consistently stable performance across various feature selection methods, whereas rrBLUP and SVMpoly were more sensitive, with large performance fluctuations (Figure 3). This robustness of RF may stem from its ensemble nature and random selection of features during tree construction, which makes it less dependent on specific feature subsets.

3.3. The Critical Role of Feature Selection in High Dimensional Omics Prediction

Previous studies have emphasized that feature selection is an effective strategy for addressing the “high dimensional, low sample size” challenge. It not only reduces the number of features and computational burden but also mitigates overfitting and improves prediction generalization ability [16,33]. In data structures where the number of predictors far exceeds the number of training samples, appropriately reducing dimensionality can effectively alleviate the “curse of dimensionality”. Even when feature selected models achieve prediction accuracy comparable to full feature models, the strategy still offers practical benefits, such as guiding the development of low density SNP arrays and reducing genotyping costs, thereby enhancing the economic feasibility of genomic selection in breeding programs [34]. In this study, Lasso and ElasticNet not only effectively reduced dimensionality but also maintained competitive prediction accuracy across different omics inputs. Starting from an initial set of approximately 332,177 markers, Lasso and ElasticNet reduced the feature space to an average of 595–6792 markers, while achieving mean PCC values of 0.65–0.66, which are comparable to those obtained using the full marker set with rrBLUP. Boruta further demonstrated a strong capacity for dimensionality reduction under high-dimensional settings (Figure 4 and Figure S3), retaining on average 54 informative markers, corresponding to a reduction of approximately 99.98% relative to the full feature set. From a practical perspective, reductions in marker density have direct implications for genotyping costs. When marker density is reduced to half of the original set, genotyping costs are expected to decrease to approximately two thirds of the original costs, whereas reducing marker density to one quarter of the original set results in estimated costs of about one half of the original [35]. Therefore, the substantial marker reductions achieved by Lasso, ElasticNet, and Boruta could translate into approximately 33–50% reductions in genotyping costs, with minimal loss in prediction accuracy. Collectively, these results demonstrate that appropriate feature selection enables a favorable trade-off between predictive performance and genotyping costs, facilitating the development of economically efficient and practically applicable low-density genomic prediction pipelines.

3.4. Limited Predictive Performance of Transcriptome-Based Models

Our study also revealed that prediction accuracy based on transcriptomic data was generally lower than that obtained with SNP data and showed reduced stability across different models (Figure 4). A possible explanation for this result is the developmental-stage mismatch between the transcriptomic data and the target trait. In this study, RNA-seq data were obtained from whole-seedling tissues at the V1 developmental stage, including both shoot and root tissues. Although flowering time is phenotypically expressed at later developmental stages, its genetic regulation is initiated much earlier during plant development. Transcriptomic profiles at the seedling stage can therefore capture genotype-dependent baseline transcriptional states, including genetic effects on gene expression, developmental potential, and early signaling pathways, which may be progressively amplified during subsequent development and ultimately influence flowering time. However, flowering time is governed by complex and dynamic developmental programs and the gene expression profiles at this early stage may not adequately capture the regulatory networks directly controlling flowering time, thereby limiting their predictive power. In addition to this developmental-stage distance, transcriptomic variation is highly sensitive to environmental conditions, leading to pronounced gene-by-environment (G × E) interactions that further reduce the robustness of transcriptome-based predictions. Together, these factors likely contribute to the limited predictive performance observed for transcriptomic models. Future studies could address these limitations by incorporating transcriptomic data from developmentally relevant stages, or by integrating longitudinal expression profiles with environmental covariates. Such approaches, combined with modeling frameworks that explicitly account for G × E interactions and nonlinear regulatory relationships, may substantially enhance the predictive utility of transcriptomic data for complex trait prediction and functional gene analysis.

3.5. Model Interpretability and Gene Mining Enhanced Through SHAP

SHAP values have been widely applied in various research fields, but their application in crop genetics remains relatively limited. In recent studies, the application of SHAP values has been explored in several crops. In maize, within the Cropformer framework, SHAP values derived from XGBoost were employed to prioritize genomic loci that may influence maize tasseling time. This approach highlighted high-ranking loci associated with functional variants, demonstrating the utility of SHAP in linking model-derived feature importance with biologically relevant candidate regions [36]. In rice, SHAP values were used to predict genes contributing to drought tolerance, identifying OsRAV11 as a potential candidate. Functional validation experiments suggested OsRAV11 plays broader roles in both drought response and flowering regulation [37]. Similarly, DPCformer utilized SHAP values to identify key SNPs affecting plant height and ear weight, further confirming the practicality of SHAP in gene discovery for complex agronomic traits [38].

In this study, we systematically evaluated multiple feature selection strategies and prediction models to extract stable and biologically meaningful features from multi-omics data, and further interpreted the prediction models using SHAP in a RF framework. SHAP provides not only global assessments of feature importance but also quantifies local contributions at the sample level, enabling the identification of core regulatory factors within complex models. In addition, the RF + SHAP framework offers transparent and interpretable feature-level attributions: SHAP values are additive and based on Shapley theory, ensuring that the contribution of each feature to individual predictions is explicit and stable. Compared with interpretability methods for deep-learning models, RF + SHAP is computationally efficient, requiring only standard CPU resources, and exhibits high reproducibility, as feature attributions remain consistent across runs with fixed data and random seeds. This combination of transparency, efficiency, and reproducibility provides a robust approach for extracting biologically meaningful insights and informing downstream experimental validation. Through this approach, we successfully captured known key genes, including MADS69, validating the effectiveness of SHAP in both model interpretation and gene discovery (Figure 5). Nevertheless, numerous yet unknown genes remain to be discovered. In the future, integrating more adaptive feature selection strategies, such as automated feature selection with deep neural networks or reinforcement learning–based optimization, along with functional validation experiments like qPCR or CRISPR, may further enhance the accuracy and biological reliability of candidate gene identification. Such approaches are expected to provide more precise targets for understanding flowering time regulation and advancing crop genetic improvement.

It should also be noted that SHAP values have inherent limitations. In high-dimensional omics data, strong correlations among features are common due to linkage disequilibrium or gene co-expression, which may lead to instability in SHAP-based feature attributions, with importance being redistributed among highly correlated predictors. Therefore, SHAP results should be interpreted with caution when identifying candidate genes and drawing biological conclusions. Nevertheless, several strategies can help alleviate this issue, including grouping highly correlated features, evaluating the robustness of SHAP attributions across different cross-validation folds or resampled datasets, and prioritizing candidate genes that consistently exhibit stable contributions across data partitions. Incorporating such robustness-oriented analyses in future studies may further enhance the reliability of SHAP-based interpretations.

4. Materials and Methods

4.1. Plant Materials

A total of 388 maize inbred lines were selected from a panel of 503 maize pan-genome lines, for which data on flowering time [26] genotype, and transcriptome [39] were available. Maize plants were cultivated under controlled greenhouse conditions with a 16 h light/8 h dark photoperiod and day/night temperatures of 27 °C/24 °C. Each pot (30 cm top diameter, 28 cm height; 14.5 L volume) contained six plants grown in Metro-Mix 300 substrate (Sun Gro Horticulture) without additional fertilization. Whole-seedling samples, including both shoot and root tissues, were collected at the V1 developmental stage for transcriptomic profiling. The data processing procedures followed the methodology described by Azodi et al. [40]. The final genomic dataset contained 332,177 high-quality SNP markers, after the removal of unmapped loci and markers with low minor allele frequency. SNP markers were encoded using a −1/0/1 scheme, corresponding to genotypes aa, Aa, and AA, respectively, where the more frequent (major) allele (AA) was coded as 1. The transcriptomic dataset consisted of 31,237 expressed genes, following filtering steps that excluded genes with missing coordinates or near-zero variance across lines. Expression values were provided as log-transformed TPM and used directly for downstream analyses.

4.2. Methods

An overview of the methods used in this study is shown in Figure 6.

4.3. Feature Selection Methods

A ten-fold cross-validation scheme was applied to the entire dataset to assess model performance. Specifically, the full dataset was randomly partitioned into ten approximately equal-sized folds. In each iteration, one fold was held out as the test set, while the remaining nine folds were used for feature selection and model training optimization. Model performance was evaluated on the held-out fold in each iteration, and the final prediction accuracy was reported as the average across all ten folds. In each iteration of model performance assessment (outer loop), feature selection was performed exclusively within the training set with a nine-fold cross-validation (inner loop) (Figure S1), thereby preventing information leakage and ensuring an unbiased estimation of model generalization performance. In tree-based models such as LightGBM and XGBoost, feature importance values are computed differently and often produce extremely small nonzero values (e.g., 1 × 10⁻¹⁷). To avoid retaining features that are essentially irrelevant due to numerical noise, we applied a threshold of 1 × 10⁻⁵ for these models. For other models, features with importance greater than zero were retained.

4.3.1. Feature Selection: Mutual Information

In probability theory and information theory, the mutual information between two random variables measures their degree of interdependence. Specifically, mutual information quantifies the reduction in “information” of one random variable when the other is known. Mutual information is closely related to entropy, which is a fundamental concept in information theory that quantifies the “information content” within a random variable. It is widely used in the field of computer science [41] and has also been applied in genomic prediction by previous studies [42]. In this study, the mutual_info_regression method was used to calculate the mutual information score between each feature and the target phenotype. We retained features with mutual information scores greater than zero, considering these features to have a positive contribution to the target phenotype and potentially contain key biological information. The loss function is defined as follows:

I (X; Y) = \iint p (x, y) \log (\frac{p (x, y)}{p (x) p (y)}) d x d y

where I(X; Y) denotes the mutual information between the features X and the target representation Y. p(x, y) denotes the joint distribution of X and Y, and p(x) and p(y) denote their respective marginal distributions.

4.3.2. Feature Selection: Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear modeling approach that integrates variable selection and regularization, making it particularly valuable in high-dimensional data analysis. By applying L1 regularization to the regression coefficients, Lasso effectively shrinks the coefficients of irrelevant features to zero, thereby highlighting the most significant variables. This characteristic not only enhances the interpretability of the model but also reduces the risk of overfitting. This method has been widely applied in fields such as economics and finance to improve predictive performance and uncover important variables that might be overlooked by traditional approaches [43,44]. In this study, feature selection was performed with the Lasso regression model implemented in the ‘sklearn.linear_model.Lasso’ class. To ensure model convergence, we trained for a maximum of 10,000 iterations. After model fitting, the absolute values of the regression coefficients were used as feature importance scores. Features with importance values greater than zero were retained because they were considered informative for prediction. Lasso’s loss function is defined as follows:

\min_{w} {| | y - X w | |_{2}^{2} + α | | w | |_{1}}

where y denotes the target representation vector, X denotes the genotype feature matrix, w denotes the regression coefficients, and α is a regularization parameter selected via cross-validation.

4.3.3. Feature Selection: ElasticNet

ElasticNet is a widely adopted regularization regression method that combines the strengths of Lasso (L1 regularization) and Ridge (L2 regularization), making it suitable for high-dimensional data and feature selection. It is particularly effective when there is multicollinearity between features or when the number of features far exceeds the number of samples. ElasticNet controls model complexity through a penalty term, preventing overfitting while retaining the important characteristics of variable selection and model interpretability. ElasticNet has been extensively used in cancer prognosis [45]. The absolute values of the regression coefficients obtained from the fitted model are used as feature importance scores. Features with importance values greater than zero were retained, as they were considered to contribute meaningfully to the model. To ensure numerical stability and convergence, the maximum number of iterations (max_iter) was set to 10,000, which helped the optimization process converge reliably and ensured reproducible feature selection results. The loss function of ElasticNet can be expressed as follows:

\min_{w} {| | y - X w {| |}_{2}^{2} + α ((1 - L 1_r a t i o) | | w | |_{2}^{2} + L 1_r a t i o | | w | |_{1})}

where y denotes the target representation vector, X denotes the genotype feature matrix, and w denotes the regression coefficients.

4.3.4. Feature Selection: RF

RF, also known as a Random Decision Forest, is an ensemble learning method widely used for classification, regression, and other tasks. It operates by constructing a large number of decision trees during training. One of the key advantages of RF is its ability to mitigate the tendency of individual decision trees to overfit the training data [46]. In recent years, RF has been increasingly applied to human genomic data analysis [47]. RF was used as a feature selection tool by leveraging the importance scores produced by the ensemble. A RandomForestRegressor with 100 trees was fit solely for the purpose of estimating feature importance, which was computed based on each feature’s contribution to impurity reduction across tree splits. Features with an importance value greater than zero were retained. This RF model was used exclusively for importance estimation rather than for phenotype prediction.

4.3.5. Feature Selection: LightGBM

LightGBM (Light Gradient Boosting Machine) is a highly efficient, distributed machine learning algorithm based on the gradient boosting framework. It has gained widespread application due to its fast computation speed, low memory consumption, and strong capability in handling large-scale data. Feature selection with LightGBM was conducted using the feature importance values generated from the trained gradient boosting framework. A LGBMRegressor with 100 estimators was trained on the dataset, and feature importance was quantified based on each feature’s contribution to the split gain in the boosting process. Features with importance scores exceeding 1 × 10⁻⁵ were retained. This LightGBM model was used solely for importance estimation rather than for downstream prediction.

4.3.6. Feature Selection: XGBoost

XGBoost has been proven to be an effective, feasible, and relatively mature ensemble gradient boosting machine learning algorithm. However, it may consume a large amount of memory and perform poorly on unstructured or sparse data [48]. Feature selection XGBoost was performed based on the feature importance scores derived from the trained gradient boosting model. An XGBRegressor with 100 estimators was fitted to the training data, and the importance of each feature was computed according to its contribution to the model’s split decisions. Features with importance scores greater than 1 × 10⁻⁵ were retained. This XGBoost model was used exclusively for estimating feature importance and not for phenotype prediction.

4.3.7. Feature Selection: Boruta

Boruta is a feature selection algorithm based on the Random Forest (RF) method. It evaluates feature importance by using random controls (adding random noise features as a comparison) to identify stable and important features, demonstrating high robustness. If the importance of a real feature is significantly higher than that of the shadow features, it is considered to make a stable contribution to the target variable [49]. Boruta eliminates unimportant features through an iterative process, with a maximum of 100 iterations (max_iter = 100), to determine the final set of retained features. The feature selection of Boruta was implemented by the library of boruta (version 0.4.3) in Python.

4.4. Machine Learning Algorithm Models and Traditional Linear Model

This study utilized five machine learning algorithms: RF, Support Vector Machine with Radial Basis Function kernel (SVMrbf) and Polynomial kernel (SVMpoly), and LightGBM and XGBoost. Hyperparameter tuning was conducted using a grid search strategy with carefully predefined parameter ranges, nested within the model performance assessment to avoid optimistic bias. The number of estimators (n_estimators) was fixed at 100 and was therefore excluded from the hyperparameter search. A fixed random seed (random_seed = 42) was used for the consistent partitioning of samples into training, validation, and test sets. An external 10-fold cross-validation was performed for model evaluation (outer loop) and a nine-fold cross-validation for feature selection (inner loop) within the training set (Figure S1). We repeated the 10-fold cross-validation 5 times, totaling 50 (10 × 5) iterations for each model performance assessment. Each iteration, the Pearson correlation coefficient (PCC) between the predicted and actual trait values in the test set was calculated to assess the final model’s performance. Model fitting for rrBLUP was carried out through the mixed.solve procedure available in the R (version 4.5.1) of the rrBLUP package (version 4.6.1). The model was fitted as:

y = Xβ + Zu + ε

where β is a vector of fixed effects and u is a vector of random marker effects, X the design matrix (n × p) for the fixed effects, in which a vector of ones was used to model the intercept. Z is the marker incidence matrix, and u represents the additive effects associated with individual markers. The residual term ε captures unexplained variation.

RF was implemented using a RandomForestRegressor with 100 trees, enabling nonlinear relationship modeling through ensemble averaging. Parameters tested included max tree depth (3, 5 and 10) and max_features (0.1, 0.25, 0.5, 0.75, “sqrt”, “log2”, “None”). Gradient boosting models were implemented using XGBRegressor for XGBoost and LGBMRegressor for LightGBM. For XGBoost, max_depth was tuned over values 3, 5, and 10, while for LightGBM, max_depth (3, 5, 10) and num_leaves (31, 63) were optimized. To complement these tree-based learners, two support vector regression models were included. For SVMrbf, hyperparameter tuning was performed over the regularization parameter (C = 0.001, 0.01, 0.1, 0.5, 1, 10, 50) and the kernel width parameter (gamma = 10⁻⁵ to 10¹, logarithmically spaced). For SVMpoly, hyperparameters included the regularization parameter (C = 0.001, 0.01, 0.1, 0.5, 1, 10, 50), the polynomial degree (degree = 2, 3, 4), and the kernel coefficient (gamma = 10⁻⁵ to 10¹, logarithmically spaced). These five models collectively provide a diverse set of learning strategies covering ensemble methods, gradient boosting, and kernel-based nonlinear regression. All analyses were conducted in Python 3.9.18, using the following key libraries: xgboost (version 2.1.4), lightgbm (version 4.6.0), scikit-learn (version 1.6.1), numpy (version 1.26.4), and pandas (version 2.3.0).

4.5. Interpretation of Candidate Genes via Explainable Artificial Intelligence (SHAP)

To elucidate the decision-making mechanism of the RF model and quantify the contribution of each feature, SHAP (SHapley Additive exPlanations) analysis was performed using TreeExplainer based on the trained models. SHAP values measure the influence of individual features on model output, capturing both the main effects of gene expression or SNPs and potential interactions among features. After model training, SHAP values were calculated for all input features and ranked by their mean absolute contributions. The top 20 features were then selected to identify the most influential molecular determinants associated with flowering time prediction.

All SHAP computations and visualizations were conducted in a Python 3.9.18 environment using the pandas, scikit-learn, shap (version 0.49.1), and matplotlib (version 3.9.4) libraries. As a well-established interpretability framework, SHAP has been widely validated in biological prediction models, with its effectiveness demonstrated in multiple studies [18,36,37].

5. Conclusions

This study systematically benchmarked the predictive performance of 42 FS–GP combinations, consisting of 7 feature selection methods and 6 genomic prediction models, for flowering time prediction. The results showed that linear methods such as Lasso and ElasticNet, as well as nonlinear methods like Boruta, are particularly effective in dimensionality reduction, significantly reducing the number of features while retaining biologically meaningful signals. In terms of prediction models, random forest performed most stably and accurately when integrating SNP and transcriptome data. The SHAP-based interpretation of the random forest model identified known flowering time regulatory genes, including ZmMADS69 and ZmRap2.7, and further revealed additional candidate genes potentially involved in the flowering regulatory network. Overall, these findings provide practical guidance for selecting effective analytical strategies and contribute to a deeper understanding of the genetic basis of flowering time. For practical applications in breeding programs, we suggest considering an ElasticNet/Lasso/Boruta → Random Forest → SHAP workflow, which aims to balance predictive accuracy with interpretability. A concise best-practice checklist can further facilitate implementation: (1) perform hyperparameter tuning on the training data to determine optimal model parameters; (2) conduct feature selection and model evaluation using 10-fold cross-validation; and (3) use SHAP values to interpret feature contributions and identify key candidate genes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms27041635/s1.

Author Contributions

Conceptualization, W.L. and X.Z.; data collection, Y.D. and N.J.; data analysis, Y.D. and N.J.; writing and editing, Y.D., N.J., Y.W., R.L., Y.L., T.W. and X.Z.; visualization, Y.D., N.J. and Y.W.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. and T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Innovation 2030—Major Project (2022ZD04020), the Science and Technology Innovation Team of Maize Modern Seed Industry in Hebei (21326319D), the National Key Research and Development Program of China (2022YFD1200802), and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—328017493/GRK 2366 (Sino-German International Research Training Group AMAIZE-P).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All relevant data are within the paper and its Supplementary Materials. The scripts are available in the release package on github (https://github.com/Jnhcau/FS-GP, accessed on 6 February 2026).

Acknowledgments

Many thanks for the reviewers’ comments and suggestions. We thank Yusheng Zhao from the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) for providing constructive suggestions that greatly improved the manuscript. We would like to thank the High-performance Computing Platform of China Agricultural University for computing support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Maldonado, C.; Mora-Poblete, F.; Contreras-Soto, R.I.; Ahmar, S.; Chen, J.-T.; do Amaral Júnior, A.T.; Scapim, C.A. Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species through Deep Learning and Bayesian Regularized Neural Network. Front. Plant Sci. 2020, 11, 593897. [Google Scholar] [CrossRef]
Grote, U.; Fasse, A.; Nguyen, T.T.; Erenstein, O. Food Security and the Dynamics of Wheat and Maize Value Chains in Africa and Asia. Front. Sustain. Food Syst. 2021, 4, 617009. [Google Scholar] [CrossRef]
Andorf, C.; Beavis, W.D.; Hufford, M.; Smith, S.; Suza, W.P.; Wang, K.; Woodhouse, M.; Yu, J.; Lübberstedt, T. Technological Advances in Maize Breeding: Past, Present and Future. Theor. Appl. Genet. 2019, 132, 817–849. [Google Scholar] [CrossRef]
Heino, M.; Kinnunen, P.; Anderson, W.; Ray, D.K.; Puma, M.J.; Varis, O.; Siebert, S.; Kummu, M. Increased Probability of Hot and Dry Weather Extremes during the Growing Season Threatens Global Crop Yields. Sci. Rep. 2023, 13, 3583. [Google Scholar] [CrossRef]
Abou-Elwafa, S.F.; Büttner, B.; Chia, T.; Schulze-Buxloh, G.; Hohmann, U.; Mutasa-Göttgens, E.; Jung, C.; Müller, A.E. Conservation and Divergence of Autonomous Pathway Genes in the Flowering Regulatory Network of Beta Vulgaris. J. Exp. Bot. 2011, 62, 3359–3374. [Google Scholar] [CrossRef] [PubMed]
Büttner, B.; Abou-Elwafa, S.F.; Zhang, W.; Jung, C.; Müller, A.E. A Survey of EMS-Induced Biennial Beta Vulgaris Mutants Reveals a Novel Bolting Locus Which Is Unlinked to the Bolting Gene B. Theor. Appl. Genet. 2010, 121, 1117–1131. [Google Scholar] [CrossRef] [PubMed]
Romero Navarro, J.A.; Willcox, M.; Burgueño, J.; Romay, C.; Swarts, K.; Trachsel, S.; Preciado, E.; Terron, A.; Delgado, H.V.; Vidal, V.; et al. A Study of Allelic Diversity Underlying Flowering-Time Adaptation in Maize Landraces. Nat. Genet. 2017, 49, 476–480, Erratum in Nat Genet. 2017, 49, 970. [Google Scholar] [CrossRef] [PubMed]
Danilevskaya, O.N.; Meng, X.; Selinger, D.A.; Deschamps, S.; Hermon, P.; Vansant, G.; Gupta, R.; Ananiev, E.V.; Muszynski, M.G. Involvement of the MADS-Box Gene ZMM4 in Floral Induction and Inflorescence Development in Maize. Plant Physiol. 2008, 147, 2054–2069. [Google Scholar] [CrossRef]
Li, Z.; Liu, X.; Xu, X.; Liu, J.; Sang, Z.; Yu, K.; Yang, Y.; Dai, W.; Jin, X.; Xu, Y. Favorable Haplotypes and Associated Genes for Flowering Time and Photoperiod Sensitivity Identified by Comparative Selective Signature Analysis and GWAS in Temperate and Tropical Maize. Crop J. 2020, 8, 227–242. [Google Scholar] [CrossRef]
Sun, H.; Wang, C.; Chen, X.; Liu, H.; Huang, Y.; Li, S.; Dong, Z.; Zhao, X.; Tian, F.; Jin, W. Dlf1 Promotes Floral Transition by Directly Activating ZmMADS4 and ZmMADS67 in the Maize Shoot Apex. New Phytol. 2020, 228, 1386–1400. [Google Scholar] [CrossRef]
Meuwissen, T.H.E.; Hayes, B.J.; Goddard, M.E. Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 2001, 157, 1819–1829. [Google Scholar] [CrossRef]
Erbe, M.; Hayes, B.J.; Matukumalli, L.K.; Goswami, S.; Bowman, P.J.; Reich, C.M.; Mason, B.A.; Goddard, M.E. Improving Accuracy of Genomic Predictions Within and Between Dairy Cattle Breeds with Imputed High-Density Single Nucleotide Polymorphism Panels. J. Dairy Sci. 2012, 95, 4114–4129, Erratum in J. Dairy Sci. 2014, 97, 6622. [Google Scholar] [CrossRef] [PubMed]
Technow, F.; Bürger, A.; Melchinger, A.E. Genomic Prediction of Northern Corn Leaf Blight Resistance in Maize with Combined or Separated Training Sets for Heterotic Groups. G3 Genes Genomes Genet. 2013, 3, 197–203. [Google Scholar] [CrossRef] [PubMed]
Marsh, J.I.; Hu, H.; Gill, M.; Batley, J.; Edwards, D. Crop Breeding for a Changing Climate: Integrating Phenomics and Genomics with Bioinformatics. Theor. Appl. Genet. 2021, 134, 1677–1690. [Google Scholar] [CrossRef] [PubMed]
Bermingham, M.L.; Pong-Wong, R.; Spiliopoulou, A.; Hayward, C.; Rudan, I.; Campbell, H.; Wright, A.F.; Wilson, J.F.; Agakov, F.; Navarro, P.; et al. Application of High-Dimensional Feature Selection: Evaluation for Genomic Prediction in Man. Sci. Rep. 2015, 5, 10312. [Google Scholar] [CrossRef]
Li, B.; Zhang, N.; Wang, Y.-G.; George, A.W.; Reverter, A.; Li, Y. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front. Genet. 2018, 9, 237. [Google Scholar] [CrossRef] [PubMed]
Al-Mamun, H.A.; Danilevicz, M.F.; Marsh, J.I.; Gondro, C.; Edwards, D. Exploring Genomic Feature Selection: A Comparative Analysis of GWAS and Machine Learning Algorithms in a Large-scale Soybean Dataset. Plant Genome 2025, 18, e20503. [Google Scholar] [CrossRef] [PubMed]
Ning, C.; Ouyang, H.; Xiao, J.; Wu, D.; Sun, Z.; Liu, B.; Shen, D.; Hong, X.; Lin, C.; Li, J.; et al. Development and Validation of an Explainable Machine Learning Model for Mortality Prediction among Patients with Infected Pancreatic Necrosis. Eclinicalmedicine 2025, 80, 103074. [Google Scholar] [CrossRef]
Madakkatel, I.; Hyppönen, E. LLpowershap: Logistic Loss-Based Automated Shapley Values Feature Selection Method. BMC Med. Res. Methodol. 2024, 24, 247. [Google Scholar] [CrossRef]
Wang, P.; Lehti-Shiu, M.D.; Lotreck, S.; Segura Abá, K.; Krysan, P.J.; Shiu, S.-H. Prediction of Plant Complex Traits via Integration of Multi-Omics Data. Nat. Commun. 2024, 15, 6856. [Google Scholar] [CrossRef]
Knoch, D.; Werner, C.R.; Meyer, R.C.; Riewe, D.; Abbadi, A.; Lücke, S.; Snowdon, R.J.; Altmann, T. Multi-Omics-Based Prediction of Hybrid Performance in Canola. Theor. Appl. Genet. 2021, 134, 1147–1165. [Google Scholar] [CrossRef]
Alemu, A.; Åstrand, J.; Montesinos-López, O.A.; Isidro YSánchez, J.; Fernández-Gónzalez, J.; Tadesse, W.; Vetukuri, R.R.; Carlsson, A.S.; Ceplitis, A.; Crossa, J.; et al. Genomic Selection in Plant Breeding: Key Factors Shaping Two Decades of Progress. Mol. Plant 2024, 17, 552–578. [Google Scholar] [CrossRef]
Yang, Z.; Song, M.; Huang, X.; Rao, Q.; Zhang, S.; Zhang, Z.; Wang, C.; Li, W.; Qin, R.; Zhao, C.; et al. AdaptiveGS: An Explainable Genomic Selection Framework Based on Adaptive Stacking Ensemble Machine Learning. Theor. Appl. Genet. 2025, 138, 204. [Google Scholar] [CrossRef]
Liang, Y.; Liu, Q.; Wang, X.; Huang, C.; Xu, G.; Hey, S.; Lin, H.; Li, C.; Xu, D.; Wu, L.; et al. Zm MADS 69 Functions as a Flowering Activator through the ZmRap2.7- ZCN 8 Regulatory Module and Contributes to Maize Flowering Time Adaptation. New Phytol. 2019, 221, 2335–2347. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, B.; Xie, Y.; Jia, H.; Li, Y.; Xu, M.; Wu, G.; Ma, X.; Li, Q.; Hou, M.; et al. The Evening Complex Promotes Maize Flowering and Adaptation to Temperate Regions. Plant Cell 2023, 35, 369–389. [Google Scholar] [CrossRef] [PubMed]
Hansey, C.N.; Johnson, J.M.; Sekhon, R.S.; Kaeppler, S.M.; De Leon, N. Genetic Diversity of a Maize Association Population with Restricted Phenology. Crop Sci. 2011, 51, 704–715. [Google Scholar] [CrossRef]
Kim, R.J.; Kim, H.J.; Shim, D.; Suh, M.C. Molecular and Biochemical Characterizations of the Monoacylglycerol Lipase Gene Family of Arabidopsis thaliana. Plant J. 2016, 85, 758–771. [Google Scholar] [CrossRef] [PubMed]
Kumar, V.; Mandlik, R.; Kumawat, S.; Mahakalkar, B.; Rana, N.; Sharma, Y.; Rajora, N.; Sudhakaran, S.; Vats, S.; Deshmukh, R.; et al. Deciphering the Role of Monoacylglycerol Lipases (MAGL) under Abiotic Stress and Lipid Metabolism in Soybean (Glycine max L.). Plant Biotechnol. J. 2025, 23, 4318–4335. [Google Scholar] [CrossRef]
Barnes, A.C.; Rodríguez-Zapata, F.; Juárez-Núñez, K.A.; Gates, D.J.; Janzen, G.M.; Kur, A.; Wang, L.; Jensen, S.E.; Estévez-Palmas, J.M.; Crow, T.M.; et al. An Adaptive Teosinte Mexicana Introgression Modulates Phosphatidylcholine Levels and Is Associated with Maize Flowering Time. Proc. Natl. Acad. Sci. USA 2022, 119, e2100036119. [Google Scholar] [CrossRef]
Feiz, L.; Shyu, C.; Wu, S.; Ahern, K.R.; Gull, I.; Rong, Y.; Artymowicz, C.J.; Piñeros, M.A.; Fei, Z.; Brutnell, T.P.; et al. COI1 F-Box Proteins Regulate DELLA Protein Levels, Growth, and Photosynthetic Efficiency in Maize. Plant Cell 2024, 36, 3237–3259. [Google Scholar] [CrossRef]
Alves, A.A.C.; Espigolan, R.; Bresolin, T.; Costa, R.M.; Fernandes Júnior, G.A.; Ventura, R.V.; Carvalheiro, R.; Albuquerque, L.G. Genome-Enabled Prediction of Reproductive Traits in Nellore Cattle Using Parametric Models and Machine Learning Methods. Anim. Genet. 2021, 52, 32–46. [Google Scholar] [CrossRef]
Alves, A.A.C.; da Costa, R.M.; Bresolin, T.; Fernandes Júnior, G.A.; Espigolan, R.; Ribeiro, A.M.F.; Carvalheiro, R.; de Albuquerque, L.G. Genome-Wide Prediction for Complex Traits Under the Presence of Dominance Effects in Simulated Populations Using GBLUP and Machine Learning Methods. J. Anim. Sci. 2020, 98, skaa179. [Google Scholar] [CrossRef]
He, D.; Rish, I.; Haws, D.; Parida, L. MINT: Mutual Information Based Transductive Feature Selection for Genetic Trait Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 13, 578–583. [Google Scholar] [CrossRef]
e Sousa, M.B.; Galli, G.; Lyra, D.H.; Granato, Í.S.C.; Matias, F.I.; Alves, F.C.; Fritsche-Neto, R. Increasing Accuracy and Reducing Costs of Genomic Prediction by Marker Selection. Euphytica 2019, 215, 18. [Google Scholar] [CrossRef]
Guo, Z.; Wang, H.; Tao, J.; Ren, Y.; Xu, C.; Wu, K.; Zou, C.; Zhang, J.; Xu, Y. Development of Multiple SNP Marker Panels Affordable to Breeders through Genotyping by Target Sequencing (GBTS) in Maize. Mol. Breed. 2019, 39, 37. [Google Scholar] [CrossRef]
Wang, H.; Yan, S.; Wang, W.; Cheng, Y.; Hong, J.; He, Q.; Diao, X.; Lin, Y.; Chen, Y.; Cao, Y.; et al. Cropformer: An Interpretable Deep Learning Framework for Crop Genome Prediction. Plant Commun. 2024, 6, 101223. [Google Scholar] [CrossRef]
Zhang, Y.; Ikram, M.; Khan, N.; Zhao, N.; Derakhshani, B.; Usman, B.; Wang, H.-F. Machine Learning and Functional Validation Identify OsRAV11/12 as Negative Regulators of Drought Tolerance and Early Flowering in Rice. Int. J. Biol. Macromol. 2025, 329, 147709. [Google Scholar] [CrossRef]
Deng, P.; Liu, K.; Zhou, M.; Li, M.; Yang, R.; Cao, C.; Wang, M.; Zhang, Z. DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops. arXiv 2025, arXiv:2510.08662. [Google Scholar] [CrossRef]
Hirsch, C.N.; Foerster, J.M.; Johnson, J.M.; Sekhon, R.S.; Muttoni, G.; Vaillancourt, B.; Peñagaricano, F.; Lindquist, E.; Pedraza, M.A.; Barry, K.; et al. Insights into the Maize Pan-Genome and Pan-Transcriptome. Plant Cell 2014, 26, 121–135. [Google Scholar] [CrossRef] [PubMed]
Azodi, C.B.; Pardo, J.; VanBuren, R.; De Los Campos, G.; Shiu, S.-H. Transcriptome-Based Prediction of Complex Traits in Maize. Plant Cell 2020, 32, 139–151. [Google Scholar] [CrossRef] [PubMed]
Haghighat, M.B.A.; Aghagolzadeh, A.; Seyedarabi, H. Multi-Focus Image Fusion for Visual Sensor Networks in DCT Domain. Comput. Electr. Eng. 2011, 37, 789–797. [Google Scholar] [CrossRef]
Jubair, S.; Tucker, J.R.; Henderson, N.; Hiebert, C.W.; Badea, A.; Domaratzki, M.; Fernando, W.G.D. GPTransformer: A Transformer-Based Deep Learning Method for Predicting Fusarium Related Traits in Barley. Front. Plant Sci. 2021, 12, 761402. [Google Scholar] [CrossRef]
Tian, S.; Yu, Y.; Guo, H. Variable Selection and Corporate Bankruptcy Forecasts. J. Bank. Financ. 2015, 52, 89–100. [Google Scholar] [CrossRef]
Coad, A.; Srhoj, S. Catching Gazelles with a Lasso: Big Data Techniques for the Prediction of High-Growth Firms. Small Bus. Econ. 2020, 55, 541–565. [Google Scholar] [CrossRef]
Milanez-Almeida, P.; Martins, A.J.; Germain, R.N.; Tsang, J.S. Cancer Prognosis with Shallow Tumor RNA Sequencing. Nat. Med. 2020, 26, 188–192. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
Ghosh, D.; Cabrera, J. Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 19, 2817–2828. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD Intetiona Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery (ACM): New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta—A System for Feature Selection. Fundam. Inform. 2010, 101, 271–285. [Google Scholar] [CrossRef]

Figure 1. Schematic presentation of the framework for complex trait prediction. (A) Workflow of the prediction pipeline. (B) Schematic diagram of the ten-fold cross-validation principle for genomic prediction (GP) model performance. Blue represents the training set, and orange represents the test set.

Figure 2. Comparison of rrBLUP across different feature selection methods. (A) The average number of features retained by each feature selection method. (B) The box plot shows the distribution of Pearson correlation coefficient (PCC) across different feature selection methods. (C) Scatter plots comparing predicted versus true trait values for selected models. Each subplot includes a regression line (red dashed) and a diagonal reference line, illustrating the consistency between prediction and observation. RF, random forest.

Figure 3. Assessment of model performance in predicting flowering time from transcriptome-based features. Comparison of prediction performance by Pearson correlation coefficient (PCC) for combinations of seven feature selection methods and five machine learning models.

Figure 4. Performance evaluation of different feature selection methods and machine learning models across three data types (SNP, Transcriptome, and the combined SNP + Transcriptome). The heatmap on the left represents the prediction accuracy (Pearson correlation coefficient, PCC values) of different feature selection and prediction model combinations, in which, the white number means the top prediction performance among the whole FS-GP combinations in each data type. The violin plots on the right depict the PCC value distribution for each model, with the red dashed line indicating the PCC = 0.62. The violin plot at the bottom compares the PCC distributions across all model feature selection combinations. T represents transcriptome.

Figure 5. Feature importance and contribution to model output based on SHAP evaluation. (A) The top 20 features selected by the RF model, ranked by their SHAP importance scores. The x-axis represents the mean absolute SHAP value. (B) SHAP beeswarm plot illustrating how feature values influence the RF predictions. Red points indicate higher feature values/major allele (1), while blue points indicate lower feature values/minor allele (−1); positive and negative SHAP values reflect the direction and magnitude of each feature’s contribution to the model output. (C–H) Main effect SHAP dependence plots of the TOP 6 features. This chart characterizes the impact of different feature on maize flowering time under different genotypes and expression levels. Aggregation above the baseline at zero indicates that this genotype delays flowering time, while aggregation below baseline indicates an accelerated flowering time.

Figure 6. An overview of the methods used in this study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Y.; Jia, N.; Wang, Y.; Li, R.; Lu, Y.; Würschum, T.; Zhu, X.; Liu, W. Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize. Int. J. Mol. Sci. 2026, 27, 1635. https://doi.org/10.3390/ijms27041635

AMA Style

Du Y, Jia N, Wang Y, Li R, Lu Y, Würschum T, Zhu X, Liu W. Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize. International Journal of Molecular Sciences. 2026; 27(4):1635. https://doi.org/10.3390/ijms27041635

Chicago/Turabian Style

Du, Yan, Nianhua Jia, Yueli Wang, Ronglan Li, Ying Lu, Tobias Würschum, Xintian Zhu, and Wenxin Liu. 2026. "Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize" International Journal of Molecular Sciences 27, no. 4: 1635. https://doi.org/10.3390/ijms27041635

APA Style

Du, Y., Jia, N., Wang, Y., Li, R., Lu, Y., Würschum, T., Zhu, X., & Liu, W. (2026). Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize. International Journal of Molecular Sciences, 27(4), 1635. https://doi.org/10.3390/ijms27041635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking Feature Selection Methods and Prediction Models for Flowering Time Prediction in Maize

Abstract

1. Introduction

2. Results

2.1. Pipeline of Feature Selection Methods and Prediction Models for Complex Trait Prediction

2.2. The Performance of Linear and Nonlinear Feature Selection on Flowering Time Prediction

2.3. Machine Learning Models Achieve Higher Accuracy for Flowering Time Prediction

2.4. Performance Evaluation of Omics Data Integration for Flowering Time Prediction

2.5. Potential to Predict Flowering Time Genes Through SHAP Values

3. Discussion

3.1. Molecular Basis and Key Regulators of Flowering Time in Maize

3.2. The Potential and Challenges of Machine Learning in Plant Genomic Prediction

3.3. The Critical Role of Feature Selection in High Dimensional Omics Prediction

3.4. Limited Predictive Performance of Transcriptome-Based Models

3.5. Model Interpretability and Gene Mining Enhanced Through SHAP

4. Materials and Methods

4.1. Plant Materials

4.2. Methods

4.3. Feature Selection Methods

4.3.1. Feature Selection: Mutual Information

4.3.2. Feature Selection: Lasso

4.3.3. Feature Selection: ElasticNet

4.3.4. Feature Selection: RF

4.3.5. Feature Selection: LightGBM

4.3.6. Feature Selection: XGBoost

4.3.7. Feature Selection: Boruta

4.4. Machine Learning Algorithm Models and Traditional Linear Model

4.5. Interpretation of Candidate Genes via Explainable Artificial Intelligence (SHAP)

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI