A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis

: This study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and ﬂag-ellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,380 genes was analyzed using several machine learning tools including the synthetic minority oversampling technique (SMOTE), information gain (IG), ReliefF, and least absolute shrinkage and selection operator (LASSO), along with various classiﬁers (BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest). We identiﬁed 439 differentially expressed genes (DEGs), of which only three were down-regulated (AT3G20810, AT1G31680, and AT1G30250). The performance of the top 20 genes selected by IG and ReliefF was evaluated using the classiﬁers mentioned above to classify stressed versus non-stressed samples. The random forest algorithm outperformed other algorithms with an accuracy of 97.91% and 98.51% for IG and ReliefF, respectively. Additionally, 42 genes were identiﬁed from all 30,380 genes using LASSO regression. The top 20 genes for each feature selection were analyzed to determine three common genes (AT5G44050, AT2G47180, and AT1G70700), which formed a three-gene signature. The efﬁciency of these three genes was evaluated using random forest and XGBoost algorithms. Further validation was performed using an independent RNA_seq dataset and random forest. These gene signatures can be exploited in plant breeding to improve stress tolerance in a variety of crops.


Introduction
The yield and nutritional quality of a crop are impacted by the different stresses experienced by plants during growth.Such stresses can be broadly classified into two groups, biotic and environmental (abiotic) [1].Plants usually respond to stress through complicated molecular mechanisms, such as changes in the transcriptome and regulatory networks [2].In severe cases, irreversible damage and plant death could be observed if the stress exceeds the plant's tolerance threshold [2].These threshold limits are encoded in and determined by the plant's genetic makeup.
Advances in high-throughput gene expression technologies, such as microarray platforms, have offered a new pathway for the identification of key genes involved in plant responses to specific stress conditions [3].Considering plants are capable of activating stress-specific and general stress response networks to adapt to various stressors [2], identifying genes that play a general role in stress response can facilitate the development of stress-tolerant cultivars through genetic engineering [4].breeding programs, as well as limitations and future works.Finally, Section 4 summarizes the major findings of the research.

Materials and Methods
A flowchart providing an overview of the data analysis process used in this study is presented in Figure 1, which we describe in detail in this section.
Algorithms 2023, 16, x FOR PEER REVIEW 3 of 14 these findings, the synergy between feature selection methods, and the potential applications in breeding programs, as well as limitations and future works.Finally, Section 4 summarizes the major findings of the research.

Materials and Methods
A flowchart providing an overview of the data analysis process used in this study is presented in Figure 1, which we describe in detail in this section.

Microarray Data
The microarray data were accessed on 25 June 2022 from Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) with the record number GSE41935.The experiment included 207 samples (arrays) and 59 unique experiments (treatments), including 10 genotypes of A. thaliana subjected to salt, temperature, HL, FLG, and their combinations [7].We extracted the expression set from the GEOquery R package (Version 2.62.2).The sample information (phenotype data) was obtained from the expression set of the series matrix file.To identify the differentially expressed genes (DEGs) out of 30,380 microarray genes (probes), we used the limma R package (Version 4.1.2)together with the false discovery rate (FDR) method (FDR was set to 0.01).Gene expression groups showing empirical Bayes moderated p-values < 0.01 were considered differentially expressed.The identified list comprised 439 DEGs used for further analysis.

Microarray Data
The microarray data were accessed on 25 June 2022 from Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) with the record number GSE41935.The experiment included 207 samples (arrays) and 59 unique experiments (treatments), including 10 genotypes of A. thaliana subjected to salt, temperature, HL, FLG, and their combinations [7].We extracted the expression set from the GEOquery R package (Version 2.62.2).The sample information (phenotype data) was obtained from the expression set of the series matrix file.To identify the differentially expressed genes (DEGs) out of 30,380 microarray genes (probes), we used the limma R package (Version 4.1.2)together with the false discovery rate (FDR) method (FDR was set to 0.01).Gene expression groups showing empirical Bayes moderated p-values < 0.01 were considered differentially expressed.The identified list comprised 439 DEGs used for further analysis.
Differential expression (DE) analysis has been employed for gene expression profiles and uncovers the underlying mechanisms that govern tolerance to stress in Arabidopsis [17,18].Given that gene expression profiles are often high-dimension matrices encompassing thousands of strongly correlated genes (including hundreds of highly correlated DEGs), it is impractical to utilize all DEGs to pinpoint genes that transcriptionally respond to stress in plants.Consequently, data-driven approaches have been extensively employed to discern the gene signature using gene expression data in plants [19].

Class Imbalance
This study's samples were binary and categorized as either control or stress.Among the 207 samples introduced in Section 2.1, 32 were control and 175 were stress, indicating class-imbalanced data.In such cases, using standard classification methods could lead to bias toward the majority class, potentially increasing the misclassification rate in the minority class.
To address this issue, an ML-based algorithm known as 'oversampling' was employed.The oversampling technique generates synthetic samples in the minority class to balance the dataset.The details on the principle of oversampling can be found elsewhere [20].In the present work, the synthetic minority oversampling technique (SMOTE) [20] was applied in WEKA software (Machine Learning Group, University of Waikato, New Zealand) [21] to oversample the minority class.Nearest neighbors (K), specifying the number of nearest neighbors, and random seed value were kept as defaults (i.e., 5 and 1, respectively).The percentage parameter, which determines the amount of oversampling, was set to 400.

Feature Selection Methods
Feature selection is an essential step in the analysis of large datasets that allows for reducing the dimensionality of data by removing redundant features and selecting the most important ones [15].This study utilized three common feature selection methods for gene expression analysis, LASSO, IG, and ReliefF, which are briefly described below.

Least Absolute Shrinkage and Selection Operator
LASSO is a regression-based feature selection method [22] commonly used in gene expression analysis.In this algorithm, a set of informative genes can be selected by shrinking the regression coefficient to zero in the linear regression model [23].LASSO can be defined as follows: where l(y i ,η i ) is the negative log-likelihood contribution for observation i. ω i represents the weight for observation i and y i is the observed response for observation i, while the predicted response is given by β 0 + β T x i .The elastic net penalty in LASSO regression is controlled by α = 1 (the default).The parameter λ is the tuning parameter that controls the overall penalty strength.It is known that LASSO tends to pick one of the coefficients of correlated predictors and discard the others.Further details on the principle and operation of LASSO analysis method can be found elsewhere [22].In the present study, a 10-fold cross-validation was performed on the gene expression profile using the glmnet package in R (version 4.1.2),with alpha set to 1 and λ adjusted to select the optimal number of genes.

Information Gain
IG is an entropy-based measure applied in gene selection to rank genes based on IG value.A higher IG value indicates that the gene contributes more relevant information to the dataset [24].
IG of gene Y can be calculated as follows: where Let N represent instances assigned to k classes, P (C i , N) is the proportion of C i to N, where C i are instance sets belonging to the ith class, i = {1, 2, . . ., k}.Assuming gene Y has V = {v 1 , v 2 , . . ., v m } distinct value, and N j ∈ N|Y = v j , entropy Y (N) can then be calculated [24] using Equation (4).Further details on the principle and operation of the IG analysis method can be found elsewhere [24].

ReliefF
ReliefF is another feature selection tool with high discriminatory power among different classes in the microarray gene expression data [25].In this approach, each gene is assigned a feature weight, ranging from −1 as the worst to +1 as the best, based on its feature statistics [16].
ReliefF algorithm identifies the K nearest neighbors from the same class (nearest hits, H) and the K nearest neighbors from each of the different classes (nearest misses, M) for each instance.It then updates the quality estimation W i for each gene i based on the difference in values of the gene in the instance and its nearest hits and misses.W i can be estimated using the following equation: where n is the total number of instances in the dataset; k is the number of nearest neighbors considered; D H (or D M C ) represents the sum of the distances between the selected instances and the nearest hits H (or misses M C) for feature i; P C is the probability of class C; and W i is the weight of feature i, which represents the importance of that feature in distinguishing between different classes.Detailed discussions on the ReliefF algorithm can be found elsewhere [26].

Identification and Validation of Gene Signature
IG and ReliefF were performed on the DEGs' expression profile in WEKA [21], with the threshold set to −1.7976, which is the default value used to identify significant features by filtering out those with weights below this value.In each case, 20 top genes were selected for further discriminant analysis (Section 2.4).BioVenn (https://www.biovenn.nl/(accessed on 25 June 2022)), a web application for comparing and visualizing biological lists, was used to identify and visualize the distribution and intersected genes among feature selection methods.
Common genes among the three methods were selected as potential biomarkers to demonstrate the efficacy of feature selection methods in identifying important genes involved in biotic and abiotic stresses in Arabidopsis.A 10-fold cross-validation was performed using RandomForest and XGBoost packages in R. The ROCR package was employed to determine accuracy and receiver operating characteristics (ROC).Ultimately, the values of the area under the ROC curve (AUC) were considered to assess the efficiency of the selected key predictive genes [27].Out-of-bag error, also called OOB estimate, was reported for measuring the prediction error of random forests using bootstrap aggregating (bagging).Bagging creates training samples by subsampling with replacement, allowing the model to learn from different data combinations.OOB error is the mean prediction error on each training sample x i , using only the trees that did not have x i in their bootstrap sample [28].Mean decrease accuracy, mean decrease gini, and the Boruta algorithm by Boruta package [29] were also adopted to rank feature importance.In the Boruta analysis, each feature is labeled as either confirmed, tentative, or rejected.These labels indicate whether a gene was considered important, unclear, or not essential, respectively, for the classification.

Validation of Gene Signature Using External Dataset
To validate the three-gene signature, an RNA-Seq experiment, GSE158444, was selected [30].This dataset comprises transcriptome response to heat stress in 148 samples of the Arabidopsis.Counts were downloaded from GEO and 10-fold cross-validation random forest was performed using randomForest in R, as explained in Section 2.3.4.

Discrimination Analysis
Discrimination analysis was performed using the WEKA software to assess the ability of the selected genes to discriminate between samples under stress and control class [21].The gene expression matrix associated with the 20 selected genes was used to construct classification models, with each row representing 1 of the 207 samples.
A preliminary screening was conducted using various methods to identify the most effective classifiers for discriminating between the control and stress classes.Ultimately, the BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest classifiers were selected for the discrimination analysis.A detailed discussion on the principle and operation of these classifiers can be found elsewhere [21].The parameters of different classifiers are provided in Table 1.The discrimination analysis was conducted using a 10-fold cross-validation approach.The performance of the classifiers was evaluated using various metrics, including confusion matrices, TP (true positive) and FP (false positive) rate values, precision, F-measure, ROC area, and precision-recall (PRC) Area, and Matthews correlation coefficient (MCC).Equations ( 6)-( 13) indicate the equations of the above-mentioned evaluation metrics.
where TN represents true negative and FN indicates false negative.

Results and Discussion
Stress tolerance is crucial for crop plants to survive under adverse environmental and pathological conditions.Despite numerous studies to discover the molecular mechanisms behind stress tolerance in the Arabidopsis crop model, distinct biomarkers for germplasm screening and breeding of tolerance to stress have remained limited.In this study, we aimed to identify key genes involved in the stress response in Arabidopsis and evaluate the efficiency of different gene selection methods and classifiers.

Identification of Differentially Expressed Genes
Based on empirical Bayes statistics for differential expression and adjusted p-value ≤ 0.01, 439 genes were found to be DEGs between control and stress conditions.The ranking of the top 20 DEGs using the absolute value of log fold change is presented in Supplementary Table S1.Only three genes, including AT3G20810, AT1G31680, and AT1G30250, were down-regulated; the remaining genes were up-regulated.The top three up-regulated genes are AT4G27310, AT4G36010, and AT4G25480.

SMOTE Balancing and Feature Selection
SMOTE was applied to generate synthetic data in the control group, resulting in 160 control samples compared with the original 32 control samples.IG and ReliefF were then used to select important genes in a sample space of 439 DEGs.
Gene expression profiles of all probes were mined through LASSO regression analysis to identify the key genes involved in various stresses in Arabidopsis.The LASSO model fitted to the gene expression data is given in Figure 2a.Each curve corresponds to a variable showing the path of its coefficient in different λ against the L1 norm of the whole coefficient.The 42 genes with non-zero coefficients (Supplementary Table S2) were obtained by 10-fold cross-validation of LASSO presented in Figure 2b.The top 20 genes for each feature selection method were selected for further analysis.
To narrow down the list of important genes, we performed feature selection using three different methods: ReliefF, IG, and LASSO.ReliefF and IG are filter-based methods that rank genes based on their relevance to the classification task, while LASSO is a wrapper-based method that selects a subset of genes by optimizing the performance of a specific classifier.Comparisons between ReliefF and IG in terms of accuracy and effectiveness for all classifiers tested is presented in the subsequent Section 3.3 to further elucidate the role and capability of these methods in identifying genes involved in stress response in Arabidopsis.To narrow down the list of important genes, we performed feature selection using three different methods: ReliefF, IG, and LASSO.ReliefF and IG are filter-based methods that rank genes based on their relevance to the classification task, while LASSO is a wrapper-based method that selects a subset of genes by optimizing the performance of a specific classifier.Comparisons between ReliefF and IG in terms of accuracy and effectiveness for all classifiers tested is presented in the subsequent Section 3.3 to further elucidate the role and capability of these methods in identifying genes involved in stress response in Arabidopsis.

Machine Learning Classification
Table 2 shows the confusion matrix and classification performance of the top 20 genes selected based on IG.Accuracy as well as eight further measurements are presented in Table 2. Different classifiers were used to evaluate the performance of the 20 selected genes based on the IG algorithm, resulting in relatively high accuracy.The accuracy ranged from 95.22% related to logistic and SMO classifiers to 97.91% for random forest.The average accuracy considering all five classifiers stands to be 96.24%.Moreover, the relative efficiency of random forest over other classifiers could be reflected by considering all performance parameters (Table 2).Despite similar results for logistic and SMO for TP rate, FP rate, precision, recall, F-measure, and MCC parameters, the logistic algorithm provided a better ROC and PRC area, demonstrating better performance than SMO (Table 2).

Machine Learning Classification
Table 2 shows the confusion matrix and classification performance of the top 20 genes selected based on IG.Accuracy as well as eight further measurements are presented in Table 2. Different classifiers were used to evaluate the performance of the 20 selected genes based on the IG algorithm, resulting in relatively high accuracy.The accuracy ranged from 95.22% related to logistic and SMO classifiers to 97.91% for random forest.The average accuracy considering all five classifiers stands to be 96.24%.Moreover, the relative efficiency of random forest over other classifiers could be reflected by considering all performance parameters (Table 2).Despite similar results for logistic and SMO for TP rate, FP rate, precision, recall, F-measure, and MCC parameters, the logistic algorithm provided a better ROC and PRC area, demonstrating better performance than SMO (Table 2).In the ReliefF selection method, the overall accuracy was 97.016% (Table 3), which was higher than the accuracy obtained by the IG algorithm.Random forest and multilayer perceptron had the highest and similar accuracy ratings of 98.51%, while the minimum accuracy was obtained by BayesNet (94.93%).Overall, random forest performed slightly better considering all parameters together (Table 3).Random forest was the best-performing classifier among all of the classifiers tested, providing the highest accuracy and other tested evaluation metrics for both ReliefF and IG (Tables 2 and 3).The effectiveness of random forest may be due to its ability to handle large datasets with many variables and automatically balance datasets, making it suitable for complex tasks [31], as previously demonstrated in other studies [32].

Selection and Validation of Key Genes
We selected the 20 top-ranked genes by LASSO, IG, and ReliefF to find the key common genes identified by the three methods (Figure 3).The intersection of the top 20 genes from each of the three feature selection methods (ReliefF, IG, and LASSO) led to the identification of three common genes, AT5G44050, AT2G47180, and AT1G70700 (Figure 3).A random forest algorithm was implemented to identify the performance of this three-gene signature.OOB error was estimated to be 8.7% (Figure 4a).The ROC was plotted by the true positive rate against the false positive rate.Therefore, the primary focus was on AUC to measure classification performance.The AUC of the three-gene set was equal to 0.921875, indicating that the gene set is excellent in discriminating control samples from those subjected to various types of stresses (Figure 4b) and can be introduced as potential biomarkers for stress tolerance in Arabidopsis owing to their efficient discrimination between control and stress conditions (Figure 4).Further, the mean decrease accuracy and mean decrease gini of each gene in the random forest algorithm were A random forest algorithm was implemented to identify the performance of this threegene signature.OOB error was estimated to be 8.7% (Figure 4a).The ROC was plotted by the true positive rate against the false positive rate.Therefore, the primary focus was on AUC to measure classification performance.The AUC of the three-gene set was equal to 0.921875, indicating that the gene set is excellent in discriminating control samples from those subjected to various types of stresses (Figure 4b) and can be introduced as potential biomarkers for stress tolerance in Arabidopsis owing to their efficient discrimination between control and stress conditions (Figure 4).Further, the mean decrease accuracy and mean decrease gini of each gene in the random forest algorithm were measured (Figure 4c).AT2G47180 seems to have the biggest contribution to the model, followed by AT5G44050 and AT1G70700.The same rank was obtained by the Boruta algorithm, and the contribution of the three-gene signature was confirmed.In comparison, the XGBoost classification model exhibited superior performance with an accuracy of 0.991%, a sensitivity of 0.9876%, and a specificity of 0.9943%.The XGBoost has proven to perform better in terms of efficiency and performance relative to other classifiers [33].A random forest algorithm was implemented to identify the performance of this three-gene signature.OOB error was estimated to be 8.7% (Figure 4a).The ROC was plotted by the true positive rate against the false positive rate.Therefore, the primary focus was on AUC to measure classification performance.The AUC of the three-gene set was equal to 0.921875, indicating that the gene set is excellent in discriminating control samples from those subjected to various types of stresses (Figure 4b) and can be introduced as potential biomarkers for stress tolerance in Arabidopsis owing to their efficient discrimination between control and stress conditions (Figure 4).Further, the mean decrease accuracy and mean decrease gini of each gene in the random forest algorithm were measured (Figure 4c).AT2G47180 seems to have the biggest contribution to the model, followed by AT5G44050 and AT1G70700.The same rank was obtained by the Boruta algorithm, and the contribution of the three-gene signature was confirmed.In comparison, the XGBoost classification model exhibited superior performance with an accuracy of 0.991%, a sensitivity of 0.9876%, and a specificity of 0.9943%.The XGBoost has proven to perform better in terms of efficiency and performance relative to other classifiers [33].

External Validation of Three-Gene Signature
As depicted in Figure 5, the OOB error rate diminishes as the number of trees increases, ultimately settling at 6.02% with 500 trees (Figure 5a).The achieved AUC was 0.9898, underscoring the potent predictive capability of the three-gene signature for heat stress in Arabidopsis.
AT5G44050, located on chromosome 5, encodes the MATE efflux family protein, also known as MRH10.16.Scholars have previously reported on the active function of the MATE gene in other crops to enhance general stress tolerance, including OsMATE1 and OsMATE2 in rice [34] and DTX/MATE in cotton [35].Ref. [36] reported 174 A MATE families in four Cucurbitaceae species coping with severe salt stress.
AT2G47180 encodes a galactinol synthase 1 (GolS1) that has been reported to be induced by drought and high-salinity stresses in Arabidopsis [37].In a study, [38] reported the GolS1 promotor as a potential biosensor for heat stress and fungal infection in Arabidopsis.Another study has revealed that GolS1 expression is regulated by other stressors, including ionic, osmotic, and heat stresses [39].

External Validation of Three-Gene Signature
As depicted in Figure 5, the OOB error rate diminishes as the number of trees increases, ultimately settling at 6.02% with 500 trees (Figure 5a).The achieved AUC was 0.9898, underscoring the potent predictive capability of the three-gene signature for heat stress in Arabidopsis.AT5G44050, located on chromosome 5, encodes the MATE efflux family protein, also known as MRH10.16.Scholars have previously reported on the active function of the MATE gene in other crops to enhance general stress tolerance, including OsMATE1 and OsMATE2 in rice [34] and DTX/MATE in cotton [35].[36] reported 174 A MATE families in four Cucurbitaceae species coping with severe salt stress.
AT2G47180 encodes a galactinol synthase 1 (GolS1) that has been reported to be induced by drought and high-salinity stresses in Arabidopsis [37].In a study, [38] reported the GolS1 promotor as a potential biosensor for heat stress and fungal infection in Arabidopsis.Another study has revealed that GolS1 expression is regulated by other stressors, including ionic, osmotic, and heat stresses [39].
AT1G70700 encodes a TIFY domain/divergent CCT motif family protein (TIFY7).TIFY, also known as JAZ9 (Jasmonate-Zim-Domain Protein 9), which plays an important role when plants are subjected to various stresses.The role of the TIFY family in response to various stress has been reported in different species, e.g., in tomato (Solanum lycopersicum) [40], wheat (Triticum aestivum) [41], and rice (Oryza sativa) [42].

Limitation and Future Works
This study has made remarkable progress in identifying crucial genes related to stress response in Arabidopsis and provides valuable insights for the future breeding of stress-tolerant crops.However, some limitations and directions for future work must be acknowledged.
The study was conducted with the available datasets, which might not include all types of environmental stresses or Arabidopsis cultivars.Expanding the analysis to more diverse conditions and genotypes could provide a more comprehensive understanding.Moreover, the use of SMOTE to oversample the control samples may have potential im- AT1G70700 encodes a TIFY domain/divergent CCT motif family protein (TIFY7).TIFY, also known as JAZ9 (Jasmonate-Zim-Domain Protein 9), which plays an important role when plants are subjected to various stresses.The role of the TIFY family in response to various stress has been reported in different species, e.g., in tomato (Solanum lycopersicum) [40], wheat (Triticum aestivum) [41], and rice (Oryza sativa) [42].

Limitation and Future Works
This study has made remarkable progress in identifying crucial genes related to stress response in Arabidopsis and provides valuable insights for the future breeding of stress-tolerant crops.However, some limitations and directions for future work must be acknowledged.
The study was conducted with the available datasets, which might not include all types of environmental stresses or Arabidopsis cultivars.Expanding the analysis to more diverse conditions and genotypes could provide a more comprehensive understanding.Moreover, the use of SMOTE to oversample the control samples may have potential implications on the analytical process.While in our analysis, the DEGs were identified before the application of SMOTE, thus maintaining the original integrity of the data, it is worth acknowledging that the oversampling process may introduce specific biases or effects.This complexity underscores the need for caution and may serve as an engaging avenue for future investigations, possibly leading to refinements in the methodology.
Machine-learning-based models may have some limitations, including the reliability of data resources, different protocols for data collection and gene expression experiments, and heterogeneity of the phenotypes.These factors might negatively affect the accuracy and predictability of the identified biomarkers through machine learning.Additionally, the choice of feature selection methods and classifiers can have a significant impact on the final result.Exploring additional machine learning algorithms and techniques, including the exploration of CatBoost, a gradient boosting framework, could offer further insights into the selected features and enhance the predictive performance of the models.
While computational methods offer valuable insights, experimental validation of identified genes in real plant systems is paramount.Field, greenhouse, and laboratory experiments are pivotal for the validation and verification of these biomarkers, promising tangible results for breeding programs.The proposed biomarkers can be authenticated using real-time qPCR.Moreover, the integration of advanced imaging and spectroscopy technologies offers a nuanced perspective on stress responses, especially in the context of agri-food quality monitoring [43][44][45].As we move towards practical applications in agriculture, considerations such as cost, scalability, and the ethical implications of genetic modifications become indispensable.Comprehensive evaluations encompassing these factors are essential for translating research into real-world applications.
Arabidopsis serves as a model plant, but the findings should be extended to other economically important crops.Future research should also focus on how these findings can be translated to breeding programs for enhancing stress tolerance in crops of agricultural importance.Moreover, plants' stress response is complex and may involve interactions with various environmental factors.Understanding the intricate relationship between genes and environmental conditions, including soil properties, humidity, and temperature, will be essential for a more holistic approach to improving stress tolerance.

Conclusions
In conclusion, this study utilized a hybrid gene selection approach to identify predictive genes involved in stress tolerance in Arabidopsis, which could potentially be used to improve crop production systems and address food security challenges.Through the use of various feature selection tools and machine learning algorithms, the study identified three common genes (AT5G44050, AT2G47180, and AT1G70700) that could serve as biomarkers for tolerant crops.The XGBoost and random forest algorithms demonstrated superior performance in classifying stress and control conditions, indicating their potential utility in crop breeding programs.However, further experimental research is needed to validate the identified genes and explore their potential for developing stress-tolerant crop varieties.Overall, this study provides valuable insights into the mechanisms underlying stress responses in plants and highlights the potential of gene selection and machine learning approaches for improving crop resilience.

Figure 1 .
Figure 1.Overview of the modeling process implemented to classify and interrogate gene expression relationships between control and stress conditions in Arabidopsis.

-Figure 1 .
Figure 1.Overview of the modeling process implemented to classify and interrogate gene expression relationships between control and stress conditions in Arabidopsis.

Figure 2 .
Figure 2. (a) The LASSO model: each curve corresponds to a variable.It shows the path of its coefficient against the L1 norm of the whole coefficient vector as λ varies.The axis above the graph indicates the number of non-zero coefficients at the current λ, which is LASSO's effective degree of freedom (df).(b) LASSO model identified 42 genes that provide the most regularized model such that the cross-validated error is within one standard error of the minimum.

Figure 2 .
Figure 2. (a) The LASSO model: each curve corresponds to a variable.It shows the path of its coefficient against the L1 norm of the whole coefficient vector as λ varies.The axis above the graph indicates the number of non-zero coefficients at the current λ, which is LASSO's effective degree of freedom (df).(b) LASSO model identified 42 genes that provide the most regularized model such that the cross-validated error is within one standard error of the minimum.

Algorithms 2023 , 14 Figure 3 .
Figure 3. Venn diagram of common overlapping genes for the top 20 ranked genes by the information gain (IG), ReliefF, and LASSO methods.

Figure 3 .
Figure 3. Venn diagram of common overlapping genes for the top 20 ranked genes by the information gain (IG), ReliefF, and LASSO methods.

Figure 3 .
Figure 3. Venn diagram of common overlapping genes for the top 20 ranked genes by the information gain (IG), ReliefF, and LASSO methods.

Figure 4 .
Figure 4.A random forest algorithm to identify the performance of the three-gene signature common among LASSO, IG, and ReliefF selection (a).The ROC is plotted by the true positive rate against the false positive rate (b).Mean decrease accuracy and mean decrease gini to confirm and rank the importance of the selected genes (c).

Figure 5 .
Figure 5. Validation of the three-gene signature by the random forest algorithm on the GSE158444 dataset, which is an RNA_seq transcriptome of Arabidopsis subjected to heat stress (a).The ROC is plotted by the true positive rate against the false positive rate (b).

Figure 5 .
Figure 5. Validation of the three-gene signature by the random forest algorithm on the GSE158444 dataset, which is an RNA_seq transcriptome of Arabidopsis subjected to heat stress (a).The ROC is plotted by the true positive rate against the false positive rate (b).

Table 1 .
The parameters of different classifiers for the discrimination of Arabidopsis based on gene expression levels under control and stress conditions.

Table 2 .
The confusion matrices and discrimination performance of Arabidopsis on expression levels of 20 selected differentially expressed genes (DEGs) under control and stress conditions based on the information gain (IG) feature selection algorithm.

Table 2 .
The confusion matrices and discrimination performance of Arabidopsis on expression levels of 20 selected differentially expressed genes (DEGs) under control and stress conditions based on the information gain (IG) feature selection algorithm.

Table 3 .
The confusion matrices and discrimination performance of Arabidopsis on expression levels of 20 selected differentially expressed genes (DEGs) under control and stress conditions based on the ReliefF feature selection algorithm.