MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage

You, Xiong; Shu, Yiting; Ni, Xingcheng; Lv, Hengmin; Luo, Jian; Tao, Jianping; Bai, Guanghui; Feng, Shusu

doi:10.3390/horticulturae11010044

Open AccessArticle

MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage

by

Xiong You

^1,*,†

,

Yiting Shu

¹,

Xingcheng Ni

¹,

Hengmin Lv

^2,†,

Jian Luo

²,

Jianping Tao

³,

Guanghui Bai

¹ and

Shusu Feng

¹

College of Sciences, Nanjing Agricultural University, Nanjing 210095, China

²

College of Horticulture, Nanjing Agricultural University, Nanjing 210095, China

³

The Institute of Agricultural Information, Jiangsu Province Academy of Agricultural Sciences, Nanjing 210014, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Horticulturae 2025, 11(1), 44; https://doi.org/10.3390/horticulturae11010044

Submission received: 4 December 2024 / Revised: 29 December 2024 / Accepted: 3 January 2025 / Published: 6 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

The challenges posed by climate change have had a crucial impact on global food security, with crop yields negatively affected by abiotic and biotic stresses. Consequently, the identification of abiotic stress-responsive genes (SRGs) in crops is essential for augmenting their resilience. This study presents a computational model utilizing machine learning techniques to predict genes in Chinese cabbage that respond to four abiotic stresses: cold, heat, drought, and salt. To construct this model, data from relevant studies regarding responses to these abiotic stresses were compiled, and the protein sequences encoded by abiotic SRGs were converted into numerical representations for subsequent analysis. For the selected feature set, six distinct machine learning binary classification algorithms were employed. The results demonstrate that the constructed models can effectively predict SRGs associated with the four types of abiotic stresses, with the area under the receiver operating characteristic curve (auROC) for the models being 81.42%, 87.92%, 80.85%, and 88.87%, respectively. For each type of stress, a distinct number of stress-resistant genes was predicted, and the ten genes with the highest scores were selected for further analysis. To facilitate the implementation of the proposed strategy by users, an online prediction server, has been developed. This study provides new insights into computational approaches to the identification of abiotic SRGs in Chinese cabbage as well as in other plants.

Keywords:

Chinese cabbage; stress-responsive genes; machine learning; binary classification

1. Introduction

Chinese cabbage (Brassica rape L. ssp. Pekinensis) is a significant vegetable in Asia, with its consumption steadily increasing in Western countries [1,2]. However, the growth of Chinese cabbage is negatively impacted by abiotic stress, which leads to reductions in both yield and quality, thereby significantly affecting agricultural productivity. Abiotic stress refers to the detrimental effects on the normal growth and development of plants caused by non-biological factors. Common abiotic stressors that plants frequently encounter include drought, radiation, nutrient deficiency, extreme temperatures (both high and low), metal ion toxicity, salinity, and organic pollution, all of which severely compromise the distribution and growth conditions of plants worldwide [3]. Research indicates that abiotic stresses such as drought, extreme temperatures, and high salinity influence nearly every stage of the Chinese cabbage life cycle, affecting not only the expression of relevant genes but also cellular metabolism and developmental processes [4]. Under these stress conditions, Chinese cabbage can develop a certain degree of resistance; however, as the intensity of the stress escalates, this self-generated resistance becomes insufficient to cope with more severe abiotic challenges. The consequences of such stress can result in abnormal growth or even mortality of the Chinese cabbage, leading to diminished yields and subsequent market shortages. The capacity of plants to respond to environmental changes is critical for their adaptation and survival. Hence, it is essential to investigate the mechanisms by which plants adapt to varying climatic conditions. Consequently, researching the molecular mechanisms underlying abiotic stress in Chinese cabbage is of great significance for enhancing its yield under adverse conditions. In particular, the identification of abiotic stress-responsive genes (SRGs) [5] and proteins is vital for fostering the resilience of Chinese cabbage [6].

In recent years, the advancement of high-throughput technologies has led to the rapid generation of extensive biological data. The availability of complete genome sequences of various plant species has enabled comprehensive research on biomacromolecules, thereby promoting the investigation of abiotic SRGs at the whole-genome level. In addition to the identification of SRGs through transcriptome analysis, gene expression studies serve as an additional method for recognizing these genes [7,8,9,10,11,12,13]. Specifically, 35 BrHsf genes have been identified in Chinese cabbage, and a thorough analysis has revealed their potential to enhance plant heat tolerance [10]. Furthermore, comparative genomic analyses have enhanced our comprehension of the evolutionary dynamics of Hsf genes in cabbage. A co-expression network of genes responsive to cold, drought, and salt stress in cabbage has been constructed and analyzed from multiple perspectives, leading to the identification of previously unknown genes associated with abiotic stress tolerance [4]. Additionally, through the evolutionary analysis of gene families, abiotic SRGs have also been identified within the B-box family in cabbage. The application of advanced analytical methods to high-throughput genomic data—including genes, transcripts, proteins, and metabolites—will undoubtedly optimize data utilization and enhance the accuracy of abiotic SRGs identification [14].

In the domain of bioinformatics, various machine learning algorithms have been employed to address significant biological challenges. The integration of machine learning into biological research has introduced a novel perspective that contrasts sharply with traditional experimental and simulation methodologies. This approach has demonstrated considerable potential due to its flexibility, accuracy, and robust generalization capabilities when analyzing complex biological systems [15]. In the early stages of genomic research, gene identification primarily relied on biological experiments, gene sequence analysis, and traditional statistical methods. This phase often involved a considerable amount of manual analysis and experimental validation. Researchers employed basic computational techniques, such as sequence alignment and pattern recognition, to identify genes; however, these methods frequently depended on manual rules and intuitive judgment [16]. With the development of high-throughput sequencing technologies, genomic data began to grow explosively [17]. At this point, traditional manual analysis methods and single statistical approaches struggled to handle such vast amounts of data. Therefore, machine learning techniques began to see preliminary applications in gene identification [18]. Subsequently, the scale and complexity of genomic research have significantly increased, with the intricate relationships among genes, phenotypes, and the environment becoming the focal point of study. At this juncture, the application of machine learning has diversified, particularly in areas such as gene function prediction and gene–phenotype association analysis. In recent years, the use of machine learning has evolved toward a more integrated and multifaceted approach. With the advent of genomic, transcriptomic, epigenomic, and other types of data, machine learning has played a crucial role in synthesizing these diverse data types and conducting multi-level gene characterization analyses [19]. The widespread application of machine learning in gene identification and characterization is expected to steer future research toward a greater emphasis on model interpretability, thereby providing scientific evidence to support clinical and agricultural decision-making.

Artificial intelligence-driven machine learning techniques have emerged as pivotal tools for data interpretation, particularly in the context of predictive modeling and plant stress responses [20]. Previous research has successfully predicted the types of stresses to which plants respond by analyzing the expression patterns of plant microRNAs (miRNAs). In this context, intricate non-linear relationships between input variables (miRNA expression) and output variables (plant stress responses) are discerned from training datasets housed in various databases. This enables the identification of whether previously uncharacterized plant miRNAs are responsive to stress conditions [21]. Additionally, computational models utilizing machine learning have been developed to predict proteins associated with abiotic stress in plants [22], specifically focusing on the classification of abiotic stress response proteins in crops of the family Poaceae through the application of deep convolutional neural networks [14]. These investigations have illustrated the efficacy of machine learning methodologies in the identification, classification, and prediction of stress-responsive molecules in plants. However, despite the labor-intensive and time-consuming nature of identifying genes related to abiotic stress through conventional genetic techniques, there remains a lack of dedicated computational models for the identification of abiotic SRGs in Chinese cabbage. Given these considerations, the development of a computational method to predict abiotic SRGs in Chinese cabbage is warranted.

The objective of this study is to develop a machine learning-based computational model to identify the genes associated with cold, heat, drought, and salt stresses in Chinese cabbage. This endeavor aims to uncover novel abiotic SRGs within this species. It is hypothesized that the stress resistance phenotype of Chinese cabbage is influenced by multiple genes or genomic regions (quantitative trait loci, QTLs) and that there exists a correlation between these genes and various abiotic stresses, including cold, heat, drought, and salt. By analyzing the complex relationship between the genotype of Chinese cabbage and these stress resistance phenotypes using machine learning methods, relevant stress resistance genes can be identified. The relationship between genotype and stress resistance phenotype is not merely linear; there may also be complex non-linear or higher-order interactions. These complex relationships may involve interactions between genes, responses of phenotypes to environmental changes, and so on. Machine learning, especially algorithms like random forests (RF), support vector machines (SVM), and deep learning, can capture these complex non-linear patterns, thus effectively mining these stress resistance genes. Focusing on our goal, we collected and organized genes responsive to the four types of stress from related articles, further obtaining the protein sequences formed by the transcription and translation of each gene. We used the protein sequence as an effective high-throughput data format to analyze the specific targets related to abiotic stress, establishing complex linear relationships between the protein-coding sequences of Chinese cabbage genes and the four types of stress.

2. Materials and Methods

2.1. Data Collection and Pre-Processing

To date, substantial advancements have been achieved in the investigation of stress resistance in Chinese cabbage, employing various experimental methodologies to identify genes associated with cold, heat, drought, and salt stresses. By collecting data from relevant literature, it is possible to effectively screen for non-biological stress response genes. First, studies related to four types of stress are selected, and the identified genes are extracted from these studies. Notably, very few articles examine the entire genome of cabbage, while more research focuses on specific gene families. This discrepancy arises because the cabbage genome is relatively large, and the genes involved are complex, resulting in a substantial workload and high technical requirements for studying the entire genome. In contrast, research on individual gene families is more concentrated, allowing for an in-depth exploration of the roles these families play in specific biological processes, such as abiotic stress responses. This study involves gradually collecting abiotic stress response genes from each piece of literature and adding them to an existing gene database, thereby forming an ever-expanding gene collection. Throughout this process, genes are regularly organized and classified by stress type. This approach facilitates the gradual accumulation of valuable gene information, providing robust data support for subsequent research.

The research focuses on a comprehensive analysis of the entire genome. A transcriptome study related to cold stress was previously performed on other crops, and the GO term ‘response to stimulus’ was emphasized by GO analysis, from which some cold stress genes were identified [23]. The Gene Ontology (GO) analysis conducted in this study identified five terms that are closely associated with cold stress, including cellular response to cold (GO:0070417), cellular response to freezing (GO:0071497), cold acclimation (GO:0009631), response to cold (GO:0009409), and response to freezing (GO:0050826). Based on these five terms, we have identified genes that have been experimentally validated to respond to cold stress. Similarly, transcriptome studies were conducted to explore GO annotations for three additional types of stress: heat, drought, and salt. Corresponding genes were identified based on the relevant GO annotations. For the study of specific gene families, research focused on these families can yield deeper insights into their functions in response to abiotic stress by narrowing the focus to a single gene set, thereby offering precise targets for improving crop resilience. This type of research often extends and refines genomic studies, which aid in understanding the functions of individual genes and establish the groundwork for constructing regulatory networks. For instance, systematic studies have revealed the role of the CSD gene family in the response of Chinese cabbage to cold stress, providing foundational data for understanding the mechanisms of low-temperature adaptation in this plant [24]. Although a gene family can respond to a specific type of abiotic stress, it may also respond to multiple forms of abiotic stress [25]. For detailed articles referenced in these two types of research, please refer to the Supplementary Materials. For the four types of abiotic stress, there were 801, 761, 600, and 427 samples available to form the positive sample set, respectively, as shown in Figure 1. The overlap of genes across different articles is shown in Figure 2. This figure highlights the frequency with which specific genes are associated with different specific abiotic stresses across multiple articles.

Given that certain genes exhibit responses to multiple stressors, Boolean operations were employed to ensure that genes responded to only a single type of stress. As shown in Table 1 the number of genes resistant to a single stress was determined using Boolean operations. The complete genome of Chinese cabbage contains 41,019 protein-coding genes. The number of genes predicted for each type of stress can be determined individually. While differential expression and downstream analyses have resulted in a considerable number of experimentally validated abiotic stress-responsive genes (SRGs), various platforms provide differing gene sequences. To enhance specificity, we commenced our approach by utilizing the protein sequences encoded by these genes and querying the gene identifiers through the BRAD database (http://brassicadb.cn, accessed on 24 July 2024), to obtain the corresponding protein sequences. Additionally, to reduce potential bias in the model that may arise from similar protein sequences, the CD-HIT tool was employed, applying a threshold of 0.8 to eliminate redundant sequences, thereby constructing a non-redundant dataset for subsequent experimental analysis. The final counts of sequences obtained for cold, heat, drought, and salt stress were 527, 515, 409, and 239, respectively.

Since the samples that had not yet been verified in response to a specific type of stress were all regarded as potential positive samples, we adopted the approach of constructing negative sample sets in [22]. One category of abiotic stress data was utilized as the positive sample set, whereas the remaining three categories comprised the negative sample set. The quantities of positive samples were 527, 515, 409, and 239, while the quantities of negative samples were 1163, 1175, 1281, and 1451. Furthermore, the entire dataset was partitioned into a training set and a testing set in a ratio of 7:3.

2.2. Feature Construction and Selection

iFeatureOmega serves as a comprehensive computational tool designed for the characterization of diverse biomolecules. This platform is freely accessible and user-friendly, enabling users to generate, analyze, and visualize numerical vector representations of 189 biological sequences, structures, and ligands [26]. In our study, we employed the six protein feature construction methods available through the graphical user interface (GUI) version of this tool to extract features from the collected protein sequence data. They are Composition of k-spaced Amino Acid Pairs (CKSAAP), Dipeptide Deviation from Expected Mean (DDE), Di-Peptide Composition (DPC), Tri-Peptide Composition (TPC), Pseudo-Amino Acid Composition (PAAC), and Amphiphilic Pseudo-Amino Acid Composition (APAAC).

The CKSAAP feature encoding calculates the frequency or the raw count of amino acid pairs separated by any k residues. The DPC feature calculates the raw count of the 400 di-peptides in a sequence, and the TPC feature calculates the raw count of the 8000 tri-peptides. The DDE feature was constructed by calculating three parameters: dipeptide composition (

D_{c}

), theoretical mean (

T_{m}

), and theoretical variance (

T_{v}

).

D_{c}

measures the frequency of dipeptides in the sequence, while

T_{m}

and

T_{v}

represent the expected frequency and variance, respectively, based on codon usage for each dipeptide. These values are used to capture the distribution of dipeptides relative to their expected frequencies. The final DDE value was computed by calculating the difference between the observed dipeptide frequency (

D_{c}

) and the expected frequency (

T_{m}

), and then normalizing this difference by dividing it by the square root of the theoretical variance (

T_{v}

). These values are used to capture the distribution of dipeptides relative to their expected frequencies, providing a measure of deviation from the expected dipeptide distribution. For the principles of APAAC and PAAC, please refer to [27].

In the context of multi-dimensional space classification and datasets characterized by feature interaction variables, there is a pressing need to enhance model classification accuracy. To address these issues, feature selection techniques are commonly utilized to reduce the complexity of the data structure. This is accomplished by identifying and selecting meaningful features that contribute positively to the model, thus obviating the need to incorporate all available features during the training process. Such an approach not only reduces the computational time required for classification but also enhances overall classification accuracy. Commonly utilized feature selection methods include backward feature selection, forward feature selection, and bidirectional feature selection [28]. Additionally, SVM-recursive feature elimination (SVM-RFE) is recognized as an effective feature selection technique. SVM-RFE facilitates the identification of relevant features while simultaneously eliminating relatively insignificant feature variables, ultimately leading to improved classification performance [29]. Empirical research demonstrates that datasets curated using SVM-RFE result in more straightforward computations and substantially improve classification accuracy [30].

2.3. Prediction Using Machine-Learning Methods

Machine learning techniques have been effectively utilized across various domains of bioinformatics [17,31], including gene discovery [32], genome annotation [33], protein classification prediction [34], and gene expression analysis [35]. Among the methodologies employed, supervised learning, unsupervised learning, reinforcement learning, and sparse dictionary learning have been extensively explored in prior research, with supervised learning emerging as a reliable and efficient approach for addressing challenges in the life sciences [36]. In this study, we evaluated several classification algorithms within the framework of supervised learning, specifically support vector machine (SVM) [37], extreme gradient boosting (XGB) [38], random forest (RF) [39], bagging (BAG) [40], adaptive boosting (ADB) [41], and gradient boosting decision trees (GBDT) [41]. The implementation of machine learning models was conducted using Python, leveraging Scikit-learn, an open-source Python library that offers straightforward and effective tools for data processing and modeling pertinent to data mining and analysis. Scikit-learn encompasses a comprehensive array of algorithms, including those for classification, regression, clustering, and dimensionality reduction, in addition to providing resources for model selection and evaluation [42]. This functionality enables the construction and optimization of machine learning models, making it a valuable asset for the models developed in this study.

2.4. Cross Validation and Performance Metrics

During the model development process, the training dataset is utilized to train the model, while the validation dataset is employed to assess the quality of the model and to select the optimal hyper-parameters. The final model derived from this process is subsequently applied to the test dataset to evaluate its real performance. However, relying solely on a training set and a validation set when the dataset is limited may lead to significant bias in the true evaluation of model performance, and allocating a separate validation set from a small dataset may further result in poor model performance. To achieve a reliable and stable model, a five-fold cross-validation method is implemented [43]. This approach eliminates the need to partition a separate validation set, as the test set is consistently reserved for the final evaluation of the model. Specifically, the entire training dataset is randomly divided into five equal-sized, mutually exclusive subsets. Four of these subsets are utilized for model training, while the remaining subset is employed to validate the established model. This procedure is repeated until each subset has served as a validation set. The accuracy across all five validation sets is averaged to provide a performance measure for the model, thereby assessing the quality of the model and facilitating the selection of the model and its corresponding parameters. Furthermore, the area under the receiver operating characteristic curve (auROC) and the area under the precision-recall curve (auPRC) are computed to evaluate predictive capability of the model [44]. The ROC curve is a graphical representation of the true positive rate plotted against the false positive rate. The area under the ROC curve (AUC-ROC) is a commonly used metric for assessing the model’s performance across various thresholds. The specific formulation is as follows:

A c c u r a c y = (\frac{T P + T N}{T P + T N + F P + F N}) \times 100 a u R O C = \int_{0}^{1} \frac{T P}{P} d (\frac{F P}{N}) a u P R C = \int_{0}^{1} \frac{T P}{T P + F P} d (\frac{T P}{P})

In this context, TP, FP, TN, and FN denote the quantities of positive samples accurately predicted as positive, the quantities of negative samples inaccurately predicted as positive, the quantities of negative samples accurately predicted as negative, and the quantities of positive samples inaccurately predicted as negative, respectively. All steps involved in the proposed methodology are illustrated in Figure 3.

3. Results

3.1. Preliminary Analysis of the Sequence Data

Motif refers to conserved regions within DNA or protein sequences, as well as common sequence patterns that are postulated to have biological functions or to play a role in essential biological processes. Motif analysis is a methodological approach employed to identify and analyze recurring patterns or sequence fragments present in DNA, RNA, or protein sequences [45]. For each type of abiotic stress, we conducted motif analysis utilizing the MEME Suite [46], which facilitates the identification of recurring sequence fragments in both positive and negative sample sets. The substantial number of significant motifs identified in stress-related genes underscores the intricate complexity of stress response regulation. These genes may be regulated by various transcription factors and may concurrently participate in responses to multiple stress pathways. To enhance our analysis, we screened these motifs based on their significance and enrichment multiples, ultimately identifying key regulatory motifs.

In the MEME analysis, the number of motifs is preset to 3, and the motif length ranges from 6 to 10. In alignment with the research objectives, significant motifs may be associated with critical biological processes or molecular mechanisms involved in responses to abiotic stress. After analyzing the preliminary results, we reset the parameters. Ultimately, we identified two motifs that are associated with cold stress: NTKFCYYNNY and FCKSCRRYWT. Employing the same analytical method for the remaining three stress datasets, we identified the following significant motifs for heat stress: FYRQLNTYGF and LPKYFKHNNF. In the context of drought stress, the motifs QLTIFYGGKV and IQTAFRGYLA were recognized as significant. Lastly, for salt stress, the motifs TDNAVKNHWN and SCRLRWCNQL were identified as significant. The identified motifs for each stress are visualized in Figure 4. The MEME analysis results for each type of stress can be found in the Supplementary Materials. In conclusion, motif analysis is demonstrated to be an effective approach for identifying significant sequence characteristics.

3.2. Feature Construction and Selection Analysis

In this section, six feature construction methods were employed to derive features from protein sequences, and the corresponding feature sets from the training dataset were utilized to assess the predictive accuracy of various machine learning algorithms. As illustrated in Figure 5, the feature sets produced by the CKSAAP, DDE, DPC, and TPC methods demonstrated superior performance compared to the other two methods. Specifically, when the CKSAAP method was utilized to generate the feature set, the SVM-based model achieved the highest area under the receiver operating characteristic curve (auROC) for cold, heat, and drought stress, with respective values of 76.60%, 80.32%, and 78.09%. In the case of salt stress, the SVM model also attained the highest auROC of 85.14%, although this feature set was derived using the DDE method; the CKSAAP method followed closely with an auROC of 82.91%. Across the four types of stress evaluated, the CKSAAP method consistently outperformed the DDE and TPC methods based on the assessment results of various machine learning algorithms. Moreover, while the DPC and TPC methods produced a greater number of features, their performance was either slightly inferior to or comparable with that of the DDE method. Consequently, to mitigate computational demands, it may be advantageous to select feature construction methods that yield fewer features. Overall, the feature sets generated by the CKSAAP and DDE methods exhibited higher predictive accuracy than those produced by the other methods employed in this study.

The process for developing the model begins with gathering genes associated with Chinese cabbage’s response to cold, heat, drought, and salt stresses. Genes that respond to multiple stresses are filtered out using Boolean operators. Protein data are then obtained from the BRAD database, and redundant sequences are removed through CD-HIT. The dataset is categorized into positive and negative samples, followed by feature construction using iFeatureOmega software. Feature selection is conducted, and five-fold cross-validation is utilized with various machine learning models for training. After identifying the optimal features, the model is trained with these features, hyperparameters are fine-tuned, and the model is ultimately assessed using a test set.

The inclusion of an excessive number of features can increase the complexity of a model, potentially leading to overfitting, whereas an insufficient number of features may result in underfitting. The primary objective of feature selection is to optimize the model such that it is sufficiently complex to ensure generalizable performance, while remaining simple enough to facilitate training, maintenance, and interpretation. In this study, multiple numerical features were generated for each protein sequence. However, the presence of numerous sparse features resulted in high correlation among them, which can adversely affect classification accuracy due to the presence of correlated or redundant features. To address this issue, the SVM recursive feature elimination (SVM-RFE) method was employed to rank the features and systematically eliminate those deemed unimportant. Additionally, combinations of features were considered, focusing on three specific feature sets: CKSAAP, DDE, and combined CKSAAP + DDE. The findings indicated that, following the screening process, the optimal feature set for various types of abiotic stress comprised a different number of features, with the optimal set achieving the highest area under the receiver operating characteristic curve (auROC) score. For the 1600 features generated by the CKSAAP method, the optimal features selected for cold, heat, drought, and salt stress were 1045, 1224, 1016, and 1018, respectively, with auROC scores reaching 70% (see Figure 6). The DDE method, which has a lower feature dimensionality, extracted 400 features per sequence. After the feature selection process, the number of features selected for the four types of stress were 112, 84, 175, and 167, respectively. However, the performance for drought and salt stress was suboptimal, with auROC scores only reaching 60%. By integrating the two feature construction methods, the combined total of (1600 + 400) features resulted in the selection of 1237, 1062, 375, and 589 features for the optimal feature set, respectively.

3.3. Prediction Analysis with Selected Features

In the context of the four types of abiotic stress, the optimal parameters for each predictive model were established utilizing a selected feature set, which was evaluated through five-fold cross-validation on the training dataset. Subsequently, receiver operating characteristic (ROC) analysis was conducted on the selected models using the test dataset to ascertain the most effective model for predicting genes associated with each type of stress. Taking salt stress as a case study, ROC analysis was performed on three feature sets (CKSAAP, DDE, and CKSAAP + DDE) in conjunction with six classification algorithms (SVM, XGB, RF, GBDT, ADB, and BAG) as shown in Figure 7. The results indicated that the GBDT classification algorithm exhibited a superior area under the ROC curve (auROC) when utilizing features derived from the CKSAAP method compared to the other algorithms. In contrast, when employing the DDE feature set and the two combined feature sets, the BAG and RF algorithms demonstrated enhanced performance, respectively. Table 2 presents the most effective machine learning classification methods for various abiotic stresses across the three feature selection approaches. A comparative analysis revealed that the CKSAAP-derived features yielded a higher overall accuracy for each category of abiotic stress than the DDE and CKSAAP + DDE feature sets. Under the feature construction method employed, the performance metrics of the six classification algorithm models demonstrated a minimum accuracy of 70% (see Figure 8). This observation suggests that the sequences corresponding to the four categories are all suitable for feature construction utilizing the CKSAAP approach. Consequently, following the analysis presented above, the most effective model for predicting genes associated with each type of stress has been determined (as shown in Table 3).

3.4. Discovery of New Stress-Related Genes in Chinese Cabbage

The traditional methodology for identifying genes associated with plant stress employs microarray analysis technology, which enables the statistical evaluation of gene expression data across different treatment conditions in comparison to a control group. This methodology facilitates the screening of a subset of genes exhibiting significant expression alterations. The differentially expressed genes are likely implicated in the relevant stress processes of organisms. Consequently, this study aims to investigate the intrinsic mechanisms underlying environmentally induced changes in organisms from a molecular perspective.

The genome of Chinese cabbage contains a total of 41,019 protein-coding genes. As indicated in Table 2, there are 40,291, 40,356, 40,570, and 40,691 uncharacterized protein-coding genes predicted for each of the four types of stress, respectively. MLAS successfully predicted 240 genes associated with cold stress (240/40,291), 1829 genes related to heat stress (1829/40,356), 2129 genes linked to drought stress (2129/40,570), and 115 genes connected to salt stress (115/40,691). Ultimately, the 10 genes with the highest scores for each type of stress were selected for further investigation (see Table 4). These identified genes may provide valuable insights into the molecular mechanisms underlying stress resistance in Chinese cabbage and could aid in the future breeding of stress-resistant varieties.

3.4.1. Cold Stress

The genes Bra02500 and Bra010944 belong to the GRP (Glycine-Rich Protein) family, which plays a crucial role in plant responses to various environmental stresses, including high temperatures, low temperatures, and drought. GRP family proteins are characterized by a high content of glycine residues and perform diverse physiological functions that enhance plants’ resistance to stress, particularly by providing protection against abiotic stresses [47]. The promoter region of Bra025001 contains four cis-acting elements related to abiotic stress, indicating that it may function under various abiotic stresses. Bra010944 responds with different expression levels to cold stress, suggesting that they may play a regulatory role under cold stress by modulating the expression of cold-responsive genes to enhance cold tolerance. The variation in expression levels also indicates the activation of different protective mechanisms under varying intensities of cold stress. Bra013175 is induced under both high and low-temperature stresses, indicating that it is a temperature stress-related gene, particularly playing a broad protective role under extreme temperature conditions. It may help plants to resist temperature stress by regulating mechanisms such as protein folding and antioxidant enzyme activity. AT2G37170 is a gene in Arabidopsis that has some homology with Bra023102 and Bra023103 [48]. This suggests that they may perform similar functions in Arabidopsis and Chinese cabbage, especially in response to stresses like cold and salt. The functional characteristics of AT2G37170, particularly its specific responses in root and leaf tissues to cold and salt stress, provide valuable clues for understanding the functions of Bra023102 and Bra023103.

3.4.2. Heat Stress

Bra009415 encodes heat shock proteins (HSPs) that are activated in response to heat stress. Bra013731, a gene associated with cabbage within the band-7 family, may play a role in enhancing membrane stability and maintaining protein homeostasis during periods of heat stress. Bra023806 encodes a KIP1-like protein [49] that is essential for the regulation of the cell cycle and the plant’s response to stress. In stressful conditions, plants may utilize KIP1-like proteins to effectively manage energy resources and mitigate cellular damage. Bra015720 and Bra009416 are associated with the heat shock protein 20 (HSP20), which is significant for responses to high-temperature stress [50]. Although the precise relationship to heat stress remains ambiguous, these proteins are known to influence nuclear function and mRNA stability. Bra035105 and AT1G79280 are homologous genes, with AT1G79280 potentially playing a role in the heat stress response through the regulation of mRNA. Additionally, AT4G27500, which is homologous to Bra019045, is implicated in protein binding and proton transport, thereby assisting plants in maintaining homeostasis under stress conditions.

3.4.3. Drought Stress

Bra020398 and Bra016377, which belong to the basic helix-loop-helix (bHLH) [51] gene family, are likely implicated in the regulation of gene expression in plants subjected to stress conditions, such as drought, through their function as transcription factors. Empirical evidence suggests that bHLH transcription factors are frequently correlated with plant resistance to stress. The process of protein dimerization is critical in the plant’s response to drought, and genes such as Bra034636, Bra002679, and Bra040856, which participate in this mechanism, may enhance the plant’s resilience to drought through dimerization processes [52]. Furthermore, research indicates that ANGUSTIFOLIA3 (AN3) [53] acts as a transcriptional co-activator, facilitating improved growth conditions for plants in drought environments by regulating cell expansion, division, and adaptive growth. Existing literature posits that AN3 plays a significant role in mediating plant adaptation to drought, while Bra031714 may function as a relevant gene marker associated with the plant’s drought response mechanisms. The MADS-box gene family contributes to enhancing plant drought resistance through regulatory pathways that influence stomatal function, root system development, hormone signaling, and antioxidant capacity under drought stress [54]. Bra008802, a member of the MADS-box family, may indirectly modulate the plant’s drought tolerance by upregulating or downregulating the expression of associated genes. Additionally, Bra007363, a homolog of AT3G57870 in the cabbage genome, may exhibit similar functional characteristics. Studies have demonstrated that the expression of AT3G57870 (MYB108) [55] is significantly elevated under drought stress conditions, suggesting its potential role in enhancing the plant’s drought tolerance through the regulation of stomatal movement or other water management mechanisms.

3.4.4. Salt Stress

Bra039970 has been annotated as being related to the carbohydrate metabolism process, which plays multiple roles in drought stress, including energy supply, osmotic regulation, signal transduction, and antioxidant functions [52]. This process is a crucial mechanism for plant drought resistance. To adapt to salt stress, plants regulate the activity of transmembrane transport proteins to mitigate ion toxicity, maintain water balance, and resist stress-induced damage. The molecular functions of Bra026062, Bra023592, Bra018594, Bra022658, and Bra025850 have been annotated as possessing transmembrane transport activity [56]. Further experimental validation, including gene expression analysis, transgenic functional analysis, and subcellular localization studies, will enhance the understanding of the specific roles of these genes in the salt stress response. The protein encoded by AT5G45275 may belong to the transmembrane transport protein family, which is involved in the transport of ions or metabolites, thereby playing a role in the plant’s response to biotic and abiotic stresses, such as salt and drought. The homologous gene Bra025086 in the cabbage genome may exhibit similar functional characteristics. AT2G33585 (NHX5) is a significant regulatory factor for salt tolerance in Arabidopsis; its function is activated under salt stress conditions, aiding plants in adapting to high-salinity environments and inducing the expression of downstream salt tolerance genes. Bra022945 is an NHX family gene in cabbage that is functionally similar to NHX5, thereby assisting cabbage in coping with salt stress and enhancing its salt tolerance [57]. Further molecular biology experiments can verify its specific functions and mechanisms of action.

3.5. Online Prediction Tool

To facilitate the implementation of the newly proposed method, we developed a web-based tool named MLAS, designed to predict the response genes of Chinese cabbage under four distinct types of abiotic stress: cold, heat, drought, and salt. The architecture of the web-based tool comprises three layers: a presentation layer [58], a web API layer, and an application layer. The presentation layer was constructed using HTML, CSS, and JavaScript. The application layer was primarily implemented in the Python programming language and interacts with the data layer through API calls. Furthermore, the application layer incorporates models that enhance usability for end users, thereby simplifying access and interaction with the tool. The interface of the web tool was developed using the Python Flask framework, while the backend was constructed utilizing machine learning modules within the Python framework. This online prediction server enables the rapid upload and analysis of corresponding protein sequences based on cabbage gene IDs, with results presented in a tabular format that correlates with the specified Chinese cabbage genes under the four types of abiotic stress.

Users have the option to upload either a single gene ID or multiple gene IDs for analysis. The former can be directly inputted on the “Analysis” page, whereas the latter necessitates the uploading of a file in a specified format. Upon uploading the corresponding original protein sequence of the cabbage gene in FASTA format, the output will categorize the gene into the predicted classification. The accurate classification and prediction of non-biological stress genes are instrumental for biologists engaged in research aimed at crop improvement. Machine learning and deep learning models are capable of effectively capturing complex patterns and dependencies within protein sequences, thereby facilitating the rapid identification of genes associated with non-biological stress response protein sequences. Figure 9 illustrates the interface of this web implementation server.

4. Discussion

Abiotic stress is a major limiting factor affecting the yield and quality of crops, including Chinese cabbage, a highly nutritious and economically important vegetable. Investigating the molecular mechanisms behind abiotic stress responses in Chinese cabbage is essential for developing stress-resistant varieties and enhancing agricultural productivity. The foundation of traditional methods is rooted in experimental techniques that directly observe and verify molecular or physiological mechanisms [59]. However, these methods frequently encounter challenges such as high costs, low efficiency, and complex data [60]. With the development of modern technology, an increasing number of methods are beginning to integrate experimental and computational approaches, thereby enhancing the efficiency of discovery [61]. Compared to machine learning, deep learning models require a large amount of data to train their complex network structures, while small datasets often lack sufficient support for effective model generalization [20]. Furthermore, traditional algorithms tend to be more efficient in feature engineering and model training, rendering them more suitable for practical applications involving small datasets. Consequently, when working with limited data, it is generally more reasonable to prioritize traditional machine learning methods.

To explore potential solutions, this study employed machine learning models to predict novel genes that may be linked to abiotic stress. Through this approach, we aimed to identify candidate genes that could be involved in stress responses, providing insights into their potential roles in stress tolerance. Compared to previous research [22,62,63], this study achieved an average auROC score of around 0.8, with a maximum score of 0.88. Additionally, this method not only contributes to uncovering the molecular factors associated with abiotic stress in crops like Chinese cabbage but also serves as a valuable tool for guiding the development of stress-resistant varieties and improving crop productivity.

A key challenge in this research is that a single gene can be associated with multiple types of stress, underscoring the complexity of abiotic stress responses [22]. To address this challenge, we developed four distinct prediction models, each tailored to a specific type of stress. If a multiclassification model were employed to predict a gene’s response to multiple stresses simultaneously, the sum of the probabilities for all four stresses would equal to 1. This approach would result in predicting only the stress with the highest response probability for each gene, thereby failing to account for the possibility of a gene responding to multiple stresses. To overcome this limitation, we constructed separate models for each type of stress.

In this study, various feature construction methods were employed, and the best performing features were selected through comparative analysis to develop a more robust and comprehensive model. To minimize the risk of overfitting, enhance the model’s generalizability, and accelerate training speed, we utilized the SVM-RFE method for feature dimensionality reduction. To further enhance model performance, we explored feature combination strategies, where multiple features were combined to improve overall accuracy. The fundamental value of feature combination lies in its ability to expand the feature space, enabling the model to capture complex patterns within the data more effectively [64]. This enhancement, in turn, enhances both predictive performance and interpretability. In practical applications, feature combination can be utilized alongside methods such as feature selection to further optimize model performance.

Given that combining various feature construction methods, feature selection, and feature combinations would significantly increase computational complexity, we initially screened six feature construction methods—CKSAAP, DDE, DPC, PAAC, APAAC, and TPC—to evaluate their performance across multiple machine learning models. From this evaluation, CKSAAP and DDE consistently outperformed the other feature sets. Consequently, we focused on these two methods for subsequent feature selection and combination, while excluding the remaining four feature sets, particularly PAAC and APAAC, which showed subpar performance, with average auROC scores around 0.65.

Using auROC as the performance metric, we applied SVM-RFE to select the optimal feature subsets for each construction method. Even though the full feature set maximized the auROC score, we discovered that the selected feature subsets outperformed it, achieving our objective of enhancing model performance with a more compact set of features. Further training with the selected feature sets—CKSAAP, DDE, and the combination of CKSAAP and DDE—across various machine learning models enabled us to identify the most robust and high-performing models. From the perspective of the auROC, CKSAAP consistently outperformed the other feature sets across all four types of stress considered in this study. As a result, CKSAAP was identified as the final feature set. We then finetuned the model’s hyperparameters using this feature set to optimize the model’s performance. This process resulted in the selection of the best model for each type of stress, which served as the final results.

Despite the promising results, several challenges persist in this field. Continued advancements in high-throughput sequencing, data standardization, and algorithm development are essential for overcoming limitations related to data quality and model interpretability. The integration of heterogeneous datasets remains a critical challenge in biological research. The effective fusion of multisource data, such as transcriptomics, proteomics, and metabolomics, presents a promising solution to enhance the predictive power and biological relevance of machine learning models [65]. The explosive growth of biological data has underscored the need for multimodal integration, as combining diverse data types—such as genomic sequences, protein structures, and metabolic profiles—can provide a more comprehensive understanding of biological processes [15]. This integration has been shown to significantly improve model performance, leading to more accurate and biologically relevant predictions that better capture the complexity of biological systems.

5. Conclusions

In this study, a novel machine learning-based computational tool was introduced to predict stress-related proteins, marking a significant advancement over traditional methods such as BLAST. This tool demonstrated strong potential in identifying genes associated with cold, heat, drought, and salt stress, providing a solid foundation for further functional studies and crop improvement efforts.

This study identified several novel cold, heat, drought, and salt stress-related genes in the Chinese cabbage genome, many of which have functional support in the literature for other plant species. These findings lay a solid foundation for future experimental validation and functional characterization under abiotic stress conditions. These efforts will contribute to a deeper understanding of stress resistance mechanisms and promote the development of stress-tolerant crops. Furthermore, the associated online prediction server provides a user-friendly platform for researchers, facilitating the translation of computational findings into experimental applications.

In conclusion, this study highlights the transformative potential of machine learning in crop stress research. By enabling the identification and characterization of stress-related genes, it lays the groundwork for developing stress-resistant crop varieties. As climate change and environmental pressures intensify, such innovations will be crucial in ensuring sustainable agricultural production and global food security.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/horticulturae11010044/s1. The folder “MEME_analysis_output” contains the results of the MEME analysis for each stress type. The files “family.xlsx” and “genome.xlsx” include all the articles referenced during the data collection process.

Author Contributions

Conceptualization, X.Y., H.L., J.L. and J.T.; methodology, X.Y., Y.S. and X.N.; software, Y.S. and X.N.; validation, X.Y., Y.S. and X.N.; formal analysis, X.Y., Y.S. and X.N.; investigation, X.Y., Y.S., X.N., G.B. and S.F.; resources, Y.S., J.L. and H.L.; data curation, Y.S., X.N., H.L., J.L. and J.T.; writing—original draft preparation, X.Y., Y.S. and X.N.; writing—review and editing, X.Y., Y.S. and X.N.; visualization, Y.S., X.N., G.B. and S.F.; supervision, X.Y.; project administration, X.Y.; funding acquisition, X.Y. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Natural Science Foundation of China (11171155), the Natural Science Foundation of Jiangsu Province, China (BK20171370), and the Primary Research & Development Plan (Modern Agriculture) of Jiangsu Province (BE2023350).

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Seong, G.-U.; Hwang, I.-W.; Chung, S.-K. Antioxidant Capacities and Polyphenolics of Chinese Cabbage (Brassica rapa L. Ssp. Pekinensis) Leaves. Food Chem. 2016, 199, 612–618. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Zhang, D.; Cai, Z.; Wang, L.; Wang, J.; Sun, L.; Fan, X.; Shen, S.; Zhao, J. Spectral Technology and Multispectral Imaging for Estimating the Photosynthetic Pigments and SPAD of the Chinese Cabbage Based on Machine Learning. Comput. Electron. Agric. 2022, 195, 106814. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, J.; Gong, Z.; Zhu, J.-K. Abiotic Stress Responses in Plants. Nat. Rev. Genet. 2022, 23, 104–119. [Google Scholar] [CrossRef] [PubMed]
Lee, G.-H.; Lee, G.-S.; Yu, J.-G.; Kim, Y.-H.; Park, Y.-D. Correlation Network Analysis of Abiotic Stress-related Genes Reveals the Coordinated Regulation of Transcription in Chinese Cabbage. HST 2018, 36, 266–279. [Google Scholar] [CrossRef]
Shaik, R.; Ramakrishna, W. Genes and Co-Expression Modules Common to Drought and Bacterial Stress Responses in Arabidopsis and Rice. PLoS ONE 2013, 8, e77261. [Google Scholar] [CrossRef]
Ma, Y.; Qin, F.; Tran, L.-S.P. Contribution of Genomics to Gene Discovery in Plant Abiotic Stress Responses. Mol. Plant 2012, 5, 1176–1178. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Wu, X.; Zhang, M.; Yang, L.; Ji, Z.; Chen, R.; Cao, Y.; Huang, J.; Duan, Q. Genome-Wide Identification of BrCMF Genes in Brassica Rapa and Their Expression Analysis under Abiotic Stresses. Plants 2024, 13, 1118. [Google Scholar] [CrossRef]
Hui, J.; Zhang, M.; Chen, L.; Wang, Y.; He, J.; Zhang, J.; Wang, R.; Jiang, Q.; Lv, B.; Cao, Y. Identification, Classification, and Expression Analysis of Leucine-Rich Repeat Extension Genes from Brassica Rapa Reveals Salt and Osmosis Stress Response Genes. Horticulturae 2024, 10, 571. [Google Scholar] [CrossRef]
Singh, S.; Chhapekar, S.S.; Ma, Y.; Rameneni, J.J.; Oh, S.H.; Kim, J.; Lim, Y.P.; Choi, S.R. Genome-Wide Identification, Evolution, and Comparative Analysis of B-Box Genes in Brassica rapa, B. oleracea, and B. napus and Their Expression Profiling in B. rapa in Response to Multiple Hormones and Abiotic Stresses. IJMS 2021, 22, 10367. [Google Scholar] [CrossRef]
Song, X.; Liu, G.; Duan, W.; Liu, T.; Huang, Z.; Ren, J.; Li, Y.; Hou, X. Genome-Wide Identification, Classification and Expression Analysis of the Heat Shock Transcription Factor Family in Chinese Cabbage. Mol. Genet. Genom. 2014, 289, 541–551. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Duan, W.; Riquicho, A.R.; Jing, Z.; Liu, T.; Hou, X.; Li, Y. Genome-Wide Survey and Expression Analysis of the PUB Family in Chinese Cabbage (Brassica rapa Ssp. Pekinesis). Mol. Genet. Genom. 2015, 290, 2241–2260. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Huang, F.; You, X.; Hou, X. Identification and Functional Characterization of a Cold-Related Protein, BcHHP5, in Pak-Choi (Brassica rapa Ssp. Chinensis). IJMS 2018, 20, 93. [Google Scholar] [CrossRef]
Yang, L.; Zhao, Y.; Wu, X.; Zhang, Y.; Fu, Y.; Duan, Q.; Ma, W.; Huang, J. Genome-Wide Identification and Expression Analysis of BraGLRs Reveal Their Potential Roles in Abiotic Stress Tolerance and Sexual Reproduction. Cells 2022, 11, 3729. [Google Scholar] [CrossRef]
Ahmed, B.; Haque, M.A.; Iquebal, M.A.; Jaiswal, S.; Angadi, U.B.; Kumar, D.; Rai, A. DeepAProt: Deep Learning Based Abiotic Stress Protein Sequence Classification and Identification Tool in Cereals. Front. Plant Sci. 2023, 13, 1008756. [Google Scholar] [CrossRef]
Zhang, S.; Fan, R.; Liu, Y.; Chen, S.; Liu, Q.; Zeng, W. Applications of Transformer-Based Language Models in Bioinformatics: A Survey. Bioinform. Adv. 2023, 3, vbad001. [Google Scholar] [CrossRef]
Bhardwaj, N.; Gerstein, M.; Lu, H. Genome-Wide Sequence-Based Prediction of Peripheral Proteins Using a Novel Semi-Supervised Learning Technique. BMC Bioinform. 2010, 11, S6. [Google Scholar] [CrossRef] [PubMed]
Ma, C.; Zhang, H.H.; Wang, X. Machine Learning for Big Data Analytics in Plants. Trends Plant Sci. 2014, 19, 798–808. [Google Scholar] [CrossRef]
Yan, J.; Wang, X. Machine Learning Bridges Omics Sciences and Plant Breeding. Trends Plant Sci. 2023, 28, 199–210. [Google Scholar] [CrossRef] [PubMed]
Van Dijk, A.D.J.; Kootstra, G.; Kruijer, W.; De Ridder, D. Machine Learning in Plant Science and Plant Breeding. iScience 2021, 24, 101890. [Google Scholar] [CrossRef] [PubMed]
Gill, M.; Anderson, R.; Hu, H.; Bennamoun, M.; Petereit, J.; Valliyodan, B.; Nguyen, H.T.; Batley, J.; Bayer, P.E.; Edwards, D. Machine Learning Models Outperform Deep Learning Models, Provide Interpretation and Facilitate Feature Selection for Soybean Trait Prediction. BMC Plant Biol. 2022, 22, 180. [Google Scholar] [CrossRef]
Asefpour Vakilian, K. Machine Learning Improves Our Knowledge about miRNA Functions towards Plant Abiotic Stresses. Sci. Rep. 2020, 10, 3041. [Google Scholar] [CrossRef]
Meher, P.K.; Sahu, T.K.; Gupta, A.; Kumar, A.; Rustgi, S. ASRpro: A Machine-learning Computational Model for Identifying Proteins Associated with Multiple Abiotic Stress in Plants. Plant Genom. 2022, 17, e20259. [Google Scholar] [CrossRef]
Ma, L.; Coulter, J.; Liu, L.; Zhao, Y.; Chang, Y.; Pu, Y.; Zeng, X.; Xu, Y.; Wu, J.; Fang, Y.; et al. Transcriptome Analysis Reveals Key Cold-Stress-Responsive Genes in Winter Rapeseed (Brassica rapa L.). IJMS 2019, 20, 1071. [Google Scholar] [CrossRef] [PubMed]
Huang, F.; Wang, J.; Tang, J.; Hou, X. Identification, Evolution and Functional Inference on the Cold-Shock Domain Protein Family in Pak-Choi (Brassica rapa Ssp. Chinensis) and Chinese Cabbage (Brassica rapa Ssp. Pekinensis). J. Plant Interact. 2019, 14, 232–241. [Google Scholar] [CrossRef]
Yuan, J.; Liu, T.; Yu, Z.; Li, Y.; Ren, H.; Hou, X.; Li, Y. Genome-Wide Analysis of the Chinese Cabbage IQD Gene Family and the Response of BrIQD5 in Drought Resistance. Plant Mol. Biol. 2019, 99, 603–620. [Google Scholar] [CrossRef]
Chen, Z.; Liu, X.; Zhao, P.; Li, C.; Wang, Y.; Li, F.; Akutsu, T.; Bain, C.; Gasser, R.B.; Li, J.; et al. iFeatureOmega: An Integrative Platform for Engineering, Visualization and Analysis of Features from Molecular Sequences, Structural and Ligand Data Sets. Nucleic Acids Res. 2022, 50, W434–W447. [Google Scholar] [CrossRef]
Chou, K.-C. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef]
He, Z.; Li, L.; Huang, Z.; Situ, H. Quantum-Enhanced Feature Selection with Forward Selection and Backward Elimination. Quantum Inf. Process. 2018, 17, 154. [Google Scholar] [CrossRef]
Huang, M.-L.; Hung, Y.-H.; Lee, W.M.; Li, R.K.; Jiang, B.-R. SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier. Sci. World J. 2014, 2014, 795624. [Google Scholar] [CrossRef] [PubMed]
Chen, D.; Liu, J.; Zang, L.; Xiao, T.; Zhang, X.; Li, Z.; Zhu, H.; Gao, W.; Yu, X. Integrated Machine Learning and Bioinformatic Analyses Constructed a Novel Stemness-Related Classifier to Predict Prognosis and Immunotherapy Responses for Hepatocellular Carcinoma Patients. Int. J. Biol. Sci. 2022, 18, 360–373. [Google Scholar] [CrossRef] [PubMed]
Sun, S.; Wang, C.; Ding, H.; Zou, Q. Machine Learning and Its Applications in Plant Molecular Studies. Brief. Funct. Genom. 2020, 19, 40–48. [Google Scholar] [CrossRef] [PubMed]
Cui, P.; Zhong, T.; Wang, Z.; Wang, T.; Zhao, H.; Liu, C.; Lu, H. Identification of Human Circadian Genes Based on Time Course Gene Expression Profiles by Using a Deep Learning Method. Biochim. Et. Biophys. Acta (BBA) Mol. Basis Dis. 2018, 1864, 2274–2283. [Google Scholar] [CrossRef] [PubMed]
Polanski, K.; Rhodes, J.; Hill, C.; Zhang, P.; Jenkins, D.J.; Kiddle, S.J.; Jironkin, A.; Beynon, J.; Buchanan-Wollaston, V.; Ott, S.; et al. Wigwams: Identifying Gene Modules Co-Regulated across Multiple Biological Conditions. Bioinformatics 2014, 30, 962–970. [Google Scholar] [CrossRef]
Li, X.; Liu, T.; Tao, P.; Wang, C.; Chen, L. A Highly Accurate Protein Structural Class Prediction Approach Using Auto Cross Covariance Transformation and Recursive Feature Elimination. Comput. Biol. Chem. 2015, 59, 95–100. [Google Scholar] [CrossRef] [PubMed]
Kang, D.; Ahn, H.; Lee, S.; Lee, C.-J.; Hur, J.; Jung, W.; Kim, S. StressGenePred: A Twin Prediction Model Architecture for Classifying the Stress Types of Samples and Discovering Stress-Related Genes in Arabidopsis. BMC Genom. 2019, 20, 949. [Google Scholar] [CrossRef] [PubMed]
Sohail, A.; Arif, F. Supervised and Unsupervised Algorithms for Bioinformatics and Data Science. Prog. Biophys. Mol. Biol. 2020, 151, 14–22. [Google Scholar] [CrossRef] [PubMed]
Roy, A.; Chakraborty, S. Support Vector Machine in Structural Reliability Analysis: A Review. Reliab. Eng. Syst. Saf. 2023, 233, 109126. [Google Scholar] [CrossRef]
Li, Z. Extracting Spatial Effects from Machine Learning Model Using Local Interpretation Method: An Example of SHAP and XGBoost. Comput. Environ. Urban. Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
Wang, H.; Wang, G. Improving Random Forest Algorithm by Lasso Method. J. Stat. Comput. Simul. 2021, 91, 353–367. [Google Scholar] [CrossRef]
Ngo, G.; Beard, R.; Chandra, R. Evolutionary Bagging for Ensemble Learning. Neurocomputing 2022, 510, 1–14. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Abraham, A. Machine Learning for Neuroimaging with Scikit-Learn. Front. Neuroinform. 2014, 8, 71792. [Google Scholar] [CrossRef] [PubMed]
Jiang, G.; Wang, W. Error Estimation Based on Variance Analysis of k -Fold Cross-Validation. Pattern Recognit. 2017, 69, 94–106. [Google Scholar] [CrossRef]
Canbek, G.; Taskaya Temizel, T.; Sagiroglu, S. BenchMetrics: A Systematic Benchmarking Method for Binary Classification Performance Metrics. Neural Comput. Applic 2021, 33, 14623–14650. [Google Scholar] [CrossRef]
Xiong, H.; Capurso, D.; Sen, Ś.; Segal, M.R. Sequence-Based Classification Using Discriminatory Motif Feature Selection. PLoS ONE 2011, 6, e27382. [Google Scholar] [CrossRef]
Bailey, T.L.; Johnson, J.; Grant, C.E.; Noble, W.S. The MEME Suite. Nucleic Acids Res. 2015, 43, W39–W49. [Google Scholar] [CrossRef]
Lu, X.; Cheng, Y.; Gao, M.; Li, M.; Xu, X. Molecular Characterization, Expression Pattern and Function Analysis of Glycine-Rich Protein Genes Under Stresses in Chinese Cabbage (Brassica rapa L. Ssp. Pekinensis). Front. Genet. 2020, 11, 774. [Google Scholar] [CrossRef] [PubMed]
Kreps, J.A.; Wu, Y.; Chang, H.-S.; Zhu, T.; Wang, X.; Harper, J.F. Transcriptome Changes for Arabidopsis in Response to Salt, Osmotic, and Cold Stress. Plant Physiol. 2002, 130, 2129–2141. [Google Scholar] [CrossRef] [PubMed]
Skirpan, A.L.; McCubbin, A.G.; Ishimizu, T.; Wang, X.; Hu, Y.; Dowd, P.E.; Ma, H.; Kao, T. Isolation and Characterization of Kinase Interacting Protein 1, a Pollen Protein That Interacts with the Kinase Domain of PRK1, a Receptor-Like Kinase of Petunia. Plant Physiol. 2001, 126, 1480–1492. [Google Scholar] [CrossRef]
Ji, H.; Liu, J.; Chen, Y.; Yu, X.; Luo, C.; Sang, L.; Zhou, J.; Liao, H. Bioinformatic Analysis of Codon Usage Bias of HSP20 Genes in Four Cruciferous Species. Plants 2024, 13, 468. [Google Scholar] [CrossRef] [PubMed]
Guo, X.; Fu, Y.; Lee, Y.J.; Chern, M.; Li, M.; Cheng, M.; Dong, H.; Yuan, Z.; Gui, L.; Yin, J.; et al. The PGS1 Basic Helix-loop-helix Protein Regulates Fl3 to Impact Seed Growth and Grain Yield in Cereals. Plant Biotechnol. J. 2022, 20, 1311–1326. [Google Scholar] [CrossRef]
Hong, J.K.; Je, J.; Song, C.; Hwang, J.E.; Lee, Y.-H.; Lim, C.O. Biochemical Analysis of a Chinese Cabbage Phytocystatin-1. Genes. Genom. 2012, 34, 13–18. [Google Scholar] [CrossRef]
Yu, J.; Gao, L.; Liu, W.; Song, L.; Xiao, D.; Liu, T.; Hou, X.; Zhang, C. Transcription Coactivator ANGUSTIFOLIA3 (AN3) Regulates Leafy Head Formation in Chinese Cabbage. Front. Plant Sci. 2019, 10, 520. [Google Scholar] [CrossRef] [PubMed]
Saha, G.; Park, J.-I.; Jung, H.-J.; Ahmed, N.U.; Kayum, M.A.; Chung, M.-Y.; Hur, Y.; Cho, Y.-G.; Watanabe, M.; Nou, I.-S. Genome-Wide Identification and Characterization of MADS-Box Family Genes Related to Organ Development and Stress Resistance in Brassica Rapa. BMC Genom. 2015, 16, 178. [Google Scholar] [CrossRef] [PubMed]
Saha, G.; Park, J.-I.; Ahmed, N.U.; Kayum, M.A.; Kang, K.-K.; Nou, I.-S. Characterization and Expression Profiling of MYB Transcription Factors against Stresses and during Male Organ Development in Chinese Cabbage (Brassica rapa Ssp. Pekinensis). Plant Physiol. Biochem. 2016, 104, 200–215. [Google Scholar] [CrossRef] [PubMed]
Ernst, M.; Robertson, J.L. The Role of the Membrane in Transporter Folding and Activity. J. Mol. Biol. 2021, 433, 167103. [Google Scholar] [CrossRef] [PubMed]
Cui, J.; Hua, Y.; Zhou, T.; Liu, Y.; Huang, J.; Yue, C. Global Landscapes of the Na+/H+ Antiporter (NHX) Family Members Uncover Their Potential Roles in Regulating the Rapeseed Resistance to Salt Stress. Int. J. Mol. Sci. 2020, 21, 3429. [Google Scholar] [CrossRef] [PubMed]
Orovwode, H.; Ibukun, O.; Abubakar, J.A. A Machine Learning-Driven Web Application for Sign Language Learning. Front. Artif. Intell. 2024, 7, 1297347. [Google Scholar] [CrossRef]
Cai, Z.; Tang, Q.; Song, P.; Tian, E.; Yang, J.; Jia, G. The m6A Reader ECT8 Is an Abiotic Stress Sensor That Accelerates mRNA Decay in Arabidopsis. Plant Cell 2024, 36, 2908–2926. [Google Scholar] [CrossRef]
Murmu, S.; Sinha, D.; Chaurasia, H.; Sharma, S.; Das, R.; Jha, G.K.; Archak, S. A Review of Artificial Intelligence-Assisted Omics Techniques in Plant Defense: Current Trends and Future Directions. Front. Plant Sci. 2024, 15, 1292054. [Google Scholar] [CrossRef] [PubMed]
Koh, E.; Sunil, R.S.; Lam, H.Y.I.; Mutwil, M. Confronting the Data Deluge: How Artificial Intelligence Can Be Used in the Study of Plant Stress. Comput. Struct. Biotechnol. J. 2024, 23, 3454–3466. [Google Scholar] [CrossRef]
Zeng, H.; Zhuang, Y.; Yan, X.; He, X.; Qiu, Q.; Liu, W.; Zhang, Y. Machine Learning-Based Identification of Novel Hub Genes Associated with Oxidative Stress in Lupus Nephritis: Implications for Diagnosis and Therapeutic Targets. Lupus Sci. Med. 2024, 11, e001126. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Gao, Y.; Lu, Q.; Zhang, R.; Gui, J.; Liu, X.; Yue, Z. RiceSNP-BST: A Deep Learning Framework for Predicting Biotic Stress–Associated SNPs in Rice. Brief. Bioinform. 2024, 25, bbae599. [Google Scholar] [CrossRef] [PubMed]
Monem, S.; Hassanien, A.E.; Abdel-Hamid, A.H. A Multi-View Feature Representation for Predicting Drugs Combination Synergy Based on Ensemble and Multi-Task Attention Models. J. Cheminform 2024, 16, 110. [Google Scholar] [CrossRef] [PubMed]
Gao, P.; Zhao, H.; Luo, Z.; Lin, Y.; Feng, W.; Li, Y.; Kong, F.; Li, X.; Fang, C.; Wang, X. SoyDNGP: A Web-Accessible Deep Learning Framework for Genomic Prediction in Soybean Breeding. Brief. Bioinform. 2023, 24, bbad349. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Total gene counts for different stress types in Chinese cabbage.

Figure 2. These figures illustrate the overlap of genes across different articles. In each bar chart, the horizontal axis represents the number of articles in which a specific gene is mentioned as being related to a specific abiotic stress, while the vertical axis shows the total number of genes mentioned that many times across all articles. Genes that are mentioned in only one article are excluded from these charts.

Figure 3. Schematic workflow for model development.

Figure 4. Discriminative motif discovery in protein sequences under various stress conditions.

Figure 5. Scatterplot heatmaps of auROC for different machine learning algorithms and feature construction methods. In the heatmap, the color intensity indicates the performance of the models, with red representing higher auROC values and blue indicating lower auROC values.

Figure 6. Plots of the area under the receiver operating characteristic curve (auROC) were created using CKSAAP methods with SVM-RFE for feature selection. The blue vertical line in the plot indicates the point corresponding to the number of features that maximizes the auROC value, with the corresponding optimal auROC value marked at that point.

Figure 7. ROC curves for salt stress show evaluation results from different feature construction methods and classification algorithms. Each plot represents a different feature construction method, while the legend in each plot indicates the performance of various machine learning algorithms applied to the respective feature set, with each line color corresponding to a specific algorithm.

Figure 8. The ROC curves for the four abiotic stresses present the assessment outcomes derived from protein sequence characteristics produced by the CKSAAP method, utilizing various classification algorithms. Each plot represents the ROC curve for the optimal CKSAAP features under a specific abiotic stress condition (cold, heat, drought, or salt). The legend in each plot indicates the performance of different machine learning algorithms applied to the selected CKSAAP features, illustrating how each classifier performed under varying stress conditions.

Figure 9. Interface for use of MLAS.

Table 1. Summary of the positive and negative datasets.

Category	Cold	Heat	Drought	Salt
Single stress	728	663	449	328
Positive set	527	515	409	239
Negative set	1163	1175	1281	1451
Unlabeled set	40,291	40,356	40,570	40,691

Table 2. The best machine learning method for classifying abiotic stress with three different feature sets.

Stress	Feature	Model	Accuracy (%)	auROC (%)	auPRC (%)
Cold	CKSAAP	RF	74.26	81.42	70.92
	DDE	BAG	76.92	77.50	63.92
	CKSAAP + DDE	RF	73.67	80.86	71.22
Heat	CKSAAP	GBDT	82.84	87.92	81.76
	DDE	RF	73.08	79.59	65.73
	CKSAAP + DDE	GBDT	77.51	85.72	76.66
Drought	CKSAAP	XGB	81.36	80.85	63.11
	DDE	SVM	79.88	78.78	56.21
	CKSAAP + DDE	RF	75.74	80.62	62.54
Salt	CKSAAP	GBDT	88.48	88.87	79.63
	DDE	BAG	89.04	87.97	74.29
	CKSAAP + DDE	RF	83.15	88.15	76.79

Table 3. Optimal models for predicting gene responses to different abiotic stresses.

Stress	Feature	Model	Number of Features
Cold	CKSAAP	RF	1045
Heat	CKSAAP	GBDT	1224
Drought	CKSAAP	XGB	1016
Salt	CKSAAP	GBDT	1018

Table 4. The top 10 newly predicted genes for the four stresses.

Rank	Cold	Likeliness	Heat	Likeliness	Drought	Likeliness	Salt	Likeliness
1	Bra017529	0.80	Bra009415	0.99	Bra020398	0.99	Bra039970	0.99
2	Bra023050	0.79	Bra013731	0.99	Bra002679	0.98	Bra026062	0.99
3	Bra025001	0.79	Bra031896	0.99	Bra031714	0.97	Bra021958	0.99
4	Bra010944	0.77	Bra023806	0.99	Bra036963	0.97	Bra022658	0.99
5	Bra023102	0.75	Bra015720	0.99	Bra008802	0.97	Bra023592	0.99
6	Bra022197	0.75	Bra020597	0.99	Bra040856	0.97	Bra025086	0.99
7	Bra013175	0.74	Bra035105	0.99	Bra016377	0.96	Bra000458	0.99
8	Bra002647	0.72	Bra025461	0.99	Bra019932	0.96	Bra018594	0.99
9	Bra023103	0.72	Bra019045	0.99	Bra007363	0.96	Bra025850	0.98
10	Bra032483	0.71	Bra009416	0.99	Bra034636	0.96	Bra022945	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

You, X.; Shu, Y.; Ni, X.; Lv, H.; Luo, J.; Tao, J.; Bai, G.; Feng, S. MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage. Horticulturae 2025, 11, 44. https://doi.org/10.3390/horticulturae11010044

AMA Style

You X, Shu Y, Ni X, Lv H, Luo J, Tao J, Bai G, Feng S. MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage. Horticulturae. 2025; 11(1):44. https://doi.org/10.3390/horticulturae11010044

Chicago/Turabian Style

You, Xiong, Yiting Shu, Xingcheng Ni, Hengmin Lv, Jian Luo, Jianping Tao, Guanghui Bai, and Shusu Feng. 2025. "MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage" Horticulturae 11, no. 1: 44. https://doi.org/10.3390/horticulturae11010044

APA Style

You, X., Shu, Y., Ni, X., Lv, H., Luo, J., Tao, J., Bai, G., & Feng, S. (2025). MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage. Horticulturae, 11(1), 44. https://doi.org/10.3390/horticulturae11010044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MLAS: Machine Learning-Based Approach for Predicting Abiotic Stress-Responsive Genes in Chinese Cabbage

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Pre-Processing

2.2. Feature Construction and Selection

2.3. Prediction Using Machine-Learning Methods

2.4. Cross Validation and Performance Metrics

3. Results

3.1. Preliminary Analysis of the Sequence Data

3.2. Feature Construction and Selection Analysis

3.3. Prediction Analysis with Selected Features

3.4. Discovery of New Stress-Related Genes in Chinese Cabbage

3.4.1. Cold Stress

3.4.2. Heat Stress

3.4.3. Drought Stress

3.4.4. Salt Stress

3.5. Online Prediction Tool

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI