AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings

Li, Rujun; Wang, Haotian; Yu, Qiunan; Cai, Jing; Jiang, Liangzhen; Luo, Ximei; Zou, Quan; Lv, Zhibin

doi:10.3390/foods14122014

Open AccessArticle

AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings

by

Rujun Li

¹,

Haotian Wang

¹,

Qiunan Yu

¹,

Jing Cai

¹,

Liangzhen Jiang

^2,3,

Ximei Luo

⁴,

Quan Zou

⁴

and

Zhibin Lv

^1,*

¹

College of Biomedical Engineering, Sichuan University, Chengdu 610041, China

²

College of Food and Biological Engineering, Chengdu University, Chengdu 610106, China

³

Country Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, Chengdu 610106, China

⁴

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology, Chengdu 610106, China

^*

Author to whom correspondence should be addressed.

Foods 2025, 14(12), 2014; https://doi.org/10.3390/foods14122014

Submission received: 14 April 2025 / Revised: 31 May 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Bioactive Peptides from Food-Derived: Preparation, Development and Functional Utilizations)

Download

Browse Figures

Versions Notes

Abstract

Antioxidant peptides (AOPs) have the natural properties of food preservatives; they are capable of improving the oxidation stability of food while also providing additional benefits such as disease prevention. Traditional experimental methods for identifying antioxidant peptides are time consuming and costly, so effective machine learning models are increasingly being valued by researchers. In this study, we integrated amino acid composition, transformation, and distribution (CTD) and block substitution matrix 62 (BLOSUM62) to develop an SVM-based AOP prediction model called AOPxSVM. This strategy significantly improves the prediction accuracy of the model by comparing 15 feature combinations and feature selection strategies, with their effectiveness being visually verified using UMAP. AOPxSVM achieves high accuracy values of 0.9092 and 0.9330, as well as Matthew’s correlation coefficients (MCCs) of 0.8253 and 0.8670, on two independent test sets, both surpassing the state-of-the-art methods based on the same test sets, thus demonstrating AOPs’ excellent identification capability. We believe that AOPxSVM can serve as a powerful tool for identifying AOPs.

Keywords:

antioxidant peptide identification; machine learning; feature engineering optimization; SVM; LGBM

1. Introduction

Antioxidants play a critical role in the food industry due to their ability to counteract oxidation, a significant factor in food spoilage. According to the United Nations Food and Agriculture Organization (FAO), approximately one third of all food produced for human consumption worldwide is either spoiled or wasted, resulting in significant economic losses to the food industry. In the context of meat and meat products, lipid oxidation emerges as the principal non-microbial cause of quality degradation [1,2]. Similarly, in fruits and vegetables, oxidative processes lead to enzymatic and non-enzymatic browning, significantly impairing their sensory attributes, texture, and nutritional value [3]. The impact of oxidation extends beyond food, playing a pivotal role in human physiology. Oxidative cellular metabolism generates reactive oxygen species (ROS) [4], which are associated with the development of oxidative stress [5]. Prolonged oxidative stress has been implicated as a contributing factor to various diseases, including inflammatory disorders, cardiovascular diseases, diabetes mellitus, certain forms of cancer, and neurodegenerative conditions such as Alzheimer’s disease [6,7,8,9]. As a result, the World Health Organization has advocated for a worldwide increase in dietary antioxidants, as food intake is the main source of these compounds [10].

Synthetic antioxidants offer significant economic advantages and are highly effective. However, they are associated with specific toxicity and harmful effects. For example, the antioxidant BHA has been shown to lead to a higher incidence of forestomachal papilloma and squamous cell carcinoma [11]. In contrast, antioxidant peptides (AOPs) exhibit lower toxicity levels and are deemed safer for use as natural antioxidants. Notably, AOPs serve dual functions: on the one hand, they can prolong the shelf life of foods by preventing lipid oxidation processes; on the other hand, they mitigate oxidative stress in cells by neutralizing harmful free radicals [12]. To date, thousands of antioxidant peptides have been identified and extracted from various sources, such as meat products [13], seafood [14], plants [15], grains, and dairy products [16]. Despite the numerous advantages of AOPs, their identification primarily relies on conventional wet laboratory experiments. However, these experimental methods are labor intensive and time consuming, leading to inefficiencies. Consequently, there has been a growing interest in exploring advanced computational tools—especially artificial intelligence methods [17]—to predict antioxidant peptides more efficiently.

In recent years, several artificial intelligence methods have been developed to predict antioxidant peptides (AOPs). In 2020, T.H. Olsen et al. developed the first online server, AnOxPePred [18], an innovative online platform capable of leveraging neural networks for AOP prediction. This system utilizes one-hot encoding as the input feature vector and employs a convolutional neural network (CNN)-based framework. In 2022, Shen Yong et al. advanced the PseAAC [19], which involves pseudo amino acid composition (PseAAC) and motif-based feature extraction to predict AOPs. Although their study demonstrated promising results, it did not include comparative analyses against other approaches. In 2023, Qin Dongya et al. proposed the BiLSTM [20]-based AnOxPP [21,22], which demonstrated superior performance, with accuracies of 0.967 and 0.819 on two independent test sets, outperforming AnOxPePred. However, the large gap in indicators on the two test sets suggests limitations in robustness. In the same year, Du Zhenjiao et al. proposed the UniDL4BioPep [23] for predicting bioactive peptides. This framework utilized a transformer-based ESM-2 language model to generate fixed-length (320-dimensional) peptide sequence embeddings. In 2025, Li Wanxing et al. introduced a new BiLSTM-based model, AOPP [24], which surpassed AnOxPePred and AnOxPP in indicators on two different datasets and achieved state-of-the-art (SOTA) levels.

Although previous studies have achieved commendable performance in testing, notable limitations and shortcomings still remain. Regarding feature extraction, current research has predominantly focused on either sequence fingerprints or sequence evolution features extracted in a case-specific manner, thereby neglecting the global physicochemical property information of peptides. Furthermore, some studies focused solely on sequence fingerprints and physicochemical features while omitting sequence evolution features. In other domains, such as antimicrobial peptide research [25], models that integrate sequence fingerprints, sequence evolution features, and physicochemical property features have emerged [26,27,28,29], demonstrating performance far superior to approaches relying on single feature categories. This underscores the importance of combining diverse classes of features for accurately predicting peptide properties.

In response to these limitations, this study proposed a new machine learning model: AOPxSVM. The construction process of AOPxSVM is shown in Figure 1. We introduced feature encoding methods such as ASDC, BLOSUM62, AAindex, and CTD, which represent three types of features: sequence fingerprints, sequence evolution features, and physicochemical property features. By integrating these features, we demonstrated the contribution of the complementarity of multidimensional information to the improvement in model performance. Then, we further optimized the feature vector using feature selection methods. Finally, we used the unified meteoroid approximation and projection (UMAP) algorithm for visualization to prove the effectiveness of feature engineering optimization. Compared with the existing optimal model, AOPP, AOPxSVM achieved significant improvements in independent test indicators on both datasets, demonstrating the advanced nature of this method. We expect that this study will help promote the application of machine learning models to predict antioxidant peptides in the food industry and ultimately promote the development of peptide-related research and industrial applications.

2. Materials and Methods

2.1. Benchmark Dataset

Our model was developed utilizing the latest AOPP dataset to facilitate the comparison with other models [24]. AOPP has two different datasets. The positive samples of the first dataset come from 1511 non-repeated AOPs from DFBP, the BI-OPEP-UWM database, the antimicrobial peptide database, and PlantPepDB. Peptide sequences were generated using a Python program. CD-HIT was used to filter out peptide sequences with a similarity of more than 90% with positive samples; then, an equal number of peptide sequences with the same length as the positive samples were randomly extracted to obtain the negative samples of the first dataset. The integrated dataset contains 3022 samples (1511 AOPs and 1511 non-AOPs), which were randomly divided into training and test sets at a ratio of 8:2. After this division, approximately 2417 samples were assigned to the training set and 604 samples were assigned to the test set. In addition, AOPP compiled an independent validation dataset of 75 peptide sequences with high antioxidant activity from the scientific literature from 2022 to 2023 to evaluate the performance of the model relative to existing research results. To clearly distinguish between the datasets in this study, the two datasets were termed AOPP.test01 and AOPP.test2023, respectively. The former was used to develop and improve the model, while the latter provided a benchmark for comparative performance evaluation. The details are shown in Table S1.

2.2. Feature Extraction

2.2.1. Physicochemical Property Feature

(1): Amino Acid Index (AAindex)

The AAindex database [30] contains 566 physicochemical properties of amino acids, including measures such as hydrophobicity and polarity. A feature vector of dimensionality 566 was computed by calculating a weighted average of these physicochemical properties for each amino acid in the sequence.

\begin{matrix} V_{i} = (v_{1}, v_{2}, \dots, v_{566}) \end{matrix}

(1)

\begin{matrix} F = \frac{1}{L} \sum_{i = 1}^{L} V_{i} \end{matrix}

(2)

where

v_{j}

corresponds to the value of the jth physicochemical property of the amino acid, and

V_{i}

corresponds to the vector of the ith amino acid in the peptide. F is the average value of each physicochemical property over all amino acids.

(2): Composition Transformation and Distribution (CTD)

CTD [31,32] systematically categorizes 20 amino acids into 3 distinct groups based on specific physicochemical properties. This classification enables the analysis of amino acid sequences through three primary components: composition (C)—this element quantifies the relative proportion of each amino acid group within the sequence; transition frequency (T), wherein the frequency of transitions between different amino acid groups is measured as they appear in the sequence; and distribution (D)—this component captures the positional distribution of amino acids through 5 specified quantiles: 0%, 25%, 50%, 75%, and 100%. These quantiles represent cumulative proportions across the sequence’s length at these intervals [33,34]. In this study, the CTD method incorporates 13 distinct physicochemical properties for classification purposes. These include characteristics such as hydrophobicity, polarizability, and charge, among others, ensuring a comprehensive analysis of amino acid traits. This process results in the generation of a feature vector characterized by 273 dimensions.

\begin{matrix} C_{i} = \frac{N_{i}}{L} \end{matrix}

(3)

\begin{matrix} T_{i j} = \frac{N_{i j} + N_{j i}}{L - 1} \end{matrix}

(4)

\begin{matrix} D_{i k} = \frac{N_{i k}}{N_{i}} \end{matrix}

(5)

where L is the total length of the peptide sequence,

N_{i}

is the number of amino acids in group i,

N_{i j}

is the frequency of amino acid transitions between different groups, and

N_{i k}

is the number of amino acids in group i within the first k% positions.

2.2.2. Sequence Fingerprinting

(1): Adaptive Skip Dipeptide Composition (ASDC)

ASDC [35,36] captures amino acid associations at arbitrary distances in a sequence by counting the frequency of dipeptide combinations at all possible intervals in the peptide sequence. Its feature vector can be expressed as

\begin{matrix} A S D C = ({f v}_{1,1}, {f v}_{1,2}, \dots, {f v}_{20,20}) \end{matrix}

(6)

where

{f v}_{i, j}

represents the probability of each amino acid pair, with 400 amino acid pairs in total.

2.2.3. Sequence Evolution Features

(1): Block Substitution Matrix 62 (BLOSUM62)

BLOSUM-n is a 20 × 20 amino acid substitution score matrix [37] designed to evaluate the likelihood of specific amino acid substitutions. BLOSUM-n calculates the substitution score s for two amino acids x and y. The calculation is as follows:

\begin{matrix} s (x, y) = \frac{1}{λ} \log \frac{p_{x y}}{f_{x} f_{y}} \end{matrix}

(7)

where

p_{x y}

is the target frequency, representing the probability of observing x and y aligned in the homologous sequence; and

f_{x}, f_{y}

are background frequencies that represent the probability of x and y occurring independently in the protein sequence.

λ

is a scaling factor used to change the score s to an integer, and n is the similarity of homologous sequences.

For peptide prediction purposes, we extended the BLOSUM62 matrix by appending a column of zero vectors, resulting in a 20 × 21 matrix. Subsequently, this expanded matrix was augmented with weights derived from amino acid frequencies in peptide sequences. This comprehensive approach yielded a 420-dimensional feature vector, capturing extensive amino acid substitution information for predictive modeling.

2.2.4. Deep Learning-Based Embedded Features

(1): TAPE_BERT

BERT [38,39] is a transformer-based pre-training framework that performs self-supervised learning on unlabeled datasets to derive initial parameter values [40]. TAPE_BERT is trained on the Protein Family Database (Pfam) [41]. TAPE_BERT produces a 768-dimensional feature vector. The key formulation of the BERT model is as follows:

\begin{matrix} A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(8)

where Q is the query matrix, K is the key matrix, V is the value matrix, and

d_{k}

is the vector dimension.

(2): UniRep

UniRep [42] is constructed based on a Multiplicative Long Short-term Memory (mLSTM), and unsupervised learning is performed on the UniRef50 [43] database containing 24 million proteins. The final feature vector of UniRep is a 1900-dimensional mLSTM network that can be defined by the following equation:

\begin{matrix} m_{t} = (W_{m x} x_{t}) ⊙ (W_{m h} h_{t - 1}) \end{matrix}

(9)

\begin{matrix} {\hat{h}}_{t} = W_{h x} x_{t} + W_{h m} m_{t} \end{matrix}

(10)

\begin{matrix} i_{t} = σ (W_{i x} x_{t} + W_{i m} m_{t}) \end{matrix}

(11)

\begin{matrix} o_{t} = σ (W_{o x} x_{t} + W_{o m} m_{t}) \end{matrix}

(12)

\begin{matrix} f_{t} = σ (W_{f x} x_{t} + W_{f m} m_{t}) \end{matrix}

(13)

\begin{matrix} c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ t a n h (h_{t}) \end{matrix}

(14)

\begin{matrix} h_{t} = t a n h (c_{t}) ⊙ o_{t} \end{matrix}

(15)

where

m_{t}

is the current multiplication interim state, W denotes the weight matrix,

x_{t}

is the current time step input,

h_{t - 1}

is the previous time step hidden state,

i_{t}, o_{t}, f_{t}

stand for the input gate, output gate, and forgetting gate, respectively, and

σ

is the sigmoid activation function.

c_{t}

is the state of the memory cell, which is used to update the long-term memory.

2.3. Machine Learning Methods

In the context of this study, we employed six widely recognized machine learning (ML) methodologies, each distinguished for its performance and applicability across diverse circumstances [44]. These include Support Vector Machine (SVM) [45,46,47,48], Light Gradient Boosting Machine (LGBM) [49], Linear Regression (LR) [50], Random Forest (RF) [51,52], K-Nearest Neighbors (KNN) [53], and Gaussian Naive Bayes (GNB) [54].

The objective of the SVM algorithm is to find a hyperplane to maximize the interval between different classes of data and map the data to a high dimensional space utilizing a kernel function. SVM also solves nonlinearly differentiable problems and, thus, can be used to solve binary classification problems in bioinformatics.

LGBM is a decision tree-based algorithm that employs a gradient boosting strategy to optimize the model.

LR solves the classification problem by mapping the linear output to the interval [0, 1] using a sigmoid function, which represents the probability that the sample belongs to a particular class.

RF is an integrated learning algorithm that decides the outcome by training multiple decision trees and voting.

KNN is an instance-based learning algorithm that classifies samples by calculating the distance between them; when a new sample is input, the class of the new sample is determined by voting on the K closest sample points.

GNB is a classification algorithm based on the Bayes theorem.

2.4. Feature Selection Methods

Feature selection is essential for developing robust predictive models [55]. This process effectively reduces dimensionality by eliminating redundant features while preserving critical ones, thereby enhancing model interpretability and performance. LGBM (Light Gradient Boosting Machine), a powerful gradient-boosting decision tree framework [56], determines feature importance by quantifying the number of splits each feature undergoes in the trees, subsequently ranking them in descending order. Previous studies have shown that LGBM outperforms both analysis of variance (ANOVA) [57] and mutual information (MI) [58] in feature selection for peptide sequences [50]. The mathematical formula for LGBM feature importance ranking is as follows:

\begin{matrix} I m p o r t a n c e_{split}^{f} = \sum_{t = 1}^{T} \sum_{n = 1}^{N_{t}} I (v (t, n) = f) \end{matrix}

(16)

where

I m p o r t a n c e_{split}^{f}

is the splitting importance of the feature, T is the total number of trees,

N_{t}

is the number of nodes in the t-th tree,

v (t, n)

is the feature used by the n-th node in the t-th tree, and I () is the indicator function, which is 1 if the condition is met; otherwise, it is 0.

2.5. Model Evaluation Metrics

To evaluate the model’s performance, we used seven assessment metrics, including accuracy (ACC), Matthew’s correlation coefficient (MCC), sensitivity (Sn), specificity (Sp), precision (Pre), area under the curve (AUC), and F1 score. These indicators were calculated from the numbers of true-positive samples (TP), true-negative samples (TN), false-positive samples (FP), and false-negative samples (FN) [59,60,61,62,63,64,65].

\begin{matrix} A C C = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(17)

\begin{matrix} M C C = \frac{T P \times T N - T P \times T N}{\sqrt{(T P + F P) (T P + T N) (T N + T P) (T N + F N)}} \end{matrix}

(18)

\begin{matrix} S n = \frac{T P}{T P + F N} \end{matrix}

(19)

\begin{matrix} S p = \frac{T N}{T N + F P} \end{matrix}

(20)

\begin{matrix} P r e = \frac{T P}{T P + F P} \end{matrix}

(21)

\begin{matrix} F 1 = \frac{2 \times P r e \times S n}{P r e + S n} \end{matrix}

(22)

Among them, the F1 score represents the harmonic mean of precision and recall, which comprehensively considers the accuracy and completeness of positive class predictions and is particularly suitable for situations where positive and negative samples are unevenly distributed. MCC, as a balance indicator, can comprehensively consider the number of TP, TN, FP, and FN, and is considered the gold standard for measuring the performance of binary classification models. Finally, AUC is the area under the ROC curve, which reflects the model’s ability to distinguish between positive and negative samples under all possible thresholds.

2.6. Friedman Test

The Friedman test [41] is a nonparametric statistical test applied to random area groups proposed by Milton Friedman. It avoids the reliance on the assumption of a normal distribution of data in the traditional analysis of variance (ANOVA). The Friedman test sorts the samples into blocks, sums up the sorts of all the treatment groups, and statistically analyzes them by the difference between the sorted sums. The formula for the Friedman test is as follows:

\begin{matrix} χ_{F}^{2} = \frac{12}{N k (k + 1)} \sum_{j = 1}^{k} R_{j}^{2} - 3 N (k + 1) \end{matrix}

(23)

where N is the number of blocks (number of different machine learning metrics), k is the number of treatment groups (number of different features or models), and R_j is the total rank sum of the jth treatment group.

3. Results

3.1. Selection of Baseline Models with Different Features and Fusion Features

We evaluated 36 baseline models that combined 6 features with 6 machine learning algorithms to determine the optimal feature extraction method. As summarized in Table S2, these models underwent a comparative analysis based on their five-fold cross-validation accuracy. Among the features examined, ASDC demonstrated superior performance across multiple algorithms, achieving the highest accuracy scores of 0.8920 (SVM), 0.8891 (RF), 0.8849 (LGBM), and 0.8634 (GNB). Conversely, AAindex and BLOSUM62 exhibited the strongest performance for KNN (0.8398) and LR (0.8713), respectively. The comparative performance of these features on an independent test set is illustrated in Figure 2, with statistical significance analyzed using Friedman’s test. The findings reveal that TAPE_BERT and UniRep exhibited significantly weaker performance compared with the remaining four features (p < 0.0001). In contrast, AAindex, ASDC, BLOSUM62, and CTD displayed no statistically significant differences in their performance outcomes (p > 0.05). Based on this analysis, we proceeded with these four features for subsequent optimization in the feature fusion process.

Combining multiple features enhances information complementarity and improves the robustness of predictive models. In this study, we developed 11 distinct fusion features by integrating AAindex, ASDC, BLOSUM62, and CTD features. As shown in Table S3, SVM achieved the highest scores across five metrics among all six machine learning algorithms, except for specificity (Sp). Figure 2B presents a comparative analysis of the scores for 90 models incorporating these fusion features and the 6 machine learning algorithms. The scores represent the averages across six independently evaluated metrics, with labels indicating the four top-performing feature combinations in SVM. Among these fusion features, the CTD + BLOSUM62 fusion feature demonstrated the most superior performance. Compared with the best-performing single features, this combination achieved significant improvements; the ACC on the independent test increased by 5.08%, the MCC by 10.57%, the Sn by 1.52%, the Sp by 4.09%, the AUC by 4.08%, and the Pre by 6.30%. Therefore, we chose the SVM model paired with the CTD + BLOSUM62 fusion feature for subsequent optimization steps. Meanwhile, these results demonstrate that fusion features significantly enhance the predictive capabilities of the model compared with single-feature approaches.

3.2. Feature Selection Optimization

High-dimensional features often lead to redundancy and overfitting. To address this, we implemented a feature selection strategy to reduce dimensionality while enhancing model performance. Using the built-in function in LGBM, we ranked the features by counting the number of splits in the trees, reflecting their importance. With a step size of 5, we selected the top 5, 10, …, 295, and 300 features iteratively, building an SVM-based model at each interval. The optimization process outlined in Figure 3A reveals that the ACC and MCC initially increased rapidly, peaked, declined, and then stabilized. Both ACC and MCC reached their highest values at a feature count of 80, achieving values of 0.9092 and 0.8253, respectively. As shown in Figure 3B, comparing six independent test metrics before and after selecting the CTD + BLOSUM62 features indicates significant improvement across all criteria. Notably, ACC, MCC, and Sn showed the most substantial enhancement. Consequently, we chose the optimized 80D CTD + BLOSUM62 feature set to construct our final SVM-based model, AOPxSVM, for subsequent analyses.

3.3. Feature Visualization

We applied UMAP for dimensionality reduction to 2D and visualized the SVM decision boundaries based on these projections to validate the effectiveness of the feature engineering optimization. As shown in Figure 4A–C, feature fusion results in better clustering than single features alone. Furthermore, as demonstrated in Figure 4C vs. Figure 4D, the selection process yields improved clustering over unselected features. These findings confirm that the feature engineering optimization successfully eliminated redundant features from fused sets and enhanced the discrimination between antioxidant peptide and non-antioxidant peptide features.

3.4. Comparison with Existing Methods

To evaluate the performance of our AOPxSVM model against existing methods, we compared two independent test sets. First, on the AOPP.test01 dataset, we compared AOPxSVM with AOPP, the current state-of-the-art method. As shown in Table 1, AOPxSVM demonstrates superior performance across multiple metrics: Val_ACC (0.9056), ACC (0.9092), MCC (0.8253), Sn (0.8449), AUC (0.9423), and F1 (0.9030) all outperform those of AOPP. To ensure a fair comparison, we also evaluated the models on the AOPP.test2023 test set, which contains 75 recently identified antioxidant peptides from the literature (2022–2023). Notably, this dataset is completely different from the training set of our model and AOPP.test01. The results presented in Table 1 further highlight the superiority of AOPxSVM across key metrics: ACC is 0.66% higher, MCC is 0.75% higher, Sn is 5.33% higher, and F1 is 1.04% higher compared with AOPP. These findings confirm that AOPxSVM achieves the best performance on both independent test sets, underscoring the model’s generalizability and robustness.

To further test the generalization of the model, we collected 138 new antioxidant peptide sequences from the recent year’s literature as an independent test set, which are completely different from the previous two datasets. In order to control data redundancy, we used the CD-HIT tool to perform de-redundancy processing at sequence similarity thresholds of 40%, 60%, and 80%, respectively, to obtain datasets with different similarity levels. As shown in Figure S1, AOPxSVM showed the highest ACC at all three similarity levels, further demonstrating the superiority of AOPxSVM.

3.5. Web Server Development

To facilitate a researcher’s use of our model to predict antioxidant peptides, we deployed the model to a web server, which can be accessed via http://inova.aibiochem.net/antioxpep/ (accessed on 3 June 2025). Users can obtain information about the antioxidant activity and confidence of samples by inputting peptide sequences or fasta files.

4. Discussion

The prediction of antioxidant peptides holds significant importance for the food industry, and many methods are available for antioxidant peptide prediction and screening. However, the existing approaches still face challenges related to accuracy and robustness, largely due to inadequate optimization for diverse feature types. Such comprehensive optimization is critical for enhancing the reliability and interpretability of predictive models. To address these limitations, we developed a novel AOP prediction model, AOPxSVM, utilizing CTD + BLOSUM62 fusion features. This model demonstrates superior performance compared with existing methods on two independent test sets, suggesting its potential to further advance the field of antioxidant peptide prediction.

In this study, we first evaluated the fusion of four different features, among which the fusion of the physicochemical property feature CTD and the sequence evolution feature BLOSUM62 obtained the best results. This finding aligns with research results reported by researchers at the China Agricultural University in 2025, who identified correlations between the antioxidant properties of fish gelatin peptides and both global electron donor capacity (a physicochemical property) and local active sites (a sequence evolutionary feature), as determined through quantum computing (DFT) and molecular docking [66]. This agreement underscores our conclusion that combining physicochemical and sequence evolution features may better capture the characteristics of antioxidant peptides. Second, we used feature selection means to optimize our fusion features, which resulted in a dimension of only 80 D. In contrast, AOPP used the feature ADCA (spliced with AAC, DPC, CKS, and AAindex) with a dimension of more than 800 D. This suggests that our model has stronger interpretability, which is a major challenge in machine learning [67,68]. We further emphasize this with our UMAP visualization.

Despite these advancements, certain limitations remain. First, the development of high-quality, large-scale datasets remains crucial for improving model accuracy and robustness [69,70]. Additionally, due to the constraints inherent in traditional machine learning approaches, we were unable to retain sequence-specific information during training; instead, our framework relied solely on global features extracted from the data. This omission may have impacted the performance of our fused feature set. In future work, we aim to address these limitations by integrating a dual-path model that combines machine learning and deep learning. Such an approach would enable separate processing of global and sequence-specific features, potentially leading to more robust models capable of capturing the full breadth of feature information. This will be a key focus of our subsequent research efforts.

5. Conclusions

In this study, we developed a model called AOPxSVM to predict antioxidant peptides, which are essential compounds with significant potential in the food industry. The process involved selecting feature extraction methods from an initial pool of six approaches. Through rigorous evaluation, we identified four superior methods: AAindex, ASDC, CTD, and BLOSUM62. These selected features were integrated with six machine learning algorithms to optimize predictive performance. Our analysis revealed that the SVM algorithm demonstrated the highest efficacy for this classification task. The optimal fusion features were CTD, representing physicochemical properties, and BLOSUM62, capturing sequence evolution characteristics. To enhance computational efficiency and reduce overfitting risks, we implemented feature selection techniques, reducing the feature count from 693D to a more manageable 80D. Finally, we obtained the final model AOPxSVM. We also employed UMAP visualization to demonstrate the effectiveness of our feature engineering approach. On two different independent test sets, our model AOPxSVM obtained the best results, surpassing those of other models in ACC, MCC, Sn, Pre, and F1, with ACCs reaching 0.9092 and 0.9333, and MCCs reaching 0.8253 and 0.8670, respectively. In conclusion, based on feature engineering optimization and machine learning algorithms, our study proposed an accurate and reliable binary classification of antioxidant peptides. This methodological framework enhances predictive capabilities and provides a robust foundation for future applications in bioinformatics and food science, contributing to the development of innovative strategies for leveraging antioxidant peptides in industrial settings.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/foods14122014/s1, Table S1: Composition of all datasets; Table S2: Comparison of 5-fold cross-validation accuracy of 6 features in 6 machine learning algorithms; Table S3: Independent test indicators of 6 machine learning algorithms (scores are the average of 15 features); Table S4: Comparison of 5-fold cross-validation and independent test indicators based on the SVM model after feature fusion; Table S5: Comparison of independent test results of four fusion features after feature selection; Table S6: Hyperparameter ranges and optimization strategy; Figure S1: Comparison of models’ ACC at different sequence similarity thresholds.

Author Contributions

Conceptualization, Z.L.; data curation, R.L. and Z.L.; funding acquisition, L.J. and Z.L.; investigation, R.L., H.W., Q.Y. and J.C.; methodology, R.L. and Z.L.; project administration, Z.L.; software, R.L.; supervision, Z.L.; visualization, R.L.; writing—original draft preparation, R.L.; writing—review and editing, H.W., X.L. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. 62371318 and 32302083), the 2024 Foundation Cultivation Research Basic Research Cultivation Special Funding (Grant No. 20826041H4211), and the Chengdu Science and Technology Bureau (Grant No. 2024-YF08-00022-GX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. The source code is available at https://github.com/yashdui/AOPxSVM.git (accessed on 3 June 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Min, B.; Ahn, D. Mechanism of lipid peroxidation in meat and meat products—A review. Food Sci. Biotechnol. 2005, 14, 152–163. [Google Scholar]
Lorenzo, J.M.; Gómez, M. Shelf life of fresh foal meat under MAP, overwrap and vacuum packaging conditions. Meat Sci. 2012, 92, 610–618. [Google Scholar] [CrossRef] [PubMed]
Brecht, J.K. Physiology of Lightly Processed Fruits and Vegetables. Hortscience 1995, 30, 18–22. [Google Scholar] [CrossRef]
Sabbatino, F.; Conti, V.; Liguori, L.; Polcaro, G.; Corbi, G.; Manzo, V.; Tortora, V.; Carlomagno, C.; Vecchione, C.; Filippelli, A. Molecules and mechanisms to overcome oxidative stress inducing cardiovascular disease in cancer patients. Life 2021, 11, 105. [Google Scholar] [CrossRef]
Rock, C.L.; Jacob, R.A.; Bowen, P.E. Update on the biological characteristics of the antioxidant micronutrients: Vitamin C, vitamin E, and the carotenoids. J. Am. Diet. Assoc. 1996, 96, 693–702. [Google Scholar] [CrossRef]
Lobo, V.; Patil, A.; Phatak, A.; Chandra, N. Free radicals, antioxidants and functional foods: Impact on human health. Pharmacogn. Rev. 2010, 4, 118. [Google Scholar] [CrossRef]
Rao, A.; Bharani, M.; Pallavi, V. Role of antioxidants and free radicals in health and disease. Adv. Pharmacol. Toxicol. 2006, 7, 29–38. [Google Scholar]
Stefanis, L.; Burke, R.E.; Greene, L.A. Apoptosis in neurodegenerative disorders. Curr. Opin. Neurol. 1997, 10, 299–305. [Google Scholar] [CrossRef]
Wang, R.; Jiang, Y.; Jin, J.; Yin, C.; Yu, H.; Wang, F.; Feng, J.; Su, R.; Nakai, K.; Zou, Q. DeepBIO: An automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023, 51, 3017–3029. [Google Scholar] [CrossRef]
World Health Organization. Diet, Nutrition and the Prevention of Chronic Diseases: Report of a Joint WHO/FAO Expert Consultation; WHO Technical Report Series 916; World Health Organization: Geneva, Switzerland, 2002; pp. 1–149. [Google Scholar]
Ito, N.; Fukushima, S.; Tsuda, H. Carcinogenicity and modification of the carcinogenic response by BHA, BHT, and other antioxidants. CRC Crit. Rev. Toxicol. 1985, 15, 109–150. [Google Scholar] [CrossRef]
López-García, G.; Dublan-García, O.; Arizmendi-Cotero, D.; Gómez Oliván, L.M. Antioxidant and antimicrobial peptides derived from food proteins. Molecules 2022, 27, 1343. [Google Scholar] [CrossRef] [PubMed]
Sohaib, M.; Anjum, F.M.; Sahar, A.; Arshad, M.S.; Rahman, U.U.; Imran, A.; Hussain, S. Antioxidant proteins and peptides to enhance the oxidative stability of meat and meat products: A comprehensive review. Int. J. Food Prop. 2017, 20, 2581–2593. [Google Scholar] [CrossRef]
Ahmadi-Vavsari, F.; Farmani, J.; Dehestani, A. Recombinant production of a bioactive peptide from spotless smooth-hound (Mustelus griseus) muscle and characterization of its antioxidant activity. Mol. Biol. Rep. 2019, 46, 2599–2608. [Google Scholar] [CrossRef]
Chen, N.; Yang, H.; Sun, Y.; Niu, J.; Liu, S. Purification and identification of antioxidant peptides from walnut (Juglans regia L.) protein hydrolysates. Peptides 2012, 38, 344–349. [Google Scholar] [CrossRef]
Qin, D.; Bo, W.; Zheng, X.; Hao, Y.; Li, B.; Zheng, J.; Liang, G. DFBP: A comprehensive database of food-derived bioactive peptides for peptidomics research. Bioinformatics 2022, 38, 3275–3280. [Google Scholar] [CrossRef]
Wei, L.; He, W.; Malik, A.; Su, R.; Cui, L.; Manavalan, B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Brief. Bioinform. 2021, 22, bbaa275. [Google Scholar] [CrossRef]
Olsen, T.H.; Yesiltas, B.; Marin, F.I.; Pertseva, M.; García-Moreno, P.J.; Gregersen, S.; Overgaard, M.T.; Jacobsen, C.; Lund, O.; Hansen, E.B. AnOxPePred: Using deep learning for the prediction of antioxidative properties of peptides. Sci. Rep. 2020, 10, 21471. [Google Scholar] [CrossRef]
Shen, Y.; Liu, C.; Chi, K.; Gao, Q.; Bai, X.; Xu, Y.; Guo, N. Development of a machine learning-based predictor for identifying and discovering antioxidant peptides based on a new strategy. Food Control 2022, 131, 108439. [Google Scholar] [CrossRef]
Xiao, C.; Zhou, Z.; She, J.; Yin, J.; Cui, F.; Zhang, Z. PEL-PVP: Application of plant vacuolar protein discriminator based on PEFT ESM-2 and bilayer LSTM in an unbalanced dataset. Int. J. Biol. Macromol. 2024, 277, 134317. [Google Scholar] [CrossRef]
Qin, D.; Jiao, L.; Wang, R.; Zhao, Y.; Hao, Y.; Liang, G. Prediction of antioxidant peptides using a quantitative structure− activity relationship predictor (AnOxPP) based on bidirectional long short-term memory neural network and interpretable amino acid descriptors. Comput. Biol. Med. 2023, 154, 106591. [Google Scholar] [CrossRef]
Chen, J.; Zou, Q.; Li, J. DeepM6ASeq-EL: Prediction of Human N6-Methyladenosine (m6A) Sites with LSTM and Ensemble Learning. Front. Comput. Sci. 2022, 16, 162302. [Google Scholar] [CrossRef]
Du, Z.; Ding, X.; Xu, Y.; Li, Y. UniDL4BioPep: A universal deep learning architecture for binary classification in peptide bioactivity. Brief. Bioinform. 2023, 24, bbad135. [Google Scholar] [CrossRef] [PubMed]
Li, W.X.; Liu, X.J.; Liu, Y.F.; Zheng, Z.J. High-Accuracy Identification and Structure-Activity Analysis of Antioxidant Peptides via Deep Learning and Quantum Chemistry. J. Chem. Inf. Model. 2025, 65, 603–612. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Ren, X.; Luo, X.; Wang, Z.; Li, Z.; Luo, X.; Shen, J.; Li, Y.; Yuan, D.; Nussinov, R.J.N.C. A foundation model identifies broad-spectrum antimicrobial peptides against drug-resistant bacterial infection. Nat. Commun. 2024, 15, 7538. [Google Scholar] [CrossRef]
Zhou, W.Y.; Liu, Y.F.; Li, Y.X.; Kong, S.Q.; Wang, W.L.; Ding, B.Y.; Han, J.Y.; Mou, C.Z.; Gao, X.; Liu, J.T. TriNet: A tri-fusion neural network for the prediction of anticancer and antimicrobial peptides. Patterns 2023, 4, 100702. [Google Scholar] [CrossRef]
Zhang, J.H.; Zhang, Z.H.; Pu, L.R.; Tang, J.J.; Guo, F. AIEpred: An Ensemble Predictive Model of Classifier Chain to Identify Anti-Inflammatory Peptides. IEEE-ACM Trans. Comput. Biol. Bioinform. 2021, 18, 1831–1840. [Google Scholar] [CrossRef]
Jiang, Y.; Wang, R.; Feng, J.; Jin, J.; Liang, S.; Li, Z.; Yu, Y.; Ma, A.; Su, R.; Zou, Q. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Adv. Sci. 2023, 10, 2206151. [Google Scholar] [CrossRef]
Li, H.; Liu, B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput. Biol. 2023, 19, e1011214. [Google Scholar] [CrossRef]
Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2007, 36, D202–D205. [Google Scholar] [CrossRef]
Li, Z.-R.; Lin, H.H.; Han, L.; Jiang, L.; Chen, X.; Chen, Y.Z. PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006, 34, W32–W37. [Google Scholar] [CrossRef]
Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S.H. Recognition of a protein fold in the context of the SCOP classification. Proteins Struct. Funct. Bioinform. 1999, 35, 401–407. [Google Scholar] [CrossRef]
Zou, X.; Ren, L.; Cai, P.; Zhang, Y.; Ding, H.; Deng, K.; Yu, X.; Lin, H.; Huang, C. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front. Med. 2023, 10, 1281880. [Google Scholar] [CrossRef] [PubMed]
Zhu, W.; Yuan, S.S.; Li, J.; Huang, C.B.; Lin, H.; Liao, B. A First Computational Frame for Recognizing Heparin-Binding Protein. Diagnostics 2023, 13, 2465. [Google Scholar] [CrossRef] [PubMed]
Wei, L.; Tang, J.; Zou, Q. SkipCPP-Pred: An improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genom. 2017, 18, 742. [Google Scholar] [CrossRef]
Wei, L.; Zhou, C.; Chen, H.; Song, J.; Su, R. ACPred-FL: A sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 2018, 34, 4007–4016. [Google Scholar] [CrossRef]
Eddy, S.R. Where did the BLOSUM62 alignment score matrix come from? Nat. Biotechnol. 2004, 22, 1035–1036. [Google Scholar] [CrossRef]
Asgari, E.; McHardy, A.C.; Mofrad, M.R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep. 2019, 9, 3577. [Google Scholar] [CrossRef]
Li, Y.; Wei, X.; Yang, Q.; Xiong, A.; Li, X.; Zou, Q.; Cui, F.; Zhang, Z. msBERT-Promoter: A multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol. 2024, 22, 126. [Google Scholar] [CrossRef]
Joshi, M.; Singh, B.K. Deep Learning Techniques for Brain Lesion Classification Using Various MRI (from 2010 to 2022): Review and Challenges. Medinformatics 2024, 1–21. [Google Scholar] [CrossRef]
El-Gebali, S.; Mistry, J.; Bateman, A.; Eddy, S.R.; Luciani, A.; Potter, S.C.; Qureshi, M.; Richardson, L.J.; Salazar, G.A.; Smart, A. The Pfam protein families database in 2019. Nucleic Acids Res. 2019, 47, D427–D432. [Google Scholar] [CrossRef]
Alley, E.C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; Church, G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019, 16, 1315–1322. [Google Scholar] [CrossRef] [PubMed]
Suzek, B.E.; Wang, Y.; Huang, H.; McGarvey, P.B.; Wu, C.H.; Consortium, U. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926–932. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Li, C.; Chen, R.; Cao, D.; Zeng, X.J.E.S.w.A. Geometric Deep Learning for Drug Discovery. Expert Syst. Appl. 2024, 240, 122498. [Google Scholar] [CrossRef]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
Wang, Y.; Zhai, Y.; Ding, Y.; Zou, Q. SBSM-Pro: Support Bio-sequence Machine for Proteins. Sci. China Inf. Sci. 2024, 67, 144–159. [Google Scholar] [CrossRef]
Meher, P.K.; Hati, S.; Sahu, T.K.; Pradhan, U.; Gupta, A.; Rath, S.N. SVM-Root: Identification of Root-Associated Proteins in Plants by Employing the Support Vector Machine with Sequence-Derived Features. Curr. Bioinform. 2024, 19, 91–102. [Google Scholar] [CrossRef]
Li, H.; Pang, Y.; Liu, B. BioSeq-BLM: A platform for analyzing DNA, RNA, and protein sequences based on biological language models. Nucleic Acids Res. 2021, 49, e129. [Google Scholar] [CrossRef]
Ke, G.L.; Meng, Q.; Finley, T.; Wang, T.F.; Chen, W.; Ma, W.D.; Ye, Q.W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural 2017, 30, 3149–3157. [Google Scholar]
Jiang, J.; Li, J.; Li, J.; Pei, H.; Li, M.; Zou, Q.; Lv, Z. A machine learning method to identify umami peptide sequences by using multiplicative LSTM embedded features. Foods 2023, 12, 1498. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ru, X.; Li, L.; Zou, Q. Incorporating Distance-Based Top-n-gram and Random Forest to Identify Electron Transport Proteins. J. Proteome Res. 2019, 18, 2931–2939. [Google Scholar] [CrossRef] [PubMed]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Boopathi, V.; Subramaniyam, S.; Malik, A.; Lee, G.; Manavalan, B.; Yang, D.-C. mACPpred: A support vector machine-based meta-predictor for identification of anticancer peptides. Int. J. Mol. Sci. 2019, 20, 1964. [Google Scholar] [CrossRef] [PubMed]
Zulfiqar, H.; Guo, Z.; Ahmad, R.M.; Ahmed, Z.; Cai, P.; Chen, X.; Zhang, Y.; Lin, H.; Shi, Z. Deep-STP: A deep learning-based approach to predict snake toxin proteins by using word embeddings. Front. Med. 2023, 10, 1291352. [Google Scholar] [CrossRef]
Lv, Z.; Cui, F.; Zou, Q.; Zhang, L.; Xu, L. Anticancer peptides prediction with deep representation learning features. Brief. Bioinform. 2021, 22, bbab008. [Google Scholar] [CrossRef]
Kumar, M.; Rath, N.K.; Swain, A.; Rath, S.K. Feature selection and classification of microarray data using MapReduce based ANOVA and K-nearest neighbor. Procedia Comput. Sci. 2015, 54, 301–310. [Google Scholar] [CrossRef]
Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef]
Zhu, H.; Hao, H.; Yu, L. Identification of microbe–disease signed associations via multi-scale variational graph autoencoder based on signed message propagation. BMC Biol. 2024, 22, 172. [Google Scholar] [CrossRef]
Huang, Z.; Guo, X.; Qin, J.; Gao, L.; Ju, F.; Zhao, C.; Yu, L. Accurate RNA velocity estimation based on multibatch network reveals complex lineage in batch scRNA-seq data. BMC Biol. 2024, 22, 290. [Google Scholar] [CrossRef]
Guo, X.; Huang, Z.; Ju, F.; Zhao, C.; Yu, L. Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single-Cell Reference and Domain Adaptive Matching. Adv. Sci. 2024, 11, 2306329. [Google Scholar] [CrossRef]
Zhang, H.Q.; Arif, M.; Thafar, M.A.; Albaradei, S.; Cai, P.; Zhang, Y.; Tang, H.; Lin, H. PMPred-AE: A computational model for the detection and interpretation of pathological myopia based on artificial intelligence. Front. Med. 2025, 12, 1529335. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Ai, C.; Yang, H.; Dong, R.; Tang, J.; Zheng, S.; Guo, F. RetroCaptioner: Beyond attention in end-to-end retrosynthesis transformer via contrastively captioned learnable graph representation. Bioinformatics 2024, 40, btae561. [Google Scholar] [CrossRef] [PubMed]
Ai, C.; Yang, H.; Liu, X.; Dong, R.; Ding, Y.; Guo, F. MTMol-GPT: De novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS Comput. Biol. 2024, 20, e1012229. [Google Scholar] [CrossRef] [PubMed]
Yan, K.; Lv, H.; Shao, J.; Chen, S.; Liu, B. TPpred-SC: Multi-functional therapeutic peptideprediction based on multi-label supervisedcontrastive learning. Sci. China Inf. Sci. 2024, 67, 212105. [Google Scholar] [CrossRef]
Mubango, E.; Fu, Z.; Dou, P.; Tan, Y.; Luo, Y.; Chen, L.; Wu, K.; Hong, H. Dual function antioxidant and anti-inflammatory fish maw peptides: Isolation and structure-activity analysis via tandem molecular docking and quantum chemical calculation. Food Chem. 2025, 465, 141970. [Google Scholar] [CrossRef]
Cao, C.; Ding, B.; Li, Q.; Kwok, D.; Wu, J.; Long, Q. Power analysis of transcriptome-wide association study: Implications for practical protocol choice. PLoS Genet. 2021, 17, e1009405. [Google Scholar] [CrossRef]
Cao, C.; He, J.; Mak, L.; Perera, D.; Kwok, D.; Wang, J.; Li, M.; Mourier, T.; Gavriliuc, S.; Greenberg, M. Reconstruction of microbial haplotypes by integration of statistical and physical linkage in scaffolding. Mol. Biol. Evol. 2021, 38, 2660–2672. [Google Scholar] [CrossRef]
Cao, C.; Wang, J.; Kwok, D.; Cui, F.; Zhang, Z.; Zhao, D.; Li, M.J.; Zou, Q. webTWAS: A resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic Acids Res. 2022, 50, D1123–D1130. [Google Scholar] [CrossRef]
Zhou, Z.; Xiao, C.; Yin, J.; She, J.; Duan, H.; Liu, C.; Fu, X.; Cui, F.; Qi, Q.; Zhang, Z. PSAC-6mA: 6mA site identifier using self-attention capsule network based on sequence-positioning. Comput. Biol. Med. 2024, 171, 108129. [Google Scholar] [CrossRef]

Figure 1. Technical flowchart. (A) Construction of the benchmark dataset. (B) Feature extraction using AAindex, ASDC, BLOSUM62, and CTD, which generated 566D AAindex features, 400D ASDC features, 420D BLOSUM62 features, and 273D CTD features, respectively. The 4 features were combined to generate 11 fusion features. (C) Generation of 90 different models by combining 15 different features (including single features) and 6 machine learning algorithms. (D) Feature selection for the best fusion feature, with the simplified feature inputted into the SVM to generate the final model. (E) Development of a web server based on the final optimized model.

Figure 2. Comparison of machine learning metrics for different feature encoding methods. (A) Comparison of machine learning metrics with 6 different single features (score is the score of 6 independently tested metrics: ACC, MCC, Sn, Sp, AUC, and Pre. The boxplot uses Friedman’s statistical test, and the labeled * denotes the significance of the statistical test, where **** represents p < 0.0001, and ns represents no statistical significance). (B) Comparison of the 15 features with 6 machine learning combinations of independently tested metrics (the 4 highest-scoring features in SVM are labeled).

Figure 3. (A) Changes in ACC and MCC during feature selection of CTD + BLOSUM62. (Best model Params{‘C’: 2.782559402207126, ‘gamma’: 0.005994842503189409, ‘kernel’: ‘rbf’}). (B) Comparison of independent test indicators of the model before and after feature selection.

Figure 4. Visualization of four features using UMAP dimensionality reduction. (parameters: {‘metric’: ‘wminkowski’, ‘n_neighbors’: 10, ‘min_dist’: 0.2, ‘target_weight’:’0.2’}) (A) CTD feature; (B) BLOSUM62 feature; (C) CTD + BLOSUM62_693D feature; (D) CTD + BLOSUM62_125D feature.

Table 1. Comparison of indicators between AOPxSVM and existing models.

Test Dataset	AOPP.test01								AOPP.test2023
Model	Val_ACC	ACC	MCC	Sn	Sp	AUC	Pre	F1	ACC	MCC	Sn	Sp	Pre	F1
AOPP	0.8969	0.9043	0.8181	0.8284	0.9802	0.9043	0.9767	0.8965	0.9267	0.8595	0.8667	0.9867	0.9848	0.9220
AnOxPP	——	——	——	——	——	——	——	——	0.8800	0.7610	0.9060	0.8530	0.8610	0.8829
AnOxPePred ^a	——	——	——	——	——	——	——	——	0.7530	0.4330	0.8100	0.6270	0.8260	0.8179
UniDL4BioPep	——	——	——	——	——	——	——	——	0.5800	0.1633	0.6800	0.4800	0.5667	0.6182
SBSM-Pro	——	0.7888	0.5786	0.7591	0.8185	——	0.8070	0.7823	0.7333	0.4668	0.7200	0.7467	0.7397	0.7297
AOPxSVM *	0.9056	0.9092	0.8253	0.8449	0.9736	0.9423	0.9697	0.9030	0.9333	0.8670	0.9200	0.9467	0.9452	0.9324

Note: * indicates the model of this study, and ^a indicates the model result of the model tested using its own dataset. The best value in each column is shown in underlined.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, R.; Wang, H.; Yu, Q.; Cai, J.; Jiang, L.; Luo, X.; Zou, Q.; Lv, Z. AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings. Foods 2025, 14, 2014. https://doi.org/10.3390/foods14122014

AMA Style

Li R, Wang H, Yu Q, Cai J, Jiang L, Luo X, Zou Q, Lv Z. AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings. Foods. 2025; 14(12):2014. https://doi.org/10.3390/foods14122014

Chicago/Turabian Style

Li, Rujun, Haotian Wang, Qiunan Yu, Jing Cai, Liangzhen Jiang, Ximei Luo, Quan Zou, and Zhibin Lv. 2025. "AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings" Foods 14, no. 12: 2014. https://doi.org/10.3390/foods14122014

APA Style

Li, R., Wang, H., Yu, Q., Cai, J., Jiang, L., Luo, X., Zou, Q., & Lv, Z. (2025). AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings. Foods, 14(12), 2014. https://doi.org/10.3390/foods14122014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings

Abstract

1. Introduction

2. Materials and Methods

2.1. Benchmark Dataset

2.2. Feature Extraction

2.2.1. Physicochemical Property Feature

2.2.2. Sequence Fingerprinting

2.2.3. Sequence Evolution Features

2.2.4. Deep Learning-Based Embedded Features

2.3. Machine Learning Methods

2.4. Feature Selection Methods

2.5. Model Evaluation Metrics

2.6. Friedman Test

3. Results

3.1. Selection of Baseline Models with Different Features and Fusion Features

3.2. Feature Selection Optimization

3.3. Feature Visualization

3.4. Comparison with Existing Methods

3.5. Web Server Development

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI