Next Article in Journal
Connectedness of Agricultural Commodities Under Climate Stress: Evidence from a TVP-VAR Approach
Previous Article in Journal
Design of Experiments Applied to the Analysis of an H-Darrieus Hydrokinetic Turbine with Augmentation Channels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Knowledge Discovery from Bioactive Peptide Data in the PepLab Database Through Quantitative Analysis and Machine Learning

by
Margarita Terziyska
1,*,
Zhelyazko Terziyski
2,
Iliana Ilieva
1,*,
Stefan Bozhkov
1 and
Veselin Vladev
1
1
Faculty of Economics, University of Food Technologies, 4000 Plovdiv, Bulgaria
2
Department of Computer Science and Mathematics, Trakia University, 6000 Stara Zagora, Bulgaria
*
Authors to whom correspondence should be addressed.
Sci 2025, 7(3), 122; https://doi.org/10.3390/sci7030122
Submission received: 29 June 2025 / Revised: 21 August 2025 / Accepted: 1 September 2025 / Published: 2 September 2025

Abstract

Bioactive peptides have significant potential for applications in pharmaceuticals, the food industry, and cosmetics due to their wide spectrum of biological activities. However, their pronounced structural and functional heterogeneity complicates the classification and prediction of biological activity. This study uses data from the PepLab platform, comprising 2748 experimentally confirmed bioactive peptides distributed across 15 functional classes, including ACE inhibitors, antimicrobial, anticancer, antioxidant, toxins, and others. For each peptide, the amino acid sequence and key physicochemical descriptors are provided, calculated via the integrated DMPep module, such as GRAVY index, aliphatic index, isoelectric point, molecular weight, Boman index, and sequence length. The dataset exhibits class imbalance, with class sizes ranging from 14 to 524 peptides. An innovative methodology is proposed, combining descriptive statistical analysis, structural modeling via DEMATEL, and structural equation modeling with neural networks (SEM-NN), where SEM-NN is used to capture complex nonlinear causal relationships between descriptors and functional classes. The results of these dependencies are integrated into a multi-class machine learning model to improve interpretability and predictive performance. Targeted data augmentation was applied to mitigate class imbalance. The developed classifier achieved predictive accuracy of up to 66%, a relatively high value given the complexity of the problem and the limited dataset size. These results confirm that integrating structured dependency modeling with artificial intelligence is an effective approach for functional peptide classification and supports the rational design of novel bioactive molecules.

1. Introduction

Bioactive peptides (BAPs) are defined as short amino acid sequences with proven specific physiological activities, including antioxidant, antibacterial, antihypertensive, antidiabetic, and immunostimulatory properties [1,2,3,4]. They are gaining increasing importance in the pharmaceutical, food industry, biotechnology, and cosmetology sectors, which necessitates the need for effective identification and classification [5,6]. Traditional approaches, including QSAR modeling, are primarily based on linear methods, which struggle to capture the complex, nonlinear relationships between the structure and function of peptides.
These limitations have stimulated the application of modern technologies such as machine learning and deep neural networks, which have demonstrated higher accuracy and efficiency [1,7,8]. Despite the progress made, significant challenges remain: lack of explainability of models, class imbalance in the available data, and a limited number of well-annotated databases [5,9]. In the biomedical and bioinformatics context, traceability and interpretability of the relationships between descriptors are critically important [10,11,12].
The present study proposes an integrated approach combining causal analysis using DEMATEL, classical structural equation modeling (SEM), and its nonlinear alternative SEM-NN. DEMATEL is used to extract causal relationships between descriptors that define the topology of the SEM model. SEM-NN upgrades classical SEM by integrating a neural network to capture complex nonlinear relationships while preserving the causal structure. This combines the advantages of deep models with the interpretability of causal modeling.
The synergy between DEMATEL and SEM-NN allows for the identification of the biological significance of physicochemical descriptors and increased adaptability to the model. Such hybrid approaches have been successfully applied in various disciplines—from the analysis of the interaction between personality traits and environmental behavior [13] to modeling the relationships between cognitive processes and functional deficits [14]. Their applicability in the field of bioactive peptides opens up prospects for more accurate classification and rational design. Despite existing limitations, such as storage stability, biological activity after passage through the digestive tract, and efficient delivery [3,8,15], the potential of bioactive peptides remains significant [16,17].
The main objective of the study is to develop an interpretable analytical model for multi-class classification of bioactive peptides, integrating quantitative statistical analysis, DEMATEL, and SEM-NN. The approach simultaneously provides high classification accuracy and insights into the design of the internal causal structure of descriptors, with potential for application in rational and functional prediction of peptides. To achieve this objective, the following research questions were formulated:
Q1. What are the dependencies between the physicochemical descriptors of bioactive peptides, and how do they differ across functional classes?
Q2. How can DEMATEL and SEM-NN be adapted for the analysis of biological sequences and what new insights do they offer compared to standard methods?
Q3. Can an accurate and explainable multiclass classifier based on causal structures be built?
Q4. What are the limitations and potential applications of this approach in rational peptide design?

2. Materials and Methods

2.1. Dataset Description and Preprocessing

The present study uses data from the PepLab platform, which contains information on bioactive peptides, including their amino acid sequences and physicochemical properties. A detailed description of the database structure and the criteria for peptide inclusion is presented in [6]. All peptide sequences in the PepLab database were extracted from publications indexed in Web of Science and Scopus, or from reliable public peptide databases. When filling in PepLab, each record was manually checked for consistency with the original source, and duplicate records, sequences with ambiguous amino acid codes, and peptides without experimental validation were excluded. Each peptide is associated with only one biological activity, according to the classification in the original scientific publication from which the sequence was extracted, which ensures that there are no repetitive sequences in the set. This structuring minimizes the risk of introducing unwanted distortions into the analysis and increases the reliability of subsequent modeling.
Amino acid sequence descriptors were calculated using DMPep, a specialized tool integrated into the PepLab platform and described in detail in [18]. The current dataset comprises 2748 unique peptides distributed across fifteen functional classes (Table 1), including ACE inhibitors, antimicrobials, anticancers, antioxidants, toxins, etc. The dataset is characterized by pronounced structural and functional heterogeneity, as well as a significant imbalance in representation. This poses challenges for statistical and machine learning methods. Strategies are required to overcome the imbalance and provide interpretable, high-quality predictive results.

2.2. Methodology of Descriptive Statistical Analysis

In order to preliminarily characterize the peptide descriptors, descriptive statistical analysis was performed by grouping observations into frequency intervals. Distributions of key descriptors, including molecular weight, isoelectric point, aliphatic index, hydrophilicity index (GRAVY), and Boman index, are presented as percentages for each group.
The data were grouped in strict accordance with well-known statistical methods. The number of classes was calculated using the Sturges’ formula (1) [19]. This formula was applied to determine the optimal number of classes for grouping peptide data because it provides a simple yet statistically justified approach derived from probability theory. It establishes a logarithmic relationship between the sample size (n) and the number of classes (k), which ensures that the grouping resolution scales appropriately with the volume of data. This prevents both over-grouping (resulting in many sparsely populated classes) and under-grouping (leading to excessive data aggregation and loss of detail). The method is particularly suitable for small to medium datasets—such as peptide sets with specific biological activities—where more complex estimation methods may not yield substantially better class boundaries but could introduce unnecessary computational complexity. In addition, the widespread use of Sturges’ formula in statistical and biochemical data analysis enhances comparability with other studies, thereby improving the interpretability and reproducibility of results.
For each activity-specific peptide subset, k was calculated as
k = 1 + 3.322 ∙ log10(n)
where
k is the number of groups (classes);
n is the number of peptide sequences with the corresponding activity.
The results are presented in tabular form (Appendix A, Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14 and Table A15) using standard methods of frequency analysis for quantitative variables. The objective is to identify structural features of the peptide set that serve as a basis for the subsequent DEMATEL analysis.

2.3. Structural Modeling of Descriptor Interdependencies Using DEMATEL

In the present study, an extended DEMATEL (Decision Making Trial and Evaluation Laboratory) approach is utilized, as originally proposed by Gabus and Fontela within the Geneva Program of the Battelle Memorial Institute in 1973 [20]. This method is used to structurally model directional dependencies and influence patterns among the physicochemical descriptors of peptides. While DEMATEL is frequently referred to as a tool for causal analysis in decision science, it is more accurately described as a framework for mapping and visualizing directed interdependencies within complex systems characterized by reciprocal influences and feedback loops. It has been widely applied in fields such as management science, logistics, sustainable development, and strategic decision-making to capture influence structures, rather than proving causality in the strict statistical sense.
In contrast to conventional correlation analysis, which merely indicates the existence of a relationship between two variables and the magnitude of that relationship, DEMATEL facilitates the identification of the direction and nature of the influence between them. In summary, DEMATEL can be utilized to determine whether one peptide characteristic exerts an influence on another, and whether that influence is positive or negative. Moreover, the method demonstrates the propagation of these influences throughout the entire system of descriptors, not merely between individual pairs, thereby providing a more comprehensive representation of the interactions.
The output direct influence matrix D was constructed by computing normalized quantitative dependencies extracted from the PepLab database. Specifically, pairwise Pearson correlation coefficients between peptide descriptors were employed as the empirical basis and subsequently normalized into the [0, 1] interval to ensure compatibility with the DEMATEL framework. This data-driven construction allowed the matrix to reflect observed descriptor interdependencies, rather than relying on subjective expert judgments. The presence of negative values in the matrix, indicative of inhibitory or inversely correlated influences between descriptors, was hypothesized, a premise that is permissible within the signed data-driven DEMATEL framework. The resulting total matrix T (2), integrating direct and indirect influences, was computed by the classical transformation:
T = D ∙ (I − D)−1
where
D—Matrix of direct influences, obtained by normalizing initial values to describe the degree to which one descriptor influences another;
I—The identity matrix and possesses identical dimensions to the D matrix. It represents a neutral value for addition in matrix operations;
(I − D)−1—An inverse matrix that accumulates all indirect influences through an unlimited number of iterative interactions between descriptors;
T—Total influence matrix reflecting the total (direct + indirect) influence between any pair of elements of the system.
The transformation is applied under the condition that the spectral radius, denoted here by ρ(D), is less than 1, as established in the original DEMATEL framework [20]. The spectral radius is the largest absolute value of the eigenvalues of D. This guarantees two things. Firstly, it guarantees the convergence of the unbounded propagation of indirect effects in the system. Secondly, it guarantees the stability of the model. The four primary DEMATEL indicators are derived from the T matrix: outward influence (D), inward influence (R), centrality (D + R), and causality (D − R). These were utilized for the visual and quantitative analysis of the influence network, as well as serving as the analytical foundation for the subsequent development of the SEM-NN model.
To the best of our knowledge, this application of DEMATEL analysis represents the first attempt (in the available literature) to apply causal matrix decomposition to the analysis of peptide descriptors. It is anticipated that this approach will facilitate a more intricate, exact, and systematic comprehension of the interrelationships among the elements under consideration (in this instance, the peptide descriptors) in comparison to conventional methodologies such as correlation analysis, PCA, or clustering. Additionally, it is expected to establish a foundation for interpretable machine learning in subsequent phases of the analysis.

2.4. SEM and SEM-NN Architecture

In the context of this study, a hybrid approach integrating classical structural equation modeling (SEM) with a neural network (SEM-NN) was implemented to capture the intricate, potentially nonlinear nature of the interrelationships between the physicochemical descriptors of bioactive peptides. SEM is a well-established statistical framework designed to model hypothesized directional dependencies between latent and observed variables. In this framework, the measurement model captures the relationships between the observed descriptors and the latent constructs, while the structural model represents the directional influence patterns among these variables. It is important to note that, in this study, SEM does not establish formal causality but rather serves to formalize the network of interdependencies suggested by the DEMATEL analysis, providing a structured hypothesis about how descriptors interact within the dataset.
In order to surmount the limitations of SEM with regard to linearity and compensatory effects, the model structure was further developed by integrating a multilayer neural component, thereby forming a hybrid SEM-NN architecture. Specifically, the SEM topology serves as the basis for determining the connections between neurons. In this model, latent exogenous variables correspond to input neurons, and endogenous variables correspond to output neurons. The employment of hidden layers, which utilize nonlinear activation functions (typically Sigmoid), enables the modeling of intricate, nonlinear dependencies between descriptors. The initial connection weights were set based on path coefficients and factor loadings from the SEM analysis. These weights were then optimized through a process of backpropagation, the aim of which was to enhance prediction accuracy.
The model was initiated with a configuration that reflected the DEMATEL-derived influence structure, in which isoelectric point (pI) and aliphatic index (alipha) were modeled as exogenous latent variables, while GRAVY index (gravy) and molecular weight (MW) were modeled as endogenous. This configuration is consistent with the logic of causal pathways, with pI and aliphatic index functioning as primary inputs that influence the internal structures of GRAVY and MW. The chosen architecture allows not only modeling effects of direct and indirect causality types but also capturing nonlinear transformations impossible for SEM in its classical form.
The performance of the SEM-NN model was evaluated using prediction accuracy on an independent test dataset. The resulting values demonstrated a reduced prediction error and augmented explained variance for the SEM-NN model in comparison to the linear SEM. Thus, it can be concluded that the SEM-NN model is more adaptable and can capture the authentic nonlinear structure of the relationships between the descriptors. This outcome substantiates the efficacy of the hybrid approach and its applicability to interpretably model complex biochemical dependencies.

2.5. Domain-Specific Data Augmentation

A thorough analysis of the number of peptide sequences in each functional class in the PepLab database reveals a pronounced imbalance in class representation. The Antihypertensive group is clearly dominant, while the Antithrombotic, Antiviral, and Antifungal groups are represented by a significantly smaller number of observations. In the context of constructing a multiclass classifier, this imbalance gives rise to training bias. The model tends to “learn” more effectively the multiple classes that are represented, which results in a degradation of accuracy, sensitivity, and specificity with respect to the less frequent classes. This constitutes a violation of fundamental principles of model objectivity and generalizability, particularly in real-world applications where each class possesses potential biomedical relevance.
In order to minimize these risks, domain-specific augmentation was applied to artificially increase only the underrepresented classes. Any functional group with fewer than 300 observations was designated as “underrepresented”. The objective of this study was to synthesize a minimum of 80% of the exemplars in the largest class (Antihypertensive), thereby enhancing the balance without compromising the realistic proportion between classes. To illustrate this process, consider a hypothetical scenario in which the antihypertensive compound contains 620 peptides, while the antiviral compound comprises 120 peptides. In such a scenario, approximately 370 novel sequences are generated from the antiviral compound, thereby bringing the total number of sequences to 490.
The generation of synthetic peptides is predicated on two complementary principles: (1) frequency of amino acids (AAC) per class and (2) frequency ranges of physicochemical descriptors (MW, GRAVY, pI, Aliphatic index, Boman index) determined by descriptive statistical analysis. For each under-represented class, a probabilistic amino acid matrix is constructed, from which new sequences of conserved length and compositional profile are extracted. Each synthetic peptide is validated by the computation of its base descriptors and subsequent filtration to fall within the 80% confidence interval of the corresponding descriptor for the given class.
The notion of biological plausibility is ensured by a three-step algorithm: Firstly, peptides bearing unnatural motifs or sequences that deviated from the structural patterns specified in UniProt and literature-curated databases were eliminated. Next, predictive modeling of the secondary structure was conducted using PEP-FOLD, with collapsed or unstable conformations being rejected. Finally, the descriptor profile was evaluated against statistical intervals derived from real cases.
It is essential to emphasize that the synthetic peptides are exclusively utilized for the training of the classification model. They are not involved in DEMATEL analysis or SEM/SEM-NN modeling to preserve causal and structural relationships. This strikes a balance between analytical rigor and classification efficiency, with augmentation only addressing imbalance and not affecting explainability of the models.

2.6. Multiclass Classification Model

In order to achieve the objective of multiclass classification of bioactive peptides by functional class, a range of models from machine and deep learning were examined, including Random Forest, XGBoost, LightGBM, Multi-Layer Perceptron (MLP), and TabNet. The initial four models have been firmly established within the scientific and applied practices, exhibiting a high degree of robustness in scenarios where data is limited and classes are imbalanced. Due to their popularity, these algorithms are not described in detail in this study and instead serve as reference algorithms for comparison.
A more in-depth focus was placed on TabNet, a relatively new architecture proposed by Sercan Arik and Tomas Pfister in 2019 in the publication “TabNet: Attentive Interpretable Tabular Learning” [21] was presented at the NeurIPS 2019 conference. TabNet is a deep neural network that has been specifically designed to process tabular data. It combines multi-layer neural transformations with a selective attention (sequential attention) mechanism on input features.
In contrast to classical Multi-Layer Perceptron architectures, where all input features are processed uniformly at each hidden level, TabNet performs hierarchical and dynamic feature selection at each processing step. This is achieved through the implementation of attentive transformer blocks, which form differentiating masks that constrain the features utilized in each phase of the solution. Consequently, the network acquires localized feature weights, thereby facilitating a more precise explanation of the predictions. Its architecture is characterized by end-to-end differentiability, comprising a sequence of decision steps, each comprising a self-contained transformation and attention substructure. These are constructed through the utilization of fully connected layers, batch normalization, and ReLU activations, thereby exhibiting a classical deep character.
TabNet’s key benefits include the following:
  • Interpretability is facilitated through the visualization of feature attribution masks;
  • The robustness of the system is enhanced through the implementation of flexible loss weighting;
  • The efficiency of the system is optimized on small training sets through the utilization of sparsity and adaptive attention mechanisms;
  • The system can automatically select relevant descriptors, thus negating the necessity for prior reduction or normalization.
All models were trained with input descriptors structured after SEM-NN analysis and enriched by domain-specific augmentation, with the objective of reducing the influence of class imbalance. The classification performance was assessed using stratified five-fold cross-validation. The evaluation metrics included accuracy and macro F1-score, which are widely accepted indicators of the overall predictive capability and balance between precision and recall across multiple classes.
The selection of a final model was based on a multifaceted evaluation that prioritized maximizing accuracy, while also ensuring robustness to an unbalanced distribution and interpretability of the solution. In this context, TabNet demonstrated comparable or superior classification performance to XGBoost and MLP. Additionally, TabNet has the unique ability to track the significance of physicochemical descriptors for each functional peptide category, both globally and at the level of individual predictions.

2.7. Computational Resources and Methods

All computational experiments were conducted using Google Colaboratory (Python 3.10) with the following core libraries: pandas, numpy, scikit-learn, xgboost, lightgbm, and pytorch-tabnet. For preliminary testing and data inspection, a local machine was used, equipped with an Intel® Core™ i7-4600M CPU @ 2.90 GHz, 8 GB RAM, Windows 10 Pro 64-bit operating system.

3. Results and Discussion

An extensive study was performed, utilizing frequency statistical analysis to establish the relationship between the physicochemical properties of peptides and the type of biological activity. The results obtained are presented in Appendix ATable A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14 and Table A15.
The amino acid composition (AAC) for each activity is presented in Figure 1. They exhibit motives that are typical of the given activity and that may be related to its structural and functional features. For instance, the predominance of proline (P) in ACE inhibitory and antiamnestic peptides (up to 30% in the latter) suggests their flexible structure and role in the modulation of enzymatic responses or in the blood–brain barrier. In addition, the high content of lysine (K), alanine (A), and leucine (L) in the Anticancer and Antimicrobial classes is indicative of their potential to interact with cell membranes.

3.1. Descriptive Statistical Analysis

3.1.1. ACE Inhibitory Peptides

As demonstrated in Table A1, antihypertensive peptides are distinguished by their low molecular weight (468.51–2088.7 Da, with almost 97% falling below 2088.7 Da) and their limited length (average of nine amino acids). The majority of these peptides (58.13%) exhibit hydrophilic properties, as indicated by a negative GRAVY index. Proline (P, 16%) is the most prevalent amino acid (Figure 1a), while the presence of leucine (L) and valine (V) contributes to the aliphatic index. The Boman index (–0.36 to +1.01), in combination with short length (≤12 residues), the presence of proline and aromatic amino acids at the C-terminal, moderate hydrophobicity, and thermal stability, corresponds to the structural models of natural ACE-inhibitory peptides described in the literature [22,23,24,25]. Peptides with C-terminal Pro-Pro motifs are of particular importance due to their proteolytic stability and effective absorption, a trend that was confirmed in the present study.

3.1.2. Anti-Inflammatory Peptides

According to Table A2, anti-inflammatory peptides are characterized by an enrichment of alanine (A) and leucine (L) (10% each), followed by proline (P, 9.1%) and arginine (R, 8.2%), as illustrated in Figure 1b. This composition is indicative of a predominantly hydrophilic character, while the elevated aliphatic index in approximately 86% of the sequences suggests structural stability in biological environments. The majority of peptides (72%) exhibit a low molecular weight (581.71–1025.41 Da) and a short length (average 7.86 residues). This is consistent with data demonstrating that peptides with <10 amino acids and <1 kDa exhibit high activity and effective absorption [26]. The bimodal distribution of pI (acidic range 3.08–4.76 at 35.7% versus basic 8.13–9.82 at 35.7%) indicates functional diversity and dependence on the physiological context. The Boman index differentiates between two subtypes: approximately 50% of the peptides exhibited low values (–1.51 to 1.95), associated with conventional anti-inflammatory activity. The remaining sequences demonstrated elevated values (up to 7.15), suggesting the presence of multifunctional potential. This assertion is corroborated by the findings of [27], which demonstrate that short aliphatic peptides (VH, AL, LAN, IA) derived from deer proteins through enzymatic hydrolysis inhibit the production of nitric oxide (NO) in LPS-induced macrophages. This outcome is in alignment with the frequency profile observed in the present study.

3.1.3. Antiamnestic Peptides

The amino acid profile of anti-amnesic peptides (see Table A3 and Figure 1c) reveals an enrichment of hydrophobic residues, such as proline (P), leucine (L) and valine (V), as well as aromatic amino acids including tyrosine (Y) and phenylalanine (F). This combination gives the molecules an amphiphilic nature and enables them to interact with neuronal membranes and receptor domains. Most peptides are relatively short, with an average of 8.1 amino acids, and have a molecular weight between 550 and 1200 Da. This makes them conducive to rapid diffusion and passage through physiological barriers. The wide variation in the isoelectric point (pI) (4.0–10.2) suggests functional activity in different neuronal environments. It is particularly notable that approximately 60% of the peptides have a positive GRAVY index, supporting their ability to interact with lipid membranes. The Boman index (–0.8 to 3.5) suggests that higher values may indicate the capacity to interact with enzymes and proteins involved in cognitive regulation. Such mechanisms have been observed in well-studied neuropeptides such as NAP (NAPVSIPQ, 836 Da), which has been shown to affect memory and provide neuroprotection in β-amyloid pathology [28], and Colostrinin® [29], a proline-rich peptide complex (<1.5 kDa) that modulates synaptic plasticity and BDNF expression [30]. Recent clinical data show a statistically significant increase in BDNF levels and improvement in cognitive performance in healthy adults after taking Colostrinin®.

3.1.4. Antibacterial Peptides

Antibacterial peptides (Table A4) are characterized by a negative GRAVY index in a significant proportion of them, which suggests a hydrophilic nature that facilitates solubility and interaction with bacterial membranes [31]. The frequency of aliphatic amino acids (I, L, V, A) is relatively low (Figure 1d), but the aliphatic index in ~90% of the peptides exceeds 25, indicating structural and thermal stability—key properties for applications in the food industry and biotechnology [32]. Approximately 26% of the peptides show a Boman index above 2.67, indicating an ability to bind nonspecifically to multiple proteins—a lower tendency toward multifunctionality compared to anti-inflammatory peptides, but higher compared to anti-amnesic peptides. Structurally, over 58% have a molecular weight >2.4 kDa and an average length of 25 residues, which favors the formation of stable α-helices and β-sheets, effective when incorporated into lipid membranes [33]. The profile is similar to the data in APD3 [31], and [34] confirm that peptides with MW >2.5 kDa exhibit stronger activity against Gram-positive and Gram-negative bacteria.

3.1.5. Anticancer Peptides

As shown in Table A5, anticarcinogenic peptides are characterized by a high content of lysine (K, 14.6%), leucine (L, 12%), alanine (A, 9.9%), and glycine (G, 8.9%) (Figure 1e). This corresponds to the observations in [35], which emphasize the key role of cationic amino acids (K) and hydrophobic L and A in stabilizing amphiphilic structures capable of interacting with tumor membranes and inducing apoptosis through their destruction. Over 40% of the peptides have a negative GRAVY index, suggesting moderate hydrophilicity combined with membrane activity. In more than 92% of cases the aliphatic index exceeds 25, indicating high structural stability. Approximately 90% have a Boman index below 2.7, a characteristic associated with increased selectivity for tumor cells and low nonspecific toxicity [36]. The average length is 19.76 amino acids, with >75% having a molecular weight above 1.62 kDa. This is consistent with [37], where effective anticarcinogenic peptides typically contain 12–30 amino acids, allowing the formation of α-helices, which are key to penetrating and destroying tumor membranes.

3.1.6. Antidiabetic Peptides

Figure 1f shows that antidiabetic peptides are rich in leucine (10.4%), alanine (8.6%), glycine (7.3%), and valine (7.0%), which imparts a moderate hydrophilic to amphiphilic character and facilitates interaction with enzymes and receptors in the regulation of glucose metabolism [38]. According to Table A6, about 60% have a negative GRAVY index, and over 85% show an aliphatic index greater than 50, indicating favorable solubility and structural stability [39]. More than 80% have a Boman index below 2.5, suggesting selective biological activity and low propensity for nonspecific interactions, typically associated with high efficacy and low toxicity [40]. Almost all peptides (98%) have a molecular weight of 1028.09–3345.95 Da and an average length of 19.67 amino acids, in line with data for functionally active antidiabetic peptides <3–4 kDa [41]. Smaller anti-diabetic peptides, typically below 3–5 kDa, demonstrate increased stability and absorption, as well as enhanced inhibitory potency. This is consistent with the structure–activity relationships reported in the literature [42]. Approximately 97% have an isoelectric point in the pH range 2.73–8.21, favoring electrostatic interactions with enzyme and receptor domains.

3.1.7. Antifungal Peptides

Table A7 shows that 93% of antifungal peptides with molecular weight below 1900 Da and an average length of about 14 amino acids, parameters characteristic of short bioactive peptides with high selectivity against fungal pathogens [43,44]. A significant proportion have a negative GRAVY index, suggesting a hydrophilic nature and potential for interaction with the cell surface of fungi [45,46]. The amino acid composition is dominated by K, Y, L, and S (Figure 1g), characteristic of natural antimycotic peptides, which contribute to structural stability and functional selectivity [44]. An aliphatic index above 25 is found in 88% of the peptides, which is a sign of thermal and structural stability and facilitates interaction with cell membranes [47]. Only 14% have a Boman index >2.04, which limits non-specific binding and underscoring selective biological function [40,43]. The biophysical profiles obtained are consistent with the described mechanisms of action of AFPs, which, as a subgroup of AMPs, combine cationic charge, amphiphilicity, and α- or β-structures with specificity to fungal cell wall and membrane components (ergosterol, β-glucans) [43,44,45].

3.1.8. Antimicrobial Peptides

As shown in Table A8, approximately 48% of AMPs have a negative GRAVY index, indicating a hydrophilic and amphiphilic profile that favors interaction with microbial membranes and involvement in processes such as pore formation and cell membrane destabilization [31,48,49]. Around 20% have a Boman index greater than 2.38, which is associated with additional biological activities, including immunomodulation and anti-inflammatory effects [48,50]. The aliphatic index is elevated in 78.8% of peptides (49.29–221.83), reflecting structural and thermal stability [31]. Half of the peptides have a molecular weight ranging from 1839 to 3611 Da, with an average length of 26.9 amino acids, which corresponds to the characteristic values for natural AMPs [51,52]. Around 70% have an isoelectric point of pH 7.7–11.7, confirming their cationic nature and electrostatic attraction to the anionic components of microbial cells [52]. The amino acid composition (Figure 1h), dominated by G (10.3%), L (9.8%) and K (8.9%), matches profiles in databases such as APD3 [31]. These are associated with well-described mechanisms of action, including toroidal pore formation, the carpet mechanism, and transmembrane integration [49].

3.1.9. Antioxidative Peptides

A substantial proportion of antioxidant peptides (66%) exhibit a negative GRAVY index (Table A9), thereby confirming their hydrophilic nature, which is pivotal to neutralizing water-soluble free radicals [53]. The amino acid composition is dominated by P (10%), E (7.7%), and L (8.4%), which are characteristic of high antioxidant activity (Figure 1i). In 68.3% of the peptides, a high aliphatic index (54–243) was reported, associated with increased thermal stability and industrial applicability [54].
The Boman index, a measure of peptide selective action and low toxicity, is below 2.48 in 75% of the peptides, suggesting a selective action and low toxicity profile [55]. It has been established that approximately 73% of these peptides have a molecular weight within the range of 389–1234 Da, with an average length of 8.56 amino acids. These values correspond to optimal values for high antioxidant activity [56,57]. The aforementioned characteristics, when considered in conjunction with a compact structure and an amino acid profile that is deemed suitable, are consistent with the established mechanisms of action of antioxidant peptides isolated from both plant and animal sources [58].

3.1.10. Antithrombotic Peptides

Over 94% of antithrombotic peptides have a negative GRAVY index (Table A10), reflecting their hydrophilic profile, which is due to the dominance of G (13.8%), P (10.1%), A (8%), R (7.8%), and D (7.8%) (Figure 1j). The presence of small and polar residues is associated with specific binding to thrombin and other key enzymes in the coagulation process. The average length is 11.37 amino acids, consistent with the literature data for short peptides with clearly defined binding motifs. About 60% show an aliphatic index below 44, suggesting lower thermal stability, which can be improved by chemical modifications such as cyclization or the inclusion of non-natural amino acids [59]. Approximately 52% have a Boman index above 2.34, an indicator of increased affinity for proteins, including thrombin [60,61]. The data confirm that anticoagulant potential is determined by hydrophilicity, amino acid composition, and structural parameters, in line with previous studies [59,60,61].

3.1.11. Antiviral Peptides

About 65% of AVPs have a low molecular weight in the range of 623–1633 Da and an average length of 12.12 amino acids (Table A11). The composition is dominated by small residues such as S (11.1%), L (10.5%), T (8.0%), and Y (7.4%) (Figure 1k). Approximately 71% show a negative GRAVY index, suggesting a hydrophilic nature and interaction with viral envelopes and cell receptors. About one-third have an aliphatic index <39, associated with flexibility and thermal stability, while only 15% show a Boman index >2.48, characteristic of more selective antiviral activity. The data are consistent with the literature, according to which AVPs are short peptides with specific physicochemical profiles that inhibit processes such as penetration, fusion, and transcription [62,63]. Databases such as DRAVP [64] and review articles [65] confirm that parameters such as charge, length, and hydrophilicity are the result of an optimized selection process. In addition, cyclic AVPs show increased stability and bioactivity at small sizes [65].

3.1.12. DPP-IV Inhibitor Peptides

Analysis of DPP-IV inhibitory peptides shows that approximately half of the peptides have a positive GRAVY index and the rest have a negative GRAVY index, which indicates a balanced distribution between hydrophilicity and hydrophobicity (Table A12). The amino acid profile is dominated by P (15.8%), G (11.3%), and L (8.9%) (Figure 1l), corresponding to known characteristics of effective DPP-IV inhibitors [66,67]. In 84% of cases, the aliphatic index is in the range of 43–260, suggesting structural stability and resistance to biotransformation [66]. The Boman index is below 2 in 86% of peptides, indicating a selective biological effect, consistent with data that active inhibitors are typically short, contain proline, and have characteristic PXP or XPP motifs [68,69]. In recent years, in silico approaches such as StackDPPIV, based on amino acid indices and structural descriptors, have proven to be effective, and their application in PepLab represents a promising opportunity [68,70].

3.1.13. Neuropeptides

The analysis in PepLab confirms the characteristic physicochemical properties of neuropeptides, corresponding to the data in the literature [71,72]. The amino acid profile is dominated by G (11.1%), L (8.8%) and R (8.3%) (Figure 1m), providing structural flexibility and a positive charge for interaction with receptors. About 75% have a negative GRAVY index (Table A13), an indicator of high solubility and stability in a neural environment. The aliphatic index (44–153) is within normal limits in 69% of the peptides, suggesting moderate structural stability [73]. In 67.6%, the Boman index is below 2.5, in line with the requirements for selectivity and specificity in receptor interaction [71]. The molecular weight of 77.46% is between 573 and 2765 Da, with an average length of 15.85 amino acids, in accordance with active neuropeptides that diffuse rapidly and modulate G-protein-coupled receptors [74]. The significance of these descriptors for neuronal signaling networks is confirmed by modern methods such as cryo-EM, genomics, and functional sensors [75].

3.1.14. Opioid Peptides

Opioid peptides are a subtype of neuropeptides that interact with opioid receptors and regulate processes such as pain, mood, and appetite [76]. Data from PepLab (Table A14) show that they have short sequences (average 9.09 amino acids), with 78.75% having a molecular weight of 550–1382 Da—parameters characteristic of endogenous opioids with strong motifs such as Tyr-Gly-Gly-Phe (YGGF) [77]. The amino acid profile is dominated by G (16.6%), F (13.4%), Y (10.0%), and P (9.9%) (Figure 1n), in accordance with [78], according to which aromatic and cyclic residues improve receptor binding and proteolysis resistance. About 58.75% have a negative GRAVY index, and half have an aliphatic index <22.29, indicating low thermal stability. In 80%, the Boman index is <2, indicating limited multifunctionality. These characteristics correspond to early observations on casein exorphins [79]. The functionality of opioid peptides is mainly determined by three factors: short length, presence of aromatic amino acids, and low thermal stability. Contemporary research emphasizes the importance of integrated structural-functional analyses for the development of new bioactive peptides, including those from food sources and through bioinformatics [77,78].

3.1.15. Toxin Peptides

The toxic peptides contained within the PepLab database are characterized by a clear predominance of cysteine (C—18.0%), glycine (G—9.4%) and lysine (K—8.8%) (Figure 1o). This amino acid distribution is characteristic of natural toxins, particularly those of animal origin—such as scorpions, spiders and marine organisms—in which cysteine-rich motifs stabilize the tertiary structure through disulfide bridges [80].
The data presented in Table A15, demonstrate that 66.8% of the toxic peptides have a molecular weight of 2373–4582 Da (average 28.84 amino acids), thus reflecting their structural complexity and the presence of functional domains [81]. With regard to their hydroprofile, 79.2% have a negative GRAVY index, indicating pronounced hydrophilicity, which facilitates their solubility and activity in biological fluids [82].
The aliphatic index of 76% of peptides is below 53, suggesting low thermal stability, which is compensated by the stabilizing role of the tertiary structure and disulfide bond motifs [83]. In over 65% of these cases, the Boman index is below 2.48, indicating specific rather than multifunctional binding, in line with the concept that toxins target specific molecular receptors or ion channels [84].
In summary, the PepLab profile of toxic peptides confirms their main characteristics described in the literature: namely high length and molecular weight, rich content of cysteine and hydrophilic amino acids, low aliphatic index, and moderate Boman index. This finding corroborates the reliability of the analytical approach that was employed. Concurrently, it delineates the potential for future research endeavors to be concentrated on the domains of toxicity, stability, and selectivity.
In conclusion, the results of the descriptive statistical analysis allow for the drawing of several conclusions about the structural and physicochemical characteristics of the bioactive peptides in the functional classes that were examined. Firstly, the majority of classes—including antihypertensive, anti-inflammatory, antiamnestic, and antimicrobial classes—exhibit a propensity towards a compact molecular structure (i.e., a weight of less than 2.5 kDA), which is typically associated with a concise amino acid sequence length (i.e., a sequence of less than 12–20 residues). Such morphology facilitates bioavailability and transport across physiological barriers. Secondly, the GRAVY index indicates that the majority of peptides are negative or moderately positive, i.e., characterized by a hydrophilic to slightly hydrophobic profile, which favors solubility in physiological media and interaction with cellular receptors and membrane structures. Thirdly, the aliphatic index, which functions as an indicator of thermal stability, is elevated in the vast majority of functional groups (with over 85% of peptides exhibiting values over 50), thereby substantiating their applicability in contexts necessitating structural stability, such as those prevalent in the food and pharmaceutical industries. Finally, the Boman index for the majority of classes remains below 2.5, indicating selectivity in biological interaction and a low propensity for nonspecific binding.
The established patterns are confirmed by comparisons made with results published in the scientific literature and are consistent with structural models of bioactive peptides already described. The observed dependencies between amino acid composition and the main descriptors form typical profiles by functional classes, giving reason to assume that the considered descriptors are not independent but form a subordinate structure of interactions. This, in turn, establishes a conceptual and empirical foundation for inferring causal dependencies between the descriptors and for applying DEMATEL analysis in order to formalize them in the context of causal-interpretable modeling.

3.2. DEMATEL Analysis

A DEMATEL analysis was performed to formalize the causal structure between the factors, based on the results of the descriptive statistics and the established relationships between the descriptors. Unlike the classical DEMATEL methodology, where the direct influence matrix is typically constructed from subjective expert judgments, the present study adopted a fully quantitative approach. The direct influence matrix was derived from the Pearson correlation coefficients calculated for the full set of peptides, thereby ensuring that both the strength and direction of the empirical relationships between descriptors were preserved. This methodological choice was motivated by the need to capture the real-valued numerical dependencies among physicochemical parameters, without smoothing or discarding negative correlations that often carry essential biophysical meaning. The correlation matrix was thus employed directly as the DEMATEL input, after normalization, serving as a proxy for direct influences. This approach allows the transformation of raw correlations into a structured causal network, where positive values denote stimulating influences and negative values denote inhibitory effects, consistent with the DEMATEL framework.
The total influence matrix between physicochemical descriptors, as presented in Table 2, was calculated by means of a modified DEMATEL procedure based on real-valued numerical relationships between parameters. In contrast to the classical approach, in which the direct influence matrix is formed by expert judgment or subjective rules, the present study utilized a quantitative approach. The direct influence matrix was constructed by correlation analysis, which was based on the full set of peptides in the initial database (n = total number of peptides comprising 15 functional classes).
Specifically, for each of the six selected descriptors—sequence length (len), molecular weight (MW), isoelectric point (pI), GRAVY index (gravy), aliphatic index (alipha) and Boman index (boman)—a correlation matrix based on the Pearson coefficient was calculated, including positive and negative correlations between pairs of descriptors. The purpose of this step is to capture the actual strength and direction of the relationships between the descriptors without smoothing or eliminating opposing trends.
The resulting Pearson matrix, comprising values ranging from minus one to one, was employed directly as a direct influence matrix. Within the framework of DEMATEL, this matrix signifies the extent of direct influence exerted by each descriptor on the others. Subsequently, the total influence matrix was calculated according to the standard transformation (1). Each matrix value indicates the extent to which a descriptor (row) influences another (column), accounting for both direct and indirect relationships. For instance, TpI,MW = −4.96 indicates that the isoelectric point exerts a substantial negative influence on the molecular weight. For the peptides contained within this database, a higher pI is generally associated with a lower molecular weight. Furthermore, a negative influence value between the GRAVY index and pI could be interpreted as an indication that peptides with a more hydrophilic profile tend to possess a lower isoelectric point, a biophysical regularity that would be difficult to identify by expert estimation or by using absolute values alone.
The pI descriptor exhibited the most pronounced generalized influence within the system, exerting a substantial impact on the aliphatic index, molecular weight, and Boman index. The resulting dependency network structure reflects the intrinsic causal nature of the interactions between physicochemical descriptors and can serve as a basis for subsequent structural modeling, variable reduction, and functional interpretation of peptide activity.
Table 3 contains the centrality and causality values for each physicochemical descriptor, calculated from the total influence matrix. The centrality (D + R) of a factor in a dependency system is reflected by the total involvement of that factor, while the causality (D − R) indicates whether the factor is more of a cause (positive value) or an effect (negative value).
An analysis of the results presented in Table 3 indicates that physicochemical descriptors may assume roles that can be categorized as either causes, mediators, or outcomes. The distribution is derived from two sources. Firstly, it is based on the biophysical nature of each variable. Secondly, it is based on the empirical values for causality (D − R) and centrality (D + R) derived from the causality matrix.
The variable len has been identified as the primary causal agent within the system. This assertion is substantiated by two factors: the positive causality (D − R) and the biochemical nature of the phenomenon. The length of a peptide is a structural feature defined in peptide synthesis that determines molecular weight. This, in turn, has the potential to influence the isoelectric point and, consequently, the other properties of the peptide. Len is a completely exogenous parameter; no other descriptor can alter length, thus positioning len as a source of influence.
Molecular weight MW is directly derived from len and amino acid composition, but it also acts as an intermediate factor relative to pI. The DEMATEL model demonstrates a positive causal relationship for MW (D − R > 0), thereby validating its function as a secondary driver. MW is implicated in determining electrochemical properties, including charge and solubility, which in turn influence biological activity.
The isoelectric point pI is characterized by a high degree of centrality (D + R) and almost neutral causality (D − R ≈ 0). This suggests that pI functions as an integrative mediator, a variable that participates in balancing influences in the network without being a dominant causative agent or receptor. From a biochemical perspective, the pI of a compound is determined by its amino acid composition and is influenced by its MW and len. However, it has been demonstrated that the pI can also affect alipha, and consequently the IC50, by moderating the charge and surface behavior of the peptide.
The aliphatic index, Alipha, demonstrates a classical mediator profile, exhibiting a balanced distribution between causal and receptive influence. The aliphatic index is associated with the relative abundance of aliphatic amino acids, which is contingent on the primary sequence (len) and exerts an influence on GRAVY and Boman. From a biophysical perspective, alipha is responsible for the regulation of thermal stability and lipophilicity in peptides.
DEMATEL’s classification of the hydrophobicity GRAVY as a receptor is unequivocal, exhibiting negative causality (D − R < 0). This indicates that its influence is derived from extraneous factors. This is biologically consistent because gravy reflects the summed hydrophobicity of the amino acid sequence and depends strongly on alipha and secondarily on pI and MW. Gravy exerts a negligible inverse influence on other variables.
The Boman index is a descriptor of the peptide’s potential to interact with proteins. DEMATEL’s analysis of the data reveals a profile that is analogous to gravy, characterized by a predominant receptivity and a concomitant absence of significant causality. From a biochemical perspective, Boman is contingent on gravy and alipha, which collectively signify the aggregate capacity for hydrogen bond formation and nonspecific interactions.
The biological activity IC50 is defined as the final output in the model and is a fully receptive factor. Its value is determined by the complex interactions of all other descriptors, through both direct and indirect pathways.
This interpretation underscores the potential of the DEMATEL methodology to reveal hidden patterns and causal interactions between descriptors. The results obtained can be used to prioritize features in the construction of predictive models, to optimize the synthetic design of bioactive peptides, and to better understand the biophysical mechanisms underlying their functionality.
As illustrated in Figure 2, a graphical representation is provided of the directional influences between the primary physicochemical descriptors, with these influences being based on the total influence matrix calculated by DEMATEL methodology, with negative dependencies being permitted. The nodes represent the individual descriptors, and the arrows illustrate the direction and strength of influence between them. The thickness of each relationship is proportional to its influence intensity. The analysis presented in Figure 2 offers a comprehensive depiction of the directional dependencies within the system. The numerical weights plotted on the arrows in the figure facilitate a quantitative assessment of the strength and direction of these interactions. It has been demonstrated that these elements not only complement but also refine the conceptual model that serves as the basis for the construction of the structural equation model (SEM) and its neural extension (SEM-NN).
The numerical values plotted on the arrows in the figure for the causal network reflect the direct influence between pairs of descriptors computed by the DEMATEL algorithm. These metrics are not to be confused with traditional correlations or regression coefficients. Rather, they represent aggregate metrics of the direction and strength of influence within the system. The computations are derived through the process of normalization of the mutual influence matrix, based on the premise of standardized linear relationships. These computations thus represent a dimensionless measure of the intensity of the impact. Positive values are indicative of a stimulating influence, while negative values are indicative of an inhibiting influence on the corresponding descriptor.
Two predominant causal chains are particularly noteworthy. The first one commences with the peptide chain length (len), which exerts a substantial positive influence on the molecular weight (influence 5.61) and consequently on the isoelectric point (influence 4.96). This finding serves to substantiate the hypothesis that length constitutes a pivotal structural parameter that determines the weight and, consequently, the charge of the molecule. pI exerts a moderate influence on the other variables, yet its values are not distinctly positive or negative, but rather balanced around a neutral position. This finding aligns with the calculated D − R coefficient for pI, which approaches zero, signifying a factor with elevated centrality yet a mediating function rather than a direct causal or outcome role.
The second causal chain commences with the GRAVY index, a descriptor of hydrophobicity. The study found a distinct negative influence on the Boman index (−0.866), reflecting the physicochemical nature of the interaction between hydrophobicity and binding ability. Subsequently, Boman exerts a substantial influence on the aliphatic index (alipha) (−0.862), which in turn exerts a considerable influence on pI (influence 6.64) and MW (influence −4.96). This observation underscores the multifaceted role of alipha as a factor with multidirectional effects on the structural and charged characteristics of the molecule.
It is important to emphasize that certain pathways within the network exhibit a low or even an opposing influence to the biochemical logic, as evidenced by the following examples: pI→ len (−1.59) and MW→ len (−0.69). This phenomenon can be attributed to the mediating and outcome-oriented role of pI and MW, which operate in response to upstream factors rather than functioning as primary causal agents.
A heatmap diagram reflecting the Total Influence Matrix generated by the DEMATEL analysis is presented in Figure 3. The color scale ranges from intense red (strong positive influence) to intense blue (strong negative influence), thus enabling expeditious visual identification of the most significant directional dependencies among the descriptors.
The most pronounced effect was reported for pI, which demonstrated a cumulative impact with the highest value along the main diagonal (24.10). This value is interpreted as the overall impact of pI on the system, incorporating both direct and indirect effects that have been accumulated through all interaction pathways. Concurrently, pI exerts a distinct negative influence on the MW and len descriptors (−4.96 and −5.61, respectively), while concurrently demonstrating a positive influence on the aliphatic index (6.64) and Boman index (3.27). These values are consistent with the direction and severity of links in the causal network (Figure 2) and confirm the central role of pI as a mediator node with extremely high centrality (D + R) but balanced causality (D − R ≈ 0).
Furthermore, the application of a heatmap analysis revealed the existence of two distinct causal clusters. The first of these includes the linear features, length (len) and molecular weight (MW), where strong mutual influences and modulation by pI are observed. The second cluster is compositional and comprises gravy, boman index, and alipha, with a distinct sequence of causal influence: gravy→ boman→ aliphatic index.
In contrast to classical correlation analyses, the utilization of DEMATEL facilitates the identification of not only the relationships between the descriptors, but also their direction, a crucial aspect in the formulation of a valid causal hypothesis. Correlation provides symmetric information about joint variation (“what a descriptor is related to”), whereas DEMATEL provides answers to the question “who is leading whom” and with what intensity. This is fundamental knowledge in the development of predictive models, functional classification, and rational molecular design.
The integration of quantitative centrality and causality metrics (Table 3) with the graphical representation of the causal structure (Figure 2) and the generalized topology of interactions (Figure 3) provides a reliable foundation for defining the directions and strengths of influences in SEM. The resulting architecture is not arbitrary, but rather it is predicated on the objectively established causal dynamics in the system of physicochemical descriptors. The causal architecture underpinning the construction and training of the extended SEM-NN model is consistent with the aforementioned principles. Consequently, DEMATEL functions as a conceptual conduit between descriptive analysis and predictive modeling, facilitating the transfer of structural information into analytical and neural paradigms.
The DEMATEL analysis reveals that the individual descriptors fulfill specific functions that correspond to the recognized mechanisms of peptide action. The isoelectric point (pI) has the highest centrality and nearly neutral causality, defining it as a mediator of electrostatic interactions. It regulates protonation around physiological pH and controls solubility, stability and membrane binding. The aliphatic index (Alipha) occupies an intermediate position, providing the necessary thermal stability and lipophilicity to maintain conformation upon interaction with targets. The hydrophobicity index (GRAVY) and the Boman index act as receptive factors: the former reflects membrane penetration and interaction with hydrophobic receptor sites, and the latter reflects protein-binding potential.
The interpretation also highlights the comparative influence of the descriptors. Biologically logically, hydrophobicity (GRAVY) has a stronger effect on IC50 than net charge, since hydrophobic amino acids facilitate the incorporation of peptides into membranes and their interaction with hydrophobic receptor sites. Charge is important for initial electrostatic attraction, but its role is secondary. A similar relationship is observed between the Boman index and the aliphatic index: high protein-binding potential appears to be a stronger indicator of activity than more general structural stability indicators.
These observations can be categorized into two main causal chains: len → MW → pI → IC50 (the structural-charge axis) and GRAVY → Boman → Alipha → IC50 (surface chemistry and binding). At the class level, these dependencies are biologically plausible: pI and Alipha dominate antimicrobial peptides, reflecting the importance of charge and stability, while GRAVY and Alipha play a decisive role in anticarcinogenic peptides by enhancing membrane disruption. In antioxidant peptides, the Boman index determines stable interactions with redox enzymes, and in ACE-inhibitory peptides, length and pI determine recognition by the enzyme.
Notably, DEMATEL confirms the biophysical plausibility of these dependencies and offers practical guidelines, such as optimizing pI for better electrostatic interactions, controlling hydrophobicity for selective membrane permeability and adjusting aliphatic content for structural stability.
To our knowledge, this is the first study to apply DEMATEL analysis in the context of bioactive peptides. Although DEMATEL has primarily been used in systems biology and decision sciences to reveal complex causal relationships, introducing it into peptide bioinformatics represents a new way of moving beyond correlational descriptors. Prioritizing indicators such as pI, GRAVY, Boman, and Alipha according to their causal role provides biomechanical plausibility and practical guidance for rational peptide design.
In this sense, DEMATEL can be considered an explainable artificial intelligence (XAI) tool in bioinformatics. Unlike classical machine learning models, which often function as ‘black boxes’, DEMATEL reveals causal relationships and directions of influence. This enables the development of predictive models that are highly accurate and provide transparent, biologically interpretable explanations of the role of individual descriptors. Therefore, DEMATEL acts as a bridge between statistical structure and biological mechanism, contributing to the broader field of XAI and supporting the rational design of peptides and the development of interpretable AI models in biomedicine.

3.3. Structural Equation Modeling (SEM) and Its Neural Extension (SEM-NN)

Based on the extracted causal structure, a conceptual model for structural equation modeling (SEM) was formulated using DEMATEL. The architecture of the SEM model is presented in Figure 4 and is characterized by two main causal loops that mediate the influence of physicochemical descriptors on biological activity as measured by IC50. This configuration reflects the fact that IC50 results from two different types of biophysical mechanisms, structural and surface chemical.
The initial causal pathway is associated with the structural characteristics of the molecule, including its overall size (length and molecular weight) and charge properties (isoelectric index). This pathway follows the sequence length → molecular weight → isoelectric point → IC50 and describes the influence of parameters determining the spatial organization, stability, solubility, and transport properties of the peptide.
The second causal pathway describes chemical properties related to hydrophobicity and molecular interaction capacity. The sequence is as follows: gravy → boman → alipha → IC50. The GRAVY index, as a measure of overall hydrophobicity, exerts a significant negative influence on the Boman index, which reflects affinity for protein targets. In turn, the Boman index negatively influences the aliphatic index, which characterizes the lipophilic nature and presence of aliphatic side chains, factors with a direct impact on biological activity.
The present study employs a univariate version of SEM in which the sole final dependent variable is IC50. All other descriptors are treated as exogenous inputs or as mediators in the causal pathway, but not as separate outputs. The approach is substantiated within the framework of the stated objective to model the determinants of biological activity. Concurrently, the development of a multi-output SEM or SEM-NN model in which descriptors such as Boman or Gravy are regarded not only as mediators but also as independent output features with functional relevance remains conceptually relevant. However, the present study does not encompass this extension, and it is scheduled for future investigation.
The numeric values depicted along the arrows in Figure 4 represent standardized regression coefficients estimated within the SEM model. These values reflect the relative strength and direction of influence between the related variables. For example, the coefficient of 0.862 on the path len → MW relationship indicates that for a one standard unit increase in length, the molecular weight increases by 0.862 standard units, with all other factors remaining constant. Analogously, the value of 0.722 for the path alipha → IC50 demonstrates that the aliphatic index is the strongest direct predictor of IC50 within the model. The estimation of all coefficients was conducted through the implementation of maximum likelihood estimation (ML estimation) on the causal structure that was defined by DEMATEL. This quantitative parameterization, in conjunction with the explicitly delineated causal architecture, facilitates not only the comprehension of the internal mechanisms but also the construction of an interpretable predictive model.
The quality assessment of the model presented in Table 4 indicates that the formulated causal structure demonstrates a satisfactory and statistically robust fit to the empirical data. The value of the Comparative Fit Index (CFI = 0.957) exceeds the commonly accepted threshold of 0.95, which is a strong indicator of an excellent global fit of the model to the hypothesis of independence between variables. Additionally, the Standardized Root Mean Square Residual (SRMR = 0.041), which is well below the threshold value of 0.08, confirms a low degree of standard residual errors between the observed and modeled covariance matrices, thereby providing further strong evidence for a good fit. The Root Mean Square Error of Approximation (RMSEA = 0.028; 90% CI: 0.020–0.037) was in the range corresponding to an extremely good fit (<0.05). The low value of RMSEA reflects a minimal level of approximation error of the model relative to the population covariance structure. The Tucker–Lewis Index (TLI = 0.876) is the only metric that remains below the optimal threshold of 0.90. This phenomenon can be interpreted as an indication of a potential discrepancy between the degree of explained variance and the complexity of the model.
Nevertheless, the overall assessment remains unambiguously positive, especially in the context of the high CFI values and low RMSEA and SRMR that offset this single indicator. Furthermore, the information criterion values (AIC = 11,259.23, BIC = 11,269.69, SABIC = 11,264.86) are comparatively minimal, signifying that the model attains an optimal equilibrium between intricacy and explanatory capacity. Decreased values denote an enhanced fit-parsimony (parameter economy) ratio. The assessment of the parametric pathways, as illustrated in Table 5, substantiates the existence of two distinct causal pathways that explicate the variation in the biological activity of peptides, as measured by IC50.
The first pathway is structured along the following sequence: len → MW → pI → IC50. The relationship between peptide length and molecular weight is strongly positive (Estimate = 109.222, Z = 382.213, p < 0.001), which is consistent with fundamental biochemical principles, given that peptide length directly contributes to molecular weight. Molecular weight demonstrated a significant and positive influence on the isoelectric point (Estimate = 0.023, p < 0.001), though its effect was minimal, which is logical given that the isoelectric index is more influenced by amino acid composition than size. The isoelectric index had a significant negative effect on IC50 (Estimate = −5.654, Z = −9.035, p < 0.001). Consequently, peptides with higher pI exhibit lower IC50 values, indicative of enhanced biological activity. This effect is likely indicative of an augmented capacity for electrostatic interactions with cellular targets.
The second causal pathway follows the sequence: gravy → boman → alipha → IC50. The GRAVY index exerts a strong negative influence on the Boman index (Estimate = −1.643, Z = −56.127, p < 0.001), signifying that an augmentation in hydrophobicity (higher gravy) concomitantly leads to a diminution in the capacity for protein–protein interaction (lower Boman index). The Boman index demonstrated a strong positive influence on the aliphatic index (Estimate = 0.883, Z = 39.348, p < 0.001), indicating that peptides exhibiting higher affinity for proteins tend to possess a greater proportion of aliphatic side chains. In parallel, gravy also had a direct and significant positive effect on the aliphatic index (Estimate = 0.167, Z = 7.333, p < 0.001), though to a lesser extent. The strongest effect in this pathway is observed between the aliphatic index and IC50 (Estimate = −9.647, Z = −25.744, p < 0.001). This finding suggests that increased aliphaticity of the peptide is associated with significantly lower IC50 values, i.e., increased biological activity. From a biochemistry perspective, this is a rational proposition, as aliphatic side chains have been demonstrated to enhance lipophilicity, thereby improving membrane permeability and promoting interactions with hydrophobic pockets on molecular targets.
All of these findings indicate that the SEM model demonstrates a robust causal structure with high explanatory power. The main regulatory driver in the first pathway is pI, which acts as a mediator between structural parameters and biological activity. In the second pathway, gravy → boman → alipha → IC50, the effects are cumulatively amplified, with the aliphatic index established as the strongest direct determinant of IC50 in the entire model. This structure highlights the fundamental interaction between structure-dependent factors (size and charge) and surface chemical properties (hydrophobicity and aliphatic character) that collectively determine the biological activity of peptides.
In addition to the classic SEM framework, an extended structural equation neural network (SEM-NN) model was developed in the present study (Figure 5). The architecture of SEM-NN reproduces the two main causal pathways identified by SEM: (i) structure-dependent characteristics (len → MW → pI → IC50) and (ii) hydrophobic and surface-active properties (gravy → boman → alipha → IC50). After passing through these two branches, the information is integrated into a common layer that combines the effects of the two mechanisms prior to the final prediction step.
The input characteristics comprise five physicochemical descriptors (Len, MW, pI, GRAVY, Boman, Aliphatic), which are initially processed by the SEM component in order to capture cause-and-effect relationships. Subsequent to the concatenation of the input vectors, the data is fed to a neural network comprising three consecutive hidden layers with differing numbers of neurons. The final output layer incorporates linear activation, enabling the prediction of a continuous IC50 value.
In order to select the optimal architecture, three variants of a SEM-inspired neural network were implemented, differing in the number of hidden layers after concatenation of the latent blocks. The first variant incorporated a single hidden layer comprising 64 neurons, attaining an approximate accuracy of 73%. The second variant incorporated two hidden layers, with 64 and 32 neurons, respectively, achieving an accuracy of over 80%. The third variant was an extended version with three hidden layers, comprising 128, 64, and 32 neurons, and incorporated Dropout layers, achieving approximately 84% accuracy. In consideration of the findings, a compromise architecture with three fully connected hidden layers (64 → 32 → 16 neurons) was selected, which achieved similar or marginally higher accuracy (approximately 85%, MAE = 31.09, MSE = 15,658.44), but with significantly fewer parameters compared to the deeper three-layer network (128 → 64 → 32). The selection of the ultimate architecture was determined by a judicious balance between the predictive power and the computational complexity of the system.
It is noteworthy that all hidden layers utilize ReLU activation to capture nonlinear dependencies. To mitigate the risk of overfitting and improve generalization, the model integrates a Dropout layer (p = 0.3) together with L2-regularization (λ = 0.01), which also limits excessive weight growth and contributes to stability. The final output layer employs softmax activation, corresponding to the multi-class classification task.
The hyperparameters are optimized through cross-validated grid search. The batch size of 64 represents a compromise between gradient stability and training efficiency. An adaptive learning rate (initial value 0.001 with exponential decay) supports both rapid initial convergence and subsequent refinement. Finally, the Adam optimizer is used due to its proven efficacy with high-dimensional, nonlinear data.
As illustrated in Figure 6, the MSE curves are depicted by epoch for both the training and validation sets. The model demonstrates a steady convergence, exhibiting no substantial indications of overfitting, as evidenced by the parallel progression of the validation error and the training error. This behavior demonstrates that the model is capable of generalizing effectively on unseen data and does not merely memorize the training set. Consequently, it can be concluded that SEM-NN not only reproduces the causal structure but also demonstrates reliable predictive power in multiple iterations.
The SEM-NN model demonstrates a significantly improved predictive ability compared to the classical SEM, achieving values of MAE = 31.09 and MSE = 15,658.44 on the test set. These results are significantly superior to the errors of SEM, thereby confirming the higher efficiency of the neural model in predicting biological activity.
The juxtaposition of these two approaches elucidates the fundamental conceptual distinctions between them. SEM provides a stable framework for the empirical evaluation of theoretical hypotheses, the assessment of causal relationships, and the quantitative analysis of latent structures. Conversely, the SEM-NN model is oriented towards the direct minimization of predictive error and offers a more flexible approach, especially when the objective is to achieve high prediction accuracy rather than to conduct formal hypothesis testing. In summary, Structural Equation modeling (SEM) remains the preferred tool for theoretical modeling, while the combination of SEM with a neural network (SEM-NN) is particularly effective in terms of predictive accuracy, especially in contexts with complex and nonlinear dependencies.

3.4. AI-Based Multiclass Classification of Peptide Bioactivity

In the context of this study, a comparative analysis of five machine learning models for multiclass classification of bioactive peptides was conducted. The algorithms employed include Random Forest, XGBoost, LightGBM, multilayer perceptron (MLP), and TabNet. Each model was trained on an extended dataset that was generated by means of targeted, domain-specific data augmentation. This approach was employed to overcome class imbalance and enrich the training set of peptide sequences. As illustrated in Table 1, the initial dataset of peptides is distributed according to their activity. The Antihypertensive peptide group is the most extensive, with Anti-inflammatory and Antiviral comprising only 14 entries each.
The augmentation process entailed the generation of synthetic peptides following two complementary criteria: the frequency distribution of amino acids (AAC profile) within each class and the confidence intervals of key physicochemical descriptors (MW, GRAVY, pI, aliphatic index, and Boman index). To provide a concise overview of the procedure, the pseudocode of Algorithm 1 is given below. The input data are as follows: dataset D, threshold T, confidence interval CI. The output is augmented dataset D*.
Algorithm 1. Pseudo code of the augmentation process.
def augmentation (D, T, CI):
FOR each class c in D:
   Determine deficit = max(T,|c|) − |c|
   IF deficit NOT NULL
    FOR i = 1 to deficit:
     Generate new sequence
     Compute descriptors of sequence
     IF descriptors within CI bounds of c THEN
      Add sequence to class c as synthetic
     ENDIF
    ENDFOR
   ENDIF
ENDFOR
Return augmented dataset D*
This compact pseudocode illustrates how synthetic sequences are iteratively generated and validated against class-specific amino acid profiles and descriptor intervals to balance the dataset while preserving biological plausibility.
Initially, the models were trained on the original imbalanced dataset, which comprised 2748 records. The results presented in Table 6 show moderate accuracy and F1-score values: Random Forest (Accuracy 0.5386, F1-score 0.5233), XGBoost (0.5027, 0.4928), LightGBM (0.5189, 0.5021), MLP (0.4829, 0.4552), and TabNet (0.4367, 0.4088). The precision and recall levels remain relatively low, especially for TabNet (precision 0.39, recall 0.44), thus emphasizing the limitations of classification on highly imbalanced classes.
In the subsequent phase, augmentation with a threshold of 250 records was applied to the underrepresented classes (a total of 4655 records). This development resulted in substantial enhancement across all models. For instance, the Random Forest algorithm attained an accuracy of 0.6237 and an F1-score of 0.6203, while the XGBoost and LightGBM algorithms achieved values of approximately 0.62 for accuracy and 0.615–0.618 for F1-score. Furthermore, neural models have reported a positive change: The MLP model demonstrated an accuracy of 0.5707 and an F1-score of 0.5652, while TabNet achieved an accuracy of 0.5443 and an F1-score of 0.5310. This finding indicates that even a modest augmentation in the quantity of synthetic exemplars can attenuate imbalanced performance and result in more robust classification.
In the third phase, the augmentation threshold was increased to 300 records per class (a total of 5255 records). The experiment demonstrated a subsequent enhancement in performance. The Random Forest, XGBoost, and LightGBM algorithms achieved an accuracy of 0.6700, 0.6800, and 0.6600, respectively, with corresponding F1-score values of 0.6400, 0.6500, and 0.6623, respectively. The MLP model demonstrated a substantial enhancement in performance metrics, with an accuracy of 0.6609 and an F1-score of 0.6608. In contrast, TabNet exhibited an increase in performance, though to a more modest extent, with an accuracy of 0.5869 and an F1-score of 0.5720. Precision and recall exhibit a comparable trend, evidently demonstrating that balancing with synthetic examples contributes to more reliable recognition of rarer classes.
It should also be noted that the reported standard deviations across all models remain relatively low (typically within the range of ±0.002 to ±0.06), thereby confirming the stability of the classifiers across the cross-validation folds. This consistency suggests that the observed improvements are not random fluctuations, but rather systematic gains attributable to the augmentation strategy.
The comparison between the original data and the two levels of augmentation demonstrates that a moderate increase in synthetic examples is essential to achieving balanced and reliable classification. It is evident that ensemble tree models (LightGBM, XGBoost, Random Forest) are particularly vulnerable to this approach, yet even neural architectures (MLP and TabNet) exhibit substantial enhancements. This underscores the significance of meticulously administered augmentation as a mechanism for addressing class imbalance. Concurrently, as has been documented in the extant literature [85,86], the excessive augmentation of synthetic data can engender noise and result in overfitting. Consequently, the selected threshold of 250–300 examples can be regarded as an optimal compromise between enhancing classification and mitigating the risk of overfitting.
It is important to emphasize that the present study introduces a fundamentally different formulation of the peptide classification problem compared to existing publications. While most studies use multi-label classification, where a peptide can simultaneously belong to several functional categories, the present study proposes and successfully implements a multi-class classification covering 15 clearly defined bioactivity classes. This means that each peptide is classified into a single, most representative functional class.
The rationale for adopting a multi-class design is twofold. First, from a biological perspective, peptides typically exhibit one dominant bioactivity that is most relevant for therapeutic design and downstream applications; therefore, forcing the model to select a primary class enhances interpretability and practical utility. Second, from a methodological standpoint, multi-class classification avoids the ambiguity and potential redundancy of multi-label outputs, thereby providing cleaner decision boundaries and a more rigorous evaluation of model performance.
Given the inherent differences between the multi-label and multi-class paradigms, a direct comparison on a common dataset is not possible. Nevertheless, contextual benchmarking against established models such as MultiPep [87], MPMABP [88], and ETFC [89] highlights the distinctive contribution of the present framework.
In terms of computational complexity, ensemble tree methods (Random Forest, LightGBM, XGBoost) demonstrated the best trade-off between accuracy and efficiency, while the deep learning–based TabNet model required substantially more resources without delivering superior performance. Despite these differences, the proposed multi-class framework achieved competitive results even in the presence of significant class imbalance, confirming its robustness and practical relevance for functional classification of bioactive peptides. This underscores the novelty and applicability of the multi-class setting in peptide bioinformatics and highlights its potential to open new avenues for future research.

4. Limitations

The present study is subject to several significant limitations. Firstly, the dataset utilized is modest in size. Despite the fact that PepLab already offers an integrated database of bioactive peptides, the number of entries is still limited because the collection and annotation process is performed manually, which is a slow and resource-intensive process. This limitation restricts the development of more sophisticated models and the attainment of more extensive generalizations. Subsequent expansions of the database are planned, incorporating new records to facilitate enhanced precision in analyses and more stable models. Future work will focus on expanding the database to improve the statistical stability of the models and achieve higher predictive power, particularly for classes with a small number of records (e.g., class 14). There is also an option to create a mechanism whereby external researchers can contribute new records after they have been checked for duplication or send scientific publications containing newly discovered peptides for the PepLab team to integrate into the database.
Secondly, while the results obtained are encouraging, they remain largely confined to computer modeling. To confirm the biological activity of peptides, molecular docking and structural bioinformatics are required, as well as in vitro and in vivo experiments. Combining in silico and experimental methods would enable the validation and practical application of the developed models, making this a natural direction for future research.
Thirdly, it is important to note that the proposed approaches have principally been developed for peptides of up to 50 amino acids in length. While these methodologies may be applicable to longer chains (proteins), it is essential to first assess their efficacy through in silico simulations prior to validation using experimental data. Nevertheless, the analytical framework presented is conceptually applicable to longer sequences, including proteins, since there are no significant methodological obstacles to extending the analysis to larger, more complex biomolecules.
Furthermore, future research will concentrate on integrating methods for structural and energy validation of interactions between peptides and their biological partners, such as receptors, enzymes, or other macromolecules. Such integration will facilitate not only the assessment of the physicochemical characteristics of peptides, but also their functional activity in the context of real biological interactions.
Finally, PepLab represents a unique integrated platform that combines a database of bioactive peptides with calculated physicochemical descriptors and analytical tools for statistical and AI analysis. In contrast to existing specialized databases that focus on specific classes of peptides, such as antimicrobial peptides (APD3, DBAASP) or antihypertensive peptides (AHTPDB), PepLab encompasses a more extensive range of biologically active peptides. This broader coverage provides enhanced versatility and opportunities for interdisciplinary applications. A further advantage of the platform is that it provides not only access to structural and physicochemical characteristics, but also the possibility of directly applying statistical methods and machine learning algorithms. This integrated solution provides a foundation for more in-depth research and accelerates the process of validating hypotheses in the field of peptide science. A comparative overview of the principal limitations of the present study, alongside a benchmarking against existing peptide databases, is provided in Table 7. This highlights both the unique contributions of PepLab and the areas where further development remains necessary.

5. Conclusions

The present study proposes an integrated methodology for the analysis and classification of bioactive peptides, which for the first time combines quantitative statistics, DE-MATEL, structural equation modeling with neural networks (SEM-NN), and machine learning. Through these methods, interpretability of causal relationships between key descriptors, which is usually missing in so-called “black box” models, is provided and competitive accuracy in multi-class classification is achieved.
The results demonstrate that descriptors such as isoelectric point and aliphatic index play a pivotal role in determining the functional class of peptides. The application of domain-specific augmentation led to a substantial enhancement in the performance of the models and a reduction in the impact of class imbalance. This demonstrated the possibility of constructing interpretable and reliable classifiers for 15 biological activities.
The limitations of the present study stem mainly from its entirely computer-based nature and dependence on a predefined set of physicochemical descriptors. Furthermore, the imbalance between classes and the lack of external experimental validation may affect the generalizability of the results, highlighting the need for future in vitro and in vivo studies to confirm the biological significance and practical applicability of the proposed models.
In view of these limitations, future directions for development can be sought in several areas. Firstly, the applicability of the proposed framework should be tested on other publicly available databases of bioactive peptides (e.g., APD3 [31], DRAVP [64]) in order to assess its generalizability and robustness across different data volumes and diversities. Secondly, the incorporation of secondary and tertiary structural features, or the integration of omics data (e.g., transcriptomics, proteomics), may enhance prediction accuracy, as evidenced by multimodal approaches to peptide classification [8]. Thirdly, the integration of such causally interpretable models into computational frameworks for drug discovery will enable the rational design of peptides with targeted biological activity, which is of particular importance in the context of anticancer and anti-microbial therapies [35,48]. Finally, the utilization of contemporary methodologies for explainable artificial intelligence modeling (e.g., SHAP and LIME [10]) can facilitate a more profound comprehension of the correlation between the structural characteristics and functional manifestations of peptides. This, in turn, can enable the translation of findings into applied biomedical and industrial practices.
In conclusion, the proposed framework establishes a foundation for explainable and practical machine learning in bioinformatics and functional proteomics. This opens up opportunities for the rational design of new bioactive peptides.

Author Contributions

Conceptualization: M.T., Z.T. and I.I.; methodology: S.B. and V.V.; software: M.T. and Z.T.; validation: M.T., Z.T. and I.I.; formal analysis: M.T. and Z.T.; investigation: I.I., S.B. and V.V.; resources: I.I., S.B. and V.V.; data curation. Z.T.; writing—original draft preparation: M.T., Z.T., I.I., S.B. and V.V.; writing—review and editing: M.T., Z.T., I.I., S.B. and V.V.; visualization: M.T., Z.T. and I.I.; supervision: M.T.; project administration: M.T., funding acquisition: M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The PepLab platform is freely available at the following website: www.pep-lab.info (accessed on 21 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AACAmino Acid Composition
ACEAngiotensin-Converting Enzyme
AMPAntimicrobial Peptide
BAPBioactive Peptide
DOutward Influence (DEMATEL)
D+RCentrality (Sum of Outward and Inward Influence)
D–RCausality (Difference between Outward and Inward Influence)
DaDalton (Molecular Weight Unit)
DEMATELDecision Making Trial and Evaluation Laboratory
DPP-IVDipeptidyl Peptidase IV
GRAVYGrand Average of Hydropathy
kDaKilodalton
MLPMulti-Layer Perceptron
MWMolecular Weight
PCAPrincipal Component Analysis
pIIsoelectric Point
QSARQuantitative Structure–Activity Relationship
RInward Influence (DEMATEL)
SEMStructural Equation Modeling
SEM-NNStructural Equation Modeling with Neural Networks
SHAPSHapley Additive exPlanations
TabNetAttentive Interpretable Tabular Learning

Appendix A

This appendix contains the complete set of tables related to the descriptive statistical analysis presented in Section 3.1. For clarity and conciseness of the main text, these tables are presented here in their entirety, while the corresponding summaries and interpretations are included in the Section 3.
Table A1. Data on the physicochemical characteristics of ACE inhibitory peptides.
Table A1. Data on the physicochemical characteristics of ACE inhibitory peptides.
Molecular Weight (DA)
Interval468.51 to 1008.611008.61 to 1548.71548.7 to 2088.82088.8 to 2628.892628.89 to 3168.993168.99 to 3709.083709.08 to 4249.184249.18 to 4789.274789.27 to 5329.375329.37 to 5869.46
Data (%)64.4422.949.370.961.530.570.000.000.000.19
Isoelectric Point (pH)
Interval2.63 to 3.733.73 to 4.824.82 to 5.925.92 to 7.027.02 to 8.128.12 to 9.219.21 to 10.3110.31 to 11.4111.41 to 12.512.5 to 13.6
Data (%)11.854.9717.0219.3111.662.4918.747.654.971.34
GRAVY
Interval−3.9 to −3.26−3.26 to −2.62−2.62 to −1.98−1.98 to −1.33−1.33 to −0.69−0.69 to −0.05−0.05 to 0.590.59 to 1.231.23 to 1.871.87 to 2.51
Data (%)0.760.762.686.5016.6328.4926.7711.853.442.10
Aliphatic index
Interval0 to 29.2929.29 to 58.5758.57 to 87.8687.86 to 117.14117.14 to 146.43146.43 to 175.72175.72 to 205205 to 234.29234.29 to 263.57263.57 to 292.86
Data (%)11.2819.6924.8623.1411.094.973.820.570.380.19
Boman index
Interval−3.11 to −1.73−1.73 to −0.36−0.36 to 1.011.01 to 2.382.38 to 3.763.76 to 5.135.13 to 6.56.5 to 7.887.88 to 9.259.25 to 10.62
Data (%)4.5919.3136.5223.5210.333.820.960.190.570.19
Table A2. Data on the physicochemical characteristics of Anti-inflammatory peptides.
Table A2. Data on the physicochemical characteristics of Anti-inflammatory peptides.
Molecular Weight (DA)
Interval581.71 to 803.56803.56 to 1025.411025.41 to 1247.271247.27 to 1469.121469.12 to 1690.97
Data (%)42.8628.577.147.1414.29
Isoelectric Point (pH)
Interval3.08 to 4.764.76 to 6.456.45 to 8.138.13 to 9.829.82 to 11.5
Data (%)35.7121.430.0035.717.14
GRAVY
Interval−2.78 to −1.95−1.95 to −1.11−1.11 to −0.28−0.28 to 0.550.55 to 1.39
Data (%)14.2914.2928.5721.4321.43
Aliphatic index
Interval0 to 3939 to 7878 to 117117 to 156156 to 195
Data (%)14.2928.5721.4328.577.14
Boman index
Interval−1.51 to 0.220.22 to 1.951.95 to 3.693.69 to 5.425.42 to 7.15
Data (%)35.7114.2935.710.0014.29
Table A3. Data on the physicochemical characteristics of Antiamnestic peptides.
Table A3. Data on the physicochemical characteristics of Antiamnestic peptides.
Molecular Weight (DA)
Interval519.64 to 818.63818.63 to 1117.611117.61 to 1416.61416.6 to 1715.591715.59 to 2014.572014.57 to 2313.56
Data (%)21.276642.55328.51062.127717.02138.5106
Isoelectric Point (pH)
Interval2.76 to 4.284.28 to 5.815.81 to 7.337.33 to 8.858.85 to 10.3810.38 to 11.9
Data (%)21.276617.021336.170217.02130.00008.5106
GRAVY
Interval−2.5 to −1.83−1.83 to −1.16−1.16 to −0.49−0.49 to 0.180.18 to 0.840.84 to 1.51
Data (%)2.12772.127717.021331.914942.55324.2553
Aliphatic index
Interval0 to 23.5223.52 to 47.0447.04 to 70.5570.56 to 94.0794.07 to 117.59117.59 to 141.11
Data (%)6.38306.383017.021338.297917.021314.8936
Boman index
Interval−1.79 to −0.96−0.96 to −0.12−0.12 to 0.710.71 to 1.541.54 to 2.382.38 to 3.21
Data (%)31.914919.148921.276614.89368.51064.2553
Table A4. Data on the physicochemical characteristics of Antibacterial peptides.
Table A4. Data on the physicochemical characteristics of Antibacterial peptides.
Molecular Weight (DA)
Interval514.63 to 1149.251149.25 to 1783.881783.88 to 2418.52418.5 to 3053.133053.13 to 3687.753687.75 to 4322.384322.38 to 49574957 to 5591.625591.62 to 6226.25
Data (%)9.659117.329514.488612.784111.363619.886410.22733.69320.5682
Isoelectric Point (pH)
Interval2.74 to 4.014.01 to 5.285.28 to 6.546.54 to 7.817.81 to 9.089.08 to 10.3510.35 to 11.6111.61 to 12.8812.88 to 14.15
Data (%)4.26142.27274.26145.96599.659125.568225.284111.363611.3636
GRAVY
Interval−3.59 to −3−3 to −2.42−2.42 to −1.83−1.83 to −1.25−1.25 to −0.66−0.66 to −0.08−0.08 to 0.510.51 to 1.11.1 to 1.68
Data (%)1.13640.85232.55689.659113.920530.113621.306811.07959.3750
Aliphatic index
Interval0 to 25.2825.28 to 50.5650.56 to 75.8375.83 to 101.11101.11 to 126.39126.39 to 151.67151.67 to 176.94176.94 to 202.22202.22 to 227.5
Data (%)10.227318.181816.761418.181817.897711.36363.69322.27271.4205
Boman index
Interval−2.48 to −1.45−1.45 to −0.42−0.42 to 0.610.61 to 1.641.64 to 2.672.67 to 3.73.7 to 4.734.73 to 5.775.77 to 6.8
Data (%)3.125010.795516.477320.454522.727317.32955.96592.84090.2841
Table A5. Data on the physicochemical characteristics of Anticancer peptides.
Table A5. Data on the physicochemical characteristics of Anticancer peptides.
Molecular Weight (DA)
Interval573.67 to 1097.511097.51 to 1621.341621.34 to 2145.182145.18 to 2669.012669.01 to 3192.853192.85 to 3716.683716.68 to 4240.524240.52 to 4764.354764.35 to 5288.19
Data (%)5.300421.554827.208518.021216.60786.36043.18020.70671.0601
Isoelectric Point (pH)
Interval2.94 to 4.174.17 to 5.395.39 to 6.626.62 to 7.857.85 to 9.079.07 to 10.310.3 to 11.5311.53 to 12.7512.75 to 13.98
Data (%)3.88697.42054.59369.54067.067116.607833.92236.713810.2473
GRAVY
Interval−2.74 to −2.22−2.22 to −1.7 −1.7 to −1.18−1.18 to −0.66−0.66 to −0.14−0.14 to 0.390.39 to 0.910.91 to 1.431.43 to 1.95
Data (%)0.70671.06013.533610.247319.434623.321625.441712.72083.5336
Aliphatic index
Interval0 to 25.4925.49 to 50.9850.98 to 76.4776.47 to 101.96101.96 to 127.45127.45 to 152.94152.94 to 178.43178.43 to 203.92203.92 to 229.41
Data (%)7.067112.014113.780915.547715.901117.314514.13433.18021.0601
Boman index
Interval−2.64 to −1.56−1.56 to −0.47−0.47 to 0.610.61 to 1.71.7 to 2.792.79 to 3.873.87 to 4.964.96 to 6.046.04 to 7.13
Data (%)3.886920.494730.388723.321614.48763.88691.76681.06010.7067
Table A6. Data on the physicochemical characteristics of Antidiabetic peptides.
Table A6. Data on the physicochemical characteristics of Antidiabetic peptides.
Molecular Weight (DA)
Interval1028.09 to 1491.661491.66 to 1955.241955.24 to 2418.812418.81 to 2882.382882.38 to 3345.953345.95 to 3809.533809.53 to 4273.14273.1 to 4736.67
Data (%)11.637919.396642.672412.500011.63790.00001.29310.8621
Isoelectric Point (pH)
Interval2.73 to 4.14.1 to 5.475.47 to 6.846.84 to 8.218.21 to 9.579.57 to 10.9410.94 to 12.3112.31 to 13.68
Data (%)20.689725.431011.637914.22417.758616.37933.44830.4310
GRAVY
Interval−2.15 to −1.64−1.64 to −1.13−1.13 to −0.62−0.62 to −0.11−0.11 to 0.40.4 to 0.90.9 to 1.411.41 to 1.92
Data (%)2.15526.896616.379327.155224.569012.93107.75862.1552
Aliphatic index
Interval0 to 25.2925.29 to 50.5850.58 to 75.8775.87 to 101.15101.15 to 126.44126.44 to 151.73151.73 to 177.02177.02 to 202.31
Data (%)3.448310.775920.689726.724122.413811.63793.44830.8621
Boman index
Interval−1.54 to −0.73−0.73 to 0.080.08 to 0.890.89 to 1.711.71 to 2.522.52 to 3.333.33 to 4.144.14 to 4.95
Data (%)5.172414.224122.844822.413821.55176.46555.60341.7241
Table A7. Data on the physicochemical characteristics of Antifungal peptides.
Table A7. Data on the physicochemical characteristics of Antifungal peptides.
Molecular Weight (DA)
Interval653.69 to 1070.021070.02 to 1486.361486.36 to 1902.691902.69 to 2319.022319.02 to 2735.362735.36 to 3151.69
Data (%)9.756126.829356.09760.00000.00007.3171
Isoelectric Point (pH)
Interval3.29 to 4.974.97 to 6.656.65 to 8.348.34 to 10.0210.02 to 11.711.7 to 13.38
Data (%)2.43900.00009.756141.463443.90242.4390
GRAVY
Interval−1.59 to −1.18−1.18 to −0.76−0.76 to −0.35−0.35 to 0.060.06 to 0.480.48 to 0.89
Data (%)7.317114.634141.46349.756114.634112.1951
Aliphatic index
Interval0 to 31.1131.11 to 62.2262.22 to 93.3493.34 to 124.45124.45 to 155.56155.56 to 186.67
Data (%)12.195141.463412.19510.000026.82937.3171
Boman index
Interval−0.89 to 0.090.09 to 1.071.07 to 2.042.04 to 3.023.02 to 44 to 4.98
Data (%)24.390217.073243.90247.31712.43904.8780
Table A8. Data on the physicochemical characteristics of Antimicrobial peptides.
Table A8. Data on the physicochemical characteristics of Antimicrobial peptides.
Molecular Weight (DA)
Interval658.8 to 1249.261249.26 to 1839.721839.72 to 2430.182430.18 to 3020.643020.64 to 3611.093611.09 to 4201.554201.55 to 4792.014792.01 to 5382.475382.47 to 5972.93
Data (%)4.285714.898021.836717.346910.816310.204112.44905.71432.4490
Isoelectric Point (pH)
Interval2.27 to 3.633.63 to 4.984.98 to 6.346.34 to 7.77.7 to 9.059.05 to 10.4110.41 to 11.7711.77 to 13.1213.12 to 14.48
Data (%)3.06125.51023.46945.714320.204128.979619.59186.12247.3469
GRAVY
Interval−2.88 to −2.25−2.25 to −1.62−1.61 to −0.98−0.98 to −0.35−0.35 to 0.280.28 to 0.920.92 to 1.551.55 to 2.182.18 to 2.81
Data (%)0.81632.24494.898025.918427.551023.469412.24492.44900.4082
Aliphatic index
Interval0 to 24.6524.65 to 49.2949.29 to 73.9473.94 to 98.5998.59 to 123.23123.23 to 147.88147.88 to 172.53172.53 to 197.17197.17 to 221.82
Data (%)6.326514.898021.632718.979616.938812.44905.30612.65310.8163
Boman index
Interval−3.58 to −2.09−2.09 to −0.6−0.6 to 0.890.89 to 2.372.38 to 3.863.86 to 5.355.35 to 6.846.84 to 8.338.33 to 9.81
Data (%)1.836714.081634.898028.979615.51022.65311.63270.20410.2041
Table A9. Data on the physicochemical characteristics of Antioxidative peptides.
Table A9. Data on the physicochemical characteristics of Antioxidative peptides.
Molecular Weight (DA)
Interval389.41 to 671.04671.04 to 952.67952.67 to 1234.291234.29 to 1515.921515.92 to 1797.551797.55 to 2079.182079.18 to 2360.82360.8 to 2642.432642.43 to 2924.06
Data (%)29.389332.061111.450411.450411.45043.05340.38170.38170.3817
Isoelectric Point (pH)
Interval2.71 to 3.843.84 to 4.974.97 to 6.116.11 to 7.247.24 to 8.378.37 to 9.59.5 to 10.6410.64 to 11.7711.77 to 12.9
Data (%)21.37409.542024.045810.305310.68703.053412.21377.63361.1450
GRAVY
Interval−3.38 to −2.63−2.63 to −1.88−1.88 to −1.12−1.12 to −0.37−0.37 to 0.380.38 to 1.131.13 to 1.881.88 to 2.632.63 to 3.38
Data (%)1.90847.633615.648927.480926.335914.88554.19851.52670.3817
Aliphatic index
Interval0 to 27.0427.04 to 54.0754.07 to 81.1181.11 to 108.15108.15 to 135.18135.18 to 162.22162.22 to 189.26189.26 to 216.29216.29 to 243.33
Data (%)17.557314.122127.862616.793914.88556.48850.76341.14500.3817
Boman index
Interval−3.5 to −2.39−2.39 to −1.28−1.28 to −0.17−0.17 to 0.940.94 to 2.062.06 to 3.173.17 to 4.284.28 to 5.395.39 to 6.5
Data (%)1.90845.725218.702315.267227.099214.503810.30533.43513.0534
Table A10. Data on the physicochemical characteristics of Antithrombotic peptides.
Table A10. Data on the physicochemical characteristics of Antithrombotic peptides.
Molecular Weight (DA)
Interval482.54 to 1306.311306.31 to 2130.072130.07 to 2953.842953.84 to 3777.613777.61 to 4601.374601.37 to 5425.14
Data (%)77.142914.28570.00002.85710.00005.7143
Isoelectric Point (pH)
Interval2.77 to 4.594.59 to 6.416.41 to 8.228.23 to 10.0410.04 to 11.8611.86 to 13.68
Data (%)22.85712.857134.285714.285722.85712.8571
GRAVY
Interval−3.66 to −3.02−3.02 to −2.37−2.37 to −1.73−1.73 to −1.08−1.08 to −0.44−0.44 to 0.21
Data (%)8.57140.000014.285734.285722.857120
Aliphatic index
Interval0 to 14.6714.67 to 29.3329.33 to 4444 to 58.6758.67 to 73.3373.33 to 88
Data (%)40.000017.14295.714314.28578.571414.2857
Boman index
Interval−0.59 to 0.880.88 to 2.342.34 to 3.83.81 to 5.275.27 to 6.736.73 to 8.2
Data (%)8.571440.000031.428611.42862.85715.7143
Table A11. Data on the physicochemical characteristics of Antiviral peptides.
Table A11. Data on the physicochemical characteristics of Antiviral peptides.
Molecular Weight (DA)
Interval623.71 to 1128.721128.72 to 1633.731633.73 to 2138.742138.74 to 2643.752643.75 to 3148.76
Data (%)52.941211.764723.52940.000011.7647
Isoelectric Point (pH)
Interval3.29 to 4.654.65 to 6.016.01 to 7.387.38 to 8.748.74 to 10.1
Data (%)47.05885.882411.764711.764723.5294
GRAVY
Interval−1.34 to −0.86−0.86 to −0.37−0.37 to 0.110.11 to 0.60.6 to 1.08
Data (%)23.529435.294111.764717.647111.7647
Aliphatic index
Interval0 to 3939 to 7878 to 117117 to 156156 to 195
Data (%)35.294123.529411.764717.647111.7647
Boman index
Interval−0.73 to 0.180.18 to 1.11.1 to 2.022.02 to 2.932.93 to 3.85
Data (%)23.529411.764729.411823.529411.7647
Table A12. Data on the physicochemical characteristics of DPP-IV inhibitor peptides.
Table A12. Data on the physicochemical characteristics of DPP-IV inhibitor peptides.
Molecular Weight (DA)
Interval359.38 to 613.86613.86 to 868.34868.34 to 1122.811122.82 to 1377.291377.29 to 1631.771631.77 to 1886.25
Data (%)14.545543.636416.363618.18183.63643.6364
Isoelectric Point (pH)
Interval2.89 to 4.374.37 to 5.865.86 to 7.347.34 to 8.828.82 to 10.3110.31 to 11.79
Data (%)40.000016.363623.63640.000012.72737.2727
GRAVY
Interval−2.19 to
−1.44
−1.43 to −0.68−0.68 to 0.080.08 to 0.830.83 to 1.591.59 to 2.34
Data (%)5.454512.727330.909129.09099.090912.7273
Aliphatic index
Interval0 to 43.3343.33 to 86.6786.67 to 130130 to 173.33173.33 to 216.67216.67 to 260
Data (%)16.363634.545527.272714.54553.63643.6364
Boman index
Interval−2.95 to −1.74−1.74 to −0.52−0.52 to 0.690.69 to 1.911.91 to 3.123.12 to 4.34
Data (%)10.909121.818238.181814.54555.45459.0909
Table A13. Data on the physicochemical characteristics of Neuropeptides.
Table A13. Data on the physicochemical characteristics of Neuropeptides.
Molecular Weight (DA)
Interval573.67 to 1304.241304.24 to 2034.812034.81 to 2765.382765.38 to 3495.953495.95 to 4226.524226.52 to 4957.094957.09 to 5687.66
Data (%)46.478930.98594.22547.04234.22542.81694.2254
Isoelectric Point (pH)
Interval3.07 to 4.564.56 to 6.056.05 to 7.547.54 to 9.039.03 to 10.5210.52 to 12.0112.01 to 13.5
Data (%)7.04239.859218.30999.859218.309922.535214.0845
GRAVY
Interval−3.19 to −2.48−2.48 to −1.78−1.78 to −1.07−1.07 to −0.37−0.37 to 0.340.34 to 1.041.05 to 1.75
Data (%)2.81692.816911.267635.211333.802811.26762.8169
Aliphatic index
Interval0 to 21.8421.84 to 43.6743.67 to 65.5165.51 to 87.3587.35 to 109.19109.19 to 131.02131.02 to 152.86
Data (%)14.084516.901421.126823.943715.49305.63382.8169
Boman index
Interval−2.52 to −1.26−1.26 to −0.01−0.01 to 1.251.25 to 2.512.51 to 3.763.76 to 5.025.02 to 6.28
Data (%)8.45078.450718.309932.394423.94377.04231.4085
Table A14. Data on the physicochemical characteristics of Opioid peptides.
Table A14. Data on the physicochemical characteristics of Opioid peptides.
Molecular Weight (DA)
Interval549.63 to 966.12966.12 to 1382.61382.6 to 1799.091799.09 to 2215.572215.57 to 2632.062632.06 to 3048.543048.54 to 3465.03
Data (%)67.5011.255.0015.000.000.001.25
Isoelectric Point (pH)
Interval3.1 to 4.544.54 to 5.995.99 to 7.437.43 to 8.878.87 to 10.3110.31 to 11.7611.76 to 13.2
Data (%)3.7542.51.253.7521.251017.5
GRAVY
Interval−2.11 to −1.58−1.58 to −1.04−1.04 to −0.5−0.5 to 0.040.04 to 0.580.58 to 1.121.12 to 1.66
Data (%)8.751016.252523.75151.25
Aliphatic index
Interval0 to 22.2922.29 to 44.5744.57 to 66.8666.86 to 89.1489.14 to 111.43111.43 to 133.71133.71 to 156
Data (%)50516.2512.58.753.753.75
Boman index
Interval−2.18 to −1.35−1.35 to −0.51−0.51 to 0.320.32 to 1.161.16 to 22 to 2.832.83 to 3.67
Data (%)1522.517.58.7516.257.512.5
Table A15. Data on the physicochemical characteristics of Toxin peptides.
Table A15. Data on the physicochemical characteristics of Toxin peptides.
Molecular Weight (DA)
Interval716.9 to 1269.061269.06 to 1821.211821.21 to 2373.372373.37 to 2925.522925.52 to 3477.683477.68 to 4029.834029.83 to 4581.994581.99 to 5134.145134.14 to 5686.3
Data (%)7.611.64.812.415.222.416.85.24
Isoelectric Point (pH)
Interval2.71 to 3.733.73 to 4.754.75 to 5.765.76 to 6.786.78 to 7.87.8 to 8.828.82 to 9.839.83 to 10.8510.85 to 11.87
Data (%)812.44.4611.634202.41.2
GRAVY
Interval−2.22 to −1.81−1.81 to −1.41−1.41 to −1−1 to −0.6−0.6 to −0.19−0.19 to 0.210.21 to 0.620.62 to 1.021.02 to 1.43
Data (%)0.827.619.640.414.89.23.62
Aliphatic index
Interval0 to 17.7817.78 to 35.5635.56 to 53.3353.33 to 71.1171.11 to 88.8988.89 to 106.67106.67 to 124.44124.44 to 142.22142.22 to 160
Data (%)1821.636.413.25.22.81.21.20.4
Boman index
Interval−1.83 to −1.06−1.06 to −0.28−0.28 to 0.50.5 to 1.271.27 to 2.052.05 to 2.832.83 to 3.63.6 to 4.384.38 to 5.15
Data (%)22.882031.223.683.60.8

References

  1. Cournoyer, A.; Bernier, M.-E.; Aboubacar, H.; De Toro-Martín, J.; Vohl, M.-C.; Ravallec, R.; Cudennec, B.; Bazinet, L. Machine Learning-Driven Discovery of Bioactive Peptides from Duckweed (Lemnaceae) Protein Hydrolysates: Identification and Experimental Validation of 20 Novel Antihypertensive, Antidiabetic, and/or Antioxidant Peptides. Food Chem. 2025, 482, 144029. [Google Scholar] [CrossRef]
  2. Correas, N.H.; Martínez, A.R.; Abellán, A.; Sánchez, H.P.; Tejada, L. Curing Strategies and Bioactive Peptide Generation in Ham: In Vitro Digestion and in Silico Evaluation. Food Chem. 2025, 484, 144360. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, Z.; Wang, S.; Cao, R.; Hu, M. Review on Bioactive Peptides from Antarctic Krill: From Preparation to Structure-Activity Relationship and Tech-Functionality. Curr. Res. Food Sci. 2025, 10, 101093. [Google Scholar] [CrossRef] [PubMed]
  4. Purohit, K.; Pathak, R.; Hayes, E.; Sunna, A. Novel Bioactive Peptides from Ginger Rhizome: Integrating in Silico and in Vitro Analysis with Mechanistic Insights through Molecular Docking. Food Chem. 2025, 484, 144432. [Google Scholar] [CrossRef]
  5. Garmidolova, A.; Desseva, I.; Terziyska, M.; Pavlov, A. Food-Derived Bioactive Peptides-Methods for Purification and Analysis. BIO Web Conf. 2022, 45, 02001. [Google Scholar] [CrossRef]
  6. Terziyski, Z.; Terziyska, M.; Deseva, I.; Hadzhikoleva, S.; Krastanov, A.; Mihaylova, D.; Hadzhikolev, E. PepLab Platform: Database and Software Tools for Analysis of Food-Derived Bioactive Peptides. Appl. Sci. 2023, 13, 961. [Google Scholar] [CrossRef]
  7. Chen, L.; Hu, Z.; Rong, Y.; Lou, B. Deep2Pep: A Deep Learning Method in Multi-Label Classification of Bioactive Peptide. Comput. Biol. Chem. 2024, 109, 108021. [Google Scholar] [CrossRef]
  8. Kang, Y.; Peng, Y.; Zheng, D.; Zhang, H.; Yang, X. Multi-View Framework for Multi-Label Bioactive Peptide Classification Based on Multi-Modal Representation Learning. Appl. Soft Comput. 2025, 175, 113007. [Google Scholar] [CrossRef]
  9. Centurion, V.B.; Bizzotto, E.; Tonini, S.; Filannino, P.; Di Cagno, R.; Zampieri, G.; Campanaro, S. FEEDS, the Food wastE biopEptiDe claSsifier: From Microbial Genomes and Substrates to Biopeptides Function. Curr. Res. Biotechnol. 2024, 7, 100186. [Google Scholar] [CrossRef]
  10. Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
  11. Azodi, C.B.; Tang, J.; Shiu, S.-H. Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet. 2020, 36, 442–455. [Google Scholar] [CrossRef] [PubMed]
  12. Tonekaboni, S.; Joshi, S.; McCradden, M.D.; Goldenberg, A. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. arXiv 2019, arXiv:1905.05134. [Google Scholar] [CrossRef]
  13. Yang, P.Q.; Wilson, M.L. Explaining Personal and Public Pro-Environmental Behaviors. Sci 2023, 5, 6. [Google Scholar] [CrossRef]
  14. Dayan, H.; Khoury-Kassabri, M.; Pollak, Y. Sense of Coherence Is Associated with Functional Impairment in Individuals Diagnosed with ADHD. Sci 2025, 7, 60. [Google Scholar] [CrossRef]
  15. Yuan, Y.; Cao, K.; Gao, P.; Wang, Y.; An, W.; Dong, Y. Extracellular Vesicles and Bioactive Peptides for Regenerative Medicine in Cosmetology. Ageing Res. Rev. 2025, 107, 102712. [Google Scholar] [CrossRef]
  16. Terziyska, M.; Vladev, V.; Terziyski, Z.; Ilieva, I.; Bozhkov, S. Application of Peptide Nanostructures in the Food Industry. BIO Web Conf. 2025, 170, 01002. [Google Scholar] [CrossRef]
  17. Phyo, S.H.; Siddique, M.S.; Mushtaq, A.; Yiasmin, M.N.; Alahmad, K.; Khan, I.; Ghamry, M.; Zhao, W. Plant-Derived Peptides and Bioactive Compounds: Mechanisms of AGEs Formation, Detection, and Innovative Approaches for Prevention in Food Processing. Food Biosci. 2025, 69, 106818. [Google Scholar] [CrossRef]
  18. Terziyski, Z.; Terziyska, M.; Hadzhikoleva, S.; Desseva, I. A Software Tool for Data Mining of Physicochemical Properties of Peptides. BIO Web Conf. 2023, 58, 03007. [Google Scholar] [CrossRef]
  19. Sturges, H.A. The Choice of a Class Interval. J. Am. Stat. Assoc. 1926, 21, 65–66. [Google Scholar] [CrossRef]
  20. Fontela, E.; Gabus, A. The DEMATEL Observer; Battelle Geneva Research Center: Geneva, Switzerland, 1976. [Google Scholar]
  21. Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2019, arXiv:1908.07442. [Google Scholar] [CrossRef]
  22. Manoharan, S.; Shuib, A.S.; Abdullah, N. Structural characteristics and antihypertensive effects of angiotensin-iconverting enzyme inhibitory peptides in the renin-angiotensin and kallikrein kinin systems. Afr. J. Tradit. Complement. Altern. Med. 2017, 14, 383–406. [Google Scholar] [CrossRef]
  23. Daskaya-Dikmen, C.; Yucetepe, A.; Karbancioglu-Guler, F.; Daskaya, H.; Ozcelik, B. Angiotensin-I-Converting Enzyme (ACE)-Inhibitory Peptides from Plants. Nutrients 2017, 9, 316. [Google Scholar] [CrossRef]
  24. Sitanggang, A.B.; Putri, J.E.; Palupi, N.S.; Hatzakis, E.; Syamsir, E.; Budijanto, S. Enzymatic Preparation of Bioactive Peptides Exhibiting ACE Inhibitory Activity from Soybean and Velvet Bean: A Systematic Review. Molecules 2021, 26, 3822. [Google Scholar] [CrossRef] [PubMed]
  25. Liu, W.-Y.; Zhang, J.-T.; Miyakawa, T.; Li, G.-M.; Gu, R.-Z.; Tanokura, M. Antioxidant Properties and Inhibition of Angiotensin-Converting Enzyme by Highly Active Peptides from Wheat Gluten. Sci. Rep. 2021, 11, 5206. [Google Scholar] [CrossRef] [PubMed]
  26. Rivera-Jiménez, J.; Berraquero-García, C.; Pérez-Gálvez, R.; García-Moreno, P.J.; Espejo-Carpio, F.J.; Guadix, A.; Guadix, E.M. Peptides and Protein Hydrolysates Exhibiting Anti-Inflammatory Activity: Sources, Structural Features and Modulation Mechanisms. Food Funct. 2022, 13, 12510–12540. [Google Scholar] [CrossRef] [PubMed]
  27. Zhao, L.; Wang, X.; Zhang, X.-L.; Xie, Q.-F. Purification and Identification of Anti-Inflammatory Peptides Derived from Simulated Gastrointestinal Digests of Velvet Antler Protein (Cervus elaphus Linnaeus). J. Food Drug Anal. 2016, 24, 376–384. [Google Scholar] [CrossRef]
  28. Gozes, I. NAP (Davunetide) Provides Functional and Structural Neuroprotection. Curr. Pharm. Des. 2011, 17, 1040–1044. [Google Scholar] [CrossRef]
  29. Banasiak-Cieślar, H.; Wiener, D.; Kuszczyk, M.; Dobrzyńska, K.; Polanowski, A. Proline-Rich Polypeptides (Colostrinin®/COLOCO®) Modulate BDNF Concentration in Blood Affecting Cognitive Function in Adults: A Double-Blind Randomized Placebo-Controlled Study. Food Sci. Nutr. 2023, 11, 1477–1485. [Google Scholar] [CrossRef]
  30. Janusz, M.; Zabłocka, A. Colostrinin: A Proline-Rich Polypeptide Complex of Potential Therapeutic Interest. Cell. Mol. Biol. Noisy—Gd. Fr. 2013, 59, 4–11. [Google Scholar]
  31. Wang, G.; Li, X.; Wang, Z. APD3: The Antimicrobial Peptide Database as a Tool for Research and Education. Nucleic Acids Res. 2016, 44, D1087–D1093. [Google Scholar] [CrossRef]
  32. Nguyen, L.T.; Haney, E.F.; Vogel, H.J. The Expanding Scope of Antimicrobial Peptide Structures and Their Modes of Action. Trends Biotechnol. 2011, 29, 464–472. [Google Scholar] [CrossRef] [PubMed]
  33. Fjell, C.D.; Hiss, J.A.; Hancock, R.E.W.; Schneider, G. Designing Antimicrobial Peptides: Form Follows Function. Nat. Rev. Drug Discov. 2012, 11, 37–51. [Google Scholar] [CrossRef]
  34. Li, J.; Koh, J.-J.; Liu, S.; Lakshminarayanan, R.; Verma, C.S.; Beuerman, R.W. Membrane Active Antimicrobial Peptides: Translating Mechanistic Insights to Design. Front. Neurosci. 2017, 11, 73. [Google Scholar] [CrossRef]
  35. Tyagi, A.; Tuknait, A.; Anand, P.; Gupta, S.; Sharma, M.; Mathur, D.; Joshi, A.; Singh, S.; Gautam, A.; Raghava, G.P.S. CancerPPD: A Database of Anticancer Peptides and Proteins. Nucleic Acids Res. 2015, 43, D837–D843. [Google Scholar] [CrossRef] [PubMed]
  36. Gaspar, D.; Veiga, A.S.; Castanho, M.A.R.B. From Antimicrobial to Anticancer Peptides. A Review. Front. Microbiol. 2013, 4, 294. [Google Scholar] [CrossRef]
  37. Papo, N.; Shai, Y. Host Defense Peptides as New Weapons in Cancer Treatment. CMLS Cell. Mol. Life Sci. 2005, 62, 784–790. [Google Scholar] [CrossRef] [PubMed]
  38. El-Sayed, M.; Awad, S. Milk Bioactive Peptides: Antioxidant, Antimicrobial and Anti-Diabetic Activities. Adv. Biochem. 2019, 7, 22. [Google Scholar] [CrossRef]
  39. Tavano, O.L.; Berenguer-Murcia, A.; Secundo, F.; Fernandez-Lafuente, R. Biotechnological Applications of Proteases in Food Technology. Compr. Rev. Food Sci. Food Saf. 2018, 17, 412–436. [Google Scholar] [CrossRef]
  40. Soltaninejad, H.; Zare-Zardini, H.; Ordooei, M.; Ghelmani, Y.; Ghadiri-Anari, A.; Mojahedi, S.; Hamidieh, A.A. Antimicrobial Peptides from Amphibian Innate Immune System as Potent Antidiabetic Agents: A Literature Review and Bioinformatics Analysis. J. Diabetes Res. 2021, 2021, 2894722. [Google Scholar] [CrossRef]
  41. Rivero-Pino, F.; Espejo-Carpio, F.J.; Guadix, E.M. Antidiabetic Food-Derived Peptides for Functional Feeding: Production, Functionality and In Vivo Evidences. Foods 2020, 9, 983. [Google Scholar] [CrossRef]
  42. Elam, E.; Feng, J.; Lv, Y.-M.; Ni, Z.-J.; Sun, P.; Thakur, K.; Zhang, J.-G.; Ma, Y.-L.; Wei, Z.-J. Recent advances on bioactive food-derived anti-diabetic hydrolysates and peptides from natural resources. J. Funct. Foods 2021, 86, 104674. [Google Scholar] [CrossRef]
  43. Fernández De Ullivarri, M.; Arbulu, S.; Garcia-Gutierrez, E.; Cotter, P.D. Antifungal Peptides as Therapeutic Agents. Front. Cell. Infect. Microbiol. 2020, 10, 105. [Google Scholar] [CrossRef] [PubMed]
  44. Van Der Weerden, N.L.; Bleackley, M.R.; Anderson, M.A. Properties and Mechanisms of Action of Naturally Occurring Antifungal Peptides. Cell. Mol. Life Sci. 2013, 70, 3545–3570. [Google Scholar] [CrossRef]
  45. De Lucca, A.J.; Walsh, T.J. Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens. Antimicrob. Agents Chemother. 1999, 43, 1–11. [Google Scholar] [CrossRef]
  46. Li, T.; Li, L.; Du, F.; Sun, L.; Shi, J.; Long, M.; Chen, Z. Activity and Mechanism of Action of Antifungal Peptides from Microorganisms: A Review. Molecules 2021, 26, 3438. [Google Scholar] [CrossRef]
  47. Freitas, C.G.; Felipe, M.S. Candida Albicans and Antifungal Peptides. Infect. Dis. Ther. 2023, 12, 2631–2648. [Google Scholar] [CrossRef]
  48. Mookherjee, N.; Anderson, M.A.; Haagsman, H.P.; Davidson, D.J. Antimicrobial Host Defence Peptides: Functions and Clinical Potential. Nat. Rev. Drug Discov. 2020, 19, 311–332. [Google Scholar] [CrossRef]
  49. Zhang, L.; Gallo, R.L. Antimicrobial Peptides. Curr. Biol. 2016, 26, R14–R19. [Google Scholar] [CrossRef]
  50. Mahlapuu, M.; Håkansson, J.; Ringstad, L.; Björn, C. Antimicrobial Peptides: An Emerging Category of Therapeutic Agents. Front. Cell. Infect. Microbiol. 2016, 6, 194. [Google Scholar] [CrossRef] [PubMed]
  51. Zasloff, M. Antimicrobial Peptides of Multicellular Organisms. Nature 2002, 415, 389–395. [Google Scholar] [CrossRef] [PubMed]
  52. Hancock, R.E.W.; Haney, E.F.; Gill, E.E. The Immunology of Host Defence Peptides: Beyond Antimicrobial Activity. Nat. Rev. Immunol. 2016, 16, 321–334. [Google Scholar] [CrossRef]
  53. Zou, T.-B.; He, T.-P.; Li, H.-B.; Tang, H.-W.; Xia, E.-Q. The Structure-Activity Relationship of the Antioxidant Peptides from Natural Proteins. Molecules 2016, 21, 72. [Google Scholar] [CrossRef]
  54. Nirmal, N.; Khanashyam, A.C.; Shah, K.; Awasti, N.; Sajith Babu, K.; Ucak, İ.; Afreen, M.; Hassoun, A.; Tuanthong, A. Plant Protein-Derived Peptides: Frontiers in Sustainable Food System and Applications. Front. Sustain. Food Syst. 2024, 8, 1292297. [Google Scholar] [CrossRef]
  55. Sila, A.; Bougatef, A. Antioxidant Peptides from Marine By-Products: Isolation, Identification and Application in Food Systems. A Review. J. Funct. Foods 2016, 21, 10–26. [Google Scholar] [CrossRef]
  56. You, L.; Zhao, M.; Regenstein, J.M.; Ren, J. Purification and Identification of Antioxidative Peptides from Loach (Misgurnus anguillicaudatus) Protein Hydrolysate by Consecutive Chromatography and Electrospray Ionization-Mass Spectrometry. Food Res. Int. 2010, 43, 1167–1173. [Google Scholar] [CrossRef]
  57. Ma, J.; Su, K.; Chen, M.; Wang, S. Study on the Antioxidant Activity of Peptides from Soybean Meal by Fermentation Based on the Chemical Method and AAPH-Induced Oxidative Stress. Food Sci. Nutr. 2023, 11, 6634–6647. [Google Scholar] [CrossRef]
  58. Udenigwe, C.C.; Aluko, R.E. Food Protein-Derived Bioactive Peptides: Production, Processing, and Potential Health Benefits. J. Food Sci. 2012, 77, R11–R24. [Google Scholar] [CrossRef]
  59. Guillen Schlippe, Y.V.; Hartman, M.C.T.; Josephson, K.; Szostak, J.W. In Vitro Selection of Highly Modified Cyclic Peptides That Act as Tight Binding Inhibitors. J. Am. Chem. Soc. 2012, 134, 10469–10477. [Google Scholar] [CrossRef] [PubMed]
  60. Balakrishnan, N.; Katkar, R.; Pham, P.V.; Downey, T.; Kashyap, P.; Anastasiu, D.C.; Ramasubramanian, A.K. Prospection of Peptide Inhibitors of Thrombin from Diverse Origins Using a Machine Learning Pipeline. Bioengineering 2023, 10, 1300. [Google Scholar] [CrossRef] [PubMed]
  61. Kretz, C.A.; Tomberg, K.; Van Esbroeck, A.; Yee, A.; Ginsburg, D. High Throughput Protease Profiling Comprehensively Defines Active Site Specificity for Thrombin and ADAMTS13. Sci. Rep. 2018, 8, 2788. [Google Scholar] [CrossRef] [PubMed]
  62. Agarwal, G.; Gabrani, R. Antiviral Peptides: Identification and Validation. Int. J. Pept. Res. Ther. 2021, 27, 149–168. [Google Scholar] [CrossRef]
  63. Vilas Boas, L.C.P.; Campos, M.L.; Berlanda, R.L.A.; De Carvalho Neves, N.; Franco, O.L. Antiviral Peptides as Promising Therapeutic Drugs. Cell. Mol. Life Sci. 2019, 76, 3525–3542. [Google Scholar] [CrossRef]
  64. Liu, Y.; Zhu, Y.; Sun, X.; Ma, T.; Lao, X.; Zheng, H. DRAVP: A Comprehensive Database of Antiviral Peptides and Proteins. Viruses 2023, 15, 820. [Google Scholar] [CrossRef]
  65. Chia, L.Y.; Kumar, P.V.; Maki, M.A.A.; Ravichandran, G.; Thilagar, S. A Review: The Antiviral Activity of Cyclic Peptides. Int. J. Pept. Res. Ther. 2022, 29, 7. [Google Scholar] [CrossRef] [PubMed]
  66. Jin, R.; Teng, X.; Shang, J.; Wang, D.; Liu, N. Identification of Novel DPP–IV Inhibitory Peptides from Atlantic Salmon (Salmo salar) Skin. Food Res. Int. 2020, 133, 109161. [Google Scholar] [CrossRef]
  67. Liu, R.; Cheng, J.; Wu, H. Discovery of Food-Derived Dipeptidyl Peptidase IV Inhibitory Peptides: A Review. Int. J. Mol. Sci. 2019, 20, 463. [Google Scholar] [CrossRef]
  68. Nongonierma, A.B.; Mooney, C.; Shields, D.C.; FitzGerald, R.J. In Silico Approaches to Predict the Potential of Milk Protein-Derived Peptides as Dipeptidyl Peptidase IV (DPP-IV) Inhibitors. Peptides 2014, 57, 43–51. [Google Scholar] [CrossRef] [PubMed]
  69. Mu, X.; Wang, R.; Cheng, C.; Ma, Y.; Zhang, Y.; Lu, W. Preparation, Structural Properties, and in Vitro and in Vivo Activities of Peptides against Dipeptidyl Peptidase IV (DPP-IV) and α-Glucosidase: A General Review. Crit. Rev. Food Sci. Nutr. 2024, 64, 9844–9858. [Google Scholar] [CrossRef]
  70. Charoenkwan, P.; Nantasenamat, C.; Hasan, M.M.; Moni, M.A.; Lio’, P.; Manavalan, B.; Shoombuatong, W. StackDPPIV: A Novel Computational Approach for Accurate Prediction of Dipeptidyl Peptidase IV (DPP-IV) Inhibitory Peptides. Methods 2022, 204, 189–198. [Google Scholar] [CrossRef]
  71. Hökfelt, T.; Broberger, C.; Xu, Z.-Q.D.; Sergeyev, V.; Ubink, R.; Diez, M. Neuropeptides—An Overview. Neuropharmacology 2000, 39, 1337–1356. [Google Scholar] [CrossRef]
  72. Li, C. Neuropeptides. WormBook 2008, 2008, 1–36. [Google Scholar] [CrossRef]
  73. DeLaney, K.; Buchberger, A.R.; Atkinson, L.; Gründer, S.; Mousley, A.; Li, L. New Techniques, Applications and Perspectives in Neuropeptide Research. J. Exp. Biol. 2018, 221, jeb151167. [Google Scholar] [CrossRef]
  74. Wang, Y.; Wang, M.; Yin, S.; Jang, R.; Wang, J.; Xue, Z.; Xu, T. NeuroPep: A Comprehensive Resource of Neuropeptides. Database 2015, 2015, bav038. [Google Scholar] [CrossRef]
  75. Girven, K.S.; Mangieri, L.; Bruchas, M.R. Emerging Approaches for Decoding Neuropeptide Transmission. Trends Neurosci. 2022, 45, 899–912. [Google Scholar] [CrossRef]
  76. Brownstein, M.J. A Brief History of Opiates, Opioid Peptides, and Opioid Receptors. Proc. Natl. Acad. Sci. USA 1993, 90, 5391–5393. [Google Scholar] [CrossRef]
  77. Fricker, L.D.; Margolis, E.B.; Gomes, I.; Devi, L.A. Five Decades of Research on Opioid Peptides: Current Knowledge and Unanswered Questions. Mol. Pharmacol. 2020, 98, 96–108. [Google Scholar] [CrossRef] [PubMed]
  78. Kaur, J.; Kumar, V.; Sharma, K.; Kaur, S.; Gat, Y.; Goyal, A.; Tanwar, B. Opioid Peptides: An Overview of Functional Significance. Int. J. Pept. Res. Ther. 2020, 26, 33–41. [Google Scholar] [CrossRef]
  79. Zioudrou, C.; Streaty, R.A.; Klee, W.A. Opioid Peptides Derived from Food Proteins. The Exorphins. J. Biol. Chem. 1979, 254, 2446–2449. [Google Scholar] [CrossRef]
  80. Possani, L.D.; Merino, E.; Corona, M.; Bolivar, F.; Becerril, B. Peptides and Genes Coding for Scorpion Toxins That Affect Ion-Channels. Biochimie 2000, 82, 861–868. [Google Scholar] [CrossRef]
  81. Sunagar, K.; Undheim, E.A.B.; Chan, A.H.C.; Koludarov, I.; Muñoz-Gómez, S.A.; Antunes, A.; Fry, B.G. Evolution Stings: The Origin and Diversification of Scorpion Toxin Peptide Scaffolds. Toxins 2013, 5, 2456–2487. [Google Scholar] [CrossRef] [PubMed]
  82. Moyes, D.L.; Wilson, D.; Richardson, J.P.; Mogavero, S.; Tang, S.X.; Wernecke, J.; Höfs, S.; Gratacap, R.L.; Robbins, J.; Runglall, M.; et al. Candidalysin Is a Fungal Peptide Toxin Critical for Mucosal Infection. Nature 2016, 532, 64–68. [Google Scholar] [CrossRef]
  83. Undheim, E.A.B.; Mobli, M.; King, G.F. Toxin Structures as Evolutionary Tools: Using Conserved 3D Folds to Study the Evolution of Rapidly Evolving Peptides. BioEssays News Rev. Mol. Cell. Dev. Biol. 2016, 38, 539–548. [Google Scholar] [CrossRef] [PubMed]
  84. Norton, R.S. Peptide Toxin Structure and Function by NMR. In Modern Magnetic Resonance; Webb, G.A., Ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–18. ISBN 978-3-319-28275-6. [Google Scholar]
  85. Fernández, A.; García, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  86. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  87. Gronning, A.G.B.; Kacprowski, T.; Scheele, C. MultiPep: A hierarchical deep learning approach for multi-label classification of peptide bioactivities. Biol. Methods Protoc. 2021, 6, bpab021. [Google Scholar] [CrossRef]
  88. LI, Y.; Li, X.; Liu, Y.; Yao, Y.; Huang, G. MPMABP: A CNN and Bi-LSTM-Based method for predicting multi-activities of bioactive peptides. Pharmaceuticals 2022, 15, 707. [Google Scholar] [CrossRef]
  89. Fan, H.; Yan, W.; Wang, L.; Liu, J.; Bin, Y.; Xia, J. Deep learning-based multi-functional therapeutic peptides prediction with a multi-label focal dice loss function. Bioinformatics 2023, 39, btad334. [Google Scholar] [CrossRef]
Figure 1. Amino acid composition of the 15 classes of peptides: (a) ACE inhibitory; (b) Anti-inflammatory; (c) Antiamnestic; (d) Antibacterial; (e) Anticancer; (f) Antidiabetic; (g) Antifungal; (h) Antimicrobial; (i) Antioxidative; (j) Antithrombotic; (k) Antiviral; (l) DPP-IV inhibitory; (m) Neuropeptide; (n) Opioid; (o) Toxin.
Figure 1. Amino acid composition of the 15 classes of peptides: (a) ACE inhibitory; (b) Anti-inflammatory; (c) Antiamnestic; (d) Antibacterial; (e) Anticancer; (f) Antidiabetic; (g) Antifungal; (h) Antimicrobial; (i) Antioxidative; (j) Antithrombotic; (k) Antiviral; (l) DPP-IV inhibitory; (m) Neuropeptide; (n) Opioid; (o) Toxin.
Sci 07 00122 g001
Figure 2. DEMATEL-Derived Causal Network of Physicochemical Descriptors.
Figure 2. DEMATEL-Derived Causal Network of Physicochemical Descriptors.
Sci 07 00122 g002
Figure 3. DEMATEL-Based Heatmap of the Total Influence Matrix.
Figure 3. DEMATEL-Based Heatmap of the Total Influence Matrix.
Sci 07 00122 g003
Figure 4. SEM diagram of the causal structure.
Figure 4. SEM diagram of the causal structure.
Sci 07 00122 g004
Figure 5. Architecture of the proposed SEM-NN model structure.
Figure 5. Architecture of the proposed SEM-NN model structure.
Sci 07 00122 g005
Figure 6. Training and validation MSE curves of the SEM-NN model.
Figure 6. Training and validation MSE curves of the SEM-NN model.
Sci 07 00122 g006
Table 1. Distribution of Bioactive Peptides by Activity Class.
Table 1. Distribution of Bioactive Peptides by Activity Class.
ActivityNumber of Peptides
ACE inhibitor524
Antimicrobial490
Antibacterial351
Anticancer279
Antioxidative261
Toxin251
Antidiabetic238
Opioid79
Neuropeptide71
DPP-IV inhibitor55
Antiamnestic47
Antifungal39
Antithrombotic35
Anti-inflammatory14
Antiviral14
Table 2. Total Influence Matrix among Physicochemical Descriptors (DEMATEL).
Table 2. Total Influence Matrix among Physicochemical Descriptors (DEMATEL).
lenMWpIgravyaliphaboman
len0.2397260.095303−5.610420.094433−1.59476−0.76024
MW0.095303−0.01738−4.957060.093243−1.35694−0.68756
pI−5.61042−4.9570624.09702−1.207756.6405773.26559
gravy0.0944330.093243−1.20775−0.52175−1.043660.408626
alipha−1.59476−1.356946.640577−1.043661.7903991.718069
boman−0.76024−0.687563.265590.4086261.7180690.131324
Table 3. Descriptor Centrality (D + R) and Causality (D − R) According to DEMATEL Analysis.
Table 3. Descriptor Centrality (D + R) and Causality (D − R) According to DEMATEL Analysis.
Causality (D − R)Centrality (D + R)
pI044.45590436
alipha1.77636−1512.30738337
boman08.151630598
gravy−1.33227−15−4.353722692
MW1.77636−15−13.6607655
len8.88178−16−15.07190701
Table 4. Model Fit Indices.
Table 4. Model Fit Indices.
IndicesValues
Chi-square854.171
Degrees of freedom7
CFI0.957
TLI0.876
RMSEA0.028 (90% CI: 0.020–0.037)
SRMR0.041
AIC11,259.23
BIC11,269.69
SABIC11,264.86
Table 5. Estimated Parameters for Causal Paths in the SEM Framework.
Table 5. Estimated Parameters for Causal Paths in the SEM Framework.
Endogenous VariableExogenous VariableEstimateZ-Valuep-Value
MWlen109.222382.213<0.001
pIMW0.0238.999<0.001
bomangravy−1.643−56.127<0.001
aliphaboman0.88339.348<0.001
aliphagravy0.1677.333<0.001
IC50alipha−9.647−25.744<0.001
Table 6. Classification Performance of ML Models on Augmented Peptide Datasets.
Table 6. Classification Performance of ML Models on Augmented Peptide Datasets.
Original Data 4655 Entries5255 Entries
AccuracyF1_scorePrecisionRecallAccuracyF1_scorePrecisionRecallAccuracyF1_scorePrecisionRecall
Random
Forest
0.5386
± 0.014
0.5233
± 0.04
0.5100
± 0.076
0.5200
± 0.03
0.6237
± 0.016
0.6203
± 0.0033
0.6300
± 0.06
0.6200
± 0.05
0.6700
± 0.02
0.6400
± 0.003
0.6800
± 0.08
0.6300
± 0.024
XGBoost0.5027
± 0.02
0.4928
± 0.05
0.4800
± 0.07
0.4900
± 0.039
0.6199
± 0.021
0.6154
± 0.0042
0.6200
± 0.07
0.6100
± 0.09
0.6800
± 0.023
0.6500
± 0.002
0.6900
± 0.061
0.6400
± 0.008
LightGBM0.5189
± 0.019
0.5021
± 0.037
0.4500
± 0.058
0.5200
± 0.03
0.6208
± 0.09
0.6218
± 0.018
0.6200
± 0.043
0.6200
± 0.05
0.6600
± 0.006
0.6623
± 0.007
0.6600
± 0.0057
0.6600
± 0.0063
MLP0.4829
± 0.0078
0.4552
± 0.01
0.4200
± 0.018
0.5200
± 0.013
0.5707
± 0.032
0.5625
± 0.027
0.5800
± 0.026
0.5700
± 0.017
0.6609
± 0.044
0.6608
± 0.05
0.6600
± 0.047
0.6600
± 0.053
TabNet0.4367
± 0.0051
0.4088
± 0.01
0.3900
± 0.005
0.4400
± 0.016
0.5443
± 0.006
0.5310
± 0.031
0.5500
± 0.023
0.5400
± 0.02
0.5869
± 0.008
0.5720
± 0.06
0.5900
± 0.0063
0.5700
± 0.0047
Table 7. Comparative overview of PepLab and existing peptide databases.
Table 7. Comparative overview of PepLab and existing peptide databases.
DatabaseScope of PeptidesKey FeaturesLimitations
APD3Antimicrobial peptidesManually curated; sequence dataLimited to antimicrobial activity
CAMPAntimicrobial peptidesMultiple prediction toolsRestricted scope; less focus on physicochemical descriptors
DBAASPAntimicrobial peptidesStructural data; activity assaysFocused mainly on antimicrobial function
BIOPEP-UWMBroad spectrum of bioactive peptides and proteinsFunctional activities; enzymatic releaseNarrow application in food/functional peptides
PepLabBroad spectrum of bioactive peptidesIntegrated database + physicochemical descriptors + statistical toolCurrently limited dataset size; ongoing manual curation
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Terziyska, M.; Terziyski, Z.; Ilieva, I.; Bozhkov, S.; Vladev, V. Knowledge Discovery from Bioactive Peptide Data in the PepLab Database Through Quantitative Analysis and Machine Learning. Sci 2025, 7, 122. https://doi.org/10.3390/sci7030122

AMA Style

Terziyska M, Terziyski Z, Ilieva I, Bozhkov S, Vladev V. Knowledge Discovery from Bioactive Peptide Data in the PepLab Database Through Quantitative Analysis and Machine Learning. Sci. 2025; 7(3):122. https://doi.org/10.3390/sci7030122

Chicago/Turabian Style

Terziyska, Margarita, Zhelyazko Terziyski, Iliana Ilieva, Stefan Bozhkov, and Veselin Vladev. 2025. "Knowledge Discovery from Bioactive Peptide Data in the PepLab Database Through Quantitative Analysis and Machine Learning" Sci 7, no. 3: 122. https://doi.org/10.3390/sci7030122

APA Style

Terziyska, M., Terziyski, Z., Ilieva, I., Bozhkov, S., & Vladev, V. (2025). Knowledge Discovery from Bioactive Peptide Data in the PepLab Database Through Quantitative Analysis and Machine Learning. Sci, 7(3), 122. https://doi.org/10.3390/sci7030122

Article Metrics

Back to TopTop