4. Discussion
Antimicrobial peptides act against bacteria, fungi, and viruses through key properties such as cationic charge, hydrophobicity, secondary structure (α-helix, β-sheet), and amphipathicity, which facilitate membrane interaction and microbial lysis. For instance, positively charged peptides bind to anionic phospholipids on bacterial membranes, while hydrophobicity aids in bilayer penetration [
24]. To explore the link between structure and activity, experimental and structural data were compiled into a structured database. The APD3 database [
18], established in 2004, was a pioneer in collecting natural antimicrobial peptides, followed by others like CAMPR3 [
25] and DBAASP [
17], reflecting ongoing advances. This study incorporates manually curated information, including new data such as measurement type, quantitative specificity, and taxonomic details of target microorganisms. Additionally, peptides were structurally represented using molecular descriptors—numerical values derived from physicochemical and structural properties—calculated from side-chain characteristics [
26] using tools like
modLamp and AA-INDEX, enabling precise encoding of peptide composition and shape.
Machine learning models allow establishing relationships between data and a specific dependent variable. This study used such models to analyze the relationship between the chemical structure of peptides (based on flanking residues) and their antimicrobial activity. The independent variables were the molecular descriptors, and the dependent variable was antimicrobial activity, the definition of which is somewhat complex due to the variability in evaluation methods (
Figure 1). Therefore, the minimum inhibitory concentration (MIC) was selected as the dependent variable, as it is considered the standard for measuring the antimicrobial activity of peptides [
17,
18,
27]. However, there are microbiological experimental factors not considered in this study, such as incubation times and culture media, under the premise that QSAR can still be established despite these variations [
28]. Since the same peptide may be active against some microorganisms but not others, it was necessary to establish a criterion to define whether a peptide is antimicrobial or non-antimicrobial, since machine learning classification models require these classes to be well defined. Thus, peptides were classified considering that AMPs have MICs < 25 µg/mL in more than 50% of the microorganisms evaluated, while the NON-AMPs have MIC > 100 µg/mL in 50% of the microorganisms evaluated [
20]. In contrast, other studies do not explain the criteria for this classification, differentiating only between antimicrobial and non-antimicrobial peptides in databases [
29,
30,
31,
32], which is not clear or it is assumed that a peptide is antimicrobial if it has activity against at least one microorganism, whatever it may be. To characterize each of the peptides, 318 molecular descriptors were calculated, although not all were useful, since some were redundant (highly correlated) or of low variance, which can affect the efficiency and accuracy of the models [
33]. Therefore, these variables were discarded using RFE and genetic algorithms, resulting in different subsets of predictor variables for each organism (
Table 1). The domain of applicability (DA) of the QSAR model refers to the chemical space defined by the molecular descriptors and allows the reliability of predictions to be assessed [
34,
35]. In this study, the k-nearest neighbors (KNN) method was used, which considers the variance and covariance of the dataset to ensure that predictions are made on chemically similar molecules [
35,
36]. Some studies do not consider DA, which limits the implementation of predictive models [
20,
29,
37]. Although there is no consensus on the best methodology to determine it, distance-based techniques are the most commonly used [
35]. For example, Tian et al. used Euclidean distances for HIV-1 antiviral peptides [
38], and Pinacho-Castellanos et al. employed a consensus of five methods in AMBIT Discovery for antimicrobial peptides [
21]. Being outside the applicability domain (AD) does not invalidate a prediction, but it does reduce its reliability, and therefore such predictions should be treated with greater caution.
In this study, 28 regression and 28 classification models were generated based on four algorithms (
Tables S1 and S2), trained with physicochemical, structural, hydrophobic, and compositional descriptors calculated from the amino acid sequence. Random Forest (RF)-based models were the best performers in both regression and classification. However, the regression models showed limited performance on the test set (R
2 = 0.33–0.57), with better results (in some cases) when training was restricted to specific bacterial groups (
Table 2). Wang et al. [
39] multiple regression models for three classes of antimicrobial peptides, achieving R
2 of 0.326, 0.589 and 0.663, using 89 descriptors. Their peptides were homogeneous in length (9 or 12 residues), unlike this work that considered peptides of any length. On the other hand, Avram et al. [
40] used only eight descriptors for 37 mastoparan-derived peptides, obtaining R
2 between 0.655 and 0.720, possibly due to the high similarity between the analyzed peptides. In contrast, this study is the first to apply regression models to a large and diverse set of peptides, which increases the complexity of the analysis, but allows to address greater structural and biological variability.
The classification models showed good performance (ACC ≥ 0.831 and MCC ≥ 0.662), also improving when using data restricted by bacterial (
Table 2). Pinacho-Castellanos et al. [
21] developed five RF models with 96,026 descriptors, achieving an ACC of 0.90 with 135 predictors. This study obtained an ACC of 0.831 using only 26 predictors, facilitating the interpretation of the structure–activity relationship. Vishnepolsky et al. [
20] used the DBSCAN algorithm and nine redesigned molecular descriptors to differentiate peptides against Gram-negative bacteria, achieving ACC = 0.80 ± 0.02, slightly lower than that of this study. Here, the optimization and selection of a larger set allowed a more precise characterization, while also searching for a relationship with their antimicrobial activity. Dong et al. [
41] acid-based (RAAC) descriptors to classify peptides according to their target microorganism, achieving high ACCs for parasites, viruses, and cancer (89.72–91.92), but lower performance for fungi and Gram-positive and -negative bacteria (74.73–77.92). In comparison, the models in the present study showed superior performance, indicating that compositional descriptors may not be sufficient to classify antimicrobial peptides against bacteria due to their higher complexity and variability.
After model generation, the relative importance of each descriptor, both individually and by type, was assessed in the classification and regression models. In the classification models, all descriptors contribute to overall performance; however, those related to physicochemical properties are, on average, most relevant. Similarly, in the regression models, physicochemical descriptors showed the highest relative importance, although certain descriptors related to hydrophobicity, alpha-structure propensity, and composition were also identified as relevant. Among the physicochemical descriptors, easy-to-interpret variables such as molecular weight (MW), isoelectric point (pI), and peptide charge (net charge and charge density) stood out. In particular, descriptors associated with peptide charge were the most significant in both types of models. Net charge, charge density, charge at acidic pH, and isoelectric point suggest that a positive charge is a common requirement in most antimicrobial peptides. This finding is consistent with one of the main mechanisms of action of the peptides proposed, based on electrostatic interactions between the cationic charges of the peptide and the anionic charges of the bacterial membrane [
42]. Furthermore, descriptors such as isotropic surface area and ISAECI index (related to steric effects and the ability to form local dipoles) indicate that antimicrobial peptides are characterized by their tendency to have bulky side chains with greater steric effects. This can be associated with the structural stability required for their activity. It was also identified that a low molecular weight is favorable, since it can allow peptides to cross barriers such as the cell wall and reach the bacterial membrane, facilitating greater interaction with it [
43]. Hydrophobicity descriptors were also relevant, although to a lesser extent. These quantify the frequency and orientation of polar and non-polar residues and are essential for the insertion of peptides into the lipid bilayer, as well as for the formation of transmembrane pores that destabilize the electrochemical gradient and lead to bacterial death [
44]. Interestingly, we observed that antimicrobial peptides tend to be less hydrophobic than those without antimicrobial activity. This could be explained by the need for antimicrobial peptides to also contain polar residues, which could facilitate specific interactions with components of the bacterial membrane or favor a more controlled permeabilization, without compromising selectivity or inducing non-specific aggregation. Descriptors assessing the propensity to form secondary structures (alpha helices and turns) were also well rated by the models. For example, levitt-alpha and QIAN880129 estimate the probability of alpha-helix formation, while CHOP780212 assesses the probability of absence of these structures. The models show that antimicrobial peptides tend to incorporate amino acids that favor these conformations, which is consistent with their mechanism of action [
43]. In this sense, it is recommended to favor the incorporation of amino acids such as alanine, glutamic acid, leucine and methionine, and to avoid glycine, tyrosine, serine and proline, since the latter discourage the formation of these structures. In contrast, beta-sheet-related descriptors were considered in only five classification models and two regression models, suggesting a limited contribution to predicting antimicrobial activity. This is possibly because few AMPs exist with this type of conformation, although they could also form functional amphipathic structures. Finally, the frequency of occurrence of certain amino acids, such as serine (S) and lysine (K), the latter with a positive charge, was evaluated, being more relevant in the regression models [
45]. Despite the lower relative importance in this study, it is highlighted that lysine content may play a key role in antimicrobial activity and should be considered in the design and evaluation of new peptides.
To assess model performance, we predicted the antimicrobial activity (logMIC) of peptides not included in the original QSAR training set. Using the general regression model trained on the full dataset, the correlation with experimental logMIC values was low (R
2 = 0.459), likely due to the dataset’s heterogeneity, which includes diverse microorganisms. To improve generalization, models were then generated for specific bacterial groups. The Gram-negative model showed a higher correlation (R
2 = 0.476), possibly due to greater structural homogeneity and a larger subset size. Conversely, the Gram-positive model showed reduced performance (R
2 = 0.339), possibly due to insufficient peptide data and higher bacterial diversity. Further refinement by bacterial genera improved predictions: Escherichia (R
2 = 0.547) and Bacillus (R
2 = 0.574) models showed higher accuracy, suggesting that intra-genus homogeneity and data quantity enhance prediction. In contrast, Pseudomonas (R
2 = 0.415) and Staphylococcus (R
2 = 0.360) models did not show improvement, likely due to high diversity and limited data. Peptides are highly heterogeneous molecules, which limits the development of species-specific QSAR models due to data scarcity [
40]. Although similarity clustering can enhance predictability, general classification models still showed strong performance (MCC = 0.662). The Random Forest-based model effectively predicts antimicrobial activity using features like physicochemical properties, hydrophobicity, and secondary structure propensity. Additionally, specific models targeting Gram-positive (MCC = 0.708), Gram-negative (MCC = 0.675), and the genera
Escherichia (MCC = 0.754),
Staphylococcus (MCC = 0.701),
Bacillus (MCC = 0.746), and
Pseudomonas (MCC = 0.755) achieved even better results. These improvements in Matthews correlation and accuracy (see
Table 6) support the strategy of restricting datasets to enhance model performance.
In the final part of this work, approximately one million peptides were designed, and their antimicrobial activity was predicted using classification and regression models. From these predictions, the 10 peptides with the greatest antimicrobial potential were selected. The structures of these peptides were modeled, revealing the presence of total or partial alpha-helix secondary structures in most of them. This suggests that the presence of secondary structures, especially alpha helices—a feature linked to membrane disruption mechanisms—is a common trait among high-activity AMPs, aligning with key descriptors identified in our QSAR analysis (e.g., propensity for helical folding). To advance these in silico findings toward therapeutic applications, future work should focus on experimental validation and stability optimization, such as using stability-guided design strategies similar to those employed for oncolytic peptides like LTX-315 [
46], or mirror-image phage display techniques to enhance proteolytic resistance [
47]. Furthermore, targeted delivery systems [
48] could be explored to improve the bioavailability and tissue specificity of these promising candidates. Future work includes generating experimental MIC data for
Enterobacteriaceae to correlate these values with specific molecular descriptors and develop more accurate, group-specific QSAR models. This strategy may be extended to other bacterial groups to strengthen predictive capabilities. Although the tool was validated using independent test sets, further experimental validation is planned to enhance its applicability.