Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges

Yan, Jielu; Chen, Zhengli; Cai, Jianxiu; Xian, Weizhi; Wei, Xuekai; Qin, Yi; Li, Yifan

doi:10.3390/app15137363

Open AccessArticle

Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges

by

Jielu Yan

¹

,

Zhengli Chen

¹,

Jianxiu Cai

²,

Weizhi Xian

^3,*

,

Xuekai Wei

^1,*

,

Yi Qin

¹

and

Yifan Li

¹

School of Computer Science, Chongqing University, Chongqing 400044, China

²

Faculty of Applied Sciences, Macao Polytechnic University, Rua de Luís Gonzaga Gomes, Macau SAR, China

³

Chongqing Research Institute of Harbin Institute of Technology, Harbin Institute of Technology, Chongqing 401151, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7363; https://doi.org/10.3390/app15137363

Submission received: 25 May 2025 / Revised: 20 June 2025 / Accepted: 24 June 2025 / Published: 30 June 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download Versions Notes

Abstract

How video-based methodologies and advanced computer vision algorithms can facilitate the development of antimicrobial peptide (AMP) generation models should be further reviewed, structural and functional patterns should be elucidated, and the generative power of in silico pipelines should be enhanced. AMPs have drawn significant interest as promising therapeutic agents because of their broad-spectrum efficacy, low resistance profile, and membrane-disrupting mechanisms. However, traditional discovery methods are hindered by high costs, lengthy synthesis processes, and difficulty in accessing the extensive chemical space involved in AMP research. Recent advances in artificial intelligence—especially machine learning (ML), deep learning (DL), and pattern recognition—offer game-changing opportunities to accelerate AMP design and validation. By integrating video analysis with computational modelling, researchers can visualise and quantify AMP–microbe interactions at unprecedented levels of detail, thereby informing both experimental design and the refinement of predictive algorithms. This review provides a comprehensive overview of these emerging techniques, highlights major breakthroughs, addresses critical challenges, and ultimately emphasises the powerful synergy between video-driven pattern recognition, AI-based modelling, and experimental validation in the pursuit of next-generation antimicrobial strategies.

Keywords:

antimicrobial peptides; generation model; deep learning; machine learning

1. Introduction

Antibiotic residues and drug resistance genes spread through wastewater, soil, and food chains, posing a long-term threat to the environment and human health [1]. In addition, antibiotic resistance not only threatens human health but also causes considerable losses to the global economy. According to a study by the World Bank, if effective measures are not taken, by 2050, antibiotic resistance may cause a 1.1–3.8% decline in global GDP, equivalent to trillions of USD in economic losses [2]. Antibiotic resistance may also lead to increased social inequality. In resource-scarce areas, the cost of treating drug-resistant infections is relatively high, and medical resources are relatively limited, causing poor individuals to face greater health risks [3]. In addition, antimicrobial peptides have broad application prospects in the fields of medicine, agriculture, and the food industry [4]. For example, in the medical field, antimicrobial peptides can be used to treat drug-resistant bacterial infections, prevent surgical infections, and develop new antimicrobial drugs [5]. In the agricultural field, antimicrobial peptides can be used to replace antibiotics as animal feed additives and reduce the use of antibiotics in agriculture [6]. In the food industry, antimicrobial peptides can be used to preserve food and extend the shelf-life of food [7].

The use of antibiotics, a transformative medical breakthrough of the 20th century, has dramatically reduced mortality rates from infectious diseases [8,9]. However, their widespread overuse has precipitated a global crisis of antibiotic resistance, identified as one of the foremost threats to public health and development [10,11,12,13]. According to the World Health Organisation (WHO), the misuse of antibiotics worldwide has increased substantially over the past few decades, with approximately half of all antibiotic use deemed unnecessary or inappropriate [14,15]. In some regions, notably India, minimal regulatory oversight on antibiotic sales has exacerbated drug resistance [16,17], whereas in the United States, nearly 30% of prescriptions are superfluous [18]. The rapid increase in antibiotic resistance significantly complicates the management of infectious diseases, endangers patient outcomes, and intensifies the economic burden of healthcare systems [19,20]. Resistant bacteria not only prolong hospital stays and increase treatment costs but also increase the risk of therapeutic failure. The environmental impact is equally profound; antibiotic residues and resistance genes spread through wastewater, soil, and food chains, undermining ecosystems and posing a long-term hazard to human health [1]. The World Bank projects that, without timely intervention, the global gross domestic product (GDP) could decline by 1.1%–3.8% by 2050 due to antibiotic resistance, translating into trillions of USD in economic losses [2]. Moreover, resource-scarce regions bear a heavier burden, as higher treatment costs for resistant infections and limited healthcare resources worsen socioeconomic disparities [3].

In response, AMPs have gained prominence as viable alternatives to conventional antibiotics. Notable for their broad-spectrum activity and reduced propensity for resistance, AMPs exhibit substantial potential in medical, agricultural, and food industry applications [4,5]. In the medical realm, AMPs hold promise for treating drug-resistant infections, preventing surgical complications, and fostering novel therapeutic approaches. In agriculture, they can serve as safer feed additives to reduce antibiotic dependence [6], whereas in the food sector, their bactericidal properties can extend product shelf-life [7]. In the field of AMP generation, researchers have widely adopted various powerful deep learning architectures that were developed for computer vision and video processing, such as CNN, diffusion, transformer, VAE, and GAN [21,22,23,24,25,26,27]. These architectures have proven to be highly effective in capturing complex patterns and generating high-quality outputs. By leveraging the strengths of these video-driven AI architectures, researchers are able to establish robust deep generative networks for AMP generation [28,29,30,31,32,33]. This approach allows them to explore the vast search space of peptide sequences and identify promising candidates with antimicrobial properties. Crucially, ongoing research integrates state-of-the-art pattern recognition methods—including video-based analysis—to accelerate AMP screening and functional verification, aligning directly with the objectives of harnessing advanced video processing models to address real-world challenges. By leveraging machine learning, computer vision, and other AI-driven techniques, researchers can more accurately characterise the dynamic interactions between AMPs and target pathogens, thereby expediting both fundamental discoveries and translational applications. Consequently, we systematically evaluate video-based computational strategies—particularly advanced computer vision architectures—for their capacity to advance AMP generative modelling, elucidate hidden sequence–function relationships, and enhance the robustness of generative power in silico frameworks.

With the abuse of antibiotics and the non-specific killing of traditional antibiotics, many bacteria, viruses, parasites, and tumours have developed multidrug resistance, which seriously threatens people’s life and health safety [34]. In addition, frequent epidemics have also led to an increase in antibiotic resistance. For example, with the large-scale outbreak of the new coronavirus since 2019, it is estimated that the number of excess deaths caused by the new coronavirus worldwide has exceeded 5.42 million [35]. Among them, China (including mainland China and Hong Kong, Macao, and Taiwan) has a total of 18.22 million confirmed cases and more than 30,000 deaths [36]. During this period, the use of antibiotics was inevitable, resulting in an increase in antibiotic resistance during the new coronavirus (COVID-19) pandemic [37]. Today, the problem of antibiotic resistance has become a public health issue worldwide [38].

AMPs generally refer to amino acid sequences with a length of less than 200, such as “SDKEVDEVDAALSDLEITLE”, where each letter represents an amino acid [39]. AMPs have broad-spectrum antimicrobial activity and can quickly kill various targets, including bacteria, viruses, fungi, parasites, and tumour cells [40]. Many of them are pure natural peptides, making them potentially promising therapeutic drugs [41]. Traditional biological experimental methods for screening new antimicrobial peptides from many candidate peptide chains are expensive, time-consuming, and labour-intensive [42]. Therefore, in silico methods are necessary to screen a few more likely candidate peptide chains through machine learning or deep learning methods and then perform computer and biological experimental verification to narrow the range of candidate peptide chains and improve the efficiency of screening new drugs.

In recent years, with the rapid development of artificial intelligence and bioinformatics technology, research on antimicrobial peptide generation models has made significant progress [43]. For example, deep learning-based recurrent neural networks (RNNs) [44,45,46], generative adversarial networks (GANs) [47,48,49], variational autoencoders (VAEs) [50,51,52,53], and diffusion models [54,55] have been widely used in the generation and optimisation of antimicrobial peptide sequences. These models are able to learn features from many known antimicrobial peptide sequences and generate new sequences with potential antimicrobial activity. In addition, researchers have used many existing machine learning- or deep learning-based methods to evaluate the antimicrobial activity and toxicity of generated sequences [56]. For example, classifiers are used to distinguish whether a given peptide chain is an antimicrobial peptide [57,58,59,60]; regression models are used to predict the minimum inhibitory concentration of an antimicrobial peptide against a certain target (the higher the minimum inhibitory concentration, the weaker the antimicrobial peptide is against the corresponding target, and vice versa) [61,62,63]; and machine learning- or deep learning-based classifiers are used to predict the toxicity of candidate peptides by integrating amino acid sequence and structural information [64,65,66] to quickly screen out candidate sequences with high potential. Although significant progress has been made in the research of antimicrobial peptide generation models, several challenges remain, including the scarcity of high-quality real data of antimicrobial peptides, the integration of multilabel features, and the verification of the effectiveness of the generation model. Among them, the scarcity of high-quality real data of antimicrobial peptides refers to the limited high-quality experimental data of antimicrobial peptides, which restricts the training and verification of the model. The selection and integration of multilabel features specifically refers to how to select specific features (such as length, charge, and hydrophobicity) and how to integrate the above multiple features of antimicrobial peptides to optimise the input data of the generation model and improve the diversity and specificity of the generated sequence. The validation of the generative model can be used not only to verify the validity of the generative model but also to select and optimise the generative model, which includes three factors. The first is experimental validation and optimisation, that is, to verify the activity and specificity of the generated antimicrobial peptides through various computer simulation experiments—the appropriate model architecture is selected, and the model parameters are further optimised. The second is enhancing the generalisation ability of the model. The existing model may have overfitting problems when new antimicrobial peptides are generated, resulting in poor results in practical applications. The third is multiobjective optimisation. The design of antimicrobial peptides needs to consider multiple factors, such as antimicrobial activity, toxicity, and stability. A way to achieve multiobjective optimisation is an important challenge. In addition, we elaborate on how the general principles of AI methodologies, such as GANs, VAEs, and diffusion models, are specifically tailored and applied to the context of AMP generation. This includes detailed discussions on how these architectures can capture the unique features and patterns in biological data, thereby increasing the accuracy and efficiency of AMP prediction and design.

AMPs are emerging as promising alternatives to conventional antibiotics because of their broad-spectrum activity, low resistance risk, and unique membrane-disrupting mechanisms. However, traditional experimental methods for AMP discovery face challenges such as high costs, time-consuming synthesis, and limited exploration of the vast chemical space. Recent advances in computational biology, particularly ML and DL, have revolutionised AMP generation by enabling rapid virtual screening and de novo design. This review synthesises materials and methodologies, breakthroughs, and challenges in AMP generation, focusing on AI-driven approaches and their integration with experimental validation.

2. Datasets

Data serve as the foundation for the development of AMP generation models. The availability and quality of AMP databases are crucial for the development of these models. In practice, the process of collecting AMP datasets for generation tasks typically involves curating data from various online AMP databases, removing duplicates, and retaining peptides within a specific length range that is relevant to the study. Moreover, some models require non-AMP data to learn contrasting information, which helps in generating higher-quality AMPs. It is also essential to develop classifier models based on both AMP and non-AMP datasets to identify whether a given sequence is an AMP or not, which is necessary for measuring the accuracy of AMP generation. However, few annotated non-AMP sequences are available for developing machine learning or deep learning models, so a variety of different negative AMP generation methods have been proposed and utilised. Additionally, incorporating more real peptide sequences can provide additional insights during the model training phase or when assessing the model’s generation capabilities. In this section, we illustrate online databases and data generation methods that contain AMP sequences, non-AMP generation methods, and peptide sequences. To provide an overview, Table 1 lists all three categories of online databases along with their corresponding names and numbers.

2.1. AMP Databases

APD

Wang et al. published an online AMP database called APD from 2004 to 2019, which contains the following three versions: APD, APD2, and APD3 [67,77,78]. The newest version of APD3 is illustrated in this article. The APD3 (Antimicrobial Peptide Database) is a comprehensive and widely referenced resource for antimicrobial peptides and is characterised by the following features and data metrics. APD3 contains 5099 AMPs in total, categorised into three groups as follows: 3306 natural AMPs, 1299 synthetic AMPs, and 231 predicted AMPs. Therein, natural AMPs span the following six life kingdoms: bacteria, archaea, protists, fungi, plants, and animals. APD3 contains AMPs with many different activities, including antibacterial, antifungal, antiviral, anticancer, antibiofilm, anti-HIV, antiparasitic, etc. APD3 continuously integrates new AMPs identified through genomic screening, synthetic design, and experimental validation, reflecting advancements in combating antimicrobial resistance.

CAMP

Tomas et al. developed CAMP, a database of natural and synthetic AMPs, in 2010, and it has been updated in three versions to

{CAMP}_{R 4}

[68,79,80,81].

{CAMP}_{R 4}

introduced separate prediction algorithms tailored for natural and synthetic AMPs. The updated database now contains 24,243 AMP sequences, 933 structures, 2143 patents, and 263 AMP family signatures. It provides comprehensive data on sequences, sources/target organisms, inhibitory/hemolytic concentrations, terminal modifications, and unusual amino acids. Integrated tools enable AMP prediction, rational design (for both natural and synthetic types), sequence/structure analysis (BLAST [82], Clustal Omega [83], and VAST [84]), and family characterisation (PRATT, ScanProsite, and CAMPSign).

LAMP

Zhao et al. collected an AMP database called LAMP in 2013 and updated it to LAMP2 in 2020 [69,85]. The LAMP2 database, an upgraded version of the Linking Antimicrobial Peptide (LAMP) platform launched in 2013, addresses the growing need to consolidate AMPs as critical tools against drug-resistant bacteria. The key features of LAMP2 include comprehensive data, cross-database integration, functional organisation, and evidence-based curation. LAMP2 contains 23,253 unique AMP sequences, with 20,909 (∼89%) being experimentally validated. LAMP2 is a cross-database integration that links 16 public AMP databases, revealing that 12,236 sequences (>50%) are exclusive to one database, whereas >45% connect to two or more. The functional organisation of LAMP2 includes the classification of AMPs into eight major functional classes and 38 specific activities, which are categorised by structure, composition, source, and function for enhanced analysis. Evidence-based curation is supported by 1924 references documenting AMP activity and cytotoxicity. LAMP2 streamlines research by centralising fragmented AMP data and providing robust tools for discovery, validation, and therapeutic development.

DBAASP

Gogoladze et al. manually curated a database of the antimicrobial activity and structure of peptides (DBAASP) in 2014 to support the development of antimicrobial compounds with increased therapeutic potential. Pirtskhalava et al. changed and updated new sequences in 2016 and 2021, respectively, [70,86,87]. DBAASP contains approximately 23,600 AMPs and compiles detailed structural and functional data on AMPs, which are evolutionarily conserved defence molecules that are effective against diverse pathogens, including antibiotic-resistant strains. The database integrates chemical and three-dimensional structural information, post-translational modifications, antimicrobial and cytotoxic activity profiles, and experimental conditions to facilitate structure–activity relationship studies. Covering both natural, which means ribosomal and nonribosomal, and synthetic peptides in monomeric, multimeric, and multipeptide forms. DBAASP also documents synergistic interactions between peptides or antimicrobial agents, quantified through fractional inhibitory concentration index (FICI) values. This comprehensive platform serves as a critical tool for advancing research in antimicrobial drug discovery across medical, agricultural, and industrial applications.

DRAMP

The Data Repository of Antimicrobial Peptides (DRAMP) is an open access, manually maintained database offering extensive annotations of AMPs, encompassing sequences, structural details, activity profiles, physicochemical properties, patent information, and clinical data, including trial phases, therapeutic applications, and the related literature [71,88,89,90]. Its latest version, DRAMP V4.0, emphasises clinical translation by introducing unique annotations such as serum and protease stability metrics—features absent in existing AMP databases—alongside updated entries on hemolysis, cytotoxicity, newly reported AMPs, and those in clinical research. Hosting over 30,000 entries, the database operates under a CC BY 4.0 licence, permitting data access, analysis, and bulk downloads for noncommercial research, with proper citations required for usage. Future plans include the integration of computational prediction tools and high-efficiency classifiers to leverage rich datasets for guiding AMP drug optimisation and development.

dbAMP

Jhong et al. developed a database of antimicrobial peptides (dbAMP) in 2019 and updated it to versions 2.0 and 3.0 in 2022 and 2025, respectively [72,91,92]. The latest version of dbAMP 3.0 is an enhanced resource offering comprehensive insights into the structural and functional characteristics of AMPs, updated to address emerging challenges in the post-pandemic era. dbAMP 3.0 contains 33,065 AMPs and 2453 antimicrobial proteins. This platform integrates large-scale transcriptomic and proteomic datasets to identify AMPs and classify their roles, particularly in combating antibiotic-resistant pathogens exacerbated by widespread antibiotic use. This study highlights the critical function of AMPs in innate immunity. The database underscores their importance in neutralising microbial threats, exemplified by human elafin—an antiviral protein produced by

γ δ

T cells in mucosal defences—and lactoferricin derivatives, which exhibit dual antibacterial and anticancer properties. By combining structural annotations with activity profiles, dbAMP 3.0 serves as a vital tool for advancing research into novel therapeutic strategies against resistant infections and complex pathogens.

ESKtides

Wu et al. developed an AMP database called ESKtides that contains 12,067,248 peptides with high antibacterial activities in 2024 [73]. ESKtides is a comprehensive database offering streamlined access to a vast peptide library sourced from ESKAPE phages and prophages, facilitating research into next-generation antimicrobial solutions against drug-resistant infections. The emergence of antibiotic-resistant pathogens, particularly ESKAPE bacteria (e.g., Enterococcus faecium and Staphylococcus aureus), which were responsible for more than 1.27 million deaths in 2019, has driven the exploration of novel antimicrobial agents such as phage-derived peptidoglycan hydrolase (PGH)-based peptides. To address the gap in systematic methods for mining such peptides, a study leveraged 6809 ESKAPE-associated bacterial and phage genomes to identify PGHs through a standardised annotation framework, yielding 12,067,248 high-potency antibacterial peptides. Accompanying this, a computational tool was developed to predict phage PGH-derived peptides from user-submitted genomes, enabling the analysis of phylogenetic relationships, physicochemical traits, and structural features.

2.2. Non-AMP Generation Methods

Given the scarcity of annotated negative AMPs, it is necessary to curate or generate negative AMPs on the basis of certain rational criteria. Here, we outline several commonly utilised methods for generating non-AMP datasets.

Normal generation method

Several methodologies involve downloading peptide sequences from UniProt and filtering out those annotated as AMPs or predicted with high AMP likelihood. For example, Bhadra et al. and Yan et al. excluded sequences labelled AMPs, membrane proteins, toxins, secretory proteins, or those associated with defensive, antibiotic, anticancer, antiviral, and antifungal functions [93,94]. Subsequently, redundant sequences were eliminated, and entries containing non-standard amino acids (B, J, O, U, X, Z) were discarded, with the remaining sequences forming the negative dataset. However, constructing balanced classifiers requires aligning the size and length distributions of the negative dataset with those of the AMP dataset. This poses a challenge, as modern AMP repositories now exceed 30,000 entries, making it difficult to achieve parity at the dataset scale and structural consistency. This challenge has prompted researchers to pursue innovative approaches for generating non-AMP datasets to address the imbalance in scale and structural alignment.

Random generation method

Yan et al. introduced a randomised negative dataset generation approach in 2022 [74]. For every sequence within the positive dataset, a corresponding negative sequence of identical length was created by substituting each residue in the template with a randomly selected natural amino acid. This method operates under the assumption that randomly generated sequences exhibit an extremely low probability of retaining biological activity akin to the original functional peptides.

Shuffle generation method

Yan et al. introduced a shuffled negative dataset generation strategy in 2022 [74], which replaces residue mutation with amino acid shuffling within positive peptide sequences to construct negative counterparts. Specifically, instead of altering individual residues, they randomised the order of amino acids in functional peptides from the positive dataset to generate nonfunctional analogues. This approach similarly assumes that scrambled sequences exhibit a negligible likelihood of retaining the biological activity of their original templates, thereby serving as reliable negative dataset entries for model training and validation.

2.3. Peptide Datasets

PeptideAltas

Desiere et al. started to establish a Peptidealtas database in 2006 to curate a comprehensive peptide dataset that has been continuously updated [75,95,96]. PeptideAtlas contains at least 3,979,590 distinct peptides. It is a publicly accessible, multispecies repository consolidating peptide data identified through tandem mass spectrometry-based proteomic studies. It aggregates raw mass spectrometry outputs from diverse organisms, including humans, mice, and yeast, and it processes them via advanced search algorithms and protein sequence databases. To ensure data reliability, all the results were subjected to uniform analysis via the Trans Proteomic Pipeline, which assigns confidence metrics for peptide identification and calculates global false discovery rates. The platform enables users to explore, query, and retrieve verified peptide information through its web interface while also providing downloadable datasets—raw files, processed outputs, and comprehensive builds—to support further research. By standardising quality control across experiments, PeptideAtlas serves as a robust resource for advancing proteomic investigations and validating peptide identification.

UniProt

UniProt is a globally authoritative, open access database renowned for its meticulously curated and exhaustive collection of protein sequences and functional annotations, serving as an essential resource for researchers, educators, and biomedical professionals advancing studies in genomics, proteomics, and therapeutic development [76]. Apweiler et al. established the UniProt database in 2004, building upon the Swiss-Prot sequence database originally developed by Bairoch et al. in 1997, and both databases have been continuously updated [76,97,98,99,100,101]. The Universal Protein Resource (UniProt) consortium, a collaborative effort by the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI), and the Protein Information Resource (PIR), aims to offer the scientific community a comprehensive central repository for protein sequences and functional data. The UniProt KnowledgeBase (UniProtKB), which is updated every four weeks, is the primary resource maintained by the consortium. Additionally, supplementary databases such as the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc) are also maintained. The UniProtKB/Swiss-Prot section, a component of the UniProt KnowledgeBase, comprises expertly curated protein sequences from a wide range of organisms that are publicly accessible. Within the framework of the Plant Proteome Annotation Program (PPAP), plant protein entries are generated, with a particular focus on well-characterised proteins from Arabidopsis thaliana and Oryza sativa. The high-quality annotations provided by UniProtKB/Swiss-Prot are extensively utilised to predict annotations for newly discovered proteins via automated processes.

3. Feature Encoding Methods

In this section, three categories, including mapping-based methods, physicochemical-based methods, and secondary structure-based methods, are illustrated.

3.1. Mapping-Based Methods

Mapping-based methods enable bidirectional conversion between sequences and feature vectors—not only transforming sequences into vectors but also reconstructing sequences from vectors. This capability ensures that output vectors produced by generative models can ultimately be translated back into functional sequences, bridging the gap between abstract representations and biologically interpretable data.

Token

Token is an encoding strategy that employs a dictionary to map each of the 20 natural amino acids and a termination symbol to distinct integer identifiers, with an additional integer reserved for padding. Specifically, sequences are padded to a uniform length corresponding to the maximum sequence length across the dataset plus one (to accommodate termination), ensuring dimensional consistency for ML/DL frameworks. As formalised in Equation (1), this method generates numerical representations directly compatible with model architectures.

\begin{matrix} T o k e n_{d i c t} = {A : 1, R : 2, N : 3, D : 4, C : 5, E : 6, Q : 7, G : 8, H : 9, I : 10, L : 11, K : 12, \\ M : 13, F : 14, P : 15, S : 16, T : 17, W : 18, Y : 19, V : 20, P a d d i n g : 24, E n d : 25} . \end{matrix}

(1)

One-Hot

The one-hot encoding method represents each natural amino acid as a 20-dimensional binary vector, where all the elements are zero except for a single position uniquely assigned to the corresponding amino acid. Padding tokens, which are required to standardise sequence lengths, are encoded as all-zero vectors of the same dimensionality. For a given sequence, each amino acid (from the first position to the maximum predefined length) is replaced by its one-hot vector, and these vectors are concatenated sequentially to form a 20 ×

{Maximum}_{L e n g t h}

dimensional feature matrix. This matrix serves as the standardised input for the ML or DL models. One-hot encoding, also referred to as binary encoding, was adopted in two seminal studies published in 2011 and 2013 [102,103].

3.2. Disorder-Based Methods

Disorder refers to intrinsically disordered regions (IDRs) or proteins (IDPs) that lack a fixed three-dimensional structure under physiological conditions. It is highly flexible and dynamic and is often enriched with hydrophilic and charged amino acids. It plays crucial roles in various biological processes, including signal transduction and molecular recognition. Disorder uses various bioinformatics tools to identify disordered regions on the basis of sequence features and physicochemical properties. DisorderC is a specific type of disorder prediction that focuses on compositional bias in amino acid sequences. DisorderC identifies regions with significant deviations in amino acid composition and is often associated with disorder. The over-representation of certain amino acids (e.g., alanine, glycine, and proline) is common. DisorderC uses statistical methods to compare observed amino acid frequencies with the expected values, flagging segments with significant deviations as potentially disordered. DisorderB is a method for predicting disordered regions by classifying each amino acid position as either ordered (0) or disordered (1). DisorderB provides a binary classification for each position in the sequence and is suitable for high-throughput analysis. Typically, machine learning models trained on known examples are generated. DisorderB offers a straightforward and computationally efficient approach for identifying disordered regions and is often used as a preliminary screening tool in large-scale studies.

Disorder

In the context of protein sequences, disorder refers to intrinsic disordered regions (IDRs) or intrinsically disordered proteins (IDPs) that lack a fixed or ordered three-dimensional structure under physiological conditions [104,105,106]. These regions or proteins are highly flexible and dynamic, which allows them to adopt multiple conformations and participate in various biological processes. Disordered regions are highly flexible and can adopt multiple conformations. This flexibility allows them to interact with a wide range of binding partners, often in a context-dependent manner. Despite their lack of a fixed structure, disordered regions play crucial roles in various biological processes, including signal transduction, molecular recognition, and protein–protein interactions. Disordered regions often contain a high proportion of hydrophilic and charged amino acids, which contributes to their flexibility and dynamic nature. They are typically enriched with amino acids such as serine (S), threonine (T), glutamine (Q), and proline (P). Identifying disordered regions in protein sequences is important for understanding their functional roles and for predicting their interactions with other biomolecules. Disorder prediction tools are commonly used in bioinformatics to identify these regions.

DisorderC

In the context of protein sequences, disorder C refers to a specific type of intrinsic disorder prediction that focuses on the compositional bias of amino acids within a protein sequence [107]. This method is part of a broader set of approaches used to identify intrinsically disordered regions (IDRs) [108] or intrinsically disordered proteins (IDPs) [109] on the basis of the over-representation or under-representation of certain amino acids. Disorder C identifies regions of a protein sequence that have a significant compositional bias towards certain amino acids. This bias is often associated with intrinsic disorder because some amino acids are more likely to be found in disordered regions. Amino acids such as alanine (A), glycine (G), proline (P), serine (S), and threonine (T) are often over-represented in disordered regions. Conversely, amino acids such as tryptophan (W), tyrosine (Y), and valine (V) are typically under-represented. Disorder C uses statistical methods to compare the observed frequency of each amino acid in a given sequence segment to the expected frequency on the basis of a reference dataset. Segments with significant deviations from the expected frequencies are flagged as potentially disordered. Disorder C can be used in conjunction with other disorder prediction methods to improve the accuracy of identifying disordered regions. These methods may include sequence complexity analysis, secondary structure prediction, and evolutionary conservation.

DisorderB

Disorder binary (DisorderB) is a method used in bioinformatics to predict intrinsically disordered regions (IDRs) within protein sequences by classifying each amino acid position in the sequence as either “ordered” or “disordered” [110]. This binary classification approach simplifies the prediction of disorders and can be particularly useful for high-throughput analysis and preliminary screening of protein sequences. Each amino acid position in the protein sequence is classified into one of the following two categories: ordered (0) or disordered (1). This binary classification provides a straightforward and computationally efficient way to identify disordered regions. Disorder binary predictions are typically generated via machine learning models trained on known examples of disordered and ordered regions. These models can be based on various features, such as amino acid composition, sequence complexity, and physicochemical properties. The binary classification approach is well suited for high-throughput analysis, allowing for the rapid screening of large datasets to identify potential disordered regions in protein sequences. Disorder binary predictions can be integrated with other bioinformatics tools and methods to provide a more comprehensive analysis of protein disorders. For example, it can be combined with more detailed disorder prediction methods or used as a preliminary step in structural and functional studies.

3.3. Physicochemical-Based Methods

To analyse the physicochemical characteristics of AMP sequences systematically, we implemented a suite of computational methods grounded in physicochemical property analysis. These methods enable the quantitative evaluation of critical features such as hydrophobicity, charge distribution, and secondary structure propensity, providing a robust framework for sequence–function relationship studies.

AAC

AAC refers to the method of calculating the frequency of each natural amino acid within a peptide sequence and arranging these values in a predefined order to generate a 20-dimensional vector, where each dimension corresponds to the frequency of a specific amino acid.

DPC

Dipeptide composition (DPC) refers to the method of analysing consecutive pairs of natural amino acids within a peptide sequence. It calculates the frequency of each possible dipeptide combination (20 × 20 = 400 unique pairs) and organises these values into a 400-dimensional vector, where each dimension corresponds to the occurrence rate of a specific amino acid pair, ordered systematically (e.g., alphabetically by residue type).

TPC

Tripeptide composition (TPC) extends this concept to consecutive triplets of amino acids. By computing the frequency of all possible three-residue combinations (20 × 20 × 20 = 8000 unique triplets), an 8000-dimensional vector is generated, with each dimension representing the normalised frequency of a distinct tripeptide arrangement, similarly ordered according to a standardised convention.

GAAC

Grouped amino acid composition (GAAC) is a feature extraction method used to analyse peptide sequences by categorising amino acids into predefined groups on the basis of their physicochemical properties [111]. This approach simplifies the representation of a peptide sequence by focusing on the collective properties of amino acids rather than individual residues. Amino acids are classified into groups on the basis of their shared characteristics, such as hydrophobicity, charge, size, or polarity. For example, hydrophobic amino acids such as alanine (A), valine (V), leucine (L), isoleucine (I), and phenylalanine (F) might be grouped together, whereas charged amino acids such as lysine (K), arginine (R), aspartic acid (D), and glutamic acid (E) may form another group. This grouping reduces the complexity of the sequence by considering the collective properties of these groups. For each group, the GAAC algorithm calculates the frequency or proportion of amino acids within that group in the peptide sequence. For example, if a peptide sequence contains ten amino acids and four of them belong to the hydrophobic group, the composition of the hydrophobic group would be 40%. This provides a high-level view of the physicochemical properties of the peptide.

GDPC

Grouped dipeptide composition (GDC) is an advanced feature extraction method used in bioinformatics to analyse peptide sequences by considering the composition of dipeptides (pairs of consecutive amino acids) grouped by their physicochemical properties. This method extends the concept of grouped amino acid composition (GAAC) to dipeptides, providing a more detailed and nuanced representation of the sequence [112]. Amino acids are first grouped on the basis of their physicochemical properties, such as hydrophobicity, charge, size, or polarity. For example, hydrophobic amino acids might be grouped together, and charged amino acids might form another group. Dipeptides are then formed by considering pairs of consecutive amino acids within these groups. For example, if we have two groups, hydrophobic (H) and charged (C), the possible dipeptide groups could be HH, HC, CH, and CC. For each dipeptide group, the GDC algorithm calculates the frequency or proportion of that dipeptide in the peptide sequence. For example, if a peptide sequence contains ten dipeptides and three of them are from the hydrophobic-hydrophobic (HH) group, the composition of the HH group would be 30%.

GTPC

Grouped Tri-Peptide Composition (GTPC) is an advanced feature extraction method used in bioinformatics to analyse peptide sequences by considering the composition of tripeptides (triplets of consecutive amino acids) grouped by their physicochemical properties. This method extends the concept of grouped dipeptide composition (GDC) to tripeptides, providing an even more detailed and nuanced representation of the sequence [113]. Amino acids are first grouped on the basis of their physicochemical properties, such as hydrophobicity, charge, size, or polarity. For example, hydrophobic amino acids might be grouped together, and charged amino acids might form another group. Tripeptides are then formed by considering triplets of consecutive amino acids within these groups. For example, if we have two groups, hydrophobic (H) and charged (C), the possible tripeptide groups could be HHH, HHC, HCH, HCC, CHH, CHC, CCH, and CCC. For each tripeptide group, the GTPC algorithm calculates the frequency or proportion of that tripeptide in the peptide sequence. For example, if a peptide sequence contains ten tripeptides and three of them are from the hydrophobic-hydrophobic (HHH) group, the composition of the HHH group would be 30%.

EGAAC

The enhanced group amino acid composition (EGAAC) is an advanced feature extraction method that builds upon the traditional GAAC by incorporating additional layers of information and complexity [114]. This method aims to provide a more comprehensive and nuanced representation of peptide sequences by considering not only the physicochemical properties of amino acids but also their spatial distribution and interaction patterns. Amino acids are categorised into groups on the basis of their physicochemical properties, such as hydrophobicity, charge, size, and polarity. The EGAAC method calculates the frequency or proportion of each amino acid group in the peptide sequence, similar to GAAC. However, it also considers the distribution and interaction patterns of these groups within the sequence. The method can analyse the distribution of amino acid groups along a sequence, identifying patterns such as clustering or periodicity. EGAAC can capture the interactions between different groups, such as transitions from one group to another, which can provide insights into the structural and functional properties of the peptide.

CTD

The composition/transition/distribution (CTD) algorithm is a widely used feature extraction method for AMPs [115,116]. It captures the essential characteristics of a peptide sequence by analysing the composition, transition, and distribution of its constituent amino acids. Composition means that the algorithm calculates the frequency of each amino acid in the peptide sequence. For example, if a peptide contains ten amino acids and three of them are lysine (K), the composition of lysine would be 30%. This provides a snapshot of the overall makeup of the peptide, highlighting the prevalence of specific amino acids that might be crucial for its antimicrobial activity. Transition refers to the frequency of changes between different types of amino acids along a sequence. For example, it measures how often a hydrophobic amino acid is followed by a hydrophilic amino acid or vice versa. This aspect is important because the spatial arrangement and interaction between different types of amino acids can influence a peptide’s structure and its ability to interact with bacterial membranes. Distribution:

Distribution examines the spatial arrangement of specific amino acids or groups of amino acids within the peptide. It identifies patterns such as clustering of certain amino acids at the N-terminus, C-terminus, or throughout the sequence. This is significant because the position of key residues can affect the peptide’s stability, ability to penetrate bacterial membranes, and overall bioactivity. By combining these three components, the CTD algorithm provides a comprehensive representation of a peptide’s characteristics. This multifaceted approach allows researchers to capture both the quantitative and qualitative aspects of peptide sequences, making it a powerful tool for predicting and understanding the antimicrobial properties of peptides.

PseAAC

Pseudo-Amino Acid Composition (PseAAC) is an advanced feature extraction method used in bioinformatics to represent peptide sequences by incorporating both sequence-order information and the physicochemical properties of amino acids [117,118]. This method extends the traditional amino acid composition (AAC) by adding pseudo components that capture sequence-order effects, making it particularly useful for predicting protein functions and properties. The AAC calculates the frequency or proportion of each of the 20 standard amino acids in the peptide sequence. For example, if a peptide contains ten amino acids and two of them are lysine (K), the composition of lysine would be 20%. The pseudo components capture sequence-order information, which is crucial for understanding the structural and functional properties of peptides. These components are derived from the various physicochemical properties of amino acids, such as hydrophobicity, charge, and polarity. The probability of transitioning from one amino acid to another in the sequence. Distribution of amino acids: The positional distribution of amino acids along the sequence. The correlation between amino acids at different positions in the sequence was measured. Additional properties, such as hydrophobicity, charge, and flexibility, can be included to capture more detailed sequence–order effects.

The PseAAC is typically calculated via a combination of AAC and pseudo components. The formula for the PseAAC can be expressed as Equation (2),

\begin{matrix} P s e A A C = A A C + λ \times P s e u d o_{C o m p o n e n t s}, \end{matrix}

(2)

where AAC is the amino acid composition,

λ

is a weight factor that balances the contribution of the pseudo components, and

P s e u d o_{C o m p o n e n t s}

are the additional sequence-order features.

APAAC

Amphiphilic pseudo amino acid composition (APAAC) is a specialised feature extraction method used in bioinformatics to represent peptide sequences, particularly those with amphiphilic properties, such as AMPs [119,120]. This method extends the traditional PseAAC by focusing on the amphiphilic nature of peptides, which is crucial for their interaction with biological membranes. The AAC calculates the frequency or proportion of each of the 20 standard amino acids in the peptide sequence. For example, if a peptide contains ten amino acids and two of them are lysine (K), the composition of lysine would be 20%. The pseudo components in APAAC specifically capture the amphiphilic properties of peptides, which are essential for their ability to interact with and disrupt bacterial membranes. Hydrophobicity measures the hydrophobic nature of amino acids. Charge measures the charge distribution along the peptide sequence. Polarity measures the polarity of amino acids. Secondary structure propensity measures the propensity of amino acids to form secondary structures such as alpha-helices or beta-sheets. Solvent accessibility measures the solvent accessibility of amino acids. The APAAC is calculated by combining the AAC with the amphiphilic pseudo components. The formula for the APAAC can be expressed as Equation (3).

\begin{matrix} A P A A C = A A C + λ \times A m p h i p h i l i c_{C o m p o n e n t s}^{P s e u d o}, \end{matrix}

(3)

where

A m p h i p h i l i c_{C o m p o n e n t s}^{P s e u d o}

are the additional features capturing the amphiphilic properties.

PseKRAAC

Pseudo K-tuple reduced amino acid composition (PseKRAAC) is a powerful bioinformatics tool designed to generate pseudo K-tuple reduced amino acid compositions from protein sequences [121]. This method simplifies protein complexity by reducing the amino acid alphabet, which helps in identifying functional conserved regions and reduces the risk of overfitting, computational burden, and information redundancy. PseKRAAC incorporates three crucial parameters that describe protein composition, making it a versatile tool for computational proteomics and protein sequence analysis. Amino acids are grouped into clusters on the basis of their physicochemical properties. This reduction simplifies the complexity of protein sequences, making them easier to analyse. Flexible parameterisation: Users can select from various reduced amino acid alphabets and adjust parameters such as the type of analysis (gap or lambda correlation), the number of gaps or lambda values, and the k-mer size. PseKRAAC is freely available as a web server, allowing users to easily generate different modes of PseKRAAC tailored to their specific needs. By incorporating three crucial parameters that describe protein composition, PseKRAAC provides more capability for protein research. This tool is anticipated to become a very useful resource in computational proteomics and protein sequence analysis.

CKSAAGP

The composition of k-spaced amino acid group pairs (CKSAAGP) is a sophisticated feature extraction method used in bioinformatics to analyse peptide and protein sequences [122,123]. This method extends the traditional amino acid composition by considering the spatial relationships between amino acids within a sequence, specifically focusing on pairs of amino acids that are separated by a fixed number of positions (k-spaced). Amino acids are categorised into groups on the basis of their physicochemical properties, such as hydrophobicity, charge, size, and polarity. The method of k-spaced pairs considers pairs of amino acids that are separated by a fixed number of positions (k) within the sequence. For example, if k=2, the method considers pairs of amino acids that are separated by exactly two positions. For a given sequence, all possible k-spaced pairs are identified and grouped on the basis of their physicochemical properties. The frequency or proportion of each group of k-spaced pairs is calculated. This provides a detailed representation of the sequence, capturing both the composition and the spatial relationships between amino acids. CKSAAGP is calculated by identifying and counting all possible k-spaced pairs within the sequence and then grouping them on the basis of their physicochemical properties. The formula for CKSAAGP can be expressed as Equation (4).

\begin{matrix} C K S A A G P = \sum_{i = 1}^{L - k - 1} G r o u p (S_{i}, S_{i + k + 1}), \end{matrix}

(4)

where L is the length of the sequence S,

S_{i}

and

S_{i + k + 1}

are the amino acids at positions i and

i + k + 1

in the sequence, and

G r o u p (S_{i}, S_{i + k + 1})

is the group to which the pair

(S_{i}, S_{i + k + 1})

belongs.

NMBroto

Normalised Moreau–Broto autocorrelation (NMBroto) is a type of autocorrelation descriptor used in bioinformatics to analyse the distribution of amino acid properties along a protein sequence [124]. It captures the correlation between the properties of amino acids at different positions in a sequence, providing insights into the structural and functional characteristics of proteins. The NMBroto descriptor is calculated on the basis of the distribution of specific physicochemical properties of amino acids along the protein sequence. The formula for NMBroto is given by Equation (5).

\begin{matrix} N M B r o t o (d) = \frac{1}{N - d} \sum_{i = 1}^{N - d} P_{i} \cdot P_{i + d}, \end{matrix}

(5)

where d is the lag parameter, which represents the distance between the amino acids being compared;

P_{i}

and

P_{i + d}

are the properties of the amino acids at positions i and

i + d

, respectively; and N is the length of the protein sequence.

KSCTriad

The k-spaced conjoint triad (KSCTriad) is a feature extraction method used in bioinformatics to analyse protein sequences [125,126]. It extends the concept of the conjoint triad descriptor by considering not only continuous amino acid triplets but also those that are separated by a fixed number of residues (k-spaced). KSCTriad considers both continuous and k-spaced triads. A k-spaced triad is a set of three amino acids where the middle amino acid is separated from the other two by exactly k residues. For example, in a sequence …A…B…C…, if k = 2, then A…B… C forms a k-spaced triad. The method generates a feature vector by counting the occurrences of all possible k-spaced triads in the sequence. This provides a comprehensive representation of the structural and compositional properties of a sequence. The dimensionality of the feature vector depends on the value of k and the number of possible triads.

SOCNumber

The sequence-order-coupling number (SOC) is a feature extraction method used in bioinformatics to capture the sequence-order information of protein sequences [127]. This method is particularly useful for predicting protein properties and functions by considering the spatial arrangement of amino acids within a sequence. The SOC number captures sequence-order information by considering the spatial arrangement of amino acids in the protein sequence. This information is crucial for understanding the structural and functional properties of proteins. The coupling number is calculated for each amino acid position in the sequence, considering its interactions with other amino acids within a specified range. This provides a measure of how each amino acid is influenced by its neighbours. The method generates a feature vector that represents the sequence-order information. This vector can be used as input for machine learning models to predict protein properties and functions. The SOC number is used in various bioinformatics applications, including protein structure prediction, protein function prediction, and AMP design. It helps in capturing the local sequence patterns that are crucial for understanding protein behaviour. We choose a range d within which the interactions between amino acids are considered. This range can be adjusted on the basis of the specific requirements of the analysis. For each amino acid position i in the sequence, the coupling number SOC(i) is calculated via Equation (6).

\begin{matrix} S O C (i) = \sum_{j = 1}^{d} \frac{1}{j} \cdot \frac{1}{1 + | x_{i} - x_{i + j} |}, \end{matrix}

(6)

where

x_{i}

and

x_{i + j}

are the physicochemical properties (e.g., hydrophobicity, charge) of the amino acids at positions i and

i + j

, and j is the distance between the amino acids within the specified range d. The feature vector is generated by concatenating the SOC numbers for all amino acid positions in the sequence.

QSOrder

QSOrder is a feature extraction algorithm for AMPs [128]. It is based on the concept of pseudo amino acid composition (PseAAC), which incorporates both sequence-order information and the physicochemical properties of amino acids. The algorithm calculates a series of features that reflect the local sequence-order effects and the global physicochemical properties of the peptides. Specifically, it uses a combination of dipeptide compositions, which captures the frequency of adjacent amino acid pairs, and a set of predefined physicochemical properties to generate a feature vector. This feature vector can then be used as input for machine learning models to predict the antimicrobial activity of peptides. The QSOrder algorithm is particularly useful for identifying key features that distinguish antimicrobial peptides from non-antimicrobial peptides, thereby aiding in the design and discovery of new AMPs.

PSSM

The position-specific scoring matrix (PSSM) profile is a powerful tool used in bioinformatics to represent the evolutionary conservation and sequence-specific information of a protein or peptide sequence [129]. The first step in generating a PSSM profile is to perform multiple sequence alignment (MSA) of homologous sequences. This involves aligning a set of sequences that are related by evolution, typically obtained through database searches such as BLAST. The alignment captures the conservation and variability of amino acids at each position across the aligned sequences. From the MSA, a frequency matrix is constructed. Each row in the matrix corresponds to a position in the aligned sequences, and each column corresponds to one of the 20 standard amino acids. The entries in the matrix represent the frequency of each amino acid at each position in the alignment. The frequency matrix is then converted into a scoring matrix. This involves calculating the log-odds score for each amino acid at each position. The log-odds score is computed as the logarithm of the ratio of the observed frequency of an amino acid at a position to the expected frequency of that amino acid in a background model (usually the frequency of amino acids in a large database of proteins). A PSSM profile is typically represented as a matrix with dimensions L × 20, where L is the length of the sequence and 20 corresponds to the 20 standard amino acids. Each entry P(i,a) in the matrix represents the log-odds score for amino acid a at position i in the sequence.

AAindex

The AAindex database is a comprehensive collection of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids [130,131,132]. It is widely used in bioinformatics and computational biology to analyse and predict the properties of proteins and peptides. The database consists of three main sections, including AAindex1, AAindex2, and AAindex3. The AAindex1 section contains indices representing the properties of individual amino acids. Each index is a set of 20 numerical values, one for each of the 20 standard amino acids. The AAindex2 section includes amino acid substitution matrices, which are used to represent the likelihood of one amino acid being replaced by another during evolution. The AAindex3 section contains statistical protein contact potentials, which describe the interactions between pairs of amino acids in protein structures. The indices of the AAindex can be used to predict the secondary and tertiary structures of proteins by providing insights into the physicochemical properties of amino acids. By analysing the properties of amino acids in a protein sequence, the AAindex can help predict the function of the protein. The AAindex is often used as a feature set in machine learning models to predict various properties of proteins, such as enzyme activity, binding affinity, and stability. The AAindex database can be accessed through the DBGET/LinkDB system at GenomeNet [133]. It can also be downloaded via anonymous FTP from the GenomeNet server. Additionally, there are software packages available for working with AAindex data, such as the Aaindex Python package.

BLOSUM62

The BLOSUM62 matrix is a widely used substitution matrix in bioinformatics for protein sequence alignment [134]. It is part of the BLOSUM (blocks substitution matrix) family, which was developed to improve the accuracy of protein sequence alignments by capturing evolutionary trends in amino acid substitutions. BLOSUM matrices are derived from conserved sequence blocks in protein families, specifically from the BLOCKS database. Clustering threshold: BLOSUM62 is constructed from alignments where no two sequences share more than 62% identity. This prevents biases from closely related sequences. Substitutions between amino acids are counted within these conserved blocks. The substitution frequencies are converted into log-odds scores, which reflect the likelihood of a substitution occurring relative to random chance. Positive scores indicate frequent substitutions, often conservative replacements that preserve biochemical properties. Negative scores reflect rare or unfavourable substitutions that could disrupt protein structure and function. BLOSUM62 balances sensitivity and specificity in detecting homologous sequences, making it effective for a wide range of evolutionary distances.

ASA

Accessible solvent accessibility (ASA), also known as the solvent-accessible surface area, is a measure of the surface area of a molecule that is accessible to a solvent, typically water [135]. It is an important concept in understanding the structure, function, and interactions of proteins and other biomolecules. The ASA can be calculated via various computational methods, with the most common approach being the rolling probe method. This method involves rolling a spherical probe (usually representing a water molecule) over the surface of the molecule and calculating the area that the probe can access without overlapping any atoms. The probe radius is typically set to 1.4 Å, which is the approximate radius of a water molecule. Atomic radii: Each atom in a molecule is assigned a van der Waals radius. A spherical probe with a fixed radius (e.g., 1.4 Å) is used to simulate the solvent. The surface area accessible to the probe is calculated by considering the overlap between the probe and the van der Waals surfaces of the atoms.

TA

Torsion angles, also known as dihedral angles, are crucial for describing the three-dimensional structure of proteins [136]. These angles define the spatial arrangement of atoms in the polypeptide backbone and side chains. The two most important torsion angles in proteins are the phi (

ϕ

) and psi (

ψ

) angles, which describe the rotation around the bonds in the polypeptide backbone. Definitions: The phi angle is defined by the rotation around the bond between the nitrogen (N) and the alpha carbon (C

α

) of an amino acid residue. It ranges from −180° to +180°. The phi (

ϕ

) angle affects the conformation of the backbone preceding the alpha carbon. The

ψ

angle is defined by the rotation around the bond between the alpha carbon (C

α

) and the carbonyl carbon (C) of an amino acid residue. It also ranges from −180° to +180°. The

ψ

angle affects the conformation of the backbone following the alpha carbon. The Ramachandran plot is a graphical representation of the allowed regions for the

ϕ

and

ψ

angles in a protein structure. It helps visualise the steric constraints and preferred conformations of the polypeptide backbone. The plot is divided into regions corresponding to different secondary structures. The alpha helix typically has

ϕ

and

ψ

angles of approximately (−57°, −47°). Beta sheets typically have

ϕ

and

ψ

angles of approximately (−139°, 135°) for parallel sheets and (−119°, 113°) for antiparallel sheets. A random coil covers a broader range of angles but avoids regions where steric clashes occur. Torsion angles can be calculated via the coordinates of the atoms in the polypeptide backbone. For a given residue i, the phi angle

ϕ

i is defined by the atoms.

Z-Scale

The Z-scale is a method used in various fields to standardise data [137]. It is particularly known for its application in protein structure analysis and machine learning. In the context of protein structure analysis, the Z-scale is a set of numerical indices that describe the physicochemical properties of amino acids. These indices are derived from experimental data and are used to predict protein structure and function. The Z-scale is used to describe the properties of amino acids in a protein sequence. These properties can help predict the secondary and tertiary structures of proteins by providing insights into how amino acids interact with each other.

3.4. Secondary Structure-Based Methods

To investigate the secondary structure profiles of AMPs, we introduced computational methods grounded in secondary structure prediction algorithms, such as SSEC or SSEB, enabling the residue-level analysis of the

α

-helix,

β

-sheet, and coil conformations.

SSEC

The secondary structure element content (SSEC) is a method used in bioinformatics to analyse and predict the secondary structure content of protein sequences [138]. The secondary structures are the local conformations that a protein chain can adopt, such as

α

-helices,

β

-sheets, and coils. Understanding the secondary structure content is crucial for predicting the overall structure and function of proteins. The secondary structure types include

α

-Helices (H),

β

-Sheets (E), and Coils (C). H corresponds to helical structures that are stabilised by hydrogen bonds between backbone atoms. E refers to extended structures formed by hydrogen bonds between adjacent strands. C represents regions that do not form regular secondary structures. SSEC calculates the proportion of each secondary structure type within the protein sequence. This is typically done via prediction algorithms that analyse the sequence and assign secondary structure types to each residue. The content is usually expressed as a percentage or fraction of the total sequence length. Various algorithms and tools are used to predict secondary structures, such as PSIPRED and DSSP. These tools use machine learning models trained on known protein structures to make predictions. The secondary structure content can provide insights into protein stability, folding, and function. For example, proteins with a high

α

-helix content might be more stable in certain environments, whereas those with beta-sheets might be involved in interactions with other molecules.

SSEB

Secondary structure element binary (SSEB) is a method used in bioinformatics to predict and represent the secondary structure of protein sequences in a binary format [139]. This approach simplifies the prediction of secondary structures by classifying each amino acid position in the sequence as belonging to one of the following three main secondary structure types: alpha-helix, beta-sheet, or coil. Each type is represented by a binary vector, making it easier to use in machine learning models and other computational analyses. Each amino acid position in the sequence is classified into one of three secondary structure types as follows: H, E, and C. H is represented by the binary vector [1, 0, 0]. E is represented by the binary vector [0, 1, 0]. C is represents the binary vector [0, 0, 1]. SSEB predictions are typically generated via machine learning models trained on known examples of secondary structures. These models can be based on various features, such as amino acid composition, sequence complexity, and physicochemical properties. The binary representation makes SSEB suitable for high-throughput analysis, allowing for the rapid screening of large datasets to identify secondary structure patterns in protein sequences. SSEB predictions can be integrated with other bioinformatics tools and methods to provide a more comprehensive analysis of protein secondary structures. For example, it can be combined with more detailed secondary structure prediction methods or used as a preliminary step in structural and functional studies.

4. Methodologies in AMP Generation

In the early research stage, deep learning methods were usually used to design new antimicrobial peptides. Typical methods include recurrent memory networks (RNNs) [140], long short-term memory (LSTM) [141], etc. RNNs use memory units to process the entire sequence and have achieved some success in the early stage, but this method only processes one input at a time [142,143]. Moreover, owing to the vanishing and exploding gradient characteristics of RNNs, the effect in the reference process is usually not ideal. Some researchers subsequently proposed the use of LSTM to generate antimicrobial peptides. LSTM uses a gate mechanism and a backwards loop to ensure that the error signal in the form of a gradient is not lost after processing. It is often used to process long sequences. Following a typical method, Müller et al. used LSTM units to train RNNs and proposed a unique network that focuses on generating linear cationic peptides with amphipathic helices, obtaining the characteristics most relevant to antimicrobial activity [44]. Moreover, the network can be trained to predict the next amino acid at each position in the input. This method successfully designs a new antimicrobial peptide. They used the CAMP AMP prediction tool to evaluate the antimicrobial activity of the generated sequences and compared them with the training data, random sequences, and manually designed helical sequences. The Euclidean distance between the generated sequences and the training data in the global peptide descriptor space was calculated to analyse the similarity. Finally, a helical wheel plot was generated to visualise the amphipathic helical structure of the top-ranked sequences. Bolatchiev et al. used the LSTM model proposed by et al. in 2018 to generate a wider range of antimicrobial peptide sequences by modifying the input dataset [144]. They first obtained 3100 antimicrobial peptide sequences with lengths of more than seven amino acid residues from the APD3 database as a dataset and performed model training and sequence generation. The generated sequences were then screened via the CAMP AMP prediction tool, and sequences predicted to have a high probability of being antimicrobial peptides (greater than 0.950) via three algorithms, namely, support vector machine, random forest, and discriminant analysis, were selected [68]. Furthermore, 35 sequences with activity against specific microbial species were screened via the algorithm proposed by Vishnepolsky et al. [145]. Finally, five peptide sequences were selected for synthesis and experimental verification. The AlphaFold algorithm was used to model the spatial structure of the synthetic peptides, which were subsequently compared with known antimicrobial peptides [146]. The HAPPENN classifier was used to predict their hemolytic potential and perform molecular dynamics simulations to study their interactions with bacterial membranes [147].

Another typical method is to use the variational autoencoder (VAE) method [148]. The VAE consists of an encoder and a decoder. The encoder converts molecules into latent vector representations. The latent representation in the decoder attempts to recreate the input molecule. Owing to its good performance, VAE has been widely used in new antimicrobial peptides [148]. Following the typical method, Dean et al. proposed two generative architectures based on the VAE method [52,53]. They used the VAE to design antimicrobial peptides from scratch. The model was trained on a large number of antimicrobial peptide sequences, and they experimentally verified that the peptides they generated had antimicrobial activity. Das et al. proposed an improved VAE method to guide the trained classifier [149]. Finally, the method tested 20 generated sequences and achieved excellent performance. Wang et al. proposed the LSSAMP model, which is composed of an encoder, a decoder, a multiscale VQ-VAE, and a prior model [150,151]. Through a unique structural design, the sequence features and secondary structure information are mapped to a shared latent space, thereby generating antimicrobial peptides with ideal sequence properties and secondary structures. Ghorbani et al. trained a binary classifier network to predict antimicrobial peptides and evaluate the quality of the generated peptides [152]. They used a variational autoencoder combined with a variational attention mechanism to learn the latent space of antimicrobial peptides and generate new antimicrobial peptides. The physicochemical properties of the generated antimicrobial peptides were subsequently analysed and compared with those of real antimicrobial peptides. Zhao et al. proposed a conditional-based VAE model employing a denoising concept for generating novel AMPs [51]. They introduced a novel conditional denoising variational autoencoder framework tailored for AMP generation. The model incorporates explicit physicochemical property constraints (e.g., charge, hydrophobicity) as conditional inputs to guide both the encoder and decoder during training, ensuring that the generated peptides align with predefined biological requirements. To address data scarcity and noise robustness, denoising layers are integrated into the architecture, enabling the stable generation of high-fidelity AMPs even under noisy or limited training data conditions. A multiobjective loss function is designed to harmonise three critical components. The first is reconstruction loss for sequence fidelity. The second is KL divergence for latent space regularisation. The third is property preservation (PP) loss to enforce adherence to target physicochemical profiles. This unified optimisation strategy enhances the model’s ability to produce functionally validated AMP candidates while maintaining structural diversity. In 2022, Hasegawa et al. proposed the feedback-AVPGAN to address the scarcity of experimentally validated antiviral peptides (AVPs), a GAN enhanced with a feedback loop for in silico AVP discovery [153]. Traditional GANs struggle with limited training data, as the discriminator relies solely on known AVPs, which are rare. Their framework innovatively integrates a transformer-based classifier into the feedback loop to iteratively refine the discriminator’s training using both real and synthetic peptide sequences. They utilised ten features as inputs, including the molecular volume, molecular weight, disulphide bonds, isoelectric point (pI), gravy, aromaticity, instability index, and secondary structure fraction (

α

-helix,

β

-sheet, and random coil).

In recent years, generative adversarial neural (GAN) networks have become a very popular architecture for generating highly realistic content [154,155]. The GAN has two components, which are a generator and a discriminator, that compete with each other during training. The generator produces data, and the discriminator tries to distinguish the generated data from the real data. The GAN uses a competitive path and no longer requires an assumed data distribution. Tucs et al. proposed the PepGan method, which uses a GAN to generate active peptides and avoid overlap with inactive peptides [47]. Oort et al. proposed an antimicrobial peptide generation method based on GAN v2, which uses a bidirectional conditional GAN to learn data-driven priors and control antimicrobial peptide generation via conditional variables [49,156]. Moreover, CAMP R3 was used to verify the generated antimicrobial peptide candidate sequences, and a large proportion (89%) of the generated samples were predicted to have antimicrobial properties [81]. Ferrell et al. proposed an AMP-GAN generation model that is based on the conditional generative adversarial network method [156]. AMP-GAN consists of three modules as follows: an encoder, a generator, and a discriminator. Among them, the encoder is a structure based on a bidirectional GAN (BiGAN), which maps the real AMP sequence to the representation of the latent space. The generator is responsible for generating new AMP sequences via a conditional generative adversarial network (CGAN), which means that the generator takes into account the given conditional vector when generating sequences, such as the sequence length, microbial target, target mechanism, and MIC50 value. The goal of the generator is to produce sequences that can deceive the discriminator. This helps the generator optimise its output through adversarial training so that it can generate more realistic peptide sequences. In 2019, Gupta et al. proposed a feedback generative adversarial network (FBGAN) for optimising the DNA coding sequence of antimicrobial peptides [157]. They proposed a new method to optimise the gene sequences encoding antimicrobial peptides by combining feedback generative adversarial networks (FBGANs) with DNA sequence optimisation technology. This method focuses on improving the functional and structural quality of antimicrobial peptides and further improving their antimicrobial activity. The performance of AMP-encoding genes was evaluated, and the AMP analyser was trained on 2600 AMPs and 2600 random peptides from the APD3 database. Next, the edit distance between the generated AMPs and known AMPs was calculated to evaluate sequence similarity and diversity. Moreover, the physicochemical properties (such as length, molecular weight, charge, etc.) of the generated protein and known AMPs were compared. Finally, the generated peptides were folded, and the edit distances between the generated sequence after feedback and the natural sequence were compared.

Diffusion is a popular and powerful framework that is widely utilised in generating areas [158,159]. Diffusion models are a family of generative algorithms that learn to synthesise data by gradually denoising random noise through an iterative process. Inspired by thermodynamics, they consist of a forward process, which incrementally adds noise to the data until it becomes Gaussian, and a reverse process, where a neural network is trained to reconstruct the original data by predicting and removing noise stepwise. Unlike GANs or VAEs, diffusion models excel in generating high-fidelity, diverse outputs while avoiding mode collapse, leveraging gradient-based learning for stable training. They have revolutionised domains such as image generation (e.g., DALL·E 2 [160], stable diffusion [161]), audio synthesis, and molecular design, with extensions such as guided diffusion enabling precise control over generated properties (e.g., antimicrobial activity and structural motifs [162]). The ability of these peptides to balance quality and diversity makes them particularly powerful for biomedical applications, such as the design of novel peptides with tailored functional characteristics. Cao et al. addressed AMP diversity via a text-guided conditional denoising diffusion (TG-CDDPM) model for generating novel AMPs [163]. The TG-CDDPM model is a three-stage framework leveraging denoising diffusion principles to generate novel, homologous AMPs with enhanced diversity to overcome the limitations in cost, scalability, and diversity of traditional AMP generation models, including LSTMs, VAEs, and GANs. Wang et al. proposed an integrated deep learning framework, Diff-AMP, which can automatically complete the generation, recognition, attribute prediction, and iterative optimisation of AMPs [54]. They innovatively incorporate dynamic diffusion and attention mechanisms into the reinforcement learning framework for efficient processing.

5. Challenges and Limitations

Data scarcity.

One of the primary challenges in applying video-driven AI methods to AMP predictive modelling is the limited availability of high-quality, labelled data. AMP datasets are often small and imbalanced, which can hinder the training and generalisation of deep learning models. As outlined in the Dataset section, fewer than 100,000 experimentally validated AMPs are currently available for model training. Furthermore, the scarcity of annotated non-AMP sequences compels researchers to rely on ad hoc methods, such as off-the-shelf tools or custom computational pipelines, to generate synthetic non-AMP datasets. These data limitations fundamentally undermine model performance, leading to suboptimal generalisability and feature extraction biases. For example, the elevated misclassification rates of random sequences (a common negative control) reveal inherent biases in learned representations, where models may conflate noise with discriminative patterns.

Lack of unified evaluation standards.

The absence of a unified evaluation standard in the AMP generation field hampers direct comparisons across studies, making it difficult to objectively assess which computational models perform best. To compensate for this methodological gap, researchers are increasingly compelled to rely on wet-lab experiments for validation—an approach that significantly escalates labour demands, time investments, and operational costs.

Translational gaps.

Only a few of the AI-generated peptides showed bioactivity, underscoring the need for wet-lab validation.

Scalability.

While AI-based generation methods reduce computational costs, peptide length exponentially expands the chemical space, making global searches impractical.

Computational efficiency.

Training and deploying complex video-driven AI models, such as those involving diffusion models or transformers, can be computationally intensive. This can be a barrier to their widespread adoption, especially in resource-constrained environments.

Model interpretability.

Many video-driven AI models, such as CNNs and transformers, are known for their black-box nature. This lack of interpretability can be a significant drawback in biological applications where understanding the underlying mechanisms is crucial.

Adaptation to the biological context.

While video-driven AI methods have shown great success in computer vision tasks, their direct application to AMP predictive modelling requires careful adaptation. The unique characteristics of biological data, such as sequence variability and functional diversity, necessitate specialised model architectures and training strategies.

6. Discussion and Conclusions

This paper systematically examines the components of video-driven AI-based AMP generation frameworks, encompassing the following three critical dimensions: datasets, feature encoding methods, methodologies in AMP generation, and challenges and limitations. For the dataset part, we introduced AMP, non-AMP, and real peptide databases. AMP databases are generally curated repositories (e.g., APD3, DRAMP) and provide experimentally validated AMP sequences for model training. Non-AMP generation method synthetic non-AMP datasets, created via rule-based sampling or adversarial training, enhance discrimination capability by contrasting functional/nonfunctional sequences. Real peptide databases can augment training with non-AMP natural peptides (e.g., UniProt), improving model generalisability and evaluation robustness. In the Feature Encoding Methods section, we introduce the following three categories: mapping-based methods, physicochemical-based methods, and secondary structure-based methods. Bidirectional sequence-feature translation (e.g., one-hot encoding, k-mer tokenisation) ensures interpretability and reversibility, linking abstract vectors to biophysically meaningful sequences. Physicochemical-based methods involve the quantitative profiling of hydrophobicity, charge, and amphipathicity via tools such as Peptides (R package) or PyBioMed to determine structure–activity relationships. The secondary structure-based methods involve residue-level conformation analysis (

α

-helix,

β

-sheet, and coil) via algorithms to capture structural determinants of antimicrobial function. The methodologies in the AMP generation part introduced AMP generative models, including architectures of RNNs, VAEs, GANs, and diffusions. RNNs can autoregress sequences with long short-term memory (LSTM) networks. VAEs explore the latent space with encoder–decoder frameworks. GANs train adversarial information for high-fidelity AMP synthesis. Diffusion model iterative denoising for diverse, property-constrained generation. In the challenges and limitations section, data scarcity, the lack of unified evaluation standards, translational gaps, and scalability are discussed.

The synergy of AI and experimental biology holds transformative potential for addressing antimicrobial resistance. Emerging paradigms, such as diffusion models with structural conditioning and hybrid neurosymbolic frameworks, offer avenues to overcome current limitations. Success will require interdisciplinary collaboration, merging computational innovation with rigorous biochemical validation to advance next-generation peptide therapeutics.

Furthermore, we discuss the strengths, weaknesses, and future directions of the video-driven AI methods that were reviewed. Video-driven AI methods, such as GANs, VAEs, and diffusion models, have demonstrated remarkable performance in capturing complex patterns and generating high-quality outputs. These methods can leverage large-scale video data to learn robust feature representations, which can be beneficial for AMP predictive modelling. Despite their strengths, these methods face several challenges when applied to AMP predictive modelling. The limitations identified in Section 5, such as data scarcity, model interpretability, and computational efficiency, highlight the need for tailored solutions. To address the identified gaps and challenges, we suggest several future research directions. These include developing data augmentation techniques to expand AMP datasets, exploring interpretable AI methods to enhance model transparency, and optimising model architectures for computational efficiency.

Author Contributions

J.Y. conceived the research framework, designed the content structure, and wrote the original draft of the review. Z.C. conducted the literature collection and analysis. Y.L. contributed to the material collection and grammar checking. J.C. focused on comparative studies and case analysis during the writing and revision processes. W.X. analysed the theoretical framework development, conducted comprehensive technical validation, and revised the paper. Y.Q. systematically revised the manuscript for logical coherence and academic rigour, whereas X.W. supervised the project design, manuscript structure, and critical revisions throughout the study. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Postdoctoral Fellowship Program of CPSF under Grant Number GZC20233322, the General Program of the Natural Science Foundation of Chongqing under Grant CSTB2024NSCQ-MSX0479, the Chongqing Postdoctoral Foundation Special Support Program under Grant 2023CQBSHTB3119, and the China Postdoctoral Science Foundation under Grant 2024MD754244. J.Y. was supported by Grant GZC20233322; J.C. was the recipient of the Macau Polytechnic University graduate scholarship; and W.X. was supported by Grants CSTB2024NSCQ-MSX0479, 2023CQBSHTB3119, and 2024MD754244. The funders had no role in the study design, data collection, interpretation, or decision to submit the work for publication.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Klemens, J.; Anna, M.; Joachim, J. Antibiotic resistance in the environment. Nat. Rev. Microbiol. 2021, 19, 254–266. [Google Scholar]
World Bank. Drug-Resistant Infections: A Threat to Our Economic Future; Technical Report; World Bank Group: Washington, DC, USA, 2017. [Google Scholar]
Laxminarayan, R.; Duse, A.; Wattal, C.; Zaidi, A.K.; Wertheim, H.F.; Sumpradit, N.; Vlieghe, E.; Hara, G.L.; Gould, I.M.; Goossens, H.; et al. Antibiotic resistance—the need for global solutions. Lancet Infect. Dis. 2013, 13, 1057–1098. [Google Scholar] [CrossRef] [PubMed]
Hancock, R.E.W.; Sahl, H.G. Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nat. Biotechnol. 2006, 24, 1551–1557. [Google Scholar] [CrossRef]
Zasloff, M. Antimicrobial peptides of multicellular organisms. Nature 2002, 415, 389–395. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Hao, H.; Xie, S.; Wang, X.; Dai, M.; Huang, L.; Yuan, Z. Antibiotic alternatives: The substitution of antibiotics in animal husbandry? Front. Microbiol. 2014, 5, 217. [Google Scholar] [CrossRef]
Gálvez, A.; Abriouel, H.; López, R.L.; Omar, N.B. Bacteriocin-based strategies for food biopreservation. Int. J. Food Microbiol. 2007, 120, 51–70. [Google Scholar] [CrossRef] [PubMed]
Dodds, D.R. Antibiotic resistance: A current epilogue. Biochem. Pharmacol. 2017, 134, 139–146. [Google Scholar] [CrossRef]
Marques, C.R.; Sousa, C.; Moutinho, C.; Matos, C.; Vinha, A.F. Characterization of Dietary Constituents, Phytochemicals, and Antioxidant Capacity of Carpobrotus edulis Fruit: Potential Application in Nutrition. Appl. Sci. 2025, 15, 5599. [Google Scholar] [CrossRef]
World Health Organization. Antimicrobial Resistance: Global Report on Surveillance; Technical Report; World Health Organization: Geneva, Switzerland, 2014. [Google Scholar]
Ferreira, H.S.; Mouga, T.; Lourenço, S.; Matias, M.H.; Freitas, M.V.; Afonso, C.N. Assessing High-Value Bioproducts from Seaweed Biomass: A Comparative Study of Wild, Cultivated and Residual Pulp Sources. Appl. Sci. 2025, 15, 5745. [Google Scholar] [CrossRef]
World Health Organization. Antimicrobial Resistance; Technical Report; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
Shang, Q.; Li, Z.; Wang, J.; Zou, L.; Xing, Z.; Ni, J.; Liu, X.; Chen, G.; Chen, Z.; Jiang, Z. Unraveling the Immobilization Mechanisms of Biochar and Humic Acid on Heavy Metals: DOM Insights from EEMs-PARAFAC and 2D-COS Analysis. Appl. Sci. 2025, 15, 5803. [Google Scholar] [CrossRef]
Centers for Disease Control and Prevention. Antibiotic Use in the United States, 2017: Progress and Opportunities; Technical Report; CDC: Atlanta, GA, USA, 2017.
Bhati, D.; Hayes, M. From Ocean to Market: Technical Applications of Fish Protein Hydrolysates in Human Functional Food, Pet Wellness, Aquaculture and Agricultural Bio-Stimulant Product Sectors. Appl. Sci. 2025, 15, 5769. [Google Scholar] [CrossRef]
Laxminarayan, R.; Chaudhury, R.R. Antibiotic resistance in India: Drivers and opportunities for action. PLoS Med. 2016, 13, e1001974. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Song, D.; Zhang, C.; Su, A. ProLinker–Generator: Design of a PROTAC Linker Base on a Generation Model Using Transfer and Reinforcement Learning. Appl. Sci. 2025, 15, 5616. [Google Scholar] [CrossRef]
Ventola, C.L. The antibiotic resistance crisis: Part 1: Causes and threats. Pharm. Ther. 2015, 40, 277–283. [Google Scholar]
Kang, H.K.; Kim, C.; Seo, C.H.; Park, Y. The therapeutic applications of antimicrobial peptides (AMPs): A patent review. J. Microbiol. 2017, 55, 1–12. [Google Scholar] [CrossRef]
Frieden, T.R. Antibiotic Resistance Threats in the United States; Technical Report 4; Centers for Disease Control and Prevention: Atlanta, GA, USA, 2019.
Cheng, S.; Song, J.; Zhou, M.; Wei, X.; Pu, H.; Luo, J.; Jia, W. EF-DETR: A lightweight transformer-based object detector with an encoder-free neck. IEEE Trans. Ind. Inform. 2024, 20, 12994–13002. [Google Scholar] [CrossRef]
Zhang, W.; Zhou, M.; Ji, C.; Sui, X.; Bai, J. Cross-frame transformer-based spatio-temporal video super-resolution. IEEE Trans. Broadcast. 2022, 68, 359–369. [Google Scholar] [CrossRef]
Song, J.; Zhou, M.; Luo, J.; Pu, H.; Feng, Y.; Wei, X.; Jia, W. Boundary-Aware Feature Fusion with Dual-Stream Attention for Remote Sensing Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5600213. [Google Scholar] [CrossRef]
Shen, Y.; Feng, Y.; Fang, B.; Zhou, M.; Kwong, S.; Qiang, B.H. DSRPH: Deep semantic-aware ranking preserving hashing for efficient multi-label image retrieval. Inf. Sci. 2020, 539, 145–156. [Google Scholar] [CrossRef]
Zhou, M.; Zhang, Y.; Li, B.; Lin, X. Complexity correlation-based CTU-level rate control with direction selection for HEVC. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2017, 13, 1–23. [Google Scholar] [CrossRef]
Xue, M.; He, J.; Wang, W.; Zhou, M. Low-light image enhancement via clip-fourier guided wavelet diffusion. arXiv 2024, arXiv:2401.03788. [Google Scholar]
Gao, T.; Sheng, W.; Zhou, M.; Fang, B.; Luo, F.; Li, J. Method for fault diagnosis of temperature-related mems inertial sensors by combining Hilbert–Huang transform and deep learning. Sensors 2020, 20, 5633. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Wang, H.; Yang, H.; Chen, L.; Kwong, S. Recent advances in rate control: From optimization to implementation and beyond. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 17–33. [Google Scholar] [CrossRef]
Zhou, M.; Wu, X.; Wei, X.; Xiang, T.; Fang, B.; Kwong, S. Low-light enhancement method based on a Retinex model for structure preservation. IEEE Trans. Multimed. 2023, 26, 650–662. [Google Scholar] [CrossRef]
Zhou, M.; Shen, W.; Wei, X.; Luo, J.; Jia, F.; Zhuang, X.; Jia, W. Blind Image Quality Assessment: Exploring Content Fidelity Perceptibility via Quality Adversarial Learning. Int. J. Comput. Vis. 2025, 133, 3242–3258. [Google Scholar] [CrossRef]
Zhou, M.; Lan, X.; Wei, X.; Liao, X.; Mao, Q.; Li, Y.; Wu, C.; Xiang, T.; Fang, B. An end-to-end blind image quality assessment method using a recurrent network and self-attention. IEEE Trans. Broadcast. 2022, 69, 369–377. [Google Scholar] [CrossRef]
Shen, W.; Zhou, M.; Luo, J.; Li, Z.; Kwong, S. Graph-represented distribution similarity index for full-reference image quality assessment. IEEE Trans. Image Process. 2024, 33, 3075–3089. [Google Scholar] [CrossRef]
Zhou, M.; Leng, H.; Fang, B.; Xiang, T.; Wei, X.; Jia, W. Low-light image enhancement via a frequency-based model with structure and texture decomposition. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
Catalano, A.; Iacopetta, D.; Ceramella, J.; Scumaci, D.; Giuzio, F.; Saturnino, C.; Aquaro, S.; Rosano, C.; Sinicropi, M.S. Multidrug resistance (MDR): A widespread phenomenon in pharmacological therapies. Molecules 2022, 27, 616. [Google Scholar] [CrossRef]
Msemburi, W.; Karlinsky, A.; Knutson, V.; Aleshin-Guendel, S.; Chatterji, S.; Wakefield, J. The WHO estimates of excess mortality associated with the COVID-19 pandemic. Nature 2023, 613, 130–137. [Google Scholar] [CrossRef]
Sulayyim, H.J.A.; Ismail, R.; Hamid, A.A.; Ghafar, N.A. Antibiotic resistance during COVID-19: A systematic review. Int. J. Environ. Res. Public Health 2022, 19, 11931. [Google Scholar] [CrossRef] [PubMed]
Langford, B.J.; Soucy, J.P.R.; Leung, V.; So, M.; Kwan, A.T.; Portnoff, J.S.; Bertagnolio, S.; Raybardhan, S.; MacFadden, D.R.; Daneman, N. Antibiotic resistance associated with the COVID-19 pandemic: A systematic review and meta-analysis. Clin. Microbiol. Infect. 2023, 29, 302–309. [Google Scholar] [CrossRef]
Jindal, A.; Pandya, K.; Khan, I. Antimicrobial resistance: A public health challenge. Med. J. Armed Forces India 2015, 71, 178–181. [Google Scholar] [CrossRef] [PubMed]
Rathinakumar, R.; Walkenhorst, W.F.; Wimley, W.C. Broad-spectrum antimicrobial peptides by rational combinatorial design and high-throughput screening: The importance of interfacial activity. J. Am. Chem. Soc. 2009, 131, 7609–7617. [Google Scholar] [CrossRef]
Büyükkiraz, E.; Mine, Z. Antimicrobial peptides (AMPs): A promising class of antimicrobial compounds. J. Appl. Microbiol. 2022, 132, 1573–1596. [Google Scholar] [CrossRef] [PubMed]
Rathinakumar, R.; Wimley, W.C. High-throughput discovery of broad-spectrum peptide antibiotics. FASEB J. 2010, 24, 3232. [Google Scholar] [CrossRef]
Musin, K.; Asyanova, E. How Machine Learning Helps in Combating Antimicrobial Resistance: A Review of AMP Analysis and Generation Methods. Int. J. Pept. Res. Ther. 2025, 31, 59. [Google Scholar] [CrossRef]
da Cunha, N.B.; Cobacho, N.B.; Viana, J.F.; Lima, L.A.; Sampaio, K.B.; Dohms, S.S.; Ferreira, A.C.; de la Fuente-Núñez, C.; Costa, F.F.; Franco, O.L.; et al. The next generation of antimicrobial peptides (AMPs) as molecular therapeutic tools for the treatment of diseases with social and economic impacts. Drug Discov. Today 2017, 22, 234–248. [Google Scholar] [CrossRef]
Muller, A.T.; Hiss, J.A.; Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model. 2018, 58, 472–479. [Google Scholar] [CrossRef]
Li, C.; Sutherland, D.; Richter, A.; Coombe, L.; Yanai, A.; Warren, R.L.; Kotkoff, M.; Hof, F.; Hoang, L.M.; Helbing, C.C.; et al. De novo synthetic antimicrobial peptide design with a recurrent neural network. Protein Sci. 2024, 33, e5088. [Google Scholar] [CrossRef]
Shaon, M.S.H.; Karim, T.; Sultan, M.F.; Ali, M.M.; Ahmed, K.; Hasan, M.Z.; Moustafa, A.; Bui, F.M.; Al-Zahrani, F.A. AMP-RNNpro: A two-stage approach for identification of antimicrobials using probabilistic features. Sci. Rep. 2024, 14, 12892. [Google Scholar] [CrossRef]
Tucs, A.; Tran, D.P.; Yumoto, A.; Ito, Y.; Uzawa, T.; Tsuda, K. Generating ampicillin-level antimicrobial peptides with activity-aware generative adversarial networks. ACS Omega 2020, 5, 22847–22851. [Google Scholar] [CrossRef]
Yu, H.; Wang, R.; Qiao, J.; Wei, L. Multi-CGAN: Deep generative model-based multiproperty antimicrobial peptide design. J. Chem. Inf. Model. 2023, 64, 316–326. [Google Scholar] [CrossRef] [PubMed]
Van Oort, C.M.; Ferrell, J.B.; Remington, J.M.; Wshah, S.; Li, J. AMPGAN v2: Machine learning-guided design of antimicrobial peptides. J. Chem. Inf. Model. 2021, 61, 2198–2207. [Google Scholar] [CrossRef] [PubMed]
Hou, K.; Zhao, W.; He, T. Physicochemical Property-guided Conditional VAE for Antimicrobial Peptides Generation. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; pp. 3279–3282. [Google Scholar]
Zhao, W.; Hou, K.; Shen, Y.; Hu, X. A Conditional Denoising VAE-based Framework for Antimicrobial Peptides Generation with Preserving Desirable Properties. Bioinformatics 2025, 41, btaf069. [Google Scholar] [CrossRef]
Dean, S.N.; Alvarez, J.A.E.; Zabetakis, D.; Walper, S.A.; Malanoski, A.P. PepVAE: Variational autoencoder framework for antimicrobial peptide generation and activity prediction. Front. Microbiol. 2021, 12, 725727. [Google Scholar] [CrossRef] [PubMed]
Dean, S.N.; Walper, S.A. Variational autoencoder for generation of antimicrobial peptides. ACS Omega 2020, 5, 20746–20754. [Google Scholar] [CrossRef]
Wang, R.; Wang, T.; Zhuo, L.; Wei, J.; Fu, X.; Zou, Q.; Yao, X. Diff-AMP: Tailored designed antimicrobial peptide framework with all-in-one generation, identification, prediction and optimization. Brief. Bioinform. 2024, 25, bbae078. [Google Scholar] [CrossRef]
Chen, T.; Vure, P.; Pulugurta, R.; Chatterjee, P. AMP-diffusion: Integrating latent diffusion with protein language models for antimicrobial peptide generation. bioRxiv 2024, 2024-03. [Google Scholar] [CrossRef]
Yan, J.; Cai, J.; Zhang, B.; Wang, Y.; Wong, D.F.; Siu, S.W. Recent progress in the discovery and design of antimicrobial peptides using traditional machine learning and deep learning. Antibiotics 2022, 11, 1451. [Google Scholar] [CrossRef]
Xu, J.; Li, F.; Li, C.; Guo, X.; Landersdorfer, C.; Shen, H.H.; Peleg, A.Y.; Li, J.; Imoto, S.; Yao, J.; et al. iAMPCN: A deep-learning approach for identifying antimicrobial peptides and their functional activities. Brief. Bioinform. 2023, 24, bbad240. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Liu, H.; Kang, L.; Gao, W.; Lu, Q.; Rao, Y.; Yue, Z. deep-AMPpred: A Deep Learning Method for Identifying Antimicrobial Peptides and Their Functional Activities. J. Chem. Inf. Model. 2025, 65, 997–1008. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Sutherland, D.; Hammond, S.A.; Yang, C.; Taho, F.; Bergman, L.; Houston, S.; Warren, R.L.; Wong, T.; Hoang, L.M.; et al. AMPlify: Attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genom. 2022, 23, 77. [Google Scholar] [CrossRef] [PubMed]
Lin, T.T.; Yang, L.Y.; Lu, I.H.; Cheng, W.C.; Hsu, Z.R.; Chen, S.H.; Lin, C.Y. AI4AMP: An antimicrobial peptide predictor using physicochemical property-based encoding method and deep learning. Msystems 2021, 6, e00299-21. [Google Scholar] [CrossRef]
Yan, J.; Zhang, B.; Zhou, M.; Campbell-Valois, F.X.; Siu, S.W. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. Msystems 2023, 8, e00345-23. [Google Scholar] [CrossRef]
Chung, C.R.; Chien, C.Y.; Tang, Y.; Wu, L.C.; Hsu, J.B.K.; Lu, J.J.; Lee, T.Y.; Bai, C.; Horng, J.T. An ensemble deep learning model for predicting minimum inhibitory concentrations of antimicrobial peptides against pathogenic bacteria. Iscience 2024, 27, 110718. [Google Scholar] [CrossRef]
Bournez, C.; Riool, M.; de Boer, L.; Cordfunke, R.A.; de Best, L.; van Leeuwen, R.; Drijfhout, J.W.; Zaat, S.A.; van Westen, G.J. CalcAMP: A new machine learning model for the accurate prediction of antimicrobial activity of peptides. Antibiotics 2023, 12, 725. [Google Scholar] [CrossRef]
Ebrahimikondori, H.; Sutherland, D.; Yanai, A.; Richter, A.; Salehi, A.; Li, C.; Coombe, L.; Kotkoff, M.; Warren, R.L.; Birol, I. Structure-aware deep learning model for peptide toxicity prediction. Protein Sci. 2024, 33, e5076. [Google Scholar] [CrossRef]
Khabbaz, H.; Karimi-Jafari, M.H.; Saboury, A.A.; BabaAli, B. Prediction of antimicrobial peptides toxicity based on their physico-chemical properties using machine learning techniques. BMC Bioinform. 2021, 22, 549. [Google Scholar] [CrossRef]
Taho, F. Antimicrobial Peptide Host Toxicity Prediction with Transfer Learning for Proteins. Ph.D. Thesis, University of British Columbia, Vancouver, BC, Canada, 2020. [Google Scholar]
Wang, G.; Li, X.; Wang, Z. APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 2016, 44, D1087–D1093. [Google Scholar] [CrossRef]
Gawde, U.; Chakraborty, S.; Waghu, F.H.; Barai, R.S.; Khanderkar, A.; Indraguru, R.; Shirsat, T.; Idicula-Thomas, S. CAMPR4: A database of natural and synthetic antimicrobial peptides. Nucleic Acids Res. 2023, 51, D377–D383. [Google Scholar] [CrossRef]
Ye, G.; Wu, H.; Huang, J.; Wang, W.; Ge, K.; Li, G.; Zhong, J.; Huang, Q. LAMP2: A major update of the database linking antimicrobial peptides. Database 2020, 2020, baaa061. [Google Scholar] [CrossRef]
Pirtskhalava, M.; Amstrong, A.A.; Grigolava, M.; Chubinidze, M.; Alimbarashvili, E.; Vishnepolsky, B.; Gabrielian, A.; Rosenthal, A.; Hurt, D.E.; Tartakovsky, M. DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 2021, 49, D288–D297. [Google Scholar] [CrossRef] [PubMed]
Ma, T.; Liu, Y.; Yu, B.; Sun, X.; Yao, H.; Hao, C.; Li, J.; Nawaz, M.; Jiang, X.; Lao, X.; et al. DRAMP 4.0: An open-access data repository dedicated to the clinical translation of antimicrobial peptides. Nucleic Acids Res. 2025, 53, D403–D410. [Google Scholar] [CrossRef]
Yao, L.; Guan, J.; Xie, P.; Chung, C.R.; Zhao, Z.; Dong, D.; Guo, Y.; Zhang, W.; Deng, J.; Pang, Y.; et al. dbAMP 3.0: Updated resource of antimicrobial activity and structural annotation of peptides in the post-pandemic era. Nucleic Acids Res. 2025, 53, D364–D376. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Chen, R.; Li, X.; Zhang, Y.; Zhang, J.; Yang, Y.; Wan, J.; Zhou, Y.; Chen, H.; Li, J.; et al. ESKtides: A comprehensive database and mining method for ESKAPE phage-derived antimicrobial peptides. Database 2024, 2024, baae022. [Google Scholar] [CrossRef]
Yan, J.; Zhang, B.; Zhou, M.; Kwok, H.F.; Siu, S.W. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput. Biol. Med. 2022, 147, 105717. [Google Scholar] [CrossRef] [PubMed]
van Wijk, K.J.; Leppert, T.; Sun, Z.; Kearly, A.; Li, M.; Mendoza, L.; Guzchenko, I.; Debley, E.; Sauermann, G.; Routray, P.; et al. Detection of the Arabidopsis proteome and its post-translational modifications and the nature of the unobserved (dark) proteome in PeptideAtlas. J. Proteome Res. 2023, 23, 185–214. [Google Scholar] [CrossRef]
Consortium, U. UniProt: The Universal protein knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D617. [Google Scholar] [CrossRef]
Wang, Z.; Wang, G. APD: The antimicrobial peptide database. Nucleic Acids Res. 2004, 32, D590–D592. [Google Scholar] [CrossRef]
Wang, G.; Li, X.; Wang, Z. APD2: The updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res. 2009, 37, D933–D937. [Google Scholar] [CrossRef]
Thomas, S.; Karnik, S.; Barai, R.S.; Jayaraman, V.K.; Idicula-Thomas, S. CAMP: A useful resource for research on antimicrobial peptides. Nucleic Acids Res. 2010, 38, D774–D780. [Google Scholar] [CrossRef] [PubMed]
Waghu, F.H.; Gopi, L.; Barai, R.S.; Ramteke, P.; Nizami, B.; Idicula-Thomas, S. CAMP: Collection of sequences and structures of antimicrobial peptides. Nucleic Acids Res. 2014, 42, D1154–D1158. [Google Scholar] [CrossRef] [PubMed]
Waghu, F.H.; Barai, R.S.; Gurung, P.; Idicula-Thomas, S. CAMPR3: A database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res. 2016, 44, D1094–D1097. [Google Scholar] [CrossRef] [PubMed]
Madden, T. The BLAST sequence analysis tool. Ncbi Handb. 2013, 2, 425–436. [Google Scholar]
Sievers, F.; Higgins, D.G. Clustal omega. Curr. Protoc. Bioinform. 2014, 48, 3–13. [Google Scholar] [CrossRef]
Tsang, H.S. Vector Alignment Search Tool (VAST) Automated Protein Structure Comparison Using Special Structural Elements; The Johns Hopkins University: Baltimore, MD, USA, 2007. [Google Scholar]
Zhao, X.; Wu, H.; Lu, H.; Li, G.; Huang, Q. LAMP: A database linking antimicrobial peptides. PLoS ONE 2013, 8, e66557. [Google Scholar] [CrossRef]
Gogoladze, G.; Grigolava, M.; Vishnepolsky, B.; Chubinidze, M.; Duroux, P.; Lefranc, M.P.; Pirtskhalava, M. DBAASP: Database of antimicrobial activity and structure of peptides. FEMS Microbiol. Lett. 2014, 357, 63–68. [Google Scholar] [CrossRef]
Pirtskhalava, M.; Gabrielian, A.; Cruz, P.; Griggs, H.L.; Squires, R.B.; Hurt, D.E.; Grigolava, M.; Chubinidze, M.; Gogoladze, G.; Vishnepolsky, B.; et al. DBAASP v. 2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res. 2016, 44, D1104–D1112. [Google Scholar] [CrossRef]
Fan, L.; Sun, J.; Zhou, M.; Zhou, J.; Lao, X.; Zheng, H.; Xu, H. DRAMP: A comprehensive data repository of antimicrobial peptides. Sci. Rep. 2016, 6, 24482. [Google Scholar] [CrossRef]
Kang, X.; Dong, F.; Shi, C.; Liu, S.; Sun, J.; Chen, J.; Li, H.; Xu, H.; Lao, X.; Zheng, H. DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci. Data 2019, 6, 148. [Google Scholar] [CrossRef]
Shi, G.; Kang, X.; Dong, F.; Liu, Y.; Zhu, N.; Hu, Y.; Xu, H.; Lao, X.; Zheng, H. DRAMP 3.0: An enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Res. 2022, 50, D488–D496. [Google Scholar] [CrossRef] [PubMed]
Jhong, J.H.; Chi, Y.H.; Li, W.C.; Lin, T.H.; Huang, K.Y.; Lee, T.Y. dbAMP: An integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data. Nucleic Acids Res. 2019, 47, D285–D297. [Google Scholar] [CrossRef]
Jhong, J.H.; Yao, L.; Pang, Y.; Li, Z.; Chung, C.R.; Wang, R.; Li, S.; Li, W.; Luo, M.; Ma, R.; et al. dbAMP 2.0: Updated resource for antimicrobial peptides with an enhanced scanning method for genomic and proteomic data. Nucleic Acids Res. 2022, 50, D460–D470. [Google Scholar] [CrossRef] [PubMed]
Bhadra, P.; Yan, J.; Li, J.; Fong, S.; Siu, S.W. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 2018, 8, 1697. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Bhadra, P.; Li, A.; Sethiya, P.; Qin, L.; Tai, H.K.; Wong, K.H.; Siu, S.W. Deep-AmPEP30: Improve short antimicrobial peptides prediction with deep learning. Mol. Ther. Nucleic Acids 2020, 20, 882–894. [Google Scholar] [CrossRef]
Desiere, F.; Deutsch, E.W.; King, N.L.; Nesvizhskii, A.I.; Mallick, P.; Eng, J.; Chen, S.; Eddes, J.; Loevenich, S.N.; Aebersold, R. The peptideatlas project. Nucleic Acids Res. 2006, 34, D655–D658. [Google Scholar] [CrossRef]
Deutsch, E.W. The peptideatlas project. Proteome Bioinform. 2010, 604, 285–296. [Google Scholar]
Bairoch, A. The SWISS-PROT protein sequence database: Its relevance to human molecular medical research. J. Mol. Med. 1997, 75, 312–316. [Google Scholar]
Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.C.; Estreicher, A.; Gasteiger, E.; Martin, M.J.; Michoud, K.; O’Donovan, C.; Phan, I.; et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31, 365–370. [Google Scholar] [CrossRef]
Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bansal, P.; Bridge, A.J.; Poux, S.; Bougueleret, L.; Xenarios, I. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: How to use the entry view. In Plant Bioinformatics: Methods and Protocols; Springer: Berlin/Heidelberg, Germany, 2016; pp. 23–54. [Google Scholar]
Apweiler, R.; Bairoch, A.; Wu, C.H.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; et al. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2004, 32, D115–D119. [Google Scholar] [CrossRef] [PubMed]
Consortium, U. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019, 47, D506–D515. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Chen, Y.Z.; Wang, X.F.; Wang, C.; Yan, R.X.; Zhang, Z. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PLoS ONE 2011, 6, e22930. [Google Scholar] [CrossRef]
Chen, Z.; Zhou, Y.; Song, J.; Zhang, Z. hCKSAAP_UbSite: Improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Biochim. Et Biophys. Acta (BBA)-Proteins Proteom. 2013, 1834, 1461–1467. [Google Scholar] [CrossRef]
Linding, R.; Jensen, L.J.; Diella, F.; Bork, P.; Gibson, T.J.; Russell, R.B. Protein disorder prediction: Implications for structural proteomics. Structure 2003, 11, 1453–1459. [Google Scholar] [CrossRef]
Obradovic, Z.; Peng, K.; Vucetic, S.; Radivojac, P.; Dunker, A.K. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins Struct. Funct. Bioinform. 2005, 61, 176–182. [Google Scholar] [CrossRef]
Peng, K.; Radivojac, P.; Vucetic, S.; Dunker, A.K.; Obradovic, Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinform. 2006, 7, 208. [Google Scholar] [CrossRef]
Romero, P.; Obradovic, Z.; Li, X.; Garner, E.C.; Brown, C.J.; Dunker, A.K. Sequence complexity of disordered protein. Proteins Struct. Funct. Bioinform. 2001, 42, 38–48. [Google Scholar] [CrossRef]
Ward, J.J.; Sodhi, J.S.; McGuffin, L.J.; Buxton, B.F.; Jones, D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 2004, 337, 635–645. [Google Scholar] [CrossRef]
Wright, P.E.; Dyson, H.J. Intrinsically unstructured proteins: Re-assessing the protein structure-function paradigm. J. Mol. Biol. 1999, 293, 321–331. [Google Scholar] [CrossRef] [PubMed]
Fukuchi, S.; Hosoda, K.; Homma, K.; Gojobori, T.; Nishikawa, K. Binary classification of protein molecules into intrinsically disordered and ordered segments. BMC Struct. Biol. 2011, 11, 29. [Google Scholar] [CrossRef] [PubMed]
Liang, Y.; Ma, X. iACP-GE: Accurate identification of anticancer peptides by using gradient boosting decision tree and extra tree. SAR QSAR Environ. Res. 2023, 34, 1–19. [Google Scholar] [CrossRef]
Yu, H.; Luo, X. IPPF-FE: An integrated peptide and protein function prediction framework based on fused features and ensemble models. Brief. Bioinform. 2023, 24, bbac476. [Google Scholar] [CrossRef]
Basith, S.; Lee, G.; Manavalan, B. STALLION: A stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction. Brief. Bioinform. 2022, 23, bbab376. [Google Scholar] [CrossRef]
Chen, Z.; He, N.; Huang, Y.; Xu, C.; Liu, H.; Hu, J.; Xia, J.; Hu, H.; Li, D. Integration of a Deep Learning Classifier with a Random Forest Approach for Predicting Malonylation Sites. Genom. Proteom. Bioinform. 2018, 16, 451–459. [Google Scholar] [CrossRef] [PubMed]
Govindan, G.; Nair, A.S. Composition, Transition and Distribution (CTD)—a dynamic feature for predictions based on hierarchical structure of cellular sorting. In Proceedings of the 2011 Annual IEEE India Conference, Hyderabad, India, 16–18 December 2011; pp. 1–6. [Google Scholar]
Hou, R.; Wu, J.; Xu, L.; Zou, Q.; Wu, Y.J. Computational prediction of protein arginine methylation based on composition–transition–distribution features. ACS Omega 2020, 5, 27470–27479. [Google Scholar] [CrossRef] [PubMed]
Shen, H.B.; Chou, K.C. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008, 373, 386–388. [Google Scholar] [CrossRef]
Chou, K.C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteom. 2009, 6, 262–274. [Google Scholar] [CrossRef]
Chou, K.C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef]
Wang, C.; Wang, W.; Lu, K.; Zhang, J.; Chen, P.; Wang, B. Predicting drug-target interactions with electrotopological state fingerprints and amphiphilic pseudo amino acid composition. Int. J. Mol. Sci. 2020, 21, 5694. [Google Scholar] [CrossRef]
Zuo, Y.; Li, Y.; Chen, Y.; Li, G.; Yan, Z.; Yang, L. PseKRAAC: A flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics 2017, 33, 122–124. [Google Scholar] [CrossRef]
Chen, K.; Kurgan, L.A.; Ruan, J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct. Biol. 2007, 7, 25. [Google Scholar] [CrossRef]
Tung, C.W. Prediction of pupylation sites using the composition of k-spaced amino acid pairs. J. Theor. Biol. 2013, 336, 11–17. [Google Scholar] [CrossRef] [PubMed]
Zulfiqar, H.; Ahmed, Z.; Ma, C.Y.; Khan, R.S.; Grace-Mercure, B.K.; Yu, X.L.; Zhang, Z.Y. Comprehensive prediction of lipocalin proteins using artificial intelligence strategy. Front. Biosci.-Landmark 2022, 27, 84. [Google Scholar] [CrossRef] [PubMed]
Rao, B.; Zhou, C.; Zhang, G.; Su, R.; Wei, L. ACPred-Fuse: Fusing multi-view information improves the prediction of anticancer peptides. Brief. Bioinform. 2020, 21, 1846–1855. [Google Scholar] [CrossRef]
Chen, D.; Li, Y. PredMHC: An effective predictor of major histocompatibility complex using mixed features. Front. Genet. 2022, 13, 875112. [Google Scholar] [CrossRef] [PubMed]
Madugula, S.S.; Pujar, P.; Nammi, B.; Wang, S.; Jayasinghe-Arachchige, V.M.; Pham, T.; Mashburn, D.; Artiles, M.; Liu, J. Identification of family-specific features in Cas9 and Cas12 proteins: A machine learning approach using complete protein feature spectrum. J. Chem. Inf. Model. 2024, 64, 4897–4911. [Google Scholar] [CrossRef]
Gaffar, S.; Tayara, H.; Chong, K.T. Stack-aagp: Computational prediction and interpretation of anti-angiogenic peptides using a meta-learning framework. Comput. Biol. Med. 2024, 174, 108438. [Google Scholar] [CrossRef]
Beckstette, M.; Homann, R.; Giegerich, R.; Kurtz, S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinform. 2006, 7, 389. [Google Scholar] [CrossRef]
Kawashima, S.; Ogata, H.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 1999, 27, 368–369. [Google Scholar] [CrossRef]
Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28, 374. [Google Scholar] [CrossRef]
Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2007, 36, D202–D205. [Google Scholar] [CrossRef]
Kanehisa, M.; Goto, S.; Kawashima, S.; Nakaya, A. The KEGG databases at GenomeNet. Nucleic Acids Res. 2002, 30, 42–46. [Google Scholar] [CrossRef]
Eddy, S.R. Where did the BLOSUM62 alignment score matrix come from? Nat. Biotechnol. 2004, 22, 1035–1036. [Google Scholar] [CrossRef]
Ahmad, S.; Gromiha, M.M.; Sarai, A. Real value prediction of solvent accessibility from amino acid sequence. Proteins Struct. Funct. Bioinform. 2003, 50, 629–635. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Hou, J.; Adhikari, B.; Lyu, Q.; Cheng, J. Deep learning methods for protein torsion angle prediction. BMC Bioinform. 2017, 18, 417. [Google Scholar] [CrossRef] [PubMed]
Tian, F.; Zhou, P.; Li, Z. T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides. J. Mol. Struct. 2007, 830, 106–115. [Google Scholar] [CrossRef]
Rost, B.; Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins Struct. Funct. Bioinform. 1994, 19, 55–72. [Google Scholar] [CrossRef]
Sun, S.; Thomas, P.D.; Dill, K.A. A simple protein folding algorithm using a binary code and secondary structure constraints. Protein Eng. Des. Sel. 1995, 8, 769–778. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Al-Selwi, S.M.; Hassan, M.F.; Abdulkadir, S.J.; Muneer, A.; Sumiea, E.H.; Alqushaibi, A.; Ragab, M.G. RNN-LSTM: From applications to modeling techniques and beyond—Systematic review. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102068. [Google Scholar] [CrossRef]
Fang, W.; Chen, Y.; Xue, Q. Survey on research of RNN-based spatio-temporal sequence prediction algorithms. J. Big Data 2021, 3, 97. [Google Scholar] [CrossRef]
Zhao, J.; Huang, F.; Lv, J.; Duan, Y.; Qin, Z.; Li, G.; Tian, G. Do RNN and LSTM have long memory? In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 11365–11375. [Google Scholar]
Bolatchiev, A.; Baturin, V.; Shchetinin, E.; Bolatchieva, E. Novel antimicrobial peptides designed using a recurrent neural network reduce mortality in experimental sepsis. Antibiotics 2022, 11, 411. [Google Scholar] [CrossRef] [PubMed]
Vishnepolsky, B.; Pirtskhalava, M. Prediction of linear cationic antimicrobial peptides based on characteristics responsible for their interaction with the membranes. J. Chem. Inf. Model. 2014, 54, 1512–1523. [Google Scholar] [CrossRef]
Nussinov, R.; Zhang, M.; Liu, Y.; Jang, H. AlphaFold, artificial intelligence (AI), and allostery. J. Phys. Chem. B 2022, 126, 6372–6383. [Google Scholar] [CrossRef]
Timmons, P.B.; Hewage, C.M. HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks. Sci. Rep. 2020, 10, 10869. [Google Scholar] [CrossRef] [PubMed]
Mak, H.W.L.; Han, R.; Yin, H.H. Application of variational autoEncoder (VAE) model and image processing approaches in game design. Sensors 2023, 23, 3457. [Google Scholar] [CrossRef]
Das, P.; Sercu, T.; Wadhawan, K.; Padhi, I.; Gehrmann, S.; Cipcigan, F.; Chenthamarakshan, V.; Strobelt, H.; Dos Santos, C.; Chen, P.Y.; et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 2021, 5, 613–623. [Google Scholar] [CrossRef]
Wang, D.; Wen, Z.; Li, L.; Zhou, H.; YE, F. Accelerating antimicrobial peptide discovery with latent sequence-structure model. In Proceedings of the ICLR 2023-Machine Learning for Drug Discovery Workshop, Kigali, Rwanda, 5 May 2022. [Google Scholar]
Wang, D.; Wen, Z.; Ye, F.; Li, L.; Zhou, H. Accelerating Antimicrobial Peptide Discovery with Latent Structure. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 2243–2255. [Google Scholar]
Ghorbani, M.; Prasad, S.; Brooks, B.R.; Klauda, J.B. Deep attention based variational autoencoder for antimicrobial peptide discovery. bioRxiv 2022, 2022-07. [Google Scholar] [CrossRef]
Hasegawa, K.; Moriwaki, Y.; Terada, T.; Wei, C.; Shimizu, K. Feedback-AVPGAN: Feedback-guided generative adversarial network for generating antiviral peptides. J. Bioinform. Comput. Biol. 2022, 20, 2250026. [Google Scholar] [CrossRef]
Borji, A. Pros and cons of GAN evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef]
Durgadevi, M. Generative Adversarial Network (GAN): A general review on different variants of GAN and applications. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 8–10 July 2021; pp. 1–8. [Google Scholar]
Ferrell, J.B.; Remington, J.M.; Van Oort, C.M.; Sharafi, M.; Aboushousha, R.; Janssen-Heininger, Y.; Schneebeli, S.T.; Wargo, M.J.; Wshah, S.; Li, J. A generative approach toward precision antimicrobial peptide design. BioRxiv 2020, 2020-10. [Google Scholar] [CrossRef]
Gupta, A.; Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 2019, 1, 105–111. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. [Google Scholar]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Cao, J.; Zhang, J.; Yu, Q.; Ji, J.; Li, J.; He, S.; Zhu, Z. TG-CDDPM: Text-guided antimicrobial peptides generation based on conditional denoising diffusion probabilistic model. Brief. Bioinform. 2025, 26, bbae644. [Google Scholar] [CrossRef]

Table 1. Online database information summary.

Category	Name	Number
AMP Datasets	APD [67]	5099
	CAMP [68]	24,243
	LAMP [69]	23,253
	DBAASP [70]	23,600
	DRAMP [71]	30,260
	dbAMP [72]	33,065
	ESKtides [73]	12,067,248
Non-AMP Generation Methods	Normal [74]	–
	Random [74]	–
	Shuffle [74]	–
Petide Datasets	PeptideAltas [75]	3,979,590
	UniProt [76]	252,761,752

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, J.; Chen, Z.; Cai, J.; Xian, W.; Wei, X.; Qin, Y.; Li, Y. Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges. Appl. Sci. 2025, 15, 7363. https://doi.org/10.3390/app15137363

AMA Style

Yan J, Chen Z, Cai J, Xian W, Wei X, Qin Y, Li Y. Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges. Applied Sciences. 2025; 15(13):7363. https://doi.org/10.3390/app15137363

Chicago/Turabian Style

Yan, Jielu, Zhengli Chen, Jianxiu Cai, Weizhi Xian, Xuekai Wei, Yi Qin, and Yifan Li. 2025. "Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges" Applied Sciences 15, no. 13: 7363. https://doi.org/10.3390/app15137363

APA Style

Yan, J., Chen, Z., Cai, J., Xian, W., Wei, X., Qin, Y., & Li, Y. (2025). Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges. Applied Sciences, 15(13), 7363. https://doi.org/10.3390/app15137363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges

Abstract

1. Introduction

2. Datasets

2.1. AMP Databases

2.2. Non-AMP Generation Methods

2.3. Peptide Datasets

3. Feature Encoding Methods

3.1. Mapping-Based Methods

3.2. Disorder-Based Methods

3.3. Physicochemical-Based Methods

3.4. Secondary Structure-Based Methods

4. Methodologies in AMP Generation

5. Challenges and Limitations

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI