Thinking on the Construction of Antimicrobial Peptide Databases: Powerful Tools for the Molecular Design and Screening

With the accelerating growth of antimicrobial resistance (AMR), there is an urgent need for new antimicrobial agents with low or no AMR. Antimicrobial peptides (AMPs) have been extensively studied as alternatives to antibiotics (ATAs). Coupled with the new generation of high-throughput technology for AMP mining, the number of derivatives has increased dramatically, but manual running is time-consuming and laborious. Therefore, it is necessary to establish databases that combine computer algorithms to summarize, analyze, and design new AMPs. A number of AMP databases have already been established, such as the Antimicrobial Peptides Database (APD), the Collection of Antimicrobial Peptides (CAMP), the Database of Antimicrobial Activity and Structure of Peptides (DBAASP), and the Database of Antimicrobial Peptides (dbAMPs). These four AMP databases are comprehensive and are widely used. This review aims to cover the construction, evolution, characteristic function, prediction, and design of these four AMP databases. It also offers ideas for the improvement and application of these databases based on merging the various advantages of these four peptide libraries. This review promotes research and development into new AMPs and lays their foundation in the fields of druggability and clinical precision treatment.


Introduction
Antibiotics represent one of the major discoveries made in the field of health during the 20th century. Starting with the discovery of penicillin in 1942 as the first key milestone, antibiotics have greatly benefited humanity, playing a key role in the treatment of human and animal diseases. However, due to the long-term abuse of antibiotics, especially in husbandry production, many bacteria have formed and have developed AMR over time. These bacteria include Staphylococcus aureus, Streptococcus, Escherichia coli, and other species. Some of them have developed multi-drug resistance quickly, which significantly reduces the efficacy of antibiotic treatment [1][2][3][4]. The first sulfonamide drug with a special resistance mechanism was reported in 1937, but the threat of AMR received little attention at that time. After drug-resistant plasmids were first reported in 1960, the number of antimicrobialresistant bacteria steadily increased year by year in the nearly 30 years that followed [1,5]. There is now an urgent need for a series of new ATAs to address this issue. Figure 1 shows the timeline of resistance development for the major classes of antibiotics.
AMPs are produced naturally in organisms and act as an innate defense system against invading pathogens via diverse mechanisms of action [6]. Melittin and maganin were first discovered by Fennell andZasloff in 1967 and1987, respectively [7,8]. In Sweden, Boman's team discovered and reported typical antibacterial peptides known as cecropins from the insect Hyatophoraceropia during the 1970s and 1980s [9][10][11][12][13], marking a key moment in the development of AMP science. From 1980 to 2000, AMPs, including defensins, AMPs are produced naturally in organisms and act as an innate defense system against invading pathogens via diverse mechanisms of action [6]. Melittin and maganin were first discovered by Fennell and Zasloff in 1967 and 1987, respectively [7,8]. In Sweden, Boman's team discovered and reported typical antibacterial peptides known as cecropins from the insect Hyatophoraceropia during the 1970s and 1980s [9][10][11][12][13], marking number of AMPs, the processes of in vivo/vitro, one-by-one, and step-by-step verification use a number of resources in terms of design, screening, and confirmation. Most AMPs suffer serious limitations with regard to low yield, instability, and toxicity. Therefore, it is necessary to establish an AMP database and to combine it with computer algorithms to efficiently and accurately predict and design new AMPs [52,53] and to further validate the iron triangle theory [26][27][28][29][30][31] and its application in health maintenance. More than ten AMP databases have been established to collect and classify AMPs so far, including APD3, DBAASPv3, CAMP3, dbAMP2, ANTI-MIC, YADAMP, LAMP2, DRAMP3.0, CyBase, and PenBese [54][55][56][57][58][59]. Among them, the first four are the most popular because of their superior tool buffering, large data resources, and powerful function, thus attracting more users [60]. These four databases were first built in 2005, 2008, 2014, and 2018 [56,[60][61][62][63][64] and updated in 2016, 2016, 2021, and 2022, respectively [55,[65][66][67]. The data resources and analytical functions of AMP databases are their essential features. Now, more and more AMP databases are being recognized as bioinformatics resources to identify, predict, and design new AMP derivatives with better or improved properties. For example, non-hemolytic anti-MRSA AMPs from plant sources have been obtained using the above tools to design them [42,56,68]. Although a variety of AMP databases have been established, they have not been applied fully or extensively, due to their weak reliability for prediction ability in design processes [60]. Only data acquisition and prediction are used in practice. Further resources are urgently needed to support additional requirements such as AMP mining [68], DNA editing, AMP AI editing [69], complex BI analysis [70], computer-aided design [71], and chemical and synthetic biology [72][73][74][75]. When considering how to supplement these disadvantages in AMPs and achieving the above goals in AMP science in the future, there is room for improvement. There are large challenges facing meeting the above new requirements for AMPs in health practices in humans and animals. The evolution of antibiotics, AMPs, and AMP databases is shown in the timeline in Figure 1.
In this paper, the advantages, disadvantages, applications, and challenges associated with four AMP databases are reviewed, and some suggestions for the construction of databases carrying out quick screening and exact predictions are provided. Focusing on More than ten AMP databases have been established to collect and classify AMPs so far, including APD3, DBAASPv3, CAMP3, dbAMP2, ANTI-MIC, YADAMP, LAMP2, DRAMP3.0, CyBase, and PenBese [54][55][56][57][58][59]. Among them, the first four are the most popular because of their superior tool buffering, large data resources, and powerful function, thus attracting more users [60]. These four databases were first built in 2005, 2008, 2014, and 2018 [56,[60][61][62][63][64] and updated in 2016, 2016, 2021, and 2022, respectively [55,[65][66][67]. The data resources and analytical functions of AMP databases are their essential features. Now, more and more AMP databases are being recognized as bioinformatics resources to identify, predict, and design new AMP derivatives with better or improved properties. For example, non-hemolytic anti-MRSA AMPs from plant sources have been obtained using the above tools to design them [42,56,68]. Although a variety of AMP databases have been established, they have not been applied fully or extensively, due to their weak reliability for prediction ability in design processes [60]. Only data acquisition and prediction are used in practice. Further resources are urgently needed to support additional requirements such as AMP mining [68], DNA editing, AMP AI editing [69], complex BI analysis [70], computer-aided design [71], and chemical and synthetic biology [72][73][74][75]. When considering how to supplement these disadvantages in AMPs and achieving the above goals in AMP science in the future, there is room for improvement. There are large challenges facing meeting the above new requirements for AMPs in health practices in humans and animals. The evolution of antibiotics, AMPs, and AMP databases is shown in the timeline in Figure 1.
In this paper, the advantages, disadvantages, applications, and challenges associated with four AMP databases are reviewed, and some suggestions for the construction of databases carrying out quick screening and exact predictions are provided. Focusing on design running and on the basis of four typical AMP databases, key principles resulting in increased and better advantages and stronger tool power are put forward to create a new scheme.

Four Typical AMP Databases
AMP databases usually feature a number of functions, such as large datasets with logistical classification, accurate prediction abilities, fast searching, and unique computer algorithms. Their most important features include prediction tools and abundant data from different pathways. Those prediction tools were developed by analyzing the physicochemical properties, toxicity, and specificity of AMPs. Four databases (DBAASP, CAMP, APD, and dbAMP) are the most popular so far (see Tables 1 and 2); their advantageous modules are shown in Figure 3. They are introduced one by one in the following sections [61].

DBAASP
DBAASP is a database that is curated manually that collects experimentally validated AMPs through experiments in which the physicochemical properties can be predicted or analyzed [56]. Recently, the 3D structures of the AMPs in this database were updated [63]. Presently, a total of 18,719 entries have been collected and classified in DBAASP (Tables 1  and 2) [66]. It is the most comprehensive database for evaluating the antimicrobial activity, cytotoxicity, and hemolysis of target peptides obtained through the collection of validated AMPs from laboratory studies. Users can search by peptide ID, name, synthesis type, sequence, length, C-terminal N-terminal modification, family source, intracellular target, UniProt ID, BD structure, hemolysis, and others fields to obtain the target sequence. Another advantage of DBAASP compared to other databases is its capacity to learn the

DBAASP
DBAASP is a database that is curated manually that collects experimentally validated AMPs through experiments in which the physicochemical properties can be predicted or analyzed [56]. Recently, the 3D structures of the AMPs in this database were updated [63]. Presently, a total of 18,719 entries have been collected and classified in DBAASP (Tables 1 and 2) [66]. It is the most comprehensive database for evaluating the antimicrobial activity, cytotoxicity, and hemolysis of target peptides obtained through the collection of validated AMPs from laboratory studies. Users can search by peptide ID, name, synthesis type, sequence, length, C-terminal N-terminal modification, family source, intracellular target, UniProt ID, BD structure, hemolysis, and others fields to obtain the target sequence. Another advantage of DBAASP compared to other databases is its capacity to learn the structural and functional relationships of AMPs ( Figure 3). Of course, instability, molecular weight, secondary structure, and half-life parameters should be added or supplemented if possible, and more machine learning (ML) algorithms should be adopted to increase and ensure the accuracy of prediction results.

APD
The APD database was established in 2003 by Wang Guangshun team and has been updated in recent years. It contains 1228 peptides (including 65 anticancer peptides, 76 antiviral peptides, 327 antifungal peptides, and 994 antibacterial peptides) and offers search capability [76], statistical analysis, structure-function relationships, and other AMP indexes [66]. Currently, there are 3425 AMPs in the APD3 database, which are mainly derived from natural species and are very close to the actual number of reported active AMPs with high reliability (Tables 1 and 2). Through this database, the physical and chemical properties of AMPs, including their molecular size, isoelectric point, hydrophilicity, structure, hydrophobic residues, protein-binding capacity, and net charge, can be predicted and calculated. Another feature is the AMP timeline module, allowing a better understanding of AMPs in relation to time (Figure 3). This database is considered to be the best tool for learning about the development and predicting the physical and chemical properties in AMPs. The APD3 database is in need of improvement. Its capacity for buffering candidates and physicochemical properties is relatively limited: for example, some derived peptides with better antibacterial activity are not included in the library or are not classified in detail. Furthermore, the in-depth analysis of anti-Gram-negative/positive bacterial peptides, more family sources, and better ability to predict on the potential and toxicity of AMPs should be integrated.

CAMP
The Collection of Antimicrobial Peptides (CAMP), established by Shaini Thomas in 2010, is a free online database that includes mature ML algorithms for various AMPs, initially including 3782 AMPs: 2766 AMPs from experimentally verified patents/non-patents and 1016 predicted sequences [62]. Its latest version features 10,247 sequences, containing 8164 AMPs, 2083 patented AMPs, 757 structures, and 114 AMPs with family-specific features (Tables 1 and 2). The best feature of CAMP is its prediction tools based on ML algorithms such as Random Forest (RF), Support Vector Machine (SVM), and Discriminant Analysis (DA), achieving accuracy levels of 93.2%, 91.5%, and 87.5%, respectively. This database marks the relationship between sequence structure and antibacterial activity for the first time and is useful for searching sequence activities and for determining their specificity and relationships with AMPs [77,78]. By analyzing sequence signatures consisting of patterns and Hidden Markov Models (HMMs) from 1386 experimentally studied AMPs, 45 AMP families have been generated in this database. It is expected that sequence optimization algorithms to rationally design amplifiers will be widely used ( Figure 3) in design practices in the future. In addition, regarding the physicochemical properties of AMPs, such as hydrophobicity, net charge, instability, amphipathicity, and toxicity, the statistical results and derived peptides should be further improved.

dbAMP
The dbAMP database is the largest database and was developed by Tzong-Yi Lee in 2018 [64]. It initially contained 12,389 AMPs able to be retrieved through the NCBI, UniProt, PDB, and AMP databases, such as APD3, CAMPR3, ADAM, PhytAMP, AMPer, Antip2, BACTIBASE, and LAMP. References can be retrieved by querying the searchable fields of AMP-related articles individually [64]. The latest version, updated in 2022, includes 26,447 AMPs and 2262 antimicrobial proteins, with 4579 references [79] (Tables 1 and 2). It also offers transcriptomic and proteomic data from all species quickly and simulates the 3D structures of AMPs online. Thus far, a total of 458 3D-structured AMPs have been collected and are available to users [65]. Compared with other databases, its best feature is the capacity to predict the activity of AMPs on different target bacteria, viruses, cancer cells, fungi, and mammals and to handle the transcriptomic and proteomic data obtained by applying high-pass technologies such as mass spectrometry (Figure 3). Because of this, it has particular value when dealing with transcriptomic and proteomic data and when analyzing their specificity. In addition, AMPs can be searched by their dbAMP ID number, although this feature works less smoothly than is desirable. Another negative is that the dbAMP database lacks the ability to predict physicochemical properties such as the hydrophobicity, net charge, amphiphilicity, instability, and hemolysis of AMPs.

ML Methods of the Four AMP Databases
Machine learning (ML) algorithms have been integrated and used in a variety of disciplines such as psychology, biology, and neurophysiology as well as in mathematics and automation. They can improve the production of vaccines and the design and screening of AMPs and target drugs to improve efficiency and to reduce drug application [80]. The combination of biology and ML has greatly promoted the development of bioinformatics, in which many amino acid sequences of AMPs with higher-complexity structures are analyzed quickly, especially when processing high-throughput data from transcriptomics and proteomics [80,81]. At present, some mature ML algorithms are used in prediction software to categorize and analyze data, and the newly developed AMP databases also contain classical machine algorithms such as RF, SVM, DA, Artificial Neural Network (ANN), and Deep Neural Network (DNN) [82][83][84]. Previous studies have shown that MLs are an important feature of databases, especially in the CAMP database, which contains all of the above MLs algorithms for the prediction and design of AMPs [77], whereas only parameter spaces and thresholds or cut-off discriminator algorithms are embedded in the APD and DBAASP databases, respectively [66,67]. Many specific AMP databases, such as those for linear cationic AMPs (LCAP), hemolytic and non-hemolytic AMPs, and anti-Gram-negative peptides (PHNX), combining ML algorithms have been established [85][86][87][88][89][90]. For example, the ML algorithms integrated with ANtiBP2, Hemdytik, and DASamp1 are called ANN and DNN [85,86]. The ML algorithms implemented by the four AMP databases analyzed here, and their derived databases, are summarized in Figure 4.
structures are analyzed quickly, especially when processing high-throughput data from transcriptomics and proteomics [80,81]. At present, some mature ML algorithms are used in prediction software to categorize and analyze data, and the newly developed AMP databases also contain classical machine algorithms such as RF, SVM, DA, Artificial Neural Network (ANN), and Deep Neural Network (DNN) [82][83][84]. Previous studies have shown that MLs are an important feature of databases, especially in the CAMP database, which contains all of the above MLs algorithms for the prediction and design of AMPs [77], whereas only parameter spaces and thresholds or cut-off discriminator algorithms are embedded in the APD and DBAASP databases, respectively [66,67]. Many specific AMP databases, such as those for linear cationic AMPs (LCAP), hemolytic and nonhemolytic AMPs, and anti-Gram-negative peptides (PHNX), combining ML algorithms have been established [85][86][87][88][89][90]. For example, the ML algorithms integrated with ANtiBP2, Hemdytik, and DASamp1 are called ANN and DNN [85,86]. The ML algorithms implemented by the four AMP databases analyzed here, and their derived databases, are summarized in Figure 4.

Challenges Facing the Application of Four AMP Databases
Due to the rapid development of AMP databases, they are being widely used in many fields, with two notable highlights being observed: the abundant data resource and the prediction and design of AMPs. ML algorithms are being involved in these processes, promoting deep learning on AMPs [86].

Challenges Facing the Application of Four AMP Databases
Due to the rapid development of AMP databases, they are being widely used in many fields, with two notable highlights being observed: the abundant data resource and the prediction and design of AMPs. ML algorithms are being involved in these processes, promoting deep learning on AMPs [86].

Application of APD and CAMP
These two databases have been widely used to design new AMPs of anti-methicillinresistant Staphylococcus aureus, hemolytic and non-hemolytic AMPs, and anti-HIV-1 peptides [91][92][93][94][95] and include AMPs of anti-Acinetobacter baumannii, anti-HIV-1, cysteine-free AMPs, and cuttlefish AMPs [93][94][95]. Combining the two databases helps to design and screen special AMPs. Houyvet reported using APD3 and CAMPR3 from these databases to obtain nine AMPs with a length of less than 25 amino acids from cuttlefish (Sepia officinalis) [95].

Application of DBAASP
The hemolytic property of AMPs is one of the major obstacles hindering their clinical application [85]; therefore, it is essential to select special characteristics with low hemolytic targets. The DBAASP database has been used to design non-hemolytic AMPs of antimethicillin-resistant Staphylococcus aureus (MRSA). Capecchi designed special AMPs using DBAASP and non-hemolytic AMPs using RNN in 2021; a total of 28 AMPs were synthesized and tested, and a final total of eight novel non-hemolytic AMPs against Pseudomonas aeruginosa, Acinetobacter baumannii, and MRSA were identified [85,93].

Challenges Facing the Four Databases
Due to rapid development in the field of AMPs, many databases have been established, and their comprehensiveness and accuracy are two key points determining the extent of their effect. The first key point is that in all AMP databases, the design and prediction of information of AMPs are too limited, as only antimicrobial activity (MIC) is considered as a screening index [66,67,78,79]. One single index of activity is not enough to support screening for the best candidate consistently with the expectation of targeting the AMP as a final whole. More function indexes should be included to meet the full range of requirements for various aspects of practice. Low toxicity, stability, and specificity and high yield (except for antibacterial activity) should also be considered during design and evaluation, as they are closely related to viability, persistence, precision, cost, and other factors as new candidate drugs [24]. Coordinating the above parameters and merging them in a scientifically appropriate way are major challenges to constructing or improving AMP databases but also represent an opportunity for improvement and optimization.

Performance of Database Tools for Screening
Many peptides are found by researchers in vivo and in vitro, and their antibacterial activity and stability cannot be easily ensured. Considering the high cost and laborintensive experimental identification of AMPs, many computational methods have been proposed for prediction with different functional types and a de novo design for more new and more effective antimicrobial agents. In order to enhance the clinical application of AMPs, researchers have tended to focus exclusively on traditional rational design to increase their antibacterial activity, proteolytic resistance, and production [20]. New approaches are necessary, particularly in the field of bioinformatics, as we know that these databases are only partly used to predict and design AMPs. For example, AureinM3 and PT-5 were designed using APD, and their mutants were analyzed by APD and CAMP in 2018 and 2021, respectively [80,81]. In 2018, combined with biological information software from sequence comparison and conservative sequences in cathelicidin and aurein, Natthaports designed a series of short hybrid peptides using APD3, I-TASSER, and Expaasy and achieved impressive results [92][93][94][95][96]. These examples verify that these databases can accurately predict the reliability of AMPs, showing strong ability as a BI tool that is dependent on the scientific construction scheme of the database for the goals of mining and design.
Based on the above extensive analysis of four AMP databases, an integrative approach to the design and construction of new AMP databases is proposed based on three essential key principles defined in the following three paragraphs (see Figure 5). The first key principle covers the following five points: (i) transcriptomic and proteomic data are obtained and analyzed from the dbAMP database by the AMPfinder function option; (ii) the hydrophobicity, isoelectric point, amphiphilicity, number of net charges, and other properties of the target peptide are predicted, screened, and designed by means of the AMP Calculator and predictor in the APD or the tools in the DBAASP; (iii) the druggability of candidate AMPs is analyzed and evaluated using CAMP; (iv) the key activity of AMPs on different specific target pathogenic species is predicted using the AMPpredictor and dbAMP database and then tested and screened by trials in vitro/vivo; (v) products are obtained by expression or chemical synthesis at a reasonable cost, and their bio-activity and mechanism are verified clearly by in vivo/in vitro experiments.
The second key principle also includes five points: (i) de novo design of AMPs by the DBAASP or dbAMP database with different target search parameters scheduled with convenient adjustable running and responding by choice in an option or box, such as cationic strength and specific action on different Gram-type or specific target bacteria, biofilms, or DNA and other target biomacromolecules in pathogens; (ii) the design and evaluation on hydrophobicity, isoelectric points, amphipathicity, bio-safety, stability, and other properties of target AMPs are predicted by the AMP Calculator, the predictor in the APD, and tools in the DBAASP; (iii) the AMPs odds are predicted by the CAMP; (iv) the AMPpredictor tool is used to analyze activity strength and spectrum against different species, and target candidate peptides are screened after a series of predictions and analyses; (v) candidate peptides are acquired by expression/synthesis, and the activity and mechanism are verified through in vivo/in vitro experiments.
The third principle deals with derived and modified AMPs by addressing/responding to the system through an optimization cycle to achieve the best results. The integrated scheme includes unique modules of the four AMP databases to screen and predict AMPs, which increases efficiency in designing AMPs.
properties of the target peptide are predicted, screened, and designed by means of the AMP Calculator and predictor in the APD or the tools in the DBAASP; (iii) the druggability of candidate AMPs is analyzed and evaluated using CAMP; (iv) the key activity of AMPs on different specific target pathogenic species is predicted using the AMPpredictor and dbAMP database and then tested and screened by trials in vitro/vivo; (v) products are obtained by expression or chemical synthesis at a reasonable cost, and their bio-activity and mechanism are verified clearly by in vivo/in vitro experiments. The second key principle also includes five points: (i) de novo design of AMPs by the DBAASP or dbAMP database with different target search parameters scheduled with convenient adjustable running and responding by choice in an option or box, such as cationic strength and specific action on different Gram-type or specific target bacteria, biofilms, or DNA and other target biomacromolecules in pathogens; (ii) the design and evaluation on hydrophobicity, isoelectric points, amphipathicity, bio-safety, stability, and other properties of target AMPs are predicted by the AMP Calculator, the predictor in the APD, and tools in the DBAASP; (iii) the AMPs odds are predicted by the CAMP; (iv) the AMPpredictor tool is used to analyze activity strength and spectrum against different species, and target candidate peptides are screened after a series of predictions and analyses; (v) candidate peptides are acquired by expression/synthesis, and the activity and mechanism are verified through in vivo/in vitro experiments.
The third principle deals with derived and modified AMPs by addressing/responding to the system through an optimization cycle to achieve the best results. The integrated scheme includes unique modules of the four AMP databases to screen and predict AMPs, which increases efficiency in designing AMPs.

Conclusions
AMPs merit more attention than they are currently receiving, and further extensive research is required, as they represent one of the pioneer ATAs with very strong potential, across multiple dimensions, to reduce stresses and threats of the AMRs in the ecosystem [8]. Due to the significant amount of AMPs with transcriptome and proteome data obtained by high-throughput technology, it is expensive and labor-intensive to carry out

Conclusions
AMPs merit more attention than they are currently receiving, and further extensive research is required, as they represent one of the pioneer ATAs with very strong potential, across multiple dimensions, to reduce stresses and threats of the AMRs in the ecosystem [8]. Due to the significant amount of AMPs with transcriptome and proteome data obtained by high-throughput technology, it is expensive and labor-intensive to carry out verification using the index of antibacterial activity alone during long chains of experiments that may go on for years [52]. Therefore, it is necessary to establish an AMP database that combines bioinformatics technology, computer algorithms, machine learning, data mining, and AI with experimental verification. Today, a number of AMP databases based on computer algorithms are in operation. They play an important role in the prediction, screening, and design of AMPs. However, difficulties exist in choosing the best one among them beyond considering antimicrobial activity. Using AMP databases, running a comprehensive analysis with high capacity and efficiency on a large data set to determine activity, toxicity, stability, specificity, and expression ability to predict AMPs could be carried out quickly by simultaneously using accurate machine learning algorithms and other new powerful BI/AI tools [52,56]. More and better new AMPs could be created quickly by means of these AMP databases and they could play an important role in the struggle to alleviate the threat posed by AMRs to the health ecosystem. We believe that the integrative approach, proposed in this paper, will lead to the improvement of AMP databases, allowing wide coverage and balance among those three essential key principles, and final goals ranging from druggability, activity, safety, stability, resistance, and cost [97].