Diversity of the Antimicrobial Peptide Genes in Collembola

Simple Summary Collembola (springtails) are tiny non-insect hexapods living in soil and have an important ecological role as detritivores. How they can survive in the microbe-rich environment for millions of years is unclear. This study used homology-based gene identification methods to identify antimicrobial peptide genes from collembola genomes and transcriptomes. We analyzed five collembola species representing three main suborders and identified 45 antimicrobial peptide genes from five families: diapausin, Alo, diptericin, defensin, and cecropin. These peptides potentially have broad activity against bacteria, fungi, and viruses. This study highlights collembola as a new source for discovering novel AMPs that may help solve the current multidrug-resistant pathogen crisis. Abstract Multidrug-resistant bacteria are a current health crisis threatening the world’s population, and scientists are looking for new drugs to combat them. Antimicrobial peptides (AMPs), which are part of the organism’s innate immune system, are a promising new drug class as they can disrupt bacterial cell membranes. This study explored antimicrobial peptide genes in collembola, a non-insect hexapod lineage that has survived in microbe-rich habitats for millions of years, and their antimicrobial peptides have not been thoroughly investigated. We used in silico analysis (homology-based gene identification, physicochemical and antimicrobial property prediction) to identify AMP genes from the genomes and transcriptomes of five collembola representing three main suborders: Entomobryomorpha (Orchesella cincta, Sinella curviseta), Poduromorpha (Holacanthella duospinosa, Anurida maritima), and Symphypleona (Sminthurus viridis). We identified 45 genes belonging to five AMP families, including (a) cysteine-rich peptides: diapausin, defensin, and Alo; (b) linear α-helical peptide without cysteine: cecropin; (c) glycine-rich peptide: diptericin. Frequent gene gains and losses were observed in their evolution. Based on the functions of their orthologs in insects, these AMPs potentially have broad activity against bacteria, fungi, and viruses. This study provides candidate collembolan AMPs for further functional analysis that could lead to medicinal use.


Introduction
The emergence of multidrug-resistant bacteria is currently one of the most acute health crises of the 21st century. In 2019, more than one million deaths globally were estimated to be directly associated with drug-resistant bacteria [1], and the number is predicted to reach 10 million deaths annually by 2050 [2]. It is, thus, an urgent issue that requires global action to promote the correct use of antibiotics in humans and farm animals and to find new therapeutic methods, e.g., phage therapy and new drugs, that can effectively kill bacteria [3].
Antimicrobial peptides (AMPs) are promising new drugs for multidrug-resistant bacteria. They are small polypeptide molecules (12-50 amino acids) that are part of the innate immune system of all living things [4][5][6][7]. They are diverse in form and function; many and discovered new AMPs for future functional study, which may help cope with current drug-resistant pathogen crises.

Collembola Species and Genetic Data
This study investigated AMP genes in five collembola species representing three major collembolan taxonomic orders (Entomobryomorpha: O. cincta, Si. curviseta; Poduromorpha: H. duospinosa, A. maritima; Symphypleona: Sm. viridis). These species were selected based on the availability of their genetic data. They all have RNA-sequencing reads data, and three of them, H. duospinosa, O. cincta, and Si. curviseta, have the whole-genome data published [16,22,31,32]. For transcriptome analysis, the RNA-sequencing reads were downloaded from the NCBI database. For genome analysis, gene annotations were conducted online using the NCBI-BLAST tool.

Transcriptome Assembly
To construct a transcriptome of each species, downloaded SRA data were de novo assembled using programs equipped in OmicsBox [33]. In brief, the quality of initial reads was examined using FASTQC [34] followed by removing low-quality bases and reads using Trimmomatic [35]. Cleaned reads were then used to assemble a transcriptome using Trinity [36]. The redundant transcripts, which share a sequence similarity higher than 95%, were filtered using the CD-HIT-EST program [37]. The Busco analysis was then used to evaluate the completeness of the transcriptomes [38]. Transcriptome basic statistics were calculated using the Fasta statistics tool in the UseGalaxy server [39].

Identification of AMP Genes
We first downloaded all the reported animal AMP peptide sequences from UniProt databases [40] using the keyword arthropod + antimicrobial peptide, leading to 711 proteins in total (downloaded on 5 February 2021). We then manually filtered out non-AMPs such as large enzymes (e.g., serine protease inhibitors, Toll-like receptors, and lysozymes). This resulted in 698 AMPs used to construct an AMP database for our analysis. We performed gene annotations for each species, one by one. DNA sequences in each collembolan transcriptome were used in a BlastX search (e-value cut-off = 1 × 10 −5 ) against the prepared arthropod AMP database using OmicBox. Contigs that have BLAST hits were then grouped according to AMP classes. We manually annotated these hits by retrieving DNA sequences from transcriptomes and translating them using the Expasy translate tool [41]. Correct translation frames were selected based on query sequences of the BLAST results. By the end of these steps, we obtained candidate peptide sequences of AMPs from each species. We refined the search by using these peptides as queries to blast (tBlastn) against other collembola transcriptomes to find additional AMPs that might be missing in the previous steps using the local BLAST program (NCBI-blast-2.12.0+) [42].
As some genes might be missing from the transcriptome due to low or nil expression in the particular tissue/stage, gene annotations were performed to find potential additional genes and complete gene models in three collembola species (O. cincta, Si. curviseta, and H. duospinosa) that have genome data. AMP peptides identified from the transcriptomes were used as queries for the tBlastn search against the whole-genome shotgun (WGS) databases using an online NCBI BLAST (e-value cut-off = 1 × 10 −5 ). DNA sequences from blast hits were downloaded and used for predicting gene models using GeneWise [43]. The cDNA and peptide sequences of each AMP gene were reported. Notes on the completeness of the gene models were assigned to gene name as the following suffix: Full = N and C terminus found, NTE = N terminus missing, CTE = C terminus missing, NC = N and C terminus missing.

Physicochemical Property and Functional Prediction
To understand the physicochemical property of candidate AMPs, we first used Inter-ProScan [44] to predict the position of signal peptides and other signature domains. Signal peptides were removed, and only the mature peptide sequences were used to estimate physicochemical properties. We used EMBOSS PEPSTATS [45] to determine sequence length, molecular weight, percentage of polar amino acids, percent positively charged, percent negatively charged, and percent proline and glycine. The isoelectric point and hydrophobic ratio were estimated with the protein report tool found in CLC main workbench V7.9.1 (QIAGEN, Aarhus, Denmark) [46]. The total net charge was calculated with the APD3 [47]. We used Phyre2 to predict the 3D structures of the candidate AMPs based on the most matched peptide [48].
The antimicrobial properties were predicted with SVM (Support Vector Machine), RF (Random Forest), ANN (Artificial Neural Network), and DA (Discriminant Analysis), with algorithmic programs available in Campr3 [49], SVM and RF available in ClassAmp [50], and AB (antibacterial), AV (antiviral), AF (antifungal) prediction available in the iAmpPred tool [51]. The probability scores were reported in the heatmap (probability = 1 is red, and 0 = green). The heatmap diagram was made using Microsoft Excel. We used the DBAASP database [52,53] to predict the antimicrobial activities of collembolan AMPs against specific strains of bacteria and fungi. Three available methods were utilized to predict antibacterial activities against five bacterial strains: Escherichia coli ATCC 25922, Pseudomonas aeruginosa ATCC 27853, Klebsiella pneumoniae, Staphylococcus aureus ATC 25923, and Bacillus subtilis. The methods included a machine learning (ML) approach based on AMP sequences, a clusterization approach based on AMP sequences, and an ML approach based on AMP sequences and bacterial genomes. To predict antifungal activities against Candida albicans and Saccharomyces cerevisiae, we used the only available tool, the ML approach based on AMP sequences. For the antimicrobial prediction, active peptides were identified as those with a predicted minimum inhibitory concentration (MIC) less than 25 µg/mL, while non-active peptides were identified as those with a MIC greater than 100 µg/mL. Finally, we used the ML approach based on AMP sequences to predict the toxicity (hemolytic activity) of collembolan AMPs against human erythrocytes. Active peptides were predicted to induce hemolysis greater than 40% at a concentration of less than 40 µg/mL.

Phylogenetic Analysis
We constructed the phylogenetic relationships of collembola and other arthropod AMPs using the maximum likelihood method. To obtain related AMPs from other arthropods, we performed a BlastP search (e-value cut-off = 1 × 10 −3 ) against the arthropod proteins in UniProtKB database using collembola AMPs as queries. Sequences that produce significant blast hits were retrieved and used for phylogenetic analyses (Supplementary Table S1). Peptide sequences of the same AMP family were aligned using MAFFT (E-INS-i) [54]. Gappy regions were removed using Trimal [55]. Maximum likelihood trees were conducted using PhyML [56] and SMS automatic model selection [57]. Branch supports were evaluated using an approximate likelihood ratio test (SH-like). An overview of this work procedure is shown in Figure 1. Overview of the AMP gene identification pipeline used in this study. Key steps include transcriptome assembly, AMP genes identification from transcriptomes and genomes, prediction of physicochemical properties, AMP activity, 3D structure, and phylogenetic analysis.

Transcriptome Assembly
Our study constructed five collembola transcriptomes using RNA-sequencing reads from the NCBI database. The number of reads used for the Trinity assembly ranged from 10 to 52 million reads, in which H. duospinosa and Sm. viridis have the highest and lowest number of reads, respectively (Table 1). In the final assembly, the number of non-redundant Unigenes varies between~31,000 and 72,000 unigenes. The average length and N50 value vary from 616 to 1337 base pairs to 907 to 2873 base pairs, respectively. BUSCO analysis, which examines the presence of conserved orthologous genes (Arthropoda_odb9), has a generally high value (88-91.5% in four species and 72% in O. cincta), suggesting that the assembled transcriptomes are suitable for gene annotation.

Overview of the Collembolan Candidate AMPs
Our analysis was analyzed using a homology-based method to identify candidate AMP genes in collembola. All collembola Unigenes from five transcriptomes were searched against a database of various types of >700 arthropod AMPs downloaded from the Uniprot database. All possible orthologous AMP genes in collembola were reported and classified according to the hits. We further searched the genome for species with genome data (O. cincta, Si. curviseta, and H. duospinosa) to find additional genes and complete gene models.
We identified only five classes of AMP genes (45 genes in total) in the five collembola species representing three taxonomic orders that span over 250 MY of collembolan evolution ( Figure 2). The total number of AMP genes in each species varies between 5 and 13. Diapausin is the only AMP that is present in all five species, whereas Alo is missing in Sm. viridis and diptericin is missing in A. maritima and H. duospinosa. Cecropin and defensin are present only in one species (Si. curviseta and Sm. viridis, respectively). As genome annotation (O. cincta, Si. curviseta, and H. duospinosa) and transcriptome annotation (A. maritima and Sm. viridis) yielded similar results, we believe that the number of genes reported here reflect the limited number of direct orthologous genes of known arthropod AMP genes in collembola.

Physicochemical Properties of Collembolan AMPs
To support that the identified collembola AMPs were assigned to the correct AMP families, we investigated whether the mature sequences of collembola AMPs have the same physicochemical properties as their orthologs using in silico prediction tools. These candidate collembolan AMPs are short peptides with an average of 40-97.5 residues in the mature sequences and have a low molecular weight (3930.4-9979.3 Dalton) ( Table 2). All of these peptides are cationic at pH 7. The hydrophobic amino acid ratio is low (0.35-0.58), suggesting that these peptides are highly soluble. The percent positive charged amino acid varies between families, with cecropin having the highest value (25.72%). The percent positive charged amino acid is higher than those of negative charged amino acids (10.87-25.72% vs. 2.55-8.57%). The percent proline and glycine also vary between species, while the diptericin has the highest value (26.66%). In general, the physicochemical properties of collembola AMPs are similar to those of their orthologs, particularly the cationic property of all these AMP families and the glycine-rich property of diptericin. Diapausin is the only AMP family present in all five collembola species investigated in this study. We identified 23 diapausin genes from H. duospinosa (10 genes), O. cincta (six genes), Sm. viridis (four genes), Si. curviseta (two genes), and A. maritima (one gene). Seventeen genes are complete gene models, and six genes are N-terminus-missing. Collembola diapausin are 63-85 amino acids in length and 39-64 amino acids for the mature peptides. Collembola diapausins share six conserved cysteine residues with motif C 1 ,X 3 ,C 2 ,X 9-14 ,C 3 C 4 ,X 9-11 ,C 5 ,X 6-9 ,C 6 and potentially have three disulfide bridges between C1 and C3, C2 and C5, and C4 and C6, which are features of insect diapausin (Figure 3a) [58]. Phyre2 predicted collembola diapausins to have two α-helices, and a triple-stranded βsheet. The predicted 3D structures are most similar to diapausin with antifungal property (Supplementary Table S2).
Most collembola AMPs form a distinct lineage unique to collembola. However, Sm. viridis diapausin genes (all four genes) and an O. cincta diapausin (OcinDiapausin4) are more closely related to diapausin from other insects (Lepidoptera, Diptera, and Coleoptera) (Figure 3b), suggesting that some collembola diapausins have diverged and evolved independently from other insect diapausins.

Alo Peptide Family
We identified a total of 12 Alo peptides from A. maritima (four genes), H. duospinosa (three genes), O. cincta (three genes), and Si. curviseta (two genes). Only two genes from A. maritima are full gene models while the rest are N-and C-terminus-missing (8 genes) and N-terminus-missing (two genes). The length of Alo peptides encoded from the full gene models is 61-62 amino acids, and the mature sequences without signal peptides are 34 amino acids. All 12 collembola Alo peptides share six conserved cysteine residues with motif C 1 , X 6 , C 2 , X 7-9 , C 3 C 4 , X 3 , C 5 , X 10 , C 6 , which is a feature of the knottin domain, suggesting three disulfide bridges between C1 and C4, C2 and C5, and C3 and C6 (Figure 4a) [59]. Phyre2 predicted collembola Alo to have a 3D structure most similar to Alo-3 from A. longimanus, which exhibits a knottin fold and has an antifungal property (Supplementary Table S2). Collembola Alo peptides do not form a single monophyletic clade (Figure 4b). Three collembola Alo (AmarAlo1-2, HduoAlo3) are found within clades containing Alo peptides from cowpea weevil, Callosobruchus maculatus, (Coleoptera). Seven collembola Alo peptides are more closely related to each other as they form one clade with an Alo gene from C. maculatus (Figure 4b). Although O. cincta, Si. curviseta, A. maritima, and H. duospinosa have about the same number of Alo genes (2-4 genes), their Alo genes are not direct orthologs (i.e., no 1:1 orthologous relationships), suggesting independent gene gain and loss events in each species rather than the existence of conserved Alo genes in their common ancestor.

Diptericin Family
We identified a total of four diptericin from Sm. viridis (two genes), O. cincta (one gene), and Si. curviseta (one gene). All of them are full gene models. The lengths of diptericin encoded from the full gene models are 108-125 amino acids and the mature sequences without signal peptides are 89-102 amino acids. These proteins are glycine-rich, constituting 21.35-25.49% glycine and no cysteine in the mature peptides (Figure 5a). Phyre2 cannot predict the structure of collembola diptericin with high confidence, possibly due to the fact that the 3D structure of diptericin has not been characterized, and none exist in the PDB database. Phylogenetic analysis shows that diptericins of Diptera, Myriapoda, and Chelicerata, but not collembola, form monophyletic clades (Figure 5b). No orthologous genes were found across these main lineages, suggesting that diptericins of the species in the same lineage are the descendants of diptericins found in the common ancestor of each lineage. The branch support values are high throughout the tree, suggesting that the degree of sequence conservation is high; therefore, relationships within and among groups can be inferred.

Cecropin Family
We identified only three cecropins that are found only in Si. curviseta. All of them are full gene models. The lengths of cecropin encoded from the full gene models are 61-62 amino acids, and the mature sequences without signal peptides are 38-41 amino acids. The mature sequences do not have cysteine residues (Figure 6a). This family contains the highest percentage of positively charged amino acids (mean = 25.72), mainly a high frequency of K and R residues. Phyre2 predicted their 3D structure to have two α helices similar to cecropin from the swallowtail butterfly, Papilio xuthus (Supplementary Table S2). Similar to the diptericin family, phylogenetic analysis reveals monophyletic relationships of cecropins from insects in the same lineage, including lepidoptera, coleoptera (red palm weevil), diptera (mosquitoes), and collembola (Si. curviseta) (Figure 6b). This suggests that proteins are relatively conserved; thus, the relationship within and among groups can be inferred. Gene family expansion was also observed, particularly eight cecropin genes in Rhynchophorus ferrugineus.

Defensin Family
We identified only three defensins found only in Sm. viridis. All of them are full gene models. The lengths of defensin encoded from the full gene models are 63-72 amino acids, and the mature sequences without signal peptides are 37-49 amino acids. All three defensins share six conserved cysteine residues with motif C 1 , X 5-6 , C 2 , X 3 , C 3 , X 9-10 , C 4 , X 7 , C 5 , X 1 , C 6 , which are the unique structure of the defensin domain, suggesting three disulfide bridges between C1 and C4, C2 and C5, and C3 and C6 (Figure 7a). Phyre2 predicted their 3D structure to have a knottin fold (Supplementary Table S2). Phylogenetic analysis reveals two main lineages of defensin, one from insects and another from Chelicerata, including three defensins from Sm. viridis (Figure 7b). SvirD-efensin1 is more closely related to SvirDefensin2 than SvirDefensin3. Defensins in the Chelicerata clade belong to a wide range of taxa, including mites, scorpions, and spiders. In the insect lineage, defensin genes are also from different taxonomic orders, including coleoptera, hymenoptera, and diptera. Of the five collembola species investigated, we can only identify defensins from Sm. viridis. As collembola defensins are more closely related to defensins from chelicerata, suggesting its ancient origin, defensin may have been lost in many lineages of collembola rather than the recent gain in Sm. viridis.

Functional Prediction
We used three programs (ClassAMP, iAmpPred, and Campr3) to predict the antimicrobial properties of the candidate collembolan AMPs based on sequence properties, such as amino acid composition, family signatures, and physicochemical properties. The results (probability values) were reported as a heatmap (Figure 8a). These programs use different training sets and methods (support vector machine, random forest, artificial neural network, discriminant analysis) and, thus, do not give identical results. All candidate collembola AMPs were predicted to have antimicrobial properties with high probability scores by at least one program. In general, ClassAmp predicted higher probability values than iAmpPred and Campr3, and the SVM method predicted higher scores than the RF method. The iAmpPred predicted higher probability scores for antibacterial properties than antifungal and antiviral properties. Diptericins have the highest probability scores in all three programs, whereas defensins have the lowest scores, mainly from Campr3. ALOs, cecropins, and diapausins generally have good prediction scores from ClassAMP and iAmpPred, but about 50% of them were predicted with lower probability scores in Campr3. Campr3 (SVM and RF methods) have previously been shown to outperform other tools [49], while ClassAMP may be prone to give false positive results [51,60]. Based on these predictions, diptericin from O. cincta, Sm. viridis, and Si. curviseta has the highest potential for antimicrobial properties. However, we note that the low probability scores predicted by Campr3 can partly be explained by the limitation of collembola AMP sequences in the training set, which limits the prediction power.
Except for OcinDiptericin1, all candidate collembola AMPs were predicted to be non-active against human erythrocytes (Figure 8b), increasing their potential application for medicine. According to the prediction tools in the DBAASP database (Supplementary Table S2), K. pneumoniae is sensitive to certain members of all AMP classes, while B. subtilis is sensitive to all AMP classes except defensins. E. coli and S. aeruginosa are susceptible to cecropins and diptericins, whereas S. aureus is susceptible to cecropin only. Finally, C. albicans and S. cerevisiae are sensitive to Alos, cecropins, and diapausins. Among the five AMP classes, cecropins show the most promising activities because all of them (ScurCecropin1-3) are predicted to be active against both Gram-positive and Gram-negative bacteria and ScurCe-cropin1 is also active against fungi.

Roles of Antimicrobial Peptides in Collembola Immunity
Collembola AMPs identified in this study potentially have broad antimicrobial properties. We believe these peptides were assigned to correct AMP families based on sequence homology and protein family signature (e.g., conserved cysteine residues, physicochemical property, and phylogenetic analysis); thus, their functions may be inferred from other arthropod orthologs. Diapausins, present in all five collembola species, have been reported to have active roles against fungi, e.g., S. cerevisiae, C. albicans, C. krusei, and Beauveria bassiana [61,62]. Alo-3 peptide from a beetle, A. longimanus, has an active role against fungi C. albicans and C. glabrata [59]. The strain-specific AMP prediction using the DBAASP database also indicated that two fungi, C. albicans and S. cerevisiae, are susceptible to Alos and diapausins from collembola. Diptericins from flies were shown to have active roles against Gram-negative bacteria, including Erwinia herbicola, Er. carotovora, E. coli, and Providencia rettgeri [63][64][65]. Defensins have broad activity against bacteria, viruses, and fungi but are most effective against Gram-positive bacteria, including S. aureus [66]. Insect cecropins can kill both Gram-positive (Listeria monocytogenes) and Gram-negative bacteria (Acinetobacter baumannii and P. aeruginosa), disrupt uropathogenic E. coli biofilms, and also exhibit antifungal properties [67][68][69][70]. As functions of collembolan AMP families can only be inferred from their orthologs in other arthropods, further functional analysis is crucial to confirm the prediction.
Our findings have filled the knowledge gaps of how collembola defend themselves against microbes in their natural habitats. Collembolan gut bacteria exhibited antimicrobial properties against various pathogenic bacteria and fungi, thus contributing to the collembola immune system [71]. A cluster of beta-lactam biosynthesis genes producing antibiotics, such as penicillins and cephalosporins, were identified from F. candida and other collembolas from different families, but not from the protura, diplura, insects, and other animals, suggesting a single horizontal gene transfer event from bacteria to the common ancestor of collembola [30,72]. Genes in the pathways were upregulated under heat-shock stress and produced beta-lactam products [30]. Although many collembola genomes have been published [27,29,31,32], the AMP families have not been thoroughly investigated and reported. Our analysis shows that collembola have at least five AMP families, potentially contributing to collembolan immunity by having broad activities against bacteria, fungi, and viruses.

Evolution of Collembola AMPs
Our study reveals the dynamic evolution of the collembolan AMP gene families, most importantly, frequent gene gains and losses. Previously, defensin was believed to be the only known ortholog of insect AMP in collembola [20]. We investigated five species of collembola representing three main collembola suborders (Entomobryomorpha, Poduromorpha, Symphypleona) and showed that collembola has at least five AMP families. Diapausins are present in all five collembolan species, suggesting the presence of the diapausin gene in the common ancestor of collembola. Diapausins have been identified in a few insect orders, including coleoptera and lepidoptera; thus, multiple gene gains and losses in different insect lineages may explain the evolution of diapausin in hexapods.
Alo peptides present in four collembolan species from the suborder Entomobryomorpha and Poduromorpha suggest that Alo is another ancient immunity gene of collembola. Previously, Alo peptides present in Hemiptera and Coleoptera were proposed to evolve via horizontal gene transfer from plants or fungi based on the idea that its knottin domain is unique to plants and fungi [20]. However, recent studies show that proteins containing knottin domain play essential roles in many animal toxins, including nettle caterpillars, spiders, scorpions, and cone snails [73]. Thus, the evolution of arthropod Alo peptides may also be explained by multiple gene gains and loss events.
For diptericin, O. cincta, and Si. curviseta, each has a single gene and Sm. viridis has two genes. It seems that diptericin might have been lost in the Poduromorpha lineage, but it has to be confirmed by further analysis that includes more species with complete genomes. Cecropin and defensin are only found in Si. curviseta and Sm. viridis, respectively. As these two AMPs are widely distributed among insect taxa [20], the origin of these peptides could be dated back to at least the common ancestor of the hexapod. However, these genes might have been lost in many collembola species. Multiple insect AMPs, e.g., drosomycins and thaumatins, have a scattered distribution over insect taxa [20], supporting the fact that gene gain and loss events observed in collembola AMPs are a common phenomenon of hexapod AMPs. We proposed that frequent gene gain and loss in AMP evolution may be due to the broad activities of AMPs, and also, there are many classes of AMP in the genome that can compensate for the function of lost or newly duplicated genes. This process explains the dynamic gene gain and loss events in the evolution of the insect chemoreceptor gene family [74,75]. However, we do not observe a large lineage-specific gene expansion in collembolan AMPs. A recent study on the evolution of Dipteran diptericin suggests trade-offs between having diptericins to combat pathogens and potential risks due to their neuronal toxicity [76]. It might explain the purifying selection against having many copies of genes in the genome. We noted that future analysis on more collembolan species with complete genomes would improve the estimation of gene gain and loss events.

Significance and Implications
Our study supports that collembola are promising new sources for novel AMP discovery. We have identified 45 AMPs from five collembola species that potentially have broad activities against fungi (diapausin, Alo, defensin, and cecropin), bacteria (diptericin, defensin, and cecropin), and viruses (defensin). As the number of AMPs per species is few (5-13 genes), we suspected that collembola might have novel AMPs that are not orthologs of any described AMPs. Future analysis using existing annotation pipelines for novel AMPs [77,78] may reveal the hidden diversity of the collembola AMPs.
Identification of AMPs from unexplored species could lead to numerous candidates, which limit further functional studies when resources are limited. We propose using in silico AMP prediction tools to help choose promising candidates. In our case, we consider Campr3 as the most stringent tool. Therefore, collembolan diptericins, the AMPs that passed all programs (ClassAMP, iAmpPred, and Campr3) with high probability scores, are favorable candidates for further analysis. We note, however, that the novel AMPs distinctly different from the training sets may not be recognized by the AI tools, i.e., give false negative results. Other filtering criteria, such as expression profiles, e.g., AMP genes that show upregulation after organisms were experimentally immunized with pathogens, could be a powerful tool for the screening [79].
Our study serves as a primer for the investigation of collembolan AMPs, which could lead to a better understanding of how collembola can survive in a pathogen-rich environment. It also opens a new road to identifying AMPs that may have a potential use in medicine to help combat the current crisis from multidrug-resistant pathogens.
Author Contributions: Conceptualization, P.E.; methodology, G.P. and P.E.; software, G.P. and P.E.; validation, P.E.; formal analysis, G.P. and P.E.; investigation, G.P. and P.E.; resources, P.E.; data curation, G.P.; writing-original draft preparation, G.P. and P.E.; writing-review and editing, G.P. and P.E.; visualization, G.P. and P.E.; supervision, P.E.; project administration, P.E.; funding acquisition, G.P. and P.E. All authors have read and agreed to the published version of the manuscript. Institutional Review Board Statement: Ethical review and approval were waived for this study due to the nature of the work. This study only used publicly available DNA sequences in the database for the analysis.

Data Availability Statement:
The data presented in this study are available in the Supplementary Materials.