Molecular Diversity and Evolution of Antimicrobial Peptides in Musca domestica

: As a worldwide sanitary insect pest, the houseﬂy Musca domestica can carry and transmit more than 100 human pathogens without suffering any illness itself, indicative of the high efﬁciency of its innate immune system. Antimicrobial peptides (AMPs) are the effectors of the innate immune system of multicellular organisms and establish the ﬁrst line of defense to protect hosts from microbial infection. To explore the molecular diversity of the M. domestica AMPs and related evolutionary basis, we conducted a systematic survey of its full AMP components based on a combination of computational approaches. These components include the cysteine-containing peptides (MdDefensins, MdEppins, MdMuslins, MdSVWCs and MdCrustins), the linear α -helical peptides (MdCecropins) and the speciﬁc amino acid-rich peptides (MdDomesticins, MdDiptericins, MdEdins and MdAttacins). On this basis, we identiﬁed multiple genetic mechanisms that could have shaped the molecular and structural diversity of the M. domestica AMPs, including: (1) Gene duplication; (2) Exon duplication via shufﬂing; (3) Protein terminal variations; (4) Evolution of disulﬁde bridges via compensation. Our results not only enlarge the insect AMP family members, but also offer a basic platform for further studying the roles of such molecular diversity in contributing to the high efﬁciency of the houseﬂy antimicrobial immune system.


Introduction
Insects account for 90% of all extant animal organisms in the world [1,2] and coexist with a variety of microorganisms in different environments [3]. Therefore, they need to evolve a potent defense system for clearing potential invaders. Antimicrobial peptides (AMPs) are effectors of the innate immunity against bacteria, fungi, parasites and viruses [4], which exhibit some common properties (e.g., cationicity, hydrophobicity and amphipathcity) for their antimicrobial activity [5]. Many AMPs are induced from the insect immune organs (e.g., fat body) in response to microbial infections. They secret into hemolymph to reach a concentration between 0.1 and 100 µM to inhibit the growth of exotic microorganisms [6].
Insects are an important resource for understanding the basic biology of the immune system and for searching for new peptides for anti-infective drugs. In recent years, some computational approaches have been applied to discover AMPs in a given species with whole genomes sequenced [17][18][19][20]. Musca domestica is a worldwide sanitary insect pest whose larvae often feed on microbe-rich, decaying organic materials and the adults are the major vectors of pathogens causing human or animal diseases. Thus, more AMPs might have been evolved [21]. Thanks to the release of its whole sequences [22], we have an opportunity to survey the AMPs in a vector insect for studying their evolution.
Here, we report the molecular diversity of the M. dometica AMPs based on a systematic database search, which could provide us with a special perspective to understand how the evolution of the antimicrobial immune system occurs in a species with the tenacious vitality. We found that different from Drosophila that has a limited number of AMPs [6], M. domestica has largely expanded its AMP number via multiple genetic mechanisms to create structural diversity of their AMPs, which would have commonly shaped its high-efficient antimicrobial immune system. These include: (1) Gene duplication; (2) Exon duplication via shuffling; (3) Protein terminal variations; (4) Evolution of disulfide bridges via compensation. A similar phenomenon was also observed previously in the evolution of AMPs in the parasitoid Nasonia vitripennis [20], suggesting that these two species of insects could suffer from a similar selective pressure to drive the evolution of their antimicrobial systems.

Database Searches
Strategies for gene discovery used in this study are provided in the supplementary information (Supplementary Information Figure S1). In brief, potential AMPs were firstly searched against the proteome of M. domestica that was downloaded from the Genome Database (up to September 2, 2020) (https://www.ncbi.nlm.nih.gov/) by filtering using a threshold < 100 amino acids with a signal peptide. The potential peptides were predicted in the Collection of Anti-microbial Peptides (CAMPR3) server (http://www.camp.bicnirrh. res.in/prediction.php), and then they were used as templates to search for more peptides from the non-redundant sequences database until no new peptides appeared by BLASTP (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Secondly, these newly-discovered peptides were again used as queries to mine the whole genome shotgun and nucleotide collection databases (https://blast.ncbi.nlm.nih.gov/Blast.cgi) by TBLASTN. Gene runner (http: //www.generunner.net/) was employed to translate a complete open reading frame from a selected nucleotide sequence. Retrieved sequences with a signal peptide and a classical AMP signature were blasted again for new rounds of TBLASTN and BLASTP searches. The method would be continuously repeated until no new hit appeared. Thirdly, BLASTP and TBLASTN programs were used to characterize orthologues of known peptides of Drosophila melanogaster against the database of M. domestica. Finally, the protein pattern method by the PHI-BLAST algorithm program (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was conducted to search for M. domestica AMPs based on the cysteine arrangement pattern of defensins, i.e., the CXXC/CXC motif [23].

Phylogenetic Tree Construction
Multiple sequences were aligned with MUSCLE (https://www.ebi.ac.uk/Tools/msa/ muscle/) and the alignments were used to build phylogenetic trees by iqtree-2.0-rc2 with substitution models BLOSUM62 and PMB. Phylogenetic testing included 1000 replicates of Ultrafast bootstrap (UFBoot) and 1000 replicates of SH-aLRT to provide support for tree branches [25]. In our study, both BLOSUM62 and PMB models generated very similar results with good agreement. The trees presented here were prepared by Evolview v2 (https://evolgenius.info/evolview-v2).
To detect whether cysteines in a specific M. domestica AMP would form one disulfide bridge, we first built its initial structure by I-TASSER and then refined the structure with the help of molecular dynamics simulations or energy minimization. For MdEppin35-1, its model was first obtained on I-TASSER with a position restraint to Cys2 (position 15) and Cys6 (position 45) and Cys3 (position 32) and Cys5 (position 42). The MdMuslin1 and MdMuslin26 were modeled with the potential disulfide bridges in MdMuslin2 and MdMuslin25, respectively, and their initial homodimer structures were assembled by Z-DOCK (http://zdock.umassmed.edu/) [26]. The MdMulsin1 homodimer was used to MD simulations (50 ns) with GROMACS 2020.1 with the OPLS (Optimized Potential for Liquid Simulations)-AA/L all-atom force field (2001 amino acid dihedrals) [27]. Sodium and chloride ions were added to neutralize the total system charge and simulated after the peptide was immersed in a cubic box from the surface at least 1 nm and solvated with SPC water. Solved structure was energy minimized for 5000 steps of steepest descent minimization termination with a maximum force less than 1000 KJ/mol/nm. The temperature at 300 K was maintained by velocity rescaling method, along with the pressure at 1 bar being kept by Parrinello-Rahman methods, followed after the system was equilibration phase of 100 ps number of particles, volume, and temperature (NVT) equilibration and 100 ps number of particles, pressure, and temperature (NPT) equilibration. The particle mesh Ewald method was used for long range electrostatic interactions and the linear constraint solver (LINCS) algorithm constrained all bonds. Trajectories were saved every 2 fs for analysis. A homodimer snapshot was extracted from the simulations for constructing the inter-monomer disulfide bridge by Swiss-PdbViewer (https://spdbv.vital-it.ch/). Energy minimizations were performed with the force fields [28] implemented in the MOE2019 software (OPLS-AA for MdEppin35-1, AMBER 10 for Md-Defensin20, MdMuslin1 and MdMuslin26). These homodimers were analyzed by PDBsum (http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/Generate.html) to evaluate their quality.

Positive Selection Analysis
Codon-substitution models were selected to estimate the nonsynonymous-to-synonymous rate ration (ω = dN/dS) using CODEML implemented in the PAML software package [29,30]. In these models, M0 assumes that all sites have a ω ratio and is used as a control. Two pairs of codon-based likelihood models (M1a/M2a, M7/M8) were chosen for making two likelihood ratio tests (LRTs). M1a (nearly neutral model) constraints a proportion p0 of conserved sites with 0 < ω <1, while a proportion p1 = 1 − p0 of neutral sites with ω1 = 1; M2a (positive selection model) adds an extra class of sites with the proportion p2 = 1 − p0 − p1 and with ω estimated from the data. M7 (β distribution model) does not allow for positively selected sites and M8 (β and ω model) adds an extra class of sites to M7, allowing for ω > 1, which means the presence of positively selected sites. The calculation of posterior probabilities was completed using the Bayes Empirical Bayes (BEB) method [31].

Co-Evolutionary Analysis
Multiple sequence alignments (MSA) of AMPs from the housefly and a representative structure were submitted to MISTIC2 for co-evolutionary analysis [32] (https://mistic2 .leloir.org.ar). Four covariation methods: (1) corrected mutual information (MIp), (2) mean field direct coupling analysis (mfDCA), (3) pseudo-likelihood maximization DCA (plmDCA) and (4) multivariate Gaussian modeling DCA (gaussianDCA) were chosen for analyzing the inter relationship of residues in a protein sequence, which could identify the structurally or functionally important positions. At the same time, these sequences were also input into the Weblogo server [33] (http://weblogo.berkeley.edu/logo.cgi) for creating sequence logos using default parameters.

Results
Using a combination of computational approaches, we have largely enlarged the Musca domestica AMP repertoire to 186 AMP-like peptides/proteins (Table A1) [22], in which 148 are considered as newly discovered members. These components included the cysteine-containing peptides (MdDefensins, MdEppins, MdMuslins, MdSVWCs and MdCrustins); the linear α-helical peptides (i.e., MdCecropins); and the specific amino acid-rich AMPs (i.e., MdDomesticins, MdDiptericins, MdEdins and MdAttacins). Their characteristics including length, pI and net charges at pH = 7.0 are provided in Table A1.
Our results indicate that most AMPs described here are smaller than 150 amino acids in length with some typical AMP features. Some putative AMPs are larger in length due to internal duplication. These peptides are described in details as follows: Defensins are approximately 4 kDa AMPs with three or four conserved disulfide bridges, which exist in nearly all multicellular organisms [34]. Based on their structural characteristics, defensins can be classified into two distinct superfamilies called cisand trans-defensins. The former includes those with the cysteine-stabilized α-helix/β-sheet (CSαβ) fold produced by plants, fungi and invertebrates; the latter includes α-defensins, β-defensins and θ-defensin from vertebrates as well as big defensins from invertebrates. In insects, Tian and colleagues found three different types of defensin from N. vitripennis including classical insect-type defensins (CITDs), nasonins and navitricins [20]. Insect defensins are composed of an n-terminal loop, an α-helix, followed by an antiparallel β sheet. Defensins in insects show antimicrobial activity on Gram-positive bacteria by forming voltage-dependent channels to disrupt the permeability barrier of the cytoplasmic membrane resulting in cytoplasmic potassium loss [35]. Besides, some insect defensins can kill the Gram-negative Escherichia coli and some fungi [15,36].
In the housefly, there are a total of 21 defensins-like AMPs (named MdDefensins) with 11 new members described here (Figure 1a and Figure S2). Their mature peptides are composed of 40-65 residues and contain three disulfide bridges with net charges ranging from 0.9 to + 4.2 (Table A1). Based on our phylogenetic tree analysis, these peptides can be divided into three groups ( Figure 1b): Group I includes MdDefensin1-MdDefensin9 which all belong to the CITDs; Group II includes MdDefensin10-MdDefensin15 which all lack a classical pro-peptide; and Group III includes MdDefensin16-MdDefensin21 which all contain a short n-loop without a pro-peptide ( Figure 1b). Among them, some members (e.g., MdDefensin4, 6, MdDefensin13, 15 and MdDefensin16) have been identified to be transcriptionally active after body wall injury [37]. MdDefensin1-MdDefensin16, MdDe-fensin18, and MdDefensin19 share their precursor organization to CITDs that comprises a signal peptide, an acidic propeptide ending with an R/KXKR motif (X denoting F, Q or Y) or its variants followed by a mature peptide. This motif is lost in MdDefensin17, -20 and -21, giving rise to failure in propeptide processing and thus generating an extended Nterminus (Figure 1a and Figure S2). The 3D structures of four representative MdDefensins with different N-terminal lengths (Figure 1a) show they all adopt a typical fold of CSαβ ( Figure 1b). The α-helix spans residues L 15 -I 21 in MdDefensin2, G 2 -L 28 in MdDefensin10, G 24 -L 34 in MdDefensin17, and N 18 -I 26 in MdDefensin20 and a hydrophobic cluster is present in MdDefensin2 (L 15 , A 17 and A 18 ), MdDefensin10 (W 27 and L 28 ), MdDefensin17 (W 26 , M 29 , and L 34 ) and MdDefensin20 (L 20 , L 23 , I 26 ). The two β-strands constitute an antiparallel sheet linked by a loop that commonly forms a functional γ-core contributing to the antimicrobial activity in some defensins [38]. The N-terminally extended region in MdDefensin17 folds into a short two-stranded antiparallel sheet followed by an α-helix. This unique subdomain structure is firstly found in an insect defensin [39]. MdDefensin20 has evolved a free cysteine that is not involved in the intramolecular disulfide bridges ( Figure 1). Compared with the CITDs, MdDefensin17 -MdDefensin21 have a shorter n-loop ( Figure 1b and Figure S2), analogous to the antibacterial ancient invertebrate-type defensins (AITDs) [40]. In our analysis, M1a/M2a models identified Ser 7 (numbered according to Mddefensin1) as a positively selected site ( Table 1). The lacking of positive selection signals in M7/M8 would be due to the sparse sampling of species and low sequence divergence. Note: l is the log likelihood; LRT is likelihood ratio test, which is twice the log likelihood difference (2∆l) between the null models (M1a and M7) and their alternative models (M2a and M8): M1a/M2a = 5.74 (df = 2, p = 0.05670); M7/M8 = 0.26 (p = 1). Positively selected sites identified by the BEB method under M2a with posterior probabilities P > 0.6 are shown and with P > 0.95 by "*".  Figure S2) and insect defensins from other insects. Previously known sequences are labeled by a red "#". Cysteines are shaded in yellow and the conserved glycines in grey. Basic residues (K, R, H) and acidic residues (E, D) are highlighted in blue and red, respectively. The extended N-terminus in MdDefesin10, MdDefesin 17, and MdDefesin 20 is shadowed in cyan. Structural elements, including three loops (designated as n-loop, m-loop and c-loop), the α-helix, the two-stranded β-sheet, the γ-Core motif as well as the three conserved disulfides are indicated at the bottom. The free cysteine (Cys1) is underlined once. Dm: Drosophila melanogaster (GenBank: P36192.1), Lucifensin: Lucilia sericata (PDB:2LLD), Pt: Protophormia terraenovae (GenBank: P10891.2), Sb: Simulium bannaense (GenBank: AJP36711.1). (b) Phylogenetic tree and 3D structure representatives of the defensins from the housefly and other insects. This tree was inferred using iqtree-2.0-rc2. Significant bootstrap values are indicated by a black circle for SH-aLRT > 90 and white for UFBoot > 85. See GenBank IDs and other details for each AMP in Table A1. The structures are shown as ribbons by MolMol, with their N-and C-termini labeled. The n-loops for all the structures and the free cysteine (Cys1) in MdDefensin20 are denoted. The extra N-terminal domain in MdDefensin17 is circled in blue.

MdEppins
Serine protease inhibitors (SPIs) exist in all organisms that participate in many important metabolisms progresses, such as blood coagulation, fibrinolysis, inflammation and immunity. The inhibitors are categorized into four groups (Kazal, Kunitz, Serpin and α macroglobulin), all with a disulfide-rich α/β fold and a P1 site [41]. They show an inhibitory activity against a broad-spectrum of enzymes, e.g., trypsin, chymotrypsin, plasmin, elastase and microbial serine proteases. P1 site plays an important role in specificity and binding strength of serine protease inhibitors because of its exposure to the proteasebinding loop [42]. Kunitz serine protease inhibitors (KuSPIs) are extensively distributed in microbes, plants, insects and mammals. In recent years, many studies focused on their antimicrobial activity. For example, in human eppins comprising two potential protease inhibitory domains (a whey acid protein (WAP) or four disulfide core domain and a kunitz domain) has been reported to kill Gram-negative bacteria [43]. IPS1-3, a KuSPI isolated from the cell-free hemolymph of the Galleria mellonella larvae, can be induced to respond to the injected fungal elicitor zymosan [44]. In insect silk, KuSPIs can inhibit bacterial and fungal proteinases [45].
The Eppin family contains 35 members in the housefly (herein named MdEppin1-MdEppin35) and shares a conserved domain with the KuSPI family. Among them, MdEp-pin35 contains nine KuSPI domains (named MdEppin35-1 to MdEppin35-9) (Figure 2a and Figure S3). The pattern can be drawn as CX 8-14 CX 15-17 CX 6-7 GCX 12-13 CX 3 C with three disulfide bridges (C1-C6, C2-C4 and C3-C5) ( Figure 2a and Figure S3), in which C1-C6 and C3-C5 are essential for the maintenance of a native conformation but the third one (C2-C4) appears to be involved in stabilizing the binding domains in the loops containing the active site (P1) [46]. The phylogenetic tree reveals that Mdeppin1-Mdeppin8, Mdeppin 35-1 and 35-9 share high similarity to the kuntiz domain of eppin isolated from human whereas Mdeppin35-5 and Mdeppin35-6 are separated as single taxa (Figure 2b). In MdEppin35-1, an N-terminal deletion led to the loss of the first two cysteines. Alternatively, two C-terminal cysteines are evolved, which could compensate the loss via the formation of new bridge bridges to stabilize its structures, as verified by our structural modeling (Figure 2a,c).
Evolutionary analysis identified two positively selected sites (K 12 and L 18 , shown in MdEppin1) whose mutations might be relevant to their functional divergence (Table 2). Consistently, L 18 was also identified as an essential site potentially related to the activity of MdEppin, as analyzed by MISTIC. In addition, this analysis suggests its possible connection with other amino acids, including K 12 , R 16 (active site P1), I 19 , P 20 and E 33 (Figure 2d,e).  Figure S3) and Kuntiz-domain-type AMPs from other species. Cysteines involved in the formation of disulfides are colored in yellow. Identical residues are shadowed in grey. Basic residues (K, R, H) and acidic residues (E, D) are highlighted in blue and red, respectively. The highly conserved domain in eppins is boxed in green. The P1 amino acids are italicized and shadowed in cyan. Conserved disulfides, α-helix and two-stranded β-sheets are also indicated at the bottom, while the disulfides in MdEppin35-1 are displayed above MdEppin35-1, in which newly emerged ones are shown in dark red. Gm: Galleria mellonella (GenBank: AAK40037.1), Pp: Pseudechis_porphyriacus (GenBank: sp_B5G6G6.1), Hm: Homo sapiens (GenBank:AAG00547.1). The phase 1 intron or 2 intron is boxed in green or red, and phase 0 intron showed by black lines. & represents only signal peptide and kuntiz-domain in human eppin are displayed. (b) Phylogenetic tree constructed from the alignment of amino acid sequences present in Figure S3 by iqtree with a maximum-likelihood method. Branches with a significant bootstrap value are indicated by black circles for SH-aLRT > 90 and white for UFBoot > 85. (c) 3D models of MdEppin-1 and 35-1. The disulfides are shown as color sticks (blue for MdEppin-1 and red for MdEppin35-1) with the conserved one indicated by a blue arrow. (d) The cicro visualization of the coevolution of MdEppin. Amino acid names and the position are in the outer ring. Conservation (second ring) from light blue (lower) to red (higher); cScore (third ring) from yellow (lower) to violet (higher). pScore (inner ring) form green (lower) to red (higher). Inner lines are the top 5% covariation scores. (e) Weblogo of MdEppins, and the cola of positions on cScore from yellow (lower) to violet (higher) is shown on the top with the distance being displayed. The P1 site is arrowed in blue and the positively selected sites are arrowed in turquoise. The highly conserved domain is boxed in green. Note: 2∆l between null models (M1a and M7) and their alternative models (M2a and M8): M1a/M2a = 11.22 (df = 2, p = 0.00366); M7/M8 = 0.00 (p = 1). Positively selected sites identified by the BEB method under M2a with posterior probabilities P > 0.6 are shown and those with P > 0.95 indicated by "*".

MdMuslins
Kazal-type serine proteinase inhibitors (KaSPI) were firstly isolated by Kazal and colleagues from pancrease [41]. KaSPIs have been identified in many insects such as mosquitos [47], Drosophila [48] and locusts [49]. They have a broad activity in various biological and physiological processes in many organisms, such as blood coagulation and innate immunity. Interestingly, KaSPIs exhibit an antibacterial activity in response to microbial infection. The recombinant CsKSPI inhibits the growth of Gram-positive and Gram-negative [50], and PSKP-1 and its variants reduce E. coli mobility and cell agglutination [51].
In the housefly, 24 muslins (named MdMuslin1-MdMuslin24) are identified to contain the typical kazal domain with three disulfide bridges and nine muslins (named MdMuslin25-MdMuslin33) contain four disulfide bridges (Figure 3a and Figure S4). Among them, four muslins (MdMuslin15, MdMuslin16, MdMuslin23 and MdMuslin24) contain two kazal domains and named -1 and -2. Our phylogenetic tree reveals four types of Mdmuslins (Figure 3b), each type clustering together, in favor of their monophyletic origin. Of them, type I to III contain three disulfide bridges (i.e., C1-C5, C2-C4, C3-C6) which are different from the peptides containing the kuntiz-domain. Their sequence pattern can be drawn as CX1-3CX5PVCX0-5GX6-9NX1-5CX3-6CX7-22C. In the tree, the two domains in the paralogous MdMuslin15 and MdMuslin16 are classified into two different types, indicating that their significant divergence occurred after domain repeats. Similarly, in other two kazal domain-containing members (MdMuslin23 and MdMuslin24) their domain-2 is categorized into type III and domain 1 into type IV. For the eight cysteinescontaining members except MdMuslin23-1 and MdMuslin24-1, their sequence pattern can be described as CX1-3CX5PVCX5-6GX5-6CX3NXCX6CX7-12CX2-5C (Figure 3c).  Figure S4) and kazal domains from other sepcies. Two serine residues mutated from the conserved cysteines are circled. Cysteines involved in the formation of disulfides are colored in yellow. Conservation replacements are shadowed in grey. Basic residues (K, R, H) and acidic residues (E, D) are highlighted in blue and red, respectively. The P1 amino acids are italicized and shadowed in cyan. Residues split by phase 1 introns are shadowed in green. α-helix and β-sheets are also indicated at the bottom together with the potential disulfides being showed with black lines, and the fourth bridge In MdMuslin1 the first cysteine is replaced by a serine and in Muslin26, the fifth cysteine is lost, leading to the initial first disulfide bridge disrupted. In these two molecules, the free cysteines are exposed to their molecular surface, as revealed by our structural models (Figure 3c), suggesting that they might participate in the formation of a homodimer. The conservation analysis indicates that the R 10 (P1 site shown as MdMuslin2) has an inner relationship with P 7 , N 11 and P 13 (Figure 3d,e).

SVWC Domain AMPs
Single domain von willebrand factor type C (SVWC) proteins mostly contain eight cysteines. Although the Bombyx mori BmSVWC gene was decreased in the cuticle when the insect was infected with fungi [52], granularin from the snails Lymnaea stagnalis was up-regulated during parasitation of the avian schistosome Trichobilharzia ocellata, in favor of a role of this class of proteins in the molluscan internal defense response [53].
In the housefly, the SVWC-type AMPs are expanded to a family comprising 29 small, single domain secreted proteins (named MdSVWC1-MdSVWC29) (Figure 4a and Figure S5). Among them, MdSVWC29 is unique in that it contains two classical fragments, named MdSVWC29-1 and MdSVWC29-2. The family displays a consensus pattern as CX18-23CX4CX10-12CX7-10CX11-14CCX1-5C. They are diverse in sequence, but the eight cysteines are conserved throughout the group to form four disulfide bridges (i.e., C1-C3, C2-C6, C4-C7, and C5-C8). Both conserved introns (phase 1 and phase 2) are located at the nearly identical position among all peptides whose gene structures are available, supporting their common origin ( Figure S5). Based on our phylogenetic tree analysis, these MdSVWC peptides can be divided into two distinct groups (Figure 4b). The predicted 3D structures of MdSVWC1 and MdSVWC17 reveal a typical structure in their N-termini that contain a four β-stranded sheet (residues M 3 -F 5 , T 12 -E 14 , S 18 -E 20 , R 27 -T 29 in MdSVWC1 and C 6 -V 8 , K 11 -V 13 , G 16 -H 21 , T 27 -D 32 in MdSVWC17). Residues P 37 -L 41 are folded into an α-helix in MdSVWC1, but G 36 -E 41 in MdSVWC17 are folded into a β-sheet. In addition, the C-terminus of MdSVWC1 forms a β-sheet (residues K 58 -D 60 and F 77 -C 79 ) and an α-helix (Y 87 -V 91 ) (Figure 4c).

Crustin-like AMPs
Crustins are antibacterial proteins with a precursor organization including a signal peptide at the N-terminus and a whey acidic protein (WAP) at the C-terminus [54,55]. Crustins have been identified in diverse invertebrate animals, with a WAP domain of approximately 50 amino acids containing a conserved motif and a four disulfide bridge core (4-DSC) in the C-terminus. In previous studies crustins are divided into four families, crustin I contains a cys-rich domain, crustin II contains a glycine-rich domain in the Nterminus. Only one WAP domain exist in crustin III and thus are also named a single WAP domain-containing peptides (SWDs). Type IV contains a cysteine-rich, an aromatic-rich region and a WAP domain in the C-terminus. This type of crustins is exclusively present in ants. Crustins are widely regarded as antimicrobial molecules since they can kill Grampositive bacteria [56,57] and some fungi [58]. Moreover, LcSWD3 isolated from L.vannamei may contribute to antiviral immune response [59].
Even though the mechanism of action of crustins on pathogens remains unknown, it appears clear that the common WAP domain in all crustins plays a key role in their antibacterial effect. This has been evidenced by several previous observations: (1) the crustin in Fenneropenaeus chinensis, which only contains a glycine-rich region without a WAP domain, exhibits no antibacterial activity [60]. (2) SWAM1 and SWAM2 in mice which belong to the crustin III family can inhibit the growth of both E. coli and S. aureus [61]. Additionally, the reduction and alkylation of the cysteine residues in the WAP domain of a crustin-like peptide from a snake venom destroyed its antimicrobial activity [62].
The housefly crustins (named MdCrustins) are identified as types V and VI based on their sequence characteristics. In comparison with all other crustins, these peptides have developed an extra C-terminal region accompanying the loss of the cysteine-rich or glycine-rich region located between the signal peptide and the WAP region (Figure 5a; see sequences in Figure S6). In the phylogenetic tree, the housefly crustins are clustered together with type III crustins (Figure 5a), supporting their close evolutionary relationship. Structural modeling suggests that the WAP domain in both MdCrustin1 and MdCrustin3 may form four disulfide bridges with a connectivity pattern as C1-C6, C2-C7, C3-C5 and C4-C8. Compared with MdCrustin1, the C-terminal region of MdCrustin3 may be more rigid given that it folds into several β-strands stabilized by two disulfide bridges (C9-C10 and C11-C12) (Figure 5b).

Linear α-helical Peptides Cecropins
Cecropins are a group of classical linear α-helical peptides containing 31-39 residues with a molecular weight of about 4 kDa. The first cecropin was firstly isolated from Hyalophora cecropia [63] and later found in many insects, such as Diptera [10,[64][65][66][67][68], Lepidoptera [69] and Coleoptera [70] and so on, but not in Hemiptera [4]. Cecropins display a broad spectrum of activity against Gram-positive and Gram-negative bacteria, fungi and HIV virus [63,71]. In some cecropins, Trp 2 and Phe 5 (e.g., cecropin A and papiliocin) was found to contribute to interactions with the negatively charged bacterial membrane [72] whereas Gly 1 , Trp 2 , Lys 4 and Lys 5 in sarcotoxin-IA are important for binding with lipid A of LPS [73,74]. In cecropins, the hinge region disrupting the long helix is important for their structural flexibility [75].
In the housefly, 11 cecropins have been identified to contain a typical precursor organization [64]. Using our method, we found another five homologs that share similarity with those found in Diptera (Figure 6a). As reflected by MdCecropin2, in a hydrophobic environment, these peptides adopt an α-helical conformation, in which two α-helices (W 2 -Q 23 and G 26 -G 40 ) are joined by a flexible hinge comprising G 24 and L 25 (Figure 6b). The N-terminal helix is strongly basic and the C-terminal one is hydrophobic (Figure 6b). By default, the output presents the hydrophilic residues as circles, hydrophobic residues as diamonds, potentially negatively charged as triangles, and potentially positively charged as pentagons. Hydrophobicity is color coded as well: the most hydrophobic residue is green, and the amount of green is decreasing proportionally to the hydrophobicity, with zero hydrophobicity coded as yellow. Hydrophilic residues are coded red with pure red being the most hydrophilic (uncharged) residue, and the amount of red decreasing proportionally to the hydrophilicity. The potentially charged residues are light blue. Middle: Carton model of MdCecropin2. The residues are colored as followed: the hydrophilic residues are in cyan, positive ones are in blue, negative are in red, and the hydrophobic ones are in gray. Glycine and Leucine residue at position 24, 25 serves to connect the two helices which are shown as stick as well. Right: The spheres diagram shows the structure of Mdcecropin2 with the same color codes with the carton picture.

Domesticins, Diptericins and Edins
Domesticins are a class of proline-rich AMPs active against Gram-positive and Gramnegative bacteria and some fungi [76][77][78]. In our data mining, we found a new domesticin, named MdDomesticin2 (Figure 7 and Figure S7), with a proline content of 27.5%. Diptericins are about 9 kDa of glycine-rich peptides active on Gram-negative bacteria, which were initially isolated from the fly Phormia terranovae [79]. Some studies have shown that they are bacterially induced through the IMD signaling pathway [80]. In addition, Diptericin in mosquito larva is up-regulated after Sindbis virus infection [81]. For two housefly-sourced diptericin genes that were not named previously but have been found to be transcriptionally active when house fly larvae and pupae were injured [37], we named them MdDiptericinD and MdDiptericinD1. Their proteins share high similarity with other four diptericins previously named (Figure 7 and Figure S8). These diptericins can be clearly divided into two domains: a proline-rich domain and a glycine-rich domain. We found there are a phase 0 intron disrupting the proline rich domain of MdDiptericinD and D1 ( Figure S8).
Edins are a class of inducible insect AMPs [82,83] with a precursor comprising a signal peptide, a propeptide ending with a RXXR motif and a mature glycine-rich domain. Interestingly, in the housefly MdEdin6-MdEdin10 have two Gly-rich domains, which is different from the only single glycine-rich domain in the orthologs from other insects and MdEdin1-MdEdin5 in the housefly (Figure 7 and Figure S9).

Attacins
Attacins are a class of 20-23 kDa proteins found in Lepidoptera and Diptera. Attacins have two types, the basic attacins (A-D) and acidic attacins (E-F) [6]. All attacins share a high similarity in their amino acid sequences, but more aspartic acids exist in the acidic attacins. The precursor organization of attacins contains a signal peptide, a propeptide ending with a conserved RXXR, an attacin-N domain, an attacin-C domain (also called G1 domain and G2 domain). In the housefly, 16 attacins can be divided into three types (named MdAttacinA, MdAttacinC and MdAttacinD), in which three MdAttacinA, one MdAttacinC (Herein named MdAttacinC2) and one MdAttacinD (MdAttacinD4) have been named. Not like attacinA and attacinB in Drosophila, MdAttacinAs in the housefly lacks a propeptide. Their glycine proportion in the N-domain is universally higher than that in the N-domain of the fruit fly counterparts ( Figure S10). In our study, we found two new MdAttacinCs (MdAttacinC6 and MdAttacin7) ( Figure S11). In addition, although some MdAttacinCs have been reported [37], we found that the original MdAttacinC3 contains two whole attacin sequences and thus named C3-1 and C3-2 ( Figure 8). MdAttacinCs contain a longer propeptide ending with an RXXR motif. The glycine residues in the attacin-N domain are much less than those in the domain of the fruitfly counterparts, but higher in attacin-G1 and G2 domain ( Figure S11). MdAttacinDs lack a signal peptide ( Figure S12). However, like MdAttacinA3, MdAttacinD3 (herein named) was also expressed in larval tissues after injury [37]. Compared with the fruit fly attacinDs, we found that the glycine percentages largely varied due to sequence divergence.  Our phylogenetic analysis based on the MSA of diptericin, domesticin and attacinC, a group of proline-rich AMPs described above, reveals a close relationship between MdAttac-inC and the D. melanogaster attacinC and a paralogous relationship between diptericins and domesticins (Figure 9). In a similar manner, we analyzed edins, diptericins and attacins, a group of glycine-rich AMPs described above. These can be divided into eight groups based on the evolutionary tree (Figure 9) where the Attacin-N domains belonging to the type attacinA are grouped together (named AttacinA-N group) with the exogenous Attacin-N in the attacinA-D from the fruit fly ( Figure 9). The N-domain of attacinD except attacinD4 in the housefly are clustered with the AttacinA-N group. The Attacin-N peptides in type attacinC and attacinD4 are clustered together (named AttacinC-N group). The AttacinA-G1 group contains the G1 domain of attacinA and attacinD1. AttacinC-G1 group contains the G1 domain of attacinC and attacinD4. AttacinA-G2 group contains the G2 domain of attacinA and attacinD1-4 as well as attacinA-D from D. melanogaster. The AttacinC-G2 group contains the G2 domain of attacinC and attacinD4. The diptericin group contains only one Gly-rich domain. All Edins are grouped together (Figure 9). These results reveal that edins and the G2 domain of attacinC have the closest evolutionary relationship and diptericins are closer to the G2 domain of attacinA. Even though MdEdin10 has been named Attacin, we found that it has a closer relationship with the Edin family according to the evolutionary tree ( Figure 9). The domain architecture of different proline-rich and glycine-rich AMPs are presented in Figure 10.

Discussion
Although some AMPs have been identified previously in M. domestica [77,84,85], more potential AMPs may be discovered by surveying its whole genome and related databases. Here, we identified 148 potential AMPs in M. domestica with different structural types. Compared with D. melanogaster, M. domestica has evolved a set of more complex antimicrobial components with diversification in number, protein size, and structure. For instance, the housefly has 21 defensins with a variable n-loop in length, but D. melanogaster has only one. Besides the increase in number, structural alteration by extension or truncation also contributes to the diversity. For example, some members belonging to MdEppins, MdMuslins and MdEdins have extended their C-termini compared with those in D. melanogaster. What attracts us most is which evolutionary or genetic mechanisms have shaped the diversity in the housefly?

Gene Duplication
Gene duplication has been found in nearly all organisms. In the M. domestica genome, there are at least seven AMPs families exhibit an adjacent chromosome location with gene structure conservation, which can be considered as a consequence of tandem duplication ( Figure 11). Gene duplication followed by positive selection is important for the creation of new biological functions, which has been observed in the evolution of insect multigene of AMPs [20]. In the housefly, we have detected some positive selection signals in the MdDefensin and MdEppin families (Tables 1 and 2) but not in other families. In addition, as mentioned above, some structural variations are also observed among different paralogs in the housefly defensin family via dynamic insertion/deletions (indels) in their n-loop ( Figure S2). Such change could have an impact on their antimicrobial activity [20].

Exon Duplication via Shuffling
Exon duplication-mediated internal repeats plays an important role in the evolution of proteins as it can create an obvious complexity increase and likely contributes to the emergence of new functions. From the house fly genome, we identified an unusual eppin protein (MdEppin35) that carries nine repeats of kuntiz domain without a protease processing signal. In the MdMuslin family, MdMuslin15, MdMuslin16, MdMuslin23 and MdMuslin24 all share two kazal domains. Moreover, in the MdEdin family, MdEdin6 -MdEdin10 contain two glycine-rich domains (Figure 12a-c). Since there are two same phase introns (i.e., phase 0) corresponding to the boundaries of the MdEppin35-1 domain, we speculate that the evolution of multiple Kuntiz-domains might be a result of exon-shuffling, as illustrated in the schematic diagram (Figure 12d). In this case, the lack of introns at other domain boundaries could be explained by intron loss during evolution [86].

Protein Terminal Variations
In comparison with their paralogs some members belonging to the three housefly AMP families (i.e., MdCrustins, MdDefensins and MdEppins) have changed their terminal length through truncation or extension. For example, compared with the ancestral state present in the insect lineage (e.g., MdCrustin3, MdCrustin4 and ArWaprinThr1), MdCrustin1 and MdCrustin2 have evolved to form a truncated C-terminus ( Figure 5) through the deletion of a disulfide-bonded sub-domain structure ( Figure 5 and Figure S6). In the MdDefensins, several members extended their N-termini via the loss of a pro-peptide processing signal and the extended fragment in MdDefensin17 forms an isolated sub-domain structure. In the MdEppin members (e.g., MdEppin4, MdEppin6, MdEppin25, MdEppin33 and MdEppin34), all have an extended C-terminus of 20-103 amino acids ( Figure S3). These observations indicate a clear structural diversification occurring among different paralogs of housefly AMPs and could hint at their functional divergence, an open question to be answered in the future.

Evolution of New Disulfide Bridges
Disulfide bridges are important to both protein structure and function [87]. In this work, we found that several M. domestica AMPs have odd cysteines. For example, Md-Defensin20 has evolved one additional cysteine in its N-terminus ( Figure 1) whereas Md-Muslin1 and MdMuslin26 have a cysteine mutation to a non-cysteine residue (Figure 3c). To a secreted protein, the presence of a free cysteine often is detrimental to its structural stability due to air oxidization, especially when this residue is exposed to the molecular surface. We thus speculated this free cysteine might be involved in the formations of a homodimer structure, as previously observed in scorpion venom lipolysis activating peptides [88]. This speculation is supported by our structural modeling, in which one intermolecular disulfide bridge is clearly formed between two monomer MdDefensin20 (Figure 13a). A Ramachandran plot indicates that in this homodimer almost all ϕ/ψ torsion angles are found in the favored or additionally allowed regions except several residues (Figure 13b). In MdMuslin1, the fifth cysteine in monomer lost the opportunity to form the intramolecular disulfide bridge, but alternatively an intermolecular disulfide bridge could be found by ZDOCK method (Figure 13c), Ramachandran plot indicates that in this homodimer almost all ϕ/ψ torsion angles are found in the favored or additionally allowed regions except a serine in position 5 (Figure 13d). A similar observation is also made in MdMuslin26, in which the loss of the sixth cysteine leads to the first cysteine residue being not paired and thus one predicted intermolecular disulfide bridge links the two monomers into a homodimer (Figure 13e,f). In the two types of MdMuslins, the type I peptides are shared with the orthologs from a diversity of species whereas the type II peptides are only shared with the orthologs from a Diptera species. This observation suggests that the former might represent the ancestor from which the latter emerged via evolutionary gain of one additional disulfide bridge in the common ancestor of Diptera. Since members with this disulfide bridge all have a long C-terminus whereas members without this bridge exhibit more divergence in their C-terminal length, it appears that gaining a disulfide bridge during evolution could have an impact on the stabilization of the size of a protein. In Mdeppin35-1, its two unusual cysteines are located at the C-terminus. Due to a deletion in its N-terminus, this peptide lose the first two cysteine residues compared with other paralogs (Figure 2), leading to the disruption of the original disulfide bridge pattern, which probably makes it become a pseudogene in the case of the loss of its structural stability. However, our structural modeling indicates that the two C-terminal cysteine residues can provide new pairings for disulfide bridge formation (Figure 2) to save the life of this gene via restoring its structural stability. These observations suggest that when the structurally important cysteines are substituted or deleted in protein evolution, compensation might be a choice.

Conclusions
In prior studies, some insect-derived AMPs have been found in the housefly based on biochemical purification combined with functional identification as well as the analysis of genomic sequences. In this work, we used a combination of computational approaches to establish a relative complete M. domestica peptidome associated with its antimicrobial immunity, which largely expands the repertoire of AMP-like molecules in a sanitary insect pest. These molecules exhibit considerable diversity in their gene numbers and structural type with some new architectures that may be assembled as a homodimer structure or display repeats by multiple homologous domains or even a novel fold different from its ancestral state. It is clear that such diversity can attribute to several evolutionary scenarios involving gene and exon duplication, terminal variations and disulfide bridge reconstruction. Although the housefly has a closer phylogenetic relationship with Drosophila (both belonging to the Order Diptera), their evolution in developing their antimicrobial immune system remarkably differs. Compared with Drosophila, the housefly seems to have increased the complexity of the system, a similar case also previously reported in the parasitoid Nasonia vitripennis. This might be a result of evolutionary convergence in facing the selective pressure towards parasitism and a dirty microbe-rich environment. Our work will offer a basic platform to further study the immune and evolutionary significance of these newly discovered AMPs and the role of the molecular and structural diversity in contributing to the immune response of houseflies.