Domain Analysis and Motif Matcher (DAMM): A Program to Predict Selectivity Determinants in Monosiga brevicollis PDZ Domains Using Human PDZ Data

Choanoflagellates are single-celled eukaryotes with complex signaling pathways. They are considered the closest non-metazoan ancestors to mammals and other metazoans and form multicellular-like states called rosettes. The choanoflagellate Monosiga brevicollis contains over 150 PDZ domains, an important peptide-binding domain in all three domains of life (Archaea, Bacteria, and Eukarya). Therefore, an understanding of PDZ domain signaling pathways in choanoflagellates may provide insight into the origins of multicellularity. PDZ domains recognize the C-terminus of target proteins and regulate signaling and trafficking pathways, as well as cellular adhesion. Here, we developed a computational software suite, Domain Analysis and Motif Matcher (DAMM), that analyzes peptide-binding cleft sequence identity as compared with human PDZ domains and that can be used in combination with literature searches of known human PDZ-interacting sequences to predict target specificity in choanoflagellate PDZ domains. We used this program, protein biochemistry, fluorescence polarization, and structural analyses to characterize the specificity of A9UPE9_MONBE, a M. brevicollis PDZ domain-containing protein with no homology to any metazoan protein, finding that its PDZ domain is most similar to those of the DLG family. We then identified two endogenous sequences that bind A9UPE9 PDZ with <100 μM affinity, a value commonly considered the threshold for cellular PDZ–peptide interactions. Taken together, this approach can be used to predict cellular targets of previously uncharacterized PDZ domains in choanoflagellates and other organisms. Our data contribute to investigations into choanoflagellate signaling and how it informs metazoan evolution.


Introduction
PDZ domains are small peptide-binding domains named for the proteins where they were first discovered: PSD-95, Dlg1,and ZO-1 (PDZ) [1][2][3][4]. These are scaffolding domains that recognize the extreme C-terminus of target proteins, with the bulk of the PDZ-protein interaction involving up to 6 residues on the target [5][6][7]. PDZ domains are found in all three domains of life (Archaea, Bacteria, and Eukarya), and PDZ-mediated interactions are important in signaling and trafficking pathways, as well as in cell adhesion [5,8,9]. Examples include: regulation of the trafficking of receptors throughout the cell, e.g., the cystic fibrosis transmembrane conductance regulator (CFTR), signaling pathways mediated by G-protein coupled receptors (GPCRs), and interactions with proteins involved in tight Bound peptides are in black sticks and colored by heteroatom (N = blue, O = red) and labeled. PDB ID codes for choanoflagellate PDZ domains from this work include: 6X1X, 6X20, 6X22, 6X23, 6X1P, 6X1N, and 6X1R [48].
Choanoflagellates are microbial eukaryotes that live in marine and freshwater environments [49]. They are considered the closest relative to metazoans, and evolutionary biologists argue that these organisms can provide insight into the origins of multicellularity [49,50]. For example, choanoflagellates have extensive signaling networks and form a molecular architecture that resembles a multicellular state, called rosettes [51][52][53]. In our previous work, we concluded that there are 178 PDZ domains in M. brevicollis, with 1-20 PDZ domains in a single protein [48]. We previously used protein biochemistry and structural biology to characterize several PDZ domains from the M. brevicollis proteome, including those from mbDLG, mbGIPC, and mbSHANK1 ( Figure 1) [48,54]. These PDZ domain-containing proteins all have clear homologues in the human proteome, and we found that selectivity determinants are largely conserved, despite ~750 million years of evolution between these organisms [48]. Choanoflagellates are microbial eukaryotes that live in marine and freshwater environments [49]. They are considered the closest relative to metazoans, and evolutionary biologists argue that these organisms can provide insight into the origins of multicellularity [49,50]. For example, choanoflagellates have extensive signaling networks and form a molecular architecture that resembles a multicellular state, called rosettes [51][52][53]. In our previous work, we concluded that there are 178 PDZ domains in M. brevicollis, with 1-20 PDZ domains in a single protein [48]. We previously used protein biochemistry and structural biology to characterize several PDZ domains from the M. brevicollis proteome, including those from mbDLG, mbGIPC, and mbSHANK1 ( Figure 1) [48,54]. These PDZ domain-containing proteins all have clear homologues in the human proteome, and we found that selectivity determinants are largely conserved, despite~750 million years of evolution between these organisms [48].
In this work, we wanted to expand the investigation of choanoflagellate PDZ domains and develop techniques to characterize those that do not have homologues in the human proteome. Based on our previous work and the work of others, we hypothesized that binding preferences in PDZ domains are directly determined by the amino acids that interact with specific peptide positions at both motif and modulator (or non-motif) positions [45][46][47][55][56][57][58]. Because we aimed to identify endogenous interactions, we chose not to use common high-throughput techniques to determine PDZ selectivity, e.g., phage display or peptide arrays, which are tuned to look for highest affinity binders [32,38,47,59,60]. We reasoned that this approach may limit the number of matching sequences in the M. brevicollis proteome and cause us to overlook sequences with lower affinity. It is well established that endogenous PDZ interactions are not necessarily optimized for affinity, as these are often regulatory in nature and in pathways involving several proteins and molecular states [5]. Therefore, in our approach, we developed a computational program, Domain Analysis and Motif Matcher (DAMM), which determines the human PDZ domain(s) that share the greatest number of conserved residues in the peptide-binding cleft with an input PDZ sequence, independent of overall sequence identity. The target sequences of human PDZ domains are largely known; therefore, we then predicted potential target sequences in the choanoflagellate proteome using the Motif Matcher part of the DAMM, based on our published MotifAnalyzer-PDZ program [54]. We used protein biochemistry, structural biology, and DAMM to identify two M. brevicollis sequences that bind to a previously uncharacterized PDZ domain with <100 µM affinity, validating our approach. We argue that this computational and experimental pipeline provides a straightforward methodology for predicting cellular targets of PDZ domains and can be applied to other peptide-binding domains, e.g., SH2 and SH3 domains, that also contain conserved structural folds.

Identification of Peptide-Interacting Residues in PDZ Domains
We wanted to develop a computational pipeline that could be used in combination with knowledge of human PDZ domain specificity to predict the endogenous targets of a PDZ domain that does not have a homologue in Homo sapiens. Based on our previous work characterizing PDZ domains in the Monosiga brevicollis proteome, we chose to focus on UniProt ID A9UPE9_MONBE, a 545 amino acid protein with one PDZ and one SH3 domain. We will refer to this protein as A9UPE9 moving forward for simplicity. Notably, there are no proteins in the human proteome that contain this domain architecture, and a BlastP search of the full-length sequence returns alignments limited to one of the isolated domains.
Then, using structural analyses of peptide-bound human PDZ domains, we identified seven key residues whose side chains interact directly with the peptide. Examples include peptide-bound structures of DLG1 PDZ2 (or DLG1-2), PSD95-1 (which is also referred to as DLG4-1), and ZO1-1 ( Figure 2) [61][62][63]. These include: the βB +2 , βB +4 , βC −2 , βC-αA +1 , αB +1 , αB +5 , and αB +9 residues, where the secondary structure elements are as defined in Figure 1, the superscript numbers indicate residues from the N-(positive values) or C-termini (negative values) of those secondary structure elements, and "βC-αA" is the loop between the βC strand and αA helix. Notably, due to a high degree of variability in the length of the βB-βC loop amongst PDZ domains, we did not include residues in this structural element, although they can interact with peptide positions, e.g., P −4 and P −5 . In addition, other residues make non-covalent interactions with peptide residues in individual PDZ domains, but we kept our assignments relatively conservative, limited to those that form the core of the peptide-binding cleft ( Figure 2).
We also wanted to use the online Robetta server for de novo structure prediction to generate a model of A9UPE9 (Figure 3) [64]. Robetta output four conformations that mainly differed in the conformation of the βB-βC loop, and alignments revealed RMSD values for main chain atoms, as compared with conformation 1 of 0.455 Å for 289 atoms (conformation 2), 0.530 Å for 291 atoms (conformation 3), and 0.679 Å for 311 atoms (conformation 4) (Figure 3a). Due to the similarities, we restricted our structural analyses to conformation 1. The structure, as well as sequence patterns, allowed us to identify the relevant peptide-binding cleft residues (Figure 3b).  . The peptide-binding cleft residues used in this analysis are represented as side chain sticks and are colored by heteroatom or monochrome, respectively, and labeled. DLG1-2 is bound to a peptide of the HPV18 E6 oncoprotein (sequence: RRETQV) [61]. PSD95-1 is bound to a peptide with sequence RRESEI [62]. ZO1-1 is bound to the Claudin2 C-terminal tail (SLTGYV) [63].
We also wanted to use the online Robetta server for de novo structure prediction to generate a model of A9UPE9 (Figure 3) [64]. Robetta output four conformations that mainly differed in the conformation of the B-C loop, and alignments revealed RMSD values for main chain atoms, as compared with conformation 1 of 0.455 Å for 289 atoms (conformation 2), 0.530 Å for 291 atoms (conformation 3), and 0.679 Å for 311 atoms (conformation 4) (Figure 3a). Due to the similarities, we restricted our structural analyses to conformation 1. The structure, as well as sequence patterns, allowed us to identify the relevant peptide-binding cleft residues (Figure 3b).   . The peptide-binding cleft residues used in this analysis are represented as side chain sticks and are colored by heteroatom or monochrome, respectively, and labeled. DLG1-2 is bound to a peptide of the HPV18 E6 oncoprotein (sequence: RRETQV) [61]. PSD95-1 is bound to a peptide with sequence RRESEI [62]. ZO1-1 is bound to the Claudin2 C-terminal tail (SLTGYV) [63]. . The peptide-binding cleft residues used in this analysis are represented as side chain sticks and are colored by heteroatom or monochrome, respectively, and labeled. DLG1-2 is bound to a peptide of the HPV18 E6 oncoprotein (sequence: RRETQV) [61]. PSD95-1 is bound to a peptide with sequence RRESEI [62]. ZO1-1 is bound to the Claudin2 C-terminal tail (SLTGYV) [63].
We also wanted to use the online Robetta server for de novo structure prediction to generate a model of A9UPE9 (Figure 3) [64]. Robetta output four conformations that mainly differed in the conformation of the βB-βC loop, and alignments revealed RMSD values for main chain atoms, as compared with conformation 1 of 0.455 Å for 289 atoms (conformation 2), 0.530 Å for 291 atoms (conformation 3), and 0.679 Å for 311 atoms (conformation 4) (Figure 3a). Due to the similarities, we restricted our structural analyses to conformation 1. The structure, as well as sequence patterns, allowed us to identify the relevant peptide-binding cleft residues (Figure 3b).

Domain Analysis and Motif Matcher (DAMM) Program
In order to determine the human PDZ domain with the most similar peptide-binding cleft as A9UPE9, we developed a Python-based software suite called Domain Analysis and Motif Matcher (DAMM). The computational pipeline of DAMM is in Figure 4a. We will refer to the input sequence as "A9UPE9" or "the choanoflagellate PDZ domain se- quence", but this program can theoretically be used for any PDZ domain sequence from any organism. Ultimately, we did not want to require a structural model of an input PDZ domain sequence in order to identify the seven peptide-binding cleft residues. Therefore, we manually identified these amino acids for 140 solved PDZ domain structures, based on our previous curation of mammalian PDZ domains in the Protein Data Bank [5]. These data became an input database for our computational program, and the first step in DAMM is to take a single input PDZ domain sequence (<100 residues) in FASTA file format and run a BLAST alignment with the 140 PDB sequences that we manually curated [5,65]. The top pairwise alignment based on sequence similarity is provided as an output with asterisks marking the peptide-binding cleft residues of the annotated PDB structure. The user PDZ domain input sequence has reference numbers, allowing the user to identify and input the seven peptide-binding cleft positions for matching. For example, for A9UPE9, if the input sequence included the N-terminus of the protein, the user would input: 26, 28, 44, 47, 77, 81, and 85 ( Figure 3b). It is important to note that these numbers may vary depending on the input sequence, e.g., if the input FASTA file only contained the UniProt annotated A9UPE9 PDZ domain (residues 12-99), the numbers would be shifted by 12 as a result because E12 would be considered residue 1 by DAMM. This description is to highlight that the identified peptide-binding cleft residues will likely not correspond to the residue numbers in the full-length protein.
colored marine by heteroatom (N = blue, O = red) and labeled.

Domain Analysis and Motif Matcher (DAMM) Program
In order to determine the human PDZ domain with the most similar peptide-binding cleft as A9UPE9, we developed a Python-based software suite called Domain Analysis and Motif Matcher (DAMM). The computational pipeline of DAMM is in Figure 4a. We will refer to the input sequence as "A9UPE9" or "the choanoflagellate PDZ domain sequence", but this program can theoretically be used for any PDZ domain sequence from any organism. Ultimately, we did not want to require a structural model of an input PDZ domain sequence in order to identify the seven peptide-binding cleft residues. Therefore, we manually identified these amino acids for 140 solved PDZ domain structures, based on our previous curation of mammalian PDZ domains in the Protein Data Bank [5]. These data became an input database for our computational program, and the first step in DAMM is to take a single input PDZ domain sequence (<100 residues) in FASTA file format and run a BLAST alignment with the 140 PDB sequences that we manually curated [5,65]. The top pairwise alignment based on sequence similarity is provided as an output with asterisks marking the peptide-binding cleft residues of the annotated PDB structure. The user PDZ domain input sequence has reference numbers, allowing the user to identify and input the seven peptide-binding cleft positions for matching. For example, for A9UPE9, if the input sequence included the N-terminus of the protein, the user would input: 26, 28, 44, 47, 77, 81, and 85 ( Figure 3b). It is important to note that these numbers may vary depending on the input sequence, e.g., if the input FASTA file only contained the UniProt annotated A9UPE9 PDZ domain (residues 12-99), the numbers would be shifted by 12 as a result because E12 would be considered residue 1 by DAMM. This description is to highlight that the identified peptide-binding cleft residues will likely not correspond to the residue numbers in the full-length protein.  Following identification of the peptide-binding cleft residues for matching, DAMM runs BLAST pairwise sequence alignments with all previously curated 272 human PDZ domain sequences, including those that do not have a solved structure in the Protein Data Following identification of the peptide-binding cleft residues for matching, DAMM runs BLAST pairwise sequence alignments with all previously curated 272 human PDZ domain sequences, including those that do not have a solved structure in the Protein Data Bank [5]. The user is prompted to specify how many sequences to output, listed in descending order by overall sequence identity. Included in the output file is a list of the pairwise alignments with number of peptide-binding cleft matches, in both identity and similarity, indicated for each of the alignments. The output file is ordered by sequence identity to ensure that alignments listed at the top are those where the majority of the PDZ domain sequences aligned, but overall, sequence identity does not play a role in number of matches specified. Amino acids are defined as "similar" based on the following categories, using one-letter abbreviations: acidic (D, E), basic (H, K, R), polar (C, M, N, Q, S, T), non-polar (A, I, L, V), and aromatic (F, W, Y).
Once these results are obtained, the user can run literature and database searches to identity potential binding sequences based on known human PDZ targets. The Motif Matcher piece of DAMM and MotifMatcher-PDZ can then be used similarly, where C-terminal sequences in the choanoflagellate proteome are filtered and screened for closest matches to an input sequence [54]. Alternatively, we can use DAMM to screen the M. brevicollis proteome for specific preferences at each position, similar to those that would be identified using phage display or another high-throughput technique. All components of DAMM were developed in Python 3, and are available at https://pdzselectivity.cs.wwu.edu (accessed on 11 September 2021).

Binding Preferences of A9UPE9 PDZ
To identify the binding preferences of A9UPE9 PDZ, we used the DAMM program to find the human PDZ domains with the most similar binding cleft residues. For reference, the PDZ domains with the top 5 highest overall sequence identities are in Table 1. Based on the overall sequences, A9UPE9 appears to most closely resemble the GRIP PDZ domains (Table 1). However, the results from the DAMM program revealed a different result. Table 2 shows the human PDZ domains with the most similar peptide-binding clefts as A9UPE9 PDZ. For reference, the GRIP1-2 and A9UPE9 PDZ domains have one identical residue amongst the peptide-binding cleft positions, an αB +1 His, and three similar residues, at the βC −2 (Ser for A9UPE9, Thr for GRIP1-2), βC-αA +1 (Lys and Arg, respectively), and αB +9 (Arg and Lys, respectively). This result highlights the importance of using number of matches and not sequence identity for this type of analysis.

PDZ Domain
Matching/Similar Residues Sequence Identity Our DAMM program results clearly show that while the top result for A9UPE9 PDZ is SYNJ2BP PDZ, overall, its peptide-binding cleft is most similar to the DLG family of PDZ domains. Of these PDZ domains, those from DLG proteins are more extensively studied, including by our lab. Therefore, we chose to use DLG1-1 PDZ in fluorescence polarization experiments to directly compare binding affinity values with A9UPE9 PDZ.
The A9UPE9 and DLG1-1 PDZ domains were expressed and purified following similar protocols, as previously described and in the Materials and Methods [48,54]. The reporter peptide used for both proteins was a fluoresceinated sequence matching the C-terminus of the HPV18 E6 oncoprotein (sequence: FITC-RLQRRRETQV), or F*-HPV18 E6. Determined K D values for each protein following triplicate experiments were 2.3 ± 1.6 µM for DLG1-1 PDZ and 6.9 ± 1.7 µM for A9UPE9 PDZ ( Figure 5). The HPV18 E6 sequence was chosen based on its known interaction with DLG1-1 PDZ [66][67][68]. A fluoresceinated version of the HPV16 E6 C-terminal sequence (FITC-SSRTRRETQL) was also tested and bound A9UPE9 PDZ with K D = 19.5 ± 1.8 µM, based on duplicate experiments. Therefore, we used the higher affinity F*-HPV18 E6 as our reporter peptide.
is SYNJ2BP PDZ, overall, its peptide-binding cleft is most similar to the DLG family of PDZ domains. Of these PDZ domains, those from DLG proteins are more extensively studied, including by our lab. Therefore, we chose to use DLG1-1 PDZ in fluorescence polarization experiments to directly compare binding affinity values with A9UPE9 PDZ.
The A9UPE9 and DLG1-1 PDZ domains were expressed and purified following similar protocols, as previously described and in the Materials and Methods [48,54]. The reporter peptide used for both proteins was a fluoresceinated sequence matching the C-terminus of the HPV18 E6 oncoprotein (sequence: FITC-RLQRRRETQV), or F*-HPV18 E6.
Finally, we predicted that the apparent preferences of A9UPE9 PDZ for P −4 and P −5 Arg residues (e.g., for P −5 , in RRETDV as compared with SRETDV) are due to D33 and/or D34, residues in the βB-βC loop, as well as D78, the amino acid following the αB +1 His, and that a preference for a P −3 Glu (as in the HPV sequences) is likely due to K45 (Figure 7b). Taken together, our DAMM program successfully identified DLG PDZ domains as having similar binding preferences as A9UPE9 PDZ, based on shared binding cleft residues. A small number of substitutions in a target sequence, here HPV18 E6, were able to further refine these preferences for A9UPE9 PDZ, and we identified two sequences, F*-HPV18 E6 and RRETDV, that bound with <10 µM affinity.

Endogenous Targets of A9UPE9 PDZ
Finally, we set out to identify potential endogenous targets of A9UPE9 PDZ, based on peptide-binding cleft similarities with human PDZ domains. We used our HPV18 E6 variant sequences (Table 3), as well as known human targets of the top hit from the DAMM program, SYNJ2BP PDZ. As previously described, Motif Matcher scans a filtered set of M. brevicollis sequences consisting of the final 6 C-terminal residues for each protein in the UniProtdownloaded proteome for most-similar matches to each sequence [54]. We initially required the program to maintain a P 0 Val residue, based on the affinity difference between the HPV16 E6 and HPV18 E6 sequences, but allowed P −2 to be Ser or Thr. In addition, we ran M. brevicollis proteins that were closest matches to known SYNJ2BP PDZ targets, including  [70][71][72]. We ultimately tested seven additional peptides in our fluorescence polarization assay, including A9V6G5_MONBE (HRESTV), A9UWH1_MONBE (STRSDV), A9UR52_MONBE (SRRTEV), A9UWP5_MONVE (GSESSV), and A9UYY4_MONBE (RLA-SEV), as well as A9VB85_MONBE (ARESEI), which contained P −1 /P −3 Glu residues and a P −4 Arg, as well as the DLL1/DLL4 sequence directly (Table 4). A9VB85_MONBE was identified by using Motif Matcher to search for sequences with the highest number of positive modulators for A9UPE9 based on our biochemical analyses, including P −1 = E/D/Q/N, P −3 = E/D, P −4 = R/K, and P −5 = any residue except E/D. where black text indicates shared residues and blue/pink text indicates differences. The βC-αA +1 residue, which influences P −1 selectivity, is indicated with a black circle. (b) Additional residues in A9UPE9 that may influence peptide preferences include D33, D34, K45, and/or D78, which are labeled and highlighted with circles. All side chains are shown as gray sticks and colored by heteroatom, with peptide-binding cleft residues used in the DAMM program colored as in (a). The HPV18 E6 peptide (sequence: RRETQV) from PDB ID 2OQS is shown as reference and labeled.
Finally, we predicted that the apparent preferences of A9UPE9 PDZ for P −4 and P −5 Arg residues (e.g., for P −5 , in RRETDV as compared with SRETDV) are due to D33 and/or D34, residues in the βB-βC loop, as well as D78, the amino acid following the αB +1 His, and that a preference for a P −3 Glu (as in the HPV sequences) is likely due to K45 ( Figure  7b). Taken together, our DAMM program successfully identified DLG PDZ domains as having similar binding preferences as A9UPE9 PDZ, based on shared binding cleft residues. A small number of substitutions in a target sequence, here HPV18 E6, were able to further refine these preferences for A9UPE9 PDZ, and we identified two sequences, F*-HPV18 E6 and RRETDV, that bound with <10 µM affinity.

Endogenous Targets of A9UPE9 PDZ
Finally, we set out to identify potential endogenous targets of A9UPE9 PDZ, based on peptide-binding cleft similarities with human PDZ domains. We used our HPV18 E6 variant sequences (Table 3), as well as known human targets of the top hit from the DAMM program, SYNJ2BP PDZ. As previously described, Motif Matcher scans a filtered set of M. brevicollis sequences consisting of the final 6 C-terminal residues for each protein in the UniProt-downloaded proteome for most-similar matches to each sequence [54]. We initially required the program to maintain a P 0 Val residue, based on the affinity difference between the HPV16 E6 and HPV18 E6 sequences, but allowed P −2 to be Ser or Thr. In addition, we ran M. brevicollis proteins that were closest matches to known SYNJ2BP PDZ targets, including TMIGD1 (sequence: HSETAL), DLL1/DLL4 (VIATEV), and SYNJ2BP (ASGSSV), through Motif Matcher [70][71][72]. We ultimately tested seven additional peptides in our fluorescence polarization assay, including A9V6G5_MONBE (HRESTV), A9UWH1_MONBE (STRSDV), A9UR52_MONBE (SRRTEV), A9UWP5_MONVE Figure 7. (a) Comparison of the peptide-binding clefts of A9UPE9 PDZ (marine) and DLG1-1 (pink, PDB ID 3RL7). DLG1-1 is bound to the peptide matching the C-terminal sequence of the adenomatous polyposis coli (APC) tumor suppressor protein (sequence: YLVTSV) [69]. Peptidebinding cleft residues are shown as side chain sticks, colored by heteroatom (N = blue, O = red) and labeled, where black text indicates shared residues and blue/pink text indicates differences. The βC-αA +1 residue, which influences P −1 selectivity, is indicated with a black circle. (b) Additional residues in A9UPE9 that may influence peptide preferences include D33, D34, K45, and/or D78, which are labeled and highlighted with circles. All side chains are shown as gray sticks and colored by heteroatom, with peptide-binding cleft residues used in the DAMM program colored as in (a). The HPV18 E6 peptide (sequence: RRETQV) from PDB ID 2OQS is shown as reference and labeled. As shown in Table 4, most of our peptides failed to bind A9UPE9 PDZ with reasonable affinities to suggest an endogenous interaction. Structural analyses suggested that A9UWH1_MONBE and A9UR52_MONBE likely failed to bind due to the presence of the P −3 Arg residue, which may negatively interact with the positively-charged K45 ( Figure 7b). It is not immediately clear why A9UYY4_MONBE bound with no detectable affinity, defined as >1000 µM. One hypothesis is that A9UPE9 PDZ requires a P −3 Glu for binding. Despite containing all other preferred selectivity determinants, including P −1 /P −3 Glu residues and a P −4 Arg, the P 0 Ile in A9VB85_MONBE likely explains why that sequence does not bind A9UPE9 PDZ. Our previous work investigating the basis of P 0 selectivity found that the PDZ-defining carboxylate-binding loop sequence "GFGF", as in A9UPE9 PDZ, will likely only accommodate Val and Leu residues [46].
Overall, this work shows that it is challenging to predict endogenous targets of a choanoflagellate PDZ sequence that is not homologous to human PDZ domains. However, we identified one sequence, A9V6G5_MONBE, that binds with an endogenously relevant affinity, <100 µM, using the DAMM program. In addition, a previously tested M. brevicollis peptide, A9UP44 (QSESRL), bound A9UPE9 PDZ with K I = 64 µM (Table 3). It is not clear why A9UPE9 PDZ bound this sequence with relatively high affinity, despite the P −1 Arg residue, and suggests that the P −3 Glu provides the majority of free energy of binding amongst non-motif/modulator residues. In our previous work, we successfully used MotifMatcher-PDZ to identify sequences that bound mbSHANK with~40-70 µM affinity, which is in the same relative range and may reflect the relatively low affinities that PDZ domains have for their endogenous targets [5,54]. Taken together, structural and sequence analyses, in combination with the DAMM program, allowed us to identify two potential endogenous targets of A9UPE9 PDZ. Future work to validate these targets in vivo, as well as to test additional M. brevicollis sequences is necessary to develop a holistic view of the cellular function of A9UPE9.

Discussion
Characterization of the proteins in choanoflagellate signaling pathways can provide important insights into the evolution of human cellular processes. We developed a program to predict the target selectivity of PDZ domains in choanoflagellates, based on conserved amino acids in the peptide-binding cleft. Another critical consideration in future studies will be to think about the other domains in PDZ domain-containing proteins and how they coordinate to regulate cellular processes. For example, there are eight proteins in the M. brevicollis proteome that contain both PDZ and SH3 domains, a domain architecture that is only present in choanoflagellates based on BLAST searches, including A9UNP0_MONBE, A9UPE9_MONBE, A9UPI8_MONBE, A9UUC6_MONBE, A9UYE7_MONBE, A9V111_MONBE, A9V6P1_MONBE, and A9V7E4_MONBE. Notably, proteins that contain both PDZ and SH2 domains appear to be unique to organisms from the phylum Choanozoa as well, and there are five M. brevicollis proteins that contain these domains: A9V6T4_MONBE, A9V7X5_MONBE, A9VA09_MONBE, A9VB90_MONBE, and A9VC25_MONBE [48].
Work over the past decade has identified a complex network of tyrosine kinase signaling in choanoflagellates, more so than in any metazoan characterized [88][89][90][91][92]. It is clear that such networks are critical for cellular communication, environmental adaptation, and other processes in both non-metazoans, such as choanoflagellates, as well as in metazoans [90]. However, the intricacies of metazoan signaling pathways are not well understood, and future investigations into how PDZ and SH2/SH3 domains act synergistically in these proteins may provide deeper insight into the role of tyrosine kinase signaling in non-metazoans.
We developed the DAMM program, which can be used to characterize a previously uncharacterized PDZ domain, using sequence identity at key residues in the peptidebinding cleft. We hypothesize that the same approach could be applied to SH3 and other domains, based on conservation in amino acids that directly interact with a peptide target (Figure 8a,b. For example, despite only 29% sequence identity (14/48 residues), structural analyses of the A9UPE9 SH3 and human Src SH3 domains reveal similar peptide-binding cleft residues (Figure 8b). While protein biochemistry can tease apart unique selectivity determinants and identify potential protein-protein interactions in choanoflagellates, cellbased experiments to directly test predictions and investigate these complex pathways will be an exciting next area of research in understanding the evolution of PDZ and tyrosine kinase signaling.
naling in choanoflagellates, more so than in any metazoan characterized [88][89][90][91][92]. It is clear that such networks are critical for cellular communication, environmental adaptation, and other processes in both non-metazoans, such as choanoflagellates, as well as in metazoans [90]. However, the intricacies of metazoan signaling pathways are not well understood, and future investigations into how PDZ and SH2/SH3 domains act synergistically in these proteins may provide deeper insight into the role of tyrosine kinase signaling in non-metazoans.
We developed the DAMM program, which can be used to characterize a previously uncharacterized PDZ domain, using sequence identity at key residues in the peptidebinding cleft. We hypothesize that the same approach could be applied to SH3 and other domains, based on conservation in amino acids that directly interact with a peptide target (Figure 8a,b. For example, despite only 29% sequence identity (14/48 residues), structural analyses of the A9UPE9 SH3 and human Src SH3 domains reveal similar peptide-binding cleft residues (Figure 8b). While protein biochemistry can tease apart unique selectivity determinants and identify potential protein-protein interactions in choanoflagellates, cellbased experiments to directly test predictions and investigate these complex pathways will be an exciting next area of research in understanding the evolution of PDZ and tyrosine kinase signaling.

Protein Expression and Purification
Expression and purification of all human and M. brevicollis PDZ domains followed a similar protocol, as previously reported for mbSHANK1 PDZ [48,54]. Briefly, N-terminal His-tagged versions of A9UPE9 PDZ (residues 7-98) and DLG1-1 PDZ (residues 220-317) with cleavable TEV sites were inserted into the pET28a+ vector by gene synthesis (Gen-Script) and expressed in Escherichia coli BL21 (DE3) cells. Cells were lysed on ice using sonication. The lysis buffer used was 50 mM Tris pH 8.5, 200 mM NaCl, 10 mM CaCl2, 10 mM MgCl2, 20% (w/v) glycerol, 50 mM imidazole pH 8.5, 0.25 mM TCEP, DNAse, and protease inhibitor cocktail. Immobilized metal-affinity chromatography, 5 mL HisTrap (GE Healthcare, now Cytiva), was used to purify proteins from the clarified supernatant. The wash buffer used was 25 mM imidazole pH 8.5, 25 mM Tris pH 8.5, 250 mM NaCl, 10% (v/v) glycerol, and 0.25 mM TCEP, and the elution buffer was 400 mM imidazole pH 8.5, 25 mM Tris pH 8.5, 50 mM NaCl, 10% (v/v) glycerol, and 0.25 mM TCEP. Except for the protein used in binding experiments, the protein was then dialyzed in dialysis/gel filtration buffer (25 mM Tris pH 8.5, 150 mM NaCl, 10% (w/v) glycerol, 0.5 mM TCEP) and incubated at 4 • C overnight with TEV protease to cleave off the His-tag. The cleaved protein was then purified using a second nickel column with the wash and elution buffers described above. All proteins were further purified on a Superdex S75 column (Cytiva), using gel filtration buffer. Proteins were concentrated using Amicon centrifugal concentrators (3 MWCO). Concentrated proteins used in fluorescence polarization assays were flash frozen in liquid nitrogen for storage at −80 • C. The extinction coefficient values used for quantification of the proteins at A 280 were 1490 cm −1 ·M −1 (A9UPE9) and 4470 cm −1 ·M −1 (DLG1-1).

Design and Development of Domain Analysis and Motif Matcher (DAMM)
The DAMM software suite was written using Python 3. The first step of DAMM, Domain Analysis (DA), receives a PDZ domain sequence in FASTA format as input from the user. DA interfaces with BioPython and performs a pairwise BLAST-style alignment of the input sequence to the 140 PDZ-labeled domains in our database [93]. DAMM outputs the top-scoring pairwise alignment, showing known conserved residues, then prompts the user for a list of seven conserved residues. Critically, the sequence will not properly align with the annotated PDZ domain sequences if the input is >100 residues, so it is important to input just the PDZ domain itself.
Following identification of peptide-binding cleft residues, DA aligns the input sequence with the sequences of all 272 human PDZ domains, based on previous curation [5]. The output file lists pairwise alignments in order based on BlastP score and also lists the number of matching and/or similar peptide-binding cleft residues. Amino acid similarity is based on the following five groupings: (group 1, hydrophobic) AILV; (group 2, negativelycharged) DE; (group 3, positively-charged) KRH; (group 4, aromatic) FYW; (group 5, polar) CMNQST. Glycine (G) and proline (P) are not included in the groupings because of their unique chemical and/or structural properties.
A user might want to perform a motif similarity search after the alignment analysis and determination of number of matching/similar peptide-binding cleft residues. DAMM provides a separate motif matching program, inspired by MotifAnalyzer-PDZ and here named Motif Matcher (MM) [54]. Motif Matcher takes as input a target proteome, specific position constraints for each motif residue (e.g., P 0 = F, I, L, or V and P −2 = S or T), and the maximum amount of substitutions a user will tolerate (e.g., for a hexamer with two defined motif positions, the maximum number here would be equal to four). The output is a list of all C-terminal hexameric sequences that meet these constraints, including the number of substitutions at non-motif positions (Table 5).

Protein Analyses
Structural models of A9UPE9 PDZ were created using the Robetta online server [64]. A structural homology model of A9UPE9 SH3 was created using SwissModel [94][95][96]. Sequence alignments were performed using BLAST and T-coffee [65,97]. All structure figures were created using PyMOL.

Conclusions
Here we provide an experimental pipeline to predict endogenous targets of a previously uncharacterized PDZ domain with no homology to any human protein. In our approach, we use structural conservation and sequence analyses to determine positional selectivity determinants, and we successfully identify two choanoflagellate sequences with endogenously relevant binding affinities. We argue that this type of approach is applicable to other peptide-binding domains, e.g., SH2 and SH3 domains. These experiments have the potential to provide important insights into signaling pathways in choanoflagellates and other organisms, allowing us to better understand the origins of multicellularity.