Survey of Protein Sequence Embedding Models

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).


Introduction
Proteins are biopolymers made up of amino acid residues. The composition and order of amino acids determine the 3D structure and function of a protein. Since the creation of the first databases of protein sequences, there has been a need for a descriptor that can uniquely identify a protein sequence, its physicochemical properties, and function.
Recent advances in natural language processing (NLP) and deep learning (DL) have led to the development of protein language models that consider single amino acids and their doublets or triplets as tokens (words) and generate fixed-size vectors (embeddings) representing a given protein. These vectors account for both the composition of amino acids and their order in the sequence [1]. Two major encoding approaches used in the field are long short-term memory (LSTM) [2] and transformer [3], e.g., bidirectional encoder representations from transformers (BERT).
In this work, we surveyed a number of representative embedding models for execution time, memory needs, and their ability to perform various tasks related to global properties for different protein sets: proteome-wide sequence embedding, GO annotation, association of human variants with disease, correlation of mutants with antimicrobial resistance in a bacterial drug target, finding similarity in diverse fungal mating factors, and distinguishing taxonomy and function in virulence factors.

Proteome Wide Benchmarking
First, all of the models were assessed for speed and memory needs to generate embeddings on a proteome-wide level. We used Saccharomyces cerevisiae (S288C strain) for this benchmarking as this organism has a considerably sized proteome (over 6000 proteins of various length, which are largely annotated) with the structural and functional complexity of eukaryotic organisms. Of note, given that both Esm and Esm1b impose the input sequence length limit of 1022 amino acids, we used two sets of proteins: proteins with length within 1022 aa for all five models, and all proteins for the One-Hot, SeqVec, and ProtT5 models only. Figure 1 shows the benchmarking results of generating the proteome-wide embeddings using the same hardware. As anticipated, One-Hot runs very fast and does not require much memory for either of the protein sets. SeqVec is the longest model to execute (33,305 ± 179 and 42,816 ± 494 s for short and all proteins, respectively) followed by ProtT5 (17,724 ± 145 and 23,390 ± 532 s). On the other hand, the SeqVec model demonstrates the lowest memory footprint among the four embedders-953 ± 10 and 1142 ± 73 MB on average and peak memory, respectively, for short proteins, whereas ProtT5 is the most memory demanding model (5176 ± 45 and 15,763 ± 2 MB). Thus, execution time appears to depend on the type of the neural network: Transformer-based models are faster than LSTM, whereas memory requirements are directly proportional to the complexity of the models (the numbers of layers and parameters).
Original publications of the embedding models assessed in this survey already provide visualizations as to how they can distinguish proteins by structure organization (e.g., all alpha, all beta, multi-domain, etc.), enzymatic function (e.g., oxidoreductases, transferases, hydrolases, etc.), or by the origin (viruses, archaea, bacteria, eukaryota). Given that S. cerevisiae, as any other sequenced organism, contains some not yet annotated proteins (845 of 6016 proteins in the S288C strain, in particular), we posed a different question-whether embeddings can separate the uncharacterized proteins in a given proteome and why. Figure 2 shows that all four embedding models are able to cluster out a large fraction of the uncharacterized proteins. When compared by sequence names, the clusters of uncharacterized proteins appear largely overlapped ( Figure 3A). To determine why all models group these sequences in a separate cluster, we compared four subsets: (1) uncharacterized proteins within the cluster, (2) uncharacterized proteins outside the cluster, and annotated proteins within (3) and outside (4) that cluster. Since the uncharacterized sequences constituting a separate cluster are nearly the same across the embedders, only the SeqVec-based data was analyzed. The first observation was that the sequences constituting this cluster were short (under 200 aa, Figure 3B). The second observation was that three amino acids demonstrate a skewed composition within the cluster: aspartate and glutamate are under-represented, whereas cysteine is enriched in the cluster of uncharacterized proteins ( Figure 3C-E). The third observation is that no compositional bias is observed for the pairs of amino acids within the cluster. glutamate are under-represented, whereas cysteine is enriched in the cluster of uncharacterized proteins ( Figure 3C-E). The third observation is that no compositional bias is observed for the pairs of amino acids within the cluster.  glutamate are under-represented, whereas cysteine is enriched in the cluster of uncharacterized proteins ( Figure 3C-E). The third observation is that no compositional bias is observed for the pairs of amino acids within the cluster.  acterized proteins by definition, the goal was to compare the concordance of these two methods in their annotations. Of the 255 uncharacterized proteins submitted to these two tools, GoPredSim was able to annotate 80 proteins with a high confidence, whereas PLAST assigned GO terms to 153 proteins. Overall, GoPredSim is more conservative in the GO annotation than PLAST ( Figure 4A,B). Interestingly, these two tools have overlaps in 62 proteins and 15 GO terms only, even for the relaxed cutoffs (RI > 0.2 and p-value < 0.01, Figure 4C,D).  Amino acid composition is defined as the log-odds ratio between the given sequence and the background frequency (Equation (1)): where a is an amino acid type (1 out of the 20 most common natural amino acids); n a -the count of the amino acid in a given sequence; n-the sequence length; N a -the total count of the amino acid in the proteome; N-the total length of the proteome. Next, we looked into the ability of the embedding models to predict the global properties of these uncharacterized proteins as gene ontology (GO). Two recently published methods were assessed: GoPredSim [10] and PLAST [16]. GoPredSim employs SeqVec embeddings, whereas PLAST is based on the Esm1b model. We ran GoPredSim using the k-nearest neighbors mode, with k = 1. Both methods used annotated proteins from Swiss-Prot as a reference database. Since the true GO annotations are not known for the uncharacterized proteins by definition, the goal was to compare the concordance of these two methods in their annotations.
Of the 255 uncharacterized proteins submitted to these two tools, GoPredSim was able to annotate 80 proteins with a high confidence, whereas PLAST assigned GO terms to 153 proteins. Overall, GoPredSim is more conservative in the GO annotation than PLAST ( Figure 4A,B). Interestingly, these two tools have overlaps in 62 proteins and 15 GO terms only, even for the relaxed cutoffs (RI > 0.2 and p-value < 0.01, Figure 4C,D). aa for a better resolution of the short-range lengths. (C-E). Composition (Equation (1)) distributions of aspartate, glutamate, and cysteine, respectively, in sequences within and outside the cluster of uncharacterized proteins. The embedding data and line colors are the same as in panel (B). Distributions for all amino acids can be found in Supplementary Material S2.

Human Protein Variants and Diseases
To assess whether the embeddings of human gene variants may be indicative of a disease status (Pathogenic versus Benign), 100 proteins were randomly picked from the UniProt/Swiss-Prot humsavar database of human variants (Release 2022_03 of 3 August 2022). These proteins contained 2889 likely pathogenic/pathogenic (LP/P) and 833 likely benign/benign (LB/B) mutations curated from the literature (see the complete list of assessed proteins and mutations in Supplementary material S3). The length of protein sequences ranged between 79 and 5202 aa. As some proteins were larger than 1022 aa (the limit for the Esm/Esm1b models), the SeqVec and ProtT5 embeddings were only considered for this analysis.
As can be seen from Table 1, both benign and pathogenic variants in human proteins cause very minor changes in the corresponding embedding vectors: cosine similarity scores between mutated sequences and their reference proteins have a standard deviation in the third digit after the point. Overall, cosine scores are not normally distributed. However, the Mann-Whitney U test shows that similarity scores for benign mutations are statistically different from the pathogenic variants.

Human Protein Variants and Diseases
To assess whether the embeddings of human gene variants may be indicative of a disease status (Pathogenic versus Benign), 100 proteins were randomly picked from the UniProt/Swiss-Prot humsavar database of human variants (Release 2022_03 of 3 August 2022). These proteins contained 2889 likely pathogenic/pathogenic (LP/P) and 833 likely benign/benign (LB/B) mutations curated from the literature (see the complete list of assessed proteins and mutations in Supplementary material S3). The length of protein sequences ranged between 79 and 5202 aa. As some proteins were larger than 1022 aa (the limit for the Esm/Esm1b models), the SeqVec and ProtT5 embeddings were only considered for this analysis.
As can be seen from Table 1, both benign and pathogenic variants in human proteins cause very minor changes in the corresponding embedding vectors: cosine similarity scores between mutated sequences and their reference proteins have a standard deviation in the third digit after the point. Overall, cosine scores are not normally distributed. However, the Mann-Whitney U test shows that similarity scores for benign mutations are statistically different from the pathogenic variants.

TEM-1 Variants and Antimicrobial Resistance
Beta-lactamase TEM-1 from Escherichia coli was used to assess how embeddings for various variants deviate from the reference protein sequence and whether they may correlate with the experimentally measured antimicrobial resistance. Jacquier and colleagues generated a large library of TEM-1 mutants using the GeneMorph II Random Mutagenesis Kit (Stratagene) and measured the minimal inhibitory concentration (MIC) of amoxicillin necessary to stop the growth of the respective bacterial colonies [17].
Following the original publication, MIC values were binned into 0, 12.5, 25, 50, 100, 250, 500, 1000, 2000, and 4000 mg/L, and log transformed. The reference TEM-1 sequence of 286 aa long was taken from the UniProt database (UniProt ID: P62593). Of the 8621 total sequenced mutants, only those representing non-synonymous mutations and containing no early stop codons or frame shifts were considered in this work. If different mutations resulted in identical protein variants, only one protein sequence was included in the dataset. Such filtering resulted in 4930 unique mutants ranging from single to eleven simultaneous amino acid mutations. This dataset gives a unique opportunity to assess the effect of diverse and multiple simultaneous mutations on sequence embeddings on a large scale and to relate them to experimentally derived data, such as MIC. A summary of the mutants considered and the corresponding results of the embeddings is presented in Table 2.

Mating Factors
Mating factors (pheromones) are essential for sexual reproduction in fungi and are classified as mating factors "a" and "α". They allow for the recognition and mating of cells of different mating types in heterothallic species. After being thoroughly studied in baker's yeast, mating factors are identified in many fungi. As pheromones display a high complexity and variability of sequences, the identification of a homolog through the simple sequence homology search is non-trivial [18]. In this survey, the ability of sequence embeddings to identify hidden common features among the diverse mating factors of the same type across species was examined. Then, how such similarities correlate with the actual sequence homology derived from the pairwise sequence alignments by BLAST were assessed. Full protein sets of α-factor precursors (n = 60) and mating factors of type a1 and a2 (n = 34) were retrieved from the Pfam database (Pfam IDs: PF05436 and PF17317, respectively).
As can be seen from Table 3, both classes of mating factors are quite divergent, yielding only a 38% and 59% sequence identity on average within the precursors of the α-factors and a-factors, respectively. However, the average pairwise cosine similarity for the Esm and Esm1b models is high-over 0.99-for both classes of mating factors. ProtT5 and SeqVec also display high pairwise cosine similarity within the classes-over 0.8. However, when compared with the actual sequence identity, the cosine scores show low to no correlation for the α-factors, and a moderate correlation for a-factors. Among the four embedding models, SeqVec shows the highest correlation between the cosine similarity and the actual sequence identity, potentially making it the most reliable in finding sequences representing the mating factors in the proteomes of non-annotated fungi. Table 3. Comparison of embedding models with BLAST over mating factors (MF).

Virulence Factors
Pathogenic microorganisms have evolved to employ various ways of infecting higher organisms and thriving within hostile environments. Evasion or adaptation to host immunity, targeting different host cell receptors, or hijacking the transcriptional machinery are some examples of such strategies. The Victors database provides access to the known virulence factors of pathogens from bacteria, viruses, fungi, and other single cell eukaryotic parasites [19]. As of 20 October 2022, the database contained 5304 proteins for 127 pathogens. We assessed how the embedding models differentiate virulence factors ( Figure 5). Int. J. Mol. Sci. 2023, 24, x FOR PEER REVIEW 8 of 11 Figure 5. Distribution of embeddings by the Esm, Esm1b, SeqVec, and ProtT5 models for the virulence factors obtained from the Victors database [19]. The upper row presents all the virulence factors colored by kingdoms, and the lower row presents the clusters of the orthologous group (COG) categories, using cell motility (red) in contrast to other functions (blue), proteins with multiple functions (orange), and other proteins with no COG assigned (grey). The t-SNE plots for all the other COG categories can be found in Supplementary materials S4.
While embeddings representing the bacterial virulence factors appear broadly distributed in all models, ProtT5 makes the best clustering of fungal and viral virulence factors ( Figure 5, upper row). When considered from the perspective of a specific function, all the models successfully cluster out the proteins involved in cell motility ( Figure 5, lower row, red). The distribution of proteins from all 18 functions, as defined by the clusters of the orthologous group (COG) categories in the Victors database can be found in Supplementary materials S4. Interestingly, all models, except for SeqVec, separate the virulence factors of unknown function ( Figure 5, lower row, grey).

Discussion
Our survey of the embedding models demonstrates that most models are able to separate proteins of unknown function from annotated proteins in a given dataset (Figure 2, Figure 5). In part, it may be attributed to the biases in sequence length as uncharacterized proteins tend to be shorter ( Figure 3B), but there may also be biases in the amino acid composition ( Figure 3C-E). In the context of the functional annotation of such proteins, sequence embedding-based methods demonstrate variable performance. GoPredSim appears to be highly conservative in assigning gene ontology terms-it was able to annotate only about 30% proteins with unknown function in S. cerevisiae with a degree of high confidence. Interestingly, its reliability index, as a measure of confidence, appears to follow a bimodal distribution and may potentially serve as a classification variable with a cutoff of around 0.25 ( Figure 4A). On the other hand, PLAST has a more gradual relationship between p-value and embedding similarity ( Figure 4B). But PLAST generally finds similar sequences more frequently than GoPredSim, which may be attributed to the fact that PLAST is based on Esm1b embeddings. From the other tests in our survey, Esm and Esm1b models tend to generate embeddings for diverse proteins that result in a very narrow range of high cosine similarity scores (0.99-1.00, Table 2), which may potentially yield a high false positive rate. Low variance in the embedding vectors of the Esm/Esm1b models is further demonstrated in the diverse sets of fungal mating factors that have a low pairwise sequence identity. However, although cosine similarity is very high, it has a low Figure 5. Distribution of embeddings by the Esm, Esm1b, SeqVec, and ProtT5 models for the virulence factors obtained from the Victors database [19]. The upper row presents all the virulence factors colored by kingdoms, and the lower row presents the clusters of the orthologous group (COG) categories, using cell motility (red) in contrast to other functions (blue), proteins with multiple functions (orange), and other proteins with no COG assigned (grey). The t-SNE plots for all the other COG categories can be found in Supplementary material S4.
While embeddings representing the bacterial virulence factors appear broadly distributed in all models, ProtT5 makes the best clustering of fungal and viral virulence factors ( Figure 5, upper row). When considered from the perspective of a specific function, all the models successfully cluster out the proteins involved in cell motility ( Figure 5, lower row, red). The distribution of proteins from all 18 functions, as defined by the clusters of the orthologous group (COG) categories in the Victors database can be found in Supplementary material S4. Interestingly, all models, except for SeqVec, separate the virulence factors of unknown function ( Figure 5, lower row, grey).

Discussion
Our survey of the embedding models demonstrates that most models are able to separate proteins of unknown function from annotated proteins in a given dataset (Figure 2, Figure 5). In part, it may be attributed to the biases in sequence length as uncharacterized proteins tend to be shorter ( Figure 3B), but there may also be biases in the amino acid composition ( Figure 3C-E). In the context of the functional annotation of such proteins, sequence embedding-based methods demonstrate variable performance. GoPredSim appears to be highly conservative in assigning gene ontology terms-it was able to annotate only about 30% proteins with unknown function in S. cerevisiae with a degree of high confidence. Interestingly, its reliability index, as a measure of confidence, appears to follow a bimodal distribution and may potentially serve as a classification variable with a cutoff of around 0.25 ( Figure 4A). On the other hand, PLAST has a more gradual relationship between p-value and embedding similarity ( Figure 4B). But PLAST generally finds similar sequences more frequently than GoPredSim, which may be attributed to the fact that PLAST is based on Esm1b embeddings. From the other tests in our survey, Esm and Esm1b models tend to generate embeddings for diverse proteins that result in a very narrow range of high cosine similarity scores (0.99-1.00, Table 2), which may potentially yield a high false positive rate. Low variance in the embedding vectors of the Esm/Esm1b models is further demonstrated in the diverse sets of fungal mating factors that have a low pairwise sequence identity. However, although cosine similarity is very high, it has a low to no correlation with sequence identity for the α-factors, and a moderate correlation for the a-factors (Table 3).
Interestingly, a comparison of embeddings for variants in human proteins, as well as various mutants in the beta-lactamase TEM-1 of E. coli, shows that the resulting vectors are very similar to those of the reference sequences, yielding a cosine similarity score in the 0.9-1.0 range (Tables 1 and 2). Nevertheless, the Mann-Whitney U test indicates that there is a statistically significant difference in the distributions of these scores between benign and disease-causing mutations (Table 1). This is in line with the recent report on the ability of the protein language models to detect conserved positions in proteins and disease-causing single amino acid variants [14]. The use of such embeddings in predicting antimicrobial resistance remains to be seen, though, as our survey shows low to no correlation between the cosine similarity and minimal inhibitory concentrations for all models ( Table 2).

Protein Sequence Embedding Models
In this survey, four of the most cited embedding models were used: Esm/Esm1b [12], Sequence-to-Vector (SeqVec) [4], and ProtTrans (ProtT5) [9], along with One-Hot as a baseline model. Their respective implementations were taken from the Bio Embeddings python library [20]. Both the Esm and Esm1b models are based on the Transformer [3] architecture and trained on the high-diversity sparse dataset of the UniRef50 representative sequences, with 1280-dimensional output vectors. Esm has 670 M parameters (34 layers), while Esm1b has 650M parameters (33 layers). SeqVec is based on the bidirectional LSTM language model [2] with 93M parameters and is trained on the UniRef50 protein dataset. ProtT5-XL is based on the T5 text-to-text transformer [21] with 3B parameters (24 layers) and is also trained on UniRef50. Both SeqVec and ProtT5 generate 1024-dimensional embedding vectors. The One-Hot output is a 21-dimensional vector, including 20 common plus one rare (selenocysteine, U) natural amino acid.

Conclusions
This survey provides an overview of the performance of four protein embedding models (Esm, Esm1b, SeqVec, and ProtT5) on various tasks, such as identifying uncharacterized proteins, predicting gene ontology, and differentiating virulence factors. The results indicate that the Esm and Esm1b models are the fastest models to execute, but have a restriction on protein length. This limits their broad application, especially to the proteomes of higher organisms. Although the SecVec model is the slowest, it is the most memory efficient. All of the models were able to cluster out a large fraction of uncharacterized proteins. In baker's yeast, such proteins demonstrated biases in the sequence length and amino acid composition. The performance of the embedding models in predicting the gene ontology varied, with GoPredSim being more conservative and PLAST assigning more GO terms. The embeddings of human gene variants did not show much deviation from a reference sequence but yielded statistically significant differences in the distributions between the benign and pathogenic mutations. The embeddings of various TEM-1 mutants did not show a correlation with antimicrobial resistance. The survey also found that Esm/Esm1b models tend to generate embeddings for diverse proteins that result in a very narrow range of high cosine similarity scores, which may potentially yield a high false positive rate. Overall, the ability of the protein language models to generate uniform, fixed length numerical identifiers for proteins has the potential to enable fast searches for similar proteins and different annotations, which is especially important in light of the vast amount of data generated by modern sequencing technology.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms24043775/s1. Author Contributions: Conceptualization and Supervision, A.P.; Software and Coding, C.T. and S.K.; Data Curation and Visualization, C.T. and S.K.; Data Analysis and Interpretation, C.T., S.K., and A.P.; Funding Acquisition, A.P.; All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported in part by the National Institutes of Health, grant number R21-AI143467 (A.P.).

Data Availability Statement:
Any new data generated through this study is provided in Supplementary Material files.