Next Article in Journal
Lysine Acetyltransferase 8: A Target for Natural Compounds in Cancer Therapy
Previous Article in Journal
Medical Ozone Increases Methotrexate Effects in Rheumatoid Arthritis Through a Shared New Mechanism Which Involves Adenosine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Explainability of Protein Deep Learning Models

by
Zahra Fazel
1,
Camila P. E. de Souza
2,
G. Brian Golding
3 and
Lucian Ilie
1,*
1
Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
2
Department of Statistical and Actuarial Sciences, University of Western Ontario, London, ON N6A 5B7, Canada
3
Department of Biology, McMaster University, Hamilton, ON L6S 4K1, Canada
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2025, 26(11), 5255; https://doi.org/10.3390/ijms26115255
Submission received: 16 April 2025 / Revised: 18 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025
(This article belongs to the Section Biochemistry)

Abstract

Protein embeddings are the new main source of information about proteins, producing state-of-the-art solutions to many problems, including protein interaction prediction, a fundamental issue in proteomics. Understanding the embeddings and what causes the interactions is very important, as these models lack transparency due to their black-box nature. In the first study of its kind, we investigate the inner workings of these models using XAI (explainable AI) approaches. We perform extensive testing (3.3 TB of total data) involving nine of the best-known XAI methods on two problems: (i) the prediction of protein interaction sites using the current top method, Seq-InSite, and (ii) the production of protein embedding vectors using three methods, ProtBERT, ProtT5, and Ankh. The results are evaluated in terms of their ability to correlate with six basic amino acid properties—aromaticity, acidity/basicity, hydrophobicity, molecular mass, van der Waals volume, and dipole moment—as well as the propensity for interaction with other proteins, the impact of distant residues, and the infidelity scores of the XAI methods. The results are unexpected. Some XAI methods are much better than others at discovering essential information. Simple methods can be as good as advanced ones. Different protein embedding vectors can capture distinct properties, indicating significant room for improvement in embedding quality.

1. Introduction and Background

The field of computational biology has undergone significant transformations in recent years, driven mainly by the advent of deep learning models. Deep learning approaches have provided unprecedented insights into protein structure prediction, function annotation, and interaction dynamics, enabling researchers to explore the complexities of biological systems with remarkable precision and efficiency [1,2,3,4].
Deep learning models have emerged as powerful tools in this context, capable of processing vast amounts of biological data and extracting meaningful patterns that often elude human perception. These models have demonstrated remarkable success in predicting protein structures, identifying functional sites, and elucidating protein–protein interactions. Such capabilities have far-reaching implications for various fields, including drug discovery, personalized medicine, and synthetic biology [1,5,6,7].
Despite these advancements, the application of deep learning in computational biology has its challenges. One of the most pressing concerns is the potential for errors in the predictions of these models. A significant factor contributing to this challenge is the black-box nature of deep learning models. These models often transform inputs into outputs through complex, multi-layered neural networks, with little insight into the intervening processes. The lack of interpretability raises concerns about the reliability and trustworthiness of these models in high-stakes biological applications.
In this paper, we address the need for enhanced explainability in deep learning models applied to protein analysis. We explore methods to elucidate the decision-making processes of these models. This paper aims to bridge the gap between complex computational techniques and their practical applications in biology. Our goal is to foster a deeper understanding of how these models work, ultimately leading to more accurate predictions, better scientific discoveries, and safer, more practical applications.
We address two specific problems, protein embedding construction and protein interaction-site prediction, which we attempt to clarify using methods for AI explainability. We introduce these topics below.

1.1. Protein Embeddings

The natural language processing (NLP) field has undergone a revolutionary transformation through the emergence and development of contextual embeddings. This journey began with early models like Word2Vec [8] and GloVe [9] and moved to more sophisticated context-dependent architectures such as BERT [10] and T5 [11]. A critical factor in this progress has been the technique of self-supervised learning. This approach allows models to extract meaningful representations from vast amounts of unlabeled data, eliminating the need for extensive manual annotation.
Inspired by the remarkable success of these approaches in NLP, researchers have applied similar principles to the domain of proteomics. In this field, protein residues (amino acids) are associated with high-dimensional numerical vectors analogous to word embeddings in NLP. This cross-disciplinary transfer of knowledge has led to the development of a diverse array of protein embedding models. These include but are not limited to ProtVec [12], SeqVec [13], SSA [14], MSA-transformer [15], ProtBert, ProtT5 [4], ESM2 [16], and Ankh [17]. Each model brings unique strengths and capabilities to the table, contributing to the growing toolkit available to researchers in proteomics.
The impact of protein embeddings is far-reaching, as they enable researchers to gain deeper insights into the complex relationships between protein sequences, structures, and functions. This enhanced understanding can pave the way for advancements in protein engineering, drug discovery, and the elucidation of cellular mechanisms, ultimately contributing to the overall progress in the life sciences. In this paper, we investigate explanations for the construction of such embeddings for three methods: ProtBERT, ProtT5, and Ankh. Other interesting candidates include ESM2, another top method; SSA, a biLSTM-based method; and MSA-transformer, an alignment-based method. However, GPU memory limitations imposed constraints on the length of the proteins we could test (see the Results section). ESM2 and MSA-transformer take a batch of protein sequences as input, which means that the length of the tested proteins should be even smaller.

1.2. Protein Interaction-Site Prediction

The other problem we consider concerns a fundamental issue in proteomics: protein interaction-site prediction. Proteins are among the most important molecules in a cell and are responsible for a multitude of essential processes, which they perform by interacting with other proteins. Therefore, the investigation of these interactions is a key problem. Protein interaction-site prediction aims to identify the specific regions of a protein where interactions with other molecules are most probable. Recent developments in protein embeddings, as mentioned above, have helped to significantly improve the state of the art in interaction-site prediction. Various deep learning models have been developed to tackle this challenge, including GraphPPIS [18], DeepPPIS [19], DLPred [20], DELPHI [21], PITHIA [22], ISPRED-SEQ [23], and Seq-InSite [24]. These models can be divided into two main categories: structure-based and sequence-based. Structure-based models use 3D structures of proteins as input, while sequence-based models use protein sequences as input. Seq-InSite [24] is the current state-of-the-art model for this task, having outperformed structure-based models despite being sequence-based. We look into explaining how Seq-InSite makes its predictions.

1.3. Explainable AI (XAI) Methods

Explainability is the degree of human comprehension of an AI model’s decision-making processes and inherent biases. It encompasses methodologies that elucidate the internal mechanisms of AI systems, fostering user understanding and trust. The pursuit of explainability serves multiple critical functions: it enhances transparency, facilitates debugging, enables model improvements, and promotes knowledge discovery from otherwise opaque “black-box” models [25]. Three categories of explainability methods are investigated in this work: gradient-based, path-attribution, and local model-agnostic methods.
Gradient-based methods are designed to determine the gradient of a prediction or classification score relative to the input features. The differences between the methods in this category depend on how these gradients are calculated [25]. Saliency maps (Vanilla Gradient) [26] compute the gradient of the class score of interest with respect to the input pixels by approximating the score with a first-order Taylor expansion. A deconvolutional network (DeconvNet) [27] can be seen as a convolutional neural network (ConvNet) operating in reverse. This method inverts the forward pass of a traditional ConvNet, enabling the visualization and interpretation of learned features at various network layers. Guided backpropagation [28] integrates both saliency map and DeconvNet approaches and selectively filters gradients during backpropagation using a modified version of the backpropagation algorithm. Input X Gradient [29] calculates attributions by multiplying the derivatives of the output with respect to the input element-wise with the input itself. This approach combines the sensitivity information captured by gradients with the scale of the input features, potentially offering more precise and interpretable attributions than previous gradient-based methods [30].
Path-attribution methods analyze the influence of an input on a model’s prediction by contrasting it with a reference point. By evaluating the difference between the predictions for the original input and the reference input, these methods distribute this difference across all features of the original input [25]. DeepLIFT [31] assigns importance scores to input features based on differences in neuron activations compared to a reference. It overcomes gradient discontinuities and provides non-zero contributions even when gradients are zero. Integrated Gradients [32] is based on two fundamental principles of attribution methodologies: sensitivity and implementation invariance. The former stipulates that features receive non-zero attribution if and only if they are influencing predictions, while the latter requires that functionally equivalent networks yield identical attributions.
Local model-agnostic methods explain individual predictions [25]. LIME (Local Interpretable Model-agnostic Explanations) [33] approximates the model locally around an instance of data by first generating perturbed samples of that instance and then weighing them based on proximity. Then, it trains an interpretable model like a decision tree on these samples and uses it to explain the instance’s prediction. SHAP (Shapley Additive Explanations) [34] is based on the Shapley values from game theory. This method considers features’ importance as players in a coalition, with the model’s predictions as the payout. SHAP uses Shapley values to allocate this payout among features fairly. KernelShap [34] is an extension of SHAP that integrates LIME with a linear explanation model and Shapley values. GradientShap [35] offers an alternative approach for estimating SHAP values based on two key assumptions: the independence of input features and the linearity of the explanation model.
To summarize, the methods we employ are saliency maps, deconvolution, guided backpropagation, Input X Gradient, DeepLIFT, Integrated Gradients, LIME, KernelShap, and GradientShap. Different methods can provide very different explainability results, as seen in the example in Figure 1.

1.4. Explaining Protein Learning Models

Despite the wide range of applications of transformers, limited research has been carried out on their explainability beyond visualizing their attention layers [36,37,38,39]. To the best of our knowledge, this is the first investigation into the explainability of protein language models. We consider the embedding construction and interaction-site prediction for three embedding methods using nine XAI methods for 34 protein sequences that fulfilled the requirements. For a given protein sequence of length n and any given XAI method, the explainability matrix (see Methods) is an n × n matrix, whose ( i , j ) entry gives the effect, or influence, of the j th residue in computing the embedding vector (or prediction value) of the i th residue. Figure 2 gives two such matrices for the same protein, 2L2T_A, and embedding method, ProtT5, which are very different due to the different XAI methods used.
The evaluation of explainability is performed in several ways. After ensuring that our matrices are fully distinguishable from random, we first calculate the correlation between the explainability matrices and each of the seven amino acid properties: interactivity, aromaticity, acidity/basicity, hydrophobicity, molecular mass, van der Waals volume, and dipole moment. Next, we evaluate how each XAI method works on influencing residues according to the distance separating them. Finally, the infidelity measure is used to evaluate the quality of the explanations of each method.

2. Results and Discussion

2.1. Data

We used a modified dataset of protein sequences from the one used by Seq-InSite [24]. Due to the limitations of the Captum library [35], it is not possible to perform explanation methods on one sequence using multiple GPUs. Therefore, GPU memory limitations meant that our experiments had to use protein sequences with at most 44 residues. We selected the protein sequences, as shown in Table 1, removed them from the training data, and retrained Seq-InSite with the remaining protein sequences.
For each protein, we computed explanation matrices for three embedding methods—ProtBERT, ProtT5, and Ankh—and for Seq-InSite predictions using the three embeddings as part of its input. We applied the nine XAI methods mentioned above. This resulted in a total of 54 explanation matrices for each protein sequence. All the matrices for the 2L2T_A protein are shown in Figure 3. A wide variety of patterns can be seen across the embeddings and XAI methods. Diagonal, row-wise, and column-wise patterns are visible, together with more complex patterns, even for the same method or the same test type.

2.2. Comparison with Random Matrices

In order to demonstrate that the resulting explanations were not arbitrary and contained relevant information, we utilized an SVM classifier with a Radial Basis Function (RBF) kernel for training on 80% of the given explanations. The trained classifier was then evaluated on the remaining 20% of the explanations. The classifier accurately distinguished between explanation maps and random matrices with a 100% accuracy rate. This indicates that a hyperplane separates these explanations from random matrices with a margin, ensuring a clear separation and validating the information content within the explanations.

2.3. Amino Acid Properties

We investigated seven properties of amino acids, which are briefly described below. Three such properties are categorical—interactivity, aromaticity, and acidity/basicity—while the other four are numerical—hydrophobicity, molecular mass, van der Waals volume, and dipole moment. They are summarized in Table 2 with the exception of interactivity, which is dataset-dependent. For all protein sequences in Table 1, the interaction sites are known. The goal of the Seq-InSite program is to predict these interaction sites as accurately as possible.
Interactivity is the property of each residue to interact with other proteins. Hydrophobicity denotes the propensity for certain residues to minimize interactions with aqueous environments, preferentially associating with non-polar substances. This property significantly influences protein folding, stability, and intermolecular interactions [40]. The molecular mass of amino acids is defined as the cumulative sum of the atomic masses comprising a single amino acid molecule and serves as a fundamental parameter in various biochemical and analytical procedures [41]. The van der Waals volume quantifies the spatial occupation of an atom or molecule, encompassing the region influenced by its electron cloud, and is crucial for elucidating molecular interactions and steric effects [41]. The dipole moment represents a vectorial quantity characterizing the magnitude and orientation of charge separation within the molecular structure and plays a significant role in determining the electrostatic properties and intermolecular interactions of amino acids [41]. The aromaticity of amino acids is determined by the presence of a six-carbon ring in the side chain; phenylalanine (F), tyrosine (Y), and tryptophan (W) are classified as aromatic [40]. The acidity/basicity is determined by whether the amino acids exhibit acidic properties (aspartic acid (D) and glutamic acid (E)) or basic characteristics (histidine (H), lysine (K), and arginine (R)) [40].
For each of these properties, we calculated the mean explanation score for each of the 20 amino acids (see the Methods section). Explainability provides the impact that each input residue—called the source—has on the value (embedding or prediction) of any output residue—called the target. The calculation of the mean explanation score for each of the 20 amino acids was performed separately for the source and target. Therefore, for a given embedding and XAI method, we had four tests for each property, combining embeddings/predictions with source/target.

2.4. Comparison of Amino Acid Properties

In order to evaluate how well an explanation captures an amino acid property, we used the Mann–Whitney U test for each of the three categorical properties and Kendall’s τ test for the four numerical properties. The results for the categorical tests are given in Table 3 for ProtBERT, Table 4 for ProtT5, and Table 5 for Ankh. The results for the numerical tests are given in Table 6, Table 7 and Table 8. In all cases, p-values below 0.05 are considered significant and shown in boldface. Kendall’s correlations in Table 6, Table 7 and Table 8 are also provided as a heat map for better visualization.
The statistical tests for the categorical properties appeared to be easier to pass than those for the numerical properties: 53% of the former tests were passed overall, whereas only 18% of the latter tests were passed. Overall, one-third of the tests were passed. Besides the sparsity of significant p-values in Table 6, Table 7 and Table 8, one can see a number of disagreements between the XAI methods, with the same property providing a positive correlation for one method but a negative correlation for another; a good example is the dipole moment in Table 6, where Integrated Gradients exhibits a positive correlation, with significant p-values in three out of four cases, whereas saliency exhibits a negative correlation, with all four cases having significant p-values. The largest difference is found in Table 7, also for the dipole moment, in the top quarter between saliency and KernelShap. Such cases are present in both the embedding and prediction tables for all three embedding methods, albeit mostly with non-significant p-values. The reason for this may be related to the possible instability in how embeddings represent physicochemical properties or the variability among XAI methods.
A summary is presented in Table 9. In terms of the XAI methods, KernelShap performed the best, with 45 tests passed, followed, unexpectedly, by the simple saliency method with 41, and guided backpropagation with 37. The last column of Table 9 indicates a wide range of performance, with LIME performing the worst, with nine tests passed.
Regarding embeddings—bottom row of Table 9—things were well balanced, with the three methods performing similarly overall, with 78, 84, and 86 tests passed. However, there was a significant difference between the categorical tests, where Ankh performed the best, and the numerical tests, where ProtT5 performed the best. ProtBERT performed second best in both and was slightly behind overall.
Breaking down the results by XAI method and test type, as shown in Table 10, there appears to be a good balance most of the time between the target and the source, as well as between the embedding and the prediction, with the exception of GradientShap and, especially, LIME. Interestingly, LIME, which ranked last overall, performed the best in the target numerical tests and prediction numerical tests. In fact, eight out of nine tests passed by LIME were in the prediction and target numerical tests.
Breaking down the results by embedding and test type, as shown in Table 11, there appears to be a better balance among the performance of all embeddings across the four test types, with source and embedding generally lagging behind target and prediction. Overall, with the notable exception of LIME, the passed tests were fairly well distributed across the different categories.
It is important to note that, even though the number of tests passed by the three embeddings was similar, the embeddings actually performed very differently, and complementarily, for different tests. In Figure 4, we plot the situation for categorical, numerical, and all tests separately. In the categorical tests, 14 tests were passed only by Ankh, while in the numerical tests, 19 tests were passed only by ProtT5. A large proportion of the tests were passed by only a single embedding method: 22% for categorical and 70% for numerical, with 45% overall. This result is important as it indicates a large potential for improvement in the embedding generation.
Another observation concerns the fact that Seq-InSite uses a given embedding to produce its predictions. Therefore, it can predict certain properties only when the necessary information is available in the embedding itself. This means that, for a test, given the embedding and XAI method, it is unexpected for the prediction to pass the test while the embedding fails. Table 12 summarizes all four possible situations, combining whether the embedding was a pass/fail with the prediction being a pass/fail. There were fewer unexpected fail/pass situations for the categorical tests than for the numerical ones. ProtBERT had the most, and ProtT5 had the fewest.
The reason for the existence of these fail/pass cases is unclear. The property must be available in the embedding to be picked up by Seq-InSite; therefore, it is possible that the XAI methods sometimes failed to identify it at the embedding stage. It is interesting to note that the number of fail/pass cases appears to be inversely proportional to the performance of the embeddings for protein interaction-site prediction; ProtT5 performed the best, with Ankh ranking second and ProtBERT last. On the other hand, the performance ranking is not completely reliable, as Ankh passed slightly more tests overall than ProtT5, as shown in Table 9. More testing is needed to clarify what caused this behavior.

2.5. Distances

It is expected that a residue in close proximity to another exerts a more significant influence on its interactions than those located further away. For each explanation method and both embeddings and predictions, we conducted an analysis to determine the average impact score for amino acids separated by a fixed distance. The resulting impact scores are plotted as a function of the distance in Figure 5. While the expectation is that the influence diminishes with distance, the hope is that the influence maintains a high level even at longer distances. Note that, in the ProtBERT and Ankh embedding distance plots, deconvolution and guided backpropagation have the same values; therefore, the line for deconvolution is covered by the one for guided backpropagation.
The first observation regarding the plots in Figure 5 is that the influences were generally smaller for all XAI methods in the case of ProtBERT (first column). Overall, KernelShap, saliency, guided backpropagation, and deconvolution were among the best-performing methods, which is in agreement with our results for test passing. One exception to this was the poor performance of guided backpropagation and deconvolution for ProtBERT. KernelShap and saliency were still the top two methods for ProtBERT. LIME, the last-ranked method overall in test passing, exhibited mixed performance in terms of the distance plots, ranking last in two cases out of six. GradientShap, the second-last-ranked method overall in test passing, exhibited very poor performance in terms of the distance plots. In general, there was good agreement between the distance plots and the previous results for test passing.

2.6. Infidelity

Infidelity measures the quality of explanation as the ability of an XAI method to capture changes in a prediction in response to perturbations (see the Section 3). We present the infidelity results in Table 13. While there are some differences among the XAI methods for a fixed embedding method (ProtBERT, ProtT5, or Ankh) and mode (embedding vs. prediction), the largest differences appear when the embedding or mode changes. That is, the infidelity values appear to be dictated mostly by the embedding method and the mode. Using Ankh appears to yield the lowest infidelity values overall. To investigate the correlation between the infidelity values and our results for test passing, we present the latter in Table 14, organized in the same way as in Table 13. While there is some correlation, important differences stand out. Several saliency infidelity results are high (which is bad) despite the number of tests passed being very high. Conversely, LIME achieved mostly very good infidelity values, while its test-passing performance was very low. Overall, the correlation between the infidelity results and test passing was not very good. The possible reasons for this are numerous and remain to be investigated.

3. Materials and Methods

3.1. Interpretability of Protein Embeddings

For each residue, an embedding vector of size e is constructed using a method such as the ones investigated here: ProtBERT, ProtT5, and Ankh. ProtBERT and ProtT5 produce embedding vectors of size e = 1024 , while Ankh’s vectors are of size e = 768 . The input for a protein language model is a protein sequence, and its length is denoted by n. The output of the model is an embedding matrix of size n × e , one vector for each residue. The first layer in the model is an embedding layer, a look-up table that outputs an n × e matrix. Since this layer is not trainable, it does not have a gradient; therefore, it is not interpretable. The interpretation connects this layer with the output, computing the influence (attribution) for each element on each element of the output, thus producing an n × e × n × e array, denoted by E; the element E [ i , k , j , ] gives the effect of the input ( j , ) on the output ( i , k ) . To obtain the effect of one residue on another residue, with respect to embedding computation, we convert this array to an n × n array by calculating the sum along the second and fourth dimensions:
X E [ i , j ] = k = 1 e = 1 e E [ i , k , j , ] .
The array X E measures the impact of each residue on computing the embedding vector of any other residue. The element X E [ i , j ] gives the effect of the j th residue on computing the embedding vector of the i th residue. Examples are shown in Figure 2 and in the top three rows of the plots in Figure 3.

3.2. Interpretability of Interaction-Site Prediction Models

Seq-InSite uses the embedding vectors of the residues within a window of an odd size, 2 w + 1 , centered on the residue being predicted. Therefore, it takes as input a ( 2 w + 1 ) × e matrix and returns a number between 0 and 1. To interpret the interaction-site prediction for a sequence with n residues, we iterate over the sequence and obtain the interpretation of the model’s output for each residue. Therefore, the interpretation output is an n × ( 2 w + 1 ) × e matrix, denoted by P. The element P [ i , j , k ] gives the effect of the k th element of the embedding vector of the j th residue on the prediction of the i th residue. In order to obtain the effect of each residue on the prediction of any other residue, we need to combine this array P with the E array above. Recall that E gives the effect of input embedding vector elements on output embedding vector elements. To obtain the effect of residues on embedding vector elements, we sum along the fourth dimension:
T [ i , j , k ] = = 1 e E [ i , j , k , ] .
T has dimension n × e × n and T [ i , j , k ] gives the effect of the k th residue on computing the element ( i , j ) , that is, the j th element of the embedding vector for the i th residue. In order to put together P and T, note that, for each residue i, we need only a slice of size 2 w + 1 of T, T [ i w . . i + w , : , : ] . The n × n array X P , for residue on residue impact for interaction prediction, is computed as follows: for any 1 i , j n , we have:
X P [ i , j ] = k = 1 2 w + 1 = 1 e P [ i , k , ] × T [ i + k ( w + 1 ) , , j ] .
The array X P gives the residue-on-residue impact for prediction. Precisely, X P [ i , j ] gives the effect of the j th residue on the interaction propensity for the i th residue. Examples are shown in the bottom three rows of the plots in Figure 3.

3.3. Evaluation of Interpretations

3.3.1. Categorical Tests

For the categorical tests, we used the Mann–Whitney U test [42], which is a non-parametric statistical test for analyzing the distribution of two random variables, with the null hypothesis that the distributions are equal. In order to evaluate the performance of the models in capturing specific properties, this test was applied to both embeddings and predictions, focusing on the impacting (source) and impacted (target) scores of three distinct groups:
1.
Interacting and non-interacting amino acids.
2.
Aromatic and non-aromatic amino acids.
3.
Acidic and basic amino acids.
If the models effectively captured these properties, the Mann–Whitney U test should reveal a significant difference between the distributions of the two populations within each group. p-values below 0.05 are considered significant.

3.3.2. Numerical Tests

For the numerical tests, we used Kendall’s τ test, another non-parametric statistical test based on the correlation between two random variables, with the null hypothesis that there is no correlation between the random variables. To assess the effectiveness of each explanation method, we computed the average source and target scores of each amino acid across all proteins for both embeddings and predictions. This process yielded four distinct vectors, each of size 20. Subsequently, for each of these four vectors, we employed Kendall’s τ test to investigate the correlation between the vector and various properties of amino acids: hydrophobicity, molecular mass, van der Waals volume, and dipole moment. The underlying assumption is that a correlation should be evident if the model accurately captures the specific property under consideration. p-values below 0.05 are considered significant.

3.3.3. Distance

From a biological perspective, the average interaction of each residue, ignoring its three-dimensional location, is more influenced by its linear distance from nearby residues than by those farther away. Therefore, for each explanation method, and for both embeddings and predictions, we computed the average impact score for amino acids separated by a distance of i , 0 i 20 , and plotted these scores against the distance. The expectation is that these plots exhibit a decreasing trend.

3.3.4. Explanation Infidelity

One way to evaluate the quality of an explanation is to measure how much it captures the changes in the predictor function in response to significant perturbations. Explanation infidelity is defined as follows [43]. Let X R d , Y R , and let f : R d R be the input space, output space, and a black-box predictor, which at some test input x R d predicts the output f ( x ) . Let Φ : F × R d R d be a feature attribution explanation that, given a black-box predictor, f, and a test input, x, returns importance scores Φ ( f , x ) for the set of input features. Given a random variable I R d with a probability measure μ I , which represents the meaningful perturbations of interest, the explanation infidelity of Φ is defined as
INFD ( Φ , f , x ) = E I μ I ( I T Φ ( f , x ) ( f ( x ) f ( x I ) ) ) 2 .
In our tests, a normal distribution with a mean of 0 and a standard deviation of 0.01 was used as a perturbation and applied to the outputs of the embedding layer of the transformer model and to the embedding inputs of the prediction models.
The formula for INFD above was used on each matrix E [ i , j , : , : ] , for 1 i n , 1 j e , to produce an infidelity score. This resulted in an n × e matrix, I E , where I i j E is the infidelity score for the interpretability of element ( i , j ) in the embeddings. The infidelity score for the whole interpretation of embeddings is defined as the mean of I E :
INFD E = i = 1 n j = 1 e I i j E n e .
To calculate the infidelity scores for the Seq-InSite model interpretations, the above formula for INFD was used on each matrix T [ i , : , : ] to give a vector I S of size n. Then, the infidelity of the prediction interpretations was calculated as follows:
INFD P = S ( n 2 w ) ( 2 w + 1 ) + ( i = w + 1 2 w i ) = S ( n 2 w ) ( 2 w + 1 ) + w ( 3 w + 1 )
where
S = ( i = w + 1 n w j = 1 e I i j E + i = w + 1 n w I i P ) × ( 2 w + 1 ) + i = 1 w ( ( j = 1 e I i j E + j = 1 e I ( n i + 1 ) j e + I i p + I n i + 1 p ) × ( w + i ) ) .

3.4. Implementation

The tests were performed using Python 3.10.0, Hugging Face transformers 4.31.0 [44], PyTorch 2.0.1 [45], and the Captum library 0.6.0 [35] to calculate and analyze protein embeddings. Because the attribution can be calculated for only one element at a time, computing the explanation for one protein sequence took between one and five days, depending on the protein sequence, embedding model, and method. They were computed once and stored. Due to the n × e × n × e size of the initial arrays, the total memory required to store all our computed matrices was 3.3 TB.

4. Conclusions

In this work, we performed extensive experiments to understand protein deep learning models through their explanations. Based on these experiments, we showed that protein language models comprehend seven important biological characteristics of amino acids. We also demonstrated that no single explanation method consistently performed best across all models and metrics and that they showed high variability in performance across different models. While current explanation methods work for short proteins, improvements are needed to make them applicable to long proteins, which are more common in real-world scenarios.
We showed that the direct evaluation of explanations provided by the infidelity measure was not consistent with the indirect evaluation, as indicated by the ability to capture various physicochemical properties. The reason for this is unclear and remains to be investigated.
Our investigations showed a very complex picture. We raised more questions than we answered. More work is needed to clarify the findings. Longer protein sequences have to be tested, as explained above, for more reliable results. Also, more embedding methods have to be tested, as well as more prediction problems. All models discussed—regarding embedding generation, prediction, and explanation—are very complex, and enhancing our understanding of them is both important and difficult.

Author Contributions

Z.F. selected the dataset; wrote the software; performed all tests, including installing and running the necessary libraries; contributed to the methodology; and wrote an initial draft of the manuscript; C.P.E.d.S. advised on the statistical analysis; G.B.G. advised on the biological analysis; L.I. proposed and designed the study, analyzed the results, and wrote the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSERC Discovery Grants RGPIN-2021-03978 to L.I. and RGPIN-2020-05733 to G.B.G.

Data Availability Statement

The software used for this project is available at github.com/lucian-ilie/Protein-XAI (accessed on 25 May 2025).

Acknowledgments

All computations were performed on the Graham cluster of the Digital Research Alliance of Canada using T4 GPUs.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
  2. Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G.; Wang, J.; Cong, Q.; Kinch, L.; Schaeffer, R.; et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, eabj8754. [Google Scholar] [CrossRef] [PubMed]
  3. Alley, E.; Khimulya, G.; Biswas, S.; Alquraishi, M.; Church, G. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019, 16, 1315–1322. [Google Scholar] [CrossRef] [PubMed]
  4. Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Yu, W.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef]
  5. Strokach, A.; Becerra, D.; Corbi-Verge, C.; Perez-Riba, A.; Kim, P. Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Syst. 2020, 11, 402–411. [Google Scholar] [CrossRef]
  6. Anishchenko, I.V.; Pellock, S.J.; Chidyausiku, T.M.; Ramelot, T.A.; Ovchinnikov, S.; Hao, J.; Bafna, K.; Norn, C.H.; Kang, A.; Bera, A.K.; et al. De novo protein design by deep network hallucination. Nature 2020, 600, 547–552. [Google Scholar] [CrossRef]
  7. Trinquier, J.; Uguzzoni, G.; Pagnani, A.; Zamponi, F.; Weigt, M. Efficient generative modeling of protein sequences using simple autoregressive models. Nat. Commun. 2021, 12, 5800. [Google Scholar] [CrossRef]
  8. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
  9. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
  10. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Minneapolis, MN, USA, 2019. [Google Scholar]
  11. Raffel, C.; Shazeer, N.M.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2019, 21, 140:1–140:67. [Google Scholar]
  12. Asgari, E.; Mofrad, M.R.K. ProtVec: A Continuous Distributed Representation of Biological Sequences. arXiv 2015, arXiv:1503.05140. [Google Scholar]
  13. Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019, 20, 723. [Google Scholar] [CrossRef] [PubMed]
  14. Bepler, T.; Berger, B. Learning protein sequence embeddings using information from structure. arXiv 2019, arXiv:1902.08661. [Google Scholar]
  15. Rao, R.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.F.; Abbeel, P.; Sercu, T.; Rives, A. MSA Transformer. biorXiv 2021.
  16. Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef] [PubMed]
  17. Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv 2023, arXiv:2301.06568. [Google Scholar]
  18. Yuan, Q.; Chen, J.; Zhao, H.; Zhou, Y.; Yang, Y. Structure-aware protein-protein interaction site prediction using deep graph convolutional network. Bioinformatics 2021, 38, 125–132. [Google Scholar] [CrossRef]
  19. Zeng, M.; Zhang, F.; Wu, F.X.; Li, Y.; Wang, J.; Li, M. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 2019, 36, 1114–1120. [Google Scholar] [CrossRef]
  20. Zhang, B.; Li, J.; Quan, L.; Chen, Y.; Lü, Q. Sequence-based prediction of protein-protein interaction sites by simplified long-short term memory network. Neurocomputing 2019, 357, 86–100. [Google Scholar] [CrossRef]
  21. Li, Y.; Golding, G.; Ilie, L. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics 2020, 37, 896–904. [Google Scholar] [CrossRef]
  22. Hosseini, S.; Ilie, L. PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention. Int. J. Mol. Sci. 2022, 23, 12814. [Google Scholar] [CrossRef]
  23. Manfredi, M.; Savojardo, C.; Martelli, P.L.; Casadio, R. ISPRED-SEQ: Deep neural networks and embeddings for predicting interaction sites in protein sequences. J. Mol. Biol. 2023, 435, 167963. [Google Scholar] [CrossRef] [PubMed]
  24. Hosseini, S.; Golding, G.B.; Ilie, L. Seq-InSite: Sequence supersedes structure for protein interaction site prediction. Bioinformatics 2024, 40, btad738. [Google Scholar] [CrossRef] [PubMed]
  25. Molnar, C. Interpretable Machine Learning, 2nd ed.; Independently Published, 2022. [Google Scholar]
  26. Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
  27. Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In Lecture Notes in Computer Science, Proceedings of the Computer Vision–ECCV 2014–13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8689, pp. 818–833. [Google Scholar] [CrossRef]
  28. Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M.A. Striving for Simplicity: The All Convolutional Net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
  29. Shrikumar, A.; Greenside, P.; Shcherbina, A.; Kundaje, A. Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. arXiv 2016, arXiv:1605.01713. [Google Scholar]
  30. Ancona, M.; Ceolini, E.; Öztireli, C.; Gross, M. Towards better understanding of gradient-based attribution methods for Deep Neural Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018; Conference Track Proceedings; OpenReview.net. 2018. [Google Scholar]
  31. Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv 2017, arXiv:1704.02685. [Google Scholar]
  32. Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Machine Learning Research, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR: Cambridge, MA, USA, 2017; Volume 70, pp. 3319–3328. [Google Scholar]
  33. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the Demonstrations Session, NAACL HLT 2016, the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; The Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 97–101. [Google Scholar] [CrossRef]
  34. Lundberg, S.M.; Lee, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
  35. Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for PyTorch. arXiv 2020, arXiv:2009.07896. [Google Scholar]
  36. Ali, A.; Schnake, T.; Eberle, O.; Montavon, G.; Müller, K.R.; Wolf, L. XAI for Transformers: Better Explanations through Conservative Propagation. In Machine Learning Research, Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR: Cambridge, MA, USA, 2022; Volume 162, pp. 435–451. [Google Scholar]
  37. Chefer, H.; Gur, S.; Wolf, L. Transformer Interpretability Beyond Attention Visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 782–791. [Google Scholar]
  38. Mohebbi, H.; Jumelet, J.; Hanna, M.; Alishahi, A.; Zuidema, W. Transformer-specific Interpretability. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts; Mesgar, M., Loáiciga, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 21–26. [Google Scholar]
  39. Fantozzi, P.; Naldi, M. The Explainability of Transformers: Current Status and Directions. Computers 2024, 13, 92. [Google Scholar] [CrossRef]
  40. Allison, L.A. Fundamental Molecular Biology, 2nd ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
  41. Weast, R. CRC Handbook of Chemistry and Physics, 62nd ed.; CRC Press: Boca Raton, FL, USA, 1981. [Google Scholar]
  42. Mann, H.B.; Whitney, D.R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
  43. Yeh, C.K.; Hsieh, C.Y.; Suggala, A.S.; Inouye, D.I.; Ravikumar, P. On the (In)fidelity and Sensitivity of Explanations. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  44. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2020; pp. 38–45. [Google Scholar]
  45. Ansel, J.; Yang, E.; He, H.; Gimelshein, N.; Jain, A.; Voznesensky, M.; Bao, B.; Bell, P.; Berard, D.; Burovski, E.; et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA, 27 April–1 May 2024; Volume 2. [Google Scholar]
Figure 1. Output of explainability methods for a CNN classification model on the same image. Green pixels have a positive contribution, while red ones have a negative impact on predicting the label of the picture.
Figure 1. Output of explainability methods for a CNN classification model on the same image. Green pixels have a positive contribution, while red ones have a negative impact on predicting the label of the picture.
Ijms 26 05255 g001
Figure 2. Explainability example: ProtT5 embedding computation for protein sequence 2L2T_A, explained using Integrated Gradients (left) and KernelShap (right). On the left and in the bottom margins, interacting residues are shown in green, and non-interacting ones in red.
Figure 2. Explainability example: ProtT5 embedding computation for protein sequence 2L2T_A, explained using Integrated Gradients (left) and KernelShap (right). On the left and in the bottom margins, interacting residues are shown in green, and non-interacting ones in red.
Ijms 26 05255 g002
Figure 3. Interpretations for protein 2L2T_A. The first three rows are for embeddings—ProtBERT, ProtT5, Ankh—and the next three are for Seq-InSite predictions using the three embeddings, in the same order. From left to right we have the nine XAI methods in the order mentioned above.
Figure 3. Interpretations for protein 2L2T_A. The first three rows are for embeddings—ProtBERT, ProtT5, Ankh—and the next three are for Seq-InSite predictions using the three embeddings, in the same order. From left to right we have the nine XAI methods in the order mentioned above.
Ijms 26 05255 g003
Figure 4. Embedding comparison for (a) categorical tests, (b) numerical tests, and (c) all tests.
Figure 4. Embedding comparison for (a) categorical tests, (b) numerical tests, and (c) all tests.
Ijms 26 05255 g004
Figure 5. Distance plots: The average attribution between residues in terms of the distance between them. The embedding tests are shown in the top row, and the prediction tests are shown in the bottom row. From left to right are the results for ProtBERT, ProtT5, and Ankh.
Figure 5. Distance plots: The average attribution between residues in terms of the distance between them. The embedding tests are shown in the top row, and the prediction tests are shown in the bottom row. From left to right are the results for ProtBERT, ProtT5, and Ankh.
Ijms 26 05255 g005
Table 1. Protein sequences used for testing.
Table 1. Protein sequences used for testing.
Protein IDChainLengthProtein IDChainLength
2CCIF301SGHB39
1MZWB316F4UD40
1OQEK312L9UA40
5KQ1C315OM2B40
5JPOE322XZER40
2L34A334LZXB40
6B7GB335TUVC41
3MJHB342XA6A41
4NAWD342MOFA42
3DXCB352K9JB43
2XJYB352F9DP43
2BE6D374GDOA43
1IK9C376GNYB43
5XJLM376AU8C43
5FV8E382KS1A44
4UEDB383HROA44
5FV8A382L2TA44
Table 2. Amino acid properties.
Table 2. Amino acid properties.
Amino AcidHydrophobicityMolecular Mass Van der Waals VolumeDipole MomentAromaticityAcidity/Basicity
Glycine (G)−0.457480.000
Alanine (A)1.871675.937
Serine (S)−0.887739.836
Proline (P)−1.697907.916
Valine (V)4.2991052.692
Threonine (T)−0.7101939.304
Cysteine (C)2.51038610.740
Isoleucine (I)4.51131243.371
Leucine (L)3.81131243.782
Asparagine (N)−3.51149618.890
Aspartic acid (D)−3.51159129.490 A
Glutamine (Q)−3.512811439.890
Lysine (K)−3.912813550.020 B
Glutamic acid (E)−3.512910942.520 A
Methionine (M)1.91311248.589
Histidine (H)−3.213711820.440 B
Phenylalanine (F)2.81471355.980A
Arginine (R)−4.515614837.500 B
Tyrosine (Y)−1.316314110.410A
Tryptophan (W)−0.918616310.730A
Table 3. ProtBERT—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
Table 3. ProtBERT—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
MethodInteractivityAromaticityAcidity/BasicityInteractivityAromaticityAcidity/Basicity
Embeddings—TargetEmbeddings—Source
Saliency6.09 × 10−281.87 × 10−828.98 × 10−311.88 × 10−211.21 × 10−1247.03 × 10−22
Deconvolution5.05 × 10−12.83 × 10−12.79 × 10−14.71 × 10−21.53 × 10−42.55 × 10−1
Guided Backprop.5.05 × 10−12.83 × 10−12.79 × 10−14.71 × 10−21.53 × 10−42.55 × 10−1
Input X Grad.6.16 × 10−26.45 × 10−16.98 × 10−12.41 × 10−21.86 × 10−21.60 × 10−1
DeepLIFT4.56 × 10−13.48 × 10−11.47 × 10−23.28 × 10−11.62 × 10−32.15 × 10−2
Integrated Grad.5.09 × 10−19.09 × 10−21.20 × 10−82.28 × 10−191.09 × 10−52.92 × 10−19
LIME3.82 × 10−15.47 × 10−13.93 × 10−16.28 × 10−13.55 × 10−12.24 × 10−1
KernelShap0.00 × 1003.74 × 10−1262.41 × 10−203.95 × 10−337.09 × 10−171.01 × 10−12
GradientShap2.79 × 10−19.32 × 10−18.37 × 10−66.14 × 10−12.35 × 10−27.30 × 10−1
Predictions—TargetPredictions—Source
Saliency6.20 × 10−521.04 × 10−1704.58 × 10−387.43 × 10−23.49 × 10−2191.90 × 10−32
Deconvolution1.65 × 10−21.29 × 10−62.63 × 10−17.41 × 10−16.66 × 10−98.29 × 10−9
Guided Backprop.6.56 × 10−39.32 × 10−14.12 × 10−12.53 × 10−13.52 × 10−33.53 × 10−1
Input X Grad.9.01 × 10−18.19 × 10−17.53 × 10−12.88 × 10−11.46 × 10−24.90 × 10−1
DeepLIFT7.11 × 10−134.81 × 10−13.41 × 10−16.81 × 10−64.63 × 10−47.06 × 10−2
Integrated Grad.6.86 × 10−1084.81 × 10−14.23 × 10−201.51 × 10−29.82 × 10−72.76 × 10−7
LIME2.06 × 10−17.92 × 10−11.67 × 10−14.45 × 10−13.82 × 10−11.91 × 10−1
KernelShap3.96 × 10−2364.68 × 10−1571.12 × 10−202.49 × 10−292.97 × 10−121.34 × 10−7
GradientShap2.92 × 10−12.87 × 10−24.41 × 10−59.66 × 10−12.82 × 10−28.91 × 10−1
Table 4. ProtT5—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
Table 4. ProtT5—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
MethodInteractivityAromaticityAcidity/BasicityInteractivityAromaticityAcidity/Basicity
Embeddings—TargetEmbeddings—Source
Saliency2.58 × 10−95.67 × 10−88.32 × 10−22.49 × 10−1521.19 × 10−28.39 × 10−61
Deconvolution4.53 × 10−157.06 × 10−12.79 × 10−23.12 × 10−24.96 × 10−450.00 × 100
Guided Backprop.5.25 × 10−69.70 × 10−17.20 × 10−603.57 × 10−141.35 × 10−359.16 × 10−139
Input X Grad.6.85 × 10−27.12 × 10−14.75 × 10−13.25 × 10−19.18 × 10−13.60 × 10−1
DeepLIFT5.95 × 10−13.73 × 10−15.07 × 10−13.82 × 10−228.98 × 10−301.41 × 10−21
Integrated Grad.2.19 × 10−18.96 × 10−28.56 × 10−19.56 × 10−324.23 × 10−43.23 × 10−1
LIME7.64 × 10−11.58 × 10−17.80 × 10−19.03 × 10−14.18 × 10−15.31 × 10−1
KernelShap1.17 × 10−23.79 × 10−675.89 × 10−632.84 × 10−133.61 × 10−56.75 × 10−3
GradientShap9.30 × 10−27.30 × 10−28.92 × 10−17.21 × 10−21.23 × 10−14.41 × 10−1
Predictions—TargetPredictions—Source
Saliency5.61 × 10−21.29 × 10−332.52 × 10−14.48 × 10−1326.03 × 10−13.49 × 10−127
Deconvolution8.16 × 10−695.95 × 10−126.02 × 10−11.71 × 10−149.20 × 10−620.00 × 100
Guided Backprop.5.26 × 10−947.31 × 10−28.84 × 10−1029.26 × 10−121.17 × 10−619.02 × 10−276
Input X Grad.1.44 × 10−14.54 × 10−17.71 × 10−19.88 × 10−16.27 × 10−17.30 × 10−1
DeepLIFT1.12 × 10−11.94 × 10−12.13 × 10−12.27 × 10−191.67 × 10−222.37 × 10−18
Integrated Grad.1.09 × 10−28.95 × 10−19.01 × 10−14.77 × 10−288.94 × 10−19.82 × 10−1
LIME8.42 × 10−11.42 × 10−17.55 × 10−16.80 × 10−16.38 × 10−17.21 × 10−1
KernelShap1.62 × 10−933.90 × 10−76.22 × 10−213.40 × 10−62.35 × 10−31.77 × 10−3
GradientShap5.14 × 10−17.91 × 10−12.08 × 10−17.09 × 10−23.88 × 10−19.76 × 10−1
Table 5. Ankh—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
Table 5. Ankh—categorical tests. The p-values correspond to the Mann–Whitney U test; p-values below 0.05 are considered significant and shown in boldface.
MethodInteractivityAromaticityAcidity/BasicityInteractivityAromaticityAcidity/Basicity
Embeddings—TargetEmbeddings—Source
Saliency5.59 × 10−1642.64 × 10−57.14 × 10−29.23 × 10−1014.12 × 10−11.81 × 10−1
Deconvolution5.00 × 10−38.79 × 10−16.77 × 10−165.95 × 10−227.93 × 10−15.57 × 10−7
Guided Backprop.5.00 × 10−38.79 × 10−16.77 × 10−165.95 × 10−227.93 × 10−15.57 × 10−7
Input X Grad.6.45 × 10−13.69 × 10−36.14 × 10−33.01 × 10−15.36 × 10−52.88 × 10−1
DeepLIFT5.75 × 10−82.57 × 10−13.58 × 10−11.45 × 10−12.65 × 10−32.80 × 10−2
Integrated Grad.2.50 × 10−13.29 × 10−27.26 × 10−81.94 × 10−272.51 × 10−872.12 × 10−288
LIME2.18 × 10−12.40 × 10−13.04 × 10−14.60 × 10−29.19 × 10−26.05 × 10−1
KernelShap1.78 × 10−213.01 × 10−30.00 × 1001.61 × 10−41.54 × 10−51.04 × 10−11
GradientShap1.51 × 10−19.62 × 10−13.46 × 10−15.43 × 10−35.03 × 10−32.45 × 10−5
Predictions—TargetPredictions—Source
Saliency0.00 × 1001.70 × 10−298.82 × 10−72.22 × 10−1541.48 × 10−17.43 × 10−8
Deconvolution3.97 × 10−144.89 × 10−56.77 × 10−81.59 × 10−151.38 × 10−125.22 × 10−1
Guided Backprop.5.68 × 10−69.42 × 10−21.99 × 10−61.38 × 10−514.74 × 10−22.40 × 10−4
Input X Grad.6.42 × 10−16.72 × 10−21.75 × 10−44.05 × 10−81.51 × 10−73.39 × 10−1
DeepLIFT1.39 × 10−91.93 × 10−28.31 × 10−77.14 × 10−28.33 × 10−79.57 × 10−1
Integrated Grad.1.19 × 10−19.03 × 10−19.37 × 10−33.78 × 10−154.92 × 10−621.30 × 10−1
LIME2.02 × 10−12.04 × 10−11.02 × 10−17.91 × 10−29.86 × 10−25.28 × 10−1
KernelShap7.78 × 10−117.89 × 10−11.31 × 10−2131.47 × 10−23.41 × 10−37.55 × 10−9
GradientShap1.17 × 10−18.57 × 10−25.02 × 10−13.98 × 10−42.62 × 10−94.35 × 10−2
Table 6. ProtBERT–numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
Table 6. ProtBERT–numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
MethodHydrophobicityMolecular MassVan Der WaalsDipole Moment
Correlationp-ValueCorrelationp-ValueCorrelationp-ValueCorrelationp-Value
Embeddings—Target
Saliency0.3000.068−0.2540.119−0.0210.896−0.4420.006
Deconvolution−0.0640.6960.3920.0160.2870.0790.0950.586
Guided Backprop.−0.0640.6960.3920.0160.2870.0790.0950.586
Input X Grad.0.2250.1710.0740.6490.1380.398−0.1470.386
DeepLIFT0.1820.268−0.2010.217−0.1700.298−0.2840.086
Integrated Grad.−0.4280.0090.3920.0160.2660.1040.5260.001
LIME0.2570.118−0.1800.269−0.2870.079−0.2000.233
KernelShap−0.2030.2160.1160.4750.1060.5150.1890.260
GradientShap−0.1710.2970.2430.1350.1920.2420.2210.186
Embeddings—Source
Saliency0.2780.090−0.2540.119−0.0430.795−0.3790.020
Deconvolution0.1180.473−0.0110.948−0.1060.5150.1370.422
Guided Backprop.0.1180.473−0.0110.948−0.1060.515−0.1370.422
Input X Grad.0.0001.000−0.0320.845−0.1060.515−0.0630.725
DeepLIFT0.1710.297−0.2010.217−0.1810.269−0.1790.288
Integrated Grad.−0.1820.268−0.0850.603−0.2660.1040.2320.165
LIME−0.0860.6020.0420.7950.1490.3620.1160.501
KernelShap0.2890.078−0.1690.2990.0640.696−0.3790.020
GradientShap−0.0110.948−0.1690.299−0.1060.5150.1260.461
Predictions—Target
Saliency0.3100.059−0.2120.1940.0001.000−0.4320.007
Deconvolution−0.0530.7440.0630.697−0.1490.3620.1470.386
Guided Backprop.−0.3000.0680.0420.795−0.1380.3980.4420.006
Input X Grad.0.4390.008−0.0850.603−0.0530.745−0.4320.007
DeepLIFT0.2030.216−0.2430.135−0.1600.329−0.2630.113
Integrated Grad.−0.1500.3610.1900.242−0.0430.7950.3370.040
LIME0.5030.002−0.2010.217−0.1700.298−0.4530.005
KernelShap0.1390.397−0.0850.6030.0001.000−0.2950.074
GradientShap0.0860.602−0.2430.135−0.0740.649−0.0840.631
Predictions—Source
Saliency0.2670.103−0.2650.104−0.0530.745−0.3890.016
Deconvolution0.0110.9480.1270.4360.0320.8450.0210.924
Guided Backprop.−0.1180.4730.0110.9480.0110.9480.0740.677
Input X Grad.−0.0960.557−0.2540.119−0.3830.019−0.0320.873
DeepLIFT0.2780.090−0.2540.119−0.2870.079−0.2840.086
Integrated Grad.−0.5240.0010.3600.0270.2130.1930.4950.002
LIME−0.2780.090−0.1270.436−0.1700.2980.1370.422
KernelShap0.2890.078−0.1800.2690.0530.745−0.3890.016
GradientShap0.3210.051−0.1900.2420.0110.948−0.5050.001
Table 7. ProtT5—numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
Table 7. ProtT5—numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
MethodHydrophobicityMolecular MassVan Der WaalsDipole Moment
Correlationp-ValueCorrelationp-ValueCorrelationp-ValueCorrelationp-Value
Embeddings—Target
Saliency0.2460.134−0.1800.2690.0320.845−0.3680.024
Deconvolution−0.2780.0900.1690.299−0.0640.6960.4000.014
Guided Backprop.0.2570.118−0.2330.153−0.0320.845−0.3790.020
Input X Grad.−0.0110.948−0.2750.091−0.3830.019−0.1160.501
DeepLIFT−0.2140.1920.2010.2170.0530.7450.2740.098
Integrated Grad.0.2140.192−0.0210.897−0.1700.298−0.2110.209
LIME−0.0640.6960.0630.6970.0740.6490.1470.386
KernelShap−0.4920.0030.3700.0230.2230.1720.6950.000
GradientShap−0.0110.948−0.1060.516−0.2770.091−0.0530.773
Embeddings—Source
Saliency0.3640.027−0.2960.069−0.0640.696−0.4630.004
Deconvolution−0.1710.2970.3390.0380.1060.5150.2950.074
Guided Backprop.0.3420.037−0.1060.5160.0210.896−0.3890.016
Input X Grad.−0.1070.514−0.0630.697−0.1060.5150.0840.631
DeepLIFT0.0001.0000.3810.0190.2770.0910.0740.677
Integrated Grad.0.1820.2680.0210.8970.0640.696−0.0630.725
LIME0.1930.2410.0630.6970.0740.649−0.0950.586
KernelShap−0.2670.1030.1800.269−0.0530.7450.3890.016
GradientShap−0.1710.2970.1800.2690.0320.8450.3260.047
Predictions—Target
Saliency0.2140.192−0.2220.173−0.0110.948−0.3680.024
Deconvolution−0.3000.0680.2120.1940.0110.9480.4420.006
Guided Backprop.0.3960.016−0.1590.3300.0110.948−0.4630.004
Input X Grad.0.2140.192−0.1690.299−0.1060.515−0.3260.047
DeepLIFT0.3420.037−0.0950.5590.0530.745−0.1890.260
Integrated Grad.−0.1820.2680.3490.0320.1600.3290.2630.113
LIME−0.3850.0190.2540.1190.1490.3620.3370.040
KernelShap0.2570.118−0.3920.016−0.2870.079−0.4210.009
GradientShap−0.0640.6960.2540.1190.0320.8450.0950.586
Predictions—Source
Saliency0.3640.027−0.2960.069−0.0640.696−0.4630.004
Deconvolution−0.1710.2970.3390.0380.1060.5150.2950.074
Guided Backprop.0.3420.037−0.1060.5160.0210.896−0.3890.016
Input X Grad.0.4490.006−0.2650.104−0.1700.298−0.5790.000
DeepLIFT0.0210.896−0.0210.8970.0530.7450.0001.000
Integrated Grad.−0.1180.473−0.1270.436−0.2550.1180.0840.631
LIME0.0001.0000.0420.7950.0210.896−0.1790.288
KernelShap0.2780.090−0.2220.1730.0320.845−0.3680.024
GradientShap−0.0430.7940.0630.6970.0210.8960.1160.501
Table 8. Ankh—numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
Table 8. Ankh—numerical tests. Kendall correlations are presented as heat maps, with red and blue indicating positive and negative correlations, respectively; p-values are from the corresponding Kendall test for assessing whether the correlations are significantly different from zero; p-values below 0.05 are considered significant and shown in boldface.
MethodHydrophobicityMolecular MassVan Der WaalsDipole Moment
Correlationp-ValueCorrelationp-ValueCorrelationp-ValueCorrelationp-Value
Embeddings—Target
Saliency0.2780.090−0.2120.1940.0001.000−0.3580.028
Deconvolution−0.3530.0310.0420.795−0.0850.6030.3470.034
Guided Backprop.−0.3530.0310.0420.795−0.0850.6030.3470.034
Input X Grad.0.1600.328−0.1160.475−0.0110.948−0.0950.586
DeepLIFT0.3000.068−0.2540.119−0.0320.845−0.2950.074
Integrated Grad.−0.2890.0780.0210.897−0.1490.3620.3260.047
LIME−0.2140.1920.1270.4360.0430.7950.0630.725
KernelShap0.0110.948−0.1270.436−0.0210.896−0.1680.319
GradientShap−0.2030.2160.0630.697−0.0210.8960.1370.422
Embeddings—Source
Saliency0.2250.171−0.3070.060−0.0960.558−0.3260.047
Deconvolution−0.1820.2680.0740.6490.0110.9480.2950.074
Guided Backprop.−0.1820.2680.0740.6490.0110.9480.2950.074
Input X Grad.0.0210.8960.2330.1530.3620.0270.0001.000
DeepLIFT0.0750.6480.0210.8970.1810.269−0.1790.288
Integrated Grad.−0.2460.134−0.1590.330−0.3300.0440.1790.288
LIME0.1280.434−0.1690.299−0.1700.298−0.1160.501
KernelShap0.2460.134−0.2120.1940.0430.795−0.3580.028
GradientShap−0.2350.152−0.1800.269−0.2980.0690.2000.233
Predictions—Target
Saliency0.2570.118−0.2120.194−0.0110.948−0.4000.014
Deconvolution−0.2570.1180.2010.2170.0530.7450.3580.028
Guided Backprop.−0.2460.1340.1590.330−0.0850.6030.3890.016
Input X Grad.0.2030.2160.2010.2170.1920.242−0.0530.773
DeepLIFT0.0750.648−0.0740.649−0.1490.3620.0740.677
Integrated Grad.0.2890.078−0.0630.697−0.0640.696−0.2530.128
LIME0.3850.019−0.4550.005−0.3830.019−0.4420.006
KernelShap0.2570.118−0.0850.603−0.1810.269−0.2530.128
GradientShap0.1180.4730.1690.2990.1490.3620.0420.823
Predictions—Source
Saliency0.2570.118−0.3170.051−0.1060.515−0.3580.028
Deconvolution−0.1930.2410.0530.745−0.0110.9480.2740.098
Guided Backprop.−0.4390.0080.1590.3300.0320.8450.5050.001
Input X Grad.−0.1600.328−0.0630.697−0.2130.1930.2420.146
DeepLIFT−0.0750.648−0.0530.745−0.2230.1720.1160.501
Integrated Grad.0.0210.8960.0850.6030.2340.1520.0001.000
LIME0.0430.794−0.1270.436−0.0430.7950.0320.873
KernelShap−0.0750.6480.0850.603−0.1280.4350.1370.422
GradientShap0.4170.0110.0420.7950.0740.649−0.1890.260
Table 9. Overall number of tests passed. Colour map added for readability; darker colours correspond to higher numbers.
Table 9. Overall number of tests passed. Colour map added for readability; darker colours correspond to higher numbers.
EmbeddingProtBERTProtT5AnkhTotal by XAI
XAI MethodCat.Num.Tot.Cat.Num.Tot.Cat.Num.Tot.Cat.Num.Tot.
ine Saliency1141586148412271441
Deconvolution61710414931225833
Guided Backprop.426107179514231437
Input X Grad.3360446179817
DeepLIFT60662870719221
Integrated Grad.97164158210211031
LIME022022145189
KernelShap122141271911112351045
GradientShap51601161711314
ine Total by Embed.56227850348465218617177248
Table 10. Tests passed by XAI methods and test type.
Table 10. Tests passed by XAI methods and test type.
XAI MethodTargetSourceEmbeddingPrediction
Saliency20212120
Deconvolution17161617
Guided Backpropagation17201720
Input X Gradient710710
DeepLIFT7141011
Integrated Gradient13181615
LIME8118
KernelShap22232421
GradientShap31168
Table 11. Tests passed by embedding and test type.
Table 11. Tests passed by embedding and test type.
EmbeddingProtBERTProtT5AnkhTotal by Test Type
Test TypeCat.Num.TotalCat.Num.TotalCat.Num.TotalCat.Num.Total
Target2313361818362913427044114
Source339423216483684410133134
Embedding278352615413210428533118
Prediction2914432419433311448644130
Table 12. Comparison of embedding vs. prediction in terms of test passing.
Table 12. Comparison of embedding vs. prediction in terms of test passing.
EmbeddingProtBERTProtT5AnkhAll
Embed.|Predict.Cat.Num.TotalCat.Num.TotalCat.Num.TotalCat.Num.Total
pass|pass2142522113325429681987
pass|fail64104487613171431
fail|pass8101828108715182543
fail|fail19547326497514556959158217
Total547212654721265472126162216378
Table 13. Mean infidelity results with a separate heat map for each column; darker is better (lower infidelity); colors in different columns are not comparable.
Table 13. Mean infidelity results with a separate heat map for each column; darker is better (lower infidelity); colors in different columns are not comparable.
Mean InfidelityProtBERTProtT5Ankh
XAI MethodEmbed.Predict.Embed.Predict.Embed.Predict.
ine Saliency6.98 × 10−85.03 × 10−55.76 × 10−95.50 × 10−62.05 × 10−113.58 × 10−6
Deconvolution7.03 × 10−85.31 × 10−57.64 × 10−96.98 × 10−62.03 × 10−116.17 × 10−5
Guided Backprop.6.95 × 10−84.93 × 10−51.14 × 10−11.10 × 10−42.02 × 10−111.63 × 10−6
Input X Gradient5.15 × 10−83.73 × 10−56.49 × 10−94.75 × 10−64.81 × 10−102.09 × 10−6
DeepLIFT4.61 × 10−83.32 × 10−58.42 × 10−87.90 × 10−55.80 × 10−102.08 × 10−6
Integrated Gradient4.27 × 10−83.22 × 10−54.10 × 10−93.50 × 10−61.36 × 10−111.69 × 10−6
LIME4.40 × 10−83.22 × 10−52.39 × 10−73.51 × 10−64.34 × 10−111.77 × 10−6
KernelShap4.47 × 10−83.25 × 10−51.00 × 10−61.00 × 10−61.61 × 10−111.74 × 10−6
GradientShap4.56 × 10−83.03 × 10−54.52 × 10−84.21 × 10−53.10 × 10−102.03 × 10−6
Table 14. Tests passed by XAI methods and embeddings, with the same organization as in Table 13 for the mean infidelity results; darker is better.
Table 14. Tests passed by XAI methods and embeddings, with the same organization as in Table 13 for the mean infidelity results; darker is better.
EmbeddingProtBERTProtT5Ankh
XAI MethodEmbed.Predict.Embed.Predict.Embed.Predict.
ine Saliency878657
Deconvolution347766
Guided Backprop.338968
Input X Grad.241343
DeepLIFT334434
Integrated Grad.792373
LIME020214
KernelShap7710975
GradientShap241034
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fazel, Z.; de Souza, C.P.E.; Golding, G.B.; Ilie, L. Explainability of Protein Deep Learning Models. Int. J. Mol. Sci. 2025, 26, 5255. https://doi.org/10.3390/ijms26115255

AMA Style

Fazel Z, de Souza CPE, Golding GB, Ilie L. Explainability of Protein Deep Learning Models. International Journal of Molecular Sciences. 2025; 26(11):5255. https://doi.org/10.3390/ijms26115255

Chicago/Turabian Style

Fazel, Zahra, Camila P. E. de Souza, G. Brian Golding, and Lucian Ilie. 2025. "Explainability of Protein Deep Learning Models" International Journal of Molecular Sciences 26, no. 11: 5255. https://doi.org/10.3390/ijms26115255

APA Style

Fazel, Z., de Souza, C. P. E., Golding, G. B., & Ilie, L. (2025). Explainability of Protein Deep Learning Models. International Journal of Molecular Sciences, 26(11), 5255. https://doi.org/10.3390/ijms26115255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop