dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector. Results: Here, we present dom2vec , an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. These evaluations allow us to assess the quality of learned embeddings independently of a particular downstream task. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two signiﬁcant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks. Hence, researchers can reliably use them for domain architecture and protein prediction tasks.


Background
A primary way of how proteins evolve is through rearrangement of their functional and structural units, known as protein domains [1,2]. The domains are independent folding and functional modules and exhibit conserved sequence segments. Prediction algorithms exploited this information and use the domain composition of a protein as input features for various tasks. For example, [3] classified the cellular location and [4,5] predicted the associated Gene Ontology (GO) terms. One common way to represent the do-main composition of a protein is by the linear order in a protein, domain architecture [6].
Moreover, [7] investigated if the "language" of domain architectures has a grammar as a natural spoken language. They compared the bi-gram entropy of domain architecture for Pfam domains [8] to the respective entropy of the English language, showing that although it was lower than the English language, it was significantly different from a language produced after shuffling the domains. Prior to this result, methods had exploited domain architecture representation for various applications, such as fast homology search [9] and retrieval of similar proteins [10].
Word embeddings are unsupervised learning methods that have as input large corpora and they output a dense vector representation of words contained in the sentences of these documents based on the distributional semantic hypothesis. By this assumption, the meaning of a word can be understood by its context. Thus a word vector encapsulates local linguistic features, such as lexical or semantical information, of the respective word. Several methods to train word embeddings have been established [11,12,13]. These representations have shown to hold several properties such as analogy and grouping of semantically similar words [14,15]. Word embeddings are currently the mainstream input for neural networks in NLP as their use improved the performance on most of the tasks.
Various methods to create embeddings for proteins were proposed [16,17,18,19,20,21,22]. ProtVec fragmented the protein sequence in 3-mers for all possible starting shifts. Then they learned embeddings for each 3-mer and represented the respective protein as the average of its constituting 3-mer vectors [16]. SeqVec utilized and extended the Embeddings from Language Models (ELMo) [23] to learn a dense representation per amino acid residue. SeqVec embedding resulted in matrix representations of proteins, created by concatenating their learned residue vectors [20].
The previous embedding approaches evaluated the learned representations qualitatively and quantitatively. For qualitative evaluation, they averaged out the whole protein amino acid embeddings to compute the aggregated vector. Then they used known biological characteristics of proteins, (biophysical, chemical, structural, enzymatic and taxonomic), as distinct colors in a reduced 2-D embedding space. In such visualizations, they reported the appearance of distinct clusters of proteins, each consisting of proteins with similar properties. For quantitative evaluation, they measured the improvement of performance in downstream tasks.
Focusing on qualitative evaluation of existing protein embeddings, we confirm two caveats. First, researchers averaged out the protein amino acid vectors, consequently this qualitative evaluation is not related in a straightforward way with each learned embedding vector trained per amino-acid. To add on, this averagingout operation may not reveal the function of the most important sites of a protein; making the comparison result holding a low degree of biological significance. Second, we argue that the presented qualitative evaluations lack the ability to assess different learned embeddings in a sophisticated manner. This is because there is no a systematic way to compare quantitatively 2-D plots of reduced embedding spaces, each produced by a protein embedding method in investigation.
Indeed for word embeddings, there is an increase in methods to evaluate word representations intrinsically and in quantitative manner such as [24,25]. Having such evaluation metrics allows us to validate the knowledge acquired per each word vector and use the best performing space for downstream tasks. However, intrinsic evaluations of current amino acid embedding representations are prevented by not complete biological metadata at amino acid level, for all disposed proteins, in UniProtKnowledgeBase (UniProtKB) [26].
To address this evaluation shortcoming of protein sequence embeddings we make five major contributions: 1 We propose dom2vec, in which words are InterPro annotations and sentences are the domain architectures. Then we use word2vec method, to learn the embedding vector representation for each In-terPro annotation. 2 We established a novel intrinsic evaluation method based on the most significant biological information for a domain; its structure and function. First, we evaluated the learned embedding space by domain hierarchy. Then, we investigated the performance of a nearest neighbor classifier, C d nearest , to predict the secondary structure class provided by SCOPe secondary structure class [27] and the Enzyme Commission (EC) primary class. Finally, we equally examined the performance of C d nearest classifier to predict the GO molecular function class for three example model organisms and one human pathogen. 3 Strikingly, we observed that C d nearest reaches adequate accuracy, compared to C d nearest on randomized domains vectors, for secondary structure, enzymatic and GO molecular function. Thus we hypothesized an analogy between word embedding clustering by local linguistic features and protein domains clustering by domain structure and function. 4 To evaluate our embeddings extrinsically, we inputted the learned domains embeddings to simple neural networks and compared their performance with state-of-the-art protein sequence embeddings in three full-protein tasks. We surpassed both Se-qVec and ProtVec for the toxin presence and enzymatic primary function prediction task and we reported comparable results in the cellular location prediction task. 5 We make available the trained domains embeddings to be used by the research community.

Methods
Our approach is summarized in Figure 1, in the following we explain the methodology for each part of the approach.
Building domain architecture InterPro database contains functional annotations for superfamily, family and single domains as well as functional protein sites. Hereafter, we will refer to all such functional annotations as InterPro annotations. Furthermore, we will denote domain architecture the ordered arrangement of domains in a protein. We consider two distinct strategies to represent a protein based on it domain architecture consisting of either non-overlapping or non-redundant annotations. These annotation types are defined as: Non-overlapping annotations. For each region with overlapping InterPro annotations, all InterPro annotations except the longest are removed.
Non-redundant annotations. For each region with overlapping InterPro annotations, all InterPro annotations with the same InterPro identifier, except the longest InterPro annotations with unique identifier, are removed.
To efficiently resolve the annotation overlaps, an interval tree was built for each protein to detect overlapping domains, and each protein was split into regions with overlapping domains. After resolving overlaps, the admitted annotations in each protein were sorted by start position to construct its domain architecture. Following the approach of [5] we also added the "GAP" domain to annotate more than 30 amino acid subsequence that does not match any InterPro annotation entry.
Training domain embeddings Given a protein, we assumed that words are its resolved InterPro annotations and sentences are the protein domain architecture. By this assumption, we learned task-independent embeddings for each Inter-Pro annotation using two variants of word2vec: continuous bag of words and skip-gram model; hereafter denoted as CBOW and SKIP. See [12] for technical details on the difference between these approaches. Through this training, each InterPro annotation is associated with a task-independent embedding vector.

Novel intrinsic evaluation methods
Previous works evaluated the quality of embeddings only indirectly by measuring performance on downstream tasks. Nevertheless, in NLP, the quality of a learned word embedding space is often evaluated intrinsically by considering relationships among words, such as analogies. Such an evaluation is important, as it ensures the learned embeddings are meaningful without choosing a specific downstream task.
In the following, we used the metadata for the most characteristic properties of domains, in order to evaluate the learned embedding space for various hyperparameters of word2vec. We propose four intrinsic evaluation approaches for domain embeddings: domain hierarchy based on the family/subfamily relation, SCOPe secondary structure class, EC primary class, and GO molecular function annotation.
We refer to the embedding space learned by word2vec for a particular set of hyperparameters as V emb . We refer to the k nearest neighbors of a domain d as C d nearest found using the Euclidean distance.
Random Interpro annotation vectors To inspect the relative performance of V emb on each of the following evaluations, we randomized all domain vectors and run each evaluation task. That is, we assigned to each domain vector a newly created random vector, for each unique dimensionality of embedding space, irrespective of all other embedding method parameters.

Domain hierarchy
InterPro defines a strict family-subfamily relationship among domains. This relationship is based on sequence similarity of the domain signatures. We refer to the children of domain p as S p . We use these relationships to evaluate an embedding space, posing the following research question, RQ hierarchy : Did vectors of hierarchically closely domains form clusters in the V emb ?
Evaluation We predicted the closest |S p | domains on cosine similarity of their vector to the parent vector and we denote this predicted set asŜ p . For all learned embedding spaces, we measured their recall performance, Recall hier defined as follows: SCOPe secondary structure class We extracted the secondary structure of Interpro domains from the SCOPe database and form the following research question, RQ SCOP e : Did vectors of domains, with same enzymatic primary class, form clusters in the V emb ?
Evaluation We evaluated V emb by retrieving C d nearest of each domain. Then, we applied stratified 5-fold cross validation and measured the performance of a k-nearest neighbor classifier to predict the structure class of each domain. The intrinsic evaluation performance metric is the average accuracy across all folds, Accuracy SCOP e .

EC primary class
The enzymatic activity of each domain is given by its primary EC class [28] and pose the following research question, RQ EC : Did vectors of domains, with enzymatic primary class, form clusters in the V emb ?
Evaluation We again evaluate V emb using k nearest neighbors in a stratified 5-fold cross validation setting. Average accuracy across all folds is again used to quantify the intrinsic quality of the embedding space.

GO molecular function
For our last intrinsic evaluation, we aimed to assess V emb using the molecular function GO annotation. We extracted all molecular function GO annotations associated with each domain. In order to account for differences in specificity of different GO annotations, we always used the depth-1 ancestor of each annotation, i.e. children of the root molecular function term, GO:0003674.
Since model organisms have the most annotated proteins we created GO molecular function data sets for one example prokaryote (Escherichia coli denoted E.coli ), one example simple eukaryote (Saccharomyces cerevisiae denoted S.cerevisiae) and one complex eukaryote (Homo sapiens denoted Human). To assess our embeddings also for not highly annotated organisms, we included a molecular function data set for an example human pathogen (Plasmodium falciparum, denoted as Malaria). Finally, we pose the following research question, RQ GO : Did vectors of domains, with same GO molecular function, form clusters in the V emb ?
Evaluation We again evaluate an embedding space using k nearest neighbors in a stratified 5-cross validation setting. Average accuracy across all folds is again used to quantify performance.

Qualitative evaluation
As a preliminary evaluation strategy, we used qualitative evaluation approaches adopted in existing work. To follow the qualitative approach of ProtVec and Se-qVec we also visualized the embedding space for selected domain superfamilies, to answer the following research question, RQ qualitative : Did vectors of each domain superfamily form a cluster in the V emb ?
Evaluation To find out, we added the vector of each domain in a randomly chosen domain superfamily to an empty space. Then we performed principle component analysis (PCA) [29] to reduce the space in two dimensions and observed the formed clusters.

Extrinsic evaluation
In addition to assess the learned V emb , we also examined the performance change in downstream tasks. For the three supervised task TargetP, Toxin, and NEW, we feeded the domain representations in simple neural networks and compare the performance of our model with the state-of-the-art protein embeddings, ProtVec and SeqVec.

TargetP
This data set is about predicting the cellular location of a given protein. We downloaded the TargetP data set provided by [30] and we also used the non-plant data set. This data set consists of 2 738 proteins accompanying with their uniprot id, sequence and the cellular location label which can be nuclear, cytosol, pathway or signal and mitochondrial. Finally, we removed all instances with duplicate set of domains, resulting in total of 2 418. This is a multi-class task and its class distribution is summarized in Supplementary section E.
Evaluation For the TargetP we used the mc-AuROC performace metric.

Toxin
[31] introduced a data set associating protein sequence to toxic or other physiological content. We used the hard setting which provides uniprot id, sequence and the label toxin content or non-toxin content, for 15 496 proteins. Finally, we kept only the proteins with unique domain composition, resulting to 2 270 protein instances in total. This is a binary task and the class distribution is shown in Suppl. E.
Evaluation As Toxin data set is binary task, we used AuROC as performance metric.

NEW
The NEW data set [32] contains the data for predicting the enzymatic function of proteins. For each of the 22 618 proteins the data set provides sequence and the EC number class. The primary enzyme class, first digit of EC number, is our label on this prediction task, resulting in a multi-class task. Finally, we removed all instances with duplicate domain composition, resulting in a total of 14 434 protein instances. The possible classes are 6 and the class distribution is shown in supplementary section E.
Evaluation NEW data set is a multi-class task, thus we used mc-AuROC as performance metric.

Data partitioning
We divided each data set into 70/30% train and test splits. To perform model selection, we created inner three-fold cross validation sets on each train split.
Out-of-vocabulary experiment We observed that the performance of classifier depending on protein domains is highly dependent on the out-of-vocabulary (OOV) domains, as first discussed in [33]. OOV domains are all the domains contained in the test set, but not in the train. For TargetP, Toxin and NEW we observed that approximately 60%, 20%, 20% of test proteins contain at least one OOV domain, respectively.
For TargetP, containing the highest OOV we experimented to compensate on high degree of OOV. We split the test set into shorter sets by an increasing degree of OOV, namely 0%, 10%, 30%, 50%, 70%, 100%. Then we trained models for the whole train set and benchmarked the performance on each of these test subsets.
Generalization experiment For Toxin and NEW data set, experiencing low OOV, we sought to investigate the generalization of the produced classifier. We increased the number of training examples that the model was allowed to learn from and we benchmarked always in the entire test set. To do so, we created training splits of size, 10%, 20%, 50% of the whole train set. To perform significance testing we trained on 10 random subsamples for each training split percentage and then test on the separate step set. We used the paired sample t-test, Benjamini-Hochberg multiple-test, to compare the performance between a pair of classifiers on the test set.

Simple neural models for prediction
We consider a set of simple, well-established neural models to combine the InterPro annotation embeddings for each protein to perform downstream tasks, that is, for extrinsic evaluation tasks.

Building domain architecture
We used the domain hits for UniProt proteins from the InterPro version 75, containing 128 660 257 proteins with InterPro signature, making up the 80.9% of the total UniProtKB proteome (version 2019 06). For all these proteins, we extracted the non-overlapping and non-redundant sequences, which we processed in the next section. The number of unique non-overlapping sequence was (35 183`1) plus the "GAP" domain and non-redundant was (36 872`1) plus the "GAP". Comparing to the total number of domains in In-terPro version 75, which was 36 872, we observed that non-overlapping InterPro annotations captured 95.42% and the non-redundant captured 100% of the InterPro annotation entries. To enable visual comparison of the created type of domain architectures versus the downloaded InterPro annotations, in Figure 2 we illustrated the non-overlapping and non-redundant domain architectures of Diphthine synthase protein. This same protein, Diphthine synthase, was picked as example illustration for annotations in the latest InterPro work [37].
Training domain embeddings Domain architecture Before applying the word2vec method we examined the histograms of number of non-overlapping and nonredundant InterPro annotations per protein in Figure 3. We observed that these distributions are longtailed with mode equal to 1 and 3 respectively. Then, we used both CBOW and SKIP algorithms to learn domain embeddings. We used the following parameters sets. Based on the histograms, we selected the context window parameter for word to be 2 or 5, w " t2, 5u. For the number of dimensions, we used common values from the NLP literature, dim " t50, 100, 200u. We trained the embeddings from 5 to 50 epochs with step size 5 epochs ep " t5, 10, 15, . . . , 50u. Finally, all other parameters were set to their default values. For example, the negative sampling parameter was left to default, ng=5.

Novel intrinsic evaluation
In the following, we evaluated each instance of learned embedding space V emb for both non-overlapping and non-redundant representations of domain architectures. An instance of V emb space is the embeddings space learned for each combination of the product non overlap Consequently, the total number of embeddings space instances is |non overlap|ˆ|w|ˆ|dim|ˆ|ep| " 2ˆ2ˆ3ˆ10 " 120. Let V i emb denote such embedding space instance. In the following subsection, we evaluated each V i emb instance for domain hierarchy, secondary structure, enzymatic primary class and GO molecular function. Finally, all reported performances are shown for the best performing epoch value (ep).
RQ hierarchy : Did vectors of hierarchically closely domains form clusters in the V emb ? For the first research question, we loaded the parentchild tree T hier , provided by InterPro, consisting of 2 430 parent domains. Then for each V i emb we compared the actual and predicted children of each parent, and we averaged out the recall for all parents. For ease of presentation, we show only the results for nonredundant InterPro annotations at Table 1a and we provide the complete results in the Suppl. A.
From Tables S1 and 1a (Suppl. A), we observed that SKIP performed better overall, and the embeddings learned from non-redundant InterPro annotations always have better average recall values compared to non-overlapping. The best performing V i emb achieved average Recall hier of 0.538. We compared this moderate performance of V emb with the performance of the randomized spaces, which was equal to 0. We concluded that our embedding spaces greatly outperformed each randomized space for domain hierarchy relation. Therefore, we admitted that the majority of domains of the same hierarchy were placed in close proximity in the embedding space.
RQ SCOP e : Did vectors of domains, with same secondary structure class, form clusters in the V emb ? We extracted the SCOPe class for each InterPro domain. This resulted to 25 196 domains with unknown secondary structure class, 9 411 with single secondary structure class and 2 265 domains with more than one assigned classes (multi-label). For clarity, we removed all multi-label and unknown instances resulting to 9 411 single-labeled instances. The class distribution of the resulting data set is shown in the Suppl. B.
We measured the performance of C d nearest classifier in each V i emb to examine the homogeneity of the space with respect to the SCOPe class. We split the 9 411 domains in 5-fold stratified cross validation sets. To test the change in prediction accuracy for an increasing number of neighbors, we used different sets of neighbors, namely, k " t2, 5, 20, 40u. We summarized the results for the best performing C d nearest which was for k " 2 for non-redundant InterPro annotations in Table1b. We show the respective table for nonoverlapping InterPro annotations in the Suppl. B. We compared these accuracy measurements to the respective ones of the random spaces, and we found that the lowest accuracy values, achieved for CBOW with w=5 using non-overlapping domains, are twice as high as the accuracy values of the random spaces for all possible dimensions. Consequently, we concluded that domain embeddings of the same secondary structure class formed distinct clusters in the learned embedding space.
RQ EC : Did vectors of domains, with same enzymatic primary class, form clusters in the V emb ? We processed the EC primary class, resulting in 29 354 domains with unknown EC, 7 248 domains with only one EC and 721 with more than one EC. As before, we removed all multi-label and unknown instances leaving 7 428 domains with known EC. We augmented a domain instance with its vector representation for each V i emb , and then we used C d nearest to predict the EC label. See Suppl. C for the class distribution of EC task.
We reported the average Accuracy EC obtained on embedding space learned using non-redundant Inter-Pro annotations in Table 1c. We show the respective table for non-overlapping in the Suppl. C. We compared these accuracy measurements to the respective ones of the random spaces. We found that the minimum average Accuracy EC value was equal to 60.51 and was achieved using CBOW w=5 for nonoverlapping InterPro annotations. That value was approximately twice as large than the accuracy values of the random spaces for all possible dimensions; the maximum average Accuracy EC for random space with dim=100 was 32.64. Hence, we were able to accept that domain embeddings of the same EC primary class formed distinct clusters in a learned embedding space.
RQ GO : Did vectors of domains, with same GO molecular function, form clusters in the V emb ? We parsed the GO annotation file of InterPro to extract first-level GO molecular function for domains for the four organisms. We followed the same methodology to examine the homogeneity of a V emb with respect to GO molecular function annotations. For each V i emb , we augmented each domain by its vector and its GO label, and we classified each domain using C d nearest . As before, we used 5-fold stratified cross-validation for evaluation. In our experiments, we varied the number of neighbors k " t2, 5, 20, 40u to test its influence to the change of performance.
For space limitations, we summarized the performances showing only the best average accuracy over the number of neighbors. For the ease of presentation we omit the resulted tables for the three first organisms and show only for Human, but we discuss the results for all organisms. See Suppl. D for full results.
For Malaria, the best average accuracy was 76.86 and the minimum was 56.94. We compared this moderate minimum accuracy to the maximum accuracy obtained by the randomized embedding space, which was 47.57. Therefore, we concluded that dom2vec embeddings outperformed the random baseline by at least ten percent.
For E.coli, the best accuracy was 81.72 and the minimum was 67.34. By comparing with the random baseline, achieving best accuracy of 64.46, we observed that again dom2vec was able to surpass the random baseline.
For Yeast, the best accuracy was 75.10 and the minimum accuracy value was 59.82. We contrasted to the maximum accuracy obtained in a random space, which was 53.73, to report that dom2vec vectors in V E.coli emb captured GO molecular function classes in much higher degree than randomized vectors.
For Human, the best average performance for nonredundant InterPro annotations are shown in Table 1d. The best average accuracy was 75.96, scored by 2-NN for V human emb (non-redundant,SKIP,w=5,dim=50, ep=40). We compared the minimum accuracy values, obtained by CBOW with w=5, to accuracy values of the random spaces, and we found that the worstperforming dom2vec was 20 percentage values higher than the random baseline.
For all four example organisms, we observed that the SKIP on non-redundant InterPro annotations produced V emb where C d nearest achieved the best average accuracy in. For the three out of the four organisms, the best performances were achieved for the lowest number of dimensions (dim=50). In all cases, we found that the worst-performing dom2vec embeddings outperformed the random baselines. By these findings, we affirmed that domain embeddings of the same GO molecular function class formed distinct clusters in the learned embedding space.

Concluding on novel intrinsic evaluation
Based on the previous four experiments, we aimed to evaluate the learned V emb spaces and select the best domain embedding space for downstream tasks. In all experiments, the non-redundant InterPro annotations created the better performing embedding spaces compared to non-overlapping annotations. We reasoned for this finding, by comparing the modes of number of annotations per protein for the two annotation types, Figure 3. We hypothesized that by the very low mode for non-overlapping annotations, mode equal to one annotation, the word2vec method could not produce embeddings for even the stringent context window value of two. In contrast, 52% of proteins contain less than or equal to three non-redundant InterPro annotations.
This makes word2vec able to produce embedding spaces producing the best intrinsic performance. From the individual results, we saw that the configuration of parameters (non-redundant, SKIP, w=5, dim=50) brought best results in C d nearest performance for SCOPe, EC and GO for E.coli, Yeast, Human, second best for Malaria and the sixth best recall (0.507) for the domain hierarchy relation. Therefore, we will denote as V best intrinsic emb , the space produced by (nonredundant, SKIP, w=5, dim=50, ep=50).
RQ qualitative : Did vectors of each domain superfamily form a cluster in the V emb ? To explore the V emb in terms of the last research question, RQ qualitative , we randomly selected five Inter-Pro domain superfamilies to perform the visualization experiment. The selected domain superfamilies were PMP-22/EMP/MP20/Claudin superfamily with parent InterPro id IPR004031, small GTPase superfamily with parent InterPro id IPR006689, Kinasepyrophosphorylase with parent InterPro id IPR005177, Exonuclease, RNase T/DNA polymerase III with parent InterPro id IPR013520 and SH2 domain with parent InterPro id IPR000980.
We loaded the parent-child tree T hier , provided by InterPro, and for each domain superfamily starting from the parent domain we included recursively all domains that have subfamily relation with this parent domain. For example, the Kinase-pyrophosphorylase domain superfamily had domain parent IPR005177, which in turn had two immediate domain subfamilies IPR026530, IPR026565. The IPR026565 domain contained a subfamily domain with id IPR017409, consequently the set of domains for Kinase-pyrophosphorylase domain superfamily was {IPR005177, IPR026530, IPR026565, IPR017409}. We retrieved the vectors for each domain in a domain superfamily using the V best intrinsic emb , the best performing dom2vec space selected previously.
Finally, we visualized the two-dimensional PCAreduced space at Figure4. We recognized that domains embeddings of each superfamily organized wellseparated clusters. From these cluster, the cluster of Exonuclease, RNase T/DNA polymerase III superfamily had the highest dispersion of all presented superfamilies. By this finding, we could answer to the research question: embedding vectors of the same superfamily clustered well in the learned V emb .

Extrinsic evaluation
Extracting domain architecture For each data set that contained the UniProt identifier for the protein instance, we extracted the domain architecture for non-redundant InterPro annotations, already created in Section "Building domain architecture". For all proteins whose UniProt identifier could not be matched, or for data sets not providing the protein identifier, we used InterProScan [38] to find the domain hits per protein. For proteins without a domain hit after InterProScan, we created a proteinspecific, artificial protein-long domain; for example, we assigned to the protein G5EBR8, a protein-long domain named "G5EBR8 unk dom".

Model selection
To select which simple neural model we should compare to the baselines, we performed hyperparameter selection using an inner, three-fold cross validation on the training set; the test set was not used to select hyperparameters. We used common parameters, dropout of 0.5, batch size of 64, the Adam optimizer [39] with learning rate of 0.0003, weight decay for the last fully connected layer of 0 and number of epochs equal to 300. As a final hyperparameter, we allowed updates to the learned domain embeddings, initialized by selected dom2vec embeddings. The results are shown in Suppl. E.

Running baselines
Then, we used the same network as the one in right side of Figure 5 of [20]; we refer to this network as Se-qVecNet. Namely, the network first averages the 100 (ProtVec) or 1 024 (SeqVec) dimensional embedding vector for a protein; it then applies a fully connected layer to compress a batch of such vectors into 32 dimensions. Next, a ReLU activation function (with 0.25 dropout) was applied to that vector, followed by batch normalization. Finally, another fully connected layer was followed by the prediction layer. As third baseline, we added the 1-hot of domains in order to investigate the performance change compared to dom2vec learned embeddings.

Evaluation
For TargetP, we sought to investigate the effect of OOV on the produced classifier compared to sequencebased embeddings classifiers, which do not experience OOV as their used sequence features are highly common in both train and test set. For Toxin and NEW datasets, we benchmarked the generalization of the produced classifier compared to the sequence-based embeddings classifiers. Finally, for both kinds of experiments, we used the trained models on each test set. Hence, this evaluation shows how differences in the training set affect performance on the test set. The resulted performances are shown in Figure 5.
Out-of-vocabulary experiment For TargetP we validated that OOV will affect the performance of domains dependent classifiers. That is, for OOV in the range of 0´30% the dom2vec classifier was comparable to the best performing model, SeqVec. However, when OOV increased more, then the performance of our model dropped, but still being competitive with the SeqVec. dom2vec greatly outperformed the 1-hot representation, validating the NLP assumption that unsupervised embeddings improve classification on unseen words, in this context protein domains, compared to 1-hot word (domain) vectors.
Generalization experiment For both Toxin and NEW, dom2vec significantly outperformed SeqVec, ProtVec and domains 1-hot vectors, Benjamini--Hochberg multiple-test corrected p-value ă 0.05. In Toxin data set, we observed that ProtVec learned the less variant model, but with the trade-off obtaining the lowest performance (mc-AuROC). For NEW data set, the dom2vec 1-hot representation was the second best representation outperforming SeqVec and ProtVec, allowing us to validate the finding that domain composition is the most important feature for enzymatic function prediction as concluded by [32].

Discussion
Comparison with Pfam domain embeddings From all proposed protein embeddings works, only [22] developed intrinsic quantitative benchmarks. They applied word2vec for Pfam domain annotations for only eykaryotic proteins. They used the following three experiments to benchmark their Pfam domain embeddings. First, they benchmarked nearest neighbor, C d nearest , classifier performance on predicting the three main GO ontologies of a Pfam using its embedding vector. Second, they assessed the Matthew's correlation coefficient [40] between Pfam embedding and firstorder Markov encodings. They also investigated if vector arithmetic holds by comparing each time two pairs of Pfam domains with mutually exclusive GO binary assignment, for example one pair consisting of two domains annotated as intra-cellular (GO:0005622) and the contrasting pair contained two domains annotated as extra-cellular (GO:0005615).
Building and evaluating Pfam domain embeddings is the closest approach to dom2vec. Our approach differs in four main points. First, we trained embeddings for all domain annotations of all proteins available in InterPro. We included all available InterPro annotations, consisting of super-family, family, single domains and functional sites, as "words" input to the word2vec method. Therefore, we used a broader set of annotations and for the whole spectrum of organisms. Besides, word2vec was developed for sentences of corpora of natural languages, which have a moderate number of words and a large number of sentences. To cope with the sentence length assumption, we resolved overlapping and redundant annotations, so as to increase the number of InterPro annotations, making our input more suitable for the word2vec method. To follow the second assumption on number of sentences, we inputted all proteins with InterPro annotation. This increased the number of sentences, used as word2vec input, about 14-fold, from 9 030 650 (Pfam domains) to 128 660 257 (dom2vec). Second, we benchmarked over the two word2vec models (CBOW and SKIP) and their parameters for each experiment of our intrinsic evaluation step. Then we used our established intrinsic evaluation to choose the best embedding space. Third, we established unique intrinsic evaluation benchmarks for the characteristic biological features of domains which are the secondary structure and the enzymatic function. We also formed intrinsic evaluation for another important biological domain feature which is the hierarchy of domains. Last, dom2vec was also extrinsically evaluated on three downstream prediction tasks. We have shown that dom2vec embeddings could surpass established sequence embeddings for toxin and enzymatic function prediction and it was comparable to these embeddings for cellular location prediction. This downstream evaluation revealed that dom2vec not only capture the biological features of domains adequately, but it could also improve the performance on protein prediction tasks.
Was domain architecture informative enough for word2vec? Our InterPro annotations histograms have shown skewed empirical distributions of the number of annotations ( Figure 3). For non-overlapping, the mode is one annotation, but for non-redundant only 10% of proteins contained only one annotation. The latter form of annotations had a mode of three annotations, consequently word2vec was presented with an input corpus of applicable number of words per sentence, even for the shortest window of two (w=2).
We argue that despite the low modes of these distributions word2vec inputs were informative. To do so, we refer to the information gain distribution of the grammar created by Pfam domain architectures found by [7]. We see that even if the grammar distribution was computed for bi-grams (low window for n-grams compared to a natural language), the difference with the grammar distribution of randomized architectures was still significant. In our intrinsic evaluation, we did validate that such a corpus with a low mode of the number of words, domains, could still allow word2vec to produce embeddings that capture biological features known for domains.

The analogy between a natural language and protein domain architectures
We have shown that dom2vec captured adequately the domain SCOPe structural information, EC enzymatic function and the GO molecular function of each domain with such available metadata information. However, dom2vec produced moderate results in the domain hierarchy evaluation task. After investigating the properties of domain families that dom2vec produces these moderate results, we concluded that dom2vec cannot capture the domain hierarchy mostly for domain families of low cardinality. We argue that using more complex classifiers compared to C d nearest , we could gain in hierarchy performance, but this was not the scope of our evaluation.
Importantly, we did discovered that dom2vec embeddings captured the most distinctive biological characteristics of domains, secondary structure, enzymatic and molecular function, for an individual domain. That is word2vec produced domain embeddings that clustered sufficiently well by their structure and function class. Therefore, our finding supported the accepted modular evolution of proteins [1], in a datadriven way. Also, it made possible a striking analogy between words in natural language that clustered together in word2vec space [14] and domains in domain architectures that clustered together in dom2vec space. Therefore, we parallel the semantic and lexical similarity of words to the functional and structural resemblance of domains. This analogy may augment the research on understanding the nature of rules underlying the domain architecture grammar [7]. We are confident that this interpretability aspect of dom2vec will allow researchers to apply it, reliably, to predict biological features of novel domain architectures and proteins with identifiable InterPro annotation(s).

Boosting downstream prediction performance
In downstream task evaluation, dom2vec outperformed significantly domain 1-hot vectors and state-of-the-art sequence-based embeddings for Toxin and NEW data sets. For the TargetP dom2vec was comparable to the best performing sequence-based embedding, Seqvec, for OOV up to 30%.
As a consequence, if the protein prediction task enables the inclusion of additional input features, more than the mere amino-acid sequence, we recommend to extract InterPro annotations for the proteins at hand and use dom2vec in combination with sequence embeddings to boost predictive performance. For example, as researchers use domain information successfully for protein function prediction, we foresee that combining sequence embeddings and dom2vec will boost performance. If the prediction task does not allow the use of domain information, for example CASP (Critical Assessment of protein Structure Prediction), we suggest using dom2vec as a hard-to-beat baseline for sequence embedding classifiers.

Conclusions
We presented dom2vec, protein domain embeddings. We processed InterPro annotations to create domain architectures and then applied word2vec to these architectures. We introduced a novel intrinsic evaluation, based on metadata related to the most critical biological characteristics of an individual domain. In this evaluation, we found that dom2vec vectors cluster sufficiently well by secondary structure, enzymatic function and molecular function.
We then used dom2vec embeddings as input of simple neural networks for three different protein prediction tasks to compare their performance with state-of-the-art sequence embeddings. We found that dom2vec models surpassed ProtVec and SeqVec models in two tasks, toxin and enzymatic function prediction and were comparable to sequence embeddings models in the task of cellular localization. We believe that dom2vec can be used reliably by the research community, to boost prediction performance for individual domains and whole proteins.