Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures

: Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. the most characteristic properties of domains, in order to evaluate the learned embedding space for various hyper-parameters of word2vec We propose four intrinsic evaluation approaches for domain embeddings:


Introduction
A primary way in which proteins evolve is through rearrangement of their functional/structural units, known as protein domains [1,2]. The domains are independent folding and functional modules, and thus they exhibit conserved sequence segments. Prediction algorithms exploited this information and used, as input features, the domain composition of a protein for various tasks. For example, [3] classified the cellular location, and [4,5] predicted the associated Gene Ontology (GO) terms. There exist two ways to represent domains; either by the linear order in a protein, domain architectures [6], or by a graph where nodes are domains and edges connect domains that co-exist in a protein [1,2].
Moreover, [7] investigated whether the domain architectures had grammar as a natural spoken language. They compared the bi-gram entropy of domain architectures for Pfam domains [8] to the respective entropy of the English language, showing that although it was lower than the English language, it was significantly different from a language produced after shuffling the domains. Prior to this result, methods had exploited the domain architecture representation to various applications, such as fast homology search [9] and retrieval of similar proteins [10].
Word embeddings are unsupervised learning methods which have, as input, large corpora, and where they output a dense vector representation of words contained in the sentences of these documents based on the distributional semantic hypothesis, that is, the meaning of a word can be understood by its context. Thus, a word vector represents local linguistic features, such as lexical or semantical information, of the respective word. Several methods to train word embeddings have been established, for example, [11][12][13]. These representations have been shown to hold several properties, such as analogy and grouping of semantically similar words [14,15]. Importantly, these properties are learned without the need of a labeled data set. Word embeddings are currently the mainstream input for neural networks in the Natural Language Processing (NLP) field, as firstly, they reduce the feature space, compared to 1-hot representation, and secondly, they provide word features that encapsulate relations between words based on linguistic features. The use of word embeddings improved the performance on most of the tasks, such as sentiment analysis or Named Entity Recognition (NER) [16].
Various methods used to create embeddings for proteins have been proposed [17][18][19][20][21][22][23]. ProtVec fragmented the protein sequence in 3-mers for all possible starting shifts, then learned embeddings for each 3-mer and represented the respective protein as the average of its constituting 3-mer vectors [17]. SeqVec utilized and extended the Embeddings from Language Models (ELMo) [24] to learn a dense representation per amino acid residue, resulting in matrix representations of proteins, created by concatenating their learned residue vectors [21].
Focusing on their word segmentation, we note that learning embeddings for each amino acid or 3-mer may not always reflect evolutionary signals. That is, a pair of proteins with low sequence similarity is still a member of the same protein super-family, preserving similar function [25].
The previous embedding approaches evaluated the learned representations intrinsically, in a qualitative manner. They averaged out the whole protein amino acid embeddings to compute the aggregated vector. Then, known biological characteristics of proteins are used, such as biophysical, chemical, structural, enzymatic, and taxonomic, as distinct colors in a reduced 2-D embedding space. In such visualizations, previous embedding approaches reported the appearance of distinct clusters of proteins, each consisting of proteins with similar properties. For downstream evaluation, they measured the improvement of performance in downstream tasks.
Concerning the qualitative intrinsic evaluation, two caveats exist. First, researchers averaged out the protein amino acid vectors, where consequently, this qualitative evaluation is not related in a straightforward way with each learned embedding vector trained per amino-acid. In addition, this averaging-out operation may not reveal the function of the most important sites of a protein, meaning the comparative result holds a low degree of biological significance. Second, we argue that the presented qualitative evaluations lack the ability to assess different learned embeddings in a sophisticated manner. This is because there is no systematic way to quantitatively compare 2-D plots of reduced embedding spaces, each produced by a protein-embedding method in investigation.
Indeed for word embeddings, there has been an increase in methods to evaluate word representations intrinsically and in a quantitative manner, such as [26,27]. Having such evaluation metrics allows us to validate the knowledge acquired per each word vector and use the best-performing space for downstream tasks. However, intrinsic evaluations of current amino acid embedding representations are prevented by incomplete biological metadata at amino acid level, for all disposed proteins, in the UniProtKnowledgeBase (UniProtKB) [28].
To address this limitation in quantitative intrinsic evaluations of protein sequence embeddings, we present our approach with five major contributions:

1.
Our dom2vec approach is developed, in which words are InterPro annotations and sentences are the domain architectures. Then, we use the word2vec method to learn the embedding vector representation for each InterPro annotation.

2.
A quantitative intrinsic evaluation method is established based on the most significant biological information for a domain-its structure and function. First, we evaluated the learned embedding space for domain hierarchy comparing known domain parent-children relations to cosine similarity of the parent domain. Then, we investigated the performance of a nearest neighbor classifier, C d nearest , to predict the secondary structure class provided by SCOPe secondary structure class [29] and the Enzyme Commission (EC) primary class. Finally, we equally examined the performance of the C d nearest classifier to predict the GO molecular function class for three example model organisms and one human pathogen.

3.
Strikingly, we observed that C d nearest reaches adequate accuracy, compared to C d nearest on randomized domains vectors, for secondary structure, enzymatic function, and GO molecular function. Thus, we hypothesized an analogy between word embedding clustering by local linguistic features and protein domains clustering by domain structure and function. 4.
To evaluate our embeddings extrinsically, we inputted the learned domains embeddings to simple neural networks and compared their performance with state-of-the-art protein sequence embeddings in three full-protein tasks. We surpassed both SeqVec and ProtVec for the toxin presence and enzymatic primary function prediction task, and we reported comparable results in the cellular location prediction task.

5.
The pre-trained protein domain embeddings are available online at https://doi.org/ 10.25835/0039431, to be used by the research community.
The remainder of the paper is organized as follows: related work on protein embeddings is reviewed in Section 2. The methodology used to train and evaluate dom2vec embeddings is described in Section 3. The intrinsic and extrinsic evaluation results are presented in Section 4. In Section 5, we conclude.

Background
Current studies on protein embeddings are evaluated intrinsically and extrinsically. In extrinsic evaluation, prediction measures, like performance on a supervised learning task, are most commonly used to evaluate the quality of embeddings. For example, the ProtVec work [17] evaluated their proposed embeddings extrinsically by measuring their performance in predicting protein family and disorder. SeqVec [21] assessed their embeddings extrinsically by measuring performance on protein-level tasks, prediction of sub-cellular localization and water solubility, and residue-level tasks, and prediction of the functional effect of single amino acid mutations. However, extrinsic evaluation methods are based on a downstream prediction task, thus not measuring the biological information captured by each learned subsequence vector separately.
Previous studies evaluated the quality of their sequence embeddings intrinsically, by averaging the amino acid embedding vectors per protein and then drawing t-SNE visualizations [30] using distinct biological labels of a protein as colors, such as taxonomy, SCOPe, and EC primary class. However, this qualitative assessment hinders the selection of the best-performing embeddings, irrespective of the downstream task, because there is not a sophisticated method to rank 2-D visualizations.
Nevertheless, in NLP, the quality of a learned word embedding space is often evaluated intrinsically in a quantitative manner by considering relationships among words, such as analogies. Compared to qualitative evaluation, quantitative intrinsic evaluation enables assessment of the degree of biological information captured by the embeddings. This advantage allows us to choose the best set of parameters to create the embeddings that contain the highest degree of meaningful information without choosing a specific downstream task.
From all discussed protein embeddings studies, only [23] developed quantitative intrinsic evaluation methods. To benchmark their Pfam domain embeddings, they used the following three experiments. First, they benchmarked the performance of the nearest neighbor classifier predicting the three main GO ontologies of a Pfam using its embedding vector. Second, they assessed the Matthew's correlation coefficient [31] between Pfam embedding and first-order Markov encodings. They also assessed the vector arithmetic to compare GO conceptual binary assignment-for example, one pair was intracellular (GO:0005622) vs. extracellular (GO:0005615).
Our approach differs from [23] in four main points. First, we trained embeddings for all domain annotations of all proteins available in Interpro. That is, we included all available InterPro annotations, consisting of super-family, family, single domains, and functional sites, as "words" input to the word2vec method. Therefore, we used a broader set of annotations for the whole spectrum of organisms. Besides, word2vec was developed for sentences of natural languages, which have a moderate number of words. In order to copy with this assumption for the sentence length, we resolved overlapping and redundant annotations in order to increase the number of InterPro annotations, making our input more suitable for the word2vec method. Second, we benchmarked over the two word2vec models (CBOW and SKIP) and their parameters for each experiment of our quantitative intrinsic evaluation step, and consequently, we used our assessment to choose the best embedding space. Third, we established three unique intrinsic evaluation benchmarks for domain hierarchy, SCOPe secondary structure, and EC primary class. Lastly, our approach was also extrinsically evaluated on three downstream tasks in order to show that dom2vec embeddings can surpass or be comparable to state-of-the-art protein sequence embeddings.

Materials and Methods
In the following, the methodology for each part of our approach is explained. A conceptual summary is presented in Figure 1.

Building Domain Architectures
The InterPro database contains functional annotations for super-family, family, and single domains, as well as functional protein sites. Hereafter, we will refer to all such functional annotations as InterPro annotations. Furthermore, we will denote by domain architectures the ordered arrangement of domains in a protein. We consider two distinct strategies to represent a protein based on its domain architecture, consisting of either non-overlapping or non-redundant annotations. For both annotation types, we insert each annotation, based on the annotation's beginning and end, in an interval tree T hit . For each entry of the T hit , we save the annotation InterPro identifier, significance score, and length. Based on the annotation type, we apply the following two distinct strategies to create the linear domain architectures: Non-overlapping annotations. For each overlapping region in a protein, we keep the longest annotation out of all overlapping ones. Annotations of non-overlapping regions are included.
Non-redundant annotations. For each overlapping region in a protein, we keep all annotations with a distinct InterPro identifier. We break ties for annotations with the equal InterPro identifier by filtering in the longest one. Similarly, we keep annotations of non-overlapping regions.
For both annotation types, we sort the filtered-in annotations by their starting position. Finally, following the approach of [5], we also added the "GAP" domain to annotate more than 30 amino acid sub-sequences, which does not match any InterPro annotation entry.
An example of the resulting domain architectures for the Diphthine synthase protein is shown in Figure 2. All domains are overlapping, with the largest one colored in blue, and the non-overlapping annotation is the single longest domain (IPR035966). All other domains have a unique InterProID; therefore, the set of non-redundant InterPro annotations includes all presented domains which are sorted with respect to their starting position, and colored in green.
Applying the previous steps for all annotated proteins produces the domain architectures, constituting the input corpus to the following embedding module.

Training Domain Embeddings
Given a protein, we assumed that words were its resolved InterPro annotations and sentences were the protein domain architectures. By this assumption, we learned taskindependent embeddings for each InterPro annotation using two variants of word2vec: a continuous bag of words and skip-gram model, hereafter denoted as CBOW and SKIP respectively. See [12] for technical details on the difference between these approaches. Through this training, each InterPro annotation is associated with a task-independent embedding vector.

Quantitative Intrinsic Evaluation
In the following, we use the metadata for the most characteristic properties of domains, in order to evaluate the learned embedding space for various hyper-parameters of word2vec. We propose four intrinsic evaluation approaches for domain embeddings: domain hierarchy based on the family/subfamily relation, SCOPe secondary structure class, EC primary class, and GO molecular function annotation.
We refer to the embedding space learned by word2vec for a particular set of hyperparameters as V emb . The k nearest neighbors of a domain d is found by using the Euclidean distance, and it is denoted as C d nearest . To inspect the relative performance of V emb on each of the following evaluations, we randomized all domain vectors and ran each evaluation task. That is, we assigned to each domain vector a newly created random vector, for each unique dimensionality of embedding space, irrespective of all other embedding method parameters.

Domain hierarchy
InterPro defines a strict family-subfamily relationship among domains. This relationship is based on sequence similarity of the domain signatures. We refer to the children of domain p as S p . We use these relationships to evaluate an embedding space, posing the following research question, RQ hierarchy : Did vectors of hierarchically close domains form clusters in the V emb ? Evaluation We predicted the closest |S p | domains on cosine similarity of their vector to the parent vector, and we denote this predicted set asŜ p . For all learned embedding spaces, we measured their recall performance, Recall hier , defined as follows:

SCOPe Secondary Structure Class
We extracted the secondary structure of Interpro domains from the SCOPe database and formed the following research question, RQ SCOPe : Did vectors of domains, with same secondary structure class, form clusters in the V emb ?
Evaluation We evaluated V emb by retrieving C d nearest of each domain. Then, we applied stratified 5-fold cross-validation and measured the performance of a k-nearest neighbor classifier to predict the structure class of each domain. The intrinsic evaluation performance metric is the average accuracy across all folds, Accuracy SCOPe .

EC Primary Class
The enzymatic activity of each domain is given by its primary EC class [32] and we pose the following research question, RQ EC : Did vectors of domains, with the same enzymatic primary class, form clusters in the V emb ?
Evaluation We again evaluate V emb using k nearest neighbors in a stratified 5-fold cross-validation setting. The average accuracy across all folds, Accuracy EC , is again used to quantify the intrinsic quality of the embedding space.

GO Molecular Function
For our last intrinsic evaluation, we aimed to assess V emb using the molecular function GO annotation. We extracted all molecular function GO annotations associated with each domain. In order to account for differences in specificity of different GO annotations, we always used the depth-1 ancestor of each annotation, that is, children of the root molecular function term, GO:0003674.
Since model organisms have the most-annotated proteins, we created GO molecular function data sets for one example of prokaryote (Escherichia coli, denoted E. coli); one example of a simple eukaryote (Saccharomyces cerevisiae, denoted S.cerevisiae); and one complex eukaryote (Homo sapiens, denoted Human). To also assess our embeddings for not highly annotated organisms, we included a molecular function data set for an example of a human pathogen (Plasmodium falciparum, denoted as Malaria). Finally, we pose the following research question, RQ GO : Did vectors of domains, with the same GO molecular function, form clusters in the V emb ?
Evaluation Similarly, k nearest neighbors is used here in a stratified 5-cross-validation setting. Average accuracy across all folds, Accuracy GO , is again used to quantify performance.

Qualitative Evaluation
As a preliminary evaluation strategy, we used qualitative evaluation approaches adopted in an existing work. To follow the qualitative approach of ProtVec and SeqVec we also visualized the embedding space for selected domain superfamilies, to answer the following research question, RQ qualitative : Did vectors of each domain superfamily form a cluster in the V emb ? Evaluation First, we added the vector of each domain in a randomly chosen domain superfamily to an empty space. Then, we performed principle component analysis (PCA) [33] to reduce the space in two dimensions, and observed the formed clusters.

Extrinsic Evaluation
In addition, we assessed the learned V emb by examining the performance change in downstream tasks. For the three supervised tasks, TargetP, Toxin, and NEW, the domain representations were used as input in simple neural networks. Next, our model performance was compared to the state-of-the-art protein embeddings, ProtVec and SeqVec.

TargetP
This data set is about predicting the cellular location of a given protein. We downloaded the TargetP data set provided by [34], and we also used the non-plant data set. This data set consists of 2738 proteins accompanied by their uniprot ID, sequence, and cellular location label, which can be nuclear, cytosol, pathway, or signal and mitochondrial. Finally, we removed all instances with a duplicate set of domains, resulting in a total of 2418. This is a multi-class task, and its class distribution is summarized in Appendix E.
Evaluation For the TargetP, we used the mc-AuROC performance metric.

Toxin
The research work [35] introduced a data set associating protein sequence to toxic or other physiological content. We used the hard setting, which provides a uniprot ID, sequence, and the label toxin content or non-toxin content, for 15,496 proteins. Finally, we kept only the proteins with unique domain composition, resulting in 2270 protein instances in total. This is a binary task, and the class distribution is shown in Appendix E.
Evaluation As the Toxin data set is a binary task, we used AuROC as a performance metric.

NEW
The NEW data set [36] contains the data for predicting the enzymatic function of proteins. For each of the 22,618 proteins, the data set provides the sequence and the EC number class. The primary enzyme class, the first digit of an EC number, is our label on this prediction task, resulting in a multi-class task. Finally, we removed all instances with duplicate domain composition, resulting in a total of 14,434 protein instances. The possible classes are six, and the class distribution is shown in Appendix E.
Evaluation The NEW data set is a multi-class task; thus, we used mc-AuROC as a performance metric.

Data Partitioning
We divided each data set into 70/30% train and test splits. To perform model selection, we created inner three-fold cross-validation sets on the train split.
Out-of-vocabulary experiment We observed that the performance of classifiers depending on protein domains was highly dependent on the out-of-vocabulary (OOV) domains, as first discussed in [37]. OOV domains are all the domains contained in the test set, but not in the train. For TargetP, Toxin, and NEW, we observed that approximately 60%, 20%, and 20% of test proteins contained at least one OOV domain, respectively.
For the TargetP containing the highest OOV, we experimented to compensate for the high degree of OOV. We split the test set into shorter sets by an increasing degree of OOV, namely 0%, 10%, 30%, 50%, 70%, and 100%. Then, we trained models for the whole train set and benchmarked the performance on each of these test subsets.
Generalization experiment For the Toxin and NEW data sets, experiencing low OOV, we sought to investigate the generalization of the produced classifier. We increased the number of training examples that the model was allowed to learn from and we benchmarked always in the entire test set. To do so, we created training splits of size 10%, 20%, and 50% of the whole train set. To perform significance testing, we trained on 10 random subsamples for each training split percentage, and then tested on the separate step set. We used the paired sample t-test, the Benjamini-Hochberg multiple-test, to compare the performance between a pair of classifiers on the test set.

Simple Neural Models for Prediction
We consider a set of simple, well-established neural models to combine the InterPro annotation embeddings for each protein to perform downstream tasks, that is, for extrinsic evaluation tasks. In particular, we use FastText [38], convolutional neural networks (CNNs) [39], and recurrent neural networks (RNNs) with long-and short-term memory (LSTM) cells [40] and bi-directional LSTMs.

Building Domain Architecture
We used the domain hits for UniProt proteins from InterPro version 75, containing 128,660,257 proteins with an InterPro signature, making up 80.9% of the total UniPro-tKB proteome (version 2019_06). For all these proteins, we extracted the non-overlapping and non-redundant sequences, which we process in the next section. The number of unique non-overlapping sequence was (35,183 + 1), where the added "GAP" domain and nonredundant domain was (36,872 + 1) plus the "GAP". Comparing this to the total number of domains in InterPro version 75, which was 36,872, we observed that non-overlapping InterPro annotations captured 95.42%, and the non-redundant domain captured 100% of the InterPro annotation entries. To enable visual comparison of the created type of domain architectures versus the downloaded InterPro annotations, in Figure 2 we illustrate the non-overlapping and non-redundant domain architectures of the Diphthine synthase protein. This same protein, Diphthine synthase, was picked as an example illustration for annotations in the latest InterPro work [41].

Training Domain Embeddings Domain Architectures
Before applying the word2vec method, we examined the histograms of the number of nonoverlapping and non-redundant InterPro annotations per protein in Figure 3. We observed that these distributions were long-tailed with modes equal to 1 and 3, respectively. Then, we used both CBOW and SKIP algorithms to learn domain embeddings. We used the following parameter sets. Based on the histograms, we selected the context window parameter for the word to be 2 or 5, w " t2, 5u. For the number of dimensions, we used common values from the NLP literature, dim " t50, 100, 200u. We trained the embeddings from 5 to 50 epochs with step size 5 epochs ep " t5, 10, 15, . . . , 50u. Finally, all other parameters were set to their default values. For example, the negative sampling parameter was set to default, ng = 5.

Quantitative Intrinsic Evaluation
In the following, we evaluated each instance of learned embedding space V emb for both non-overlapping and non-redundant representations of domain architectures. An instance of V emb space is the embedding space learned for a combination of the product annotation_type¨w¨dim¨ep. Consequently, the total number of embedding space instances is |annotation_type|¨|w|¨|dim|¨|ep| " 2¨2¨3¨10 " 120. Let V i emb denote such an embedding space instance. In the following subsection, we evaluated each V i emb for domain hierarchy, secondary structure, enzymatic primary class, and GO molecular function. Finally, all reported performances are shown for the best-performing epoch value (ep). Results are shown in Table 1.  For the first research question, we loaded the parent-child tree T hier , provided by InterPro, consisting of 2430 parent domains. Then, for each V i emb , we compared the actual and predicted children of each parent, and we averaged out the recall for all parents. For ease of presentation, we show only the results for non-redundant InterPro annotations at Table 1a, and we provide the complete results in the Appendix A.
From Tables A1 and 1a (Appendix A), we observed that SKIP performed better overall, and the embeddings learned from non-redundant InterPro annotations always had better average recall values compared to the non-overlapping ones. The best-performing V i emb achieved average Recall hier of 0.538. We compared this moderate performance of V emb with the performance of the randomized spaces, which was equal to 0. We concluded that our embedding spaces greatly outperformed each randomized space for domain hierarchy relation. Therefore, we admitted that the majority of domains of the same hierarchy were placed in close proximity in the embedding space.

RQ SCOPe : Did Vectors of Domains with the Same Secondary Structure Class Form Clusters in the V emb ?
We extracted the SCOPe class for each InterPro domain. This resulted in 25,196 domains with an unknown secondary structure class, 9411 with a single secondary structure class, and 2265 domains with more than one assigned class (multi-label). For clarity, we removed all multi-label and unknown instances, resulting in 9411 single-labeled instances. The class distribution of the resulting data set is shown in Appendix B.
We measured the performance of the C d nearest classifier in each V i emb to examine the homogeneity of the space with respect to the SCOPe class. We split the 9411 domains in 5-fold stratified cross-validation sets. To test the change in prediction accuracy for an increasing number of neighbors, we used different sets of neighbors, namely, k " t2, 5, 20, 40u. We summarized the results for the best-performing C d nearest , which was k " 2 for non-redundant InterPro annotations in Table 1b. We show the respective table for non-overlapping InterPro annotations in Appendix B. We compared these accuracy measurements to the respective ones of the random spaces, and we found that the lowest accuracy values, achieved for (non-overlapping, CBOW, w = 5, dim = 200, ep = 15), as shown in Appendix Table A2, are twice as high as the accuracy values of the random spaces for all possible dimensions. Consequently, we concluded that domain embeddings of the same secondary structure class formed distinct clusters in the learned embedding space.

RQ EC : Did Vectors of Domains, with the Same Enzymatic Primary Class, Form Clusters in the V emb ?
We processed the EC primary class, resulting in 29,354 domains with unknown EC, 7248 domains with only one EC, and 721 with more than one EC. As before, we removed all multi-label and unknown instances, leaving 7428 domains with known EC. We augmented a domain instance with its vector representation for each V i emb , and then we used C d nearest to predict the EC label. See Appendix C for the class distribution of the EC task.
We reported the average Accuracy EC obtained in embedding spaces learned using non-redundant InterPro annotations in Table 1c. We show the respective table for nonoverlapping in Appendix C. We compared these accuracy measurements to the respective ones of the random spaces. We found that the minimum average Accuracy EC value was equal to 60.51 and was achieved using (non-overlapping, CBOW, w = 5, dim = 200, ep = 15), presented in Appendix Table A3. That value was approximately twice as large as the accuracy values of the random spaces for all possible dimensions; the maximum average Accuracy EC for random space with dim = 100 was 32.64. Hence, we were able to accept that domain embeddings of the same EC primary class formed distinct clusters in a learned embedding space.

RQ GO : Did Vectors of Domains with the Same GO Molecular Function Form Clusters in the V emb ?
We parsed the GO annotation file of InterPro to extract first-level GO molecular function for domains for the four organisms. We followed the same methodology to examine the homogeneity of a V emb with respect to GO molecular function annotations. For each V i emb , we augmented each domain by its vector and its GO label, and we classified each domain using C d nearest . As before, we used 5-fold stratified cross-validation for evaluation. In our experiments, we varied the number of neighbors k " t2, 5, 20, 40u to test its influence on the change of performance.
For space limitations, we summarized the performances showing only the best average accuracy over the number of neighbors. For ease of presentation, we omitted the result tables for the first three organisms and show only that for Human, but we discuss the results for all organisms. See Appendix D for full results.
For Malaria, the best average accuracy was 76.86 (non-redundant, SKIP, w = 5, dim = 100, ep = 40) and the minimum was 56.94 (non-overlapping, CBOW, w = 5, dim = 100, ep = 10), presented in Table A4b,c respectively. We compared this moderate minimum accuracy to the maximum level of accuracy obtained by the randomized embedding space, which was 47.57 for dim = 200. Therefore, we concluded that dom2vec embeddings outperformed the random baseline by at least 10 percent.
For Yeast, the best accuracy score was 75.10 (non-redundant, SKIP, w = 5, dim = 50, ep = 50), and the minimum accuracy value was 59.82 (non-overlapping, CBOW, w = 5, dim = 50, ep = 50), presented in Table A6b,c respectively. We contrasted this to the maximum accuracy level obtained in a random space, which was 53.73 (achieved for dim = 100), to report that dom2vec vectors in V E.coli emb captured GO molecular function classes at a much higher degree than randomized vectors.
For Human, the best average performance for non-redundant InterPro annotations are shown in Table 1d. The best average accuracy level was 75.96, scored by 2-NN for V human emb (non-redundant, SKIP, w = 5, dim = 50, ep = 40). The minimum accuracy value was 57.7, obtained by (non-overlapping, CBOW, w = 2, dim = 50, ep = 10) shown in Table A7b. The best performance of a random space was 37.36 (Table A7b). We compared the minimum accuracy level of trained spaces with the best of the random spaces. We found that the minimum accuracy achieved in the dom2vec spaces was 20 percentage values higher than the best performance of the random space.
For all four example organisms, we observed that the SKIP on non-redundant InterPro annotations produced V emb , in which C d nearest achieved the best average accuracy. For three out of the four organisms, the best performances were achieved for the lowest number of dimensions (dim = 50). In all cases, we found that the worst-performing dom2vec embeddings outperformed the random baselines. By these findings, we affirmed that domain embeddings of the same GO molecular function class formed distinct clusters in the learned embedding space.

Concluding on Quantitative Intrinsic Evaluation
Based on the previous four experiments, we aimed to evaluate the learned V emb spaces and select the best domain embedding space for downstream tasks. In all experiments, the non-redundant InterPro annotations created better-performing embedding spaces compared to non-overlapping annotations. We reached this finding by comparing the modes of a number of annotations per protein for the two annotation types, Figure 3. We hypothesized that, by the very low mode for non-overlapping annotations, a mode equal to one annotation, the word2vec method could not produce embeddings for even the stringent context window value of two. In contrast, 52% of proteins contained less than or equal to three non-redundant InterPro annotations.
This makes SKIP able to produce embedding spaces by attaining the best intrinsic performance. From the individual results, we saw that the configuration of parameters (non-redundant, SKIP, w = 5, dim = 50) brought the best results in C d nearest performance for SCOPe, EC, and GO for E. coli, Yeast, Human, second best for Malaria, and the sixth best recall (0.507) for the domain hierarchy relation. Therefore, we will denote as V best intrinsic emb , the space produced by (non-redundant, SKIP, w = 5, dim = 50, ep = 50).

Qualitative Evaluation RQ qualitative : Did Vectors of Each Domain Superfamily Form a Cluster in the V emb ?
To explore the V emb in terms of the last research question, RQ qualitative , we randomly selected five InterPro domain superfamilies to perform the visualization experiment. The selected domain superfamilies were PMP-22/EMP/MP20/Claudin superfamily with parent InterPro id IPR004031, small GTPase superfamily with parent InterPro id IPR006689, Kinasepyrophosphorylase with parent InterPro id IPR005177, Exonuclease, RNase T/DNA polymerase III with parent InterPro id IPR013520, and SH2 domain with parent InterPro id IPR000980.
We loaded the parent-child tree T hier , provided by InterPro, and for each domain superfamily starting from the parent domain, we included recursively all domains that had a subfamily relationship with this parent domain. For example, the Kinase-pyrophosphorylase domain superfamily had domain parent IPR005177, which in turn had two immediate domain subfamilies IPR026530 and IPR026565. The IPR026565 domain contained a subfamily domain with ID IPR017409, where consequently, the set of domains for Kinase-pyrophosphorylase domain superfamily was {IPR005177, IPR026530, IPR026565, and IPR017409}. We retrieved the vectors for each domain in each superfamily in the V best intrinsic emb . Finally, we applied principal component analysis (PCA) to produce a two-dimensional space.
Visualization of the reduced space is depicted in Figure 4. Domain embeddings of each superfamily are organized in well-separated clusters. The cluster of the Exonuclease, RNase T/DNA polymerase III superfamily had the highest dispersion of all presented superfamilies. By this finding, we could answer the research question with the following: Embedding vectors of the same superfamily are well-clustered in the trained V emb .

Extracting Domain Architectures
For each data set that contained the UniProt identifier for the protein instance, we extracted the domain architectures for non-redundant InterPro annotations, already created in Section "Building domain architectures". For all proteins whose UniProt identifier could not be matched, or for data sets not providing the protein identifier, we used Inter-ProScan [42] to find the domain hits per protein. For proteins without a domain hit after InterProScan, we created a protein-specific, artificial protein-long domain; for example, we assigned to the protein G5EBR8, a protein-long domain named "G5EBR8_unk_dom".

Model Selection
To select which simple neural model we should compare to the baselines, we performed hyperparameter selection using an inner, three-fold cross-validation on the training set; the test set was not used to select hyperparameters. We used common parameters, with a dropout of 0.5, batch size of 64, an Adam optimizer [43] with learning rate of 0.000, weight decay for the last fully connected layer of 0, and number of epochs equal to 300. As a final hyperparameter, we allowed updates to the learned domain embeddings, initialized by selected dom2vec embeddings. The results are shown in Appendix E.

Running Baselines
Then, we used the same network as the one in the right side of Figure 5 of [21]; we refer to this network as SeqVecNet. Namely, the network first averages the 100 (ProtVec) or 1024 (SeqVec) dimensional embedding vector for a protein; it then applies a fully connected layer to compress a batch of such vectors into 32 dimensions. Next, a ReLU activation function (with 0.25 dropout) was applied to that vector, followed by batch normalization. Finally, another fully connected layer was followed by the prediction layer. As the third baseline, we added the 1-hot of domains in order to investigate the performance change compared to dom2vec learned embeddings.

Evaluation
For TargetP, we sought to investigate the effect of OOV on the produced classifier compared to sequence-based embeddings classifiers which do not experience OOV, as their used sequence features were highly common in both the train and test sets. For the Toxin and NEW datasets, we benchmarked the generalization of the produced classifier compared to the sequence-based embeddings classifiers. Finally, for both kinds of experiments, we used the trained models on each test set. Hence, this evaluation shows how differences in the training set affect performance on the test set. The resulting performances are shown in Figure 5.
Out-of-vocabulary experiment For TargetP, we validated that OOV will affect the performance of domains dependent classifiers. That is, for OOV in the range of 0-30%, the dom2vec classifier was comparable to the best-performing model, SeqVec. However, when OOV increased even further, then the performance of our model dropped, though still being competitive with the SeqVec. dom2vec greatly outperformed the 1-hot representation, validating the NLP assumption that unsupervised embeddings improve classification on unseen words-in this context, protein domains-compared to 1-hot word (domain) vectors.
Generalization experiment For both Toxin and NEW, dom2vec significantly outperformed SeqVec, ProtVec, domains 1-hot vectors, and Benjamini-Hochberg multiple-test corrected p-value < 0.05. In the Toxin data set, we observed that ProtVec learned the less variant model, but with the trade-off obtaining the lowest performance (mc-AuROC). For the NEW data set, the dom2vec 1-hot representation was the second-best representation outperforming SeqVec and ProtVec, allowing us to validate the finding that domain composition is the most important feature for enzymatic function prediction, as concluded by [36].

Conclusions
In this work, we presented dom2vec, an approach for learning quantitatively assessable protein domain embeddings using the word2vec method on domain architectures from InterPro annotations.
We have shown that dom2vec adequately captured the domain SCOPe structural information, EC enzymatic function, and the GO molecular function of each domain with such available metadata information. However, dom2vec produced moderate results in the domain hierarchy evaluation task. After investigating the properties of domain families that dom2vec produces these moderate results, we concluded that dom2vec cannot capture the domain hierarchy, mostly for domain families of low cardinality. We argue that by using more complex classifiers compared to C d nearest , we could gain in hierarchy performance, but this was not the scope of our evaluation.
Importantly, we did discover that dom2vec embeddings captured the most distinctive biological characteristics of domains, secondary structure, and enzymatic and molecular function for an individual domain. That is, word2vec produced domain embeddings which clustered sufficiently well by their structure and function class. Therefore, our finding supported the accepted modular evolution of proteins [1], in a data-driven way. It also made possible a striking analogy between words in natural language that clustered together in word2vec space [14], and protein domains in domain architectures that clustered together in dom2vec space. Therefore, we parallel the semantic and lexical similarity of words to the functional and structural resemblance of domains. This analogy may augment the research on understanding the nature of rules underlying the domain architecture grammar [7]. We are confident that this interpretability aspect of dom2vec will allow researchers to apply it reliably, so as to predict biological features of novel domain architectures and proteins with identifiable InterPro annotations.
In downstream task evaluation, dom2vec significantly outperformed domain 1-hot vectors and state-of-the-art sequence-based embeddings for the Toxin and NEW data sets. For the TargetP, dom2vec was comparable to the best-performing sequence-based embedding, Seqvec, for OOV up to 30%. Therefore, we recommend using dom2vec in combination with sequence embeddings to boost prediction performance.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Domain Hierarchy
Average recall for all InterProparents in T hier , see main paper, for no overlapping sequences are shown in Table A1. The histogram of average recall for best-performing embedding space is shown at Figure A1a. We observe that the embeddings space brought close domains with unknown family-subfamily relation for almost the one third of the parent domains (827 out of 2430).
To diagnose the reason for this moderate performance, we plotted the histogram of the number of children for each parent having recall 0, Figure A1b. We observed that most of these parents had only one child. Consequently, the embedding space should have been very homogeneous, for each of these parent child relation, in order to acquire better recall than 0.  Classes distribution of secondary structure class is shown at Table A2a. Average C d nearest accuracy over all folds, Accuracy SCOPe , for non-overlapping annotations shown in Table A2b.

Appendix C. EC Primary Class
Classes distribution of EC primary class is shown at Table A3a. Average C d nearest accuracy over all folds, Accuracy EC , for non-overlapping shown in Table A3b.

Appendix D. GO Molecular Function
Appendix D.1. Malaria GO class distribution and average C d nearest accuracy over all folds,Accuracy GO , for non-overlapping and non-redundant annotations, for Malaria, are shown in Table A4a-c respectively. Table A4. Malaria GO molecular function evaluation. (a) GO class summary, (b,c) Average C d nearest accuracy over all folds, Accuracy GO , for non-overlapping and non-redundant annotations, whenever k is not shown k = 2, best shown in bold case.

GO Class
No. of Domains   Table A5a-c respectively. Table A5. E. coli GO molecular function evaluation. (a) GO class summary, (b,c) Average C d nearest accuracy over folds, Accuracy GO , for non-overlapping and not-redundant annotations, whenever k is not shown k = 2, best shown in bold.

GO Class
No. of Domains  GO class distribution and average C d nearest accuracy over all folds, Accuracy GO , for non-overlapping and non-redundant annotations, for Yeast, are shown in Table A6a-c respectively. Table A6. S.cerevisiae GO molecular function evaluation. (a) GO class summary, (b,c) Average C d nearest accuracy over folds,Accuracy GO , for non-overlapping and non-redundant annotations, best shown in bold.

GO Class
No. of Domains  GO class distribution and average C d nearest Accuracy GO for non-overlapping annotations, for Human, is shown in Table A7a,b. Table A7. Human GO molecular function evaluation: (a) GO class summary, (b) Average C d nearest accuracy over folds, Accuracy GO , for non-overlapping annotations, when k is not shown k = 2, best shown in bold.

GO Class
No. of Domains

Appendix E. Extrinsic Evaluation
Class distribution for TargetP, Toxin and NEW data sets shown in Table A8a-c respectively. Model selection over hyperparameters, including architecture, shown in Table A9.