dom2vec: Unsupervised protein domain embeddings capture domains structure and function providing data-driven insights into collocations in domain architectures

Motivation Word embedding approaches have revolutionized Natural Language Processing NLP research. These approaches aim to map words to a low-dimensional vector space in which words with similar linguistic features are close in the vector space. These NLP approaches also preserve local linguistic features, such as analogy. Embedding-based approaches have also been developed for proteins. To date, such approaches treat amino acids as words, and proteins are treated as sentences of amino acids. These approaches have been evaluated either qualitatively, via visual inspection of the embedding space, or extrinsically, via performance on a downstream task. However, it is difficult to directly assess the intrinsic quality of the learned embeddings. Results In this paper, we introduce dom2vec, an approach for learning protein domain embeddings. We also present four intrinsic evaluation strategies which directly assess the quality of protein domain embeddings. We leverage the hierarchy relationship of InterPro domains, known secondary structure classes, Enzyme Commission class information, and Gene Ontology annotations in these assessments. These evaluations allow us to assess the quality of learned embeddings independently of a particular downstream task. Importantly, allow us to draw an analog between the local linguistic features in nature languages and the domain structure and function information in domain architectures, thus providing data-driven insights into the context found in the language of domain architectures. We also show that dom2vec embeddings outperform, or are comparable with, state-of-the-art approaches on downstream tasks. Availability The protein domain embeddings vectors and the entire code to reproduce the results are available at https://github.com/damianosmel/dom2vec. Contact melidis@l3s.uni-hannover.de


Introduction
A primary way of how proteins evolve is through rearrangement of their functional/structural units, known as protein domains (Moore et al., 2008;Forslund and Sonnhammer, 2012).The domains are independent folding and functional modules and so they exhibit conserved sequence segments.Prediction algorithms exploited this information and used as input features the domain composition of a protein for various tasks.For example (Chou and Cai, 2002) classified the cellular location and (Forslund and Sonnhammer, 2008;Doǧan et al., 2016) predicted the associated Gene Ontology (GO) terms.There exist two ways to represent domains; either by the linear order in a protein, domain architecture, (Scaiewicz and Levitt, 2015), or by a graph where nodes are domains and edges connect domains that co-exist in a protein (Moore et al., 2008;Forslund and Sonnhammer, 2012).
Moreover, (Yu et al., 2019) investigated if the domains architecture has a grammar as a natural spoken language.They compared the bigram entropy of domain architecture for PFAM domains (Sonnhammer et al., 1998) to the respective entropy of the English language, showing that although it was lower than the English language, it was significantly different from a language produced after shuffling the domains.Prior to this result, methods had exploited domain architecture representation to embedding space by domain hierarchy (tree structure capturing family-subfamily relations provided by InterPro).We investigated the performance of a nearest neighbor classifier, C d nearest , to predict the secondary structure class provided by SCOPe (Fox et al., 2013) and the primary EC class in the learned embedding space.We examined the performance of C d nearest classifier to predict the GO molecular function class for three example model organisms and one human pathogen.
‚ As a by-product of the intrinsic evaluation, we observed that C d nearest reaches adequate accuracy, compared to C d nearest on randomized domains vectors, for secondary structure, enzymatic and GO molecular function.Thus we hypothesized the analog between word embedding clustering by local linguistic features and protein domains clustering by domains structure and function, which in turn, brings data-driven insights into the semantic context of domain architectures.
‚ To evaluate our embeddings extrinsically, we inputted the learned domains embeddings to simple neural networks and compared their performance with state-of-the-art methods in three full-protein tasks; surpassing both state-of-the-art methods for the two tasks, EC primary class and toxin presence, and being comparable in the cellular location prediction task.
‚ We make available the trained domains embeddings to be used for protein supervised learning tasks by the research community.
2 Related work

Protein embedding methods
Works on protein embeddings can be divided in two methods based on their word assignment, the former learn embeddings for constant amino acid subsequences and the latter for variable length subsequences.

Constant length subsequence vectors
Early work in protein embeddings (Asgari and Mofrad, 2015) (ProtVec) belongs to this category.In this work, the authors split the protein sequence in 3-mers with the 3 possible starting shifts.They then used the skip-gram model from (Mikolov et al., 2013b) to learn the embeddings for each distinct 3-mer.The embedding for a protein was taken as the mean of all its 3-mers.(Yang et al., 2018) adapted ProtVec by using doc2vec (Le and Mikolov, 2014) to find the whole protein embedding.
Similarly (Woloszynek et al., 2019) computed embeddings for all k-mers using a range of k; they also used doc2vec to find the whole protein embedding.(Bepler and Berger, 2019) learned an embedding of each amino acid position incorporating global structural similarity between a pair of proteins and contact map information for each protein.They represented the alignment of two proteins as a soft symmetric alignment of the embeddings of the protein residues.Recently, SeqVec (Heinzinger et al., 2019) applied and extended ELMo model (Peters et al., 2018) to learn a contextualized embedding per amino acid position.Recently, UniRep was introduced (Alley et al., 2019) to learn embeddings for amino acids by a language model that uses RNNs.

Variable-length subsequence vectors
ProtVecX (Asgari et al., 2019) embeds proteins by first extracting (variable-length) motifs from proteins using a data compression algorithm, byte-pair encoding (Gage, 1994).ProtVec is then used to learn embeddings for the motifs.

Extrinsic evaluation methods
Extrinsic evaluation methods, like performance on a supervised learning task, are most commonly used to evaluate the quality of embeddings.To "main" -2020/4/26 -page 3 -#3 dom2vec 3 date, different papers have evaluated performance on different downstream tasks and using different datasets.
Sequence-level prediction (Asgari and Mofrad, 2015) predicted protein family and disorder, while (Yang et al., 2018) evaluated embeddings by predicting the Channelrhodopsin (ChR) localization, the thermostability of Cytochrome P450, the Rhodopsin absorption wavelength and the Epoxide hydrolase enantioselectivity.(Bepler and Berger, 2019) predicted transmembrane regions of proteins, and (Heinzinger et al., 2019) used subcellular localization and water solubility as validation.(Asgari et al., 2019) consider venom toxin, subcellular localization, and enzyme primary class predictions tasks for downstream validation.(Alley et al., 2019) predicted stability of proteins and the phenotype of diverse variants of the green fluorescent protein.
Per-residue prediction (Heinzinger et al., 2019) predicted secondary structure and intrinsic disorder and (Alley et al., 2019) predicted functional effect of single amino acid mutations.

Qualitative evaluation methods
Many of the described approaches use some form of qualitative evaluation.Commonly, the learned embeddings are projected to two dimensions using techniques like tSNE (Maaten and Hinton, 2008) or UMAP (McInnes et al., 2018) to visually inspect the embeddings.While such approaches are helpful to gain trust in the embeddings, they do not provide a rigorous approach to compare the quality of multiple embedding spaces, such as those found when using different hyperparameters for the embedding models.(Asgari and Mofrad, 2015) used a qualitative intrinsic evaluation approach for ProtVec.They visualized the 2-dimensional reduced protein space along with their biophysical and biochemical properties to report the appearance of clusters with similar biochemical properties.
For intrinsic evaluation they (Heinzinger et al., 2019) visualized the 2-dimensional reduced protein space along with secondary structure class, the EC and the taxonomic kingdoms per protein.

Materials and methods
Our approach is summarized in Figure 1, in the following we explain the methodology for each part of the approach.

dom2vec
We now describe dom2vec, our novel approach for learning protein domain embeddings based on domain architectures.

Building domain architecture
Hereafter we will refer to the sequence of domains in a protein as its domain architecture.We consider two distinct strategies to represent a protein based on its domain architecture: non-overlapping and nonredundant.In both cases, the sequences are based on protein domain entries from Interpro.For efficiency, an interval tree is built for each protein to detect overlapping domains, and each protein is split into regions with overlapping domains.
Non-overlapping sequences.For each region with overlapping domains, all domains except the longest are removed.
Non-redundant sequences.For each region with overlapping domains, all domains with the same Interpro identifier, except the longest domain with each identifier, are removed.
In both cases, the domains in each protein are sorted by start position to construct its domain architecture.Following the approach of (Doǧan et al., 2016) we also added the "GAP" domain to annotate more than 30 amino acid subsequence that does not match any InterPro domain entry.

Training domain embeddings
Next, we learned task-independent embeddings for each domain using two variants of word2vec (Mikolov et al., 2013b): continuous bag of words CBOW and SKIP.See (Mikolov et al., 2013b) for technical details on the difference between these approaches.In our context, each domain is taken as a word, and each protein sequence is considered as a sentence.Thus, after learning, each domain is associated with a task-independent embedding.
In Section 3.3, we consider several intrinsic evaluation strategies to determine the quality of these embeddings for various hyperparamers of word2vec.

Qualitative evaluation
As a preliminary evaluation strategy, we used qualitative evaluation approaches adopted in existing work.To follow the qualitative approach of ProtVec and SeqVec we also visualized the embedding space for selected domain superfamilies, to answer the following research question, RQ: Did vectors of each domain superfamily form a cluster in the V emb ?
Evaluation To find out, we added the vector of each domain in a randomly chosen domain superfamily to an empty space.Then we performed principle component analysis (PCA) (Pearson, 1901) to reduce the space in two dimensions and observed the formed clusters.

Novel intrinsic evaluation methods
Previous work has evaluated the quality of embeddings only indirectly by measuring performance on downstream, supervised tasks, as described in Section 3.4.However, in the natural language processing field, the quality of a learned word embedding space is often evaluated intrinsically by considering relationships among words, such as analogies.Such an evaluation is important because it ensures the learned embeddings are meaningful without choosing a specific downstream task.
Such intrinsic evaluations cannot be performed for the ProtVec and SeqVec protein embeddings, because in contrast with these approaches dom2vec directly learns embeddings for protein domains.That is, protein domains are typically represented using structures like profile hidden Markov models (HMM) (Eddy, 1998) rather than unique amino acid sequences.Thus it is not straightforward how sequence feature embeddings can be combined for a given profile HMM in order to evaluate the learned sequence embeddings in the way that we will describe next.
Thus, prior approaches, such as ProtVec, SeqVec, cannot learn unique embeddings per domain not allowing us to perform the same evaluation tasks for the existing protein embeddings.
We propose four intrinsic evaluation approaches for domain embeddings: domain hierarchy based on the family/subfamily relation, secondary structure class, Enzymatic Commission (EC) primary class, and Gene Ontology (GO) molecular function annotation.
We refer to the embedding space learned by word2vec for a particular set of hyperparameters as V emb .We refer to the k nearest neighbors of a domain d as C d nearest found using the Euclidean distance.To inspect the relative performance of V emb on each of the following evaluations, we randomized all domain vectors and run each evaluation task.That is, irrespective of the embedding method parameters and only for each different space dimensionality we assigned to each domain vector a newly created random vector and save this random space to be evaluated.

Domain hierarchy
InterPro defines a strict family-subfamily relationship among domains.This relationship is based on sequence similarity of the domain signatures.We refer to the children of domain p as C p .We use these relationships to evaluate an embedding space, posing the following research question, RQ: Did vectors of hierarchically closely domains form clusters in the V emb ?
Evaluation We evaluate V emb by retrieving C d nearest of each domain.For all learned embedding spaces, we measured their recall performance, Recall hier defined as follows: (1)

Secondary structure class
We extracted the secondary structure of Interpro domains from the SCOPe database and form the following research question, RQ: Did vectors of domains, with same secondary structure class, form clusters in the V emb ?
Evaluation We evaluated V emb by retrieving C d nearest of each domain.Then we applied stratified cross-fold validation and measured the performance of a k-nearest neighbor classifier to predict the structure class of each domain.The intrinsic evaluation performance metric is the average accuracy across all folds, Accuracy SCOP e .

Enzyme Commission (EC) primary class
The enzymatic activity of each domain is given by its primary EC class (Fleischmann et al., 2004) and pose the following research question, RQ: Did vectors of domains, with same secondary structure class, form clusters in the V emb ?
Evaluation We again evaluate V emb using k nearest neighbors in a stratified cross-validation setting.Average accuracy across all folds is again used to quantify the intrinsic quality of the embedding space.

GO molecular function
For our last intrinsic evaluation, we aimed to assess V emb using the molecular function GO annotation.We extracted all molecular function GO annotations associated with each domain.1 Since model organisms have the most annotated proteins we created GO molecular function data sets for one example prokaryote (Escherichia coli denoted E.coli), one example simple eukaryote (Saccharomyces cerevisiae denoted S.cerevisiae) and one complex eukaryote (Homo sapiens denoted Human).To assess our embeddings also for not highly annotated organisms, we included a molecular function data set for an example human pathogen (Plasmodium falciparum, denoted as Malaria).Finally, we pose the following research question, RQ: Did vectors of domains, with same GO molecular function, form clusters in the V emb ?
Evaluation We again evaluate an embedding space using k nearest neighbors in a stratified cross-validation setting.Average accuracy across all folds is again used to quantify performance.

Extrinsic evaluation
In addition to assess the learned V emb we also examine the performance change in downstream tasks.That is for three data sets-supervised tasks, we feeded the domain representations in simple neural networks and compare the performance of our model with state-of-the-art protein embeddings.

Simple neural models for downstream tasks
We consider a set of simple, well-established neural models to combine the domain embeddings for each protein to perform downstream tasks, that is, for extrinsic evaluation tasks.In particular, we use FastText (Joulin et al., 2017), convolutional neural networks (CNNs) (LeCun et al., 1998), and recurrent neural networks (RNNs) with long-short term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) and bi-directional LSTMs.We leave evaluation with more sophisticated models, such as transformers like BERT (Devlin et al., 2018), for future work.
The location data set is a multi-class task, thus we used as performance metric a multi-class generalization of area under the receiver operating characteristic curve (mc-AuROC).For the membrane data set which is a binary task, the performance metric was the area under the receiver operating characteristic curve (AuROC).

TargetP
We downloaded the TargetP data set provided by (Emanuelsson et al., 2000).To compare with ProtVecX we also used the non-plant data set.This data set consists of 2,738 proteins accompanying with their uniprot id, sequence and the cellular location label which can be nuclear, cytosol, pathway or signal and mitochondrial.Finally, we removed all instances with duplicate set of domains, resulting in total of 2,418.This is a multiclass task and its class distribution is summarized in supplementary section E.
Evaluation For the TargetP we used the mc-AuROC performace metric.(Gacesa et al., 2016) introduced a data set associating protein sequence to toxic or other physiological content.To compare with ProtVecX we used the hard setting which provides uniprot id, sequence and the label toxin content or non-toxin content, for 15,496 proteins.Finally, we kept only the proteins with unique domain composition, resulting to 2,270 protein instances in total.This is a binary task and the class distribution is shown in supplementary section E.

Toxin
Evaluation As Toxin data set is binary task, we used AuROC as performance metric.

NEW
We downloaded the NEW data set from (Li et al., 2017).For each of the 22,618 proteins the data set provides sequence and the EC number (class).The primary enzyme class, first digit of EC number, is our label on this prediction task, resulting in a multi-class task.Finally, we removed all instances with duplicate domain composition, resulting in a total of 14,434 protein instances.The possible classes are 6 and the class distribution is shown in supplementary section E.
Evaluation NEW data set is a multi-class task, thus we used mc-AuROC as performance metric.

Data partitioning
We divided each data set into 70/30% train and test splits.To perform model selection, we created inner three-fold cross validation sets on each train split.
Then we observed that the performance of classifier depending on protein domains is highly dependent on the out-of-vocabulary (OOV) domains, as first discussed in (Luong et al., 2015).OOV domains are all the domains contained in the test set, but not in the train.For TargetP, Toxin and NEW we observed that approximately 60%, 20%, 20% of test proteins contain at least one OOV domain.Thus, we run experiments based on this observation.For TargetP, containing the highest OOV, we split the test set into shorter sets by an increasing degree of OOV, namely 0%, 10%, 30%, 50%, 70%, 100%.Then we learned models for the whole train set and benchmarked the performance on each of these test subsets.
For Toxin and NEW data set, experiencing low OOV, we seek to investigate the generalization of the produced classifier, by increasing the number of training examples that the model learn from and benchmark always in the entire test set.Thus, we created training splits of size, 10%, 20%, 50% of the whole train set.To perform significance testing we created 10 random subsamples for each training split percentage.

Building domain architecture
We downloaded the domain hits, protein2ipr.dat,for UniProt proteins from the InterPro version 75.This file contained 128 660 257 proteins with InterPro signature, making up the 80.9% of the total UniProtKB proteome (version 2019_06).For all these proteins we extracted the non-overlapping and non-redundant sequences, which we processed in the next section.The number of unique non-overlapping sequence was (35 183 `1) plus the "GAP" domain and non-redundant was (36 872 `1) plus the "GAP".Comparing to total number of domains in InterPro version 75, which was 36 872, we observed that non-overlapping sequences captured 95.42% and the non-redundant captured 100% of the InterPro domains.Figure 2 demonstrates an example of constructed domain architectures by showing the domain architectures for the subdomains for the Tetrapyrrole methylase domain of the Diphthine synthase protein (see (Mitchell and et al., 2019) for more details).

Domain architecture
Before we applied the word2vec package we examined the histograms of number of domains per protein for non-overlapping and non-redundant sequences.We show the histogram of the number of non-overlapping sequences per protein in Figure 3.We observed that these distributions are skewed and long-tailed.Then we used both CBOW and SKIP algorithms to learn domain embeddings.We used the following parameters sets.Based on the histograms we picked the window parameter for word to be 2 or 5, w " t2, 5u.For the number of dimensions, we used common values from the NLP literature, dim " t50, 100, 200u.We trained the embeddings from 5 to 50 epochs with step size 5 epochs ep " t5, 10, 15, . . ., 50u.Finally, all other parameters were set to their default values.For example, the negative sampling parameter was left to default, ng=5.

Qualitative evaluation
We randomly selected five InterPro domain superfamilies to perform the experiment.The selected domain superfamilies were PMP-22/EMP/MP20/Claudin superfamily with parent InterPro id IPR004031, small GTPase superfamily with parent InterPro id IPR006689, Kinasepyrophosphorylase with parent InterPro id IPR005177, Exonuclease, RNase T/DNA polymerase III with parent InterPro id IPR013520 and SH2 domain with parent InterPro id IPR000980.Similarly to domain hierarchy, we loaded the parent-child tree T hier , provided by InterPro, and for each domain superfamily starting from the parent domain we included recursively all domains that have subfamily relation with this parent domain.For example the Kinase-pyrophosphorylase domain superfamily has domain parent IPR005177, which in turn has two immediate domain subfamilies IPR026530, IPR026565 and IPR026565 has domain subfamily IPR017409, consequently the set of domains for Kinase-pyrophosphorylase domain superfamily is {IPR005177,IPR026530,IPR026565,IPR017409}.
We retrieved the vectors for each domain in a domain superfamily using the V downstream emb , the best performing dom2vec space selected in Section 4.5.Finally, we visualized the two-dimensional PCA-reduced space at Figure 4. From this Figure 4, we observed that domains embeddings of each superfamily form well-separated clusters; the cluster of Exonuclease, RNase T/DNA polymerase III superfamily has the highest dispersion of all presented superfamilies.

Novel intrinsic evaluation
In the following, we evaluated each instance of learned embedding space V emb for both non-overlapping and non-redundant representations of domain architectures.An instance of V emb space is the embeddings space learned for each combination of the product non_overlap Ś w Ś dim Ś ep.Thus, the total number of embeddings spaces is |non_overlap|ˆ|w|ˆ|dim|ˆ|ep| " 2ˆ2ˆ3ˆ10 " 120.Let V i emb denote such embedding space instance.In the following subsection we evaluated each V i emb instance for domain hierarchy, secondary structure, enzymatic process and GO molecular function.

Domain hierarchy
We loaded the parent-child tree T hier , provided by InterPro, consisting of 2 430 parent domains.Then for each V i emb we compared the actual and predicted children of each parent, and we averaged out the recall for all parents.For ease of presentation, we show only the results for nonredundant sequences at Table 1 and we provide the complete results in the Supplementary A. However, in the following we discuss the results.
From Tables S1 and 1 (appendix A), we observed that SKIP performed better overall, and the embeddings learned from non-redundant annotations always have better average recall values compared to nonoverlapping.The best performing V i emb achieved average Recall hier of 0.538; that embedding space was trained using non-redundant annotations with SKIP, w " 2 and dim " 200.We plotted the histogram of recall values for this best performing space, Figure S1 (Suppl.A), and we observed that the embeddings space brought close domains with not known family-subfamily relation for almost the one third of the parent domains (827 out of 2 430).To diagnose the reason for this moderate performance, we plotted the histogram of the number of children for each parent having recall 0, Figure S2 (Suppl.A).We observed that most of these parents have only one child.Consequently, the embedding space should have been very precise for each of these parent child relation in order to acquire better recall than 0. We compared this moderate performance of V emb with the performance of the randomized spaces, which was equal to 0. Thus, we concluded that our embedding spaces greatly outperformed each randomized space for domain hierarchy relation.Consequently, we can answer our initial question: the majority of domains of the same hierarchy are placed in close proximity in the embedding space.

Secondary structure class
We extracted the SCOPe class for each InterPro domain from the interpro.xmlfile.This resulted to 25 196 domains with unknown secondary structure class, 9 411 with single secondary structure class and 2 265 domains with more than one assigned classes (multi-label).For clarity, we removed all multi-label and unknown instances leaving with 9 411 single-labeled instances; the class distribution of the resulting data set is shown in Table S2 (Supplementary B).
To answer the research question, we measured the performance of C d nearest classifier in each V i emb to examine the homogeneity of the space with respect to the SCOPe class.We split the 9 411 domains in 5-fold stratified cross validation sets.To test the change in prediction accuracy for an increasing number of neighbors, we used different sets of neighbors, namely, k " t2, 5, 20, 40u.We summarized the results for the best performing C d nearest which was for k " 2 for non-redundant sequences in 2. We show the respective table for non-overlapping sequences in the Suppl.B. We compared these accuracy measurements to the respective ones of the random spaces, and we found that the lowest accuracy values, achieved for CBOW w=5 using non-overlapping domains, are twice as high as the accuracy values of the random spaces for all possible dimensions.We observed again that the non-redundant domains resulted in higher accuracy compared to non-overlapping domains.The best performing C d nearest was for 2-NN for (non-redundant, SKIP, w=5,dim=50,ep=25) with average accuracy over the folds equal to 84.56.Consequently, we can answer our second research question: domain embeddings of the same secondary structure class form distinct clusters in a learned embedding space.

Enzyme Commission (EC) primary class
To this end we extracted the EC primary class from the interpro.xmlfile.This resulted in 29 354 domains with unknown EC, 7 248 domains with only one EC and 721 with more than one EC.As before, we removed all multi-label and unknown instances leaving 7 428 domains with known EC.For each V i emb we augmented a domain instance with its vector representation, and then we used C d nearest to predict the EC.Please refer to the Supplementary C for the class distribution of EC task.We again omit the entire result tables and we discuss the main results.
We present the average Accuracy EC obtained on embedding space learned using non-redundant sequences in Table 3.We show the respective table for non-overlapping in the Suppl.C. We compared these accuracy measurements to the respective ones of the random spaces.We found that the minimum average Accuracy EC value was equal to 60.51 and was achieved using CBOW w=5 for non-overlapping sequences.That value is approximately twice as large than the accuracy values of the random spaces for all possible dimensions (maximum average Accuracy EC for random space with dim=100 was 32.64).The best results were always produced using 2-NN, and we saw once more that SKIP trained on nonredundant sequences performed the best with maximum average accuracy of 90.85 (SKIP,w=5,dim=50,ep=30). Consequently, we can answer our third research question: domain embeddings of the same EC primary class form distinct clusters in a learned embedding space.

GO molecular function
We parsed the GO annotation file of InterPro to extract first-level GO molecular function for domains for the four organisms.We followed the same methodology to examine the homogeneity of a V emb with respect to GO molecular function annotations.Thus, for each V i emb , we augmented each domain by its vector and its GO label, and we classified each domain using C d nearest .As before, we used 5-fold stratified cross-validation for evaluation.In our experiments, we varied the number of neighbors k " t2, 5, 20, 40u to test its influence to the change of performance.For space limitations, we summarized the performances showing only Emb.Meth, w dim=50 dim=100 dim=200 Accuracy SCOP e for nonredundant sequences (k=2), best shown in bold case the best average accuracy over the number of neighbors.For the ease of presentation we omit the resulted tables for three first organisms and show only for Human; however, we discuss the results for all organisms.Please see supplementary D for full results.
For Malaria, the best average accuracy was 76.86, scored by the 2-NN for the V malaria emb corresponding to (non-redundant, SKIP, w=5,dim=100,ep=40).The minimum accuracy value was 56.94 and was achieved using CBOW with w=5 using non-overlapping sequences.We compared this minimum accuracy to the maximum accuracy obtained in a random space, which was 47.57.That is, even the worst-performing dom2vec approach outperformed the random baseline by ten percent.
For E.coli, the best average accuracy was 81.72, scored using 2-NN for the V ecoli emb corresponding to (non-redundant,SKIP,w=5,dim=50,ep=5).The minimum accuracy value was 67.34, achieved using CBOW with w=5 using non-overlapping sequences.For E.coli, the random baseline was still worse, with an accuracy of 64.46.
For Yeast, the best average accuracy was 75.10, scored using 5-NN for the V yeast emb corresponding to (non-redundant,SKIP,w=5,dim=50,ep=50).The minimum accuracy value was 59.82, achieved using CBOW with w=5 using non-overlapping sequences.We compared this miminum accuracy to the maximum accuracy obtained in a random space, which was 53.73.
For Human, the best average performance for non-redundant sequences are shown in Table 4.The best average accuracy was 75.96, scored by 2-NN for the V human emb (non-redundant, SKIP, w=5,dim=50,ep=40).We compared the minimum accuracy values, obtained by CBOW with w=5, to accuracy values of the random spaces, and we found that the worstperforming dom2vec was 20 percentage values higher than the random baseline.
For all four example organisms, we observed that the SKIP on non-redundant sequences produced V emb where C d nearest achieved the best average accuracy in.For the three out of the four organisms, the best performances were achieved for the lowest number of dimensions (dim=50).In all cases, we found that the worst-performing dom2vec embeddings outperformed the random baselines.

Selecting trained embeddings
Based on the previous four experiments, we aimed to evaluate the learned V emb spaces and select the best domain embedding space for downstream tasks.In all experiments, the SKIP algorithm for non-redundant sequences created best performing embedding spaces.From the individual results, we saw that the configuration of parameters (non-redundant, SKIP, w=5, dim=50) brought best results in C d nearest performance for SCOPe, EC and GO for E.coli,Yeast,Human, second best for Malaria and the sixth best average recall (0.507) for the domain hierarchy relation.Therefore, for the next of the work V downstream emb denotes the space produced by (nonredundant, SKIP, w=5, dim=50).

Extracting domain architecture
For each data set that contained the UniProt identifier for the protein instance, we extracted the domain architecture for non-redundant sequences, already created in Section 4.1.For all proteins whose UniProt identifier could not be matched, or for data sets not providing the protein identifier, we use InterProScan (Jones and et al., 2014) to find the domain hits per protein.For proteins without a domain hit after InterProScan, we created a protein-specific, artificial protein-long domain; for example, for the protein G5EBR8, we assigned a protein-long domain named "G5EBR8_unk_dom".

Model selection
To select which simple neural model we should compare to the baselines, we perform hyperparameter selection using an inner, three-fold cross validation on the training set; the test set is not used to select hyperparameters.We used common parameters, dropout of 0.5, batch size of 64, the Adam optimizer (Kingma and Ba, 2015) with learning rate of 0.0003, weight decay of 0 and number of epochs equal to 300.As a final hyperparameter, we allowed updates to the learned domain embeddings, initialized by V downstream emb .The results are shown in Supplementary E.

Running baselines
Then, we used the same network as the one in right side of Figure 5 of (Heinzinger et al., 2019); we refer to this network as SeqVecNet.Namely, the network first averages the 100 (ProtVec) or 1 024 (SeqVec) dimensional embedding vector for a protein; it then applies a fully connected layer to compress a batch of such vectors into 32 dimensions.Next, a ReLU activation function (with 0.25 dropout) was applied to that vector, followed by batch normalization.Finally, another fully connected layer was followed by the prediction layer.As third baseline, we added the 1-hot of domains in order to investigate the performance change compared to dom2vec learned embeddings.

Evaluation
As discussed in 3.4.5,we performed two kinds of experiments.For TargetP, we sought to investigate the effect of OOV on the produced classifier compared to sequence-based embeddings classifiers, which do not experience OOV as their used sequence features are highly common in both train and test set.For Toxin and NEW datasets, we benchmarked the generalization of the produced classifier compared to the sequencebased embeddings classifiers.Finally, for both kinds of experiments, we used the trained models on each test set.Thus, this evaluation shows how differences in the training set affect performance on the test set.The produced performances are shown in Figure 5.

Statistical significance
For both Toxin and NEW, dom2vec significantly outperformed SeqVec, ProtVec and domains 1-hot vectors.In Toxin data set, we observed that Emb.Meth, w dim=50 dim=100 dim=200  ProtVec learned the less variant model, but with the trade-off obtaining the lowest performance (mc-AuROC).For NEW data set, the 1-hot representation was the second representation outperforming SeqVec and ProtVec allowing us to validate the finding that domain composition is the most important feature for enzymatic function prediction (Li et al., 2017).
For TargetP we validated our assumption that OOV will affect the performance of domains dependent classifiers.That is, for OOV in the range of 0 ´30% the dom2vec classifier was comparable to the best performing model, SeqVec.However, when OOV increased more, then the performance of our model dropped, but still being competitive with the SeqVec.OOV is an known problem in NLP and as future work we would like to investigate ways to resolve this issue.Finally, dom2vec greatly outperformed the 1 ´hot representation, validating the NLP assumption that unsupervised embeddings improve classification on unseen words, in this context protein domains, compared to 1´hot word (domain) vectors.

Conclusions
We presented dom2vec, an approach for learning protein domain embeddings using the domain architectures deposed in InterPro.
We introduced a novel intrinsic evaluation based on the four sources of biological knowledge of protein domains.We found that the dom2vec vectors cluster by secondary structure, enzymatic function and GO molecular function however such representation could not capture domain hierarchy mostly for domain families of low cardinality.This result comes along with similar results in word embeddings, where it has been shown that, word vectors cluster by semantic and lexical similarity.This byproduct, allows us to draw an analog between words in natural languages and protein domains.Further this finding supports, in a data-driven way, the accepted modular evolution of proteins (Moore et al., 2008) and enhance the grammar of protein domain architectures (Yu et al., 2019) by giving insights on "forms" of collocations that can be found in such grammar.
In downstream task evaluation, dom2vec outperformed significantly domain 1 ´hotvectors and state-of-the-art sequence-based embeddings for Toxin and NEW data sets.For the TargetP dom2vec was comparable to the best performing sequence-based embedding, Seqvec, for OOV up to 30%.Thus we recommend to use dom2vec for prediction experiments when the OOV is limited, otherwise sequence-based embeddings can be used.
Finally, we hope that these embeddings will be used for prediction tasks, as well as, for creating data-driven hypothesis to augment our understanding of protein domain architectures.

Fig. 1 .
Fig. 1.Summary of our approach divided in four parts, building two forms of domains architecture, training domain embeddings, performing intrinsic and extrinsic evaluation of embeddings.Numbers in parentheses indicate the sections discussing the respective part.

Fig. 2 .
Fig. 2. Non overlapping and non-redundant domain architecture of Diphthine synthase protein.Because all domains are overlapping with the largest one, colored in blue, the nonoverlapping sequence is just the single longest domain (IPR035966).All other domains have a unique InterPro id, so the set of non-redundant sequences includes all presented domains sorted by starting position; we colored all these other domains in fading hues of gray based on their starting position.

Fig. 3 .
Fig. 3. Histogram of non-overlapping domains per protein, where the number of proteins is in log 10 scale.

Fig. 4 .
Fig. 4. Visualization of domain vectors for five domain superfamilies in best performing dom2vec space (V downstream emb ).

Fig. 5 .
Fig. 5. TargetP, OOV experiment: learning in whole train and benchmark in test splits of increasing out-of-vocabulary degree.Toxin and NEW, generalization experiment: learning in increasing train splits, 10 replicates each, and benchmark in whole test sets.The marked points represent the mean performance on the test set and the shaded regions show one standard deviation above and below the mean.

Table 2 .
Average C d nearest

Table 3 .
Average C d nearest Accuracy EC non-redundant sequences (k=2), best shown in bold case

Table 4 .
Average C d nearest Accuracy GO for nonredundant sequences, when k is not shown k=2, best shown in bold case

Table S7 .
MalariaGO class distribution, C d nearest Accuracy GO for non-overlapping sequences and non-redundant sequences, for Malaria, are shown in S6, S7 and S8 respectively.E.coli GO class distribution, C d nearest Accuracy GO for non-overlapping sequences and non-redundant sequences, for E.coli, are shown in S9, S10 and S11 respectively.Yeast GO class distribution, C d nearest Accuracy GO for non-overlapping sequences and non-redundant sequences, for Yeast, are shown in S12, S13 and S14 respectively.Human The C d nearest Accuracy GO for non-overlapping sequences, for Human, is shown in Table S16.Average C d nearest Accuracy GO for non-overlapping sequences, whenever k is not shown k=2, best shown in bold case (Malaria)

Table S8 .
Average C d nearest Accuracy GO for non-redundant sequences (k=2), best shown in bold case (Malaria)

Table S10 .
Average C d nearest Accuracy GO for non-overlapping sequences, best shown in bold case (E.coli)

Table S11 .
Average C d nearest Accuracy GO for non-redundant domains (k=2), best shown in bold case (E.coli)