Accurate Prediction of Transcriptional Activity of Single Missense Variants in HIV Tat with Deep Learning

Tat is an essential gene for increasing the transcription of all HIV genes, and affects HIV replication, HIV exit from latency, and AIDS progression. The Tat gene frequently mutates in vivo and produces variants with diverse activities, contributing to HIV viral heterogeneity as well as drug-resistant clones. Thus, identifying the transcriptional activities of Tat variants will help to better understand AIDS pathology and treatment. We recently reported the missense mutation landscape of all single amino acid Tat variants. In these experiments, a fraction of double missense alleles exhibited intragenic epistasis. However, it is too time-consuming and costly to determine the effect of the variants for all double mutant alleles through experiments. Therefore, we propose a combined GigaAssay/deep learning approach. As a first step to determine activity landscapes for complex variants, we evaluated a deep learning framework using previously reported GigaAssay experiments to predict how transcription activity is affected by Tat variants with single missense substitutions. Our approach achieved a 0.94 Pearson correlation coefficient when comparing the predicted to experimental activities. This hybrid approach can be extensible to more complex Tat alleles for a better understanding of the genetic control of HIV genome transcription.


Introduction
Human immunodeficiency virus (HIV) causes acquired immunodeficiency syndrome (AIDS), characterized by a progressive failure of the immune system. It remains an important health problem in the United States with 1,189,700 infected people, 18,489 annual deaths, and an annual medical cost exceeding USD 50 billion [1]. HIV lacks proofreading of its replicated RNA genome and has a high mutation rate of 1 in 10 4 bp, with each virion 9 kB genome having about 10 new variants [2]. Furthermore, an active HIV infection in a single individual is estimated to generate approximately 10 11 virions per day [3]. The combination of high mutation rates with efficient virion generation creates an extremely genetically heterogeneous and diverse viral genome population, which is a key consideration for important pathogenic processes such as antiretroviral therapy (ARV) resistance, latency, and strain evolution. After selective pressure from ARV therapy, variant virions with drug-resistant variants may survive and propagate, limiting therapeutic efficacy. Therefore, it is important to understand how HIV evolves both within a person, and in worldwide populations with relevance to AIDS pathogenesis and treatment [4].
Tat is an essential regulatory gene that drastically enhances the efficiency of HIV genome transcription and replication. The absence of Tat may lead to short and abortive viral transcripts, and Tat variants widely affect different viral activities. Therefore, a comprehensive investigation of various activities of Tat variants can deepen the understanding of AIDS pathology and assist drug design targeting for a broader range of HIV-1 strains.
Studies of variant frequencies, HIV evolution, and small-scale mutagenesis studies have greatly advanced knowledge about ARV drug resistance and how to effectively treat AIDS. Although viral isolates from an infected patient are experimentally tested, the ability to link specific genetic changes with the functions of viral proteins is largely limited to low-throughput experiments, which slowly and incrementally reveal how the vast variant landscape of a typical infection impacts HIV replication, viral latency, drug resistance, and AIDS pathogenesis.
Two high-throughput approaches are available to estimate the effects of the variants: the GigaAssay [5] directly measures a functional readout such as transcription, whereas the alternative multiplex assays of variant effect (MAVEs) are survival screens [6][7][8]. Activities are determined in a GigaAssay by measuring thousands of reads for approximately a million individually UMI-barcoded variant cDNAs. By comparing populations of cDNAs for each mutant to the populations for controls, this GigaAssay approach produces an accurate measurement and classification of Tat transcriptional activity with high confidence.
A previous analysis of Tat with the GigaAssay reported transcriptional activities for all 1615 Tat single and 3429 double missense variants with a~95% accuracy [5]. A total of 35% of all possible single amino acid variants in Tat are loss-of-function. However, it is currently too time-consuming and costly to conduct GigaAssay experiments on millions of variants to complete the Tat double missense mutant landscape. Our aim is to engineer efficient computational approaches to accurately estimate Tat's transcriptional activities of single variants. These tools will next be used in a future extension to predict the activities of double-variant Tat alleles. The novelty of our approach lies in the combination of highaccuracy GigaAssay variant/activity data with deep learning algorithms. The evaluations herein demonstrate the efficiency of our novel approach in predicting single missense mutant activities. This clearly suggests that this approach can likely be extended to predict the effect of more complex variants and possibly for other protein activities.

Overview of the Proposed Deep Learning Framework Called Rep2Mut
We proposed testing a deep learning framework called Rep2Mut to accurately estimate the transcriptional activity of missense variants. The architecture of the Rep2Mut algorithm is shown in Figure 1 using the Tat protein (86 amino acids) as an example. The output of Rep2Mut is the predicted effect upon the transcriptional activity of Tat variants, and the inputs are the wild type (WT) protein sequence, mutated protein sequences (missense variants), and mutated amino acid positions. is an input to fully connected layer 3 ( Figure 1) with a Sigmoid activation function to generate the prediction of transcriptional activity. In total, the Rep2Mut network with the Tat protein has 328,153 trainable weights.

Evaluation of the Proposed Deep Learning Framework Rep2Mut
We evaluated Rep2Mut on the GigaAssay transcriptional activity data for all 1615 single amino acid missense variants in HIV Tat. Layers 1 and 2 of Rep2Mut (in Figure 1) were pretrained on 115,997 single variants of 37 existing protein datasets with different There are several steps in the Rep2Mut algorithm to capture the difference between a WT sequence and its mutated sequences. First, a WT sequence is used as an input of the evolutionary scale modeling (ESM) protein language model [9] to learn the representation of the position of interest in the WT sequence. This WT representation, with a vector of 1280 elements, is then an input of fully connected layer 1 (Figure 1) to generate a vector with 128 elements. Similarly, the corresponding mutated sequence is used as input to the ESM to generate a representation vector of 1280 elements in the mutated sequence. The learned representation is then fed into a fully connected layer 2 ( Figure 1) to generate the other vector with 128 elements. The two vectors of 128 elements are combined by applying element wise multiplication (see Figure 1) followed by concatenation with the position encoding vector of a mutation position. The position encoding vector has N elements (N = 86 for the Tat protein); each element is for one position in the protein sequence. All values are zero except for the mutated position, which is marked by 1. Next, the combined vector is an input to fully connected layer 3 ( Figure 1) with a Sigmoid activation function to generate the prediction of transcriptional activity. In total, the Rep2Mut network with the Tat protein has 328,153 trainable weights.

Evaluation of the Proposed Deep Learning Framework Rep2Mut
We evaluated Rep2Mut on the GigaAssay transcriptional activity data for all 1615 single amino acid missense variants in HIV Tat. Layers 1 and 2 of Rep2Mut (in Figure 1) were pretrained on 115,997 single variants of 37 existing protein datasets with different protein functional measurements ( Figure 1). After adding layer 3, all layers in Rep2Mut were then optimized and fine-tuned on the experimental GigaAssay data. To avoid overfitting, 10-fold cross-validation was repeated 10 times and used to calculate the performance of Rep2Mut.
When the variant activities predicted with Rep2Mut were compared to the GigaAssay results, a Pearson correlation coefficient of 0.94 and a Spearman correlation coefficient of 0.89 were observed. We repeated the analysis with a baseline method and two recently published methods, ESM [9] and DeepSequence [10], compared to Rep2Mut in Figure 2 and Table 1. The baseline method is a feed forward neural network with simple encoding of the variant sequences as input (described in the Methods), and its performance was 0.17 lower than Rep2Mut. For the ESM prediction methods (called ESM_pred, to be distinguished from ESM models), we tested all Tat variant activities of the five trained ESM_pred estimations compared with the experimental activities and calculated the best performance. The predictions and the performance of the averaged prediction by ESM_pred are shown in Figure 2a,b and in Table 1. As expected, the averaged estimation of variants from the five ESM_pred achieved a better performance (0.59 Spearman correlation coefficient) than any of the individual ESM_pred estimations.
For the DeepSequence method, we generated multiple sequence alignments using EVision and retrained DeepSequence as suggested by Riesselman et al. [10]. DeepSequence needs to be retrained for each protein sequence due to the different number of dimensions. DeepSequence generated a higher Pearson correlation coefficient (0.57), but lower Spearman correlation coefficient (0.41) when compared to the ESM_pred prediction of the activities of the Tat variants ( Figure 2c and Table 1). Unfortunately, DeepSequence was only able to generate predictions for 10% of all variant data (see Figure 2c), even after fine-tuning the retraining process with more sequences in the multiple sequence alignments, demonstrating a limitation of DeepSequence for this application.
In conclusion, our Rep2Mut algorithm achieved a much better performance when compared to the state-of-the-art models ESM_pred and DeepSequence ( Figure 2d) and ( Table 1). The Pearson correlation coefficient for Rep2Mut was 0.39 higher than ESM_pred, 0.37 higher than DeepSequence, and 0.18 higher than the baseline method. Likewise, the Spearman correlation coefficient was 0.31 higher than ESM_pred, 0.48 higher than DeepSequence and 0.17 higher than the baseline method. compared with the experimental activities and calculated the best performance. The predictions and the performance of the averaged prediction by ESM_pred are shown in Figure 2a,b and in Table 1. As expected, the averaged estimation of variants from the five ESM_pred achieved a better performance (0.59 Spearman correlation coefficient) than any of the individual ESM_pred estimations. For the DeepSequence method, we generated multiple sequence alignments using EVision and retrained DeepSequence as suggested by Riesselman et al. [10]. DeepSequence needs to be retrained for each protein sequence due to the different number

Effect of Amino Acid Position on Activity Prediction
The activities of Tat variants are partially dependent upon the positions in the Tat protein sequence. Figure 3 shows that the last 20 positions of the Tat protein (C-terminal) have WT GigaAssay activities with some outliers such as K85E, which demonstrate a higher tolerance than the N-terminal amino acids [5]. Therefore, we tested the predictions of the Rep2Mut without a position encoding vector. Compared with Rep2Mut, a modified algorithm without a position vector achieved slightly lower Pearson and Spearman correlation coefficients (0.03 and 0.02 lower, respectively; Table 1). This result suggests that Rep2Mut has no significant overfitting with positional information. of dimensions. DeepSequence generated a higher Pearson correlation coefficient (0.57), but lower Spearman correlation coefficient (0.41) when compared to the ESM_pred prediction of the activities of the Tat variants ( Figure 2c and Table 1). Unfortunately, DeepSequence was only able to generate predictions for 10% of all variant data (see Figure  2c), even after fine-tuning the retraining process with more sequences in the multiple sequence alignments, demonstrating a limitation of DeepSequence for this application.
In conclusion, our Rep2Mut algorithm achieved a much better performance when compared to the state-of-the-art models ESM_pred and DeepSequence ( Figure 2d) and ( Table 1). The Pearson correlation coefficient for Rep2Mut was 0.39 higher than ESM_pred, 0.37 higher than DeepSequence, and 0.18 higher than the baseline method. Likewise, the Spearman correlation coefficient was 0.31 higher than ESM_pred, 0.48 higher than DeepSequence and 0.17 higher than the baseline method.

Effect of Amino Acid Position on Activity Prediction
The activities of Tat variants are partially dependent upon the positions in the Tat protein sequence. Figure 3 shows that the last 20 positions of the Tat protein (C-terminal) have WT GigaAssay activities with some outliers such as K85E, which demonstrate a higher tolerance than the N-terminal amino acids [5]. Therefore, we tested the predictions of the Rep2Mut without a position encoding vector. Compared with Rep2Mut, a modified algorithm without a position vector achieved slightly lower Pearson and Spearman correlation coefficients (0.03 and 0.02 lower, respectively; Table 1). This result suggests that Rep2Mut has no significant overfitting with positional information.

Rep2Mut Sensitivity Analysis for the Fraction of Training Data
The initial evaluation of Rep2Mut predictions used 90% of the variant activity data (n = 1457) to train Rep2Mut, and the remaining 10% for testing. However, for scaling, even with high-throughput wet-lab experiments such as the GigaAssay, acquiring experimental data is too time consuming and cost prohibitive to assess complex variants. Therefore, we evaluated Rep2Mut's performance to identify the minimal amount of training data needed to maintain near-maximal performance. We trained Rep2Mut with 70, 50, 30, 20, 10, and 7% of the Tat activity single missense variant dataset, and tested Rep2Mut performance with the remainder of the data. To reduce errors from random sampling, we split, trained, and tested those predictions 50 times, calculating an average performance for all tests (Figure 4).

Rep2Mut Sensitivity Analysis for the Fraction of Training Data
The initial evaluation of Rep2Mut predictions used 90% of the variant activity data (n = 1457) to train Rep2Mut, and the remaining 10% for testing. However, for scaling, even with high-throughput wet-lab experiments such as the GigaAssay, acquiring experimental data is too time consuming and cost prohibitive to assess complex variants. Therefore, we evaluated Rep2Mut's performance to identify the minimal amount of training data needed to maintain near-maximal performance. We trained Rep2Mut with 70, 50, 30, 20, 10, and 7% of the Tat activity single missense variant dataset, and tested Rep2Mut performance with the remainder of the data. To reduce errors from random sampling, we split, trained, and tested those predictions 50 times, calculating an average performance for all tests (Figure 4). reduction to 20 or 30% of the training data only reduced the performance by 0.03. The performance further decreased by another ~0.02 when 10% of the data was used for training. Surprisingly, Rep2Mut achieved more than a 0.80 or 0.82 Spearman correlation coefficient when only 7% (113 variants) or 10% of the data, respectively, were used for training. This will make the combination of deep learning with the GigaAssay more scalable for multivariant alleles because as little as ~20% of the variant data can generate predictions of variant effect with little compromise in performance.

Visualization of Predicted Vectors
To better understand how Rep2Mut predicts variant effects, we used combined vector after dot product in Figure 1 of all variants to create a global map after dimension reduction to a 2D space using uniform manifold approximation and projection (UMAP) [11]. We then investigated the resulting 2D map for correlations with GigaAssay activity, variant position, and the physiochemical types of amino acids ( Figure 5). Figure 5a,b clearly demonstrate that variants with different experimental activities had a smooth distribution from the right to the left, with or without position encoding. Variants with high activities were in the left half, while most variants with low experimental activities were in the right half for both Figure 5a  As expected, reduced Rep2Mut performance was observed with smaller training sampling. However, the performance (Spearman correlation coefficient) only showed a minimal reduction when Rep2Mut was trained with 50% or more of the data. Further reduction to 20 or 30% of the training data only reduced the performance by 0.03. The performance further decreased by another~0.02 when 10% of the data was used for training. Surprisingly, Rep2Mut achieved more than a 0.80 or 0.82 Spearman correlation coefficient when only 7% (113 variants) or 10% of the data, respectively, were used for training. This will make the combination of deep learning with the GigaAssay more scalable for multivariant alleles because as little as~20% of the variant data can generate predictions of variant effect with little compromise in performance.

Visualization of Predicted Vectors
To better understand how Rep2Mut predicts variant effects, we used combined vector after dot product in Figure 1 of all variants to create a global map after dimension reduction to a 2D space using uniform manifold approximation and projection (UMAP) [11]. We then investigated the resulting 2D map for correlations with GigaAssay activity, variant position, and the physiochemical types of amino acids ( Figure 5). Figure 5a,b clearly demonstrate that variants with different experimental activities had a smooth distribution from the right to the left, with or without position encoding. Variants with high activities were in the left half, while most variants with low experimental activities were in the right half for both Figure 5a

Association of Amino Acid Types with Tat Activity Predictions
Figure 5e-j show three different groups of WT amino acid types in 2D space, with and without the position vector. In all subplots, there was a clear distribution of experimental activities from right to left. Interestingly, there was no such pattern in the 2D space with different groups of mutated amino acids, suggesting non-randomness of the WT amino acids at each position.

Outliers in Rep2Mut Prediction
To better understand incorrect predictions, we analyzed the activity prediction outliers. In Figure 2d and Table 2, we annotated those variants whose predicted activities were 0.3 larger or smaller than the experimentally-determined activities (n = 20). We split them into two groups: overestimation if the predicted activities were 0.3 larger than the experimental activities, or underestimation if predicted activities were 0.3 smaller than experimental activities. We chose the 0.2 to 0.3 range because this is the approximate error rate for activities determined for UMI barcodes in the GigaAssay [5]. In Table 2, we also listed the mean, maximum, and minimum of the predicted and GigaAssay activities for the positions of the outlier predictions.
In all outlier predictions, the average predicted activities of each mutated position were very similar to the averaged experimental activities for that position. The majority of the outlier overestimations had very low experimental activities among the 19 variants for each position, while all of the underestimated outliers had the highest experimental activities. In particular, K12P and K85E, which were overestimated by Rep2Mut, had significantly lower experimental values when compared to other variants at the same positions (K85E: 0.16 vs. >0.72 for other K85 variants; K12P: 0.12 vs. >0.5 for other K12 variants). Several other variants such as P3R, P6K, K12P, and R7P all involved a proline substitution, suggesting that the unique nature of proline might not be captured by the deep learning algorithm.
Curiously, all overestimated and underestimated outliers were in the Cyclin T1 interaction site defined in a structure of the Tat:Cyclin T1 complex [12]. Visualization of this structure (PDB: 4OR5) with PyMOL identified M39K and F32H in two α-helices (27-32 and 34-42) of the Tat protein, and the majority of these mutated positions interacted with Cyclin T1 (Figure 6). According to the accessible surface area calculated by RDBePISA, many of the mutated positions had a buried accessible surface area >40 Å 2 when the Tat protein binds with Cyclin T1. The structure analysis also demonstrates that the outliers Q35R, Q17V, and E2A have hydrogen bonds with Cyclin T1.
One potential explanation of the observation of the erroneous predictions for substitutions at the Cyclin T1 interface is that Cyclin T1 accessibility or interactions may differ in the cell lines used in the GigaAssay (the cells used for the GigaAssay express Cyclin T1 [13]). An alternative explanation could be that our predictions do not have data that considers the structural complexity of the protein-protein interaction in this region. Another possibility is that these amino acids are important for a presently unknown interaction.  substitutions at the Cyclin T1 interface is that Cyclin T1 accessibility or interactions may differ in the cell lines used in the GigaAssay (the cells used for the GigaAssay express Cyclin T1 [13]). An alternative explanation could be that our predictions do not have data that considers the structural complexity of the protein-protein interaction in this region. Another possibility is that these amino acids are important for a presently unknown interaction.

Dataset
The dataset used for testing was made up of single variants of HIV Tat proteins, generated by GigaAssay [5]. This Tat protein is composed of 86 amino acids, and each position except the first amino acid is mutated individually to 19 other amino acids besides the WT amino acid. In total, there are 1615 single missense variants of the Tat protein. Each missense variant was sequenced with more than five barcodes, and the transcriptional activity of each variant was calculated by the GigaAssay. The effect of a single variant was estimated with a value ranging from 0 to 1, where the larger the value, the less effect of the single variant on the transcriptional activities of the Tat protein.

Rep2Mut Framework to Estimate Tat Variants' Effect on Transcriptional Activities
Rep2Mut is a sequence-based prediction of the effect of variants on transcriptional activities measured by GigaAssay. As shown in Figure 1, the input of Rep2Mut includes three types of sequence information. One is the WT sequence, and the other is the mutated sequence with a substitution of an amino acid at a position of interest. For either the WT or mutated sequence, we used evolutionary scale modeling (ESM) [9] to learn the representation of the mutated position.
ESM [9,14] is a self-supervised learning framework that was trained on millions of protein sequences to learn multiple levels of protein knowledge from biochemical properties to evolutionary information. It is composed of multiple transformer layers and trained using the masked language modeling objective [15]. Usually, the learned representation at the 33 rd layer is used to predict the diverse functions of proteins. ESM-1v [9] is a 34-layer transformer trained on a UniRef90 dataset [16] with five released pretrained models. We used the first pretrained model, and fed WT or mutated sequences into it. We used the learned vector of the position of interest at the 33 rd layer to represent WT or variant information. This learned vector has 1280 elements.
Each of the learned representation vectors from ESM-1v was used as the input of a fully connected neural network layer with a vector of 128 elements as the output (as shown in Figure 1). The PReLU activation function [17] was applied to the layers with a dropout rate of 0.2 to avoid overfitting [18]. The two 128-deminsion vectors were then merged with an elementwise dot product. The element-wise product (or the Hadamard product) is a binary operation that takes two matrices of the same dimensions as the input and produces another matrix of the same dimension as the operands. In other words, given two matrices A m,n and B m,n , of the same dimension m × n, the elementwise product A B = (A) ij (B) ij , where 0 < i ≤ m and 0 < j ≤ n.
The third type of input to Rep2Mut was a mutated position. A position was encoded into a binary vector of N elements each of which corresponds to a position in the protein sequence of interest. This encoding vector only has one value of 1 at the mutated positions for variants and 0 for all the other positions. This position encoding vector is then concatenated with the dot-product vector and used as the input of another fully connected neural network to predict the transcriptional activity. The prediction is normalized with a Sigmoid activation function so that the output value ranges from 0 to 1.

Training and Testing Rep2Mut
There are two steps to train Rep2Mut. First, we pretrained layers 1 and 2 (as shown in Figure 1) on another 37 protein datasets with various measurements of protein functions. The pretraining was used to optimize weights in the two neural networks. Afterward, we added layer 3 in Figure 1, and fine-tuned Rep2Mut to predict the GigaAssay activities. In both the pretraining and fine-tuning processes, we used the Adam optimizer [19] and MSE loss function in back-propagation. MSE is defined in Equation (1) where n is the number of data points, Y i is the observed activities, andŶ is the predicted activities.
During the fine-tuning process, we used a batch size of 8, and a learning rate of 1 × 10 −3 for layer 3. Since layers 1 and 2 were optimized during the pretraining step, we used a smaller rate, and the learning rate for layers 1 and 2 was 1 × 10 −5 .
To compare Rep2Mut with the other methods, we used 10-fold cross validation. We randomly split the GigaAssay data into 10 groups, each of which had 10% of the experimental data. Each time, a group was used for testing and the remaining data for training. We repeated this process ten times and obtained the mean of the performance to evaluate Rep2Mut.
To test the performance of Rep2Mut with different sizes of training data, the variants in the dataset were shuffled and then split into two sets called the training (90%) and test (10%) set. Rep2Mut was learned on training data and evaluated on test data using the Pearson and Spearman's correlation coefficients defined below. To avoid random splits, the process above was repeated 50 times, and the averaged performance was calculated for a final evaluation.

Evaluation Measurements
We used the Pearson and Spearman's correlation coefficients to measure the performance of each tested method. We used the python package scipy to calculate both the Pearson and Spearman's correlations for the prediction activities of a method. In detail, let X be the GigaAssay activities of a list of variants, and Y be the predicted activities of the same list, and then the Pearson correlation coefficients (PCC) are calculated using Equation (2), where p is the Pearson correlation coefficient, x i is the ith observed values in X, x is the mean of X, y i is the ith predicted values in Y, andȳ is the mean of Y.
Likewise, the Spearman's rank correlation coefficients (sPCC) were estimated with Equation (3) where sp is the Spearman's rank correlation coefficient; R( * ) is the ranking of items in * ; cov(R(X), R(Y)) is the covariance of X and Y; σ R(X) is the standard deviation of X; σ R(Y) is the standard deviation of Y.
4.5. How to Use ESM to Predict Tat Variants' Activities ESM [9] has a diverse capability to estimate proteins' activities and functions. Here, we used ESM (called ESM_pred so that it is different to the ESM released models) to estimate the GigaAssay activities that were not previously conducted. To determine the effect of the variants, the probability of each amino acid type at a position of interest is predicted in ESM_pred, and the variant effect is calculated based on the logarithmic ratio of the probability between the mutated amino acid and the WT amino acid in Equation (4), where T is the set of mutated positions; x \T is the masked input sequence; p x t = x mt t x \T is the probability assigned to the mutated amino acid x mt t ; p x t = x wt t x \T is the probability assigned to the wildtype.
As recommended by ESM [9], five released ESM models (ESM-2 Public Release v1.0.3: esm1v_t33_650M_UR90S_1, esm1v_t33_650M_UR90S_2, esm1v_t33_650M_UR90S_3, esm1v_t33_650M_UR90S_4, and esm1v_t33_650M_UR90S_5) were used individually to predict the transcriptional effect after Tat variants. sPCC was then calculated for each model. In addition, the average prediction for the transcriptional effect of each Tat variant was determined by combining the predictions of five models, and estimated using sPCC.

How to Test DeepSequence on Tat Variants
DeepSequence [10] is a generative, unsupervised latent variable model to estimate the effects of the variants on biological sequences across a variety of datasets with deep mutational scanning. The model was learned in an unsupervised manner solely from sequence information, and grounded with biologically motivated priors, revealing a latent organization of sequence families. There are three steps to run DeepSequence. First, a protein sequence of interest was used as input of multiple sequence alignment (MSA) tools to generate multiple sequence alignments. We used the recommended tools by DeepSequence, EVcoupling from the website v2.evcouplings.org. DeepSequence [10] suggests the use of a bit score of 0.5 bits/residue as a threshold to generate MSA results. However, MSA results of the Tat protein with this score generated only 1.7 Seqs/L with 123 sequences, which was not enough to train DeepSequence. We thus tested two bit scores: 0.3 bits/residue with 1645 effective sequences and 23.8 Seqs/L as well as 0.25 bits/residue with 7871 effective sequences and 110.9 Seqs/L. Second, DeepSequence was trained with the sequences from MSA. Although DeepSequence is a generative model, each protein sequence requires a different model. We used MSA sequences with each bit score to retrain DeepSequence to generate separate models. Finally, the retrained models were used to predict the effect of the variants. However, both models predicted only~10% (114) of the single variants, although 0.3 bits/residue produced better results.

The Framework of a Baseline Method
A simple baseline method was also designed and compared with Rep2Mut. This method uses the one-hot encoding of amino acids at each position as the input, and has three fully connected layers of feed forward networks: the input is a vector of 1720 elements, the first hidden layer generates a vector of 860 elements, and the second generates a vector of 256 elements. The output is the predicted activity for a variant. This method was trained with a batch size of 16 as well as a learning rate of 5 × 10 −4 , and tested with a similar strategy to Rep2Mut.

Conclusions
We designed a deep learning-based method that only used protein sequences to accurately predict the transcriptional activities of experimentally-determined Tat variants. With the representation learning from protein sequence models, our approach achieved a 0.94 Pearson correlation coefficient. This demonstrates that our deep learning-based method can precisely estimate the transcriptional activities of proteins with various variants and has great potential to be extended to complex mutations and other protein sequences. Although we used supervised learning while state-of-the-art methods such as ESM and DeepSequence models use unsupervised training, the superior performance makes our approach more promising for new applications. In particular, our method, trained on as little as 20 or 30% of data, was able to achieve a much better performance than the state-of-the-art methods, demonstrating its potential application on other proteins with limited training data. We plan to extend our methods to complex variant alleles and to other proteins for human disease studies.