Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling

Human Immunodeficiency virus (HIV) and its clinical entity, the Acquired Immunodeficiency Syndrome (AIDS) continue to represent an important health burden worldwide. Although great advances have been made towards determining the way viral genetic diversity affects clinical outcome, genetic association studies have been hindered by the complexity of their interactions with the human host. This study provides an innovative approach for the identification and analysis of epidemiological associations between HIV Viral Infectivity Factor (Vif) protein mutations and four clinical endpoints (Viral load and CD4 T cell numbers at time of both clinical debut and on historical follow-up of patients. Furthermore, this study highlights an alternative approach to the analysis of imbalanced datasets, where patients without specific mutations outnumber those with mutations. Imbalanced datasets are still a challenge hindering the development of classification algorithms through machine learning. This research deals with Decision Trees, Naïve Bayes (NB), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). This paper proposes a new methodology considering an undersampling approach to deal with imbalanced datasets and introduces two novel and differing approaches (MAREV-1 and MAREV-2). As theses approaches do not involve human pre-determined and hypothesis-driven combinations of motifs having functional or clinical relevance, they provide a unique opportunity to discover novel complex motif combinations of interest. Moreover, the motif combinations found can be analyzed through traditional statistical approaches avoiding statistical corrections for multiple tests.


Introduction
Human immunodeficiency virus (HIV) and its clinical entity, the Acquired Immunodeficiency Syndrome (AIDS) continue to represent an important health burden worldwide. Since the first reports of HIV more than 35 years ago, 78 million people have been infected with HIV and 35 million have died from AIDS-related illnesses. In 2021, approximately 1.5 million people contracted HIV and 650,000 people died from HIV-related diseases (UNAIDS, https://www.unaids.org/en, accessed on 28 October 2022. Although the overall number of new infections has declined since 2010, the resource limited countries of Latin America, Asia, and Africa have shown a steady increase in new infections and excess deaths due to HIV [1]. Different strategies have been employed in the fight against HIV and AIDS, mostly focused on either preventative measures or the development of novel anti-retroviral drugs targeting the main viral enzymes involved in HIV replication [2]. On the other hand, current HIV research efforts continue to focus on increasing our understanding of viral-host interactions at the molecular level, with the aim to discover those worth exploiting to interfere with viral tropism, fusion, replication, integration, and transmission.
Our understanding of the function of some viral proteins such as the protease, reverse transcriptase, and integrase enzymes has allowed for the development of potent preventative and therapeutic strategies [3]. However, for some accessory and non-structural viral proteins, little is known with regards to the function and their potential as candidate targets for antiviral drug development. While the use of molecular biology techniques allows for an estimation of functional or clinical relevance of these proteins, complex genetic and clinical variable comparisons decrease the statistical power of such studies.
The HIV genome has 9719 base pairs (HXB2 reference strain) and a total of 3 open reading frames encoded in a prototypical lentivirinae genome organization comprised of gag, pol, and env genes, long terminal repeat regions (LTRs) and accessory-proteinencoding regions (Vif, vpr, tat, rev, vpu, and nef ). The gag gene encodes for the matrix, capsid, nucleocapsid, and p6 proteins, pol encodes for the enzymes protease, reversetranscriptase, and integrase and env encodes for the glycoproteins GP41 and GP120. The different aforementioned accessory proteins facilitate or promote HIV replication and viral fitness. The best studied accessory proteins include tat (which acts as viral transcriptional transactivator), rev (which regulates RNA trafficking), and nVifef which promotes viral maturation and release from the host cell [4,5]. Vif is a 192-amino acid HIV accessory protein essential for replication. Vif protein counteracts human antiviral proteins of the APOlipoprotein Bmessenger RNA Editing enzyme, Catalytic polypeptide-like (APOBEC3) family. APOBEC3 proteins are zinc-dependent deaminases which mutate viral cytidine (dC) to uridine (dU) in both viral DNA and RNA molecules, thus interfering with the fidelity of the viral genome. APOBEC3 is a host innate mechanism that protects human cells from exogenous viruses and endogenous mobile retroelements. The Vif protein allows HIV to evade such innate mechanisms. This viral protein has recently become a candidate target for both therapeutic and preventive interventions in HIV/AIDS. Nevertheless, little is known about the clinical relevance of Vif accessory protein, particularly among HIV-infected patients of developing countries and Latin America [6].
Members of the human APOBEC family of proteins include APOBEC1, APOBEC2, APOBEC3, and the poorly expressed APOBEC4. The APOBEC3 subfamily has seven known members, including APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3DE, APOBEC3F, APOBEC3G, and APOBEC3H. Among all APOBEC3 subfamily members, APOBEC3G is notable for exerting the strongest antiviral effect [7]. APOBEC3G is incorporated into the HIV-1 virions as they emerge from an infected cell when HIV-1 lacks the capacity to encode for Vif protein. During the second round of viral replication, after infecting a second cell, APOBEC3G would normally cause extensive dC to dU mutations of the single-stranded viral DNA during reverse transcription [8]. HIV's Vif protein inhibits and interferes with APOBEC3G activity and thus renders the virus immune to this important innate immunity. However, HIV-1 evolution and quasi-species diversification within a single human being might lead to the accumulation of mutations in the Vif region, which might affect protein function and have clinical significance by either decreasing viral replication or affecting integration and transmission.
The use of machine learning approaches has been extensively applied to the search of statistical associations between genetic and clinical variables during the last years given their known capacity at tackling high dimensional data [9,10]. Previously, some research groups have applied combined algorithm based approaches, such as ANN coupled to genetic algorithms, grammatical evolution, and genetic programming, to the discovery of genetic associations and classification [11][12][13][14][15]. Other combined-algorithm approaches have been SVMs with genetic algorithms [16] and ANNs coupled to Rule Association Mining (Apriori algorithm) [17]. Although combining different machine learning approaches does not guarantee better performance, there is ample evidence supporting the statistical benefits and capabilities at discovering novel genetic associations in the context of infectious diseases [18,19].
One important factor in assessing the importance of different genetic variables mentioned in previously published studies is their combined effect on classification performance.
We previously applied this approach to the study of HIV's Vif gene mutations by using four different machine learning approaches for the discovery of clinical endpoint associations [20]. A mayor caveat to our previous effort was the availability of an imbalanced dataset arising from the difficulty in collecting large cohort samples and extensive genetic data. Data imbalance is a fundamental and challenging problem in machine learning that limits the power of small clinical datasets. This limitation has also been shown to be present in other non-medical applications such as fraud detection, finance, ecology, and biology [21,22]. As such, in this study we set forth to evaluating the performance of state-of-the-art machine learning approaches (Decision Trees, NB, SVMs, and ANNs) enhanced with an undersampling process for dealing with the data imbalance in the dataset. Furthermore, we present a probabilistic method capable of suggesting the most clinically relevant variable combinations associated to clinical outcomes.
The paper is organized as follows: Section 2 describes the dataset and the undersampling approach. The methods are presented in Section 3, followed by the results and conclusions sections.

Dataset
For the purpose of this study we relied on a previously consolidated dataset including Vif protein amino acid physicochemical changes and clinical outcome variables (CD4 T cell numbers and HIV viral load at both initial diagnosis and on follow-up) [23]. From the original 192 amino-acid sites conforming the Vif protein, those pertaining to 17 protein motifs were encoded into binary data as either conserved or mutated, as described previously [20]. Eight of the 17 variables representing Vif protein domains are known to interact with APOBEC3 proteins (herein designated as APOBEC-1 to APOBEC-8). Other motifs considered in this study include the Nuclear Localisation Inhibitory Signal (NLIS), two (CBFβ-1 and -2) interaction sites as well as three Cullin-5 binding sites (Cul5-1, Cul5-2, and Cul5-3). When the different Vif motif sequences implied a non-conservative change in physicochemical properties, the genetic variable for that motif was encoded as a "1", and when the site was conserved it was encoded as "0".
The values for the clinical endpoints (outcome class) were encoded based on thresholds recommended by the World Health Organization and the U.S. Centers for Disease Control and Prevention. The CD4Ini and CD4Hist clinical endpoints reflect the levels of CD4+ T cells number (cells/per micro liter) at the first time of diagnosis (CD4Ini) and as the median number of CD4+ cells from quarterly assessments during two years of patient follow-up (CD4Hist). For both CD4Ini and CD4Hist, ≥ 500 CD4+ T cells/µL corresponds to a value of "0", as CD4+ T cell numbers above this threshold are not indicative of poor clinical prognosis. Contrarily, the clinical endpoint is encoded as "1", when ≤ 500 CD4+ T cells/µL when the cell numbers are below normal and reflecting immunodeficiency. Similarly, VLIni and VLHist outputs reflect another clinical aspect used to assess HIV-prognosis, where high viral loads are associated with worsening clinical progression. As mentioned above, VLIni and VLHist reflect HIV viral titres at the time of initial diagnosis and the median of quarterly follow-up assessments of viral load (copies/milliliter). For both VLIni and VLHits ≥ 10, 000 copies/mL/µL corresponds to a value of "1", as viral loads above 10,000 cp/mL are suggestive of intense viral replication and worsening clinical prognosis. Contrarily, this value is encoded as "0", when ≤ 500 copies/ml/µL when the viral load is below 10,000 cp/ml and stable [24].

Undersampling
In the case of binary classification, the class-imbalance is defined as the over representation of one class (the majority class) over another class (the minority class). Over representation affects the learning process of the algorithms as most of them are designed to construct the most general and simplest hypothesis from the data [25]. Undersampling can lead to a bias towards the over-represented class during the learning process. Different approaches have been used to resolve the problem of undersampling, which range from applying data balancing strategies (either undersampling or oversampling), modifying the machine learning process to address data imbalance or through data penalization to enhance minority class attribute detection [26]. Undersampling balancing strategies are the most popular approach as they are based on the original dataset, whereas oversampling requires the generation of artificial data, derived from the original dataset but not necessarily true in content [27].
As the use of oversampling involves the generation of artificial data, in this work we decided to use an undersampling approach to better preserve the biological distribution of genetic variables and clinical endpoints of our dataset. Figure 1 describes the undersampling process. The original dataset contains m + n examples where n is the minority class and m is the majority class. The algorithm identifies the least represented class (i.e., n) and then creates a new balanced dataset by subtracting m class elements until it is similar in size to n class subset. These undersampled balanced sets are generated 100 times (1, 2, . . . , p), and each one is used for machine learning and training.
the data by modifying the learning process. Finally, the penalty process invo nalty function to help for a better learning of the minority class. Haixiang et a tensive review on these problems and approaches [23].
Since the use of oversampling involves the generation of artificial data, in this w use undersampling. The goalbeing to avoid generating genetic relations that ist in nature. Figure 1 shows the undersampling process. The original dataset amples. The process determines the class with the minimum number of examples presented class. Then, 1, 2, . . . , p balanced datasets are generated containing the amples per class (i.e. n = m).

Methods
This paper compares the performance of well known machine learning methods f ecision Trees, NB, Multi-Layer Perceptron (MLP) and SVM.

Decision trees
Decision trees represent the simplest and most widely used non-parametric su g method. There are many algorithmic implementations to generate decision cluding Iterative Dichotomiser 3 (ID3) [24], its successor -C4.5, Classification

Methods
This paper compared the classification performance of the well-known machine learning methods: Decision Trees, NB, SVMs, and Multi-Layer Perceptron (MLP).

Decision Trees
Decision trees represent the simplest and most widely used non-parametric supervised learning method. There are many algorithmic implementations to generate decision trees from data including Iterative Dichotomiser 3 (ID3) [28], its successor-C4.5, Classification And Regression Tree (CART), Chi-square Automatic Interaction Detection (CHAID), and Multivariate Adaptive Regression Splines (MARS). This paper focus only on the CART implementation [29] available in Scikit-learn [30,31].
For CART, the use of the Gini index and a max depth of five were used as predefined parameters, as they provided a similar performance to the C4.5 algorithm. Contrary to C4.5, CART helped identify the most significant variables and to eliminate non-significant ones [32].

Multinomial Naïve Bayes
NB classifiers include several highly-scalable and simple probabilistic classifiers that rely on Bayes theorem with strict independence assumptions between features. When coupled with kernel density estimation they can achieve elevated classification accuracy levels [26].
The NB classifier is defined as: where p(v 1 , v 2 , . . . , a i , . . . , v 17 |class j ) = ∏ i p(v i |class j ), because this classifier assumes that the variables, v i are conditionally independent, given the class, and class j ∈ C are the classes or labels [33]. NB usage relied on calculations of the prior probabilities and estimation on the prior probabilities.

Multi Layer Perceptron (MLP)
MLP is based on classical ANN models, in particular the Perceptron introduced by F. Rosenblatt in 1957 [34]. MLP architecture is a more complex ANN where at least one or more hidden layers are included before the clinical endpoint variable layer [35]. MLP is also known as backpropagation [36][37][38][39], a generalization of the delta rule learning algorithm proposed by B. Widrow in 1962 [40]. MLPs are also referred to as feedforward neural networks. Figure 2 illustrates a general MLP architecture with v 1 , v 2 , . . . , v 17 input variables (green), a hidden layer (blue) and a single clinical endpoint (red). There is a single MLP for each of the clinical endpoint variable classes: CD4Ini, CD4Hist, VLIni, and VLHist.  For MLP training, we use the logistic activation function, a hidden layer with 8 neurons, 2 outputs, and 10,000 epochs with the Limited-memory BFGS algorithm (the Broyden-Fletcher-Goldfarb-Shanno algorithm), which is a method for numerical optimization [41].

Support Vector Machine (SVM)
SVMs are state-of-the-art algorithms initially introduced by Cortes and Vapnik as support-vector networks [42,43]. SVM were developed in an effort to develop artificial intelligence strategies for complex problems. SVM have mostly been applied to classification or regression problems. For classification purposes, SVMs aim to produce a mathematical n-dimensional space function capable of non-linearly distinguishing between different classes from complex and multivariate (training and test) datasets Given a dataset D = {(x 1 , y 1 ), · · · , (x l , y l )}, where x ∈ R 17 (inputs), y ∈ {−1, +1} (clinical endpoint), and l is the size of the dataset. The SVM classifier is defined as which is a linear combination of kernels, K(x i , x), where the sign function (sgn) gives the class [42]: with constrains, 0 ≤ α i ≤ C, i = 1, · · · , l, and ∑ l j=1 α j y j = 0. The parameter C is known as the margin and the Support Vectors (SV) will have non-zero Lagrange multipliers, α i ; K(x i , x j ) is the kernel function performing the non-linear mapping into feature space φ, known as the "kernel trick" [26,42,43].
There are many kernel functions available for use with SVMs including linear, Gaussian Radial Basis Function (RBF), sigmoid, and polynomial. Our approach made use of the RBF kernel, where the width of a kernel is given by the γ parameter.
Across this research, SVMs used RBF as kernel with the following values: C = 10 and γ = 1.0.

Methods for Assessing the Relevance of each Vif Variable
In order to assess the relevance that the different Vif variables (input) have on each of the included clinical endpoint variables (output), a series of steps were used, including:
Constructing input variable combinations of less than 10 in size (k); 3.
Identifying the variable combinations of each balanced datasets providing the best classification performance; 4.
Calculating the relevance of each variable through a probabilistic approach, and; 5.
Optimizing the selection of the most relevant variables by using a threshold value.
For the first step, balanced datasets are generated through undersampling by creating p partitions, which include all elements of the minority class (n) and an equal number of randomly selected elements of the majority class (i.e., n examples out of m), as shown in Figure 1. After producing balanced datasets, a second step addresses the construction of k size variable combinations by using each of them as input in different classification algorithms. For this, a five-fold cross-validation training process using weighted accuracy was used. The construction of the variable combinations relied on using greedy step-wise variable selection, as shown in Figure 3, in such a way as to identify the best variable capable of discriminating between the clinical endpoint classes. This process was repeated for a second variable in combination with the first identified and the process was repeated k-times so as to identify the k best variable combinations available.      where p(v 1 , v 2 , . . . , a i , . . . , v 17 |class j ) = i p(v i |class j ), because this classifier assumes that the variables, v i are conditionally independent, given the class, and class j ∈ C are the classes or labels [28].

Multinomial NB
There are several versions of the NB classifier including Bernoulli, Multinomial and Gaussian versions. Since we have discrete categorical variable s with possibly more than two distinct values, this paper employs the Multinomial version which includes the Laplace estimation to avoid the zero frequency problem.

Multi-Layer Perceptron
This method is based on classical ANNs models, in particular from the Perceptron (a single output layer ANN) introduced by F. Rosenblatt in 1957 [29]. A MLP architecture is a more complex ANN with at least one or more layers (called hidden layers) before the output layer [30]. The MLP is also known across the literature as backpropagation [31,32,33,34], which is the generalisation of the delta rule (learning algorithm) proposed by B. Widrow in 1962 [35]. The MLPs are also called feed forward neural networks. Figure 4 show the general MLP architecture A third step involved discovering the best k combinations for each p balanced dataset. As the discovery of a global optimum is not guaranteed, a reasonably good local optimum (based on classification performance) was used, as shown in Figure 4. Global optimums are not realistically feasible as the search space exponentially explodes with k.
Select the q th variable combination with best classification performance, with q ≤ k Repeat on the resulting combination from the p balanced datasets . . . . . .

Variable Combination Variables
Classification Performance with v 1 , v 2 , . . . , v 17 inputs, a hidden layer and a single output. There is a MLP per output/class, i.e. CD4Ini, CD4Hist, VLIni and VLHist.

Support Vector Machine
Given a dataset , and l is the size of the dataset The SVM classifier is defined as 6 Figure 4. The selection of the overall-best combinations for each p balanced dataset by using their classification performance.
In a fourth step, variable relevance assessment is achieved using the p best combinations through a probabilistic approach. For this, the probability of each input Vif variable appearing at jth position on the variable combination matrix produced in the previous step is calculated using Equation (3).
where p(v j i ) indicates the probability that the ith variable was selected at the jth position of the generated combinations. The frequencies for the variable and that of the different variables at the position jth are expressed as f (v j i ). This equation is applied for each one of the k positions (j ≤ k). These probabilities define the relevance score (r) for each variable by using Equation (4): where r i indicates the relevance score for the variable ith, considering its probability of appearing on each of the k positions in the combination matrix. This process assigns greater weight to the variables that are found closest to the root (lower entropy) of the combination matrix and less weight to those that appear farther from the root (higher entropy). In a fifth step, the relevance scores obtained in the previous step are then used for sorting the variables considering their relevance scores and by establishing a threshold value (which involves calculating the upper limit of a 99% confidence interval of their relevance scores) to determine the most relevant variables (those surpassing the threshold limit).

MAREV-1
The first Method for Assessing the Relevance of Each Variable (hereafter called MAREV-1) considers the classification results produced by each algorithm (CART, Multinomial NB, SVMs, and MLP) on p = 100 balanced datasets. This yielded a total of 400 variable combinations having the highest classification performances, all of which were then tested further, including traditional statistical analysis, as mentioned below, see Section 3.5.3.

MAREV-2
The second method, MAREV-2, selects only the best variable combinations assessed as classification performance for each algorithm (the third step described above), see Section 3.5. This yielded four input variable combinations, one per algorithm. Again, as mentioned above for the score assessment on each variable, all were then tested through the following traditional statistical analysis.

Hypothesis Evaluation on the MAREV-1 and MAREV-2 Approaches
Once the most relevant variables had been identified in the previous steps, subsequent analysis involved establishing the clinical importance of the different machine learning algorithm-suggested variable combinations and their status (Mut or Cons) through traditional statistical association methods. For this, the Vif protein conserved sites, synonymous amino acid substitutions, or those being non-synonymous but conserved in physicochemical properties were encoded as "0" (Cons in the following discussion, figures, and tables). Contrarily, mutations leading to non-synonymous amino acid substitutions resulting in non-conserved physicochemical properties of the Vif protein (polar to non-polar changes, acidic to basic changes, gross molecular structure size changes, as well as changes in susceptibility to post-translational modifications such as phosphorilation, ubiquitination, SUMOylation, methylation, and glycosylation) were encoded as "1" (Mut). The definition of explicit variable-value combinations used the ID3 algorithm as implemented in the Waikato Environment for Knowledge Analysis (WEKA) workbench v3.6 [44]. ID3 was used for generating a decision tree for each clinical endpoint relying on tree branches to incorporate variable status (Mut or Cons) combinations. The calculation of the statistical significance of variable frequency differences between clinical endpoint groups relied on two-sided Fisher's exact test using IBM SPSS Statistics (version 21, IBM Corporation, Armonk, NY, USA).

Results
The position of the Vif encoding region within the HIV-1 reference sequence HXB2, and the position and nomenclature of the Vif protein motifs and their putative ligands, is provided in Figure 1. The APOBEC-1 variable, corresponding to the N-terminal APOBEC3 binding site ( 14 DRMR 17 ), was excluded from the original dataset as it remained conserved.

Classification on the Balanced Datasets
The assessment of the relevance of each variable, as explained in Section 3.5, was based on the classification performance from four different classifiers (CART, MLP, SVMs, and Multinomial-NB) as implemented in the Scikit-learn package [30].
We have identified the top 100 variable-combinations associated to each clinical endpoint class by applying the proposed method to assess variable relevance. We obtained 1600 top-performing genetic variable-combinations associated to each clinical endpoint (CD4Ini, CD4Hist, VLIni, and VLHist) using the four classification algorithms. The balancedaccuracy was calculated with a 5-Cross-Validation approach during each training process. Algorithm accuracy was defined as the correct identification of both true positive and true negative registry examples (patients) and encompasses true-positive and true-negative predictive rates.
Out of the four machine learning algorithms tested, MLP superseded the three other machine learning algorithms during the analysis of each of the four clinical endpoints, accurately classifying, 79.6%, 76%, 68.5%, and 66.3% of CD4Ini, CD4HIts, VLIni, and VLHist patient registries, respectively. The classification performance of each machine learning algorithm for each clinical endpoint is summarized in Table 1. Although the best classification results achieved higher values than those previously reported elsewhere [20], this can easily be explained by the use of balanced datasets and 5-Cross-Validation settings in this report. The genetic variable combinations providing the best classification performance are summarized in Table 2. Considering the top scores per clinical endpoint shown in Table 2, the best discrimination was achieved for the CD4 T cells counts (CD4Ini and CD4Hist clinical endpoints).
On the other hand, low performance was observed on the VLIni clinical endpoint [71.5-80.2], and even lower for the VLHist [68. 5-73.8].
Some variables were shown to be present in all "top combinations" identified for each different clinical endpoints. These were: [BCbox-3, BCbox-2, and APOBEC-2] for CD4Ini, [APOBEC-2, APOBEC-4, and BCbox-3] for CD4Hist, [APOBEC-2 and APOBEC-4] for VLIni, and [NLIS, APOBEC-2, and BCbox-1] for VLHist. Only the variable APOBEC-2 was present in 15 of the 16 best-combinations, except for in the combination with the highest classification when using MLP with the CD4Ini clinical endpoint. On the other hand, BCbox-3 was present in all the best combinations related to the CD4 T cell count.

Results Using the MAREV-1
After defining the 100 best-combinations per clinical endpoint by each algorithm, an assessment on the relevance of each variable was then undertaken. This involved calculating the probabilities for each variable of being selected as the most informative (i.e., root variable) in each of the best combinations. The relevance scores (r) per algorithm and positions are shown in Appendix A, see Tables A1-A4. After evaluating all the variables for each clinical endpoint, a threshold was calculated per clinical endpoint and used for selecting the most relevant variables as mentioned previously; see Section 3.5. The calculated threshold values for the most relevant variables are summarized in Appendix A, see Table A6a. The variables indicated as most relevant for CD4Ini (ordered by their relevance scores) were: [BCbox-3, APOBEC-3, APOBEC-5, APOBEC-2]; for CD4Hist: [APOBEC-2, APOBEC-3, APOBEC-5]; for VLIni they were [APOBEC-2, BCbox-1, APOBEC-3] and, finally; for VLHist they were [NLIS, APOBEC-3, APOBEC-5]. Considering these most relevant variables, APOBEC-3 proved to be associated with all the clinical endpoints, while APOBEC-2 and APOBEC-5 were present in only three clinical endpoints. BCbox-1 was seen to be the most relevant for only VLIni. BCbox-3 was only relevant for CD4Ini, and NLIS was suggested as being the most relevant in only VLHist.
The most relevant variables identified were in agreement with the best variables identified in previous efforts using alternative approaches [20], as shown in Table A7b; see Appendix A. This was also the case for the second variables in the clinical endpoints CD4Hist and VLIni. Another difference was that the quantity of variables defined as the most relevant when using the MAREV-1 approach was much higher for the clinical endpoints CD4Ini and CD4Hist than reported previously.

Results Using the MAREV-2
In this approach, the variable assessment process was done considering only the combinations of variables having the best classification performance, see Table 2. As happens with MAREV-1, MAREV-2 also calculated the probability for each variable to appear at every available position. This was later used to determine the score per variable and clinical endpoint as shown in Table A6b); see Appendix A. The variables discovered to be more relevant for CD4Ini (ordered by their scores) were:  Table A7c in Appendix A).
The comparison among the variables identified as the most relevant by the previous approach, MAREV-1 and MAREV-1, show a coincidence in some of the variables detected as most relevant. This is the case of BCBox-3 in CD4Ini and APOBEC-2 in both CD4Hist and VLIni. Although MAREV-1 and the previous approach agreed on assigning NLIS as the most relevant variable for VLHist, this motif was only suggested as the second most relevant for this clinical endpoint by MAREV-2.

Decision Trees and the Most Relevant Variable Combinations from MAREV-1 and MAREV-2
The decision trees defined with the variables determined by the MAREV-1 are shown in Figure 5, while those using the MAREV-2 are shown in Figure 6.  ID3 branch frequency was used to identify specific combinations of input variable status ( Mut or Cons ) as related to the clinical endpoints in Fisher's exact test. Only branches having more than 1 variable were considered, yielding a total of 20 variable combinations for the MAREV-1 approach (6 for CD4Ini, 5 for CD4Hist, 6 for VLIni, and 3 for VLHist) whereas the MAREV-2 approach identified 22 different relevant variable combinations (4 for CD4Ini, 6 for CD4Hist, 6 for VLIni, and 6 for VLHist. The results of the statistical assessment for the MAREV-1 and MAREV-2 approaches are shown in Table 3. Table 3. The most relevant Vif protein variable combinations associated with the clinical endpoints. (a) Significant associations after testing the 20 hypothesis suggested by the MAREV-1 approach; (b) Significant associations after testing the 22 hypothesis suggested by the MAREV-2 approach. Vif protein regions can either be conserved (Cons) or mutated (Mut) and associated with protection (prot) or risk to either <500 cells/µL CD4 T cells or ≥10,000 cp/mL of viral load.  Four of the 20 ID3-combinations defined from the MAREV-1 approach were detected as associated with clinical endpoints after further statistical testing. One was present for CD4Ini (p-value = 0.0011), two for CD4Hist (p-value = 0.0136, p-value = 0.0182), and one for VLIni (p-value = 0.0207). None of the associated combinations were present in VLHist. The combination for CD4Ini [BCboc-3 Mut , APOBEC-3 Cons ] suggests protection from having lower numbers of CD4 T lymphocytes at the time of initial medical assessment as it was present in only 6 patient samples having ≤500 CD4 T cells, compared to 53 patient samples not having said combination. In the case of CD4Hist, only one combination [APOBEC-2 Cons , APOBEC-3 Cons ] suggested protection from having less than 500 T Lymphocytes on medical follow-up, as was also found in our previously published work. A second combination [APOBEC-2 Mut , APOBEC-3 Cons , APOBEC-5 Cons ] was found to be associated with the risk of progression to less than 500 CD4 T lymphocytes on medical follow-up. The absence of said combination was detected in 14 out of 15 sequences with ≥500 CD4 T cells. Finally, in the case of VLIni, the [APOBEC-2 Mut , BCbox-1 Cons , APOBEC-3 Cons ] combination suggested a risk of having higher HIV viral loads on the first medical examination as it was absent in 22 out of the 26 cases with less than 10,000 virus copies.

Contingency Tables Classification
On the other hand, the 22 ID3-combinations generated using the variables defined by the MAREV-2 yielded 5 clinical associations. Both of the associations found in CD4Ini involved variables BCBox-2 and BCBox-3 where the conservation of both protein regions was associated with a higher risk of having lower initial CD4 T lymphocytes on the first medical examination (p-value = 0.0068). This variable combination was present in 26 of the patient cases with < 500 CD4 cells/µL, compared with a single occurrence in a patient having ≥ 500. A second variable combination, [BCBox-2 Mut and BCBox-3 Mut ], was associated with protection from low CD4 T lymphocytes counts as it was observed to be more frequent in patients having ≥ 500 CD4 cell count/µL (p-value = 0.0049). Regarding historic CD4 T cell counts, one variable combination [APOBEC-2 Mut , BCbox-3 Cons ] was associated with the risk of having low CD4 T cell counts on medical follow-up as it was present in 20 cases with a CD4 cell count below 500 and not in patients having ≥ 500 CD4 T cells/µL. Regarding initial viral load assessments, [APOBEC-2 Mut , APOBEC-4 Mut , BCbox-1 Cons ] was associated with the risk of having high viral titres (≥10,000 viral copies) at the time of initial medical examination and was present in 11 patients having ≥10,000 viral copies, yet in only a single patient having lower viral loads. Finally, [NLIS Mut , APOBEC-2 Mut , BCbox-1 Cons ] was observed to be associated with a higher risk of low historical viral loads on patient follow-up as it was seen only once in a patient having <10,000 copies but it was present in 6 patients having more than 10,000 copies of the virus. As mentioned before, eight novel HIV associations were identified through this approach: three by MAREV-1, and five with MAREV-2.
Distinct Vif protein regions were identified through this approach as being highly relevant by MAREV-1, mainly involved in APOBEC3 interactions and Elongin B/C binding. Relevant APOBEC3 interaction motifs included APOBEC-3, which was found to be conserved in all cases as well as APOBEC-2, which only failed to be relevant with regard to CD4Ini. Similarly, APOBEC-5 was found to be absent in CD4Hist while BCbox-1 was related to VLIni. Similarly, MAREV-2 also identified APOBEC-3, APOBEC-2, and APOBEC-4, and the Elongin B/C-box binding motifs, BCbox-1, BCbox-2, and BCbox-3 as most relevant. The results from the MAREV-2 for VLHist agree with our previously published findings by suggesting a higher relevance of the NLIS segment.
These results help supporting the variables detected as more informative in our previous findings [20], being: (i) [BCbox-3] for CD4Ini, (ii) [APOBEC-2] for CD4Hist, VLIni and VLHist, (iii) [BCbox-1] for VLIni and VLHist, and iv) [NLIS] for VLHist. Additionally, the MAREV-1 approach places relevance for the variables [APOBEC-3 and APOBEC-5] while MAREV-2 places relevance for [APOBEC-4, BCbox-2, and BCbox-3]. On the other hand, the four associations determined with MAREV-1 and the five determined by MAREV-2 were less than the seven suggested with the previously methodology. Only one of said associations was present when using both approaches. Fewer associations were found when considering the viral load clinical status, both the initial and historical. This was the case for VLHist, where no association was found when using the MAREV-1 approach. However, determining which set of associations have more biological significance requires further research. Table 3 concentrates the most relevant associations of genetic variable combinations with each of the four clinical endpoint variables out of the 20 and 22 hypotheses tested by the MAREV-1 and MAREV-2 algorithms, respectively. On initial examination, the reiterative appearance of APOBEC and Elongin B/C Box motifs stands out in the results generated by both algorithms, irrespective of site status (mutated or conserved). This is a reflection of the importance of Vif protein, a function which involves both binding of Elongin B/C and recognition of APOBEC molecules to provide HIV with the capacity to escape from APOBEC-mediated innate immunity. From within the eight different APOBEC binding sites included in the analysis, APOBEC-2 and APOBEC-3 stand out for the number of times they appear in the associations shown in Table A5. Interestingly, the APOBEC-2 and -3 sites bind APOBEC3G and APOBEC3F, the two most relevant members of the APOBEC3 family of antiviral proteins. Nevertheless, our results are indicative that the APOBEC3G and APOBEC3F protein binding site (APOBEC-2) is perhaps the least important of all the genetic Vif variables assessed. This is based on the fact that both MAREV-1 and MAREV-2 results show higher viral titres and lower CD4 T cell numbers (suggesting ongoing viral robustness) even in the presence of APOBEC-2 mutations, as long as the other APOBECbinding regions or Elongin B/C binding regions remain conserved. This was observed in historic CD4 T cell numbers, the initial viral loads, and regarding the historic viral loads.
Similarly, the recursive appearance of Elongin B/C box-1 and box-3 binding sites also highlights the relevance that the Elongin interactions have for the Vif protein mediated ubiquitination of APOBEC3 anti-viral proteins. Overall, our results emphasize the clinical relevance of both APOBEC3G and Elongin B/C binding sites from among the remaining Vif protein domains assessed. Figure 7 illustrates the position of the Vif encoding region within a reference (HXB2) HIV-1 genome, the Vif protein domains and regions, as well as some of the putative or known ligands. Even greater detail is provided by our results regarding the weight of each of these genetic variables when individual clinical outcomes are considered. At least one previous report has identified that amino acid substitutions in Elongin B/C sites lead to a loss-of-infectivity in HIV [45]. The results of both MAREV-1 and -2 suggest that initial CD4 T cell numbers seem to depend more on Elongin B/C site status than any other Vif protein attribute. When Elongin B/C box mutations are present, such as in [BCbox-3 Mut , APOBEC-3 Cons ] (MAREV-1) and [BCbox-3 Mut , BCbox-2 Mut ] (MAREV-2), a greater number of patients are seen to be present in the ≥500 cells/µl class than in the ≤500 cells/µl class. This supports the notion that Elongin B/C binding box mutations are detrimental to viral fitness and thus prevent HIV from escaping APOBEC3 inhibition or interference.
An additional interesting finding relates to historic CD4 T cell numbers and viral loads. HIV patients are normally enrolled into anti-retroviral therapy protocols after being diagnosed, irrespective of CD4 T cell counts and viral load numbers. The clinical impact that viral mutations have at this stage, after initiating treatment, has largely been linked to protease, reverse-transcriptase, and integrase sites, those most subjected to selective pressures by anti-retroviral drugs. Our results indicate that the conservation of APOBEC binding motifs are essential to viral fitness (and worsening clinical progression), at least in the MAREV-1 results. As such, [APOBEC-2 Mut , APOBEC-3 Cons , APOBEC-5 Cons ] and [APOBEC-2 Cons , APOBEC-3 Cons ] were more common among patients having lower CD4 T Cell numbers on follow-up. This was also true for BCbox-3 in MAREV-2 results, where [APOBEC-2 Mut , BCbox-3 Cons ] was also more common among patients having ≤500 cells/µL. Previous reports have highlighted how the conservation of APOBEC binding sites is crucial for vifmediated viral fitness. Our results suggest that the mutation of certain APOBEC3 binding site motifs (i.e., APOBEC-2) is tolerated without a significant effect on viral fitness as long as other, perhaps more important, remaining motifs are conserved (i.e., APOBEC-3 and or -5) [46].

Conclusions
This paper proposes a new methodology based on machine learning algorithms (CART, NB, SVMs, and MLP) combined with an undersampling approach to deal with an imbalanced HIV dataset. Additionally, we present evidence of the classification performance of two different approaches (MAREV-1 and MAREV-2) for the identification of associations of Vif protein motifs with clinical endpoints in HIV. These variables subsequently proved to play a crucial role when different combinations of them were linked to HIV outcome, a difficult task that is not possible to achieve in human terms without relying on statistical corrections that decrease the statistical power of the study. These findings are in agreement with the known properties and with the functional and clinical relevance of the different Vif protein motifs found to be relevant. Needless to say, further research employing cell biology and molecular epidemiology tools is warranted so as to provide further support for these claims. Efforts are currently underway in our group to test the clinical utility of the identified variable combinations in a novel, larger HIV cohort.
When comparing the different strategies described in this manuscript, MAREV-2 was able to identify many more clinical associations, at least one per clinical outcome. This might be interpreted to suggest that this approach might prove more useful in future analysis and in clinical settings.
Many techniques are currently available to deal with imbalanced datasets. Although we studied the capacity of an undersampling approach to resolve this limitation, future work will explore the performance of oversampling techniques. These results provide further evidence on the usefulness and potential that machine learning methods have at analyzing complex datasets. Given the exponential growth of applications of artificial intelligence and classification strategies, this field is likely to benefit from the results presented herein.
Elongin B/C binding site mutations might prove to be the single most important Vif genetic feature determining CD4 T cell numbers at the time of clinical debut and at a time when viral replication has not been subjected to the influence of anti-retroviral drugs (as patients are treatment-naïve at this time). This opens the possibility that molecular approaches targeting HIV-1 Elongin B/C binding motifs or those inhibiting the interactions of Elongin B/C and Vif might provide innovative preventative strategies in the fight against HIV.
Overall, our results provide insight into the utility that both MAREV-1 and -2 algorithms have at discriminating complex genetic variable combinations linked to clinical endpoints in HIV, the practical utility of screening for accessory protein encoding region mutations in HIV prognosis, as well as at guiding the development of novel therapeutic interventions in HIV. Funding: The first author wants to give thanks to the National Science and Technology Council (CONACYT) for the funding through the scholarship #436028 and for the support for a research stay. He also thanks the University of Birmingham in the UK for its generous support during the research visit. The authors also thank the wonderful people caring for HIV patients at Centro Ambulatorio para la Prevención y Atención del SIDA e Infecciones de Transmisión Sexual (CAPASITS, SLP) for their unconditional help and work.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The dataset used in this research was already published [23], and it is publicly available at http://www.genomica.uaslp.mx/Research/HIV.html, accessed on 28 October 2022.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Variable Assessment
Following the proposed methodology for assessing the relevance of each variable (see Section 3.5), the results from the fourth step are in Tables A1-A4. There is a table per clinical endpoint CD4Ini, CD4Hist, VLIni, and VLHist, respectively. Each table shows the results from each classification algorithm: CART, MLP, NB, and SVMs. Table A1. Relevance scores (r) in a descending order per algorithm and variable considering the clinical endpoint CD4Ini using the MAREV-1 approach. The four variables with higher values are highlighted in bold.               Table A5 shows the results from the fourth step of the proposed methodology, see Section 3.5.   Table A6. The most informative variables per clinical endpoint considering those surpassing a calculated threshold (relevance scores in boldface). a, Scores when considering the classifications results from all the combinations; b, Scores calculated using only the best classification performance per clinical endpoint and algorithm (see Table 2).  Table A7 compares the findings of MAREV-1 and MAREV-2 with the previous results.