1. Introduction
Human immunodeficiency virus (HIV), which causes acquired immunodeficiency syndrome (AIDS), affects over 1.1 million people in the U.S. today [
1]. While there is still no cure, HIV can be treated effectively with antiretroviral therapy (ART). Consistent treatment with ART can extend the life expectancy of an HIV-positive individual to nearly as long as that of a person without HIV by reducing the viral load to below detectable levels [
2] and can reduce transmission rates [
3,
4]. However, the fast replication rate and lack of repair mechanisms of HIV leads to a large number of mutations, many of which result in the evolution of HIV to resist antiretroviral drugs [
5,
6]. Drug resistance may be conferred at the time of HIV transmission, so even treatment-naïve patients may be resistant to certain ART drugs, which can lead to rapid drug failure [
7]. As a result, the analysis of drug resistance is critical to treating HIV, and thus is an important focus of HIV research.
HIV drug resistance may be directly evaluated using phenotypic assays such as the PhenoSense assay [
8]. Both the patient’s isolated HIV strain and a wild type reference strain are exposed to an antiretroviral drug at several concentrations, and the difference in effect between the two indicates the level of drug resistance. Resistance is measured in terms of fold change, which is defined as a ratio between the concentration of the drug necessary to inhibit replication of the patient virus and that of the wild type [
9]. These tests are laborious, time-intensive, and costly. Additionally, phenotypic assays are reliant on prior knowledge of correlations between mutations and resistance to specific drugs, which evolve quickly and thus cannot be totally accounted for [
10]. An alternative, the “virtual” or genotypic test, predicts the outcome of a phenotypic test based on the genotype using statistical methods. Several web tools exist for HIV drug resistance prediction based on known genotype profiles, including HIVdb [
11] and WebPSSM [
12], utilizing rules-based classification and position-specific scoring matrices, respectively, to predict drug resistance. Two additional tools utilize machine learning approaches: geno2pheno [
13] and SHIVA [
14].
Several machine learning architectures have been applied to predict drug resistance, including random forests such as those used in SHIVA [
14,
15,
16], support vector machines as in geno2pheno [
13], decision trees [
17], logistic regression [
16], and artificial neural networks [
10,
18,
19,
20]. Deep learning models (i.e., neural networks) are a major focus in current machine learning research and have been successfully applied to several classes of computational biology data [
21]. Yet, one aspect of deep learning models in particular that has been overlooked to date is model interpretability. Deep learning is often criticized for its “black box” nature, as it is unclear from the model itself why a given classification was made. The resulting ambiguity in the classification model proves to be a major limitation to the utility of deep learning in translational and clinical applications [
22]. As a response to this concern, recently-developed model interpretability methods seek to map model outputs back to a specific subset of the most influential inputs, or features [
23]. In practice, this method allows researchers to better understand whether predictions are based on relevant patterns in the training data as opposed to bias, thus attesting to the model’s trustworthiness and, in turn, providing the potential for deep learning models to identify novel patterns in the input data.
Here, we integrate deep learning techniques with HIV genotypic and phenotypic data and analyses in order to investigate the implications of the underlying evolutionary processes of HIV-1 drug resistance for classification performance and vice versa. The objectives of this study are: (i) to compare the performance of three deep learning architectures which may be used for virtual HIV-1 drug resistance tests–multilayer perceptron (MLP), bidirectional recurrent neural network (BRNN), and convolutional neural network (CNN), (ii) to evaluate feature importance in the context of drug resistance mutations, and (iii) to explore the relationship between the molecular evolution of drug resistant strains and model performance.
2. Methods
Briefly, our deep learning approach included training and cross-validation of three architectures for binary classification of labeled HIV-1 sequence data: MLP, BRNN, and CNN (
Figure 1). In addition to comparing performance metrics across architectures, we evaluated feature importance using a permutation-based method and interpreted these results using known DRM loci. Lastly, we reconstructed and annotated phylogenetic trees from the same data in order to assess clustering patterns of resistant sequences.
2.1. Data
Genotype-phenotype data were obtained from Stanford University’s HIV Drug Resistance database, one of the largest publicly available datasets for such data [
9]. The filtered genotype-phenotype datasets for eighteen protease inhibitor (PI), nucleotide reverse transcriptase inhibitor (NRTI), and non-nucleotide reverse transcriptase inhibitor (NNRTI) drugs were used (downloaded August 2018;
Table 1). While the Stanford database additionally includes data pertaining to integrase inhibitors (INIs), at the time of this study insufficient data were available for deep learning analysis, and so here we have focused on polymerase. In the filtered datasets, redundant sequences from intra-patient data and sequences with mixtures at major drug resistance mutation (DRM) positions were excluded from further analyses. In total, the data pulled from the Stanford database contained 2112 sequences associated with PI susceptibility, 1772 sequences associated with NNRTI susceptibility, and 2129 sequences associated with NRTI susceptibility (
Table 1). Drug susceptibility testing results included in the Stanford dataset had been generated using a PhenoSense assay [
8]. Susceptibility is expressed as fold change in comparison to wild-type HIV-1; a fold change value greater than 3.5 indicates that a sample is resistant to a given drug [
10,
17].
All sequences were from regions of the polymerase (pol) gene, which is standardly sequenced in studies of HIV-1 drug resistance. Sequences in the PI dataset include a 99 amino acid sequence from the protease (PR) region (HXB2 coordinates: 2253-2549), while the NRTI and NNRTI datasets include a 240 amino acid sequence from reverse transcriptase (RT) region (HXB2 coordinates: 2550-3269). Subtype is not explicitly listed in the filtered dataset, though the consensus sequence given is that of Subtype B. The data were separated into 18 drug-specific datasets, each of which contained all sequences for which a resistance assay value for the drug was available, as well as their respective drug resistance status. For training and evaluating deep learning models, amino acid and ambiguity codes were encoded via integer encoding.
2.2. Deep Learning Classifiers
Three classes of deep learning classifiers were constructed, trained, and evaluated in each of the 18 datasets: multilayer perceptron (MLP), bidirectional recurrent neural network (BRNN), and convolutional neural network (CNN). The structure and parameters of each model are described below (
Section 2.2.1,
Section 2.2.2,
Section 2.2.3). All data pre-processing, model training, and model evaluation steps were completed using R v3.6.0 [
24] and RStudio v1.2.1335 [
25]. All classifiers were trained and evaluated using the Keras R package v2.2.4.1 [
26] as a front-end to TensorFlow v1.12.0 [
27], utilizing Python v3.6.8 [
28].
All classifiers were evaluated using 5-fold stratified cross-validation with adjusted class weights in order to account for small dataset size and class imbalances (
Table 1), implemented as follows. All data were randomly shuffled and then split evenly into five partitions such that each partition had the same proportion of resistant to non-resistant sequences. Then, a model was initiated and trained using four of the five partitions as training data and the fifth as a hold-out validation set. Next, a new, independent model was initiated and trained using a different set of four partitions for training and the fifth as the hold-out validation set. This was repeated three additional times such that each partition was used as a validation set exactly once. Thus, for each architecture and dataset, a total of five independent neural networks models were trained and evaluated using a distinct partition of the total dataset for validation and the remainder for training. This method is advantageous when working with limited available data, as it allows all data to be utilized towards evaluating the performance of a given architecture without the same data ever being used for both training and validation of any one model. Taking the average of performance metrics produced from each fold also lessens the bias that the choice of validation data may inflict on these results.
The class weights used to train each model were calculated such that a misclassification of a non-resistant sequence had a penalty of 1 and the misclassification of a resistant sequence had a penalty of the ratio of non-resistant to resistant sequences (i.e., a misclassification of a resistant sequence had a penalty higher than that of a non-resistant sequence by a factor of the observed class imbalance). Each model was compiled using the Root Mean Square Propagation (RMSprop) optimizer function and binary cross-entropy as the loss function, as is standardly used for binary classification tasks, and was trained for 500 epochs with a batch size of 64.
2.2.1. Multilayer Perceptron
A multilayer perceptron (MLP) is a feed-forward neural network consisting of input, hidden, and output layers that are densely connected. MLPs are the baseline, classical form of a neural network. The MLP used here included embedding and 1D global average pooling layers (input), followed by four feed-forward hidden layers each with either 33 (PI) or 99 (NNRTI or NRTI) units, Rectified Linear Unit (ReLU) activation, L2 regularization, and ending with an output layer with a sigmoid activation function.
2.2.2. Bidirectional Recurrent Neural Network
A bidirectional recurrent neural network (BRNN) includes a pair of hidden recurrent layers, each of which feeds information in opposite directions, thus utilizing the forward and backward directional context of the input data. BRNNs are commonly used in applications where such directionality is imperative, such as language recognition and time series data tasks. The BRNN used here included one embedding layer, one bidirectional long short-term memory (LSTM) layer with either 33 (PI) or 99 (NNRTI or NRTI) units, dropout of 0.2, and recurrent dropout of 0.2, and ended with an output layer with a sigmoid activation function.
2.2.3. Convolutional Neural Network
A convolutional neural network (CNN) operates in a manner inspired by the human visual cortex and consists of convolutional (feature extraction) and pooling (dimension reduction) layers. CNNs are well known for their use in computer vision and image analysis but have recently been applied to genetic sequence data with much success, especially for training on DNA sequence data directly [
21]. The CNN used here included one embedding layer, two 1D convolution layers with 32 filters, kernel size of 9, a ReLU activation function, and one 1D max pooling layer in between, then ending in an output layer using a sigmoid activation function.
2.3. Performance Metrics
The following metrics were recorded for each evaluation step: true positive rate/sensitivity (TPR), true negative rate/specificity (TNR), false positive rate (FPR), false negative rate (FNR), accuracy, F-measure (F1), binary cross entropy (loss), and area-under-the-receiver operating characteristic curve (AUC). Formulas for the calculation of all metrics are given below. Binary cross-entropy was reported in Keras output and AUC and receiving operator characteristic (ROC) curves were generated using the ggplot2 v3.1.1 [
29] and pROC v1.15.0 [
30] R packages.
Accuracy is a measure of the overall correctness of the model’s classification:
Sensitivity, or true positive rate (TPR), measures how often the model predicts that a sequence is resistant to a drug when it is actually resistant:
Specificity, or true negative rate (TNR), measures how often the model predicts that the sequence is not resistant to a drug when it is actually not resistant:
The false positive rate (FPR) measures how often the model predicts that a sequence is resistant to a drug when it is actually not resistant:
The false negative rate (FNR) measures how often the model predicts that a sequence is not resistant to a drug when it is actually resistant:
The F1 score measures the harmonic mean of precision of recall and is often preferred to accuracy when the data has imbalanced classes:
Binary cross-entropy is a measure of the error of the model for the given classification problem, which is minimized by the neural network during the training phase. Thus, this metric represents performance during training but not necessarily performance on the testing dataset, which is used for evaluation. Binary cross-entropy is calculated as follows, where
y is a binary indicator of whether a class label for an observation is correct and
p is the predicted probability that the observation is of that class:
AUC measures the area under the receiving operator characteristic (ROC) curve, which plots true positive rate against false positive rate. AUC is also commonly used in situations where the data has imbalanced classes, as the ROC measures performance over many different scenarios.
2.4. Model Interpretation
Model interpretation analysis was conducted in R/RStudio using the permutation feature importance function implemented in the IML package v0.9.0 [
23]. This function is an implementation of the model reliance measure [
31], which is model-agnostic. Put simply, permutation feature importance is a metric of change in model performance when all data for a given feature is shuffled (permuted) and is measured in terms of 1-AUC. Feature importance plots were rendered using the ggplot2 package and annotated with known DRM positions using the Stanford database [
9], both for the top 20 most important features and across the entire gene region.
2.5. Phylogenetics
In addition to deep learning-based analysis, we reconstructed phylogenetic trees for all datasets in order to empirically test whether resistant and non-resistant sequences formed distinct clades and to visualize evolutionary relationships present in the data. ModelTest-NG v0.1.5 [
32] was used to estimate best-fit amino acid substitution models for each dataset for use in phylogeny reconstruction. The selected models included HIVB (FPV, ATV, TPV, and all PI), FLU (IDV, LPV, SQV, and DRV)–which has been shown to be highly correlated with HIVb [
33], JTT (NFV, ETR, RPV, 3TC, D4T, DDI, TDF, and all NRTI), and JTT-DCMUT (EFV, NVP, ABC, AZT, and all NNRTI). We then used RAxML v8.2.12 [
34] to estimate phylogenies for each data set using the maximum likelihood optimality criterion and included bootstrap analysis with 100 replicates to evaluate branch support. Both ModelTest-NG and RAxML were run within the CIPRES Web Interface v3.3 [
35]. Trees were then annotated with drug resistance classes using iTOL v4 [
36]. The approximately unbiased (AU) test for constrained trees [
37] was used to test the hypothesis that all trees were perfectly clustered by drug resistance class using IQ-Tree v1.6.10 [
38], with midpoint rooting used for all trees.