Amino Acid k-mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights

Simple Summary Infectious bacteria (microbes) are able to evolve to become resistant to antibiotics (develop antimicrobial resistance, or AMR). Resistant microbes are harder to treat, requiring higher doses, or alternative medications, which can be more toxic. Because of inappropriate use of medicine, microbes are being subjected to evolutionary pressure resulting in increased AMR development. As a result, AMR is emerging one of the biggest public health challenges of our time—posing the risk of a pandemic without effective treatment or vaccine. The goals of this paper are to develop and analyze machine learning methods to use the genome sequence information of a bacterium to: (1) predict the minimum required dose of an antibiotic to treat bacterial infection, and, (2) identify specific mutations or altered genetic content give rise to AMR. In particular, we propose a novel method to apply machine learning algorithms to learn patterns of amino acid sequences in the genes of the bacteria. We show that our proposed method produces comparable or even more accurate results when compared to existing methods for the goal of dose prediction, and it can provide additional insight for scientists who study AMR mechanisms. Abstract Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.

for Escherichia coli [20]. MIC is defined as the minimum concentration of an antibiotic that will inhibit the visible growth of a microorganism after overnight incubation [44]. One shortcoming of methods that predict resistance or susceptibility is that the definition of a resistant or susceptible strain depends on clinical breakpoints (thresholds) [45,46]; however, in some cases, there is no agreement on the values of these breakpoints between scientific institutions [45]. Moreover, the breakpoints are subject to change over time [47], resulting in different definitions of resistance or susceptibility to an antibiotic. On the other hand, MIC prediction does not rely in on a breakpoint. MIC also provides some resolution as to the reflection of level of resistance, rather than a binary output of resistant or susceptible. As a practical example, this can help distinguishing between a strain that has a very low MIC and a strain that is susceptible but has an MIC that is close to the breakpoint [19]. From the machine learning point of view, in these methods quantitative MIC prediction is a regression problem, meaning that a continues value is outputted by the machine learning algorithm, which is the MIC, unlike classification problems where a discrete label such as 0 for susceptible or 1 for resistant was outputted.
Different papers have used different feature extraction methods to present genome sequence data as input to prediction models such as support vector machines [38,48,49], neural networks [41], or gradient boosting [15,24,36], for AMR prediction. In [50,51], for example, the authors used, as features, single nucleotide polymorphisms (SNPs) in genes that previously were known to have drug-resistance mutation for Mycobacterium tuberculosis. In another paper [14], the authors also used SNPs to predict resistance in Mycobacterium tuberculosis, in two scenarios: only using SNPs in known AMR genes and using all SNPs in the dataset. In [34], SNPs from whole-genome sequence (WGS) to predict resistance for Neisseria gonorrhoeae. In [35], the following feature extraction method was used for binary AMR prediction in Escherichia coli: The authors first clustered all genes in all genomes using 95% amino acid identity. Genes that existed in all genomes were labeled as "core" genes, and genes that existed in some genomes were labeled as "accessory" genes. Then, a genetic algorithm (GA) method was used to choose which subset of accessory genes should be used for AMR prediction. Finally absence or presence of chosen accessory genes in each genome was used as feature to predict AMR. Khaledi et al. [38] tried gene expression, gene absence/presence, SNP calling and different combinations of those features for binary AMR prediction in Pseudomonas aeruginosa. For gene absence/presence, they clustered all of the genes with the condition of 95% sequence alignment coverage (genes went to the same cluster only if there was 95% coverage when they aligned using BLAST). Then absence or presence of each gene cluster in each genome was used as a feature. Hyun et al. [39], used absence/presence, but since gene clustering ignores genetic variation, they also included unique amino acid sequence variants or "alleles" of each gene. In their definition a core gene was a gene that was missing in at most 10 genomes. Similarly, in [37] both SNPs and gene absence/presence were used to predict AMR for Elizabethkingia. Reference [36] used different features, including absence/presence of accessory genomes, SNPs, indels and year of isolation for Escherichia coli.
Another approach has been to count nucleotide k-mers (oligonucleotides) for predicting MIC: For example, ref. [15,24] employed counts of 10-mers (subsequences of length 10) for Klebsiella pneumoniae and 15-mers for Salmonella enterica, respectively. In [22,23], nucleotide 31-mers were employed to predict resistance for multiple species. Liu, et al. [52] also used nucleotide 31-mers to predict resistance in Actinobacillus pleuropneumoniae. In [49], nucleotide k-mers of length 5 to 10 were used for Neisseria gonorrhoeae. Although nucleotide k-mers can perform well in terms of predicting AMR accurately, good performance is only guaranteed when the k-mer is long enough. Furthermore, important features (k-mers) chosen by the model in this method can only be interpreted by searching against databases such as NCBI [53] if k is long enough. However, simply increasing k-mer length runs into a substantial practical problem: the number of features in this method increases exponentially as k increases, at least before the number of features becomes limited by the sizes of the genomes. As such, longer k-mers can lead to memory issues when a machine learning model is trained using these features.
To mitigate the problems associated with increasing k-mer length to achieve interpretability, in this paper we propose an alternative: counting amino acid k-mers (i.e., oligopeptide sequences) in protein sequences. Amino acid k-mer counts exploit the biological redundancy of the nucleotides sequences to provide a more compact representation of the data. Amino acid k-mers can provide easier model interpretation and require less computational complexity compared to nucleotide k-mers, while providing comparable accuracy to other feature extraction methods evaluated herein, such as counting nucleotide k-mers of varying lengths, identifying the absence/presence of gene clusters and obtaining SNPs.
We principally used extreme gradient boosting (XGBoost) regression [54] to train the models evaluated in this paper. In XGBoost, several trees are trained to learn the relationship between input and output. After being trained, each tree calculates the output by comparing the inputs to a series of thresholds in a hierarchical manner. Each tree attempts to correct mistakes of the previous tree. The final output is sum of predictions from all trees. As a result, in XGBoost, a strong learner is built by combining decisions of several weak learners [54].
In this paper, we first use amino acid k-mers alongside other feature extraction methods to predict MIC and compare them in terms of ability to predict MIC, interpretability, feature stability and computational complexity. Then, we build and demonstrate the results of a feature selection pipeline for the extraction of important features, and use it to identify important AMR determinants from the model without any prior knowledge about AMR mechanisms. For example, the pipeline discovers that although the truncated version of tetracycline resistance gene tet(D) is known as a resistance conferring gene in PATRIC database, Klebsiella pneumoniae strains that have this version have lower MICs compared to strains that do not have any version of this gene. Finally, we show that we can build a model with only a few important features picked by our feature selection pipeline, which in many cases reaches a better accuracy in comparison with a model that uses all features. In this paper, we apply the proposed feature extraction method as well as other existing feature extraction methods to four gram-negative bacteria species.

Data Acquisition and Pre-Processing
We applied all feature extraction methods described below to four bacterial species, namely, Campylobacter jejuni, Neisseria gonorrhoeae, Klebsiella pneumoniae and Salmonella enterica. For S. enterica the MIC values were downloaded from the published metadata of a previous study [24] and for K. pneumoniae the MIC values were downloaded from the metadata of another study [15]. For C. jejuni and N. gonorrhoeae, the MIC values were downloaded from the PATRIC database [26]. For all datasets, the nucleotide sequences and the amino acid sequences were obtained from PATRIC database. Nucleotide genomes were in FASTA format (.fna) and amino acid sequences were in protein FASTA format (.faa). The amino acid sequences were annotated using RAST tool kit (RASTtk) [55] by PATRIC. For dual antibiotics, such as trimethoprim-sulfamethoxazole, we used the MIC values of the first antibiotic, because the second one either depends on the first one or is constant [15]. For each antibiotic and each species, we discarded MIC target values which were underrepresented below a specified threshold in the amount of strains. We set the threshold for discarding a target value to 3, meaning that strains of any target value that had only 1, 2 or 3 strains were discarded from dataset of that particular species of antibiotic combination. The number of genomes and distribution of MIC values for each species and each antibiotic are provided in Appendix A. Chemical structures of all antibiotics are provided in Appendix J.

Feature Extraction Methods
We compared the results of the following feature extraction methods: counting nucleotide k-mers in raw DNA sequences, counting amino acid k-mers in protein sequences annotated by PATRIC from the raw DNA sequences, gene content (finding which genes exist in which genomes, and in a case of existence, how many times the gene is present in the genome), SNP calling and the combination of gene content and SNP calling.

Nucleotide k-mers
Nucleotide k-mers were counted using KMC3 [56] for the genome of each strain. The minimum count for output k-mers was set to 1, meaning that any k-mer with a frequency more than zero was counted. By default, the maximum count for a k-mer in KMC3 is 255. In order to avoid cutting off the counts when a k-mer repeated more than this value, we set the maximum value to 4,294,967,295, which is larger than any possible k-mer count in our data because the longest genome in our data had 9,985,884 characters and a k-mer count theoretically cannot be larger than that. k-mers of length 8, 9, 10 and 11 were tested. We did not try longer k-mers because of memory limitation.
Two scenarios are possible for counting nucleotide k-mers. One scenario is to convert all of the non-canonical k-mers (k-mers that have smaller lexicographical orders than their reverse compliments [57]) to their reverse complement to get the canonical form, and just count canonical k-mers. The other scenario is to count all k-mers [57]. For each k, both scenarios were tested in Appendix F. Results show that converting non-canonical k-mers to canonical performs better. Therefore, canonical counting was used for all nucleotide k-mer analyses in the main text. The rest of the parameters were set to their default values.

Amino Acid k-mers
Amino acid k-mers were counted in the protein FASTA sequences downloaded from PATRIC database. For counting the amino acid k-mers, we used MerCat [58]. The minimum frequency for output k-mers was set to 1, so that any existing k-mer was counted; 3-mers, 4-mers and 5-mers of amino acid (which correspond to 9-mers, 12-mers and 15 mers of nucleotide, respectively) were counted for the genome of each strain. We did not try longer k-mers because of our memory limitation.

Gene Content
The goal of this method is to predict MIC based on the gene content of the strains. The gene content of a genome is defined as the set of genes that exist in that genome and the number of times each gene exists. In [35] a subset of genes was selected by a GA method to be used for AMR prediction. On the other hand, [36,38,39] used all genes as features. We chose the second approach, which includes all genes. Moreover, in [35,36,38,39], the clustering of the genes was performed on all genomes regardless of partitioning of the genomes into training and testing sets. This creates a bias for the machine learning model because the model is exposed to some information about the test data during the clustering step before training. To avoid this bias, we performed clustering one the training data and searched for the genes in the test set.
For each species and each antibiotic, in order to find an orthologous gene in different genomes, in each training set, we combined all of the genes in amino acid sequences of all of the training genomes into one FASTA file. Then the entries of this FASTA file were clustered using MMseqs2 version 9.d36de [59,60]. We used the easy-linclust command with default settings: Setcover clustering mode, maximum e-value: 0.001, minimum alignment length 80%, amino acid substitution matrix: Blosum62. After performing the clustering, the list of all existing gene clusters was extracted. Each cluster was labeled with the PATRIC ID of a representative gene sequence. For each genome, the number of times each gene cluster was observed was counted to create the gene cluster feature matrix. In this matrix, each row is a genome and each column is a gene, and the corresponding element is the number of times that gene exists in that genome. We used count rather than binary absence/presence so that the model could understand if a gene was observed more than once in a genome. After training the model to search existence and counts of the genes in the test genomes, we again used MMseqs2 with the same parameters (maximum e-value: 0.001, minimum alignment length 80%, amino acid substitution matrix: Blosum62) to search the test genomes in the training genomes. By doing this, we made sure that the training set and the test set were completely separated and the model was been exposed to any information about the test genomes during the training or before that.

SNP Calling
The goal of this method is to predict AMR based on the SNPs in the genome. There are multiple software packages for SNP calling used in the literature [61][62][63][64]. Based on the comparative analysis of [65], Snippy [66] had the best overall rank when the reference genome was divergent from the sources of the reads. Thus, we chose Snippy in our analysis. For the genome sequence of each strain, SNPs were extracted by comparing each genome to the reference sequence. The reference sequences were downloaded from National Center for Biotechnology Information (NCBI) [53] in FASTA format. The NCBI accession IDs of the chosen reference genome for different species are as follows. C. jejuni: NC_002163, K. pneumoniae NC_016845, N. gonorrhoeae: NC_002946.2 and S. enterica: NC_003198.1. We used Snippy version 4.6.0 [66] for extracting SNPs. Since genes might be at different locations in query genomes with respect to reference genome, Snippy shreds each query to 250 bp pseudo-reads with 20x coverage and then aligns the short pseudo-reads to the reference and finds SNPs using Freebayes [62] (version: 1.3.2-dirty). We annotated the SNPs with respect to the reference genome. In each position on the reference, the nucleotide could change to three different bases. For example, if at a position 200, the reference has A, the variant at that position can be C, T or G. In order to be able to keep all of the information, we one-hot coded the SNP features. For example, for the aforementioned position, 3 features can exist: position 200 > C, 200 > G, and 200 > T. Snippy also finds indels that we did not use.
This representation of data leads to a sparse matrix (more than 95% zero), because many SNPs do not exist in most of the strains. We exploited this criterion of the data to conserve memory, by storing the data in form of sparse row matrix of SciPy (version 1.3.0) [67].

Gene Content and SNP Calling
Both the gene content and SNP features were combined and tested as another feature extraction method. In this feature extraction method, SNP features extracted according to Section 2.2.4 and gene content features extracted according to Section 2.2.3 were concatenated into one vector. Since most of the features for this data were the SNP features, the data were also sparse and we used the same sparse representation as Section 2.2.4.

Feature Matrix and Target Values
For each species and each antibiotic regardless of the feature extraction method, certain features only exist in certain genomes and not all of them. For example, for k-mer features, a genome of one strain has a subset of all possible k-mers and the genome of another strain has a different subset. To build a unified feature matrix, the union of all existing features was calculated to create the feature space. The feature matrix was created as a N f eat × N strain matrix, where N f eat is the total number of existing features and N strain is the number of genomes in the dataset. The feature matrix was used for machine learning in conjunction with the target MIC values.
Similarly to [15,24], when the MIC value was larger than the maximum testing threshold (reported as MIC > x) the employed MIC value was replaced with 2 × x, and when MIC value was smaller than the minimum testing threshold (reported as MIC < x), it was replaced with x/2. In all cases the target MIC values were converted to log 2 scale for the machine learning task. Without this conversion, the model will not be able to distinguish the differences between different small MIC values in the presence of larger values. For example, small target values, such as 0.125 and 0.25, look the same in the presence of large target values, such as 64 and 128, because the difference between two small target values looks like a small amount of noise compared to the large target values. After the conversion, the mentioned target values become −3, −2, 6 and 7 respectively. Figures A1-A4. The overall pipeline for all feature extraction methods is depicted in Figure 1.

Measuring the Prediction Performance
According to the FDA standards for antimicrobial susceptibility test [68], essential agreement (EA) between two MIC measurement approaches is achieved when MIC of the proposed method is within ±1 two-fold dilution of the reference method. In other words, if the reference MIC is denoted by x and the MIC of the proposed method is denoted byx, the agreement is reached wheñ This can be used to evaluate performance of MIC prediction methods: ±1 two-fold accuracy is defined as the number of predictions that satisfy EA, divided by the total number of predictions, as done in [15,19,24,42,43]. Note that in the log 2 scale the accepted range of EA becomes [log 2 (x) − 1, log 2 (x) + 1]. We used this metric as the main method to evaluate the the prediction performance. As an alternative that tolerates greater error, ±2 two-fold dilution accepts anything in the range of [x/4, 4x] as a correct prediction for the actual target value of x.
A more theoretical alternative metric is root mean squared error (RMSE). For a set of target values, such as x i , and a set of corresponding predictions, such asx i , RMSE is defined as: RMSE is more precise compared to ±1 two-fold accuracy, because it better quantifies the error. Two other biologically important metrics are major error (ME), which is type I error, and very major error (VME), also known as type II error [68]. These metrics can be calculated based on breakpoints for MIC. Breakpoints are concentrations (mg/L) of an antibiotic that define whether a strain is susceptible or resistant to the antibiotic. If the MIC is less than or equal to the susceptibility breakpoint, the strain is considered susceptible to the antibiotic. If the MIC is greater than the resistance breakpoint, the strain is considered resistant to the antibiotic. If MIC is between the susceptibility and resistance breakpoints, the strain is considered intermediate. For C. jejuni and S. enterica, the breakpoints were obtained from the National Antimicrobial Resistance Monitoring System for Enteric Bacteria (NARMS) [69]. For N. gonorrhoeae and K. pneumoniae, breakpoints were obtained from the Clinical and Laboratory Standards Institute (CLSI) [46]. Predicting a truly susceptible strain as resistant is a ME and predicting a truly resistant strain as susceptible is a VME. Rate of ME and VME can be calculated by dividing the number of errors by the total number of tested strains.
We report a performance evaluation with all of the metrics for all of the species-antibiotic combinations in Appendix E.

Machine Learning
For each species and each feature extraction method, we trained a separate model for each antibiotic. Similarly to methods in [15,19,20,24,43], we trained the model to predict MIC as a regression task, rather than just classifying the genomes into resistant or susceptible.
In order to choose the best regression package, different methods were tested on an experimental dataset, with S. enterica and ampicillin, 4-mers of amino acid, to see which one performed better. After separating 10% of the data as a hold-out set, the remaining genomes were divided into 10 folds of stratified cross-validation. Regression packages that we tried were linear regression, ridge regression, support vector machine regression with three kernels (linear, radial basis function or RBF, and polynomial), random forest regression [70], AdaBoost regression [71] and XGBoost regression [54]. For XGBoost, we used the Python implementation version 1.0.2 [72] with regression objective function and squared loss. For all other methods, we used scikit-learn version 0.23.1 [73]. In this experiment, all methods were trained with their default parameters and no hyper-parameter tuning was performed. Results of the comparison are provided in Section 3.1.2. Since XGBoost performed better than all other methods and had a reasonable computational complexity, we chose XGBoost.

Hyper-Parameter Tuning for XGBoost
After selecting XGBoost, in order to find the best combination of hyper-parameters, we performed hyper-parameter tuning for each species and each feature extraction method separately. Since we had four species and five feature extraction methods, the hyper-parameter tuning was performed 20 times. For k-mer counting feature extraction methods, different k-mer lengths exist. We ran hyper-parameter tuning on 4-mers of amino acids and 10-mers of canonical nucleotides. Since different antibiotics exist, for each species, one antibiotic was used for tuning. For C. jejuni we used clindamycin; for N. gonorrhoeae, we used tetracycline; for K. pneumoniae, we used cefoxitin; and for S. enterica, we used streptomycin. These antibiotics were chosen because they each had (i) a relatively uniform distribution of MIC values, (ii) a large range of values and (iii) a sample size large enough to allow the model can learn the data while being small enough to avoid computational complexity. Before performing hyper-parameter tuning, the genome of each microbe was divided into two parts: 90% of the genomes were used for running the experiment, and 10% were held out (see Figure 2). For each microbe, the hold-out set was not used at any of the hyper-parameter tuning or cross-validation steps and was saved exclusively for the final evaluation. Since there are different number of genomes for different antibiotics, we could not use the same hold-out set for all antibiotics. We also ensured that no genome in the hold-out set would ever be observed by any of the models during the parameter-tuning step. Thus, for each species, the genomes IDs of the genomes that were used for hyper-parameter tuning were saved by the pipeline, and the pipeline made sure that in other antibiotics these genomes did not fall in the hold-out set. In cases where the number of unobserved genomes was less than 10% of the total number of genomes, due to smaller number of genomes for some antibiotics compared to the antibiotic that was used for hyper-parameter-tuning, less than 10% of the data were used as the hold-out set.
In the hyper-parameter tuning stage, we performed cross-validation with five folds. In each fold, 20% of the data were used for validation and 80% were used for parameter tuning (see Figure 2). In each fold, the hyper-parameter combination that minimized the mean squared error of validation was selected as the optimal combination. After running the tests, we had five sets of hyper-parameters corresponding to five folds. The final chosen hyper-parameters were those that minimized the RMSE error on the validation set. Since some of the datasets are very large, more than one hundred gigabytes, it is not computationally feasible for many researchers to experiment with all possible combinations of all hyper-parameters in an exhaustive search manner. We used Optuna [74], a software package that implements early stopping for XGBoost by using the built-in validation check feature of XGBoost and a tree-structured Parzen estimator (TPE) [75] for choosing combinations of hyper-parameters to perform hyper-parameter tuning in each one of the mentioned five folds. The hyper-parameters which we searched were learning rate, maximum tree depth, minimum child weight, lambda, gamma, column sub-sampling, maximum delta step, and number of estimators. Table 1, shows the ranges of the hyper-parameters which were swept.  Overall pipeline: First, a hold-out set is separated for the final evaluation. Then for each microbe and each feature extraction method, hyper-parameter tuning is done on one antibiotic with 5 folds of cross-validation, which results in 5 different sets of hyper-parameters in the end. The parameter set that minimizes the RMSE is chosen for 2 experiments: Cross-validation using 10 folds, and a final evaluation on the hold-out set.

Training the Model
After optimizing the parameters of XGBoost, it was used to train and test with 10 different feature extraction methods for C. jejuni, S. enterica, N. gonorrhoeae and K. pneumoniae. We used two schemes for train and test: 10-fold cross-validation and evaluation of the hold-out set (See Figure 2). For the 10-fold cross, in each fold, 10% of data were used for testing and 90% were used for training. In the training part, 10% of the training data were used for validation to prevent over-fitting and the rest was actually used for training the model. The trained model of each fold was later tested with the corresponding test set. In the hold-out evaluation, all of the data except for the hold-out set were used for training, and the hold-out set was used for final evaluation.

Feature Selection
After training the model with cross-validation in 10 folds, we wanted to extract the important features. Important features serves two purposes:

•
A concise model using only important features can be built.

•
The features can be used to gain biological insights.

Feature Selection Pipeline
To find the important features, we ranked them by their absolute SHAP [76,77] values. SHAP (shapley additive explanations) assigns each feature an importance value based on the contribution of that feature to the output of the model by comparing the output of the model with and without that feature [76]. A feature that makes a great change in the output will have a great absolute SHAP value and a feature that does not make a huge difference in the output will have a small SHAP value.
Since we performed 10-fold cross-validation for each species-antibiotic combination, there were 10 models in each experiment. However, we wanted a unified set of important features from all folds. To combine the features selected in different folds, we used the following method: In each fold, the top N top features are picked and added to the set of important features. Any feature in this set might be in the top N top features in the model in one or more fold. For each feature in the set of important features, the number of folds in which it makes it to the top N top is counted, and features are ranked based on this number. For example, a feature that makes it to the top N top in 9 folds out of 10, receives a higher rank compared to a feature that makes it to the top N top only 3 times. This method returns a list of features ranked based on their importance in all folds of cross-validation. In this pipeline N top is a hyper-parameter. We tuned this hyper-parameter and found the optimal value of 50 for it. Details of this optimization are provided in the rest of this section. Pseudo-code for the feature selection pipeline is presented in Algorithm 1. Selected features i = Top N top features in fold i; 4: end for 5: Find union of selected features in all folds ; 6: for feature in union set do 7: Find in how many folds this features makes it to top N top ; 8: end for 9: Rank features in the union based on the number of times each feature appears in the top N top ; 10: Output the ranked list of features; In the results section, for different antibiotic-species, we are reporting the top features selected by this pipeline. Moreover, for each feature, all of the existing strains (including the ones that are filtered out) are divided into 2 groups: positive strains, strains that have the feature at least once, and negative strains, strains that do not have the feature. The Kruskal-Wallis (K-W) test [78] is used to compare the MICs of two groups and see if there is a significant difference. The reported features are the ones that appear in the top 50 features in at least 8 folds out of 10 folds and their p-values are less that the significance threshold (0.05).

Training the Model with the Selected Features
After obtaining the list of features sorted by their importance, we built models with the most important features. To do this, first, we picked only the most important features and trained the model in 10 fold cross-validation with only 1 feature. Then we selected the 2 most important features and trained the model in 10 folds. The same processes was repeated up to the 40 most important features. The minimum number of features that reached the maximum accuracy was selected as the required number of features for the "selected-feature" model. Then this selected-feature model was tested in the hold-out evaluation scheme.

Tuning the Hyper-Parameter N top
In Algorithm 1, N top is a hyper-parameter. We tested the values of [20,30,40,50,60,70,80] for this hyper parameter on a dataset of C. jejuni and tetracycline using gene content feature extraction. To find the best value for this hyper-parameter, we needed a metric to evaluate different values. Since the purpose of this pipeline was to find important features, the best performance was achieved by the set of features that resulted in good accuracy with the fewest features. The good accuracy that we choose as the benchmark was the average cross-validation accuracy when all of the features are used. In other words, the question was using which value of N top can a pipeline that is only using the selected features reach the average cross-validation accuracy of a pipeline that is using all features. The best performance was achieved by N top = 50, where the pipeline reached an accuracy better than the average accuracy of model trained with all features using only 10 features.

Interpreting the Top Features
After selecting the top features, we wanted to know what these features are and what information can we extract from them, for two purposes: • Validation: When the model finds top features that we already know are important AMR determinants, we know that the model has been trained and is working properly.

•
Discovery of new AMR determinants: In an accurate model, the top features that are derived from the data can be used for discovering new AMR genes/SNPs.
For the gene content pipeline, we interpreted the functions of the selected genes by searching the gene ID of a representative gene of each cluster in the PATRIC database. In some cases we also blasted the important gene sequences in other databases such as CARD [27] and NCBI [53], as reported in Section 3.2. For SNP features, first we extracted the position of each SNP with respect to the reference, and then looked for the gene in which the SNP happened and then found the function of that gene in the gene bank file provided for the reference in NCBI database. For the k-mer features, we searched them in NCBI database using BLAST [79]. We specified the species name before each blast search. When a query is searched against a database, it might align with multiple positions in different genes. We only considered the hits with perfect matches to the query and discarded cases where only a part of the k-mer matched a subject. When there was a disagreement among perfect matches, meaning that the query aligned with multiple genes, we reported the gene that had the bigger number of hits. In cases where the hits with full query coverage were from different genes and none of these genes outnumbered the others, and where a product of the gene with a maximum number of hits was a hypothetical protein, we did not report anything.

Software Implementation and Availability
All of the computational results reported herein were performed using Python version 3.7.1, CentOS Linux release 7.6.1810 and Red Hat Enterprise Linux Server release 6.5. We have made all of the source code available on Github at https://github.com/TahaAslani/AAk-mer.

Results
The results section is divided into two subsections: First, we present results of performances of different techniques. Second, we analyze the AMR determinants for different antibiotics by interpreting the top features that our feature selection pipeline picked. Analyses of feature stability and performance evaluations for selected-feature models are provided in Appendices G and H, respectively.

Comparison of Required Numbers of Features
When comparing different feature extraction methods, it is important to take the number of features into account, because it directly affects memory usage and computational complexity. For k-mer counting methods, the number of features is determined by the number of existing k-mers in all genomes. For SNPs, the number of features is the number of existing SNPs in all genomes multiplied by four, because of one-hot coding of four possible nucleotides after mutation. In case of gene content, the number of features is the number of existing clusters after clustering the genomes. Finally, for "gene content + SNP" the number of features is simply the addition of the number for gene content and SNP methods. Figure 3 depicts the average and standard deviation of number of features for each method across all species. For SNP data, although the number of features is large, the data are always sparse (more than 95% zero) so sparse data format can be used to mitigate memory issues.

Comparison of Regression Packages
As mentioned in Section 2.5, different regression packages were compared based on predicting MIC of S. enterica and ampicillin from 4-mers of amino acid. We chose 4-mers of amino acid because the number of features and data size for this feature extraction method were not too large and we could test computationally expensive methods, like random forest, using this dataset. Figure 4 shows the results of this comparison.
The best performance, as measured by the accuracy of prediction of ±1 two-fold dilution level, was achieved by XGBoost. The second best option was random forest, which performed close to XGBoost in terms of accuracy; however, the computational complexity of XGBoost was significantly lower than that of random forest. XGBoost is thus more scalable to bigger datasets. XGBoost was also reported to outperform other machine learning packages for MIC prediction in [15]. Moreover, in [36] gradient boosting overall performed better than logistic regression, random forest and deep learning.

Accuracy of Different Feature Extraction Methods Using XGBoost
For the dataset for each species and antibiotic combination, we tested and compared different feature extraction methods. In the main text, we used ±1 two-fold dilution to measure MIC prediction accuracy. A complete evaluation of all models using all metrics described in Section 2.4 is provided in Appendix E. Since we used cross-validation to measure the performance, in each experiment there were 10 different prediction accuracies corresponding to 10 different folds. The average and the box plot of distribution of the ±1 two-fold dilution accuracies are reported in Figures 5-8 for different species. In these figures, the antibiotics are ordered based on their classes. In each box plot, the whiskers represents the maximum and minimum. The boxes represent the first and the third quartiles. The orange line represents the median and the green line represents the mean. The accuracy on the hold-out set is also represented by an "×" mark.

Searching Top Amino Acid and Nucleotide k-mers in NCBI Database
In order to compare the interpretability of models trained with nucleotide and amino acid k-mers, for eight species-antibiotic combinations, we searched the top amino acid 5-mers and nucleotide 11-mers, chosen by the proposed feature selection pipeline in NCBI database. These datasets were: K. pneumoniae-tetracycline, K. pneumoniae-tobramycin, K. pneumoniae-imipenem, S. enterica-cefoxitin, S. enterica-amoxicillin clavulanic acid, S. enterica-ampicillin, S. enterica-chloramphenicol and S. enterica-sulfisoxazole. For both amino acid and nucleotide k-mers, we chose the longest k-mers size, which was determined by computational complexity limitations. As mentioned in Section 3.1.1, both of these methods have virtually the same feature size. The searched features were chosen based on these two criteria: They appear in the top 50 features at least in eight folds out of 10 folds and their K-W test p-value is less than the significance threshold (see Section 2.5.3). In all of the searches we used the default parameters of NCBI BLAST.
For the top amino acid 5-mers, the chosen features in all datasets aligned to genes that were known AMR determinants. Results of these alignments are presented in Tables 2, A5, A6, A7, A8, A9, A10 and A11, respectively. On the contrary, in case of nucleotide 11-mers none of the queries hit a subject on NCBI database.  The box plots are similar to Figure 4. The orange line represents the median and the green line represents the mean. The × marks represent the accuracy of the hold-out set. The antibiotic used for hyper-parameter tuning is indicated by an asterisk. For each method, the top boxes, labeled as "All," were obtained by combining all ten folds for all antibiotics.

Analysis of AMR Determinants for Specific Antibiotics Families
In this subsection, we analyze the performances of models that were trained with different feature extraction methods for different families of antibiotics, and we interpret the models by analyzing the top features that they selected.

Resistance to Tetracycline
In this section, we analyze determinants of resistance to tetracycline that the models found for different species. It is well-known that resistance of C. jejuni to tetracycline is conferred by the presence of tet(O) gene [80,81]. Tet(O) belongs to the class of ribosomal protection proteins that cause resistance by dislodging tetracycline from its primary binding site on the ribosome [81][82][83]. When XGBoost uses feature extraction methods that are capable of detecting the presence of this gene, such as amino acid 5-mers or gene content, it can detect AMR and predict MICs of organisms reasonably well. On the other hand, when SNPs are used as features, it cannot detect presence or absence of this gene. This can be seen in Figure 5, where results of using SNP features are not as good as other methods. Moreover, Figure A5 in Appendix C provides more details about prediction error for models trained with different methods. Our feature selection method found tet(O) for gene content and gene content + SNP. In amino acid 5-mer pipeline, the top feature, "RKAEY," aligns with this gene. Out of total of 481 strains (including the ones that were filtered out), 327 strains had this gene and 154 strains did not have this gene. The average MIC of strains that had this gene was 33.30 (standard deviation 51.16) mg/L and the average MIC of strains without this gene was 13.11 (standard deviation 30.33) mg/L, K-W test [78] p-value: 0.0418.
The same pattern for accuracy of different methods can be seen in Figures 7 and 8 for K. pneumoniae and S. enterica, respectively. For K. pneumoniae, the gene content and gene content + SNP pipelines found tet(A), tet(D), tetR. tet(A), tet(D) and TetR are known to confer resistance to tetracycline in K. pneumoniae [84,85]. We found that SNP features are not capable of reflecting the presence of those genes. Important features found by the amino acid 5-mers and gene content, using the method described in Section 2.5.3, for K. pneumoniae and tetracycline, are presented in Table A5 in Appendix D. The top op features of amino acid 5-mer pipeline are "DGLTT," "LIMPV" and "HYGIL," which align with TetR, Tet(A) and Tet(A), respectively.

Two Types of the tet(D) Gene
Gene clustering reveals two clusters associated with tet(D) gene, and two clusters associated with tetR gene. We investigated further to see why and determine whether both clusters are correlated with AMR. In the case of tet(D), two clusters are formed because, in the cluster represented by fig573.13783.peg.5353, the gene has been truncated to approximately one quarter of the length of the corresponding gene found in the cluster represented by fig573.14286.peg.4536. In fact, the gene sequence in most of the members in the cluster represented by fig573.13783.peg.5353 is the last 97 amino acid of the consensus gene sequence of the cluster represented by fig573.14286.peg.4536 (length of all of the sequences in this cluster is 394 amino acids). In the case of tetR, we found that this is due to the mutations in this genes in some strains, i.e., sequence identity of blasting the representative genes of two clusters is 52%.
Interestingly, strains that have the truncated version of the tet(D) have lower average MIC value compared to strains that do not have this gene at all. The gene content pipeline also found a class A β-lactamase gene as a factor for resistance to tetracycline. This shows that some strains are multi-drug resistant.
For S. enterica, tet(A) has been reported before to confer resistance [86]. For the gene content method, our feature selection pipeline found tet(A) [86], tet(D) [86] and tetR. For the amino acid 5-mer method, the feature selection pipeline selected "GLIMP" and "GPLLF" which align with tet(B) and "ALYWH," which aligns with tetR.

Resistance to Quinolone Antibiotics
In this subsection, we analyze the causes of resistance to quinolone antibiotics, which include nalidixic acid and ciprofloxacin. Quinolone antibiotics target two essential bacterial enzymes, DNA gyrase and DNA topoisomerase IV [88]. Quinolone resistance can be caused by single amino acid changes in gyrase [88]. Gyrase is composed of 2 GyrA and 2 GyrB subunits [89]. For C. jejuni and both ciprofloxacin and nalidixic acid, SNP and gene content + SNP pipelines recognize a mutation in position 959,966 with respect to the reference genome (conversion of the original nucleotide to A) as the most important feature. This nucleotide is in the gyrA gene. For nalidixic acid and C. jejuni, the mean MIC of strains that have this mutation (N = 87) is 119.26 mg/L (standard deviation 25.56) and for strains without (N = 394) it is 6.43 mg/L (standard deviation 13.91) for a K-W p-value of 9.88 × 10 −62 . The top three k-mers selected by the amino acid 5-mer method also hit this gene. For ciprofloxacin and C. jejuni, average the MIC of strains that have this mutation (N = 87) is 7.27 mg/L (standard deviation 7.06) and the average MIC of strains that do not have this mutation (N = 394) is 0.16 mg/L (standard deviation 0.80), for a K-W p-value of 2.29 × 10 −53 . Mutations in many other positions of gyrA gene were also found for both antibiotics as important features for C. jejuni. Both SNP and gene content + SNP methods also found mutations in gyrB as a less important features for both antibiotics and C. jejuni. For C. jejuni and ciprofloxacin, the amino acid 5-mer pipeline found "PHGDT," "HGDTA," "GDTAV" and "DTAVY" as the top four important features. These are parts of a longer sequence "PHGDTAVY," which aligns with a gene labeled "Campylobacter jejuni gyrA conferring resistance to fluoroquinolones" in CARD database [27] (gene bank accession: AJY14066.1). The top four features of amino acid 5-mers pipeline for C. jejuni and nalidixic acid are exactly the same.
For these antibiotics, models which use SNP features perform better at predicting MIC than gene content, because the mutations in question occur at a finer resolution than simple gene presence/absence/count. Long amino acid and nucleotide k-mers are also capable of detecting these mutations, while shorter k-mers cannot distinguish these mutations from other genes with the same motif. In the case of a short k-mer size, short motifs are more likely to occur across the genome. The longer the k-mer, the more unique (i.e., only occurring once) it is likely to be; therefore, the interpretation based on the k-mer count will be less ambiguous. This is reflected in the sparsity of the feature matrix. For C. jejuni and nalidixic acid, sparsity (portion of matrix that is zero) of the feature matrix for amino acid 3-mers is 1%. For 4-mers and 5-mers of amino acids, the values are 22% and 58%, respectively. In the same manner, sparsity of the feature matrix for nucleotide 8-mers, 9-mers, 10 mers and 11-mers is 4%, 14%, 28% and 42%, respectively. From Figure A6, it can be seen that pipelines of nucleotide 8-mers, amino acid 3-mers and gene content perform worse at MIC prediction than longer k-mer sizes and SNP methods.
The same pattern was observed for performance of models trained with different feature extraction methods in [38] for classification of Pseudomonas aeruginosa strains as resistant or susceptible to ciprofloxacin: Methods that did not account for SNPs performed worse than methods that accounted for SNPs.
In the case of C. jejuni and nalidixic acid, only one amino acid 5-mer is required to get the best accuracy. For this data, the selected-feature pipeline (described in Section 2.5.3.2) with amino acid 5-mers reached a very good accuracy using only the top feature (AA 5-mer: "GDTAV"), which is in "gyrase subunit A" (gene bank accession: AJY14066.1). In Figure 9a, the change of accuracy by increasing number of features is depicted. It can be seen that the prediction with more than one 5-mer feature only decreases accuracy. Additionally, in Figure A7 prediction performance via violin plot of the error using only one feature ("GDTAV") is depicted.

Resistance to Aminoglycoside Antibiotics
Aminoglycosides (AGs) are a class of antibiotics including tobramycin, gentamicin, amikacin, streptomycin and kanamycin, which bind to the bacterial ribosome and interfere with bacterial protein translation. Although in some rare cases mutations in the ribosomal target of AGs can contribute to resistance, the most widespread mechanism of resistance to these antibiotics is achieved by AG-modifying enzymes (AMEs) [90].
In Table A6, the most important genes discovered by amino acid 5-mers and gene content pipeline for tobramycin and K. pneumoniae are provided. The gene content pipeline selected ANT(2 )-la, AAC(6')-Ib and ANT(3 )-la, which are all known AMEs [91] as important features. Amino acid 5-mer pipeline selected "PYEET" and "DASMV," which align with AAC(3)-IIe; "YAQSY," which aligns with AAC(6')-Ib/AAC(6')-II; and "DTTQV," which aligns with ANT(2 )-Ia. All of these genes are known AMEs [91]. ("DTTQV" and "YAQSY" are not shown in Table A6). Models trained with SNP and short k-mers are not able to detect these genes. The same relatively poor accuracy of SNP can be seen for gentamicin and K. pneumoniae, Kanamycin and S. enterica and gentamicin and S. enterica.
In Table A6, the gene cluster that is represented by PATRIC ID fig573.12887.peg.5713 is labeled as a hypothetical protein in the PATRIC database, but when we BLAST the protein sequence against NCBI database, it matched a tunicamycin resistance protein in Escherichia coli (NCBI Reference Sequence: WP_110074664.1) with 100% coverage and 100% identity. This shows that the same gene is correlated with resistance for K. pneumoniae and tobramycin.
Both gene content and amino acid 5-mer pipelines also found the OXA-1, β-lactamase gene (see Section 3.2.4), which is known for conferring resistance to β-lactam antibiotics in K. pneumoniae [92]. In the case of the amino acid 5-mer pipeline, this gene was found via 5-mer "QFLRK." Here, both pipelines selected it as an important factor for tobramycin, which means that there are multi-drug resistant strains in the data.

Resistance to β-Lactam Antibiotics
β-lactam antibiotics are among the most commonly prescribed antibiotics and each have a 3-carbon and 1-nitrogen ring (β-lactam ring) [93]. The most important factor for resistance to these antibiotics is production of beta-lactamases [93]. Other mechanisms of resistance are decreased penetration to the target site, alteration of target site penicillin-binding proteins (PBPs) and efflux from the periplasmic space through specific pumping mechanisms [93].

Carbapenem Antibiotics
It can be seen in Figure 7 that models trained with SNP features alone are not capable of providing a good prediction for resistance of K. pneumoniae to imipenem and meropenem. Both of these antibiotics belong to a class of β-lactam antibiotics called carbapenem [94]. According to [95], the most common resistance mechanism of K. pneumoniae to carbapenem antibiotics is production of enzymes with carbapenemase activity that hydrolyze β-lactam antibiotics. SNP features are not a very effective feature extraction method for finding these genes. Gene content and long k-mers pipelines, on the other hand, can find presence of these genes easily.    For imipenem, a class A β-lactamase genes was selected as important feature by the gene content pipeline. The average MIC of strains that have this gene (N = 488) is 12.15 mg/L (standard deviation 5.37) and the average MIC of strains that do not have this gene (N = 1178) is 1.29 mg/L, (standard deviation 2.21), for K-W p-value of 1.56 × 10 −236 (see Table A7). The amino acid 5-mer pipeline found "TCGVY" as the fourth important feature (not shown in Table A7). This 5-mer aligns with "carbapenem-hydrolyzing class A beta-lactamas" on NCBI database. Interestingly, as soon as this feature is included in the selected-feature pipeline (described in Section 2.5.3.2), the accuracy increases significantly (see Figure 9b). Amino acid 5-mers "TCGVY" and "WELE" also align with classA beta-lactamase genes.

Cephalosporin Antibiotics
Cephalosporins are a class of β-lactam antibiotics that includes cefoxitin, ceftiofur, ceftriaxone, cefazolin, cefepime, cefpodoxime, cefixime, ceftazidime and cefuroxime [96]. Like other β-lactam antibiotics, cephalosporins are inactivated by the β-lactamases produced by the bacteria [97]. Table A8 shows the features that amino acid 5-mer and gene content pipeline find for S. enterica and cefoxitin. Based on this table, the found "class C β-lactamases" gene is making a significant difference in the MIC value. Amino acid 5-mer pipeline found "ANKSY" as the most important 5-mer, which aligns with the same gene. The model trained with SNP features is not capable of detecting such genes, and this is why for cefoxitin, ceftiofur and ceftriaxone, SNP features perform relatively poorly compared to other methods (see Figure 8).
Other β-Lactam Antibiotics The same pattern can be observed for other β-lactam antibiotics (ampicillin and amoxicillin clavulanic acid) and S. enterica in Figure 8: models trained with SNPs alone do not perform as good as models trained with long k-mers and models trained with gene content. Important genes found by amino acid 5-mer and gene content pipelines for S. enterica-ampicillin and S. enterica-amoxicillin clavulanic acid are provided in Tables 2 and A9, respectively. It can be seen that for both antibiotics presence of β-lactamase genes is increasing MIC significantly. SNPs are not the best feature extraction method for detecting presence of these genes. Figure 9e,f depicts results of gene content pipelines trained and tested with few selected features on the datasets of ampicillin and amoxicillin clavulanic acid, respectively. For S. enterica and amoxicillin clavulanic acid, the accuracy increases significantly when the third and the fifth features are included ( Figure 9e). These genes are "class C beta-lactamase (EC 3.5.2.6) => CMY/CMY-2/CFE/LAT family" and "class A beta-lactamase (EC 3.5.2.6) => TEM family" respectively. For S. enterica and ampicillin, the first two features are class A beta-lactamase genes; however, it is only after inclusion of the fifth feature that the accuracy increases (Figure 9f). This gene is "lipocalin Blc." Presence or absence of this gene alone does not make a significant difference in MIC; however, it is involved in the dissemination of antibiotic resistance genes [98] and that is why it plays an important role in MIC prediction.
For S. enterica and amoxicillin clavulanic acid, the amino acid 5-mer pipeline found "TDFLR," "AHTWI," "VIDMA," "QNEQK" and "ASWVH" as important features that align with β-lactamase genes (see Table 2). Another important 5-mer is "NTAAN," which aligns with "type IV conjugative transfer system coupling protein TraD." Conjugative plasmids harboring antibiotic resistance genes can be transferred from one bacterium to another through physical contact. After conjugation, the recipient bacterium harbors the antibiotic resistance genes and transfers the acquired plasmid to other bacteria [99]. The amino acid 5-mer pipeline also selects "NQNYG," that aligns with "cysteine synthase family protein." It has been reported that cysteine synthesis is associated with antibiotic resistance of swarming S. enterica cells [100,101]. Another important 5-mer is "YWDYN," which aligns with TolC family protein. TolC is required for the function of several drug efflux systems in S. enterica serovar Typhimurium [102]. The gene content pipeline found the gene whose product is tetracycline resistance regulatory protein TetR as an important feature for resistance to amoxicillin clavulanic acid. This shows that the strains that have this gene are probably multi-drug resistant.

Resistance to Chloramphenicol
Chloramphenicol is a broad-spectrum antibiotic that inhibits bacterial growth by stopping the protein synthesis [104]. Important genes found for S. enterica and chloramphenicol by amino acid 5-mer gene content pipelines are presented in Table A10. Top 5-mers, "WAYTL," "YMVML" and "MDIYL" align with "chloramphenicol/florfenicol efflux MFS transporter FloR," which is known to mediate resistance to chloramphenicol [105]. "TAWPV" aligns with "CmlA family chloramphenicol efflux MFS transporter," which confers non-enzymatic chloramphenicol resistance [106]; "CDGFH" and "PVFTM" align with "type A chloramphenicol O-acetyltransferase," which is a type of chloramphenicol acetyltransferase (CAT) gene that confers enzymatic chloramphenicol resistance to chloramphenicol [105,[107][108][109]. "FAKFF" aligns with a Type IV secretion system protein, which can have an antibiotic resistance function [110]. For the gene content pipeline, efflux genes appear in the top genes. The pipeline found chloramphenicol acetyltransferases, which is known to cause resistance to chloramphenicol [108,109]. The model also found LysR transcriptional regulator. The role of this gene in regulation of antibiotic resistance for Aeromonas hydrophila was investigated in [111]. TetR resistance gene was found by the model as an important feature. This again shows multi-drug resistance. The importance of all of these genes is causing the SNP method to perform relatively poorly compared to other methods.

Resistance to Sulfonamide
Sulfonamide antibiotics are broad-spectrum antibiotics, including sulfisoxazole and trimethoprim-sulfamethoxazole [112]. These antibiotics interfere with the synthesis of folic acid [112]. Table A11 presented important genes found for S. enterica and sulfisoxazole. Amino acid 5-mer pipeline found 5-mers "LDPGM," "DPGMG," "GMGFF" and "MGFFL" that all align with "sulfonamide-resistant dihydropteroate synthase Sul1." In fact, these 5-mers are all different sections of sequence "LDPGMGFFL." "LDPGM" and "GMGFF" are among the top 50 features in all 10 folds, but "MGFFL" and "DPGMG" only make it to the top features in seven folds (possibly because the model does not need them since they are correlated with the other two important features), and that is why they are not represented in Table A11. The gene content pipeline found two gene clusters represented by sulfonamide resistance proteins as well as ANT(3 )-Ia, which is an AG-resistance gene, discussed in Section 3.2.3 and RmuC, which has been reported to be involved in the resistance against norfloxacin [113].
For trimethoprim-sulfamethoxazole and S. enterica, the most important gene that the gene content pipeline found is "dihydrofolate reductase," which is known for conferring resistance to trimethoprim in Salmonella typhimurium [114]. In Figure 9c,d, the accuracy of the selected-feature (i.e., reduced feature number) pipeline is depicted for different numbers of feature for amino acid 5-mers and nucleotide 11-mers, respectively. Interestingly, the amino acid 5-mer pipeline can predict MIC with 99% accuracy, using only two 5-mers: "IPWKI" and "TYNQW." Both of these amino acid 5-mers align with the same gene using NCBI BLAST, when the organisms is specified: For "IPWKI," 48 out of the top 50 matches, and for "TYNQW," 47 out of the top 50 matches. For the same data, nucleotide 11-mers also reach very good accuracy using selected-features. In that case the best accuracy is obtained when the top three 11-mers are used. These three nucleotide 11-mers are "TATTAGGACCA," "ATATTAGGACC" and "CCCAATAGGAA." No significant similarity was found by NCBI BLAST when these nucleotide 11-mers were searched because of the short query length. Fortunately, in this particular case, the first and second 11-mers belong to the same region shifted by one position. This made us able to extract a longer 12-mer "ATATTAGGACCA" by combining the two. After the extending the length, NCBI BLAST found significant matches for the query but was not able to identify the gene because the pattern was similar to many other positions in the genome. There was no hit for a "dihydrofolate reductase" gene in the top 50. Seven matches to the target gene were found when the top 100 matches were considered. This example shows the superiority of amino acid k-mers compared to nucleotide k-mers.

Resistance to Macrolide Antibiotics
Macrolide antibiotics are composed of more than two amino or neutral sugars attached to a lactone ring [115]. These antibiotics include azithromycin and erythromycin [115].
For N. gonorrhoeae and erythromycin, the top selected feature by amino acid 5-mer pipeline is "SKSET," which aligns with two-component sensor histidine kinase. A two-component system (TCS) is a mechanism in bacteria for receiving an external signal, for example, the presence of antibiotic, and a response regulator that conveys a proper change in the bacterial cell physiology [116]. Expression of antibiotic resistance determinants may be regulated by some TCSs [116]. Another important 5-mer is "SINRE," which aligns with pilus assembly/adherence protein PilC. Mutation in this gene in Pseudomonas aeruginosa has been reported to cause resistance to aminoglycosides [117]. The gene content pipeline found "RND efflux system" as an important feature, which is a widespread resistance mechanism [118].

Discussion
When considering the choice of feature extraction method for machine learning AMR prediction, our results demonstrate that different considerations support the use of alternative methods. In particular, the optimal choice of feature extraction method depends on the importance of necessity of gene assembly, quantitative AMR prediction accuracy, constraints on computational complexity (i.e., speed and memory) and the ability to interpret the model to yield biological insight. Table 3 summarizes the comparison of different methods based on these issues. We discuss the comparative performance in the context of the aforementioned practical issues of feature extraction method in turn below.

Requirement of Gene Assembly
Among the feature extraction methods we present here, nucleotide k-mers and SNPs do not require the assembly of genes. The input to these methods can be nucleotide contigs obtained by assembling short reads together, possibly using de novo assembly. In the case of SNP features, full assembly of the genome is required only for the reference because each SNP must have a unique position with respect to the reference genome. The requirement of assembly just for the reference genome is usually not an issue because the reference can be obtained from databases such as NCBI [53].
For amino acid k-mers and gene content (and consequently gene content + SNP) methods, by contrast, the input to the feature extraction method is the amino acid sequence of the genes. This means that just assembling the short reads to contigs is not enough and gene finding must be performed to find the regions of the sequence that encode the genes and these regions must translated to amino acid sequences. This adds an extra pre-processing step to these methods.
Not requiring gene assembly makes nucleotide k-mers an interesting option for clinical AMR prediction, where next generation sequencing technologies can be used for predicting AMR in real-time without using the culture-based methods [119,120]. In that situation, predicting AMR as fast as possible and as cheaply as possible is the top priority. Thus, the best option will be a model that utilizes selected-features (i.e., a reduced model based on the top features) that also does not require gene assembly to generate features. For example, DNA microarrays can be employed to rapidly identify presence or absence of certain nucleotide k-mers (selected features). On the other hand, in a scenario where researchers want to train and interpret a model to learn more about AMR mechanisms and possibly discover new mechanisms, assembly of the genes is not an issue. Thus, amino acid k-mers are the better option because of the lower feature size and better interpretability of this method.

Predicting AMR Accurately
In Table 4, performance rankings of each method based on the average ±1 two-fold dilution accuracy for all antibiotics combined are presented for different species. Amino acid 5-mers always get the best performance. Then nucleotide 11-mers are second, except for S. enterica, where gene content is the second best.
Generally, gene content features and SNP features, on their own when used separately, are unable to capture all of the AMR determinants in all datasets. As in some cases genes cause resistance and in other cases SNPs cause resistance. This was discussed in detail in Section 3.2. On the other hand, k-mer counting methods can capture AMR phenotype from both genes and SNPs. If a certain gene is causing AMR, k-mer features of that gene will only show in the resistant genomes and if a certain SNP is causing AMR, k-mers can capture the changed nucleotide pattern. However, for both of these tasks, the k-mer length must be long enough to be able to distinguish between the genomes that have these patterns and those that don't. Using long k-mers is hard because the number of features increases exponentially with k (at least before a the saturation point limited by the entire species' k-mer vocabulary). However, we have shown that for amino acid k-mers this increase in feature size is less severe than nucleotide k-mers. Moreover, amino acid k-mers achieve better performance in terms of average ±1 two-fold dilution accuracy.

Comparison of Required Number of Features
In machine learning, an excessive number of features can increase the required memory and lead to over-fitting [121]. We have shown that longer k-mers reach better accuracies since they are more likely to capture a certain gene or SNP that is causing AMR. However, after increasing the k-mer length, the dataset becomes too large to handle by the machine learning algorithm. One advantage of amino acid k-mers over nucleotide k-mers is that it is a more compact representation of the biological information: each codon consists of three nucleotides and translates into one amino acid. Since the alphabet size of a nucleotide is four, there are 4 3 nucleotide distinct codons but they translate to 20 amino acids. The biological redundancy in the nucleotide alphabet compared to that of amino acids can be exploited to decrease the feature size and have a more compact representation of the data, when amino acid k-mers are employed instead of nucleotide k-mers. A nucleotide k-mer of length k is equivalent to an amino acid k-mer of length k/3. For the case of canonical nucleotide k-mers, the ratio of the maximum number of nucleotide k-mers to the maximum number of equivalent amino acid k-mers is as follows (proof of provided in Appendix B): This can be seen in Figure 3, where the number of amino acid 5-mers (equivalent to nucleotide 15-mers) is approximately equal to nucleotide 11-mers. This is why Nguyen et al. had to break the total number of strains into different parts and train a separate model for each part of the data to be able train XGBoost with 15-mers of amino acid for S. enterica [24]. On the other hand, as we have shown here, 5-mers of amino acid leads to a reasonable number of features. The number of features for the gene content method is usually low compared to k-mer counting methods with long k-mers (see Figure 3). For the SNP feature extraction method and combined gene content + SNP features, the number of features is large for large datasets, like S. enterica. Required resources for dataset of one antibiotic for each species are provided in Appendix I.

Interpretability of the Model
When the AMR prediction model is trained without any a priori knowledge, it does not have any initial bias and learns the entire mechanism from the data. For such a model, a desired characteristic is interpretability. Here, we call a model "interpretable" if it has the following properties: • Validation: mechanisms predicted to be important in an interpretable model can be compared to previously known AMR mechanisms, so that we know the model is working properly. • Discovery: predictions of an interpretable model can be used to discover new AMR mechanisms that have not been discovered before.
Unlike nucleotide k-mers, amino acid k-mers are easy to interpret. We interpreted the models by analyzing the top features, selected by our proposed feature selection pipeline. The interpretation of the models was performed in Section 3.2. For the gene content method, the important features can be interpreted by finding the gene selected by the pipeline in databases, such as PATRIC [26], NCBI [53] or CARD [27]. For SNP features the same process can be performed by finding the genes in which the SNP has happened, and finding the position of the mutation and interpreting the gene and the mutation. Thus, gene content and the SNP model are both easily interpretable, because the genes always align uniquely to the corresponding subject using a method like BLAST. For the k-mers features, interpreting the model can be achieved by blasting the important k-mers against databases, such as NCBI [53] using BLAST [79] to see where the k-mers align. If k-mers do not align or align with multiple genes with different functions, we can claim that the model is not easily interpretable. In case of nucleotide k-mers, although 10-mers and 11-mers are able to have good accuracy in MIC prediction, in many cases the important 11-mers are not long enough align with the right position when searched with tools like BLAST. This was analyzed in Section 3.1.4. For instance, Nguyen et al. noted that they increased the k-mer length to 15 nucleotides to make the features identifiable using BLAST. This increased the computational complexity of their analysis significantly [24]. On the other hand, amino acid 5-mers can be easily searched using BLAST, and they align unambiguously to the target genes. Thus, for amino acid k-mers, interpreting the top features using BLAST is easier compared to nucleotide features.

Conclusions
In this paper, we proposed a new feature extraction method, amino acid k-mers, for predicting MIC of an antibiotic on a bacterium from its genome and compared this method to different existing methods, namely, nucleotide k-mers, gene content, SNP and gene content + SNP. We applied all of the methods to MIC data of C. jejuni, S. enterica, N. gonorrhoeae and K. pneumoniae, finding that amino acid k-mers provide comparable or superior performance across multiple metrics, in particular accuracy, interpretability and computational complexity.
Notably, amino acid and nucleotide k-mers have different applications. k-mer counting methods are robust in predicting AMR, particularly because they can capture absence or presence of a unique gene and SNP. However, these methods are only effective when the k-mer length is long enough, which can make the process of training computationally expensive because long k-mer length leads to a large feature size. We have shown that amino acid k-mers are less prone to the problem of large feature size compared to nucleotide k-mers. Moreover, amino acid k-mers are easier to interpret, because their top features are more likely to be uniquely identified using search algorithms such as BLAST. The main drawback of amino acid k-mers is that they require assembly of the genes. In a situation where assembly of the genes is not a problem, we recommend amino acid k-mers. An example of this situation is when researchers want to train a model to predict AMR in dataset of sequenced genomes and learn more about AMR mechanisms by interpreting the model. On the other hand, in a situation where assembling the genes is not possible, and the goal is to predict AMR and interpreting the model is not the top priority; or enough resources are available and having a large enough number of features is not an issue, nucleotide k-mers are an alternative option. An example of the latter scenario is predicting AMR in a medical clinic. In such a scenario, we propose utilization of a feature selection pipeline in which the most important features are selected and used to build models with a small number of features. Such selected-feature models, in many cases, achieve better accuracy compared to a model that uses all features. The selected-feature model can be useful in the clinical resistance test applications, where resistance to an antibiotic must be tested in real time and at minimum cost.  Acknowledgments: This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. This work also used Proteus, which is Drexel University's main high-performance computer cluster. Proteus is a part of University Research Computing Facility, Drexel University.

Appendix B. Comparison of Number of Features for Nucleotide and Amino Acid k-mers
A nucleotide k-mer of length k is equivalent to an amino acid k-mer of length k/3. Since each nucleotide has 4 choices and each amino acid has 20 choices, if all nucleotide k-mers are counted, the ratio of maximum number of nucleotide k-mers to maximum number of amino acid k-mers will be Now we calculate the same ratio for the scenario where only canonical nucleotide k-mers are counted. For odd ks the maximum number of possible features will be halved because of half of the k-mers are non-canonical. For even k-mers, the number is more than half, because there are palindromic k-mers. Palindromic k-mers are k-mers that are equal to their reverse compliments. For example, the reverse compliment of "AATT" is "AATT." If all of possible k-mers are listed alphabetically, all of the k-mers in the first half of the list will be canonical. In the second half, only palindromic k-mers will be canonical because the canonical counter parts of the rest have been considered in the first part of the list. Hence, the total number of theoretically possible canonical k-mers for an even k is the number of k-mers in the first half of the list, which is (k 4 )/2, plus the number of the palindromic k-mers is the second half. Now, we must count the number of palindromic k-mers in the second half of the list. We argue that the first base must be G or T; otherwise the k-mer would not be in the second half. Moreover, if a palindromic k-mer is divided into 2 chunks of the same length, the second chunk must be the reverse compliment of the first chunk to make it palindromic, so the second chunk is a function of first chunk. To count all palindromic k-mer we just have to count all possible combinations of the first chunk, which has k/2 bases. The first base has 2 options (G or T) and the rest of the k/2 − 1 bases have 4 options each. Hence, the total number of palindromic k-mers in the second half of the list will be Thus, the total number of canonical k-mers for an even k will be Thus, generally the total number of canonical k-mers will be: Therefore, for odd ks the ratio of total number of amino acid k-mers to total number of canonical nucleotide k-mers will be 1 2 4 k 20 k/3 ≈ 1 2 1.4736 k and for an even k, the ratio will be 1 2 4 k +2 k 20 k/3 ≈ 1 2 (1.4736 k + 0.7368 k ) or in a more compact form:

Appendix C. Violin Plots of Error in Different Methods
To visualize the prediction error in different methods, we collected the predicted MIC values vs. the actual values of all predictions pooled across the 10 folds of cross-validation. For example, if the actual value of MIC in one strain is 2, and the model predicts it to be 3, this would create the data-point (2, 3). Then, we grouped all those pairs by the actual values and generated a violin plot of distribution of the predictions for each target value. In these plots, the top and bottom horizontal lines represent minimum and maximum predicted values and the middle horizontal line represents the mean predicted value. The body of the violin is the kernel density estimation of the data. The ±1 two-fold dilution accuracy of that target value, and in parentheses, the number of strains with that actual value (i.e., the target sample size) is labeled under each violin. The green line represents a 1-to-1 perfect prediction, and the yellow line represents the first order regression between the actual values and the predicted values. The red lines represent the limits of ±1 two-fold dilution for the predictions of the model.
Such plots are presented for predictions of tetracycline for C. jejuni in Figure A5. They show how the actual versus predicted MIC values vary from a perfect prediction, within or outside the bounds provided by the two-fold dilution error. Using SNP features, the correlation between the actual values and the predictions is not as strong as in other methods and the slope of yellow regression line is not close to the 45 degrees.
In Figure A6, such plots are provided for C. jejuni and nalidixic acid. Here SNP features perform better than gene content method.
Perfect prediction First order regression (R^2 score:0.87) Two-fold dilution Figure A7. C. jejuni and nalidixic acid. Amino acid 5-mers. Violin plots of performance using only one 5-mer ("GDTAV"). X-axis is the actual value and y-axis is the predicted value. Each violin shows the kernel density estimation distribution of predictions for one target value. Below each violin, the ±1 two-fold dilution accuracy of that target value is mentioned, and below that, the number of strains with that target value is mentioned in parentheses. The green line represents the perfect prediction and the yellow line represents the first order regression between the actual values and predicted values. The red lines represent the limits of ±1 two-fold dilution. Table A5. Important features found for K. pneumoniae and tetracycline by amino acid 5-mer and gene content methods.     Table A9. Important genes found for S. enterica and ampicillin by amino acid 5-mer and gene content methods.

Appendix E. Performance Evaluation
In this section, performances of the different methods are evaluated using different metrics described in Section 2.4. The metrics are root mean square (RMSE), ±1 two-fold accuracy (DD1), ±2 two-fold accuracy (DD2), major error rate (ME) and very major error rate (VME). For each metric, we report the mean and standard deviation of cross-validation labeled as CV and the value for the held out set, labeled as H. For example, RMSE-CV means RMSE in cross-validation, which is reported in terms of mean and standard deviation, in parentheses, across 10 folds. For the hold-out set, only one value is reported. Results of each species-antibiotic combinations are presented in a separate table. In each column, if one of the methods performed better than others we plotted it in bold face. Since currently, there are no approved breakpoints for erythromycin and N. gonorrhoeae, major error rate and and very major error rate were not calculated for this data set.

Appendix F. Canonical and Non-Canonical Nucleotide k-mers
Using lexicographical order, nucleotide k-mers can be divided into two groups: canonical or non-canonical. A k-mer is called canonical if its lexicographical order is smaller than or equal to the order of its Watson-Crick reverse complement [57]. For example, in the case of a 3-mer like "TTT," the lexicographical order of the Watson-Crick reverse complement, "AAA" is smaller, so "AAA" is canonical and "TTT" is non-canonical.
As mentioned in Section 2.2.1, for nucleotide k-mers two scenarios are possible: converting all of the non-canonical k-mers to their canonical counter parts and counting them after conversion, or counting every k-mer, canonical or otherwise [57]. We tested both scenarios and compared the results. for k-mers of length 8, 9, 19 and 11, results   When non-canonical k-mers are converted to canonical form, the accuracy of MIC prediction becomes significantly better compared to a scenario in which all k-mers are counted. To investigate the reasons for this observation, we hypothesized that when all (canonical and non-canonical) k-mers are counted separately, counts of non-canonical k-mers are correlated with counts of canonical counterparts, e.g., frequency of an 8-mer like "TTTTTTTT" is correlated with frequency of "AAAAAAAA." This hypothesis was based on the generalization of the second Chargaff rule, which states that "k-mer frequency counted on a single chromosomal strand equals the frequency of the reverse-complement k-mer," which holds for many species in both prokaryotes and eukaryotes [122]. Having correlated features increases the feature to strain ratio, without actually adding information and worsens learning performance [123], particularly in cases when number of strains is low compared to number of features like the data that we have.
To test this hypothesis, we tried calculating the correlations between the frequencies of canonical and non-canonical features; both frequencies were counted separately (no conversion to canonical form was applied). For each k-mer length, in each genome, we calculated the Pearson correlation coefficient between 1000 randomly selected non-canonical features and their canonical counterparts. The distribution of correlation coefficient across the genomes is depicted in Figure A12. It can be seen that there is a strong correlation between the features. This correlation is the reason for poor performance of k-mer counting methods when both canonical and non-canonical k-mers are counted and no conversion is applied.  Figure A12. Distribution of correlation coefficient between the frequencies of canonical and non-canonical features, when both frequencies are counted separately across the genomes for different microbes and different k-mer lengths. In each box plot, the whiskers represents the maximum and minimum. The boxes represent the first and the third quartiles. The orange line represents the median and the green line represents the mean.

Appendix I. Resources Used for Example Datasets
In this section we mention the requested resources for one example antibiotic for each species, to give the reader a sense of computational complexity of the problem. The wall-time was for training and testing the model with all features, and training and testing the model with the selected features, which was done for different numbers of of features, as described in the methods section.