Detection of DGA-Generated Domain Names with TF-IDF

: Botnets often apply domain name generation algorithms (DGAs) to evade detection by generating large numbers of pseudo-random domain names of which only few are registered by cybercriminals. In this paper, we address how DGA-generated domain names can be detected by means of machine learning and deep learning. We ﬁrst present an extensive literature review on recent prior work in which machine learning and deep learning have been applied for detecting DGA-generated domain names. We observe that a common methodology is still missing, and the use of different datasets causes that experimental results can hardly be compared. We next propose the use of TF-IDF to measure frequencies of the most relevant n-grams in domain names, and use these as features in learning algorithms. We perform experiments with various machine-learning and deep-learning models using TF-IDF features, of which a deep MLP model yields the best results. For comparison, we also apply an LSTM model with embedding layer to convert domain names from a sequence of characters into a vector representation. The performance of our LSTM and MLP models is rather similar, achieving 0.994 and 0.995 AUC, and average F1-scores of 0.907 and 0.891 respectively.


Introduction
Botnets pose a severe threat to the security of systems connected to the Internet and their users. A botnet is composed of a collection of compromised systems ('bots') that receive and respond to commands from a Command and Control (C&C) server. A C&C server acts as rendezvous point between the bots and the botmaster, who controls the botnet. By updating the malware running on the bots, the botmaster can configure the botnet to perform different types of attacks, such as launching DDoS attacks, sending spam, or stealing credentials. This versatility causes that botnets are considered as the Swiss army knife of cybercriminals.
C&C servers and the communication channels between botmaster and bots are critical components of a botnet. By taking down the C&C servers, or by blocking the communication channels, the link between bots and botmaster is broken, which renders the botnet useless. Numerous techniques have been applied to provide stealthy botnet operation and to increase resilience against take-down attempts [1]. An eminent technique to evade detection is the application of domain name generation algorithms (DGAs) in bot malware that generate large numbers of pseudo-random domain names for contacting the C&C server, of which only few are actually registered shortly by the botmaster. Due to the dynamic DGA operation and short-lived domain names, the communication between C&C servers and bots is protected against take-down attempts.
The presence of botnets that use DGAs can be revealed by analysing network traffic. For instance, most of the domain names that are generated by DGAs are not registered and hence DNS lookups for resolving such domain names into IP addresses will result in NXDomain responses from name servers. Hence, by monitoring and analysing NXDomain responses, the presence of DGA-based botnets can be revealed. In addition, the domain names generated by DGAs typically differ from regular domain names. Regular domain names are usually meant to be interpretable by humans and hence they often are rather short and meaningful, while domain names generated by DGAs typically consist of random strings of letters and digits that humans cannot pronounce or interpret as meaningful. Hence, by analysing the syntax or semantics of domain names, the presence of DGA-based botnets can be revealed.
In recent years, numerous methods applying machine learning, and more recently also deep learning, have been explored for detecting DGA-based botnets. Machine-learning algorithms typically have been used to train models using sample data of network traffic in order to make predictions on whether the traffic contains traces originating from botnets.
The key principle is that the models are not programmed explicitly up front to detect botnets, but the models evolve during training by discovering patterns in the sample data. The sample data typically consist of features that are derived from captured network traffic, such as relevant fields in packet headers or payload data. In prior work on detection of DGA-generated domain names with machine learning, a large variety of structural, linguistic, and statistical features have been explored that are derived from domain names in DNS traffic. In deep learning, as an advanced form of machine learning, more complex models are trained that can discover higher-level patterns in the sample data. While machine learning requires to provide sample data by means of selected features during training, deep learning is able to implicitly derive features at multiple levels from the sample data.
Furthermore, in this paper, we apply machine learning and deep learning for detecting DGA-based botnets, and use TF-IDF (term frequency-inverse document frequency) as statistical method to derive features from domain names. TF-IDF originates from information retrieval and automated text analysis, where it is used as a weighting factor to evaluate how relevant a word is to a document in a collection of documents [2]. Key words that appear more often but in a smaller number of documents, have higher TF-IDF. We observed that the distributions of characters and n-grams vary considerably for regular domain names and domain names that are generated by different types of DGAs (as we will show in Section 4.1). Our hypothesis therefore is that TF-IDF of n-grams can be used as features for classifying domain names in learning algorithms.
Our contributions are as follows: 1.
We provide an extensive literature review on recent prior work in which machine learning and deep learning have been applied for detecting DGA-based botnets.

2.
We explore the usage of TF-IDF for feature selection. Although TF-IDF has been studied and applied extensively for decades in information retrieval, we are the firstto the best of our knowledge-to apply TF-IDF for detecting DGA-based botnets.

3.
We provide experimental results using TF-IDF as features with the most popular algorithms for machine learning (Decision Tree, Gradient Boosting, K-Neighbours, Logistic Regression, Multinomial Naive Bayes, Random Forest, and Support Vector Machine) and deep learning (Multi-Layer Perceptron).

4.
We compare the results obtained by machine learning, in which TF-IDF scores of n-grams in domain names are used as features, with featureless deep learning (using a long short-term memory (LSTM) classifier) in which domain names are embedded as sequences of input characters.
In the remainder of this paper, we first provide more details about botnets and fluxing methods, DGAs, and TF-IDF in Section 2. In Section 3, we present an extensive literature review on recent prior work in which machine learning and deep learning have been applied for detecting DGA-based botnets. In Section 4, we present our research method, including a description of the datasets we applied in our experiments, the setup of our experiments, and the experimental results with discussion. We conclude the paper in Section 5.

Background
This section provides more details on fluxing methods as applied by botnets to evade detection in Section 2.1, on DGAs in Section 2.2, and on TF-IDF in Section 2.3.

Botnets and Fluxing
The bots in a botnet regularly contact their C&C server. This is the case during the rallying process, when a bot tries to contact its C&C server for the first time to announce its presence, and later on when the bot contacts the C&C server to upload data (such as stolen credentials) or to download malware updates. In order to do so, the bot should know either the IP address or the domain name of the C&C server.
The IP address can be hardcoded in the bot malware. This offers stealthy botnet operation since no DNS lookup is required. However, the IP address can easily be revealed by reverse engineering of the malware. Network administrators can subsequently blacklist the IP address in ACLs at gateways, or apply BGP route announcements to route the IP address to a blackhole where the traffic is dropped.
Alternatively, the domain name can be hardcoded in the bot malware. This is less stealthy since it requires a DNS lookup to resolve the domain name into an IP address. To evade detection, botnets can apply IP/fast flux by using dynamic DNS to provide that the domain name can be resolved into an IP address that changes frequently. These IP addresses refer to proxy bots that relay communication between bots and the C&C server. Bringing down the botnet now requires blacklisting or blackholing the IP addresses of all proxy bots. To further evade detection, botnets can apply double flux, where the concept of flux is also applied to the name server that is responsible for resolving the domain name. The name server, which is under control of the botmaster, will refer to frequently changing authoritative name servers, which in turn will resolve the domain name into frequently changing IP addresses of proxy bots. However, the domain name can be blacklisted, or, by applying DNS sinkholing, the domain name can be resolved into an IP address that is not under control of the botnet. DNS sinkholing allows for instance law enforcement agencies to take over the botnet.
Many botnets therefore do not rely on a single domain name, but apply domain flux by generating a large number of domain names of which only few actually are registered by the botmaster for a short time period. Domain flux renders botnet detection by static domain name blacklists or sinkholing ineffective.

DGA
Domain flux is implemented in bot malware by a DGA that dynamically generates a large number of pseudo-random domain names from a seed. The seed, which acts as a shared secret between botmaster and bots, can be either static or dynamic [3].
A static seed was for instance applied in early versions of the Kraken botnet, and therefore the same set of domain names is generated at each execution [4]. Early versions of the Torpig botnet applied a deterministic seed that is derived from the current date. Since the domain names derived from such deterministic seeds can be precomputed easily, and botmasters do not register all future domain names in advance, a botnet can be taken over. For instance, a research team was able to preregister some domain names and take over the Torpig botnet for 10 days in 2009 [5]. The Conficker botnet also applied a timedependent seed based on GMT that is derived from the response of querying a public website [6]. The Conficker.C botnet applied domain flux by generating 50,000 domain names of which bots daily tried up to 500 for contacting the C&C server to receive updates. If the botmaster registered one of these 50,000 domain names, bots have 1% probability per day to contact the C&C server, and hence bots would contact the C&C server once every 100 days on average. While the botmaster had to register only one or a few domain names, law enforcement would have to preregister 50,000 domain names to block the C&C communication.
Domain names derived from dynamic seeds rely on non-deterministic sources. For instance, the Bedep DGA applied a seed that relates to foreign exchange reference rates published daily by the European Central Bank, while the seed in later versions of the Torpig DGA related to trending topics in Twitter [7]. Since domain names from non-deterministic seeds cannot be precomputed in advance, blacklisting and sinkholing or preregistering large numbers of short-lived domain names by law enforcement agencies is a challenging, time-critical task that requires continuous effort. However, also botmasters only have a small time window and should switch continuously to new domain names.
Next to classifying DGAs based on their seeding characteristics, Plohmann et al. classified DGAs into 4 types based on how domain names are constructed [7]. Arithmetic-based DGAs (type DGA-A) are most common. They construct domain names by generating sequences of values that have either an ASCII representation directly or index hardcoded arrays that constitute the DGA alphabet. Hash-based DGAs (type DGA-H) construct domain names from hashing algorithms such as MD5 and SHA256. Wordlist-based DGAs (type DGA-W) construct domain names by concatenating sequences of words from dictionaries that are embedded in the malware or obtained from a publicly accessible source. Permutation-based DGAs (type DGA-P) construct domain names through permutation of an initial domain name.
Domain names generated by DGAs of type DGA-A typically consist of random sequences of characters (letters and digits). Domain names generated by DGAs of type DGA-H represent a hexadecimal number and consist of digits and the letters A-F. Domain names generated by DGAs of type DGA-W are less random and pronounceable, which makes them harder to distinguish from regular domain names. In addition, domain names generated by DGAs of type DGA-P, that are derived by permutation of regular domain names, look similar to regular domain names.
RFC 1035 initially specified the preferred syntax of domain names as a sequence of labels separated by dots [8]. The right-most label conveys the top-level domain. Each label is a sequence of at most 63 characters containing letters (A-Z, a-z), digits (0-9), or the hyphen symbol (-), with the restriction that a label starts with a letter and ends with a letter or digit. Although uppercase and lowercase letters are allowed, no significance is attached to the case. The length of a domain name is at most 255 characters. In later specifications, this has been relaxed to labels that contain the underscore symbol (_), leading or trailing hyphens, other ASCII characters (such as the symbols # and $), and even Unicode characters in internationalized domain names [9].

TF-IDF
TF-IDF originates from information retrieval and automated text analysis, where it is used as a weighting factor to evaluate how relevant a term is to a document in a collection of documents [2]. For instance, TF-IDF is the most popular weighting scheme in recommender systems for research papers that apply content-based filtering [10].
TF-IDF is composed of multiplying term frequency (TF) [11] and inverse document frequency (IDF) [12], where TF indicates how often a term appears in a document, and IDF indicates the number of documents in a corpus that contain the term.
The simplest way to compute TF is the raw count of appearances of term t i in document d j , where t i is in the set of terms T = {t 1 , . . . , t K } in the corpus of documents D = {d 1 , . . . , d N }. This can be normalized by considering for instance the length of the document or the most frequent term in the document. IDF adjusts for the general appearance of terms across documents and is usually defined as log(N/n i ), where N is the number of documents in the corpus and n i is the number of documents in which term t i occurs. IDF is close to 0 when the term appears in many documents, and increases when the term appears in fewer documents. Hence, TF-IDF discriminates key terms that appear more often but in a smaller number of documents.

Literature Review
We conducted an extensive literature review on recent prior work in which machine learning (ML) and deep learning (DL) have been applied for detecting DGA-based botnets. Zago et al. [13] previously published a literature review on DGA-based botnet detection that covered literature up to May 2018. We extend their literature review by covering 38 additional scientific papers that were published afterwards from May 2018 to February 2021.
Zago et al. built a taxonomy of approaches for botnet detection considering the applied learning approach (supervised, unsupervised, or semi-supervised) and the type of features adopted (context-aware, context-free, or featureless). Context-aware features are dependent on a specific malware sample execution, such as features extracted from DNS responses that consider timing, origin, or any other environment configuration. Context-free features are related only to domain names, considering structural, statistical or linguistic properties of a domain name. Featureless models, as typically applied in DL, do not require features and use encoded domain names as inputs.
We adopt a slightly different taxonomy by considering the learning method, which is either feature-based ML (see Table 1), featureless DL (see Table 2), or other (see Table 3). Nearly all studies included in our review applied supervised learning algorithms, using either classical ML models such as Decision Tree (DT), Random Forest (RF), k-Nearest Neighbour (kNN), Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), Gradient Boost (GB), and Multi-Layer Perceptron (MLP), or novel DL models such as Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN).
The following subsections provide more details on the ML models in Section 3.1, on the DL models in Section 3.2, on other methods in Section 3.3, and on the datasets used in the reviewed studies in Section 3.4.   Table 3. DGA-based botnet detection using other approaches.

Models
The studies in our literature review apply a range of ML methods, as shown in Table 1. RF and SVM are applied most often. For each study, the best performing model is shown in bold. It can be observed that there is no single best performing method overall, but RF and MLP (with a single hidden layer) give best results in most cases. Due to differences in the applied hyperparameters, features, datasets of benign and malicious domain names, and evaluation metrics, it is rather unfeasible to compare the experimental results obtained in these studies.

Context-Free Features
Zago et al. [13] focused in their literature review on context-free features. They identified 74 features of 32 types, although we found that their overview is somewhat inaccurate. Most of the studies cited by Zago et al. use string metrics as features in which a domain name is considered as a string of characters or words. Most frequently used are string length (in 69% of the cited works) and entropy (46%). Some features relate to linguistics, such as the pronounceability score (13%) and normality score (13%), while other features capture more complex structural aspects of domain names, such as the Jaccard Index measure (17%) and the Kullback-Leiber divergence (8%). Zago et al. identified features related to length, ratio and sequence of digits, and frequencies and pronounceability of n-grams as most relevant. They observed that arbitrary combinations of features are used in most studies, often with different names or definitions, and hence a common ground for features is still missing.
As a follow-up of the literature review by Zago et al., our literature review covered 22 more recent studies that applied ML-based methods (see Table 1). Nearly all studies (19 out of 22) use context-free features that relate to domain names.
As shown in Table 4, we identified 97 context-free feature types that we categorised as being related to the domain name, subdomains, character-level information, linguistics, and n-grams. Compared to the 32 feature types that were identified by Zago et al. [13] up to early 2018, during the next two years researchers explored 65 additional feature types. We observe however that arbitrary combinations of features have been used. Domain name length, number of subdomains, alphabet cardinality, entropy, and the ratios of digits and vowels are used most frequently.   [13,21,26] n-gram distance [29] n-skip-gram distribution [24] Number of most frequently used n-grams [26,33] Number of masked n-grams [26] Ratio of 4-grams without vowel [16] Ratio of n-grams from benign domains [33] Ratio of n-grams from malicious domains [33] n-gram entropy [13] n-gram covariance [13] n-gram pronounceability score [13] n-gram normality score [13] n-gram transition probability [13] n-gram probability of appearance [13] n-gram index probability [13] n-gram Kullback-Leiber divergence [13,34] n-gram Jaccard Index measure [13,34] n-gram distance-threshold [13] n-gram distance-avg. frequency [13] n-gram distance-avg. count [13] A limited number of studies also used features derived from statistics of n-grams. Next to n-grams, Selvi et al. [21] also used masked n-grams in which every character is substituted by a symbol representing the character type (consonant, vowel, digit, other). Alaeiyan et al. [24] also consider the distribution of n-skip-grams, in which n centre characters are removed in a sequence of adjacent characters (for n = 1, 2).
Yang et al. [22] and Patsakis et al. [35] both focus on wordlist-based DGAs and use a large number of features that try to distinguish wordlist-based malicious and benign domain names (excluded from Table 4). Yang et al. use 24 features based on word frequency, part-of-speech frequency, inter-word correlation, and inter-domain correlation. Patsakis et al. use 32 features that consider alphanumeric sequences, statistical and lexical characteristics, and entropy.
Hwang et al. [27] used 10 context-free features and in addition they extracted 100 features using a TextCNN. The TextCNN takes as input a 70 × 100 matrix for each domain name, constructed by taking 100 characters from the domain name (using truncation for longer domain names and padding for shorter domain names) and one-hot encoding with a dictionary of 70 characters. The TextCNN is composed of two convolutional and max pooling layers with ReLU activation function, three dense layers with ReLU activation function, and dropout.

Context-Aware Features
Of the 22 ML-based studies in our literature review, five studies used a combination of context-free and context-aware features [18,19,28,30,31], and two studies used contextaware features only [14,20], as indicated in Table 1.
Chiba et al. [14] used 55 context-aware features: 20 features reflect how and when a domain name is included in evolving lists of popular and malicious domain names in a certain time window; 18 features consider information from BGP prefixes, ASN, and IP address registration corresponding to the related IP addresses of a domain name; eight features consider relations between domain names of which IP addresses are in the same ASN (so called rDomains). They also use nine features that relate to domain names in rDomains. He et al. [18] used 153 context-aware features: 25 features derived from five feature types that consider DNS information; 128 features obtained from graph embedding to estimate the likelihood of a specific sequence of connected nodes, using a domain relationship graph in which domain names are connected if they are mapped to the same IP addresses using the Jaccard coefficient as weight. They also use 41 context-free features, which are the same as used by Schüppen et al. [15]. Li et al. [19] focused on the Rustock botnet that applies fast-flux and DGA. They used 31 context-aware features of eight feature types that relate to DNS, and one context-free feature (the ratio of the number of characters in the longest successive string of letters or digits and the total length of the domain name). Liang et al. [28] used five context-aware features that relate to DNS and BGP, and five context-free features. Palaniappan et al. [30] used 13 context-aware features: six DNS-based features and seven web-based features that relate to the web site for which the domain name provides the URL. They also used four context-free features. Sivaguru et al. [31] used nine DNS-based features.
There are two studies, by Schüppen et al. [15] and Liu et al. [20], that focus completely on monitoring of non-existent domain (NXDomain) responses in DNS traffic. Schüppen et al. [15] apply ML for classifying NXDomain responses as originating from benign or malicious sources. Benign NXDomains can originate from either typing errors due to users that misspell existing domain names, misconfigurations due to systems that erroneously try to resolve domain names that do not exist (anymore), and misuse due to non-intended DNS usage such as probing to detect DNS hijacking attempts or anti-virus software performing signature checks. They apply 21 context-free feature types related to domain names. Liu et al. [20] apply filtering to remove benign NXDomain responses using a whitelist, and clustering to group malicious domain names from the same DGA considering the DNS behaviour of hosts. Next, they apply statistical analysis on the clusters considering the distributions of the DNS querying time, count, and domains, from which 18 context-aware featured are derived.

CNN Models
Xu et al. [41] apply a CNN-based method called n-CBDC (n-gram Character-Based Domain Classification). A sliding window is applied to obtain a sequence with length l of n-grams from a domain name. The sequence of n-grams is represented in a n × l matrix with one-hot encoding. A CNN layer is used for feature extraction by stacking multiple convolution kernels of different sizes using an inception-like structure. The CNN output is fed into a fully-connected classification network consisting of three layers with dropout. The output is derived from a sigmoid function.

RNN Models
RNN models for detecting DGAs have been applied in several studies. RNN models in general are composed of an embedding layer to transform a domain name into a vector representation, an LSTM layer for implicit feature extraction, and a dense output layer. Dropout is usually applied to prevent overfitting.
Woodbridge et al. [36] and Akarsh et al. [43] apply the Keras embedding layer that learns a 128-dimensional vector representation for each character in the set of valid domain characters. The output of the embedding layer is fed into an LSTM layer with 128 LSTM units for implicit feature extraction. A dropout layer is added to prevent overfitting during training. The final dense layer applies logistic regression and the output is derived from a sigmoid (for binary classification) or softmax (for multi-class classification) activation function.
Lison and Mavroeidis [37] apply one-hot input representation, a layer with 512 GRU or LSTM units, a dense output layer that takes a linear combination, and a sigmoid activation function to generate the output.
Koh and Rhodes [38] apply the pretrained ELMo (Embeddings from Language Models) word embedding layer, a fully-connected layer with 128 rectified linear units (ReLUs), and a logistic regression output layer.
Qiao et al. [44] apply an input layer where a domain name (using a fixed length of 54 characters with padding or truncation) is converted into a matrix of dimension 54 × 128 using Word2Vec's CBOW model. An LSTM layer with an attention mechanism is used that gives different attention to different parts of the input domain name, followed by a fully connected layer, dropout, and a softmax classification function.
Vij et al. [47] use an embedding layer where each character is mapped onto a vector with 128 dimensions using a lookup. An LSTM layer with 128 units is added and dropout is used for preventing overfitting.
Yilmaz et al. [50] use an embedding layer where characters are encoded by their ASCII representation. An LSTM layer with two hidden layers is used with dropout to avoid overfitting.
Tran et al. [39] use an embedding layer that projects a padded sequence of input characters of length l to a sequence of vectors with dimension 128 × l. LSTM.MI is used, where the original LSTM is adapted to be cost-sensitive for dealing with multi-class imbalance. Sivaguru et al. [31] use a hybrid model where the output of the LSTM.MI model by Tran et al. is used, together with context-free and context-aware features, as input for a B-RF classifier that consists of 100 trees, where each tree is trained using a subset of the feature space.

Hybrid CNN-RNN Models
Several studies have applied hybrid models in which CNN and RNN models are combined.
Vinayakumar et al. [40] use a hybrid CNN-LSTM model. When compared to RNN, LSTM, GRU, I-RNN, and CNN models, best results are obtained with the LSTM and hybrid CNN-LSTM model, achieving over 0.99 accuracy. The accuracy with ML methods (Adaboost, DT, LR, ME, NB, RF) and hand-crafted features is below 0.96.
Yu et al. [42] compared LSTM, BiLSTM, stacked CNN, parallel CNN, and hybrid CNN-LSTM models. All these models performed equally well and obtained over 0.98 accuracy. For comparison, the accuracy of ML methods (RF and MLP) with lexical features is below 0.92.
Liu et al. [45] use a hybrid RCNN-SPP model that combines a bi-directional LSTM network, a CNN, and spatial pyramid pooling.
Highnam et al. [48] use a hybrid CNN-LSTM-ANN model. The output of the embedding layer is passed to separate LSTM and CNN models in parallel. The features extracted by the LSTM and CNN models are sent to a single layer ANN, which is then flattened to produce the output.
Ren et al. [46] compare CNN, LSTM, CNN-BiLSTM, ATT-CNN-BiLSTM, and ML (SVM) models. Best results are obtained with the ATT-CNN-BiLSTM model that is composed of an embedding layer, a CNN layer to extract local parallel features, a BiLSTM layer to extract features that depend on neighbouring characters or on characters that are wider apart, an attention layer, dropout, and an output layer.
Namgung et al. [49] compare CNN, LSTM, BiLSTM, and hybrid CNN-BiLSTM models. In the hybrid model, the output of the embedding layer is sent to a CNN and a BiLSTM with attention in parallel, which subsequently feed into a fully-connected output layer using ReLU and dropout.
Cucchiarelli et al. [34] compare LSTM.MI from Tran et al. [39], BiLSTM from Mac et al. [56], hybrid ATT-CNN-BiLSTM from Ren et al. [46], and ML (MLP, RF, SVM) models. The best accuracy is obtained by a MLP with one single hidden layer composed by 128 units. Although previous studies [40,46] showed that DL methods outperform basic ML methods, Cucchiarelli et al. [34] show that ML methods with careful feature selection and classifier tuning can still outperform DL methods.

Other Methods
Of the other methods, that are not based on ML or DL, Wang et al. [51] and Yin et al. [55] focused on NXDomain responses. Wang et al. first filter 'normal' NXDomain responses, next cluster hosts that seem compromised by the same DGA-based malware, and finally identify compromised hosts using a supervised statistical algorithm based on query time and query count distributions. Yin et al. implemented client-side detection by using Threshold Random Walk for sequential hypothesis testing that relies solely on benign domains.
Satoh et al. [52] filter benign domain names using whitelists, select the longest subdomain and split it into words using dictionaries, and estimate the randomness of character strings. To compensate for deficiencies of the dictionaries, they also estimate the randomness of a subdomain by referring to web search results.
Sun et al. [53] applied a graph convolutional network method considering the character distribution of domain names, resources aggregation of attackers, and the query behaviour of clients. The DNS context is modelled as a Heterogeneous Information Network (HIN) of clients, domains, IP addresses, and different types of relations among them. Meta-paths are elaborately extracted to help uncover higher-level semantics hiding in the HIN. A graph convolutional network (GCN) is used that applies an attention mechanism to adaptively learn the meta-paths.
Yan et al. [54] applied graph analysis in a semi-supervised learning scheme. They first extract three types of feature vectors: vectors that represent visiting patterns of domain names in traffic during a fixed time frame are extracted using a CNN-based auto-encoder; vectors that represent the visiting order of domain names are extracted using an embedding scheme where a series of domain names is considered as a series of words in NLP; vectors that represent lexical features of domain names are extracted using an LSTM. These three vectors are combined into a comprehensive feature vector for each domain name. Graph analysis algorithms are used next to group domain names from the same DGA family. By considering thresholds for the number of domain names visited, the most visited domain name, the dispersion of the length of visited domain names, and the dispersion of time intervals in which domain names are visited, it is determined whether a host is infected.

Datasets
The 42 studies in our literature review as listed in Tables 1-3 [7], originally contained lists of domain names generated by 43 DGA families, and has been extended later on.

Research Method
We applied the method as outlined in Figure 1. We trained and tested different multiclass classification models, using both ML and DL, for which we applied TF-IDF features. We also trained and tested an LSTM model without TF-IDF that contains an embedding layer to convert domain names. We consider such LSTM model as state-of-the-art, and hence, by comparing the results obtained with the LSTM model and the TF-IDF based models, we can evaluate the effectiveness of TF-IDF based models. In the following subsections we provide details on the datasets in Section 4.1, on the usage of TF-IDF in Section 4.2, on the ML and DL models in Section 4.3, on the evaluation metrics in Section 4.4, on the experimental results in Section 4.5, and a discussion on the experimental results in Section 4.6.

Datasets
We obtained datasets with benign domain names and malicious domain names as generated by DGAs from public sources.
We derived our dataset with benign domain names from the TRANCO list of the top one million most popular domains on the web https://tranco-list.eu. The TRANCO list is based on available rankings from Alexa, Cisco Umbrella, Majestic, and Quantcast, but improves upon each of these rankings by addressing agreement on the set of popular domains, stability over time by averaging the rankings over the past 30 days, popularity and availability of the listed websites, and lack of malicious domains [57]. We performed the following operations: We removed domain names that were also in the list of malicious domain names.
Our final dataset contains 583,954 benign domain names. The length of these domain names varies from 4 to 67 characters.
We derived our dataset with malicious domain names from DGArchive, a free service offered by Fraunhofer FKIE [7]. We downloaded the DGArchive dataset at the end of 2018, which contained 110,497,746 malicious domain names from 87 DGA families. We performed the following operations on this dataset:  In total, our dataset contains 1,076,754 domain names. For our experiments, we divided the dataset into two disjoint subsets: 70% is included in the training dataset for training the models (containing 753,727 domain names), and 30% is included in the test dataset for evaluating the trained models (containing 323,027 domain names).
The distribution of the average frequency of unigram occurrences in malicious domain names per DGA family as well as in benign domain names from the topsites, is shown in Figure 2. We derived these distributions from the domain names in the training dataset as follows: Since domain names are case insensitive, we first lowercased all domain names. The resulting domain names contain characters from a dictionary of 39 characters {a, . . . , z, 0, . . . , 9, −, ., _} (i.e., letters, digits, hyphen, dot, and underscore). Next, we computed the relative frequency of each character, which is the number of occurrences of the character divided by the length of the domain name. Finally, we computed the average frequency of each character for each category of domain names (which is either a DGA family or topsites). Of the DGA-A families (banjori to xxhex), some have a rather uniform distribution (e.g., chinad and gameover), but in most distributions no digits are present. Since the secondlevel domain names of DGA-H families (bamital to wd) are hexadecimal numbers, the distributions mainly contain characters {a, . . . , f , 0, . . . , 9}; other characters are rare and occur in the top-level domain name. Since the domain names of DGA-W families (gozi to suppobox) contain words, the distributions are similar to the distribution of the domain names of topsites. However, this also holds for some of the distributions of DGA-A families (e.g., banjori and pykspa2).
The differences in distributions show that these can be used as a basis for multi-class or binary classification of domain names. Hence, classifiers in which features are applied that rely on n-gram distributions (such as TF-IDF as we propose in this paper), look promising.

TF-IDF
We use TF-IDF (as explained in Section 2.3) as the single feature type in various classifiers. In general, TF-IDF yields a weighting factor for each term which indicates how relevant the term is to a document in a collection of documents. We apply TF-IDF to obtain weighting factors to evaluate how relevant n-grams are to a domain name in a set of domain names.
We applied the code as shown in Listing 1 to derive the TF-IDF for the top 5000 ngrams in the training dataset, using the class TfidfVectorizer from scikit-learn. We first determine the top 5000 of n-grams (for n ∈ {1, 2, 3}) that occur most often in the training dataset, and learn the IDF for each of these n-grams (i.e., the inverse frequency of each n-gram in the domain names in the training dataset). The training dataset is transformed from a set of domain names into a set of vectors with dimension 5000 representing the TF-IDF of the top 5000 n-grams. Next, the TF-IDF is determined for these 5000 n-grams in the test dataset. Hence, also the test dataset is transformed from a set of domain names into a set of vectors with dimension 5000 representing the TF-IDF of the top 5000 n-grams, using the vocabulary and IDF of n-grams derived from the training dataset. TF-IDF for n-gram x in domain name y is computed by Equation (1), where TF x,y denotes the frequency of x in y, DF x the number of domain names containing x, and N the total number of domain names.

ML and DL Models
As indicated in our literature review in Section 3.1 (see in Table 1), different ML models have been applied in prior studies for DGA-based botnet detection, including Decision Trees, K-Nearest Neighbours, Logistic Regression, Naive Bayes, Neural Networks, Support Vector Machine, and ensemble approaches constituted by boosting (such as AdaBoost, BT, C5.0, GB, GBM, GBT, XGBoost) or bagging (such as CART, ET, RF). RF and SVM are applied most often. In line with these prior studies, we considered 7 ML models, all with TF-IDF feature vectors: Decision Tree (DT), Gradient Boosting (GB), K-Neighbours (KN), Logistic Regression (LR), Multinomial Naive Bayes (MNB), Random Forest (RF), and Support Vector Machine (SVM). All models are multi-class classifiers with 58 outputs (corresponding to the 57 DGA families and topsites).
We applied TensorFlow with Keras and Scikit-learn in our experiments. We tuned the hyperparameters for each ML model using RandomizedSearchCV on the training set with 2-fold cross-validation and 20 iterations. The hyperparameters involved are shown in Table 6. In each iteration, a different hyperparameter setting was tried. The hyperparameter settings that yielded the best classification results (by means of the mean F1-score) are shown in bold. We also considered two DL models: Multi-Layer Perceptron (MLP) with TF-IDF feature vectors, and Long Short-Term Memory (LSTM) with embedding (without TF-IDF). For training the DL models, we used 'Adam' as optimizer, 'categorical_crossentropy' as loss function, and 'accuracy' as metric. The 'validation_split' is set to 0.3, which means that 70% of the training set is used for training and 30% for validation. We trained the DL models for 20 epochs with a batch size of 512. We used an early stopping mechanism, configured by EarlyStopping(monitor = 'val_categorical_crossentropy', patience = 5) which means that training is stopped after five epochs without improvement.
We tried MLP models using two to five dense layers, with and without dropout. All models are multi-class with 58 outputs (corresponding to the 57 DGA families and topsites). Best results (by means of mean F1-score) were obtained using the model as shown in Table 7. For comparison, we also applied an LSTM model without TF-IDF. We used an embedding layer to convert a domain name into a vector representation. We derived a dictionary of 39 characters and assigned a unique id to each character. We lowercased and tokenized each domain name into a sequence of characters, which we next transformed into a numeric vector by assigning the character id's from the dictionary. We used vectors with a fixed length of 67. For domain names containing less than 67 characters, we padded the corresponding vector with 0 values. For domain names that contained more than 67 characters, we discarded the extra characters. In addition, our LSTM model is multi-class with 58 outputs (corresponding to the 57 DGA families and topsites). Details of the model are shown in Table 8.

Metrics
We evaluated the accuracy of our models by considering how well the models are able to correctly classify domain names. Consider for instance the actual positive class of domain names that are generated by the banjori DGA, and the actual negative class of all other domain names (including benign domain names and domain names generated by other DGAs). We can express the classification results, i.e., the predictions as output by a model, by considering the positive predictions (i.e., domain names that are classified as being generated by the banjori DGA) and the negative predictions (i.e., domain names that are classified as not being generated by the banjori DGA), and whether the predicted class matches the actual class. A classification result is either a true positive (TP) in case of a correct positive prediction, a false positive (FP) in case of an incorrect positive prediction, a true negative (TN) in case of a correct negative prediction, or a false negative (FN) in case of an incorrect negative prediction. The ratio between these classifications can be expressed by different metrics. Common metrics are precision, which is the fraction of correct positive predictions among all correct and incorrect positive predictions (TP/(TP + FP)) and recall, which is the fraction of correct positive predictions among all elements in the actual positive class (TP/(TP + FN)). Precision and recall may not be particularly useful when used in isolation, since deviations in FP and FN may lead to considerable differences between recall and precision. This is addressed by the F1-score, which takes the harmonic mean of precision and recall: We evaluated all our DL and ML models using the held-out test set and computed the F1-score. For further analysis of the results, we also considered confusion matrix, precision-recall curve, and ROC-curve of each model as metrics. In a confusion matrix, each row represents an actual class, while each column represents a predicted class. A precision-recall curve shows the trade-off between precision and recall for different settings of the classification threshold. An ROC-curve shows the true positive rate against the false positive rate at various settings of the classification threshold. Table 9 shows the F1-score for different ML and DL models. The corresponding results for precision and recall are shown in Tables A1 and A2   The aggregated results in Table 9 also show that the best performing models in terms of the highest average F1-score also have the smallest standard deviation and hence the smallest spread in F1-scores. Nevertheless, the standard deviations of 14.68 for the LSTM model and 17.17 for the MLP model still are rather large.

Results
The spread in F1-scores is also shown in the boxplots in    Figure 4 shows the confusion matrices of the LSTM, MLP, SVM, and LR models. All matrices show that false classifications of DGA-W domain names are classified as topsites. As observed in Section 4.1, the DGA-W domain names closely resemble benign domain names. In the LSTM and MLP models, false classifications are often classified as topsites, pykspa and vidro, while in the SVM and LR models, false classifications are often classified as topsites, dnschanger, and ramnit.

Discussion
In our experiments, the DL models clearly yielded better results than the ML models in multi-class classification, as shown by the F1-scores. This is as expected, since DL models are more advanced and allow deeper analysis of the input data either by applying several hidden layers (in the MLP model) or feedback (in the LSTM model). This is in line with previous studies [40,46], which also showed that DL methods tend to outperform basic ML methods. Cucchiarelli et al. [34] on the other hand showed that ML methods with careful feature selection and classifier tuning can still outperform DL methods. Although our experiments show that ML methods perform worse than DL methods with multi-class classification, the AUC results with binary classification for the DL models and the best performing ML models (SVM and LR) are similar, which confirms the importance of feature selection (as demonstrated by our TF-IDF feature selection) and classifier tuning.
We used TF-IDF features with the MLP model, and with all ML models. We argued that TF-IDF features are promising due to the differences in the n-gram distributions of benign domain names and malicious domain names from different DGA families. Our experiments indicate that models using TF-IDF features indeed perform well. The results obtained with TF-IDF in the MLP model are comparable with the results obtained with the LSTM model in which we applied standard embedding.
Our experiments also show that there are notable differences among domain names from different DGA types. DGA-H domain names are distinguishable since they represent hexadecimal numbers, and hence they are easier to classify. This also shows in our results, as the highest aggregated F1-scores are for DGA-H (up to 99.96% with the LSTM model). DGA-W domain names are close to regular domain names, and hence they are more difficult to classify. This also shows in our results, as the F1-scores for DGA-W are much lower than for DGA-A and DGA-H. For DGA-W, both the LSTM and MLP model do not perform very well (73.18% and 78.57% micro average F1-score) and the SVM model shows better results (83.61%).
Unfortunately it is not straightforward to compare our results to prior results as published in scientific literature. This is mainly due to differences in the datasets used in experiments. As discussed in Section 3, nearly all studies used different datasets of benign and malicious domain names. Even if the same datasets are used as sources, the datasets originate from different time periods and also different subsets are taken with different numbers and types of DGA families. As discussed above, the mix of DGA families included in the dataset has a large impact.
Comparison with prior work therefore would require to implement the models from prior work and evaluate them with our datasets. There is a vast amount of prior work, and criteria would be needed for deciding which models from prior work to consider. Furthermore, often details are missing in literature that hinder reproducing models, such as the values of all hyperparameters and default settings, and the configurations of software tools. And even when reproducing the models is feasible, training and evaluating the models takes considerable effort.

Conclusions
We presented the results of an extensive literature review on the application of ML and DL for detection of DGA-generated domain names. We observed that this is an active research field to which numerous groups all over the world are contributing. We also observed that there is no common methodology for performing experiments and reporting on results. Different ML and DL models are being used. Arbitrary combinations of features for ML models are being used and a common ground for features is missing. Different datasets from various sources are being used, with different subsets of DGA families. These differences cause that it is hard to compare experimental results.
We proposed the usage of TF-IDF as single feature type. We apply TF-IDF for evaluating how relevant n-grams are to a domain name in a set of domain names. We used TF-IDF of the 5000 most popular n-grams (for n = 1, 2, 3) as features for popular ML and DL models. For comparison, we also used an LSTM model with embedding layer to convert domain names from a sequence of characters into a vector representation. Our results show that the DL models outperform ML models. The LSTM and MLP model provide the highest overall F1-scores (micro-averages of 90.69% and 89.08%), the highest area under the precision-recall curve (micro-averages of 0.974 and 0.965), the highest area under the ROC-curve (0.994 and 0.995), and the highest true positive rates (95.67% and 96.54%) with the lowest false positive rates (2.68% and 2.53%). Hence, the performance of the MLP model with TF-IDF features and the LSTM model with embedding is rather similar.
A limitation of any approach that relies on a single feature type, is that an adversary can tune the DGA such that the feature values of malicious domain names match those of benign domain names. This is also the case for our TF-IDF based approach: an adversary may tune a DGA to generate domain names for which the n-gram distributions match those of benign domain names, although this might be difficult to implement due to the large number of n-grams. We also observe that results differ for different DGA types. Our LSTM and MLP models perform well for classifying arithmetic-based and hash-based DGAs, but less for wordlist-based DGAs where domain names resemble regular domain names.
In our future work we intend to look at the effectiveness of features for ML models as mentioned in scientific literature. We observed that a large variety and rather arbitrary combinations of features have been applied, and it is not clear yet which features are effective in what cases. We also plan to look into hybrid learning models, where different types of models are combined. We observed that different models perform best for classifying domain names from different types of DGAs. We also intend to explore more advanced deep-learning models to derive features and to classify DGA-generated domain names.