Because the interconnectivity and reachability of computer networks make them vulnerable to malware hazards, numerous research communities have expressed strong ongoing interest in the development of adequate defense solutions [12
]. Accordingly, we begin this section by describing the details of DGA malware, after which we introduce studies aimed at detecting various malware and note their limitations.
2.1. DGA Malware
Malware programs such as Conficker and Kraken, which have caused considerable worldwide damage, implement DGAs as one aspect of their capabilities. Moreover, researchers have found improper codes embedded in websites and web-based advertisements that are designed to promote the spread of such malware programs over broad target areas [15
schematically depicts callbacks implemented by DGA malware. In this figure, communications labeled
are requests for name resolution from DGA malware to a recursive DNS server (RDNS), while communications labeled
are responses to these requests. We begin by assuming that DGA-generated domain names, being the capability of a C&C, are preregistered in an authoritative DNS server (ADNS). The malware first dynamically generates multiple domain names, such as d8wgr9gpa7.com
, and gxwx123nrs.net
, based on its DGA and then directs DNS queries regarding these domain names to the RDNS in the network to which the malware itself belongs. Upon receiving a query, the RDNS returns the address assigned to the domain name if it is registered in the ADNSs or a nonexisting domain (NXDOMAIN) response if it is not. Finally, any domain, e.g., gxwx123nrs.net
in this figure, that corresponds to an affirmative DNS response is assumed to be the C&C, and the malware attempts callbacks for the address assigned to that domain.
The character strings generated by typical DGA malware for use as malicious domain names are easily distinguished from benign domain names. In GameOver Zeus [17
], domain names are random sequences of letters and numerals ranging in length from 15 to 30 characters, such as 1ygx14u1vnf8hb1twhv8619h8ygr.net
. These domain names are designed to avoid namespace collisions with previously registered domains. In Rovnix [6
], a type of dict-DGA malware, domain names are generated by concatenating words from dictionaries, such as accelerateaccountant.in.net
. These domain names are difficult to distinguish from human-generated domain names due to their string similarities.
The objective of a DGA is to establish highly available communication channels between the malware and the C&C. One result of this process is that communication barriers based on blacklists are easily avoided by changing C&C domains. Furthermore, communications from within the network to the outside span a wide range of destination addresses and are thus difficult to detect since they are not restricted by firewalls. Notably, with identical DGAs, the malware and the C&C avoid the need to exchange any information related to changing the domain name.
2.2. Related Work
Studies that focus on producing more sophisticated blacklists are conducted frequently and constitute the core of most network threat defense strategies. For example, Soldo et al. [18
] proposed a method to substantially enhance the performance of blacklists based on previous attack logs provided by multiple contributors. Meanwhile, Freudiger et al. [19
] improved the confidentiality of this method by sharing attack logs through peer-to-peer communications. Separately, Špaček et al. [20
] developed a DNS firewall system that blocks communications from a protected network to malicious domains on outside networks. This system uses DNS response policy zone (RPZ) technology [22
] for advanced domain blacklisting. Unfortunately, malware families constructed using DGAs are capable of avoiding blacklist-based detection by frequently changing the domain of their C&C.
In another approach, deep packet inspection (DPI), which surveys packet payloads, has been used to strengthen protections against malware encroachments. For example, Gu et al. [23
] implemented BotHunter, which is a DPI-based passive network monitoring system that models typical malware behavior and interprets communications exhibiting strong associations with such behavior as evidence of malware contamination. Other studies have discussed efforts to improve the performance of DPI-based detection [24
]. However, in a recent cybersecurity report [26
], Cisco Systems noted that more than 50% of current Internet communications are encrypted and that approximately 70% of all malware programs encrypt their communications. Thus, in inverse proportion to the fraction of encrypted communications, the share of future communications amenable to DPI analysis will eventually be reduced to a negligible subset. A common practice in applying DPI to encrypted packets is by employing a Man-in-the-Middle approach [27
], which interrupts, decrypts, and analyzes communications between two end-points, but this approach has both practical and privacy concerns. Specifically, the large latency is incurred due to computational costs in encrypt-decrypt processes, and the privacy guarantee is violated, raising potential compliance issues. These scenarios have motivated focus on the resolution of domain names for the traditional DNSs, which are not encryption, as an information source for detecting malware. Through previous investigations [28
], we have confirmed that the traces of malicious activities appear in queries for the traditional DNSs.
Rahbarinia et al. [29
] developed a system, called Segugio, that finds unknown malicious domains based on their co-occurrence relation with known malicious domains in DNS queries. Segugio is based on the following insights: (1) Infected computers in the same malware family tend to communicate with the same malicious domain group and (2) uninfected computers have no reason to communicate with malicious domains. However, since the temporary malicious domains used for DGA malware callbacks have extremely short lifetimes and since the domains that co-occur with temporary malicious domains may not actually exist, this system is not sufficiently able to detect the callbacks of DGA malware.
In another example, Berger et al. [30
] developed a system called DNSMap that discovers potentially compromised computers based on DNS traffic. Specifically, DNSMap identifies suspicious agile DNS mappings, i.e., mappings characterized by rapidly changing domain names and/or addresses, which are often used by the C&C. Meanwhile, Wang et al. [31
] deployed a system called DBod, which classifies and detects DGA malware based on the analytical results of DNS query behaviors. Moreover, since computers contaminated by the same DGA malware family tend to generate a large number of identical DNS queries, those queries also tend to exhibit a similar domain scope and distribution. Thus, DBod exploits these similarities for classification and detection. However, since both DNSMap and DBod require the observation of an enormous quantity of extensive DNS traffic, their utility is limited to large-scale networks, such as Internet service providers (ISPs).
Plohmann et al. [32
] revealed the DGA landscape by reverse-engineering 43 malware families, and Zago et al. [33
] provided a mature dataset with analysis results for over 30 million domains generated by 50 malware families. Based on those results, other researchers have attempted to distinguish between benign and malicious domains using only their character strings in manners similar to our approach. For example, Truong et al. [4
] proposed a technique that learns and predicts character patterns using bigram models with supervised learning algorithms, and Anderson et al. [5
] extended this technique using character-level models with LSTM networks. Meanwhile, Qiao et al. [34
] combined LSTM networks with attention mechanisms to assign appropriate weight values to the characters in domain names. In addition, Vinayakumar et al. [35
] compared the performance of various machine learning (ML) and deep learning (DL) algorithms when detecting DGA-generated malicious domains. These techniques are based on the existence of discernible bias in the rules for generating malicious domain names and thus must learn the bias in advance by analyzing both benign and malicious datasets.
Pereira et al. [36
] proposed a method for detecting dict-DGA malware based solely on domain strings. This method has two functions: Estimating the dictionary used by dict-DGA malware from their callbacks and identifying malicious domain names using the estimated dictionary. However, this method is extremely simple in that malicious domain names are identified by whether the number of words constituting the character string exceeds a given threshold in the estimated dictionary. As a result, numerous false positives are caused by the inability to consider benign domain names. In contrast, our work focuses on using ML to characterize the differences between benign and malicious domain names and thus produces better results than can be achieved by the method in [36