Cybersquatting attacks emerged with the creation and implementation of the Domain Name System (DNS) in internet architecture and have since evolved to include various techniques that manipulate domain names for malicious purposes [
20]. Initially, in a process called domain-squatting, attackers attempted to preemptively register desirable domain names and sell them at a premium to businesses and trademark owners [
20]. Squatting attacks have subsequently expanded to include numerous different techniques and processes, exploiting both human and hardware errors [
21]. These different attacks are described in greater detail in
Section 2.1.
A large focus of research has focused on understanding the expansive nature of squatting attacks and their effect on the DNS landscape. Experimental techniques and processes have varied, but most studies have focused on understanding malicious actors’ goals and the trends in domain selection and creation [
2,
13,
22].
Section 2.2 provides an overview of various squatting experiments. In addition to understanding the space, researchers have strived to develop applications and models to combat and detect squatted domains. Approaches to these solutions vary, ranging from machine learning models focused on detecting squatted domains to browser extensions focused on detecting typos, and even gamified trainings for individuals [
8,
15,
23]. These approaches are further detailed in
Section 2.3. An overview of the current scope of research related to the privacy impacts of squatting is provided in
Section 2.4. Finally, an introduction to the architecture of the Use and Abuse project, along with the privacy-oriented active OSINT capabilities developed for it, will be presented in
Section 2.5.
2.1. Overview of Squatting Techniques
Squatting has come to represent a class of attacks that exploit the DNS process to mislead users into accessing potentially malicious websites at domains that closely resemble the ones they intended to visit. Common techniques include typosquatting, combosquatting, sound squatting, homograph squatting, bitsquatting, and email squatting. As shown in
Figure 1, each of these methods involves modifying the base domain in some form. Using one of these techniques, actors generate and register domains to host their own content and then count on users to unintentionally access their domain.
One of the more common techniques used by attackers is typosquatting, which focuses on the likelihood that an individual will mistype or misjudge an illegitimate domain as the real one [
21]. The majority of typosquatted domains rely on users adding, subtracting, or mistyping one letter when attempting to access a legitimate domain, otherwise known as a Damerau-Levenshtein distance of one [
24]. This concept has come to be coined the “fat-finger distance,” representing a user’s likelihood to accidentally strike a nearby key when typing on a keyboard [
25]. An example of a domain in this category is a user accidentally typing fscebook.com rather than facebook.com.
Combosquatting is another squatting technique that involves creating domains by combining well-known company or organization names with believable keywords [
26]. An example for facebook.com could be a domain registered as facebook-friends.com [
22]. The purpose of combosquatting is to trick or mislead users by creating contextual and believable domain variations based on legitimate domains. Combosquatting and typosquatting represent the two most popular squatting techniques used by attackers [
22]. Although squatting research primarily focuses on the English QWERTY keyboard layout, studies have also shown that attacks target non-English keyboards and languages [
27].
Another form of squatting attack is soundsquatting, which takes advantage of similar-sounding words, otherwise known as homophones, within domain names [
28]. Soundsquatting is not as popular as other squatting techniques, but is becoming more critical as the usage of virtual assistants and smart devices increase [
29,
30]. Users are more likely to access these websites as a result of misinterpretations by smart virtual assistants and platforms. In addition, these attacks are prevalent in multi-language scenarios, where an individual’s language, accents, and comprehension abilities are critical [
31].
Homograph squatting is a technique that relies on utilizing visually similar characters when creating domains in order to visually deceive users [
32]. This type of attack exploits characters from other languages that have different ASCII values, but visually resemble those of a legitimate domain [
33]. Documented attacks have utilized Russian, Latin, Cyrillic, and other language characters to target well-known domains [
33]. For example, an attacker might register ‘facebook.com’ using the Cyrillic character ‘a’ (U+0430) to replace the Latin ‘a’ (U+0061) in ‘facebook.com’. While the domains appear visually identical, they are registered as two completely distinct domains.
A more challenging and randomized attack harnesses unavoidable errors in memory and routing infrastructure. This type of attack, known as bitsquatting, relies on random bit errors in device memory to redirect users from legitimate domains to the registered malicious one [
9]. More specifically, in the process of transmitting data from the user interface, network errors or hardware faults can result in bit flips occurring during the DNS process, resulting in misdirected requests [
20]. For example, a single bit flip in ‘google.com’ might result in the ‘g’ (binary 1100111) changing to ‘c’ (binary 1100011), creating ‘coogle.com’. Attackers could then register ‘coogle.com’ in an effort to capitalize on the domain misrouting. There are several potential solutions to address bitsquatting, including ECC memory, CRC checks, and DNSSEC protocols. However, these solutions are not widely implemented and would require significant time and financial investment to deploy at scale [
9].
Squatting attacks also present themselves in non-DNS applications like email domains. This attack aims to capture emails where a user mistypes the recipient email address or mistype their own address when registering for a website or service [
24]. In other cases, email squatting can take advantage of typos in SMTP configurations within email clients [
24]. An example of an attack in this format may be a user typing gmaol.com rather than gmail.com.
A summary overview of each squatting technique described, as well as relevant citations, is provided in
Table 1. Overall, there are a variety of techniques that are commonly employed by attackers when carrying out cybersquatting attacks.
2.2. Categories of Typosquatting Research
Most research within the squatting space focuses on further understanding squatting attacks’ targeting, generation, purpose, and impacts. A better understanding of the squatting space enables researchers and organizations to create effective countermeasures and solutions. To achieve a better understanding, squatting experiments typically focus on one of three categories:
- 1.
Understand the motives of squatting actors (distinguish between profit-driven, politically motivated, or maliciously driven actors)
- 2.
Survey and understand the current scope of squatting attacks
- 3.
Perform active OSINT through registration of squatting domains
As mentioned previously, the typical goal of squatting is to generate a profit or to disseminate malware and phishing content via the content hosted at a squatted domain. Four main methodologies were discovered through a survey of squatted domain websites [
4]. Thirty-three percent of the websites attempted to generate profit through advertisements, 17% were involved in affiliate abuse schemes, 13% violated trademarks, and 7%, considered to be the most harmful category, were phishing or scam websites designed to steal personal information [
4]. With the wide range of potential motives, researchers have attempted to formulate methods to quantify the harm caused by squatting attacks as a function of time loss [
13].
In addition to understanding motives and goals, it is crucial to understand the trends and scale of the squatting landscape so that mitigation efforts can be tailored and targeted effectively. Researchers have performed targeted analysis and surveys of DNS records and squatted domains to better understand the squatting field. One survey, focused on combosquatting domains, analyzed over 400 billion DNS records, and determined that the majority of cases only add a single character to the legit domain to create the malicious domain [
26]. Additionally, the majority of combosquatted domains were found to exist for extended periods without suffering from remediation efforts [
26]. Researchers who tracked the registrations of bitsquatted domains for the 500 most popular websites found 5366 different domains over a 270 day period [
9]. Additionally, throughout their experiment, the number of active registrations increased by 46% [
9]. Another large-scale survey of 8255 typosquatting URLs found 8828 different malicious pop-up messages that attempted to mislead visitors into downloading malicious content or sharing their private information [
34].
A more active method to understanding the scope of squatting has also been utilized, where researchers create and register a set of squatted domains and monitor the traffic of those domains. Researchers who registered 76 misspelled variations of popular email domains found that the traffic received by squatted domains is largely influenced by the popularity of the legitimate domain and the degree of similarity between the squatted and legitimate domains [
24]. Based on their data, they predict that five domains (gmail.com, hotmail.com, outlook.com, comcast.com, and verizon.com), targeted by 1211 typosquatted email domain registrations, would receive between 22,577 and 905,174 emails per year due to user typos [
24].
2.3. Squatting Attack Counter Measures
To attempt to mitigate and decrease the number of typo squatting incidents, researchers have developed and proposed a variety of solutions. Proposed solutions range from AI/ML models to predict the targeted iterations of domains to browser extensions to detect user typos [
8,
35]. Squatting is used to facilitate identity theft, financial fraud, and malware distribution; therefore, mitigating this threat is crucial to protecting the average individual using the Internet [
36].
Many of the countermeasures developed and proposed seek to leverage machine learning and AI capabilities to detect squatted domains. Similar investigations have sought to leverage machine learning to detect phishing websites, utilizing techniques such as semantic feature extraction and mutual information-based classification [
37,
38]. These approaches share a similar methodology with domain squatting detection, especially in URL analysis and feature-based classification techniques. Approaches to detecting potentially malicious DNS queries can be divided into two process categories, query level approaches and traffic level approaches, each having a set of features that can be focused on to detect irregularities [
35]. Additionally, AI/ML solutions are typically divided into one of two methodologies: employing AI/ML for the detection of squatted domains or employing AI/ML to generate the likely set of targeted domains so that they can be defensively registered. Defensive registration is the process of domain owners purchasing and registering similar domain names so that malicious actors cannot register the domains themselves [
39].
Determining the features to prioritize when training a machine learning (ML) model to detect phishing or squatting domains is crucial. Current research identifies four main feature categories: URL-based features, domain-based features, page-based features, and content-based features [
40]. One implementation utilizes an ensemble learning classifier model based on five classification algorithms: K-Nearest Neighbors (K-NN), C4.5 Algorithm, Left-to-Right (LR), Naive Bayes (NB), and Support Vector Machine (SVM) [
35]. This model was capable of achieving an 88.4% accuracy and 85.5% precision rating based on 8 key features of domain names, including domain length, unique letters/numbers, and ratios of character types [
35]. Another approach trains and compares different machine-learning classifiers utilizing a dataset of known phishing URLs so that it can be applied to detect malicious domains, with the most successful classifier achieving an accuracy of 98.03% [
29]. A potential limitation of some machine learning approaches is that they only focus on one squatting domain. This limitation is resolved through one approach, which employs and compares large language models to detect squatting attacks [
36]. Through the application of the Llama-3-70B language model on a dataset of 1649 squatting domains, with curated prompts consisting of squatting domain examples and reference domains, this approach achieved 94.7% accuracy [
36]. In real-world application, the system detected 34,359 squatting domains from 2.09 million new domains. Adversaries are constantly developing methods to avoid detection by ML algorithms, which has been coined the “evasion space”, through techniques such as HTML and URL manipulation [
41]. Successful avoidance can significantly decrease the effectiveness of ML detection of phishing domains and techniques.
One solution researchers have pursued from a corporation standpoint is to analyze network traffic datasets to determine the most likely set of typo errors [
15]. By training a random forest regressor model utilizing features from these datasets, researchers were able to achieve 95.7% accuracy in predicting likely iterations of domains, which organizations can then defensively register [
15]. One tool, TypoWriter, utilizes a Recurrent Neural Network (RNN) to generate and predict the most probable set of typo-error domains for organizations to defensively register [
42]. Alternative approaches are tailored towards specific squatting domains. A transformer neural network has been utilized to predict sound squatting domains for multi-language scenarios [
30]. Using tools like this enables corporations to gain better oversight of squatting cases and domains that might otherwise go unnoticed.
Defensive registration can become extremely costly and complex, as attackers could abuse any number of the countless iterations of domains. This is especially prevalent for small businesses and organizations who may not have the resources to effectively register their squatting domain space. As a result, many organizations employ the services of defensive registrars like MarkMonitor or Com Laude [
43]. Another underutilized approach organizations can adopt is the usage of sunrise periods, in which domain registrars will notify trademark owners if a potentially infringing domain is registered, so that organizations can take appropriate actions [
43].
Other approaches by researchers are focused on protecting individual users accessing the Internet using their own devices. One example of this is an anti-typo Squatting Tool browser extension that provides real-time suggestions and error detection for users accessing domains on the Internet [
8]. The tool itself functions by comparing user domain query inputs to a local database of common, popular website domains [
8]. A similar approach aims to detect and prevent users from inputting sensitive information into untrustworthy websites or sources in order to prevent phishing attacks [
44]. A different method takes advantage of the Swype keyboard framework, which is a keyboard that allows users to slide their finger from character to character to type words [
45]. The TypoSwype tool analyzes Swype pattern images utilizing image recognition algorithms and a convolutional neural network (CNN) to compare entered queries to common queried domains [
45].
Increasing users’ ability to detect malicious domains can greatly decrease their likelihood of falling victim, and therefore the threat of squatting attacks. Researchers have created a gamified application with features such as scores and leaderboards to train users to detect and avoid spoofed websites [
23].
In addition to technical and gamified approaches, another potential area to address domain squatting is policy-based approaches focusing on domain registration. Domain name registrations are overseen by the Internet Corporation for Assigned Names and Numbers (ICANN), which is responsible for defining policies for domain name registrations [
46]. ICANN, however, has received criticism over its lack of enforcement and administration internationally for TLD registrations, with studies indicating the need for verification standards for WHOIS registrations [
47]. Additionally, the expansion of generic top-level domain (gTLD) names has been shown to have resulted in an increase in typosquatting attacks targeting legitimate organizations [
48]. One potential approach to improve domain name registration verification is requiring that to register domains with certain TLDs, a user must possess a registered business. A more complex approach, which would require reform in the domain registration space, would be to limit the number of registration providers and hold those companies liable for allowing registrations of squatting domains. A third approach, similar to defensive registrations, would be to make domain registrars responsible for blocking registrations that mimic well-known brands and trademarks. In general, policy-based reform could result in significant changes in the domain space that would greatly reduce squatting attacks.
Overall, a variety of countermeasure approaches have been developed and explored. A summary of these countermeasures are provided in
Table 2. Machine learning algorithms and applications have demonstrated success in detecting typo domains, however are yet to be adopted and applied. Defensive registration is a practice applied by major corporations, but the sheer number of possible domains combined with constrained resources limit its effectiveness, especially for small businesses. Approaches tailored for user-side interactions lack widespread implementation. Effective applications and detection methods have been demonstrated, but are limited in their mainstream usage.
2.4. Privacy Impacts of Squatting Attacks
The increased digitization of people’s day-to-day lives has resulted in greater privacy and personally identifiable information (PII) related risks. The challenge for individuals and organizations tasked with protecting the PII they collect, however, is that there is no uniform definition or standard for what constitutes PII. OMB Memorandum M-07-16 defines PII as “information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual [
49].” This definition leaves room for nuanced scenarios and the requirement for case-by-case assessments to be performed based on the data collected and available [
50]. The European Union’s GDPR refers to personal data as information relating to an identified or identifiable natural person, and refers to identifiers as factors such as name, identification number, location data, and factors related to online identifiers [
51]. Although both definitions focus on the identification component for privacy data, both leave room for interpretation based on the scenario and type of data that is being collected or processed.
In the context of online transactions, certain categories of PII are more commonly collected when creating accounts or signing up for services. The required information differs based on the type of service an individual is trying to register for. For example, the creation of an account on a blog site typically requires an email address, name, birth date, and password, while an account on a social media website tracks additional information about user activity, interactions, and location [
5]. Malicious actors seek to harvest and steal PII to carry out impersonation, identity fraud, and other cyber attacks [
52]. One common technique leveraged by attackers to collect PII is creating fraudulent websites that mimic legitimate websites or services. Hosting these websites at squatted domains is particularly effective, as users who mistype a URL or are redirected may unknowingly share their PII with a website they believe to be legitimate. To understand the true extent and impact of squatting attacks on PII collection, researchers must examine not only whether these domains solicit information, but also what happens to that information after it is submitted.
The body of research on the privacy impacts of domain squatting is currently limited. While studies have shown squatted domains being used for PII data collection and phishing schemes, the broader privacy implications remain largely unknown. In an analysis of 40,299 nonlegitimate domains, 174 were found to be conducting phishing attacks [
26]. A similar study identified 657,000 domains impersonating 702 popular brands, of which 1175 were found to be squatted domains attempting to carry out phishing attacks [
53]. A study on defensive registrations found a moderately positive relationship between phishing attacks targeting company domains and the number of defensive registrations made by the company [
43]. Overall, research indicates that squatting techniques can be used to steal PII; however, the broader large-scale impacts remain uncertain. Existing studies have focused primarily on identifying phishing domains rather than investigating what happens to PII after it is collected. This gap underscores the need for active investigative approaches that can track the downstream consequences of sharing personal information with squatted domains.