Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs

Li, Shiyun; Dib, Omar

doi:10.3390/jtaer19040141

Open AccessArticle

Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs

by

Shiyun Li

^1,2,3

and

Omar Dib

^1,2,3,*

¹

Computer Science and Artificial Intelligence Center, 88 Daxue Rd, Ouhai, Wenzhou 325060, China

²

Department of Computer Science, Wenzhou-Kean University, 88 Daxue Rd, Ouhai, Wenzhou 325060, China

³

Department of Computer Science, Kean University, 1000 Morris Avenue, Union, NJ 07083, USA

^*

Author to whom correspondence should be addressed.

J. Theor. Appl. Electron. Commer. Res. 2024, 19(4), 2919-2960; https://doi.org/10.3390/jtaer19040141

Submission received: 30 August 2024 / Revised: 16 October 2024 / Accepted: 20 October 2024 / Published: 26 October 2024

Download

Browse Figures

Versions Notes

Abstract

The rapid expansion of the internet has led to a corresponding surge in malicious online activities, posing significant threats to users and organizations. Cybercriminals exploit malicious uniform resource locators (URLs) to disseminate harmful content, execute phishing schemes, and orchestrate various cyber attacks. As these threats evolve, detecting malicious URLs (MURLs) has become crucial for safeguarding internet users and ensuring a secure online environment. In response to this urgent need, we propose a novel machine learning-driven framework designed to identify known and unknown MURLs effectively. Our approach leverages a comprehensive dataset encompassing various labels—including benign, phishing, defacement, and malware—to engineer a robust set of features validated through extensive statistical analyses. The resulting malicious URL detection system (MUDS) combines supervised machine learning techniques, tree-based algorithms, and advanced data preprocessing, achieving a high detection accuracy of 96.83% for known MURLs. For unknown MURLs, the proposed framework utilizes CL_K-means, a modified k-means clustering algorithm, alongside two additional biased classifiers, achieving 92.54% accuracy on simulated zero-day datasets. With an average processing time of under 14 milliseconds per instance, MUDS is optimized for real-time integration into network endpoint systems. These outcomes highlight the efficacy and efficiency of the proposed MUDS in fortifying online security by identifying and mitigating MURLs, thereby reinforcing the digital landscape against cyber threats.

Keywords:

web security; machine learning; malicious URL detection; real-time detection; zero-day attack detection

1. Introduction

The rapid expansion of the internet and the increasing reliance on online services have significantly enhanced convenience and created new opportunities in our daily lives [1]. These services are typically accessed via uniform resource locators (URLs), which comprise distinct components, as illustrated in Figure 1. A URL typically includes the following components:

Protocol: This specifies the method used to access the resource. Common protocols include HTTP (Hypertext Transfer Protocol) and HTTPS (HTTP Secure), which indicate whether the data are transmitted securely.
Third-level domain: Often referred to as the subdomain, this component precedes the second-level domain and can be used to organize or categorize different sections of a website (e.g., www in www.example.com).
Second-level domain: This is the core part of the URL that typically represents the organization’s name or the main identifier of the site (e.g., example in www.example.com).
Top-level domain (TLD): This indicates the highest level of the domain name hierarchy. Common TLDs include .com, .org, .net, and country-specific codes like .uk or .jp.
Query string: This part of the URL includes a set of parameters that provide additional data to the web server, often used in web applications to pass information (e.g., ?id=123&search=keyword).
Parameters: These are the key–value pairs within the query string that provide specific details to be processed by the server (e.g., id=123 and search=keyword in the query string).

Figure 1. URL structure.

There have been numerous advancements in digital technologies, and this rapid transformation has led to a simultaneous surge in malicious online activities. This poses significant risks to both internet users and organizations worldwide. Cybercriminals increasingly exploit URLs to disseminate harmful content and launch cyber attacks, highlighting the urgent need for effective detection and mitigation strategies for malicious URLs [2].

1.1. Problem Overview

Malicious URLs (MURLs) are a significant threat to internet security, as they can deceive unsuspecting users into visiting phishing websites, downloading malware, or being redirected to other harmful online resources. These deceptive links are often embedded in emails, social media posts, or even on legitimate-looking websites, making it increasingly difficult for users to differentiate between malicious and benign URLs [3]. The capability to accurately identify and block such MURLs is essential for maintaining a secure online environment and safeguarding internet users.

Social networks like Facebook, Instagram, and Twitter serve as significant vectors for the dissemination of these MURLs. While these platforms enable wide-reaching communication and connectivity, they also raise privacy concerns due to the vast amount of personal data shared. Users are often unaware of the privacy risks, which makes them vulnerable to malicious actors who exploit fake accounts to distribute harmful content through malicious URLs. Fake accounts, bots, and cyborgs are commonly employed to spread MURLs by embedding them in posts or messages [4,5]. Studies emphasize the need for greater user awareness of privacy risks, especially as MURLs are increasingly disseminated through fake social media profiles [4].

Cybercriminals employ various techniques centered around MURLs to disseminate harmful content and carry out a range of cyber attacks. By leveraging methods such as URL shortening, crafting, phishing, and distributed denial-of-service (DDoS) attacks, these malicious actors can effectively deceive users and exploit vulnerabilities, leading to significant consequences for both individuals and organizations. One prevalent tactic is URL shortening, which obscures the true destination of a link [6]. By utilizing URL shorteners, attackers can disguise MURLs as seemingly innocuous links, tricking users into clicking on them and unwittingly exposing themselves to phishing attempts, malware downloads, or other cyber threats.

Furthermore, attackers utilize URL crafting as another effective attack vector. In this method, cybercriminals strategically construct URLs containing embedded malicious scripts that trigger upon visiting the compromised webpage [7]. These scripts exploit vulnerabilities in the user’s browser or operating system, leading to unauthorized access, data theft, or the installation of malware. URL crafting is particularly insidious as it leverages legitimate-looking URLs to deceive users into believing they are accessing safe websites, thereby increasing the likelihood of successful cyber attacks.

Moreover, cybercriminals frequently exploit URLs in phishing campaigns, creating deceptive websites that closely mimic legitimate ones. These fake websites are designed to trick unsuspecting users into divulging sensitive information such as login credentials, personal details, or financial data. By masquerading as trusted entities like banks, social media platforms, or online retailers, attackers exploit the trust users place in these brands to perpetrate identity theft, fraud, or unauthorized access to accounts [8].

In addition to phishing, URLs are employed in distributed denial-of-service (DDoS) attacks, a tactic aimed at disrupting the normal operations of target servers or networks by overwhelming them with malicious traffic. Attackers distribute MURLs containing instructions that initiate simultaneous requests to the target server when accessed by multiple devices or botnets under their control. This coordinated assault floods the server’s resources, rendering it unable to respond to legitimate requests from users, thereby causing service outages and financial losses [9].

These uses of MURLs underscore the diverse and potent threats cybercriminals pose. Their effective detection and mitigation are crucial for safeguarding individuals and organizations against the severe consequences of data breaches, financial losses, and operational disruptions in today’s interconnected digital landscape.

The research community has recently made significant advances in developing effective techniques for detecting MURLs. Many of these approaches leverage machine learning to analyze and classify URL characteristics, leveraging various features such as lexical properties, domain-based attributes, and behavioral patterns [10]. However, the continuously evolving nature of cybercrime and the increasing sophistication of MURL tactics present ongoing challenges in maintaining high detection accuracy and generalizability.

Indeed, machine learning algorithms have been highly effective for classification tasks across a wide range of domains and applications. These include intrusion detection systems in the Internet of Things (IoT) [11,12,13], the classification of malicious transactions in the financial sector [14], and the classification of tumors in medical imaging [15,16]. This highlights the versatility and potential of machine learning to handle diverse datasets across various domain-specific contexts.

This work significantly contributes to the field of MURL detection by introducing a comprehensive and robust machine learning framework designed to enhance the accuracy and effectiveness of identifying MURL threats. We systematically engineer a diverse set of URL-based features and assess data-processing techniques, multi-class classification models, and optimization methods to achieve superior accuracy and reliability in distinguishing MURLs from benign ones. Additionally, we conduct a detailed analysis of the significance of various URL characteristics in identifying MURLs, thereby providing critical insights that can inform the development of more effective cybersecurity solutions.

1.2. Research Objectives

The primary objectives of this research are threefold. Firstly, we aim to develop a robust and accurate detection system for classifying URLs into three malicious categories—phishing, malware, and defacement—and benign URLs. By engineering a comprehensive set of URL-based features, we hypothesize that our model can effectively distinguish between these classes, providing a reliable means to identify and mitigate URL-based cyber threats. Secondly, we seek to extend the system’s capabilities to detect previously unseen MURLs. Given the ever-evolving nature of cybercriminal tactics, it is crucial to develop models that can adapt and generalize beyond the known MURL types, providing a more comprehensive defense against emerging threats. To this end, we explore advanced unsupervised machine learning techniques and feature engineering strategies to enhance the model’s ability to recognize novel MURL patterns and behaviors.

Finally, we thoroughly evaluate the proposed approach, including hyperparameter tuning and exploring ensemble methods such as stacking. We also compare the system’s performance against state-of-the-art MUDSs reported in [17,18,19,20,21,22], ensuring that our solution offers superior accuracy, robustness, and generalization capabilities.

Through this research, we aim to contribute to the broader field of cybersecurity by providing a robust and effective solution for detecting MURLs. By understanding the feature importance of different URL characteristics, we also hope to offer valuable insights that can inform the design of more informed and adaptive security measures against URL-based MURLs. To the best of our knowledge, this is the first study in the literature that addresses the detection of both known and unknown MURLs.

1.3. Main Contributions

The key contributions of this paper are as follows:

A novel machine learning-based framework for MURL detection: We propose a comprehensive system that effectively classifies URLs into benign, phishing, malware, and defacement, achieving a high accuracy rate of 96.83% for known MURLs.
A modified k-means clustering algorithm (CL_K-means): We use CL_K-means for detecting zero-day MURLs, achieving an accuracy of 92.54% on simulated datasets, addressing the challenge of identifying previously unseen cyber threats.
Robust feature engineering and statistical validation: We systematically engineer and validate a suite of URL-based features through extensive statistical analyses, ensuring the effectiveness of the detection framework.
High efficiency for real-time integration: The proposed MUDS demonstrates remarkable efficiency, with an average processing time of less than 14 milliseconds per test instance, making it suitable for real-time deployment.
Advanced data preprocessing and optimization techniques: We leverage preprocessing and optimization techniques, including tree-based algorithms and hyperparameter tuning, to enhance the performance and robustness of the proposed detection system.

1.4. Paper Organization

Table 1 presents the notation used in the entire paper. The remainder of the paper is organized as follows: Section 2 reviews the existing literature on the application of machine learning for detecting MURLs. Section 3 discusses the significance of MUDS as a decision-making tool in cybersecurity. Section 4 details the dataset and proposes a hybrid classification framework for identifying phishing, malware, defacement, benign URLs, and unknown attacks. Section 5 presents the experimental setup, analyzes classification metrics for different algorithms applied to both known and zero-day MURLs, and provides a comparative assessment against state-of-the-art models from the literature. Section 6 elaborates on the implications of our findings and the challenges associated with their practical application. Finally, Section 7 offers a conclusion summarizing the paper’s achievements and discusses potential avenues for future research and enhancements in MURL detection.

2. Literature Review

The rapid growth of the internet and web-based applications has led to a significant increase in the prevalence of malicious online content [23], including malware-laden websites, phishing pages, and other cyber threats [24]. Malicious URLs (MURLs) are a common vector for this malicious online content, serving as entry points to distribute harmful content and compromise user systems [25]. As a result, developing an effective malicious URL detection system (MUDS) has become a critical area of research and practice in cybersecurity [26].

2.1. Approaches to Malicious URL Detection

Researchers have explored various approaches to detecting MURLs, leveraging techniques from machine learning and other computational methods.

2.1.1. Machine Learning-Based Approaches

Md. Alamgir Hossain et al. [17] employed a two-pronged approach for obfuscated malware detection: Firstly, they used a combination of filter-based and wrapper-based methods to identify the most discriminative features, optimizing the input feature set. Secondly, they applied an ensemble learning model, with AdaBoost emerging as the top performer in terms of classification accuracy for identifying obfuscated malware.

Shantanu et al. [19] explored MURL detection through a comprehensive machine learning-based approach. The researchers evaluated the performance of various classifiers, including logistic regression, support vector machine, and ensemble methods, on a large-scale dataset of labeled URLs. By rigorously assessing the strengths and limitations of these supervised learning techniques, Shantanu et al. could identify the optimal classifier for accurately distinguishing between known malicious and benign URLs.

Cho Do Xuan et al. [20] proposed a comprehensive approach to MURL detection that integrated feature engineering, machine learning algorithms, and big data technologies. The researchers extracted a novel set of static and dynamic features from URLs, capturing various characteristics that distinguish malicious and benign websites. To classify the URLs, the authors employed support vector machines and Random Forest, two widely used supervised learning algorithms. By leveraging the distinctive patterns encoded in the extracted features, the machine learning models could accurately identify known MURLs.

Furthermore, the authors of [20] integrated their detection pipeline with big data processing frameworks to handle the large-scale datasets commonly encountered in real-world scenarios. This synergistic integration of feature engineering, robust classification models, and big data infrastructures enabled the development of a scalable and efficient solution for MURL detection. Nevertheless, the approach does not address the handling of unknown MURLs, which limits its effectiveness in dealing with evolving attacks.

Sara Afzal et al. [21] introduced URLdeepDetect, a novel approach that leverages long short-term memory (LSTM) networks and k-means clustering to classify malicious and benign URLs. The researchers recognized the need for robust and accurate techniques to address the growing threat of URL-based cyber threats. The URLdeepDetect employed LSTM to capture the sequential patterns and contextual information within URL strings. Additionally, it integrated k-means clustering to further enhance the classification performance by leveraging the inherent structure and similarities within the URL data.

Through extensive experimentation, the authors of [21] demonstrated the effectiveness of the URLdeepDetect system, reporting classification accuracies of 98.3% using the LSTM model and an impressive 99.7% with the k-means clustering approach. These results highlight the potential of deep learning and unsupervised clustering techniques in addressing the challenges of MURL detection.

2.1.2. Large Language Model-Based Approaches

Boyang Yu et al. [18] leveraged large language models for malicious website classification, specifically Multilingual BERT (M-BERT). They fine-tuned the pre-trained M-BERT on a labeled website dataset, enabling the model to effectively capture semantic and contextual information. Their M-BERT-based approach outperformed existing methods in accurately identifying malicious websites. The authors also conducted an interpretability analysis to gain insights into the model’s decision-making process. The work by BOYANG YU et al. demonstrates the potential of integrating pre-trained models with domain-specific fine-tuning for security-critical tasks, paving the way for more robust and interpretable systems. However, further analysis is needed to address the potential overfitting of these pre-trained models due to the relatively small dataset size.

Ruitong Liu et al. [27] introduced PMANet, a pre-trained language model-guided multi-level feature attention network aimed at improving the detection of MURLs. PMANet utilizes a post-training program to adapt a pre-trained Transformer model for URL detection, incorporating three self-supervised learning objectives to effectively capture subword and character-level information. It extracts hierarchical URL representations and employs a layer-wise attention mechanism for the dynamic weighting of features. Additionally, using spatial pyramid pooling enables the model to capture local and global contexts. The evaluation results indicate that PMANet outperforms existing methods, achieving a 0.9941 AUC under adversarial conditions and accurately identifying all 20 malicious URLs in a case study. However, the reliance on post-training may limit its adaptability to evolving threats, and further research is needed to assess its performance on larger datasets.

Similarly, Maneriker Pranav et al. [28] presented URLTran, a Transformer-based system designed to improve phishing URL detection. By leveraging state-of-the-art Transformer models like BERT and RoBERTa, the authors significantly enhanced detection accuracy, especially at low false positive rates. Their novel approach introduced domain-specific pre-training tasks and adversarial training to counteract phishing attacks using techniques such as homoglyph substitution and compound word splitting. While URLTran shows superior performance, its reliance on pre-trained models and extensive fine-tuning raises concerns about scalability to rapidly evolving threats. Moreover, despite the model’s robustness against common phishing attacks, the effectiveness of URLTran in broader real-world applications, especially on more diverse datasets and unseen adversarial strategies, remains a limitation that warrants further investigation.

In 2023, Li Lu et al. [29] proposed a novel approach to malicious webpage detection by utilizing large language models (LLMs) such as GPT-3.5 and ChatGPT. Unlike traditional methods that rely solely on URL features, their approach incorporated webpage content, enabling more comprehensive detection. The authors explore zero-shot and few-shot prompting to avoid the need for large annotated datasets, demonstrating that LLMs can match or even outperform deep learning baselines. Despite these advantages, the reliance on prompting methods highlights a key limitation—LLMs may struggle with rapidly evolving malicious techniques without continuous adaptation, and their performance on larger, real-world datasets still requires further evaluation.

2.2. Literature Comparison

The comparative analysis provided in Table 2 examines the existing studies in MURL detection, evaluating them across several key criteria. These include the dataset employed, the efficacy in detecting both known and zero-day MURLs, the methodologies for feature engineering, techniques for model optimization, the availability of reproducible code, and the type of classification performed.

Md. Alamgir Hossain et al. [17] leveraged the CIC-MalM-em-2022 dataset and demonstrated the capability to detect known MURLs through their model, which also incorporated feature engineering and optimization strategies. However, their approach was limited to binary classification, a relatively straightforward task. Additionally, the researchers only used a small number of samples for testing, which raises concerns about the potential for overfitting and the robustness of their models in detecting unknown MURLs.

Shantanu et al. [19] employed a dataset of malicious and benign websites and explored various machine learning algorithms for known attack detection, highlighting the importance of model optimization. However, their work did not address feature engineering, the stacking technique, or the study of execution and training time.

Cho Do Xuan et al. [20] developed a solution using a dataset of MURLs and non-MURLs, focusing on feature engineering to enhance classification performance. However, they did not explore model optimization, and their approach was limited to binary classification.

Sara Afzal et al. [21] introduced URLdeepDetect, a system that combines long short-term memory (LSTM) networks and k-means clustering to achieve high accuracy in detecting both known URLs and MURLs, while also emphasizing feature engineering and model optimization. Their work, like the others, was focused on binary classification. However, their results may be susceptible to overfitting, and their dataset may not be sufficiently large.

Boyang Yu et al. [18] used a dataset from the Zhejiang Mobile Innovation Research Institute to identify known MURLs, emphasizing feature engineering in their binary classification approach. However, this reliance on a specific dataset raises concerns about the generalizability of their findings. Additionally, focusing solely on binary classification may overlook the complexities of malicious URL detection, where multi-class classification could provide more nuanced insights.

Ruitong Liu et al. [27] utilized three datasets, GramBeddings, Mendeley, and Kaggle, harnessing the power of pre-trained Transformer models for effective MURL detection. However, while the dataset diversity is commendable, reliance on post-training may hinder adaptability to rapidly evolving threats, leaving its effectiveness on larger, varied datasets untested and highlighting the need for further evaluation in real-world scenarios. Additionally, pre-trained models are often prone to overfitting in the context of MURL detection, necessitating further investigation to address these concerns and enhance their robustness.

Similarly, Maneriker Pranav et al. [28] introduced URLTran, a Transformer-based model for phishing URL detection, leveraging pre-trained models like BERT and RoBERTa. While it improves detection accuracy and reduces false positives, its reliance on pre-training raises concerns about adaptability to evolving phishing techniques and the potential for overfitting. The model’s performance on larger, more diverse datasets also remains untested, highlighting the need for further evaluation.

Li Lu et al. [29] utilized large language models (LLMs) like GPT-3.5 and ChatGPT for binary classification of malicious webpages, focusing on webpage content. However, the study’s reliance on a small dataset limits its generalizability, and the model does not account for zero-day attacks. Additionally, the model’s adaptability to evolving threats remains untested, highlighting the need for further evaluation on larger, more diverse datasets.

While these studies have made valuable contributions to the field of MURL detection, they are limited in their ability to handle multiple malicious classes and address zero-day MURLs, critical components in modern cybersecurity. Additionally, the availability of reproducible code is not consistently reported, hindering the replicability and transparency of the research. To address these limitations, the proposed MUDS approach incorporates comprehensive feature engineering, advanced model optimization techniques, and the capability to detect multiple malicious classes, and both known and zero-day MURLs. Furthermore, this work has committed to providing reproducible code, ensuring the transparency and replicability of this research. By building upon the existing work and addressing its shortcomings, this study aims to contribute to developing a more robust and effective MUDS.

3. Malicious URL Detection System (MUDS)

The increasing prevalence of malicious URLs (MURLs) on endpoint devices has underscored the necessity for robust malicious URL detection systems (MUDSs). These systems are specifically engineered to detect suspicious URLs indicative of potential attack activity. The architecture of a typical MUDS and its integration with existing systems are detailed below.

3.1. Architecture of MUDS

Figure 2 illustrates the architecture of the MUDS, comprising several critical stages.

1.: Data retrieval: The system initiates by collecting data from users’ endpoint devices, including benign and MURLs. These raw data serve as the foundational input for subsequent steps, which include data cleaning, feature engineering, and machine learning model development. The data collected likely include information related to the URLs, such as the URL itself, metadata, user interaction with the URL, and potentially any URL classification. Ensuring the quality and completeness of the data received from the endpoint devices is crucial, as this impacts the performance of the downstream machine learning model. When gathering data from user devices, the system should implement robust data collection protocols to handle potential issues like missing data, erroneous inputs, privacy concerns, and security requirements.
2.: Feature engineering: The collected raw data are processed to extract relevant features that can be used to classify the URLs. Some key features extracted include URL length: The URL’s length can indicate malicious intent. Use of IP addresses: URLs containing IP addresses instead of domain names may signal malicious activity. Abnormal URL structure: Certain patterns or characteristics of the URL structure can suggest it is malicious. Extracting these meaningful features from the raw data is challenging but a crucial step in the machine learning pipeline. Feature engineering also aims to transform the raw data into a format that can be effectively used to train the machine learning model to accurately classify URLs. By engineering relevant features, the model is provided with the most informative signals to learn the patterns distinguishing MURLs from benign ones.
3.: Machine learning model: The features extracted during the feature engineering step are used to train various learning-based models, such as XGBoost, Random Forest, Gradient Boosting, Stacking, and other classifier models. The training phase aims to impart to the model the underlying patterns and correlations between the engineered features and the classifications of URLs. By learning from this labeled training data, the model can effectively predict whether a new, unseen URL is benign or belongs to one of the malicious classes. The training process includes optimizing the model’s parameters to minimize classification errors. This optimization enables the model to generalize and apply its learned knowledge to reliably predict the classifications of new, unlabeled URLs.

By implementing a proactive approach, the MUDS helps protect endpoint devices from exploitation by MURLs. It enhances security by identifying and mitigating potentially harmful MURLs before they can inflict damage.

3.2. Applications and Integration of MUDS

The proposed MUDS can be integrated into various contexts to enhance cybersecurity and protect users from malicious online activities. Here are some potential use cases and applications of the MUDS system:

1.

Web browsers:

The MUDS can be integrated as a browser extension or plugin, providing real-time analysis of URLs visited by the user.
The system can automatically invoke the machine learning model to assess the maliciousness of a URL before the page is loaded, triggering security alerts or blocking access to potentially harmful websites.
This integration would help safeguard users against phishing MURLs, drive-by downloads, and other web-based threats, enhancing the overall security of the browsing experience.

2.

API gateways and web services:

MUDS can be integrated into API gateways and web service platforms to scrutinize incoming URLs before granting access or processing requests.
This integration would protect API-driven applications and services from being exploited by malicious actors, preventing malware propagation and reducing the risk of data breaches.
The system can provide recommendations to API developers and service providers, suggesting implementing additional security measures or blocking suspicious URLs.

3.

Email security:

The MUDS can be integrated into email security solutions, analyzing URLs embedded in incoming emails before they are delivered to the user’s inbox.
The system can automatically flag or quarantine emails containing potential MURLs, reducing the risk of phishing MURLs, malware distribution, and other email-borne threats.
This integration would enhance the overall security of email communications, protecting users from falling victim to social engineering and other email-based MURLs.

4.

Enterprise security frameworks:

The MUDS can be integrated into enterprise-level security frameworks, such as security information and event management (SIEM) systems or security operations centers (SOCs).
The system can provide real-time alerts and security recommendations to the security teams, enabling them to swiftly respond to and mitigate the impact of MURLs detected within the organization’s network or systems.
This integration would strengthen the organization’s overall cybersecurity posture, helping security analysts and incident response teams to identify and address URL-based threats more effectively.

5.

IoT devices:

An MUDS can be adapted to safeguard IoT devices and networks, analyzing URLs associated with firmware updates, cloud-based services, or device-to-device communications.
By detecting and blocking MURLs, the system can help prevent the exploitation of IoT devices, which are often vulnerable to various security threats.
This integration would contribute to the overall security of IoT ecosystems, protecting connected devices and the data they handle from URL-based MURLs.

The proposed MUDS’s versatility allows seamless integration into various contexts, from web browsers and API gateways to email security and enterprise security frameworks. By automatically invoking the machine learning model to assess the maliciousness of URLs, the system can provide security alerts, recommendations, and preventive measures to safeguard users, applications, and organizations from URL-based threats, ultimately enhancing the overall cybersecurity landscape.

3.3. The Role of Security Experts in Enhancing MUDS Efficacy

The integration of security experts into the MUDS development process is essential for maximizing the system’s effectiveness and reliability. These professionals bring extensive knowledge of current cyber threats, attack vectors, and evolving tactics employed by malicious actors, which is crucial for shaping the model’s features and classification techniques. Their expertise ensures that the model focuses on relevant patterns and characteristics commonly associated with cyber attacks, thereby enhancing the model’s accuracy in identifying malicious URLs.

Moreover, security experts play a pivotal role in the dataset validation process. By curating and verifying the labeled URLs, they ensure that the ground truth labels are reliable and reflective of real-world scenarios. This aspect is critical as the performance of any machine learning model heavily depends on the quality of the training data.

Additionally, the involvement of security professionals extends to continuous evaluation and adaptation of the MUDS framework. As new threats emerge, experts provide valuable insights that inform necessary updates to the model, ensuring it remains effective against the latest cyber threats. For instance, leveraging their understanding of zero-day vulnerabilities can help refine detection capabilities and enhance the system’s resilience.

Incorporating feedback from security experts not only strengthens the model but also fosters a proactive approach to cybersecurity. This collaborative effort allows the MUDS to dynamically adapt to the evolving landscape of cyber threats, enhancing its overall efficacy in protecting users and systems from malicious URL attacks. To build a more efficient MUDS, integrating real-time threat intelligence feeds could be beneficial, allowing the system to stay updated on emerging threats and adjust its detection algorithms accordingly.

4. Proposed Framework

This work aims to develop an MUDS that can protect users from malicious online activities. In this article, a novel framework is proposed to effectively detect both known and unknown malicious URLs (MURLs). Figure 3 demonstrates the architecture of the proposed system, comprising four main stages: (1) data preprocessing; (2) data split; (3) multi-class classification with various machine learning models for known MURL detection; and (4) CL_K-means with biased classifiers for the unknown MURL detection. The proposed MUDS is detailed subsequently.

4.1. System Architecture

The first component of the system’s architecture is the MURL dataset, used for training and performance evaluation. Next is data preprocessing, which comprises feature creation, Z-score normalization, and the SMOTE technique to tackle class imbalance issues. In the data split process, the dataset is processed by information-gain-based and correlation-based feature selection (FS) methods to remove irrelevant and redundant features, and then, passed to the PCA model to further reduce dimensionality and noisy features. Then, for the known MURLs, the dataset is split into training and test sets: 80% of the dataset is for training, and 20% is for testing. For the unknown MURL detection, the data split process is discussed in Section 5. The multi-class classification layer is developed to detect known MURLs by training seven machine learning algorithms as the first tier of the proposed MUDS: Ada, SGD, ET, LGBM, RF, XGB, and GBC. In the second tier, a stacking ensemble model is used to further improve the detection accuracy by combining the output of the best four base learners from the first tier. In the next stage, CL_K-means with biased classifiers is constructed to detect unknown MURLs. In the CL_K-means with biased classifiers, the suspicious instances, classified as benign, are further passed to a CL_K-means model as the third tier to effectively separate unknown MURL samples from normal samples. The fourth tier of the MUDS comprises the BO-TPE method and two biased classifiers used to optimize the model and reduce the classification errors of the CL_K-means. Ultimately, the detection result of each test sample is returned, which could be a known MURL with its type, an unknown MURL, or a normal URL instance. Table 3 provides a summary of the rationale for the algorithms used in the proposed MUDS, along with a brief description of each algorithm and its effect on performance. A detailed description is provided in Section 4’s remaining parts.

4.2. Malicious URL Dataset

The dataset used in this study was obtained from a reliable online source [30] and consists of a comprehensive collection of both benign and MURL samples. The dataset includes a total of 65,1191 URL instances, out of which 428,103 are benign or safe URLs, 96,457 are defacement URLs, 94,111 are phishing URLs, and 32,520 are malware URLs. For collecting benign, phishing, malware, and defacement URLs, the authors have used a URL dataset (ISCX-URL-2016). For increasing phishing and malware URLs, we have used the Malware domain dataset. The authors have increased the benign URLs using the Faizan git repository [31]. Finally, the authors have increased the number of phishing URLs using the PhishTank and PhishStorm datasets. They have collected the URLs from different sources into a separate data frame and merged them to retain only URLs and their class type. The readers can visit the Kaggle dataset page [30] for detailed descriptions and discussions related to the dataset.

The MURLs were identified and verified by security experts, ensuring the accuracy and reliability of the ground truth labels. Table 4 displays the first few rows of the initial dataframe of the dataset. Figure 4 depicts their distribution in terms of percentage. Figure 4 shows the distribution of different types of URLs based on their classification. The chart indicates that the majority of URLs (65.7%) are classified as “benign”, meaning they are considered safe and not associated with any malicious activities. The next largest category is “defacement” at 14.8%, which refers to URLs that have been tampered with or altered without authorization. Following that, “phishing” makes up 14.5% of the URLs, which involve attempts to fraudulently acquire sensitive information such as login credentials. Finally, the smallest category is “malware” at 5.0%, which represents URLs that are associated with the distribution of malicious software or programs.

4.3. Data Preprocessing

4.3.1. Label Encoding

To accommodate the requirement of machine learning models for numeric input and output variables, categorical data need to be encoded before the models can be fitted and evaluated. In this study, the label encoder method was specifically applied to the “URL_length” and “type_code” attributes. This results in the mappings shown in Table 5.

4.3.2. Feature Creation

Based on the URLs in the considered dataset, we have developed 21 new features, as outlined in Table 6. These features provide additional information about the URLs and can be used to for further analysis and classification [32]. Please refer to the table for a detailed description of each feature.

4.3.3. Z-Score Normalization

After encoding labels and creating new features, the MURL dataset is normalized by the Z-score algorithm. Z-score normalization, also known as standardization, is a method used to transform numerical data to have a mean of 0 and a standard deviation of 1. This technique helps to bring all the features onto a common scale, which is particularly useful when working with machine learning algorithms that are sensitive to the scale of the input data.

By applying Z-score normalization, we ensure that each feature contributes equally to the analysis and modeling process, regardless of the original scale or distribution of the data. This is particularly beneficial when dealing with features with different units or magnitudes, as it eliminates any bias that may be introduced due to these variations.

4.3.4. Reduce Class Imbalance via SMOTE

Class imbalance issues occur in the MURL dataset. Class imbalance is a common challenge in machine learning, where the number of samples in one class significantly outweighs the number of samples in another class. This imbalance can lead to biased models that perform poorly on the minority class.

To address the class imbalance, this research uses SMOTE, an oversampling method that balances the class distribution by creating synthetic samples of the minority class. The SMOTE algorithm works by selecting a minority class sample and finding its k-nearest neighbors in the feature space. It then creates synthetic samples along the line segments connecting the selected sample and its neighbors. This process is repeated for multiple minority class samples until the desired balance between the classes is achieved.

By generating synthetic samples, SMOTE helps to increase the representation of the minority class in the dataset, thereby reducing class imbalance. This approach allows the model to have more training examples from the minority class, enabling it to learn more effectively and make better predictions for both classes.

4.4. Feature Selection for Known MURLs

For known MURL detection, this work adopted a systematic and rigorous approach in the feature selection process for the training data. First, it identified a comprehensive set of 21 newly engineered URL-based features that we hypothesized would be highly informative for MURL classification. These features capture various characteristics of the URLs, such as “use_of_ip”, “abnormal_URL”, and “count”, which have proven to be effective in prior research.

We then designated the “type_code” variable as the target variable, which represents the MURL type (e.g., phishing, malware, defacement) or the benign class. By focusing the feature selection on these 21 predictive variables and the target variable, we aimed to establish a robust and discriminative feature set that would enable our machine learning models to accurately differentiate between the known MURL types and benign URLs.

The feature selection process involved a combination of statistical analysis and domain-specific expertise [33]. We first examined the correlation between each of the 21 predictive variables and the target variable, as well as the intercorrelations among the predictive variables themselves. This allowed the identification and removal of any redundant or highly correlated features, thereby reducing the dimensionality of the input space while preserving the most informative and non-redundant characteristics. The relevance of this feature manufacturing process is thoroughly evaluated in Section 5.

Additionally, we consulted cybersecurity experts and reviewed the relevant literature to further validate the relevance and potential predictive power of the selected features. This multifaceted approach ensured that the engineered feature set was not only statistically sound but also grounded in the domain-specific knowledge required for effective MURL detection (see Section 5 for the statistical evaluation).

By focusing the analysis on this carefully curated set of URL-based features and the target variable, this work aims to develop a robust and accurate machine learning-based system for classifying URLs into known MURL categories and the benign class. This foundational step lays the groundwork for the subsequent model training, evaluation, and comparison efforts.

4.5. Feature Selection For Unknown MURLs

For the task of unknown MURL detection, this work utilized three feature selection methods: IG, FCBF, and PCA. These methods were employed to identify the most relevant features for training and testing the proposed anomaly-based models.

4.5.1. IG

Due to the effectiveness of IG as a feature selection method, it was used as the initial step in the feature selection process. By applying IG, we aim to identify and select the most important features for our analysis.

IG evaluates the relevance of each feature by measuring the amount of information it provides about the class labels in the dataset. This allows us to prioritize features that significantly impact the classification task, enabling us to focus on the most informative aspects of the data.

By using IG for feature selection, we can reduce the dimensionality of the dataset and eliminate features that contribute little to the classification process. This enhances computational efficiency, helps prevent overfitting, and improves the interpretability of the resulting models.

After applying IG, we have a subset of the most relevant features based on their information gain. These selected features are used for further analysis, model training, and testing.

4.5.2. FCBF

While IG helps select a subset of relevant features, some redundant features may remain in the selected subset. To address this issue and further enhance the feature selection process, we employ the FCBF algorithm.

FCBF is specifically designed to handle the challenge of redundancy among features. It considers both the correlation between each feature and the class labels, as well as the intercorrelation among the features themselves. By assessing these correlations, FCBF identifies features that are highly correlated with the class labels while minimizing redundancy with other features.

The FCBF algorithm operates in a forward selection manner. It starts with an empty set and iteratively adds features that maximize the relevance with the class labels while minimizing the redundancy with the features already selected. This process continues until no further improvements can be made.

By applying FCBF after IG, we aim to eliminate redundant features that may not contribute additional discriminatory power to the classification task. This helps obtain a more concise and informative feature subset that can lead to improved model performance and better interpretability.

Using FCBF, we can effectively address the issue of redundancy and ensure that the selected features are both relevant and non-redundant. This refined feature subset can then be used for subsequent analysis, model training, and testing, providing a more focused and efficient approach to unknown MURL detection.

4.5.3. PCA

While IG and FCBF are effective feature selection methods, they may still have limitations in capturing complex relationships and interactions among features. To overcome these limitations and further enhance our feature selection process, we incorporate PCA.

PCA is a dimensionality reduction technique that transforms the original features into a lower-dimensional space while preserving the most important information. It achieves this by finding orthogonal axes, known as principal components, that capture the maximum variance in the data. The principal components are ordered based on their contribution to the overall variance, allowing us to select a subset of components that retain most of the information.

Incorporating PCA into our feature selection process helps address the limitations of individual methods and provides a more comprehensive approach to feature selection. The resulting feature set can be used for subsequent analysis, model training, and testing, enabling us to build more efficient and interpretable models for unknown MURL detection.

4.6. Proposed MUDS

This subsection details the proposed MUDS framework to classify both known and unknown MURLs. Using this MUDS, we aim to achieve improved accuracy and detection performance in identifying various types of MURLs.

4.6.1. Multi-Class Classification within MUDS

After data preprocessing and feature selection, the labeled dataset is trained by eight models to find the best machine learning model for multi-class classification for known MURLs. We used eight models for the comparison, including AdaBoost, SGD, Extra Trees, LGBM, Random Forest, XGBoost, Gradient Boosting, and Stacking classifier.

To optimize the performance of the tree-based ML models, their important hyperparameters were tuned using Optuna, an open-source hyperparameter optimization framework. Optuna automates the search for optimal hyperparameters by efficiently exploring the hyperparameter search space, utilizing statistical modeling and pruning techniques. It dynamically adjusts the search strategy based on a user-defined objective function and search space, maximizing the model’s performance. Optuna supports various types of hyperparameters and seamlessly integrates with popular machine learning libraries, simplifying the hyperparameter optimization process and enhancing model performance.

To assess the proposed framework’s generalizability and mitigate overfitting issues, hold-out methods were employed in the known MURL detection experiments. The train–test validation split for each dataset was as follows:

Known malicious URL train–test split: An 80–20% train–test split was employed, with 80% of the data samples allocated to the training set and 20% to the test set. The test set was kept untouched until the final hold-out validation, ensuring it remained independent for unbiased evaluation of the model’s performance.

4.6.2. CL_K-Means with Biased Classifiers

While the multi-class classification model is effective at detecting known MURLs, cybercriminals can still deploy unknown MURLs (i.e., zero-day MURLs) that are not included in known patterns, potentially leading to their misclassification as benign. Therefore, the instances labeled “benign” by the known-MUDS will be considered suspicious instances because some can be unknown MURL samples. A novel unknown-MUDS architecture is then developed to identify unknown MURLs by processing the suspicious instances.

In the unknown-MUDS, we simulated unknown MURLs and split the dataset accordingly. This simulation process is illustrated in Algorithm 1. The algorithm focuses on simulating phishing attacks as the unknown threat, but it is adaptable and can be modified to simulate other types of unknown attacks as well. It begins by reading the dataset of malicious URLs and filtering out phishing instances, which are then organized into a separate dataset for targeted analysis.

The first tier of the proposed unknown-MUDS comprises the CL_K-means developed by improving the k-means model. To optimize the performance of CL_K-means, the number of clusters (k) is tuned using BO-TPE. BO-TPE is a powerful optimization technique for hyperparameter tuning. It fits a probabilistic model to observed data and optimizes an acquisition function to find the next hyperparameter configuration to evaluate. The Tree-Parzen Estimator (TPE) is an effective acquisition function, particularly for high-dimensional problems. BO-TPE can efficiently find optimal hyperparameters with relatively few function evaluations. And since there are a large number of instances collected under many different situations in the MURL dataset, a sufficient number of clusters should be used to distinguish between normal and attack data. The main procedures of the proposed CL_K-means method are as follows.

1.: Split the dataset into a sufficient number of clusters using k-means.
2.: Label each cluster by the majority label of data samples. In each cluster, the class label to which most of the instances belong, “benign” or “malicious”, is assigned to this cluster.
3.: Label each sample in the test set as benign or malicious based on the label of the cluster that this instance is classified into by k-means.
4.: For each test sample i that is classified into a cluster, calculate the percentage of majority class samples in this cluster as the confidence or clustering probability, pi.
5.: Optimize the number of clusters (k) and distance metric as the major hyperparameters of k-means by the BO-TPE algorithm to obtain the optimal CL_K-means model. The validation accuracy on the test set is used as the objective function for BO-TPE.

K-means is selected to distinguish between MURLs and benign URLs mainly due to the real-time requirements of endpoint network systems. K-means is computationally faster than most other clustering algorithms because it has a linear time complexity of O(nkt), where n is the data size, k is the number of clusters, and t is the number of iterations. The model training time is further reduced by using mini-batch k-means, which uses randomly sampled subsets as mini-batches in each training iteration. Additionally, k-means guarantees convergence and easily adapts to new samples.

Algorithm 1 Simulate Unknown Malicious URLs and Split Dataset

1:: $d f \leftarrow read_csv (^{’} M a l i c i o u s U R L d a t a s e t . c s v^{’})$
// Load the dataset from a CSV file.
// $d f$ is the original dataset containing various types of URLs. Each URL is labeled with its classification type (e.g., Phishing, Malware, Defacement).
2:: Filter out ‘Phishing’ URLs: $d f 1 \leftarrow d f [d f [‘ t y p e ’] \neq 3]$
// This operation filters out rows where the URL type is ‘Phishing’ (type = 3). Time complexity: $O (n)$ .
// $d f 1$ dataset contains all URLs except those labeled as ‘Phishing’. It includes entries labeled as ‘Defacement’ (type = 1) and ‘Malware’ (type = 2).
3:: Combine ‘Defacement’ and ‘Malware’ into one class:
4:: for each $r o w$ in $d f 1$ do
5:: if $r o w [‘ t y p e ’] > 0$ then
6:: $r o w [‘ t y p e ’] \leftarrow 1$
// This operation updates the type for ‘Defacement’ and ‘Malware’ to a single class (1). Time complexity: $O (n)$ .
7:: end if
8:: end for
9:: Save the filtered dataset without ‘Phishing’: $read_csv (d f 1, ‘ w i t h o u t_p h i s h i n g . c s v ’)$
// Time complexity for writing to CSV is O(m), where m is the number of rows in $d f 1$ .
10:: Filter ‘Phishing’ URLs: $d f 2 \leftarrow d f [d f [‘ t y p e ’] = = 3]$
// This operation filters to keep only ‘Phishing’ entries (type = 3). Time complexity: $O (n)$ .
// $d f 2$ contains only URLs labeled as ‘Phishing’.
11:: for each $r o w$ in $d f 2$ do
12:: if $r o w [‘ t y p e ’] = = 3$ then
13:: $r o w [‘ t y p e ’] \leftarrow 1$
// This loop modifies the type of each phishing row to 1 (same as for $d f 1$ ). Time complexity: $O (m)$ .
14:: end if
15:: end for
16:: Save the ‘Phishing’ dataset: $read_csv (d f 2, ‘ p h i s h i n g . c s v ’)$
// As with $d f 1$ , saving this dataset has a time complexity of $O (m)$ .
17:: Load the modified datasets:
18:: $d f 1 \leftarrow read_csv (‘ w i t h o u t_p h i s h i n g . c s v ’)$
// Loading data from a file incurs I/O costs. Time complexity: O(m).
19:: $d f 2 \leftarrow read_csv (‘ p h i s h i n g . c s v ’)$
// This operation loads the dataset containing ‘Phishing’ URLs.

To increase the detection rate (DR) and reduce the false alarms (FARs) of the CL_K-means method, the second tier of the proposed unknown-MUDS uses two biased classifiers to reduce the false negatives (FNs) and false positives (FPs), respectively. Biased classifiers are developed by the following procedures.

1.: Collect the FNs and FPs obtained from the training set using the proposed CL_K-means method.
2.: Select the best-performing singular supervised learning model in the known-MUDS (e.g., RF) as the algorithm to construct biased classifiers.
3.: Train the first biased classifier B1 on all the FNs along with the same amount of randomly sampled TNs to build a model that aims to reduce FNs.
4.: Train the second biased classifier B2 on all the FPs along with the same amount of randomly sampled TPs to build a model that aims to reduce FPs.

The proposed unknown-MUDS is constructed under the assumption that new MURL patterns are unknown and future incoming data samples are unlabeled; hence, only the FNs and FPs obtained during the training phase are used to build biased classifiers. This enables the proposed MUDS to detect new MURLs patterns without additional procedures that are difficult to perform, such as constant data labeling and model updates.

4.7. Runtime Complexity

The training process of the proposed MUDS for known MURLs can be performed on a server machine with high computational speed. In contrast, the testing process should be implemented in local computer systems. Developing models with low runtime complexity enables the proposed MUDS to reduce the latency of local computer systems and meet real-time requirements. In the implementations, each test sample is passed through the XGBoost classifier, one of the best-performing binary classifiers, and the CL_K-means method. Since the runtime complexity of XGBoost is O(dft), where d is the maximum depth of the trees, f is the number of features, and t is the number of trees. For the proposed CL_K-means method in the unknown-MUDS, its runtime complexity is O(fk), where k is the number of clusters. The binary classifier in the unknown-MUDS is also the best-performing tree-based algorithm, so its maximum runtime complexity is also O(dft).

4.8. Evaluation Metrics

Several metrics are utilized to comprehensively evaluate the performance of the proposed MUDS. These metrics include accuracy (Acc), precision, recall, F1 score, model training, and model execution time. They are derived from an analysis of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) within the context of the proposed model. The detailed explanation of these metrics is as follows.

Accuracy (Acc): Accuracy is a fundamental metric that measures the overall correctness of the MUDS’s predictions. It is calculated as the ratio of the number of correctly classified samples (true positives and true negatives) to the total number of samples in the dataset. Accuracy provides an overall assessment of the model’s ability to correctly identify both malicious and benign URLs.

\begin{matrix} Accuracy (Acc) & = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}

(1)

Precision: Precision measures the proportion of true positives among the samples that the model classified as positive (malicious). Precision is particularly important in the context of MURL detection, as it indicates the model’s ability to minimize false positives, which is crucial for maintaining user trust and avoiding unnecessary security alerts.

\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}

(2)

Recall (also known as sensitivity or true positive rate): Recall measures the proportion of true positives that the model correctly identified among all the actual positive (malicious) samples. Recall reflects the model’s ability to correctly identify MURLs, which is crucial for effectively detecting and mitigating cyber threats.

\begin{matrix} Recall & = \frac{TP}{TP + FN} \end{matrix}

(3)

F1 score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure of the model’s performance. The F1 score ranges from 0 to 1, with 1 indicating a perfect balance between precision and recall.

\begin{matrix} F 1 Score & = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{matrix}

(4)

T-Time: This metric measures the time required to train the MUDS model on the given dataset. It is an important consideration, as the training time can impact the model’s practical deployment and scalability, especially in real-time or high-volume URL detection scenarios.

\begin{matrix} T - Time & = {Time}_{(T - End)} - {Time}_{(T - Begin)} \end{matrix}

(5)

P-Time: This metric measures the time required for the trained MUDS model to make predictions on new, unseen URL samples. It is crucial to evaluate the model’s efficiency and suitability for real-time or near-real-time URL classification in operational cybersecurity environments.

\begin{matrix} P - Time & = {Time}_{(P - End)} - {Time}_{(P - Begin)} \end{matrix}

(6)

By evaluating the MUDS’s performance using these metrics, we can gain a comprehensive understanding of the model’s accuracy, precision, recall, and overall effectiveness in detecting MURLs. Additionally, the training and execution time metrics provide insights into the practical viability and scalability of the proposed solution, which is essential for its real-world deployment and adoption.

Finally, while accuracy, precision, recall, and F1 score are standard metrics for evaluating the MUDS framework, they may be insufficient for imbalanced datasets, where benign URLs dominate. High accuracy can mask poor detection of malicious URLs, and precision or recall alone may overlook critical errors. To better understand the model’s performance, we analyze confusion matrices and error rates, which provide detailed insights into true positives, false positives, false negatives, and true negatives. This approach helps to identify where the model struggles, such as misclassifying benign URLs as malicious or missing actual threats, and ensure more accurate detection in real-world scenarios.

5. Experimental Study

To develop the proposed MUDS, the feature engineering and machine learning algorithms were implemented using the Pandas [34], Scikit-learn [35], and XGBoost [36] libraries in Python, while the hyperparameter optimization (HPO) methods were implemented using Optuna [37] and extending the Skopt [38] and Hyperopt [39] libraries. The source code of our project is available at: https://github.com/yunduannnn/Malicious-URL-Detection-System (accessed on 23 October 2024). The experiments were carried out on a Dell Precision 3660 Tower machine with an i9-12900 central processing unit (CPU) (12 core, 2.40 GHz) and 64 gigabytes (GB) of memory, representing a server machine for model training and a vehicle-level machine for model testing, respectively.

The experimental study is structured as follows. In Section 5.1, we present the results of a comprehensive statistical analysis, demonstrating that our feature engineering for this dataset has been well designed and is effective. In Section 5.2, we evaluate the performance of the known-MUDS when tested with known malicious URLs (MURLs). Section 5.2.2 examines the robustness of the known-MUDS against simulated unknown MURLs, analyzing how well the system performs under these conditions. In Section 5.3, we investigate the performance of the proposed unknown-MUDS, specifically designed to address unknown MURLs. Additionally, we compare the proposed unknown-MUDS, which incorporates CL_K-means and biased classifiers, against a binary classifier specifically designed to tackle unknown MURLs.

5.1. Statistical Analysis of URL-Based Features

In the following, we present the results of a comprehensive statistical analysis, demonstrating that the proposed feature engineering for this MURL dataset has been well designed and is effective.

Following the systematic application of feature construction and selection methodologies, 21 features were obtained and are delineated in Figure 5 and Figure 6. Figure 5 illustrates the correlations between all features. Figure 6 illustrates the top ten features correlated with the category.

The key observations from Figure 5 and Figure 6 are as follows:

There are several strong positive correlations (shown in dark red) between variables, such as:

“abnormal_url” and “count-http”;

“abnormal_url” and “tld_length”;

“abnormal_url” and “hostname_length”;

“count_http” and “hostname_length”;

“hostname_length” and “tld_length”.

There are also some weak or negative correlations (shown in blue/green) between certain variables, indicating little relationship or an inverse relationship.

Figure 7 depicts the relationship between URL length and URL type, which can be classified as phishing, benign, defacement, or malware. Figure 7 indicates that the average URL length for phishing URLs is 46 characters. The average URL length for benign URLs is 58 characters; the average URL length for defacement URLs is 86 characters; and the average URL length for malware URLs is 57 characters.

From these data, we can observe the following.

Defacement URLs have the longest average URL length at 86 characters. Phishing URLs have the shortest average URL length at 46 characters. Benign URLs and malware URLs have similar average URL lengths, around 57-58 characters. The positive correlation between longer average URL lengths and MURL types indicates that this feature could be valuable for the MUDS framework. By incorporating the average URL length as a feature, the model may better differentiate between benign and MURLs, potentially improving the overall classification performance.

Based on Figure 8, the following analysis can be provided. The bar graph in Figure 8 displays the frequency distribution of abnormal URLs and normal URLs. The blue bar represents a significantly higher count of 463,185 for “normal URL” compared to the orange bar of 188,006 for “abnormal URL”. This indicates that the dataset contains a predominant proportion of abnormal URLs relative to normal URLs.

Figure 9 presents the distribution of URLs based on their classification as “Abnormal_URL” and their associated “types” (benign, defacement, malware, phishing). The graph illustrates the frequency counts for each category. The data reveal that the majority of URLs are classified as “Abnormal_URL”, with a frequency count of 392,709. This indicates that the dataset under study contains a significant proportion of abnormal URLs. Further examining the breakdown of the “Abnormal_URL” category, the graph shows that the most prevalent type is “defacement” with a frequency of 96,457, followed by “malware” with 31,310 and “phishing” with 24,845. This distribution highlights the diverse nature of the abnormal URLs within the dataset. This information can be valuable for understanding the characteristics and composition of the abnormal URLs present in the dataset, which may have implications for MURL detection and classification.

Figure 10 presents the distribution of HTTPS usage within the dataset. The graph displays two distinct categories: HTTPS usage and non-HTTPS usage. The data reveal that most URLs, approximately 97.5% (634,768), utilize HTTPS for communication. In contrast, only a small portion, approximately 2.5% (16,423), of the URLs do not use HTTPS. This suggests that the dataset is predominantly composed of URLs that prioritize secure communication through the adoption of the HTTPS protocol. The high prevalence of HTTPS usage among the URLs in the dataset is noteworthy and aligns with the broader industry trend towards increased adoption of secure protocols.

Figure 11 illustrates the relationship between the URL “type” and the presence of HTTPS (“HasHTTPS” = 1 indicates HTTPS, “HasHTTPS” = 0 indicates non-HTTPS). The data reveal several key insights.

Benign URLs: The majority of benign URLs (96,457) utilize HTTPS, while a smaller portion (7263) do not. Defacement URLs: The majority of defacement URLs (25,756) employ HTTPS, with a smaller number (6764) utilizing non-HTTPS. Malware URLs: A significant number of malware URLs (86,848) are HTTPS-enabled, while a smaller proportion (2396) use non-HTTPS. Phishing URLs: The majority of phishing URLs (425,707) utilize HTTPS, while a smaller number (7263) use non-HTTPS.

This analysis highlights the diverse security practices across different URL types. While a significant portion of URLs, regardless of their classification, employ the more secure HTTPS protocol, there is also a notable presence of non-HTTPS usage, particularly for MURL types such as malware and phishing.

Figure 12 shows the relationship between the URL “type” and “Has Shortening Service” (0 or 1). The key observations are:

1.: Benign URLs have a significantly higher count when the “Has Shortening Service” is 0, indicating that the majority of benign URLs do not use a URL shortening service.
2.: Phishing, defacement, and malware URLs also show a higher count when “Has Shortening Service” is 0, though not as pronounced as the benign URLs.
3.: A smaller portion of URLs across the different types have “Has Shortening Service” set to 1, indicating the use of a URL shortening service.

This suggests that URL shortening is less common overall, and benign URLs, in particular, are less likely to utilize a shortening service compared to other URL types.

The in-depth analyses conducted above can indeed serve as strong evidence that our feature engineering efforts for this dataset have been quite effective. Here is how the analyses support the quality of the feature engineering:

1.: Comprehensive URL type categorization: The ability to clearly distinguish and quantify different URL types, such as benign, defacement, malware, and phishing, indicates that the feature engineering has captured meaningful and discriminative characteristics to enable accurate URL classification. The granular breakdown of URL types showcases the robustness of the feature engineering process.
2.: Alignment with real-world security trends: The observed high prevalence of HTTPS usage among the URLs aligns with the broader industry trend towards increased adoption of secure communication protocols. This alignment between the dataset’s characteristics and real-world security practices validates the relevance and quality of the feature engineering.
3.: Potential for targeted security enhancements: The analyses reveal nuanced security behaviors across different URL types, such as the varying HTTPS adoption rates. This level of granularity in feature engineering enables the identification of specific areas for targeted security improvements and the development of more specialized models.

In summary, the comprehensive and insightful analyses presented in the above figures serve as strong evidence that the feature engineering for this dataset has been well designed and is effective. The ability to extract meaningful patterns and security-related characteristics aligns with real-world trends and underscores the quality of the feature engineering process. This, in turn, lays a solid foundation for building robust and reliable models for URL classification, MURL detection, and other security-related applications.

5.2. Performance Analysis of Known-MUDS

This section presents a comprehensive performance analysis of the proposed known-MUDS on the known MURLs. Our goal is to evaluate the ability of the learning models to classify instances of the same type as the training data. The results are detailed in Table 7. The key insights and information from Table 7 are as follows:

1.: Impact of Optuna: The table clearly separates the proposed algorithms into two categories: “Proposed algorithms without Optuna” and “Proposed algorithms with Optuna”. By comparing the performance metrics between these two categories, we can assess the impact of using Optuna for hyperparameter optimization.
2.: Performance improvements: The use of Optuna leads to improved performance across the various metrics for most models. For example, the accuracy of the XGBoost model increases from 96.18% without Optuna to 96.83% with Optuna, a significant improvement. Similar improvements can be seen in the precision, recall, and F1 score of the XGBoost model when Optuna is employed. In order to rigorously compare the performance of the proposed algorithms with and without Optuna optimization, we conducted statistical significance testing using the Wilcoxon signed-rank test. This non-parametric test was chosen due to its ability to handle non-normal distributions and small sample sizes. The test was applied across four key performance metrics: accuracy, precision, recall, and F1 score. The results showed statistically significant differences for all metrics, with p-values of 0.043, indicating that Optuna’s optimization process had a meaningful and consistent impact on improving the model performance. These findings validate the effectiveness of Optuna in enhancing the overall efficiency and accuracy of the proposed system.
3.: Computational time: The impact of Optuna on computational time is more mixed. While the training time generally increases due to the additional optimization process, the prediction time can either increase or decrease depending on the model. For the XGBoost model, the prediction time decreases from 0.52 microseconds without Optuna to 1.66 microseconds with Optuna, indicating a slight decrease in inference speed.
4.: Trade-offs: The table indicates that using Optuna for hyperparameter optimization can lead to performance improvements, especially in terms of accuracy, precision, recall, and F1 score. However, this improvement comes at the cost of increased training time, as the optimization process adds computational overhead. For example, the training time of the XGBoost model increases from 3.13 s without Optuna to 8.67 s with Optuna.
5.: Top-performing model: Focusing on the XGBoost model, we can see that it achieves the best performance among all the proposed algorithms, both with and without Optuna. With Optuna, the XGBoost model reaches an accuracy of 96.83%, precision of 96.78%, recall of 96.83%, and F1 score of 96.79%, outperforming all other models.
6.: Comparison with state-of-the-art models: Evaluating the XGBoost model against two recent Random Forest-based models on the same dataset demonstrates the advantages of the data preprocessing and hyperparameter tuning techniques employed. XGBoost achieves an accuracy of 96.83%, while the Random Forest models in the literature reach 96.15%. Additionally, XGBoost outperforms these models in other classification metrics, including precision, recall, and F1 score.

The performance of the eight models before and after Optuna optimization is shown in Figure 13c and Figure 13d, respectively.

The execution time comparison of the five machine learning models before and after Optuna in classifying malicious and benign URLs is presented in Figure 13a. The training time of the five machine learning models in classifying malicious and benign URLs is presented in Figure 13b.

5.2.1. Model Evaluation and Discussions

To further investigate the behavior of the best model in the known-MUDS, the multi-class classifier, in detecting known MURLs, we conducted an extensive analysis, presented in Figure 14. The confusion matrix depicted in Figure 14a provides a comprehensive overview of the model’s classification performance and its normalized version. Additionally, Figure 14b focuses on the errors in the confusion matrix.

Figure 15 illustrates the average feature importance of the top 20 features within our developed framework. Feature importance serves as a metric that signifies the relative significance of each feature in predicting the target variable or facilitating classifications. In this section, we analyze the top five features.

1.: count_www: The frequency of ‘www’ in a URL can serve as a distinguishing characteristic for different types of websites or web structures. Its significant importance suggests that the presence or absence of ‘www’ elements in URLs plays a pivotal role in the model’s predictive capacity.
2.: count_dir: This feature likely signifies the complexity or depth of the URL structure based on its directories. Its pronounced importance indicates that the quantity of directories in a URL is a crucial determinant for the model’s decision-making process.
3.: hostname_length: The length of the hostname in a URL offers insights into the URL’s intricacy or origin. Its notable importance implies that the hostname’s length significantly impacts the model’s predictions and outcomes.
4.: fd_length: The length of the initial directory in a URL path may reveal specific patterns or categories. Its high importance suggests that this feature strongly influences the model’s predictions and analysis.
5.: count_http (the number of ‘http’ within the URL): The count of ‘http’ occurrences in a URL can indicate certain URL types or protocols. Its substantial importance highlights that the presence or frequency of ‘http’ is a critical factor in the model’s performance and analysis.

Despite the solid performance of the known-MUDS in detecting known MURLs, many other types exist beyond malware, defacement, and phishing. Cybercriminals frequently develop new MURLs for illicit activities. Therefore, we must thoroughly examine the robustness of the known-MUDS models. The following section presents the robustness of the proposed MUDS against unknown MURLs.

5.2.2. Robustness Study of Known-MUDS

While the current focus has been on effectively detecting known MURLs, it is imperative to prioritize the robustness study of the established known-MUDS. This helps in analyzing the performance of machine learning models against unknown scenarios.

To conduct this crucial robustness study, we simulated zero-day MURLs using various data-splitting techniques. Below, we illustrate the process for simulating “Defacement” as an unknown attack.

1.: Filtering dataset: Filters the dataset to include only the classes ‘benign’, ‘phishing’, and ‘malware’ and stores the result in the variable train_df.
2.: Extracting ‘benign’ URLs: Extracts the URLs corresponding to the ‘benign’ class from train_df and assigns them to the variable benign_urls.
3.: Random sampling: Randomly samples 96457 URLs from benign_urls using a specified random state and stores the result in test_benign_urls.
4.: Creating the test set: Creates the test set test_df by selecting rows from the original dataset where the URL is in test_benign_urls or the type is ‘defacement’.
5.: Creating the training set: Creates the training set by filtering train_df to exclude the URLs present in test_benign_urls.

As illustrated above, a new test set is created by including defacement URLs and an equal number of benign URLs. A corresponding training set is then formed by removing this test set from the original dataset. Learning algorithms are trained on this training set and tested on the unseen test set. For evaluation, the known-MUDS is considered robust if it accurately classifies benign test instances as “benign” and defacement instances as either “phishing” or “malware”. This process is similarly repeated for the “phishing” and “malware” types.

The detailed robustness results of the known-MUDS are shown in Table 8. The results reveal that the classifiers’ performance varies across different types of unknown MURLs. Specifically, accuracy remains above 90% for defacement and malware URLs, but the system’s robustness is lower for phishing URLs, with accuracy around 72%. Additionally, the models exhibit minimal improvement when tuned with Optuna. These findings underscore the need for enhancing the performance of the known-MUDS.

Detecting unknown MURLs presents greater challenges compared to known MURLs because classifiers are not trained on these specific class types [41]. Furthermore, unknown MURLs may differ significantly in nature and characteristics from known MURLs, complicating accurate classification.

To address this challenge, further research and development are crucial to enhance the classifiers’ ability to identify and categorize unknown MURLs accurately. In response, we have utilized a combination of supervised and unsupervised algorithms, specifically CL_K-means and biased classifiers, as potential solutions. Details of this approach are presented in the following sections.

5.3. Performance Analysis of Unknown-MUDS

In this section, we evaluate the performance of the CL_K-means_BC approach for detecting unknown MURLs. We compare it against two benchmarks:

1.: Known-MUDS: This refers to using the existing MUDS without modifications to handle unknown attacks. For comparison, we use XGBoost, the top-performing model from our robustness study, as the reference for the known-MUDS. The results are averaged across the three types of unknown MURLs.
2.: Supervised binary classifiers: We compare CL_K-means_BC with binary classifiers trained to differentiate between benign and MURLs. For this, a test set is created by simulating one type of unknown MURL alongside an equal number of benign instances. The training set is formed by excluding the test instances from the original dataset and combining all malicious types into one category. The performance results for the binary classifiers, both before and after tuning, are detailed in Table 9 and Table 10, respectively.

Table 9 and Table 10 highlight the impact of hyperparameter optimization on binary classifiers for detecting unknown MURLs.

Before tuning, XGBoost exhibited the highest average accuracy of 87.94%, particularly excelling in defacement (97.05%) and malware (94.74%) detection, but struggled with phishing detection, achieving an accuracy of only 72.03%. Conversely, AdaBoost had an average accuracy of 85.88%, with notable difficulties in phishing detection at 72.53%.

Post-tuning, both XGBoost and AdaBoost showed improvements. AdaBoost_HPO’s average accuracy increased to 86.55%, while XGBoost_HPO achieved an average accuracy of 87.88%, with better performance in phishing detection (72.07%) and malware detection (94.62%). Other classifiers, such as Random Forest_HPO and Gradient Boosting_HPO, also displayed enhanced results, especially in phishing and defacement categories, indicating that hyperparameter tuning effectively improved their overall performance.

Notably, the SGD_HPO classifier stood out, with the highest average accuracy of 88.64%. It demonstrated strong precision and recall for phishing detection (95.70%) and maintained robust performance for defacement detection (94.18%) and malware detection (85.68%). Additionally, SGD_HPO exhibited low processing and training times, underscoring its efficiency. These findings suggest that SGD_HPO is the most effective method for detecting unknown MURLs, offering high accuracy and reliability across various attack types while ensuring efficient computational performance. Consequently, SGD_HPO is utilized in subsequent comparisons against the proposed CL_K-means.

After examining the performance of the known-MUDS and supervised binary classifiers for unknown malicious types, we now present the results for the proposed CL_K-means_BC method. Table 11 summarizes the performance of CL_K-means and CL_K-means_BC. CL_K-means refers to the proposed method without using two biased classifiers responsible for correcting the detection errors of CL_K-means.

Table 11 indicates that for phishing URLs, CL_K-means_BC achieved an accuracy of 93.04%, indicating its ability to accurately classify a large portion of phishing instances. The precision score of 86.53% demonstrates its capability to correctly identify and label phishing URLs, minimizing false positives. The recall score of 93.04% highlights its effectiveness in capturing a significant number of phishing URLs from the test set. The F1 score of 89.32% combines precision and recall, indicating the overall performance of CL_K-means_BC in identifying phishing URLs. The processing time for phishing URLs was 14.00 ms, while the training time was 2.64 s.

In the case of defacement URLs, CL_K-means_BC achieved superior performance, with an accuracy of 96.10%. It also exhibits a high precision of 95.78%, indicating its ability to accurately identify defacement URLs. The recall score of 96.08% demonstrates its effectiveness in capturing the majority of defacement URLs from the test set. The F1 score of 96.08% further emphasizes the overall performance of CL_K-means_BC in detecting defacement URLs. The processing time for defacement URLs was 6.00 ms, and the training time was 1.05 s.

For malware URLs, CL_K-means_BC achieved an accuracy of 88.47%, demonstrating its ability to accurately classify malware instances. The precision score of 88.47% indicates its capability to correctly identify and label malware URLs. The recall score of 88.55% highlights its effectiveness in capturing a significant number of malware URLs from the test set. The F1 score of 88.51% combines precision and recall, reflecting the overall performance of CL_K-means_BC in detecting malware URLs. The processing time for malware URLs was 14.00 ms, and the training time was 0.89 s.

Overall, the average performance of CL_K-means_BC across all types of unknown MURLs is much superior compared to CL_K-means without the biased classifiers. With an average accuracy of 92.56% and an average F1 score of 91.30%, CL_K-means_BC demonstrates its effectiveness in accurately identifying and classifying different types of unknown MURLs. The average processing time of 11.33 ms further emphasizes its efficiency in real-time applications. The training time for CL_K-means_BC is 1.53 s, indicating that it can be trained effectively within a reasonable time frame.

The performance comparison of the three methods—known-MUDS, supervised binary classifiers, and CL_K-means_BC for unknown MURLs—is summarized in Table 12. The table clearly demonstrates the superior performance of the proposed CL_K-means_BC method across several evaluation metrics. Specifically, CL_K-means_BC achieves an accuracy of 92.54%, precision of 90.26%, recall of 92.56%, and an F1 score of 91.30%. These results surpass both the best multi-class model (XGB) and the best binary model (SGD), which achieved respective accuracies of 88.19% and 88.64%. The precision and recall scores of CL_K-means_BC also significantly outperform those of the other models, highlighting its effectiveness in correctly identifying both positive and negative instances of MURLs.

The higher accuracy of 92.54% achieved by CL_K-means_BC, compared to 88.19% for the best multi-class model (XGB) and 88.64% for the best binary model (SGD), indicates that CL_K-means_BC is more reliable in correctly classifying unknown MURLs. The precision of CL_K-means_BC is 90.26%, which is superior to that of XGB (91.65%) and SGD (89.20%), suggesting that the proposed method has a lower rate of false positives and is more accurate in identifying actual threats. The recall of 92.56% for CL_K-means_BC is also higher than that of XGB (88.19%) and SGD (88.41%), indicating a better capability of detecting true positives and minimizing false negatives.

The processing time (P-Time) and training time (T-Time) of CL_K-means_BC are competitive, with a P-Time of 11.33 ms and a T-Time of 1.53 s. This only refers to the time of the added biased classifiers to CL_K-means. Although the processing time is higher compared to the binary model (SGD) at 0.11 ms, the training time of 1.53 s demonstrates that the proposed method can be feasibly deployed in real-world scenarios, where timely detection is critical. The efficiency of CL_K-means_BC in terms of both processing and training times underscores its practical applicability despite the slight trade-off in processing speed.

The significant improvements in accuracy, precision, recall, and F1 score achieved by CL_K-means_BC validate its robustness and effectiveness in detecting unknown MURLs. These metrics highlight the method’s potential to improve the overall performance and reliability of MURL detection frameworks. By leveraging CL_K-means_BC, we can achieve a highly accurate and efficient unknown-MUDS, making it a promising approach for addressing the ongoing security challenges posed by evolving cyber threats.

5.4. Project Reproducibility

Ensuring project reproducibility is of paramount importance for scientific integrity. To facilitate this, we have made the complete source code of our project publicly available on GitHub. The repository can be accessed using the following URL: https://github.com/yunduannnn/Malicious-URL-Detection-System (accessed on 23 October 2024). By accessing the repository, readers can inspect the code, explore the project structure, and understand the implementation details comprehensively. The repository includes a detailed README file that provides instructions on how to set up and utilize the source code effectively. This documentation assists users in reproducing the project environment and replicating the experimental results. To enhance the readability and transparency of our work, we have utilized Jupyter Notebooks. These notebooks contain the code snippets, data preprocessing steps, machine learning model implementations, and the corresponding results. The interactive nature of Jupyter Notebooks allows readers to navigate through the code and observe the output easily. By referring to these notebooks, users can gain deeper insights into our methodology and reproduce the experimental outcomes. By providing open access to the source code and utilizing Jupyter Notebooks, we strive to promote transparency, enable reproducibility, and facilitate the verification of our project by the scientific community.

6. Discussion

This study presents a significant advancement in malicious URL detection, with implications for both research and practical applications. Our contributions include the development of a comprehensive machine learning framework that leverages a variety of URL characteristics to classify URLs into benign and malicious categories. By employing a diverse set of supervised and unsupervised algorithms and hyperparameter optimization techniques, we have achieved promising accuracy rates, suggesting the potential for real-world application.

6.1. Practical Implications

The insights gained from this research can be directly applied to enhance the effectiveness of URL detection systems in various sectors, including cybersecurity, finance, and online platforms. For instance, organizations can implement our machine learning framework to develop robust filtering systems that automatically flag potentially malicious URLs, thereby protecting users from phishing attempts and malware infections [42]. This proactive approach not only safeguards sensitive information but also reduces the financial and reputational risks associated with security breaches.

Importantly, our framework is designed to detect both known and zero-day attacks. Known attacks refers to previously identified threats with documented signatures or patterns, which traditional signature-based detection systems can recognize. In contrast, zero-day attacks exploit vulnerabilities that have not yet been discovered or documented, making them more challenging to detect [14]. Our machine learning approach enhances the ability to identify these sophisticated zero-day attacks by analyzing URL characteristics and behavior, allowing organizations to respond swiftly to new threats as they arise.

Moreover, our findings underscore the importance of continuously updating and retraining models as new threats emerge. The dynamic nature of cyber threats necessitates an adaptable detection system that can evolve alongside emerging malicious tactics. Integrating our framework into existing security infrastructures can enhance their resilience against evolving threats, ensuring more effective and timely responses.

6.2. Challenges in Implementation

While the insights from our study offer substantial potential for enhancing malicious URL detection systems, their implementation presents several challenges that must be addressed for effective deployment.

One primary concern is the computational overhead associated with training and deploying complex machine learning models [43]. These models often require significant computational resources and memory, which can pose a barrier for organizations, particularly smaller enterprises that may not have the necessary infrastructure to support extensive data processing and model training. To mitigate this, organizations may need to invest in specialized hardware, such as graphics processing units (GPUs) or cloud-based solutions, which can incur additional costs.

Moreover, the integration of our framework into existing security systems necessitates careful consideration of compatibility and operational disruptions [44]. Organizations may have legacy systems that are not readily adaptable to new technologies, leading to potential challenges in data integration and workflow continuity. A phased implementation approach involving pilot testing and gradual scaling is crucial to minimize disruptions and ensure that the new system complements existing security measures.

Ensuring real-time processing capabilities is another significant challenge for our detection system [45]. Users interact with numerous URLs daily, and the ability to quickly analyze and classify URLs without introducing significant latency is critical for maintaining user experience and security. A delayed response can expose users to threats and compromise system trustworthiness. To address this issue, optimization strategies such as model pruning, quantization, or employing more lightweight algorithms may be necessary. Additionally, implementing batch processing or streaming data approaches can enhance the system’s efficiency, allowing for real-time URL classification without sacrificing performance.

User acceptance is a critical factor in the successful implementation of any new security technology. If users find the system cumbersome or disruptive, they may bypass its use, undermining its intended protective measures. To foster user acceptance, organizations must prioritize user-centered design principles in their implementation strategies. Providing comprehensive training and support can help users understand the system’s benefits and functionalities, ultimately leading to higher engagement and adherence to security protocols.

Addressing these challenges effectively will require collaboration between data scientists, cybersecurity experts, and software engineers. Multi-disciplinary teams can ensure that the technical, operational, and user experience aspects of the detection system are holistically addressed. Moreover, fostering partnerships with external stakeholders, such as academic institutions and cybersecurity firms, can provide additional insights and resources to strengthen the implementation process.

Finally, organizations must navigate regulatory and compliance considerations when implementing new technologies for URL detection [46]. Various regions have specific regulations concerning data privacy, cybersecurity, and the ethical use of artificial intelligence. Ensuring that the system adheres to these regulations is essential to avoid legal ramifications and build trust with users. This may involve conducting impact assessments, obtaining necessary approvals, and ensuring transparency in how data are processed and analyzed.

6.3. Limitations of the Proposed Study

While this study provides valuable insights, it is essential to acknowledge its limitations. One notable concern is the potential biases present in the dataset used for training and evaluation. If the dataset contains an imbalance in the representation of malicious versus benign URLs, this may skew the model’s performance, leading to overfitting or underperformance in real-world scenarios. Future research should prioritize the collection of diverse datasets that accurately reflect the evolving landscape of malicious URLs to enhance model robustness.

Furthermore, the generalizability of our results may be limited by the specific characteristics of the dataset. While our framework demonstrated strong performance on the selected data, the variability in URL structures and malicious tactics may affect its efficacy in different contexts. Further validation across varied datasets is necessary to establish the framework’s reliability and applicability to diverse environments.

Lastly, the inherent limitations of machine learning models must be recognized. While our framework can provide a probabilistic assessment of URLs, it may still misclassify certain URLs, leading to false positives or negatives. Continuous monitoring and feedback loops are essential to improve the model’s accuracy and adapt to new types of attacks. Integrating human expertise in the decision-making process can enhance the effectiveness of automated systems, ensuring that users are appropriately alerted without unnecessary alarms.

7. Conclusions and Future Work

This paper presents a comprehensive machine learning-based framework for identifying both known and unknown malicious URLs (MURLs). By leveraging a diverse MURL dataset, which includes labels such as benign, phishing, defacement, and malware, we engineered a robust set of features validated through extensive statistical analyses. The resulting malicious URL detection system (MUDS) demonstrates remarkable performance. In detecting known MURLs, the MUDS achieves a high accuracy of 0.968. For the detection of unknown MURLs, the MUDS attains a significant accuracy level of 0.925 when evaluated on a simulated zero-day MURL dataset. Additionally, the MUDS exhibits commendable efficiency, with an average processing time of less than 14 milliseconds per sample, underscoring its suitability for real-time integration within endpoint network systems. This work highlights the potential of the proposed approach to significantly enhance online security by effectively identifying and mitigating MURLs.

The implications of this research are profound, marking a concrete advancement in enhancing online security. By accurately identifying and mitigating MURLs, the MUDS significantly reduces the risks associated with cyber threats. The system’s robustness in detecting both known and zero-day threats underscores the efficacy of machine learning-based frameworks in real-world cybersecurity applications. Its adaptability to evolving attack vectors, particularly zero-day attacks, emphasizes its practical utility. The MUDS framework stands poised to become an essential component in modern URL detection systems, pushing the boundaries of cybersecurity technologies and paving the way for future advancements in safeguarding against increasingly sophisticated threats.

For our future endeavors, we outline several key initiatives to enhance the robustness and applicability of the proposed MUDS. First, we plan to explore adversarial deep learning and machine learning techniques. This involves investigating the resilience of MUDS against more advanced adversarial attacks using methods like the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). We will also develop and implement countermeasures such as adversarial training and defensive distillation to strengthen the system’s robustness and security. This aims to uncover vulnerabilities in the model and facilitate the creation of more secure deep learning models for detecting MURLs. Second, we intend to deploy the system in real-world settings to gather user feedback, refine detection capabilities, and enhance user experience. Lastly, we will evaluate the system on diverse datasets, including various publicly available and proprietary datasets, to assess its generalizability and robustness. This evaluation will incorporate datasets from different geographical regions, industries, and time periods to identify potential biases and limitations, ensuring the system’s effectiveness in real-world scenarios.

Author Contributions

S.L.: conceptualization, methodology, software, validation, investigation, data curation, writing—original draft, writing—review and editing, visualization. O.D.: conceptualization, methodology, software, validation, investigation, resources, writing—review and editing, visualization, supervision, project administration, funding acquisition. All authors have read and agreed to the published version of this manuscript.

Funding

This research was funded by Wenzhou-Kean University Computer Science and Artificial Intelligence Center (Project No. BM20211203000113), Wenzhou-Kean University Student Partnering with Faculty (Project No. WKUSPF202444), Wenzhou-Kean University Internal (Faculty/Staff) Research Support Program (IRSP) (Project No. IRSPG202105), and Wenzhou-Kean University International Collaborative Research Program (Project No. ICRP2023008).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used during the current study are publicly available on Kaggle.

Acknowledgments

The authors would like to express their sincere gratitude to the reviewers for their thorough and insightful comments. Their constructive feedback has been invaluable in refining our research and enhancing the overall quality of this paper. The authors also extend their gratitude to Wenzhou-Kean University for providing the necessary laboratory facilities and execution resources, which were instrumental in completing this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saha, S.K.; Duarte, P.; Silva, S.C.; Zhuang, G. The role of online experience in the relationship between service convenience and future purchase intentions. J. Internet. Commer. 2023, 22, 244–271. [Google Scholar] [CrossRef]
Sarkar, G.; Shukla, S.K. Behavioral analysis of cybercrime: Paving the way for effective policing strategies. J. Econ. Criminol. 2023, 2, 100034. [Google Scholar] [CrossRef]
Chiramdasu, R.; Srivastava, G.; Bhattacharya, S.; Reddy, P.K.; Reddy Gadekallu, T. Malicious URL Detection using Logistic Regression. In Proceedings of the 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Barcelona, Spain, 23–25 August 2021; pp. 1–6. [Google Scholar] [CrossRef]
Cirillo, S.; Desiato, D.; Scalera, M.; Solimando, G. A Visual Privacy Tool to Help Users in Preserving Social Network Data. In Proceedings of the Joint Proceedings of the Workshops, Work in Progress Demos and Doctoral Consortium at the IS-EUD 2023 Co-Located with the 9th International Symposium on End-User Development (IS-EUD 2023), Cagliari, Italy, 6–8 June 2023; Volume 3048. [Google Scholar]
Caruccio, L.; Cimino, G.; Cirillo, S.; Desiato, D.; Polese, G.; Tortora, G. Malicious Account Identification in Social Network Platforms. ACM Trans. Internet Technol. 2023, 23, 1–25. [Google Scholar] [CrossRef]
Fukushi, N.; Koide, T.; Chiba, D.; Nakano, H.; Akiyama, M. Analyzing Security Risks of Ad-Based URL Shortening Services Caused by Users’ Behaviors. In Proceedings of the Security and Privacy in Communication Networks, Virtual Event, 6–9 September 2021; Springer: Cham, Switzerland; pp. 3–22. [Google Scholar] [CrossRef]
Peng, Z.; He, Y.; Sun, Z.; Ni, J.; Niu, B.; Deng, X. Crafting Text Adversarial Examples to Attack the Deep-Learning-based Malicious URL Detection. In Proceedings of the ICC 2022—IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 3118–3123. [Google Scholar] [CrossRef]
Goud, N.S.; Mathur, A. Feature Engineering Framework to detect Phishing Websites using URL Analysis. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
Mittal, M.; Kumar, K.; Behal, S. Deep learning approaches for detecting DDoS attacks: A systematic review. Soft Comput. 2023, 27, 13039–13075. [Google Scholar] [CrossRef]
Madhubala, R.; Rajesh, N.; Shaheetha, L.; Arulkumar, N. Survey on Malicious URL Detection Techniques. In Proceedings of the 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 28–30 April 2022; pp. 778–781. [Google Scholar] [CrossRef]
Liu, J.; Xue, H.; Wang, J.; Hong, S.; Fu, H.; Dib, O. A systematic comparison on prevailing intrusion detection models. In Proceedings of the International Conference on Parallel and Distributed Computing: Applications and Technologies, Gwangju, Republic of Korea, 20–22 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 213–224. [Google Scholar] [CrossRef]
Liu, T.; Fan, W.; Wang, G.; Tang, W.; Li, D.; Chen, M.; Dib, O. A Hybrid Supervised Learning Approach for Intrusion Detection Systems. In Proceedings of the International Symposium on Knowledge and Systems Sciences, Guangzhou, China, 2–3 December 2023; Springer Nature: Singapore, 2023; pp. 3–17. [Google Scholar] [CrossRef]
Tang, W.; Li, D.; Fan, W.; Liu, T.; Chen, M.; Dib, O. An intrusion detection system empowered by deep learning algorithms. In Proceedings of the 2023 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Abu Dhabi, United Arab Emirates, 14–17 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1137–1142. [Google Scholar] [CrossRef]
Dib, O.; Nan, Z.; Liu, J. Machine learning-based ransomware classification of Bitcoin transactions. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 101925. [Google Scholar] [CrossRef]
Chen, M.; Fan, W.; Tang, W.; Liu, T.; Li, D.; Dib, O. Review of Machine Learning Algorithms for Breast Cancer Diagnosis. In Proceedings of the International Conference on Data Mining and Big Data, Xiamen, China, 17–19 November 2023; Springer Nature: Singapore, 2023; pp. 229–243. [Google Scholar] [CrossRef]
Li, Z.; Dib, O. Empowering Brain Tumor Diagnosis through Explainable Deep Learning. Mach. Learn. Knowl. Extr. 2024, 6, 2248–2281. [Google Scholar] [CrossRef]
Hossain, M.A.; Haque, M.A.; Ahmad, S.; Abdeljaber, H.A.; Eljialy, A.; Alanazi, A.; Sonal, D.; Chaudhary, K.; Nazeer, J. AI-enabled approach for enhancing obfuscated malware detection: A hybrid ensemble learning with combined feature selection techniques. Int. J. Syst. Assur. Eng. Manag. 2024, 1–19. [Google Scholar] [CrossRef]
Yu, B.; Tang, F.; Ergu, D.; Zeng, R.; Ma, B.; Liu, F. Efficient Classification of Malicious URLs: M-BERT—A Modified BERT Variant for Enhanced Semantic Understanding. IEEE Access 2024, 12, 13453–13468. [Google Scholar] [CrossRef]
Shantanu; Janet, B.; Joshua Arul Kumar, R. Malicious URL Detection; A Comparative Study. In Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; pp. 1147–1151. [Google Scholar] [CrossRef]
Xuan, C.D.; Nguyen, H.D.; Nikolaevich, T.V. Malicious URL Detection based on Machine Learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [Google Scholar] [CrossRef]
Afzal, S.; Asim, M.; Javed, A.R.; Beg, M.O.; Baker, T. Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models. J. Netw. Syst. Manag. 2021, 29, 1–27. [Google Scholar] [CrossRef]
Abad, S.; Gholamy, H.; Aslani, M. Classification of malicious URLs using machine learning. Sensors 2023, 23, 7760. [Google Scholar] [CrossRef] [PubMed]
Wejinya, G.; Bhatia, S. Machine Learning for Malicious URL Detection. In ICT Systems and Sustainability; Tuba, M., Akashe, S., Joshi, A., Eds.; Springer: Singapore, 2021; pp. 463–472. [Google Scholar] [CrossRef]
Sahoo, D.; Liu, C.; Hoi, S.C.H. Malicious URL Detection using Machine Learning: A Survey. arXiv 2019, arXiv:1701.07179. [Google Scholar] [CrossRef]
Tsai, Y.D.; Liow, C.; Sheng Siang, Y.; Lin, S.D. Toward More Generalized Malicious URL Detection Models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 21628–21636. [Google Scholar] [CrossRef]
Weshahi, A.; Dwaik, F.; Khouli, M.; Ashqar, H.I.; Shatnawi, A.; ElKhodr, M. IoT-Enhanced Malicious URL Detection Using Machine Learning. In Advanced Information Networking and Applications; Barolli, L., Ed.; Springer: Cham, Switzerland, 2024; pp. 470–482. [Google Scholar] [CrossRef]
Liu, R.; Wang, Y.; Xu, H.; Qin, Z.; Zhang, F.; Liu, Y.; Cao, Z. PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network. Inf. Fusion 2024, 113, 102638. [Google Scholar] [CrossRef]
Maneriker, P.; Stokes, J.W.; Lazo, E.G.; Carutasu, D.; Tajaddodianfar, F.; Gururajan, A. URLTran: Improving Phishing URL Detection Using Transformers. In Proceedings of the MILCOM 2021–2021 IEEE Military Communications Conference (MILCOM), San Diego, CA, USA, 29 November–2 December 2021; pp. 197–204. [Google Scholar] [CrossRef]
Li, L.; Gong, B. Prompting Large Language Models for Malicious Webpage Detection. In Proceedings of the 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 393–400. [Google Scholar] [CrossRef]
Siddartha, M. Malicious URLs Dataset. 2024. Available online: https://www.kaggle.com/datasets/sid321axn/malicious-URLs-dataset (accessed on 22 July 2024).
Ahmad, F. Using Machine Learning to Detect Malicious URLs. 2024. Available online: https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs (accessed on 22 July 2024).
Li, T.; Kou, G.; Peng, Y. Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods. Inf. Syst. 2020, 91, 101494. [Google Scholar] [CrossRef]
Yan, X.; Xu, Y.; Cui, B.; Zhang, S.; Guo, T.; Li, C. Learning URL Embedding for Malicious Website Detection. IEEE Trans. Ind. Inform. 2020, 16, 6673–6681. [Google Scholar] [CrossRef]
Reback, J.; McKinney, W.; Van Den Bossche, J.; Augspurger, T.; Cloud, P.; Klein, A.; Hawkins, S.; Roeschke, M.; Tratner, J.; She, C.; et al. Pandas, pandas-dev/pandas: Pandas 1.0.5; Zenodo: Geneva, Switzerland, 2020.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Volume 785, p. 794. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Markov, S. Skopt Documentation. 2017. Available online: https://readthedocs.org/projects/skopt/downloads/pdf/latest/ (accessed on 24 October 2024).
Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D.D. Hyperopt: A python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 2015, 8, 014008. [Google Scholar] [CrossRef]
Kyoma, A. Malicious URL Detection. 2024. Available online: https://www.kaggle.com/code/awskyoma/malicious-url-detection-accuracy-95-61-ml-dl/#Machine-Learning (accessed on 22 July 2024).
Rasheed, A.; Izzat Alsmadi, W.; Tawalbeh, L. Zero-day attack detection: A systematic literature review. Artif. Intell. Rev. 2023, 56, 10733. [Google Scholar] [CrossRef]
Rafsanjani, A.S.; Binti Kamaruddin, N.; Behjati, M.; Aslam, S.; Sarfaraz, A.; Amphawan, A. Enhancing Malicious URL Detection: A Novel Framework Leveraging Priority Coefficient and Feature Evaluation. IEEE Access 2024, 12, 85001–85026. [Google Scholar] [CrossRef]
Paleyes, A.; Urma, R.G.; Lawrence, N.D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Comput. Surv. 2022, 55, 1–29. [Google Scholar] [CrossRef]
Li, Z.; Sun, W.; Zhan, D.; Kang, Y.; Chen, L.; Bozzon, A.; Hai, R. Amalur: Data Integration Meets Machine Learning. In IEEE Transactions on Knowledge and Data Engineering; IEEE: Piscataway, NJ, USA, 2024; pp. 1–14. [Google Scholar] [CrossRef]
Bian, J.; Arafat, A.A.; Xiong, H.; Li, J.; Li, L.; Chen, H.; Wang, J.; Dou, D.; Guo, Z. Machine Learning in Real-Time Internet of Things (IoT) Systems: A Survey. IEEE Internet Things J. 2022, 9, 8364–8386. [Google Scholar] [CrossRef]
Pugliese, R.; Regondi, S.; Marini, R. Machine learning-based approach: Global trends, research directions, and regulatory standpoints. Data Sci. Manag. 2021, 4, 19–29. [Google Scholar] [CrossRef]

Figure 2. Architecture of machine learning-based malicious URL classification.

Figure 3. The framework of the proposed malicious URL detection system (MUDS).

Figure 4. Distribution of URLs.

Figure 5. Correlation heatmap.

Figure 6. Top 10 features correlation heatmap with category.

Figure 7. Relationship between URL type and length.

Figure 8. Counting abnormal URLs.

Figure 9. Distribution based on type and Abnormal_URL.

Figure 10. Distribution of HTTPS usage.

Figure 11. Relation between type and Has_HTTPS.

Figure 12. Relation between type and Has_Shortening_Service.

Figure 13. Model performance and execution and training time comparisons. Bold refers to the models with the best performance.

Figure 14. Confusion matrix and error matrix of known-MUDS.

Figure 15. Feature importance.

Table 1. List of abbreviations.

Abbreviation	Full Name
URL	Uniform resource locator
MURL	Malicious URL
MUDS	Malicious URL detection system
API	Application programming interface
SMOTE	Synthetic Minority Oversampling Technique
CL_K-means	Cluster labeling k-means
BC	Biased classifier
IG	Information gain
FCBF	Fast correlation-based filter
PCA	Principal component analysis
FL	Feature selection
ET	Extra Trees
RF	Random Forest
LGBM	Light Gradient Boosting Machine
XGB	Extreme Gradient Boosting
GBC	Gradient Boosting Classifier
Ada	AdaBoost
BO-TPE	Bayesian Optimization with Tree-Parzen Estimator
HPO	Hyperparameter optimization

Table 2. Evaluating malicious URL detection systems: A comparative study.

Paper	Dataset	Approach	Known Attack Detection	Zero-Day Attack Detection	Feature Engineering	Model Optimization	Code Reproducibility	Classification Type ¹
Md. Alamgir Hossain et al. [17]	CIC MalMem 2022	Ensemble Machine Learning	✓	✕	✓	✓	✕	B
Shantanu et al. [19]	Malicious and Benign Websites	Machine Learning Classifiers	✓	✕	✕	✓	✕	B
Cho Do Xuan et al. [20]	Malicious n Non-Malicious URL	Machine Learning Classifiers	✓	✕	✓	✕	✕	B
Sara Afzal et al. [21]	Malicious And Benign URLs	A Hybrid Deep Learning Approach	✓	✕	✓	✓	✕	B
BOYANG YU et al. [18]	The Zhejiang Mobile Innovation Research Institute	Bidirectional Encoder Representations from Transformers (BERT)	✓	✕	✓	✕	✕	B
Ruitong Liu et al. [27]	GramBeddings, Mendeley, and Kaggle	Pre-Trained Transformer	✓	✕	✓	✓	✓	M
Maneriker Pranav et al. [28]	Microsoft’s Edge and Internet Explorer production	BERT and RoBERTa	✓	✕	✕	✕	✕	B
Li Lu et al. [29]	The 2017 China Cybersecurity Technology Competition	GPT-3.5	✓	✕	✕	✕	✕	B
Proposed MUDS	Malicious URLs	Tree-based Algorithms and CL_K-means	✓	✓	✓	✓	✓	M

¹ B stands for binary classification, M stands for multi-class classification.

Table 3. The rationale and description of each component of the proposed MUDS.

Stage	Algorithm	Rationale and Description
Data Preprocessing	SMOTE	Addresses class imbalance in machine learning by generating synthetic samples for the minority class, helping balance distribution, improve classifier performance, and reduce bias towards the majority class.
Data Preprocessing	Z-score	Z-score is a statistical measure that standardizes data points by measuring their distance from the mean in terms of standard deviations, useful for identifying outliers and understanding relative positions.
Data Split	IG	Information gain measures the amount of information provided by a feature in predicting the target variable, aiding decision trees and other ML algorithms in selecting the most informative features for accurate models.
	FCBF	FCBF is a feature selection method that identifies relevant and non-redundant features in a dataset, improving efficiency and effectiveness by considering correlation and redundancy among features.
	PCA	PCA is a dimensionality reduction technique that transforms high-dimensional datasets into lower dimensions while preserving important information, simplifying representation, and removing redundancy.
Multi-Class Classification with the Best Machine Learning Model	ET, RF, LGBM, XGB, GBC	Tree-based models are machine learning algorithms utilizing decision tree structures for predictions and tasks, known for interpretability, robustness against outliers, and handling missing values. They capture complex interactions and are widely used.
	Stacking	Stacking combines predictions of multiple models to create a more accurate and robust model, leveraging individual strengths and collective wisdom to enhance overall performance.
	Ada	AdaBoost is an ensemble learning algorithm that combines weak classifiers to create a strong one, emphasizing misclassified samples and achieving high accuracy through iterative weight adjustments.
	SGD	SGD is an optimization algorithm used for training machine learning models. It minimizes the objective function by iteratively updating model parameters based on small subsets of training data.
	Optuna	Optuna is a hyperparameter optimization framework that automates finding optimal configurations for machine learning models. It provides efficient ways to search hyperparameters.
CL_K-means with Biased Classifiers	CL_K-means	For unknown malicious URL detection, CL_K-means can generate a sufficient number of normal and malicious clusters to identify zero-day malicious URLs from the newly arriving data.
	BO-TPE	BO-TPE is a model-based optimization method that combines Bayesian Optimization and Tree-structured Parzen Estimator to efficiently search hyperparameter space and maximize the performance of a machine learning model.
	Biased Classifiers	CL_K-means may return many errors when detecting complex unknown malicious URLs. Two biased classifiers are trained on the FPs and FNs of CL_K-means to reduce the errors.

Table 4. The first few rows of the dataframe.

	URL	Type
0	br-icloud.com.br	Phishing
1	mp3raid.com/music/krizz_kaliko.html	Benign
2	bopsecrets.org/rexroth/cr/1.htm	Benign
3	http://www.garage-pirenne.be/index.ph-p?option=...	Defacement
4	http://adventure-nicaragua.net/index.p-hp?option=...	Defacement

Table 5. Type_codes and numerical representation in the dataset.

Type_Code	Numerical Representation
Benign	0
Defacement	1
Malware	2
Phishing	3

Table 6. Features, types, and descriptions in feature engineering.

Feature	Type	Description
use_of_ip	int	Whether the URL in the dataset contains an IP address or not
abnormal_URL	int	Whether the URL in the dataset has an abnormal structure
count.	int	The number of periods or dots in the URL
count-www	int	The number of www in the URL
count@	int	The number of @ in the URL
count_dir	int	The number of directories or path segments in the URL
count_embed_domian	int	The number of embedded domains or subdomains within the URL
short_URL	int	A binary indicator of whether the URL is a shortened URL or not
count-https	int	The number of https within the URL
count-http	int	The number of http within the URL
count%	int	The number of % within the URL
count?	int	The number of ? within the URL
count-	int	The number of - within the URL
count=	int	The number of = within the URL
URL_length	int	The length of the URL
hostname_length	int	The length of the hostname
sus_URL	int	A binary indicator of the presence of suspicious keywords in the URL
fd_length	int	The length of the first directory in the URL path
tld_length	int	The length of the top-level domain of the URL
count-digits	int	The number of digits in the URL
count-letters	int	The number of letters in the URL

Table 7. Performance evaluation of the proposed framework on the attack of the known malicious URLs dataset.

Framework	Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
	AdaBoost	30.31	78.43	30.31	37.05	5.39	11.06
	SGD	84.67	84.80	84.67	81.79	0.13	27.52
	Extra Trees	96.56	96.50	96.56	96.51	21.45	32.54
Proposed algorithms	LGBM	95.92	95.84	95.92	95.84	1.62	1.90
without Optuna	Random Forest	96.63	96.59	96.63	96.60	15.59	32.05
	XGB	96.18	96.11	96.18	96.11	0.52	3.13
	Gradient Boosting	94.06	93.93	94.06	93.88	3.41	169.71
	Stacking	96.63	96.59	96.63	96.60	0.35	1.75
	AdaBoost	40.13	80.55	40.13	40.44	22.51	47.05
	SGD	86.51	86.02	86.51	84.66	0.11	26.00
	Extra Trees	96.71	96.66	96.71	96.66	75.77	116.40
Proposed algorithms	LGBM	96.53	96.47	96.53	96.47	7.12	6.46
with Optuna	Random Forest	96.74	96.69	96.74	96.70	71.53	158.96
	XGB	96.83	96.78	96.83	96.79	1.66	8.67
	Gradient Boosting	94.06	93.93	94.06	93.88	3.46	170.32
	Stacking	96.82	96.77	96.82	96.78	0.34	2.35
State-of-the-art	RF, 2023 [22]	N/A	93.19	91.19	92.18	N/A	71.00
algorithms	RF, 2023 [40]	96.15	95.61	93.97	94.76	N/A	N/A

Table 8. Performance evaluation of robustness study of the models on the malicious URL dataset.

Framework	Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
	AdaBoost	57.64	70.90	57.64	50.43	5.62	11.03
	SGD	60.94	74.58	60.94	53.24	0.11	60.26
	Extra Trees	70.31	81.30	70.31	62.89	10.57	19.90
Phishing Attack	LGBM	71.84	81.89	71.84	64.47	0.88	1.27
without Optuna	Random Forest	71.12	81.64	71.12	63.79	7.50	22.31
	XGB	71.38	81.73	71.38	63.73	0.44	2.20
	Gradient Boosting	72.68	82.15	72.68	66.43	2.74	128.13
	Stacking	71.12	81.64	71.12	63.79	0.23	1.42
	AdaBoost	58.90	72.70	58.90	51.97	44.20	87.71
	SGD	60.78	74.79	60.78	52.99	0.13	77.90
	Extra Trees	61.96	78.30	61.96	53.36	15.49	32.05
Phishing Attack	LGBM	71.79	81.87	71.79	64.38	1.12	2.45
with Optuna	Random Forest	71.11	81.64	71.11	63.95	36.44	107.09
	XGB	71.39	81.75	71.39	63.67	4.90	22.47
	Gradient Boosting	72.68	82.15	72.68	66.43	2.40	117.89
	Stacking	71.39	81.75	71.39	63.67	0.19	1.80
	AdaBoost	94.16	94.29	94.16	94.12	8.36	10.55
	SGD	89.47	90.75	89.47	89.35	0.16	14.02
	Extra Trees	96.12	96.19	96.12	96.11	31.65	26.24
Defacement Attack	LGBM	96.36	96.42	96.36	96.34	1.88	1.25
without Optuna	Random Forest	96.29	96.35	96.29	96.28	19.37	30.62
	XGB	96.65	96.70	96.65	96.63	0.69	2.45
	Gradient Boosting	94.82	95.04	94.82	94.76	5.44	118.83
	Stacking	96.29	96.35	96.29	96.28	0.55	1.76
	AdaBoost	95.20	95.26	95.20	95.06	67.11	83.99
	SGD	80.70	85.49	80.70	79.93	0.19	17.64
	Extra Trees	88.30	90.36	88.30	87.77	31.16	32.55
Defacement Attack	LGBM	96.65	96.70	96.65	96.63	6.24	3.49
with Optuna	Random Forest	96.25	96.37	96.25	96.24	16.09	25.27
	XGB	96.66	96.72	96.66	96.65	1.35	3.84
	Gradient Boosting	94.82	95.04	94.82	94.76	5.49	119.49
	Stacking	96.82	96.87	96.82	96.81	0.61	2.06
	AdaBoost	37.11	74.51	37.11	35.62	4.83	13.89
	SGD	91.97	92.18	91.97	91.69	0.14	7.44
	Extra Trees	94.86	94.82	94.86	94.76	14.74	38.33
Malware Attack	LGBM	96.31	96.28	96.31	96.27	0.54	0.85
without Optuna	Random Forest	96.28	96.25	96.28	96.25	9.63	38.84
	XGB	96.48	96.46	96.48	96.45	0.29	1.18
	Gradient Boosting	95.68	95.64	95.68	95.60	1.00	51.67
	Stacking	96.28	96.25	96.28	96.25	0.11	0.62
	AdaBoost	37.60	74.46	37.60	36.38	9.33	26.45
	SGD	92.75	92.85	92.75	92.37	0.14	7.51
	Extra Trees	95.27	95.30	95.27	95.14	42.61	126.68
Malware Attack	LGBM	95.69	95.66	95.69	95.61	1.32	1.78
with Optuna	Random Forest	96.79	96.78	96.79	96.76	8.95	34.16
	XGB	96.51	96.48	96.51	96.47	0.47	1.31
	Gradient Boosting	95.68	95.64	95.68	95.60	0.91	42.71
	Stacking	96.51	96.48	96.51	96.47	0.06	0.38

Table 9. Performance evaluation of binary classifiers before tuning for unknown malicious URLs.

Model	Malicious URL	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
AdaBoost	Phishing	72.53	79.68	72.53	70.76	4.46	9.77
	Defacement	94.16	94.38	94.16	94.15	4.86	10.12
	Malware	90.94	91.69	90.94	90.90	4.83	13.89
	Average	85.88	88.58	85.88	85.27	4.72	11.26
SGD	Phishing	61.82	73.29	61.82	56.45	0.10	11.35
	Defacement	95.18	95.21	95.18	95.17	0.09	6.54
	Malware	88.75	89.81	88.75	88.67	0.14	7.44
	Average	81.92	86.10	81.92	80.10	0.11	8.44
Extra Trees	Phishing	71.08	81.50	71.08	68.48	14.38	16.58
	Defacement	96.95	96.98	96.95	96.95	14.73	22.07
	Malware	92.16	92.83	92.16	92.13	14.74	38.33
	Average	86.73	90.44	86.73	85.85	14.62	25.66
LGBM	Phishing	73.05	82.25	73.05	70.98	0.39	0.57
	Defacement	96.71	96.77	96.71	96.71	0.51	0.59
	Malware	94.68	94.95	94.68	94.68	0.54	0.85
	Average	88.15	91.32	88.15	87.46	0.48	0.67
Random Forest	Phishing	71.84	81.79	71.84	69.44	6.64	17.84
	Defacement	97.06	97.09	97.06	97.06	10.22	24.72
	Malware	94.86	95.09	94.86	94.85	9.63	38.84
	Average	87.92	91.32	87.92	87.12	8.83	27.13
XGB	Phishing	72.03	81.91	72.03	69.68	0.21	0.71
	Defacement	97.05	97.09	97.05	97.05	0.25	0.85
	Malware	94.74	95.03	94.74	94.73	0.29	1.18
	Average	87.94	91.34	87.94	87.15	0.25	0.91
Gradient Boosting	Phishing	72.07	80.35	72.07	70.02	0.78	35.93
	Defacement	95.44	95.60	95.44	95.43	1.31	33.08
	Malware	94.56	94.79	94.56	94.55	1.00	51.67
	Average	87.36	90.25	87.36	86.67	1.03	40.23
Stacking	Phishing	71.84	81.79	71.84	69.44	0.09	0.34
	Defacement	97.06	97.09	97.06	97.06	0.08	0.43
	Malware	94.86	95.09	94.86	94.85	0.11	0.62
	Average	87.92	91.32	87.92	87.12	0.09	0.46

Table 10. Performance evaluation of binary classifiers after tuning for unknown malicious URLs.

Model	Malicious URL	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
AdaBoost_HPO	Phishing	72.86	80.27	72.86	71.09	45.25	153.00
	Defacement	95.70	95.79	95.70	95.70	47.08	97.64
	Malware	91.10	91.86	91.10	91.06	9.33	26.45
	Average	86.55	89.31	86.55	85.95	33.89	92.36
SGD_HPO	Phishing	80.73	81.87	80.73	80.55	0.09	9.35
	Defacement	95.64	95.66	95.64	95.64	0.10	7.55
	Malware	88.68	90.06	88.86	88.78	0.14	7.51
	Average	88.64	89.20	88.41	88.32	0.11	8.14
Extra Trees_HPO	Phishing	71.52	81.69	71.52	69.04	29.10	54.27
	Defacement	97.53	97.57	97.53	97.53	33.68	68.26
	Malware	92.44	93.18	92.44	92.40	42.61	126.68
	Average	87.16	90.81	87.16	86.32	35.13	83.07
LGBM_HPO	Phishing	73.15	82.30	73.15	71.11	0.42	0.61
	Defacement	97.12	97.16	97.12	97.12	1.49	1.66
	Malware	94.88	95.13	94.88	94.87	1.32	1.78
	Average	88.38	91.53	88.38	87.70	1.08	1.35
Random Forest_HPO	Phishing	73.06	82.37	73.06	70.97	21.28	67.17
	Defacement	97.29	97.32	97.29	97.29	9.07	25.02
	Malware	95.30	95.54	95.30	95.30	8.95	34.16
	Average	88.55	91.74	88.55	87.85	13.10	42.12
XGB_HPO	Phishing	71.96	81.86	71.96	69.60	1.39	4.35
	Defacement	97.06	97.07	97.06	97.06	5.30	25.94
	Malware	94.62	94.93	94.62	94.61	0.47	1.31
	Average	87.88	91.29	87.88	87.09	2.39	10.53
Gradient Boosting_HPO	Phishing	72.07	80.35	72.07	70.02	0.86	35.46
	Defacement	95.44	95.60	95.44	95.43	1.34	35.45
	Malware	94.56	94.79	94.56	94.55	0.91	42.71
	Average	87.36	90.25	87.36	86.67	1.04	37.87
Stacking_HPO	Phishing	71.96	81.86	71.96	69.60	0.07	0.35
	Defacement	97.06	97.07	97.06	97.06	0.07	0.38
	Malware	94.84	95.12	94.84	94.83	0.06	0.38
	Average	87.95	91.35	87.95	87.16	0.07	0.37

Table 11. Performance evaluation of CL_K-means and CL_K-means_BC before and after tuning for unknown malicious URLs.

Model	Malicious URL	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
CL_K-means	Phishing	59.73	80.86	25.49	38.76	5.00	0.93
	Defacement	52.56	60.03	15.33	24.43	4.00	0.83
	Malware	51.14	55.89	10.82	18.13	62.00	4.05
	Average	54.48	65.59	17.21	27.11	23.67	1.94
CL_K-means_HPO	Phishing	60.30	82.58	26.10	39.66	22.00	4.10
	Defacement	73.05	83.37	57.57	68.11	23.00	4.39
	Malware	63.06	80.17	34.71	48.45	41.00	2.64
	Average	65.47	82.04	39.46	52.07	28.67	3.71
CL_K-means_BC	Phishing	64.42	64.38	64.12	64.54	4.00	0.83
	Defacement	91.92	92.66	91.66	91.45	10.00	1.97
	Malware	77.98	78.38	77.84	76.06	40.00	2.59
	Average	78.11	78.47	77.87	77.35	18.00	1.80
CL_K-means_BC_HPO	Phishing	93.04	86.53	93.04	89.32	14.00	2.64
	Defacement	96.10	95.78	96.08	96.08	6.00	1.05
	Malware	88.47	88.47	88.55	88.51	14.00	0.89
	Average	92.56	90.26	92.54	91.30	11.33	1.53

Table 12. Performance evaluation of unknown malicious URL detection.

Method	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	P-Time (ms)	T-Time (s)
Best multi-class model (XGB)	88.19	91.65	88.19	85.60	2.24	9.21
Best binary model (SGD)	88.64	89.20	88.41	88.32	0.11	8.14
CL_K-means	65.47	82.04	39.46	52.07	28.67	3.71
CL_K-means_BC	92.54	90.26	92.56	91.30	11.33	1.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Dib, O. Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 2919-2960. https://doi.org/10.3390/jtaer19040141

AMA Style

Li S, Dib O. Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs. Journal of Theoretical and Applied Electronic Commerce Research. 2024; 19(4):2919-2960. https://doi.org/10.3390/jtaer19040141

Chicago/Turabian Style

Li, Shiyun, and Omar Dib. 2024. "Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs" Journal of Theoretical and Applied Electronic Commerce Research 19, no. 4: 2919-2960. https://doi.org/10.3390/jtaer19040141

APA Style

Li, S., & Dib, O. (2024). Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs. Journal of Theoretical and Applied Electronic Commerce Research, 19(4), 2919-2960. https://doi.org/10.3390/jtaer19040141

Article Menu

Enhancing Online Security: A Novel Machine Learning Framework for Robust Detection of Known and Unknown Malicious URLs

Abstract

1. Introduction

1.1. Problem Overview

1.2. Research Objectives

1.3. Main Contributions

1.4. Paper Organization

2. Literature Review

2.1. Approaches to Malicious URL Detection

2.1.1. Machine Learning-Based Approaches

2.1.2. Large Language Model-Based Approaches

2.2. Literature Comparison

3. Malicious URL Detection System (MUDS)

3.1. Architecture of MUDS

3.2. Applications and Integration of MUDS

3.3. The Role of Security Experts in Enhancing MUDS Efficacy

4. Proposed Framework

4.1. System Architecture

4.2. Malicious URL Dataset

4.3. Data Preprocessing

4.3.1. Label Encoding

4.3.2. Feature Creation

4.3.3. Z-Score Normalization

4.3.4. Reduce Class Imbalance via SMOTE

4.4. Feature Selection for Known MURLs

4.5. Feature Selection For Unknown MURLs

4.5.1. IG

4.5.2. FCBF

4.5.3. PCA

4.6. Proposed MUDS

4.6.1. Multi-Class Classification within MUDS

4.6.2. CL_K-Means with Biased Classifiers

4.7. Runtime Complexity

4.8. Evaluation Metrics

5. Experimental Study

5.1. Statistical Analysis of URL-Based Features

5.2. Performance Analysis of Known-MUDS

5.2.1. Model Evaluation and Discussions

5.2.2. Robustness Study of Known-MUDS

5.3. Performance Analysis of Unknown-MUDS

5.4. Project Reproducibility

6. Discussion

6.1. Practical Implications

6.2. Challenges in Implementation

6.3. Limitations of the Proposed Study

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI