Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models

Türk, Fuat; Kılıçaslan, Mahmut

doi:10.3390/app151810090

Open AccessArticle

Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models

by

Fuat Türk

^1,*

and

Mahmut Kılıçaslan

²

¹

Department of Computer Engineering, Faculty of Technology, Gazi University, Ankara 06500, Türkiye

²

Department of Statistics, Vocational School of Information Technologies, Ankara University, Ankara 06110, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10090; https://doi.org/10.3390/app151810090

Submission received: 1 August 2025 / Revised: 23 August 2025 / Accepted: 12 September 2025 / Published: 15 September 2025

Download

Browse Figures

Versions Notes

Abstract

This study presents a comprehensive comparative analysis of machine learning, deep learning, and optimization-based hybrid methods for malicious URL detection on the Malicious Phish dataset. For feature selection and model hyperparameter tuning, the Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Harris Hawk Optimizer (HHO) were employed. Both multiclass and binary classification tasks were addressed using classic machine learning algorithms such as LightGBM, XGBoost, and Random Forest, as well as deep learning models including LSTM, CNN, and hybrid CNN+LSTM architectures, with optimization support also integrated into these models. The experimental results reveal that the ELECTRA-based deep learning model achieved outstanding accuracy and F1-scores of up to 99% in both multiclass and binary scenarios. Although optimization-supported hybrid models also improved performance, the language-model-based ELECTRA architecture demonstrated a significant superiority over classical and optimized approaches. The findings indicate that optimization algorithms are effective in feature selection and enhancing model performance, yet next-generation language models clearly set a new benchmark in malicious URL detection.

Keywords:

optimization algorithms; malware detection; ELECTRA model; feature selection

1. Introduction

Malware constitutes one of the most significant threats facing the modern information technology ecosystem. Various types of malicious software—such as viruses, trojans, worms, rootkits, and ransomware—have the potential to enable unauthorized system access, cause data theft, disrupt services, and result in severe financial losses. Malware detection and classification have thus become critical research areas within cybersecurity. Malicious URLs are fundamental components of many cyberattacks, including phishing, malware dissemination, and financial fraud, posing serious threats to individuals and organizations. The inadequacy of traditional signature-based and blacklist approaches to detect emerging attack types has increased interest in machine learning (ML) and deep learning (DL)-based methods.

In recent years, a multitude of models have been proposed in the literature for malware and malicious URL detection. The deepBF model by Patgiri et al. integrates a Bloom Filter with deep learning to detect malicious URLs rapidly and resource-efficiently, using a two-dimensional Bloom Filter and evolutionary CNN that maintains system flexibility by updating the filter for new URLs [1]. In 2023, Hilal et al. introduced the AFSADL-MURLC model, combining Glove-based word embeddings, a GRU classifier, and Artificial Fish Swarm Optimization to achieve high detection accuracy, particularly for URLs spread via social engineering [2]. Alsowail (2025) presented an effective approach by modeling spatial relationships among URL features using a self-attention-based CapsNet architecture, achieving an accuracy as high as 99.2% [3]. Zaimi et al. utilized DistilBERT for text-based feature extraction and a hybrid CNN-LSTM model to analyze local and temporal URL patterns, reporting an accuracy of 98.19% [4]. Similarly, Alsaedi et al. designed a multimodal CNN-based model by combining textual and visual features, achieving a 4.3% improvement in performance and a 1.5% reduction in the false positive rate [5].

Machine-learning-based solutions are also widely represented in the literature. Raja et al. performed effective detection using a model based solely on URL structural features with a count vectorizer and Random Forest [6]. Tashtoush et al. employed certain statistical features and a CNN, achieving 94.09% accuracy with low complexity [7]. Gupta et al. (2024) improved CNN hyperparameters with the Brown Bear optimization algorithm, attaining 93% accuracy and 92% precision [8].

Transformer and language-model-based approaches have also attracted attention. For example, the PMANet model adapts a pretrained language model to the URL domain via post-training and leverages layer-wise attention, achieving an AUC of 99.41% [9]. Do et al. addressed the shortcomings of CNNs and RNNs by utilizing a Temporal Convolutional Network (TCN) and Multi-Head Self-Attention, achieving 98.78% accuracy [10].

There has also been progress in parallel processing and real-time detection. In 2023, Nagy et al. used different parallel processing strategies to train ML and DL models, improving speed and accuracy [11]. Lavanya and Shanthi increased host security using URL-API intensity-based feature selection and spectral deep learning, achieving about 96% accuracy [12]. In a 2025 study, a method using only 14 basic features yielded 94.46% accuracy for low-cost phishing detection [13]. Similarly, recent studies have leveraged the most popular ML and DL algorithms—including quantum machine learning—in URL classification [14]. A contemporary study emphasized the failure of traditional methods in detecting malicious Uniform Resource Locators, reporting approximately 98% accuracy using ML and DL on 5000 real-world URLs [15]. The TransURL model by Liu et al. provides an innovative transformer-based solution that integrates multiscale feature learning and regional attention mechanisms, outperforming many existing methods in certain scenarios but facing challenges with computational cost and real-time system integration [16]. A CNN-based model, surpassing blacklisting and heuristic approaches, achieved high F1-scores (98.99%) in experiments on more than 651,000 labeled URLs by capturing local character-level patterns [17]. Su et al. (2023) demonstrated that a BERT-based method could leverage self-attention to learn semantic relationships in URLs and achieve 98–99% accuracy on various datasets, including tests against IoT and DoH attacks [18]. Similarly, Elsadig et al. combined BERT-based feature extraction with a CNN classifier for phishing URL detection, achieving 96.66% accuracy and highlighting the role of NLP features in this domain [19]. Islam et al. used ensemble models such as Random Forest and XGBoost on URL content and metadata, reaching an accuracy of 97% [20]. Hani et al. (2024) compared various ML techniques for malicious URL detection, obtaining successful results but noting the lack of optimization techniques as a limitation [21]. Gupta et al. compared the performance of LSTM, Bi-LSTM, RNN, and CNN models for detecting malicious web addresses [22].

In summary, the literature clearly demonstrates the critical importance of malicious URL detection for cybersecurity. Deep learning and machine learning techniques have shown superior performance compared to traditional systems, providing more adaptable solutions for new threats encountered in real-world scenarios. Future research is expected to focus on real-time processing, minimizing resource consumption, and enhancing resistance to adversarial attacks.

The main contributions of this study to the field of science can be summarized as follows:

Comprehensive comparison of different models: For the first time, machine learning, deep learning, and hybrid optimization-based methods are systematically compared for multiclass and binary classification on the Malicious Phish dataset.
Optimization-based feature selection: GA, PSO, and HHO algorithms are jointly used for feature selection and hyperparameter tuning, resulting in significant improvements in model performance.
Advanced deep learning utilization: The superiority of next-generation language-model-based deep learning approaches, such as ELECTRA, for malicious URL detection is directly demonstrated in comparison to conventional methods.
Analysis of hybrid and optimized models: CNN+LSTM and optimization-supported hybrid models reduce error rates and enhance performance compared to traditional approaches.
Feature contribution analysis: The impact of selected features on model decisions is thoroughly examined, revealing which attributes are most decisive.

2. Materials and Methods

Figure 1 presents the overall workflow of the proposed study. Initially, exploratory data analysis, preprocessing, and feature extraction were performed on the Malicious Phish dataset. Subsequently, various optimization algorithms were applied for feature selection. Using the selected features, machine learning, deep learning, and hybrid methods were employed to perform binary and multiclass classification tasks. The results obtained from the models were compared and analyzed using performance metrics such as accuracy, recall, precision, and F1-score. In the final stage, the prediction results were further analyzed and visualized to comprehensively demonstrate the effectiveness of the models.

In this study, model performances were compared using different classification algorithms, and the proposed hybrid structure was evaluated. The methods utilized are detailed below.

Light Gradient Boosting Machine (LightGBM), developed by Microsoft, is a gradient boosting framework optimized for large datasets and high-dimensional feature spaces. Its histogram-based splitting approach ensures low memory usage and significantly reduces training time. LightGBM aims to improve accuracy through a leaf-wise growth strategy while also offering depth control to mitigate overfitting [23]. LGBM is another gradient boosting framework designed for efficiency and scalability. It leverages Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to reduce computational costs. The objective function is similar to XGBoost:

\begin{matrix} L = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + Ω (f) \end{matrix}

(1)

but LightGBM optimizes the tree growth using a leaf-wise strategy rather than a level-wise approach. At each step, the leaf with the maximum loss reduction is expanded, which significantly improves model accuracy. The gain of a split is computed as

\begin{matrix} Gain = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - λ \end{matrix}

(2)

where G and H denote the first- and second-order gradients, respectively, and

λ

and

γ

are regularization parameters. This formulation ensures both computational efficiency and high accuracy. Another method, Extreme Gradient Boosting (XGBoost), is an advanced variant of gradient boosting algorithms that delivers superior performance in both classification and regression problems. By incorporating a regularized objective function (L1 and L2 norms), XGBoost controls overfitting, while parallel computation and pre-sorting features optimize processing time. XGBoost (Extreme Gradient Boosting) is a highly optimized implementation of gradient boosting that introduces regularization terms to control overfitting and improve generalization. The objective function is defined as

\begin{matrix} L (θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k}), \end{matrix}

(3)

where

l (y_{i}, {\hat{y}}_{i})

is the loss function, and

Ω (f_{k}) = γ T + \frac{1}{2} λ {∥ w ∥}^{2}

is the regularization term penalizing the complexity of tree

f_{k}

with T leaves and weights w. The additive model update at iteration t is

\begin{matrix} {\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i}), f_{t} \in F, \end{matrix}

(4)

where F represents the functional space of regression trees. This formulation allows XGBoost to achieve superior efficiency and performance in large-scale datasets.

The gradient boosting method involves the sequential training of weak learners (decision trees), minimizing errors at each step. In this ensemble-based approach, models are trained in sequence, and each new model attempts to correct the residuals of its predecessor [24]. Gradient boosting is an ensemble learning method that constructs predictive models in a forward stage-wise manner by sequentially adding weak learners. At each iteration, the algorithm fits a new learner to the negative gradient of the loss function with respect to the current model. Formally, the procedure begins with an initial model:

\begin{matrix} F_{0} (x) = arg min \sum_{i = 1}^{n} L (y_{i}, λ) \end{matrix}

(5)

where

L (y, F (x))

denotes the chosen loss function. At iteration m, pseudo-residuals are calculated as

\begin{matrix} r_{i m} = - [\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}] F (x) = F_{m - 1} (x) \end{matrix}

(6)

and a weak learner

h_{m} (x)

is trained to approximate these residuals. The optimal step size

γ_{m}

is obtained by

\begin{matrix} γ_{m} = arg min \sum_{i = 1}^{n} L (y_{i}, F_{m - 1} (x_{i}) + γ h_{m} (x_{i})) \end{matrix}

(7)

Finally, the model is updated as

\begin{matrix} F_{m} (x) = F_{m - 1} (x) + ν γ_{m} h_{m} (x) \end{matrix}

(8)

where

v \in (0, 1)

is the learning rate. This iterative optimization ensures gradual improvement of the predictive accuracy while controlling overfitting.

Random Forest is a powerful ensemble learning technique widely used for classification and regression problems, comprising a collection of decision trees. Each tree is trained on a randomly sampled subset of the training data, and a random subset of features is used at each node split. This reduces correlation among trees, decreases overall model error, and prevents overfitting. In classification problems, predictions from individual trees are combined by majority vote, while in regression, the average of tree outputs is taken. Random Forest performs effectively on high-dimensional data, is robust to missing data and imbalanced class distributions, and provides feature importance scores for variable selection. However, increasing the number of trees can raise computational cost and memory requirements, and interpretability is lower compared to a single decision tree [25,26,27].

Long Short-Term Memory (LSTM) networks are a variant of Recurrent Neural Networks (RNNs) extensively used for sequential data and time-series analysis. Developed to address the problem of learning long-term dependencies in standard RNNs, LSTM cells utilize forget, input, and output gates to dynamically control which information is retained or forgotten. This architecture enables the network to capture both short- and long-term dependencies effectively [28,29,30].

A Small Convolutional Neural Network (CNN) is a deep learning model designed for automated feature extraction and classification on high-dimensional data such as images or sequential data but with reduced parameters and computational cost. Through convolution and pooling operations between layers, local patterns, edges, and structural details are efficiently captured from input data, while fully connected layers perform classification based on the extracted features. The compact design of small CNNs with fewer layers and filters compared to conventional deep CNN architectures offers low computational overhead and faster processing, making them suitable for embedded systems and resource-constrained environments.

One of the effective hybrid methods proposed is the CNN+LSTM hybrid model. This architecture aims to improve classification performance by learning both spatial and temporal patterns simultaneously. The small CNN component automatically extracts local spatial features from input data, while the LSTM layer processes feature vectors obtained from the CNN in sequential order to learn long-term dependencies. This combined structure delivers superior performance, especially in the analysis of time series or dynamic pattern image sequences. The low computational cost of the small CNN enables faster operation, while the gating mechanisms of the LSTM contribute to efficient integration of past and present information.

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a deep learning model used in natural language processing that learns by detecting whether words in the text have been replaced. The model replaces certain words in input texts and predicts whether these replacements are correct. This approach allows ELECTRA to be trained faster and more efficiently than traditional masked language models, yielding high accuracy in tasks such as text classification. The ability to achieve effective results with lower computational requirements is a major advantage [31].

To optimize model hyperparameters and enhance classification performance, we employed three widely used nature-inspired optimization algorithms—Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Harris Hawk Optimizer (HHO). This selection not only allowed us to utilize algorithms that are popular and benchmarked in the literature but also enabled us to test the capability of an innovative optimizer (HHO) within the context of feature selection for malicious URL detection. Prior research has demonstrated the success of GA and PSO in cybersecurity-related tasks, while HHO has recently gained attention for its ability to avoid premature convergence. Thus, including these three optimizers provides a meaningful balance between established reliability and novelty, ensuring that the feature selection results are both comparable and innovative [32,33,34,35,36].

The GA is a stochastic population-based optimization technique inspired by the principle of natural selection. A set of candidate solutions, referred to as chromosomes, evolves over generations using genetic operations such as selection, crossover, and mutation. The quality of each solution is measured by a fitness function

f (x)

.

Mathematically, given a population

P = {x_{1}, x_{2}, \dots, x_{n}}

, the GA updates solutions as

\begin{matrix} x^{(t + 1)} = Mutation (Crossover (Selection (p^{(t)}))) \end{matrix}

(9)

where t denotes the current generation. Selection is based on fitness probability:

\begin{matrix} p_{i} = \frac{f (x_{i})}{\sum_{j = 1}^{n} f (x_{j})} \end{matrix}

(10)

This ensures that high-quality solutions are more likely to contribute to the next generation [37,38].

The PSO algorithm models the social behavior of bird flocks or fish schools. Each solution, called a particle, has a position vector

x_{i}

and velocity vector

v_{i}

. Particles move in the search space by combining their own best-known position pbest and the global best position gbest.

The update rules are

\begin{matrix} v_{i}^{(t + 1)} & = ω v_{i}^{(t)} + c_{1} r_{1} ({pbest}_{i} - x_{i}^{(t)}) + c_{2} r_{2} (gbest - x_{i}^{(t)}) \\ x_{i}^{(t + 1)} & = x_{i}^{(t)} + v_{i}^{(t + 1)} \end{matrix}

(11)

where w is the inertia weight,

c 1

and

c 2

are learning factors, and

r 1

and

r 2

are random numbers in

[0, 1]

. This mechanism provides both exploration and exploitation in research spaces [39].

The HHO algorithm is another algorithm inspired by the cooperative hunting strategy. It dynamically balances exploration and exploitation. The position of each hawk is updated based on the prey position

X_{rabbit}

as

\begin{matrix} X^{(t + 1)} = \{\begin{matrix} X_{rabbit}^{(t)} - E | J \cdot X_{rabbit}^{(t)} - X^{(t)} | & if | E | \geq 1 (exploration) \\ X_{rabbit}^{(t)} - E | X_{rabbit}^{(t)} - X^{(t)} | & if | E | < 1 (exploitation) \end{matrix} \end{matrix}

(12)

where

E = 2 (1 - \frac{t}{T}) r - 1

represents the escaping energy of the prey, J is the jump strength, and T is the maximum number of iterations. The adaptive switching makes HHO highly effective in complex, high-dimensional spaces [40].

The main reason for choosing GA, PSO, and HHO in this study is that their diverse search strategies and problem-solving approaches provide both variety and balance in feature selection and hyperparameter optimization. GA offers a robust exploration strategy by mimicking biological evolution and promoting genetic diversity among solutions. PSO, based on social behavior, ensures rapid convergence and low computational cost as each particle learns from its own experience and that of the best neighbors. The HHO algorithm, with its predator–prey-based dynamic exploration and exploitation, excels particularly in complex, high-dimensional spaces. Using all three algorithms together enables more effective feature selection and better generalization of models for various data structures. Algorithms 1, 2 and 3 show the pseudocode for the GA, PSO, and HHO approaches, respectively.

Algorithm 1 Pseudocode of GA.

1:: Initialize the population randomly
2:: Evaluate the fitness of each individual
3:: A predefined number of generations has not been met
4:: Select the best individuals
5:: Apply crossover to generate offspring
6:: Apply mutation to offspring
7:: Evaluate the fitness of the new individuals
8:: Form a new population from the best individuals
9:: Return the best individual as the solution

Algorithm 2 Pseudocode of PSO.

1:: Initialize particles’ positions and velocities randomly
2:: Evaluate the fitness of each particle
3:: Set each particle’s personal best (pBest) and global best (gBest)
4:: For a predefined number of iterations:
5:: 4.1. For each particle:
6:: - Update velocity based on pBest and gBest
7:: - Update position using the new velocity
8:: - Evaluate the new fitness
9:: - If improved, update pBest
10:: 4.2. Update gBest if needed
11:: Return gBest as the optimal solution

Algorithm 3 Pseudocode of HHO.

1:: Initialize the hawk population randomly
2:: Evaluate the fitness of each hawk
3:: Set the best solution as X_best
4:: For a predefined number of iterations:
5:: 4.1. For each hawk:
6:: - Calculate escape energy E
7:: - If $| E | \geq 1$ : apply exploration strategy
8:: - If $| E | < 1$ : apply exploitation strategy
9:: - Update position accordingly
10:: - Evaluate new fitness
11:: - Update X_best if improved
12:: Return X_best as the best solution

3. Experimental Results and Discussion

The Malicious Phish dataset is a widely used and comprehensive resource for cybersecurity and malware detection research, based on the analysis of various URLs. This dataset comprises approximately 650,000 URLs, categorized into four main classes: “benign”, “defacement”, “phishing”, and “malware”. Each URL in the dataset is annotated with detailed textual and numerical features. In particular, characteristics such as domain length, IP address usage, suspicious keywords, and the presence of special characters within the URL play a crucial role in enabling machine learning and deep learning algorithms to identify malicious links. The Malicious Phish dataset is frequently used in academic and industrial research to model current threats and enable early detection of next-generation cyberattacks. Visualizations related to the dataset distribution are provided in Figure 2.

The dataset was partitioned into training–validation (80%) and testing (20%) subsets using a stratified random split with a fixed random seed to preserve class distributions across subsets. Prior to splitting, duplicate entries were removed to ensure data integrity. Furthermore, domain-level isolation was applied, ensuring that URLs originating from the same domain were not distributed across different subsets, thereby preventing potential data leakage. This protocol provides a fair and robust evaluation setting for the proposed models.

The available features in the dataset are categorized as follows: ‘use_of_ip’, ‘abnormal_url’, ’count.’, ‘count-www’, ‘count@’, ‘count_dir’, ‘count_embed_domain’, ‘short_url’, ‘count-https’, ‘count-http’, ‘count%’, ‘count-’, ‘count=’, ‘url_length’, ‘hostname_length’, ‘sus_url’, ‘fd_length’, ‘tld_length’, ‘count-digits’, and ‘count-letters’. The features selected by the GA, PSO, and HHO optimization algorithms are presented in Table 1.

In cases where GA, PSO, and HHO selected overlapping but not identical feature subsets (as shown in Table 1), we adopted an intersection–union strategy. Specifically, we constructed a consensus feature set by (i) prioritizing features selected by at least two optimizers and (ii) including additional unique features only if they contributed significantly to validation performance. This approach ensured that the final feature set was both compact and robust, minimizing redundancy while retaining discriminative power. Moreover, feature selection results demonstrated that lexical and entropy-based features (e.g., URL length, the number of special characters, and randomness) were consistently prioritized across GA, PSO, and HHO, highlighting their strong discriminative power for detecting malicious obfuscation. Additionally, contextual attributes such as domain age and WHOIS records were often selected by GA and HHO, reflecting their importance in capturing the temporal and reliability aspects of domains. HHO further identified less common but critical content-based indicators, such as iframe or script intensity, which align with known attack behaviors. These findings indicate that optimization algorithms not only improve model performance but also uncover domain-relevant features that are strongly linked to malicious URL characteristics.

In this study, the features selected by three different algorithms are compared. As seen, the GA, PSO, and HHO algorithms selected some features in common (e.g., ‘count_embed_domain’, ‘count-https’, and ‘count-letters’) while differing on others. For instance, the ‘short_url’ feature was selected only by GA, whereas ‘count@’ was among the features chosen by both GA and HHO. Notably, the HHO algorithm identified a greater number of features compared to the others, whereas PSO focused on a more limited subset. These differences indicate that each algorithm’s feature selection strategies and evaluation criteria exhibit diversity depending on the dataset’s structure. Therefore, employing multiple algorithms together offers variety and flexibility in determining the most effective features for the dataset.

In this study, we employed the ELECTRA-Base variant, which balances computational efficiency and representational capacity, and used the WordPiece tokenizer with case-insensitive settings and a maximum sequence length of 128 tokens to ensure that URL structures such as ‘http’, and ‘www’, domain endings, and subdomain fragments were preserved. The model was fine-tuned end to end, with all layers updated instead of freezing pretrained parameters, and the training was conducted using the AdamW optimizer with a learning rate of

2 \times 10^{- 5}

, 12 layers, 768 hidden units, 12 attention heats, a batch size of 32, a weight decay of 0.01, and a dropout rate of 0.1 for 10 epochs. Pre-processing involved stripping URL protocols (e.g., ‘http://’), removing dynamic query strings to reduce noise, normalizing percent-encoded characters (e.g., ‘%20’) into their ASCII equivalents, and deduplicating identical samples to avoid data leakage, while post-processing ensured consistent token representation across the dataset. This configuration was adopted to guarantee reproducibility and transparency while enabling robust fine-tuning of ELECTRA for the classification task.

The search spaces, population sizes, and iteration budgets of the metaheuristic algorithms used are explicitly defined. For feature selection and hyperparameter optimization, the Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Harris Hawk Optimization (HHO) were applied. For GA, the population size was set to 50 with a maximum of 100 iterations, and chromosomes were represented in a binary format covering 30 features. In PSO, the population size was set to 40 with 80 iterations, and particles were updated under velocity clamping with learning coefficients (

c 1 = 2.0, c 2 = 2.0

). For the HHO algorithm, the population size was configured as 30 with 70 iterations, where the exploration and exploitation phases were dynamically balanced. The search spaces, along with model hyperparameters, were defined as follows: learning rate (0.0001–0.01), batch size (16–128), dropout rate (0.1–0.5), and maximum depth (3–12) for tree-based models. These parameter ranges are consistent with widely adopted optimization limits in the literature and were determined considering the size of the Malicious URL dataset and the available computational resources. Thus, the performance of the algorithms in both feature selection and hyperparameter optimization was evaluated on a comparable basis.

When the results presented in Table 2 are analyzed, it becomes clear that both machine learning and deep learning approaches deliver highly competitive performances for multiclass malicious URL classification. Among machine learning algorithms, RF emerged as the strongest performer with an accuracy of 97% and an F1-score of 0.95, highlighting its robustness in handling feature variability within the dataset. Similarly, LGBM and XGB achieved accuracy rates of 96%, accompanied by balanced precision and recall values (0.95–0.96), demonstrating their strong capability to generalize across different malicious URL categories. In contrast, GB produced slightly lower results (94% accuracy and 0.92 F1-score), suggesting that while effective, it may be more sensitive to parameter tuning compared to the other ensemble-based methods. For deep learning models, both LSTM and CNN individually demonstrated competitive performance, each achieving around 95% accuracy, although CNN showed slightly lower recall (0.91). The hybrid CNN+LSTM model without optimization provided a modest improvement (95.6% accuracy), indicating that combining convolutional layers for feature extraction with sequential layers for temporal dependencies can enhance classification performance. However, optimization-based enhancements led to varied outcomes. For instance, while HHO+CNN+LSTM achieved the highest performance among optimization-supported models (95.7% accuracy and F1-score 0.92), the PSO+CNN+LSTM combination performed poorly (81% accuracy and F1-score 0.70). This suggests that the effectiveness of meta-heuristic optimization is highly dependent on the alignment between the algorithm’s search dynamics and the underlying data distribution. The underperformance of PSO may indicate premature convergence or insufficient exploration of the hyperparameter space for this dataset. This outcome is likely due to PSO’s tendency to converge prematurely to local optima, especially in high-dimensional feature spaces like the malicious URL dataset.

Finally, the ELECTRA model clearly outperformed all other methods, achieving 99% accuracy, precision, recall, and F1-score. This remarkable performance highlights the superiority of transformer-based pretrained language models in malicious URL detection. Unlike traditional ML and DL models, ELECTRA benefits from contextual understanding and deep semantic representation, which allows it to detect subtle differences in URL patterns more effectively. These results confirm that while classical machine learning and CNN/LSTM-based architectures remain highly competitive, transformer-based models offer a distinct advantage, especially in large-scale and complex text-based cybersecurity tasks.

The confusion matrices presented in Figure 3 provide a detailed illustration of the successes and errors of the models across the four main classes: benign, defacement, phishing, and malware. The Random Forest (RF) model correctly classified 42,350 samples in the benign class while misclassifying 3632 benign samples as defacement; additionally, within the malware class, it achieved 7784 correct classifications but erroneously identified 1239 malware samples as benign. With the CNN+LSTM model, misclassifications in the benign and defacement classes decreased significantly, and high accuracy was achieved for phishing and malware examples. Notably, in the CNN+LSTM model, only 651 misclassifications were observed in the benign class out of 42,275 correct classifications. The CNN+LSTM model optimized with HHO achieved higher correct classification rates in the phishing and defacement classes, making 9255 and 2957 accurate predictions, respectively. Furthermore, mislabeling in the benign and malware classes also decreased. The ELECTRA-based model demonstrated a clearly superior performance compared to the other models. In this model, only 5 out of 85,311 benign samples were misclassified into another class. Almost all samples in the defacement and phishing categories were also correctly classified, while in the malware class, only 19 out of 6310 samples were mislabeled. In particular, the ELECTRA model exhibited consistently high true classification rates and extremely low misclassification rates across all four classes. Overall, while classical machine learning and basic deep learning models showed relatively higher misclassification rates in dominant classes such as benign and defacement, language-model-based approaches like ELECTRA minimized errors for both minority and majority classes, yielding a much more balanced and higher overall performance. This explains why advanced models are preferred, especially in cybersecurity problems requiring multiclass classification and handling imbalanced data distributions.

In Figure 4, the distribution of feature importance values for the attributes selected by the HHO algorithm is represented alongside the corresponding feature names. Upon examination, it is evident that ‘count-letters’ (Feature 19), ‘count%’ (Feature 10), and ‘count-www’ (Feature 3) contribute most significantly to the model’s success in detecting malicious URLs. In other words, the number of letters within a URL, the use of the percent (%) character, and the frequency of ‘www’ are highly decisive in distinguishing malicious links. Additionally, features such as ‘count_embed_domain’ (Feature 6), ‘count@’ (Feature 4), and ‘count-https’ (Feature 8) also possess high importance and play a critical role in the model’s decision-making process. Conversely, attributes like ‘tld_length’ (Feature 17), ‘count-’ (Feature 11), and ‘fd_length’ (Feature 16) hold relatively lower importance for the model. These findings indicate that URL characteristics—such as the count of letters, special characters, and keywords—are prominent determinants in identifying malicious content and that the HHO algorithm effectively highlights such informative features. As a result, features with unnecessary or low informational value are eliminated, enhancing the effectiveness and accuracy of the model.

Table 3 provides a comprehensive comparison of performance metrics for binary classification of malicious and benign URLs using machine learning, deep learning, and optimization-supported hybrid models. Among the traditional machine learning approaches, RF achieved the highest overall performance, with an accuracy of 0.95, precision of 0.96, recall of 0.95, and F1-score of 0.95. This result highlights RF’s robustness in handling the feature set of the Malicious URL dataset, likely due to its ensemble nature and ability to reduce variance. XGBoost also exhibited strong performance with a 0.94 accuracy and a 0.94 F1-score, which is consistent with its reputation for efficient gradient boosting and strong generalization capability. LGBM followed closely with a 0.93 accuracy, demonstrating that boosting-based models provide a competitive edge in detecting malicious URLs. Deep learning models also performed competitively. The LSTM model reached a 0.95 accuracy, while CNN achieved 0.94; LSTM and convolutional structures (CNN) are both capable of learning relevant URL patterns. However, the hybrid CNN+LSTM model, contrary to expectations, performed considerably worse (accuracy = 0.84, and F1 = 0.80). This suggests that combining convolutional and recurrent layers without careful optimization may introduce complexity without yielding additional representational benefits. A likely explanation is that feature redundancy or overfitting negatively impacted its generalization ability. When metaheuristic optimization algorithms were applied to hybrid deep learning models, moderate improvements were observed compared to the basic CNN+LSTM. Both PSO+CNN+LSTM and HHO+CNN+LSTM achieved similar performance levels (accuracy ≈ 0.90, and F1 ≈ 0.89), demonstrating the potential of optimization methods in enhancing feature selection and hyperparameter tuning. However, GA+CNN+LSTM performed the worst among these, with only a 0.81 accuracy and a 0.76 F1-score. This result indicates that the effectiveness of metaheuristic optimization is highly algorithm-dependent; while PSO and HHO could better balance exploration and exploitation during the search process, GA may have failed to converge effectively in the given parameter space. Finally, the ELECTRA-based transformer model substantially outperformed all other approaches, achieving 0.99 across all metrics (accuracy, precision, recall, and F1-score). This remarkable result underscores the strength of pretrained language models in capturing semantic and structural patterns within URLs. Unlike classical ML and deep learning models that rely heavily on handcrafted feature representations or limited training, ELECTRA leverages large-scale pretraining and contextual embeddings, enabling it to generalize better and achieve near-perfect classification. In summary, while classical ML and optimized deep learning models provided solid baselines, ELECTRA demonstrated superior scalability and reliability. These results confirm that advanced transformer-based architectures are significantly more effective in malicious URL detection tasks compared to both ensemble machine learning and conventional neural networks.

In Figure 5, the binary classification results of the four best-performing algorithms are visualized using confusion matrices. The RF and LSTM models achieved high true classification rates in the benign class, but a portion of malicious class examples were incorrectly labeled as benign. The CNN+LSTM model optimized with HHO, on the other hand, correctly classified 81,787 benign samples but misclassified 3991 benign instances as malicious, and in the malicious class, 8929 samples were incorrectly classified; this indicates that the misclassification rate for this model is somewhat higher compared to the others. The ELECTRA-based DML, however, produced much more balanced and highly accurate results. This table clearly demonstrates that the ELECTRA model keeps errors to a minimum in both classes, exhibiting superior discriminatory performance in the binary classification task compared to all other methods. In conclusion, while classical ML and basic DL approaches yield moderate results regarding misclassifications, advanced language-model-based methods—especially in large and imbalanced datasets—stand out with high overall accuracy and low error rates.

Table 4 demonstrates that the performance of different algorithms was evaluated for both binary and multiclass classification problems. The binary classification results indicated that Random Forest and LSTM models achieved comparable performance levels (F1 = 0.95 and 0.94, respectively). The approximate confidence intervals, [0.93–0.97] for Random Forest and [0.92–0.96] for LSTM, suggest that their performances are statistically similar. In contrast, the ELECTRA model demonstrated a substantially higher performance with an F1-score of 0.99, and its confidence interval remained within the [0.98–1.00] range. This highlights the superior and consistent effectiveness of language-model-based approaches in binary classification tasks. For the multiclass scenario, both LightGBM and XGBoost yielded comparable results with F1-scores of 0.94, and their estimated confidence intervals clustered around [0.92–0.96]. This suggests no significant performance difference between the two models. However, once again, the ELECTRA model stood out with an F1-score of 0.99, clearly surpassing the traditional machine learning algorithms. The findings indicate that while classical machine learning and conventional deep learning methods achieve satisfactory results, transformer-based models such as ELECTRA provide more reliable and robust performance in both binary and multiclass classification tasks. This suggests that transformer-based approaches are likely to play a more prominent role in the future of malicious URL detection and similar cybersecurity applications.

Table 5 illustrates sample predictions and misclassification cases for different models (ELECTRA, HHO+CNN+LSTM, and Random Forest) under multiclass and binary classification scenarios. The ELECTRA model produced mostly accurate predictions and achieved high precision, especially in the phishing and benign classes. However, it did misclassify, for example, the URL transit-port.net/AI.CogSci.Robotics/robotics.html, which is benign, as phishing. The HHO+CNN+LSTM and Random Forest models also performed well overall but occasionally produced false positives in the benign class, such as labeling a benign URL as phishing or malware. Similarly, in the binary classification scenario, the ELECTRA model correctly identified most benign URLs; although in some cases, it mislabeled malicious URLs as benign. Overall, this table demonstrates that the ELECTRA model consistently has the lowest error rate; however, no model is entirely flawless, and there remains potential for misclassification, particularly with the diverse URL structures encountered in real-world data. It is also notable that most misclassified examples were labeled as benign, which highlights the importance of minimizing false negatives in cybersecurity applications.

Table 6 provides a comprehensive comparison of the proposed approach against several state-of-the-art methods, considering both the multiclass (Scenario-1) and binary (Scenario-2) classification tasks. In the multiclass scenario (Scenario-1), the proposed ELECTRA-based model achieved highly competitive results, with accuracy, precision, recall, and F1-score all reported at 0.99. This performance underscores the robustness of the ELECTRA architecture, which effectively balances false positives and false negatives. When compared to other multiclass approaches in the literature, such as ref. [4] (Scenario-1) with 96.04% accuracy and 93.97% F1, ref. [18] (Scenario-1) with 98.78% accuracy, and ref. [20] (Scenario-1) with 97% accuracy, the proposed model demonstrates superior performance, positioning it as a state-of-the-art solution in multiclass classification. In contrast, the binary classification task (Scenario-2) employed a hybrid HHO+CNN+LSTM model, which obtained an accuracy of 90.4%, precision of 0.90, recall of 0.87, and F1-score of 0.89. Although these results confirm the adaptability of our optimization-driven pipeline, they fall short when compared with stronger binary baselines. For example, ref. [4] (Scenario-2) reported 98.19% accuracy and 97.26% F1, ref. [17] (Scenario-2) achieved 99.26% precision, 98.73% recall, and 98.99% F1, and ref. [41] (Scenario-2) outperformed all other methods with 99.82% accuracy, supported by its integration of n-gram features, CNN, BiLSTM, and Attention mechanisms. The high performance of these approaches is largely due to the use of more complex and computationally intensive architectures, which enable deeper feature representation but at the cost of increased training and inference complexity.

The findings highlight two important aspects. First, the proposed method delivers state-of-the-art performance in multiclass classification, surpassing most competing methods in the literature. Second, while the binary scenario results are competitive, they remain below those of heavily engineered or ensemble-based methods such as ref. [41]. This trade-off emphasizes the efficiency–performance balance of our approach: ELECTRA achieves near-ceiling performance with a relatively streamlined architecture in the multiclass setting, while the binary model offers a resource-efficient yet flexible framework. Future work will focus on enhancing the binary case by incorporating more powerful representation learning techniques, such as transformer fine-tuning or calibrated ensembling, to bridge the performance gap with highly complex architectures.

4. Conclusions and Future Works

In this study, various ML, DL, and optimization-based hybrid classification approaches were comprehensively compared using the Malicious Phish dataset. The results revealed that classical machine learning methods, such as Random Forest and XGBoost, achieved high accuracy and F1-scores (95–97%). In contrast, deep learning methods offered more stable results, particularly in complex and multiclass scenarios. Hybrid models supported by CNN+LSTM and optimization algorithms reduced error rates in certain metrics and improved overall performance compared to traditional approaches. Nonetheless, the ELECTRA-based deep language model significantly outperformed all competitors in both binary and multiclass classification, achieving 99% accuracy and F1-scores. This finding highlights the superiority of next-generation language-model-based approaches for malicious URL detection. Although computational efficiency was not the main focus of this study, it is important to note that ELECTRA required longer training and inference times compared to lighter ML and DL models. This trade-off reflects the increased complexity of transformer-based architectures. Nevertheless, the high predictive performance of ELECTRA demonstrates that even with higher computational costs, advanced language models provide a practical and highly accurate framework for malicious URL detection.

Future work will focus on several key directions. First, integrating the proposed models into real-time cybersecurity applications is planned, particularly by testing inference times and scalability in operational environments. Second, we aim to investigate adversarial robustness by evaluating the resilience of ELECTRA and hybrid models against adversarial attacks, which are increasingly relevant in cybersecurity contexts. Third, low-resource deployment will be explored by optimizing models for lightweight environments, ensuring applicability in mobile devices or IoT-based security systems. Additionally, further research will include cross-validation with diverse and up-to-date datasets, development of more advanced optimization algorithms, and efforts to reduce computational cost while maintaining high interpretability. In conclusion, this study demonstrates that the combined use of optimization algorithms and advanced deep language models provides significant advantages in malicious URL detection. The results suggest that ELECTRA and similar transformer-based architectures have the potential to establish a new benchmark in this domain, balancing predictive performance with adaptability to future challenges such as adversarial robustness and resource-constrained deployment.

5. Availability of Data and Materials

The source code developed and used in this study has been made publicly available to ensure reproducibility and transparency. All scripts, model configurations, and optimization procedures (GA, PSO, and HHO implementations) can be accessed via our GitHub repository: https://github.com/mahmutkilicaslan-ankara/Network_Malicious_Detection. Due to licensing and ethical considerations, the raw dataset cannot be shared directly; however, the preprocessing scripts and detailed documentation provided in the repository allow researchers to replicate the experiments using the Malicious Phish dataset, which is openly available.

Author Contributions

Conceptualization, F.T. and M.K.; methodology, F.T.; software, F.T.; validation, F.T. and M.K.; formal analysis, M.K.; investigation, F.T.; resources, M.K.; data curation, M.K.; writing—original draft preparation, F.T.; writing—review and editing, M.K.; visualization, F.T.; supervision, M.K.; project administration, F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study is a public dataset. It can be found athttps://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset (accessed on 14 January 2025).

Acknowledgments

The authors would like to thank the reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Patgiri, R.; Biswas, A.; Nayak, S. deepBF: Malicious URL detection using learned bloom filter and evolutionary deep learning. Comput. Commun. 2023, 200, 30–41. [Google Scholar] [CrossRef]
Hilal, A.M.; Hashim, A.H.A.; Mohamed, H.G.; Nour, M.K.; Asiri, M.M.; Al-Sharafi, A.M.; Othman, M.; Motwakel, A. Malicious url classification using artificial fish swarm optimization and deep learning. Comput. Mater. Contin. 2023, 74, 607–621. [Google Scholar] [CrossRef]
Alsowail, R.A. Anomaly detection based capsnet for malicious Url detection system. Wirel. Netw. 2025, 31, 3785–3801. [Google Scholar] [CrossRef]
Zaimi, R.; Safi Eljil, K.; Hafidi, M.; Lamia, M.; Nait-Abdesselam, F. An enhanced mechanism for malicious URL detection using deep learning and DistilBERT-based feature extraction. J. Supercomput. 2025, 81, 438. [Google Scholar] [CrossRef]
Alsaedi, M.; Ghaleb, F.A.; Saeed, F.; Ahmad, J.; Alasli, M. Multi-modal features representation-based convolutional neural network model for malicious website detection. IEEE Access 2023, 12, 7271–7284. [Google Scholar] [CrossRef]
Raja, A.S.; Peerbasha, S.; Iqbal, Y.M.; Sundarvadivazhagan, B.; Surputheen, M.M. Structural Analysis of URL For Malicious URL Detection Using Machine Learning. J. Adv. Appl. Sci. Res. 2023, 5, 28–41. [Google Scholar] [CrossRef]
Tashtoush, Y.; Alajlouni, M.; Albalas, F.; Darwish, O. Exploring low-level statistical features of n-grams in phishing URLs: A comparative analysis with high-level features. Clust. Comput. 2024, 27, 13717–13736. [Google Scholar] [CrossRef]
Gupta, B.B.; Gaurav, A.; Attar, R.W.; Arya, V.; Bansal, S.; Alhomoud, A.; Chui, K.T. A Hybrid CNN-Brown-Bear Optimization Framework for Enhanced Detection of URL Phishing Attacks. Comput. Mater. Contin. 2024, 81, 4853–4874. [Google Scholar] [CrossRef]
Liu, R.; Wang, Y.; Xu, H.; Qin, Z.; Zhang, F.; Liu, Y.; Cao, Z. PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network. Inf. Fusion 2025, 113, 102638. [Google Scholar] [CrossRef]
Do, N.Q.; Selamat, A.; Krejcar, O.; Fujita, H. Detection of malicious URLs using Temporal Convolutional Network and Multi-Head Self-Attention mechanism. Appl. Soft Comput. 2025, 169, 112540. [Google Scholar] [CrossRef]
Nagy, N.; Aljabri, M.; Shaahid, A.; Ahmed, A.A.; Alnasser, F.; Almakramy, L.; Alhadab, M.; Alfaddagh, S. Phishing URLs detection using sequential and parallel ML techniques: Comparative analysis. Sensors 2023, 23, 3467. [Google Scholar] [CrossRef] [PubMed]
Lavanya, B.; Shanthi, C. malicious software detection based on URL-API intensity feature selection using deep spectral neural classification for improving host security. Int. J. Comput. Intell. Appl. 2023, 22, 2350002. [Google Scholar] [CrossRef]
Nayak, G.S.; Muniyal, B.; Belavagi, M.C. Enhancing Phishing Detection: A Machine Learning Approach with Feature Selection and Deep Learning Models. IEEE Access 2025, 13, 33308–33320. [Google Scholar] [CrossRef]
Reyes-Dorta, N.; Caballero-Gil, P.; Rosa-Remedios, C. Detection of malicious URLs using machine learning. Wirel. Netw. 2024, 30, 7543–7560. [Google Scholar] [CrossRef]
Rafsanjani, A.S.; Kamaruddin, N.B.; Behjati, M.; Aslam, S.; Sarfaraz, A.; Amphawan, A. Enhancing malicious URL detection: A novel framework leveraging priority coefficient and feature evaluation. IEEE Access 2024, 12, 85001–85026. [Google Scholar] [CrossRef]
Liu, R.; Wang, Y.; Guo, Z.; Xu, H.; Qin, Z.; Ma, W.; Zhang, F. TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features. Comput. Netw. 2024, 253, 110707. [Google Scholar] [CrossRef]
Hoang, X.D.; Le Minh, D.; Ninh, T.T.T. A CNN-Based Model for Detecting Malicious URLs. In Proceedings of the 2023 RIVF International Conference on Computing and Communication Technologies (RIVF), Hanoi, Vietnam, 23–25 December 2023; IEEE: New York, NY, USA, 2023; pp. 284–288. [Google Scholar]
Su, M.-Y.; Su, K.-L. Bert-based approaches to identifying malicious urls. Sensors 2023, 23, 8499. [Google Scholar] [CrossRef]
Elsadig, M.; Ibrahim, A.O.; Basheer, S.; Alohali, M.A.; Alshunaifi, S.; Alqahtani, H.; Alharbi, N.; Nagmeldin, W. Intelligent deep machine learning cyber phishing url detection based on bert features extraction. Electronics 2022, 11, 3647. [Google Scholar] [CrossRef]
Islam, M.S.; Jyoti, M.N.J.; Mia, M.S.; Hussain, M.G. Fake website detection using machine learning algorithms. In Proceedings of the 2023 International Conference on Digital Applications, Transformation & Economy (ICDATE), Miri, Sarawak, Malaysia, 14–16 July 2023; IEEE: New York, NY, USA, 2023; pp. 255–259. [Google Scholar]
Hani, R.B.; Amoura, M.; Ammourah, M.; Khalil, Y.A.; Swailm, M. Malicious URL detection using machine learning. In Proceedings of the 2024 15th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 13–15 August 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Gupta, N.; Thapliyal, S.; Sharma, A.; Sheladia, J.; Wazid, M.; Giri, D. Deep learning approach for malicious url detection using cnn, rnn, lstm and bi-lstm models. In Proceedings of the 2024 6th International Conference on Computational Intelligence and Networks (CINE), Bhubaneswar, India, 19–21 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef]
Ahmetoglu, H.; Das, R. A comprehensive review on detection of cyber-attacks: Data sets, methods, challenges, and future research directions. Internet Things 2022, 20, 100615. [Google Scholar] [CrossRef]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef]
Chaudhary, A.; Kolhe, S.; Kamal, R. An improved random forest classifier for multi-class classification. Inf. Process. Agric. 2016, 3, 215–222. [Google Scholar] [CrossRef]
Kulkarni, V.Y.; Sinha, P.K. Pruning of random forest classifiers: A survey and future directions. In Proceedings of the 2012 International Conference on Data Science & Engineering (ICDSE), Cochin, India, 18–20 July 2012; IEEE: New York, NY, USA, 2012; pp. 64–68. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Muhuri, P.S.; Chatterjee, P.; Yuan, X.; Roy, K.; Esterline, A. Using a long short-term memory recurrent neural network (LSTM-RNN) to classify network attacks. Information 2020, 11, 243. [Google Scholar] [CrossRef]
Hiriyannaiah, S.; GM, S.; MHM, K.; Srinivasa, K. A comparative study and analysis of LSTM deep neural networks for heartbeats classification. Health Technol. 2021, 11, 663–671. [Google Scholar] [CrossRef]
Haq, M.I.U.; Mahmood, K.; Li, Q.; Das, A.K.; Shetty, S.; Hussain, M. Efficiently Learning an Encoder that Classifies Token Replacements and Masked Permuted Network-Based BIGRU Attention Classifier for Enhancing Sentiment Classification of Scientific Text. IEEE Access 2024, 12, 190240–190254. [Google Scholar] [CrossRef]
Zheng, H.; Chu, J.; Li, Z.; Ji, J.; Li, T. Accelerating Federated Learning with genetic algorithm enhancements. Expert Syst. Appl. 2025, 281, 127636. [Google Scholar] [CrossRef]
Asif, S. OSEN-IoT: An optimized stack ensemble network with genetic algorithm for robust intrusion detection in heterogeneous IoT networks. Expert Syst. Appl. 2025, 276, 127183. [Google Scholar] [CrossRef]
Almomani, O.; Alsaaidah, A.; Abu-Shareha, A.A.; Alzaqebah, A.; Almaiah, M.A.; Shambour, Q. Enhance URL Defacement Attack Detection Using Particle Swarm Optimization and Machine Learning. J. Comput. Cogn. Eng. 2025, 4, 296–308. [Google Scholar] [CrossRef]
Almseidin, M.; Gawanmeh, A.; Alzubi, M.; Al-Sawwa, J.; Mashaleh, A.S.; Alkasassbeh, M. Hybrid deep neural network optimization with particle swarm and grey wolf algorithms for sunburst attack detection. Computers 2025, 14, 107. [Google Scholar] [CrossRef]
Alohali, M.A.; Alahmari, S.; Aljebreen, M.; Asiri, M.M.; Miled, A.B.; Albouq, S.S.; Alrusaini, O.; Alqazzaz, A. Two stage malware detection model in internet of vehicles (IoV) using deep learning-based explainable artificial intelligence with optimization algorithms. Sci. Rep. 2025, 15, 20615. [Google Scholar] [CrossRef]
Mitchell, M. An Introduction to Genetic Algorithms; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Bartz-Beielstein, T.; Branke, J.; Mehnen, J.; Mersmann, O. Evolutionary algorithms. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2014, 4, 178–195. [Google Scholar] [CrossRef]
Bonyadi, M.R.; Michalewicz, Z. Particle swarm optimization for single objective continuous space problems: A review. Evol. Comput. 2017, 25, 1–54. [Google Scholar] [CrossRef]
Shehab, M.; Mashal, I.; Momani, Z.; Shambour, M.K.Y.; L-Badareen, A.A.; Al-Dabet, S.; Bataina, N.; Alsoud, A.R.; Abualigah, L. Harris hawks optimization algorithm: Variants and applications. Arch. Comput. Methods Eng. 2022, 29, 5579–5603. [Google Scholar] [CrossRef]
Bozkir, A.S.; Dalgic, F.C.; Aydos, M. GramBeddings: A new neural network for URL based identification of phishing web pages through n-gram embeddings. Comput. Secur. 2023, 124, 102964. [Google Scholar] [CrossRef]

Figure 1. Workflow diagram of the proposed method.

Figure 2. Dataset distribution. (a) Bar chart and (b) curve chart.

Figure 3. Multiple confusion matrices of the best algorithms: (a) RF, (b) CNN+LSTM, (c) HHO+CNN+LSTM, and (d) ELECTRA.

Figure 4. Feature importance levels of the features selected with the HHO algorithm according to the result.

Figure 5. Binary class confusion matrices of the best result algorithms: (a) RF, (b) LSTM, (c) HHO+CNN+LSTM, and (d) ELECTRA.

Table 1. Features selected by the optimization algorithms.

Number	Name	GA	PSO	HHO
0	‘use_of_ip’	×	✓	✓
1	‘abnormal_url’	✓	×	✓
2	‘count.’	×	✓	✓
3	‘count-www’	×	✓	✓
4	‘count@’	✓	×	✓
5	‘count_dir’	✓	×	✓
6	‘count_embed_domian’	✓	✓	✓
7	‘short_url’	✓	×	×
8	‘count-https’	✓	✓	✓
9	‘count-http’	×	✓	✓
10	‘count%’	×	×	✓
11	‘count-’	✓	×	✓
12	‘count=’	×	×	✓
13	‘url_length’	✓	×	×
14	‘hostname_length’	✓	×	✓
15	‘sus_url’	✓	×	✓
16	‘fd_length’	✓	×	×
17	‘tld_length’	✓	×	✓
18	‘count-digits’	✓	×	✓
19	‘count-letters’	✓	✓	✓

Table 2. Performance metrics for multiclass evaluation.

Model Name	Accuracy	Precision	Recall	F1-Score
Machine Learning
LGM	0.96	0.95	0.93	0.94
XGB	0.96	0.96	0.94	0.94
GB	0.94	0.92	0.88	0.92
RF	0.97	0.96	0.95	0.95
Deep Learning
LSTM	0.95	0.95	0.95	0.95
CNN	0.95	0.93	0.91	0.92
CNN+LSTM	0.956	0.94	0.93	0.94
Deep Learning with Opt.
GA+CNN+LSTM	0.92	0.92	0.85	0.87
PSO+CNN+LSTM	0.81	0.79	0.66	0.70
HHO+CNN+LSTM	0.957	0.93	0.90	0.92
Electra
ELECTRA	0.99	0.99	0.99	0.99

Table 3. Performance metrics for binary classification evaluation.

Model Name	Accuracy	Precision	Recall	F1-Score
Machine Learning
LGM	0.93	0.94	0.92	0.93
XGB	0.94	0.95	0.92	0.94
GB	0.91	0.92	0.945	0.91
RF	0.95	0.96	0.95	0.95
Deep Learning
LSTM	0.95	0.94	0.935	0.94
CNN	0.94	0.94	0.93	0.93
CNN+LSTM	0.84	0.85	0.77	0.80
Deep Learning with Opt.
GA+CNN+LSTM	0.81	0.81	0.75	0.76
PSO+CNN+LSTM	0.901	0.90	0.88	0.89
HHO+CNN+LSTM	0.904	0.90	0.87	0.89
Electra
ELECTRA	0.99	0.99	0.99	0.99

Table 4. Comparing confidence intervals of best results for binary and multiple classification.

Algorithms	Model	Scenario	F1-Score	Approx. CI
Machine Learning	RF	Binary	0.95	[0.93–0.97]
	LGBM	Multiclass	0.94	[0.92–0.96]
	XGB	Multiclass	0.94	[0.92–0.96]
Deep Learning	LSTM	Binary	0.94	[0.92–0.96]
Electra	ELECTRA	Binary	0.99	[0.98–1.00]
Electra	ELECTRA	Multiclass	0.99	[0.98–1.00]

Table 5. Model prediction examples from classification tasks.

Class	Method	URLs	False Label	True Label
Multi-Class	Electra	clienteltau.com		✓ Phishing
	HHO+CNN+LSTM	centralmich.craigslist.org/		✓ Benign
	RF	https://mitsui-jyuku.mixh.jp/uploads/84ODNO38B		✓ Malware
	Electra	transit-port.net/AI.CogSci.Robotics/robotics.html	× Benign	Phishing
	HHO+CNN+LSTM	bluefin.writerbin.com/css/secure	× Malware	Benign
	RF	ea.ea.home.mindspring.com/*Star.html	× Benign	Phishing
Binary Class	Electra	http://www.dutchthewiz.com/freeware/		✓ Benign
	HHO+CNN+LSTM	http://www.deadlinedata.com		✓ Benign
	RF	http://www.avclub.com/content/node/24539		✓ Benign
	Electra	http://www.blackmistress.com/	× Benign	Malicious
	HHO+CNN+LSTM	http://www.muschi-feuchte.de/	× Benign	Malicious
	RF	http://myblog.de/promisc/page/99122	× Benign	Malicious

Table 6. Comparison of the proposed method with different approaches.

	Proposed		Ref. [4]		Ref. [17]	Ref. [18]	Ref. [20]	Ref. [41]
	(1st Sen.)	(2nd Sen.)	(1st Sen.)	(2nd Sen.)	(2nd Sen.)	(1st Sen.)	(1st Sen.)	(2nd Sen.)
Accuracy	0.99	0.904	96.04	98.19	-	98.78	97.0	0.9982
Precision	0.99	0.90	95.16	98.22	99.26	99.12	96	0.9995
Recall	0.99	0.87	92.80	96.32	98.73	98.02	95	0.9970
F1-Score	0.99	0.89	93.97	97.26	98.99	-	95	0.9985
Classifier	ELECTRA	HHO+CNN +LSTM	DistilBERT	DistilBERT+ 21 URL feat.	CNN	BERT	RF, LGBM, XGBoost	n-gram+ CNN_BiLSTM +Attention

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Türk, F.; Kılıçaslan, M. Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models. Appl. Sci. 2025, 15, 10090. https://doi.org/10.3390/app151810090

AMA Style

Türk F, Kılıçaslan M. Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models. Applied Sciences. 2025; 15(18):10090. https://doi.org/10.3390/app151810090

Chicago/Turabian Style

Türk, Fuat, and Mahmut Kılıçaslan. 2025. "Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models" Applied Sciences 15, no. 18: 10090. https://doi.org/10.3390/app151810090

APA Style

Türk, F., & Kılıçaslan, M. (2025). Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models. Applied Sciences, 15(18), 10090. https://doi.org/10.3390/app151810090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Malicious URL Detection with Advanced Machine Learning and Optimization-Supported Deep Learning Models

Abstract

1. Introduction

2. Materials and Methods

3. Experimental Results and Discussion

4. Conclusions and Future Works

5. Availability of Data and Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI