Next Article in Journal
Analysis of Transition Mode Operation and Characteristic Curves in a Buck–Boost Converter for Unmanned Guided Vehicles
Previous Article in Journal
An Innovative Finite Impulse Response Filter Design Using a Combination of L1/L2 Regularization to Improve Sparsity and Smoothness
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Graph Neural Network Framework for Malicious URL Classification

by
Sarah Mohammed Alshehri
*,
Sanaa Abdullah Sharaf
and
Rania Abdulrahman Molla
Computer Science Department, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(22), 4387; https://doi.org/10.3390/electronics14224387
Submission received: 8 October 2025 / Revised: 27 October 2025 / Accepted: 5 November 2025 / Published: 10 November 2025
(This article belongs to the Section Computer Science & Engineering)

Abstract

The increasing reliance on Internet-based services has been accompanied by a rapid growth in cyber threats, particularly phishing attacks using misleading Uniform Resource Locators (URLs) to mislead users and compromise sensitive data. This paper proposes a hybrid deep learning architecture that integrates Graph Convolutional Networks (GCN), Attention Mechanism and Long Short-Term Memory (LSTM) networks, and for accurate classification of malicious and benign URLs. The model combines sequential pattern recognition through LSTM, structural graph representations via GCN, and feature prioritization using attention to enhance detection performance. Experiments were conducted on a labeled URL dataset of 100,000 and subsequently 200,000 samples, using consistent training and testing splits. The proposed model showed stable performance across different dataset sizes and ultimately outperformed other approaches on the expanded dataset, demonstrating stronger generalization capabilities. These findings highlight the effectiveness of the proposed hybrid model in capturing structural URL features, providing a reliable approach for detecting phishing attacks via structural URL analysis, and offer a foundation for future research on graph-based cybersecurity systems.

1. Introduction

In recent years, the Internet has grown into the world’s largest communication platform, supported by rapid progress in networking technologies that affect almost every aspect of daily life. Millions of users now rely on it for activities such as e-commerce, social networking, and online banking. However, the open and largely uncontrolled nature of the Internet makes it a common target for cyberattacks, where weaknesses in its infrastructure can be exploited. One of the most widespread threats is phishing, in which attackers trick users into revealing personal information by clicking on malicious Uniform Resource Locators (URLs) [1].
The main goal of information security is to protect systems and prevent unauthorized access to valuable resources through effective measures and policies. Despite these efforts, illegitimate URLs remain a serious threat. As global identifiers of online resources, URLs can easily be manipulated to host phishing content and mislead users [2]. Reports show that phishing incidents increased from about 114,702 cases in 2019 to more than 241,000 in 2020 during the COVID-19 pandemic. Verizon’s 2021 Data Breach Investigations Report [3] further highlighted that phishing was the most common type of breach, accounting for 43% of incidents across 88 countries.
Malicious URLs are designed to deceive users and steal sensitive data such as usernames, passwords, and financial details. Numerous researchers have investigated artificial intelligence techniques, particularly deep learning approaches, to efficiently identify and prevent malicious URLs which have achieved good results [4]. A key advantage of deep learning is its ability to process raw data directly, without requiring manual feature engineering [5]. Most studies on URLs attacks classification and detection have primarily relied on traditional deep learning models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) networks. Recently, some cybersecurity research has begun exploring the use of Graph Neural Networks (GNNs) in areas like network protection and fake content detection on web pages, due to their ability to capture complex structural relationships, model interdependencies among entities, and generalize across irregular data formats.
Despite the advancements, existing approaches still suffer from many limitations. Some approaches are restricted to capturing sequential patterns, while others rely heavily on external page content such as HTML. Phishing Attacks remain the preferred technique for attackers because it exploits both system vulnerabilities and human behavior. Many users become victims of phishing due to a lack of awareness, carelessness, or accidental mistakes. At the same time, attackers are constantly improving their techniques, often using machine learning to conduct large-scale phishing attacks which specifically target users with high-level access to sensitive data [6].
In this study, we present a hybrid framework that leverages the combined strengths of Graph Convolutional Networks (GCN), Attention Mechanism and Long Short-Term Memory (LSTM), to accurately enhance the classification of malicious and benign URLs. This architecture is designed to capture both the structural patterns and sequential behaviors within URLs, enabling more intelligent and reliable threat detection.
Experiments were conducted on a large-scale dataset exceeding 100,000 labeled samples, with evaluation performed before and after dataset augmentation. The proposed model consistently outperformed both classical machine learning and recent deep learning baselines, achieving over 90% accuracy.
The contributions of this paper are summarized as follows:
  • Use a hybrid architecture that combines GCN, Attention Mechanism and LSTM to jointly capture structural and sequential URL features.
  • Evaluate the model on large-scale real-world datasets, including augmented samples.
  • Provide a comprehensive comparison with traditional and state-of-the-art baselines to demonstrate the advantage of the hybrid models.
  • Highlight the significance of implementing deep learning techniques into cybersecurity tasks to enhance intelligent defense mechanisms.
  • Demonstrate the effectiveness of graph-based representations in phishing URL classification and emphasize their promising potential as a specialized defense mechanism in cybersecurity.
The rest of this paper is organized as follows. Section 2 reviews related work on URL classification. Section 3 describes the dataset and preprocessing steps. Section 4 presents the proposed methodology. Section 5 discusses the experimental results. Finally, Section 6 concludes the paper and outlines future directions.

2. Related Works

A variety of approaches have been proposed to address phishing URL challenges, ranging from traditional machine learning (ML) classifiers to more advanced deep learning (DL) models, including those based on graph structures. In this section, we review the most relevant studies, organized into three categories: ML-based approaches, DL-based models, and hybrid graph-based architecture.
Zhang et al. [7] proposed CANTINA, a TF–IDF–based method that compares webpage keywords with search engine results to identify phishing pages. Building on similar principles, Ma et al. [8] employed lexical and host-based features with classifiers such as Naïve Bayes, SVM, and logistic regression to detect malicious URLs with high accuracy. Chen et al. [9] further improved detection reliability using a reputation-based two-stage model that combines domain trust evaluation and sandbox inspection, achieving 94% accuracy.
Subsequent studies have shifted toward deep learning to overcome the limitations of manual feature extraction. A CNN-based model [6] was developed with multiple convolutional and pooling layers followed by a sigmoid classifier, achieving 94.31% accuracy, while Singh et al. [10] proposed a text-based CNN architecture that processes tokenized and embedded URL sequences through convolutional and dense layers, attaining 98% accuracy. Similarly, RNN-based architecture has been utilized for sequential data modeling. A comparative study [11] revealed that LSTM networks outperform Random Forest classifiers by effectively capturing long-term dependencies, reaching 98.7% accuracy. An improved AB-BiLSTM with an attention mechanism [12] further enhanced performance by capturing bidirectional sequence semantics and anomalous URL segments, achieving 98.06% accuracy. Likewise, Su in [13] proposed a deep LSTM optimized with stochastic gradient descent, demonstrating 99.1% accuracy on Yahoo Directory and PhishTank datasets.
Recently, hybrid and graph-based architectures have been investigated to further boost detection robustness. Yao et al. [14] introduced an improved Feature Pyramid Network combined with Faster R-CNN for QR code–based phishing detection, where logo extraction and recognition verified website legitimacy, achieving 91.4% precision. Similarly, Adebowale et al. [15] developed the Intelligent Phishing Detection System (IPDS), which integrates CNN and LSTM to analyze image, text, and frame features, achieving 93.28% accuracy. In another study, Pooja and Sridhar [16] proposed a multidimensional hybrid model employing deep learning and XGBoost to enhance detection speed and scalability, also achieving 93.28% accuracy.
In the context of graph learning, Ouyang and Zhang [17] introduced an enhanced Graph Convolutional Network (GCN) utilizing DOM tree structures and RNN-based node embeddings for phishing detection, achieving 95.5% accuracy. Huang et al. [18] applied GNNs with cross-protocol feature integration and random forest pre-selection to detect malicious IP addresses, obtaining 85.28% accuracy. Moreover, Ariyadasa et al. [19] developed PhishDet, a hybrid model combining Long-term Recurrent Convolutional Networks (LRCN) with GCN to jointly analyze URL and HTML features, achieving 96.42% accuracy.
Overall, the literature demonstrates that while traditional machine learning methods perform well with handcrafted features, deep learning and graph-based architectures provide superior adaptability and feature-learning capabilities, making them more effective for modern phishing detection.
Table 1 provides a summary of previous studies in terms of dataset, model type, and achieved accuracy.
In summary, existing approaches to solve malicious URL related issues either rely on manually engineered features, treat URLs as flat sequential data, or depend heavily on external page content such as HTML and JavaScript. While hybrid and graph-based models have shown promising results, there remains a lack of methods that jointly capture both the structural relationships within a URL and its sequential behavior using graph-based deep learning. In addition, attention mechanisms have been employed to help the model focus on the most informative parts of the URL. However, despite these improvements, most DL-based methods still treat URLs as flat sequences and fail to capture the complex structural relationships that exist between different components of the URL string. This gap has recently led to increased interest in graph-based methods, which are better suited to modeling such dependencies. Moreover, attention mechanisms have rarely been integrated with GCN and LSTM in the context of malicious URL related attacks.
To address this gap, our study presents a hybrid deep learning model that integrates GCN, Attention Mechanism and LSTM, using only the raw URL string as input. This architecture combines structural and sequential learning capabilities, leveraging the strengths of each component to improve classification accuracy without relying on external webpage content. This model serves as a foundational step toward future research that aims to further integrate graph-based learning into the development of more advanced solutions for malicious URL detection and prevention.

3. Dataset and Preprocessing

A large-scale dataset of URLs was collected from publicly available sources, comprising both malicious and benign examples. The data were structured in two columns: one containing the URL string and the other containing its corresponding binary label. To ensure quality, duplicates were removed, and basic preprocessing steps were applied to retain only meaningful and well-structured entries. The resulting dataset included over 100,000 unique URLs, representing a diverse set of patterns relevant to malicious link analysis.
The dataset was collected from several publicly available sources, including Alexa [20], PhishTank [21] and Openphish [22].
Given that this study focuses on the textual nature of URLs, Term Frequency-Inverse Document Frequency (TF-IDF) was employed to transform raw strings into meaningful numerical vectors. This encoding method highlights the significance of individual tokens within URLs without relying on manually crafted features. To enhance the model’s sensitivity to subtle variations common in phishing attempts, TF-IDF was applied at both the character and word levels. The resulting vectors served as inputs to the learning models in subsequent stages [23].

3.1. Data Collection

The initial dataset consisted of approximately 100,000 labeled URLs, collected from a publicly available source. To increase the variety and improve generalization, additional phishing and benign URLs were later gathered from multiple repositories, including recent threat intelligence feeds and well-known public datasets such as Alexa, OpenPhish and PhishTank. This process aimed to expose the model to diverse URL patterns and reduce overfitting. Unlike artificial oversampling techniques, dataset expansion was achieved through direct data gathering without duplicating any entries.

3.2. Model Input Preparation

The TF-IDF vectors were reshaped to align with the input requirements of the hybrid GCN-Attention-LSTM model, ensuring architectural compatibility without modifying its core structure.
For evaluation, two experimental phases were conducted:
  • Run 1: The model was trained and tested on approximately 100,000 unique samples.
  • Run 2: The dataset was expanded by including additional non-redundant samples, nearly doubling its size, and the model was re-evaluated under the same conditions.
In both phases, the dataset was divided into training and testing subsets to maintain balanced class distributions. Table 2 summarizes the dataset distribution across the two runs.
Both datasets used in this study were balanced, containing an equal number of benign and malicious URLs. The average URL length was approximately 52 characters. Each URL consisted of an average of 7 tokens, indicating moderate structural diversity. To ensure data reliability and fair evaluation, several standard measures were applied during preprocessing. The dataset was cleaned to remove duplicates and normalized by stripping redundant components such as protocol identifiers, query strings, and “www” subdomains. The data were then divided into training and testing subsets using an 80/20 split, ensuring representative coverage across domains and a balanced distribution of classes. These precautions were taken to maintain consistency and prevent bias in the evaluation process.

4. Proposed Methodology

4.1. Problem Formulation

The task of identifying malicious URLs is framed as a binary classification problem. Given an input URL string u , the objective is to predict its corresponding label y 0,1 , where y   y = 1 denotes a malicious URL and y = 0 indicates a benign one. Formally, the model learns a function F u y that maps URL strings to their respective classes. The goal is to minimize classification errors on unseen URLs by effectively capturing both sequential patterns and structural dependencies present in the input. This study also aims to assess its overall effectiveness in accurately distinguishing between malicious and benign links based on structural and sequential patterns present in the data.

4.2. Data Representation

Each URL in the dataset was treated as raw text and encoded into numerical vectors using the Term Frequency–Inverse Document Frequency (TF-IDF) technique. This method assigns higher weights to informative tokens that frequently appear in malicious URLs but are uncommon in benign ones. Unlike handcrafted lexical or content-based features, TF-IDF offers a scalable, text-only representation that does not rely on external metadata such as HTML or JavaScript [23].
The tokenized input was represented as a graph, where nodes represent tokens and edges indicate their relative positions within the string. This structure helps the model learn how different parts of the URL are arranged, supporting further processing by GCN and LSTM components. To effectively process the structured and sequential representations of URLs, this study employs a hybrid architecture that combines GCN, Attention Mechanism and LSTM. Each component was selected for its ability to handle a specific aspect of the data, enabling the model to extract richer patterns and improve classification accuracy.

4.3. Model Architecture

This section provides comprehensive reviews of each methodology algorithm discussed in the proposed model, such as the GCN model, an Attention mechanism and LSTM model. The proposed GCN-Attention-LSTM (GCA-LSTM) model is then explained.

4.3.1. Graph Convolutional Network (GCN)

The Graph Convolutional Network (GCN) is employed for the analysis of data organized in a graph structure. Graph structures consist of vertices and edges. The Graph Convolutional Network (GCN) is used in a scenario where URLs need to be classified as either benign or malicious. The goal is to analyze the structural relationships within the URLs dataset that are relevant for this classification task. The GCN aims to distinguish between harmless and dangerous URLs by examining the relationships and patterns between them to identify structural indications. It has the ability for finding connections between dangerous URLs or identify shared linking patterns among harmless URLs.
This model is concerned with capturing data in the form of a graph consisting of nodes and edges. The moment the data enters during this model, it is understandable, so all the characters are transformed into numbers so that we can deal with them. They are dealt with in the form of feature vectors. The relationships between the feature vectors on graph are stored in adjacency matrix.
The adjacency matrix A of a graph G = V , E with n vertices V and m edges E , contains 1 s on the diagonal to represent each vertex being connected to itself, and 1 s at position i , j to indicate an edge e   i , j connecting node vi to node v j [24].

4.3.2. Attention Mechanism

The implementation of attention mechanism has proven to be highly effective in several deep learning applications, including image recognition, natural language processing, and voice recognition. The attention function contains the process of connecting a query with several key-value pairs and providing a weighted sum of the values. The form of the output varies depending on the level of information you want to highlight, such as spatial, temporal, or other elements [25].
The attention technique dynamically allocates importance to different URL features and temporal sequences that are relevant to the classification of URLs as either benign or harmful. The purpose is to prioritize important indicators or sequences that have significant effects on the classification process, increasing its focus on highlighting the fundamental components of URLs that determine their benign or harmful nature. For instance, it may give priority to specific URL features or time-related patterns that reflect harmful intent.
Attention mechanisms are commonly incorporated into hybrid neural models to enhance feature selection and representation. There are several forms of attention that can be integrated and applied depending on the model structure and the nature of the data. For example, temporal attention is often used alongside models such as LSTM to highlight important time steps within a sequence. In graph-based models, graph attention mechanisms enable the model to assign different weights to neighboring nodes during the aggregation process. The graph attention module follows a multi-head structure, where multiple attention heads jointly learn the importance of node connections and aggregate their representations through concatenation or averaging. Figure 1 displays the multi-head attention technique with a vertex in its neighborhood. The features obtained from both the neighboring vertices and the vertex itself are combined either by concatenation or by averaging in order to form the attention vector [25]. In this study, a single-head attention mechanism was implemented to dynamically assign weights to graph-encoded URL features before sequential modelling. Unlike multi-head structures that aggregate multiple attention subspaces, this configuration applies one adaptive weighting process across all node representations.

4.3.3. Long Short-Term Memory (LSTM)

The Recurrent Neural Network (RNN) is frequently employed for predicting sequential data. However, problems such as expanding gradients may occur when using RNN models to capture long-term dependencies in time-series data. The derivatives of the neural network will be repeatedly compounded across the layers. A few modifications in the derivatives cause a rapid decrease in the gradient, whereas significant changes in the derivatives lead to a rapid increase in the gradient. Both conditions can result in the occurrence of the vanishing or growing gradient issue. LSTM, an improved version of the original RNN, successfully resolves the problems of vanishing or expanding gradients by efficiently capturing long-range relationships. The main difference is that LSTM integrates a singular cell state and includes three gates at each recurrent unit. These structures enhance the model in capturing long-term interactions [26]. Figure 2 displays an individual LSTM cell.
The proposed model incorporates an LSTM component to effectively capture temporal relationships and sequences of events. This enables the model to accurately identify URLs as either benign or malicious in the binary classification. The system will analyze temporal patterns of user activity and URL accesses to detect specific sequences of events and user interactions that determine whether a URL is classified as benign or harmful. This may involve analyzing common trends of behavior that occur when accessing malicious URLs or recognizing normal user activities before visiting harmless URLs.

4.3.4. Graph Convolutional Network-Attention-Long Short-Term Memory (GCA-LSTM)

The classification model proposed is based on the approaches described in the research conducted by X Ma et al. [27]. This study presents a graph framework for predicting the short-term demand of a bike-sharing system at the station level. The framework utilizes multi-sources of data, including historical bike-sharing trip data, land-use data, meteorological data, and users’ personal information. Inspired by this structure, our proposed model implements architecture which is tailored for a different application domain: binary URL classification. It aligns with the spatial-temporal graph learning framework, where Graph Convolutional Networks (GCNs) capture the structural relationships between URL components, Long Short-Term Memory (LSTM) layers model their sequential dependencies, and the attention mechanism enhances representation by emphasizing the most informative features. This hybrid design is theoretically grounded in spatial-temporal graph learning, which integrates structural and temporal dimensions to jointly optimize feature representation. By modeling URLs in this unified framework, the proposed model effectively captures both the structural correlations and temporal dynamics underlying evolving malicious behaviors, thereby strengthening its ability to detect complex threat patterns.
Our proposed model is specifically designed for the classification of URLs as either benign or malicious. It integrates Graph Convolutional Networks (GCNs) to capture the structural dependencies within URL components, Attention mechanism to emphasize the most informative features and Long Short-Term Memory (LSTM) networks to model the sequential characteristics of the data. These components work together to enable accurate and robust classification of harmful and benign URLs.
In this paper, we refer to our proposed hybrid model as GCA-LSTM, a variant of GCN-LSTM where attention mechanism is explicitly integrated into the GCN and LSTM layers. The name reflects the key architectural enhancement (Graph Convolution + Attention + LSTM) while maintaining a consistent naming convention.
The proposed model was selected for its robustness and performance [27] with the goal of effectively addressing the problem of malicious URL classification which requires capturing both structural and sequential characteristics from input data.
As shown in Figure 3, URLs will be fed into the model for processing in the input layer which defines the input shape for the model. The second layer was a Reshape layer which reshapes the input data to a specific shape. It reshapes the input data to ensure that the input data aligns with the expected format for further processing.
Graph Convolutional Networks (GCNs) are used first to extract the underlying structural relationships between different URL components, such as subdomains, paths, and query parameters.
Accordingly, an Attention mechanism is applied to focus on the most informative features, helping it prioritizes critical patterns that may indicate malicious behavior. Long Short-Term Memory (LSTM) networks are utilized to analyze the sequential flow of URL components, enabling the model to detect irregularities or abnormal repetition patterns.
A Flatten layer was applied to convert the output from the GCA-LSTM layer into a one-dimensional vector, which is often required before passing it to a traditional neural network layer. In addition to the dense layer, this layer was used with a sigmoid activation function, which is commonly used for binary classification tasks. It produces the final output of the model which represents the probability of the input URL being malicious or benign based on the learned features from the previous layers.
To compile the model, a Binary cross-entropy was used as the loss function in addition to the Adam optimizer to adjust the learning rate. Furthermore, the model was training for 10 epochs with early stopping after 3 epochs of no loss improvements.
All experiments were conducted on Google Colab. The implementation was developed in Python v3.8.8 with TensorFlow v2.13.0 and Keras libraries v2.13.1. These models were trained and tested as follows Windows10; Processor: Intel (R) Core (TM) i5-1035G4 CPU@1.10 GHz 1.50 GHz; RAM: 8 GB.
The model was trained using the Adam optimizer to minimize the binary cross-entropy loss function. Hyperparameters were empirically selected as follows: a batch size of 32, a learning rate of 0.001, and 10 training epochs.
The dataset was randomly divided into training and testing subsets using an 80% and 20% ratio. This split was consistently applied across both experimental phases, regardless of dataset size. The same split strategy was applied across baseline models used for comparison to ensure an objective assessment.
The proposed GCA-LSTM model integrates graph convolution, attention, and sequential learning in a unified structure. The embeddings produced by the Graph Convolution (GCN) layer are sequentially passed into the LSTM unit to capture temporal dependencies. An attention mechanism is applied between these components to assign adaptive importance weights to graph-structured features before sequential modelling. The training objective encourages the model to jointly learn both spatial structure from the (GCN) and temporal behaviour from the (LSTM), so that both sources of information contribute to the final binary cross-entropy loss used for classification. The dimensional flow of the model progresses from the input layer (1000 features × 1) through the GCN layer (1000 features × 128 nodes) and the attention layer (128 nodes × 64 units), followed by the LSTM (64 units × 32 units) and the final dense layer (32 units × 1), which produces the classification output.

4.4. Evaluation Metrics

To comprehensively assess the performance of the proposed model, several standard classification metrics were employed. These include:
  • Accuracy: Measures the overall proportion of correctly classified URLs.
  • Precision: Represents the fraction of URLs predicted as malicious that are malicious.
  • Recall: Indicates the model’s ability to identify all malicious URLs in the dataset.
  • F1-Score: The harmonic means of precision and recall, balancing false positives and false negatives.
All metrics were computed based on the confusion matrix generated from the held-out test set. In addition, the training time was recorded to evaluate the computational efficiency of the proposed hybrid architecture.
To further validate robustness, the model’s performance was compared with baseline machine learning and deep learning models under the same experimental settings.

5. Experimental Results and Discussion

5.1. Experimental Runs

Two experimental settings were considered to evaluate the proposed model.
  • Run 1: Training and testing on approximately 100,000 unique URLs after preprocessing.
  • Run 2: Evaluation on an extended dataset of nearly 200,000 unique URLs, created by augmenting the first dataset with additional non-redundant samples. Table 3 and Table 4 present the performance of various traditional machine learning algorithms evaluated on both datasets. These models serve as baseline algorithms for comparison.
Table 3. Performance of the baseline machine learning models on dataset A.
Table 3. Performance of the baseline machine learning models on dataset A.
InputDatasetAlgorithmAccuracyPrecisionRecallF1-Score
URLsDataset AXG-Boost0.99300.99890.98730.9931
SVM0.99290.99870.98730.9929
LDA0.98630.99380.97900.9864
KNN0.98050.98830.97310.9806
QDA0.98900.99620.98210.9891
Gaussian NB0.98530.98700.98410.9855
Table 4. Performance of the baseline machine learning models on dataset B.
Table 4. Performance of the baseline machine learning models on dataset B.
InputDataset Algorithm AccuracyPrecision RecallF1-Score
URLsDataset BXG-Boost0.84290.81820.88150.8487
SVM0.87420.84020.92390.8801
LDA0.84920.81650.90090.8565
KNN0.85310.80800.92600.8630
QDA0.75440.96050.53040.6834
Gaussian NB0.78970.92830.63120.7499
Table 5 and Table 6 summarize the results obtained from both runs using some recent deep learning-based models used for comparison.
The results demonstrate that the proposed model consistently outperformed the baselines in both experimental runs, achieving an accuracy above 85% and higher F1-scores on the larger evaluation dataset. Graph-based approaches reliably outperformed purely sequential or feature-engineering-based methods, highlighting the importance of incorporating structural relationships into URL classification tasks.

5.2. Performance Comparison

The proposed hybrid model was compared against six baseline approaches, including both classical ML algorithms and recent DL architectures. These models included XG-Boost, SVM, LDA, KNN, QDA, Gaussian NB, LSTM, GCN, GCN with LSTM.

5.2.1. Comparison Against GCN-LSTM

To assess the value of integrating attention mechanisms into graph-based models, a direct comparison was conducted between the proposed GCA-LSTM model and the GCN-LSTM baseline. GCN-LSTM combines graph convolutional layers with LSTM units to capture both local structures using GCN and temporal features using LSTM.
However, the proposed GCA-LSTM enhances this architecture by incorporating an attention mechanism, allowing the model to focus on the most informative parts of the input. While GCN-LSTM processes the input uniformly, the attention mechanism in GCA-LSTM prioritizes critical patterns, which improves classification accuracy.
Based on the experiments, GCA-LSTM consistently outperformed GCN-LSTM in both accuracy and F1-score across both datasets. This confirms the advantage of integrating attention to improve the model’s focus and adaptability to complex patterns, especially when classifying URL-based textual inputs.

5.2.2. Comparison Against Deep Learning Models

To further validate the impact of combining temporal and graph-based features, we compared GCA-LSTM to standalone deep learning models: LSTM and GCN.
  • LSTM: Although capable of modeling sequential dependencies, its performance was limited due to the non-sequential nature of URL data. URLs often encode structured patterns rather than time-based sequences, making LSTM less effective in this context. To ensure fair comparisons, all baseline models were trained under identical experimental settings, including the same input features (TF-IDF vectors), binary classification targets, and consistent train–test splits. The LSTM model exhibited notable training difficulties on Dataset B, likely due to the dataset’s high dimensionality and large size. Its sequential architecture and recurrent dependencies make it more sensitive to long input sequences and memory constraints, which may explain its lower performance in this context despite being trained under the same conditions as the other baseline models. Furthermore, unlike natural language, URLs are characterized by structural rather than linguistic dependencies, making graph-based and attention-driven approaches more effective for capturing their underlying patterns.
  • GCN: GCN captures structural relationships between data points but lacks the ability to model sequential context or prioritize important parts of the input without an attention mechanism.
The results showed that GCA-LSTM achieved superior performance compared to both LSTM and GCN individually, validating the benefit of combining attention-based enhancement with hybrid modeling strategies.
  • ANN: To establish a fair baseline for comparison, an Artificial Neural Network (ANN) was included as a simple reference model. The ANN represents a conventional feed-forward network, suitable for testing textual datasets and verifying the model’s learning efficiency under minimal architectural complexity. Although ANN achieved slightly higher accuracy on Dataset A, this outcome is expected given its simplicity and the dataset’s limited structural diversity. Nevertheless, the proposed GCA-LSTM demonstrated competitive performance while incorporating graph-based and sequential components, indicating its ability to generalize across both structural and temporal dependencies—an essential characteristic for real-world malicious URL analysis.

5.2.3. Comparison with Machine Learning Algorithms

In alignment with prior work, several traditional machine learning models were also used as baselines, including SVM, XG-Boost, LDA, QDA, KNN, and GaussianNB. Below is a brief description of each and a summary of their behavior on the dataset:
  • XG-Boost: A powerful gradient boosting method that performed strongly on the dataset due to its ability to handle structured data like TF-IDF representations of URLs [28]. It consistently ranked among the top-performing models.
  • SVM: Well-suited for high-dimensional text data, SVM also achieved competitive results and showed reliable performance in classification tasks [29].
  • LDA & QDA: These discriminant analysis methods performed reasonably well, especially LDA. QDA required more data to generalize effectively, which affected its consistency [30].
  • KNN: Performance dropped significantly due to the high dimensionality and large volume of the dataset, which KNN struggles with [31].
  • GaussianNB: This model assumes a Gaussian distribution of features, which does not align well with TF-IDF vectors, resulting in lower performance [32].
Overall, while some traditional models like XG-Boost and SVM delivered competitive results, especially on Dataset A, their performance declined as the dataset scale increased. Notably, models like QDA, KNN, and GaussianNB exhibited significant drops in accuracy and F1-score on Dataset B, suggesting limitations in handling high-dimensional, large-scale textual data.
In contrast, our proposed hybrid model consistently maintained high performance across both datasets, particularly excelling after dataset expansion. This highlights the robustness and scalability of the model, making it more suitable for real-world applications where data complexity and volume are significant.

5.2.4. Graph-Based Models

Both GCA-LSTM and GCN-LSTM, as graph-enhanced models, showed better alignment with the characteristics of the data. Their ability to handle structured relationships across different URL segments enabled better pattern extraction and classification accuracy. Additionally, the inclusion of attention in GCA-LSTM provided further advantage by guiding the model to focus on the most informative parts of the TF-IDF input, rather than treating all features equally.
The following Figure 4 and Figure 5 illustrate the accuracy and F1-scores obtained from various deep learning models including the proposed hybrid model and the GCN-LSTM model. Accuracy and F1-score were selected for visualization and comparative charts, as they provide the most balanced and interpretable assessment of model performance in classification tasks.

5.2.5. Evaluation of Training Time

Following initial experiments, Dataset B was introduced to evaluate model scalability. This helped to assess whether increased data volume impacts performance and training dynamics.
As shown in Figure 6, The training time analysis revealed that while GCA-LSTM had a slightly higher training time on Dataset A, it demonstrated greater efficiency and faster convergence on Dataset B, outperforming models like GCN-LSTM in training speed. This indicates that attention mechanisms improve scalability and training efficiency as data complexity increases.

5.3. Discussion

The effective performance of the proposed hybrid architecture can be driven by three key factors:
  • Structural modeling by GCN model: Effectively captures complex relationships within URL components that sequential models fail to capture.
  • Sequential dependency modeling by LSTM model: Preserves the order of characters, which is critical for detecting subtle malicious patterns.
  • Feature prioritization by Attention mechanism: Focuses on the most discriminative parts of the URL, improving robustness to noise and irrelevant tokens.
Moreover, the results demonstrate that using a larger dataset in Run 2 led to more consistent model behavior and improved generalization performance, without relying on synthetic duplication techniques.
While the integration of GCN, Attention, and LSTM significantly increased training time, especially during the second experimental run with a larger dataset, the resulting performance demonstrated high classification accuracy. This indicates that despite higher training cost, the model remains well-suited for deployment in real-word and automated URL filtering systems.
Although this study specifically focused on applications related to URLs and web-based content, other domains have also adopted the spatial-temporal framework to address diverse prediction and other challenges, such as traffic flow forecasting [27], complex traffic network security [33], and environmental monitoring [34]. Our work stands alongside these studies by extending the spatial-temporal concept to contribute a novel perspective that can guide future research on malicious URL detection and model enhancement.

6. Conclusions and Future Work

In this paper, we proposed a hybrid deep learning framework for malicious URL classification that integrates Graph Convolutional Networks (GCN), An attention mechanism and Long Short-Term Memory (LSTM) networks. The model leverages both structural and sequential representations of URLs, enabling more accurate and robust classification compared to traditional ML approaches and recent DL baselines. Experimental results on large-scale datasets demonstrated the effectiveness of the proposed model, achieving accuracy exceeding 99% and 85% across different datasets and consistently outperforming recent deep learning-based methods across multiple evaluation metrics. This study addresses the growing need to enhance URL filtering technologies, given the significant risks associated with malicious URLs to both users and security infrastructures. It explores the application of graph-based deep learning, particularly Graph Convolutional Networks (GCNs) due to their ability to model the structural characteristics of URLs. In addition, the study demonstrates how integrating structural learning, sequential modeling, and feature weighting within an integrated architecture can enhance URL classification and support more effective filtering of malicious links. Beyond its experimental contributions, this work establishes a foundational baseline for future research, providing a flexible architecture that can be extended, combined, or adapted to advance the development of more robust solutions for malicious URL classification.
Although the current evaluation did not specifically test the model on shortened or dynamically generated URLs, these types of URLs often present unique challenges in real-world environments. Future work will therefore consider extending the model to handle such cases more effectively. One potential direction involves experimenting with additional data sources that incorporate HTML content and JavaScript-based features. Alongside this, future work will include general improvements in data preprocessing and validation strategies. These enhancements combined with alternative feature extraction techniques aim to determine the most effective approach for modeling and analyzing URL-related components. Lastly, future efforts will focus on deploying the model in real-time environments and evaluating its robustness against adversarial attacks, with the goal of assessing its practical readiness for deployment in cybersecurity systems.

Author Contributions

S.M.A. led the research process, including the literature review, methodology development, experimentation, and drafting of the manuscript. S.A.S. and R.A.M. provided critical academic supervision, conducted thorough proofreading, and offered insightful recommendations that enriched the structure and content of the paper, including the addition of sections and clarification of overlooked aspects. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in public repositories such as kaggle: https://www.kaggle.com/datasets/cheedcheed/top1m (accessed on 1 September 2022), openphish: https://openphish.com/faq.html (accessed on 10 September 2022) and phishtank: http://www.phishtank.com (accessed on 26 September 2022).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANNArtificial Neural Network
CNNConvolutional Neural Network
DLDeep Learning
GANGenerative Adversarial Network
GCNGraph Convolutional Network
GNNGraph Neural Network
LSTMLong Short-Term Memory Network
MLMachine Learning

References

  1. Hajgude, J.; Ragha, L. Phish Mail Guard: Phishing Mail Detection Technique by Using Textual and URL Analysis. In Proceedings of the 2012 World Congress on Information and Communication Technologies (WICT), Trivandrum, India, 30 October–2 November 2012; pp. 297–302. [Google Scholar] [CrossRef]
  2. Zieni, R.; Massari, L.; Calzarossa, M.C. Phishing or Not Phishing? A Survey on the Detection of Phishing Websites. IEEE Access 2023, 11, 18499–18519. [Google Scholar] [CrossRef]
  3. Valiyaveedu, N.; Jamal, S.; Reju, R.; Murali, V.; M, N.K. Survey and Analysis on AI-Based Phishing Detection Techniques. In Proceedings of the 2021 International Conference on Communication, Control and Information Sciences (ICCISc), Idukki, India, 16–18 June 2021; pp. 1–6. [Google Scholar] [CrossRef]
  4. Dhingra, M.; Jain, M.; Jadon, R.S. Role of Artificial Intelligence in Enterprise Information Security: A Review. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 188–191. [Google Scholar] [CrossRef]
  5. Ongsulee, P. Artificial Intelligence, Machine Learning and Deep Learning. In Proceedings of the 2017 15th International Conference on ICT and Knowledge Engineering (ICT&KE), Bangkok, Thailand, 22–24 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
  6. Al-Milli, N.; Hammo, B.H. A Convolutional Neural Network Model to Detect Illegitimate URLs. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 220–225. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Hong, J.I.; Cranor, L.F. CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In Proceedings of the 2007 16th International Conference on World Wide Web (WWW’07), New York, NY, USA, 8–12 May 2007; pp. 639–648. [Google Scholar] [CrossRef]
  8. Ma, J.; Saul, L.K.; Savage, S.; Voelker, G.M. Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In Proceedings of the 2009 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’09), New York, NY, USA, 28 June–1 July 2009; pp. 1245–1254. [Google Scholar] [CrossRef]
  9. Chen, C.M.; Huang, J.J.; Ou, Y.H. Efficient Suspicious URL Filtering Based on Reputation. J. Inf. Secur. Appl. 2015, 20, 26–36. [Google Scholar] [CrossRef]
  10. Singh, S.; Singh, M.P.; Pandey, R. Phishing Detection from URLs Using Deep Learning Approach. In Proceedings of the 2020 5th International Conference on Computing, Communication and Security (ICCCS), Patna, India, 14–16 October 2020; pp. 1–4. [Google Scholar] [CrossRef]
  11. Bahnsen, A.C.; Bohorquez, E.C.; Villegas, S.; Vargas, J.; González, F.A. Classifying Phishing URLs Using Recurrent Neural Networks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime), Scottsdale, AZ, USA, 25–27 April 2017; pp. 1–8. [Google Scholar] [CrossRef]
  12. Ren, F.; Jiang, Z.; Liu, J. A Bi-Directional LSTM Model with Attention for Malicious URL Detection. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 13 February 2020; pp. 300–305. [Google Scholar] [CrossRef]
  13. Su, Y. Research on Website Phishing Detection Based on LSTM RNN. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; pp. 284–288. [Google Scholar] [CrossRef]
  14. Yao, W.; Ding, Y.; Li, X. Deep Learning for Phishing Detection. In Proceedings of the 2018 IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), Melbourne, Australia, 11–13 December 2018; IEEE: New York, NY, USA, 2018; pp. 645–650. [Google Scholar] [CrossRef]
  15. Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Deep Learning with Convolutional Neural Network and Long Short-Term Memory for Phishing Detection. In Proceedings of the 2019 13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Island of Ulkulhas, Maldives, 26–28 August 2019; pp. 1–8. [Google Scholar] [CrossRef]
  16. Pooja, A.S.S.V.L.; Sridhar, M. Analysis of Phishing Website Detection Using CNN and Bidirectional LSTM. In Proceedings of the 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 28 December 2020; pp. 1620–1629. [Google Scholar] [CrossRef]
  17. Ouyang, L.; Zhang, Y. Phishing Web Page Detection with HTML-Level Graph Neural Network. In Proceedings of the 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Shenyang, China, 20–22 October 2021; pp. 952–958. [Google Scholar] [CrossRef]
  18. Huang, Y.; Negrete, J.; Wagener, J.; Fralick, C.; Rodriguez, A.; Peterson, E. Graph Neural Networks and Cross-Protocol Analysis for Detecting Malicious IP Addresses. Complex Intell. Syst. 2023, 9, 3857–3869. [Google Scholar] [CrossRef] [PubMed]
  19. Ariyadasa, S.; Fernando, S.; Fernando, S. Combining Long-Term Recurrent Convolutional and Graph Convolutional Networks to Detect Phishing Sites Using URL and HTML. IEEE Access 2022, 10, 82355–82375. [Google Scholar] [CrossRef]
  20. Alexa. Alexa Dataset Repository (Kaggle). Available online: https://www.kaggle.com/ (accessed on 1 September 2022).
  21. PhishTank. PhishTank Developer Information. Available online: https://www.phishtank.com/ (accessed on 26 September 2022).
  22. OpenPhish. Open Phish: Real-Time Phishing Intelligence Feed. Available online: https://openphish.com/ (accessed on 10 September 2022).
  23. Mondal, D.K.; Singh, B.C.; Hu, H.; Biswas, S.; Alom, Z.; Azim, M.A. SeizeMaliciousURL: A Novel Learning Approach to Detect Malicious URLs. J. Inf. Secur. Appl. 2021, 62, 102967. [Google Scholar] [CrossRef]
  24. Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
  25. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903. Available online: https://arxiv.org/abs/1710.10903 (accessed on 25 January 2024). [PubMed]
  26. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  27. Ma, X.; Yin, Y.; Jin, Y.; He, M.; Zhu, M. Short-Term Prediction of Bike-Sharing Demand Using Multi-Source Data: A Spatial-Temporal Graph Attentional LSTM Approach. Appl. Sci. 2022, 12, 1161. [Google Scholar] [CrossRef]
  28. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  29. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Machine Learning: ECML-98; Nédellec, C., Rouveirol, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1398, pp. 137–142. [Google Scholar] [CrossRef]
  30. Grouven, U.; Bergel, F.; Schultz, A. Implementation of Linear and Quadratic Discriminant Analysis Incorporating Costs of Misclassification. Comput. Methods Programs Biomed. 1996, 49, 55–60. [Google Scholar] [CrossRef] [PubMed]
  31. Manocha, S.; Girolami, M.A. An Empirical Analysis of the Probabilistic K-Nearest Neighbour Classifier. Pattern Recognit. Lett. 2007, 28, 1818–1824. [Google Scholar] [CrossRef]
  32. Bhavitha, B.K.; Rodrigues, A.P.; Chiplunkar, N.N. Comparative Study of Machine Learning Techniques in Sentimental Analysis. In Proceedings of the 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 10–11 March 2017; IEEE: New York, NY, USA, 2017; pp. 216–221. [Google Scholar] [CrossRef]
  33. Hong, S.; Yue, T.; You, Y.; Lv, Z.; Tang, X.; Hu, J.; Yin, H. A Resilience Recovery Method for Complex Traffic Network Security Based on Trend Forecasting. Int. J. Intell. Syst. 2025, 2025, 3715086. [Google Scholar] [CrossRef]
  34. Guan, X.; Mo, X.; Li, H. A Novel Spatio-Temporal Graph Convolutional Network with Attention Mechanism for PM2.5 Concentration Prediction. Mach. Learn. Knowl. Extr. 2025, 7, 88. [Google Scholar] [CrossRef]
Figure 1. Multi head attention method. Each arrow represents attention weights from one head. Aggregated features form the final node representation [25].
Figure 1. Multi head attention method. Each arrow represents attention weights from one head. Aggregated features form the final node representation [25].
Electronics 14 04387 g001
Figure 2. LSTM cell [13].
Figure 2. LSTM cell [13].
Electronics 14 04387 g002
Figure 3. Architecture of The Proposed Hybrid GCN–Attention–LSTM (GCN-LSTM) Model.
Figure 3. Architecture of The Proposed Hybrid GCN–Attention–LSTM (GCN-LSTM) Model.
Electronics 14 04387 g003
Figure 4. Model Performance Comparison on Data A.
Figure 4. Model Performance Comparison on Data A.
Electronics 14 04387 g004
Figure 5. Model Performance Comparison on Data B.
Figure 5. Model Performance Comparison on Data B.
Electronics 14 04387 g005
Figure 6. Training time across Deep Learning-based models for two datasets.
Figure 6. Training time across Deep Learning-based models for two datasets.
Electronics 14 04387 g006
Table 1. Summary of recent studies on URL malicious detection and classification.
Table 1. Summary of recent studies on URL malicious detection and classification.
StudyDatasetModel TypeAccuracy%
[7]Real-world phishing sites & legitimate pagesContent-based TF-IDF lexical signatures94–97
[8]Public blacklists (PhishTank (San Francisco, CA, USA),
Yahoo (Sunnyvale, CA, USA), DMOZ (Netscape, CA, USA))
ML-based blacklist learning model (Naïve Bayes, SVM, Logistic Regression)≈96
[9]Real web requests Two-stage model: reputation filter, sandbox analysis94
[6]Public URL datasetCNN-based model94.3
[10]Public URL datasetCNN with token embedding98
[11]Public URL dataset (PhishTank, Common Crawl (San Francisco, CA, USA))LSTM-RNN98
[12]Public URL datasetBi-LSTM-based model98.7
[13]Yahoo Dictionary, PhishTankLSTM-based model99.1
[14]QR-code websites dataFaster R-CNN91 (Precision)
[15]Public URL datasetIPDS (CNN, LSTM hybrid)93
[16]Public URL datasetHybrid DL, XGBoost95
[17]DOM tree, RNN features GCN-based model85.28
[18]IP address datasetGCN with Random Forest96.4
[19]URL, HTML datasetLRCN, GCN (PhishDet)98.7
Table 2. Sample distribution per class for Dataset A and B.
Table 2. Sample distribution per class for Dataset A and B.
DatasetClassNo. of URLsTotal
Dataset APhishing URLs56,937113,874
Benign URLs56,937
Dataset BPhishing URLs97,399194,798
Benign URLs97,399
Table 5. Performance of the proposed model and deep learning-based models on Dataset A.
Table 5. Performance of the proposed model and deep learning-based models on Dataset A.
InputDatasetAlgorithmAccuracyPrecisionRecallF1-Score
URLsDataset AGCA-LSTM0.99090.99550.98650.9910
GCN-LSTM0.99170.99670.98690.9918
LSTM0.61840.97340.25400.4029
GCN0.98980.99350.98630.9899
ANN0.99210.98670.99740.9920
Table 6. Performance of the proposed model and deep learning-based models on Dataset B.
Table 6. Performance of the proposed model and deep learning-based models on Dataset B.
InputDatasetAlgorithmAccuracyPrecisionRecallF1-Score
URLsDataset BGCA-LSTM0.85130.81950.90090.8582
GCN-LSTM0.84710.81520.89740.8543
LSTM0.57410.75950.21610.3364
GCN0.85040.81770.90190.8576
ANN0.86710.82860.92550.8744
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshehri, S.M.; Sharaf, S.A.; Molla, R.A. A Hybrid Graph Neural Network Framework for Malicious URL Classification. Electronics 2025, 14, 4387. https://doi.org/10.3390/electronics14224387

AMA Style

Alshehri SM, Sharaf SA, Molla RA. A Hybrid Graph Neural Network Framework for Malicious URL Classification. Electronics. 2025; 14(22):4387. https://doi.org/10.3390/electronics14224387

Chicago/Turabian Style

Alshehri, Sarah Mohammed, Sanaa Abdullah Sharaf, and Rania Abdulrahman Molla. 2025. "A Hybrid Graph Neural Network Framework for Malicious URL Classification" Electronics 14, no. 22: 4387. https://doi.org/10.3390/electronics14224387

APA Style

Alshehri, S. M., Sharaf, S. A., & Molla, R. A. (2025). A Hybrid Graph Neural Network Framework for Malicious URL Classification. Electronics, 14(22), 4387. https://doi.org/10.3390/electronics14224387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop