You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

16 October 2023

BERT-Based Approaches to Identifying Malicious URLs

and
Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan City 333, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Data Engineering in the Internet of Things

Abstract

Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learning models to identify malicious URLs, using the segmentation of URL strings into character-level or word-level tokens, and embedding and employing trained models to differentiate between URLs. In this study, a bidirectional encoder representation from a transformers-based (BERT) model was devised to tokenize URL strings, employing its self-attention mechanism to enhance the understanding of correlations among tokens. Subsequently, a classifier was employed to determine whether a given URL was malicious. In evaluating the proposed methods, three different types of public datasets were utilized: a dataset consisting solely of URL strings from Kaggle, a dataset containing only URL features from GitHub, and a dataset including both types of data from the University of New Brunswick, namely, ISCX 2016. The proposed system achieved accuracy rates of 98.78%, 96.71%, and 99.98% on the three datasets, respectively. Additionally, experiments were conducted on two datasets from different domains—the Internet of Things (IoT) and Domain Name System over HTTPS (DoH)—to demonstrate the versatility of the proposed model.

1. Introduction

In July 2022, the Interisle Consulting Group published a report on the phishing landscape covering the period from 1 May 2021 to 30 April 2022 []. The report highlighted that over 3 million phishing events were detected, resulting in 1.1 million unique phishing attacks during this period. Compared to a previous annual report released in July 2021, the number of unique phishing attacks increased by 61%. Additionally, there was a 72% rise in malicious domain names and an 83% increase in registered domain names used by phishers. Furthermore, cryptocurrency phishing attacks experienced a significant surge of 257%, explicitly targeting digital currency wallets and exchanges. According to Trend Micro’s 2022 Cyber Security Report [], over 16 million phishing attempts were detected in 2021 worldwide, a 2.3-fold increase from the previous year. Of these incidents, 62% originated from spam and 38% were associated with fake login pages. Furthermore, 90% of data breaches in 2021 were attributed to phishing emails. The report further highlighted the increasing vulnerability of non-fungible tokens (NFTs) to fraud, with scams involving fake NFT exchange domains and deceptive websites that trick users into linking their wallets while facilitating subsequent attacks.
URL analysis typically involves feature extraction or character embedding. Feature extraction identifies essential URL attributes, such as domain, length, and character count, for input into classification algorithms. Early machine learning models [,,,,,] relied on manual feature extraction for accuracy. However, they were limited by their reliance on historical data and time-consuming pre-processing, reducing the effectiveness of their real-time cyberattack defense. Character embedding converts individual URL characters into vectors, enabling deep-learning models to assess URL maliciousness. However, it has limitations, including different meanings for the same character in various URL positions and a lack of character relationship consideration. In addressing this gap, ensemble models [,,,,] have often been used alongside character embeddings to capture URL features before making decisions. Nevertheless, these malicious URL detection models were primarily tailored for datasets with URL strings that lack the versatility to handle different data formats or domains effectively.
Therefore, to overcome these limitations, this paper introduces a BERT-based (bidirectional encoder representations from transformers) [] approach to enhance the detection of malicious URLs. This model excels at effectively capturing semantic relationships. In particular, this study conducts a comprehensive evaluation using various publicly available datasets, including those from Kaggle, GitHub, and ISCX 2016, to ensure a rigorous and robust analysis. The proposed model achieves remarkably high accuracy rates across all the datasets. Furthermore, the study assesses the model’s performance on both direct URL strings and their derived feature representations, showcasing its versatility. Additionally, the research extends the model’s applicability beyond detecting malicious URLs to include attacks in IoT and DoH domains, highlighting its flexibility.
The remaining sections of this paper are organized as follows: Section 2 offers an overview of related research and highlights the contributions of this study. Section 3 details the proposed approaches for URL analysis. Section 4 presents the experimental results on a variety of datasets, including accuracy, extensibility, and prediction time requirements. Lastly, Section 5 provides a summary of the advantages, performance, and limitations of the proposed system.

3. The Proposed Methodology

Public datasets on URLs can be categorized into two types. One provides only the URL string (Figure 1a), like those on Kaggle [], and the other offers various features extracted from the URLs (Figure 1b), seen in datasets on GitHub [], with 111 features per URL. Some datasets, such as ISCX 2016 [], include both URL strings and features.
Figure 1. Different types of URL datasets. (a) URL strings and labels (https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset) (accessed on 26 August 2023). (b) URL features and labels (without URL string) (https://github.com/GregaVrbancic/Phishing-Dataset) (accessed on 26 August 2023).
The approach in Figure 2 was designed to manage both types of datasets. When dealing with datasets containing only URL strings, BERT was utilized for tokenization, leveraging its self-attention mechanism to grasp semantic meaning. A subsequent classifier determined the maliciousness of each URL. For datasets comprising URL features without URL strings, the feature engineering algorithm—random forest—was used to select key features and form a feature string for each URL entry. Subsequently, a similar BERT process was applied to the feature strings of all entries. Essentially, the proposed system employed BERT for both URL strings and feature strings, facilitating the effective analysis and classification of malicious URLs. The algorithm is illustrated in Algorithm 1.
Algorithm 1: URL Classification
Input: A dataset of labeled URLs or features, divided into 80% training and 20% testing.
Output: Confusion matrix.
1.
    If the dataset contains URLs, use the training dataset to fine-tune the pre-trained BERT model and use the test dataset to evaluate.
2.
    Else://for the dataset with only features.
3.
         Select k important features using the random forest algorithm.
4.
         If the dataset is imbalanced, apply the SMOTE algorithm [] to balance classes with fewer instances.
5.
         Normalize the data.
6.
         Concatenate selected features of each entry into a feature string with “/” as a separator.
7.
         Fine-tune the pre-trained BERT model using the training dataset.
8.
         Evaluate the model using the testing dataset.
9.
         Output the confusion matrix accompanied by accuracy, precision, and recall rate.
10.
  Endif
Figure 2. The proposed approach.

3.1. Data Pre-Processing

This section outlines the distinct pre-processing steps for the two types of datasets.

3.1.1. URL String and BERT Tokenization

For datasets containing only URL strings and labels, URL strings were used as the inputs for BERT tokenization. Unlike traditional character-level tokenization that assigns the same embeddings to identical characters, BERT tokenization takes into account the significance of letters within different vocabularies. In this study, the bert-base-cased model was employed for tokenizing the URL strings. The BERT token dictionary (see Figure 3a) and an example of URL tokenization (see Figure 3b) are provided. For example, the URL string “br-icloud.com.br” is divided into 10 tokens.
Figure 3. Illustration of BERT tokenization for an example URL string. (a) Part of the BERT dictionary and (b) BERT tokenization for the URL string.

3.1.2. URL Feature String and BERT Tokenization

The pre-processing complexity increases when a dataset incorporates URL-related features without including URL strings. In the GitHub dataset [], each entry is associated with 111 features and a label. This study selected essential features using the random forest algorithm and combined them into a feature string representation of the entry. To preserve feature integrity and prevent arbitrary tokenization during BERT processing, a separator “/” was added between essential features, as illustrated in Figure 4. For example, when two adjacent features have values of 25 and 24, the “/” separator results in the feature string “25/24” being used and split into three tokens: {“25”, “/”, and “24”}. Without the separator, “2524” could be split into tokens like {“2524”}, {“252” and “4”}, or {“2” and “524”}. The “/” separator helps maintain the integrity and distinction of individual features during tokenization.
Figure 4. Illustration of BERT tokenization for URL features.

3.2. Fine-Tuning the BERT Model

Enhancing the performance of the pre-trained BERT model designed initially for natural language texts is essential when dealing with unnatural languages such as URLs. Transfer learning techniques were employed for this purpose. The choice of the pre-trained bert-base-cased model was motivated by its sensitivity to URL character cases. Input for the BERT model consists of three tensors, as depicted in Figure 5. The “tokens” tensor captures token embeddings, exemplified by the URL “br-icloud.com.br” being divided into 10 tokens. The initial {CLS} token marks the start of input and carries overall semantics. If the URL is short, {pad} tokens are appended at the end. The “segments” tensor distinguishes between sentences A (i.e., 0) and B (i.e., 1). However, given a single URL input, all values in the “segment” tensor are set to 0. Lastly, the “attention masks” tensor determines the scope of self-attention. In this study, the entire URL was treated as a single sentence, necessitating attention for all tokens. Consequently, all values in the “attention masks” tensor were set to 1.
Figure 5. BERT input format.
The output of the self-attention processing, representing the entire string—whether a URL string or a URL feature string—was encapsulated in the corresponding output {CLS}. A classifier was added to this {CLS} output for fine-tuning, as seen in Figure 6. In this study, the BertForSequenceClassification classifier was utilized. Once trained, the model could determine whether a given URL is normal or malicious. A brief introduction to the self-attention concept is provided below, while in-depth details about the BERT mechanism are available in [].
Figure 6. Illustration of self-attention, using token x1 as an example.
Consider X = [x1, x2, …, xn] as the set of tokens within a URL, encompassing {CLS} as illustrated in Figure 6. Each input token xi, where 1 ≤ in, has been transformed into a vector and subjected to multiplication with three weight matrices: WQ, WK, and WV, resulting in the corresponding triplets q i , k i , and   v i , as indicated in Equations (1)–(3), respectively. The three matrices, WQ, WK, and WV, are derived through a learning process. The qi serves as the query, the ki serves as the key to be queried, and the vi represents the token’s information. For clarity, token x1 is employed to illustrate self-attention in Figure 6.
x i · W Q = q i
x i · W K = k i
x i · W V = v i
In Figure 6, the q1 is utilized to query kj, where 1 ≤ jn, through an inner product operation as described in Equation (4). The value α 1 , j represents the attention score of token x1 towards token xj. Subsequently, the Softmax function is applied to determine the proportions by which token x1 should be influenced by all tokens, as depicted in Equation (5). Ultimately, the output of token x1, denoted as y1, is obtained by summing the contributions from each token, as demonstrated in Equation (6).
q 1 · k j = α 1 , j
α 1 , j = exp α 1 , j j exp α 1 , j
y 1 = i α 1 , j v i
Note that all the outputs of token xi, where 1 ≤ in, can be calculated in parallel using Equation (7), where Q = q 1 , q 2 , , q n , K = k 1 , k 2 , , k n , V = v 1 , v 2 , , v n , and d K , denotes the dimensions of key k. The output of BERT is a 768-dimensional vector.
self _ attention Q ,   K ,   V =   softmax Q K T d k   V
In summary, the model divides the URL into individual tokens and employs an attention mechanism to calculate contextual relationships among them. Each token is evaluated for its attention score in relation to others using Equations (1)–(6). Finally, Equation (7) is applied to extract the comprehensive semantic meaning of the entire URL, with the aim of determining whether the URL is malicious.

4. Experimental Results

This study conducted experiments using three distinct types of public datasets sourced from Kaggle [], GitHub [], and ISCX 2016 []. Specifically, Kaggle [] exclusively contained URL strings, GitHub [] provided URL features without accompanying URL strings, and ISCX 2016 [] encompassed both URL strings and features. In assessing the efficacy of the proposed methods, various metrics, including accuracy, precision, and recall, were employed, as defined below:
Accuracy = TP + TN TP + TN + FN + FP
Recall = TP TP + FN
Precision = TP TP + FP
where, TP, TN, FN, and FP represent true positive, true negative, false negative, and false positive, respectively. This study utilized a computer equipped with an Intel Core i9 CPU with 64 GB memory and an NVIDIA RTX3070Ti GPU with 8 GB memory (ASUS WS750T). All the hyperparameters used in the experiments are detailed in Table 1.
Table 1. Hyperparameters used for training the proposed model.

4.1. Performance on Kaggle Dataset

The Kaggle dataset [] comprises four distinct URL types: benign, defacement, phishing, and malware. Among URL strings with lengths below 250 characters, the respective entry counts for these categories are approximately 424,000 for benign, 95,000 for defacement, 93,000 for phishing, and 32,000 for malware. Instances with exceptionally long lengths were infrequent; hence, they were excluded to reduce GPU memory usage and training duration. The dataset used in this study contained 99% of entries from the original dataset. For the initial experiment, a subset of ten thousand samples was randomly selected from each category, resulting in a total of 40 thousand samples. The results of this experiment are presented in Figure 7a, showing an accuracy rate of 96.70%. When expanding the sample size to min {one hundred thousand, actual number of entries} samples for each category, the results (shown in Figure 7b) indicate an improved accuracy rate of 98.02%. Finally, utilizing the entire dataset for experimentation, the outcomes (displayed in Figure 7c) exhibit an accuracy rate of 98.78%. In all three experiments, 80% of instances were allocated for training, while the remaining 20% were reserved for testing. Detailed results for these three experiments can be found in Table 2. A comparison with other related work is provided in Table 3.
Figure 7. Different sizes of the Kaggle dataset used for experiments: (a) 40 thousand instances, (b) about 320 thousand instances, and (c) 646 thousand instances.
Table 2. Kaggle dataset for multiclass classification.
Table 3. Comparison with the literature using the Kaggle dataset.

4.2. Performance on GitHub Dataset

The GitHub dataset [] comprises approximately 88,000 entries, divided into two categories: 58,000 benign URLs and 30,000 phishing URLs. Unlike providing URL strings, this dataset furnished 111 features for each entry. In this study, the importance of each feature was computed using the random forest algorithm. From these, 46 essential features with an importance value exceeding 0.009 were selected. These features were then concatenated into a feature string, as illustrated in Figure 8. To preserve the integrity of individual features and facilitate self-attention within BERT tokens, the symbol “/” was employed for concatenation. With an 80–20% split for training and testing, the confusion matrix is depicted in Figure 9, where k = 46 denotes the number of selected features, showing accuracy, precision, and recall rates of 96.71%, 96.25%, and 96.50%, respectively.
Figure 8. Feature strings of URLs after feature selection and concatenation.
Figure 9. Confusion matrices for the GitHub dataset (k = 46).
Table 4 outlines the performance metrics for different k values, while Figure 10 illustrates the learning curves of the validation accuracy for the model across various k values during training. To the best of our knowledge, there were no pertinent experimental results in the literature for this dataset, thus rendering direct comparisons unfeasible. Nonetheless, the proposed approach achieved a noteworthy accuracy of 96.71%, emphasizing its effectiveness in handling the dataset.
Table 4. Performance of the different number of features (i.e., k) selected on the GitHub dataset.
Figure 10. Learning curves of the proposed approach for different k values.

4.3. Performance on ISCX 2016 Dataset

The ISCX 2016 dataset [] comprises approximately 160,000 entries divided into five categories: around 35,000 benign URLs, 96,000 defacements, 11,000 malware, 11,000 spam, and 10,000 instances of phishing. The confusion matrix, with an 80–20% splitting for training and testing, is shown in Figure 11 as an epoch set to 30. The achieved accuracy, precision, and recall were 99.78%, 99.73%, and 99.34%, respectively. For the purpose of straightforward comparison with other research, a binary classification was also performed, wherein the four negative classes—defacements, malware, spam, and phishing—were collectively labeled as “malicious”. In this binary scenario, an impressive accuracy rate of 99.98% was attained. The comparative results are summarized in Table 5.
Figure 11. Confusion matrix for the multiclass classification on the ISCX 2016 dataset.
Table 5. Comparison with other research on the ISCX 2016 dataset.

4.4. Extending to Other Domains

The proposed feature string approach was further extended to include the detection of attacks targeting the Internet of Things (IoT) and attacks directed at DNS over HTTPS (DoH)—a protocol designed to enhance the security of DNS queries and responses. Two publicly available datasets, the “IoT Attack Dataset 2023” and the “DoHBrw 2020”, were utilized for this purpose and can be obtained from the website []. Detailed dataset information is also provided on the same website. The IoT dataset contains approximately 620,000 instances, categorized into eight classes, while the DoHBrw 2020 dataset consists of about 560,000 instances, falling into three classes. The confusion matrices for the two datasets, with an 80–20% split for training and testing, are presented in Figure 12a and Figure 12c, respectively.
Figure 12. Confusion matrices for the IoT 2023 and DoHBrw 2020 datasets. (a) IoT 2023 dataset (original), (b) IoT 2023 dataset (augmented), and (c) DoHBrw 2020 (original).
The performance metrics are summarized in Table 6. Notably, the IoT dataset exhibited class imbalance due to significantly fewer instances in the brute-force and web-based classes compared to others. In addressing this gap, the SMOTE algorithm [] was employed to augment these two classes, ensuring a more balanced dataset. Additionally, experiments were conducted on the augmented IoT dataset. The confusion matrix is presented in Figure 12b, with corresponding numerical performance values included in Table 6.
Table 6. Expanding the proposed method to other domains.

4.5. URL Prediction Time

Moreover, the measurement of prediction times for real-time detection using the proposed approach was performed. Figure 13a illustrates the distribution of six malicious URLs and two benign URLs. The average prediction time per URL was approximately 0.010146 s, as shown in Figure 13b. These measurements were carried out on a desktop equipped with an Intel Core i9-3.50 GHz processor.
Figure 13. Average prediction time for one URL. (a) Tested URLs and (b) prediction time.

5. Conclusions and Future Work

This study presents a BERT-based approach for non-natural language processing tasks, with a specific focus on identifying malicious URLs. Through extensive experiments carried out on three distinct public datasets (Kaggle, GitHub, and ISCX 2016), the effectiveness of the proposed model has been demonstrated. In comparison to previous research, the proposed system outperforms in terms of accuracy. In the multi-classification experiments conducted on the Kaggle dataset, the achieved accuracy was 98.78%. For the GitHub dataset, which provides only features without corresponding URL strings, the proposed model exhibited an accuracy of 96.71%. In the ISCX 2016 dataset experiments, the model displayed remarkable accuracy rates of 99.98% in binary classification and 99.78% in multi-classification tasks. Furthermore, two datasets from different domains concerning IoT and DNS over HTTPS were incorporated into the study to demonstrate the versatility of the proposed system. Moreover, the proposed pre-trained model can make decisions on tested URLs quickly, making the system suitable for real-time detection deployment. Indeed, the BERT-based approach demonstrates superior performance when compared to other methods in experiments with existing URL datasets. However, its effectiveness in detecting zero-day malicious URL attacks, including newly registered URLs or benign web servers that have turned malicious due to infections, remains uncertain. In the future, we aspire to conduct further investigations into these related issues.

Author Contributions

Conceptualization, M.-Y.S. and K.-L.S.; methodology, M.-Y.S.; software, K.-L.S.; validation, M.-Y.S. and K.-L.S.; formal analysis, M.-Y.S. and K.-L.S.; investigation, M.-Y.S. and K.-L.S.; resources, M.-Y.S.; data curation, M.-Y.S. and K.-L.S.; writing—original draft preparation, M.-Y.S. and K.-L.S.; writing—review and editing, M.-Y.S. and K.-L.S.; visualization, M.-Y.S. and K.-L.S.; supervision, M.-Y.S.; funding acquisition, M.-Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Technology (MOST) Taiwan, grant number MOST 111-2221-E-130-007, and the APC was funded by MOST 111-2221-E-130-007.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Aaron, G.; Chapin, L.; Piscitello, D.; Strutt, C. Phishing Landscape 2022: An Annual Study of the Scope and Distribution of Phishing; Interisle Consulting Group, LLC: Boston, MA, USA, 2022; pp. 1–65. Available online: https://interisle.net/PhishingLandscape2022.pdf (accessed on 26 August 2023).
  2. Trend Micro 2021 Annual Cybersecurity Report: Navigating New Frontiers, 17 March 2022; pp. 1–42. Available online: https://documents.trendmicro.com/assets/rpt/rpt-navigating-new-frontiers-trend-micro-2021-annual-cybersecurity-report.pdf (accessed on 26 August 2023).
  3. Kumar, R.; Zhang, X.; Tariq, H.A.; Khan, R.U. Malicious URL Detection Using Multi-Layer Filtering Model. In Proceedings of the 14th IEEE International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 15–17 December 2017; pp. 97–100. [Google Scholar]
  4. Ahammad, S.H.; Kale, S.D.; Upadhye, G.D.; Pande, S.D.; Babu, E.V.; Dhumane, A.V.; Bahadur, D.K.J. Phishing URL detection using machine learning methods. Adv. Eng. Softw. 2022, 173, 103288. [Google Scholar] [CrossRef]
  5. Gupta, B.B.; Yadav, K.; Razzak, I.; Psannis, K.; Castiglione, A.; Chang, X. A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Comput. Commun. 2021, 175, 47–57. [Google Scholar] [CrossRef]
  6. Saleem, R.A.; Vinodini, R.; Kavitha, A. Lexical features based malicious URL detection using machine learning techniques. Mater. Today Proc. 2021, 47, 163–166. [Google Scholar] [CrossRef]
  7. Li, T.; Kou, G.; Peng, Y. Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods. Inf. Syst. 2020, 91, 101494. [Google Scholar] [CrossRef]
  8. Mondal, D.K.; Singh, B.C.; Hu, H.; Biswas, S.; Alom, Z.; Azim, M.A. SeizeMaliciousURL: A novel learning approach to detect malicious URLs. J. Inf. Secur. Appl. 2021, 62, 102967. [Google Scholar] [CrossRef]
  9. Srinivasan, S.; Ravi, V.; Arunachalam, A.; Alazab, M.; Soman, K.P. DURLD: Malicious URL Detection Using Deep Learning-Based Character Level Representations. In Malware Analysis Using Artificial Intelligence and Deep Learning; Springer: Berlin/Heidelberg, Germany, 2021; pp. 535–554. [Google Scholar]
  10. Bozkir, A.S.; Dalgic, F.C.; Aydos, M. GramBeddings: A New Neural Network for URL Based Identification of Phishing Web Pages Through N-gram Embeddings. Comput. Secur. 2023, 124, 102964. [Google Scholar] [CrossRef]
  11. Alshehri, M.; Abugabah, A.; Algarni, A.; Almotairi, S. Character-level word encoding deep learning model for combating cyber threats in phishing URL detection. Comput. Electr. Eng. 2022, 100, 107868. [Google Scholar] [CrossRef]
  12. Zheng, F.; Yan, Q.; Leung, V.C.M.; Yu, F.R.; Ming, Z. HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection. Comput. Secur. 2022, 114, 102584. [Google Scholar] [CrossRef]
  13. Hussain, M.; Cheng, C.; Xu, R.; Afzal, M. CNN-Fusion: An effective and lightweight phishing detection method based on multi-variant ConvNet. Inf. Sci. 2023, 631, 328–345. [Google Scholar] [CrossRef]
  14. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  15. Piñeiro, J.J.M.L.; Portillo, L.R.W. Web architecture for URL-based phishing detection based on Random Forest, Classification Trees, and Support Vector Machine. Intel. Artif. 2022, 25, 107–121. [Google Scholar] [CrossRef]
  16. Kalabarige, L.R.; Rao, R.S.; Abraham, A.; Gabralla, L.A. Multilayer Stacked Ensemble Learning Model to Detect Phishing Websites. IEEE Access 2022, 10, 79543–79552. [Google Scholar] [CrossRef]
  17. Somesha, M.; Alwyn, R.P. Classification of Phishing Email Using Word Embedding and Machine Learning Techniques. J. Cyber Secur. Mobil. 2022, 11, 279–320. [Google Scholar]
  18. Li, Q.; Cheng, M.; Wang, J.; Su, B. LSTM Based Phishing Detection for Big Email Data. IEEE Trans. Big Data 2022, 8, 278–288. [Google Scholar] [CrossRef]
  19. Singh, S.; Singh, M.P.; Pandey, R. Phishing Detection from URLs Using Deep Learning Approach. In Proceedings of the 5th IEEE International Conference on Computing, Communication and Security (ICCCS), Patna, Bihar, India, 14–16 October 2020; pp. 1–4. [Google Scholar]
  20. Ariyadasa, S.; Fernando, S.; Fernando, S. Combining Long-Term Recurrent Convolutional and Graph Convolutional Networks to Detect Phishing Sites Using URL and HTML. IEEE Access 2022, 10, 82355–82375. [Google Scholar] [CrossRef]
  21. Alsaedi, M.; Ghaleb, F.A.; Saeed, F.; Ahmad, J.; Alasli, M. Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors 2022, 22, 3373. [Google Scholar] [CrossRef]
  22. Remmide, M.A.; Boumahdi, F.; Boustia, N.; Feknous, C.L.; Della, R. Detection of Phishing URLs Using Temporal Convolutional Network. Procedia Comput. Sci. 2022, 212, 74–82. [Google Scholar] [CrossRef]
  23. Wang, C.; Chen, Y. TCURL: Exploring hybrid transformer and convolutional neural network on phishing URL detection. Knowl.-Based Syst. 2022, 258, 109955. [Google Scholar] [CrossRef]
  24. Maneriker, P.; Stokes, J.W.; Lazo, E.G. URLTran: Improving Phishing URL Detection Using Transformers. In Proceedings of the IEEE Military Communications Conference (MILCOM), San Diego, CA, USA, 29 November–2 December 2021; pp. 197–204. [Google Scholar]
  25. Ullah, F.; Alsirhani, A.; Alshahrani, M.M.; Alomari, A.; Naeem, H.; Shah, S.A. Explainable Malware Detection System Using Transformers-Based Transfer Learning and Multi-Model Visual Representation. Sensors 2022, 22, 6766. [Google Scholar] [CrossRef]
  26. Lin, X.; Xiong, G.; Gou, G.; Li, Z.; Shi, J.; Yu, J. ET-BERT: A Contextualized Datagram Representation with Pre-training Transformers for Encrypted Traffic Classification. In Proceedings of the ACM Web Conference, Lyon, France, 25–29 April 2022; pp. 633–642. [Google Scholar]
  27. Shi, Z.; Luktarhan, N.; Song, Y.; Yin, H. TSFN: A Novel Malicious Traffic Classification Method Using BERT and LSTM. Sensors 2023, 25, 821. [Google Scholar] [CrossRef]
  28. Malicious URLs Dataset. Available online: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset (accessed on 26 August 2023).
  29. Vrbančič, G.; Fister, I., Jr.; Podgorelec, V. Datasets for Phishing Websites Detection. Data Brief 2020, 33, 1–7. Available online: https://github.com/GregaVrbancic/Phishing-Dataset (accessed on 26 August 2023). [CrossRef] [PubMed]
  30. ISCX-URL 2016 Dataset. Available online: https://www.unb.ca/cic/datasets/url-2016.html (accessed on 26 August 2023).
  31. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. arXiv 2011, arXiv:1106.1813. [Google Scholar]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  33. Canadian Institute for Cybersecurity. Available online: https://www.unb.ca/cic/datasets/ (accessed on 26 August 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.