You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

30 March 2022

Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

and
1
Department of Computer Science, Yonsei University, Seoul 03722, Korea
2
Department of Computer Science, Kyungil University, Daegu 38428, Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Cybersecurity in the Next-Generation Industrial Internet of Things Era: Modelling, Detecting and Mitigating Threats

Abstract

Deep learning models for phishing URL classification based on character- and word-level URL features achieve the best performance in terms of accuracy. Various improvements have been proposed through deep learning parameters, including the structure and learning strategy. However, the existing deep learning approach shows a degradation in recall according to the nature of a phishing attack that is immediately discarded after being reported. An additional optimization process that can minimize the false negatives by selecting the core features of phishing URLs is a promising avenue of improvement. To search the optimal URL feature set and to fully exploit it, we propose a combined searching and learning strategy that effectively models the URL classifier for recall. By incorporating the deep-learning-based URL classifier with the genetic algorithm to search the optimal feature set that minimizing the false negatives, an optimized classifier that guarantees the best performance was obtained. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest recall among the deep learning models. We demonstrated the superiority of the method by 10-fold cross-validation and confirmed that the recall improved compared to the latest deep learning method. In particular, the accuracy and recall were improved by 4.13%p and 7.07%p, respectively, compared to the convolutional–recurrent neural network in which the feature selection optimization was omitted.

1. Introduction

A phishing attack using URLs can be defined as a scalable and deceptive action in which an impersonated web server obtains information from an individual [1]. Various deep learning models have been introduced to classify phishing attacks based on the URL features [2,3]. Prominent among these is the method of optimizing the structure of the convolutional–recurrent neural network [4] by focusing on the expression of URL features in the character- and word-level, which has achieved a plausible performance comparable to the existing detection system [5]. The attention-mechanism-based deep learning method developed by Microsoft has verified the best performance in terms of accuracy [6].
However, the limitation in recall (sensitivity) due to the nature of phishing URLs was pointed out for the past deep learning approach aiming to minimize the loss function in the classification task. Considering the characteristics of phishing URLs that are immediately discarded after being reported, an additional optimization process that can minimize the false negatives of unobserved attacks is essential [7]. Table 1 shows the characteristics of each URL type traditionally used for modeling phishing attacks. To search the most generalized characteristics of phishing URL and to fully exploit it, the process of extracting the attack characteristics expressed in the URL based on the deep model and the process of searching for the optimal combination of URL characteristics must be performed simultaneously.
Table 1. URL feature sets for modeling phishing attacks.
In this paper, therefore, we propose a deep learning method combined with a genetic-algorithm-based feature-optimization method that searches for an optimal combination of URL features to improve the recall. The proposed method achieved an improved recall compared to the latest deep-learning-based model for three benchmark datasets consisting of 222,541 URLs and showed accuracy and recall factor increases of 4.13% and 7.07%, respectively, compared to the convolution–recurrent neural network. We additionally demonstrated the superiority of the proposed method by comparing URL features that contribute to performance enhancement through self-organizing maps. The main findings of this paper are as follows:
  • The genetic-algorithm-embedded convolutional recurrent network works well for classifying benign and phishing URLs, resulting the best recall and accuracy for phishing website detection;
  • We formulated the task of regulating the learning process of a deep learning model to enhance a specific metric as an optimization problem, and solved it effectively by extending the existing genetic algorithm.

3. Proposed Method

In this section, we describe the combination of the convolutional recurrent network and genetic-algorithm-based feature optimization process. Figure 1 illustrates the overall architecture of the proposed method that is consisted of URL preprocessing steps, deep URL model based on convolutional recurrent network, and optimization of network based on recall-based fitness score.
Figure 1. Combined structure of genetic algorithm and deep learning for combinatorial optimization of feature-selection process aiming the recall score.

3.1. Genetic-Algorithm-Based URL Feature Optimization

URL features for modeling a phishing attack are subdivided into address-bar-based, http-request-based, domain-information-based, and script-based features, including JavaScript [18,19]. Each is known to be meaningful to be used as a feature in the field of machine-learning-based URL modeling [20], but it should be considered to explore the recall-oriented URL feature combination at the same time to fully exploit the URL feature set.
Firstly, the i th input URL vector x i , the number of rules indicating the dimension of the input vector d , the y i variable indicating whether it is normal or not, and the number of URLs that make up the original dataset n are specified to define the evolutionary algorithm-based URL feature combination search.
The genetic algorithm is a representative method to segment a wide search space and quickly find a combinatorial solution using crossover, mutation, and reproduction operations [21]. The suggested combination of deep learning and genetic algorithm is defined with the j th-generation chromosome ω j that performs the feature selection operation as Equation (1).
ω j = a 1 , a 2 , , a d ,   a = 0 ,   1
The chromosome of each generation performs the filter operation based on the Hadamard product to filter the features that are meaningful to the phishing URL classification and outputs x ¯ i with the selected feature. Equation (2) defines the dataset, X ¯ ω j , Y ¯ ω j , applying the selection of the feature collected in the j th generation.
X ¯ ω j = x ¯ 1 ,   ,   x ¯ n = x 1 ω j ,   ,   x n ω j
The set of chromosomes { ω j expressed in the corresponding generation is evaluated in two ways. The first is the importance computed as the recall score of the deep learning classifier expressed through the chromosome, and the second is the diversity of the deep learning classifier expressed in the generation. Equations (3) and (4) define the importance function I M P of the j th generation and the diversity function D I V .
I M P Y ¯ ω j = R e c a l l ω j
D I V Y ¯ ω j = E Y ¯ ω j T Y ¯ ω j
Lastly, the fitness J ω j of the generation j is defined as expectation of importance and diversity: the recall score of the deep learning classifier set expressed in the corresponding generation and the feature selection rule does converge. The optimization process of the recall score of deep learning classifier is iteratively performed according to the procedure depicted in Figure 2.
J ω j = E D I V Y ¯ ω j 1 , I M P Y ¯ ω j
Figure 2. Optimization process of feature selection rules based on genetic operations.
After the fitness J ω j is computed, optimized classifiers are selected and genetic operations such as crossovers and mutations are applied to form the new classifier set represented as a new population. Selection process of the classifiers conducted by the roulette wheel method and elitism method with a probability of selection p j in Equation (6), where the size of the population is s with the average of objectives. As a result, the generated population maximize the diversity of the set and recall of each of the deep-learning-based classifier. The diagram of optimization are shown in Figure 3.
p j = J ω j k = 1 s J ω k
Figure 3. Diagram of optimization process for selection of feature sets that maximize recall.

3.2. Convolutional–Recurrent Neural-Network-Based URL Classification

Although the feature selection rule is optimized by the genetic algorithm, a deep learning model applying the rule is built and fitted to evaluate the feature selection rules for each generation. Two types of deep learning network were applied to conduct the modeling of character and word-level features of phishing URLs as described in Figure 4. The convolutional recurrent neural network, which is known to be a representative method to model the character and word-level features, has proven its performance in the existing phishing-detection field [5].
Figure 4. Convolutional recurrent network for URL classification.
First, an integer was assigned to each character, and modeling of a low-level signal obtained through this process was performed by the CNN to model the syntactic features of random characters, including enumerated special characters, which are frequently observed in phishing URLs. Second, each word was embedded based on the word-to-vector model, and the modeling of a sequence of words obtained through this process was performed by the LSTM to model the semantic features of domains and subdomains composing the internal URLs.
The convolution operation ϕ c designed to extract spatial correlation from a vector composed of URL features and the pooling operation ϕ P that extracts a representative value from the information added by the convolution operation are defined by Equations (7) and (8), respectively, for the output x i j l of the node in the i row and the j column of the l th layer.
ϕ c l x = a = 0 m 1 b = 0 m 1 w a b l x i + a j + b l 1
ϕ p l x = m a x x i j × τ l 1
At the same time, it uses the f th convolution filter w f l sized m × m and the pooling distance τ of the pooling area sized k × k . The learning of the convolution parameter is the process of optimizing the weight of the filter w that extracts the syntactics while preserving the spatial correlation between characters, and the pooling operation is the process of extraction of emphasized features using the stride parameter τ .
The modeling of semantics of phishing URLs was carried out through word embedding based on the word-to-vector model and LSTM deep learning algorithm application for time series modeling. Moreover, 20 words that appeared in subdomains were additionally extracted, since phishing URLs generally included various subdomains. Each word was replaced as vectors in 32 dimensions using the word-to-vector model, and URLs formed as n   ×   20   ×   32 sized vector according to n observations were input in the phishing word-level LSTM.
The LSTM network is a type of RNN in which three types of nonlinear gates are implemented. The LSTM ϕ L l performs the time-series modeling of sequence of domain and subdomains.
ϕ L l x i j = o t tanh   c t
The input gate (i), forget gate (f), output gate (o), and LSTM cell state (c) were defined based on the input domain sequence of x = ( x t , ,   x t - ω ) with word sequence length ω , as shown in Equation (4). b ,   σ   and   refer to the bias added to each neural network, the sigmoid activation function of neural networks, and Hadamard multiplication, respectively. Pretrained word-to-vector (W2V) model and LSTM neural network are used to model the features obtained from domains and subdomains among the representative features of the phishing URL.

4. Experimental Results

In this section, we present how the genetic-algorithm-embedded convolutional recurrent network predicts the phishing attack and evaluates the performance with 10-fold cross validation in terms of accuracy and recall, which is followed by quantitative comparison with the latest deep learning models.

4.1. Dataset and Implementation

Two types of benchmark datasets were used, and URLs were directly collected from an open source database to evaluate the deep-learning-based URL classifier combined with the optimized feature extraction algorithm based on the suggested evolutionary algorithm. Table 3 summarizes the sources, numbers, and examples of 222,541 collected normal and phishing URLs.
Table 3. Source, quantity, and examples of benchmark datasets.
The ISCX-URL-2016 dataset aims the four-way classification task consisting of benign, phishing, malware, and spam URLs, and has a 3:1 class imbalance as a characteristic of malicious URL modeling. Web-accessible Phishstorm and Phishtank provide known phishing attack cases. Unlike the Phishstorm dataset where class sampling was performed, Phishtank does not provide a benign URL. We collected benign URLs from the Open directory project and collected 95,541 and 60,000 URLs, respectively.
The architecture and training method of the proposed method can be modified variously according to the configuration of genetic algorithm, as well as the number of chromosomes, number of generations, and the selection strategy. Since there are various options for designing a deep learning-based URL classifier, it is essential to adjust and optimize the hyperparameters carefully. Table 4 summarizes the list of available hyperparameters for each objective, the range of options, and the configuration of the best model. The proposed method was developed in a hardware environment of 4 Nvidia Tesla v100 GPUs with 128GB GPU memory. Ubuntu, python 3.x, and TensorFlow 2.3 versions were used as software environments, and Python science libraries including Scikit-learn were additionally used.
Table 4. The types of hyperparameters for each component and the range of options.

4.2. Phishing Detection Performance

Table 5 cross-validates in 10-fold the accuracy and recall with the latest deep-learning-based model including the convolutional network and LSTM network which are the basic structures for phishing URL classification. The convolutional neural network selected as the base network is known to be suitable for receiving text URLs without a separate feature selection process and modeling the correlation between characters. LSTM neural networks are known to be suitable for time-series modeling of character sequences found in phishing URLs. The CNN-LSTM neural network is a neural network variant that combines a recurrent neural network on the top of the convolution layer to model the correlation of characters extracted from a convolution operation in time series.
Table 5. Ten-fold cross-validation of recall factor of phishing URL classification by method.
We assumed that URLNet, which achieved the best performance in URL classification by using convolutional network in parallel, and Microsoft’s Texception network, which improved the convolutional operation with attention mechanism for the URL field, as major comparative studies. Texception network achieved an accuracy of 0.9765, 0.9710, 0.9319 for each dataset, but URLNet composed of a vanilla convolutional network achieved a similar level of performance for both the convolutional network and convolutional–recurrent network.
As argued, the proposed method achieved the highest recall factor for the ISCX-URL-2016 benchmark data and PhishTank data consisting of 60,000 directly collected URLs in that the feature-extraction rule was optimized based on the recall. In particular, it showed the accuracy and recall factor improvements of 4.13% and 7.07%, respectively, for PhishTank datasets compared to the existing convolutional-recurrent network.
Figure 5 visualizes the performance of the deep-learning-based phishing URL classifier combined with the feature-selection algorithm expressed for each generation and the diversity in the generation. As the chromosomes representing the feature selection rules were randomly initialized, the diversity increased after a degradation was found, and the accuracy of the combination gradually increased. The chart describes the accuracy and divergence of chromosomes that emerge as generations progress. Divergence is defined as the variation of outputs between classifiers, which shows that the genetic algorithm selects various combinations of URL features after the initial degradation. On the other hand, the average accuracy of the emerging classifiers increased gradually and achieved the converged performance in the 60th generation.
Figure 5. Accuracy and diversity improvement for each generation.
Table 6 lists the selected URL features. A feature category consisting of 50, 6, 15, and 20 features was constructed from the address bar, request, domain, and script. The total number of available combinations is 90,000 (=50 × 6 × 15 × 20). The proposed method successfully finds the optimal feature set for phishing classification and selects 15 key features for phishing URL classification.
Table 6. URL features selected for classification of phishing URLs.
In Figure 6, quantized deep learning parameters represented by SOM for the major features of a phishing attack are visualized in order to qualitatively evaluate the selected URL features. The feasibility of the optimized feature set was qualitatively visualized by mapping URLs to different spaces according to each selected rule.
Figure 6. Self-organizing map-based visualization to show the complementarity of selected features.

5. Concluding Remarks

In this paper, we have proposed a genetic-algorithm-embedded deep learning classifier to search the optimal combination of URL feature set for recall. We confirm the optimization of URL feature-selection process based on the genetic algorithm can improve the deep learning classifier in terms of recall, and we cross-validated the method on three benchmark datasets. We analyzed the method quantitatively and qualitatively by presenting the performance of each generation and visualizing the selected URL features to show the complementarity.
Meanwhile, the proposed method iteratively builds and evaluates as many deep learning structures as the number of chromosomes per generation. It is necessary to consider the trade-off between the performance improvement and the explosively increasing computational cost. In that aspect, the study on the reduction of deep learning parameters should be preceded before the proposed method is applied to a deeper neural network.

Author Contributions

Conceptualization, S.-J.B.; Formal analysis, S.-J.B.; Funding acquisition, H.-J.K.; Investigation, H.-J.K.; Methodology, S.-J.B. and H.-J.K.; Supervision, H.-J.K.; Visualization, S.-J.B.; Writing-review&editing, S.-J.B. and H.-J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work has supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2021R1F1A1063085).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Marchal, S.; François, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef] [Green Version]
  2. Bu, S.-J.; Cho, S.-B. Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing URL detection. Electronics 2021, 10, 1492. [Google Scholar] [CrossRef]
  3. Bu, S.-J.; Cho, S.-B. Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2685–2689. [Google Scholar]
  4. Wei, W.; Ke, Q.; Nowak, J.; Korytkowski, M.; Scherer, R.; Woźniak, M. Accurate and fast URL phishing detector: A convolutional neural network approach. Comput. Netw. 2020, 178, 107275. [Google Scholar] [CrossRef]
  5. Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C. URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv:1802.03162. [Google Scholar]
  6. Tajaddodianfar, F.; Stokes, J.W.; Gururajan, A. Texception: A character/word-level deep learning model for phishing URL detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2857–2861. [Google Scholar]
  7. Muntasir, M.; Rahman, S.S.M.M.; Jahan, N.; Siddikk, A.B.; Islam, T. AntiPhishTuner: Multi-level approaches focusing on optimization by parameters tuning in phishing URLs detection. In Artificial Intelligence and Blockchain for Future Cybersecurity Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 161–180. [Google Scholar]
  8. Le, A.; Markopoulou, A.; Faloutsos, M. Phishdef: Url names say it all. In Proceedings of the 2011 Proceedings IEEE INFOCOM, Shanghai, China, 10–15 April 2011; pp. 191–195. [Google Scholar]
  9. Mohammad, R.M.; Thabtah, F.; McCluskey, L. An assessment of features related to phishing websites using an automated technique. In Proceedings of the 2012 International Conference for Internet Technology and Secured Transactions, London, UK, 10–12 December 2012; pp. 492–497. [Google Scholar]
  10. Iuga, C.; Nurse, J.R.; Erola, A. Baiting the hook: Factors impacting susceptibility to phishing attacks. Hum.-Cent. Comput. Inf. Sci. 2016, 6, 8. [Google Scholar] [CrossRef] [Green Version]
  11. Bahnsen, A.C.; Bohorquez, E.C.; Villegas, S.; Vargas, J.; González, F.A. Classifying phishing URLs using recurrent neural networks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime), Scottsdale, AZ, USA, 25–27 April 2017; pp. 1–8. [Google Scholar]
  12. Zhao, J.; Wang, N.; Ma, Q.; Cheng, Z. Classifying malicious URLs using gated recurrent neural networks. In Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, Matsue, Japan, 3–5 July 2018; pp. 385–394. [Google Scholar]
  13. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 649–657. [Google Scholar]
  14. Anand, A.; Gorde, K.; Moniz, J.R.A.; Park, N.; Chakraborty, T.; Chu, B.-T. Phishing URL detection with oversampling based on text generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 1168–1177. [Google Scholar]
  15. Bu, S.-J.; Cho, S.-B. A convolutional neural-based learning classifier system for detecting database intrusion via insider attack. Inf. Sci. 2020, 512, 123–136. [Google Scholar] [CrossRef]
  16. Suleman, M.-T.; Awan, S.M. Optimization of URL-based phishing websites detection through genetic algorithms. Autom. Control. Comput. Sci. 2019, 53, 333–341. [Google Scholar] [CrossRef]
  17. Park, K.-W.; Bu, S.-J.; Cho, S.-B. Evolutionary optimization of neuro-symbolic integration for phishing URL detection. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Bilbao, Spain, 22–24 September 2021; pp. 88–100. [Google Scholar]
  18. Da Silva, C.M.R.; Fernandes, B.J.T.; Feitosa, E.L.; Garcia, V.C. Piracema.io: A rules-based tree model for phishing prediction. Expert Syst. Appl. 2022, 191, 116239. [Google Scholar] [CrossRef]
  19. Shreeram, V.; Suban, M.; Shanthi, P.; Manjula, K. Anti-phishing detection of phishing attacks using genetic algorithm. In Proceedings of the 2010 International Conference on Communication Control and Computing Technologies, Nagercoil, India, 7–9 October 2010; pp. 447–450. [Google Scholar]
  20. Moghimi, M.; Varjani, A.Y. New rule-based phishing detection method. Expert Syst. Appl. 2016, 53, 231–242. [Google Scholar] [CrossRef]
  21. Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Lv, J. Automatically designing CNN architectures using the genetic algorithm for image classification. IEEE Trans. Cybern. 2020, 50, 3840–3854. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Mamun, M.S.I.; Rathore, M.A.; Lashkari, A.H.; Stakhanova, N.; Ghorbani, A.A. Detecting malicious urls using lexical analysis. In Proceedings of the International Conference on Network and System Security, Taipei, Taiwan, 28–30 September 2016; pp. 467–482. [Google Scholar]
  23. Cui, Q.; Jourdan, G.-V.; Bochmann, G.V.; Couturier, R.; Onut, I.-V. Tracking phishing attacks over time. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 667–676. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.