Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

Bu, Seok-Jun; Kim, Hae-Jung

doi:10.3390/electronics11071090

Open AccessArticle

Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

by

Seok-Jun Bu

^1,*

and

Hae-Jung Kim

²

¹

Department of Computer Science, Yonsei University, Seoul 03722, Korea

²

Department of Computer Science, Kyungil University, Daegu 38428, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(7), 1090; https://doi.org/10.3390/electronics11071090

Submission received: 15 February 2022 / Revised: 26 March 2022 / Accepted: 28 March 2022 / Published: 30 March 2022

(This article belongs to the Special Issue Cybersecurity in the Next-Generation Industrial Internet of Things Era: Modelling, Detecting and Mitigating Threats)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning models for phishing URL classification based on character- and word-level URL features achieve the best performance in terms of accuracy. Various improvements have been proposed through deep learning parameters, including the structure and learning strategy. However, the existing deep learning approach shows a degradation in recall according to the nature of a phishing attack that is immediately discarded after being reported. An additional optimization process that can minimize the false negatives by selecting the core features of phishing URLs is a promising avenue of improvement. To search the optimal URL feature set and to fully exploit it, we propose a combined searching and learning strategy that effectively models the URL classifier for recall. By incorporating the deep-learning-based URL classifier with the genetic algorithm to search the optimal feature set that minimizing the false negatives, an optimized classifier that guarantees the best performance was obtained. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest recall among the deep learning models. We demonstrated the superiority of the method by 10-fold cross-validation and confirmed that the recall improved compared to the latest deep learning method. In particular, the accuracy and recall were improved by 4.13%p and 7.07%p, respectively, compared to the convolutional–recurrent neural network in which the feature selection optimization was omitted.

Keywords:

phishing detection; URL classification; deep learning optimization; genetic algorithm; feature selection

1. Introduction

A phishing attack using URLs can be defined as a scalable and deceptive action in which an impersonated web server obtains information from an individual [1]. Various deep learning models have been introduced to classify phishing attacks based on the URL features [2,3]. Prominent among these is the method of optimizing the structure of the convolutional–recurrent neural network [4] by focusing on the expression of URL features in the character- and word-level, which has achieved a plausible performance comparable to the existing detection system [5]. The attention-mechanism-based deep learning method developed by Microsoft has verified the best performance in terms of accuracy [6].

However, the limitation in recall (sensitivity) due to the nature of phishing URLs was pointed out for the past deep learning approach aiming to minimize the loss function in the classification task. Considering the characteristics of phishing URLs that are immediately discarded after being reported, an additional optimization process that can minimize the false negatives of unobserved attacks is essential [7]. Table 1 shows the characteristics of each URL type traditionally used for modeling phishing attacks. To search the most generalized characteristics of phishing URL and to fully exploit it, the process of extracting the attack characteristics expressed in the URL based on the deep model and the process of searching for the optimal combination of URL characteristics must be performed simultaneously.

In this paper, therefore, we propose a deep learning method combined with a genetic-algorithm-based feature-optimization method that searches for an optimal combination of URL features to improve the recall. The proposed method achieved an improved recall compared to the latest deep-learning-based model for three benchmark datasets consisting of 222,541 URLs and showed accuracy and recall factor increases of 4.13% and 7.07%, respectively, compared to the convolution–recurrent neural network. We additionally demonstrated the superiority of the proposed method by comparing URL features that contribute to performance enhancement through self-organizing maps. The main findings of this paper are as follows:

The genetic-algorithm-embedded convolutional recurrent network works well for classifying benign and phishing URLs, resulting the best recall and accuracy for phishing website detection;
We formulated the task of regulating the learning process of a deep learning model to enhance a specific metric as an optimization problem, and solved it effectively by extending the existing genetic algorithm.

2. Related Works

In this section, we review the recent phishing URL-detection methods based on machine learning algorithms. The phishing URL detection research can be categorized into representation of URL, modeling and learning methods as summarized in Table 2.

As an initial attempt to model the phishing attacks, a primitive method of constructing a feature set from a list of words appearing in URL as bag-of-words vectors has been attempted [8]. Since basic machine-learning-algorithm-based classification is mainly calculated by conditional probability, the selection of URL features has also been studied as an important criterion. Mohammad et al. contributed to the automation of the phishing-URL-detection task by systematically extracting URL features and proposing a hierarchical classifier according to the extraction rule [9].

As it was revealed that the rule-based feature selection and modeling has a limitation in the generalization performance for unobserved URLs [10], deep-learning-based phishing detection was actively studied. Deep learning is known as a method of fitting complex mapping function using massive observation. The feature selection process is automated by modeling the word-level features based on the recurrent neural network [11] and its variants [12].

On the other hand, as it is known that modeling of languages including sentimental analysis is possible from the sequence of characters constituting the string [13], the character-level feature constituting the URL was selected as the key feature. Since the feature set composed of a sequence of characters requires less feature selection or preprocessing, the deep-learning-based research has focused on optimization of computation and structure. In the case of generating a virtual phishing URL using a generative adversarial network (GAN), the effect of data augmentation for performance enhancement was verified [14]. Bu and Cho pointed out the extreme class imbalance in phishing URL classification and filtered the phishing attacks using an unsupervised learning method [2].

According to the research stream, Microsoft introduced a deep learning model that utilizes both character and word-level features to maximize the detection performance of phishing attacks [5]. The best accuracy and recall among the existing phishing-detection methods were achieved from the improvement of the deep learning operation based on self-attention mechanism and the richness of the URL feature set [6].

Bu and Cho improved performance by additionally using not only character and word-level URL features, but also feature sets based on expert knowledge [3]. The output of the deep learning classifier was successfully corrected by utilizing the phishing attack-detection rule expressed in the form of first-order logic, and the necessity of optimizing the feature set for phishing detection was addressed. This is not the first time that deep learning and traditional machine algorithms have been combined with genetic-algorithm-based combinatorial search to achieve better performance. Suleman et al. improved the accuracy of Naive Bayes classifier, k-nearest neighbor, decision tree, and random forest classifier by introducing a feature-selection algorithm based on evolutionary computation to the traditional machine-algorithm-based phishing-website-detection task [15]. Park et al. performed detection rule optimization based on genetic algorithm to maximize the accuracy and recall of deep learning classifier and improved detection performance [16].

In this paper, we extend the URL feature extraction and selection process to detect phishing attacks. In contrast to the attempts that use a rule-based system consisting of an optimized detection ruleset and machine learning algorithms in parallel [3,16], we explicitly include the step of optimizing the feature-selection process. The genetic algorithm is representative method of combinatorial searching [15], which divides the wide search space of deep learning parameters and performs recall-oriented optimization.

3. Proposed Method

In this section, we describe the combination of the convolutional recurrent network and genetic-algorithm-based feature optimization process. Figure 1 illustrates the overall architecture of the proposed method that is consisted of URL preprocessing steps, deep URL model based on convolutional recurrent network, and optimization of network based on recall-based fitness score.

3.1. Genetic-Algorithm-Based URL Feature Optimization

URL features for modeling a phishing attack are subdivided into address-bar-based, http-request-based, domain-information-based, and script-based features, including JavaScript [18,19]. Each is known to be meaningful to be used as a feature in the field of machine-learning-based URL modeling [20], but it should be considered to explore the recall-oriented URL feature combination at the same time to fully exploit the URL feature set.

Firstly, the

i

th input URL vector

x_{i}

, the number of rules indicating the dimension of the input vector

d

, the

y_{i}

variable indicating whether it is normal or not, and the number of URLs that make up the original dataset

n

are specified to define the evolutionary algorithm-based URL feature combination search.

The genetic algorithm is a representative method to segment a wide search space and quickly find a combinatorial solution using crossover, mutation, and reproduction operations [21]. The suggested combination of deep learning and genetic algorithm is defined with the

j

th-generation chromosome

ω_{j}

that performs the feature selection operation as Equation (1).

ω_{j} = [a_{1}, a_{2}, \dots, a_{d}], a = \{0, 1\}

(1)

The chromosome of each generation performs the filter operation

\circ

based on the Hadamard product to filter the features that are meaningful to the phishing URL classification and outputs

{\bar{x}}_{i}

with the selected feature. Equation (2) defines the dataset,

({\bar{X}}^{ω_{j}}, {\bar{Y}}^{ω_{j}})

, applying the selection of the feature collected in the

j

th generation.

{\bar{X}}^{ω_{j}} = \{{\bar{x}}_{1}, \dots, {\bar{x}}_{n}\} = \{(x_{1} \circ ω_{j}), \dots, (x_{n} \circ ω_{j})\}

(2)

The set of chromosomes

{ω_{j}

expressed in the corresponding generation is evaluated in two ways. The first is the importance computed as the recall score of the deep learning classifier expressed through the chromosome, and the second is the diversity of the deep learning classifier expressed in the generation. Equations (3) and (4) define the importance function

I M P (\cdot)

of the

j

th generation and the diversity function

D I V (\cdot)

.

I M P ({\bar{Y}}^{ω_{j}}) = R e c a l l_{ω_{j}}

(3)

D I V ({\bar{Y}}^{ω_{j}}) = E [{({\bar{Y}}^{ω_{j}})}^{T} \cdot ({\bar{Y}}^{ω_{j}})]

(4)

Lastly, the fitness

J_{ω_{j}}

of the generation

j

is defined as expectation of importance and diversity: the recall score of the deep learning classifier set expressed in the corresponding generation and the feature selection rule does converge. The optimization process of the recall score of deep learning classifier is iteratively performed according to the procedure depicted in Figure 2.

J_{ω_{j}} = E [D I V {({\bar{Y}}^{ω_{j}})}^{- 1}, I M P ({\bar{Y}}^{ω_{j}})]

(5)

After the fitness

J_{ω_{j}}

is computed, optimized classifiers are selected and genetic operations such as crossovers and mutations are applied to form the new classifier set represented as a new population. Selection process of the classifiers conducted by the roulette wheel method and elitism method with a probability of selection

p_{j}

in Equation (6), where the size of the population is

s

with the average of objectives. As a result, the generated population maximize the diversity of the set and recall of each of the deep-learning-based classifier. The diagram of optimization are shown in Figure 3.

p_{j} = \frac{J_{ω_{j}}}{\sum_{k = 1}^{s} J_{ω_{k}}}

(6)

3.2. Convolutional–Recurrent Neural-Network-Based URL Classification

Although the feature selection rule is optimized by the genetic algorithm, a deep learning model applying the rule is built and fitted to evaluate the feature selection rules for each generation. Two types of deep learning network were applied to conduct the modeling of character and word-level features of phishing URLs as described in Figure 4. The convolutional recurrent neural network, which is known to be a representative method to model the character and word-level features, has proven its performance in the existing phishing-detection field [5].

First, an integer was assigned to each character, and modeling of a low-level signal obtained through this process was performed by the CNN to model the syntactic features of random characters, including enumerated special characters, which are frequently observed in phishing URLs. Second, each word was embedded based on the word-to-vector model, and the modeling of a sequence of words obtained through this process was performed by the LSTM to model the semantic features of domains and subdomains composing the internal URLs.

The convolution operation

ϕ_{c}

designed to extract spatial correlation from a vector composed of URL features and the pooling operation

ϕ_{P}

that extracts a representative value from the information added by the convolution operation are defined by Equations (7) and (8), respectively, for the output

x_{i j}^{l}

of the node in the

i

row and the

j

column of the

l

th layer.

ϕ_{c}^{l} (x) = \sum_{a = 0}^{m - 1} \sum_{b = 0}^{m - 1} w_{a b}^{l} x_{(i + a) (j + b)}^{l - 1}

(7)

ϕ_{p}^{l} (x) = m a x x_{i j \times τ}^{l - 1}

(8)

At the same time, it uses the

f

th convolution filter

w_{f}^{l}

sized

(m \times m)

and the pooling distance

τ

of the pooling area sized

(k \times k)

. The learning of the convolution parameter is the process of optimizing the weight of the filter

w

that extracts the syntactics while preserving the spatial correlation between characters, and the pooling operation is the process of extraction of emphasized features using the stride parameter

τ

.

The modeling of semantics of phishing URLs was carried out through word embedding based on the word-to-vector model and LSTM deep learning algorithm application for time series modeling. Moreover, 20 words that appeared in subdomains were additionally extracted, since phishing URLs generally included various subdomains. Each word was replaced as vectors in 32 dimensions using the word-to-vector model, and URLs formed as

n \times 20 \times 32

sized vector according to

n

observations were input in the phishing word-level LSTM.

The LSTM network is a type of RNN in which three types of nonlinear gates are implemented. The LSTM

ϕ_{L}^{l} (\cdot)

performs the time-series modeling of sequence of domain and subdomains.

ϕ_{L}^{l} (x_{i j}) = o_{t} ⊙ \tanh (c_{t})

(9)

The input gate (i), forget gate (f), output gate (o), and LSTM cell state (c) were defined based on the input domain sequence of

x = (x (t), \dots, x (t - ω))

with word sequence length

ω

, as shown in Equation (4).

b, σ and ⊙

refer to the bias added to each neural network, the sigmoid activation function of neural networks, and Hadamard multiplication, respectively. Pretrained word-to-vector (W2V) model and LSTM neural network are used to model the features obtained from domains and subdomains among the representative features of the phishing URL.

4. Experimental Results

In this section, we present how the genetic-algorithm-embedded convolutional recurrent network predicts the phishing attack and evaluates the performance with 10-fold cross validation in terms of accuracy and recall, which is followed by quantitative comparison with the latest deep learning models.

4.1. Dataset and Implementation

Two types of benchmark datasets were used, and URLs were directly collected from an open source database to evaluate the deep-learning-based URL classifier combined with the optimized feature extraction algorithm based on the suggested evolutionary algorithm. Table 3 summarizes the sources, numbers, and examples of 222,541 collected normal and phishing URLs.

The ISCX-URL-2016 dataset aims the four-way classification task consisting of benign, phishing, malware, and spam URLs, and has a 3:1 class imbalance as a characteristic of malicious URL modeling. Web-accessible Phishstorm and Phishtank provide known phishing attack cases. Unlike the Phishstorm dataset where class sampling was performed, Phishtank does not provide a benign URL. We collected benign URLs from the Open directory project and collected 95,541 and 60,000 URLs, respectively.

The architecture and training method of the proposed method can be modified variously according to the configuration of genetic algorithm, as well as the number of chromosomes, number of generations, and the selection strategy. Since there are various options for designing a deep learning-based URL classifier, it is essential to adjust and optimize the hyperparameters carefully. Table 4 summarizes the list of available hyperparameters for each objective, the range of options, and the configuration of the best model. The proposed method was developed in a hardware environment of 4 Nvidia Tesla v100 GPUs with 128GB GPU memory. Ubuntu, python 3.x, and TensorFlow 2.3 versions were used as software environments, and Python science libraries including Scikit-learn were additionally used.

4.2. Phishing Detection Performance

Table 5 cross-validates in 10-fold the accuracy and recall with the latest deep-learning-based model including the convolutional network and LSTM network which are the basic structures for phishing URL classification. The convolutional neural network selected as the base network is known to be suitable for receiving text URLs without a separate feature selection process and modeling the correlation between characters. LSTM neural networks are known to be suitable for time-series modeling of character sequences found in phishing URLs. The CNN-LSTM neural network is a neural network variant that combines a recurrent neural network on the top of the convolution layer to model the correlation of characters extracted from a convolution operation in time series.

We assumed that URLNet, which achieved the best performance in URL classification by using convolutional network in parallel, and Microsoft’s Texception network, which improved the convolutional operation with attention mechanism for the URL field, as major comparative studies. Texception network achieved an accuracy of 0.9765, 0.9710, 0.9319 for each dataset, but URLNet composed of a vanilla convolutional network achieved a similar level of performance for both the convolutional network and convolutional–recurrent network.

As argued, the proposed method achieved the highest recall factor for the ISCX-URL-2016 benchmark data and PhishTank data consisting of 60,000 directly collected URLs in that the feature-extraction rule was optimized based on the recall. In particular, it showed the accuracy and recall factor improvements of 4.13% and 7.07%, respectively, for PhishTank datasets compared to the existing convolutional-recurrent network.

Figure 5 visualizes the performance of the deep-learning-based phishing URL classifier combined with the feature-selection algorithm expressed for each generation and the diversity in the generation. As the chromosomes representing the feature selection rules were randomly initialized, the diversity increased after a degradation was found, and the accuracy of the combination gradually increased. The chart describes the accuracy and divergence of chromosomes that emerge as generations progress. Divergence is defined as the variation of outputs between classifiers, which shows that the genetic algorithm selects various combinations of URL features after the initial degradation. On the other hand, the average accuracy of the emerging classifiers increased gradually and achieved the converged performance in the 60th generation.

Table 6 lists the selected URL features. A feature category consisting of 50, 6, 15, and 20 features was constructed from the address bar, request, domain, and script. The total number of available combinations is 90,000 (=50

\times

6

\times

15

\times

20). The proposed method successfully finds the optimal feature set for phishing classification and selects 15 key features for phishing URL classification.

In Figure 6, quantized deep learning parameters represented by SOM for the major features of a phishing attack are visualized in order to qualitatively evaluate the selected URL features. The feasibility of the optimized feature set was qualitatively visualized by mapping URLs to different spaces according to each selected rule.

5. Concluding Remarks

In this paper, we have proposed a genetic-algorithm-embedded deep learning classifier to search the optimal combination of URL feature set for recall. We confirm the optimization of URL feature-selection process based on the genetic algorithm can improve the deep learning classifier in terms of recall, and we cross-validated the method on three benchmark datasets. We analyzed the method quantitatively and qualitatively by presenting the performance of each generation and visualizing the selected URL features to show the complementarity.

Meanwhile, the proposed method iteratively builds and evaluates as many deep learning structures as the number of chromosomes per generation. It is necessary to consider the trade-off between the performance improvement and the explosively increasing computational cost. In that aspect, the study on the reduction of deep learning parameters should be preceded before the proposed method is applied to a deeper neural network.

Author Contributions

Conceptualization, S.-J.B.; Formal analysis, S.-J.B.; Funding acquisition, H.-J.K.; Investigation, H.-J.K.; Methodology, S.-J.B. and H.-J.K.; Supervision, H.-J.K.; Visualization, S.-J.B.; Writing-review&editing, S.-J.B. and H.-J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work has supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2021R1F1A1063085).

Conflicts of Interest

The authors declare no conflict of interest.

References

Marchal, S.; François, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef] [Green Version]
Bu, S.-J.; Cho, S.-B. Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing URL detection. Electronics 2021, 10, 1492. [Google Scholar] [CrossRef]
Bu, S.-J.; Cho, S.-B. Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2685–2689. [Google Scholar]
Wei, W.; Ke, Q.; Nowak, J.; Korytkowski, M.; Scherer, R.; Woźniak, M. Accurate and fast URL phishing detector: A convolutional neural network approach. Comput. Netw. 2020, 178, 107275. [Google Scholar] [CrossRef]
Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C. URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv:1802.03162. [Google Scholar]
Tajaddodianfar, F.; Stokes, J.W.; Gururajan, A. Texception: A character/word-level deep learning model for phishing URL detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2857–2861. [Google Scholar]
Muntasir, M.; Rahman, S.S.M.M.; Jahan, N.; Siddikk, A.B.; Islam, T. AntiPhishTuner: Multi-level approaches focusing on optimization by parameters tuning in phishing URLs detection. In Artificial Intelligence and Blockchain for Future Cybersecurity Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 161–180. [Google Scholar]
Le, A.; Markopoulou, A.; Faloutsos, M. Phishdef: Url names say it all. In Proceedings of the 2011 Proceedings IEEE INFOCOM, Shanghai, China, 10–15 April 2011; pp. 191–195. [Google Scholar]
Mohammad, R.M.; Thabtah, F.; McCluskey, L. An assessment of features related to phishing websites using an automated technique. In Proceedings of the 2012 International Conference for Internet Technology and Secured Transactions, London, UK, 10–12 December 2012; pp. 492–497. [Google Scholar]
Iuga, C.; Nurse, J.R.; Erola, A. Baiting the hook: Factors impacting susceptibility to phishing attacks. Hum.-Cent. Comput. Inf. Sci. 2016, 6, 8. [Google Scholar] [CrossRef] [Green Version]
Bahnsen, A.C.; Bohorquez, E.C.; Villegas, S.; Vargas, J.; González, F.A. Classifying phishing URLs using recurrent neural networks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime), Scottsdale, AZ, USA, 25–27 April 2017; pp. 1–8. [Google Scholar]
Zhao, J.; Wang, N.; Ma, Q.; Cheng, Z. Classifying malicious URLs using gated recurrent neural networks. In Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, Matsue, Japan, 3–5 July 2018; pp. 385–394. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 649–657. [Google Scholar]
Anand, A.; Gorde, K.; Moniz, J.R.A.; Park, N.; Chakraborty, T.; Chu, B.-T. Phishing URL detection with oversampling based on text generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 1168–1177. [Google Scholar]
Bu, S.-J.; Cho, S.-B. A convolutional neural-based learning classifier system for detecting database intrusion via insider attack. Inf. Sci. 2020, 512, 123–136. [Google Scholar] [CrossRef]
Suleman, M.-T.; Awan, S.M. Optimization of URL-based phishing websites detection through genetic algorithms. Autom. Control. Comput. Sci. 2019, 53, 333–341. [Google Scholar] [CrossRef]
Park, K.-W.; Bu, S.-J.; Cho, S.-B. Evolutionary optimization of neuro-symbolic integration for phishing URL detection. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Bilbao, Spain, 22–24 September 2021; pp. 88–100. [Google Scholar]
Da Silva, C.M.R.; Fernandes, B.J.T.; Feitosa, E.L.; Garcia, V.C. Piracema.io: A rules-based tree model for phishing prediction. Expert Syst. Appl. 2022, 191, 116239. [Google Scholar] [CrossRef]
Shreeram, V.; Suban, M.; Shanthi, P.; Manjula, K. Anti-phishing detection of phishing attacks using genetic algorithm. In Proceedings of the 2010 International Conference on Communication Control and Computing Technologies, Nagercoil, India, 7–9 October 2010; pp. 447–450. [Google Scholar]
Moghimi, M.; Varjani, A.Y. New rule-based phishing detection method. Expert Syst. Appl. 2016, 53, 231–242. [Google Scholar] [CrossRef]
Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Lv, J. Automatically designing CNN architectures using the genetic algorithm for image classification. IEEE Trans. Cybern. 2020, 50, 3840–3854. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mamun, M.S.I.; Rathore, M.A.; Lashkari, A.H.; Stakhanova, N.; Ghorbani, A.A. Detecting malicious urls using lexical analysis. In Proceedings of the International Conference on Network and System Security, Taipei, Taiwan, 28–30 September 2016; pp. 467–482. [Google Scholar]
Cui, Q.; Jourdan, G.-V.; Bochmann, G.V.; Couturier, R.; Onut, I.-V. Tracking phishing attacks over time. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 667–676. [Google Scholar]

Figure 1. Combined structure of genetic algorithm and deep learning for combinatorial optimization of feature-selection process aiming the recall score.

Figure 2. Optimization process of feature selection rules based on genetic operations.

Figure 3. Diagram of optimization process for selection of feature sets that maximize recall.

Figure 4. Convolutional recurrent network for URL classification.

Figure 5. Accuracy and diversity improvement for each generation.

Figure 6. Self-organizing map-based visualization to show the complementarity of selected features.

Table 1. URL feature sets for modeling phishing attacks.

Type	URL Features
Address-bar-based	Usage of IP address
	Digit count
	Character length
	Special characters count
	http:// or https://
Abnormal request-based	URL request count
Abnormal request-based	Server form handler
Domain-based	Age of domain
	Registered top domain
	Subdomain count
	Typosquatted URL (e.g., google → goggle.com)
Script-based	Usage of mouseover script
	Usage of popup window
	Disabling right click

Table 2. Related works on the classification of phishing URLs according to the expression and modeling of URL features (SVM = support vector machine; LSTM = long short-term memory; GRU = gated recurrent unit; GAN = generative adversarial network).

Authors	URL Feature	Model	Dataset
Le [8]	Bag-of-words	SVM	VirusTotal
Mohammad [9]	Address bar, HTTP request, Domain, Script	Hierarchical classifier based on feature groups	PhishTank
Bhansen [11]	Word embeddings	LSTM	PhishTank
Zhao [12]	Word embeddings	GRU	PhishTank
Anand [14]	Character-level URL features	GAN	PhishTank
Bu [2]	Character-level URL features	Convolutional autoencoder	PhishTank, PhishStorm, ISCX-URL-2016
Le [5]	Character-level, Word vector-level URL features	Character-CNN with W2V-LSTM	VirusTotal
Tajaddodianfar [6]	Character-level, Word vector-level URL features	CNN-LSTM with Attention	MS anonymized browsing data
Suleman [16]	Address bar, HTTP request, Domain, Script	Machine learning with genetic algorithm	UCI phishing website dataset
Park [17]	Character-level URL features, Address bar, HTTP request, Domain, Script	CNN-LSTM with genetic algorithm	PhishTank, PhishStorm, ISCX-URL-2016
Bu [3]	Character-level URL features, Address bar, HTTP request, Domain, Script	Transformer-style network with first-order logics	PhishTank, PhishStorm, ISCX-URL-2016

Table 3. Source, quantity, and examples of benchmark datasets.

Source	Desc.	Instances	Examples (Accessed Date: 19 October 2020)
ISCX-URL-2016 [22]	Benign	35,000	http://metro.co.uk/2015/05...
	Phishing	9000	http://standardprincipal.pt/...
	Malware	11,000	http://9779.info/%E5%88%...
	Spam	12,000	http://adverse*s.co.uk/scr/cl...
PhishStorm [1]	Benign	47,682	en.wikipedia.org/wiki/dead...
PhishStorm [1]	Phishing	47,859	nobell.it/70ffb52d079109dc...
PhishTank [23]	DMOZ Open Directory (Benign)	45,000	http://geneba**.org/ftp/...
PhishTank [23]	OpenDNS (Phishing)	15,000	http://droopbxoxx.com/@@@..

Table 4. The types of hyperparameters for each component and the range of options.

Objective	Hyperparameter	Choice of Value Range	Best Model for Phishing Detection
Genetic algorithm configurations	Population per generation	50, 100, 150, 200, 250, 300	300
	Number of generations	16, 32, 64, 128, 256	64
	Chromosome selection strategy	Roulette, Rank, Elitism	Elitism
Network architecture	Activation functions	relu, sigmoid, tanh	relu
	Embedding dimension	32, 64, 128, 256, 512	64
	Dropout rate	0.25, 0.5	0.5
Learning strategy	Batch size	32, 64, 128, 256, 512	32
Learning strategy	Loss optimizer	adam, adadelta, sgd	adam

Table 5. Ten-fold cross-validation of recall factor of phishing URL classification by method.

Dataset	ISCX-URL-2016		PhishStorm		PhishTank
Metrics	Acc.	Recall	Acc.	Recall	Acc.	Recall
Base network
Character-CNN	0.9363	0.8909	0.9016	0.8565	0.8852	0.8034
LSTM [10]	0.9175	0.8803	0.8777	0.8440	0.8544	0.7865
CNN-LSTM	0.9424	0.9015	0.9229	0.8785	0.9070	0.8374
Comparative studies
URLNet [5]	0.9450	0.9390	0.9395	0.8864	0.9226	0.8785
Texception Net [6]	0.9765	0.9462	0.9710	0.9227	0.9319	0.9075
GA-based URL feature set optimization (Proposed)
Ours	0.9685	0.9510	0.9505	0.9332	0.9483	0.9081

Table 6. URL features selected for classification of phishing URLs.

Type	URL Features	Selected URL Features
Address-bar-based	Usage of IP address	is_ip_address
	Digit count	digits >50
	Character length	length >30
	Special character count	special_characters >10
	Special character count	dots >5
	http:// or https://	is_https
Abnormal-request-based	URL request count	-
	SFH	SFH_blank
	SFH	SFH_empty
Domain-based	Age of domain	domain_age > average_domain_age
	Registered top domain	is_registered
	Subdomain count	subdomains >5
	Typo-squatted domain	is_typosquatted
	Suspicious TLD	suspicious_tld
Script-based	Usage of mouseover	is_mouseover_available
	Usage of pop-up	-
	Disabling right click	is_rightclick_available

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bu, S.-J.; Kim, H.-J. Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection. Electronics 2022, 11, 1090. https://doi.org/10.3390/electronics11071090

AMA Style

Bu S-J, Kim H-J. Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection. Electronics. 2022; 11(7):1090. https://doi.org/10.3390/electronics11071090

Chicago/Turabian Style

Bu, Seok-Jun, and Hae-Jung Kim. 2022. "Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection" Electronics 11, no. 7: 1090. https://doi.org/10.3390/electronics11071090

APA Style

Bu, S.-J., & Kim, H.-J. (2022). Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection. Electronics, 11(7), 1090. https://doi.org/10.3390/electronics11071090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Genetic-Algorithm-Based URL Feature Optimization

3.2. Convolutional–Recurrent Neural-Network-Based URL Classification

4. Experimental Results

4.1. Dataset and Implementation

4.2. Phishing Detection Performance

5. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI