Deep Character-Level Anomaly Detection Based on a Convolutional Autoencoder for Zero-Day Phishing URL Detection

: Considering the fatality of phishing attacks, the data-driven approach using massive URL observations has been veriﬁed, especially in the ﬁeld of cyber security. On the other hand, the supervised learning approach relying on known attacks has limitations in terms of robustness against zero-day phishing attacks. Moreover, it is known that it is critical for the phishing detection task to fully exploit the sequential features from the URL characters. Taken together, to ensure both sustainability and intelligibility, we propose the combination of a convolution operation to model the character-level URL features and a deep convolutional autoencoder (CAE) to consider the nature of zero-day attacks. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest performance among the latest deep-learning methods. We demonstrated the superiority of the proposed method by receiver-operating characteristic (ROC) curve analysis in addition to 10-fold cross-validation and conﬁrmed that the sensitivity improved by 3.98% compared to the latest deep model.


Introduction
A phishing attack in its broadest sense can be defined as a scalable act of deception whereby impersonation is used by an attacker to obtain information from an individual [1]. Considering that the most common form of online phishing attacks is malicious hyperlinks embedded in messages, the recent technological trend in which personal connections are reinforced due to the explosive growth of social media services is particularly vulnerable. Consequently, it is important to conduct a study on better understanding the diffusion of phishing URLs for improving the safety and reliability of devices and networks [2].
In the field of cyber security, the supervised learning approach to learn the features from phishing attacks based on various machine-learning techniques with massive knownattack observations was introduced [3,4]. Deep learning is a representative method of learning the mapping function between observed URL features and labels through a large number of parameters (weights) expressed by the layer-by-layer matrix product and sum operation. Among the most prominent methods, the combination of a convolutional neural network (CNN) and a recurrent neural network (RNN) has been found to significantly improve the detection performance by explicitly modeling the character-and word-level features of phishing attacks [5]. The Texception network [6], designed by Microsoft to classify phishing attacks, is an effective modification of a CNN, showing the best performance in supervised learning-based phishing URL classification tasks. The convolution operation aims to learn a spatial filter to extract features in the local receptive field that shares weights [7], and the long short-term memory (LSTM), a variant of an RNN, is a memory cell that stores the weights used for mapping between inputs and outputs [8].
Despite the successful development of a deep-learning-based phishing URL classifier, on the other hand, the supervised learning approach, which focuses on minimizing the loss of the classification performance relying on a large number of observations and known attacks, addresses the limitations in the detection of phishing attacks. The main difficulty, expressed as a zero-day phishing attack [9], is that phishing URLs are generated and discarded immediately after the information is stolen.
In Figure 1, we visualize the benign and phishing URL space classified by existing supervised machine-learning techniques using URL observations collected from the Phish-Tank [10] database. The blue and red dots represent benign and phishing URL observations, respectively, and the colored areas represent the decision boundary of the supervised classifiers. From the nature of the zero-day attack, we note that the confusion of the classifier occurs due to the class imbalance issue in which benign URLs are observed extremely less than phishing URLs.
Electronics 2021, 10, x FOR PEER REVIEW 2 of 16 receptive field that shares weights [7], and the long short-term memory (LSTM), a variant of an RNN, is a memory cell that stores the weights used for mapping between inputs and outputs [8].
Despite the successful development of a deep-learning-based phishing URL classifier, on the other hand, the supervised learning approach, which focuses on minimizing the loss of the classification performance relying on a large number of observations and known attacks, addresses the limitations in the detection of phishing attacks. The main difficulty, expressed as a zero-day phishing attack [9], is that phishing URLs are generated and discarded immediately after the information is stolen.
In Figure 1, we visualize the benign and phishing URL space classified by existing supervised machine-learning techniques using URL observations collected from the PhishTank [10] database. The blue and red dots represent benign and phishing URL observations, respectively, and the colored areas represent the decision boundary of the supervised classifiers. From the nature of the zero-day attack, we note that the confusion of the classifier occurs due to the class imbalance issue in which benign URLs are observed extremely less than phishing URLs. Furthermore, we visualized three major statistics in the distribution of characters that consist of URLs in Figure 2 to focus on the difficulties inherent in the field of URL modeling. In Figure 2a, we quantified the effect of specific subdomains on the characteristics of phishing URLs as mutual information. As security experts point out, it was confirmed that keywords such as wp, admin, and content from default settings in the personal server and php keyword can be used as abnormal features of phishing URLs. Figure 2b,c shows that phishing URLs are particularly longer than benign URLs and have a composition that is significantly different from the alphabetic distribution constituting natural language. Furthermore, we visualized three major statistics in the distribution of characters that consist of URLs in Figure 2 to focus on the difficulties inherent in the field of URL modeling. In Figure 2a, we quantified the effect of specific subdomains on the characteristics of phishing URLs as mutual information. As security experts point out, it was confirmed that keywords such as wp, admin, and content from default settings in the personal server and php keyword can be used as abnormal features of phishing URLs. Figure 2b,c shows that phishing URLs are particularly longer than benign URLs and have a composition that is significantly different from the alphabetic distribution constituting natural language.
Taken together, the simultaneous consideration of the nature of a zero-day attack and character-level characteristics of phishing URLs is a promising approach that can ensure both sustainability and intelligibility in the phishing URL detection task. We noted that in order to satisfy the two requirements, an anomaly detection framework that can cope with class imbalance [11] and an optimized operation for URL modeling [12] are essential.
In this paper, we propose a combination of a convolution operation to model the character-level URL features and a deep autoencoder (AE) to consider the nature of zeroday attacks. The main innovation of this study is the introduction of an anomaly detection framework for phishing URL detection based on a convolutional autoencoder (CAE). Unlike the supervised learning approach, we have an advantage in that we constructed a URL template by learning the autoencoder with only benign URLs. We defined the abnormal score of phishing URLs by utilizing the characteristics of the autoencoder, which reduces reconstruction performance for unobserved data. Moreover, we learned an auxiliary convolutional neural network that can improve the sensitivity of the detection by using the phishing abnormal score as a feature of the phishing URL. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest performance among the latest deep-learning methods. In order to demonstrate the superiority of the proposed method, we performed receiver-operating characteristic (ROC) curve analysis in addition to 10-fold cross-validation and confirmed that the accuracy improved by 3.98% Figure 2. Three main statistics supporting the strong need for character-level modeling in the phishing URL detection task: (a) mutual information by keyword; (b) availability of the URL length feature; (c) character distribution that separates benign and phishing URLs.
Taken together, the simultaneous consideration of the nature of a zero-day attack and character-level characteristics of phishing URLs is a promising approach that can ensure both sustainability and intelligibility in the phishing URL detection task. We noted that in order to satisfy the two requirements, an anomaly detection framework that can cope with class imbalance [11] and an optimized operation for URL modeling [12] are essential.
In this paper, we propose a combination of a convolution operation to model the character-level URL features and a deep autoencoder (AE) to consider the nature of zeroday attacks. The main innovation of this study is the introduction of an anomaly detection framework for phishing URL detection based on a convolutional autoencoder (CAE). Unlike the supervised learning approach, we have an advantage in that we constructed a URL template by learning the autoencoder with only benign URLs. We defined the abnormal score of phishing URLs by utilizing the characteristics of the autoencoder, which reduces reconstruction performance for unobserved data. Moreover, we learned an auxiliary convolutional neural network that can improve the sensitivity of the detection by using the phishing abnormal score as a feature of the phishing URL. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest performance among the latest deep-learning methods. In order to demonstrate the superiority of the proposed method, we performed receiver-operating characteristic (ROC) curve analysis in addition to 10-fold cross-validation and confirmed that the accuracy improved The main contribution of this paper is that we formulated phishing detection as an anomaly detection problem with a severe class-imbalanced condition and solved it efficiently by extending the existing deep autoencoder. To the best of our knowledge, this is the first attempt where a convolutional autoencoder is incorporated to reconstruct a URL and measure the abnormal score for a phishing attack. The main findings of this research can be summarized as follows:

•
The convolutional autoencoder works well for modeling the deep representation of benign URLs, resulting in the best accuracy for phishing detection.

•
The abnormal score defined based on the reconstruction error of the autoencoder is suitable for the phishing detection, resulting in a significant improvement in recall.

Related Works
In this section, we review the relevant phishing URL detection methods based on machine-learning algorithms. The phishing URL detection research can be categorized into URL representation, modeling methods, and learning methods as summarized in Table 1. Selected URL features Rule-based detection algorithm PhishTank Azeez [13] Hierarchical classifier based on feature group PhishTank Mohammad [14] Machine-learning methods: AdaBoost, BN, Decision  [20] As an initial attempt to model phishing attacks, malware epidemiology was proposed and implemented with diagram-based compartment models [2]. Azeez et al. proposed a simple rule-based detection algorithm utilizing the four characteristics of suspicious URLs and exploited the URL characteristics effective for classification based on the classification performance [13]. Mohammad et al. contributed to the automation of the phishing URL detection task by systematically extracting URL features and proposing a hierarchical classifier according to the extraction rule [14,21]. The URL features collected and refined for phishing classification were fully exploited for 35 machine-learning-based classifiers, including the unfamiliar methods in Osho et al., and achieved a classification performance of 0.9570 based on the random forest algorithm [15].
On the other hand, as it was revealed that the rule-based URL feature selection and modeling has a limitation in the generalization performance for unobserved URLs [22], machine-learning-based [23] phishing detection was actively studied and reached better performance. Naive Bayes (NB), decision tree (DT), random forest (RF), Bayesian network (BN), and support vector machine (SVM) were quantitatively evaluated to model phishing URLs [16], and it was emphasized that the nonlinear mapping function was effective according to the natural language characteristics in the URL. Moreover, the phishing URL database [2,21,24] that stores the observed phishing attacks provides an ideal testbed for the deep-learning-based URL classification task with a relatively closed environment. Various deep-learning methods such as CNN [5,6] and its modification [18,25,26] are proposed, as well as the LSTM-based generative adversarial network (GAN) [17] for exploiting the class imbalance issue by generating phishing URLs.
The majority of the current research in deep-learning-based phishing detection focuses mainly on optimizing the operation of the neural network. In particular, the comparative study in [6] proves the superiority of the modified 1D-convolution operation with variable filter size compared with several competitors. This motivates our decision to consider the anomaly detection-based approach proposed in this paper. The learning method is mainly categorized into four approaches: a supervised approach that learns the phishing URL feature and its selection method directly from the label classification result, a semior weakly supervised approach that uses only a small number of labels or noisy labels to consider the realistic constraints [19], and an unsupervised approach that does not use label information [27] of URLs, and an autoencoder-based anomaly detection approach [20]. The fact that phishing URLs are not used to learn the benign URL model in the unsupervised approach is an advantage in class imbalance conditions and, more importantly, is an amenable solution for modeling the nature of a zero-day attack.

Proposed Method
In this section, we describe the combination of the convolutional autoencoder with an auxiliary classifier that learns the threshold function to detect the phishing URL based on the anomaly detection framework. Figure 3 illustrates the overall architecture of the proposed method, which consists of URL preprocessing steps, a character-level deep URL model based on an autoencoder, and phishing URL detection based on the abnormal score and URL reconstruction representing the phishing URL features.
posed, as well as the LSTM-based generative adversarial network (GAN) [17] for exploiting the class imbalance issue by generating phishing URLs.
The majority of the current research in deep-learning-based phishing detection focuses mainly on optimizing the operation of the neural network. In particular, the comparative study in [6] proves the superiority of the modified 1D-convolution operation with variable filter size compared with several competitors. This motivates our decision to consider the anomaly detection-based approach proposed in this paper. The learning method is mainly categorized into four approaches: a supervised approach that learns the phishing URL feature and its selection method directly from the label classification result, a semi-or weakly supervised approach that uses only a small number of labels or noisy labels to consider the realistic constraints [19], and an unsupervised approach that does not use label information [27] of URLs, and an autoencoder-based anomaly detection approach [20]. The fact that phishing URLs are not used to learn the benign URL model in the unsupervised approach is an advantage in class imbalance conditions and, more importantly, is an amenable solution for modeling the nature of a zero-day attack.

Proposed Method
In this section, we describe the combination of the convolutional autoencoder with an auxiliary classifier that learns the threshold function to detect the phishing URL based on the anomaly detection framework. Figure 3 illustrates the overall architecture of the proposed method, which consists of URL preprocessing steps, a character-level deep URL model based on an autoencoder, and phishing URL detection based on the abnormal score and URL reconstruction representing the phishing URL features.

Character-Level URL Model Based on a Convolutional Autoencoder
We performed two preprocessing steps to focus on the syntactics of the URL described in Section 1. The first is to allocate a unique integer to the characters that constitute the URL. We simply implemented the allocation function by extracting the ASCII code using the built-in Python function ord( ) [28]. The second is the one-hot encoding of each code to remove the arithmetic relationship from the sequence of integers. We encoded each character from URLs by replacing each alphabet with 1-of-m predefined integers. We defined the character dictionary as 26 alphabets, 10 numbers, and 54 to 64 special characters, including whitespace, and encoded, as shown in Figure 4. Three benchmark phishing URL datasets were preprocessed, and 100 characters were cropped in consideration of the average length of URLs in each dataset. URLs shorter than the 100-character limit were zero padded. Finally, the ith observed URL x i of X = [x 1 , . . . , x n ] forms a vector of size (length, size of dictionary). each character from URLs by replacing each alphabet with 1-of-m predefined integers. We defined the character dictionary as 26 alphabets, 10 numbers, and 54 to 64 special characters, including whitespace, and encoded, as shown in Figure 4. Three benchmark phishing URL datasets were preprocessed, and 100 characters were cropped in consideration of the average length of URLs in each dataset. URLs shorter than the 100-character limit were zero padded. Finally, the ith observed URL xi of X = [x1, …, xn] forms a vector of size (length, size of dictionary). The general idea of an autoencoder is to represent the data through a nonlinear encoder to a hidden layer and use the hidden units as the new feature representations, as depicted in Figure 5 [29,30]: where ℎ ∈ ℝ is the URL representation of th layer, and ∈ ℝ is interpreted as a reconstruction of a normalized input URL ∈ ℝ . The parameter set includes weight matrices ∈ ℝ × and ∈ ℝ × and bias vectors ∈ ℝ and ∈ ℝ with dimensionality and , and (⋅) is a nonlinear activation function. The core idea is to maximize the reconstruction error for the unobserved URL instance by learning the autoencoder using only benign URLs and to implement encoding function (⋅) and decoding function (⋅) with a convolutional neural network to fully exploit the character-level URL features [31]. The URL image is encoded as a 120-dimensional vector in the hidden layer.
The major hurdle in modeling the URL with a neural network lies in extracting the spatial features from the limited URL samples [32,33]. We construct the convolutional layer and the deconvolutional layer for learning the benign URL features from the convo- The general idea of an autoencoder is to represent the data through a nonlinear encoder to a hidden layer and use the hidden units as the new feature representations, as depicted in Figure 5 [29,30]: where h l ∈ R z is the URL representation of lth layer, andx i ∈ R d is interpreted as a reconstruction of a normalized input URL x i ∈ R d . The parameter set includes weight matrices W 1 ∈ R z×d and W 2 ∈ R d×z and bias vectors b 1 ∈ R z and b 2 ∈ R d with dimensionality z and d, and σ(·) is a nonlinear activation function. The core idea is to maximize the reconstruction error for the unobserved URL instance by learning the autoencoder using only benign URLs and to implement encoding function f (·) and decoding function g(·) with a convolutional neural network to fully exploit the character-level URL features [31].
defined the character dictionary as 26 alphabets, 10 numbers, and 54 to 64 special characters, including whitespace, and encoded, as shown in Figure 4. Three benchmark phishing URL datasets were preprocessed, and 100 characters were cropped in consideration of the average length of URLs in each dataset. URLs shorter than the 100-character limit were zero padded. Finally, the ith observed URL xi of X = [x1, …, xn] forms a vector of size (length, size of dictionary). The general idea of an autoencoder is to represent the data through a nonlinear encoder to a hidden layer and use the hidden units as the new feature representations, as depicted in Figure 5 [29,30]: where ℎ ∈ ℝ is the URL representation of th layer, and ∈ ℝ is interpreted as a reconstruction of a normalized input URL ∈ ℝ . The parameter set includes weight matrices ∈ ℝ × and ∈ ℝ × and bias vectors ∈ ℝ and ∈ ℝ with dimensionality and , and (⋅) is a nonlinear activation function. The core idea is to maximize the reconstruction error for the unobserved URL instance by learning the autoencoder using only benign URLs and to implement encoding function (⋅) and decoding function (⋅) with a convolutional neural network to fully exploit the character-level URL features [31]. The URL image is encoded as a 120-dimensional vector in the hidden layer.
The major hurdle in modeling the URL with a neural network lies in extracting the spatial features from the limited URL samples [32,33]. We construct the convolutional layer and the deconvolutional layer for learning the benign URL features from the convo- Figure 5. An illustration of autoencoder that reconstructs a URL image expressed as a vector of (length, dictionary size). The URL image is encoded as a 120-dimensional vector in the hidden layer.
The major hurdle in modeling the URL with a neural network lies in extracting the spatial features from the limited URL samples [32,33]. We construct the convolutional layer and the deconvolutional layer for learning the benign URL features from the convolutional autoencoder. It is well known that the convolution operation has advantages represented by data-driven filter learning focused on extracting spatial features in the field of pattern recognition [34,35].
The convolution operation φ C (·) and the max-pooling operation φ P (·) in CNNs, which have been successfully applied for extracting the character-level features, are suitable to model the sequence of characters in URLs and extract the features using local connectivity between characters [36]. The convolution operation is known to reduce the translational variance between features [37,38] and preserves the spatial relationship between URL characters by learning filters to extract the hidden correlations. Given the (k × k)-sized filter W of the lth convolutional layer, the stacked convolutional operation is applied with the input URL x l mn in the row m and the column n: Because the dimension of the output vector that has been distorted and copied by the convolution operation φ C (·) is increased by the number of convolution filters, the Electronics 2021, 10, 1492 7 of 16 summary statistics from nearby node activation are extracted from max-pooling φ P (·). Pooling refers to a dimensionality reduction process from the (k × k) region in order to impose the capacity bottleneck and facilitate faster computation [39]: The convolutional autoencoder has been extensively utilized in the field of anomaly detection and novelty detection. It compresses and reconstructs character-level URL features through an encoding function f θ (·) consisting of the convolution/pooling operation and decoding function g θ (·) performing an inverse operation. We define a convolutional autoencoder to construct a deep URL model with the reconstructed URLx i for the compressed URL code h: The distance function between the inputted benign URL x i and the reconstructed URLx i can be implemented with the Euclidean distance, and the loss function of the convolutional autoencoder is defined as the error between input x i and reconstructedx i : The objective of autoencoder learning is to find the encoding/decoding parameter θ that minimizes the loss function L MSE , and we trained the network using the backpropagation algorithm based on the stochastic gradient descent method according to the basic neural network training method:

Phishing URL Classification Based on Reconstruction Errors
Because only benign URLs are used for the learning of the convolutional autoencoder, exploiting the parameter θ* that optimizes the reconstruction of the benign URL means that it is difficult to reconstruct phishing URLs with different character distributions and length characteristics. According to the traditional autoencoder-based anomaly detection framework, we defined an abnormal score S τ with threshold τ and distance function d(,) based on the reconstruction error: The distance function d(·, ·) can be implemented as a Manhattan distance or a cosine distance, but we defined it as the most intuitive Euclidean distance by referring to the loss function of a convolutional autoencoder.
The abnormal score S τ defined for the reconstruction URLx i can be used as a classifier by applying a thresholding rule for itself. However, we constructed an additional phishing URL classifier for the reconstruction URLx i , as it was known that the thresholding rule is limited in generalization performance for unobserved instances or phishing URLs similar to the benign URL distribution. Because the input of the auxiliary classifier that finally performs the phishing classification task is the reconstructed URL image, we implemented the classifier φ(·) using a convolutional neural network. Intuitively, the convolutional neural network learns a thresholding function that classifies labels from reconstructed URL images with weight matrices W ∈ R 2 : Finally, the objective of the auxiliary classifier that learns the thresholding function is to find the parameter θ that minimizes the loss function L CE implemented with cross-entropy between predicted and actual label:

Experimental Results
In this section, we present how the convolutional autoencoder with character-level embedded URLs predicts the phishing attack and evaluate the performance with 10-fold cross-validation in terms of accuracy and recall [40], which is followed by quantitative comparison with the latest deep-learning models.

Dataset and Implementation
Here we validate the proposed convolutional autoencoder and auxiliary classifier that utilizes the reconstruction error as an abnormal score with the benchmark URL database. For extensive evaluation, three real-world URL datasets consisting of 222,541 benign and phishing URLs were collected and are summarized in Table 2. The ISCX-URL-2016 dataset aims at the four-way classification task consisting of benign, phishing, malware, and spam URLs and has a 3:1 class imbalance as a characteristic of malicious URL modeling. Webaccessible Phishstorm and Phishtank datasets provide known phishing attack cases. Unlike the Phishstorm dataset where class sampling was performed, Phishtank does not provide a benign URL. We collected benign URLs from the Open Directory Project and collected 95,541 and 60,000 URLs. The architecture of the convolutional autoencoder can be modified variously according to the number of stacked convolution and pooling layers, as well as the number of convolutional filters, the kernel size, and the number of the nodes in layers. Given that typical deep-learning models require an optimization process, it is essential to adjust and optimize the hyperparameters carefully. A total 3,677,115 of deep-learning hyperparameters of the proposed method were determined through an empirical trial and error of the iterative optimization process. The number of convolutional filters indicating the number of local reception fields for learning spatial features between URL characters, the size of a reception field, stride as a parameter of overlapping regions, the type of an activation function of the layer, and the number of layer-by-layer parameters are specified in Table 3.  Table 4 compares the accuracy and recall for the latest deep models, including the standard deep-learning networks (CNN, LSTM) and their major modifications, which achieve state-of-the-art results. CNN and CNN-LSTM used as the base network achieved a 0.9424 accuracy and 0.9015 recall in the ISCX-URL-2016 dataset. We assumed URLNet, which achieved the best performance in URL classification by using a CNN in parallel, and Microsoft's Texception network, which improved the inception operation in the CNN for the URL field, as major comparative studies. The Texception network achieved accuracies of 0.9765, 0.9710, and 0.9319 for each dataset, but URLNet composed of a vanilla CNN achieved a similar level of performance for CNN and CNN-LSTM. Surprisingly, the triplet network structure, which has recently attracted much attention in the field of signal processing and image classification, and its modification, the Monte Carlo search-based triplet network, achieved robust performance. The triplet network is the latest implementation of metric learning that explicitly learns the distributions of a dataset, and we note that it is relatively suitable for modeling character-level URL images.

Phishing Detection Performance
The proposed method outperforms the latest deep-learning model. As argued, it was effective to model both class imbalance and character-level features in the URLs, and we achieved the highest accuracy and recall in all three benchmark datasets. On the other hand, the thresholding attempt based on the anomaly score calculated from URL reconstruction without an auxiliary classifier showed performance degradation.
In Figure 6, receiver-operating characteristic (ROC) analysis was conducted to show the improvement of the recall with the comparative study. The xand y-axes represent the false positive rate and the true positive rate for the output of the phishing URL classifier, respectively, and our approach to learning the thresholding function produced an areaunder-the-curve (AUC) improvement of 1.06%.
We compared the proposed method and comparative study in terms of accuracy and recall under severe class imbalance conditions in Figure 7. The class imbalance ratio was adjusted along the x-axis while removing the phishing URL from the training dataset from Phishtank based on the assumption of a zero-day phishing attack situation. The imbalance ratio is the number of phishing URLs compared to the benign URLs scaled in the [0.0,1.0] range. For example, at an imbalance ratio of 1.0, it is assumed that there is no phishing URL instance in the training dataset. For a fair comparison, we applied a class weight algorithm that was proportional to the number of data when training two networks.
Initially, both the proposed method and the Texception network showed accuracies of 0.9642 and 0.9635, but the accuracy degraded linearly as the number of phishing URL instances decreased. Because the proposed methods include the thresholding mechanism based on the abnormal score, a classification accuracy of 0.8883 was achieved even in the severe class-imbalanced condition.       The convolutional autoencoder, which is the core idea of the proposed method, is optimized for reconstructing benign URLs. We compared the input and reconstructed images for normal and phishing URLs in Figure 8. The white dots in the URL image represent characters, and the sequence of characters in the URL is recorded along the y-axis. In the benign URL, there is little visual difference between the input and the reconstructed URL, whereas the phishing URL has a blurring effect. There was no significant difference in terms of the structural similarity index (SSIM), which measures the difference in distribution instead of the pixel difference in the image; however, in terms of the root mean square error (RMSE), which actually measures the Euclidean distance between pixels, we confirmed the increased reconstruction error for the phishing URL.

Performance Evaluation by Component: URL Reconstruction and Effect of the Auxiliary Classifier
We conducted confusion matrix analysis to verify the effect of the auxiliary classifier in the Phishtank dataset in Table 5. In parentheses, the result of thresholding-based classification using the anomaly score is described, excluding the auxiliary classifier, which utilizes the URL reconstruction from the convolutional autoencoder. Referring to the statistics of misclassified cases that deviate from the main diagonal matrix, we confirmed an improvement in recall and accuracy in both benign and phishing URLs. The convolutional autoencoder, which is the core idea of the proposed method, is optimized for reconstructing benign URLs. We compared the input and reconstructed images for normal and phishing URLs in Figure 8. The white dots in the URL image represent characters, and the sequence of characters in the URL is recorded along the y-axis. In the benign URL, there is little visual difference between the input and the reconstructed URL, whereas the phishing URL has a blurring effect. There was no significant difference in terms of the structural similarity index (SSIM), which measures the difference in distribution instead of the pixel difference in the image; however, in terms of the root mean square error (RMSE), which actually measures the Euclidean distance between pixels, we confirmed the increased reconstruction error for the phishing URL. Electronics 2021, 10, x FOR PEER REVIEW 12 of 16

Discussions
We compared the supervised approach and the proposed autoencoder-based anomaly detection approach in Figure 9. The deep-learning-based URL classifier, which achieved the highest performance so far, as described in Figure 9a, focuses on minimizing the classification errors to learn the parameter θ defined as a set of the weights of a neural network. On the other hand, in the proposed method described in Figure 9b, there is an explicit step of modeling a benign URL before classification. Considering the autoencoder learns the encoding/decoding operation to reconstruct the output, the reconstruction performance degrades for inputs with different distributions (mainly phishing URLs) after learning with only the benign URLs. Figure 9. Anomaly detection approach that can construct a template for benign URLs and measure the abnormal score for a phishing attack based on the abnormal score measured by the autoencoder.

Discussions
We compared the supervised approach and the proposed autoencoder-based anomaly detection approach in Figure 9. The deep-learning-based URL classifier, which achieved the highest performance so far, as described in Figure 9a, focuses on minimizing the classification errors to learn the parameter θ defined as a set of the weights of a neural network. On the other hand, in the proposed method described in Figure 9b, there is an explicit step of modeling a benign URL before classification. Considering the autoencoder learns the encoding/decoding operation to reconstruct the output, the reconstruction performance degrades for inputs with different distributions (mainly phishing URLs) after learning with only the benign URLs.

Discussions
We compared the supervised approach and the proposed autoencoder-based anomaly detection approach in Figure 9. The deep-learning-based URL classifier, which achieved the highest performance so far, as described in Figure 9a, focuses on minimizing the classification errors to learn the parameter θ defined as a set of the weights of a neural network. On the other hand, in the proposed method described in Figure 9b, there is an explicit step of modeling a benign URL before classification. Considering the autoencoder learns the encoding/decoding operation to reconstruct the output, the reconstruction performance degrades for inputs with different distributions (mainly phishing URLs) after learning with only the benign URLs. Figure 9. Anomaly detection approach that can construct a template for benign URLs and measure the abnormal score for a phishing attack based on the abnormal score measured by the autoencoder. Figure 9. Anomaly detection approach that can construct a template for benign URLs and measure the abnormal score for a phishing attack based on the abnormal score measured by the autoencoder.
We visualized the decision boundary mentioned in Section 1 to understand the pros and cons of the proposed method. The deep URL representation generated from the hidden Electronics 2021, 10, 1492 13 of 16 layer of the convolutional autoencoder was mapped into a two-dimensional space [42] using the t-SNE algorithm, as depicted in Figure 10, and the main misclassified case was extracted from the area of the top-right where the classifier is confused. Correctly classified cases at the top and bottom of both sides were also extracted and are listed in Table 6. We confirmed that the correctly classified cases fully support the research hypothesis that the syntactics from the sequence of characters in a URL should be exploited. As a case in which the anomaly score increased significantly, there was a phishing URL composed of a sequence of random characters, and the case that fits the benign URL distribution output a low anomaly score, as expected.
ics 2021, 10, x FOR PEER REVIEW 13 of 16 We visualized the decision boundary mentioned in Section 1 to understand the pros and cons of the proposed method. The deep URL representation generated from the hidden layer of the convolutional autoencoder was mapped into a two-dimensional space [42] using the t-SNE algorithm, as depicted in Figure 10, and the main misclassified case was extracted from the area of the top-right where the classifier is confused. Correctly classified cases at the top and bottom of both sides were also extracted and are listed in Table 6. We confirmed that the correctly classified cases fully support the research hypothesis that the syntactics from the sequence of characters in a URL should be exploited. As a case in which the anomaly score increased significantly, there was a phishing URL composed of a sequence of random characters, and the case that fits the benign URL distribution output a low anomaly score, as expected.  However, the benign URL is misclassified by phishing when a long and noisy sequence of characters is observed. Several readable phishing URLs are misclassified as benign. This misclassified case suggests that additional URL features to be utilized remain, although the proposed method achieves the best performance among the deep models. Considering the fact that CNN was used in parallel in the comparative study and modeled not only character-level but also word-level URL features, the limitation can be handled by extending the proposed method with an additional convolution operation for the full utilization of URL characteristics.  However, the benign URL is misclassified by phishing when a long and noisy sequence of characters is observed. Several readable phishing URLs are misclassified as benign. This misclassified case suggests that additional URL features to be utilized remain, although the proposed method achieves the best performance among the deep models. Considering the fact that CNN was used in parallel in the comparative study and modeled not only character-level but also word-level URL features, the limitation can be handled by extending the proposed method with an additional convolution operation for the full utilization of URL characteristics.

Concluding Remarks
In this study, we proposed a character-level convolutional autoencoder based on the anomaly detection framework to overcome the two difficulties of phishing URL detection. The main innovation of this study is the introduction of deep anomaly detection to the field of phishing URL detection and achieving the best performance compared to classification-based deep-learning methods by implementing a neural network structure and an operation optimized for URL modeling. The combination of the encoding/decoding structure to facilitate disentanglement between classes and convolution operation optimized for character-level URL characteristics was utilized to define an anomaly score based on the reconstruction error.
The limitation of the proposed methodology is that it was optimized for character-level features among the various features constituting URLs. We discussed that the confusion of the character-level features is the main cause of the performance degradation of the proposed method. Considering the structure of the web address consisting of domains and subdomains, additional performance improvements can be expected by utilizing the word-level features, including the typos and the keywords listed in the blacklist.
In a future study, we can consider the additional exploitation of URL features to improve the detection performance. At the same time, an additional convolution operation that utilizes both the character-and word-level URL features is required to fully exploit URL features. We also suggest exploring a plausible solution to zero-day attacks, which can be expressed as an out-of-distribution issue. Considering that the features that are not exposed in the dataset can be modeled from the external knowledge of domain experts, it would be promising to introduce a symbolic AI approach that leverages the detection rules based on the domain knowledge into the field of phishing URL detection.