1. Introduction
At present, the global network security problem is constantly updated with the development of the economy, and the way of a network attack is also escalating. The network security market will gradually recover in 2021. The ‘2021 China Network Security Report’ released by Ruixing Company points out that it makes a detailed analysis of malware, malicious URLs, mobile security, enterprise security, and other fields. The report points out that network attacks are becoming more and more frequent, and maintaining network security and defending against network attacks have become the top priority. At present, many malicious URLs are mixed into the URLs accessed by users in their daily network activities. With the continuous updating of technology, malicious URLs are very similar to benign URLs, so it is particularly important to identify malicious URLs correctly and efficiently.
Researchers use a variety of benchmark models for malicious URL detection. Abdulhamit Subasi et al. [
1] compared the accuracy of KNN, SVM, CART, C4.5, LADTree, and NBTree using the data set of URL instances and verified that the classification results of support vector machine and KNN are more accurate through integration technology. Anjali B. Sayamber et al. [
2] proposed a method for the automatic classification and detection of malicious URLs using a naive Bayesian classifier. For a wide range of benchmark data sets, the naive Bayesian model using probabilistic model learning has better accuracy than the support vector machine model. Liu Jian et al. [
3] proposed a multi-level filtering detection model using a machine learning algorithm to identify malicious URLs. By training the key threshold of the classifier, the classifier only determines the URLs they are good at, giving full play to their own advantages. If these classifiers are not good at classifying a URL, then use multiple classifiers to vote. Finally, compared with naive Bayes, decision tree, and SVM model, this method improves the accuracy of detecting malicious URLs. Vara Vundavalli [
4] used logistic regression, neural networks, and different types of naive Bayes algorithms to classify malicious URLs, and finally, naive Bayes obtained better results. Sheikh Shah Mohammad Motiur Rahman et al. [
5] evaluated the performance of various machine learning classifiers by AUC-ROC curve, accuracy, misclassification rate and mean absolute error for the identification of phishing URLs. The best AUC area was obtained from the random forest and multi-layer perceptron, respectively, and the accuracy of stacking generalization in binary classification and multi-class feature sets was higher. Li Zeyu et al. [
6] studied the effect of various machine learning models, especially ensemble learning models, on malicious URL recognition. Through a number of indicators such as recall rate, accuracy rate, and AUC value, it is verified that the inheritance learning method is superior to the traditional machine learning model, and the random forest performs best. Thuy Thi Thanh Pham et al. [
7] used three different deep neural networks, CNN, LSTM, and CNN-LSTM, to detect malicious URLs, but did not compare the hidden layer and the number of neurons in the experiment to find the optimal network parameters. Chen Z et al. [
8] proposed an improved multi-layer recursive convolutional neural network model based on the YOLO algorithm to detect malicious URLs. Compared with Text-RCNN, BRNN, and other models, this method has higher accuracy. In the experiment, the truncation method is used to standardize the length of all URLs. It is inevitable to lose some information when facing longer URLs. On the basis of the benchmark machine learning model, the recognition rate of malicious URLs can also be improved by improving the feature engineering method. Tie Li et al. [
9] proposed a feature engineering method combining linear and nonlinear spatial transformation. New features are generated by five spatial transformation models and applied to the classifier, which significantly improves the recognition rate of KNN, linear support vector machine, and multi-layer perceptron.
In the current research, most of the detection methods of malicious URLs extract features from the content of URLs. Kumi S et al. [
10] combined classification with association rules and proposed a data mining algorithm based on association classification, which uses URLs and web content features to detect malicious URLs. A. Saleem Raja et al. [
11] proposed a weighted method that only contains URL lexical features, extracts a small number of features for malicious URL detection, and measures machine learning algorithms from the perspective of performance and execution time. Random forest and KNN algorithms obtain good results. Apoorva Joshi et al. [
12] extracted static lexical features from URL strings, used the ensemble classification method learned by machine learning to detect them, and concluded that the pure lexical method could generate fast real-time determination of URLs in lightweight systems. Chen Kang et al. [
13] proposed a detection method based entirely on lexical features. The convolutional neural network extracts and classifies the URL strings and finally obtains a better classification result. Yuan J T et al. [
14] proposed a joint neural network algorithm model of Bi-IndRNN and CapsNet to identify and detect malicious URLs, extract vector features and texture features at the character and word levels, and perform feature fusion. Classification by a joint neural network effectively improves the efficiency and accuracy of malicious URL detection. H Le et al. [
15] proposed a deep neural network based on CNN for URLNet for malicious URL detection. For manual feature engineering and the inability to handle features that are not visible in the test URL, the characters cnn and Word cnn are proposed, and the network is jointly optimized.
In addition, many researchers also identify URLs from the aspects of extracting texture features and structural features. Yuan J et al. [
16] proposed a parallel neural joint model algorithm for the analysis and detection of malicious URLs, using CapsNet to combine texture features with text features and capture multi-modal vectors. Zhao Gang et al. [
17] proposed a decision tree intelligent detection method for two-dimensional code URLs. The accuracy rate in identifying regular URLs and two-dimensional code malicious URLs has a relatively good effect, but the selected data volume is low, and the model’s generic ability is not well trained. Liu C et al. [
18] proposed a statistical method to study the character features of malicious URLs and obtain the character distribution features and structural features. The extracted features were combined with random forests to obtain better performance than the original features, and the parameters of random forests were adjusted to be the best. Lin Helen et al. [
19] proposed an efficient method for detecting malicious URLs based on segment pattern, which parses the three semantic segments of the domain name, path name, and file name in the marked malicious URL and quickly calculates the pattern of each semantic segment of the malicious URL through the inverted index of the triple as the term, and finally determines the segment pattern according to the inverted index. There is no research on other features, such as IP address for the content of domain name, path, and file name. Gabriel AD et al. [
20] proposed a system for analyzing URLs in network traffic. Each correctly classified URL is reused as part of a new data set. Different clustering techniques are used to identify missing features on malicious URLs. Through the OSC perceptron optimization algorithm, the final recognition accuracy is improved, and the overall detection rate is increased to 82.9%.
Based on the above research background, in order to better identify malicious URLs and improve the accuracy of detection, there are some unobvious features in the URL text. Based on the grammatical features, structural features, and probability features of URLs, a convolutional neural network model based on genetic algorithm optimization is proposed. High-dimensional feature space, on the one hand, will consume a lot of training time; on the other hand, there may be some interference information on the training results. Therefore, firstly, the genetic algorithm is used to reduce the dimension of URL features, and the feature subset with a good classification effect is searched. Secondly, the feature subset is classified by a convolutional neural network. Finally, compared with the traditional machine learning method, the accuracy of malicious URL recognition is improved.
4. Conclusions and Future Research
In order to improve the accuracy of malicious URL recognition, this paper proposes a model of a convolutional neural network based on a genetic algorithm. Firstly, this paper extracts the grammatical features, structural features, and probabilistic features in the URL text by comparing the malicious URL with the benign URL, reduces the dimension of the features by genetic algorithm to obtain the final 20 vectors, and then uses the convolutional neural network to classify and identify the 20 features. Finally, compared with the traditional machine learning algorithm, the recognition rate of malicious URL detection is improved. The experimental results show that the accuracy of the model in malicious URL classification is 93.99%, which achieves the expected classification effect.
In malicious URL detection, the model proposed in this paper has a better effect than directly detecting the entire URL text content by extracting insignificant features in the URL. Second, the genetic algorithm is used to reduce the dimension of the extracted features, and the redundant features are removed to reduce the computational overhead. Third, when using the advantages of convolutional neural networks, the classification effect is better while sharing parameters in the convolutional layer, reducing the use of parameters.
However, there are still some limitations in the experiment. The genetic algorithm can obtain the global optimal feature subset and effectively improve the accuracy of malicious URL recognition. However, there are many parameters used in the calculation process; the calculation amount is large, and the experiment takes a long time. This model cannot detect targets in real time. Therefore, in order to solve these problems, our further research work is to optimize the model around how to reduce the feature vector space while reducing the time complexity and exploring the method of detecting malicious URLs in real time.