1. Introduction
With the growing number of mobile devices, web applications, and users, as well as the prompt evolution of computing, cyber-security has become critical in recent years [
1]. Easy access to the computer networks and Internet applications, the accessibility of high-speed networks, and high-industrial advancements such as the development of 5G and 4G technologies, have comprehensively expanded Internet usage worldwide [
2]. In particular, as a result of the recent improvements in information technology and digitalization, several enterprises and companies have shifted their businesses from the physical to digital domain, employing mobile and web applications to decrease physical contact [
3,
4]. However, this growth in Internet applications and websites has caused them to become significantly open and accessible to the rest of the world and comes with a significant security risk and a leakage of private data.
Because of the easy access to the Internet, a person’s digital visibility has extensively increased, creating opportunities for digital thieves and hackers to gain access to private data and credentials. Detecting malicious integrity attacks and protecting transmitted data from eavesdropping attacks are critical to prevent user information from being manipulated or destroyed in the integrated data-driven framework [
5]. Such cyber-security breaches lead to severe mental stress and financial losses over time [
6]. Malicious websites are considered the means by which the hackers can collect private information from unsuspecting Internet consumers. Malicious websites normally appear to be benign websites and request private information, such as passwords, usernames, and credit card information, or gain access to personal images or crucial data. Such data are stored in a form of data storage and might be exploited to achieve the attackers’ goals, such as criminal money transfers, online shopping, harassing the user, or blackmailing. Using forged webpages identical to genuine webpages is a common method to steal users’ data and private information [
7].
One method of ensuring user security is to use classification techniques to determine whether the accessed website is malicious [
8,
9]. A correct classification ensures that the user is informed to not enter his/her data and important information on suspect websites [
10,
11]. Recently, machine learning methods have demonstrated exceptional results by using data classification [
12,
13]. They are not only used for malicious websites and attack detection, but are also applied in numerous identification, classification, and prediction tasks in a variety of fields, such as stock prediction [
14], weather prediction [
15], and image processing [
16]. The machine learning field has shown much progress and promise, with large amounts of data becoming available, the development of advanced computational models, and advancements in computing power. Traditional machine learning methods begin with building a new model architecture, followed by the initialization of its parameters. The model is then trained with the training set to learn the relationship between the features of the inputs and the actual class outputs. After that, the model can be validated in the training stage by computing its results on a validation set, which is known as the validation stage. A trained model is then tested on unseen instances, and the classification results are obtained to assess its performance. Support Vector Machine (SVM) [
17], K-Nearest Neighbors (KNN) [
18], Decision Tree (DT) [
19], Naïve Bayes (NB) [
20], and Logistic Regression (LR) [
21] are some common machine learning methods.
In the literature, some evaluation metrics, such as precision, accuracy, recall, and F1-score, are applied to evaluate the machine learning outputs [
22]. These evaluation metrics are mathematically computed based on the confusion matrices, in which true positives, false positives, true negatives, and false negatives are counted. Then, a defined formula using the ratio of the two true and two false amounts can be utilized to assess a model’s performance on a particular data set. One of the main issues encountered during machine learning data classification in general [
23,
24], and more specifically in malicious websites’ classification [
25], is the class-imbalanced dataset. In the class-imbalanced issue, a number of classes or a specific class can contain a more instances than another class or group of classes. This means that, in th training stage, the machine learning method becomes biased toward this specific class or other classes with a large number of instances. This can produce a high result for the dominating class or classes, with the results of other classes being completely disregarded. A number of data-balancing strategies are suggested in the literature. Some common strategies are under-sampling, SMOTE, and over-sampling, which have generally been used for imbalanced data classification [
26], and specifically for malicious websites classification [
25]. However, the biases in the original dataset of malicious websites might keep the data balanced. Moreover, quality assurance in data-balancing is expensive work, and finding an effective data-balancing method is a challenging task, which must prevent overlap between synthesized samples of classes [
27].
In this research work, we propose a decision-fusion-based approach that builds a decision-fusion ensemble learning (DFEL) model by combining the decisions of three effective ensemble classifiers, namely, a gradient boosting (GB) classifier, extreme gradient boosting (XGB) classifier, and random forest (RF) classifier. The approach exploits the advantages of these classifiers when used for for class-imbalanced data and data with statistical noise, as in the case of malicious website detection. We utilize a weighted majority-voting rule to generate the final decisions of developed model. The main contributions of the proposed approach can be summarized as follows:
Improving the accuracy of malicious website detection by exploiting the diversity of boosting (i.e., GB and XGB) and bagging (i.e., RF) techniques. In boosting, the approach can create sequential models by combining weak learners into strong learners, where the final model has the highest accuracy. Furthermore, in bagging, the approach can create different training subsets from a sample training set using replacement and the output of the final model is based on the majority voting.
Reducing the class-imbalanced and over-fitting problems in malicious website classification due to the regularization ability of GB, XGB, and RF classifiers.
Proposing a weighted soft voting rule to fuse the final classification scores utilizing the competence of well-calibrated and diverse classifiers such as the base classifiers in the approach. Furthermore, evaluating and comparing the accuracy of the proposed DFEL model with its base classifiers and some recent related work.
The rest of the paper is organized as follows:
Section 2 presents the methods and approaches of the related work.
Section 3 offers an explanation of the materials and methods used in the research work, including a description of the dataset, methods of the proposed approach, and evaluation metrics. The experiments and results, along with a discussion, are introduced in
Section 4.
Section 5 summarizes the conclusions and future work.
2. Related Work
Singhal et al. [
28] classified malicious and benign websites using supervised machine learning classifiers such as decision trees, random forest, deep neural network, and gradient boosting. First, the authors gathered the URLs. Then, they extract host-based, lexical-based, and content-based features from malicious and benign websites. These features were used as the inputs to machine learning models. The authors generated the lexical features by choosing the length of host, the length of URL, the length of path, the count of host token, and some other symbols.
Similarly, the autonomous system number (ASN) and location, which are the host-based features, were obtained from the URL. The author chose applet count, HTTPS-enabled, Eval function, redirection, XMLHttpRequest, unescaped function, and popups as the content-based features. The authors obtained the benign websites from PhishTank’s public blacklist. This dataset contains a total of 80,000 unique balanced URLs. The features were extracted after the data were collected. To compare different classifiers, the same measures (precision, accuracy, F1-score, and recall) were used to quantify the classifier results on this dataset. Using the gradient-boosting method, the authors achieved the best accuracy result of 96.4%.
Amrutkar et al. [
29] created an analysis technique, named kAYO, to distinguish between benign and malicious mobile websites based on the static features. For classification, their method makes use of the static features of a website. The authors applied the proposed method to a huge, labeled data set of 350,000 benign and malicious mobile websites, and attained 90% accuracy. They created a browser extension for their proposed technique. The kAYO was run in the browser extension’s backend to determine whether selected webpages are benign or malicious.
McGahagan et al. [
30] investigated the relation between the number of extracted features from the HTTP headers and the likelihood of malicious webpage detection. They examined 6021 malicious websites’ HTTP headers and 39,853 benign websites’ HTTP headers. The number of features extracted from HTTP headers was 672 and the authors selected 22 features for further analysis; 11 of these features were considered in previous research and the remaining 11 features were used in these authors’ work. Three of the twenty-two features contained 80% of the total importance of these features. The authors conducted a principal component analysis (PCA) of the extracted features and used eight classifiers to improve the detection rate. They found that the 22 features attained a better accuracy result.
A hybrid approach was used by Patil et al. [
31] to find malicious URLs. In the hybrid approach, they combined static and dynamic features; the static features were extracted using a static method and the other dynamic features were extracted using a dynamic method. A total of 117 features were extracted; 44 of them were new features. The dataset used in their study comprised 52,082 samples. The training data contained a total of 40,082 instances: 20,041 instances in the malicious class and 20,041 in the benign class. This demonstrates that the authors’ study’s dataset is balanced. The authors used six machine learning methods, including the simple CART, DT, RF, ADTree, random tree (RT), and REPTree, to assess the effectiveness of their approach.
A one-dimensional convolutional neural network (1D-CNN) architecture was proposed by Al-milli et al. [
32] to detect benign URLs. The experiment wasconducted by the authors using a benchmark dataset and receiver operating characteristic (ROC) curve with accuracy evaluation metrics. The authors collected 2456 records with 30 features in their dataset. A total of 70% of the dataset was applied for training, and 30% was utilized for testing. They built a CNN architecture containing 64 filters and 16 kernel sizes with 500 and 2000 epochs. The authors’ model achieved a 91.23% area under the curve (AUC) and 94.31% accuracy.
A two-step method for the detection of benign and malicious URLs was introduced by Jayakanthan et al. [
33]. An algorithm called “enhanced probing classification of malicious URLs (EPCMU)” is used as the first step and an NB classifier is used as the second step to find malicious URLs. The first step involves the detection task and the second is used for classification. The input URLs are thoroughly examined by the EPCMU. The system flags the URLs as malicious if they exhibit any characteristics of a malicious websites or appear on the blacklisted websites. Otherwise, more checks ar eperformed. In the EMPCU, a collection of URLs serves as input to the NB classifier during the classification stage. This determines whether the set URLs are genuine or malicious.
An auto-encoder model was used by Assefa et al. [
34] to differentiate between benign and malicious websites. Three layers (input, hidden, and output) make up the structure of the auto-encoder model. The data of legitimate webpages were gathered from the dataset created by the Canadian Institute for Cyber-security, while the phishing data of webpages were gathered from the dataset generated by the open-source Phish Tank. The final dataset contained a total of 16 features and 10,000 instances. The authors cleaned the missing extracted values in the preprocessing stage. The effectiveness of the auto-encoder-based model is compared to DT and SVM. The authors’ model attained 91.24% accuracy. The DT and SVM methods delivered an accuracy of 86.1% and 88.4%, respectively.
The contributions and results of important related research have confirmed the ability and applicability of these methods and techniques for detecting benign and malicious websites. Recently, Hassan et al. [
25] proposed an approach using DT, RF, SVC, LR, and Stochastic Gradient Decent (SGD) classifiers and achieved 94.19% accuracy for imbalanced datasets. The results for data balancing in [
25] are discarded because the authors balanced the dataset before splitting it into training and test sets. This means that the distribution of augmented instances in the test set has almost the same distribution of original instances in the training set and increases the accuracy. Singhal et al. [
28] applied RF, GB, DT, and deep neural network (DNN) methods to obtain an accuracy of up to 96.4%. Amrutkar et al. [
29] achieved 90% accuracy using the kAYO technique. Adaptive Boosting (AB), Extra Trees (ET), RF, GB, Bagging Classifier (BC), LR, and k-NN were used by McGahagan et al. [
30] to obtain an accuracy of up to 89%. Al-milli et al. [
32] proposed an approach using a 1D-CNN to attain 94.31% accuracy. Assefa et al. [
34] used an auto-encoder, DT, and SVM to attain 91.24% accuracy. Sandag et al. [
35] applied the k-NN method to website features and attained 95% accuracy. Alkhudair et al. [
36] and Panischev et al. [
37] proposed methods using RF to obtain 95% accuracy for both studies. Labhsetwar et al. [
38] achieved 92% accuracy by using an RF classifier. Singh et al. [
39] proposed a multilayer CNN and attained an accuracy of 91%. Aljabri et al. [
40] used an NB classifier and 96% accuracy was obtained. Utku and Can [
41] proposed an approach using LightGBM, DT, SVM, k-NN, LR, multilayer perceptron (MLP), RF, and XGB to gain 96% accuracy.
The results of previous studies show that the existing methods have several strengths, such as a good performance on malicious website detection tasks. They are simple and can be regularized to decrease the chance of over-fitting. Their outputs can be interpreted easily, and some of them do not need to scale the feature values and can be used for both non-linear and linear features. However, they have limitations in their ability to produce an effective trained model with reduced variance and bias, and improved classification results. Moreover, a common limitation of the existing work is the ability to reduce the effect of class-imbalanced and highly correlated features on the accuracy and performance of malicious website classification. These limitations are still a research gap that needs to be solved.
4. Experiments and Results
This section validates the proposed approach to detect malicious websites. The approach is able to classify malicious websites and benign ones based on the developed decision-fusion-based ensemble classification model. The experiments can be used to obtain the benchmark dataset and evaluation metrics explained in the previous subsections. Two experiments are conducted to validate the research work. The first experiment is executed using the holdout evaluation, in which two evaluation methods are used: evaluation method one (evaluation method 1) divides the dataset into training set (70%), and test set (30%), and evaluation method two (evaluation method 2) executes the training and validation processes for 15 runs with 15 splits. In each split, the models are trained on a different number of training examples and validated on 10% of the dataset. The second experiment is accomplished using a 10-fold cross-validation technique. The 10-fold cross-validation technique divides the dataset into 10 sets. The training process is conducted for 10 runs. In each run, one set is used to test the base and developed models. In both experiments, the model parameters are initialized, along with their default values. The experiments are implemented using the Python programming language on a laptop with Intel processor Core i7-8750H CPU 2.21 GHz, RAM 32.0 GB and 64-bit Windows 11 operating system.
Before training the ensemble models, we pre-process the dataset using the pre-processing step. We read it and replace the null values with zeroes to make the null values numeric when training the ensemble models. Then, we determine some basic statistics for the dataset to provide an overview of its feature values.
Figure 4 shows the uniqueness and frequency of the dataset feature values.
From
Figure 4, we can see that the values of the URL feature are totally unique, with very low frequency. This feature can create noise and decrease the accuracy because the difference between them will increase the distance of the model’s decision boundary during the training step. The non-numeric values of features, including the ‘Type’ class label, are also encoded into numbers using label encoding because the classifiers can only learn the numeric values of the features. After that, the correlation between features is analyzed to find highly correlated features.
Figure 5 provides a correlation heat map of the dataset features. The correlation heat map is a visual graphical representation of the relationship between features.
Figure 5 demonstrates how each feature is correlated with another feature. It measures the strength of the relationship between every two feature variables. Understanding the correlation between features is useful because the value of one feature can be used to predict the other feature value. The correlated features indicate that, as the value of one feature changes, the other feature tends to change in a specific direction. As shown in
Figure 5, the ‘TCP_CONVERSATION_EXCHANGE’, ‘APP_PACKETS’, ‘SOURCE_APP_PACKETS’, ‘REMOTE_APP_PACKETS’, ‘APP_BYTES’, and ‘REMOTE_APP_BYTES’ are highly correlated features with one correlation score. In the next experiment, we will check whether removing these highly correlated features is necessary for a more accurate classification.
4.1. Results of First Experiment
In first experiment, we randomly select a subset of the dataset to test the models. Two evaluation methods are applied to obtain the results of this experiment. The following subsections explain the evaluation methods, along with their outcomes, in detail.
4.1.1. Evaluation Method 1
This evaluation method randomly divides 30% of the dataset to test the models, and the remaining 70% is utilized for training.
Figure 6 presents the number of instances of both malicious and benign classes in the training and test sets.
As shown in
Table 2, the distribution of instances in the training set is imbalanced, as 87.08% of the total classes are malicious and 12.92% are benign.
After building the DFEL model on the training set without the highly correlated features, it is tested on the test set.
Figure 7 displays the confusion matrix of the classification results. The true positive (TP) and true negative (TN) instances are colored with a light green color.
From
Figure 7, we can show that the model can correctly classify 477 malicious instances out of 480 and 50 benign instances out of 55. Based on the corrected classified instances,
Table 3 provides the recall, precision, F1-score, and accuracy results for classifying malicious and benign classes.
As shown in
Table 3, the proposed model achieves notable classification results, with an accuracy of 98.50%. Since the dataset classes are imbalanced, the accuracy metric is not enough for evaluation; hence, the F1-score is taken as another metric. In
Table 3, we can see that the weighted avg. F1-score is 98.49%. The accuracy and F1-score results prove the model’s capability and effectiveness in alleviating the class-imbalanced problem produced by the dominant malicious class label.
To show the effect of highly correlated features, we train the DFEL model on the training set with all features, including the highly correlated features, and test it on the same test set.
Figure 8 visualizes the confusion matrix of the classification results. The numbers in the light green color on the confusion matrix are TP and TN instances.
From
Figure 8, we can see that the highly correlated features decrease the number of TP instances to 476 instead of 477 to train the model without highly correlated features. The number of TN instances is not affected. The results of other evaluation metrics are given in
Table 4.
In
Table 4, the model is shown to achieve a classification result with 98.32% accuracy and a 98.31% weighted avg. F1-score. These results confirm that removing the highly correlated features is necessary for a more accurate classification. Moreover, decreasing the number of features increases the efficiency of the model.
Figure 9 shows the average classification time of all test sets in seconds for the model trained on the training set with and without highly correlated features.
In
Figure 9, we can see that removing highly correlated features decreases the average classification time of all test sets from 0.031 s to 0.016 s, improving the efficiency of the DFEL model in addition to improving the detection accuracy. Moreover, this average classification time confirms the applicability of the proposed model for real-time detection.
Besides, to analyze the ability of the proposed DFEL model compared with its models individually at the 0.5 classification threshold,
Figure 10 illustrates the ROC curves of the DFEL model and its other base models.
From
Figure 10, we can see that the DFEL model outperforms its base models and achieves a 0.95 AUC, compared with a 0.93, 0.94, and 0.93 AUC for GB, XGB, and RF, respectively. The high AUC value for the DFEL model proves its ability to show how much the trained model can differentiate between test set classes. This means that the built model can classify instances of a malicious class as malicious, and instances of a benign class as benign.
4.1.2. Evaluation Method 2
This evaluation method trains and validates the proposed decision fusion model and its base models 15 times with 15 splits. Each time, the models are trained on a different number of training examples and validated on 10% of the dataset without highly correlated features.
Figure 11 demonstrates the learning curves of the DFEL model and its base models. The training and validation scores are compared to the training data examples (data size).
From
Figure 11, we can see that when fewer than 600 training examples are used, this not enough to allow the trained models to classify the validation set. In addition, we can see that the training accuracy scores for all models are still around the maximum, and the validation scores increase with more training examples. However, the difference between the accuracy scores for the training and validation of the base models is larger than that of the accuracy scores of the DFEL model for all training sets with more than 600 examples. This means that the performance of the DFEL model is better than its base classifier models.
4.2. Results of Second Experiment
In this experiment, another evaluation experiment is conducted using a 10-fold cross-validation routine to compare the DFEL model with its base models (GB, XGB, and RF).
Table 5 demonstrates the average accuracy and F1 score of the ten-fold validation results for malicious and benign classification using GB, XGB, RF, and DFEL models trained on training folds without highly correlated features.
As seen in
Table 5, the highlighted numbers with bold font are the best evaluation results achieved using a 10-fold cross-validation technique. We can see that the proposed DFEL model outperforms its base models. It can classify the malicious and benign cases with a 97.20% weighted average F1 score and 97.25% average accuracy. In addition, the macro-average F1 scores for each model are less than the micro-average F1 scores because of the class-imbalanced problem. The macro-average F1 score can provide a true evaluation of the classifier model in a class-imbalanced evaluation task.
To confirm whether highly correlated features affect the DFEL’s performance results, we performed 10-fold cross-validation on the training set with highly correlated features.
Figure 12 visualizes the 10-fold cross-validation averaged accuracy and F1 score results of the proposed DFEL model trained on the training folds with and without highly correlated features.
From
Figure 12, we can see that the DFEL model with highly correlated features attains an average of 96.91% and 96% instead of 97.25% and 97.2% in its 10-fold accuracy and 10-fold weighted average F1 scores, respectively. Moreover, there is a significant difference in the average of the 10-fold macro-averaged F1 scores, in which the model improves the result from 90.4% to 93.1%. These outcomes also confirm that removing highly correlated features is necessary for a more accurate classification.
4.3. Comparison of Results with Related Work
In this subsection, we compared the accuracy of the results of this study with the results of methods and techniques in some recent related studies.
Table 6 lists the obtained accuracy result of the proposed DFEL model compared with the accuracies of important studies in the literature review.
As shown in
Table 6, our DFEL model results demonstrates a substantial improvement (98.50% vs. 96.4% in terms of accuracy) compared to the results of our recent work. Such a notable performance in terms of website classification provides a possible application of the proposed DFEL model to help cyber-security technicians and researchers in the detection of malicious websites. The other advantage of the decision fusion-based approach is the diversity of GB, XGB, and RF models in reducing the effect of over-fitting and class-imbalanced problems during training development.