An Effective and Secure Mechanism for Phishing Attacks Using a Machine Learning Approach

: Phishing is one of the biggest crimes in the world and involves the theft of the user’s sensitive data. Usually, phishing websites target individuals’ websites, organizations, sites for cloud storage, and government websites. Most users, while surﬁng the internet, are unaware of phishing attacks. Many existing phishing approaches have failed in providing a useful way to the issues facing e-mails attacks. Currently, hardware-based phishing approaches are used to face software attacks. Due to the rise in these kinds of problems, the proposed work focused on a three-stage phishing series attack for precisely detecting the problems in a content-based manner as a phishing attack mechanism. There were three input values—uniform resource locators and trafﬁc and web content based on features of a phishing attack and non-attack of phishing website technique features. To implement the proposed phishing attack mechanism, a dataset is collected from recent phishing cases. It was found that real phishing cases give a higher accuracy on both zero-day phishing attacks and in phishing attack detection. Three different classiﬁers were used to determine classiﬁcation accuracy in detecting phishing, resulting in a classiﬁcation accuracy of 95.18%, 85.45%, and 78.89%, for NN, SVM, and RF, respectively. The results suggest that a machine learning approach is best for detecting phishing.


Introduction
In the modern era of networks, almost every industry uses the internet. Many different security attacks affect businesses. Among the main attacks is phishing. Phishing threats are conducted by e-mail spoofing and similar webpage functioning. The phisher can perform attacks with the help of spoofed e-mails and by copying website design. Based on the internet, the phisher can hack the user's personal belongings. Phishing is an offence implemented by technical and social technology and hacking user identity information and banking information. Phishing threats are enabled by user weakness and the development of sophisticated mechanisms by phishers [1].
When working on the internet, users need to enter user data such as personal information and banking information. These attacks are used to steal the user's data. Phishing attacks are increasing. Phishing websites look the same as the original websites [2]. A group named Anti-Phishing was formed to control phishing attacks. A report of that group says that phishing activities are increasing. The main target of phishers is to attack the victim's e-mails, messages and phone calls. There are many kinds of phishing, such as deceptive 2 of 14 phishing in which the attacker's focus is on the organization in which the employees work. Deceptive phishing is easily implemented by using the URL to distinguish genuine links from the scammer. Phishing Spear is a kind of phishing in which attacks are conducted through e-mail by targeting and collecting data about the entity on Facebook. Attackers are targeted through emails by crafting a positioning attack on DNS. User IP address is easily identified by attackers through DNS and users' IP addresses are easily redirected to malicious websites. A new type of phishing was discovered which is conducted through dropbox-attackers want to steal the dropbox files from users. Phishing attackers create a fake dropbox signature and then this is passed to the dropbox of users. Phishers can easily steal files and users' credentials which are hosted on the website. Google docs phishing was derived from dropbox phishing, through which attackers can easily target victims. Attackers initiate on SaaS or webmail by stealing sensitive data as their primary goal. To integrate e-mail addresses into the system, an anti-phishing simulator was designed to prevent serious threats by catching malicious emails arriving in the system. This system also helps to evaluate keywords in the existing database and to determine the contents of the database.

Existing Issues of Phishing on Website
In 2011, Prevost et al. identified phishing based on more than 25 features on a heuristic webpage. The main disadvantage of this work is that when there are changes in webpages, this method does not work, i.e., it is not robust. It also requires more heuristic parameters [2]. In 2016, Moghimi et al. developed rule-based phishing techniques, based on the parameters of a webpage and applied the rules for retrieving the hidden information in the webpage. The major drawback of this algorithm is that it is not reliable for identifying the phisher [3]. In 2010, Prakash et al. applied predictive blacklist techniques to identify and delete the correct blacklist using the PhishNet database. This algorithm fails to detect the 0th-day phishing [4].

Existing Issues of Phishing in E-Mail
In 2006, Lyon et al. proposed the authentication for domain level by securely sending the data by e-mail. The main drawback of this algorithm is that it works only when both the sender and receiver have the same device [5]. Chen et al. developed the LinkGuard algorithm. This algorithm is capable of detecting the actual domain of phishers and also detecting the phisher link URL. The main drawback of this algorithm is that it is applicable only for e-mail and not for detecting phishers on the web [6]. In 2009, Gansterer et al. applied a machine learning-based K-nearest learning approach to detect phishing based on the parameters of e-mail and ranked those parameters., The main limitations are that it attained a high false-positive rate for spam detecting filters [7,8].
A phishing attack is a dangerous threat. For example, at Deleware University, nearly 75 thousand people faced this problem as phishers stole the personal information of teachers, students, researchers and faculty through websites [9]. In 2001, the first online gold fraud was investigated. This was similar to a phisher sending spam mail to increase their network. Since the number of phishing attacks have increased, the government set up the working group Anti-Phishing and also implemented several laws for victims [10].

Proposed Research Key Features
The objective of this work is to evaluate the harmfulness of this problem and offers better solutions to protect against phishing attacks.
• Fishing attacks may occur at any time. Thus, based on database features, this work develops a mechanism.
• The proposed work focuses on recent databases and performance can be evaluated based on parameters.
• According to the literature survey, most of the existing work is based on imbalanced mechanisms.
• Three different classifiers are used to determine classification accuracy in detecting phishing.

•
The motivation for the current work is the increasing number of phishing attacks; we need to develop a computational automated methodology for detecting phishing.  Figure 3 shows a year-wise online report of phishing attack incidents from 2014 to 2019.

Proposed Research Key Features
The objective of this work is to evaluate the harmfulness of this problem and offers better solutions to protect against phishing attacks.

•
Fishing attacks may occur at any time. Thus, based on database features, this work develops a mechanism.

•
The proposed work focuses on recent databases and performance can be evaluated based on parameters. • According to the literature survey, most of the existing work is based on imbalanced mechanisms.

•
Three different classifiers are used to determine classification accuracy in detecting phishing.

•
The motivation for the current work is the increasing number of phishing attacks; we need to develop a computational automated methodology for detecting phishing.  Figure 3 shows a year-wise online report of phishing attack incidents from 2014 to 2019.

Proposed Research Key Features
The objective of this work is to evaluate the harmfulness of this problem and offers better solutions to protect against phishing attacks.
• Fishing attacks may occur at any time. Thus, based on database features, this work develops a mechanism.
• The proposed work focuses on recent databases and performance can be evaluated based on parameters.
• According to the literature survey, most of the existing work is based on imbalanced mechanisms.
• Three different classifiers are used to determine classification accuracy in detecting phishing.

•
The motivation for the current work is the increasing number of phishing attacks; we need to develop a computational automated methodology for detecting phishing.  Figure 3 shows a year-wise online report of phishing attack incidents from 2014 to 2019.

•
The motivation for this work is that there is still a lack of awareness regarding phishing threats.

•
The main phishing crimes are stealing banking details stealing such as CVV details and credit card information through websites such as PayPal and e-bay.

•
Other phishing crimes include theft of personal data and capturing trade secrets and important documents.

Literature Survey
A literature survey which is related to the mechanism developed in this investigation is presented in this section. Table 1 summarizes the literature survey for detecting phishing. Figure 4 illustrates phishing attacks in the industry. Applying 7 different machine learning processes for the anti-detection process.
Easy to identify the words present in the URL by using NPL features.
To handle large datasets, machine learning will not be more effective.
Kim et al. [14] Features of machine learning Authentication on user and domain levels.
Communication of security can be increased.
The same technology should be used on the sender side and the receiver side. Zhang et al. (2017) [15] Neural Networks A Neural Network was classified with the Monte Carlo algorithm.
Increases accuracy rate and stability detection.
The whole page has to be downloaded.

•
The motivation for this work is that there is still a lack of awareness regarding phishing threats.

•
The main phishing crimes are stealing banking details stealing such as CVV details and credit card information through websites such as PayPal and e-bay.

•
Other phishing crimes include theft of personal data and capturing trade secrets and important documents.

Literature Survey
A literature survey which is related to the mechanism developed in this investigation is presented in this section. Table 1 summarizes the literature survey for detecting phishing. Figure 4 illustrates phishing attacks in the industry. Easy to identify the words present in the URL by using NPL features.
To handle large datasets, machine learning will not be more effective.

Kim et al. (2017) [14]
Features of machine learning Authentication on user and domain levels.
Communication of security can be increased.
The same technology should be used on the sender side and the receiver side.

The Proposed Methodology
The phishing attack mechanism can be categorized into three categories as • A heuristic-based approach.
• A web crawler-based approach.
These approaches are used for future purposes in phishing attacks for future extraction. Figure 5 shows the proposed architecture of the phishing attack mechanism.

The Proposed Methodology
The phishing attack mechanism can be categorized into three categories as These approaches are used for future purposes in phishing attacks for future extraction. Figure 5 shows the proposed architecture of the phishing attack mechanism.

DNS Blacklist and Web Crawler
A DNS blacklist (domain name system blacklist) is used for generating many Internet Protocol addresses which can be easily mounted for programming on the browser. The DNS blacklist is built on the top source file on the internet. This domain name system blacklist generates Internet Protocol addresses with spam purposes. Information is frequently updated on the DNS system. Web crawler starts to attack websites interconnecting with pages and links. Crawling from one website, the phishing attack mechanism goes through all the links in the web index. The proposed phishing attack mechanism crawl is creating a web crawler for each webpage in a website since the attack. Figure 6 shows the crawler for web indexing.

DNS Blacklist and Web Crawler
A DNS blacklist (domain name system blacklist) is used for generating many Internet Protocol addresses which can be easily mounted for programming on the browser. The DNS blacklist is built on the top source file on the internet. This domain name system blacklist generates Internet Protocol addresses with spam purposes. Information is frequently updated on the DNS system. Web crawler starts to attack websites interconnecting with pages and links. Crawling from one website, the phishing attack mechanism goes through all the links in the web index. The proposed phishing attack mechanism crawl is creating a web crawler for each webpage in a website since the attack. Figure 6 shows the crawler for web indexing.

DNS Blacklist and Web Crawler
A DNS blacklist (domain name system blacklist) is used for generating many Internet Protocol addresses which can be easily mounted for programming on the browser. The DNS blacklist is built on the top source file on the internet. This domain name system blacklist generates Internet Protocol addresses with spam purposes. Information is frequently updated on the DNS system. Web crawler starts to attack websites interconnecting with pages and links. Crawling from one website, the phishing attack mechanism goes through all the links in the web index. The proposed phishing attack mechanism crawl is creating a web crawler for each webpage in a website since the attack. Figure 6 shows the crawler for web indexing.

Heuristic Analysis and URL Analysis
Algorithm 1 details the working module of heuristic-based phishing detection. Three features of heuristic analysis phases are as follows. URLs. URL partition is as follows. <protocol>://<subdomain> <primary domain> <TLD>/<path domain>. Algorithm 2 explains the working module of URL-based phishing detection.

Web Content Analysis and Web Traffic Analyzer
Crawling through the website and web page content copyrights was proposed by the phishing detection mechanism in the website. Regarding suspicion, phishing detection mechanism classifiers send an alert for the message for the contents by the phishing detection mechanism. Parameters are taken from the web traffic analyzer such as tool visits for sites, pages visited per page, duration visiting per person on average, and the bouncing rate. SiriReputation is used on the valued website for calculating the website links from other web pages to itself. PageRank is similar to this [20]. SiriReputation will also be low for phishing the websites in a higher level, for SiriReputation is similar to Pagerank, where SiriReputation values are lower for phishing the websites on a legitimate site.

A Machine Learning Approach for Detecting the Attacks
The datasets are collected from Alexa, Siri, and Phish Tank and then processed into machine learning algorithms. This learning algorithm extracts useful information from training examples. The machine learning algorithms can be classified into supervised and semi-supervised. A supervised learning algorithm learns from labelled samples, whereas an unsupervised learning algorithm learns from unknown samples. Initially, the classifier starts with a training phase that can be used to build a decision model. The very important machine learning classifier described here is used for detecting phishing attacks described below. Figure 7 shows the machine learning techniques for detecting phishing attacks.
phishing detection mechanism in the website. Regarding suspicion, phishing detection mechanism classifiers send an alert for the message for the contents by the phishing detection mechanism. Parameters are taken from the web traffic analyzer such as tool visits for sites, pages visited per page, duration visiting per person on average, and the bouncing rate. SiriReputation is used on the valued website for calculating the website links from other web pages to itself. PageRank is similar to this [20]. SiriReputation will also be low for phishing the websites in a higher level, for SiriReputation is similar to Pagerank, where SiriReputation values are lower for phishing the websites on a legitimate site.

A Machine Learning Approach for Detecting the Attacks
The datasets are collected from Alexa, Siri, and Phish Tank and then processed into machine learning algorithms. This learning algorithm extracts useful information from training examples. The machine learning algorithms can be classified into supervised and semi-supervised. A supervised learning algorithm learns from labelled samples, whereas an unsupervised learning algorithm learns from unknown samples. Initially, the classifier starts with a training phase that can be used to build a decision model. The very important machine learning classifier described here is used for detecting phishing attacks described below. Figure 7 shows the machine learning techniques for detecting phishing attacks.

Neural Network Algorithm (NN)
The NN algorithm consists of three layers-an input layer, an output layer and a hidden layer. The hidden layer processes the data and passes it to the output layer. Several attacks are found and recognized by the multi-layer perception algorithm. This algorithm is trained by the back-propagation technique, which is based on the concept of feed forward and back propagation [21][22][23].

Support Vector Machine (SVM)
The SVM is used for guess and classification, which is used to find the boundaries in multi-dimensional space [24][25][26]. Its distinct data points can be divided into two classes, +1 and −1, using a hyperplane. Hence, +1 denotes ordinary data and −1 denotes doubtful data.
The hyperplane can be written as Equation (1)

Neural Network Algorithm (NN)
The NN algorithm consists of three layers-an input layer, an output layer and a hidden layer. The hidden layer processes the data and passes it to the output layer. Several attacks are found and recognized by the multi-layer perception algorithm. This algorithm is trained by the back-propagation technique, which is based on the concept of feed forward and back propagation [21][22][23].

Support Vector Machine (SVM)
The SVM is used for guess and classification, which is used to find the boundaries in multi-dimensional space [24][25][26]. Its distinct data points can be divided into two classes, +1 and −1, using a hyperplane. Hence, +1 denotes ordinary data and −1 denotes doubtful data.
The hyperplane can be written as Equation (1) where W = w 1 , w 2 , . . . , w n are weight vectors for n attributes values x 1 , x 2 , . . . , x n and b is a scalar. The Support Vector Machine aims to discover the linear best hyperplane so that the boundary of partition between the two classes is magnified. The hyperplane with the peak margin is treated as a good hyperplane. This machine classifies two classes, and multi-class classification is understood by developing an SVM for each two of the classes.

Random Forest
Random Forests are based on decision trees [27][28][29]. The computational methodology is the best method to classify phishing in phishing attack mechanisms. Figure 8 shows the extraction parameter in the URL and website. Figure 9 shows the extraction parameter in an e-mail. Table 2 shows the feature extraction and dataset creation. multi-class classification is understood by developing an SVM for each two of the classes.

Random Forest
Random Forests are based on decision trees [27][28][29]. The computational methodology is the best method to classify phishing in phishing attack mechanisms. Figure 8 shows the extraction parameter in the URL and website. Figure 9 shows the extraction parameter in an e-mail. Table 2 shows the feature extraction and dataset creation.

Feature Extraction
Feature Description Lengthily URL Websites with a URL length greater than 1750 is likely phishing. Symbol "-" Domain names including "-" are considered legitimate URLs. URL subdomain URL subdomains is are likely phishing. HTTPS This is considered a secure URL. IP address using the domain name Hackers hide the number with their name when it is phishing.

URL request
URLs which consider all images and text together in the same domain.

Domain age
Websites created within the last year are likely phishing. multi-class classification is understood by developing an SVM for each two of the classes.

Random Forest
Random Forests are based on decision trees [27][28][29]. The computational methodology is the best method to classify phishing in phishing attack mechanisms. Figure 8 shows the extraction parameter in the URL and website. Figure 9 shows the extraction parameter in an e-mail. Table 2 shows the feature extraction and dataset creation.

Feature Extraction
Feature Description Lengthily URL Websites with a URL length greater than 1750 is likely phishing. Symbol "-" Domain names including "-" are considered legitimate URLs. URL subdomain URL subdomains is are likely phishing. HTTPS This is considered a secure URL. IP address using the domain name Hackers hide the number with their name when it is phishing.

URL request
URLs which consider all images and text together in the same domain.

Domain age
Websites created within the last year are likely phishing.  For training data, 200 samples with features and labels were divided into training and testing learning processes, respectively. Three classification algorithms are used to develop an accurate approach for detecting phishing. The performance of various algorithms was measured by using evaluation metrics on the test samples. A total of 70% of the data were used in the training stage. Testing and validation were processed with the remaining 30% samples.
Sensitivity is described as the ratio of correctly recognized phishing attacks. (1-Specificity) is another attribute of a classifier which describes the ratio of non-phishing attacks. Sensitivity and specificity are the most significant parameters for performance metrics computed by any classifier.
Matrix is a method to analyze the performance metrics of any classification algorithm and it also provides better solutions. Table 3 shows a comparison of existing work based on features. In general, researchers have achieved an accuracy of approximately 90%. However, in all the previous cases, researchers have restricted their analysis to only one algorithm. Moreover, none of them has considered all areas, i.e., URL, website and email. Thus, the current work has a definite advantage over them in terms of the widespread application area and comprehensive analysis in terms of the several algorithms considered. Random Forest. With respect to the Random Forest classifier, the Support Vector Machine and Neural Network classifiers showed 8.32% and 20.65% improvement. A total of 12 parameters were extracted by the algorithms. Table 5 shows performance evaluation metrics with different classifiers. In terms of specificity, the Random Forest and Neural Network classifiers are better than the Support Vector Machine classifier by 0.42% and 4.65%, respectively. However, the sensitivity of the Support Vector Machine and Neural Network classifiers is greater than the Random Forest classifier by 9.74% and 21.99%, respectively. Similarly, the precision of the Support Vector Machine and Neural Network classifiers is greater than the Random Forest classifier by 16.46% and 28.77%, respectively. The F1 score of the Support Vector Machine and Neural Network classifiers show 14.27% and 26.66% respective improvements over the Random Forest classifier.     Table 5 shows performance evaluation metrics with different classifiers. In terms of specificity, the Random Forest and Neural Network classifiers are better than the Support Vector Machine classifier by 0.42% and 4.65%, respectively. However, the sensitivity of the Support Vector Machine and Neural Network classifiers is greater than the Random Forest classifier by 9.74% and 21.99%, respectively. Similarly, the precision of the Support Vector Machine and Neural Network classifiers is greater than the Random Forest classifier by 16.46% and 28.77%, respectively. The F1 score of the Support Vector Machine and Neural Network classifiers show 14.27% and 26.66% respective improvements over the Random Forest classifier.  Figure 11 shows the comparison of different methodologies. The machine learning approach is found to be the best among the three approaches. The heuristic-based approach and the machine learning approach show an improvement of 2.22% and 5.76%, respectively, over the blacklist-based approach.

Results
approach is found to be the best among the three approaches. The heuristic-based approach and the machine learning approach show an improvement of 2.22% and 5.76%, respectively, over the blacklist-based approach. Figure 12 illustrates spam mails detected by the proposed methodology in e-mail. Figure 13 shows fake websites detected and blocked by the proposed methodology in Netcraft. Figure 14 illustrates fake websites detected by the proposed methodology in Google Safe Browsing.  Accuracy (%) Figure 11. Comparison of different methodologies. Figure 12 illustrates spam mails detected by the proposed methodology in e-mail. Figure 13 shows fake websites detected and blocked by the proposed methodology in Netcraft. Figure 14 illustrates fake websites detected by the proposed methodology in Google Safe Browsing.  Figure 11 shows the comparison of different methodologies. The machine learning approach is found to be the best among the three approaches. The heuristic-based approach and the machine learning approach show an improvement of 2.22% and 5.76%, respectively, over the blacklist-based approach. Figure 12 illustrates spam mails detected by the proposed methodology in e-mail. Figure 13 shows fake websites detected and blocked by the proposed methodology in Netcraft. Figure 14 illustrates fake websites detected by the proposed methodology in Google Safe Browsing.

Conclusions
A phishing detection mechanism was proposed to detect phishing attackers. The developed phishing detection mechanism is implemented through three phases. Detection based on the DNS blacklist is performed, and then heuristic-based detection is followed by using a web crawler. It is easy to identify the websites frequently using phishing IPs in the DNS blacklist. Using the web crawler and analysis phase, phishing e-mails and sites are identified. The proposed experimental analysis was performed for the phishing detection mechanism and it is used for precisely detecting websites which are phishing as the phishing detection mechanism has the best accuracy. Three different classifiers were used to determine classification accuracy in detecting phishing, resulting in a classification accuracy of 95.18%, 85.45% and 78.89%, for NN, SVM, and RF, respectively. The results suggest that a machine learning approach is best for detecting phishing.

Conclusions
A phishing detection mechanism was proposed to detect phishing attackers. The developed phishing detection mechanism is implemented through three phases. Detection based on the DNS blacklist is performed, and then heuristic-based detection is followed by using a web crawler. It is easy to identify the websites frequently using phishing IPs in the DNS blacklist. Using the web crawler and analysis phase, phishing e-mails and sites are identified. The proposed experimental analysis was performed for the phishing detection mechanism and it is used for precisely detecting websites which are phishing as the phishing detection mechanism has the best accuracy. Three different classifiers were used to determine classification accuracy in detecting phishing, resulting in a classification accuracy of 95.18%, 85.45% and 78.89%, for NN, SVM, and RF, respectively. The results suggest that a machine learning approach is best for detecting phishing.

Data Availability Statement:
The data presented in this study are available through email upon request to the corresponding author.