Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024)

Wilk-Jakubowski, Jacek Lukasz; Pawlik, Lukasz; Wilk-Jakubowski, Grzegorz; Sikora, Aleksandra

doi:10.3390/electronics14183744

Open AccessSystematic Review

Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024)

by

Jacek Lukasz Wilk-Jakubowski

^1,2

,

Lukasz Pawlik

^1,*

,

Grzegorz Wilk-Jakubowski

^2,3

and

Aleksandra Sikora

^4,*

¹

Department of Information Systems, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland

²

Institute of Crisis Management and Computer Modelling, 28-100 Busko-Zdrój, Poland

³

Institute of Internal Security, Old Polish University of Applied Sciences, 49 Ponurego Piwnika Str., 25-666 Kielce, Poland

⁴

Department of Computer Science, Electronics and Electrical Engineering, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(18), 3744; https://doi.org/10.3390/electronics14183744

Submission received: 16 August 2025 / Revised: 18 September 2025 / Accepted: 19 September 2025 / Published: 22 September 2025

(This article belongs to the Special Issue Advances of Artificial Intelligence and Vision Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Phishing remains a persistent and evolving cyber threat, constantly adapting its tactics to bypass traditional security measures. The advent of Machine Learning (ML) and Neural Networks (NN) has significantly enhanced the capabilities of automated phishing detection systems. This comprehensive review systematically examines the landscape of ML- and NN-based approaches for identifying and mitigating phishing attacks. Our analysis, based on a rigorous search methodology, focuses on articles published between 2017 and 2024 across relevant subject areas in computer science and mathematics. We categorize existing research by phishing delivery channels, including websites, electronic mail, social networking, and malware. Furthermore, we delve into the specific machine learning models and techniques employed, such as various algorithms, classification and ensemble methods, neural network architectures (including deep learning), and feature engineering strategies. This review provides insights into the prevailing research trends, identifies key challenges, and highlights promising future directions in the application of machine learning and neural networks for robust phishing detection.

Keywords:

phishing; machine learning; neural networks; websites; electronic mail; social networking (online); malware; security

1. Introduction

In recent years, the need to ensure comprehensive cybersecurity on a global scale has become increasingly evident. The growing sophistication and volume of cyber threats have prompted research institutions and industry stakeholders worldwide to focus on enhancing the efficiency of threat detection systems. This includes the design and deployment of more advanced and effective countermeasures. Within this context, phishing attacks remain one of the most pervasive and adaptive forms of cybercrime, and their evolution is closely tied to the rapid expansion of digital communication platforms and services. The global landscape suggests that further changes in phishing techniques are inevitable, driven by the continuous growth in attack volume and the diversity of delivery channels.

A widely accepted definition of phishing is provided by the Anti-Phishing Working Group (APWG) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32], an international coalition that coordinates the global response to phishing and cybercrime. This definition captures phishing’s core characteristics and is frequently cited in research and industry reports. According to APWG,

Definition 1. “Phishing is a crime employing both social engineering and technical subterfuge to steal consumers’ personal identity data and financial account credentials. Social engineering schemes prey on unwary victims by fooling them into believing they are dealing with a trusted, legitimate party, such as by using deceptive email addresses and messages, bogus web sites, and deceptive domain names. These are designed to lead consumers to counterfeit Web sites that trick recipients into divulging financial data such as usernames and passwords. Technical subterfuge schemes plant malware onto computers to steal credentials directly, often using systems that intercept consumers’ account usernames and passwords or misdirect consumers to counterfeit Web sites” [32].

The general overview of early phishing detection methods are presented in Table 1. Each method provided incremental improvements but suffered from high false negative rates, limited adaptability, or high computational costs.

The earliest scientific publications on phishing indexed in Scopus (https://www.scopus.com) appeared in 2006, marking the formal beginning of academic research in this field. Detection methods have since evolved rapidly. Starting around 2016, these methods began to be widely replaced or supplemented by Machine Learning (ML) and Neural Network (NN) approaches. This shift reflects the need for more adaptive, data-driven systems capable of addressing zero-day attacks and evolving threat patterns. The present article examines this transformation in depth, providing a structured analysis of research published between 2017 and 2024, identifying key methodological trends, evaluating technical implementations, and mapping global contributions. By synthesizing existing knowledge, it aims to clarify the current state of the field, highlight gaps in research, and suggest potential directions for future development.

In the current literature, there is no deployment-oriented synthesis across the four delivery channels through which phishing is propagated (Websites, Electronic Mail, Malware, and Social Networking) that comparably examines data quality, leakage risk between training and test sets, time-aware validation, model selection procedures, and system-level metrics.

This article addresses this gap by introducing a unified assessment of selected studies in Table 2, which defines fields that normalize evidence and track common validity threats, including leakage and temporal drift, and linking these fields to per-channel deployment checklists that translate the literature into actionable guidance. In addition, we complement the synthesis with a coherent categorization of the corpus and a quantitative summary that organize studies by delivery channel, classes of ML and neural network methods, methodological practices, and geographic distribution. We complement this with a synthesis of findings from cross-sectional cross-tabulations that show the diversity of technique and methodology profiles observed across phishing delivery channels.

2. Materials and Methods

This article presents a review of the literature on phishing detection methods using ML and NN. The aim was to collect, organize, and analyze studies published between 2017 and 2024. The scope includes examines phishing delivery channels, ML models and techniques, as well as research methodologies.

2.1. Data Retrieval and Corpus Construction

To ensure a focused review, bibliographic data were retrieved from the Scopus database. A structured search strategy was developed to capture research on phishing detection using machine learning or neural networks (Figure 1). The search query was formulated to match occurrences of the term phishing combined with either machine learning or neural network in the title, abstract, or keywords fields. The search was limited to journal articles published between 2017 and 2024, written in English, and indexed under the Computer Science or Mathematics subject areas. The time frame was set between 2017 and 2024 because earlier years showed very limited coverage of this topic in Scopus, with only sporadic publications indexed before 2017. The end year was set to 2024, since 2025 is still in progress and does not yet provide a complete set of annual research outputs. Publications from unrelated subject areas, such as medicine, economics, or the arts, were excluded using Scopus filters. To focus on detection methods tailored to individual delivery channels (Websites, Electronic Mail, Social Networking (online), and Malware), an additional “Limit to” filter was applied.

To allow replication of the dataset, we provide the exact wording of the query:

TITLE-ABS-KEY (“Phishing” AND (“Machine Learning” OR “Neural Network”))

AND PUBYEAR > 2016 AND PUBYEAR < 2025

AND (EXCLUDE (SUBJAREA, “CENG”) OR EXCLUDE (SUBJAREA, “ARTS”) OR EXCLUDE (SUBJAREA, “NEUR”) OR EXCLUDE (SUBJAREA, “ECON”) OR EXCLUDE (SUBJAREA, “ENVI”) OR EXCLUDE (SUBJAREA, “BUSI”) OR EXCLUDE (SUBJAREA, “MEDI”) OR EXCLUDE (SUBJAREA, “PHYS”) OR EXCLUDE (SUBJAREA, “ENER”) OR EXCLUDE (SUBJAREA, “MATE”) OR EXCLUDE (SUBJAREA, “ENGI”) OR EXCLUDE (SUBJAREA, “MULT”) OR EXCLUDE (SUBJAREA, “PHAR”) OR EXCLUDE (SUBJAREA, “EART”) OR EXCLUDE (SUBJAREA, “CHEM”) OR EXCLUDE (SUBJAREA, “BIOC”) OR EXCLUDE (SUBJAREA, “SOCI”) OR EXCLUDE (SUBJAREA, “DECI”))

AND (LIMIT-TO (DOCTYPE, “ar”))

AND (LIMIT-TO (LANGUAGE, “English”))

AND (LIMIT-TO (EXACTKEYWORD, “Websites”))

OR LIMIT-TO (EXACTKEYWORD, “Electronic Mail”)

OR LIMIT-TO (EXACTKEYWORD, “Social Networking (online)”)

OR LIMIT-TO (EXACTKEYWORD, “Malware”)

Finally, we further refined the keywords to capture studies involving specific machine learning models and techniques:

AND (LIMIT-TO (EXACTKEYWORD, “Machine Learning”))

OR LIMIT-TO (EXACTKEYWORD, “Learning Systems”)

OR LIMIT-TO (EXACTKEYWORD, “Machine-learning”)

OR LIMIT-TO (EXACTKEYWORD, “Classification (of Information)”)

OR LIMIT-TO (EXACTKEYWORD, “Learning Algorithms”)

OR LIMIT-TO (EXACTKEYWORD, “Deep Learning”)

OR LIMIT-TO (EXACTKEYWORD, “Feature Extraction”)

OR LIMIT-TO (EXACTKEYWORD, “Decision Trees”)

OR LIMIT-TO (EXACTKEYWORD, “Support Vector Machines”)

OR LIMIT-TO (EXACTKEYWORD, “Features Selection”)

OR LIMIT-TO (EXACTKEYWORD, “Deep Neural Networks”)

OR LIMIT-TO (EXACTKEYWORD, “Neural-networks”)

OR LIMIT-TO (EXACTKEYWORD, “Feature Selection”)

OR LIMIT-TO (EXACTKEYWORD, “Random Forests”)

OR LIMIT-TO (EXACTKEYWORD, “Neural Networks”)

OR LIMIT-TO (EXACTKEYWORD, “Classification”)

OR LIMIT-TO (EXACTKEYWORD, “Machine Learning Algorithms”)

OR LIMIT-TO (EXACTKEYWORD, “Long Short-term Memory”)

OR LIMIT-TO (EXACTKEYWORD, “Convolutional Neural Network”)

OR LIMIT-TO (EXACTKEYWORD, “Supervised Learning”)

OR LIMIT-TO (EXACTKEYWORD, “Nearest Neighbor Search”)

OR LIMIT-TO (EXACTKEYWORD, “Convolutional Neural Networks”)

OR LIMIT-TO (EXACTKEYWORD, “Adaptive Boosting”)

Only articles containing at least one term from a predefined list of 23 machine learning-related keywords (e.g., support vector machines, deep neural networks, feature selection) were retained. This step resulted in a final set of 105 articles.

Based on the index keywords applied in this initial filtering step, the first thematic grouping was established under the category Phishing Delivery Channels, comprising four distinct types: Websites, Malware, Electronic Mail, and Social Networking (Section 3). Subsequently, we used index-keyword filtering to define a second thematic grouping, Machine Learning Models and Techniques, encompassing machine learning, neural networks, classification and ensemble methods, and feature engineering. Additionally, authors’ countries of affiliation were identified from Scopus metadata. The Research Methodology category was derived by manual content analysis of the articles.

The metadata of the selected publications were exported to a Comma-Separated Values (CSV) file containing details such as title, authors, year of publication, and other bibliographic fields. This file was then imported into a PostgreSQL 16.2 database to enable query-based analysis, data mining, and aggregation via Structured Query Language (SQL). The process was fully automated using a Python 3.12.2 script, which also generated tables and graphs to support further analysis. The data were exported on 21 July 2025. Throughout the remainder of this article, we refer to this dataset as the corpus to avoid confusion with other datasets used in the study.

All relevant replication materials, including the raw scopus.csv export (Table S1), the thesaurus_mapping.csv file (Table S2), and the apwg_data.csv dataset (Table S3), are provided in the Supplementary Materials to enable full replication of the analysis.

2.2. Supplementary Data Sources

To provide a broader empirical context for the review, this study incorporates statistical data published by the Anti-Phishing Working Group (APWG) in its Phishing Activity Trends Reports [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]. These quarterly reports are recognized as one of the most authoritative global sources on phishing activity, offering aggregated metrics such as the number of unique phishing websites, the volume of phishing email campaigns, and the number of targeted brands. Incorporating APWG data documents changes in the volume assets of phishing attacks over time, enabling interpretation of research trends alongside real-world developments in the threat landscape.

For this study, APWG data for 2017–2024 were obtained from official reports on the organization’s website (https://apwg.org/trendsreports (accessed on 8 August 2025)). In particular, the data were manually extracted from the listed quarterly reports and processed using a Python 3.12.2 script. In later sections, these figures are used to divide the study period into two distinct intervals, highlighting a clear shift in the phishing dynamics, with a relatively stable phase followed by a period of sharp, sustained growth in activity.

2.3. Bibliometric Analysis Procedure

To gain a comprehensive understanding of research directions and thematic structures in phishing detection using machine learning and neural networks, we conducted a bibliometric analysis. This approach enables the identification of key concepts, their interconnections, and emerging trends within the scientific literature. The objective was to identify and visualize the most significant research themes and their relationships.

The analysis was conducted using VOSviewer (version 1.6.20, https://www.VOSviewer.com), which generated a co-occurrence map of keywords derived from Scopus bibliographic data. The dataset used for this purpose comprised the 105 documents described in Section 2.1, exported from Scopus in CSV format. Index keywords were considered, with a thesaurus file applied that introduced minimal intervention—limited solely to resolving spelling differences—in order to preserve the most faithful observation of the dataset. A minimum occurrence threshold of 5 was set, and fractional counting was applied to measure link strengths. This configuration ensured a balanced and reliable representation of keyword relationships in the analyzed corpus.

2.4. Review Protocol and Publication Quality

This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework. The process was carried out in three main stages (Figure 2):

In the identification stage, a comprehensive search was conducted in the Scopus database. The search strategy used a defined set of keywords applied to titles, abstracts, or author keywords in order to capture relevant publications. Filters were applied to restrict the results to English-language articles within the defined time frame (2017–2024). Records from unrelated subject areas were removed. A total of 108 records were identified.
In the screening stage, all 108 identified in the previous step records were examined. Three records were excluded after applying an additional keyword filter in Scopus. This left 105 records for further retrieval.
In the eligibility assessment, 90 full-text articles and 15 abstracts were reviewed. The inclusion of abstracts helped maintain methodological consistency and increased the sample size, which was essential for conducting a reliable quantitative analysis. Although abstracts provide less detail than full texts, they contain key information on the scope of the study, the applied methods, and the main findings, making them a valuable source of data in a systematic review.

The quality of the included publications (full texts and abstracts) was ensured by selecting only peer-reviewed articles indexed in Scopus. The selection covered major publishers such as Springer, Elsevier, the Institute of Electrical and Electronics Engineers (IEEE), and the Multidisciplinary Digital Publishing Institute (MDPI), as well as other recognized peer-reviewed journals including the Institution of Engineering and Technology (IET), Hindawi (Wiley), and the International Journal of Advanced Computer Science and Applications (IJACSA). The final set of 105 publications represented both recent studies with few citations and highly cited works, showing the coexistence of emerging approaches and established research.

Each publication was independently assessed by two authors, with disagreements resolved through discussion to reach a consensus. This process enabled accurate multi-labeling of hybrid publications, as reflected in the tables in Section 4. The evaluation considered topic relevance to phishing detection, publication completeness, and methodological clarity. The verification was consistent with the results obtained from the search process.

2.5. Study Quality and Risk-of-Bias Assessment

To ensure the credibility and reliability of the review, each included study was systematically assessed for methodological quality and potential sources of bias. A structured appraisal rubric was developed to evaluate common threats to validity in machine learning-based phishing detection research (Table 2). The evaluation considered the following main aspects: data quality, class balance, external sources used (blacklists/metadata), risk of data leakage, validation method, model selection procedure, evaluation metrics, and handling of class imbalance. This process ensured a consistent basis for comparing studies and made it possible to identify common weaknesses.

The column Data quality reports how the dataset was constructed and from which sources it was obtained (single-source, combined; repository names as applicable), then records the acquisition window or snapshot used and any preprocessing steps that affect inclusion such as duplicate removal, unreachable links, or Uniform Resource Locator (URL) sanitation; the entry concludes with one overall item count for the entire dataset. This scope keeps provenance and basic quality controls together. Note on “Total items”: Even when per-source counts are listed, a single overall total is often unavailable or unreliable because sources commonly overlap and must be deduplicated; authors may not specify the exact snapshot or time window used for each source; and preprocessing steps such as URL validation, removal of duplicates, and filtering unreachable or malformed entries change the final size. Unless the paper reports the post-processing size of the dataset actually used for training and testing, this field is recorded as Not reported.

The Class balance column begins with a short status (e.g., Balanced, Imbalanced, or Not reported), then shows the distribution between phishing and benign classes. If the authors report per-split distributions, the column presents the Train, Validation, and Test splits. If only an overall distribution is reported, the column reflects that. If the information is missing, the cell states Not reported.

The column External sources used (blacklists/metadata) states whether a study relied on external sources either for labels or for input metadata, which helps normalize evidence across papers and assess comparability and leakage risk. Cells follow a fixed pattern: “Labels: ... Metadata: ...”. Labels indicate the origin of ground-truth class assignments, for example PhishTank or OpenPhish, preferably with a snapshot date or version if provided. Metadata refers only to external signals obtained beyond the Uniform Resource Locator (URL) string itself, for example, Registration Data Access Protocol (RDAP) registration data, Domain Name System (DNS) records such as Address (A) and Name Server (NS) records with properties like time to live (TTL), and Transport Layer Security (TLS) certificate information including Certificate Transparency (CT) evidence. These sources are transformed into numeric or categorical features, such as domain age, registrar, record counts, TTL values, issuer fields, and presence in CT logs, and then used as model inputs. Features derived solely from the Uniform Resource Locator (URL) string are not external metadata in this column. In such cases, write “Metadata: none”.

The column Risk of data leakage indicates the likelihood that the reported results may have been affected by an unintended overlap between training and test data. A Low rating indicates that the dataset was clearly separated between training and test sets, with no evidence of overlap. A Medium rating indicates that multiple datasets were combined and/or the separation procedure was insufficiently described, leaving the possibility of overlap between training and test sets. A High rating indicates that studies either provided insufficient information or used procedures that strongly suggest a risk of overlap between the training and test sets.

The column Validation method specifies how each study divided the dataset into training, validation, and test sets. The most common strategy is a hold-out split. In this approach, the dataset is divided once into set parts, for example, 80/20 (80% for training and 20% for testing) or 60/20/20 (60% for training, 20% for validation, and 20% testing). A variant is the random split, where the partitioning is performed randomly. If class proportions are preserved within each subset, this is termed a stratified random split. Another common approach is k-fold cross-validation (CV), in which the dataset is split into k folds, and the model is trained and tested k times, each time using a different fold as the test set; when k is specified, it is written as, for example, 10-fold CV. A more rigorous design, nested CV, uses an inner loop for hyperparameter tuning and an outer loop for performance estimation, thereby reducing bias from model selection. In the table, the terminology follows the authors’ descriptions; when not explicitly stated, the generic term hold-out split is used to denote a fixed partition of the dataset. Because URL liveness and labels age, time-based splits are necessary to estimate performance under drift rather than on mixed-era samples.

The column Model selection procedure describes how the final model and its hyperparameters were chosen. Not reported means that the procedure was not described.

The column Evaluation/system metrics presents, for each study, the performance criteria used to assess predictive quality and, where available, quantitative characteristics of computational cost. The evaluation part enumerates metric families such as Accuracy, Precision, Recall, F1-score, Receiver Operating Characteristic Area Under the Curve (ROC AUC), and Matthews Correlation Coefficient (MCC). The System metrics part reports numerical efficiency and resource indicators provided by the authors, including training and inference time, per-request latency, throughput, and memory or model size, with values and units exactly as stated in the source. When a study does not include runtime, latency, memory, or throughput figures, this part indicates that such cost or time metrics were not reported.

Based on the approaches discussed in recent studies on imbalanced learning [140,141,142], the authors adopted a three-level categorization to assess how class imbalance was handled in the reviewed publications. The column Handling of class imbalance reflects whether and how the studies addressed the problem of unequal class distribution in phishing datasets. A Not addressed rating indicates that the study relied primarily on accuracy or omitted any discussion of class imbalance. Partially addressed (metrics only) means that the authors reported appropriate evaluation metrics such as Precision, Recall, F1-score, MCC, or AUC, but did not apply explicit balancing techniques. Adequately addressed (metrics and techniques) refers to studies that combined suitable metrics with explicit methods such as Synthetic Minority Over-sampling Technique (SMOTE), undersampling, or class weighting to mitigate the effects of imbalance.

Table 2 was compiled from full-text analysis of all included articles, based on a predefined appraisal rubric. Two authors independently coded each study, and any disagreements were resolved through discussion until consensus was reached. The table was prepared manually in a word processor rather than generated by software. To support replication, a concise legend placed directly below Table 2 explains the meaning and coding rules for every column, and the Supplementary Materials include the Scopus export that lists all publications considered in the review.

2.6. Summary

This study combines a constructed Scopus corpus of 105 journal articles on phishing detection using machine learning or neural networks (2017–2024) with statistical data from the APWG to compare research trends with real-world attack dynamics. The corpus was compiled using a structured, replicable query restricted to relevant subject areas, delivery channels, and a predefined set of 23 machine-learning keywords. APWG quarterly reports provide authoritative global metrics on phishing activity, enabling contextual interpretation of bibliometric results. Keyword co-occurrence analysis using VOSviewer identified key research themes and their interconnections, forming the basis for the thematic analysis in subsequent sections.

During the preparation of this work, the authors used ChatGPT (GPT-4.5, GPT-5, and GPT-5 Thinking; OpenAI, https://chat.openai.com) to refine the language.

3. Deployment Checklists by Phishing Delivery Channel

Section 3.1, Section 3.2, Section 3.3 and Section 3.4 translate our review findings into actionable deployment checklists for each phishing delivery channel. For each channel, we summarize privacy controls, data collection risks, fail-safe behavior, model updates or rollbacks, and explainability for analyst triage, with each item anchored in the evidence fields captured in Table 2. This framing clarifies what the reported results imply for engineering and operations across contexts.

Across all channels, privacy controls follow a common baseline. Limit collection and retention to what is necessary for detection, prefer on-device feature extraction, remove direct identifiers when telemetry leaves a device, keep raw artifacts only for short, defined windows, and document any third-party inputs using the exact Table 2 columns External sources used (blacklists/metadata) and Data quality. Channel sections provide representative examples from Table 2 rather than an exhaustive catalog. Further detailed rules and recommendations on privacy controls are available in legal sources [143] and technical frameworks [144]. This article focuses on translating the evidence encoded in Table 2 into deployable, channel-specific controls with representative examples.

Data collection risks are consistently addressed using the study-level evidence recorded in Table 2. Deployment should mirror the controls in Table 2 by documenting snapshot windows, applying deduplication and liveness or crawl-validity checks, preventing cross-split overlap, and stating post-processing class balance. When sources are continuously updated, use time-aware splits to reduce temporal leakage and reflect the order of arrival in production. These practices map to Table 2 fields (Data quality, Risk of data leakage, Class balance, and Validation method), and address limitations noted in the corpus regarding outdated data, overlap, and drift.

Fail-safe behavior and safe defaults use the same vocabulary as Evaluation/system metrics field in Table 2. Where latency, throughput, memory, or runtime are reported, use them to set timeouts, backoff, caching, and degradation paths for partial features or service unavailability. When cost metrics are not reported in a source study, record Not reported in Table 2, and define explicit operational budgets for deployment.

Model updates and rollbacks adhere to the validation and selection practices documented in Table 2. Version data snapshots, models, feature schemas, and any External sources used, gate promotions with shadow or forward-chaining tests consistent with the recorded Validation method, and keep a last-known-good bundle for rapid rollback. Where Model selection procedure was not nested or not reported, treat pre-deployment checks and canary thresholds as mandatory safeguards.

Explainability for triage provides concise, case-level reasons consistent with the features actually used by the model in each channel. Store the top-contributing indicators with the prediction, link them to the model version and data snapshot ID, and retain only the minimal artifacts needed for audit. Channel sections surface the kinds of indicators reported in the studies and tie them to the Evaluation/system metrics evidence.

Finally, the channel subsections present evidence-backed examples drawn from Table 2. They are representative rather than exhaustive and can be extended in future revisions by first documenting additional signals in Table 2 and then incorporating them into the corresponding checklists.

3.1. Deployment Checklist for the Phishing Delivery Channel: Websites

This checklist is anchored in the evidence fields in Table 2 for the Websites channel (Data, Data quality, Risk of data leakage, Class balance, Validation method, Model selection procedure, External sources used (blacklists/metadata), Use of external lists or metadata) [34,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95].

3.1.1. Privacy Controls

For website detectors that process URLs, Hypertext Markup Language (HTML), or rendered snapshots, Table 2 documents feature families such as URL lexical tokens [92], DOM-, HTML-, or render-derived features [91], and third-party metadata where applicable, including WHOIS domain registration records (WHOIS), DNS, and TLS certificate fields [36,37,90]. Use the table’s column names when documenting provenance in the External sources used (blacklists/metadata) field and data handling in the Data quality field [34,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95].

3.1.2. Data Collection Risks

Make data collection reproducible and contamination-aware. Table 2 records snapshot dates (when reported), deduplication, liveness or crawl-validity checks, class balance, and overlap between training and test URL lists; mirror these controls by documenting snapshot windows, enforcing deduplication and liveness checks, and preventing cross-split overlap [36,37,75,92]. Typical risks identified in Table 2 include merged sources without clear separation [36,37], missing deduplication in hold-out or CV settings [92], and mismatched labeling in mixed live and archival sets [75].

3.1.3. Failsafe Behavior and Safe Defaults

Align operational safeguards with the Evaluation/system metrics field in Table 2. Where cost figures exist, set timeouts and degradation paths accordingly; examples include prototype/extension response times [37] and per-request detection/classification times [92]. If metrics are not reported, state this explicitly and record results using the same vocabulary [37,92]. Common patterns include time limits for rendering [37,91,94] and fallback to URL-only features when HTML is unavailable [43,44,92].

3.1.4. Model Updates and Rollback

Keep versioned, dated snapshots of models, feature schemas, and any External sources used (blacklists/metadata) as recorded in Table 2. Gate promotions using the Validation method actually reported (e.g., 10-fold or 5-fold CV; hold-outs) and keep decisions consistent with the documented Model selection procedure (e.g., GridSearchCV, Bayesian optimization, or Not reported) [36,37,88,92]; pin versions or snapshots of external sources where applicable [75,90].

3.1.5. Explainability for Triage

Provide concise case-level rationales consistent with feature families used by the Websites studies. Table 2 indicates which studies report feature importance or instance-level diagnostics; for example, random forest importance reports and per-instance cues in website classifiers [75]. Surface influential URL tokens, key DOM/HTML elements, and simple visual cues when those features are used by the model; link explanations to the model version and data snapshot ID [75].

3.2. Deployment Checklist for the Phishing Delivery Channel: Malware

This checklist is anchored in the evidence fields of Table 2 for the Malware channel [36,55,66,69,92,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116].

3.2.1. Privacy Controls

Representative signals documented for this channel include dynamic Application Programming Interface (API) call sequences captured prior to encryption in the RISS ransomware dataset [98], Android static and dynamic features, such as declared permissions and selected API call counts, reported for Drebin [99], and network-level aggregates used in the included studies, for example, NetFlow statistics from CTU-13 [101]. Prefer transmitting derived features documented in Table 2 rather that raw binaries or packet captures [98,99,101].

3.2.2. Data Collection Risks

Table 2 indicates typical risks for Malware studies that deployments should mirror and mitigate [98,99,100,101,104]. Examples include merged sources or mixed benign/malicious collections without deduplication or temporal isolation [99,104], single-scenario NetFlow evaluations without flow-correlation isolation [101], and random or k-fold splits without nesting of model selection [98,101,104]. Use time-aware splits where feeds evolve, avoid cross-split near-duplicates, and document snapshot windows.

3.2.3. Failsafe Behavior and Safe Defaults

System-cost reporting is often sparse for Malware entries in Table 2, with the Evaluation/system metrics field frequently marked Not reported [98,99,101]. Define explicit timeouts, backoff, and safe defaults, and record degradation paths when features or services are unavailable, then log outcomes using the same metric vocabulary used for evaluation.

3.2.4. Model Updates and Rollback

Version models, feature schemas, and any External sources used (blacklists/metadata) listed in Table 2, and keep immutable, dated snapshots [98,99]. Gate promotions using the same Validation method recorded for this channel and keep decisions consistent with the documented Model selection procedure [98,101,104]. Monitor and log field behavior using the vocabulary of Evaluation/system metrics field, noting explicitly when system metrics are not reported in the source studies [98,99,101]. Pin versions and refresh cadence for external sources following the table’s “Labels: …; Metadata: …” pattern [98,99,101,104,108].

3.2.5. Explainability for Triage

Provide case-level rationales mapped to the feature families recorded for the Malware channel in Table 2. For ransomware pre-encryption detectors, surface the most influential dynamic API call sequences prior to encryption, as reported for the RISS dataset [98]. For Android malware, show top-contributing static permissions and selected API-call counts consistent with Drebin-based analyses [99]. For traffic-driven detectors, report aggregates aligned with the literature, for example, NetFlow statistics in CTU-13 [101] and DNS-derived fields, such as TTL distributions and query types, in ISOT botnet experiments [104]. Keep summaries concise and restricted to inputs documented in Table 2 for this channel.

3.3. Deployment Checklist for the Phishing Delivery Channel: Electronic Mail

This checklist is anchored in the evidence fields of Table 2 for the Electronic Mail channel (Data, Data quality, Risk of data leakage, Class balance, Validation method, Model selection procedure, Evaluation/system metrics, External sources used (blacklists/metadata)) [45,50,69,75,92,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134].

3.3.1. Privacy Controls

Use signals that studies actually derive from messages: header and body features and attributes of embedded URLs [119,121,122,123]. Representative inputs in Table 2 include header irregularities and sender–recipient patterns, tokenized subject/body features, and URL-level vectors [117,119,121,122,123]. Keep references aligned with the table’s Data quality and External sources used (blacklists/metadata) fields.

3.3.2. Data Collection Risks

Table 2 highlights the risks of merged corpora without thorough deduplication or time-aware separation, and of random splits or k-fold CV that allow leakage across folds [117,119,121,122,123]. Examples include 10-fold CV without deduplication or timestamp isolation [117], and multi-corpus merges with benign-only deduplication and no cross-split deduplication [119]. Mirror the controls in Table 2 by documenting snapshot windows and preventing cross-split overlap.

3.3.3. Failsafe Behavior and Safe Defaults

For many e-mail entries, System metrics are Not reported or limited to training-time figures [117,121]. Use the Evaluation/system metrics vocabulary from Table 2 when recording costs in deployment, and note explicitly when a source study provides no system metrics.

3.3.4. Model Updates and Rollback

Align promotions with the Validation method and Model selection procedure used in the channel studies. Table 2 records random hold-out and k-fold protocols for merged datasets [119,121]; where sources evolve over time, prefer date-aware checks consistent with these entries, and keep snapshot references for comparability.

3.3.5. Explainability for Triage

Provide short rationales tied to the feature families evidenced in the e-mail rows. Surface the most influential header or body indicators and URL attributes when these features are part of the model [117,121,122,123]. Keep explanations consistent with the inputs and metric families used in Table 2 for this channel.

3.4. Deployment Checklist for the Phishing Delivery Channel: Social Networking

This checklist is anchored in the evidence fields of Table 2 for the Social Networking channel (Data quality, Risk of data leakage, Class balance, Validation method, Model selection procedure, Evaluation/system metrics, External sources used (blacklists/metadata)) [85,100,103,135,136,137,138,139].

3.4.1. Privacy Controls

Limit data collection and storage to the signal families actually used by studies in this channel: domain reputation of linked URLs in Twitter spam detection [100]; account- and content-level features for malicious-user detection [103]; profile-level features for Instagram fake-account detection [135]; and behavioral signals relevant to Sybil and multi-account deception [138,139]. Document how these signals are derived and retained, and avoid processing raw personal content beyond what these feature sets require.

3.4.2. Data Collection Risks

Guard against leakage when datasets are merged and randomly split. Table 2 flags a high leakage risk in a study that combined Twitter and Instagram accounts using an 80/20 random hold-out without identity-level separation or deduplication [103]. Use identity- or account-level isolation and avoid random splits in such settings.

When URL or domain features are part of the feature set, ensure grouping and deduplication policies prevent cross-split overlap of identical or near-duplicates, consistent with the risk patterns highlighted for this channel [100,103].

3.4.3. Failsafe Behavior and Safe Defaults

When features or feeds are partially unavailable, degrade gracefully by relying on feature families evidenced in Table 2 for this channel, for example domain reputation [100], account or profile features [103,135], and behavioral cues for Sybil or multi-account deception [138,139].

3.4.4. Model Updates and Rollback

Align update checks with the Validation method and Model selection procedure recorded fields recorded for Social Networking entries. For example, mirror the reported hold-out or cross-validation setup during pre-promotion tests, and assess changes using the evaluation metrics reported in Table 2 (Accuracy, Precision, Recall, F1) [103]. Keep versioned, dated snapshots of models and feature schemas so you can revert if metrics regress.

3.4.5. Explainability for Triage

Surface the most influential signals that correspond to Table 2 features for this channel: report reputation indicators for linked domains in tweet-borne spam [100], profile- and content-level attributes used by malicious-user and Instagram fake-account detectors [103,135], and behavioral patterns relevant to Sybil or multi-account deception [138,139]. Keep summaries concise and consistent with the feature families Table 2 documents for Social Networking.

3.5. Summary

Each checklist item maps to Table 2 fields for the corresponding channel, so readers can trace operational guidance back to the reported validation methods, model selection procedures, leakage risks, and system metrics.

4. Discussion

This section presents a comprehensive analysis of research on phishing detection using Machine Learning (ML) and Neural Networks (NN). The analysis is based on the curated Scopus corpus described in Section 2. The results are organized to present both the conceptual landscape and the methodological distribution of studies published between 2017 and 2024. The discussion begins with a keyword co-occurrence analysis. This step highlights dominant topics and their interconnections within the dataset. The section then examines the relationship between global phishing activity and research engagement. A categorization framework is applied to classify publications by delivery channel, applied ML/NN techniques, and research methodology. Subsequent sections investigate international contributions. This is followed by an analysis of methodological patterns across channels. The structure enables identification of dominant approaches, persistent gaps, and emerging areas of interest. This multi-layered provides a foundation for interpreting how technical and methodological trends align with evolving phishing threats.

4.1. Keyword Co-Occurrence Map: Dataset, Parameters, and Metrics

This subsection provides a quantitative overview of the keyword landscape in the Scopus corpus exported on 21 July 2025. We use VOSviewer to construct a co-occurrence map from bibliographic data (Figure 3), focusing on index keywords and normalizing terms with a thesaurus file. A minimum occurrence threshold of five was applied; 49 of 737 keywords met this criterion. Fractional counting was used and the 25 most relevant terms were selected for visualization. We report three standard VOSviewer metrics: occurrences (how many publications in this corpus include a given keyword), co-occurrence (how often two keywords appear together in the same publication, with contributions down-weighted for records listing many keywords) and total link strength (the overall strength of a keyword’s connections to all other keywords in the map) [145]. The purpose of Section 4.1 is to complement the qualitative review by identifying the dominant topics and the strongest interrelations strictly within this dataset and configuration.

In the analyzed map, the most frequent keywords (Occurrences) are computer crime (n = 68), websites (n = 61), phishing (n = 55), machine learning (n = 50), learning systems (n = 39), phishing websites (n = 31), classification (of information) (n = 30), phishing detection (n = 29), cybersecurity (n = 27), and malware (n = 25). The ranking by total link strength matches the ranking by occurrences (same ordering and values): computer crime (68), websites (61), phishing (55), machine learning (50), learning systems (39), phishing websites (31), classification (of information) (30), phishing detection (29), cybersecurity (27), and malware (25). The co-occurrence analysis shows that frequency and connectivity coincide: the most frequent terms are also the most strongly connected to the rest of the vocabulary. No rare yet structurally central terms emerge, and there are no very frequent but weakly connected terms. As a result, the network exhibits a compact conceptual core dominated by a small set of general keywords; we do not observe bridging niche terms that would tie distant topical areas, and the diversity of cross-topic relations is correspondingly limited. These counts refer exclusively to publications in this corpus.

We analyze the color-coded clusters in the VOSviewer co-occurrence map to see how keywords group together and how strongly they are connected within this corpus. For each cluster, we report the top keywords by Occurrences and by Total Link Strength to establish both frequency and connectivity. We describe internal connectivity by reporting the number of links that key terms in the cluster have with other terms and by identifying the strongest edges within the cluster and to neighboring clusters. We also check alignment with the delivery channels introduced in Section 3 (Websites, Malware, Electronic Mail, Social Networking), ensuring that the quantitative structure matches the substantive organization of the review. Finally, we add practical significance—what the observed patterns suggest for data, features, model placement, or evaluation—stating such implications cautiously when the evidence is indirect.

Across clusters, we look for signals that shape the narrative: whether bridging terms appear (rare but central keywords that connect distant areas) or are absent; whether the map shows cohesion or separation (a compact core versus dispersed topics); and whether frequency and connectivity are consistent (that is, whether Occurrences and TLS identify the same or different sets of key terms).

Cluster 1 (red, machine-learning-centric)

Within this cluster, the dominant keywords are machine learning (n = 50), malware (n = 25), decision trees (n = 21), network security (n = 21), crime (n = 10), losses (n = 9), and random forests (n = 9). Internally, the subgraph is fully connected: every term links to every other term in the cluster, with the strongest internal edges observed for decision trees–machine learning (≈3.31), machine learning–malware (≈3.09), and malware–network security (≈1.66). Externally, this cluster is tightly integrated with the network’s core concepts: machine learning forms high-weight links to phishing (≈5.66) and websites (≈4.59), and it also connects to phishing detection (≈2.63) and electronic mail (≈2.12). Degree counts underline this connectivity: machine learning and malware are linked to 24 of the 24 other selected terms (Links = 24, i.e., 24 is the maximum possible number of links in this map given the selected parameters), and network security links to 23. Taken together, these patterns indicate that the cluster aligns with multiple delivery channels from Section 3—most directly with Malware (present inside the cluster) and, via strong cross-links, with Websites and Electronic Mail—so its role is methodological and cross-channel rather than tied to a single medium. Practically, this suggests keeping robust classical ML baselines (e.g., decision trees, random forests) alongside newer models, reporting results per channel where possible (malware/web/email) and checking for data leakage between related samples, since the same ML methods are widely reused across contexts in this corpus.

Cluster 2 (green; learning-oriented + e-mail)

In this cluster the dominant keywords are learning systems (n = 39), classification (of information) (n = 30), learning algorithms (n = 24), electronic mail (n = 23), and support vector machines (n = 16). Internally, the strongest links are learning algorithms–learning systems (~2.76), electronic mail–learning systems (~2.46), classification–electronic mail (~2.09), and classification–learning systems (~2.08); remaining pairs (e.g., with SVM) also connect but with lower weights (~1.20, ~1.10, ~0.98). By degree, classification (of information) connects to 24 other selected terms (Links = 24), while learning systems and learning algorithms connect to 23, and electronic mail and SVM to 22.

Externally, this cluster is well connected to the network core. The highest-weight cross-cluster edges include computer crime–learning systems (~5.82), learning systems–websites (~3.67), learning systems–phishing (~3.01), electronic mail–phishing (~2.89), classification–websites (~2.86), classification–phishing (~2.69), and several links to machine learning and malware (~2.46–2.36). Read together, these patterns indicate that Cluster 2 captures the learning/classification spine of the literature with a clear attachment to the Electronic Mail channel, while remaining strongly coupled to the web-centric and general “abuse” terminology at the network’s core.

Practical significance (cautious): The prominence of classification/learning alongside electronic mail suggests prioritizing well-specified e-mail feature sets (headers/body/attachments) and stable baselines (e.g., SVM) in evaluations, with metrics reported under class imbalance. The dense links from learning systems to websites/phishing suggest reporting per-channel results (email vs. web) and checking for data leakage, since the same learning setups recur across channels in this corpus.

Cluster 3 (blue; web-centric detection focus)

This cluster centers around keywords related to websites and detection strategies. The dominant terms are websites (n = 61), phishing websites (n = 31), phishing detection (n = 29), phishing (n = 55), detection rate (n = 12), and false positive rate (n = 7). Internally, the strongest edges are between websites–phishing (~5.75), websites–phishing detection (~4.38), and phishing–phishing websites (~3.36). This subgraph is densely connected, with websites and phishing each linked to 24 of the 24 other selected terms (Links = 24), and phishing detection to 22. These degree counts confirm that the cluster is tightly embedded in the network’s conceptual core.

Cross-cluster connectivity is also strong: websites links to machine learning (~4.59), learning systems (~3.67), classification (~2.86), and electronic mail (~2.55), among others. Phishing detection also bridges to the machine-learning-centric cluster through links to decision trees, support vector machines, and random forests. This high degree of integration indicates that website-based phishing remains a dominant testbed for evaluating learning algorithms, particularly for classification tasks and metrics such as detection rates and false positive rates.

The practical implication of this structure is twofold. First, it suggests that many detection systems, especially those benchmarked in this literature, have been trained and tested on datasets derived from phishing websites. Second, because these website-centered terms are highly connected to general learning methods and metrics, results from such studies may not be generalized to other delivery channels (e.g., e-mail or malware). Therefore, reporting performance per delivery channel becomes essential. Without such disaggregation, conclusions drawn from web-based benchmarks may be incorrectly extrapolated to e-mail or malware contexts, despite the structural and behavioral differences between them. This is particularly important in studies that reuse similar learning pipelines across multiple types of data; separation helps avoid conflating distinct detection challenges and feature spaces.

Cluster 4 (yellow; neural networks + social networking)

This cluster groups together the keywords phishing attack (n = 22), neural networks (n = 10), and social networking (online) (n = 8). It forms a distinct but peripheral area on the map, with relatively low frequencies and total link strengths. The strongest internal links are phishing attack–neural networks (~1.19) and phishing attack–social networking (online) (~0.79), while neural networks and social networking are weakly connected to each other and to the rest of the network. All three terms exhibit lower external integration compared to the main ML-related nodes.

Despite these limitations, phishing attack is linked to 24 other terms on the map, including phishing (~1.67), phishing detection (~1.27), machine learning (~1.86), and multiple learning methods such as decision trees (~0.62), support vector machines (~0.37), and deep learning (~1.06). These connections confirm that phishing attack functions as the conceptual hub of the cluster and serves as a bridge to the core ML vocabulary.

The presence of neural networks and social networking (online) in this cluster suggests that these publications investigate phishing attacks on social media platforms using neural architectures. However, the relative isolation of social networking (Links = 16) and the weak integration of neural networks (Links = 20) imply that this direction is still underrepresented in the dataset. Stronger ties between phishing attack and central terms like phishing detection and machine learning confirm topical alignment, but the low co-occurrence weights suggest that this area remains niche.

Practical significance (cautious): The limited size and sparse connectivity of this cluster imply that the use of neural networks for detecting phishing on social networking platforms is still emerging. The strong dependence on phishing attack as a bridging term and the weak ties of neural-networks and social networking (online) to the broader ML ecosystem highlight a potential gap in the literature. This suggests a need for more studies applying neural network architectures in social media contexts, with attention to platform-specific features and evolving threat models.

Cluster 5 (purple; phishing detection + cybersecurity + deep learning)

This cluster comprises phishing detection (n = 29), cybersecurity (n = 27) and deep learning (n = 24). Internal links are moderate: phishing detection–cybersecurity (2.13), phishing detection–deep learning (2.08) and cybersecurity–deep learning (0.92). Each keyword connects to almost every other node in the network (cybersecurity = 24 links; phishing detection = 23; deep learning = 23), exhibiting high network centrality rather than dense intra-cluster cohesion. Such “connector-hub” behavior–high degree centrality with weaker internal density–matches patterns described in bibliometric network theory [145].

Strong cross-cluster ties reinforce this bridging role: deep learning–websites (2.85) and cybersecurity–websites (2.07) link to the web-centric cluster; deep learning–phishing (2.51) and phishing detection–machine learning (2.63) anchor the group to classical ML topics; phishing detection also couples to decision trees (1.01) and support vector machines (0.55). Connections to electronic mail (1.22, 1.17, 0.52, respectively) show that research framed by this cluster spans multiple delivery channels outlined in Section 3.

Practical significance (cautious). The mixture of deep-learning terms with classical models and several attack channels (Websites, Electronic Mail, Social Networking) suggests that neural architectures are typically evaluated alongside, not in isolation from, traditional algorithms. Comparative studies that disclose full model configurations and report channel-specific metrics remain essential for reproducibility and for quantifying the incremental benefit of deep models.

Cluster 6 (teal; deep neural networks)

This cluster is a single-node group, containing only deep neural networks (n = 12); consequently, no internal edges exist. However, the term links to 23 of the other 24 keywords (Links = 23), giving it a total-link-strength of 12.00 and marking it as a narrowly defined yet well-connected node in the overall map.

The strongest outward links are to websites (1.23), computer crime (1.23), learning systems (1.10), phishing (1.08) and deep learning (0.67). Additional edges above 0.60 connect to learning algorithms (0.65), phishing attack (0.65) and phishing websites (0.62). These values—lower than the top weights in Clusters 1–5—confirm that deep neural networks function as a bridge term referenced across web-centric, crime-focused and learning-method studies rather than as the nucleus of a cohesive sub-topic.

Practical significance (cautious). The single-node status reveals a vocabulary split: some papers prefer the generic label deep learning, others the more specific deep neural networks. Keeping the terms distinct preserves fidelity to the source dataset. Subsequent sections will discuss results under broader headings, but in this section the two labels remain separate to reflect the Scopus classification exactly.

Cluster 7 (orange; phishing core term)

This cluster is a single-node group containing only phishing (n = 55). Because no companion keywords belong to the same cluster, there are no internal edges. Even so, phishing links to every other selected keyword (Links = 24) and has the highest total-link-strength in the map (TLS = 55), confirming its role as the conceptual hub of the entire network.

The strongest outward edges tie phishing to websites (6.77), computer crime (6.02) and machine learning (5.66). Additional high-weight links include learning systems (3.01), electronic mail (2.89), classification (of information) (2.69), phishing websites (2.65), deep learning (2.51), phishing detection (2.33) and malware (2.25). This pattern shows that the term acts as an all-purpose connector across every attack channel and methodological family represented in the corpus.

Practical significance (cautious). The single-node status illustrates how a broad, domain-wide keyword can dominate co-occurrence metrics, potentially masking finer distinctions among delivery channels or model types. Retaining phishing as a standalone label preserves fidelity to the Scopus dataset; however, later analytical sections will treat this core term as an overarching context, while narrower keywords (e.g., phishing websites, phishing detection) provide channel- and task-specific detail.

To keep the keyword map aligned with the goals of this review, we applied a minimum-occurrence threshold of five, fractional counting and a “top-25 most relevant terms” filter. These settings reduce visual noise and stabilize co-occurrence statistics, ensuring that the visualization highlights the core vocabulary and its strongest relationship.

Within the resulting map, frequency and connectivity coincide: computer crime, websites, phishing, and machine learning are simultaneously the most frequent (n ≈ 50–68) and the most strongly linked (TLS ≈ 50–68). No rare yet structurally central keywords appear, and no very frequent but weakly connected ones emerge. Consequently, the network exhibits a compact conceptual core dominated by a small set of broadly framed terms, with color-coded clusters aligning closely to the delivery channels defined in Section 3.

4.2. Trends in Global Phishing Activity

Table 2 presents a numbered review of scientific articles published between 2017 and 2024 that focus on machine learning and neural networks for phishing detection from different perspectives. To contextualize the evolution of these methods, our primary metric is the quarterly count of unique phishing websites detected in each quarter, which serves as a reliable indicator of the overall scale and evolution of phishing attacks over time. We follow the APWG reporting convention for “unique phishing websites” as documented in the quarterly reports; year-to-year definitional notes are enumerated in Supplement Table S1 and were taken into account during aggregation. Based on an analysis of phishing attack data reported by APWG [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] (Figure 4) between 2017 and 2024, we divided the period into two intervals to reflect significant changes in attack dynamics. The series comprises 32 quarterly observations (2017 Q1–2024 Q4) derived from the extraction sheet provided in the Supplement. The raw APWG data in CSV format (apwg_data.csv) and the Python script used for analysis (trend_break_analysis.py, Table S4) are included in the Supplementary Materials.

For the APWG quarterly data, a two-segment model identifies a statistically significant structural break in Q3 2020 (F = 18.1192, p = 0.00020), indicating a sharp increase in phishing activity. Although this analysis reveals several statistically significant breaks around the 2020–2021 period (including 2020 Q1, 2020 Q2, 2020 Q4, 2021 Q1, and 2021 Q2), the strongest statistical evidence for a fundamental shift in trend is located in the second half of 2020. These results consistently justify the division of the timeline into two phases.

Figure 5 presents the annual distribution of publications in the analyzed corpus between 2017 and 2024. For annual publication counts, while joinpoint tests are underpowered with eight points, a Poisson block comparison shows a 2.28-fold higher publication rate in 2021–2024 versus 2017–2020 (95 percent CI 1.51–3.46).

Consistently, the APWG quarterly series exhibits a structural break in 2021 Q2, with a 95 percent confidence interval spanning 2021 Q1 to 2021 Q3 (F = 11.65, p = 0.00021); a monthly reanalysis identifies April 2021 with comparable significance (F = 30.78, p < 1 × 10⁻¹⁰). Incidence rate ratios (IRR) were estimated with a Poisson generalized linear model using a post-2020 indicator; 95 percent confidence intervals are Wald intervals, and a Negative Binomial sensitivity analysis produces similar point estimates. Breakpoints were estimated using piecewise linear regression with a Chow-type comparison against a single-trend model and residual bootstrap for the break-date uncertainty; joinpoint regression is reported as a sensitivity check. Consequently, separating the timeline into two distinct phases—pre-2021 (moderate growth) and post-2021 (high-intensity attacks)—enables more accurate trend analysis and contextual interpretation of technological advancements in detection methods, particularly those leveraging machine learning models and neural networks architectures.

4.3. Categorization Framework for Analyzed Publications

Table 3 presents a quantitative review of scientific articles published between 2017 and 2024, showing the number of publications across predefined categories and features related to machine learning and neural networks for phishing detection.

The categorization applied for the analysis of publications is structured into three main dimensions: Phishing Delivery Channels, Machine Learning Models and Techniques, and Research Methodology (Table 3). This approach allows for a systematic examination of studies based on both the nature of phishing threats and the technical solutions proposed for detection.

The first category, Phishing Delivery Channels, includes four primary vectors through which phishing attacks are executed: Websites, Malware, Electronic Mail, and Social Networking. These channels represent the main media exploited by attackers, enabling the differentiation of research based on the attack surface. Grouping by delivery channel is essential because defensive strategies and detection mechanisms often vary significantly depending on the context (e.g., email-based phishing vs. website-based phishing).

The second category, Machine Learning Models and Techniques, focuses on the machine learning and neural networks approaches utilized for phishing detection: Machine Learning, Neural Networks, Classification and Ensembles, and Feature Engineering. This categorization enables evaluation of the specific algorithms, learning paradigms, and feature selection strategies applied in the studies. It is justified by the need to understand not only which algorithms are employed but also how feature engineering contributes to detection performance, as it often plays a critical role in phishing detection systems.

The third category, Research Methodology, addresses the methodological basis of the studies: Experiment, Literature Analysis, Case Study, and Conceptual. This classification reflects the level of empirical validation and scientific rigor of the research. Experimental studies typically provide quantitative performance metrics, while conceptual papers may introduce theoretical frameworks or new models without extensive testing.

This multidimensional classification provides a comprehensive lens for analyzing research from three perspectives: the problem domain (delivery channels), the applied solution (machine learning methods), and the scientific approach (research methodology). It ensures comparability across studies and highlights trends, strengths, and gaps in the existing literature.

It is important to note that the total number of publications within individual categories does not sum up to 105 (or 100%), as some studies were classified under multiple categories. This overlap occurs because a single publication may address several delivery channels, apply different machine learning techniques, or combine various research methodologies. Consequently, a strictly mutually exclusive classification was not possible, and the categorization should be interpreted as a representation of thematic coverage rather than distinct groups.

The distribution of publications across phishing delivery channels (Figure 6) indicates a clear research focus on web-based phishing. A total of 61 studies (approximately 58% of the analyzed sample) addressed the detection of phishing on websites. In comparison, 26 publications (≈25%) explored malware-related phishing, while 23 studies (≈22%) concentrated on phishing through electronic mail. Only 8 studies (≈8%) investigated threats originating from social networking platforms. This pattern remained relatively stable across the examined periods, confirming the persistent dominance of website-based phishing as the primary research area.

The Machine Learning Models and Techniques category (Figure 7) encompasses various approaches and components applied in phishing detection research. The Machine Learning subcategory includes studies that utilize general supervised [83,85,97,101,104,111,113,126], semi-supervised [109,124], unsupervised [97] or mixed [85,131] learning models for phishing detection.

The Neural Networks subcategory covers research employing deep learning [41,44,49,54,64,66], convolutional neural networks (CNN) [40,50] or artificial neural networks [65,68] to classify phishing threats.

The Classification and Ensembles subcategory refers to approaches that combine multiple classifiers (e.g., Random Forest, boosting) to improve prediction performance [42,51,64,71].

The Feature Engineering subcategory involves techniques for selecting, extracting, and optimizing input features to enhance model accuracy and reduce complexity [38,44,48,73].

The category Research Methodology (Figure 8) refers to the approach adopted by authors to conduct their research. It includes experimental research [66,68,70], where models such as machine learning algorithms or neural networks are implemented and tested on datasets to evaluate performance. Literature analysis [71,138,139] involves reviewing and synthesizing existing research to identify trends and techniques. The Case Study category involves practical research conducted in real-world environments, such as developing phishing email detection models using actual company data [134] or implementing real-time spear phishing detection within organizational networks to validate effectiveness in operational settings [120]. Conceptual research [66,72] introduces new frameworks, models, or theoretical concepts without extensive experimental validation.

4.4. International Research Contributions in Phishing Detection

The dataset presents the distribution of phishing detection research publications using ML and NN across countries between 2017 and 2024 (Figure 9). The timeline is split into two subperiods, 2017–2020 and 2021–2024, allowing observation of temporal trends in research activity.

During 2017–2020, the total output across all countries was 32 publications. This number more than doubled in the subsequent period (2021–2024), reaching 73 publications, indicating a clear acceleration in global research efforts. In total, 105 publications were identified for the full period.

India leads the ranking with 34 publications (32.38% of all records). The country shows strong growth, increasing from 9 publications in the first period to 25 in the second, suggesting a significant expansion of academic and institutional engagement in ML- and NN-based phishing detection research.

Saudi Arabia holds the second position with 15 publications (14.29%), also showing a positive trend—from 5 to 10 publications between the two periods. China follows with 12 publications (11.43%), maintaining steady growth from 5 to 7 publications.

Jordan and the United States each contributed 7 publications (6.67%), with Jordan showing a sharp increase (from 1 to 6), while the United States exhibited a more gradual rise (from 2 to 5). Malaysia’s output grew from 2 to 4 publications, for a total of 6 (5.71%). The United Kingdom produced 5 publications (4.76%) over the period, with a modest increase from 2 to 3.

Notably, Pakistan and the United Arab Emirates contributed no publications in the first period but entered the field in 2021–2024 with 4 publications each (3.81%). This emergence may reflect a recent strategic focus or the establishment of new research programs.

The “Other” category, encompassing all remaining countries, accounts for 24 publications (22.86%), increasing from 8 to 16 publications.

The data reveal a significant increase in global research activity on phishing detection using Machine Learning and Neural Networks, with publication output more than double between the first and second period. This upward trend confirms the growing importance of the topic in the international cybersecurity agenda. Notably, the entry of Pakistan and the United Arab Emirates in the later years suggests the emergence of new regional initiatives and the possible influence of targeted funding schemes. India stands out as the leading contributor, combining the highest publication volume with consistent growth, which points to a strong academic and industrial foundation in learning-based research for cybersecurity. The rising share of the “Other” group indicates a gradual broadening of participation, with more countries contributing to the field despite lower individual outputs. In addition, the substantial presence of Saudi Arabia, Jordan, and the United Arab Emirates highlights the Middle East as an emerging region of interest, reflecting increasing investment in learning-based security solutions.

4.5. Technical and Methodological Approaches to Phishing Detection by Channel

The purpose of this section is to quantify and interpret how research approaches are distributed across phishing delivery channels (Table 4). Using numerical data, descriptive statistics, and visual representations, this section identifies dominant research strategies, notes methodological trends, and highlights underexplored intersections that offer opportunities for further study. Shares are calculated within each channel (not against the 105-publication corpus). Percentages therefore reflect the proportion of occurrences within each channel, and counts are shown in parentheses. Totals across channels or categories may exceed 105 because individual publications can be coded to multiple categories and, in some cases, to multiple channels.

Websites. Among the Machine Learning Models and Techniques topics (Figure 10), Machine Learning accounts for 36% (46 documents), Neural Networks for 21% (27), Classification and Ensembles for 26% (33), and Feature Engineering for 17% (22). This mix indicates a balanced focus between model-centric work and feature-driven design for web data. In methodology (Figure 11), experimental studies dominate at 58% (58 documents), with literature analysis at 20% (20) and conceptual contributions at 22% (22). No case studies are recorded, 0% (0). The prevalence of experiments suggests dataset-based evaluation pipelines for website phishing detection, while the share of conceptual work indicates ongoing refinement of problem framing and architectures.

Malware. For the Machine Learning Models and Techniques topics (Figure 10), Machine Learning accounts for 49% (22 documents), Classification and Ensembles for 27% (12), Neural Networks for 16% (7), and Feature Engineering for 9% (4). The pattern emphasizes general machine-learning solutions and ensemble strategies, while explicit feature-engineering reports are less common. Methodologically, experiments again lead with 57% (21), followed by literature analysis at 27% (10) and conceptual work at 16% (6); case studies are not present, 0% (0). This distribution points to sustained empirical testing, with secondary emphasis on evidence synthesis and problem conceptualization.

Electronic Mail. Topic shares are Machine Learning 39% (20), Classification and Ensembles 29% (15), Neural Networks 18% (9), and Feature Engineering 14% (7). The profile is more evenly spread across learning approaches than in malware. Methodologically, experiments constitute 52% (22), literature analysis 29% (12), conceptual work 14% (6), and case studies 5% (2). Notably, case studies appear only in this channel, indicating efforts to situate findings in concrete organizational or campaign contexts [120,134].

Social Networking (online). Topic shares are Machine Learning 54% (7), Neural Networks 31% (4), and Classification and Ensembles 15% (2). The emphasis falls on learning-driven approaches, with a comparatively high share for Neural Networks. In methodology, experiments account for 40% (6), conceptual work 33% (5), and literature analysis 27% (4). The relatively greater weight of conceptual contributions suggests this channel is still consolidating tasks, data representations, and evaluation standards. Interpretations for this channel should be made with caution due to a small base of five unique publications.

Machine Learning Models and Techniques topics concentrate on the websites phishing delivery channel (Figure 10). For this research approach, the distribution is Websites 44% (46), Malware 21% (22), Electronic Mail 19% (20), and Social Networking 7% (7). For Neural Networks, the distribution is Websites 26% (27), Malware 7% (7), Electronic Mail 9% (9), and Social Networking 4% (4). For Classification and Ensembles, the distribution is Websites 31% (33), Malware 11% (12), Electronic Mail 14% (15), and Social Networking 2% (2). For Feature Engineering, the distribution is Websites 21% (22), Malware 4% (4), Electronic Mail 7% (7), and Social Networking 0% (0).

The results indicate a clear channel hierarchy. Websites concentrate the majority of work across topics and methods. Malware and electronic mail receive moderate but steady attention. Social networking remains underrepresented, including no instances of feature engineering, 0% (0). Case studies are almost absent and occur only in electronic mail, 2% (2). These gaps highlight opportunities for deeper empirical and design-oriented studies in social networking and for more case-based evaluations across all channels.

In response to this gap, recent research explores transformer-based models for detecting fake or inauthentic profiles on social platforms. For instance, [147] introduces an encoder-only, attention-guided Transformer that captures profile and behavioral signals using positional encodings and multi-head self-attention. The attention weights emphasize attributes such as follower count, number of favorites, and total posts. Hyperparameters are optimized using a Tree-structured Parzen Estimator. This method reduces dependence on manual feature engineering and offers built-in explainability to support triage workflows. We reference [147] to highlight the relevance of social-media impersonation, which enables pretext creation and dissemination in phishing campaigns.

A complementary research direction involves agentic Large Language Model (LLM) pipelines that leverage social-media streams as sources of cyber threat intelligence. Retrieval-augmented agents can collect suspicious posts and profiles, contextualize them using open-source reports, and integrate entities and tactics into a knowledge graph to assist analysts in triage and attribution. This line of work also motivates the development of multimodal defenses against deepfakes and chatbot-driven social engineering, alongside time-sensitive evaluation methods for rapidly evolving campaigns [148].

Another related approach adapts reasoning-centric, multimodal link analysis—originally developed for email security—to social-network content. Recent findings on phishing-email URLs demonstrate improved accuracy when models receive layered metadata for each link, including domain information, certificate details, regulatory filings, browser context, and Optical Character Recognition (OCR) of rendered previews, and when they generate explanations prior to predictions. Applying this framework to social-network posts involves combining post text, account metadata, rendered previews, and explanation-first prompting, thereby enhancing robustness and operator trust [149].

4.6. Common Validity Threats Observed in the Reviewed Studies

Across the papers summarized in Table 2, we observed recurring threats to validity that can inflate headline metrics and hinder reproducibility. The most frequent problem is overlap between training and test URL lists. When studies aggregate feeds such as PhishTank or Alexa-derived benign sets without careful deduplication, near-duplicate or identical URLs can appear on both sides of a split, which makes the task easier than it would be in deployment. Clear examples include combined-source evaluations without documented cross-split deduplication or host/domain isolation [37,39,56,60], and cases where authors explicitly acknowledge duplicate-related limitations [34]. Additionally, potential near-duplicate overlap across cross-validation folds is also noted in a deep sequence model study [81]. These patterns justify multi-granularity deduplication and host- or campaign-aware splitting before any partitioning.

A second pattern is temporal leakage caused by random splits. Phishing ecosystems evolve quickly, yet many studies use random hold-out or cross-validation that mixes older and newer examples, allowing future information to influence training [38,56,70,71]. In contrast, one study reports both random and date-based splits and observes drift over time, which illustrates the importance of time-aware validation in this domain [43]. Forward-chaining or blocked evaluation aligned to collection windows would better reflect operational conditions. This matters operationally because phishing distributions are non-stationary. Models trained on stale snapshots experience covariate and concept drift, which degrades precision and increases false negatives on novel campaigns. Time-aware validation and continuous refresh of training data are therefore required for any claim of deployable performance.

We also noted leakage in model selection and preprocessing. Hyperparameters are often tuned and performance estimated within the same resampling scheme, without a nested protocol, which yields optimistic error estimates [37,41,66,80,81,88,92,94]. In several papers, feature selection, oversampling, or representation learning are applied to the entire dataset before splitting or across folds, which propagates test information into training [38,49,54,55,73,78]. A defensible workflow fits all preprocessing steps inside each training fold and uses a separate outer loop for final estimation.

Metric choice can also hide class imbalance effects. Accuracy alone is frequently reported on imbalanced datasets, which can obscure practical precision at realistic alert budgets [49,50,68,71,83]. Imbalance-aware summaries like precision, recall, F1, ROC AUC, and MCC, accompanied by the predicted positive rate or threshold policy, provide a more informative picture of utility.

Finally, limited transferability and incomplete documentation reduce comparability. Many studies evaluate on a single dataset or use within-dataset splits only, so generalization across sources or time remains untested [49,50,68,71]; where across time evaluation is attempted, performance drift appears [43]. Several papers also omit critical dataset hygiene details such as acquisition windows, total usable items after cleaning, or explicit deduplication procedures, which complicate replication and may bias results [37,38,39,65,90]. Transparent reporting of snapshot dates, cleaning outcomes, and exact split protocols is essential for credible evidence.

4.7. Limitations

An important consideration is the nature and quality of the datasets used in the reviewed studies. Dataset quality can be evaluated from various angles, and there is no universally agreed-upon definition of what makes a dataset high quality. Hence, no security dataset for challenges such as phishing, whether related to emails, websites, or URLs, can be considered complete. Many works rely on publicly available repositories such as theUCI Machine Learning Repository [36,38,39,88], Kaggle [36], PhishTank [37,38], MillerSmiles [37], ISCX-URL2016 [38], OpenPhish [38], Mendeley [38,88]; Phish_NetDS [39]. Their use carries certain risks: a large proportion of phishing URLs become inactive within a short time after collection, which can reduce the representativeness and relevance of the data, and there may be overlaps between datasets from different repositories. Moreover, public datasets often do not provide real-world validation, which can limit the generalizability of the findings. Such issues can affect the robustness and reproducibility of reported results [150]. As distributions shift, models trained on such stale datasets tend to underperform on emerging campaigns, which underscores the need for recency controls and time-aware evaluation.

To make these limitations explicit at the study level, we annotated them in Table 2. The “Data quality” column records URL verification or recrawl practices, deduplication, and snapshot descriptions that surface the risk of outdated or dead links. “Class balance” captures shifts in prevalence that may vary with time and source. “External lists/metadata used” identifies the public feeds and auxiliary metadata and notes timing or provenance when reported. “Risk of data leakage” flags reuse or overlap between training and test splits and cross-source collisions. “Validation method” distinguishes temporal from random splits, which is crucial as URL liveness declines and labels age. “Model selection procedure” records that model selection was non-nested or not reported across all included studies, and marks this as a validity risk when combined with dataset quirks. “Evaluation/system metrics” documents what was measured and the execution context, which is relevant when unreachable or stale URLs could skew outcomes. “Handling of class imbalance” summarizes whether rebalancing or appropriate metrics were applied, since imbalance often co-occurs with partial or aging datasets.

This review covers studies published from 2017 to 2024, identified through searches conducted in the Scopus database. After applying our inclusion and exclusion rules, 105 records remained, but full texts were unavailable for 15 of them; those items were analyzed based on metadata alone. The studies report different metrics and use varied datasets, which limits direct comparisons. These constraints narrow the range of evidence and call for caution when generalizing the findings.

Findings reflect the state of the art as of 31 December 2024. Early 2025 publications were excluded to avoid partial-year bias and indexing lag; incorporating 2025 will require a separate update.

The conclusions drawn from the VOSviewer map should be interpreted within several boundaries. The co-occurrence network reflects the thresholds and settings chosen here: a minimum of five occurrences, fractional counting, and the selection of the twenty-five “most relevant” terms out of 737 keywords. Changing any of these parameters may alter cluster structure, rankings, or link strengths [146]. All network measures are correlational; strong links indicate frequent co-mentioning, not causal relationships, and the absence of a link does not prove conceptual independence, which may simply reflect the threshold [151]. Finally, heterogeneous reporting practices across studies (e.g., different keyword conventions, incomplete keyword lists) introduce noise that can bias term frequencies and connectivity [152]. Together, these factors mean that the findings are specific to this corpus and configuration and should not be generalized uncritically beyond them.

In summary, the limitations identified in this review stem from three main areas: the inherent imperfections of publicly available phishing datasets, the scope restrictions of a literature corpus sourced exclusively from Scopus searches, and the methodological constraints of the bibliometric analysis. These factors influence the robustness, representativeness, and generalizability of the evidence, underscoring the need for cautious interpretation of the findings.

5. Conclusions

The discussion confirms that phishing detection research using ML and NN is concentrated around a compact set of high-frequency, strongly connected concepts, with “phishing”, “websites”, “computer crime”, and “machine learning” forming the conceptual core. The analysis of global phishing activity shows a marked escalation in attacks after 2021, paralleled by a significant rise in scientific output. Across delivery channels in the analyzed corpus of published articles, websites dominate as the primary focus, while malware and electronic mail receive moderate attention and social networking remains underrepresented. Methodologically, experimental studies prevail, supported by literature analyses and conceptual works, with case studies appear rarely and almost exclusively in the electronic mail context. Internationally, India leads in publication volume, with notable growth also observed in Saudi Arabia, China, and emerging contributors such as Pakistan and the United Arab Emirates. The cross-tabulations of techniques and methodologies by channel highlight opportunities for expanding research in underexplored areas, particularly neural-network-based detection on social media platforms and case-based evaluations across all channels.

Social networking emerges as the sparsest yet most heterogeneous channel in the corpus. Studies span in-stream Twitter/X phishing detection pipelines that fuse URL and content features with lightweight neural classifiers [136]; malicious-profile detection using hybrid LSTM-CNN architectures applied to user metadata [103]; reinforcement learning-augmented feature extraction for social-media URLs [137]; and graph-based defenses against Sybil/bot infiltration, which threaten trust signals and amplify phishing reach [138]. Despite this methodological breadth, practical progress is repeatedly gated by two constraints: (i) inconsistent or restricted access to platform data (including API limits and dataset takedowns), and (ii) frequent changes to platform policies and terms that break pipelines or preclude replication, hindering longitudinal evaluation and cross-platform generalization [137,138].

Future research priorities must remain tightly coupled to the current tactics, techniques, and procedures of the criminal ecosystem. As threat actors pivot delivery channels and lures, research agendas should align with operational threat intelligence so that datasets, taxonomies, and benchmarks reflect the live attack mix. Concretely, incorporating signals from quarterly APWG Phishing Activity Trends Reports—e.g., Q1 2025’s 1003,924 observed attacks and the rise of QR-code “quishing”—helps identify emerging vectors that merit rapid methodological attention [153]. The European Union Agency for Cybersecurity (ENISA) publishes Threat Landscape analyses that track the prevalence of phishing and related scams across sectors and regions, providing context on regional priorities and underexplored channels [154]. Vendor threat-intelligence reports such as the Microsoft Digital Defense Report [155] and advisories from the Cybersecurity and Infrastructure Security Agency (CISA) indicate which evasion patterns and delivery paths are gaining traction, for example adversary-in-the-middle and token theft (https://www.cisa.gov).

In the future, the integration of large language models (LLMs) into operational environments will significantly impact phishing detection. These models can easily identify diverse phishing content. However, LLMs also pose a major threat, as they enable the generation of high-quality phishing content, which means that security measures must become increasingly advanced.

In parallel, multimodal technologies can increase the accuracy of phishing detection by analyzing textual, visual, and audio data. This approach will make systems more sensitive to behaviors that were previously difficult to diagnose due to the inability to combine different data types into a single representation. Therefore, cooperation between researchers and industry is so necessary to implement modern solutions more quickly.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/electronics14183744/s1, Table S1. scopus.csv (raw Scopus query results), Table S2. thesaurus_mapping.csv (thesaurus mapping file), Table S3. apwg_data.csv (phishing attack counts dataset compiled for the study), Table S4. trend_break_analysis.py (Python script for analyzing trends in APWG phishing data and identifying statistically significant breakpoints).

Author Contributions

Conceptualization, G.W.-J.; methodology, G.W.-J.; software, L.P.; validation, J.L.W.-J., L.P. and A.S.; formal analysis, L.P. and A.S.; investigation, L.P.; resources, L.P.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.S., J.L.W.-J., G.W.-J. and L.P.; visualization, L.P. and A.S.; supervision, J.L.W.-J.; project administration, J.L.W.-J., G.W.-J., L.P. and A.S.; funding acquisition, J.L.W.-J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2017; Anti-Phishing Working Group: Lexington, KY, USA, 2017. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2017; Anti-Phishing Working Group: Lexington, KY, USA, 2017. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2017; Anti-Phishing Working Group: Lexington, KY, USA, 2017. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2017; Anti-Phishing Working Group: Lexington, KY, USA, 2017. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2018; Anti-Phishing Working Group: Lexington, KY, USA, 2018. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2018; Anti-Phishing Working Group: Lexington, KY, USA, 2018. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2018; Anti-Phishing Working Group: Lexington, KY, USA, 2018. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2018; Anti-Phishing Working Group: Lexington, KY, USA, 2018. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2019; Anti-Phishing Working Group: Lexington, KY, USA, 2019. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2019; Anti-Phishing Working Group: Lexington, KY, USA, 2019. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2019; Anti-Phishing Working Group: Lexington, KY, USA, 2019. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2019; Anti-Phishing Working Group: Lexington, KY, USA, 2019. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2020; Anti-Phishing Working Group: Lexington, KY, USA, 2020. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2020; Anti-Phishing Working Group: Lexington, KY, USA, 2020. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2020; Anti-Phishing Working Group: Lexington, KY, USA, 2020. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2020; Anti-Phishing Working Group: Lexington, KY, USA, 2020. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2021; Anti-Phishing Working Group: Lexington, KY, USA, 2021. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2021; Anti-Phishing Working Group: Lexington, KY, USA, 2021. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2021; Anti-Phishing Working Group: Lexington, KY, USA, 2021. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2021; Anti-Phishing Working Group: Lexington, KY, USA, 2021. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2022; Anti-Phishing Working Group: Lexington, KY, USA, 2022. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2022; Anti-Phishing Working Group: Lexington, KY, USA, 2022. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2022; Anti-Phishing Working Group: Lexington, KY, USA, 2022. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2022; Anti-Phishing Working Group: Lexington, KY, USA, 2022. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2023; Anti-Phishing Working Group: Lexington, KY, USA, 2023. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2023; Anti-Phishing Working Group: Lexington, KY, USA, 2023. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2023; Anti-Phishing Working Group: Lexington, KY, USA, 2023. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2023; Anti-Phishing Working Group: Lexington, KY, USA, 2023. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 1st Quarter 2024; Anti-Phishing Working Group: Lexington, KY, USA, 2024. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 2nd Quarter 2024; Anti-Phishing Working Group: Lexington, KY, USA, 2024. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 3rd Quarter 2024; Anti-Phishing Working Group: Lexington, KY, USA, 2024. [Google Scholar]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report 4th Quarter 2024; Anti-Phishing Working Group: Lexington, KY, USA, 2024. [Google Scholar]
Sheng, S.; Wardman, B.; Warner, G.; Cranor, L.; Hong, J.; Zhang, C. An Empirical Analysis of Phishing Blacklists. In Proceedings of the 6th Annual Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 13–14 August 2009; pp. 1–8. [Google Scholar]
Rao, R.S.; Pais, A.R. Detection of Phishing Websites Using an Efficient Feature-Based Machine Learning Framework. Neural Comput. Appl. 2019, 31, 3851–3873. [Google Scholar] [CrossRef]
Aburrous, M.; Hossain, M.; Dahal, K.; Thabtah, F. Intelligent Phishing Detection System for E-Banking Using Fuzzy Data Mining. Expert Syst. Appl. 2010, 37, 7913–7921. [Google Scholar] [CrossRef]
Awasthi, A.; Goel, N. Phishing Website Prediction Using Base and Ensemble Classifier Techniques with Cross-Validation. Cybersecur 2022, 5, 22. [Google Scholar] [CrossRef]
Hr, M.G.; Mv, A.; Gunesh Prasad, S.; Vinay, S. Development of Anti-Phishing Browser Based on Random Forest and Rule of Extraction Framework. Cybersecur 2020, 3, 20. [Google Scholar] [CrossRef]
Gopal, S.B.; Poongodi, C. Mitigation of Phishing URL Attack in IoT Using H-ANN with H-FFGWO Algorithm. KSII Trans. Internet Inf. Syst. 2023, 17, 1916–1934. [Google Scholar] [CrossRef]
Priya, S.; Selvakumar, S.; Velusamy, R.L. Evidential Theoretic Deep Radial and Probabilistic Neural Ensemble Approach for Detecting Phishing Attacks. J. Ambient Intell. Humaniz. Comput. 2023, 14, 1951–1975. [Google Scholar] [CrossRef]
Wang, W.; Zhang, F.; Luo, X.; Zhang, S. PDRCNN: Precise Phishing Detection with Recurrent Convolutional Neural Networks. Secur. Commun. Netw. 2019, 2019, 2595794. [Google Scholar] [CrossRef]
Ali, W.; Ahmed, A.A. Hybrid Intelligent Phishing Website Prediction Using Deep Neural Networks with Genetic Algorithm-Based Feature Selection and Weighting. IET Inf. Secur. 2019, 13, 659–669. [Google Scholar] [CrossRef]
Feng, F.; Zhou, Q.; Shen, Z.; Yang, X.; Han, L.; Wang, J. The Application of a Novel Neural Network in the Detection of Phishing Websites. J. Ambient Intell. Humaniz. Comput. 2024, 15, 1865–1879. [Google Scholar] [CrossRef]
Al-Alyan, A.; Al-Ahmadi, S. Robust URL Phishing Detection Based on Deep Learning. KSII Trans. Internet Inf. Syst. 2020, 14, 2752–2768. [Google Scholar] [CrossRef]
Wazirali, R.; Ahmad, R.; Abu-Ein, A.A.-K. Sustaining Accurate Detection of Phishing URLs Using SDN and Feature Selection Approaches. Comput. Netw. 2021, 201, 108591. [Google Scholar] [CrossRef]
Oram, E.; Dash, P.B.; Naik, B.; Nayak, J.; Vimal, S.; Nataraj, S.K. Light Gradient Boosting Machine-Based Phishing Webpage Detection Model Using Phisher Website Features of Mimic URLs. Pattern Recognit. Lett. 2021, 152, 100–106. [Google Scholar] [CrossRef]
Jain, A.K.; Gupta, B.B. Two-Level Authentication Approach to Protect from Phishing Attacks in Real Time. J. Ambient Intell. Humaniz. Comput. 2018, 9, 1783–1796. [Google Scholar] [CrossRef]
Mao, J.; Bian, J.; Tian, W.; Zhu, S.; Wei, T.; Li, A.; Liang, Z. Phishing Page Detection via Learning Classifiers from Page Layout Feature. EURASIP J. Wirel. Commun. Netw. 2019, 2019, 43. [Google Scholar] [CrossRef]
He, D.; Liu, Z.; Lv, X.; Chan, S.; Guizani, M. On Phishing URL Detection Using Feature Extension. IEEE Internet Things J. 2024, 11, 39527–39536. [Google Scholar] [CrossRef]
Khatun, M.; Mozumder, M.A.I.; Polash, M.N.H.; Hasan, M.R.; Ahammad, K.; Shaiham, M.S. An Approach to Detect Phishing Websites with Features Selection Method and Ensemble Learning. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 768–775. [Google Scholar] [CrossRef]
Kulkarni, A.D. Convolution Neural Networks for Phishing Detection. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 15–19. [Google Scholar] [CrossRef]
Tashtoush, Y.; Alajlouni, M.; Albalas, F.; Darwish, O. Exploring Low-Level Statistical Features of n-Grams in Phishing URLs: A Comparative Analysis with High-Level Features. Clust. Comput. 2024, 27, 13717–13736. [Google Scholar] [CrossRef]
Almomani, A.; Alauthman, M.; Shatnawi, M.T.; Alweshah, M.; Alrosan, A.; Alomoush, W.; Gupta, B.B. Phishing Website Detection With Semantic Features Based on Machine Learning Classifiers: A Comparative Study. Int. J. Semant. Web Inf. Syst. 2022, 18, 24. [Google Scholar] [CrossRef]
Jibat, D.; Jamjoom, S.; Al-Haija, Q.A.; Qusef, A. A Systematic Review: Detecting Phishing Websites Using Data Mining Models. Intell. Converg. Netw. 2023, 4, 326–341. [Google Scholar] [CrossRef]
Prabakaran, M.K.; Meenakshi Sundaram, P.; Chandrasekar, A.D. An Enhanced Deep Learning-Based Phishing Detection Mechanism to Effectively Identify Malicious URLs Using Variational Autoencoders. IET Inf. Secur. 2023, 17, 423–440. [Google Scholar] [CrossRef]
Samad, S.R.A.; Ganesan, P.; Al-Kaabi, A.S.; Rajasekaran, J.; Singaravelan, M.; Basha, P.S. Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 328–341. [Google Scholar] [CrossRef]
Jalil, S.; Usman, M.; Fong, A. Highly Accurate Phishing URL Detection Based on Machine Learning. J. Ambient Intell. Humaniz. Comput. 2023, 14, 9233–9251. [Google Scholar] [CrossRef]
Kulkarni, A.; Brown, L.L. Phishing Websites Detection Using Machine Learning. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 8–13. [Google Scholar] [CrossRef]
Ndichu, S.; Kim, S.; Ozawa, S.; Misu, T.; Makishima, K. A Machine Learning Approach to Detection of JavaScript-Based Attacks Using AST Features and Paragraph Vectors. Appl. Soft Comput. 2019, 84, 105721. [Google Scholar] [CrossRef]
Sharma, S.R.; Singh, B.; Kaur, M. Improving the Classification of Phishing Websites Using a Hybrid Algorithm. Comput. Intell. 2022, 38, 667–689. [Google Scholar] [CrossRef]
Li, Y.; Yang, Z.; Chen, X.; Yuan, H.; Liu, W. A Stacking Model Using URL and HTML Features for Phishing Webpage Detection. Future Gener. Comput. Syst. 2019, 94, 27–39. [Google Scholar] [CrossRef]
Qasim, M.A.; Flayh, N.A. Enhancing Phishing Website Detection via Feature Selection in URL-Based Analysis. Informatica 2023, 47, 145–155. [Google Scholar] [CrossRef]
Song, F.; Lei, Y.; Chen, S.; Fan, L.; Liu, Y. Advanced Evasion Attacks and Mitigations on Practical ML-Based Phishing Website Classifiers. Int. J. Intell. Syst. 2021, 36, 5210–5240. [Google Scholar] [CrossRef]
Mishra, S.; Soni, D. Smishing Detector: A Security Model to Detect Smishing through SMS Content Analysis and URL Behavior Analysis. Future Gener. Comput. Syst. 2020, 108, 803–815. [Google Scholar] [CrossRef]
Zaimi, R.; Hafidi, M.; Lamia, M. A Deep Learning Mechanism to Detect Phishing URLs Using the Permutation Importance Method and SMOTE-Tomek Link. J. Supercomput. 2024, 80, 17159–17191. [Google Scholar] [CrossRef]
Mohamad, M.A.; Ahmad, M.A.; Mustaffa, Z. Hybrid Honey Badger Algorithm with Artificial Neural Network (HBA-ANN) for Website Phishing Detection. Iraqi J. Comput. Sci. Math. 2024, 5, 671–682. [Google Scholar] [CrossRef]
Mahdavifar, S.; Ghorbani, A.A. DeNNeS: Deep Embedded Neural Network Expert System for Detecting Cyber Attacks. Neural Comput. Appl. 2020, 32, 14753–14780. [Google Scholar] [CrossRef]
Moedjahedy, J.; Setyanto, A.; Alarfaj, F.K.; Alreshoodi, M. CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning. Future Internet 2022, 14, 229. [Google Scholar] [CrossRef]
Hassan, N.H.; Fakharudin, A.S. Web Phishing Classification Model Using Artificial Neural Network and Deep Learning Neural Network. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 535–542. [Google Scholar] [CrossRef]
Gandotra, E.; Gupta, D. Improving Spoofed Website Detection Using Machine Learning. Cybern. Syst. 2021, 52, 169–190. [Google Scholar] [CrossRef]
Roy, S.S.; Awad, A.I.; Amare, L.A.; Erkihun, M.T.; Anas, M. Multimodel Phishing URL Detection Using LSTM, Bidirectional LSTM, and GRU Models. Future Internet 2022, 14, 340. [Google Scholar] [CrossRef]
Shabudin, S.; Sani, N.S.; Ariffin, K.A.Z.; Aliff, M. Feature Selection for Phishing Website Classification. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 587–595. [Google Scholar] [CrossRef]
Chen, S.; Lu, Y.; Liu, D.-J. Phishing Target Identification Based on Neural Networks Using Category Features and Images. Secur. Commun. Netw. 2022, 2022, 5653270. [Google Scholar] [CrossRef]
Anitha, J.; Kalaiarasu, M. A New Hybrid Deep Learning-Based Phishing Detection System Using MCS-DNN Classifier. Neural Comput. Appl. 2022, 34, 5867–5882. [Google Scholar] [CrossRef]
Priya, S.; Selvakumar, S. Detection of Phishing Attacks Using Probabilistic Neural Network with a Novel Training Algorithm for Reduced Gaussian Kernels and Optimal Smoothing Parameter Adaptation for Mobile Web Services. Int. J. Ad Hoc Ubiquitous Comput. 2021, 36, 67–88. [Google Scholar] [CrossRef]
Maurya, S.; Saini, H.S.; Jain, A. Browser Extension Based Hybrid Anti-Phishing Framework Using Feature Selection. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 579–588. [Google Scholar] [CrossRef]
Gururaj, H.L.; Mitra, P.; Koner, S.; Bal, S.; Flammini, F.; Janhavi, V.; Kumar, R.V. Prediction of Phishing Websites Using AI Techniques. Int. J. Inf. Secur. Priv. 2022, 16, 14. [Google Scholar] [CrossRef]
Vrbančič, G.; Fister, I.; Podgorelec, V. Parameter Setting for Deep Neural Networks Using Swarm Intelligence on Phishing Websites Classification. Int. J. Artif. Intell. Tools 2019, 28, 1960008. [Google Scholar] [CrossRef]
Nagaraj, K.; Bhattacharjee, B.; Sridhar, A.; Sharvani, G.S. Detection of Phishing Websites Using a Novel Twofold Ensemble Model. J. Syst. Inf. Technol. 2018, 20, 321–357. [Google Scholar] [CrossRef]
Feng, J.; Zou, L.; Nan, T. A Phishing Webpage Detection Method Based on Stacked Autoencoder and Correlation Coefficients. J. Compt. Inf. Technol. 2019, 27, 41–54. [Google Scholar] [CrossRef]
Gupta, S.; Bansal, H. Trust Evaluation of Health Websites by Eliminating Phishing Websites and Using Similarity Techniques. Concurr. Comput. Pract. Exp. 2023, 35, e7695. [Google Scholar] [CrossRef]
Ozcan, A.; Catal, C.; Donmez, E.; Senturk, B. A Hybrid DNN–LSTM Model for Detecting Phishing URLs. Neural Comput. Appl. 2023, 35, 4957–4973. [Google Scholar] [CrossRef]
Alotaibi, B.; Alotaibi, M. Consensus and Majority Vote Feature Selection Methods and a Detection Technique for Web Phishing. J. Ambient Intell. Humaniz. Comput. 2021, 12, 717–727. [Google Scholar] [CrossRef]
Vaitkevicius, P.; Marcinkevicius, V. Comparison of Classification Algorithms for Detection of Phishing Websites. Informatica 2020, 31, 143–160. [Google Scholar] [CrossRef]
Zaimi, R.; Hafidi, M.; Lamia, M. A Deep Learning Approach to Detect Phishing Websites Using CNN for Privacy Protection. Intell. Decis. Technol. 2023, 17, 713–728. [Google Scholar] [CrossRef]
Catal, C.; Giray, G.; Tekinerdogan, B.; Kumar, S.; Shukla, S. Applications of Deep Learning for Phishing Detection: A Systematic Literature Review. Knowl. Inf. Syst. 2022, 64, 1457–1500. [Google Scholar] [CrossRef]
Gao, B.; Liu, W.; Liu, G.; Nie, F. Resource Knowledge-Driven Heterogeneous Graph Learning for Website Fingerprinting. IEEE Trans. Cogn. Commun. Netw. 2024, 10, 968–981. [Google Scholar] [CrossRef]
Jain, A.K.; Gupta, B.B. A Machine Learning Based Approach for Phishing Detection Using Hyperlinks Information. J. Ambient Intell. Humaniz. Comput. 2019, 10, 2015–2028. [Google Scholar] [CrossRef]
Almujahid, N.F.; Haq, M.A.; Alshehri, M. Comparative Evaluation of Machine Learning Algorithms for Phishing Site Detection. PeerJ Comput. Sci. 2024, 10, e2131. [Google Scholar] [CrossRef] [PubMed]
Hossain, S.; Sarma, D.; Chakma, R.J. Machine Learning-Based Phishing Attack Detection. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 378–388. [Google Scholar] [CrossRef]
Goud, N.S.; Mathur, A. Feature Engineering Framework to Detect Phishing Websites Using URL Analysis. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 295–303. [Google Scholar] [CrossRef]
Mehedi, I.M.; Shah, M.H.M. Categorization of Webpages Using Dynamic Mutation Based Differential Evolution and Gradient Boost Classifier. J. Ambient Intell. Humaniz. Comput. 2023, 14, 8363–8374. [Google Scholar] [CrossRef]
Abu Al-Haija, Q.; Al-Fayoumi, M. An Intelligent Identification and Classification System for Malicious Uniform Resource Locators (URLs). Neural Comput. Appl. 2023, 35, 16995–17011. [Google Scholar] [CrossRef]
El-Alfy, E.-S.M. Detection of Phishing Websites Based on Probabilistic Neural Networks and K-Medoids Clustering. Comput. J. 2017, 60, 1745–1759. [Google Scholar] [CrossRef]
Zhang, W.; Jiang, Q.; Chen, L.; Li, C. Two-Stage ELM for Phishing Web Pages Detection Using Hybrid Features. World Wide Web 2017, 20, 797–813. [Google Scholar] [CrossRef]
Marchal, S.; Armano, G.; Grondahl, T.; Saari, K.; Singh, N.; Asokan, N. Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application. IEEE Trans. Comput. 2017, 66, 1717–1733. [Google Scholar] [CrossRef]
Abutair, H.; Belghith, A.; AlAhmadi, S. CBR-PDS: A Case-Based Reasoning Phishing Detection System. J. Ambient Intell. Humaniz. Comput. 2019, 10, 2593–2606. [Google Scholar] [CrossRef]
Muhammad, A.; Murtza, I.; Saadia, A.; Kifayat, K. Cortex-Inspired Ensemble Based Network Intrusion Detection System. Neural Comput. Appl. 2023, 35, 15415–15428. [Google Scholar] [CrossRef]
Zakaria, W.Z.A.; Abdollah, M.F.; Mohd, O.; Yassin, S.M.W.M.S.M.M.; Ariffin, A. RENTAKA: A Novel Machine Learning Framework for Crypto-Ransomware Pre-Encryption Detection. Intl. J. Adv. Comput. Sci. Appl. 2022, 13, 378–385. [Google Scholar] [CrossRef]
Arhsad, M.; Karim, A. Android Botnet Detection Using Hybrid Analysis. KSII Trans. Internet Inf. Syst. 2024, 18, 704–719. [Google Scholar] [CrossRef]
Binsaeed, K.; Stringhini, G.; Youssef, A.E. Detecting Spam in Twitter Microblogging Services: A Novel Machine Learning Approach Based on Domain Popularity. Intl. J. Adv. Comput. Sci. Appl. 2020, 11, 11–22. [Google Scholar] [CrossRef]
Baruah, S.; Borah, D.J.; Deka, V. Detection of Peer-to-Peer Botnet Using Machine Learning Techniques and Ensemble Learning Algorithm. Int. J. Inf. Secur. Priv. 2023, 17, 16. [Google Scholar] [CrossRef]
Shang, Y. Detection and Prevention of Cyber Defense Attacks Using Machine Learning Algorithms. Scalable Comput. Pract. Exp. 2024, 25, 760–769. [Google Scholar] [CrossRef]
Shah, A.; Varshney, S.; Mehrotra, M. DeepMUI: A Novel Method to Identify Malicious Users on Online Social Network Platforms. Concurr. Comput. Pract. Exper. 2024, 36, e7917. [Google Scholar] [CrossRef]
Almomani, A. Fast-Flux Hunter: A System for Filtering Online Fast-Flux Botnet. Neural Comput. Appl. 2018, 29, 483–493. [Google Scholar] [CrossRef]
Chipa, I.H.; Gamboa-Cruzado, J.; Villacorta, J.R. Mobile Applications for Cybercrime Prevention: A Comprehensive Systematic Review. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 73–82. [Google Scholar] [CrossRef]
Ilyasa, S.N.; Khadidos, A.O. Optimized SMS Spam Detection Using SVM-DistilBERT and Voting Classifier: A Comparative Study on the Impact of Lemmatization. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1323–1333. [Google Scholar] [CrossRef]
Taherdoost, H. Insights into Cybercrime Detection and Response: A Review of Time Factor. Information 2024, 15, 273. [Google Scholar] [CrossRef]
Rustam, F.; Ashraf, I.; Jurcut, A.D.; Bashir, A.K.; Zikria, Y.B. Malware Detection Using Image Representation of Malware Data and Transfer Learning. J. Parallel Distrib. Comput. 2023, 172, 32–50. [Google Scholar] [CrossRef]
Mvula, P.K.; Branco, P.; Jourdan, G.-V.; Viktor, H.L. A Survey on the Applications of Semi-Supervised Learning to Cyber-Security. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Al-Fawa’Reh, M.; Abu-Khalaf, J.; Szewczyk, P.; Kang, J.J. MalBoT-DRL: Malware Botnet Detection Using Deep Reinforcement Learning in IoT Networks. IEEE Internet Things J. 2024, 11, 9610–9629. [Google Scholar] [CrossRef]
Diko, Z.; Sibanda, K. Comparative Analysis of Popular Supervised Machine Learning Algorithms for Detecting Malicious Universal Resource Locators. J. Cyber Secur. Mobil. 2024, 13, 1105–1128. [Google Scholar] [CrossRef]
Alqahtani, A.S.; Altammami, O.A.; Haq, M.A. A Comprehensive Analysis of Network Security Attack Classification Using Machine Learning Algorithms. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1269–1280. [Google Scholar] [CrossRef]
Butnaru, A.; Mylonas, A.; Pitropakis, N. Towards Lightweight Url-Based Phishing Detection. Future Internet 2021, 13, 154. [Google Scholar] [CrossRef]
Demmese, F.A.; Shajarian, S.; Khorsandroo, S. Transfer Learning with ResNet50 for Malicious Domains Classification Using Image Visualization. Discov. Artif. Intell. 2024, 4, 52. [Google Scholar] [CrossRef]
Das, L.; Ahuja, L.; Pandey, A. A Novel Deep Learning Model-Based Optimization Algorithm for Text Message Spam Detection. J. Supercomput. 2024, 80, 17823–17848. [Google Scholar] [CrossRef]
Hans, K.; Ahuja, L.; Muttoo, S.K. Detecting Redirection Spam Using Multilayer Perceptron Neural Network. Soft Comput. 2017, 21, 3803–3814. [Google Scholar] [CrossRef]
Naswir, A.F.; Zakaria, L.Q.; Saad, S. Determining the Best Email and Human Behavior Features on Phishing Email Classification. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 175–184. [Google Scholar] [CrossRef]
Das, S.; Mandal, S.; Basak, R. Spam Email Detection Using a Novel Multilayer Classification-Based Decision Technique. Int. J. Comput. Appl. 2023, 45, 587–599. [Google Scholar] [CrossRef]
Bountakas, P.; Xenakis, C. HELPHED: Hybrid Ensemble Learning PHishing Email Detection. J. Netw. Comput. Appl. 2023, 210, 103545. [Google Scholar] [CrossRef]
Bhadane, A.; Mane, S.B. Detecting Lateral Spear Phishing Attacks in Organisations. IET Inf. Secur. 2019, 13, 133–140. [Google Scholar] [CrossRef]
Magdy, S.; Abouelseoud, Y.; Mikhail, M. Efficient Spam and Phishing Emails Filtering Based on Deep Learning. Comput. Netw. 2022, 206, 108826. [Google Scholar] [CrossRef]
Stevanović, N. Character And Word Embeddings for Phishing Email Detection. Comput. Inf. 2022, 41, 1337–1357. [Google Scholar] [CrossRef]
Somesha, M.; Pais, A.R. Classification of Phishing Email Using Word Embedding and Machine Learning Techniques. J. Cyber Secur. Mobil. 2022, 11, 279–320. [Google Scholar] [CrossRef]
Almousa, B.N.; Uliyan, D.M. Anti-Spoofing in Medical Employee’s Email Using Machine Learning Uclassify Algorithm. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 241–251. [Google Scholar] [CrossRef]
Mohammed, M.A.; Ibrahim, D.A.; Salman, A.O. Adaptive Intelligent Learning Approach Based on Visual Anti-Spam Email Model for Multi-Natural Language. J. Intell. Syst. 2021, 30, 774–792. [Google Scholar] [CrossRef]
Li, W.; Ke, L.; Meng, W.; Han, J. An Empirical Study of Supervised Email Classification in Internet of Things: Practical Performance and Key Influencing Factors. Int. J. Intell. Syst. 2022, 37, 287–304. [Google Scholar] [CrossRef]
Loh, P.K.K.; Lee, A.Z.Y.; Balachandran, V. Towards a Hybrid Security Framework for Phishing Awareness Education and Defense. Future Internet 2024, 16, 86. [Google Scholar] [CrossRef]
Manita, G.; Chhabra, A.; Korbaa, O. Efficient E-Mail Spam Filtering Approach Combining Logistic Regression Model and Orthogonal Atomic Orbital Search Algorithm. Appl. Soft Comput. 2023, 144, 110478. [Google Scholar] [CrossRef]
Akinyelu, A.A.; Adewumi, A.O. On the Performance of Cuckoo Search and Bat Algorithms Based Instance Selection Techniques for SVM Speed Optimization with Application to E-Fraud Detection. KSII Trans. Internet Inf. Syst. 2018, 12, 1348–1375. [Google Scholar] [CrossRef]
Siddique, Z.B.; Khan, M.A.; Din, I.U.; Almogren, A.; Mohiuddin, I.; Nazir, S. Machine Learning-Based Detection of Spam Emails. Sci. Program. 2021, 2021, 6508784. [Google Scholar] [CrossRef]
Abari, O.J.; Sani, N.F.M.; Khalid, F.; Sharum, M.Y.B.; Ariffin, N.A.M. Phishing Image Spam Classification Research Trends: Survey and Open Issues. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 794–805. [Google Scholar] [CrossRef]
Mughaid, A.; AlZu’bi, S.; Hnaif, A.; Taamneh, S.; Alnajjar, A.; Elsoud, E.A. An Intelligent Cyber Security Phishing Detection System Using Deep Learning Techniques. Clust. Comput. 2022, 25, 3819–3828. [Google Scholar] [CrossRef]
Akinyelu, A.A.; Ezugwu, A.E.; Adewumi, A.O. Ant Colony Optimization Edge Selection for Support Vector Machine Speed Optimization. Neural Comput. Appl. 2020, 32, 11385–11417. [Google Scholar] [CrossRef]
Bezerra, A.; Pereira, I.; Rebelo, M.Â.; Coelho, D.; Oliveira, D.A.D.; Costa, J.F.P.; Cruz, R.P.M. A Case Study on Phishing Detection with a Machine Learning Net. Int. J. Data Sci. Anal. 2024, 20, 2001–2020. [Google Scholar] [CrossRef]
Kaushik, K.; Bhardwaj, A.; Kumar, M.; Gupta, S.K.; Gupta, A. A Novel Machine Learning-Based Framework for Detecting Fake Instagram Profiles. Concurr. Comput. Pract. Exp. 2022, 34, e7349. [Google Scholar] [CrossRef]
Djaballah, K.A.; Boukhalfa, K.; Guelmaoui, M.A.; Saidani, A.; Ramdane, Y. A Proposal Phishing Attack Detection System on Twitter. Int. J. Inf. Secur. Priv. 2022, 16, 27. [Google Scholar] [CrossRef]
Khan, A.I.; Unhelkar, B. An Enhanced Anti-Phishing Technique for Social Media Users: A Multilayer Q-Learning Approach. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 18–28. [Google Scholar] [CrossRef]
Shetty, N.P.; Muniyal, B.; Anand, A.; Kumar, S. An Enhanced Sybil Guard to Detect Bots in Online Social Networks. J. Cyber Secur. Mobil. 2022, 11, 105–126. [Google Scholar] [CrossRef]
Yamak, Z.; Saunier, J.; Vercouter, L. Automatic Detection of Multiple Account Deception in Social Media. Web Intell. 2017, 15, 219–231. [Google Scholar] [CrossRef]
Khan, A.A.; Chaudhari, O.; Chandra, R. A Review of Ensemble Learning and Data Augmentation Models for Class Imbalanced Problems: Combination, Implementation and Evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
Sharma, S.; Gosain, A. Addressing Class Imbalance in Remote Sensing Using Deep Learning Approaches: A Systematic Literature Review. Evol. Intell. 2025, 18, 23. [Google Scholar] [CrossRef]
Rezvani, S.; Wang, X. A Broad Review on Class Imbalance Learning Techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
Regulation-2016/679-EN-Gdpr-EUR-Lex. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (accessed on 14 September 2025).
National Institute of Standards and Technology. NIST Privacy Framework: A Tool for Improving Privacy through Enterprise Risk Management, Version 1.0; NIST: Gaithersburg, MD, USA, 2020. [Google Scholar]
van Eck, N.J.; Waltman, L. VOSviewer Manual; Centre for Science and Technology Studies (CWTS), Leiden University: Leiden, The Netherlands, 2023. [Google Scholar]
van Eck, N.J.; Waltman, L. Software Survey: VOSviewer, a Computer Program for Bibliometric Mapping. Scientometrics 2010, 84, 523–538. [Google Scholar] [CrossRef]
Shukla, P.K.; Veerasamy, B.D.; Alduaiji, N.; Addula, S.R.; Sharma, S.; Shukla, P.K. Encoder Only Attention-Guided Transformer Framework for Accurate and Explainable Social Media Fake Profile Detection. Peer-to-Peer Netw. Appl. 2025, 18, 232. [Google Scholar] [CrossRef]
Balasubramanian, P.; Liyana, S.; Sankaran, H.; Sivaramakrishnan, S.; Pusuluri, S.; Pirttikangas, S.; Peltonen, E. Generative AI for Cyber Threat Intelligence: Applications, Challenges, and Analysis of Real-World Case Studies. Artif. Intell. Rev. 2025, 58, 336. [Google Scholar] [CrossRef]
Li, H.; Li, Y.; Li, K. Phishing Email Uniform Resource Locator Detection Based on Large Language Model. In Proceedings of the International Conference on Computer Application and Information Security (ICCAIS 2024), Wuhan, China, 20–22 December 2024; SPIE: Bellingham, WA, USA, 2025; Volume 13562, pp. 1245–1250. [Google Scholar]
Zeng, V.; Baki, S.; El Aassal, A.; Verma, R.; Teixeira De Moraes, L.F.; Das, A. Diverse Datasets and a Customizable Benchmarking Framework for Phishing | Proceedings of the Sixth International Workshop on Security and Privacy Analytics. New Orleans, LA, USA, 18 March 2020. [Google Scholar] [CrossRef]
Waltman, L.; Van Eck, N.J.; Noyons, E.C.M. A Unified Approach to Mapping and Clustering of Bibliometric Networks. J. Informetr. 2010, 4, 629–635. [Google Scholar] [CrossRef]
Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to Conduct a Bibliometric Analysis: An Overview and Guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, 1st Quarter 2025; Anti-Phishing Working Group (APWG): Lexington, MA, USA, 2025. [Google Scholar]
European Union Agency for Cybersecurity. ENISA Threat Landscape 2024: July 2023 to June 2024; European Union Agency for Cybersecurity (ENISA): Luxembourg, 2024. [Google Scholar]
Microsoft Digital Defense Report 2024. Available online: https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/microsoft/final/en-us/microsoft-brand/documents/Microsoft%20Digital%20Defense%20Report%202024%20%281%29.pdf (accessed on 14 September 2025).

Figure 1. Overview of the search strategy and thematic scope for data retrieval. The Scopus query focused on phishing detection using Machine Learning (ML) and Neural Networks (NN) between 2017 and 2024, filtered by subject area, document type, language, and index keywords reflecting delivery channels and learning techniques. A total of 105 articles met the final criteria. Source: Scopus, search performed 21 July 2025.

Figure 2. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram illustrating the identification, screening, eligibility assessment, and inclusion of studies retrieved from Scopus.

Figure 3. Network visualization of relationships between keywords generated using VOSviewer software [146].

Figure 4. Number of reported phishing attacks worldwide between 2017 and 2024, based on Anti-Phishing Working Group (APWG) Phishing Activity Trends Reports.

Figure 5. Number of publications per year between 2017 and 2024, based on corpus.

Figure 6. Number of publications per phishing delivery channel by period (2017–2020 vs. 2021–2024).

Figure 7. Number of publications per Machine Learning Models and Techniques.

Figure 8. Number of publications per research methodology by period.

Figure 9. Publications by year in countries.

Figure 10. Cross-tabulation of machine learning models and techniques applied to phishing detection across different delivery channels.

Figure 11. Cross-tabulation of research methodologies used in phishing detection studies across different delivery channels.

Table 1. Evolution of phishing detection methods (2000–2016).

Time Frame *	Dominant Approaches	Example Technologies/Features	Characteristics
2000–2005	List-based Approaches [33,34]	Blacklist, Whitelist (Google Safe Browsing, Microsoft SmartScreen)	Simple and fast; high false negative rate for zero-day attacks
2006–2010	Visual Similarity-based Approaches [34,35]	DOM structure comparison, screenshot matching	Effective for look-alike pages; computationally expensive
2011–2016	URL & Website Content Feature-based (heuristics) [34,35]	URL length, HTTPS presence, number of forms	Manual rules, easy to bypass; low adaptability to evolving attacks

* The time frames are approximate, marking the transitions between dominant phishing detection techniques. Document Object Model (DOM). Uniform Resource Locator (URL). Hypertext Transfer Protocol Secure (HTTPS).

Table 2. Structured appraisal rubric for included studies.

No.	Study	Data Quality	Class Balance	External Sources Used (Blacklists/Metadata)	Risk of Data Leakage	Validation Method	Model Selection Procedure	Evaluation/System Metrics	Handling of Class Imbalance
1	[36]	Construction and sources: combined; UCI Machine Learning Repository—Phishing Websites Data Set; Kaggle—“Phishing website dataset”; Preprocessing: standardization applied by authors—datasets described as preprocessed/normalized; Total items: 13,511.	Not reported	Labels: UCI Phishing Websites Data Set; Kaggle “Phishing website dataset. Metadata: WHOIS-derived domain age; DNS record presence; web traffic; Google index; page rank; external links—per dataset feature list.	Medium; datasets from UCI and Kaggle were merged, and separation or deduplication procedures were not described in detail.	10-fold CV for the ensemble models	Classifiers compared by accuracy across datasets; hyperparameters and selection procedure not described.	Evaluation: Accuracy; Precision; Recall; F1-score; ROC AUC; Cohen’s kappa. System metrics: Not reported.	Partially addressed (metrics only)
2	[37]	Construction and sources: combined; PhishTank; MillerSmiles; source of benign: Not reported; Acquisition window: Not reported; Preprocessing: Not reported; Total items: 11,055.	Not reported	Labels: PhishTank; MillerSmiles; benign labels: Not reported. Metadata: TLS/SSL certificate information; domain registration length/age (WHOIS); DNS record presence; web traffic rank; PageRank; Google index; links pointing to page; statistical list of phishing IP addresses.	Medium—combined PhishTank and MillerSmiles; deduplication and temporal/host-level separation not described.	5-fold CV	GridSearchCV for Random Forest; optimal hyperparameters reported; no nested evaluation.	Accuracy, Precision, Recall, F1, confusion matrix; System metrics: controlled testbed; avg response time 4 s (prototype) vs. 6 s (Chrome extension); 33.3% lower time overhead	Partially addressed (metrics only)
3	[38]	Construction and sources: single-source; ISCX-URL2016; OpenPhish; PhishTank; UCI Machine Learning Repository; Mendeley website dataset; Preprocessing: removal of empty and NaN values; removal of redundant/empty fields; URL-based features only; Total items: Not reported.	Imbalanced; ISCX-URL2016: benign 35,000/phishing 10,000; OpenPhish: benign 20,025,990/phishing 85,003; PhishTank: benign 48,009/phishing 48,009; UCI: benign 204,863/phishing 24,567; Mendeley: benign 58,000/phishing 30,647	Labels: ISCX-URL2016; OpenPhish; PhishTank; UCI Machine Learning Repository; Mendeley website dataset; snapshot/version not reported. Metadata: none	High—feature selection performed before dataset split; only an 80/20 random hold-out described; no deduplication or temporal separation detailed	Hold-out split (80/20)	Hyperparameter-optimized ANN; H-FFGWO for feature selection; parameters set after experimentation; no formal search procedure described	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Partially addressed (metrics only)
4	[39]	Construction and sources: combined; UCI ML Repository; PhishTank; Starting Point Directory; Acquisition window: UCI accessed 30 Mar 2020; PhishTank and Starting Point Directory accessed 30 Jul 2019; Preprocessing: continuous attributes converted to categorical; duplicate or invalid URL filtering not reported; Total items: UCI_DS1 = 11,055; UCI_DS2 = 1353; Phish_NetDS = 10,493	Imbalanced; UCI_DS1: phishing 6157, benign 4898; UCI_DS2: Not reported; Phish_NetDS: phishing 4654, benign 5839.	Labels: UCI repository; Phish_NetDS phishing labels from PhishTank, benign from Starting Point Directory. Metadata: WHOIS domain data, domain age checker, Google index/SEO tools; DNS record.	Medium—multiple datasets evaluated and internal 65/35 hold-out only, no deduplication or temporal split reported.	Hold-out split 65/35; stratification not reported.	Architecture and hyperparameters specified (Deep_Radial m-6-5-4-3-2; activations; epochs = 1000; smoothing = 0.1; RBF spread = 1.0); base-classifier weights optimized with IntSquad (DE + SQP); selection procedure for these settings not described.	Accuracy, Precision, Recall, F1, MCC, TPR, FPR. System metric: the proposed ensemble was slower than DNN by 3.54–5.83% (test detection time, averaged over multiple runs).	Partially addressed (metrics only)
5	[40]	Construction and sources: combined; PhishTank phishing + Alexa Top-1M-derived benign; Acquisition window: PhishTank Aug 2006–Mar 2018; Alexa snapshot date not reported; Preprocessing: liveness check, removal of non-surviving or HTML-error pages, de-dup of benign links via search engine collection; Total items: 490,408 URLs.	Balanced; overall 245,385 phishing/245,023 benign; Train 196,308/196,019; Validation 24,538/24,502; Test 24,539/24,502.	Labels: PhishTank (August 2006–March 2018) for phishing; Alexa top domains with search engine top-10 links for benign. Metadata: none.	Medium—Combined sources. No deduplication reported.	Fixed split 392,327/49,040/49,041 and separate 10-fold CV reported.	Hyperparameters explored on validation set and chosen by accuracy/loss: RNN units {8, 16, 32, 64, 128} best 64; CNN kernel sizes 2–7 best {5, 6, 7}; batch size 2048; epochs 32; optimizer Adam, learning rate 0.01; architecture and hyperparameters provided with selection on validation set.	Accuracy, Precision, Recall, F1, AUC. System metrics: training time 4426.15 s and test time 40.66 s, average per-URL detection 0.4 ms.	Partially addressed (metrics only)
6	[41]	Construction and sources: single-source; UCI Machine Learning Repository “Phishing Websites Data Set”; Acquisition window: snapshot retrieved 9 May 2016; Preprocessing: dataset-encoded features with values −1/0/1 as described; Total items: 1353.	Imbalanced; phishing 702, legitimate 548, suspicious 103; per-split distributions Not reported.	Labels: pre-labeled benchmark (UCI). Metadata: features beyond URL string included in UCI dataset (e.g., Age of Domain, Website Traffic, HTTPS/SSL); specific external providers Not reported.	Medium—no per-fold description for GA; single global “best features by GA”; no nested CV	10-fold CV, also 70/30 hold-out (reported as yielding similar results; paper presents CV results)	GA used to select features and to weight features; DNN architecture and hyperparameters given (TanhWithDropout; 2 hidden layers; 50 neurons each; dropout 0.5; ADADELTA; cross-entropy; max epochs 100) without describing how these settings were chosen.	Accuracy, Sensitivity (TPR), Specificity (TNR), G-mean; multi-class formulas explicitly defined System metrics: Not reported	Partially addressed (metrics only)
7	[42]	Public dataset: UCI Phishing Websites (11,055 URLs), collected from PhishTank, MillerSmiles, and Google search operators.	Imbalanced. Overall: 55.69% phishing vs. 44.31% benign	Labels: pre-labeled benchmark (UCI). Features: dataset-provided fields only, no external metadata.	High (hyperparameters chosen to maximize test-set accuracy across g, h, activation, cβ).	Random split (80/20), stratified	Manual experimental tuning (Design Risk Minimization + Monte Carlo) over g, h, activation, cβ; selection by test-set accuracy; no nested selection.	Accuracy, TPR, FPR, Precision, Recall, F1, MCC. System metric: test run time about 1 s for each model.	Partially addressed (metrics only)
8	[43]	MUPD: collected from PhishTank (phishing) and DomCop top-4M (legitimate); deduplicated and balanced; 2,307,800 URLs after deduplication. Sahingoz: 26,052 URLs after deduplication; also evaluated without preprocessing at 73,575 URLs.	Near balanced (overall). MUPD: 1,167,201 phishing vs. 1,140,599 benign; Sahingoz: 14,356 vs. 11,696; Sahingoz (no preprocessing): 37,175 vs. 36,400.	Labels: PhishTank (phishing) and benign derived from DomCop top-4M; plus the published Sahingoz dataset. Features: URL string only (character-level CNN), no external metadata.	Low–medium. URLs and hosts deduplicated; both random 60/20/20 and date-based splits reported; temporal drift observed	Random split 60/20/20. Additional time-based split on MUPD: train 2006–2013, validation 2013–2015, test 2015–2018.	CNN (PUCNN) selected by highest validation accuracy; no nested CV reported.	Accuracy, Precision, Recall, F1. No system metrics reported.	Adequately addressed (metrics and techniques)
9	[44]	Sources: 5000 best.com (legitimate), PhishTank (phishing); Construction: combined; Size: 51,200 URLs total. PhishTank access date: 21 May 2021.	Imbalanced; Legitimate 40,000; Phishing 11,200; Overall distribution only.	Labels: 5000 best.com (legitimate), PhishTank (phishing). Metadata: none (features derived solely from URL string).	High	Not reported	FS-CNN hyperparameters fixed; RFE-SVM used for feature selection; feature-map sizes compared	Evaluation: Accuracy, Recall, F1-score, Precision. System metrics: throughput 460 URLs/s, Packet inspection time increasing from 77 ms (100 packets) to 1129 ms (5000 packets), memory usage 570 MB, URL length effect on packet inspection time: from 53 ms (15 characters) to 134 ms (100 characters)	Partially addressed (metrics only)
10	[45]	Public repository: Mendeley Data, dataset “Phishing Dataset for Machine Learning”; 10,000 webpages total; sources: PhishTank, OpenPhish, Alexa, Common Crawl; 48 features; capture windows: 2015-01–2015-05; 2017-05–2017-06.	Imbalanced; distribution: not reported.	Labels: PhishTank, OpenPhish; Alexa, Common Crawl. Metadata: none	High; combined multi-source dataset; split procedure not detailed	Hold-out split; ratio: not reported (training and testing phases mentioned)	Hyperparameters specified for baselines, LightGBM parameter selection: not reported, overall selection procedure: not reported	Evaluation: Accuracy, Precision, Recall, F1-score, ROC curves reported. System metrics: not reported	Adequately addressed (metrics and technique: random naive oversampling)
11	[46]	Sources: PhishTank, OpenPhish, Alexa Top Sites; Collection window: January to June 2017; Preprocessing: duplicates removed, HTTP 404 excluded; Total: 4000 URLs	Balanced; overall: phishing 2000; benign 2000	Labels: PhishTank; Openphish; Alexa Top Websites (legitimate). Metadata: none	Low	Not applicable	Not applicable (no machine learning model)	Evaluation: True Positive Rate; True Negative Rate; Accuracy System metrics: Average response time 2358 ms; First-level 1727 ms; Second-level 2043 ms	Partially addressed (metrics only)
12	[47]	Construction and sources: combined, PhishTank phishing pages plus target and normal pages collected by authors; comparison vectors from CSS layout features. Preprocessing: manual invalid-page filtering, exclusion of pages with too-small layout elements or entirely different appearance from targets. Total items: 24,051 samples (comparison vectors).	Imbalanced; Train: Positive (similar) 3719, Negative (different) 17,926; Test: Positive 414, Negative 1992	Labels: PhishTank (phishing URLs), target/normal pages from authors’ collection. Metadata: none	Medium; pairwise vectors centered on shared target pages, split given by counts without page-level deduplication details	Hold-out split, training and testing sets, ratio not specified	Manual parameter variation reported for SVM gamma, Decision Tree max_depth, AdaBoost n_estimators, Random Forest n_estimators; no formal search or validation split described	Evaluation: Accuracy, Precision, Recall, F1-score. System metrics: Not reported	Partially addressed (metrics only), no explicit resampling or class weighting
13	[48]	[A]
14	[49]	Construction and sources: single-source, UCI Machine Learning Repository “Phishing Websites Data Set”; Preprocessing: dropped index column, recoded feature values to 0/1; removed records with missing values; Total items: 11,055	Balanced; per-class counts: Not reported	Labels: UCI “Phishing Websites Data Set”; Metadata: none	High, feature selection performed before train/test split, single 70/30 hold-out, separation details limited	Hold-out split (70/30); stratification: Not reported	Compared multiple classifiers and feature-selection methods, selected by hold-out accuracy, random parameter tuning mentioned, details not reported.	Evaluation: Accuracy System metrics: Not reported.	Adequately addressed
15	[50]	Construction and sources: single-source, UCI Machine Learning Repository (Website Phishing Data Set); Total items: 1353.	Imbalanced; Overall: phishing 702, legitimate 548, suspicious 103; Per-split: Not reported.	Labels: UCI Website Phishing Data Set; Metadata: Web Traffic, Domain Age; providers not stated.	Low, single UCI dataset, random 70/30 hold-out, no dataset merging described	Hold-out split (70/30), random	Not reported (architecture and hyperparameters provided without describing the selection procedure)	Evaluation: Accuracy System metrics: Not reported	Not addressed (accuracy only)
16	[51]	Construction and sources: single-source, Hannousse&Yahiouche benchmark dataset (Kaggle: web-page-phishing-detection-dataset); Total items: 11,430.	Balanced; overall: Legitimate 5715, Phishing 5715; per-split: Not reported	Labels: benchmark dataset “status” column. Metadata: none.	Medium; random 80/20 hold-out, no stratification stated.	Hold-out split (80/20), random, stratification: Not reported.	Best model reported (CNN with 8 g); architecture outlined; hyperparameters not specified; selection procedure not described.	Evaluation: Accuracy. System metrics: Training time enhancement ratio vs. 41-feature baseline (percent, 4-feature set): Decision tree 88.89; Gradient boosting 86.52; AdaBoost 81.93; XGBoost 73.42; Random forest 65.05; ExtraTrees 61.60; Logistic regression 50.00; LightGBM 44.64; CatBoost 21.62; Naive Bayes 0.00.	Not applicable (balanced dataset)
17	[52]	Construction and sources: combined; Huddersfield phishing dataset built from PhishTank, MillerSmiles, Google query operators; plus Tan (2018) dataset from PhishTank, OpenPhish, Alexa, Common Crawl; Acquisition window: 2012–2018; 2015–2017; Total items: 12,456.	Mixed: Dataset 2 balanced (5000 phishing/5000 legitimate); Dataset 1 not reported	Labels: PhishTank, MillerSmiles, Google (Huddersfield); PhishTank, OpenPhish; Alexa/Common Crawl (Tan). Metadata: domain age, DNS record, website traffic, PageRank, Google Index, backlinks count, statistical-reports feature.	Medium; combined datasets and random split; no de-duplication or time-based isolation described; features include list/traffic-based signals.	Hold-out split (70/30), random; 10-fold CV also mentioned	Sixteen scikit-learn classifiers compared; RandomForest reported as best; hyperparameters and selection procedure not described.	Evaluation: Accuracy, Precision, Recall, AUC, Mean squared error System metrics: Not reported	Partially addressed (metrics only)
18	[53]	[R]
19	[54]	Construction and sources: combined; ISCX-URL-2016 (malicious), Kaggle (benign); Preprocessing: one-hot encoding of URL characters (84-symbol alphabet), fixed length 116 via trimming and zero-padding; Total items: 99,658.	Balanced; Train: Not reported; Test: Benign 10,002, Malicious 9911; Validation: Not applicable.	Labels: ISCX-URL-2016 (malicious: phishing, malware, spam, defacement); Kaggle (benign). Metadata: none.	High; VAE trained on entire dataset before split, so test distribution influenced the feature extractor; deduplication and per-domain separation not reported.	Hold-out split (80/20); stratification: Not reported; randomization: Not reported.	VAE latent dimension chosen via loss curves over L ∈ {5, 10, 24, 48, 64} (selected L = 24); DNN architecture and hyperparameters provided without describing the selection procedure; no nested selection.	Evaluation: Accuracy, Precision, Recall, F1-score, ROC. System metrics: Response time 1.9 s; total training time 268 s.	Not applicable (balanced dataset)
20	[55]	Construction and sources: single-source, Kaggle “Malicious and Benign URLs” dataset; Preprocessing: webpage paragraph text extraction with Requests and BeautifulSoup, text cleaning to lowercase with stopword removal and lemmatization, vectorization to CSV and merge into combined.csv; Total items: 12,982.	Balanced; overall: Malicious 6478, Benign 6504.	Labels: Kaggle “Malicious and Benign URLs” dataset; Metadata: none.	High—vectorizers fitted on the full dataset and saved to CSV before applying 10-fold CV on combined.csv; no de-duplication or domain-wise split reported.	10-fold CV	Algorithms compared across feature sets and vectorizers; best reported configuration is Hashing Vectorizer with Extreme Gradient Boosting; architecture and hyperparameters not described, selection procedure not described.	Evaluation: Accuracy; Precision; Recall; F1-score. System metrics: Not reported.	Adequately addressed (balanced subset, Precision, Recall, F1-score)
21	[56]	Construction and sources: multiple public datasets evaluated separately; Kaggle (D1–D2), CatchPhish (D3–D5), Ebbu2017 (D6); Preprocessing: removal of missing and duplicate URLs; normalization of quoted/comma-separated entries; Total items: 969,311 (six datasets).	Imbalanced; D1: phishing 114,203; benign 392,801; D2: phishing 55,914; benign 39,996; D3: phishing 40,668; benign 85,409; D4: phishing 40,668; benign 42,220; D5: phishing 40,668; benign 43,189; D6: phishing 37,175; benign 36,400	Labels: Kaggle (D1–D2), CatchPhish (D3–D5), Ebbu2017 (D6). Metadata: none	Medium; six pre-labeled public datasets; hold-out 70/30 without stated stratification/time policy; duplicates removed within datasets only; potential cross-split overlap not excluded	Hold-out split (70/30); stratification: Not reported	Eight classifiers compared on hold-out; Random Forest selected as best; hyperparameters WEKA default; selection procedure not described; 30 features chosen via domain knowledge + ReliefF	Evaluation: Accuracy, Precision, TPR, FPR, TNR, FNR, F1-score, ROC System metrics: Training time RF on D1 384 s; latency, memory, throughput: Not reported	Partially addressed (metrics only)
22	[57]	Construction and sources: single-source; UCI Machine Learning Repository “Website Phishing Data Set”; Total items: 1353.	Imbalanced; overall: phishing 702, suspicious 103, legitimate 548.	Labels: UCI Machine Learning Repository “Website Phishing Data Set”; Metadata: age of domain; web traffic; SSL final state	Low; single-source dataset with explicitly described exclusive random hold-out splits; no mixing of datasets or reuse of test items indicated.	Neural network: hold-out split 60/20/20, random; Decision tree, Naïve Bayes, SVM: hold-out split 40/60, random.	Neural network architecture specified (9-10-3; backprop) without describing a selection procedure; decision tree pruned, selection procedure not described; SVM and Naïve Bayes hyperparameters and selection procedure: Not reported.	Evaluation: Accuracy, True Positive Rate, False Positive Rate, Confusion matrix System metrics: Not reported	Partially addressed (metrics only)
23	[58]	Construction and sources: combined; D3M (malicious) and JSUNPACK plus Alexa top 100 (benign); Preprocessing: regex filtering to plain-JS and AST parsing with Esprima with syntactic validation; Total items: 5024 JS codes.	Balanced; overall: malicious 2512; benign 2512	Labels: D3M (malicious); JSUNPACK and Alexa top 100 (benign); snapshot dates not reported. Metadata: none	High; large-scale data augmentation (dummy plain-JS and AST-JS manipulations) combined with 10-fold CV performed by splitting feature vectors, with no grouping by original code reported	10-fold CV	Model selection procedure: 10-fold CV hyperparameter sweep (vector_size 100–1000; min_count 1–10; window 1–8); selected vector_size = 200, min_count = 5, window = 8; SVM (kernel = linear, C = 1)	Evaluation: Precision; Recall; F1-score; ROC AUC System metrics: training time per JS code 1.4780 s (PV-DBoW), 1.5290 s (PV-DM); detection time per JS code 0.0019 s (PV-DBoW), 0.0012 s (PV-DM)	Partially addressed (metrics only)
24	[59]	[A]
25	[60]	Construction and sources: combined (Alexa; PhishTank); Acquisition window: 50 K Phishing Detection (50 K-PD) 2009–2017; 50 K Image Phishing Detection (50 K-IPD) 2009–2017; 2 K Phishing Detection (2 K-PD) 2017; Preprocessing: Not reported; Total items: 49,947; 53,103; 2000.	Imbalanced overall; 50 K-PD: Legitimate 30,873, Phishing 19,074; 2 K-PD: Legitimate 1000, Phishing 1000; 50 K-IPD: Legitimate 28,320, Phishing 24,789; per-split counts: Not reported.	Labels: PhishTank and Alexa rankings plus hyperlinks for legitimate; Metadata: none.	Medium; combined sources and hold-out split, no deduplication or domain/time separation reported.	Hold-out split (70/30); stratification: Not reported.	Two-layer stacking with base models GBDT, XGBoost, LightGBM; base-model candidates compared via 5-fold CV using Kappa and average error; parameters mostly default; architecture and hyperparameters provided without a formal search procedure.	Evaluation: Accuracy, Missing rate, False alarm rate. System metrics: Training time 64 min (CPU i7-6700HQ; RAM 16G; 35 K training samples)	Partially addressed (metrics only)
26	[61]	Construction and sources: single-source; Mendeley Data “Phishing Websites Dataset” (2021); Preprocessing: random down-selection from 80,000 to 8000 URLs; Total items: 8000.	Balanced; Overall: Legitimate 4000; Phishing 4000.	Labels: Mendeley Data “Phishing Websites Dataset” (2021). Metadata: none.	Medium; random 80/20 split at URL level stated, no deduplication or domain-level separation reported.	Hold-out split (80/20); randomization: Not reported; stratification: Not reported.	Not reported; algorithms compared (DT, SVM, RF); hyperparameters: Not reported.	Evaluation: Accuracy, Precision, Recall, F1-score System metrics: Not reported.	Not applicable (balanced dataset)
27	[62]	Construction and sources: combined, PhishTank (phishing URLs, phishing websites), PHISHNET (legitimate websites, URLs); Preprocessing: inactive phishing URLs only; extraction of links and texts from webpages; Total items: 2,599,834.	Not reported	Labels: PhishTank, PHISHNET Metadata: none	Medium—combined sources and no explicit train–test separation; evaluation against production classifiers with unknown training data (GPPF, Bitdefender)	Not applicable	Not applicable (no model trained; evaluation on GPPF and commercial tools)	Evaluation: Attack success rate; Transferability success rate; Detection rate (Pelican) System metrics: Attack crafting time per website < 1 s; Feature inference time: URL/DOM 0.68 h, term 11.66 h, total 12.34 h	Not addressed
28	[63]	Construction and sources: combined; Almeida SMS spam dataset; Pinterest smishing images converted to text; Preprocessing: punctuation removal, lowercasing, tokenization, stemming, TF-IDF vectorization; short-to-long URL expansion for analysis; Total items: 5858 messages.	Imbalanced; Overall: Smishing 538, Ham 5320; Per-split: Not reported.	Labels: Almeida SMS spam collection; manual extraction of smishing from spam plus Pinterest images to text; snapshot/version not reported. Metadata: PhishTank blacklist lookups; domain age from WHOIS/RDAP; other checks on HTML source and APK download;	Medium; combined sources, no deduplication or time-based separation described, model chosen and evaluated on the same dataset without nesting.	5-fold CV	Classifier family compared; Naive Bayes selected by empirical performance; hyperparameters and selection procedure not reported.	Evaluation: Accuracy, Precision, Recall, F1-score System metrics: Not reported.	Partially addressed (metrics only)
29	[64]	Construction and sources: combined; Mendeley Phishing Websites datasets D1 and D2 (legitimate from Alexa; phishing from PhishTank); Preprocessing: datasets verified for null/duplicate samples; SMOTE-Tomek balancing applied.	Imbalanced; D1 overall: Legitimate 27,998, Phishing 30,647; After SMOTE-Tomek: Legitimate 29,194, Phishing 29,194; D2 overall: Legitimate 58,000, Phishing 30,647; After SMOTE-Tomek: Legitimate 56,605, Phishing 56,605	Labels: PhishTank; Alexa; snapshot: Not reported. Metadata: DNS records and resolver signals (A/NS counts, MX servers, TTL, number of resolved IPs), TLS certificate validity, Sender Policy Framework (SPF), redirects, response time, Google indexing, ASN/IP features; provided as dataset attributes	Medium; SMOTE-Tomek applied before 80/20 split; potential synthetic overlap across train/test; no stratified or temporal split described	Hold-out split (80/20); random; stratification: Not reported	Architectures and hyperparameters provided without describing the selection procedure (XGBoost, CNN, LSTM, CNN-LSTM, LSTM-CNN)	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Adequately addressed (metrics and techniques: SMOTE-Tomek link; metrics reported)
30	[65]	Construction and sources: single-source; UCI Machine Learning Repository; Total items: Not reported.	Not reported	Labels: Not reported; Metadata: Not reported.	High; separation procedure not described; single-source dataset; potential overlap between training and testing not ruled out.	Hold-out split; split proportions: Not reported; stratification: Not reported.	Architecture and hyperparameters declared without describing the selection procedure; weights optimized via HBA using MSE as fitness.	Evaluation: Accuracy; Precision; Recall; F1-score; Error rate System metrics: Convergence time 528 s; Learning iterations 1689; Minimum MSE 0.00498.	Partially addressed (metrics only).
31	[66]	Construction and sources: single-source; UCI Phishing Websites Dataset (UCI repository); Total items: 11,055.	Imbalanced; overall: phishing 4898, legitimate 6157; per-split: Not reported.	Labels: UCI Phishing Websites Dataset (snapshot/version: Not reported). Metadata: Website traffic (Alexa rank), Page rank, Google index, DNS record, Domain registration, SSL final state, Statistical report (Top 10 domains/IPs from PhishTank).	Medium—single combined dataset with random stratified 5-fold CV; no deduplication or time-window separation reported; hyperparameters and architecture selected using the same CV setting	Stratified 5-fold CV; dataset shuffled before batching; 10 runs averaged; non-converged runs discarded.	Grid over Adam parameters a ∈ {0.05, 0.01, 0.1, 0.5, 1}, β1 ∈ {0.1, 0.3, 0.5, 0.7, 0.9}; architectures tested (1–3 hidden layers; various neuron counts); best model chosen by Accuracy/F1 on stratified 5-fold CV; no nested procedure described.	Evaluation: Accuracy; Precision; Recall; F1-score; False Positive Rate; False Negative Rate. System metrics: Not reported.	Partially addressed (metrics only).
32	[67]	Construction and sources: combined; Tan(PhishTank, OpenPhish, Alexa, General Archives); Hannousse&Yahiouche (PhishTank, OpenPhish, Alexa, Yandex); Acquisition window: Tan—two collection sessions between January–May and May–June across two years; Hannousse&Yahiouche—2021 build; Preprocessing: Tan—removed broken/404 pages; screenshots saved for filtering; Hannousse&Yahiouche—removed duplicates and inactive URLs; used DOM for limited-lifetime URLs; Total items: 21,430 (10,000 + 11,430).	Balanced; Tan: 5000 phishing/5000 benign; Hannousse&Yahiouche: 5715 phishing/5715 benign	Labels: PhishTank, OpenPhish; benign from Alexa, General Archives (Tan) and Alexa, Yandex (Hannousse&Yahiouche); Metadata: none	Medium; random 70/30 hold-out only stated for dataset 2; no domain/time-wise isolation described; deduplication mentioned only for dataset 2; Tan collected in sessions across two years	Hold-out split (70/30) (dataset 2); Tan dataset: Not reported	Models compared (RF, SVM, DT, AdaBoost); no hyperparameters or tuning/selection procedure described	Evaluation: Accuracy System metrics: Execution time (s): 0.983028 [48 features, dataset 1, RF]; 0.970703 [10 features, dataset 1, RF]; 0.969786 [87 features, dataset 2, RF]; 0.957109 [10 features, dataset 2, RF]	Not applicable (balanced)
33	[68]	Construction and sources: single-source; UCI Machine Learning Repository “Website Phishing Data Set”; Acquisition window: page accessed 4 July 2022; Total items: 1353.	Imbalanced; overall distribution: Legitimate 548; Suspicious 103; Phishing 702.	Labels: UCI “Website Phishing Data Set” (accessed 4 July 2022). Metadata: SSL/TLS final state, domain age, and site traffic provided within the UCI dataset; no additional metadata retrieval by authors.	Medium—single hold-out split with no details on randomization or deduplication checks; multiple runs on the same split.	Hold-out split (70/30); repetitions: 20 runs; stratification: Not reported.	Architectures and hyperparameters provided without describing the selection procedure; examples: ANN 9-10-10-1; learning rate 0.01; momentum 0.1; epochs 200; batch size 50.	Evaluation: RMSE System metrics: Not reported	Not addressed.
34	[69]	[A]
35	[70]	Construction and sources: single-source; Kaggle “malicious-and-benign-urls” (siddharthkumar25); Acquisition window: accessed 11 September 2022; Preprocessing: character-level tokenization, fixed-length padding/truncation, embedding; Total items: 450,176.	Class balance: Imbalanced; overall: Phishing 104,438; Benign 345,738.	Labels: Kaggle “malicious-and-benign-urls” (accessed 11 September 2022). Metadata: none.	Medium; single Kaggle snapshot with 70/30 hold-out; no deduplication or temporal separation described.	Hold-out split (70/30); stratification: Not reported.	Architecture and hyperparameters provided without describing the selection procedure; LSTM, Bi-LSTM, GRU models.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Partially addressed (metrics only)
36	[71]	Construction and sources: single-source; UCI Machine Learning Repository “Phishing Websites”; Total items: 11,055.	Imbalanced; Overall: Phishing 4898; Benign 6157; Per-split: Not reported.	Labels: UCI “Phishing Websites” dataset; snapshot/version: Not reported. Metadata: SSLfinal-State, Domain-registration-length, Age-of-domain, DNSRecord, Web-traffic, Page-Rank, Google-Index.	Medium—single-source dataset with 10-fold CV; no deduplication or split-detail reported.	10-fold CV; randomization: Not reported; stratification: Not reported.	Random Forest tuned via one-at-a-time parameter sweeps; final parameters reported: maxDepth = 14, numIterations = 105, batchSize = 10; MLP and Naive Bayes hyperparameters/architecture not detailed, selection procedure not described.	Evaluation: Accuracy System metrics: Processing time—All features: RF 15 s, MLP 945 s, NB 1 s; FSOR: RF 10 s, MLP 600 s, NB 1 s; FSFM: RF 6 s, MLP 360 s, NB 1 s.	Not addressed.
37	[72]	Construction and sources: single-source; PhishTank URLs with targets plus collected WHOIS, DNS, screenshots, HTML, favicon; Acquisition window: October 2021–June 2022; Preprocessing: manual removal of invalid/blank pages, correction of mislabels, standardized screenshots; Total items: 3500.	Imbalanced; 70 target-brand classes; per-class counts: Not reported; per-split distributions: stratified 80/20.	Labels: PhishTank target labels; authors corrected some labels during cleaning. Metadata: WHOIS (creation/expiration dates, registrant country), DNS A and CNAME records, HTML text and tag counts, favicon ICO hex, OCR text from screenshots.	Medium—random stratified 80/20 split on URLs; no campaign/domain-level separation described.	Hold-out split (80/20); stratified random.	Architecture and hyperparameters provided without describing the selection procedure; learning rate 0.1; batch size 50; epochs 100; random state 0.	Evaluation: Accuracy; Macro-F1; Weighted-F1 System metrics: Not reported	Partially addressed (metrics only)
38	[73]	Construction and sources: single-source; University of Huddersfield (Phishing Websites Dataset No. 1 and No. 2); Preprocessing: duplicate removal; Total items: Not reported.	Not reported	Labels: University of Huddersfield Phishing Websites Dataset No. 1 and No. 2; Metadata: DNS records; domain age; PageRank; Website Traffic; Google Index	Medium—feature selection and preprocessing described outside the cross-validation loop; separation per fold not detailed	10-fold CV	Architecture (3 hidden layers, 20 nodes each) and hyperparameters (learning rate 0.001; epochs 50) provided without describing the selection procedure	Evaluation: Accuracy; Precision; Recall; F-score; TPR; TNR; FPR; FNR; MCC System metrics: Not reported	Partially addressed (metrics only)
39	[74]	[A]
40	[75]	Construction and sources: combined; UCI Phishing Websites dataset (repository) for training and a separate live URL set for evaluation; Total items: 11,055 (UCI) and 2000 live URLs.	Balanced; Overall (live set): Phishing 1000, Legitimate 1000; UCI training distribution: Not reported.	Labels: UCI Phishing Websites dataset for training; labels for the live URL set not described. Metadata: Alexa Top Sites whitelist, PhishTank API blacklist, WHOIS Domain Registration data, DNS record checks, SSL certificate checks, Google index, PageRank.	Medium—dataset splitting and deduplication not described; GridSearchCV used for tuning without a clearly separated evaluation protocol; labeling process for the live set not specified.	Not reported.	Stack of classifiers selected–RF, SVM with RBF, Logistic Regression; GridSearchCV described for SVM and RF; feature importance via RF (Gini) mentioned; search space and selection protocol details not described.	Accuracy; Mean Squared Error. System metrics: Average execution time–Proposed framework 0.62 ms; Logistic Regression 0.98 ms; SVM 0.87 ms; Random Forest 1.75 ms.	Not applicable—balanced evaluation set
41	[76]	[A]
42	[77]	[A]
43	[78]	Construction and sources: single-source; UCI Machine Learning Repository phishing websites dataset; Preprocessing: cluster-based oversampling with k-means, removal of 988 “inappropriate” instances; feature selection via correlation filter and Boruta; Total items: 10,068.	Balanced; Overall distribution: 5034 phishing/5034 benign; per-split distributions: Not reported.	Labels: UCI phishing websites dataset (snapshot/version not specified). Metadata: none.	High oversampling and feature selection performed on the full dataset prior to evaluation; results reported with 10-fold CV without nesting.	10-fold CV; also reports fixed hold-out partitions 60:40, 70:30, 75:25.	Architecture and hyperparameters provided without describing the selection procedure; twofold FFNN with five hidden layers and eight neurons; SVM kernels: polynomial, RBF; RF parameters not specified.	Evaluation: Accuracy; Precision; Recall; F1-score; MSE. System metrics: Not reported.	Adequately addressed (metrics and techniques)
44	[79]	[A]
45	[80]	Construction and sources: combined; PhishTank (phishing), UNB CIC URL-2016 (legitimate); Total items: 2000.	Balanced; Overall: Phishing 1000, Legitimate 1000; Per-split: Not reported	Labels: PhishTank; UNB CIC URL-2016. Metadata: WHOIS/registration lookup (DNS record, domain age), Alexa ranking; no TLS/DNS TTL/RDAP fields reported	Medium—80/20 split and 5-fold CV without nesting; algorithm selected on same data; duplicate handling/stratification not described	Hold-out split 80/20; 5-fold CV (mean CV score reported)	Compared Decision Tree, RF, SVM on 80/20 split and mean 5-fold CV; selected Decision Tree; hyperparameters not described; no nested selection	Evaluation: Accuracy, Precision, Recall, Cross-validation score System metrics: Not reported	Not applicable (dataset balanced)
46	[81]	Construction and sources: combined (Ebbu2017; PhishTank; Marchal2014 legitimate URLs); Total items: Ebbu2017 73,575; PhishTank 26,000.	Balanced; Overall: Ebbu2017 Legitimate 36,400, Phishing 37,175; PhishTank dataset Legitimate 13,000, Phishing 13,000; Per-split: Not reported	Labels: Ebbu2017; PhishTank; Marchal2014; Metadata: none	Medium; 10-fold CV without deduplication or temporal controls described; potential overlap of near-duplicate URLs across folds; dataset construction details limited	10-fold CV on each dataset	Hyperparameter search space reported; best values given (optimizer = adam; activation = relu; dropout = 0.3; epochs = 40; batch size = 128); architecture specified (DNN–LSTM and DNN–BiLSTM); selection procedure not fully described; no nested CV	Evaluation: Accuracy; AUC; F1-score System metrics: Not reported	Partially addressed (metrics only)
47	[82]	Construction and sources: combined; Mendeley Data “Phishing dataset for machine learning” and UCI “Phishing Websites” dataset; Total items: 21,055.	Mixed Balanced (Mendeley): Phishing 5000, Legitimate 5000; Imbalanced (UCI): Phishing 3793, Legitimate 7262; Per-split distributions: Not reported.	Labels: From the two datasets; Metadata: External features present in the datasets (e.g., WebTraffic, SSLfinalState, AgeOfDomain, GoogleIndex, DNSRecord); source services not reported.	Medium—single 70/30 hold-out; no details on randomization, deduplication, or host/domain-level separation	Hold-out split 70/30 train/test; number of runs and stratification not reported.	Architecture: AdaBoost with LightGBM; ≥15 algorithms investigated for comparison; base feature selection methods: RF, Gradient Boosting, LightGBM; hyperparameters not reported; selection procedure not described.	Evaluation: Accuracy; Precision; Recall; F1-score. System metrics: Detection time for entire test set—14 ms (Dataset 1, full features); 13.9 ms (Dataset 1, consensus); 13.9 ms (Dataset 1, majority); 214 ms (Dataset 2, full features); 185 ms (Dataset 2, majority); 300 ms (Dataset 2, consensus); per-instance detection time 4.63 μs (Dataset 1, consensus); training time figures not reported numerically.	Partially addressed (metrics only)
48	[83]	Construction and sources: single-source; UCI-2015 (UCI repository), UCI-2016 (UCI repository), MDP-2018 (Mendeley Data); Acquisition window: UCI-2015 donated March 2015; UCI-2016 contributed November 2016; MDP-2018 published March 2018; Total items: 22,408.	Mixed; UCI-2015 imbalanced (phish 6157; benign 4898); UCI-2016 imbalanced (phish 805; benign 548); MDP-2018 balanced (phishing 5000; benign 5000).	Labels: UCI-2015, UCI-2016, MDP-2018 dataset labels as provided by dataset authors; snapshot months as above; Metadata: none.	Low—30-fold stratified CV within each dataset; no cross-dataset mixing described.	30-fold stratified CV (per dataset).	Manual, expert-guided hyperparameter tuning using learning curves and 30-fold CV; best hyperparameters reported per dataset; no formal grid or nested search described.	Evaluation: Accuracy System metrics: Not reported	Not addressed
49	[84]	[A]
50	[85]	[R]
51	[86]	[A]
52	[34]	Construction and sources: combined; PhishTank; Alexa Top Sites; Total items: 3526.	Imbalanced; overall: phishing 2119; legitimate 1407; per-split: Not reported.	Labels: PhishTank(phishing), Alexa Top Sites (legitimate); snapshot/version: Not reported. Metadata: WHOIS domain age, Alexa Page Rank, Bing search-engine results using title/description/copyright matching.	Medium; combined sources and repeated random hold-out split; no deduplication or per-domain grouping described, raising possibility of near-duplicate/domain overlap across splits; authors explicitly note duplicates as a limitation.	Hold-out split 75/25; repeated 10 times with randomly selected training set; metrics averaged across repeats; stratification: Not reported.	Algorithm comparison across RF, J48, LR, BN, MLP, SMO, AdaBoostM1, SVM; RF selected by highest average accuracy over 10 repeated 75/25 splits; RF hyperparameters fixed (ntree = 100; mtry = 4); no inner validation/tuning described.	Evaluation: Sensitivity; Specificity; Precision; Accuracy; Error rate; False positive rate; False negative rate. System metrics: Not reported.	Partially addressed (metrics only)
53	[87]	Construction and sources: combined; PhishTank (phishing), Alexa top websites, Stuffgate Free Online Website Analyzer, List of online payment service providers; Preprocessing: removed identical feature vectors; label encoding of class values; Total items: 2544.	Imbalanced: phishing 1428; legitimate 1116	Labels: PhishTank (2018); Alexa top websites (2018); Stuffgate (2018); online payment service providers list (2018). Metadata: none	Medium—combined sources; random 10-fold CV; no domain/time de-duplication described; only “identical values removed” noted	10-fold CV	Classifier family comparison via 10-fold CV (SMO, Naive Bayes, Random Forest, SVM, Adaboost, Neural Networks, C4.5, Logistic Regression); selected Logistic Regression; hyperparameters not reported	Evaluation: Accuracy, Precision, Recall/TPR, FPR, TNR, FNR, F1-score, ROC AUC System metrics: Not reported	Partially addressed (metrics only)
54	[88]	Construction and sources: combined; Mendeley dataset built from Alexa and Common Crawl for benign plus PhishTank and Open-Phish for phishing; UCI Phishing Websites dataset; Preprocessing: dropped non-informative/index columns; label normalization; SMOTE applied to Dataset 2; Total items: D1 10,000; D2 12,314 after SMOTE.	Balanced (D1 5000 phishing; 5000 legitimate); Imbalanced (D2 4898 phishing; 6157 legitimate); Balanced after SMOTE (D2 6157 phishing; 6157 legitimate)	Labels: Alexa; Common Crawl; PhishTank; Open-Phish; snapshot versions Not reported. Metadata: none.	High—SMOTE described as producing a fully balanced Dataset 2 prior to reporting totals; feature selection/correlation filtering and GridSearchCV discussed without a clearly separated outer evaluation loop; nested CV not specified; CV settings inconsistent (k = 5 vs. stratified k = 10).	Stratified 10-fold CV	GridSearchCV hyperparameter tuning; reported chosen settings include LR (penalty = L2, C = 0.1, solver = saga, max_iter = 500), DT (criterion = gini, max_depth = 3, min_samples_leaf = 5), RF (n_estimators = 150, max_depth = 10, min_samples_split = 5, min_samples_leaf = 2, max_features = log2), KNN (n_neighbors = 3, algorithm = brute), SVC (C = 0.7, kernel = sigmoid), XGBoost (learning_rate = 0.2, n_estimators = 100, max_depth = 5, min_child_weight = 2, subsample = 0.8, colsample_bytree = 1.0), CNN (64 filters, 3 × 3, pool 3 × 3, dense 128, dropout = 0.5), DL (optimizer/learning rate/batch size/dropout tuned).	Evaluation: Accuracy; Precision; Recall; F1-score; FPR System metrics: CNN training time 94 s 29 ms; other ML models < 10 s; hardware TPU v2–8 (8 cores, 64 GiB).	Adequately addressed (metrics and techniques)
55	[89]	Construction and sources: single-source; Mendeley online repository; Total items: Not clear.	Balanced; overall 5000 phishing; 5000 legitimate	Labels: Mendeley repository (snapshot/version not reported); Metadata: none	High; split not described; dataset count description inconsistent	Not reported	Algorithms specified (KNN, Decision Tree, Random Forest, Extra Trees, SVM, Logistic Regression); hyperparameters and selection procedure not reported	Evaluation: ROC AUC; Precision; Recall; F1-score System metrics: Not reported	Partially addressed (metrics only)
56	[90]	Not reported	Not reported.	Labels: Not reported. Metadata: DNS records and counts (qty_nameservers, qty_mx_servers, ttl_hostname, qty_ip_resolved); TLS certificate (tls_ssl_certificate); domain age/expiration (time_domain_activation, time_domain_expiration); ASN/IP (asn_ip); Google index status (url_google_index, domain_google_index).	High—dataset provenance and label sources not described; 80/20 split and repeated stratified CV both mentioned without deduplication or strict separation details.	Hold-out split 80/20; repeated stratified cross-validation (k and repetitions not reported).	Recursive Feature Elimination with XGBoost estimator; pipeline comparison of LR, AdaBoost, GBM, XGBoost using repeated stratified CV; selected 29 features and XGBoost based on accuracy; hyperparameters not described.	Evaluation: Accuracy System metrics: Not reported	Not addressed.
57	[91]	Construction and sources: proprietary; inspired by Bruni and Bianchi (2020); Preprocessing: web scraping and OCR; screenshot acquisition; logo detection; data cleaning; tokenization; stop-word removal; HOG feature extraction; Total items: 1000.	Not reported.	Labels: authors’ manual assignment into five categories; no external lists. Metadata: none.	Medium; hold-out 50/50 mentioned; split procedure and deduplication not described.	Hold-out split 50/50	Dynamic mutation-based differential evolution tuning GBC hyperparameters; objectives: accuracy and f-measure; bounds per Table 1; DE settings: crossover ratio 0.5; adaptive mutation; generations 400; population 90; selection: tournament.	Evaluation: Accuracy; F-measure; Kappa System metrics: Not reported	Partially addressed (metrics only)
58	[92]	Construction and sources: single-source; ISCX-URL2016 (Canadian Institute for Cybersecurity); Preprocessing: duplicate/redundant removal, missing-data handling, structural error fixes, outlier handling, MRMR feature selection, dataset shuffling; Total items: 57,000.	Not reported; qualitative note: multi-class nearly balanced; binary imbalanced.	Labels: ISCX-URL2016 (CIC); snapshot/version Not reported. Metadata: none (URL lexical features only, e.g., query length, domain/path token counts).	High—5-fold CV with random 70/30 per fold; hyperparameters tuned via Bayesian optimizer; no nested CV or independent hold-out described.	Fivefold cross-validation; each fold uses a random 70/30 split; metrics averaged over 5 folds.	Bayesian optimization minimizing classification error over model-specific hyperparameters; final model selected as En_Bag based on CV metrics.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Detection time 6.67 μs; Classification time 11.77 μs.	Partially addressed (metrics only)
59	[93]	[A]
60	[94]	Construction and sources: combined; phishing from PhishTank; legitimate from Statscrop top sites; Chinese set phishing from search engine, SMS, emails; legitimate from Statscrop; Preprocessing: HTML parsed with Jsoup; OCR (Tesseract) for image-format pages; other QC Not reported; Total items: Data set 1 = 5905; Data set 2 = 1000.	Mixed; Imbalanced (Data set 1: 2784 phishing; 3121 legitimate); Balanced (Data set 2: 500 phishing; 500 legitimate); Per-split: Not reported.	Labels: PhishTank (phishing); Statscrop top sites (legitimate); plus search engine/SMS/email collection for Chinese dataset. Metadata: Alexa website traffic rank; DNS record; domain age; Who.is count of URLs linking to the site.	High—hyperparameter and ensemble size tuned using the same 5-fold CV used for reporting; no nested CV; no deduplication or per-domain isolation described.	5-fold CV; two datasets evaluated.	5-fold CV sweep: hidden nodes -selected 10; ensemble size-selected 25; sigmoid activation; input weights/biases random in [−1, 1]; non-nested; no independent hold-out described.	Evaluation: Accuracy; FPR; FNR; SD. System metrics: Training time 1.16 s (LC-ELMs, avg across 100 simulations); training time for the two-stage ELM 1.21 s; average detection time per page 1.89 s; environment: MATLAB 2012B, Intel Pentium G850 2.89 GHz, 2 GB RAM; feature extractor in Java.	Partially addressed (metrics only)
61	[95]	[A]
62	[96]	Construction and sources: combined; PhishTank (phishing), DMOZ/Open Directory (legitimate); Total items: 500 and 750 (two datasets).	Balanced; Overall: 250 phishing/250 benign (Dataset 1); 375 phishing/375 benign (Dataset 2); Per-split distributions: Not reported.	Labels: PhishTank (validated phishing) and DMOZ (legitimate). Metadata: Alexa Rank, Alexa Links-In Count; search-engine presence for mld and mld.ps (Google, Yahoo).	High—Train/test separation not described; combined sources; OPT component evaluated separately without clear isolation from evaluation sets.	Not reported	Two-stage feature selection and weighting: Information Gain to discard features with IG = 0 and assign weights; GA to select 8 features and tune weights; GA: population sizes 20/50/100, crossover 0.5–1, mutation 0.05, 10 runs; fitness = CBR accuracy; final CBR uses weighted Euclidean similarity; k not reported.	Evaluation: Accuracy, F-measure, TPR, TNR, FPR, FNR; comparative accuracy vs. RF, C4.5, JRip, PART, LMT, SVM; separate OPT scenario reported. System metrics: Not reported.	Not applicable; datasets balanced; no balancing techniques reported.
63	[97]	Construction and sources: combined; KDDCup99; CICIDS2017; Acquisition window: KDDCup99 Not reported; CICIDS2017 7-day capture; Preprocessing: removed missing and infinity values; label encoding; standardization (StandardScaler); Total items: KDDCup99 490,000; CICIDS2017 2,299,535.	Balanced; KDDCup99 multiclass 14 classes, each 7.14% after undersampling; CICIDS2017 8 major classes, each 12.5% after undersampling.	Labels: dataset ground truth from KDDCup99 and CICIDS2017; Metadata: none.	Medium; datasets balanced via undersampling and no explicit train/test separation procedure described.	Hold-out split (not specified).	Empirical selection of number of prototypes (26 for KDDCup99; 102 for CICIDS2017) and similarity measure (histogram intersection) based on performance; random forest meta-classifier used; RF hyperparameters not reported.	Evaluation: Accuracy; Precision; Recall; F1-score; FPR; AUC System metrics: Not reported	Adequately addressed (metrics and techniques)
64	[98]	Construction and sources: single-source; RISS (Imperial College London); Acquisition window: 2016; Preprocessing: dynamic analysis sandbox; pre-encryption boundary extraction of API calls before ENC flag; Total items: 1524.	Imbalanced; overall 582 ransomware, 942 benign; splits: Not reported.	Labels: RISS dataset. Metadata: none.	Medium—single-source dataset and split procedure not described.	Not reported	Final model reported as SVM best; selection procedure not described.	Evaluation: Accuracy; TPR; FPR System metrics: Not reported	Partially addressed (metrics only)
65	[99]	Construction and sources: combined; Drebin Android malware dataset for botnet/malware; Google Play Store and other internet repositories for benign; Preprocessing: reverse engineering; static features (13 permissions, 26 API calls) via AAPT/Androguard; dynamic features via DroidBox; Total items: 100.	Imbalanced; Botnet 70; Malware 20; Benign 10; Splits: Not reported	Labels: Drebin Android malware dataset (botnet/malware), Google Play Store and other internet repositories (benign); snapshot/version not reported. Metadata: none	High; no split procedure described; combined multiple sources (Drebin + benign repositories)	Not reported	Algorithms compared (Decision Tree, Random Forest, SVM with SMO, Naive Bayes, MLP); final choice by highest accuracy; selection procedure and hyperparameters not described	Evaluation: Accuracy; Precision; Recall; F1-score; True Positive Rate; False Positive Rate System metrics: Not reported	Partially addressed (metrics only)
66	[100]	Construction and sources: single-source; Twitter API stream; Spamhaus list referenced for confirmed spam domains; Alexa Top 1M used for domain popularity; Acquisition window: 27 days; Preprocessing: frequency filter ≥200 tweets/hour, Alexa Top 1M check, manual verification of 1131 distinct domains; Total items: 268,921,568 tweets.	Balanced; 26,986 tweets: 50% spam, 50% legitimate; grouped domains: 630 records, 50% spam.	Labels: manual domain verification; Spamhaus confirmed-spam list used to identify false negatives. Metadata: Alexa Top 1M domain list (domain popularity).	Medium—random 50/50 hold-out and 10-fold CV on raw tweets without explicit split by domain/user; a grouped-by-domain variant was also evaluated but no isolation policy stated.	Hold-out split 50/50; 10-fold CV; also evaluated grouped-record method at domain level.	Comparison of Random Forest, J48, Naïve Bayes; Random Forest reported as most reliable; selection/tuning procedure not described.	Evaluation: Accuracy; Precision; Sensitivity (Recall); F1-score. System metrics: Not reported.	Partially addressed (metrics only)
67	[101]	Construction and sources: single-source; CTU-13 (scenario 12; NetFlow); Acquisition window: 10 August 2011 10:02:43–19 August 2011 11:45:43; Total items: Not reported.	Imbalanced; overall distribution Not reported; per-split distribution Not reported; stratified 5-fold used	Labels: CTU-13 scenario 12 ground truth; Metadata: none	Medium; single-scenario dataset, stratified k-fold without stated isolation policy between correlated flows	Stratified 5-fold CV	Models: NB, kNN, LDA, DT, RF, SVM; ensemble: soft majority voting; hyperparameters mentioned for LDA (solver = svd; n_components 2–5) and Naive Bayes (var_smoothing 1 × 10⁻⁹–1 × 10⁻¹), selection procedure not described	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Partially addressed (metrics only)
68	[102]	[A]
69	[103]	Construction and sources: combined; Twitter Cresci et al. dataset plus self-crawled Twitter profiles, and a separate Instagram dataset by Akyon et al.; Acquisition window: Twitter Not reported; Instagram 6 months; Preprocessing: token cleaning of symbols, emoticons, stop words, lowercasing, missing text replaced with “missing”, standardization of numerical features, binary encoding, Word2Vec embedding, feature selection; Total items: Twitter 9082; Instagram 1400.	Balanced; Twitter 4531 legitimate/4551 fake; Instagram 700 legitimate/700 automated.	Labels: Twitter from Cresci et al. plus self-crawled manual labeling and replica detection following Zarei et al.; Instagram from Akyon et al. via API and bot-behavior rules across 6 months. Metadata: none.	High—multiple datasets merged and self-crawled accounts with impersonation replicas, random 80/20 split with no identity-level separation or deduplication described.	Hold-out split 80/20	Architecture and hyperparameters provided without describing the selection procedure.	Evaluation: Accuracy; Precision; Recall; F1-score; ROC curve; Confusion matrix System metrics: Not reported.	Partially addressed (metrics only); class balance achieved at data construction stage; no resampling or class weights reported.
70	[104]	Construction and sources: combined; ISOT botnet dataset (2010; includes Storm and Waledac); Acquisition window: 2010 snapshot; Preprocessing: DNS traffic parsing and stemming, conversion of features to numeric, Information Gain Ratio feature ranking, linear normalization; Total items: 7615 domain records.	Imbalanced; fast-flux 83; benign not reported; splits: 5-fold CV with 80/20 train–test per fold	Labels: ISOT botnet dataset (2010; Storm, Waledac). Metadata: DNS records and traffic features including TTL and its standard deviation, query type (A, AAAA, MX, CNAME), number of DNS servers, number of TLDs, synchronization status, number of DNS queries, packet sizes, duration; plus source/destination IP and domain extracted into 14 features	Medium; combined dataset with random 80/20 splits and 5-fold CV, no deduplication or temporal isolation described	5-fold cross-validation; each fold uses approximately 80% training and 20% testing	Information Gain Ratio used for feature ranking; EFuNN distance threshold Dthr swept over candidate values with best Dthr = 0.9 reported; selection procedure relative to CV folds not described	Evaluation: Accuracy; RMSE; NDEI System metrics: training 40.3–47.6 s; testing 8.0–8.4 s; fuzzy rules 491–513; memory 496–518 KB; values reported per Dthr setting	Not addressed
71	[105]	[R]
72	[106]	Construction and sources: combined; UCI SMS Spam Collection; Acquisition window: accessed 25 November 2023; Preprocessing: no missing values; lowercasing; whitespace cleanup; synonym replacement with WordNet; with and without lemmatization; Total items: 5574.	Imbalanced: 4827 ham/747 spam (overall)	Labels: UCI SMS Spam Collection; Metadata: none	Medium; non-nested tuning with stratified 5-fold CV and synonym augmentation pipeline not detailed for fold isolation; no de-duplication described	Stratified 5-fold CV	Optuna hyperparameter optimization with stratified 5-fold CV; tuned TF-IDF max_features; SVC C; Logistic Regression C; RF n_estimators and max_depth; Gradient Boosting n_estimators; KNN neighbors; XGBoost max_depth, subsample, scale_pos_weight; AdaBoost n_estimators and learning rate; shared hyperparameters for DistilBERT-based models	Evaluation: Accuracy; Precision; Recall; F1-score; ROC AUC System metrics: Not reported	Adequately addressed (metrics and techniques)
73	[107]	[R]
74	[108]	Construction and sources: single-source; Malimg dataset; Total items: 9342.	Imbalanced; no benign class; 25 malware classes with varying counts; phishing vs. benign distribution not applicable.	Labels: Malimg dataset (25 malware families); snapshot/version not reported. Metadata: none.	Low; single-source dataset; explicit 80/20 hold-out and separate 10-fold CV reported; no evidence of train–test mixing.	old-out split 80/20; 10-fold CV	Hyperparameters and architectures provided with tuning ranges; best settings chosen empirically; procedure not formally described.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Training time (s): Bi-SVM 71.143; Bi-KNN 18.928; Bi-RF 188.306; Bi-LR 394.631; SVM 0.221; KNN 0.665; RF 4.529; LR 2.509; 10-fold SVM 119.139; 10-fold KNN 21.539; 10-fold RF 231.674; 10-fold LR 523.74.	Handling of class imbalance: Partially addressed (metrics only)
75	[109]	[A]
76	[110]	Construction and sources: combined; MedBIoT, N-BaIoT; Preprocessing: time-window aggregation {0.1 s, 0.5 s, 1.5 s, 10 s, 60 s}, duplicate-packet removal, 23 statistical features, min–max scaling; Total items: 599,152 flows.	Imbalanced; early stage overall: 300,000 malicious/700,000 normal; late stage overall: 300,000 malicious/555,932 normal (855,932 total); per-split Not reported.	Labels: dataset ground truth from MedBIoT and N-BaIoT (versions/snapshots not specified). Metadata: none.	Medium—random hold-out 60/10/30 on flows without device or time isolation described; multi-scenario use.	Hold-out split 60/10/30	Architecture and hyperparameters provided (DQN 256-64-32, 115 input features); discount factor chosen empirically; selection procedure not described.	Evaluation: Accuracy, Precision, Recall/Detection Rate, F1-score, G-mean System metrics: training 12.6 min, additional RAM 140.3 MB, CPU + 17.1% (training); testing 8.4 s, additional RAM 6.8 MB, CPU 42% (testing); time per sample 0.00126 s (training) and 3.266 × 10⁻⁵ s/sample (testing); after larger test set, testing 4.725 × 10⁻⁵ s/sample.	Adequately addressed (metrics and techniques)
77	[111]	Construction and sources: single-source; Kaggle malicious URLs dataset; Total items: 651,191.	Imbalanced; overall per-class distribution Not reported.	Labels: Kaggle malicious URLs dataset, snapshot date not reported. Metadata: none.	Medium—random 80/20 split on a pre-collected URL list, no deduplication or temporal separation described.	Hold-out split; 80/20; stratified.	Not reported.	Evaluation: Accuracy; Precision; Recall; F1-score; Confusion matrix System metrics: Not reported
78	[112]	Construction and sources: single-source; UNSW-NB15 via Kaggle; Acquisition window: 2015; Preprocessing: standardization and normalization; min-max scaling fitted on train only; one-hot encoding of proto/service/state; PCA to 10 principal components explaining 90% variance; Total items: 257,673 (train 175,341 + test 82,332).	Not reported; figures show label distributions but no numeric proportions per split	Labels: UNSW-NB15 predefined ground-truth; snapshot/version not reported Metadata: none	Medium—single-source train/test used and scaler fit on train only, no deduplication/stratification details provided	Hold-out split using UNSW-NB15 train/test CSVs (train 175,341, test 82,332)	Not reported; algorithms implemented with “appropriate hyperparameters” without describing selection/tuning procedure	Evaluation: Accuracy; Precision; Recall; F1-score; Confusion matrix System metrics: Not reported	Partially addressed (metrics only)
79	[113]	Construction and sources: combined; Kaggle “Malicious and Benign URLs” (Kumar Siddharth) and PhishTank; Acquisition window: April 2020; Preprocessing: Not reported; Total items: 100,000.	Imbalanced; overall: phishing 60,315; benign 40,000	Labels: PhishTank for phishing; Kaggle “Malicious and Benign URLs” for benign. Metadata: none (features extracted solely from the URL; no WHOIS/DNS/TLS).	Medium—combined sources with 80/20 split, deduplication and domain-level separation not reported.	Hold-out split 80/20 for training and testing; 5-fold CV used during hyperparameter tuning.	GridSearchCV (cv = 5) over Random Forest, SVM, and MLP with specified parameter grids; model chosen by F1-score; RF selected after tuning.	Evaluation: Accuracy; Precision; Recall (Sensitivity); F1-score; ROC AUC; Confusion matrix System metrics: Not reported	Partially addressed (metrics only)
80	[114]	Construction and sources: single-source; CIC-Bell-DNS 2021 (Canadian Institute for Cybersecurity); Acquisition window: accessed 5 January 2023; Preprocessing: removed outliers; removed empty rows/columns; dropped page_rank; Total items: Not reported.	Imbalanced; Malware 4337; Phishing 4337; Spam 4337; Benign 2337; Per-split: Not reported	Labels: CIC-Bell-DNS 2021, accessed 5 January 2023. Metadata: DNS traffic features (32 fields) from captured packets.	Medium—random/hold-out splits on DNS transactions with a separate test set declared non-overlapping, but no domain-level grouping described.	Hold-out split; train 80%, validation 20%; separate test set 2608 samples; random seeds noted; stratification: Not reported.	Validation-set tuning; transfer learning with ResNet-50 plus global average pooling and two dense layers; hyperparameters specified (image 224 × 224, batch size 32, dropout 0.5, epochs 25, LR 0.0001, Adam); search strategy details: Not reported.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Handling of class imbalance: Adequately addressed (metrics and techniques)
81	[115]	Construction and sources: single-source; UCI SMS Spam Collection via Kaggle; Preprocessing: tokenization; stop word removal; lemmatization; text cleaning; Total items: 5850.	Imbalanced; overall counts Not reported; per-split distributions Not reported.	Labels: UCI SMS Spam Collection (via Kaggle), snapshot/version Not reported. Metadata: none.	Medium; hold-out split described only as “80% train, 20% testing and validation”; SMOTE mentioned, not stated if applied only to the training set; no de-duplication details.	Hold-out split 80/20 (train vs. test + validation); allocation within the 20% Not reported; randomization/stratification Not reported.	Architecture and hyperparameters provided without describing the selection procedure; embedding GloVe 300d; convolution filters 256; kernel size 3; activation ReLU; max pooling 2; learning rate 0.001; epochs 30; ROA population size 30; C = 0.1.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Adequately addressed (metrics and techniques)
82	[116]	Construction and sources: combined; PhishTank (malicious); Acquisition window: PhishTank accessed 8 November 2015; Preprocessing: only active URLs; semi-manual verification with instrumented browser; Total items: 2383.	Imbalanced; overall: phishing 1409, benign 974; per-split: Not reported	Labels: PhishTank (OpenDNS) plus semi-manual verification; Metadata: none	Medium; random hold-out without deduplication or temporal split described	Hold-out split; 70/15/15; Train 1670; Validation 357; Test 356; stratification: Not reported	Empirical selection by MSE across neuron combinations; best architecture 2 hidden layers with 10 and 11 neurons; SCG training chosen after comparing algorithms; up to 1000 epochs; tansig transfer; performance metric MSE	Evaluation: Accuracy; RMSE; MSE; regression R; confusion matrix System metrics: Not reported	Not addressed
83	[117]	Construction and sources: combined; IWSPA-AP 2018; PECORP (Nazario Phishing Corpus + Enron CALO); Preprocessing: conversion to CSV with fields FROM, TO, DATE, SUBJECT, BODY and LABEL; punctuation removal in FROM, HTML check/removal in BODY, tokenization; Total items: IWSPA full-header 4585; IWSPA no-header 5719; PECORP 5513.	Imbalanced; IWSPA full-header 503 phishing/4082 legitimate; IWSPA no-header 628 phishing/5091 legitimate; PECORP 2712 phishing/2801 legitimate.	Labels: IWSPA-AP 2018; Nazario Phishing Corpus; Enron CALO. Metadata: none.	Medium; multiple corpora; 10-fold CV without deduplication/timestamp separation described; SMOTE use on IWSPA not clearly nested with CV.	10-fold CV	Models compared in PyCaret across 13 classifiers; best model per experiment reported; hyperparameter selection procedure not described.	Evaluation: Accuracy; AUC; Recall; Precision; F1; Kappa; MCC System metrics: Not reported.	Adequately addressed (metrics and techniques)
84	[118]	[A]
85	[119]	Construction and sources: combined; Enron Email Corpus, SpamAssassin Public Corpus, Nazario Phishing Corpus, authors’ mailboxes; Acquisition window: phishing 2015–2021 (Nazario 2015–2020 plus a pre-2015 subset; authors’ mailboxes 2019–2021); Preprocessing: duplicates removed when merging Enron and SpamAssassin benign emails; parsing and text cleansing including lowercasing, stopword and HTML removal, URL placeholder, tokenization, lemmatization; Total items: 35,511.	Imbalanced; Train: 22,432 benign/2425 phishing; Test: 9619 benign/1035 phishing; Total: 32,051 benign/3460 phishing.	Labels: Enron (benign), SpamAssassin (benign), Nazario (phishing), authors’ mailboxes 2019–2021. Metadata: none.	Medium; multiple corpora combined, random 70/30 split, deduplication explicitly reported only for benign merge, no explicit cross-split or domain-level de-duplication.	Hold-out split 70/30; training/validation 24,857 and test 10,654.	Compared LR, GNB, KNN, DT, MLP on content-based and text-based features; selected DT for content and KNN for text; MLP used as fusion; MLP hidden layers varied 1–4 and chose 3; other hyperparameter selection not reported.	Evaluation: Accuracy; Precision; Recall; F1-score; ROC AUC; MCC; FPR; FNR System metrics: Training time 0.0313 s for Method 2 (Soft Voting).	Partially addressed (metrics only)
86	[120]	[A]
87	[121]	Construction and sources: combined; UCI SpamBase; CSDMC2010 (ICONIP 2010); merged Phishing_corpus from SpamAssassin and Nazario (phishing emails); Preprocessing: email parsing to header/body, tokenization, stemming, lemmatization, case folding; regex-based extraction of URLs and other features; normalization/scaling; Total items: SpamBase 4601; CSDMC2010 4327; Phishing_corpus 5850	Imbalanced; SpamBase 2788 ham/1813 spam; CSDMC2010 2949 ham/1378 spam; Phishing_corpus DS1 2758 ham/660 spam; DS2 2758 ham/2432 phishing; DS3 2758 ham/660 spam/2432 phishing	Labels: UCI SpamBase; CSDMC2010; SpamAssassin; Nazario (no snapshot/version reported) Metadata: none	Medium; multiple datasets combined; random 70/30 split without deduplication or temporal separation described; potential overlap/near-duplicates across splits	Hold-out split (70/30); 10-fold CV; early stopping variant evaluated	Grid search over number of layers (2–20), hidden units {16, 32, 64, 128, 256}, and learning rate {0.01, 0.001, 0.0001} with 10-fold CV; architecture and LR chosen by best validation accuracy	Evaluation: Accuracy; Precision; Recall; F1-score; MCC; BDR System metrics: SpamBase (200 epochs, all features): building time 229.32218 s; testing time 0.05756 s; Phishing_corpus (200 epochs, all features): DS1 building 117.0894 s/testing 0.0501 s; DS2 213.5986 s/0.0513 s; DS3 230.6510 s/0.0527 s	Partially addressed (metrics only)
88	[122]	Construction and sources: combined; SpamAssassin Public Corpus (20,030,228 easy ham, 20,030,228 hard ham, 20,030,228 easy ham 2), Nazario Phishing Corpus (phishing3.mbox); Acquisition window: dataset archive versions as named; Preprocessing: subject and body extracted, multipart parts concatenated, HTML converted to text and links using html2text, link text and URL extracted, attachments ignored; Total items: 6429.	Imbalanced; Overall: ham 4150; phishing 2279	Labels: SpamAssassin Public Corpus, Nazario Phishing Corpus. Metadata: none.	Medium; combined datasets from different sources and random 10-fold CV, no deduplication described.	10-fold CV	Architecture and hyperparameters provided without describing the selection or tuning procedure.	Evaluation: Accuracy; Precision; Recall; F1-score; False Positive Rate System metrics: Not reported	Partially addressed (metrics only)
89	[123]	Construction and sources: combined; Phishing Corpus, SpamAssassin ham, plus in-house email repository; Acquisition window: SpamAssassin ham (2002); Phishing Corpus (2004–2007 and 2015–2017); in-house corpus collected contemporaneously by authors; Preprocessing: duplicate removal for open-source corpora, header parsing and cleanup (removing extra spaces, angle brackets, quotes) before tokenization/lemmatization; Total items: Dataset-1 15,430, Dataset-2 27,405, Dataset-3 27,256.	Imbalanced; Overall distributions-Dataset-1: 6295 legitimate/9135 phishing; Dataset-2: 18,270 legitimate/9135 phishing; Dataset-3: 18,270 legitimate/8986 phishing.	Labels: Phishing Corpus and SpamAssassin ham; in-house emails manually labeled using header analysis with Google warnings and MXToolbox. Metadata: none.	Medium—multiple sources combined and a random 70/30 split; deduplication only reported for Dataset-1; no stated controls for near-duplicates/header-similar items across splits.	Hold-out split (random), 70% train/30% test for each dataset.	Multiple embeddings and classifiers tried; vector size and final combo chosen from observed results; architecture and hyperparameters provided without a separate, described selection procedure.	Evaluation: Accuracy, Precision, Recall/TPR, Specificity/TNR, F-score, MCC, FPR. System metrics: Training time 67.15 s (TF-IDF) to 425.02 s (Word2Vec-SkipGram); Testing time 50.44 s (TF-IDF) to 328.56 s (Word2Vec-SkipGram), for vector size 200.	Partially addressed (metrics only)
90	[124]	Construction and sources: combined; Kaggle Phishing Email Collection (2020 revision by Akashsurya156); PhishTank phishing URLs; Acquisition window: Kaggle 2020 revision; PhishTank “active” at crawl time; Preprocessing: tokenization; lemmatization; BeautifulSoup crawl for active URLs; internal/external feature sets (IFS/EFS) defined; Total items: emails 525,754; URLs used 20,000.	Not reported; URL dataset balanced—Train 8000 phishing/8000 benign; Test 2000 phishing/2000 benign; Kaggle emails—Not reported.	Labels: Kaggle Phishing Email Collection (2020 revision), PhishTank verified phishing URLs (active at crawl); Metadata: none.	Medium; multiple datasets and split/deduplication procedures not fully described, potential overlap not excluded.	Hold-out 80/20 for URL dataset; additional hold-out tests with 20/25/30/40 percent splits on emails; k-fold CV used, k not reported.	Algorithms compared (Multinomial Naive Bayes, SVM, RF, AdaBoost, Logistic Regression); hyperparameters and final selection procedure not described.	Evaluation: Accuracy; Precision; Recall; F1-score; Specificity System metrics: Not reported	Partially addressed (metrics only).
91	[125]	Construction and sources: single-source; researcher’s Outlook mailbox emails saved as HTML/text; Preprocessing: header/body split, tokenization, short-form expansion, stop-word removal, stemming, regex noise handling, document-frequency filter, mutual information feature selection; Total items: 2000 emails.	Not reported.	Labels: proprietary manual labeling of researcher’s Outlook emails into spam vs. legitimate; snapshot not specified. Metadata: none.	Medium; 10-fold CV on a proprietary email corpus with no deduplication or sender/thread grouping described, so near-duplicates may cross folds.	10-fold CV	Naive Bayes specified; feature selection via document frequency and mutual information; hyperparameters and selection procedure not reported.	Evaluation: Accuracy; Precision; Recall; F-measure; FP rate; FN rate System metrics: Not reported.	Partially addressed (metrics only)
92	[126]	Construction and sources: proprietary; three environments (research institute, university, IT company); official accounts; Java collection tool; Acquisition window: June 2018–December 2019; 6 months per participant; Preprocessing: user labeling; automatic feature extraction (14 features); other QC: Not reported; Total items: Not reported.	Imbalanced; spam proportion by environment: research institute 46.8%, university 53.5%, company 27.1%.	Labels: user-provided labels in the tool. Metadata: none.	Medium; random 60/40 split within users; no de-duplication or time-based separation described.	60/40 random split with 10-fold CV; Phase 2: train on all labeled data and classify new emails for 2 weeks.	Algorithms enumerated (NaiveBayes, J48, IBK, LibSVM, RBFNetwork, FFNN, BiLSTM, SMO-LibSVM); WEKA default settings; selection procedure Not reported.	Evaluation: AUC; False positive rate; False negative rate; Accuracy System metrics: Not reported	Partially addressed (metrics only)
93	[127]	Construction and sources: combined; PhishTank; PhishStats; OpenPhish; Acquisition window: continuous crawl (cron every 12 h); Preprocessing: labeling by source; duplicate-row removal; removal of rows with redacted keywords; extraction of 32 lexical URL features; Total items: 817,997.	Imbalanced; Overall: 468,005 malicious; 349,992 benign; Per split: Not reported	Labels: PhishTank; PhishStats; OpenPhish; snapshot Not reported. Metadata: none.	Medium; multi-source feeds combined; only duplicate rows removed; split ratio unspecified.	Hold-out split; ratio Not reported	Comparative evaluation of FNN, Bi-RNN, GRU, LSTM, RNN, CNN; CNN selected based on best evaluation; ablations on conv layers, dropout, loss, batch size, epochs; procedure details beyond comparisons not described.	Evaluation: Accuracy; Precision; Recall; F1; Confusion matrix System metrics: Execution time (s) reported, e.g., CNN 629.896 s; batch size 128,549.733 s; epochs 12,618.987 s; class balance variants 649.639–832.164 s	Adequately addressed (metrics and techniques)
94	[128]	Construction and sources: combined; CSDMC2010 (ICNIP competition), Enron email corpus; Preprocessing: removing punctuations, lowercasing, tokenization, stop-word removal, lemmatization; TF-IDF vectorization with n-first features (n = 500 or 1000); Total items: CSDMC2010 4307; Enron 0.5 M messages in corpus, subset for experiments Not reported.	Imbalanced; CSDMC2010 overall: spam 1378, ham 2929; Enron: Not reported.	Labels: CSDMC2010 competition labels; Enron corpus labels. Metadata: none.	Medium; random 10-fold CV across full datasets; no deduplication or user/thread grouping described.	10-fold CV (random; stratification not specified).	GridSearchCV used to tune baseline ML models; OAOS optimizes LR weights; search spaces not detailed; final hyperparameters listed.	Evaluation: F1-score; Precision; Recall System metrics: Not reported	Partially addressed (metrics only)
95	[129]	Construction and sources: combined; SpamAssassin ham, Jose Nazario phishing email set; Preprocessing: feature extraction on emails, Information Gain feature selection, Gaussian scaling, libSVM formatting; Total items: 4000.	Imbalanced; Overall: 3500 ham (87.5%), 500 phishing (12.5%).	Labels: SpamAssassin ham; Jose Nazario phishing (snapshot not specified). Metadata: none.	Medium; combined sources; no deduplication or temporal split described	Repeated 10-fold CV (10 × 10)	RBF kernel; C and γ explored on exponential grid; final selection procedure not reported.	Evaluation: Accuracy, Precision, Recall, F-Measure, False Positive rate, False Negative rate. System metrics: Training time 30.54–45.62 s (filter-based) and 378.12–409.69 s (wrapper-based); storage reduction 5.90–8.92% (filter-based) and 47.83–50.10% (wrapper-based).	Partially addressed (metrics only)
96	[130]	Construction and sources: single-source; Kaggle; authors’ Urdu-translated dataset posted to GitHub; Preprocessing: Googletrans translation with manual correction; tokenization, stop-word removal, stemming; Total items: 5000 emails.	Not reported.	Labels: Kaggle emails translated to Urdu; snapshot not reported. Metadata: none.	High—duplicates present (4.8%) and no deduplication described; simple 80/20 hold-out split.	Hold-out split (80/20; train 4000, test 1000).	Not reported	Evaluation: Accuracy; Precision; Recall; F1-score; ROC-AUC; Model loss System metrics: Not reported.	Partially addressed (metrics only)
97	[131]	[R]
98	[132]	Construction and sources: combined; three public datasets “Phishing email collection,” “Phishing legitimate full,” “Spam or not spam dataset”; Preprocessing: duplicate removal, missing-value removal, balancing by random sampling for dataset 1, tokenizing numbers and URLs as NUMBER and URL for dataset 3; Total items: 16,751; 10,000; 3000.	Exp1 Balanced; Train 5846 phishing/5881 legitimate, Test 2506 phishing/2520 legitimate. Exp2 Balanced; Train 3502 phishing/3498 legitimate, Test 1498 phishing/1502 legitimate. Exp3 Imbalanced; Train 351 phishing/1749 benign, Test 149 phishing/751 benign.	Labels: Not reported. Metadata: none.	Medium; random 70/30 splits, only exact-duplicate removal described, no temporal split or cross-dataset deduplication reported.	Hold-out split 70/30 for each dataset.	Seven algorithms compared; final choice by highest accuracy; hyperparameters and selection procedure not described.	Evaluation: Accuracy; Precision; Recall; F1-score System metrics: Not reported	Exp1 Adequately addressed Exp2 Dataset balanced; Exp3 Partially addressed (metrics only).
99	[133]	Construction and sources: combined; SpamAssassin Data (ham) and Nazario Phishing Corpus (phishing); Preprocessing: programmatic feature extraction in C#, conversion to LIBSVM format, Gaussian scaling to zero-mean/unit-variance, information gain feature reduction; Total items: 4000.	Imbalanced; overall: phishing 500 (12.5%), ham 3500 (87.5%); per-split: Not reported	Labels: SpamAssassin Data; Nazario Phishing Corpus; snapshot/version not reported Metadata: none	High; combined sources without deduplication described and information-gain feature selection not stated as train-only; repeated 10-fold CV without nesting	Repeated 10 × 10-fold CV	RBF SVM; grid search over exponentially spaced C and γ; best pair selected by prediction accuracy; feature count reduced via information gain; selection relative to CV not described	Evaluation: Accuracy, Global-best accuracy, False-positive rate, False-negative rate, Recall, Precision, F-measure System metrics: Training time 38.46 s, 44.76 s, 64.35 s, 71.08 s; storage reduction 5.56% or 8.33% (by subset size/K)	Partially addressed (metrics only)
100	[134]	Construction and sources: single-source; E-goi servers (EML); Preprocessing: deduplication; removal of emails without content or address; feature standardization and text embedding with PCA/HC reduction; Total items: 214,214.	Imbalanced; Overall: phishing 214; benign 214,000; Train: phishing 160; benign 3050; Test: phishing 54; benign 1016.	Labels: internal E-goi classification; snapshot Not reported. Metadata: none.	Medium; single-source with random/k-means sub-sampling and k-fold/hold-out; duplicates removed, but no temporal or account-level separation reported.	3-fold CV for grid search; final evaluation on hold-out split 75/25; training 3210 emails and testing 1070 emails (5% phishing in each).	Exhaustive grid search with 3-fold CV; RF tuned over {criterion, oob_score, min_samples_leaf, max_features} with F1/recall scoring; MLP tuned over {hidden_layer_sizes, activation, solver, max_iter}; final choice prioritized F1/recall and low blocked-accounts on 5% “pca_centroids_phish” sets; selected NN with ReLU and Adam, two hidden layers.	Evaluation: Accuracy; Precision; Recall; F1; ROC AUC; confusion matrix System metrics: % Blocked accounts 4.62%; % New right 82.67%	Adequately addressed (metrics and techniques)
101	[135]	Construction and sources: single-source; Kaggle “Instagram fake spammer genuine accounts” (two CSVs: train and test); Acquisition window: accessed 17 September 2021; Preprocessing: feature scaling to [0, 1] with scikit-learn; Total items: 576.	Balanced; Overall: 288 fake, 288 genuine; Splits: Not reported	Kaggle “Instagram fake spammer genuine accounts”; Metadata: none	Medium; two CSVs for train and test only; split procedure and leakage controls not described	Hold-out split; sizes not reported	Architecture and hyperparameters provided without describing the selection procedure (Sequential ANN with layers 50–150–150–2; ReLU; Softmax; Adam)	Evaluation: Accuracy; Precision; Recall; F1-score; Confusion matrix System metrics: Not reported	Partially addressed (metrics only)
102	[136]	[A]
103	[137]	Construction and sources: combined; PhishTank (2018) for phishing, Yandex Search API top-ranked pages for benign; Preprocessing: tokenization; Weka StringToWordVector; feature reduction with CfsSubsetEval; generic cleaning of missing values and removal of personal information; Total items: 73,575.	Balanced; Overall: 37,175 phishing/36,400 legitimate; Train/Test: 75/25 (random); per-split class proportions Not reported.	Labels: PhishTank (phishing) and Yandex Search API top-ranked pages (benign). Metadata: none.	High; random URL split over a combined dataset, no deduplication or temporal separation described, and inconsistent use of 75/25 split and 10-fold CV.	Random hold-out 75/25; 10-fold cross-validation also reported.	Architecture and hyperparameters varied (number of LSTM units, dense layers, epochs) without describing the selection procedure.	Evaluation: Accuracy; Precision; Recall; F1-score; AUC; MSE System metrics: Not reported.	Partially addressed (metrics only)
104	[138]	Construction and sources: combined; Kaggle “MachineLearning-Detecting-Twitter-Bots” and Twitter API stream; Preprocessing: missing-value treatment for profile-centric features; graph construction to .mtx; Total items: Not reported.	Class balance: Not reported.	Labels: Kaggle “MachineLearning-Detecting-Twitter-Bots” and Twitter API streamed data; Metadata: none.	High; combined pre-existing Kaggle data with newly streamed Twitter data, no split, deduplication, or leakage controls described.	Not reported	Proposed Improved Sybil Guard with fixed thresholds and rules; architecture and thresholds provided without describing the selection procedure.	Evaluation: Accuracy System metrics: Not reported	Handling of class imbalance: Not addressed.
105	[139]	Construction and sources: combined; English Wikipedia (EnWiki) block logs and user contributions; Acquisition window: February 2004–April 2015; Preprocessing: filtered accounts blocked for Sockpuppetry with infinite duration, grouped by Sockpuppeteer, sampled 5000 Sockpuppets from groups with >3 plus 5000 Active accounts with >1 year activity and ≥1 contribution, extracted revisions across 30 namespaces and computed 11 non-verbal features including revert detection; Total items: 10,000 accounts.	Balanced; 5000 Sockpuppet, 5000 Active (overall)	Labels: English Wikipedia Sockpuppet block logs and Sockpuppet Investigations up to April 2015; Metadata: none	High; random 2/3–1/3 split without group-wise separation can place accounts from the same Sockpuppeteer on both train and test, procedure not described to prevent this	Hold-out split (2/3 train + validation, 1/3 test); 10-fold CV on training for hyperparameter selection	10-fold CV on training in Weka to choose algorithm hyperparameters; best settings then evaluated on the hold-out test set; standardized vs. normalized variants compared	Evaluation: Accuracy; TP Rate; FP Rate; Precision; Recall; F-Measure; MCC; AUC System metrics: Not reported	Adequately addressed (metrics and techniques)

WHOIS domain registration data (WHOIS), Domain Name System (DNS), Cross-validation (CV), Transport Layer Security (TLS), Secure Sockets Layer (SSL), Receiver Operating Characteristic Area Under the Curve (ROC AUC), Artificial Neural Network (ANN), Hypertext Markup Language (HTML), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Dempster Shafer Theory (DST), Deep Radial Basis Function Network (Deep_RBF), Deep Generalized Radial Basis Function Network (Deep_GRBF), Deep Probabilistic Neural Network (Deep_PNN), Deep Hypothesis Probabilistic Neural Network (Deep_HPNN), Matthews Correlation Coefficient (MCC), Area Under the ROC Curve (AUC), True Positive Rate (TPR/Recall/Sensitivity), False Positive Rate (FPR), Software Defined Network (SDN), Recursive Feature Elimination with Support Vector Machine (RFE-SVM), Abstract Syntax Tree (AST), Feature Selection Convolutional Neural Network (FS-CNN), Convolutional Neural Networks (CNN), Genetic Algorithm (GA), Application Programming Interface (API), Geometric Mean (G-mean), Receiver Operating Characteristic(ROC), Long Short-Term Memory (LSTM), Variational Autoencoders (VAE), Waikato Environment for Knowledge Analysis (WEKA), Central Processing Unit (CPU), Random Access Memory (RAM), Random Forest (RF), JavaScript (JS), Mean Square Error (MSE), Multilayer Perceptron (MLP), Naive Bayes (NB), Feature Selection by Omitting Redundant Features (FSOR), Registration Data Access Protocol (RDAP), Deep Learning (DL), Feature Selection by Filter Method (FSFM), Logistic regression (LR), Term Frequency–Inverse Document Frequency (TF-IDF), Bayesian network (BN), Autonomous System Number (ASN), Multilayer perceptron (MLP), Sequential Minimal Optimization (SMO), AdaBoostM1 (AdaBoostM1), Time To Live (TTL), Support vector machine (SVM), Differential Evolution (DE), Honey Badger Algorithm (HBA), Mail Exchange (MX), IPv6 Address Record (AAAA), Canonical Name (CNAME), top-level domain (TLD), Autonomous System Number (ASN), Android Application Package (APK), Google’s Phishing Page Filter (GPPF), Genetic Algorithm (GA), False Negative Rate (FNR), True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), False Negative Rate (FNR), Android Application Package (APK), Logistic Model Trees (LMT), Tensor Processing Unit (TPU), Online Phishing Threats (OPT), Histogram of Oriented Gradients (HOG), Paragraph Vector–Distributed Bag of Words (PV-DBoW), Paragraph Vector–Distributed Memory (PV-DM), Evolving Fuzzy Neural Network (EFuNN), Optical Character Recognition (OCR), Distance Threshold (Dthr), Root Mean Square Error (RMSE), Non-Dimensional Error Index (NDEI), International Workshop on Security and Privacy Analytics (IWSPA), Bidirectional Long Short-Term Memory (BiLSTM), Minimum Redundancy Maximum Relevance (MRMR), Gradient Boosting Classifier (GBC), Gradient Boosting Machine (GBM), Logistic Regression (LR), Rectified Linear Unit (ReLU), Gaussian Naive Bayes (GNB), Support Vector Classifier (SVC), k-Nearest Neighbors (KNN), Feedforward Neural Network (FFNN), Decision Tree (DT), Principal Component Analysis (PCA), Hierarchical Clustering (HC), Multilayer Perceptron (MLP), Dataset (DS), [A]—abstract, [R]—review.

Table 3. Publications across all categories by time period (2017–2020, 2021–2024).

Labeling	2017–2020	2021–2024	All Years	Share [%]
Unique Publications	32	73	105	100.0
Phishing Delivery Channels ^a
Websites	23	38	61	58.10
Malware	5	21	26	24.76
Electronic Mail	5	18	23	21.90
Social Networking	2	6	8	7.62
Machine Learning Models and Techniques ^b
Machine Learning	27	56	83	79.05
Neural Networks	9	35	44	41.90
Classification and Ensembles	16	37	53	50.48
Feature Engineering	8	21	29	27.62
Research Methodology ^c
Experiment	30	65	95	90.48
Literature Analysis	10	30	40	38.10
Case Study	1	1	2	1.90
Conceptual	14	21	35	33.33

^a A single research paper can address more than one delivery channel; therefore, it may be classified under multiple subcategories simultaneously. ^b Many studies apply multiple approaches within the same research; consequently, some publications are included in several subcategories. ^c More than one research method can be applied in each analyzed document.

Table 4. Publications by Phishing Delivery Channels in other categories.

Research Approach	Websites	Malware	Electronic Mail	Social Networking	Total
Unique Publications	61	26	23	8	105
Machine Learning Models and Techniques ^a
Machine Learning	46	22	20	7	83
Neural Networks	27	7	9	4	44
Classification and Ensembles	33	12	15	2	53
Feature Engineering	22	4	7	0	29
Research Methodology ^a
Experiment	58	21	22	6	98
Literature Analysis	20	10	12	4	40
Case Study	0	0	2	0	2
Conceptual	22	6	5	5	35

^a A single research paper can address more than one research approach; therefore, it may be classified under multiple subcategories simultaneously.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wilk-Jakubowski, J.L.; Pawlik, L.; Wilk-Jakubowski, G.; Sikora, A. Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024). Electronics 2025, 14, 3744. https://doi.org/10.3390/electronics14183744

AMA Style

Wilk-Jakubowski JL, Pawlik L, Wilk-Jakubowski G, Sikora A. Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024). Electronics. 2025; 14(18):3744. https://doi.org/10.3390/electronics14183744

Chicago/Turabian Style

Wilk-Jakubowski, Jacek Lukasz, Lukasz Pawlik, Grzegorz Wilk-Jakubowski, and Aleksandra Sikora. 2025. "Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024)" Electronics 14, no. 18: 3744. https://doi.org/10.3390/electronics14183744

APA Style

Wilk-Jakubowski, J. L., Pawlik, L., Wilk-Jakubowski, G., & Sikora, A. (2025). Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024). Electronics, 14(18), 3744. https://doi.org/10.3390/electronics14183744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning and Neural Networks for Phishing Detection: A Systematic Review (2017–2024)

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Retrieval and Corpus Construction

2.2. Supplementary Data Sources

2.3. Bibliometric Analysis Procedure

2.4. Review Protocol and Publication Quality

2.5. Study Quality and Risk-of-Bias Assessment

2.6. Summary

3. Deployment Checklists by Phishing Delivery Channel

3.1. Deployment Checklist for the Phishing Delivery Channel: Websites

3.1.1. Privacy Controls

3.1.2. Data Collection Risks

3.1.3. Failsafe Behavior and Safe Defaults

3.1.4. Model Updates and Rollback

3.1.5. Explainability for Triage

3.2. Deployment Checklist for the Phishing Delivery Channel: Malware

3.2.1. Privacy Controls

3.2.2. Data Collection Risks

3.2.3. Failsafe Behavior and Safe Defaults

3.2.4. Model Updates and Rollback

3.2.5. Explainability for Triage

3.3. Deployment Checklist for the Phishing Delivery Channel: Electronic Mail

3.3.1. Privacy Controls

3.3.2. Data Collection Risks

3.3.3. Failsafe Behavior and Safe Defaults

3.3.4. Model Updates and Rollback

3.3.5. Explainability for Triage

3.4. Deployment Checklist for the Phishing Delivery Channel: Social Networking

3.4.1. Privacy Controls

3.4.2. Data Collection Risks

3.4.3. Failsafe Behavior and Safe Defaults

3.4.4. Model Updates and Rollback

3.4.5. Explainability for Triage

3.5. Summary

4. Discussion

4.1. Keyword Co-Occurrence Map: Dataset, Parameters, and Metrics

4.2. Trends in Global Phishing Activity

4.3. Categorization Framework for Analyzed Publications

4.4. International Research Contributions in Phishing Detection

4.5. Technical and Methodological Approaches to Phishing Detection by Channel

4.6. Common Validity Threats Observed in the Reviewed Studies

4.7. Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI