Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs
Abstract
1. Introduction
- RQ1: Do SLMs deliver coherent and reliable outputs, including meaningful reasoning for phishing website detection?
- RQ2: How do the costs and benefits of local models compare to state-of-the-art proprietary models such as GPT4?
- RQ3: How does the length of the HTML code influence both analysis runtime and detection performance?
- 1.
- We design an LLM-based phishing website detection pipeline and benchmark 15 SLMs, regarding various numeric and text-based criteria.
- 2.
- The benchmarking methodology, dataset, and source code are publicly available on Github (https://github.com/sbaresearch/benchmarking-SLMs, (accessed on 2 March 2026)), so other researchers and practitioners can reproduce the results and test new models in the future.
- 3.
- We discuss the benefits and challenges of small, local models compared to commercially available options and provide recommendations for applying SLMs to website phishing detection.
2. Related Work
3. Methodology
3.1. Model Selection
- Since no singular llama3.x model on Ollama comes with all required parameter sizes; multiple versions were included to represent a range of models.
- The chosen Qwen3 models feature an exceptionally large context window length of 256,000 tokens, which allows them to process more input. To maintain comparability, the maximum prompt length was capped at 128,000 tokens.
- Since LLM-based phishing detection research often relies on GPT-family models, we decided to include gpt-oss:20b as an open-source GPT alternative with a comparable model size to the other models tested in our experiments.
3.2. Prompt Design
- Flat Prompt Design: We use the generate mode (no conversation history), since each website must be analysed independently. As a result, the prompt had to be written in a flattened format, i.e., without explicit role labels like system or user.
- Few-Shot Prompting: To guide the model in detecting phishing characteristics, the prompt includes a limited number of examples of common phishing indicators (e.g., suspicious URLs, hidden forms, misleading redirects). These examples provide context and decision-making cues.
- Output Formatting: A strict JSON-parsable output format was defined to standardise model responses. The output contains three fields: phishing_score (an integer from 0 to 10, segmented into predefined risk brackets to reduce ambiguity and facilitate consistent interpretation across different models). is_phishing (boolean classification), reasoning (brief textual explanation citing key indicators). Furthermore, the output length is limited to 1000 tokens in an attempt to encourage models to retain the required output formatting and obtain concise answers.
4. Evaluation Design
4.1. Dataset
4.2. Experiments
4.2.1. Experiment 1—Syntactic Performance and Runtime
- Is the output deterministic or do variations occur?
- How long are the model runtimes, and how does the prompt length influence it?
- To what extent do models adhere to the specified JSON output format?
4.2.2. Experiment 2—Phishing Detection Performance
4.3. Used Hardware
5. Results
5.1. Results Experiment 1
5.1.1. LLM Runtime
- Prompt length strongly affects runtime, with longer inputs leading to higher computational costs, for almost all models.
- Model family, rather than model size alone, is a key determinant of runtime efficiency.
- Gemma and Llama models are generally faster and more consistent than DeepSeek and Qwen models.
- Runtime variability plays a crucial role in determining model practicality, with low-variance models being more predictable.
5.1.2. Syntactic Performance
5.1.3. Output Stability
- is_phishing = True, phishing_score ≥ 5, reasoning says it is phishing,
- is_phishing = False, phishing_score < 5, reasoning says it is not phishing.
Quantitative Analysis
Qualitative Analysis of Reasoning Outputs
5.1.4. Consequences for the Setup of Experiment 2
5.2. Results Experiment 2
Threshold Sensitivity Analysis
6. Discussion
6.1. Costs
- Proprietary API: In related literature, GPT4o and GPT4-turbo are frequently used for phishing detection. At the time of writing (September 2025), GPT4o is priced at $2.5 per million input tokens and $10 per million output tokens, while GPT4-turbo costs $10 per million input tokens and $30 per million output tokens. The newer and comparable GPT5 models, such as GPT5, GPT5.1, and GPT5.2, are priced at a similar level (as of February 2026), with $1.25 to $1.75 per million input tokens and $10 to $14 per million output tokens. The most expensive GPT5.2-pro model sits at $21 per million input tokens, and $168 per million output tokens [38]. More precisely, to obtain the results in Experiment 2, which includes two analysis runs for each of the 1000 websites, or 3.8 million tokens in total, per model, for the three proprietary models, we paid $11.26 for GPT4.1, $10.09 for GPT5.2 and $10.05 for gemini2.5-flash.Across both experiments in this study, each model processed approximately 8 million input tokens. While output token counts varied between models, using gpt-oss as an example, which belongs to the same family as GPT-4, generated roughly 1.6 million output tokens in total. Using these figures, the total cost of running the experiments for commonly used models in related work would be approximately $36 (31€) for GPT4o, or $128 (109€) for GPT4o-turbo.
- Renting GPUs: In this study, models were executed on rented GPUs via runpod, with the following rates: Nvidia A100, 1.4 EUR per GPU hour, and Nvidia H100, 2.03 EUR per GPU hour. The total runtime for Experiments 1 and 2 was slightly over 28 h, resulting in a total cost of approximately 41 EUR for the A100 setup, including disc storage fees. For the two largest 70b models, which required an H100 GPU, the total runtime was 22 h and 20 min, amounting to roughly 46 EUR, including disk storage.
- Fully local Setup: As the Nvidia A100 is no longer widely available, the latest GPU generation (H100 series) is an option, which costs between 27,000 EUR and 39,000 EUR, depending on the provider and specifications (https://www.newegg.com/p/pl?d=h100 (accessed on 11 September 2025), https://geizhals.at/nvidia-h100-nvl-900-21010-0020-000-a3356480.html (accessed on 11 September 2025)). In addition to the initial investment, electricity, cooling, and maintenance costs must be factored in, which can vary significantly by region and infrastructure.
6.2. Benefits of SLMs
- Data Privacy, Control and Security: Running all inference on local infrastructure ensures that sensitive information such as URLs, HTML content, and user metadata remains internal and is not transmitted to external providers. This is particularly important for organisations subject to strict data protection regulations or operating in sensitive domains. In the context of phishing detection, local models allow organisations to maintain full control over their data, reducing exposure to external systems. Furthermore, keeping the analysis methods and intellectual property (IP) in-house provides a competitive advantage, preventing sensitive prompts and detection strategies from being shared with third parties. It also lowers the risk of manipulation or infiltration by adversaries.
- Customisability: Although models in this study were evaluated in their out-of-the-box state, local models can be fine-tuned for phishing detection. Organisations with relevant expertise can leverage their proprietary data to partially retrain model weights or implement a retrieval-augmented generation (RAG) system to improve performance. The current open-source LLM ecosystem offers a wide variety of models for domain-specific adaptations. Furthermore, fine-tuned models for related tasks are often shared on platforms such as Hugging Face. However, to the best of our knowledge, no fine-tuned models specifically for phishing detection are publicly available at this time.
- Independence and Availability: Local model deployment eliminates vendor lock-in, ensuring that operations are not dependent on the availability, pricing policies, or strategic decisions of external providers. Local models can also offer greater reliability, as they are not affected by cloud service outages or external network issues. In addition, they can deliver lower latency and faster response times, which is particularly beneficial for time-critical applications such as real-time phishing detection.
6.3. Challenges of SLMs
- Performance: The most evident drawback is the lower performance of small models compared to larger proprietary ones. The results achieved in this study are promising; the tested 70b models, for instance, demonstrated solid overall performance, while other models excelled in specific areas such as precision or recall. However, the best proprietary large models used in related work outperformed local models across all performance metrics. We could confirm these results by using state-of-the-art large models in the same analysis framework that was used for the SLMs. Even though the performance gap is relatively small (F1-score difference of 0.036), it still could have practical implications, leading to higher rates of false positives or false negatives. Such errors can have severe consequences, including missed threats or unnecessary disruptions.
- Customisability: Customisability is one of the key strengths of local models, but realising it is non-trivial. First, a suitable base model must be selected, which requires in-depth knowledge of available models and careful evaluation of their strengths and weaknesses. Second, high-quality, domain-specific data is needed to achieve a meaningful fine-tuning result. Finally, fine-tuning itself can be a computationally intensive process, requiring specialised hardware and expertise.
- Hardware cost: Running even relatively small models locally requires specialised hardware, and the technical expertise to set up, optimise, and maintain the infrastructure. For organisations without existing ML infrastructure, the initial hardware investment can be substantial. In short-term or exploratory projects, this upfront investment may be too high, making proprietary models or rental GPU services more cost-effective alternatives. Ultimately, the exact cost must be carefully calculated based on the organisation’s specific strategy.
- Scalability: Proprietary cloud-based solutions offer elastic scaling with demand. In contrast, local deployments are constrained by on-site hardware capacity. Handling sudden spikes in phishing detection workloads may require over-provisioning hardware or accepting degraded performance.
6.4. Application Recommendation
- (i)
- Proprietary models: for high performance: If high performance and accuracy are the primary objectives and economic considerations are less critical, proprietary models are the preferred choice. This is particularly true when the dataset size of websites to analyse is small to moderate, or if the analysis is only performed infrequently. Ease of deployment, scalability, and superior performance outweigh the recurring costs.
- (ii)
- Local LLMs: for cost efficiency and privacy: The results show that there is potential in running smaller local LLMs for phishing detection, particularly when cost, data control, and privacy are key priorities. Using local models in combination with customisation techniques could become sustainable long-term solutions. While the initial investment may be substantial, these upfront costs can be offset over time. Furthermore, having full ownership of the model enables continuous improvements through iterative fine-tuning.Among the tested local models, llama3.3:70b, deepseek-r1:70b and mistral-nemo:12b stand out as the most promising options out of the box. The two 70b models demonstrated particularly strong performance. While all three models are small enough to be fine-tuned at a reasonable computational cost, especially the Mistral model, which only consists of 12 billion parameters, it is a promising candidate. Regarding analysis runtime, Llama, Gemma, and the Mistral model are among the fastest.
7. Limitations
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Additional Materials
Appendix A.1. Prompt Template
Prompt:
You are a cybersecurity expert analysing websites for phishing attempts. Your task is to examine the provided HTML code and the original website URL to determine if the website is likely a phishing site.
**Important:** The HTML may be truncated to reduce costs, so CSS styles and JavaScript code may be missing. Focus on the HTML structure, text content, URLs, and the original URL.
Original URL: ’{original_url}’
HTML: ’{html_text}’
Original HTML character count: ’{original_character_count}’
When assessing the likelihood of phishing, consider both your general knowledge of phishing techniques and the evidence you see in the data - focus on HTML structure, content, and domain/URL consistency.
Below are common examples of phishing indicators you may find useful, but you are not limited to them:
1. **Suspicious URLs/domains**—Check href attributes, form actions, image sources for:
- -
Misspelled brand names, unusual domains, suspicious subdomains- -
IP addresses instead of domains, excessive hyphens, unusual TLDs
2. **Form analysis**—Login/input forms with:
- -
Action URLs pointing to wrong domains- -
Password/sensitive data collection for mismatched brands- -
Excessive personal information requests (SSN, full address, etc.)
3. **Content and language**—Text containing:
- -
Urgent threats: ’Account suspended’, ’Verify immediately’, ’Limited time’- -
Fear tactics: ’Security breach’, ’Unauthorised access detected’- -
Reward baits: ’You have won’, ’Free gift’, ’Exclusive offer’
4. **HTML structure issues**:
- -
Spelling/grammar errors in text content- -
Inconsistent or poor HTML structure- -
Missing or suspicious meta tags (title, description)
5. **Link analysis**—Check all href attributes for:
- -
Links to different domains than expected- -
Shortened URLs (bit.ly, tinyurl, etc.)- -
Misleading anchor text vs actual URL
6. **Brand impersonation**—Look for:
- -
Company names in text that don’t match domain- -
References to legitimate services (PayPal, Amazon, banks, ...) on wrong domains- -
Official-sounding but incorrect terminology
7. **Missing legitimacy markers**:
- -
No contact information or privacy policy links- -
Missing proper company details in footer- -
No legitimate copyright notices
**Note:** Since CSS/JS may be truncated, focus on HTML tags, text content, and URL analysis rather than visual styling or dynamic behaviour.
*Scoring guide:**
- -
0–2: Very unlikely phishing (legitimate site)- -
3–4: Low risk (minor suspicious elements)- -
5–6: Medium risk (several concerning indicators)- -
7–8: High risk (multiple clear phishing signs)- -
9–10: Very high risk (obvious phishing attempt)
**Required output format (JSON only):**
{
phishing_score: int [0–10],
is_phishing: boolean [true/false],
reasoning: string [Brief explanation of your decision based on specific indicators found]
}
**Output Constraints:**
Do only output the JSON-formatted output and nothing else.
Appendix A.2. Runtime Analysis
| Dataset | Model | Runtime |
|---|---|---|
| d5 | llama3.2:1b | 0.550 |
| d5 | dolphin3:8b | 0.892 |
| d5 | mistral-nemo:12b | 1.793 |
| d5 | gemma3:4b | 1.966 |
| d5 | deepseek-r1:1.5b | 2.916 |
| d5 | gemma3:12b | 3.545 |
| d5 | phi3:14b | 4.515 |
| d5 | qwen3:4b | 5.725 |
| d5 | llama3.1:8b | 5.729 |
| d5 | gpt-oss:20b | 5.784 |
| d5 | deepseek-r1:14b | 6.386 |
| d5 | gemma3:27b | 7.077 |
| d5 | llama3.3:70b | 7.731 |
| d5 | qwen3:30b | 9.375 |
| d5 | deepseek-r1:70b | 23.318 |
| d50 | llama3.2:1b | 0.876 |
| d50 | dolphin3:8b | 2.162 |
| d50 | mistral-nemo:12b | 3.358 |
| d50 | deepseek-r1:1.5b | 3.752 |
| d50 | gemma3:4b | 3.840 |
| d50 | deepseek-r1:14b | 4.618 |
| d50 | llama3.1:8b | 5.353 |
| d50 | gemma3:12b | 7.846 |
| d50 | gpt-oss:20b | 7.946 |
| d50 | qwen3:4b | 8.601 |
| d50 | phi3:14b | 8.913 |
| d50 | qwen3:30b | 12.709 |
| d50 | gemma3:27b | 15.658 |
| d50 | llama3.3:70b | 15.748 |
| d50 | deepseek-r1:70b | 30.775 |
Appendix A.3. Qualitative Analysis
| Model | Dataset | Phishing Score Tendency | Benign Score Tendency |
|---|---|---|---|
| deepseek-r1:1.5b | d5 | 336 | 86 |
| deepseek-r1:14b | d5 | 322 | 109 |
| deepseek-r1:70b | d5 | 333 | 127 |
| dolphin3:8b | d5 | 257 | 125 |
| gemma3:12b | d5 | 485 | 212 |
| gemma3:27b | d5 | 483 | 186 |
| gemma3:4b | d5 | 518 | 123 |
| gpt-oss:20b | d5 | 194 | 95 |
| llama3.1:8b | d5 | 116 | 19 |
| llama3.2:1b | d5 | 10 | 0 |
| llama3.3:70b | d5 | 306 | 151 |
| mistral-nemo:latest | d5 | 219 | 166 |
| phi3:medium | d5 | 304 | 117 |
| qwen3:30b | d5 | 14 | 2 |
| qwen3:4b | d5 | 101 | 106 |
| deepseek-r1:1.5b | d50 | 207 | 31 |
| deepseek-r1:14b | d50 | 301 | 166 |
| deepseek-r1:70b | d50 | 318 | 176 |
| dolphin3:8b | d50 | 197 | 83 |
| gemma3:12b | d50 | 475 | 190 |
| gemma3:27b | d50 | 425 | 195 |
| gemma3:4b | d50 | 470 | 115 |
| gpt-oss:20b | d50 | 107 | 63 |
| llama3.1:8b | d50 | 170 | 19 |
| llama3.2:1b | d50 | 18 | 0 |
| llama3.3:70b | d50 | 246 | 113 |
| mistral-nemo:latest | d50 | 156 | 112 |
| phi3:medium | d50 | 235 | 57 |
| qwen3:30b | d50 | 0 | 0 |
| qwen3:4b | d50 | 164 | 239 |
Appendix A.4. Model Output Example
Result Samples:
“\n\nOkay, so I need to analyze this website’s HTML code for phishing attempts. The user provided an HTML snippet that seems very minimal—just a title link. But since CSS and JS might be missing, I have to focus on the structure and content.\n\nFirst, let me look at the HTML structure. It starts with <html>, then <head> which includes meta tags, charset, and a title. The body is empty except for some empty <body/body/html>. So, no actual content in the page beyond the title link.\n\nNow, looking at the phishing indicators:\n\n1. **Suspicious URLs/Domain**: The HTML doesn’t have any links or href attributes because it’s just an empty body. But wait, maybe there are other elements? No, the structure is very minimal—probably no form fields or images either since they’re not present.\n\n2. **Form Analysis**: There are no login inputs or forms in this snippet. So, that’s probably not a concern here.\n\n3. **Content and Language**: The only text is the title: \“Amazon\”. That doesn’t seem like it contains any phishing indicators. It’s just a link to an account sign-in page for Amazon. I don’t see any urgency, threats, or reward baits here.\n\n4. **HTML Structure Issues**: There are no errors in the HTML structure because there are no tags beyond what’s already present. So, that doesn’t add anything.\n\n5. **Link Analysis**: The only link is to \“Amazon\”, which is a legitimate account sign-in page for Amazon. It doesn’t point to any other domain or have any suspicious links.\n\n6. **Brand impersonation**: The text has \“Amazon\” in the title, but it’s just a regular brand name. There are no references to other companies or services on this page. So, that’s not a phishing indicator either.\n\n7. **Missing Legitimacy Markers**: The HTML doesn’t have any meta tags like title or description. It does have a link tag, which is fine because it’s pointing to a legitimate domain. There are no proper company details in the footer or any privacy policy links. So, that’s not an issue here.\n\nPutting this all together: The website has very minimal HTML structure and content. It doesn’t contain any phishing indicators like suspicious URLs, form fields, incorrect domains, or references to other companies. The only link is to a legitimate account sign-in page for Amazon, which isn’t phishing material.\n\nTherefore, based on the analysis, there are no clear signs of phishing attempts here. So, the phishing score should be 0–2, indicating very unlikely. \n</think>\n\n“‘json\n{\n \“phishing_score\”: 0,\n \“is_phishing\”: false,\n \“reasoning\”: \“The website’s HTML code is minimal and lacks any links or content that would indicate phishing attempts. The only link points to a legitimate Amazon account sign-in page, which does not contain phishing indicators.\”\n}\n""’
Result Samples:
Run 1: ““json\n{\n“phishing_score”: 3,\n“is_phishing”: false,\n“reasoning”: “The HTML code is minimal and lacks significant content. It contains basic HTML tags for a webpage, but there are no suspicious URLs, forms, or text content that would immediately indicate a phishing attempt. The lack of content makes it difficult to assess the legitimacy of the site.“\n}\n””
Run 2: ““‘json\n{\n\“phishing_score\”: 3,\n\“is_phishing\”: false,\n\“reasoning\”: \“The HTML code is minimal and lacks significant content. It contains basic HTML tags for a webpage, but no suspicious URLs, forms, or text content are present. It’s difficult to assess the legitimacy of the site without more information. The score reflects the lack of indicators, but it’s not a definitive determination of legitimacy.\”\n}\n”””
Run 3: ““‘json\n{\n\“phishing_score\”: 3,\n\“is_phishing\”: false,\n\“reasoning\”: \“The HTML code is minimal and lacks significant content. It contains basic HTML tags for a webpage, but no suspicious URLs, forms, or text content are present. It’s difficult to assess the legitimacy of the site without more information. The score reflects the lack of indicators, but it’s not a definitive determination of legitimacy.\”\n}\n””’
Run 4: ““‘json\n{\n\“phishing_score\”: 3,\n\“is_phishing\”: false,\n\“reasoning\": \“The HTML code is minimal and lacks significant content. It contains basic HTML tags for a webpage, but no suspicious URLs, forms, or text content are present. It’s difficult to assess the legitimacy of the site without more information. The score reflects the lack of indicators, but it’s not a definitive determination of legitimacy.\"\n}\n””’
Run 5: ““‘json\n{\n\“phishing_score\": 3,\n\“is_phishing\": false,\n\“reasoning\": \“The HTML code is minimal and lacks significant content. It contains basic HTML tags for a webpage, but no suspicious URLs, forms, or text content are present. It’s difficult to assess the legitimacy of the site without more information. The score reflects the lack of indicators, but it’s not a definitive determination of legitimacy.\"\n}\n””’
| Model | F1-Score | F1-Threshold | F2-Score | F2-Treshold | Exp. Cost | Exp. Cost Threshold |
|---|---|---|---|---|---|---|
| llama3.3:70b | 0.901 | 8 | 0.925 | 5 | 349 | 6 |
| deepseek-r1:70b | 0.871 | 6 | 0.910 | 3 | 267 | 2 |
| mistral-nemo:12b | 0.854 | 7 | 0.902 | 5 | 377 | 3 |
| deepseek-r1:14b | 0.842 | 5 | 0.888 | 3 | 337 | 1 |
| gemma3:27b | 0.850 | 7 | 0.920 | 6 | 277 | 6 |
| gemma3:12b | 0.832 | 6 | 0.897 | 5 | 385 | 4 |
| gemma3:4b | 0.694 | 7 | 0.840 | 4 | 498 | 3 |
| dolphin3:8b | 0.791 | 3 | 0.863 | 2 | 450 | 2 |
| phi3:14b | 0.664 | 0 | 0.832 | 0 | 496 | 0 |
| gpt-5.2 | 0.972 | 4 | 0.987 | 3 | 39 | 3 |
| gpt-4.1 | 0.910 | 4 | 0.931 | 3 | 280 | 2 |
| gemini-2.5-flash | 0.920 | 7 | 0.952 | 3 | 164 | 3 |

| Model | TPR | FPR | PPV list |
|---|---|---|---|
| gpt-4.1-url | 0.910 | 0.066 | [0.420, 0.122, 0.014] |
| dolphin3:8b | 0.506 | 0.048 | [0.357, 0.096, 0.010] |
| gpt-5.2-url | 0.960 | 0.162 | [0.238, 0.057, 0.006] |
| gemini-2.5-flash-url | 0.978 | 0.175 | [0.227, 0.053, 0.006] |
| llama3.3:70b | 0.948 | 0.174 | [0.223, 0.052, 0.005] |
| deepseek-r1:70b | 0.929 | 0.199 | [0.197, 0.045, 0.005] |
| deepseek-r1:14b | 0.877 | 0.189 | [0.196, 0.045, 0.005] |
| mistral-nemo:12b | 0.909 | 0.213 | [0.184, 0.041, 0.004] |
| gemma3:27b | 0.964 | 0.347 | [0.128, 0.027, 0.003] |
| gemma3:12b | 0.951 | 0.340 | [0.128, 0.027, 0.003] |
| phi3:medium | 0.345 | 0.145 | [0.111, 0.023, 0.002] |
| gemma3:4b | 0.980 | 0.853 | [0.057, 0.011, 0.001] |
References
- Federal Bureau of Investigation, Internet Crime Complaint Center (IC3). Internet Crime Report 2024; Federal Bureau of Investigation: Washington, DC, USA, 2025. Available online: https://www.ic3.gov/AnnualReport/Reports/2024_IC3Report.pdf (accessed on 25 September 2025).
- Europol. Internet Organised Crime Threat Assessment (IOCTA) 2024; European Union Agency for Law Enforcement Cooperation: Luxembourg, 2024. [Google Scholar] [CrossRef]
- Anti-Phishing Working Group (APWG). Phishing Activity Trends Report: 2nd Quarter 2025; APWG: Lexington, MA, USA, 2025. Available online: https://docs.apwg.org/reports/apwg_trends_report_q2_2025.pdf (accessed on 25 September 2025).
- Chiew, K.L.; Yong, K.S.C.; Tan, C.L. A survey of phishing attacks: Their types, vectors and technical approaches. Expert Syst. Appl. 2018, 106, 1–20. [Google Scholar] [CrossRef]
- Mahboubi, A.; Luong, K.; Aboutorab, H.; Bui, H.T.; Jarrad, G.; Bahutair, M.; Camtepe, S.; Pogrebna, G.; Ahmed, E.; Barry, B.; et al. Evolving techniques in cyber threat hunting: A systematic review. J. Netw. Comput. Appl. 2024, 232, 104004. [Google Scholar] [CrossRef]
- Yang, R.; Zheng, K.; Wu, B.; Wu, C.; Wang, X. Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning. Sensors 2021, 21, 8281. [Google Scholar] [CrossRef]
- Bahaghighat, M.; Ghasemi, M.; Ozen, F. A high-accuracy phishing website detection method based on machine learning. J. Inf. Secur. Appl. 2023, 77, 103553. [Google Scholar] [CrossRef]
- Yerima, S.Y.; Alzaylaee, M.K. High Accuracy Phishing Detection Based on Convolutional Neural Networks. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Abdelnabi, S.; Krombholz, K.; Fritz, M. VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20); Association for Computing Machinery: New York, NY, USA, 2020; pp. 1681–1698. [Google Scholar] [CrossRef]
- Lin, Y.; Liu, R.; Divakaran, D.M.; Ng, J.Y.; Chan, Q.Z.; Lu, Y.; Si, Y.; Zhang, F.; Dong, J.S. Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages. In 30th USENIX Security Symposium (USENIX Security 21); USENIX Association: Berkeley, CA, USA, 2021; pp. 3793–3810. Available online: https://www.usenix.org/conference/usenixsecurity21/presentation/lin (accessed on 25 September 2025).
- Panum, T.K.; Hageman, K.; Hansen, R.R.; Pedersen, J.M. Towards Adversarial Phishing Detection. In Proceedings of the 13th USENIX Workshop on Cyber Security Experimentation and Test (CSET ’20); USENIX Association: Berkeley, CA, USA, 2020; Available online: https://www.usenix.org/conference/cset20/presentation/panum (accessed on 25 September 2025).
- Nahapetyan, A.; Khare, K.; Schwarz, K.; Reaves, B.; Kapravelos, A. Characterizing Phishing Pages by JavaScript Capabilities. arXiv 2025, arXiv:2509.13186. [Google Scholar] [CrossRef]
- Abuadbba, A.; Wang, S.; Almashor, M.; Ahmed, M.E.; Gaire, R.; Camtepe, S.; Nepal, S. Towards Web Phishing Detection Limitations and Mitigation. arXiv 2022, arXiv:2204.00985. [Google Scholar] [CrossRef]
- Kavya, S.; Sumathi, D. Staying ahead of phishers: A review of recent advances and emerging methodologies in phishing detection. Artif. Intell. Rev. 2024, 58, 50. [Google Scholar] [CrossRef]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
- Liu, R.; Lin, Y.; Teoh, X.; Liu, G.; Huang, Z.; Dong, J.S. Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List. In 33rd USENIX Security Symposium (USENIX Security 24); USENIX Association: Berkeley, CA, USA, 2024; pp. 523–540. Available online: https://www.usenix.org/conference/usenixsecurity24/presentation/liu-ruofan (accessed on 25 September 2025).
- Koide, T.; Nakano, H.; Chiba, D. ChatPhishDetector: Detecting Phishing Sites Using Large Language Models. IEEE Access 2024, 12, 154381–154400. [Google Scholar] [CrossRef]
- Barberá, I. AI Privacy Risks & Mitigations—Large Language Models (LLMs); European Data Protection Board: Brussels, Belgium, 2025; Available online: https://www.edpb.europa.eu/system/files/2025-04/ai-privacy-risks-and-mitigations-in-llms.pdf (accessed on 25 September 2025).
- Irugalbandara, C.; Mahendra, A.; Daynauth, R.; Arachchige, T.K.; Dantanarayana, J.L.; Flautner, K.; Tang, L.; Kang, Y.; Mars, J. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production. In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); IEEE: Piscataway, NJ, USA, 2024; pp. 280–291. [Google Scholar] [CrossRef]
- Putra, I.K.A.A. Phishing Website Dataset, version v1; Zenodo: Geneva, Switzerland, 2023. [Google Scholar] [CrossRef]
- Lee, J.; Lim, P.; Hooi, B.; Divakaran, D.M. Multimodal Large Language Models for Phishing Webpage Detection and Identification. In 2024 APWG Symposium on Electronic Crime Research (eCrime), Boston, MA, USA, 24–26 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–13. Available online: https://ieeexplore.ieee.org/document/10896256 (accessed on 25 September 2025).
- Li, W.; Manickam, S.; Chong, Y.-W.; Karuppayah, S. PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection. arXiv 2025, arXiv:2506.15656. [Google Scholar] [CrossRef]
- Trad, F.; Chehab, A. Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models. Mach. Learn. Knowl. Extr. 2024, 6, 367–384. [Google Scholar] [CrossRef]
- Kulkarni, A.; Balachandran, V.; Divakaran, D.M.; Das, T. From ML to LLM: Evaluating the Robustness of Phishing Web Page Detection Models against Adversarial Attacks. Digit. Threat. Res. Pract. 2025, 6, 10. [Google Scholar] [CrossRef]
- Li, W.; Manickam, S.; Chong, Y.-W.; Leng, W.; Nanda, P. A state-of-the-art review on phishing website detection techniques. IEEE Access 2024, 12, 187976–188012. [Google Scholar] [CrossRef]
- Vulfin, A.; Sulavko, A.; Vasiliev, V.; Minko, A.; Kirillova, A.; Samotuga, A. A multimodal phishing website detection system using explainable artificial intelligence technologies. Mach. Learn. Knowl. Extr. 2026, 8, 11. [Google Scholar] [CrossRef]
- Nakano, H.; Koide, T.; Chiba, D. PhishParrot: LLM-driven adaptive crawling to unveil cloaked phishing sites. arXiv 2025, arXiv:2508.02035. [Google Scholar] [CrossRef]
- Hannousse, A.; Yahiouche, S. Web Page Phishing Detection, version v3; Mendeley Data; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar] [CrossRef]
- Aljofey, A.; Bello, S.A.; Lu, J.; Xu, C. Comprehensive phishing detection: A multi-channel approach with variants TCN fusion leveraging URL and HTML features. J. Netw. Comput. Appl. 2025, 238, 104170. [Google Scholar] [CrossRef]
- Google Developers. Introduction to Large Language Models. 2025. Available online: https://developers.google.com/machine-learning/resources/intro-llms (accessed on 25 September 2025).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Long Beach, CA, USA, 2017; pp. 5998–6008. Available online: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf (accessed on 25 September 2025).
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
- Keles, F.D.; Wijewardena, P.M.; Hegde, C. On The Computational Complexity of Self-Attention. In Proceedings of the 34th International Conference on Algorithmic Learning Theory; PMLR: New York, NY, USA, 2023; pp. 597–619. Available online: https://proceedings.mlr.press/v201/duman-keles23a.html (accessed on 25 September 2025).
- Fu, Y. Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis. arXiv 2024, arXiv:2405.08944. [Google Scholar] [CrossRef]
- Ratul, I.J.; Zhou, Y.; Yang, K. Accelerating Deep Learning Inference: A Comparative Analysis of Modern Acceleration Frameworks. Electronics 2025, 14, 2977. [Google Scholar] [CrossRef]
- Dalvi, S.; Gressel, G.; Achuthan, K. Tuning the False Positive Rate/False Negative Rate with Phishing Detection Models. Int. J. Eng. Adv. Technol. 2019, 9, 7–13. [Google Scholar] [CrossRef]
- Bold, R.; Al-Khateeb, H.; Ersotelos, N. Reducing False Negatives in Ransomware Detection: A Critical Evaluation of Machine Learning Algorithms. Appl. Sci. 2022, 12, 12941. [Google Scholar] [CrossRef]
- OpenAI. OpenAI API Pricing. 2025. Available online: https://platform.openai.com/docs/pricing (accessed on 25 September 2025).




| Publication | Input Data | LLM | Dataset Size | F1-Score |
|---|---|---|---|---|
| [21] | Text + Vision | Gemini Pro-Vision 1, GPT-4-turbo, Claude 3 Opus | ∼3000 benign ∼1500 phish | 0.81–0.92 |
| [17] | Text + Vision | GPT-4;4V;4o, GPT-3.5-turbo, Gemini Pro 1.0, Command R+, Llama 2 70b , Llama 3 70b, Gemma 2 9b | 1000 benign 1000 phish | 0.72–0.99 |
| [22] | Text | Qwen2.5-vl.72b-instruct, Gemini-2.0-Flash, GPT-4o, GPT-4o-mini | 500 benign 500 phish | 0.85–0.97 |
| [23] | Text | GPT-3.5-turbo Claude 2 | 500 benign 500 phish | 0.78–0.93 |
| [24] | Text + Vision | Gemini-1.5-Flash | 17,794 benign 9105 phish | 0.75–0.98 |
| [16] | Text + Vision | GPT-3.5-turbo-16k | 6000 benign 6000 phish | 0.86 |
| SLM Selection Criteria |
|---|
| In the top 30 “Popular” models on Ollama in July 2025 |
| Model has approximately 70 billion parameters or fewer |
| Model has a 128,000 context window length |
| If there are multiple model variants available, choose the smallest model, a model in the centre of the parameter range and the model closest to 70b |
| Exclusion if tagged as “embedding” |
| If the exact popular model does not have the required context window length, choose a model of the same family that falls within the scope (e.g., mistral: 7b’s context window is too short; therefore, it is replaced by mistral-nemo) |
| Number of Runs per Website with Correct JSON Output | ||||||
|---|---|---|---|---|---|---|
| Model Name | 0 | 1 | 2 | 3 | 4 | 5 |
| deepseek-r1:70b | 0 | 0 | 0 | 0 | 0 | 80 |
| gemma3:12b | 0 | 0 | 0 | 0 | 0 | 80 |
| gemma3:27b | 0 | 0 | 0 | 0 | 0 | 80 |
| deepseek-r1:14b | 1 | 1 | 0 | 0 | 0 | 78 |
| gemma3:4b | 2 | 0 | 0 | 0 | 0 | 78 |
| llama3.3:70b | 2 | 1 | 0 | 0 | 1 | 76 |
| dolphin3:8b | 8 | 0 | 0 | 0 | 1 | 71 |
| phi3:14b | 1 | 5 | 0 | 0 | 3 | 71 |
| mistral-nemo:12b | 8 | 2 | 0 | 0 | 0 | 70 |
| qwen3:4b | 23 | 0 | 0 | 0 | 0 | 57 |
| llama3.1:8b | 23 | 6 | 0 | 0 | 8 | 43 |
| deepseek-r1:1.5b | 33 | 10 | 0 | 0 | 7 | 30 |
| gpt-oss:20b | 45 | 6 | 1 | 0 | 5 | 23 |
| llama3.2:1b | 77 | 0 | 0 | 0 | 1 | 2 |
| qwen3:30b | 77 | 1 | 0 | 0 | 0 | 2 |
| Phishing Score | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | >Model | Phishing Cat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | NaN | <5 | ≥5 |
| d5 | deepseek-r1:1.5b | False | 31 | 0 | 0 | 45 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 76 | 8 |
| d5 | deepseek-r1:1.5b | True | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 30 | 0 | 0 | 0 | 35 |
| d5 | deepseek-r1:1.5b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 81 | 0 | 0 |
| d5 | deepseek-r1:14b | False | 56 | 9 | 10 | 73 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 149 | 0 |
| d5 | deepseek-r1:14b | True | 0 | 0 | 0 | 0 | 0 | 23 | 0 | 19 | 5 | 0 | 0 | 0 | 0 | 47 |
| d5 | deepseek-r1:14b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| d5 | deepseek-r1:70b | False | 4 | 64 | 43 | 28 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 145 | 0 |
| d5 | deepseek-r1:70b | True | 0 | 0 | 0 | 0 | 0 | 14 | 5 | 17 | 17 | 2 | 0 | 0 | 0 | 55 |
| d5 | dolphin3:8b | False | 10 | 5 | 76 | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 125 | 0 |
| d5 | dolphin3:8b | True | 0 | 0 | 0 | 0 | 0 | 10 | 25 | 30 | 0 | 0 | 0 | 0 | 0 | 65 |
| d5 | dolphin3:8b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 |
| d5 | gemma3:12b | False | 0 | 20 | 20 | 60 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 105 | 0 |
| d5 | gemma3:12b | True | 0 | 0 | 0 | 0 | 0 | 0 | 65 | 25 | 5 | 0 | 0 | 0 | 0 | 95 |
| d5 | gemma3:27b | False | 0 | 0 | 0 | 35 | 35 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 70 | 30 |
| d5 | gemma3:27b | True | 0 | 0 | 0 | 0 | 0 | 5 | 40 | 45 | 10 | 0 | 0 | 0 | 0 | 100 |
| d5 | gemma3:4b | False | 0 | 0 | 0 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 25 | 0 |
| d5 | gemma3:4b | True | 0 | 0 | 0 | 0 | 0 | 5 | 40 | 90 | 35 | 0 | 0 | 0 | 0 | 170 |
| d5 | gemma3:4b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 |
| d5 | gpt-oss:20b | False | 15 | 48 | 8 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 79 | 0 |
| d5 | gpt-oss:20b | True | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 7 | 2 | 0 | 0 | 0 | 13 |
| d5 | gpt-oss:20b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 108 | 0 | 0 |
| d5 | llama3.1:8b | False | 6 | 0 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 0 |
| d5 | llama3.1:8b | True | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 34 | 44 | 0 | 0 | 0 | 87 |
| d5 | llama3.1:8b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92 | 0 | 0 |
| d5 | llama3.2:1b | False | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| d5 | llama3.2:1b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 195 | 0 | 0 |
| d5 | llama3.3:70b | False | 5 | 14 | 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 120 | 0 |
| d5 | llama3.3:70b | True | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 66 | 9 | 0 | 0 | 0 | 80 |
| d5 | mistral-nemo:12b | False | 0 | 0 | 33 | 51 | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 85 | 10 |
| d5 | mistral-nemo:12b | True | 0 | 0 | 0 | 0 | 0 | 5 | 8 | 32 | 22 | 33 | 0 | 0 | 0 | 100 |
| d5 | mistral-nemo:12b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 |
| d5 | phi3:14b | False | 84 | 6 | 21 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 116 | 0 |
| d5 | phi3:14b | True | 0 | 0 | 0 | 0 | 0 | 5 | 4 | 37 | 15 | 10 | 0 | 0 | 0 | 71 |
| d5 | phi3:14b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 |
| d5 | qwen3:30b | False | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 |
| d5 | qwen3:30b | True | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 |
| d5 | qwen3:30b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 189 | 0 | 0 |
| d5 | qwen3:4b | False | 95 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 95 | 0 |
| d5 | qwen3:4b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 105 | 0 | 0 |
| d50 | deepseek-r1:1.5b | False | 1 | 0 | 0 | 4 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 12 |
| d50 | deepseek-r1:1.5b | True | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 11 | 0 | 36 | 0 | 0 | 4 | 48 |
| d50 | deepseek-r1:1.5b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 131 | 0 | 0 |
| d50 | deepseek-r1:14b | False | 50 | 0 | 0 | 79 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 129 | 0 |
| d50 | deepseek-r1:14b | True | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 21 | 14 | 0 | 0 | 0 | 0 | 66 |
| d50 | deepseek-r1:14b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 |
| d50 | deepseek-r1:70b | False | 11 | 78 | 31 | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 134 | 0 |
| d50 | deepseek-r1:70b | True | 0 | 0 | 0 | 0 | 0 | 12 | 4 | 30 | 19 | 1 | 0 | 0 | 0 | 66 |
| d50 | dolphin3:8b | False | 0 | 5 | 29 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 54 | 0 |
| d50 | dolphin3:8b | True | 0 | 0 | 0 | 0 | 0 | 26 | 48 | 41 | 0 | 0 | 0 | 0 | 0 | 115 |
| d50 | dolphin3:8b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 |
| d50 | gemma3:12b | False | 0 | 10 | 5 | 40 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 70 | 0 |
| d50 | gemma3:12b | True | 0 | 0 | 0 | 0 | 0 | 0 | 65 | 45 | 20 | 0 | 0 | 0 | 0 | 130 |
| d50 | gemma3:27b | False | 0 | 0 | 0 | 15 | 45 | 10 | 15 | 0 | 0 | 0 | 0 | 0 | 60 | 25 |
| d50 | gemma3:27b | True | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 50 | 30 | 0 | 0 | 0 | 0 | 115 |
| d50 | gemma3:4b | True | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 70 | 85 | 0 | 0 | 0 | 0 | 195 |
| d50 | gemma3:4b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 |
| d50 | gpt-oss:20b | False | 5 | 27 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 0 |
| d50 | gpt-oss:20b | True | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 6 | 0 | 0 | 0 | 10 |
| d50 | gpt-oss:20b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 149 | 0 | 0 |
| d50 | llama3.1:8b | False | 10 | 0 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 0 |
| d50 | llama3.1:8b | True | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 1 | 81 | 39 | 41 | 0 | 0 | 130 |
| d50 | llama3.1:8b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 55 | 0 | 0 |
| d50 | llama3.2:1b | False | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| d50 | llama3.2:1b | True | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 4 |
| d50 | llama3.2:1b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 191 | 0 | 0 |
| d50 | llama3.3:70b | False | 5 | 5 | 59 | 0 | 0 | 1 | 5 | 0 | 0 | 0 | 0 | 0 | 69 | 6 |
| d50 | llama3.3:70b | True | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 101 | 9 | 0 | 0 | 0 | 110 |
| d50 | llama3.3:70b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0 |
| d50 | mistral-nemo:12b | False | 5 | 0 | 15 | 16 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 36 | 4 |
| d50 | mistral-nemo:12b | True | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 42 | 39 | 35 | 0 | 0 | 0 | 117 |
| d50 | mistral-nemo:12b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 0 | 0 |
| d50 | phi3:14b | False | 19 | 5 | 5 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 0 |
| d50 | phi3:14b | True | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 116 | 12 | 10 | 0 | 0 | 0 | 142 |
| d50 | phi3:14b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0 |
| d50 | qwen3:30b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 200 | 0 | 0 |
| d50 | qwen3:4b | False | 190 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 190 | 0 |
| d50 | qwen3:4b | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 |
| Number of Classification Result Differences Between the D5 and D50 Dataset per Website Runs | ||||||
|---|---|---|---|---|---|---|
| Model Name | 0 | 1 | 2 | 3 | 4 | 5 |
| llama3.2:1b | 39 | 0 | 0 | 0 | 1 | 0 |
| qwen3:30b | 37 | 1 | 0 | 0 | 0 | 2 |
| gemma3:27b | 35 | 0 | 0 | 0 | 0 | 5 |
| gemma3:4b | 33 | 0 | 0 | 0 | 0 | 7 |
| deepseek-r1:14b | 30 | 1 | 0 | 0 | 2 | 7 |
| deepseek-r1:70b | 30 | 1 | 0 | 0 | 3 | 6 |
| gemma3:12b | 29 | 0 | 0 | 0 | 0 | 11 |
| llama3.3:70b | 29 | 1 | 0 | 0 | 2 | 8 |
| dolphin3:8b | 24 | 1 | 0 | 0 | 0 | 15 |
| mistral-nemo:12b | 24 | 0 | 0 | 0 | 2 | 14 |
| qwen3:4b | 21 | 0 | 0 | 0 | 0 | 19 |
| gpt-oss:20b | 20 | 4 | 0 | 1 | 4 | 11 |
| phi3:14b | 19 | 3 | 0 | 0 | 3 | 15 |
| llama3.1:8b | 16 | 7 | 0 | 0 | 4 | 13 |
| deepseek-r1:1.5b | 13 | 5 | 0 | 0 | 4 | 18 |
| Model | F1_Score | Accuracy | Precision | Recall | NaN Proportion |
|---|---|---|---|---|---|
| llama3.3:70b | 0.893 | 0.887 | 0.845 | 0.948 | 0.001 |
| deepseek-r1:70b | 0.873 | 0.865 | 0.824 | 0.929 | 0.073 |
| mistral-nemo:12b | 0.858 | 0.849 | 0.811 | 0.909 | 0.009 |
| deepseek-r1:14b | 0.842 | 0.842 | 0.810 | 0.877 | 0.086 |
| gemma3:27b | 0.835 | 0.809 | 0.737 | 0.964 | 0.004 |
| gemma3:12b | 0.828 | 0.804 | 0.734 | 0.951 | 0.008 |
| gemma3:4b | 0.690 | 0.562 | 0.533 | 0.980 | 0.007 |
| dolphin3:8b | 0.651 | 0.729 | 0.913 | 0.506 | 0.000 |
| phi3:14b | 0.462 | 0.601 | 0.701 | 0.345 | 0.014 |
| Model | F1_Score | Accuracy | Precision | Recall | NaN Proportion |
|---|---|---|---|---|---|
| llama3.3:70b | 0.892 | 0.885 | 0.842 | 0.948 | 0.001 |
| deepseek-r1:70b | 0.870 | 0.862 | 0.821 | 0.927 | 0.073 |
| mistral-nemo:12b | 0.843 | 0.823 | 0.761 | 0.946 | 0.009 |
| deepseek-r1:14b | 0.842 | 0.839 | 0.795 | 0.895 | 0.086 |
| gemma3:12b | 0.828 | 0.803 | 0.732 | 0.951 | 0.008 |
| gemma3:27b | 0.764 | 0.690 | 0.618 | 1.000 | 0.004 |
| gemma3:4b | 0.690 | 0.562 | 0.533 | 0.98 | 0.007 |
| dolphin3:8b | 0.651 | 0.729 | 0.913 | 0.506 | 0.000 |
| phi3:14b | 0.462 | 0.601 | 0.701 | 0.345 | 0.014 |
| Model | F1_Score | Accuracy | Precision | Recall | NaN Proportion |
|---|---|---|---|---|---|
| gpt-5.2 | 0.929 | 0.927 | 0.900 | 0.960 | 0.001 |
| gpt-4.1 | 0.920 | 0.921 | 0.932 | 0.908 | 0.000 |
| gemini-2.5-flash | 0.910 | 0.902 | 0.852 | 0.978 | 0.026 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Goldenits, G.; König, P.; Raubitzek, S.; Ekelhart, A. Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs. J. Cybersecur. Priv. 2026, 6, 48. https://doi.org/10.3390/jcp6020048
Goldenits G, König P, Raubitzek S, Ekelhart A. Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs. Journal of Cybersecurity and Privacy. 2026; 6(2):48. https://doi.org/10.3390/jcp6020048
Chicago/Turabian StyleGoldenits, Georg, Philip König, Sebastian Raubitzek, and Andreas Ekelhart. 2026. "Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs" Journal of Cybersecurity and Privacy 6, no. 2: 48. https://doi.org/10.3390/jcp6020048
APA StyleGoldenits, G., König, P., Raubitzek, S., & Ekelhart, A. (2026). Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs. Journal of Cybersecurity and Privacy, 6(2), 48. https://doi.org/10.3390/jcp6020048

