Phish-Master: Leveraging Large Language Models for Advanced Phishing Email Generation and Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper introduces a pioneering dual-purpose framework that both exploits and defends against the misuse of LLMs in phishing contexts. Its main contribution lies in the Hybrid Prompt algorithm, which ingeniously combines CoT reasoning and MetaPrompt engineering to generate realistic, high-quality phishing emails capable of bypassing corporate filters with a 99% success rate in real-world tests. Simultaneously, the study presents a multi-model detection framework, trained on both public and LLM-generated phishing datasets, achieving a remarkable 99.87% detection rate. This bidirectional approach, demonstrating offensive capability and defensive countermeasures, establishes Phish-Master as a critical benchmark for understanding and mitigating the evolving cybersecurity risks posed by generative AI in phishing attacks
The proposal is interesting and current. Here are my comments:
1. How does the study ensure that its use of real-world network simulations accurately reflects diverse enterprise environments beyond the limited academic testbed, and to what extent might the high success rate (99%) be inflated by controlled experimental conditions? Please discuss further.
2. The authors could complement some techniques for compressing LLM schemas, such as:
- Anliker, C., Lain, D., & Capkun, S. (2025). Phishing attacks against password manager browser extensions. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 7857–7876).
- Nolazco-Flores, J. A., Guerrero-Galván, A. V., Del-Valle-Soto, C., & Garcia-Perera, L. P. (2023). Genre Classification of Books on Spanish. IEEE Access, 11, 132878-132892.
3. Given that the detection framework relies heavily on datasets like TREC, Enron, and Kaggle, which may not capture the linguistic diversity of global phishing attempts, how generalizable are the reported 99.87% detection results to multilingual or cross-cultural phishing scenarios?
4. By publicly releasing the source code and dataset for Phish-Master, does the paper adequately mitigate the dual-use risks of enabling malicious actors to enhance their own phishing capabilities using the same Hybrid Prompt techniques?
5. The study focuses primarily on quantitative evasion and detection rates. How might incorporating qualitative analyzes—such as human cognitive susceptibility or user behavior modeling—offer deeper insights into the psychological realism of LLM-generated phishing attacks?
6. Since the LLMs used were not fine-tuned, could fine-tuning on real phishing corpora dramatically alter the balance between offensive generation and defensive detection? Does the detection framework remain robust against adversarial fine-tuned LLMs?
7. While the paper claims to achieve the first comprehensive evaluation of LLM-generated phishing in real networks, how does it advance beyond prior studies that already demonstrated the use of GPT-3.5 or Claude for targeted spear-phishing? Does Phish-Master truly shift the paradigm, or mainly refine existing methodologies through more systematic prompt engineering?
Author Response
Comments 1: How does the study ensure that its use of real-world network simulations accurately reflects diverse enterprise environments beyond the limited academic testbed, and to what extent might the high success rate (99%) be inflated by controlled experimental conditions? Please discuss further.
Response 1: Thank you for pointing this out. We agree with this comment.
Comments 2: The authors could complement some techniques for compressing LLM schemas, such as:
- Anliker, C., Lain, D., & Capkun, S. (2025). Phishing attacks against password manager browser extensions. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 7857–7876).
- Nolazco-Flores, J. A., Guerrero-Galván, A. V., Del-Valle-Soto, C., & Garcia-Perera, L. P. (2023). Genre Classification of Books on Spanish. IEEE Access, 11, 132878-132892.
Response 2: Agree. We have supplemented the "Discuss" section (Section 6) with the two suggested references and included a brief discussion on LLM model compression and multilingual classification methods.
Comments 3: Given that the detection framework relies heavily on datasets like TREC, Enron, and Kaggle, which may not capture the linguistic diversity of global phishing attempts, how generalizable are the reported 99.87% detection results to multilingual or cross-cultural phishing scenarios?
Response 3: Thank you for raising this important question. We agree that linguistic and cultural diversity are key factors affecting the model's generalizability. However, due to the authors' limited familiarity with Spanish and other non-English cultural contexts, comprehensively addressing international phishing scenarios in the current study is challenging. We have acknowledged this limitation in the "Limitations and Future Work" section (Section 6) and hope the reviewer understands the constraints regarding language coverage in this research.
Comments 4: By publicly releasing the source code and dataset for Phish-Master, does the paper adequately mitigate the dual-use risks of enabling malicious actors to enhance their own phishing capabilities using the same Hybrid Prompt techniques?
Response 4:
We acknowledge the potential dual-use risks. To address this, we have introduced a corresponding detection framework as a defensive measure. It should be noted that in social engineering attacks, the human factor always remains the most critical element. Additionally, the attack method proposed in this paper has certain practical deployment constraints (e.g., reliance on browser automation), and enterprise administrators can intervene through measures such as real-time blocking, thereby keeping the overall risk manageable.
Comments 5: The study focuses primarily on quantitative evasion and detection rates. How might incorporating qualitative analyzes—such as human cognitive susceptibility or user behavior modeling—offer deeper insights into the psychological realism of LLM-generated phishing attacks?
Response 5:
Many phishing email studies do incorporate human cognitive susceptibility and user behavior modeling to analyze delivery success rates. The primary goal of this study, however, is to verify whether LLMs can successfully deliver spear-phishing emails under specific conditions. We plan to further explore LLM-generated phishing emails across multiple scenarios and incorporate cognitive-behavioral modeling in future work. Given the substantial impact such content would have on the length and structure of the current paper, it has not been included at this stage.
Comments 6: Since the LLMs used were not fine-tuned, could fine-tuning on real phishing corpora dramatically alter the balance between offensive generation and defensive detection? Does the detection framework remain robust against adversarial fine-tuned LLMs?
Response 6:
We previously attempted fine-tuning LLMs on phishing corpora, including lexical substitution and stylistic polishing of email content. Experimental results showed that fine-tuning did not significantly improve generation quality, and the detection framework proposed in this paper maintained a detection rate of 97.38% against fine-tuned samples. Given the minor performance drop, a detailed analysis was not included in the main text.
Comments 7: While the paper claims to achieve the first comprehensive evaluation of LLM-generated phishing in real networks, how does it advance beyond prior studies that already demonstrated the use of GPT-3.5 or Claude for targeted spear-phishing? Does Phish-Master truly shift the paradigm, or mainly refine existing methodologies through more systematic prompt engineering?
Response 7:
We believe that Phish-Master, or other LLM-based phishing email generation methods, does not fundamentally shift the paradigm of phishing attacks. In APT attacks, phishing emails typically involve longer incubate periodand more covert techniques, where LLMs can only provide auxiliary support in the content generation stage. Currently, Phish-Master is primarily suited for "fast-in, fast-out" penetration testing scenarios and offers a degree of automation and detection reference within that context.
Reviewer 2 Report
Comments and Suggestions for AuthorsIntroduction, Background and Related Work sections
Comprehensive overview of the subject is given in the Intro part. The overview of the literature and research sources is actual, referencing to some of the most contemporary works in the field of usage of LLMs to generate phishing mails, as a new and emerging vector attack. The only part which potentially should be taken into the account is usage of LLM models beyond 3.5 (i.e. 4.0 and 4.5).
These sections are also giving comprehensive overview of the phishing methods in general, which put the paper into the context.
Methodology
The methodology is robust and sound. Combining COT framework and LLM prompt and Meta prompt gave the authors possibility to explore different options.
Models were learned via: decision three, logistic regression model and random forest.
Experiment
Experiment was well established, setting up the evaluation indicators, as well as 2 Research questions. I would suggest to authors to change the naming here - and to reformulate a bit, in order to streamline the logic, and apply standard hypotheses proving methodological approach.
The results are well described, and, in particular, the real-world evaluation is invaluable.
The authors also performed the model testing against different test scenarios, which also proved effectiveness of the Phish-Mater framework
Discussion
The Discussion section is very good, very well organized and presented, leading the reader through the evolution of the results, as well giving good overview and drawing relevant conclusions from the results.
The contribution is especially important in the part of tackling the LLM-based phishing attacks.
Author Response
Comments 1:
Introduction, Background and Related Work sections
Comprehensive overview of the subject is given in the Intro part. The overview of the literature and research sources is actual, referencing to some of the most contemporary works in the field of usage of LLMs to generate phishing mails, as a new and emerging vector attack. The only part which potentially should be taken into the account is usage of LLM models beyond 3.5 (i.e. 4.0 and 4.5).
These sections are also giving comprehensive overview of the phishing methods in general, which put the paper into the context.
Response 1:
Thank you for this constructive suggestion. We agree that incorporating more advanced LLMs is a valuable direction for future research. In response, we have added a discussion in the “Discuss”(Section 6) to explicitly address this point. The added text outlines the exploration of next-generation models like GPT-4 and beyond as a key future goal. This addition can be found in the end of Section 6 of the revised manuscript.
Comments2:
Methodology
The methodology is robust and sound. Combining COT framework and LLM prompt and Meta prompt gave the authors possibility to explore different options.
Models were learned via: decision three, logistic regression model and random forest.
Response2:
We thank the reviewer for their positive and constructive feedback on the structure and clarity of these sections.
Comments3:
Experiment
Experiment was well established, setting up the evaluation indicators, as well as 2 Research questions. I would suggest to authors to change the naming here - and to reformulate a bit, in order to streamline the logic, and apply standard hypotheses proving methodological approach.
The results are well described, and, in particular, the real-world evaluation is invaluable.
The authors also performed the model testing against different test scenarios, which also proved effectiveness of the Phish-Mater framework
Response3:
We thank the reviewer for this positive feedback and valuable suggestion regarding the methodological approach. We appreciate the comment on streamlining the logic by reformulating the research questions into standard hypotheses. However, we seek a bit of clarification to ensure we address this point most effectively. Would the reviewer be able to provide a brief example of how they would suggest we rephrase our research questions (e.g., RQ1 and RQ2 as presented on Page 13) into formal hypotheses? We are very willing to make these changes and want to ensure they align with the reviewer's expectations and the standard practices in the field. Any additional guidance would be greatly appreciated.
Comments4:
Discussion
The Discussion section is very good, very well organized and presented, leading the reader through the evolution of the results, as well giving good overview and drawing relevant conclusions from the results.
The contribution is especially important in the part of tackling the LLM-based phishing attacks.
Response4:
We thank the reviewer for their positive and constructive feedback on the structure, clarity, and contribution of these sections. No specific revisions were requested in these comments, but we have carefully reviewed the entire manuscript to ensure consistency and accuracy in response to all feedback.
Reviewer 3 Report
Comments and Suggestions for AuthorsComments:
1. The section is fine; simply write that the contributions of the work are included separately, in a document attached to the article itself.
2. Section 2 is well-defined; the authors present the background to their submitted work.
3. Considering that it is the section on related work, I would expect a table to be added to section 3 citing the consulted documents with the main contributions, including the authors' contributions.
4. The methodology section is appropriate for the article presented by the authors; their mastery of the topic is evident.
5. The section is appropriate for the submitted article.
6. The discussion section is appropriate, and therefore the conclusion section is appropriate.
The references used are appropriate.
Author Response
Comments 1:
1. The section is fine; simply write that the contributions of the work are included separately, in a document attached to the article itself.
Response 1:
Thank you for this suggestion. We agree with the comment and revified in the section 1.
Comments 2:
2. Section 2 is well-defined; the authors present the background to their submitted work.
Response 2:
Agree. We appreciate the positive feedback on the background section. No changes were made to this section as it was deemed appropriate.
Comments 3:
3. Considering that it is the section on related work, I would expect a table to be added to section 3 citing the consulted documents with the main contributions, including the authors' contributions.
Response 3:
Thank you for this valuable suggestion. We agree and have added a table to Section 3 (Related Work) summarizing the consulted documents, their main contributions, and authors' key insights.
Comments 4:
4.The methodology section is appropriate for the article presented by the authors; their mastery of the topic is evident.
Responses 4:
Agree. We thank the reviewer for acknowledging the methodology section. No revisions were necessary.
Comments 5:
5.The section is appropriate for the submitted article.
Responses 5:
Agree. We appreciate the comment on the experimental section. No changes were made.
Comments 6:
6.The discussion section is appropriate, and therefore the conclusion section is appropriate.
Response 6:
Agree. We are glad the discussion and conclusion sections met expectations. No modifications were required.
Commnets 7:
The references used are appropriate.
Responses 7:
Agree. We confirm that all references are relevant and properly cited. No changes were needed.
Reviewer 4 Report
Comments and Suggestions for AuthorsIn their article, the authors address a very interesting topic related to cybersecurity, specifically phishing detection. Considering the rest of the article, the title chosen and formulated by the authors is appropriate and fitting.
In the introduction, the authors clearly demonstrate the reasons for their research, citing current and carefully selected literature sources. It's worth noting that the authors explicitly demonstrate the added value of their research and the arguments they present in the article. This is excellent.
This extensive introduction is supplemented by some theoretical background and related works. Although this chapter could be more extensive, if considered in conjunction with the introduction, it is, in my opinion, sufficient.
The authors then provide an engaging presentation of the methodology and their Phish Master. The included figures and diagrams facilitate understanding and verification of the authors' ideas. Furthermore, the algorithms presented are clear, well-described, and look interesting.
The chapter on the experiment deserves positive attention. The obtained results are interesting and confirm the authors' research theses.
Technical note: I believe Figure 5 should be reduced in size; it doesn't fit in its current form.
An additional advantage of the authors' research is its real-world evaluation. The results confirm the quality of the solution presented by the authors.
Given the very interesting results, I believe the Discussion section should be more refined. Currently, in my opinion, it is too terse and calls for a broader reference to the work of other teams, and for a more comprehensive approach.
The article is worth publishing with minor changes.
Author Response
Comments 1:
Technical note: I believe Figure 5 should be reduced in size; it doesn't fit in its current form.
Response 1:
Thank you for pointing this out. We agree with this comment. Therefore, we have reduced the size of Figure 5 to 60% of its original dimensions to ensure it fits properly within the page layout. This change can be found in the section containing Figure 5.
Comments 2:
An additional advantage of the authors' research is its real-world evaluation. The results confirm the quality of the solution presented by the authors.
Given the very interesting results, I believe the Discussion section should be more refined. Currently, in my opinion, it is too terse and calls for a broader reference to the work of other teams, and for a more comprehensive approach.
Response 2:
We agree with this comment. Therefore, we have revised the Discussion section to include a broader reference to related work from other research teams and a more comprehensive analysis of the implications of our findings.
Specifically, we have expanded the discussion on the limitations of our approach, incorporated additional citations such as [1] and [2] to contextualize our results within existing literature, and elaborated on future research directions to provide a more balanced perspective.
[1]Anliker, C., Lain, D., & Capkun, S. (2025). Phishing attacks against password manager browser extensions. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 7857–7876).
[2]Nolazco-Flores, J. A., Guerrero-Galván, A. V., Del-Valle-Soto, C., & Garcia-Perera, L. P. (2023). Genre Classification of Books on Spanish. IEEE Access, 11, 132878-132892.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have adequately addressed my comments. From my side, they may proceed with the publication of the paper.

