Next Article in Journal
Evaluating Player Stress and Motivation Through Biofeedback-Controlled Dynamic Difficulty Adjustment
Previous Article in Journal
A Lightweight and Efficient Deep Learning Model for Detection of Sector and Region in Three-Level Inverters
 
 
Article
Peer-Review Record

You Got Phished! Analyzing How to Provide Useful Feedback in Anti-Phishing Training with LLM Teacher Models

Electronics 2025, 14(19), 3872; https://doi.org/10.3390/electronics14193872
by Tailia Malloy 1,2,*,†, Laura Bernardy 1,*,†, Omar El Bachyr 1, Fred Philippy 1, Jordan Samhi 1, Jacques Klein 1 and Tegawendé F. Bissyandé 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2025, 14(19), 3872; https://doi.org/10.3390/electronics14193872
Submission received: 26 August 2025 / Revised: 23 September 2025 / Accepted: 25 September 2025 / Published: 29 September 2025
(This article belongs to the Special Issue Human-Centric AI for Cyber Security in Critical Infrastructures)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript explores how Large Language Models (LLMs) can be used as "teachers" in anti-phishing training. It introduces a dataset of embeddings derived from conversations between human learners and LLMs in a phishing education context. Through regression and mediation analyses, the authors evaluate the relationship between message - email similarity and various measures of user performance, learning outcomes, demographics, and confidence. The work aims to provide actionable recommendations for designing more effective and inclusive LLM-driven training systems.


The study addresses a highly relevant and emerging intersection between cybersecurity training and generative AI. The originality lies in leveraging embedding similarity as a proxy for evaluating the educational impact of LLM feedback. This focus is novel and could provide a methodological foundation for improving LLM-based learning environments. The significance is clear given the increasing sophistication of phishing attacks and the pressing need for scalable, effective training solutions.

The methodology is generally sound, with detailed explanations of dataset construction, embedding generation, and regression/mediation analyses. The use of multiple embedding models and diverse statistical approaches strengthens the reliability of the findings. However, some aspects could be clarified:

  • The rationale for selecting specific embedding models (e.g., text-embedding-3-large, ada-002) could be better justified.
  • The interpretation of non-significant results (e.g., ANOVA discrepancies) sometimes appears overstated.
  • More explicit limitations of correlational analysis (e.g., causality cautions) should be emphasized.

The abstract is concise and well aligned with the content. It clearly identifies the problem, approach, and contributions. However, it could benefit from more explicit reference to the key findings (e.g., teacher similarity improves accuracy and confidence, while student similarity correlates negatively with learning gains).

The manuscript is well written, with clear organization across Introduction, Related Work, Dataset, Analysis, Results, and Discussion. The flow is logical, though some sections (e.g., Related Work) could be condensed to avoid redundancy. Figures are informative, but captions could be more interpretive rather than purely descriptive. The figures are appropriate and support the analysis. Regression plots and mediation tables are well presented, though some legends are overly technical and might be simplified for clarity. No critical equations are missing. Tables summarize mediation outcomes effectively.

The paper is written in clear and fluent academic English. Minor editorial polishing would improve readability (e.g., streamlining long sentences, removing redundancies like "previous previous approaches" at line 97).

Among the strengths of the paper I can underline:

  • Novel dataset contribution with embedding dictionaries of human–LLM conversations.
  • Rigorous use of regression and mediation analysis.
  • Strong contextualization within existing literature on phishing education, personalization, and cognitive modeling.
  • Practical recommendations for improving LLM-based training platforms, especially in terms of inclusivity (age, education, experience).

On the opposite side, I could also underline several aspects that I consider improvable:

  • Methodological clarity: better justification of embedding model selection and preprocessing steps would strengthen replicability.
  • Result interpretation: some non-significant findings are given too much emphasis; clearer separation between trends and robust outcomes is needed.
  • Broader applicability: while the focus is on phishing, the authors could briefly discuss potential generalization to other cybersecurity or digital literacy contexts.
  • Dataset limitations: the constraints of using embeddings as proxies for learning interactions (e.g., semantic similarity vs. pedagogical quality) should be acknowledged more explicitly.
  • Figures: enhance captions with interpretive insights to guide the reader (not only describing what is plotted).

 

In conclusion, I consider the paper to be technically sound, original, and highly relevant for Electronics readership. With some refinements to clarify methodology, strengthen interpretation, and polish presentation, it will make a strong contribution.

Author Response

Reviewer 1

We thank the reviewer for their thorough reading and comments on our manuscript. Please find below a description of the changes that we made to the manuscript for each of the comments and edits that were requested. 

1. The rationale for selecting specific embedding models (e.g., text-embedding-3-large, ada-002) could be better justified.

In the original manuscript the reason for selecting these models was that they were the same embedding models originally used. However, based on this feedback as well as feedback from other reviewers we have chosen to expand the number of embedding models used in our analysis. Additionally, we provide more justification for this expanded list of models used in our dataset. 

2. The interpretation of non-significant results (e.g., ANOVA discrepancies) sometimes appears overstated. More explicit limitations of correlational analysis (e.g., causality cautions) should be emphasized.

The previous results and discussion section had failed to adequately emphasize the meaning and interpretation of our correlational analysis, so we have added a passage to the beginning of these sections mentioning these factors. 

 

3. The abstract is concise and well aligned with the content. It clearly identifies the problem, approach, and contributions. However, it could benefit from more explicit reference to the key findings (e.g., teacher similarity improves accuracy and confidence, while student similarity correlates negatively with learning gains).

Thank you for the comment on our abstract and the idea for improving it. To incorporate this point we have added the following sentence to the end of the abstract. “Specifically, we suggest that LLM teachers be trained or fine-tuned to either speak generally or mention specific sections of emails depending on user demographics and behaviors, and to steer conversations away from students that over focus on the current example”

4. The manuscript is well written, with clear organization across Introduction, Related Work, Dataset, Analysis, Results, and Discussion. The flow is logical, though some sections (e.g., Related Work) could be condensed to avoid redundancy. 

Thank you for your comments on the structure and writing of our manuscript. Taking into consideration your comments as well as the editor, we have condensed the related work section considerably to remove irrelevant background information on cognitive modeling. 

5. Figures are informative, but captions could be more interpretive rather than purely descriptive. The figures are appropriate and support the analysis. 

We have added additional information to the figures including the meaning of things like shaded regions and a summary of the interpretation we give in the main text. 

6. Regression plots and mediation tables are well presented, though some legends are overly technical and might be simplified for clarity. No critical equations are missing. Tables summarize mediation outcomes effectively.

Thank you for this feedback on our plots and tables. We hope that the edits to the appendix can clarify mediation and ANOVA analysis meaning. 

7. The paper is written in clear and fluent academic English. Minor editorial polishing would improve readability (e.g., streamlining long sentences, removing redundancies like "previous previous approaches" at line 97).

We have gone through the manuscript to double check for redundancies and overly complex passages. Part of this issue has also been reduced by removing irrelevant related work which adds focus and clarity to the manuscript. 

8. Methodological clarity: better justification of embedding model selection and preprocessing steps would strengthen replicability.

We have added significantly to the description of how we selected embedding models by comparing a wide range of open source models to the close sourced models we had previously been using. 

9. Result interpretation: some non-significant findings are given too much emphasis; clearer separation between trends and robust outcomes is needed.

Our updates to the initial description of our results, why we are doing the analysis we performed, has hopefully addressed this lack of clarity in the interpretation of results. 

10. Broader applicability: while the focus is on phishing, the authors could briefly discuss potential generalization to other cybersecurity or digital literacy contexts.

We have added a passage on broader applicability of our research in the context of cybersecurity. We believe this work demonstrates an application of LLMs in a sort of ‘red teaming’ scenario, and blue teaming or LLM agents as cyberdefenders is another interesting application. 

11. Dataset limitations: the constraints of using embeddings as proxies for learning interactions (e.g., semantic similarity vs. pedagogical quality) should be acknowledged more explicitly.

We agree with this limitation, to address this we have added an analysis of multiple embedding models and describe the differences between these models in their ability to measure semantic similarity and how that impacts results. 

12. Figures: enhance captions with interpretive insights to guide the reader (not only describing what is plotted).

We have added some interpretations to our figures as well as more descriptions of the content of figures including the meaning of shaded regions and how values are calculated.  

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This paper introduces a new dataset of conversations between human students and LLM-based teacher models in an anti-phishing training context. It embeds both teacher and student messages to analyze semantic similarity with phishing emails and relates these embeddings to educational outcomes such as accuracy, confidence, learning improvement, and demographics. The study demonstrates that LLM feedback more closely aligned with the phishing emails correlates with higher accuracy and confidence, while excessive student focus on specific email content correlates with weaker long-term learning.

Technical Weaknesses:

The dataset relies exclusively on GPT-4.1 generated emails and teacher responses, limiting generalizability to other LLM architectures or phishing examples created by humans.

The analysis is largely correlational; causal claims about LLM similarity leading to better outcomes remain unsupported without controlled intervention studies.

The work presents actionable recommendations but does not provide a concrete evaluation pipeline or fine-tuning framework to implement those suggestions.

Important educational outcomes like long-term retention or resistance to novel phishing attacks are not assessed, limiting practical impact.

The current literature is very weak. The authors should cite more recently published works on LLMs and education to attract field researchers. Please cite:

https://ieeexplore.ieee.org/abstract/document/10706931

https://ieeexplore.ieee.org/abstract/document/10577164

Author Response

Reviewer 2 

1. The dataset relies exclusively on GPT-4.1 generated emails and teacher responses, limiting generalizability to other LLM architectures or phishing examples created by humans.

To expand on the generalizability and reproducibility of our results, we have expanded our dataset to include 7 additional open-sourced embedding models. Part of the limitation of this is due to the original dataset which did use GPT-4 generated emails and conversations, which cannot be adjusted for in this work. Additionally, one of the goals of this work is to direct future research areas in LLM teaching, which can employ multiple different models. We hope that the inclusion of these additional embedding models has sufficiently addressed this concern. 

2. The analysis is largely correlational; causal claims about LLM similarity leading to better outcomes remain unsupported without controlled intervention studies.

To address these limitations, we have removed some of the stronger claims in the previous manuscript and added paragraphs to contextualize the reason for our analysis and how we interpret the results.  The analysis we perform is exploratory and intended to identify potential areas of future study that included intervention. 

3. The work presents actionable recommendations but does not provide a concrete evaluation pipeline or fine-tuning framework to implement those suggestions.

While our work is exploratory, the actionable recommendations we give can be evaluated in many different ways, there are similarly a large number of fine-tuning frameworks. Our hope is to provide clear recommendations without requiring a commitment to a specific method for achieving these recommendations. We hope that this leads to more broadly applicable results that can be useful to researchers who are investigating a variety of teaching settings, evaluation methods, and LLM fine-tuning approaches. 

4. Important educational outcomes like long-term retention or resistance to novel phishing attacks are not assessed, limiting practical impact.

While longitudinal studies of phishing education have specific benefits associated with them, so too do the types of studies that form the basis of the dataset that we augment. While longitudinal studies could be an interesting future direction of this research, the context of assessing the quality of educational feedback from LLM teachers does not necessitate it. Additionally, we make recommendations for potential improvements in LLM teaching of anti-phishing training that could be tested in future longitudinal studies. This could be an interesting direction for future research and we hope that our results are useful for researchers working on long term improvement of phishing education. (see edits to sections 5 and 6)

5. The current literature is very weak. The authors should cite more recently published works on LLMs and education to attract field researchers. Please cite: https://ieeexplore.ieee.org/abstract/document/10706931 https://ieeexplore.ieee.org/abstract/document/10577164 

As suggested by other reviewers, we have substantially edited the related work section to remove irrelevant previous literature and expand on more recent related work. We thank you for pointing out the relevance of these papers and have added them as well as other related works to our literature review. These papers and other related works now form our related work section on LLM educational chatbots (see Section 2.1)

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

Summary

The paper presents exploratory analyses and an expanded dataset from an online anti-phishing education study in which several hundred participants rated dozens of email messages, including conversations with the LLM chatbot. The paper makes available an "embedding dictionary" for several thousand chat messages and justifications for students' open-ended responses. Three OpenAI embedding models are used for calculations. The paper relates the cosine similarity of message and email to confidence, accuracy, reaction time, learning outcomes, quiz scores, demographics, and mediation models. Some patterns reported: higher similarity of teacher message and email correlates with higher accuracy and confidence (Pearson), with mixed ANOVA confirmation; student similarity negatively correlates with confidence; some mediation effects are reported. Code is provided.

The topic is not at first aligned with the journal in the narrow sense, but the manuscript is still acceptable because the journal regularly publishes special issues on artificial intelligence in cybersecurity and related topics. Moreover, the focus on phishing education with the LLM is a timely topic at the intersection of education, large language models, and phishing defense.

Content comments

The authors measure the “similarity” between the teacher’s message (chatbot) and the email using numbers from the model (embedding + cosine similarity). But if the teacher often quotes or repeats words from the email (“AliPay”, “reset”…), this automatically raises the similarity measure, and it may seem as if better similarity causes better accuracy, even though part of the effect is just copying the words. This is a question of construct validity: does your similarity measure measure what we want (real didactic consistency), or does it measure something easier (how many words are the same)? I suggest the following:

  1. Do a length and citation control. Remove the quoted parts of the email from the teacher’s message (quotes). Add the length of the message as a control in the analysis (longer messages naturally have a higher chance of overlapping).
  2. Control for “lexical” overlap using the Jaccard (proportion of common words between the message and the email) and Rouge (counts how many of the same short phrases (n-grams) appear in both texts). If Jaccard/ROUGE explain a good part of the connection, then “similarity” is more word copying than real semantics.

https://www.e-periodica.ch/digbib/view?pid=bsv-002:1901:37::790

https://aclanthology.org/W04-1013/

Formal comments

The title of the paper “LLMao, you've been phished!” is attention-grabbing, but it seems inappropriate for a scientific journal. I understand that this practice is becoming more common, but I'm not inclined to attract attention in this way. I suggest choosing a more professional alternative.

The choice of service for storing the program code and data is strange. ODF does not allow downloading the data, only viewing. It may be possible to download the data after registration, which further complicates access. I suggest another server, and MDPI offers such a service. In addition, the program code is on the 4open server, which is a better choice.

Explain the abbreviation ECSS which is mentioned in the appendix, but not in the text. In principle, it will be clear to a good part of the reader, but it is recommended to define each abbreviation the first time it is used.

 

Author Response

Reviewer 3 

2. The authors measure the “similarity” between the teacher’s message (chatbot) and the email using numbers from the model (embedding + cosine similarity). But if the teacher often quotes or repeats words from the email (“AliPay”, “reset”…), this automatically raises the similarity measure, and it may seem as if better similarity causes better accuracy, even though part of the effect is just copying the words. This is a question of construct validity: does your similarity measure measure what we want (real didactic consistency), or does it measure something easier (how many words are the same)? I suggest the following:

    1. Do a length and citation control. Remove the quoted parts of the email from the teacher’s message (quotes). Add the length of the message as a control in the analysis (longer messages naturally have a higher chance of overlapping).
    2. Control for “lexical” overlap using the Jaccard (proportion of common words between the message and the email) and Rouge (counts how many of the same short phrases (n-grams) appear in both texts). If Jaccard/ROUGE explain a good part of the connection, then “similarity” is more word copying than real semantics.  https://www.e-periodica.ch/digbib/view?pid=bsv-002:1901:37::790  https://aclanthology.org/W04-1013/

Because these points and subpoints are related, we have grouped them together and provided a description of the changes we have made to address the concerns raised in this section of feedback. We agree with the possible confound due to the lexical overlap between messages and emails. To address this, we have introduced an initial analysis of the correlation between embedding similarities and our metrics of student performance and compared them across ten different embedding models. 

The next point raised by the reviewer is related to our use of cosine similarity without alternative more simple measures. To address this, we compared these similarity values in their correlation to message length, proportion of common words, and count of n-grams. We found that for the metrics of student performance, there was significant correlation for many of these embedding models as well as a general trend in which larger embeddings were more closely correlated. We then compared the degree of correlation to the measures of email length, Jaccard, and Rouge measures which found an insignificant correlation to the metrics of student learning outcomes. 

Regarding the removal of quotation from the formation of embeddings of messages sent by the teacher, this would unfortunately require a recalculation of the embeddings which is now more difficult due to our introduction of additional embedding models. For this reason we limit our added analysis to the three other main points that you raise. We hope this sufficiently addresses your concerns. However, we believe that our analysis of the alternative measures (length, jaccard, rogue) as well as our comparison of 10 different embedding methods can help in addressing this related concern. 

3. The title of the paper “LLMao, you've been phished!” is attention-grabbing, but it seems inappropriate for a scientific journal. I understand that this practice is becoming more common, but I'm not inclined to attract attention in this way. I suggest choosing a more professional alternative.

We agree that the previous title used an unprofessional acronym. We have updated the title to remove the acronym while keeping the attention-grabbing beginning.

4. The choice of service for storing the program code and data is strange. ODF does not allow downloading the data, only viewing. It may be possible to download the data after registration, which further complicates access. I suggest another server, and MDPI offers such a service. In addition, the program code is on the 4open server, which is a better choice.

We apologize for this missight, we had not made the OSF repository publicly downloadable. It is now available for download from the public.

5. Explain the abbreviation ECSS which is mentioned in the appendix, but not in the text. In principle, it will be clear to a good part of the reader, but it is recommended to define each abbreviation the first time it is used.

We have added the description of that abbreviation, which stands for Message-Email Cosine Similarity Score. 

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have made considerable efforts to address the comments, and I am satisfied with the revised manuscript.

Back to TopTop