A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages

Lee, Kiho; Lee, Yongjoon; Jeong, Jaeyeong; Choi, Yong-ha; Shin, Dongkyoo

doi:10.3390/electronics15102196

Open AccessArticle

A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages

by

Kiho Lee

¹,

Yongjoon Lee

¹,

Jaeyeong Jeong

^2,3,

Yong-ha Choi

^1,* and

Dongkyoo Shin

^2,3,*

¹

Department of Artificial Intelligence & Security, Far East University, Eumseong-gun 27601, Republic of Korea

²

Department of Computer Engineering, Sejong University, Seoul 05006, Republic of Korea

³

Convergence Engineering for Intelligent Drone, Sejong University, Seoul 05006, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(10), 2196; https://doi.org/10.3390/electronics15102196

Submission received: 17 April 2026 / Revised: 15 May 2026 / Accepted: 17 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue AI-Powered Natural Language Processing Applications)

Download

Browse Figures

Versions Notes

Abstract

Phishing messages have evolved from simple fraud templates into socially engineered texts that exploit anxiety, trust, relational obligation, and culturally embedded norms. In Korean phishing messages, attackers frequently combine institutional authority, family or acquaintance framing, requests for cooperation, and urgency cues to induce concrete victim actions such as money transfer, link clicking, phone contact, app installation, or credential submission. However, prior studies have largely emphasized binary phishing detection while offering limited interpretability regarding how such messages mobilize social and cultural persuasion strategies. This study proposes a culturally aware large language model framework for analyzing social engineering tactics in Korean phishing messages. The framework is built on a multidimensional codebook that represents the message text, phishing label, tactic type, relation type, requested action, cultural lever, and evidence span, enabling structured and explainable analysis beyond simple classification. To operationalize this framework, an OpenChat-based model is fine-tuned with QLoRA to generate structured outputs that jointly predict the phishing status and socially relevant attributes, while evidence-span supervision is incorporated to improve grounding and explanation consistency. The evaluation examines not only phishing-detection performance but also attribute-level prediction accuracy, evidence alignment, parsing reliability, and human-rated usefulness and trustworthiness. By integrating the cultural context, relational framing, and evidence-grounded explanation into LLM-based phishing analysis, this study provides an interpretable analytical framework for Korean phishing messages and an evidence-grounded basis for analyst-supportive phishing triage. On the 82-sample authoritative clean hold-out split, Model D produced error-free label predictions and achieved 0.841 exact-match core and 0.886 span-F1. However, because the evaluation used a single 82-sample internal hold-out split and no independent external corpus, these results should be interpreted as feasibility evidence under leakage-controlled conditions rather than as proof of deployment-level robustness or cross-domain generalization. The main contribution of this study is therefore not improved binary detection over strong lexical baselines, but the structured and evidence-grounded representation of Korean phishing persuasion tactics for analyst-supportive triage.

Keywords:

Korean phishing messages; social engineering tactics; culturally aware LLM; explainable phishing analysis; evidence-grounded classification

1. Introduction

Phishing is no longer adequately described as a simple malicious-message-classification problem. Its operational core is social engineering: attackers craft messages that manipulate human judgment by exploiting trust, context, perceived legitimacy, and decision pressure [1,2,3]. Human-centered phishing research has consistently shown that susceptibility is shaped not only by technical indicators but also by cognitive and situational factors, including social influence, contextual relevance, and persuasive framing. Empirical studies further show that phishing success is affected by how messages deploy cues such as credibility, urgency, and authority within realistic organizational settings. From this perspective, a phishing message should be understood not merely as a carrier of a malicious link or fraudulent request, but as a compact persuasive artifact designed to trigger a specific user action under uncertainty [1,2,3,4].

Despite this, much of the anti-phishing literature still evaluates phishing primarily as a binary detection task. Recent surveys and experimental studies have reported strong performance for machine learning and deep learning models based on phishing-versus-legitimate classification, and this line of work has produced important advances in feature engineering, benchmark construction, and automated filtering. However, its dominant evaluation paradigm remains centered on classification outcomes such as accuracy, precision, recall, and the F1-score. As a result, many systems provide limited interpretive visibility into how a message attempts to persuade the recipient, which relationship frame is invoked, what concrete action is being requested, and which textual evidence directly supports the model decision. Prior profiling work has also analyzed phishing attack patterns and campaign characteristics, but such analyses are typically separated from structured message-level interpretation for end-user protection [5,6].

This limitation is particularly consequential for Korean phishing messages. Korean-language phishing and smishing studies have highlighted the importance of mobile and messenger-based attack channels and have recently begun to explore explainable or LLM-based approaches for Korean message analysis. At the same time, short Korean messages often compress trust signals into institution names, kinship terms, brand references, cooperative requests, and urgent directives. In such texts, the persuasive mechanism is frequently embedded in culturally and socially meaningful cues rather than in overtly malicious wording alone. Existing Korean-focused detection work and recent explainable SMS-phishing systems show the value of moving beyond raw detection, yet they still leave room for a more standardized and attribute-level analytical framework that can systematically represent the relation, requested action, persuasion mechanism, and evidence grounding in a single output structure [7,8,9].

Accordingly, the objective of this study is to develop and evaluate a culturally aware large language model framework for analyzing social engineering tactics in Korean phishing messages. The contribution of this study is methodological rather than algorithmic. Instead of proposing a new model architecture, this paper formulates Korean phishing-message analysis as a culturally aware structured interpretation task. First, it defines a multidimensional codebook and output schema that jointly represent the phishing status, social-engineering tactic, relation type, requested victim action, cultural lever, and exact evidence span. Second, it implements this schema through OpenChat-based QLoRA supervised fine-tuning to examine whether structured supervision improves output consistency and evidence grounding. Third, it evaluates the framework using detection metrics, attribute-level Macro-F1, parsing reliability, exact-match consistency, span-level evidence alignment, and human evaluation. Thus, the novelty lies in task formulation, culturally grounded schema design, and evidence-based evaluation, rather than in QLoRA itself. By integrating the cultural context, relational framing, and evidence-grounded explanation into LLM-based phishing analysis, this study aims to provide an interpretable analytical framework for Korean phishing messages and an evidence-grounded basis for analyst-supportive phishing triage.

The cultural specificity of the proposed schema should be interpreted as a scope condition rather than as a claim of universal applicability. The framework is designed to capture persuasion patterns that are salient in Korean phishing messages, such as kinship-based obligation, institutional formality, cooperation-oriented requests, and urgency or disadvantage-avoidance cues. Therefore, applying the same framework to other languages or phishing ecosystems would require codebook adaptation, external annotation, and independent validation rather than direct transfer.

Figure 1 illustrates the overall research workflow.

2. Materials and Methods

This study was designed to move beyond binary phishing detection and to analyze Korean phishing messages as socially engineered texts. The methodological pipeline consisted of dataset construction and quality assurance, codebook design, leakage-controlled data splitting, structured output formulation, and downstream model training and evaluation. In contrast to prior Korean phishing studies that primarily focused on message classification, the present study aimed to preserve message-level interpretability by jointly representing the phishing status, tactic type, relation type, requested action, cultural lever, and evidence span [7,8]. This design was also informed by recent explainable phishing studies that emphasized evidence-based reasoning and user-oriented explanation rather than label prediction alone [9].

2.1. Dataset Construction and Quality Assurance

The dataset used in this study was constructed from a Korean phishing message corpus that was iteratively reviewed and refined for experimental use. The initial working dataset contained 1250 messages, comprising 625 phishing and 625 normal messages. Each instance included the raw message text and a set of structured annotation fields, including label, url_risk, emotion_score, tactic_type, relation_type, requested_action, cultural_lever, and evidence_span. The corpus was designed to support not only binary phishing detection but also the attribute-level interpretation of social engineering strategies [10,11,12,13,14,15,16].

Because Korean phishing messages often recur as templated alerts, impersonation notices, or platform-style short messages, simple random splitting was considered insufficient. A dataset audit identified 62 duplicate text instances and additional near-duplicate institutional or public-notice templates. To address this issue, duplicate and near-duplicate messages were grouped using a split_group variable, and the main experimental protocol adopted leakage-aware splitting rather than a naive random partition. Rare classes were also explicitly monitored during quality assurance, including app-installation requests and friend/acquaintance relation messages, because these categories were too sparse to be treated reliably without additional controls [17,18,19,20,21,22].

The quality-assurance process was conducted as a staged manual review procedure. The original file, the third-round reviewed file, and the fourth experimental file were stored separately to preserve traceability. Each row was rechecked for logical consistency across the text and its core labels, with particular attention to agreement among label, requested_action, relation_type, cultural_lever, and evidence_span. Borderline cases were collected separately, especially those in which phishing messages closely resembled institutional notifications or in which normal business messages superficially resembled cooperation-oriented phishing attempts. Additional management columns such as qa_flag, qa_reason, reviewer_note, needs_recheck, and final_decision were used to record the review history and retain an auditable cleaning trail. Messages with unresolved ambiguity were excluded from the main clean split.

Because reviewer-specific parallel annotations were not archived, a formal overlap-based inter-annotator agreement statistic could not be reconstructed retrospectively. Instead, annotation reproducibility was assessed using an audit-based agreement proxy between the pre-consensus annotations and the finalized consensus annotations across 1250 messages. Under this staged-review proxy, Cohen’s kappa was 1.000 for label, 0.431 for tactic_type, 0.211 for relation_type, 0.664 for requested_action, and 0.531 for cultural_lever. For evidence_span, the exact-match rate was 0.610, the mean token-overlap Jaccard score was 0.746, and the mean token-level F1 was 0.776. These results indicate that label assignments remained highly stable across review stages, whereas relation_type and cultural_lever required more extensive consensus refinement.

The final experiments were conducted on an authoritative clean split defined as final_decision = keep and needs_recheck = N. Under this criterion, the fixed train/dev/test split was 730/84/82. This clean split was used as the main dataset for baseline models, structured prompt baselines, and QLoRA-based fine-tuning so that all reported results shared the same annotation standard and leakage-control policy.

The dataset should therefore be interpreted as a curated and leakage-controlled benchmark rather than as a fully realistic deployment stream. Although duplicate and near-duplicate messages were grouped through split_group and prevented from crossing train/dev/test partitions, the corpus still reflects a template-heavy Korean phishing-message environment. Consequently, the clean split was designed to reduce direct template leakage, not to eliminate all forms of template familiarity. This distinction is important because high performance based on the clean split may still overestimate robustness against noisy, newly emerging, or adversarially modified phishing messages.

2.2. Codebook Design

The codebook was designed to support a structured and interpretable analysis of Korean phishing messages. Rather than collapsing messages into a single phishing label, the codebook represented each message along multiple dimensions that correspond to how social engineering operates in practice: the overt persuasion tactic, the implied social relation, the requested victim action, the underlying cultural lever, and the textual evidence supporting the annotation. This design choice was motivated by the observation that Korean phishing messages often compress their persuasive force into short but socially meaningful cues, such as kinship terms, institutional names, cooperation requests, urgent directives, and loss-avoidance language. It also aligns with recent work on explainable smishing detection, which has shown the value of structured, evidence-based outputs for end-user understanding [23,24,25,26].

The codebook preserved the original fine-grained labels and adopted a primary single-label policy for each annotated field. In other words, even when a message could plausibly express multiple tactics or social cues, annotators were instructed to assign the single most decisive label for each dimension. The codebook further specified a text-evidence-only rule for relation_type, meaning that family, acquaintance, institutional, or corporate relations were assigned only when such relations were explicitly recoverable from the message text itself. Implicit assumptions about workplace hierarchy or interpersonal familiarity were not used unless directly supported by the wording of the message. In addition, evidence_span was constrained to exact substrings from the source text, with one to three decisive spans connected only when necessary. Paraphrased summaries or loosely grounded explanations were not permitted.

The tactic_type field was defined using seven categories: relational, information-seeking, financial, reward-oriented, authority/institutional, neutral/informational, and urgent/fear-based tactics. These categories were intended to capture the dominant surface strategy used to move the recipient toward compliance. The relation_type field consisted of five categories: family, friend/acquaintance, institution, company, and no explicit relation. The requested_action field used six categories: link clicking/access, phone or consultation connection, money transfer/payment/refund, authentication or personal-information entry, app installation/update/remote control, and general reply/confirmation/visit induction. The cultural_lever field also used six categories: relational appeal, authority/formality, responsibility/cooperation request, urgency/crisis framing, disadvantage or shame avoidance, and monetary gain/loss framing. Together, these dimensions were designed to separate surface tactics from underlying persuasive pressure and to make message interpretation more reproducible across annotators. Table 1 summarizes the structured annotation schema and label dimensions.

The codebook was not intended to serve as a universal taxonomy of all Korean phishing strategies. Rather, it was designed as an auditable operational schema for the present dataset and evaluation setting. Its fixed categories improve reproducibility and allow a structured evaluation, but they also limit adaptability to newly emerging phishing tactics. Future versions of the framework should therefore incorporate periodic codebook revision, active-learning-based label expansion, and external annotator validation to reduce schema rigidity.

To improve modeling stability without discarding fine-grained interpretation, the codebook also introduced derived fields such as emotion_band, relation_group, and action_group. These auxiliary group labels were not intended to replace the original labels, but to support hierarchical supervision and more stable learning for sparse categories. To stabilize learning under sparse classes, fine-grained labels were supplemented with grouped auxiliary labels. This principle was later reflected in the structured output design and in the final Model D configuration used for the main experiments.

Another important rule in the codebook concerned normal messages. Even when a message was labeled as non-phishing, requested_action and cultural_lever were not left blank by default. Instead, annotators were instructed to record the minimal observable action demand and the dominant social frame that appeared in the text. This decision was necessary because many borderline errors arose not from clearly malicious wording, but from messages that were structurally similar to ordinary notifications, delivery updates, confirmation messages, or workplace coordination texts. By maintaining comparable structured fields for both phishing and normal messages, the framework enabled a finer-grained comparison between malicious persuasion and benign communication patterns.

Finally, the codebook was designed as a reproducibility instrument rather than a descriptive appendix. Each label was accompanied by a definition, inclusion criteria, exclusion criteria, boundary rules, and representative excerpts. The operational goal was that different annotators, when presented with the same message, would converge on similar decisions under a shared rule set. In this sense, the codebook functioned as the central interface among human annotation, structured model output, and an evidence-grounded evaluation.

This reproducibility objective was not treated as a purely conceptual design principle. It was also examined empirically through audit-based agreement analysis between pre-consensus and finalized annotations, which showed high stability for label and moderate stability for several structured fields under the finalized codebook.

2.3. Authoritative Clean Split and Leakage Control

To ensure reproducibility and to reduce overestimation caused by repeated message templates, this study did not rely on a naive random split. Instead, the main experiments used an authoritative clean split defined by retaining only records marked as final_decision = keep and needs_recheck = N. Under this criterion, the final train/dev/test split was fixed to 730/84/82, and the same split was used consistently for the baseline models, prompt baselines, and QLoRA-based fine-tuning experiments. This design was adopted to ensure that all reported results were directly comparable under the same annotation and data-quality standard.

Leakage control was particularly important because Korean phishing messages often recur as short institutional alerts, transaction notifications, delivery messages, or impersonation templates with only minor variations in names, URLs, or phone numbers. The project specification therefore treated duplicate and near-duplicate messages as a serious evaluation risk. To address this issue, structurally or semantically similar instances were assigned to a shared split_group, and the same group was not allowed to appear across different partitions. In other words, if

g_{i}

denotes the group identifier of message

x_{i}

, the split was designed so that

g_{i} \in G_{t r a i n} \cup G_{t e s t}

. This grouped partitioning rule was used to prevent the model from achieving artificially high performance by memorizing repeated templates rather than learning transferable phishing characteristics.

This grouped leakage-control strategy was also necessary for rare classes. The dataset design notes explicitly identified sparse categories such as app-installation requests and friend/acquaintance relation messages as unstable under simple fine-label learning. If repeated templates from these rare categories were distributed across train and test sets, their apparent performance could be substantially inflated. For this reason, the authoritative clean split and split_group control were treated not as optional preprocessing choices but as core elements of the experimental design [27,28]. Data splitting was performed as shown in Figure 2.

2.4. Structured Output Schema

The proposed framework was implemented as a structured generation task rather than a plain binary classification task. Each input message was mapped to a fixed JSON-style output so that the model could jointly predict phishing status and socially grounded interpretive attributes. The final supervised schema consisted of 12 keys: label, label_name, emotion_band, tactic_type, relation_type, relation_group, requested_action, action_group, cultural_lever, evidence_span, evidence_label, and short_rationale. This schema was fixed before model training and was used consistently across prompt baselines and fine-tuned models.

The schema was designed to separate different layers of message interpretation. The label and label_name fields represented the phishing status. The fields tactic_type, relation_type, requested_action, and cultural_lever captured how the message attempted to persuade the recipient. The grouped fields relation_group and action_group were introduced to support hierarchical supervision for sparse subclasses. The evidence_span field grounded the prediction in exact substrings from the original text, while short_rationale provided a concise natural-language explanation suitable for downstream human evaluation. Importantly, phishing_probability was excluded from the supervised targets because it did not have a stable gold annotation. It was therefore generated only at inference time and not included in the training loss.

The structured schema was also tightly linked to the codebook design. The codebook preserved fine labels, enforced a primary single-label rule for each field, applied a text-evidence-only rule for relation_type, and constrained evidence_span to exact substrings from the source message. This meant that the model was not trained to produce free-form summaries of why a message was suspicious, but rather to generate a controlled and interpretable representation aligned with the annotation policy. The evidence field was especially important because the project aimed to evaluate not only whether the model could identify phishing, but also whether it could point to the minimal decisive span that supported that judgment.

Formally, if

x

denotes the input message and

y = (y_{1}, \dots, y_{t})

denotes the serialized structured target, supervised fine-tuning optimized the autoregressive likelihood

L_{S F T} = - \sum_{t = 1}^{r} l o g p_{0} (y_{t} / x, y_{< t}) .

This formulation allowed all fields to be learned within a single generative sequence while preserving a deterministic output format [29,30,31,32,33].

2.5. Baselines

The experimental design included three types of baselines. First, a TF-IDF + Logistic Regression model was used as a conventional lexical baseline for binary phishing classification. This model was included because phishing messages often contain short but strongly discriminative lexical cues, and sparse-feature classification remains a practical reference point for Korean short-message filtering. Second, a KoELECTRA-based supervised classifier was used as a contextual transformer baseline for the same binary task. These two baselines established the detection-oriented performance ceiling of conventional discriminative modeling based on the authoritative clean split [34].

Third, prompt-based OpenChat baselines were prepared in zero-shot and few-shot settings to assess whether structured phishing analysis could be performed through prompting alone, without parameter updates. In this setting, the model was required to generate the same structured JSON output used by the fine-tuned models. The prompt baselines were evaluated not only based on phishing classification but also on parse success, exact match, and attribute-level Macro-F1. This comparison was important because the contribution of the study was not merely that an LLM could detect phishing, but that structured supervision might improve output consistency, interpretability, and evidence alignment relative to prompt-only generation.

The baseline design therefore covered three complementary regimes: lexical binary classification, contextual binary classification, and prompt-based structured generation. This setup made it possible to separate improvements due to structured supervision and QLoRA fine-tuning from improvements that could already be achieved through conventional classifiers or in-context prompting alone. On the authoritative clean test split, TF-IDF achieved a Macro-F1 of 98.61%, whereas KoELECTRA achieved 97.18%.

It is important to note that the binary baselines and the proposed structured model do not represent identical task settings. TF-IDF and KoELECTRA were evaluated as detection-only classifiers, whereas the QLoRA models were evaluated as structured generators that jointly produced phishing labels, social-engineering attributes, and evidence spans. Therefore, the binary baselines serve as detection reference points rather than direct competitors for the full structured-output task. The key question is not whether Model D substantially exceeds TF-IDF in binary F1, but whether structured fine-tuning improves schema compliance, attribute-level fidelity, and evidence grounding beyond prompt-only generation and reduced-output QLoRA variants.

2.6. OpenChat QLoRA Setup

The main training experiments were conducted using the Hugging Face checkpoint openchat/openchat-3.5-0106 as the base model. Fine-tuning was performed with QLoRA to enable structured supervised learning within limited Colab resources. The fixed training configuration used 4-bit NF4 quantization, a maximum sequence length of 1024, a per-device train batch size of 1, a per-device evaluation batch size of 1, gradient accumulation of 16 steps, a learning rate of 2 × 10⁻⁴, cosine scheduling with a warmup ratio of 0.03, four training epochs, and a random seed of 42. Early stopping was applied with a patience of two evaluation rounds.

The LoRA configuration used rank 64, lora_alpha = 16, and lora_dropout = 0.05, and adaptation was applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. The project specification also explicitly stated that GGUF files were reserved only for inference or demonstration and were not used for training. All reported supervised fine-tuning experiments therefore used the Hugging Face checkpoint with QLoRA adapters rather than any GGUF-based workflow.

To examine the contribution of structured supervision, four model variants were defined. Model A predicted only label and label_name and served as the binary-only generative baseline. Model B extended Model A by adding emotion_band, tactic_type, relation_type, requested_action, and cultural_lever, thereby representing the basic structured interpretation task. Model C further added evidence_span, evidence_label, and short_rationale to test the effect of explicit grounding and explanation supervision. Model D, which was fixed as the main model of the paper, additionally included relation_group and action_group so that fine-grained labels and auxiliary grouped labels could be learned jointly. This design was intended to improve stability for sparse categories while preserving full interpretability.

2.7. Evaluation Metrics

The evaluation protocol was designed to assess both binary detection and structured interpretability. For phishing label prediction, accuracy, precision, recall, binary F1, and Macro-F1 were computed. Macro-F1 was treated as the main summary metric because it gives equal weight to each class and is therefore more appropriate than raw accuracy when performance must be interpreted across both phishing and normal messages. For class c, the F1-score is defined as

{F 1}_{c} = \frac{{2 P}_{c} R_{c}}{P_{c} + R_{c}},

and Macro-F1 is defined as

M a c r o \ m b o x - F 1 = \frac{1}{C} \sum_{c = 1}^{C} {F 1}_{c},

where ‘C’ is the number of classes [35].

For structured generation, parse success was measured as the proportion of outputs that could be validly decoded into the fixed schema. Exact match was reported under two levels: exact match core and exact match strict. Exact match core focused on the principal structured fields used for phishing interpretation, whereas exact match strict required the full target output to match exactly across all fields. This distinction was necessary because a model may correctly identify the phishing status and the main social-engineering attributes while still differing in auxiliary grouped labels or concise explanation text.

Attribute-level interpretability was evaluated using Macro-F1 for tactic_type, relation_type, requested_action, and cultural_lever. These metrics were central to the study because the main contribution lay in interpreting how Korean phishing messages mobilize social and cultural persuasion mechanisms rather than merely deciding whether a message is phishing. In addition, evidence-grounding performance was evaluated using span-F1 and token overlap between predicted and gold evidence_span values. If P and G denote the token sets of the predicted and gold evidence spans, token-level precision and recall were defined as

∣ P \cap G ∣ / ∣ P ∣ a n d ∣ P \cap G ∣ / ∣ G ∣,

respectively, and span-F1 was computed from these values. These evidence-based metrics were included to test whether the model’s decision was supported by textually aligned grounds rather than by unsupported free-form explanation.

Finally, ablation analysis compared binary-only and multi-attribute models, as well as variants that included or excluded relation, cultural, and evidence supervision. The project design explicitly linked these comparisons to the study hypotheses, including whether multi-task structured learning improves label prediction, whether relation and cultural supervision improve social-engineering interpretation, and whether evidence supervision improves grounding quality and human-rated usefulness [36].

Because the final hold-out test split contained only 82 messages, point estimates were not treated as sufficient evidence of generalization. For binary detection comparisons, McNemar’s exact test was used because the same test instances were evaluated by each model. For Macro-F1, attribute-level F1, exact-match core, and span-level metrics, 95% confidence intervals were estimated using paired bootstrap resampling over test instances. Human-evaluation scores were compared against the neutral midpoint using Wilcoxon signed-rank tests with Holm correction. These statistical checks were used to distinguish descriptive performance differences from reliable improvements under the small-sample evaluation setting.

2.8. Human Evaluation Protocol

Human evaluation was incorporated to assess practical usefulness beyond automatic metrics. The human-evaluation protocol was initially designed for 120–150 samples. After the exclusion of incomplete or unusable items, the finalized blind evaluation was conducted on 99 samples with three evaluators using a 5-point Likert design. The evaluated items included both phishing and normal messages, as well as borderline cases such as transaction alerts, institutional notices, and cooperation-oriented workplace messages. This sampling strategy was adopted because practical usefulness must be tested not only on clearly malicious cases but also on messages whose surface form closely resembles benign communication.

Each evaluator reviewed the input message together with anonymized model outputs and rated five dimensions: label validity, evidence appropriateness, adequacy of social-engineering interpretation, usefulness of response guidance, and practical usefulness for phishing-risk prevention. The model identities were hidden during evaluation to reduce expectation bias. The protocol therefore treated human judgment not as a replacement for automatic metrics, but as a complementary criterion for determining whether structured and evidence-grounded outputs are more understandable and actionable for users.

For statistical reporting, the human-evaluation layer used means and standard deviations for item-level scores and either Fleiss’ kappa or Krippendorff’s alpha for inter-rater agreement among evaluators, depending on the final table structure. This evaluator-level agreement should be distinguished from the audit-based annotation reproducibility analysis reported for the dataset-construction stage. For pairwise model comparison, either a paired t-test or a Wilcoxon signed-rank test could be used depending on the score distribution. This evaluation layer was important because strict exact match alone may undervalue outputs that are practically useful despite small deviations from the gold annotation. A model output may still be effective for user protection if its phishing judgment, major social-engineering interpretation, and supporting evidence are appropriate even when auxiliary fields are not perfectly matched.

3. Results

3.1. Dataset Split and Class Statistics

After the fourth-round quality assurance and the authoritative filtering step, 896 records were retained for the main experiments. Among these, 730 instances were assigned to the training set, 84 to the development set, and 82 to the test set, following the fixed authoritative clean split used throughout the study. The corresponding label distribution was 460 normal and 270 phishing messages in the training set, 59 normal and 25 phishing messages in the development set, and 56 normal and 26 phishing messages in the test set. This configuration preserved the project’s fixed train/dev/test counts and ensured that all baseline and QLoRA results were reported on the same evaluation basis. Authoritative clean split and class distribution are as shown in Table 2 below.

At the full clean-subset level, the data remained more skewed toward normal messages than in the original 1250-row balanced sheet, yielding 575 normal and 321 phishing instances after the removal of uncertain or pending-review cases. The fine-grained attribute distributions were also notably imbalanced. In the clean subset, tactic_type was dominated by the neutral/informational class, while relation_type was heavily concentrated in the no-explicit-relation class. Likewise, requested_action was dominated by reply/confirmation/visit-induction patterns, and cultural_lever was concentrated in responsibility/cooperation-request and authority/formality frames. This imbalance confirmed that attribute-level prediction was substantially harder than binary phishing detection and justified the use of grouped auxiliary labels in the later QLoRA experiments.

The class skew was especially pronounced in rare subclasses. The project design had already identified app-installation messages and friend/acquaintance relation cases as low-frequency categories, and the clean split preserved this sparsity pattern rather than artificially balancing it. As a result, the dataset used for the main experiments reflected a more realistic but more difficult evaluation setting, in which strong overall binary performance could coexist with uneven attribute-level difficulty. Major structured-attribute distribution patterns in the clean subset are as shown in Table 3 below.

3.2. Binary and Prompt Baseline Results

The binary baselines yielded strong performance based on the authoritative clean split. Based on the 82-sample test set, TF-IDF + Logistic Regression achieved 98.78% accuracy, 98.11% binary F1, and 98.61% Macro-F1, while KoELECTRA achieved 97.56% accuracy, 96.15% binary F1, and 97.18% Macro-F1. TF-IDF produced only one false positive and no false negatives based on the test split, whereas KoELECTRA produced one false positive and one false negative. These results indicate that phishing-versus-normal discrimination was already strong under the cleaned and leakage-controlled setting.

However, the error notes showed that the remaining mistakes were concentrated in boundary cases rather than in clearly malicious messages. One false negative corresponded to a short transaction-approval alert that resembled a benign payment notification, and one false positive corresponded to a rough but normal workplace coordination message. In both cases, the message surface was short, neutral, and weak in explicit action demand, making the phishing status difficult to infer from lexical cues alone. This pattern suggests that conventional binary classifiers were strong at coarse filtering but less informative for borderline social-engineering interpretation.

For the prompt baseline, the available stabilized OpenChat three-shot run was retained as a structured-output reference. In that run, parse success was perfect, indicating that the calibrated prompt could consistently produce machine-readable outputs. Nevertheless, structured prediction quality remained limited. In the final-view evaluation based on the development set, label Macro-F1 was 0.610, binary F1 was 0.488, exact match core was 0.032, and exact match strict was 0.024. Attribute-level Macro-F1 values were also modest, including 0.215 for tactic_type, 0.259 for relation_type, 0.407 for requested_action, and 0.225 for cultural_lever. In other words, prompt-only generation was adequate for format control, but insufficient for a reliable structured phishing analysis.

Taken together, the baseline results indicate that binary phishing detection was already a high-performing task based on the curated clean split. The TF-IDF and KoELECTRA baselines therefore establish that the proposed framework should not be evaluated primarily as a detector that substantially improves binary classification. Instead, the relevant comparison is whether structured supervision enables more reliable schema-compliant interpretation, attribute-level prediction, and evidence-span grounding than prompt-only structured generation. In this respect, prompt-only OpenChat inference achieved stable formatting but remained weak in structured prediction quality. Binary classification and prompt-baseline performance are as shown in Table 4 below.

3.3. Comparison of QLoRA Models A–D

The authoritative re-evaluation based on the 82-sample test set showed clear differences among Models A–D. All four models achieved a parse success of 1.000, indicating that the structured decoding pipeline itself was stable across variants. The main differences appeared in phishing classification, attribute-level Macro-F1, exact-match behavior, and evidence-grounding quality.

Model A, which predicted only label and label_name, achieved 92.68% accuracy, 89.29% binary F1, and 0.919 label Macro-F1. As expected, its exact-match core and exact-match strict scores were both 0.000 because it did not generate the full structured schema. Its span-F1 and token-overlap scores were also very low, reflecting the absence of explicit evidence supervision. Model B improved binary detection and substantially expanded structured prediction. It achieved 93.90% accuracy, 91.23% binary F1, and 0.933 label Macro-F1, with 0.646 exact match core. At the attribute level, Model B reached Macro-F1 scores of 0.610 for tactic_type, 0.911 for relation_type, 0.733 for requested_action, and 0.643 for cultural_lever.

Model C introduced evidence supervision and explanation-related targets. Its binary classification performance decreased relative to Model B, with 90.24% accuracy, 84.62% binary F1, and 0.887 label Macro-F1. However, evidence-related metrics improved sharply: span-F1 reached 0.899 and token-overlap Jaccard reached 0.875. Attribute-level performance also became more uneven, with requested_action Macro-F1 increasing to 0.868, while relation_type and cultural_lever fell to 0.783 and 0.562, respectively. This result indicates that explicit grounding supervision improved evidence alignment, but did not automatically improve all social-engineering attributes at the same time.

Model D, predefined as the main model, offered the most stable trade-off among label prediction, core structured agreement, and evidence grounding. Based on the 82-sample authoritative internal test split, no label-level errors were observed for Model D, together with an exact-match core score of 0.841. This result should be treated as a split-specific observation under leakage-controlled conditions, not as evidence that the model would maintain the same performance based on external or out-of-template phishing corpora. However, this label-level result should be interpreted cautiously because of the small test size and the template-driven nature of the corpus. The attribute-level results were also uneven: Model D achieved strong performance for requested_action (Macro-F1 = 0.950) and tactic_type (Macro-F1 = 0.822), but relation_type remained unstable (Macro-F1 = 0.590). This indicates that relation inference is a high-value but high-uncertainty component of the framework, particularly under sparse relation classes and the codebook’s strict text-evidence-only rule. Model D also maintained strong evidence-grounding performance, with span-F1 of 0.886 and token-overlap Jaccard of 0.829.

Overall, the A–D comparison showed a progressive shift from binary prediction to structured and grounded interpretation. Model A served as a minimal generative classifier. Model B demonstrated that adding social-engineering attributes substantially improved structured exact-match performance without sacrificing binary discrimination. Model C showed the direct benefit of evidence supervision for grounding metrics. Model D provided the best overall balance between phishing detection, attribute-level interpretability, and evidence alignment, supporting its selection as the main model for the remainder of the study. Table 5 compares the four QLoRA variants.

Exact match core and tactic_type Macro-F1 for Model C were not retained in the archived summary table and are therefore marked as not available in Table 5.

3.4. Final Results Based on the Authoritative Clean Split

The final results on the authoritative clean split identified Model D as the main model of the study. Based on the 82-sample test set, Model D produced no label-level errors and achieved an exact-match core score of 0.841, a tactic_type Macro-F1 of 0.822, a requested_action Macro-F1 of 0.950, and a cultural_lever Macro-F1 of 0.741. These results indicate that Model D was effective under the fixed leakage-controlled hold-out setting. However, the label-level result should not be interpreted as evidence of deployment-level generalization because the test set was small and the dataset retained repeated or templated phishing-message structures. Because the test split contained 82 instances, the result should be interpreted as strong hold-out performance under split-group leakage control rather than as definitive evidence of broad generalization.

Compared with the other variants, Model D showed the strongest overall balance between detection and structured analysis. Model A was limited to binary prediction and therefore did not provide usable attribute-level exact matches. Model B improved the structured fields while maintaining high phishing classification performance, and Model C substantially improved evidence-grounding performance, but neither of these variants matched the overall profile of Model D. In particular, Model D outperformed the other variants in label-level performance and exact match core, while also preserving strong evidence alignment. Among the four variants, Model D showed the strongest overall balance across label prediction, exact match core, and evidence-grounding metrics.

At the same time, the exact match strict score remained 0.000 for all variants, including Model D. This result should be interpreted as a diagnostic limitation of full-schema generation rather than as a failure of the entire framework. Strict exact match requires every field in the complete schema, including fine-grained labels, grouped labels, evidence text, and short rationale, to match the gold output exactly. A minor deviation in any single field results in a strict mismatch. Therefore, the zero strict exact-match score indicates that deterministic reproduction of the full structured schema remains unsolved, even when the model correctly predicts the phishing label, major attributes, and evidence spans. For this reason, exact match core, attribute-level Macro-F1, span-F1, and token overlap are treated as the primary structured-output metrics.

3.5. Evidence-Span Evaluation

Evidence-grounding performance differed substantially across the model variants. Model A, which did not explicitly learn evidence supervision, produced a span-F1 of 0.119 and a token-overlap Jaccard score of 0.082. Model B, which introduced structured social-engineering attributes but not full evidence-focused supervision, improved only slightly based on these metrics, with a span-F1 of 0.155 and token-overlap Jaccard of 0.124. In contrast, Model C showed a marked improvement after the addition of evidence_span, evidence_label, and short_rationale, reaching a span-F1 of 0.899 and token-overlap Jaccard of 0.875.

Model D maintained strong evidence-grounding performance, with a span-F1 of 0.886 and token-overlap Jaccard of 0.829. Although its span-based scores were slightly lower than those of Model C, they remained substantially higher than those of Models A and B while co-occurring with the best overall phishing detection and attribute-level performance. This result suggests that explicit evidence supervision was necessary for grounded structured output and that the final Model D was able to retain most of that grounding benefit while improving the broader interpretive task.

Taken together, the evidence-span evaluation showed a clear separation between models without explicit grounding supervision and models trained to predict textual evidence directly. The contrast between Model B and Model C was especially pronounced, indicating that evidence alignment did not emerge automatically from multi-attribute structured prediction alone. Instead, it required direct supervision based on evidence spans. Model D then demonstrated that strong evidence alignment could be preserved even after adding grouped auxiliary labels and a broader structured-output objective.

An example of evidence-grounded structured output is shown in Figure 3.

3.6. Human Evaluation

The human evaluation was conducted on 99 samples with three evaluators under a blind 5-point Likert protocol. The evaluated dimensions were label validity, evidence appropriateness, adequacy of social-engineering interpretation, usefulness of response guidance, and practical usefulness for phishing-risk prevention. Across all raters, the highest mean scores were obtained for label validity and evidence appropriateness, which were 3.896 ± 1.572 and 3.929 ± 1.553, respectively. Both values were significantly above the neutral midpoint of 3 under the Holm-corrected Wilcoxon test. The human evaluation results across five usefulness dimensions are as shown in Table 6 below.

The remaining items received lower scores. Social-engineering interpretation adequacy was rated at 2.542 ± 1.182, and response-guidance usefulness was rated at 2.394 ± 1.164. Both were statistically different from the neutral midpoint, but their means remained below 3. Practical usefulness for phishing-risk prevention was rated at 2.630 ± 1.399 and did not reach statistical significance relative to the neutral midpoint after correction. These results indicate that the current framework should not be presented as a stand-alone user-facing defense system. Its empirically supported role is narrower: it can assist analysts or reviewers by organizing phishing judgments and evidence spans, but its response-guidance and practical-prevention outputs require further redesign before deployment-oriented claims can be made.

The low inter-rater agreement further limits the strength of the human-evaluation conclusions. Rather than treating the human scores as definitive evidence of practical usefulness, this study uses them as diagnostic indicators. They suggest that label validity and evidence localization were more acceptable to evaluators, whereas social-engineering interpretation and response guidance remained insufficiently stable.

The subgroup pattern was also uneven across evaluator profiles. The undergraduate-educated office-worker evaluator assigned the highest scores across most items, including 4.657 for label validity and 4.848 for evidence appropriateness, whereas the security-expert evaluator assigned substantially lower scores, including 2.444 for both label validity and evidence appropriateness. This gap suggests that the perceived usefulness of the outputs depended strongly on evaluator expectations and expertise level.

Inter-rater agreement was limited across the evaluation items. Fleiss’ kappa values ranged from −0.101 to 0.086, indicating weak or inconsistent agreement among the three raters. In particular, agreement was negative for evidence appropriateness, social-engineering interpretation adequacy, and response-guidance usefulness. Therefore, the human-evaluation results were interpreted cautiously. They nevertheless showed a consistent tendency: the structured outputs were rated more positively for supporting phishing judgment and textual grounding than for providing robust action guidance or universally convincing social-engineering interpretation.

3.7. Error Analysis and Failure Modes

Although Model D produced no binary label errors based on the fixed clean hold-out split, the remaining failures appeared in structured interpretation rather than in coarse phishing detection. The main failure modes were relation_type ambiguity, cultural_lever boundary confusion, evidence-span over-selection or under-selection, and weak response-guidance quality. Relation errors were most common when messages contained institutional or workplace-like wording without explicit relational markers. Cultural-lever errors occurred when authority, urgency, and responsibility cues appeared together in short messages. Evidence-span errors usually involved selecting a span that was semantically related but either too broad or not minimal enough to satisfy the exact-substring rule. Human-evaluation weaknesses were concentrated in response guidance, suggesting that the model’s structured analysis should not be equated with actionable end-user advice.

Table 7 summarizes the major failure types observed in the structured-output evaluation and human-evaluation diagnostics.

4. Discussion

Therefore, the results should not be read as showing that QLoRA-based LLM fine-tuning meaningfully outperforms simple lexical or transformer baselines in binary phishing detection. Instead, the contribution lies in reframing Korean phishing analysis as a structured interpretation problem. The framework adds value by representing how a message attempts to persuade the recipient, what action it requests, which social or cultural pressure it invokes, and which exact text span supports the judgment. This means that simple phishing-versus-normal discrimination, at least in the cleaned evaluation setting, was not the main unresolved challenge. The more meaningful problem was whether the model could explain how a message attempted to persuade the recipient and whether that explanation could be grounded in the message text itself. From this perspective, the contribution lies less in marginal binary improvement and more in transforming phishing analysis into a structured interpretation task that jointly represents the tactic, relation, requested action, cultural lever, and evidence span. The final Model D best matched this goal by showing the strongest overall balance across label prediction, core structured agreement, and evidence grounding based on the present hold-out split, although the perfect label score should be interpreted cautiously given the limited test size.

A second important finding is the practical value of predicting the relation type and cultural lever, in addition to the phishing status. Korean phishing messages often compress persuasion into kinship terms, institutional naming, cooperation requests, urgency, and loss-avoidance cues rather than into overtly malicious wording alone. The codebook was explicitly designed to separate the surface tactic from underlying social or cultural pressure, and this distinction is important because two messages can request similar actions while relying on very different persuasive mechanisms. In practical terms, identifying that a message uses authority/formality, responsibility/cooperation framing, or relational appeal can make the output more informative than a single phishing label, especially in borderline cases that resemble ordinary notifications or workplace coordination. At the same time, the model comparison also showed that relation prediction remained less stable than some other structured fields, particularly under sparse classes and under the codebook’s strict text-evidence-only rule. This suggests that relation-aware interpretation is useful, but also more fragile than action prediction and therefore should be interpreted as a structured signal rather than as a fully solved capability.

The weakest structured dimension was relation_type. This result is important because relation-aware interpretation is central to the culturally aware framing of the study. The low relation_type Macro-F1 indicates that explicit relation inference remains difficult when the message does not contain clear family, acquaintance, institutional, or corporate markers. Under the current codebook, relation_type follows a strict text-evidence-only rule; therefore, implicit social relations were intentionally not inferred. This improves annotation discipline but also reduces model stability for sparse or ambiguous relation classes.

The evidence-supervision results further clarify the role of grounding in this framework. Evidence alignment did not emerge automatically from multi-attribute prediction alone. Models without explicit evidence supervision showed very low span-F1 and token-overlap performance, whereas Model C and Model D achieved large gains after evidence_span was directly included in the training targets. This pattern indicates that a model can predict socially meaningful labels without reliably pointing to the minimal decisive textual evidence, and that grounded explanation must therefore be trained as a dedicated objective rather than assumed to follow from classification quality. At the same time, the comparison between Model C and Model D suggests that grounding alone is not the only target that matters. Model C achieved the highest evidence metrics, but Model D offered the most stable trade-off among label prediction, structured attribute fidelity, and evidence alignment. This supports the study’s design choice to treat evidence supervision as one component of a broader interpretability objective rather than as a standalone endpoint.

Two cautionary points are also necessary when interpreting the evaluation results. First, the strict exact-match score remained 0.000 even for the best model. This should not be read as a failure of the framework. Under the fixed structured-output design, strict exact match requires every field in the full schema to be identical to the gold target, including auxiliary grouped labels and short explanatory fields. In a multi-field generation task, this criterion is intentionally severe and can remain zero even when the model correctly predicts the phishing label, major interpretive attributes, and grounded evidence. For this reason, the present results should be interpreted primarily through label Macro-F1, exact match core, attribute-level Macro-F1, and evidence-grounding metrics rather than through strict exact match alone. Second, the human-evaluation results should also be interpreted conservatively. The outputs received more favorable ratings for label validity and evidence appropriateness than for response guidance or overall preventive usefulness, and inter-rater agreement was weak. This means that the structured outputs were more consistently perceived as helpful for judging and locating suspicious content than for delivering universally convincing user guidance. In other words, human evaluation supports the usefulness of grounded analytical output, but not a strong claim that the current system is already sufficient as a standalone user-protection assistant [37].

The human-evaluation results also constrain the practical claims of the study. The framework was rated more favorably for label validity and evidence appropriateness than for response-guidance usefulness or practical prevention. Thus, the current system is better understood as an evidence-grounded analyst-support tool than as a mature end-user intervention system. Future work should separate the analytic module from a dedicated response-guidance module and evaluate whether user-facing recommendations can be improved through task-specific instruction tuning or human-centered design.

The evaluation scope should be interpreted conservatively. The authoritative clean split was useful for controlling annotation quality and reducing direct template leakage, but it remains an internal validation setting. The final test set contained 82 messages, and no independently collected external corpus was available for this study. Therefore, the reported error-free label prediction of Model D does not demonstrate deployment-level robustness. Instead, the result indicates that the proposed structured-output formulation is feasible under a curated, Korean-language, leakage-controlled benchmark. Robustness against newly emerging phishing templates, noisier real-world messages, and non-Korean phishing ecosystems remains to be validated through larger external corpora and grouped cross-validation in future work.

Several limitations follow directly from these findings [29]. The experiments were conducted on a Korean-language dataset under a fixed clean split and a codebook that intentionally preserved sparse fine-grained categories. This was appropriate for the study objective, but it limits direct generalization to other languages, other phishing ecosystems, and freer annotation settings. In addition, the dataset contained repeated templates and low-frequency subclasses from the beginning, which is why split-group control and grouped supervision were treated as core design elements. The resulting framework is therefore best understood as an interpretable analytical system for Korean phishing messages rather than as a universal phishing solution. Even so, the results show that combining culturally informed annotation, structured generation, and evidence-based supervision is a viable direction for phishing analysis. The study therefore supports a shift from binary alerting toward grounded, attribute-level explanation, while also showing that relation inference, strict full-schema agreement, and user-facing action guidance remain open areas for improvement [38,39]. Because the main hold-out test set contained 82 messages, the observed 1.000 label Macro-F1 should not be interpreted as evidence that Korean phishing detection has been solved. Additional grouped cross-validation and external validation on harder out-of-template samples are necessary to assess generalization more robustly.

5. Conclusions

The evaluation has the limitation of treating Korean phishing messages as a binary classification problem only. Instead of focusing solely on phishing-versus-normal detection, it proposed a culturally aware LLM framework for the structured analysis of social engineering tactics in Korean phishing messages. The framework integrated phishing status prediction with the tactic type, relation type, requested action, cultural lever, and evidence span so that the output could support both classification and interpretation. This design was grounded in the project’s fixed codebook v1.3 and authoritative clean split, which were established to improve annotation consistency, leakage control, and experimental reproducibility [40].

The experimental results showed that binary detection alone was not the main contribution of the study. Conventional baselines already achieved high performance based on the authoritative clean split, whereas the proposed structured framework provided additional value by making phishing analysis more interpretable. Among the QLoRA variants, Model D achieved the once between phishing detection, attribute-level interpretation, and evidence grounding. Based on the authoritative 82-sample hold-out test split, it achieved error-free binary predictions together with strong exact match core, tactic_type, requested_action, cultural_lever, and evidence-span results. These findings indicate that structured supervision and grouped auxiliary labels can support a more informative analysis pipeline than binary prediction alone.

The practical implication of this study lies primarily in structured analyst support, review prioritization, and evidence presentation, not in direct end-user action guidance. The human-evaluation results show that the current framework is more reliable for organizing phishing judgments and evidence spans than for providing immediately actionable prevention advice. Therefore, deployment-oriented claims should be limited until the response-guidance component is redesigned and validated in user-centered evaluation settings. In its current form, the framework is better suited to explaining how a message persuades than to prescribing a universally convincing user response. In the Korean context, where phishing messages often rely on authority, kinship, cooperation pressure, urgency, and loss-avoidance cues, such structured outputs may be more useful than a single binary alert for downstream review, analyst support, and user-oriented protective systems. Evidence-span supervision was particularly important because it improved grounding quality and made the analytical outputs more traceable to the original message text.

Several limitations should be noted. First, the study was conducted on a Korean-language dataset under a fixed clean split, so the results should not be generalized directly to other languages or phishing ecosystems. Second, some fine-grained classes remained sparse, especially relation-sensitive subclasses, which limited the stability of certain attribute-level predictions. Third, the human-evaluation results were more favorable for label validity and evidence appropriateness than for action guidance and overall preventive usefulness, indicating that user-facing explanation quality is still uneven.

Future work should proceed in four directions. First, the framework should be validated on larger and independently collected Korean phishing corpora, including noisy messages, newly emerging templates, and harder out-of-template cases. Second, grouped cross-validation or external test sets should be used to estimate generalization more robustly than a single 82-sample hold-out split. Third, relation_type and cultural_lever prediction should be improved through better sparse-class handling, confidence-aware outputs, and possible secondary-label annotation. Fourth, the structured analysis module should be separated from a dedicated response-guidance module so that evidence-grounded interpretation and user-facing prevention advice can be evaluated independently. Overall, this study provides an interpretable analytical foundation for Korean phishing-message analysis and shows that culturally aware, evidence-grounded structured generation is a viable direction for socially engineered threat analysis. However, the present evidence supports feasibility and analyst-supportive interpretability within a curated Korean benchmark, rather than broad robustness across external corpora, unseen phishing campaigns, or non-Korean cultural contexts.

Author Contributions

Conceptualization, K.L., Y.L. and J.J.; funding acquisition, Y.L.; methodology, K.L., J.J. and Y.L.; supervision, D.S.; validation, K.L. and Y.-h.C.; writing—original draft, K.L. and Y.L.; writing—review and editing, D.S. and Y.-h.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Human Resources Development Project for Regional Energy Clusters, funded by the Ministry of Trade, Industry and Energy of the Republic of Korea (Project No. 20224000000070).

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhuo, S.; Biddle, R.; Koh, Y.S.; Lottridge, D.M.; Russello, G. SoK: Human-Centered Phishing Susceptibility. ACM Trans. Priv. Secur. 2023, 26, 24. [Google Scholar] [CrossRef]
Burda, P.; Allodi, L.; Zannone, N. Cognition in Social Engineering Empirical Research: A Systematic Literature Review. ACM Trans. Comput.-Hum. Interact. 2024, 31, 19. [Google Scholar] [CrossRef]
Parsons, K.; Butavicius, M.; Delfabbro, P.; Lillie, M. Predicting Susceptibility to Social Influence in Phishing Emails. Int. J. Hum.-Comput. Stud. 2019, 128, 17–26. [Google Scholar] [CrossRef]
De Bona, M.; Paci, F. A Real World Study on Employees’ Susceptibility to Phishing Attacks. In Proceedings of the 15th International Conference on Availability, Reliability and Security (ARES 2020), Virtual Event, 25–28 August 2020. [Google Scholar] [CrossRef]
Naqvi, B.; Perova, K.; Farooq, A.; Makhdoom, I.; Oyedeji, S.; Porras, J. Mitigation Strategies against the Phishing Attacks: A Systematic Literature Review. Comput. Secur. 2023, 132, 103387. [Google Scholar] [CrossRef]
Atawneh, S.; Aljehani, H. Phishing Email Detection Model Using Deep Learning. Electronics 2023, 12, 4261. [Google Scholar] [CrossRef]
Kim, S.; Park, J.; Ahn, H.; Lee, Y. Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem. Information 2024, 15, 265. [Google Scholar] [CrossRef]
Lee, Y.; Han, D. KorSmishing Explainer: A Korean-Centric LLM-Based Framework for Smishing Detection and Explanation Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 642–656. [Google Scholar] [CrossRef]
Wang, Y.; Zhai, H.; Wang, C.; Hao, Q.; Cohen, N.A.; Foulger, R.; Handler, J.A.; Wang, G. Can You Walk Me Through It? Explainable SMS Phishing Detection Using LLM-Based Agents. In Proceedings of the 21st Symposium on Usable Privacy and Security (SOUPS 2025); USENIX Association: Seattle, WA, USA, 2025; pp. 37–56. [Google Scholar]
Lee, J.; Lee, Y.; Lee, D.; Kwon, H.; Shin, D. Classification of Attack Types and Analysis of Attack Methods for Profiling Phishing Mail Attack Groups. IEEE Access 2021, 9, 80866–80872. [Google Scholar] [CrossRef]
Fan, Z.; Li, W.; Laskey, K.B.; Chang, K.-C. Investigation of Phishing Susceptibility with Explainable Artificial Intelligence. Future Internet 2024, 16, 31. [Google Scholar] [CrossRef]
Gan, C.L.; Lee, Y.Y.; Liew, T.W. Fishing for phishy messages: Predicting phishing susceptibility through the lens of cyber-routine activities theory and heuristic-systematic model. Humanit. Soc. Sci. Commun. 2024, 11, 1–17. [Google Scholar] [CrossRef]
Lu, X.; Jiang, J.; Head, M.; Yang, J. Phishing detection in multitasking contexts: The impact of working memory load on users’ threat detection performance. Eur. J. Inf. Syst. 2025, 35, 134–164. [Google Scholar] [CrossRef]
Edwards, M.E.; Still, J.D. Cyber hygiene of SMiShing: What they know and where they look. Comput. Stand. Interfaces 2025, 95, 104048. [Google Scholar] [CrossRef]
Rathod, T.; Jadav, N.K.; Tanwar, S.; Alabdulatif, A.; Garg, D.; Singh, A. A comprehensive survey on social engineering attacks, countermeasures, case study, and research challenges. Inf. Process. Manag. 2024, 62, 103928. [Google Scholar] [CrossRef]
Waelchli, S.; Walter, Y. Reducing the risk of social engineering attacks using SOAR measures in a real world environment: A case study. Comput. Secur. 2024, 148, 104137. [Google Scholar] [CrossRef]
Kavya, S.; Sumathi, D. Staying ahead of phishers: A review of recent advances and emerging methodologies in phishing detection. Artif. Intell. Rev. 2024, 58, 50. [Google Scholar] [CrossRef]
Altwaijry, N.; Al-Turaiki, I.; Alotaibi, R.; Alakeel, F. Advancing Phishing Email Detection: A Comparative Study of Deep Learning Models. Sensors 2024, 24, 2077. [Google Scholar] [CrossRef] [PubMed]
Fares, H.; Kilani, J.; Fagroud, F.-E.; Toumi, H.; Lakrami, F.; Baddi, Y.; Aknin, N. Machine Learning Approach for Email Phishing Detection. Procedia Comput. Sci. 2024, 251, 746–751. [Google Scholar] [CrossRef]
Patra, C.; Giri, D.; Nandi, S.; Das, A.K.; Alenazi, M.J.F. Phishing email detection using vector similarity search leveraging transformer-based word embedding. Comput. Electr. Eng. 2025, 124, 110403. [Google Scholar] [CrossRef]
Sahingoz, O.K.; Buber, E.; Kugu, E. DEPHIDES: Deep Learning Based Phishing Detection System. IEEE Access 2024, 12, 8052–8070. [Google Scholar] [CrossRef]
Hranický, R.; Horák, A.; Polišenský, J.; Jeřábek, K.; Ryšavý, O. Unmasking the Phishermen: Phishing Domain Detection with Machine Learning and Multi-Source Intelligence. In 2024 IEEE/IFIP Network Operations and Management Symposium (NOMS); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Shafin, S.S. An explainable feature selection framework for web phishing detection with machine learning. Data Sci. Manag. 2025, 8, 127–136. [Google Scholar] [CrossRef]
Uddin, K.M.M.; Biswas, N.; Rikta, S.T.; Nur-A-Alam, M.; Mostafiz, R. Explainable Machine Learning for Phishing Site Detection: A High-Efficiency Approach Using Boosting Models and SHAP. J. Eng. 2025, 2025, e70110. [Google Scholar] [CrossRef]
Kehkashan, T.; Abdelhaq, M.; Al-Shamayleh, A.S.; Huda, N.; Yaseen, I.A.; Ahmed, A.I.A.; Akhunzada, A. Explainable phishing website detection for secure and sustainable cyber infrastructure. Sci. Rep. 2025, 15, 41751. [Google Scholar] [CrossRef]
Alotaibi, S.R.; Alkahtani, H.K.; Aljebreen, M.; Alshuhail, A.; Saeed, M.K.; Ebad, S.A.; Almukadi, W.S.; Alotaibi, M. Explainable artificial intelligence in web phishing classification on secure IoT with cloud-based cyber-physical systems. Alex. Eng. J. 2024, 110, 490–505. [Google Scholar] [CrossRef]
Barik, K.; Misra, S.; Mohan, R. Web-based phishing URL detection model using deep learning optimization techniques. Int. J. Data Sci. Anal. 2025, 20, 4449–4471. [Google Scholar] [CrossRef]
Prasad, Y.B.; Dondeti, V. PDSMV3-DCRNN: A novel ensemble deep learning framework for enhancing phishing detection and URL extraction. Comput. Secur. 2024, 148, 104123. [Google Scholar] [CrossRef]
Kulkarni, A.; Balachandran, V.; Divakaran, D.M.; Das, T. From ML to LLM: Evaluating the Robustness of Phishing Web Page Detection Models against Adversarial Attacks. Digit. Threat. Res. Pract. 2025, 6, 1–25. [Google Scholar] [CrossRef]
Rashid, F.; Ranaweera, N.; Doyle, B.; Seneviratne, S. LLMs are one-shot URL classifiers and explainers. Comput. Netw. 2024, 258, 111004. [Google Scholar] [CrossRef]
Roy, S.S.; Thota, P.; Naragam, K.V.; Nilizadeh, S. From Chatbots to PhishBots?: Phishing Scam Generation in Commercial Large Language Models. In 2024 IEEE Symposium on Security and Privacy; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Heiding, F.; Schneier, B.; Vishwanath, A.; Bernstein, J.; Park, P.S. Devising and Detecting Phishing Emails Using Large Language Models. IEEE Access 2024, 12, 42131–42146. [Google Scholar] [CrossRef]
Chataut, R.; Gyawali, P.K.; Usman, Y. Can AI Keep You Safe? A Study of Large Language Models for Phishing Detection. In 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Mahendru, S.; Pandit, T. SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection. In 2024 IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Nguyen, N.T.V.; Childress, F.D.; Yin, Y. Debate-Driven Multi-Agent LLMs for Phishing Email Detection. In 2025 13th International Symposium on Digital Forensics and Security (ISDFS); IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Lee, J.; Lim, P.; Hooi, B.; Divakaran, D.M. Multimodal Large Language Models for Phishing Webpage Detection and Identification. In 2024 APWG Symposium on Electronic Crime Research (eCrime); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Aseeri, A.M.; Bohacek, S. Using Ensembles of LLMs to Detect Phishing Emails. In Advanced Information Networking and Applications (AINA 2025); Springer: Cham, Switzerland, 2025; Volume 6. [Google Scholar] [CrossRef]
Shim, H.S.; Park, H.; Lee, K.; Park, J.S.; Kang, S. Data Augmentation for Smishing Detection: A Theory-based Prompt Engineering Approach. In WWW 2024 Companion Proceedings of the ACM Web Conference; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
Kasri, W.; Himeur, Y.; Alkhazaleh, H.A.; Tarapiah, S.; Atalla, S.; Mansoor, W.; Al-Ahmad, H. From Vulnerability to Defense: The Role of Large Language Models in Enhancing Cybersecurity. Computation 2025, 13, 30. [Google Scholar] [CrossRef]
Popescul, D.; Radu, L.-D. AI in phishing detection: A bibliometric review. Front. Artif. Intell. 2025, 8, 1496580. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed culturally aware phishing analysis pipeline.

Figure 2. Data curation, clean split construction, and leakage-control process.

Figure 3. Example of evidence-grounded structured output. The check mark (√) indicates that the corresponding field matched the gold annotation or satisfied the evaluation criterion.

Table 1. Summary of the structured annotation schema and label dimensions.

Field	No. of Labels	Purpose	Example Values	Key Annotation Rule
label	2	Binary phishing status	normal; phishing	Primary label for each message
tactic_type	7	Surface persuasion tactic	relational; authority/institutional; urgent/fear-based	Assign the single most decisive tactic
relation_type	5	Explicit social relation in text	family; institution; no explicit relation	Text-evidence-only rule
requested_action	6	Main action demanded from the recipient	link clicking; money transfer; authentication entry	Choose the most specific action channel
cultural_lever	6	Underlying social/cultural pressure	authority/formality; responsibility/cooperation	Separate pressure mechanism from surface tactic
evidence_span	1–3 spans	Grounding evidence	exact substring(s) from source text	No paraphrase; exact substring only
relation_group	aux.	Grouped supervision for sparse relation labels	personal/institutional/none	Auxiliary field for stable learning
action_group	aux.	Grouped supervision for sparse action labels	interactive/financial/credential	Auxiliary field for stable learning
emotion_band	aux.	Coarse emotional intensity band	low/medium/high	Auxiliary interpretive field
short_rationale	free text	Concise explanation for evaluators	one-sentence rationale	Used with structured evidence, not as free-form summary

Table 2. Authoritative clean split and class distribution.

Split	Total	Normal	Phishing	Phishing Ratio
Train	730	460	270	37.0%
Development	84	59	25	29.8%
Test	82	56	26	31.7%

Table 3. Major structured-attribute distribution patterns in the clean subset.

Field	Label/Pattern	Relative Status	Description	Source Note
tactic_type	Neutral/informational	Dominant	Most frequent class in clean subset	Reported in manuscript text
relation_type	No explicit relation	Dominant	Most frequent class in clean subset	Reported in manuscript text
requested_action	Reply/confirmation/visit induction	Dominant	Most frequent class in clean subset	Reported in manuscript text
cultural_lever	Responsibility/cooperation request	Dominant	One of the most frequent frames	Reported in manuscript text
cultural_lever	Authority/formality	Dominant	One of the most frequent frames	Reported in manuscript text
rare subclass	App-installation requests	Sparse	Low-frequency subclass preserved in clean split	Explicitly noted in project docs
rare subclass	Friend/acquaintance relation	Sparse	Low-frequency subclass preserved in clean split	Explicitly noted in project docs

Table 4. Binary classification and prompt-baseline performance.

Model	Eval Split	Accuracy	Precision	Recall	Binary F1	Macro-F1	Parse Success	Exact Match Core	Exact Match Strict
TF-IDF/Logistic Regression	Test	0.9878	0.9630	1.0000	0.9811	0.9861	-	-	-
KoELECTRA	Test	0.9756	0.9615	0.9615	0.9615	0.9718	-	-	-
OpenChat few-shot (3-shot)	Dev/final-view	-	-	-	0.4880	0.6100	1.0000	0.0320	0.0240

Table 5. Comparison of QLoRA model variants (A–D).

Model	Output Scope	Parse Success	Accuracy	Binary F1	Label Macro-F1	Exact Match Core	Tactic Macro-F1	Relation Macro-F1	Requested_Action Macro-F1	Cultural_Lever Macro-F1	Span-F1	Token Overlap
A	label + label_name	1.000	0.927	0.893	0.919	0.000	-	-	-	-	0.119	0.082
B	A + emotion_band, tactic, relation, action, cultural	1.000	0.939	0.912	0.933	0.646	0.610	0.911	0.733	0.643	0.155	0.124
C	B + evidence_span, evidence_label, short_rationale	1.000	0.902	0.846	0.887	-	-	0.783	0.868	0.562	0.899	0.875
D	C + relation_group, action_group	1.000	1.000	1.000	1.000	0.841	0.822	0.590	0.950	0.741	0.886	0.829

Table 6. Human evaluation results across five usefulness dimensions.

Dimension	Mean	SD	Statistical Test	Interpretation
Label validity	3.896	1.572	Wilcoxon vs. midpoint = 3	Significantly above midpoint
Evidence appropriateness	3.929	1.553	Wilcoxon vs. midpoint = 3	Significantly above midpoint
Social-engineering interpretation adequacy	2.542	1.182	Wilcoxon vs. midpoint = 3	Significantly below midpoint
Response-guidance usefulness	2.394	1.164	Wilcoxon vs. midpoint = 3	Significantly below midpoint
Practical usefulness for phishing-risk prevention	2.630	1.399	Wilcoxon vs. midpoint = 3	Not significant after correction

Table 7. Error analysis by failure type.

Failure Type	Affected Field	Typical Pattern	Interpretation	Revision Implication
Relation ambiguity	relation_type	Workplace-like or notification-style messages without explicit relation markers	Text-evidence-only rule limits implicit inference	Add relation_group confidence or implicit-relation flag
Cultural boundary confusion	cultural_lever	Authority + urgency + cooperation cues appear together	Single-label policy forces one dominant lever	Consider secondary lever or confidence score
Evidence over-selection	evidence_span	Full sentence selected instead of minimal phrase	Model identifies relevant text but not minimal decisive span	Add post-validation for span length
Evidence under-selection	evidence_span	Only URL or institution name selected	Evidence misses action demand or pressure cue	Train with multi-span examples
Weak response guidance	short_rationale/guidance	Generic or insufficient user advice	Current model is analytic, not guidance-specialized	Separate response-guidance module
Borderline normal message	label/tactic_type	Normal workplace coordination resembles phishing request	Surface form overlaps with social-engineering tactics	Add hard-negative training examples

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, K.; Lee, Y.; Jeong, J.; Choi, Y.-h.; Shin, D. A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages. Electronics 2026, 15, 2196. https://doi.org/10.3390/electronics15102196

AMA Style

Lee K, Lee Y, Jeong J, Choi Y-h, Shin D. A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages. Electronics. 2026; 15(10):2196. https://doi.org/10.3390/electronics15102196

Chicago/Turabian Style

Lee, Kiho, Yongjoon Lee, Jaeyeong Jeong, Yong-ha Choi, and Dongkyoo Shin. 2026. "A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages" Electronics 15, no. 10: 2196. https://doi.org/10.3390/electronics15102196

APA Style

Lee, K., Lee, Y., Jeong, J., Choi, Y.-h., & Shin, D. (2026). A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages. Electronics, 15(10), 2196. https://doi.org/10.3390/electronics15102196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Culturally Aware LLM Framework for Analyzing Social Engineering Tactics in Korean Phishing Messages

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction and Quality Assurance

2.2. Codebook Design

2.3. Authoritative Clean Split and Leakage Control

2.4. Structured Output Schema

2.5. Baselines

2.6. OpenChat QLoRA Setup

2.7. Evaluation Metrics

2.8. Human Evaluation Protocol

3. Results

3.1. Dataset Split and Class Statistics

3.2. Binary and Prompt Baseline Results

3.3. Comparison of QLoRA Models A–D

3.4. Final Results Based on the Authoritative Clean Split

3.5. Evidence-Span Evaluation

3.6. Human Evaluation

3.7. Error Analysis and Failure Modes

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI