Diagnosing and Mitigating LLM Failures in Recognizing Culturally Specific Korean Names: An Error-Driven Prompting Framework
Abstract
1. Introduction
2. Related Work
2.1. PII Detection Technologies
2.1.1. Traditional NER-Based PII Detection Methods
2.1.2. Deep Learning and Language Model-Based PII Detection
2.2. Benchmarks for PII Detection
2.2.1. English-Centric Benchmarks and Their Limitations
2.2.2. The Gap in Korean PII Benchmarks
2.3. Enhancing LLM Performance via Prompt Engineering
3. Methodology
3.1. Building a Diagnostic Framework: From Errors to 12 Fine-Grained Subtypes
3.1.1. Constructing and Executing the Diagnostic Probe
3.1.2. The Diagnostic Process: From Qualitative Analysis to Pattern Induction
- Structural Decomposition Failure represents errors where the model fails to correctly segment compound name structures, reflecting a lack of morphological and syntactic decomposition ability. The phenomenon is particularly influenced by Korean cultural conventions, where personal names are frequently combined with occupational titles, honorifics, or intimate suffixes, which increases the difficulty of boundary identification for the model.
- (a)
- NAME-S3 (Composite Entity: Name + Position).In Korean discourse, personal names are often attached to professional titles, forming conventionalized expressions such as Park gwajang (manager) or Lee sajang (company president). This “Name + Position” pattern is common in workplace communication and conveys clear signals about a person’s social identity. The model often misinterprets the entire phrase as an inseparable lexical unit, failing to capture the hierarchical structure of the compound.
- (b)
- NAME-S4 (Name + Suffix).Korean names frequently appear with politeness or intimacy suffixes such as -ssi (neutral polite marker), -nim (formal honorific), or -ah/-ya (intimate address form). Examples like Yoon-jung-ssi or Jin-ah illustrate how these suffixes encode social distance within Korea’s speech-level hierarchy. The model’s inability to isolate the core name while recognizing these morphological attachments indicates its limited understanding of Korean sociolinguistic conventions.
- Common-sense Bias refers to a category of errors that arise from the model’s overreliance on “world knowledge” acquired during pretraining, while neglecting contextual cues within discourse. As a result, the model often fails to interpret culturally specific meanings. In Korean communication, it is common to use words with literal meanings as personal names or nicknames, a phenomenon of “semantic overloading” that poses unique challenges for language models.
- (a)
- NAME-S6 (Highly Ambiguous Noun-like Name).Words such as Nara (meaning “nation”) or Uju (meaning “universe”) function both as ordinary nouns and as personal names in Korean usage. The model often activates the literal semantics of such words rather than recognizing their referential role, which indicates its difficulty in handling polysemous name forms grounded in cultural conventions.
- (b)
- NICK-S1 (Common Noun Type).In Korean culture, nicknames derived from everyday nouns such as foods, animals, or objects are frequently used to express intimacy or humor. Examples include Strawberry, Puppy, or Tteok (rice cake). The model tends to interpret these words according to their literal meanings, overlooking their pragmatic function as interpersonal references. This reflects its lack of sensitivity to cultural metaphor and affective expression in nickname usage.
- Recognition Failure refers to a category of errors involving unconventional entities that fall outside the model’s learned representation space, constituting another major source of false negatives (FN). This phenomenon is particularly prominent in Korean contexts, where diverse forms such as foreign names, online identity markers, and colloquial nicknames are widespread.
- (a)
- NAME-S2 (Prototypical Korean Full Name).Standard Korean full names typically follow a one-syllable surname and two-syllable given name structure (e.g., Kim Minji, Park Jisoo), which is the most common and canonical naming convention in Korean. Despite this high prototypicality, the model still fails to recognize such names when explicit contextual or syntactic cues are absent. This indicates the model’s limited morphological sensitivity to Korean naming structures and weak representational capacity for encoding canonical name patterns. At a deeper level, this error reflects insufficient coverage and diversity in Korean-specific training data, exposing structural limitations in the model’s ability to process basic named entity recognition in low-resource languages.
- (b)
- NAME-S5 (Foreign/Atypical Name).Names such as Ahmed or Mika do not conform to Korean phonological or orthographic conventions. The increasing globalization of Korean society has made such foreign names common in workplaces and media, yet the model’s poor recognition performance indicates limited cross-cultural adaptability.
- (c)
- NICK-S2 (ID/Creative Type).In Korean online culture, hybrid nicknames combining Korean, English, and numerals are widely used, such as minsu123 or cute_jin. These creative and personalized forms challenge the model’s ability to detect boundaries and identify entities accurately, reflecting limitations in recognizing culturally specific digital naming patterns.
- (d)
- NICK-S4 (Phonetic/Affectionate Variation Type).In spoken Korean, affectionate nicknames are often formed through abbreviation, phonetic alteration, or reduplication. For instance, suffixes such as -i or -sseu are added to convey intimacy or endearment. The model fails to capture these phonological and morphological variations, which suggests its cognitive limitation in processing colloquial forms shaped by sociolinguistic intimacy conventions.
- Fine-grained Conceptual Confusion refers to a category of errors that reflect the model’s inability to distinguish between semantically similar but pragmatically different expressions, resulting in frequent label mismatches or false positives (FP). This phenomenon is closely related to the complexity of Korean address systems, in which the referential function of an expression often depends on relational roles, communicative context, and social distance.
- (a)
- NICK-S3 (Relational Reference Type).When a relational expression includes an explicit name anchor (e.g., Minsu’s mom, Chulsoo’s brother), it typically refers to a specific individual in real-world Korean discourse, functioning as a stable referential identifier. Therefore, such expressions can be considered personally identifiable information (PII) or name-like nicknames. In contrast, relational expressions without a name anchor (e.g., “my mom,” “your brother,” “the kid’s mom”) merely indicate a relational role or category rather than a specific individual, and thus are not PII. The model often fails to capture this pragmatic distinction: sometimes it extracts only the name while ignoring the relational term (false negative), and sometimes it incorrectly labels non-referential relational expressions as PII (false positive). These inconsistencies indicate that the model lacks a fine-grained understanding of the cultural pragmatics that govern relational reference in Korean, resulting in unstable and context-insensitive behavior.
- (b)
- NICK-S5 (Descriptive/Characteristic Type).Descriptive nicknames in Korean are often derived from personal appearance, personality, or behavioral traits, such as Bossy, Shorty, or Smiley. Depending on the social group and interactional context, these expressions can function either as temporary descriptions or as stable identity markers. The model frequently fails to infer this pragmatic boundary, misclassifying evaluative adjectives or descriptive terms as permanent identifiers. This suggests a lack of sensitivity to the social-pragmatic cues that distinguish contextual characterization from stable personal reference.
- (c)
- NICK-S6 (Foreign-style Nicknames).These expressions refer to foreign-style nicknames derived from non-Korean naming conventions, often exhibiting phonological patterns that differ from standard Korean forms. Because of their atypical structure, models frequently misinterpret them as generic foreign words or non-PII expressions rather than person-specific identifiers.
- (d)
- NAME-S1 (Independent Surname).In Korean discourse, surnames such as Kim, Park, and Lee are commonly used as abbreviated forms of address, particularly in familiar or workplace contexts. When combined with politeness suffixes like -ssi (neutral politeness) or -nim (formal politeness), these forms usually indicate a specific referent and therefore qualify as PII. However, when used generically (e.g., “many Kims,” “a certain Park”), they lose referential specificity. The model exhibits inconsistent behavior in handling such cases: sometimes it omits surnames that carry clear referential intent (false negatives), while at other times it overgeneralizes and mislabels non-specific mentions as PII (false positives). This inconsistency shows the model’s limited understanding of the Korean politeness and address system, as well as its difficulty in modeling pragmatically driven referentiality.
3.2. Error-Driven Prompt Engineering
3.3. K-NameDiag: A Diagnostic Benchmark for Name and Nickname Disambiguation
3.3.1. Stage 1: Construction of the High-Difficulty Entity Lexicon
- High-Difficulty Entity Pool Curation: As described in the diagnostic analysis presented in Section 3.1, we quantitatively identify instances of false negatives and false positives in the KDPII dataset. From these diagnostic results, we extract all PII entities that cause recognition failures to form an Error Entity Pool.
- Lexicon Construction based on Frequency and Coverage: From this Error Entity Pool, we curate the final lexicon according to two core criteria:
- High-Frequency First: We prioritize entities with the highest error frequency in the diagnostic analysis. This design inherently guarantees the “high-difficulty” nature of the lexicon, as it directly targets the model’s most common vulnerabilities.
- Ensuring Coverage: To ensure diagnostic comprehensiveness, we conduct expert review and supplementation to confirm that all twelve challenge types, including rare but semantically complex cases, are adequately represented in the lexicon.
3.3.2. Stage 2: Dialogue Generation via Adversarial Pairing Protocol
3.3.3. Expert Refinement and Independent Quality Verification
4. Experimental Setup
- BP (Baseline Prompt). The basic configuration that serves as the reference point for the models as shown in Appendix A.
- EDP (Error-Driven Prompt) The version that incorporates explicit linguistic and diagnostic knowledge, used to examine the effect of knowledge infusion.
- CEP (Combined Enhanced Prompt). The final configuration integrating structured reasoning instructions to evaluate the impact of general reasoning augmentation. The full prompt content of EDP and CEP is provided in Figure 8.
5. Results and Discussion
5.1. Overall Performance Comparison Across Prompt Configurations
5.2. Fine-Grained Diagnostic Analysis
6. Conclusions
7. Limitations and Future Work
7.1. Limitations
7.2. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| BP | Baseline Prompt |
| BiLSTM | Bidirectional Long Short-Term Memory |
| CEP | Combined Enhanced Prompt |
| CoT | Chain-of-Thought |
| CRF | Conditional Random Field |
| EDP | Error-Driven Prompt |
| FN | False Negative |
| FP | False Positive |
| GDPR | General Data Protection Regulation |
| HMM | Hidden Markov Model |
| ICL | In-Context Learning |
| KDPII | Korean Dialogic PII Dataset |
| KISA | Korea Internet & Security Agency |
| LLM | Large Language Model |
| NER | Named Entity Recognition |
| NLP | Natural Language Processing |
| PHI | Protected Health Information |
| PII | Personally Identifiable Information |
| PLM | Pre-trained Language Model |
| SVM | Support Vector Machine |
| TP | True Positive |
Appendix A. Baseline Prompt Configuration
| Instruction | You are an information extraction model. Identify all spans in the given text that contain Korean personal names or nicknames. |
| Input | Dialogue sample from KDPII containing Korean personal names or nicknames. |
| Task | Extract all words or phrases in the text that represent a Korean personal name (PS_NAME) or nickname (PS_NICKNAME). If no entities are found, return an empty list as shown below. |
| Output Format | {“PNE”: [{“form”: “...”, “label”: “PS_NAME”}, {“form”: “...”, “label”: “PS_NICKNAME”}]} If no entities are found, return: {“PNE”: []} |
Appendix B. Organizational Structure of KDPII Dataset

Appendix C. Adversarial Dialogue Generation Prompt
| Adversarial Dialogue Generation Prompt |
|---|
| # ROLE |
| You are an expert-level Korean scriptwriter and a linguist specializing in subtle linguistic nuances. |
| # TASK |
| Using the provided “Adversarial Name-Related Entity Set,” you must write a natural, multi-turn Korean dialogue of 8–12 turns that embeds these entities ambiguously. |
| # GOAL |
| This dialogue is used to test a language model’s ability to classify the precise label of the name-related entities. Therefore, the entities must be used naturally within the context while simultaneously containing traps or ambiguity that make their classification difficult. |
| # ADVERSARIAL NAME-RELATED ENTITY SET |
| {ADVERSARIAL_ENTITY_SET} |
| # CONSTRAINTS |
| 1. Entity Usage: Include all entities from the provided set in the dialogue. |
| 2. Dialogue Length: The dialogue must contain 8–12 turns and may involve two or three speakers (P01, P02, optionally P03). |
| 3. Naturalness: The dialogue must be realistic and coherent, as if spoken by native Koreans. |
| 4. Maintain Ambiguity (Critical): |
|
| # OUTPUT FORMAT |
| P01: [Dialogue content] |
| P02: [Dialogue content] |
| P03: [Dialogue content, if applicable] |
Appendix D. Dialogue Quality Assessment Rubric
| Dialogue Quality Assessment Rubric |
|---|
| Objective |
| This guideline assists researchers in evaluating the quality of candidate dialogues for K-NameDiag. The goal is to ensure linguistic naturalness, contextual appropriateness, and annotation feasibility of the final benchmark. |
| Evaluation Procedure |
| Each dialogue is evaluated along four core dimensions. For every dimension, the evaluator must assign one of two judgments: Acceptable (Pass) or Unacceptable (Fail). |
| Final Decision Rule |
| A dialogue is considered Pass (1) only if it is rated Acceptable on all four dimensions. If a dialogue is judged Unacceptable in even one dimension, it is rated Fail (0). |
| Dimension 1. Linguistic Naturalness and Coherence |
| Acceptable (Pass): |
|
| Unacceptable (Fail): |
|
| Dimension 2. Contextual Appropriateness of Target Entities |
| Acceptable (Pass): |
|
| Unacceptable (Fail): |
|
| Dimension 3. Verification of Target Entity Generation |
| Acceptable (Pass): |
|
| Unacceptable (Fail): |
|
| Dimension 4. Annotation Feasibility for Incidental PIIs |
| Acceptable (Pass): |
|
| Unacceptable (Fail): |
|
Appendix E. Statistics of the K-NameDiag Benchmark
| Category | Label | Value |
|---|---|---|
| A. Overall Scale | ||
| Total Dialogues | 3000 | |
| Total PII Instances | 6105 | |
| Total Dialogue Turns | 30,111 | |
| Total Tokens | 333,934 | |
| B. Label Distribution | ||
| PS_NAME | 3151 | |
| PS_NICKNAME | 2954 | |
| NAME-S1 | 211 | |
| NAME-S2 | 2421 | |
| NAME-S3 | 200 | |
| NAME-S4 | 90 | |
| NAME-S5 | 182 | |
| NAME-S6 | 47 | |
| NICK-S1 | 477 | |
| NICK-S2 | 306 | |
| NICK-S3 | 210 | |
| NICK-S4 | 1541 | |
| NICK-S5 | 313 | |
| NICK-S6 | 107 | |
| Unique PII Forms | 1375 | |
| C. Dialogue Characteristics | ||
| Average Turns per Dialogue | 10.04 | |
| Average PII Instances per Dialogue | 2.04 | |
References
- Li, Y.; Pei, Q.; Sun, M.; Lin, H.; Ming, C.; Gao, X.; Wu, J.; He, C.; Wu, L. CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenge. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 5929–5965. [Google Scholar]
- Sheikhaei, M.S.; Tian, Y.; Wang, S.; Xu, B. An Empirical Study on the Effectiveness of Large Language Models for SATD Identification and Classification. arXiv 2024, arXiv:2405.06806. [Google Scholar] [CrossRef]
- Deußer, T.; Sparrenberg, L.; Berger, A.; Hahnbück, M.; Bauckhage, C.; Sifa, R. A Survey on Current Trends and Recent Advances in Text Anonymization. arXiv 2025, arXiv:2508.21587. [Google Scholar] [CrossRef]
- Hui, B.S.H.; Miao, X.; Wang, X. SecureSpeech: Prompt-based Speaker and Content Protection. arXiv 2025, arXiv:2507.07799. [Google Scholar]
- Fei, L.; Kang, Y.; Park, S.; Jang, Y.; Lee, J.; Kim, H. KDPII: A new korean dialogic dataset for the deidentification of personally identifiable information. IEEE Access 2024, 12, 135626–135641. [Google Scholar] [CrossRef]
- Pham, D.; Kairouz, P.; Mireshghallah, N.; Bagdasarian, E.; Pham, C.M.; Houmansadr, A. Can Large Language Models Really Recognize Your Name? arXiv 2025, arXiv:2505.14549. [Google Scholar] [CrossRef]
- Sang, E.F.T.K.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
- Chinchor, N.; Robinson, P. MUC-7 named entity task definition. In Proceedings of the 7th Conference on Message Understanding, Fairfax, VA, USA, 29 April –1 May 1998; Volume 29, pp. 1–21. [Google Scholar]
- Stubbs, A.; Uzuner, Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015, 58, S20–S29. [Google Scholar] [CrossRef]
- Singh, D.; Narayanan, S. Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability. arXiv 2025, arXiv:2504.12308. [Google Scholar]
- Shen, H.; Gu, Z.; Hong, H.; Han, W. PII-Bench: Evaluating Query-Aware Privacy Protection Systems. arXiv 2025, arXiv:2502.18545. [Google Scholar]
- Kocetkov, D.; Li, R.; Allal, L.B.; Li, J.; Mou, C.; Ferrandis, C.M.; Jernite, Y.; Mitchell, M.; Hughes, S.; Wolf, T.; et al. The stack: 3 tb of permissively licensed source code. arXiv 2022, arXiv:2211.15533. [Google Scholar] [CrossRef]
- Asthana, S.; Mahindru, R.; Zhang, B.; Sanz, J. Adaptive PII Mitigation Framework for Large Language Models. arXiv 2025, arXiv:2501.12465. [Google Scholar] [CrossRef]
- Savkin, M.; Ionov, T.; Konovalov, V. SPY: Enhancing Privacy with Synthetic PII Detection Dataset. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Student Research Workshop), Albuquerque, NM, USA, 30 April–1 May 2025; pp. 236–246. [Google Scholar]
- Bikel, D.M.; Miller, S.; Schwartz, R.; Weischedel, R. Nymble: A high-performance learning name-finder. arXiv 1998, arXiv:cmp-lg/9803003. [Google Scholar] [CrossRef]
- Isozaki, H.; Kazawa, H. Efficient support vector classifiers for named entity recognition. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics, Taipei, Taiwan, 24 August–1 September 2002. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Williamstown, MA, USA, 28 June–1 July 2001; Morgan Kaufmann: Burlington, MA, USA, 2001; pp. 282–289. [Google Scholar]
- McCallum, A.; Li, W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May 2003; pp. 188–191. [Google Scholar]
- Settles, B. Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, Switzerland, 28–29 August 2004; pp. 107–110. [Google Scholar]
- Tomás, J.; Rasteiro, D.; Bernardino, J. Data anonymization: An experimental evaluation using open-source tools. Future Internet 2022, 14, 167. [Google Scholar] [CrossRef]
- Yang, J.; Zhang, X.; Liang, K.; Liu, Y. Exploring the Application of Large Language Models in Detecting and Protecting Personally Identifiable Information in Archival Data: A Comprehensive Study*. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2116–2123. [Google Scholar]
- Ratinov, L.; Roth, D. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), Boulder, CO, USA, 4 June 2009; pp. 147–155. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
- Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar] [CrossRef]
- Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
- Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. Chatie: Zero-shot information extraction via chatting with chatgpt. arXiv 2023, arXiv:2302.10205. [Google Scholar]
- Mainetti, L.; Elia, A. Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward. Appl. Syst. Innov. 2025, 8, 55. [Google Scholar] [CrossRef]
- Nakka, K.K.; Frikha, A.; Mendes, R.; Jiang, X.; Zhou, X. PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs. arXiv 2024, arXiv:2410.06704. [Google Scholar]
- Zhang, W.; Aljunied, M.; Gao, C.; Chia, Y.K.; Bing, L. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 5484–5505. [Google Scholar]
- Chiu, Y.Y.; Jiang, L.; Lin, B.Y.; Park, C.Y.; Li, S.S.; Ravi, S.; Bhatia, M.; Antoniak, M.; Tsvetkov, Y.; Shwartz, V.; et al. CulturalBench: A Robust, Diverse and Challenging Benchmark on Measuring (the Lack of) Cultural Knowledge of LLMs. arXiv 2024, arXiv:2410.02677v1. [Google Scholar] [CrossRef]
- Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H.W.; Tay, Y.; Ruder, S.; Zhou, D.; et al. Language models are multilingual chain-of-thought reasoners. arXiv 2022, arXiv:2210.03057. [Google Scholar]
- AI Hub. Korean SNS Multi-Turn Dialogue Data. 2024. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?pageIndex=1&currMenu=115&topMenu=100&srchOptnCnd=OPTNCND001&searchKeyword=SNS&srchDetailCnd=DETAILCND001&srchOrder=ORDER001&srchPagePer=20&aihubDataSe=data&dataSetSn=71694 (accessed on 10 October 2025).
- Center for Science and Technology Security Research. Network Intrusion Detection Dataset; Korea Institute of Science and Technology Security Research Center: Seoul, Republic of Korea, 2022. [Google Scholar] [CrossRef]
- Park, S.; Moon, J.; Kim, S.; Cho, W.I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; et al. Klue: Korean language understanding evaluation. arXiv 2021, arXiv:2105.09680. [Google Scholar] [CrossRef]
- Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
- Oppenlaender, J.; Linder, R.; Silvennoinen, J. Prompting AI art: An investigation into the creative skill of prompt engineering. Int. J. Hum.–Comput. Interact. 2025, 41, 10207–10229. [Google Scholar] [CrossRef]
- Henrickson, L.; Meroño-Peñuela, A. Prompting meaning: A hermeneutic approach to optimising prompt engineering with ChatGPT. AI Soc. 2025, 40, 903–918. [Google Scholar] [CrossRef]
- Cherukuri, M. Cost, Complexity, and Efficacy of Prompt Engineering Techniques for Large Language Models. IJSAT-Int. J. Sci. Technol. 2025, 16, 1–24. Available online: https://www.ijsat.org/papers/2025/2/2584.pdf (accessed on 10 October 2025).
- Abdallah, A.E.M.; Mozafari, J.; Piryani, B.; Abdelgwad, M.M.; Jatowt, A. DynRank: Improve passage retrieval with dynamic zero-shot prompting based on question classification. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 4768–4778. [Google Scholar]
- Wu, J.; Wang, X.; Jia, W. Enhancing text annotation through rationale-driven collaborative few-shot prompting. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Zhang, Y.; Wang, X.; Wu, L.; Wang, J. Enhancing chain of thought prompting in large language models via reasoning patterns. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February 25–4 March 2025; Volume 39, pp. 25985–25993. [Google Scholar]
- Chen, Y.C.; Lee, S.H.; Sheu, H.; Lin, S.H.; Hu, C.C.; Fu, S.C.; Yang, C.P.; Lin, Y.C. Enhancing responses from large language models with role-playing prompts: A comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med. Inform. Decis. Mak. 2025, 25, 196. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Yraola, L.; Zhu, K.; O’brien, S. Error Reflection Prompting: Can Large Language Models Successfully Understand Errors? In The Sixth Workshop on Insights from Negative Results in NLP; Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 157–170. [Google Scholar]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
- Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
- Islam, R.; Moushi, O.M. Gpt-4o: The cutting-edge advancement in multimodal llm. In Intelligent Computing, Proceedings of the Computing Conference, London, UK, 19–20 June 2025; Springer: Cham, Switzerland, 2025; pp. 47–60. [Google Scholar]










| Prompt | Claude Sonnet 4.5 | GPT-5 | Gemini 2.5 Pro | ||||||
|---|---|---|---|---|---|---|---|---|---|
| F1-Score | Precision | Recall | F1-Score | Precision | Recall | F1-Score | Precision | Recall | |
| BP | 74.74 | 70.52 | 79.48 | 78.92 | 82.12 | 75.96 | 76.55 | 73.72 | 79.60 |
| EDP | 81.62 ↑ | 78.28 ↑ | 85.25↑ | 85.37 ↑ | 83.81 ↑ | 86.99 ↑ | 85.49 ↑ | 84.43↑ | 86.58 ↑ |
| CEP | 84.93↑ | 84.68↑ | 85.17 ↓ | 87.73↑ | 86.86↑ | 88.61↑ | 85.67↑ | 83.27 ↓ | 88.20↑ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Choi, G.; An, S.; Kang, J.; Park, S.; Choi, H.; Lee, J.; Kim, H. Diagnosing and Mitigating LLM Failures in Recognizing Culturally Specific Korean Names: An Error-Driven Prompting Framework. Appl. Sci. 2025, 15, 12977. https://doi.org/10.3390/app152412977
Wang X, Choi G, An S, Kang J, Park S, Choi H, Lee J, Kim H. Diagnosing and Mitigating LLM Failures in Recognizing Culturally Specific Korean Names: An Error-Driven Prompting Framework. Applied Sciences. 2025; 15(24):12977. https://doi.org/10.3390/app152412977
Chicago/Turabian StyleWang, Xiaonan, Gyuri Choi, Subin An, Joeun Kang, Seoyoon Park, Hyeji Choi, Jongkyu Lee, and Hansaem Kim. 2025. "Diagnosing and Mitigating LLM Failures in Recognizing Culturally Specific Korean Names: An Error-Driven Prompting Framework" Applied Sciences 15, no. 24: 12977. https://doi.org/10.3390/app152412977
APA StyleWang, X., Choi, G., An, S., Kang, J., Park, S., Choi, H., Lee, J., & Kim, H. (2025). Diagnosing and Mitigating LLM Failures in Recognizing Culturally Specific Korean Names: An Error-Driven Prompting Framework. Applied Sciences, 15(24), 12977. https://doi.org/10.3390/app152412977

