Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning
Abstract
1. Introduction
- 1.
- Multi-VerilogEval, the first multilingual Verilog code generation benchmark, built from 156 unique underlying tasks, each instantiated in four languages (English, Japanese, Hindi, and Mongolian), yielding 624 language-specific test cases constructed via a multi-agent translation pipeline with human oversight.
- 2.
- A comprehensive empirical study evaluating four representative LLMs (two commercial, one open-source, one domain-specific) on Multi-VerilogEval, complemented by hidden-state analysis that probes where multilingual representations diverge inside biased models.
- 3.
- HE-ICL (Hard-Example In-Context Learning), a train-free inference-time method that constructs few-shot hard-example demonstrations to mitigate multilingual bias without any parameter updates.
2. Related Work
2.1. LLM-Driven Verilog Generation
2.2. Multilingual Code Generation
2.3. Broader Deep Learning Paradigms
2.4. Positioning of This Work
3. Background
3.1. LLM-Driven Verilog Code Generation
3.2. Problem Formulation
4. Multi-VerilogEval Construction
4.1. Language Selection
- English (EN): the default language of hardware documentation and the reference language in prior benchmarks.
- Japanese (JA): representative of a major East Asian electronics and semiconductor ecosystem.
- Hindi (HI): representative of a rapidly growing South Asian hardware engineering ecosystem.
- Mongolian (MN): included as a deliberately low-resource language to stress-test multilingual robustness beyond high-resource settings.
4.2. Seed Dataset
4.3. Translation Pipeline
- (1)
- Multi-Agent Automated Translation.
- 1.
- Translator Agent: receives the English specification and produces an initial target-language translation, instructed to act as a professional technical translator for hardware design while preserving all Verilog keywords and structural elements.
- 2.
- Native Evaluator Agent: role-plays a native speaker of the target language and scores the translation for fluency, naturalness, and technical terminology accuracy on a scale of 0 to 10, providing detailed feedback and revision suggestions.
- 3.
- Back-Translator Agent: renders the translation back into English independently, preserving all Verilog keywords and identifiers, without access to the original specification.
- 4.
- Judge Agent: compares the back-translation against the original specification to compute a semantic equivalence score in , focusing on functional requirements, timing constraints, interface definitions, and implementation details.
- (2)
- Human Intervention.
- Structural Preservation: Verilog keywords (module, always, assign, etc.), module/port/signal names, and interface declarations remain in English.
- Format Consistency: numerical literals (e.g., 4’b1010), truth tables, Karnaugh maps, and timing diagrams are preserved verbatim.
4.4. Dataset Statistics
5. Empirical Study
5.1. Evaluated Models
5.2. Evaluation Metrics
5.3. Prompt Template
Please write a Verilog module that solves the following problem
efficiently, using the exact module header below:
Problem: {problem.prompt}
Module header (must not be changed): {problem.module_header}
Return only the Verilog code, without any explanation.
5.4. Implementation Details
5.4.1. Hardware and Software Environment
5.4.2. Model Identifiers and Inference Settings
5.5. Empirical Results
- (1)
- Syntax vs. Functional Correctness.
- (2)
- Commercial Models vs. Open-Source and Domain-Specific Models.
- (3)
- Hidden-State Analysis of Multilingual Representations.

6. HE-ICL: Hard-Example In-Context Learning
6.1. Motivation and Approach
6.1.1. Stage 1: Hard Example Mining
6.1.2. Stage 2: Hard-Example Demonstration Construction
- 1.
- The task specification in the target language ℓ (showing the model what a non-English prompt looks like);
- 2.
- The corresponding English specification (providing the semantic anchor);
- 3.
- The correct Verilog code generated from the English prompt (providing the reference output).
6.1.3. Stage 3: ICL Inference
| Algorithm 1 HE-ICL Inference Pipeline |
| Require: Model G, target language ℓ, test task , demonstration count k, hard example set |
| Ensure: Generated Verilog module |
| 1: Sample k examples from |
| 2: |
| 3: for k do |
| 4: spec {Target-language spec} |
| 5: spec {English spec} |
| 6: code {Correct Verilog} |
| 7: end for |
| 8: spec {Test task} |
| 9: |
| 10: return |
6.2. Research Questions
- RQ1 (Effectiveness): How does HE-ICL compare against other train-free baselines for mitigating multilingual bias?
- RQ2 (Ablation): How does the quality of demonstrations affect performance? Specifically, what is the contribution of hard-example mining versus random or no demonstrations?
- RQ3 (Sensitivity): How sensitive is HE-ICL to the number of demonstrations k?
6.3. RQ1: Comparison with Baselines
- CoT (Chain-of-Thought): The model is instructed to reason step-by-step about the non-English specification before generating Verilog code.
- TtG (Translate-then-Generate): The non-English prompt is first translated to English using the Google Translate API, and the translated English prompt is then fed to the model for code generation.

6.4. RQ2: Ablation Study
- No Demonstrations: zero-shot inference without any in-context demonstrations.
- Random Demonstrations: k demonstrations are randomly sampled from RTLLM-v2, regardless of whether the model succeeds or fails on them.
- Hard-Example Demonstrations (HE-ICL): k demonstrations are selected from , i.e., tasks where the model succeeds in English but fails in the target language.

6.5. RQ3: Sensitivity to Demonstration Count k

7. Threats to Validity
7.1. Internal Validity
7.2. External Validity
7.3. Construct Validity
8. Conclusions and Future Work
Funding
Data Availability Statement
Conflicts of Interest
References
- Yang, G.; Zheng, W.; Chen, X.; Liang, D.; Hu, P.; Yang, Y.; Peng, S.; Li, Z.; Feng, J.; Wei, X.; et al. Large language model for verilog code generation: Literature review and the road ahead. arXiv 2025, arXiv:2512.00020. [Google Scholar] [CrossRef]
- Garcia-Gasulla, D.; Kestor, G.; Parisi, E.; Albertí-Binimelis, M.; Gutierrez, C.; Ghorab, R.M.; Montenegro, O.; Homs, B.; Moreto, M. Turtle: A unified evaluation of llms for rtl generation. In Proceedings of the 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD); IEEE: Piscataway, NJ, USA, 2025; pp. 1–12. [Google Scholar]
- Ibnat, Z.; Calzada, P.E.; Saha, D.; Al-Shaikh, H.; Saha, S.K.; Zhou, J.; Farahmandi, F.; Tehranipoor, M. Trusting the Machine: How Secure is LLM-Generated RTL Code? In Proceedings of the 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD); IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
- Zhang, J.; Liu, C.; Cheng, L.; Li, X.; Li, H. Understanding and Mitigating Errors of LLM-Generated RTL Code. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2026; early access.
- Thakur, S.; Ahmad, B.; Pearce, H.; Tan, B.; Dolan-Gavitt, B.; Karri, R.; Garg, S. Verigen: A large language model for verilog code generation. ACM Trans. Des. Autom. Electron. Syst. 2024, 29, 1–31. [Google Scholar] [CrossRef]
- Yang, Y.; Teng, F.; Liu, P.; Qi, M.; Lv, C.; Li, J.; Zhang, X.; He, Z. Haven: Hallucination-mitigated llm for verilog code generation aligned with hdl engineers. In Proceedings of the 2025 Design, Automation & Test in Europe Conference (DATE); IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar]
- Liu, M.; Pinckney, N.; Khailany, B.; Ren, H. Verilogeval: Evaluating large language models for verilog code generation. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD); IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
- Lu, Y.; Liu, S.; Zhang, Q.; Xie, Z. Rtllm: An open-source benchmark for design rtl generation with large language model. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC); IEEE: Piscataway, NJ, USA, 2024; pp. 722–727. [Google Scholar]
- Wilson Research Group. 2024 Wilson Research Group FPGA Functional Verification Trend Report; White Paper; Siemens EDA: Wilsonville, OR, USA, 2024. [Google Scholar]
- Liu, S.; Fang, W.; Lu, Y.; Wang, J.; Zhang, Q.; Zhang, H.; Xie, Z. Rtlcoder: Fully open-source and efficient llm-assisted rtl code generation technique. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 1448–1461. [Google Scholar] [CrossRef]
- Cassano, F.; Gouwar, J.; Nguyen, D.; Nguyen, S.; Phipps-Costin, L.; Pinckney, D.; Yee, M.H.; Zi, Y.; Anderson, C.J.; Feldman, M.Q.; et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng. 2023, 49, 3675–3691. [Google Scholar] [CrossRef]
- Peng, Q.; Chai, Y.; Li, X. Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); ELRA and ICCL: Luxembourg, 2024; pp. 8383–8394. [Google Scholar]
- Raihan, M.N.; Anastasopoulos, A.; Zampieri, M. mHumanEval-a multilingual benchmark to evaluate large language models for code generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 11432–11461. [Google Scholar]
- Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Le Scao, T.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15991–16111. [Google Scholar]
- Chua, L.; Ghazi, B.; Huang, Y.; Kamath, P.; Kumar, R.; Manurangsi, P.; Sinha, A.; Xie, C.; Zhang, C. Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models. arXiv 2024, arXiv:2406.16135. [Google Scholar]
- Yan, J.; Wang, Q.; Cheng, Y.; Su, Z.; Zhang, F.; Zhong, M.; Liu, L.; Jin, B.; Zhang, W. Optimized single-image super-resolution reconstruction: A multimodal approach based on reversible guidance and cyclical knowledge distillation. Eng. Appl. Artif. Intell. 2024, 133, 108496. [Google Scholar] [CrossRef]
- Wang, X.; Jiang, H.; Dong, Y.; Mu, M. Spatial-channel collaborative multi-scale graph interaction deep transfer learning for unsupervised rotating machinery fault diagnosis. Eng. Appl. Artif. Intell. 2026, 176, 114691. [Google Scholar] [CrossRef]
- Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]
- Yang, Z.; Yang, Y.; Cer, D.; Darve, E. A simple and effective method to eliminate the self language bias in multilingual representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5825–5832. [Google Scholar]
- Nie, S.; Fromm, M.; Welch, C.; Görge, R.; Karimi, A.; Plepi, J.; Mowmita, N.; Flores-Herr, N.; Ali, M.; Flek, L. Do Multilingual Large Language Models Mitigate Stereotype Bias? In Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 65–83. [Google Scholar]
- Boutobza, S.; Popa, S.; Costa, A. An automatic testbench generator for test patterns validation. In Proceedings of the 2018 IEEE East-West Design & Test Symposium (EWDTS); IEEE: Piscataway, NJ, USA, 2018; pp. 1–11. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Briva-Iglesias, V. Are AI agents the new machine translation frontier? Challenges and opportunities of single-and multi-agent systems for multilingual digital communication. In Proceedings of Machine Translation Summit XX: Volume 1; European Association for Machine Translation: Allschwil, Switzerland, 2025; pp. 365–377. [Google Scholar]
- Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Han, T.; Chen, T. ExploitGen: Template-augmented exploit code generation based on CodeBERT. J. Syst. Softw. 2023, 197, 111577. [Google Scholar] [CrossRef]
- OpenAI. Introducing GPT-5.4. 2026. Available online: https://openai.com/index/introducing-gpt-5-4/ (accessed on 3 April 2026).
- Anthropic. Claude Opus 4.6. 2026. Available online: https://www.anthropic.com/claude/opus (accessed on 3 April 2026).
- Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K.; et al. Qwen2. 5-coder technical report. arXiv 2024, arXiv:2409.12186. [Google Scholar]








| Language | R1 | R2 | R3 | Auto-Approved | Human | Auto Rate (%) |
|---|---|---|---|---|---|---|
| Japanese | 128 | 19 | 5 | 152 | 4 | 97.4 |
| Hindi | 108 | 25 | 12 | 145 | 11 | 92.9 |
| Mongolian | 89 | 28 | 18 | 135 | 21 | 86.5 |
| Total | 325 | 72 | 35 | 432 | 36 | 92.3 |
| Qwen2.5-Coder 7B | HaVen | GPT-5.4 | Opus-4.6 | |||||
|---|---|---|---|---|---|---|---|---|
| Syn. | Func. | Syn. | Func. | Syn. | Func. | Syn. | Func. | |
| English | 83.33 | 40.38 | 92.31 | 42.95 | 97.44 | 78.21 | 98.08 | 90.38 |
| Japanese | 83.33 | 33.97 | 91.67 | 35.90 | 98.08 | 80.77 | 98.08 | 87.82 |
| Mongolian | 82.50 | 32.69 | 91.67 | 32.69 | 98.08 | 79.49 | 98.08 | 89.74 |
| Hindi | 76.92 | 32.69 | 93.59 | 35.26 | 98.08 | 76.92 | 98.08 | 89.74 |
| Avg. | 80.92 | 33.12 | 92.31 | 34.62 | 97.92 | 79.06 | 98.08 | 89.10 |
| Qwen2.5-Coder 7B | HaVen | |||
|---|---|---|---|---|
| Syntax | Functional | Syntax | Functional | |
| Japanese | ||||
| Zero-Shot | 83.33 | 33.97 | 91.67 | 35.90 |
| CoT | 71.79 | 29.49 | 80.77 | 36.54 |
| TtG | 83.97 | 35.90 | 93.59 | 36.54 |
| HE-ICL (Ours) | 83.97 | 39.10 | 92.31 | 43.59 |
| Eng. | 83.33 | 40.38 | 92.31 | 42.95 |
| Mongolian | ||||
| Zero-Shot | 82.50 | 32.69 | 91.67 | 32.69 |
| CoT | 65.38 | 26.28 | 74.36 | 26.92 |
| TtG | 80.77 | 20.51 | 91.03 | 26.92 |
| HE-ICL (Ours) | 84.62 | 39.74 | 92.95 | 42.31 |
| Eng. | 83.33 | 40.38 | 92.31 | 42.95 |
| Hindi | ||||
| Zero-Shot | 76.92 | 32.69 | 93.59 | 35.26 |
| CoT | 69.87 | 30.13 | 84.62 | 30.13 |
| TtG | 81.41 | 37.18 | 92.31 | 39.74 |
| HE-ICL (Ours) | 83.97 | 41.03 | 92.95 | 42.95 |
| Eng. | 83.33 | 40.38 | 92.31 | 42.95 |
| Qwen2.5-Coder 7B | HaVen | |||||
|---|---|---|---|---|---|---|
| JA | MN | HI | JA | MN | HI | |
| No Demonstrations | 33.97 | 32.69 | 32.69 | 35.90 | 32.69 | 35.26 |
| Random Demonstrations | 35.26 | 34.62 | 35.26 | 37.82 | 35.90 | 37.18 |
| Hard-Example Demonstrations | 39.10 | 39.74 | 41.03 | 43.59 | 42.31 | 42.95 |
| k | 1 | 2 | 3 | 5 | 8 |
|---|---|---|---|---|---|
| Qwen2.5-Coder 7B | 35.26 | 37.18 | 39.96 | 38.46 | 35.90 |
| HaVen | 36.54 | 39.74 | 42.95 | 41.67 | 38.46 |
| Zero-shot ref. | Qwen: 33.12 HaVen: 34.62 | ||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, G. Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics 2026, 15, 2275. https://doi.org/10.3390/electronics15112275
Yang G. Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics. 2026; 15(11):2275. https://doi.org/10.3390/electronics15112275
Chicago/Turabian StyleYang, Guang. 2026. "Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning" Electronics 15, no. 11: 2275. https://doi.org/10.3390/electronics15112275
APA StyleYang, G. (2026). Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics, 15(11), 2275. https://doi.org/10.3390/electronics15112275

