Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning

Yang, Guang

doi:10.3390/electronics15112275

Open AccessArticle

Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning

by

Guang Yang

^1,2

¹

Institute of Artificial Intelligence and Educational Big Data, Nantong Normal College, Nantong 226000, China

²

The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou 310052, China

Electronics 2026, 15(11), 2275; https://doi.org/10.3390/electronics15112275

Submission received: 12 April 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 25 May 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) are increasingly adopted for Verilog code generation, yet existing benchmarks assume English-only prompts, overlooking the linguistic diversity of the global FPGA engineering community. We introduce Multi-VerilogEval, the first multilingual Verilog benchmark, built from 156 unique underlying tasks instantiated in four languages (English, Japanese, Hindi, and Mongolian), yielding 624 language-specific test cases. Our evaluation of four representative LLMs reveals a silent failure pattern: syntactic correctness remains high (∼90%) across languages, but functional correctness degrades by up to 23.9% for non-English prompts in open-source and domain-specific models, while commercial models remain near-parity. Hidden-state analysis suggests that multilingual bias is associated with persistent cross-lingual representation divergence throughout the network, which becomes most pronounced in the final layers that directly drive token generation. As fine-tuning and common prompt-based mitigations remain impractical or unreliable for multilingual RTL, we propose HE-ICL (Hard-Example In-Context Learning), a train-free method that constructs few-shot hard-example demonstrations from cross-lingually difficult cases. HE-ICL closes 80–100% of the multilingual gap without any parameter updates, achieving near-parity with or exceeding the English reference level across all evaluated HE-ICL settings.

Keywords:

Large Language Models; Verilog code generation; multilingual bias; in-context learning; hardware design automation

1. Introduction

The integration of Large Language Models (LLMs) into hardware design automation has catalyzed a paradigm shift in Register-Transfer Level (RTL) development [1]. Recent advancements demonstrate that LLMs can significantly accelerate Verilog code generation, from simple combinational logic to complex application-driven modules in domains such as cryptography and machine learning [2,3,4]. Domain-specific models such as VeriGen [5] and HaVen [6], together with simulation-based benchmarks like VerilogEval [7] and RTLLM [8], have further advanced both the capability and rigor of LLM-driven hardware generation.

However, existing evaluation frameworks are predominantly English-centric. Current benchmarks assume that hardware specifications are formulated exclusively in English, overlooking the linguistic diversity of the global semiconductor industry: according to the Wilson Research Group, nearly 60% of FPGA and ASIC design engineers reside in non-English-speaking regions, including major hubs in East Asia and South Asia [9]. Whether LLMs can reliably interpret non-English hardware specifications remains an open and critical question.

To address this gap, we make three contributions:

1.: Multi-VerilogEval, the first multilingual Verilog code generation benchmark, built from 156 unique underlying tasks, each instantiated in four languages (English, Japanese, Hindi, and Mongolian), yielding 624 language-specific test cases constructed via a multi-agent translation pipeline with human oversight.
2.: A comprehensive empirical study evaluating four representative LLMs (two commercial, one open-source, one domain-specific) on Multi-VerilogEval, complemented by hidden-state analysis that probes where multilingual representations diverge inside biased models.
3.: HE-ICL (Hard-Example In-Context Learning), a train-free inference-time method that constructs few-shot hard-example demonstrations to mitigate multilingual bias without any parameter updates.

Our evaluation reveals a pervasive “silent failure” pattern: syntactic correctness remains high across languages (∼90%), but functional correctness degrades by up to 23.9% for non-English prompts in open-source and domain-specific models (

p < 0.001

), while commercial models maintain near-parity (

p > 0.3

). Hidden-state analysis further suggests that cross-lingual representations remain partially aligned in intermediate layers but diverge more strongly toward the final layers that directly drive token generation.

Addressing this bias through fine-tuning is impractical: commercial models expose no training interface, and multilingual Verilog corpora do not yet exist. HE-ICL instead operates purely at inference time, closing 80–100% of the multilingual gap across all tested settings and achieving near-parity with or exceeding the English upper bound, without any parameter updates.

The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 provides background on LLM-based Verilog generation and multilingual bias. Section 4 details the construction of Multi-VerilogEval. Section 5 presents the empirical study and hidden-state analysis. Section 6 introduces HE-ICL and evaluates it through three research questions. Section 8 concludes the paper.

2. Related Work

2.1. LLM-Driven Verilog Generation

The application of Large Language Models to Verilog code generation has progressed rapidly. General-purpose models such as GPT-4 and CodeLlama have demonstrated the ability to produce simple hardware modules from natural-language specifications. Domain-specific approaches, including VeriGen [5], RTLCoder [10], and HaVen [6], further improve generation quality through fine-tuning on curated Verilog corpora. On the evaluation side, benchmarks such as VerilogEval [7] and RTLLM-v2 [8] have established simulation-based verification as the standard for assessing functional correctness. Despite this progress, existing work has overwhelmingly assumed English-only specifications, leaving the impact of prompt language on Verilog generation largely unexplored.

2.2. Multilingual Code Generation

Multilingual evaluation of LLM-based code generation has been explored primarily for general-purpose programming languages. Benchmarks such as MultiPL-E [11], HumanEval-XL [12], and mHumanEval [13] translate established coding tasks into dozens of natural languages to assess cross-lingual transfer. These studies consistently report a “multilingual tax”: non-English prompts lead to measurably lower code quality, even for high-resource languages [14].

To mitigate this gap, existing approaches predominantly rely on multilingual fine-tuning or parallel-corpus augmentation [15], both of which require substantial training resources and access to model weights.

2.3. Broader Deep Learning Paradigms

Beyond code-generation benchmarks, broader deep learning work underscores challenges relevant to multilingual RTL generation: structured input modeling, multi-source fusion, and robustness under distribution shift. Supervised multimodal super-resolution with reversible guidance and cyclical knowledge distillation [16], unsupervised graph-based transfer learning for fault diagnosis [17], and attention-enhanced perception systems in agricultural engineering [18] illustrate cross-modal alignment, cross-domain transfer, and robust AI deployment under scarce labels.

These paradigms motivate our train-free focus: when multilingual RTL corpora and model weights are unavailable, HE-ICL offers a lightweight inference-time alternative to training-heavy fine-tuning or transfer pipelines.

2.4. Positioning of This Work

Our work differs from prior studies in two key respects. First, we present the first systematic investigation of multilingual bias in the hardware domain. Unlike Python or Java, Verilog carries strict hardware semantics (clock-driven concurrency, module interface constraints) where specification misunderstanding may lead to silent failures: outputs that compile successfully yet do not satisfy the intended hardware behavior. This makes multilingual robustness a particularly important concern in hardware code generation. We address this gap by constructing Multi-VerilogEval, a four-language Verilog benchmark, and providing both quantitative bias measurements and representation-level analysis of where cross-lingual divergence persists inside biased models.

Second, while existing mitigation strategies for multilingual bias in code generation often focus on training-time interventions such as multilingual fine-tuning or representation alignment [19,20], these methods are less practical when model weights are inaccessible or when multilingual hardware corpora are unavailable. We therefore propose HE-ICL, a purely train-free inference-time method based on hard-example demonstrations. Because it does not require access to model weights, the method is in principle compatible with both open-source and commercial APIs, although our empirical validation in this paper focuses on Qwen2.5-Coder 7B and HaVen.

3. Background

3.1. LLM-Driven Verilog Code Generation

Unlike software languages such as Python, Verilog carries hardware semantics (e.g., clock-driven concurrency via always @(posedge clk)) and requires strict module interface constraints on port names, widths, and directions. Correctness is verified through simulation-based testing against a testbench rather than simple unit-test assertions [21]. These characteristics make Verilog generation particularly sensitive to the precision of the input specification, raising the question of whether non-English prompts can convey hardware requirements with sufficient fidelity.

3.2. Problem Formulation

Let

L

denote the set of natural languages considered, and let

x_{i, ℓ}

be the specification of problem i written in language

ℓ \in L

. A code-generation model G produces a candidate Verilog module

G (x_{i, ℓ})

, which is evaluated against a language-independent testbench

{t b}_{i}

. The functional correctness for a given model and language is measured by the pass@1 metric [22]:

Perf (G, ℓ) = pass @ 1 ({\{(G (x_{i, ℓ}), {t b}_{i})\}}_{i = 1}^{N}),

(1)

where N is the number of problems. Because all languages share the same underlying task semantics and testbenches, observed differences are primarily attributable to prompt language. However, tokenization and prompt-length differences across languages may also contribute, so our setup isolates linguistic variation as the dominant experimental factor rather than the only one.

4. Multi-VerilogEval Construction

4.1. Language Selection

Guided by the Wilson Research Group FPGA Functional Verification Trends report [9], we select four languages to balance industry relevance with resource diversity:

English (EN): the default language of hardware documentation and the reference language in prior benchmarks.
Japanese (JA): representative of a major East Asian electronics and semiconductor ecosystem.
Hindi (HI): representative of a rapidly growing South Asian hardware engineering ecosystem.
Mongolian (MN): included as a deliberately low-resource language to stress-test multilingual robustness beyond high-resource settings.

4.2. Seed Dataset

We adopt VerilogEval [7] as the seed dataset. VerilogEval is derived from HDLBits, a widely used online platform for learning Verilog through interactive exercises. It contains 156 Verilog design tasks spanning a broad range of difficulty levels, from basic combinational logic (e.g., multiplexers, encoders) to sequential circuits (e.g., finite state machines, counters) and arithmetic modules. Each task is paired with an English natural-language specification describing the desired functionality and a simulation-based testbench for automated functional verification. VerilogEval has been widely adopted as a standard benchmark for evaluating LLM-based Verilog code generation [5,6], making it a natural foundation for our multilingual extension.

4.3. Translation Pipeline

Each English specification is translated into the three non-English target languages through a hybrid pipeline combining automated multi-agent collaboration [23] with human oversight. All agents are powered by GPT-5.4 to ensure high translation quality across diverse target languages.

(1): Multi-Agent Automated Translation.

The automated stage employs four LLM-based agents operating in sequence:

1.: Translator Agent: receives the English specification and produces an initial target-language translation, instructed to act as a professional technical translator for hardware design while preserving all Verilog keywords and structural elements.
2.: Native Evaluator Agent: role-plays a native speaker of the target language and scores the translation for fluency, naturalness, and technical terminology accuracy on a scale of 0 to 10, providing detailed feedback and revision suggestions.
3.: Back-Translator Agent: renders the translation back into English independently, preserving all Verilog keywords and identifiers, without access to the original specification.
4.: Judge Agent: compares the back-translation against the original specification to compute a semantic equivalence score in $[0, 1]$ , focusing on functional requirements, timing constraints, interface definitions, and implementation details.

The four agents iterate for up to three rounds. A translation is approved when the semantic equivalence score is ≥0.95 and the native evaluation score is ≥7/10. If not approved, the Native Evaluator’s feedback and the Judge’s reasoning are concatenated and fed back to the Translator Agent for refinement in the next iteration.

(2): Human Intervention.

If a specification remains unapproved after three automated rounds, a bilingual annotator with hardware design expertise manually translates it from scratch [24], ensuring verified quality for every entry.

Table 1 summarizes the per-language approval statistics. The automated pipeline achieves an overall approval rate of 92.3% (432/468 tasks), with Japanese exhibiting the highest first-round approval rate (82.1%) owing to GPT-5.4’s strong Japanese proficiency, while Mongolian requires the most iterations and the highest proportion of human intervention (13.5%), consistent with its status as a low-resource language. In total, 36 specifications (7.7%) required manual translation.

Two translation principles are enforced throughout:

Structural Preservation: Verilog keywords (module, always, assign, etc.), module/port/signal names, and interface declarations remain in English.
Format Consistency: numerical literals (e.g., 4’b1010), truth tables, Karnaugh maps, and timing diagrams are preserved verbatim.

4.4. Dataset Statistics

The final Multi-VerilogEval contains 156 unique underlying tasks. Each task is instantiated in four languages, yielding 624 language-specific test cases in total.

Figure 1 presents the token length distribution across the four languages. English prompts are the most compact, with a median of approximately 170 tokens and the interquartile range (IQR) concentrated below 300 tokens. In contrast, Hindi and Mongolian prompts exhibit substantially higher token counts, with medians around 440 and 370 tokens respectively, and long-tailed distributions extending beyond 3000 tokens. Japanese prompts fall in between, with a median near 280 tokens. These differences are important for interpretation: although task semantics and verification are matched across languages, tokenization and prompt length necessarily vary by language.

Figure 2 illustrates a representative task: implementing a combinational logic function specified by a Karnaugh map. The English prompt (a) serves as the source specification, while (b), (c), and (d) show the Hindi, Mongolian, and Japanese translations, respectively. Across all four versions, structural elements such as the module interface declaration, port names (x, f), bit-width annotations, and the Karnaugh map itself are preserved verbatim, whereas only the natural-language instructions are translated. This design ensures that any observed performance gap can be attributed solely to the model’s comprehension of natural-language context rather than to differences in technical content.

5. Empirical Study

In this section, we empirically study the multilingual bias in Verilog code generation.

5.1. Evaluated Models

We evaluate four LLMs spanning three categories.

Commercial models: GPT-5.4 [25] and Claude Opus-4.6 [26] (hereafter Opus-4.6), representing strong generally available systems and serving as reference points for multilingual Verilog generation.

Open-source code models: Qwen2.5-Coder 7B [27], allowing us to examine whether multilingual bias persists in openly available systems of comparable scale.

Domain-specific models: HaVen [6], fine-tuned for hardware or Verilog generation, enabling us to assess whether domain specialization improves or weakens multilingual robustness.

5.2. Evaluation Metrics

We adopt pass@1 as the primary metric, since practical hardware design workflows typically require a correct solution on the first attempt. Each generated Verilog module is evaluated for (i) syntactic correctness, i.e., whether it compiles without errors, and (ii) functional correctness, i.e., whether it passes the simulation-based testbench. A sample is counted as successful only when both criteria are satisfied.

To quantify multilingual bias, we define the Robustness Ratio with English as the reference language. Let ℓ denote a non-English target language:

M_{R} (ℓ) = \frac{pass @ 1 (ℓ)}{pass @ 1 (EN)},

where

M_{R} = 1

indicates parity with English and lower values indicate greater degradation. We also report the relative drop

Δ_{rel} (ℓ) = 1 - M_{R} (ℓ)

when absolute percentages aid interpretation.

5.3. Prompt Template

To make prompt language the primary controlled experimental factor, we adopt a unified prompt template across all experiments. The system instruction is fixed in English for every model, while only the task specification is translated into the target language. Verilog keywords and module interface definitions are always preserved in English.

This design keeps task semantics, verification, and prompt structure fixed across languages, while acknowledging that tokenization and prompt length necessarily vary by language:

Please write a Verilog module that solves the following problem

efficiently, using the exact module header below:

Problem: {problem.prompt}

Module header (must not be changed): {problem.module_header}

Return only the Verilog code, without any explanation.

5.4. Implementation Details

5.4.1. Hardware and Software Environment

All open-source model experiments and hidden-state extraction were conducted on a single Linux workstation with an Intel Xeon Gold 6248R CPU (24 cores, 3.00 GHz), 256 GB RAM, and one NVIDIA GeForce RTX 4090 GPU (24 GB VRAM). The software stack was Ubuntu 22.04.4 LTS, Python 3.10.14, PyTorch 2.2.2, Transformers 4.44.2, and CUDA 12.1; local inference for Qwen2.5-Coder 7B and HaVen used Hugging Face transformers with model.generate (bfloat16, device map on the single GPU). Verilog compilation and simulation used Icarus Verilog 12.0 (v12_0) with the same testbench scripts for all models and languages.

5.4.2. Model Identifiers and Inference Settings

We evaluate four models. The commercial API models GPT-5.4 (OpenAI) and Claude Opus-4.6 (Anthropic) were accessed via the providers’ official APIs between March and May 2026. For both APIs we used the default inference parameters provided by each platform, including default temperature and sampling settings, while keeping the same fixed English system instruction and the unified user prompt template in Section 5. Only the natural-language task specification varies by language.

The open-source local models Qwen2.5-Coder 7B and HaVen were loaded from Hugging Face and run locally on the workstation described above with bfloat16 weights. For fair comparison and to fit within GPU memory, we cap the input context to 8192 tokens and set the maximum generation length to 4096 tokens for both models. Decoding is greedy with temperature

T = 0

and do_sample = False.

5.5. Empirical Results

Table 2 presents the syntax and functional pass@1 rates for each model across all four languages. The “Avg.” row reports the unweighted arithmetic mean over the three non-English languages, treating each language as an equally important evaluation axis regardless of its speaker population, thereby reflecting each model’s overall cross-lingual capability.

(1): Syntax vs. Functional Correctness.

A striking pattern emerges: syntactic correctness remains consistently high across all models and languages (76.92–98.08%), yet functional correctness varies dramatically. This gap reveals a “silent failure” mode where generated code compiles successfully but fails to meet the specification, a particularly dangerous outcome in hardware design where undetected logic errors can propagate to silicon.

(2): Commercial Models vs. Open-Source and Domain-Specific Models.

To assess statistical significance, we first compare English and each non-English language separately using the one-sided exact McNemar test on paired task outcomes, applying Holm–Bonferroni correction across the three language comparisons for each model. To provide a compact model-level summary of cross-lingual degradation, we additionally report Fisher’s combined probability statistic over the three language-specific tests:

χ^{2} = - 2 \sum_{i = 1}^{k} ln p_{i} \sim χ^{2} (2 k),

where

k = 3

is the number of non-English languages. We interpret the Holm-corrected per-language tests as the primary inferential results and the Fisher statistic as an overall summary. Figure 3 visualizes both

M_{R}

and the combined p-values.

Commercial models show no clear evidence of systematic multilingual degradation. GPT-5.4 and Opus-4.6 maintain strong cross-lingual robustness, with

M_{R}

of 0.98–1.03 and 0.97–0.99, respectively. Fisher’s combined test yields non-significant aggregate evidence of degradation (p = 0.854 for GPT-5.4; p = 0.303 for Opus-4.6), consistent with near-parity performance across languages.

Open-source and domain-specific models exhibit strong multilingual degradation. In contrast, Qwen2.5-Coder 7B (

M_{R}

= 0.81–0.84,

Δ_{rel}

= 15.9–19.0%) and HaVen (

M_{R}

= 0.76–0.84, worst on Mongolian at

Δ_{rel} = 23.9 %

) both show substantial degradation, with Fisher’s combined p < 0.001 ***. Notably, HaVen’s domain-specific fine-tuning on English-only Verilog data does not improve cross-lingual robustness; its

M_{R}

values are comparable to or worse than those of the general-purpose Qwen2.5-Coder.

(3): Hidden-State Analysis of Multilingual Representations.

To understand where multilingual bias arises inside the model, we probe the internal representations of Qwen2.5-Coder 7B and HaVen, the two models exhibiting statistically significant bias. For each task, we extract the hidden state at the last token of the task specification across selected layers, then analyze cross-lingual alignment via PCA visualization and pairwise cosine similarity.

Language-specific clusters form early and persist. Figure 4 and Figure 5 show the PCA projections of hidden states at nine layers spanning 20–100% of model depth. For both models, the four languages form clearly separable clusters as early as Layer 6, indicating that the models encode language identity in lower layers. As depth increases, the clusters shift in position but do not fully converge: even at the final layers (Layer 28 for Qwen2.5-Coder, Layer 32 for HaVen), representations of semantically identical tasks remain separated by prompt language. This pattern suggests that multilingual divergence is not created only at the final layer; rather, it appears early and remains unresolved through depth, becoming especially consequential near generation.

Cosine similarity degrades sharply in deeper layers. Figure 6 quantifies cross-lingual alignment by computing the mean cosine similarity between hidden states of the same task across language pairs. In both models, cosine similarity is moderately high in the middle layers (

\sim 0.85

–0.90 for Qwen2.5-Coder,

\sim 0.75

–0.80 for HaVen), suggesting partial convergence of multilingual representations. However, a sharp drop occurs in the final layers: for Qwen2.5-Coder, similarity falls to 0.42–0.65 at Layer 28; for HaVen, it drops to 0.35–0.75 at Layer 32. This degradation is most pronounced for language pairs involving Hindi and Mongolian, consistent with the larger

M_{R}

drops observed for these languages. Overall, the evidence suggests that intermediate layers retain partial cross-lingual alignment, but this alignment weakens substantially near the layers that directly shape token generation.

6. HE-ICL: Hard-Example In-Context Learning

6.1. Motivation and Approach

The hidden-state analysis in Section 5 reveals a key insight: cross-lingual cosine similarity remains moderately high in intermediate layers before dropping sharply at the final generation layers. This suggests that the models retain partial multilingual semantic alignment in intermediate representations, but this alignment is not stably preserved as representations approach the output distribution. Rather than proving that multilingual understanding is fully intact, our analysis indicates that a key failure mode appears when partially aligned internal representations are converted into final code tokens.

This observation motivates a train-free mitigation strategy. Retraining or fine-tuning is impractical for most practitioners: commercial models expose no training interface, and fine-tuning open-source models on multilingual Verilog data requires costly parallel corpora that do not yet exist. We therefore seek a purely inference-time solution based on in-context learning (ICL).

We propose HE-ICL (Hard-Example In-Context Learning), a method that constructs few-shot hard-example demonstrations from cross-lingually difficult cases to bridge the gap between multilingual prompting and correct code generation. Figure 7 illustrates the module-level HE-ICL pipeline, including inputs, outputs, and information flow across the three stages.

6.1.1. Stage 1: Hard Example Mining

For a given model and target language ℓ, we identify tasks where the model generates functionally correct code from the English prompt but fails on the target-language prompt under zero-shot inference. Formally, we define the hard example set:

H (ℓ) = \{t_{i} | pass (t_{i}, EN) = 1 \land pass (t_{i}, ℓ) = 0\},

where

pass (t_{i}, ℓ)

denotes whether task

t_{i}

passes functional verification when prompted in language ℓ. These are the tasks for which the model succeeds under the English specification but fails under the target-language specification, making them natural candidates for cross-lingual corrective demonstrations.

To avoid data contamination between demonstrations and test instances, the hard examples are mined from RTLLM-v2 [8], an independent Verilog benchmark with no task overlap with Multi-VerilogEval. This separation ensures that no test-set information leaks into the demonstrations. The target-language specifications for RTLLM-v2 are produced with the same multi-agent translation pipeline and quality checks described in Section 4, so demonstration translations are comparable to Multi-VerilogEval in methodology.

6.1.2. Stage 2: Hard-Example Demonstration Construction

From

H (ℓ)

, we randomly sample

k = 3

hard examples and organize each hard-example demonstration as a triplet in the following order:

1.: The task specification in the target language ℓ (showing the model what a non-English prompt looks like);
2.: The corresponding English specification (providing the semantic anchor);
3.: The correct Verilog code generated from the English prompt (providing the reference output).

This ordering is deliberate: by presenting the target-language prompt first, followed by its English equivalent and the correct code, we encourage the model to associate non-English specifications with an English-aligned problem formulation that leads to correct code generation. The choice of

k = 3

balances informativeness against context-length overhead; we analyze sensitivity to k in Section 6.5.

6.1.3. Stage 3: ICL Inference

The k hard-example demonstrations are prepended to the current test query, which is presented solely in the target language ℓ. Algorithm 1 summarizes the full HE-ICL pipeline.

Algorithm 1 HE-ICL Inference Pipeline

Require: Model G, target language ℓ, test task

t_{test}

, demonstration count k, hard example set

H (ℓ)

Ensure: Generated Verilog module

\hat{v}

1: Sample k examples

{t_{1}, \dots, t_{k}}

from

H (ℓ)

2:

prompt \leftarrow " "

3: for

i = 1

k do

4:

prompt \leftarrow prompt ∥

spec

(t_{i}, ℓ)

{Target-language spec}

5:

prompt \leftarrow prompt ∥

spec

(t_{i}, EN)

{English spec}

6:

prompt \leftarrow prompt ∥

code

(t_{i}, EN)

{Correct Verilog}

7: end for

8:

prompt \leftarrow prompt ∥

spec

(t_{test}, ℓ)

{Test task}

9:

\hat{v} \leftarrow G (prompt)

10: return

\hat{v}

The model is expected to infer a cross-lingual mapping pattern from the hard-example demonstrations and generate correct Verilog code for the current task without requiring an explicit test-time translation step.

6.2. Research Questions

We design three research questions to evaluate HE-ICL:

RQ1 (Effectiveness): How does HE-ICL compare against other train-free baselines for mitigating multilingual bias?
RQ2 (Ablation): How does the quality of demonstrations affect performance? Specifically, what is the contribution of hard-example mining versus random or no demonstrations?
RQ3 (Sensitivity): How sensitive is HE-ICL to the number of demonstrations k?

We focus on the two models with statistically significant multilingual bias (Qwen2.5-Coder 7B and HaVen), as commercial models already achieve near-parity and leave little room for improvement.

6.3. RQ1: Comparison with Baselines

We compare HE-ICL against two train-free baselines:

CoT (Chain-of-Thought): The model is instructed to reason step-by-step about the non-English specification before generating Verilog code.
TtG (Translate-then-Generate): The non-English prompt is first translated to English using the Google Translate API, and the translated English prompt is then fed to the model for code generation.

We additionally include the Zero-Shot baseline (direct generation from non-English prompts) and the English upper bound (generation from the original English prompts) for reference.

Table 3 presents the results. Several observations emerge:

CoT degrades performance rather than helping. Across nearly all settings, CoT leads to lower functional pass@1 than the zero-shot baseline. For Mongolian, the drop is particularly severe: Qwen2.5-Coder falls from 32.69% to 26.28%, and HaVen from 32.69% to 26.92%. CoT also substantially reduces syntactic correctness (e.g., 65.38% for Qwen2.5-Coder on Mongolian), suggesting that step-by-step reasoning in a non-English context introduces additional noise and confuses the model’s code generation process.

Translate-then-Generate yields mixed results. TtG improves performance on Hindi (e.g., Qwen2.5-Coder: 32.69% → 37.18%), where Google Translate provides relatively high-quality translations. However, it fails catastrophically on Mongolian (Qwen2.5-Coder: 32.69% → 20.51%), likely because machine translation quality for low-resource languages introduces semantic distortions that further mislead the model. This highlights a fundamental limitation of translation-based approaches: their effectiveness is bottlenecked by the quality of the translation system for the target language.

HE-ICL achieves near-parity with the English upper bound. Across all six (model, language) settings, HE-ICL substantially closes the multilingual gap. For Qwen2.5-Coder 7B, functional pass@1 improves by 5.13–8.34 percentage points over zero-shot, reaching 39.10–41.03% compared to the English baseline of 40.38%. For HaVen, the gains are even more pronounced (+7.69–9.62 pp), with HE-ICL matching or exceeding the English upper bound on Japanese (43.59% vs. 42.95%) and Hindi (42.95% vs. 42.95%). Notably, the largest improvements occur on Mongolian, the lowest-resource language, where HE-ICL recovers nearly the entire zero-shot gap for both models. These results confirm that hard-example demonstrations effectively bridge the cross-lingual generation bottleneck identified in our hidden-state analysis, without any model retraining.

Failure Case Analysis. Figure 2 illustrates a combinational task in Multi-VerilogEval: the Karnaugh map and module interface are fixed across languages, and only the natural-language instructions are translated. In our evaluation logs for Qwen2.5-Coder 7B, we find tasks for which generation from the English prompt passes the shared testbench, but the same underlying task fails under a Mongolian prompt for zero-shot inference, CoT, TtG, and HE-ICL with

K = 3

. In these instances the model often still returns compilable Verilog (syntax pass) yet produces incorrect logic (functional fail), matching the silent-failure pattern in Section 5. HE-ICL raises Mongolian pass@1 on average (Table 3) but does not correct every task: demonstrations mined from RTLLM-v2 target other failure modes and need not cover a given Karnaugh-map or timing interpretation error on the test instance. The example therefore shows both that baseline mitigations can fail and that HE-ICL is not guaranteed to succeed task by task.

6.4. RQ2: Ablation Study

To isolate the contribution of hard-example mining, we compare three in-context strategies with

k = 3

:

No Demonstrations: zero-shot inference without any in-context demonstrations.
Random Demonstrations: k demonstrations are randomly sampled from RTLLM-v2, regardless of whether the model succeeds or fails on them.
Hard-Example Demonstrations (HE-ICL): k demonstrations are selected from $H (ℓ)$ , i.e., tasks where the model succeeds in English but fails in the target language.

Table 4 presents the results. Random demonstrations yield modest improvements over zero-shot (1–3 pp), indicating that the ICL format itself provides a weak cross-lingual signal. However, hard-example demonstrations roughly double these gains (5–10 pp over zero-shot), consistently outperforming random demonstrations by a wide margin across all settings. The gap between Random Demonstrations and Hard-Example Demonstrations is largest on Mongolian (+5.12 pp for Qwen2.5-Coder, +6.41 pp for HaVen), confirming that the specificity of hard examples is the primary driver of HE-ICL’s effectiveness, rather than the mere presence of few-shot examples.

6.5. RQ3: Sensitivity to Demonstration Count k

We vary the number of hard-example demonstrations

k \in {1, 2, 3, 5, 8}

and report the average functional pass@1 across the three non-English languages.

Table 5 reveals a clear inverted-U pattern. Performance increases steadily from

k = 1

to

k = 3

, peaking at 39.96% for Qwen2.5-Coder and 42.95% for HaVen, both representing substantial gains over the zero-shot baseline. Beyond

k = 3

, however, performance declines: at

k = 8

, Qwen2.5-Coder drops back to 35.90% and HaVen to 38.46%. This degradation is attributable to context window saturation. Each demonstration comprises a target-language specification, its English equivalent, and the corresponding Verilog code, collectively consuming a significant portion of the model’s effective context. When too many demonstrations are prepended, the model’s attention to the actual test query is diluted, particularly for Hindi and Mongolian prompts whose token counts are already high (median 370 to 440 tokens per specification). The consistent optimum at

k = 3

across both models suggests that this value strikes the best balance between providing sufficient cross-lingual alignment signal and preserving attention capacity for the test task.

7. Threats to Validity

7.1. Internal Validity

Our multi-agent translation pipeline enforces a dual threshold (semantic equivalence

\geq 0.95

and native evaluation

\geq 7 / 10

) with up to three refinement rounds, and specifications that fail automation are manually translated by bilingual annotators with hardware design expertise. Nevertheless, residual translation artifacts may still inflate or deflate measured bias for specific languages, particularly for Mongolian where 13.5% of specifications required human intervention. Additionally, we rely on the testbenches provided by VerilogEval; insufficient test-vector coverage could allow functional errors to go undetected equally across languages, potentially underestimating true bias. All generation is performed with temperature 0 to reduce randomness, but non-determinism in commercial API endpoints may introduce minor variance.

Prompt length and tokenization also differ across languages (Figure 1), and these factors may interact with model behavior, particularly for smaller open-source models. Our study treats prompt language as the primary factor because task semantics, module headers, testbenches, and the prompt template are fixed across languages, but length is not artificially matched. Artificial truncation of faithful translations could introduce new semantic noise, so we report realistic translated prompts instead of a length-controlled subset in this work. We note, however, that Japanese prompts are longer than English yet often exhibit smaller degradation than Hindi or Mongolian, and hidden-state analysis compares the same task across languages, which suggests that language-linked effects are not explained by token count alone. Length-stratified or length-matched analyses are left to future work.

7.2. External Validity

Our four languages (English, Japanese, Hindi, Mongolian) span multiple script families and a range of resource levels, but they do not cover all major engineering languages (e.g., Korean, German, French). The four evaluated models represent three categories (commercial, open-source, domain-specific); newer or larger models may exhibit different bias profiles. We focus on module-level Verilog tasks derived from VerilogEval, so generalizability to VHDL or system-level design specifications remains open. Furthermore, the hard-example demonstrations used by HE-ICL are mined from RTLLM-v2, and its task distribution may not be representative of all practical hardware design scenarios.

7.3. Construct Validity

Pass@1 captures functional correctness but not non-functional qualities such as power, area, or timing; a module passing all testbench assertions may still be unsuitable for synthesis. The translation pipeline itself uses GPT-5.4, whose multilingual biases could correlate with the biases we measure in code generation; we mitigate this through back-translation verification and human review of unapproved cases but cannot fully exclude residual confounding. Finally, HE-ICL’s effectiveness is evaluated on the same two models (Qwen2.5-Coder 7B and HaVen) that exhibit significant bias; whether the approach generalizes to other biased models requires further investigation.

8. Conclusions and Future Work

We presented Multi-VerilogEval, the first multilingual Verilog code generation benchmark, built from 156 unique underlying tasks instantiated in four languages, yielding 624 language-specific test cases. Our empirical study on four LLMs uncovers a silent failure pattern: non-English prompts preserve high syntactic correctness but can substantially degrade functional correctness in open-source and domain-specific models. Hidden-state analysis suggests that this bias is associated with persistent cross-lingual representation divergence that becomes strongest in the final layers.

To address this issue, we proposed HE-ICL, a train-free method based on hard-example demonstrations. On the biased models evaluated in this paper, HE-ICL closes 80–100% of the multilingual gap without parameter updates. Future work includes extending Multi-VerilogEval to more languages, VHDL, and system-level tasks; evaluation beyond pass@1, including stronger verification coverage and synthesis-oriented metrics (area, timing, power); representation-level alignment near the final layers; multilingual hardware corpora for training-time fine-tuning; and broader HE-ICL studies across additional models and scales.

Funding

This research was supported by the Electronics Best PhD Thesis Award, and by Nantong Normal College.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The author declares no conflict of interest.

References

Yang, G.; Zheng, W.; Chen, X.; Liang, D.; Hu, P.; Yang, Y.; Peng, S.; Li, Z.; Feng, J.; Wei, X.; et al. Large language model for verilog code generation: Literature review and the road ahead. arXiv 2025, arXiv:2512.00020. [Google Scholar] [CrossRef]
Garcia-Gasulla, D.; Kestor, G.; Parisi, E.; Albertí-Binimelis, M.; Gutierrez, C.; Ghorab, R.M.; Montenegro, O.; Homs, B.; Moreto, M. Turtle: A unified evaluation of llms for rtl generation. In Proceedings of the 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD); IEEE: Piscataway, NJ, USA, 2025; pp. 1–12. [Google Scholar]
Ibnat, Z.; Calzada, P.E.; Saha, D.; Al-Shaikh, H.; Saha, S.K.; Zhou, J.; Farahmandi, F.; Tehranipoor, M. Trusting the Machine: How Secure is LLM-Generated RTL Code? In Proceedings of the 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD); IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
Zhang, J.; Liu, C.; Cheng, L.; Li, X.; Li, H. Understanding and Mitigating Errors of LLM-Generated RTL Code. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2026; early access.
Thakur, S.; Ahmad, B.; Pearce, H.; Tan, B.; Dolan-Gavitt, B.; Karri, R.; Garg, S. Verigen: A large language model for verilog code generation. ACM Trans. Des. Autom. Electron. Syst. 2024, 29, 1–31. [Google Scholar] [CrossRef]
Yang, Y.; Teng, F.; Liu, P.; Qi, M.; Lv, C.; Li, J.; Zhang, X.; He, Z. Haven: Hallucination-mitigated llm for verilog code generation aligned with hdl engineers. In Proceedings of the 2025 Design, Automation & Test in Europe Conference (DATE); IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar]
Liu, M.; Pinckney, N.; Khailany, B.; Ren, H. Verilogeval: Evaluating large language models for verilog code generation. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD); IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
Lu, Y.; Liu, S.; Zhang, Q.; Xie, Z. Rtllm: An open-source benchmark for design rtl generation with large language model. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC); IEEE: Piscataway, NJ, USA, 2024; pp. 722–727. [Google Scholar]
Wilson Research Group. 2024 Wilson Research Group FPGA Functional Verification Trend Report; White Paper; Siemens EDA: Wilsonville, OR, USA, 2024. [Google Scholar]
Liu, S.; Fang, W.; Lu, Y.; Wang, J.; Zhang, Q.; Zhang, H.; Xie, Z. Rtlcoder: Fully open-source and efficient llm-assisted rtl code generation technique. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 44, 1448–1461. [Google Scholar] [CrossRef]
Cassano, F.; Gouwar, J.; Nguyen, D.; Nguyen, S.; Phipps-Costin, L.; Pinckney, D.; Yee, M.H.; Zi, Y.; Anderson, C.J.; Feldman, M.Q.; et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng. 2023, 49, 3675–3691. [Google Scholar] [CrossRef]
Peng, Q.; Chai, Y.; Li, X. Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); ELRA and ICCL: Luxembourg, 2024; pp. 8383–8394. [Google Scholar]
Raihan, M.N.; Anastasopoulos, A.; Zampieri, M. mHumanEval-a multilingual benchmark to evaluate large language models for code generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 11432–11461. [Google Scholar]
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Le Scao, T.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15991–16111. [Google Scholar]
Chua, L.; Ghazi, B.; Huang, Y.; Kamath, P.; Kumar, R.; Manurangsi, P.; Sinha, A.; Xie, C.; Zhang, C. Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models. arXiv 2024, arXiv:2406.16135. [Google Scholar]
Yan, J.; Wang, Q.; Cheng, Y.; Su, Z.; Zhang, F.; Zhong, M.; Liu, L.; Jin, B.; Zhang, W. Optimized single-image super-resolution reconstruction: A multimodal approach based on reversible guidance and cyclical knowledge distillation. Eng. Appl. Artif. Intell. 2024, 133, 108496. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Dong, Y.; Mu, M. Spatial-channel collaborative multi-scale graph interaction deep transfer learning for unsupervised rotating machinery fault diagnosis. Eng. Appl. Artif. Intell. 2026, 176, 114691. [Google Scholar] [CrossRef]
Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]
Yang, Z.; Yang, Y.; Cer, D.; Darve, E. A simple and effective method to eliminate the self language bias in multilingual representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5825–5832. [Google Scholar]
Nie, S.; Fromm, M.; Welch, C.; Görge, R.; Karimi, A.; Plepi, J.; Mowmita, N.; Flores-Herr, N.; Ali, M.; Flek, L. Do Multilingual Large Language Models Mitigate Stereotype Bias? In Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 65–83. [Google Scholar]
Boutobza, S.; Popa, S.; Costa, A. An automatic testbench generator for test patterns validation. In Proceedings of the 2018 IEEE East-West Design & Test Symposium (EWDTS); IEEE: Piscataway, NJ, USA, 2018; pp. 1–11. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Briva-Iglesias, V. Are AI agents the new machine translation frontier? Challenges and opportunities of single-and multi-agent systems for multilingual digital communication. In Proceedings of Machine Translation Summit XX: Volume 1; European Association for Machine Translation: Allschwil, Switzerland, 2025; pp. 365–377. [Google Scholar]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Han, T.; Chen, T. ExploitGen: Template-augmented exploit code generation based on CodeBERT. J. Syst. Softw. 2023, 197, 111577. [Google Scholar] [CrossRef]
OpenAI. Introducing GPT-5.4. 2026. Available online: https://openai.com/index/introducing-gpt-5-4/ (accessed on 3 April 2026).
Anthropic. Claude Opus 4.6. 2026. Available online: https://www.anthropic.com/claude/opus (accessed on 3 April 2026).
Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K.; et al. Qwen2. 5-coder technical report. arXiv 2024, arXiv:2409.12186. [Google Scholar]

Figure 1. Token length distribution of Multi-VerilogEval across four languages.

Figure 2. A representative Multi-VerilogEval task (Karnaugh map-based combinational logic) shown in four languages. Verilog keywords, port names, and the Karnaugh map are preserved verbatim; only natural-language instructions are translated.

Figure 3. (Left): Robustness ratio

M_{R}

(functional pass@1 relative to English) for each model–language pair; values below 1.0 indicate degradation. Parenthetical values preceded by a downward arrow (↓) denote the relative functional change compared with English; negative values indicate improvement over the English baseline. (Right): Fisher’s combined p-value summarizing evidence across the three non-English languages; per-language significance is assessed with Holm-corrected exact McNemar tests. Asterisks denote significance of Fisher’s combined test: ***

p < 0.001

. Color scales show

M_{R}

(left) and

- {log}_{10} (p)

(right).

Figure 3. (Left): Robustness ratio

M_{R}

(functional pass@1 relative to English) for each model–language pair; values below 1.0 indicate degradation. Parenthetical values preceded by a downward arrow (↓) denote the relative functional change compared with English; negative values indicate improvement over the English baseline. (Right): Fisher’s combined p-value summarizing evidence across the three non-English languages; per-language significance is assessed with Holm-corrected exact McNemar tests. Asterisks denote significance of Fisher’s combined test: ***

p < 0.001

. Color scales show

M_{R}

(left) and

- {log}_{10} (p)

(right).

Figure 4. PCA projection of hidden states across layers for Qwen2.5-Coder 7B. Each point represents one task in one language. Language-specific clusters emerge in early layers and persist through deeper layers.

Figure 5. PCA projection of hidden states across layers for HaVen. Compared to Qwen2.5-Coder, language clusters are more tightly separated even at early layers, and cross-lingual mixing remains limited throughout.

Figure 6. Mean pairwise cosine similarity between hidden states of the same task in different languages, measured across layers for Qwen2.5-Coder 7B (left) and HaVen (right). Shaded bands indicate 95% confidence intervals.

Figure 7. Overview of the HE-ICL pipeline (train-free, inference-time mitigation). Stage 1 (Hard-example mining): From an independent corpus, tasks with

pass (EN) \land

¬

pass (ℓ)

are collected into the hard-example set

H (ℓ)

. Stage 2 (Demonstration construction):

K = 3

hard examples are sampled. Stage 3 (ICL inference): Demonstrations are prepended to the test query in ℓ; the LLM generates the final Verilog module.

Figure 7. Overview of the HE-ICL pipeline (train-free, inference-time mitigation). Stage 1 (Hard-example mining): From an independent corpus, tasks with

pass (EN) \land

¬

pass (ℓ)

are collected into the hard-example set

H (ℓ)

. Stage 2 (Demonstration construction):

K = 3

hard examples are sampled. Stage 3 (ICL inference): Demonstrations are prepended to the test query in ℓ; the LLM generates the final Verilog module.

Table 1. Translation pipeline approval statistics per language. “R1/R2/R3” denote the number of specifications approved after rounds 1, 2, and 3, respectively. “Human” indicates specifications requiring manual translation after three automated rounds.

Language	R1	R2	R3	Auto-Approved	Human	Auto Rate (%)
Japanese	128	19	5	152	4	97.4
Hindi	108	25	12	145	11	92.9
Mongolian	89	28	18	135	21	86.5
Total	325	72	35	432	36	92.3

Table 2. Syntax and functional pass@1 (%) on Multi-VerilogEval. The highest functional score per language is bolded.

	Qwen2.5-Coder 7B		HaVen		GPT-5.4		Opus-4.6
	Syn.	Func.	Syn.	Func.	Syn.	Func.	Syn.	Func.
English	83.33	40.38	92.31	42.95	97.44	78.21	98.08	90.38
Japanese	83.33	33.97	91.67	35.90	98.08	80.77	98.08	87.82
Mongolian	82.50	32.69	91.67	32.69	98.08	79.49	98.08	89.74
Hindi	76.92	32.69	93.59	35.26	98.08	76.92	98.08	89.74
Avg.	80.92	33.12	92.31	34.62	97.92	79.06	98.08	89.10

Table 3. Syntax and functional pass@1 (%) under different mitigation strategies. Eng. denotes the English upper bound. Best non-English results per model are bolded.

	Qwen2.5-Coder 7B		HaVen
	Syntax	Functional	Syntax	Functional
Japanese
Zero-Shot	83.33	33.97	91.67	35.90
CoT	71.79	29.49	80.77	36.54
TtG	83.97	35.90	93.59	36.54
HE-ICL (Ours)	83.97	39.10	92.31	43.59
Eng.	83.33	40.38	92.31	42.95
Mongolian
Zero-Shot	82.50	32.69	91.67	32.69
CoT	65.38	26.28	74.36	26.92
TtG	80.77	20.51	91.03	26.92
HE-ICL (Ours)	84.62	39.74	92.95	42.31
Eng.	83.33	40.38	92.31	42.95
Hindi
Zero-Shot	76.92	32.69	93.59	35.26
CoT	69.87	30.13	84.62	30.13
TtG	81.41	37.18	92.31	39.74
HE-ICL (Ours)	83.97	41.03	92.95	42.95
Eng.	83.33	40.38	92.31	42.95

Table 4. Ablation on in-context demonstration selection strategy. Functional pass@1 (%) is reported with

k = 3

demonstrations. Best non-English results per model are bolded.

Table 4. Ablation on in-context demonstration selection strategy. Functional pass@1 (%) is reported with

k = 3

demonstrations. Best non-English results per model are bolded.

	Qwen2.5-Coder 7B			HaVen
	JA	MN	HI	JA	MN	HI
No Demonstrations	33.97	32.69	32.69	35.90	32.69	35.26
Random Demonstrations	35.26	34.62	35.26	37.82	35.90	37.18
Hard-Example Demonstrations	39.10	39.74	41.03	43.59	42.31	42.95

Table 5. Sensitivity to demonstration count k. Average functional pass@1 (%) over three non-English languages. Best non-English results per model are bolded.

k	1	2	3	5	8
Qwen2.5-Coder 7B	35.26	37.18	39.96	38.46	35.90
HaVen	36.54	39.74	42.95	41.67	38.46
Zero-shot ref.	Qwen: 33.12 HaVen: 34.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, G. Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics 2026, 15, 2275. https://doi.org/10.3390/electronics15112275

AMA Style

Yang G. Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics. 2026; 15(11):2275. https://doi.org/10.3390/electronics15112275

Chicago/Turabian Style

Yang, Guang. 2026. "Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning" Electronics 15, no. 11: 2275. https://doi.org/10.3390/electronics15112275

APA Style

Yang, G. (2026). Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning. Electronics, 15(11), 2275. https://doi.org/10.3390/electronics15112275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding and Mitigating Multilingual Bias in LLM-Driven Verilog Code Generation via Hard-Example In-Context Learning

Abstract

1. Introduction

2. Related Work

2.1. LLM-Driven Verilog Generation

2.2. Multilingual Code Generation

2.3. Broader Deep Learning Paradigms

2.4. Positioning of This Work

3. Background

3.1. LLM-Driven Verilog Code Generation

3.2. Problem Formulation

4. Multi-VerilogEval Construction

4.1. Language Selection

4.2. Seed Dataset

4.3. Translation Pipeline

4.4. Dataset Statistics

5. Empirical Study

5.1. Evaluated Models

5.2. Evaluation Metrics

5.3. Prompt Template

5.4. Implementation Details

5.4.1. Hardware and Software Environment

5.4.2. Model Identifiers and Inference Settings

5.5. Empirical Results

6. HE-ICL: Hard-Example In-Context Learning

6.1. Motivation and Approach

6.1.1. Stage 1: Hard Example Mining

6.1.2. Stage 2: Hard-Example Demonstration Construction

6.1.3. Stage 3: ICL Inference

6.2. Research Questions

6.3. RQ1: Comparison with Baselines

6.4. RQ2: Ablation Study

6.5. RQ3: Sensitivity to Demonstration Count k

7. Threats to Validity

7.1. Internal Validity

7.2. External Validity

7.3. Construct Validity

8. Conclusions and Future Work

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI