Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation
Abstract
1. Introduction
- Multi-vendor, multi-hardware benchmark. We evaluate ten open-source LLMs—Meta llama3.1:8b and llama3.3:70b; Alibaba qwen2.5-coder at 7B/14B/32B and qwen3.5:122b; Mistral AI mistral:latest and mistral-small3.1:24b; OpenAI gpt-oss:120b; NVIDIA nemotron-3-super:120b—on three hardware tiers (RTX 3090 24GB, RTX 4090 24GB, NVIDIA DGX Spark 128GB unified memory), each tested in both LLM-only and LLM + RAG configurations on the same 285-question dataset.
- Static syntax-checker calibrated on the FX3U/FX3UC instruction set. A three-tier validator—Lexical mnemonic recognition, Syntactic operand format and operand count validation, Semantic program network and stack balance checks—supplies a static correctness signal beyond cosine, BLEU, and ROUGE. The checker, calibrated against a ground-truth pass rate, is explicitly a lower bound on functional correctness rather than a substitute for controller-side validation (Section 6.6).
- Retrieval-depth ablation. A controlled ablation on qwen2.5-coder:7b quantifies the marginal value of additional retrieved exemplars on similarity, BLEU, ROUGE, and inference time.
- Honest failure-mode analysis. We report two negative findings prominently: llama3.3:70b gains only percentage points in syntactic pass rate from RAG, versus a median of pp; gpt-oss:120b gains only pp with the lowest absolute RAG pass rate () among the four XL models. For llama3.3:70b we also report the result of an empirical diagnostic on the per-sample CSV that re-attributes the dominant failure to prose preamble contamination rather than to the originally hypothesised output truncation (Section 6.2).
2. Related Work
2.1. LLM-Based Code Generation
2.2. Retrieval-Augmented Generation
2.3. AI for PLC Programming
2.4. Functional Validation of Generated Code
3. Methodology
3.1. Dataset
3.2. RAG Pipeline
- Index. ChromaDB collection plc_dataset, cosine space, indexed once per experiment over all 285 question texts (the answers are stored as metadata). We index each question text as a single atomic document; no sub-question chunking is performed. Because each dataset item is itself a short, semantically self-contained unit (50–200 tokens), conventional chunking concerns about splitting variable declarations from their use do not arise. For larger codebases (e.g., function block libraries), semantic chunking by network or function block boundary would be appropriate; we leave this for future work on broader corpora.
- Embedding model. all-MiniLM-L6-v2 (HuggingFace sentence-transformers) is a 22.7M-parameter, 384-dimensional sentence encoder distilled from BERT-base via deep self-attention distillation [21,22]. We selected it for its low memory footprint and fast batch encoding so that the entire pipeline fits on commodity GPUs alongside the LLM weights.
- Retrieval. initials candidates from ChromaDB by cosine similarity (, so 9 candidates at the default ), then a similarity-floor cut at (cosine), then Maximal Marginal Relevance (MMR) reranking with [23]. Concretely, given the candidate set R, the next exemplar selected into result set S isWith in Equation (1), the first term rewards relevance to the query and the second term penalizes near-duplicates already selected. This penalty matters because the dataset contains many parametric variants of the same template (e.g., 60 traffic light items differing only in timer constants); without MMR, the top-k would be near-identical clones of one exemplar. Ties are broken by retrieval rank. The query item itself is excluded from retrieval to prevent trivial leakage. If no candidate clears the 0.25 floor, the pipeline falls back to the unfiltered top-k exemplars so that every query receives the same number of in-context examples regardless of retrieval quality. This fallback triggers on a small minority of queries in our runs.
- Defaults. retrieved exemplars per query; ablation experiments use (Section 5.3).

| Listing 1: Large language model (LLM) prompt templates: the LLM-only baseline and the LLM with Retrieval-Augmented Generation (RAG) variant. Both are transcribed from experiments/run_experiment_minilm.py (cosmetic line breaks added for column width). {question} and {context} are filled in at call time. |
# PLC_PROMPT (LLM-only configuration) You are a professional PLC programming engineer with expertise in Mitsubishi FX-series Instruction List programming. Please provide your response in two parts: [Part One] - Mitsubishi FX3U/FX3UC IL Instruction List Code Write the complete IL program with line-by-line English comments explaining each instruction. [Part Two] - Program Logic Explanation Provide a step-by-step description of the program’s logic flow in engineering documentation style. Question: {question} Answer: # RAG_PROMPT (LLM + RAG configuration) You are a professional PLC programming engineer with expertise in Mitsubishi FX-series Instruction List programming. Please answer based on the reference examples provided below. [Part One] - Mitsubishi FX3U/FX3UC IL Instruction List Code Write the complete IL program with line-by-line English comments. [Part Two] - Program Logic Explanation Provide a step-by-step description of the program’s logic flow. Reference Examples: {context} Question: {question} Answer: |
| Listing 2: A representative dataset item (basic instruction category, item index 82). The question text is the input to the prompt template above; the answer text is the reference against which generated outputs are compared on the four similarity metrics. |
Question: How do LD and OUT work together to control an output? Answer: Example code: LD X1 OUT Y0 Explanation: When X1 is ON, Y0 is driven ON; when X1 is OFF, Y0 is turned OFF. |
3.3. Inference Configuration
3.4. Static Syntax Checker
- Load family: LD, LDI, LDP, LDF; aliases LDN→LDI.
- Series and parallel: AND, ANI, ANDP, ANDF, OR, ORI, ORP, ORF.
- Set/reset and pulse: SET, RST, PLS, PLF.
- Block: ANB, ORB, MPS, MRD, MPP, INV, MEP, MEF, NOP, END, FEND.
- Comparison: LD=, LD<, LD>, LD<=, LD>=, LD<> and AND-/OR- variants. The tokenizer normalizes LD = to LD= so that whitespace before the comparator does not register a parse error.
- Output and applied (29 ops): OUT; MOV, DMOV, BMOV; INC[P], DEC[P]; ADD, SUB, MUL, DIV, CMP, ZCP, ZRST, FMOV; ROL, ROR, SFTL, SFTR, DECO, ENCO, WAND, WOR, WXOR; TO, FROM, REF; STL, RET, CALL, SRET, MC, MCR; EI, DI, IRET, WDT, FOR, NEXT; plus CU/CD as compound counter operands.
3.5. Metrics
- BLEU. Sentence-BLEU with NLTK’s smoothing function 1 [27], against the reference answer. We use the default uniform 4-gram weights and the whitespace tokenizer (str.split()). With brevity penalty and modified n-gram precisions ,
- ROUGE-L. F-measure of longest common subsequence [28] via the rouge_score library, with stemming.
- Inference time. Wall clock time for a single /api/generate call, in seconds. This is generation-only: the timer wraps the Ollama HTTP call exclusively, so for the RAG configuration the cost of MiniLM encoding, ChromaDB cosine fetch, similarity floor filtering, and MMR re-ranking is excluded from the metric. Retrieval latency is reported separately. On an RTX 3090 with , mean per-query retrieval cost is ms total ( ms MiniLM encode, ms ChromaDB fetch, ms MMR re-rank, 50-query warm cache); this is two-to-four orders of magnitude smaller than the per-query LLM generation time for every model in Table 3 (range s to s), and including it would not change the qualitative ranking. The decision to report generation-only time keeps the RAG- and LLM-only columns comparable on the same scale (generation alone) and isolates the model-level effect of the additional in-context exemplars from the retrieval stage cost.
- Syntax checker tiers. Lexical, Syntactic, Semantic (each in ) and a binary is_valid.
4. Experimental Setup
4.1. Models
4.2. Hardware
- RTX 3090 (24 GB GDDR6X)—consumer workstation, used for S-tier models. Ollama runs on a local 11,434 endpoint.
- RTX 4090 (24 GB GDDR6X)—separate machine on the lab network, used for M-tier models. Network round-trip is treated as part of inference latency for accounting purposes; absolute values are reported, not normalized.
- NVIDIA DGX Spark (128 GB unified memory)—used for L- and XL-tier models. Provides the only platform on which the 70B and 120B+ models fit alongside the KV cache for the maximum prompt budget.
Software Stack
4.3. Procedure
- 1.
- The 285 dataset items are processed in dataset order. For each item, the LLM-only configuration is queried first, immediately followed by the RAG configuration, so that any drift in server-side state affects both arms equally.
- 2.
- Both raw answers and all retrieved context (IDs and similarities) are written to a per-model CSV; the same data is also serialized to a per-model JSON. Per-sample CSVs are committed to the repository; the larger per-sample JSONs are not (per project policy).
- 3.
- After generation, a single summary JSON is written containing the four similarity/time metrics’ mean, standard deviation, delta (RAG minus LLM), and percentage change.
- 4.
- After all ten models finish, syntax_checker.py (in –all-csv mode) re-reads each CSV and produces syntax_report_all.json, the syntax-tier matrix used in Table 4 of Section 5.2.
4.4. Ablation: Retrieval Depth k
4.5. Reproducibility
5. Results
5.1. Multi-Model Similarity, BLEU, ROUGE, Time
| Model | Sim (Cosine) | BLEU | ROUGE-L | Time (s) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | RAG | LLM | RAG | LLM | RAG | LLM | RAG | |||||
| llama3.1:8b | 0.453 | 0.632 | 0.002 | 0.115 | 0.055 | 0.207 | 16.3 | 14.8 | ||||
| qwen2.5-coder:7b | 0.481 | 0.709 | 0.003 | 0.202 | 0.065 | 0.363 | 9.6 | 6.7 | ||||
| mistral:latest | 0.473 | 0.699 | 0.002 | 0.143 | 0.054 | 0.277 | 8.4 | 7.7 | ||||
| qwen2.5-coder:14b | 0.490 | 0.653 | 0.003 | 0.104 | 0.063 | 0.229 | 9.3 | 5.7 | ||||
| mistral-small3.1:24b | 0.504 | 0.613 | 0.003 | 0.084 | 0.060 | 0.180 | 16.4 | 12.4 | ||||
| qwen2.5-coder:32b | 0.504 | 0.618 | 0.003 | 0.084 | 0.065 | 0.182 | 88.3 | 64.0 | ||||
| llama3.3:70b | 0.516 | 0.626 | 0.003 | 0.085 | 0.059 | 0.175 | 166.0 | 130.4 | ||||
| gpt-oss:120b | 0.483 | 0.522 | 0.001 | 0.007 | 0.043 | 0.077 | 71.5 | 66.4 | ||||
| nemotron-3-super:120b | 0.509 | 0.655 | 0.005 | 0.145 | 0.065 | 0.213 | 54.4 | 43.2 | ||||
| qwen3.5:122b | 0.544 | 0.637 | 0.004 | 0.061 | 0.069 | 0.188 | 44.2 | 27.7 | ||||
5.1.1. RAG Raises Similarity for All Ten Models
5.1.2. Coder-Tuned Small Models Are Competitive with General Large Ones
5.1.3. gpt-oss:120b Is an Outlier
5.1.4. The Cosine Similarity Gain Is Broad, Not Driven by Outliers
5.2. Static Syntax Checker Pass Rates
| Model | LLM-Only | LLM + RAG | (RAG − LLM) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pass | Lex | Syn | Sem | Pass | Lex | Syn | Sem | Pass (pp) | Overall | |
| llama3.1:8b | 0.035 | 0.765 | 0.089 | 0.633 | 0.523 | 0.972 | 0.848 | 0.774 | ||
| qwen2.5-coder:7b | 0.242 | 0.940 | 0.573 | 0.755 | 0.853 | 0.940 | 0.914 | 0.768 | ||
| mistral:latest | 0.021 | 0.853 | 0.070 | 0.632 | 0.628 | 0.958 | 0.811 | 0.742 | ||
| qwen2.5-coder:14b | 0.239 | 0.902 | 0.634 | 0.673 | 0.698 | 0.944 | 0.877 | 0.772 | ||
| mistral-small3.1:24b | 0.172 | 0.754 | 0.493 | 0.630 | 0.709 | 0.997 | 0.924 | 0.838 | ||
| qwen2.5-coder:32b | 0.347 | 0.747 | 0.626 | 0.568 | 0.821 | 0.993 | 0.955 | 0.803 | ||
| llama3.3:70b | 0.081 | 0.891 | 0.406 | 0.806 | 0.147 | 1.000 | 0.784 | 0.805 | ||
| gpt-oss:120b | 0.183 | 0.600 | 0.481 | 0.422 | 0.439 | 0.768 | 0.712 | 0.606 | ||
| nemotron-3-super:120b | 0.428 | 0.775 | 0.693 | 0.550 | 0.649 | 0.898 | 0.854 | 0.702 | ||
| qwen3.5:122b | 0.709 | 0.937 | 0.909 | 0.788 | 0.958 | 1.000 | 0.993 | 0.811 | ||
| Ground-truth dataset: Pass = 0.933, Lex = 0.968, Syn = 0.959, Sem = 0.792 | ||||||||||
5.2.1. RAG Improves Syntactic Correctness Across All 10 Models
5.2.2. The Best RAG Configuration Exceeds the Human-Authored Ground-Truth Pass Rate
5.2.3. Coder-Tuned Small Versus General Large
5.2.4. Two Outliers
5.2.5. Per-Category Robustness
5.3. Retrieval-Depth Ablation ()
5.4. Summary
- 1.
- RAG universally improves similarity and static syntactic correctness, with one weak case (llama3.3:70b at pp) and one weak similarity case (gpt-oss:120b at cosine).
- 2.
- qwen3.5:122b with RAG ( pass) exceeds the ground-truth dataset’s own static-pass rate ().
- 3.
- qwen2.5-coder:7b with RAG ( pass) beats llama3.3:70b with RAG (): domain alignment outweighs parameter count.
- 4.
- Increasing k from 1 to 5 improves all overlap metrics on qwen2.5-coder:7b without penalizing latency.
6. Discussion
6.1. Sensor Layer Framing of the Generated Programs
Sustainability Angle
6.2. Failure-Mode Analysis: llama3.3:70b and gpt-oss:120b
6.2.1. llama3.3:70b—Prose Preamble Contamination, Not Truncation
6.2.2. gpt-oss:120b—Residual Thinking-Trace Capacity Loss
6.2.3. Model Family-Specific RAG Sensitivity
6.3. Budget-Paradox Confirmation (num_predict = 4096)
6.4. Comparison to Prior PLC-LLM Work
6.5. Local LLMs with Retrieval as a Practical Recipe for Deprecated Industrial Languages
6.6. Limitations
- 1.
- Dataset scope (basic and sensor-driven logic only). The 285 questions span five categories—Traffic Light, Basic Instruction, Special Relay, Timer/Counter, and Basic Control—which together exercise discrete sensor-to-actuator logic and the FX-internal special relay layer. The corpus deliberately excludes four important program classes that practising automation engineers do encounter on FX-series hardware and that we explicitly do not claim our results generalize to: (a) closed-loop process control implemented with the FX PID applied instruction; (b) serial and bus communications (RS, IVCK, FX-ENET/CCLINK message handlers); (c) motion control (PLSY, PLSR, DRVI, DRVA pulse-train outputs to stepper or servo drives); and (d) step-ladder programs structured around the STL/RET pair with master-control (MC/MCR) blocks. Validating that retrieval-grounded generation transfers to these instruction families requires a separately curated reference corpus, an extension of the static checker (e.g., PID parameter-block rules, STL state-graph integrity), and—for motion and PID—an additional simulation harness with timing semantics. The 93.3% calibration baseline and all pass-rate claims in Section 5 therefore apply to the five categories named above, not to PLC programming as a whole.
- 2.
- Single PLC family. Only Mitsubishi FX-series IL is covered. Cross-brand evaluation (Siemens S7 STL, Omron CX, and Allen-Bradley AB) is a desirable extension but requires a comparably sized hand-curated dataset and a per-family static checker; both are out of the scope here. The IL deprecation note in IEC 61131-3 Edition 3.0 makes this a “legacy modernization” study rather than a “future language” study; the broader claim that retrieval-grounded local LLMs help on deprecated industrial languages is hypothesized but is not generally established by a single-family evaluation, and we frame it as such in Section 6.5.
- 3.
- Static checks, not functional verification. The three-tier syntax checker is a strict lower bound on functional correctness, and we want this point to be unambiguous. It evaluates lexical, syntactic, and structural-semantic conformance; it does not compile generated programs through the vendor toolchain, does not execute them on a controller or simulator, and consequently cannot detect: incorrect device assignments that pass type-and-range checks (e.g., start and stop buttons swapped), wrong timer or counter constants (K100 vs. K10), off-by-one networks in interlock or stop-priority logic, race conditions on shared internal relays, watchdog mis-configuration, scan-time-dependent sequencing bugs, or any property whose evaluation requires running code. Section 3.4 gives a worked example of a program that swaps X0 for X1 and yet passes every tier; such a program is syntactically valid but functionally wrong. Tier 4 (compilation through GX Works2) and Tier 5 (GX Simulator2 trace evaluation against canonical sensor input timelines) are listed as future work in Section 7, and a binary pass under the present checker should be interpreted accordingly as “not statically dead” rather than “ready for plant-floor deployment”.
- 4.
- Category imbalance. The five categories are not equally represented: Basic Instruction (110 items) and Special Relay (95) dominate, Traffic Light (60) is sizeable, and the two smallest—Basic Control and Timer/Counter—have only 10 items each. This imbalance is a property of the dataset as authored, not a design choice in this paper, but it has measurable consequences. First, two of the per-category cells in Table 5 sit on denominators, which means a single misclassification moves the pass-rate by 10 pp; the negative cell on qwen3.5:122b Timer/Counter ( pp) is one fewer passing item out of ten and should be read accordingly. Second, the macro pass-rate is dominated by the two largest categories; a future macro-balanced evaluation would either weight categories equally or expand the smallest two to at least 30 items. We refrain from re-weighting the present numbers because doing so would alter the calibration baseline that anchors every other claim in the paper; the disaggregated Table 5 is the honest reporting.
- 5.
- No fine-tuning arm. A fine-tuned third configuration was not evaluated. With items, a credible fine-tuning study would require a held-out evaluation partition, leaving fewer than 230 training items—too few to support reliable conclusions about the fine-tuned arm. We therefore treat fine-tuning as future work paired with dataset expansion. The 7B coder-tuned model was, however, used precisely because it represents the upper edge of accessible fine-tuning cost in this hardware tier.
- 6.
- Sample size and statistical headroom. is small for k-fold cross-validation, and the per-category split above further limits the resolution of category-conditioned analyses. Confidence intervals on the per-model pass rate are wide (5 pp at Wilson CI). The differences highlighted in Section 5—qwen3.5:122b RAG vs. ground truth, qwen2.5-coder:7b RAG vs. llama3.3:70b RAG, and the two failure-mode anomalies—all exceed this margin by a comfortable factor, but smaller cross-model rankings within a single tier should be interpreted with caution.
7. Conclusions
7.1. Scope of the Conclusions
7.2. Future Work
- 1.
- Tier-4 compilation validation. Wrap GX Works2 (or its CLI) so each generated program is compiled and the compile log captured. This converts the lower-bound static check into a stricter “compiles or not” pass criterion.
- 2.
- Tier-5 simulation validation. Feed compiled programs into GX Simulator2 with a canonical sensor-input timeline per category (e.g., a periodic X0 pulse for traffic light items) and check actuator output against an expected pattern.
- 3.
- Failure-mode-controlled mitigations. For llama3.3:70b, evaluate two mitigations for the prose preamble pattern documented in Section 6.2: a tokenizer-level pre-pass that strips any prefix prose paragraph before the static check, and a stricter system prompt that suppresses the meta-introduction. For gpt-oss:120b, rerun with a more aggressive thinking channel disable through Ollama options.
- 4.
- Fine-tuning third arm. LoRA or QLoRA fine-tuning of a coder-tuned 7B–14B model on the 285-item dataset (with k-fold partitioning to avoid train/test leakage) supplies a third comparison point alongside LLM-only and LLM + RAG.
- 5.
- Cross-PLC-brand extension. A hand-curated parallel dataset for Siemens S7 STL or Omron CX would test cross-vendor generalization. This requires a separate per-vendor static checker; the present FX3U/FX3UC checker is a template.
- 6.
- Dynamic context provision via MCP. An emerging alternative to the static-corpus retrieval used here is dynamic context provision through Model Context Protocol (MCP) servers backed by curated PLC libraries; such a server could extend the present pipeline without altering the LLM or the syntax checker.
- 7.
- Coverage extension to advanced instructions. Extend the dataset and the static checker to cover PID, communications (RS, IVCK), motion (PLSY, DRVI, DRVA), and step-ladder (STL/RET, MC/MCR) programs, including—for PID and motion—a per-instruction simulation harness with timing semantics so that the static tier check is not the sole correctness signal in those categories.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lasi, H.; Fettke, P.; Kemper, H.G.; Feld, T.; Hoffmann, M. Industry 4.0. Bus. Inf. Syst. Eng. 2014, 6, 239–242. [Google Scholar] [CrossRef]
- Xu, L.D.; Xu, E.L.; Li, L. Industry 4.0: State of the Art and Future Trends. Int. J. Prod. Res. 2018, 56, 2941–2962. [Google Scholar] [CrossRef]
- IEC 61131-3; Programmable Controllers—Part 3: Programming Languages, Edition 3.0. IEC: Geneva, Switzerland, 2013.
- John, K.H.; Tiegelkamp, M. IEC 61131-3: Programming Industrial Automation Systems—Concepts and Programming Languages, Requirements for Programming Systems, Decision-Making Aids, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2020; pp. 1877–1901. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv 2024, arXiv:2406.00515. [Google Scholar] [CrossRef]
- Fakih, M.; Dharmaji, R.; Moghaddas, Y.; Quiros Araya, G.; Ogundare, O.T.; Al Faruque, M.A. LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’24); Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Liu, Z.; Zeng, R.; Wang, D.; Peng, G.; Wang, J.; Liu, Q.; Liu, P.; Wang, W. Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems Using LLM-Based Agents. arXiv 2024, arXiv:2410.14209. [Google Scholar] [CrossRef]
- Koziolek, H.; Grüner, S.; Hark, R.; Ashiwal, V.; Linsbauer, S.; Eskandani, N. LLM-Based and Retrieval-Augmented Control Code Generation. In Proceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code ’24); Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2020; pp. 9459–9474. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
- Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open Foundation Models for Code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. StarCoder: May the Source be With You! arXiv 2023, arXiv:2305.06161. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K.; et al. Qwen2.5-Coder Technical Report. arXiv 2024, arXiv:2409.12186. [Google Scholar] [CrossRef]
- Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3784–3803. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2020; pp. 5776–5788. [Google Scholar]
- Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98); Association for Computing Machinery: New York, NY, USA, 1998; pp. 335–336. [Google Scholar] [CrossRef]
- Chroma Team. Chroma: The Open-Source Embedding Database. Open-Source Software. 2023. Available online: https://www.trychroma.com (accessed on 24 May 2026).
- Koziolek, H.; Gruener, S.; Ashiwal, V. ChatGPT for PLC/DCS Control Logic Generation. In Proceedings of the 2023 IEEE 28th International Conference on Emerging Technologies and Factory Automation (ETFA); IEEE: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
- Kersting, J.; Rummel, M.; Benndorf, G. Vendor-Aware Industrial Agents: RAG-Enhanced LLMs for Secure On-Premise PLC Code Generation. arXiv 2025, arXiv:2511.09122. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL ’02); Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]






| Category | Count | Sensor/Actuator Content |
|---|---|---|
| Traffic light | 60 | Start/stop push buttons (X0, X1); lamp outputs (Y0–Y2); timers (T0–T3) |
| Basic instruction | 110 | LD/AND/OR logic on X-inputs driving Y-outputs and M-relays |
| Special relay | 95 | M8000-series special relays (e.g., M8013 1-Hz pulse) feeding outputs |
| Timer/counter | 10 | T-timer/C-counter operations gated by X-inputs |
| Basic control | 10 | Standard start–stop interlocks, latched outputs |
| Vendor | Ollama Tag | Quant. | Params | Hardware | Tier |
|---|---|---|---|---|---|
| Meta | llama3.1:8b | Q4_K_M | 8B | RTX 3090 24GB | S |
| Alibaba | qwen2.5-coder:7b | Q4_K_M | 7B | RTX 3090 24GB | S |
| Mistral AI | mistral:latest | Q4_K_M | 7B | RTX 3090 24GB | S |
| Alibaba | qwen2.5-coder:14b | Q4_K_M | 14B | RTX 4090 24GB | M |
| Mistral AI | mistral-small3.1:24b | Q4_K_M | 24B | RTX 4090 24GB | M |
| Alibaba | qwen2.5-coder:32b | Q4_K_M | 32B | DGX Spark 128GB | L |
| Meta | llama3.3:70b | Q4_K_M | 70B | DGX Spark 128GB | L |
| OpenAI | gpt-oss:120b | MXFP4 | 120B | DGX Spark 128GB | XL |
| NVIDIA | nemotron-3-super:120b | Q4_K_M | 120B | DGX Spark 128GB | XL |
| Alibaba | qwen3.5:122b | Q4_K_M | 122B | DGX Spark 128GB | XL |
| Category | n | qwen2.5-coder:7b | qwen3.5:122b | ||||
|---|---|---|---|---|---|---|---|
| LLM | RAG | LLM | RAG | ||||
| Traffic Light | 60 | 13.3 | 100.0 | 61.7 | 100.0 | ||
| Basic Control | 10 | 10.0 | 90.0 | 60.0 | 60.0 | ||
| Timer/Counter | 10 | 10.0 | 40.0 | 90.0 | 80.0 | ||
| Basic Instruction | 110 | 11.8 | 77.3 | 73.6 | 98.2 | ||
| Special Relay | 95 | 48.4 | 89.5 | 72.6 | 95.8 | ||
| Setting | Sim (Cosine) | BLEU | ROUGE-L | Time (s) | ||||
|---|---|---|---|---|---|---|---|---|
| LLM | RAG | LLM | RAG | LLM | RAG | LLM | RAG | |
| (run 1) | 0.4801 | 0.6898 | 0.0025 | 0.1823 | 0.0625 | 0.3444 | 9.0 | 7.7 |
| (run 2) | 0.4734 | 0.6877 | 0.0024 | 0.1790 | 0.0606 | 0.3466 | 10.7 | 5.5 |
| 0.4806 | 0.7087 | 0.0028 | 0.2020 | 0.0648 | 0.3635 | 9.6 | 6.7 | |
| 0.4726 | 0.7274 | 0.0027 | 0.2072 | 0.0625 | 0.3788 | 7.8 | 4.8 | |
| Configuration | np = 1024 | np = 4096 | (pp) |
|---|---|---|---|
| LLM-only pass rate | 8.07% | 4.91% | |
| LLM + RAG pass rate | 14.74% | 18.95% | |
| “To”-preamble fraction of RAG failures | |||
| RAG passed-answer mean length (chars) | 2215 | 2321 | |
| RAG failed-answer mean length (chars) | 2564 | 2856 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yeh, M.-F.; Luo, C.-C.; Lu, C.-L. Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation. Sensors 2026, 26, 3602. https://doi.org/10.3390/s26113602
Yeh M-F, Luo C-C, Lu C-L. Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation. Sensors. 2026; 26(11):3602. https://doi.org/10.3390/s26113602
Chicago/Turabian StyleYeh, Ming-Feng, Ching-Chuan Luo, and Cheng-Lin Lu. 2026. "Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation" Sensors 26, no. 11: 3602. https://doi.org/10.3390/s26113602
APA StyleYeh, M.-F., Luo, C.-C., & Lu, C.-L. (2026). Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation. Sensors, 26(11), 3602. https://doi.org/10.3390/s26113602
