Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design

Faruque, Md Omar; Jamieson, Peter; Patooghy, Ahmad; Badawy, Abdel-Hameed A.

doi:10.3390/electronics14234745

Open AccessEditor’s ChoiceArticle

Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design

¹

Klipsch School of ECE, New Mexico State University, Las Cruces, NM 88003, USA

²

Department of Electrical and Computer Engineering, Miami University, Oxford, OH 45056, USA

³

Department of Computer Systems Technology, North Carolina A&T State University, Greensboro, NC 27411, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4745; https://doi.org/10.3390/electronics14234745

Submission received: 10 September 2025 / Revised: 25 November 2025 / Accepted: 25 November 2025 / Published: 2 December 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Traditionally, inserting realistic Hardware Trojans (HTs) in complex hardware systems has been a time-consuming manual process, requiring comprehensive knowledge of the design and navigating intricate Hardware Description Language (HDL) codebases. Machine Learning (ML)-based approaches have attempted to automate this process but often struggle with the need for extensive training data, learning time, and limited generalizability across diverse hardware design landscapes. This paper introduces GHOST, an automated tool that leverages Large Language Models (LLMs) for rapid generation and insertion of HT. The research encompasses both the development of the GHOST framework and a comprehensive evaluation of its effectiveness across three state-of-the-art LLMs-GPT-4, Gemini-1.5-Pro, and Llama-3-70B. According to our evaluations, GPT-4 demonstrates the best performance by successfully generating and inserting HTs in 88.9% of its attempts. This study also highlights the security risks posed by LLM-generated HTs, as 100% of successful GHOST-generated HTs that completed inference within the time limit evaded detection by a state-of-the-art ML-based HT detection tool. These results underscore the need for advanced detection and prevention mechanisms in hardware security to address the emerging threat of LLM-generated HTs.

Keywords:

Hardware Trojans; Large Language Models; Hardware Security; Hardware Design

1. Introduction

Hardware Trojans (HTs) are malicious and unauthorized modifications to hardware designs that can alter functionality, degrade performance, leak sensitive information, or facilitate other devastating attacks [1]. HTs are becoming an increasing concern in the electronics and semiconductor industry due to the globalization of the supply chain, design reuse, and the proliferation of Third-Party Intellectual Property (3P-IP) cores [2].

The challenge of inserting HTs into hardware systems, traditionally a manual and labor-intensive process, has become increasingly complex as hardware designs grow in size and complexity. This manual process not only demands comprehensive expertise in various hardware architectures but is also inherently limited by the biases and assumptions of human designers. These biases often result in predictable and narrowly focused HTs, thereby reducing their effectiveness and making them easier to detect with targeted security tools [3]. Although some semi- and fully automated HT insertion tools have been developed using algorithmic and machine learning (ML) approaches, to the best of our knowledge, none are currently leveraging the available Large Language Models (LLMs), thereby limiting their applicability.

To better understand the landscape of HT insertion tools, we provide a comparison (summarized in Table 1) of existing methodologies, focusing on key attributes such as automation, learning time, open-source availability, and platform compatibility. The Trust-Hub repository [4] provides a collection of manually inserted HT benchmarks extensively used for research. However, manual insertion introduces human bias, which limits the diversity of HT scenarios and reduces the overall effectiveness of training detection tools. TAINT [5] tried to automate HT insertion in FPGA designs at various levels. Despite automation claims, this tool requires significant user input for trigger selection, payload definition, and selection of insertion location(s), which can reintroduce human biases. TRIT [3] automates HT insertion for gate-level netlists. It allows dynamic HT generation with flexible configurations. However, it relies on human input for initial setup and configuration choices. The MIMIC framework [6] automates HT insertion using ML by learning from existing HT samples. While reducing human intervention, it requires extensive training data and computational resources. MIMIC’s reliance on specific features from existing HTs may limit its generalizability across different hardware designs. ATTRITION [7] uses ML to insert HTs into rare nets, enhancing stealthiness. However, it requires significant training time and may overlook other vulnerable areas, with the tool being non-open-source, limiting broader research access. Trojan Playground [8] employs reinforcement learning (RL) for HT insertion, allowing an agent to explore insertion points autonomously. Although it reduces biases and adapts to different designs, extensive training is required. DTjRTL [9] automates HT insertion at the RTL level with configurable parameters. While flexible, it relies on predefined configurations, limiting its adaptability to novel HT scenarios. TrojanForge [10] integrates adversarial learning with RL and a Generative Adversarial Network (GAN) like approach to generate HTs designed to evade detection. This approach enhances stealthiness but has high computational complexity and long training times. FEINT [11] automates template/Trojan insertion into FPGA designs, focusing on flexibility. FEINT allows insertion at various stages of FPGA design but is tailored for specific FPGA contexts, which may limit its generalizability to other platforms. Recently, Kumar et al. [12] proposed a compatibility graph-assisted framework that systematically selects rare nodes as trigger nodes, generating HT instances with 25–125 trigger nodes. While achieving significant speedup and detection evasion, this approach requires pre-computed compatibility graphs and is limited to gate-level netlists. Concurrently, Kokolakis et al. [13] explored the use of general-purpose LLMs for HT insertion in complex designs, such as CVA6 RISC-V, addressing context length limitations through hierarchical filtering. However, their approach lacks automation, requiring manual prompt engineering for each HT insertion and does not support batch generation of multiple HT variants across different designs, unlike GHOST.

Although the papers mentioned above are innovative, the tools bring their own challenges, including long training times, limited generalizability across different designs and hardware platforms, and a lack of open-source availability, which restricts their accessibility and adaptability for broader research and development. To address these limitations of existing tools, we introduce GHOST (Generator for Hardware Oriented Stealthy Trojans). This framework leverages available LLMs to automate the HT generation and insertion processes. The primary focus of this research is the development and evaluation of GHOST as an automated tool for HT insertion. We present the systematic methodology used to design the framework, including its prompt engineering strategies and LLM inference pipeline, and evaluate its effectiveness across diverse hardware designs and multiple state-of-the-art LLMs.

This work makes the following contributions:

We develop and introduce GHOST, an automated tool that leverages LLMs for HT generation and insertion in complex RTL designs. The tool employs systematic prompt engineering strategies (Role-Based Prompting, Contextual Trojan Prompting, and Reflexive Validation Prompting) to guide LLMs in generating stealthy and functional hardware Trojans.
GHOST is platform-agnostic and design-agnostic. It can target both ASIC and FPGA flows across diverse hardware architectures.
We comprehensively evaluate the GHOST tool across three state-of-the-art LLMs (GPT-4, Gemini-1.5-Pro, and LLaMA3) and multiple hardware designs, offering insights into the security implications of LLM-generated HTs.
We present an analysis of each LLM’s performance, capabilities, and limitations in HT insertion when using the GHOST framework. We also evaluate their effectiveness in evading detection by a modern ML-based HT detection tool.
We contribute 14 functional and synthesizable hardware Trojan (HT) benchmarks generated by our framework, addressing a critical need for benchmarks in hardware security research. This is particularly significant given recent findings [14] that only 3 out of 86 HT benchmarks from the Trust-Hub suite are considered effective. We are also open-sourcing our prompts and Python scripts, enabling the automated generation of more high-quality HT benchmarks. This resource can be utilized for future HT research, especially in developing detection schemes for large language model (LLM)-generated HTs.

The rest of this paper is organized as follows: Section 2 reviews related work on LLM applications in hardware security. Section 3 details our threat model. Section 4 presents the GHOST framework architecture and prompt engineering strategies. Section 5 describes our evaluation methodology. Section 6 presents our experimental results, including a comparative analysis of LLM performance, case studies, and an evaluation of HT detection evasion. Section 7 concludes the paper.

2. LLMs for Hardware Design and Security

Recently, LLMs have demonstrated capabilities in creating hardware designs. Chang et al. [15] developed ChipGPT, an LLM-based design environment that generates optimized logic designs from natural language specifications. Efforts have also been made to address the issue of LLMs producing incorrect HDL code by fine-tuning on Verilog datasets [16] or developing frameworks like Autochip to fix bugs in HDL codes [17].

In the hardware security domain, LLMs have been applied to verification [18,19,20], secure hardware generation [20,21,22,23], and vulnerability detection and remediation [24,25]. Ahmad et al. [24] used LLMs to guide the fixing of security vulnerabilities. Fu et al. [25] trained domain-specific LLMs on a dataset of hardware defects and fixes to enable automated debugging. Saha et al. [26] provided an analysis of using GPTs for various hardware security tasks, including vulnerability insertion. They demonstrate ChatGPT’s capability to integrate vulnerabilities into hardware designs.

Compared to defensive applications, there has been limited exploration of using LLMs for offensive security purposes, such as HT insertion. As this might be an easy practice for adversaries, this work explores this capability in depth by presenting a methodology to leverage LLMs to automate the insertion of HTs into hardware designs.

3. Threat Model

Similar to the threat model of Shakya et al. [27], our threat model addresses the security challenges in the increasingly globalized and outsourced System-on-Chip (SoC) development process. As Figure 1 illustrates, various stages of SoC development are outsourced to different entities. This paper focuses on the case where a trusted RTL designer, possibly from a third-party IP vendor, provides a clean IP core to an SoC integrator. This integrator, potentially an offshore entity, is tasked with incorporating the IP into a larger SoC design. The crux of our threat model lies in the assumption that the SoC integrator might be the adversary. Despite having access to the RTL code, integrators often face significant time constraints, typically due to tight project deadlines, preventing them from fully comprehending the intricate complexities of the IP core’s internal architecture and implementation details.

This is where the GHOST framework becomes a powerful tool for the potential attacker. GHOST leverages LLMs to bridge the knowledge gap, enabling integrators to rapidly analyze HDL code, identify vulnerabilities, and insert HTs with minimal manual effort before passing it to trusted Computer-Aided Design (CAD) tools for synthesis and eventual chip fabrication at the foundry. The framework’s capabilities enable the creation of stealthy HTs that can potentially evade detection during pre- and post-fabrication testing, yet be activated post-deployment.

It is crucial to note that in this model, while the SoC integrator is considered untrusted, the foundry responsible for chip fabrication is assumed to be trusted. The attacker’s modifications are confined to the RTL level.

In our threat model, the malicious SoC integrator receives the clean IP core from the trusted designer, inserts the HT, and then passes the trojaned design to downstream entities (foundry, fabrication facilities, and ultimately end users). Critically, these downstream entities receive only the HT-infected design without access to the original, golden, clean version. Hence, traditional before-versus-after comparison approaches—such as diff-based comparison of RTL files, grep searches for suspicious signal names, or comparison of top-level port lists—are not applicable in this threat model, as they inherently require access to both the clean and HT-infected versions for comparison. The downstream entities, lacking a golden reference, must rely on single-version analysis methods that can detect HTs from the HT-infected design alone, without requiring comparison to a clean baseline.

4. Proposed Methodology

This section presents our proposed automated HT insertion framework, GHOST, that seamlessly integrates LLM capabilities. Figure 2 illustrates the overall architecture of the GHOST framework, which consists of two main components: (1) Prompt Engineering and (2) LLM Inference. We will discuss each separately in the following sections.

4.1. Prompt Engineering

In the prompt engineering component of GHOST, shown on the left block of Figure 2, we employ a combination of three prompting strategies, i.e., Role-Based Prompting (RBP), Reflexive Validation Prompting (RVP), and Contextual Trojan Prompting (CTP) to guide the LLM in executing HT insertion tasks.

4.1.1. Role-Based Prompting (RBP)

RBP involves assigning the LLM a specific role or persona, which helps frame the task by providing context and clarity. It enables the LLM to leverage domain-specific knowledge [28] and thus helps maintain consistency across various HT designs. For HT insertion tasks, the LLM is prompted to assume the role of a hardware security expert specializing in HT insertion. This role provides the LLM with the context to understand and implement sophisticated HTs. A typical prompt may start with the following:

4.1.2. Reflexive Validation Prompting (RVP)

RVP makes the LLM self-review and verify its output, enhancing the quality and reliability of the generated HTs, drawing inspiration from the concept introduced by Shinn et al. [29]. RVP typically includes a series of prompts that guide the LLM through a structured self-evaluation process. For instance, (see Appendix A for the complete template), the directive “Ensure that all instructions are followed” initiates this self-checking process. The subsequent instruction to “Describe how the HT trigger and payload have been implemented in the code” ensures that the LLM provides a detailed account of its actions, making the HT insertion process more transparent and traceable. The final directive to “Verify the correctness, stealthiness, and synthesizability of the HT implementation.” prompts the LLM to critically evaluate its work, considering the functionality and the covert nature of the inserted HT.

4.1.3. Contextual Trojan Prompting (CTP)

CTP provides relevant context about HTs and their characteristics, similar to the few-shot learning approach introduced by Brown et al. [30].

Our proposed framework investigates three types of HT inspired by the Trust-Hub benchmarks [4]. For each HT type, we use a tailored CTP strategy. The following shows the three types of HT functionality and an associated sample CTP:

HT1 (Change functionality): This refers to HTs that alter the intended functionality of the circuit. Examples include privilege escalation, bypassing encryption algorithms, or producing incorrect computational results.

HT2 (Leak information): An information-leaking HT is a malicious modification to the design that covertly transmits sensitive data from the system. Such HTs operate stealthily, maintaining the circuit’s original functionality while creating hidden channels to leak critical information such as encryption keys, secure data, or internal states. The leaked information can be transmitted through various means, including covert output channels, timing variations, or power consumption patterns.

HT3 (Denial of Service): This refers to HTs that cause the circuit or system to stop functioning entirely or become unavailable for its intended use. The primary goal is to disrupt or prevent the normal operation of the system. Examples might include causing the circuit to enter a non-functional state, continuously resetting the system, or blocking access to critical resources.

Each CTP comes with ➀ A clear objective (e.g., “change functionality”, “leak information”), ➁ Desired implementation details (e.g., “subtle logical modification”, “covert channel”), ➂ Guidance on triggering mechanisms (e.g., “specific rare input sequence”, “seemingly benign signal or state”), and ➃ Instructions on maintaining stealth (e.g., “hard to detect”, “extremely rare”). By combining the three prompting strategies, GHOST achieves a balance between expert-level task framing (RBP), specific HT implementation guidance (CTP), and rigorous self-validation (RVP). This prompting approach enables the framework to generate diverse, realistic, and stealthy HTs while maintaining high quality and consistency.

4.2. LLM Inference

The LLM Inference component, as illustrated on the right block of Figure 2, translates the crafted prompts into actual HT designs based on the following three steps.

4.2.1. Model Selection

The LLM inference process begins with model selection (left side of the LLM Inference block in Figure 2). GHOST supports both closed-source (e.g., GPT-4, Gemini-1.5-Pro) and open-source (e.g., LLaMA3-70b) models, chosen for their performance in general language benchmarks [31]. GHOST’s modular architecture enables easy integration and comparative analysis of these models in the context of HT generation. While closed-source models offer ease of use and regular updates, open-source alternatives provide customizability, enhanced privacy, and potential cost benefits for large-scale applications. Some open-source models, like LLaMA3-70b, can run on consumer hardware, thereby broadening accessibility [32]. This approach enables a rigorous evaluation of the effectiveness of various LLMs in HT insertion, a task that has not been studied in depth previously.

4.2.2. LLM Tasks

Once selected, GHOST interacts with the chosen LLM via API calls, handling authentication and request construction. The constructed prompt, including role-based instructions and contextual HT information, is submitted to the LLM along with the targeted clean RTL designs. For example, when inserting an HT to leak information (HT2), the framework uses the API call shown in Listing 1. The central part of the LLM Inference block in Figure 2 shows the four main tasks the selected LLM performs. The selected LLM analyzes the given RTL design to understand its functionality and structure. It then identifies suitable attack points where an HT could be inserted without disrupting the design’s normal operation. Based on this analysis, the LLM generates the HT code that meets the specified requirements. Finally, it inserts the HT into the original design, modifying the RTL code to integrate the HT seamlessly.

Listing 1. Python code to generate an HT in Verilog using OpenAI’s API call.

4.2.3. Response Extraction

The final step in the LLM Inference process is the response extraction, which involves processing the LLM’s output to extract the modified RTL code containing the inserted HT. Additionally, the framework extracts explanations of the HT functionality, insertion process, trigger, and payload details provided by the LLM. It also extracts an HT taxonomy similar to that of the Trust-Hub HT benchmarks [4].

4.3. GHOST Main Steps

The core of the GHOST framework is the HT insertion algorithm. This algorithm leverages LLMs to automate designing and inserting HTs into clean RTL designs.

As shown in Algorithm 1, the process begins with a set of inputs including clean RTL designs D, a set of HT types

T = {H T 1, H T 2, H T 3}

, a Role-Based Prompt R, three Contextual Trojan Prompts

C T P = {C T P_{1}, C T P_{2}, C T P_{3}}

corresponding to each HT type, a Reflexive Validation Prompt

R V P

, and a set of pre-trained LLMs L. The algorithm iterates over each clean RTL design

d \in D

(line 1) and applies each HT type

t \in T

(line 2) to it. The algorithm constructs a combined prompt for each combination that integrates all three prompting strategies: RBP, CTP, and RVP (lines 3–5). An appropriate LLM is then selected (line 6) and used to generate an initial HT-infected design based on this combined prompt (line 7). After the initial HT insertion, the LLM checks if the generated design complies with the instructions and requirements specified in the RVP (line 8). If the design is not compliant, the LLM modifies its response (line 9). This step ensures that the LLM self-reviews and improves the HT insertion without manual intervention. The final HT-infected design is then added to the set of outputs

D_{H T}

(line 11). This process repeats for all combinations of clean RTL designs and HT types, resulting in a comprehensive set of HT-infected RTL designs (line 14). By synergistically applying the three prompting strategies throughout the process, the algorithm guides the LLM in generating effective and stealthy HTs across various clean RTL designs in a fully automated manner.

Algorithm 1 HT Insertion Algorithm
Require: Set of clean RTL designs D, set of HT types $T = {H T 1, H T 2, H T 3}$ , Role-Based Prompt R, Contextual Trojan Prompts $C T P = {C T P_{1}, C T P_{2}, C T P_{3}}$ corresponding to HT types, Reflexive Validation Prompt $R V P$ , set of LLMs L
Ensure: Set of HT-infected RTL designs $D_{H T}$
1: for each design $d \in D$ do
2: for each HT type $t \in T$ do
3: $P_{R B P} \leftarrow$ ConstructRolePrompt(R, t)
4: $P_{C T P} \leftarrow$ SelectContextualPrompt( $C T P$ , t)
5: $P_{c o m b i n e d} \leftarrow$ CombinePrompts( $P_{R B P}$ , $P_{C T P}$ , $R V P$ , d)
6: $L_{t} \leftarrow$ SelectLLM(L, t)
7: $d_{t}^{'} \leftarrow L_{t} (P_{c o m b i n e d})$	▷Generate initial HT-infected design
8: if not CheckCompliance( $L_{t}$ , $d_{t}^{'}$ , $R V P$ ) then
9: $d_{t}^{'} \leftarrow M o d i f y (d_{t}^{'})$	▷Modify HT design if non-compliant
10: end if
11: $D_{H T} \leftarrow D_{H T} \cup {d_{t}^{'}}$
12: end for
13: end for
14: return $D_{H T}$

5. Evaluation Methodology

We present a comprehensive evaluation methodology depicted in Figure 3 to assess the effectiveness of GHOST. Our approach encompasses (1) pre-synthesis simulations (highlighted in red) and (2) post-synthesis verification (highlighted in green). We define four evaluation metrics to quantitatively measure the success of LLMs in HT stealthiness, functionality, and persistence throughout the entire design flow. Table 2 summarizes the metrics we will define in the following sections. Practical considerations in hardware security research drove the choice of simulation-based verification over formal methods. While formal verification offers complete coverage, it becomes computationally intractable for complex designs like AES-128 due to state space explosion, particularly when dealing with HTs designed to activate under rare conditions. Simulation-based verification allowed us to directly validate both functional correctness and triggering behavior through targeted test vectors. Additionally, formal verification would have required significant manual effort to create design-specific properties and assertions for each HT type, thereby limiting the scalability of our evaluation framework. The verification effort in our framework was deliberately structured to minimize manual intervention while maintaining rigor and accuracy. We developed automated compilation and simulation scripts to handle routine verification tasks and leveraged pre-existing testbenches from original designs for functional verification.

5.1. Pre-Synthesis Simulations

5.1.1. Compilation Verification (Eval0)

As shown in the leftmost stage of Figure 3, we begin by compiling each HT-infected design using an open-source RTL compiler tool. This step, automated via a Python script, verifies the syntactic correctness and basic design integrity of the code. We quantify this using the Compilation Success Rate parameter (Eval0), which measures the proportion of HT-infected designs that compile without errors. The “Compiled?” decision point determines whether the design proceeds to the next stage or is marked as an Eval0 failure.

5.1.2. Functional Consistency Check (Eval1)

Designs passing the compilation stage undergo functional simulation using an open-source Verilog simulator with their original testbenches, as depicted by the “Unaffected?” decision point in the next stage of Figure 3. We analyze the resulting simulation logs to check if the intended original functionality is preserved when the HT is dormant. This crucial step, implemented with a Python script, validates HT stealthiness. We capture this using the Normal Operation Preservation Rate parameter (Eval1), representing the fraction of designs that maintain correct functionality in non-triggering conditions.

5.1.3. Trojan Activation Verification (Eval2)

For designs passing the functional consistency check, we move to the third stage indicated by the “HT Functional?” decision point in Figure 3. Here, we employ manually crafted testbenches to attempt to activate the HT. The manual creation of the testbenches allows for precise control over the testing conditions and ensures that the unique characteristics of each HT are thoroughly examined. The testbenches simulate various input conditions and operational scenarios to activate the HT. We carefully analyze the resulting simulation logs and waveforms to verify if the HT behaves as intended when triggered. We quantify the results of this verification process using the HT Triggering Success Rate parameter (Eval2), which represents the proportion of inserted HTs that can be successfully activated. We note the importance of this property, which is not addressed in many automated insertion techniques. We suggest that an LLM’s capability to perform this task makes these tools both promising and concerning.

5.2. Post-Synthesis Simulations (Eval3)

As depicted in the right half of Figure 3, designs that successfully pass pre-synthesis evaluations undergo logic synthesis using an open-source logic synthesizer tool. This step translates RTL designs to gate-level netlists. We then use the netlist designs to perform post-synthesis simulation using the same pre-synthesis testbenches. We generate and analyze simulation logs to verify the preservation of HT behavior. This final step assesses HT resilience against synthesis optimizations and transformations as indicated by the “HT Survived” decision point. We quantify this using the HT Survival Rate parameter (Eval3), which measures the fraction of HTs that remain functional post-synthesis.

Throughout this process, designs may fail at various stages, as indicated by the red failed endpoints in Figure 3. Designs that successfully pass all stages are classified as “Survived HT” and are shown in the rightmost part of Figure 3.

6. Experimental Results

6.1. Experimental Setup

We use Icarus Verilog (version 11.0) [33] for RTL compilation and functional simulations. Waveform visualization was performed using GTKWave (version 3.3) [34]. The technology-mapped netlist was generated using Yosys (version 0.9) [35] in conjunction with the Google SkyWater 130 nm PDK [36], utilizing the sky130_fd_sc_hd__tt_025C_1v80.lib library, which provides fabrication-ready digital standard cells. Our experimental setup utilizes a Linux Ubuntu 22.04 environment for all tests and evaluations conducted. Python scripts were deployed using a Conda environment with Python version 3.10.14.

6.2. Large Language Models

Our experiments use three state-of-the-art LLMs: OpenAI’s GPT-4, Google’s Gemini-1.5-Pro, and Meta’s LLaMA 3 70B (accessed through Groq API [37]). These models were accessed via their respective APIs rather than running locally. Typical HT generation time is approximately 1–2 seconds per attempt, measured as the LLM API response latency. Table 3 provides the specific configurations used for each model.

6.3. Dataset

Our experiments utilize a diverse set of Verilog IP cores, representing security-critical components of a typical SoC, categorized by various difficulty levels. The Cryptographic Core, represented by the Advanced Encryption Standard (AES-128 [41]) with 768 lines, is the most complex, making it a hard-level challenge for the LLMs. The Communication Core, represented by the Universal Asynchronous Receiver-Transmitter (UART [42]) with 430 lines, manages serial data transmission and reception, making it of medium difficulty. The Memory Controller Core, represented by the single-port, synchronous Static Random Access Memory (SRAM [43]) controller with 52 lines, is the simplest and is responsible for secure data storage and retrieval. This categorization enables us to assess the effectiveness of the GHOST framework in HT insertion across different difficulty levels. Table 4 provides an overview of the IP cores used.

6.4. Case Study: An Information Leakage Trojan (HT2) in an AES-128 Cryptographic Core, Designed by GPT-4

In this section, we review one of the HTs generated by GHOST. We discuss how HT2 is inserted into an AES-128 cryptographic core using GPT-4. The tool was tasked with inserting an HT while preserving the original functionality (the complete prompt template is provided in Appendix A).

As shown in Listing 2, the LLM modified the AES-128 module by adding a ‘secret_trigger’ input for activation and a ‘trojan_out’ output for leaking the key,

k 0

. The HT code, shown in the listing, activates when ‘secret_trigger’ is asserted, storing the key in a shadow register ‘t_state’ and outputting it via ‘trojan_out’.

For the pre-synthesis verification, a testbench simulates normal operation and HT activation. The simulation shell command is shown in Figure 4a. The waveform in Figure 4b (only relevant signals are shown) depicts ‘trojan_out’ updating with the

k 0

value (edge ‘a∼b’) when ‘secret_trigger’ is asserted (point ‘t’), thus confirming the HT’s functionality. Post-synthesis verification with Yosys and the Google Skywater 130nm PDK ensures the HT’s behavior is preserved. The synthesis commands and the subsequent waveform analysis are shown in Figure 5.

Listing 2. Information Leakage HT inserted in AES-128 RTL by GPT-4.

In the following sections, we analyze the performance of the three LLMs in generating and inserting HTs into SRAM, AES-128, and UART designs using the GHOST framework. We evaluate the models using four metrics (Eval0 through Eval3) as defined in Table 2. Table 5 presents the results, organized by LLMs, design types, and individual HT attempts. Success is indicated by checkmarks (✓) and failures by crosses (×), with (–) denoting “Not Applicable” stages. The table also includes standard cell counts, overhead in percent (change in cell counts), trigger types, and brief descriptions of trigger mechanisms.

6.5. GPT-4 Performance

Overall Performance Metrics: GPT-4 demonstrated exceptional proficiency in generating and inserting HT across all evaluated designs. The model achieved a compilation success rate (Eval0) of 88.9%, successfully compiling eight out of nine attempted HTs. Notably, GPT-4 excelled in maintaining normal operation (Eval1) and achieving the intended HT functionality (Eval2), with perfect 100% success rates for both metrics. All functional HTs generated by GPT-4 survived the synthesis process (Eval3: 100%), underscoring the model’s ability to produce hardware-aware implementations.

Design-Specific Performance: In terms of design-specific performance, GPT-4 showcased remarkable consistency. For the SRAM design, all three attempted HTs were successfully generated, inserted, and synthesized. The AES-128 design, despite its complexity, posed no significant challenge for GPT-4, with all three HTs passing synthesis. For the UART design, two out of three attempts resulted in functional and synthesizable HTs.

Trigger Characteristics: GPT-4’s generated HTs exhibited a diverse range of trigger mechanisms (both external and internal). Internal triggers utilized techniques such as counters and specific address access patterns, while external triggers relied on dedicated trigger signals. This variety demonstrates GPT-4’s understanding of different triggering methods and its ability to adapt them to various hardware designs.

Resource Utilization: GPT-4 demonstrated varying overheads in HT insertions across designs. For SRAM, overheads ranged from 0.90% to 40.72%. AES-128 initially appeared zero-overhead (no extra cell used) for HT1, but a closer analysis revealed subtle increases in wire and bit counts. Other AES-128 HTs had minimal overheads of 0.15% and 0.22%. UART HTs showed higher overheads of 22.80% and 9.42%. This variability reflects GPT-4’s adaptability, with efficient HTs in some cases and more noticeable impacts in others.

Implications for Hardware Security: GPT-4’s success in handling the complex AES-128 design is particularly noteworthy. This performance indicates the model’s robust capability in comprehending and manipulating intricate hardware structures, suggesting its potential applicability to a wide range of hardware designs of varying complexity.

6.6. Gemini-1.5-Pro Performance

Overall Performance Metrics: Gemini-1.5-Pro demonstrated moderate success in HT generation and insertion. The model achieved a compilation success rate (Eval0) of 88.9%, matching GPT-4’s performance in this initial stage. However, Gemini-1.5-Pro showed some degradation in subsequent metrics, with a normal operation preservation rate (Eval1) of 87.5% and a HT functionality success rate (Eval2) of 71.4%. Notably, all functional HTs produced by Gemini-1.5-Pro survived the synthesis process (Eval3: 100%), indicating a strong grasp of hardware-synthesizable constructs. Overall, five out of nine attempts were successful (55.6% success rate), showing moderate success in HT insertion across different designs.

Design-Specific Performance: In terms of design-specific performance, Gemini-1.5-Pro’s results varied across the different hardware designs. For the SRAM design, only one out of three attempted HTs was successfully generated and synthesized. However, the model showed improved performance with the AES-128 design, successfully generating and synthesizing two out of three attempted HTs. The UART design saw similar success, with two out of three HTs passing synthesis.

Trigger Characteristics: Gemini-1.5-Pro demonstrated sophistication in its trigger designs, particularly evident in the AES-128 HT3, which implemented a trigger based on a specific input pattern over 255 cycles, which confirms the model’s capacity to generate stealthy HTs.

Resource Utilization: Gemini-1.5-Pro showed more consistent, generally lower overheads, particularly in complex designs like AES-128 (0.15% to 0.48%). However, it saw higher overheads in the UART design (up to 15.50%).

Implications for Hardware Security: Gemini-1.5-Pro’s performance, particularly its success with the complex AES-128 design and its 100% synthesis survival rate for functional HTs, indicates its potential as a tool for automated HT generation.

6.7. LLaMA3 Performance

Overall Performance Metrics: LLaMA3 demonstrated more limited success in HT generation and insertion compared to the other evaluated models. The model achieved a compilation success rate (Eval0) of 88.9%, matching the performance of GPT-4 and Gemini-1.5-Pro in this initial stage. However, LLaMA3 showed significant degradation in subsequent metrics. The normal operation preservation rate (Eval1) was 75.0%, indicating that a quarter of the compiled HTs disrupted the original functionality of the designs. The HT functionality success rate (Eval2) was particularly low at 33.3%, suggesting difficulties in implementing the intended malicious behavior. Furthermore, only half of the functional HTs survived the synthesis process (Eval3: 50.0%), indicating potential issues in generating hardware-synthesizable constructs. Only one out of nine attempts was successful (11.1% success rate), indicating significant challenges in generating functional and synthesizable HTs.

Design-Specific Performance: LLaMA3’s performance varied across different hardware designs. For the SRAM design, only one out of three HTs was successful. The AES-128 design was more challenging, with all three attempts failing at different stages. The UART design saw similar struggles, with no fully successful HTs.

Challenges and Limitations: Key issues identified in LLaMA3’s performance include problems with variable handling (i.e., getting the variable names wrong, not initializing the registers), implementing unsatisfiable trigger conditions, and generating HTs that could not survive the synthesis process. Overall, the model struggled to translate high-level HT concepts into correct hardware implementations.

Implications for Hardware Security: Despite its limitations, LLaMA3’s partial success, particularly with the simpler design (SRAM), indicates some baseline capability in HT generation. However, its performance underscores the challenges with less advanced LLMs for this complex task.

6.8. Overall Hardware Overhead Analysis

Analysis of the 14 successfully synthesized HTs in Table 5 reveals that the GHOST framework predominantly generates HTs with low hardware overhead (in terms of total physical cell count).

Overhead Distribution: The overhead distribution demonstrates that the majority of generated HTs achieve a favorable total physical cell count:

10 HTs (71.4%) exhibit low overhead (0.00–1.82%)
2 HTs (14.3%) show moderate overhead (9.42–15.50%)
2 HTs (14.3%) display higher overhead (22.80–40.72%)

Context for High Overhead Cases: The two HTs with notably high overhead (SRAM HT1: 40.72% and UART HT1: 22.80%) occur in small baseline designs where percentage metrics are particularly sensitive to absolute cell count additions:

SRAM HT1 (40.72%): Implements a counter-based trigger reaching 50,000 cycles, adding 4465 cells to a 10,964-cell baseline. The SRAM design itself is relatively small (52 lines of code), making the percentage overhead appear large despite the trigger’s complexity.
UART HT1 (22.80%): Uses a counter trigger requiring 1 million cycles, adding 75 cells to a 329-cell baseline. Similarly, the UART design is compact (430 lines of code), amplifying the percentage impact.

In contrast, larger designs like AES-128 (768 lines of code) demonstrate significantly lower overhead percentages even when implementing similar counter-based trigger mechanisms. For instance, GPT-4’s AES-128 HT3 utilizes a 1 million-cycle counter but achieves only 0.22% overhead due to the larger baseline (169,168 cells).

Framework Flexibility: Importantly, GHOST provides adversaries with fine-grained control over overhead characteristics through prompt engineering. By specifying constraints such as “Implement the HT with minimal area overhead (<5%) by reusing existing signals”, adversaries can explicitly guide the LLM to generate HTs that meet specific stealth requirements.

Implications: The finding that over 71% of successfully generated HTs achieve low overhead without explicit overhead constraints in the prompts indicates the framework’s natural tendency toward stealthy implementations. This characteristic, combined with the flexibility to specify explicit overhead constraints when needed, makes GHOST a versatile tool for generating HTs.

Physical Cell-Level Analysis

To understand how synthesis achieves low cell overhead, we performed physical cell-level analysis of the final flattened designs after ABC technology mapping to SkyWater 130nm PDK cells. This analysis examines how HT logic is realized in actual silicon, providing insight beyond aggregate cell counts. Table 6, Table 7 and Table 8 present the top 15 most changed physical cell types for each design, showing how synthesis redistributes logic among specific standard cell implementations.

Zero Cell Overhead Through Logic Restructuring: The physical cell analysis reveals the mechanism behind GPT-4 HT1’s zero cell overhead for AES-128. While the total cell count remains 169,168 (identical to baseline), the distribution of cell types changes: xor3_1 increases from 160 to 256 (+60.00%), while xor2_1 decreases from 4144 to 4048 (−2.32%). The HT logic is realized by reconfiguring existing XOR logic paths rather than instantiating new gates during the synthesis process. Most other cell types remain unchanged (0.00%), demonstrating that the HT was implemented through sophisticated logic restructuring, not resource addition.

Physical Cell Type Diversity: The baseline AES-128 design utilizes 60 unique physical cell types, while all HT infected AES-128 versions introduce only 3 new types in total (mux2_1, a311oi_1, and $_DFF_PP0_). For SRAM, the baseline uses 47 cell types, with HT-infected SRAM designs adding 14 new types. In contrast, UART uses 38 baseline types, with 11 new additions. This limited introduction of new cell types across most HTs demonstrates the ability of synthesis optimization to implement HTs within existing, pre-designed cell vocabularies by restructuring and resource sharing.

Contrasting Overhead Profiles: The SRAM GPT-4 HT1 case (40.72% overhead) provides instructive contrast directly reflecting the counter-based trigger’s hardware requirements. In contrast, SRAM GPT-4 HT2/HT3 show minimal changes across most cell types, achieving 0.90–0.94% overhead through efficient trigger implementations.

Detection Implications: Physical cell-level analysis has critical implications for hardware Trojan detection. Traditional anomaly detection approaches that flag overhead in aggregate metrics (such as total cells or area) can be evaded through synthesis-level optimization. Even cell-type distribution analysis faces challenges, as zero-overhead cases show most cell types unchanged, with only selective XOR logic redistribution. The ability of modern synthesis tools to absorb HT logic through logic restructuring rather than resource addition fundamentally limits the effectiveness of overhead-based detection heuristics, particularly for simple trigger conditions that can be folded into existing logic paths.

6.9. GHOST HT Benchmark Exploration and Applicability

This section examines the applicability and characteristics of the generated HT benchmark through four perspectives. First, we analyze physical implementation characteristics through RTL-to-GDS synthesis to quantify area, power, and timing impacts at the silicon level. Second, we explore stealthiness and obfuscation strategies, examining naming conventions, port reuse patterns, and prompt-driven customization for evading detection. Third, we discuss scalability considerations and design complexity limitations. Finally, we address the novelty and contributions of the GHOST benchmark in the context of existing HT research.

6.9.1. Physical Implementation and PPA Analysis

To characterize GHOST-generated HTs beyond cell-count metrics (Table 5), we performed RTL-to-GDS physical implementation using OpenROAD-flow-scripts (ORFS) with SkyWater 130nm PDK. This analysis examines whether HT insertion introduces timing violations, power anomalies, or area overhead at the physical design level.

Implementation Methodology: Physical synthesis was performed using ORFS with SkyWater sky130hd standard cell library. The flow included: (1) logic synthesis with Yosys, (2) floorplanning and placement, (3) clock tree synthesis, (4) global and detailed routing, (5) static timing analysis (STA), and (6) power analysis. Table 9 shows the timing constraints and implementation configurations.

Design Scope: We implemented SRAM and UART designs. SRAM designs range from 51,833 to 56,443 cells post-placement (52 k cells average), while UART designs range from 1750 to 2138 cells (1.8 k cells average). AES-128 designs were excluded due to out-of-memory (OOM) failures during detailed routing.

Table 10 and Table 11 show Power-Performance-Area (PPA) metrics from post-route analysis. Area values represent the core area (the region containing placed cells). Percentage changes (Δ) are relative to the HT-free baseline.

Results: Four out of five SRAM HT variants exhibit area changes within ±0.7% and power changes within ±0.7%. All but one design (Gemini UART HT1, –4.6%) show frequency improvements ranging from +0.9% to +25.2%. Timing improvements occur because additional logic can enable synthesis tool optimization during placement and routing.

SRAM HT1 and UART HT1 from GPT-4 show +12.6% and +13.8% area overhead, consistent with the high cell-count HT1 implementations in Table 5. Functionality-altering HTs (HT1) require more resources than information-leaking (HT2) or DoS HTs (HT3). All designs met timing requirements and passed static timing analysis.

Figure 6 shows the GDS layouts of baseline (HT-free) SRAM and UART designs after physical implementation. The SRAM core area is 183,680 µm² with 45% utilization (45% of core occupied by logic cells), while the UART core area is 4237 µm² with 37% utilization. These layouts serve as baselines for comparing HT-infected variants.

6.9.2. Stealthiness and Obfuscation Strategies

The seemingly explicit variable naming (shown in Listing 2) in some HT implementations (e.g., ‘trojan_out’ and ‘secret_trigger’), was deliberately included for experimental clarity. In real deployment scenarios, these could be obfuscated using standard code transformation techniques, which we applied through LLM-based post-processing. While we performed naming cleanup post-generation, prompt engineering can also direct the LLM to generate obfuscated names from the outset. Analysis of these post-processed GHOST benchmarks reveals multiple stealthiness strategies, some of which emerge autonomously and others through explicit prompting.

Observed Stealthiness Techniques:

(1) Benign Identifier Renaming: Obvious names like trojan_counter or secret_trigger can be replaced with benign-looking identifiers through either post-processing cleanup or explicit prompting during HT generation. In our benchmarks, we applied post-processing to transform explicit names into benign abbreviations that blend naturally with legitimate hardware design. Listing 3 shows actual examples from post-processed GHOST benchmarks across different LLMs and designs. These identifiers use common 2–3 letter abbreviations typical in hardware design:

PS, pm, pc—could represent Packet State, power management, protocol control, packet counter
sc, RS—could represent state counter, sync check, Reset State, register select

These abbreviations appear professional and raise no suspicion during code review

(2) Existing Interface Exploitation: Not all GHOST HTs required additional ports for data exfiltration. In several cases, LLMs autonomously chose to reuse existing ports rather than adding new ones, without being explicitly prompted to do so. This strategy is demonstrated in our SRAM HT2 benchmark (generated by GPT-4), where the existing dout0 data output port is hijacked to leak sensitive memory contents. When the trigger address (0xAA) is accessed, the HT modifies dout0 to output data from sequentially incremented memory addresses, rather than the requested address. This implementation exemplifies both port reuse (using the legitimate data output) and temporal multiplexing (the port alternates between normal operation and data leakage based on trigger conditions). The HT remains stealthy as the leakage occurs through a pre-existing interface without adding suspicious new ports.

Listing 3. Benign Identifier Renaming Examples from GHOST Benchmarks.

Listing 4 shows the actual implementation from the SRAM HT2 benchmark. The HT adds a counter (lines 15–23) that increments when the trigger address is accessed. The payload (lines 27–31) hijacks the existing dout0 port: when reading from the trigger address, instead of returning the data at that address, it returns data from the memory location pointed to by the counter, effectively leaking sequential memory contents through the legitimate data output port.

(3) Prompt-Driven Customization: The framework’s prompt-based approach enables systematic control over HT stealthiness characteristics. By modifying the Contextual Trojan Prompt (CTP) to explicitly request benign naming conventions and architectural constraints (e.g., “Insert an information leakage HT using only existing signals and benign-looking variable names that would not raise suspicion in code review”), adversaries can systematically generate HTs with stealthier profiles.

These obfuscation strategies demonstrate that through post-processing cleanup or targeted prompt engineering, GHOST can generate stealthy HTs that would evade manual code inspection in real-world scenarios.

6.9.3. Quantitative Analysis of Name Obfuscation Impact

To quantify the detectability decrease from name obfuscation, we performed a systematic grep-based analysis on all 14 successful GHOST-generated HT pairs (original HT-infected files with obvious naming versus files with obfuscated naming). We define a vocabulary of suspicious identifiers based on hardware security domain knowledge, including terms related to trojans (trojan, backdoor, malicious), triggers (trigger, activate, armed), payloads (payload, leak, exfiltrate, corrupt, disable), and covert functionality (secret, hidden, covert, shadow, stealth). Using case-insensitive substring matching on actual code (comments excluded), we quantify: N—count of suspicious identifiers before obfuscation; Y—count after obfuscation; and X—binary detection status (whether grep finds any suspicious identifier).

Listing 4. SRAM HT2: Port Reuse and Temporal Multiplexing (GPT-4 Generated).

Table 12 presents the quantitative results. Before obfuscation, 85.7% of HTs (12 out of 14) contained at least one suspicious identifier detectable through simple grep-based scanning, with a total of 116 Trojan-related occurrences across all designs. After post-processing obfuscation, 100% of these suspicious identifiers were successfully replaced with benign-looking alternatives, achieving complete elimination of grep-detectable Trojan-specific naming patterns. Our analysis excludes the term “payload” for UART designs, as it represents legitimate protocol terminology (PAYLOAD_BITS parameter) present in the original Trojan-free UART design, rather than Trojan-related naming. This distinction highlights the importance of context-aware analysis in HT detection.

This quantitative analysis demonstrates that name obfuscation through post-processing or targeted prompting completely eliminates suspicious HT identifiers’ detectability by simple grep-based code review methods. While sophisticated ML-based detection tools, such as HW2VEC, analyze structural properties beyond naming conventions, the complete removal of suspicious identifiers renders manual code inspection and simple automated scanning ineffective for detecting LLM-generated HTs, thereby increasing the threat level that these automated tools pose to hardware security.

Threat Model Context: It is important to note that while this analysis involves before/after comparison of HT-infected designs with name-obfuscated designs for quantification purposes, such comparisons do not represent detection methods applicable in our threat model. As explained in Section 3, downstream entities (foundry, fabrication facilities, end users) in our threat model receive only the HT-infected design without access to the original clean golden reference, making traditional before-versus-after comparison approaches (diff of RTL files, grep for suspicious names, port list comparisons) inapplicable. However, this analysis demonstrates the effectiveness of obfuscation techniques that adversaries could apply within the threat model to further reduce the detectability of suspicious HT identifiers.

6.9.4. Scalability and Design Complexity

Regarding dataset size and scalability to larger designs, our current evaluation focused on three representative designs (SRAM, UART, and AES-128) to establish baseline capabilities of LLMs in HT generation. While these designs are smaller than complex microprocessors, they represent common security-critical components in modern SoCs. Scaling to larger designs presents challenges primarily due to limitations in the LLM context length. However, these issues could be addressed through techniques such as hierarchical analysis or design partitioning.

6.9.5. Benchmark Novelty and Contributions

On the novelty of generated HTs, our framework’s objective was not to create entirely new classes of HTs but rather to demonstrate LLMs’ capability to implement HT functionality from high-level descriptions. While known HT categories inspired the CTPs, the LLMs independently generated the actual circuit implementations without specific implementation guidance. This resulted in unique combinations of triggering mechanisms and payload implementations, even though they fall within established HT categories.

6.10. HT Detection Analysis

We evaluated the detectability of our GHOST Framework-generated HTs using the state-of-the-art ‘HW2VEC’ [44], an open-source ML-based HT detection tool that operates at both RTL and gate-level netlist. The selection of HW2VEC as our primary detection mechanism was driven by pragmatic constraints in the current hardware security landscape. The scarcity of open-source RTL-level HT detection tools significantly limited our options. While other detection methods exist in the literature, recreating these tools is not feasible due to the missing implementation details, time, and complexity of these methods. Therefore, we utilized HW2VEC as it represents a complete, open-source implementation that allows for reproducible evaluation. We used Data Flow Graphs (DFG) for detection and the pre-trained model weights provided by HW2VEC’s authors. Our experiments were performed on a machine equipped with an Intel 12th Gen i7-12700H CPU and an NVIDIA GeForce RTX 3060 Laptop GPU, with a maximum inference time of 4 hours.

Results presented in Table 13 highlight the following key points:

GPT-4’s HTs for SRAM and UART designs went undetected (7–9.5 min inference time).
For AES-128, all GPT-4 HTs caused HW2VEC to timeout (>4 hours).
HTs generated by Gemini-1.5-Pro and LLaMA3 also went undetected or produced inconclusive results.

In our validation of HW2VEC [44], we successfully reproduced the detection accuracy metrics reported in the original paper by running their provided dataset on our local machine. However, we observed significantly longer inference times compared to those reported in the original HW2VEC paper [44]. This discrepancy can be attributed to differences in computational resources. This hardware difference explains why we set a practical timeout threshold of 4 h for our experiments, as waiting longer would be impractical for evaluation purposes. Despite HW2VEC’s previous success with human-written Trust-Hub [4] HTs (F1 score of 0.926), it struggled to detect LLM-generated HTs. This demonstrates the effectiveness of our LLM-based attack framework in generating undetectable HTs, representing a new and severe threat vector that requires further analysis and mitigation. The inability of HW2VEC to detect GHOST-generated HTs can be attributed to the fundamental limitations of GNN approaches and possibly the training data used, rather than indicating weaknesses in our evaluation. GNNs analyze structural relationships within circuits, and their performance heavily depends on the composition of the training data. The fact that HW2VEC’s training dataset contained AES but lacked SRAM and UART designs explains its variable performance across different architectures. This highlights a broader challenge in ML HT detection methods: generalization to structurally different designs.

6.11. Ablation Study: Component Contribution Analysis

To quantify the individual contributions of each GHOST component (RBP, CTP, and RVP), we conducted an ablation study using LLaMA 3 70B accessed via Together.ai API (https://together.ai) on the SRAM controller design. We evaluated four configurations with progressive component addition: (1) Baseline prompting, (2) Baseline+CTP, (3) RBP+CTP, and (4) RVP+RBP+CTP. Each configuration generated 10 HT samples, evaluated through our four-stage pipeline (Eval0-Eval3).

6.11.1. Initial Baseline Configuration Analysis

Initial experiments with a pure baseline configuration (no GHOST components) showed that 5 out of 10 samples exhibited correct HT3 behavior (Denial of Service), while the remaining samples implemented HT1-type modifications (Change functionality). This inconsistency in the Trojan type motivated the use of Baseline+CTP as the comparison point for subsequent experiments, which achieved 10/10 correct HT3 specifications. This suggests that CTP is necessary for consistent adherence to Trojan specification.

The baseline prompt used was minimal and lacked contextual guidance:

Despite explicitly requesting denial of service (HT3), the LLM frequently produced HT1-type trojans that corrupt data instead of disabling the module. Listing 5 from the baseline configuration illustrates this misinterpretation:

Listing 5. Baseline Sample 1: Data Corruption (HT1) - Writes all 1’s instead of disabling the module.

The LLM-generated code includes comments labeling the behavior as “DoS” (Denial of Service), but the actual implementation corrupts data by writing all 1’s to memory. This is HT1 (Change functionality) rather than HT3 (Denial of Service). The code contains a syntax error: for (int i = 0; i < RAM_DEPTH; i++) uses the increment operator i++, which does not exist in Verilog (correct syntax is i = i + 1). This shows the LLM misinterpreted the Trojan specification and generated syntactically incorrect HDL code. CTP’s explicit strategy guidance addresses both issues.

In contrast, when CTP was added (Baseline+CTP configuration), all 10 samples correctly implemented HT3 behavior. Listing 6 demonstrates proper denial of service implementation:

Listing 6. Baseline+CTP Sample 3: Correct HT3 Implementation-Disables module by forcing chip select high.

This example disables the module by forcing the chip select signal high (inactive state) when triggered, preventing all read and write operations. This is a denial-of-service (DoS) attack (HT3) rather than data corruption (HT1).

6.11.2. Progressive Component Addition Results

Table 14 presents the evaluation results across all four stages.

6.11.3. Component Contributions and Analysis

The results demonstrate distinct contributions from each GHOST prompting component. CTP proves essential for Trojan-type consistency. Without it, only 50% of samples correctly implemented the requested HT3 (Denial of Service) behavior, with the remainder defaulting to HT1 (Change functionality). This inconsistency motivated the use of Baseline+CTP as the comparison point.

Adding RBP to the CTP baseline improved compilation success (Eval0) from 70% to 80% and functional correctness (Eval1) from 14% to 25%. However, RBP alone proved insufficient for end-to-end success. The single RBP + CTP sample that passed activation testing was optimized away during synthesis, resulting in 0 out of 10 end-to-end successes.

RVP addition produced the most substantial functional improvement, increasing Eval1 success from 25% to 88%, a 3.5× gain. This improvement stems from RVP’s self-validation mechanism, which catches and corrects errors before code generation. More importantly, RVP-generated designs proved more synthesis-robust: RVP + RBP + CTP achieved 100% synthesis success (2/2) compared to RBP + CTP’s 0% (0/1).

The end-to-end success progression (0% Baseline + CTP → 0% RBP + CTP → 20% RVP + RBP + CTP) indicates that all three components are necessary. While the absolute 20% success rate remains modest, it represents a meaningful improvement over configurations lacking any component, and demonstrates that the complete GHOST framework is required for generating synthesis-survivable hardware Trojans.

7. Conclusions

This paper presents a novel framework, GHOST, that takes advantage of commercial LLMs to automate HT design and insertion. We evaluate GPT-4, Gemini-1.5-Pro, and LLaMA3 using different hardware designs and demonstrate high potential for this approach, particularly concerning GPT-4, which shows extraordinary proficiency in generating functional and stealthy HTs. Our analysis using the HW2VEC detection tool showed that LLM-generated HTs consistently evaded detection, creating a new and significant threat vector in hardware security. This work underscores the possibility of a near-future paradigm shift that allows adversaries to generate undetectable HTs with minimal human effort. Although GHOST points out many potential security risks, other avenues also open up for research. Specifically, future work on developing robust detection techniques against LLM-generated HTs and exploring LLMs in developing defensive measures should be pursued. By understanding and anticipating these AI-driven threats, we can develop more resilient and secure hardware systems for the future.

Author Contributions

M.O.F. prepared the initial manuscript draft, which was subsequently extensively edited by all authors. M.O.F. conducted all technical development work, while technical feedback, discussions, and conceptual ideas were collaboratively developed by the entire team. All authors have read and agreed to the published version of the manuscript.

Funding

The work is partially supported by the National Science Foundation under award numbers 2219679, 2219680, and #OIA-2417062.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable for studies not involving humans.

Data Availability Statement

The prompt templates used in the GHOST framework are provided in Appendix A to enable methodology reproduction. Following guidance from our institution’s research compliance office regarding the distribution of offensive cybersecurity materials, the HT-infected RTL designs and synthesized netlists are available to verified academic and industry researchers upon request. Researchers may contact the corresponding author (badawy@nmsu.edu) with their institutional affiliation and intended use case to request access to these materials.

Acknowledgments

We acknowledge the use of Claude 3.5 (https://claude.ai/, accessed October 2024–December 2025) to improve the organization and academic writing of this document. No portion of this work was produced exclusively by any AI tools.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Complete GHOST Prompt Template

This appendix provides the complete prompt template used in the GHOST framework for HT insertion. The template consists of three main components: Role-Based Prompting (RBP), Contextual Trojan Prompting (CTP), and Reflexive Validation Prompting (RVP).

Appendix B. GHOST Benchmark Detailed Characterization

This appendix provides comprehensive characterization of all 14 successfully generated hardware Trojans in the GHOST benchmark suite. Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 detail the trigger mechanisms, payload effects, activation probabilities during normal operation, and post-activation behavior for each HT.

HT Naming Convention: To systematically organize the benchmark and enable traceability, each HT identifier follows a three-digit naming scheme where the first digit indicates the LLM model (1 = GPT-4, 2 = Gemini, 3 = LLaMA3), the second digit indicates the HT type (0 = HT1/Change Functionality, 1 = HT2/Leak Information, 2 = HT3/Denial of Service), and the third digit indicates the attempt number (0 = first attempt). For example, AES-HT120 denotes an AES design with a GPT-4-generated Type 3 (DoS) HT from the first attempt.

Activation Probability Calculation: All activation probabilities are calculated from a normal operation perspective (without adversary intervention) assuming: (1) 100 MHz clock frequency, (2) random/uniform data distributions, and (3) external trigger pins held at default inactive state. Probability formulas used:

Counter-based: P ≈ 100%; Time to trigger $= threshold / f_{c l k}$
Data/Address pattern: $P (match) = 1 / 2^{N}$ for N-bit value; $P (consecutive) = {(1 / 2^{N})}^{M}$ for M consecutive matches
Multi-condition: $P (all) = P (C_{1}) \times P (C_{2}) \times \dots \times P (C_{n})$ for independent conditions
External trigger: P ≈ 0% during normal operation (trigger pin inactive)

Table A1 provides a high-level classification of all 14 HTs, showing the generating LLM model, target design, HT type (HT1: Change Functionality, HT2: Leak Information, HT3: Denial of Service), trigger mechanism category, and trigger implementation type (combinational or sequential). Trigger types are classified into five categories:

External: Activated via dedicated input pin requiring adversary control. Implementation is combinational (direct signal check).
Internal Counter: Activated when an internal counter reaches a threshold. Implementation is sequential (requires state elements).
Internal Data/Address Pattern: Activated when specific data or address values are observed. Implementation is combinational (direct comparison) or sequential (if tracking consecutive occurrences).
Internal Rare-Event: Activated by statistically improbable conditions (e.g., consecutive pattern matches). Implementation is sequential (requires counters to track consecutive occurrences).
Internal Conditional: Activated when a specific internal signal condition persists. Implementation is sequential (requires state tracking).

Table A1. GHOST Benchmark Overview: HT Classification by LLM, Design, and Payload Type.

HT ID	LLM	Design	HT Type	Payload Category	Trigger Type	Impl.
AES-HT100	GPT-4	AES-128	HT1	Change Functionality	External	Comb.
AES-HT110	GPT-4	AES-128	HT2	Leak Information	External	Comb.
AES-HT120	GPT-4	AES-128	HT3	Denial of Service	Internal Counter	Seq.
AES-HT210	Gemini	AES-128	HT2	Leak Information	External	Comb.
AES-HT220	Gemini	AES-128	HT3	Denial of Service	Internal Rare-Event	Seq.
SRAM-HT100	GPT-4	SRAM	HT1	Change Functionality	Internal Counter	Seq.
SRAM-HT110	GPT-4	SRAM	HT2	Leak Information	Internal Address	Comb.
SRAM-HT120	GPT-4	SRAM	HT3	Denial of Service	Internal Rare-Event	Seq.
SRAM-HT220	Gemini	SRAM	HT3	Denial of Service	Internal Rare-Event	Seq.
SRAM-HT320	LLaMA3	SRAM	HT3	Denial of Service	Internal Conditional	Seq.
UART-HT100	GPT-4	UART	HT1	Change Functionality	Internal Counter	Seq.
UART-HT110	GPT-4	UART	HT2	Leak Information	External	Comb.
UART-HT200	Gemini	UART	HT1	Change Functionality	Internal Data Pattern	Comb.
UART-HT220	Gemini	UART	HT3	Denial of Service	Internal Rare-Event	Seq.

Trigger Types: External (dedicated input pin), Internal Counter (threshold-based), Internal Address/Data Pattern (value comparison), Internal Rare-Event (consecutive pattern match), Internal Conditional (signal persistence). Impl.: Comb. = Combinational (direct logic), Seq. = Sequential (requires state elements/counters).

Table A2 details the specific trigger conditions and payload effects for each HT. Trigger conditions range from simple external signals to complex multi-cycle pattern detection, while payloads include data corruption, key leakage, and permanent denial-of-service attacks.

Table A2. GHOST Benchmark Trigger Conditions and Payload Effects.

HT ID	Trigger Condition	Payload Effect
AES-HT100	External signal `trojan_trigger` = 1	Corrupts encryption by XORing additional key with state
AES-HT110	External signal `secret_trigger` goes high	Leaks internal AES key `k0` via `trojan_out` pin
AES-HT120	32-bit counter reaches 1,000,000 cycles	Resets state and key to zero (destroys encryption)
AES-HT210	External signal `trigger_signal` = high	Leaks 128-bit intermediate key `k0b`
AES-HT220	`state[7:0]==0xAA` AND `key[7:0]==0x55` for 255 consecutive cycles	Halts encryption by preventing state/key updates
SRAM-HT100	16-bit counter reaches 50,000 cycles	Inverts write data (corrupts memory writes)
SRAM-HT110	Read operation to `TRIGGER_ADDR`	Leaks sequential memory contents via covert channel
SRAM-HT120	Consecutive all-one addresses (`0x7F`) for 15 cycles	Disables read/write while condition persists
SRAM-HT220	Access to address `0b1010101` ten consecutive times	Disables read/write while condition persists
SRAM-HT320	`web0` signal high for 4 consecutive cycles	Forces chip select high, permanently disables module
UART-HT100	16-bit counter reaches 65,535 (0xFFFF)	Forces `uart_rx_valid` to false (drops packet)
UART-HT110	External signal `trojan_en` = high (synchronized with valid data)	Leaks received data via `trojan_data` pin
UART-HT200	Received data equals `0xAB`	Permanently inverts all subsequent received data
UART-HT220	Receive byte `0xAA` eight consecutive times	Permanently blocks output data

Table A3 quantifies the activation probability for each HT during normal operation (without adversary intervention). Probabilities are calculated assuming a 100 MHz clock frequency and random/uniform data distributions. External trigger HTs have near-zero probability since the trigger pins remain inactive during legitimate use.

Table A3. GHOST Benchmark Activation Probability During Normal Operation.

HT ID	Probability Rationale	Activation Prob.	Category
AES-HT100	External pin inactive during normal operation	∼0%	Near-Zero
AES-HT110	External pin inactive during normal operation	∼0%	Near-Zero
AES-HT120	Deterministic counter fires every 10 ms @ 100 MHz	∼100%	Very High
AES-HT210	External pin inactive during normal operation	∼0%	Near-Zero
AES-HT220	Requires specific input pattern for 255 cycles	∼10⁻¹²³²	Negligible
SRAM-HT100	Deterministic counter fires every 0.5 ms @ 100 MHz	∼100%	Very High
SRAM-HT110	Random address access: 1/128 per read (7-bit addr)	∼0.78%	Low
SRAM-HT120	Consecutive `0x7F` addresses: ${(1 / 128)}^{15}$	∼10⁻³²	Negligible
SRAM-HT220	Same address 10× consecutive: ${(1 / 128)}^{10}$	∼10⁻²¹	Negligible
SRAM-HT320	Write-enable high 4 consecutive cycles: ${(0.5)}^{4}$	∼6.25%	Medium
UART-HT100	Deterministic counter fires every 0.65ms @ 100MHz	∼100%	Very High
UART-HT110	External enable pin inactive during normal operation	∼0%	Near-Zero
UART-HT200	Random data: P(byte=0xAB) = 1/256	∼0.39%	Low
UART-HT220	Consecutive `0xAA` bytes: ${(1 / 256)}^{8}$	∼10⁻¹⁹	Negligible

Categories: Very High (∼100%, deterministic counters); Medium (1–10%, usage-dependent); Low (0.1–1%, data patterns); Negligible (<10⁻⁶, rare events); Near-Zero (∼0%, external triggers inactive without adversary).

Table A4 characterizes the post-activation behavior and recovery mechanisms for each HT. Post-activation behavior types are classified into five categories:

Continuous: Payload active while trigger signal is held high.
Conditional: Payload active while a specific condition persists (e.g., consecutive address matches) and automatically deactivates when the condition ends.
Persistent: Payload permanently latched after a single trigger event, requiring system reset.
Periodic: Payload fires at regular intervals as counter wraps.
One-shot: Single payload event per trigger occurrence.

Recovery mechanisms, which indicate how the system returns to normal operation after HT activation, are classified into three categories:

Self-reset: HT deactivates automatically when trigger condition is no longer satisfied (e.g., external enable pin goes low, or consecutive address pattern breaks), returning the circuit to normal operation without intervention.
Auto-cycle: Automatically resets after each payload event, allowing repeated activations.
Hard reset: HT has latched internal state that persists until full system reset.

Of the 14 HTs, 11 (78.6%) recover automatically, while 3 (21.4%) require a hard reset.

Table A4. GHOST Benchmark Post-Activation Behavior and Recovery.

HT ID	Activation Behavior	Post-Trigger Effect	Recovery
AES-HT100	Continuous (Level)	Active only while trigger = 1; normal when trigger = 0	Self-reset
AES-HT110	Continuous (Level)	Leaks key while trigger high; stops when low	Self-reset
AES-HT120	Periodic (Counter)	Resets state for 1 cycle; counter restarts; repeats	Auto-cycle
AES-HT210	Continuous (Level)	Leaks key while trigger high; stops when low	Self-reset
AES-HT220	Conditional (Counter)	Halts encryption while counter threshold met; resets when inputs change	Self-reset
SRAM-HT100	One-Shot (Event)	Inverts one write at counter threshold; continues	Auto-cycle
SRAM-HT110	One-Shot (Event)	Leaks one address per trigger; increments sequence	Auto-cycle
SRAM-HT120	Conditional (Counter)	Disables memory while consecutive `0x7F` addresses; resets on different addr	Self-reset
SRAM-HT220	Conditional (Counter)	Disables memory while same address repeated; resets on different addr	Self-reset
SRAM-HT320	Persistent (Latch)	Permanently disables module; `trojan_active` latched	Hard reset
UART-HT100	One-Shot (Event)	Drops one packet at counter max; counter wraps	Auto-cycle
UART-HT110	One-Shot (Event)	Leaks one byte per valid+enable; FSM auto-returns	Auto-cycle
UART-HT200	Persistent (Latch)	Inverts all data after 0xAB received; flag never clears	Hard reset
UART-HT220	Persistent (Latch)	Permanently blocks output; `trojan_active` latched	Hard reset

Behavior Types: Continuous—active while trigger held; Conditional—active while condition met, resets when condition ends; Persistent—permanent after single trigger; Periodic—repeats at intervals; One-Shot—single event per trigger. Recovery: Self-reset—returns to normal when trigger/condition removed; Auto-cycle—resets after event; Hard reset—requires system reset.

Table A5 provides aggregated statistics across all 14 HTs, summarizing distributions by LLM model, target design, payload type, trigger mechanism, activation probability category, and recovery requirements.

Table A5. GHOST Benchmark Summary Statistics.

Metric	Description	Count	Percentage
By LLM Model
GPT-4	3 designs × 3 HT types (2 failed)	8	57.1%
Gemini	3 designs × 3 HT types (4 failed)	5	35.7%
LLaMA3	3 designs × 3 HT types (8 failed)	1	7.1%
By Target Design
AES-128	Cryptographic core (128-bit encryption)	5	35.7%
SRAM	Memory module (OpenRAM-based)	5	35.7%
UART	Communication receiver	4	28.6%
By Payload Type
Change Functionality	Corrupts data/computation	4	28.6%
Leak Information	Exfiltrates sensitive data	4	28.6%
Denial of Service	Disables module operation	6	42.9%
By Trigger Type
External (adversary-controlled)	Requires deliberate activation	4	28.6%
Internal Counter (deterministic)	Fires automatically over time	3	21.4%
Internal Pattern/Conditional	Usage-dependent activation	3	21.4%
Internal Rare-Event	Negligible activation probability	4	28.6%
By Normal Operation Activation
Very High (∼100%)	Activates within milliseconds	3	21.4%
Medium/Low (0.39–6.25%)	Usage-dependent	3	21.4%
Near-Zero (∼0%)	Requires adversary intervention	4	28.6%
Negligible (<10⁻⁹)	Effectively never activates	4	28.6%
By Post-Activation Recovery
Self-reset/Auto-cycle	Recovers automatically	11	78.6%
Hard reset required	Permanent until system reset	3	21.4%

Table A6 provides a detailed per-LLM breakdown of the successfully generated HTs, showing the specific designs targeted, trigger mechanisms employed, payload types implemented, and recovery characteristics. This analysis reveals trends in how each LLM approached HT design. GPT-4 produced the most diverse HTs with all payload types, trigger mechanisms, and exclusively recoverable designs (0% requiring hard reset). Gemini showed a preference for rare-event triggers, with 40% of its HTs being persistent and requiring a hard reset. LLaMA3’s single successful HT was a conditional-trigger DoS that requires a hard reset to recover.

Table A6. Per-LLM Analysis of Successfully Generated Hardware Trojans.

Metric	GPT-4	Gemini	LLaMA3	Total
Success Rate
Successful HTs	8 of 9 (88.9%)	5 of 9 (55.6%)	1 of 9 (11.1%)	14 of 27 (51.9%)
By Target Design
AES-128	HT100, HT110, HT120	HT210, HT220	—	5
SRAM	HT100, HT110, HT120	HT220	HT320	5
UART	HT100, HT110	HT200, HT220	—	4
By Payload Type
Change Functionality (HT1)	3 (SRAM, AES, UART)	1 (UART)	0	4
Leak Information (HT2)	3 (SRAM, AES, UART)	1 (AES)	0	4
Denial of Service (HT3)	2 (SRAM, AES)	3 (AES, SRAM, UART)	1 (SRAM)	6
By Trigger Type
External	3 (AES-HT100, HT110, UART-HT110)	1 (AES-HT210)	0	4
Internal Counter	3 (SRAM-HT100, AES-HT120, UART-HT100)	0	0	3
Internal Pattern	1 (SRAM-HT110)	1 (UART-HT200)	0	2
Internal Rare-Event	1 (SRAM-HT120)	3 (AES-HT220, SRAM-HT220, UART-HT220)	0	4
Internal Conditional	0	0	1 (SRAM-HT320)	1
By Recovery Mechanism
Self-reset	4	2	0	6
Auto-cycle	4	1	0	5
Hard reset	0	2	1	3
By Activation Probability
Very High (∼100%)	3	0	0	3
Medium/Low	1	1	1	3
Near-Zero (∼0%)	3	1	0	4
Negligible	1	3	0	4

References

Tehranipoor, M.; Koushanfar, F. A survey of hardware trojan taxonomy and detection. IEEE Des. Test Comput. 2010, 27, 10–25. [Google Scholar] [CrossRef]
Xiao, K.; Forte, D.; Tehranipoor, M. Hardware trojans: Lessons learned after one decade of research. Acm Trans. Des. Autom. Electron. Syst. (TODAES) 2016, 22, 1–23. [Google Scholar] [CrossRef]
Cruz, J.; Huang, Y.; Mishra, P.; Bhunia, S. An automated configurable Trojan insertion framework for dynamic trust benchmarks. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 1598–1603. [Google Scholar]
Trust-HUB. Chip-Level Trojan Benchmarks. 2024. Available online: https://trust-hub.org/#/benchmarks/chip-level-trojan (accessed on 26 August 2024).
Jyothi, V.; Krishnamurthy, P.; Khorrami, F.; Karri, R. Taint: Tool for automated insertion of trojans. In Proceedings of the 2017 IEEE International Conference on Computer Design (ICCD), Boston Area, MA, USA, 5–8 November 2017; pp. 545–548. [Google Scholar]
Cruz, J.; Gaikwad, P.; Nair, A.; Chakraborty, P.; Bhunia, S. Automatic hardware trojan insertion using machine learning. arXiv 2022, arXiv:2204.08580. [Google Scholar] [CrossRef]
Gohil, V.; Guo, H.; Patnaik, S.; Rajendran, J. Attrition: Attacking static hardware trojan detection techniques using reinforcement learning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; pp. 1275–1289. [Google Scholar]
Sarihi, A.; Patooghy, A.; Jamieson, P.; Badawy, A.-H.A. Trojan playground: A reinforcement learning framework for hardware Trojan insertion and detection. J. Supercomput. 2024, 80, 14295–14329. [Google Scholar] [CrossRef]
Dai, R.; Liu, Z.; Arias, O.; Guo, X.; Yavuz, T. DTjRTL: A Configurable Framework for Automated Hardware Trojan Insertion at RTL. In Proceedings of the Great Lakes Symposium on VLSI 2024, Clearwater, FL, USA, 12–14 June 2024; pp. 465–470. [Google Scholar]
Sarihi, A.; Jamieson, P.; Patooghy, A.; Badawy, A.-H.A. TrojanForge: Adversarial Hardware Trojan Examples with Reinforcement Learning. arXiv 2024, arXiv:2405.15184. [Google Scholar] [CrossRef]
Surabhi, V.R.; Sadhukhan, R.; Raz, M.; Pearce, H.; Krishnamurthy, P.; Trujillo, J.; Karri, R.; Khorrami, F. FEINT: Automated Framework for Efficient INsertion of Templates/Trojans into FPGAs. Information 2024, 15, 395. [Google Scholar] [CrossRef]
Kumar, G.; Shaik, A.H.; Riaz, A.; Prasad, Y.; Ahlawat, S. Compatibility Graph Assisted Automatic Hardware Trojan Insertion Framework. In Proceedings of the 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 31 March–2 April 2025; pp. 1–7. [Google Scholar]
Kokolakis, G.; Moschos, A.; Keromytis, A.D. Harnessing the power of general-purpose LLMs in hardware Trojan design. In International Conference on Applied Cryptography and Network Security; Springer: Cham, Switzerland, 2024; pp. 176–194. [Google Scholar]
Krieg, C. Reflections on trusting TrustHUB. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 29 October–2 November 2023; pp. 1–9. [Google Scholar]
Chang, K.C.; Wang, Y.Y.; Ren, H.T.; Wang, M.H.; Liang, S.Y.; Han, Y.S.; Li, H.Y.; Li, X. Chipgpt: How far are we from natural language hardware design. arXiv 2023, arXiv:2305.14019. [Google Scholar] [CrossRef]
Thakur, S.; Ahmad, B.; Fan, Z.; Pearce, H.; Tan, B.; Karri, R.; Dolan-Gavitt, B.; Garg, S. Benchmarking large language models for automated verilog rtl code generation. arXiv 2022, arXiv:2212.11140. [Google Scholar] [CrossRef]
Thakur, S.; Blocklove, J.; Pearce, H.; Tan, B.; Garg, S.; Karri, R. Autochip: Automating hdl generation using llm feedback. arXiv 2023, arXiv:2311.04887. [Google Scholar] [CrossRef]
Kande, R.; Pearce, H.; Tan, B.; Dolan-Gavitt, B.; Thakur, S.; Karri, R.; Rajendran, J. Llm-assisted generation of hardware assertions. arXiv 2023, arXiv:2306.14027. [Google Scholar] [CrossRef]
Orenes-Vera, M.; Martonosi, M.; Wentzlaff, D. Using llms to facilitate formal verification of rtl. arXiv 2023, arXiv:2309.09437. [Google Scholar] [CrossRef]
Srikumar, P. Fast and wrong: The case for formally specifying hardware with llms. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vancouver, BC, Canada, 25–29 March 2023. [Google Scholar]
Meng, X.; Srivastava, A.; Arunachalam, A.; Ray, A.; Silva, P.H.; Psiakis, R.; Makris, Y.; Basu, K. Unlocking hardware security assurance: The potential of LLMS. arXiv 2023, arXiv:2308.11042. [Google Scholar] [CrossRef]
Nair, M.; Sadhukhan, R.; Mukhopadhyay, D. Generating Secure Hardware using ChatGPT Resistant to CWEs. Cryptology ePrint Archive, Paper 2023/212. 2023. Available online: https://eprint.iacr.org/2023/212 (accessed on 26 August 2024).
Paria, S.; Dasgupta, A.; Bhunia, S. Divas: An llm-based end-to-end framework for soc security analysis and policy-based protection. arXiv 2023, arXiv:2308.06932. [Google Scholar]
Ahmad, B.; Thakur, S.; Tan, B.; Karri, R.; Pearce, H. Fixing hardware security bugs with large language models. arXiv 2023, arXiv:2302.01215. [Google Scholar] [CrossRef]
Fu, W.; Yang, K.; Dutta, R.S.G.; Guo, X.; Qu, G. Llm4sechw: Leveraging domain-specific large language model for hardware debugging. In Proceedings of the Asian Hardware Oriented Security and Trust (AsianHOST), Tianjin, China, 13–15 December 2023. [Google Scholar]
Saha, D.; Tarek, S.; Yahyaei, K.; Saha, S.K.; Zhou, J.; Tehranipoor, M.; Farahmandi, F. Llm for soc security: A paradigm shift. arXiv 2023, arXiv:2310.06046. [Google Scholar] [CrossRef]
Shakya, B.; He, T.; Salmani, H.; Forte, D.; Bhunia, S.; Tehranipoor, M. Benchmarking of hardware trojans and maliciously affected circuits. J. Hardw. Syst. Secur. 2017, 1, 85–102. [Google Scholar] [CrossRef]
Amatriain, X. Prompt design and engineering: Introduction and advanced methods. arXiv 2024, arXiv:2401.14423. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2024, 36, 8634–8652. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf (accessed on 28 August 2024).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
LMSys. Chatbot Arena Leaderboard. 2024. Available online: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard (accessed on 28 August 2024).
Ollama. Meta Llama 3: The Most Capable Openly Available LLM to Date. 2024. Available online: https://ollama.com/library/llama3:70b (accessed on 28 August 2024).
Williams, S. Icarus Verilog. 2024. Available online: https://steveicarus.github.io/iverilog/ (accessed on 8 August 2024).
GTKWave GTK+ Based Wave Viewer. 2024. Available online: https://gtkwave.github.io/gtkwave/ (accessed on 8 August 2024).
Yosys Open SYnthesis Suite. 2024. Available online: https://yosyshq.net/yosys/ (accessed on 8 August 2024).
Google. SkyWater Open Source PDK. 2024. Available online: https://github.com/google/skywater-pdk (accessed on 15 March 2024).
Groq Inc. Groq: Fast AI Inference. 2024. Available online: https://groq.com/ (accessed on 30 August 2024).
OpenAI. GPT-4 Model Documentation. 2024. Available online: https://platform.openai.com/docs/models (accessed on 30 August 2024).
Google. Gemini API Documentation. 2024. Available online: https://ai.google.dev/gemini-api/docs/models/gemini (accessed on 30 August 2024).
Groq. Llama 3 Model Documentation. 2024. Available online: https://console.groq.com/docs/quickstart (accessed on 30 August 2024).
Tappero, F. VHDL/Verilog IP Cores Repository. 2024. Available online: https://github.com/fabriziotappero/ip-cores/tree/crypto_core_aes (accessed on 7 September 2024).
Marshall, B. UART: A Simple Implementation of a UART Modem in Verilog. 2024. Available online: https://github.com/ben-marshall/uart (accessed on 7 September 2024).
Guthaus, M.R.; Stine, J.E.; Ataei, S.; Chen, B.; Wu, B.; Sarwar, M. OpenRAM: An open-source memory compiler. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 7–10 November 2016; pp. 1–6. [Google Scholar]
Yu, S.Y.; Yasaei, R.; Zhou, Q.; Nguyen, T.; Al Faruque, M.A. HW2VEC: A graph learning tool for automating hardware security. In Proceedings of the 2021 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), Washington, DC, USA, 12–15 December 2021; pp. 13–23. [Google Scholar]

Figure 1. Assumed Threat Model.

Figure 2. GHOST Framework Key Components.

Figure 3. Evaluation Framework overview.

Figure 4. Pre-synthesis Simulation: Commands (a) and Resulting Waveforms (b).

Figure 5. Post-synthesis Verification: Commands (a) and Resulting Waveforms (b).

Figure 6. Physical Layouts (GDS) of HT-Free Baseline Designs (SkyWater 130 nm PDK).

Table 1. Comparison of Hardware Trojan Insertion Tools.

Tool	Platform	Agent Type	Automatic	Learning Time
Trust-Hub [4]	Both	Human	✗	—
TAINT [5]	FPGA	Human	✗	—
TRIT [3]	ASIC	Human-Config.	✓	✗
MIMIC [6]	ASIC	ML	✓	✓
ATTRITION [7]	ASIC	ML/RL	✓	✓
Trojan Playground [8]	ASIC	RL	✓	✓
DTjRTL [9]	Both	Human-Config.	✓	✗
TrojanForge [10]	ASIC	RL/GAN	✓	✓
FEINT [11]	FPGA	Human-Config.	✓	✗
Kumar et al. [12]	ASIC	Algorithmic	✓	✗
Kokolakis et al. [13]	Both	LLM	✗	✗
GHOST (Our Work)	Both	LLM	✓	✗

Legend: ✓ Yes, ✗ No, — N/A, Config. = Configured.

Table 2. Evaluation Metrics.

Metric	Description	Formula
Compilation Success Rate (Eval0)	Proportion of Trojan-infected designs compiling without errors	$\frac{Compiled}{Total}$
Normal Operation Preservation Rate (Eval1)	Fraction of designs maintaining correct non-triggered functionality	$\frac{Functional}{Compiled}$
Trojan Triggering Success Rate (Eval2)	Proportion of Trojans successfully activated	$\frac{Activated}{Functional}$
Trojan Survival Rate (Eval3)	Fraction of Trojans remaining functional post-synthesis	$\frac{Post - Synth .}{Activated}$

Table 3. LLM Configurations.

Parameter	gpt-4-0613 [38]	gemini-1.5-pro [39]	llama3-70b-8192 [40]
# Params	1.76 T (est.)	1.5 T (est.)	70 B
Temperature	1.0	1.0	1.0
Top-p	1.0	0.95	1.0
Context window (tokens)	8192	2,097,152	8192
Max Output (tokens)	8192	8192	8192
Knowledge Cutoff	Sep 2021	Nov 2023	Dec 2023
Cost per 1M tokens (I/O)	$30.00/$60.00	free-tier	$0.59/$0.79

#: Number of; Params: Parameters; I/O: Input/Output.

Table 4. IP Cores Used in Experiments.

IP Name	Type	Total Lines of Code
AES-128 [41]	Cryptographic Core	768
UART [42]	Communication Core	430
SRAM Controller [43]	Memory Controller Core	52

Table 5. Hardware Trojan Insertion Results and Evaluation Metrics.

LLM	Design	HT	C	U	T	S	# Cells (tj-free/tj-in)	% Overhead
GPT-4	SRAM	HT1	✓	✓	✓	✓	10,964/15,429	40.72%
		HT2	✓	✓	✓	✓	10,964/11,063	0.90%
		HT3	✓	✓	✓	✓	10,964/11,067	0.94%
	AES-128	HT1	✓	✓	✓	✓	169,168/169,168	0.00%
		HT2	✓	✓	✓	✓	169,168/169,424	0.15%
		HT3	✓	✓	✓	✓	169,168/169,543	0.22%
	UART	HT1	✓	✓	✓	✓	329/404	22.80%
		HT2	✓	✓	✓	✓	329/360	9.42%
		HT3	×	—	—	—	—	—
GPT-4 Metrics			E0:	E1:	E2:	E3:
			88.9%	100%	100%	100%
Gemini-1.5-Pro	SRAM	HT1	✓	✓	×	—	—	—
		HT2	✓	✓	×	—	—	—
		HT3	✓	✓	✓	✓	10,964/11,041	0.70%
	AES-128	HT1	✓	×	—	—	—	—
		HT2	✓	✓	✓	✓	169,168/169,424	0.15%
		HT3	✓	✓	✓	✓	169,168/169,973	0.48%
	UART	HT1	✓	✓	✓	✓	329/335	1.82%
		HT2	×	—	—	—	—	—
		HT3	✓	✓	✓	✓	329/380	15.50%
Gemini-1.5-Pro Metrics			E0:	E1:	E2:	E3:
			88.9%	87.5%	71.4%	100%
LLaMA3	SRAM	HT1	✓	✓	×	—	—	—
		HT2	✓	✓	×	—	—	—
		HT3	✓	✓	✓	✓	10,964/11,034	0.64%
	AES-128	HT1	✓	×	—	—	—	—
		HT2	×	—	—	—	—	—
		HT3	✓	✓	✓	×	—	—
	UART	HT1	✓	✓	×	—	—	—
		HT2	✓	×	—	—	—	—
		HT3	✓	✓	×	—	—	—
LLaMA3 Metrics			E0:	E1:	E2:	E3:
			88.9%	75.0%	33.3%	50.0%

Legend: #: Number of, C: Compiled, U: Unaffected, T: Functional HT, S: Synthesized, ✓: Success, ×: Failure, —: Not Applicable, E0–E3: Evaluation Metrics 0–3. Cell colors: Green = Success, Red = Failure, Gray = Not Applicable.

Table 6. Physical Cell Breakdown for AES-128: Top 15 Most Changed Cell Types.

Cell Type	Baseline	GPT-4 HT1	GPT-4 HT2	Gemini HT2	GPT-4 HT3	Gemini HT3
$_DFF_PP0_	0	0	128 (NEW)	0	288 (NEW)	0
a311oi_1	0	0	0	0	1 (NEW)	0
mux2_1	0	0	128 (NEW)	128 (NEW)	0	0
xor3_1	160	256 (+60.00%)	160	160	160	160
xnor2_1	1776	1776	1776	1776	1913 (+7.71%)	1777 (+0.06%)
lpflow_isobufsrc_1	1920	1920	1920	1920	2049 (+6.72%)	1920
$_DFF_P_	6848	6848	6848	6976 (+1.87%)	6592 (−3.74%)	6856 (+0.12%)
xor2_1	4144	4048 (−2.32%)	4144	4144	4025 (−2.87%)	4144
a21oi_1	11,152	11,152	11,152	11,152	11,159 (+0.06%)	11,409 (+2.30%)
nor3_1	12,472	12,472	12,472	12,472	12,478 (+0.05%)	12,734 (+2.10%)
nand3_1	9208	9208	9208	9208	9210 (+0.02%)	9336 (+1.39%)
clkinv_1	584	584	584	584	592 (+1.37%)	585 (+0.17%)
o21ai_0	12,024	12,024	12,024	12,024	12,024	12,152 (+1.06%)
and4_1	200	200	200	200	202 (+1.00%)	202 (+1.00%)
nor2b_1	200	200	200	200	202 (+1.00%)	200

Table 7. Physical Cell Breakdown for SRAM: Top 15 Most Changed Cell Types.

Cell Type	Baseline	GPT-4 HT1	GPT-4 HT2	GPT-4 HT3	LLaMA3 HT3	Gemini HT3
o21ai_0	15	4114 (+27,326.67%)	35 (+133.33%)	3 (−80.00%)	19 (+26.67%)	4 (−73.33%)
nand4_1	6	14 (+133.33%)	8 (+33.33%)	131 (+2083.33%)	108 (+1700.00%)	133 (+2116.67%)
nand2_1	248	4450 (+1694.35%)	274 (+10.48%)	229 (−7.66%)	338 (+36.29%)	320 (+29.03%)
a211oi_1	1	1	1	2 (+100.00%)	4 (+300.00%)	8 (+700.00%)
o31a_1	1	6 (+500.00%)	1	0 (−100.00%)	5 (+400.00%)	0 (−100.00%)
nor2_1	40	54 (+35.00%)	35 (−12.50%)	167 (+317.50%)	34 (−15.00%)	45 (+12.50%)
a211o_1	1	0 (−100.00%)	0 (−100.00%)	4 (+300.00%)	1	1
or4b_1	1	4 (+300.00%)	0 (−100.00%)	0 (−100.00%)	0 (−100.00%)	0 (−100.00%)
nor4b_1	20	79 (+295.00%)	44 (+120.00%)	13 (−35.00%)	5 (−75.00%)	5 (−75.00%)
a21oi_1	28	60 (+114.29%)	72 (+157.14%)	85 (+203.57%)	72 (+157.14%)	92 (+228.57%)
and4b_1	43	138 (+220.93%)	75 (+74.42%)	25 (−41.86%)	6 (−86.05%)	12 (−72.09%)
a2111oi_0	47	115 (+144.68%)	67 (+42.55%)	30 (−36.17%)	15 (−68.09%)	18 (−61.70%)
nand3_1	42	84 (+100.00%)	98 (+133.33%)	76 (+80.95%)	57 (+35.71%)	75 (+78.57%)
and2_0	4	9 (+125.00%)	7 (+75.00%)	3 (−25.00%)	4	3 (−25.00%)
nor4bb_1	56	125 (+123.21%)	66 (+17.86%)	18 (−67.86%)	18 (−67.86%)	8 (−85.71%)

Table 8. Physical Cell Breakdown for UART: Top 15 Most Changed Cell Types.

Cell Type	Baseline	GPT-4 HT1	Gemini HT1	GPT-4 HT2	Gemini HT3
xnor2_1	1	1	10 (+900.00%)	1	2 (+100.00%)
a41oi_1	1	4 (+300.00%)	1	1	2 (+100.00%)
nand4_1	2	8 (+300.00%)	6 (+200.00%)	2	7 (+250.00%)
o21a_1	1	4 (+300.00%)	3 (+200.00%)	1	2 (+100.00%)
a211oi_1	2	4 (+100.00%)	3 (+50.00%)	3 (+50.00%)	3 (+50.00%)
a21boi_0	1	0 (−100.00%)	0 (−100.00%)	0 (−100.00%)	2 (+100.00%)
a21o_1	0	0	0	1 (NEW)	2 (NEW)
a221o_1	0	1 (NEW)	0	0	0
a22o_1	0	0	1 (NEW)	1 (NEW)	1 (NEW)
a2bb2oi_1	0	1 (NEW)	0	0	0
a31oi_1	2	2	2	2	4 (+100.00%)
a32o_1	1	0 (−100.00%)	0 (−100.00%)	0 (−100.00%)	1
clkinv_1	7	14 (+100.00%)	10 (+42.86%)	10 (+42.86%)	11 (+57.14%)
lpflow_inputiso1p_1	0	0	0	1 (NEW)	0
mux2_1	0	0	0	8 (NEW)	0

Table 9. Physical Implementation Setup and Configuration.

Parameter	SRAM	UART
SDC Timing Constraints
Clock Period	10 ns
Target Frequency	100 MHz
Input Delay	2 ns
Output Delay	2 ns
Clock Uncertainty	0.5 ns
Floorplan & Placement Configuration
Core Utilization	40%	30%
Placement Density	0.65	0.55
Core Aspect Ratio	1	1
Clock Tree Synthesis	Automatic	Automatic
Power Delivery Network	Automatic	Automatic

Table 10. SRAM PP Metrics.

Design	Core Area		Power (mW)		Timing
	µm²	Δ%	Total	Δ%	Freq (MHz)	Δ%
Baseline (HT-free)	183,680	–	43.4	–	226.35	–
HT1 (GPT-4)	206,755	+12.6	48.2	+11.1	283.36	+25.2
HT2 (GPT-4)	182,518	–0.6	43.2	–0.5	269.23	+18.9
HT3 (GPT-4)	183,622	–0.0	43.2	–0.5	250.48	+10.7
HT3 (Gemini)	183,211	–0.3	43.3	–0.2	268.60	+18.7
HT3 (LLaMA3)	184,328	+0.4	43.7	+0.7	262.38	+15.9

Table 11. UART PPA Metrics.

Design	Core Area		Power (mW)		Timing
	µm²	Δ%	Total	Δ%	Freq (MHz)	Δ%
Baseline (HT-free)	4237	–	0.67	–	208.85	–
HT1 (GPT-4)	4823	+13.8	0.73	+9.4	230.77	+10.5
HT2 (GPT-4)	4209	–0.7	0.64	–4.6	210.68	+0.9
HT1 (Gemini)	4282	+1.1	0.68	+2.1	199.28	–4.6
HT3 (Gemini)	4607	+8.7	0.69	+3.4	214.06	+2.5

Table 12. Quantitative Impact of Name Obfuscation on Detectability.

LLM	Design	HT	N	X Before	X After	Reduction
GPT-4	AES-128	HT1	6	✓	✗	100%
	AES-128	HT2	9	✓	✗	100%
	AES-128	HT3	0	✗	✗	N/A
	SRAM	HT1	4	✓	✗	100%
	SRAM	HT2	3	✓	✗	100%
	SRAM	HT3	0	✗	✗	N/A
	UART *	HT1	6	✓	✗	100%
	UART *	HT2	42	✓	✗	100%
Gemini	AES-128	HT2	6	✓	✗	100%
	AES-128	HT3	6	✓	✗	100%
	SRAM	HT3	6	✓	✗	100%
	UART *	HT1	4	✓	✗	100%
	UART *	HT3	10	✓	✗	100%
LLaMA3	SRAM	HT3	14	✓	✗	100%
Overall	(14 samples)		116	12/14	0/14	100%

N: Suspicious IDs before; Y: After; X: Detected by grep (✓ = yes, ✗ = no); Reduction: (N − Y)/N×100%; * ‘payload’ excluded for UART.

Table 13. HW2VEC’s Performance on GHOST inserted HTs. Results are categorized as: (1) Not Detected—HT evaded detection within inference time limit, (2) Timed Out—Inconclusive results where inference exceeded 4 h, not equivalent to confirmed evasion.

LLM	Design	HT Type	Detection Status	Inference Time (mm:ss)
GPT-4	SRAM	HT1	Not Detected	07:14.0
		HT2	Not Detected	08:19.6
		HT3	Not Detected	08:01.0
	AES-128	HT1	Timed Out	>4 hrs
		HT2	Timed Out	>4 hrs
		HT3	Timed Out	>4 hrs
	UART	HT1	Not Detected	07:00.6
	UART	HT2	Not Detected	09:31.4
Gemini-1.5-Pro	SRAM	HT3	Not Detected	07:56.5
	AES-128	HT2	Timed Out	>4 hrs
	AES-128	HT3	Timed Out	>4 hrs
	UART	HT1	Not Detected	07:59.1
	UART	HT3	Not Detected	07:10.5
LLaMA3	SRAM	HT3	Not Detected	11:00.5

Legend: Not Detected: Evasion Confirmed, Timed Out: Inconclusive (Inference > 4 h).

Table 14. Ablation Study Results: Progressive Component Addition (LLaMA 3 70B, SRAM Design, N = 10 samples per configuration).

Configuration	Eval0	Eval1	Eval2	Eval3	End-to-End
Configuration	(Compile)	(Functional)	(Activation)	(Synthesis)	Success
Baseline + CTP	7/10 (70%)	1/7 (14%)	0/1 (0%)	–	0/10 (0%)
+RBP (RBP + CTP)	8/10 (80%)	2/8 (25%)	1/2 (50%)	0/1 (0%)	0/10 (0%)
+RVP (RVP + RBP + CTP)	8/10 (80%)	7/8 (88%)	2/7 (29%)	2/2 (100%)	2/10 (20%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Faruque, M.O.; Jamieson, P.; Patooghy, A.; Badawy, A.-H.A. Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design. Electronics 2025, 14, 4745. https://doi.org/10.3390/electronics14234745

AMA Style

Faruque MO, Jamieson P, Patooghy A, Badawy A-HA. Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design. Electronics. 2025; 14(23):4745. https://doi.org/10.3390/electronics14234745

Chicago/Turabian Style

Faruque, Md Omar, Peter Jamieson, Ahmad Patooghy, and Abdel-Hameed A. Badawy. 2025. "Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design" Electronics 14, no. 23: 4745. https://doi.org/10.3390/electronics14234745

APA Style

Faruque, M. O., Jamieson, P., Patooghy, A., & Badawy, A.-H. A. (2025). Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design. Electronics, 14(23), 4745. https://doi.org/10.3390/electronics14234745

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design

Abstract

1. Introduction

2. LLMs for Hardware Design and Security

3. Threat Model

4. Proposed Methodology

4.1. Prompt Engineering

4.1.1. Role-Based Prompting (RBP)

4.1.2. Reflexive Validation Prompting (RVP)

4.1.3. Contextual Trojan Prompting (CTP)

4.2. LLM Inference

4.2.1. Model Selection

4.2.2. LLM Tasks

4.2.3. Response Extraction

4.3. GHOST Main Steps

5. Evaluation Methodology

5.1. Pre-Synthesis Simulations

5.1.1. Compilation Verification (Eval0)

5.1.2. Functional Consistency Check (Eval1)

5.1.3. Trojan Activation Verification (Eval2)

5.2. Post-Synthesis Simulations (Eval3)

6. Experimental Results

6.1. Experimental Setup

6.2. Large Language Models

6.3. Dataset

6.4. Case Study: An Information Leakage Trojan (HT2) in an AES-128 Cryptographic Core, Designed by GPT-4

6.5. GPT-4 Performance

6.6. Gemini-1.5-Pro Performance

6.7. LLaMA3 Performance

6.8. Overall Hardware Overhead Analysis

Physical Cell-Level Analysis

6.9. GHOST HT Benchmark Exploration and Applicability

6.9.1. Physical Implementation and PPA Analysis

6.9.2. Stealthiness and Obfuscation Strategies

6.9.3. Quantitative Analysis of Name Obfuscation Impact

6.9.4. Scalability and Design Complexity

6.9.5. Benchmark Novelty and Contributions

6.10. HT Detection Analysis

6.11. Ablation Study: Component Contribution Analysis

6.11.1. Initial Baseline Configuration Analysis

6.11.2. Progressive Component Addition Results

6.11.3. Component Contributions and Analysis

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Complete GHOST Prompt Template

Appendix B. GHOST Benchmark Detailed Characterization

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI