A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS

Sun, Hyeseung; Ryoo, Kwangki

doi:10.3390/electronics15102202

Open AccessArticle

A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS

by

Hyeseung Sun

and

Kwangki Ryoo

^*

Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2202; https://doi.org/10.3390/electronics15102202

Submission received: 24 April 2026 / Revised: 15 May 2026 / Accepted: 19 May 2026 / Published: 20 May 2026

(This article belongs to the Topic Advanced Integrated Circuit Design and Application)

Download

Browse Figures

Versions Notes

Abstract

Many System-on-Chip (SoC) studies rely solely on simulation and tool-based results, encountering unexpected failures during post-silicon validation. In particular, silicon-level demonstrations of Hardware/Software (HW/SW) functional equivalence, which confirms that an FPGA-validated design operates identically on an ASIC with the same firmware, remain extremely rare. This work proposes a reproducible FPGA-to-silicon verification methodology that establishes HW/SW functional equivalence at the silicon level by applying an identical firmware source code, device driver, and memory map to both platforms. The methodology is validated on an Arm Cortex-M0-based SoC platform fabricated in Samsung 28 nm Low Power Plus (LPP) CMOS technology with a dual Inter-Integrated Circuit (I2C) interface. The fabricated chip integrates two 64KB on-chip memories within a core area of 653

μ

m × 769

μ

m, operates at 125 MHz, and consumes 17.5 mW at the optimal operating point of 1.0 V. The primary contributions are: (1) a reproducible FPGA-to-silicon HW/SW functional equivalence verification methodology based on shared firmware source code, device driver, and memory map across both platforms, (2) silicon-measurement-based performance characterization with verified experimental data, (3) a reproducible design methodology documenting the complete flow from FPGA verification through ASIC fabrication, including static timing closure, place-and-route, and physical verification, and (4) an extensible SoC platform architecture enabling researchers to integrate and validate their own Intellectual Property (IP) via Advanced High-performance Bus (AHB) and I2C interfaces.

Keywords:

Arm Cortex-M0; SoC platform; FPGA-to-silicon verification; functional equivalence; post-silicon validation; 28 nm CMOS; I2C interface; AHB-Lite; verification methodology

1. Introduction

1.1. Research Background and Motivation

Recent integrated circuit design has evolved toward constructing complex SoCs by assembling pre-verified IP (Intellectual Property) blocks, driven by the need to shorten time-to-market and reduce development costs. In application domains such as automotive, IoT, and wearable devices, designs that integrate diverse peripheral and analog/digital IP blocks around an embedded processor have become essential.

However, this IP-based integration approach carries several inherent challenges, particularly in university research environments. First, the absence of a verified platform requires the entire system stability to be re-validated each time a new IP is implemented and integrated. Unforeseen interface-level mismatches between the processor bus and newly added IP blocks frequently cause development delays and force researchers to repeat verification from the ground up. Second, reusability is insufficient. Redesigning an entire system from scratch for every project is highly inefficient, and without a stable, reusable platform, cumulative progress across projects becomes difficult to achieve. Third, many existing studies rely solely on simulation and tool-based results, repeatedly encountering unexpected problems at the post-silicon validation stage.

These challenges motivate the need for a silicon-verified, reusable SoC platform accompanied by a systematic FPGA-to-silicon verification methodology. The relevant prior work and the specific research gap addressed by this study are discussed in Section 2.

1.2. Technical Objectives

This work addresses two key challenges: the absence of a verified platform, and the lack of a systematic FPGA-to-silicon verification methodology. To address these challenges, four technical objectives are defined.

The first objective is state-of-the-art process implementation. The SoC is implemented in Samsung 28 nm LPP CMOS technology using only standard cells, I/O cells, and on-chip memories, without a Phase-Locked Loop (PLL) or Power Management Kit (PMK). This constraint reflects the reality of university MPW programs, in which PLL and clock management IP are typically unavailable. Within these constraints, the design targets the highest achievable clock frequency in a compact area, implementing a pure Arm Cortex-M0 SoC.

The second objective is extensible interface design. A dual I2C interface is implemented with two independent channels. Channel 0 is designated for connection to external off-chip devices, while Channel 1 faces inward toward the chip interior for integration of digital or analog IP blocks. An AHB Slave interface is additionally exposed at the system boundary to support high-performance digital IP integration.

The third objective is a complete FPGA-to-silicon verification flow. The methodology targets direct demonstration of HW/SW functional equivalence by ensuring that the same firmware source code, device driver, and memory map are applied to both the FPGA and the ASIC without modification. This shared software artifact strategy ensures that any behavioral discrepancy between the two platforms is attributable solely to hardware differences, not software divergence.

The fourth objective is hard macro platform delivery. The platform is provided as a hard macro, enabling designers to connect their own IP to the AMBA bus or I2C interface pins and verify its operation at the software level without modifying the verified core.

1.3. Purpose and Contributions

The purpose of this work is not to competitively achieve peak performance or minimum power consumption at a specific process node. Instead, the focus is on developing an SoC platform whose operation has been stably verified under real chip design conditions and process constraints, and on providing a methodological framework for integrating and validating diverse IP blocks. This work presents four contributions that differentiate it from existing studies.

The first contribution is a reproducible FPGA-to-silicon verification methodology. HW/SW functional equivalence is confirmed by applying an identical firmware source code, device driver, and memory map to both the FPGA and the ASIC, and confirming identical UART terminal output and I2C SDA/SCL waveforms on both platforms through silicon-level measurement. This methodology provides a systematic and repeatable approach that other researchers can apply to their own FPGA-to-silicon transitions.

The second contribution is silicon-measurement-based performance characterization. The complete flow from FPGA verification through 28 nm ASIC fabrication, packaging, and measurement is carried out, and all reported performance data are obtained from post-silicon measurement rather than simulation or EDA tool estimates. This provides reliable performance benchmarks for designers working with comparable platforms.

The third contribution is a reproducible design methodology. The complete design flow is documented in detail, covering HW/SW integration verification on FPGA, static timing analysis and closure, automated place-and-route, real-pattern-based merge, and physical verification. The most aggressive timing constraint achievable without a PLL under MPW program constraints is also proposed and validated.

The fourth contribution is an extensible SoC platform architecture. A dual I2C interface is implemented and validated, with Channel 0 serving external device connectivity and Channel 1 designated for integration of internal digital and analog IP blocks. An AHB Slave interface further extends the platform to support high-performance accelerator integration, establishing a foundation for future multi-IP co-verification studies.

1.4. Organization of the Paper

The remainder of this paper is organized as follows. Section 2 reviews related work and identifies the research gap addressed by this study. Section 3 describes the overall architecture of the proposed platform. Section 4 presents the design methodology in detail, covering the complete flow from FPGA verification to ASIC implementation, along with FPGA-to-silicon functional equivalence verification results. Section 5 presents the measurement results of the fabricated chip, including performance analysis and platform extensibility validation through the I2C interface. Section 6 provides a comparative analysis against prior work, and Section 7 concludes the paper.

2. Background and Related Work

2.1. Overview of Arm-Based SoC Research

Existing Arm-based SoC studies have primarily pursued three directions. Energy efficiency optimization through voltage scaling and pipeline reconfiguration [1,2,3,4,5,6,7,8,9,10,11,12,13], system expansion by connecting IP blocks or subsystems to an Arm SoC [14,15,16], platform- or methodology-oriented frameworks for reusable SoC development [17,18]. Although many prior studies were conducted with corporate support in well-equipped laboratories, the Arm Academic Access (AAA) program now alleviates this barrier by providing university researchers with Arm IP and software development tools.

Performance-oriented studies achieve impressive energy efficiency, but under operating conditions that are incompatible with general-purpose computing. Representative designs report 7.3 pJ/cycle at 2.8 MHz and 0.51 V on a 28 nm process [1], and 43.2 pJ/cycle at 15 MHz and 0.37 V on a 40 nm process [12]. While these results represent the frontier of ultra-low-power design, the operating frequencies of hundreds of kHz to a few MHz limit their applicability to general-purpose embedded workloads. This work instead targets stable operation at 125 MHz, prioritizing platform reproducibility and IP integration capability over minimum energy per cycle.

Platform- and methodology-oriented studies, represented by CHIPKIT [17] and the OQPSK SoC platform [18], provide reusable frameworks but share a common limitation: none provides silicon-level verification demonstrating that an FPGA-validated design operates identically on an ASIC using the same firmware source code. This work belongs to the platform-oriented category and directly addresses this gap.

2.2. Current Status and Limitations of Open-Source SoC Platforms

The recent proliferation of open-source Process Design Kits (PDKs) and RISC-V-based designs has significantly lowered the barrier to chip design. Efabless’s Caravel harness, built around a PicoRV32-based RISC-V (RV32IMC) core, a 32-bit Wishbone bus, and the SkyWater 130 nm process, has enabled hundreds of projects to be realized as actual silicon [19]. Tiny Tapeout has further reduced the entry barrier to chip design by providing a collaborative tapeout service based on open-source PDKs [20]. However, these open-source ecosystems carry several fundamental limitations. First, they are confined to legacy processes of 130 nm or 180 nm, making it difficult to reflect modern design conditions. Second, the use of non-standard buses such as Wishbone limits compatibility with the AMBA-based IP ecosystem that dominates the industry. Third, post-silicon power measurement data are rarely provided, making it difficult for designers to obtain reliable performance benchmarks. Fourth, no documented cases exist in which HW/SW functional equivalence between an FPGA prototype and an ASIC has been systematically verified and reported. While open-source platforms have driven an innovation in accessibility, this work fills the gap of a silicon-verified platform based on a commercial process, accompanied by a reproducible FPGA-to-silicon verification methodology.

2.3. The Gap Between FPGA Prototyping and Silicon Verification

FPGA prototyping serves as a primary means of HW/SW integration verification in the early stages of SoC development. However, studies that presents through silicon-level measurement that a design validated on an FPGA operates identically on an ASIC using the same firmware remain extremely rare. Existing studies either present FPGA verification results and silicon results in separate sections without direct comparison, or do not explicitly address the discrepancies that arise during migration from FPGA to silicon. The FPGA-to-silicon verification methodology proposed in this work directly bridges this gap. By applying an identical firmware source code, driver, and memory map to both platforms and comparing UART terminal output and I2C SDA/SCL waveforms side by side, HW/SW functional equivalence is quantitatively demonstrated.

3. Proposed Platform Architecture

3.1. Overall System Architecture

The core elements of the proposed SoC platform are illustrated in Figure 1. The baseline code from Arm Cortex-M0 DesignStart is instantiated with I/O cells and memories provided by the foundry. The input/output ports of the platform are highlighted in blue.

The Arm Cortex-M0 processor is a 32-bit Reduced Instruction Set Computer (RISC) architecture with a three-stage pipeline. The AMBA bus system consists of an AHB-Lite bus and an Advanced Peripheral Bus (APB), enabling operation across high-speed and low-speed clock domains. This bus architecture follows the approach adopted in prior Arm-based SoC studies, in which high-performance accelerators are connected via AHB and low-speed peripherals via APB [14,15,16]. The main clock, represented as XTAL1, is distributed through the clk_ctrl module and supplied to the main system as multiple derived clocks. The asynchronous reset signal NRST is synchronized to the clock inside the clk_ctrl module before being applied to the main system. The FIFO module shown on the right side of the figure generates control signals to create idle periods in the overall system operation, and is logically OR-ed with the main reset signal NRST.

The on-chip memory consists of 64 KB each of ROM and RAM. Since the core is a 32-bit processor, an SRAM of size 16,384 × 32 is compiled for use. For FPGA implementation, a block memory type is used, while for ASIC implementation, an SRAM compiled using the Samsung Memory Compiler is employed. The peripheral devices include GPIO 0 and GPIO 1 as AHB bus slaves, and UART, Timer, Dual Timer, and Watchdog Timer as APB bus slaves.

To enhance platform extensibility, this work provides both a high-speed interface and a low-speed interface. The high-speed interface is implemented as an AHB Slave and can be used for high-performance IP integration, while the low-speed interface is implemented as I2C and can be utilized for integrating analog and digital IP blocks such as sensors, Analog-to-Digital Converters (ADCs), and Digital-to-Analog Converters (DACs).

3.2. Memory Map Organization

The memory map of the system is organized as shown in Table 1.

Since the baseline code is openly available in Verilog, designers can flexibly add IP blocks and modify the address decoder within the bus. When extending the AHB or APB bus, address allocation is possible in units of 4 KB. Whenever the memory map is modified, the same changes must be reflected in the software code, and the device driver must be updated accordingly so that the Cortex-M0 CPU can access the newly added IP.

3.3. IP Integration via AHB-Lite Interface

This work employs the AHB-Lite bus. AHB-Lite is a lightweight implementation of the standard AHB that supports only a single master. Therefore, any additional IP must be connected as a slave and cannot operate as a master. The platform exposes an AHB Slave interface to the design boundary to support high-performance digital IP integration, consistent with the approach used in deep neural network (DNN) accelerator integration studies [14,15]. Designers can connect their own IP as an AHB Slave within the AHB memory address space and communicate directly with the Cortex-M0 core. An important characteristic of AHB-Lite is its two-stage pipeline operation. During the Address Phase, the address and control signals (HADDR, HWRITE, etc.) are transmitted, and during the Data Phase in the following cycle, the actual data is transferred. Therefore, when connecting memory or high-speed IP as an AHB Slave, this two-stage timing characteristic must be carefully considered. In particular, timing optimization is required at the interface with on-chip memory (see Section 4.3).

3.4. I2C Interface Design for Platform Connectivity

For a system to function as a platform, it must be equipped with interfaces compatible with both digital and analog IP blocks. Since the standard Arm Cortex-M0 DesignStart package provides only UART and GPIO as external interfaces, this work adopts I2C as the low-speed interface of choice due to its high versatility and widespread use in sensor and ADC integration for embedded platforms [18]. The I2C interface is a core element of the extensibility of the proposed SoC platform, implemented as an APB bus slave and controlled by the processor via memory-mapped access. The detailed address of I2C within the APB subsystem is shown in Figure 2.

The adoption of a dual I2C interface in this platform is intended to maximize both silicon verification capability and platform reusability. I2C Channel 0 is directed toward external off-chip devices, while Channel 1 faces inward and is exposed only as physical pins at the hard macro boundary. Through this predefined interface, designers can freely integrate their own IP during the place-and-route stage. Since Channel 1 can be used not only for digital IP but also for analog IP integration, it serves as a communication interface between the digital and analog domains.

This architecture provides three advantages. First, by separating external I/O operation from communication with internal IP, clock domain interference and digital switching noise are effectively isolated, enabling stable operation when analog IP is integrated in the future. Second, during silicon bring-up and post-silicon verification, external I2C-based validation and internal IP verification can be performed independently, facilitating debugging. Third, Channel 1 allows multiple internal sensors or analog IP blocks to be expanded in a multi-drop configuration without additional bus arbitration logic.

3.5. Software Driver and Software Stack

To validate the I2C IP, it is not sufficient to simply connect it to the platform via the I2C interface. The Cortex-M0 core must also be programmed to directly control it. That is, the core must be able to initialize the newly connected IP and perform read/write operations on its status values, input values, and output results. To enhance the practical utility of the platform, the following software components are prepared.

Interrupt handler entries for newly added IP blocks are registered to support interrupt-driven operation. Device drivers are structured around memory-mapped read/write access. The main function calls the driver to perform initialization, data transmission and reception, and status monitoring of the newly added IP.

A complete example program is provided so that users can easily write application code without needing to understand the underlying hardware details. Figure 3 shows the device driver code and a usage example.

4. Design and Implementation Methodology

This section describes the complete design flow from FPGA prototyping to 28 nm ASIC implementation in detail. The purpose is to provide a methodology that other researchers seeking to implement a similar SoC can reproduce using the same approach. Figure 4 illustrates the overall flow of the proposed methodology. The process begins with RTL design and firmware development, from which two parallel tracks diverge. FPGA prototyping on an Intel DE2-115 board and ASIC implementation in Samsung 28 nm CMOS. Both tracks share identical firmware source code, device driver, and memory map, denoted by the yellow boxes in the figure, ensuring that any behavioral discrepancy observed between the two platforms is attributable solely to hardware differences rather than software divergence. The four verification stages executed on each platform are: base SoC software verification, I2C peripheral integration, firmware execution, and waveform observation. Upon completion of both tracks, FPGA and ASIC outputs are compared side by side; a match in UART terminal output and I2C SDA/SCL waveforms confirms HW/SW functional equivalence at the silicon level. The following subsections describe each stage in detail.

4.1. FPGA Implementation and Verification

The design was implemented to satisfy setup and hold static timing constraints at 50 MHz in the following development environment, and HW/SW integration implementation and verification were performed. Figure 5 shows the FPGA implementation result using Intel Quartus 18.0.

FPGA board: Terasic DE2-115
–
Equipped with an Intel Cyclone IV EP4CE115F29 device
–
Provides 114,480 logic elements and 3888 Kb of on-chip memory
–
Offers sufficient resources for full implementation of the Arm Cortex-M0 SoC
–
Includes abundant GPIO pins and an on-board I2C EEPROM (24LC08), making it suitable for this work
Development tools
–
Quartus Prime: Verilog-based design entry and FPGA implementation
–
MegaWizard: Configuration and implementation of block memory
–
SIEMENS QuestaSim: Functional verification through simulation

Since the standard Cortex-M0 DesignStart provides RAM and ROM as behavioral models, they must be replaced with hard macro memory blocks for FPGA implementation. In this work, memories are generated using the MegaWizard tool in Quartus Prime 18.0. It is important to select a type that preserves the communication protocol and timing between the Cortex-M0 core and memory. A single-port RAM with synchronous read/write and no output register is applied uniformly to both ROM and RAM. Implementing ROM as RAM allows designers to load code at any time without fixing the software content. This requires a dedicated interface capable of transferring data from outside the chip into the ROM region. While adding a separate master to the SoC’s AHB bus is a common approach [17], the standard Arm Cortex-M0 SoC uses AHB-Lite, which supports only a single master. Therefore, this work implements a ROM Writer that operates by intercepting the interface between the core and the memory. Figure 6 shows the internal structure of the ROM Writer.

The ROM Writer consists of a simple structure comprising UART, FIFO, and ROM memory. It receives the compiled HEX code in 8-bit units at 115,200 bps, with the FIFO handling data buffering and overall flow control. The UART module converts the received 8-bit ASCII code back to hex, assembles it into 32-bit words, and writes sequentially to memory starting from the base address in little-endian format. Figure 7 shows the write operation to memory and the normal operation of the core immediately after reset.

The ROM Writer operating sequence is as follows.

Immediately after system reset, the ROM Writer acquires control of the ROM.
The HEX file is transmitted from the PC through the ROM Writer’s UART channel.
The FIFO sequentially writes the data to ROM.
Upon completion, the specified address can be read back for verification, or the entire memory contents can be transmitted to the PC and compared against the original HEX file.
Upon completion, the ROM Writer gates its own clock to deactivate itself and transfers ROM control to the Cortex-M0.
A system-level reset is issued. Immediately after reset, the core takes control and begins fetching and executing the program from ROM.

HW/SW integration verification was performed on the FPGA. The complete system was constructed using Arm Cortex-M0 DesignStart and the ROM Writer, and the application program was compiled in the Keil MDK environment. Test programs for UART communication, interrupt handling, timer operation, and watchdog timer were executed, and all functions were confirmed to operate correctly as shown in Figure 8.

I2C EEPROM interfacing was subsequently tested using the on-board I2C EEPROM (24LC08) of the DE2-115 board. A test program utilizing the device driver was used to write and read data to and from the EEPROM, confirming correct operation. Figure 9 shows the result confirmed via terminal messages, and Figure 10 shows the waveform captured using a logic analyzer.

The completed final SoC platform, as shown in Figure 11, encompasses both the hardware and software domains. The complete HW/SW integration methodology documented in this work can serve as a reference for other designers seeking to implement a similar SoC platform.

4.2. ASIC Implementation

In this work, the Cortex-M0 SoC verified on FPGA is implemented in Samsung 28 nm LPP CMOS technology. The digital libraries available for the design include standard cells and memories with a nominal voltage of 1.0 V, and I/O cells with a nominal voltage of 1.8 V. The I/O cells incorporate Electrostatic Discharge (ESD) protection and support signal input/output up to 200 MHz. The total chip size is 4000

μ

m × 4000

μ

m including the seal ring, and the package type is Low-Profile Quad Flat Package (LQFP) 208. The design flow and EDA tools applied during ASIC implementation are shown in Figure 12 [18], and the detailed information of the EDA tools used at each stage is summarized in Table 2.

This research was made possible through the AAA program, which provided access to the Arm Cortex-M0 DesignStart IP and software development tools. Moreover, IDEC MPW chip manufacturing opportunity and EDA license support. The authors thank IDEC and its staff members for their support.

A hierarchical design approach is adopted for the SoC implementation in this work. By pre-implementing each major block and integrating them at the top level, timing and Design Rule Check (DRC) issues can be addressed rapidly at the module level without re-implementing the entire design.

Memory Wrapper: During place-and-route, phantom cells are used, which causes three issues: routing concentrated at the center of memory pins rather than the pins themselves, routing metal patterns penetrating into the memory cell interior, and clock shield VSS patterns entering the memory interior. Since this work uses RAM and ROM of identical size and type, a Memory Wrapper module containing a single memory instance is implemented precisely and reused. Figure 13 shows the correct implementation of the Memory Wrapper, illustrating the problematic situation on the left, the resolved intermediate result in the center, and the final result with correct memory pin connections on the right.

ROM Writer: Since the ROM Writer operates prior to main system operation, it is pre-implemented and receives a separate 50 MHz clock input. It contains a ROM internally and therefore instantiates the pre-implemented Memory Wrapper. During HEX code writing to memory, it holds the Cortex-M0 core in reset and gates the main clock. During main system operation, it gates its own input clock, contributing to the low-power strategy.

Cortex-M0 Core: The core contains a large amount of combinational logic internally, requiring timing optimization. Pre-implementation is performed to meet a 4 ns (250 MHz) clock constraint through the front-end flow, and the set_dont_touch command is applied when instantiated in the top-level module to prevent modification of the internally synthesized design.

Cortex-M0 SoC: This is the top-level hierarchical design containing all modules described above, implemented as a rectangular block. It is prepared as a hard macro in Samsung 28 nm LPP CMOS technology, and the platform interface specification is available for designers seeking to integrate their own IP blocks. In this work, Signal I/O cells are connected to the GPIO 0, GPIO 1, I2C, and UART ports, and Power I/O cells are added to supply power to the core region, forming an I/O domain consisting of a total of 160 I/O cells.

4.3. Key Implementation Issues and Solutions

This section describes the critical implementation challenges encountered during the design of the Cortex-M0 SoC in Samsung 28 nm LPP CMOS technology and the solutions applied to resolve them.

4.3.1. AHB Bus and Memory Timing Optimization

The timing characteristics of the AHB-Lite bus and the Samsung on-chip SRAM introduce a potential timing violation on the core-to-memory read path. The AHB-Lite bus operates as a two-stage pipeline: when the Cortex-M0 issues an instruction fetch request, the data must be returned within two clock cycles. While this constraint is readily satisfied in behavioral simulation, the actual timing behavior in silicon is strongly dependent on the memory timing characteristics specified by the foundry. The timing sequence for an instruction fetch is as follows:

T0: The Cortex-M0 issues an instruction fetch request, driving HADDR and the AHB control signals.
T1: The AHB slave (ROM) receives HADDR and the control signals, and generates the corresponding memory control signals.
T2: The ROM asserts the address and control signals to the SRAM, which then drives the instruction data to its output.
T3: The CPU core completes the instruction read.

Since the core requires the instruction to be available at T2 but the SRAM does not present valid data until T3, a one-cycle latency mismatch causes incorrect operation of the entire SoC. To resolve this, the address and control signal generation logic inside the AHB_ROM module is implemented on the negative clock edge. This allows the memory control signals to be prepared half a cycle earlier than T1, enabling the SRAM to present valid instruction data at the correct timing. Because only the clock edge is changed, this solution incurs no area or power overhead and introduces no additional design risk. Figure 14 shows a comparison of the AHB bus and memory timing before and after applying this optimization.

Figure 15 shows the post-layout simulation waveform confirming correct AHB-to-memory read timing with the Samsung on-chip SRAM model. At the rising edge where HADDR = 0x0 and HWRITE = 0 (T0), the Cortex-M0 initiates an instruction fetch. HRDATA = 0x2000_0368 is correctly presented two rising edges later (T2), confirming that the negative-edge clock implementation resolves the one-cycle latency mismatch.

4.3.2. Static Timing Closure

The primary objective in implementing the Cortex-M0 SoC was to achieve the highest operating frequency attainable in the target process. Two factors were given priority consideration.

The first is the maximum allowable frequency of the I/O cells. The I/O cells provided for the Samsung 28 nm LPP CMOS process support signal switching up to 200 MHz. Because the Multi-Project Wafer (MPW) program under which this work was fabricated did not permit the use of a PLL or similar clock IP, a 5 ns clock period was initially declared. Following the foundry-recommended Topographical Synthesis flow, a 20% timing margin was applied, resulting in a final constraint of 4 ns (250 MHz).

The second is the maximum synthesizable operating frequency. The Cortex-M0 core contains a large amount of combinational logic, making it a significant challenge to meet setup and hold time requirements across the numerous timing paths at the 4 ns constraint.

In accordance with Samsung 28 nm LPP CMOS design guidelines, Topographical Mode synthesis and Multi-Corner Multi-Mode (MCMM) analysis with On-Chip Variation (OCV) derating were applied throughout this work [18]. The Regular Voltage Threshold (RVT) library corners selected for timing closure are listed in Table 3. A cell delay derate factor of 1.036 (late) and 0.964 (early) was applied under OCV conditions to ensure conservative timing analysis. Setup time violations were primarily resolved during the front-end stage, while hold time violations were addressed during the back-end stage.

4.3.3. Post-Layout ECO Process

After completion of place-and-route, parasitic resistance and capacitance (RC) values on all nets were extracted using the StarRCXT tool. Static timing analysis (STA) was then performed using PrimeTime, which identified hold time violations on several paths. To resolve these violations, an Engineering Change Order (ECO) process was applied. The insert_buffer command in PrimeTime was used to fix the hold violations, generating an ECO file as output. The flow then returned to the place-and-route environment, where the ECO file was applied to insert the required buffers. Parasitic RC extraction was repeated on the updated layout, and STA was re-executed to confirm that all timing violations had been resolved. Figure 16 shows the post-layout STA result after completion of the ECO process.

4.3.4. Implementation

The digital design area used for floorplanning is 3958

μ

m × 3958

μ

m (including I/O cells and bond pads), with a core area of 653

μ

m × 769

μ

m. Since the SRAM macros inside the Memory Wrapper occupy the largest area, their placement is considered first. The two memories are placed symmetrically so that their pins face each other at the same vertical alignment. I/O pins are then positioned in the space between the memories, and sufficient area is reserved to accommodate all standard cells. To allow the possibility of providing the design as a hard macro in the future, the SoC block is placed on the right side of the layout, reserving the left side as open space for designers to place their own IP blocks. Figure 17 shows the final state of the hierarchical design, distinguished by the presence or absence of I/O cells.

4.3.5. Physical Verification

The design, having passed timing verification, power/ground connectivity checks, DRC, and Layout Versus Schematic (LVS) verification within the ICC2 place-and-route environment, is exported in GDS format. The GDS is then imported into Cadence Virtuoso, where it is merged with the real-pattern GDS files provided by the foundry. The resulting layout view is shown in Figure 18. When performing DRC verification using the Siemens Calibre tool, two primary types of issues are typically encountered. The first type consists of DRC violations that arise naturally inside the SRAM macros generated by the Samsung Memory Compiler. These are classified as warnings rather than errors and are therefore resolved through waiver processing. The second type consists of antenna violations caused by plasma charging effects during fabrication. These are resolved either by applying metal hopping techniques, in which the routing ascends to a higher metal layer and returns, or by inserting diode cells at the affected nodes. LVS verification is performed by converting the netlist extracted from ICC2 into SPICE format and comparing it against the merged layout.

4.4. FPGA-to-Silicon Functional Equivalence Verification

This section presents the HW/SW functional equivalence between the FPGA prototype and the fabricated silicon. Unlike prior works that present FPGA verification results and silicon results in separate sections without explicit cross-platform comparison, this work places the results of both platforms side by side to provide quantitative evidence of equivalence. The identical firmware source code is compiled with Keil MDK and loaded onto both the FPGA and the ASIC. The only platform-specific adaptation required is the I2C clock divider value, which is adjusted to produce the target 200 kHz SCL frequency on each platform (FPGA: 50 MHz system clock, ASIC: 125 MHz system clock). The same device driver structure, memory map, and application logic are used on both platforms without modification. As shown in Figure 19, both the FPGA and the fabricated ASIC produce identical I2C SDA/SCL waveforms. The SCL frequency of 200 kHz and the transmitted data sequence (0xA0, 0x00, 0x00) are consistent across both platforms, confirming that the proposed methodology successfully achieves FPGA-to-silicon HW/SW functional equivalence. The ASIC-side waveform is further analyzed in Section 5.4 in the context of platform extensibility verification.

5. Measurement Results and Verification

Having confirmed HW/SW functional equivalence between the FPGA prototype and the fabricated ASIC in Section 4.4, this section presents the silicon measurement results. All performance data reported here were obtained from physical measurements on the fabricated chip.

5.1. Chip Design Summary

Prior to presenting the measurement results, the chip design outcomes obtained from EDA tools are summarized in Table 4. A design comprising approximately 2300 gates operating at 250 MHz is implemented within a core area of 653

μ

m × 769

μ

m. Including the two on-chip memories, the complete SoC consumes an average power of 5.592 mW during execution of the matrix multiplication benchmark, as estimated by PrimePower. It should be noted that this figure is a pre-silicon estimate generated by the EDA tool, not a measured value. Figure 20 shows photographs of the fabricated bare die and packaged chip. All performance data presented in this work are based on physical measurements of the fabricated silicon, which constitutes a fundamental distinction from prior works that report simulation results only [14,15,17].

5.2. Test Environment

Chip testing is performed using a custom-designed socket module and a printed circuit board (PCB). Figure 21 shows the socket module and test board used in this work. The measurement equipment connected to the test board is as follows:

Power supply: 1.8 V for I/O cells; 0.9–1.6 V for the core and on-chip memories
Logic analyzer: Saleae Logic Pro
Clock source: Pulse generator 81130A
Oscilloscope: Tektronix TDS3052
UART communication: PC connection via USB-to-UART converter

5.3. Measurement Results

The chip is powered and the benchmark program is executed. Power consumption is derived from measurements taken with the test equipment, and execution time is recorded via UART messages. Table 5 presents the matrix multiplication benchmark results measured at room temperature across a range of supply voltage and clock frequency conditions.

The clock frequencies listed represent the highest achievable operating frequency at each supply voltage. The chip was confirmed to operate across a core voltage range of 0.90 V to 1.60 V, applying voltages beyond 1.6 V did not yield operation at higher frequencies. The matrix multiplication benchmark completed in a consistent execution time of 111 µs across all conditions, corresponding to a throughput of 8.6 MOPS.

Each benchmark run was repeated ten times, and the reported power values represent the average of these measurements. All measurements were performed at room temperature (+25 °C) under natural convection cooling with no forced airflow applied to the device.

The voltage sweep range was selected to cover the standard design corners: 0.90 V corresponds to the worst-case (WC) supply corner, and 1.10 V corresponds to the best-case (BC) corner. The nominal 1.00 V point falls at the center of this range. The 1.20 V point extends one step beyond the BC corner to characterize behavior outside the standard design envelope. Above 1.20 V, the step size increases to 0.20 V, as the primary interest is in confirming the degradation trend rather than fine-grained characterization. The 1.02 V point was added to characterize the sensitivity of the optimal operating region. Since 1.00 V lies near the boundary of the near-threshold region, even a small voltage step reveals the nonlinear power-voltage relationship characteristic of this region.

The optimal operating point is defined as the voltage at which power efficiency (MOPS/mW) is maximized. In terms of energy and power efficiency, the optimal operating point is 1.00 V at 125 MHz, at which the chip consumes 17.5 mW, achieves an energy efficiency of 140 pJ/cycle, and a power efficiency of 491 MOPS/mW. The supply voltage was set and verified at the power supply output terminal using an oscilloscope. A slight voltage drop between the output terminal and the chip pins is expected due to contact resistance and PCB trace impedance. Since 1.00 V lies near the boundary of the near-threshold region, small variations in the actual pin voltage may cause a disproportionate change in power consumption, consistent with the nonlinear power-voltage relationship of this region. The relatively large power difference observed between 1.00 V and 1.02 V is therefore attributed to this near-threshold sensitivity.

Comparing the prediction results of the EDA tool with the measurement results reveals a significant difference. The maximum measured operating frequency of 125 MHz falls short of the design target of 250 MHz, and the measured power of 17.5 mW is substantially higher than the PrimePower estimate of 5.592 mW. The frequency deviation stems directly from the conservative timing closure applied with OCV derating under the MPW program constraint of no PLL. The power deviation is attributed to I/O cell dissipation, which is not captured in the PrimePower estimate. However, the absolute figures differ from EDA tool predictions. The measurement results nonetheless show a clear and consistent trend that aligns with the design intent. The fabricated chip was not optimized for peak performance, but rather designed to operate reliably under a wide range of conditions.

To further investigate the source of this discrepancy, Table 6 presents the PrimePower breakdown at the measured operating condition (1.0 V, +25 °C, 125 MHz). The I/O cells account for 55.61% of the total estimated power, and the clock network accounts for 43.06%. Core logic and on-chip memory together contribute less than 1.4%. This breakdown confirms that I/O switching and clock distribution dominate the pre-silicon estimate, and that the gap between the EDA estimate (3.867 mW) and the measured value (17.5 mW) is primarily attributable to the full-swing external clock traversing the I/O cells without a PLL, and to PCB trace capacitance not captured in the simulation model.

Power consumption is the sum of dynamic power and leakage power. Dynamic power scales with the square of the supply voltage (V²) [1,12]. At 1.2 V relative to 1.0 V, the voltage ratio is 1.2×, and squaring this predicts a dynamic power increase of approximately 1.44×, corresponding to roughly 25 mW. This is consistent with the measured value of 24 mW, confirming that the region from 1.0 V to 1.2 V is dominated by dynamic power and follows the expected quadratic scaling. Beyond 1.4 V, the measured power deviates significantly from the quadratic reference line. This indicates that leakage power becomes the dominant contributor. Energy efficiency exceeds 300 pJ/cycle and power efficiency falls below 200 MOPS/mW above 1.4 V.

Figure 22 shows the power efficiency (MOPS/mW) across the measured voltage range. Higher bars indicate better efficiency. The optimal point at 1.0 V is highlighted in yellow, the normal operating region up to 1.2 V is shown in green, and the region from 1.4 V onward is shown in blue. Higher clock frequencies are accessible in the blue region, but at significantly reduced efficiency.

Figure 23 shows the power consumption and energy metrics. The x-axis represents the applied supply voltage and corresponding clock frequency, and the y-axis represents power consumption (mW) and energy per cycle (pJ/cycle). The dashed reference line represents the theoretical quadratic scaling of dynamic power with increasing voltage.

The measured results are summarized as follows. First, the full operating range is divided into three regions:

The optimal region at 1.00 V, the normal region up to 1.20 V, and the poor region at 1.40 V and above.
The worst-case, best-case, and optimal characteristics all appear within ±10% of the nominal voltage of 1.00 V.
Voltage scaling effects are clearly observed within the 125 MHz operating range, while efficiency degrades sharply above 1.4 V despite the availability of higher clock frequencies.

5.4. Platform Extensibility Verification via I2C Interface

Building on the equivalence verification in Section 4.4, the extensibility of the silicon-verified SoC platform is further validated at the chip level. The EEPROM write/read program used during FPGA verification is applied to the fabricated chip without modification, and the operation of the I2C channel is confirmed. Figure 24 shows the verification results of the UART output and the SDA/SCL waveforms of I2C channel 0, compared against the software code. The test program initiates a write operation to address 0x0000 of the EEPROM. The measurement results confirm that the values 0xA0, 0x00, and 0x00 are correctly driven on the SDA pin in accordance with the 200 kHz SCL clock. Since an identical I2C channel is implemented as channel 1 for on-chip use, these results demonstrate that both I2C channels are capable of reliable communication with external devices and on-chip IP blocks.

Based on the results presented so far, it is concluded that other researchers can readily integrate and validate their own IP blocks using the proposed platform. For example, a researcher may add a digital accelerator IP through the AHB interface and verify its operation at the software level by using the provided memory map and device driver templates. Alternatively, an external I2C debug channel can be configured via channel 0, while an ADC is connected through channel 1 for on-chip measurement and verification.

6. Comparison and Discussion

6.1. Comparison with Platform-Oriented Prior Works

The preceding section presented raw measurement data. This section interprets those results in the context of prior work and discusses the limitations and future directions of this study.

Table 7 presents a feature comparison with prior works that explicitly propose platform-level reuse or methodology contributions for Arm-based SoC design. Works focused primarily on energy minimization through subthreshold operation are excluded from this comparison, as their design objectives differ fundamentally from the platform-oriented goals of this work.

The comparison results show that this work is the only design among the four compared that simultaneously achieves silicon fabrication in a commercial 28 nm process, post-silicon power characterization across seven voltage points (0.90–1.60 V), and FPGA-to-silicon functional equivalence verification. Equivalence is confirmed by applying three shared software artifacts, namely firmware source code, device driver, and memory map, identically to both platforms, and by matching UART output and I2C SDA/SCL waveforms at 200 kHz on both platforms. None of the three prior works provides post-silicon power measurement or explicit cross-platform equivalence verification through direct waveform-level comparison. While CHIPKIT [17] describes FPGA emulation as a pre-silicon validation step, it does not explicitly demonstrate cross-platform functional equivalence through direct measurement comparison using identical firmware, which is the primary focus of this work. While the open-source platforms represented by Tiny Tapeout have made significant contributions to accessibility, they are based on legacy processes (130/180 nm) and non-standard bus architectures (Wishbone), which limits their compatibility with the industrial AMBA IP ecosystem. This work applies the industry-standard AMBA bus and a commercial advanced process. The proposed methodology is therefore applicable to practical SoC design environments.

As illustrated in Figure 4, the distinguishing feature of the proposed methodology lies in the simultaneous application of identical software artifacts to both the FPGA and ASIC platforms, followed by direct waveform-level comparison at the silicon level. Specifically, the same firmware source code, device driver, and memory map are deployed on both platforms without modification, so that any behavioral discrepancy observed between the two is attributable solely to hardware differences rather than software divergence. This end-to-end traceability from RTL design through post-silicon measurement is not present in any of the platform-level studies summarized in Table 7, and constitutes the primary differentiating contribution of this work.

6.2. Limitations and Future Work

This work establishes the foundation of the proposed platform, but further research can be pursued in terms of energy efficiency and extensibility.

First, the current energy efficiency of 140 pJ/cycle at 1.0 V is significantly higher than that of subthreshold designs (7–12 pJ/cycle) [4,12]. This can be improved by applying power gating techniques in future work.
Second, a direct silicon-level demonstration of I2C channel 1 is not provided in this work. Instead, the correct operation of I2C channel 0 is confirmed through the EEPROM test program, and channel 1, which is implemented with identical hardware, is expected to operate reliably on-chip. Future work can demonstrate a mixed-signal SoC by connecting an external device through channel 0 and integrating an ADC through channel 1.
Third, the maximum measured operating frequency of 125 MHz falls short of the design target of 250 MHz. As shown in Figure 16, the post-layout STA confirms zero timing violations across all 131,021 checks, including setup, hold, recovery, and removal, under both MAX and MIN corners at the 4 ns (250 MHz) constraint after ECO. This result indicates that timing closure was achieved at the design target frequency within the EDA environment. The gap between the STA result and the measured operating frequency is attributed to the conservative OCV derating and the absence of a PLL under the MPW program constraints. Both factors introduce design margin that post-layout STA does not fully capture. The precise mechanism has not been conclusively identified and remains a subject of future investigation.

7. Conclusions

This work proposes and implements a silicon-verified SoC platform based on the Arm Cortex-M0 processor, and presents a reproducible FPGA-to-silicon verification methodology. The platform is implemented in Samsung 28 nm LPP CMOS technology within a core area of 653

μ

m × 769

μ

m, operates at 125 MHz, consumes 17.5 mW, and achieves an energy efficiency of 140 pJ/cycle and a power efficiency of 491 MOPS/mW. The dual I2C interface supports both external device connectivity via channel 0 and on-chip IP integration via channel 1, demonstrating the extensibility of the proposed platform.

The key contributions of this work are summarized as follows. First, the identical firmware source code, device driver, and memory map are applied to both the FPGA and the fabricated ASIC without modification, and identical UART output and I2C waveforms are confirmed on both platforms through direct measurement, thereby demonstrating FPGA-to-silicon HW/SW functional equivalence. Second, all performance data are obtained through physical measurement of the fabricated silicon rather than simulation, providing reliable performance metrics for subsequent researchers. Third, the complete design flow from FPGA prototyping to ASIC fabrication is documented in detail, covering static timing closure, automated place-and-route, and physical verification, enabling other researchers to reproduce the same approach for their own SoC implementations. Fourth, the extensible platform architecture, equipped with a dual I2C interface and an AHB Slave interface, provides a practical foundation for integrating and validating diverse IP blocks at the silicon level.

Future work includes the construction of a fully integrated mixed-signal SoC platform through analog IP and digital accelerator integration, and energy efficiency improvement via DVFS and clock/power gating. Porting to 14 nm FinFET technology with multiple threshold voltage options is also planned. This work prioritizes verified reliability and practical reusability over peak performance, and the measured data and detailed design methodology obtained through silicon verification provide a trustworthy foundation for both academic researchers and industry designers.

Author Contributions

Conceptualization, H.S.; methodology, H.S.; software, H.S.; validation, H.S.; formal analysis, H.S.; investigation, H.S.; resources, H.S.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and K.R.; visualization, H.S.; supervision, K.R.; project administration, H.S. and K.R.; funding acquisition, K.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The measurement data presented in this study are available within the article. Source code and design scripts cannot be made publicly available due to licensing restrictions of the third-party IP and EDA tools used in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADC	Analog-to-Digital Converter
AHB	Advanced High-performance Bus
AMBA	Advanced Microcontroller Bus Architecture
APB	Advanced Peripheral Bus
Arm	Advanced RISC Machines
ASCII	American Standard Code for Information Interchange
ASIC	Application-Specific Integrated Circuit
CMOS	Complementary Metal-Oxide Semiconductor
CPU	Central Processing Unit
DAC	Digital-to-Analog Converter
DNN	Deep Neural Network
DRC	Design Rule Check
DVFS	Dynamic Voltage and Frequency Scaling
ECO	Engineering Change Order
EDA	Electronic Design Automation
EEPROM	Electrically Erasable Programmable Read-Only Memory
ESD	Electro-Static Discharge
FIFO	First-In First-Out
FPGA	Field-Programmable Gate Array
GDS	Graphic Data System
GPIO	General-Purpose Input/Output
HW	Hardware
I2C	Inter-Integrated Circuit
IC	Integrated Circuit
IP	Intellectual Property
LPP	Low Power Process
LQFP	Low-Profile Quad Flat Package
LVS	Layout Versus Schematic
MCMM	Multi-Corner Multi-Mode
MOPS	Million Operations Per Second
MPW	Multi-Project Wafer
OCV	On-Chip Variation
PCB	Printed Circuit Board
PDK	Process Design Kit
PLL	Phase-Locked Loop
PMK	Power Management Kit
RAM	Random-Access Memory
RC	Resistance-Capacitance
RISC	Reduced Instruction Set Computer
ROM	Read-Only Memory
RVT	Regular Voltage Threshold
SCL	Serial Clock Line
SDA	Serial Data Line
SoC	System-on-Chip
SRAM	Static Random-Access Memory
STA	Static Timing Analysis
SW	Software
UART	Universal Asynchronous Receiver/Transmitter
USB	Universal Serial Bus

References

Jain, S.; Lin, L.; Alioto, M. Processor Energy–Performance Range Extension Beyond Voltage Scaling via Drop-In Methodologies. IEEE J. Solid-State Circuits 2020, 55, 2670–2679. [Google Scholar] [CrossRef]
Chandrakasan, A.P.; Daly, D.C.; Finchelstein, D.F.; Kwong, J.; Ramadass, Y.K.; Sinangil, M.E.; Sze, V.; Verma, N. Technologies for Ultradynamic Voltage Scaling. Proc. IEEE 2010, 98, 191–214. [Google Scholar] [CrossRef]
Kaul, H.; Anders, M.; Mathew, S.; Hsu, S.; Agarwal, A.; Sheikh, F.; Krishnamurthy, R.; Borkar, S. A 1.45 GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32 nm CMOS. In Proceedings of the 2012 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2012; IEEE: New York, NY, USA, 2012; pp. 182–184. [Google Scholar]
Myers, J.; Savanth, A.; Howard, D.; Gaddh, R.; Prabhat, P.; Flynn, D. An 80 nW retention 11.7 pJ/cycle active subthreshold ARM Cortex-M0+ subsystem in 65 nm CMOS for WSN applications. In Proceedings of the 2015 IEEE International Solid-State Circuits Conference—(ISSCC) Digest of Technical Papers, San Francisco, CA, USA, 22–26 February 2015; IEEE: New York, NY, USA, 2015; pp. 1–3. [Google Scholar]
Hsu, S.; Agarwal, A.; Anders, M.; Mathew, S.; Kaul, H.; Sheikh, F.; Krishnamurthy, R. A 280 mV-to-1.1 V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22 nm CMOS. In Proceedings of the 2012 IEEE International Solid-State Circuits Conference, San Francisco, CA, USA, 19–23 February 2012; IEEE: New York, NY, USA, 2012; pp. 178–180. [Google Scholar]
Wang, J.; Pinckney, N.; Blaauw, D.; Sylvester, D. Reconfigurable self-timed regenerators for wide-range voltage scaled interconnect. In Proceedings of the 2015 IEEE Asian Solid-State Circuits Conference (A-SSCC), Xiamen, China, 9–11 November 2015; IEEE: New York, NY, USA, 2015; pp. 1–4. [Google Scholar]
Jain, S.; Khare, S.; Yada, S.; Ambili, V.; Salihundam, P.; Ramani, S. A 280 mV-to-1.2 V wide-operating-range IA-32 processor in 32 nm CMOS. In Proceedings of the 2012 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2012; IEEE: New York, NY, USA, 2012; pp. 66–68. [Google Scholar]
Sheikh, F.; Mathew, S.K.; Anders, M.A.; Kaul, H.; Hsu, S.K.; Agarwal, A.; Krishnamurthy, R.K.; Borkar, S. A 2.05 GVertices/s 151 mW Lighting Accelerator for 3D Graphics Vertex and Pixel Shading in 32 nm CMOS. IEEE J. Solid-State Circuits 2013, 48, 128–139. [Google Scholar] [CrossRef]
Ickes, N.; Gammie, G.; Sinangil, M.E.; Rithe, R.; Gu, J.; Wang, A. A 28 nm 0.6 V Low Power DSP for Mobile Applications. IEEE J. Solid-State Circuits 2012, 47, 35–46. [Google Scholar] [CrossRef]
Jain, S.; Lin, L.; Alioto, M. Automated Design of Reconfigurable Microarchitectures for Accelerators Under Wide-Voltage Scaling. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 777–790. [Google Scholar] [CrossRef]
Myers, J.; Savanth, A.; Prabhat, P.; Yang, S.; Gaddh, R.; Toh, S.O.; Flynn, D. A 12.4 pJ/cycle sub-threshold, 16 pJ/cycle near-threshold ARM Cortex-M0+ MCU with autonomous SRPG/DVFS and temperature tracking clocks. In Proceedings of the 2017 Symposium on VLSI Circuits, Kyoto, Japan, 5–8 June 2017; IEEE: New York, NY, USA, 2017; pp. C332–C333. [Google Scholar]
Reyserhove, H.; Dehaene, W. A Differential Transmission Gate Design Flow for Minimum Energy Sub-10-pJ/Cycle ARM Cortex-M0 MCUs. IEEE J. Solid-State Circuits 2017, 52, 1904–1914. [Google Scholar] [CrossRef]
Lee, Y.; Kim, G.; Bang, S.; Kim, Y.; Lee, I.; Dutta, P. A modular 1 mm³ die-stacked sensing platform with optical communication and multi-modal energy harvesting. In Proceedings of the 2012 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2012; IEEE: New York, NY, USA, 2012; pp. 402–404. [Google Scholar]
Whatmough, P.N.; Lee, S.K.; Brooks, D.; Wei, G.Y. DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications. IEEE J. Solid-State Circuits 2018, 53, 2722–2731. [Google Scholar] [CrossRef]
Lee, S.K.; Whatmough, P.N.; Brooks, D.; Wei, G.Y. A 16-nm Always-On DNN Processor With Adaptive Clocking and Multi-Cycle Banked SRAMs. IEEE J. Solid-State Circuits 2019, 54, 1982–1992. [Google Scholar] [CrossRef]
Whatmough, P.N.; Lee, S.K.; Donato, M.; Hsueh, H.-C.; Xi, S.; Gupta, U. A 16 nm 25 mm² SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators. In Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, 9–14 June 2019; IEEE: New York, NY, USA, 2019; pp. C34–C35. [Google Scholar]
Whatmough, P.N.; Donato, M.; Ko, G.G.; Lee, S.K.; Brooks, D.; Wei, G.Y. CHIPKIT: An Agile, Reusable Open-Source Framework for Rapid Test Chip Development. IEEE Micro 2020, 40, 32–40. [Google Scholar] [CrossRef]
Sun, H.S.; Cho, I.S. A Proposal of Methodologies for Implementing Digital Chips in the Latest Processes. IDEC J. Integr. Circuits Syst. 2024, 10, 42–48. [Google Scholar]
Efabless Corporation. Caravel Harness SoC Documentation. Available online: https://caravel-harness.readthedocs.io/en/latest/ (accessed on 8 April 2026).
Venn, M. Tiny Tapeout: A Shared Silicon Tape-Out Platform Accessible to Everyone. IEEE Solid-State Circuits Mag. 2024, 16, 68–72. [Google Scholar] [CrossRef]
Mascorro-Guardado, E.; Luna-Rodriguez, L.A.; Ortega-Cisneros, S.; Becerra-Luna, E.I.; Jimenez-Torres, U.; Murillo-Garcia, E.; Hernandez-Andrade, M. Design and Test of Offset Quadrature Phase-Shift Keying Modulator with GF180MCU Open Source Process Design Kit. Electronics 2024, 13, 1705. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed Arm Cortex-M0 SoC platform. Blue-highlighted ports indicate the I/O boundary. I2C Channel 1 and the AHB Slave port face inward, designated for on-chip IP integration.

Figure 2. Detailed address map of the dual I2C interfaces within the APB subsystem. I2C Channel 0 (0x4000_9000) is designated for external devices; Channel 1 (0x4000_A000) is reserved for on-chip IP integration. Both channels share an identical hardware implementation.

Figure 3. I2C device driver and usage example. Red boxes highlight the key memory-mapped register access operations. The driver applies to both Channel 0 and Channel 1 by changing only the base address argument.

Figure 4. The proposed FPGA-to-silicon verification methodology. The dashed boundary denotes the scope of the methodology. Purple and teal boxes represent the RTL/firmware starting point and FPGA verification steps, blue boxes represent ASIC implementation steps and the amber box represents the functional equivalence check, which is the primary verification step. Yellow boxes indicate shared software artifacts applied to both platforms without modification.

Figure 5. FPGA implementation result of the Arm Cortex-M0 SoC on a Terasic DE2-115 board (Intel Cyclone IV EP4CE115F29) using Quartus Prime 18.0. The design meets setup and hold timing constraints at 50 MHz, including the ROM Writer and all peripherals.

Figure 6. Internal architecture of the ROM Writer. HEX code is received at 115,200 bps via UART, buffered in a FIFO, reassembled into 32-bit little-endian words, and written sequentially to ROM. Upon completion, the ROM Writer gates its own clock and transfers control to the Cortex-M0 core.

Figure 7. UART terminal output showing ROM write completion and Cortex-M0 core startup after system reset. Following HEX code transfer via the ROM Writer, the core acquires ROM control and begins program execution immediately after reset.

Figure 8. HW/SW integration verification result on the FPGA platform. UART communication, interrupt handling, timer, and watchdog timer functions all operate correctly, establishing the verified FPGA baseline for the subsequent FPGA-to-silicon equivalence check.

Figure 9. UART terminal output confirming I2C EEPROM write and read operation on the FPGA platform (DE2-115 on-board 24LC08). Values 0–9 are written and read back correctly, establishing software-level I2C correctness prior to the FPGA-to-silicon equivalence check.

Figure 10. Logic analyzer waveform showing simultaneous operation of I2C Channel 0 and Channel 1 on the FPGA platform. This is the sole waveform evidence for dual-channel concurrent I2C operation in the FPGA verification stage.

Figure 11. Complete SoC platform validated through FPGA implementation, encompassing RTL timing closure, ROM Writer-based code loading, HW/SW integration verification, and I2C peripheral testing. This flow constitutes the FPGA-side foundation of the proposed FPGA-to-silicon verification methodology.

Figure 12. ASIC digital design flow for Samsung 28 nm LPP CMOS, following the foundry-recommended Topographical Synthesis approach with MCMM analysis and OCV derating. EDA tool details for each stage are listed in Table 2.

Figure 13. Memory Wrapper implementation: issue identification (left), intermediate correction (center), and final resolved result (right). Routing concentrated at pin centers and clock shield VSS patterns penetrating the memory interior are corrected. A single Wrapper instance is reused for both ROM and RAM, which share an identical SRAM16384×32 macro.

Figure 14. AHB bus and Samsung on-chip SRAM timing before (top) and after (bottom) the negative-edge clock optimization in AHB_ROM. Without optimization, SRAM presents valid data at T3, one cycle after the required T2, causing incorrect SoC operation. Implementing the control signal generation on the negative clock edge advances the SRAM output by half a cycle, resolving the mismatch with no area or power overhead.

Figure 15. Post-layout simulation waveform of the AHB-to-memory read path with the Samsung 28 nm on-chip SRAM model. HRDATA presents valid data (0x2000_0368) at T2, two clock cycles after the instruction fetch request at T0 (HADDR = 0x0, HWRITE = 0), confirming correct timing achieved by the negative-edge clock implementation in AHB_ROM.

Figure 16. Post-layout STA result after ECO, generated by Synopsys PrimeTime. All 131,021 timing checks (setup, hold, recovery, removal) pass with zero violations under both MAX and MIN corners at the 4 ns (250 MHz) constraint with OCV derating applied.

Figure 17. Final hierarchical design views with I/O cells (left) and without I/O cells (right). The SoC core block (653

μ

m × 769

μ

m) is placed on the right side of the 3958

μ

m × 3958

μ

m die, with the left side reserved as open space for designers integrating the platform as a hard macro.

Figure 17. Final hierarchical design views with I/O cells (left) and without I/O cells (right). The SoC core block (653

μ

m × 769

μ

m) is placed on the right side of the 3958

μ

m × 3958

μ

m die, with the left side reserved as open space for designers integrating the platform as a hard macro.

Figure 18. Final layout view after GDS merge with Samsung 28 nm LPP foundry cells in Cadence Virtuoso. The two SRAM macros (ROM and RAM) are placed symmetrically. The total chip size is 4000

μ

m × 4000

μ

m including the seal ring.

Figure 18. Final layout view after GDS merge with Samsung 28 nm LPP foundry cells in Cadence Virtuoso. The two SRAM macros (ROM and RAM) are placed symmetrically. The total chip size is 4000

μ

m × 4000

μ

m including the seal ring.

Figure 19. FPGA-to-silicon I2C SDA/SCL waveform comparison confirming HW/SW functional equivalence. Both platforms produce identical 200 kHz SCL and SDA data sequence (0xA0, 0x00, 0x00) using the same firmware source code.

Figure 20. Bare die (left) and LQFP-208 packaged chip (right) fabricated in Samsung 28 nm LPP CMOS. The die size is 4000

μ

m × 4000

μ

m including the seal ring. Limited visibility of metal layers in the bare die photograph is due to the passivation layer applied during fabrication.

Figure 20. Bare die (left) and LQFP-208 packaged chip (right) fabricated in Samsung 28 nm LPP CMOS. The die size is 4000

μ

m × 4000

μ

m including the seal ring. Limited visibility of metal layers in the bare die photograph is due to the passivation layer applied during fabrication.

Figure 21. Socket module and PCB test board for chip measurement. The board supplies 0.9–1.6 V to the core and 1.8 V to I/O cells, and connects to a pulse generator (81130A), Saleae Logic Pro, Tektronix TDS3052 oscilloscope, and PC via USB-to-UART converter.

Figure 22. Power efficiency across the measured voltage range—higher is better. The optimal point at 1.0 V (yellow) achieves 491 MOPS/mW. Efficiency degrades above 1.4 V (blue) as leakage power becomes dominant.

Figure 23. Power consumption and energy efficiency (pJ/cycle) across the measured voltage range—lower is better. The dashed line represents theoretical V² dynamic power scaling. Measured values follow the trend up to 1.2 V, then deviate significantly above 1.4 V due to leakage-dominated operation.

Figure 24. Verification of I2C Channel 0 SDA/SCL waveforms on the fabricated ASIC, compared against firmware behavior. SCL operates at 200 kHz and SDA correctly drives the target sequence (0xA0, 0x00, 0x00). Since Channel 1 uses identical hardware, these results confirm reliable operation of both I2C channels.

Table 1. Memory map of the proposed SoC platform. Address space is organized in 4 KB-aligned segments. I2C Channel 0 and Channel 1 sub-ranges are highlighted in red.

Name	Address Range
ROM (Booting Codes)	`0x0000_0000`~`0x0000_FFFF`
SRAM	`0x2000_0000`~`0x2000_FFFF`
APB Peripherals	`0x4000_0000`~`0x4000_FFFF`
(I2C Channel 0)	`0x4000_9000`~`0x4000_9FFF`
(I2C Channel 1)	`0x4000_A000`~`0x4000_AFFF`
AHB Peripherals	`0x4001_0000`~`0x4001_FFFF`

Table 2. EDA tools applied at each stage of the ASIC design flow. All Synopsys tools are version 2021.06, ensuring a consistent sign-off environment across synthesis, timing, power, and parasitic extraction.

Stage	Tool	Version
Logic Synthesis	Synopsys Design Compiler	2021.06-SP4
Place-and-Route	Synopsys ICC2	2021.06-SP4
Static Timing Analysis	Synopsys PrimeTime	2021.06
Dynamic Timing Simulation	Synopsys VCS+Verdi	2021.09
Power Consumption Analysis	Synopsys PrimePower	2021.06-SP5
Net Parasitic Extraction	Synopsys StarRCXT	2021.06-SP2
Equivalence Check	Synopsys Formality	2021.06
Physical Verification	Siemens Calibre	aoi_cal_2014.1
Merge & Layout Patterning	Cadence Virtuoso	IC617_ISR23
On Chip Memory Generation	Samsung Memory Compiler	SRAM16384×32

Table 3. Library corners and operating conditions applied for MCMM static timing analysis with OCV derating (1.036 late/0.964 early). FF (BC) at −40 °C is used for hold checks; SS (WC) at +125 °C is used for setup checks.

Corner	Temperature	STD	IO	Memory	OCV Derate
FF (BC)	−40 °C	1.1 V	1.95 V	1.1 V	1.036
SS (WC)	+125 °C	0.90 V	1.65 V	0.95 V	0.964

Table 4. Chip design results obtained from EDA tools. The operating frequency (250 MHz) is the design target, and the power value (5.592 mW) is a pre-silicon PrimePower estimate. Measured results are reported separately in Table 5.

Items	Result
Process	Samsung 28 nm LPP CMOS
Bare Die Chip Size	4000 $μ$ m × 4000 $μ$ m
Digital Design Area	3958 $μ$ m × 3958 $μ$ m
SoC Area	653 $μ$ m × 769 $μ$ m
Memory Area	2 × (276.5 $μ$ m × 769 $μ$ m)
Core Area (Except Memory)	100 $μ$ m × 769 $μ$ m
Gates Count	2296 Gates
Memory Instance	SRAM 2 × (16,384 × 32) bits
Operating Frequency	Cortex-M0 SoC: 250 MHz (design target)
Power Consumption (Matrix Mult.)	5.592 mW (averaged)
Process Reference Voltage	Core: 1.0 V, I/O: 1.8 V, Memory: 1.0 V
Package Type	LQFP 208 Type

Table 5. Matrix multiplication benchmark results measured on the fabricated ASIC at room temperature (+25 °C), averaged over ten runs. The optimal operating point at 1.0 V (highlighted in red). Execution time is consistent at 111

μ

s across all conditions.

Table 5. Matrix multiplication benchmark results measured on the fabricated ASIC at room temperature (+25 °C), averaged over ten runs. The optimal operating point at 1.0 V (highlighted in red). Execution time is consistent at 111

μ

s across all conditions.

Clock	Core	Power	Energy	Power
Frequency	Voltage	Consumption	Efficiency	Efficiency	Remark
(Hz)	(V)	(mW)	(pJ/Cycle)	(MOPS/mW)
125 MHz	0.90	32.4	259	265	Near Threshold
125 MHz	1.00	17.5	140	491	Optimal
125 MHz	1.02	21.8	174	394	Normal
125 MHz	1.10	22.0	176	391	Normal
125 MHz	1.20	24.0	192	358	Normal
130 MHz	1.40	44.0	338	195	High Voltage
135 MHz	1.60	61.0	453	141	High Voltage

Table 6. Pre-silicon power breakdown estimated by PrimePower at 1.0 V, +25 °C, 125 MHz (matrix multiplication benchmark).

Power Group	Internal (mW)	Switching (mW)	Leakage (mW)	Total (mW)	Ratio (%)
Clock Network	1.401	0.264	<0.001	1.665	43.06
I/O Cells	1.461	0.009	0.680	2.150	55.61
Memory	0.008	0.000	0.036	0.044	1.14
Core Logic	0.001	<0.001	0.006	0.007	0.19
Total	2.872	0.273	0.722	3.867	100.0

Table 7. Comparison of platform-oriented SoC works in terms of methodology and extensibility. Works focused primarily on energy minimization through subthreshold operation are compared separately, as their design objectives differ fundamentally from the platform-oriented goals of this work.

Feature	CHIPKIT [17]	Tiny Tapeout [20]	OQPSK [21]	This Work
Silicon fabricated	Yes (16 nm)	Yes (130 nm) ^a	Yes (180 nm) ^a	Yes (28 nm)
Commercial foundry process	Yes (TSMC)	No	No	Yes (Samsung)
Post-silicon power measured	No	No	No	Yes (17.5 mW)
Voltage characterization	No	No	No	Yes (0.9–1.6 V)
FPGA-to-ASIC identical FW/driver	Partial ^c	No	No	Yes
Documented reproducible methodology	Partial	No	No	Yes
Standard AMBA bus (AHB + APB)	Yes	No ^b	No	Yes
Dual low-speed interface (I2C × 2)	No	No	No	Yes
SW driver + memory map template	No	No	No	Yes
Hard macro delivery	No	No	No	Yes

^a Open-source PDK (SkyWater 130nm/GF 180nm). ^b Wishbone bus (non-AMBA). ^c CHIPKIT describes FPGA emulation as a pre-silicon validation step [17], but does not explicitly demonstrate cross-platform functional equivalence through direct measurement using identical firmware. Tiny Tapeout and OQPSK primarily target accessibility and open-source goals; their design objectives differ from the platform-oriented methodology focus of this work.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, H.; Ryoo, K. A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS. Electronics 2026, 15, 2202. https://doi.org/10.3390/electronics15102202

AMA Style

Sun H, Ryoo K. A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS. Electronics. 2026; 15(10):2202. https://doi.org/10.3390/electronics15102202

Chicago/Turabian Style

Sun, Hyeseung, and Kwangki Ryoo. 2026. "A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS" Electronics 15, no. 10: 2202. https://doi.org/10.3390/electronics15102202

APA Style

Sun, H., & Ryoo, K. (2026). A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS. Electronics, 15(10), 2202. https://doi.org/10.3390/electronics15102202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reproducible FPGA-to-Silicon Verification Methodology for an Embedded SoC Platform in 28 nm CMOS

Abstract

1. Introduction

1.1. Research Background and Motivation

1.2. Technical Objectives

1.3. Purpose and Contributions

1.4. Organization of the Paper

2. Background and Related Work

2.1. Overview of Arm-Based SoC Research

2.2. Current Status and Limitations of Open-Source SoC Platforms

2.3. The Gap Between FPGA Prototyping and Silicon Verification

3. Proposed Platform Architecture

3.1. Overall System Architecture

3.2. Memory Map Organization

3.3. IP Integration via AHB-Lite Interface

3.4. I2C Interface Design for Platform Connectivity

3.5. Software Driver and Software Stack

4. Design and Implementation Methodology

4.1. FPGA Implementation and Verification

4.2. ASIC Implementation

4.3. Key Implementation Issues and Solutions

4.3.1. AHB Bus and Memory Timing Optimization

4.3.2. Static Timing Closure

4.3.3. Post-Layout ECO Process

4.3.4. Implementation

4.3.5. Physical Verification

4.4. FPGA-to-Silicon Functional Equivalence Verification

5. Measurement Results and Verification

5.1. Chip Design Summary

5.2. Test Environment

5.3. Measurement Results

5.4. Platform Extensibility Verification via I2C Interface

6. Comparison and Discussion

6.1. Comparison with Platform-Oriented Prior Works

6.2. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI