WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices

Ashtari Gargari, Milad; Tabrizchi, Sepehr; Roohi, Arman

doi:10.3390/electronics14122485

Open AccessArticle

WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices

by

Milad Ashtari Gargari

,

Sepehr Tabrizchi

and

Arman Roohi

^*

Electrical and Computer Engineering Department, University of Illinois Chicago, Chicago, IL 60607, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2485; https://doi.org/10.3390/electronics14122485

Submission received: 20 May 2025 / Revised: 13 June 2025 / Accepted: 17 June 2025 / Published: 19 June 2025

(This article belongs to the Special Issue Emerging Computing Paradigms for Efficient Edge AI Acceleration)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a novel transposed MRAM architecture (WinEdge) specifically optimized for Winograd convolution acceleration in edge computing devices. Leveraging Magnetic Tunnel Junctions (MTJs) with Spin Hall Effect (SHE)-assisted Spin-Transfer Torque (STT) writing, the proposed design enables a single SHE current to simultaneously write data to four MTJs, substantially reducing power consumption. Additionally, the integration of stacked MTJs significantly improves storage density. The proposed WinEdge efficiently supports both standard and transposed data access modes regardless of bit-width, achieving up to 36% lower power, 47% reduced energy consumption, and 28% faster processing speed compared to existing designs. Simulations conducted in 45 nm CMOS technology validate its superiority over conventional SRAM-based solutions for convolutional neural network (CNN) acceleration in resource-constrained edge environments.

Keywords:

Magnetic RAM; neural network; transposed memory; Winograd convolution

1. Introduction

Neural networks have transformed computational systems by efficiently addressing a wide variety of complex challenges. Their extensive applications include image processing, natural language understanding, and real-time decision making. Among neural networks, Convolutional Neural Networks (CNNs) have particularly excelled in tasks such as image recognition and classification, making them crucial for processing visual data [1]. However, deploying CNNs on resource-constrained edge devices, like smartphones, IoT sensors, and wearable gadgets, presents significant challenges. These devices typically have limited power, storage, and processing capabilities. Therefore, developing methods that efficiently bridge high-performance neural network requirements with the constraints inherent to edge computing environments is an active and critical research area [2]. A prominent method for improving CNN efficiency is the Winograd algorithm, which optimizes the convolution process by significantly reducing multiplications, the primary computational operation in CNNs. Though this reduces computational overhead and enhances processing speed and energy efficiency, it increases the number of required additions [3]. Additionally, enabling on-device learning is vital for real-time adaptability and privacy in applications that demand responsiveness and customization, further emphasizing the need for optimized computational resources and specialized memory structures [4,5].

Efficient deployment of the Winograd algorithm and on-device learning [6,7,8] requires specialized memory architectures, such as transposable memories, which efficiently handle the specific data access patterns of CNN operations [9,10,11,12]. Traditional memory technologies, such as Static Random Access Memory (SRAM), consume significant chip area and exhibit high leakage power even in idle states, making them unsuitable for edge applications with stringent energy and area constraints. Addressing these challenges is essential to unlocking CNNs’ full potential on edge devices [13,14].

To address the drawbacks of traditional memories, recent studies have explored Non-Volatile Memory (NVM) technologies, such as Magnetoresistive Random Access Memory (MRAM), due to their minimal static power consumption and non-volatility. Magnetic Tunnel Junctions (MTJs), key components in MRAM technology, have been utilized in transposable memory designs. Previous approaches primarily employed Spin-Transfer Torque (STT) to write data into transposable memory arrays. Although effective, STT-based methods face limitations in terms of slow write operations and higher energy consumption, making them less suitable for energy-sensitive applications. Alternative approaches leveraging Spin Hall Effect (SHE)-assisted STT improve write efficiency but often require multiple cycles to perform transposed data reads, reducing overall performance in scenarios demanding high-speed access [15]. Furthermore, certain prior works [16,17] introduced diagonal data storage combined with shifting and switching circuitry to facilitate transposed access, while these designs improve some aspects, they tend to be inefficient for handling smaller bit-width data, such as binary or quaternary, and exhibit complexity issues with multi-bit data. Such limitations highlight the necessity for more adaptable and efficient transposable memory solutions.

In this paper, we propose a novel transposed MRAM (T-MRAM) architecture using Magnetic Tunnel Junctions (MTJs) with SHE-assisted STT for efficient data writing. Our approach enables simultaneous writing to four MTJs using a single SHE current, significantly reducing power consumption and enhancing energy efficiency. Furthermore, the design incorporates stacked MTJs to increase data storage density. The proposed memory architecture effectively supports both normal and transposed data accesses, regardless of bit-width, making it versatile and well-suited for diverse CNN applications. Extensive simulations demonstrate that our proposed architecture achieves substantial improvements in energy efficiency, power consumption, and processing speed, making it a highly effective solution for CNN accelerators deployed in resource-constrained edge environments.

2. Background

2.1. Magnetic Tunnel Junction

Figure 1 illustrates a Magnetic Tunnel Junction (MTJ), which consists of two ferromagnetic layers separated by a thin oxide barrier. The magnetic orientation of one layer, called the reference layer, is fixed, while the orientation of the other (free layer) can be altered. Depending on whether these two layers have parallel (P) or anti-parallel (AP) magnetic orientations, two stable states with distinct electrical resistances are achievable. These resistance differences, utilized for non-volatile data storage, are quantified using the tunneling magnetoresistance (TMR) ratio [18], defined by (1).

T M R (%) = (\frac{R_{AP} - R_{P}}{R_{P}}) \times 100

(1)

Several methods are employed to perform data writing and state switching in MTJs, among which the Spin–Orbit Torque (SOT)-assisted Spin-Transfer Torque (STT) method has demonstrated significant advantages. In this approach, initially, a SOT current passes through a heavy metal layer underneath the MTJ, aligning the free layer’s magnetic orientation perpendicularly to the fixed layer. Subsequently, an STT current, whose direction depends on the desired switching from anti-parallel (AP) to parallel (P) or vice versa, is applied to finalize the magnetic state transition [19].

2.2. Winograd Convolution

The Winograd algorithm is a highly efficient method for performing convolution operations, significantly reducing computational complexity by minimizing the number of required multiplications. However, this efficiency gain is accompanied by an increase in the number of additional operations [3]. Formally, the Winograd convolution for an input matrix

I

of dimension

m \times m

and a kernel matrix

F

of dimension

n \times n

produces an output matrix

Y

of dimensions

o \times o

, defined as follows:

Y = A^{⊤} [(G F G^{⊤}) ⊙ (B I B^{⊤})] A

(2)

where

o = m - n + 1

. The transformation matrices

A

,

B

, and

G

used for a

4 \times 4

input matrix and

3 \times 3

kernel are provided explicitly in Equation (3).

\begin{matrix} B & = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & - 1 & 1 \\ - 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & - 1 \end{matrix}], \\ G & = [\begin{matrix} 1 & 0 & 0 \\ 0 & 0.5 & 0.5 \\ 0.5 & - 0.5 & 0.5 \\ 0 & 0 & 1 \end{matrix}], A = [\begin{matrix} 1 & 0 \\ 1 & 1 \\ 1 & - 1 \\ - 1 & 0 \end{matrix}] \end{matrix}

(3)

The operational procedure of the Winograd convolution is visually summarized in Figure 2, demonstrating its effectiveness in significantly reducing the computational burden, which is particularly beneficial for convolutional neural networks in resource-limited environments [20].

3. Proposed WinEdge Architecture

3.1. T-MRAM Cell

As illustrated in Figure 3a, our proposed T-MRAM cell employs a SHE-assisted STT mechanism for efficient data writing and storage operations. To significantly reduce power consumption, the SHE current simultaneously targets two memory cells comprising four MTJs. During the write operation, the SHE current passes through transistor TW-1 by activating the word line WLW. Concurrently, the source lines (SL) SL-1 and SL-2 are set to V_DD and ground, respectively. Following the application of the SHE current, the magnetization vectors of the MTJs align perpendicular to the initial orientation, thus achieving a transient intermediate magnetization state. Subsequently, the STT current path is activated once this intermediate state is reached. Intermediate voltage levels are then applied to the SLs, while bit lines (BL1 and BL2) are driven to either ground or V_DD depending on the desired data state, completing the magnetic orientation switch.

In the two stacked MTJs, a data-dependent path is established between either 0 V and the intermediate voltage or V_DD and the intermediate voltage, and the appropriate state is written to the cells. Figure 3a illustrates the detailed interconnections and simultaneous write operations of the proposed T-MRAM cells. Table 1 summarizes the logic states and corresponding interconnect values utilized during the simultaneous 4-bit data (‘1001’) writing operation, facilitated by the activation of transistors T1, T2, and T3. This approach achieves efficient data writing within a single clock cycle. Furthermore, the physical layout of the proposed T-MRAM cell designed based on the

λ

-based scaling rule, is depicted in Figure 3b. Note that in each of the two sections labeled as MTJs in this figure, two MTJ cells are designed in a stacked configuration.

3.2. Proposed Memory Array

Figure 4 illustrates the detailed architecture of the proposed transposed memory array, designed explicitly for efficient data storage and retrieval in neural network acceleration applications. The memory array is partitioned into multiple memory banks, each capable of independently accessing stored data in both normal and transposed orientations. This dual-mode access significantly enhances computational flexibility, allowing seamless integration with convolution algorithms such as the Winograd algorithm. In this design, dedicated vertical and horizontal sense amplifiers (SAs) enable rapid and efficient data retrieval from the memory cells. The retrieved data are directly transferred to the integrated computational unit, facilitating high-speed data processing critical for edge device operations. Computation outcomes are subsequently written back into the memory via specialized address lines, ensuring data integrity and consistency throughout processing cycles. Figure 4 further details the internal connectivity structure of the memory cells within each bank, illustrating how each cell comprises two stacked MTJs. This configuration permits versatile storage capacities, supporting various bit-width requirements and data precision levels ranging from binary to multi-bit formats. The design achieves optimized performance by allowing simultaneous access to multiple bits, thereby reducing overall latency and enhancing throughput for diverse data-intensive applications.

Figure 5 illustrates the detailed storage and data access schemes for various bit-width configurations, specifically highlighting binary, quaternary, and four-bit data arrangements. This figure demonstrates the proposed memory array’s capability for efficient simultaneous access in both normal (row-wise) and transposed (column-wise) directions. In Figure 5a, the storage configuration for binary data is depicted, where each MTJ stores a single bit. Normal data access is performed by grounding the source lines, and using the BL2 bit lines connected to sense amplifiers (SAs), enabling rapid retrieval of stored data. For transposed data access, the architecture uses dedicated transposed bit lines (BLT) connected to SAs, while the BL1 bit lines are grounded, ensuring equally efficient data retrieval in column-wise orientation. Figure 5b demonstrates the quaternary data storage and access scheme, where each cell stores two bits of data across two stacked MTJs. Here, normal data retrieval involves connecting BL2 lines to the SAs and grounding BL1 lines, allowing simultaneous access to both bits stored in a single cell. Conversely, transposed access utilizes BLT lines connected to SAs and grounded BL1 lines, enabling efficient and parallel retrieval of column-wise stored data.

In the four-bit data arrangement illustrated in Figure 5c, each data element spans two adjacent memory cells, thereby accommodating four bits of storage per matrix element. For normal mode retrieval, each data element is accessed through two consecutive SAs, with BL2 lines connecting directly to the SAs and BL1 lines grounded. For transposed access, BLT lines (BLT1 and BLT2) are sequentially connected to consecutive SAs to read individual bits efficiently. This systematic approach ensures flexible, high-speed data handling in both normal and transposed memory orientations, making the proposed architecture highly suitable for resource-constrained neural network acceleration tasks.

Figure 6 presents a comprehensive timing diagram demonstrating the operational performance of a representative segment of the proposed memory array, specifically focusing on the data access methods depicted in Figure 5b. During the initial 5 nanoseconds, binary data are written sequentially to the first row, comprising four MTJ cells. In the subsequent interval of 5 nanoseconds, the write operation is replicated for the second row. This controlled sequential writing process ensures accurate initialization of data prior to read operations. Following these write phases, the memory array undergoes standard read operations. The first row is accessed during the 15–20 ns window, where data integrity and state verification are performed through the connected SAs. Similarly, the second row undergoes a normal read operation between 20 and 25 ns, allowing confirmation of the successful data storage and retrieval for each MTJ cell.

To validate transposed access capabilities, two separate intervals are allocated specifically for reading data stored column-wise. The first transposed read operation occurs between 30 and 35 ns, targeting the first column. Subsequently, the second column data are read from 35 to 40 ns. This sequential column-wise access highlights the flexibility of the proposed architecture, demonstrating its effectiveness in rapidly switching between normal and transposed data retrieval modes. This detailed operational timeline, visually represented in Figure 6, underscores the high efficiency, reliability, and versatility of our T-MRAM array architecture. It confirms the design’s ability to achieve rapid simultaneous data writing and flexible, efficient reading operations, both critical for high-performance neural network acceleration in resource-constrained edge computing environments. Leveraging the dual-mode access capabilities of our T-MRAM array, we now present its strategic mapping for neural network computation, specifically optimized for the Winograd convolution algorithm.

3.3. Hardware Mapping

The proposed T-MRAM architecture is strategically designed to facilitate efficient mapping of convolutional neural networks, particularly employing the Winograd algorithm. Initially, convolution kernels and input feature maps are partitioned into sub-matrices that align with the T-MRAM’s storage structure. Each sub-matrix element is then stored within individual MTJs, allowing efficient simultaneous access during both normal and transposed data reads. Specifically, the convolution operation, performed using the Winograd algorithm, benefits from the transposable memory capability. During the transformation phase, input data and kernels undergo transformation utilizing matrices

G

and

B

. The resultant transformed data and kernels are stored in normal access mode. For subsequent matrix multiplications and element-wise operations (Hadamard product), the architecture seamlessly switches to transposed access mode. The sense amplifiers retrieve rows and columns simultaneously, feeding data directly into the computational sub-arrays for efficient execution of arithmetic operations.

Following the element-wise multiplication, the results undergo an inverse transformation using matrix

A

, converting back to the spatial domain. These final output matrices are subsequently written back into the memory cells. This approach optimally utilizes the transposed memory structure, significantly minimizing data movement, latency, and overall power consumption. Such streamlined hardware mapping underscores the suitability of the proposed T-MRAM architecture for resource-constrained CNN acceleration in edge computing environments.

4. Experiments

Circuit-level simulations were performed using HSPICE with the 45 nm NCSU Product Development Kit (PDK) library [21] and a validated MTJ compact model [22] at a nominal supply voltage of 1.2 V. Table 2 summarizes the critical MTJ parameters used in the simulations, ensuring accurate modeling of physical characteristics and performance metrics of the proposed T-MRAM cell.

To enhance reproducibility, we provide key simulation parameters used in our evaluations, summarized in Table 3. The proposed T-MRAM array supports binary, quaternary, and 4-bit data formats via a stacked MTJ structure and dual-mode access architecture. A nominal activity factor of 0.5 was used to model alternating read/write behavior. For algorithm-level testing, we implemented a

64 \times 64

input with a

3 \times 3

kernel using the Winograd

F (4 \times 4, 3 \times 3)

configuration. Representative layers from ShuffleNet, ResNet18, TinyYOLO, MobileNetV2, and EfficientNet-V0 were selected as workloads. All simulations were performed in HSPICE using the 45 nm NCSU FreePDK at 1.2 V and 25 °C. Each memory access was modeled over a 5 ns window.

To evaluate robustness and reliability against process variations, extensive Monte Carlo simulations comprising 1000 iterations were executed. Specifically, variations in critical fabrication parameters, including free layer thickness (T_F), free layer surface area, MTJ resistance-area product, and Tunneling Magnetoresistance (TMR) were modeled using Gaussian distributions with standard deviations of 5%, 15%, 15%, and 10%, respectively [23,24]. Reliable multi-bit storage requires distinct resistance states despite fabrication inconsistencies; Figure 7 demonstrates the clearly distinguishable four resistance states achieved under these process variations, confirming the robustness of the proposed design.

To validate the performance and efficiency of the proposed T-MRAM architecture for convolution acceleration, the Winograd algorithm was implemented for a standard convolution scenario involving a

64 \times 64

input matrix and a

3 \times 3

kernel. Comparative assessments against existing designs [17] were conducted, evaluating key performance metrics, including processing delay, power consumption, and Power-Delay Product (PDP). As summarized in Table 4, the proposed architecture achieves a power consumption of 136

μ

W, a processing delay of 807 ns, and a PDP of 110 pJ. These results demonstrate substantial improvements over existing approaches, showcasing reductions of approximately 33% in power consumption and up to 72% in PDP relative to comparable spintronic-based implementations. Additionally, static power analysis emphasizes the considerable energy advantages, with near-zero static power consumption compared to conventional SRAM-based solutions, which exhibit significantly higher leakage power.

For a comprehensive performance evaluation, practical neural network workloads were considered. Figure 8 presents performance comparisons across several widely used CNN architectures, including ShuffleNet, ResNet18, TinyYOLO, MobileNetV2, and EfficientNet-V0. These comparisons underscore substantial reductions in power and energy consumption—achieving at least 36% power and 47% energy savings compared to existing designs. These improvements are primarily attributable to our optimized design features, such as simultaneous multi-bit reading and writing capabilities, which significantly reduce memory access overhead and enhance computational throughput. Furthermore, the proposed architecture’s execution latency closely matches the state-of-the-art optimized approach described in [17], demonstrating approximately a 28% improvement compared to other evaluated designs. These results collectively confirm the viability and superior performance of our T-MRAM architecture for resource-efficient CNN acceleration in edge computing environments.

5. Discussion and Conclusions

In this paper, we introduced a novel T-MRAM architecture, namely WinEdge, that significantly advances Winograd convolution acceleration for edge computing. Our design combines SHE-assisted STT writing with stacked MTJs to achieve substantial improvements: 36% power reduction, 47% lower energy consumption, and 28% faster processing compared to existing designs. These enhancements directly address the critical constraints limiting AI deployment in resource-constrained environments. By enabling efficient on-device learning and inference while maintaining near-zero static power consumption, our T-MRAM architecture paves the way for next-generation intelligent edge systems across IoT, wearable technology, and mobile computing domains. The demonstrated performance benefits position this technology as a compelling alternative to conventional memory solutions for future energy-efficient AI hardware.

While WinEdge offers compelling advantages, several limitations merit consideration. Its sequential, SHE-assisted STT write mechanism, though energy efficient, imposes a fixed latency that may constrain ultra-high-throughput applications. The stacked MTJ design, while improving density, poses thermal reliability concerns under sustained high-frequency writes due to potential crosstalk. Additionally, enabling dual-mode (normal and transposed) access necessitates extra multiplexing and control circuitry, introducing modest area and complexity overhead—an acceptable trade-off for the observed gains in Winograd convolution performance and efficiency.

Author Contributions

Conceptualization, S.T. and A.R.; Methodology, M.A.G.; Validation, M.A.G. and S.T.; Writing—original draft, M.A.G.; Writing—review and editing, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Science Foundation under Grant No. 2448133, 2504839, and 2447566.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NVM	Non-volatile memory
ASIC	Application-specific integrated circuit
CNN	Convolutional Neural Networks
MAC	Multiplication-and-Accumulation
SHE	Spin Hall Effect
MTJ	Magnetic Tunnel Junctions
SRAM	Static Random Access Memory
MRAM	Magnetic Random Access Memory
BLT	Bit Line
SA	Sense Amplifier
T-MRAM	Transposed MRAM
WLW	Word Line
SL	Source Line
nvSRAM	Non-volatile Random-access Memory
SOT-MRAM	Spin–Orbit Torque Magnetic Random Access Memory
STT-MRAM	Spin-Transfer Torque Magnetic Random Access Memory
PDP	Power-Delay Product

References

Xie, K.; Lu, Y.; He, X.; Yi, D.; Dong, H.; Chen, Y. Winols: A large-tiling sparse winograd CNN accelerator on FPGAs. ACM TACO 2024, 21, 1–24. [Google Scholar] [CrossRef]
Tabrizchi, S.; Shafiee Sarvestani, A.; Amin, M.H.; Najafi, D.; Angizi, S.; Zand, R.; Roohi, A. Magnetic In/Near-Sensor Architectures: From Raw Sensing to Smart Processing. In Proceedings of the GLSVLSI, New Orleans, LA, USA, 30 June–2 July 2025. [Google Scholar]
Wang, S.; Zhu, J.; Wang, Q.; He, C.; Ye, T.T. Customized instruction on RISC-V for Winograd-based convolution acceleration. In Proceedings of the IEEE ASAP, Virtual, 7–9 July 2021; pp. 65–68. [Google Scholar]
Zhang, F.; Sridharan, A.; Hwang, W.; Xue, F.; Tsai, W.; Wang, S.X.; Fan, D. On-device continual learning with STT-assisted-SOT MRAM based in-memory computing. IEEE TCAD 2024, 43, 2393–2404. [Google Scholar] [CrossRef]
Luo, Y.; Wang, P.; Yu, S. Accelerating on-chip training with ferroelectric-based hybrid precision synapse. ACM JETC 2022, 18, 1–20. [Google Scholar] [CrossRef]
Mori, P.; Frickenstein, L.; Sampath, S.B.; Thoma, M.; Fasfous, N.; Vemparala, M.R.; Frickenstein, A.; Unger, C.; Stechele, W.; Mueller-Gritschneder, D.; et al. Wino vidi vici: Conquering numerical instability of 8-bit winograd convolution for accurate inference acceleration on edge. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 53–62. [Google Scholar]
Li, M.; Li, P.; Yin, S.; Chen, S.; Li, B.; Tong, C.; Yang, J.; Chen, T.; Yu, B. WinoGen: A Highly Configurable Winograd Convolution IP Generator for Efficient CNN Acceleration on FPGA. In Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 23–27 June 2024; pp. 1–6. [Google Scholar]
Mori, P.; Rahman, M.S.; Frickenstein, L.; Sampath, S.B.; Thoma, M.; Fasfous, N.; Vemparala, M.R.; Frickenstein, A.; Stechele, W.; Passerone, C. End-to-End Deployment of Winograd-Based DNNs on Edge GPU. Electronics 2024, 13, 4538. [Google Scholar] [CrossRef]
Luo, J.; Xu, X.; Shen, Y.; Huang, B.; Wang, W. Efficient turbo product code decoder with Build-In SRAM-based transpose memory. Electron. Lett. 2024, 60, e13296. [Google Scholar] [CrossRef]
Lyu, D.; Li, Z.; Chen, Y.; Wang, G.; He, W.; Xu, N.; He, G. FLNA: Flexibly Accelerating Feature Learning Networks for Large-Scale Point Clouds With Efficient Dataflow Decoupling. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2024, 32, 739–751. [Google Scholar] [CrossRef]
Wu, C.; Yang, C.; Bandara, S.; Geng, T.; Guo, A.; Haghi, P.; Li, A.; Herbordt, M. FPGA-accelerated range-limited molecular dynamics. IEEE Trans. Comput. 2024. [Google Scholar] [CrossRef]
Boutros, A.M.M. Reconfigurable Architectures for Deep Learning. Ph.D. Thesis, University of Toronto (Canada), Toronto, ON, Canada, 2024. [Google Scholar]
Chang, L.; Zhu, Z.; Zhu, Z.; Yang, S.; Li, W.; Zhou, J. Energy-Efficient Spin-Orbit Torque MRAM Operations for Neural Network Processor. In Proceedings of the IEEE ISCAS, Daegu, Republic of Korea, 22–28 May 2021; pp. 1–5. [Google Scholar]
Najafi, D.; Tabrizchi, S.; Zhou, R.; Amel Solouki, M.; Marshal, A.; Roohi, A.; Angizi, S. Hybrid Magneto-electric FET-CMOS Integrated Memory Design for Instant-on Computing. In Proceedings of the GLSVLSI, Clearwater, FL, USA, 12–14 June 2024; pp. 770–775. [Google Scholar]
Verma, G.; Soni, S.; Nisar, A.; Kaushik, B.K. Multi-bit MRAM based high performance neuromorphic accelerator for image classification. Neuromorphic Comput. Eng. 2024, 4, 014008. [Google Scholar] [CrossRef]
Yin, S.; Seo, J.S. A 2.6 TOPS/W 16-bit fixed-point convolutional neural network learning processor in 65-nm CMOS. IEEE SSCL 2019, 3, 13–16. [Google Scholar] [CrossRef]
Koo, J.; Kim, J.; Ryu, S.; Kim, C.; Kim, J.J. Area-efficient transposable 6T SRAM for fast online learning in neuromorphic processors. In Proceedings of the 2019 IEEE Custom Integrated Circuits Conference (CICC), Austin, TX, USA, 14–17 April 2019; pp. 1–4. [Google Scholar]
Natsui, M.; Hanyu, T. Design of an Intermittent-Computing-Oriented Nonvolatile Register with a Switching-Probability-Aware Store-and-Verify Scheme. IEEE Access 2025, 13, 38104–38114. [Google Scholar] [CrossRef]
Bahador, A.; Moaiyeri, M.H.; Ghaderi, R. Algorithmically-enhanced design of spintronic-based tunable true random number generator for dependable stochastic computing. IEEE TCAD 2024, 44, 961–974. [Google Scholar] [CrossRef]
Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
NCSU EDA Group. NCSU EDA FreePDK45 Process Design Kit. 2011. Available online: https://eda.ncsu.edu/freepdk/freepdk45/ (accessed on 13 June 2025).
Zhang, Y.; Zhao, W.; Lakys, Y.; Klein, J.O.; Kim, J.V.; Ravelosona, D.; Chappert, C. Compact modeling of perpendicular-anisotropy CoFeB/MgO magnetic tunnel junctions. IEEE TED 2012, 59, 819–826. [Google Scholar] [CrossRef]
Gargari, M.A.; Eslami, N.; Moaiyeri, M.H. Multi-Bit Memory Architecture for In-memory Computing using In-Plane MTJ. In Proceedings of the IEEE ICEE, Tehran, Iran, 14–16 May 2024; pp. 1–5. [Google Scholar]
Adel, M.J.; Rezayati, M.H.; Moaiyeri, M.H.; Amirany, A.; Jafari, K. A robust deep learning attack immune MRAM-based physical unclonable function. Sci. Rep. 2024, 14, 20649. [Google Scholar] [CrossRef] [PubMed]
Choi, S.; Han, D.; Choi, C.; Seo, Y. Layout-aware area optimization of transposable STT-MRAM for a processing-in-memory system. IEEE TVLSI 2023, 32, 245–255. [Google Scholar] [CrossRef]

Figure 1. Magnetic Tunnel Junction in (a) anti-parallel and (b) parallel states.

Figure 2. Steps of Winograd Convolution.

Figure 3. (a) Two cells of the proposed array and their connections for simultaneous SHE write current, with specified SHE and STT current paths. (b) The physical layout of the proposed T-MRAM cell is designed based on the

λ

-based scaling rule.

Figure 3. (a) Two cells of the proposed array and their connections for simultaneous SHE write current, with specified SHE and STT current paths. (b) The physical layout of the proposed T-MRAM cell is designed based on the

λ

-based scaling rule.

Figure 4. The proposed WinEdge architecture, comprises memory banks, access lines and decoders, sense amplifiers, and a computation unit. The connectivity structure of memory cells within each memory bank is shown on the left.

Figure 5. Storage and access methods for normal and transposed modes: (a) binary, (b) quaternary, (c) four-bit data. Green lines and sense amplifiers (SAs) are used for normal access, blue lines and SAs for transposed access, and black lines are shared for both modes.

Figure 6. The waveform of the proposed array in write (0–10 ns) and read operations under both standard (10–20 ns) and transposed modes (20–30 ns). In the write phase, the value of +1 represents the Parallel (P) mode, and −1 is for the Antiparallel (AP) mode.

Figure 7. The four levels of resistance in each cell in the presence of process variation.

Figure 8. Comparison of the proposed architecture with previous designs (SOT21 [13], TSTT23 [25], and 6T19 [17]) in terms of (a) Power-Delay Product (PDP) and (b) Delay of neural networks, including ShuffleNet, ResNet18, TinyYOLO, MobileNetV2, and EfficientNet-V0, using the Winograd algorithm.

Table 1. The value of each interconnect during the two writing phases (horizontal view).

Phase	WLW	WL1	WL2	WL3	WL4	SL-1	BL1-1	BL2-1	BL1-2	BL2-2	SL-2
SHE	V_DD	0	V_DD	V_DD	0	V_DD	0	0	0	0	0
STT	0	V_DD	V_DD	V_DD	V_DD	V_DD/2	0	V_DD	V_DD	0	V_DD/2

Table 2. Critical parameters of Magnetic Tunnel Junctions (MTJs).

Parameter	Value
Free layer thickness (T_F)	0.7 nm
The thickness of the oxide barrier (T_b)—upper cell	0.8 nm
The thickness of the oxide barrier (T_b)—lower cell	1 nm
diameter of the surface of the free layer	65 nm
Dimensions of the metal strip (l, w, d)	$90 \times 70 \times 3$ nm³
Resistance Area Product of the MTJs (RAP)	$10^{- 11} Ω$ m²
Tunneling magnetoresistance ratio under zero bias	150%
The saturation magnetization ( $M_{s}$ )	$8.5 \times 10^{5}$ Am⁻¹
Anisotropy field ( $H_{k}$ )	$2.5 \times 10^{4}$ Am⁻¹
Temperature	25 °C

Table 3. Summary of simulation parameters used for reproducibility.

Parameter	Value
Memory bank size	$256 \times 256$ cells (up to 128 kB)
Supported precision	Binary, Quaternary, 4-bit
Activity factor	0.5
CNN workload	$64 \times 64$ input, $3 \times 3$ kernel (Winograd $F (4 \times 4, 3 \times 3)$ )
Benchmarked networks	ShuffleNet, ResNet18, TinyYOLO, MobileNetV2, EfficientNet-V0
Simulation tool	HSPICE with 45 nm NCSU FreePDK
Supply voltage	1.2 V
Temperature	25 °C
Access timing	5 ns per read/write operation
Monte Carlo runs	1000 iterations

Table 4. Simulation results and comparison of the proposed design with previous designs.

	Proposed	[13]	[25]	[17]
Technology	SOT-assisted STT	Sot assisted STT	STT	SRAM
Nonvolatile	Y	Y	Y	N
Transistor/bit	2.25	2	2	8
Power	136 $μ$ W	202 $μ$ W	287 $μ$ W	281 $μ$ W
Delay	807 ns	1123 ns	1346 ns	798 ns
PDP	110 pJ	227 pJ	387 pJ	210 pJ
Static Power	≅0	≅0	≅0	11.67 $μ$ W

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ashtari Gargari, M.; Tabrizchi, S.; Roohi, A. WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices. Electronics 2025, 14, 2485. https://doi.org/10.3390/electronics14122485

AMA Style

Ashtari Gargari M, Tabrizchi S, Roohi A. WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices. Electronics. 2025; 14(12):2485. https://doi.org/10.3390/electronics14122485

Chicago/Turabian Style

Ashtari Gargari, Milad, Sepehr Tabrizchi, and Arman Roohi. 2025. "WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices" Electronics 14, no. 12: 2485. https://doi.org/10.3390/electronics14122485

APA Style

Ashtari Gargari, M., Tabrizchi, S., & Roohi, A. (2025). WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices. Electronics, 14(12), 2485. https://doi.org/10.3390/electronics14122485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WinEdge: Low-Power Winograd CNN Execution with Transposed MRAM for Edge Devices

Abstract

1. Introduction

2. Background

2.1. Magnetic Tunnel Junction

2.2. Winograd Convolution

3. Proposed WinEdge Architecture

3.1. T-MRAM Cell

3.2. Proposed Memory Array

3.3. Hardware Mapping

4. Experiments

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI