A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids

Mylonas, Eleftherios; Filippou, Chrisanthi; Kontraros, Sotirios; Birbas, Michael; Birbas, Alexios

doi:10.3390/electronics15020414

Open AccessArticle

A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids

by

Eleftherios Mylonas

^*

,

Chrisanthi Filippou

,

Sotirios Kontraros

,

Michael Birbas

and

Alexios Birbas

Electrical and Computer Engineering Department, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 414; https://doi.org/10.3390/electronics15020414

Submission received: 15 December 2025 / Revised: 12 January 2026 / Accepted: 14 January 2026 / Published: 17 January 2026

(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

The ever-increasing need for energy-efficient implementation of AI algorithms has driven the research community towards the development of many hardware architectures and frameworks for AI. A lot of work has been presented around FPGAs, while more sophisticated architectures like CGRAs have also been at the center. However, AI ecosystems are isolated and fragmented, with no standardized way to compare different frameworks with detailed Power–Performance–Area (PPA) analysis. This paper bridges the gap by presenting a unified, fully open-source hardware-aware AI acceleration pipeline that enables seamless deployment of neural networks on both FPGA and CGRA architectures. Built around the Brevitas quantization framework, it supports two distinct backend flows: FINN for high-performance dataflow accelerators and CGRA4ML for low-power coarse-grained reconfigurable designs. To facilitate this, a model translation layer from QONNX to QKeras is also introduced. To demonstrate its effectiveness, we use an autoencoder model for anomaly detection in wind turbines. We deploy our accelerated models on the AMD’s ZCU104 and benchmark it against a Raspberry Pi. Evaluation on a realistic cyber–physical testbed shows that the hardware-accelerated solutions achieve substantial performance and energy-efficiency gains—up to 10× and 37× faster inference per flow and over 11× higher efficiency—while maintaining acceptable reconstruction accuracy.

Keywords:

hardware-aware AI acceleration; anomaly detection; machine learning; cyber–physical systems; edge computing; FPGAs; CGRAs; energy efficiency

1. Introduction

1.1. Motivation

Artificial Intelligence (AI) has emerged as a transformative technology with applications across numerous and diverse domains. Applications in computer vision, natural language processing, predictive maintenance, and anomaly detection are typical examples of the capabilities of AI nowadays. As AI continues to mature, a growing interest in deploying it in time-critical cyber–physical environments, where rapid response times and reliable operation are essential [1], is surfacing. Indicative examples of such AI applications already exist, as with the so-called autoencoder models, which provide timely response and accurate performance [2,3,4]. However, the widespread adoption of AI has been accompanied by an alarming escalation in computational demands and energy consumption [5]. Modern deep learning models require extensive computational resources during both training and inference phases. This leads to substantial power consumption that poses environmental, economic, and practical deployment concerns. For AI to continue its growth trajectory and maintain long-term sustainability, the development of energy-efficient AI acceleration solutions has become absolutely essential. This urgency has sparked intensive research into specialized hardware architectures that can deliver the required computational performance while reducing power consumption.

1.2. Background and State of the Art

Neural Processing Units (NPUs) and Deep Learning Processing Units (DPUs) have revolutionized the landscape of AI inference at the edge. These specialized processors are fundamentally designed to excel at the matrix multiplication and multiply accumulate operations that dominate neural network computations, achieving orders of magnitude improvements in energy efficiency compared to general-purpose CPUs and even GPUs. At their architectural core, these accelerators employ design principles rooted in coarse-grained reconfigurable architectures (CGRAs), which provide a balance between the flexibility of software-programmable processors and the efficiency of application-specific integrated circuits [6,7]. CGRAs consist of arrays of functional units interconnected through a programmable routing network. This enables the mapping of computational kernels directly onto hardware datapaths with minimal control overhead. By exploiting spatial parallelism and eliminating much of the instruction fetch and decode overhead inherent in traditional processor architectures, CGRA-based accelerators can achieve remarkable throughput and energy efficiency for structured computational patterns typical of neural network inference. This architectural paradigm has proven particularly effective for edge AI applications where power consumption, latency, and throughput requirements must be simultaneously optimized.

Field-Programmable Gate Arrays (FPGAs) and CGRAs have long been recognized as promising platforms for implementing energy-efficient AI accelerators. FPGAs offer the unique advantage of hardware reconfigurability, allowing designers to implement custom datapaths tailored to specific neural network architectures and precision requirements, thereby achieving superior energy efficiency compared to fixed-function processors. Similarly, CGRAs have attracted significant attention as an intermediate point between the full flexibility of FPGAs and the efficiency of ASICs, with various research frameworks demonstrating their effectiveness for neural network acceleration, and many CGRA architectures have been mapped onto FPGA fabrics and validated. There are many works in the literature that demonstrate hardware acceleration of AI applications on such platforms [8,9,10,11]. Ref. [8] presents a summary of many accelerated AI applications on FPGAs featuring all the major deep learning model architectures (ANNs, CNNs, RNNs, etc.). Ref. [9] presents a full software stack for accelerating Convolutional Neural Networks (CNNs) into an energy-efficient streaming hardware architecture. Ref. [10] showcases a systolic parallel hardware architecture able to efficiently accelerate Artificial Neural Networks (ANNs), autoencoders, etc., while ref. [11] presents a hardware accelerated version of a Vision Transformer (ViT)-based model with significant energy efficiency results.

Despite the abundance of available tools and frameworks, the FPGA and CGRA Machine Learning (ML) acceleration landscape remains fragmented and challenging to navigate for practitioners seeking to deploy AI at the edge. Many existing pipelines suffer from significant limitations. Some are built upon deprecated or outdated toolchains that no longer receive active development support, while others present steep learning curves with complex workflows that require deep hardware expertise. The lack of standardization across different frameworks makes it extremely difficult to perform fair comparisons between FPGA-based and CGRA-based acceleration approaches, or to benchmark these hardware solutions against contemporary software frameworks such as PyTorch or TensorFlow. Furthermore, most existing tools provide limited support for hardware-aware quantization. This forces designers to either accept suboptimal quantization schemes or manually iterate through time-consuming design-space exploration processes. Researchers and engineers working on edge AI applications often find themselves navigating a complex landscape of incompatible tools, each with its own model formats, quantization schemes, and deployment workflows, making it difficult to leverage the full potential of both FPGA and CGRA acceleration technologies.

Existing approaches to FPGA and CGRA-based neural network acceleration include several notable frameworks. FINN [12,13] provides a dataflow-oriented design flow for quantized neural networks on FPGAs with emphasis on streaming architectures and customizable processing elements. In addition, hls4ml [14], originally developed for high-energy physics applications, focuses on ultra-low latency inference. For CGRA acceleration, frameworks like CGRA-ME [15] and various academic research tools have demonstrated the potential of coarse-grained reconfigurable architectures for neural network inference, though many remain research prototypes with limited production readiness. Commercial solutions from FPGA vendors, including Xilinx’s Vitis AI [16] and Intel’s OpenVINO toolkit [17], offer comprehensive design flows but often require significant vendor-specific expertise and may not provide the level of customization desired for research applications. Other notable efforts include DNNDK [18], VTA [19], and various high-level synthesis approaches that attempt to bridge the gap between software model description and hardware implementation. However, these tools typically operate in isolation, each with proprietary model formats and quantization approaches. This makes systematic comparison and evaluation across different acceleration strategies extremely challenging for researchers and practitioners in the field.

1.3. Contributions

This paper presents a unified, fully open-source hardware-aware AI acceleration pipeline that addresses these fragmentation challenges. Our pipeline provides a common frontend based on the Brevitas 0.12.0 quantization library [20] (an add-on library of PyTorch), supporting two distinct acceleration flows targeting both custom FPGA dataflow accelerators through FINN 0.10.1 and CGRA-based architectures through a state-of-the-art CGRA-ML framework called CGRA4ML [21,22,23]. Brevitas enables seamless comparison between different hardware acceleration approaches while maintaining full compatibility with PyTorch, supporting both Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) of pretrained models. The hardware-aware quantization capabilities of Brevitas allow our pipeline to adapt quantization strategies to the specific requirements and constraints of each target acceleration flow, accommodating the distinct characteristics of dataflow FPGA architectures versus CGRA mappings. We demonstrate the effectiveness and versatility of our proposed pipeline through a comprehensive case study targeting autoencoder-based anomaly detection for wind turbines in smart energy systems, a representative time-critical cyber–physical system application.

The main contributions of this work are the following:

A unified hardware-aware AI acceleration pipeline targeting FPGA devices with two diverse acceleration flows, one focused on custom high-performance low-latency AI dataflow accelerator design via the FINN library, and the other centered around a state-of-the-art CGRA-ML framework. A model translation layer from QONNX 0.4.0 to QKeras 0.9.0 is also introduced to facilitate the process. To our knowledge, this is the first time that a fully open-source unified ML pipeline for both FPGAs/CGRAs is presented, with the powerful Brevitas quantization library as the frontend.
Application of the proposed pipeline on a real-world time-critical cyber–physical system scenario in the smart grid domain for the acceleration of a real-time anomaly detection solution for wind turbines.
Validation of the proposed pipeline on the presented cyber–physical system scenario by applying our anomaly detection solution on two different hardware platforms, one Commercial Off-The-Shelf (COTS) Raspberry Pi (RPi) utilizing ONNX Runtime, which serves as the reference baseline, and an AMD ZCU104 platform. The main concept is that a comparison is performed of both platforms in terms of accuracy, performance, latency, power and energy efficiency, etc., and of the two acceleration flows when dealing with the same quantized model input.

1.4. Paper Structure

This paper is structured as follows: Section 1 introduces the problem and shows the need for our proposed pipeline. Section 2 presents the hardware-aware AI acceleration pipeline. Section 3 introduces the cyber–physical system problem under testing conditions and presents the anomaly detection solution for which the proposed pipeline will be used. In Section 4, we describe the experimentation procedure and present the results. We conclude our work in Section 5, where the benefits of the proposed solution are summarized.

2. Hardware-Aware AI Acceleration Pipeline

This section presents the high-level architecture of the proposed hardware-aware AI acceleration pipeline and describes the two different acceleration flows in depth.

2.1. High-Level View of the Proposed Pipeline

The proposed hardware-aware AI acceleration pipeline is presented in Figure 1 below. The pipeline can be split in a set of steps:

1.

The first step is for the AI engineer to prepare (or pre-process) the available dataset and select the model architecture best suited for the application of interest in a PyTorch environment. The outcome of this step is a set of Python 3.10 scripts, one with the model architecture description, one for the training procedure.

2.

The second step is the model training phase. Here we can follow two different paths:

(a): Either train the model in PyTorch in floating-point mode directly and then apply PTQ on the model with Brevitas, or
(b): Apply QAT on the model with Brevitas, therefore training the quantized model directly, given that the model architecture description and training script from Step 1 are already prepared in a Brevitas QAT-ready format.

3.

(Optional) The third step mainly applies to those having followed Step 2(a) and is about quantization-aware model finetuning. For those who used Brevitas PTQ methods, it may be that the resulting quantized model is not good enough and therefore a finetuning process is mandatory. This process is the same as training a standard PyToch model, but in this case, the model is a Brevitas PTQ-generated one featuring special “quantization hooks” which enable the generated model to be used as a standard floating-point one during finetuning. This further facilitates the whole quantization process.

4.

During the fourth step, the quantized model is extracted in a quantized Open Neural Network eXchange (ONNX), also known as QONNX [24], formatted, and is given as input to the acceleration flow of choice:

(a): FINN flow;
(b): CGRA4ML flow.

2.2. Brevitas Frontend

Brevitas is the main intersection point between the FINN and CGRA4ML flows. The reason why Brevitas was selected is based on the following facts:

1.: It features many state-of-the-art quantization algorithms for almost all popular AI models, from basic DNNs to even core Large Language Model (LLM) functionalities.
2.: It supports both QAT and PTQ while providing full flexibility to the developer, who is able to choose custom integer resolutions of less than 8 bit, even at 1 bit for Binary Neural Network (BNN) computation support.
3.: It exports quantized models on the quantization version of the popular ONNX Intermediate Representation (IR) model format, which is already supported on all major AI workflows.
4.: It is inherently supported in FINN; therefore, it is easy to export a quantized model and work with it there.
5.: It is a very active open-source project, with a great community and a lot of support and examples that enable the developer to go straight into action.

2.3. FINN Flow

The FINN flow is based on the homonymous open-source project from Xilinx and is essentially a compiler toolchain that can transform a quantized model into a Register-Transfer-Level (RTL) description. The toolchain provides a frontend with its transformation and analysis passes and a high-level synthesis (HLS)-based backend to explore the design space in terms of resource and performance constraints. A user can build a custom step-by-step flow, but in this work, we leverage the automatic build flow functionality and create a more generic one that is able to optimize both common Deep Neural Network (DNN) topologies and more specialized ones like the models presented later in Section 3, based on a performance target and device constraints. Internally, FINN works only with an IR based on the ONNX format, the well-known QONNX format. The FINN toolchain comprises three main phases of graph transformation passes, that is, the preparation phase, the mapping phase, and the tuning phase. These phases are responsible for rearranging or fusing operations, assigning them to specific HLS/RTL templates, and finally calculating and preparing the necessary compute resources via a selection of parallel processing elements (PEs), Single-Instruction–Multiple-Data (SIMD) input channel lanes per PE, number and size of First-In First-Out (FIFO) queues, etc. for each layer to obtain the desired throughput within a balanced pipeline. In addition, FINN employs a lot of utility tools for code analysis, resource estimation, and simulation available in any phase of the compilation process. Finally, FINN is able to export the resulted RTL into a bitstream file for the appropriate FPGA platform with little or no effort, which can be deployed immediately with the generated shell project and driver for Xilinx Alveo and PYNQ platforms.

2.4. CGRA4ML Flow

The CGRA4ML flow was chosen as an alternative AI hardware acceleration flow to FINN. It is based on the homonymous open-source project of Kastner R., which is basically a comprehensive toolchain that transforms quantized models into vendor-agnostic SystemVerilog RTL designs suitable for both FPGA and application-specific integrated circuit (ASIC) implementations. CGRA4ML provides a Python-based frontend built on top of QKeras for both QAT and PTQ, along with an automated backend that generates parameterizable CGRA hardware specifications, production-ready C runtime firmware, and comprehensive verification testbenches. Unlike spatial layer-by-layer implementations, CGRA4ML enables temporal reuse of processing elements across multiple layers through dynamic reconfiguration, allowing the deployment of larger and more complex models that exceed the capacity constraints of purely on-chip implementations. Internally, CGRA4ML works with an IR based on bundles, which are groups of QKeras layers that can be deterministically executed on the generated hardware, with compute-intensive operations accelerated on the CGRA while lightweight pixel-wise operations are offloaded to the host CPU. The framework features a unified dataflow architecture with runtime-programmable control that maximizes data reuse through intelligent weight caching and pixel shifting, alongside high-performance AXI DMAs for efficient off-chip data movement. Finally, CGRA4ML provides automated Tool Command Language (TCL) toolflows for both FPGA and ASIC implementations, along with integration scripts for deployment on Zynq-based platforms with PYNQ support, facilitating rapid prototyping and production-ready implementations with minimal manual intervention.

2.5. QONNX-QKeras Translation

The existing FINN and CGRA4ML tools are working on two different quantization frontends, the Brevitas and QKeras libraries. As explained in Section 2.2, we have decided to use Brevitas as our frontend, due to the active development around it and the many different features it provides in all cases, for both QAT and PTQ, and all kinds of AI models. This means that we are responsible for creating a custom compilation flow from QONNX to QKeras IR model format. At this point in time, this work focused on Conv/ConvTranspose nodes and standard activation nodes, e.g., ReLus, while support for other popular layers and functionalities, e.g., residual blocks, will be added in the future.

In general, the compilation flow consists of a set of seven steps:

1.: Identification of Conv/ConvTranspose nodes, along with their respective activation nodes and quantizers.
2.: Extraction of basic parameters for the Conv/ConvTranspose nodes (kernel size, strides, dilations, filters).
3.: Identification of relevant weight and bias Quant nodes, along with extraction of the respective raw tensors and quantization information.
4.: Preparation of the model architecture layer per layer (or more precisely bundle by bundle, following the CGRA4ML/QKeras logic) based on a QKeras bundle template.
5.: Extraction of input shape from the QONNX model and transformation from NCHW (batch, channels, height, width) to NHWC (batch, height, width, channels) format.
6.: Assignment of weight and bias tensors to the respected bundles, with transformation of weights tensors’ shape for QKeras compatibility.
7.: Model export and continuation with the rest CGRA4ML flow.

At this point, it is necessary to highlight that during this whole process, we follow CGRA4ML’s internal structures to ensure compatibility. This leads us to use CGRA4ML’s XConvBN bundle for Conv layer mapping. XConvBN includes an internal BatchNorm layer for weights/biases normalization. During the QONNX-QKeras translation, since the extracted weights and biases are not in need of normalization due to their preparation from Brevitas, we assign appropriate tensors to the BatchNorm layer so that it is nullified.

The compilation steps are summarized in a pseudocode format in Algorithm 1.

Algorithm 1 QONNX-QKeras Translation Flow

Require: QONNX model

M_{QONNX}

Ensure: QKeras IR model

M_{QK}

1:: function QONNX_to_QKeras( $M_{QONNX}$ )
2:: $N_{conv} \leftarrow$ FindNodes( $M_{QONNX}$ , {Conv, ConvTranspose})
3:: $N_{act} \leftarrow$ FindActivationNodes( $M_{QONNX}$ )
4:: $Q \leftarrow$ FindQuantizers( $M_{QONNX}$ )
5:: for all $n \in N_{conv}$ do
6:: $(k_{h}, k_{w}) \leftarrow$ GetKernelSize(n)
7:: $(s_{h}, s_{w}) \leftarrow$ GetStrides(n)
8:: $(d_{h}, d_{w}) \leftarrow$ GetDilations(n)
9:: $C_{out} \leftarrow$ GetNumFilters(n)
10:: end for
11:: for all $n \in N_{conv}$ do
12:: $W_{n} \leftarrow$ ExtractWeights(n)
13:: $b_{n} \leftarrow$ ExtractBias(n)
14:: $Q_{n}^{W}, Q_{n}^{b} \leftarrow$ ExtractQuantInfo( $Q, n$ )
15:: end for
16:: $B \leftarrow \emptyset$ ▹ set of QKeras bundles
17:: for all $n \in N_{conv}$ do
18:: $B_{n} \leftarrow$ InstantiateBundleTemplate(XConvBN)
19:: AssignParameters( $B_{n}, k_{h}, k_{w}, s_{h}, s_{w}, d_{h}, d_{w}, C_{out}$ )
20:: $B \leftarrow B \cup {B_{n}}$
21:: end for
22:: $(N, C, H, W) \leftarrow$ GetInputShape( $M_{QONNX}$ )
23:: $(N, H, W, C) \leftarrow$ TransformToNHWC( $N, C, H, W$ )
24:: for all $B_{n} \in B$ do
25:: ${\tilde{W}}_{n} \leftarrow$ TransformWeightLayout( $W_{n}$ )
26:: AssignWeights( $B_{n}, {\tilde{W}}_{n}$ )
27:: AssignBias( $B_{n}, b_{n}$ )
28:: NullifyInternalBatchNorm( $B_{n}$ )
29:: end for
30:: $M_{QK} \leftarrow$ AssembleModel( $B$ )
31:: return $M_{QK}$
32:: end function

ConvTranspose Translation

A 2D transposed convolution can be re-expressed as a standard convolution applied to an explicitly upsampled IR of the input [25]. Instead of relying on the implicit gradient-of-convolution interpretation, the operator is decomposed into two concrete steps:

1.

Pixel Padding (Explicit Upsampling)

A transposed convolution with weights

W \in R^{C_{in} \times C_{out} \times K_{h} \times K_{w}},

stride

(s_{h}, s_{w})

, dilation

(d_{h}, d_{w})

, and padding p is rewritten as

Y = ConvTranspose (X, W) .

The first step is to upsample the input tensor by inserting zeros between adjacent samples according to the stride. For an input feature map

X \in R^{H \times W \times C_{in}}

, the pixel-padded tensor

X^{'}

is defined as:

X_{i s_{h}, j s_{w}, c}^{'} = X_{i, j, c}, X_{u, v, c}^{'} = 0 otherwise,

(1)

with output dimensions

H^{'} = H \cdot s_{h} - (s_{h} - 1), W^{'} = W \cdot s_{w} - (s_{w} - 1) .

(2)

This operation performs deterministic upsampling without introducing learnable parameters:

X^{'} = PixelPadding (X; s_{h}, s_{w}) .

2.

Kernel Rotation and Channel Reordering

The second step constructs a convolution kernel that matches the effect of the original transposed kernel. The required transformation consists of

(a): Rotating the spatial dimensions of $W$ by $180^{\circ}$ ;
(b): Swapping the input and output channel axes.

Formally, the rotated kernel is given by

\tilde{W} = Transpose (Rot180 (W), (1, 0, 2, 3)),

(3)

where

Rot180 {(W)}_{c_{in}, c_{out}, u, v} = W_{c_{in}, c_{out}, K_{h} - 1 - u, K_{w} - 1 - v} .

(4)

After rotation and channel transposition, the tensor

\tilde{W} \in R^{C_{out} \times C_{in} \times K_{h} \times K_{w}}

conforms to the format expected by a standard convolution.

Combining the two steps yields the following equivalence:

ConvTranspose (X, W) = Conv (PixelPadding (X; s_{h}, s_{w}), \tilde{W}),

(5)

with convolution stride fixed to

(1, 1)

and dilation and padding preserved from the original operator.

This formulation expresses ConvTranspose entirely through a conventional Conv layer acting on an explicitly upsampled input. The resulting representation preserves all learnable parameters while providing a structurally compatible, Conv-only form suitable for downstream processing inside CGRA4ML without introducing many changes in the flow, both during model preparation and during runtime.

3. Real-Time Anomaly Detection in Wind Turbines

This section presents the cyber–physical system problem under testing conditions and describes the components which implement the core functionality of the real-time anomaly detection algorithm.

3.1. Concept Description

We assume the cyber–physical system scenario of Figure 2. In this scenario, a wind turbine fleet is connected to the main power grid via an inverter and a power switch. The power switch is controlled by a local turbine controller, which can be a Programmable Logic Controller (PLC). The power switch is open when an anomaly is detected in the wind turbine fleet behavior, or whenever the wind turbine fleet output should not be connected to the main grid due to non-nominal generated voltage output, which could cause the insertion of harmonics in the power grid. This is very important since harmonics could lead the rest of the grid to instability and de-synchronization with terrible outcomes, e.g., power outages [26,27,28]. The cyber–physical system scenario we are dealing with has a critical response time of 50 ms, which is the ideal time between an anomaly generation, its detection, and finally the generation of the appropriate control signal at the controller.

To solve this problem and ensure the system’s stability and reliability, we assume the installation of appropriate sensors at the wind turbine fleet which measure a number of features, e.g., the wind speed, the generated voltage, etc. (presented in more detail in Section 3.2). The sensors’ measurements are sent to an embedded board which acts as an AI-enabling extension card to the local turbine controller. Our goal is to present an AI solution that will detect anomalies during the critical timespan we set, while taking into account the time it takes for the transmission of measurements from the sensors to the board and the transmission of the anomaly state detection signal from the board to the controller. To achieve this, hardware acceleration is desirable, if not mandatory, since it can yield results in hard real-time and, of course, we need deterministic behavior so that we can predict accurately the time needed for the generation of the appropriate control signal.

To this end, in the subsequent sections, we present a complete AI solution based on autoencoder architecture which achieves anomaly detection in wind turbine fleets in real-time.

3.2. Autoencoder-Based Anomaly Detection Module

Autoencoders are a type of Artificial Neural Network that have a wide range of practical applications across numerous domains [29]. In our case study, an autoencoder model is designed and trained for the purpose of detecting anomalies on a fleet of wind turbines in real-time, based on sensor data analysis. Specifically, the function of an autoencoder model is to convert the input data into a compact representation known as the latent space through linear transformations—thus implementing the encoder part—and then to reconstruct the original input from the encoded version—that is, the decoder part. The goal in designing such an architecture is to minimize the difference between the original and reconstructed data. In general, the main advantages of the autoencoder architecture are the lack of supervision during the training procedure and the fact that it can directly receive data from fully functional wind turbines for training. Therefore, it can be trained with normal sensor readings and be able to detect normal behavior. As a consequence, when an anomaly occurs, the change in the input data is amplified at the output and instantly recognized as an abnormality.

The dimensionality of the latent space representation is an important characteristic that is used to categorize autoencoders into either Undercomplete or Overcomplete. What distinguishes the latter from the former is that the size of the latent space representation is greater than the size of the input. While Overcomplete autoencoders contain the risk of learning trivial solutions like identity mappings, several regularization techniques can be applied in order to prevent this. Applying sparsity constraints through L1 Penalty or KL Divergence can effectively mitigate the risk of extracting trivial solutions [30]. Another technique that can be used for this is weight decay, which encourages smaller, simpler weight values improving the generalization and robustness of the model. Moreover, Undercomplete autoencoders pose considerable trade-offs, since smaller latent representation means limited expressive capabilities, which can be costly when dealing with input data features of equal importance. This means that in cases where all input features carry valuable information, using an Undercomplete autoencoder model forces the loss of determinative data representation. Thus, we select the Overcomplete approach as it allows us to capture information of greater complexity from the data, since the network is able to learn and detect intricate patterns with very high efficiency [31].

To this regard, our autoencoder model is based on convolution layers, which are good at extracting hidden features from the input data. Specifically, our model is comprised of a set of three convolution and three ConvTranspose layers for the encoder and decoder part, respectively, as shown in Figure 3. On the one hand, the convolution layers are able to extract useful features from the input matrix, leading to the latent space representation. On the other hand, the ConvTranspose layers apply the inverse mathematical operation of the convolution layers, gradually restoring the dimensionality of the input and producing the final output. Lastly, the Rectified Linear Unit (ReLU) is chosen as the activation function. This simple setup allows the network to learn complex patterns while preventing common problems such as the vanishing gradient problem [32], making use of the non-linear behavior of ReLU both in the case of the encoder and the decoder part.

The input data consist of sensor readings of four selected features: angular acceleration, wind speed in rotations per second, rotation of the rotor in rotations per second, and generated voltage in Volts. As part of the data preparation process for the model training, the angular acceleration measurements are converted from the quaternion system to Euler angles; thus, the selected input features are represented by six values. Then, all sensor data are filtered by applying Wavelet denoising, and finally they are normalized. Regarding the denoising process, we use PyWavelets [33], an open-source wavelet transform library for Python that provides a fast and user-friendly interface, to perform filtering on the accelerometer data. For this purpose, Discrete Wavelet Transform (DWT) is used to decompose the chosen data into frequency components, then “soft” thresholding is applied to the resulting wavelet coefficients to suppress high-frequency noise and finally, the original data are reconstructed using multi-level Inverse Discrete Wavelet Transform (IDWT). The resulting denoised data go through Z-Score normalization ensuring uniform scaling among all features to achieve accurate functionality for the model. Z-Score normalization results in a dataset with features of mean value of zero and Standard Deviation value of one.

The final input tensor is created by applying a sliding window to the resulting dataset, capturing the temporal correlation of neighboring measurements, and converting each time window into a tensor according to a defined structure. This structure includes the six features mentioned before as input, across one hundred consecutive samples. The resulting shape of the input as well as the output tensor is (6, 10, 10) while the corresponding datatype is single-precision floating point (FP32).

3.3. Threshold Trigger Module

The output of the autoencoder model needs some further processing in order to solidify whether it is associated with anomaly generation or not. For this reason, the threshold trigger module is created. The role of this module is to compare the autoencoder’s output with its input vector and against a predefined threshold vector. This vector is calculated by comparing the output vectors of the autoencoder module during normal and abnormal states and monitoring their value ranges. If the difference between an output and its respective input vector exceeds the normal value ranges identified by the threshold vector, then an anomaly state alarm signal is generated and sent directly to the local wind turbine controller for actuation. The threshold trigger module runs in all cases on the CPU in software after each autoencoder inference finishes and does not necessitate hardware acceleration. The operation of the threshold trigger module is summarized in Figure 4.

4. Results

In this section, we describe the experimentation conducted for validation purposes of our proposed hardware-aware AI acceleration pipeline. Initially, we describe our experimentation setup, then we analyze the methodology behind the power/energy metrics’ acquisition and present the list of metrics used in this paper, and finally we present the relevant results.

4.1. Experimentation Setup

In order to validate our hardware-aware AI acceleration pipeline and our autoencoder-based AI solution, we prepare a realistic, complete testbed of the cyber–physical system scenario in our lab, as shown in Figure 5. For the emulation of the power grid components, we prepare a modified IEEE 5-bus test system in Simulink and export it in a Matlab script based on our previous work on real-time power grid simulation [34,35,36]. The scripts run on a standard Dell Precision 3630 Tower PC with an Intel Core i7-9700 CPU, 32 GB RAM, Windows 11 OS, and Matlab R2024b. The rest of the components were tested on two different embedded platforms. For the baseline reference test, we use a standard RPi 3b+ board. For the acceleration pipeline tests, we use an AMD-Xilinx ZCU104 FPGA board [37]. In our acceleration tests with FINN, all the components run on top of the quad-core Cortex A53 of the ZCU104, while the autoencoder model is offloaded to the FINN-generated hardware accelerator. In our acceleration tests with CGRA4ML, all the components run on top of a MicroBlazeV RISC-V core [38] configured with the microcontroller preset inside ZCU104’s FPGA region, while the autoencoder model is offloaded to the CGRA4ML-generated accelerator configured as a 10 × 64 PE array structure. For the wind turbine fleet, a playback simulator is used with data from a public repository [39], processed to be more realistic for our scenario. More specifically, the original dataset represents sensor readings from five model mini-wind turbines mounted with seven different sensors. From these features we isolate the relevant ones, applying pre-processing calculations so that they resemble real-world-scale measurements. We place the simulator on top of the embedded platform as well, since we assume that the sensors are connected to the platform via fiber optic link, therefore minimal transmission delays are incurred. In our tests, we compare the two acceleration flows by starting from a common QONNX autoencoder model, as described in Section 3.2, quantized via the Brevitas PTQ process on 8 bits for weights and activations and 16 bits for biases. To elaborate further, during the quantization process, the original dataset is shuffled and 20% of it is used for the calibration of the quantized model, while the rest is used for the validation of the resulting model.

4.2. Power Measurement Methodology

In order to measure the power for each platform during our experiments, we utilize direct hardware monitoring. For the RPi, we use a COTS Adafruit INA219 power sensor [40], which is connected between the main DC power supply and the Device Under Test (DUT), as shown in Figure 6a. This sensor is a high-side current and voltage monitor chip with an I2C interface that operates using an external shunt resistor of 0.1 Ohms connected to a Programmable Gain Amplifier (PGA) and a 12-bit Analog-to-Digital Converter (ADC). Featuring programmable calibration, gain, filtering, and ADC resolution, it is deemed a versatile and sufficiently accurate instrument for the purposes of this application, providing a direct hardware-based solution. In order to initialize and calibrate the sensor as well as read the sensor measurements, we use the Adafruit INA219 library for Python [41], as it provides an intuitive and modifiable interface that handles the hardware–software interconnection. In our tests, we calibrate the INA219 so that we achieve a power sampling rate of approximately 20 samples per second.

In contrast to the RPi, the ZCU104 FPGA platform features integrated power monitoring capabilities which are accessible via software to the user. More specifically, ZCU104 is equipped with three power controller chips, two Infineon IRPS5401 and one TI INA226 [42]. All three chips are accessible via I2C and can be used to give precise power monitoring information from the board. In our tests, we use only the TI INA226 power controller to monitor the power consumption of the board. For access to the controller via ZCU104’s four-core Cortex A53 ARM CPU, we leverage the PMBus module of the PYNQ v3.0.0 library, which facilitates the monitoring of the main 12V power rail of the board. For access to the controller during the experiment with the RISC-V MicroBlazeV microprocessor, we utilize the RPi’s I2C interface via its GPIO header extension and access the ZCU104’s I2C bus via its PMOD interface. This way, we use the RPi in order to monitor ZCU104’s power during the RISC-V experiment, since in our setup, the MicroBlazeV microprocessor does not have access to the sensor. In our tests with the TI INA226 and the ARM and FINN power measurement topology, we achieve a maximum power sampling rate of approximately 65 samples per second. While this limitation is likely introduced by the platform’s parallel execution of accelerator-related and power monitoring scripts, it is verified that it does not lead to erroneous results, using the topology deployed in the MicroBlazeV tests, which attains a sampling rate of 1000 samples per second. The described setup for these cases is shown in Figure 6b,c.

We begin by obtaining a sufficient amount of power measurements while the DUT is in an idle state (without load), thus establishing a base power figure. The power figure measurements are saved in appropriate .csv files during each monitoring session, containing the information of the derived power value, the timestamp, and a tag unique for each inference. During each acceleration test (when the DUT is under the load), we concurrently trigger a power monitoring routine and obtain a new power figure which reflects both the base power consumption of the board and the power consumption of the accelerator. By comparing those two figures, we are able to identify precisely the power behavior related to the acceleration routine and obtain valuable insights, thus enabling the extraction of ground-truth metrics, which are later used to compare the overall efficiency of the proposed solutions. Furthermore, to ensure fair comparison between the two acceleration flows and take into account the data transactions’ power utilization, we develop scripts that are repeatedly executing data transactions to and from the external memory in the respective accelerator’s fashion, that is, by taking into account its computation latency. In this way, we determine the mean power consumption of each host CPU data scheduling approach and utilize it in the subsequent computations.

The metrics we utilize in our work for comparing the different platforms in terms of power/energy are as follows:

Overall Mean Power per Inference (mW);
Mean Absolute Deviation of Power per Inference (MAD) (mW);
Standard Deviation of Mean Power per Inference (STDev) (mW);
Mean Inference Duration (ms);
Mean Energy per Inference (mJ);
Performance (inferences per second);
Efficiency (Performance per W).

We develop Python scripts for automatic parsing and post-processing of the power logs. We begin by calculating and subtracting the base power consumption of the DUT from all power measurements logged during the execution under load. Hence, the computations described from this point forward concern the power associated exclusively with the load examined in each case.

We calculate the mean power

{\bar{P}}_{i}

consumed during inference i by utilizing n power measurements

P_{i, j}

logged during this inference execution period:

{\bar{P}}_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} P_{i, j}

(6)

We are then able to define the Overall Mean Power per Inference as the mean value of all

{\bar{P}}_{i}

calculated previously. This gives us a precise, representative value that accurately depicts power consumption on the inference level.

{\bar{P}}_{overall} = \frac{1}{N} \sum_{i = 1}^{N} {\bar{P}}_{i}

(7)

We compute the Mean Absolute Deviation of

{\bar{P}}_{overall}

to obtain a measure of the dispersion in the mean power per inference execution, as well as the Standard Deviation so as to examine the amount of variation from the overall mean. Furthermore, the Mean Inference Duration

\bar{t}

is derived.

Having specified the Mean Inference Duration

\bar{t}

, we calculate the Mean Energy per Inference based on the definition below:

\bar{E} = {\bar{P}}_{overall} \times \bar{t}

(8)

We also focus on the performance of the implementations, establishing the average number of inferences performed in a second:

\bar{P r f} = \frac{1}{\bar{t}}

(9)

Lastly, we quantify efficiency as the ratio between performance and overall mean power.

\bar{E f} = \frac{\bar{P r f}}{{\bar{P}}_{overall}}

(10)

4.3. Evaluation of the Proposed Pipeline

In this section, we elaborate on the testing scenarios studied, as well as present the results in detail and assess them.

Firstly, we capture the power profile of each platform in the idle state. We deploy the respective software monitoring routines described in the previous section for each device for a total duration of thirty minutes. This way, we are able to extract the mean idle power consumption of the devices with great certainty. By parsing these profiles, we compute the mean idle power consumed by the RPi to be 1.639 W, whereas the ZCU104 consumes 10.142 W. Additionally, we gather power measurements with the devices in under-load state. The resulting graphs of both the idle and under-load states for each platform are depicted in Figure 7a,b. The measurements of the RPi in under-load state, as shown in Figure 7a, also correspond to a wall-clock duration of thirty minutes, which comprise approximately 2000 inferences. The same time duration was chosen for the ZCU104 as well, as shown in Figure 7b. It is worth noting that the runtime needed for the ZCU104 platform to perform the same number of inferences is significantly less, amounting to a total of approximately 1 min and 24 s just for the FINN flow.

The results of the post-processing methodology described in the previous section are presented in Table 1 for each DUT:

When comparing the results of the ONNX Runtime implementation on the RPi against the FINN-generated accelerator on the ZCU104 platform, we make numerous observations. In terms of Overall Mean Power per Inference, we observe a 7.5% increase for the ZCU104 FINN implementation, while the Mean Absolute Deviation exhibits a 74% decrease and the Standard Deviation a 73.8% decrease. This indicates that the FINN accelerator operates with a more stable power profile in contrast to the RPi, which is beneficial for the platform’s power stability and health. Moreover, the ZCU104 achieves a major reduction of 89.4% in Mean Inference Duration, causing an equally major drop of 88.6% in Energy Consumption per Inference. Indeed, this confirms initial expectations on the superiority of a custom design, exploiting parallelism and folding schemes to optimize the execution of the algorithm. This conclusion is amplified when comparing the last two metrics: performance and efficiency. The difference between the RPi and FINN flow upon examination of these metrics is apparent: the RPi achieves 3 inf/s while the FINN-derived accelerator achieves 12-fold performance. Similarly, in terms of efficiency, the RPi achieves just around 9% of the FINN accelerator’s efficiency.

The metrics of CGRA4ML on the ZCU104 platform are equally intriguing. One can notice that it exhibits the highest Overall Mean Power per Inference out of all implementations, specifically 15.7% more power compared to the FINN case. This correlates with the fact that the CGRA4ML-generated accelerator needs to move data to and from the off-chip RAM, which causes this power increase. Nevertheless, with reference to the MAD and STDev features, the CGRA4ML topology shows favorable results in par with the FINN accelerator, which again shows the stable power profile that the accelerator exhibits. Furthermore, CGRA4ML offers exceptionally low Mean Inference Duration, merely 6.89 ms, which is around 75% faster than the FINN accelerator. This speed increase is mainly due to the fact that the FINN accelerator’s runtime environment is managed by the PYNQ library, while the CGRA accelerator runs via baremetal C code. Moreover, this lack in speed by the FINN accelerator is justified by the target Frames-Per-Second (FPS) choice during the build process. Since the goal is to provide a basic performance overview of the flows presented, we focus on the default setup of the process without specifying an FPS constraint. Computing the Mean Energy per Inference results at 3.2 mJ, a 71% drop occurs compared to the FINN design. Correspondingly, the performance and efficiency of the CGRA4ML accelerator is significantly better than FINN’s (4× and 3.4× better). We are led to the conclusion that CGRA4ML has equally useful capabilities in ML acceleration, offering impactful improvement in efficiency during operation.

Another important characteristic of the flows tested in this research is the accuracy achieved by the algorithm in each environment. The metric we use to express this is the reconstruction error of the model—that is, its Mean Squared Error (MSE)—since this is a most common and widely accepted metric for autoencoder models. The resulting value for each DUT is shown in Table 2:

The original autoencoder model is validated right after the training process, exhibiting a reconstruction error of 0.0015 (MSE). This is the baseline error of the model, which we find in the RPi deployment. The model is also tested in terms of accuracy right after we perform PTQ using the Brevitas library. This process results in a reconstruction error of 0.0026 (MSE). The error stays the same during both the functional verification of the model inside the FINN build process and the final hardware verification. Overall, accuracy is sufficiently preserved throughout the build process. Lastly, the accuracy of the CGRA4ML implementation is the same as the one of the FINN flow, which shows that our pipeline works as expected and does not tamper with the QONNX model during the QKeras translation. All things considered, 8-bit quantization shows very good performance. Lesser quantization representations could also yield similar accuracy results with greater energy saving outcomes. Even though not investigated during our tests, it is expected that 4-bit quantization could effectively double the energy performance metrics of ZCU104 shown in Table 1.

It is also worthwhile to compare the resources used by the designs originating from the FINN and CGRA4ML flow. Table 3 and Table 4 summarize the hardware resources used by each flow during our tests, in order to better grasp the impact that these strategies have on the final design footprint:

The resource utilization percentages reveal no clear superior architecture. The FINN design outranks the CGRA4ML design in terms of Look-Up Table (LUT) and Flip Flop (FF) usage (10% and 5%, respectively, as opposed to 25% and 20%), whereas the CGRA4ML design utilizes slightly less Look-Up Table Random Access Memory (LUTRAM), as well as Unified RAM. Regarding Block RAM (BRAM), the CGRA4ML accelerator takes up approximately half the quantity that FINN requires. However, we have to take into account that the resources of the CGRA4ML accelerator include also the ones of the MicroBlazeV microcontroller used in our tests.

Lastly, we need to point out the real-time response of the accelerators in contrast to the RPi platform. Our experiments showed that the FINN accelerator achieved a Mean Inference Duration of 27.5 ms and the CGRA one was able to reach 6.89 ms, whereas the RPi managed to finish one inference in 259.3 ms. By summing up the threshold trigger module’s processing delay (which in all cases is less than 1 ms) and the minimal transmission delay of the sensor data from the sensors to the board, we see that the accelerators can easily achieve the hard real-time constraint of 50 ms that we set in the beginning, while the RPi fails to give an output within the acceptable time period. The fact that the accelerators’ behavior is deterministic by nature makes them a reliable solution and suitable for cyber–physical problems.

Figure 8 presents an example of voltage transients and the respective alarm signal generation from the anomaly detection application. Specifically, in Figure 8a we see the voltage transients’ trace and how the application is closely monitoring it and identifies the anomaly events in real-time. In Figure 8b, we can see more clearly the difference in latency between the different implementations, with the RPi alarm signal always lagging behind the accelerator ones. This further highlights the superior capabilities of the accelerators.

5. Discussion

This paper presented a unified, fully open-source hardware-aware AI acceleration pipeline that addresses fragmentation challenges in FPGA and CGRA-based neural network deployment through a common frontend based on the Brevitas quantization library. By supporting two distinct acceleration flows—custom FPGA dataflow accelerators via FINN and CGRA-based architectures through CGRA4ML—our pipeline enables seamless comparison between different hardware acceleration approaches while maintaining full compatibility with PyTorch-based development workflows. We demonstrated the practical effectiveness of our proposed pipeline through a comprehensive case study on autoencoder-based anomaly detection for wind turbines in smart grids, deploying our solution on both a RPi baseline platform with ONNX Runtime environment and an AMD-Xilinx ZCU104 FPGA platform utilizing both acceleration flows. Our experimental validation confirmed true real-time, low-energy, high-performance operation with both acceleration flows, with detailed comparisons demonstrating the superiority of FPGA-accelerated solutions for edge AI applications.

Future work will be dedicated to the further automation of the pipeline. More specifically, focus will be given on the creation of an automation routine dedicated to the selection of the best quantization scheme for each acceleration flow. This routine will also compare performance estimates for each flow and provide suggestions on which flow is better suited for the application of interest. In addition, further research will be given on the flows themselves and specifically their enhancement with more features, e.g., the addition of Recurrent Neural Network (RNN) support in CGRA4ML and its comparison with the FINN flow. Finally, broader testing of the pipeline is also under the scope with its integration with the already established research field of Neural Architecture Search (NAS) and its active research branch of hardware-aware NAS.

Author Contributions

Conceptualization, E.M.; methodology, E.M. and C.F.; software, E.M. and C.F.; validation, E.M., C.F. and S.K.; formal analysis, E.M.; investigation, E.M. and C.F.; resources, E.M.; data curation, E.M.; writing—original draft preparation, E.M. and C.F.; writing—review and editing, E.M. and C.F.; visualization, E.M. and S.K.; supervision, M.B. and A.B.; project administration, A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the European Union’s Horizon 2020 research and innovation programs under grant agreement No 101139194 6G TransContinental Edge Learning (6G-XCEL).

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/lefmylonas/finn_cgra4ml_hw_pipeline.git (accessed on 15 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PPA	Power–Performance–Area
AI	Artificial Intelligence
CNN	Convolutional Neural Network
ANN	Artificial Neural Network
ViT	Vision Transformer
ML	Machine Learning
COTS	Commercial Off-The-Shelf
RPi	Raspberry Pi
NPU	Neural Processing Unit
DPU	Deep Learning Processing Unit
CPU	Central Processing Unit
GPU	Graphics Processing Unit
CGRA	Coarse-Grained Reconfigurable Architecture
FPGA	Field-Programmable Gate Array
QAT	Quantization-Aware Training
PTQ	Post-Training Quantization
ONNX	Open Neural Network eXchange
QONNX	Quantized Open Neural Network eXchange
LLM	Large Language Model
BNN	Binary Neural Network
IR	Intermediate Representation
RTL	Register-Transfer-Level
HLS	High-Level Synthesis
DNN	Deep Neural Network
PE	Processing Element
SIMD	Single-Instruction–Multiple-Data
FIFO	First-In First-Out
ASIC	Application-Specific Integrated Circuit
DMA	Direct Memory Access
TCL	Tool Command Language
PLC	Programmable Logic Controller
DWT	Discrete Wavelet Transform
IDWT	Inverse Distrete Wavelet Transform
FP32	Single Precision Floating Point Format
DUT	Device Under Test
PGA	Programmable Gain Amplifier
ADC	Analog-to-Digital Converter
GPIO	General-Purpose Input/Output
MAD	Mean Absolute Deviation
STDev	Standard Deviation
MSE	Mean Squared Error
FPS	Frames-Per-Second
LUT	Look-Up Table
LUTRAM	Look-Up Table Random Access Memory
FF	Flip FLop
BRAM	Block Random Access Memory
URAM	Unified Random Access Memory
DSP	Digital Signal Processing
RNN	Recurrent Neural Network
NAS	Neural Architecture Search

References

Radanliev, P.; De Roure, D.; Van Kleek, M.; Santos, O.; Ani, U. Artificial intelligence in cyber physical systems. AI Soc. 2021, 36, 783–796. [Google Scholar] [CrossRef]
Chen, J.; Wen, K.; Xia, J.; Huang, R.; Chen, Z.; Li, W. Knowledge Embedded Autoencoder Network for Harmonic Drive Fault Diagnosis Under Few-Shot Industrial Scenarios. IEEE Internet Things J. 2024, 11, 22915–22925. [Google Scholar] [CrossRef]
Choi, Y.; Joe, I. Motor Fault Diagnosis and Detection with Convolutional Autoencoder (CAE) Based on Analysis of Electrical Energy Data. Electronics 2024, 13, 3946. [Google Scholar] [CrossRef]
Ghazimoghadam, S.; Hosseinzadeh, S.A.A. A novel unsupervised deep learning approach for vibration-based damage diagnosis using a multi-head self-attention LSTM autoencoder. Elsevier Meas. 2024, 229, 114410. [Google Scholar] [CrossRef]
We Did the Math on AI’s Energy Footprint. Here’s the Story You Haven’t Heard, MIT Technology Review. Available online: https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/ (accessed on 15 December 2025).
Liu, L.; Zhu, J.; Li, Z.; Lu, Y.; Deng, Y.; Han, J.; Yin, S.; Wei, S. A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications. ACM Comput. Surv. 2019, 53, 1–39. [Google Scholar] [CrossRef]
Podobas, A.; Sano, K.; Matsuoka, S. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access 2020, 8, 146719–146743. [Google Scholar] [CrossRef]
Mohaidat, T.; Khalil, K. A Survey on Neural Network Hardware Accelerators. IEEE Trans. Artif. Intell. 2024, 4, 3801–3822. [Google Scholar] [CrossRef]
Liu, S.; Fan, H.; Ferianc, M.; Niu, X.; Shi, H.; Luk, W. Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 3974–3987. [Google Scholar] [CrossRef] [PubMed]
Medus, L.D.; Iakymchuk, T.; Frances-Villora, J.V.; Bataller-Mompeán, M.; Rosado-Munoz, A. A Novel Systolic Parallel Hardware Architecture for the FPGA Acceleration of Feedforward Neural Networks. IEEE Access 2019, 7, 76084–76103. [Google Scholar] [CrossRef]
Zhao, Z.; Cao, R.; Un, K.F.; Yu, W.H.; Mak, P.I.; Martins, R.P. An FPGA-Based Transformer Accelerator Using Output Block Stationary Dataflow for Object Recognition Applications. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 281–285. [Google Scholar] [CrossRef]
Blott, M.; Preußer, T.B.; Fraser, N.J.; Gambardella, G.; O’brien, K.; Umuroglu, Y.; Leeser, M.; Vissers, K. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. 2018, 11, 1–23. [Google Scholar] [CrossRef]
Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017. [Google Scholar] [CrossRef]
Schulte, J.F.; Ramhorst, B.; Sun, C.; Mitrevski, J.; Ghielmetti, N.; Lupi, E.; Danopoulos, D.; Loncar, V.; Duarte, J.; Burnette, D.; et al. hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware. arXiv 2025, arXiv:2512.01463. [Google Scholar]
Anderson, J.; Beidas, R.; Chacko, V.; Hsiao, H.; Ling, X.; Ragheb, O.; Wang, X.; Yu, T. CGRA-ME: An Open-Source Framework for CGRA Architecture and CAD Research: (Invited Paper). In Proceedings of the 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Virtual, 7–9 July 2021. [Google Scholar] [CrossRef]
Vitis AI User Guide (UG1414). Available online: https://docs.amd.com/r/en-US/ug1414-vitis-ai/Vitis-AI-Overview (accessed on 15 December 2025).
OpenVINO Website. Available online: https://docs.openvino.ai/2025/index.html (accessed on 15 December 2025).
DNNDK User Guide. Available online: https://docs.amd.com/v/u/en-US/ug1327-dnndk-user-guide (accessed on 15 December 2025).
Faure-Gignoux, A.; Delmas, K.; Gauffriau, A.; Pagetti, C. Open-source Stand-Alone Versatile Tensor Accelerator. arXiv 2025, arXiv:2509.19790. [Google Scholar]
Xilinx/brevitas Zenodo Website. Available online: https://zenodo.org/records/16987789 (accessed on 15 December 2025).
CGRA4ML’s GitHub Website. Available online: https://github.com/KastnerRG/cgra4ml.git (accessed on 13 January 2026).
Abarajithan, G.; Ma, Z.; Li, Z.; Koparkar, S.; Munasinghe, R.; Restuccia, F.; Kastner, R. CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing. arXiv 2024, arXiv:2408.15561. [Google Scholar] [CrossRef]
Khadem Hosseini, A.M.; Mirzakuchaki, S. Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework. arXiv 2025, arXiv:2510.22243. [Google Scholar]
Pappalardo, A.; Umuroglu, Y.; Blott, M.; Mitrevski, J.; Hawks, B.; Tran, N.; Loncar, V.; Summers, S.; Borras, H.; Muhizi, J.; et al. QONNX: Representing Arbitrary-Precision Quantized Neural Networks. arXiv 2022, arXiv:2206.07527. [Google Scholar] [CrossRef]
What Is Transposed Convolutional Layer? Available online: https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11/ (accessed on 15 December 2025).
Preventing Blackouts: Real-Time Data Processing for Millisecond-Level Fault Handling. Available online: https://www.ververica.com/blog/preventing-blackouts-real-time-data-processing-for-mission-critical-infrastructure (accessed on 15 December 2025).
Singh, G.K. Power system harmonics research: A survey. Eur. Trans. Electr. Power 2009, 19, 151–172. [Google Scholar] [CrossRef]
Liang, X.; Andalib-Bin-Karim, C. Harmonics and Mitigation Techniques Through Advanced Control in Grid-Connected Renewable Energy Sources: A Review. IEEE Trans. Ind. Appl. 2018, 54, 3100–3111. [Google Scholar] [CrossRef]
Berahm, K.; Daneshfar, F.; Salehi, E.S.; Li, Y.; Xu, Y. Autoencoders and their applications in machine learning: A survey. Artif. Intell. Rev. 2024, 57, 28. [Google Scholar] [CrossRef]
Zhu, X.; Yang, C.; Lin, T. Maximum Variance Regularization for Latent Variables Makes Autoencoder Become Better One-Class Classifier. In Proceedings of the China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; Available online: https://ieeexplore.ieee.org/document/10865629 (accessed on 3 January 2026).
Yong, B.X.; Brintrup, A. Do Autoencoders Need a Bottleneck for Anomaly Detection? IEEE Access 2022, 10, 78455–78471. [Google Scholar] [CrossRef]
Mercioni, M.A.; Holban, S. Developing Novel Activation Functions in Time Series Anomaly Detection with LSTM Autoencoder. In Proceedings of the 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 19–21 May 2021. [Google Scholar] [CrossRef]
PyWavelets—Wavelet Transforms in Python. Available online: https://pywavelets.readthedocs.io/en/latest/ (accessed on 15 December 2025).
Mylonas, E.; Tzanis, N.; Birbas, M.; Birbas, A. An Automatic Design Framework for Real-Time Power System Simulators Supporting Smart Grid Applications. Electronics 2020, 9, 299. [Google Scholar] [CrossRef]
Stavropoulos, S.; Tzanis, N.; Mylonas, E.; Birbas, M.; Birbas, A.; Papalexopoulos, A. FPGA-enabled Real-Time Power Grid Simulation Using Grid Partitioning. In Proceedings of the 12th Mediterranean Conference on Power Generation, Transmission, Distribution and Energy Conversion (MEDPOWER 2020), Paphos, Cyprus, 9–12 November 2020. [Google Scholar] [CrossRef]
Tzanis, N.; Proiskos, G.; Birbas, M.; Birbas, A. FPGA-Assisted Distribution Grid Simulator. In Proceedings of the 14th International Symposium, ARC 2018, Santorini, Greece, 2–4 May 2018. [Google Scholar] [CrossRef]
ZCU104 Board User Guide (UG1267). Available online: https://docs.amd.com/v/u/en-US/ug1267-zcu104-eval-bd (accessed on 14 December 2025).
MicroBlaze V Processor Embedded Design User Guide (UG1711). Available online: https://docs.amd.com/r/en-US/ug1711-microblaze-v-embedded-design/Introduction (accessed on 14 December 2025).
Wind Turbine Fleet Dataset for Anomaly Detection. Available online: https://aws-ml-blog.s3.amazonaws.com/artifacts/monitor-manage-anomaly-detection-model-wind-turbine-fleet-sagemaker-neo/dataset_wind_turbine.csv.gz (accessed on 14 December 2025).
INA219 Datasheet. Available online: https://www.ti.com/lit/ds/symlink/ina219.pdf?ts=1765634341896 (accessed on 14 December 2025).
Circuit Python Driver for INA219 Current Sensor. Available online: https://github.com/adafruit/Adafruit_CircuitPython_INA219.git (accessed on 14 December 2025).
INA226 Datasheet. Available online: https://www.ti.com/product/INA226 (accessed on 14 December 2025).

Figure 1. Hardware-aware AI acceleration pipeline.

Figure 2. Visualization of the wind turbine anomaly detection concept.

Figure 3. Autoencoder architecture.

Figure 4. Threshold trigger module operation.

Figure 5. Experimentation testbed.

Figure 6. Power measurement topology: (a) RPi (baseline). (b) ARM and FINN. (c) RISC-V and CGRA4ML.

Figure 7. (a) RPi power measurements in idle and under-load state. (b) ZCU104 power measurements in idle and under-load state.

Figure 8. (a) Example of voltage transients and alarm generation. (b) Scaled up version, where different alarm latencies are clearly depicted.

Table 1. Metrics across tested platforms.

Metric (Unit)	RPi (ONNX Runtime)	ZCU104 (FINN)	ZCU104 (CGRA4ML)
Overall Mean Power per Inference (mW)	371.3	399.3	461.9
MAD of Mean Power per Inference (mW)	93.1	24.1	20.9
STDev (mW)	133.5	35.0	25.6
Mean Inference Duration (ms)	259.3	27.5	6.89
Mean Energy per Inference (mJ)	96.3	11.0	3.2
Performance (inf/s)	3	36	145
Efficiency (Performance/W)	8.079	90.156	313.941

Table 2. Algorithm accuracy across tested platforms.

Metric (Unit)	RPi (ONNX Runtime)	ZCU104 (FINN)	ZCU104 (CGRA4ML)
Accuracy (MSE)	0.0015	0.0026	0.0026

Table 3. Resource usage of FINN hardware design.

Design Attribute	Utilization *	Available	Utilization %
LUT	22,228	230,400	9.65
LUTRAM	9195	101,760	9.04
FF	21,088	460,800	4.58
BRAM	107	312	34.13
URAM	3	96	3.13
DSP	21	1728	1.22

* Design operates on 100 MHz clock frequency.

Table 4. Resource usage of CGRA hardware design.

Design Attribute	Utilization *	Available	Utilization %
LUT	57,600	230,400	25
LUTRAM	6106	101,760	6
FF	92,160	460,800	20
BRAM	47	312	15.06
URAM	1	96	1.04
DSP	18	1728	1.04

* Design operates on 100 MHz clock frequency.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mylonas, E.; Filippou, C.; Kontraros, S.; Birbas, M.; Birbas, A. A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids. Electronics 2026, 15, 414. https://doi.org/10.3390/electronics15020414

AMA Style

Mylonas E, Filippou C, Kontraros S, Birbas M, Birbas A. A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids. Electronics. 2026; 15(2):414. https://doi.org/10.3390/electronics15020414

Chicago/Turabian Style

Mylonas, Eleftherios, Chrisanthi Filippou, Sotirios Kontraros, Michael Birbas, and Alexios Birbas. 2026. "A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids" Electronics 15, no. 2: 414. https://doi.org/10.3390/electronics15020414

APA Style

Mylonas, E., Filippou, C., Kontraros, S., Birbas, M., & Birbas, A. (2026). A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids. Electronics, 15(2), 414. https://doi.org/10.3390/electronics15020414

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified FPGA/CGRA Acceleration Pipeline for Time-Critical Edge AI: Case Study on Autoencoder-Based Anomaly Detection in Smart Grids

Abstract

1. Introduction

1.1. Motivation

1.2. Background and State of the Art

1.3. Contributions

1.4. Paper Structure

2. Hardware-Aware AI Acceleration Pipeline

2.1. High-Level View of the Proposed Pipeline

2.2. Brevitas Frontend

2.3. FINN Flow

2.4. CGRA4ML Flow

2.5. QONNX-QKeras Translation

ConvTranspose Translation

3. Real-Time Anomaly Detection in Wind Turbines

3.1. Concept Description

3.2. Autoencoder-Based Anomaly Detection Module

3.3. Threshold Trigger Module

4. Results

4.1. Experimentation Setup

4.2. Power Measurement Methodology

4.3. Evaluation of the Proposed Pipeline

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI