The Potential of SoC FPAAs for Emerging Ultra-Low-Power Machine Learning

: Large-scale ﬁeld-programmable analog arrays (FPAA) have the potential to handle machine inference and learning applications with signiﬁcantly low energy requirements, potentially alleviating the high cost of these processes today, even in cloud-based systems. FPAA devices enable embedded machine learning, one form of physical mixed-signal computing, enabling machine learning and inference on low-power embedded platforms, particularly edge platforms. This discussion reviews the current capabilities of large-scale ﬁeld-programmable analog arrays (FPAA), as well as considering the future potential of these SoC FPAA devices, including questions that enable ubiquitous use of FPAA devices similar to FPGA devices. Today’s FPAA devices include integrated analog and digital fabric, as well as specialized processors and infrastructure, becoming a platform of mixed-signal development and analog-enabled computing. We address and show that next-generation FPAAs can handle the required load of 10,000–10,000,000,000 PMAC, required for present and future large ﬁelded applications, at orders of magnitude of lower energy levels than those expected by current technology, motivating the need to develop these new generations of FPAA devices.


Motivating Ultra-Low-Power Embedded Machine Learning
Large-scale field-programmable analog arrays (FPAA) show significant potential for mixed-signal computing [1], including embedded machine learning [2][3][4], machine learning, and inference on low-power embedded platforms. Although cloud-centric machine learning will be used going forward, the energy constraints at the edge device requires significantly more processing to be computed locally, rather than paying the high energy costs of transmitting data off the device. The energy required for cloud computations is not negligible, seen in the fact that the costs for the production of cloud-based machine learning techniques are becoming a significant fraction of the USA's energy budget (e.g., [5]).
Physical computing techniques enable local, edge-embedded computation, through significantly improved energy efficiency. Mead originally proposed that analog computation would have 1000× computational energy efficiency over digital approaches [6], a factor experimentally demonstrated in custom Si in 2004 [7], and repeatably demonstrated ever since, including multiple FPAA demonstrations (e.g., [1,8]). As a physical neuromorphically engineered Si Cortex could be possible in less than 100 W [9], physical computing devices can eventually go beyond current cloud-based machine learning approaches [5].
FPAA devices enable embedded Machine inference and learning [2][3][4], that includes neural networks (NN) and neurally inspired datapaths using analog elements and signals integrated with potential logic and mixed-signal-enabled routing ( Figure 1). Analog implementations of NN have roots from the earliest demonstrations (e.g., [10,11]). FPAAs include analog elements and signals integrated with potential logic and mixed-signal-enabled routing (Figure 1), enabling full end-to-end, sensor to decision-computation that includes machine learning. FPAA devices, such as the SoC FPAAs, integrate analog and digital include analog elements and signals integrated with potential logic and mixed-signal-enabled routing. Current SoC FPAA devices include integrated analog and digital fabric as well as specialized processors and infrastructure, becoming a platform of mixed-signal development as well as analog-enabled computing. The fundamental question is the following: What is the potential of future FPAA devices in machine learning applications? Future FPAA devices seem to offer a promise of ubiquitous reconfigurable devices for ultra-low-power machine learning that can directly solve the enormous energy requirements of current fielded machine learning applications, as well as enable what is typically considered cloud-based machine learning in edge devices.
SoC FPAAs enable an analog-enabled mixed-signal computing platform that enables wide user development of the emerging analog computing techniques (e.g., [12]). Many of these techniques were first shown in FPAA devices, or in custom ICs that were on the development path into FPAA devices [1]. Large-scale field-programmable analog arrays were explicitly defined (e.g., [13]) as reconfigurable, mixed-signal devices to be used for computation, rather than glue logic devices (e.g., digital CPLDs), typical of early analog reconfigurable approaches (e.g., [14]), while staying consistent with development history.
End-to-end computation requires analog input and computation for physical implementations, and communication and data conversion is part of the system cost ( Figure 2). An FPGA can handle analog signal inputs and outputs with the addition of sufficient data converters ( Figure 2). FPGA devices are constructed to enable a wide range of digital applications, and have a number of specialized blocks to optimize particular applications [15][16][17][18][19][20][21][22][23]. Xlinix's Zync FPGA [24], a recent FPGA, includes high-speed (up to 10 GSPS) 14-bit DAC and (up to 4 GSPS) 14-bit ADCs on-chip with 6 ARM µP, as well as digital fabric and interfaces. These ICs show a need for integrated analog components in configurable structures in reducing the overall system complexity and system throughput. The Zync FPGA family, fabricated in 16-10 nm CMOS nodes, can support on-chip RF capability. The overall high energy, area, and complexity costs for the digital computation (e.g., [8] vs. [25]), static FPGA power, and data converters will overwhelm many applications.
FPGAs are not low power when looking at solutions requiring 10-100 mW of power. Most FPGAs require 100 s of mW simply to power up the SRAM elements holding the programming variables. Commercial flash-based FPGAs significantly decrease the starting power requirements (e.g., 7-10 mW standby power [26,27]) while enabling 350-500 MHz signals and 70 mW 5 G SERDES, although the power requirements are too high for lower power systems of 10-20 mW and below. Analog computation (FPAAs) solves these issues energy and area efficient analog computation, as well as requiring far fewer data converters.
Analog co-processors for a digital computation with a bank of data converters create a huge infrastructure that nearly eliminates the benefits of analog processing. Current FPGA-based solutions (Figure 2a) are used because of the lack of commercially available FPAAs, as well as the lack of available FPAA engineering experience. In addition to the high energy and complexity costs of these data converters, the FPGA has some latency for its digital computation and is constrained by its static and dynamic power requirements. (b) An FPAA device can typically interface to the incoming analog signals. Where necessary (e.g., RF), input signals can be adjusted to the IC's power and supply voltage levels. The FPAA device uses analog techniques where possible in a near-zero latency path utilizing digital control.
Programmable and configurable FPAA devices enable end-to-end embedded machine learning applications (Figures 1 and 3). A machine learning algorithm that ignores other application areas, such as translating sensor data as an input into the network, often shows incremental or negligible system improvements. End-to-end embedded machine learning eliminates requiring a bank of FPGAs and interface chips in front of a low-power classifier chip so that the wins from the physical system still provide advantages for the overall system concept. Configurability is essential to have a limited number of commercially relevant ICs to handle a wide application space, particularly the wide range of front-end computations for the end-to-end computation, as well as efficiently implementing local and sparse weight matrices. Building end-to-end architectures typically reduces the amount of raw computations required, further empowering these physical computing approaches. digital fabric and interfaces. These ICs show a need for integrated analog components in configurable structures in reducing the overall system complexity and system throughput. The Zync FPGA family are fabricated in 16nm and smaller CMOS nodes, and at these nodes one begins to get some RF capability in the device.
FPGAs are not low power when looking at solutions requiring 10-100mW of power. Most FPGAs require 100s of mW simply to power up the SRAM elements holding the applications, from microphones to classified result, even without the FPAA designed for machine learning. The question becomes the following: What are the additional capabilities of scaled FPAA devices for end-to-end classifiers for acoustic applications as well as new enabled applications that utilize vision and RF sensors.
The questions are whether FPAA approaches can transform the required load of 10,000 to 10,000,000,000 PMAC required for future fielded applications (e.g., [5]), and do these FPAAs computations perform at far lower energy levels? A positive response motivates the engineering efforts to develop these new generations of FPAA devices. The projected scaling of FPAA devices enables us to directly address these questions, and that, in turn, requires further analysis into the capabilities and flexibilities of FPAA architectures.
This discussion works through the the opportunities and questions towards building a ubiquitous supply of large-scale field-programmable analog arrays (FPAA), particularly SoC FPAA-type devices [28] for machine learning applications with ultra-low energy requirements. The discussion starts by providing an overview of the FPAA capabilities that enables the wider application for end-to-end machine learning (Section 2). Understanding the architectural and granularity tradeoffs in FPAA architectures (Section 3) enables predicting the future capabilities of these devices (Section 4).

Configurable Technology, Architecture, and Capabilities
SoC FPAAs enable a mixed-signal end-to-end embedded machine platform as they include analog elements and signals integrated with potential logic and mixed-signalenabled routing ( Figure 4). End-to-end acoustic embedded learning and classification have been demonstrated at 20-30 µW on command-word recognition, as well as on the full Nzero database [2][3][4]. FPAA devices have experimentally demonstrated a wide range of computations ( Figure 4) that include computations to set up machine learning, as well as the machine learning inference and learning, sometimes at small scale, given the component constraints of a 350 nm CMOS SoC FPAA [28]. Floating gate (FG) techniques enable programmability, having precise parameters (e.g., 14-bit accuracy [29]), in standard CMOS, as well as configurability, through long-term retention of FG charge (0-100 µV over 10 years [30]). SoC FPAAs enable a wide user development of the emerging analog computing techniques (e.g., [12]).
FPAA tools empower a wide application ecosystem through systematic analog design [1,31,32] to enable an engineering team to rapidly develop new embedded machine learning components. These tools give the user the ability to create, model, and simulate analog and digital designs. This early tool enables the development of an FPAA toolset that can start with high-level definitions and automatically generate targeted hardware where the user has the ability to optimize the process at each level. These analog and mixed-signal tools are expanding to analog synthesis using standard cells for custom ICs (e.g., [33]).
FPAA devices can be the solution for analog and mixed-signal security and component obsolescence [34], just as FPGAs solve security and obsolescence issues. Multiple digital techniques can verify an FPGA, allowing for secure and confident FPGA programming for a particular application. An FPAA device can be a completely generic and known device that can be completely verified in a safe location [34], where the secret sauce for the technology can be programmed on the device also in a safe location ( Figure 5). The resulting FPAA device layout says nearly nothing about the programmed function, similar to FPGA devices. FPAA devices can directly and discretely map secure functions, such as unique functions and physically unclonable functions (PUF), directly into the FPAA fabric [34]. In general, the more specific the resulting solution, and the fewer levels of software stacks, the fewer security holes; the nonvolatile programming of an FPAA device minimizes the number of mixed-signal security concerns. The use of nonvolative memory (e.g., FG) eliminates SRAM loading vulnerabilities, where analog values are difficult to measure without significantly distorting the measurements where digital computing can be analog encoded, and where low-power circuits provide unique challenges for external measurements [34]. FPAAs can replicate similar analog circuit elements or a combination of analog circuit elements to achieve similar linear and nonlinear dynamics seen in older custom or configurable devices, including some of the unintended dynamics, eliminating the obsolescence issues. Mapping discrete-component (e.g., BJT vs. FET) analog music synthesis to FPAA devices demonstrates translating intended and unintended dynamics [35]. This discussion works through the the opportunities and questions towards building a ubiquitous supply of large-scale Field Programmable Analog Arrays (FPAA), particularly SoC FPAA type devices [19] for machine learning applications with ultra-low energy requirements. The discussion starts by overviewing FPAA capabilities (Sec. 2) enabling the structure around, as well as the machine learning core, for end-to-end machine learning. Understanding the architectural and granularity tradeoffs in FPAA architectures (Sec. 3), enables predicting the future capabilities of these devices (Sec. 4).

Configurable Technology, Architecture, and Capabilities
SoC FPAAs enable a mixed-signal end-to-end embedded machine platform as they include analog elements and signals integrated with potential logic and mixed-signal enabled routing (Fig. 4). End-to-end acoustic embedded learning and classification have   and have the technology secret sauce be programmed in a safe location. The output product is a custom chip due to the nonvolatile device programming.

Granularity for End-To-End Machine Learning: Flexibility vs. Switch Cost
To predict FPAA devices scaling to advanced process nodes, one needs to understand how the configurable fabric might scale to larger technology nodes. An effective config-urable fabric for embedded machine learning applications empowers the user's creativity through flexible opportunities while minimizing the added cost for that flexibility, particularly with increasing application size. Flexibility (φ) enables more computations in a single architecture, where Φ quantifies the possible combinations available. Flexibility affects the types, sparsity, and energy efficiency of machine learning algorithms, as well as the processing circuitry before and after the machine learning operations. Flexibility requires switches, and more switches result in higher area, circuit, and interconnect cost ( Figure 6).  larly with increasing application size. Flexibility (f) enables more computations in a single architecture, where F quantifies the possible combinations available. Flexibility affects the types, sparsity, and energy efficiency of machine learning algorithms as well as the processing circuitry before and after the machine learning operations. Flexibility requires switches, and more switches result in higher area, circuit and interconnect cost (Fig. 6).
A configurable architecture will always be a factor higher cost (K) in area, however small, compared to the area of a fully custom architecture (A 1 ). The custom block area (A 1 ) is fully custom, explicitly including in K any configurability or parameters. Each switch linearly increases K by a factor a that is the ratio of the size of a switch compared to an individual selection block. For n switches for configuring a custom block area, the resulting area (A) is Typical values of a for small to moderate cells connected to this switch would be between 0.1 to 0.01; switches selecting a single transistor element would have a closer to 1. Switch implementation in a particular technology Fig. 6b) directly affects a. Switches add circuit costs to the flexible fabric. With the increase in area by (K), the A configurable architecture will always be a factor higher cost (K) in area, however small, compared with the area of a fully custom architecture (A 1 ). The custom block area (A 1 ) is fully custom, explicitly including in K any configurability or parameters. Each switch linearly increases K by a factor a that is the ratio of the size of a switch compared with an individual selection block. For n switches for configuring a custom block area, the resulting area (A) is Typical values of a for small to moderate cells connected to this switch would be between 0.1 and 0.01; switches selecting a single transistor element would have a closer to 1. Switches connecting individual transistors offers significant opportunities [36,37] with significant additional cost (a > 1). Switch implementation in a particular technology (Figure 6b) directly affects a.
Switches add circuit costs to the flexible fabric. With the increase in the area by (K), the custom computation has a total load capacitance (C L ), and the configurable computation has an increased capacitance roughly scaled by the cost factor (K). Area efficiency due to configurability is the inverse of cost (1/K). The custom computation power-delay product (E 1 ) would be proportional to C L that is proportional to A 1 : For subthreshold operation, near-threshold operation, and some other situations, E 1 is constant with frequency. The size, weight, and power (SWaP) metric, a product of the area and power-delay product for a custom and configurable system, are Furthermore, CMOS switches have a resistive loss (Figure 6b), although other technologies (e.g., III-V transistors and Chalcogenides) potentially can reduce the signal loss for an on switch. Switch approaches include SRAM-driven transmission gates [38][39][40][41][42][43][44][45][46][47], memristor elements [48][49][50][51], and phase-change memories [52][53][54][55][56][57][58][59][60][61][62], as well as FG devices. Switch granularity is typically pictured as a continuum between course-grain granularity that has a minimum of switches between a menu of items, and fine-grain granularity, that has switches between the lowest level of components ( Figure 6a). Different architectures create a different K in their attempt to achieve their desired flexility. Over the following subsections, we will develop the cost of configurability (K), which trades off with the resulting increase in flexibility (φ), described as increased functionality when connecting n blocks together, including course-grain architectures (Section 3.1), Manhattan architectures (Section 3.2), and fine-grain architectures (Section 3.3).

Course-Grain Architectures
Given the concern about switches, many FPAAs (e.g., [14,[38][39][40][41][42][43][44][45][46][47]63]) utilize coursegrain architectures (Figure 6a), minimizing the number of switches ( Figure 6b) and associated parasitics required for any particular computation. Course-grain architectures attempt to minimize the effect of additional switches by only switching between large fixed components, and the loss of opportunity by this strategy is incorporated into the flexibility metric (φ). In a simple crossbar network (Figure 7a), Φ is where each block could have a selection connection to the n − 1 other blocks or could have parallel connections to each of the n − 1 blocks (Figure 7c).

Manhattan Architectures Improve Flexibility
Manhattan architectures utilize a multilevel routing scheme to reduce the scaling of K with the number of elements while still achieving significant flexibility. Manhattan architectures enable reconfigurable and efficient routing of local and sparse interconnections, a critical issue for many neural network algorithms. FPGAs significantly improve their granularity through Manhattan architectures [64]. The evolution of configurable digital from fully connected structures to Manhattan-type approaches enabled the production of FPGAs with a routing structure that enabled a level of granularity beyond typical LUTs [64]. More flexible, efficient, and fine-grained granularity enables creativity by the designer; although, these approaches require placement and routing design tools [64]. The wide range of configurations, as well as the portability of high-level code, enabled FPGAs to solve the digital design legacy as well as enabled secure FPGAs. These techniques require optimization algorithms to place and route an application into this architecture.
The Manhattan routing approach improves the effective granularity by assuming more connections are effectively local-typical of digital and analog designs. Manhattan routing structures (Figure 7b) assume a starting element size significantly larger than the crossbar determined by the local switch matrix parameters-b (CAB/CLB lines), d (lines into CAB/CLB), and f (lines in the connection block)-resulting in an improved scaling metric. The values of b, d, and f would increase weakly for increasing total number of nodes. K scales with local routing (b, d, f) within each module (e.g., CLB or CAB) instead of the entire array (Figure 7b). Other multilevel routing schemes have similar scaling properties. Manhattan architectures utilize these crossbar arrays in each of their local regions (CLB/CAB) typically having n = 8 to 64 block elements, where one wants to maximize Φ in each local region. A local region is defined as a large block where the routing architecture focuses on local computation, enabling these bus connections to only weakly grow with increased number of local regions and number of components.  termined by the local switch matrix parameters b (CAB/CLB lines), d (lines into CAB/CLB), and f (lines in the connection block) resulting in an improved scaling metric. The values of b, d, and f would increase weakly for increasing total number of nodes. K scales with local routing (b, d, f) within each module (e.g. CLB or CAB) instead of the entire array (Fig. 7b). Other multilevel routing schemes have similar scaling properties. Manhattan architectures utilize these crossbar arrays in each of their local regions (CLB/CAB) typically having n=8 to 64 block elements, where one wants to maximize F in each local region. A local region is defined as a large block where the routing architecture focuses on local computation, enabling these bus connections to only weakly grow with increased number of local regions and number of components.

Fine-grain architectures
CMOS devices using FG elements allow for non-volatile switches potentially enabling analog granularity. Analog parameters improve the resulting density and resulting system flexibility (Fig. 7c). For analog m-bit switch elements, the increased parallel flexibility increases by a 2 m factor. Having analog parameters with parallel connections enables using routing fabric as computing fabric [31]. In this computing in memory approach [1,32], the number of additional switches, and therefore K, for a particular computation decreases significantly.
Manhattan architectures with fine-grain analog storage provides an energy and area

Fine-Grain Architectures
CMOS devices using non-volatile switches FG elements enables analog granularity. Analog parameters improve the resulting density and resulting system flexibility ( Figure 7c). For analog m-bit switch elements, the increased parallel flexibility increases by a 2 m factor. Having analog parameters with parallel connections enables using routing fabric as computing fabric [65]. In this computing in memory approach [1,66], the number of additional switches, and therefore K, for a particular computation decreases significantly.
Manhattan architectures with fine-grain analog storage provide an energy-and areaefficient implementation of reconfigurable routing of local and sparse interconnections, where the neural network weight computation occurs directly through the weight fabric routing. In these cases, the network complexity scales as the number of neurons, and only weakly on the number of synapses within reasonable neuron sparsity.
Fine-grain granularity, particularly analog programmable granularity, greatly improves the tradeoff between configurable architecture efficiency (1/K) and flexibility (Φ), requiring fewer nodes for similar flexibility as well as having a lower cost (K) of that flexibility (Figure 8). The higher granularity by parallel analog connections significantly decreases the number of components (n >> 1000 to n = 15), and the resulting SWaP efficiency (1/K → 0.1% to 80%) illustrates the potential advantage using switch elements as programmable transistors. Decreasing the required number of blocks for the same Φ illustrates the potential system-level reduction to achieve a range of potential applications. as programmable transistors. Decreasing the required number of blocks for the same F illustrates the potential system-level reduction to achieve a range of potential applications. Fine-grain, analog switch architectures provides a favorable tradeoff between configurable architecture efficiency (1/K) and flexibility (F), and further research to enable these techniques should yield many significant opportunities. These advantages are consistent with the demonstrated orders of magnitude advantage of FPAA devices using analog switching matrices, such as the SoC FPAA [1,19]. As the computational routing fabric becomes more important because of the high F, additional fabric infrastructure, such as partial high-speed in-circuit reconfigurability [19,33], further empowers the range of potential targeted applications.
The difference in F between digital connection switches and parallel analog switches, enabling computing through the switches structured in a memory configuration, can be seen by the capabilities of a typical CLB and CAB (Fig. 9). Where a typical CLB can impressively enable a state machine per CLB, a CAB could potentially implement a small acoustic classifier stage in a single CAB. These differences in capabilities are almost entirely due to the fine-grain routing vs. an efficient traditional routing approach; course-grain Fine-grain, analog switch architectures provides a favorable tradeoff between configurable architecture efficiency (1/K) and flexibility (Φ), and further research to enable these techniques should yield many significant opportunities. These advantages are consistent with the demonstrated orders of magnitude advantage of FPAA devices using analog switching matrices (e.g., SoC FPAA [1,28]). As the the high Φ makes the computational routing fabric more important, additional fabric infrastructure, such as partial high-speed in-circuit reconfigurability [28], further empowers the range of potential applications.
The difference in Φ between digital connection switches and parallel analog switches, enabling computing through the switches structured in a memory configuration, can be seen by the capabilities of a typical CLB and CAB (Figure 9). Where a typical CLB can impressively enable a state machine per CLB, a CAB could potentially implement a small acoustic classifier stage in a single CAB. These differences in capabilities are almost entirely due to the fine-grain routing vs. an efficient traditional routing approach; course-grain routing techniques leave even more unused Φ and capability.

Scaled FPAA Devices Opportunities towards Low-Energy Machine Learning
We want to understand the scaling opportunities for new Machine-Learning capable FPAA devices given current FPAA capabilities. Configurable mixed-signal devices, having the opportunity of flexibility in a reasonably granular solution, can justify the IC design cost for new embedded devices (Fig. 10). As mask costs exponentially increase with decreasing processing node, the resulting design costs to obtain value from these investments exponentially increases, requiring a significantly higher expected market return from the effort (Fig. 10). Only a few applications (e.g. cell phone processors) can have the market impact that are necessary to justify the cost of advanced IC nodes (e.g. 10nm, 14nm), where configurable solutions can utilize a single IC design across a number of applications to justify the investment cost.   Figure 10. Configurable devices like FPGA and FPAA devices provide a cost effective machine-learning end-toend solution. because of the ever-increasing cost of IC design for scaled down processes. The costs for making a set of IC masks scales inversely as a power law of the CMOS minimum channel length, and typically the design cost for a new design is at least 10⇥ the mask cost, typically requiring a 10⇥ the expected financial return to even attempt such a venture. The resulting cost for designing an IC is often far too high for most engineering applications to hope to reach these financial returns. A configurable device can spread this resulting engineering cost over a wide number of designs. (CLB) or a computational analog block (CAB). A CLB typically uses binary connection switches to form multiple (e.g., 8) lookup tables and some selection RAM, enabling small state machines in a single CLB. A CAB typically uses analog parallel switches for its computation, that includes FG routing that can be used for programmable and configurable computation, as well as programmable FG-based circuits. Within such a structure, a small auditory classifier could be compiled in a single CAB.

Scaled FPAA Devices Opportunities towards Low-Energy Machine Learning
We want to understand the scaling opportunities for new machine-learning-capable FPAA devices, given the current FPAA capabilities. Configurable mixed-signal devices--having the opportunity of flexibility in a reasonably granular solution-can justify the IC design cost for new embedded devices ( Figure 10). As mask costs exponentially increase with decreasing processing node, the resulting design costs to obtain value from these investments exponentially increases, requiring a significantly higher expected market return from the effort (Figure 10). Only a few applications (e.g., cell phone processors) can have the market impact that are necessary to justify the cost of advanced IC nodes (e.g., 10 nm, 14 nm), where configurable solutions can utilize a single IC design across a number of applications to justify the investment cost.   Figure 10. Configurable devices such as FPGA and FPAA devices provide a cost effective machine learning end-to-end solution, because of the ever-increasing cost of IC design for scaled-down processes. The costs for making a set of IC masks scales inversely as a power law of the CMOS minimum channel length, and typically the design cost for a new design is at least 10× the mask cost, typically requiring a 10× the expected financial return to even attempt such a venture. The resulting cost for designing an IC is often far too high for most engineering applications to hope to reach these financial returns. A configurable device can spread this resulting engineering cost over a wide number of designs.

Machine Learning Computation Opportunities from Scaled CMOS FPAAs
Given the demonstration of high flexibility of fine-grain capabilities compared with architectural cost, as well as demonstration of SoC FPAAs for embedded machine inference and learning [2,28], we ask the following question: What is the potential of these FPAA devices given the existing understanding of these techniques? Scaling allows for a higher signal bandwidth in the FPAA fabric architectures (Figure 11a), roughly with an inverse quadratic scaling on the minimum channel length, enabling some RF bandwidths (e.g., 4 GHz) at 40-45 nm CMOS [67,68].

Machine Learning Computation Opportunities from Scaled CMOS FPAAs
Given the demonstration of high flexibility of fine-grain capabilities compared to architectural cost, as well as demonstration of SoC FPAAs for embedded machine inference and learning [2,19], what is the potential of these FPAA devices given existing understanding of these techniques? Scaling allows for a higher signal bandwidth in the FPAA fabric architectures (Fig. 11a), roughly with an inverse quadratic scaling on the minimum channel length, enabling some RF bandwidths (e.g. 4GHz) at 40-45nm CMOS [37,38].
FG devices scale to a number of CMOS processes (e.g. 40nm, 14nm) [37], and do not limit any expected scaling opportunities. FG devices have been demonstrated across a number of IC technologies from 2.0µ to 40nm CMOS [11,37,38,51], with designs awaiting measurement in 14nm CMOS. Issues around different insulators [37,38], temperature issues on circuit operation [48,49], handling of voltage levels on-chip for programming [1,50,52], and long-term reliability & multiple writing [21,22] have all been carefully studied and their capabilities across standard CMOS processes have been established. Future IC processes from 14nm and below show no significant constraint on further scaling of these devices as CMOS continues to scale to smaller channel lengths.
The impact of scaled down FPAA devices for end-to-end machine learning (e.g. Neural Network) applications builds from a number of experimental results of FPAA devices at the 350nm CMOS Inode. Initial early experimental measurements in smaller IC processes (e.g. [37,38]) and efforts in analog system automation derived from FPAA tools (e.g. [1,[23][24][25]) give significant grounding of these next generation FPAAs. The FPAA CAB / CLB density should be similar to existing FPGA densities. The complexity of a CLB in the 350nm SoC FPAA [1] is the same size and complexity (8 LUTs + registers) as the CLBs used by Xlinix Zynq RF enabled devices. The 200 CAB+ CLBs in existing 350nm FPAAs could be optimized to improve the density by at least a factor of 2. Scaled sizes show similar number of CLBs in a similar area 20-40k (40nm CMOS) vs. 50k in Zynq RF (12-16nm CMOS) [16]. Many different features might be possible in each case, and yet, these approaches are of similar order of magnitude.
Scaling creates smaller switches and processing elements resulting in higher density and lower energy consumption (Fig. 11b,c). One expects a significantly increased number of FG devices scale to a number of CMOS processes (e.g., 40 nm, 14 nm) [67], and do not limit any expected scaling opportunities. FG devices have been demonstrated across a number of IC technologies from 2.0 µm to 40 nm CMOS [9,[67][68][69], with designs awaiting measurement in 14 nm CMOS. Issues around different insulators [67,68], temperature issues on circuit operation [70,71], handling of voltage levels on-chip for programming [1,72,73], and long-term reliability and multiple writing [30] have all been carefully studied and their capabilities across standard CMOS processes have been established. Future IC processes from 14 nm and below show no significant constraint on further scaling of these devices, as CMOS continues to scale to smaller channel lengths.
The impact of scaled-down FPAA devices for end-to-end machine learning (e.g., neural network) applications builds from a number of experimental results of FPAA devices at the 350 nm CMOS node. Initial early experimental measurements in smaller IC processes (e.g., [67,68]) and efforts in analog system automation derived from FPAA tools (e.g., [1,31,32]) give significant grounding of these next generation FPAAs. The FPAA CAB/CLB density should be similar to existing FPGA densities. The complexity of a CLB in the 350 nm SoC FPAA [1] is the same size and complexity (8 LUTs + registers) as the CLBs used by Xlinix Zynq RF-enabled devices. The 200 CAB+ CLBs in existing 350 nm FPAAs could be optimized to improve the density by at least a factor of 2. Scaled sizes show similar number of CLBs in a similar area 20-40 k (40 nm CMOS) vs. 50 k in Zynq RF (12-16 nm CMOS) [24]. Many different features might be possible in each case, and yet, these approaches are of a similar order of magnitude.
Scaling creates smaller switches and processing elements, resulting in higher density and lower energy consumption (Figure 11b,c). One expects a significantly increased number of FG devices with CMOS scaling depending on process capabilities (Figure 11b). The number of vector-matrix multiplications (VMM) on a 5 mm × 5 mm die grows rapidly with decreasing process node (Figure 11b); these results assume that 1/8 of the total routing fabric are VMM computations. The 45 nm and 14 nm devices are capable of PMAC(/s) level computation on a single die, computation levels typically requiring a large supercomputer (Figure 11c). A neural network or similar machine learning problem could utilize PMAC(/s) computations for inference and learning. As one expects 10,000-10,000,000,000 PMAC for current and future large fielded applications (e.g., [5]), a single 10 W, 10 PMAC(/s) device at 40 nm and a single 25 W, 250 PMAC(/s) device at 14 nm dramatically decreases the computing time and power requirements, as well as decreases significantly the overall energy requirements (Table 1). Another important aspect of these devices would be the significantly lower required energy consumption for these operations, particularly the possible computation at 1 µW and 1 mW levels (Figure 11c). The range of machine learning (inference and learning) demonstrations have been demonstrated with the 350 nm CMOS FPAA device [2][3][4]. In this application, a range of acoustic microphone-to-classification machine learning (inference and training) techniques have been demonstrated with inference in 350 nm CMOS at 20-30 µW levels. Improved circuit design in 350 nm CMOS would already move these devices to 1 µW levels [2]; therefore, scaled devices certainly would certainly implement machine learning for similar applications at 1 µW levels. Further improvements by utilizing more neuromorphic physical algorithms in FPAA devices, such as neurons, synapses, and dendrites (e.g., [9,[74][75][76]), with their improved energy efficiency over analog matrix-vector multiplication used in NN (e.g., [9]), further illustrate the opportunities for energy-efficient neural network and machine learning computing at 1 µW energy levels.
A device requiring 1 mW average energy could be easily supplied by a battery, enabling months of continuous fielded use, and a device requiring 1 µW average energy could easily be supplied by small (<1 cm 2 ) energy-harvesting devices. Energy levels of 1 mW can possibly be supplied by moderate-sized (e.g., 10 cm × 10 cm) devices. A 40 nm CMOS structure enables 1GMAC(/s), around the level of a fully capable laptop computer, and around 1 TMAC(/s), around the level of a small GPU or FPGA cluster. Embedded low-power end-to-end embedded machine learning powered through energy harvesting eliminates the need for external energy sources, eliminating part of the machine learning energy crisis [5].

Algorithm Opportunities from Scaled CMOS FPAAs
The current [1] and future ( Figure 12) FPAA directions towards machine learning show opportunities for embedded end-to-end system algorithms. Increased density, decreased energy consumption, and increased bandwidth at each CMOS node (Figure 12) directly impact the range of computations; although, the same computations are possible at each CMOS node at a lower operating frequency and problem size.
Developing end-to-end machine learning systems require building the computation before the machine learning, as well as the computing for the machine learning operations ( Figure 13). The optimal operating frequency matches the input data rate to eliminate the need for any internal storage or related infrastructure. For a given operating frequency, the objective is to minimize energy consumption, as well as the classification latency (small for analog computing). Analog numerical analysis [77], analog architecture theory [78], and real-valued computing theory [12] provides the framework for analog computations.   power required in a 40nm node. An FPGA has significant initial static power (100's of mW to W) because of the on-chip SRAM storage. A 40nm CMOS FPAA would have roughly 20k CABs and CLBs with roughly 50M multiplication elements in the FPAA fabric that are likely to be typically accessible. The highest end Zync processor has roughly 50k CLBs and 4k DSP (with single multiply) units. This 40nm FPAA device would be similar to the 350nm SoC FPAA, likely with multiple embedded processors (MSP430 or RISC V), in open design architecture; Zync includes 6 ARM cores with a 1-2GHz maximum clock rate. A 14nm FPAA device should have improved metrics; a 40nm FPAA device requires considerably less building costs and design costs.
End-to-end solutions requires different front-end solutions for different applications (e.g. acoustic, imaging, RF) to set the data before the network inference and training (Fig.  13). Acoustic or speech processing or classification, scaling increases the problem size (command word to small vocabulary to speech classification) while potentially further improving the energy efficiency. Acoustic applications often require front-end filterbanks, delay lines, and other subbanding processes before the machine learning computation, as well as asynchronous event processing to encode the machine-learning process (Fig. 13a).
A 40nm FPAA, similar to what is already done in 350nm SoC FPAA, could directly implement the front-end computation before the classifier would include BPF over acoustic frequencies, amplitude detections, and continuous-time delay approximations in each parallel channel. One expects energy costs of 1-20µW for this entire front-end infrastructure (e.g. [3,19]) with an upper-end output frequency of 1kHz for each output. A speech recognizer using multiple neural layers (phoenems, syllables, words) in realistic SNR (<10dB) environments with a moderate number of weights (10M) that might use single layer VMM+WTA blocks (e.g. [3]). As the NN computation requires 10µW (10GMAC(/s) ), one easily expects this application to fit in a 50µW budget including additional potential overhead. A single 40 nm FPAA device ( Figure 12) could potentially be used for an acoustic NN classifier, an image NN classifer, and an RF NN classifer. These examples provide a good comparison with a single high-end (e.g., RF Zync [24]) FPGA IC (12-16 nm CMOS), solving these applications ( Figure 13). The 5 GHz FPAA bandwidth (dc to 5 GHz [67]) compares to the highest-end 10 GSPS DACs on the RF Zynq [24]. The FG elements that are not programmed are initialized to accumulation, and the negligible current (e.g., [67]) results in little static power required in a 40 nm node. An FPGA has significant initial static power (100's of mW to W) because of the on-chip SRAM storage. A 40 nm CMOS FPAA would have roughly 20 k CABs and CLBs with roughly 50 M multiplication elements in the FPAA fabric that are likely to be typically accessible. The highest end Zync processor has roughly 50 k CLBs and 4 k DSP (with single multiply) units. This 40 nm FPAA device would be similar to the 350 nm SoC FPAA, likely with multiple embedded processors (MSP430 or RISC V) in an open design architecture; Zync includes 6 ARM cores with a 1-2 GHz maximum clock rate. A 14 nm FPAA device should have improved metrics; a 40 nm FPAA device requires considerably lower building costs and design costs.
End-to-end solutions requires different front-end solutions for different applications (e.g., acoustic, imaging, RF) to set the data before the network inference and training ( Figure 13). Acoustic or speech processing or classification scaling increases the problem size (command word to small vocabulary to speech classification), while potentially further improving the energy efficiency. Acoustic applications often require front-end filterbanks, delay lines, and other sub-banding processes before the machine learning computation, as well as asynchronous event processing to encode the machine learning process (Figure 13a). A 40 nm FPAA, similar to the process which is already carried out in 350 nm SoC FPAA, could directly implement the front-end computation before the classifier would include BPF over acoustic frequencies, amplitude detection, and continuous time delay approximations in each parallel channel. One expects energy costs of 1-20 µW for this entire front-end infrastructure (e.g., [3,28]) with an upper-end output frequency of 1 kHz for each output. A speech recognizer using multiple neural layers (phonemes, syllables, words) in realistic SNR (<10 dB) environments with a moderate number of weights (10 M) that might use single-layer VMM+WTA blocks (e.g., [3]). As the NN computation requires 10 µW (10 GMAC(/s)), one easily expects this application to fit in a 50 µW budget, including the additional potential overheads.  In other cases, the increased problem size opens new architectural solutions, such as a in image classification where more processing can occur on the incoming streamed image from a sensor (Fig. 13b).
Image classification on larger nodes might take a standard database with an on-board compression (e.g. Compressed DCT [47]), where a scaled down system would compute and classify subimages in parallel.
Typical image processing and NN classification architectures would build around an image IC that transmits an image one pixel value at a time (Fig. 14a). Matching data and computing speed results in optimal computing efficiency and minimal overhead minimizing the expensive requirement for buffering or caching data. An alternate path could enable a CMOS imager to have direct interconnections between a pixel and reconfigurable Si processing on another IC that would enable significantly higher computing opportunities (Fig. 14b). Image processing would have CABs or groups of CABs for handling different symbols. Data goes throughout the IC and utilizes by local computing in memory; can have 2 percent pixels a piece operate on the scanned image input. A 10M Pixel imager requires an average 10TMAC (/s) at roughly 10-20mW in the FPAA device, similar cost as the CMOS imager, and similar cost in transmitting the data between the two chips on a PCB. If imager has vertical connections from one wafer to another, the vertical connection would change the algorithm and more parallelism due to parallel input (naturally from source).
An FPAA device could be a common module for RF related machine learning (Fig. 13c) at 45nm and smaller CMOS nodes [38]. CMOS scaling enables this application applications, such as beamforming and demoduation that could be 40MHz at 350nm CMOS, while improving to 400MHz at 130nm CMOS, 4GHz at 40nm CMOS, and higher for smaller CMOS processes [37]. The device, depending on process node, might include some specialized LNA at input, configuration for initial signal processing (e.g. VMM for beamforming), as well as demodulation for classifying modulated signals. The 40nm FPAA device can be directly built to enable 10-20GHz signals with 4-5GHz signal bandwidths (dc to  in the routing fabric [38], including FG tunable delay elements for spatiotemporal filtering through the routing fabric. These inputs could directly be used to classify spectrum dynamics, utilizing a large classifier network (again, 10M NN) as well as utilizing those dynamics with minimal system latency [40]. 10M NN operating at 1GHz bandwidth puts the computation at 10PMAC(/s) range operating at 10W of average power.
Using multiple FPAA devices could enable a platform for larger algorithms as well as an accelerator for training a network. Training today's NN models requires multiple In other cases, the increased problem size opens new architectural solutions, such as an in-image classification, where more processing can occur on the incoming streamed image from a sensor (Figure 13b). Image classification on larger nodes might take a standard database with an on-board compression (e.g., compressed DCT [79]), where a scaled-down system would compute and classify sub-images in parallel.
Typical image processing and NN classification architectures would build around an image IC that transmits an image one pixel value at a time ( Figure 14a). Matching data and computing speed results in optimal computing efficiency and minimal overhead minimizing the expensive requirement for buffering or caching data. An alternate path could enable a CMOS imager to have direct interconnections between a pixel and reconfigurable Si processing on another IC that would enable significantly higher computing opportunities (Figure 14b). Unlike the trajectories for single wafer imagers (e.g., [80,81]), through-hole wafer connections and die stacking (e.g., [82,83]) enable a wide range of opportunities for 3D die-stacked imagers [84][85][86][87][88][89]. Image processing would have CABs or groups of CABs for handling different symbols. Data moves throughout the IC and utilizes local computing in the memory. A 10 M Pixel imager requires an average of 10 TMAC (/s) at roughly 10-20 mW in the FPAA device, similar cost as the CMOS imager, and similar cost in transmitting the data between the two chips on a PCB. If an imager has vertical connections from one wafer to another, then the vertical connection would change the algorithm and more parallelism would occur due to parallel input (naturally from source).
An FPAA device could be a common module for RF related machine learning (Figure 13c) at 45 nm and smaller CMOS nodes [68]. CMOS scaling enables this application, such as beamforming and demodulation, that could be 40 MHz at 350 nm CMOS, while improving to 400 MHz at 130 nm CMOS, 4 GHz at 40 nm CMOS, and higher for smaller CMOS processes [67]. The device, depending on process node, might include some specialized LNA at input, configuration for initial signal processing (e.g., VMM for beamforming), as well as demodulation for classifying modulated signals. The 40 nm FPAA device can be directly built to enable 10-20 GHz signals with 4-5 GHz signal bandwidths (dc to [4][5] in the routing fabric [68], including FG tunable delay elements for spatiotemporal filtering through the routing fabric. These inputs could directly be used to classify spectrum dynamics, utilizing a large classifier network (again, 10 M NN), as well as utilizing those dynamics with minimal system latency [78]. A 10 M NN operating at 1 GHz bandwidth puts the computation at 10 PMAC(/s) range operating at 10 W of average power. Imager parallel networks training for different initial conditions and other parallel type structures. One would expect many FPAA chips compiled to the desired NN structure with the same signal and training inputs with the system recording the near-digital classified outputs, and eventually reading the converged weight values. Rewriting of the networks is a small percentage of time compared to training algorithm. Supplying an arbitrary network vectors of analog signals requires a large number of DACs, and therefore, compiling and using the preprocessing algorithms on the FPAA with the NN would drastically reduce the amount of DACs (e.g. acoustic classification), system complexity (e.g. input data movement), and energy requirements by utilizing the end-to-end computing. Again using the same 40nm FPAA devices with an application of 10M weights with the bandwidth accelerated (where possible) to 1GHz speeds, one expects 10PMAC(/s) range operating at 10W of average power. A parallel system of 1000 devices would occupy a rack infrastructure with 10kW of computing power. The infrastructure to control the system (input DAC signals) and store the resulting inputs likely requires similar complexity for this 10EMAC (/s) system. FPGAs used in accelerators for training follow a similar path, although with higher energy and complexity (data movement) requirements seen in the other applications. An FPGA system would also benefit from implementing front-end processing, reducing the input complexity. The FPAA system would require roughly 20-40kW of power, and the FPGA system would require on the order of 100MW of power, typical of a data-computing node. The large 10,000,000,000 PMAC(/s) production training problems would require roughly 2 weeks of compute time assuming the FPAA reconfiguration / reloading time is a small fraction of the computing time.

Summary and Further Directions
We have shown that FPAAs have the potential to handle machine inference and learning applications with significantly lower energy requirements, potentially alieviating the high cost experienced today even in cloud-based systems. FPAA devices enable embedded machine learning, one form of physical mixed-signal computing, enables machine learning and inference on low-power embedded platforms, particularly edge platforms. The SoC FPAA device uses fine-grain analog programmability and therefore minimizes the high cost of fine-grain switch networks. Today's FPAA devices are platform of mixed-signal development as well as analog-enabled computing, and future FPAA devices significantly increase the size, area, and energy efficiency of these capabilities. Next-generation FPAAs can handle the required load of 10000 to 10,000,000,000 PMAC required for today's and Using multiple FPAA devices could enable a platform for larger algorithms and an accelerator for training a network. Training current NN models requires multiple parallel networks training for different initial conditions and other parallel-type structures. One would expect many FPAA chips compiled to the desired NN structure with the same signal and training inputs with the system recording the near-digital classified outputs, and eventually reading the converged weight values. Rewriting the networks consumes a small percentage of time compared with the training algorithm. Supplying an arbitrary network with vectors of analog signals requires a large number of DACs; therefore, compiling and using the preprocessing algorithms on the FPAA with the NN would drastically reduce the amount of DACs (e.g., acoustic classification), system complexity (e.g., input data movement), and energy requirements by utilizing the end-to-end computing. Again using the same 40 nm FPAA devices with an application of 10 M weights with the bandwidth accelerated (where possible) to 1 GHz speeds, one expects 10P MAC(/s) range operating at 10 W of average power. A parallel system of 1000 devices would occupy a rack infrastructure with 10 kW of computing power. The infrastructure to control the system (input DAC signals) and store the resulting inputs likely requires similar complexity for this 10 EMAC (/s) system. FPGAs used in accelerators for training follow a similar path, although with higher energy and complexity (data movement) requirements than those seen in the other applications. An FPGA system would also benefit from implementing front-end processing, reducing the input complexity. The FPAA system would require roughly 20-40 kW of power, and the FPGA system would require on the order of 100 MW of power, typical of a data-computing node. The large 10,000,000,000 PMAC(/s) production training problems would require roughly 2 weeks of compute time, assuming the FPAA reconfiguration/reloading time is a small fraction of the computing time.

Summary and Further Directions
We have shown that FPAAs have the potential to handle machine inference and learning applications with significantly lower energy requirements, potentially alleviating the high cost experienced today even in cloud-based systems. FPAA devices enable embedded machine learning-one form of physical mixed-signal computing-enabling machine learning and inference on low-power embedded platforms, particularly edge platforms. The SoC FPAA device uses fine-grain analog programmability and therefore minimizes the high cost of fine-grain switch networks. Current FPAA devices are platform of mixed-signal development as well as analog-enabled computing, and future FPAA devices will signifi-cantly increase the size, area, and energy efficiency of these capabilities. Next-generation FPAAs can handle loads of 10,000-10,000,000,000 PMAC required for current and future large fielded applications at orders of magnitude of lower energy levels than expected by current technology, motivating the need to develop these new generations of FPAA devices.
An end-to-end solution perspective tends to take the computing communication issues into account, from sensor to classified result. As part of the configurability items, the related architecture constraint requires avoiding large external memories, because they will significantly reduce performance without significantly increasing Φ. The local computing eliminates the deep communication to memories, as well as difference in learning architecture efficiently computed, etc. [78]. Manhattan architectures with fine-grain analog storage provide an energy-and area-efficient implementation of reconfigurable routing of local and sparse interconnections, where the neural network weight computation occurs directly through the weight fabric routing. In these cases, the network complexity scales as the number of neurons, and only weakly on the number of synapses within reasonable neuron sparsity.
FPAAs for online learning may want to incorporate direct FG on-chip learning algorithms into the architecture. FPAA networks that both use FG elements for routing as well as for adaptation have been demonstrated, requiring use of nFET FG elements to route the pFET adaptive paths while retaining their immediate state. These architectures should be considered in the next generations of FPAA devices where specializations towards machine learning would be used.
One might question the long period of research and development before a commercial SoC FPAA device will be available. The development of FPGA devices took considerable time in an environment where digital computing was well established with a clear framework, and other system design issues were happening in parallel [64]. The success of the SoC and earlier FPAA devices [1] led to the development of an initial complete toolset [33], a toolset that shows the next steps in FPAA automation. These successes to develop 350 nm FPAA devices to a significant size (nearly 1 M FG parameters, ≈200CAB + CLB, µP) with design tools capable of designing an entire FPAA target, led to development of analog computing techniques [12,90] that included foundational work on analog numeric [77], architectures [78], and abstraction [91]. Like the SoC FPAA devices, these techniques are still relatively new, and are expanding towards new applications (e.g., [92]).
From a technical perspective, the current and projected SoC FPAA capabilities could impact a range of applications. Commercial success often requires an alignment of a combination of technical and non-technical factors, and the path for platform technologies, such as FPGAs, is rarely a linear path [64]. The decades of market interest in Anadigm FPAAs, as well as recent interest in the FG-based Aspinity FPAAs, shows a continued market interest for FPAAs given its limited capabilities. Even though it is hard to predict commercial opportunities, the technical capabilities potentially open new opportunities in low-power end-to-end computing for machine learning applications.