Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions

: Edge AI accelerators have been emerging as a solution for near customers’ applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, robotics, and remote sensing satellites. These applications require meeting performance targets and resilience constraints due to the limited device area and hostile environments for operation. Numerous research articles have proposed the edge AI accelerator for satisfying the applications, but not all include full speciﬁcations. Most of them tend to compare the architecture with other existing CPUs, GPUs, or other reference research, which implies that the performance expos é of the articles are not comprehensive. Thus, this work lists the essential speciﬁcations of prior art edge AI accelerators and the CGRA accelerators during the past few years to deﬁne and evaluate the low power ultra-small edge AI accelerators. The actual performance, implementation, and productized examples of edge AI accelerators are released in this paper. We introduce the evaluation results showing the edge AI accelerator design trend about key performance metrics to guide designers. Last but not least, we give out the prospect of developing edge AI’s existing and future directions and trends, which will involve other technologies for future challenging constraints.


Introduction
Convolution neural network (CNN), widely applied to image recognition, is a machine learning algorithm.CNN is usually adopted by software programs supported by the artificial intelligence (AI) framework, such as TensorFlow and Caffe.These programs are usually run by central processing units (CPUs) or graphics processing units (GPUs) to form the AI systems which construct the image recognition models.The models trained by massive data such as big data and infer the result by the given data have been commonly seen running on cloud-based systems.
Hardware platforms for running AI technology can be sorted into the following hierarchies: data center bound system, edge-cloud coordination system, and 'edge' AI devices.The three hierarchies of hardware platforms from the data center to edge devices require different hardware resources and are exploited by various applications according to their demands.The state-of-the-art applications for image recognition such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, robotics, remote sensing satellites belong to the third hierarchy and are called edge devices.Edge devices refer to the devices connecting to the internet but near the consumers or at the edge of the whole Internet of things (IoT) system.They are also called edge AI devices when they utilize AI algorithms.The targeted AI algorithm of the accelerators in this paper is CNN.
The mentioned applications contain several features, which demand a dedicated system to cover.Edge AI holds several advantages and is qualified to deal with them.The specific capabilities of edge AI technology are:

•
Edge AI can improve the user experience when AI technology and data processing are near customers more and more.

•
Edge AI can reduce the data transition latency, which implies real-time processing ability.

•
Edge AI can run under no internet coverage to offer privacy through local processing.

•
Edge AI pursues the compact size and manages power consumption to meet the mobility and limited power source.
Some edge AI systems are neither power-sensitive nor size-limited such as surveillance systems for face recognition and unmanned shops.Designed as an immobile system is a specific feature of these edge AI systems.Although these kinds of applications do not care about power consumption and size, they tend to be aware of data privacy more.As a result, they also avoid using the first and second hierarchy platforms.However, the scope of these power non-sensitive edge AI systems is not a target in this paper.This paper focuses on surveying AI accelerators designed for power-sensitive and size-limited edge AI devices, which are based on batteries or limited power sources such as systems using solar panels.The following mentioned edge AI refers to this kind of system.Typically, this kind of system is portable and comes to mind when people mention edge AI devices because GPU and CPU-based systems can easily substitute the non-power-sensitive and non-size-limited edge AI systems since they do not care about the power consumption and size.As an edge AI accelerator requires mobility, power consumption and area size are the most caring features.Greedily, when the two features meet the requirement, the computation ability will be expected to be as high as possible.High computation ability helps to solve the critical feature of this kind of edge AI device.The critical feature is the real-time computing ability for predicting or inferring the subsequent decision by pre-trained data.
CPUs and GPUs have been used extensively in the first two hierarchies of AI hardware platforms for practicing CNN algorithms.Due to the inflexibility of CPUs and the highpower consumption of GPUs, they are not suitable for power-sensitive edge AI devices.As a result, power-sensitive edge AI devices require a new customized and flexible AI hardware platform to implement arbitrary CNN algorithms for real-time computing with low power consumption.
Furthermore, as the edge devices develop into various applications such as monitoring natural hazards by UAV, detecting radiation leakage for nuclear disaster by robotics, and remote sensing in space by satellites, these applied fields are more critical than usual.These critical environments, such as radiation fields, can cause a system failure.As a result, not only power consumption and area size are the key but also the fault tolerance of edge AI devices for satisfying their compact and mobile feature with reliability.There are various research articles have been proposed targeting fault tolerance.Reference [1] introduces a clipped activation technique to block the potentially faulty activations and maps them to zero on a CPU and two GPUs.Reference [2] focuses on systolic array fault mitigation, which utilizes fault-aware pruning with/without retraining technique.The retraining feature takes 12 min to finish the retraining at least, and the worst case is 1 h for AlexNet.It is not suitable for edge AI.For permanent faults, Reference [3] proposes a fault-aware mapping technique to minus the permanent fault in MAC units.For powerefficient technology, Reference [4] proposes a computation re-use-aware neural network technique to reuse the weight by constructing a computational reuse table.Reference [5] uses an approximate computing technique and retrains the network by getting the resilient neurons.It also shows that dynamic reconfiguration is the crucial feature for the flexibility to arranging the processing engines.These articles focus on fault tolerance technology specifically.Some of them address the relationship between accuracy and power-efficient together but lack computation ability information [4,5].Besides these listed articles, there are still many published works targeting fault tolerance in recent years, which indicates that the edge AI with fault tolerance is the trend.
In summary, edge AI devices' significant issues are power sensitivity, device size limitation, limited local-processing ability, and fault tolerance.The limitation of power sensitivity, device size, and local-processing ability are bound by the native constraints of the edge AI devices.The constraints are generated by the limited power source such as battery and the portable feature, which causes the size of an edge AI system limited.To address these issues, examining the three key features of the prior art of the accelerators tailored for edge AI devices is necessary for providing future design directions.
From the point of view of a completed edge AI accelerator, the released specifications of the above fault tolerance articles are not comprehensive because they focus on the fault tolerance feature more than a whole edge AI accelerator.Furthermore, most related edge AI accelerator survey works tend to focus on an individual structure's specific topics and features without comparing the three features.Although each article of the individual structure is interesting to read and learn about, without direct comparison in the same standard units, it is unclear to know how good and what the generation of each structure is.This situation also makes the designers hard to compare the structure of each edge AI accelerator and determine which design is more suitable to reference.As a result, this paper focuses on evaluating the prior arts of edge AI accelerator on the three key features.
The prior arts focused on in this work are not limited to the released prior art accelerators but also edge accelerators architecture based on coarse-grained reconfigurable array (CGRA) technology because to achieve the flexibility of the hardware structure and deal with its compact size, one of the solutions for edge AI platforms is dynamic reconfiguration.The reconfigurable function realizes different types of CNN algorithms such that they can be loaded into an AI platform depending on the required edge computing.Moreover, the reconfigurable function also potentially provides fault tolerance to the system by reconfiguring the connections between processing elements (PEs).Overall, this survey will benefit those looking up the low-power ultra-small edge AI accelerators' specifications and setting up their designs.This paper helps designers choose or design a suitable architecture by indicating reasonable parameters for their low-power ultra-small edge AI accelerator.The rest of this paper is organized as follows: Section 2 introduces the hardware types adopted by AI applications.Section 3 introduces the edge AI accelerators, including prior edge AI accelerators, CGRAs accelerators, the units used in this paper for evaluating their three key features, and the suitable technologies for implementing the accelerators.Section 4 releases the analysis result and indicates the future direction.Conclusion and future works are summarized in Section 5.

System Platform for AI Algorithms
For achieving the performance of AI algorithms, several design trends of a complete platform for AI systems, such as cloud training and inference, edge-cloud coordination, near-memory computing, and in-memory computing, have been proposed [6].Currently, AI algorithms rely on the cloud or edge-cloud coordinating platforms, such as Nvidia's GPU-based chipsets, Xilinx's Versal platform, MediaTek's NeuroPilot platform, and Apple's A13 CPU [7].The advantages and disadvantages of CPU and GPU for applying on edge devices are shown in Table 1 [8,9].As shown in Table 1, CPU and GPU are more suitable for data-center-bound platforms due to CPU's sequential processing feature and GPU's power high power consumption.They do not meet the demand of low-power edge devices, which are strictly power-limited and size sensitive [10].
Edge-cloud coordination systems belong to the second hierarchy, which cannot run in areas with no network coverage.Data transfer through the network has significant latency, not acceptable for real-time AI applications such as security and emergency response [9].Privacy is another concern when personal data is transferred through the internet.Lowpower edge AI devices require hardware to support high-performance AI computation with minimal power consumption in real-time.As a result, designing a reconfigurable AI hardware platform that allows the adoption of arbitrary CNN algorithms for low-power edge AI devices with no internet coverage is the trend.

•
Can process high throughput video data.

•
High memory bandwidth.
• Power and computation efficient.

•
Customizable design for the specific application.

•
The sequential processing feature does not match the characteristic of CNN, requiring massively parallel computing.

•
RRestricts its application for power-sensitive edge devices.

•
Images in a streaming video and some tracking algorithms are inputted sequentially but not parallel [15].

•
Customizable for the specific targeted application (inflexible for all types of computations).

•
Computational power is limited compared to the data center CPU and GPU.

Application platform
• More suitable for a data center.

•
Cooperate with AI accelerator.
• More suitable for a data center.

•
Cooperate with AI accelerator.
• Customized for specific edge devices.

•
Can cooperate with CPU or GPU.

Edge AI Accelerators
Several architectures and methods have been proposed to achieve compact size, low power consumption, and computation ability of edge devices.The following Sections 3.2 and 3.3, introduce the released prior art edge AI accelerators and state-of-the-art edge accelerators based on CGRA, potentially suitable for low-power edge AI devices.
Most of the proposed review articles introduce accelerators feature-by-feature, and some miss mentioning the three key features.These articles tend to report the existing works only but not compare them.On the other hand, another edge AI accelerator articles contain all three key features.Still, the result they release is hard to understand because they only release comparison results with reference accelerators.It turns out that the used units in these articles for comparison of the architecture area, power consumption, and computation ability are not expected by the edge AI designers' community.Instead of using millimeters squared (mm 2 ), watt (W), operations per second (OPs), the results show how many 'times' better their reference works.As a result, we decide to compare the edge AI accelerators and CGRA architectures by the units used by most AI accelerator designers.

Specification Normalization and Evaluation
The presented unit of the computation ability in the following section is OPs (operation per second); MOPs, GOPs, and TOPs represent Mega, Giga, and Tera OPs, respectively.The arithmetic of each accelerator varies in data representation, e.g., floating-point (FP) and fixed-point (fixed).To compare the accelerators, using different arithmetic will lose impartiality.As a result, converting the units will be the following task.
The computation ability will be represented as c.If the arithmetic of the accelerator is FP, its c will be defined as cFP.On the other hand, cFixed means the computation ability under Fixed arithmetic.In the computation rows of Tables 2-7, the initial c is the original data released by the reference works.It might vary in arithmetic type and precisions.Based on [16,17], the computation ability of FP (cFP) can be converted to the computation ability of cFixed by scaling three times.As a result, (1) is introduced.
Converted computation ability to a fixed point is defined as follows: However, because not all accelerators have the same data precision, cFixed is not convincing while comparing the accelerators' ability.Reference [18] indicates that if a structure is not being optimized for precisions like Nvidia does in their GPUs, the theoretical performance of half-precision follows the natural 2×/4× speedups against single/double precisions, respectively.As a result, accelerators' computation ability performance needs to be normalized to 16-bit as it is used in the majority, without loss of generality.After normalization, the computation ability of each accelerator can be represented as cFixed16.
To specify the accelerators' performance fairly, ( 2) is introduced.The lasted computation abilities shown in the computation rows in Tables 2-7 are the computation ability in 16 bit fixed-point format.
Converted computation ability to a 16-bit fixed point is defined as follows: To specify the accelerators' synergy performance, ( 3) is introduced to represent the accelerators' evaluation value.Since the edge devices require low power consumption and compact size, in (3), the denominator will be power consumption p (w) times chip size s (mm 2 ), and the fraction will be computation ability cFixed16 (GOPs).
The equation for evaluating accelerator's synergy performance:

Prior Art Edge AI Accelerators
The following will show the edge AI accelerators [9,[19][20][21][22][23][24][25][26][27][28], which focus on the demands of edge devices and are organized into Tables 2-4 according to their precision and power consumption.Table 2 shows the accelerators with 16-bit precision and containing less than a watt power consumption.Table 3 shows the 16-bit precision accelerators have relatively high-power consumption, higher than a watt.Table 4 shows the accelerators which do not belong to 16-bit precision.

Commercial product example
Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) -    After calculating their evaluation value E, accelerators [9,21] show similar abilities, E = 80 s.On the other hand, References [19,22,26] share similar E in the 20 s.Although some of the accelerators have close evaluation value E, the cFixed16, p, s values of these accelerators still need to be examined because the accelerators might apply for different purposes and environments.For example, Reference [16] has the highest evaluation value E, but its size is 9.8 times [9], 3 times [21], and nearly 2 times [26].Overall, the evaluation value E can tell us a general efficiency of an AI accelerator, which is computation ability per unit area and watt.In Tables 2-4, several accelerators [23,25,28] lack details of the specifications since they only release the module-level data.As a result, the evaluation value E of them should be treated more conservatively.On the other hand, Reference [20] is a completed system on an FPGA board and does not release its size on a single chip, so its evaluation value E is hard to measure.Nevertheless, its data are a good study material for the designers who intend to build their future projects on an FPGA board for prototyping.

Three Key Features and the Evaluation Value
Some works, such as [29], use analog components and memristors to mimic neurons for CNN computing.However, none of the commercial proposed systems uses memristors.Several developers have researched memristor technology, including HP, Knowm, Inc. (Santa Fe, NM, USA), Crossbar, SK Hynix, HRL Laboratories, and Rambus.HP built the first workable memristor in 2008, yet until now, it still has a distance from prototype to commercial application.Knowm, Inc. sold their fabricated memristor for experimentation purposes.Again, the memristor is not intended for application in commercial products [30].Besides, it is worth mentioning that many CPUs in smartphones contain built-in neural processing units (NPU) or AI modules; for example, MediaTek Helio P90, Apple A13 Bionic, Samsung Exynos 990, Huawei Kirin 990, etc.However, individual NPU or so-called AI modules' detailed performance in these commercial CPUs is not public.As a result, these AI modules are hard to compare with the pure AI accelerators, but it is worth keeping an eye on these commercial products to prevent losing the latest information.

Coarse-Grained Cell Array Accelerators
Dynamically reconfigurable technology is the key feature of an edge AI hardware platform for flexibility and fault tolerance.The term 'dynamic' explains that during the runtime, reconfiguring the platform is still possible.Generally, reconfigurable architectures can be grouped into two major types: fine-grained reconfigurable architecture (FGRA) and coarse-grained reconfigurable architecture (CGRA).FGRA contains a large amount of silicon to be allocated for interconnecting the logic together, which implies that FGRA impacts the rate for reconfiguring devices in real-time due to the larger bitstreams of instructions needed.As a result, CGRA is a better solution for real-time computing.
Reference [31] presents many CGRAs and categorizes them into different categories: early pioneering, modern, larges, and deep learning.The article includes plentiful information, and the authors also collect the statistics to let readers know the developing trend of CGRA comparing to GPU.However, Reference [31] does not focus on the three key features' comparison for the CGRAs.For understanding the performance between CGRAs and determine which architectures are the potential candidates for edge AI accelerators, this paper presents architectures [32][33][34][35][36][37][38][39][40][41][42] published in the recent few years for comparison.For uniting the units and reference standards used in each article, this paper consults the references and converts the various units to be standardized according to the revealed information of each architecture in Tables 5-7.Table 5 shows the CGRAs use 32-bit precision, while Table 6 shows the 16-bit precision CGRAs.Last but not least, Table 7 presents the CGRAs, which neither use 32-bit, nor 16-bit precision.Some of the works do not show the computation ability in OPs, the standard reference unit for the AI accelerator designers and clients.Instead, a few of them compare their computation ability with the ARM Cortex A9 processor [32,33,36].For example, Reference [33] releases Versat's performance in operation cycles by running the benchmarks and shows the ARM Cortex A9 processor's ability on those benchmarks for comparison.However, the article does not show Versat's OPs, the standard reference unit for the AI accelerator designers and clients.According to the results, it shows that Versat is 2.4 times faster than the ARM Cortex A9 processor on average.Then, Reference [43] shows the performance of the ARM Cortex A9 processor is 500 mega floating points per second (MFlops).Af-ter calculation, Versat's operation ability is equal to 1.17 Giga Flops (GFlops).However, GFlops is still not the preferred unit for edge AI devices' designers and clients.Based on (1), GFlops can be converted to GOPs by scaling three times.Finally, the performance of Versat is gotten and equal to 3.51 GOPs in 32-bit precision.We adopt (2) to get its cFixed16, 7.02 GOPs, as shown in Table 5, for easily comparing to other accelerators.Similar works are done for the rest of the architectures in Tables 5-7.For exception, the area size of [34] cannot be found out due to the lack of information.Reference [36] does the work on an FPGA, so its core size is unable to be evaluated.
The computation unit used by [37] is faces-per-second because it targets face recognition, making the computation unit's converting work even harder and having significant deviation when referencing the ability from other similar work.Table 5 shows the computation ability of [37] is 450 faces/second, roughly equal to 201.6 GOPs [44].In [37], it achieves recognizing 30 faces in a frame while the frame rate is 15 per second, which amounts to 450 recognitions to be performed per second.On the other hand, the reference work [44] recognizes up to 10 objects in a frame with a 60 per second frame rate.As a result, the converted computation ability of [37] remains for reference with a certain deviation.Reference [40] has the highest evaluation value E of all the listed works.However, the revealed size of [40] is only part of the architecture, so edge AI designers should be more conservative in assessing specific architecture specifications.Reference [42] does not release its computation ability.Since [42] shares the same architecture with [45], the operation/power ratio in [45] can be the reference.Furthermore, Reference [42] contains double cores and extra heterogenous PEs comparing to [45].As a result, the overall operation/power ratio of [42] would be higher than being evaluated.
Overall, the evaluation results show that [37,38,41] have tens' grade evaluation values E, between 10 to 40.References [32,33,35,42] have hundreds' grade evaluation values E, between 100 to 400.According to CGRA's evaluation value E, CGRAs show the potential ability to execute edge AI applications, like the outstanding prior art edge AI accelerators in Tables 2-4.

Implementation Technology
The implementation method of each edge AI accelerator has been shown in Tables 2-7.Most of the prior art edge AI accelerators in Tables 2-4 have been commercialized and taped out as ASIC chips, such as [9,16,22,23,25,26,28], which can be easily found online for sale in different system packages.Commonly seen, the system packages adopted by the systems are developing boards or USB sticks; the examples are in Tables 2-4.On the academic side, References [19,21] are also built as ASIC, but [20] is implemented in a single FPGA chip.When it comes to CGRA accelerators in Tables 5-7, although most of them are not taped out to get the physical chip, they have been synthesized on ASIC standard library.The technology they use has been organized in Tables 5-7.
From the implementation information in Tables 2-7, it can be noticed that FPGA and application-specific integrated circuits (ASIC) are the most used approaches to implement edge AI accelerators, including CGRA based, for their customized ability and low-power consumption.The non-recurring engineering expense (NRE expense) and flexibility of ASIC are high and low, respectively, compared to FPGA.As a result, building a system by ASIC has a higher cost than FPGA when the amount of the products is small.Not only that, but also the developing time of ASIC is longer than FPGA.At the beginning of the system developing process, FPGA-based platforms are the better solution due to their high throughput, reasonable price, low power consumption, and reconfigurability [46].Accordingly, at the prototyping stage, building future AI platform design on a suitable FPGA platform at the system level [47] is suggested.

Architecture Analysis and Design Direction
Figure 1 organizes the accelerators whose evaluation value E is in the grade of tens or hundreds from Tables 2-7.References [20,34,36,40] are not included in Figure 1 because they lack the area data at the chip level.The power consumption data of [23,25,28] is only released at module-level, so to be fair, they are not listed in Figure 1, either.The yellow area in Figure 1 represents the area size of each accelerator.Every bar in Figure 1 is composed of two parts, up and down in blue and orange color, respectively.The upper part in blue represents the GOPs/mm 2 ; the lower orange represents the mW/mm 2 .The two lines, the upper one and the lower one in Figure 1, represent evaluation value E and the ratio of GOPs and mW, respectively.To focus on accelerators targeting ultra-small areas, the accelerators whose area size is below 10 mm 2 are selected, and their three key features are presented in Figure 2a.In Figure 2a, each accelerator contains two bars.The left orange one represents the power consumption in mW while the right blue one represents the computation ability, GOPs.According to Figure 2a, the accelerators can be grouped into two categories by area size as units' grade and decimal grade.In the units' grade group are [9,35,37,38].On the other hand, Refs.[32,33,42] belong to the decimal grade group.In the units' grade group, References [9,35] share a similar ratio of GOPs/mW while [37] has a relatively lower ratio, and [38] has the lowest ratio.The below analyzes the result of the ratio.Although [37] has close computation ability about 1.3 times [9] and 0.45 times [35], it consumes too much power, near 3.5 times [9] and 1.3 times [35].When it comes to [38], its computation ability almost reaches a hundred but with huge power consumption, even higher than [37].As a result, References [9,35] have better computation ability performance and power consumption in the units mm 2 area grade.It is interesting to know that the area size and GOPs/mW ratio positively correlate in the decimal grade group.Reference [42] has a relatively large area size compared to its computation ability.For more detail, the next paragraph will introduce the analysis.
Electronics 2021, 10, x FOR PEER REVIEW 11 of 16 ability and power consumption can be taken once the architecture size has been chosen.Designers can set accelerator's specifications according to its targeting application.As a result, if designers want to design an architecture for an edge AI accelerator in an ultrasmall area (units' mm 2 area size), the power consumption and operation ability should be in the order of hundreds of mWs and GOPs, respectively.Figure 3 shows the scenario of the AI applications from the original to the future.It also illustrates the model of the AI application distribution in the yellow and purple oval, from data center-based cloud AI to edge AI, mentioned as three hierarchies.Figure 4 shows the legends of Figure 3. Figure 2b shows the normalized three key features of the accelerators in Figure 2a.The three key features of the accelerators are normalized to the same grade of computation ability by scaling up [32,33,38,42] and scaling down [35,37], linearly [48].The result shows that except for [38,42], the remaining five accelerators have a similar trend in power consumption and area size.After normalization, the result emphasizes the unsatisfactory performance of [38,42] for low-power edge AI devices.Reference [38] consumes too much power while [42] has a too big area compared to its computation ability.However, if the targeting application requires ultra-low power consumption and can accept hundred-grade MOPs, Reference [42] is a good choice.Overall, a trading-off between computation ability and power consumption can be taken once the architecture size has been chosen.Designers can set accelerator's specifications according to its targeting application.As a result, if designers want to design an architecture for an edge AI accelerator in an ultra-small area (units' mm 2 area size), the power consumption and operation ability should be in the order of hundreds of mWs and GOPs, respectively.Figure 3 shows the scenario of the AI applications from the original to the future.It also illustrates the model of the AI application distribution in the yellow and purple oval, from data center-based cloud AI to edge AI, mentioned as three hierarchies.Figure 4 shows the legends of Figure 3.
The history of the developing data center-based AI in the first hierarchy can trace back to the 1980s.At that time, AI systems had developed to solve several specific problems as the decision-making ability of a human expert.It is achieved by if-then rules instead through conventional procedural code and composed by the inference engine and the knowledge base units.The idea is like the AI technology we are using nowadays.Rather than the knowledge base, we use big data and deep learning.Several modern Figure 3 shows the scenario of the AI applications from the original to the future.It also illustrates the model of the AI application distribution in the yellow and purple oval, from data center-based cloud AI to edge AI, mentioned as three hierarchies.Figure 4 shows the legends of Figure 3.In summary, training on chips in edge AI accelerators will be a popular research topic.The distributed computing technology, 5G or future telecommunication protocol, data encryption for training on chips, cross-platform data exchange protocol, and harsh environment tolerance will involve in the developing path of the edge AI accelerators in the future.

Conclusions and Future Works
This paper has presented a survey of up-to-date edge AI accelerators and CGRA accelerators that can apply to image recognition systems and introduced the evaluation value E for both edge AI accelerators and CGRAs.CGRA architectures meet the evaluation value E of the existing prior art edge AI accelerators, which implies the potential suitability of CGRA architectures for running edge AI applications.The result reveals the evaluation values E of prior art edge AI accelerators and CGRAs are between tens to four hundred, indicating that edge AI accelerators' future design trend should meet this grade.Overall, the analysis shows that the future design of ultra-small area (under 10 mm 2 ) accelerators' power consumption and operation ability should be in the order of hundreds of mWs and GOPs, respectively.
As the edge devices are finding their way into various applications such as monitoring natural hazards by UAVs, detecting radiation leakage for nuclear disaster by robotics, and remote sensing in space by satellites, these applied fields are more critical than usual.Many research articles are targeting the fault tolerance feature for edge AI accelerators in recent years, which indicates that the trend of the edge AI accelerators is resilient in terms of reliability and high radiation field applicability.Finally, we illustrate the current status of edge AI applications with a future vision, which addresses the technologies involved in the future design of edge AI accelerators and their challenges.The history of the developing data center-based AI in the first hierarchy can trace back to the 1980s.At that time, AI systems had developed to solve several specific problems as the decision-making ability of a human expert.It is achieved by if-then rules instead through conventional procedural code and composed by the inference engine and the knowledge base units.The idea is like the AI technology we are using nowadays.Rather than the knowledge base, we use big data and deep learning.Several modern database AI achievements are well noticed; the examples are Deep Blue (a chess computer 1996) and AlphaGO (a board game Go program 2014).The significant technology gap between Deep Blue and AlphaGO is deep learning neural network that can handle a larger branching factor in GO games.
Nowadays, much different information is collected by internet browsers, databases, and sensors through big data technology.Massively collected data is a good material for data training.According to the different individual applications, the trained weights are different.Trained weights are stored in cloud systems and wait for the unjudged data to come in.A specific application-oriented AI system uses a suitable deep learning neural network algorithm to infer the result, then transmit the result back to the requiring system.Cloud-based data center AI system is powerful due to its excellent hardware such as supercomputer or high-end GPU.It is hard to reach directly by ordinary consumers.
Thanks to the well-developed internet and personal consumer electronics, AI technology can benefit the general public.As a result, edge AI is a popular topic.The current development status of edge AI can be seen in both yellow and purple oval in Figure 3.The yellow oval shows that the data center AI system is connected with several edge AI devices through internet infrastructures such as routers and cell towers.Some edge AI systems need to work with intermediary systems such as a server and a PC to process data before transmitting or receiving data from the cloud.This type of edge AI system, which requires connection to the data center, is called an edge-cloud coordination system, which is currently well-spread and famous such as virtual assistants, e.g., Siri.
In Figure 3, the face recognition door lock can send the unjudged data, i.e., face picture, to the cloud for processing and get the inference result.However, in reality, the method is challenged due to safety concerns.As a result, systems like door locks requiring high standard security usually adopt edge AI systems instead of edge-cloud coordination systems.Because of the local inference feature of edge, AI systems can provide systems processing the data without an internet connection.The local inference feature is a big move for a device to promise data privacy, security, real-time processing, and working under no internet coverage environment.This paper has introduced several edge AI accelerators, which target this field to satisfy the demand for edge AI devices.
The latest generation telecommunication standard, e.g., 5G, can be a valuable technology in the future edge AI development when it comes to training on chips.Training on chips might require something like distributed computing, which is achieved by the Internet of things (IoT) through 5G or future communication protocols.The idea is shown as the orange oval illustrated in Figure 3.
For example, autonomous cars contain pre-trained weights during the developing time to allow their AI system to make decisions for self-driving.The car manufacturer, Tesla, does epochal milestones on this page.However, pre-trained data might have some flaws due to the environment's change with an unexpected scenario.There are several non-fatal crashes reported in [49].The idea of training on chips is that while human drivers engage the autopilot for a system non-aware situation but human handleable, the system can share the information with other compatible systems for training on chips.Training on chips can adopt distributed computing technology to reduce the computation.
The future trend of the research is how to distribute the data for training and share the weights between compatible systems, which might demand 5G or future telecommunication protocol to implement massive data exchange.Training on Chips also encounters another challenge.As shown in Figure 3, edge AI systems along different unmanned shops may experience several significant issues about which people are caring.The significant issues are personal data utilization and a standard platform/protocol for data exchanging, etc.Although personal data utilization may sound beyond the edge AI accelerators' technical design level, it is suggested that an accelerator designer involves a data encryption algorithm with data training on chips during the design level.Another significant topic is data training exchanging protocol.Applying training on chips to universal edge AI devices or systems is demanded to maximize the benefit of training on chips.The answer might be a cross-platform protocol, which will require the coordination of the designers.
Furthermore, there are several popular research of edge AI systems targeting harsh environments mentioned in the introduction.The scenario of a UAV working in a nuclear power station and a remote sensing satellite in Figure 3 are examples.They should be aware while designing future trend edge AI accelerators.
In summary, training on chips in edge AI accelerators will be a popular research topic.The distributed computing technology, 5G or future telecommunication protocol, data encryption for training on chips, cross-platform data exchange protocol, and harsh environment tolerance will involve in the developing path of the edge AI accelerators in the future.

Conclusions and Future Works
This paper has presented a survey of up-to-date edge AI accelerators and CGRA accelerators that can apply to image recognition systems and introduced the evaluation value E for both edge AI accelerators and CGRAs.CGRA architectures meet the evaluation value E of the existing prior art edge AI accelerators, which implies the potential suitability of CGRA architectures for running edge AI applications.The result reveals the evaluation values E of prior art edge AI accelerators and CGRAs are between tens to four hundred, indicating that edge AI accelerators' future design trend should meet this grade.Overall, the analysis shows that the future design of ultra-small area (under 10 mm 2 ) accelerators' power consumption and operation ability should be in the order of hundreds of mWs and GOPs, respectively.
As the edge devices are finding their way into various applications such as monitoring natural hazards by UAVs, detecting radiation leakage for nuclear disaster by robotics, and remote sensing in space by satellites, these applied fields are more critical than usual.Many research articles are targeting the fault tolerance feature for edge AI accelerators in recent years, which indicates that the trend of the edge AI accelerators is resilient in terms of reliability and high radiation field applicability.Finally, we illustrate the current status of edge AI applications with a future vision, which addresses the technologies involved in the future design of edge AI accelerators and their challenges.
Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) - Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) - Packaged in a USB stick --Packaged in a USB stick Some works, such as [29], use analog components and memristors to mimic neurons for CNN computing.However, none of the commercial proposed systems uses memristors.Several developers have researched memristor technology, including HP, Knowm, Inc. (Santa Fe, NM, USA), Crossbar, SK Hynix, HRL Laboratories, and Rambus.HP built Packaged in a USB stick -Electronics 2021, 10, x FOR PEER REVIEW 7 Packaged in a USB stick --Packaged in a USB stick Some works, such as [29], use analog components and memristors to mimic neurons for CNN computing.However, none of the commercial proposed systems uses memristors.Several developers have researched memristor technology, including HP, Knowm, Inc. (Santa Fe, NM, USA), Crossbar, SK Hynix, HRL Laboratories, and Rambus.HP built Packaged in a USB stick

Figure 1 .
Figure 1.Power consumption and operations per area statistics.

Figure 2 .
Figure 2. (a) Three key features statistics (accelerators under 10 mm 2 ) and (b) the statistics normalized to the same grade of GOPs.

Figure 1 .
Figure 1.Power consumption and operations per area statistics.

Figure 1 .Figure 2 .
Figure 1.Power consumption and operations per area statistics.

Figure 2 .
Figure 2. (a) Three key features statistics (accelerators under 10 mm 2 ) and (b) the statistics normalized to the same grade of GOPs.

Figure 3 .Figure 3 .
Figure 3.The scenario of AI applications on hardware platforms.(Figure legends can be found in Figure 4.).
in the edge deviceThe edge device in this circle might be designed to work in the radiation field

Table 1 .
Pros and cons of CPU, GPU, and Edge AI accelerator.

Table 2 .
Prior Art Edge AI Accelerators.

Table 2 .
Prior Art Edge AI Accelerators.

Table 3 .
Prior Art Edge AI Accelerators.Packaged in a USB stick Electronics 2021, 10, x FOR PEER REVIEW 6 of 16

Table 2 .
Prior Art Edge AI Accelerators.

Table 3 .
Prior Art Edge AI Accelerators.Packaged on a PCIe interface card (no public sales channel found) -

Table 3 .
Prior Art Edge AI Accelerators.

Commercialized product example Commercial product ex- ample
Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) -

Table 3 .
Prior Art Edge AI Accelerators.
Evaluation value E 21.55 (package s) 5.34 × 10 −4 (module-level p) 3.81 (module-level p) 3.5 × 10 −5 (modulelevel p) Packaged in a USB stick ample Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) -

Table 3 .
Prior Art Edge AI Accelerators.

Table 3 .
Prior Art Edge AI Accelerators.

Table 3 .
Prior Art Edge AI Accelerators.

Table 4 .
Prior Art Edge AI Accelerators.

Three Key Features and the Evaluation Value Edge AI Accelerators GTI Lightspeeur SPR2801S 2019 [16] Optimizing FPGA-based 2015 [20] Google Edge TPU 2018 [26,27]
Electronics 2021, 10, x FOR PEER REVIEW 7 of 16 Packaged on a PCIe interface card Packaged in a USB stick Packaged on a developing board

Table 4 .
Prior Art Edge AI Accelerators.