Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

O’Connor, Owen; Elfouly, Tarek; Alouani, Ali

doi:10.3390/en16166043

Open AccessReview

Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

by

Owen O’Connor

,

Tarek Elfouly

and

Ali Alouani

^*

Department of Electrical and Computer Engineering, Tennessee Technological University, Cookeville, TN 38505, USA

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(16), 6043; https://doi.org/10.3390/en16166043

Submission received: 26 June 2023 / Revised: 11 August 2023 / Accepted: 16 August 2023 / Published: 18 August 2023

(This article belongs to the Section K: State-of-the-Art Energy Related Technologies)

Download

Browse Figure

Versions Notes

Abstract

:

There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.

Keywords:

power; high-performance mobile computing systems; endpoint computing; onboard computing; embedded computing; energy efficiency; co-processors; acceleration

1. Introduction

In recent years, the capabilities of high-performance computing systems have been growing in leaps and bounds, with individual processors reaching hundreds if not thousands of teraflops [1]. At the cutting edge of performance, Oak Ridge National Lab’s Frontier supercomputer is able to perform more than one quintillion floating point operations per second [2]. This performance comes at a price, both in terms of physical space and the massive power consumption required. While these constraints are no longer an issue in workstations and data centers, they are major issues for mobile computing systems.

Current and future voice- and gesture-controlled smart devices, wearable and portable medical monitoring devices, advanced metering infrastructure in smart grid and sensors for monitoring power generation such as those outlined in [3], and many other applications requiring local, onboard processing of big data have to use high-performance mobile computing systems. One of the limiting factors affecting the acceptance of smart devices is concerns about the users’ privacy [4]. To mitigate these concerns, the devices must be able to execute complex machine learning algorithms on sensor data without offloading data to a central server. For many medical and smart grid applications, large amounts of sensor data must be acted on quickly and reliably in order to make a timely diagnosis or issue a warning about impending medical or stability issues, respectively.

Current mobile computing systems with insufficient onboard computational performance will often offload sensor data to remote servers for processing, but this approach has several disadvantages. First, it requires the system to maintain a constant connection to the remote server in order to transfer data. A constant, high-bandwidth connection comes with its own associated power draw [5]. The data transfer may also introduce unwanted latency, especially for applications with time sensitive outcomes such as medical or AMI devices. Additionally, data transfer may not be possible in remote areas where the access to communication is poor or non-existent. Finally, there have been concerns raised about user privacy when data is transferred to a remote system [4]. Data collected by sensors often includes sensitive personal information from the users, such as biomedical data. This data must be protected before it is transmitted from the system, adding to the overall complexity of the design. Research has been conducted regarding better ways of protecting user privacy, such as encryption or anonymization. However, the most reliable method of protecting sensitive data would be to never transfer it off the device if it is possible to process it locally.

Furthermore, offloading data to central servers for processing introduces issues on the server side as well. In order to process all the incoming data in a reasonable amount of time, expensive, high-bandwidth data centers are needed [6]. In order to alleviate these issues somewhat while keeping the computational load off of the mobile computing system, an alternative technique called edge computing has been proposed. In edge computing, data processing is offloaded to one of many small, localized servers instead of a large centralized data center [7]. In this way, the computational performance and bandwidth required for each of the edge computing servers is greatly reduced, decreasing the overall cost and complexity of the system. This technique has gained traction in recent years, with much research being conducted to find the most efficient way to distribute a workload across an edge computing environment and different network typologies to yield the fastest and most reliable results [8,9,10]. While edge computing does offer several advantages over the more traditional, centralized data processing, it still does not fully address the concerns outlined in the previous paragraph. Edge computing techniques still require a connection to a server to offload data processing too, and although the connections tend to have a much lower latency [11] this still introduces reliability and security concerns.

In order for these issues to be fully remedied, true onboard or “endpoint” computing is needed. Unfortunately, due to the limited computational performance of the mobile computing systems currently available and the complexity of the tasks they would have to perform, this would be next to impossible [7]. As such, new and much more efficient mobile computing hardware is required. A brief overview of the existing mobile computing hardware and its limitations will be provided in Section 2. Section 3 then provides a review of research being conducted into alternative hardware designs that can potentially provide the additional efficiency required for onboard, real-time processing of data in mobile computing systems.

2. Existing Solutions

There are several different approaches to accelerating mathematical operations in conventional, commercially available mobile computing systems. At the most basic level, there are the floating point units (FPUs) seen in architecture such as the Cortex M4 and Cortex M7 [12,13]. These FPUs allow the processors to perform more floating point operations per second compared to similar processors without FPUs, but they are still very limited. The style of FPUs used in these architectures are typically only capable of performing a single floating point operation at a time, limiting the overall throughput. This limited throughput, combined with the non-negligible power consumption of the FPU, leads to a mediocre computational efficiency of the overall system, typically less than one billion operations per second per Watt (1 GOPS/W) [14]. In applications where a large amount of computation needs to be performed every second or where high efficiency is required, some alternative solution is needed.

Two of the most direct approaches used to increase the computational performance of mobile computing systems are techniques called single instruction, multiple data (SIMD) and the closely related single instruction, multiple thread (SIMT). With these techniques, the same operation is performed simultaneously on multiple data points, greatly increasing the overall throughput of the processor without an unreasonable increase in hardware cost. This parallel processing approach makes it very easy to scale up an architecture, allowing for extremely high throughput at the cost of increased power consumption. This scalability can be seen in Nvidia’s Jetson Orin line of high-performance mobile computing systems. At the lowest end of their product stack is a system capable of performing 40 trillion operations per second (TOPS) with a maximum power consumption of 15 W [15]. Their flagship system, on the other hand, has been scaled up to be capable of performing 275 TOPS with a maximum power consumption of 60 W [1]. The parallel nature of these processors also grants them a higher efficiency than the simpler FPUs, with modern processors approaching an efficiency of around 5 TOPS/W [1]. The high efficiency of SIMD/SIMT architectures have been optimized even further without sacrificing computational power through even more highly specialized processors. By focusing on a very small set of operations that the processor can perform, designers are able to greatly optimize both the physical size and power consumption of the mobile computing system. One such example is AMD’s Versal AI Edge processors [16]. Because these processors are designed specifically to accelerate AI operations and nothing else, they need a much lower data precision and as a result a much smaller ALU with a much lower power consumption. These optimizations give specialized processors a significant increase in efficiency when compared to contemporary, non-specialized processors [17].

In the most extreme case of application specialization are field programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs). ASICs are designed specifically to perform one task and one task only, often eschewing a traditional Harvard or von Neumann architecture for one purpose built for the application. As a result, ASICs are capable of delivering extremely fast, extremely efficient data processing but require huge amounts of investment into research and development to implement. Because of this, ASICs are usually confined to large-volume applications, such as networking and communication equipment where the additional development costs can be spread across thousands or even millions of products [18,19]. For smaller-volume applications, commercially available FPGAs can be used to implement custom logic and custom architecture without the need for designing a complete chip from scratch. However, FPGAs still present additional design complexities over a conventional processor and offer lower overall efficiencies and speeds than ASICs.

Conventional portable computing accelerators capable of onboard processing of large amounts of data often have extremely high power requirements [1,16]. As a result of these high power requirements, large, heavy, and expensive batteries are required for the system to have any reasonable battery life. Similarly, if the requirements for the device demand a small form factor and large battery life, the amount of data processing that the device is capable of will be limited. In the applications discussed above where large amounts of onboard data processing are required while maintaining low power consumption, these trade offs are not possible. As a result, new and more efficient hardware designs for high-performance mobile computing systems are required for these applications. Section 3 discusses research being conducted into alternative hardware designs for high-performance mobile computing systems in more detail.

3. Literature Survey

To enable real-time onboard processing of data, new architectural paradigms are needed. In 1961, Landauer established a theoretical upper bound on the efficiency of a non-reversible computing system using thermodynamic analysis [20]. While it is unlikely that a computing system will reach this perfect efficiency anytime in the near future, the conventional approaches in use are still multiple orders of magnitude away from the limit. According to Landauer’s work, the theoretical minimum energy that can be consumed when a bit is flipped is

2.6 \times 10^{- 21}

J [20], but even current state-of-the art computing systems require something on the order of

1.0 \times 10^{- 13}

J [21].

Researchers have identified several inefficiencies in the architectural design of a computing system, in the way data is stored and processed, and in the technology used to implement the system. These inefficiencies include complications from modeling a physical system in binary logic [22], memory bottlenecks in the von Neumann architecture [23], wasted energy dissipated within logic gates [24] (pp. 181–185), and inefficiencies rising from the physical properties of transistors themselves [25]. This paper provides a survey of several related high-performance computing systems focusing on the power efficiency, while also discussing total computational performance and the architectural design of the system.

Power consumption of a processor is typically divided into two parts—dynamic and static power consumption. Dynamic power is consumed when internal signals are changed, e.g., when data is being processed or transferred, and is primarily dependent on the size and number of transistors, the supply voltage, how frequently the signals change, and the operating frequency of the system [24] (pp. 185–194). In a conventional CMOS circuit, the primary cause of dynamic power consumption is the gate capacitance of the transistors being repeatedly charged from the source voltage and then discharged directly to ground [24] (pp. 181–185). Dynamic power consumption can be further divided into two parts, the power consumed to store and transfer data and the power used to operate on that data. In a general purpose processor, the power required to store and transmit data can be close to or even above 50% of the total power consumed by the processor [26]. Static power, on the other hand, is consumed both when the processor is actively processing data and when it is idle. In a conventional CMOS circuit, static power consumption is primarily caused by internal leakage of the transistors. Static power consumption is primarily dependant on the source voltage, transistor size, and number of transistors [24] (pp. 194–200). As the transistors used in modern processors have gotten progressively smaller, they have allowed more and more current leakage. This increase in leakage current has led to a corresponding increase in wasted energy from static power consumption.

Optimization techniques for conventional CMOS designs will focus on decreasing one of these sources of power consumption. For example, the specialized circuits discussed earlier, such as the TPUs and ASICs, reduce dynamic power consumption by decreasing the number of transistors required to perform any given task while simultaneously cutting back on internal data transfer [17]. Dynamic power consumption can also be reduced at the cost of computational power by limiting the operational frequency. When the operational frequency of a processor is lowered, the voltage required for the processor to function properly is also reduced. By dynamically varying the operational frequency and source voltage simultaneously, not only is the dynamic and static power consumption reduced, but the computational efficiency is increased [27]. Idle power consumption can be reduced even further with techniques called power gating and clock gating, where power sources and clock signals are completely shut off from portions of a circuit when they are not being used. However, this does not have any significant effect on efficiency when the processor is under full load, and even shows mixed results for extremely low power circuits [28].

The rest of this section will cover alternative techniques that are being researched to avoid or mitigate these inefficiencies, grouped into general sections based on what basic approach the researchers took.

3.1. Analog Computing

One major limitation of the mobile computing systems currently in use is the need to translate complex physical systems or even simple mathematical operations into discrete binary operations. One alternative that has been proposed to this approach is analog computing, which uses the physical properties of circuit components to create analogous models of physical systems or mathematical equations directly in hardware. In an analog computer, a variable is represented by a continuous value such as voltage, current, or resistance instead of a vector of bits. This approach greatly decreases the bus width required to transfer data and the number of transistors required to process it, but it does come with its own disadvantages. Fabrication inconsistencies in the photolithography process means that no two analog components are guaranteed to have the same parameters. Due to the nature of analog circuits, these small inconsistencies accumulate and create inaccuracies in the output. Additionally, any time data needs to be translated between the analog and digital domains, digital to analog converters (DACs) or analog to digital converters (ADCs) are needed, increasing circuit complexity and power draw.

Similarly to digital circuits, analog circuits can be designed to be general purpose or optimized for a specific problem. As expected, the application specific analog circuits are often highly efficient and can take up very little die space. One example is presented in [29], where the authors propose a hybrid analog approach for highly efficient detection of voice keywords. Because they were able to operate directly on analog data captured from the microphone, they were also able to avoid expensive conversions between the digital and analog domains. The authors of [30] applied a similar approach to data pre-processing of EEG signals.

One approach that has been taken for more general purpose analog computing is to design an analog computer with configurable parameters to model a family of mathematical systems. Two popular approaches, due to their applicability to many real world problems, are linear algebra systems and partial differential equations (PDEs). The authors of [31] note some limitations of existing analog PDE solvers, primarily the size required, and propose a potential solution that uses an improved, iterative algorithm. Using this algorithm, the authors were able to demonstrate a four order of magnitude improvement in the die size required for a given computational performance. An alternative approach is presented in [32], where the authors choose to focus solely on non-linear PDEs. This restriction meant that the authors were able design a very simple, parallelizable circuit to find the solution to the PDE at any given point. In [22], the authors designed an analog computer to model and solve linear algebra systems using ordinary differential equations with extremely high efficiency. The authors extrapolated their results to a 600 mm² chip, a similar size to high end data center accelerators, and predicted that even with such a large die their design would still only draw about 1 W of power.

Another option for more general purpose analog computing is a field programmable analog array, or FPAA [33]. An FPAA is essentially the analog equivalent of a FPGA, where there are numerous small computation blocks joined together with a network of configurable interconnects. Similar to an FPGA, this configurability gives FPAAs a large amount of flexibility at the cost of an increased die size and lower maximum performance. The analog nature of an FPAA, however, introduces a new set of design complexities. In an FPGA, the configurable interconnects do little more than create a small delay, but in an FPAA they also can have a small effect on the value of the signal being passed. Because of manufacturing inconsistencies, this small effect can vary from one interconnect to another, creating uncertainty in how any given configuration will perform. The authors of [34] propose a modification to the basic design of an FPAA along with a calibration method to minimize the effects of these inconsistencies, leading to a more reliable system overall. FPAAs have also been demonstrated with good results in digital/analog hybrid architectures, as demonstrated in [35]. The authors of this article integrated an FPAA with a traditional digital microprocessor, allowing the authors to simultaneously take advantage of the benefits of both analog and digital computing while mitigating some of the disadvantages of both.

3.2. In-Memory Computing

The von Neumann and Harvard architectures are capable of extremely efficient operations in applications where only small amounts of data need to be processed, but start suffering as more and more data needs to be processed. One of the major factors contributing to this loss of efficiency, known as the von Neumann bottleneck or the “memory wall”, was first recognized multiple decades ago [23]. In the von Neumann architecture, any data that needs to be processed must be transferred from memory to some small number of register files located within the processor core, have the necessary operations performed on it, and then transferred back to memory. Due to the mismatch of the maximum bandwidth of memory and the number of operations the processor core is capable of performing per second, this setup often leaves the processor core starved for data. The addition of local high-speed cache memory and out-of-order or speculative instruction execution within the processor core have alleviated this problem somewhat, but they both come with their own disadvantages. Cache memory takes up a large amount of die size and is extremely power intensive [26] and various implementations of speculative or out of order execution have introduced major hardware level security vulnerabilities to processors [36]. In addition to limiting overall throughput, this data transfer to and from the processing core is a major contributing factor in the power draw of a von Neumann system [26].

One solution that has been proposed to bypass the von Neumann bottleneck is integrating numerous small computing elements into memory in a technique called in-memory computing, or IMC. These computing elements will operate on the data directly in place in memory, often in parallel with each other, removing the need to transfer data back and forth between the memory and the central processor. In order to save on die space, designs usually opt to implement a very small number of functions on the in-memory circuitry. One popular operation to implement due to its usefulness for data processing, signal processing, and neural networks is the multiply and accumulate operation (MAC).

Many optimizations have been made to the basic IMC MAC architecture over the years. In [37], in-memory MAC operations are added to the base layer of a high-bandwidth, 3D-stacked memory solution. This approach greatly increased the amount of data points that can be operated on in place at any given time, further reducing the amount of data transfer required. Another optimization more tailor-fit to a specific problem is discussed in [38]. In this paper, the authors use an optimization technique already common in convolutional neural networks and use it to perform MAC operations on large bins of data simultaneously. This binning approach greatly reduces the total number of operations that need to be performed, leading to a decrease in overall power consumption.

Another method that has been used to further reduce the amount of die space required to perform MAC operations in memory is stochastic computing [39]. In stochastic computing, values are represented by the proportion of high and low bits in a signal instead of the location of the bits as in a standard binary value. This approach leads to a lower maximum precision available for any given bit-width, but has the advantage of being able to perform arithmetic operations directly with simple logic gates. This not only decreases the die size required to perform MAC operations, but also greatly decreases the power draw. Unfortunately, the translation between traditional binary representations of numbers and their stochastic equivalent can draw a lot of power. The authors of [40] worked around this issue by implementing an IMC architecture that stores and operates on stochastic values natively, only converting them to binary when needed by the main processor.

Hybrid designs that combine analog and in-memory computing have also been proposed [41]. Similarly to stochastic computing, using analog computing in memory leads to significant improvements in efficiency and die size, but requires expensive conversion circuits in order to use it in tandem with digital circuitry. To combat this, the authors set up their designs so that multiple analog elements would share a single ADC. This reduced the overall conversion bandwidth but also decreased the die size and power draw required, increasing the overall efficiency of the computing system.

3.3. Adiabatic Computing

In traditional CMOS designs, large amounts of energy are lost by the repeated charging and discharging of the internal capacitance in the transistors [24]. Adiabatic computing, also known as charge recovery computing, improves the overall efficiency of a system by recovering a portion of that charge at the end of every clock cycle. In order to recover the charge, additional transistors are required, increasing the physical size of the circuit. Additionally, there are extra design rules that need to be followed when implementing an adiabatic circuit, complicating the design process. One of the major requirements is that all operations must be reversible, which can introduce otherwise unnecessary signals into a design. Furthermore, even though adiabatic logic can greatly decrease the dynamic power draw of a circuit, static power dissipation is still a major problem that must be considered [42].

Because of these limitations, the majority of research has been in implementing components of a processor with the highest power consumption in adiabatic logic. These components can then be used as part of a hybrid design, either with a conventional CMOS processor or with one of the other techniques discussed in this paper, to increase the overall efficiency of the system. The authors of [43] compare the performance and efficiency of four possible implementations of an adiabatic adder. For their tests, the authors simulated a four-bit adder in each of the four adiabatic logic families, and compared the adders based on their physical size in the layout and their power consumption at various frequencies. As noted earlier, memory and data transfer is one of the primary sources of power dissipation in a traditional digital computer [26]. Because of this, the authors of [44] chose to demonstrate the feasibility of using charge recovery computing for memory circuits. Their adiabatic implementation of a 5T SRAM cell showed impressive savings in power dissipation, though it did come at the cost of increased write times [44]. Adiabatic logic can also be used to implement other sequential logic elements, as demonstrated in [45]. In this paper, the authors used adiabatic computing to implement data and toggle flip flops, and then characterized their power consumption in comparison to a traditional DFF implemented in traditional CMOS logic. At a frequency of 25 MHz, they were able to demonstrate an impressive efficiency improvement of 164 times that of the conventional CMOS design, though this dropped off to about 7 times when the frequency was increased to 100 MHz.

Adiabatic computing principles have also been used in the design of analog processor components. The authors of [46,47] propose two possible adiabatic implementations of a comparator with promising results. In [46], the authors demonstrate an adiabatic implementation of the strongARM comparator that decreases energy dissipation by 55% when compared to a conventional CMOS implementation. The authors of [47], on the other hand, primarily focus on increasing the maximum operational frequency of an existing adiabatic comparator design, though they are still also able to demonstrate significant energy savings. Studies have also shown great success in using adiabatic logic as an interface between the analog and digital domains. As discussed in Section 3.1, converting between the analog and digital representations of a value is a major limiting factor for the efficiency of an analog computer system. The authors of [48] mitigate this conversion cost with a high speed, high precision ADC using adiabatic logic with a respectable efficiency of 4.4 fJ per conversion step. The authors of [49] on the other hand implemented a flash ADC with a three-bit resolution that drew about 90% less power compared to a conventional CMOS design when operated at 1 MHz.

One of the first implementations of a fully adiabatic processor was published in 1995 in [50]. In this thesis the author discusses the various issues that are encountered when designing an adiabatic processor, and where future work could be performed to mitigate these issues and improve the overall design. One of the largest issues presented is dealing with extraneous data required to meet the reversibility requirements of an adiabatic design. This extraneous data must be generated as part of the computations, but is otherwise unneeded for the processor to function and increases the overall complexity of the processor [50].

In recent years, fully adiabatic systems have seen great interest for their potential applications in medical sensors and encryption. Because of their use of charge recovery, adiabatic circuits not only have increased efficiency but also present much more uniform power draw profiles. The authors of [51] take advantage of this second fact to design a high-efficiency cryptographic accelerator that is much more resistant to power-based side-channel attacks than similar designs implemented in conventional logic. The authors of [52] take this concept one step farther, implementing the entire medical device in adiabatic logic. By using only adiabatic logic, the authors were able to power the device entirely off of energy harvested from radio frequency signals, removing the need for a physical power source. The authors note that the ability for the device to be powered wirelessly would be an incredible advantage for implanted medical devices.

As with analog computing, adiabatic computing can be combined with other techniques for even greater improvements in efficiency. One example of such a hybrid design is seen in [53], which combines elements of in-memory, analog, and adiabatic computing into a single design. As is common with other IMC designs, the authors chose to accelerate MAC operations for use in linear algebra operations. One interesting feature with their design is that it can be operated as both a conventional CMOS circuit or as a resonant adiabatic circuit for even greater energy efficiency. The design is capable of performing 640 million MAC operations every second with a computational efficiency of 1.1 TMACS/mW of power consumed when operated in adiabatic mode. This is around three orders of magnitude more efficient than NVIDIA’s new H100 data center accelerator, demonstrating the incredible gains in efficiency that can be achieved with these techniques.

3.4. Alternate Technologies

All of the techniques mentioned in the previous sections use traditional circuit elements such as transistors, resistors, and capacitors. However, research has also been conducted using more exotic circuit components to perform computation, often with some combination of analog or IMC techniques. These approaches typically attempt to solve some limitation of transistor based systems, such as static power consumption or the high power draw of transistor based SRAM.

One alternative to transistor-based memory circuits is resistive RAM, also known as ReRAM or RRAM. In traditional memory circuits, logic states are represented by the charge level of a capacitor or the voltage on a feedback network, both of which come with their own complications and sources of power dissipation. RRAM however represents logic states with the resistance of circuit components such as memristors or MLC flash cells. These components are usually nonvolatile and do not require a power source to maintain their values, decreasing the overall power draw of the memory subsystem. The authors of [54] discuss the use of MLC-based RRAM in an analog IMC linear algebra accelerator and various techniques that can be used to compensate for manufacturing inconsistencies. One main drawback of MLC-based RRAM is limited operational life, since the MLC flash used naturally degrades every time a new value is written. Memristor-based RRAM solves this issue, though memristors have a major disadvantage of being a much less mature technology. The authors of [55] use a memristor-based RRAM system to implement an analog linear algebra accelerator. The authors of [56] on the other hand use RRAM to implement an alternative, highly efficient digital logic family based on the logical implication operation. There are several previous papers that had implemented a similar logical family, though they suffered from reliability or speed issues. The authors of [56] solved these issues by adding an additional amplifier to the design.

Another alternative memory system that has been proposed to improve efficiency is spin transfer torque magnetic RAM (STT-MRAM). STT-MRAM utilizes an effect called tunneling magnetoresistance that occurs when two conductive magnetic bodies are separated by an extremely thin insulating layer. When the two magnetic bodies have aligned polarities, electrons are more likely to tunnel through the insulating layer than if the magnetic fields are opposed [57]. This can be used to store and retrieve data by measuring the minuscule leakage current that tunnels through the insulating layer. In [58], STT-MRAM serves as the data storage for a general purpose digital IMC device, reducing the power consumption overhead consumed by the memory itself. STT-MRAM can also be used to store continuous analog values. By varying the alignment of the magnetic fields, the amount of leakage current that passes through the insulating layer can be controlled. This technique is used in [59,60] to implement low power in-memory analog computing circuitry. The authors of [59] use extra circuitry to improve the ratio between the high-state leakage current and the low-state leakage current, allowing for higher efficiency while performing in-memory linear algebra operations. The authors of [60] take advantage of the low voltages required for STT-MRAM to implement an extremely low power consumption accelerator for neural networks.

A completely different fabrication technique is demonstrated in [25], where calculations are performed by mechanical logic gates implemented using micro-electromechanical systems (MEMS). The design is compatible with adiabatic logic, and the authors claim that since the design is contact-free it does not suffer from the leakage seen in transistor-based implementations of adiabatic circuits. For simplicity, the authors implemented the design in a relatively large MEMS technology which affected the overall efficiency, but they point out that it should scale very well to smaller, state-of-the-art MEMS.

4. Discussion and Future Directions

An overview of the strengths and weaknesses of the various optimizations discussed in this survey is given in Table 1, along with the limitations of conventional computer systems that they are intended to solve. A more detailed comparison of the operational efficiency of the various systems discussed in this paper can be seen in Figure 1. In Figure 1, the power consumption of various commercially available mobile computing systems and alternative systems under development are plotted against their respective performance in operations per second. Each approach is represented in Figure 1 with a unique symbol. To aid in the interpretation of Figure 1, guide lines showing computational efficiency are given. It should be noted that the performance metrics for each architecture depend on the hardware implementation and the task being performed. In general, it is seen that as the performance of the system increases so does the power consumption. From Figure 1, it can be seen that systems using analog computing techniques offer the highest efficiency [29] while IMC offers the most computational performance [61]. Meanwhile, adiabatic techniques are currently most used in extremely low power draw but low-performance systems [53,62], potentially due to the limitations discussed in Section 3.3. Alternative technologies such as STT-MRAM and memristors are not currently seen to excel in any particular category, but still offer a good balance between power consumption and performance while showing much room for future improvement.

Since the majority of past research and commercial implementations of mobile processors have focused on conventional, transistor-based designs, alternative technologies such as those using memristors or STT-MRAM are relatively immature. While there have been some commercial products utilizing these technologies [63,64], there is still much room for research to improve the applicability of these technologies for wide-scale deployment. Research into fabrication techniques could improve aspects such as yields, reliability, or electrical characteristics. There is also room for further research into the architectural aspect of these new technologies. Because these alternative technologies are so wildly different from conventional transistor-based circuits, the design procedures and conventions used in conventional designs will need to be adapted to the new technologies. In some cases, it is even possible that the low-level logical operations that processors are built upon can be modified and improved [56].

Although processor designs using just one of the techniques presented above show improvements in efficiency over conventional processors, hybrid designs that employ multiple techniques have shown the highest efficiencies [35,41,46,53,65]. By using multiple techniques, researchers are able to take advantage of the strengths of each individual technique while mitigating their individual weaknesses. These hybrid designs seem to be the most promising field of future research for further increases in processing efficiency. There are some combinations of techniques in particular, such as using adiabatic charge recovery in an analog circuit [66], that have seen relatively little research up to this point but are showing very promising initial results. It is also possible that hybrid designs could be created using new, innovative techniques or more conventional techniques that were not discussed in this survey, such as approximate computing [67]. Finally, there are several exotic alternative techniques, such as quantum computing [68,69,70] or adiabatic computing using superconductive Josephson junctions [71,72,73], that have shown exceptionally high computing performance and efficiencies. Unfortunately, these techniques currently require cryogenic cooling, making them impractical for use in high-performance mobile computing systems. If these cooling requirements can be removed, these techniques could become very lucrative for future high-performance mobile computing systems, but there is still much research needed to reach that goal.

Table 1. Comparison of different optimization techniques.

Technique	Conventional	Analog	IMC	Adiabatic	Alternative Technologies
Optimization Method	N/A	Computational modal	Bottleneck mitigation	Energy recovery	Hardware implementation
Operational Frequency	50 MHz [74]– 1.3 GHz [1]	10 kHz [29]– 690 MHz [75]	10 MHz [40]– 2.2 GHz [76]	20 kHz [53]– 1 GHz [77]	2.5 kHz [25]– 3.3 GHz [78]
Computational Efficiency	560 MOPS/W [14]– 5.4 TOPS/W [16]	21 GOPS/W [31]– 410 POPS/W [29]	46 GOPS/W [79]– 1.6 POPS/W [40]	150 TOPS/W [62]– 110 POPS/W [53]	25 TOPS/W [80]– 710 TOPS/W [81]
Typical Use Cases	General purpose	Matrix operations and PDEs	Matrix operations	General purpose	General purpose
Primary Limitations	Power consumption, limited efficiency	Reduced precision, manufacturing inconsistencies	Increased memory size, requires SIMD friendly problems	Increased design complexity, extraneous signals, additional design constraints	Immature technologies, reliability

Figure 1. Comparison of system efficiencies (Conventional [1,14,15,16,82], Adiabatic [53,62], Alternative [55,81,83,84], Analog [31,75,85,86,87,88,89], IMC [40,61,79,90,91]).

5. Conclusions

HPC systems have become increasingly powerful in recent years, but the immense power draw required by conventional HPC hardware makes them impractical for use in high-performance mobile computing systems. While it is possible to offload the data from the mobile computing system to a central server for processing, multiple applications would benefit from having energy efficient, onboard data processing capabilities. For these applications, new hardware designs are needed. Multiple different hardware designs are presented in this survey, each of which aims to solve at least one issue with conventional, transistor-based von Neumann systems. Some of them, such as in-memory or analog computing, target inefficiencies in the architecture itself to eliminate processing bottlenecks. Adiabatic computing and the alternative technologies on the other hand seek to decrease the waste power dissipated by the circuit components themselves, increasing the overall efficiency of the system. While any of these approaches used individually showed improvements in efficiency, papers that combined multiple techniques in a single system showed the biggest improvements. These hybrid architectures present a promising field for future research.

Author Contributions

Conceptualization, O.O., T.E. and A.A.; methodology, O.O., T.E. and A.A.; investigation, O.O.; writing—original draft preparation, O.O.; writing—review and editing, O.O., T.E. and A.A.; project administration, T.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Karumbunathan, L.S. NVIDIA Jetson AGX Orin Series; Technical Report 10749-001; Nvidia: Santa Clara, CA, USA, 2022. [Google Scholar]
ORNL’s Frontier First to Break the Exaflop Ceiling. 2021. Available online: https://www.top500.org/news/ornls-frontier-first-to-break-the-exaflop-ceiling/ (accessed on 25 June 2023).
Fortuna, L.; Buscarino, A. Sustainable Energy Systems. Energies 2022, 15, 9227. [Google Scholar] [CrossRef]
Yang, H.; Lee, H.; Zo, H. User acceptance of smart home services: An extension of the theory of planned behavior. Ind. Manag. Data Syst. 2017, 117, 68–89. [Google Scholar] [CrossRef]
Vilar, B.M.J.C.; Luiz, S.O.D.; Perkusich, A.; Santos, D.R. Dynamic power management for network interfaces. In Proceedings of the 2015 IEEE 5th International Conference on Consumer Electronics—Berlin (ICCE-Berlin), Berlin, Germany, 6–9 September 2015; pp. 383–387. [Google Scholar] [CrossRef]
Shi, W.; Dustdar, S. The Promise of Edge Computing. Computer 2016, 49, 78–81. [Google Scholar] [CrossRef]
Qin, M.; Chen, L.; Zhao, N.; Chen, Y.; Yu, F.R.; Wei, G. Power-Constrained Edge Computing with Maximum Processing Capacity for IoT Networks. IEEE Internet Things J. 2019, 6, 4330–4343. [Google Scholar] [CrossRef]
Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge Computing with Artificial Intelligence: A Machine Learning Perspective. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Wang, X.; Han, Y.; Leung, V.C.M.; Niyato, D.; Yan, X.; Chen, X. Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. IEEE Commun. Surv. Tutorials 2020, 22, 869–904. [Google Scholar] [CrossRef]
Campolo, C.; Iera, A.; Molinaro, A. Network for Distributed Intelligence: A Survey and Future Perspectives. IEEE Access 2023, 11, 52840–52861. [Google Scholar] [CrossRef]
Zhou, F.; Chai, Y. Near-sensor and in-sensor computing. Nat. Electron. 2020, 3, 664–671. [Google Scholar] [CrossRef]
ARM. Cortex-M4 Devices Generic User Guide; Technical Report dui0553; ARM: Cambridge, UK, 2011. [Google Scholar]
ARM. Arm Cortex-M7 Devices Generic User Guide; Technical Report dui0646; ARM: Cambridge, UK, 2018. [Google Scholar]
STMicroelectronics. STM32F427xx STM32F429xx Datasheet—Production Data; Technical Report 024030; STMicroelectronics: Geneva, Switzerland, 2019. [Google Scholar]
Karumbunathan, L.S. Solving Entry-Level Edge AI Challenges with NVIDIA Jetson Orin Nano. 2022. Available online: https://developer.nvidia.com/blog/solving-entry-level-edge-ai-challenges-with-nvidia-jetson-orin-nano/ (accessed on 25 June 2023).
Xilinx. ACAP at the Edge with the Versal AI Edge Series; Technical Report WP518; Xilinx: San Jose, CA, USA, 2021. [Google Scholar]
Fuchs, A.; Wentzlaff, D. The Accelerator Wall: Limits of Chip Specialization. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 1–14. [Google Scholar] [CrossRef]
Le, S.T.; Drenski, T.; Hills, A.; King, M.; Kim, K.; Matsui, Y.; Sizer, T. 100 Gbps DMT ASIC for Hybrid LTE-5G Mobile Fronthaul Networks. J. Light. Technol. 2021, 39, 801–812. [Google Scholar] [CrossRef]
Corrěa, M.; Neto, L.; Palomino, D.; Corrěa, G.; Agostini, L. ASIC Solution for the Directional Intra Prediction of the AV1 Encoder Targeting UHD 4K Videos. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Landauer, R. Irreversibility and Heat Generation in the Computing Process. IBM J. Res. Dev. 1961, 5, 183–191. [Google Scholar] [CrossRef]
Jeon, W.; Park, J.H.; Kim, Y.; Koo, G.; Ro, W.W. Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs. IEEE Access 2020, 8, 127768–127780. [Google Scholar] [CrossRef]
Huang, Y.; Guo, N.; Seok, M.; Tsividis, Y.; Sethumadhavan, S. Analog Computing in a Modern Context: A Linear Algebra Accelerator Case Study. IEEE Micro 2017, 37, 30–38. [Google Scholar] [CrossRef]
Kogge, P.M. Function-based computing and parallelism: A review. Parallel Comput. 1985, 2, 243–253. [Google Scholar] [CrossRef]
Weste, N.; Harris, D. CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed.; Addison Wesley: Boston, MA, USA, 2011. [Google Scholar]
Perrin, Y.; Galisultanov, A.; Hutin, L.; Basset, P.; Fanet, H.; Pillonnet, G. Contact-Free MEMS Devices for Reliable and Low-Power Logic Operations. IEEE Trans. Electron Devices 2021, 68, 2938–2943. [Google Scholar] [CrossRef]
Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar] [CrossRef]
Simevski, A.; Schrape, O.; Benito, C. Comparative Analyses of Low-Power IC Design Techniques based on Chip Measurements. In Proceedings of the 2018 16th Biennial Baltic Electronics Conference (BEC), Tallinn, Estonia, 8–10 October 2018; pp. 1–6. [Google Scholar] [CrossRef]
Bartík, M. External Power Gating Technique—An Inappropriate Solution for Low Power Devices. In Proceedings of the 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 4–7 November 2020; pp. 0241–0245. [Google Scholar] [CrossRef]
Seok, M.; Yang, M.; Jiang, Z.; Lazar, A.A.; Seo, J.S. Cases for Analog Mixed Signal Computing Integrated Circuits for Deep Neural Networks. In Proceedings of the 2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu, Taiwan, 22–25 April 2019; pp. 1–2. [Google Scholar] [CrossRef]
Rodriguez-Perez, A.; Ruiz-Amaya, J.; Delgado-Restituto, M.; Rodrigeuz-Vazquez, A. A Low-Power Programmable Neural Spike Detection Channel with Embedded Calibration and Data Compression. IEEE Trans. Biomed. Circuits Syst. 2012, 6, 87–100. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Botimer, J.; Chou, T.; Zhang, Z. A 1.87-mm² 56.9-GOPS Accelerator for Solving Partial Differential Equations. IEEE J. Solid-State Circuits 2020, 55, 1709–1718. [Google Scholar] [CrossRef]
Malavipathirana, H.; Hariharan, S.I.; Udayanga, N.; Mandal, S.; Madanayake, A. A Fast and Fully Parallel Analog CMOS Solver for Nonlinear PDEs. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 3363–3376. [Google Scholar] [CrossRef]
Hasler, J. Large-Scale Field-Programmable Analog Arrays. Proc. IEEE 2020, 108, 1283–1302. [Google Scholar] [CrossRef]
Kim, S.; Shah, S.; Hasler, J. Calibration of Floating-Gate SoC FPAA System. IEEE Trans. Very Large Scale Integr. Syst. 2017, 25, 2649–2657. [Google Scholar] [CrossRef]
Schlottmann, C.R.; Shapero, S.; Nease, S.; Hasler, P. A Digitally Enhanced Dynamically Reconfigurable Analog Platform for Low-Power Signal Processing. IEEE J. Solid-State Circuits 2012, 47, 2174–2184. [Google Scholar] [CrossRef]
Johnson, A.; Davies, R. Speculative Execution Attack Methodologies (SEAM): An overview and component modelling of Spectre, Meltdown and Foreshadow attack methods. In Proceedings of the 2019 7th International Symposium on Digital Forensics and Security (ISDFS), Barcelos, Portugal, 10–12 June 2019; pp. 1–6. [Google Scholar] [CrossRef]
Jeon, D.I.; Park, K.B.; Chung, K.S. HMC-MAC: Processing-in Memory Architecture for Multiply-Accumulate Operations with Hybrid Memory Cube. IEEE Comput. Archit. Lett. 2018, 17, 5–8. [Google Scholar] [CrossRef]
Garland, J.; Gregg, D. Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks. IEEE Comput. Archit. Lett. 2017, 16, 132–135. [Google Scholar] [CrossRef]
Shanbhag, N.R.; Abdallah, R.A.; Kumar, R.; Jones, D.L. Stochastic computation. In Proceedings of the Design Automation Conference, Anaheim, CA, USA, 13–18 June 2010; pp. 859–864. [Google Scholar]
Zhang, X.; Wang, Y.; Zhang, Y.; Song, J.; Zhang, Z.; Cheng, K.; Wang, R.; Huang, R. Memory System Designed for Multiply-Accumulate (MAC) Engine Based on Stochastic Computing. In Proceedings of the 2019 International Conference on IC Design and Technology (ICICDT), Suzhou, China, 17–19 June 2019; pp. 1–4. [Google Scholar] [CrossRef]
Chung, S.; Wang, J. Tightly Coupled Machine Learning Coprocessor Architecture with Analog In-Memory Computing for Instruction-Level Acceleration. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 544–561. [Google Scholar] [CrossRef]
Sarma, T.; Parikh, C.D. Effect of Leakage Currents in Adiabatic Logic Circuits at Lower Technology Nodes. In Proceedings of the 2019 IEEE Conference on Modeling of Systems Circuits and Devices (MOS-AK India), Hyderabad, India, 25–27 February 2019; pp. 79–81. [Google Scholar] [CrossRef]
Srilakshmi, K.; Tilak, A.V.N.; Rao, K.S. Performance of FinFET based adiabatic logic circuits. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; pp. 2377–2382. [Google Scholar] [CrossRef]
Samson, M.; Mandavalli, S. Adiabatic 5T SRAM. In Proceedings of the 2011 International Symposium on Electronic System Design, Kochi, India, 19–21 December 2011; pp. 267–272. [Google Scholar] [CrossRef]
Samanta, S. Sequential adiabatic logic for ultra low power applications. In Proceedings of the 2017 Devices for Integrated Circuit (DevIC), Kalyani, India, 23–24 March 2017; pp. 821–824. [Google Scholar] [CrossRef]
Filippini, L.; Taskin, B. The Adiabatically Driven StrongARM Comparator. IEEE Trans. Circuits Syst. II Express Briefs 2019, 66, 1957–1961. [Google Scholar] [CrossRef]
Filippini, L.; Taskin, B. A 900 MHz Charge Recovery Comparator with 40 fJ per Conversion. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar] [CrossRef]
Van Elzakker, M.; van Tuijl, E.; Geraedts, P.; Schinkel, D.; Klumperink, E.A.M.; Nauta, B. A 10-bit Charge-Redistribution ADC Consuming 1.9 μW at 1 MS/s. IEEE J. Solid-State Circuits 2010, 45, 1007–1015. [Google Scholar] [CrossRef]
Moyal, V.; Tripathi, N. Adiabatic Threshold Inverter Quantizer for a 3-bit Flash ADC. In Proceedings of the 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 23–25 March 2016; pp. 1543–1546. [Google Scholar] [CrossRef]
Vieri, C.J. Pendulum: A Reversible Computer Architecture. Master’s Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995. [Google Scholar]
Degada, A.; Thapliyal, H. Single-Rail Adiabatic Logic for Energy-Efficient and CPA-Resistant Cryptographic Circuit in Low-Frequency Medical Devices. IEEE Open J. Nanotechnol. 2022, 3, 1–14. [Google Scholar] [CrossRef]
Dhananjay, K.; Salman, E. SEAL-RF: SEcure Adiabatic Logic for Wirelessly-Powered IoT Devices. IEEE Internet Things J. 2022, 10, 1112–1123. [Google Scholar] [CrossRef]
Karakiewicz, R.; Genov, R.; Cauwenberghs, G. 1.1 TMACS/mW Fine-Grained Stochastic Resonant Charge-Recycling Array Processor. IEEE Sens. J. 2012, 12, 785–792. [Google Scholar] [CrossRef]
Milo, V.; Glukhov, A.; Pérez, E.; Zambelli, C.; Lepri, N.; Mahadevaiah, M.K.; Quesada, E.P.B.; Olivo, P.; Wenger, C.; Ielmini, D. Accurate Program/Verify Schemes of Resistive Switching Memory (RRAM) for In-Memory Neural Network Circuits. IEEE Trans. Electron Devices 2021, 68, 3832–3837. [Google Scholar] [CrossRef]
Bavandpour, M.; Mahmoodi, M.R.; Strukov, D.B. aCortex: An Energy-Efficient Multipurpose Mixed-Signal Inference Accelerator. IEEE J. Explor. Solid-State Comput. Devices Circuits 2020, 6, 98–106. [Google Scholar] [CrossRef]
Zanotti, T.; Puglisi, F.M.; Pavan, P. Smart Logic-in-Memory Architecture for Low-Power Non-Von Neumann Computing. IEEE J. Electron Devices Soc. 2020, 8, 757–764. [Google Scholar] [CrossRef]
Julliere, M. Tunneling between ferromagnetic films. Phys. Lett. A 1975, 54, 225–226. [Google Scholar] [CrossRef]
Jain, S.; Ranjan, A.; Roy, K.; Raghunathan, A. Computing in Memory with Spin-Transfer Torque Magnetic RAM. IEEE Trans. Very Large Scale Integr. Syst. 2018, 26, 470–483. [Google Scholar] [CrossRef]
Cai, H.; Guo, Y.; Liu, B.; Zhou, M.; Chen, J.; Liu, X.; Yang, J. Proposal of Analog In-Memory Computing with Magnified Tunnel Magnetoresistance Ratio and Universal STT-MRAM Cell. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 1519–1531. [Google Scholar] [CrossRef]
Fan, D.; Shim, Y.; Raghunathan, A.; Roy, K. STT-SNN: A Spin-Transfer-Torque Based Soft-Limiting Non-Linear Neuron for Low-Power Artificial Neural Networks. IEEE Trans. Nanotechnol. 2015, 14, 1013–1023. [Google Scholar] [CrossRef]
Snelgrove, M.; Beachler, R. speedAI240: A 2-Petaflop, 30-Teraflops/W At-Memory Inference Acceleration Device with 1456 RISC-V Cores. IEEE Micro 2023, 43, 58–63. [Google Scholar] [CrossRef]
Sanni, K.; Andreou, A. A Mixed-Signal Successive Approximation Architecture for Energy-Efficient Fixed-Point Arithmetic in 16 nm FinFET. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
Everspin Technologies, Inc. Everspin Announces Commercial Availability of Its EMxxLX STT-MRAM Devices. 2022. Available online: https://www.everspin.com/news/everspin-announces-commercial-availability-its-emxxlx-stt-mram-devices (accessed on 25 June 2023).
Choe, J. Recent Technology Insights on STT-MRAM: Structure, Materials, and Process Integration. In Proceedings of the 2023 IEEE International Memory Workshop (IMW), Monterey, CA, USA, 21–24 May 2023; pp. 1–4. [Google Scholar] [CrossRef]
Zhang, H.; Liu, J.; Bai, J.; Li, S.; Luo, L.; Wei, S.; Wu, J.; Kang, W. HD-CIM: Hybrid-Device Computing-In-Memory Structure Based on MRAM and SRAM to Reduce Weight Loading Energy of Neural Networks. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4465–4474. [Google Scholar] [CrossRef]
Filippini, L.; Khuon, L.; Taskin, B. Charge recovery implementation of an analog comparator: Initial results. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1505–1508. [Google Scholar] [CrossRef]
Jiang, H.; Santiago, F.J.H.; Mo, H.; Liu, L.; Han, J. Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications. Proc. IEEE 2020, 108, 2108–2135. [Google Scholar] [CrossRef]
Kuppusamy, P.; Yaswanth Kumar, N.; Dontireddy, J.; Iwendi, C. Quantum Computing and Quantum Machine Learning Classification—A Survey. In Proceedings of the 2022 IEEE 4th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA), Goa, India, 8–9 October 2022; pp. 200–204. [Google Scholar] [CrossRef]
Yang, Z.; Zolanvari, M.; Jain, R. A Survey of Important Issues in Quantum Computing and Communications. IEEE Commun. Surv. Tutorials 2023, 25, 1059–1094. [Google Scholar] [CrossRef]
Upama, P.B.; Faruk, M.J.H.; Nazim, M.; Masum, M.; Shahriar, H.; Uddin, G.; Barzanjeh, S.; Ahamed, S.I.; Rahman, A. Evolution of Quantum Computing: A Systematic Survey on the Use of Quantum Computing Tools. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; pp. 520–529. [Google Scholar] [CrossRef]
Ayala, C.L.; Tanaka, T.; Saito, R.; Nozoe, M.; Takeuchi, N.; Yoshikawa, N. MANA: A Monolithic Adiabatic iNtegration Architecture Microprocessor Using 1.4-zJ/op Unshunted Superconductor Josephson Junction Devices. IEEE J. Solid-State Circuits 2021, 56, 1152–1165. [Google Scholar] [CrossRef]
Yamauchi, T.; San, H.; Yoshikawa, N.; Chen, O. A Study on the Efficient Design of Adders Using Adiabatic Quantum-Flux-Parametron Circuits. In Proceedings of the 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 18–21 October 2022; pp. 114–116. [Google Scholar] [CrossRef]
Takahashi, D.; Takeuchi, N.; Yamanashi, Y.; Yoshikawa, N. Design and Demonstration of a Superconducting Field-Programmable Gate Array Using Adiabatic Quantum-Flux-Parametron Logic and Memory. IEEE Trans. Appl. Supercond. 2022, 32, 1–7. [Google Scholar] [CrossRef]
Chou, C.C.; Chen, T.Y.; Fang, W.C. FPGA implementation of EEG system-on-chip with automatic artifacts removal based on BSS-CCA method. In Proceedings of the 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS), Shanghai, China, 17–19 October 2016; pp. 224–227. [Google Scholar]
Zhang, B.; Saikia, J.; Meng, J.; Wang, D.; Kwon, S.; Myung, S.; Kim, H.; Kim, S.J.; Seo, J.s.; Seok, M. A 177 TOPS/W, Capacitor-based In-Memory Computing SRAM Macro with Stepwise-Charging/Discharging DACs and Sparsity-Optimized Bitcells for 4-Bit Deep Convolutional Neural Networks. In Proceedings of the 2022 IEEE Custom Integrated Circuits Conference (CICC), Newport Beach, CA, USA, 24–27 April 2022; pp. 1–2. [Google Scholar] [CrossRef]
Simon, W.A.; Qureshi, Y.M.; Rios, M.; Levisse, A.; Zapater, M.; Atienza, D. BLADE: An in-Cache Computing Architecture for Edge Devices. IEEE Trans. Comput. 2020, 69, 1349–1363. [Google Scholar] [CrossRef]
Sathe, V.S.; Chueh, J.Y.; Papaefthymiou, M.C. Energy-Efficient GHz-Class Charge-Recovery Logic. IEEE J. Solid-State Circuits 2007, 42, 38–47. [Google Scholar] [CrossRef]
Wu, B.; Zhu, H.; Reis, D.; Wang, Z.; Wang, Y.; Chen, K.; Liu, W.; Lombardi, F.; Hu, X.S. An Energy-Efficient Computing-in-Memory (CiM) Scheme Using Field-Free Spin-Orbit Torque (SOT) Magnetic RAMs. IEEE Trans. Emerg. Top. Comput. 2023, 11, 331–342. [Google Scholar] [CrossRef]
Villemur, M.; Tognetti, G.; Julian, P. Memory based computation core for nonlinear neural operations. In Proceedings of the 2019 Argentine Conference on Electronics (CAE), Mar del Plata, Argentina, 14–15 March 2019; pp. 98–102. [Google Scholar] [CrossRef]
Lu, L.; Mani, A.; Do, A.T. A 129.83 TOPS/W Area Efficient Digital SOT/STT MRAM-Based Computing-In-Memory for Advanced Edge AI Chips. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xie, W.; Sang, H.; Kwon, B.; Im, D.; Kim, S.; Kim, S.; Yoo, H.J. A 709.3 TOPS/W Event-Driven Smart Vision SoC with High-Linearity and Reconfigurable MRAM PIM. In Proceedings of the 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 11–16 June 2023; pp. 1–2. [Google Scholar] [CrossRef]
Cavalcante, M.; Schuiki, F.; Zaruba, F.; Schaffner, M.; Benini, L. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE Trans. Very Large Scale Integr. Syst. 2020, 28, 530–543. [Google Scholar] [CrossRef]
Han, L.; Huang, P.; Wang, Y.; Zhou, Z.; Zhang, Y.; Liu, X.; Kang, J. Efficient Discrete Temporal Coding Spike-Driven In-Memory Computing Macro for Deep Neural Network Based on Nonvolatile Memory. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4487–4498. [Google Scholar] [CrossRef]
Chiu, Y.C.; Khwa, W.S.; Li, C.Y.; Hsieh, F.L.; Chien, Y.A.; Lin, G.Y.; Chen, P.J.; Pan, T.H.; You, D.Q.; Chen, F.Y.; et al. A 22 nm 8 Mb STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI Devices. In Proceedings of the 2023 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 496–498. [Google Scholar] [CrossRef]
Cowan, G.; Melville, R.; Tsividis, Y. A VLSI analog computer/digital computer accelerator. IEEE J. Solid-State Circuits 2006, 41, 42–53. [Google Scholar] [CrossRef]
Dong, Q.; Sinangil, M.E.; Erbagci, B.; Sun, D.; Khwa, W.S.; Liao, H.J.; Wang, Y.; Chang, J. 15.3 a 351 TOPS/W and 372.4 GOPS compute-in-memory SRAM macro in 7 nm FINFET CMOS for machine-learning applications. In Proceedings of the 2020 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 16–20 February 2020. [Google Scholar] [CrossRef]
Kneip, A.; Lefebvre, M.; Verecken, J.; Bol, D. IMPACT: A 1-to-4b 813-TOPS/W 22-nm FD-SOI Compute-in-Memory CNN Accelerator Featuring a 4.2-POPS/W 146-TOPS/mm² CIM-SRAM with Multi-Bit Analog Batch-Normalization. IEEE J. Solid-State Circuits 2023, 58, 1871–1884. [Google Scholar] [CrossRef]
Kneip, A.; Lefebvre, M.; Verecken, J.; Bol, D. A 1-to-4b 16.8-POPS/W 473-TOPS/mm2 6T-based In-Memory Computing SRAM in 22 nm FD-SOI with Multi-Bit Analog Batch-Normalization. In Proceedings of the ESSCIRC 2022—IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; pp. 157–160. [Google Scholar] [CrossRef]
Kim, S.; Kim, S.; Um, S.; Kim, S.; Kim, K.; Yoo, H.J. Neuro-CIM: A 310.4 TOPS/W Neuromorphic Computing-in-Memory Processor with Low WL/BL activity and Digital-Analog Mixed-mode Neuron Firing. In Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Honolulu, HI, USA, 12–17 June 2022; pp. 38–39. [Google Scholar] [CrossRef]
Kim, H.; Yoo, T.; Kim, T.T.H.; Kim, B. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro for Processing Neural Networks. IEEE J. Solid-State Circuits 2021, 56, 2221–2233. [Google Scholar] [CrossRef]
Zang, Q.; Goh, W.L.; Lu, L.; Yu, C.; Mu, J.; Kim, T.T.H.; Kim, B.; Lit, D.; Dot, A.T. 282-to-607 TOPS/W, 7T-SRAM Based CiM with Reconfigurable Column SAR ADC for Neural Network Processing. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

O’Connor, O.; Elfouly, T.; Alouani, A. Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms. Energies 2023, 16, 6043. https://doi.org/10.3390/en16166043

AMA Style

O’Connor O, Elfouly T, Alouani A. Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms. Energies. 2023; 16(16):6043. https://doi.org/10.3390/en16166043

Chicago/Turabian Style

O’Connor, Owen, Tarek Elfouly, and Ali Alouani. 2023. "Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms" Energies 16, no. 16: 6043. https://doi.org/10.3390/en16166043

APA Style

O’Connor, O., Elfouly, T., & Alouani, A. (2023). Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms. Energies, 16(16), 6043. https://doi.org/10.3390/en16166043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

Abstract

1. Introduction

2. Existing Solutions

3. Literature Survey

3.1. Analog Computing

3.2. In-Memory Computing

3.3. Adiabatic Computing

3.4. Alternate Technologies

4. Discussion and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI