Study on the Implementation of a Simple and Effective Memory System for an AI Chip

: In this study, a simple and effective memory system required for the implementation of an AI chip is proposed. To implement an AI chip, the use of internal or external memory is an essential factor, because the reading and writing of data in memory occurs a lot. Those memory systems that are currently used are large in design size and complex to implement in order to handle a high speed and a wide bandwidth. Therefore, depending on the AI application, there are cases where the circuit size of the memory system is larger than that of the AI core. In this study, SDRAM, which has a lower performance than the currently used memory system but does not have a problem in operating AI, was used and all circuits were implemented digitally for simple and efﬁcient implementation. In particular, a delay controller was designed to reduce the error due to data skew inside the memory bus to ensure stability in reading and writing data. First of all, it veriﬁed the memory system based on the You Only Look Once (YOLO) algorithm in FPGA to conﬁrm that the memory system proposed in AI works efﬁciently. Based on the proven memory system, we implemented a chip using Samsung Electronics’ 65 nm process and tested it. As a result, we designed a simple and efﬁcient memory system for AI chip implementation and veriﬁed it with hardware.


Introduction
Artificial intelligence (AI) technology has been developed for a long time. Initially, it was approached in a mathematical way based on theory, and it was developed and implemented as software [1]. Since the AI technology implemented in software uses the existing CPU, GPU, and memory system, hardware implementation for the AI structure itself was not required [2]. However, as the utilization of AI gradually increased, not only did the implementation of AI through CPU and GPU become necessary, but so did an AI-dedicated process represented by a neural processing unit (NPU) [3]. Table 1 shows the reason for the need of AI-dedicated hardware [4]. As shown in Table 1, the AI-only processor compared to the CPU can be implemented in less than half the size. Of course, it can also be implemented in a smaller size compared to the GPU. Moreover, even if you look at the power, the speed can be implemented in less than half that of the GPU. Finally, it shows superior performance even in TOPS/s, which shows its potential in AI applications.
AI makes inferences and judgments about new data based on the results of learning a lot of data. To learn data, data must be read, and for this, a system that reads and writes memory is required.
As mentioned earlier, when using AI as an existing CPU or GPU, it can use the memory system connected to the CPU and GPU. Figure 1 is a picture of the CPU, GPU, and AI using the memory system. As shown in Figure 1, even if a dedicated AI core is developed without using a CPU or GPU, the use of the memory system remains unchanged [5,6]. The most popular memory system at present is the DDR4 method specified by the Joint Electron Device Engineering Council (JEDEC) as standard. The DDR4 operates at a maximum speed of 1600 MHz and 3200 MT/s [7]. In addition, data transmission can be performed simultaneously from 8-to 32-bit. However, memory systems such as DDR4 require special physical layers, as mentioned above, and the use of these physical layers leads to certain problems, such as: First, for high speed, a large number of delay cells are required to match the skew between PLL, high-speed IO, and data bits. Second, operating systems, including the CPU and GPU, must implement data processing methods for their memory systems. Although the size and speed of the AI are different depending on the implementation method, the DDR4 mentioned earlier is very likely to have a larger area of the entire chip than the AI implemented by the total size of the memory system. In addition, the level of implementation if the memory system is also difficult, so it may be hard to implement the AI chip. To solve the above problems, a simple and effective memory system is required for implementation [8][9][10]. In this study, the following method was followed to satisfy the above conditions. First, a simple memory controller was implemented to read and write the data in the external memory. The memory controller was designed to be flexible so that the operation could be changed according to the external memory through I2C. Second, the external memory and the chip were connected through GPIO. Since this is a common IO, no separate design is required, but an algorithm that automatically prevents skew between data even when the internal data are distorted with respect to the external environment (e.g., package, PCB design, and actual temperature and humidity) is useful. This algorithm ensures stable reading and writing of memory. In addition, through compensation and monitoring circuit, the algorithm was implemented to use the system stably in the external environment where AI is used. Before making the chip, the AI environment was verified using an FPGA. The proposed memory interface system was implemented using the Samsung foundry 65 nm process and verified by connecting the test board and the external memory to the chip.

Design
The proposed method consists of the blocks shown in Figure 2. First, a delay cell is connected to an external memory through the general purpose input output (GPIO). Each delay cell is connected to a delay control block to adjust the amount of delay. The output of the delay is connected to the memory controller, whose role is to deliver addresses, reads, and writes in a timely manner with respect to memory operations. There is a compensation block under the memory controller that not only adjusts skew between pins connected to the memory, but also corrects data to be reliably exchanged from the external environment at the beginning of the system. The core, which is a processor of AI, uses the necessary data by passing instructions to the memory controller to read and write the data from external memory.

Delay Logic
The delay controller is a block that controls the amount of delay by turning the delay cell on and off. Delay controllers can be connected to each delay chain to control the delay individually. A duty cycle check (DCC) is connected to the compensation block to ensure that the data are at the center of the clock, so that the data can be sent and received reliably. The delay cell used in this study was a primitive cell provided in the process. Therefore, no analog design or layout was required, and the place and routing (P&R) could be carried out quickly after designing using the register transfer level (RTL). Figure 3 shows a block diagram of one delay chain. The inverter is connected, and the output of each inverter is connected to the multiplexer so that the desired delay output can be selected by the control pin. To design an RTL using a delay cell, a model must be created by calculating the delay corresponding to the inverter cell provided by the process. Figure 4 shows the simulation logic code for the inverter logic.
The inverter cell among the primitive cells provided by the process is selected, and the delay value of the inverter cell is calculated and put it in the # delay to see the change in the delay in the RTL simulation and to check the overall operation. Figure 5 shows the simulation results, indicating that the delay increases as the delay step is increased using the simulation model. In Figure 5, each step can increase the delay with the delay set in #delay in Figure 4: (1) is the result when the delay is made into one step and it pushes about 200 ps, while (2) shows that 6600 ps is pushed because of running the simulation by pushing 33 steps. Among the primitive cells provided by the process, it is possible to use the primitive cells suitable for the desired amount of delay for the testing.   The delay controller is a combination of mux, as shown in Figure 3. It is directly connected to the compensation block and indirectly to the memory controller. Its main role is to increase or decrease the amount of delay in the memory controller or compensation block, and it outputs the desired delay by changing the mux output. The number of delays should be adjusted to an even number, as odd number of adjustments will cause the frequency or data to be inverted, resulting in unwanted results.

Memory Controller
The memory controller is responsible for reading and writing data in accordance with the specifications of the memory [11]. In this study, a memory that can be used in the AI chip and can be purchased in the market was selected rather than a memory with very good performance. In this study, AS6C6416-55TIN SDRAM developed by Alliance Memory was used, which has a storage capacity of 64 M bits. Figure 6 shows the timing diagram for reading and writing SDRAM. The read operation proceeds in the following order: a. The corresponding address is written first. b. The wait is as long as tAA. c. After the tAA time, the data value of the corresponding address is output at DQ0~7 pin.
The write operation is as follows.
a. The desired address is written. b. The CE # (chip enable #) is set to 0 and CE2 to 1. c. The LB # (low byte) and UB # (up byte) are both set to 0. d. The WE # (write enable) is set to 0. e. The desired value is written from DQ0~7.

Compensation Circuit
The compensation block consists of two blocks. Block 1 confirms that the location of the data detected by the delay is in the desired position, while block 2 compensates for the error so that the current system can operate stably from the external environment. The block that finds the location uses the method of delay lock loop (DLL) [12], which is a circuit that checks whether the data exist at the desired position by the delay chain. Figure 7 displays a block diagram of this circuit.  The compensation block is a Build-In-Self-Test (BIST) method [13]. Compensation operates in the following order: a. The register set is changed to compensation enable mode with I2C. b. The memory controller in Figure 2 writes the 0xaa data to address 0. c. After writing the operation, the value of address 0 is read. d. If the read value is not 0xaa, the delay value of DQ is increased to unmatched bits. e. If the data match, it passes. f.
A value of 0x55 written to the address. g. The same procedure as above is followed.
Since 0xaa is 10101010 and 0x55 is 01010101, we can examine both 0 and 1 of the data. Usually, CMOS circuits change their operating speed due to external conditions such as temperature and humidity. Therefore, for accurate data transmission, the distortion of the data due to the surrounding situation can be corrected through the delay cell using the above-described compensation algorithm.

FPGA
For system-level verification, we implemented YOLO-V2, the most used algorithm when performing object recognition in an embedded environment in recent years, in the FPGA, and checked the result and operation. First, the YOLO-V2 algorithm was verified using the C simulation function of Xilinx's Vivado high-level synthesis (HLS). Next, the block to be implemented with the actual FPGA was converted into IP using HLS, and the entire block was designed using the Vivado tool and the IP synthesized through HLS. Using Vivado SDK, the board was set up to measure communication with the control host PC, the FPGA control signal, and the processing time, and the design was finalized using Petalinux to enable on FPGA board.
The FPGA implementation environment was Xilinx ZCU102 and Zedboard, and detailed specifications are shown in Table 2. For implementing YOLO-V2, it was applied to ZCU102, which has sufficient resources, and Zedboard, which has relatively fewer resources.   The verified algorithm was synthesized into real FPGA IP through Vivado HLS. At this time, the use of the internal #pragma instruction varied according to the number of resources of the target FPGA. Since ZCU102 has a relatively good number of resources compared to Zedboard. On the contrary, Zedboard cannot unroll many loops because of its lower number of resources. Therefore, the parameters were designed according to the resources of the target FPGA. Table 4 indicates a large amount of digital signal processors (DSPs) as the number of resources when synthesized with ZCU102 and Zedboard.  Figure 9 shows a block diagram of the entire FPGA system. The IP synthesized through HLS was connected to the PS for control, and AXI was used to access the DRAM and divided data.  Figure 10 shows the overall chip design. Two memory controllers are connected, and the I2C slave and register are placed inside to control the register chip. The core is a circuit related to image compression. Table 5 shows the composite size of the delay cell and the P&R size of the actual chip. The unit delay cell is the size of the cell used for the two inverters and mux, while the memory controller is the sum of the sizes of some flip-flops and debugging operations inside. The delay control includes logic to add and subtract blocks and other delays that determine the edge.   Table 6 is a comparison between a previous study and this study. The memory system is largely composed of PHY, including IO and a memory controller. In this study, PHY and memory controller were implemented together. The table also shows the results of comparing the gate count of this study, designed with a 65 nm process as one and relatively size. Because processes differ from process company to process company, general Moore's Law was applied and compared. In [13], the PHY of LPDDR4 was implemented by the Samsung 10 nm process. Compared to this study, the size was small, but considering the process, the size was approximately three times larger. Moreover, if the implementation of the controller is added, the size is expected to be approximately seven times. [14] provided a thesis implemented on the controller, excluding PHY, in the memory system. The memory type was not known exactly, but when compared to this paper, a size of approximately four times larger can be determined. [15] implemented PHY for DDR3. Again, only the PHY was designed, but it was approximately three times larger than that in this study. Although it is difficult to decipher from Table 6, this study, implemented purely digitally, has two advantages in contrast to [13,15]. The first is the time to reach implementation. It is difficult to quantitatively judge the time spent in purely digital design compared to analog circuit design, but considering the layout or simulation time, purely digital design takes much less time to implement. The second is the cost aspect. As mentioned earlier, the cost is reduced because the implementation time is shortened, but there is an effect of reducing the die size because the layout is optimized on the chip.  Table 7 shows the power measurement results. The operation voltage was 1.2 V. "Busy" measured here is the power consumed for reading and writing the data. In particular, a lot of power was consumed in the operation of the IO and the delay cell. In the future, if the delay cell is optimized, less power consumption can be expected. As mentioned in the introduction, the existing AI system does not have its own memory system, but instead uses the memory system used by the GPU or CPU. Table 8 shows a comparison of the memory used in the existing AI environment and the memory used in this study. As shown in the results of Table 3, when YOLO-V2 was implemented in Zedboard, 101 18 kb BRAMs inside the FPGA were used. It is difficult to map the RAM used in the FPGA and the external RAM 1:1, but 18.18 Mb of RAM is used numerically. This means that in a simple AI application such as YOLO-2, a limited memory system is available.

ASIC Chip
As mentioned in Section 2, since the first stage of the delay cell is 200 ps, the first stage of the delay cell cannot be accurately measured using an oscilloscope. However, when the delay cell is increased in 10-step units, it is possible to confirm delayed data by the delay cell. In this study, much smaller delay steps could be obtained by using synthesized delay cells rather than analog designs. Table 9 displays the delay step of our work compared to other studies. The smaller the delay step, the more precisely the skew of the data bus can be adjusted, so the smaller the value that can be adjusted in one step, the better.  Figure 11 is the result of measuring the data pin using an oscilloscope. A total of three data points were measured at the same time, and a phase difference was deliberately made using a delay controller for the measurement. Looking at Figure 11a, we can see that the skews of DQ0, DQ1, and DQ2 are different. Figure 11b shows the result of adjusting the skew by adjusting the delay controller corresponding to DQ1 through I2C. The read and write of the memory can be confirmed indirectly using I2C. If we write the address and data, we want it to register inside the chip through I2C and to write the corresponding pin in the order of Figure 11, where the data in SDRAM is saved in the register inside the chip. The value was read through the external I2C master and was found to be identical to the value written in the SDRAM. Table 10 describes the time it takes to read and write data from the external memory. In this study, it was designed to operate from 1 to 100 MHz, but this is the result when measured based on 20 MHz. The memory used in this study can be used in both 8-and 16-bit modes. Table 10 describes the time required to read and write 16 and 32 MB. Figure 12 is the environment for measuring the chip. The chip used a socket to enable change, and the FPGA was used to set the environment of the I2C master and measurement board. For measurement, all of the IO of the chip was pulled out with pins.

Conclusions
In this study, we proposed and implemented a memory system that is effective in an AI environment. To determine how much memory is needed in an AI environment, the required memory capacity was calculated by implementing the YOLO application in an FPGA. Based on this, an easy-to-implement and efficient memory system was designed and manufactured using Samsung Electronics' 65 nm process. As a result of this, it was possible to design all digitally with a size more than three times smaller than that of the system using the existing LPDDR. In the future, we will implement the YOLO application in an FPGA and verify it, and this will be implemented in a chip.