AIM: Annealing in Memory for Vision Applications

As the Moore’s law era will draw to a close, some domain-specific architectures even non-Von Neumann systems have been presented to keep the progress. This paper proposes novel annealing in memory (AIM) architecture to implement Ising calculation, which is based on Ising model and expected to accelerate solving combinatorial optimization problem. The Ising model has a symmetrical structure and realizes phase transition by symmetry breaking. AIM draws annealing calculation into memory to reduce the cost of information transfer between calculation unit and the memory, improves the ability of parallel processing by enabling each Static Random-Access Memory (SRAM) array to perform calculations. An approximate probability flipping circuit is proposed to avoid the system getting trapped in local optimum. Bit-serial design incurs only an estimated 4.24% area above the SRAM and allows the accuracy to be easily adjusted. Two vision applications are mapped for acceleration and results show that it can speed up Multi-Object Tracking (MOT) by 780× and Multiple People Head Detection (MPHD) by 161× with only 0.0064% and 0.031% energy consumption respectively over approximate algorithms.


Introduction
Over the last 50 years, Moore's law has predicted that the performance of integrated circuits approximately doubled every two years [1]. However, it has become more and more difficult to maintain it in recent years. Some domain-specific architectures even non-Von Neumann systems have been presented to keep the progress, such as machine-learning accelerators [2][3][4], neuromorphic chips [5][6][7] and quantum computing [8,9]. These new architectures fundamentally based on different theoretical models are redefining new era of computing.
The Ising chip [10][11][12][13][14][15], based on the Ising model [16], is one of the new architectures, which can efficiently solve combinatorial optimization problem. Combinatorial optimization is a branch of discrete mathematics that seeks to find the best possible solution from a finite or countably set of possibilities [17]. An enormous number of key issues in science and engineering can be classified as combinatorial optimization problems. As many of these problems are known to be NP-hard [18], lots of time and energy will be consumed to solve them under Von Neumann architecture. The Ising model, a symmetrical structure for modeling the behavior of magnetic spin [16], can speed up the ground state search process by enabling each spin to search in parallel. Different from Von Neumann architecture, the Ising model solves combinatorial optimization problem as the following three steps: (1) mapping input problem to the interaction coefficients of the Ising model, (2) performing iterative annealing calculation to approach the ground state, (3) reading final spin state to get the solution of original problem. The self-convergence property of the Ising model makes it a promising method for solving combinatorial optimization problem.
The Ising chip is mainly composed of memory that stores interaction coefficients and calculation logic for iterative annealing [14]. The mapped coefficients of combinatorial optimization problem are firstly stored in the memory, and then the iterative annealing calculation is started to search the ground state. The next state of each spin is determined by its neighboring spin states and their interaction coefficients. According to the connection relationship between spins, the Ising chip can be divided into local-interconnect structure and global-interconnect structure. Each spin is connected to only four or eight spins in local-interconnect structure while connected to (n-1) spins in global-interconnect structure when there are n spins in the system. The oversimplified connection relationship of local-interconnect structure limits its application, and more combinatorial optimization problems can only be mapped to global-interconnect structure. However, the memory capacity for storing coefficients and the iterative annealing logic will increase exponentially with the number of spins due to its complex connection in global-interconnect structure. Hardware-complexity for local-interconnect structure is O(n) while O(n 2 ) for global-interconnect structure. The current Ising chips are mainly local-interconnect structure. The high hardware cost limits the realization of global-interconnect structure. Based on the characteristics of global-interconnect structure: (1) large-capacity memory is required to store the interaction coefficients, (2) iterative annealing calculation is relatively simple, (3) many calculations in the annealing can be processed in parallel, (4) different calculation accuracy is required for different applications, this paper proposes novel AIM architecture to implement a global-interconnect Ising chip. The advantage of the structure is: (1) it draws annealing into memory to reduce the cost of information-transfer between calculation unit and memory, (2) it improves parallel processing by enabling each SRAM array to perform calculations since each spin needs to update its local search term (LST) in each annealing step, (3) to reduce the influence of annealing calculation logic on the area of SRAM array, the bit-serial design of annealing calculation is adopted in AIM, which incurs only an estimated 4.24% area overhead for the SRAM array. The disadvantage is that it will affect the calculation time due to bitwise calculation. However, on the other hand, bit-serial design allows the computation accuracy easy to be adjusted.
Owing to the special computing paradigm, another challenge for the Ising chip is how to map real-world applications to it for acceleration. Data association, which is a kind of combinatorial optimization problems, exists widely in vision applications since each object does not exist alone in the world [19,20]. Better understanding would be got if association relationship is considered in the algorithm of computer vison. This paper will demonstrate how to map two vision applications to the annealing in memory architecture for acceleration. The results show that the proposed structure can speed up MOT by 780× and MPHD by 161× with only 0.0064% and 0.031% energy consumption respectively over approximate algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the the Ising model. Annealing in memory architecture is introduced in Section 3. Section 4 demonstrates how to map MOT and MPHD to the proposed architecture. Final conclusion is presented in Section 5.

Overview of the Ising Model
The Ising model was proposed by Wilhelm Lenz in 1920 to study the ferromagnetism of atomic spins [16]. The model consists of spins that can be one of two-body correlated states to store information. The spins are arranged in a graph and each of them can interact with its neighbors based on the coefficients. The topology of the graph can be local-interconnect or global-interconnect, which depends on the connection relationship between spins. The Hamiltonian of the Ising model is: where s i is the state of spin i, J ij is the coefficient between spin i and spin j, hi is the external magnetic field for spin i. The Hamiltonian represents the total energy of the Ising model and acts as a cost function to be optimized. The goal of ground state search is to find a spin configuration that minimizes the cost function, which corresponds to the ground state of the Ising model. Each spin simultaneously performs iterative annealing calculation to search for the ground state. The annealing calculation is composed of local search and probability flipping. The local search is defined as follows: where Ls i (t) is the LST of spin i, s i (t) is the neighboring state of spin i at time t, which makes the local energy lower. To help the system getting out of local optimum, the state transition is defined as a probability behavior, which means that spin i changes its state probabilistically. The flipping probability is shown in (4), which is related to the energy change ∆E in (5) when the state of spin i changes from s i (t) to 1 − s i (t) and the annealing temperature T(t) at time t.
The annealing temperature T(t) plays an important influence on the flipping probability, and changes from high to low along the annealing process. It determines the sensitivity of the flipping probability to energy changes. By combining (4) and (5), the next state s i (t + 1) for spin i can be determined by (6).
The system's energy will continue to decrease as the annealing calculation progresses, and the system finally reaches the ground state or near-ground state when the annealing temperature T(t) arrives at the end of the search.
The local search of each spin can be realized in parallel according to De Gloria algorithm [21]. Firstly, the states of all spins are initialed and the local search term of each spin is precomputed. When one spin is randomly selected to update its state, the probability in (6) is calculated and next state of the spin is determined by comparing probability P with a random r as (7), where r is between 0 and 1. If the spin changes its state, all other spins will update their local search term as (8), which can be done in parallel. After that, another spin is randomly selected to repeat the upon iterative annealing process.

Annealing in Memory Architecture
Annealing in memory architecture aims at reducing data movement by performing annealing calculation directly in memory. Coefficients between spins are stored in SRAM array and the LSTs are updated in parallel near SRAM. Bit-serial design of annealing calculation incurs little area overhead to SRAM and allows the accuracy easy to be adjusted. To avoid the system getting trapped in local optimum, an approximate probability flipping method is presented. Figure 1 describes top-level architecture of the annealing in memory design. Annealing Control (Anl_Ctrl) module controls the flow of iterative annealing calculation. It also performs probability flipping to determine next state of the selected spin based on LST and probability flipping term (PFT). Compute Static Random-Access Memory (CSRAM) modules store coefficients and LST. The update of LST is performed in parallel in CSRAM when it receives update command from the Anl_Ctrl module. Randomly Select (Rdm_Slt) module is a random number generator, which generates a random number at regular intervals to select a spin to update. The Probability Flipping (Prob_Flip) module generates the PFT for probability flipping. The flow chart of iterative annealing calculation is shown in Figure 2. First of all, the coefficients, LSTs and spins are initialled in the CSRAM module. Next, Anl_Ctrl module selects a spin to update according to the random number generated from Rdm_Slt module. Then, it gets the LST of that spin from the corresponding CSRAM module, and gets PFT from Prob_Flip module at the same time. Next state of the spin is determined based on its LST and the PFT. After that, the corresponding spin and all other LSTs will be updated if the spin has changed its state, and the spin unchanged number n will be reset in the flowing step. Otherwise, the spin unchanged number n will be increased by 1. Another spin will be selected to repeat the annealing process if the spin unchanged number is less than the predefined maximum value. The iteration process will stop when there is no spin updates its state for a certain period and finally the spins are read out to get the solution to the mapped problem.

Local Search in Memory
Local Search is performed in CSRAM module, which is shown in Figure 3. The CSRAM module consists of a SRAM array, adjacent annealing calculation logic and two shift registers. Coefficients between spins and the LST are stored in SRAM array and Lst shift register respectively. The adjacent annealing calculation logic circuit is used to update the LST in each iterative annealing step. In order to reduce hardware cost and make calculation accuracy be adapted easily, bit-serial design is adopted to update the LST. When the selected spin changes its state in an iterative annealing step, all spins need to update their LSTs in the corresponding CSRAM module. The related coefficient is firstly read to the Cof shift register as the update command arrives, and then the LST is updated by adding the related coefficient and original LST based on the changed spin. C_reg is a register that stores carry in the bit-serial calculation.

Approximate Probability Flipping Method
To avoid the system getting trapped in local optimum, the state transition is defined as a probability behavior, which means the flipping of spin state occurs probabilistically. The flipping probability is related to energy change and annealing temperature. The following state of the spin is determined by comparing the probability P with a random r as (7), which can be transformed into the following: where P f (t) is the PFT, which is the product of annealing temperature T(t) and ln(1/r − 1). Then, the state of the spin can be determined according to the sign bit of the sum, which is obtained by adding LST Ls i (t) with PFT P f (t). Due to the complexity of the PFT, a digital hardware circuit is proposed to implement it approximately.
As it can be seen in Figure 4, the curve of ln(1/r − 1) is approximated by three lines, where r is a random number between 0 and 1, and two of which have the same slope. The linear interpolation of ln(1/r − 1) is explained via (11). (11) can be approximatively replaced by (12) using the similar technique that was proposed in [22].  The fixed constants can be represented by a finite and shorter number of bits and the product of different constants with r can be realized by shifting, so the approximation in (12) is more convenient for digital logic hardware implementation. The selection of different fixed constants and different products, which is based on the value of r, can be achieved by multiplexers. The proposed structure to approximate ln(1/r − 1) is shown in Figure 5, where the final approximate value is stored in register R2. The annealing Temperature T is set to some fixed values, which is powers of 2, and will keep a fixed time for each value along the annealing process. Therefore, the PFT P f (t) can be obtained by shifting register R2 according to the value of T. The energy histories of iterative annealing calculation process by using different probability flipping methods are depicted in Figure 6. Coefficients of the Ising model are randomly generated. State of the system will fall into local optimum soon if there is no probability flipping, which means the next state of the spin is only determined by local search. Although simplified for hardware implementation, approximate probability flipping method is able to find a state with similar energy to theoretical method at the end of the annealing process.

Hardware Performance
The annealing in memory architecture is synthesized using Design Compiler in 28 nm process, where the SRAM array is obtained using memory compiler with a 28 nm library. The target clock frequency is 1 GHz, and the size of the SRAM array is 1024*64. The entire chip consists of 4096 SRAM arrays, so that it can support up to 4096 global interconnect spins if the bit-width of the coefficient is configured to 16-bits.
We conclude the hardware performance in Table 1. Area of CSRAM and the entire chip are 20.74 µm 2 and 85 mm 2 respectively. The local search logic circuit incurs only 4.24% area overhead to CSRAM module. The chip power is 3.26 W.

Application
Vision applications mainly focus on the understanding of useful information from images. Data association exists widely in vision applications since each object does not exist alone in the world. Better understanding would be got if association relationship is considered in the algorithm of computer vison. Data association is a kind of combinatorial optimization problems, which can be accelerated by the Ising model. In this section, we demonstrate how to map two vision applications to the annealing in memory architecture for acceleration.

Mapping MOT to AIM
MOT is the problem of simultaneously tracking multiple moving objects in a sequence of images and is a key component of many vision tasks. Tracking-by-detection is widely accepted MOT method. The core issue is how to effectively associate hypotheses and build complete trajectories. Figure 7a shows the process of mapping MOT to AIM. Firstly, hypotheses for each frame are obtained from the detector and the combinations of trackers with hypotheses are regarded as spins. The unary item in (1) is defined by the affinity between tracker and hypothesis, the pairwise item is defined by the cost of violating interactions [19]. In the next step, coefficients are passed to AIM and the annealing process starts. We get spin states at the end of the annealing process and map it to the solution of MOT problem. The combination of tracker and hypothesis is successful if the spin state is 1, otherwise the hypothesis does not belong to the tracker.

Mapping MPHD to AIM
Data association is also considered in MPHD. Since each object does not exist alone in the world, the model in [20] models spatial association between objects which provide complementary contextual cues for detection, and can get better result than others.
The process of mapping MPHD to AIM is shown in Figure 7b. At first, objects of the image are obtained from the detector and each object is regarded as a spin. The unary item in (1) is defined by the response of local object detector at corresponding locations and the pairwise item depends on the image data. After that, coefficients are passed to AIM, and then we get spin states at the end of the annealing process, which corresponds to the solution of the MHPD problem. If the spin state is 1, the object is a head, otherwise it is not.

Evaluation
To evaluate the performance, we develop a cycle-accurate simulator for AIM, and run the baseline of multi-branch algorithm for MOT [19], Quadratic Pseudo-Boolean Optimization (QPBO) and sequential tree-reweighted message passing (TRWS) algorithm for MPHD [20] on Core i7-8450 processor. The evaluation for MOT is based on the KITTI Vision Benchmark Suite [23], which is a widely used benchmark to evaluate MOT, and the CLEAR MOT metrics [24]. Dataset used for MPHD evaluation is HollywoodHeads dataset [20] and Casablanca dataset [25], which are widely used datasets to evaluate MPHD. Table 2 Figure 8a. Figure 9 reports precision-recall (PR) curve for MPHD by using QBPO, TRWS and AIM. All the three methods can get similar results for multiple people head detection. One of qualitative head detection results by using AIM is shown in Figure 8b.  Figure 9. Precision-recall curve for MPHD on the HollywoodHeads [20] and Casablanca datasets [25] by using AIM, QPBO and TRWS. All the three methods can get similar results for MPHD.
The run time and energy consumption of all methods are shown in Figure 10. AIM achieves 780× speedup in latency compared to multi-branch algorithm for MOT, 161× and 195× speedup compared to QPBO and TRWS for MPHD respectively. The significant speedup can be attributed to the highly parallel processing for enabling each SRAM array to carry out local search. AIM achieves energy efficiency that is 0.0064% of the multi-branch algorithm for MOT, 0.031% and 0.025% of QPBO and TRWS for MHPD respectively. The energy efficiency improvement can be explained by annealing in memory architecture that reduces data movement between calculation unit and the memory.

Energy(mJ)
A v e r a g e S e q 1 4 S e q 1 2 S e q 1 0 S e q 0 4 S e q 0 3 A v e r a g e S e q 1 4 S e q 1 2 (a)

Conclusions
This paper presents annealing in memory architecture to realize the Ising calculation, which is expected to accelerate solving combinatorial optimization problem. Based on the characteristics of the Ising chip mainly including large-capacity memory and simple iterative calculation, it draws annealing into memory to reduce the cost of information transfer between calculation unit and the memory, improves the ability of parallel processing by enabling each SRAM array to perform calculation. An approximate probability flipping circuit is proposed to avoid getting trapped in local optimum. This paper also demonstrates how to map two vision applications to the proposed architecture for acceleration. The results show that it can speed up multi-object tracking by 780× and multiple people head detection by 161× with only 0.0064% and 0.031% energy consumption respectively over approximate algorithms.

Conflicts of Interest:
The authors declare no conflict of interest.