High-Performance Motion Estimation for Image Sensors with Video Compression

It is important to reduce the time cost of video compression for image sensors in video sensor network. Motion estimation (ME) is the most time-consuming part in video compression. Previous work on ME exploited intra-frame data reuse in a reference frame to improve the time efficiency but neglected inter-frame data reuse. We propose a novel inter-frame data reuse scheme which can exploit both intra-frame and inter-frame data reuse for ME in video compression (VC-ME). Pixels of reconstructed frames are kept on-chip until they are used by the next current frame to avoid off-chip memory access. On-chip buffers with smart schedules of data access are designed to perform the new data reuse scheme. Three levels of the proposed inter-frame data reuse scheme are presented and analyzed. They give different choices with tradeoff between off-chip bandwidth requirement and on-chip memory size. All three levels have better data reuse efficiency than their intra-frame counterparts, so off-chip memory traffic is reduced effectively. Comparing the new inter-frame data reuse scheme with the traditional intra-frame data reuse scheme, the memory traffic can be reduced by 50% for VC-ME.


Introduction
Video compression (VC) is very widely used, e.g., in mobile phones, notebook computers, video sensor networks (VSN), and so on. Due to the ease of deployment, dynamically-configurable nature and self-organizing characteristics, VSN becomes more and more popular [1]. VSNs are essential in the surveillance of remote areas, such as monitoring of battlefields, farmlands, and forests. VSNs can also be used in intelligent transportation, environmental monitoring, public security, and so on. An important factor for VSN is the transmission speed of the video contents for real-time processing. The video contents obtained by image sensors in VSN are usually compressed to reduce the time overhead of video transmissions. Therefore, it is also important to reduce the time overhead for video compression or its kernel algorithm motion estimation (ME). Figure 1 gives the block diagram of a video-acquisition system [2]. The sensor interface obtains pixels from the image sensor (Lens Sensor Module) and continuously transfers them to the external memory (Frame Memory) via the memory controller. The type of the external memory is DDR3 memory with 64-bits bus width, 4GB capacity and 800 MHz memory clock frequency. Once one row of pixel-blocks in an image is ready in the external memory, the image process module processes these pixel-blocks one by one. The output pixel-block can be transferred to a typical video encoder module for video compression. The host processor (e.g., an ARM processor) is used to control or configure the other modules.  Figure 2 shows the data flow for a video-acquisition system mainly including an image sensor with video compression. The lens and image sensor are used to acquire the raw image. The image pipeline processes the pixels from the image sensor to restore vivid videos. A buffer is used to reorder line scan pixels to block scan ones. The video encoder compresses the video from the buffer and output the compressed Bitstream for transmission. In the video encoder, ME is used to reduce temporal redundancy among adjacent frames for video compression (VC-ME) [3]. There are many ME implementations on various platforms (embedded systems, GPU and FPGA) for different types of ME, such as gradient-based, energy-based and block-based [4][5][6][7][8][9][10][11][12][13]. There are a lot of works for accelerating ME based on specific hardware. A novel customizable architecture of a neuromorphic robust optical flow (multichannel gradient model) is proposed in [4], which is based on reconfigurable hardware with the properties of the cortical motion pathway. A complete quantization study of neuromorphic robust optical flow architecture is performed in [5], using properties found in the cortical motion pathway. This architecture is designed for VLSI systems. An extensive analysis is performed to avoid compromising the viability and the robustness of the final system. A robust gradient-based optical flow model is efficiently implemented in a low-power platform based on a multi-core digital signal processor (DSP) [6]. Graphics processor units (GPUs) offer high performance and power efficiency for a large number of data-parallel applications. Therefore, there are also many works using GPU or multi-GPU to accelerate motion estimation [7,8]. FPGA is also a very useful platform to accelerate motion estimation [9,10]. There is also a novel bioinspired sensor based on the synergy between optical flow and orthogonal variant moments; the bioinspired sensor has been designed for VLSI and implemented on FPGA [9]. A tensor-based optical flow algorithm is developed and implemented using field programmable gate array (FPGA) technology [10].
Block-based ME is popular for its simplicity and efficiency, and there are also many hardware-based block-matching algorithms [11][12][13]. Block matching is used to find the best matching macro-block (MB) in the reference frame with the current MB. The sum of absolute differences (SAD) is one way to determine the best match. The displacement between the current MB and the best matching reference MB is the motion vector (MV). ME usually takes most of the time in VC and the accuracy of ME affects the compression ratio in VC. Full search integer ME (FSIME) employs brute-force search to find the optimal MB in the search range and achieves the best accuracy. FSIME is suitable for efficient hardware implementation because of its regularity but it demands a large number of computations and memory accesses. Therefore, it is important to reduce the time cost of VC-ME in VSN. Fast search methods are proposed to reduce time overhead usually with loss of accuracy, such as Three Step Search [14], New Three Step Search [15], Diamond Search [16,17] and Four Step Search [18]. Fast search methods may not find the optimal MB and many of them are not regular for hardware implementation. We focus on FSIME in this paper. In recent years, the speed gap between on-chip computing and off-chip memory access has grown larger and larger, so it is important to reduce off-chip bandwidth requirement to improve overall performance especially for real-time video applications [19]. Reusing data on chip is usually considered to reduce off-chip memory traffic. Some data reuse methods are proposed for FSIME [20][21][22][23][24]. Previous work mainly focused on intra-frame data reuse within a reference frame but inter-frame data reuse was neglected. For VC-ME, the reconstructed frame was usually stored to off-chip memory and then loaded to on-chip memory when needed [23,[25][26][27][28]. If the reconstructed frame is reused on chip without being stored to off-chip memory for VC-ME, the off-chip memory traffic will be greatly reduced.
In this paper, we propose a high-performance motion estimation to reduce the time cost for image sensors with video compression. The new method exploits inter-frame data reuse with affordable on-chip memory size for FSIME. The inter-frame data reuse scheme can effectively reduce off-chip memory traffic. For VC-ME, reconstructed frames are stored on chip until it is used by the next current frame. Three levels (Inter-E, Inter-D, Inter-C) of the proposed inter-frame data reuse scheme are presented and analyzed, which gives a good tradeoff between data reuse efficiency and on-chip memory size. All the three levels have better data reuse efficiency than their intra-frame counterparts. Furthermore, the new method is compatible with different intra-frame data reuse schemes [20,21] and scan orders [22,23]. Comparing the inter-frame data reuse scheme with the intra-frame data reuse scheme, we find that the memory traffic can be reduced by 50% for VC-ME according to the case study.
The rest of the paper is organized as follows: the parallelism and data locality of FSIME are analyzed in Section 2; three levels of inter-frame data reuse scheme for VC-ME are proposed and analyzed in Section 3; an implementation of the inter-frame data reuse architecture is presented in Section 4; experiment results are given in Section 5; and Section 6 is the conclusion.

Parallelism and Locality Analysis for FSIME
Some basic concepts are explained in Table 1 for better understanding of the following sections. MB, BS, SR, and SRS are four levels of data ranges in a frame. Different scan orders are used to implement different data reuse schemes and can lead to different data reuse efficiency [22,23]. The IME engine to compute the SAD value and MV

Parallelism Analysis for FSIME
There are four levels of parallelism for FSIME ( Figure 3). Level A is the parallelism among pixels. Different pixels can be computed in parallel to get SAD between two MBs. A typical implementation of Level A is an N × N PEA with an adder tree [25]. Level B is the parallelism among reference blocks. One CB is compared with all the reference blocks in SR, and different reference blocks can be computed in parallel [26]. Level C is the parallelism among SRs or CBs. Different SRs in a frame can be computed in parallel [28]. Level C cannot be directly applied to VC-ME because of data dependency between adjacent CBs. After modifying the method of computing the MV predictor [25], data dependency is eliminated and Level C can be applied to VC-ME. Level D is the parallelism among frames. For VC-ME, the current frame cannot be processed until the previous frame is reconstructed so Level D is not possible. The parallelism degrees for the four parallelism levels are listed in Table 2, and it is assumed that there is one reference frame for a current frame. F is the number of the current frames, and it is assumed that there are F frames in the video sequence.  Figure 3. Four different levels of parallelism for FSIME.

Data Locality Analysis for FSIME
Data locality includes spatial locality and temporal locality. A common technique to utilize spatial locality is pre-fetching. The temporal locality can be analyzed for better implementation of data reuse, which is our focus. Different intra-frame data reuse levels are proposed and analyzed in Table 3.
Level A is within a reference block strip in an SR. Level B is among adjacent reference block strips in an SR. Level C is within SRS, and is the most popular data reuse level. Level D is among adjacent SRSs, which is a one-access data reuse level [20,29]. Level C+ [21] is the data reuse level between Level C and D, which elaborately employs tradeoffs between data reuse efficiency and on-chip memory size. For variable block size ME, SAD of smaller blocks can be reused to compute SAD of larger blocks. That is the data reuse within one MB, and we call it Level O data reuse. Level A+ is the data reuse level between Level A and B, which also employs tradeoff between data reuse efficiency and on-chip memory size. Level A+ is similar to Level C+. However, Level A+ is usually implemented to load data from SRAM to registers [22] while Level C+ is often used to load data from off-chip memory to on-chip memory.
The above data reuse levels only consider the data reuse within the reference frame, and we call them intra-frame data reuse levels. The disadvantage of only using intra-frame data reuse is that the reconstructed frame is usually stored to off-chip memory and then loaded to on-chip memory when needed for VC-ME. Level E is a type of inter-frame reuse level, but it was not analyzed in detail and considered to be impractical because it demanded storing at least one frame on chip [20].
Traditionally, Ra, defined by Equation (1), is the redundancy access factor which is used to evaluate memory accesses efficiency [20]. Only the memory traffic of loading the reference frame is considered for Ra. We list Ra and on-chip memory size of intra-frame data reuse levels in Table 3. n is the parameter of C+ scheme.

Inter-Frame Data Reuse Scheme for VC-ME
VC-ME aims to reduce the temporal redundancy in adjacent frames. The reconstructed frame instead of original frame is used as the reference frame for VC-ME. In this section, we present and analyze three inter-frame data reuse levels for two kinds of VC-ME, single reference frame VC-ME (VC-SRME) and multiple reference frames VC-ME (VC-MRME) [30].

New Definition of Ra for VC-ME
Only the memory traffic of loading the reference frame was considered in previous work [20][21][22][23] when computing Ra. However, the memory traffic for storing the reconstructed frame to off-chip memory and loading the current frame from off-chip memory should also be considered. We use Equation (2) as the new definition of Ra for VC-ME. The memory traffic of loading the reference frame (Ra_intra) is computed by Equation (3), where memrefload is the memory traffic to load a reference frame. Ra_inter defined in Equation (4) rises from storing reconstructed frame to off-chip memory (memrefstore) and loading current frame to on-chip memory (memcurload).
Ra of different intra-frame data reuse levels for VC-SRME (Table 4) can be calculated according to Equations (2)-(4). For example, Ra of Intra-C is calculated as follows: Table 4. Ra and On-chip Memory Size for VC-SRME.

Level
Ra Ra of different intra-frame data reuse levels for VC-MRME can also be calculated according to Equations (2)-(4). For example, Ra of Intra-C is calculated as follows, where r is the number of reference frames for one current frame (Table 5): Table 5. Ra and On-chip Memory Size for VC-MRME.

Level
Ra

Level E Inter-Frame Data Reuse (Inter-E)
To make the reconstructed frame buffered on chip and used by the next current frame, we can use two frame buffers and one PEA to implement Inter-E for VC-SRME as in Figure 4 but two frame buffers are used to store two reconstructed frames. We can reduce the size of the frame buffer by reusing the buffer between two adjacent reconstructed frames. In Figure 5, the reconstructed frame buffer is divided into H/N + 2 BS buffers and designed as a circular buffer. We assume that SRv equals 2N for convenience, which means that one SRS includes three BSs. Frame i is the reconstructed previous frame and Frame i + 1 is the reconstructed current frame. When processing BS2 of current Frame i + 1, BS1, BS2, and BS3 of reconstructed Frame i is considered as the corresponding SRS. Thus, BS0 of reconstructed Frame i is useless and can be replaced by BS2 of reconstructed Frame i + 1 ( Figure 6). In the same way, BS1 of reconstructed Frame i and BS3 of reconstructed Frame i + 1 share the same BS buffer. Thus, a BS buffer is shared by two adjacent reconstructed frames. Comparing Figure 5 with Figure 4, the size of frame buffer is reduced from 2 × W × H to W × H + 2 × N × W. As only current frames are loaded from off-chip memory and reference frames are cached on chip, Ra equals 1 for Inter-E of VC-SRME.   Figure 5. Architecture of Inter-E for VC-SRME. Circular frame buffer is used. The grey part is the shared buffer between Frame i and Frame i + 1.
Step  Figure 6. The processing order and the BS replacement strategy of the circular frame buffer for VC_SRME.

Current BS in Process Reference BS Replaced BS in BS Buffer
For VC-MRME, two reference frames are used for a current frame to explain how to implement inter-frame data reuse scheme but more reference frames are also supported. Figure 7 gives an implementation with three reference frame buffers on chip. The processing steps are shown in Figure 8. Circular buffer is used to reduce the frame buffer size (Figure 9). Frame i and Frame i + 2 share H/N − 2 BS buffers, and the frame buffer size is reduced from 3 × W × H to 2 × W × H + 2 × N × W. Ra is 1 for Inter-E of VC-MRME because only the current frame is loaded from off-chip memory and no reconstructed frame is stored to or loaded from off-chip memory. Figure 7. Architecture of Inter-E for VC-MRME. Three frame buffers are on chip to hold three reference frames.
Step  Figure 8. Processing steps of the PEA for Inter-E of VC-MRME.

Level D Inter-Frame Data Reuse (Inter-D)
The on-chip memory size of Inter-E is at least W × H + 2 × N × W pixels, so we propose Inter-D to reduce on-chip memory size. For Inter-D, multiple SRSs of the reconstructed frame are kept on chip at the same time. An SRS buffer is used to store SRS of a reconstructed frame. In Figure 10, one PEA processes CBs of two current frames alternately in one time period, with two SRS buffers and two CB buffers (or one shared CB buffer) integrated on chip for VC-SRME. Each SRS buffer contains three BS buffers and is designed as a circular buffer. Frame i is the current frame for Frame i − 1 and the reference frame for Frame i + 1. We assume that SRV equals 2N. We try to find a way that minimum number of pixels in reconstructed Frame i are stored to or loaded from off-chip memory. In Figure 11, current BSs of Frame i and i + 1 are processed alternately with the same scan order. After processing BS2 of Frame i in Step3, reconstructed BS2 of Frame i is stored to SRS buffer1 instead of off-chip memory, and then the reconstructed BS0, BS1, and BS2 of Frame i are all in SRS buffer1 which are used as SRS for BS1 of Frame i + 1 in Step 4. In this way, reconstructed BSs of Frame i are always on chip in time for Frame i + 1 and do not need to be stored to or loaded from off-chip memory so reconstructed Frame i are inter-frame reused. However, Frame i − 1 and i + 1 cannot be inter-frame reused. There is one frame which needs to be stored to and loaded from the off-chip memory every m frames. Therefore, Ra of Inter-D for VC-SRME is calculated according to Equations (2)-(4) as follows, where m is the number of the current frames processed in one time period. When m always equals 1, it becomes Ra of Intra-D.  Step  Figure 11. The order of the BS in process for Inter-D of VC-SRME.

BS in Process
For VC-MRME, we also use two reference frames for one current frame to explain how to implement the inter-frame data reuse. Figure 12 gives an implementation with three current frames (Frame i, i + 1 and i + 2) processed in one time period. Three BS buffers are combined for Frame i − 2 and i + 1 to store the reconstructed SRS, and one more BS buffer is needed for reconstructed Frame i and i − 1 because these two frames are the reference frames of two current frames. The processing order of BS in the three current frames is shown in Figure 13. After one current BS of Frame i or i + 1 is processed, it is reconstructed and stored in the according BS buffer. Ra of Inter-D for VC-MRME is also calculated according to Equation (2)-(4). r is the number of reference frames for one current frame and m is the number of the current frames processed in one time period. It is assumed that m ≥ r. There are r reconstructed frames which need to be stored to and loaded from the off-chip memory every m frames, so Ra is calculated as follows.
Buffer size of Inter-D for VC-MRME is as follows (Figure 12).  We can improve data reuse efficiency by processing more current frames in one time period and adding more SRS buffers for Inter-D of VC-ME. If there is only one SRS buffer, no reconstructed pixels can be buffered on chip and inter-frame data reuse cannot be implemented. The data reuse degree grows as the number of SRS buffers increases due to Ra reduction. For example, one mth of reconstructed frames are stored to off-chip memory when m SRS buffers are used for VC-SRME. In addition, Inter-D is also compatible with Intra-D for VC-ME and we can also increase parallelism by adding more PEAs. Step

Level C Inter-Frame Data Reuse (Inter-C)
We propose Inter-C to further reduce on-chip memory size. Multiple SR buffers are integrated on chip, and an SR buffer is used to store SR of a reference frame. In Figure 14, PEA processes CBs of two current frames (Frame i and i + 1) alternately for VC-SRME. Frame i is the current frame for Frame i − 1 and the reference frame for Frame i+1. Two SR buffers and two CB buffers are on chip. Both SRH and SRV equal 2N. We arrange the processing order of CBs as in Figure 15 so that part of reconstructed Frame i is kept on chip and used as SR for the CBs of Frame i + 1. Bk stands for the kth block in a BS. Bk of BSj in Frame i + 1 is processed just after Bk + 1 of BSj + 1 in Frame i. Before Step 0, B0-BS0, B1-BS0, B0-BS1 and B1-BS1 of reconstructed Frame i are already in SR buffer, and BS0, BS1 of Frame i and BS0 of Frame i + 1 have been processed. After B0, B1, and B2 of BS2-Frame i are reconstructed, the SR for B1-BS1-Frame i + 1 is produced. When processing B1-BS1-Frame i + 1 in Step 3, only B2-BS0-Frame i and B2-BS1-Frame i are loaded from off-chip memory and B2-BS2-Frame i is already in the SR buffer. In this way, the reconstructed data of Frame i are partly inter-frame reused in the SR buffer. However, Frame i − 1 and i + 1 are only intra-frame reused. A part of reference frame is kept on chip for reuse, so Ra_intra of Inter-C is calculated as follows.
Every reconstructed frame should be stored to off-chip memory so Ra_inter of Inter-C equals 2. Then Ra of Inter-C is 2 + SRV/N + 1/m. When m equals 1, it becomes Ra of Intra-C. Figure 14. Architecture of Inter-C for VC-SRME. Two SR buffers are on chip.
Step   For VC-MRME, we use two reference frames for one current frame to explain how to implement Inter-C. Figure 16 gives an implementation with three current frames processed in one time period, Frame i, i + 1 and i + 2. Part of reconstructed Frame i − 2, i − 1, i, and i + 1 are buffered on chip as SR for current frames. Bk in BSj of Frame i + 1 is processed just after Bk + 1 in BSj + 1 of Frame i, and Bk − 1 in BSj − 1 of Frame i + 2 is processed just after Bk in BSj of Frame i + 1 (Figure 17).

CB in Process Data in SR Buffer for Reconstructed Frame i Data from Off-Chip Memory
Step 0 is an initial step to load twelve N × N blocks of two reference frames for B0-BS3-Frame i.
Step 2 and 5 are initial steps to load six N × N blocks of two reference frames for B0-BS2-Frame i + 1 and B0-BS1-Frame i + 2 respectively. After the initial steps, only six N × N blocks of two reference frames are loaded for each CB of Frame i and only three N × N blocks of two reference frames are loaded for each CB of Frame i + 1 and i + 2 . r is the number of reference frames for one current frame and m is the number of the current frames processed in one time period. It is assumed that m ≥ r. A part of the reference frames is kept on chip for reuse. Every m frames, there is one frame(Frame i in Figure 16) which needs to load H/N × (W + SRH) × (N + SRV) × r pixels, and the other m-1 frames only needs to load H/N × (W + SRH) × ((r−1)N + SRV) pixels. So Ra_intra of Inter-C is calculated as follows. When r equals 1, the Ra_intra is the same as that of VC-SRME.
All the data of the reconstructed frame should be stored to off-chip memory and Ra_inter of Inter-C equals 2. Then Ra of Inter-C is as follows: The buffer size of Inter-C for VC-MRME is as follows: Calculated in the same way as Inter-C, the buffer size of Inter-C+ for VC-MRME is as follows: Figure 16. Architecture of Inter-C for VC-MRME.
Step We can also improve data reuse efficiency by adding more SR buffers and increase parallelism by adding more PEAs for VC-ME. The data reuse degree grows as the number of current frames processed in one time period increases. In addition, Inter-C (Inter-C+) is compatible with Intra-C (Intra-C+).

Analysis and Comparison of Different Data Reuse Levels
We list Ra and on-chip memory requirement for different reuse levels of VC-SRME in Table 4. m is the number of the SR or SRS buffers on chip. n is the parameter for C+ scheme. The proposed inter-frame data reuse schemes always have better reuse efficiency than their intra-frame counterparts. A larger m leads to a smaller Ra but a larger on-chip memory size. Ra of Inter-D is less than half of Ra of Intra-D when m is greater than 2 but the on-chip memory size of Inter-D is m times of Intra-D. Comparing with Inter-E, the on-chip memory size of Inter-D and Inter-C are reduced to m SRS buffers and m SR buffers respectively. We give Ra and on-chip memory size for different reuse levels of VC-MRME in Table 5. m is the number of the SR or SRS buffers on chip. n is the parameter for C+ scheme. r is the number of reference frames for one current frame. The proposed inter-frame data reuse schemes also show better reuse efficiency than their intra-frame counterparts when m ≥ r ≥ 2.

Implementation of Inter-D for VC-SRME
Inter-D gives a useful tradeoff between off-chip memory traffic and on-chip memory size so we implement an Inter-D architecture for VC-SRME ( Figure 18) with m = 4. The IME module in this implementation ( Figure 19) mainly comprises of one SAD Tree [15] with a MV selector (41 SAD comparators), which can support variable block size ME. FME, IP, EC and DB are the other modules for a complete encoder architecture but these modules are not included in our implementation. Fractional ME (FME) is usually the module after IME in an encoding pipeline. Instead of having its own on-chip buffer, FME can load data directly from IME buffer [31], so the proposed data reuse scheme is compatible with FME. CB Reg. Array is a 16 × 16 register array which is used to store the current block. A two-level memory hierarchy is implemented between IME and off-chip memory for loading reference pixels. We implement the Inter-D VC-SRME architecture with synthesizable Verilog HDL and list synthesis results using a TSMC 65 GP technology with 360 MHz (Table 6). Figure 20 gives the picture of the implemented ME module in 65 nm. The proposed architecture is compared with other FSIME architectures using intra-frame data reuse. All the works in Table 6 use full search (FS) with the frame rate of 30 f/s and the block size of 16 × 16 but they support different resolutions, SR or numbers of reference frames. The reuse level (off-chip to on-chip) affects the off-chip memory bandwidth requirement. Our implementation adopts Inter-D data reuse so it achieves the best off-chip memory access efficiency. However, the off-chip bandwidth is not only related with the data reuse level (Inter-D in the proposed architecture or Intra-D in [29]) but also related with other parameters such as resolution, frame rate, SR, N and so on. Some of these parameters are different between [29] and our work so the off-chip bandwidth of our work (Inter-D) is greater than that of [29] (Intra-D). If the parameters are all the same for the two works, Inter-D will only need half the off-chip memory traffic of Intra-D. In Section 5, we will give the comparison between the proposed Inter-D and Intra-D [29] under the same parameters. The reuse level (on-chip) represents the reuse level from SRAM to registers on chip and affects on-chip memory bandwidth requirement. The reference [32] proposed a novel data reuse method "Inter-candidate + Inter-macroblock" to reduce the on-chip memory bandwidth to 5 GByte/s. This method can be regarded as improved Intra-A or Inter-A. However, the goal of our proposed method is to reduce the off-chip memory traffic which is usually the performance bottleneck. The proposed inter-frame data reuse methods in this paper are compatible with "Inter-candidate + Inter-macroblock". So we simply use a Level A data reuse with "Inter-candidate + Inter-macroblock" for reducing on-chip memory bandwidth. Note that some implementations [25,29] in Table 6 only consider the memory traffic of loading the reference frame when computing off-chip or on-chip bandwidth requirement and neglect the memory traffic of loading current frames or storing reference frames. Both Level A parallelism and Level B parallelism (eight SAD Tree) are implemented in [25] while only Level A parallelism (one SAD Tree) is implemented in our design. The gate count increases as the parallelism level increases because more PEAs are used. Due to the fact we use an Inter-D data reuse, the implementation demands a larger on-chip SRAM size (241.92 KB) than the other three works. Ref Figure 19. The implemented architecture for IME.  Figure 20. Picture of the implemented ME module in 65 nm.

Experiment Results
We give three case studies (1080 p, 720 p and 4 K) to analyze and compare different data reuse levels for VC-ME. Ra and on-chip memory size of two different application scenarios are computed according to Tables 4 and 5 respectively, where m and n both equals 4. Intra-A, Intra-B, Intra-C, Intra-D, and Inter-E are traditional data reuse methods. Inter-A, Inter-B, Inter-C, Inter-D and New Inter-E are proposed inter-frame data reuse methods. Bandwidth in Equation (8) is the off-chip memory bandwidth requirement. f is frame rate.
For VC-SRME (Table 7), the bandwidth requirement reductions of Inter-D are 50% compared with Intra-D [29], which is the largest memory traffic reduction ratio for VC-SRME. Inter-E needs 4.1 MB on-chip memory for 1080 p, 1.8 MB for 720 p and 16.6 MB for 4 K, and the on-chip memory size is reduced nearly by half after using the proposed circular frame buffer (New Inter-E) while the bandwidth requirement is the same. Compared with no data reuse, Intra-A [20] and Intra-B [20] can reduce a large amount of memory traffic with a small on-chip memory size. However, they still demand too much memory bandwidth. The proposed three inter-frame data reuse schemes all have better data reuse efficiency than their intra-frame counterparts for all the three specifications. For 1080p and 4K, the bandwidth requirement reductions of Inter-C, Inter-C+, and Inter-D are 15%, 21.4%, and 50% respectively, compared with Intra-C [25,29], Intra-C+ [21], and Intra-D. For 720 p, the bandwidth requirement reductions of Inter-C, Inter-C+, and Inter-D are 18.8%, 23.1%, and 50% respectively, compared with Intra-C, Intra-C+, and Intra-D. For VC-MRME (Table 8), the video sequence is encoded with one I frame initially and the following frames are all encoded using the previously encoded and reconstructed r frames. r equals 4 in this case study. The on-chip memory size is r times of VC-SRME for the corresponding intra-frame data reuse level because it has to keep r buffers for r reference frames. Compared with no data reuse, Intra-A and Intra-B can reduce a large amount of memory traffic with a small on-chip memory size. However, they still demand too much memory bandwidth, 24.0 GByte/s and 2.4 GByte/sec for Intra-A and Intra-B, respectively. For 1080 p and 4 K, the bandwidth requirement reductions of Inter-C, Inter-C+, and Inter-D are 37.5%, 30.6%, and 50% respectively, compared with their intra-frame counterparts. For 720 p, the bandwidth requirement reductions of Inter-C, Inter-C+, and Inter-D are 30%, 18.8%, and 50% respectively, compared with their intra-frame counterparts.
m is an important factor which affects the data reuse efficiency for the new inter-frame data reuse scheme. So the bandwidth requirement for 1080 p with different m is given in Figure 21 for different data reuse levels of VC-SRME. n equals 4 in the C+ scheme. We can find that the bandwidth requirements of all the intra-frame data reuse levels (including Intra-A and Intra-B which are not shown in the figure) and Inter-E are unchanged with the variation of m. When m equals 1, the bandwidth requirement of the inter-frame data reuse level is the same as that of its corresponding intra-frame data reuse level. The bandwidth requirements of Inter-C, Inter-C+, and Inter-D become lower with the increase of m, but the bandwidth reduction magnitude becomes smaller because there is always a constant which will not change with m in the formula of Ra. We find that the bandwidth requirement reduction for Inter-D is more effective than that for Inter-C or Inter-C+ because Inter-D can reuse the reconstructed frame more efficiently.  The bandwidth requirement for 1080 p with different m is given in Figure 22 for different data reuse levels of VC-MRME. n equals 4 in the C+ scheme and r equals 2. We can find that the bandwidth requirements of all the intra-frame data reuse levels (including Intra-A and Intra-B which are not shown in the figure) and Inter-E are unchanged with the variation of m. When m equals 1, the bandwidth requirement of the inter-frame data reuse level is the same as that of its corresponding intra-frame data reuse level. The bandwidth requirements of Inter-C, Inter-C+, and Inter-D become lower with the increase of m, but the bandwidth reduction magnitude becomes smaller because there is always a constant which will not change with m in the formula of Ra. The value of r is a factor which can reflect how the number of reference frames affects the memory traffic of different data reuse levels for VC-MRME, so we give the bandwidth requirements of different values of r when n equals 4 and m equals 8 in Figure 23. There is a linear relationship between the value of r and the bandwidth requirements of different data reuse levels, which means that a larger r will lead to more memory traffic because more data need to be loaded from off-chip memory. The slope of an intra-frame data reuse scheme is larger than that of its inter-frame counterpart, which means that memory traffic of the intra-frame data reuse increases faster with r than its inter-frame counterpart. Therefore, the proposed inter-frame data reuse scheme is more scalable than the traditional intra-frame data reuse scheme for multiple reference frames ME. The Level C parallelism defined in Section 2 is based on modifying the calculation of the predicted motion vector (MVP) [25]. Therefore, the quality analysis is needed to be accomplished. Figure 24a,b shows the comparisons of RD curves between JM and the modified motion vector prediction. Many sequences have been tested from QCIF to HDTV, and two of them are Racecar (720 × 288, 30 fps) and Taxi (672 × 288, 30 fps). From the figure, we can find that the quality loss is near zero at high bit rates (larger than 1 Mb/s) and the quality is degraded 0.1 dB at a low bit rate [25]. We can conclude that the coding performance of the modified MVP is competitive with that of JM. For Level A and B parallelism and the proposed different data reuse levels, a standard motion vector prediction and a classical full search algorithm is used and nothing which can affect quality is modified. Therefore, the best quality can be achieved. Our method can effectively reduce the number of off-chip memory accesses, which will affect positively the power consumption. We perform power consumption analysis for different data reuse schemes, Intra-C, Inter-C, Intra-C+, Inter-C+, Intra-D, and Inter-D. We model the DRAM access power according to [33,34]. We assume that the static power is constant with respect to accesses and evenly distributed across all banks. The dynamic power is proportional to the read bandwidth and write bandwidth. The equation for DRAM access power is as follows [34], where BWr and BWw represent the read bandwidth and write bandwidth in GB/s. Table 9 gives the power consumptions of the 1080 p case for VC-SRME according to Table 7. We find that the power consumptions of the inter-frame data reuse schemes are all lower than their intra-frame counterparts. P (mW) = 5.85 + 753 × BWr + 671 × BWw Table 9. Power analysis of 1080 p for VC-SRME.

Conclusions
In this paper, we propose a novel inter-frame data reuse scheme for FSIME in image sensors with video compression. The new scheme improves data reuse efficiency of the reconstructed frame for VC-ME. The proposed inter-frame data reuse scheme effectively reduces the number of off-chip memory accesses and outperforms the traditional intra-frame scheme on memory bandwidth requirement.