SAD-Based Stereo Vision Machine on a System-on-Programmable-Chip (SoPC)

This paper, proposes a novel solution for a stereo vision machine based on the System-on-Programmable-Chip (SoPC) architecture. The SOPC technology provides great convenience for accessing many hardware devices such as DDRII, SSRAM, Flash, etc., by IP reuse. The system hardware is implemented in a single FPGA chip involving a 32-bit Nios II microprocessor, which is a configurable soft IP core in charge of managing the image buffer and users' configuration data. The Sum of Absolute Differences (SAD) algorithm is used for dense disparity map computation. The circuits of the algorithmic module are modeled by the Matlab-based DSP Builder. With a set of configuration interfaces, the machine can process many different sizes of stereo pair images. The maximum image size is up to 512 K pixels. This machine is designed to focus on real time stereo vision applications. The stereo vision machine offers good performance and high efficiency in real time. Considering a hardware FPGA clock of 90 MHz, 23 frames of 640 × 480 disparity maps can be obtained in one second with 5 × 5 matching window and maximum 64 disparity pixels.


Introduction
The major task of a stereo vision system is to reconstruct the 3D representation of the scene from the 2D images captured by those cameras which are fixed with their optical axes parallel and separated by a certain distance. The 3D information can be applied to complex tasks such as robot navigation systems, obstacle and lane detection, etc. [1,2].
Stereo matching algorithms have played an important role in stereo vision. They can be classified into either local or global methods of correspondence. Local methods match one window region centered at a pixel of interest in one image with a similar window region in the other image by searching along epipolar lines. The disparity is obtained by calculating the distance between two candidate window regions containing the most similarity. The performance of local stereo matching algorithms depends to a large extent on what similarity metric is selected. Typical similarity metrics are cross-correlation (CC), the sum of absolute differences (SAD), the sum of squared differences (SSD), the census transformation (CENS), etc. SSD and SAD find correspondences by minimizing the sum of squared or that of absolute differences in WxW windows.
As it is well known, the stereo matching algorithm is computationally and data intensive because it has to perform an identical operation on a large amount of pixels. Consequently, a special hardware system is most often required.
There are various examples of stereo vision algorithms implemented on FPGA in the literature. The circuit [3] is a stereovision system based on a Xilinx Virtex II using the SAD algorithm. The system can process images with a size of 270 × 270 at a frame rate of 30 fps. Paper [4] presents a FPGA-based stereo matching system that operates on 512 × 512 stereo images with a maximum disparity of 255 and achieves a frame rate of 25.6 fps running under a frequency of 286 MHz. In [5], a development system based on four Xilinx XCV2000E chips is used to implement a dense, phase correlation-based stereo system that runs at a frame rate of 30 fps for 256 × 360 pixels stereo pairs. Gardel et al. introduce in [6] their design, which can obtain 30,000 depth points from images of 2 Mpix at a frame rate of 50 frames per second under a 100 MHz working frequency. A real-time fuzzy hardware module based on a color SAD window-based technique is proposed in [7]. This module can theoretically provide accurate disparity map computation at a rate of nearly 440 frames per second without considering the memory delay and other factors of time consumption, thus giving a stereo image pair with a disparity range of 80 pixels and 640 × 480 pixels resolution. The design in [8] is a 7 × 7 binary adaptive SAD based real-time stereo vision architecture with a depth range of 80, which is implemented on the Altera Cyclone II EP2C70 FPGA chip based on 800 × 600 color images and operates in real-time at a frame of 56 Hz. The architecture captures the 90 Megapixel/sec 12 bit signals of two cameras in real-time and does not require memories external to the FPGA. This paper proposes a new architecture that can solve the matching problem on variant image resolution of 256 × 256 to 695 × 555 pixels by using the SAD stereo matching algorithm. The hardware is based on SoPC technology and all circuits are implemented on a single Cyclone II FPGA chip. The Nios II processors, a configurable soft IP core, is added in this system to manage the buffer addresses of stereo images in SSRAM and to transfer the configuration data of users to other hardware modules through the Avalon bus interface. The disparity computation unit, modeled by the Matlab-based DSP Builder, is in charge of computing the SAD value of 5 × 5 pixel windows and extracting the disparity from the 64 candidates of SAD. The stereo matching controller, designed in the Verilog-HDL, is in charge of the update of the line buffer data in the on-chip dual-port RAM (DPRAM) and the write back disparities to the off-chip DDRII SDRAM. The whole system can produce 640 × 480 dense disparity maps at a frame rate of 23 fps under a 90 MHz working clock frequency.

DSP Builder Design Flow
DSP Builder integrates the algorithm development, simulation, and verification capabilities of MathWorks MATLAB and Simulink system-level design tools with the Altera Quartus II software and third-party synthesis and simulation tools. The DSP Builder works with the Simulink environment. The designer can combine Simulink blocks with the DSP Builder blocks to verify system level specifications and perform simulation. Figure 1 shows the DSP Builder system-level design flow. The modules of our design mentioned in Section 4 are all modeled and simulated by Matlab/ Simulink. Researchers interested in those models can e-mail the author to ask for the mdl files. Usually, we model algorithm modules in Simulink, not the whole system. After automatic HDL generation, we can easily add the algorithm modules to the top-level design file of our system. We just need to add a *.qip file (Quartus II IP file) in the Quartus II project and instantiate an instance of the algorithm module in the top-level HDL file.

SoPC Architecture for SAD-Based Stereo Vision Machine
The system proposed herein is divided into the main modules as shown in Figure 2: (a) Nios II processor system: It consists of a 32-bit Nios II processor core, a set of on-chip peripherals, on-chip memory and interfaces to off-chip memory.  The different modules of the system are interconnected with the Altera Avalon Memory-Mapped (Avalon-MM) interface applied for the read and write interfaces on master and slave components in a memory-mapped system [9]. The stereo images are stored in the off-chip SSRAM memory because it can offer a shorter read cycle than the DDRII SDRAM. The information about start address and

Avalon-TC Master
Avalon-TC Slave resolution of images is passed to special function registers of the DCU through the Avalon interface by the C code executing on the Nios II processor. The SMC starts to initialize the line buffer of Left/Right images after the start bit of the system is set by the processor. Pixel data are sent from the line buffer to the DCU continuously till the whole dense disparity map is established.

Hardware Implementation of the Disparity Computation Unit
The SAD algorithm has the advantage of computational efficiency. The SAD equation used for 5 × 5 windows with a maximum disparity of 64 can be seen below: (1) where disp is the disparity value ranging from 0 to 63, P R (i, j) serves as the reference pixel in the right image and P L (i, j+disp) as the currently analyzed candidate pixel in the left image.
The reference 5 × 5 window centered at P r (i, j) is compared to 64 possible candidate windows to calculate 64 SAD values. There are 25 bytes of data for the right image and 340 bytes of data for the left image involved in the operation for the calculation of disparity(i, j), where disparity(i, j) means the disparity value of the pixel(i, j). It can be easily observed that different disparity calculations have many operations in common. For example, as show in Figure 3, the data involved with calculations of disparity(3,3) and disparity (3,4) differ from each other for only 10 bytes of data (5 bytes data of both the right and the left). We use two shift-taps to temporarily store the data used in calculating the disparity. The shift-tap for the right image has 25 taps and the other has 340 taps for the left. We propose two principles for feeding data to two shift-taps accordingly:  The block diagram of the DCU is shown in Figure 4. The two shift-taps receive image data from the buffer management unit in serial and feed to the 64 SAD processing element (SAD-PE) in parallel. The computation of the 5 × 5 SAD needs a 25 input parallel adder which costs too much logical elements in the FPGA, therefore we only arrange 32 parallel SAD calculators in the SAD-PE. Figure 5 illustrates the layout plan of the parallel SAD calculator. The SAD-PE can calculate 32 SAD values within 5 clock cycles (25-input parallel adder has four pipeline stages inside). With an input switch signal, it can finish 64 SAD computations in six clock cycles. Four additional clock cycles should be   In front of every absolute difference (AD) calculator, there is a multiplexer which separates the 266 bytes left image data into two groups with the offset of 32 × 5 = 160 bytes. As an example, if the In1A is P L (0,0) which is the first pixel of the 5 × 5 window centered at P L (2,2), then the In1B should be assigned to P L (0,32) which is the first pixel of the 5 × 5 window centered at P L (2,34). Under the control of the SW signal, we use 32 SAD processing modules to determine the 64 SAD values separated in six clock cycles.

Design Principle of the Stereo Matching Controller
The controller is the commander of the stereo matching processing machine with three main functions listed below:

Line Buffer Management
There are two DPRAMs placing in the FPGA. Each has 1,024 × 16 = 16,384 bytes of memory space, acting as line buffers for stereo pairs. The controller initializes and updates the line buffers using the data read from the image frame buffers in the off-chip SSRAM connected by the Avalon-MM read master interface. The line buffers always store 16 lines of image data under the condition that the maximum amount of pixels per line is lower than 1,024. The direct mapping method is used to locate the pixel address in the line buffer. The mapping formula is shown as follows: Select Bit (2) where J is the pixel address in the line buffer, X and Y are the pixel coordinates of the image, Linepixel is the number of horizontal pixels. The line buffer management process will be explained through the Finite State Machine (FSM) in Figure 8 and Table 1. There are 9 states that build the finite state machine for line buffer management in the SMC. The FSM of the line buffer management is activated by two signals, the start signal of the system and the update signal of the SMC. With the set action of the start signal, the FSM reads 16 lines of pixels at the beginning of the stereo images separately for initializing the two line buffers fully. After initialization of the line buffers, the FSM comes into idle state. The update signal activates the FSM into the updating buffer state. The FSM reads one word (4 byte pixel data) from every image in the SSRAM and replaces the oldest 4 pixels in line buffers, and then becomes idle again.

Disparity Write Back
When the system finishes the computation of the disparity of a pixel, a direct memory access controller is invoked to write it to the disparity table. For fear of conflicting with the reading action on the SSRAM caused by the line buffer update event, the dense disparity table is located in the off-chip DDRII SDRAM. The write back address is generated by the Equation (3): where Addr is the write back address, D_ADDR is the base address in the DDRII of the disparity table input by the Nios II processor, X and Y are the pixel coordinates, Linepixel is the number of horizontal pixels, the multiplication by 2 indicates every disparity value is occupying two bytes of memory.

Stereo Matching Process Control
This is the most significant function among other blocks in the whole system. The process control is performed by a FSM involving only six states. Each state has different tasks. In the following list, the details of the tasks performed in each of the states are described. Figure 9 is the FSM diagram of stereo matching process control.
(a) After system reset, the FSM enters into the IDLE state automatically. In this state, all variables are initialized. (b) After two line buffers are initialized with 16 lines of pixels, the FSM transfers into the INIT_SHIFTTAP state. Twenty five data bytes of the right line buffer and 340 data bytes of the left line buffer are read out and sent into the shift-taps concurrently in this state, and then the state comes into the CALCULATE_DISP state. (c) In the CALCULATE_DISP state, the stereo machine spends two clock cycles setting switch signal and sending an activation signal to awake the latching data module. The latching data module is in charge of latching the 64 SAD values and finds the disparity from them, then writing it to the DDRII SDRAM. The module is executed concurrently with this FSM, therefore the time consumed by waiting for computing the disparity is decreased from 10 clocks to 2 clocks. (d) In the UPDATE_COORDINATE state, the variables X and Y, the currently processed pixels' coordinate, are updated. There are three conditions for state transition in this state: (1) if X is smaller than the number of horizontal pixels, the next state is FEEDING_SHIFTTAP; (2) if X is equal to the number of horizontal pixels and Y is smaller than the number of vertical pixels, the next state is INIT_SHIFTTAP; (3) if X is equal to the number of horizontal pixels and Y equal to the number of vertical pixels, the next state is DONE. (e) The task of the FEEDING_SHIFTTAP state is reading 5 bytes of new pixel data from each line buffer and sending them into the shift-taps, and then changing to the CALCULATE_DISP state. (f) The DONE state is indicating that the whole dense disparity map is generated.

Results and Discussion
The stereo matching circuit has been realized by using an Altera Cyclone II EP2C70F672C6N device which is assembled on the Altera DSP Development kit Cyclone II Edition Board as shown in Figure 10. It is clocked with an external crystal of 100 MHz frequency.  Figure 11 is a screenshot of the Nios II system designed in the SOPC Builder of Quartus II software. In Figure 11, the item named "stereo_dma_0" shows as an IP package including the module DCU and SMC mentioned in part 4 and part 5. The item "pcounter" is a performance counter unit used to measure the consumption time for processing a disparity map. The Performance counter, the only mechanism available with the Nios II development kits, provides measurements with little intrusion [10].  shows the n this paper.