This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm.

3D reconstruction has been a very active research field for many years. The problem can be approached with active techniques, in which the system interacts with the scene, or with passive techniques in which the system, instead of interacting with the scene, captures images from several view points in order to reconstruct the scene-related depth information.

Using passive techniques, only two views are enough to reconstruct 3D information from the scene by means of a stereo algorithm. However, these techniques can be generalized to more than two views and are then called multistereo techniques. Both dual stereo and multistereo are generally based on finding a correspondence between the pixels of several images taken from different view points. This is called the correspondence problem and generally needs some optimization process in order to find the best correspondence between pixels.

The correspondence problem can be solved within the Markov Random Field (MRF) framework [

CAFADIS is a 3D video camera patented by the University of La Laguna that performs depth reconstruction in real time. The CAFADIS camera is an intermediate sensor between the Shack-Hartmann and the pyramid sensor [

The optimization process is very slow, so specific hardware has to be used to achieve real-time performance. A first prototype of the CAFADIS camera for 3D reconstruction was built using a computer provided with multiple Graphical Processing Units (GPUs) and achieving satisfactory results [

The FPGA technology makes the sensor applications small-sized (portable), flexible, customizable, reconfigurable and reprogrammable with the advantages of good customization, cost-effectiveness, integration, accessibility and expandability [

In this sense, the main objective of this work is to select an efficient belief propagation algorithm and then to implement it over a FPGA platform, paving the way for accomplishing the computational requirements of real-time processing and size requirements of the CAFADIS camera. The fast and specialized hardware implementation of the belief propagation algorithm was carried out and successfully compared with other existing implementations of the same algorithm based on FPGA.

The rest of the paper is structured as follows: we will start by describing the belief propagation algorithm. Then, Section 3 describes the design of the architecture. Section 4 explains the obtained results and, finally, the conclusions and future work are presented.

The belief propagation algorithm [_{d}_{s}_{d}_{s}_{d}_{p}_{p}(d)_{p}(d)_{s}_{p,q}_{pq}(d_{p}, d_{q})_{pq}(d_{p}, d_{q})

The energy function is optimized using an iterative message passing scheme that passes messages over the 4-connected neighbors of each pixel in the image grid. Each message consists in a vector of _{q}

After a certain number of iterations

The depth value for pixel _{q}^{2}^{2})

Two of the approaches used in [_{pq}(d_{p}, d_{q})

The transformation of the general message update rule gives the following update rule:

This allows computation of the message update for each pixel in

On the other hand, one can observe that the image grid can be split into two sets so that the outgoing messages of a pixel in set A only depends on the incoming messages from neighbors in set B, and

The global control system to be developed is shown in

We will focus on the FPGA implementation from

The algorithm can be accelerated using parallel processing power of FPGAs instead of other classical technology platforms. In our implementation the improvements are due to the fact that:

Arithmetic computations are performed in pipeline and as parallel as possible.

The number of planes in the architecture implemented is parallelized.

Taking into account these considerations, the overall implemented architecture is depicted in

Finally, the smoothing module compares the new values obtained from all levels and the new values are stored in the message passing memory after smoothing (

These steps are performed using gray pixels in odd iterations and the white pixels in even iterations (

Simultaneously, the

The implementations of each of the modules that make up the overall architecture are detailed below.

According to the algorithm, each memory plane consists of one cost memory and four message-passing memories.

Taking into account

To calculate the new messages associated with a given pixel, the up-memory must supply the value of its right, the down-memory, the value of the left, and the left and right-memories should access the top and bottom positions respectively. This addressing causes conflicts at the ends of the arrays. In

The software algorithms solve these conflicts using zero padding. This implies an extra memory of 8Nx + 8Ny + 16 for each plane in a hardware implementation. A second approach is to avoid this zero padding. As shown in

However, the FPGA's internal memory is a critical resource when implementing this algorithm and the final design optimizes the memory usage by eliminating the above mentioned excesses. Instead of increasing memory sizes, additional logic was added in the address generator design in order to indicate when an address is valid. With this alternative design, the size of the memory is minimized. Furthermore, the size is the same for all the memories, making the VHDL implementation more modular and flexible.

The block diagram of the address generator and control signals are depicted in

The operation of the module is as follows: the x-counter is enabled when the

The effective address is generated using

The control unit provides

The validity of the message addresses can be calculated using only the

This module is responsible for performing the calculations of the message passing algorithm according to the equations. The implemented module is depicted in

The value of the

Intermediate values of the arithmetic module are conveniently rounded. So, the input precision is the same as the output precision (generic

The module is synthesized

This module performs the smoothness corresponding to the last line of the pseudo-code of _{2}

This block selects the plane that contains the minimum of

The update of the messages takes 13 clock cycles (9 from the arithmetic and 2 + ⌈log_{2}

A first script was successfully tested using Matlab. Then the design was programmed using the VHDL hardware description language, simulated using ModelSim, and XST was used to synthesize these modules. An overview of the module operation is shown in a functional simulation (

The depth estimation using multistereo is less clear than using stereo because the cost function is more complex. Moreover, the quantifying results with 16 bit precision have shown performances really close to the original Matlab programmed algorithm.

The implemented architecture is pipeline and it permits continuous data streaming. The use of internal memory allows simultaneous accesses to the messages for each direction and each plane. Also, all arithmetic computations have been replicated for each plane and the number of cycles in order to make the final depth map independent of the number of planes. Taking into account this and the checkerboard algorithm, the cycles for the operation of the module are:

These results can be contrasted with other works. [

Block RAMs are the critical resource for the implementation of the system in a FPGA device.

The current investigation develops a first FPGA implementation for depth map estimation using the belief propagation algorithm for the CAFADIS plenoptic sensor. The main contribution of this work is the use of FPGA technology for processing the huge amount of data from the plenoptic sensor. FPGA technology features are an important consideration in the CAFADIS camera. The depth reconstruction in real time is ensured due to the extremely high-performance signal processing and conditioning capabilities through parallelism based on FPGA slices and arithmetic circuits and highly flexible interconnection possibilities. Furthermore, the use of a single FPGA can meet the size requirements for a portable video camera. The low cost of FPGA implementation in data processing makes the camera sellable at not too expensive prices in the future.

However, algorithm implementation requires an extremely large internal memory. Such massive amount of storage requirement becomes one of the most crucial limitations for the implementation of Virtex-4, Virtex-5 and Virtex-6 FPGA families and the development platform has to be replaced by a subsequent generation of FPGA. The quantifying results with 16 bit precision have shown performances are really close to the original Matlab programmed algorithm. Our results have been compared with other belief propagation algorithms in FPGA and our implementation is comparatively faster.

The design of the belief algorithm was developed using functional VHDL hardware description language and is technology-independent. So, the system can be implemented on any large enough FPGA. Xilinx has just announced the release of 28-nm Virtex-7 FPGAs. These devices provide the highest performance and capacity for FPGAs (up to 65Mb) [

In the future, we will implement this architecture in a Virtex-7 and integrate it in a real-time multistereo vision system. The goal is to obtain a fully portable system.

This work has been partially supported by “Programa Nacional de Diseño y Producción Industrial” (Project AYA 2009-13075) of the “Ministerio de Educación y Ciencia” of the Spanish government, and by “European Regional Development Fund” (ERDF).

Overall system to be integrated in a portable video camera.

Architecture of the designed belief propagation system.

Memory addressing for even iterations.

Memory addressing for odd iterations.

Architectural block diagram of the address generator.

Architectural block diagram of the arithmetic core.

Diagram of the smoothing operation.

Functional simulation of belief propagation for a 64 × 64 frame and 10.

Lightfield captured with a plenoptic camera. Image taken from [

Pseudo-code for the algorithm.

Nx and Ny determine the size of the image, and Nz is the number of planes.

Address generation for the example.

odd | 0 | 3 | out | 1 | out |

even | 1 | 4 | out | 2 | 0 |

odd | 2 | 5 | out | out | 1 |

even | 3 | 6 | 0 | 4 | out |

odd | 4 | 7 | 1 | 5 | 3 |

even | 5 | 8 | 2 | out | 4 |

odd | 6 | 9 | 3 | 7 | out |

even | 7 | 10 | 4 | 8 | 6 |

odd | 8 | 11 | 5 | out | 7 |

even | 9 | out | 6 | 10 | out |

odd | 10 | out | 7 | 11 | 9 |

even | 11 | out | 8 | out | 10 |

Execution time for the belief algorithm in FPGA.

64 | 64 | 10 | 22,539 | 0.11 |

64 | 64 | 25 | 53,259 | 0.27 |

120 | 160 | 10 | 105,611 | 0.53 |

120 | 160 | 25 | 249,611 | 1.25 |

128 | 128 | 10 | 90,123 | 0.45 |

128 | 128 | 25 | 213,003 | 1.07 |

256 | 256 | 10 | 360,459 | 1.80 |

256 | 256 | 25 | 851,979 | 4.26 |

512 | 512 | 10 | 1,441,803 | 7.21 |

512 | 512 | 25 | 3,407,883 | 17.04 |

1,024 | 1,024 | 10 | 5,767,179 | 28.84 |

1,024 | 1,024 | 25 | 13,631,499 | 68.16 |

FPGA internal memory resources.

XC4SX35 Virtex-4 | 64 × 64 × 4 | RAMB16 1K × 16 | 80/192 (41%) |

XC5SX50 Virtex-5 | 64 × 64 × 4 | BRAM 2K × 16 | 40/132 (30%) |

XC5SX50 Virtex-5 | 64 × 64 × 8 | BRAM 2K × 16 | 80/132 (60%) |

XC6VLX240 Virtex-6 | 64 × 64 × 8 | BRAM 2K × 16 | 40/416 (9%) |

XC6VLX240 Virtex-6 | 64 × 64 × 8 | BRAM 2K × 16 | 80/416 (19%) |

XC6VLX240 Virtex-6 | 128 × 128 × 4 | BRAM 2K × 16 | 160/416 (38%) |

XC6VLX240 Virtex-6 | 128 × 128 × 8 | BRAM 2K × 16 | 320/416 (77%) |

XC6VLX240 Virtex-6 | 256 × 128 × 4 | BRAM 2K × 16 | 320/416 (77%) |