^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/)

In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations.

The resolution of ground-based astronomical observations is strongly affected by atmospheric turbulence above the observation site. In order to achieve resolution close to the diffraction limit of the telescopes, AO techniques have been developed to offset wavefront distortion as it passes through turbulent layers in the atmosphere.

AO includes several steps: detection of the phase gradients, wavefront phase recovery, information transmission to the actuators and their mechanical movement. The next generation of extremely large telescopes (from 50 to 100 meter diameters) will demand significant technological advances to maintain the segments of the telescopes aligned (phasing of segmented mirrors) and also to offset atmospheric aberrations. For this reason, faster wavefront phase reconstruction seems to be of utmost importance, and new wavefront sensor designs and technologies must be explored. The CAFADIS camera presents a robust optical design that can meet AO objectives even when the references are extensive objects (elongated LGS and solar observations). The CAFADIS camera is an intermediate sensor between the Shack-Hartmann and the pyramid sensor. It samples an image plane using a microlens array. The pupil phase gradients can be obtained from there, and after that, the phase recovery problem is the same as in the Shack-Hartman.

In this work, our main objective is to select a good and fast enough wavefront phase reconstruction algorithm, and then to implement it over the FPGA platform, paving the way for accomplishing the computational requirements of the ELT’s number of actuators within a 6 ms limit, which is atmospheric response time.

The modal estimation of the wavefront consists in using the slope measurements to fit the coefficients of an aperture function in a phase expansion of orthogonal functions. These functions are usually Zernike polynomials or complex exponentials, but there are other possibilities, depending on the pupil mask. Very fast algorithms can be implemented when using complex exponential polynomials because the FFT kernel is the same [

We will start by describing the modal Fourier wavefront phase reconstruction algorithm, and how the Fast Fourier Transform tallies with the FPGA architecture, analyzing the obtained efficiency and comparing it to implementations in other technologies and platforms. We design an initial 64 × 64 full pipeline phase recovery prototype using the synthesized 2D-FFT module. The system was satisfactorily circuit-tested using simulation data as phase gradients. Finally, we analyze the obtained efficiency and compare it to the modal wavefront using high-end CPU.

The CAFADIS plenoptic sensor samples the signal ψ_{telescope}

The final phase resolution depends on the number of pixels sampling each microlens, but depth resolution also depends on the same quantity. This implies that, increasing the phase resolution, higher height resolution is obtained at the same time. In the extreme case, when using a pyramid sensor (2 × 2 microlens), the phase and height resolution are maximized. At the other extreme, when using a Shack-Hartmann wavefront sensor, the phase resolution depends on the number of subpupils, and height resolution is minimized (and even lost).

A compromise solution might be taken: a unique plenoptic sensor, comprised by 6 × 6 subpupils sampled by 84 × 84 pixels would be enough to get phases with 84 × 84 pixel resolution using only one 504 × 504 pixels detector. Or even, in order to avoid detector contamination due to the neighboring LGS images, the plenoptic image could be sampled by 12 × 12 subpupils. In this case, a 1,008 × 1,008 detector is needed [

The phase gradients at pupil plane are calculated from this plenoptic frame using the partial derivates of the wavefront aberration estimated in [

The gradient is then written:

Making a least squares adjustment over the _{pq}

The phase can then be recovered from the gradient data by reverse transformation of the coefficients:

A filter composed of three two-dimensional Fourier transforms therefore must be calculated to recover the phase. In order to accelerate the process, an exhaustive study of the crucial FFT algorithm was carried out which allowed the FFT to be specifically adapted to the modal wavefront recovery pipeline and the FPGA architecture.

The global control system to be developed is shown in

We will focus on the FPGA implementation from

The block diagram of the designed recoverer is depicted in _{pq}

An analysis of the equations and a parallel architecture of its implementation are taken into account. We then break down the design into the following steps or stages:

Compute two real forward 2D FFT that compute FFT (

Compute the complex coefficients

Carry out a complex inverse 2D FFT on _{pq}

Flip data results

Generally, each butterfly implies one complex multiplier and two complex adders. In particular, multipliers consume much silicon area of FPGA because they are implemented with adder trees. Various implementation proposals have been made to save area by removing these multipliers [

However, in order to implement an efficient multiplier, the last Virtex-4 FPGA devices incorporate specific arithmetic modules, called DSP48. Each DSP48 slice has a two-input multiplier followed by multiplexers and a three-input adder/subtractor. With these circuits, the FPGA only needs four clock cycles to calculate the complex multiplication with up to 550 MHz in XC4VSX35 Virtex-4 [

The complete pipeline radix-2 butterfly can be implemented with this specialized multiplier. It is necessary to use a FPGA Look-Up Table (LUT) (configured as SRL16 shift register) to preserve the synchronism. The butterfly implemented is depicted in

A pipeline radix-2 FFT can be implemented using one butterfly at each stage. The twiddle coefficients used in each stage are stored in twiddle LUT ROMs in the FPGA. The logic resources and the clock cycles of the FFT module is reduced in our implementation using specific butterfly modules at the first and second stages. The first stage utilizes the feature of the twiddle factors related to the first stages of the pipeline:

So, the first stage can be implemented in a very simple way with an adder/subtractor. In the second stage, the next twiddle factors are:

This twiddle suggests a similar splitting structure in the second pipeline stage as in the first one; however, the imaginary unit imposes a special consideration: two additional multiplexers change real and imaginary data, and the pipeline adder/subtractor works according to

Taking into account these features, the 1D-FFT architecture implementation is depicted in

The system performs the calculation of the FFT with no scaling. The unscaled full-precision method was used to avoid error propagations. This option avoids overflow situations because output data have more bits than input data. Data precision at the output is:

The number of bits on the output of the multipliers is much larger than the input and must be reduced to a manageable width with the use of one-cycle symmetric rounding stages (

Taking into account the clock cycles of each block in

When the number of points of the FFT is a power of 4, it is computationally more efficient to use a radix 4 algorithm instead of radix 2. The reasoning is the same as in radix 2 but subdividing iteratively a sequence of _{4}_{2}_{2}_{2}

When the number of points is a power of 4, the pipeline radix-4 FFT module has half the arithmetic stages, but the swap modules need twice the amount of clock cycles to arrange the data. Then, the latency is expressed as:

This time estimation has been conducted for other radix, as shown in the following equations:

Generalizing:

Several radix-2 FFT were satisfactorily synthesized in a XC4VSX35 Virtex-4 FPGA. A comparison has been carried out between our design and other implementations. The combined use of the FPGA technology and the developed architecture achieves an improved performance if compared to other alternatives. This is shown in

For a first prototype of the phase recoverer, we have selected a plenoptic sensor with 64 × 64 pixels sampling each microlens. The fundamental operation in order to calculate the corresponding 64 × 64 2D-FFT is equivalent to applying a 1D-FFT on the rows of the matrix and then applying a 1D-FFT on the columns of the result. Traditionally, the parallel and pipeline algorithm is then implemented in the following four steps:

Compute the 1D-FFT for each row

Transpose the matrix

Compute the 1D-FFT for each column

Transpose the matrix

Continuous data processing using a single dual-port memory (real and imaginary) is not possible. Therefore, the new transformed data must wait for the old data to be introduced in the second FFT block, otherwise data are overwritten. As a result, the pipeline property of the FFT architecture cannot be used. This problem can be avoided by using two memories instead of one, where memories are continuously commuting between write and read modes. When the odd memory is reading and introducing data values in the second FFT module, the even memory is writing data which arrives from the first FFT. So, data flow is continuous during all of the calculations in the two-dimensional transform. The memory modes are always alternating and the function is selected by the counter. The same signal is used to commute the multiplexer that selects the data entering the column transform unit.

It is worth mentioning that the transposition step defined above (Step 2) is implemented simultaneously with the transfer of column data vector to the memory with no delay penalty. In this way, the counter acts as an address generation unit. The last transposition step (Step 4) is not implemented in order to save resources and obtain a fast global system. So, the last transposition step is taken into account only at the end of the algorithm described in

Row 1D-FFT block and column 1D-FFT block are not identical due to the unscaled data precision. So, a 64 × 64 2D-FFT for the phase recoverer must meet certain requirements. If the precision of data input is 8 bits, the output data of 1D-FFT of the rows has to be 15 bits according

Several FFTs were implemented over a XC4VSX35 Virtex-4 device and numerical results were satisfactorily compared with MatLab simulations. As we show in

Taking into account the latency of the FFT (

The design of a 64 × 64 phase recoverer was programmed using the VHDL hardware description language [

The implemented architecture is pipeline. This architecture allows phase data to be obtained for each 4,096 clock cycles (this number coincides with the number of points of the transforms, that is, the number of subpupils, 64 × 64, of the CAFADIS camera). Using the 100 MHz clock, the prototype provides new phase data each 40.96 μs.

These results can be compared with other works. Rodriguez-Ramos

A 64 × 64 wavefront recoverer prototype was synthesized with a Xilinx XC4VSX35 Virtex-4 as sole computational resource. This FPGA is provided in a ML402 Xtreme DSP evaluation platform. Our prototype was designed using ISE Foundation 8.2 and ModelSim 6.0 simulator. The system has been successfully validated in the FPGA chip using simulated data.

A two-dimensional FFT is implemented as nuclei algorithm of the recoverer: processing times are really short. The system can process data in much lower times than the atmospheric response. This feature allows more phases to be introduced in the adaptive optical process. Then, the viability of the FPGAs for AO in the ELTs is assured.

Future work is expected to be focused on the optimization of the 2D-FFT using others algorithms (radix-8, radix-16) and the implementation of a larger recoverer into Virtex-5 and Virtex-6 devices for the necessary 84x84 recoverer using CAFADIS camera. The prototypes could be four times faster than with Virtex-4 FPGA devices. Moreover, the system should be tested in a telescope expected soon.

This work has been partially supported by “Programa Nacional de Diseño y Producción Industrial” (Project DPI 2006-07906) of the “Ministerio de Educación y Ciencia” of the Spanish government, and by “European Regional Development Fund” (ERDF).

Outline of the Plenoptic camera used as wavefront sensor.

Section of the plenoptic frame showing the LGS on axis. The remaining, off-axis LGS present a similar aspect.

Modules of the control system.

Architecture of the synthesized phase recoverer.

Pipeline radix-2 butterfly in FPGA.

Architectural block diagram of a pipeline radix-2 FFT.

(a) Normalized latency. (b) Relative improvement regarding radix-2 algorithm.

Execution times in microseconds for various algorithms of 1024-points FFT using different technologies.

Block diagram of the implemented 2D-FFT.

Phase gradients and original and recovered phase for a CAFADIS camera with 64 × 64 subpupils.

2D-FFT performance comparison with other designs.

114.5 μs | 1,580 μs | - | 44.4 μs | |

811.0 μs | 1,680 μs | 2,380 μs | 170.8 μs |

Execution time (latency) for the different stages of the phase recoverer.

2D-FFT (Sx and Sy) | 4,438 | 44.38 μs |

Multipliers | 4 | 0.04 μs |

Adder | 1 | 0.01 μs |

2D-IFFT | 4,438 | 44.38 μs |

Flip-RAM | 4,096 | 40.96 μs |

Rounding (3) | 3 | 0.03 μs |

12,980 | 129.8 μs |

Virtex-4 resources.

13993 (91%) | 13201 (62%) | 19478 (63%) | 39 (8%) | 74 (38%) | 53 (27%) | 189.519 MHz |