^{1}

^{★}

^{1}

^{2}

^{★}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

This paper presents a novel phase unwrapping architecture for accelerating the computational speed of digital holographic microscopy (DHM). A fast Fourier transform (FFT) based phase unwrapping algorithm providing a minimum squared error solution is adopted for hardware implementation because of its simplicity and robustness to noise. The proposed architecture is realized in a pipeline fashion to maximize throughput of the computation. Moreover, the number of hardware multipliers and dividers are minimized to reduce the hardware costs. The proposed architecture is used as a custom user logic in a system on programmable chip (SOPC) for physical performance measurement. Experimental results reveal that the proposed architecture is effective for expediting the computational speed while consuming low hardware resources for designing an embedded DHM system.

Digital holographic microscopy (DHM) [

A simple raster scan algorithm is able to perform realtime phase unwrapping. However, in the presence of noise, the raster-scan algorithm may lead to an accumulation of error that eventually results in large deviations near the end of the accumulation. Popular approaches to the robust phase unwrapping include least square techniques, where the unwrapped phase is obtained as the function whose discrete gradient has the least squares deviation from its available estimate. The Poisson equations are then derived for the optimization, which can be solved by the preconditioned conjugate gradient (PCG) [

A number of implementations for fast phase unwrapping have been proposed [

The goal of this paper is to present a novel phase unwrapping hardware architecture for accelerating the computational speed of DHM. The algorithm [

Based on the algorithm [

The proposed architecture has been implemented on FPGA devices so that it can operate in conjunction with a softcore CPU. Using the reconfigurable hardware, we are then able to construct a system on programmable chip (SOPC) system for the physical performance measurement for phase unwrapping in the embedded systems. As compared with its software counterpart running on Intel I-7 quad-core CPU, the proposed system has significantly lower computational time for phase unwrapping. In particular, when the image resolution is 513 × 513, the proposed system attains the speedup of 605 over its software counterpart. All these facts demonstrate the effectiveness of the proposed architecture.

This section briefly reviews the algorithm adopted for the hardware phase unwrapping implementation. Please refer to [_{i,j} be the wrapped phase function of an unknown real-valued function _{i,j} for 0 ≤ _{i,j} ≤ ^{jζi,j} = ^{jϕi,j}. Let _{i,j} for 0 ≤ _{i,j} using the mirror reflection technique. That is,

Let _{i,j} be an estimation of _{i,j} based on _{i,j}. The goal of the phase unwrapping algorithm is to find _{i,j} minimizing
_{i,j} is the solution to
_{i,j} and _{i,j} are periodic, the Fourier transform can be used to solve _{m,n} and Γ_{m,n} are the Fourier transforms of _{i,j} and _{i,j}, respectively. The function _{i,j} is obtained by the inverse Fourier transform to _{i,j} is then obtained by restricting the results to the grid defined by 0 ≤

Based on the discussions shown above, the phase unwrapping algorithm using the FFT is summarized as follows:

Step 0: Suppose _{i,j}, 0 ≤

Step 1: Compute _{i,j}, 0 ≤

Step 2: Compute Γ_{m,n}, 0 ≤

The 2D-FFT operates as follows.

Step 2.1 For each row _{i,j}, compute _{i,j}, 0 ≤

That is, _{i,j} = _{i,j} for 0 ≤ _{i,j} = _{i,2N−j} for

Step 2.2 Compute Λ_{i,n}, 0 ≤ _{i,j}, 0 ≤

Step 2.3 Replace _{i,j} by Λ_{i,n} with the restriction that 0 ≤

Step 2.4 After all of the rows are processed in this way, repeat the process (Step 2.1–2.3) on columns.

Step 3: Compute Φ_{m,n} using

Step 4: Compute the inverse FFT of Φ_{m,n} to obtain _{i,j}.

_{i,j}, 0 ≤ _{i,j}. The FFT unit is then adopted for computing Γ_{m,n}. After that, the post-transform unit is used for calculating Φ_{m,n}. Finally, the FFT unit is used again for computing _{i,j} based on Φ_{m,n}. The on-chip memory is used for storing both the original data and the intermediate and final results of the pre-transform unit, the FFT unit and the post-transform unit. Storing the original data and intermediate results in the on-chip memory effectively reduces the memory access time for the algorithm.

The on-chip memory consists of two identical RAM modules. Each RAM module is able to store an (N+1) × (N+1) array. The RAM modules are shared by all the units in the proposed architecture. They are used to store the original or intermediate results produced by each unit. These results will then be used as the source data for subsequent operations. The employment of the on-chip memory is able to significantly reduce the memory access time for phase unwrapping.

The goal of the pre-transform unit is to implement Step 1 of the algorithm in hardware.

The source data for the pre-transform operations, _{i,j}, 0 ≤ _{i,j}, 0 ≤ _{i,j} in two steps. At the first step, _{i,j} is retrieved from the on-chip RAM 1 to compute

At the second step, _{i,j} is retrieved again from the on-chip RAM 1 to compute
_{i,j}, which will then be stored back to the on-chip RAM 1 and RAM 2 for the subsequent FFT operations. _{i,j} are stored in on-chip RAM 1 and RAM 2, respectively. By storing the results to two modules, the computation precision is then doubled for subsequent operations.

Note that, as shown in _{i,j}, which is the mirror reflected version of _{i,j} in accordance with _{i,j} is based on _{i,j} instead of _{i,j}. When 0 ≤ _{i,j} = _{i,j}, the _{i,j} stored in on-chip RAM 1 is used as _{i,j}. Otherwise, _{i,j} should be computed using _{−1,}_{j}_{N}_{+1,}_{j}_{i,−1}, and _{i,N+1}, 0 ≤ _{0,j}, _{N,j}, _{i,0} and _{i,N}. Using _{−1,j} = _{1,j}, _{N+1,j} = _{N−1,j}, _{i,−1} = _{i,1}, and _{i,N+1} = _{i,N−1}, it is not necessary to design a circuit for mirror reflection for the pre-transform unit. We only have to reconfigure the address generator in the unit so that when _{−1,j}, _{N+1,j}, _{i,−1}, or _{i,N+1} are desired, the address of _{1,j}, _{N−1,j}, _{i,1} or _{i,N−1} will be delivered to on-chip RAM 1, respectively.

Another advantage of the employment of address generator is that it is able to generate multiple addresses for the concurrent read and write accesses of the on-chip memory. Multiple address generation is essential for the implementation of the pipeline in the pre-transform unit. For the shaded time interval indicated in _{i+1,j} and
_{i−2,j}, should also be written to the RAM 1 and RAM 2. As shown in _{i+1,j} from RAM 1, address for reading
_{i−2,j} to RAM 1 and RAM 2. Other alternatives for memory accesses are based on CPU or direct memory access (DMA). However, because there is only one memory access at a time, using the CPU or DMA-based memory accesses for the proposed pipeline architecture may be difficult.

The FFT unit is employed for implementing Steps 2 and 4 of the algorithm.

In the FFT unit, each row of _{i,j} is loaded from the on-chip memory one at a time. The FFT unit then writes the computational results directly back to the same row in the on-chip memory. After the row operations are completed, the column operations will proceed in the same manner. After the completion of all the column operations, the array stored in the on-chip RAM is Γ_{m,n}, the two-dimensional FFT of _{i,j}.

From Step 2.1 of the phase unwrapping algorithm, it follows that the mirror reflection is required before the 1D-FFT transform (or inverse transform). The mirror reflection module is a 2

We use Altera FFT MegaCore function [

The FFT unit is able to operate as a two-stage pipeline, where the first stage is mirror reflection, and the second stage is 1D-FFT. _{i,j}. Note that the operation of each stage of FFT unit is separated into two phases. Both the stages will operate at the same phase at the same time. _{i,j} (e.g., _{i,j}, 0 ≤ _{i−1,j}, 0 ≤ _{i−1,n}, 0 ≤ _{i,j}, 0 ≤ _{i−1,n}, 0 ≤

Note that the FFT unit is also used for the computation of inverse 2D-FFT of Φ_{m,n}. The data stored in the on-chip memory is Φ_{m,n}. The 1D-FFT module will operate as the 1D inverse FFT for the input data. The FFT unit will then produce _{i,j} to the on-chip memory.

The post-transform unit is used for the hardware computation of Step 3 in the algorithm. Therefore, the objective of the post-transform unit is to realize

Although LUTs can be used for the implementation of cosine modules, they may be difficult to be used for the design of divider in the post-transform unit. From

Given Γ_{m,n} in the on-chip memory, the post-transform unit operates as follows. The unit loads the FFT coefficients Γ_{m,n} from the on-chip memory one at a time based on the raster scan order. To reduce the amount of bus traffic, the address delivered to the on-chip memory for loading Γ_{m,n} is also delivered to the post-transform unit for extracting the indices _{m,n} loaded from the on-chip memory and (2 cos(_{m,n}. The output of the divider is then stored directly back to the on-chip memory.

The post-transform unit is implemented as a (2 + _{m,n} from the on-chip RAM is used for computing indices _{m,n}.

The major advantage of the design is that only the Γ_{m,n} is required from the input ports. The terms cos(_{m,n} from the on-chip RAM. Based on the address, the computation of cos(_{m,n} actually produces Γ_{m,n}, cos(_{m,n} using

Two types of performance are considered in this paper: the latency and area complexities. The latency of each unit is defined as the time required for finishing the operations of that unit. Because the arithmetic operators and storage cells are the basic building blocks for the architecture, the area complexities are separated into 2 categories: the number of arithmetic operators, and the number of storage cells. The arithmetic operators consist of adders, multipliers, and dividers. The storage cells contain registers, ROM cells and RAM cells.

The number of arithmetic operators is independent of the image resolution because each unit in the proposed architecture only uses a fixed number of adders, multipliers and/or dividers, independent of

The number of storage cells used by the proposed architecture increases with the image resolution. For the FFT unit, because the mirror reflection module contains 2^{2}), and grows linearly with the image resolution.

To evaluate the time complexity, we first note that the pre-transform and post-transform units need to perform additions and division to each of the (^{2}). For the 2D FFT and 2D IFFT operations, the latency is given by ^{2}

The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore NIOS CPU [

The objective of NIOS CPU is to control the data flow of the proposed architecture. Note that the on-chip RAM in the proposed architecture provides the source data for pre-transform unit, FFT unit and post-transform unit. The on-chip RAM also stores the computation results from these units. To ensure that the data in on-chip RAM is delivered to the correct unit, and the computation results of each unit can be sent to the on-chip RAM, the CPU is responsible for activating controller at each unit, and specifying the proper value in the status register in the on-chip RAM, which controls the multiplexer in the read and write ports of the memory.

As shown in

This section presents some experimental results of the proposed architecture. The design platform is Altera Quartus II with SOPC Builder and NIOS II IDE. The target FPGA device for the hardware design is Altera Stratix III EP3SL150.

The hardware resource utilization of each unit in the proposed architecture are revealed in

It can be observed from ^{2}. The FFT unit utilizes most of the ALMs, which are used for the design of mirror reflection module and 1D FFT module. The FFT unit uses DSP block only for the implementation of complex multipliers. Since there is only one complex multiplier in the 1D FFT module, independent of image resolutions. The DSP block utilization of FFT unit is therefore also independent of image resolutions.

The execution time of each step of phase unwrapping algorithm implemented by the proposed architecture for various image resolutions is shown in

Although the division and cosine operations are required in the post-transform unit, the computation time is low and comparable to that of the pre-transform unit. For a 513 × 513 image, the post-transform unit consumes only 15.7% of the total computation time. The fast computation is due to the efficient single-address-multiple-data operations as revealed in _{m,n} actually produces Γ_{m,n}, cos(_{m,n} in a pipeline fashion. As a result, the architecture is able to minimize number of memory accesses while maintaining high throughput.

The proposed architecture has been found to be effective for phase unwrapping. It utilizes low hardware resources. Only a single divider and complex multiplier is used in the architecture. The utilization of DSP blocks therefore is minimized. The ALM and memory bits utilization also only grow linearly with the image resolutions. Each unit in the architecture is implemented in a pipeline fashion for enhancing the throughput. The architecture therefore has fast computation speed. In particular, when the image resolution is 513 × 513, the computation time is only 8.1 ms. The speedup attains 605 over its software counterpart. The architecture is able to support frame rate above 100 fps for embedded DHM rendering. The architecture is an effective alternative for the implementation of embedded DHM systems where low hardware resource utilization, high image resolution and high image rendering rate are desired.

The work was financially supported in part by National Taiwan Normal University (NTNU100-D-01).

The proposed architecture for phase unwrapping.

The architecture of pre-transform unit, where REG and MUX are the abbreviations of register and multiplexer, respectively.

The timing diagram for pipeline operation at the first step of pre-transform unit.

The input/output to each stage of the pipeline for the shaded time interval marked in the

The timing diagram for pipeline operation at the second step of pre-transform unit.

The input/output to each stage of the pipeline for the shaded time interval marked in the

The architecture of FFT unit.

The timing diagram for pipeline operation of FFT unit.

The operation of phase 1 at each stage of the pipeline for the shaded time interval marked in the

The operation of phase 2 at each stage of the pipeline for the shaded time interval marked in the

The architecture of post-transform unit.

The timing diagram for pipeline operation at post-transform unit for

The input/output to each stage of the pipeline for the shaded time interval marked in the

The SOPC system for phase unwrapping.

The flowchart of the software executing by the CPU.

The phase unwrapping results for 257 × 257 images:

The phase unwrapping results for 513 × 513 image:

Area complexities and latency of the proposed architecture with respect to the image resolution (N+1) × (N+1), where the function

Pre-Transform | FFT | Post-Transform | On-Chip Memory | Overall | |
---|---|---|---|---|---|

Arithmetic Operators | 0 | ||||

Storage Cells | ^{2}) |
^{2}) | |||

Latency | ^{2}) |
^{2} log |
^{2}) |
^{2} log |

The ALM utilization of each unit in the proposed architecture for various image resolutions.

Image Resolutions | Pre-Transform | FFT | Post-Transform | On-Chip Memory |
---|---|---|---|---|

129 × 129 | 197 | 8,358 | 889 | 136 |

257 × 257 | 201 | 10,532 | 959 | 178 |

513 × 513 | 221 | 19,049 | 1,641 | 301 |

The embedded memory bit utilization of each unit in the proposed architecture for various image resolutions.

Image Resolutions | Pre-Transform | FFT | Post-Transform | On-Chip Memory |
---|---|---|---|---|

129 × 129 | 0 | 61,440 | 4,608 | 299,538 |

257 × 257 | 0 | 122,880 | 4,608 | 1,188,882 |

513 × 513 | 0 | 233,472 | 4,608 | 4,737,042 |

The DSP block utilization of each unit in the proposed architecture for various image resolutions.

Image Resolutions | Pre-Transform | FFT | Post-Transform | On-Chip Memory |
---|---|---|---|---|

129 × 129 | 0 | 24 | 16 | 0 |

257 × 257 | 0 | 24 | 16 | 0 |

513 × 513 | 0 | 24 | 16 | 0 |

The total area costs of the proposed architecture for various image resolutions.

Proposed Arch. | Entire SOPC | |||||
---|---|---|---|---|---|---|

| ||||||

Image Resolutions | ALMs | Embedded Memory Bits | DSP Blocks | ALMs | Embedded Memory Bits | DSP Blocks |

129 × 129 | 9,580/56,800 (17%) | 365,586/5,630,976 (6%) | 40/384 (10%) | 17,081/56,800 (30%) | 968,722/5,630,976 (17%) | 44/384 (11%) |

257 × 257 | 11,870/56,800 (21%) | 1,316,370/5,630,976 (23%) | 40/384 (10%) | 20,905/56,800 (36%) | 1,916,434/5,630,976 (34%) | 44/384 (11%) |

513 × 513 | 21,212/56,800 (37%) | 4,975,122/5,630,976 (88%) | 40/384 (10%) | 28,568/56,800 (50%) | 5,085,778/5,630,976 (90%) | 44/384 (11%) |

The execution time of the proposed phase unwrapping architecture for various image resolutions.

Image Resolutions | Pre-Transform | FFT | Post-Transform | Inverse FFT | Total |
---|---|---|---|---|---|

129 × 129 | 0.1 (ms) | 0.3 (ms) | 0.2 (ms) | 0.3 (ms) | 0.9 (ms) |

257 × 257 | 0.3 (ms) | 0.9 (ms) | 0.4 (ms) | 0.9 (ms) | 2.5 (ms) |

513 × 513 | 1.1 (ms) | 3.2 (ms) | 1.4 (ms) | 3.2 (ms) | 8.9 (ms) |

The execution time of the proposed phase unwrapping architecture for various image resolutions.

Image Resolutions | Proposed Architecture | Software Counterpart | Speedup |
---|---|---|---|

129 × 129 | 0.9 (ms) | 468 (ms) | 585 |

257 × 257 | 2.5 (ms) | 1504 (ms) | 601 |

513 × 513 | 8.9 (ms) | 5389 (ms) | 605 |

The execution time of different phase unwrapping implementations.

Implementations | Computation Time | Image Resolutions | Platforms |
---|---|---|---|

Proposed Architecture | 8.9 (ms) | 513× 513 | FPGA (Altera Stratix III EP3SL150) |

[ |
672 (ms) | 512× 512 | GPU (NVIDIA Geforce 8800GTX) |

[ |
2.8 (s) | 640× 480 | GPU (NVIDIA Geforce 8800GTX) |

[ |
24.7 (s) | 1, 024× 512 | FPGA (Xilinx Vertex II Pro) |