^{*}

Tomasz Kryjak — preparing the previous works survey and tables, idea and design of pre-, post-filtering and reliability check modules, pointing out the improvements resulting in modified version.

Marek Gorgon — conception of FPGA pipelined processing system, evaluation of experiment with scientific trends, efficiency evaluation with GOPS, W, GOPS/W measures.

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1, 920 × 1, 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems.

Nowadays a continuous increase of vision sensors importance can be observed both in commercial and industrial applications. TV sets, smartphones or game consoles are being equipped with functions enabling them for more natural, contactless interaction with user by analysing pose, gestures or facial expression. Smart cameras [

In case of video sequence analysis, extracting the movement parameters of individual objects present on the scene is often desired. Because objects consists of pixels, it is possible to extract their movement based on resultant value of corresponding pixels displacement. Therefore the object movement can be obtained by computing the optical flow, which is a vector field describing the relative displacement of pixels between two consecutive frames from a video sequence.

Information about optical flow, is valuable in many different applications. It allows to extract the object movement direction and its speed [

The issue of accurate optical flow computation is a separate and very comprehensive research problem. Improvements or new algorithms were described in many publications. The summary of this efforts is presented in [

Optical flow computation algorithms can be divided into two groups. The first one consists of methods which determine the optical flow field for every pixel of the image (dense flow). Algorithms computing the flow only for selected pixels (sparse flow) belong to the second group. This classification was proposed, because some points are easier to track than others. For example, a pixel which has a unique colour is simpler to localize than a pixel which is surrounded by other pixels of the same or similar colour. Method used for picking proper points can be very different. From choosing pixels located at a rectangular grid virtually imposed on image to corner or feature detection (e.g., Scale Invariant Feature Transform) methods. It is not possible to determine optical flow for all pixels. For example, when a pixel disappears between two frames due to occlusion. This is why dense optical flow methods use more clues to compute the flow than only pixel intensity (e.g., assume local or global smoothness of the flow).

Determining optical flow is a computationally demanding problem. This can be demonstrated using the performance results from tables presented at the Middlebury evaluation web page [

ASIC devices allow the parallel implementation of many algorithms and are characterized by very low power consumption. Their usage is however justified only in case of large volume production. It is because of the long and costly design, testing and production process. It is also impossible to introduce any changes or improvements to the algorithm once the device is produced.

Modern FPGA devices have almost similar capabilities as ASICs and are very well suited for prototyping and small or medium volume production series. Moreover, their main advantage is the ability to modify and update the designed logic (

In this work a hardware implementation of dense Horn-Schunck [

The main contributions of this paper can be summarized as:

a comprehensive survey of hardware implementations of the Horn-Schunck method,

the proposal of a fully pipelined Horn-Schunck computation architecture which requires external RAM only for image storage, instead of the iterative approach proposed in all previous papers,

measuring the impact of various differentiation and averaging kernels on accuracy, maximum operating frequency and resource utilization of the proposed design,

the realization and comparison of modules computing optical flow with either fixed- or floating-point precision,

the implementation of the first FPGA hardware system able to compute dense optical flow for a Full HD video stream in real-time,

results comparable with state-of-the-art mono-scale optical flow implementations,

verification of the proposed architecture not only in simulation, but also on the evaluation board. Working example of high definition video processing streamed from an HDMI camera is presented.

In this paper optical flow is computed using the method proposed by Horn

Optical flow computed according to the method proposed by Horn-Schunck is used in many practical applications. In a vision system for monitoring a road intersection [

In the Horn-Schunck method [_{x}_{y}_{t}

Because the optimal solution should be found globally, authors of the algorithm proposed an iterative method of solving this minimization problem. It is based on multiple repetition of successive approximations given by equations:

Due to the high usefulness of optical flow in many vision systems and because most of the algorithms can be effectively parallelized, optical flow computation is often implemented in reprogrammable FPGA devices. In the below paragraphs a survey of previous hardware implementations with particular focus on realizations based on the Horn-Schunck method is presented.

In the article [

The work [

In the article [

An architecture able to compute optical flow according to the Horn-Schunck method in real-time for image resolution of 256 × 256 pixels with 60 frames per second was presented in [

Two different architectures for optical flow computation based on the Horn-Schunck algorithm were presented in the article [

In a recent work published in 2012 [

A hardware module implementing a change driven modification of the Horn-Schunck optical flow computation was presented in [

In the article [

Hardware implementations of optical flow computation which are based on other methods than the Horn-Schunck algorithm are also described in the literature. A worth to mention design was presented by M. Abutaleb [

An optical flow hardware implementation used for UAV (unmanned aerial vehicle) navigation was presented in [

In two articles [

The previous hardware implementations of Horn-Schunck computation modules, described in Section 3, have different properties. A summary is presented in

It can be noticed (

It can, however, be noticed that transition from one iteration of _{x}_{y}_{t}

Observation of the algorithm's data flow execution allows to propose a novel computing architecture based on the well known pipelining technique. Its main feature is the data processing scheme. It is not iterative, but fully pipelined, as presented in

It was also decided to provide the compatibility of the base version of the hardware module with the software implementation of the Horn-Schunck algorithm available in the popular computer vision library OpenCV [

After analysing the original work of Horn-Schunck [_{x}_{y}_{t}

Based on the results published in [

In this section all designed hardware modules are described in detail. VHDL and Verilog languages were used for implementation. A reference software model in C++ was also created for each module. The verification of each core was done by comparing the result obtained from hardware simulation in ISim software with the results returned by the corresponding reference model.

The block schematic of the proposed module for optical flow computation based on the Horn-Schunck algorithm is presented in

In the first stage (_{x}_{y}_{t}

A hardware divider is needed, but computing

The computed values _{x}_{y}_{t}

The next block (

After computing the first estimation, the flow vectors are transferred to a cascade of serially connected modules which perform a single iteration given by _{x}_{y}_{t}

It should be noticed, that depending on configuration, the module can compute with either fixed- or floating-point precision. The first stage is always computed with fixed-point precision. It is because the result of 8 bit values convolution with a kernel, which coefficients sum can be expressed as the power of two, does not require floating-point precision. The optional conversion to floating-point precision is performed in stage II and from this moment all other stages (II, III, IV and so on) use processing elements able to work with either floating-point or a configurable fixed-point representation.

As initial pre-processing a linear filter is proposed. In the research stage, it turned out that simple averaging in a 3 × 3 neighbourhood (mask coefficients equal to 1) is improving the results in many cases. The context is generated using the classical delay line setup [

In the post-processing stage the obtained

To measure the reliability of computed optical flow an absolute deviation in 3 × 3 neighbourhood is used, which is given by equation:

This is a simplification of the standard deviation measure and consumes less hardware resources - it does not require the computation of power and division. Using such criteria is based on the assumption of optical flow local smoothness. In other words, if the flow for a given point is reliable, then the absolute deviation value should be small. If it is large, it is assumed that the algorithm did not computed a proper optical flow value and it should be discarded. The schematic of the proposed module is presented in

In this section a detailed discussion about different implementation variants of the presented in Section 5.1 hardware architecture for optical flow computation is presented. Several design options were investigated: impact of the number of iterations, difference between floating- and fixed-point precision and choice of differentiation and averaging convolution kernels for final accuracy of optical flow computation. A short introduction to the testing methodology: methods of determining optical flow accuracy and used reference sequences is also presented.

The methodology of evaluating an optical flow algorithm's accuracy was extensively described in [_{r}_{r}

The first proposed error measure is average angular error (AAE) computed as:

It is described in [_{r}_{r}

In order to visualize the result of optical flow computation two methods are employed. The first one is used mainly for sparse methods. For each pixel for which the flow was computed, a displacement vector is drawn. This method is problematic in case of dense flow, because drawing an arrow for every pixel location would result in a unreadable visualization.

This is why in case of dense flow visualization a method based on appropriate pixel colouring according to the value of obtained

Two sequences Yosemite and Rubber Whale from the Middlebury dataset [

Initial research as well as results presented in the paper [

Before starting the quantitative assessment, the values of

Computing optical flow based on the Horn-Schunck method requires many iterations of

In

The original implementation of the Horn-Schunck algorithm from the OpenCV library is working with single precision floating-point number representation. In order to make an efficient hardware implementation, the impact of switching from floating-point to fixed-point numbers, which are much more suitable for hardware architectures, has to be evaluated. To achieve that, two software models were created. The first one resembling on a bit level the behaviour of floating-point modules provided by Xilinx (Xilinx Bit Accurate C Model [

Quantitative results for both sequences for the original floating-point version from OpenCV (SP — single precision) and for the fixed-point versions (7 to 11 bits assigned for the fractional part) were presented in

For computing the derivatives _{x}_{y}_{t}

In

In the OpenCV implementation for

In the next step, the impact of using different kernel masks on optical flow computation accuracy was analysed. In

The analysis of

Because the averaging method choice is only slightly affecting the obtained results (_{x}_{y}_{t}

Three different versions of hardware architectures were selected for final testing and implementation:

FCV — an architecture working with floating-point numbers (single precision) compliant with the implementation from OpenCV library,

ICV — an architecture, which uses the same methods for derivatives and averaging computations as the OpenCV library, but working with fixed-point number representation. The word has 17 bits (one bit for the sign, 6 bits for the integer part, 10 bits for the fractional part). According to the research results presented in Section 6.3, this representation assures a good accuracy of optical flow computation,

MOD — an architecture modified according to results from Section 6.4. It is using the differentiation scheme introduced in the original work of Horn-Schunck [

In the block schematic presented in

The gradient computation module is identical in both implementations based on the OpenCV version (ICV, FCV). In the case of the modified architecture (MOD), a different kernel is used. In the next processing stage (PSI computation) the

In

In

All modules described in this section are able to work with a frequency far beyond the video clock frequency of Full HD (1920 × 1080 @ 60fps) video stream which is 148.5 MHz. These measurements are however valid only for single modules. In the next step, it was tested how connecting these modules into larger entities impacts the maximum achievable frequency and throughput of the whole system.

The presented ICV and FCV modules are realizing the same algorithm version. The only difference is in the numerical representation used in all processing elements (adders, multipliers

Moreover, it can be noticed, that the difference in the maximum working frequency of both versions is rather small. In case of the iteration modules, the floating-point version has even a slightly higher maximum working frequency. Such results were obtained by using processing elements with maximum available latency (thus maximum working frequency) in both cases. For example, the latency of the PSI computation module which uses two multipliers, two adders and one divider is equal to 38 clock cycles for the ICV version and 72 clock cycles for the FCV version. In some applications, the excessive growth of latency might be the second, after the larger resource usage, limitation in efficient realization of floating-point computations in FPGA devices.

The resource usage and estimated performance presented in

The same tests were conducted for the fixed-point version of the system (ICV and MOD). The results are presented in

Analysis of

This is caused by the fact, that after synthesis, the design is mapped into real 2D structure of FPGA device, where logic resources are organized into rows and columns of finite sizes. Switching to larger FPGA device than the used XC7VX980T chip, should allow either to improve the working frequency of the proposed architectures or to increase the number of iteration blocks (better flow accuracy).

In this section the evaluation of the proposed optical flow computation module is presented. The aspects of real-time video stream processing, achieved computing performance, power consumption and comparison with other hardware systems described in literature are investigated. By the term real-time video stream processing it is meant computing optical flow for every pixel received from a camera.

The designed hardware modules were compared in terms of image processing time with its software implementation from the OpenCV library. The PC computer with a Intel Core i7 2600K [

It is assumed that in order to consider real-time video stream processing, the system should be able to process 60 frames per second. This results in a maximal single frame processing time of 16.6 ms. It can be noticed, that the CPU implementation is not able to achieve this requirement for a VGA frame (resolution 640 × 480) without SIMD support. The use of AVX instructions available in the Core i7 processor allows to obtain real-time performance only for about 8 to 9 iteration of the Horn-Schunck algorithm.

In FPGA each iteration is computed by a separate hardware module. The modules are working in a scalable pipelined architecture. Thanks to that, executing more iterations is increasing the latency only. The throughput is affected by the image size and maximum working frequency of the obtained architecture (presented in

It is worth to notice that processing a Full HD video stream (1,920 × 1,080) with 128 iterations of the Horn-Schunck algorithm requires only 14.14 ms on the proposed hardware architecture. Processing the same image on a CPU (Core i7) requires 1,509.44 ms. Therefore, the obtained speedup is 106 times in this case. Thanks to such a significant throughput, the designed hardware system is able to process the HD video stream in real-time.

In order to determine the computing performance, the methodology presented in the paper [

In the same tables the estimated power consumption, using theXilinx XPower Analyzer software, for each architecture is presented. All measures were conducted for the main clock frequency of 148.5 MHz. In the last step, the coefficient showing how much operations are performed per 1 W of power consumed by the FPGA device was determined.

The fixed-point hardware architecture reaches the computing performance of 418 GOPS. To achieve this only 12 W of power is needed, therefore the efficiency is 34 GOPS per 1 W. It is a very good result and proves the large potential of FPGA devices in embedded systems.

In the case of the floating-point architecture (

In

PRE MED REL MOD128 — fixed-point MOD architecture, number of iterations 128, pre-filtering, median post-filtering, reliability calculation,

PRE MED MOD128 — fixed-point MOD architecture, number of iterations 128, pre-filtering, median post-filtering,

PRE MED REL ICV128 — fixed-point ICV architecture, number of iterations 128, pre-filtering, median post-filtering, reliability calculation,

PRE MED IVC128 — fixed-point ICV architecture, number of iterations 128, pre-filtering, median post-filtering,

The floating-point architecture (FCV) was excluded from the comparison, because it can only reach 32 iterations (Section 7.2). This results in lower accuracy than the ones obtained by the fixed-point modules able to perform 128 iterations.

The implementations were sorted from the newest to the oldest. For accuracy comparison, the average angular error (AEE) computed for the well known Yosemite sequence (without clouds) was used. This requires some clarification. The Yosemite sequence was generated by Lynn Quam using an aerial photo of the Yosemite valley. Because it is synthetic, a ground truth reference map was automatically generated. Computing the flow in the cloud region is however controversial, especially its impact on final error computation. According to Michael J. Black, who published the sequence, the cloud region should be totally excluded from analysis, since the reference ground truth was not generated for it.

The density parameter is strictly related with calculating the optical flow reliability [

The data presented in

The modules described in previous sections were used for designing the complete hardware system able to compute optical flow for video stream obtained from a Full HD camera. The proposed solution is a proof-of-concept of an optical flow sensor which can be used in a smart camera. Its block schematic is presented in

In the scheme, apart from the previously described Horn-Schunck, pre-filtering (PRE), median post-processing (MED) and reliability computation (REL) modules, some additional blocks are present, which allowed to obtain the desired system functionality.

The HDMI source (e.g., a digital camera) is connected by specialised extension card Avnet DVI I/O FMC (FPGA Mezzanine Card) to the FPGA device logic (HDMI INPUT). The received colour video stream is converted to greyscale (RGB 2 GREY) and transferred to the pre-filtering module (PRE). In the next step, it is split and directed to the memory controller (RAM CTRL) and HORN-SCHUNCK module. Because the optical flow module needs two images: previous and current (frames N-1 and N), the current frame is stored in RAM memory. As a Full HD image has a considerable size, the external DDR3 memory is used for this task. Detailed description of the memory controller is presented in the article [

In parallel to the current frame (N) buffering, the previous frame (N-1) is read back from the memory. Both frames are transferred to the optical flow computation module (HORN-SCHUNCK) — described in detail in Section 5.1. Because of the limited number of available logic resources on the VC707 board (FPGA device XC7VX485T) the fixed-point version of the algorithm with 32 iterations was implemented. The vectors

For flow visualization, the colouring block is used (COLOUR). The original method proposed in [

Finally, the pixels are processed by a module which is responsible for transmitting them outside the board (HDMI OUTPUT). It allows to display the results on a LCD monitor. The UART module is used for establishing a control connection between PC and the FPGA card. This allows to change the parameter

The resource usage of the FPGA device is presented in

In

The left image shows a person silhouette, which is rotating a longitudinal object (poster tube) through axis located in its centre. This is why, the points which lays left to the axis centre are moving in the opposite direction than points from the right side of the axis centre (one is moving up and the other is moving down). The right image presents a person waving hands in opposite directions. It can be noticed on both images, that the points moving in different directions have different colours. It means, that the flow direction was determined correctly.

In the article, a hardware architecture able to compute optical flow based on the Horn-Schunck algorithm was presented. After preliminary research, three different module variants were proposed for final implementation: a floating-point module compliant with the OpenCV library and two fixed-point modules with different convolution masks. Moreover, pre- and post-filtering blocks were designed and tested. Compared to the software OpenCV implementation executed on a Core i7 processor with AVX instruction support, the highest speedup obtained by the proposed hardware realization is above 100x. Moreover, the designed optical flow computation module achieved twice the throughput of the most powerful hardware architecture described in literature so far, with similar accuracy.

Such a high performance was achieved thanks to designing an architecture, which distinguishes with several features. The proposed module does not require external RAM memory in order to store temporary flow values between algorithm iterations. The memory is used only for previous frame buffering. Each iteration is executed by a separate hardware submodule. This approach results in large resource usage, but it allows a fully pipelined processing of the video stream and enables to take advantage of the parallelization provided by FPGA devices.

All modules were described in VHDL and Verilog HDL languages. A bit-accurate reference software model was created for each one of them, which was later used for verification of hardware simulation results. Finally, the described modules were combined into a vision system, which was then successfully tested on a Xilinx VC707 evaluation board with high definition HDMI camera as a video source.

Several observations were made during the research. Firstly, it was proven and experimentally verified, that in case of a fixed-point implementation, the 10-bit fractional part is guaranteeing small error and stability of the obtained results. Moreover, four different variants of convolution masks and their impact on optical flow accuracy was evaluated. It was also pointed out that using proper kernels can improve the flow accuracy of the Horn-Schunck method with a small increase in device logic resources usage.

Direct comparison of the floating-point version (compliant with OpenCV) and its fixed-point modification, showed a 4-times difference in logic resource usage. It confirms the thesis that hardware realization of floating-point computation with single precision should be avoided, because it consumes a lot of available resources. Moreover, the research proved, that the number of iterations has a greater impact on final flow accuracy than the used numerical representation.

An interesting conclusion can also be drawn from comparison of the software OpenCV version with the proposed hardware realisation. It shows that modern CPU processors, which are able to execute 8 operations in the same time, are not able to compute dense optical flow for Full HD images in real time, not even one iteration of the Horn-Schunck method. On the other hand, FPGA devices allow to process 60 frames of that resolution in one second and the number of iterations is limited only by available hardware resources. It should be noticed, that optical flow computation is often only one among few operations realised in a video system. Another may be foreground object segmentation, object detection or classification. It is why moving the optical flow computation from the processing unit, to the smart camera (

The presented implementation is characterized by both: high computing performance with low power consumption in the same time. The fixed-point version is able to achieve 418 GOPS with power efficiency of 34 GOPS/W. The floating-point implementation is achieving 103 GFLOPS with 24 GFLOPS/W efficiency. These results prove, that FPGA devices are a very good platform for embedded systems realization. They can be used in automotive industry, where high computing performance is demanded, but the small power consumption is also an decisive factor.

The authors plans to extend the module to allow flow computation in multiple scales. This approach should allow to achieve even better accuracy especially in case when the pixel displacement between frames is large. However, it seems that designing a module with such a functionality, which additionally would be able to process Full HD video stream is a challenging task.

The proposed system is a prof of concept that can be used for creating an real-time, high resolution, optical flow sensor required in smart cameras, UAVs (unmanned aerial vehicles), autonomous robots, automotive industry and automatic surveillance systems. The designed architecture is fully scalable and the number of iterations can be easily adjusted for the application requirements. The remaining hardware resources can be used for implementing other image processing and analysis algorithms.

The work presented in this paper was sponsored by AGH UST projects number 15.11.120.356 and 15.11.120.330.

The authors declare no conflict of interest.

Two ways of hardware optical flow computation: (

Proposed system. Optical flow images from [

General block schematic of the hardware optical flow computation module.

Block schematic of reliability calculation module. MEAN - mean computation, ABSD — absolute deviation computation (

Flow colouring scheme according to

Sequences (first frame) with flow ground truth reference image used for accuracy evaluation.

Comparison of the obtained results depending on the number of algorithm iterations.

Angular error for different iterations number and numerical representation. SP-single precision floating-point, 7–11 fixed precision with

Endpoint error for different iteration number and numerical representation. SP-single precision floating-point, 7–11 fixed-point precision with

Kernel masks for derivatives computation and flow averaging used in different implementations.

The obtained average angular and endpoint errors for different number of iterations and different convolution kernels used for differentiation and averaging. D_ — differentiation method, A_ — averaging method, HS — original kernels from the Horn-Schunck proposal, CV — kernels from the OpenCV implementation. Results for Rubber Whale sequence.

System block schematic.

Working system. Camera, FPGA evaluation board and LCD monitor displaying the result. (

Comparison of proposed hardware module for the Horn-Schunck algorithm computation with previous works. I — iterative computation, P — pipelined computation.

| |||||
---|---|---|---|---|---|

(1998) [ |
1 | 256 × 256 @ - | - | N | N |

(1998) [ |
1 | 256 × 256 @ 25 | 1.64 | N | N |

(1998) [ |
3 - I | 50 × 50 @ 19 | 0.05 | N | N |

(2005) [ |
1 | 256 × 256 @ 60 | 3.93 | N | Y (memory) |

(2012) [ |
10 - I | 320 × 240 @ 15 | 1.15 | N | Y (JTAG) |

(2012) [ |
8 - I | 320 × 240 @ 1029 | 0.73 | N | N |

(2012) [ |
10 - I | 256 × 256 @ 90 | <5.90 | N | Y (PCIe) |

(2013) [ |
1 | 256 × 256 @ 247 | 16.19 | N | Y (RS232) |

Proposed 1 | 32 - P | 1,920 × 1,080 @ 60 | 124.16 | Y | Y (camera) |

Proposed 2 | 128 - P | 1,920 × 1,080 @ 84 | 174.18 | N | N |

Resource usage and maximum working frequency of different modules (description in text).

| |||||||||
---|---|---|---|---|---|---|---|---|---|

FF | 345 | 381 | 1123 | 1285 | 3970 | 2933 | 3715 | 13589 | 1224000 |

LUT 6 | 405 | 455 | 729 | 948 | 3370 | 3115 | 3713 | 11766 | 612000 |

SLICE | 155 | 178 | 335 | 362 | 1254 | 1021 | 1220 | 4435 | 153000 |

BRAM36 | 3 | 2 | 0 | 0 | 0 | 11 | 11 | 12 | 1500 |

DSP48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3600 |

| |||||||||

clk_{max} |
401 | 410 | 553 | 512 | 503 | 236 | 223 | 281 | - |

Resource usage and maximum working frequency of different modules (description in text).

| |||
---|---|---|---|

FF | 154 | 5050 | 1345 |

LUT 6 | 149 | 5619 | 1478 |

SLICE | 78 | 2238 | 538 |

BRAM36 | 3 | 4 | 5 |

DSP48 | 1 | 0 | 2 |

| |||

clk_{max} |
316 MHz | 368 MHz | 313MHz |

Resource usage for different number of iteration stages for floating-point optical flow computation module (FCV).

| ||||||
---|---|---|---|---|---|---|

FF | 17481 | 31084 | 58266 | 112627 | 222851 | 441779 |

LUT 6 | 15399 | 27452 | 51198 | 98630 | 195156 | 386369 |

SLICE | 5521 | 9880 | 18544 | 35132 | 69200 | 124343 |

BRAM36 | 15 | 27 | 51 | 99 | 195 | 387 |

| ||||||

clk_{max} |
274 | 244 | 184 | 197 | 200 | 197 |

Resource usage for different number of iteration stages for fixed-point optical flow computation module (ICV).

| ||||||||
---|---|---|---|---|---|---|---|---|

FF | 4265 | 7209 | 13069 | 24810 | 49794 | 98242 | 194949 | 388544 |

LUT 6 | 4391 | 7820 | 14446 | 27690 | 56511 | 111784 | 222358 | 416314 |

SLICE | 1575 | 2518 | 4488 | 8449 | 17073 | 33245 | 63618 | 119201 |

BRAM36 | 14 | 24 | 45 | 87 | 171 | 339 | 675 | 1347 |

| ||||||||

clk_{max} |
257 | 243 | 238 | 236 | 256 | 193 | 187 | 175 |

Resource usage for different number of iteration stages for fixed-point optical flow computation module (MOD).

| ||||||||

FF | 6492 | 10218 | 17642 | 32511 | 63367 | 123943 | 244906 | 487013 |

LUT 6 | 6286 | 10264 | 18040 | 33649 | 67032 | 131145 | 259693 | 490279 |

SLICE | 2101 | 3397 | 5804 | 10714 | 20720 | 37309 | 72163 | 137049 |

BRAM36 | 13 | 23 | 43 | 86 | 170 | 338 | 674 | 1346 |

| ||||||||

clk_{max} |
202 | 245 | 215 | 184 | 206 | 191 | 177 | 129 |

Processing times in milliseconds for Core i7 2600K 3.4 GHz processor and hardware implementations in Virtex 7 XC7VX980T device.

640 × 480 | 1, 920 × 1,080 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | 17.6 | 5.8 | 1.63 | 2.08 | 1.53 | 117.92 | 38.62 | 9.63 | 12.25 | 9.03 |

2 | 21.68 | 5.56 | 1.73 | 1.71 | 1.72 | 157.76 | 50.23 | 10.19 | 10.10 | 10.14 |

4 | 33.19 | 9.08 | 1.76 | 1.95 | 2.28 | 239.34 | 72.71 | 10.4 | 11.51 | 13.45 |

8 | 58.17 | 15.92 | 1.78 | 2.28 | 2.13 | 400.21 | 119.09 | 10.49 | 13.45 | 12.56 |

16 | 103.54 | 30.2 | 1.64 | 2.04 | 2.1 | 743.1 | 214.53 | 9.67 | 12.01 | 12.38 |

32 | 199.08 | 56.79 | 2.17 | 2.20 | 2.13 | 1376.41 | 399.97 | 12.82 | 12.96 | 12.56 |

64 | 396.07 | 116.72 | 2.24 | 2.37 | 2705.08 | 769.61 | 13.24 | 13.98 | ||

128 | 784.83 | 216.61 | 2.4 | 3.25 | 5305.69 | 1509.44 | 14.14 | 19.19 |

Computing performance in GOPS and power consumption for the fixed-point architectures (ICV) with assumed 148.5 MHz clock.

| ||||||||
---|---|---|---|---|---|---|---|---|

GOPS | 3.86 | 7.13 | 13.66 | 26.73 | 52.87 | 105.14 | 209.68 | 418.77 |

| ||||||||

W | 0.53 | 0.64 | 0.84 | 1.23 | 1.99 | 3.46 | 6.60 | 12.22 |

| ||||||||

GOPS/W | 7.28 | 11.14 | 16.26 | 21.73 | 26.57 | 30.39 | 31.77 | 34.27 |

Computing performance in GFLOPS and power consumption for the floating-point architectures (FCV) with assumed 148.5 MHz clock.

| ||||||
---|---|---|---|---|---|---|

GFLOPS | 2.08 | 5.35 | 11.88 | 24.95 | 51.08 | 103.36 |

| ||||||

W | 0.60 | 0.75 | 1.02 | 1.50 | 2.60 | 4.25 |

| ||||||

GFLOPS/W | 3.47 | 7.13 | 11.65 | 16.63 | 19.65 | 24.32 |

Comparison of the proposed architectures with previous implementations of optical flow computation methods for well known Yosemite sequence (without clouds).

PRE_MED_REL_ |
H&S | 1, 920 × 1,080 | 62 | 129 | Xilinx V7 |
5.35 | 59.43 |

PRE_MED_ |
H&S | 1, 920 × 1,080 | 62 | 129 | Xilinx V7 |
9.07 | 100 |

PRE_MED_REL_ |
H&S | 1, 920 × 1,080 | 84 | 175 | Xilinx V7 |
11.85 | 59.04 |

PRE_MED_ |
H&S | 1, 920 × 1,080 | 84 | 175 | Xilinx V7 |
16.51 | 100 |

Barranco [ |
L&K |
640 × 480 | 270 | 82.9 | Xilinx V4 |
5.97 | 59.88 |

Barranco [ |
L&K |
640 × 480 | 31.91 | 9.8 | Xilinx V4 |
4.55 | 58.50 |

Tomasi [ |
Multiscale |
640 × 480 | 31.5 | 9.6 | Xilinx V4 |
7.91 | 92.01 |

Botella [ |
Multi-channel |
128 ×96 | 16 | 0.2 | Xilinx V2 | 5.5 | 100 |

Mahalingham [ |
L&K |
640 × 480 | 30 | 9.2 | Xilinx V2P |
6.37 | 38.6 |

Anguita [ |
L&K |
1, 280 × 1,026 | 68.5 | 90.0 | Core2 Quad Q9550 |
3.79 | 71.8 |

Pauwels [ |
Phase-based | 640 × 512 | 48.5 | 15.9 | NVIDIA GeForce |
2.09 | 63 |

Diaz [ |
L&K |
800 × 600 | 170 | 81.6 | Xilinx V2 |
7.86 | 57.2 |

Resource utilization for VC707 card (XC7VX485T device).

| |||
---|---|---|---|

FF | 149151 | 607200 | 24% |

LUT 6 | 152118 | 303600 | 50% |

SLICE | 46912 | 75900 | 61% |

BRAM36 | 395 | 1030 | 38% |

DSP 48 | 3 | 2800 | 0% |

| |||

clk_{max} |
150 MHz | - | - |