# Real-Time Efficient FPGA Implementation of the Multi-Scale Lucas-Kanade and Horn-Schunck Optical Flow Algorithms for a 4K Video Stream

^{*}

## Abstract

**:**

## 1. Introduction

- Proposition of an architecture able to process Horn–Schunck and Lucas–Kanade optical flow computation algorithms in multi-scale versions in real-time in Ultra HD (4K) resolution on an FPGA platform, which, to our best knowledge, has not been done before.
- Efficient implementation of the multi-scale method, taking advantage of processing different number of pixels simultaneously depending on the scale and without using additional external memory to store temporal values.

## 2. The Horn-Schunck and Lucas-Kanade OF Computation Algorithms

- $\frac{\partial I}{\partial x},\frac{\partial I}{\partial y}$
- —spatial derivatives,
- $\frac{\partial I}{\partial t}$
- —temporal derivative,
- $\frac{\partial x}{\partial t},\frac{\partial y}{\partial t}$
- —optical flow.

#### 2.1. Horn–Schunck Algorithm

- $\overline{{u}_{n}},\overline{{v}_{n}}$
- —average velocity in the neighbourhood.

#### 2.2. Lucas–Kanade Algorithm

_{1}, p

_{2}, …, p

_{n}, the equation related to the pixel brightness can be written in matrix form as Equation (8).

^{T}W on both sides, which led to Equation (11). The weight matrix W takes into account a different impact of the pixels depending on their distance from the analysed pixel. In the simplest case, W can be a unitary matrix, while using a Gaussian-like mask increases the weights of the closest pixels.

_{i}.

^{T}WA. In other words, if its determinant is different from 0, there is an optical flow expressed by Equation (13).

^{T}WA does not always guarantee a correct solution. For this reason, additional conditions are proposed in the literature in which the eigenvalues of this matrix are taken into account—they cannot be too small, and their quotient cannot be too large. Determining whether the eigenvalues of a matrix satisfy these conditions is usually done by a comparison with a threshold.

#### 2.3. Multi-Scale Method

## 3. FPGA Implementations of Optical Flow Methods

#### 3.1. Lucas-Kanade FPGA Implementations

#### 3.2. Horn-Schunck FPGA Implementations

## 4. The Proposed OF System

#### 4.1. Video Processing in 4K

#### 4.2. Optical Flow Algorithms

#### 4.2.1. Implementation of the Lucas-Kanade Algorithm

_{x}, I

_{y}on a previous frame (from RAM) and the temporal derivative I

_{t}between the previous frame and the current one, coming from the camera. Various masks for the spatial derivative in both directions were analysed, including $[-1,0,1]/2$ and $[-1,8,0,-8,1]/12$. After testing them on a few sequences, it turned out that the first one ensures slightly better accuracy. In the case of a hardware implementation, it avoids the division operation, limiting the resource utilisation and eliminating rounding errors, and also reduces the latency of the module, which is even more important in the case of a multi-scale version. In the case of the temporal derivative, the simplest subtraction between the frames was realised. For all three derivatives, additional thresholding was used—if the result was small (e.g., below 5), it was zeroed. This was motivated by the noise that occurs in the source video signal. Finally, the derivatives calculated for each pixel were output simultaneously. A simplified scheme for these calculations (for clarity for one processed pixel, which translates to the 1ppc mode) is presented in Figure 6. In general case, X contexts are generated for X pixels during derivative calculations, as in Figure 4b.

_{xx}) is presented in Figure 7. Weights w

_{i}from Equation (12) were all assigned 1 to smooth the flow and reduce the influence of erroneous “central” pixels. Another solution was also implemented and tested—generating context for incoming derivatives and then performing their multiplication. This method significantly reduced BRAM usage, as the number of bits per pixel was much smaller (27 vs. 85), but at a cost of considerably higher DSP/LUT/FF utilisation. This approach, even with an effective split of the used resource types, resulted in congestions and routing problems during the implementation of the multi-scale version, and thus it was not used in the presented solution.

#### 4.2.2. Implementation of the Horn–Schunck Algorithm

_{x}, I

_{y}and temporal I

_{t}. For this task, a context of size 2 × 2 × 2 px has to be generated with the time as the third dimension (previous and current frame). Therefore, for both images a context of size 2 × 2 px is generated. Marking the previous frame (from RAM) as I

_{1}and the current one as I

_{2}, the derivatives are calculated according to Equations (18)–(20). Similar as in the LK method, an additional thresholding was used to zero the values which were smaller than a set parameter. A scheme of derivatives calculation in the HS algorithm is shown in Figure 9.

_{0}and v

_{0}from Equations (6) and (7) with average velocities set to 0, which resulted in Equations (22) and (23). The obtained values of $\psi $, u

_{0}, v

_{0}and delayed derivatives I

_{x}, I

_{y}and I

_{t}were output simultaneously from the initialisation module.

_{x}, I

_{y}, I

_{t}and calculated $\psi $ were passed to the output of the module, which enabled further flow updates in subsequent iterations. In this way, the iterative flow refinement in the HS method was realised, and the number of iterations was controlled by a parameter which was initially set at 10. As already shown in the literature (e.g., [35]), the more iterations, the higher the accuracy of the results and the resource consumption. Therefore, choosing the proper number of iterations is always a compromise between accuracy and resource utilisation. The values obtained in the output of the last refinement module constituted the final result of the HS algorithm.

#### 4.3. Implementation of the Multi-Scale Method

## 5. Evaluation

_{AAE}(average angular error), defined by Equation (26), which is the average angular error between the normalised ground truth vector (u

_{r}, v

_{r}, 1) and the determined (u, v, 1) for all N pixels. The second popular indicator is E

_{AEE}(average endpoint error), which is expressed by Equation (27). It corresponds to the average endpoint error between the obtained flow and the ground truth in pixels, calculated using the Euclidean norm. Density is also often used to denote the ratio of the number of pixels with a determined optical flow to the total number of pixels in the image. In the case of our implementation, the density is 100% (apart from the pixels without “correct” context), since the flow values for all pixels are determined and no thresholding of the results is applied.

#### 5.1. Middlebury Dataset

#### 5.2. Resource Utilisation

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Liu, C.; Freeman, W.T.; Adelson, E.H.; Weiss, Y. Human-assisted motion annotation. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef][Green Version]
- Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell.
**1981**, 17, 185–203. [Google Scholar] [CrossRef][Green Version] - Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision (IJCAI). In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, USA, 24–28 August 1981; Volume 81. [Google Scholar]
- Zach, C.; Pock, T.; Bischof, H. A Duality Based Approach for Realtime TV-L1 Optical Flow. In Proceedings of the Pattern Recognition, Leipzig, Germany, 18–20 July 2007; Hamprecht, F.A., Schnörr, C., Jähne, B., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 214–223. [Google Scholar]
- Anandan, P. Measuring Visual Motion from Image Sequences. Ph.D. Thesis, University of Massachusetts Amherst, Hampshire, MA, USA, 1987. [Google Scholar]
- Fleet, D.J.; Jepson, A.D. Computation of component image velocity from local phase information. Int. J. Comput. Vis.
**1990**, 5, 77–104. [Google Scholar] [CrossRef] - Liu, C.; Yuen, J.; Torralba, A. SIFT Flow: Dense Correspondence across Scenes and Its Applications. IEEE Trans. Pattern Anal. Mach. Intell.
**2011**, 33, 978–994. [Google Scholar] [CrossRef] [PubMed][Green Version] - Brox, T.; Bregler, C.; Malik, J. Large displacement optical flow. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Häusser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar] [CrossRef][Green Version]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. arXiv
**2016**, arXiv:1612.01925. [Google Scholar] - Ranjan, A.; Black, M. Optical Flow Estimation Using a Spatial Pyramid Network. arXiv
**2016**, arXiv:1611.00850. [Google Scholar] - Sun, Z.; Wang, H. Deeper Spatial Pyramid Network with Refined Up-Sampling for Optical Flow Estimation. In Proceedings of the Advances in Multimedia Information Processing—PCM 2018, Hefei, China, 21–22 September 2018; Hong, R., Cheng, W.H., Yamasaki, T., Wang, M., Ngo, C.W., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 492–501. [Google Scholar]
- Hui, T.W.; Tang, X.; Loy, C.C. LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation. arXiv
**2018**, arXiv:1805.07036. [Google Scholar] - Hui, T.W.; Tang, X.; Loy, C.C. A Lightweight Optical Flow CNN—Revisiting Data Fidelity and Regularization. arXiv
**2019**, arXiv:1903.07414. [Google Scholar] [CrossRef][Green Version] - Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. arXiv
**2017**, arXiv:1709.02371. [Google Scholar] - Ahn, H.E.; Jeong, J.; Kim, J.W.; Kwon, S.; Yoo, J. A Fast 4K Video Frame Interpolation Using a Multi-Scale Optical Flow Reconstruction Network. Symmetry
**2019**, 11, 1251. [Google Scholar] [CrossRef][Green Version] - Wei, Z.; Lee, D.; Nelson, B.; Martineau, M. A Fast and Accurate Tensor-based Optical Flow Algorithm Implemented in FPGA. In Proceedings of the 2007 IEEE Workshop on Applications of Computer Vision (WACV’07), Austin, TX, USA, 21–22 February 2007; pp. 18–23. [Google Scholar] [CrossRef]
- Chase, J.; Nelson, B.; Bodily, J.; Wei, Z.; Lee, D. Real-Time Optical Flow Calculations on FPGA and GPU Architectures: A Comparison Study. In Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, Stanford, CA, USA, 14–15 April 2008; pp. 173–182. [Google Scholar] [CrossRef]
- Seyid, K.; Richaud, A.; Capoccia, R.; Leblebici, Y. FPGA-Based Hardware Implementation of Real-Time Optical Flow Calculation. IEEE Trans. Circuits Syst. Video Technol.
**2018**, 28, 206–216. [Google Scholar] [CrossRef] - Diaz, J.; Ros, E.; Pelayo, F.; Ortigosa, E.M.; Mota, S. FPGA-based real-time optical-flow system. IEEE Trans. Circuits Syst. Video Technol.
**2006**, 16, 274–279. [Google Scholar] [CrossRef] - Díaz, J.; Ros, E.; Agís, R.; Bernier, J.L. Superpipelined high-performance optical-flow computation architecture. Comput. Vis. Image Underst.
**2008**, 112, 262–273. [Google Scholar] [CrossRef] - Barranco, F.; Tomasi, M.; Diaz, J.; Vanegas, M.; Ros, E. Parallel Architecture for Hierarchical Optical Flow Estimation Based on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2012**, 20, 1058–1067. [Google Scholar] [CrossRef] - Kalyan, T.R.S.; Malathi, M. Architectural implementation of high speed optical flow computation based on Lucas-Kanade algorithm. In Proceedings of the 2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, 8–10 April 2011; Volume 4, pp. 192–195. [Google Scholar] [CrossRef]
- Seong, H.; Rhee, C.E.; Lee, H. A Novel Hardware Architecture of the Lucas–Kanade Optical Flow for Reduced Frame Memory Access. IEEE Trans. Circuits Syst. Video Technol.
**2016**, 26, 1187–1199. [Google Scholar] [CrossRef] - Neuendorffer, D.B.P.K.S. Demystifying the Lucas-Kanade Optical Flow Algorithm with Vivado HLS; Technical Report XAPP1300; Xilinx: San Jose, CA, USA, 2017. [Google Scholar]
- Hsiao, S.F.; Tsai, C.Y. Design and Implementation of Low-Cost LK Optical Flow Computation for Images of Single and Multiple Levels. In Proceedings of the 2018 21st Euromicro Conference on Digital System Design (DSD), Prague, Czech Republic, 29–31 August 2018; pp. 276–279. [Google Scholar] [CrossRef]
- Blachut, K.; Kryjak, T.; Gorgon, M. Hardware implementation of multi-scale Lucas-Kanade optical flow computation algorithm—A demo. In Proceedings of the 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP), Porto, Portugal, 10–12 October 2018. [Google Scholar]
- Murachi, Y.; Fukuyama, Y.; Yamamoto, R.; Miyakoshi, J.; Kawaguchi, H.; Ishihara, H.; Miyama, M.; Matsuda, Y.; Yoshimoto, M. A VGA 30-fps Realtime Optical-Flow Processor Core for Moving Picture Recognition. IEICE Trans. Electron.
**2008**, 91-C, 457–464. [Google Scholar] [CrossRef] - Smets, S.; Goedemé, T.; Verhelst, M. Custom processor design for efficient, yet flexible Lucas-Kanade optical flow. In Proceedings of the 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP), Rennes, France, 12–14 October 2016; pp. 138–145. [Google Scholar] [CrossRef]
- Martín, J.L.; Zuloaga, A.; Cuadrado, C.; Lázaro, J.; Bidarte, U. Hardware implementation of optical flow constraint equation using FPGAs. Comput. Vis. Image Underst.
**2005**, 98, 462–490. [Google Scholar] [CrossRef] - Balazadeh Bahar, M.R.; Karimian, G. High performance implementation of the Horn and Schunck optical flow algorithm on FPGA. In Proceedings of the 20th Iranian Conference on Electrical Engineering (ICEE2012), Tehran, Iran, 15–17 May 2012; pp. 736–741. [Google Scholar] [CrossRef]
- Gultekin, G.K.; Saranli, A. An FPGA based high performance optical flow hardware design for computer vision applications. Microprocess. Microsyst.
**2013**, 37, 270–286. [Google Scholar] [CrossRef] - Kunz, M.; Ostrowski, A.; Zipf, P. An FPGA-optimized architecture of Horn and Schunck optical flow algorithm for real-time applications. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications (FPL), Munich, Germany, 2–4 September 2014; pp. 1–4. [Google Scholar] [CrossRef]
- Johnson, B.; Sheeba Rani, J. A high throughput fully parallel-pipelined FPGA accelerator for dense cloud motion analysis. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; pp. 2589–2592. [Google Scholar] [CrossRef]
- Komorkiewicz, M.; Kryjak, T.; Gorgon, M. Efficient Hardware Implementation of the Horn-Schunck Algorithm for High-Resolution Real-Time Dense Optical Flow Sensor. Sensors
**2014**, 14, 2860–2891. [Google Scholar] [CrossRef][Green Version] - Johnson, B.; Thomas, S.; Rani, J.S. A High-Performance Dense Optical Flow Architecture Based on Red-Black SOR Solver. J. Signal Process. Syst.
**2020**, 92, 357–373. [Google Scholar] [CrossRef] - Imamura, K.; Kanda, S.; Ohira, S.; Matsuda, Y.; Matsumura, T. Scalable Architecture for High-Resolution Real-time Optical Flow Processor. In Proceedings of the 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS), Bali, Indonesia, 5–7 November 2019; pp. 248–253. [Google Scholar] [CrossRef]
- Bournias, I.; Chotin, R.; Lacassagne, L. FPGA Acceleration of the Horn and Schunck Hierarchical Algorithm. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 22–28 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Kowalczyk, M.; Przewłocka, D.; Kryjak, T. Real-Time Implementation of Contextual Image Processing Operations for 4K Video Stream in Zynq UltraScale+ MPSoC. In Proceedings of the 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP), Porto, Portugal, 10–12 October 2018; pp. 37–42. [Google Scholar] [CrossRef]
- Batcher, K.E. Sorting Networks and Their Applications. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ, USA, 30 April–2 May 1968; Association for Computing Machinery: New York, NY, USA, 1968; pp. 307–314. [Google Scholar] [CrossRef]
- Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; Part IV, LNCS 7577. pp. 611–625. [Google Scholar]
- Menze, M.; Heipke, C.; Geiger, A. Object Scene Flow. ISPRS J. Photogramm. Remote Sens. (JPRS)
**2018**, 140, 60–76. [Google Scholar] [CrossRef] - Baker, S.; Scharstein, D.; Lewis, J.; Roth, S.; Black, M.; Szeliski, R. A Database and Evaluation Methodology for Optical Flow. Int. J. Comput. Vis.
**2011**, 92, 1–31. [Google Scholar] [CrossRef][Green Version]

**Figure 2.**Optical flow calculation scheme in the multi-scale method. An image pyramid is generated for both frames from the sequence. Then, the optical flow is determined in the smallest scale, and its results are used to modify the previous frame to perform the motion compensation. Next, the optical flow is calculated in a bigger scale and the motion compensation is performed again—this procedure continues to the biggest scale.

**Figure 4.**Comparison of context generation for different data formats. The 2ppc format can be generalised to any Xppc. (

**a**) Context generation for the 1ppc format; (

**b**) context generation for the 2ppc format. Two central pixels are coloured in red and blue.

**Figure 6.**Scheme of calculating derivatives in the LK algorithm. Using context of size 3 × 3 px, 3 derivatives are calculated for one pixel. Image n−1 denotes the previous frame, while Image n denotes the current one.

**Figure 7.**Scheme of the summation method for the 4ppc data format. The sum of non-white pixels is calculated only once for the joint context, saving hardware resources.

**Figure 9.**Derivatives calculation scheme in the HS algorithm. White pixels are from I

_{1}(previous frame), while grey pixels are from I

_{2}(current frame). Using cube of size 2 × 2 × 2 px, 3 derivatives are calculated for one pixel.

**Figure 10.**Methods for processing multiple iterations of the HS algorithm. (

**a**) Pipeline approach; (

**b**) iterative approach.

**Figure 12.**Implemented downscaling method. Black pixels in the output image are invalid, i.e., their tvalid_mod signal is set to 0, so they are not processed in the smaller scale. Blue, green and gold contexts are next ones being processed in the same way.

**Figure 13.**Warping of the image based on the calculated optical flow. The dark red circle in the input image is the processed pixel, the blue line shows its movement (optical flow), the green circle determines the pixel’s target position (with the fractional parts), while black dots represent “pixel centres” (for the purpose of visualising fractionals). A context of 2 × 2 px is generated for bilinear interpolation, and the resulting pixel brightness is assigned to the processed pixel in the output image.

**Figure 14.**Implemented upscaling method. Black pixels in the input image are invalid, i.e., their tvalid_mod signal is set to 0 and they were not processed in the smaller scale. A context of 2 × 2 valid pixels is generated for the bilinear interpolation to calculate new valid pixels, which are put in relevant places of the output image. Blue, green, gold and purple contexts are next ones processed in the same way.

**Figure 15.**Block scheme of a multi-scale optical flow computation. In case of our implementation, white modules work in the 4ppc format and grey in the 2ppc format. The same meaning is for black and grey arrows, signifying data transfer in the 4ppc or 2ppc format. The optical flow module can be the LK method as in Section 4.2.1, the HS method as in Section 4.2.2 or in a general case, another algorithm.

**Figure 16.**Exemplary results for the sequence Camera motion from the MIT CSAIL database. The images show: (

**a**) a frame from the sequence, (

**b**) the ground truth, (

**c**) the result of the LK algorithm, (

**d**) the result of the HS algorithm.

**Figure 17.**Exemplary results on Middlebury dataset sequences. The following rows contain the sequences Hydrangea and Grove3. The columns show a frame from the sequence, the ground truth, the result of the LK algorithm and the result of the HS algorithm.

**Figure 18.**The photo of the proposed optical flow system. The input video signal is transmitted from the computer (the source) and processed in real-time on a ZCU 104 platform equipped with the Xilinx Zynq UltraScale+ MPSoC device. The calculated optical flow is transmitted and visualised on a 4K resolution monitor. The video being processed shows traffic at the intersection seen from above (top left corner of the image). Different colours relate to various directions of the moving objects.

**Table 1.**Hardware implementations of the LK algorithm on FPGA platform. Our solution works for the highest video resolution compared to these available in the literature. It also uses the multi-scale approach, which was realised only for VGA and HD resolutions.

Implementation | Scales | Resolution | FPS | Platform |
---|---|---|---|---|

Diaz [20] | 1 | 320 × 240 | 30 | Xilinx Virtex 2000-E |

Diaz [21] | 1 | 800 × 600 | 170 | Xilinx Virtex II XC2V6000-4 |

Barranco [22] | 1 | 640 × 480 | 270 | Xilinx Virtex4 XC4vfx100 |

Kalyan [23] | 1 | 1200 × 680 | 500 | Altera Cyclone II |

Seong [24] | 1 | 800 × 600 | 196 | Xilinx Virtex-6 LX760 |

Bagni [25] | 1 | 1920 × 1080 | 123 | Xilinx Zynq 7045-2 |

Murachi [28] | 3 | 640 × 480 | 30 | Custom 90nm CMOS |

Smets [29] | 3 | 640 × 480 | 16 | Custom 40nm CMOS |

Hsiao [26] | 3 | - | - | Xilinx Virtex-4 FX100 |

Barranco [22] | 4 | 640 × 480 | 32 | Xilinx Virtex4 XC4vfx100 |

Blachut [27] | 2 | 1280 × 720 | 50 | Xilinx Virtex-7 VC707 |

This work | 2 | 3840 × 2160 | 60 | Xilinx UltraScale+ ZCU 104 |

**Table 2.**Hardware implementations of the HS algorithm on FPGA platform. Our solution can process a video in 4K resolution, but in multiple scales, which is rarely found in the literature and was done at most for Full HD resolution.

Implementation | Scales | Resolution | Iterations | FPS | Platform |
---|---|---|---|---|---|

Martin [30] | 1 | 256 × 256 | 1 | 60 | Altera APEX 20K |

Bahar [31] | 1 | 320 × 240 | 8 | 1029 | Altera Cyclone II |

Gultekin [32] | 1 | 256 × 256 | 1 | 257 | Altera Cyclone II EP2C70 |

Kunz [33] | 1 | 640 × 512 | 30 | 30 | Altera Stratix IV |

Kunz [33] | 1 | 4096 × 2304 | 20 | 30 | Altera Stratix IV |

Johnson [34] | 1 | 3750 × 3750 | 10 | 30 | Xilinx Virtex-7 VC707 |

Komor. [35] | 1 | 1920 × 1080 | 32 | 60 | Xilinx Virtex-7 VC707 |

Komor. [35] | 1 | 1920 × 1080 | 128 | 84 | Xilinx Virtex-7 VC707 |

Johnson [36] | 1 | 1920 × 1080 | 15 | 200 | Xilinx Virtex-7 VC707 |

Johnson [36] | 1 | 3840 × 2160 | 15 | 48 | Xilinx Virtex-7 VC707 |

Imamura [37] | 2 | 1920 × 1080 | 32, 8 | 60 | Custom 90nm CMOS |

Bournias [38] | 3 | 1024 × 1024 | 20, 10, 5 | 29 | Altera Stratix V |

This work | 2 | 3840 × 2160 | 10, 5 | 60 | Xilinx UltraScale+ ZCU 104 |

**Table 3.**Resource utilisation for the LK module in different data formats in 4K resolution. The resource usage decreases significantly as fewer pixels are processed in parallel. The 0.5ppc mode requires less memory elements, but a similar number of computing elements as the 1ppc mode. In general, many scales can be processed at the same time, but with reduced hardware utilisation.

Resource Type | 4ppc | 2ppc | 1ppc | 0.5ppc |
---|---|---|---|---|

LUT | 47,683 | 26,254 | 15,574 | 15,565 |

Flip-Flop | 82,482 | 49,987 | 33,999 | 33,981 |

Block RAM | 92 | 50 | 25 | 17 |

DSP | 540 | 290 | 165 | 165 |

**Table 4.**Resource utilisation for the HS module with 10 iterations in different data formats in 4K resolution. The resource usage decreases significantly as fewer pixels are processed in parallel. The 0.5ppc mode requires fewer memory elements, but a similar number of computing elements as the 1ppc mode. In general, many scales can be processed at the same time, but with reduced hardware utilisation.

Resource Type | 4ppc | 2ppc | 1ppc | 0.5ppc |
---|---|---|---|---|

LUT | 43,265 | 21,907 | 11,462 | 11,408 |

Flip-Flop | 61,530 | 31,405 | 16,958 | 17,069 |

Block RAM | 134 | 69 | 39.5 | 26.5 |

DSP | 488 | 244 | 122 | 122 |

**Table 5.**Comparison of E

_{AAE}errors in degrees for Middlebury dataset sequences. D—Dimetrodon, V—Venus, H—Hydrangea, G2—Grove2, G3—Grove3.

Implementation | Method | Version | D | V | H | G2 | G3 |
---|---|---|---|---|---|---|---|

Seyid [19] | Block | 3 scales | 8.23 | 6.41 | 14.80 | 5.80 | 10.90 |

Hsiao [26] | LK | 1 scale | 35.69 | - | - | - | - |

Hsiao [26] | LK | 3 scales | 21.35 | - | - | - | - |

Smets [29] | LK | 2 scales | 20.51 | 24.16 | 19.32 | 11.51 | 16.05 |

Smets [29] | LK | 4 scales | 10.15 | 16.21 | 8.28 | 5.50 | 10.08 |

This work | LK | 1 scale | 20.44 | 41.92 | 34.51 | 38.22 | 37.36 |

This work | LK | 2 scales | 12.54 | 28.14 | 18.08 | 17.81 | 24.88 |

Johnson [34] | HS | 10 iter. | 26.33 | - | 40.30 | - | - |

Johnson [34] | HS | 50 iter. | 21.32 | - | 36.93 | - | - |

Johnson [36] | HS | Precision | 10.67 | 26.12 | 25.23 | 26.88 | 26.64 |

Johnson [36] | HS | Throughput | 10.99 | 26.88 | 25.56 | 27.08 | 26.89 |

This work | HS | 1 scale | 32.27 | 41.99 | 35.61 | 33.11 | 35.68 |

This work | HS | 2 scales | 22.94 | 29.63 | 18.60 | 17.90 | 26.76 |

**Table 6.**Comparison of E

_{AEE}errors in pixels for Middlebury dataset sequences. D—Dimetrodon, V—Venus, H—Hydrangea, G2—Grove2, G3—Grove3.

Implementation | Method | Version | D | V | H | G2 | G3 |
---|---|---|---|---|---|---|---|

Seyid [19] | Block | 3 scales | 0.44 | 0.47 | 1.98 | 0.42 | 0.99 |

Hsiao [26] | LK | 1 scale | 2.16 | - | - | - | - |

Hsiao [26] | LK | 3 scales | 1.84 | - | - | - | - |

This work | LK | 1 scale | 1.02 | 3.33 | 2.47 | 2.22 | 3.11 |

This work | LK | 2 scales | 0.63 | 2.53 | 1.45 | 1.19 | 2.35 |

Johnson [34] | HS | 10 iter. | 1.18 | - | 2.71 | - | - |

Johnson [34] | HS | 50 iter. | 1.02 | - | 2.21 | - | - |

Johnson [36] | HS | Precision | 0.63 | 2.34 | 2.23 | 1.56 | 2.53 |

Johnson [36] | HS | Throughput | 0.65 | 2.43 | 2.34 | 1.66 | 2.62 |

This work | HS | 1 scale | 1.32 | 2.97 | 2.41 | 1.90 | 3.02 |

This work | HS | 2 scales | 1.01 | 2.31 | 1.40 | 1.21 | 2.38 |

**Table 7.**Resource utilisation for the LK algorithm on a ZCU 104 platform. Due to effective implementation of the multi-scale method, there is a small increase in resource utilisation (apart from BRAMs) when adding the second scale to the algorithm.

Resource Type | Available | Pass-Through | 1-Scale Version | 2-Scale Version |
---|---|---|---|---|

LUT | 230,400 | 38,097 (17%) | 89,167 (39%) | 122,734 (53%) |

Flip-Flop | 460,800 | 44,673 (10%) | 123,995 (27%) | 183,688 (40%) |

Block RAM | 312 | 7 (2%) | 119 (38%) | 311 (100%) |

DSP | 1728 | 3 (0%) | 559 (32%) | 861 (50%) |

**Table 8.**Resource utilisation for the HS algorithm on a ZCU 104 platform. Due to effective implementation of the multi-scale method, there is a small increase in resource utilisation (apart from BRAMs) when adding the second scale to the algorithm.

Resource Type | Available | Pass-Through | 1-Scale Version | 2-Scale Version |
---|---|---|---|---|

LUT | 230,400 | 38,097 (17%) | 84,477 (37%) | 104,728 (45%) |

Flip-Flop | 460,800 | 44,673 (10%) | 113,922 (25%) | 145,872 (32%) |

Block RAM | 312 | 7 (2%) | 161 (52%) | 312 (100%) |

DSP | 1728 | 3 (0%) | 507 (29%) | 523 (30%) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Blachut, K.; Kryjak, T. Real-Time Efficient FPGA Implementation of the Multi-Scale Lucas-Kanade and Horn-Schunck Optical Flow Algorithms for a 4K Video Stream. *Sensors* **2022**, *22*, 5017.
https://doi.org/10.3390/s22135017

**AMA Style**

Blachut K, Kryjak T. Real-Time Efficient FPGA Implementation of the Multi-Scale Lucas-Kanade and Horn-Schunck Optical Flow Algorithms for a 4K Video Stream. *Sensors*. 2022; 22(13):5017.
https://doi.org/10.3390/s22135017

**Chicago/Turabian Style**

Blachut, Krzysztof, and Tomasz Kryjak. 2022. "Real-Time Efficient FPGA Implementation of the Multi-Scale Lucas-Kanade and Horn-Schunck Optical Flow Algorithms for a 4K Video Stream" *Sensors* 22, no. 13: 5017.
https://doi.org/10.3390/s22135017