This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

This paper presents a complete implementation of the Principal Component Analysis (PCA) algorithm in Field Programmable Gate Array (FPGA) devices applied to high rate background segmentation of images. The classical sequential execution of different parts of the PCA algorithm has been parallelized. This parallelization has led to the specific development and implementation in hardware of the different stages of PCA, such as computation of the correlation matrix, matrix diagonalization using the Jacobi method and subspace projections of images. On the application side, the paper presents a motion detection algorithm, also entirely implemented on the FPGA, and based on the developed PCA core. This consists of dynamically thresholding the differences between the input image and the one obtained by expressing the input image using the PCA linear subspace previously obtained as a background model. The proposal achieves a high ratio of processed images (up to 120 frames per second) and high quality segmentation results, with a completely embedded and reliable hardware architecture based on commercial CMOS sensors and FPGA devices.

One of the main research areas in the field of computer vision is the automatic description of the features of a given scene [

It is usual for the platforms chosen to carry out these algorithms based on sequential programs, in which the only improvements currently available consist in applying multi-threading programming techniques so the power of the new multicore processors may be used. However, from a performance point of view these processing architectures are not so efficient in many applications like the digital processing of images, which normally requires a high number of operations to be handled at the bit level as quickly as possible, by processing in parallel a small number of input samples. Due to the sequential architecture of conventional computers, a notorious amount of operations cannot be performed concurrently. Another issue is the amount of data processed in each instruction, which is limited by the type and width of used communication bus and the image capture board. For this reason, when a large amount of data must be handled, the system performs slowly. This has given rise to solutions that make use of coprocessor systems that handle low level preprocessing tasks, where the amount of data to be processed is high but the operations to be carried out are simple [

The detection of both static and moving objects within a captured area is one of the more common tasks undertaken by many computer vision applications. Movement analysis is involved, among other things, in real time applications such as navigation and tracking and obtaining information about static and moving objects within a scene [

Within the field of image processing previous works have partially developed the processing algorithm of PCA using programmable devices. In [

These situations make difficult to segment/divide the hardware processing of the different parts of PCA. For this reason, executing PCA is normally divided between an FPGA and a PC or microprocessor [

One of the main contributions of this work is the FPGA implementation of the complete PCA algorithm on reconfigurable hardware; indeed it is the first work in the literature to do so. Classic sequential execution of different parts of the PCA algorithm has been parallelized. This parallelization has led to the development and implementation of seldom used alternatives for the different stages of PCA. One example is the calculation of eigenvalues and eigenvectors, matrix multiplication in hardware or calculation of a dynamic threshold for detecting moving objects. This latter issue is another major contribution of the paper because the information generated by PCA is used to detect moving objects. In this work, PCA is implemented on an FPGA to detect moving objects within a scene, based on the PCA algorithm. To achieve this, a specifically designed intelligent camera has been implemented based on a CMOS sensor and an FPGA [

The other sections of this paper are as follows: Section 2 sets out the mathematical foundations of the PCA algorithm applied to image processing; Section 3 describes the platform design; Section 4 presents the implementation in VHDL of the PCA algorithm on an FPGA; and finally, Sections 5 and 6 set out the results and present the conclusions respectively.

Principal Component Analysis (PCA) is a method that is used in different fields, such as statistics, power electronics or artificial vision. The main feature of PCA is the reduction of redundant information, retaining only information that is fundamental (principal components).

Artificial vision is a good example of a field where the PCA technique can be applied directly, as an image contains a large number of highly correlated variables (pixels). Therefore, applying the PCA technique to image processing allows us to reduce the redundant information of the initial variables and determine the degree of similarity between two or more images by analyzing only the basic features within the transformed space. This last feature is of interest as far as the detection of new objects within the scene is concerned.

The PCA algorithm can be applied to images using the following steps [

_{i}^{N×N}

Each image is represented as a column vector of the dimensions ^{2}

Calculating the mean image from the M reference images : ^{N2×1} given for:
_{i,j} is the ^{2}) element of _{i} image.

Form a matrix ^{N2×M}_{j}_{j}

^{N2×N2}

^{2}^{2}^{T}^{M}^{×} ^{M}^{T}^{T}

_{1} > _{2} > …> _{t}

The transformation matrix _{t}^{N2×t} is given by (8) where [_{1},_{2},....,_{t}_{1}_{2}_{t}

An important issue is the quantification of the value

According to the results shown in [_{t}

^{t}^{×1} is characterized by a vector of dimension _{1}, _{2},…, _{t}^{T}_{i}_{j}.

^{N2×1} is recovered using (10):

_{MD}_{MD}

_{V}^{N2×N2}), where every one of its elements (_{w,i}

Once the _{V}_{V}_{MD}

The system proposed is based on a high speed CMOS sensor (up to 500 images per second with a maximum resolution of 1,280 × 1,024 [

The mathematical complexity of the operations of the PCA algorithm presented in Section 2 (calculation of eigenvectors, matrix multiplication, square roots, etc) makes it impractical to implement them directly on reconfigurable hardware. The proposal and selection of different hardware structures and computing alternatives in order to obtain an efficient solution to resolve these operations on FPGAs is essential for the PCA implementation and, thus, constitutes one of the major contributions of this paper. This section presents the hardware solution found which permits the PCA algorithm to be implemented on an FPGA.

The first phase of the PCA algorithm is the generation of the eigenvectors of the reduced transformation matrix _{t}

Calculating the mean of the ^{N2×1}) and the matrix ^{N2×M} (3).

Obtaining the covariance matrix ^{T}^{M}^{×}^{M}

Calculating the eigenvectors of the matrix ^{M}^{×}^{M}_{t}^{M}^{×}^{t}

Obtaining the eigenvectors of the matrix _{t}^{N2×t} from _{t}^{M}^{×}^{M}

Calculating the norms of matrix eigenvectors.

The hardware architecture that has been developed for this module stores the captured _{1}). Once the eight pixels have been extracted, one for each image, the mean calculation process is initiated using a set of cascade adders (see B_{2} _{6}).

Generating ^{T}^{T}_{t}

The computation of eigenvalues and eigenvectors represents the greatest computational burden on the PCA algorithm. Different techniques have been proposed for obtaining the eigenvalues of a matrix using specific hardware, all of them based on recurrent methods that look to diagonalize the matrix [

The first step in determining the most significant

To obtain the matrix _{t}_{t}

The eigenvectors obtained in the previous stage do not possess a unit module so they must be normalized (14) _{tn}

To implement in hardware the arithmetical operations shown in expressions (14) and (15) is extremely complex as a consequence of the square root, and it also uses a large amount of resources. To avoid calculating the square root when calculating _{j}

If there is a new object in the captured image, with respect to the reference scene, it is determined during the

_{j}_{j}

_{j}_{j}

_{j}_{j}

_{j}_{j}

Once the first results from the
^{2}

To perform the division operation on an FPGA, there are basically two possibilities: either design a division unit specifically for that purpose, or use a coordinate rotation digital computer (CORDIC) algorithm [

When the first component of the division has been obtained, the next step to be performed in (16) is to obtain _{j}

This section presents the solution developed for implementing an identification of new objects from the error recovery (_{j}_{j}_{V}_{MD}_{MD}_{j}_{j}

The Map of Distances _{j}_{ji}_{j}_{ji}_{j} i^{2} (for images of the size

Working with the square of the Euclidean distance rather than the Euclidean distance (17), facilitates the design of hardware associated with this function, as it avoids the need to perform the square root operation. As such, to obtain

Once the initial components of _{V}_{V}_{vi,w}

To provide a compromise value to the size of mask

To implement the averaging function with masks of

To perform the convolution between a matrix and a generic mask, nine multiplication operations and eight accumulation operations must be performed for each resulting component. However, when all the coefficients of the mask have been identified, as happens in our case, another way of performing the convolution is according to (19), whereby one that reduces the number of multiplications to one. In this way, to obtain each _{vi,w}_{V}

Once the map of average distances (_{V}_{MD}_{V}_{MD}_{MD}_{V}_{MD}

Analyzing the information supplied by the histogram on the maximums of the _{V}_{V}_{V}_{MD}_{MD}

The hardware to perform the threshold is shown in

_{V}

_{V}

_{MX} of

To find a valley, a hardware block has been designed to check the memory of _{V}

This section sets out the results obtained in detecting new objects with a FPGA running PCA algorithm. All the images presented in this work have been captured by an “intelligent camera” described in [

From a quantitative point of view, in calculating the execution time of the entire proposal presented in this work (T_{PCA_TOTAL}) from the moment the first

When it comes to calculating the number of complete clock cycles employed by T_{PCA_TOTAL}, the value obtained is not constant as it depends on the number of significant eigenvectors, the size of the matrix and the number of Jacobi algorithm iterations, as explained in [_{CLK_CAMERA} is the signal period of the CMOS sensor’s clock and T_{CLK} the FPGA’s master clock. Clock Camera is generated by the FPGA using a DCM (digital clock management) block. Thanks to this element and a bank register managed for a FSM (finite state machine), both clocks working rightly. To obtain a ratio of the number of images the system processes, if the CMOS sensor’s clock (T_{CLK_CAMERA}) is 66 MHz and the FPGA’s master clock is 100 MHz (frequency reached once the entire system has been implemented) a minimum of 121 images of 256 × 256 pixels have been processed per second. This ratio increases notably if any of the following situations occur:

_{PCA_TOTAL} value would reach an equivalent image per second ratio of 189.

As may be seen from this figure, from

With respect to the frequency of the FPGA clock, according to the reports generated by the implementation tool, a maximum value of 1,124 MHz for the entire FPGA is assured. However, the master frequency chosen for our design is 100 MHz as from this value all the other necessary frequencies can be generated (the camera and external memory frequencies).

As for the real results obtained,

This work presents a new image capture and processing system implemented on FPGAs for detecting new objects in a scene, starting from a reference model of the scene. To achieve this, the Principle Component Analysis (PCA) technique has been used. The main objective is to parallelize it in order to achieve a concurrent execution which will enable processing speeds of around 120 images per second to be reached. This processing speed, including all stages included in the PCA technique (calculating eigenvalues and eigenvectors, projection and recovery of images to/from the transformed space, obtaining map of distances,

This work was made possible thanks to the sponsorship of the Ministry of Education and Science (MEC) and the projects ESPIRA (REF-DPI2009-10143) and SIAUCON (REF-CCG08-UAH/DPI-4139), funded by the University of Alcala and the Madrid Regional Government.

Block diagram of the internal architecture of the FPGA.

Block diagram of the PCA algorithm implemented on an FPGA.

Block diagram of the proposed circuit for calculating the mean (

Block diagram of the modules of the design in VHDL of the on-line stage of the PCA.

Proposal for the system consisting of the construction of the MD, detection of new objects and the updating of the background model.

Example of histogram construction of the maximum of the columns for an average map of distances (_{V}

Block diagram on an FPGA of the dynamic threshold calculating system for detecting new objects.

Ratio of images achieved per second with

Sequence of images captured to determine new objects.

Sequence of images detected to determine a new object from those captured in

Description of the partial times of T_{PCA_TOTAL}.

T_{GEN_WR_U} |
Time the FPGA takes to generate and write in SDRAM the eigenvectors of the matrix _{t} |

T_{IMAGE} |
Time employed in capturing a new image and its subsequent writing in SDRAM. |

L_{MEM} |
Latency of the SDRAM memory, from the time it gives the order to read an image until the first data is received. |

T_{OBJ} |
Time consumed in detecting new objects after the recovered image (_{j} |

Summary of all the resources consumed by the entire developed system on a XC2VP7.

_{CLKMAX} | |||
---|---|---|---|

4225 (86%) | 40 (91%) | 43 (98%) | 112,4MHz |