A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor

This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation.


Introduction
Computer vision (CV) has many possible utilizations and is one of the fastest growing research areas these days. Passive optical sensors have been used in the automotive industry for tasks such as management and off-board traffic observation [1]. The CV technique is often used in driver assistance and traffic sign detection in onboard systems [2]. Also, the augmented reality (AR) resulting from CV functions, improves the real environment by using virtual elements and allows diversified functions such as guided order picking or upkeep efforts [3,4].
Various light conditions, different views of an object, surface reflections and noise from image sensors are the challenging questions with which one has to deal when it comes to pose estimation and optical object detection. The solution to these problems to some extent can be achieved thanks to the use of SIFT or SURP algorithms as they compute the point features, which are invariant towards scaling and rotation [5,6]. However, these algorithms demand powerful hardware and consume high battery power because of their high computational complexity, making their use in automotive applications and in the field of mobile devices generally challenging.
The Nios II microprocessor has been widely used in many applications such as image processing, control, and mathematic acceleration. Low-cost sensors have been discussed in [7]. They have accelerated block matching motion estimation techniques using the Altera C2H. Their processor can process 50 × 50 at 29.5 fps. Custom instructions in the Nios II were used for accelerating block matching algorithms [8] where a 75% improvement was achieved by utilizing custom instructions. In [9], customized Nios II multi-cycle instructions have been used to accelerate block matching techniques.

Related Works
Natural features of images, including global or local features, can been used in object recognition because of the increasing processing power of traditional PC hardware [21]. Feature recognition techniques have been used in developing many applications such as autonomous systems and automotive systems [22]. Harris Corner Detection, SIFT, and SURF are well-known algorithms for finding and describing point features [5,23]. SIFT and SURF extract local features which outperform global features [6,24]. However, the speed of SURF is faster than SIFT. The main drawback of the aforementioned algorithms is the need for high computational power which requires powerful PC-like hardware [21], making the availability of high-speed platforms with such powerful PC-like hardware a necessity to implement these algorithms. A modern GPU hardware and programmable logic (FPGAs) have been used to meet the computational requirements of these algorithms [25]. Histogram of Oriented Gradient (HOG) for object detection has been implemented using reduced bit width fixed point presentation in order to reduce required area resources. This implementation was able to process 68.2 fps for 640 × 480 image resolution on single FPGA [26]. Another reduced fixed point bit width for human action recognition has been introduced in [27] where hardware requirements have been reduced due to using a classifier with 8 bit-fixed-point. Bio-inspired optical flow has been implemented in reconfigurable hardware using a customizable architecture [28,29]. FPGA was used in order to enhance the performance of a localization algorithm. FAST corner detection has been accelerated using a FPGA-based grid approach in [30]. Mirage pose estimation has been used for robot localization and trajectory tracking. Pose was estimated analytically without iterative solutions [31]. A POSIT algorithm for non-coplanar feature points has been used for UAV localization in [32].
A SURF-based object recognition and pose estimation embedded system was introduced in [33]. This system is able to process images at five frames per second and recognize nine different objects. However, this system is not suitable for real-time systems due to the latency produced by the SURF Sensors 2016, 16, 2139 3 of 21 algorithm implementation. On-board vision sensors have been designed and implemented using a linked list algorithm in [34]. In this work, memory consumption had been minimized and the proposed algorithm is able to detect blobs in real-time.

Architecture of the Proposed System
The proposed system has five main stages as shown in Figure 1. It starts by converting the input pixel stream into grayscale which will be the input for the next stage. The grayscale level is represented by 8-bits. In the second stage, image segmentation and FAST corner detection are performed on parallel for extracting foreground objects and objects corners, respectively. The results of the second stage will be bounding box coordinates for every detected object and its corners. Bounding box coordinates and corners coordinates will be stored in the third stage which is implemented by M9K block memories. Since the main target of the proposed system is fast implementation, a simple marker has been chosen as shown in Figure 2. The marker geometries are predefined, with each side measuring 113 mm. This marker will reduce the matching time and lead to high-speed implementation. The output of pattern matching stage is four vertices of the detected markers which will be sent to the pose estimation stage. These vertices should be ordered as shown in Figure 2. The order of marker's vertices is important for pose estimation procedure. In the final stage, NIOS II soft-core processor will be used for executing the coplanar PosIt algorithm to find out the pose of the detected markers. In the following section, a detailed explanation about the components of the proposed system is given.
Sensors 2016, 16,2139 3 of 21 algorithm implementation. On-board vision sensors have been designed and implemented using a linked list algorithm in [34]. In this work, memory consumption had been minimized and the proposed algorithm is able to detect blobs in real-time.

Architecture of the Proposed System
The proposed system has five main stages as shown in Figure 1. It starts by converting the input pixel stream into grayscale which will be the input for the next stage. The grayscale level is represented by 8-bits. In the second stage, image segmentation and FAST corner detection are performed on parallel for extracting foreground objects and objects corners, respectively. The results of the second stage will be bounding box coordinates for every detected object and its corners. Bounding box coordinates and corners coordinates will be stored in the third stage which is implemented by M9K block memories. Since the main target of the proposed system is fast implementation, a simple marker has been chosen as shown in Figure 2. The marker geometries are predefined, with each side measuring 113 mm. This marker will reduce the matching time and lead to high-speed implementation. The output of pattern matching stage is four vertices of the detected markers which will be sent to the pose estimation stage. These vertices should be ordered as shown in Figure 2. The order of marker's vertices is important for pose estimation procedure. In the final stage, NIOS II soft-core processor will be used for executing the coplanar PosIt algorithm to find out the pose of the detected markers. In the following section, a detailed explanation about the components of the proposed system is given.   algorithm implementation. On-board vision sensors have been designed and implemented using a linked list algorithm in [34]. In this work, memory consumption had been minimized and the proposed algorithm is able to detect blobs in real-time.

Architecture of the Proposed System
The proposed system has five main stages as shown in Figure 1. It starts by converting the input pixel stream into grayscale which will be the input for the next stage. The grayscale level is represented by 8-bits. In the second stage, image segmentation and FAST corner detection are performed on parallel for extracting foreground objects and objects corners, respectively. The results of the second stage will be bounding box coordinates for every detected object and its corners. Bounding box coordinates and corners coordinates will be stored in the third stage which is implemented by M9K block memories. Since the main target of the proposed system is fast implementation, a simple marker has been chosen as shown in Figure 2. The marker geometries are predefined, with each side measuring 113 mm. This marker will reduce the matching time and lead to high-speed implementation. The output of pattern matching stage is four vertices of the detected markers which will be sent to the pose estimation stage. These vertices should be ordered as shown in Figure 2. The order of marker's vertices is important for pose estimation procedure. In the final stage, NIOS II soft-core processor will be used for executing the coplanar PosIt algorithm to find out the pose of the detected markers. In the following section, a detailed explanation about the components of the proposed system is given.

Grayscale
Grayscale pixels are needed to perform image segmentation and corner detection. In order to implement this subsystem in hardware efficiently, red and blue components were divided by four, through shifting two digits to the right. Whereas green component was divided by two, by shifting one digit to the right. The summation of the shifted components gives grayscale level as shown in Equation (1) where ">>" means logical right-shift. (1)

Segmentation
One of the most central image analysis tools is Blob analysis; it is used in marker-based tracking applications. The major idea of using blob analysis is extracting the foreground regions from the frame and calculating their geometric features. Talking about the classical algorithms, two or three successive raster scans through the image are required [35,36]. A temporary label is designated to the current foreground pixel based on the labels of the neighbors that have already been inspected in the first scan. Different labels can be used with the same blob [37]. In the second pass, the replacement of the temporary label is required by its equivalent. In the final pass, the features of each blob are computed based on the blob information such as area, bounding box, and orientation. It is clearly determined that the classical algorithms are not convenient for hardware implementation due to the following reasons, conversion of these algorithms to parallel is severe because of their sequential nature. Secondly, a complete image frame should be buffered before the next pass starts. To achieve this, we need high memory bandwidth and long processing delays, which is not acceptable in applications where the system latency is critical. In the proposed system, it is required that pixels should be processed on the fly to ensure minimum latency. As a result, single pass image segmentation architecture was adopted to fulfill this constraint. The single pass image segmentation architecture consists of four main blocks, as shown in Figure 3. It is based on the architecture suggested in [38]. This architecture has the following blocks.

Grayscale
Grayscale pixels are needed to perform image segmentation and corner detection. In order to implement this subsystem in hardware efficiently, red and blue components were divided by four, through shifting two digits to the right. Whereas green component was divided by two, by shifting one digit to the right. The summation of the shifted components gives grayscale level as shown in Equation (1) where ">>" means logical right-shift. (1)

Segmentation
One of the most central image analysis tools is Blob analysis; it is used in marker-based tracking applications. The major idea of using blob analysis is extracting the foreground regions from the frame and calculating their geometric features. Talking about the classical algorithms, two or three successive raster scans through the image are required [35,36]. A temporary label is designated to the current foreground pixel based on the labels of the neighbors that have already been inspected in the first scan. Different labels can be used with the same blob [37]. In the second pass, the replacement of the temporary label is required by its equivalent. In the final pass, the features of each blob are computed based on the blob information such as area, bounding box, and orientation. It is clearly determined that the classical algorithms are not convenient for hardware implementation due to the following reasons, conversion of these algorithms to parallel is severe because of their sequential nature. Secondly, a complete image frame should be buffered before the next pass starts. To achieve this, we need high memory bandwidth and long processing delays, which is not acceptable in applications where the system latency is critical. In the proposed system, it is required that pixels should be processed on the fly to ensure minimum latency. As a result, single pass image segmentation architecture was adopted to fulfill this constraint. The single pass image segmentation architecture consists of four main blocks, as shown in Figure 3. It is based on the architecture suggested in [38]. This architecture has the following blocks.

Neighborhood Block
This block is used for dealing with the neighborhood pixels. The neighbor pixels are stored in the registers P1, P2, P3 and P4. These registers are shifted at each clock cycle during the window scanning across the image. A row buffer is used for storing the last processed row of the image since the resulting labeled image is not stored. The labels of this row buffer need to be updated from the "Label_LUT" before being used in processing further input pixels of the image. Multiplexers will be added before P3 and P4. When there is a merger case the value of P3 should be updated by the new label value. For pixel P4, the next value of this pixel is being read out at the same time the "Label_LUT" is being updated which means that the value read from the "Label_LUT" is the label before the merger. Consequently, if the next value for P4 is not background pixel, the newly merged label should be written in P4.

Label_LUT
Output of the shift register will be assigned to "RD_address" of the "Label_LUT" block to assure that the appropriate label will be used for any label stored in the shift register. Initialization of the "Label_LUT" before the processing is not required because at any time when there is a new label, a new entry will be added in the "Label_LUT" pointing to itself. Whenever a new merger case appears, this LUT will be updated by the labeling block.

Labeling Block
Based on the neighboring pixels the labeling of the current pixels takes place making the use of following algorithm [36]: • Assign zero when the input pixel is background.

•
Assign new label when all neighbors are backgrounds.

•
In case of only one label is used in the labeled neighbors, this label will be assigned to the current pixel. • A merger condition occurs when there are two different labels used among the neighbors.
By understanding that image segmentation algorithm processes foreground pixels, the overall execution time is reduced by filtering out the background pixels that require no further process. The fact that the neighboring pixels are not independent, the processing of foreground pixels can be optimized [39]. For example, if the pixel P3-which is the neighbor of all the pixels in the mask-is foreground pixel, all pixels in the neighborhood must share the same label of P3. This shows that the number of the neighbors to be inspected is decreased from four to one. By studying the 8-connectivity pixel mask which is shown in Figure 4, a total of thirty-two cases will appear. These cases can be classified into ten categories, as listed in Table 1.

Neighborhood Block
This block is used for dealing with the neighborhood pixels. The neighbor pixels are stored in the registers P1, P2, P3 and P4. These registers are shifted at each clock cycle during the window scanning across the image. A row buffer is used for storing the last processed row of the image since the resulting labeled image is not stored. The labels of this row buffer need to be updated from the "Label_LUT" before being used in processing further input pixels of the image. Multiplexers will be added before P3 and P4. When there is a merger case the value of P3 should be updated by the new label value. For pixel P4, the next value of this pixel is being read out at the same time the "Label_LUT" is being updated which means that the value read from the "Label_LUT" is the label before the merger. Consequently, if the next value for P4 is not background pixel, the newly merged label should be written in P4.

Label_LUT
Output of the shift register will be assigned to "RD_address" of the "Label_LUT" block to assure that the appropriate label will be used for any label stored in the shift register. Initialization of the "Label_LUT" before the processing is not required because at any time when there is a new label, a new entry will be added in the "Label_LUT" pointing to itself. Whenever a new merger case appears, this LUT will be updated by the labeling block.

Labeling Block
Based on the neighboring pixels the labeling of the current pixels takes place making the use of following algorithm [36]:


Assign zero when the input pixel is background.  Assign new label when all neighbors are backgrounds.  In case of only one label is used in the labeled neighbors, this label will be assigned to the current pixel.  A merger condition occurs when there are two different labels used among the neighbors.
By understanding that image segmentation algorithm processes foreground pixels, the overall execution time is reduced by filtering out the background pixels that require no further process. The fact that the neighboring pixels are not independent, the processing of foreground pixels can be optimized [39]. For example, if the pixel P3-which is the neighbor of all the pixels in the mask-is foreground pixel, all pixels in the neighborhood must share the same label of P3. This shows that the number of the neighbors to be inspected is decreased from four to one. By studying the 8-connectivity pixel mask which is shown in Figure 4, a total of thirty-two cases will appear. These cases can be classified into ten categories, as listed in Table 1.
The remaining cases Leave Blob These categories can be explained as follows: • New Label: in this case, a new label is created and assigned to the input pixel P. A new entry will be created in "Label_LUT". A new entry in "Features Table" for the newly created label will be added. The number of blobs will be incremented by one. • Copy P1: in this case, the label of P1 will be assigned to P. The corresponding features values in "Features Table" at P1 address will be updated. "Label_LUT" will not be updated. A similar process will occur for Copy P2, Copy P3 and Copy P4. • Compare P1 and P4: If P1 equals P4, P1 will be copied to the pixel P and features values at P1 address will be updated. "Label_LUT" will not be updated. If P1 < P4, a merger condition occurs. In this case, P1 will be copied to the pixel P, features values in the "Features Table" at P1 and P4 will be merged and stored at P1 address and "Label_LUT" will be updated by replacing the value at P4 with the value at P1. In addition, multiplexers will be set to read the newly updated label. When P4 < P1, a merger condition takes place. In this case, P4 will be copied to the pixel P, features values in the "Features Table" at P1 and P4 will be merged and stored at P4 and "Label_LUT" will be updated by replacing the value at P1 with the value of P4. Multiplexers will be set to read the newly updated label. The number of blobs will be decremented by one when there is a merger case. A similar process will occur for "Compare P2 and P4". • Leave Blob: Labeling block has a lookup table called "Last_x" which is used for storing the x-coordinate of the pixel P. This "Last_x" LUT is used for deciding the "finish blob" case. • Finish Blob: If the value of Last_x[P2] equals to the current x-coordinate of the input pixel, the current blob is finished, bounding box coordinates of the current region can be sent to next processing step and the label can be reused again. • NOP: No operation.

Features Table
The last component to be discussed in image segmentation block is the features table. It is updated for every processed pixel. In the proposed system, the bounding box is the only required feature for every blob. Four Dual Port Rams (DPR) are used for storing Xmin, Xmax, Ymin and Ymax for every extracted region. For updating the bounding box feature for every region in the image, two states have been considered in this architecture.

•
State0: In this state, the feature values at the addresses "Label" and "Merger Label" will be read out. After that, according to the "opcode" from the labeling block the value that will be written back in DPR's will be figured out. A two-bit op code is used for representing four possible cases as follows: op = 00, new blob is recognized, the value of x and y coordinates of this blob will be prepared to be stored in the DPR's in the next clock cycle of the features table's frequency. When op = 01, the bounding box information of the blob defined by "label" will be updated. When op = 10, a merger case occurred, the data read out from port A and port B will be compared and the minimum one will be selected as xmin/ymin and the maximum one will be selected as xmax/ymax. When op = 11, the current blob completes and the data read out from DPR's will be sent to the next processing step. • State1: the data prepared from the state0 will be written in the DPR's. Algorithm 1 shows the pseudo code for updating xmin/ymin feature.
The merger table requires three clock cycles for performing one merge. However, since dual-port ram is used, these three steps are pipelined with a throughput with one merger per clock cycle.

FAST Corner Detection
In the proposed system, a Features from Accelerated Segment Test (FAST) algorithm was used since it provides high-speed corner detection and requires fewer hardware resources compared to other corner detection algorithms such as Harris corner detection. FAST was introduced by Rosten and Drummond [40,41]. FAST algorithm can be summarized in the following steps:

1.
Select a pixel P with intensity Ip. Get a circle of sixteen pixels around the pixel under test as shown in Figure 5.

4.
Decide that the central pixel P is a corner if there exists a set of n contiguous pixels in the surrounding circle which are all brighter than Ip + T, or darker than Ip − T where Ip is the intensity of the central pixel and T is the selected threshold.
Sensors 2016, 16, 2139 8 of 21 4. Decide that the central pixel P is a corner if there exists a set of n contiguous pixels in the surrounding circle which are all brighter than Ip + T, or darker than Ip − T where Ip is the intensity of the central pixel and T is the selected threshold. FAST corner detection consists of a two-stage pipeline. The first stage is thresholder block and the second one for n contiguity checking. This pipelined implementation is efficient for determining in each clock cycle if a pixel is a corner or not. The subsystem of FAST corner detection is given by Figure 6.  Window 7 × 7: this block is used for getting the 16 surrounding pixels of the circle and the central pixel. In order to achieve this, six FIFO delay buffers were used, the width of every FIFO equals to the image width. In addition, 42 8-bit registers were used to store pixel intensity values in the investigated window. The 16 pixels and the central pixel will pass to the thresholder stage.  Thresholder: this block decides if the pixels in the circle are brighter or darker than the central pixel. If the intensity of the pixel Pi-Pi is a pixel in the circle-is greater than the intensity of the central pixel P added to threshold then the pixel Pi is brighter than the pixel P. In this case, it will be assigned 1 in the output vector in the corresponding position of the pixel Pi. On the other hand, if the intensity of the pixel Pi is smaller than the intensity of the central pixel P subtracted by threshold then the pixel Pi is darker than the pixel P. It will be assigned 1 in the output vector in the same position of the pixel Pi. As a result, two 16-bit vectors will be the result of this stage. One is for the brighter pixels and the other one is for the darker pixels.  Contiguity blocks: these blocks are used to study if there is n contiguous bright or dark pixels in the circle. In this system, three blocks were used for studying the contiguity n = 10, n = 11 and n = 12. The architecture for each block contains 16 n-input logical AND gates. The output of these contiguity blocks will be set to 1 if there is a '10' or '11' contiguous bright or dark pixels in the circle.  OR gate: this gate is used for deciding whether the pixel is a corner or not from brighter or darker contiguities blocks. FAST corner detection consists of a two-stage pipeline. The first stage is thresholder block and the second one for n contiguity checking. This pipelined implementation is efficient for determining in each clock cycle if a pixel is a corner or not. The subsystem of FAST corner detection is given by Figure 6. 4. Decide that the central pixel P is a corner if there exists a set of n contiguous pixels in the surrounding circle which are all brighter than Ip + T, or darker than Ip − T where Ip is the intensity of the central pixel and T is the selected threshold. FAST corner detection consists of a two-stage pipeline. The first stage is thresholder block and the second one for n contiguity checking. This pipelined implementation is efficient for determining in each clock cycle if a pixel is a corner or not. The subsystem of FAST corner detection is given by Figure 6.  Window 7 × 7: this block is used for getting the 16 surrounding pixels of the circle and the central pixel. In order to achieve this, six FIFO delay buffers were used, the width of every FIFO equals to the image width. In addition, 42 8-bit registers were used to store pixel intensity values in the investigated window. The 16 pixels and the central pixel will pass to the thresholder stage.  Thresholder: this block decides if the pixels in the circle are brighter or darker than the central pixel. If the intensity of the pixel Pi-Pi is a pixel in the circle-is greater than the intensity of the central pixel P added to threshold then the pixel Pi is brighter than the pixel P. In this case, it will be assigned 1 in the output vector in the corresponding position of the pixel Pi. On the other hand, if the intensity of the pixel Pi is smaller than the intensity of the central pixel P subtracted by threshold then the pixel Pi is darker than the pixel P. It will be assigned 1 in the output vector in the same position of the pixel Pi. As a result, two 16-bit vectors will be the result of this stage. One is for the brighter pixels and the other one is for the darker pixels.  Contiguity blocks: these blocks are used to study if there is n contiguous bright or dark pixels in the circle. In this system, three blocks were used for studying the contiguity n = 10, n = 11 and n = 12. The architecture for each block contains 16 n-input logical AND gates. The output of these contiguity blocks will be set to 1 if there is a '10' or '11' contiguous bright or dark pixels in the circle.  OR gate: this gate is used for deciding whether the pixel is a corner or not from brighter or darker contiguities blocks. • Window 7 × 7: this block is used for getting the 16 surrounding pixels of the circle and the central pixel. In order to achieve this, six FIFO delay buffers were used, the width of every FIFO equals to the image width. In addition, 42 8-bit registers were used to store pixel intensity values in the investigated window. The 16 pixels and the central pixel will pass to the thresholder stage. • Thresholder: this block decides if the pixels in the circle are brighter or darker than the central pixel. If the intensity of the pixel Pi-Pi is a pixel in the circle-is greater than the intensity of the central pixel P added to threshold then the pixel Pi is brighter than the pixel P. In this case, it will be assigned 1 in the output vector in the corresponding position of the pixel Pi. On the other hand, if the intensity of the pixel Pi is smaller than the intensity of the central pixel P subtracted by threshold then the pixel Pi is darker than the pixel P. It will be assigned 1 in the output vector in the same position of the pixel Pi. As a result, two 16-bit vectors will be the result of this stage. One is for the brighter pixels and the other one is for the darker pixels. • Contiguity blocks: these blocks are used to study if there is n contiguous bright or dark pixels in the circle. In this system, three blocks were used for studying the contiguity n = 10, n = 11 and n = 12. The architecture for each block contains 16 n-input logical AND gates. The output of these contiguity blocks will be set to 1 if there is a '10' or '11' contiguous bright or dark pixels in the circle.

Memory Blocks
In this block, six memory blocks were used to store the coordinates of detected corners from FAST block and the coordinates of bounding box from single pass image segmentation block. For x, y coordinates of the extracted corners, the memory depth is 1024 and the word width is 10 bits. For bounding box coordinates, four memory blocks were used for storing Xmin, Xmax, Ymin and Ymax. In this system, 256 objects could be extracted from the image and word width is 10 bits. M9K blocks where used to implement those memory blocks.

Pattern Matching
The main goal in this stage is to make matching process as simple as possible in order to achieve real time implementation. For every extracted region, the location of the corners will be used for deciding whether the region is the predefined marker or not. The general steps followed for extracting the vertices of the detected markers are:

•
Divide the bounding box into four quarters by calculating the center coordinates of the bounding box. • Every quarter should have one corner, located only on the edge of the bounding box.

•
It is assumed that all corners of the inner square are located in one quarter or in two adjacent quarters.

•
Based on the location of the inner square, the first vertex of the marker will be decided.

•
The remaining vertices are a clockwise direction to the first one.

Memory Blocks
In this block, six memory blocks were used to store the coordinates of detected corners from FAST block and the coordinates of bounding box from single pass image segmentation block. For x, y coordinates of the extracted corners, the memory depth is 1024 and the word width is 10 bits. For bounding box coordinates, four memory blocks were used for storing Xmin, Xmax, Ymin and Ymax. In this system, 256 objects could be extracted from the image and word width is 10 bits. M9K blocks where used to implement those memory blocks.

Pattern Matching
The main goal in this stage is to make matching process as simple as possible in order to achieve real time implementation. For every extracted region, the location of the corners will be used for deciding whether the region is the predefined marker or not. The general steps followed for extracting the vertices of the detected markers are:


Divide the bounding box into four quarters by calculating the center coordinates of the bounding box.  Every quarter should have one corner, located only on the edge of the bounding box.  It is assumed that all corners of the inner square are located in one quarter or in two adjacent quarters.  Based on the location of the inner square, the first vertex of the marker will be decided.  The remaining vertices are a clockwise direction to the first one.

Pose Estimation
The Coplanar PosIt algorithm was used in order to estimate the pose of the detected markers. This algorithm was developed by Daniel DeMenthon [19]. It works by comparing some coplanar point coordinates with predefined object points and statistically finding the best rotation and translation that fit the object projection in these points.

Coplanar PosIt Algorithm
The Coplanar PosIt algorithm starts by finding the pose from orthography and scaling, which approximates the perspective projection with a scaled orthographic projection and finds the rotation matrix and translation vector from a linear set of equations. After that, PosIt iteratively uses a scale factor for each point to enhance the found orthographic projection and then uses POS algorithm on

Pose Estimation
The Coplanar PosIt algorithm was used in order to estimate the pose of the detected markers. This algorithm was developed by Daniel DeMenthon [19]. It works by comparing some coplanar point coordinates with predefined object points and statistically finding the best rotation and translation that fit the object projection in these points.

Coplanar PosIt Algorithm
The Coplanar PosIt algorithm starts by finding the pose from orthography and scaling, which approximates the perspective projection with a scaled orthographic projection and finds the rotation matrix and translation vector from a linear set of equations. After that, PosIt iteratively uses a scale factor for each point to enhance the found orthographic projection and then uses POS algorithm on the new points instead of the original ones until the threshold is met. However, dealing with planes will give more poses for same orthographic projection as show in Figure 8. Two branches will be created after the first iteration of the algorithm. Every branch will be assigned with one solution produced from the first iteration. After that, the best pose will be kept for each branch. The best solution will be decided based on measuring the average Euclidean distance (E) between the actual image points and corresponding predicted image points for the found pose divided by the image diagonal. The iteration stops if E is smaller than a predefined threshold or a certain number of iterations were achieved. Figure 9 shows the flow chart of the implemented algorithm on Nios II soft core processor. the new points instead of the original ones until the threshold is met. However, dealing with planes will give more poses for same orthographic projection as show in Figure 8. Two branches will be created after the first iteration of the algorithm. Every branch will be assigned with one solution produced from the first iteration. After that, the best pose will be kept for each branch. The best solution will be decided based on measuring the average Euclidean distance (E) between the actual image points and corresponding predicted image points for the found pose divided by the image diagonal. The iteration stops if E is smaller than a predefined threshold or a certain number of iterations were achieved. Figure 9 shows the flow chart of the implemented algorithm on Nios II soft core processor.   the new points instead of the original ones until the threshold is met. However, dealing with planes will give more poses for same orthographic projection as show in Figure 8. Two branches will be created after the first iteration of the algorithm. Every branch will be assigned with one solution produced from the first iteration. After that, the best pose will be kept for each branch. The best solution will be decided based on measuring the average Euclidean distance (E) between the actual image points and corresponding predicted image points for the found pose divided by the image diagonal. The iteration stops if E is smaller than a predefined threshold or a certain number of iterations were achieved. Figure 9 shows the flow chart of the implemented algorithm on Nios II soft core processor.

Soft-Core Processor
The Coplanar PosIt algorithm was implemented by using z Nios II soft-core processor which was built using Qsys. Qsys is a tool provided by Altera which optimizes time and effort in the FPGA design process. A complete and complex system could be created by using the peripherals available in the Qsys libraries [42]. Qsys provides faster development, faster timing closure and faster verification. The proposed soft-core processor system is shown in Figure 10. The clock frequency of the proposed system is 50 MHz. A NIOS II/f processor with hardware multipliers and hardware dividers is used. The NIOS II/f processor is a high-performance 32-bit processor based on a Reduced Instruction Set Computer (RISC) architecture. The fastest version of this processor, which allows utilization of entire instruction set available, is used. In order to accelerate the arithmetic functions executed on float variables to find the pose, NiosII Floating Point Custom Instructions (FPCI) is used. FPCI implements single-precision floating-point arithmetic operations. The existence of FPCI in the system forces the compiler to use the custom instructions for floating point operations. The ca ustom instruction master of NiosII CPU is connected with custom instruction slave in the Floating Point Hardware as shown in Figure 11. The size of on chip RAM is 200 KB. Eight parallel I/O (PIO) ports are used for reading the coordinates of the four vertices of the detected markers. The width of these ports is 10 bits and they act as input ports. One PIO is used as output port with a width equals to 8 bits in order to be used as read address for the vertices stored in the memory blocks. In addition, performance counter component is used for studying the performance of the system.

Soft-Core Processor
The Coplanar PosIt algorithm was implemented by using z Nios II soft-core processor which was built using Qsys. Qsys is a tool provided by Altera which optimizes time and effort in the FPGA design process. A complete and complex system could be created by using the peripherals available in the Qsys libraries [42]. Qsys provides faster development, faster timing closure and faster verification. The proposed soft-core processor system is shown in Figure 10. The clock frequency of the proposed system is 50 MHz. A NIOS II/f processor with hardware multipliers and hardware dividers is used. The NIOS II/f processor is a high-performance 32-bit processor based on a Reduced Instruction Set Computer (RISC) architecture. The fastest version of this processor, which allows utilization of entire instruction set available, is used. In order to accelerate the arithmetic functions executed on float variables to find the pose, NiosII Floating Point Custom Instructions (FPCI) is used. FPCI implements single-precision floating-point arithmetic operations. The existence of FPCI in the system forces the compiler to use the custom instructions for floating point operations. The ca ustom instruction master of NiosII CPU is connected with custom instruction slave in the Floating Point Hardware as shown in Figure 11. The size of on chip RAM is 200 KB. Eight parallel I/O (PIO) ports are used for reading the coordinates of the four vertices of the detected markers. The width of these ports is 10 bits and they act as input ports. One PIO is used as output port with a width equals to 8 bits in order to be used as read address for the vertices stored in the memory blocks. In addition, performance counter component is used for studying the performance of the system.

Soft-Core Processor
The Coplanar PosIt algorithm was implemented by using z Nios II soft-core processor which was built using Qsys. Qsys is a tool provided by Altera which optimizes time and effort in the FPGA design process. A complete and complex system could be created by using the peripherals available in the Qsys libraries [42]. Qsys provides faster development, faster timing closure and faster verification. The proposed soft-core processor system is shown in Figure 10. The clock frequency of the proposed system is 50 MHz. A NIOS II/f processor with hardware multipliers and hardware dividers is used. The NIOS II/f processor is a high-performance 32-bit processor based on a Reduced Instruction Set Computer (RISC) architecture. The fastest version of this processor, which allows utilization of entire instruction set available, is used. In order to accelerate the arithmetic functions executed on float variables to find the pose, NiosII Floating Point Custom Instructions (FPCI) is used. FPCI implements single-precision floating-point arithmetic operations. The existence of FPCI in the system forces the compiler to use the custom instructions for floating point operations. The ca ustom instruction master of NiosII CPU is connected with custom instruction slave in the Floating Point Hardware as shown in Figure 11. The size of on chip RAM is 200 KB. Eight parallel I/O (PIO) ports are used for reading the coordinates of the four vertices of the detected markers. The width of these ports is 10 bits and they act as input ports. One PIO is used as output port with a width equals to 8 bits in order to be used as read address for the vertices stored in the memory blocks. In addition, performance counter component is used for studying the performance of the system.

Floating-Point versus Fixed-Point
Fixed-point representation is a way of implementing floating-point operations using ordinary integer representation. In other words, a fixed-point number is a scaled version of a floating-point number. It is defined by word length in bits, position of the binary point, and whether it is signed or unsigned. This position affects the range-the difference between the largest and smallest representable numbers-and precision -the smallest possible difference between any two numbers. The main advantage of using fixed-point is its simplicity since, we can use same hardware that does integer arithmetic in fix-point arithmetic. However, using fixed-point in implementing an algorithm is error-prone due to following reasons: (i) deciding where the position of the fixed-point of the variables requires good understanding of the algorithm; (ii) Infinities are not represented by fixed-point; (iii) Overflow should be predicted and manually managed; (iv) Rounding toward-∞ for signed and unsigned arithmetic due to truncation since it is the cheapest rounding mode; (v) Multiplying two integers with the same size produces a result with twice the size of inputs. On

Mathematical Functions Approximation
Pose estimation algorithms require calculating some trigonometric functions like sine, cosine, arc cosine and arctangent. These functions consume long execution times. In order to reduce this time, these functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Sine, cosine and arctangent were approximated using Taylor series, however, the best approximation for arc cosine was a Lagrange polynomial. Sine function has been approximated as shown in Equation (2). Cosine function has been approximated using Equation (3), while arc cosine and arctangent have been approximated using Equations (4) and (5), respectively. In addition to trigonometric functions, square roots are also required for estimating the pose of the detected markers. However, this function requires a long execution time too, hence, fast inverse square root was utilized. This method uses Newton's method of approximation but it starts with a guess very close to the solution. Only one iteration is enough for Newton's method to get a good enough solution [44]:

Experimental Results
System experiments and simulations where carried out using an Altera DE2-115 board powered by a Cyclon IVE FPGA chip. Figure 12 shows the experimental equipment used in this experiment.
The simulation results for a frame containing one marker in the center of the frame will be introduced. The illustration of this frame is shown in Figure 13.   In addition, detected corners using FAST algorithm are shown in Figure 14b. It can be seen that the 'corner' signal is set to one because there is a contiguity pattern equals to '1111111000000111'. In this case n = 10. The detected corner at (X = 254, Y = 174) will be written at the address specified by 'Coneraddr'.
The matching process is illustrated in Figure 15. This figure is marked from one to four.
Step 1 represents reading coordinates of the bounding box.
Step 2 shows how vertices signals became 1 during the reading of corner coordinates.
Step 3 depicts that, the first vertex is located in the second quarter.    In addition, detected corners using FAST algorithm are shown in Figure 14b. It can be seen that the 'corner' signal is set to one because there is a contiguity pattern equals to '1111111000000111'. In this case n = 10. The detected corner at (X = 254, Y = 174) will be written at the address specified by 'Coneraddr'.
The matching process is illustrated in Figure 15. This figure is marked from one to four.
Step 1 represents reading coordinates of the bounding box.
Step 2 shows how vertices signals became 1 during the reading of corner coordinates.
Step 3 depicts that, the first vertex is located in the second quarter.     In addition, detected corners using FAST algorithm are shown in Figure 14b. It can be seen that the 'corner' signal is set to one because there is a contiguity pattern equals to '1111111000000111'. In this case n = 10. The detected corner at (X = 254, Y = 174) will be written at the address specified by 'Coneraddr'.
The matching process is illustrated in Figure 15. This figure is marked from one to four.
Step 1 represents reading coordinates of the bounding box.
Step 2 shows how vertices signals became 1 during the reading of corner coordinates.
Step 3 depicts that, the first vertex is located in the second quarter.  In addition, detected corners using FAST algorithm are shown in Figure 14b. It can be seen that the 'corner' signal is set to one because there is a contiguity pattern equals to '1111111000000111'. In this case n = 10. The detected corner at (X = 254, Y = 174) will be written at the address specified by 'Coneraddr'.
The matching process is illustrated in Figure 15. This figure is marked from one to four.
Step 1 represents reading coordinates of the bounding box.
Step 2 shows how vertices signals became 1 during the reading of corner coordinates.
Step 3 depicts that, the first vertex is located in the second quarter.  illustrates the vertices found in the bounding box, "3" illustrates that the first vertex is located in quarter 2, and "4" illustrates the output of the ordered vertices.
The complete system simulation is shown in Figure 16. Features such as detected corners, bounding box coordinates of the marker, writing these coordinates in the memory block and finally, matching process which gives the ordered vertices of the detected markers are illustrated in this figure. These vertices are sent to the designed soft core processor in order to find out the pose of the detected markers.  illustrates the vertices found in the bounding box, "3" illustrates that the first vertex is located in quarter 2, and "4" illustrates the output of the ordered vertices.
The complete system simulation is shown in Figure 16. Features such as detected corners, bounding box coordinates of the marker, writing these coordinates in the memory block and finally, matching process which gives the ordered vertices of the detected markers are illustrated in this figure. These vertices are sent to the designed soft core processor in order to find out the pose of the detected markers. illustrates the vertices found in the bounding box, "3" illustrates that the first vertex is located in quarter 2, and "4" illustrates the output of the ordered vertices.
The complete system simulation is shown in Figure 16. Features such as detected corners, bounding box coordinates of the marker, writing these coordinates in the memory block and finally, matching process which gives the ordered vertices of the detected markers are illustrated in this figure. These vertices are sent to the designed soft core processor in order to find out the pose of the detected markers.  Another example of a rotated marker is shown in Figure 17.
Another example of a rotated marker is shown in Figure 17. Figure 17. Illustration of an input frame with a rotated marker.
The whole system simulation for extracting the ordered vertices of the detected marker is shown in Figure 18. The pose estimation using coplanar PosIt running on the designed processor is   The pose estimation using coplanar PosIt running on the designed processor is Another example of a rotated marker is shown in Figure 17. Figure 17. Illustration of an input frame with a rotated marker.
The whole system simulation for extracting the ordered vertices of the detected marker is shown in Figure 18. The pose estimation using coplanar PosIt running on the designed processor is   The required time for executing the PosIt algorithm on Nios II soft-core processor by using FPCI and functions approximation is 0.00118 s. However, the time required for running this algorithm without acceleration is 0.01043 s. This is illustrated in Figure 19.
The required time for executing the PosIt algorithm on Nios II soft-core processor by using FPCI and functions approximation is 0.00118 s. However, the time required for running this algorithm without acceleration is 0.01043 s. This is illustrated in Figure 19.  In order to test the performance of proposed pose estimation algorithm on Nios II, two scenarios have been considered. In the first scenario, we study the performance of translation and orientation at predefined locations. For instance, the robot moves on the X-axis and all other degrees of freedom are fixed. In other words, the target is either translated or rotated with respect to one axis only. Figure 21 illustrates the pose of the camera and target. Figure 22 illustrates the comparison of estimated pose with the real pose.   The required time for executing the PosIt algorithm on Nios II soft-core processor by using FPCI and functions approximation is 0.00118 s. However, the time required for running this algorithm without acceleration is 0.01043 s. This is illustrated in Figure 19.  In order to test the performance of proposed pose estimation algorithm on Nios II, two scenarios have been considered. In the first scenario, we study the performance of translation and orientation at predefined locations. For instance, the robot moves on the X-axis and all other degrees of freedom are fixed. In other words, the target is either translated or rotated with respect to one axis only. Figure 21 illustrates the pose of the camera and target. Figure 22 illustrates the comparison of estimated pose with the real pose.  In order to test the performance of proposed pose estimation algorithm on Nios II, two scenarios have been considered. In the first scenario, we study the performance of translation and orientation at predefined locations. For instance, the robot moves on the X-axis and all other degrees of freedom are fixed. In other words, the target is either translated or rotated with respect to one axis only. Figure 21 illustrates the pose of the camera and target. Figure 22 illustrates the comparison of estimated pose with the real pose. The required time for executing the PosIt algorithm on Nios II soft-core processor by using FPCI and functions approximation is 0.00118 s. However, the time required for running this algorithm without acceleration is 0.01043 s. This is illustrated in Figure 19.  In order to test the performance of proposed pose estimation algorithm on Nios II, two scenarios have been considered. In the first scenario, we study the performance of translation and orientation at predefined locations. For instance, the robot moves on the X-axis and all other degrees of freedom are fixed. In other words, the target is either translated or rotated with respect to one axis only. Figure 21 illustrates the pose of the camera and target. Figure 22 illustrates the comparison of estimated pose with the real pose.   The minimum, maximum, and mean error of x, y, and z coordinates and roll-yaw-pitch angles are shown in Table 2. In the second scenario, the marker traveled on a predefined trajectory and the estimation error has been recorded and compared with [31]. Figure 23 shows pose estimation and rotation angle results. In [31], the estimation performance was studied with respect to the number of target points (4, 20, 50, 100). However, we estimate the pose using four coplanar target points only. Table 3 shows the comparison of pose estimation error between implemented coplanar Posit algorithm and [31] results. The results shows that our proposed design outperforms the method introduced in [31]. The minimum, maximum, and mean error of x, y, and z coordinates and roll-yaw-pitch angles are shown in Table 2. In the second scenario, the marker traveled on a predefined trajectory and the estimation error has been recorded and compared with [31]. Figure 23 shows pose estimation and rotation angle results. In [31], the estimation performance was studied with respect to the number of target points (4, 20, 50, 100). However, we estimate the pose using four coplanar target points only. Table 3 shows the comparison of pose estimation error between implemented coplanar Posit algorithm and [31] results. The results shows that our proposed design outperforms the method introduced in [31].  The algorithm mentioned in [34] is used for object tracking, the extracted foreground objects are sent to CPU where simple motion-based tracker is used for object tracking. On the other hand, our proposed algorithm extracts the blobs and corners simultaneously. These information are sent to onboard soft-core processor NIOS II to estimate the pose of the extracted objects using Coplanar-Posit algorithm. Coplanar-Posit algorithm was accelerated by approximating trigonometric functions to achieve real-time results. The performance comparison of the proposed method with reference method is shown in Table 4. Hardware resources used for building this system are shown in the Table 5. The Cyclone IVE EP4CE115F29C7 has been used for implementing the proposed system.   The algorithm mentioned in [34] is used for object tracking, the extracted foreground objects are sent to CPU where simple motion-based tracker is used for object tracking. On the other hand, our proposed algorithm extracts the blobs and corners simultaneously. These information are sent to on-board soft-core processor NIOS II to estimate the pose of the extracted objects using Coplanar-Posit algorithm. Coplanar-Posit algorithm was accelerated by approximating trigonometric functions to achieve real-time results. The performance comparison of the proposed method with reference method is shown in Table 4. Hardware resources used for building this system are shown in the Table 5. The Cyclone IVE EP4CE115F29C7 has been used for implementing the proposed system.

Conclusions
In this paper, we have introduced a complete hardware architecture for a real-time visual sensor based on predefined markers. The developed system has two stages; FPGA implementation for image processing algorithms and soft-core processor system using a Nios II core for pose calculation. The proposed system is able to process the input stream of pixels on-the-fly without any need to buffer the frame in the memory. This is because of the running of a FAST algorithm and image segmentation in parallel, use of a simple marker which reduces the matching time, implemention of single pass image segmentation, and the usage of FPCI which accelerates the arithmetic operations on floating point data. Moreover, approximating trigonometric functions played a great role in enhancing system performance. The final system was able to process 162 fps with five markers inside the frame.