FPGA implementation of a real-time edge detection system based on an improved Canny algorithm

. Abstract Canny edge detection is one of the most widely used edge detection algorithms due to its superior performance. However, it is complex, time-consuming process and has a high hardware cost. In addition, most of the existing implementations of the algorithm use the same fixed pair of high and low threshold values for all input images. This cannot automatically adapt to changes in the external detection environment and results in decreased performance. To overcome these issues, an improved Canny algorithm is proposed in this paper. It uses the Sobel operator and approximation methods to calculate the gradient magnitude and direction for replacing complex operations with reduced hardware cost. Otsu’s algorithm is introduced to adaptively determine the image threshold. However, Otsu’s algorithm has division operations, and the division operation is complex, has low efficiency and slow speed. We introduce a logarithmic unit to turn division into a subtraction operation that is easy to implement by hardware but which does not affect the selection of the threshold. Experimental results show that the system can detect the edge of the image well without adjusting the threshold value when the external environment changes and requires only 1.231 ms to detect the edges of the 512 x 512 image when clocked at 50 MHz. Compared with existing FPGA implementations, our implementation uses the least amount of logical resources. Thus, it is more suitable for platforms have limited logical resources.


Introduction
Image edge detection has important applications in many fields, such as medical imaging, geological exploration, satellite remote sensing, aerospace, transportation, and computer vision.Among the existing edge detection algorithms, the Canny edge detection algorithm [1] has been a standard for many years and has the best performance.However, the Canny edge detection algorithm is a complex, time-consuming process, and the traditional implementations of the algorithm use the same fixed pair of high and low threshold values for all input images.When the external environment, such as light, changes, the image to be detected will also change, and the threshold value needs to be adjusted.These shortcomings seriously affect the real-time performance, adaptive ability and flexibility of the system.Therefore, a direct implementation of the Canny algorithm cannot be employed in real-time applications.
Many researchers have proposed diverse techniques to overcome these issues.One of these is the hardware implementation of the Canny edge detection algorithm based on a field programmable gate array (FPGA).Several works [2][3][4][5] have dealt with FPGA-based Canny edge detection algorithms for real-time applications.All of them reduce the complexity of the Canny edge detection algorithm to a certain extent and propose their own solutions for threshold selection.In [2], the hysteresis threshold uses constant low and high threshold values to reduce complexity, but it degrades the performance.In [3], the threshold is calculated using the data of the histogram of gradient magnitude.On the other hand, block classification techniques introduced in [4][5] are used to calculate the threshold.
In this paper, an improved Canny edge detection algorithm is proposed with adequate accuracy, reduced computational complexity and self-adaptive threshold calculation.We calculate the threshold using Otsu's algorithm [6].Several works have dealt with the FPGA-based Otsu's algorithm for threshold calculation [7][8].The approximate method and logarithm operation are introduced to reduce the complexity of computation.Our contributions in terms of the proposed design strategy are given below.
1. Computation of the gradient magnitude and orientation using the approximate method.
3. The design of a 32-bit logarithmic arithmetic unit to reduce the complexity of Otsu's algorithm.

Proposed design strategy 2.1 Conventional Canny edge detection algorithm
Canny developed an approach to derive an optimal edge detector to deal with step edges corrupted by white Gaussian noise.Figure 1 summarizes the original Canny edge detection algorithm [1] for an input image of size  ×  with n bits each.It consists of the following steps: 1).Calculate the horizontal gradient Gx and vertical gradient Gy at each pixel location by convolving with gradient masks.2).Compute the gradient magnitude G and direction θG at each pixel location.

Improved Canny edge detection algorithm
In all hardware implementations of the Canny edge detection algorithm, the method of selecting the threshold can be roughly divided into two types, namely, fixed and adaptive thresholds.
The adoption of a fixed threshold will undoubtedly reduce algorithm complexity and hardware resource consumption.However, when the object to be detected changes or the external environment of the detected object changes, such as light, the threshold needs to be reset.This obviously cannot meet the requirements of real-time performance.In contrast, an adaptive threshold is a good solution to the above problems.In this section, we introduced three approaches to improve the Canny edge detection algorithm for hardware implementation, namely, Otsu's algorithm, which is used to automatically select the threshold, low complexity gradient magnitude and direction calculation method, and logarithmic approximation, which is used to reduce the complexity of Otsu's algorithm.Figure 2 shows an overview of the improved Canny edge detection design strategy.The detailed description of the proposed Canny edge detection algorithm is given below.Here, we use the Sobel operator [12][13] as a gradient mask.There is significant motivation for employment of the Sobel operator rather than other operators used in [10,11] because it combines differentiation and Gaussian smoothing.More weight is allocated to pixel intensities around edges. Figure 3 shows the 3x3 sub-window of an image and Sobel operator kernels in the horizontal (x) and vertical (y) directions.The gradients in the x-and y-direction denoted by   (, ) and   (, ) are computed using Eqs.( 1) and (2).

(c) Sobel operator kernel in y direction
Before computing   (, ) and   (, ) , we need to obtain the corresponding pixels.This is done by the Line Buffer. Figure 4 shows the architecture of Line Buffer.Here we use two first-infirst-out (FIFO) modules to cache the first two rows.Fig. 4. Architecture of Line Buffer After obtaining the corresponding pixels, the computation of   (, ) and   (, ) is started.Figure 5 shows the architecture to calculate   (, ) .The architecture to calculate   (, ) is similar and only the corresponding pixel values that are involved in the computation are different.In this part, the computation values of  (, ) and   (, ) are signed numbers, and their most significant bits (MSBs) determine whether their values are positive or negative.

Low complexity gradient magnitude and direction calculation
The magnitude of the pixel defines the strength of the edge at a particular pixel.The orientation of a pixel defines the direction of the edge at a particular pixel.The gradient magnitude and direction calculations of a single pixel given in Eqs. ( 3) and ( 4)  ,  =   (, ) 2 +   (, ) 2  3 The hardware cost of implementing Eqs. ( 3) and ( 4) is high.For Eq. ( 3), we use the addition of the absolute values of   (, ) and   (, ) , which are presented in [12,13], to find the gradient magnitude mag (x, y).Eq. ( 3) can be rewritten as For Eq. ( 4), the coordinate rotation digital computer (CORDIC) module suggested in [14,15] is used to calculate the value of θ(x,y).The downside of the CORDIC module is that many iterations are required to achieve acceptable accuracy, which leads to the utilization of more hardware.In this part, the shift-based orientation method [5] is adopted.The orientations are spaced uniformly and are divided into eight zones, as shown Figure 6.The approximated values of (, ) based on shift operations are listed in Table 1.The tabulated value shows that orientations 0 ° to 90 ° lie in the first quadrant, while other orientations until 180 ° lie in the second quadrant.The tan values of the second and first quadrants are the same, but they differ in polarity.The quadrant is found using quadrant_flag.The gradient magnitude calculation and quadrant_flag computation are shown in Figure 7.
Fig. 6.Eight zones of gradient orientation Fig. 7. Architecture of the gradient magnitude calculation and quadrant_flag computation To find a zone for θ(x,y), we modify θ(x,y) as (, ) and a particular zone is defined between two angle values.For simplification, the Eq. ( 4) can be rewritten as The value of (, ) is defined by the Eq. ( 7), Rewriting the Eq. ( 7) gives Eq. ( 8), The condition derived from Eq. ( 8) is given in Eq. ( 9), The zone value is determined by satisfying the condition given in Eq. ( 9).Since the condition   ,  * (, ) contains a multiplication operation, it is equivalent to shifting   (, ) by a particular value, as mentioned in Table 1. Figure 8 illustrates the architecture for the computation of   ,  * 22.5° and finding the direction.Nonmaximum suppression (NMS) is an edge-thinning technique for preserving local maxima along the image gradient direction.Figure 9 shows the NMS architecture.The selector is used to find the pixels that are in the direction equivalent to the current pixel (center pixel of 3 x 3 window) direction depicted as a and b in Fig. 9.For directions 1 or 8, the values of a and b are g3 and g5; for directions 2 or 3, the values of a and b are g0 and g8; for directions 4 or 5, the values of a and b are g5 and g6; for directions 6 or 7, the values of a and b are g1 and g7; The comparator (CMP) is used to compare the current pixel (g4) with a and b to determine the presence of maxima.If the maxima is present in the current pixel, then it is preserved.Otherwise, it is suppressed as 0. Considering that the maximum value of g4 requires 10 bits, to save hardware resources for later statistics, we limit the value range to 8 bits.The value generated after the NMS is sent to both the data cache module and the adaptive threshold computation module, as shown in Figure 2. Data can be cached inside the FPGA or outside the FPGA.Here, data are cached in external SDRAM.Otsu's algorithm [6] is a popular thresholding algorithm and is used to find an optimal threshold that separates an image into two classes: the background and object.These classes are represented by C0 and C1, respectively.This method performs a variance analysis processing to find the optimal threshold k by a maximization as shown in Eq. (10).
where   2 is the k-th between-class variance of the image, defined as: where   and   are the probability of class occurrence given a k threshold and the mean intensity value of the pixels up to the k threshold of the image, respectively, and  −1 is the average intensity of the entire image, called the global mean, with a value equal to   when k = L − 1. Normally, L is equal to 256.The variables   and   can be expressed as and 2) Logarithm Approximation The division operation is costly to the hardware in terms of processing speed and is the architecture's bottleneck due to the highest critical time.To avoid the division operation in Eq. ( 11), the logarithm function is presented in [17,18] and Eq. ( 11) can be accordingly rewritten as:

3) Architecture of adaptive threshold computation a) Architecture of histogram
The histogram is the key part for Otsu's algorithm.Histogram statistics are performed on a series of values generated by the NMS. Figure 10 shows the architecture of the histogram.Here, we used a RAM, and the address bit width is 8 bits.The data bit width can be flexibly selected according to the number of pixels in the image.Thirty-two bits are used here.The architecture is divided into two parts.The first part is before the completion of statistics, and the other part is after the completion of statistics.The steps for computing the histogram statistics are as follows: (1) Read out the current statistics, add 1 and write them back to RAM.
Repeat the above steps until the current image has been counted.If imhist_end is 1 as shown Figure 10, thestatistics are complete.
(3) Next, start reading out the data in order.
Step ( 1) is realized by the statistical circuit module, and step (3) is realized by the circuit after completion of the statistics module, as shown in Fig. 10.After the statistics were completed, parameters  −1 ,   and   in Otsu's algorithm were calculated.Before the calculation, a normalized cumulative histogram is needed.Here, we use the fixed-point number system [7] to represent non-integers.Figure 11 shows 32-bit (16.16) unsigned fixed-point numbers.In the proposed architecture, a 32-bit (8.24) unsigned fixed-point number system is used.Fig. 11.32-bit fixed-point number format Since we use 24 bits as the fractional part, the value of   in Eq. ( 12) can be calculated as follows: In this paper, all images are 512 x 512 in size, so the value of   can be obtained by shifting   six places to the left.The value of   is calculated in the histogram, as shown Figure .10. Figure 12 shows the architecture of Otsu's algorithm calculation.The logarithmic number system (LNS) has been studied to simplify arithmetic computations to achieve a lower computation complexity [16][17][18].Here, a 32-bit logarithmic arithmetic unit suggested in [18] is adopted to compute Eq. ( 14). Figure 13 shows the of the logarithmic converter.It is composed of a 32-bit count leading zero (CLZ), a barrel shifter (BSH), a characteristic generator (CGen), and a fractional part generation block (FPGen).The variable input number range Qm.n. is modified to the fixed number range of Q6.26. at the end of the logarithmic converter.The CLZ block calculates the number of leading zero bits of the input.The five bits of the CLZ block output determine the characteristic value and the amount of shift value of the BSH.BSH converts the input number range of Qm.n. to Q6.26.After the shifting operation, the fractional part is generated by the FPGen block, and the characteristic part is generated by the CGen block.The above two values are combined to give the logarithmic conversion result.Figure 14 shows the architecture of the threshold calculation.When the result of 2 2   is obtained, the value of k corresponding to the first negative value (when the MSB of the result is 1) is found by subtracting the values before and after.Finally, the obtained k value is reduced by one to obtain the optimal threshold.The output signal dout_valid in Figure 14 is equal to threshold calculate end in Figure 2, and the purpose is to enable the output to obtain temporarily cached NMS values.After the threshold calculation is completed, the value obtained is set as the high threshold value, and the low threshold value is half of it.Then, the comparison is conducted to find the strong edge and the weak edge.Figure 15 shows the architecture of binarization, where gh represents the strong edge and gl represents the weak edge.The nonmaximum suppression technique finds strong edges.The edges affected by noise and changes are preserved by this technique.After obtaining the values of gh and gl, if gh is 1, the gradient magnitude value is a strong edge pixel and is preserved.If gl is 0, it is not an edge pixel.If gl is 1, which means the gradient magnitude value is a weak edge pixel, then the current pixel is considered a strong edge pixel if and only if any of the neighbors of the current pixel depicted as Ghn with various coordinate values in Fig. 16 is a strong edge pixel.Otherwise it is a weak edge pixel and is suppressed.Figure 16 shows the architecture of hysteresis thresholding.

Experimental results and analysis
Figure 17 shows the comparison results between the Canny edge detection algorithm with a fixed threshold value and the improved Canny edge detection algorithm proposed in this paper.Figure 17(a) shows the grayscale of an image.Figure 17(b) shows the edge map detected by the Canny edge detection algorithm with a fixed threshold value, which is implemented in MATLAB.
Gaussian kernel used is 9 in size 1.5 in standard deviation, and the fixed threshold value is 70. Figure 17(c) shows the edge map detected by the proposed Canny edge detection algorithm, which is implemented on a FPGA.The FPGA chip used is Alter's Cyclone IV E: EP4CE10F17C8.Table 2 shows the resource utilization and computation time comparison of the proposed Canny edge detection algorithm with other Canny edge detection algorithms.The results show that the improved Canny edge detection algorithm can complete the edge detection task well even when the detection image changes.However, Canny edge detection algorithm with a fixed threshold value has obvious differences in the effect of different image edge detections.

Conclusion
In this paper, an improved Canny edge detection algorithm is proposed and implemented on a FPGA.In this new Canny edge detector, an approximate method is used to reduce the complexity of computing the gradient magnitude and orientation.In addition, Otsu's algorithm is used to adaptively determine the image threshold.Due to its high complexity, it is difficult to implement Otsu's algorithm directly.Therefore, the logarithmic operation unit is introduced to simplify the calculation of Otsu's algorithm.The result shows that the proposed algorithm's implementation requires only 1.231 ms to detect the edges of the 512 x 512 image when clocked at 50 MHz.In addition, the proposed hardware architecture uses the fewest logical resources and is more suitable for hardware platforms with limited resources.In the future, this system can be a candidate for realtime edge detection applications.
3).Nonmaximum suppression (NMS): Convert blurred edges of the image into sharp edges and suppress minima (preserving local maxima) in the gradient image.4).Computation of threshold: Potential edges are determined by high and low thresholds and are calculated based on the gradient magnitude histogram of the whole image.5).Hysteresis thresholding: Creates a continuous edge map by comparing the values of the gradient magnitude of each pixel with low and high threshold values.This removes edge pixels caused by noise and illumination variation.

Fig. 1 .
Fig. 1.Block diagram of the Canny edge detection algorithm.2.2 Improved Canny edge detection algorithmIn all hardware implementations of the Canny edge detection algorithm, the method of selecting the threshold can be roughly divided into two types, namely, fixed and adaptive thresholds.

Fig. 2 .
Fig. 2. Architecture of improved Canny edge detection 2.2.1 Gradient computation based on the Sobel operator

Fig. 3 .
Fig. 3. Sub-window of an image and Sobel operator kernel

Fig. 10 .
Fig. 10.Histogram architecture b) Architecture of Otsu's algorithm calculationAfter the statistics were completed, parameters  −1 ,   and   in Otsu's algorithm were calculated.Before the calculation, a normalized cumulative histogram is needed.Here, we use the fixed-point number system[7] to represent non-integers.Figure11shows 32-bit (16.16) unsigned fixed-point numbers.In the proposed architecture, a 32-bit (8.24) unsigned fixed-point number system is used.

Fig. 12
Fig. 12 Architecture of Otsu's algorithm calculation c) 32-bit Logarithmic Arithmetic UnitThe logarithmic number system (LNS) has been studied to simplify arithmetic computations to achieve a lower computation complexity[16][17][18].Here, a 32-bit logarithmic arithmetic unit suggested in[18] is adopted to compute Eq. (14).Figure13shows the of the logarithmic converter.It is composed of a 32-bit count leading zero (CLZ), a barrel shifter (BSH), a characteristic generator (CGen), and a fractional part generation block (FPGen).The variable input number range Qm.n. is modified to the fixed number range of Q6.26. at the end of the logarithmic converter.The CLZ block calculates the number of leading zero bits of the input.The five bits of the CLZ block output determine the characteristic value and the amount of shift value of the BSH.BSH converts the input number range of Qm.n. to Q6.26.After the shifting operation, the fractional part is generated by the FPGen block, and the characteristic part is generated by the CGen block.The above two values are combined to give the logarithmic conversion result.

Fig. 13 .
Fig. 13.Architecture of logarithmic converter d) Threshold calculationFigure14shows the architecture of the threshold calculation.When the result of 2 2   is obtained, the value of k corresponding to the first negative value (when the MSB of the result is 1) is found by subtracting the values before and after.Finally, the obtained k value is reduced by one to obtain the optimal threshold.The output signal dout_valid in Figure14is equal to threshold calculate end in Figure2, and the purpose is to enable the output to obtain temporarily cached NMS values.

Fig. 14
Fig. 14 Architecture of threshold calculation 4) BinarizationAfter the threshold calculation is completed, the value obtained is set as the high threshold value, and the low threshold value is half of it.Then, the comparison is conducted to find the strong edge and the weak edge.Figure15shows the architecture of binarization, where gh represents the strong edge and gl represents the weak edge.

Fig. 15 .
Fig. 15.Architecture of Binarization 5) HysteresisThe nonmaximum suppression technique finds strong edges.The edges affected by noise and changes are preserved by this technique.After obtaining the values of gh and gl, if gh is 1, the gradient magnitude value is a strong edge pixel and is preserved.If gl is 0, it is not an edge pixel.If gl is 1, which means the gradient magnitude value is a weak edge pixel, then the current pixel is considered a strong edge pixel if and only if any of the neighbors of the current pixel depicted as Ghn with various coordinate values in Fig.16is a strong edge pixel.Otherwise it is a weak edge pixel and is suppressed.Figure16shows the architecture of hysteresis thresholding.

Table 1
Shift-based tangent value computation

Table 2
Comparison of resource utilization and computation time