A Geometric Algebra Co-Processor for Color Edge Detection

This paper describes advancement in color edge detection, using a dedicated Geometric Algebra (GA) co-processor implemented on an Application Specific Integrated Circuit (ASIC). GA provides a rich set of geometric operations, giving the advantage that many signal and image processing operations become straightforward and the algorithms intuitive to design. The use of GA allows images to be represented with the three R, G, B color channels defined as a single entity, rather than separate quantities. A novel custom ASIC is proposed and fabricated that directly targets GA operations and results in significant performance improvement for color edge detection. Use of the hardware described in this paper also shows that the convolution operation with the rotor masks within GA belongs to a class of linear vector filters and can be applied to image or speech signals. The contribution of the proposed approach has been demonstrated by implementing three different types of edge detection schemes on the proposed hardware. The overall performance gains using the proposed GA Co-Processor over existing software approaches are more than 3.2× faster than GAIGEN and more than 2800× faster than GABLE. The performance of the fabricated GA co-processor is approximately an order of magnitude faster than previously published results for hardware implementations.


Introduction
With the pervasive nature of computing devices that use increasingly complex video processing, there is becoming an ever greater need for faster and more efficient processing of image and video data.One of the key areas in this field is edge detection, particularly in color images, and while it is possible to carry out image processing using software, often it is simply too slow.In order to address this key issue of performance we have designed and implemented a Geometric Algebra Co-Processor that can be applied to image processing, in particular edge detection.
Edge detection is one of the most basic operations in image processing and can be applied to both gray scale and color images.For gray scale images, edges are defined as the discontinuity in the brightness function, whereas in color images they are defined as discontinuities along adjacent regions in the RGB color space.Traditional color edge detection involves applying the uncorrelated monochrome or scalar based technique to three correlated color channels.However, to smooth along a particular color component within the image, component-wise filtering gives incorrect results as described by [1][2][3].Different techniques exists that treat color as a 3-D vector to avoid such problems [4,5].The techniques of convolution and correlation are quite common to image processing algorithms for scalar fields.Standard techniques used to identify the edges and critical features of an image use the rotational and curvature properties of vector fields, which is increasingly a popular method [6].A combination of the scalar and vector field techniques has been extended to vector fields for visualization and signal analysis in [2].In [7], the author developed a hyper-complex Fourier transform of a vector field and has applied this to images.
Geometric Algebra (GA) methods were introduced in [8] and [9] for image processing where it was shown that hyper-complex convolution is in fact a subset within GA.Independently, in [10] the GA or Clifford convolution Fourier transform method was applied to pattern matching on vector fields which are used for the visualization of 3-D vector field flows.This work suggested that the convolution technique based on GA is superior because of the unified notation used to represent scalar and vector fields.
A large body of work exists in the general area of color edge detection which is concerned with the relative merits of different basic approaches and quantifying the ability of algorithms to identify features from images [11][12][13][14][15].For example in [11] it is suggested that leveraging GA is an effective method of detecting edges and that it can also provide a "metric" for the assessment of the strength of those edges.One of the interesting aspects of the proposed work is taking advantage of the simplification of the terms, which has the potential for much simpler implementation, making this an ideal candidate for hardware implementation.We can contrast our approach and also [11] with the more general techniques (Euclidean) described in [12].In [12] the author evaluates different approaches for feature detection, however in many cases the relevance comes after the basic transformations, such as those presented in our approach or even in [11].Similarly, [13] discusses the use of visual cues and how the use of an optimally combined set of rules can improve the ability of a vision system to identify edges.However this would be useful in a system environment, again, after the initial transformation had been completed.In a similar manner, [16] describes derived classes of edge detection techniques and quantifies how effective they are in practice, particularly for natural images.The work described in [16] also highlights the possibilities for implementing vision systems that target specific operations or transformations such as GA, where significant performance benefits can accrue, and integrating this with a general purpose processor that can then leverage other algorithms taking advantage of the results of the efficient processing completed by the partner GA processor.If we consider the general area of research using GA techniques, in most cases the assumption is made that any algorithm can be implemented in general purpose hardware.Most real world scenarios (such as dynamic object recognition) require real time operation, making a relatively slow software solution based on a general purpose platform impractical.
In our proposed approach, the above ideas have been extended further by introducing color vector transformations and a difference subspace within GA.Based on certain properties of these transformations, a hardware architecture to compute all the different GA products has been proposed.The experimental results show the use of the GA methods and proposed hardware for three different edge detection algorithms.
This paper is organized as follows.Section 2 establishes some important GA fundamentals and introduces the techniques for implementing rotations in 3-D.This section also demonstrates how the GA and rotation techniques can be applied to the topics of transformation and difference subspace.This introduction therefore establishes the theoretical foundation of the relationship between GA operations and how the transformations in the GA framework can be exploited to identify chromatic or luminance shifts in an image when the color changes.Section 3 introduces the details of the proposed GA Micro Architecture (GAMA) designed specifically for this application, describes the custom ASIC implementation and reports the experimental results for this hardware implementation, with a focus on the geometric operations per second, and instructions per second, providing a comparison with other hardware and software implementations.Section 4 demonstrates how the technique can be extended to use rotor convolution for color edge detection, with examples including color "block" images and natural images.Section 5 and section 6 extends this to color sensitive edge detection and color smoothing filter, respectively.Finally, concluding remarks are given in Section 7.

Geometric Algebra (GA) in 3-D
GA has proven to be an extremely powerful and flexible approach for applying complex geometric transformation to objects in both software and hardware applications.One of the major advantages of GA is that once expressions have been defined for relatively low order systems (for example 2-D or 3-D) it becomes straightforward to extend these to much higher order systems.Expressions within GA embed and extend existing theories and methods to express geometric relations without the need for special case considerations in higher dimensions [17][18][19][20].The key to this approach is in how the GA framework handles vectors of different types, and therefore in order to understand some of the key concepts in this paper, it is useful to describe the fundamentals of how vectors are handled in a GA context, particularly with reference to the conventional Euclidian space.
Consider the Euclidean 3-D vector space E 3 , which is defined by the orthonormal basis vectors e1, e2 and e3.This 3-D space in E 3 can be decomposed into an eight dimensional real vector space having the eight elements in G 3 shown in Equation (1).The elements of this algebra or real vector space are called the multivectors, and essentially describe all the possible geometric objects within that vector space including scalars (with no vector), vectors (lines), bivectors (surfaces) and trivectors (volumes).
The concept is useful in that any or all of the individual elements can be considered together as a single entity called a multivector.

1
, , , , Fundamental algebraic rules exist for objects defined within a GA framework.Multiplication of elements within GA is associative, bilinear and commutative for scalar and trivectors but anticommutative for bivectors and is defined by the rules in Equation (2).
As already discussed, within GA it is possible to add or multiply different vector elements to form a multivector.For example, a generic multivector in 3-D is the linear combination of the fundamental eight elements shown in Equation ( 1) and is defined by Equation (3).
Within GA the multiplication of any two multivectors a and b results in the geometric product, which consists of the inner product or dot product (a • b) and the outer product (a ˄ b) and is given by Equation (4).
The inner product or the dot product, gives the magnitude of the vectors and the outer product a ˄ b gives the orientation of the plane or the oriented area that is formed by sweeping the vectors a and b.The geometric product is the most important element of this algebra and all the other meaningful operations are derived from it [17].Clearly there are a large number of possible operations that can be carried out using this framework; however, we are particularly interested in the transformations that become possible within GA, especially rotations.The next section therefore describes how 3-D rotations are performed using GA.

Rotations in the Geometric Algebra 3-D Space
In Geometric Algebra, a rotor R is an element which is used to rotate any vector within the 3-D space and satisfies the relation: = 1, where is the conjugate of R. One of the useful aspects of the GA framework is that if only the bivectors of the algebra are used, it can be shown that the quaternions are a subset of the GA [20].If ℱ = [ , , , ] is defined as a unit quaternion then the one to one mapping between the quaternion and the rotor which performs the same rotation in GA is given by Equation (5) (where I = e1 e2 e3 is the pseudoscalar).

R = + I − I + I
Where θ represents a rotation about an axis parallel to unit bivector B and the direction of rotation axis is given by μ + μ + μ spanned by the bivector basis.We will show in this paper how this rotational element R is an important element while discussing the transformation subspace for color images in the following section.Also, a detailed discussion following an example is discussed in Section 4.3.

Transformation and Difference Subspace
In the context of this work, it is assumed that the perceived color is a vector in the 3-D Euclidean space and not as separate r-g-b image planes.In this regard the bivector representation of color vectors in GA fits neatly for the 3-D Euclidean space.Using this approach the (r,g,b)vector of the color image , can be written as shown in Equation (7).
where , , , and , are the rgb vectors of the image , , where m and n are the row and column pixels, respectively.In this section the color is defined as a vector or a single entity (Equation ( 7)) and the image is treated as a superset of this entity.In the later experimental sections we will show by doing this how the image processing algorithms become straightforward and intuitive.For this discussion, μ is the diagonal axis, also the gray line in the color cube (Figure 1) is represented as in Equation (8).
For r = g = b the pixel is achromatic in nature and is represented as a gray line.For transformation a normalized color representation is chosen for clarity.This ensures that the orientation information is kept while the distance information is normalized.For a unit transformation on the normalized color, the rotation vector is expressed as shown in Equations ( 9) and (10).
The rotation given by and R rotates any vector by an angle in 3-D about an axis parallel to the rotation axis.The unit transformation on a color element C (Equation ( 7)) is given by Equation (11), where Part II in Equation ( 11) reduces to and III reduces to and part IV reduces to this becomes: Using the reductions in Equations ( 12) to (15), Equation (11) can therefore be rewritten as shown in Equation ( 16): The term "A" in the above equation is the rgb space component, "B" is the intensity component and the "C" term is the color difference or the chromaticity (hue and saturation) of the vector.
If the vector is rotated by an angle θ = , the above equation reduces to only two components as shown in Equations ( 17) and (18), where the space component "A" is cancelled.This transformation is RGB to HSI conversion where the two components in the equation describe the luminance and chrominance of the image.Therefore, for θ = .
where , , are scalar quantities.Similarly the opposite rotation is given by Equation (18).
If the color vector is homogeneous then addition of two transforms equate to: this results in the intensity of the image.Subtracting the two transforms gives: which is the difference of the color vector, and this is often referred to as the change in chromaticity or a shift in hue.It can therefore be concluded that if the color vectors are different then the above transformations do not cancel out and this will result in a hue or chromatic shift.On the other hand a homogeneous color vector results in an intensity component that generates sharp edges when a color change occurs within an image.This is an extremely powerful and interesting result, which can be applied to edge detection in hardware if the required GA transformations are available.

Rotor Edge Detection Using Geometric Algebra Co-Processor
The previous section showed that in order to implement color edge detection using rotor transformations the main GA computations required are the geometric product, multi-vector addition and subtraction.From an implementation perspective, these functions can also be decomposed into parallelized tasks, each task involving Multiply and Accumulation (MAC) operations, which can map directly to specific hardware.It is already known that significant speed advantages can be gained when GA is implemented in hardware instead of software [21][22][23].The hardware architecture is discussed in this section, however, a detailed description is beyond the scope of this paper, and the reader is referred to [23] for more information.An important advantage of the architecture is that by having dedicated hardware functions to calculate the GA operations directly, significant efficiencies can be achieved over general-purpose hardware.For example, although a general-purpose processor may have dedicated GA software, the underlying transformations will still use standard processing resources, which will be less efficient than a hardware, which directly maps the GA software operations.

Hardware Architecture Overview
The proposed rotor edge detection hardware architecture consists of an IO interface, control unit, memory unit and a central Geometric Algebra Core [21][22][23] consisting of adder, multiplier, blade logic and a result register (Figure 2).The architecture supports both single and double precision floating-point numbers, four rounding modes, and exceptions specified by the IEEE 754 standard [24].This can be seen as effectively a GA Co-Processor, which operates in conjunction with a conventional general-purpose microprocessor.The floating-point multiplier is a five-stage pipeline that produces a result on every clock cycle.The floating point adder is more complex than the multiplier and involves steps including checking for non-zero operands, subtraction of exponents, aligning the mantissas, adjusting the exponent, normalizing the result, and rounding off the result.The adder is a six-stage pipeline that produces a result on every clock cycle.It is important to state here that the proposed hardware can process other products of Geometric Algebra with ease.The state machine governing the processing stages of Geometric Algebra has six states, idle, clear, load, process, write, and memory dump.Firstly, the idle state waits for the start signal to be high to trigger the state machine.After that, the state machine will come into the clear state where it clears any registers and then to the load state to export load as "1" to load input data into the registers.Then it processes the result in the required clock cycles based on the control word (cfg_bits).Finally, it outputs the product in the output state.Apart from the transformation of the load state, which is triggered by the start signal, the others just proceed to next state automatically after an expected number of cycles based on the control word.
The long 320-bit word datapath is coordinated by controller (LOGIC) and sequencer unit.The transfer of the data is done in the input and output interface unit (conversion logic).The signals are all defined as input and output registers to the system architecture.Selection of the data input and output is based on the 16-bit control word.The control bits are used for configuring (cfg_bits) the processing core for different operations.These control bits are responsible for defining the data interface and configuring the operators at the correct time and outputting the result.The control block along with the sequencer ensures effective queuing and stalling to balance the inputs in different stages in the datapath.

Blade Logic
Another important element of this architecture is the blade computation.The "blade" is simply the general definition of vectors, scalar, bivectors and trivectors.The blade index relationship is defined by the elements being computed, in a similar way that matrix operations result in an output matrix dimension that is defined by the input matrix dimensions.For example, to compute with , the resultant blade index is .Similarly if we multiply with then the resultant basis blade index is .This can be implemented by a multiplication table, an approach followed by many software implementations.However, accessing a memory in hardware is a slower operation than a simple EXOR function (Figure 3a).Determining the sign due to the blade index is not straightforward due to the invertible nature of the geometric operation.For example the blade index multiplication of with gives whereas with and results in − .The resulting circuit which is a cascade of EXOR gates takes care of the swapping of the blade vector and the AND gates compute the number of swaps that the blade element undergoes (see Figure 3b).
Detailed design description of the hardware implementation can be found in [23].

ASIC Implementation
The proposed hardware is described using synthesizable VHDL which means that the architecture can be targeted at any technology, FPGA or ASIC.Standard EDA tools perform all the translation from VHDL to silicon including synthesis and automatic place and route.The architecture was synthesized to a CMOS 130 nm standard cell process.The synthesis was carried out using the Synplify ASIC and the Cadence Silicon Ensemble place and route tools.The synthesis timing and area reports are summarized in Table 1.The chip area was 1.1 × 1.1 mm 2 (inset in Figure 4) and the estimated maximum clock frequency of the design was 130 MHz.
The prototype ASIC was packaged in a standard QFP package and this was then placed on a test printed circuit board (PCB), with power supply connections and a link to a FPGA development kit that supported a general purpose processor for programming and data handling (Figure 4).The test board was then used to evaluate the performance of the GA processor in handling a variety of test data, described in detail in the next section.

Comparison with Other GA Implementation Hardware and Software
This section describes the experimental verification results.When calculating the GA operations the GOPS (Equation ( 20)) is particularly important because the designer is then able to determine whether the timing constraint put by the clock cycles and GOPS provided by that particular implementation is relevant.
Table 2 gives a comparison for different dimensions, GA processor performance in MHz, clock cycles, latency and Geometric Operations per Second (GOPS).This also provides a comparison to the GA hardware implementations in [21,23,[25][26][27][28][29][30].It was found that the proposed hardware is an order of magnitude faster as shown in Table 2 (columns 5-6).
In [27], the normalized GOPS (in Table 2) is found out by dividing the number of MAC units, in this case it was 74.However, the authors have used the hardware resources available in the FPGA to their advantage and shown a threefold performance improvement as compared to the proposed hardware architecture.The GA products are computed on every clock cycle after a specified latency.The performance of their hardware is further improved by the authors by roughly two to three orders of magnitude in a very recent paper [28] as compared to the proposed ASIC architecture discussed here.

FPGA and ASIC Test Results
Subsequently, the ASIC test was performed using a 125 MHz clock (a little less than the maximum clock frequency of 130 MHz).A set of 1 k, 4 k, and 40 k product evaluations was carried out to yield the raw performance (where all multivectors are present) of the hardware.The results for geometric product evaluation at different clock speeds along with the performance measures obtained when the design was targeted to an FPGA family (Xilinx XUPV2P, San Jose, CA, USA) are given in Tables 3 and 4.
The implementation results of the processor core at different dimensions suggest that the performance is comparable to some of the software packages implemented in software (GAIGEN, Amsterdam, The Netherlands) which runs on high speed CPUs.To see how much advantage was gained by this hardware we compared the performance of GABLE [31], GAIGEN [32] and the proposed hardware (see Table 3).To ensure consistency with GABLE, only the three-dimensional GA implementation is considered.For GAIGEN the "e3ga.h"module is used for the performance evaluation.Tables 3 and 4 show the performance comparison on the software that ran on 2GHz Intel Pentium processors and the proposed hardware.The software implementation GABLE (in MATLAB) was found to be extremely slow however the GAIGEN software running on a CPU was found to be of comparable performance to our proposed hardware (Table 3).However, when the number of elements is varied, e.g., (vectors multiplied with a scalar and bivector) (( + + ) × (1 + + + )) it resulted in a performance increase of more than 3× when compared to the software implementation as shown in Table 4. Furthermore, it was seen from the experiments that the number of cycles can be optimized by hardwiring the operators which are constant over the operation.For example, for a windowing operator typical to an image-processing algorithm, four of the multivectors which define the window or the filter can be hardwired.This leads to a 25% savings in the processing cycle due to the savings associated with I/O transfers every operation.The overall performance gains using the GA Co-Processor over existing software [31,32] approaches are more than 3.2× faster than GAIGEN [31] and more than 2800× faster than GABLE [32].The above comparison shows that the proposed hardware provides speedup and performance improvement over the existing hardware (Table 2) as well the software implementations on CPU.
In Section 2, the basic technique of rotor detection has already been discussed.This technique is applied to color difference edge detection on our proposed hardware in the next section.
The proposed hardware architecture supports the general computations involving GA operations.To show the usefulness of the framework, we chose an image processing application in general and edge detection algorithm in particular.In this application, we demonstrate that even if the data is not a full multivector containing all the elements, one could still take advantage of the architecture with very little changes to the way the data is fed into the core.We also demonstrate the performance with two cores.The platform can be used for signal processing, vision and robotics applications.

Introduction
A conventional edge detection process involves convolving masks ( , ) (left) and ( , ) (right) of the size X × Y with the image ( , ) of dimension ( × ) (Equation ( 22)).In a GA based approach, convolution masks undergoing the following equation is similar with the exception that the scalar quantities are replaced with multivector masks (Equations ( 23) and ( 24)).
Previous work [33] has already reported the usefulness of hypercomplex masks for edge detection and this was extended to rotor convolution by Corrochano-Flores in [8].The rotor convolution works exactly the same way as the hypercomplex convolution and operates on the color vectors of the image.The horizontal left and right masks for the rotor convolution are defined in Equations ( 23) and (24).The vertical masks are obtained by interchanging the rows and columns of the two masks.
where the rotors are given by Equation (25).
In the above equation, is the unit vector and is given by μ = ( e e + e e + e e )/√3 and = 1/√6 is the scale factor.The left and right masks are applied to each of the color pixels, which give the following convolution in the simplified form [9]. Therefore convolution in Equation ( 22) results in Equation (26).
Here the color is taken as a single vector and is split in parallel and perpendicular to the rotation axis.In the 3D color cube (Figure 1, the rotation axis is the unit vector.As shown in Equation ( 26), if the colors are uniform in upper and lower rows, then the amount of the rotation to the perpendicular component for both the rows will be same.Therefore they cancel out and the resultant vector lie on the gray axis.That will indicate no edges present in the image.In case of dissimilar regions, i.e., the upper and lower rows having different values, the perpendicular component rotated will be different and when added they will not cancel each other.The resultant vector will lie somewhere else in the cube, off the gray axis-indicating the presence of an edge.

Experimental Configuration
The experimental setup for the convolution operation involving hardware is as shown in Figure 4.It consists of the two blocks (to bin matlab, to bmp matlab block in the Figure 5) and the ASIC (hardware) mounted on a PCB.The images were first read in MATLAB (Mathworks, Natick, MA, USA and converted into binary images (processed off line) and were then fed to the hardware.Then the binary results from the hardware were again converted to the bitmap (bmp) images and are processed off line.Three images (Figures 6a, 7a and 8a) of different image sizes (128 × 128, 256 × 256 and 512 × 512) were taken.The masks defined in [8,33] were applied to all three images.

Image Convolution Results
The hardware performs all the necessary geometric calculations and generated the results.The results of the convolution of the three images are shown in Figures 6b, 7b and 8b.Rotor convolution using Equation ( 26) was applied on the test image of the "color block" which has an 8 × 8 array of colored squares (see Figure 6a).The result of the filtered image is shown in Figure 6b.The filtered image has gray areas where the squares had uniform color.But it has colored lines at the edges or where there was a change of color.It was also observed at the edges between the black and white blocks.In this approach the color is considered to be a vector and expressed as a single entity.As shown in Figure 1, the color vector ( ) is split into two components; ∥ , which is the parallel component to the gray axis, and is the perpendicular component.When the masks are applied on the color vector only the perpendicular component c is affected but the parallel component c ∥ is unchanged.The convolution rotates the perpendicular component by an amount specified by the rotor angle θ.Since rotor rotates any vector in a clockwise manner, the rotor would rotate the color vector by the same amount as would the rotor but with an opposite direction.Hence if the color vectors are homogeneous then both the components would cancel.This point or pixel would fall on the gray axis and the pixel would be perceived as a gray picture or the intensity of the image.
However, if the color components are not homogeneous then the color vector will be rotated by an unequal amount by the two rotors.Thus the resultant vector would lie somewhere else in the color cube, far from the gray axis (Figure 1).For θ = π/2, the rotates the color by π/2 and by − /2.Hence the two color vectors cancel each other due to the rotation operation when they are uniform and fall on the black and white axis.Otherwise they fall outside this gray line where = = , giving a color value to the pixel.This signifies that the rotor operation is a shift in hue (chromaticity) of the image.In the areas where the upper and lower pixels are similar the rotors produce a gray scale image.When these pixels differ in color, the rotors produce different colors, as they do not cancel in the chromatic sense.Hence the change of direction due to rotation of colors results in different colors on the edges.This type of change is also be observed on the filtered images of tulips (Figure 7b) and top edges of the hat of the "Lena" image (Figure 8b), and the edges of flowers of the "Tulip" image.

Discussion of Rotor Convolution Results
From the experiments using the proposed hardware, it was observed that for an image size of 128 × 128 pixels, the total number of geometric product multiplications is 1.96 × 10 5 .Each color pixel is treated as a vector and each convolution operation consists of 12 geometric product multiplications (GP mul) and four geometric product additions (GP add).For image size of 256 × 256 and 512 × 512, the number of multiplication and addition increases proportionally.
It was observed that further optimization of these algorithms is also possible.For example, four additions followed by four convolutions on these vectors can be used with the same result due to the linearity property of the rotor operations.This results in huge savings in terms of geometric operations where the hardware is designed to handle such specialized computations efficiently.As shown in the Table 5, the total time taken for the convolution of a 128 × 128 image takes 5.70 × 10 6 cycles which amounts to 45.6 ms (third column, third row of Table 5) based on the hardware operating frequency set at 130 MHz.The convolution times for different sized images are given in the following Table 5.The hardware consists of two multipliers and three adders (labeled as ASIC with two cores in the Tables 5 and 6) in an optimized structure already discussed in the previous section.Tables 5 and 6 show the time required for each calculation running on this hardware.Entries in the Table 5 (columns 2 and 3) show the results for a 3-D vector space consisting of an eight multivector, i.e., a full multivector (MV).Therefore the multiplication of two multivectors would result in 64 products and 56 additions.However, as discussed earlier, the color is expressed with three bivector elements, which reduces significantly the number of processor cycles while computing the geometric product and geometric additions.The reduction is evident from the number of clock cycles and time column in the Table 5 (columns 2-3 compared to 4-5).The entries in columns 6 and 7 show the performance of the geometric product and additions with two cores (labeled as ASIC with two Core in the table), consisting of three adders and two multipliers.
In Tables 5 and 6, the timing of 4 GP mul and 5 GP add (one extra final addition done, therefore 5 GP Add instead of 4 GP Add) is given, respectively.For the convolution of the image the worst-case timing between the two is considered.For example for an image size 128 × 128, the GP mul is 19.6 ms and GP add is 21.4 ms.In this case only the worst of the two, i.e., 21.4 ms, is taken into account when showing the convolution operation with the slack being wasted as a stall operation in the hardware.From the experiments we observed that if the color image was treated as a full multivector, a lot of clock cycles were wasted as either stall operations or no operations.If the color vector is expressed as a bivector containing four vectors, there is an immediate performance gain both with single core and two cores.For example in Tables 5 and 6, we can see a 16% gain in speed in the single core.This also occurs when color is expressed as full multivector (45.6 ms) resulting in improvements using a single core (38.3 ms) and finally a 53% gain with two cores (21.4 ms).A similar positive trend is observed for larger image sizes as shown in Tables 5 and 6.Now that we have demonstrated the approach for color edge detection, we can extend this to the special case of color sensitive edge detection in the following.

Color Sensitive Edge Detection-Red to Blue-Using the Geometric Algebra Co-Processor
In this section, the detection of homogeneous regions of particular colors to ( → ) are discussed.The edges between these two colors are determined by the GA methods and are an extension of the method discussed in [34].To show the feasibility of this approach we have considered only synthetic images with strict thresholding criterion to find the edges are considered because the thresholding criterion for synthetic images is straightforward.
Let C 1 and C 2 be two color bivectors.μ is a normalized color bivector of , where μ is given by Equation (27).
To find the discontinuity between two regions to the convolution can be used with the following hypercomplex filter.
Therefore the convolution operation on an image ( × ) of size × is given by: where , are horizontal and vertical masks of size 3 × 3 as given in Equations ( 28) and ( 29), respectively (see Figure 9).The above convolution operation results in non-zero scalar part and zero vector part in special cases.For example, to observe the discontinuity from → (red to blue), let be (red) and be (blue): Normalizing both the vectors, μ = and μ = , the convolution results in a scalar value (−3 − 3) = −1 or generally expressed as: It can be observed that the above convolution results in a non-zero scalar value and zero vector value.As shown in Figure 10a there is only one block where the red to blue ( → ) color transition occurs.When convolving with the mask (Equation ( 30)) the vector part will be zero.In the above example, the non-zero scalar value is the intensity of the pixel shown in the whiter regions in Figure 10b.In all other places of the image the scalar and vector will have non-zero value.Also, it should be noted that the masks are directional in nature.Therefore, we observe the white line (in the region of interest) when the red to blue occurs and not when the blue to red change occurs.

Testing of Rotor Convolution in Hardware
Table 7 shows the timing and the clock cycles the hardware takes for the GA geometric product and Table 8 for the GA addition operations.The results also show varying timing for full multivector and timing with single and two cores for the color bivector.Again in this case only the worst of the two (Geo Product and Geo Addition) is taken into account when showing the convolution operation (marked gray in Table 8, using only two cores).

Color Sensitive Smoothing Filter
Another application of GA in image processing is to apply a low pass filter technique.The smoothing filter can be applied to smooth the color image component in the direction of any color component .
Extending low pass filtering applied to gray scale techniques provides limited color sensitivity as only one color band is convolved with the filter [2].If all the color bands are considered, then with traditional filtering all the bands will be smoothed equally which is undesirable.With GA, such affine transformation with low pass filtering is achieved in one step [35].
The smoothing of red or cyan components is performed by the following mask: where is the normalized color vector.The convolution is used to find the homogeneous regions within the image will result in a non-zero scalar part and a zero vector part for the homogeneous regions where the match is found.In all the other places the masking operation results in a non-zero scalar and vector part.Convolution due to this mask results in smoothing along the C 1 direction.The smoothing of red or cyan components, which are parallel to C 1 , is changed but the other two perpendicular components remain untouched.Furthermore, the left convolution is different to the right convolution.Hence addition due to left and right convolution will result in cancellation of the vector part leaving only the scalar part.As shown in the results of Figure 11b, two sets of masks were chosen for the smoothing operation.The first was set to the color cyan, the color of the block, which is at the lower left corner block of the image (see Figure 11a).The normalized color filter is set for cyan having RGB values set at (0,127,128).The smoothing operation due to convolution of the mask results in Figure 11b (the smaller size 128 × 128 is due to a formatting issue).The second mask for smoothing the red component (255,0,0) results in the image shown in Figure 12.As shown in the Tables 9 and 10 the step filtering operation could be performed in 1 GP and 8 GA additions.Table 9 shows the timing and the clock cycles the hardware takes for the GA geometric product and Table 10 for the GA addition operations.The results also show the variation in timing for full multivector color represented as three bivectors and timing within a single core and two cores.Both the experiments with color sensitive edge detection and smoothing have demonstrated that there is a substantial performance improvement when using the coprocessor.Furthermore, employing simple optimization technique leads to huge benefits.For example, in color sensitive edge detection with an image size 128 × 128, the full multivector expression takes 22.8 ms for 2 GP operations, whereas single core ASIC requires 19.1ms and two cores require 9.8 ms.Similarly for four geometric additions it requires 36.2 ms, 20.1 ms and 17.1 ms, respectively.This leads to 44% (with single core) and 52.7% (with two cores) relative performance improvements.In the case of color smoothing experiment we observe a similar trend with 44% and 52.8% performance improvement with single core and two cores, respectively.
We have shown that with an ASIC implementation it is feasible to achieve data throughput rates, which will provide fast enough operation for video applications, even with this relatively conservative process.For example, taking an image size 128 × 128 we can achieve 46.73 frames per second (fps), 58.48 fps and 29.24 fps for color difference edge detection, color sensitive edge detection and color sensitive smoothing, respectively.The ASIC was implemented as a "proof of concept" and therefore was not optimized for speed, but does demonstrate the advantages of a dedicated co-processor for this type of application, without the compromises inherent in an equivalent FPGA implementation.

Conclusions
This paper presents an overview of the Geometric Algebra fundamentals and convolution operations involving rotors for image processing applications.The discussion shows that the convolution operation with the rotor masks within GA belongs to a class of linear vector filters and can be applied to image or speech signals.Furthermore, it shows that this kind of edge detection is compact and is performed wholly using bivector color images.
The use of the ASIC GA Co-processor for rotor operations was introduced, including a demonstration of its potential for other applications in Vision and Graphics.This hardware architecture is tailored for image processing applications, providing acceptable application performance requirements.The usefulness of the introduced approach was demonstrated by analyzing and implementing three different edge detection algorithms using this hardware.The qualitative analysis for the edge detection algorithm shows the usefulness of GA based computations within image processing applications in general and the potential of the hardware architectures in signal processing applications.
Details are presented of the custom Geometric Algebra Co-Processor that directly targets GA operations and results in significant performance improvement for color edge detection.The contribution of the proposed approach has been demonstrated by analyzing and implementing three different types of edge detection schemes on the GA Co-Processor and FPGA platforms and overall performance gains is reported.A detailed analysis was undertaken which describes not only the raw timings, but also the trade-offs that can be made in terms of resources, area and timing.In addition, the results were also analyzed in the context of larger numbers of operations, where the potential performance benefits can be seen to scale to larger datasets.Future work will be focused on theoretical understanding of this linear filter and the frequency response, which would be useful in developing further algorithms.

Figure 1 .
Figure 1.RGB vectors and the color cube.

Figure 2 .
Figure 2. Rotor edge detection geometric algebra co-processor architecture with permission from the authors of ref [23].

Figure 5 .
Figure 5. Offline processing with the GA Coprocessor hardware.

Figure 7 .
Figure 7. (a) Original tulip image and (b) Tulip image after rotor convolution.

Figure 8 .
Figure 8.(a) Original lena image and (b) Lena image after rotor convolution.

Figure 9 .
Figure 9. Red to Blue, Convolution by left and right mask.

Figure 11 .
Figure 11.Outputs of the Cyan block before (a) and after (b) smoothing filter.

Table 2 .
GA performance figures and comparison with previously published results.

Table 3 .
Comparison of different Software and the proposed hardware.

Table 4 .
Comparison of Software and the hardware for vector and bivector calculations.

Table 5 .
Four GP for rotor convolution.

Table 6 .
Five geometric additions for rotor convolution.

Table 7 .
Two geometric products for rotor convolution.

Table 8 .
Four geometric additions for rotor convolution.

Table 9 .
One geometric product for rotor convolution.

Table 10 .
Eight geometric additions for rotor convolution.