Reconﬁgurable Morphological Processor for Grayscale Image Processing

: Grayscale morphology is a powerful tool in image, video, and visual applications. A reconﬁgurable processor is proposed for grayscale image morphological processing. The architecture of the processor is a combination of a reconﬁgurable grayscale processing module (RGPM) and peripheral circuits. The RGPM, which consists of four grayscale computing units, conducts grayscale morphological operations and implements related algorithms of more than 100 f/s for a 1024 × 1024 image. The periphery circuits control the entire image processing and dynamic reconﬁguration process. Synthesis results show that the proposed processor can provide 43.12 GOPS and achieve 8.87 GOPS/mm 2 at a 220-MHz system clock. The simulation and experimental results show that the processor is suitable for high-performance


Introduction
Mathematical morphology is extensively used in various areas, such as computer vision and machine learning [1], texture and image analysis [2,3], color image processing [4], remote sensing image analysis [5], image compression [6], and video segmentation [7]. Morphological computing has commonly been implemented using processors such as CPU or DSP. Some enhancements were implemented on GPUs to speed up the process [8][9][10]; however, even high-end GPUs have lower throughput than CPUs and FPGAs [11], and GPUs are more expensive with a higher power consumption for embedded systems. Therefore, high-speed specialized image processing chips have received considerable attention due to their efficient performance and low power consumption. Partial-resultreuse (PRR) architecture, as proposed in [12], has been used for morphological operations and can be combined with systolic array architecture to enhance performance and maintain cost-efficiency. Pipeline architecture for morphological operations has been presented to reduce computing latency. A 2D stream, processing architecture, was implemented on Virtex-5 for sequential filters [13]. An algorithm for efficient computation of morphological operations and specific hardware were presented in [14]. A hardware architecture presented for gray-level image erosion and dilation utilized a systolic-like organization of processing elemental array [15].
With technological advances, mathematical morphology continues to be introduced to new applications and the algorithms improve and expand constantly. Flexibility is important in image processing chips. Thus, the reconfigurable technique and processing element array architecture, which can solve the incompatibility between high performance and flexibility, are used in morphological image processing chips. A content-addressable memory (CAM) LSI for real-time pixel-parallel image processing was described in [16]. A programmable single-instruction multiple-data (SIMD) real-time vision chip was reported to achieve high-speed target tracking [17]. Recently, a vision chip with the architecture of a massively parallel cellular array of processing elements was reported for image processing using the asynchronous or synchronous processing technique [18]. In [7], a reconfigurable morphological image processing accelerator was proposed for video object segmentation, and watershed transform could be achieved in real time using 32 macro processing elements.
A high-performance flexible reconfigurable processor that can conduct basic grayscale morphological operations and implement complicated algorithms is presented. The processor consists of a morphology processing module and peripheral circuits. The morphology processing module is a reconfigurable hardware accelerator for grayscale morphological operations, which can deal with the arbitrary size and shape of structure elements (SEs). The accelerator is a mixed-grained architecture, which has novelty maximum and minimum computing circuits and a high flexibility and efficiency structure. The operation can be reconfigured dynamically and the peripheral circuits can select and synchronize the input and output images. The processor can be designed easily and achieve an optimal hardware cost. The processor can process pixel-level images and extract image features, such as boundary and gradient. The processor is high speed, with a simple structure, and an extensive application range.
This paper is organized as follows. A literature review is present in Section 2. The maximum and minimum computing circuits, which are the key units of the morphological operations, are introduced in Section 3. The processor architecture is presented in Section 4. The processor implementation and algorithms involved in image processing by the processor are described in Section 5. In Section 6, the performance of the processor is evaluated and compared with that of existing processors. Finally, the discussion and conclusion is provided in Section 7.

Literature Review
In recent years, image and vision applications are widely used in advanced manufacturing, medicine, national defense, public safety, and space technology; however, the traditional CPU and ASIC cannot provide both high performance and enough flexibility, which limits the development and application of the systems. In order to improve performance, GPU-based parallel algorithms are designed in [19,20]. FPGA is also a good platform for high-performance systems [21,22]; however, the GPU and FPGA still do not solve the problem of flexibility and performance in image processing and machine vision. Reconfiguration, which can solve the problem, is introduced to realize high-flexibility and performance chips.
Several reconfigurable chips have realized general computing tasks through PE array, based on SIMD structure. Thus, chips with multi reconfigurable cores which are connected in a high-performance manner (such as NoC) will be the trend. Some NPUs use reconfigurable computing units to improve flexibility [23][24][25][26]. When implementing reconfigurable precisions, compute-in-memory macro is fully reconfigurable in [27,28] and some chips used for IoT [29] and Biomedical AI Processor [30].
In addition, in order to meet the requirements of general computing, it is necessary to study the general reconfigurable chip, and pay attention to the problems of software framework, auxiliary compilation, and task scheduling. The calculation model and algorithm research provide the basis for the architecture design of the reconfigurable chip.

Morphological Operations
The basic morphological operations are dilation and erosion. For a 2D grayscale image, A is the image and B is the SE. The flat dilation, which is denoted by ⊕, is computed using the following equation: where α and β are the domain of the SEs. Erosion, which is denoted by Θ, can be computed using the following equation: The combinations of dilation and erosion can form other morphological operations. The opening and closing operations, denoted by • and •, are expressed as follows: The computation of a maximum and minimum value is the key operation for grayscale morphological operations. The design of the maximum and minimum computing circuits directly affects the performance of the morphological processor.

Algorithms of the Maximum and Minimum Computing
In this study, we implemented maximum and minimum computing by logic operations instead of comparators and delay elements. This new method reduces hardware resources. The maximum and minimum computing methods were proposed in [31]. The algorithms used were for the systolic array architecture. We modified the maximum and minimum computing for grayscale morphological operations, and the methods are described in Algorithms 1 and 2, respectively. In Algorithm 1, m n-bit numbers are maximized. All bits of an m-bit parameter, which is denoted by a flag, are set to "1". The n m-bit parameters, which are denoted by Temp (n − 1) , Temp (n − 2) , . . . , Temp 0 , are marked as the i-bit of all numbers. For example, Temp 0 is the combination of . Line 1 initializes the number of loops. Lines 3 to 5 calculate the maximum number of inputs. The minimum computing method is described in Algorithm 2. The flag is set to "0". An example is shown in Figure 1. Five 8-bit numbers, namely, 25, 237, 59, 179, and 48, are the inputs. Flags are set to "11111" and "00000" for maximum and minimum computing, respectively. Each loop is based on Algorithms 1 and 2. Finally, the maximum is 237 and the minimum is 25. The new algorithms proposed in this study are resource-light and easily extensible.

Maximum and Minimum Computing Circuits
The optimized circuit, which implements Algorithm 1 for the maximum computing of eight 8-bit numbers, is shown in Figure 2. The gate count of an n m-bit number computing circuit is computed as: Gate count = 4 × n × (m − 1) − 1. Two gates are used to implement the equation in [31]: t(i) = t(i) · max j + y(i) · max j . Then, each eight 8-bit numbers' maximum computing circuit uses 112 gates fewer than in [31]. Four maximum computing circuits are synthesized with the Synopsys Design Compiler. The synthesis results of various input circuits are shown in Table 1. Due to the increase in the critical path and fan-in, the frequency of the circuit is decreased when the input number increases. This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn. The minimum computing circuit is implemented in my previous work [32].

Architecture
The grayscale image processor is designed for various applications in computer vision, image analysis, medical image processing, and video segmentation systems. Such systems, especially embedded ones, should have high flexibility and performance for extensive applications, and the design focus should be on flexibility and speed. Then, a reconfigurable grayscale image processor with high speed and a simple structure is designed for extensive use and consumption of fewer hardware resources. The architecture is illustrated in Figure 3. The core of the processor is the reconfigurable grayscale processing module (RGPM). The processor also has two bus interfaces, an input control logic unit, an output control logic unit, a process control unit, and a configuration register group. The RGPM performs the image processing. The two bus interfaces connect the processor to the system bus when a SoC is constructed. The input control logic unit controls the inputs from video images and SDRAM, and registers to the RGPM. The output control logic unit writes the selected parallel image data from the RGPM into SDRAM through bus interface 1. The process control unit reads the configuration information in the configuration registers and controls the operation process of the RGPM. It also controls the input and output control logic units and bus interfaces in data access. After the processed image data are written to SDRAM, the process control unit transmits interrupt requests to complete the interaction of the processor with external systems.
The configuration register group is an extremely important part in the proposed processor. It contains control parameters, reconfiguration information, operation parameters and interaction information. Most of the registers in the configuration register group are written by an external CPU via the system bus, and the rest are written by the internal modules in the proposed processor.

RGPM
The diagram of the RGPM is shown in Figure 4. The RGPM consists of several grayscale computing units (GCUs) that conduct grayscale morphological operations at a high speed. The image processing algorithms are implemented using the operations in the individual GCUs and the connection pattern of these units. The units can be implemented in a pipelined or parallel manner. The converters select the output from all the GCU outputs based on the parameters and convert the 8-bit data series into a 32-bit data series. Considering the hardware resource, flexibility, and usability of the processor, we set the number of GCUs to 4. The unit contains two converters. The inputs of GCUs 1 to 4 are the outputs of the input control logic.

GCU
The architecture of the GCU is shown in Figure 5. The core of the GCU has two 5 × 5 maximum and minimum computing units (MMCUs). The GCU also has one grayscale operation element and one input control logic. The grayscale operation element can conduct 8-bit integer operations, such as two-input maximum, two-input minimum, shifting, complement, addition/subtraction, and straight-through output. This element can be used to supported operations between images.

Input Control Logic
Image signals should be synchronized by the input control logic before being used as input for the MMCUs because one-to-one matching is needed between the pixels in different images. The input control logic selects and synchronizes the inputs from video images and SDRAM to the synchronization circuit. The block diagram of the input control logic is shown in Figure 6. The unit contains 4 data converters, 1 synchronization circuit, 8 line memories and 50 registers. Data converters 1 and 2 convert 8-bit image signals into 32-bit parallel data, which have the same format as the data from SDRAM. Converters 3 and 4 convert the 32-bit data into 8-bit image signals, which are then synchronized by the synchronization circuit. The line memories are needed to buffer the processing image signals. When the size of the SE is n × n, the number of the line buffers is n-1. The depth of each line buffer is the number of the image line pixels, and the width is the pixel intensity (8-bit usually). The n-1 line buffers and the current line video signals create an n-line parallel image input. The n × n pixel registers form an n × n image window. The outputs of line memories are selected by the multiplexers (MUXs) as inputs for the registers, and the image data in the registers are selected as inputs for the MMCU1. The inputs and outputs of the registers are transmitted via MUXs, which renders the architecture unit more flexible. The inputs transmitted to the MMCUs via MUXs can be reconfigured to two 5 × 5 arrays or one 7 × 7 array. The outputs of the registers transmitted via MUXs can be reconfigured to a 3 × 3, 5 × 5, or 7 × 7 array.

MMCU
The algorithms and circuits of maximum and minimum computing have been presented. The circuit was easy to design and extend. The problem considered was the input number of the MMCU. More inputs meant the consumption of fewer hardware resources, albeit, at a low speed. For example, different circuits, which consist of different input number circuits, are implemented for 5 × SE computing. The implementation results are presented in Table 2. The design of MMCU focuses on the trade-off between resource and frequency. In this study, we selected the five-input MMCU as the basic computing element. For enhancing flexibility, the sixth MMCU (five-input MMCU2) was different from the others, as shown in Figure 7. The implemented operations on the 5 × 5 MMCU were predetermined by configurable registers, including operation parameters, image resolution parameters, mask sizes, input and output selection parameters, and auxiliary parameters.
Some grayscale image operations, including flat dilation, erosion, opening, and closing, are given to demonstrate the flexibility of the GCU. The actual use of the GCU is not confined to the examples. The implementation of the operations is shown in Figure 9. The dynamic reconfiguration approach, which allows the reconfiguration to execute at run-time, is employed in the proposed processor. Therefore, the processor can transform the structure using different configurations as they are needed. Thus, the reconfigurable hardware will fit in more applications.

Expansibility
The architecture of the RGPM has good expansibility. Large, processed images can be supported by increasing the depth of the line memories. In this study, the line memories for each GCU are 8-line memories with a length of 1024. For larger image processing, the depth of the line memories can be increased. For example, if the maximum horizontal image size is 1920, then the line memory can be 8 × 1920 × 8 bits. For larger SEs, the number of the line memories can be increased. For example, if the SE size is 31 × 31, then the line memory can be 30 × 1024 × 8 bits. For low hardware consumption, the number of GCUs can be decreased. For higher performance, the number of GCUs can be increased.

Synthesis Results
The proposed grayscale image processor was synthesized with the Synopsys Design Compiler and the SMIC0.18 µm cell library. The maximum size of the image to be processed was 1024 × 1024. The maximum size of the mask was 13 × 13. The number of GCUs was 4. The reconfiguration parameter was 24 × 32 bits. The synthesis results are reported in Table 3.

Grayscale Image Processing Applications
An embedded grayscale image processing system with the proposed processor is presented in Figure 10. The Gaisler Research Leon 2 was selected as the system CPU. The AHB and APB were selected as the system buses. The CPU was used as the controller. The Register group 2 and Interrupt controller were used to control the system. The SDRAM1 and SDRAM2 were used as the main memory source and to store images, respectively. Register group 1 contains control, reconfiguration, and operation parameters (such as morphology operation, image resolution, mask sizes, input and output selection parameters, and auxiliary parameters), and interaction information. It can be written by the CPU via the bus. The control logic reads the parameters in Register group 1; it controls the operation process of the RGPM. It also controls the input and output control logic units and bus interfaces during data access. The CPU changed the control registers in Register group 2 when the processor was reconfigured. All the parameters needed to be changed for the processor reconfiguration were transferred from the SDRAM1 to the Register group 1 by the bus interface, and then the control registers were changed according to the parameters in Register group 1. After the registers were modified, a signal was transmitted to Register group 2. Then, the reconfiguration was complete; the image processing task was executed. After the processed image data were written to SDRAM, the control logic transmits interrupted requests to complete the interaction. Some grayscale morphological operations and algorithms are provided to show the high flexibility and performance of the processor.

Basic Mathematical Morphological Operations
The basic operations are dilation, erosion, opening, and closing. The implementation and processing results of the four operations on the processor are shown. Figures 11 and 12 illustrate eight pipelined 5 × 5 dilation and erosion operations. Figure 13 presents opening and closing operations.

Applications
Some example tasks are also implemented for the proposed processor. The hit-andmiss operation, denoted by ⊗, is expressed as Thinning and thickening, denoted by • and , are expressed as follows: The gradient, denoted by ρ B , is expressed as follows: The white top-hat transform is expressed as follows:

Performance and Comparison
The grayscale image processing system, as shown in Figure 10, was applied to verify the proposed processor. The operations and tasks, as discussed in Section 4, were applied to the system. The system performance is presented in Table 4. The table shows that the frame rate of operations is more than 200 f/s, and the performance can be multiplied by parallel processing. At a 220 MHz system clock, each GCU in the processor can provide 10.78 GOPS. The entire RGPM containing four GCUs can provide 43.12 GOPS and achieve 8.87 GOPS/mm 2 area efficiency (the cell area of the RGPM is 4.86 mm 2 ).  1 The result of the hit-and-miss operation is an input for the SDRAM. Then, the result is read by the input control logic for the thinning or thickening operation. The latency of the access of the SDRAM and reconfiguration time are significantly shorter than the image processing time. As such, the time can be omitted.
The processor is compared with the processors introduced in [12][13][14][15]. The power consumption of the processor in [12] is normalized to 0.18 µm. The technology scaling of power is expressed as follows: where L 1 and L 2 are the characteristic lengths of two different processes. The area is normalized into the number of gates to compare the implementation of the FPGAs and chips, where Count gate = 12 × Count ALUT . The summary of hardware features and comparison with related studies [12][13][14][15] are shown in Table 5. The morphological filter in [13] shows the highest performance and the performance can be multiplied by pipelining the filters; however, the hardware resource is too large. The processors in our study and [12] exhibit higher performance per K-gates than the processors used in [13][14][15]. Our processor yields a higher area/pixel ratio than the processors used in [12,15], and yields a higher power/pixel ratio than the processor used in [12]. However, the chip in [12] can only conduct dilation with a 5 × 5 disk SE and the chip in [15] can only conduct 7 × 7 dilation or erosion. The hardware in [12,15] is inflexible for embedded applications; therefore, compared with other processors, our processor, which features a small area, low power consumption, and high flexibility, is more suitable for embedded systems.
The architecture comparison shows that the ASIC and pipeline have a small area and low power consumption. However, inflexibility and low overall performance are inevitable defects of the embedded image processing system. The reconfigurable MIMD array has high performance and flexibility, which is suitable for use in embedded systems for processing large images. The basic operations supported by the processors are shown in Table 6. The hardware in [13] supports dilation, erosion, opening, and closing, but not hit-and-miss and operations between images. This feature limits usage but is efficient for specific applications. As shown in Table 6, our extensively used processor is more suitable for embedded image processing systems.