HLS Based Approach to Develop an Implementable HDR Algorithm

: Hardware suitability of an algorithm can only be veriﬁed when the algorithm is actually implemented in the hardware. By hardware, we indicate system on chip (SoC) where both processor and ﬁeld-programmable gate array (FPGA) are available. Our goal is to develop a simple algorithm that can be implemented on hardware where high-level synthesis (HLS) will reduce the tiresome work of manual hardware description language (HDL) optimization. We propose an algorithm to achieve high dynamic range (HDR) image from a single low dynamic range (LDR) image. We use highlight removal technique for this purpose. Our target is to develop parameter free simple algorithm that can be easily implemented on hardware. For this purpose, we use statistical information of the image. While software development is veriﬁed with state of the art, the HLS approach conﬁrms that the proposed algorithm is implementable to hardware. The performance of the algorithm is measured using four no-reference metrics. According to the measurement of the structural similarity (SSIM) index metric and peak signal-to-noise ratio (PSNR), hardware simulated output is at least 98.87 percent and 39.90 dB similar to the software simulated output. Our approach is novel and effective in the development of hardware implementable HDR algorithm from a single LDR image using the HLS tool.


Introduction
Vision is one of the most important senses. The things observed by us are analyzed by the brain to help us in taking decision. When we talk about modern science, the camera is an instrument that is analogous to eyes but the capability of eyes is far better than the camera. A camera can capture a moment through an image or video and store it. However, camera by itself cannot take a decision and, for that purpose, camera needs a system similar to the human brain to analyze the data. A field-programmable gate array (FPGA) can be used in a camera as a real-time system to analyze the captured image. The language for FPGA is related to the way of resource implementation in the FPGA. In the case of software programming, we do not need to think about resource consumption. This fact limits the algorithmic computational complexity in FPGA. Hence, research has been conducted on different types of computational acceleration techniques [1,2]. Image classification [1], real-time anomaly detection of hyperspectral images [3], synthetic aperture imaging [4], feature detection for image analysis [5], bilateral filter design for real-time image denoising [6], and panorama real-time video system with high-speed image distortion correction [7] are FPGA-based implementations in the field of image processing.
HLS has become popular in recent years because of its design performance, low complexity, and reduced product development time [8]. HLS connects hardware description languages (HDL), e.g., VHSIC hardware description language (VHDL) and Verilog, to high-level languages (HLL), e.g., C/C ++ . In simple terms, it converts HLL to HDL with optimization techniques. In [8], Nane et al. conducted a survey of HLS tools and compared their optimization techniques, e.g., operation chaining, bit-width analysis and optimization, memory space allocation, loop optimization, hardware resource library, speculation, and code notion, thereby exploiting spatial parallelism and if-conversion. As HLS brings relationship between HLL to HDL, we see a feasibility to develop an implementable algorithm using HLS. Therefore, we convert and optimize our proposed algorithm for the FPGA implementation using HLS.
Dynamic range of an image depends on the exposure quality and visual quality of the scene. However, highlight hides the perfect information about the surface of an image. It adds extra difficulty to any image processing algorithm. Highlights correspond to regions in an image where the light intensity is so high that we cannot see the object behind it. Active light sources, e.g., sun, light emitting diodes (LEDs), and tube light, are not included in this definition. According to Lee [9], the highlight parts are the combination of diffuse reflection and specular reflection where specular reflection dominates. Based on this point, Shafer introduced the dichromatic reflection model [10]. The total reflection, R, from an inhomogeneous object is the sum of the two independent parts: light reflected from the surface, R s , and light reflected from the body, R b , shown in Equation (1). Inhomogeneous objects include varnishes, paper, plastics, and ceramics, while homogeneous objects are polished objects, e.g., metals and diamonds. (1) Based on the dichromatic model, Ren et al. defined the illumination chromaticity [11]. They estimated illumination chromaticity from a novel idea of color line constraint to remove highlight from an image [11]. In Figure 1, we mark three highlight areas where the dynamic range of the object is considerably low due to severe illumination. Our primary objective is to remove the highlights and recover the missing information, which results in local improvement of the dynamic range and enhanced global visibility of the image.
Electronics 2018, 7, x FOR PEER REVIEW 2 of 18 VHSIC hardware description language (VHDL) and Verilog, to high-level languages (HLL), e.g., C/C ++ . In simple terms, it converts HLL to HDL with optimization techniques. In [8], Nane et al. conducted a survey of HLS tools and compared their optimization techniques, e.g., operation chaining, bit-width analysis and optimization, memory space allocation, loop optimization, hardware resource library, speculation, and code notion, thereby exploiting spatial parallelism and if-conversion. As HLS brings relationship between HLL to HDL, we see a feasibility to develop an implementable algorithm using HLS. Therefore, we convert and optimize our proposed algorithm for the FPGA implementation using HLS. Dynamic range of an image depends on the exposure quality and visual quality of the scene. However, highlight hides the perfect information about the surface of an image. It adds extra difficulty to any image processing algorithm. Highlights correspond to regions in an image where the light intensity is so high that we cannot see the object behind it. Active light sources, e.g., sun, light emitting diodes (LEDs), and tube light, are not included in this definition. According to Lee [9], the highlight parts are the combination of diffuse reflection and specular reflection where specular reflection dominates. Based on this point, Shafer introduced the dichromatic reflection model [10]. The total reflection, R, from an inhomogeneous object is the sum of the two independent parts: light reflected from the surface, Rs, and light reflected from the body, Rb, shown in Equation (1). Inhomogeneous objects include varnishes, paper, plastics, and ceramics, while homogeneous objects are polished objects, e.g., metals and diamonds.
Based on the dichromatic model, Ren et al. defined the illumination chromaticity [11]. They estimated illumination chromaticity from a novel idea of color line constraint to remove highlight from an image [11]. In Figure 1, we mark three highlight areas where the dynamic range of the object is considerably low due to severe illumination. Our primary objective is to remove the highlights and recover the missing information, which results in local improvement of the dynamic range and enhanced global visibility of the image. In our previous papers [12,13], we proposed a new model of single LDR image to HDR image generation by highlight removal technique and described an HLS based implementation scenario, respectively. However, our previous technique performance completely depended on four parameters. This parameter dependency limited our technique in the case of hardware implementation. To remove such kind of limitation, in this paper, we describe a new technique based on the previous concept. Our new method is parameter free, which makes it more robust. Although our final target is complete SoC implementation, we describe a significant portion of our implementation here by using HLS tool. From this explanation, we can claim that our method is implementable to the hardware that increases the acceptability of our algorithm. Our proposed HDR algorithm removes the highlights from the image and recovers image information, e.g., color and texture. The main assumptions of the algorithm are: (a) the pixels are not fully saturated; and (b) the surface is inhomogeneous. At first, we detect the highlight area (HA) using statistical information of an image. We modify the HA depending on the information of non-highlight area (NHA). Finally, we improve the global brightness since highlight-free (HF) image is dark type image. For hardware implementation, our target is system on chip (SoC) based implementation. Here, we elaborate on the In our previous papers [12,13], we proposed a new model of single LDR image to HDR image generation by highlight removal technique and described an HLS based implementation scenario, respectively. However, our previous technique performance completely depended on four parameters. This parameter dependency limited our technique in the case of hardware implementation. To remove such kind of limitation, in this paper, we describe a new technique based on the previous concept. Our new method is parameter free, which makes it more robust. Although our final target is complete SoC implementation, we describe a significant portion of our implementation here by using HLS tool. From this explanation, we can claim that our method is implementable to the hardware that increases the acceptability of our algorithm. Our proposed HDR algorithm removes the highlights from the image and recovers image information, e.g., color and texture. The main assumptions of the algorithm are: (a) the pixels are not fully saturated; and (b) the surface is inhomogeneous. At first, we detect the highlight area (HA) using statistical information of an image. We modify the HA depending on the information of non-highlight area (NHA). Finally, we improve the global brightness since highlight-free (HF) image is dark type image. For hardware implementation, our target is system on chip (SoC) based implementation. Here, we elaborate on the programmable logic (PL) side development. Finally, we evaluate our method from various point of views using no-reference [14][15][16][17] and full-reference metrics [18,19].
The remainder of this paper is organized as follows: Section 2 discusses related work; Section 3 describes the proposed algorithm; Section 4 describes hardware development; Section 5 presents the results of our software and hardware evaluation; and Section 6 concludes this study.

Related Works
Tan et al. [20] proposed the idea of intensity logarithmic differentiation to remove highlight iteratively from the input image by comparing it between input image and its specular free image [20]. Yoon et al. [21] explained that diffuse reflection component of a non-saturated input image under the uniform illumination could be extracted by comparing the local ratios of input image and the specular-free two-band image [21]. Shen et al. [22] proposed an algorithm based on the error analysis of chromaticity to separate reflections [22]. Shen et al. [23] described another method by adding offset to modified specular free (MSF) image, whereas MSF chromaticity closes to the diffuse chromaticity [23]. Yang et al. [24] removed highlight from image with bilateral filter by propagating maximum diffuse chromaticity values from diffuse pixels to specular pixels [24].
Researchers introduced several algorithms to produce HDR image from a single low-dynamic range (LDR) image [25][26][27][28]. Reinhard et al. [25] described dodging and burning based tone-mapping method in high and low contrast region of LDR image. Dodging and burning is a printing approach to withhold or add light to a portion of an image [25]. Rempel et al. [26] boosted the dynamic range of images for viewing in HDR displays by using reverse tone-mapping algorithm [26]. Banterle et al. [27] introduced a new framework of inverse tone-mapping operator for boosting up the LDR image to HDR image by linear interpolation of original LDR image. Huo et al. [28] showed that linear expansion of HF LDR image can expand the dynamic range of LDR image. They developed highlight removal technique by the help of principal component analysis (PCA) and polynomial transformation. However, all of these methods target software analysis only, which does not guarantee real-time FPGA implementation. Our aim is to develop an algorithm as well as make sure that the algorithm is implementable to the hardware.
Vonikakis et al. [29] presented an image enhancement-based HDR imaging technique. They stretched the luminance value of every frame by building a pipelined structure and implemented their algorithm in Altera's Stratix II. Stretching the luminance value helps in adding more brightness to the resultant image, especially in the underexposed regions of the image, although it can also affect the overexposed regions. Multi-frame-based implementations were usually adopted to get HDR imaging [30,31]. Some researchers also implemented algorithm from dual-camera settings to increase the frame rate [32]. From the point of view of implemented work for HDR algorithm, the novelty of our work is that HDR images are generated from a single image. Besides, we describe the implementation scenario for single image, while others focus on multi-frame implementation.
The applicability and reliability of the HLS are discussed in [33][34][35][36][37]. Tambara et al. [33] analyzed the utilization and performance of HLS-based optimization techniques, e.g., pipelining, loop unrolling, array partitioning, and function inlining. These techniques are used in three different combinations on matrix multiplication (MM), advanced encryption standard (AES), and adaptive differential pulse code modulation (ADPCM). Choi et al. [34] measured the performance of different applications, e.g., Qsort, Log reg, Mat mul, and ConvNN using HLS-based coding. Li et al. [35] focused on multi-loop optimization technique in an algorithmic way for applications such as image segmentation, denoising, edge minimization, and matrix multiplication. A data acquisition system was built based on HLS using finite impulse response (FIR) filtering by CERN researchers [36]. Daud et al. [37] used an HLS-based approach to develop an intellectual property (IP) of glucose-insulin analysis. An IP is a package of HDL coding that can be used directly in system-level register transfer logic (RTL) design. Thus far, HLS-related research has mainly focused on the performance estimation of pre-built algorithms [33][34][35]. In [36,37], the authors presented application based works, but none are related to the image processing application. The most novel aspect of our work is the development of a single image HDR algorithm by HLS-based implementation in hardware.

Proposed Method
The target of our algorithm is to make it simple while being able to produce a competitive result. The simplicity will help us for efficient implementation in the hardware. The algorithm is described by the block diagram in Figure 2.

Proposed Method
The target of our algorithm is to make it simple while being able to produce a competitive result. The simplicity will help us for efficient implementation in the hardware. The algorithm is described by the block diagram in Figure 2.

Highlight Detection and Modification
( ) , i P x y is the input image. (x,y) indicates the input pixel location. According to previous researches [20,[22][23][24]28], the minimum channel value ( ) min , P x y is used for highlight detection from the assumption that HA is not fully saturated.
( ) min , P x y can be expressed as follows: The highlight is simply detected by comparing in the following way, ( ) min min , 2 , highlight else non-highlight P x y P > × We experimentally set this condition. For all test images, it detects the HA properly. We detect the HA to work with only that region. We assume that other parts of the image contain properly diffused pixels. In the HA, we modify the highlight pixels by Equations (4) and (5). We call this image as MSF image. will also become a low intensity than the appropriate

Highlight Detection and Modification
indicates the input pixel location. According to previous researches [20,[22][23][24]28], the minimum channel value P min (x, y) is used for highlight detection from the assumption that HA is not fully saturated. P min (x, y) can be expressed as follows: The highlight is simply detected by comparing in the following way, We experimentally set this condition. For all test images, it detects the HA properly. We detect the HA to work with only that region. We assume that other parts of the image contain properly diffused pixels. In the HA, we modify the highlight pixels by Equations (4) and (5). We call this image as MSF image.
and C ∈ Pixels belong to HA (4) First, we reduce the P min (x, y) from each HA pixel, P i,C (x, y). This is called specular free (SF) image [22,23,28]. SF image is usually dark and texture is not rich enough. Therefore, we add an offset C MSF to P MSF,i,C (x, y). Since we reduce the P min (x, y) from HA and P min (x, y) in HA area is usually higher, it is more logical to calculate the reasonable portion of P min (x, y) for addition. For extracting the appropriate diffuse information of HA, we have to consider the P min of NHA because NHA is the diffuse area. However, we do not yet know the appropriate diffuse intensity of HA but we know about the area where we can find it. Thus, we can take the average of P min of NHA (P min,NH A ) to use it for extracting the diffuse information of HA. However, NHA is comprised of object and background area. Due to the highlight, the background goes darker during capturing by LDR image sensor. It is a general characteristic of LDR camera. Because of this characteristic of LDR image sensor, we can say that NHA is darker area and P min,NH A will also become a low intensity than the appropriate diffuse value of HA. As P min,NH A is the average of NHA, it brings the diffuse information of object and background surface of the image. From the results in Figure 3d, we can say that the pixel distribution is close to Gaussian distribution in NHA. If we add P min SD,NH A to P min,NH A , it seems that we move to the direction of the diffuse pixel of HA because P min,NH A − P min SD,NH A directs to the diffuse pixel of background of the NHA and P min,NH A + P min SD,NH A directs to the diffuse pixel of object of NHA which is almost same surface of the HA. This is the approximate diffuse intensity of HA and the C MSF is added to SF image to produce a better MSF image. Another reason for taking 1 SD (standard deviation) of P min (x, y) of NHA is that 2 SD may damage the diffuse pixel by directing to HA. directs to the diffuse pixel of object of NHA which is almost same surface of the HA. This is the approximate diffuse intensity of HA and the MSF C is added to SF image to produce a better MSF image. Another reason for taking 1 SD (standard deviation) of ( ) min , P x y of NHA is that 2 SD may damage the diffuse pixel by directing to HA.

Removing Brightness Mismatch
The mismatch is noticeable clearly in Figure 3b. From the visual side, we can say that luminance of two portions does not match. To remove the brightness mismatch between HA and NHA, we argue that there is a lack of brightness offset (BO) in HA of MSF replaced image, , , ( , ) in Equation (4). We also argue that the value of BO will become very small because the information of HA of MSF image is quite visible and well-recovered but not bright enough. As the brightness is not completely dark in HA of MSF image, , , ( , ) x y , we do not need to consider the average gray

Removing Brightness Mismatch
The mismatch is noticeable clearly in Figure 3b. From the visual side, we can say that luminance of two portions does not match. To remove the brightness mismatch between HA and NHA, we argue that there is a lack of brightness offset (BO) in HA of MSF replaced image, P MSF,i,C (x, y), shown in Equation (4). We also argue that the value of BO will become very small because the information of HA of MSF image is quite visible and well-recovered but not bright enough. As the brightness is not completely dark in HA of MSF image, P MSF,i,C (x, y), we do not need to consider the average gray value of NHA and, instead, we can consider the SD of gray value of NHA because, in most cases for Gaussian-like distribution, SD is much lower than average value. From this analysis, we measure the SD of gray value of HA (L SD,H A ) and NHA (L SD,NH A ). We take the small value between L SD,H A and L SD,NH A as a BO to remove the brightness mismatch. Experimentally, we decide that small value between L SD,H A and L SD,NH A is the appropriate BO to remove the mismatch. In Equation (6), we represent BO and, in Equation (7), we represent P HF,i,C (x, y) as HF pixels on HA.

Low Light Area Enhancement
Most HF images have a histogram distribution similar to in Figure 4b. Thus, according to Hsia et al. [14], we can conclude that these images are low light images. After increasing the brightness of these low light images, we can claim these images as HDR images. Huo et al. [28] followed the same approach to achieve the HDR images, while they used the linear expansion method. Instead of linear expansion, we follow the algorithm of Li et al. [38]. They showed the following equation to brighten the under exposed region. We use Equation (8) for producing our final HDR image, P HDR,i (x, y). The specialty of Equation (8) is that it will boost up in the dark region only while keeping the higher luminance value intact.
The values of γ and V in Equation (8)

Low Light Area Enhancement
Most HF images have a histogram distribution similar to in Figure 4b. Thus, according to Hsia et al. [14], we can conclude that these images are low light images. After increasing the brightness of these low light images, we can claim these images as HDR images. Huo et al. [28] followed the same approach to achieve the HDR images, while they used the linear expansion method. Instead of linear expansion, we follow the algorithm of Li et al. [38]. They showed the following equation to brighten the under exposed region. We use Equation (8) (8) is that it will boost up in the dark region only while keeping the higher luminance value intact.
, , The values of γ and V in Equation (8)

Hardware Development
The HLS development part is included in this paper as a part of our next development scenario. Our final goal is the SoC based development. We have chosen the Xilinx device Zynq for our development due to its popularity in the field of SoC. Our primary implementation scenario is described in Figure 5. Zynq architecture has two sides, processor (PS) and PL. The camera will feed the video to the PS. The individual image frame is stored temporally on an off-chip memory such as DDR3. We call it frame buffer. The required frame parameters will be estimated in the PS side. The image frame and parameters will be supplied to the algorithm block in the PL side. This algorithm block will be generated by HLS tool. The display driver will be used finally to see the output.

Hardware Development
The HLS development part is included in this paper as a part of our next development scenario. Our final goal is the SoC based development. We have chosen the Xilinx device Zynq for our development due to its popularity in the field of SoC. Our primary implementation scenario is described in Figure 5. Zynq architecture has two sides, processor (PS) and PL. The camera will feed the video to the PS. The individual image frame is stored temporally on an off-chip memory such as DDR3. We call it frame buffer. The required frame parameters will be estimated in the PS side. The image frame and parameters will be supplied to the algorithm block in the PL side. This algorithm block will be generated by HLS tool. The display driver will be used finally to see the output.

HLS Development
The target of this paper is to represent that our algorithm can be implemented in hardware while most researchers [20,22] only focus on software part development and comparison. For this reason, we used the HLS tool to verify that our idea is implementable in hardware. Our main target is to simplify the PL part development using HLS tool while we also describe our optimized development method. Since Zynq is our final device, the vivado HLS tool is selected for the development. For our final design, we will need AXI bus to communicate from PS to PL and from PL to PS. The other academic HLS tools are not compatible with Xilinx Zynq devices. For example, Leg UP [39] and Intel HLS tool [40] are only compatible with Altera/Intel FPGA.
Each type of HLS tool may have different steps for achieving its goal (i.e., HLL-to-HDL translation). D. Bailey, in his survey of HLS tools [41], identified four steps: dataflow analysis, resource allocation, resource binding, and scheduling. The vivado HLS tool has four basic steps [42]: C-synthesis, C-simulation, RTL verification, and IP packaging. We discuss each step output in our description. Optimization techniques have been generalized by the survey paper [8]. The authors discussed eight types of HLS optimization. Among them, depending on our tool necessity, we use bit-width analysis and optimization, loop optimization, and hardware resource library.
In the beginning of our development, we have separated the part that is designated for the development in the PS while we have also developed a part in the PL. For the PL side development, we use the HLS tool. The operations separated for PS and PL, respectively, are shown in Figure 6.

HLS Development
The target of this paper is to represent that our algorithm can be implemented in hardware while most researchers [20,22] only focus on software part development and comparison. For this reason, we used the HLS tool to verify that our idea is implementable in hardware. Our main target is to simplify the PL part development using HLS tool while we also describe our optimized development method. Since Zynq is our final device, the vivado HLS tool is selected for the development. For our final design, we will need AXI bus to communicate from PS to PL and from PL to PS. The other academic HLS tools are not compatible with Xilinx Zynq devices. For example, Leg UP [39] and Intel HLS tool [40] are only compatible with Altera/Intel FPGA.
Each type of HLS tool may have different steps for achieving its goal (i.e., HLL-to-HDL translation). D. Bailey, in his survey of HLS tools [41], identified four steps: dataflow analysis, resource allocation, resource binding, and scheduling. The vivado HLS tool has four basic steps [42]: C-synthesis, C-simulation, RTL verification, and IP packaging. We discuss each step output in our description. Optimization techniques have been generalized by the survey paper [8]. The authors discussed eight types of HLS optimization. Among them, depending on our tool necessity, we use bit-width analysis and optimization, loop optimization, and hardware resource library.
In the beginning of our development, we have separated the part that is designated for the development in the PS while we have also developed a part in the PL. For the PL side development, we use the HLS tool. The operations separated for PS and PL, respectively, are shown in Figure 6.

HLS Development
The target of this paper is to represent that our algorithm can be implemented in hardware while most researchers [20,22] only focus on software part development and comparison. For this reason, we used the HLS tool to verify that our idea is implementable in hardware. Our main target is to simplify the PL part development using HLS tool while we also describe our optimized development method. Since Zynq is our final device, the vivado HLS tool is selected for the development. For our final design, we will need AXI bus to communicate from PS to PL and from PL to PS. The other academic HLS tools are not compatible with Xilinx Zynq devices. For example, Leg UP [39] and Intel HLS tool [40] are only compatible with Altera/Intel FPGA.
Each type of HLS tool may have different steps for achieving its goal (i.e., HLL-to-HDL translation). D. Bailey, in his survey of HLS tools [41], identified four steps: dataflow analysis, resource allocation, resource binding, and scheduling. The vivado HLS tool has four basic steps [42]: C-synthesis, C-simulation, RTL verification, and IP packaging. We discuss each step output in our description. Optimization techniques have been generalized by the survey paper [8]. The authors discussed eight types of HLS optimization. Among them, depending on our tool necessity, we use bit-width analysis and optimization, loop optimization, and hardware resource library.
In the beginning of our development, we have separated the part that is designated for the development in the PS while we have also developed a part in the PL. For the PL side development, we use the HLS tool. The operations separated for PS and PL, respectively, are shown in Figure 6.  The separation depends on the convenience of the task. Implementation of the whole algorithm in the PS side will be easier. However, PL side has the advantage in terms of speed due to the capability of parallel processing. Therefore, we want to take this advantage by separating our task between PS and PL. It is generally easy to calculate the average and SD in the PS side, while implementation of an equation in PL will always give the advantage in terms of latency and speed. Obviously, there is trade-off among latency, resources and quality of the output. For calculation of average and SD in PL, Popovic et al. [43] assumed the no variation of light between two frames in a sequence. If we adopt this method in our case, we may need assumption that light is constant among four frames in a sequence because we need three frames in a sequence to calculate the third, fourth, and fifth blocks of PS side in Figure 6. In the fourth frame, finally, we can apply these values. High speed frame capturing can be one solution in this purpose, but this kind of feature will not generalize our algorithm for low speed (<60 fps) commercial camera. To avoid this kind of situation, we consider the PS based implementation in SoC. Hence, we assigned this part (C MSF and BO) for PS development while other operations are possible conveniently in the PL side.

Coding for IP Optimization
HLS coding starts with the function declaration. It is possible to have multiple functions declaration in an IP. The main function should be selected. During the synthesis, depending on the argument of the main function, the input/output interface is generated. We have one function in the IP with seven arguments: input, output, row size, column size, minAvg (P min shown in Equation (2)), CMSF (C MSF ), and BO. AXI bus is used to communicate between PS and PL. AXI stream interfaces are used as input/output interfaces for the image frame, named as data bus. The other arguments are received by AXI lite interfaces, named as controlled bus. For pipeline based designing, AXI stream interfaces are used. These interfaces are added by the pragma settings suggested by vivado HLS tool user guide [42]. We keep the same clock for both control bus and data bus. The HLS C code is different from general C code, as HLS does not support every feature of the C. For example, it is possible in C to allocate dynamic memory for an array, but the size of the array in HLS must be pre-defined. The array directly consumes space in the fixed block RAM (BRAM) of the FPGA chip and BRAM size is limited in FPGA. Usually, for stream-based input/output, all the operations are done in a single for-loop. This is also one of the main reasons that we have calculated all of the average and SD in the PS side. Therefore, we do not need to use multiple for loops in the PL side.
The optimized version of the code is shown in Tables 1-3. For optimization, first, we use pipeline directive for loop based optimization. We achieved the initiation interval (II) as one that indicates the high throughput [8]. Secondly, we focus on bit-width optimization. During this optimization, we need to select carefully the bit-width of every variable used in the code. Vivado HLS tool has an excellent feature of customized bit-width. Since writing in HLS is general C/C ++ code, the important thing is combination of hardware implementation concept during the writing. We describe our code step by step.
In Table 1, Line 8, the main function is LDR2HDR. We consider AXI_STREAM as a 24-bit stream data type. Hence, our input and output data bit are limited to 24-bits, which is reasonable since we are taking RGB image as an input and output. The rows (row size) and cols (column size) are at most 10-bit because all of our test images are within the limit of 1023 × 1023. However, our IP can also be applicable to higher resolution by increasing the bit-width of the rows and cols, respectively. The rest of the arguments (minAvg, CMSF, BO) are 8-bit. Although they are float in nature, experimentally, we observed that rounding these values does not degrade the image quality. The variables are declared according to necessity of the operation. The constants and variables are declared according to the highest bit-width.

Partial code for HLS environment
Beginning of the loop ( Table 2) contains two loop pragmas. HLS LOOP_TRIPCOUNT indicates the maximum and minimum number of pixels. Since our rows and cols are limited to 10-bit, the maximum number will be 1023 × 1023 = 1,046,529. At the time of this implementation, we have found that, during the use of comparator operator (>/<) and arithmetic operator (+/−), the data types should be same on both sides i.e., either they will be custom data type (e.g., ap_fixed type) or regular data type (e.g., float (32-bit)). We want to keep everything in custom data type with minimum bit required. However, when we take the pixel line (Lines 8-10), the 8-bit data types are needed to take the data input. Eventually, we have cast in next line to keep everything in the 9-bit. Lines 12-14 indicate the minimum channel value selection while Lines 16-21 show HA detection and modification.

Partial code for HLS environment
In Table 3, luminance is calculated in Line 2. The way of writing of X y in vivado HLS is exp f (y × log f (X)). This is the case where power is not an integer number. HLS math library (hls_math.h) includes math functions that are synthesizable. The math functions are applicable only for single-precision float type or double-precision float type (64-bit) [42]. Therefore, we have written Equation (8) according to the tool's way in Line 3. Lines 8-10 prevent the floating over flow. During the image out (Lines 11-16), we need to cast again to the 8-bit data type. Table 4 shows the resource comparison between optimized and unoptimized implementation. In Table 4, we use terms such as latency, iteration latency (IL), initiation interval (II), trip count (TC), dataflow, pipeline, etc. Their definitions are provided in [34,42]. Optimized version is presented in Tables 1-3. The main difference between optimized and unoptimized is that we calculated every equation in float (32-bit) in the case of unoptimized version. We take arguments (minAvg, CMSF, and BO) in float as well. In our design, we do not need any BRAM. The design is optimized for 100 MHz clock. Only 76 clocks will be needed from input to output for one pixel while we achieved the II as one that indicates that, with every clock cycle, we can take a pixel input. In the case of unoptimized version, the IL is 105. Fewer resources are required for the optimized version. The DSP requisition is reduced by almost half in optimized version. Resource to quality comparison has been shown at the end of Section 5. During RTL verification, for Fish image, it was completely successful. Figure 7 shows the RTL pass report for VHDL.

Resource and Latency Comparison
During RTL verification, for Fish image, it was completely successful. Figure 7 shows the RTL pass report for VHDL. Finally, at the IP packaging stage, final resource count for optimized version is shown in Table  5. Three DSP have been reduced in the IP packaging stage while flip-flop (FF) and look up table (LUT) count also reduced in number. At the same time, 193 shift register lookups (SRLs) have been consumed in the IP packaging stage. Besides, our IP achieved the post implementation clock pulse of 9.546 ns, which is less than our desired clock pulse (10 ns). Therefore, timing requirement was met successfully.

Results and Discussion
We used ten test images during our experiment: Doll, Stone, Hen, Idol, Red Ball, Face, Fish, Bear, Green Pear, and Cups. For software evaluation, we used MATLAB and, as an HLS tool, Vivado HLS v2016.4 was used. In Figure 8, we show each step output. We also compared our final HDR output with the HDR image generated from Shen's HF image [22]. Shen's target was only to generate HF image. We generated Shen's HF image from the code that is provided in [22]. During generation, the parameter chromaticity threshold was set to 0.05, as in [22]. We applied the same low light area enhancement algorithm (Equation (8)) to Shen's HF [22] image to compare with our final HDR image. In Figure 8(aiii-ciii),(aiv-civ), our HDR output is better than the HDR output from Shen's HF [22] image. In the case of Doll image, Figure 8(biii), a color mismatch is noticeable at the area below the guitar. In Figure 8(biv), we cannot see this kind of mismatch. We also compared our MATLAB output with optimized HLS C simulation output, as shown in Figure 8(aiv-civ),(av-cv). Although, in HLS, we reduced the bit-width, we cannot see any visible difference between MATLAB and HLS C simulation output, indicating our IP can generate outputs with software precision level.
(ai) (aii) (aiii) (aiv) (av) Finally, at the IP packaging stage, final resource count for optimized version is shown in Table 5. Three DSP have been reduced in the IP packaging stage while flip-flop (FF) and look up table (LUT) count also reduced in number. At the same time, 193 shift register lookups (SRLs) have been consumed in the IP packaging stage. Besides, our IP achieved the post implementation clock pulse of 9.546 ns, which is less than our desired clock pulse (10 ns). Therefore, timing requirement was met successfully.

Results and Discussion
We used ten test images during our experiment: Doll, Stone, Hen, Idol, Red Ball, Face, Fish, Bear, Green Pear, and Cups. For software evaluation, we used MATLAB and, as an HLS tool, Vivado HLS v2016.4 was used. In Figure 8, we show each step output. We also compared our final HDR output with the HDR image generated from Shen's HF image [22]. Shen's target was only to generate HF image. We generated Shen's HF image from the code that is provided in [22]. During generation, the parameter chromaticity threshold was set to 0.05, as in [22]. We applied the same low light area enhancement algorithm (Equation (8)) to Shen's HF [22] image to compare with our final HDR image. In Figure 8(aiii-ciii),(aiv-civ), our HDR output is better than the HDR output from Shen's HF [22] image. In the case of Doll image, Figure 8(biii), a color mismatch is noticeable at the area below the guitar. In Figure 8(biv), we cannot see this kind of mismatch. We also compared our MATLAB output with optimized HLS C simulation output, as shown in Figure 8(aiv-civ),(av-cv). Although, in HLS, we reduced the bit-width, we cannot see any visible difference between MATLAB and HLS C simulation output, indicating our IP can generate outputs with software precision level. During RTL verification, for Fish image, it was completely successful. Figure 7 shows the RTL pass report for VHDL. Finally, at the IP packaging stage, final resource count for optimized version is shown in Table  5. Three DSP have been reduced in the IP packaging stage while flip-flop (FF) and look up table (LUT) count also reduced in number. At the same time, 193 shift register lookups (SRLs) have been consumed in the IP packaging stage. Besides, our IP achieved the post implementation clock pulse of 9.546 ns, which is less than our desired clock pulse (10 ns). Therefore, timing requirement was met successfully.

Results and Discussion
We used ten test images during our experiment: Doll, Stone, Hen, Idol, Red Ball, Face, Fish, Bear, Green Pear, and Cups. For software evaluation, we used MATLAB and, as an HLS tool, Vivado HLS v2016.4 was used. In Figure 8, we show each step output. We also compared our final HDR output with the HDR image generated from Shen's HF image [22]. Shen's target was only to generate HF image. We generated Shen's HF image from the code that is provided in [22]. During generation, the parameter chromaticity threshold was set to 0.05, as in [22]. We applied the same low light area enhancement algorithm (Equation (8)) to Shen's HF [22] image to compare with our final HDR image. In Figure 8(aiii-ciii),(aiv-civ), our HDR output is better than the HDR output from Shen's HF [22] image. In the case of Doll image, Figure 8(biii), a color mismatch is noticeable at the area below the guitar. In Figure 8(biv), we cannot see this kind of mismatch. We also compared our MATLAB output with optimized HLS C simulation output, as shown in Figure 8(aiv-civ),(av-cv). Although, in HLS, we reduced the bit-width, we cannot see any visible difference between MATLAB and HLS C simulation output, indicating our IP can generate outputs with software precision level.
In this stage, we compared our output numerically. No-reference metrics are selected on the basis of quality evaluation related to the HDR image. Each metric indicates the quality improvement of an image in a specific area. Uniform distribution of light level, good color, contrast, and overall better visual quality ensure HDR quality image. Lower value of histogram balance (HB) indicates the image is visually better in HA and low light area [14]. Entropy (E) ensures the good contrast of an image, whereas larger value of E represents better contrast of image. Naturalness image quality evaluator (NIQE) and colorfulness-based patch-based contrast quality index (CPCQI) guarantee the overall quality of the image [15][16][17]. Lower NIQE and larger CPCQI value represent better quality of image. Thus, satisfying the conditions of these metrics ensures the validity of the proposed algorithm. The software output, validated by the no-reference metrics, was used as a reference for checking the accuracy of the hardware stage simulated output using SSIM and PSNR. SSIM indicates the similarity between two images in terms of luminance, contrast and image structure [18]. In [19], it is assumed that PSNR value greater than 40 dB indicates almost invisible difference between two images. Table  6 shows the detailed numerical comparison among input, HDR from Shen's HF, and our HDR images. The superior values are indicated in bold font. On average, our method performs numerically better than HDR from Shen's HF. Shen [22] removed highlight by solving the least squares problem of the dichromatic reflection model based on the error analysis of chromaticity and appropriate selection of body color in iterative way. The whole process was done in three steps. In first step, Shen [22] classified diffuse and highlight pixels. In the second and third steps, highlight was removed in an iterative way. On the other hand, Figure 8. (ai-ci) Input LDR images (Idol, Doll, and Fish); (aii-cii) our HF images; (aiii-ciii) HDR images using Shen's HF [22] images; (aiv-civ) our HDR images; and (av-cv) our HDR images by HLS C Simulation.
In this stage, we compared our output numerically. No-reference metrics are selected on the basis of quality evaluation related to the HDR image. Each metric indicates the quality improvement of an image in a specific area. Uniform distribution of light level, good color, contrast, and overall better visual quality ensure HDR quality image. Lower value of histogram balance (HB) indicates the image is visually better in HA and low light area [14]. Entropy (E) ensures the good contrast of an image, whereas larger value of E represents better contrast of image. Naturalness image quality evaluator (NIQE) and colorfulness-based patch-based contrast quality index (CPCQI) guarantee the overall quality of the image [15][16][17]. Lower NIQE and larger CPCQI value represent better quality of image. Thus, satisfying the conditions of these metrics ensures the validity of the proposed algorithm. The software output, validated by the no-reference metrics, was used as a reference for checking the accuracy of the hardware stage simulated output using SSIM and PSNR. SSIM indicates the similarity between two images in terms of luminance, contrast and image structure [18]. Paris et al. [19] assumed that PSNR value greater than 40 dB indicates almost invisible difference between two images. Table 6 shows the detailed numerical comparison among input, HDR from Shen's HF, and our HDR images. The superior values are indicated in bold font. On average, our method performs numerically better than HDR from Shen's HF. Shen [22] removed highlight by solving the least squares problem of the dichromatic reflection model based on the error analysis of chromaticity and appropriate selection of body color in iterative way. The whole process was done in three steps. In first step, Shen [22] classified diffuse and highlight pixels. In the second and third steps, highlight was removed in an iterative way. On the other hand, our proposed method uses only two steps. First, we also classify diffuse and highlight pixels as in [22]. However, after that, we only process on highlight pixel using the idea of diffuse and highlight pixel distribution that directs the highlight pixel to diffuse pixel. On the other hand, Shen [22] iterated on whole image in every step to remove highlight, although they classified the diffuse and highlight components in the first step. At this point, our proposed method has achieved advantages in terms of processing speed compared to Shen [22], as shown in Table 7. At the same time, quality of our output images is also better, as proved in Figures 9 and 10. We compared the processing speed between Shen's method [22] and our method by using the same PC configurations. The configuration of the PC is Windows 7 64-bit operating system, Intel Core i7-3770 K CPU and 12 GB RAM. In Table 7, we can verify that our method is around 76 times faster than Shen's method [22]. The MATLAB output images are used as a reference to measure the quality of the HLS C-simulation output. For HLS C-simulation, we include the hls_math.h library in our testbench as well as the original source file. The HLS math library includes floating point precision factors of synthesizable math functions that are applicable to the hardware [44]. Although the HLS tool calls the GCC compiler (C compiler) for C-simulation, it follows the HLS math library to generate the output of the used math functions instead of standard C output [44]. Therefore, our C-simulation output verifies the hardware stage output. Since the precision level of a math function (e.g., exp) for hardware is different from standard C-math libraries [44], we verify our C-simulation output with the software stage output. Table 8 shows the average SSIM and PSNR for both optimized and unoptimized versions. It is obvious that unoptimized version is closer to the MATLAB output, since data (e.g., minAVG, CMSF, BO, etc.) bit-width is close to the MATLAB. In Table 8, average PSNR for without optimization is higher than that of optimized version, which verifies our expectation numerically. However, in both cases, average PSNR is above 40 dB. According to Paris et al. [19], the difference between MATLAB and HLS C simulation will not be visible. Besides, average SSIM is almost same for both cases, which indicates that visual differences are indistinguishable.
Electronics 2018, 7, x FOR PEER REVIEW 13 of 18 our proposed method uses only two steps. First, we also classify diffuse and highlight pixels as in [22]. However, after that, we only process on highlight pixel using the idea of diffuse and highlight pixel distribution that directs the highlight pixel to diffuse pixel. On the other hand, Shen [22] iterated on whole image in every step to remove highlight, although they classified the diffuse and highlight components in the first step. At this point, our proposed method has achieved advantages in terms of processing speed compared to Shen [22], as shown in Table 7. At the same time, quality of our output images is also better, as proved in Figures 9 and 10. We compared the processing speed between Shen [22] and our method by using the same PC configurations. The configuration of the PC is Windows 7 64-bit operating system, Intel Core i7-3770 K CPU and 12 GB RAM. In Table 7, we can verify that our method is around 76 times faster than Shen [22]. The MATLAB output images are used as a reference to measure the quality of the HLS Csimulation output. For HLS C-simulation, we include the hls_math.h library in our testbench as well as the original source file. The HLS math library includes floating point precision factors of synthesizable math functions that are applicable to the hardware [44]. Although the HLS tool calls the GCC compiler (C compiler) for C-simulation, it follows the HLS math library to generate the output of the used math functions instead of standard C output [44]. Therefore, our C-simulation output verifies the hardware stage output. Since the precision level of a math function (e.g., exp) for hardware is different from standard C-math libraries [44], we verify our C-simulation output with the software stage output. Table 8 shows the average SSIM and PSNR for both optimized and unoptimized versions. It is obvious that unoptimized version is closer to the MATLAB output, since data (e.g., minAVG, CMSF, BO, etc.) bit-width is close to the MATLAB. In Table 8, average PSNR for without optimization is higher than that of optimized version, which verifies our expectation numerically. However, in both cases, average PSNR is above 40 dB. According to Paris et al. [19], the difference between MATLAB and HLS C simulation will not be visible. Besides, average SSIM is almost same for both cases, which indicates that visual differences are indistinguishable. (  (di) (dii) (diii) Figure 10. Visual comparison of Shen [22] and our method globally: (ai-di) input LDR images (Stone, Ball, Bear, and Cups); (aii-dii) HDR images using Shen's [22] HF images; and (aiii-diii) HDR images by our proposed method.

Conclusions
The main focus of this paper is the development of parameter free single LDR image to HDR image generation technique using highlight removal algorithm that removes the parameter dependency of our previous paper [12]. We compared our method with the state of the art [22] and showed that our method performs better in terms of quality and processing speed. At the same time, by describing a SoC based implementation scenario, we tried to verify that our method was hardware friendly where PL side development was described depending on the HLS tool. This kind of hardware development approach was completely new for single image based LDR to HDR algorithm. By comparing SSIM and PSNR values, we can claim that, even though we limited the bit width for operation in HLS, the C simulated output was similar to MATLAB output. The main achievement was resolution independency by obtaining throughput one clock cycle and by avoiding the use of BRAM. Finally, although it is true that our algorithm was developed on the assumption that highlight part should not be fully saturated, our method worked very well for all of the test images. In the future, we will develop manual HDL IP with more optimized resources as well as a complete SoC implementation.
Author Contributions: The manuscript was written by R.S.; his main contribution is HLS part. P.P.B. contributed the algorithm part and also helped to write the paper. As the corresponding author, K.-D.K. proposed the idea as well as supervised the research.

Conflicts of Interest:
The authors declare no conflicts of interest. Figure 10. Visual comparison of Shen [22] and our method globally: (ai-di) input LDR images (Stone, Ball, Bear, and Cups); (aii-dii) HDR images using Shen's [22] HF images; and (aiii-diii) HDR images by our proposed method.

Conclusions
The main focus of this paper is the development of parameter free single LDR image to HDR image generation technique using highlight removal algorithm that removes the parameter dependency of our previous paper [12]. We compared our method with the state of the art [22] and showed that our method performs better in terms of quality and processing speed. At the same time, by describing a SoC based implementation scenario, we tried to verify that our method was hardware friendly where PL side development was described depending on the HLS tool. This kind of hardware development approach was completely new for single image based LDR to HDR algorithm. By comparing SSIM and PSNR values, we can claim that, even though we limited the bit width for operation in HLS, the C simulated output was similar to MATLAB output. The main achievement was resolution independency by obtaining throughput one clock cycle and by avoiding the use of BRAM. Finally, although it is true that our algorithm was developed on the assumption that highlight part should not be fully saturated, our method worked very well for all of the test images. In the future, we will develop manual HDL IP with more optimized resources as well as a complete SoC implementation.
Author Contributions: The manuscript was written by R.S.; his main contribution is HLS part. P.P.B. contributed the algorithm part and also helped to write the paper. As the corresponding author, K.-D.K. proposed the idea as well as supervised the research.

Conflicts of Interest:
The authors declare no conflicts of interest.