Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Abstract: In this paper, we propose acceleration methods for edge-preserving filtering. The filters natively include denormalized numbers, which are defined in IEEE Standard 754. The processing of the denormalized numbers has a higher computational cost than normal numbers; thus, the computational performance of edge-preserving filtering is severely diminished. We propose approaches to prevent the occurrence of the denormalized numbers for acceleration. Moreover, we verify an effective vectorization of the edge-preserving filtering based on changes in microarchitectures of central processing units by carefully treating kernel weights. The experimental results show that the proposed methods are up to five-times faster than the straightforward implementation of bilateral filtering and non-local means filtering, while the filters maintain the high accuracy. In addition, we showed effective vectorization for each central processing unit microarchitecture. The implementation of the bilateral filter is up to 14-times faster than that of OpenCV. The proposed methods and the vectorization are practical for real-time tasks such as image editing.

The kernel of the edge-preserving filter can typically be decomposed into range and/or spatial kernels, which depend on the difference between the value and position of reference pixels.Bilateral filtering has a range kernel and a spatial kernel.Non-local means filtering has only a range kernel.The shape of the spatial kernel is invariant across all pixels.By contrast, that of the range kernel is variant; thus, the range kernel is computed adaptively for each pixel.The adaptive computation is expensive.
In the case of multi-channel image filtering with intermediate-sized kernels, the method tends to be slower than the naïve algorithms owing to the curse of dimensionality [27], which indicates that the computational cost increases exponentially with increasing dimensions.Furthermore, when the kernel radius is small, the naïve algorithm can be faster than the algorithms of the order O(r) or O(1) owing to the offset times, which refers to pre-processing and post-processing such as creating intermediate images.In the present study, we focus on accelerating the naïve algorithm of the edge-preserving filtering based on the characteristics of computing hardware.
The edge-preserving filter usually involves denormalized numbers, which are special floating-point numbers defined in IEEE Standard 754 [37].The definition of the denormalized numbers is discussed in Section 3. The formats are supported by various computing devices, such as most central processing units (CPUs) and graphics processing units (GPUs).The denormalized numbers represent rather small values that cannot be expressed by normal numbers.Although the denormalized numbers can improve arithmetic precision, their format is different from the normal numbers.Therefore, the processing of the denormalized numbers incurs a high computational cost [38][39][40].The edge-preserving filters have small weight values, where a pixel is across an edge.The values tend to be the denormalized numbers.Figure 1 shows the occurrence of the denormalized numbers in various edge-preserving filtering.The denormalized numbers do not influence the eventual results, because these values are almost zero in the calculations.Hence, we can compute edge-preserving filtering with high-precision even by omitting the denormalized numbers.Moreover, the omission would be critical for accelerating the edge-preserving filtering.The template window size is (3,3), and the search window size is (2r + 1, 2r + 1).The image size is 768 × 512.In (b-e), the ratios of denormalized numbers in all weight calculations are 2.11%, 3.26%, 1.97% and 3.32%, respectively.
A fast implementation requires effective utilization of the functionalities in CPUs and GPUs.In the present study, we focus on a CPU-centric implementation.Existing CPU microarchitectures are becoming complex.The architectures are based on multi-core architectures, complicated cache memories and short vector processing units.Single-instruction, multiple-data (SIMD) [41] instruction sets in vector processing units are especially changed.The evolution of the SIMD instructions has taken the form of the increased vector length [42], increased number of types of instructions and decreased latency of instructions.Therefore, it is essential to use SIMD instructions effectively for extracting CPU performance.In the edge-preserving filtering, execution of the weight calculation is the main bottleneck.Thus, the vectorization for weight calculation has a significant effect.
In the present study, we focus on two topics: the influence of denormalized numbers and effective vectorized implementation on CPU microarchitectures in the edge-preserving filtering.For the first, we verify the influence of the denormalized numbers on the edge-preserving filtering, and then, we propose methods to accelerate the filter by removing the influence of the denormalized numbers.For the second, we compare several types of vectorization of bilateral filtering and non-local means filtering.We develop various implementations to clarify suitable representations of the latest CPU microarchitectures for these filters.
The remainder of this paper is organized as follows.In Section 2, we review bilateral filtering, non-local means filtering and their variants.Section 3 describes IEEE standard 754 for floating point numbers and denormalized numbers.In Section 4, we present CPU microarchitectures and SIMD instruction sets.We propose novel methods for preventing the occurrence of the denormalized numbers in Section 5.In Section 6, we introduce several types of vectorization.Section 7 presents our experimental results.Finally, in Section 8, we show a few concluding remarks.

Edge-Preserving Filters
General edge-preserving filtering in finite impulse response (FIR) filtering is represented as follows: where I and J are the input and output images, respectively.p and q are the present and reference positions of pixels, respectively.A kernel-shaped function N (p) comprises a set of reference pixel positions, and it varies for every pixel p.The function f (p, q) denotes the weight of position p with respect to the position q of the reference pixel.η is a normalizing function.If the gain of the FIR filter is one, we set the normalizing function as follows: Various types of weight functions are employed in edge-preserving filtering.These weights are composed of spatial and range kernels or only a range kernel.The weight of the bilateral filter is expressed as follows: where • 2 is the L2 norm and σ s and σ r are the standard deviations of the spatial and the range kernels, respectively.The weight of the non-local means filter is as follows: where v(p) represents a vector, which includes a square neighborhood of the center pixel p. h is a smoothing parameter.The weight of the bilateral filter is determined by considering the similarity between the color and spatial distance between a target pixel and that of the reference pixel.The weight of the non-local means filter is defined by computing the similarity between the patch on the target pixel and that on the reference pixel.The weight of the non-local means filter is similar to the range weight of the bilateral filter for a multi-channel image.
To discuss the influence of the denormalized numbers, we introduce two variants of the bilateral and non-local means filters, namely the Gaussian range filter and the bilateral non-local means filter.The weight of the Gaussian range filter is expressed as follows: ). ( The weight of the bilateral non-local means filter [43] is as follows: The Gaussian range filter is composed of the range kernel alone in the bilateral filtering.The bilateral non-local means filter is composed of the spatial kernel and the range kernel in the non-local means filtering.

Floating Point Numbers and Denormalized Numbers in IEEE Standard 754
The formats of floating point numbers are defined in IEEE Standard 754 [37].The floating point number is composed of a set of normal numbers and four special numbers, which are not a number (NaN), infinities, zeroes and denormalized (or subnormal) numbers.The normal numbers are represented as follows: In a single-precision floating point number (float), parameters are as follows: bit length of exponent is 8 bit; that of f raction is 23 bit; bias = 127.In a single-precision floating point number (double), parameters are as follow: bit length of exponent is 11 bit; that of f raction is 52 bit; bias = 1023.
In the normal number, exponent is neither zero nor the maximum value of exponent.In the special numbers, exponent is zero or it has the maximum value of exponent.When exponent and f raction are zero, the format represents zero.When exponent of a given number has the maximum value, the format represents infinity or NaN.In the case of the denormalized numbers, exponent is zero, but f raction is not zero.The denormalized numbers are represented as follows: Note that exponent is set to zero for the special number flags, but exponent can be forcefully regarded as one even if the settled value is zero.The range of magnitudes of the denormalized numbers is smaller than that of the normal numbers.In terms of float, the range of magnitudes of the normal numbers is 1.17549435 × 10 −38 ≤ |x| ≤ 3.402823466 × 10 38 , while that of the denormalized numbers is 1.40129846 × 10 −45 ≤ |x| ≤ 1.17549421 × 10 −38 .Typical processing units are optimized for the normal numbers.Thus, the normal numbers are processed using specialized hardware.By contrast, the denormalized numbers are processed using general hardware.Therefore, the computational cost of handling the denormalized numbers is higher than that of the normal numbers.
There are three built-in methods for suppressing the speed reduction caused by the denormalized numbers.The first approach is computation with high-precision numbers.A high-precision number format has a large range of magnitudes in a normal number.In float, most denormalized numbers are represented by normal numbers in double.However, the bit length of double is longer than that of float.Thus, computational performance degrades owing to the increased cost of memory write/read operations.The second approach is computation with the flush to zero (FTZ) and denormals are zero (DAZ) flags.These flags are implemented in most CPUs and GPUs.If the FTZ flag is enabled, the invalid result of an operation is set to zero.The invalid result is an underflow flag or a denormalized number.If the DAZ flag is enabled, an operand in assembly language is set to zero when the operand is already a denormalized number.When the computing results are denormalized numbers or operands are already denormalized numbers, the DAZ flag ensures the denormalized numbers are set to zero.These flags suppress the occurrence of denormalized numbers; thereby, computing is accelerated.However, computation with these flags has events that convert denormalized numbers to normal numbers.Hence, the calculation time with these flags is not the same as that without denormalized numbers.In the third approach, a denormalized number is converted into a normal number by a min or max operation.This approach forcibly clips a calculated value to a normal number in the calculation, whether the calculated value is a denormalized number or not.The approach suppresses the denormalized numbers after their occurrence.Thus, it is not optimal for accelerating computation.Therefore, in this study, we propose a novel approach to prevent the occurrence of the denormalized numbers themselves to eliminate the computational time for handling the denormalized numbers.

CPU Microarchitectures and SIMD Instruction Sets
Moore's law [44] states that the number of transistors on an integrated circuit will double every two years.In the early stages, CPU frequencies were increased by increasing the number of transistors.In recent years, owing to heat and power constraints, the use of a larger number of transistors has become difficult [45], such as chips with multiple cores, complicate cache memory and short vector units.The latest microarchitectures used in Intel CPUs are presented in Table 1.The table indicates that the number of cores is increasing, cache memory size is expanding and the SIMD instruction sets are growing.SIMD instructions simultaneously calculate multiple data.Hence, high-performance computing utilizes SIMD.Typical SIMD instructions include streaming SIMD extensions (SSE), advanced vector extensions (AVX)/AVX2 and AVX512 in order of the oldest to newest [46].Moreover, fused multiply-add 3 (FMA3) [46] is a special instruction.FMA3 computes A × B + C by one instruction.
There are three notable changes in SIMD.First, the vector length is growing.For example, the lengths of SSE, AVX/AVX2 and AVX512 are 128 bits (4 float elements), 256 bits (8 float elements) and 512 bits (16 float elements), respectively.Second, several instructions have been added, notably, gather and scatter instructions [42].These instructions load/store data of discontinuous positions in memory.gather has been implemented in the AVX2, and scatter has been implemented in the AVX512.Before gather was implemented, the set instruction was used.The set instruction stores data in the SIMD register from scalar registers (see Figure 2).Thus, the instruction incurs a high computational cost.Third, even with the same instruction, instruction latency depends on CPU microarchitecture (for example, the latency of the add instruction indicated at https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_add_ps).Therefore, the effective vectorization is different for each CPU microarchitecture.The expansion of SIMD instructions influences vectorization for the edge-preserving filtering.We can accelerate the filters by the increased number of vectorizing elements.Furthermore, the gather instruction is useful for referencing lookup tables (LUTs).Using LUTs is a typical acceleration technique for arithmetic computation [47].Weights are stored in the LUTs, and then, the weights are used by loading them from the LUTs.The loading process is accelerated by gather.Moreover, FMA3 is beneficial for FIR filtering.The summation term of Equation ( 1) can be realized by using FMA3.Therefore, we can accelerate the edge-preserving filtering with proper usage of SIMD.

Proposed Methods for the Prevention of Denormalized Numbers
An edge-preserving filter has a range kernel and a spatial kernel or only a range kernel.When the similarity of color intensity is low, and the spatial distance is long, the weight of the kernel is exceedingly small.For example, in the bilateral filter for a color image, when the parameters are σ r = 32, I(p) = (255, 255, 255) and I(q) = (0, 0, 0), the range weight is 4.29 × 10 −42 , which is the minimum value of the range kernel and is a denormalized number in float.Note that the remaining spatial kernel does not multiply the weight.Thus, the total value becomes smaller than the range weight.Moreover, the non-local means filtering is more likely to involve denormalized numbers from Equation (4).Notably, the occurrence frequency of denormalized numbers is low when the smoothing parameters are large.This parameter overly smooths edge-parts; thus, the smoothing parameters should be small in most cases.
We propose new methods to prevent the occurrence of denormalized numbers for the edge-preserving filtering.The proposed methods deal with the two cases: a weight function contains only one term or multiple terms.For the former cases, we consider two implementations: computing directly and referring to only one LUT.For the latter cases, we also consider two implementations: computing directly and referring to each of multiple LUTs.
For the one-term case, the argument of the term is clipped using appropriate values so that the resulting value is not a denormalized number.If the weight function is a Gaussian distribution, the argument value x satisfies the following equations: where δ max is the maximum value of the denormalized number.In other words, Equation ( 9) can be written as follows: x ≥ ln(ν min ), (10) where ν min denotes the minimum value of the normal number.δ max and ν min are set based on the precision of floating point numbers.In the proposed method, an argument value is clipped by −87.3365478515625 = ln(ν min ) in float.
For the multiple terms, the clipping method is inadequate because denormalized numbers could occur owing to the multiplication of multiple terms.Therefore, the weights are multiplied by an offset value in the proposed method.The offset value satisfies the following equations: where o is an offset value and w k is the k-th weight function, which is a part of the decomposed weight function.N is the number of terms in the weight function.Λ k is a set of possible arguments in the k-th weight function, and ν max is the maximum value of normal numbers.Equation ( 12) limits the summation in Equation ( 1) such that it does not exceed the normal number when the image range is 0-255.Notably, min x w n (x) and its product are occasionally zero owing to underflow, even if the mathematical results of the weight function are non-zero.When the number of terms is large or min x w n (x) is very small, o is very large.Therefore, Equations ( 12) and ( 13) cannot be satisfied.In this condition, we must reduce the number of significant figures of w n (•) to eliminate the occurrence of denormalized numbers.Note that o should be large to ensure that the number of significant figures of w n (•) is large.In the edge-preserving filtering, max x w n (x) is one.Therefore, Equation ( 12) is transformed as follows: Accordingly, o should be ν max 255|N (p)| , if we achieve higher accuracy.Even when the number of significant figures cannot be decreased sufficiently, the method can decrease the rate of occurrence of denormalized numbers.
The proposed methods are implemented by using max and/or multiplication operations.The weight function of the bilateral filter is considered to be composed of only one term or multiple terms.The one-term case of the bilateral filter is implemented as follows: The multiple terms case is implemented as follows: where the following equation must be satisfied: Note that o has no effect unless it is firstly multiplied by the term of the decomposed weight function.
If the equation is not satisfied because the minimal values of the range and spatial kernels are very small, Equation ( 16) is transformed as follows: where s controls the number of significant figures and is obtained using the same method as Equation (10).The other filters can be realized in the same way.The computational costs of the proposed methods are significantly lower than the cost of computing denormalized numbers.Moreover, when using LUTs, the costs of the proposed methods can be neglected.In this case, the proposed methods are applied in preprocessing for creating LUTs.Therefore, the benefits of the proposed methods can be significant.

Effective Implementation of Edge-Preserving Filtering
In the edge-preserving filtering, weight calculation accounts for the largest share of processing time.Thus, we consider the implementation of the weight calculation.There are three approaches for the arithmetic computation of the weight function [47]: direct computation, by using LUTs and a combination of both [47].Computing of usual arithmetic functions has lower cost than the transcendental functions, i.e., exp, log, sin and cos, or heavy algebraic functions, e.g., sqrt.For the high-cost function, the LUT is effective when arithmetic computing is a bottleneck.By contrast, computing is valid when memory I/O is a bottleneck.We can control the trade-off by using the LUT and computation.
In the bilateral filter, the possible types of implementation are as follows: • RCSC: range computing spatial computing; range and spatial kernels are directly and separately computed.In the non-local means filtering process, the possible types of implementation are reduced, because the filter does not contain the spatial kernel.The possible types of implementation are as follows: • RC: range computing; the range kernel is directly computed.• RL: range LUT; LUTs are used for the range kernel.

•
RqL: range quantized LUT; quantized LUTs are used for the range kernel.
We consider five types of implementation for bilateral filtering, namely, MC, RCSL, RLSL, RqLSL and MqL.Notably, we did not implement the RCSC, RLSC, RLSqL and ML, because the cost of computing the spatial kernel is lower than that of computing the range kernel, and the size of the range/merged LUT is larger than that of the spatial LUT.We also implement three types for non-local means filtering, such as RC, RL and RqL.Note that the pairs of MC/RC, RLSL/RL and RqLSL/RqL are similar without spatial computation.
In the MC/RC implementation, weights are directly computed for each iteration.In the bilateral and bilateral non-local means filtering, two exponential terms are computed as one exponential term considering the nature of the exponential function.The implementation is computationally expensive because it involves weight calculation every time.However, this point is not always a drawback.The calculation increases arithmetic intensity, which is the ratio of the number of float-number operations per the amount of the accessed memory data.When the arithmetic intensity is low, the computational time is limited by memory reading/writing [48].In image processing, arithmetic intensity tends to be low, but the MC/RC implementation improves the arithmetic intensity.Therefore, the MC/RC implementation may be practical in a few cases.
RCSL can be applied to filters, which contain a spatial kernel.These filters include the bilateral and bilateral non-local means filters.The exponential term of the range kernel is computed every time, and LUTs are used as the weights of the spatial kernel.The size of the spatial LUT is the kernel size.In the bilateral filters, the weight function in the proposed implementation can be expressed as follows: where EXP s [•] is the spatial LUT.The first term is calculated for all possible arguments; subsequently, the weight values are stored in a LUT before filtering.Because the combinations of the relative distances of p and q are identical in all kernels, it is not required to calculate the relative distances for each kernel.
In RLSL/RL, LUTs are used as the weights of the range and the spatial kernels.The LUTs are created for the range or spatial kernel.For Gaussian range and non-local means filtering, the spatial kernel is omitted or always considered to be one.Only one LUT is used for the kernel, but the LUT is referenced for each channel using the separate representation of an exponential function.Notably, we can save the LUT size, which is 256.If we use an LUT for merged representation of an exponential function, its size becomes 255 2 × 3 + 1 = 195,075.In the bilateral filter for a color image, the weight function of the implementation is expressed as follows: where These LUTs are accessed frequently.Hence, the arithmetic intensity is low.
In RqLSL/RqL, the range LUT for the merged representation of an exponential function is quantized to reduce the LUT size.Therefore, the LUT is approximated.This implementation is faster than using a large LUT and accessing the LUT multiple times, such as RLSL/RL.In the bilateral filter, the weight function of the implementation is expressed as follows: where φ(•) and ψ(•) denote a quantization function and an inverse quantization function, respectively.By converting the range of the argument through the quantization function, the size of the LUT can be reduced.The LUT size is φ(3 × 255 2 ) + 1.We use the square root function (sqrt) and division for the quantization function.In sqrt, the quantization function and the inverse quantization function are expressed as follows: where n controls the LUT size.In div, they are expressed as follows: In sqrt, the size of the quantization range LUT is 442 = In MqL, the range and spatial LUTs are merged, and then, the LUT is quantized.This implementation uses only one LUT.Thus, we do not require to multiply the range and spatial kernels.In the MqL, the weight function of the bilateral filter is expressed as follows: where EXP rq [•] is identical to Equation (25).The other filters are implemented in the same way.The size of the quantization merged LUT is larger than that of the quantization range LUT.The quantization merged LUT can be accessed only once in the weight calculation.The accuracy decreases, as does the quantization range LUT.Furthermore, we consider the data type of an input image.The typical data type of input images is unsigned char, although floating point numbers are used in filter processing.Therefore, unsigned char values of the input image are converted to float/double before the filtering or during the filtering.Note that float is typically used.The bit length of the double is longer than that of float.Hence, the computational time when using double is slower than that when using float.When the input type is unsigned char, we must convert pixel values redundantly to floating point numbers for every loaded pixel.The converting time is SK, where S and K are the image and kernel size, respectively.By contrast, if the input type is a floating point number, which is pre-converted before filtering, the conversion process can be omitted in filtering.The converting times is S.However, in the case of inputting unsigned char, the arithmetic intensity is higher than that of float.This is because the bit length of unsigned char is shorter than that of float.We should consider the tradeoff between the number of converting times and arithmetic intensity owing to the size of the image and kernel.
In the use of LUTs, these implementation approaches can be applied to arbitrary weight functions, which are not limited to the weighting functions consisting of exponential functions in the present study.Especially, if a weight function is computationally expensive, the use of a LUT is more practical.
Notably, in these types of implementations, the weight of the target pixels, which is at the center of the kernel, need not be calculated.The weight of the target pixels is always one.When r is small, this approach accelerates the filtering process somewhat.

Experimental Results
We verified the occurrence of denormalized numbers in the bilateral filtering, non-local means filtering, Gaussian range filtering and bilateral non-local means filtering processes.Moreover, we discussed the effective vectorization of bilateral filtering and non-local means filtering on the latest CPU microarchitectures.These filters were implemented in C++ by using OpenCV [49].Additionally, multi-core parallelization was executed using Concurrency (https://msdn.microsoft.com/en-us/library/dd492418.aspx),which is a parallelization library provided by Microsoft, also called the parallel patterns library (PPL).Table 2 shows the CPUs, SIMD instruction sets and memory employed in our experiments.Windows 10 64-bit was used as the OS, and Intel Compiler 18.0 was employed.For referring LUTs, the set or gather SIMD instructions were employed.The outermost loop was parallelized by multi-core threading, and we had pixel-loop vectorization [50].This implementation was found to be the most effective [50].Notably, a vectorized exponential operation is not implemented in these CPUs.Hence, we employed a software implementation, which is available in Intel Compiler.The experimental code spanned around 95,000 lines (https://github.com/yoshihiromaed/FastImplementation-BilateralFilter).

Influence of Denormalized Numbers
We compared the proposed methods with the straightforward approach (none) and four counter-approaches for denormalized numbers.These approaches involve converting denormalized numbers to normal numbers (convert) and using FTZ and/or DAZ flags (FTZ, DAZ, and FTZ and DAZ).The convert implementation clips a calculated value to a normal number value in the weight calculation by means of a min-operation with the minimal value of the normal numbers, such as min(exp(a) × exp(b), v), where a and b are variables and v is the minimal value of the normal numbers.Figures 3-10 show the results of computational time and speedup ratio of various types of implementation on Intel Core i9 7980XE.The computational time was taken as the median value of 100 trials.The parameters were identical for all filters.Notably, in the RqLSL/RqL and MqL implementation, sqrt was used as the quantization function, and n = 1.These figures indicate that the proposed methods for handing denormalized number were the fastest among the other approaches for each implementation.The proposed methods were up to four-times faster than the straightforward approach.In many cases, the none implementation using double was faster than that using float.The range of magnitudes of double was larger than that of float.Hence, the occurrence frequency of denormalized numbers was lower.In the case of double, however, there was no significant speedup in all approaches for managing denormalized numbers.Because double had twice the byte length compared to float, the corresponding computational time was approximately twice as long, as well.Therefore, the implementation using double was slower than that using float when the influence of denormalized numbers was eliminated.In the RqL of the Gaussian range and the non-local means filters and the MqL of the bilateral and bilateral non-local means filters, the speedup ratio of the proposed methods was almost the same as that of the convert, FTZ and FTZ and DAZ implementation.In these approaches, denormalized numbers occurred only when LUTs were created, and the denormalized numbers were eliminated during LUT creation.Thus, during the filtering process, denormalized numbers did not occur.Therefore, the RqL and MqL implementation could achieve the same effect as the proposed methods did.In addition, DAZ had almost no effect because DAZ was executed only if an operand was a denormalized number.As shown in Figure 1, denormalized numbers occurred in edges.Therefore, if the weight of the range kernel was small or the multiplication of the range kernel with the spatial kernel was possible, denormalized numbers were likely to occur.(c) AVX512.(c) AVX512.(c) AVX512.If the ratio exceeds one, all implementation of the method are faster than the straightforward implementation (none).σ s = 6, h = 4 √ 2, template window size is (3,3), and search window size is (2 × 3σ s + 1, 2 × 3σ s + 1).Image size is 768 × 512.show the speedup ratio of the MC/RC implementation between the proposed methods and the none implementation for each set of smoothing parameters.Note that when σ s , r, and the search window size are larger, the amount of processing increases.The tables indicate that the proposed methods are 2-5-times faster than the none implementation.When the smoothing parameters are small and the amount of processing is large, the speedup ratio is high.Therefore, the influence of denormalized numbers is strong when the degree of smoothing is small and the amount of processing is large.To verify the accuracy of the proposed methods, we compared the scalar implementation in double-precision with the proposed methods and other approaches regarding peak signal-to-noise ratio (PSNR).The results are shown in Figure 11.The proposed methods hardly affect accuracy.Note that the accuracies of the RqL and MqL implementation are slightly lower than those of the other types of implementation because the LUTs are approximated.In these types, the accuracy deterioration is not significant because human vision does not sense differences higher than 50 dB [51,52].

Effective Implementation on CPU Microarchitectures
In this subsection, we verify the effective vectorization of the bilateral filter and the non-local means filter on the latest CPU microarchitectures.The proposed methods for denormalized numbers have already been applied to these filters.Figures 12 and 13 show the computational times of the bilateral filter and the non-local means filter for each CPU microarchitecture.These filters were implemented using float.Notably, in the RqLSL/RqL and MqL implementation, sqrt was used as the quantization function and n = 1.The RqLSL/RqL implementation is the fastest in these CPU microarchitectures.This implementation has a lower computational cost than the MC/RC implementation.Moreover, the number of LUT accesses is lower than that in the case of the RLSL/RL implementation.The computational time of the MC/RC implementation is almost the same as that of the RLSL/RL and RqLSL/RqL implementation or faster than that of the RLSL/RL implementation.This tendency can be stronger in the latest CPU microarchitectures because computational time is limited by the memory reading/writing latency.Besides, the computation time of the RqLSL implementation is almost the same as that of the MqL implementation, but that of the RqLSL implementation is slightly faster than that of the MqL implementation.The size of the merged quantization LUT is larger than that of the range quantization LUT.The effects of the size of the quantization LUT and the quantization function are discussed in the following paragraph.The RLSL/RL, RqLSL/RqL and MqL implementations in which the gather instruction is employed are faster than the implementations in which the set instruction is employed.However, on the Intel Core i7 5960X and AMD Ryzen Threadripper 1920X CPUs, the implementations in which the gather instruction is used are slower than the implementations in which the set instruction is used.In the bilateral filter and the non-local means filter, when the SIMD's vector length increases, all types of implementations with longer SIMD instructions are faster than that with shorter SIMD instructions.Furthermore, in a comparison of the implementations with/without FMA3, the FMA3 instruction improved computational performance slightly.These results indicate that effective vectorization of the bilateral filter and the non-local means filter are different for each CPU microarchitecture.Figures 14 and 15 show the speedup ratio of the bilateral filter and the non-local means filter for each CPU microarchitecture.If the ratio exceeds one, the corresponding implementation is faster than the scalar implementation for each CPU microarchitecture.Multi-core threading parallelized both the scalar and the vectorized implementations for focusing on comparing vectorized performance.In the case of the bilateral filter, the fastest implementation is 170-times faster than the scalar one.Moreover, in the case of the non-local means filter, the fastest implementation is 200-times faster than the scalar one.The speedup was determined by the management of denormalized numbers and effective vectorization.Thus, the effect of using a multi-core CPU is not evaluated in the verification.If the ratio exceeds one, the implementation is faster than a scalar implementation for each CPU microarchitecture.Note that the scalar implementation is parallelized using multi-core.h = 4 √ 2; template window size is (3,3); and search window size is (37,37).Image size is 768 × 512.
We discuss the relationship between accuracy and computational time for various types of implementation involving the use of the quantization LUT. Figure 16 shows the relationship between calculation time and accuracy of the RqL and the MqL implementation.In this figure, we changed the quantization functions and the size of the quantization range/merged LUTs.The quantization functions were square root function (sqrt) and division (div).Note that the maximal value in the case of the RqLSL/RqL implementation is commensurate with that of the non-quantized cases.In the RqLSL/RqL and MqL implementation, the accuracy and computational time have a trade-off.These implementations accelerate these filters while maintaining high accuracy, when the LUT size is practical.The characteristics of the performance depend on the quantization functions.Therefore, we must choose the functions and the LUT size by considering the required accuracy and computational time.).In sqrt and div, the quantization function is square root and division, respectively.These results were obtained on an Intel Core i9 7980XE.σ r = 4, σ s = 6, h = 4 √ 2; and search window size is (37,37).Image size is 768 × 512.
Figure 17 shows the computational time of using unsigned char (8U) and float (32F) in the MC/RC implementation.Note that the computational time is plotted on a logarithmic scale.The 8U implementation of the bilateral filter is faster than the 32F implementation, when r is small.By contrast, when r is large, the 8U implementation is slower than the 32F implementation.If the cost of the conversion process in 8U is low, the arithmetic intensity of the 8U implementation would be larger than that of the 32F implementation.However, in the 8U implementation, the conversion cost increases owing to the redundant conversion of the pixels as the amount of processing increases.In the non-local means filter, when the template window size is (3,3), the 8U implementation is always faster than the 32F implementation.When the template window size is (5,5), the 8U implementation is slower than the 32F implementation in the case of the large search window.The arithmetic intensity of the non-local means filter is low because many pixels are accessed.Therefore, we can improve the arithmetic intensity by using the 8U implementation.However, if the amount of processing is large, we should use 32F.As a result, the speeds of the 8U and 32F implementation depend on the amount of processing, which are changed by the filtering kernel size and arithmetic intensity.

Conclusions
In this paper, we propose methods to accelerate bilateral filtering and non-local means filtering.The proposed methods prevent the occurrence of denormalized numbers, which has a large computational cost for processing.Moreover, we verify various types of vectorized implementations on the latest CPU microarchitectures.The results are summarized as follows: 1.
The proposed methods are up to five-times faster than the implementation without preventing the occurrence of denormalized numbers.

2.
In modern CPU microarchitectures, the gather instruction in the SIMD instruction set is effective for loading weights from the LUTs.

3.
By reducing the LUT size through quantization, the filtering can be accelerated while maintaining high accuracy.Moreover, the optimum quantization function and the quantization LUT size depends on the required accuracy and computational time.

4.
When the kernel size is small, the 8U implementation is faster than the 32F implementation.By contrast, in the case of the large kernel, the 32F implementation is faster than the 8U implementation.
In the future, we will verify the influence of denormalized numbers on GPUs.In particular, we are planning to implement edge-preserving filters on GPUs and to verify the influence of the denormalized numbers on computation time in GPUs.Furthermore, we will design effective implementations of the filters on GPUs.
(a) Original image.
numbers (c) Non-local means filter.
numbers (e) Bilateral non-local means filter.
is the floor function, • 1 is the L1 norm and I(•) r , I(•) g and I(•) b are the red, green and blue channels in I(•), respectively.EXP r [•] is the range LUT, and EXP s [•] is identical to Equation (21).

Figure
Figure19.Computational time and speedup ratio of fastest implementation and OpenCV implementation in bilateral filter.These results were calculated using an Intel Core i9 7980XE.Note that the computational time is plotted on the logarithmic scale.If the speedup ratio exceeds one, the fastest implementation is faster than the OpenCV implementation.σ r = 16, r = 3σ s ; and image size is 768 × 512.For σ s = 4, the PSNRs of the proposed method and OpenCV are 84.63 dB and 44.08 dB, respectively.For σ s = 8, they are 85.45 dB and 43.55 dB, respectively.For σ s = 16, they are 84.41 dB and 43.19 dB, respectively.

Table 1 .
CPU microarchitectures of Intel Core series Extreme Editions.In each CPU generation, the Extreme Edition versions offer the highest performance on the consumer level.

Table 2 .
Computers used in the experiments.

Table 3 .
Computational time and speedup ratio of bilateral filtering in the merged computing (MC) implementation using AVX512 for various parameters.These results were obtained on an Intel Core i9 7980XE.If the ratio exceeds 1, the proposed methods are faster than the straightforward implementation (none) for each parameter.r = 3σ s , and image size is 768 × 512.(a) Computational time (proposed) [ms]; (b) Computational time (none) [ms]; (c) Speedup ratio.

Table 4 .
Computational time and speedup ratio of Gaussian range filter in the range computing (RC) implementation using AVX512 for various parameters.These results were obtained on an Intel Core i9 7980XE.If the ratio exceeds 1, the proposed methods are faster than the straightforward implementation (none) for each parameter.Image size is 768 × 512.

Table 5 .
(3,3)tational time and speedup ratio of non-local means filter in the RC implementation using AVX512 for various parameters.These results were obtained on an Intel Core i9 7980XE.If the ratio exceeds 1, the proposed methods are faster than the straightforward implementation (none) for each parameter.Template window size is(3,3), and image size is 768 × 512.(a) Computational time (proposed) [ms]; (b) Computational time (none) [ms]; (c) Speedup ratio.

Table 6 .
(3,3)tational time and speedup ratio of the bilateral non-local means filter in the MC implementation using AVX512 for various parameters.These results were calculated using an Intel Core i9 7980XE.If the ratio exceeds 1, the proposed methods are faster than the straightforward implementation (none) for each parameter.Template window size is(3,3); search window size is (2 × 3σ s + 1, 2 × 3σ s + 1); and image size is 768 × 512.
(37,37) search window size is(37,37).Image size is 768 × 512.If the ratio exceeds one, the implementation is faster than a scalar implementation for all CPU microarchitectures.Note that the scalar implementation is parallelized using multi-core.σ r = 4, σ s = 6 and r = 3σ s .Image size is 768 × 512.
19. Computational time and speedup ratio of fastest implementation and OpenCV implementation in bilateral filter.These results were calculated using an Intel Core i9 7980XE.Note that the computational time is plotted on the logarithmic scale.If the speedup ratio exceeds one, the fastest implementation is faster than the OpenCV implementation.σ r = 16, r = 3σ s ; and image size is 768 × 512.For σ s = 4, the PSNRs of the proposed method and OpenCV are 84.63 dB and 44.08 dB, respectively.For σ s = 8, they are 85.45 dB and 43.55 dB, respectively.For σ s = 16, they are 84.41 dB and 43.19 dB, respectively.