VLSI Implementation of a Cost-Efﬁcient Loefﬂer DCT Algorithm with Recursive CORDIC for DCT-Based Encoder

: This paper presents a low-cost and high-quality, hardware-oriented, two-dimensional discrete cosine transform (2-D DCT) signal analyzer for image and video encoders. In order to reduce memory requirement and improve image quality, a novel Loefﬂer DCT based on a coordinate rotation digital computer (CORDIC) technique is proposed. In addition, the proposed algorithm is realized by a recursive CORDIC architecture instead of an unfolded CORDIC architecture with approximated scale factors. In the proposed design, a fully pipelined architecture is developed to efﬁciently increase operating frequency and throughput, and scale factors are implemented by using four hardware-sharing machines for complexity reduction. Thus, the computational complexity can be decreased signiﬁcantly with only 0.01 dB loss deviated from the optimal image quality of the Loefﬂer DCT. Experimental results show that the proposed 2-D DCT spectral analyzer not only achieved a superior average peak signal–noise ratio (PSNR) compared to the previous CORDIC-DCT algorithms but also designed cost-efﬁcient architecture for very large scale integration (VLSI) implementation. The proposed design was realized using a UMC 0.18- µ m CMOS process with a synthesized gate count of 8.04 k and core area of 75,100 µ m 2 . Its operating frequency was 100 MHz and power consumption was 4.17 mW. Moreover, this work had at least a 64.1% gate count reduction and saved at least 22.5% in power consumption compared to previous designs.


Introduction
The Internet of Things (IoT) has drawn lots of research and business attention, which makes connecting everything possible, and it can be applied in the fields of human-tohuman, human-to-machine, and machine-to-machine communications [1]. The most important applications of the IoT are wireless sensor networks (WSNs) [2][3][4][5]. WSN devices, including mobile phones, have achieved up to 26 billion nodes by 2020 and are set to reach 100 billion nodes by 2025 [6]. Therefore, it is without a doubt that WSNs may bring massive business opportunities and provide momentum for upgrading industry technology. The WSN is a promising candidate for the application of wireless personal area networks with low transmission data rates [7]. When the nodes of WSNs increase, network management and heterogeneous node networks might be challenging to the WSN. To overcome the problem, the software-defined network (SDN) approach was proposed for the WSN to improve its efficiency and robustness [7]. Moreover, sensor data must be transmitted via wireless, and the importance of the biomedical signals, such as electroencephalography (EEG), needs lossless data compression to save not only the 2 of 16 data bits but also the power. Chen et al. [8] proposed an efficient method of lossless EEG compression by using dynamic voting prediction for WSNs.
Regarding the communication of multi-nodes in WSNs, both the bandwidth and the power are major parameters to be considered for wireless transceivers. A novel antenna with power efficiency and multiband is proposed in [9] with four designed loops. Furthermore, a synchronous data line is always required in any data transmission. Chen et al. [10] provided a chip design for low-power specifications and a preamble data synchronizer in case the data were from different frequency domains.
The purpose of developing the WSN not only provides the platform for data and multimedia exchange [11] but also constructs smart and safe cities, including video surveillance [12], safe transportation [13], medical imaging [14], search and secure systems [15], and smart museums [16]. Therefore, it is essential to develop smart cities with lowcomplexity smart security systems. Considering the application for an outdoor image system with lower power consumption, seven important image compression methods for binary images were investigated for the WSN [10] and were used to detect the number of objects (cyclist or pedestrian) to ensure traffic safety. As for the application of requiring high-quality color images, Kouginos et al. designed the platform with a digital camera and better portable graphics (BPG) format for transmitting security images of search and secure operations over the WSN [15]. Then, the robotic camera network was introduced to monitor the environment [17]. In [16], the authors designed the WSN system of a smart museum that can automatically provide visitors with the cultural contents of related observed artworks. Moreover, in [14] the authors proposed the very large scale integration (VLSI) implementation of wireless capsule endoscopy with low-complexity and a high image quality algorithm for the wireless body sensor network. To this end, high-quality and low-complexity image compression processing techniques are a priority to be developed in the future. For the WSN system, the critical issue is focusing on how to reduce the size of the transmitted images for storage and still maintain high image quality. To this end, image compression is a widely used method applied to images before transmission to efficiently reduce the image data. Existing image compression techniques such as joint photographic experts group (JPEG) [18], JPEG-2000 [19], BPG [15], and Secure BPG (SBPG) [20] are employed in the WSN. The JPEG standard is the most popular still image compression method and is widely used in the business and industry areas. JPEG converts each image to its equivalent frequency domain using discrete cosine transform (DCT). After doing so, JPEG keeps the important information with lower-frequencies and discards the less important information with higher-frequencies to attain the image compression. Finally, the compressed data can be further boosted by a compression ratio when an entropy-coding algorithm is subsequently applied. Recently, several documents were published to study machine learning and deep learning techniques on the JPEG standard [21][22][23]. First, MalJPEG, a machine learning-based solution for detecting malicious JPEG images, was proposed [21] to avoid the harmful actions of cyber attacks. Next, in [22], a novel deep learning-based approach for double JPEG compression detection was proposed that used spatial and frequency domain information and a multi-column convolutional neural network (CNN) architecture for block classification. Moreover, the authors proposed a generic hybrid deep-learning framework for JPEG steganalysis, which combined the domain knowledge behind rich steganalytic models with compound deep neural networks [23]. In doing so, the deep-learning framework for JPEG steganalysis was insensitive to JPEG blocking artifact alterations.
In image compression, two-dimensional discrete cosine transform (2-D DCT) is widely used for signal analysis of the image data. In [24], the authors proposed the hardware architecture of 2-D DCT with Loeffler factorization and algebraic integer representation. The design was completely error-free and eliminated the use of multipliers. The efficient hardware architectures of the 2-D DCT suitable for H.265/HEVC were further proposed [25][26][27]. Figure 1 depicts the flow chart of the standard image compression technique, which pri-The efficient hardware architectures of the 2-D DCT suitable for H.265/HEVC were further proposed [25][26][27]. Figure 1 depicts the flow chart of the standard image compression technique, which primarily contains a 2-D DCT signal analyzer and an entropy encoder. The former is used to spectrally analyze image data; the latter is used to improve the spectral efficiency. In this paper, the focus is to improve the image compression technique with low computing complexity for WSN applications. In this study, the low-complexity VLSI architecture of the 2-D DCT signal analyzer is realized. To realize real-time WSN systems, many high-performance discrete cosine transform (DCT) algorithms of JPEG were proposed for VLSI implementation [28][29][30]. To reduce hardware costs, Loeffler proposed an efficient one-dimension (1-D) DCT algorithm, which utilized 11 multipliers and 29 adders only [28]. In turn, the coordinate rotation digital computer (CORDIC)-based Loeffler DCT algorithm was proposed to avoid using multipliers and to achieve a power consumption of only about 20% akin to Loeffler's work. CORDIC is the algorithm used to evaluate sinusoidal/hyperbolic functions and can be used for intelligent robots [31,32], along with the application of communication systems and signal processing [33]. In [30], a low complexity CORDIC-DCT algorithm was proposed to implement 2-D DCT based on the row-column method by using 1-D DCT twice. In previous studies, [29,30], the efficient unfolded CORDIC-based algorithms for DCT were proposed when allowing small indispensable accuracy errors. On the other hand, the low-complexity fully parallel hardware architecture for DCT, called FPAX-CORDIC, was proposed to avoid using the memory register of the Para-CORDIC [34]. Recently, a high-performance CORDIC-based algorithm that included a square root and inverse trigonometric operator for biped robots was realized on a field-programmable-gate array (FPGA) device. A high-accuracy CORDIC-based algorithm for biped robots was proposed in [35], in which a pipeline and hardware sharing techniques were used to improve performance and reduce hardware costs efficiently.
As mentioned above, it is necessary to develop a high-performance, high-quality, and low-complexity image compression technique for image and video encoders. In this paper, an efficient, low-complexity, and high-quality hardware-oriented Loeffler DCT algorithm with recursive CORDIC architecture for image and video encoders is proposed. The remaining parts of the paper are organized as follows: In Section 2, the image compression algorithm is described. Section 3 shows the VLSI architecture for the proposed cost-efficient hardware-oriented Loeffler DCT algorithm with recursive CORDIC for image and video encoders. In Section 4, experimental results of the proposed Loeffler DCT algorithm with recursive CORDIC are demonstrated. Finally, concluding remarks are made in Section 5. To realize real-time WSN systems, many high-performance discrete cosine transform (DCT) algorithms of JPEG were proposed for VLSI implementation [28][29][30]. To reduce hardware costs, Loeffler proposed an efficient one-dimension (1-D) DCT algorithm, which utilized 11 multipliers and 29 adders only [28]. In turn, the coordinate rotation digital computer (CORDIC)-based Loeffler DCT algorithm was proposed to avoid using multipliers and to achieve a power consumption of only about 20% akin to Loeffler's work. CORDIC is the algorithm used to evaluate sinusoidal/hyperbolic functions and can be used for intelligent robots [31,32], along with the application of communication systems and signal processing [33]. In [30], a low complexity CORDIC-DCT algorithm was proposed to implement 2-D DCT based on the row-column method by using 1-D DCT twice. In previous studies, [29,30], the efficient unfolded CORDIC-based algorithms for DCT were proposed when allowing small indispensable accuracy errors. On the other hand, the low-complexity fully parallel hardware architecture for DCT, called FPAX-CORDIC, was proposed to avoid using the memory register of the Para-CORDIC [34]. Recently, a high-performance CORDIC-based algorithm that included a square root and inverse trigonometric operator for biped robots was realized on a field-programmable-gate array (FPGA) device. A high-accuracy CORDIC-based algorithm for biped robots was proposed in [35], in which a pipeline and hardware sharing techniques were used to improve performance and reduce hardware costs efficiently.

JPEG
As mentioned above, it is necessary to develop a high-performance, high-quality, and low-complexity image compression technique for image and video encoders. In this paper, an efficient, low-complexity, and high-quality hardware-oriented Loeffler DCT algorithm with recursive CORDIC architecture for image and video encoders is proposed. The remaining parts of the paper are organized as follows: In Section 2, the image compression algorithm is described. Section 3 shows the VLSI architecture for the proposed cost-efficient hardware-oriented Loeffler DCT algorithm with recursive CORDIC for image and video encoders. In Section 4, experimental results of the proposed Loeffler DCT algorithm with recursive CORDIC are demonstrated. Finally, concluding remarks are made in Section 5.

JPEG
JPEG is a widely used method of lossy image compression for digital data, images, and digital photos. This method discards some minor information to achieve image compression. JPEG can provide different levels of data compression by considering the tradeoff between storage space and image quality. In general, digital image pixels in neighboring regions are highly correlated with each other, and, thus, high data compression can be achieved. The image-coding algorithm is composed of data correlation reduction, value quantization, and entropy coding, as shown in Figure 1. To reduce correlation, DCT is one of the most well-known methods. After manipulating the DCT on the digital data, the resulting DCT coefficients are uniformly quantized using the quantization table. The purpose of quantization is to achieve a higher compression ratio by obtaining DCT coefficients with adequate precision, which is enough to achieve the desired image quality. JPEG [18] uses the standard quantization table in Figure 2, which is the example quantization table for luminance components. JPEG is a widely used method of lossy image compression for digital data, images, and digital photos. This method discards some minor information to achieve image compression. JPEG can provide different levels of data compression by considering the tradeoff between storage space and image quality. In general, digital image pixels in neighboring regions are highly correlated with each other, and, thus, high data compression can be achieved. The image-coding algorithm is composed of data correlation reduction, value quantization, and entropy coding, as shown in Figure 1. To reduce correlation, DCT is one of the most well-known methods. After manipulating the DCT on the digital data, the resulting DCT coefficients are uniformly quantized using the quantization table. The purpose of quantization is to achieve a higher compression ratio by obtaining DCT coefficients with adequate precision, which is enough to achieve the desired image quality. JPEG [18] uses the standard quantization table in Figure 2, which is the example quantization table for luminance components. Entropy coding is the final step of DCT-based encoder processing. In this step, additional compression by encoding the quantized DCT coefficients can be obtained. The most widely known and the most widely used algorithm for entropy coding is the Huffman coding algorithm.
As mentioned above, most operations were focused on DCT design for hardware cost reduction.

DCT
The DCT algorithm is widely used in data/video compression, which allows unequal computation on the spatial domain data to generate the frequency-domain outputs. Some DCT computations are crucial to image quality while others are not. The two-dimensional DCT in Equation (1) The 2-D DCT has a disadvantage, that is, the hardware cost. In [28], Loeffler used only 11 multipliers to implement the 1-D DCT. The following subsection presents how the 1-D DCT is utilized to implement the 2-D DCT.

2-D DCT Using Row-Column 1-D DCT Architecture
In this section, we adopt the row-column method based on the 1-D DCT to implement the 2-D DCT to reduce complexity. As shown in Figure 3, the row-column method applies the 1-D DCT to the image block twice. Given the image block X with size N × N, the 1-D DCT is performed on the rows of X , and then the 1-D DCT is performed on the columns of X . Between the two 1-D DCTs is the N × N transposition memory. The Entropy coding is the final step of DCT-based encoder processing. In this step, additional compression by encoding the quantized DCT coefficients can be obtained. The most widely known and the most widely used algorithm for entropy coding is the Huffman coding algorithm.
As mentioned above, most operations were focused on DCT design for hardware cost reduction.

DCT
The DCT algorithm is widely used in data/video compression, which allows unequal computation on the spatial domain data to generate the frequency-domain outputs. Some DCT computations are crucial to image quality while others are not. The twodimensional DCT in Equation (1) transform an N × N block sample from spatial domain f (x, y) into frequency domain F(k, l).
where normalization factors C(0) = 1/ √ 2 and C(k) = C(l) = 1 for k, l = 0. The 2-D DCT has a disadvantage, that is, the hardware cost. In [28], Loeffler used only 11 multipliers to implement the 1-D DCT. The following subsection presents how the 1-D DCT is utilized to implement the 2-D DCT.

2-D DCT Using Row-Column 1-D DCT Architecture
In this section, we adopt the row-column method based on the 1-D DCT to implement the 2-D DCT to reduce complexity. As shown in Figure 3, the row-column method applies the 1-D DCT to the image block twice. Given the image block X with size N × N, the 1-D DCT is performed on the rows of X, and then the 1-D DCT is performed on the columns of X. Between the two 1-D DCTs is the N × N transposition memory. The matrix C is defined as the orthonormal matrix of 1-D DCT with size N × N, where C(m, n) is the (m, n)-th element of C defined by where normalization factors c(0) = 1/ √ 2 and c(m) = 1 for m = 0. According to the row-column method, the 2-D DCT performed on the image block X is given by where the superscript T denotes the matrix transpose. In this paper, the value of N is set to eight according to the JPEG standard.
matrix C is defined as the orthonormal matrix of 1-D DCT with size N × N, where ( , ) C m n is the ( , ) m n -th element of C defined by 2 (2 1) ( , ) ( ) cos , , 0,1, , 1. 2 where normalization factors (0) 1/ 2 = c . According to the row-column method, the 2-D DCT performed on the image block X is given by where the superscript T denotes the matrix transpose. In this paper, the value of N is set to eight according to the JPEG standard.

CORDIC Algorithm
The CORDIC algorithms for the rotation mode were summarized in [31]. Four important equations (Equations (4)- (7)) of the CORDIC algorithm are shown as follows: where x and y denote the x-axis and y-axis components in the rectangular coordinates system, respectively; ω is the accumulated rotation angle; σ is the signum symbol; 1 or −1 defines the rotation direction; i denotes the ith iteration step; and α is the predefined angle value of each rotation step. The output data of CORDIC are amplified by a scaling factor K, which depends on the number of iteration steps. Therefore, the final values from the CORDIC algorithm have to be multiplied by 1/K. When index n in K is large enough, a constant number (1/K) will approximately be 0.60725.

Loeffler DCT Algorithm with Recursive CORDIC
Based on the work of the CORDIC-based Loeffler DCT [29], a novel Loeffler DCT algorithm with recursive CORDIC is proposed in this study. The proposed architecture of the 8-point Loeffler DCT algorithm with iterative CORDIC is shown in Figure 4. The improvement of this work over the previous algorithms lies in its four features. First, the quality of image compression is improved without using any further approximation and ignoring iteration compensation on the CORDIC algorithm. Second, recursive architec-

CORDIC Algorithm
The CORDIC algorithms for the rotation mode were summarized in [31]. Four important equations (Equations (4)- (7)) of the CORDIC algorithm are shown as follows: where x and y denote the x-axis and y-axis components in the rectangular coordinates system, respectively; ω is the accumulated rotation angle; σ is the signum symbol; 1 or −1 defines the rotation direction; i denotes the ith iteration step; and α is the predefined angle value of each rotation step. The output data of CORDIC are amplified by a scaling factor K, which depends on the number of iteration steps. Therefore, the final values from the CORDIC algorithm have to be multiplied by 1/K. When index n in K is large enough, a constant number (1/K) will approximately be 0.60725.

Loeffler DCT Algorithm with Recursive CORDIC
Based on the work of the CORDIC-based Loeffler DCT [29], a novel Loeffler DCT algorithm with recursive CORDIC is proposed in this study. The proposed architecture of the 8-point Loeffler DCT algorithm with iterative CORDIC is shown in Figure 4. The improvement of this work over the previous algorithms lies in its four features. First, the quality of image compression is improved without using any further approximation and ignoring iteration compensation on the CORDIC algorithm. Second, recursive architecture is applied to the proposed CORDIC algorithm to reduce complexity. Third, the algorithm in [29] is optimized as illustrated in Figure 4, highlighted with a red box, where the pipeline structure for a hardware-sharing machine for the VLSI implementation is applied to reduce hardware costs. Fourth, the hardware-sharing machine technique is also applied to the three scale-factor circuits.
ture is applied to the proposed CORDIC algorithm to reduce complexity. Third, the algorithm in [29] is optimized as illustrated in Figure 4, highlighted with a red box, where the pipeline structure for a hardware-sharing machine for the VLSI implementation is applied to reduce hardware costs. Fourth, the hardware-sharing machine technique is also applied to the three scale-factor circuits.     ture is applied to the proposed CORDIC algorithm to reduce complexity. Third, the algorithm in [29] is optimized as illustrated in Figure 4, highlighted with a red box, where the pipeline structure for a hardware-sharing machine for the VLSI implementation is applied to reduce hardware costs. Fourth, the hardware-sharing machine technique is also applied to the three scale-factor circuits.  Figure 5 illustrates the VLSI architecture of the proposed 2-D DCT design. According to the row-column method, it only contains a 1-D DCT hardware-sharing machine and a transposition memory to implement a 2-D DCT. The transposition memory requires 64 12-bits to temporarily store the 1-D DCT coefficients.   Figure 6 depicts the VLSI architecture of the proposed 9-stage pipeline 1-D DCT circuit, where a hardware-sharing machine is applied to reduce hardware costs. A 1-D CORDIC DCT in the proposed design contains four hardware-sharing machines and 20 adders. The proposed design is implemented by 9-stage pipeline architecture to increase the operating frequency. It is used to calculate eight DCT coefficients given in (2). Figure 6 depicts the VLSI architecture of the proposed 9-stage pipeline 1-D DCT circuit, where a hardware-sharing machine is applied to reduce hardware costs. A 1-D CORDIC DCT in the proposed design contains four hardware-sharing machines and 20 adders. The proposed design is implemented by 9-stage pipeline architecture to increase the operating frequency. It is used to calculate eight DCT coefficients given in (2).  Table 1 lists an efficient and high precision way to develop a scaling factor generator. The design only used adders and shifters, as many as possible, to replace multipliers and dividers. Moreover, this paper utilized a hardware-sharing machine to decrease hardware complexity. Using hardware-sharing machines might somewhat increase the executing cycles. To tackle this problem, in the proposed design, a fully pipelined architecture was developed to increase throughput and reduce cycle time. The third row (scale factor = 1/ 2 2 ) in Table 1 represents the 9-stage process Hardware-Sharing Scale factor_1 in Figure 6 and is shown in Figure 7. In a similar way, the fourth row (scale factor = 1/ 3.1694 ) in Table 1 expresses the Hardware-Sharing Scale factor_2 in Figure 6 and is shown in Figure 8. The proposed hardware-sharing scaling factors are more accurate and more cost-efficient than those of the scaling factors used in Sun et al. [29]. To conclude, the proposed scaling-factor design can provide a cost-efficient, high-precision, and high-performance structure in designing the CORDIC-based DCT. Table 1. Scale factors used in the work of Sun et al. [29] and in this work.

Scale Factor
Quantization Value Quantization Erorr Add Shift  Table 1 lists an efficient and high precision way to develop a scaling factor generator. The design only used adders and shifters, as many as possible, to replace multipliers and dividers. Moreover, this paper utilized a hardware-sharing machine to decrease hardware complexity. Using hardware-sharing machines might somewhat increase the executing cycles. To tackle this problem, in the proposed design, a fully pipelined architecture was developed to increase throughput and reduce cycle time. The third row (scale factor = 1/2 √ 2) in Table 1 represents the 9-stage process Hardware-Sharing Scale fac-tor_1 in Figure 6 and is shown in Figure 7. In a similar way, the fourth row (scale factor = 1/3.1694) in Table 1 expresses the Hardware-Sharing Scale factor_2 in Figure 6 and is shown in Figure 8. The proposed hardware-sharing scaling factors are more accurate and more cost-efficient than those of the scaling factors used in Sun et al. [29]. To conclude, the proposed scaling-factor design can provide a cost-efficient, high-precision, and high-performance structure in designing the CORDIC-based DCT. Table 1. Scale factors used in the work of Sun et al. [29] and in this work.

Scale Factor
Quantization Value Quantization Erorr Add Shift  The proposed CORDIC was realized by an iterative structure in a recursive CORDIC architecture, as shown in Figure 9, instead of the unfolded CORDIC architecture with an approximated scale factor. According to the experiment, the output data from CORDIC require 11 iterating cycles to be stable. This design also allows a flexible accuracy approximation where a lookup table is employed to reduce hardware area and improve performance. Users can control the iterating cycles.
The hardware-sharing machine recursive CORDIC was realized by an iterative architecture, as shown in Figure 9. Figure 10 is the hardware-sharing CORDIC scaling factor generator, which only uses adders and shifters. With the increase in the iterating cycles, K ≈ 0.60725, this architecture is realized as illustrated in Figure 10. In addition, pipeline and operation simplification techniques were used to improve the executing performance and reduce hardware costs further.  The proposed CORDIC was realized by an iterative structure in a recursive CORDIC architecture, as shown in Figure 9, instead of the unfolded CORDIC architecture with an approximated scale factor. According to the experiment, the output data from CORDIC require 11 iterating cycles to be stable. This design also allows a flexible accuracy approximation where a lookup table is employed to reduce hardware area and improve performance. Users can control the iterating cycles.
The hardware-sharing machine recursive CORDIC was realized by an iterative architecture, as shown in Figure 9. Figure 10 is the hardware-sharing CORDIC scaling factor generator, which only uses adders and shifters. With the increase in the iterating cycles, K ≈ 0.60725, this architecture is realized as illustrated in Figure 10. In addition, pipeline and operation simplification techniques were used to improve the executing performance and reduce hardware costs further. The proposed CORDIC was realized by an iterative structure in a recursive CORDIC architecture, as shown in Figure 9, instead of the unfolded CORDIC architecture with an approximated scale factor. According to the experiment, the output data from CORDIC require 11 iterating cycles to be stable. This design also allows a flexible accuracy approximation where a lookup table is employed to reduce hardware area and improve performance. Users can control the iterating cycles.
The hardware-sharing machine recursive CORDIC was realized by an iterative architecture, as shown in Figure 9. Figure 10 is the hardware-sharing CORDIC scaling factor generator, which only uses adders and shifters. With the increase in the iterating cycles, K ≈ 0.60725, this architecture is realized as illustrated in Figure 10. In addition, pipeline and operation simplification techniques were used to improve the executing performance and reduce hardware costs further.  The proposed design was composed of adders and shifters only and realized by 9-stage pipeline architecture, which can enhance the performance and reduce the hardware costs efficiently. Table 2 lists the rotation direction σ (sign-bits) of the angle at the ith stage for the CORDIC-DCT. Figures 9 and 10 show the architecture of the proposed hardware-sharing machine, where Equations (4)-(7) can be realized by using only adders, shifters, and Table 2 for VLSI implementation.   The proposed design was composed of adders and shifters only and realized by 9-stage pipeline architecture, which can enhance the performance and reduce the hardware costs efficiently. Table 2 lists the rotation direction σ (sign-bits) of the angle at the ith stage for the CORDIC-DCT. Figures 9 and 10 show the architecture of the proposed hardware-sharing machine, where Equations (4)-(7) can be realized by using only adders, shifters, and Table 2 for VLSI implementation.  The proposed design was composed of adders and shifters only and realized by 9stage pipeline architecture, which can enhance the performance and reduce the hardware costs efficiently. Table 2 lists the rotation direction σ (sign-bits) of the angle at the ith stage for the CORDIC-DCT. Figures 9 and 10 show the architecture of the proposed hardwaresharing machine, where Equations (4)-(7) can be realized by using only adders, shifters, and Table 2 for VLSI implementation.

Experimental Results of the Proposed Loeffler DCT Algorithm with Recursive CORDIC
In this section, the experimental results are evaluated for the peak signal-noise ratio (PSNR) of the proposed Loeffler DCT algorithm with recursive CORDIC. The PSNR of the three layers (R-G-B) 256-color image with size H-by-V is defined as where I u is the original image in the uth layer, and K u is the reconstructed image in the uth layer for u ∈ {R,G,B} corresponding to the colors red, green, and blue, respectively. Moreover, the image size of each image in Table 1 is 512 × 512 pixels in which each pixel is 24-bits. The image size of each image in Table 2 is 768 × 512 pixels in which each pixel is 24-bits. Taking the image compression of Table 1 for example, the procedure of obtaining the reconstructed image K u is conducted by following these four steps: (1)  In the above procedure, the luminance quantization matrix recommended by the JPEG standard was used to evaluate the value of PSNR in the image compression. Moreover, the number of CORDIC iterations in the proposed design was defined to attain the average PSNR of Loeffler DCT work [28] within 0.01 dB performance loss.
In the following experimental results, the image compression algorithm was conducted, and the image quality and compression ratio were computed to evaluate the performance of the proposed image compression algorithm. The same image datasets from previous work [28][29][30] are used in this work for PSNR comparison and are shown in Figures 11 and 12. Tables 3 and 4 list the obtained PSNR values using the different image datasets. In addition, Table 5 lists the comparison results of the computing resources for the three aforementioned landmark DCT algorithms.
Electronics 2021, 10, x FOR PEER REVIEW 11 of 16 Figure 11. Eight images used for the PSNR comparison in Table 3. Table 4. PSNR (dB) comparison of previous DCT algorithms and this work using the second image dataset shown in Figure 12.

Loeffler [28]
Sun [ Figure 11. Eight images used for the PSNR comparison in Table 3.  Table 4. Figure 12. 24 images used for the PSNR comparison in Table 4. Table 3. Peak signal-noise ratio (PSNR) (dB) comparison of previous DCT algorithms and this work using the first image dataset shown in Figure 11.

Loeffler [28]
Sun [ Table 4. PSNR (dB) comparison of previous DCT algorithms and this work using the second image dataset shown in Figure 12.

Loeffler [28]
Sun [ This work with hardware sharing machine 0 28 11 In Table 5, the computing resources of 2-D DCT were evaluated by doubling the computing resources needed for 1-D DCT. Moreover, the required resources of the scalefactor of the CORDIC and the normalization factor of the 1-D DCT were also included for fair comparison of consistency following the work of Lee et al., which counted the resources of the two types of factors [30]. Thus, the required resources of scale factors used in the work of Sun et al. [29] and this work are listed in Table 1. Moreover, the complexity of the shifter and the complexity of the adder are assumed to be the same.
From Table 5, two observations are made. First, this work without a hardware-sharing machine achieves the lowest computing complexity. Second, this work with a hardwaresharing machine can significantly reduce the computing complexity. The lookup tables (LUTs) used for different rotation angles are listed in Table 2. According to Tables 3-5, the proposed work exhibits a high-quality and low-complexity hardware-oriented 2-D DCT algorithm for VLSI implementation. Finally, Table 6 compares the image quality between previous DCT algorithms and this work. It can be observed from Table 6 that the proposed novel recursive Loeffler DCT algorithm can achieve almost the same image quality as the original image. The two testing images, Lena and Kodak03, were widely applied to the image processing realm. This work with hardware sharing machine 0 28 11 In Table 5, the computing resources of 2-D DCT were evaluated by doubling the computing resources needed for 1-D DCT. Moreover, the required resources of the scale-factor of the CORDIC and the normalization factor of the 1-D DCT were also included for fair comparison of consistency following the work of Lee et al., which counted the resources of the two types of factors [30]. Thus, the required resources of scale factors used in the work of Sun et al. [29] and this work are listed in Table 1. Moreover, the complexity of the shifter and the complexity of the adder are assumed to be the same.
From Table 5, two observations are made. First, this work without a hardware-sharing machine achieves the lowest computing complexity. Second, this work with a hardware-sharing machine can significantly reduce the computing complexity. The lookup tables (LUTs) used for different rotation angles are listed in Table 2. According to Tables 3-5, the proposed work exhibits a high-quality and low-complexity hardware-oriented 2-D DCT algorithm for VLSI implementation. Finally, Table 6 compares the image quality between previous DCT algorithms and this work. It can be observed from Table 6 that the proposed novel recursive Loeffler DCT algorithm can achieve almost the same image quality as the original image. The two testing images, Lena and Kodak03, were widely applied to the image processing realm. The 24 images from the Kodak dataset were used to run the image processing, including the 2-D DCT based on the proposed design, quantization (quantized with the Sun [29] 0 120 Lee [30] 0 192 This work without hardware sharing machine 0 108 This work with hardware sharing machine 0 28 In Table 5, the computing resources of 2-D DCT were evaluated computing resources needed for 1-D DCT. Moreover, the required scale-factor of the CORDIC and the normalization factor of the 1-D included for fair comparison of consistency following the work of L counted the resources of the two types of factors [30]. Thus, the requi scale factors used in the work of Sun et al. [29] and this work are li Moreover, the complexity of the shifter and the complexity of the adde be the same.
From Table 5, two observations are made. First, this w hardware-sharing machine achieves the lowest computing complexi work with a hardware-sharing machine can significantly reduce complexity. The lookup tables (LUTs) used for different rotation ang Table 2. According to Tables 3-5, the proposed work exhibits a h low-complexity hardware-oriented 2-D DCT algorithm for VLSI Finally, Table 6 compares the image quality between previous DCT alg work. It can be observed from Table 6 that the proposed novel recursi algorithm can achieve almost the same image quality as the original testing images, Lena and Kodak03, were widely applied to the image pro The 24 images from the Kodak dataset were used to run the im including the 2-D DCT based on the proposed design, quantization (qu Sun [29] 0 Lee [30] 0 This work without hardware sharing machine 0 This work with hardware sharing machine 0 In Table 5, the computing resources of computing resources needed for 1-D DCT. scale-factor of the CORDIC and the normal included for fair comparison of consistency counted the resources of the two types of fa scale factors used in the work of Sun et al. Moreover, the complexity of the shifter and th be the same. From Table 5, two observations ar hardware-sharing machine achieves the low work with a hardware-sharing machine complexity. The lookup tables (LUTs) used Table 2. According to Tables 3-5, the pro low-complexity hardware-oriented 2-D DC Finally, Table 6 compares the image quality b work. It can be observed from Table 6 that t algorithm can achieve almost the same imag testing images, Lena and Kodak03, were wide The 24 images from the Kodak dataset including the 2-D DCT based on the proposed Sun [29] Lee [30] This work without hardware sharing machine This work with hardware sharing machine In Table 5, the computing resourc scale-factor of the included for fair c counted the resour scale factors used i Moreover, the comp be the same.
From Table  hardware-sharing m work with a hard complexity. The lo Table 2. According low-complexity ha Finally, Table 6 com work. It can be obs algorithm can achi testing images, Lena This work with hardware sharing machine 0 28 11 In Table 5, the computing resources of 2-D DCT were evaluated by doubling the computing resources needed for 1-D DCT. Moreover, the required resources of the scale-factor of the CORDIC and the normalization factor of the 1-D DCT were also included for fair comparison of consistency following the work of Lee et al., which counted the resources of the two types of factors [30]. Thus, the required resources of scale factors used in the work of Sun et al. [29] and this work are listed in Table 1. Moreover, the complexity of the shifter and the complexity of the adder are assumed to be the same.
From Table 5, two observations are made. First, this work without a hardware-sharing machine achieves the lowest computing complexity. Second, this work with a hardware-sharing machine can significantly reduce the computing complexity. The lookup tables (LUTs) used for different rotation angles are listed in Table 2. According to Tables 3-5, the proposed work exhibits a high-quality and low-complexity hardware-oriented 2-D DCT algorithm for VLSI implementation. Finally, Table 6 compares the image quality between previous DCT algorithms and this work. It can be observed from Table 6 that the proposed novel recursive Loeffler DCT algorithm can achieve almost the same image quality as the original image. The two testing images, Lena and Kodak03, were widely applied to the image processing realm. The 24 images from the Kodak dataset were used to run the image processing, including the 2-D DCT based on the proposed design, quantization (quantized with the Sun [29] 0 120 Lee [30] 0 192 This work without hardware sharing machine 0 108 This work with hardware sharing machine 0 28 In Table 5, the computing resources of 2-D DCT were evaluated computing resources needed for 1-D DCT. Moreover, the required scale-factor of the CORDIC and the normalization factor of the 1-D included for fair comparison of consistency following the work of L counted the resources of the two types of factors [30]. Thus, the requi scale factors used in the work of Sun et al. [29] and this work are li Moreover, the complexity of the shifter and the complexity of the adde be the same. From Table 5, two observations are made. First, this w hardware-sharing machine achieves the lowest computing complexi work with a hardware-sharing machine can significantly reduce complexity. The lookup tables (LUTs) used for different rotation ang Table 2. According to Tables 3-5, the proposed work exhibits a h low-complexity hardware-oriented 2-D DCT algorithm for VLSI Finally, Table 6 compares the image quality between previous DCT alg work. It can be observed from Table 6 that the proposed novel recursi algorithm can achieve almost the same image quality as the original testing images, Lena and Kodak03, were widely applied to the image pro In Table 5, the computing resources of computing resources needed for 1-D DCT. scale-factor of the CORDIC and the normal included for fair comparison of consistency counted the resources of the two types of fa scale factors used in the work of Sun et al. Moreover, the complexity of the shifter and th be the same.
From Table 5, two observations ar hardware-sharing machine achieves the low work with a hardware-sharing machine complexity. The lookup tables (LUTs) used Table 2. According to Tables 3-5, the pro low-complexity hardware-oriented 2-D DC Finally, Table 6 compares the image quality b work. It can be observed from Table 6 that th algorithm can achieve almost the same imag testing images, Lena and Kodak03, were wide The 24 images from the Kodak dataset including the 2-D DCT based on the proposed Sun [29] Lee [30] This work without hardware sharing machine This work with hardware sharing machine In Table 5, the computing resourc scale-factor of the included for fair c counted the resour scale factors used i Moreover, the comp be the same.
From Table  hardware-sharing m work with a hard complexity. The lo Table 2. According low-complexity ha Finally, Table 6 com work. It can be obs algorithm can achi testing images, Lena The 24 images from the Kodak dataset were used to run the image processing, including the 2-D DCT based on the proposed design, quantization (quantized with the standard quantization table), zig-zag coding, and Huffman entropy coding. At the end of processing the 24 images, the compression ratio was obtained with a value of 9.86. Significant comparisons of the previously proposed designs and this work of 2-D DCT VLSI design are listed in Table 7. The synthesized gate counts of [29,30,36,37] and for this work are 27.3 k, 22.4 k, 24.6 k, 31.5 k and 8.04 k, respectively. These were obtained from the results of [30], excluding the memory. The synthesized gate count for this work is 8.04 k, while the core area of the proposed design is 75,100 µm 2 . Compared with previous works, the proposed design in this study achieved a 64.1% gate count reduction. Moreover, its operating frequency and power consumption are 100 MHz and 4.17 mW, respectively. Thus, the proposed design can save at least 22.5% power compared to that of the previous designs. The proposed image algorithm has significant improvements with low complexity, lower memory required, low hardware costs, and reduced power consumption. Moreover, the resulting image quality of the proposed algorithm is better than that of previous works [29,30,36,37]. Considering the combined effect of the PSNR, compression ratio (CR), and gate counts, a figure of merit (FOM) is introduced and defined as The FOM of this work is 38.68, where the Huffman entropy coding with the CR of 9.86 is employed. To fairly evaluate the FOMs of previous works, the same entropy coding and the same CR are used in Table 7. From Table 7, it can be observed that the FOM of this work is superior to those of previous works. Finally, given the improvements of this design, including low costs, high compression ratios, low power consumption, low memory requirements, and high performance, the design is appropriate for WSN and IoT applications.

Conclusions
This paper proposed a new hardware-oriented VLSI design with low costs and reduced memory requirements. A high accuracy 2-D DCT spectral analyzer with a hardwaresharing recursive Loeffler CORDIC and novel scaling-factor generation were developed. Compared with the previous DCT algorithm, the proposed algorithm has the benefits of high-quality and low computing resources for VLSI implementation. Moreover, with its characteristics of low memory requirements, low complexity, and high image quality, the proposed design is suitable for wireless sensor networks and IoT applications.