A Hidden DCT-Based Invisible Watermarking Method for Low-Cost Hardware Implementations

: This paper presents an invisible and robust watermarking method and its hardware implementation. The proposed architecture is based on the discrete cosine transform (DCT) algorithm. Novel techniques are applied as well to reduce the computational cost of DCT and color space conversion to achieve low-cost and high-speed performance. Besides, a watermark embedder and a blind extractor are implemented in the same circuit using a resource-sharing method. Our approach is compatible with various watermarking embedding ratios, such as 1/16 and 1/64, with a PSNR of over 45 and the NC value of 1. After Joint Photographic Experts Group (JPEG) compression with a quality factor (QF) of 50, our method can achieve an NC value of 0.99. Results from a design compiler (DC) with TSMC-90 nm CMOS technology show that our design can achieve the frequency of 2.32 GHz with the area consumption of 304,980.08 µ m 2 and power consumption of 508.1835 mW. For the FPGA implementation, our method achieved a frequency of 421.94 MHz. Compared with the state-of-the-art works, our design improved the frequency by 4.26 times, saved 90.2% on area and increased the power efﬁciency by more than 1000 fold.


Introduction
With the rapid development of the Internet and social media, the spread of digital photos around the world is becoming more and more convenient. Therefore, copyright protection is of concern. Digital watermarking is the process of hiding information in signals such as image, text, video and audio, and is used mainly for the copyright protection of digital content [1].
In recent years, many researchers have been focusing on image watermarking techniques because still images are shared widely throughout the Internet, and many image watermarking methods can also be applied to video watermarking [2]. Generally, image watermarking can be processed in two different domains: the spatial domain [3][4][5][6] and the transform domain [1,[7][8][9][10][11][12][13][14][15][16][17]. In the spatial domain, watermark data are directly embedded to the pixel values of the host image [18]. Although spatial-domain watermarking methods are easy to implement and usually require less computational resources, they are not robust enough to attacks such as JPEG compression and geometry attacks [19]. As for the transform domain, image watermarking methods are mostly carried out in the discrete cosine transform (DCT) domain [1,7,13], the discrete Fourier transform (DFT) domain [20][21][22] or the discrete wave transformation (DWT) domain [9,[22][23][24][25][26]. Watermarking in the transform domain is often based on the human vision system (HVS) [2,4,10,27,28] model to achieve invisibility and robustness. For example, human vision is more sensitive to low and middle frequency information [18], so most image compressiontechniques remove the high-frequency components which are not perceptible to save space. As a result, watermarks embedded in the low or middle frequencies are more robust against attacks. For instance, in [16], a watermarking technique that adjusts the DCT low-frequency coefficients through the concept of mathematical remainders was proposed; it enables the watermark to be almost fully extracted even under very high compression ratios.
In this paper, we propose a mixed-strength watermark-embedding approach basedon the HVS model and the DCT approach. Theoretically, after the DCT, three types of coefficient can be chosen: 1.
If the DC component is chosen, as in this paper, then there is a risk of changing the contrast of some blocks, as the DC is an average of luminance values. 2.
If the low or mid-range frequency coefficients are chosen, the image quality could be affected because the human eye is especially sensitive to these frequencies. 3.
If high frequencies are used, they are likely to be removed at compression time.
By specifying the pixel changing values ({0, ±1} or {0, ±1, ±2}) before embedding, we can strictly control the changes in the DC components. Any visible block artifacts can be avoided in advance. Secondly, to eliminate the errors brought about by DCT and IDCT, the optimized workflow conducts DCT and IDCT implicitly. Additionally, computational effort could be saved with the hidden DCT and IDCT. Additionally, 5/6 of the color space conversion can be omitted using our DC component-based approach. As a result, the proposed approach has an obvious advantage over previous works after hardware implementation. Compared with the state-of-the-art works, our design improves the clock frequency 4.26 fold [3], saves 90.2% on area and increases the power efficiency by over 1000 fold [29]. Meanwhile, by changing the threshold and embedding strength in our approach, users can easily modify the original method to make it suitable for applications with different demands. For example, users can choose a low threshold and high embedding strength to increase robustness, and choose low embedding strength to pursue invisibility and high image quality.
The remainder of this paper is as follows. In Section 2, we list the related works and the motivation for our work. Section 3 presents some background by way of basic concepts. In Section 4, the proposed hidden DCT-based invisible watermarking method is introduced. Section 5 presents experimental results of the proposed approach and attack tests. In Section 6, we present the detailed hardware implementation results for both ASIC and FPGA platforms. The elaborated comparison with the state-of-the-art works is shown in Section 7. Section 8 gives a brief conclusion of our work.

Related Works and Motivation
Despite the advantages brought by transform-domain watermarking methods, the computational overheads make them hard to implement in many applications, especially on mobile phones where the computational resources are limited. Hence, many researchers have been devoted to inventing low-cost, robust watermark embedding methods. Some researchers focus on improvements in the spatial domain, and some others use novel approaches to reduce the computational burden of transform-domain watermarking.
In [29], the authors proposed an optimized image and video watermarking method using the spatial domain for low-cost applications such as wireless networks. The noise visibility function (NVF)-based mask was adopted in this paper, and the hardware implementation results show that it has significant improvements over previous works in terms of resource utilization and power consumption. However, the highest clock frequency in 90 nm CMOS technology is relatively low due to the long critical path of the divider. Additionally, one divider is shared in the whole process to save computational resources. which limits the throughput of the proposed design.
In [12], the authors used a novel DCT-based approach to achieve fast and robust watermarking. They calculated the DCT coefficient of a specific location, and watermarked bits were embedded by directly modifying the pixel values without the full-frame DCT. Test results proved that the proposed method performs well against noise attacks and compression attacks. However, a detailed hardware implementation was not presented, so its advantages over previous works in terms of area and power consumption remain unknown.
The spatial-domain watermarking and transform-domain watermarking techniques are both widely used in state-of-the-art works. Compared with transform-domain watermarking, the spatial-domain watermarking approaches are very simple and computationally efficient. However, they have relatively low information-hiding capacities and limited robustness to normal media operations, such as filtering or lossy compression. Spatial-domain watermarking approaches also have limited defense against cropping, during which some watermark information is lost [2]. The main focus of a watermarking technique is to achieve robustness and high power efficiency at the same time.
In this paper, a hidden DCT-based invisible watermarking method for low-cost hardware implementation is proposed. Traditionally, DCT-based watermarking methods are supposed to insert the watermark bits in low or middle frequencies. However, the authors of [30] argued that direct current (DC) components are more suitable for watermarking for the following reasons. Firstly, the magnitudes of DC components are much larger than those of any alternating current (AC) components in general, which makes the DC components capable of containing more watermarking information. Secondly, DC components are more robust to attacks such as data compression than the AC components. Meanwhile, the challenges of embedding watermarks in DC components of DCT can be summarized as follows: 1.
Compared to doing so with AC components, embedding watermarks into the DC components of the DCT is believed to cause visible block artifacts [30], thereby compromising the image quality. Hence, the first challenge is to embed the watermark in DC components without bringing visible changes to the original picture. Theoretically, any watermarking method would change the pixel values of the target image; otherwise the watermark could not be inserted. From this perspective, if the changing values are constrained to a certain small range, embedding the watermark in the DC components will not necessarily cause more visible block artifacts than any other methods.

2.
For the hardware implementation, the fixed-point DCT and IDCT are not completely reversible. DCT matrix elements contain real numbers represented by finite numbers of bits, which inevitably introduce truncation and rounding errors during computation [31]. Conventionally, the longer the bit length, the more accurate the results [32,33]. In general, a longer bit length for the DCT coefficients leads to higher energy consumption during the DCT compression process [34]. Even if we do not apply any changes after the DCT and carry out the IDCT directly, pixel values would also change by between 1 and 2 bits. Watermarks could be polluted by the errors introduced by the DCT and IDCT. Hence, the second challenge is how to optimize the watermark workflow to minimize or eliminate the errors in hardware implementation.

Introduction to Basic Concepts
In this section, we explain some fundamental concepts, including the DCT algorithm and the conventional DCT-based approach for digital watermarking. Meanwhile, the drawbacks of the regular approach will also be analyzed in this section.

The DCT Algorithm
The DCT algorithm is analogous to the discrete Fourier transform (DFT), but it only involves real numbers. For most natural signals (sounds and images), their energies are concentrated in the low-frequency domain after discrete cosine transformation, which is the principle of JPEG (Joint Photographic Experts Group) compression. Hence, the DCT algorithm is widely used in image processing.

A Conventional DCT-Based Watermarking Approach
DCT-based watermarking should have the qualities of robustness and invisibility [16]. The typical workflow [1,7] of DCT-based watermark insertion can be summarized in Figure 1. For example, an original image that is 512 × 512 is firstly divided into a series of 8 × 8 RGB blocks. After color space conversion, the Y channel of the 8 × 8 block is transformed into the frequency domain using the DCT. Meanwhile, the original watermark is prepared and encrypted for embedding. Secondly, the watermark embedder inserts the encrypted watermark data into the 8 × 8 block according to the texture, frequency and direction of image block. If the size of watermark is 64 × 64, then one pixel of the watermark will be inserted in each 8 × 8 block. At last, the embedded Y channel is inversed back into the spatial domain and converted into RGB color space along with the stored U channel and V channel. Traditional DCT-based watermarking techniques have the following three drawbacks: The tradeoff between robustness and invisibility: According to the HVS theory, people cannot tell the difference between two pictures as long as the changes of pixel values are under a certain threshold. Much in the same way, there is also a threshold when it comes to watermarking. When the embedding strength exceeds the threshold, changes of pixel values are visible and the picture begins to appear distorted. However, if the embedding strength is not high enough, the embedded watermark may not be extracted successfully because of the changes introduced to the picture during the process of color space conversion and DCT/IDCT. Considering the formulas of converting RGB signals to YUV signals: Coefficients such as −0.3313 are not represented precisely due to the bit length limits in digital signal processing systems. Pixel values may be different after color space conversion, let alone the DCT and IDCT processes. Hence, it is necessary to strike a balance between robustness and invisibility when using the DCT-based watermarking methods, restricting the applicable scenarios.

2.
Computational complexity: Figure 1 reveals that each 8 × 8 RGB block needs to be converted twice between color spaces and twice between spatial and frequency domain. Nine multiplications and eight additions are required for each pixel to complete the conversion from RGB to YUV judging from Equation (5). For 8 × 8 2D-DCT, Equation (3) can be represented in matrix form as: where and For each coefficient matrix, 64 multiplications and 56 additions are needed. Therefore, 128 multiplications and 112 additions are needed to complete a 8 × 8 2D-DCT. The total numbers of calculations needed for changing color spaces, spatial and frequency domain can be summarized as follows: Additionally, the computational complexity will be dramatically increased when the picture's size increases. Take smart phones, for example: the most advanced cell phones have up to 100 megapixel sensors. If we want to add watermarks to the original pictures taken by these cell phones with 8 × 8 RGB blocks, more than 312 million multiplications and additions are required.

3.
Unsuitability for higher embedding ratios: Although the 1/64 embedding ratio is mostly applied in current works, a higher embedding ratio such 1/16 allows us to add more information to the host picture. Current DCT-based watermarking techniques usually add the watermark data to the low or middle frequency bands [18] after the DCT transform. This works well for the 1/64 embedding ratio but fails to achieve satisfactory results when the embedding ratio is increased to 1/16. It can be deduced that under the 1/16 embedding ratio, the accuracy losses brought by DCT and IDCT are more severe, leading to failure of extracting the valid watermark.

The Proposed Method
In this section, the hidden DCT-based invisible watermarking method is proposed. Detailed analysis and proof of our method are presented.

The Proposed Watermarking Method
As stated before, a low or middle frequency watermarking method is not suitable for higher embedding ratios. Besides a low or middle frequency, watermarking information can also be inserted into the DC component of a picture block. Our proposed method first extracts the DC component. Then, the encrypted watermark is embedded into each picture block according to the texture property of the image region. If the picture block belongs to a flat region, weak embedding is adopted in order to make the watermark invisible. If the picture block belongs to a texture region, strong embedding is adopted in order to make the watermark robust. At last, the change to the DC component brought about by either weak or strong embedding is reflected in RGB channels of the final image.
The workflow of our proposed method can be seen in Figure 2. Compared with the traditional DCT-based process in Figure 1, our method removes the DCT and IDCT steps and the conversion of 5/6 of the color space (our method only needs the RGB to Y transformation).

Elimination of DCT and IDCT
The DC component is usually calculated after the picture is transformed into the frequency domain using DCT, which increases the overall computational cost. However, after analyzing the equations to calculate the DC component, we found that the process of DCT can be hidden. According to definitions of the direct current component and Equations (1) and (2), the DC value is calculated through the following equation: Theoretically, the proposed approach can be applied to any kind of rectangular blocks. For convenience, the N × N 2D-DCT is used to demonstrate our approach, so M in (9) is replaced with N to get the following equation: where Y(x, y) stands for each pixel's Y channel value.
In this way, after converting RGB signals to Y channels, it is effortless to get the DC component by summing up the Y channels of the N × N block without going through the computationally intensive DCT formula.
Watermark information is added to the DC component in our design. Assuming the change to the DC component is ∆M, the DC value after the insertion is Now we need to use 2D-IDCT to calculate the corresponding Y channel values after adding the watermark. From (3), we can get that Since the watermark information is only added to the DC component, it can be concluded that where F (u, v) stands for the watermarked value and F(u, v) stands for the original value without adding the watermark. After combining (12) and (13), the following equation can be inferred: Hence, IDCT can be saved by adding ∆M N to its original Y channels' values. In conclusion, the proposed hidden DCT-based approach can be used to embed the watermark into the DCT domain without actually carrying out the DCT and IDCT operations.

Color Space Conversion
The watermark was inserted into the Y channels in the last step; now we need to convert YUV to RGB. For example, YUV can be converted to RGB using the following formula: From (15), it is obvious that the coefficients of Y and the R channel, G channel and B channel all equal 1. This indicates that the changes to Y values brought about by the watermark will be equally reflected to R, G and B channels. Hence, after calculating ∆M N from the watermark, we can directly add it to R, G and B channels as listed in (16) without first applying it to Y channel and then converting the color space.
Since the watermark information can be directly added to R, G and B channels, U and V channels are no longer necessities. In this way, we only need 1/3 of the calculations listed in (4) to get the Y channel values in order to extract the DC component. The computational cost of (15) can also be omitted. In total, 5/6 of the computational cost can be saved using our approach when compared with traditional methods in terms of color space conversion. What is more, the calculation errors caused by color space conversion can be avoided too.

The Watermark Embedding and Blind-Extracting Approach
In order to satisfy the balance between the robustness and the imperceptivity, a mixedstrength watermark embedding approach is proposed. According to (16), the watermark can be directly added to R, G and B channels. Considering that the value changes of RGB color space can only be integers, we introduce a strong-embedding approach which brings changes to {0, ±1, ±2} to RGB channels, and a weak-embedding approach which brings changes to {0, ±1}. Modular arithmetic is also used for blind extraction: where F is a positive integer used to distinguish the value of the watermark pixel added to a certain N × N block. For any block that does not meet the requirement in (17), ∆M is combined with watermark information and added to the original DC value.
• For strong embedding, we aim to get a δ ∈ {0, ±1, ±2}. According to (16), the range of ∆M s can be gotten: Although δ ∈ {±3} when ∆M s ∈ ± 5N 2 , ∆M s varies in a continuous range, and the probability of ∆M s ∈ ± 5N 2 tends toward zero. It will not affect the distribution of δ due to principle of small probability event.
F is set as 5N, and the following equation is used to calculate ∆M s to meet (17) and (18): where r stands for mod(DC, 5N). From (19), the values of mod(DC , 5N) are constrained to 5N 4 and 15N 4 after modification, which are the mid-points of two judgment intervals in (17). By doing so, robustness of the watermark is increased. As after attacks, the remainder is more likely to fall into the right judgment interval during extraction if we constrain the remainder to the two mid-points when embedding. In (19), the range of r is divided into three sections labelled as S1, S2 and S3 in Figure 3. Depending on the value of the watermark bit, different ∆M s values are added to the original r to move it to the nearest spot according to the principle of proximity. Take (19), for example: When W = 1 and r ∈ 0, 5N 4 , it does not meet (17). Although r could be moved to 15N 4 , the changes brought to the pixels will be bigger than when moving it to − 5N 4 . Hence, ∆M(−r − 5N 4 ) is added to the original DC value to move the remainder to its nearest spot (the red trajectory in Figure 3), and the updated r will be: and which meets the requirements of (17) and (18). • For weak embedding, δ should be in the range of {0, ±1}. The range of ∆M w will be: F is set as 24, and the following equation is used to calculate ∆M w to meet (17) and (22): where r stands for mod(DC, 3N).
The strong-embedding approach is more robust than the weak-embedding approach because it brings more changes to the original picture block, which means that the inserted watermark is more likely to survive after attacks. However, the maximum change of value between two blocks equals four in the strong-embedding approach. For some picture blocks with small luminance or color differences, strong embedding will cause visible changes and reduce the image quality, as shown in Figure 4. To solve this problem, a mixed-strength watermark embedding approach is adopted. Parameter L is introduced to represent the image texture of a picture block, and L is defined as follows: We set a threshold T to determine whether to use the strong-embedding approach or weak-embedding approach:

Experiment Results
In this section, we report experiments of the proposed method conducted on color images. All the test images were 512 × 512 pixels in size. Watermark patterns were set as 32 ×32, 64 × 64 and 128 × 128 pixels (shown in Figure 5) to test our approach in 1/256, 1/64 and 1/16 embedding ratios, respectively. In order to provide objective judgement, metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) and normalized correlation (NC) were calculated. PSNR was used to evaluate the image quality of the watermarked pictures, and it is defined as: where W × H represents the image size. X(i, j) and Y(i, j) represent the original image and the watermarked image, respectively.
SSIM is a criterion that reflects the structural similarity between two pictures and is defined as: µ x and µ y represent the mean values of original and watermarked pictures. σ x and σ y represent their variances. Additionally, σ xy represents the covariance between them. C 1 and C 2 are constants added to avoid unstable results; C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 , k 1 = 0.01 and k 2 = 0.03. L is the dynamic range of a pixel value, which was 255 in our experimental scenario. SSIM is within [0, 1], and it is 1 when two pictures are exactly the same.
NC represents the similarity between the original watermark and the extracted watermark. The definition of NC is as follows: where W 1 represents the original watermark and W 2 represents the extracted watermark. NC ranges from 0 to 1, and the extraction result is the best when NC equals 1.

Embedding Strength Tests
We first needed to determine the value of T in (25). For strong embedding δ ∈ {0, ±1, ±2}, the following equation can be inferred: where P stands for probability. Hence, the mean square deviation (MSE) between original and watermarked pictures was 2. According to (25), the theoretical PSNR of strong-embedding approach can be calculated: Similarly, the theoretical PSNR of weak-embedding approach is: Different thresholds for 1/16 embedding ratio were tested, and the results are listed in Table 1. It is obvious that with the increase of the threshold (from 0 to 255, s0 means pure strong embedding and 255 means pure weak embedding), both PSNR and SSIM were improved. PSNR, which is one of the most important criteria in terms of watermarking, saw a significant improvement. Although a satisfying PSNR of 45.1540 was achieved when the threshold was set as 0, we could still observe some visible changes in watermarked images compared with the original images. As shown in Figure 4a,b, the color block marked with a red box has more mosaics in the processed image. In addition, Lena was already a relatively low-resolution and poor-quality image; the color bumps caused by pure strong embedding will be severer in high-quality pictures. However, weak embedding cannot resist attacks effectively. In Table 1, the results of compressed watermarked images (QF = 80) are presented. When the embedding strength was decreased, the NC value gradually dropped from 0.9695 to 0.7994. We can also see from Figure 4c-g that the watermarks extracted after JPEG compression (QF = 80) had better quality when the embedding strength was guaranteed. The extracted watermark could hardly be recognized if the threshold was set as 255. To strike a balance between robustness and invisibility, the threshold was set as 16 in following tests. Actually, the threshold can be adjusted in different application scenarios with various demands; 16 was just selected for the demonstration.

Attack Tests
Despite the fact that low-embedding-ratio watermarking (Figure 5a) can already meet the requirements for copyright protection, higher-resolution watermark patterns (Figure 5c) will make the evidence more convincing. Therefore, attack experiments were conducted to test the robustness of our proposed method under different embedding ratios. The testing results are listed in Table 2. JPEG compression is one of the most common attacks seen on the Internet these days. The quality factor indicates the compression ratio: the original picture was more compressed when QF is lower. From Table 2, we can see that the results of NC were the best when the size of the watermark pattern was 32 × 32, representing a 1 256 embedding ratio. With increasing watermark size, the NC dropped under all compression ratios, but it still had a satisfying result of 0.7732 when QF = 50 with a 1 16 embedding ratio. For salt and pepper noise, we can guarantee an NC of over 0.8132 when ρ = 0.005 (ρ represent the noise density). Actually, extraction results were still over 0.7314 when increasing ρ to 0.01. However, even when ρ = 0.005, the image was significantly polluted, making it completely useless for attackers. We also tested the Gaussian noise with variances (σ) of 0.001 and 0.002. The results were not so great under the 1 16 embedding ratio. In particular, when σ = 0.002, the NC value dropped to 0.5365. The reason why it did not work so well at high embedding ratios is that the applied Gaussian noise had a mean value of zero. When the embedding ratio was 1 256 , one bit of the binary watermark was added to a 64 × 64 block, in which Gaussian noise had a mean value closer to zero than a 4 × 4 block. Hence, the proposed method did not perform well at high embedding ratios after Gaussian noise attacks. However, the results were still quite good for low embedding ratios, getting an NC value of 0.9833 when σ = 0.002. Actually, higher variances were also tested under the 1 256 embedding ratio. We can still guarantee an NC value of 0.7014 when σ = 0.009, although the attacked image will be complete ruined. Geometric attacks were also tested. As a watermark is randomly distributed in the host image, different kinds of cropping have similar effects on the watermarked image. The NC value will almost equal the percentage of the picture remaining. In our testing cases, the NC values fluctuated around 0.75 when 1/4 of the picture was cropped. To test the scaling attack, we first scaled the picture according to the scale factor and then rescaled it to its original size. Scaling 0.5 in Table 2 means that the original 512 × 512 watermarked picture was first scaled to the size of 256 × 256. Testing results met the expectations in scaling tests, providing NC values over 0.8898 under different embedding ratios even when the picture was scaled to 1/4 of the original picture.

Discussion
From the attack tests, we can see that although our approach provided satisfying results overall, it still did not perform so well in some extreme cases. For example, after JPEG compression, the NC value was 0.7732 when QF = 50 under a 1 16 embedding ratio, whereas the NC in [18] approached 0.9999. Additionally, the results under Gaussian noise attacks were also not good enough compared with other state-of-the-art works such as [18,35]. Although the embedding strength can be adjusted through changing the threshold to increase its robustness, the NC values cannot be improved much. It can be seen that even in pure strong embedding, the PSNR value still has a theoretical limit of 45.121 from (30). Robustness and invisibility are relatively antagonistic. Additionally, if we want to further improve robustness, invisibility will be compromised.

Details of the Hardware Implementation
The block diagram of the proposed hardware design is shown in Figure 7a. The overall processing flow stayed the same as that illustrated in Section 3. The 4 × 4 RGB data and the one-bit watermark information arrive synchronously and are stored in the buffer for further use. Meanwhile, 4 × 4 RGB data are sent to the rgb_to_y module to complete the color space conversion according to (4). Then, the watermark_prepare module calculates the values of mod(DC, 12) and mod(DC, 20) by replacing N with 4, as in Section 3, for weak and strong embedding, respectively. Image texture of the 4 × 4 block is also calculated in this module and is compared with the threshold to determine the embedding strength of this block. Finally, the selected remainder and threshold comparison result are sent to the watermark_process module. For example, if L > T (25), the strong-embedding approach is adopted. The T_result in Figure 7a must be 1 and rmd will be the value of mod(DC, 20). As the main computing operators used in embedding and extracting of our approach are the same (rgb_to_y module and watermark_prepare module can be used for both embedding and extracting), we made our hardware design capable of being configured in embedding mode and extracting mode with a mode selection signal. In the watermark_process module, either embedding or extracting is carried out according to the mode selection signal (1 for embedding and 0 for extracting). According to the detailed diagram of watermark_process module in Figure 7b, T_result and rmd from the watermark_prepare module are first sent to the mod_sel module to complete the bypass. In embedding mode, T_result and rmd are sent to the embed module. Operations similar to (19) or (23) are executed in the ∆M calculation module to calculate ∆M, which is added to the original RGB channels with 48 adders (the adder for every R/G/B pixel). As the value of ∆M is constrained to {0, ±1, ±2}, the length of ∆M is 3 bits. Hence, the two inputs of adders in the embed module are 8 bits and 3 bits respectively. Additionally, some extra logic components are required to make sure the computing results do not exceed [0, 255]. In extracting mode, operations in (17) are executed and the extracted one-bit watermark information is obtained (marked as W_extract in Figure 7b). Some other optimizations are also adopted in the hardware implementation.

Parallel Computing and Pipeline
Our first improvement aims at latency reduction. In the rgb_to_y module, input data are the 48 8-bit RGB pieces of data of a 4 × 4 block. According to (4), all channels need to be multiplied with their constant coefficients. Hence, 48 constant coefficient multipliers are computed in parallel. Then, 16 three-input adder trees are applied to calculate the final Y channel data. In the watermark_prepare module, as shown in Figure 8a, the DC component and image texture of the block need to be calculated. According to (10) and (24), Y channel data are simultaneously sent to a sixteen-input adder tree and the grain analysis module to get the DC component and L, respectively. Since these two operations have no relevance, they are also carried out in parallel to reduce latency. When it comes to pipelining, three 128-bit-width FIFOs and one 1-bit-width FIFO are inserted as buffers to control the data flow. All FIFOs have a depth of 18 to cover the latency of the overall watermarking process. Meanwhile, adder trees and other operators are also fully pipelined to help us get maximum throughput.

Remainder Calculation
As mentioned before, we need to calculate the values of mod(DC, 12) and mod (DC,20) in the watermark_prepare module to get the remainders in weak embedding and strong embedding. However, conventional approaches involve dividers which are not friendly in terms of timing and area when it comes to hardware implementations. Inspired by linear CORDIC, we introduced an iterative algorithm to compute the modulo. Take mod(DC, 20) as an example; then the following pseudo code can be used (Algorithm 1): Algorithm 1 Computing the remainder using linear CORDIC. In this way, it takes seven clock cycles to calculate mod(DC, 20) with the critical path of an adder. Similarly, computing mod(DC, 12) takes eight clock cycles.

Timing Analysis
In the rgb_to_y module, RGB data are firstly multiplied by constant coefficients and then go through three-input adder trees. Hence, the latency of the rgb_to_y module is: For watermark_prepare, the sixteen-input adder tree needs four clock cycles to get the DC component. From Figure 8b, it can be deduced that the compare module has a latency of two clock cycles. In order to get the maximum and minimum values of the 16 pixels, two stages of comparison are needed, as shown in Figure 8a. Hence, the grain analysis module needs four clock cycles to get the maximum and minimum values, and one more clock to complete the subtraction, which adds up to five clock cycles. The latency remainder calculation needs eight clock cycles, as stated before. Additionally, another clock cycle is needed to complete the threshold comparison. As the sixteen-input adder tree and grain analysis module work in parallel, the overall latency of the watermark_prepare module is: As for the watermar_process module, when it works in embedding mode, the calculation of ∆M is similar to (19) or (23); it needs one clock cycle, and adding ∆M to RGB channels requires another clock cycle. Hence, the latency is: When is works in extracting mode, according to (17), only one clock cycle is needed: To summarize, the overall latency of our proposed hardware design is: It is also worth mentioning that the buffered original data are read out of the FIFOs one clock before the final results are calculated, so the depth of the FIFOs is set as 18 to ensure that there is no overflow.

Implementation Results
Our design was coded in Verilog HDL and synthesized with TSMC 90-nm CMOS technology on the Xilinx Virtex-7 platform. It should be mentioned that the reported ASIC results are pre-layout synthesis results from Design Compiler, and the reported FPGA implementation results were estimated by the Xilinx Power Estimator (XPE) in the Vivado Design Suite. The overall architecture was as in Figure 7a. The synthesized circuit is able to process a single 4×4 block. For example, if we want to embed a 128 × 128 watermark pattern to a 512 × 512 picture with a single core, the overall latency will be: where 19 is the latency of the first block, and each of following blocks is sent to the process core continuously. In the image process system, DDR is used to transfer data to the process core. Take DDR3, for example: a DDR3 chip usually has a throughput of approximately 20 GB/s and the read channel consumes about half of the throughput which is 10 GB/s. As listed in Table 4, the maximum frequency of our proposed architecture is 2.32 GHz. According to (38), a single core is able to process a 512 × 512 picture within 7053 ns. Hence, a single core has the maximum throughput of 103 GB/s (141,783 fps) with an area of 304,980.08 µm 2 and power consumption of 508.1835 mW. Additionally, the only thing that limits the performance will be interface speed for ASIC applications. ASIC implementation results were obtained from DC reports.
For FPGA applications, the clock frequency may not be as high as with ASIC; the bottleneck could be the process speed. Since there are no interactions among all the individual blocks in our approach, we can easily apply the multicore strategy to process a large picture in a shorter time in FPGA applications. In Table 5, the frequencies and numbers of cores were set to match the 10 GB/s interface speed of DDR3. For example, when we set the frequency as 50 MHz, the throughput of a single core was 2.2 GB/s and the number of cores was set as five. Results under different frequencies are listed in Tables 4 and 5. For the ASIC implementation, area and power consumption rose with frequency. However, the power efficiency stayed around 2.8 × 10 5 fps/W. For FPGA implementation, although power efficiency had a significant decrease when frequency rose from 50 to 250 MHz, the area under 50 MHz was almost five times the area under 250 MHz. To conclude, our design is capable of handling the watermarking tasks in various kinds of scenarios with satisfying performance. a. Xilinx Virtex-7 xc7vx485tffg1157-2 was used for the FPGA implementation. b.
Numbers of cores in the FPGA implementation were set to meet the transfer speed of the DDR3 chip. For 50 MHz, the core number was 5. For 100 MHz, the core number was 3. For 250 MHz, the core number was 1. c.
Power consumption was estimated by the Xilinx Power Estimator.

Hardware Implementation Comparison
Different kinds of approaches have been proposed to realize the hardware acceleration of watermark embedding. Both ASIC and FPGA have been employed for their implementation. Hence, we compare our results with the most recent works in terms of area, power consumption and energy efficiency. For the FPGA implementation, we chose the Xilinx Virtex-7 series vx485t as our target device. The criteria used for evaluating the resources of a design in FPGA applications are the numbers of LUTs, flip flops and DSPs. In order to compare our results with previous works under the same conditions, we adjusted the synthesis strategy in Vivado design tools so that DSPs were not allowed to be used. Hence, resources were constrained to LUTs and flip flops. It is also worth mentioning that in [29], an Altera Cyclone device was used. It consumes 4582 Les, which are the logic equivalent of 4582 LUTs and 4582 single-bit registers. Compared with other works, our design has a notable advantage in saving logic resources, except for [29,36]. However, [29] showed relatively low throughput (11.8 im./sec), whereas our design is capable of reaching the throughput of 2.57 × 10 4 fps. FPGA implementation results show that our design achieved the highest frequency of 421.941 MHz with the power consumption of 754 mW. According to (38), the frame rate per Watt of our design equals: 421.941 × 10 6 16, 402 × 0.754 = 3.41 × 10 4 fps/W.
Designs in [29] had the throughput of 30.1 im./sec, and the images processed were 640 × 480 grayscale images. Hence, the throughput per Watt can be deduced: Although our design consumes more resources than [29], its throughput and throughput per Watt are both three orders of magnitude higher. Compared with [36], our design consumes less LUTs but more flip flops. However, [36] uses four more DSP48E resources than our design. Hence, it is assumed that the design in [36] and our paper are at the same level in terms of hardware utilization and power consumption. Throughput of the design in [36] was 1.676 GBps at 362.58 MHz, which is 1.34 × 10 4 Mbps. According to (40), the throughput of our design is 1.61 × 10 5 Mbps, which is one order of magnitude higher than that in [36].
Comparisons of ASIC implementation results are also listed in Table 6. Due to the DCT and IDCT and remainder calculation methods applied in our approach, the critical path of our design is the path of a constant coefficient multiplier. The maximum frequency of the synthesized circuit can reach 2.32 GHz, which is much faster than the other works listed in Table 6. Since we use registers to build the internal FIFOs, the overall power consumption reaches 508.1835 mW at 2.32 GHz. If we do not include the power of the internal buffer, the computational logic consumes 48.6163 mW at 2.32 GHz and 4.0938 mW at 200 MHz. [29] also uses 90 nm CMOS technology, but it can only achieve the maximum frequency of 166.7 MHz for its parallel version due to the introduction of arithmetic operations such as division and square root. Additionally, [29] uses only one divider in the whole design and shares it in different blocks; thus, the throughput is limited to 30.1 im./sec. On the contrary, our design can reach the maximum throughput of 141,783 fps because fully-pipelined architecture is adopted in our design. Compared with [29] in terms of throughput per Watt, our design is three orders of magnitude faster, which is consistent with the FPGA implementation results. In conclusion, the power consumption of the proposed design is lower than in most recent works, such as [29,36,37], due to the eliminations of the DCT and IDCT, and the 5/6 of the computational cost saved in color space conversion. Due to the low computational cost, parallel computing and a pipeline strategy (unlike [29], one divider is shared in all modules) can be used with the proposed method to improve the throughput of our design, which contributes to the high power efficiency.

Watermarking Performance Comparison
For the watermarking performance comparison, we mainly focus on invisibility and robustness.
The PSNR reflects the image quality of a watermarked picture. Watermarked bits can be considered as noise added to the original image. Hence, for the same picture, if the PSNR value is higher after watermarking, the invisibility of the watermark should be better. First, we compared some of the most recent works with our method using the same image Lena for the invisibility test under a 1/64 embedding ratio, which is widely used. The method in [12] works in the DCT domain, that in [18] works in the DWT domain and that in [34] is based on Tchebichef moments. As shown in Table 7, our PSNR and SSIM values are higher than those in [12,18], which indicates that the proposed method has better invisibility than [12,18].
The NC value between the extracted watermark after attack and the original watermark reflects the robustness of the watermarking approach. According to variablecontrolling approach, we set the threshold to 0 and F = 56, δ ∈ {0, ±1, ±2, ±3} to get a PSNR value of 42.1097, which is still better than those in [18,35]. [14,18,35] were among the best in terms of attack tests through using DCT, DWT and Tchebichef moments, respectively. Typical attacks, such as JPEG compression, scaling, Gaussian noise and salt and pepper, were used for comparison. Experimental results are listed in Table 7. For JPEG compression, the proposed method achieved the best result with an NC value of 1 after 50% compression. For scaling attack, the 512 × 512 image was first scaled to 256 × 256 and then rescaled to 512 × 512. The result shows that our approach was slightly worse than [18] but much better than [14] and [34]. For Gaussian noise and salt and pepper noise attacks, [18] still attained better results than our design, but the differences were very small. Overall, it can be stated that the proposed method shows great advantages in invisibility and is one of the best methods in terms of robustness. Although the DWT-based method in [18] has some slight advantages in robustness tests, the hardware overhead brought on by DWT makes it unsuitable for low-cost applications.

Conclusions
In this paper, we proposed a hidden DCT-based watermarking method for low-cost hardware implementations. The proposed architecture combines a watermark embedder and a blind extractor in the same circuit using a resource-sharing method. The hardware implementation and simulation results have proven the performance of our design in terms of throughput, efficiency and accuracy. When synthesized with TSMC 90-nm CMOS technology, our proposed circuit was able to reach the frequency of 2.32 GHz, which is 4.26 times higher than the best result reported [3]. Compared to the state-of-the-art design in [29], our design improved throughput per Watt by more than 1000 fold with the energy efficiency of 1.75 ×10 6 Mbps/W. In addition, invisibility and robustness tests showed that the proposed method is among the state-of-the-art methods. The principal contributions of the proposed scheme can be summarized as follows:

1.
We improve the invisibility resulting from the conventional DC component-based DCT watermarking method by introducing the HVS model. Changes in DC components are strictly controlled according to the characteristics of the HVS model to avoid visible block artifacts.

2.
An optimized workflow is proposed to reduce computational overhead in traditional DCT-based watermarking methods. The hidden DCT-based approach is applied to embed the watermark into the DCT domain without actually carrying out the operations. Additionally, 5/6 of the color space conversion can be omitted using our DC component-based approach. Meanwhile, image quality after watermarking can be improved because the calculation errors can also be avoided.

3.
The optimized low-cost watermarking method in this paper is suitable for real-time applications with limited computing resources, such as mobile phone applications and wireless network applications [29]. It is worth noting that the proposed method is suitable for various kinds of embedding ratios. Additionally, by adjusting the parameters such as threshold in our approach, users can easily adapt it for invisibilityoriented applications or robustness-oriented applications. For example, if we pursue robustness, we can set the threshold to 0 and choose the high F and δ values mentioned to increase the embedding strength.
In future, the proposed method should be optimized for better performance against geometry attacks and JPEG compression attacks. For example, in the cropping attack test, the proposed method did not perform as well as some state-of-the-art methods. Besides, the method in [18] showed advantages over our method in response to JPEG compression attacks. Hence, it is important for us to make efforts to improve our approach in these aspects. Additionally, the high-efficiency characteristic makes our design suitable for video watermarking applications, rather than just still pictures. Some further studies will be undertaken in the video watermarking area.