ZWNet: A Deep-Learning-Powered Zero-Watermarking Scheme with High Robustness and Discriminability for Images

: In order to safeguard image copyrights, zero-watermarking technology extracts robust features and generates watermarks without altering the original image. Traditional zero-watermarking methods rely on handcrafted feature descriptors to enhance their performance. With the advancement of deep learning, this paper introduces “ZWNet”, an end-to-end zero-watermarking scheme that obviates the necessity for specialized knowledge in image features and is exclusively composed of artificial neural networks. The architecture of ZWNet synergistically incorporates ConvNeXt and LK-PAN to augment the extraction of local features while accounting for the global context. A key aspect of ZWNet is its watermark block, as the network head part, which fulfills functions such as feature optimization, identifier output, encryption, and copyright fusion. The training strategy addresses the challenge of simultaneously enhancing robustness and discriminability by producing the same identifier for attacked images and distinct identifiers for different images. Experimental validation of ZWNet’s performance has been conducted, demonstrating its robustness with the normalized coefficient of the zero-watermark consistently exceeding 0.97 against rotation, noise, crop, and blur attacks. Regarding discriminability, the Hamming distance of the generated watermarks exceeds 88 for images with the same copyright but different content. Furthermore, the efficiency of watermark generation is affirmed, with an average processing time of 96 ms. These experimental results substantiate the superiority of the proposed scheme over existing zero-watermarking methods.


Introduction
In contrast to cryptography, which primarily focuses on ensuring message confidentiality, digital watermarking places greater emphasis on copyright protection and tracing [1,2].Classical watermarking involves the covert embedding of a watermark (a sequence of data) within media files, allowing for the extraction of this watermark even after data distribution or manipulation, enabling the identification of data sources or copyright ownership [3].However, this embedding process necessarily involves modifications to the host data, which can result in some degree of degradation to data quality and integrity.In response to the demand for high fidelity and zero tolerance for data loss, classical watermarking has been supplanted by zero-watermarking.Zero-watermarking focuses on extracting robust features and their fusion with copyright information [4,5].Notably, a key characteristic of zero-watermarking lies in the generation or construction of the zero-watermark itself, as opposed to its embedding.
Commonly, zero-watermarking algorithms are traditionally reliant on handcrafted features and typically involve a three-stage process.The first stage entails computing robust features, followed by converting these features into a numerical sequence in the second stage.The third stage involves fusing the numerical sequence with copyright identifiers, resulting in the generation of a zero-watermark without any modifications to the original data.Notably, the specific steps and features in these three stages are intricately designed by experts or scholars, thereby rendering the performance of the algorithm contingent upon expert knowledge.Moreover, once a zero-watermarking algorithm is established, continuous optimization becomes challenging, representing a limitation inherent in handcrafted approaches.
Introducing deep learning technology is a natural progression to overcome the reliance on expert knowledge and achieve greater optimization in zero-watermarking algorithms.Deep learning has recently ushered in significant transformations in computer vision and various other research domains [6][7][8][9].Numerous tasks, including image matching, scene classification, and semantic segmentation, have exhibited remarkable improvements when contrasted with classical methods [10][11][12][13][14].The defining feature of deep learning is its capacity to replace handcrafted methods reliant on expert knowledge with Artificial Neural Networks (ANNs).Through training ANNs with ample samples, these networks can effectively capture the intrinsic relationships among the samples and model the associations between inputs and outputs.Inspired by this paradigm shift, the zero-watermarking method can also transition towards an end-to-end mode with the support of ANNs, eliminating the need for handcrafted features.
In addressing watermark performance, conventional handcrafted methods encounter challenges in simultaneously enhancing robustness and discriminability.To address this limitation and achieve concurrent improvements, this paper proposes a novel zerowatermarking network named ZWNet, which employs distinct strategies.Firstly, ZWNet combines ConvNeX and LK-PAN to enhance the extraction of local features and consider the global context more comprehensively.Secondly, the watermark block is strategically designed as the leading component, integrating with copyright information, encrypting the watermark, and generating a distinctive image identifier.Thirdly, the training strategy for ZWNet focuses on the image identifier it produces.To enhance robustness, ZWNet is trained to yield the same image identifier for both the original and attacked images.Simultaneously, to improve discriminability, ZWNet is trained to produce different image identifiers for distinct images.Experimental results further validate the effectiveness of ZWNet, as evidenced by the normalized coefficient of the zero-watermark consistently exceeding 0.97 for robustness and the Hamming distance between watermarks with the same copyright and different images surpassing 88 for discriminability.
The contributions of this paper can be summarized as follows: (1) Introduction of a novel approach that combines ConvNeXt and LK-PAN to enhance feature extraction, effectively addressing both global context and local features.
(2) Transformation of the problem of improving watermark performance into a classification task, leveraging the common framework provided by deep learning.
(3) ZWNet exhibits notable discriminability, ensuring that the generated zero-watermark is distinct enough to differentiate between different images sharing the same copyright.This capability addresses the challenge of zero-watermark confusion.
The structure of this paper is organized as follows: Section 2 provides a concise background on zero-watermarking, ConvNeXt, and LK-PAN.Section 3 introduces the proposed zero-watermarking scheme, "ZWNet".Section 4 delves into the presentation of experimental results and subsequent discussion.Finally, Section 5 presents the concluding remarks.

Related Work
This section introduces three key components of related work.Section 2.1 provides a concise introduction to zero-watermarking and analyzes similar methods based on deep learning.Section 2.2 presents an overview of ConvNeXt as a feature extraction network.Furthermore, Section 2.3 introduces LK-PAN, delineating its role in providing an optimization structure to enhance ConvNeXt.

Zero-Watermarking
The concept of zero-watermarking in image processing was originally introduced by Wen et al. [15].This technology has garnered significant attention and research interest due to its unique characteristic of preserving the integrity of media data without any modifications.Taking images as an example, the zero-watermarking process can be broadly divided into three stages.The first stage involves the computation of robust features.In this phase, various handcrafted features such as Discrete Cosine Transform (DCT) [16,17], Discrete Wavelet Transform (DWT) [18], Lifting Wavelet Transform [19], Harmonic Transform [5], and Fast Quaternion Generic Polar Complex Exponential Transform (FQGPCET) [20] are calculated and utilized to represent the stable features of the host image.The second stage focuses on the numerical conversion of these features into a numerical sequence.Mathematical transformations such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are employed to filter out minor components and extract major features [16,21].The resulting feature sequence from this stage serves as a condensed identifier of the original image.However, this sequence alone cannot serve as the final watermark since it lacks any copyright-related information.Hence, the third stage involves the fusion of the feature sequence with copyright identifiers.Copyright identifiers can encompass the owner's signature image, organization logos, text, fingerprints, or any digitized media.To ensure the zero-watermark cannot be forged or unlawfully generated, cryptographic methods such as Advanced Encryption Standard (AES) or Arnold Transformation [22] are often utilized to encrypt the copyright identifier and feature sequence.The final combination can be as straightforward as XOR operations [16].Consequently, the zero-watermark is generated and can be registered with the Intellectual Property Rights (IPR) agency.Additionally, copyright verification is a straightforward process involving the regeneration of the feature sequence and its comparison with the registered zero-watermark.The process of zero-watermarking technology is illustrated in Figure 1.

Related Work
This section introduces three key components of related work.Section 2.1 provides a concise introduction to zero-watermarking and analyzes similar methods based on deep learning.Section 2.2 presents an overview of ConvNeXt as a feature extraction network.Furthermore, Section 2.3 introduces LK-PAN, delineating its role in providing an optimization structure to enhance ConvNeXt.

Zero-Watermarking
The concept of zero-watermarking in image processing was originally introduced by Wen et al. [15].This technology has garnered significant attention and research interest due to its unique characteristic of preserving the integrity of media data without any modifications.Taking images as an example, the zero-watermarking process can be broadly divided into three stages.The first stage involves the computation of robust features.In this phase, various handcrafted features such as Discrete Cosine Transform (DCT) [16,17], Discrete Wavelet Transform (DWT) [18], Lifting Wavelet Transform [19], Harmonic Transform [5], and Fast Quaternion Generic Polar Complex Exponential Transform (FQGPCET) [20] are calculated and utilized to represent the stable features of the host image.The second stage focuses on the numerical conversion of these features into a numerical sequence.Mathematical transformations such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are employed to filter out minor components and extract major features [16,21].The resulting feature sequence from this stage serves as a condensed identifier of the original image.However, this sequence alone cannot serve as the final watermark since it lacks any copyright-related information.Hence, the third stage involves the fusion of the feature sequence with copyright identifiers.Copyright identifiers can encompass the owner's signature image, organization logos, text, fingerprints, or any digitized media.To ensure the zero-watermark cannot be forged or unlawfully generated, cryptographic methods such as Advanced Encryption Standard (AES) or Arnold Transformation [22] are often utilized to encrypt the copyright identifier and feature sequence.The final combination can be as straightforward as XOR operations [16].Consequently, the zero-watermark is generated and can be registered with the Intellectual Property Rights (IPR) agency.Additionally, copyright verification is a straightforward process involving the regeneration of the feature sequence and its comparison with the registered zero-watermark.The process of zero-watermarking technology is illustrated in Figure 1.Presently, several research efforts focus on deep-learning-based watermarking methods, encompassing both classical watermarking of embedding style and zero-watermarking of generative style.In the domain of embedding-style watermarking, a method inspired by the architecture of Autoencoder has been proposed.In this approach, Autoencoders encode the watermark and embed it using convolutional networks.For watermark extraction, Autoencoders are also employed to extract and decode the watermark [23].Other Autoencoder-based methods aim to enhance robustness or improve efficiency [24,25].Despite their superior performance in robustness and elimination of reliance on prior knowledge, embedding-style watermarking significantly differs from zero-watermarking methods, as the latter maintains the original data unchanged.Additionally, it is noteworthy that zero-watermarking places greater emphasis on discriminability, a focus less pronounced in embedding-style watermarking.
In the realm of deep-learning-based zero-watermarking methods, a hybrid scheme that combines traditional Discrete Wavelet Transform (DWT) and the deep neural network ResNet-101 has been proposed.This approach involves applying DWT to the host image and subsequently sending the wavelet coefficients to ResNet-101 [26].While exhibiting strong robustness against translation and clipping, this scheme falls short of being an end-to-end solution.Regarding end-to-end zero-watermarking, some studies employ Convolutional Neural Networks (CNN), VGG-19 (developed by the Oxford Visual Geometry Group), or DenseNet to generate robust watermark sequences [27][28][29].Another line of research predominantly revolves around the concept of style transfer [30].In the watermark generation phase, it utilizes VGG to merge the content of the copyright logo with the style of the host image.In the verification stage, another CNN is employed to eliminate the style component and extract the copyright content.Although these approaches have demonstrated promising levels of robustness compared to handcrafted methods, we believe they fall short in adequately considering multi-level features within the image.This limitation arises because when using CNN or VGG to upsample the image, the higher-level features have a less effective receptive field than the theoretical receptive field [31].Furthermore, one drawback of these zero-watermark networks is the insufficient emphasis on discriminability.This means the generated zero-watermarks for different images should be distinct enough to prevent copyright ambiguity.

ConvNeXt
As discussed in Section 2.1, Convolutional Neural Networks (CNN) have been employed as the feature extraction component in existing watermarking methods.However, it is noteworthy that the performance of CNN has become outdated in various tasks.Hence, Liu et al. introduced ConvNeXt, a nomenclature devised to distinguish it from traditional Convolutional Networks (ConvNets) while signifying the next evolution in ConvNets [32].Rather than presenting an entirely new architectural paradigm, ConvNeXt draws inspiration from the ideas and optimizations put forth in the Swin Transformer [33] and applies similar strategies to enhance a standard ResNet [8].These optimization strategies can be summarized as follows: (1) Modification of stage compute ratio: ConvNeXt adjusts the number of blocks within each stage from (3,4,6,3) to (3,3,9,3).
(2) Replacement of the stem cell: The introduction of a patchify layer achieved through non-overlapping 4 × 4 convolutions.
(4) Inverted Bottleneck design: This approach involves having the hidden layer dimension significantly larger than that of the input.
(6) Micro-level optimizations: These include the replacement of ReLU with GELU, fewer activation functions, reduced use of normalization layers, the substitution of Batch Normalization with Layer Normalization, and the implementation of separate downsampling layers.
Remarkably, the amalgamation of these strategies results in ConvNeXt achieving a state-of-the-art level of performance in image classification, all without requiring substantial changes to the network's underlying structure.Furthermore, a key feature of this paper lies in its detailed presentation of how each optimization incrementally enhances performance, effectively encapsulated in Figure 2.
Appl.Sci.2024, 14, x FOR PEER REVIEW 5 of 18 Normalization with Layer Normalization, and the implementation of separate downsampling layers.
Remarkably, the amalgamation of these strategies results in ConvNeXt achieving a state-of-the-art level of performance in image classification, all without requiring substantial changes to the network's underlying structure.Furthermore, a key feature of this paper lies in its detailed presentation of how each optimization incrementally enhances performance, effectively encapsulated in Figure 2. From Figure 2, it is evident that employing the strategy of stage ratio modification and patchify stem leads to an improvement in accuracy, increasing from 78.8% to 79.5%.Further enhancements are observed with the introduction of depth convolution and larger width, resulting in an accuracy improvement of 80.5%.The utilization of an inverted bottleneck and larger kernel size contributes to a higher accuracy of 80.6%.Finally, with micro-optimizations, the accuracy of ConvNeXt reaches 82.0%, surpassing that of Swin.

LK-PAN
While ConvNeXt offers a straight-line structure that effectively captures local features, it may fall short in dedicating sufficient attention to the global context.To address this limitation and enhance the capabilities of ConvNeXt, a path aggregation mechanism, LK-PAN, is introduced.LK-PAN originates from the Path Aggregation Network (PANet), which was initially introduced in the context of instance segmentation to bolster the hierarchy of feature extraction networks.The primary structure of PANet is depicted in Figure 3. From Figure 2, it is evident that employing the strategy of stage ratio modification and patchify stem leads to an improvement in accuracy, increasing from 78.8% to 79.5%.Further enhancements are observed with the introduction of depth convolution and larger width, resulting in an accuracy improvement of 80.5%.The utilization of an inverted bottleneck and larger kernel size contributes to a higher accuracy of 80.6%.Finally, with micro-optimizations, the accuracy of ConvNeXt reaches 82.0%, surpassing that of Swin.

LK-PAN
While ConvNeXt offers a straight-line structure that effectively captures local features, it may fall short in dedicating sufficient attention to the global context.To address this limitation and enhance the capabilities of ConvNeXt, a path aggregation mechanism, LK-PAN, is introduced.LK-PAN originates from the Path Aggregation Network (PANet), which was initially introduced in the context of instance segmentation to bolster the hierarchy of feature extraction networks.The primary structure of PANet is depicted in Figure 3.In Figure 3, we observe that part (a) represents the classical network structure of the Feature Pyramid Network (FPN), which is named for its pyramid-like arrangement [35].However, it's important to note that the influence of low-level features on high-level features is limited due to the long paths, as indicated by the red dashed lines in Figure 3.These paths can comprise over 100 layers.As mentioned in Section 1, while the theoretical receptive field of P5 may be quite large, it does not manifest as such in practice due to the numerous convolution, pooling, and activation operations.Therefore, PANet introduced a bottom-up path augmentation, as depicted in Figure 3b.This approach aggregates the top-most features from both the low-level features and features at the same level.Consequently, this mechanism substantially shortens the connection between low-level features and the top-most features to around 10 layers.Thus, it effectively enhances feature expression for local areas and minor details.PANet's contributions also encompass adaptive feature pooling (Figure 3c) and fully-connected fusion (Figure 3d).However, these two mechanisms are more closely related to the task of instance segmentation and will not be elaborated upon here.
Building upon the foundation of PANet, the Large Kernel-PANet, abbreviated as LK-PAN [36], introduces some improvements.The primary feature of LK-PAN is the enlargement of the convolution kernel size.In contrast to PANet, LK-PAN utilizes 9 × 9 convolution kernels instead of the original 3 × 3 size.This augmentation is aimed at expanding the receptive field of the feature map, thereby enhancing the ability to discern minor features with greater precision.Another key change in LK-PAN is the adoption of a concatenation operation, replacing Figure 3c, for fusing features from different levels.

Main Idea
At the heart of zero-watermarking technology lies the extraction of robust image features.While ConvNeXt offers deep insights for extracting dense features, it alone may not provide sufficient attention to fine-grained image semantics and local details.This can result in situations where the watermark differences between substantially different images are not distinct enough, affecting the discriminability of the zero-watermark.To fully leverage both local features and global context, ZWNet integrates ConvNeXt and LK-PAN as the backbone and neck components, enhancing the robustness and distinctiveness of multi-level features.
After the image feature extraction, a crucial challenge remains in training ZWNet to achieve robustness and discriminability.Additionally, there are requirements such as combining the watermark with copyright logos and encrypting the watermark.To address these issues, the watermark block is introduced as the head component of ZWNet.This block includes a linear layer for generating an image identifier, encryption layers, and copyright-mixture layers.In summary, the primary architecture of ZWNet is illustrated in Figure 4, with further network details elucidated below.In Figure 3, we observe that part (a) represents the classical network structure of the Feature Pyramid Network (FPN), which is named for its pyramid-like arrangement [35].However, it's important to note that the influence of low-level features on high-level features is limited due to the long paths, as indicated by the red dashed lines in Figure 3.These paths can comprise over 100 layers.As mentioned in Section 1, while the theoretical receptive field of P5 may be quite large, it does not manifest as such in practice due to the numerous convolution, pooling, and activation operations.Therefore, PANet introduced a bottom-up path augmentation, as depicted in Figure 3b.This approach aggregates the topmost features from both the low-level features and features at the same level.Consequently, this mechanism substantially shortens the connection between low-level features and the top-most features to around 10 layers.Thus, it effectively enhances feature expression for local areas and minor details.PANet's contributions also encompass adaptive feature pooling (Figure 3c) and fully-connected fusion (Figure 3d).However, these two mechanisms are more closely related to the task of instance segmentation and will not be elaborated upon here.
Building upon the foundation of PANet, the Large Kernel-PANet, abbreviated as LK-PAN [36], introduces some improvements.The primary feature of LK-PAN is the enlargement of the convolution kernel size.In contrast to PANet, LK-PAN utilizes 9 × 9 convolution kernels instead of the original 3 × 3 size.This augmentation is aimed at expanding the receptive field of the feature map, thereby enhancing the ability to discern minor features with greater precision.Another key change in LK-PAN is the adoption of a concatenation operation, replacing Figure 3c, for fusing features from different levels.

Main Idea
At the heart of zero-watermarking technology lies the extraction of robust image features.While ConvNeXt offers deep insights for extracting dense features, it alone may not provide sufficient attention to fine-grained image semantics and local details.This can result in situations where the watermark differences between substantially different images are not distinct enough, affecting the discriminability of the zero-watermark.To fully leverage both local features and global context, ZWNet integrates ConvNeXt and LK-PAN as the backbone and neck components, enhancing the robustness and distinctiveness of multi-level features.
After the image feature extraction, a crucial challenge remains in training ZWNet to achieve robustness and discriminability.Additionally, there are requirements such as combining the watermark with copyright logos and encrypting the watermark.To address these issues, the watermark block is introduced as the head component of ZWNet.This block includes a linear layer for generating an image identifier, encryption layers, and copyright-mixture layers.In summary, the primary architecture of ZWNet is illustrated in Figure 4, with further network details elucidated below.

Backbone Component
Prior to entering ZWNet's backbone component, training images undergo various attacks, which are managed by the preprocessing module depicted in Figure 4. Subsequently, they are fed into the ConvNeXt network, the details of which are illustrated in Figure 5.

Backbone Component
Prior to entering ZWNet's backbone component, training images undergo various attacks, which are managed by the preprocessing module depicted in Figure 4. Subsequently, they are fed into the ConvNeXt network, the details of which are illustrated in Figure 5.

Backbone Component
Prior to entering ZWNet's backbone component, training images undergo various attacks, which are managed by the preprocessing module depicted in Figure 4. Subsequently, they are fed into the ConvNeXt network, the details of which are illustrated in Figure 5.The input image has dimensions of 224 × 224 with three color channels (Red, Green, and Blue).Both the training and test datasets are formatted as JPG images with a resolution of 72 dpi.The image initially undergoes processing through a convolutional layer and layer normalization.Subsequently, it is directed through four ConvNeXt blocks and three downsample blocks.Each ConvNeXt block includes a residual connection, a depthwise convolution layer, and standard convolution layers.The downsample layer comprises a normalization layer and a convolution layer.Importantly, it should be noted that the feature maps generated after each ConvNeXt block are then passed to LK-PAN, which serves as the neck component of ZWNet.

Neck Component
The neck component of ZWNet draws inspiration from LK-PAN, and its specifics are outlined in Figure 6.
Appl.Sci.2024, 14, x FOR PEER REVIEW 8 of 18 convolution layer, and standard convolution layers.The downsample layer comprises a normalization layer and a convolution layer.Importantly, it should be noted that the feature maps generated after each ConvNeXt block are then passed to LK-PAN, which serves as the neck component of ZWNet.

Neck Component
The neck component of ZWNet draws inspiration from LK-PAN, and its specifics are outlined in Figure 6.The input to ZWNet's neck component comprises four branches, each corresponding to a feature map generated by one of the four ConvNeXt blocks from the backbone.Each branch begins with a 1 × 1 convolution operation, followed by the addition of upsampled features from higher levels and subsequent upsampling to match the low-level branch.These features then pass through a larger-kernel convolutional layer (9 × 9).Following the convolution operation, the features are combined with downsampled features and split into two branches.The first branch is downsampled using a 3 × 3 convolutional layer and sent to the higher level.The other branch undergoes an additional 9 × 9 convolutional layer and ultimately contributes to the final concatenate layer.

Head Component
The head component is the watermark block, encompassing four key functions: optimizing the feature maps, generating image identifiers, encryption, and merging with copyright information.The intricate structure is illustrated in Figure 7.The input to ZWNet's neck component comprises four branches, each corresponding to a feature map generated by one of the four ConvNeXt blocks from the backbone.Each branch begins with a 1 × 1 convolution operation, followed by the addition of upsampled features from higher levels and subsequent upsampling to match the low-level branch.These features then pass through a larger-kernel convolutional layer (9 × 9).Following the convolution operation, the features are combined with downsampled features and split into two branches.The first branch is downsampled using a 3 × 3 convolutional layer and sent to the higher level.The other branch undergoes an additional 9 × 9 convolutional layer and ultimately contributes to the final concatenate layer.

Head Component
The head component is the watermark block, encompassing four key functions: optimizing the feature maps, generating image identifiers, encryption, and merging with copyright information.The intricate structure is illustrated in Figure 7.
Within the watermark block, the input comprises feature maps generated by the neck component.These feature maps undergo processing through an exceptionally large depthwise convolutional layer, utilizing a 21-unit kernel.Subsequently, they are subjected to adaptive max pooling to maintain the size of the output feature map, fixed at 16 × 16 × 1.This 16 × 16 × 1 feature map can be viewed as the robust features of the input host image.
The feature map is then divided into two branches.The first branch is directed through a linear layer to produce an image identifier.This image identifier serves as a unique code differentiating the input image from others, which will be further elucidated in Section 3.5.The second branch is funneled through an encrypt-conv layer within a loop function.This loop function emulates the encryption process of the Arnold transformation, where the encrypt-conv layer, abbreviated as the encryption-convolution layer, executes a single permutation of the Arnold transformation.Key1 represents the secret key of the Arnold transformation, which is fed into the loop function.Post-encryption, the feature map undergoes quantization based on a threshold, T, resulting in the conversion of the feature map into a binary sequence.This binary sequence is then merged with copyright information through an XOR operation.Within the watermark block, the input comprises feature maps generated by the neck component.These feature maps undergo processing through an exceptionally large depthwise convolutional layer, utilizing a 21-unit kernel.Subsequently, they are subjected to adaptive max pooling to maintain the size of the output feature map, fixed at 16 × 16 × 1.This 16 × 16 × 1 feature map can be viewed as the robust features of the input host image.
The feature map is then divided into two branches.The first branch is directed through a linear layer to produce an image identifier.This image identifier serves as a unique code differentiating the input image from others, which will be further elucidated in Section 3.5.The second branch is funneled through an encrypt-conv layer within a loop function.This loop function emulates the encryption process of the Arnold transformation, where the encrypt-conv layer, abbreviated as the encryption-convolution layer, executes a single permutation of the Arnold transformation.Key1 represents the secret key of the Arnold transformation, which is fed into the loop function.Post-encryption, the feature map undergoes quantization based on a threshold, T, resulting in the conversion of the feature map into a binary sequence.This binary sequence is then merged with copyright information through an XOR operation.

Training
The image identifier plays a crucial role in distinguishing the input image from others and serves as the training target for ZWNet.There are various methods for generating an image identifier, such as assigning a unique value or utilizing a hash function.The key requirement is that different images should map to distinct identifiers.In the case of ZWNet, the training process can be conceptualized as optimizing the identifier output by the entire network to match the target identifier.
To ensure the robustness and discriminability of ZWNet simultaneously, we employ two strategies.The first strategy involves training ZWNet to generate the same identifier for both the original input image and the attacked versions.This approach encourages the network to extract consistent features even for images subjected to different attacks.The second strategy entails training ZWNet to produce different identifiers for different host images, promoting the network's ability to create distinct feature maps for varying

Training
The image identifier plays a crucial role in distinguishing the input image from others and serves as the training target for ZWNet.There are various methods for generating an image identifier, such as assigning a unique value or utilizing a hash function.The key requirement is that different images should map to distinct identifiers.In the case of ZWNet, the training process can be conceptualized as optimizing the identifier output by the entire network to match the target identifier.
To ensure the robustness and discriminability of ZWNet simultaneously, we employ two strategies.The first strategy involves training ZWNet to generate the same identifier for both the original input image and the attacked versions.This approach encourages the network to extract consistent features even for images subjected to different attacks.The second strategy entails training ZWNet to produce different identifiers for different host images, promoting the network's ability to create distinct feature maps for varying images.Through these strategies, we reframe the problem of improving zero-watermark performance as a common task in deep learning, akin to multi-label classification.
In terms of implementation details, the image identifier in ZWNet is represented as a 256-bit binary sequence, and the network is treated as a multi-label task.Consequently, BCEWithLogitsLoss is employed as the loss function.This loss function sigmoidalizes the output first and then computes the difference between the target identifier T i and the actual output value S i as follows: Here, i represents the sequence index.
It is important to note that the image identifier is solely used during the training stage.In the testing phase or deployment, it remains unused, although it still generates the identifier.This is because the image identifier is utilized to train the network, and the network should not adapt beyond the training phase.The training process involved updating ZWNet using the training data for each epoch and then evaluating the loss with the test images.If the loss on the test data no longer decreased or even began to increase, the training was terminated.Consequently, the test data was solely used to verify whether the training process was sufficient and did not impact the updating of ZWNet.Furthermore, once successfully trained, ZWNet remained stable and deployable.It could process arbitrary images for zero-watermark service without the need for retraining or adjustments.The changes in loss during the training stage are illustrated in Figure 10.The training process involved updating ZWNet using the training data for each epoch and then evaluating the loss with the test images.If the loss on the test data no longer decreased or even began to increase, the training was terminated.Consequently, the test data was solely used to verify whether the training process was sufficient and did not impact the updating of ZWNet.Furthermore, once successfully trained, ZWNet remained stable and deployable.It could process arbitrary images for zero-watermark service without the need for retraining or adjustments.The changes in loss during the training stage are illustrated in Figure 10.

Robustness
Robustness is a critical feature of digital watermarking technology.We assess ZWNet's robustness by comparing the zero-watermark of the original image with the zero-watermark of the attacked image.The evaluation index employed is NC, as explained in Section 3.6, with a threshold set to 0.8.The attack methods used are consistent with those applied in data augmentation.Using the image Lena (Figure 9a) as an example, the visuals of the original image and the attacked image are displayed in Figure 11.

Robustness
Robustness is a critical feature of digital watermarking technology.We assess ZWNet's robustness by comparing the zero-watermark of the original image with the zero-watermark of the attacked image.The evaluation index employed is NC, as explained in Section 3.6, with a threshold set to 0.8.The attack methods used are consistent with those applied in data augmentation.Using the image Lena (Figure 9a) as an example, the visuals of the original image and the attacked image are displayed in Figure 11.

Robustness
Robustness is a critical feature of digital watermarking technology.We assess ZWNet's robustness by comparing the zero-watermark of the original image with the zero-watermark of the attacked image.The evaluation index employed is NC, as explained in Section 3.6, with a threshold set to 0.8.The attack methods used are consistent with those applied in data augmentation.Using the image Lena (Figure 9a) as an example, the visuals of the original image and the attacked image are displayed in Figure 11.The NC results of the images in Figure 9 under different attacks are listed in Table 2.The NC results of the images in Figure 9 under different attacks are listed in Table 2.The results in Table 2 clearly indicate that the test images, even when subjected to different types and intensities of attacks, all exhibit NC values above the 0.8 threshold.These results demonstrate the robustness of the proposed scheme.The robustness can be attributed to two key factors: the utilization of ConvNeXt combined with LK-PAN and the effective training strategy for image identifiers.
Beyond the numerical results, it is essential to address overfitting when evaluating the effectiveness of a neural network.Overfitting occurs when a network memorizes the training data instead of learning the target function.In the context of zero-watermarking, overfitting could lead to a network that memorizes images under various attacks instead of extracting robust features.However, ZWNet effectively avoids overfitting.This is primarily due to the strict separation of training and testing images.During the training stage, ZWNet has not been exposed to the four test images shown in Figure 9, preventing it from memorizing these images based on prior knowledge.Therefore, it is evident that ZWNet has successfully learned to extract robust image features rather than merely memorizing them.

Discriminability
Discriminability is a crucial aspect of a zero-watermarking algorithm, ensuring that different images generate distinct zero-watermarks, especially when copyright identifiers are the same.In ZWNet's training, we assessed the similarity between images of Koala and Panda in the evaluation mode and observed the changes in the NC values, as depicted in Figure 12.
Appl.Sci.2024, 14, x FOR PEER REVIEW 14 of 18 ZWNet has successfully learned to extract robust image features rather than merely memorizing them.

Discriminability
Discriminability is a crucial aspect of a zero-watermarking algorithm, ensuring that different images generate distinct zero-watermarks, especially when copyright identifiers are the same.In ZWNet's training, we assessed the similarity between images of Koala and Panda in the evaluation mode and observed the changes in the NC values, as depicted in Figure 12.From Figure 12, it is evident that the NC value decreases from around 0.70 as the training epoch increases.After the 14th epoch, the NC value drops to approximately 0.54.Considering the NC threshold is set at 0.8, a value of 0.54 is relatively low, indicating that the zero-watermarks are dissimilar and the copyright will not be confused.
To conduct a more precise assessment of ZWNet's discriminability, we used the Hamming distance to compare zero-watermarks generated by different images with the same copyright.The Hamming distance is calculated as follows: Here, A and B represent two zero-watermark sequences, n is the sequence length (which is 256 for ZWNet's watermark block) and ⊕ represents the exclusive OR operation.Hamming distances among the zero-watermarks of the four test images are detailed in Table 3.  From Figure 12, it is evident that the NC value decreases from around 0.70 as the training epoch increases.After the 14th epoch, the NC value drops to approximately 0.54.Considering the NC threshold is set at 0.8, a value of 0.54 is relatively low, indicating that the zero-watermarks are dissimilar and the copyright will not be confused.
To conduct a more precise assessment of ZWNet's discriminability, we used the Hamming distance to compare zero-watermarks generated by different images with the same copyright.The Hamming distance is calculated as follows: Here, A and B represent two zero-watermark sequences, n is the sequence length (which is 256 for ZWNet's watermark block) and ⊕ represents the exclusive OR operation.Hamming distances among the zero-watermarks of the four test images are detailed in Table 3. Table 3 displays the Hamming distances among the zero-watermarks of the four test images.The high Hamming distances indicate that the zero-watermarks differ significantly from one another, with more than 80 different bits.Given that the total length of the zero-watermark is 256, this observation suggests that the zero-watermarks of each image are substantially distinct from the others.Hence, the discriminability of ZWNet is substantiated.
Two factors contribute to this discriminability.Firstly, using LK-PAN within ZWNet helps extract local features and fuse them with the global context during feature map generation, as described in Section 2.2.Secondly, incorporating unique identifiers in the watermark block plays a crucial role.During the training phase, ZWNet is trained to extract robust features, assigning the same identifier to identical images and different identifiers to distinct images.As a result, ZWNet strives to produce dissimilar feature maps for different inputs, resulting in its discriminative capability.Importantly, this discriminative feature is not a result of overfitting since, as previously mentioned, ZWNet has not seen the test images during training.

Comparisons with Existing Methods
To provide a more objective evaluation of ZWNet's performance, we have chosen to compare it with three other zero-watermarking methods.The first method, named Yang's method, is a classical zero-watermarking technique that utilizes FQGPCET and has been recognized for its robustness and discriminability [20].The second method, Liu's method, is an end-to-end neural network-based approach centered on style transfer and removal, renowned for its robustness and is named after its creator, Liu [30].The third method, referred to as Nawaz's method, is a hybrid scheme that combines DWT and ResNet-101 together [26].These three methods will be assessed alongside ZWNet in terms of their robustness, discriminability, and efficiency.

Robustness
To compare the robustness of ZWNet with Yang's method, Liu's method, and Nawaz's method, we conducted tests using the same test image, Lena, and subjected them to identical attacks.The results are summarized in Table 4.
From Table 4, it is evident that under the same attack conditions, ZWNet exhibits higher NC values compared to the other methods in most cases.It excels in robustness, with only a slight decrease in performance under rotation attacks.This suggests that the features extracted by the convolutional layer may not be highly robust when it comes to rotation attacks.However, ZWNet still achieves a substantial NC value greater than 0.9 in this scenario, which is more than adequate for copyright identification.We utilized the four test images from Figure 9, generating zero-watermarks with identical copyrights using the comparison methods.To assess the differences in these zero-watermarks, we employed Hamming distance, and the results are summarized in Table 5.From Table 5, when calculating the Hamming distance among the test images, ZWNet exhibits higher values compared to the other comparison methods.This result demonstrates the excellent discriminability of the proposed ZWNet.

Efficiency
While enhancing efficiency was not the primary focus of our study, we conducted an efficiency assessment as it is a crucial consideration for practical usage.For the same set of test images, we ran each of the three zero-watermarking methods 10 times and calculated the average processing time.The efficiency results for these methods are presented in Table 6.The efficiency comparison presented in Table 6 clearly shows that the time required to generate a single zero-watermark with ZWNet is significantly lower than that of the other two methods.Yang's method appears to be less efficient than ZWNet due to the utilization of the CPU for computation without GPU acceleration.In fact, classical zero-watermarking methods are generally impractical to run on GPUs, as many of their steps cannot be efficiently implemented with tensor operations.
In the case of Liu's method, although both it and ZWNet are based on neural networks and can be accelerated by GPUs, Liu's method involves learning different style features for various host images.This learning process includes training and multiple epochs, making it more time-consuming compared to ZWNet.In contrast, ZWNet can produce a zero-watermark directly for different images without the need for retraining by passing the host image through the ZWNet's layers only once.

Conclusions
This paper introduced an end-to-end zero-watermarking approach built on neural networks, which has practical applicability in scenarios such as image copyright registration, copyright authentication, and piracy detection.In contrast to traditional approaches that rely on handcrafted features, our methodology employs pure neural networks to learn robust features automatically.The structure of ZWNet consists of ConvNeXt and LK-PAN as the backbone and neck, respectively.Furthermore, we introduced the watermark block as the head component, transforming the challenge of enhancing robustness and discriminability into a multi-label classification task based on image identifiers.The experimental results clearly demonstrate that ZWNet effectively extracts resilient image features and generates zero-watermarks without the need for retraining.Moreover, ZWNet exhibits superior robustness, discriminability, and efficiency compared with existing methods.The results suggest that through the implementation of the proposed training strategy on image identifiers, the zero-watermark performance has been notably enhanced in terms of both robustness and discriminability simultaneously.

Figure 1 .
Figure 1.The process of zero-watermark generation and verification.Figure 1.The process of zero-watermark generation and verification.

Figure 1 .
Figure 1.The process of zero-watermark generation and verification.Figure 1.The process of zero-watermark generation and verification.

Figure 5 .
Figure 5. ZWNet's backbone details.(a) Main structure.(b) ConvNeXt block details.(c) Downsample layers details.The input image has dimensions of 224 × 224 with three color channels (Red, Green, and Blue).Both the training and test datasets are formatted as JPG images with a resolution of 72 dpi.The image initially undergoes processing through a convolutional layer and layer normalization.Subsequently, it is directed through four ConvNeXt blocks and three downsample blocks.Each ConvNeXt block includes a residual connection, a depthwise

Figure 5 .
Figure 5. ZWNet's backbone details.(a) Main structure.(b) ConvNeXt block details.(c) Downsample layers details.The input image has dimensions of 224 × 224 with three color channels (Red, Green, and Blue).Both the training and test datasets are formatted as JPG images with a resolution of 72 dpi.The image initially undergoes processing through a convolutional layer and layer normalization.Subsequently, it is directed through four ConvNeXt blocks and three downsample blocks.Each ConvNeXt block includes a residual connection, a depthwise

Figure 7 .
Figure 7. Structure of the watermark block.(a) Main structure.(b) Illustration of encrypt-conv layer.

Figure 7 .
Figure 7. Structure of the watermark block.(a) Main structure.(b) Illustration of encrypt-conv layer.

18 Figure 10 .
Figure 10.Loss changes in the training stage.

Figure 10 .
Figure 10.Loss changes in the training stage.

Figure 10 .
Figure 10.Loss changes in the training stage.

Figure 12 .
Figure 12.Similarity changes of test images in the training stage.

Figure 12 .
Figure 12.Similarity changes of test images in the training stage.

Table 2 .
NC results of four test images under different attacks.

Table 3 .
Hamming distances between the zero-watermarks of four test images.

Table 3 .
Hamming distances between the zero-watermarks of four test images.

Table 4 .
NC results of the comparison methods.

Table 5 .
Hamming distance of comparison methods.