4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution

Duszejko, Paweł; Piotrowski, Zbigniew

doi:10.3390/app15084469

Open AccessArticle

4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution

by

Paweł Duszejko

^*

and

Zbigniew Piotrowski

Faculty of Electronics, Military University of Technology, 00-908 Warszawa, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4469; https://doi.org/10.3390/app15084469

Submission received: 13 February 2025 / Revised: 10 April 2025 / Accepted: 12 April 2025 / Published: 18 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a method for actively safeguarding image integrity based on embedding a hidden signature generated by a neural network. The proposed solution utilizes two cooperating networks: the first one is responsible for generating a signature pattern that is difficult to detect, while the second one embeds this signature into the original image in a way that ensures high visual transparency. This automatically generated signature enables the detection of manipulations in the image and allows for localizing areas where integrity violations have occurred. The main innovation of the proposed approach lies in its ability to handle images of virtually any resolution, including widely used standards in modern communication and publications, such as HD (1280 × 720), Full HD (1920 × 1080) and even 4K (3840 × 2160). This represents a significant improvement over most existing methods, which are typically limited to the small, square images (e.g., 256 × 256 or 512 × 512 pixels) commonly found in popular object classification datasets, such as ImageNet (approximately 224 × 224 pixels). As a result, the proposed method opens new possibilities for actively securing the integrity of large and non-standard image formats while maintaining reasonable computational requirements. It surpasses previous limitations in terms of scalability and image proportions.

Keywords:

image forensics; deep learning; active detection; passive detection; tampering detection; watermarking; neural networks

1. Introduction

In an era of the widespread circulation of digital visual content, both in social media and professional industry environments, ensuring the integrity and credibility of images has become a significant challenge. The rapid advancement in image editing tools, including algorithms based on machine learning and artificial intelligence [1], makes photo manipulations [2] ranging from subtle retouching to complex alterations, increasingly difficult to detect. Consequently, there is an urgent need to develop methods that not only verify the authenticity of visual materials but also identify areas where potential modifications have occurred. The development of both passive and active manipulation detection methods aims to counteract forgeries and facilitate the verification of file authenticity. Passive approaches rely on analyzing image statistics to detect characteristic artifacts without prior intervention in the content; however, their effectiveness is limited, and they do not guarantee the detection of all unauthorized modifications [3]. In contrast, active methods, including the one presented in this study, involve embedding a special, hard-to-remove signature (e.g., watermarks) into the image during its creation or processing [4].

Moreover, many advanced data-hiding methods utilize modern deep learning techniques for video identification [5,6], images [7] and even network protocol consistency [8,9,10,11]. Each of these methods confirms the paradigm of maintaining perceptual and steganalytic transparency.

Inspired by the concept of watermarks and steganographic methods based on deep learning [12,13], we propose an active approach to image integrity protection using watermarking techniques. Unlike traditional steganographic or watermarking solutions, where the primary goal is to conceal additional information (e.g., an identifier or a secret message) within the image structure, our method aims to develop a “hidden signature” that does not serve as a mere data carrier but rather as a global pattern enabling the direct analysis and localization of manipulated areas. A key element of our approach is using unsupervised training processes, which ensures the high transparency of the embedded pattern while avoiding the limitations related to extracting a specific marker or character sequence. Instead of treating the image as a transmission medium for hidden information, this method focuses on directly estimating manipulation fields on the recipient’s side and performing the binary detection of image interference. Leveraging this principle allows for a significant simplification and acceleration of post-reception processing steps. As demonstrated in later sections of this paper, our method enables efficient scaling to very high resolutions, facilitating the simple and rapid marking of visual data. To the best of the authors’ knowledge, this type of solution has not yet been published. It represents a significant advancement, opening new possibilities for practical applications of active image manipulation protection methods.

In summary, the proposed solution introduces several key elements to the current state of knowledge, which can significantly impact the development of active image protection methods:

Full flexibility—enables embedding integrity markers in images of virtually any resolution and aspect ratio without additional model retraining;
High transparency—achieving an SSIM > 0.999 and PSNR > 60 dB ensures an almost imperceptible interference with the original image (especially for high/native resolutions);
A simple yet effective model—even a standard UNet architecture, enhanced with a classification module, achieves excellent results in embedding and detecting signatures thanks to the original data flow. Using an “intermediate” resolution and upsampling allows such a straightforward network to handle high-quality and diverse image formats effectively;
Economical memory management—introducing an “intermediate” resolution significantly reduces GPU usage, leading to lower hardware requirements and greater scalability. Moreover, shifting the most resource-intensive operations to system RAM further relieves the graphics card;
No additional payload—the end-to-end pipeline does not require separate hidden data, considerably simplifying the process and reducing the computational costs.

2. Related Works

2.1. Deep Learning in Data Hiding

In recent years, deep learning methods have become the dominant paradigm in information hiding [14], encompassing steganography and watermarking. The key advantage of neural network-based approaches is their ability to automatically adapt the signature’s embedding process to the image’s characteristics, as well as their resilience to various distortions and transformations. This robustness can be achieved by appropriately selecting training data and augmentation conditions.

One of the first widely cited works in this field is HiDDeN [15], which introduced an end-to-end approach based on convolutional networks for embedding and recovering messages from images without requiring access to the original during decoding. A particularly valuable aspect of this approach is the inclusion of transmission-related distortions during training by incorporating a dedicated noise simulation block between the encoder and decoder. This allows the model to learn how to handle various distortions, significantly improving its usability in real-world applications. Since then, numerous variations and improvements have emerged. One example is the Distortion Agnostic Model [16], which demonstrated more effective performance in scenarios where distortions were absent in the training dataset. Other studies have expanded the scope of application: Tancik et al. introduced StegaStamp [17]. This method enables embedding hyperlinks into photographs in a way that allows retrieval even after printing and capturing them with a camera. An interesting direction in developing data-hiding technologies involves architectures where both the hiding and retrieval processes are based on the same models. This concept was presented in HiNet [18]. which employs an invertible neural network (INN), allowing the data-hiding process (forward pass) and the retrieval process (backward pass) to share the same parameters. This architecture eliminates the need for separate networks for both tasks, reducing the risk of errors caused by inconsistencies between models. Additionally, HiNet integrates a discrete wavelet transform (DWT) to embed information efficiently into the high-frequency components of an image. This improves the effectiveness of data hiding and enhances resistance to detection using steganalysis tools. Another crucial aspect of HiNet is its loss function based on low-frequency components, ensuring a higher quality and security in information embedding. In the context of potential risks associated with model leakage, the literature proposes techniques incorporating an additional encryption key into the data-hiding process, addressing the issue of the compromise of the encoding process. One such technique is DKiS [19], which introduces a private key into the secret pipeline, using it for data permutation and multiplication operations within the model. As a result, even if the algorithm is publicly disclosed, access to the hidden information remains impossible without knowledge of the correct key. Encryption keys can also be used in traditional architectures as an additional security layer, protecting against unauthorized access to encoded data within an image.

The primary objective in training watermarking algorithms is the optimization of two key metrics: maintaining the high transparency of the host object in which the hidden information is embedded and ensuring the accurate extraction of the encoded data. This process is achieved through loss functions, which balance these requirements. In addition to commonly used loss functions such as Mean Squared Error (MSE) and Mean Absolute Error (MAE), more perceptually oriented metrics are gaining importance, including the Structural Similarity Index (SSIM) [20] and Learned Perceptual Image Patch Similarity (LPIPS) [21]. These functions are beneficial as they account for subtle differences perceived by the human eye, minimizing visible artifacts in the encoded image. Furthermore, adversarial loss functions are employed in systems based on Generative Adversarial Networks (GANs) to enhance transparency by simultaneously training the generator and discriminator in a dynamic competition framework. This approach helps improve the imperceptibility of the embedded watermarks, while maintaining robustness against detection.

New training strategies have also emerged, such as Two-Stage Separable Deep Learning [22]. In this approach, the decoder training process is conducted in two stages, allowing the model to better adapt to various non-differentiable distortions, such as JPEG compression or GIF format conversion. By using this method, the model reduces its reliance on approximations of these attacks during training and can instead be trained with real-world distortions, which are often difficult to balance. This significantly improves the model’s robustness in practical applications, making it more resistant to the degradations encountered in real-world environments.

2.2. Deep Learning in Ensuring Image Integration

Since 2015, there has been a noticeable surge in interest in deep neural networks for image manipulation detection and localization. Many earlier studies relied on analyzing noise patterns [23] or color consistency [24]. Today, the field has shifted mainly towards deep learning-based methods and the analysis of deep features extracted using neural networks, as evidenced by numerous scientific publications. Chen et al. published one of the first studies to utilize neural networks for image manipulation classification [25]. This work proposed using CNNs to detect median filtering as a form of image manipulation. The network produced a binary classification output, determining whether an image had been manipulated. A year later, Bayar and Stamm [26] proposed extending this method by using CNNs to classify five different types of manipulations. A further important step came with the introduction of ManTra-Net in 2019 [27]. ManTra-Net was one of the first end-to-end solutions based on fully convolutional neural networks. Its architecture was divided into two main components: the first, the “image manipulation-trace feature extractor”, was responsible for generating a uniform representation of features, and the second, the “local anomaly detection network”, focused on anomaly localization. This structure directly detected manipulated areas in an image, eliminating the need for additional processing. The method could distinguish various known manipulations, including operations generated by other neural networks, such as image inpainting (filling in missing parts of an image).

A year later, in 2020, SPAN (Spatial Pyramid Attention Network) [28] was introduced to improve manipulation localization by effectively capturing dependencies between points and objects at different scales using self-attention mechanisms [29]. SPAN offered more precise manipulation detection than ManTra-Net, which primarily focused on analyzing the same image points across different feature map scales. SPAN models both scale dependencies through multi-scale information propagation and use spatial dependencies between different image regions using a self-attention module. In recent years, an interesting approach was also presented in ObjectFormer [30]. This method stands out for its multimodal approach to manipulation detection, analyzing both the RGB and frequency domains. Thanks to this dual-domain analysis, it becomes possible to detect invisible traces of editing in the standard color spectrum. ObjectFormer’s architecture is based on transformers, which model spatial dependencies within an image. In addition to passive methods, active watermark-based approaches are gaining increasing interest. These methods embed a watermark in the image as a mask, which can later be extracted to detect potential modifications. One such approach is the transparent watermark concept introduced by Zhao et al. [31]. A key component of this solution is using two neural network models that are responsible for generating and recovering the watermark. The watermark is a separate image that is transformed into a feature matrix and added to the original image. Then, based on the observed changes in the extracted image on the recipient’s side, it is possible to predict the area that has undergone manipulation.

3. Method

This section presents our proposed 4KSecure algorithm for active image manipulation detection. We begin with an overview of the framework and a detailed architecture and data flow description. Subsequently, we discuss the evaluation metrics, the dataset employed and the training process. Finally, we compare the performance of 4KSecure with existing methods and outline potential areas for further algorithm enhancement.

3.1. Method Overview

By design, the 4KSecure algorithm is not a traditional neural network architecture but a data processing concept integrated within existing models. Its primary goal is to enable the active embedding of integrity markers across the entire image surface, virtually independent of resolution, while minimizing computational resource requirements. While a similar approach could be applied to many convolutional networks [32], it often proves impractical in commercial settings due to the high hardware costs and increased processing time associated with handling huge input matrices. The main challenge is the high demand for GPU memory (VRAM) and the extended computation time, significantly impacting efficiency when processing high-resolution images.

In the proposed solution, we mitigate these challenges by introducing an “intermediate” resolution and implementing efficient memory switching between the GPU and CPU. Large matrices (e.g., in 4K resolution and beyond) are temporarily stored in the computer’s RAM, which is more affordable and more straightforward to expand. Meanwhile, operations on smaller matrices, including neural network inference, are primarily performed on the GPU. This approach significantly reduces the GPU workload and minimizes VRAM requirements, making it possible to process high-resolution images more efficiently without excessive hardware constraints.

Figure 1 presents a data flow diagram divided into two parts: Figure 1a and Figure 1b. Depending on the design and available hardware resources, the input image and intermediate data can be stored in different types of memory, such as conventional system RAM or computational accelerator memory, typically GPU memory. In the diagram, different colors distinguish data allocated to GPU memory from data stored in RAM, clearly visualizing how memory management is structured within the system.

Figure 1a illustrates the encoding process, which involves embedding additional data into the carrier image to enable the later detection of potential manipulations. In the 4KSecure algorithm, the entire data flow is structured to prevent large matrices from being processed in GPU memory at any point. At the beginning of the process, the high-resolution image is downsampled, reducing its dimensions to what is referred to as an intermediate resolution, a predefined format in which the neural networks used in the system operate. The resulting smaller image is then transferred to the GPU. Depending on the specific use case, additional data, such as metadata, binary masks, other images [32], encoding keys or object identifiers, can be embedded. The network then produces an artifact (in this scenario, a three-channel mask), representing a processed input image transformation. In 4KSecure, this artifact serves as an integrity marker, enabling future modification detection in the image. The generated artifact is subsequently sent back to the CPU, and it undergoes upsampling to match the original image’s resolution. This ensures that large matrices are never stored in GPU memory, significantly reducing hardware requirements. Once this process is completed, the mask is added to the input image, which is stored in the system RAM. The final result is a new image, now enhanced with an invisible signature, ready for publication or distribution through any chosen channel.

Figure 1b illustrates the decoding process. In this stage, the system loads an image, which is then downsampled to a predefined and fixed intermediate resolution. The resized image is then fed into the decoder network. In this scenario, the key objective is to detect and localize manipulated regions and perform a binary classification to determine whether the image has been modified. The decoder network analyzes the artifact embedded in the image and determines whether its integrity has been compromised. If a modification is detected, the network marks the affected areas using a binary mask, indicating potential manipulation regions. As a result, the generated mask or anomaly map provides insight into which parts of the image may have been altered. This produces a clear and interpretable output, allowing users to identify whether and where modifications have occurred.

3.2. Method Implementation

In this implementation, the data flow follows the concept presented in Figure 1. As mentioned earlier, the 4KSecure algorithm is a versatile method for embedding integrity markers in high-resolution images (in this publication, it is specifically used to ensure integrity by embedding a hidden signature). One of the key components of this architecture is the design of the encoder and decoder models. For this purpose, we opted to use the UNet network [33] due to its well-documented effectiveness [34,35] in similar tasks and its ability to preserve input dimensions at the output, which is particularly important for an end-to-end model. According to the diagram in Figure 2, the encoder network follows a standard UNet implementation, while the decoder network has been enhanced with an additional classification module. This module consists of an average pooling layer, a convolutional layer and two fully connected layers. This extension was motivated by early experimental observations, which found that, although the decoder successfully generated accurate masks in most cases, occasional false positives appeared as small incorrectly marked regions. Adding a separate classification module allows for an additional verification step, reducing the risk of incorrect manipulation detection. The downsampling and upsampling operations were performed differentiably using the Kornia library [36].

The above description of the network implementation provides a solid foundation for the training process, which is presented in the next section.

3.3. Training Process

At the beginning of the 4KSecure training process, we assume, similar to other data-hiding solutions [37,38,39], that the encoder and decoder networks are trained simultaneously within a single learning loop. This means that during a single training iteration, data pass through the entire pipeline, from the encoder input through subsequent processing blocks and finally to the decoder. A crucial aspect of this process is maintaining a continuous gradient flow, which enables the network to understand better the transformations occurring throughout the processing pipeline. From an implementation perspective, this was achieved using the Kornia library, which supports differentiable image transformation operations.

This chapter focuses on a general overview of the 4KSecure training process. At the same time, specific aspects, such as dataset characteristics and the types of introduced distortions, will be discussed in dedicated subsections. A detailed schematic of the entire training pipeline is illustrated in Figure 3, where the high-resolution and intermediate-resolution paths are highlighted separately. This distinction helps clarify how data are repeatedly scaled and processed within the system. The upper part of the diagram represents the encoder section, which concludes with generating a high-resolution image containing the embedded hidden signature. Then, still within the training phase, this image passes through the attack module, a component present only during training, which simulates various types of malicious modifications that the data may encounter in real-world scenarios. The next stage is the decoder section, which takes the manipulated (noisy) image as input and attempts to determine the likelihood of manipulation based on distortions in the previously embedded pattern, while identifying the regions where alterations occurred. The decoder output consists of a mask visualizing the localized attacks and a binary prediction indicating whether the system detects any manipulation in the image.

Supervised training is conducted based on four loss functions (also illustrated in Figure 3), which simultaneously regulate both the encoder and decoder parameters. Each pipeline component has a corresponding ground truth, allowing the network to adjust its predictions to match real values gradually. This online training organization ensures that both models, the encoder and the decoder, learn harmoniously within a single process, enabling continuous performance monitoring and optimization. Below is a description of each loss function along with their corresponding metrics:

Loss 1 is responsible for ensuring that the binary mask generated by the decoder aligns with the reference mask (ground truth). In this case, the classic Binary Cross-Entropy (BCE) function is used, as each pixel is classified into one of two categories: modified or unmodified. The reference mask is created by the attack module, which designates the manipulated regions. The mathematical formulation of BCE is as follows:

$l (x, y) = L = {l_{1}, \dots, l_{N}}^{T}, l_{n} = - w_{n} {[y}_{n} \times l o g x_{n} + (1 - y_{n}) \times l o g (1 - x_{n})]$

where $N$ is the batch size;
Loss 2 concerns the classification accuracy of the entire image. Similar to Loss 1, BCE is applied. Still, instead of operating on a pixel-wise matrix, it uses a single logit value that determines whether any manipulation is present in the entire image. The ground truth for this function is derived from the reference mask: if any modification is detected, the image is classified as “positive”;
Loss 3 ensures high image transparency at the intermediate resolution stage. It evaluates artifacts introduced by the encoding network (combined with the input image), while considering the final predictions from the decoder. SSIM and LPIPS metrics are used to achieve this, as they provide a more perceptually accurate assessment of differences as perceived by the human eye compared to simple pixel-space error measures;
Loss 4 serves a similar purpose to Loss 3, but it is calculated for the final high-resolution image (after upsampling). Due to the larger matrix size and higher computational demands, MSE and PSNR metrics are used instead of SSIM and LPIPS. These metrics balance quality assessment accuracy and computational efficiency, which is particularly important for large images.

The following equation describes the final loss function:

{L o s s}_{t o t a l} = {L o s s}_{1} + {L o s s}_{2} + {L o s s}_{3} + {L o s s}_{4}

The following loss functions are summed and then subjected to the backpropagation process. In the implemented approach, an additional dynamic weighting mechanism adjusts the contribution of each loss component depending on the training stage.

At the beginning of training, all components are calculated without modification. However, during the final summation, the loss terms responsible for maintaining high transparency (Loss 3 and Loss 4) are initially set to zero. This means that, at the start, the network does not receive a gradient related to minimizing image artifacts, and the entire focus is placed on ensuring the correct operation of the decoder, specifically on detecting manipulations.

In practice, this approach rapidly improves the detection accuracy, as feedback from the decoder entirely drives the learning process. However, as training progresses, the weight of the transparency-related loss component (for the encoder) gradually increases. In this implementation, its value increases by 0.1 per epoch until reaching 0.4, after which it remains constant. Meanwhile, the decoder’s weight remains fixed at 1.0 throughout the training process. This strategy, known as a warm-up, allows training an effective decoder first, then improving the embedding quality (transparency) without significantly compromising the detection performance. Observations indicate that training would take longer without this approach, and the manipulation detector would achieve inferior results. The encoder network plays a relatively more straightforward role, which means it tends to dominate the learning process early on. The remaining hyperparameters used during training include the Adam optimizer with a learning rate 0.0001 and betas set to (0.5, 0.999). The batch size is set to 32.

3.3.1. Dataset

To implement the training process illustrated in Figure 3, it was necessary to prepare a dataset that would simultaneously provide high-quality images at relatively high resolutions, while incorporating potential manipulation operations and corresponding ground truth masks. To the best of the authors’ knowledge, no publicly available dataset of this kind currently exists [40,41,42]. For this reason, a custom dataset was created based on OpenImage V7 [43], which offers images with relatively high resolutions and diverse aspect ratios. The OpenImage V7 dataset served as the primary source for training and validation. The dataset included 30,000 images for training, 30,000 for validation and 30,000 for testing. Additionally, the training dataset was expanded with 25,719 images from the HOLLYWOOD2 Human Actions and Scenes Dataset [44], which contains frames extracted from various film scenes. Another data source was a collection of clipart images downloaded from the Free SVG database [45], comprising 30,067 illustrations. These pieces of clipart, often used in presentations and graphic design projects, provided additional data diversity for model training. In this study, these images were used to simulate manipulations, including inserting additional objects into selected image areas, which helps bring training conditions closer to more complex visual alterations. For testing purposes, resources from OpenImage V7 were also used. However, due to the relatively low resolution of most images, with a maximum size of approximately 1024 × 1024, the test set was further expanded with an additional 30,000 high-resolution photographs in HD, Full HD, 2K and 4K formats, such as 1280 × 720, 1920 × 1080, 2560 × 1440 and 3840 × 2160, as well as square equivalents 720 × 720, 1080 × 1080, 1440 × 1440 and 2160 × 2160. These images were sourced from the bghi-ra/photo-concept-bucket dataset [46] published on HuggingFace. As a result, the final test dataset included 60,000 images with resolutions ranging from 560 × 1024 to 3840 × 2160, allowing for a comprehensive evaluation of the algorithm’s effectiveness across a wide range of resolution and aspect ratio variations.

To conduct additional comparative tests while maintaining the same parameters, a separate testing process was performed on the ImageNet-1K [47] dataset, which contains 100,000 images. The choice of this dataset was primarily driven by the need to compare the transparency of the hidden artifacts in the image with other data-hiding methods. Since many existing solutions in this field use ImageNet-1K as a primary reference dataset, its inclusion allows for a more objective and reliable evaluation of the effectiveness of the proposed method. Therefore, we used an additional test dataset alongside our dataset to ensure the most credible test results.

3.3.2. Noises

A key component of the 4KSecure algorithm is the distortion module presented in Figure 4. This block serves a dual role within the pipeline. First, it acts as a form of augmentation, which, despite the relatively large dataset consisting of tens of thousands of training and validation examples, further diversifies the input data and improves the model’s generalization ability. Second, the distortion module introduces realistic manipulations into the training process, allowing the network to learn to detect actual image modifications more precisely. As shown in Figure 4, the Noise module is divided into two parts: Noise False and Noise True.

Noise False
- This component is responsible for introducing various attacks or modifications that do not alter the semantics of the image. This means that, for a human observer, the image still conveys the same content and should not be classified as manipulated. In practice, this serves two main functions:
- (1) It acts as an additional augmentation technique, expanding the data distribution and increasing the diversity of sample scenes.
- (2) It trains the network to recognize that not every modification to an image should be classified as a manipulation, such as edits that enhance its visual appeal or improve its quality.
- The Noise False group includes operations such as the following:
○
JPEG compression [48] with a randomly selected compression factor ranging from 64.0 to 99.0;
○
Gaussian noise with random intensity and a standard deviation of std = 0.03;
○
Conversion of the image to grayscale while preserving the number of channels;
○
Brightness adjustment with random intensity in the range of 80% to 120%;
○
Contrast adjustment with random intensity in the range of 80% to 120%.
- Each of these transformations can occur within an acceptable range of intensity, which does not significantly affect the semantics of the image and should not be interpreted as intentional manipulation.

Noise True
- After passing through the Noise False module, the image enters the attack sequence labeled Noise True. These transformations actively alter the image’s content by modifying its semantics, such as removing, replacing or obscuring essential elements. As a result, the image deviates from the original to the extent that the network should detect the modification by identifying the manipulated area through binary classification as manipulated or non-manipulated. Each Noise True transformation also produces a binary mask indicating the modified areas, which serves as ground truth during training (see Figure 3). The Noise True sequence includes the following operations:
○
Random Erasing: deletes a random portion of the image, filling it with zeros or a constant value. The affected area may range from 1% to 5% of the image’s surface;
○
Random Cut Mix: swaps image fragments between samples in the batch by inserting part of one image into another. The affected area may range from 1% to 7% of the image’s surface;
○
Cut Mix Restore: restores a fragment of the original image (without a hidden mark) in place of an already marked section. The affected area may range from 1% to 10% of the image’s surface;
○
Random Clipart Mix: inserts randomly positioned, scaled and oriented clipart into the image. The clipart’s size may range from 8% to 25% of the image’s surface;
○
Random Region Noise: applies noticeable blurring to selected image regions. The affected area may range from 8% to 25% of the image’s surface;
○
Random Crop Pair: extracts a portion of the already noisy image and its corresponding mask and resizes it back to the original dimensions. The selected area may range from 60% to 99% of the image’s surface.

Each attack is applied with a specific probability, affecting individual samples within a batch rather than the entire mini-batch. This approach ensures a high diversity of transformation combinations during each training iteration, effectively preventing model overfitting. Maintaining a proper balance between the modified and original images is crucial, as the training process also includes a binary classifier that determines whether an image has been manipulated. The probabilities used in this implementation result in approximately 17.8% of samples remaining unmodified, which statistically corresponds to 5–6 unmodified images per batch of 32. Experience indicates that this ratio is sufficient for effectively training the classifier.

It is worth noting that, in addition to the module shown in Figure 4, an additional augmentation step was applied, involving the insertion of the clipart before embedding. This prevents the network from automatically classifying every inconsistent fragment as a manipulation, considering that in some scenarios, a clipart element may be part of the original composition. As a result, not all inserted pieces of clipart need to be treated as an act of tampering, which further enhances the realism and diversity of the data during the training process.

3.4. Evaluation

The algorithm was evaluated using the test dataset described in Section 3.3.1, following the same distortion scheme as in the training phase (details in Section 3.3.2). Model inference was performed in a cloud environment with an NVIDIA A40 GPU (48 GB VRAM), a 9-core CPU and 50 GB RAM. This chapter presents the obtained results and compares them with the outcomes of selected studies addressing similar challenges, allowing for an assessment of the effectiveness of the proposed 4KSecure solution.

3.4.1. Transparency Metrics Obtained

Three main metrics were used to assess the similarity between the source image and the image containing the embedded integrity marker: SSIM, PSNR and LPIPS. These metrics were calculated on the processed image at its full native resolution, as integrity artifacts were applied after upsampling from the intermediate resolution. Below is a brief description of each of these measures:

SSIM (Structural Similarity Index) is a metric used to evaluate the similarity between two images, considering factors such as luminance, contrast and structural details. It ranges from 0 to 1, where 1 represents perfect similarity between the images.

In a simplified form, the SSIM can be defined by the following equation:

$S S I M = \frac{(2 μ_{x} μ_{y} + k_{1}) (2 σ_{x y} + k_{2})}{(μ_{x}^{2} + μ_{y}^{2} + k_{1}) (σ_{x}^{2} + σ_{y}^{2} + k_{2})}$

where
○
$μ_{x} {, μ}_{y} - average of the image x and y$ ;
○
$σ_{x}^{2}, σ_{y}^{2} - variance of the image x and y$ ;
○
$k_{1}, k_{1} - constants that prevent division by zero$ .

PSNR (Peak Signal-to-Noise Ratio) measures image quality based on the difference between pixel values in the original and distorted images. It is typically expressed in decibels (dB) and is widely used to evaluate image perceptual similarity. A higher PSNR value indicates better image quality and less distortion.

For 8-bit images, the PSNR is commonly calculated using the following formula:

$P S N R = 10 l o g \frac{{M A X}_{I}^{2}}{M S E (I, T)}$

where
○
$M S E (I, T) = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {[I (i, j) - T (i, j)]}^{2} - mean squared error$ ;
○
${M A X}_{I}^{2} - is the maximum possible input value$ .

LPIPS (Learned Perceptual Image Patch Similarity) uses deep neural networks to assess differences between images, like human perception. Instead of comparing pixel values, LPIPS compares embedding values in the layers of a neural network trained on a large number of image examples. It takes values in the range of 0 to 1, where 0 indicates a perfect similarity between images. The AlexNet [49] backbone was selected as the feature extractor.

Figure 5, Figure 6 and Figure 7 present the results obtained for each of the discussed metrics. It is worth noting that the SSIM and PSNR achieve high values. At the same time, the LPIPS remains exceptionally low, indicating minimal differences between the original image and the image with the embedded integrity marker.

The SSIM graph shows values around 0.999, signifying almost perfect similarity. Differences depending on resolution only become apparent in the fourth decimal place and are practically negligible. On the LPIPS graph, a scale ranging from 0.0 to 0.004 was applied instead of the entire 0.0–1.0 range, to capture slight variations. Despite this rescaling, LPIPS values still fluctuate within a very narrow range, confirming that artifacts in the embedded image are almost imperceptible, with an average of approximately 0.0002. Slightly more significant fluctuations are observed in the PSNR analysis, partially due to the fact that it is a logarithmic measure and is not normalized in the same way as the SSIM or LPIPS. It should be emphasized that the average PSNR value was 60.216 dB, indicating a very high quality level. Variations from approximately 59.9 to 62.25 dB still fall within the range indicative of high marked image transparency.

The table below (Table 1) presents a comparative analysis of the transparency results achieved by the 4KSecure algorithm at full resolution. While our method does not fall under traditional watermarking or steganography, we reference well-established works in data hiding as a benchmark for evaluating transparency. Unlike steganography, which primarily focuses on covert communication, or classical watermarking, which often embeds additional information for ownership verification or copyright protection, our approach—active marking—is designed specifically for image integrity verification and tamper detection, while maintaining an exceptionally high level of transparency. This comparison serves to highlight the minimal perceptual impact of our approach relative to existing high-performance data-hiding techniques. The results demonstrate that 4KSecure achieves significantly higher transparency (higher SSIM and PSNR values) compared to prior methods, while remaining suitable for active integrity verification across various resolutions.

The obtained results demonstrate the superiority of the 4KSecure algorithm over other data embedding methods in terms of transparency. Notably, this solution enables the efficient active marking of images at virtually any resolution, with successful testing conducted on materials up to a 4K resolution (3840 × 2160). Unlike some fully convolutional models such as SteganoCNN, which theoretically support images of any size, 4KSecure does not require excessive memory resources when increasing resolution. This allows it to achieve very high SSIM and PSNR values by distributing integrity markers over a larger surface area and leveraging the semantic richness of the entire image, providing more potential locations for embedding verification features compared to approaches that operate only on patches.

The described algorithm was tested on both a proprietary dataset and the ImageNet dataset, covering over 160,000 images. A key observation is the minimal variance in results between these two datasets. The SSIM values obtained, 0.9996 and 0.9995, differ by only about 0.01%, while the PSNR values (60.9391 vs. 60.7033) show a difference of less than 0.4%. An additional advantage of the 4KSecure method is that it does not require separate training for different input and output resolutions, allowing it to efficiently process images with various aspect ratios, from 1:1 to 16:9, without modifying the network parameters.

3.4.2. Detection Metrics Obtained

This subsection discusses the metrics used to evaluate the performance of the decoder, i.e., the effectiveness of detecting potential manipulations introduced (or not) into a previously watermarked image. The analysis is divided into two parts: the first concerns binary manipulation detection (i.e., whether an image has been manipulated or not), while the second focuses on the localization of modified regions within the image.

Image Manipulation Detection Metrics Obtained

Figure 8 and Figure 9 present the key binary classification metrics: accuracy and F1 score. These metrics are shown as a function of the resolution of the tested images to observe how detection performance varies across different formats. The graphs display the average values of the selected metrics, calculated for all images of a given size.

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

where

$P r e c i s i o n = \frac{T P}{T P + F P}$ ;
$R e c a l l = \frac{T P}{T P + F N}$ .

Overall, the average accuracy for the described method (for our dataset) was 0.889, while the F1 score reached 0.722.

Image Manipulation Detection Metrics Obtained

This subsection discusses the results generated by the decoder in the context of identifying and localizing potentially modified regions. During testing, the “attack false” block was always active, meaning that each image was subjected to additional operations such as JPEG compression or Gaussian blur (see Section 3.3.2). Additionally, all possible combinations of attacks from the “attack true” section were considered, allowing for the identification of potential weaknesses in the algorithm and the evaluation of its performance under various types of manipulations. The quality of the predicted manipulated regions was assessed using four metrics: IoU (Intersection over Union)—Figure 10, Dice—Figure 11, Coverage—Figure 12 and AUC (Area Under the Curve)—Figure 13. Below are brief definitions of each metric:

IoU—measures the proportion of correctly detected pixels (the intersection of the predicted and actual manipulation area) relative to the total number of pixels present in both sets. The formula is given by

$I o U = \frac{| P \cap G |}{| P \cup G |}$

where
○
$P - the set of pixels identified by the model as manipulated;$
○
$G - the set of pixels in the ground truth (actually manipulated) .$

Dice—similar to IoU, measures the overlap between the predicted manipulation area and the actual manipulation area but places greater emphasis on overall coverage. This formulation gives more weight to the consistency between predictions and the ground truth. The formula is given by

D i c e = 2 \times \frac{| P \cap G |}{| P \cup G |}

Coverage—measures the fraction of all pixels predicted as “manipulated” that actually overlap with the manipulated region in the ground truth. This metric answers the question of how accurate the detection was from the perspective of the predicted manipulation area, indicating what percentage of predicted pixels were truly modified. The formula is given by

C o v e r a g e = \frac{| P \cap G |}{| P |}

AUC—measures the performance of a binary classifier by calculating the area under the ROC (Receiver Operating Characteristic) curve. It indicates how well the model distinguishes between classes. The AUC value falls within the range (0, 1), where 1.0 represents perfect classification, 0.5 corresponds to random predictions, and 0.0 indicates completely incorrect classification.

$C o v e r a g e = \int_{0}^{1} T P R (F P R) d F P R$

where
○
$T P R$ (True Positive Rate): $\frac{T P}{T P + F N}$ ;
○
$F P R (False Positive Rate) : \frac{F P}{F P + T N}$ .

Analyzing the presented results, it is important to consider the variety of attack combinations. All possible combinations of attacks and every possible number within a given combination are considered, with the order of attacks in the sequence being irrelevant. The first point on the x-axis is labeled as “none”, representing a scenario where no attack was applied. This was included to verify the correctness of the metric computation algorithms and data flow. The metric values for this point should be 0.0 or close to it, ensuring that the system operates correctly. This can be considered a test field for diagnosing potential errors. The graphs display results in ascending order based on the number of introduced attack combinations, up to a maximum of six. Within the bars of the bar chart, the number of samples used to compute the mean value of the metric for each combination is marked in white. Additionally, a horizontal red line indicates the global average value of the metric, allowing for the easy comparison of results, such as those obtained using different methods.

One of the more notable values among the presented combinations is the column indexed as 5 (counting from 1) on the x-axis, a pattern observed across all three graphs. This represents the crop pair attack, which—similar to the “none” value—yields a result of 0. At first glance, this might appear to be an error; however, in reality, it confirms the effectiveness of the 4KSecure method. As described earlier, the crop pair attack involves selecting a specific portion of an image, cropping it and then resizing it to its native dimensions, effectively performing a zoom operation. Since no additional attacks are applied in this configuration, the image’s content remains unaltered—it is simply magnified to emphasize a specific area. As a result, obtaining a metric value of 0 aligns perfectly with expectations and serves as a positive indicator of the algorithm’s performance. It confirms that 4KSecure does not incorrectly classify simple transformations, such as zooming, as unwanted manipulations.

The analysis of the above diagrams reveals another positive aspect: the relatively small fluctuations (variance) in average values depending on different combinations and the number of attacks performed. It is worth noting that even in cases where the number of samples assigned to a given combination decreases significantly, the calculated mean value remains consistent with the results obtained on larger datasets. This indicates a low standard deviation and stable performance, which, in practice, translates to minimal metric oscillations across various attack scenarios. The decline in the number of samples assigned to individual attack combinations results from the adopted data processing scheme (pipeline). As previously mentioned, this pipeline is identical to the one used during training, meaning that each sample is sequentially processed through different blocks, each with a specific activation probability. Consequently, the likelihood of multiple blocks being activated simultaneously is relatively low. As a result, despite the large number of test samples, the number of available samples in the later columns may be significantly smaller than in the earlier ones.

Regarding the attacks best detected by the network, operations such as erasing, blurring and cut mix stand out, translating into a high sensitivity to significant modifications in image fragments. All these attacks substantially alter pixel values and the perception of the scene, making their detection relatively straightforward. On the other hand, attacks involving subtle modifications, such as clipart or cut–restore, perform worse in detection. It is important to note that clipart elements were added to the image before the embedding of hidden artifacts, meaning that from the anomaly detection perspective, they were not considered unwanted modifications but rather an intended part of the image. This implies that the network’s task was not only to detect the anomaly in the form of clipart but also to determine whether a given item of clipart existed before the embedding process. As a result, this leads to an increased number of false positives (incorrectly flagged modifications) and false negatives (missed modifications).

The metric for the cut–restore operation, which involves replacing fragments of the marked image with corresponding areas from the unmarked version, also performed slightly worse. The key issue is the trade-off between high transparency and final detection accuracy. To maintain the highest possible visual clarity for the human eye, the algorithm must limit the number of regions where integrity markers are embedded and the intensity of the modifications introduced. As a result, it may happen that a replaced fragment did not contain a sufficiently strong integrity marker, while simultaneously being an integral part of the original image. In such cases, the network may fail to recognize it as manipulation. From a general perspective on image integrity, this might not necessarily be considered tampering, since the content remains part of the original photo. However, from the algorithm’s standpoint—where one of the tasks is also to detect content lacking an embedded integrity marker—this should ideally be flagged as a potential issue. This discrepancy highlights the challenges faced by the detection network, as it strives to balance a high accuracy with minimal invasiveness of the embedded integrity features. Two graphical summaries have been prepared to better illustrate the results of both the active marking and detection processes (Figure 14 and Figure 15). These images provide a visual representation of the obtained results and help in understanding the underlying processes. They are structured as matrices (grids) arranged in rows and columns:

The first row contains the original images;
The second row presents the images after the integrity markers have been applied;
The third row shows the artifacts introduced by the encoding network, which have been normalized (0–1) to make them visible;
The fourth row includes false and actual manipulations introduced into the previously marked images, with some highlighted using orange arrows for better readability;
The fifth row displays binary ground truth masks indicating areas of actual manipulation (true attacks);
The last row presents the raw output of the decoding model (without any thresholds or filters), attempting to reconstruct the manipulated regions.

Figure 14 also includes cases where images were not subjected to any modifications.

Table 2 compares the results of the 4KSecure algorithm with other significant methods for active image integrity verification published in the scientific literature in recent years. All the included methods are based on the concept of embedding integrity markers into an image before its publication, enabling subsequent manipulation detection. One of the key challenges in reliably comparing these methods is the lack of a widely accepted dataset for evaluating active image protection systems. As mentioned in Section 3.3.1, to the best of the authors’ knowledge, no standard dataset exists that allows for a comprehensive end-to-end evaluation of these methods. In particular, there is no dataset that provides a structured approach to embedding a given signature in an image, applying distortions in a controlled manner and generating a corresponding ground truth mask for the later assessment of the protection’s effectiveness. For this reason, a proprietary dataset was used for the analysis, allowing for evaluation of the method in a standardized environment, and thanks to a large number of examples, it should ensure an appropriate generalization of the results.

4. Conclusions

The 4KSecure solution presented in this paper demonstrates an effective method for actively embedding integrity markers in high-resolution images, enabling manipulation detection throughthe binary classification and precise identification of altered regions within the image structure. Throughout the study, it was observed that the proposed method has a minimal impact on visual quality, as confirmed by the obtained SSIM, PSNR and LPIPS metric results. Additionally, its flexible architecture allows its adaptation to images of nearly any aspect ratio and size, it being successfully tested on formats reaching up to 4K (3840 × 2160). A key concept of this approach is the “mixing” of memory between the system’s CPU and the accelerator (GPU), enabling the efficient processing of very large images without the need to scale them down significantly. However, this approach also introduces specific challenges. First, the lack of standardized data sizes (as seen in models trained on ImageNet, where images are typically resized to 256 × 256 pixels) can lead to unstable RAM usage. From an implementation perspective, this requires careful resource management, particularly when handling diverse and non-standard input formats. Second, larger images result in higher computational costs during training and inference, which poses a challenge for practical deployments, especially in edge computing environments with limited processing power. Furthermore, the algorithm requires the prior embedding of integrity markers, meaning that the creation or modification of visual content must be controlled; otherwise, the method cannot be applied. Despite these limitations, the results clearly show that 4KSecure offers a significant advantage over other approaches in terms of a high manipulation-detection accuracy and the preservation of image quality. The integration of the UNet architecture, enhanced with an additional classification module in the decoder and the interpolation mechanism, yields promising results. The proposed pipeline can be a foundation for further optimization, particularly by exploring additional compression techniques or segmenting the image into subregions to further balance computational resource demands. Moreover, extending the method to video sequence analysis, incorporating a temporal dimension, represents a natural direction for development in response to the growing popularity of video content.

In conclusion, 4KSecure can be seen as a step forward in developing active methods for ensuring image authenticity. It combines high scalability, precise manipulation detection and a minimal impact on the aesthetic quality of the final content. This makes it applicable to social media, the publishing sector and technical documentation, addressing the increasing threats posed by advanced image editing and content generation, and it may also prove useful for supporting user identity protection [63,64,65].

Author Contributions

Conceptualization, P.D.; methodology, P.D.; software, P.D.; validation, P.D.; resources, P.D.; data curation, P.D.; writing—original draft preparation, P.D.; writing—review and editing, P.D. and Z.P; visualization, P.D.; supervision, Z.P.; project administration, Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by the project of the Military University of Technology titled: “New Neural Network Architectures for Signal and Data Processing in Radiocommunications and Multimedia.” Project No. UGB 22-054/2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, S.; Saharia, C.; Montgomery, C.; Pont-Tuset, J.; Noy, S.; Pellegrini, S.; Onoe, Y.; Laszlo, S.; Fleet, D.J.; Soricut, R.; et al. Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18359–18369. [Google Scholar]
Duszejko, P.; Walczyna, T.; Piotrowski, Z. Detection of Manipulations in Digital Images: A Review of Passive and Active Methods Utilizing Deep Learning. Appl. Sci. 2025, 15, 881. [Google Scholar] [CrossRef]
Farid, H. Image Forgery Detection. IEEE Signal Process. Mag. 2009, 26, 16–25. [Google Scholar] [CrossRef]
Tyagi, S.; Yadav, D. A Detailed Analysis of Image and Video Forgery Detection Techniques. Vis. Comput. 2023, 39, 813–833. [Google Scholar] [CrossRef]
Kaczyński, M.; Piotrowski, Z. High-Quality Video Watermarking Based on Deep Neural Networks and Adjustable Subsquares Properties Algorithm. Sensors 2022, 22, 5376. [Google Scholar] [CrossRef]
Bistroń, M.; Piotrowski, Z. Efficient Video Watermarking Algorithm Based on Convolutional Neural Networks with Entropy-Based Information Mapper. Entropy 2023, 25, 284. [Google Scholar] [CrossRef]
Lenarczyk, P.; Piotrowski, Z. Parallel Blind Digital Image Watermarking in Spatial and Frequency Domains. Telecommun. Syst. 2013, 54, 287–303. [Google Scholar] [CrossRef]
Teca, G.; Natkaniec, M. StegoBackoff: Creating a Covert Channel in Smart Grids Using the Backoff Procedure of IEEE 802.11 Networks. Energies 2024, 17, 716. [Google Scholar] [CrossRef]
Natkaniec, M.; Kępowicz, P. StegoEDCA: An Efficient Covert Channel for Smart Grids Based on IEEE 802.11e Standard. Energies 2025, 18, 330. [Google Scholar] [CrossRef]
Jekateryńczuk, G.; Jankowski, D.; Veyland, R.; Piotrowski, Z. Detecting Malicious Devices in IPSEC Traffic with IPv4 Steganography. Appl. Sci. 2024, 14, 3934. [Google Scholar] [CrossRef]
Teca, G.; Natkaniec, M. A Novel Covert Channel for IEEE 802.11 Networks Utilizing MAC Address Randomization. Appl. Sci. 2023, 13, 8000. [Google Scholar] [CrossRef]
Wang, Z.; Byrnes, O.; Wang, H.; Sun, R.; Ma, C.; Chen, H.; Wu, Q.; Xue, M. Data Hiding With Deep Learning: A Survey Unifying Digital Watermarking and Steganography. IEEE Trans. Comput. Soc. Syst. 2023, 10, 2985–2999. [Google Scholar] [CrossRef]
Ye, C.; Tan, S.; Wang, J.; Shi, L.; Zuo, Q.; Feng, W. Social Image Security with Encryption and Watermarking in Hybrid Domains. Entropy 2025, 27, 276. [Google Scholar] [CrossRef]
Cao, F.; Ye, H.; Huang, L.; Qin, C. Multi-Image Based Self-Embedding Watermarking with Lossless Tampering Recovery Capability. Expert Syst. Appl. 2024, 258, 125176. [Google Scholar] [CrossRef]
Zhu, J.; Kaplan, R.; Johnson, J.; Li, F.-F. HiDDeN: Hiding Data With Deep Networks. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11219, pp. 682–697. ISBN 978-3-030-01266-3. [Google Scholar]
Luo, X.; Zhan, R.; Chang, H.; Yang, F.; Milanfar, P. Distortion Agnostic Deep Watermarking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13548–13557. [Google Scholar]
Tancik, M.; Mildenhall, B.; Ng, R. StegaStamp: Invisible Hyperlinks in Physical Photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Jing, J.; Deng, X.; Xu, M.; Wang, J.; Guan, Z. HiNet: Deep Image Hiding by Invertible Network. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 4713–4722. [Google Scholar]
Yang, H.; Xu, Y.; Liu, X. DKiS: Decay Weight Invertible Image Steganography with Private Key. Neural Netw. 2025, 185, 107148. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Liu, Y.; Guo, M.; Zhang, J.; Zhu, Y.; Xie, X. A Novel Two-Stage Separable Deep Learning Framework for Practical Blind Watermarking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1509–1517. [Google Scholar]
Lyu, S.; Pan, X.; Zhang, X. Exposing Region Splicing Forgeries with Blind Local Noise Estimation. Int. J. Comput. Vis. 2014, 110, 202–221. [Google Scholar] [CrossRef]
Fan, Y.; Carré, P.; Fernandez-Maloigne, C. Image Splicing Detection with Local Illumination Estimation. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 2940–2944. [Google Scholar]
Chen, J.; Kang, X.; Liu, Y.; Wang, Z.J. Median Filtering Forensics Based on Convolutional Neural Networks. IEEE Signal Process. Lett. 2015, 22, 1849–1853. [Google Scholar] [CrossRef]
Bayar, B.; Stamm, M.C. Constrained Convolutional Neural Networks: A New Approach towards General Purpose Image Manipulation Detection. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2691–2706. [Google Scholar] [CrossRef]
Wu, Y.; AbdAlmageed, W.; Natarajan, P. Mantra-Net: Manipulation Tracing Network for Detection and Localization of Image Forgeries with Anomalous Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9543–9552. [Google Scholar]
Hu, X.; Zhang, Z.; Jiang, Z.; Chaudhuri, S.; Yang, Z.; Nevatia, R. SPAN: Spatial Pyramid Attention Network for Image Manipulation Localization. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12366, pp. 312–328. ISBN 978-3-030-58588-4. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Wang, J.; Wu, Z.; Chen, J.; Han, X.; Shrivastava, A.; Lim, S.-N.; Jiang, Y.-G. ObjectFormer for Image Manipulation Detection and Localization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 2354–2363. [Google Scholar]
Zhao, Y.; Liu, B.; Zhu, T.; Ding, M.; Yu, X.; Zhou, W. Proactive Image Manipulation Detection via Deep Semi-Fragile Watermark. Neurocomputing 2024, 585, 127593. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Walczyna, T.; Piotrowski, Z. Fast Fake: Easy-to-Train Face Swap Model. Appl. Sci. 2024, 14, 2149. [Google Scholar] [CrossRef]
Duan, X.; Liu, N.; Gou, M.; Wang, W.; Qin, C. SteganoCNN: Image Steganography with Generalization Ability Based on Convolutional Neural Network. Entropy 2020, 22, 1140. [Google Scholar] [CrossRef] [PubMed]
Riba, E.; Mishkin, D.; Ponsa, D.; Rublee, E.; Bradski, G. Kornia: An Open Source Differentiable Computer Vision Library for PyTorch. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar]
Li, Y.; Wang, H.; Barni, M. A Survey of Deep Neural Network Watermarking Techniques. Neurocomputing 2021, 461, 171–193. [Google Scholar] [CrossRef]
Zhong, X.; Das, A.; Alrasheedi, F.; Tanvir, A. A Brief, In-Depth Survey of Deep Learning-Based Image Watermarking. Appl. Sci. 2023, 13, 11852. [Google Scholar] [CrossRef]
Singh, H.K.; Singh, A.K. Digital Image Watermarking Using Deep Learning. Multimed. Tools Appl. 2024, 83, 2979–2994. [Google Scholar] [CrossRef]
Langguth, J.; Pogorelov, K.; Brenner, S.; Filkuková, P.; Schroeder, D.T. Don’t Trust Your Eyes: Image Manipulation in the Age of DeepFakes. Front. Commun. 2021, 6, 632317. [Google Scholar] [CrossRef]
Capasso, P.; Cattaneo, G.; De Marsico, M. A Comprehensive Survey on Methods for Image Integrity. ACM Trans. Multimedia Comput. Commun. Appl. 2024, 20, 347. [Google Scholar] [CrossRef]
Zheng, L.; Zhang, Y.; Thing, V.L.L. A Survey on Image Tampering and Its Detection in Real-World Photos. J. Vis. Commun. Image Represent. 2019, 58, 380–399. [Google Scholar] [CrossRef]
Open Images V7. Available online: https://storage.googleapis.com/openimages/web/index.html (accessed on 15 January 2025).
Marcin Marsza{\l}ek and Ivan Laptev and Cordelia Schmid. HOLLYWOOD2 Human Actions and Scenes Dataset. Available online: https://www.di.ens.fr/~laptev/actions/hollywood2/ (accessed on 15 January 2025).
Free SVG Clip Art and Silhouettes for Cricut Cutting Machines. Available online: https://freesvg.org/ (accessed on 15 January 2025).
Bghira/Photo-Concept-Bucket·Datasets at Hugging Face. Available online: https://huggingface.co/datasets/bghira/photo-concept-bucket (accessed on 23 January 2025).
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Shin, R. JPEG-Resistant Adversarial Images. In Proceedings of the Advances in Neural Information Processing SystemsWorksh, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Li, G.; Li, S.; Luo, Z.; Qian, Z.; Zhang, X. Purified and Unified Steganographic Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024. [Google Scholar]
Neekhara, P.; Hussain, S.; Zhang, X.; Huang, K.; McAuley, J.; Koushanfar, F. FaceSigns: Semi-Fragile Neural Watermarks for Media Authentication and Countering Deepfakes. arXiv 2022, arXiv:2204.01960. [Google Scholar]
CelebA Dataset. Available online: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 31 January 2025).
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
Asnani, V.; Yin, X.; Hassner, T.; Liu, X. MaLP: Manipulation Localization Using a Proactive Scheme. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Zhang, X.; Tang, Z.; Xu, Z.; Li, R.; Xu, Y.; Chen, B.; Gao, F.; Zhang, J. OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking. arXiv 2024, arXiv:2412.01615. [Google Scholar]
The MIRFLICKR Retrieval Evaluation. Available online: https://press.liacs.nl/mirflickr/ (accessed on 31 January 2025).
Wang, Y.; Zhu, X.; Ye, G.; Zhang, S.; Wei, X. Achieving Resolution-Agnostic DNN-Based Image Watermarking: A Novel Perspective of Implicit Neural Representation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024. [Google Scholar]
Zhang, X.; Xu, Y.; Li, R.; Yu, J.; Li, W.; Xu, Z.; Zhang, J. V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024. [Google Scholar]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 724–732. [Google Scholar]
Feng, J.; Wu, Y.; Sun, H.; Zhang, S.; Liu, D. Panther: Practical Secure Two-Party Neural Network Inference. IEEE Trans. Inf. Forensics Secur. 2025, 20, 1149–1162. [Google Scholar] [CrossRef]
Zhao, P.; Lai, L. Minimax Optimal Q Learning With Nearest Neighbors. IEEE Trans. Inf. Theory 2025, 71, 1300–1322. [Google Scholar] [CrossRef]
Zhang, P.; Fang, X.; Zhang, Z.; Fang, X.; Liu, Y.; Zhang, J. Horizontal Multi-Party Data Publishing via Discriminator Regularization and Adaptive Noise under Differential Privacy. Inf. Fusion 2025, 120, 103046. [Google Scholar] [CrossRef]

Figure 1. Data flow scheme and data allocation in the 4KSecure algorithm. (a) Encoder side, (b) decoder side.

Figure 2. Network scheme and data shape in the 4KSecure algorithm. (a) Encoder network, (b) decoder network.

Figure 3. Training scheme for the 4KSecure algorithm.

Figure 4. Noise scheme in the 4KSecure algorithm.

Figure 5. Average SSIM for different resolutions of test images (our dataset).

Figure 6. Average PSNR for different resolutions of test images (our dataset).

Figure 7. Average LPIPS for different resolutions of test images (our dataset).

Figure 8. Average accuracy for different resolutions of test images (our dataset).

Figure 9. Average F1 for different resolutions of test images (our dataset).

Figure 10. A summary of IoU metric values in relation to different combinations of attack true (our dataset).

Figure 11. A summary of Dice metric values in relation to different combinations of attack true (our dataset).

Figure 12. A summary of Coverage metric values in relation to different combinations of attack true (our dataset).

Figure 13. A summary of AUC metric values in relation to different combinations of attack true (our dataset).

Figure 14. Sample images from the dataset and the key pipeline processes applied to them (1).

Figure 15. Sample images from the dataset and the key pipeline processes applied to them (2).

Table 1. A comparison of transparency results obtained using the 4KSecure algorithm (at full resolution) with other leading methods in the field of data hiding.

Method Name	Year	Dataset Used for Testing	Cover Size (Used in Paper)	SSIM	PSNR [dB]
HiDDeN [15]	2018	ImageNet	256 × 256	0.9234	28.87
StegaStamp [17]	2020	ImageNet	400 × 400	0.930	29.88
SteganoCNN [35]	2020	ImageNet	256 × 256	0.981	35.852
HiNet [18]	2021	ImageNet	256 × 256	0.9920	46.88
DKiS [19]	2024	ImageNet	256 × 256	0.9948	41.91
PUSNet [50]	2024	ImageNet	256 × 256	0.9756	38.94
Ours (indirect-resolution)	2025	ImageNet	144 × 256	0.9995	58.9127
Ours (indirect-resolution)	2025	Our dataset	144 × 256	0.9995	58.3834
Ours (high-resolution)	2025	ImageNet	Potentially any commercial size (tested up to 1936 × 2592)	0.9996	60.7743
Ours (high-resolution)	2025	Our dataset	Potentially any commercial size (tested up to 3840 × 2160)	0.9996	60.2165

Table 2. A comparison of transparency results achieved by the 4KSecure algorithm (at full resolution) with other significant works in the field of active image integrity verification.

Method Name	Year	Dataset Used for Train/Test	Image Size (Used in Paper)	AUC	SSIM	PSNR [dB]	Comments
FaceSigns [51]	2022	CelebA [52] FFHQ [53]	256 × 256	0.996	0.975	36.08	This method primarily works with face datasets (CelebA, FFHQ, etc.).
EditGuard [54]	2023	COCO [55] CelebA	512 × 512	0.933	0.949	37.77
MaLP [56]	2023	COCO CelebA	256 × 256	1.0	0.7312	23.02
OmniGuard [57]	2024	MIR-FlickR [58]	512 × 512	0.991	0.989	41.78
Wang et al. [59]	2024	COCO	Potentially any commercial size (tested up to 1920 × 1080)	-	-	39.61	The method can only work at the specific resolution on which it was trained. The authors achieved an accuracy of 99.90%.
V²A-Mark [60]	2024	Vimeo-90K [61] DAVIS [62]	448 × 256	0.972	0.983	40.83	A separate and coordinated watermark for the video track and audio, plus a “cross-modal” mechanism—the final key can combine what is decoded from both the image and the sound.
Zhao et al. [31]	2024	FFHQ CelebA	256 × 256	0.95	0.94	38.05
Ours (high-resolution)	2025	Our dataset	Potentially any commercial size (tested up to 3840 × 2160)	0.9668	0.9996	60.2165
Ours (high-resolution)	2025	ImageNet-1K	Potentially any commercial size (tested up to 1936 × 2592)	0.9783	0.9996	60.7743	(1936 × 2592) is the maximum image resolution in the ImageNet-1K dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duszejko, P.; Piotrowski, Z. 4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution. Appl. Sci. 2025, 15, 4469. https://doi.org/10.3390/app15084469

AMA Style

Duszejko P, Piotrowski Z. 4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution. Applied Sciences. 2025; 15(8):4469. https://doi.org/10.3390/app15084469

Chicago/Turabian Style

Duszejko, Paweł, and Zbigniew Piotrowski. 2025. "4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution" Applied Sciences 15, no. 8: 4469. https://doi.org/10.3390/app15084469

APA Style

Duszejko, P., & Piotrowski, Z. (2025). 4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution. Applied Sciences, 15(8), 4469. https://doi.org/10.3390/app15084469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

4KSecure: A Universal Method for Active Manipulation Detection in Images of Any Resolution

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning in Data Hiding

2.2. Deep Learning in Ensuring Image Integration

3. Method

3.1. Method Overview

3.2. Method Implementation

3.3. Training Process

3.3.1. Dataset

3.3.2. Noises

3.4. Evaluation

3.4.1. Transparency Metrics Obtained

3.4.2. Detection Metrics Obtained

Image Manipulation Detection Metrics Obtained

Image Manipulation Detection Metrics Obtained

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI