A One-Class Classifier for the Detection of GAN Manipulated Multi-Spectral Satellite Images

The highly realistic image quality achieved by current image generative models has many academic and industrial applications. To limit the use of such models to benign applications, though, it is necessary that tools to conclusively detect whether an image has been generated synthetically or not are developed. For this reason, several detectors have been developed providing excellent performance in computer vision applications, however, they can not be applied as they are to multispectral satellite images, and hence new models must be trained. In general, two-class classifiers can achieve very good detection accuracies, however they are not able to generalise to image domains and generative models architectures different than those used during training. For this reason, in this paper, we propose a one-class classifier based on Vector Quantized Variational Autoencoder 2 (VQ-VAE 2) features to overcome the limitations of two-class classifiers. First, we emphasize the generalization problem that binary classifiers suffer from by training and testing an EfficientNet-B4 architecture on multiple multispectral datasets. Then we show that, since the VQ-VAE 2 based classifier is trained only on pristine images, it is able to detect images belonging to different domains and generated by architectures that have not been used during training. Last, we compare the two classifiers head-to-head on the same generated datasets, highlighting the superiori generalization capabilities of the VQ-VAE 2-based detector.


Introduction
Deep Learning (DL) techniques have established new State of the Art (SOTA) benchmarks in several fields: from bioinformatics to computer vision, from natural language processing to object detection [1][2][3][4][5].In this context, the development of tools for the creation of image forgeries, on one hand, and for authenticity verification and other forensic applications, on the other, have seen a steep increase in the use of DL techniques [6].Among the possible application domains, great attention is increasingly devoted to satellite image analysis.
As a matter of fact, satellite images play a crucial role in several application areas, such as meteorological forecasts, landscape analysis, agriculture, regional planning, monitoring and detection of natural disasters, and many others.As a result, the number of commercial satellites is constantly growing, and the accessibility of satellite images with larger and large ground resolution [7] is increasing on a daily basis.As for other application domains, DL provides several tools to manipulate satellite images.Some examples of DL-based tools for satellite images manipulations are described in Ref. [8] [9] [10].Such manipulations are often related to disinformation campaigns, as reported, for instance, in [11].Hence, there is a growing need to develop DL forensic methods suited for the detection and identification of satellite image forgeries.The extension of image forensics tools developed for computer vision applications to satellite imagery, however, proves to be challenging from several points of view.First of all, forensic techniques developed for non-satellite images must be adapted to the specific content of overhead imagery.This is due to the inherently different types of features and characteristics of conventional RGB images and multispectral satellite images.In addition, quite often, satellite images have more than 3 bands.For example, Sentinel-2 Level 1C optical images have 13 bands, each group of bands characterized by a different Ground Sampling Distance (GSD).Moreover, unlike RGB images, each pixel is typically represented by more than 8 bits per band (12 bits in the case of Sentinel-2 Level 1C images).On top of that, synthetically generated datasets of multispectral images are missing, thus making it difficult to benchmark DL tools for satellite image forgery detection.
Generative Adversarial Networks (GANs) can be successfully used to create synthetic satellite images [8].Such architectures have proven to be extremely useful for image generation or style transfer, and they are applicable not only to RGB images [12,13] but also to multispectral images [14] with only some minor modifications.A few works have also been proposed for the detection of GAN generated satellite images.In general, most of these tools are trained only on the RGB bands of multispectral images and exhibit good detection capabilities when the training and test datasets are acquired under matched conditions, but they fail to generalize to unseen data.One tool that is actually trained on multispectral images but also lacks generalization capabilities is described in [15], where the authors present a detector based on an EfficientNet-B4 model trained on multispectral images, achieving very good performance.When the model is tested on a different dataset, though, the performance drop significantly.
To overcome the generalization weakness of SOTA techniques for the detection of GAN multispectral satellite images, in this work, we propose to use a one-class classifier based on a Vector Quantized Variational Autoencoder (VQ-VAE) trained only on pristine images, and to use the reconstruction loss between the input and the output images to distinguish GAN and pristine images.
We evaluated the performance of the one-class classifier on several multispectral GAN synthetic Sentinel-2 level-1C satellite datasets and compared them against those of a conventional manipulation detector based on EfficientNet-B4, trained on both pristine and GAN-generated satellite images.We run the experiments on the full 13 bands of Sentinel-2 level-1C samples.The results we got, demonstrate the superior performance of the one-class classifier in terms of generalization capability, with only a small performance loss with respect to the two-class classifier under matched conditions.
The paper is structured as follows: in Section 2, we overview the state of the art in the field of satellite imagery forgery detection.In Section 3, we describe the datasets we created or collected, differentiating between the various models and GANs methodologies used.In Section 4, we describe the VQ-VAE oneclass classifier, and in Section 5, we present the experimental results proving the validity of the proposed method.Eventually, Section 6 draws conclusions and offers future perspectives on our work.

State of the Art
In this section, we overview the prior art on satellite images forgery detection.In [16], the authors introduce a framework composed of two steps for the detection and localization of forgeries in satellite images.In the first step, a GAN is trained to obtain a set of features capable of representing pristine satellite images.In the second step, a one-class classifier based on Support Vector Machine (SVM) is applied to distinguish pristine and non-pristine images.The method that is proposed in [17] is based on a conditional GAN architecture that is trained on two domains, the first domain is the domain of spliced images while the second contains the forgery masks.Given a GAN image as input, the GAN generator is used to estimate a forged mask that is as close as possible to the real one.In [18], the authors apply an analysis similar to [16], by jointly training the auto-encoder and a Support Vector Data Descriptor (SVDD) [19].Furthermore, in [20], the authors take advantage of a one-class Deep Belief Network (DBN), which is employed for detecting and localizing forged images.The same authors, in [21], propose to use a Vision Transformer for image reconstruction.In this case, the forgery mask is obtained by observing the differences between the input and output images.
Another interesting work is [22], where the heat maps of forged regions are estimated by using a novel architecture based on a U-Net nested within a GAN architecture.The system built in this way is able to localize RGB image forgeries generated by 3 different types of GANs: StyleGAN2 [23], ProGAN [24] and CycleGAN [25].The forgeries are created by using Sentinel-2 RGB images that are spliced within the generated images.The output of the architecture consists of a probability mask, where each pixel is associated with the probability of having been generated by one of the specific GANs that are tested in the experimental framework.
In [26], the authors implement a GAN-based approach for semantic satellite image translation.In the same work, they also present a data-driven approach for the detection of GAN generated images and use SVM to detect cycleGANgenerated images from a set of features (both spatial and spectral).In [27], the authors rely on the dataset presented in [5], to develop an additional method to detect RGB satellite images that are semantically transformed, starting from the detection of high-frequency details in the generated samples.
Most of the methods proposed so far focus on RGB 8-bit images.To the best of our knowledge, the only method that has been proposed to detect multispectral images forgeries on images with 13 bands, like those acquired by Sentinel-2 sensors, is [15], which, can not generalize to mismatched data, as we already mentioned in the Introduction.

Datasets
In this section, we describe the datasets that we have used to train the detectors and test them.The datasets consist of two main types of images: pristine Sentinel-2 level1-C images and GAN images generated from models trained on Sentinel-2 level1-C datasets.The pristine images were obtained from the ESA Copernicus hub [28].In our datasets, the images consist of 13 bands.The spatial resolution of the green, blue, red, and Near Infrared (NIR) bands (bands 2, 3, 4, and 8 respectively) is 10m, with an overall size of 10980 ×10980 pixels.Bands 5, 6, 7, 8a, 11, and 12, instead, have 20m spatial resolution and a size of 5490×5490 pixels.Finally, bands 1, 9, and 10 have a spatial resolution of 60m, and size equal to 1830×1830 pixels.The radiometric resolution of all of the bands is 16 bits.
Prior to their use, all the bands that did not have a resolution of 10m were up-sampled.This allowed us to deal with images having the same size of the 10m image bands.We then divided the images into 512×512 patches.Tiling was implemented by using the gdal-retile of the gdal software library [29].Moreover, as a further pre-processing step, we removed from all the datasets the tiles with no data pixels (0 brightness).
For the generalization experiments, we collected an additional dataset, hereafter referred to as "This-city-does-not-exist" dataset, containing RGB images of size 512×512.A detailed description of all the datasets is given in the next subsections.
A summary of all the datasets described in this section is given in Table 1.Table 1 Summary of the datasets used in our work (see Section 3).The first column reports the names of the datasets.The other columns report the number of bands, the size, the type of architecture used to generate the images, the type of transfer applied to the images, the total number of pristine images and the total number of GAN images.

Dataset
The datasets were used for several tasks.To start with, we needed a set of pristine images to train the GANs architectures and the VQ-VAE 2-based detector, the pristine images and the GAN images were used to train the 2class EfficientNet detector, in addition the GAN images were used to test both detectors.Eventually, a small number of pristine images were used to calibrate the thresholds for both detectors since the assessment metric we used was the probability of detection at a false alarm rate of 0.1.Table 2 shows how we split the images of the various datasets across different tasks.

Land Cover (LC) Transfer Datasets
The first dataset we built is the land cover (LC) transfer dataset.The dataset is formed by two classes of images: pristine and GAN images.To generate the GAN images, we first gathered the pristine images that were used to train the cycleGAN architecture [25] in charge of transferring the land cover from one domain to another.As indicated in Table 2, the pristine images were also used to train the GAN images detectors.The cycleGAN was instructed to transfer barren to vegetation landscapes and vice versa.For this reason, the dataset had to contain two kinds of images: images dominated by vegetation, and images mostly containing barren terrain.To build the dataset, we exploited the organization for economic cooperation and development land cover classifications data [30].For the vegetation domain, we obtained images from the following countries: Salvador, Congo, Montenegro, Gabon, and Guyana.For the barren domain, we downloaded images from South and Central America.
In total, we gathered 10000, 512×512 images per domain (vegetation and barren).The cycleGAN was trained on 16000 pristine images equally split into vegetation and barren terrains.The remaining 4000 images, were used to generate 4000 GAN images using the trained cycleGAN model.We relied on this set to train the efficientNet-B4 detector on 3000 pristine images and 3000 generated images.The remaining 1000 GAN images were used for testing, while 100 pristine images were used for threshold calibration.To train the VQ-VAE 2, gathered 10000 additional pristine images (5000 barren and 5000 vegetation) 1 .Then, we used 29000 pristine images (all the images except 1000 images we left aside) as part of the VQ-VAE 2 training dataset.Figure 1a shows two examples from the pristine LC dataset.Figure 1b, instead, shows two examples of images generated by the cycleGAN.

China and Scandinavian Season Transfer Datasets
The China and Scandinavian (Scand) Season Transfer datasets are another example of image-to-image GAN image generation.Instead of changing the type of land cover, in this case the network was asked to generate the summer (res.winter) counterpart of a winter (res.summer) image.To create these datasets, we started from images taken at two different geographic locations (China and Scandinavia) characterized by very different seasonal changes.To generate the GAN images of the China season transfer dataset, we trained two pix2pix GAN models [31].Since pix2pix is an architecture for one-way style transfer, we needed to train two different models to achieve a season-style transfer in both directions.For training, we needed a paired dataset, so we decided to download images taken in China at the same location but in two different months of the year.Specifically, we retrieved images taken in August 2020 and in January 2021.The total number of pristine images collected for the China dataset is 16000, forming 8000 pairs out of which we took the 6000 pairs we used to train the GANs.We then created a dataset from the 13-band GAN models that contained 8000 images, equally divided into pristine and GANs and equally divided into summer and winter.6000 images out of these, were used to train the efficientNet-B4 detector, 100 pristine images were used for threshold calibration, and 1000 GAN images were used to test the two detectors.In addition, 15000 pristine images from this dataset (excluding the images used for calibration and testing) were used as part of the VQ-VAE 2 training dataset.Figure 1c shows two examples of pristine images, while Figure 1d shows two examples of GAN generated images.The Scandinavian season transfer dataset was created in a similar manner, with summer images taken in June 2020 and winter images in February 2020.See Figures 1e and 1f for two examples of pristine and GAN images.The total number of pristine images collected for this dataset is 17044, which corresponds to 8522 pairs.The style transfer GANs were trained on 6522 pristine image pairs, while the efficientNet-B4 detector was trained on 6000 images, the threshold calibration of the detectors was performed on 100 pristine images, and the tests were carried out on 1000 GAN images.

Alps Dataset
The Alps dataset is a pristine image dataset collected to help training the VQ-VAE 2 detector.We collected images from the exact same area in two different months, with each month representing a different season (June 2019 for summer and December 2019 for winter).To avoid generating images with clouds, we selected images with limited cloud cover.Since it was not possible to obtain images with 0% cloud cover, we limited the search to images with cloud cover less than 9%.As a result, we obtained a dataset with 7872 pristine images.

This-city-does-not-exist Dataset
This dataset is used to test the ability of the various models to generalize to images generated from unknown architectures and with completely different content.It contains only GAN images downloaded from [32].The images have been generated by a styleGAN2 model.We collected 140 images of size 1024 × 1024.The images of this dataset have only 3 bands (RGB).Figure 2-a, shows two examples of the this-city-does-not-exist dataset.

The VQ VAE 2 one-class classifier
In this section, we describe the one-class classifier we have developed to distinguish pristine and GAN multispectral images.We start with a brief introduction to autoencoders and variational autoencoders, then, we describe the VQ-VAE 2 architecture, which is the one our system relies on.A neural network A trained to reconstruct its input at the output is referred to as an autoencoder [33].The reconstruction is constrained in such a way to prevent learning the identity function.An autoencoder is divided into two main parts: • the encoder A e , mapping the input x into a hidden representation h (i.e., h = A e (x)).• the decoder A d , reconstructing an approximate version of the input x from the hidden representation (i.e., x = A d (h)).
In the case of tensor data, the input can be an image X and the hidden representation a vector h.The encoder and decoder are trained in tandem to reduce the reconstruction loss, usually a L 2 loss term, between the input and output samples.
In Variational Autoencoders (VAEs), the input is encoded into a vectorial representation, and the hidden representation's features are forced to follow a Gaussian distribution, denoted by N (f (x); g(x)), where f (•) denotes the mean and g(•) denotes the variance of the distribution.A sample of the hidden representation is taken during the decoding stage and utilized as input to the decoder, which produces a reconstructed version of the original input data.When the hidden features are required to follow a Gaussian distribution, the total loss used during training is equal to: where the first term is a "data fidelity term" that measures the difference between the input sample x and the estimated sample x, the second term applies a kind of "regularization" by requiring the network to minimize the Kullback-Leibler divergence L kl between the learned hidden variable distribution and a desired normal distribution N (0; I d ) (here, I d is the identity matrix), and β is a hyperparameter balancing the two loss terms.The decoder is used to create new images by selecting random samples from the hidden layer after training the VAE.
VQ-VAE [34] is a variant of VAE [35] that uses vector quantization to learn a discrete latent representation instead of a continuous one.This is done by adding a discrete codebook component to the network that contains the list of vectors with their indices.Then the output of the encoder is compared with all the codebook entries in terms of Euclidean distance, and the code that is closer to the output of the encoder is fed to the decoder.In VQ-VAE compared to a VAE, the priors are learned rather than taken as static input.The combination of a discrete latent representation and an autoregressive prior thus paves the way for the generation of high-quality images, videos, and speech.
VQ-VAE 2 [36] is similar to VQ-VAE with the only difference that it uses multi-scale latent maps to increase the resolution of the reconstructed image.
In our experiments, we used 3 levels of latent maps and relied only on a static prior instead of an autoregressive prior, since our goal was to detect GAN images rather than generate them.Figure 3 shows the architecture we used, where we opted for a three-level hierarchy: bottom, middle, and top, with latent space sizes of 512, 128, and 64, respectively.
With regard to the detection of GAN images, we used the VQ-VAE 2 architecture according to two different modalities.In the first approach, the autoencoder processes all the 13 bands together (the resulting architecture is referred to as VQ-VAE 2 13 .For the second approach, we trained one model per band (referred to as VQ-VAE 2 1 ).The reconstruction loss on all bands and the total reconstruction loss calculated on all bands are used as features to be processed by an anomaly detection module, e.g., a one-class SVM.Based on the experiments we have run (see Section 5.2), we decided to use the reconstruction loss directly and detect the GAN images by applying threshold to the reconstruction error band by band.

Experiments and Results
In this section, we evaluate the performance of the proposed VQ-VAE 2 and empirically prove that an off-the-shelf 2-class detector based on EfficientNet-B4 [37] lags behind the VQ-VAE 2 in terms of generalization capabilities.Both detectors are trained and tested with the datasets described in Section 3.

EfficientNet-B4 Detector
As a baseline 2-class detector to benchmark the performance of our system, we trained a model based on the EfficientNet architecture.The EfficientNet class of networks was proposed as a way to efficiently scale the network depth, width, and resolution based on the input dimensions [37].In our experiments, we used EfficientNet-B4 (eff), with hyper-parameters set as in [37], the only exception being the difference in the input size, which we adapted to match the number of channels our images consist of (13).We built 3 different models by training the networks on the LC dataset (thus fitting to cycleGAN data), the Scandinavian dataset (adapting the model to distinguish pix2pix images), and a combination of the LC and Scandinavian datasets.The three models were cross-tested on the LC, Scandinavian, and China datasets.We also trained three additional models, this time by removing the down-sampling from the initial layer as suggested by [38] to enhance the generalization capabilities of the model.We called these models EfficientNet-B4 with no down (eff nodown).Augmentation, including Gaussian blur, random shift, random rotation, and random flip, was applied to train all models.
The results we got on the various datasets are shown in Table 3, reporting the correct detection probability at a false alarm rate equal to 0.1.For threshold calibration, for each test, we used 100 pristine images from the dataset to be tested.We also obtained other results that we don't report here using different thresholds where we obtained the thresholds from 100 pristine images from the corresponding training dataset.However, the conclusions drawn are the same.
Expectedly, the probability of detection is very good when the datasets used for training and testing are matched, that is when the GAN images have been generated by the same GAN model used for training.With regard to generalization, we observe that the eff nodown architecture has better generalization capabilities.In any case, the performance drop when the models are tested on images taken from datasets that were not used during training.The best results are obtained by training the detector on images generated by both pix2pix and cycleGAN.Even in this case, though, the performance deteriorates when the detector is tested on the images of the China dataset that was not used during training.

Vector Quantized Variational Autoencoder 2
To build the VQ-VAE 2 one-class classifiers, we trained a VQ-VAE 2 13 model working on all bands together, and 13 models VQ-VAE 2 1 models, each working on one different band.The models were trained on 50000 pristine Sentinel 2 level-1C images collected from the Alps, China and land cover datasets.The networks were trained for 100 epochs with early stopping and a batch size equal to 64.Some initial insights into the discrimination capability of the VQ-VAE 2 models can be obtained by plotting the reconstruction losses obtained by applying the trained autoencoders to the various datasets (see Figure 4).In order to obtain the scatter plots shown in the Figure , we applied Principle Component Analysis (PCA) to reduce the feature dimensionality to 2. The scatter plots obtained by the models trained on VQ-VAE 2 13 and VQ-VAE 2 1 are reported for two datasets: the China dataset, whose pristine images were part of the VQ-VAE 2 training dataset, and the Scandinavian dataset, that was never seen by the VQ-VAE 2. We observe that in both cases, pristine and GAN images are grouped into well-distinct clusters; however, the features are further apart in the case where the reconstruction losses are taken from the 13 models of VQ-VAE 2 1 .Hence, the rest of our experiments were carried out by training one model per band.Given that 13 single-band trained autoencoders, we used the reconstruction error of each autoencoder to detect the GAN images.To do so, we set the detection threshold by fixing the false alarm rate on 100 pristine images for each testing dataset to 0.1.Similar to the efficientNet-B4 results, we also experimented with obtaining the threshold from the training dataset which in this experiment is obtained from the China and LC datasets.The results, we obtained, are slightly worse for the 10m bands but are comparable for the rest of the bands.
The tests were carried out on the LC, Scand, China and all the datasets mixed together.Table 4 shows the results we have obtained.The correct detection probability is very good for most of the bands, with some room for improvement on the R (band 4), G (band 3), B (band 2), and NIR (band 8) bands.The results in Table 4 demonstrate the excellent generalization capability of the VQ-VAE 2-based detectors, as they are able to detect the GAN

Conclusions
We have introduced a one-class detector of GAN multispectral images generated by a variety of DL architectures.The model is based on a VQ-VAE 2 autoencoder and is trained only on pristine images.To the best of our knowledge, this is the first work proposing the use of a one-class classifier to detect 13-band Sentinel-2 level-1C artificially generated images.
We run experiments on images generated by cycleGAN and pix2pix architectures.The results we obtained are particularly promising.In particular, the proposed detector exhibits a superior generalization capability than a baseline 2-class detector based on EfficientNet-B4.To evaluate the generalization capability in extreme conditions, we tested the detector on a small dataset of RGB satellite images generated by styleGAN 2. The proposed detector outperformed the two-class classifier by far.Further work could be done to diversify the generative models that were used to create the satellite images to include additional GAN and diffusion models.intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Authors' contributions.Lydia Abady wrote the first draft of this manuscript and ran the experiments.Prof. Mauro Barni provided feedback and guidance in all the steps of the this research and was involved in reviewing and improving the manuscript.Giovanna Maria Dimitri provided feedback for research and provided reviews for the manuscript.The research idea developed was an equal contribution between Lydia Abady and Prof. Barni.

Fig. 1
Fig. 1 some examples of the images contained in the datasets used throughout the paper.Only the RGB bands are shown.

Figure 2 -
b shows an RGB representation of two examples of the Alps dataset.

Fig. 2
Fig. 2 Examples of images taken from this-city-does-not-exit and Alps datasets where for the Alps dataset only RGB bands are shown.

Fig. 4
Fig. 4 Scatter plot after PCA feature reduction for GAN and pristine images

Table 2
Summary of the number of images used for each task.P indicates Pristine images and G denotes GAN generated images.For VQ-VAE 2 training, all the images were used to train a single one-class detector.For EfficientNet-B4, the training images were used to train several versions of the detectors each time by using a different combination of the training sets (see Section 5.1).The images used for testing were never used during training