SatelliteCloudGenerator: Controllable Cloud and Shadow Synthesis for Multi-Spectral Optical Satellite Images

: Optical satellite images of Earth frequently contain cloud cover and shadows. This requires processing pipelines to recognize the presence, location, and features of the cloud-affected regions. Models that make predictions about the ground behind the clouds face the challenge of lacking ground truth information, i


Introduction
Clouds are reported to affect a significant fraction of all acquired satellite imagery, with estimates as high as 55% over land and 72% over ocean [1].For many downstream applications, such as land and crop monitoring, this type of obstruction can have a severe impact on the performance or altogether feasibility of a system.A significant amount of research [2][3][4][5][6][7][8][9] has been carried out on the topic of understanding clouds in satellite imagery as well as attempts to remove them.
The need for paired cloudy-clear image samples is particularly prominent in the context of cloud removal.This need is present in both key aspects of the design pipeline, namely, training and evaluation.At the training stage, it is often necessary to access a clear-sky sample and use it as ground truth to compute a loss for optimization.For paired image-translation frameworks, this ground truth should ideally correspond precisely to the input cloudy sample.At the evaluation stage, access to ground truth data pairs is also crucial, as computing a similar loss at inference is one of the key sources of information about the model's performance.
There are currently two approaches for obtaining paired cloudy data (a corresponding pair of cloudy and cloud-free representations of the same image) explored in the literature.These include (i) the acquisition of temporally proximate real samples (achieving merely an approximation of the ground truth, with variable and hard-to-quantify precision) and (ii) the simulation of clouds based on a real source image with no clouds present (with precise ground truth).
This manuscript describes a new tool for simulating an unlimited number of diverse pairs of clear-sky and cloudy examples, following the second approach.The tool offers the capability to parameterize and control various features of the generated clouds.
The contributions of this work are as follows: • It proposes a framework to generate synthetic clouds for satellite imagery, with adjustable parameters to synthesise diverse cloud types.• It demonstrates that a cloud detection model trained solely on simulated data can approach the performance of a model trained on real data.• Similarly, it demonstrates that a cloud removal model trained exclusively on simulated data can approach the performance of a model trained on real data.• It demonstrates and highlights the advantages and disadvantages of synthetic and real paired cloudy data and it shows the different kinds of insight each data source can offer when evaluating cloud detection and cloud removal techniques.
Furthermore, the tool proposed in this work is openly shared on GitHub at https:// github.com/strath-ai/SatelliteCloudGenerator(accessed on 22 August 2023), allowing for fast and easy incorporation into PyTorch-based deep learning schemes as an augmentation operation (it can process batches of stacked tensor input directly on GPU), as well as for evaluation.
The tool can be used for developing new methods for cloud detection and cloud removal but also beyond that, where it could be used as an augmentation type for tasks that deal with the risk of cloud presence, such as multi-sensor fusion, multi-temporal fusion, satellite image segmentation, object detection, or change detection.

Related Work
There are at least two distinct paths to achieving paired cloudy samples explored in the literature.Both approaches are motivated by the following limitation of the real physical world, namely, that it is most likely not physically possible to acquire a clear-sky observation and a corresponding cloudy observation, where all factors (such as lighting, exact time, or exact conditions on the planet's surface), apart from the presence of the cloud, are kept constant.
The first approach to generating paired cloudy data relaxes this requirement for constant factors and it links cloudy and clear-sky images that are proximate in time.This occurs under the assumption that all other factors remain similar, rather than constant, within some feasible margin.However, the assumption may be too optimistic, and the conditions may vary enough to make the clear-sky sample an inaccurate approximation of the cloudy image with clouds removed.This is shown in Figure 1 with a sample from a commonly used SEN12MS-CR dataset [6].To highlight the resulting changes better, two crops from the original pair are shown in greater detail.The selected regions of considerable change are indicated using the colored circles.As a result, inaccurate ground truth samples could be used for evaluation or learning if this approach is followed.This approach has been explored in works such as [2][3][4][5][6][7].In [2] and [5], Landsat image pairs are gathered with a time gap of 16 or 32 days.In [3], this gap is up to 15 days apart, while in [4] it could be up to 35 days.These lengths between acquired paired samples mean that the changes in the ground surface view may be profound even without any clouds present in either image.In the SEN12MS-CR dataset, it is ensured that the optical cloudy and cloud-free images are captured within the same meteorological season, which appears to be a rather loose constraint, as demonstrated in Figure 1.In the related work on the SEN12MS-CR-TS dataset [7], 30 samples evenly spaced in time are captured for each ROI across the full year, leading to temporal gaps of about 12 days.
The alternative approach maintains the constant-factor requirement by simulating the cloud component and adding it to a source clear-sky image.In this case, the compromise is that it is not guaranteed how accurate the simulated clouds are as an approximation of real clouds.This approximation could be improved by increasing the capability and quality of the simulation engine, at the cost of heavier processing.
Another limitation of both approaches is that they both rely on the presence of clearsky data.This in itself enforces a very strong bias on the resulting datasets, since only the subset of all ground data is processed.The correlation between the cloud cover and the state of the planet's surface is likely not strong enough to completely hide away some features of the ground surface data.However, in the practical context, only a finite number of image samples are acquired over a finite temporal scope, which can contribute to substantial sampling noise.In the presence of this noisy sampling, the bias of the clear-sky samples could be quite strong, and it is not yet clear how that can be mitigated apart from aiming for larger datasets.
While the phenomenon of clouds in Earth's atmosphere is a complex process requiring a respectively complex and expensive computational simulation, the majority of the literature to date has focused on borrowing from the fields of computer graphics, where the generation of random shapes that resemble structures encountered in nature has been of interest for many years [10].In a notable paper, Perlin noise was introduced [10] as a relatively lightweight method for generating natural-looking random structures.The method produces a texture by interpolating unit-length gradients on a grid, where the direction of gradients is sampled randomly.Almost three decades later, the approaches of applying procedural noise for the simulation of clouds received some attention in the literature, starting with a rather brief description in [11] and eventually including some of the more developed use cases, such as in [12,13].
Other than that, many hybrid approaches were proposed to find a trade-off between the weaknesses of real and simulated data.In [14], cloud masks are extracted from real images using either layer separation methods or channel threshold and then used to synthesize a cloudy image.Some similar approaches were also previously applied to the problem of dehazing [15,16] and later revisited for the thin cloud removal problem [17].In the case of [16,17], the transparency is adjusted by channel wavelength.In [9], a framework for cloudy image arithmetic is proposed which relies on extracting real clouds (rather than masks) from images and then the addition of those clouds to new scenes, which limits the number of possible generated samples.
This work delivers a new type of approach, where the simplicity of Perlin noise is combined with a flexible and versatile framework for simulating realistic clouds using that mask.It includes previously unexplored features such as control over the scale of the synthesized clouds, the thickness of the clouds, the influence of the ground image over the perceived color of the cloud, the spatial misalignment of the cloud layer between image channels, the blurring of the cloud, the simulation of cloud shadows, and channelspecific magnitude.It is also shown how these settings can be easily managed by adjusting configuration objects and also how the generated cloud masks and shadows can be easily converted to segmentation masks, if needed.The key focus of the proposed techniques is to achieve a lightweight simulation of the visual features of clouds, aimed to improve the quality of computer vision machine learning models that detect or remove clouds on relatively small patches of satellite images.The precise visual features of clouds may be difficult to define explicitly, and hence, the quality of the simulator is measured by the impact it has on the performance of the machine learning models trained on its data.In this version of the simulator, the shadows and clouds are generated as uncorrelated factors, motivated by two observations: (i) a cloud removal or cloud detection model that can perform well on uncorrelated clouds and shadows should also perform well when they are correlated (the converse is not necessarily true) and (ii) for smaller patches of satellite imagery, the shadows in the patch will often be cast by clouds outside of the view, appearing uncorrelated with the clouds in the view.
Overall, the SatelliteCloudGenerator (https://github.com/strath-ai/SatelliteCloudGenerator,accessed on 22 August 2023) framework attempts to deliver a tool that provides a high level of flexibility for generating an unlimited number of cloudy-clear image data pairs.

Method
The proposed framework integrates several new techniques for cloud simulation (cloud locality degree, ground shadow, automated segmentation mask) and further incorporates several methods already discussed in the literature (cloud color, channel misalignment, ground blurring) under a unified scheme.It provides a set of interpretable control parameters for adjusting the types of generated clouds, which makes it easy to adopt in new studies.Additionally, the software framework has been designed to facilitate easy integration with PyTorch with support for parallel execution.This design is illustrated in Figure 2, where an input cloud-free source I clear undergoes several transforms to obtain the final output of a simulated cloudy image I cloudy .This consists of the two underlying simulation processes within the cloud generation (shown in Figure 3) and shadow generation blocks (Figure 4).More detail on the computation pipeline is provided in the subsequent sections describing the mechanism of generation.The tool takes a cloud-free input satellite image I clear and first mixes it with a shadow generated using a shadow generation process (detailed in Figure 4), then blurs the image locally, and finally mixes it with a cloud generated using the cloud generation process (detailed in Figure 3).

Synthetic Shape
The key structure of the generated clouds is derived using a function based on Perlin noise [10].The use of Perlin Noise has been explored in earlier literature [11][12][13], but little detail has been provided about the generation process.This work explicitly reports on the generation process and defines a set of parameters to simulate a diverse range of cloud transparency maps.
The first stage of the process involves generating the base shape of the cloud transparency map.This can be accomplished using many different synthetic noise generation methods, and here, a type of Perlin noise is employed, where Perlin noise generated as several harmonic scales is used.As shown in Figure 5a-c, Perlin noise can be generated at various scales, resulting in different frequencies present in the spectrum.These different scales can be weighted and summed in order to generate more complex-looking noise structures, as shown in Figure 5d.
The weights applied to individual shapes at each scale control the spectral content of the image (spatial frequency domain).Hence, they can be adjusted to obtain shapes that are smoother by applying a strong decay for finer scales or sharper by reducing that decay.This is controlled by the decay_factor parameter.By default, this factor is set to 1, and higher values will result in smoother shapes.The resulting shape computed using the Perlin noise method is likely to have a few sparse global minimum points, instead of a larger region of floor values that would simulate an area where no clouds are present.To produce that effect, a clear_threshold can be applied, which will assign a value of 0.0 to all values below that threshold, as illustrated in Figure 6.Once the shape passes through the threshold operation, the value range can be adjusted by setting the min_lvl and max_lvl parameters, which shift the minimum and maximum values of the shape to these two levels, correspondingly.For example, a min_lvl value of 0.0 will indicate that the most transparent pixels will have no cloud cover at all.By increasing the min_lvl, it can be ensured that all pixels have cloud presence at least at that level.An example of range adjustment is shown in Figure 7.The shape mask adjusted to the range of [min_lvl, max_lvl] is treated as the final transparency map of the simulated cloud.The final step of the process involves using this mask in a mixing operation with the clear-sky source sample.As in the earlier works of [9,14], the mixing operation for the clouds is defined as where the output is the cloudy image I cloudy , based on a source clear-sky image I clear and a cloud-component image I cloud .In the simplest setting, the cloud-component image I cloud could be equal to the static color of the ambient cloud.The cloud transparency mask is indicated by M C .

Cloud Locality Degree
Another desirable feature is to be able to make the generated clouds more local (meaning that individual cloud objects extend over a smaller portion of the image).In many cases, real clouds only occupy a limited area of the image.The approach proposed in this work is to multiply several generated cloud mask shapes I cloud (followed by normalization).Each shape element in multiplication decreases the likelihood of a high value being preserved for a given pixel in the resulting product, increasing the total number of cloud-free or almost-cloud-free pixels.An example is shown in Figure 8, where the parameter of locality_degree indicates the number of noise shapes multiplied by each other.As shown, the clouds become sparser with the increasing value of this parameter.Notably, a changed locality of the clouds can also be achieved by increasing the clear_threshold value, as shown in Figure 9. This, however, brings another effect of sharper cloud edges compared to the example with changed locality degree (Figure 8).

Channel Misalignment
The real cloud data will often exhibit an effect of channel misalignment, where individual channels of the cloud object are spatially misaligned due to the velocity of the acquiring sensor.This occurs when the individual sensor channels are acquired at slightly different time instants, which is often the case [14].
This effect can be simulated by adding an artificial offset between the individual channels of the raw cloud mask M C , as shown in Figure 3.The effect is controlled by a channel_offset parameter, which determines the maximum possible spatial offset between two consecutive channels in either the x or y direction, in terms of the number of pixels.During simulation, the exact value of shift in each dimension is sampled uniformly from [−channel_offset, +channel_offset], resulting in a range of potential discrete offsets.Subpixel values are not considered.An example of the feature is shown in Figure 10, where a three-channel (RGB) cloud mask M C is shown without any misalignment in the left panel (a) compared to one with misalignment (b), and the resulting mixture results in (c).

Cloud Color
As described in the earlier work [14], the clouds present in satellite imagery do not generally resemble a purely white component but instead are colored by the ambient light reflected from the ground.Based on the observation shared in [14], the cloud color will tend to be similar to the mean color reflected from the ground surface in that area.This color can be computed by averaging all pixels in the source clear-sky image.Furthermore, this effect is partially dependent on cloud thickness, meaning that a thicker cloud will let through less light from the ground and reflect light of a color closer to pure white.
To simulate this feature, the color cloud component I cloud is adjusted to have a color interpolated between pure white and the mean color of the underlying cloud-free image I clear , depending on the thickness of the cloud component I cloud .With the default scale of 1.0, the resulting I cloud color for a given pixel will be pure white if the input I cloud is equal to 1.0 (thick cloud) or equal to the mean cloud-free image color if the input I cloud is equal to 0.0 (no cloud).For the semi-transparent cloud values, the resulting color will be a linear mixture of the two extreme colors.This effect is demonstrated in Figure 11.

Channel-Specific Magnitude
In several works [17,18], it has been noted that the exact magnitude of the cloud transparency mask will depend on the wavelength of the specific image channel.This is further affected by the specific channel scaling applied during pre-processing.In the context of evaluating techniques for the detection and removal of clouds, this aspect is important for obtaining realistic simulated data.During training, this could also be important for many tasks, since optimizing the loss only on simulated clouds could make the real clouds be perceived as out of domain objects and, as a consequence, prevent successful detection or removal.
The intensity of the cloud component can be adjusted by applying a set of channelspecific weights to I cloud before mixing with the cloud-free input image, as shown in Figure 3.However, it is not immediately clear what the values of those weights should be.In this work, the channel magnitude weights are extracted from real cloudy images accounting for the ratio ρ between a selected statistic feature c clear in the cloud-free region and another statistic feature c cloud in the cloud-affected region of the image.Figure 12 demonstrates how a real cloudy sample can be used for sampling cloud-free and cloudy regions using the cloud detection technique of s2cloudless [19].
For example, the statistic feature can be chosen as the mean reflected color.However, in some cases, the cloudy region may be heavily influenced by the non-cloudy region due to the likely presence of nearly cloud-free pixels in the cloud mask, as illustrated in Figure 12 (this depends on the quality of the used cloud mask).This interference can lead to an underestimated reflection statistic from the cloudy region.To reduce this effect, the statistic c cloud is instead selected as the 95% quantile of the distribution observed in the cloudy region.This value can be expected to be close to the maximum reflected value in the true cloudy region, with more stability than the maximum value (100% quantile).This approach should work well for the vast majority of scenarios, as long as more than 5% of the cloud mask coverage does, in fact, contain a cloud.
A further illustration of the relationship between the distribution of values in the cloudy and cloud-free regions is shown in Figure 13, where histogram curves are shown for the distribution of pixels in the cloud-free (blue) and cloudy (orange) regions of the image, individually for each band.It is apparent that the cloudy region tends to contain higher values for each channel.The leakage of cloud-free pixels to the cloud mask can also be observed, manifested by similar histogram shapes for the lower values.In each case, it is shown that the cloudy region (orange plot) contains higher values but also exhibits some resemblance to the cloud-free histogram (blue) due to the leakage of the ground component in the cloudy region (as shown in Figure 12).
Since the cloud-free mask is generally unlikely to contain any clouds, the statistic c clear extracted from that region can be closer to the center and is indeed selected as the central 50% quantile (median) of the distribution.
The ratio ρ can then be multiplied with the statistic ĉclear extracted from a new cloudfree image and give a predicted channel weight vector ĉcloud for the cloud strength: The cloud component I cloud is then multiplied by ĉcloud to give a magnitude-adjusted cloud component: Figure 14 shows the effect of applying this approach, showing that the application of CSM scaling results in an image more visually similar to a real reference (in both cases, the values go well above 1.0 and are hence saturated in this figure).

Ground Shadow
Satellite imagery with cloud presence will often include shadows cast on the ground surface.Depending on the sun angle and other lighting parameters, the shadows present in the image could be a result of clouds that are outside of the view.For that reason, the shadows could be simulated in an uncorrelated fashion and look plausible.An example is shown in Figure 15.The mixture process for a shadow is similar to the cloud mixing operation, in a manner analogous to Equation (1): But since the shadow component image I shadow can be approximated by all zeroes, the operation will simply be

Ground Blurring
Another effect of the through-cloud scattering, besides changed cloud color, is the blurring of the underlying ground image, as noted in [14].There, the source clear-sky image is transformed into a mixture of itself and the result of a blur with a constant Gaussian kernel.The mixture is performed based on an alpha mask dependent on cloud thickness.
As shown in Figure 16, this approach is developed here further by convolving the source clear-sky image with a locally changing Gaussian blur kernel, as opposed to a static one.The variance of the used kernel is directly proportional to the cloud thickness and can be adjusted using the blur_scaling factor.

Configuring CloudGenerators
The use of a synthetic noise source allows us to effectively generate an unlimited number of cloudy samples for every single cloud-free source image.Consequently, it is possible to sample from a very wide distribution of samples, much larger than what can be stored and packaged into a single dataset.To model this source of data as a sampler, this work introduces modules termed as CloudGenerators.
The introduction of CloudGenerators allows us to encapsulate a specific simulation configuration (corresponding to a type of generated cloud) with a sampling function.A CloudGenerator module behaves in a manner similar to torchvision image augmentation modules and inherits from torch.nn.module.That way, new samples of specific types can be generated by simply passing through this module.
Four predefined configurations are provided in the software release, and new ones can be created as Python dictionaries.Table 1 contains   Furthermore, this allows for the easy combination of different configurations by using the "|" operator.The output of the operator will be another instance of a CloudGenerator that each time uses a random configuration from one of its parent objects.

Generation of Segmentation Masks
For many applications, especially that of cloud detection, segmentation labels are required to train the models.Since the cloud simulation tool described herein has direct access to the cloud mixing mask, it can be used to generate discrete segmentation data.
The process of transforming the exact cloud and shadow mixing masks to discrete segmentation-like labels is as follows.The format of the segmentation labels in this example will follow the approach in CloudSEN12 [8] but could be easily adapted to other formats.In this case, the segmentation map is composed of 4 classes, 0 for clear sky, 1 for thick cloud, 2 for thin cloud, and 3 for cloud shadow.This way, the label output M seg can be based on 3 binary masks: where B thick , B thin , and B shadow are the binary segmentation masks for thick clouds, thin clouds, and shadows, respectively, as shown in Figure 18.

Evaluation
The utility of images with clouds simulated using the proposed framework is tested on the two tasks directly related to clouds: cloud detection and cloud removal.In the experiments, the models are trained on the specific task from scratch on either real data or simulated data.Once trained, each variant is tested on both real and simulated datasets.This enables comparison between the two data sources to determine whether a model trained exclusively on simulated data can perform well on real data and vice versa.This can give some information about whether the simulated data are similar enough to the real data to produce models that generalise to real data.The main goal of these experiments is to preserve the training settings as much possible and compare the impact of the types of images (real versus simulated) used for training and testing, rather than push the state of the art in each task.The procedure should highlight differences with respect to both training and evaluating models for these tasks.

Cloud Detection Task
For the task of cloud detection, the lightweight fully convolutional architecture of Mo-bileNetV2 [20] (which achieved the highest performance on the CloudSEN12 high-quality cloud detection benchmark [8]) is trained from scratch on Sentinel Level-2A images.The networks are optimized without any pre-training in order to match the exact optimization conditions for the explored data variants.The baseline variant (a) is optimized on the manually annotated data of real clouds sourced from the official high-quality subset of the CloudSEN12 dataset (the first released version) [8].The alternative variants (b-d) use the clear images from that subset and simulate the clouds using the simulation method proposed in this work.The variants involving the use of simulator data include two variants that use samples simulated without channel-specific magnitude (b) and (c) and two variants (d) and (e) where channel-specific magnitude is used.In each case, a fully synthetic approach is tested as in (b) and (d), as well as a hybrid approach that mixes 50% of real data with 50% of simulated data as in (c) and (e).
For this experiment, the network operates on the bands contained in the Sentinel-2 L2A product.The networks are trained with the standard cross entropy loss for three classes (clear, cloud, shadow), until a point where the validation loss does not decrease for 240 epochs.Each batch contains 32 clear images and 32 cloudy images, but the loss on the clear images is multiplied by a factor of 0.1 so that the cloudy images are prioritized during learning.The weights parameters are optimized using an AdamW [21] optimizer with an initial learning rate of 10 −3 , which is scaled down by a factor of 0.1 whenever the validation loss does not decrease for 128 epochs.
As a result, each network was trained for about 20,000 optimization steps until the validation loss ceased to improve.Although this process could be tuned further to optimize various learning hyperparameters and yield a lower validation loss, the motivation for the experiments conducted here is to compare the effect of the real and simulated training data for these models.
The metrics reported for the performance were selected based on the CloudSEN12 work [8], where producer's accuracy (PA), user's accuracy (UA), and balanced overall accuracy (BOA) are reported.The first two metrics are more widely known in the field of object classification as recall (producer's accuracy) and precision (user's accuracy).Additionally, the false positive rate is reported in addition to the metrics used in CloudSEN12.
The producer's accuracy (PA), or recall, is computed as the fraction of positives that are correctly detected.For an example of the cloud class, it corresponds to the number of correctly detected cloud pixels divided by the number of all true cloud pixels.It can be interpreted as an approximate probability of a pixel containing a cloud being assigned cloud class by the model.A high level of PA means that a large portion of the cloud present in the image is contained in the cloud mask.
User's accuracy (UA), or precision, corresponds to the fraction of all positive detections that are correct detections.For the example of the cloud class, it is computed as the number of correctly detected cloud pixels divided by the number of all pixels detected as cloud.It can be interpreted as an approximate probability of a pixel detected as cloud containing, in fact, cloud.A high level of UA means that a large portion of the cloud mask produced in the model contains cloud pixels, with minimal leakage of non-cloudy pixels into the mask.UA = TP TP + FP (10) The balanced overall accuracy (BOA) is the average of true positive rate (producer's accuracy or recall) and true negative rate.This is particularly helpful for non-balanced datasets, where there is a significant imbalance between positives and negatives in the ground truth.The true negative rate corresponds to the ratio of all negative instances correctly labeled as negatives.
Finally, the false positive rate (FPR) is provided as the rate of falsely rejected positive instances.For the example of cloud detection, it can be interpreted as the number of pixels incorrectly detected as cloud divided by the total number of cloud-free pixels.
Table 2 contains the metrics computed on the cloudy images of the test dataset containing the original real cloudy samples.In terms of balanced overall accuracy for the cloud class, the performance is quite comparable across variants, with the model trained on real data performing best (0.79).Yet, models (b) and (d) trained exclusively on simulated data can achieve a non-trivial performance of 0.75 and 0.78, respectively.This also demonstrates the improvement achieved with the realistic channel-specific magnitude (CSM) feature of the simulator, where higher BOA is observed for all three classes, especially for the shadow This concludes the analysis of the cloud detection task, which demonstrates that cloud detection models can be trained exclusively on simulated cloudy data and achieve performance comparable to the models trained on real data.Furthermore, the realistic magnitude of the cloud component in each channel of multi-spectral data has been found to be beneficial for performance on real clouds when learning from simulated data.

Cloud Removal Task
For the task of cloud removal, the dataset of SEN12MS-CR is used, containing real pairs of cloudy and non-cloudy Sentinel-2 images, along with corresponding Sentinel-1 samples.The dataset contains Sentinel-2 Level-1C product, which consists of 13 bands of multi-spectral data.Furthermore, Band 10 is excluded from the experiment, since it primarily responds to the top of atmosphere reflections of cirrus clouds [22], which often have a different effect than in other bands, as can be observed in Figures 19 and 20.This effect is not currently modeled by the cloud simulator and hence the exclusion.The two figures contain visualizations of individual bands from a Sentinel-2 L1C product for a cloudy and a clear sample.In both cases, all bands except for Band-10 (SWIR-Cirrus) appear to be highly correlated.In the cloudy image, the bands tend to contain a similar presence of the cloud, while in Band-10 this object appears absent.Similarly, Band-10 in the clear image appears to detect a fairly different structure than the other bands.The baseline architecture used for the experiments on cloud removal is DSen2-CR [6], a simple residual-based deep neural network architecture consisting primarily of convolutional operations.In each case, it is trained from scratch to a point where no improvement in validation loss occurs for 30 epochs.The networks are trained using an AdamW optimizer [21] with a starting learning rate of 10 −4 and the same decay strategy as the cloud detection model above.Each batch of data contained four clear and four cloudy images, and, similarly to the cloud detection scheme, the loss on the clear images is multiplied by a factor of 0.1.Due to the large dataset size of SEN12MS-CR, during each epoch, the loss is optimized on 1000 random samples from the training dataset, and the validation loss is computed on 500 random samples from the validation dataset.8) and RMSE (Table 9) are used to evaluate the models.Furthermore, each metric is reported for the whole image (the first listed value) as well as the isolated cloud-affected region (the second listed value).
The results in Table 8 indicate that while the model trained on the real data (a) performs best on that type of data (an SSIM of 0.623 for the whole image and 0.561 for the inpainted region), the model (d) trained exclusively on simulated data with channel-specific magnitude achieves an SSIM of 0.544/0.462.On the other hand, for any type of simulated test data, model (d) outperforms model (a).Similar to the detection task, model (b), which was trained on simulated cloudy images without channel-specific magnitude, consistently produces results of the lowest quality.In Figures 21 and 22, it can be observed that model (b) does not really apply any visible changes to the input image, indicating that it does not recognize the cloud objects as something that should be removed.It achieves an SSIM of 0.444/0.323,compared to 0.544/0.462 in the equivalent model trained with channel-specific magnitude.This demonstrates that the channel-specific magnitude feature is crucial for cloud removal models to generalize from simulated data to real instances, indicated by the improved quality achieved with models (d) and (e).

Conclusions
This manuscript introduced a novel open-source SatelliteCloudGenerator (https: //github.com/strath-ai/SatelliteCloudGenerator,accessed on 22 August 2023) tool for simulating cloudy satellite images with a number of controllable parameters responsible for many features, including cloud structure, locality, color, thickness, or sensing channel misalignment.The tool can be used on image batches directly on GPU and act as an augmentation operation in the PyTorch library commonly used for training deep neural networks.
To shed more light onto the problem, deep neural networks have been trained in the tasks of cloud detection and cloud removal.The results show that learning from simulated data alone has the effect of good performance on real data, especially when channel-specific magnitudes are adjusted.Compared to real data, this approach incurs zero annotation cost.Furthermore, the simulation framework can be used to generate unlimited data for evaluation; in which case, the models trained on simulated data outperform the model trained solely on real data.
Ultimately, neither real nor simulated data appear to be universally more advantageous.The ideal, potentially impossible, case is an abundant source of real images with precise ground truth.Real image sources are not abundant and are likely to contain distorted ground truth.Simulated sources are abundant and with precise ground truth, but a certain gap between real and simulated images can be expected.Given this state of matters, a good way to achieve a trade-off is to perform training as well as evaluation on both data sources, which has been demonstrated herein.

Figure 1 .
Figure 1.The approach of using pairs of real data for training and evaluating cloud removal models often results in fundamental differences in the ground surfaces.Examples from SEN12MS-CR [6].It is apparent that some areas significantly change their state among acquisitions.

Figure 2 .
Figure 2. Diagram of the main pipeline of SatelliteCloudGenerator.The tool takes a cloud-free input satellite image I clear and first mixes it with a shadow generated using a shadow generation process (detailed in Figure4), then blurs the image locally, and finally mixes it with a cloud generated using the cloud generation process (detailed in Figure3).

Figure 3 .
Figure 3. Diagram of the cloud generation component responsible for generating the cloud shape using Perlin noise and applying several operations to control the final appearance.The output is a transparency mask and a channel-wise cloud image.The cloud-free source image is used to determine the magnitude of the cloud component in each channel.

Figure 4 .
Figure 4. Diagram of the shadow generation component.The shadow generation component, similar to the cloud generation component, is based on Perlin noise shape but is followed only by thresholding and range adjustment to produce the output shadow mask.

Figure 5 .
Figure 5. Demonstration of the Perlin noise generated at several scales (32, 16, 8) (a-c) and the result of their weighted sum (d).An example of the resulting cloud mask mixed with a real image is shown in (e).

Figure 6 .
Figure 6.Example of the clear_threshold of 0.30 applied to the cloud mask from Figure 5.This means that any value less than 0.30 will be clipped to zero, resulting in the black (0.00) regions in the mask (a), which appear as cloud-free portions of the satellite image (b).

Figure 7 .
Figure 7. Example of mask range adjusted to [min_lvl = 0.0, max_lvl = 0.5] from the original shape.This means that the strongest influence of the cloud in (a) is at 0.5, which corresponds to the reflectance of the ground still contributing to 50% of the value for that pixel, giving the impression of semi-transparent clouds in (b).

Figure 8 .
Figure 8. Varying levels of locality degree obtained for a range of locality_degree parameter values.For the value of 1, the cloud generates a single Perlin shape with default parameters (a).For the values of 2-4 in (b-d), the same Perlin shape is multiplied by another 1, 2, or 3 independent Perlin shapes, resulting in more local (smaller-area) clouds.

Figure 9 .
Figure 9. Increased locality can also be achieved by adjusting the clear_threshold parameter value (as shown with values between 0.00 and 0.90 illustrated in (a-d)), but the cloud edges tend to become sharper, giving the resulting clouds a distinct look, different from the default.

Figure 10 .
Figure 10.An example of a sample with channel-wise misalignment controlled by the channel_offset parameter.Without this effect, the cloud component is fully aligned across image channel as shown in (a).However, to simulate a common feature of satellite images, a small spatial shift can be introduced to individual channels as shown in (b), resulting in what often appears as colored cloud edges as in (c).

Figure 11 .
Figure 11.An example of a cloud with color adjusted by the ground reflectance.To simulate the global light reflected from the ground affecting the cloud color, the cloud image (a) is adjusted, depending on the thickness of the cloud, to a value between the mean reflected color of the ground and pure white, before it is used for the mixture (b).

Figure 12 .
Figure 12.Example of cloud-free (b) and cloudy (c) regions masked from a real cloudy sample (a) using the s2cloudless cloud detection tool.This allows us to approximate the reflectance characteristics of the cloud-free and cloudy regions separately.

Figure 13 .
Figure 13.Histogram curves for the example image.In each case, it is shown that the cloudy region (orange plot) contains higher values but also exhibits some resemblance to the cloud-free histogram (blue) due to the leakage of the ground component in the cloudy region (as shown in Figure12).

Figure 14 .
Figure 14.Example of a sample simulated with (c) and without (b) channel-specific magnitude (CSM) scaling.The real image reference used for magnitude scaling is shown in (a).

Figure 15 .
Figure 15.An example of a shadow generation feature.In this case, the mask (a) determines which regions of the ground image will appear darker, resulting in an image with shadows (b).

Figure 16 .
Figure 16.An example of the cloud-dependent ground blur effect.Before mixing with the cloud, the cloud-free image is blurred with the strength of the effect proportional to the thickness in the cloud mask, as shown in (a).This is later followed by the standard mixture process to give the final result (b) with a cloud on top of a locally blurred ground image.
the parameter levels for each of these four configurations, with examples of resulting samples shown in Figure 17.The first one uses a wide range between the min_lvl and max_lvl values to simulate large thick clouds in the image.The other three generators focus on more specific types of clouds, namely, local (thick clouds covering a smaller portion of the image), thin (local semi-transparent clouds), and fog (semi-transparent layer over the entire image).Thick clouds are achieved by setting the max_lvl parameter to 1.0, meaning that portions of the image will contain pixels completely dominated by the cloud component.The thick and local configurations differ only by the value of the locality_degree parameter, where thick has it set to 1 (large clouds) and local has it set to a [2, 4] range (various degrees of more local clouds).For any part of the configuration expressed as range, the used value is extracted by sampling from a uniform distribution (discretised, if necessary).The thin configuration limits the max_lvl parameter to the [0.4,0.7] range, to ensure semi-transparency.Finally, fog lifts the min_lvl parameter to the [0.3, 0.6] range, resulting in the full image containing a semi-transparent cloud.

Figure 17 .
Figure 17.Random samples synthesized using 4 different CloudGenerator configurations used for this work.The configurations contain parameters that can easily adjust the visual appearance of clouds, as listed in Table1.

Figure 18 .
Figure 18.An example of a precise segmentation mask derived from the simulation tool.It is possible to obtain an accurate segmentation representation for the simulated data, since the cloud mask is generated explicitly as part of the simulation process.

Figure 19 .
Figure 19.Example of content of individual multi-spectral bands in each channel for a cloudy Sentinel-2 L1C image.Intensity range for each band is listed in square brackets.

Figure 20 .
Figure 20.Example of content of individual multi-spectral bands in each channel for a cloud-free Sentinel-2 L1C image.Similar to the previous example with cloud detection, the models are tested on five different test datasets, one with real clouds and another four with simulated-only data of different cloud types.The commonly used metrics of SSIM (Table8) and RMSE (Table9) are used to evaluate the models.Furthermore, each metric is reported for the whole image (the first listed value) as well as the isolated cloud-affected region (the second listed value).The results in Table8indicate that while the model trained on the real data (a) performs best on that type of data (an SSIM of 0.623 for the whole image and 0.561 for the inpainted region), the model (d) trained exclusively on simulated data with channel-specific magnitude achieves an SSIM of 0.544/0.462.On the other hand, for any type of simulated test data, model (d) outperforms model (a).Similar to the detection task, model (b), which was trained on simulated cloudy images without channel-specific magnitude, consistently produces results of the lowest quality.In Figures21 and 22, it can be observed that model (b) does not really apply any visible changes to the input image, indicating that it does not recognize the cloud objects as something that should be removed.It achieves an SSIM of 0.444/0.323,compared to 0.544/0.462 in the equivalent model trained with channel-specific magnitude.This demonstrates that the channel-specific magnitude feature is crucial for cloud removal models to generalize from simulated data to real instances, indicated by the improved quality achieved with models (d) and (e).

Figure 21 .
Figure 21.Examples of cloud removal output for the five model variants on real cloudy images.

Figure 22 .
Figure 22.Examples of cloud removal output for the five model variants on simulated cloudy images (test images simulated with channel-specific magnitude).

Table 1 .
Configuration parameters for four types of clouds.Different cloud types have been achieved primarily by adjusting the range of cloud thickness and the locality degree of the clouds.

Table 7 .
Evaluation on the cloud-free images for the detection task.