2.1. Retinex-Based Tone Processing
Retinex (a combination of Retina and Cortex) is a theory first developed by Edwin Land in 1977. Later, Jobson et al. [
16] proposed the two most popular Retinex algorithms, SSR and Multiscale Retinex (MSR). The former can provide dynamic range compression or tonal rendition, depending on the scale (small or large, respectively). The superposition of these two processes is an obvious choice for balancing the two effects. MSR mimics the human visual system. The illumination of objects by incident light is reflected into the imaging system, which ultimately forms the image that we see. The image obtained by the human eye depends on the incident light and its reflection by the surface of the object. The general expression for Retinex is
where
is the spatial location,
is the input image,
is the reflection, and
is the illumination by incident light.
Because the signals are continuous and infinite, a computer cannot process them directly. Thus, FT is applied to change the signal from the time domain to the frequency domain, which converts it to a sinusoidal signal with many frequencies and amplitudes that are processable via a computer. However, the calculation burden is exceptionally large because a convolution calculation is used in FT. Hence, fast FT (FFT), which significantly reduces the calculation burden and enables the computer to process the signal, is used instead.
Single-Scale Retinex (SSR) is an image enhancement algorithm derived from Retinex theory, where the reflectance component is estimated by removing the illumination from the input image in the logarithmic domain [
17]. This allows suppression of uneven illumination and enhancement of local details. For the
channel of the input images, denoted as
, the SSR output is defined in Equation (2)
where
is a Gaussian low-pass filter and * denotes the convolution operation. The Gaussian filter is given by Equation (3)
with
representing the Gaussian surround space constant and
is spatial distance in the filter (blur radius).
represents the reflectance, obtained by subtracting the log of the estimated illumination (computed via Gaussian smoothing) from the log of the input image [
18]. Smaller
σ values preserve local details and reduce color distortion, whereas larger
σ values emphasize global illumination correction and tonal balance in the spatial domain.
MSR is based on SSR using multiple sigma values, followed by the weighting of the end results. Compared with SSR, MSR has the advantage of maintaining a high fidelity of the image while achieving color enhancement, local dynamic range compression, and global dynamic compression, but with the disadvantage of haloing. It is defined as
where
is the number of scales,
= {
} is a vector of Gaussian blur coefficients,
is the weight associated with the
kth scale (
+
+ … +
= 1),
is the spatial location, and
is the size of the result after each SSR (
k times).
Figure 1 illustrates the effect of varying sigma (σ) values in the single-scale Retinex (SSR) algorithm. Specifically, changing the Gaussian kernel parameter σ in Equation (3) modifies the value of
in Equation (2), which alters the illumination component
and generates different reflectance images depending on the sigma scale. In this study, σ = 15, 80, and 250 were used to produce the results shown in
Figure 1. This Gaussian kernel–based comparative experiment was as described in [
19] to ensure reproducibility.
Figure 1a shows an original image during a sandstorm. After applying an SSR with a sigma value of five (
Figure 1b), the color of the image became more normal; however, the local details could be further improved. To achieve this, SSR processing with a sigma value of 40 was performed; however, haloing gradually appeared in the image (
Figure 1c). Finally, SSR processing with a sigma value of 150 rendered the image colorless, and the overall details were insufficient (
Figure 1d). Hence, small-scale SSR improved the overall tone and contrast of the image, whereas large-scale SSR provided improved local contrast and image details.
2.2. CycleGAN-Based Image Translation
Generative adversarial networks (GANs) [
20,
21] have achieved impressive results in image generation [
22,
23], image editing [
24], and representation learning [
25,
26]. A basic GAN has a unique structure wherein two neural networks—the generator and the discriminator—compete [
27]. The generator is used to generate samples, whereas the discriminator is used to determine whether each sample is true. The generator uses random noise to generate fake images, whereas the discriminator performs binary classification training based on both real and fake images. The discriminator generates a score based on the input image, which indicates whether the image generated by the generator is successful, and thereby further trains the latter to generate a better image. Based on Ian Goodfellow’s definition of GAN [
20], they completed the optimization task in the following manner:
where
is the generator,
is the discriminator,
is a value function representing the discriminative performance of the discriminator (the larger the value, the better the performance),
represents the real data distribution,
represents the input data distribution of the generator, and
is the expectation.
The first term, is constructed based on the loss of the logarithmic function of real data. An ideal situation occurs when the discriminator, , determines whether a data sample is obtained from a real data distribution. Optimization leads to = 1 for a real data sample.
The second term, , is related to the data generated by the generator. The ideal situation is when the discriminator outputs zero in this scenario. Optimization leads to = 0, where represents the random noise input to the generator. The discriminator maximizes these two terms.
Because it is an adversarial relationship, optimizing
allows it to deceive the discriminator in the second term, thereby making the discriminator accept the generated data even when
≈ 1. Essentially, the discriminator maximizes the two terms, while the generator minimizes the second term, which results in minimizing the objective function. Because the generator and discriminator are in this adversarial relationship, given a fixed generator, the discriminator is trained to maximize the objective function
by correctly classifying real data as real and generated data as fake. Therefore, the discriminator optimization function can be expressed as
where
represents the optimal value for discriminator
and given data point
,
denotes the probability of
being from the real data distribution, and
denotes the probability of
being generated by generator
. This implies that the discrimination result for
tends toward zero, whereas the value of
tends toward one.
For the optimization of , when the value of is fixed, the condition required to minimize is given by Equation (7). Intuitively, we can also understand that when the distribution of the generator matches that of the real data, the generator is trained to minimize the value of the objective function , whereas the discriminator is trained to maximize it, forming a minimax optimization problem.
As its name suggests, the network in CycleGAN (a new type of GAN proposed by the Berkeley Artificial Intelligence Research Lab) is a circle [
3]. It addresses the problem of inconsistent image-to-image translation without sufficient paired-image datasets by adding a cycle-consistency loss that feeds back the generated image through an inverse function to ensure transfer parity with its source image. The formula for cycle-consistency loss is
where ‖.‖ is the pixel-level difference between two images,
is the input image belonging to the
A domain,
is the input image belonging to the
B domain,
and
comprise the generated image, and
and
comprise the cyclically reconstructed image from the original image. Unlike Pix2Pix, CycleGAN does not require pairs of images. Moreover, the loss function ensures that converting the image from
Domain A to
Domain B, and then vice versa, can be used to reconstruct the original image.
As shown in
Figure 2, generator
G1 converts the image with yellow sand–dust interference from Domain
A to Domain
B and generates a fake clear image,
,
ed by the discriminator as either 1 (true) or 0 (false). The difference between images
and
is the GAN loss. After converting
in generator
G2, the clear fake image,
, in Domain
B is converted to image
with yellow sand–dust interference in Domain
A. The difference between images
and
is the cycle-consistency loss. When the clear real image
is input into
G1, the generated image
should be as close as possible to input
. The difference between
and
is called identity loss.
In summary, the basic concept of GANs is to augment the encoder–decoder-based generator with a separate discriminator network that determines whether the generated images are real or fake, and to use its output as part of the loss function. The generator is trained to deceive the discriminator by producing results that appear realistic, which is the key advantage that distinguishes GANs from other deep learning–based generative models.
CycleGAN is designed to use unpaired data, and its generator objective function leverages the discriminator adversarial loss to drive the generated samples toward the distribution of the target domain. However, the primary advantage of training CycleGAN with paired data is its ability to leverage paired information to guide the learning process. The objective function of the generator directly includes the mapping error term. Paired data provide explicit pixel-level correspondence between images from different domains, which means that the model can directly learn the mapping between two domains and the pixel-level correspondence. This approach can better capture image features and reduce ambiguity and uncertainty in the drawing process, resulting in more visually appealing and accurate transitions between domains. The results of using CycleGAN for clear and sandstorm-obscured images are shown in
Figure 3. The real clear, unpaired training-generated, and paired training-generated images are shown in
Figure 3a–c, respectively. The fake sandstorm-obscured image has the same characteristics as the real image with yellow-sand interference, that is, it has a yellowish color and is blurred. The result of applying CycleGAN to sandstorm-obscured and clear images of a tower crane in
Figure 4 shows that CycleGAN with paired data is better at removing the yellow-sand interference and has increased image details.
2.3. Proposed Method
Since Retinex has the property of removing background components while enhancing image sharpness and brightness, this study employed SSR to guide the learning direction. It was hypothesized that this approach could suppress the global yellow cast and background effects caused by sandstorms while improving image details. In addition, the characteristics of SSR vary with the sigma scale in the frequency domain: a larger sigma results in a narrower frequency response that suppresses high-frequency components and emphasizes low-frequency information, whereas a smaller sigma produces a wider frequency response that allows more high-frequency components to pass through, thereby preserving details with less color distortion. To exploit these properties, three sigma scales were individually trained to generate independent modules, and their outputs were then fused to incorporate the advantages of MSR (Multi-Scale Retinex). For color preservation, the AB channels obtained from the small sigma (1-scale) were used to maintain natural color representation, while the L channel was derived by combining the outputs of the three scales and processed in the LAB space to ensure both detail enhancement and color fidelity.
The proposed method is divided into three stages, as shown in
Figure 5. The first training step involved the generation of fake dust-interference images. This stage was conducted using a dataset consisting of 2000 clear images (referred to as Clear Image 1) and 250 real sandstorm images. Because CycleGAN learning with unpaired data requires a substantial amount of authentic sandstorm-obscured images, their scarcity in real life hampers its progress. Consequently, our approach employed CycleGAN with unpaired data for the initial learning process and runs the generator in the “clear to sandstorm” direction. This initial learning phase provided a substantial dataset of paired images to facilitate subsequent learning steps.
The second training step involved training CycleGAN with paired enhanced clear images. In this stage, another set of 2000 real images (referred to as Clear Image 2) was enhanced using the Retinex algorithm at three different scales. These enhanced images were then paired with 2000 synthetic sandstorm images generated in the first training step to form paired training data. Three models were obtained by applying Retinex enhancement to the dataset images at three different scales. Each resulting dataset exhibited distinctive characteristics, and this preprocessing step enriched the visual content, providing improved input data for the second phase of CycleGAN training.
In the test process step, after obtaining modules and their respective characteristics, the processing was performed on the L channel and then the combined A and B channels of the SSR model for favorable color preservation.
2.3.1. First Training Process: Fake Dust-Interference Image Training
This study proposes a method to enrich the paired dataset by initially training unpaired data and then using the trained model to generate paired data for subsequent training. Because paired data have a clear mapping relationship based on the comparison between paired and unpaired images, paired training can lead to high-quality translation as well as faster convergence, shorter learning time, and better control characteristics. As shown in
Figure 4, after training CycleGAN with both paired and unpaired data, the model was applied to the enlarged region highlighted by the blue box. The results indicated that training with paired data achieved superior dust removal and produced a more visually consistent clean image.
Although training CycleGAN with paired datasets yields better results, real-world image datasets obscured by yellow sand are lacking and paired datasets are almost nonexistent. However, synthetic paired data can be generated by translating images from one domain to another using an unpaired CycleGAN model. This method not only expands the dataset, reduces the dependence on limited paired data, and provides additional training samples for subsequent supervised learning tasks, but is also flexible and adaptable. We can control the distribution of the generated pairs, tailor it to a specific need, or create scenarios that are difficult to obtain from paired real-world data. It is important to note that the quality and validity of the generated paired data depend heavily on the performance and capability of the initial unpaired CycleGAN model. To ensure that the generated image pairs are of sufficient quality and preserve the desired characteristics of the target domain, careful consideration must be given during data generation. Therefore, we first trained CycleGAN using unpaired images and then used the trained model to generate paired data, referred to as “fake dust-interference image training,” as illustrated in
Figure 6.
2.3.2. Second Training Process: Paired Training CycleGAN with Enhanced Clear Images
To enhance the performance of CycleGAN, the proposed method first augmented the dataset and then performed paired training. This involves using Retinex to enhance the data before training CycleGAN with paired data. Images were transformed from the spatial to the frequency domain before being processed using Retinex. In the LAB color space, the L channel provides luminance information, whereas the A and B channels provide color information. The LAB color space exhibits excellent adaptability to changes in illumination conditions. The luminance information on the L channel experiences minimal variation under different lighting conditions. In addition, the color uniformity is high, implying that during color preservation, the variation in color across different regions is smoother. In contrast to the relatively complex relationships between the color components in the RGB color space, which can lead to unexpected effects during color preservation, working in the LAB color space enables the separation of color image luminance. This ensures that the color information remains unaffected, thereby preserving the color fidelity.
Part of the car in
Figure 7 has color distortion, which is attributed to the Retinex enhancement process. During Retinex enhancement, luminance information is accentuated, while color information might undergo compression or alteration. Adjustments in brightness and contrast applied to individual color channels can disrupt the balance between them, leading to color distortion or loss and reducing the authenticity of image colors.
Retinex-based enhancement technology can significantly improve the visual quality of an image by enhancing brightness, contrast, color, fine details, textures, and edges, which ensures that the output image has improved visual features compared to the input image. Augmenting a dataset with Retinex better preserves these details during CycleGAN training with paired images, which can lead to more accurate and better preserved translations that retain important visual features. Retinex-based enhancement can also reduce noise and artifacts in images, resulting in clearer and smoother data. Training CycleGAN on a paired-image dataset with reduced noise and artifacts can help the model focus on learning the desired mapping between domains rather than being affected by undesirable artifacts. In addition, augmentation by Retinex can enhance uniformity and consistency, thereby improving training convergence. The enhanced image provides a more stable and reliable training signal, which can help the CycleGAN model trained on paired images to converge faster and more efficiently.
As shown in
Figure 8, CycleGAN training with Retinex-enhanced images involves transferring clear images from the spatial domain to the frequency domain and then performing SSR preprocessing in the LAB color space. The reason for moving to the frequency domain is that image processing therein can significantly reduce the computing time. Thus, the L channel in the LAB color space was used for data augmentation.
The small-scale model (Model 1) can better preserve and transfer the color features of the image while enhancing both the overall detail and contrast. However, the drawbacks are also evident in
Figure 9b, with a halo appearing in the sky portion. While small-scale Retinex enhances image details, it tends to generate noise and artifacts that can be propagated or amplified during CycleGAN learning.
The medium-scale model (Model 2) is capable of better adjusting the image texture and details. In addition, it can appropriately enhance the image contrast, thereby making variations in brightness more pronounced and boosting the overall visual effect. Drawbacks include focusing on local details that can disrupt the overall balance of the image, which affects its natural feel. Furthermore, inadequate color fidelity is another disadvantage, which is evident in the building sign portion of the image in
Figure 9c.
A large-scale model (Model 3) can accentuate the features and structures within the image, thereby enhancing the sense of depth. It also adjusts the image contrast, which renders the edge texture clearer and emphasizes local variations in brightness, as shown in
Figure 9d. However, processing datasets in the frequency domain can result in the loss of local details. This can lead to a large-scale model potentially losing the ability to enhance certain finer details during learning. Furthermore, inadequate color fidelity is an unavoidable concern associated with this approach. Therefore, careful selection of an appropriate large-scale Retinex preprocessing model is necessary.
To consider the local and overall details, the experimental results for the large-, medium-, and small-scale models were combined, and a weight of 1/3 was used to obtain the final image. Compared with the previous image, the processed image has dynamic compression and shows details in the shadows, while the local details and overall details were greatly improved.
2.3.3. Image Preservation
The composite results of the three SSR models at different scales in the RGB channel still created serious color deviation; therefore, it is necessary to reduce color deviation through color preservation. As shown in
Figure 5, after transferring the image to the LAB space, the small-scale sigma model (Model 1) provided normal color but poor local details, whereas the large-scale sigma model (Model 3) provided poor color but excellent local details and contrast. Putting the image obtained via the three SSR models with a weight of 1/3 through the L channel and the color parts through the A and B channels of the LAB color space using the small sigma SSR model, and then transferring the result to the RGB color space, effectively reduced the color deviation and maintained excellent contrast and details, as shown in
Figure 10.