Denoising Vanilla Autoencoder for RGB and GS Images with Gaussian Noise

Noise suppression algorithms have been used in various tasks such as computer vision, industrial inspection, and video surveillance, among others. The robust image processing systems need to be fed with images closer to a real scene; however, sometimes, due to external factors, the data that represent the image captured are altered, which is translated into a loss of information. In this way, there are required procedures to recover data information closest to the real scene. This research project proposes a Denoising Vanilla Autoencoding (DVA) architecture by means of unsupervised neural networks for Gaussian denoising in color and grayscale images. The methodology improves other state-of-the-art architectures by means of objective numerical results. Additionally, a validation set and a high-resolution noisy image set are used, which reveal that our proposal outperforms other types of neural networks responsible for suppressing noise in images.


Introduction
Currently, there is a growing interest in the use of artificial vision systems for application in daily tasks such as industrial processes, autonomous driving, telecommunication systems, surveillance systems, and medicine, among others [1].Recent developments in the field of artificial vision have stimulated the need to make increasingly robust systems to meet established quality requirements, which is an essential part of why systems fail to cover these types of requirements, mainly in data acquisition.Among image acquisition systems, there are several factors that can alter the result of the capture, including failures in the camera sensors, adverse lighting conditions, electromagnetic interferences, noise generated by the hardware, etc. [2].All of these phenomena are described using distribution models and are known, in a general way, as noise.The procedure in the image processing field to try to diminish the effect of the noise is known as the pre-processing stage in any image processing system.In recent years, various algorithms have been developed in denoising images, and recently, a new field has taken much interest in the scientific community.In this way, deep learning methods emerge [3,4].
Deep learning methods particularly present an inherent ability to overcome the deficiencies contained in some traditional algorithms [5]; however, despite their significant improvements compared to traditional filters, deep learning methods have practical limitations to their credit, which fall in high computational complexity.Although, as previously mentioned, various methods have focused on noise suppression, in this work, autoencoders are proposed, which are neural networks capable of replicating an unknown image by applying convolutions whose weights were adjusted with previous training [6][7][8].This research project highlights the importance of using autoencoders because they do not require high computational complexity, demonstrating noticeable improvement compared to other types of deep learning architectures, such as the Denoising Convolutional Neural Network (DnCNN) [9], the Nonlinear Activation Free Network for Image Restoration (NAFNET) [10], and the Efficient Transformer for High-Resolution Image Restoration (Restormer) [11].
The rest of this paper is structured as follows.In Section 2, the theoretical background work is described.The proposed model is described in Section 3. The experimental setup and results are discussed in Section 4. Finally, the conclusions of this research work are given in Section 5.

Background Work
In recent years, noise suppression has become a dynamic field within the domain of image processing.This is due to the fact that as technological advances emerge, a greater understanding of the scene in which a vision system is interacting is required [12].For the suppression of noise, several processing techniques have been proposed.These techniques are known as filters that depend on the noise present in the image and are mainly classified into two types.

Spatial Domain Filtering
Spatial filtering is a traditional method for noise suppression in images.These filters suppress noise by being applied directly to the corrupted image.They can generally be classified into linear and non-linear.Among the most common filters are:

•
Mean Filter: For each pixel, there are samples with a similar neighborhood to the pixel's neighborhood, and the pixel value is updated according to the weighted average of the samples [13].

•
Median Filter: The use of this filter is that the central pixel of a neighborhood is replaced by the median value of the corresponding window [14].

•
Fuzzy Methods: This type of filter is different from those mentioned above since it is mainly constituted by fuzzy rules with which it is possible to preserve the edges and fine details in an image.Fuzzy rules are used to derive suitable weights for neighboring samples by considering local gradients and angle deviations.Finally, directional processing is used with which it improves the precision of the same filter [15].

Transform Domain Filtering
Transform domain filtering is a very useful tool for signal and image processing due to its extensive analysis of multiple resolutions, sub-bands, and location in the time and frequency domains.An example of this type of filtering is the Wavelet method, which is performed based on the frequency domain and attempts to distinguish the signal from noise and preserve said signal in the noise suppression process.As a first step, a wave base is selected to determine the decomposition of its layers to later select the level of decomposition, establishing a threshold in all the sub-bands for all levels [16].

Artificial Intelligence
A new method of processing images has emerged, called artificial intelligence.To address the issue of noise suppression, it is necessary to distinguish between artificial intelligence, machine learning, and deep learning, because people tend to use these terms synonymously, but there exists a subtle difference.Artificial intelligence involves machines that can perform tasks with characteristics of human intelligence, such as understanding language, recognizing objects, gestures, sounds, and problem solving [17,18].Machine learning is a subset that belongs to artificial intelligence.The function is to obtain better performance in the learning task.The algorithms used are mainly statistical and probabilistic ones, making the machines improve with experience, allowing them to act and make decisions based on the input data [19].Finally, deep learning is a subset of machine learning that uses techniques and algorithms of automatic learning that have high performance in different problems of image recognition, sound recognition, etc., since the basic functioning and structure of the brain and the visual system of animals are imitated [20].
There are two types of deep learning: the first type is supervised, learning which takes a direct approach using labels on learning data to build a reasonable understanding of how machines make decisions, and the second is unsupervised learning, which takes a very different approach by learning by itself how to make decisions or perform specific tasks without the need to contain labels in a database [21].

Autoencoders
Autoencoders are unsupervised neural networks, and the main function of autoencoders is that the input and the output are the same [22].This is taken as an advantage against other models because, in each training phase of the neural network, the output is compared with the original image version, and through a calculation error, the weights found in each of the layers that make up the autoencoder are adjusted.This adjustment is carried out by means of the backpropagation method.There are different types of autoencoders, which are:

•
The Vanilla Autoencoder (VA) comprises only three layers: the encoding layer, in charge of reducing the dimensions of the input information; the hidden layer, better known as latent space, in which are the representations of all characteristics learned by the network; and the decoding layer, which is in charge of restoring the information to its original input dimensions, as shown in Figure 1 [23].• The Denoising Autoencoder (DA) is a robust modification of Conv AE that changes the input data preparation.The information the autoencoder is trained in is divided into two groups: original and corrupted.In order for the autoencoder to learn to denoise an image, the corrupted information is sent to the input of the network to be processed.
Once the information is in the output, it is compared with the original [25].This type of autoencoder is capable of generating clean images from noisy images, ignoring the type of noise present as well as the density in which the image was affected.

Proposed Model
The proposed model is based on the suppression of Gaussian noise in both RGB and grayscale (GS) images.Figure 3 shows the architecture of the proposed Denoising Vanilla Autoencoder (DVA) algorithm, which consists of a selection stage where, if the image to which the processing is going to be submitted is of the RGB type, a multimodal model is applied, and if it is a GS image, a unimodal model is applied.This is described by Equation (1).The advantage of combining two types of autoencoder architectures (VA and DA) is that by only having one encoding layer and one decoding layer, the reconstructed pixels do not have many alterations, which could translate into a loss of information, and at the same time, they are capable of remove noise present in images.The use of the autoencoder also allows us to have a lower computational load, which, in turn, improves both training and processing times once the network models are generated.
where X is the image processed by DVA, and c is the number of channels in the corrupted image.
where X is the corrupted image with dimensions width w, height h, and channels c, and W is the matrix weight with dimensions width m, height n, channels c, and k kernels.
where (X * W) (i,j,c) is the intensity of the result of the k convolutions in the position (i, j, c), b is the bias.
where Y (i,j,c) is the result of the activation function ReLu f in the position (i, j, c).
where Z is the encoded image by maxpooling, 2 − 1} are the strides.
where Z (i,j,c) is the result of the second convolutional layer and activation function, and W is another matrix weight.
where Y is the dencoded image by upsampling.
where X is the final result of the processing, and W represents another matrix weight.
For the multimodal model, the image is separated into its three different components (red, green, blue), and each component is processed independently, with models trained for each type of channel (Equations ( 2)-( 9)) so that once the result is obtained, the three new ones are concatenated.The components generate a new image in which the noise is smoothed out.Within the unimodal model, a single trained model is applied.The main reason why a multimodal model was trained for RGB-type images is because the noise, being completely random and defined by a Gaussian probability, means that each channel is affected differently.In this case, processing the three channels of the image in the same way can cause the final smoothing to not be carried out properly and contain a greater number of corrupted pixels.Figure 4a shows the original histogram of the Lenna image, and Figure 4b shows how the image behaves when corrupted with Gaussian noise with density σ = 0.20.This example is perceived as the red channel tends to increase the intensity of its pixels, and in the case of both the green channel and the blue channel, their intensities tend to decrease.The DVA process is described in detail in Algorithm 1. Once the processing through the DVA is finished, we analyze the histogram of the resulting image, which is shown in Figure 5, perceiving how the DVA restores the intensities of the pixels contained in each of the channels to a certain extent.In this sense, the DVA is capable of restoring the image; however, it is not an optimal processing due to the nature of the noise since the same noise causes significant loss of information in the images, which the DVA tries to bring closer to the images.The intensities of the corrupted pixels are an ideal panorama.

Network Training
For the multimodal model, the "1 million faces" database was used, of which only 7000 images were used [26], which were resized in a dimension of 420 × 420 pixels.The same database was duplicated to generate the noise database.The 7000 images were divided into batches of 700 in which each batch was corrupted with a different noise density.The noise densities used are {0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5}.Once the two databases were obtained, the DVA training was carried out.The databases were divided into 80% for the training phase and 20% for the validation phase.In the case of the unimodal model, the original database was converted to GS, and the database with noise was created by repeating the above procedure.
The network was trained on an NVIDIA GeForce RTX 3070 (8GB) GPU.The hyperparameters used were seed = 17, learning rate = 0.001, shuffle = true, optimizer = Adam, loss function = MSE, epochs = 100, batch size = 50, and validation split = 0.1.Figure 6 shows the learning curves for the training and validation phase throughout the 100 epochs, showing us that the proposed architecture did not suffer from overtraining for both the unimodal model (Figure 6a) and the multimodal model (Figure 6b).

Experimental Results
The evaluation of the DA was carried out through the use of various images both in RGB and in GS of different dimensions.These images are unknown to the network in order to verify the proper functioning of the same.The evaluation images are shown in Figure 7.Each evaluation image was corrupted with Gaussian noise with densities from 0 to 0.50 in intervals of 0.01.To gain a better perspective of the proper functioning of the proposed algorithm, comparisons were made with three other neural networks that differ in their structure but whose objective is noise smoothing.Table 1 shows the visual comparisons of the results obtained by the DVA and the other neural networks used to validate the algorithm for the Lenna image in GS.Table 2 shows the same comparisons for the Lenna image but this time in RGB.It should be noted that an approach was made to a region of interest to have a better perspective of the work of each of the networks on the image in question.In addition to the visual comparisons, evaluation metrics were used, such as:

•
Mean Square Error (MSE): Calculate the mean of the differences between the original images and the processed images squared.
where x and y are the images to compare, (i, j) is the coordinates of the pixel, and M and N are the size of the images.
• Root Mean Squared Error (RMSE): Commonly used to compare the difference between the original images and the processed images by directly computing the variation in pixel values [27].
• Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS): Used to compute the quality of the processed images in terms of normalized average error of each band of processed image [28].
where dh dl is the ratio of pixel between hue and light, n is the number of bands, and µ i is the mean of the ith band.

•
Peak Signal-to-Noise Ratio (PSNR): A widely used metric that is computed by the number of gray levels in the image divided by the corresponding pixels in the original images and the processed images [29].
where b is the number of the bits in the image.

•
Relative Average Spectral Error (RASE): Characterizes the average performance of a method in the considered spectral bands [30].
where µ is the mean radiance of the n spectral bands, and B i represents ith band of the image.• Spectral Angle Mapper (SAM): Computes the spectral angle between the pixel, the vector of the original images, and the processed images [31].
• Structural Similarity Index (SSIM): Used to compare the local patterns of pixel intensities between the original images and the processed images [32].
where µ x and µ y are the mean of the images, respectively; σ xy is the covariance between the images to compare; C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 are two variables to stabilize the division with low denominators; L is the dynamic range of the pixel values; K 1 << 1; and K 2 << 1.

•
Universal Quality Image Index (UQI): Used to calculate the amount of transformation of relevant data from the original images into the processed images [33].
Table 3 exemplifies the PSNR results obtained by each neural network used in the validation GS images, and Table 4 exemplifies the PSNR results obtained in the same way but for RGB images.

Nafnet results
Table 2. Comparative visual results to RGB image.

Original RGB Image
Noisy Images

Nafnet results
Table 3. Comparative results of PSNR in GS images.In order to better show all the results of the metrics calculated from the validation database images processed by each of the aforementioned networks, Box-and-Whisker plots were made.This type of graph shows a summary of a large amount of data in five descriptive measures, in addition to intuiting its morphology and symmetry.This type of graph allows us to identify outliers and compare distributions.

GS Image
Figure 8 shows the Box-and-Whisker plots for each of the metrics applied to the results of the GS images, and Figure 9 also shows the plots for the RGB image results.In each of the diagrams, it can be seen that the DVA contains smaller box dimensions with respect to the other networks, which means that the results obtained oscillate in a smaller range, so the result of the processing is similar regardless of the density with which the image is corrupted.The median is also located near the center of the box, which indicates that the distribution is almost symmetrical.Another point to highlight in the diagrams is that there are fewer outliers in the DVA compared to the other networks.Recapitulating the previous results, it has been determined that the DVA obtained better results in comparison with the other neural networks.Although the difference presented in the metric calculations is not visually appreciated, this is mainly due to the fact that these metrics do not accurately reflect the perceptual quality of the human eye.One measure of image quality is the Mean Opinion Score (MOS) [34]; however, this type of measure is not objective as it differs depending on the user in question [35].
Another point in favor of the DVA is that it can be used in images of any dimension.As an example, Table 5 shows the visual and calculated results for high-definition images in which it is perceived that good restoration results are obtained.
As an aggregate, the negative of the differences between the analyzed image and the original image is shown, in which all the white pixels represent the pixels that are equal to those of the original image, for which it can be deduced that the DVA manages to have a good restoration of the image when it is corrupted with Gaussian noise.

Conclusions
In this research work, the importance of the use of filters for artificial vision systems was highlighted, as well as the basic concepts that encompass artificial intelligence and some types of unsupervised networks that are used today.Through this, a methodology based on autoencoders was proposed, which is capable of processing images of any size and type (RGB or GS).When carrying out the analysis of the results shown, it is identified that, from the use of the DVA, it is possible to efficiently smooth the Gaussian noise of images through the deep learning techniques implemented in the proposed algorithm regardless of the density of noise present in the corrupted images.The DVA results, both visual and calculated using various quantitative metrics, show better results in noise suppression compared to the DnCNN, NAFNET, and Restormer algorithms that, despite being of different architecture, have the function of smoothing noise in images.
One of the limitations observed during this research work is that when the image presents a low noise density, the results are similar to the architectures with which the DVA was compared.That is why it is suggested as a starting point to make improvements either by transferring learning or combining this methodology with another such as that proposed in [36] in order to obtain both qualitative and quantitative results, since it is extremely important for vision systems to get as close as possible to the real scene in order to reduce errors.

Figure 1 .
Figure 1.Architecture of the vanilla autoencoder.•TheConvolutional Autoencoder (Conv AE) makes use of convolution operators and extracts useful representations from the input data, as shown in Figure2.The input image is sampled to obtain a latent representation and is forced to learn that representation[24].
(a) Histogram of the original Lenna image.(b) Histogram of corrupted Lenna image.

Figure 4 .
Figure 4. Difference between histogram of original Lenna image and histogram of corrupted Lenna image.

Algorithm 1 :Figure 5 .
Figure 5. Histogram of the result of the corrupted image of Lenna processed by DVA.
(a) Learning curve obtained in the training with GS images.(b) Learning curves obtained in the training with RGB images.

Figure 6 .
Figure 6.Learning curves obtained during the training of the DVA.

Figure 8 .
Figure 8. Box-and-Whisker plots of the quantitative results obtained on GS images.

Figure 9 .
Figure 9. Box-and-Whisker plots of the quantitative results obtained on RGB images.

Table 4 .
Comparative results of PSNR in RGB images.