Denoising and Dehazing an Image in a Cascaded Pattern for Continuous Casting

: Automatic vision systems have been widely used in the continuous casting of the steel industry, which improve efﬁciency and reduce labor. At present, high temperatures with evaporating fog cause images to be noisy and hazy, impeding the usage of advanced machine learning algorithms in this task. Instead of considering denoising and dehazing separately like previous papers, we established that by taking advantage of deep learning in a modeling complex formulation, our proposed algorithm, called Cascaded Denoising and Dehazing Net ( CDDNet ) reduces noise and hazy in a cascading pattern. Experimental results on both synthesized images and a pragmatic video from a continuous casting factory demonstrate our method’s superior performance in various metrics. Compared with existing methods, CDDNet achieved a 50% improvement in terms of peak signal-to-noise ratio on the validation dataset, and a nearly 5% improvement on a dataset that has never seen before. Besides, our model generalizes so well that processing a video from an operating continuous casting factory with CDDNet resulted in high visual quality.


Introduction
Computer vision has seen remarkable progress in the domains of image classification [1][2][3][4], semantic segmentation [5][6][7], and restoration from degradation [8][9][10][11]. There is no doubt that traditional industries seek to obtain advantages in terms of safety and efficiency with the assistance of computer vision. However, since production environments require high robustness and stability, deploying advanced but tricky algorithms to these industries is questionable. For example, in the continuous casting of the steel industry, due to its inevitably high-temperature environment, imaging suffers from noise and haze, leading to low visual quality. Figure 1 shows an example from a real continuous casting factory. The bottom area of the ladle (blue area in Figure 1) is where the vision system focuses on in order to drive the robot arms (red area in Figure 1) to install the long casting nozzle, but both the areas captured by the camera are deteriorated highly by haze and noise due to the high temperature. High-level computer vision tasks, such as detection and recognition, may fail in this situation. Considering that noise and haze misleading the computer vision algorithm prevent the development of vision system advances in the continuous casting of steel, it is crucial and fundamental for vision systems that are pursuing robustness and accuracy to denoise and dehaze at an early stage.
In this paper, we propose a high performance end-to-end denoising and dehazing convolutional nerual network (CNN) model, called Cascaded Denoising and Dehazing Net (CDDNet). While some approaches either neglect the influence of image noise or reduce noise separately, CDDNet is designed to be a two-stage end-to-end pipeline which removes noise and haze in a cascading pattern. It was trained on synthesized hazy and noisy images, and was tested on synthetic images and a real video from an in-operation continuous casting factory. Experiments demonstrated that CDDNet performs well. It is exciting to share that our proposed model generalizes so well that a real video processed by CDDNet result in high visual quality.
To summarize, our contributions include: • We analyze the atmospheric scattering model and additive noise from the perspective of combining them. We further expand this formula into a new denoising and dehazing algorithm that works in a cascading pattern. • We propose CDDNet, a cascading two-stage U-Net for image denoising and dehazing. • We provide in-depth experiments on both synthesized images and a real video from a continuous casting factory, which demonstrates that CDDNet achieves superior performance to other models.

Related Work
The continuous casting of the steel industry inevitably occurs in high temperature environments with a lack of ambient light. When placing industrial cameras in such situations, these disadvantages lead to image degradation in two ways. First, the camera is unable to gain enough photons to activate; meanwhile, electric signals on the sensor tend to be unstable and leak out. Both factors result in significant noise. Second, the high temperature in this environment leads to much evaporation and liquefaction, generating haze which considerably obscures the targets being watched. In this section, we first overview the basic models of haze and additive noise that are of essential importance to CDDNet's design.
In computer vision, the formation of hazy images is usually described by the atmospheric scattering model [12][13][14]: where I(x) is observed hazy image and J(x) is the clean image to be recovered. There are two parameters that need to be figured out: A denotes the global atmospheric light, and t(x) is the transmission matrix defined as: where β is the scattering coefficient of the atmosphere, and d(x) is the distance from the target to the camera. The goal of image dehazing is to recover the J(x) from the I(x). In recent years, there has been significant progress in image haze removal, and these methods tend to take priors or make assumptions about the hazy images. By observing that hazy images have lower contrast than clean images, Tan et al. [15] aimed to enlarge the local contrast to improve the visibility of hazy images. However, since different images have very different parameters of haze, editing contrast arbitrarily brings out fidelity loss.
He [16] observed that the Dark Channel Prior (DCP) reliably calculates the transmission matrix, but the DCP fails when objects share a similar color to the atmospheric light A as a result of the transmission value tending to be zero. Zhu et al. [17] proposed a color attenuation prior to recover images, and the model parameters were learned in a supervised way. Since the success of convolutional neural networks in image processing, numerous related algorithms based on CNN have emerged. DehazeNet proposed by Cai et al. [18] is a CNN-based network elaborately trained to predict a transmission matrix, and the global atmospheric light is estimated by empirical rules. All-in-One Net transforms the atmospheric scattering model into a unified model, and dehazes images directly without estimating A and t(x) [10].
When considering image noise, the main challenge is to recover a clean image x from the noisy observation y, with the additive noise n, namely: Most of these denoising methods can be classified into three groups: (1) based on the assumption of local similarity of image methods, (2) transforming domain methods, and (3) convolutional neural network methods. The first group of denoising methods searches the local window of the image and restores the pixels according to similar parts of the image. Non-Local Mean (NLM)) proposed by Buades et al. [19], and Block-matching and 3D filtering (BM3D) proposed by Dabov et al. [20] are the two most successful representatives of the first group. Since noise and signal usually exhibit different frequency spectra, Portilla et al. [21] utilized the sparsity of the image in the transformed domain. Deep learning methods have the advantage of fitting a complex distribution, and are becoming promising techniques for denoising. NBNet proposed by Cheng et al. [8], and MemNet porposed by Tai et al. [22] are CNN-based methods that have superior denosing performance. However, in the steel industry, pure denosing is insufficient, since the vast majority of interference cases come from haze.
All above methods either cope with haze or noise; however, Matlin et al. [23] showed that the noise will be amplified if we only process the haze. Denoising is vital, since the image noise is everywhere, especially in a high-temperature environment such as that of continuous casting areas in the steel industry. Liu et al. [24] observed that the haze almost completely resides in the low frequencies, while noisy tends to have higher frequencies. It is obvious to decompose the image into frequency domains and then filter these disadvantages and reconstruct from the frequency domain. However, textures and edges in images are inevitably smoothed, and extra detail enhancement tricks are utilized to compensate for the loss of information.
In conclusion, dehazing and denoising have the features of ubiquity and being illposed. The methods described above all ignore one of the two problems because they do not address pragmatic industry scenarios or split haze and noise into different domains under the naive assumption of no coupling. Our work not only deals with the dehazing and denosing synchronously, but also uses no other prior about the relationship between haze and noise: we came up with an end-to-end, CNN-based solution with superior performance.

Method
In this section, the proposed CDDNet is explained; see Figure 2a. CDDNet splits the restoration process into two steps (denosing and dehazing) in a cascading pattern in order to remove noise and haze synchronously. The rest of this section is structured as follows: first, we combine the atmospheric scattering model and additive noise, leading to a more complex yet pragmatic form. Second, our proposed method, CDDNet, inspired by the complex formula, is explained in detail.

Transformed Formula
According to Section 2, we are not only concerned about the haze, but also the additive noise: where I is the observed image degraded by both haze and additive noise, and n is the noise contribution, assumed to be independent and identically distributed, with zero mean and variance σ 2 . Matlin et al. [23] proved that if an image is preprocessed only by dehaze algorithms, the noise level of the image will be amplified exponentially. In order to avoid the amplifying, we split the restoration process into two stages in a cascading pattern by first denoisiong the image, and then following a dehazing procedure. During the first stage, the noise is treated as an additive term that can be removed by fitting its mean and variance, for which a neural network does a good job. After the first stage, we obtain wheren is the estimated noise term, and Y is the denoising restoration. The noise-free image is then transferred to the second stage for dehazing. During the second stage, similarly to AODNet, we unify the two parameters t and A into one formula, i.e., K in Equation (6), where K(x) is a integrated parameter unifying the global atmospheric light A and transmission matrix t(x). b is a constant bias with a default value 1. The clean image J(x) can be retrieved by Equation (6) once the unified parameter K is estimated.

CDDNet
The architecture of our proposed method, Cascaded Denoise and Dehaze Net (CDDNet), is shown in Figure 2a. CDDNet consists of two subnetworks, each of which is a U-Net.
U-Net, proposed by Ronneberger et al. [5], is a popular baseline network especially adept at pixelwise dense prediction (the input is a whole image and the output is a whole image too). Each U-Net consists of two significant phases called encoder and decoder, as shown in Figure 2b.
In the encoder, U-Net, level by level, extracts different scales' features while the spatial size of the input degraded image scales down. We replaced the regular extractor with a Half-Instance Normalization Block (HIN Block), since the HIN Block proposed by Chen et al. [9] achieves a balance between deep feature extraction and feature stability. We use five HIN Blocks concatenated to form the encoding phase, and four convolution blocks with a kernel size of 4 each between HIN Blocks to downsample the feature maps.
In the decoder, U-Net utilizes and rescales the information-rich features from the encoding phase to recover the image, level by level. Similarly to HINet proposed by Chen et al. [9], we use four ResBlocks to extract high-level features, and four transpose convolution blocks, each of which is between ResBlocks, to upsample the features. Pixelwise prediction tasks, such as image restoration and semantic segmentation, are faced with the difficulty that when digging into higher-level features for gaining a deeper understanding of an image, a network designed with numerous extractors tends to, by inference, produce a lowresolution result. One of the reasons why U-Net is all the rage is that the skip connection fuses features from the encoder component to compensate for the loss of information caused by resampling, which leads to a high-resolution result. We also adopt the skip connection in order to get a high-resolution restoration.
The first subnetwork of CDDNet takes the degraded image with noise and haze as input X and generates a noise estimation R. The supervised attention module proposed by Mehri et al. [11] is deployed to generate the noise-free image by X + R according to Equation (5) and to connect the second subnetwork of CDDNet in a cascading pattern. The latter stage receives a noise-free estimation from the previous stage and produces the estimation of parameter K. At last, the dehazing module combines K and noise-free estimation, as shown in Figure 2c, to generate a haze-free resultĴ. We also allow some kinds of information exchange between these two subnetworks by using cross-stage feature fusion [11].

Experiments
We evaluated our approach on a wide range of datasets and report the standard metrics, including PSNR and SSIM. The datasets, brief introductions of metrics, and training details are as follows.

Implementation Details
Datasets We first created synthetic hazy images from the NYU2 Dataset [25] by setting different atmospheric light levels and scattering coefficients. We took 27,000 synthetic hazy images as training dataset and the rest images as TestSet A. We randomly added Gaussian noise (zero mean and standard deviation varying from 0.01 to 0.3) to the hazy image, then synthesized a set of images with haze and additive noise. Besides synthetic hazy images, we took I-HAZE [26] as TestSet B, a dataset that contains 35 image pairs of hazy and corresponding haze-free indoor images. Hazy images were generated using real haze produced by a professional haze machine. In addition, we also collected a real video (TestSet C) from an in-operation continuous casting factory, which truly causes image noise and haze.
Metrics We use peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) to quantify the restoration performance. PSNR refers to the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation [27]. In the filed of image processing, given a noise-free image (groundtruth)I and its estimation K, PSNR is defined as: where I is of size m × n and MAX I is the maximum possible pixel value of the image. The higher PSNR is, the higher the quality of the restoration image. While PSNR is used for estimating absolute errors between the ground trth and the restoration, SSIM is used for measuring the structural similarity between two images [28]. SSIM is defined as: where µ is the average of the image, σ 2 denotes the variance of the image, σ IK is the covariance of two images, and c 1 , c 2 are two variables to stabilize the division with a weak denominator. The higher SSIM is, the more faithful human perception is.
Training We used vanilla Stochastic Gradient Descent (SGD) with momentum 0.9 as the optimizer instead of Adam [29], since the adaptive optimization methods can be sensitive and exhibit poor generalization if the training set is noisy [30]. The learning rate was set to 2 × 10 −4 initially and decreased to 1 × 10 −6 with the cosine annealing strategy [31]. We first center cropped the training set from 640 × 480 image resolution to 256 × 256, and we trained CDDNet on these patches with a batch size of 32 for 100 epochs. Since CDDNet is a two-stage model including two inputs and outputs, we used two peak signal-to-noise ratio functions as the metrics of loss, PSNR loss. X ∈ R N×C×H×W denotes the input of first stage, where N is the batch size of data, C is the number of channels, and H and W are the spatial dimensions of the image. R ∈ R N×C×H×W denotes the inference of first stage, which is a noise estimation. Y ∈ R N×C×H×W denotes the ground truth image only degraded by haze. K ∈ R N×C×H×W is the final product of stage two, and J ∈ R N×C×H×W is the ground truth without noise and haze. Then CDDNet is optimized end-to-end by:

Quantitative Results
Most of the dehazing algorithms cannot handle the degradation caused by noise. As mentioned in Section 3, dehazing approaches that ignore noise will significantly amplify noise. Matlin et al. [23] suggests that combining denoise methods with dehazing methods will remit this problem. Thus, we compare CDDNet with pure dehazing methods: Allin-One Net proposed by Li et al. [10] and Dark-Channel Prior proposed by He et al. [16]. Meanwhile, we combined the denoising method BM3D with these dehazing methods to make a comprehensive comparison. Table 1 shows the average PSNR and SSIM results on TestSets A and B with σ ∈ {0.05, 0.1, 0.2, 0.3}, respectively. Since CDDNet was validated on TestSets A, it made sense for it to obtain superior scores in terms of PSNR and SSIM on this dataset. TestSet B is a challenge for deep learning methods like CDDNet and AODNet, because these methods have never seen TestSet B before and the results mostly show how well the deep learning methods generalize. Compared with BM3D+AODNet on TestSet B, the PSNR and SSIM of CDDNet increased by 0.66 dB and 0.08, on average. These results prove that CDDNet generalizes and performs better than BM3D + AODNet. The last two rows of Table 1 show the differences (improvements) between CDDNet's scores and the corresponding best scores of the other methods. We conclude that CDDNet obtained superior PSNR and SSIM performance to all competitors under a variety of challenging circumstances.  Figure 3 reveal that pure dehazing methods reduced haze from a global perspective, leading to the contrast of the image widening. The dehazing performance of these traditional methods decayed as σ increased. As can be seen from the third (BM3D+DCP) and fifth (BM3D+AODNet) rows of Figure 3, denosing before dehazing resulted in better visual quality. For example, in the fourth (Figure 3d) and eighth (Figure 3h) column (σ = 0.3), it can be seen that BM3D+DCP could further clean the rest of the haze that vanilla DCP could not. However, BM3D inevitably blurred the origin while denosing, leaving an artistic result. CDDNet, in the last row of Figure 3, obtained high visual quality results compared with the other methods. We conclude that our method can retrieve clean images from severely deteriorated images.

Visual Results and Discussion
We collected a real video (TestSet C) from an in-operation continuous casting factory that truly suffers from image noise and haze. We again evaluated TestSet C with CDDNet and the methods compared above. Figure 4 demonstrates frames 3000 and 3400 from the video being restored by the methods mentioned above. Each of the restoration images has a histogram figure below that was used for judging the entire tonal distribution of a digital image. The more the histogram shows the full range of intensities evenly, the higher the global contrast of the image. Low-contrast images tend to be either too bright or too dark, causing rich details and textures to be invisible. As can be seen from Figure 4, the raw frames are hazy and noisy, and have low global contrast (most of the data points distribute on the left side of the histogram). The methods mentioned above have few positive effects on histogram distribution. AODNet and BM3D + AODNet tended to set numerous pixel values to zero, causing the result to be too dark. For example, there are still some details at the right bottom area of frame 3000, but AODNet made this area invisible. CDDNet not only restored the image from its hazy and noisy state, but also spread the histogram distribution, leaving appropriate contrast. We can see the brightest part of image (the red hot long casting nozzle), besides the darkest details, taken the right bottom part of frame 3000, for example.  We also analyze some regions of interest in Figure 5. As is shown, pure dehazing methods (first and third columns) amplify the noise and bring about even more degradation. Combining with denosing approaches, these dehazing methods converge but tend to lose fine details. For instance, the second row of Figure 5 is an electric wire. CDDNet retained fine details as much as possible, whereas other methods blurred the region and caused some pixel overflow (red and green points). The last row is a region where dense and bright white haze obscures the background. The BM3D + methods were confused by the light, resulting in incorrect color restoration (white mixed with rainbow colors). CDDNet not only made the background equipment appear, but also kept the colors accurate.

Conclusions
In this paper, by deciding to gather additive noise with an atmospheric scattering model, we proposed CDDNet, a two-stage convolutional network that reduces haze and noise in a cascading pattern. We compared CDDNet with a variey of methods, on synthetic images and a video (we collected it from an in-operation continuous casting factory), using both objective metrics (PSNR, SSIM) and histogram distributions. Extensive experimental results show that, when being faced with haze and noise, CDDNet achieves superior performance in dehazing and denoising, and restores the images to a high visual standard. Since few restoration methods were designed especially for the continuous casting of the steel industry, we were excited to witness that our proposal has promising generalization to a real continuous casting environment.