Noise Parameter Estimation Two-Stage Network for Single Infrared Dim Small Target Image Destriping

: The existing nonuniformity correction methods generally have the defects of image blur, artifacts, image over-smoothing, and nonuniform residuals. It is difﬁcult for these methods to meet the requirements of image enhancement in various complex application scenarios. In particular, when these methods are applied to dim small target images, they may remove dim small targets as noise points due to the image over-smoothing. This paper draws on the idea of a residual network and proposes a two-stage learning network based on the imaging mechanism of an infrared line-scan system. We adopt a multi-scale feature extraction unit and design a gain correction sub-network and an offset correction sub-network, respectively. Then, we pre-train the two sub-networks independently. Finally, we cascade the two sub-networks into a two-stage network and train it. The experimental results show that the PSNR gain of our method can reach more than 15 dB, and it can achieve excellent performance in different backgrounds and different intensities of nonuniform noise. Moreover, our method can avoid losing texture details or dim small targets after effectively removing nonuniform noise.


Introduction
With the development of infrared remote sensing detection systems, infrared line-scan detectors have been widely used in military and civilian fields such as battlefield surveying, maritime surveillance, and urban traffic monitoring. However, restricted by factors such as the level of manufacturing process, there is nonuniform noise in the response between the detection elements of the line-scan sensor, resulting in a certain strip effect in the original infrared image obtained. This requires that we need to first perform nonuniformity correction on the image before performing target detection [1]. In addition, the infrared remote sensing detection system not only has a long observation distance, but also is often interfered with by complex background clutter and noise. Therefore, the targets often show the characteristics of dim small targets such as low signal-to-noise ratio and lack of effective structural information. According to the definition of the Society of Photo-Optical Instrumentation Engineers (SPIE), the target with a local signal-to-noise ratio <5 dB and a pixel size ≤ 9 × 9 is regarded as a weak target [2]. These characteristics of dim small targets make them very easy to be regarded as noise points, which requires us to pay special attention to the problem of image over-smoothing when removing strip noise.
For the fixed pattern noise (FPN) generated by infrared imaging systems, researchers have proposed two kinds of nonuniformity correction methods, which are calibration-based and scene-based [3].
The calibration-based correction method utilizes the two-point or multi-point response between detection elements of the sensor to the black body radiation source and calculates the correction parameters through mathematical fitting. The advantage of this kind of 1.
A destriping method based on a noise parameter estimation two-stage network is proposed, which can adapt to the input image size and effectively correct the real nonuniformity infrared image; 2.
According to the nonuniformity response model of the line-scan detector, a deep learning dataset for strip noise parameter estimation and image reconstruction is produced; 3.
A multi-scale feature extraction unit is designed to use image information more effectively, and the proposed network has excellent generalization to different intensities of nonuniform noise and different backgrounds;

4.
The noise parameter estimation mechanism in our network can fundamentally solve the problem that texture details and dim small targets may be removed due to image over-smoothing.

Methods
In this paper, two deep learning sub-networks for estimating the gain correction coefficient and estimating the offset correction factor are designed, respectively. After pre-training the sub-networks, we cascade the two sub-networks through a multiplication structure and an addition structure as a two-stage network. Finally, we train the two-stage network, and the resulting network model can perform nonuniformity correction on real infrared images.

Nonuniformity Response Model and Datasets
Our two-stage deep learning network model requires a large amount of data for training; however, due to less research on deep learning methods for nonuniformity correction, no ready-made datasets are available. We downloaded the MS-COCO datasets [36], converted the images in the datasets into grayscale images as clean infrared images, and then added nonuniform noise to these clean infrared images according to the nonuniformity response model of the infrared line-scan detector to produce our datasets.
The nonuniformity response of an infrared line-scan detector can be expressed as follows: where x ij and y ij represent the actual response and observed value of the detector unit i scanned to the row j, respectively; g i and o i represent the gain and offset of the detector unit i, respectively. The image nonuniformity correction is essentially to remove the nonuniform noise generated by each detector unit shown in Equation (1) from the observed signal y ij , so as to obtain the estimated valuex ij of the actual response x ij . The process can be expressed as follows:x where G i and O i are the gain correction coefficient and the offset correction factor, respectively: Therefore, the problem of image nonuniformity correction is the problem of estimating the gain correction coefficient G i and the offset correction factor O i . Based on the above nonuniformity response model, we produced our datasets. First, we cropped the image of size 128 × 128 from the image in the MS-COCO dataset. Second, we converted the cropped image to grayscale image, and we backed up the grayscale image as "Ori". Third, we added noise to the grayscale image according to Equation (1) and save the noised image as "Nuf". At the same time, we substituted the noise factors g i and o i into Equation (3) and Equation (4), respectively, for calculation to obtain correction factors G i and O i . Finally, we obtained the mat data of {"Nuf","Ori", "G", "O"}, where "Nuf" are grayscale image of size 128 × 128 with nonuniform noise added, "Ori" are the corresponding un-noised original grayscale image of size 128 × 128, "G" are the corresponding nonuniformity gain correction coefficients of size 1 × 128, and "O" are the corresponding nonuniformity offset correction factors of size 1 × 128. The specific operation of producing the datasets in this paper is shown in Figure 1. It should be noted that in the step 3, the multiplicative noise g i of the same image obeys the uniform distribution of 1 − σ g , 1 + σ g , the additive noise o i of the same image follows a Gaussian distribution with a mean of 0 and a standard deviation of σ o , and each image corresponds to a different random standard deviation, As above, we generated 500,000 nonuniformity images, of which 200,000 images were used to pre-train our two sub-networks, and the other 300,000 images were used to train the two-stage network with the two sub-networks cascaded.

Network Design
Our two-stage network mainly consists of two parts, including a gain correction subnetwork and an offset correction sub-network. Our network adopts the design principle of fully convolutional network (FCN), which can adapt to the size of the input images [37]. In our network, the convolutional layer is a multi-scale feature extraction unit, as shown in Figure 2. As can be seen from Figure 2, we cascade 32 asymmetric convolution kernels of 1 × 5 and 32 asymmetric convolution kernels of 5 × 1 and connect the cascade with 32 convolution kernels of 1 × 1 and 64 convolution kernels of 3 × 3 in parallel. In addition, the cascade of asymmetric convolution kernel of 1 × 5 and asymmetric convolution kernel of 5 × 1 is equivalent to a 5 × 5 convolution kernel in the perceived field of view [38]. Moreover, each convolution operation is followed by a Leaky ReLU activation function [39] to introduce nonlinearity. Then, we concatenate the multi-scale features obtained by convolution in the channel direction. At last, we adopt a convolution kernel of 1 × 1 to fuse the features across channels and reduce the dimension. By adopting multi-scale feature extraction units, we can not only increase the adaptability to targets of different scales, but also improve the network expression ability without increasing the network depth to avoid the phenomenon of gradient dispersion.
Shown in the Figure 3 is the gain correction sub-network structure. We use 8 convolution layers with multiple feature extraction units, and then use a convolution kernel of 1 × 1 to reduce the output to a single channel, so as to obtain an output with the same size as the input image. At last, we use the column mean of this output as an estimate of the gain correction coefficient. In this sub-network, we used the nonuniformity infrared image "Nuf" as input, the saved gain correction coefficient "G i " as the true value label, and the mean square error as the loss function for pre-training, where the loss function of the gain correction sub-network can be expressed as: where W is the width of the image, G i is the estimated value of the gain correction coefficient output by the gain correction sub-network, and G i is the saved true value of the gain correction coefficient. Shown in the Figure 4 is the offset correction sub-network structure. In this subnetwork, we use 15 convolution layers with multi-feature extraction units and a convolution kernel of 1 × 1, and finally use the column mean of the output as the estimate of the offset correction factor. Since the offset correction sub-network and the gain correction sub-network are pretrained independently, both sub-networks can use the same nonuniformity infrared image datasets. Different from the gain correction sub-network, the offset correction sub-network uses the image obtained by multiplying the saved nonuniformity infrared image "Nuf" and the gain correction coefficient "G i " as input and uses the saved offset correction factor "O i " as the true value label. In this sub-network, we also use the mean square error as the loss function for pre-training, as shown in the formula: where W is the width of the image, O i is the estimated value of the offset correction coefficient output by the offset correction sub-network, and O i is the saved true value of the offset correction coefficient. Shown in Figure 5 is the two-stage deep learning network structure of gain and offset correction. In our network, since the gain correction coefficient and offset correction factor of the line-scan detector are estimated by column, and the "small" feature of the dim small target makes it impossible to have a decisive influence on the estimation of the column, the problem that may lead to image over-smoothing and remove the dim small targets when reconstructing the images pixel by pixel based on image correlation in traditional methods can be avoided.
It is important to emphasize that the training set of the cascaded two-stage network cannot be the same as the training set of the two sub-networks. When training the cascaded two-stage network, we use the nonuniformity infrared image "Nuf" as input, use the saved clean infrared image "Ori" as the true value label, and use the mean square error as the loss function, as shown in the formula: where W is the width of the image, H is the height of the image, I ij is the reconstructed image output by the network model, I ij is the saved clean image, and i and j are the pixel coordinates of the image.

Results
In experiments, to verify the effectiveness of our method, we selected the most commonly used traditional methods for comparison, including MHE [14] and 1D-GF [15]. At the same time, we also compare our method with already open-source deep learning methods, including SNRCNN [27], ICSRN [28], DLS-NUC [29], and SNRWDNN [30].

Network Model Training
We used TensorFlow software (San Francisco, CA, USA) to build the network model and used the adaptive momentum optimizer (AdamOptimizer) in TensorFlow to pre-train and train the network model on a single GPU NVIDIA Tesla V100S-PCIe-32GB (Santa Clara, CA, USA).
In the pre-training of the gain correction sub-network and the offset correction subnetwork, we randomly initialized the parameters: the batch number was set to 16, the initial learning rate was set to 0.001, the learning rate was reduced to 0.0001 after 25 epochs, and the total number of training epochs was set to 50. In the training of the cascaded two-stage network, we initialized with the parameters obtained in the pre-training of the two sub-networks: the number of batches was set to 16, the initial learning rate was set to 0.0001, the learning rate was reduced to 0.00001 after 25 epochs, and the total number of training epochs was set to 50.

Quality Evaluation Metrics
In the experiments on simulated nonuniformity infrared dim small target images, we mainly compare the correction performance of each method for different intensity nonuniformity, and the influence of these methods on the dim small targets in nonuniformity infrared images. In the simulation experiments, we conducted a comparative analysis through five objective evaluation metrics, including root mean square error (RMSE) [40], peak signal-to-noise ratio (PSNR) [41], structural similarity (SSIM) [42], image roughness (IR) [43], and signal-to-clutter ratio (SCR) [44,45].
RMSE is the root mean square error of the clean reference image and the de-noised image, which is the error of image reconstruction. The smaller the RMSE, the better the performance of image reconstruction, which is defined as: where W is the width of the image, H is the height of the image, e ij is the image to be evaluated output by the network model, r ij is the clean image for reference, and i and j are the pixel coordinates of the image. PSNR is one of the most commonly used evaluation metrics for image reconstruction quality, which is calculated from RMSE. The larger the PSNR, the better the performance of image reconstruction, which is defined as: Structural similarity is a metric to evaluate the similarity of the clean reference image and the de-noised image based on the characteristics of the human visual system, and mainly considers the three key features of image luminance, contrast, and structure. The value range of the structural similarity is (0, 1); the larger the value, the more similar the two images are, and the better the image reconstruction performance is. The definition is as follows: where r and e are the image for reference and the image to be evaluated, respectively; µ r and µ e are the mean of r and e, respectively; σ 2 r and σ 2 e are the variance of r and e, respectively; σ re is the covariance of r and e. Constraints c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 , L = 255 (for 8-bit grayscale images) are the dynamic range of image pixel values, and the default values for k 1 and k 2 are 0.01 and 0.03, respectively.
Image roughness is an important metric to measure image sharpness. The smaller the image roughness, the better the image quality, which is defined as: where e is the reconstructed image to be evaluated, h = [1, −1] is the horizontal mask, h T as the transpose of h is the vertical mask, the asterisk denotes discrete convolution, 1 denotes the L 1 norm, e 1 is the sum of all pixel values of the image, and h * e 1 and h T * e 1 are the sum of differences between pixels in the horizontal and vertical directions of the reconstructed image, respectively.
The signal-to-clutter ratio is an important metric to measure the difficulty of detecting dim small targets. The higher the signal-to-clutter ratio of dim small targets, the easier it is to be detected. Its definition is as follows: where µ t is the average pixel value of the target, µ b is the average pixel value of the target neighborhood background, and σ b is the standard deviation of the pixel value of the target neighborhood background. Figure 6 shows the small target and its neighborhood background, in which the height h and width w of the small target are not greater than 5, and we set the width d of the neighborhood background as 5. In the experiments on real nonuniformity infrared images, we mainly verified the correction performance of each method on real images and compared the effects of each method on texture details. Since there is no corresponding clean infrared image for the real nonuniformity infrared image, we can only use non-reference evaluation metrics. In the field of image processing, compared with the reference index, the non-reference index generally has certain defects. In addition to the above IR metrics, in order to comprehensively evaluate the denoising performance and the ability to preserve image details of each method, we also adopted the inverse coefficient of variation (ICV) [46,47] and the mean relative deviation (MRD)) [48]. However, all of these metrics are affected by factors other than algorithm performance. Therefore, we performed edge detection [49] on the images processed by each method to help us compare the impact of each method on image texture details more objectively.
The inverse coefficient of variation (ICV) is an indicator for evaluating the smoothness of an image homogeneous region, which can measure the destriping performance of each method. It is defined as: where µ h and σ h are the mean and standard deviation of pixel values in a homogeneous region, respectively. In general, the larger the ICV, the smoother the homogeneous region of the image, and the better the destriping performance.
In contrast to ICV, the mean relative deviation (MRD) is an indicator for evaluating the relative distortion of a sharp region, which is defined as: where s ij and s ij are the pixel pair before and after destriping in a sharp region. In general, the smaller the MRD, the less distortion in the sharp region of the image, and the better the ability to retain details. However, although ICV and MRD are more targeted than IR metrics, these two metrics also have certain limitations due to the need for us to manually delineate the homogeneous and sharp region of the image.
Edge detection is to highlight the structural information by finding the points with obvious brightness changes in the image. In this paper, we use the Sobel operator for edge detection. We convolve the image to be detected with the Sobel operator to obtain the gradient values of each pixel in the horizontal and vertical directions, and then use the gradient of each pixel as the pixel value of the edge detection image, as shown in the formula: where the Sobel operator is as follows:

Method Comparison on Simulated Data with Different Intensities of Nonuniformity
We selected one real, clean infrared small target image from the SIRST dataset [50]. As shown in Figure 7, the dim small target is located in the middle of the red box we marked. We added different intensities of nonuniform noise to this image. As a low-intensity nonuniformity, its multiplicative noise g i obeys a uniform distribution of (1 − 0.05, 1 + 0.05), and its additive noise o i obeys a Gaussian distribution with a mean of 0 and a standard deviation of 5. As a medium-intensity nonuniformity, its multiplicative noise g i obeys a uniform distribution of (1 − 0.10, 1 + 0.10), and its additive noise o i obeys a Gaussian distribution with a mean of 0 and a standard deviation of 15. As a high-intensity nonuniformity, its multiplicative noise g i obeys a uniform distribution of (1 − 0.15, 1 + 0.15), and its additive noise o i obeys a Gaussian distribution with a mean of 0 and a standard deviation of 25.
The results of correction methods for different intensities of nonuniformity infrared dim small target images are shown as Figures 8-10, respectively. Shown in Figure 8 are the correction results of different methods on a low-intensity nonuniformity infrared dim small target image.    Figure 9 are the correction results of different methods on a medium-intensity nonuniformity infrared dim small target image.

Shown in
Shown in Figure 10 are the correction results of different methods on a high-intensity nonuniformity infrared dim small target image.
According to the evaluation metrics described in Section 3.2, we compare the correction performance of each method for different intensities of nonuniform noise, as shown in Table 1.
In order to more intuitively compare the generalization performance of each method for different intensities of nonuniformity, we created a line graph of each method regarding PSNR gain, as shown in Figure 11.

Method Comparison on Simulated Data with Different Backgrounds
We selected 10 real, clean infrared dim small target images from the SIRST dataset [50], measured the background complexity by the pixel value variance B var (dynamic range of 0-255), and named these images in increasing complexity as Test-1 to Test-10. We then added equal-intensity nonuniform noise to these images. The multiplicative noise g i obeys a uniform distribution of (1 − 0.12, 1 + 0.12), and its additive noise o i obeys a Gaussian distribution with a mean of 0 and a standard deviation of 12. Figures 12-14 are the three most representative experimental results we selected.  Figure 12 is the nonuniformity correction results of Test-1 (B var = 19). Figure 13 is the nonuniformity correction results of Test-6 (B var = 663). Figure 14 is the nonuniformity correction results of Test-10 (B var = 2880).
At the same time, we use the objective metrics described in Section 3.2 to quantitatively evaluate the experimental results, as shown in Table 2.
In order to compare the performance of each method on PSNR more intuitively, we created a line graph, as shown in Figure 15.
In order to verify the effectiveness of our method in solving the problem of image over-smoothing, we used the clean infrared image roughness as the abscissa reference and created a line graph for each method performance on the IR, as shown in Figure 16.     At the same time, in order to verify the friendliness of our method to dim small targets, we used the clean infrared image SCR as the abscissa reference and created a line graph for each method performance on the SCR, as shown in Figure 17.

Method Comparison on Real Data
We compared the methods using a real nonuniformity infrared image, which was captured by an uncooled long-wave infrared camera and especially has many texture details [29]. As shown in Figure 18a, we cropped out the (1:400, 1:480) area of the image as the real data 1, cropped out the (271:480, 481:640) area of the image as the real data 2, and the red dividing line in the figure is our cropping diagram. Figure 18b is the edge detection image, and the homogeneous region inside the blue box and the sharp region outside the blue box were used to calculate the evaluation metrics ICV and MRD.  Figure 19 shows the correction results of the real data 1 and the edge detection images of the correction results. Figure 20 shows the correction results of the real data 2 and the edge detection images of the correction results. In the experiment with real nonuniformity infrared images, we also compared the correction performance of each method according to the above evaluation metrics, as shown in Table 3.

Discussion
The experimental results show that our method has excellent generalization to different backgrounds, can effectively remove nonuniform noise of different intensities, and was effectively verified on real data. In addition, our method can avoid the image oversmoothing problem, and mechanically ensures that texture details including weak objects will not be removed in the process of nonuniformity correction.

Analysis of Simulation Experiment Results for Different Intensities of Nonuniformity
In the simulation experiments for different intensities of nonuniformity, MHE reduces the IR, but it performs poorly in the PSNR and SSIM metrics. The 1D-GF method can perform effective correction on low-and medium-intensity nonuniformity images, but there is still a certain degree of residual strip noise. In addition, it is difficult for the 1D-GF to meet the needs of practical application with the image with high-intensity nonuniform noise. As a preliminary attempt of a deep learning method in the field of image nonuniformity correction, the SNRCNN method can remove nonuniform noise to a certain extent, but the correction ability cannot meet the needs of practical applications. Compared with SNRCNN, the correction ability of ICSRN is slightly improved, but it can only meet the needs of low-intensity nonuniformity correction. The correction ability of the DLS-NUC is comparable to the traditional classical algorithm 1D-GF, which can basically meet the nonuniformity correction requirements. The SNRWDNN has good performance on images with different intensities of nonuniformity.
Compared to the above methods, our method has the best correction performance for nonuniform noise of different intensities. The PSNR of our corrected image can be improved by at least 15 dB compared to the PSNR of the uncorrected image, and the SSIM of the simulation experiments all reached above 0.995. In addition, it can be seen from Figure 11 that the greater the intensity of nonuniformity, the higher the PSNR gain of MHE and DLS-NUC, while the lower the PSNR gain of the SNRCNN, ICSRN, and 1D-GF. At the same time, SNRWDNN and our method do not change much in PSNR gain. In comparison, our method has the best generalization performance to different intensities of nonuniform noise among these methods.

Analysis of Simulation Experiment Results for Different Backgrounds
In the experiment of Section 3.4, we added the same intensity of nonuniform noise to 10 infrared dim small targets images with different backgrounds. From Table 2 and Figure 15, it can be seen that our method not only has the highest average PSNR, but also has the highest PSNR of the correction results for each image. At the same time, our method also shows excellent performance compared to other methods on the SSIM. More notably, the standard deviation of the PSNR in our experiments is also the smallest, which shows that our method has an excellent generalization ability to infrared images with different backgrounds. In contrast, the standard deviation of the PSNR of MHE and DLS-NUC is high, indicating that their correction performance is very easily affected by different backgrounds. Furthermore, as can be seen from Figures 16 and 17, our method fits the baseline best in terms of IR and SCR, which demonstrates the effectiveness of our proposed mechanism based on noise parameter estimation. Compared with other methods for suppressing the image over-smoothing, our method fundamentally avoids the problem that dim small targets or texture details may be removed in the correction process through the innovation of the mechanism.

Analysis of Real Experimental Results
In the correction experiment for real nonuniformity infrared images, the MHE method, the GF method, and the SNRCNN method all have very obvious strip noise residuals. From the edge detection images of each correction result, other methods all have the phenomenon of losing texture details.
From the objective metrics in Table 3, compared with other methods, although our method performs poorly on IR, it is not far behind other methods. In the experiments of real data 1, DLS-NUC achieves the best performance on ICV, SNRWDNN achieves the best performance on MRD, and our method achieves a suboptimal performance on both metrics. At the same time, based on the correction results of real data 1 in Figure 19, it can be considered that our method achieved a very good balance on these two metrics. That is, our method can effectively remove strip noise and preserve image details well. In experiments on real data 2, we still achieve a suboptimal performance. However, on the MRD metric, both SNRCNN and ICSRN, which performed poorly in qualitative results, achieved top performance. We hypothesize that this is because the residual strip noise reduces the deviation between the correction result and the noisy image.

Conclusions
The main contribution of this paper is to propose a two-stage deep learning network based on a noise parameter estimation mechanism, which is able to perform nonuniformity correction on a single-frame infrared line-scan image. Our network model abandons the pixel-by-pixel estimation mechanism of the traditional end-to-end image reconstruction network, but estimates by column, so that when our network is applied to dim small target images, it can ensure that dim small targets are not removed as noise points. According to the infrared line-scan imaging mechanism, we produced a noise parameter estimation and image reconstruction dataset and trained our nonuniformity correction model with this dataset. In general, compared with existing methods, our method has many advantages, including excellent performance, intelligence, and friendliness to texture details and dim small targets. Of course, our model still has some room for improvement in real time. Therefore, in the future work, we plan to simplify our model by migrating learning to improve the processing speed of the model and obtain better practical value.

Data Availability Statement:
The synthetic data underlying this article will be shared upon reasonable request to the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.