Aerial-Image Denoising Based on Convolutional Neural Network with Multi-Scale Residual Learning Approach

Aerial images are subject to various types of noise, which restricts the recognition and analysis of images, target monitoring, and search services. At present, deep learning is successful in image recognition. However, traditional convolutional neural networks (CNNs) extract the main features of an image to predict directly and are limited by the requirements of the training sample size (i.e., a small data size is not successful enough). In this paper, using a small sample size, we propose an aerial-image denoising recognition model based on CNNs with a multi-scale residual learning approach. The proposed model has the following three advantages: (1) Instead of directly learning latent clean images, the proposed model learns the noise from noisy images and then subtracts the learned residual from the noisy images to obtain reconstructed (denoised) images; (2) The developed image denoising recognition model is beneficial to small training datasets; (3) We use multi-scale residual learning as a learning approach, and dropout is introduced into the model architecture to force the network to learn to generalize well enough. Our experimental results on aerial-image denoising recognition reveal that the proposed approach is highly superior to the other state-of-the-art methods.


Introduction
As one of the classical problems in aerial imaging, aerial images for regional scenes, searching, and military scouting are subject to various types and degrees of noise, which may occur during transmission and acquisition.The appearance of noise in aerial images interferes with the quality of the original images, which has direct or indirect influences that complicate the timely recognition, analysis, target monitoring, and search services [1].Therefore, image denoising is a crucial issue in the field of computer vision that has been widely discussed and studied by researchers [2][3][4].At present, most of the existing denoising methods mainly focus on the scenario of additive white Gaussian noise (AWGN), whereby an observed noisy image y is modeled as a combination of a clean image x and AWGN v; that is, y = x + v. Using a general model, Elad & Aharon [5] presented sparse and redundant representations over learned dictionaries, whereby the proposed algorithm simultaneously trained a dictionary on its content using the K-SVD algorithm.However, its computational complexity is one of the shortcomings of this algorithm.
In recent years, with the development of deep learning, the research results of deep architecture have shown good performance [6][7][8][9].In the task of image denoising, Frost et al. [10] and Kuan et al. [11] proposed spatial linear filters that assume that the resulting values of image filtering are linear with respect to the original image, by searching for the correlation between the intensity of the center pixel in the moving window and the average intensity of the filter window.Therefore, the spatial linear filter achieves a trade-off between the constant all-pass identity filter in the edge-containing region and balance in the uniform region.However, the spatial linear filter methods are not enough to integrally preserve edges and details and are unable to preserve the average value, especially when the equivalent number of look (ENL) of the original Synthetic Aperture Radar (SAR) image is very small.To address this problem, Zhang et al. [12] considered that image speckle noise can be expressed more accurately through nonlinear models than through linear models and proposed a novel deep-neural-network-based approach for SAR image despeckling, learning a nonlinear end-to-end mapping between the speckled and clean SAR images by a dilated residual network (SAR-DRN), which enlarges the receptive field and maintains the filter size and layer depth.Likewise, Wei et al. [13] proposed the deep residual pan-sharpening neural network (DRPNN); that is, the concept of residual learning was introduced to form a very deep convolutional neural network (CNN) to make full use of the high nonlinearity of the deep learning models, which overcomes the drawbacks of the previously proposed methods and performs high-quality fusion of Panchromatic (PAN) and Multi-spectral (MS) images.Furthermore, Yuan et al. [14] introduced multi-scale feature extraction and residual learning into the basic CNN architecture and propose a multi-scale and -depth CNN for the pan-sharpening of remote sensing imagery that overcomes some shortcomings; for example, in the Component Substitution (CS) and Multiresolution Analysis (MRA) based fusion methods, the transformation from observed images to fusion targets is not rigorously modeled, and distortion in the spectral domain is very common.In the Model-based optimization (MBO)-based methods, the linear simulation from the observed and fusion images is still a limitation, especially when the spectral coverages of the PAN and MS images do not fully overlap and lead to the fusion process being highly nonlinear.Very recently, Zhang et al. [15] proposed an image denoising model using residual learning of deep CNNs (feed-forward denoising CNN-DnCNN) that has provided promising performance among the state-of-the-art methods.Specifically, the model utilizes residual learning and batch normalization to speed up the training process as well as boost the denoising performance.
As mentioned above, using the deep learning method, it has been explained that if the model is able to be trained with a very large data size, the deep architecture will provide a competitive result [16,17].That is to say, these deep learning techniques require a large training dataset to establish a prediction model to provide better model performance, which brings about a serious problem: not only do they need to consume a lot of precious time, but the image datasets of the target region search are often also limited.Therefore, it is still an open field of research to apply deep learning to image-processing recognition problems with small datasets.
In this paper, to improve the model denoising recognition performance for small training datasets, following Zhang et al. [15], with our own modified architecture, we propose an aerial-image denoising recognition model based on a CNN with an improved multi-scale residual learning approach.Firstly, differently from [15], the image denoising recognition model developed by us is beneficial to small training datasets.Secondly, instead of directly outputting a denoised image, we separate the noise from a noisy image by a deep CNN model, and then the denoised image is obtained by subtracting the residual from the noisy observation.Finally, multi-scale residual learning is used as a learning algorithm, and batch normalization and dropout are also incorporated to speed up the training process and improve the model's denoising performance.
The rest of the paper is organized as follows.Section 2 presents related work, Section 3 explains the proposed aerial-image denoising, the experimental results and evaluation are elaborated upon in Section 4, and finally Section 5 provides the conclusion of the paper.
Compared with shallow neural networks, deep networks have a higher training error and test error, and deep-level network structure training is also very difficult [18,19].Supposing there exists a shallow network, then there should be such a deep network that is made up of multiple x → x mappings that are based on the shallow network, and the training error of the deep neural network should not be higher than that of the corresponding shallow network.However, it is very difficult to use multiple nonlinear layers to fit a x → x direct mapping [20][21][22].
To overcome the above-stated problem, the residual learning approach along with some other approaches was introduced.The ResNet can be defined as y = ℘({W i }, x) + x, where ℘({W i }, x) is the residual mapping function to be learned, y is the output vector of the considered layers, and x is the input vector of the considered layers.As shown in Figure 1a, ℘ = W 2 Relu(W 1 x), and Relu is defined as Relu(x) = max{0, x}.
Furthermore, the idea of residual learning is illustrated.We assume that the target mapping H(x) is a sub-module of the neural network ϕ that we wish to learn and that H(x) may be very complex and difficult to learn.Now, instead of learning H(x) directly, the sub-module ϕ learns a residual function The original mapping function x can be expressed as That is, ϕ can be composed of two parts: a linear direct mapping x → x and a nonlinear mapping ℘(x).If the direct mapping x → x is optimal, the learning algorithm can easily set the weight parameters of the nonlinear mapping ℘(x) to 0. If there is no direct mapping, it is very difficult for the nonlinear mapping ℘(x) to learn the linear mapping x → x .Additionally, He Kaiming's [26] experiments showed that such a residual structure in the extraction of the image texture and detail features is a significant result.mappings that are based on the shallow network, and the training error of the deep neural network should not be higher than that of the corresponding shallow network.However, it is very difficult to use multiple nonlinear layers to fit a  →  direct mapping [20][21][22].
To overcome the above-stated problem, the residual learning approach along with some other approaches was introduced.The ResNet can be defined as  = ℘({ }, ) + , where ℘({ }, ) is the residual mapping function to be learned,  is the output vector of the considered layers, and  is the input vector of the considered layers.As shown in Figure 1a, ℘ =  ( ), and  is defined as () = {0, }.
Furthermore, the idea of residual learning is illustrated.We assume that the target mapping ℋ() is a sub-module of the neural network φ that we wish to learn and that ℋ() may be very complex and difficult to learn.Now, instead of learning ℋ() directly, the sub-module φ learns a residual function ℘() as The original mapping function  can be expressed as That is, φ can be composed of two parts: a linear direct mapping  →  and a nonlinear mapping ℘() .If the direct mapping  →  is optimal, the learning algorithm can easily set the weight parameters of the nonlinear mapping ℘() to 0. If there is no direct mapping, it is very difficult for the nonlinear mapping ℘() to learn the linear mapping  →  .Additionally, He Kaiming's [26] experiments showed that such a residual structure in the extraction of the image texture and detail features is a significant result.At present, the ResNet is mainly composed of multiple residual learning modules.However, the consumption in stacking a number of residual learning modules to build a deep neural network is still a little large.Therefore, Zhang et al. [15] presented a residual module structure called a bottleneck, as shown in Figure 1b.Additionally, with the residual learning strategy, a new method named the DnCNN [15] recently showed excellent performance, implicitly removing the latent clean image in the hidden layers.This property motivated us to train a single DnCNN model to tackle several general image denoising tasks, such as Gaussian denoising, single-image super-resolution, and JPEG image deblocking.
In addition, the bottleneck structure module is the key for the ResNet to achieve hundreds or thousands of layers.In Figure 1b, there are three layers in total for the nonlinear part of the bottleneck structure, one m * m (m = 3) convolution and two 1 * 1 convolutions.Assuming that the dimension of the input data  is 256, the first 1 * 1 convolution can reduce the dimensions; at the At present, the ResNet is mainly composed of multiple residual learning modules.However, the consumption in stacking a number of residual learning modules to build a deep neural network is still a little large.Therefore, Zhang et al. [15] presented a residual module structure called a bottleneck, as shown in Figure 1b.Additionally, with the residual learning strategy, a new method named the DnCNN [15] recently showed excellent performance, implicitly removing the latent clean image in the hidden layers.This property motivated us to train a single DnCNN model to tackle several general image denoising tasks, such as Gaussian denoising, single-image super-resolution, and JPEG image deblocking.
In addition, the bottleneck structure module is the key for the ResNet to achieve hundreds or thousands of layers.In Figure 1b, there are three layers in total for the nonlinear part of the bottleneck structure, one m * m(m = 3) convolution and two 1 * 1 convolutions.Assuming that the dimension of the input data x is 256, the first 1 * 1 convolution can reduce the dimensions; at the same time, it also implements cross-channel information fusion to some extent and reduces the input dimension to 64.The second 1 * 1 convolution plays a major role in raising the dimension, which causes the output dimension to return back to 256.The input and output dimensions of the middle m * m convolution fall from the original 256 to 64 by the two 1 * 1 convolutions, which greatly reduces the number of parameters and also increases the module depth [23,24].In order to boost the denoising performance and improve image denoising recognition as much as possible, the bottleneck residual learning module is used to replace the ordinary residual learning module [25].

The Proposed Method Formulation
At present, all the image denoising methods have a common formulation, as follows: where y is the observed noisy image generated as the addition of the clean image x and some noise v.
Most of the discriminative denoising models aim to learn a mapping function (Equation ( 4)) to predict the latent clean image from y: Unlike previous methods, for aerial-image denoising, we follow DnCNN [15] and adopt the residual learning formulation to learn the residual mapping, as follows: Thus, we have the original function x: Formally, to learn the trainable parameters λ in the aerial-image denoising model, the averaged mean-square error between the desired image residuals {∆ t } and estimated residuals {℘} from noisy input can be adopted as the loss function of the residual estimation, where {(y t ; x t )} N t = 1 are N pairs of the noisy-clean training image (patch) and . 2  F is the Frobenius norm, which, for a matrix H, can be calculated by the following formula: The architecture of the proposed aerial-image denoising model for learning ℘(y) as shown in Figures 2 and 3 illustrates that the reconstructed denoised image x is obtained by subtracting the residual image ℘ 1 (y) from the noisy observation image y.In the following, we explain the architecture of the multi-scale learning module for the aerial-image denoising model.3), and the module is a multi-scale residual learning module (see Figure 4).

Multi-Scale Learning Module
It is assumed that the output of the previous layer  is composed of the  feature mappings (i.e. ,  channels).Firstly, to produce a set of feature mappings  , multi-scale filters are applied to the input data, as follows:  =  * ( ) +  (9) where  and  are the convolutional filters and biases of the  th type of the  th layer, respectively, and  includes  filters of size  ×  ×  .
In our paper, the multi-scale convolutional layer comprises three ( = 3) types of filters,  = 5 × 5,  = 3 × 3, and  = 1 × 1, as shown in Figure 4. Across the feature maps, the maxout function executes maximum element-wise pooling.During each iteration of the training process, the activation function ensures that the unit with the maximum value in the group is activated.Conversely, the convolution layer feeds the feature map to the maxout activation function.
The difference is that instead of the single 3 * 3 convolution kernel in the convolution module, the multiple 1 * 1, 3 * 3, and 5 * 5 convolution kernels are used, so that the convolution layer can observe the input data from different scales, so as to help the different convolution kernels converge to different values, which effectively avoids the collaborative work of the network [27,28].

Multi-Scale Learning Module
It is assumed that the output of the previous layer  is composed of the  feature mappings (i.e. ,  channels).Firstly, to produce a set of feature mappings  , multi-scale filters are applied to the input data, as follows:  =  * ( ) +  (9) where  and  are the convolutional filters and biases of the  th type of the  th layer, respectively, and  includes  filters of size  ×  ×  .
In our paper, the multi-scale convolutional layer comprises three ( = 3) types of filters,  = 5 × 5,  = 3 × 3, and  = 1 × 1, as shown in Figure 4. Across the feature maps, the maxout function executes maximum element-wise pooling.During each iteration of the training process, the activation function ensures that the unit with the maximum value in the group is activated.Conversely, the convolution layer feeds the feature map to the maxout activation function.
The difference is that instead of the single 3 * 3 convolution kernel in the convolution module, the multiple 1 * 1, 3 * 3, and 5 * 5 convolution kernels are used, so that the convolution layer can observe the input data from different scales, so as to help the different convolution kernels converge to different values, which effectively avoids the collaborative work of the network [27,28].

Multi-Scale Learning Module
It is assumed that the output of the previous layer X i−1 is composed of the n i−1 feature mappings (i.e., n i−1 channels).Firstly, to produce a set of feature mappings z i k , multi-scale filters are applied to the input data, as follows: where W i k and B i k are the convolutional filters and biases of the kth type of the ith layer, respectively, and Secondly, each convolution produces n l feature maps.Therefore, the multi-scale convolution [24] is composed of K × n i feature maps.The feature maps are divided into n l nonoverlapping groups, and the tth group is made up of K feature maps, that is, is performed by the maxout activation function δ(.).The function is the maxout output of the tth group at the position (x, y), where z t i p (x, y)(p = 1, 2, . . ., k) is the data at a particular position (x, y) in the pth feature map for the tth group.
In our paper, the multi-scale convolutional layer comprises three (K = 3) types of filters, as shown in Figure 4. Across the feature maps, the maxout function executes maximum element-wise pooling.During each iteration of the training process, the activation function ensures that the unit with the maximum value in the group is activated.Conversely, the convolution layer feeds the feature map to the maxout activation function.
The difference is that instead of the single 3 * 3 convolution kernel in the convolution module, the multiple 1 * 1, 3 * 3, and 5 * 5 convolution kernels are used, so that the convolution layer can observe the input data from different scales, so as to help the different convolution kernels converge to different values, which effectively avoids the collaborative work of the network [27,28].In Figure 4,  is the scaling parameter, and the three convolutions of unit width are 1 * 1 * 16, 3 * 3 * 16, and 5 * 5 * 16.Then after the transformation of the scaling parameter , the network width is as follows: The corresponding three convolution kernels are as follows: 1 For the traditional ResNet, when the network depth reaches a certain degree, the effect of the network becomes less obvious [26].However, the proposed multi-scale ResNet not only increases the network depth, but also makes the network training very easy.The experimental results show the improvement in performance, as detailed in Section 4.2.

Dropout
Considering the excellence in dropout [6] for CNNs, we incorporated dropout for the image denoising recognition [29,30], where dropout is explained hereunder.
We assume that a neural network with  hidden layers and the feed-forward operation of a standard neural network can be described as follows and can be seen in Figure 5a: where  ∈ {1, 2, 3, ⋯ , } represents the hidden layers of the network;  ( ) represents the vector of outputs from layer  (where  ( ) =  is the input);  ( ) and  ( ) are the biases and weights of layer , respectively;  ( ) represents the vector of inputs into layer ;  is any hidden unit; and  is any activation function, such as In Figure 4, k is the scaling parameter, and the three convolutions of unit width are 1 * 1 * 16, 3 * 3 * 16, and 5 * 5 * 16.Then after the transformation of the scaling parameter k, the network width is as follows: 3 The corresponding three convolution kernels are as follows: For the traditional ResNet, when the network depth reaches a certain degree, the effect of the network becomes less obvious [26].However, the proposed multi-scale ResNet not only increases the network depth, but also makes the network training very easy.The experimental results show the improvement in performance, as detailed in Section 4.2.

Dropout
Considering the excellence in dropout [6] for CNNs, we incorporated dropout for the image denoising recognition [29,30], where dropout is explained hereunder.
We assume that a neural network with L hidden layers and the feed-forward operation of a standard neural network can be described as follows and can be seen in Figure 5a: where l ∈ {1, 2, 3, • • • , L} represents the hidden layers of the network; y (l) represents the vector of outputs from layer l (where y (0) = x is the input); b (l) and W (l) are the biases and weights of layer l, respectively; z (l) represents the vector of inputs into layer l; i is any hidden unit; and f is any activation function, such as However, with dropout, the feed-forward operation can be expressed as follows and can be seen in Figure 5b: where Bernoulli() is the binomial distribution function and p is the probability that a neuron is discarded, ranging in interval [0, 1], where the probability of being retained is 1 − p; r (l) is a vector of independent Bernoulli random variables for any layer l, and each random variable has a probability p of being 1.To create the thinned outputs y (l) , the vector is sampled and multiplied with the output elements of the layer, y (l) .Then the process in which y (l) is used as input to the next layer begins to be applied for each layer.
Information 2018, 9, x FOR PEER REVIEW 7 of 17 However, with dropout, the feed-forward operation can be expressed as follows and can be seen in Figure 5b: where Bernoulli() is the binomial distribution function and  is the probability that a neuron is discarded, ranging in interval [0, 1], where the probability of being retained is 1 − ;  ( ) is a vector of independent Bernoulli random variables for any layer , and each random variable has a probability  of being 1.To create the thinned outputs  ( ) , the vector is sampled and multiplied with the output elements of the layer,  ( ) .Then the process in which  ( ) is used as input to the next layer begins to be applied for each layer.In the multi-scale learning module, when the input and output neurons remain unchanged, half of the hidden neurons in the network are deleted randomly (temporary), so that the corresponding parameters will not be updated when the network is propagating in reverse.At the time of training, the dropout is equivalent to training a subnet of the whole network every time.If there are  nodes in the network, the number of available subnets should be 2 .When  is large enough, the subnets used in each training will not be the same.Finally, the whole network can be regarded as the average of the multiple subnet models.By doing this, the training set can avoid overfitting of a certain subnet and enhance the generalization ability of the network.
As described above, dropout helps the network retain robustness.Figure 6 shows the test error rates obtained for many different datasets (CIFAR-10, MNIST, etc.) as training progressed.As seen from two separate clusters of trajectories, the same datasets trained with and without dropout had significantly different test errors, and dropout brought a great improvement across all architectures.For our aerial-image denoising model, we found that incorporating dropout boosted the accuracy of the model and made the training process faster.The experimental results show that the dropout location and drop ratio  affected the network performance, as detailed in Section 4.2.2.In the multi-scale learning module, when the input and output neurons remain unchanged, half of the hidden neurons in the network are deleted randomly (temporary), so that the corresponding parameters will not be updated when the network is propagating in reverse.At the time of training, the dropout is equivalent to training a subnet of the whole network every time.If there are n nodes in the network, the number of available subnets should be 2 n .When n is large enough, the subnets used in each training will not be the same.Finally, the whole network can be regarded as the average of the multiple subnet models.By doing this, the training set can avoid overfitting of a certain subnet and enhance the generalization ability of the network.
As described above, dropout helps the network retain robustness.Figure 6 shows the test error rates obtained for many different datasets (CIFAR-10, MNIST, etc.) as training progressed.As seen from two separate clusters of trajectories, the same datasets trained with and without dropout had significantly different test errors, and dropout brought a great improvement across all architectures.For our aerial-image denoising model, we found that incorporating dropout boosted the accuracy of the model and made the training process faster.The experimental results show that the dropout location and drop ratio p affected the network performance, as detailed in Section 4.2.2.

Experimental Results and Setting
The main purpose of the experimental work of this paper was to confirm the performance of our proposed approach in the aerial-image denoising recognition model.

System Environment
The computer configuration used in the experimental environment was as shown in Table 1.In our experiment, the model for Gaussian denoising was trained with noise levels σ = 10 σ = 15, and σ = 25 independently.Firstly, we considered the images of the Caltech [31] database to train the model.The database is a publicly available aerial database that comprises an aircraft dataset and a vehicle dataset.We mixed 475 images from the vehicle dataset and 350 images from the aircraft dataset to train the model.Finally, 55 images were randomly selected from the two datasets, which were not included in the training period but were used for the test model.

Experimental Parameters
In the experiment, the momentum random gradient descent method was used.At the 50th and 100th epochs, the learning rate increased by a factor of 0.1, and the total number of epoch runs was 150.The experiment ran on two blocks of NVIDIA GeForce GTX1050 Ti, and mini-batch had 100 samples for each GPU.In addition, other parameters are shown in Table 2.

Experimental Results and Setting
The main purpose of the experimental work of this paper was to confirm the performance of our proposed approach in the aerial-image denoising recognition model.

System Environment
The computer configuration used in the experimental environment was as shown in Table 1.In our experiment, the model for Gaussian denoising was trained with noise levels σ = 10 σ = 15, and σ = 25 independently.Firstly, we considered the images of the Caltech [31] database to train the model.The database is a publicly available aerial database that comprises an aircraft dataset and a vehicle dataset.We mixed 475 images from the vehicle dataset and 350 images from the aircraft dataset to train the model.Finally, 55 images were randomly selected from the two datasets, which were not included in the training period but were used for the test model.

Experimental Parameters
In the experiment, the momentum random gradient descent method was used.At the 50th and 100th epochs, the learning rate increased by a factor of 0.1, and the total number of epoch runs was 150.The experiment ran on two blocks of NVIDIA GeForce GTX1050 Ti, and mini-batch had 100 samples for each GPU.In addition, other parameters are shown in Table 2.

Selecting Dropout Location
From Section 3.3, we know that dropout can effectively avoid network overfitting and improve the denoising performance.Therefore, it is necessary to discuss the influence of dropout location on the network performance.Figures 7 and 8 show the different dropout locations in the multi-scale residual learning module and the influence of parameter p (drop ratio) on the network image denoising recognition performance, respectively.
Information 2018, 9, x FOR PEER REVIEW 9 of 17

Selecting Dropout Location
From Section 3.3, we know that dropout can effectively avoid network overfitting and improve the denoising performance.Therefore, it is necessary to discuss the influence of dropout location on the network performance.Figures 7 and 8 show the different dropout locations in the multi-scale residual learning module and the influence of parameter  (drop ratio) on the network image denoising recognition performance, respectively.From Figure 8, we can see that in Scenario 5, when  was 0.3, the error rate of the test set was the smallest, 3.88%.On the contrary, in Scenario 4, when  was 0.3, the error rate of the test set was the largest, 4.32%.Therefore, for the proposed multi-scale ResNet, we adopted Scenario 5, that the node discarding probability  is 0.3 and a dropout layer is added before the full connection layer.From Figure 8, we can see that in Scenario 5, when p was 0.3, the error rate of the test set was the smallest, 3.88%.On the contrary, in Scenario 4, when p was 0.3, the error rate of the test was the largest, 4.32%.Therefore, for the proposed multi-scale ResNet, we adopted Scenario 5, that the node discarding probability p is 0.3 and a dropout layer is added before the full connection layer.

Network Depth
Instead of the network depth being based on the experience, the optimal network depth was acquired through the experiments.The multi-scale ResNet structure includes 3 multi-scale residual learning modules with two layers of convolution for each learning module, and the entire network has (6 + 2) layers.In addition, there is a scaling parameter  for each convolution layer.Each layer consists of three different scale convolutions 1 * 1, 3 * 3, and 5 * 5, and the network width is 3 * .In order to explore the effect of different scaling parameters and depths on the network performance, we conducted some experiments on the Caltech dataset.The results are shown in Figure 9 and Table 3.
It can be seen from Table 3 that with the increase in the parameter , the network depth decreased.However, the model became more and more complex, and more and more parameters needed to be trained.Accordingly, the recognition error rate of the model for the test set also became continually lower.In Table 3, the scaling parameter of model 8 was 5, which was larger than that of model 7, and the number of parameters was also more than 10 M, but the accuracy rate declined.The main reasons may be that the structure of model 8 was too complex and the number of parameters was too great, which increased the training difficulty, and a slight overfitting phenomenon occurred.In addition, we can see from Figure 3 that with the increase in training times, the error rate tended to be stable.Therefore, in the experiment, in order to obtain the best network effect, the scaling parameter was set to 4 and the network depth was set to about 26.

Network Depth
Instead of the network depth being based on the experience, the optimal network depth was acquired through the experiments.The multi-scale ResNet structure includes 3n multi-scale residual learning modules with two layers of convolution for each learning module, and the entire network has (6n + 2) layers.In addition, there is a scaling parameter k for each convolution layer.Each layer consists of three different scale convolutions 1 * 1, 3 * 3, and 5 * 5, and the network width is 3 * k.In order to explore the effect of different scaling parameters and depths on the network performance, we conducted some experiments on the Caltech dataset.The results are shown in Figure 9 and Table 3.
It can be seen from Table 3 that with the increase in the parameter k, the network depth decreased.However, the model became more and more complex, and more and more parameters needed to be trained.Accordingly, the recognition error rate of the model for the test set also became continually lower.In Table 3, the scaling parameter of model 8 was 5, which was larger than that of model 7, and the number of parameters was also more than 10 M, but the accuracy rate declined.The main reasons may be that the structure of model 8 was too complex and the number of parameters was too great, which increased the training difficulty, and a slight overfitting phenomenon occurred.In addition, we can see from Figure 3 that with the increase in training times, the error rate tended to be stable.Therefore, in the experiment, in order to obtain the best network effect, the scaling parameter was set to 4 and the network depth was set to about 26.

Network Depth
Instead of the network depth being based on the experience, the optimal network depth was acquired through the experiments.The multi-scale ResNet structure includes 3 multi-scale residual learning modules with two layers of convolution for each learning module, and the entire network has (6 + 2) layers.In addition, there is a scaling parameter  for each convolution layer.Each layer consists of three different scale convolutions 1 * 1, 3 * 3, and 5 * 5, and the network width is 3 * .In order to explore the effect of different scaling parameters and depths on the network performance, we conducted some experiments on the Caltech dataset.The results are shown in Figure 9 and Table 3.
It can be seen from Table 3 that with the increase in the parameter , the network depth decreased.However, the model became more and more complex, and more and more parameters needed to be trained.Accordingly, the recognition error rate of the model for the test set also became continually lower.In Table 3, the scaling parameter of model 8 was 5, which was larger than that of model 7, and the number of parameters was also more than 10 M, but the accuracy rate declined.The main reasons may be that the structure of model 8 was too complex and the number of parameters was too great, which increased the training difficulty, and a slight overfitting phenomenon occurred.In addition, we can see from Figure 3 that with the increase in training times, the error rate tended to be stable.Therefore, in the experiment, in order to obtain the best network effect, the scaling parameter was set to 4 and the network depth was set to about 26.To further evaluate the performance of the proposed model, it was necessary to test the running state of the model under conditions of different scaling parameters and network depths, as shown in Table 4.As we can see from Table 4, with the increase in the scaling parameter k, the number of parameters of the network training also increased.When the scaling was 4, the network could only be added to 26 layers, and the training parameter is 66.4 M. When the number of layers was increased, the GPU resources were exhausted.In addition, when the scaling parameter increased to 5, 76.4 M parameters could be trained at most.In addition, when the number of parameters reached about 70 M, the proposed multi-scale ResNet algorithm did not have serious overfitting problems.
It also can be seen from Table 4 that the proposed multi-scale ResNet (scaling parameter of 4 and network depth of 26) was the best.The error rate on the Caltech dataset was about 3.883%.To further study the impact of the network depth on the network performance, we fixed the scaling parameters to 4, slightly adjusting the number of groups of learning modules.
The multi-scale ResNet consists of three groups of learning modules, and each group contains n learning modules.To obtain the ideal number groups of learning modules, we changed the numbers of learning modules in each group so that they were not completely equal.Supposing that the numbers of learning modules in the groups are x, y, and z (x being the number of modules close to the input layer, y being the number of modules close to the output layer, and z being the fixed scaling parameter), the value of z was 4, and the values of x and y were changed.The experimental results are shown in Figure 10.The multi-scale ResNet consists of three groups of learning modules, and each group contains  learning modules.To obtain the ideal number groups of learning modules, we changed the numbers of learning modules in each group so that they were not completely equal.Supposing that the numbers of learning modules in the groups are x, y, and z ( being the number of modules close to the input layer,  being the number of modules close to the output layer, and  being the fixed scaling parameter), the value of  was 4, and the values of  and  were changed.The experimental results are shown in Figure 10.As can be seen from Figure 10, when  = 5,  = 3, and  = 4, the error rate was 3.693%, which was much better than for all variables being equal to 4. This indicates that appropriately increasing the number of modules that are close to the input layer can better capture the original features of the picture and help to improve the network performance.In the next part, we describe the use of these parameters to evaluate our proposed aerial-image denoising recognition methods.

Model Architecture
In this study, as we have shown in the model architecture in Figure 2, the model was trained by the minimizing the cost function in Equation (11), and the model architecture had a depth of (6n + 2) layers.The image size of 32 * 32 * 3 was inputted and passed through the 3 * 3 convolution kernel with 16 convolution kernels, and the output of the model was 32 * 32 * 16; then these data were passed through the 6n convolution layers and average pool layer, followed by a dropout operation, and were finally passed through a full connection layer and subjected to a softmax operation.In this model, residual learning was adopted to learn the mapping ℘(), and batch normalization was incorporated to increase the model's accuracy and speed up the training performance.It was seen that the proposed multi-scale ResNet for a small training dataset has promising advantages for training deep networks and further improves the network accuracy.

Comparison with State-of-the-Art Algorithms
In this section, we use the optimal parameters that were obtained in Section 4.2 to evaluate the recognition ability of the proposed improved multi-scale residual neural network learning approach.We compared our results with other aerial-image denoising methods from the existing literature, such as PNN [32], BM3D [33], SRCNN [34], WNNM [35], and DnCNN [15], where x = 5, y = 3, and z = 4.The experimental results revealed that our model benefits from multi-scale residual learning, parameter setting, and particular image sizes, as can be seen in Table 5.As can be seen from Figure 10, when x = 5, y = 3, and z = 4, the error rate was 3.693%, which was much better than for all variables being equal to 4. This indicates that appropriately increasing the number of modules that are close to the input layer can better capture the original features of the picture and help to improve the network performance.In the next part, we describe the use of these parameters to evaluate our proposed aerial-image denoising recognition methods.

Model Architecture
In this study, as we have shown in the model architecture in Figure 2, the model was trained by the minimizing the cost function in Equation (11), and the model architecture had a depth of (6n + 2) layers.The image size of 32 * 32 * 3 was inputted and passed through the 3 * 3 convolution kernel with 16 convolution kernels, and the output of the model was 32 * 32 * 16; then these data were passed through the 6n convolution layers and average pool layer, followed by a dropout operation, and were finally passed through a full connection layer and subjected to a softmax operation.In this model, residual learning was adopted to learn the mapping ℘(y), and batch normalization was incorporated to increase the model's accuracy and speed up the training performance.It was seen that the proposed multi-scale ResNet for a small training dataset has promising advantages for training deep networks and further improves the network accuracy.

Comparison with State-of-the-Art Algorithms
In this section, we use the optimal parameters that were obtained in Section 4.2 to evaluate the recognition ability of the proposed improved multi-scale residual neural network learning approach.We compared our results with other aerial-image denoising methods from the existing literature, such as PNN [32], BM3D [33], SRCNN [34], WNNM [35], and DnCNN [15], where x = 5, y = 3, and z = 4.The experimental results revealed that our model benefits from multi-scale residual learning, parameter setting, and particular image sizes, as can be seen in Table 5.      Figures 11-16 with different noise parameters are used for comparison to illustrate our experimental results.We randomly selected some images from the test model, as shown in Figures 11-13, which show the evaluation of the proposed model trained on 825 images, each with an image size of 64 × 64.From the three figures, we can see that the reconstructed images were similar to the original images; Figure 11 shows the effect better, as the reconstructed image was more similar to the original image because of low noise (σ = 10).A similar conclusion can be observed by comparing Figures 14-16 (Figure 14 is the most apparent).Figures 11-16 with different noise parameters are used for comparison to illustrate our experimental results.We randomly selected some images from the test model, as shown in Figures 11-13, which show the evaluation of the proposed model trained on 825 images, each with an image size of 64 × 64.From the three figures, we can see that the reconstructed images were similar to the original images; Figure 11 shows the effect better, as the reconstructed image was more similar to the original image because of low noise (σ = 10).A similar conclusion can be observed by comparing Figures 14-16 (Figure 14 is the most apparent).We compared our model performance with images of the same image size.Table 5 shows the average results of the proposed model trained on an image size of 64 × 64, the training sample size of 825 images, and with dropout; it can be seen that the proposed model improved the reconstructed image quality by about 0.167/0.001over the DnCNN and by about 0.931/0.006over the SRCNN at σ = 10 and was better than the other methods in terms of PSNR/SSIM values at σ = 10, σ = 15, and σ = 25.
Similarly, Table 6 exhibits the model performance with an image size of 180 × 180, the training sample size of 825 images, and with dropout.Again, it can be seen that the proposed model improved the average reconstructed image quality at σ = 10, σ = 15, and σ = 25.However, in comparison with the results in Tables 5 and 6, we found this image size affected the model performance; the performance of the model trained on the image size of × 180 with dropout was better than that with the image size of 64 × 64 with dropout.This conclusion is consistent with the above (Figures 11-16).Furthermore, comparing Tables 6 and 7, we also found that the incorporation of dropout for image recognition (dropout location is critical) significantly improved the model network performance.This conclusion is consistent with Section 4.2.2.

Summary and Conclusions
In this paper, we propose an efficient deep CNN model for the denoising of aerial images with a small training dataset.Unlike most of the aerial-image denoising methods, which approximate a latent clean image from an observed noisy image, the proposed model approximates the noise from the observed noisy image.By combining multi-scale residual learning and dropout, regarding the influence of the network depth and the number of learning modules in each group on the network performance, we not only speed up the training process, but also improve the denoising performance of the model.More importantly, although the deep architecture has a better performance with a large training dataset, the proposed aerial-image denoising model was trained with small dataset and achieved competitive results.With a small training dataset, the experimental results showed that the proposed aerial-image denoising model has better performance than the existing image denoising methods.In the future, the performance expectation of the model will be improved by stacking more multi-scale competitive modules.

Figure 2 .
Figure 2. The architecture of the model for aerial-image denoising;  is the number of learning modules (the size of  is discussed in Section 4.2.3), and the module is a multi-scale residual learning module (see Figure4).

Figure 3 .
Figure 3. Reconstruction phase of the proposed model.

Figure 2 .
Figure 2. The architecture of the model for aerial-image denoising; n is the number of learning modules (the size of n is discussed in Section 4.2.3), and the module is a multi-scale residual learning module (see Figure4).

Figure 2 .
Figure 2. The architecture of the model for aerial-image denoising;  is the number of learning modules (the size of  is discussed in Section 4.2.3), and the module is a multi-scale residual learning module (see Figure4).

Figure 3 .
Figure 3. Reconstruction phase of the proposed model.

Figure 3 .
Figure 3. Reconstruction phase of the proposed model.

Figure 5 .
Figure 5.A comparison of the basic operations between the standard and dropout networks.

Figure 5 .
Figure 5.A comparison of the basic operations between the standard and dropout networks.

Figure 6 .
Figure 6.Test error for different datasets with and without dropout.

Figure 6 .
Figure 6.Test error for different datasets with and without dropout.

Figure 7 .
Figure 7.The module structure of different dropout locations.

Figure 7 .
Figure 7.The module structure of different dropout locations.

Figure 8 .
Figure 8.The influence of parameter  and dropout for multi-scale residual learning network (ResNet).

Figure 9 .
Figure 9.The curve for the training times and the error rates for the test set.

Figure 8 .
Figure 8.The influence of parameter p and dropout for multi-scale residual learning network (ResNet).

Figure 9 .
Figure 9.The curve for the training times and the error rates for the test set.

Figure 9 .
Figure 9.The curve for the training times and the error rates for the test set.

Information 2018, 9 ,
x FOR PEER REVIEW 12 of 17

Figure 10 .
Figure 10.The error rate of test set under different learning module numbers (where z = 4).
show the randomly selected images reconstructed by the model trained with the training sample.

Figure 10 .
Figure 10.The error rate of test set under different learning module numbers (where z = 4).
show the randomly selected images reconstructed by the model trained with the training sample.

Figures 11 -
with different noise parameters are used for comparison to illustrate our experimental results.We randomly selected some images from the test model, as shown in Figures11-13, which show the evaluation of the proposed model trained on 825 images, each with an image size of 64 × 64.From the three figures, we can see that the reconstructed images were similar to the original images; Figure11shows the effect better, as the reconstructed image was more similar to the original image because of low noise (σ = 10).A similar conclusion can be observed by comparing Figures14-16(Figure14is the most apparent).Figures 14-16 also represent the evaluation of the proposed model trained on 825 images, each with an image size of 180 × 180.Compared with Figure 11, the reconstructed images of Figure 14 are more similar to the original image, which implies the proposed model benefited from a particular image size.Furthermore, comparing Figures 12 and 15 and Figures 13 and 16 separately, a similar conclusion can be observed.same conclusion is also reached from Figure 17.

Figures 14 -Figure 16 .
with different noise parameters are used for comparison to illustrate our experimental results.We randomly selected some images from the test model, as shown in Figures11-13, which show the evaluation of the proposed model trained on 825 images, each with an image size of 64 × 64.From the three figures, we can see that the reconstructed images were similar to the original images; Figure11shows the effect better, as the reconstructed image was more similar to the original image because of low noise (σ = 10).A similar conclusion can be observed by comparing Figures14-16(Figure14is the most apparent).Figures 14-16 also represent the evaluation of the proposed model trained on 825 images, each with an image size of 180 × 180.Compared with Figure 11, the reconstructed images of Figure 14 are more similar to the original image, which implies the proposed model benefited from a particular image size.Furthermore, comparing Figures 12 and 15 and Figures 13 and 16 separately, a similar conclusion can be observed.The same conclusion is also reached from Figure 17.

Figures 14 -
with different noise parameters are used for comparison to illustrate our experimental results.We randomly selected some images from the test model, as shown in Figures11-13, which show the evaluation of the proposed model trained on 825 images, each with an image size of 64 × 64.From the three figures, we can see that the reconstructed images were similar to the original images; Figure11shows the effect better, as the reconstructed image was more similar to the original image because of low noise (σ = 10).A similar conclusion can be observed by comparing Figures14-16(Figure14is the most apparent).Figures 14-16 also represent the evaluation of the proposed model trained on 825 images, each with an image size of 180 × 180.Compared with Figure 11, the reconstructed images of Figure 14 are more similar to the original image, which implies the proposed model benefited from a particular image size.Furthermore, comparing Figures 12 and 15 and Figures 13 and 16 separately, a similar conclusion can be observed.The same conclusion is also reached from Figure 17.
Author Contributions: C.C. designed the experiments and wrote the manuscript; Z.X.provided the instructions and helped during the design.Both authors have read and approved the final manuscript.Funding: This article was funded by the Shanghai Municipal Committee of Science and Technology Project (No. 18030501400) and by the Shanghai University of Engineering Science Innovation Fund for Graduate Students (No. E3-0903-18-01319).

Table 1 .
The machine configuration used in experimental environment.

Table 2 .
The parameter in experimental environment.

Table 1 .
The machine configuration used in experimental environment.

Table 2 .
The parameter in experimental environment.

Table 3 .
The error rate of test set under different network structures, where k is the scaling parameter and n is the number of learning modules.

Table 4 .
The running state of model under different scaling parameters and network depths, where "-" indicates that GPU resources were exhausted and could not be trained (blue font indicates the best).

Table 6 .
The average results of our aerial-image model, where the multi-scale residual learning network (ResNet) was the proposed aerial-image denoising method in the paper.The corresponding experimental parameters were x = 5, y = 3, scaling parameter z = 4, and network depth of 26 (blue font indicates the best).