Infrared Image Super Resolution by Combining Compressive Sensing and Deep Learning

Super resolution methods alleviate the high cost and high difficulty in applying high resolution infrared image sensors. In this paper we present a novel single image super resolution method for infrared images by combining compressive sensing theory and deep learning. Low resolution images can be regarded as the compressed sampling results of the high resolution ones in compressive sensing. With sparsity in this theory, higher resolution images can be reconstructed. However, because of diverse level of sparsity for different images, the output contains noise and loss of high frequency information. Deep convolutional neural network provides a solution to relieve the noise and supplement some missing high frequency information. By concatenating two methods, we manage to produce better results in super resolution tasks for infrared images than SRCNN and ScSR. PSNR and SSIM values are used to quantify the performance. Applying our method to open datasets and actual infrared imaging experiments, we also find better visual results are preserved.


Introduction
Nowadays high resolution (HR) images, possessing richer scene information and better visual quality than low resolution (LR) ones, are more desirable in many circumstances. However, the instrumentation limits make the HR images expensive and hard to achieve [1]. This problem is much more severe for infrared (IR) image sensors than visible (VIS) ones. Due to long wave-length, low resolution IR images always suffer from missing details including texture, contexture, edge information, etc. [2]. Less difficulty in optics and sensors manufacturing, super-resolution (SR) method is the most common task that widely used in many areas such as medical imaging [3], remote sensing [4], face recognition [5] and microscopy [6].
SR solutions are grouped into two categories: multi-frame SR (MFSR) and single-image SR (SISR) [7]. For MFSR, a sequence of LR images are captured to compose a HR image using the relative geometric and/or photometric displacements from the target HR image [8]. However, the necessary highly related sequences of images are not often available. In this paper, we focus on single image super resolution (SISR). As it is an inherently ill-posed problem, we have to rely on strong prior information to accomplish the task [9]. Sparsity based methods and learning based methods represent two typical ways of utilizing prior information [10].
Images, as is 2-D signal, exhibit sparsity in some domain, which enables Compressive Sensing (CS) theory to reconstruct the original HR images with LR ones with less sample rate. CS theory has already been proved to be effective and powerful in SR tasks [11]. Many SISR problems have been analyzed under different sparse bases, such as Wavelet [12], Discrete Cosine Transformation (DCT) [13] and Discrete Fourier Transform (DFT) [14]. Recently more practical applications reveal that a signal is more sparse with respect to an over-complete dictionary than a basis [15]. Besides, in order to accurately reconstruct the coefficients of the original signal in sparse domain, optimal reconstruction methods are needed. Different methods possess different performance, while we choose the iteratively reweighted least squares (IRLS) as the optimal reconstruction algorithm for its high reconstruction performance through experiments [16]. Its mechanism will be discussed later in this paper.
Apart from sparsity-based methods, learning-based ones also benefit from prior information. As deep-learning has recently prospered, many learning-based algorithms have been used in SISR, such as VGG [17], ResNet [18] and GAN [19]. Initially SRCNN [20] was the first 3-layered Convolutional Neural Network (CNN) utilized for SR tasks. Lately, for better performance the network structure goes much deeper. Besides, residual learning networks, when used in SR tasks, have been proved to possess better visual performance and Peak Signal to Noise Ratio (PSNR) performance.
In recent several years, many researchers have tried to combine CS and deep learning to produce better SR task solutions. Duan et al. [21] used deep learning to capture the image features and apply them to reconstruct HR images with the help of the sparsity in CS. Bora et al. [22] used generative models to replace the sparsity bases in CS and achieve satisfying results.
In this paper, we provide a novel combination architecture. We take advantage of sparsity in CS to recover the high frequency information in HR images. Then we build a deep-layer CNN to promote the performance of IRLS in CS. Residual learning [23] ensures that with our algorithm it is easier to optimize the results by denoising and reconstructing the output image of CS. By concatenating the two methods we achieve better performance than SRCNN [20] and ScSR [24] that utilize sparsity and a neural network alone. In simulations and actual infrared imaging experiments, we apply our method to IR images and we verify its performance both visually and quantitatively.

Super-Resolution with Compressive Sensing Theory
CS theory combines sampling and compression into non-adaptive linear measurement process [25] at a rate significantly below the Nyquist [26]. The classical CS acquisition process can be depicted as: Here y ∈ R M is the vector of stacking measurements. x ∈ R N (M < N) is the original compressible signal. Φ is the M × N measurement matrix and θΦΨ where Ψ is the N × N basis matrix. Vector s is the coefficients of x in the Ψ domain. Usually a Gaussian random matrix will be used as Φ. In SISR tasks, y will be regarded as the low projection of the HR image x, and Φ is corresponding to a downsample matrix in SISR [13]. Referring to the binning process of image sensors [27], we believe that one pixel in a LR image equals to the average of corresponding k × k neighbor pixels in HR one. Therefore Φ with M × N dimension, where N/M = k 2 , should function as this downsampling process [28].
x in the spatial domain can be represented by vector s in the Ψ domain, which is K-sparse (K < N coefficients in s are non-zero). With sufficient sampling rate, s will be correctly recovered from Equation (1) by solving such an l p -norm optimization problem: Ψ, the sparsity basis, has been widely proved validity using wavelet basis [29]. In our algorithm, we utilize DCT basis instead, because of its better performance under numerous experimental conditions. In this paper, we will use Peak Signal to Noise Ratio (PSNR) and structural similarity index (SSIM) to quantify the performance of the SR method. After testing on widely used 400 images [30] of size 180 × 180, we find that on average the DCT basis outperforms the wavelet basis by 14% higher in PSNRs and 26% higher in SSIMs.
Corresponding to the basis, the reconstruction algorithm is also important for our SR tasks. In order to solve this underdetermined equation finding accurate x, many optimization methods have been developed these years, such as Orthogonal Matching Pursuit (OMP) [31], Subspace Pursuit [32], Relevance Vector Machine (RVM) [33] and Iteratively reweighted least squares (IRLS) [16]. Iteratively reweighted least squares (IRLS) is selected for better visual and quantitative results, where p = 1.
The IRLS method we use is based on solving (2) with modified objective function that at each iteration the function approaches N ∑ k=1 |s| p [27]. Simply, we substitute the p objective function in (2) with a weighted 2 norm: where is the first-order approximation to the p objective function. w i changes at each iteration until w i s i 2 is sufficiently close to s p p in (4) after convergence. Then the solution of (3) is: where Q n is the diagonal matrix with entries: The convergence criterion for each iteration stage can be depicted as: After (6) is attained, µ is reduced by a factor of 10, and the iterative procedure is repeated until µ < 10 −8 [34].
In conclusion, the HR image x can be depicted by sparse vector s in Ψ domain. The input of the algorithm, the original LR image, is regarded as the compressed measurements. Finally, x can be resolved with reconstruction algorithm. The detailed parameters in this algorithm are demonstrated in Algorithm 1.
Step 1: Initialize the size of output image and the formation of sparsity basis.

Image Denoising and Reconstruction with Deep Learning
Practically, it is hard to find an absolutely correct y, which represents the HR image in SR tasks. Mostly the algorithms will come to a local optimal solution that makes the output images contain fixed pattern noise, which is illustrated in Figure 1. By comparing the output of CS, bicubic method and the original HR image, visually we find that CS preserves more texture information and less blur effect, but contains some fixed pattern noise. After using our CNN, the noise is visually alleviated. Although the PSNR of CS output is 0.64 dB higher than bicubic, the SSIM of CS is 0.031 lower. As SSIM calculates the covariance value of the images representing the structural information of the objects in images [35], studies show that it is more vulnerable to fixed-pattern noise than pixel difference-based measurement, PSNR [36]. Therefore a method that protects the high spatial frequency information while wiping out the fixed pattern noise is necessary. After using our CNN, the structure of which will be discussed in the following paragraph, the PSNR is increased to 34.18 dB and the SSIM is increased to 0.9719. These values proves that our CNN is effective in denoising and reconstruction. Although the PSNR of CS output is 0.64 dB higher than bicubic, the SSIM of CS is 0.031 lower. As SSIM calculates the covariance value of the images representing the structural information of the objects in images [35], studies show that it is more vulnerable to fixed-pattern noise than pixel difference-based measurement, PSNR [36]. Therefore a method that protects the high spatial frequency information while wiping out the fixed pattern noise is necessary. After using our CNN, the structure of which will be discussed in the following paragraph, the PSNR is increased to 34.18 dB and the SSIM is increased to 0.9719. These values proves that our CNN is effective in denoising and reconstruction. From the results, we believe that the CNN not only deals with the fixed pattern noise, but also helps supplement more high frequency information. As the images change, the level of sparsity changes as well. Some HR images may contain more high frequency information that won't be recovered by a certain sparsity basis, causing the limits of the CS method, which means using CS alone won't recover all the high frequency information. In that case, we also need more efforts to From the results, we believe that the CNN not only deals with the fixed pattern noise, but also helps supplement more high frequency information. As the images change, the level of sparsity changes as well. Some HR images may contain more high frequency information that won't be recovered by a certain sparsity basis, causing the limits of the CS method, which means using CS alone won't recover all the high frequency information. In that case, we also need more efforts to supplement the missing information during the SR process. Deep learning with powerful image processing ability has been applied to many tasks like image denoising, demosaicing [37] and reconstruction [38]. Zhang et al. [29] designed a deep convolutional neural network (CNN) for image Gaussian denoising, which is called DnCNN. Residual learning and batch normalization greatly benefit its performance. Inspired by DnCNN, we modified its network architecture to accomplish the denoising and high frequency information supplementation in our SR tasks.
The most essential part of our CNN model is the residual learning. Although the output of CS contains fixed pattern noise, we are not able to describe its formation with a designed rule in order to eliminate it. However deep learning provides us with trainable convolutional filter, in which case the noise of each HR image can be detected and eliminated after training the CNN model. Residual learning enables us to train each layer of CNN to fit the residual mapping instead of the original image. Formally, we denote the HR output of CS as H(j), and the original HR image, which is the ground truth, as G(j). Here j denotes the index of each image. The residual image R(j) = G(j) − H(j), represents the fixed pattern noise of each image. Researches have revealed that residual image is easier to be optimized by CNN [23]. Figure 2 shows the proposed SR architecture when training. supplement the missing information during the SR process. Deep learning with powerful image processing ability has been applied to many tasks like image denoising, demosaicing [37] and reconstruction [38]. Zhang et al. [29] designed a deep convolutional neural network (CNN) for image Gaussian denoising, which is called DnCNN. Residual learning and batch normalization greatly benefit its performance. Inspired by DnCNN, we modified its network architecture to accomplish the denoising and high frequency information supplementation in our SR tasks. The most essential part of our CNN model is the residual learning. Although the output of CS contains fixed pattern noise, we are not able to describe its formation with a designed rule in order to eliminate it. However deep learning provides us with trainable convolutional filter, in which case the noise of each HR image can be detected and eliminated after training the CNN model. Residual learning enables us to train each layer of CNN to fit the residual mapping instead of the original image. Formally, we denote the HR output of CS as ( ), and the original HR image, which is the ground truth, as ( ). Here denotes the index of each image. The residual image ( ) = ( ) − ( ), represents the fixed pattern noise of each image. Researches have revealed that residual image is easier to be optimized by CNN [23]. Figure 2 shows the proposed SR architecture when training. The target of our CNN is to estimate the residual image of every CS output for promoting the performance. The averaged mean square error between estimated residual image and the true residual one: denotes the loss function to learn the trainable parameters Θ in CNN. Corresponding to th training image, (Θ, ) represents the estimated residual image produced by our CNN, while (Θ, ) represents the true residual image used for training. Researches reveal that the depth of network is of great importance for better results [23]. Therefore, we challenge to modify the CNN into a deeper network with 30 layers. Inspired by DnCNN, our network consists of three types of layer, which is shown in Figure 3.
In the first layer, we utilize 64 filters of 3 × 3 size as the convolution kernels to generate 64 feature maps. And rectified linear units (ReLU, (0,•)) are utilized as the nonlinear activation function for speeding up the optimization. The 28 hidden layers are of the same formation. 64 filters of size 3 × 3 × 64 are connected with batch-normalization (BN) [39] in the hidden layers for accelerating training speed. For the last layer, a 3 × 3 × 64 convolution is used for reconstructing The target of our CNN is to estimate the residual image of every CS output for promoting the performance. The averaged mean square error between estimated residual image and the true residual one: denotes the loss function to learn the trainable parameters Θ in CNN. Corresponding to ith training image, R(Θ, i) represents the estimated residual image produced by our CNN, while R(Θ, i) represents the true residual image used for training. Researches reveal that the depth of network is of great importance for better results [23]. Therefore, we challenge to modify the CNN into a deeper network with 30 layers. Inspired by DnCNN, our network consists of three types of layer, which is shown in Figure 3. After simulation experiments, we find that Adaptive Moment Estimation (Adam) optimization [40] algorithm outperforms Stochastic Gradient Descent (SGD) [41]. Therefore, we choose Adam as the optimization method for our CNN. Adam is a first-order gradient-based optimization algorithm, which is based on adaptive estimates of lower-order moments of the gradients. The pseudo-code is shown in Algorithm 2.

Algorithm 2. Adam Method for Optimization Parameters:
is the stepsize; , ∈ , ), ∈ , ) are the exponential decay rates for the moment estimates; ( ) is the loss function with parameter .

Step 2: Initialize the vectors.
← is the initial first moment vector. ← is the initial second moment vector. ← is the initial timestep.
Step 3: Do the inner loop: Most parameters of Adam are set the same as the ones in [40], as the mini-batch size is 128 and the learning rate decays exponentially from 1 × 10 −1 to 1 × 10 −4 during 50 epochs of training.
We use the MatConvNet package in Matlab 2017a to train our CNN. A Intel ® Core TM i5-4670k CPU operating at 3.4 GHz and an Nvidia 1080Ti GPU are used. Experiments show that the deeper the network goes, the better PSNR performance becomes, as is shown in Figure 4. However, for a deep network of 30 layers and 128 mini-batch size, a great burden has been placed on the GPU memory. 30 layers with 128 mini-batch size is up to the limit of the GPU memory. In the first layer, we utilize 64 filters of 3 × 3 size as the convolution kernels to generate 64 feature maps. And rectified linear units (ReLU, max(0, ·)) are utilized as the nonlinear activation function for speeding up the optimization. The 28 hidden layers are of the same formation. 64 filters of size 3 × 3 × 64 are connected with batch-normalization (BN) [39] in the hidden layers for accelerating training speed. For the last layer, a 3 × 3 × 64 convolution is used for reconstructing the residual image.
After simulation experiments, we find that Adaptive Moment Estimation (Adam) optimization [40] algorithm outperforms Stochastic Gradient Descent (SGD) [41]. Therefore, we choose Adam as the optimization method for our CNN. Adam is a first-order gradient-based optimization algorithm, which is based on adaptive estimates of lower-order moments of the gradients. The pseudo-code is shown in Algorithm 2.
Step 2: Initialize the vectors. m 0 ← 0 is the initial first moment vector. v 0 ← 0 is the initial second moment vector. t ← 0 is the initial timestep.

m t
Update parameters, where is for preventing the denominator to be zero. 3.9 if Θ t is converged, go to step 4; otherwise go to step 3.1.
Most parameters of Adam are set the same as the ones in [40], as the mini-batch size is 128 and the learning rate decays exponentially from 1 × 10 −1 to 1 × 10 −4 during 50 epochs of training.
We use the MatConvNet package in Matlab 2017a to train our CNN. A Intel ® Core TM i5-4670k CPU operating at 3.4 GHz and an Nvidia 1080Ti GPU are used. Experiments show that the deeper the network goes, the better PSNR performance becomes, as is shown in Figure 4. However, for a deep network of 30 layers and 128 mini-batch size, a great burden has been placed on the GPU memory. 30 layers with 128 mini-batch size is up to the limit of the GPU memory.

The Whole Super-Resolution Algorithm Architecture
In Figure 5 we show the whole architecture when using the proposed method to accomplish the SR target. After training process, the CNN is used to eliminate the fixed pattern noise in the output of CS SR method and supplement some high spatial frequency information to it.

Simulation Results
Before applying our method to real scenes captured by infrared sensors, we test it with some open datasets by comparing it with SRCNN [20] and ScSR [24] that utilize sparsity and neural network alone. Considering that there are not enough open image data sets for training at infrared wavelengths, we choose widely used 400 VIS images [29] of size 180 × 180 as the training dataset. The experimental results show that the model trained by VIS dataset functions well when dealing with IR images. A larger training dataset is more preferable, but leads to more training time pressure. After testing we find that 400 images are enough to get high performance, and the training time is acceptable. About 10 h for training is needed for our CNN. This trained model in VIS is used for super resolution tasks in VIS images and IR images.

The Whole Super-Resolution Algorithm Architecture
In Figure 5 we show the whole architecture when using the proposed method to accomplish the SR target. After training process, the CNN is used to eliminate the fixed pattern noise in the output of CS SR method and supplement some high spatial frequency information to it.

The Whole Super-Resolution Algorithm Architecture
In Figure 5 we show the whole architecture when using the proposed method to accomplish the SR target. After training process, the CNN is used to eliminate the fixed pattern noise in the output of CS SR method and supplement some high spatial frequency information to it.

Simulation Results
Before applying our method to real scenes captured by infrared sensors, we test it with some open datasets by comparing it with SRCNN [20] and ScSR [24] that utilize sparsity and neural network alone. Considering that there are not enough open image data sets for training at infrared wavelengths, we choose widely used 400 VIS images [29] of size 180 × 180 as the training dataset. The experimental results show that the model trained by VIS dataset functions well when dealing with IR images. A larger training dataset is more preferable, but leads to more training time pressure. After testing we find that 400 images are enough to get high performance, and the training time is acceptable. About 10 h for training is needed for our CNN. This trained model in VIS is used for super resolution tasks in VIS images and IR images.

Simulation Results
Before applying our method to real scenes captured by infrared sensors, we test it with some open datasets by comparing it with SRCNN [20] and ScSR [24] that utilize sparsity and neural network alone. Considering that there are not enough open image data sets for training at infrared wavelengths, we choose widely used 400 VIS images [29] of size 180 × 180 as the training dataset. The experimental results show that the model trained by VIS dataset functions well when dealing with IR images. A larger training dataset is more preferable, but leads to more training time pressure. After testing we find that 400 images are enough to get high performance, and the training time is acceptable. About 10 h for training is needed for our CNN. This trained model in VIS is used for super resolution tasks in VIS images and IR images.
We apply our method to six infrared images collected from the OSU thermal pedestrian database, OSU Color and Thermal Database and Terravic Motion Infrared Database of the OTCBVS dataset collection [42]. Besides, we also apply our method to six widely used VIS images to prove the robustness. Figure 6 shows the overview of the 12 total images regarded as the test set. It is worth highlighting that the training set should not share the same images with the test set in order to avoid a logical paradox. Therefore the 12 IR and VIS images are not included in the 400 images for training. We apply our method to six infrared images collected from the OSU thermal pedestrian database, OSU Color and Thermal Database and Terravic Motion Infrared Database of the OTCBVS dataset collection [42]. Besides, we also apply our method to six widely used VIS images to prove the robustness. Figure 6 shows the overview of the 12 total images regarded as the test set. It is worth highlighting that the training set should not share the same images with the test set in order to avoid a logical paradox. Therefore the 12 IR and VIS images are not included in the 400 images for training. In this paper the upscaling factors are set 2 and 3. We down-sample the HR image into two LR one by merging 2 × 2 or 3 × 3 neighbor pixels on average in order to simulate two kinds of LR images. We compare the SR images with the original HR ones by quantifying the performance in PSNR and SSIM, the results of which is shown in the Tables 1 and 2. Besides the execution time is also provided in the tables for considering the complexity of our algorithm.  In this paper the upscaling factors are set 2 and 3. We down-sample the HR image into two LR one by merging 2 × 2 or 3 × 3 neighbor pixels on average in order to simulate two kinds of LR images. We compare the SR images with the original HR ones by quantifying the performance in PSNR and SSIM, the results of which is shown in the Tables 1 and 2. Besides the execution time is also provided in the tables for considering the complexity of our algorithm. Before discussing the SR reconstruction performance, the execution time of three methods also attracts great interest. SRCNN exhibits the least time consumption, while ScSR and our algorithm need far more execution time. Most time of our algorithm is spent on solving the optimization problem for compressive sensing architecture in (2). This is because the time complexity of IRLS is high, despite its better accuracy in reconstruction. Another fact that draws great attention is that our algorithm needs far less time for SR of upscaling factor of 3 than of 2, unlike ScSR and SRCNN. The reason is that LR images produced by merging 3 × 3 neighbor pixels from HR ones contain lower spatial resolution and less amount of information than those produced by merging 2 × 2 pixels, which means fewer constraint conditions in (3) and fewer dimensions of vector s in the Ψ domain in (3). After fewer iterations, IRLS will comes to the nearly accurate answers to get HR estimation. Therefore, we may predict that for even larger upscaling factors, our algorithm may perform much better in execution time. We find that the proposed method has great advantages in PSNR values, while performing a little better than SRCNN and ScSR in SSIM values. We choose image 6, the infrared surveillance, in the test set as an example to show the performance visually. Figure 7 illustrates the visual comparison of three methods. The original HR image is of 360 × 240 pixels. After down-sampling, two kinds of LR image images are produced, which are of 180 × 120 pixels and of 120 × 80 pixels. We produce the SR images with SRCNN, ScSR and our method. The zoomed HR images are placed on the right.
The whole images comparison provides us the overall perception of different methods, where the texture feature in our method appears to be clearer. Moreover, in our results, the surroundings near the objects are of less distraction and less noise. From the zoomed images, we find that the edges in the output of our method are more distinct. In details, the contours of the zebra crossing in our method possesses higher fidelity and higher contrast compared to the one in SRCNN and ScSR. We believe this advantage may help a lot in further image recognition tasks. Before discussing the SR reconstruction performance, the execution time of three methods also attracts great interest. SRCNN exhibits the least time consumption, while ScSR and our algorithm need far more execution time. Most time of our algorithm is spent on solving the optimization problem for compressive sensing architecture in (2). This is because the time complexity of IRLS is high, despite its better accuracy in reconstruction. Another fact that draws great attention is that our algorithm needs far less time for SR of upscaling factor of 3 than of 2, unlike ScSR and SRCNN. The reason is that LR images produced by merging 3 × 3 neighbor pixels from HR ones contain lower spatial resolution and less amount of information than those produced by merging 2 × 2 pixels, which means fewer constraint conditions in (3) and fewer dimensions of vector in the domain in (3). After fewer iterations, IRLS will comes to the nearly accurate answers to get HR estimation. Therefore, we may predict that for even larger upscaling factors, our algorithm may perform much better in execution time.
We find that the proposed method has great advantages in PSNR values, while performing a little better than SRCNN and ScSR in SSIM values. We choose image 6, the infrared surveillance, in the test set as an example to show the performance visually. Figure 7 illustrates the visual comparison of three methods. The original HR image is of 360 × 240 pixels. After down-sampling, two kinds of LR image images are produced, which are of 180 × 120 pixels and of 120 × 80 pixels. We produce the SR images with SRCNN, ScSR and our method. The zoomed HR images are placed on the right.
The whole images comparison provides us the overall perception of different methods, where the texture feature in our method appears to be clearer. Moreover, in our results, the surroundings near the objects are of less distraction and less noise. From the zoomed images, we find that the edges in the output of our method are more distinct. In details, the contours of the zebra crossing in our method possesses higher fidelity and higher contrast compared to the one in SRCNN and ScSR. We believe this advantage may help a lot in further image recognition tasks.
(a) Two kinds of simulated LR images

Imaging Experiments
In this section, we apply our method to an infrared image sensor to testify the portability and generality. As demonstrated in Figure 8, we use MARS-VLW-RM4 from the Sofradir Company (Palaiseau, France) as the infrared image sensor, whose original resolution is 320 × 256 . Its sensitivity to infrared radiation in the Very Long-Wave band (8-12 μm) make ensure its applicability for military and civilian surveillance purposes. However due to the high cost of manufacturing, it is difficult to increase the resolution. Using CS theory and deep learning, we are able to produce higher resolution infrared images without changing the original sensor.
The parameters and trained models are the same as the ones in the simulation section. As lack of ground truth for HR infrared images, the performance will be judged visually in this section. With upscaling factor of 2 and 3, we will produce HR images of 640 × 512 and 960 × 768 resolution. The results are shown in Figure 9.
Visual comparison between the LR and HR images and the 3 different methods demonstrate the advantage of our method. In HR ones, image details, like textures and contours, are more sufficient and mosaic effects caused by LR image sensor are relieved. Therefore, higher resolution infrared images which surpass the original image sensor's resolution are available by using our method. Moreover, compared to zoomed images of ScSR, our results contain less blur and sharper features. As to SRCNN, its reconstruction noise of the windowsill in the zoomed images shows its inferiority to our method.

Imaging Experiments
In this section, we apply our method to an infrared image sensor to testify the portability and generality. As demonstrated in Figure 8, we use MARS-VLW-RM4 from the Sofradir Company (Palaiseau, France) as the infrared image sensor, whose original resolution is 320 × 256. Its sensitivity to infrared radiation in the Very Long-Wave band (8-12 µm) make ensure its applicability for military and civilian surveillance purposes. However due to the high cost of manufacturing, it is difficult to increase the resolution. Using CS theory and deep learning, we are able to produce higher resolution infrared images without changing the original sensor.
The parameters and trained models are the same as the ones in the simulation section. As lack of ground truth for HR infrared images, the performance will be judged visually in this section. With upscaling factor of 2 and 3, we will produce HR images of 640 × 512 and 960 × 768 resolution. The results are shown in Figure 9. (c) SR images using SRCNN with upscaling factor of 2 and 3 (d) SR images using proposed method with upscaling factor of 2 and 3 Figure 9. Imaging results comparison of the SR output.

Conclusions
In this paper we present a novel super resolution method that is the combination of compressive sensing theory and deep learning. Our method consists of two parts. The first one utilizes the spatial sparsity of CS theory to reconstruct a HR image which contains higher frequency information. The second part uses the trained network to remove the fixed pattern noise that was introduced in the first part and supplement some additional high frequency information which is learnt from the training set. Its high performance helps us to acquire higher resolution infrared images without suffering from the high cost and difficulty in applying large infrared sensors. The performance has been demonstrated visually and quantitatively in the simulation tasks. Our method possesses better performance with higher PSNR and SSIM values than SRCNN and ScSR in both visible and infrared datasets. We apply our method to a Very-Long-Wave band infrared sensor to testify its portability and generality. With low resolution infrared sensor, we are able to produce higher resolution images.
As our work only analyzes the monochrome images, we expect more studies will focus on spectral images' super-resolution problems.

Conflicts of Interest:
The authors declare no conflict of interest. Visual comparison between the LR and HR images and the 3 different methods demonstrate the advantage of our method. In HR ones, image details, like textures and contours, are more sufficient and mosaic effects caused by LR image sensor are relieved. Therefore, higher resolution infrared images which surpass the original image sensor's resolution are available by using our method. Moreover, compared to zoomed images of ScSR, our results contain less blur and sharper features. As to SRCNN, its reconstruction noise of the windowsill in the zoomed images shows its inferiority to our method.

Conclusions
In this paper we present a novel super resolution method that is the combination of compressive sensing theory and deep learning. Our method consists of two parts. The first one utilizes the spatial sparsity of CS theory to reconstruct a HR image which contains higher frequency information. The second part uses the trained network to remove the fixed pattern noise that was introduced in the first part and supplement some additional high frequency information which is learnt from the training set. Its high performance helps us to acquire higher resolution infrared images without suffering from the high cost and difficulty in applying large infrared sensors. The performance has been demonstrated visually and quantitatively in the simulation tasks. Our method possesses better performance with higher PSNR and SSIM values than SRCNN and ScSR in both visible and infrared datasets. We apply our method to a Very-Long-Wave band infrared sensor to testify its portability and generality. With low resolution infrared sensor, we are able to produce higher resolution images.
As our work only analyzes the monochrome images, we expect more studies will focus on spectral images' super-resolution problems.