EnGe-CSNet: A Trainable Image Compressed Sensing Model Based on Variational Encoder and Generative Networks

The present study primarily investigates the topic of image reconstruction at high compression rates. As proven from compressed sensing theory, an appropriate algorithm is capable of reconstructing natural images from a few measurements since they are sparse in several transform domains (e.g., Discrete Cosine Transform and Wavelet Transform). To enhance the quality of reconstructed images in specific applications, this paper builds a trainable deep compressed sensing model, termed as EnGe-CSNet, by combining Convolution Generative Adversarial Networks and a Variational Autoencoder. Given the significant structural similarity between a certain type of natural images collected with image sensors, deep convolutional networks are pre-trained on images that are set to learn the low dimensional manifolds of high dimensional images. The generative network is employed as the prior information, and it is used to reconstruct images from compressed measurements. As revealed from the experimental results, the proposed model exhibits a better performance than competitive algorithms at high compression rates. Furthermore, as indicated by several reconstructed samples of noisy images, the model here is robust to pattern noise. The present study is critical to facilitating the application of image compressed sensing.


Introduction
Compressed Sensing (CS) [1] integrates the sampling and compression of information acquisition and significantly downregulates the sampling rate of the measurement system. The typical problem of CS refers to the reconstruction of sparse signalx from measurements y ∈ R m , where Φ ∈ R m×n (m n) denotes a general Gaussian random matrix with far fewer rows than columns, which is termed the measurement matrix. x indicates any generalized sparse signal (e.g., sound, image or sensor data with spatial correlation). The compressed measurement y is transmitted by the energy-limited system (e.g., Wireless Sensor Network). It carries nearly identical information to x, whereas it exhibits a much smaller size. Equation (1) denotes a system of under-determined equations. A unique solution is unlikely to be found unless x exhibits special structures. Fortunately, most natural world images are sparse under some transformation, for example, Discrete Cosine Transform (DCT) [2] or Wavelet Transform (WT) [3]. For this reason, the signal required may be the sparsest solution for this system.
It refers to a Non-deterministic Polynomial hard (NP-hard) problem to seek the sparsest solution to an under-determined system of equations. Exact algorithms are considered to be incapable of solving NP-hard problems. As indicated in existing works, if the matrix Φ satisfies some conditions, for example, Restricted Isometry Property (RIP) [4,5] or the related Restricted Eigenvalue Condition (REC) [6], the unique solution can be identified by using the convex optimization algorithm. Conventional algorithms to address the reconstruction problems consist of greedy algorithms [7], convex relaxation [8], and Bayesian framework [9]. The compression ratio and image quality of the mentioned methods on images recovering remain unsatisfactory. Deep Convolution Generative Adversarial Networks (DCGAN) [10] and Variational Autoencoder (VAE) [11] exhibit prominent performance on low dimensional representation of images. As indicated by Bora's work [12], applying deep generative networks to image reconstruction as the prior information is suggested to have a positive effect.
As inspired by Larsen's work [13], the present study proposes a deep CS model termed EnGe-CSNet to improve the quality of reconstructed images in Compressed Sensing applications. EnGe-CSNet integrates variational encoder and generative network. As impacted by the high structural similarity between images in the applications (e.g., crop monitoring and face detection), EnGe-CSNet are pre-trained on the image set to learn the general features. The learned features act as prior information when the algorithm reconstructs images using compressed measurements. To be specific, EnGe-CSNet is pretrained on Arabidopsis thaliana seedlings images [14] or Large-scale CelebFaces Attributes (CelebA) dataset [15] to determine a mapping G between low dimensional representation z ∈ R k (k ≤ m) and image x ∈ R n . When EnGe-CSNet reconstructs unknown images, the imagex generated by G tends to approach the original image x by optimizing z. Compared with the existing approaches, the model proposed can conduct more effective reconstruction and exhibits a higher ability to resist noise.
The major contributions of the present study are summarized below: • The present study builds novel convolutional generative networks termed as EnGe-CSNet that applies more to compressed sensing applications. EnGe-CSNet more effectively extracts the general features of target images in specific applications by integrating the advantages exhibited by VAE and DCGAN. • The present study designs a novel deep CS framework to up-regulate the compression rate and improve the reconstruction quality in CS applications (e.g., crop monitoring and face detection). The model proposed employs pre-trained generative networks as prior information to overall exploit the structural similarity of images collected by sensors. • The study verifies that the image reconstructed algorithms based on generative networks exhibit a strong anti-noise ability.
The present study is organized as follows: In Section 1, the research background, significance, contributions, article structure, and the fundamental idea of the proposed model are introduced. In Section 2, the relevant works on CS based on neural networks are outlined. In Section 3, the architecture and specific design of the proposed algorithm are presented. In Section 4, the experiment is elucidated, and the proposed model is compared with existing methods. In Section 5, the study is summarized, and the prospects for subsequent works are proposed.

Compressed Sensing
Compressed Sensing theory aims to overcome the limitation of the Nyquist theorem and reconstruct the original signal from a few measurements. This theory requires original signal x to be sparse, that is, only a few elements in vector x are nonzero. The reconstruction problem is expressed as Candes [5] proved that sparse signals could be reconstructed with high probability if the measurement matrix Φ satisfy RIP: where δ > 0 is a small constant; · 2 is L2-norm. x i denotes ith element of vector x.
RIP ensures the difference between the measurements of two separate vectors. Most nature signals are not obviously sparse, whereas they are sparse under some transformation bases. Moreover, it is NP-hard to minimize L0-norm. Thus, in practice, Equation (2) is transformed to an L1-norm minimization problem where Ψ denotes the sparse basis; s expresses the sparse representation coefficient. As Ψ is known, we can reconstruct the original signal x after s is obtained. Equation (5) has been extensively studied in depth, and the solution can be obtained quickly and accurately by using numerous algorithms [16]. Besides, some previous works have designed useful sensors for image CS. Zhang [17] presented a low power all-CMOS implementation of temporal compressive sensing with pixel-wise coded exposure, which can reconstruct 100 fps videos from coded images sampled at 5 fps. Another important application of image CS is to mine effective information from compressed measurements. Kwan [18] proposed a real-time framework for processing compressive measurements directly without image reconstruction. This study adopted a pixel-wise coded exposure (PCE) to condenses multiple frames into a single frame. This real-time system is applied to object detection and classification with the help of YOLO.

Image Reconstruction Based on Neural Networks
The conventional CS reconstruction algorithms requires the signal x to be k-sparse in a known basis, and these algorithms are effective in simple signal reconstruction. However, a suitable transform basis for a particular kind of image is hard to find. Bora [12] proposed to solve the problem of CS using generative models (dubbed CSGM). A more general Set Restricted Eigenvalue Condition (S-REC) is proposed in [12]. X ⊆ R n denotes the set of all possible images required. For parameters γ > 0 and δ ≥ 0, Φ satisfies the S-REC if ∀x 1 , x 2 ∈ X, matrix Ψ is hard to find for a certain type of image. The sparsity essentially originates from the redundancy of information carried by natural images. The color and outline in a meaningful image should be regular. Besides, two photos in one type should be similar (e.g., two different face images contain numerous similar features). As assisted by convolutional neural networks, this study directly excavates the structural similarity of images instead of finding the sparse basis. Unlike the conventional methods to reconstruct the sparse representation s of image x, compressed sensing based on generative networks aims to find the low dimensional representation z. This present study builds the mapping G from low dimensional latent space to image space in advance for a particular type of image in a specific application. The original image x is restored after the low dimensional representation z is determined. Since the dimension of hidden vector z is lower than measurements y, the goal of CS is altered to find the solution of a system of over-determined equations, instead of under-determined equations. The present study solves the mentioned problem with the least square method. The objective of the reconstructed model is It can be solved by using gradient descent methods. After the optimal z is found through several iterations, reconstructed imagex is generated bŷ The optimizer adopted here to solve Equation (7) is Adam [23]. It is helpful to add penalty terms while solving practical problems. As mentioned in [12], z 2 2 is an appropriate regularizer since both VAE and GAN typically impose an isotropic Gaussian prior on z. The penalty ΦG(z) − y 1 can also improve the performance because it tends to make the error vector e = ΦG(z) − y more sparse (i.e., more element in e is equal to zero). Thus, the objective function employed for minimization is expressed as: where α measures the importance of prior z 2 2 and β denotes the ratio of L1-norm penalty to measurement error ΦG(z) − y 2 2 . It is easy to fall into a locally optimal solution since gradient descent is a greedy algorithm. This study attempts to seek the global optimal solution by multiple restarts. Algorithm 1 expresses the pseudo code of the proposed model. The maximum restart times are set to r max . A separate search is conducted after each restart. First, z is initialized to a random point in space R k . Next, i max iterations are taken to optimize z. In the respective iteration, the loss is calculated by using Equation (8) and z is updated by Adam with the learning rate lr. Next, the square of Euclidean distance d = ΦG(z) − y 2 2 is employed to estimate the quality of the reconstructed image. If d falls below a threshold t, it is considered that the correct image is found, and the algorithm will terminate. Otherwise, z will be reinitialized randomly and the next search will be initiated. Figure 1 illustrates the process of image acquisition in crop monitoring based on deep CS. The image x of a plant seedling is compressively sampled with a wireless sensor node. The dimension of the measurements y is significantly lower than the original image, so the amount of data transmitted decreases noticeably. Subsequently, a small number of measurements are sent to the base station via energy-limited wireless networks. When images are being reconstructed, the pre-trained generative network G from EnGe-CSNet acts as a prior. The reconstructed imagex is obtained after several iterations.

Algorithm 1 Proposed Approach
Input: Φ, y, α, β t, learning rate lr, maximum restart steps r max , Iterations i max for r = 0 to r max do

Generative Adversarial Networks with Variational Encoder
The framework of EnGe-CSNet proposed here is presented in Figure 2. Z ⊆ R k is assumed as the set of all possible low dimension latent vectors. p(X) is set as the distribution of images, and p(Z) is set as the distribution of latent vectors. As mentioned in [11], Generator, termed as called Decoder in VAE, is trained to learn a model p(X|Z) that maximizes p(X), p(Z) is considered the standard Gaussian prior distribution in DCGAN. In deep CS, more stress is placed on the vectors that more likely generate natural images. The Encoder is trained to learn a posterior probability distribution q(X|Z). The problem that VAE tends to produce blurred images is solved by Discriminator (D). It comprises three convolutional layers and two fully connected layers. Discriminator attempts to distinguish the real from the erroneous images. When Discriminator is being trained, the target score of real image x is 1, and that of the image x η generated with vector η sampled from standard Gaussian distribution is 0. As inspired by [10], cross-entropy acts as the loss function of Discriminator: Generator (G) covers a fully connected layer and four deconvolutional layers. It is updated twice in each iteration. In the DCGAN process, image x η is generated using vector η sampled from standard Gaussian distribution to fool Discriminator. When Generator is being trained, the target score of x η is 1. Thus, Equation (12) is derived as the loss function of the Generator in the DCGAN process: In the VAE process, G generates image x z with vector η sampled from q(X|Z). It aims to make the Euclidean distance between x z and x to be minimum. Thus, Equation (13) is derived as the loss function of the Generator trained in VAE process: Encoder (E ) comprises three convolutional layers and a fully connected layer. It finally outputs two vectors, i.e., a mean vector µ and a variance vector σ. z is randomly sampled from the normal distribution N(µ, σ 2 ). As derived in [24], latent loss of VAE is where µ i denotes the ith element of vector µ; σ i expresses the ith element of vector σ. Encoder aims to find the latent vector z that makes G(z) approach x and fool the Discriminator. Thus, Equation (15) is derived as the loss function of the Encoder: Regularizer z 2 2 is introduced since Gaussian representation is selected for the latent prior p(z) and the approximate posterior q(z|x).

Dataset and Training Details
The performance of the proposed model is tested with plant seedling and face images. The Aberystwyth leaf evaluation dataset contains four sets of 20 Arabidopsis Thaliana plants that have been grown in trays. Images of the respective tray are taken in a 15-min timelapse sequence. Interested readers can obtain this dataset from the following link: https://zenodo.org/record/168158#, accessed on 18 January 2021. CelebA is a largescale face dataset with over 200,000 images involved. It can be obtained from this link: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, accessed on 25 January 2021. For both datasets, we randomly sample 19,200 images for training, and 1000 for evaluation. In this study, each sample is cut from a large-scale image and scaled to an RGB image of size 64 × 64, giving n = 64 × 64 × 3 inputs. The value of the respective input is scaled to The Encoder of EnGe-CSNet consists of three latent convolutional layers and one full connection output layer with a kernel size of 5 × 5, strides of 2. In the latent layers, relu acts as the activation function. The dimension of latent vector z is set to k = 100. The Generator comprises one full connection layer and four transposed convolutional layers. Besides, all layers adopt the hyperbolic tangent function as activation. The discriminator consists of three convolutional layers and two full connection layers. All layers except for the output employ leakyrelu activation. Adam is employed to update all parameters of EnGe-CSNet with batch size 32 and learning rate 0.0001.
The optimizer of the reconstruction model refers to Adam with learning rate lr = 0.1. This study searches up to r max = 5 times for the optimal low dimensional representation of each image. After the respective search step with i max = 500 iterations, the search will be terminated if d < 50, as the optimal solution is considered to be found.
The experiments' hardware environment includes Intel(R) Xeon(R) W-2145 @3.70 GHz CPU, 64.0 GB DDR4 Memory, and NVIDIA Quadro RTX4000 GPU. The software environment is Windows 10 64-bit operating system, Python as the programming language, and Tensorflow2.0 as a library.

Experimental Setup
For baselines, the proposed model is compared with Lasso [4], TVAL3 [25], NL-RCS [26], DAMP [27], GAPTV [28] and CSGM [12]. The first five are optimization-based methods. The implementation codes of these algorithms are downloaded from authors' websites. Lasso is applied to reconstruct images in WT domain because we found it perform better than in DCT domain. To reconstruct RGB images, we use CBM3D as the denoiser of DAMP. The RGB versions of TVAL3, NLRCS and GAPTV do not exist currently, so we reconstruct the three color channels independently. Other parameters for these methods, including the number of iterations, are set to the default values suggested by authors without any changes. CSGM is a reconstructed method based on generative models. The architecture of the generator in the CSGM is identical to EnGe-CSNet. According to [12], we optimize CSGM using Adam with a learning rate of 0.1 and do two random restarts with 500 update steps per restart and pick the reconstruction with the best measurement error.
The performance of the mentioned methods are tested at different compression rates (cr = n/m, that is, the ratio of the dimension of images x and measurements y). The regularization parameters in Equation (8) are set to α = 0.1 and β = 0.1 in the experiments since these parameters are suggested to lead to the best performance according to experience. The parameters' influence is discussed in Section 4.3. For all methods, the measurement matrix Φ is a random Gaussian matrix with each entry sampled i.i.d from N(0, 1).
The Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) are adopted for quantitative comparison. The PSNR is calculated by Equation (16).
where µ x expresses the mean of x, µx is the mean ofx, σ 2 x is the variance of x, σ 2 x is the variance ofx, σ xx is covariance between x andx, c 1 and c 2 are constants that avoid dividing by zero.

Results and Discussion
Some visual examples reconstructed by various methods at the compression rate cr = 20 are illustrated in Figures 3 and 4. Figure 3 gives four plant seedling images and Figure 4 shows four face images. Lasso's reconstructed results are illegible. The results of TVAL3 are composed of many small color blocks, which makes the image unnatural. The reconstructed images of NLRCS and DAMP are excessively smooth and lose considerable useful information. As indicated from the face reconstruction results of GAPTV, there are numerous jagged edges in the image. CSGM is capable of reconstructing the outline of plants with high probability, whereas many details are lost. By comparison, The images reconstructed by the proposed model are much explicit than others with the compression rate cr = 20. The edges of objects in images are easy to distinguish, which is more valuable information in applications. In terms of visual effect, the present study's approach achieves better results.    26.68&0.6780. But our method's performance is almost maintained, the PSNR&SSIM are 30.01&0.8020. The performance of the proposed method on the face dataset is slightly worse than that on the plant seedling images, whereas it remains better than comparison methods at the high compression rate. DAMP achieves the best results with the PSNR&SSIM of 29.05&0.8506 at a compression rate of 10. Nevertheless its performance declines rapidly with the increase of compression rate. At cr = 50, DAMP completely collapses, and the proposed model has PSNR&SSIM of 25.01&0.7492, which is significantly better than other methods. To be specific, when cr = 10, the performance of the proposed model is highly consistent with CSGM's on the face dataset because neither of them has sufficient information to reconstruct the high quality images. Yet both outperform other algorithms, which indicates that the compressed sensing method based on generative networks performs better at the high compression rate. The intuitive comparison presented in Figure 5 combines PSNR and SSIM. At the compression rate of 20, the SSIM of the proposed model reaches over 0.8 on both datasets, which is noticeably better than others. Moreover, Figures 3 and 4 show that the proposed model retains more significant details. In brief, it is a more suitable method for specific applications of compressed sensing image reconstruction. With sufficient datasets, the proposed method is capable of obtaining high-quality images with meager sampling rates. Table 2 shows the average runtimes of comparison methods on plant seedling images. The runtime is not the focus of our work because our method work on GPU while the first six baselines are implemented in MATLAB or scikit-learn and only utilize CPU. Our method runs faster than CSGM when the compression rate is low (e.g., cr = 10 and cr = 50) because it reconstructs high quality images with fewer restarts. It is hard to seek the optimal solution when the compression rate is high (e.g., cr = 100), so the proposed method costs more time than CSGM.   Figure 6 shows the curves of the MSE of the reconstructed images using different hyper-parameters in Equation (9) with the number of iterations. In this subsection, the effect of α and β on the performance of our algorithm is mainly evaluated. The reconstructed model is considered over-fitting if the MSE decreases rapidly in the early stage of iteration but is finally maintained at a large value. It is considered under-fitting if the MSE decreases slowly and can not reach the minimum error. In Figure 6 (left), β is set to 0.1 and α is set to 0, 0.1, 0.5 respectively. It is shown that α = 0.1 gives the best performance. The proposed method is over-fitting when α = 0 and it is under-fitting when α = 0.5. In Figure 6 (right), α is set to 0.1 and β is set to 0, 0.1, 0.5, respectively. It is suggested that the three combinations exhibit the identical convergence rate at the early stage of iteration, but β = 0.1 finally reaches the minimum test error. The proposed method is over-fitting when β = 0.5 and it is under-fitting when β = 0. Hence, the recommended parameter combination here is α = 0.1, β = 0.1.

Anti-Noise Performance
The images captured by sensors are commonly accompanied by noise. Several possible causes are presented as follows: 1. The scene is insufficiently bright, or the brightness is non-uniform when the image sensor is operating. 2. The temperature of image sensors is excessively high due to their long working time. 3. Noise affecting the sensor's circuit components may have an impact on the output image captured.
Accordingly, the image acquisition system should have strong anti-noise ability. The anti-noise performance is tested with Gaussian noise and salt-and-pepper noise. The image collected by the sensor is assumed below: The algorithm aims to reconstructed imagex from measurements y = Φx * . The anti-noise performance is assessed by comparing the similarity betweenx and x. Figure 7 presents some reconstructed samples of various methods. Gaussian noise is added to seedling and face images.
where σ denotes the ratio of noise, n ∈ R n is the Gaussian random noise, n i ∼ N(0, 1). As revealed from the visual comparison, the reconstructed image's quality of the proposed approach is significantly higher than others. To be specific, the images reconstructed by TVAL3 and NLRCS are extremely blurry, which is even worse than noisy images. GAPTV is capable of recovering the target's outline in general, whereas many color blocks cause the whole image lack details. DAMP achieves good results on several images, such as the female face, while other images are illegible. CSGM's face reconstructed samples are unnatural. Though the outline is consistent with the ground truth, the details (e.g., eyes and mouth) are different. Besides, CSGM completely collapses while reconstructing the first plant seedling image. As opposed to the mentioned, the proposed model is more accurate in reconstructing image details and the edge of the target is clearer.  Table 3 shows the quantitative comparison of various methods with cr = 10 and σ = 0.2. The best results are marked in bold font. The proposed method's PSNR and SSIM are higher than others on both plant seedling and face datasets. As revealed from the evaluation, GAPTV also achieves good results, whereas the samples in Figure 7 reveal that its visual effect is poor. In another experiment, salt-and-pepper (i.e., pixel loss) noise is added to images.
where θ denotes the ratio of pixels lost, and p i abides by the uniform distribution between 0 and 1. Figure 8 presents some reconstructed samples and Table 4 gives the quantitative evaluation with cr = 10 and θ = 0.02. The results on images with salt-and-pepper noise provide the identical conclusion as that of images with Gaussian noise. In brief, the proposed model's anti-noise performance is significantly better than that of other methods, and the model has strong robustness.

Conclusions
In this study, a deep CS model is proposed and introduced to specific CS applications (e.g., plant seedling or face images acquisition). The proposed model is capable of reconstructing images more effectively than existing models, significantly reducing the amount of data to be transmitted. As verified by experiments on both datasets, the performance of the method proposed here outperforms existing algorithms. The major advantage of the proposed algorithm is that it can effectively reconstruct the images at high compression rates (above 20). Thanks to convolutional neural networks, it is a data-driven method. Thus, another important advantage is that, with the increase of the training set, the generative model's performance will be enhanced and the details of images can be restored more effectively. It also exhibits strong anti-noise ability. For the mentioned reason, it is a more practical method.
Moreover, the reconstruction of images exhibiting complex backgrounds is an open question. Given the different characteristics of the target and background, multiple generative networks are considered helpful. The recovering of details refers to another question. In subsequent studies, we will introduce some post-processing methods to improve reconstruction performance in depth.  Data Availability Statement: All data generated or analyzed during this study are included in this article.