An Image Compression Method for Video Surveillance System in Underground Mines Based on Residual Networks and Discrete Wavelet Transform

: Video surveillance systems play an important role in underground mines. Providing clear surveillance images is the fundamental basis for safe mining and disaster alarming. It is of signiﬁcance to investigate image compression methods since the underground wireless channels only allow low transmission bandwidth. In this paper, we propose a new image compression method based on residual networks and discrete wavelet transform (DWT) to solve the image compression problem


The Image Compression Demand from Underground Mines
Coal is one of the major resources in China.In the foreseeable future, China will still be the largest consumer and the producer of coal [1].Therefore, it is of great importance to research into technologies that contribute to the advancement in intelligent mine monitoring and safe mining practices.
One of the key components of intelligent mine monitoring is the video surveillance system since visual information plays a key role in how a human perceives the world.Because digital images usually require large storage, it is natural to think of transmitting images with high bandwidth channels, like cable networks.Although cable networks could potentially provide enough bandwidth, they are inflexible in that the cable networks are fixed and have to expand as the working surface expands.In favor of mobility, wireless networks are usually chosen as the information channel in mines.However, the bandwidth can be limited because of relatively limited narrow spaces, harsh environment diffraction, attenuation, and multi-path effect in underground mines.The problem can be especially serious when disasters such as explosion and collapse occur [2].Therefore, it is necessary to investigate image compression methods in order to save the transmission bandwidth.

From Conventional Image Compressing to Compressed Sensing
There have been vast investigations into the field of image compression.Among the researches, JPEG (Joint Photographic Experts Group) [3] has been quite popular and influential.JPEG mainly employs discrete cosine transform (DCT) and entropy coding techniques to compress the images.While the JPEG compression method has gained widespread popularity, it does introduce visible artifacts including blurring, ringing and blocking [4].JPEG2000 [5] is proposed forward to address the problems in JPEG.JPEG2000 adopts 2D wavelet transform and arithmetic coding to achieve higher compression efficiency.
Besides utilizing transforms and entropy coding techniques, a theory framework known as compressed sensing (CS) [6][7][8] was proposed to overcome the limitation that a signal must be sampled at the Nyquist sampling rate [9].The CS theory has shed light on the problem of compression and reconstruction.Optimization techniques such as total variation (TV) minimization [10] and approximate message passing (AMP) [11] can be used in the recovery phase in the CS framework.TV minimization for image denoising was first introduced in [12].TV minimization takes the advantage that it can better accurately preserve the edges or boundaries at certain compression ratios.In [13], the method "total variation minimization by augmented Lagrangian and alternating direction algorithms" (TVAL3) is proposed and has been used widely in image recovery problems.Comparisons in [14] suggest that the TVAL3 solver turns out to be fast and efficient so long as the reconstruction parameters are sufficient for a satisfying reconstruction.Meanwhile, based on the AMP [11] recovery algorithm, the D-AMP [15] algorithm is proposed to enhance CS recovery.In the scheme of D-AMP, the existing rich knowledge of signal denoiser is utilized to design the solver.Tests in [15] show that the D-AMP maintains a low computational footprint.Compressed sensing-based techniques have been explored in real-life scenarios like mine monitoring image compression [16] and landslide monitoring system [17].The non-learning compressed sensing methods do achieve some success, but they struggle to produce sound recoveries at low compression ratios.

Data-driven Approaches
Due to the advancement of information technology, more data is within the reach of researchers.The data-driven approaches have found their way into various fields including signal processing [18], control systems [19][20][21][22] and especially vision tasks [23][24][25][26][27].In particular, the deep learning-based method has stood out among the data-driven approaches.This section explores the recent development of deep learning-based image compression methods.

Convolution Neural Network based Image Compression
In more recent years, convolution neural networks (CNNs) has gained great attention due to the improvement of computing devices.As for image compression utilizing CNN, it generally involves designing image codecs with neural networks and constructing appropriate loss functions.
One genre of compression method combines the ideas of compressed sensing into CNN.For instance, the network DeepInverse proposed in [28] uses fully connected layers to simulate the compression process and stacks convolution layers for decompression.Back-propagation is applied to train the networks.This idea is extended further by ReconNet [29] which uses more convolution layers to attack the decompression problem.In [30], a deep residual reconstruction network is proposed to recover images more accurately.However, this series of methods are more likely to blur edges in the recovered image especially at low compression ratios, according to the results reported by their authors.
Another genre of CNN based compression methods utilize the semantic information in images, since preserving semantic information will render the recovered image more eye-pleasing.Ballé et al. introduce an end-to-end optimized CNN image compression network in [31].The method is based on non-linear coding rather than linear coding used by JPEG.One important contribution of [31] is that the authors propose an method which simulates the quantizer in the training procedure to deal with the problem of zero derivatives due to quantization.Li et al. point out that in [4] it is inappropriate to allocate the same number of codes for each spatial position in an image.They propose the importance map to guide the spatially variant bit allocation.To further compress the data, they introduce the convolutional entropy encoder to compress the binary codes and the importance map.In [32], the authors combine the deep-learning-based image semantic analysis into image compression as well.Unlike [4] which focuses more on the edge of objects, the method in [32] emphasizes the semantic analysis of the whole region.Results in their experiments show the method can improve the visual quality under the same compression overhead.However, it can be quite complicated to adjust the compression ratios of this genre of methods.Moreover, these methods are rarely applied at very low compression ratios.

Recurrent Neural Network Based Image Compression
Unlike the feed-forward CNN, the recurrent neural network (RNN) is state-aware.The output of an RNN is not only related to current input, but also the previous input.Lyu et al. propose to combine the knowledge of block-sparsity recovery into RNN deep learning in [33].Their method acquires the spatial correlations between nonzero elements of block-sparse signals.It is applied to not only images but also audio data.However, the method proposed in [33] requires the input data to be sparse, which limits its compression capability.In [34], Toderici et al. combine the scaled-additive coding framework into RNN-based image compression scheme.The highlight in [34] is that the architectures proposed can provide variable compression rates during deployment without retraining the network.In [35], Minnen et al. propose a spatially adaptive image compression framework with quality-sensitive bit rate adaptation.However, though their method outperforms JPEG, it is still inferior to JPEG2000 [36].

Generative Adversarial Network Based Image Compression
Generative adversarial network (GAN) is another promising deep learning method developed during recent years.In the GAN scheme, a generator network and a discriminator network are optimized simultaneously.The discriminator network is trained to determine whether a sample is generated by the generator network, while the generator network needs to fool the discriminator into wrong decisions.In regards of image compression utilizing GAN, Ripple and Bourdev in [37] propose an architecture of autoencoder featuring pyramidal analysis, an adaptive coding module, and regularization of the expected code length.It produces images 2.5 times smaller than JPEG and JPEG2000, while achieving realtime performance using GPU.Jia et al. in [38] propose a light filed image compression framework driven by a GAN-based sub-aperture image generation and a cascaded hierarchical coding structure.Their method outperforms the state-of-the-art learning-based light field image compression approach with on average 4.9% BD-rate [39] reductions.In [40], Agustsson et al. propose a GAN-based framework targeting extremely low bitrate compression.Their method pushes the bitrate below 0.1 bpp while still achieves eye-pleasing results.

The Objectives and the Organization of the Paper
Considering the demand of image compression at very low compression ratios in underground mines, in this paper, we propose an image codec network based on CNN and a new loss function based on discrete wavelet transform.The new loss function is dedicated to preserving edges in the images of underground mines.The remaining of the paper is organized as follows: Section 2 elaborates the proposed method by discussing the network architecture and the construction of the loss function.Section 3 provides experiments which demonstrate the performance and analysis of the proposed method.Section 4 concludes the paper with further discussion about the proposed method.

Overview
Before introducing the network architecture, it is necessary to understand the workflow of the proposed compression method.As shown in Figure 1, a gray-scale image or one of the channels of an RGB color image is taken as the input.We view the input image as a matrix x.For simplicity, we assume the input image is square, which means x has the same number of rows and columns.The image matrix x is "vectorized" into one vector x v by concatenating each row of the matrix.The encoder module compresses x v to a feature vector y.Then the decoder module is applied to approximate x using the feature vector y.The approximation of x is denoted as x.During training, both the recovered image and the original image are fed into the loss function.Back-propagation will try to minimize the value of loss by updating the weights in the encoder and decoder module.If there are N numbers in the image matrix x and M numbers in the feature vector y, then we define the compression ratio r as r = M/N.
In short, the encoder module is responsible for compressing the image and determining the compression ratio, while the decoder module takes care of the recovery process.

The Encoder Module
The weight matrix W of size M × N is multiplied by the "vectorized" image x v .Then the product is added by the bias vector b to derive the feature vector y: In Equation ( 2), both the weight matrix W and the bias vector b are parameters to be learnt during back-propagation.W is initialized using He initialization [41], while b is initialized with zeros.

The Decoder Module
The network architecture of the decoder module is illustrated in Figure 2. The feature vector y is first upsampled to y using nearest-neighbor interpolation [42].The length of y is determined by Equation (3): where M is the length of vector y.The symbol z means rounding number z to the nearest integer more than or equal to z.The vector y is then reshaped into the initial feature map F using Equation ( 4): Afterwards, the initial feature map F is convolved with 96 filters of size 3 × 3. We empirically add a batch-normalization [43] layer after the first convolution layer to accelerate training.Then the feature maps go through several residual units.Some residual units are followed by nearest-neighbor upsampling operation as in Figure 2. Finally, the feature maps are convolved with one filter of size 1 × 1 to derive the recovered image x.The residual units.The introduction of residual units is inspired by [44].As depicted in Figure 3, two types of residual units are used.Both types follow the two-branch connection pattern.The feature maps go through the two branches and add up at the output summator.The upper branches of the two types are identical.The lower branches differ in that residual unit (1) connects the input and the output with a stack of layers, but residual unit (2) connects the input and the output directly.Each convolution layer that appears in Figure 3 is composed of 96 filters of size 3 × 3.After each convolution layer, there is a batch-normalization layer [43].Each batch-normalization layer is then followed by a Leaky ReLU activation layer [45] if the batch-normalization layer is not directly connected to the output summator.
The nearest-neighbor upsampling operations.If the input image x is of size n × n, then the second, third, and fourth upsampling operation in Figure 2 resize the feature maps to size 1  2 n ×

Combination of Two Types of Loss Functions
Image recovering problems are conventionally seen as optimization problems that minimize the l 2 loss between the recovered and original image.However, from the perspective of image recovery quality assessment, l 2 metric does not reflect every aspect of signal fidelity [46].Therefore, it is necessary to combine other metrics that compensate for what is missing in l 2 loss when constructing the loss function.
In this section, we propose a metric termed discrete wavelet structural similarity (DW-SSIM) that focuses the recovery of edges of the images.Our loss function is the weighted sum of DW-SSIM loss and l 2 loss: where Ω represents a set of training image, L F (x, x) denotes the l 2 loss, L S (x, x) denotes the DW-SSIM loss, and β 1 = 0.5 and β 2 = 0.5 are weights.Both L F (x, x) and L S (x, x) are set up to fall in range [0, 1).Section 2.3.2 will provide the expression of L F (x, x), while Section 2.3.3 will explain L S (x, x) in details.

l 2 Loss
We propose to use Frobenius norm in L F (x, x) to derive the l 2 loss: It is worth noting that the denominator of Equation ( 6) cannot be zero.However, since x is taken from the natural images instead of artificial generated matrices, it is impossible for x to be a zero matrix.

Discrete Wavelet Similarity (DW-SSIM) and DW-SSIM Loss
Inspired by structural similarity (SSIM) [47] and complex-wavelet structure similarity (CW-SSIM) [48], we propose to use two-dimensional discrete wavelet transform (2D-DWT) [49] to analyze the similarity between the recovered image and the original image.The similarity is termed DW-SSIM which stands for discrete-wavelet similarity.
2D-DWT.The 2D-DWT is able to decompose an image into different levels of subbands.The first level is the decomposition of the original image.Each level is composed of four subband images which can be referred to as low-low (LL), low-high (LH), high-low (HL) and high-high (HH).The LL image at each level can be further decomposed into the next level of subbands.The LH image represents the variation along the vertical direction, HL image the horizontal direction, and HH image the diagonal direction [49].The high-frequency LH, HL, and HH subband images altogether form the details of the original image.As the decomposition level goes higher, the subband images become coarser, thus details of different scales can be analyzed.Figure 4 provides an example of a three-level 2D-DWT decomposition of an image.DW-SSIM.We divide the calculation of DW-SSIM between the original image and the recovered image into two stages.The first stage involves figuring out the local DW-SSIM, where a "window" slides through the original image and the recovered image.2D-DWT is performed on the image patches within the "window" to derive the decomposition.We define the local low frequency DW-SSIM S L,t and high frequency DW-SSIM S H,t of the image patches as S L,t (c (1) , c (2) S H,t (c (1) , c (2) In Equations ( 7) and ( 8), K is a small positive constant for arithmetic robustness and K is set to 0.01.c (1) and c (2) refer to the corresponding subband images of the original image patch and the recovered image patch after 2D-DWT, respectively.The wavelet function we use is the Haar wavelet.t is the patch index.J = 3 is the maximum decomposition level, and c LH j , c HL j , c HH j are high frequency subband images at the j-th level.
To better understand Equation (7), one can ignore K, "vectorize" (as in Section 2.2.1) c into c v and rewrite it as S L,t (c (1) , c (2) In Equation ( 9), the first term is determined by the energy of the subband images.It will reach its maximum value 1 only if c In the second term, cos (θ) = c (1) is the cosine similarity [50].If c v and c (2) v point to roughly the same direction, the cosine similarity will be close to 1.However, the cosine function falls in range [−1, 1].Therefore we are taking the absolute value so that it falls in [0, 1].The interpretation of Equation ( 8) is largely the same with that of Equation (7).Equation ( 8) additionally averages the contribution of each level of subband to the high frequency DW-SSIM in order to cope with the patterned noise in underground mine images.This can be better understood through the discussion in Section 3.3.
In the second stage, a weighted sum of S H,t and S L,t is figured out to form the final DW-SSIM S: where T is the total number of image patches, γ 1 and γ 2 are parameters to adjust the weight of low frequency subband and high frequency subbands.Since we want to emphasize high frequency details such as edges and spikes in the image, we set γ 1 = 0.2 and γ 2 = 0.8.The computation of DW-SSIM is summarized with Algorithm 1.The window length l in the proposed method is set to 15.The stride s that the window will move in each iteration is set to 8.
Input: The original image img-ori and the recovered image img-rec of the same height H and width W (H > 0, W > 0); the decomposition level J; the stride s that the window will move in each iteration; the window length l; the weights γ 1 and γ 2 in Equation ( 10) Output: The DW-SSIM similarity S between img-ori and img-rec Get image patch patch-ori within the window [up, down, left, right ] from img-ori; Get image patch patch-rec within the window [up, down, left, right ] from img-rec; Derive c (1) by performing 2D-DWT on patch-ori; Derive c (2)  DW-SSIM loss.The DW-SSIM defined in Equation (10) falls in range (0, 1].The more the original image and the recovered image matches each other, the closer DW-SSIM S is to 1.However, the loss should be near 0 if the model has done a perfect recovery.Moreover, the loss should fall in range [0, 1).Therefore, we define the DW-SSIM loss as:

Learning the Parameters
The encoder module and the decoder module can be trained in an end-to-end manner using the proposed network architecture and the proposed loss function.Mini-batch gradient descent is used to train the model with the batch size being 64.The Adam [51] optimizer is utilized as well.We set the initial learning rate to 5 × 10 −4 .The learning rate is multiplied by 0.2 when the loss is not going down during training.The training is stopped if the learning rate drops below 1 × 10 −6 .

Overview
In order to generalize the recovery capability, the network of the proposed method is trained on both images from video images we have collected in underground mines and images from the COCO 2014 dataset [52].We build the training set by extracting the 100 × 100 center-crop patches from the images, and converting them to grayscale images.
After the model is trained, test images (as in Figure 5) are passed to the model to perform the compression and recovery.We test our method on both standard images of Barbara, Fingerprint, and Lena to verify its effectiveness.In addition, we test the proposed method on images of coal cutter and tunnel boring machine (TBM) which are from real underground mines to evaluate the performance in the application-specific environment.
The recovery quality is quantitatively evaluated with peak-signal-to-noise ratio (PSNR) and structural similarity (SSIM) [46]: In Equation ( 12), d is the dynamic range of pixel intensities, and N is the number of pixels in the image.In Equation ( 13), µ x and µ x are means of x and x, and σ 2 x and σ 2 x are variances of x and x. σ x x is the cross correlation of x and x.The small positive constants C 1 = C 2 = C 3 = 0.01 prevent numerical instability of each term.
To verify the effectiveness of the proposed method, the quantitative evaluation at compression ratios of 0.25, 0.20, 0.15, 0.10, 0.04 and 0.01 is carried out, with the compression ratio defined in Equation ( 1).In addition, the proposed method is compared to the algorithms of D-AMP [15], ReconNet [13] and TVAL3 [29] at different compression ratios.For simplicity, we do not re-implement the algorithms but use the demo code provided by the authors' websites instead.
Further, visual quality evaluation of recovery is presented at some specific compression ratios.Finally, the robustness of the proposed method is tested by recovering images contaminated by different levels of Gaussian noise.
The proposed method was implemented with Pytorch [53] and pytorch_wavelet package (https: //github.com/fbcotter/pytorch_wavelets).The training process is carried out on Ubuntu 18.04.2,with Nvidia Tesla K80 GPU and Intel Xeon CPU.More details about the implementation can be found in the code which we have made public on the Internet (https://github.com/y0umu/ResCSNet).

Quantitative Evaluation
Tables 1 and 2 provide quantitative measurements of the proposed method and other algorithms at different compression ratios.As the compression ratio r decreases, all the algorithms being compared have PSNR and SSIM decreased.It can be interpreted from Table 1 that the proposed method is second only to D-AMP at compression ratio r ≥ 0.20 for both standard test images and real underground mine images.Yet the proposed method achieves the highest PSNR compared to other algorithms at a compression ratio r ≤ 0.15.It should be also noted that for the recoveries of images of coal cutter and TBM at compression ratios r ≤ 0.04, the proposed method has an edge over other algorithms by a margin of at least 1.8 dB, indicating the potential of the application-specific usage in mines of the proposed method.From Table 2, it can be learned that the proposed method achieves the highest SSIM at every compression ratio for all the images except the Fingerprint image.Since the SSIM metric describes structural similarity between the recovered and the original images, it can be drawn to the conclusion that the proposed method preserves specific characteristics of the images better.

Visual Quality Evaluation
Figures 6 and 7 illustrate the recovered images of the proposed method and the algorithms being compared.The green boxes zoom in the image patches within the red boxes so that the details can be viewed clearly.As can be seen in most of the pictures, the proposed method recovers sharper edges with less blurring compared to other algorithms.In Figure 7 where the compression ratio is relatively low, the edges can still be discerned in the recovered image of the proposed method, while other recoveries tend to be more blurred.Combined with Tables 1 and 2, it can be found that the characteristic which the proposed method preserves is the edges in the image.
Figures 6 and 7 also demonstrate an interesting phenomenon.In the recovery of the Fingerprint image, the proposed method fails to recover the details either at a compression ratio r = 0.15 or r = 0.04.This is intended behavior and actually the proposed method deliberately "blurs" dense patterns in the recovered images to cope with the noise which is often seen in underground mine images.To explain the rationale behind this, suppose we take the image patches of size 15 × 15 at the same location from the recovered image and the original image of Fingerprint.Then 3-level 2D-DWT is applied on both patches and it can be discovered that the level 2 or level 3 subband images are almost identical.The major difference of the subbands lies in the level 1 decomposition.Recall that in Equation ( 8) each level is given the same significance, the difference between the recovered and original patch in level 1 decomposition is in effect "averaged out".Therefore the DW-SSIM loss of the original dense patterned patch and the recovered blurred patch will be small, leading the proposed network to learn to blur the dense patterns.

Robustness against Noise
Since the tests in previous sections indicate that the proposed method takes an advantage when the compression ratio is low, we then test the noise robustness of the proposed method at a compression ratio r = 0.04 in this section.As depicted in Figures 8 and 9, Gaussian noise is added to the Lena and TBM test images to simulate the dusty environment in underground mines.The noise is zero-mean.The standard deviation σ of the noise is set to 5, 10, 15, 20, 25 and 30 to emulate different levels of noise.The noise-contaminated images are compressed at ratio r = 0.04.Then the similarity of the recovered images between the original test images is evaluated using the PSNR and SSIM measurement.As in Figure 8 and Figure 9, at all noise levels, fewer artifacts can be seen yet sharp edges are preserved in the recovered images of the proposed method.Further, Figure 10 plots the PSNR and SSIM curves as σ varies.The PSNR and SSIM of all algorithms drop as σ increases, yet PSNR and SSIM of the proposed method are higher than those of the algorithms being compared.As σ grows from 5 to 30, the decrease of PSNR and SSIM of the proposed method, which is no more than 1.6 dB and 0.11, is the least among the algorithms.Therefore, it can be concluded that the proposed method features noise robustness when the compression ratio is low.

Conclusions
In this paper, we propose a CNN based image codec network which acts as the basis for the compression and recovery of images.We also propose a novel loss function that combines the knowledge of discrete wavelet transform to attack the problem of edge blurring in the recovered images.The proposed method is more suitable for the compression and recovery of underground mine images in that:

•
The proposed method recovers sharp edges in the images.For underground mines, edges in the image are the key component to distinguish the foreground and background.By determining the boundaries of miners and equipment, it is possible for further image analysis to carry out.

•
The proposed method features noise robustness.By blurring the dense patterns, the proposed method can filter out the noise especially seen in underground mines.

•
Compared to other algorithms, the proposed method excels at low compression ratios.General image compression methods tend to strike a balance between the compression ratio and the recovery quality.They do not have to work at extremely low compression ratios as the transmission bandwidth available is comparably high.However, the proposed method is designed to work at low compression ratios to adapt to the harsh communication environment in underground mines.
In future work, we will combine other denoising techniques into the work presented in this paper is an attempt to achieve noise robustness without blurring the patterned areas.The current design of the DW-SSIM loss is not perfect in that the merits of cosine similarity is not fully preserved.Thus it is worth further investigating into the design of loss function.We will also train the model on other datasets in order to expand the application of the proposed method.

Figure 1 .
Figure 1.The workflow of the proposed method.

Figure 2 .
Figure 2. The network architecture of the decoder module.

Figure 4 .
Figure 4. Illustration of 2D-discrete wavelet transform (DWT) image decomposition.(a) Original image.(b) Three-level decomposition of the image.For clarity, every intermediate low-low (LL) image is put in its place, yet DWT only preserves the LL image of the highest level.

30 Figure 8 .
Figure 8.Comparison of recoveries of the Lena image at the presence of noise.σ denotes the standard deviation of the noise.The compression ratio r is 0.04.

30 Figure 9 .Figure 10 .
Figure 9.Comparison of recoveries of the TBM image at the presence of noise.σ denotes the standard deviation of the noise.The compression ratio r is 0.04.

Figure 10 .
Figure 10.Plots of PSNR and SSIM against σ for the recoveries of noise contaminated (a) Lena image, (b) TBM image.The PSNR and SSIM are checked between the original test image (no noise added) and the recovered images.The compression ratio r is 0.04.

Funding:
This research was funded by Foundation of the National Key Research and Development Program grant number 2016YFC0801800, National Natural Science Foundation of China grant number 51874300, National Natural Science Foundation of China and Shanxi Provincial People's Government Jointly Funded Project of China for Coal Base and Low Carbon grant number U1510115, and the Open Research Fund of Key Laboratory of Wireless Sensor Network and Communication, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences grant numbers 20190902 and 20190913.

Table 1 .
Peak signal-to-noise ratio (PSNR) (in dB) comparison for different algorithms on test images.r is the compression ratio.

Table 2 .
SSIM comparison for different algorithms on test images.r is the compression ratio.