JsrNet: A Joint Sampling–Reconstruction Framework for Distributed Compressive Video Sensing

Huge video data has posed great challenges on computing power and storage space, triggering the emergence of distributed compressive video sensing (DCVS). Hardware-friendly characteristics of this technique have consolidated its position as one of the most powerful architectures in source-limited scenarios, namely, wireless video sensor networks (WVSNs). Recently, deep convolutional neural networks (DCNNs) are successfully applied in DCVS because traditional optimization-based methods are computationally elaborate and hard to meet the requirements of real-time applications. In this paper, we propose a joint sampling–reconstruction framework for DCVS, named “JsrNet”. JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames, while the existing frameworks only utilize key frames as the reference to reconstruct non-key frames. Moreover, different from the existing frameworks which only focus on exploiting complementary information between frames in joint reconstruction, JsrNet also applies this conception in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in an encoder. JsrNet fully exploits spatial–temporal correlation in both sampling and reconstruction, and achieves a competitive performance in both the quality of reconstruction and computational complexity, making it a promising candidate in source-limited, real-time scenarios.


Introduction
Compressive sensing (CS) [1,2] is a powerful framework for signal acquisition and processing. By adopting a measurement matrix, CS integrates sampling and compression, making it desirable in many applications such as magnetic resonance imaging (MRI) [3] and cognitive radio communication [4]. CS states that if the measurement matrix satisfies the restricted isometry property (RIP), we can recover one sparse or compressible signal from fewer measurements than that suggested by the Nyquist theory [5]. Frame-based sampling [6,7] is impractical due to limited storage space. To overcome this problem, Lu [8] proposed block-based CS that reduced much of the implementation costs.
Over the past decade, CS has been successfully applied in video signal processing because compared to still images, video signals contain more spatial and temporal redundancies which can be further exploited. One of the most powerful architectures in video CS reconstruction in the literature 1.
JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames.

2.
JsrNet not only applies the conception of exploiting complementary information between frames in joint reconstruction, but also in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in an encoder.

3.
JsrNet exploits spatial-temporal correlation in both sampling and reconstruction, and achieves a competitive performance on both the quality of reconstruction and computational complexity, making it a promising candidate in source-limited, real-time scenarios.
The remainder of this paper is organized as follows. In Section 2, we review the backgrounds of our work. Section 3 introduces a detailed description of the proposed JsrNet. In Section 4, we provide the experimental results. Conclusions are drawn in Section 5.

Preliminary of CS Theory
CS theory states that we can measure a signal x ∈ R n×1 with a sub-Nyquist rate through a measurement matrix ϕ ∈ R m×n : where y ∈ R m×1 denotes the measurements vector and SR = m/n denotes the sampling rate.
In block-based CS, n is equal to B 2 , where B denotes the block size. Since m n, the recovery of x from y is ill-posed. Regularized iterative algorithms [7,27] have become the standard approach to this ill-posed inverse problem in the past few decades: where λ is a non-negative constant and R(x) represents some priors about the signal structure, such as sparse priors [28,29] and low-rank priors [30,31]. These methods suffer from high computational complexity and parameter-tuning issues. Due to the powerful learning capability of deep networks, deep learning-based algorithms [19][20][21][22][23][24][25]32] have successfully shown great potential in solving this inverse problem.

Unsupervised Learning
Both supervised learning and unsupervised learning have been successfully applied in image CS frameworks; however, we highlight the need for using unsupervised learning to find and represent structure in video CS frameworks because videos contain a large amount of spatial and temporal redundancies which makes them particularly suitable for building unsupervised learning models. This is consistent with one of the motivations of our work that we aim to apply the conception of exploiting complementary information between frames in joint sampling.
Given a T-length group of pictures {x 1 , . . . , x T }, we use mean square error (MSE) as the loss function which favors high peak signal-to-noise ratio (PSNR): where Θ represents the parameters in the designed network and F(x i ; Θ) denotes the output of the network. One advantage of these algorithms is low computational complexity because signals are reconstructed by feeding to a single forward model, instead of optimizing iteratively.

The Proposed JsrNet
In this section, we propose a joint sampling-reconstruction framework for DCVS, named "JsrNet". JsrNet measures signals in a block-based manner, but reconstructs signals in a frame-based manner. Figure 1 shows the overview architecture of JsrNet which contains three modules: (1) a convolutional neural network (CNN) for joint sampling, in which multiple frames are sampled jointly and simultaneously by using learnable convolutions in a block-based manner; (2) a spatial DCNN for initial recovery, in which all measurements of blocks from one image are used simultaneously to output the intermediate reconstructed image; and (3) a temporal DCNN for joint reconstruction, in which each frame is reconstructed by exploiting temporal correlation within the whole group of frames. These three modules consist of an integrated end-to-end model whose parameters are jointly trained. Sensors 2020, 20, x FOR PEER REVIEW 4 of 12

CNN for Joint Sampling
Different from traditional approaches which commonly utilize the random Gaussian matrix [22] as the measurement matrix, we use a convolutional layer [33] in which parameters only depend on the size and number of convolution kernels to mimic the sampling operation. Figure 2 shows the structure of the encoder for joint sampling. First, video sequences are divided into several T-length groups of frames, in which a key frame 1 x is followed by some non-key frames 2 { , , } T x x . Each frame goes through a specific convolution layer in which rectified linear units (ReLU) activation [34] was removed to obtain measurements in a block-based manner. High sampling rates, are allocated to key frames, whereas relatively low sampling rates, , are allocated to nonkey frames. During the training process, the sampling of multiple frames is jointly optimized, fully exploiting spatial-temporal correlation in the encoder. Different from the existing frameworks which only focus on exploiting complementary information between frames in joint reconstruction, JsrNet also applies this conception in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in the encoder.

Spatial DCNN for Initial Recovery
In this subsection, we design a spatial DCNN for the initial recovery of each frame which is shown in Figure 3. Inspired by [25] which effectively removes the blocking artifacts, all measurements of blocks from one image are used simultaneously to reconstruct the full image. Different from typical DCNNs used for classification and segmentation, we remove the pooling layer which can cause information loss. We first use a convolutional layer which uses n convolution kernels of size 1 × 1 with stride 1 and a reshape layer to transform the measurements to the feature map which has the same dimension as the final reconstructed frame. Then, we stack 12 convolutional layers to obtain

CNN for Joint Sampling
Different from traditional approaches which commonly utilize the random Gaussian matrix [22] as the measurement matrix, we use a convolutional layer [33] in which parameters only depend on the size and number of convolution kernels to mimic the sampling operation. Figure 2 shows the structure of the encoder for joint sampling. First, video sequences are divided into several T-length groups of frames, in which a key frame x 1 is followed by some non-key frames {x 2 , . . . , x T }. Each frame goes through a specific convolution layer in which rectified linear units (ReLU) activation [34] was removed to obtain measurements in a block-based manner. High sampling rates, SR K = m K /n, are allocated to key frames, whereas relatively low sampling rates, SR N = m N /n, are allocated to non-key frames. During the training process, the sampling of multiple frames is jointly optimized, fully exploiting spatial-temporal correlation in the encoder. Different from the existing frameworks which only focus on exploiting complementary information between frames in joint reconstruction, JsrNet also applies this conception in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in the encoder.

CNN for Joint Sampling
Different from traditional approaches which commonly utilize the random Gaussian matrix [22] as the measurement matrix, we use a convolutional layer [33] in which parameters only depend on the size and number of convolution kernels to mimic the sampling operation. Figure 2 shows the structure of the encoder for joint sampling. First, video sequences are divided into several T-length groups of frames, in which a key frame 1 x is followed by some non-key frames 2 { , , } T xx . Each frame goes through a specific convolution layer in which rectified linear units (ReLU) activation [34] was removed to obtain measurements in a block-based manner. High sampling rates, KK / SR m n  , are allocated to key frames, whereas relatively low sampling rates, , are allocated to nonkey frames. During the training process, the sampling of multiple frames is jointly optimized, fully exploiting spatial-temporal correlation in the encoder. Different from the existing frameworks which only focus on exploiting complementary information between frames in joint reconstruction, JsrNet also applies this conception in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in the encoder.

Spatial DCNN for Initial Recovery
In this subsection, we design a spatial DCNN for the initial recovery of each frame which is shown in Figure 3. Inspired by [25] which effectively removes the blocking artifacts, all measurements of blocks from one image are used simultaneously to reconstruct the full image. Different from typical DCNNs used for classification and segmentation, we remove the pooling layer which can cause information loss. We first use a convolutional layer which uses n convolution kernels of size 1 × 1 with stride 1 and a reshape layer to transform the measurements to the feature map which has the

Spatial DCNN for Initial Recovery
In this subsection, we design a spatial DCNN for the initial recovery of each frame which is shown in Figure 3. Inspired by [25] which effectively removes the blocking artifacts, all measurements of blocks from one image are used simultaneously to reconstruct the full image. Different from typical DCNNs used for classification and segmentation, we remove the pooling layer which can cause information loss. We first use a convolutional layer which uses n convolution kernels of size 1 × 1 with stride 1 and a reshape layer to transform the measurements to the feature map which has the same dimension as the final reconstructed frame. Then, we stack 12 convolutional layers to obtain the intermediate reconstruction x i=1...T . All the convolutional layers are followed by ReLU activation, except the final layer, and each frame has its corresponding spatial DCNN, instead of a universal one. x  . All the convolutional layers are followed by ReLU activation, except the final layer, and each frame has its corresponding spatial DCNN, instead of a universal one.

Temporal DCNN for Joint Reconstruction
JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames, while the existing frameworks only utilize key frames as the reference to reconstruct non-key frames. Figure 4 shows the structure of the temporal DCNN for joint reconstruction, which is made up of several basic units (BUs). As shown in Figure 5, BU consists of a concatenating layer, an inception layer, and a convolutional layer. In the concatenating layer, we concatenate the intermediately reconstructed key frame and the output of the previous layer into a single tensor. Adaptively exploiting temporal correlation is the key to improve the overall reconstruction quality in traditional DCVS [17,18]. Therefore, we adopt the inception module [35] in the inception layer to let DCNN adaptively select the optimal size to exploit temporal correlation. In the last convolutional layer, 3 × 3 convolution kernels are utilized to reduce the number of channels from T to T − 1. ReLU activation is removed in this convolutional layer. After stacking 5 BUs, we add a shortcut connection to the plain network, making the DCNN easier to train [36]. Then, we deconcatenate the output to obtain the final reconstructed frames.

Temporal DCNN for Joint Reconstruction
JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames, while the existing frameworks only utilize key frames as the reference to reconstruct non-key frames. Figure 4 shows the structure of the temporal DCNN for joint reconstruction, which is made up of several basic units (BUs). As shown in Figure 5, BU consists of a concatenating layer, an inception layer, and a convolutional layer. In the concatenating layer, we concatenate the intermediately reconstructed key frame and the output of the previous layer into a single tensor. Adaptively exploiting temporal correlation is the key to improve the overall reconstruction quality in traditional DCVS [17,18]. Therefore, we adopt the inception module [35] in the inception layer to let DCNN adaptively select the optimal size to exploit temporal correlation. In the last convolutional layer, 3 × 3 convolution kernels are utilized to reduce the number of channels from T to T − 1. ReLU activation is removed in this convolutional layer. After stacking 5 BUs, we add a shortcut connection to the plain network, making the DCNN easier to train [36]. Then, we de-concatenate the output to obtain the final reconstructed frames. x  . All the convolutional layers are followed by ReLU activation, except the final layer, and each frame has its corresponding spatial DCNN, instead of a universal one.

Temporal DCNN for Joint Reconstruction
JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames, while the existing frameworks only utilize key frames as the reference to reconstruct non-key frames. Figure 4 shows the structure of the temporal DCNN for joint reconstruction, which is made up of several basic units (BUs). As shown in Figure 5, BU consists of a concatenating layer, an inception layer, and a convolutional layer. In the concatenating layer, we concatenate the intermediately reconstructed key frame and the output of the previous layer into a single tensor. Adaptively exploiting temporal correlation is the key to improve the overall reconstruction quality in traditional DCVS [17,18]. Therefore, we adopt the inception module [35] in the inception layer to let DCNN adaptively select the optimal size to exploit temporal correlation. In the last convolutional layer, 3 × 3 convolution kernels are utilized to reduce the number of channels from T to T − 1. ReLU activation is removed in this convolutional layer. After stacking 5 BUs, we add a shortcut connection to the plain network, making the DCNN easier to train [36]. Then, we deconcatenate the output to obtain the final reconstructed frames.    x  . All the convolutional layers are followed by ReLU activation, except the final layer, and each frame has its corresponding spatial DCNN, instead of a universal one.

Temporal DCNN for Joint Reconstruction
JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames, while the existing frameworks only utilize key frames as the reference to reconstruct non-key frames. Figure 4 shows the structure of the temporal DCNN for joint reconstruction, which is made up of several basic units (BUs). As shown in Figure 5, BU consists of a concatenating layer, an inception layer, and a convolutional layer. In the concatenating layer, we concatenate the intermediately reconstructed key frame and the output of the previous layer into a single tensor. Adaptively exploiting temporal correlation is the key to improve the overall reconstruction quality in traditional DCVS [17,18]. Therefore, we adopt the inception module [35] in the inception layer to let DCNN adaptively select the optimal size to exploit temporal correlation. In the last convolutional layer, 3 × 3 convolution kernels are utilized to reduce the number of channels from T to T − 1. ReLU activation is removed in this convolutional layer. After stacking 5 BUs, we add a shortcut connection to the plain network, making the DCNN easier to train [36]. Then, we deconcatenate the output to obtain the final reconstructed frames.

Training Settings
We implemented the proposed JsrNet with Tensorflow framework using NVIDIA Titan XP GPU. UCF-101 dataset [37] was used to benchmark the proposed network because there is no standard dataset designed for DCVS. Due to limited GPU memory, we cropped the central 160 × 160 patch from each frame and retained only the luminance component. The size of group of frames was set to 4 and the batch size was set to 16. Groups were randomly split into 80% for training, 10% for validation, and the remaining for testing. The sampling rate of key frames SR K was set to 0.25, whereas the sampling rate of non-key frames SR N was set to 0.01, 0.04, and 0.1. We adopted the Adam optimizer [38] with a learning rate of 0.0001 to train JsrNet for 50 epochs.
In DCVS, the reconstruction quality of key frames plays a significant role in improving the overall reconstruction performance, because key frames are allocated with high sampling rates for guaranteed high reconstruction quality to serve as references in the recovery of non-key frames. The reconstruction quality of key frames, however, can be easily degraded by the poor reconstruction quality of non-key frames in joint optimizations. Hence, we pre-trained the sampling part and the spatial DCNN for key frames based on VOC dataset [39]. The learning rate was set to 0.0001 and the batch size was set to 128. We pre-trained the subnetwork for 200 epochs.

Performance Comparisons
We compared the proposed JsrNet with four state-of-the-art algorithms experimentally: (1) D-AMP [7], which is a representative of the state-of-the-art iterative algorithms developed for CS; (2) Reconnet [22], which is a dedicated DCNN-based approach for block-based CS; (3) FIR [25], which is a novel full image recovery CS framework for block-based CS; and (4) MH-BCS-SPL [11], which achieves the state-of-the-art performance in DCVS. CSVideoNet [24] is another architecture designed for DCVS and was intended to be compared; however, we could not present the results of CSVideoNet due to limited GPU memory. The parameters used in these methods were set as default to keep fairness. We adopted PSNR and structural similarity (SSIM) as objective standards to measure reconstruction performance. Table 1  The main reason was that they compressed and recovered signals in a block-wise manner, but ignored edge continuity between blocks. Benefiting from exploiting temporal correlation instead of treating each frame independently, MH-BCS-SPL slightly alleviated the blocking artifacts and achieved an acceptable performance. Although FIR succeeded in reducing the blocking artifacts because all the measurements of blocks from one image were used to simultaneously reconstruct the full image, FIR failed in preserving image details. It can be seen clearly that JsrNet achieved the best performance. There were several factors contributing to this improvement. First, combining the advantages of FIR and MH-BCS-SPL, JsrNet utilized the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames. JsrNet further applied the conception of exploiting complementary information between frames in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in the encoder.   Table 2 shows the comparisons of average reconstruction speed of each frame. Compared with MH-BCS-SPL and DAMP, the reconstruction time of JsrNet was nearly 1000 times faster. This was because DCNN-based approaches reconstruct video sequences via a forward model instead of solving an iterative optimization problem. More importantly, the speed of DCNN-based approaches depends only on the model capacity, whereas traditional approaches depend on the sampling rate. Compared with Reconnet and FIR, which treat each frame independently, JsrNet reconstructed frames simultaneously, and achieved the best performance.

Conclusions
A DCNN-based learning framework, named "JsrNet", is proposed with the aim to apply DCVS in real-time applications. JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames. Moreover, JsrNet applies the conception of exploiting complementary information between frames in joint sampling by adopting learnable   Table 2 shows the comparisons of average reconstruction speed of each frame. Compared with MH-BCS-SPL and DAMP, the reconstruction time of JsrNet was nearly 1000 times faster. This was because DCNN-based approaches reconstruct video sequences via a forward model instead of solving an iterative optimization problem. More importantly, the speed of DCNN-based approaches depends only on the model capacity, whereas traditional approaches depend on the sampling rate. Compared with Reconnet and FIR, which treat each frame independently, JsrNet reconstructed frames simultaneously, and achieved the best performance.

Conclusions
A DCNN-based learning framework, named "JsrNet", is proposed with the aim to apply DCVS in real-time applications. JsrNet utilizes the whole group of frames as the reference to reconstruct each frame, regardless of key frames and non-key frames. Moreover, JsrNet applies the conception of exploiting complementary information between frames in joint sampling by adopting learnable convolutions to sample multiple frames jointly and simultaneously in an encoder. Benefiting from fully exploiting spatial-temporal correlation in both sampling and reconstruction, JsrNet achieves a satisfying reconstruction quality without the blocking artifacts. Moreover, the non-iterative nature of DCNNs leads to low computational complexity, making JsrNet a promising candidate in source-limited, real-time scenarios. In future, we will focus on utilizing generative models for the representation and reconstruction of video sequences.