1. Introduction
Hyperspectral sensors collect data as a set of images with high spatial and spectral resolutions, with each spectral band image being a narrow wavelength range of the electromagnetic spectrum. The large quantities of hyperspectral data present great challenges in storage, transmission, and analysis, as a consequence, data compression is becoming a common process for such imagery. In general, compression can be either lossy or lossless. Lossy compression typically provides lower bit rates but incurs loss on the original data. On the other hand, lossless compression guarantees perfect reconstruction on the original data, albeit with higher bit rates. This work focuses on improving the performance of onboard predictive lossless compression on hyperspectral imagery. The techniques are useful for many precisiondemanding applications where strictly no data loss is highly desirable [
1,
2,
3]. Below is a brief survey on the existing work on the subject.
Lossless compression of hyperspectral images has been performed very successfully using predictionbased methods. Contextbased Adaptive Lossless Image Codec (3DCALIC) [
4] and its variant MCALIC [
5] consider both interband and intraband correlations to reduce prediction errors. The Lookup Table (LUT) approach in [
6] exploits the calibrationinduced data correlation specific to hyperspectral imagery to facilitate accuracy prediction. This scheme was enhanced by a Locally Averaged Interband Scaling (LAISLUT) approach using a band adaptive quantization factor [
7].
Transformbased approaches such as Discrete Wavelet Transform (DWT) and Principal Component Analysis (PCA), aim to exploit the relations in the spectral and spatial dimensions based on a redundancy reduction transform. The problem of selecting an appropriate signal representation for transformbased compression is equivalent to the feature extraction problem in machine learning. Recently, Waveletbased Regression Analysis (RWA) [
8,
9] was introduced for lossless compression by exploiting the relationships among wavelettransformed components, which outperforms the traditional approaches.
Lowcomplexity filterbased compressors, such as the Fast Lossless (FL) [
10] and SpectralOriented Least Squares (SLSQ) [
11], utilize linear models to decorrelate the colocated pixels from different spectral bands. An optimized version of the Fast Lossless (FL) algorithm developed by the NASA Jet Propulsion Laboratory (JPL) has been selected as the core predictor in the new Consultative Committee for Space Data Systems (CCSDS) standard for multispectral and hyperspectral data compression [
12]. Besides, traditional Wiener filter, Kalman filter and least mean square filter, were adopted for hyperspectral image compression. Examples include the Backward Pixel Search (BPS) [
13], Kalman Spectral Prediction (KSP) [
14] and Maximum Correntropy Criterion based Least Mean Square (MCCLMS) [
15] algorithms. Similar to linear predictors, nonlinear predictors such as Contextbased Condition Average Prediction (CCAP) [
16] and TwoStage Prediction (TSP) [
17] have also brought improvement in compressed bit rates.
Highcomplexity compressors such as Clustered Differential Pulse Code Modulation (CDPCM) have been studied in [
18], which partitions the data into several clusters with similar statistics and applies separate leastsquare optimized linear predictors to different clusters. [
19] presents an Adaptive Prediction Length CDPCM (CDPCMAPL) method, which is a bruteforce variant of the CDPCM approach, in that the number of previous bands selected for prediction was determined by a bruteforce search ranging from 10 to 200 bands in steps of 10. Two other CDPCM variants also use a large portion of previous bands for prediction, including the Spectral Relaxation Labeled Prediction (SRLP) and Spectral Fuzzy Matching Pursuits (SFMP) in [
20]. However, the computational complexity of these clusteringbased compressors is very high.
Recently, Deeplearning based approaches have been widely utilized to lossy and lossless hyperspectral data compression. For lossy compression, [
21,
22,
23,
24] focused on designing deep networks to reconstruct the original imagery with a reasonable loss of information. Those models have an encoderdecoder structure, where representative features are usually extracted by Autoencoder network (AE) or Convolution Neural Network (CNN). Ref. [
25] proposed an onboard CNNbased lossy compressor, where the neural network is pretrained on other datasets in a groundbased setting. For lossless compression, deep neural networks [
26,
27] and recurrent neural networks (RNN) [
28] have been proposed to compress hyperspectral data by appropriately pretraining the networks.
Nonetheless, the above mentioned deeplearning methods are not suitable for the challenging task of onboard lossless compression of hyperspectral images. The main reason is that deep learning relies on the availability of data from all the spectral bands during the process of training or clustering. However, either the entire original dataset or decompressed dataset is normally not available, or only partially available in many realtime compression applications. Furthermore, a pretrained model can not in general adapt well for some new datasets, which necessitates model retraining for each new dataset.
To address those limitations, we propose an adaptive filteringbased Concatenated Shallow Neural Network (CSNN) model for predictive lossless compression. The contributions of the proposed method are twofold: (1) The CSNN was designed as an adaptive prediction filter rather than as a trainingbased network. Thus the model needs not be pretrained before being used for pixel value calculation. To the best of our knowledge, this might be the first neural network based method requiring no training proposed for hyperspectral data compression. (2) The shallow twohidden layer structure of the proposed model is capable of capturing both spatial and spectral correlations to provide more accurate pixel prediction, with only a few contexts from four previous bands. Consequently, computational complexity is much lower than other deeplearning based methods.
The rest of the paper is organized as follows.
Section 2 discusses context selection for prediction and provides an information theoretic analysis of the prediction performance.
Section 3 describes the proposed method in detail. Simulation results are given in
Section 4. Finally, conclusions are drawn in
Section 5.
2. Context Selection and Prediction Performance Analysis
Let ${s}_{x,y,z}$ denotes the pixel value at line x and column y in band z of a hyperspectral image cube. Rather than directly encoding the value of ${s}_{x,y,z}$, a predictive compression algorithm uses previously decoded pixel values to compute a predicted pixel value ${\widehat{s}}_{x,y,z}$. Then the prediction residual, (${s}_{x,y,z}{\widehat{s}}_{x,y,z}$), which is the difference between the actual pixel value and its estimate is encoded losslessly using an entropy coder.
2.1. Context Selection
The predictor attempts to exploit correlations between the contexts and the current pixel value. Thus, the first step is to select the contexts appropriately. Typically, the neighboring pixels tend to have correlations. Considering the fact that spectral correlations tend to be much stronger than spatial correlations, the FL and MCCLMS methods use only spectral contexts, while CCAP, TSP and MCALIC methods combine spatial context with spectral context for prediction. Following the practice in [
10], as a preprocessing step, we perform a simple local averaging to better exploit the spatial correlations as follows:
Figure 1 shows an example of context selection. The spatial context was selected from the current band and two previous bands. For each band, four neighboring pixels are reshaped into a 1D vector
$\left\{{\overline{s}}_{x,y1,z},{\overline{s}}_{x1,y1,z},{\overline{s}}_{x1,y,z},{\overline{s}}_{x1,y+1,z}\right\}$, note the selected pixels are averaged values. Thus, the combined spatial context, denotes as
${C}_{t}$, is a 3 × 4 matrix. If a pixel reaches the boundary of image, we can still use the four nearest pixels to construct the context vector. For example, four previous pixels are selected as context for pixels in the first row of image:
$\left\{{\overline{s}}_{x,y4,z},{\overline{s}}_{x,y3,z},{\overline{s}}_{x,y2,z},{\overline{s}}_{x,y1,z}\right\}$, we repeat previous pixel values to make the size of context vector equals to four even if
$y<4$. For pixels in the first column, the spatial context vector becomes
$\left\{{\overline{s}}_{x1,y,z},{\overline{s}}_{x1,y+1,z},{\overline{s}}_{x1,y+2,z},{\overline{s}}_{x1,y+3,z}\right\}$. Similarly, context vector for pixels in the last column of the image can be written as
$\left\{{\overline{s}}_{x,y2,z},{\overline{s}}_{x,y1,z},{\overline{s}}_{x1,y1,z},{\overline{s}}_{x1,y,z}\right\}$. Besides, four pixels from previous bands colocated with the current pixel are chosen as the spectral context, denotes as
${C}_{l}=\left\{{\overline{s}}_{z,y,z4},{\overline{s}}_{x,y,z3},{\overline{s}}_{x,y,z2},{\overline{s}}_{x,y,z1}\right\}$. Note for prediction of the pixels in the spectral bands where
$z<4$, we use spatial contexts only.
2.2. Prediction Performance Analysis
The performance of the prediction based algorithms largely depends on the choice of the context. Information theoretic analysis can provide an upper bound on the amount of compression achievable based on the specific context. The analysis employs the concept of conditional entropy, as a measure of information gain, based on a simple model of prediction process [
29].
Let
${X}_{i}$ be a twodimensional spectral image of the hyperspectral dataset, and
$i\in \left\{1,2,\dots ,K\right\}$, where
K is the total number of spectral bands in the data cube. We reshape pixel value of
${X}_{i}$ into a vector, then the occurrences of pixels in the vector can be viewed as a random process. For a hyperspectral image having 16 bits/pixel, the first order statistical properties of
${X}_{i}$ is defined in terms of the probabilities
${p}_{j}=P(x=j),j\in \varphi $, where
$\varphi $ is the set of distinct pixel values in
${X}_{i}$, with the range
$\left[0,{2}^{16}1\right]$. Then the entropy of the source can be written as [
30]:
where
$H\left({X}_{i}\right)$ is the minimum bit rate that lossless compression can possibly achieve using an ideal entropy coder.
The information gain of
${X}_{i}$ can be further reduced by exploiting the firstorder statistical information of contexts. The entropy scheme
$H\left({X}_{i}\right)$ can be easily extended to conditional entropy of band
${X}_{i}$ given spatial context
${C}_{t}$ and spectral context
${C}_{l}$:
where
${p}_{j{C}_{t},{C}_{l}}$ is the conditional probability
${p}_{j{C}_{t},{C}_{l}}=P(x=j{C}_{t},{C}_{l})$. By applying the chain rule, the conditional entropy can be further rewritten as:
The conditional entropy gives the minimum achievable bit rate of
${X}_{i}$, given the context
${C}_{t}$ and
${C}_{l}$. In general, by exploiting the spectral and spatial correlation, we will have
$H\left({X}_{i}\right{C}_{t},{C}_{l})<H\left({X}_{i}\right)$.
In practice, as stated in [
14], the conditional entropy estimation becomes very inaccurate when two or more previous bands are used for prediction. It is because the conditional entropy have to estimate the joint entropy by calculating the occurrence frequencies on a very large alphabet space, i.e.,
${\left({2}^{16}\right)}^{{N}_{l}+{N}_{t}+1}$ in our case, where
${N}_{t}$ and
${N}_{l}$ is the number of bands used for selecting spatial and spectral context. As a consequence, a band might not contain enough pixels to provide a statistically meaningful estimation of the probabilities. [
14] proposes to use the bitplanes of
${X}_{i}$ as a set of 16 binary sources to greatly reduce the alphabet size. However, results obtained from the binary source might not be representative of the actual alphabet sources, since the correlations between the bitplanes cannot be ignored. To solve this problem, we propose to use a neural network to extract the features from the selected contexts, which are more representative of the context sources.
3. Proposed Method
The proposed approach was motivated by the state of the art CCSDS123 standard onboard compressor [
12], which has been proved very efficient in lossless compression [
10,
12]. In CCSDS123, the core algorithm FL is mainly a gradientbased adaptive filter. The predicted value
${\widehat{s}}_{t}$ and error
${\Delta}_{t}$ can be expressed as:
where the
${W}_{t}$ and
${U}_{t}$ are the weight vector and input context vector. Then the weight vector is updated adaptively based on
${\Delta}_{t}$:
where
$clip$ denotes the clipping of the real number to the range
$\left\{{w}_{min},{w}_{max}\right\}$, and
$sgn$ is the sign function defined as
$sgn=\frac{d}{dx}\leftx\right,x\ne 0$.
${\rho}_{t}$ is the weight update scaling exponent, and
$\zeta $ is the interband weight exponent offsets in the range
$6\le \zeta \le 5$.
Our goal here is to improve the traditional gradientbased adaptive filter with a neural network. The main idea is that the training a neural network can be interpreted as a nonlinear filtering procedure. Compared to the FL algorithm, the corresponding prediction value and error can be rewritten in a neural network setting:
${F}_{net}$ and
${F}_{loss}$ are the designed network and the loss function. Then weights and bias are updated by the batch gradient decent with a small learning rate:
As we can see, the prediction and the updating scheme of the neural network are very similar to the FL algorithm, which indicates that the neural network can play the role of a nonlinear adaptive filter.
With onboard data compression (with limited data for pretraining the network) in mind, we propose a filteringbased concatenated shallow network (CSNN) for predictive lossless compression. The CSNN behaves as an adaptive filter, which updates the network parameters onthefly with the incoming batch input. Specifically, the input samples (following a zigzag scanning order) flow through the network for just one time. The prediction error of each sample is recorded simultaneously for further mapping (to nonnegative integers) and entropy coding. The weights and biases are adjusted for each batch according to the prediction errors. Algorithm 1 provides more details of the proposed adaptive scheme.
Algorithm 1 Algorithm for filtering based CSNN adaptive prediction. 
 1:
Initialize the neural networks.  2:
Calculate the local sample mean using Equation ( 1).  3:
for t = 1:N do %% N is the number of batch, and the batch size equals to the number of columns in the spectral band.  4:
Select the spatial and spectral contexts ${C}_{t}$ and ${C}_{l}$ for each pixels in batch, and prepare the data pair $\left\{\left({C}_{t},{C}_{l}\right),{s}_{x,y,z}\right\}$.  5:
Extract spatial and spectral features ${F}_{t}$ and ${F}_{l}$ from the contexts using onelayer shallow neural networks.  6:
Concatenate the features: $F=\left[{F}_{t},{F}_{l}\right]$.  7:
Predict the pixel values based on F using Equation ( 11).  8:
Calculate and record the prediction error ${e}_{x,y,z}$ for further mapping and coding.  9:
Calculate the weight updates $\Delta w$ using Equation ( 13).  10:
Adjusting the parameters in every batch: $w=w+\Delta w$.  11:
end for

We design the prediction method in such a way that the training procedure associated with most conventional neural networks based algorithms is not required. The filteringbased network aims to get the losses of each input sample rather than a welltrained network. Also, training a neural network relies on the availability of a large amount of training data, as well as an iterative optimization process of high computational complexity. In contrast, the proposed method filters each input sample only once, with a zigzag scanning order band by band. Thus, the computational time of filteringbased network is significantly lower than the conventional trainingbased neural network. In a nutshell, the filteringbased CSNN provides a more robust solution for a wide variety of hyperspectral datasets, without any pretraining or prior knowledge needed.
3.1. Concatenated Shallow Neural Network (CSNN)
Figure 2 illustrates the framework of the proposed method. The processing flow of the proposed method consists of two channels, i.e., the spatial channel and spectral channel. Both channels extract representative features from the corresponding contexts. The features from these two channels are then combined to obtain the final predicted pixel values.
The spatial contexts and spectral contexts tend to correlate with the current sample pixel to be predicted in a very different way. Conventional prediction methods either directly combine these two types of contexts, or use the spectral contexts only. In this work, we introduce two parallel shallow neural networks to learn the spatial and spectral correlations separately, in light of the good ability of neural networks for many illposed tasks.
Figure 3 shows the structure of the concatenated shallow neural network for pixel prediction. To extract spatial correlations, the hidden neurons connected with spatial context converts the input into a series of feature maps via a nonlinear process,
where
${F}_{t}$ is the extracted spatial features, and
${w}_{t}$ and
${b}_{t}$ denote the corresponding weight and bias. A Rectified Linear Unit,
$ReLU=max\left(0,x\right)$, is used as the activation function for spatial channel. In this work, a 3 × 4 spatial context matrix is converted to a 1 × 5 feature vector by means of five connected neurons.
For the spectral channel, exactly the same number of hidden neurons are used for extracting the spectral feature, however, the
ReLU activation function is not used here to obtain the spectral feature
${F}_{l}$ below, based on our observation in extensive empirical study that the spectral contexts tend to be correlated in a less nonlinear fashion than the spatial contexts.
Similarly,
${w}_{l}$ and
${b}_{l}$ in the equation above denote the corresponding weights and biases.
Note that the neural networks are employed for both channels as opposed to deep neural networks. Although deep networks are capable of capturing high dimensional features, they might suffer from the overfitting problem. To predict a spectral band whose context changes rapidly, optimization of the deep network might be trapped by local optima and thus fail to react promptly. Besides, since a large number of the weights and biases need be adjusted by prediction errors using back propagation, training of deep network may be time consuming.
The extracted features from two channels are concatenated together denote as
$F=\left[{F}_{t},{F}_{l}\right]$. The combined features jointly decide the final predicted pixel value with a linear output layer,
where
${w}_{f}$ and
${b}_{f}$ are the weights and biases for the final layer. Thus, the CSNN model contains two hidden layers: the first hidden layer extracts spatial and spectral features, and the second concatenated hidden layer for final pixel value prediction.
Figure 4 shows the firstorder entropies of the prediction residuals for a segment of the IP dataset, with the joint spatial/spectral contexts and spectral contexts only, respectively. We can see that combined contexts allow for more accurate prediction (i.e., low entropy values). The improvement in prediction is especially pronounced for band images with large pixelintensity variations, for example, entropy reduction by 0.6 bit for band 148 and 0.9 bit for band 154.
The CSNN is a typical endtoend fully connected neural network, with the weights and biases being updated by the Adadelta optimizer [
31] with the
${L}_{1}$ loss function:
Note that the
${L}_{1}$ loss function was adopted since our study found that it can lead to lower residual entropies than the
${L}_{2}$ loss function, which favors quality assessment based on the mean square errors. If we set
${g}_{t}=\frac{\partial {L}_{F}\left(t\right)}{\partial {w}_{t}}$ to be the gradient of the parameters at
tth input data, the update
$\Delta {w}_{t}$ can be calculated as follows:
where RMS is the root mean square, which is defined as
$RMS\left[{g}_{t}\right]=\sqrt{E\left[{g}_{t}^{2}\right]+\u03f5}$.
$E\left[{g}_{t}^{2}\right]$ is an exponentially decaying average of the squared gradients:
where
$\rho $ is a decay constant and
$\u03f5$ is added to the numerator of RMS to ensure progress continues to be made even if the previous updates become small. We set
$\rho =0.95$ and
$\u03f5={e}^{6}$ in the simulations. The code of CSNN model can be found at [
32].
3.2. Entropy Coding
After prediction, all the residuals are mapped to nonnegative values [
33] and then coded into bitstream losslessly using a Golomb Rice Codec (GRC) [
34]:
where
n refers to the value of the prediction residual. Note GRC is selected as the entropy coder due to its computational efficiency. We observed that arithmetic coding [
35] can offer slightly lower bitrates, albeit at a much higher computational cost.
Besides the GR codewords, there is other side information that needs to be transferred to the decoder in order to recover the original data losslessly. For example, the weights and biases that initialize the neural networks need to be encoded too. Since the CSNN model has less than 20 neurons, such side information becomes negligible and thus it is not included in the total bit rates reported in the following.
4. Simulations Results
We tested the proposed method on five public hyperspectral datasets [
36] and the standard CCSDS test datasets. We selected 20 datasets from the CCSDS test sets with different collecting instruments: Atmospheric Infrared Sounder (AIRS), Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), SWIR Full Spectrum Imager (SFSI), and Compact Airborne Spectrographic Imager (CASI). The test results are compared with the stateoftheart CCSDS123 method. Before presenting the results, we first provide a convergence analysis on the proposed model, and discuss the sensitivity of parameter initialization on the compression result.
4.1. Convergence Analysis and Parameter Sensitivity
Deep neural network is the mainstream technique for many machine learning tasks. Despite its success, theoretically explaining why deep neural networks can be efficiently trained in practice using simple gradientbased methods is still an open problem. Filtering the input data by the CSNN model is similar to the training process, which requires optimizing a complex nonconvex problem. Over the past few years, much research has been devoted to this problem [
37,
38,
39,
40,
41]. Based on these existing work, we found that the convergence of the neural network is closely related to the choice of hyperparameters such as the number of hidden layers, the number of hidden units and the learning rate. Although training a network seems intractable, [
37] provides some tricks to determine the hyperparameters in the model.
The convergence of neural networks is studied in [
39,
40]. An early convergence of filteringbased network can provide much smaller prediction loss, and lower compression bitrates as a return. To demonstrate the convergence of filteringbased CSNN model, we perform the filteringbased CSNN on four public hyperspectral datasets:
Indian Pines (IP),
Pavia University (PU),
Salinas (SAL), and
Botswana (BOT).
Figure 5 shows the prediction loss in Mean Square Error (MSE) across all the spectral bands.
Clearly, the prediction loss curves of four datasets show convergence to relatively small error values. For example, the losses of the PU dataset decrease to a very low level after filtering the first 20 bands. For other datasets, even if the losses fluctuate intensely at the beginning like IP dataset, the losses converged to small values after half of the data have been filtered. This demonstrates the ability of the proposed CSNN model to find a fairly good solution with only a single iteration on the data. We believe that the similarity (correlation) of data samples among different spectral bands helps accelerate the convergence of the model.
Another factor that can influence the convergence is the initialization of parameters in the network. In our simulations, we select
Xavier initialization [
42], which is one of the most commonly used initialization methods for the neural network. In
Xavier, all the weights initialize independently from a zeromean, unit variance distribution. It is interesting to see how different weight initialization methods would affect prediction loss and entropy. Thus, we conduct several experiments by assigning network with weight values ranging from 0.1 to 0.9.
Figure 6 shows the variations of MSE values and compression bitrates on four different datasets using proposed method. Note the weight parameters include the weights and biases in each layer.
We can see that the MSE values and compressed bitrates do not appear to be very sensitive to change of weight parameters. For example, the MSE values of IP dataset range from 89 to 90 with different initialized weight values. For other datasets, the value of MSE and bitrate all fluctuated within a small range. The testing results are consistent with the conclusion in [
40], where the gradientbased linear neural network has a strong convergence ability. It also indicates the filteringbased CSNN is robust to the initial condition.
4.2. Simulations on Five Public Hyperspectral Datasets
The five hyperspectral image datasets include IP, PU, SAL, BOT and
Kennedy Space Center (KSC). All these datasets contain 12bits noncalibrated raw images. More detailed information of these data sets are given in
Table 1.
We select three representative adaptive filtering methods to benchmark the proposed method:
As we can see in
Table 2, the proposed CSNN method achieves the lowest bitrates on all five hyperspectral image datasets, where the LMS method has the highest bitrates. Specifically, the CSNN method improves by nearly 0.2 bit/pixel and 0.25 bit/pixel on average on the FL and MCCLMS methods, respectively, with a more significant reduction of 0.58 bit/pixel over the LMS method. The CSNN seems to provide more efficient compression by exploiting jointly the spatial and spectral correlations from the contexts.
In terms of the prediction residuals,
Figure 7 shows that the CSNN method consistently achieves the lowest entropies on the most of bands of IP, PU, SAL and BOT datasets, with more obvious improvement for the last 50 bands. For example, in
Figure 7b, the curves of the last 50 bands of MCCLMS and FL methods almost overlap, while the CSNN curve goes much lower. The curves for the KSC dataset in
Figure 7d exhibit significant fluctuations for all three methods. This data set contains a substantial amount of impulse noise, which might cause many sudden changes of the contexts. For example, the residual entropy fluctuates rapidly after band 100. But still, the proposed method seems to be the most stable one among the four methods. By considering also the compression bitrates of the KSC dataset in
Table 2, we can see that the adaptive filtering methods (including the proposed method) are not robust enough to data with noise.
4.3. Simulations on CCSDS Test Datasets
The CCSDS hyperspectral and multispectral test corpus [
43] has been publicly available for hyperspectral images compression testing and evaluation. The corpus includes images from many different instruments. To diversify the testing datasets, we selected 20 hyperspectral datasets from instruments AVIRIS, AIRS, SFSI and CASI for further evaluation of the algorithms. Seven hyperspectral images are from AVIRIS instrument, which includes five 16bit noncalibrated Yellowstone scenes and two 12bit scenes. The AIRS instrument has ten scenes, each scene has 1501 spectral bands and 90 lines with a width of 153 pixels. The remaining three images are from instruments SFSI and CASI.
Table 3 provides detailed information about the selected datasets. As an example, The grayscale versions of the five AVIRIS Yellowstone scenes are shown in
Figure 8.
Lossless hyperspectral image compression techniques can be separated into two main categories: transformbased compression and predictionbased compression. Transformbased techniques achieve compression by taking advantage of frequency domain representation of images (e.g., based on wavelet transforms). On the other hand, predictive compression performs directly on pixel domain, followed by entropy coding on the prediction residuals (e.g., by using the GolombRice codes). We selected a total of seven lossless compressors from both categories: JPEG2000 [
3], JPEGLS [
2], LUT [
6], European Space Agency (ESA) [
44], CCSDS122 [
45], MCCLMS [
15], CCSDS123 [
12]. Note the stateoftheart predictive compressor CCSDS123 is also provided for comparison.
Table 4 provides the lossless coding results for all the images in terms of the bit rate (in bpppc). The compression efficiency of each algorithm can be appreciated by observing the degree to which its resulting bit rate falls below the bit depth of the original images. We can see that the overall performance of the proposed filteringbased CSNN method exceeds other stateoftheart methods included in the comparison.
For 16 bpppc noncalibrated AVIRIS Yellowstone scenes, the filteringbased CSNN model outperforms CCSDS123 and MCCLMS by 0.12 bpppc and 0.17 bpppc on average, respectively. When compared with the transformbased compression, the coding gain of CSNN is larger than JPEG2000 and CCSDS122 by 0.52 bpppc and 0.60 bpppc, respectively.
For 12 bpppc noncalibrated AVIRIS scene (Hawaii and Maine), the coding performance of proposed model is comparable with CCSDS123. Specifically, the compressed bitrate of CSNN is 0.03 bpppc higher than CCSDS123, but much lower than other methods. Compared with other 12 bpppc images with different instruments, the Hawaii and Maine scenes have relatively smaller pixel values. For example, the average pixel value of Hawaii and Maine scene is 267.10 and 328.75, respectively. But for AIRSgran9 image, the average pixel value is 2091. This indicates that the filteringbased CSNN would obtain more compression gain for images with higher pixel values. These results are also consistent with the property of the gradientbased network, where the neural network can make a quick response for large variations in the data. However, linear predictors, such as CCSDS123, might be more suitable for slowly changing data with small values.
For other images from AIRS, SFSI and CASI instruments, the filteringbased CSNN model also provides superior performance compared to other prominent predictive coding techniques. To summarize, the results show that the proposed filteringbased CSNN yields the best overall achievement as compared to other predictivebased compressors. It offers additional desirable features such as no pretraining involved in compression procedure, thereby making it an appealing approach for lossless hyperspectral image compression.
Figure 9 shows the residual entropy variations of the “Yellow Stone” scenes. It is interesting to observe that five distinct scenes seem to follow a similar trend spectral band around 160 are almost the same. This indicates the robust prediction performance that can be achieved by jointly considering both the spatial and spectral contexts using the proposed method. There is the potential to use transfer learning to exploit such a similarity to further improve the compression performance.
4.4. Computational Complexity
The computation of the proposed CSNN method includes feedforward propagation and back propagation. Note the filtering of CSNN model is mainly implemented with multiplication of matrices. Assume there are
i nodes in the input layer, corresponding to
i context pixels being fed to the network, and
j and
k denotes the number of nodes in the two hidden layers, and
l denotes the number of nodes in the output layer. In a fourlayer neural network, there are three weight matrices to represent weights between these layers. For example,
${W}_{ji}$ is a weight matrix with
j rows and
i columns, which contains the weights going from layer
i to layer
j. In a feed forward pass, propagating a sample from layer
i to layer
j takes
$O(j\xb7i)$ time complexity, thus the overall time complexity from the input layer to the output layer becomes
$O(j\xb7i+k\xb7j+l\xb7k)$. The back propagation starts from the last layer of the model. Similar to the feed forward pass, the time complexity of the back propagation is given by
$O(l\xb7k+k\xb7j+j\xb7i)$. We can see that the computational complexity of the neural network largely depends on the number of hidden layers and the number of nodes in each layer. Also, as shown in Equations (
9) and (
10), activation function is only needed for the spatial channel, with very light computation required for the
$ReLU$ function. The computation time for compressing the IP dataset takes 0.8 seconds/band, the experiments were carried out on a Thinkpad laptop with Intel Core i5 CPU and 8GB installed memory, running Windows 7 Professional (64bit operating system). Note the matrix operations can be greatly parallelized by GPUs to further reduce the computation time.