Spectral–Spatial Feature Partitioned Extraction Based on CNN for Multispectral Image Compression

: Recently, the rapid development of multispectral imaging technology has received great attention from many ﬁelds, which inevitably involves the image transmission and storage problem. To solve this issue, a novel end-to-end multispectral image compression method based on spectral– spatial feature partitioned extraction is proposed. The whole multispectral image compression framework is based on a convolutional neural network (CNN), whose innovation lies in the feature extraction module that is divided into two parallel parts, one is for spectral and the other is for spatial. Firstly, the spectral feature extraction module is used to extract spectral features independently, and the spatial feature extraction module is operated to obtain the separated spatial features. After feature extraction, the spectral and spatial features are fused element-by-element, followed by downsampling, which can reduce the size of the feature maps. Then, the data are converted to bit-stream through quantization and lossless entropy encoding. To make the data more compact, a rate-distortion optimizer is added to the network. The decoder is a relatively inverse process of the encoder. For comparison, the proposed method is tested along with JPEG2000, 3D-SPIHT and ResConv, another CNN-based algorithm on datasets from Landsat-8 and WorldView-3 satellites. The result shows the proposed algorithm outperforms other methods at the same bit rate.


Introduction
By capturing digital images of several continuous narrow spectral bands, remote sensors can generate three-dimensional multispectral images that contain rich spectral and spatial information [1]. The abundant information is very useful and has been employed in various applications, such as military reconnaissance, target surveillance, crop condition assessment, surface resource survey, environmental research, and marine applications and so on. However, with the rapid development of multispectral imaging technology, the spectral-spatial resolution of multispectral data becomes higher and higher, resulting in the rapid growth of its data volume. The huge amount of data is not conducive to image transmission, storage, and application, which hinders the development of related technologies. Therefore, it is necessary to find an effective multispectral image compression method to process images before use.
The research of multispectral image compression methods has always received widespread attention. After decades of unremitting efforts, various multispectral image compression algorithms for different application needs have been developed, which can be summarized as follows: predictive coding-based framework [2], vector quantization codingbased framework [3], transform coding-based framework [4,5]. The predictive coding is mainly applied to lossless compression. Its rationale is to use the correlation between pixels to predict the unknown data based on its neighbors, and then to encode the residual

Proposed Method
In this section, we introduce the proposed multispectral image compression network framework in detail and describe the training flow diagram. We elaborate on several key operations, such as spectral feature extraction module, spatial feature extraction module, rate-distortion optimizer, etc.

Spectral Feature Extraction Module
2D convolution has been proven with great promise and successfully applied to lots of aspects of image vision and processing, such as target detection, image classification and image compression. However, as multispectral images are three-dimensional, which is more complex, and rich spectral information is even more important, the information loss problem will inevitably be encountered when 2D convolution is used to process multispectral images. Although there have been many precedents of applying deep learning to multispectral image compression, and it has achieved great performance and exceeded some traditional compression methods such as JPEG and JPEG2000, in the process of feature extraction, however, as the convolution kernel is two-dimensional, the spectral redundancy on the third dimension cannot be efficaciously removed, which inhibits the performance of the network.
To deal with this problem, we have come up with the idea of extracting spectral or spatial features separately. Among this, the inspiration of extracting spectral features derives from [21]. Ref. [21] uses three-dimensional kernels for convolution operation, which can maintain the integrity of spectral features in multispectral image data. To avoid the data volume becoming too large, we use a 1 × 1 × n convolution kernel on the spectral dimension named as 1D spectral convolution to extract spectral features independently. Figure 1 shows the differences between 2D convolution and 1D spectral convolution. Remote Sens. 2021, 13 As shown in Figure 1a, the image is convolved by 2D convolution, whose kernel is two-dimensional, generally followed by activation functions, such as rectified linear units (ReLU) [14], parametric rectified linear units (PReLU) [22], etc. This operation can be expressed as follows: where i indicates the current layer, j indicates the current feature map of this layer, Similarly, considering the dimension of the spectrum, 1D spectral convolution operated on 3D images can be formulated as follows: where i R is the size of the convolution kernel in the spectral dimension, xyz ij v is the output value at   x, y, z of the j -th feature map in the i -th layer, and pqr ijm w is weight of the kernel at position   p,q,r connected to the m -th feature map. As the size of kernel is  n 1 1 , by extension, i P and Q i are set to 1, Equation (2) can be written as: In regard to the activation function, we adopt ReLU as our first choice, as the gradient is usually constant in back propagation when using ReLU, which alleviates the problem of gradient disappearance in deep network training and contributes to network convergence. Additionally, the computation cost is much less when using ReLU than other functions (e.g., sigmoid). In addition, ReLU can make the output of some neurons zero, which ensures the sparsity of the network so as to alleviate the overfitting problem. The ReLU function can be formulated as below: As shown in Figure 1a, the image is convolved by 2D convolution, whose kernel is two-dimensional, generally followed by activation functions, such as rectified linear units (ReLU) [14], parametric rectified linear units (PReLU) [22], etc. This operation can be expressed as follows: where i indicates the current layer, j indicates the current feature map of this layer, v xy ij is the output value at (x, y) of the j-th feature map in the i-th layer, f (·) represents the activation function, w pq ijm denotes the weight of the convolution kernel at position (p, q) connected to the m-th feature map (m indexes over the set of feature maps in the (i − 1)-th layer connected to the current feature map), b ij is the bias of the j-th feature map in the i-th layer, M i−1 is the number of feature maps in the (i − 1)-th layer, P i and Q i are the height and width of the convolution kernel, respectively.
Similarly, considering the dimension of the spectrum, 1D spectral convolution operated on 3D images can be formulated as follows: where R i is the size of the convolution kernel in the spectral dimension, v xyz ij is the output value at (x, y, z) of the j-th feature map in the i-th layer, and w pqr ijm is weight of the kernel at position (p, q, r) connected to the m-th feature map. As the size of kernel is 1 × 1 × n, by extension, P i and Q i are set to 1, Equation (2) can be written as: In regard to the activation function, we adopt ReLU as our first choice, as the gradient is usually constant in back propagation when using ReLU, which alleviates the problem of gradient disappearance in deep network training and contributes to network convergence. Additionally, the computation cost is much less when using ReLU than other functions (e.g., sigmoid). In addition, ReLU can make the output of some neurons zero, which ensures the sparsity of the network so as to alleviate the overfitting problem. The ReLU function can be formulated as below: f (x) = max(0, x).
In summary, when 2D convolution is operated on three-dimensional images, the output is always two-dimensional, which may cause a large amount of spectral information loss. Therefore, we adopt 1D spectral convolution to retain more feature data of the multispectral image.

Spatial Feature Extraction Module
In order to ensure the spatial information does not mingle with the spectral features, we use group convolution instead of normal 2D convolution in spatial dimensions. Group convolution first appeared in AlexNet, in order to solve the problem of limited hardware resources at that time. Feature maps were distributed to several GPUs for simultaneous processing, and finally concatenated together. Figure 2 shows the differences between normal convolution and group convolution.
In summary, when 2D convolution is operated on three-dimensional images, the output is always two-dimensional, which may cause a large amount of spectral information loss. Therefore, we adopt 1D spectral convolution to retain more feature data of the multispectral image.

Spatial Feature Extraction Module
In order to ensure the spatial information does not mingle with the spectral features, we use group convolution instead of normal 2D convolution in spatial dimensions. Group convolution first appeared in AlexNet, in order to solve the problem of limited hardware resources at that time. Feature maps were distributed to several GPUs for simultaneous processing, and finally concatenated together. Figure 2 shows the differences between normal convolution and group convolution. As shown in Figure 2a, the size of input data is   C H W , representing the number of channels, width, and height of the feature map, respectively. The size of the convolution kernel is  k k , and the number of the kernels is N . At this point, the size of the output feature map is   N Hʹ Wʹ . The parameter number of N convolution kernels is: In group convolution, just as its name implies, the input feature maps are divided into several groups, and then convolved separately. Assuming that the size of the input is still   C H W and the number of output feature maps is N . If the input is divided into G groups, the number of input feature maps in each group is C G , the number of output feature maps in each group is N G , and the size of convolution kernel is  k k , that is, the amount of convolution kernels remains unchanged and the number of kernels in each group is N G . Since the feature maps are only convolved by the convolution kernels of the same group, the total number of parameters can be calculated as: By comparing the two Equations (5) and (6), it can be easily known that group convolution can greatly reduce the number of parameters, precisely speaking, it can reduce them to G 1 . Moreover, as group convolution can increase the diagonal correlation between filters according to [14], filter relationships become sparse after grouping. Figure 3 shows the correlation matrix between filters of adjacent layers [23], highly correlated filters are brighter, while lower correlated filters are darker. The role of filter groups, namely group convolution, is to take advantage of the block-diagonal sparsity to learn information about the channel dimension. Low correlated filters do not need to be learned, that is to say, they do not need to be given parameters. What is more, as seen in Figure 3, the highly correlated filters can be trained in a more structured way when using group convolution. Therefore, with structured sparsity, group convolution can not only As shown in Figure 2a, the size of input data is C × H × W, representing the number of channels, width, and height of the feature map, respectively. The size of the convolution kernel is k × k, and the number of the kernels is N. At this point, the size of the output feature map is N × H × W . The parameter number of N convolution kernels is: In group convolution, just as its name implies, the input feature maps are divided into several groups, and then convolved separately. Assuming that the size of the input is still C × H × W and the number of output feature maps is N. If the input is divided into G groups, the number of input feature maps in each group is C/G, the number of output feature maps in each group is N/G, and the size of convolution kernel is k × k, that is, the amount of convolution kernels remains unchanged and the number of kernels in each group is N/G. Since the feature maps are only convolved by the convolution kernels of the same group, the total number of parameters can be calculated as: By comparing the two Equations (5) and (6), it can be easily known that group convolution can greatly reduce the number of parameters, precisely speaking, it can reduce them to 1/G. Moreover, as group convolution can increase the diagonal correlation between filters according to [14], filter relationships become sparse after grouping. Figure 3 shows the correlation matrix between filters of adjacent layers [23], highly correlated filters are brighter, while lower correlated filters are darker. The role of filter groups, namely group convolution, is to take advantage of the block-diagonal sparsity to learn information about the channel dimension. Low correlated filters do not need to be learned, that is to say, they do not need to be given parameters. What is more, as seen in Figure 3, the highly correlated filters can be trained in a more structured way when using group convolution. Therefore, with structured sparsity, group convolution can not only reduce the number of parameters, but also learn more accurately to make a more efficient network. reduce the number of parameters, but also learn more accurately to make a more efficient network.

Framework of the Proposed Network
The whole framework of the proposed compression network is illustrated in Figure  4. The multispectral images are fed into the forward network first, after feature extraction, the data are then compressed and converted to bit stream successively through quantization and entropy encoder. The structure of the decoder is symmetrical with that of the encoder. As a result, for decoding, the bit stream goes through entropy decoding, inverse quantization, and the backward network, in turn, to restore the images. The detailed architecture of the forward and backward network will be demonstrated in Section 2.3.1.

Framework of the Proposed Network
The whole framework of the proposed compression network is illustrated in Figure 4. The multispectral images are fed into the forward network first, after feature extraction, the data are then compressed and converted to bit stream successively through quantization and entropy encoder. The structure of the decoder is symmetrical with that of the encoder. As a result, for decoding, the bit stream goes through entropy decoding, inverse quantization, and the backward network, in turn, to restore the images. The detailed architecture of the forward and backward network will be demonstrated in Section 2.3.1.
Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 21 reduce the number of parameters, but also learn more accurately to make a more efficient network.

Framework of the Proposed Network
The whole framework of the proposed compression network is illustrated in Figure  4. The multispectral images are fed into the forward network first, after feature extraction, the data are then compressed and converted to bit stream successively through quantization and entropy encoder. The structure of the decoder is symmetrical with that of the encoder. As a result, for decoding, the bit stream goes through entropy decoding, inverse quantization, and the backward network, in turn, to restore the images. The detailed architecture of the forward and backward network will be demonstrated in Section 2.3.1.  The architecture of the forward and backward network is shown in Figure 5, the spectral block and the spatial block are shown in Figure 6. The architecture of the forward and backward network is shown in Figure 5, the spectral block and the spatial block are shown in Figure 6.   Figure 5 illustrates the detailed process of our network. First of all, the input multispectral images are simultaneously fed into the spectral feature extraction network and the spatial feature extraction network separately, which consist of corresponding function modules. In the spectral part, there are several spectral blocks (Figure 6a), which are based on residual block structure. We replace the convolution layers with 1D spectral convolution as adjusted to meet our expectations, and the size of the kernel is   1 1 3 . Likewise, the spatial part is composed by several spatial blocks with a similar structure, as shown in Figure 6b, and group convolution is used so that each channel will not interact with each other. To be specific, the GROUP is set to 7 or 8 as the input multispectral images are of seven or eight bands. Additionally, some convolution layers are added to enhance the ability of the learning features, whose kernel size is  3 3 . After extraction, two parts The architecture of the forward and backward network is shown in Figure 5, the spectral block and the spatial block are shown in Figure 6.   Figure 5 illustrates the detailed process of our network. First of all, the input multispectral images are simultaneously fed into the spectral feature extraction network and the spatial feature extraction network separately, which consist of corresponding function modules. In the spectral part, there are several spectral blocks (Figure 6a), which are based on residual block structure. We replace the convolution layers with 1D spectral convolution as adjusted to meet our expectations, and the size of the kernel is   1 1 3 . Likewise, the spatial part is composed by several spatial blocks with a similar structure, as shown in Figure 6b, and group convolution is used so that each channel will not interact with each other. To be specific, the GROUP is set to 7 or 8 as the input multispectral images are of seven or eight bands. Additionally, some convolution layers are added to enhance the ability of the learning features, whose kernel size is  3 3 . After extraction, two parts  Figure 5 illustrates the detailed process of our network. First of all, the input multispectral images are simultaneously fed into the spectral feature extraction network and the spatial feature extraction network separately, which consist of corresponding function modules. In the spectral part, there are several spectral blocks (Figure 6a), which are based on residual block structure. We replace the convolution layers with 1D spectral convolution as adjusted to meet our expectations, and the size of the kernel is 1 × 1 × 3. Likewise, the spatial part is composed by several spatial blocks with a similar structure, as shown in Figure 6b, and group convolution is used so that each channel will not interact with each other. To be specific, the GROUP is set to 7 or 8 as the input multispectral images are of seven or eight bands. Additionally, some convolution layers are added to enhance the ability of the learning features, whose kernel size is 3 × 3. After extraction, two parts of the features are fused together, and then downsampling is carried out to reduce the size of the feature maps. At the end of the forward network, the sigmoid function plays a role of limiting the value of the intermediate output, in addition, similar to ReLU as well, it introduces nonlinear factors to make the network more expressive to the model.
Symmetric with the forward network, the backward network is formed with upsampling layers, some convolution layers, and the partitioned extraction part. In particular, upsampling is implemented with PixelShuffle [24], which can turn low resolution images into high resolution images using sub-pixel operation.

Quantization and Entropy Coding
After the forward network, the intermediate data are first quantized into a succession of discrete integers by the quantizer. Since the descent gradient is used in the backward propagation to update parameters when training the network, the gradient needs to be passed down. However, the rounding function is not differentiable [25], which will hinder the optimization of the network. Therefore, we relax the function, and it is calculated as: where Q is the quantization level, X s ∈ (0, 1) is the intermediate datum after sigmoid activation, round[·] is the rounding function, and X Q is the quantized data. The function rounds the data in the forward network and is skipped during backward propagation, to pass the gradient directly to the previous layer. Then, we adopt ZPAQ as the lossless entropy coding standard and select "Method-6" as the compression pattern, in order to further process the quantized X Q and generate the binary bit stream. In the decoder, the bit stream goes through the entropy decoder and de-quantization, and the data X Q / 2 Q − 1 are finally fed into the backward network to recover the image.

Rate-Distortion Optimizer
There are two criterions to evaluate a compression method, one is the bit rate, and the other is the quality of the recovered image. To enhance the performance of the network, it is vital to strike a balance between these two criterions. In consequence, rate-distortion optimization is introduced: where L is the loss function that should be minimized during training, L D indicates the distortion loss, L R represents the rate loss, which can be controlled by the penalty λ. As we use MSE to measure the distortion loss of the recovered image, L D can be expressed as follows: where N denotes the batch size, I represents the original multispectral image and I is the recovered image, H, W and C are, respectively, height, width, and spectral band number of the image. In order to estimate the rate loss, we adopt an Importance-Net to replace the entropy computation with a continuous approximation of the code length. The importance network is used to generate an importance map P(X) learning from the input images [26]. The intention is to assign the bit rate according to the importance of the content of the image, more bits are assigned to complex regions, and fewer bits are assigned to smooth regions. The importance-net is simply composed of four layers, two 1 × 1 convolution layers and a residual block that consists of two 3 × 3 convolution layers, which is shown as Figure 7.
layers and a residual block that consists of two  3 3 convolution laye as Figure 7. The activation function used in the importance-net is Mish [27 proven to be smoother than ReLU and achieve better results. Nonethel time cost and limited hardware conditions due to the increased comp only adopt Mish in the importance-net rather than the whole networ mulated as: After sigmoid activation, the value range of the output is   where X represents the un-quantized output after the encoder, w i of the importance-net,   x, y is the spatial location of the pixel, and the bias, size of the kernel and stride, respectively. Unlike [26], we use instead of the sum to define the rate loss: where N is the number of the intermediate channel.

Datasets
The 7 band image datasets come from the Landsat-8 satellite. T contains about 80,000 images. The images we selected include various ferent seasons and different weather conditions, which enables the net tiple features, preventing the network training from overfitting. We pic images from 80,000 images as a test set and make sure that there are n The activation function used in the importance-net is Mish [27], and it has been proven to be smoother than ReLU and achieve better results. Nonetheless, considering the time cost and limited hardware conditions due to the increased complexity of Mish, we only adopt Mish in the importance-net rather than the whole network. Mish can be formulated as: After sigmoid activation, the value range of the output is [0, 1]. The importance map can be described as below: whereX represents the un-quantized output after the encoder, w indicates the weight of the importance-net, (x, y) is the spatial location of the pixel, and b, k and s 0 are the bias, size of the kernel and stride, respectively. Unlike [26], we use the mean of P(X) instead of the sum to define the rate loss: L R = avg(P(X)), (13) where N is the number of the intermediate channel.

Datasets
The 7 band image datasets come from the Landsat-8 satellite. The training dataset contains about 80,000 images. The images we selected include various terrains under different seasons and different weather conditions, which enables the network to learn multiple features, preventing the network training from overfitting. We pick 17 representative images from 80,000 images as a test set and make sure that there are no identical images in the two data sets. The size of training images and test images are 128 × 128 and 512 × 512, respectively. The 8 band image datasets come from the WorldView-3 satellite, which contains about 8700 images of size of 512 × 512. Likewise, we ensure that the datasets include various terrains under different weather conditions to ensure the diversity of the feature. The test set has 14 images of size of 128 × 128, and has no identical images with the training set.

Parameter Settings
We use the Adam optimizer to train the model and update the network. To accelerate the convergence of the network, the initial learning rate is set to 0.0001. Until the loss function drops to a certain degree, then set the learning rate to 0.00001 to seek for the optimal solution. The experimental settings of the training network are listed as Table 1:

The Training Process
First of all, we initialize the weights of the network randomly, and utilize the Adam optimizer to train the network. In the first stage of training, MSE is introduced into the loss function. Since optimization is a process of restoring the image as close as possible to the original one, we can express it by the following formula: where x is the original image, θ 1 and θ 2 are the parameters of the spectral feature extraction network and spatial feature extraction network, respectively. Se(·) represents the 1D spectral convolution network, Sa(·) is the spatial group convolution network, En(·) denotes quantization coding, and Re(·) denotes the whole decoding and recovering process. To make the loss function decline as soon as possible, θ 1 and θ 2 are disposed to update along with the gradient descent. By fixing θ 1 , we can obtain: and we can obtain θ 1 by fixing θ 2 : During the backward propagation, the quantization needs to be skipped. Accordingly, θ 1 ,θ 2 = arg min to simplify the representation of Equation (14), an auxiliary variable x m is introduced: hence, Equation (14) can be written as: As the first stage of training optimization is completed, we then bring in the rate loss into the loss function. Combining Equations (13) and (19), the final optimization procedure can be formulated as: When the loss function no longer declines, the training reaches the optimal solution. Moreover, in the second stage of the training, a different compression rate can be easily obtained by changing the penalty λ. The value of λ in our experiment is listed in Table 1.

Results and Discussion
In this section, we recorded the experimental results, including the performance comparison of our network with other traditional methods at the same bit rate, and the different bit rates have been obtained through adjusting the penalty λ. Meanwhile, to make the results more convincing, the compression method based on CNN using an optimized residual unit in [20] is also added for comparison. For presentation purposes, it is written as ResConv.

The Evaluation Criterion
To evaluate the performance of the network comprehensively, apart from PSNR measuring the image recovery, we also utilize another metric known as spectral angle (SA) on the spectral dimension to verify the validity of the partitioned extraction method we proposed. SA indicates the angle between two spectra, which can be viewed as two vectors [28], and it can be used to measure the similarity between two spectral dimensions. The formula is written as follows: whose value ranges from −1 to 1. The closer the SA is to zero, the more similar the two vectors are. Figure 8 shows the average PSNR of 7 band test sets. As seen from above, our proposed method is about 1 dB better than 3D-SPIHT and exceeds JPEG2000 by 3 dB. Comparing with ResConv, the partitioned extraction method still gains a little advantage of about 0.6 dB. Figure 9 states the detailed comparison of four selected test images, which can show the recovered result comparison of four methods. It is easy to tell that the partitioned extraction algorithm has an obvious superiority when the bit rate is ranging from 0.3-0.4.

Spatial Information Recovery
For illustrative purposes, Figure 10 shows the visual effects of four test images when the bit rate is around 0.4. To be specific, we display the grayscale image of the third band of the test image to show the differences more clearly. As can be seen from it, with the JPEG2000 and 3D-SPIHT algorithms, the recovered images have obvious block effects and the textures and margins are seriously blurred, whereas the proposed partitioned extraction algorithm performs well under the same bit rate, the same with ResConv, and these two methods preserve more details than any other methods. Figure 11 shows the partial enlarged view of ah_xia for a clearer demonstration. When the bit rate is around 0.4, ResConv and our proposed method both demonstrate impressive performance. However, according to Figure 8, ResConv starts to lose its edge as the bit rate drops to 0.3 or even lower and our method is then more stable.
To further illustrate the advantage of our partitioned extraction method, we augment 8 band test sets into the experiment for better comparison. The average PSNR is shown in Figure 12. As seen from below, our method obtained a higher PSNR than JPEG2000 and 3D-SPIHT, approximately 8 dB and 4 dB, respectively. With regard to ResConv, the proposed method maintains the competitive edge and obtains about 2.5 dB higher than it on average. the textures and margins are seriously blurred, whereas the proposed partitioned extraction algorithm performs well under the same bit rate, the same with ResConv, and these two methods preserve more details than any other methods. Figure 11 shows the partial enlarged view of ah_xia for a clearer demonstration. When the bit rate is around 0.4, ResConv and our proposed method both demonstrate impressive performance. However, according to Figure 8, ResConv starts to lose its edge as the bit rate drops to 0.3 or even lower and our method is then more stable.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 21  To further illustrate the advantage of our partitioned extraction method, we augment 8 band test sets into the experiment for better comparison. The average PSNR is shown in Figure 12. As seen from below, our method obtained a higher PSNR than JPEG2000 and 3D-SPIHT, approximately 8 dB and 4 dB, respectively. With regard to ResConv, the proposed method maintains the competitive edge and obtains about 2.5 dB higher than it on average. To further illustrate the advantage of our partitioned extraction method, we augment 8 band test sets into the experiment for better comparison. The average PSNR is shown in Figure 12. As seen from below, our method obtained a higher PSNR than JPEG2000 and 3D-SPIHT, approximately 8 dB and 4 dB, respectively. With regard to ResConv, the proposed method maintains the competitive edge and obtains about 2.5 dB higher than it on average.  Figure 13 represents the comparison of the PSNR of four test images from 8 band test sets at different bit rates. It can be observed that the advantage of the partitioned extraction method becomes quite prominent, compared with the result of the 7 band test images. Regarding ResConv, in spite of it surpassing JPEG2000 and 3D-SPIHT, its inferiority to our proposed method is more distinct compared with the results of the 7 band test sets. When it comes to processing multispectral images with more bands, or even hyperspectral images, some traditional compression methods will ineluctably be in a more inferior position, as they rarely take the abundant spectral correlation into account.  Figure 13 represents the comparison of the PSNR of four test images from 8 band test sets at different bit rates. It can be observed that the advantage of the partitioned extraction method becomes quite prominent, compared with the result of the 7 band test images. Regarding ResConv, in spite of it surpassing JPEG2000 and 3D-SPIHT, its inferiority to our proposed method is more distinct compared with the results of the 7 band test sets. When it comes to processing multispectral images with more bands, or even hyperspectral images, some traditional compression methods will ineluctably be in a more inferior position, as they rarely take the abundant spectral correlation into account.
For visual comparison, as shown in Figures 14 and 15, the JPEG2000 method inevitably generates serious block and ring effects, and detailed texture is ignored too. Furthermore, 3D-SPIHT is relatively better than JPEG2000; however, there are still a lot of blurred texture details in the recovered images. ResConv also obtains recovered images with blurred texture. On the contrary, the algorithm we proposed can retain the detailed texture and edge information of the images to a great extent.
All of the comparison results indicate that these traditional compression methods are not suitable for multispectral image compression, which may cause a lot spatial-spectral information loss. Some CNN-based algorithms may obtain a better result; however, as the bands increase, the inadequacy manifests as well. The partitioned extraction method of spatial-spectral features that we proposed has been proven effective on multispectral image compression, with its higher PSNR and much smoother visual effects. For visual comparison, as shown in Figures 14 and 15, the JPEG2000 method inevitably generates serious block and ring effects, and detailed texture is ignored too. Furthermore, 3D-SPIHT is relatively better than JPEG2000; however, there are still a lot of blurred texture details in the recovered images. ResConv also obtains recovered images with blurred texture. On the contrary, the algorithm we proposed can retain the detailed texture and edge information of the images to a great extent. test2 test8 test14 test16 For visual comparison, as shown in Figures 14 and 15, the JPEG2000 method inevitably generates serious block and ring effects, and detailed texture is ignored too. Furthermore, 3D-SPIHT is relatively better than JPEG2000; however, there are still a lot of blurred texture details in the recovered images. ResConv also obtains recovered images with blurred texture. On the contrary, the algorithm we proposed can retain the detailed texture and edge information of the images to a great extent. test2 test8 test14 test16 JPEG2000 JPEG2000 JPEG2000 JPEG2000 All of the comparison results indicate that these traditional compression methods are not suitable for multispectral image compression, which may cause a lot spatial-spectral information loss. Some CNN-based algorithms may obtain a better result; however, as the bands increase, the inadequacy manifests as well. The partitioned extraction method of spatial-spectral features that we proposed has been proven effective on multispectral image compression, with its higher PSNR and much smoother visual effects.

Spectral Information Recovery
To adapt to our partitioned extraction network structure and verify the effectiveness of spectral information recovery as well, we adopt the SA as the second evaluation criterion. The average SA curves of 7 band and 8 band test sets are shown in Figures 16 and  17, respectively. As seen below, we can find that the SA of the images reconstructed by the partitioned extraction algorithm is always smaller than that of JPEG2000, 3D-SPIHT and ResConv. Tables 2 and 3 list the detailed SA values of four representative test images of 7 band and 8 band, respectively. Supported by the chart and data below, it can be proven that the partitioned extraction algorithm obtains the smallest SA at all bit rates compared with the other three methods, and the smaller SA indicates that the images reconstructed by the proposed partitioned extraction method can obtain better spectral information recovery.

Spectral Information Recovery
To adapt to our partitioned extraction network structure and verify the effectiveness of spectral information recovery as well, we adopt the SA as the second evaluation criterion. The average SA curves of 7 band and 8 band test sets are shown in Figures 16 and 17, respectively. As seen below, we can find that the SA of the images reconstructed by the partitioned extraction algorithm is always smaller than that of JPEG2000, 3D-SPIHT and ResConv. Tables 2 and 3 list the detailed SA values of four representative test images of 7 band and 8 band, respectively. Supported by the chart and data below, it can be proven that the partitioned extraction algorithm obtains the smallest SA at all bit rates compared with the other three methods, and the smaller SA indicates that the images reconstructed by the proposed partitioned extraction method can obtain better spectral information recovery.
proposed Figure 15. Partial enlarged view of test8.
All of the comparison results indicate that these traditional compressio not suitable for multispectral image compression, which may cause a lot sp information loss. Some CNN-based algorithms may obtain a better result; ho bands increase, the inadequacy manifests as well. The partitioned extracti spatial-spectral features that we proposed has been proven effective on mu age compression, with its higher PSNR and much smoother visual effects.

Spectral Information Recovery
To adapt to our partitioned extraction network structure and verify the of spectral information recovery as well, we adopt the SA as the second ev rion. The average SA curves of 7 band and 8 band test sets are shown in F 17, respectively. As seen below, we can find that the SA of the images rec the partitioned extraction algorithm is always smaller than that of JPEG20 and ResConv. Tables 2 and 3 list the detailed SA values of four representati of 7 band and 8 band, respectively. Supported by the chart and data bel proven that the partitioned extraction algorithm obtains the smallest SA compared with the other three methods, and the smaller SA indicates that t constructed by the proposed partitioned extraction method can obtain bett formation recovery.

Conclusions
In this paper, a novel end-to-end framework with partitioned extraction of spatialspectral features for multispectral image compression is proposed. The algorithm pays close attention to the abundant spectral features of the multispectral images and is committed to preserving the integrity of the spectral-spatial features. The spectral and spatial feature modules extract corresponding features separately, after which the features are fused together for further processing. Likewise, the spectral and spatial features are severally recovered when reconstructing the images, which can help obtain images with high quality. To testify the validity of the framework, experiments are implemented on both 7 band and 8 band test sets. The results show that the proposed algorithm surpasses JPEG2000, 3D-SPIHT and ResConv on PSNR, visual effects and SA as well. The results on the 8 band show that the proposed method has achieved a more obvious superiority, which may prove that spectral information plays an indispensable role in multispectral image processing.