1. Introduction
Hyperspectral imaging technology has achieved success in the applications of target detection [
1], food surveillance [
2], environmental protection [
3], biomedicine [
4,
5], and remote sensing [
6]. A hyperspectral image (HSI) can be represented by a three-dimensional (3D) data cube containing two spatial dimensions (
x and
y) and one spectral dimension (
λ). Traditional spectral imaging uses a two-dimensional (2D) detector to sample the 3D data cube by sequentially scanning the target scene along the spatial or spectral coordinates. However, this approach will prolong the data acquisition time, which is inadequate in some applications, especially for dynamic target imaging [
7].
Compressive spectral imaging (CSI) relies on compressive sensing theory to solve the above problem [
8,
9]. The schematic diagram of a typical CSI system is shown in
Figure 1 [
10,
11]. First, the HSI of the target scene is spatially modulated by a binary coded aperture, and then the spectral bands of the HSI are shifted by different displacements through a dispersive element. Afterwards, the encoded HSI is projected onto a grayscale detector along the spectral dimension to form the 2D compressive measurement. Finally, the 3D HSI can be reconstructed from a single or a few compressive measurements using the numerical algorithms.
However, traditional CSIs suffer from several limitations. Firstly, conventional CSI systems are composed of separate optical modulation elements, which prevent the system integration and miniaturization. Secondly, the encoding ability of the binary coded aperture is limited due to the lack of spectral modulation [
12,
13]; thus, multiple snapshots with varying coding patterns are necessary to further improve the reconstruction quality. However, switching the coding patterns will increase the cost and complexity of the spatial light modulator. Finally, most conventional HSI reconstruction algorithms for CSI systems are computationally intensive [
14,
15].
Recently, color-coded compressive spectral imaging (CCSI) was proposed for reducing the detection time [
16,
17]. The color-coded aperture (CCA) is used to replace the binary coded aperture and dispersive element in the traditional CSI system. The CCA is a 2D array composed of various optical filters with different spectral responses; thus, it has modulation capabilities in both the spatial and spectral domains. Previous studies proposed to attach the CCA to the camera sensor to obtain a compact system, and different fabrication methods for CCA were proposed. Specifically, Zhao et al. proposed a low-cost CCSI system based on the colored printed mask [
18], which was fabricated by a consumer-level printer with color inks. However, the uncontrollable placement of ink droplets may lead to some deviations from the designed coding pattern. A spectral imaging system based on a diffractive optical element and a CCA was proposed in [
19]. This CCA was produced using an ordinary film photography technique, which improved manufacturing accuracy at low cost. However, the periodic arrangement of the CCA limited its modulation freedom. In addition, the method using a diffractive lens in concert with a CCA provides limited field of view. Yako et al. proposed a hyperspectral camera based on Fabry–Pérot filters [
20], where multiple filters with different spectral responses were randomly arranged to obtain color coding. This kind of CCA required advanced manufacturing techniques that increased the fabrication cost. Zhang et al. developed a low-cost multi-spectral camera at tens-of-megapixel spatial resolution [
21]. In that work, the color-coded mask was generated by imaging a binary coding mask through a disperser, but the wavelength-dependent dispersive effect limited the modulation freedom. In general, traditional CCAs face a trade-off among coding freedom, modulation accuracy, and fabrication cost. It is preferrable to design a low-cost freeform CCA component that can be fabricated precisely.
In addition, the CCSI can reconstruct the HSI using only one snapshot without switching the coding pattern, thus saving the detection time. However, the compression ratio of the original HSI data over the compressive measurement is very high, which brings in a big challenge for the reconstruction algorithms. To address this problem, deep learning technology was introduced to improve the reconstruction efficiency and accuracy of the CCSI. In the past, convolutional neural networks (CNNs) were used to construct the end-to-end mapping models between the compressive measurement and the reconstructed HSI [
18,
19,
20]. Transformer-based networks were also proposed to further improve the reconstruction quality of traditional CSI and CCSI by capturing the long-range inter-spectra dependencies of the target HSIs [
21,
22,
23]. However, the convolution-based down-sampling and up-sampling operations used in the Transformer-based models may lose information of feature maps, which was detrimental to the reconstruction accuracy. Moreover, there is currently a lack of training datasets for real CCSI systems, which induces difficulty for the deep learning methods to reconstruct high-quality HSI in real applications.
This paper proposes a low-cost color-coded compressive spectral imager (LCCSI), and a corresponding deep learning approach, dubbed Focus-based Mask-guided Spectral-wise Transformer (F-MST), that can obtain high-quality reconstruction for the LCCSI system.
Figure 2 shows the schematic diagram of the LCCSI system. It jointly uses a CCA fabricated using a color film and an RGB detector to achieve higher degrees of freedom in the spatio-spectral modulations. First, the target’s light field is projected through an objective lens on the CCA. The CCA is produced by imaging a color-coded pattern onto a transparent color film using conventional film photography, which offers varying spectral modulations on different pixels. The HSI data cube is modulated by the CCA on both spatial and spectral domains and then projected on an RGB detector by a relay lens. The HSI data cube is modulated again by the Bayer filter integrated in the RGB detector. The Bayer filter is a three-color filter array, i.e., red, green, and blue filters. Finally, the twice-encoded HSI is captured by the focal plane array (FPA) of the detector to form a 2D compressive measurement. The joint spatial and spectral modulations of the CCA and Bayer filter can further increase the modulation freedom, thus obtaining an incoherent sensing matrix for the LCCSI system, which is beneficial to improve the HSI reconstruction quality. The proposed CCA is very thin in volume; thus, it can be tightly attached on the detector surface to miniaturize the LCCSI system.
In order to reconstruct the spectral data cube, this work also develops the F-MST network that is inspired by the Mask-guided Spectral-wise Transformer (MST) network designed for the traditional CSI system [
22]. We embed the focus-based down-sampling and up-sampling modules in the MST network to improve the reconstruction accuracy. To overcome the problem of lacking training sets, we first use the simulation data to pre-train the deep learning models, and then we retrain them with the real dataset collected by the LCCSI testbed established by our group. The proposed LCCSI system and F-MST reconstruction network are verified and assessed based on both simulations and real experiments. These show that the proposed F-MST method achieves superior reconstruction performance over the commonly used iterative reconstruction algorithms (GPSR [
24], TwIST [
25], and GAP-TV [
26]) and some other state-of-the-art learning-based algorithms (TSA-Net [
27] and MST [
22]).
The main contributions of this paper are summarized as follows. This work proposes a low-cost LCCSI system based on a cascaded encoding method and a corresponding supervised learning reconstruction algorithm for compressive spectral imaging. The combination of the film-based CCA and the Bayer filter array of the RGB detector can effectively enhance the coding freedom of the entire optical system. The proposed F-MST algorithm utilizes the focus-based down-sampling and up-sampling modules to maintain more feature information and improve reconstruction quality. The proposed LCCSI system and F-MST algorithm have the potential to be used in the miniaturized and high-quality hyperspectral imaging technology.
2. Imaging Model of LCCSI System
The imaging process of the LCCSI system is shown in
Figure 2. The HSI of the target scene is represented as
with the spatial size of
and the spectral depth of
. The coding effects of the CCA and the Bayer filter are represented by the matrices of
and
, respectively. The CCA consists of a 2D array of color pixels with random arrangement, where each color pixel has a pre-defined spectral response that can transmit specific components of the light spectrum and suppress other components. The Bayer filter is a 2D array arranged in a
cycle consisting of three kinds of optical filters (red, green, and blue). It is noted that
and
have the same dimensionality. These two matrices are composed of the spectral modulation curves (along the dimension
) over all of the spatial coordinates (along the dimensions
). The voxels of
and
are defined as
and
, respectively. Finally, the 2D compressive measurement on the detector is denoted by
. Then, the imaging model of the LCCSI system is given by
where
is referred to as the cascade coding cube that represents the total coding effects attributed to both the CCA and Bayer filter. It can be formulated as
Next, we transform Equation (1) to a matrix multiplication format. Let
and
be the vectorized representations of
and
, respectively. Then, Equation (1) is rewritten as
where
represents the sensing matrix of the LCCSI system, and
represents the measurement noise. The sensing matrix
is the 2D representation of the cascade coding cube
.
Figure 3 provides an intuitive illustration of the sensing matrix for one snapshot with the spatial dimensions
and the spectral dimension
. The three groups of diagonal elements in
Figure 3 correspond to the modulation coefficients in the three spectral bands (
), where the grayscale elements from black to white represent the element values from 0 to 1. It is shown that the sensing matrix is not binary but contains lots of grayscale elements with different values. This comes from the freeform color coding and brings in more freedom in the spatio-spectral modulations.
Reconstructing the 3D HSI data cube from the 2D measurement is an underdetermined problem that can be solved by compressive sensing methods. According to the highly correlated property of HSI along the spatial and spectral dimensions,
can be sparsely represented as
, where
and
denote the sparse basis matrix and sparse coefficient vector, respectively. In this paper, the basis matrix is defined as
, where
denotes the Kronecker product,
is the 2D wavelet Symmlet 8 basis in the spatial domain, and
is the one-dimensional discrete cosine basis in the spectral domain. The wavelet Symmlet 8 basis can effectively capture the components of different frequencies at multiple scales in the spatial dimension, while the discrete cosine basis can maintain the major low-frequency characteristics of the spectrum. The HSI of the target scene can be reconstructed by solving the following optimization problem:
where
and
denote the
-norm and
-norm respectively, and
is the regularization parameter. As mentioned in
Section 1, deep learning approaches have been recently introduced to reconstruct the HSI directly from the measurement rather than iteratively updating the solutions by gradient-based algorithms. Our proposed deep learning method will be described in
Section 4.
3. Experimental System of LCCSI
In order to verify the proposed LCCSI method, our group established an experimental system of the LCCSI as shown in
Figure 4. This system consists of an objective lens (FM5014-8MP, CW Lens, Shenzhen, China), a pair of bandpass filters (GCC-300115 and GCC-211002, Daheng Optics, Beijing, China), a CCA, an XY translation mount (CXY1-M, OEABT, Guangzhou, China), a precision rotation mount (S/N0007, OEABT, Guangzhou, China), a relay lens (FM3514-10MP-A, CW Lens, Shenzhen, China), and an RGB detector (MER2-231-41U3C, Daheng Imaging, Beijing, China). The focal lengths of the objective lens and relay lens are 50mm and 35 mm, respectively. The bandpass filters are placed behind the objective lens to limit the spectrum range of HSI from 450 nm to 650 nm. The XY translation mount and rotation mount are used to finely adjust the position and rotation angle of the CCA, thus ensuring the precise pixel match between the CCA and the RGB detector.
The CCA in our system is produced by the research group at University of Delaware using conventional film photography, and its production process is similar to that of the CCA in [
19]. Compared to the CCA with periodic arrangement in [
19], our CCA adopts a completely random arrangement, which has higher modulation freedom. It was shown in [
28] that the CCA with more than five color codes could achieve good reconstruction results. Thus, we use six color codes in our CCA. We use the spectrometer (USB2000+, Ocean Optics, Orlando, FL, USA) to obtain the spectral curves of a variety of color codes, and then we select six distinct codes that cover the visible spectrum to form the coding pattern with a random arrangement.
Figure 5a shows the spectral modulation curves of the six codes in the CCA.
In our previous work, we designed a compact compressive spectral imaging system, where the film-based coding mask was attempted to be attached to the surface of the detector [
29]. However, that system lacks the relay imaging scheme; thus, it is hard to achieve the perfect matching between the pixels on the CCA and the detector. Some advanced manufacturing methods may realize the high-resolution CCA with precise pixel size control, but this will significantly increase the manufacturing difficulty and cost of the CCA. Additionally, there is a protective glass on the detector’s sensor, which creates a gap between the CCA and the detector. This covering glass and air gap will induce refraction and diffraction effects, which may cause crosstalk among adjacent encoded pixels, causing the actual encoding process to deviate from the ideal encoding model.
To address these issues, this paper proposes an LCCSI system based on a relay imaging scheme as shown in
Figure 4, where the CCA pixel size does not need to be consistent with the detector pixel size, thereby reducing the manufacturing difficulty of the CCA. Furthermore, we introduce a matching scheme between one CCA pixel and a
Bayer filter array, which not only relaxes the requirement on the resolution of the CCA but also enhances the system’s modulation freedom. The CCA used in this work contains a
color-coded array with a pixel size of 62.5 μm. This array is combined with a
Bayer filter array of the RGB detector to double the spatial resolution of the cascade coding cube
, which means that one CCA pixel is matched with a
cycle array of the Bayer filter. Through this specific matching mechanism, the modulation freedom of the coding cube is increased in both the spatial and spectral domains. In the spatial dimension, this matching procedure doubles the spatial size of
from
to
, which enhances the spatial modulation freedom of the coding cube. In the spectral dimension, the red, green, and blue filters contained in the
Bayer cycle array have different spectral modulation curves, as shown in
Figure 5b. When a CCA pixel is matched with a
Bayer filter cycle, more diverse spectral modulation curves can be generated, thereby enhancing the spectral modulation freedom of the coding cube. It is notable that a
Bayer filter cycle of the RGB camera contains two green filters with the same spectral response. This slightly reduces the diversity of spectral modulation in the coding cube. However, considering the cost and generalization, we use a common RGB camera as the detector for our LCCSI system.
However, the modulation behavior of the real experimental system may deviate from the designed target due to the non-ideal fabrication conditions of the CCA and the system assembly errors. In addition, the strict pixel matching between the CCA and RGB sensor is difficult to achieve through manual adjustment. Those problems may introduce errors in the imaging model, thus degrading the reconstructed result of the LCCSI system. To solve the above problems, after manually matching the pixels between the CCA and RGB sensor to the utmost, we need to calibrate the coding cube
in Equation (1) for the real LCCSI system, and the calibrated coding cube is denoted by
. It is noted that
conforms to the real modulation process of the LCCSI system. In particular, we illuminate the CCA with a monochromatic light source that is composed of a xenon lamp light source (GLORIA-X500A, ZOLIX, Beijing, China) and a monochromator (Omni-λ300i, ZOLIX, Beijing, China), scanning over nineteen spectral bands with the center wavelengths from 460 nm to 640 nm. The xenon lamp light source first provides composite light illumination covering the entire visible spectrum for the monochromator. Then, the monochromator uses an internal grating to emit monochromatic light, which is transmitted through an optical fiber to an annular illuminator, ultimately achieving monochromatic light illumination for the LCCSI system. For each spectral band, we record the image of the CCA using the RGB detector. Those images are exactly the 2D transmittance matrices of the calibrated coding cube
in the corresponding spectral bands. Then, the
is obtained by stacking all transmittance matrices together along the spectral dimension.
Figure 5c shows the images of the
for ten selected spectral bands, where the spatial size of the coding cube is
.
4. HSI Reconstruction Based on F-MST Network
A Transformer-based F-MST network was developed to achieve fast and high-quality reconstruction of HSIs. The proposed F-MST network is developed from the MST network applicable to the traditional CSI system [
22]. Different from the original MST network, the F-MST network removes the shift modules [
22] corresponding to the dispersive effect in the traditional CSI system. Then, the MASK [
22] corresponding to the binary coded aperture of traditional CSI is replaced by the calibrated coding cube
of the proposed LCCSI system. Besides, the convolution-based down-sampling and up-sampling operations [
22] in the MST network are replaced by the focus-based down-sampling and up-sampling modules to improve the reconstruction accuracy.
Figure 6a shows the overall structure of the proposed F-MST network, which is composed of a
convolutional layer, an embedding layer, an encoder, a bottleneck, a decoder, and a mapping layer. On the whole, this network inherits the basic framework of the U-net [
30]. First, the compressive measurement
of the LCCSI system is imported into the
convolutional layer, which expands the spectral dimension and generates the feature map
. Then, the embedding layer (
convolutional layer) is used to map
into another feature
. Afterwards,
is inputted into the encoder, which contains two sets of three Mask-guided Spectral-wise Attention Blocks (MSABs) [
22] and two down-sampling modules. The encoder outputs the feature map
, which then passes through the bottleneck containing three MSABs. The bottleneck outputs the feature map
, which serves as the input of the following decoder. The decoder contains two sets of three MSABs and two up-sampling modules. As shown in
Figure 6a, the skip connections indicated by the green arrows are used for feature fusion between the encoder and decoder, where the channel concatenation and the
convolutional layer are used to reduce the information loss caused by the down-sampling operations. The shallow and deep features are fused through skip connections to improve the reconstruction quality of HSIs. The decoder outputs the feature map
, which is then converted to the feature map
through the mapping layer (
convolutional layer). Finally, the feature map
is added to the feature map
to obtain the reconstructed HSI
.
Figure 6b shows the structure of the MSAB, which consists of a Feed-Forward Network (FFN), a Mask-guided Spectral-wise Multi-head Self-Attention (MS-MSA) module, and two layer normalization functions [
22].
Figure 6c,d illustrate the structures of the FFN and the MS-MSA module, respectively. The MS-MSA can capture the long-range inter-spectra dependencies, which is conducive to learn the mapping functions between the 2D measurement and the 3D HSI [
22]. In addition, the mask-guided mechanism (MM) [
22] based on the proposed calibrated coding cube
is used to guide the network to focus on the regions with highly credible information in both the spatial and spectral dimensions. This mechanism can further improve the reconstruction quality.
Next, we describe the differences between the proposed F-MST network and the original MST network in [
22]:
(1) We remove the shifting modules [
22] in the MST network corresponding to the dispersive component of the traditional CSI system. Then, in order to adapt the network to the LCCSI system, we add a convolutional layer before the embedding layer to expand the spectral dimensionality of compressive measurement
to
.
(2) The mask [
22] in the MM module of the MST network corresponds to the binary coded aperture in the traditional CSI system. So, it is replaced by the 3D calibrated coding cube
of the proposed system to enhance the attention to the regions with highly credible information in both the spatial and spectral domains.
(3) The convolution-based down-sampling and up-sampling operations [
22] in the MST network are replaced by the focus-based down-sampling and up-sampling modules, which can reduce the loss of feature map information and thus improve the reconstruction accuracy of HSIs. The structure of the focus-based down-sampling module is shown in
Figure 6e. The focus module firstly obtains four feature maps of half spatial size by interval sampling and then concatenates them along the channel dimension. Combined with a
convolutional layer, a batch normalization function, and a ReLU activation function, the focus-based down-sampling module can convert the input feature map
to the output feature map
. On the other hand, the structure of the focus-based up-sampling module is shown in
Figure 6f. The inverse focus module uses convolutions with different kernel sizes to obtain three new feature maps, which together with the original feature map can achieve up-sampling in the spatial domain. Combined with a
convolutional layer, a batch normalization, and a ReLU function, the focus-based up-sampling module can convert the input feature map
to the output feature map
. The focus module can retain all the information of the original feature map in the down-sampling process. In addition, the convolutions with different kernel sizes can capture the feature information at different scales in the up-sampling process. With the help of the focus-based down-sampling and up-sampling modules, the proposed F-MST network can further improve the reconstruction quality of HSIs.