Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer

Wang, Zi; Yang, Yang; Yuan, Liyin; Li, Chunlai; Wang, Jianyu

doi:10.3390/s24237658

Open AccessArticle

Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer

by

Zi Wang

^1,2,3

,

Yang Yang

^1,2,

Liyin Yuan

^1,2

,

Chunlai Li

^1,2,4,*

and

Jianyu Wang

^1,2,3,4,*

¹

Key Laboratory of Space Active Opto-Electronics Technology, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China

⁴

Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China

^*

Authors to whom correspondence should be addressed.

Sensors 2024, 24(23), 7658; https://doi.org/10.3390/s24237658

Submission received: 22 October 2024 / Revised: 26 November 2024 / Accepted: 28 November 2024 / Published: 29 November 2024

(This article belongs to the Collection Advances in Spectroscopy and Spectral Imaging)

Download

Browse Figures

Versions Notes

Abstract

Long-wave infrared (LWIR) spectral imaging plays a critical role in various applications such as gas monitoring, mineral exploration, and fire detection. Recent advancements in computational spectral imaging, powered by advanced algorithms, have enabled the acquisition of high-quality spectral images in real time, such as with the Uncooled Snapshot Infrared Spectrometer (USIRS). However, the USIRS system faces challenges, particularly a low spectral resolution and large amount of data noise, which can degrade the image quality. Deep learning has emerged as a promising solution to these challenges, as it is particularly effective at handling noisy data and has demonstrated significant success in hyperspectral imaging tasks. Nevertheless, the application of deep learning in LWIR imaging is hindered by the severe scarcity of long-wave hyperspectral image data, which limits the training of robust models. Moreover, existing networks that rely on convolutional layers or attention mechanisms struggle to effectively capture both local and global spectral correlations. To address these limitations, we propose the pixel-based Hierarchical Spectral Transformer (HST), a novel deep learning architecture that learns from publicly available single-pixel long-wave infrared spectral databases. The HST is designed to achieve a high spectral resolution for LWIR spectral image reconstruction, enhancing both the local and global contextual understanding of the spectral data. We evaluated the performance of the proposed method on both simulated and real-world LWIR data, demonstrating the robustness and effectiveness of the HST in improving the spectral resolution and mitigating noise, even with limited data.

Keywords:

spectral reconstruction; long-wave infrared; thermal infrared; spectral imaging; deep learning; transformer; infrared imaging spectrometer

1. Introduction

Long-wave infrared (LWIR) or thermal infrared refers to wavelengths ranging between 7 and 14

μ

m. A range of crucial materials, including minerals and gases, have distinctive spectral characteristics within this LWIR range, enabling their identification [1]. Consequently, LWIR spectral imaging plays a crucial role in various applications such as environmental monitoring, climate studies, gas monitoring, mineral exploration, and fire detection [2,3].

In recent years, the computational spectral imaging technique, powered by advanced algorithms, has enabled the acquisition of high-quality spectral images in real time [4]. A prime example is the recently proposed Uncooled Snapshot Infrared Spectrometer (USIRS), which has enabled real-time video imaging in the LWIR range [5,6]. This is achieved by replicating the target scene using optical multi-aperture imaging and assigning filters of different wavelengths for each scene [7], allowing for real-time LWIR spectral video imaging at 8Hz. However, the USIRS presents two significant challenges. Firstly, the spectral resolution is achieved via the band-pass spectral filtering of the multi-aperture scene, creating a trade-off between the spectral and spatial resolution. To preserve an adequate spatial resolution, the spectral resolution is kept relatively low, limiting the USIRS to only 18 spectral channels. Secondly, the uncooled LWIR system, unlike commercial visible cameras that measure reflected sunlight, measures the target scene’s inherent radiation, typically low-intensity blackbody radiation at room temperature. This radiation is highly sensitive to background noise, resulting in a low signal-to-noise ratio (SNR) [8,9]. This further degrades the quality of spectrum acquisition and imaging performance. Therefore, a specifically designed computational algorithm is required to address these challenges.

Over the past decade, deep learning has demonstrated extraordinary robustness to noise [10,11] and high accuracy in hyperspectral image reconstruction and spectroscopy reconstruction [12], making it a promising solution to the challenges faced by USIRS. This method successfully integrates both spatial and spectral information to reduce noise and preserve details in hyperspectral images, leading to significant improvements in image quality [13,14,15,16]. Liu et al. advanced hyperspectral image denoising using a 3D attention denoising network [17]. By incorporating a three-dimensional attention mechanism, they managed to effectively capture intricate patterns and relationships within the hyperspectral data, resulting in superior denoising performance. Zhang et al. enhanced hyperspectral image denoising using a multitask learning approach combined with sparse representation techniques [18].

On the other hand, the field of spectroscopy reconstruction has introduced many pixel-based spectrum reconstruction methods, which mitigate the need for a large number of spectral images to train a neural network [19,20,21,22]. For instance, Wen et al. utilized the Multilayer Perceptron (MLP) to efficiently reconstruct the absolute spectra of various nonluminous samples, thereby developing a miniaturized spectrometer [23]. J. Zhang et al. suggested feeding the initial results of optimization-based methods into the MLP instead of directly feeding the raw measurements [24]. This reconstruction method has shown great potential for spectral recovery of filter-based miniature spectrometers. However, in the task of LWIR spectral reconstruction, existing neural network structures struggle to handle the noise and unique characteristics of the LWIR spectrum. LWIR data can be noisy, requiring neural networks to be robust against such noise to avoid inaccurate reconstructions. Traditional Convolutional Neural Networks (CNNs) [25,26,27,28] often struggle with this task due to their limited receptive fields and inefficient modeling of long-range dependencies [29]. Furthermore, the reconstruction problem is inherently ill posed, indicating that multiple possible solutions often exist for a given set of measurements. This complexity makes it challenging for MLPs [30,31,32] to learn meaningful representations without significant computational overhead and a risk of overfitting [33]. Therefore, there is a demand for bespoke network architectures for LWIR spectral reconstruction. These should be capable of capturing the complex spectral correlations and dependencies that are inherent to the LWIR spectrum.

To address these challenges, we propose the pixel-based Hierarchical Spectral Transformer (HST). First, instead of directly training on LWIR spectral images, we train the pixel-based neural network on public LWIR spectral databases [34]. This strategy reduces the need for high-quality paired data for neural network training. Second, we incorporate an LWIR spectral noise model into the USIRS data simulation pipeline. The noise model serves as a bridge between synthetic data and experimental data, enabling our network, trained on synthetic data, to effectively handle experimental data that are collected by the USIRS. Third, the HST employs a multistage architecture, where each stage progressively refines the spectral features, enhancing the detail and accuracy of the reconstruction. By employing spectral-wise multi-head self-attention [35], the HST can concentrate on the most relevant spectral features, thereby reducing the computational complexity and enhancing the reconstruction quality. This hierarchical approach ensures that both global and local spectral information effectively mitigate the large amount of noise and achieve superior performance in LWIR spectral image reconstruction tasks. Furthermore, while the HST is specifically designed and evaluated for the USIRS system, its versatility allows for it to be applied to other computational spectral imaging systems.

Our contributions are summarized as follows:

We propose training the neural network on pixel-based LWIR spectra and integrating an LWIR spectral noise model into the USIRS data simulation pipeline. This approach reduces the need for high-quality paired data for training on LWIR spectral images.
We introduce the Hierarchical Spectral Transformer (HST), designed to effectively learn and preserve both global and local spectral information, thus mitigating the large amount of noise and enhancing the reconstruction accuracy.
We evaluate our pipeline using both synthetic and experimental data to demonstrate its effectiveness in handling real-world scenarios.

2. Materials and Methods

2.1. The USIRS

The Uncooled Snapshot Infrared Spectrometer (USIRS) is an industrial instrument developed by the Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, as shown in Figure 1a.

The USIRS uses filters and multi-aperture imaging to capture images at various wavelengths, providing a detailed spectral image of the target object. Figure 1b illustrates the optical design of the USIRS. The process begins with a telescopic lens collecting light from the target. This light is then directed to a collimating lens, ensuring that the rays are parallel. Following this, a lens array composed of nine smaller lenses recreates the target scene. The light then travels through a filter array, in which each filter corresponds to a specific waveband, before it finally reaches the focal plane, where images of different wavelengths are created and captured.

The USIRS utilizes two identical optical systems, with the only difference being the wavelength of the filter array. This effectively makes it a dual system with 17 band-pass wavelengths: 7.22

μ m

, 7.67

μ m

, 7.96

μ m

, 8.33

μ m

, 8.70

μ m

, 9.07

μ m

, 9.48

μ m

, 9.81

μ m

, 10.18

μ m

, 10.55

μ m

, 10.92

μ m

, 11.29

μ m

, 11.66

μ m

, 12.03

μ m

, 12.40

μ m

, 12.77

μ m

, and 13.14

μ m

.

Figure 1c displays one of the USIRS images for a target scenario involving two individuals and two gas bottles releasing NH3 and SF6, respectively. The spectral curves of the gases marked by the blue box are depicted in Figure 1d. The USIRS does not have a high spectral resolution, given that it only has 17 band-pass spectral channels and the measurement tends to be quite noisy. For further applications, it is crucial to denoise and reconstruct the high-resolution spectrum.

2.2. Imaging and Noise Model

Deep learning has emerged as a promising solution to the problem of hyperspectral reconstruction with large amounts of noise. However, the challenge in using neural networks for long-wave infrared spectral video reconstruction lies in the scarcity of sufficient high-quality data for training. Most existing long-wave infrared hyperspectral datasets consist of single-point spectral data collected through spectrometers. Therefore, we propose to develop the imaging and noise model of the USIRS, allowing us to generate corresponding low-resolution data from high-resolution ones using this model and subsequently train the neural network at the pixel spectrum level.

The signal received by the USIRS is a combination of the target radiation and background radiation:

L (λ) = L_{t} (λ) τ (λ) + L_{p} (λ)

(1)

Here,

L_{t} (λ)

represents the radiation spectrum of the target on the lens of the USIRS.

L (λ)

is the radiation reaching the detector,

τ (λ)

is the spectral response of the imaging system, and

L_{p} (λ)

is the background radiation. The radiation reaching the the detector is converted into a current and then quantified as a digital number (DN):

D N = P (ϕ L) + N_{d} + N_{q}

(2)

In this equation,

P

is the photon noise operator,

ϕ

is the quantum efficiency of the detector,

N_{d}

is the dark current in the circuits, and

N_{q}

is the quantization noise. Both the dark currents

N_{d}

and

N_{q}

quantization noise are assumed to obey the Gaussian distribution, with

N (μ_{d}, σ_{d})

and

N (μ_{q}, σ_{q})

, respectively.

The imaging and noise model provides a pipeline for converting publicly available high-resolution spectral data into synthetic measurements of the USIRS, thus providing paired data for neural network training.

2.3. Pixel-Based Hierarchical Spectral Transformer

We propose the Hierarchical Spectral Transformer (HST), a model that utilizes self-attention to capture long-range dependencies and spectral correlations. This model employs a multistage architecture to progressively refine spectral and spatial features, thereby enhancing the reconstruction accuracy. This approach preserves both global and local spectral information, improving the performance in LWIR spectral image reconstruction tasks.

The network structure of the proposed HST is illustrated in Figure 2.

2.3.1. Positional Encoding

The input of the network is the measured intensity, denoted as

I \in R^{l}

in different spectral channels, concatenated with the corresponding positional encoding, denoted as

P o s \in R^{2 M l}

. Each feature’s position, denoted as m, is transformed into a positional embedding via a trigonometric transformation:

\begin{matrix} P o s (m) = & [sin (m \times 2^{0}), cos (m \times 2^{0}), \\ sin (m \times 2^{1}), cos (m \times 2^{1}), \\ \dots \\ sin (m \times 2^{M}), cos (m \times 2^{M})] \end{matrix}

(3)

In this equation, M is the maximum encoding dimension, set to 5 in our system. The combined input is as follows:

S = I \oplus P o s \in R^{(2 M + 1) l}

(4)

The combined input is fed into the multistage transformer. This process ensures that each position has a unique encoding, allowing the model to comprehend the order of the input data.

2.3.2. Hierarchical Representation

We employ a point-based transformer to process the input intensity. In contrast to standard visual transformers that work with patches, this method divides the input spectral intensity into separate point tokens. These tokens are then processed using multiple transformer blocks, each integrating multi-head self-attention computations.

We construct a hierarchical representation by reducing the token count with the help of merging layers as we progress deeper into the stages. These merging layers combine the features of every two adjacent tokens and apply a linear layer to the combined features. This process results in halving the token count while doubling the channel depth. All these stages collectively yield a hierarchical representation. In the k-th stage, the output of the transformer is

S^{(k)} = T r a n s f o r m e r {S^{(k - 1)}}

(5)

In this equation,

S^{(k)} \in R^{(2 M + 1) l / 2^{k}}

is the output of the k-th stage transformer, and

S^{(k - 1)} \in R^{(2 M + 1) l / 2^{(k - 1)}}

is the input of the k-th stage transformer. In addition,

k = 1, 2, 3, \dots

. In the first stage, we define the input of the first stage as

S^{(k - 1)} = S

.

In our experiments, we use a 5-stage representation to convert the 17 channels into a single channel but with a feature dimension of 32. In each stage, we repeat the transformer block

L = 4

times.

2.3.3. Attention Mechanism in Transformer

Each stage employs multi-head attention to facilitate token reduction. The attention mechanism allows the model to focus on the most relevant parts of the input data by assigning different weights to different parts of the input.

In the transformer, the input of the k-th stage of the transformer

S^{(k - 1)}

is first linearly transformed by three learnable weight matrices,

W_{Q}, W_{K},

and

W_{V}

, to obtain the queries Q, keys K, and values V, respectively. We denote the queries at the i-th position as

Q_{i}

. And the keys and values follow the same rule. The weight from the i-th and j-th position is computed as follows:

{\hat{α}}_{i, j} = softmax (\frac{Q_{i} {K_{j}}^{T}}{h})

(6)

where h is the length of the dimension of queries and keys. The attention at the i-th position is computed as follows:

B_{i} = \sum_{j} {\hat{α}}_{i, j} V_{j}

(7)

The output is a weighted sum of the values, allowing the model to focus on the most relevant parts of the input data. At the end of this stage, we concatenate the attention

B = concat (B_{1}, B_{2}, \dots, B_{l})

and further pass the matching features through a linear layer:

S^{(k)} = linear {B}

(8)

In each stage, the attention mechanism can be repeated L times to produce a deep neural network. In this case, the linear layer is applied only when the last attention mechanism outputs.

2.3.4. Implementation Details

Our network training employs the Mean Squared Error (MSE) loss, which is mathematically defined as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(s_{i} - {\hat{s}}_{i})}^{2}

(9)

In this equation, n represents the total number of data points,

s_{i}

is the ground truth spectrum, and

{\hat{s}}_{i}

is the reconstructed spectrum. This formula calculates the average of the squared differences between the ground truth and predicted spectrum. Larger errors are penalized more heavily, pushing the model to reduce significant discrepancies and aim for more accurate predictions. The use of MSE loss enhances the precision and reliability of our network, ensuring consistent and reliable results.

We trained our network with a batch size of 100 for 1000 epochs. The Adam optimizer was used with an initial learning rate of 0.001, which decays by a factor of 0.5 every 100 epochs. The minimum learning rate was set to

1 \times 10^{- 5}

. The HST was implemented using PyTorch 1.8, and the training process took approximately 3 h on an NVIDIA 3090 GPU.

3. Results

We first evaluated the effectiveness of our method using synthetic data.

This synthetic dataset, employed for the evaluation, comprises four sections. The first section includes 1289 samples from the ASTER LWIR spectral library, selected for their distinctive LWIR attributes. The ASTER library is a comprehensive collection housing over 2300 spectra of natural and human-made materials. It covers a wide range of materials, and the wavelengths range from 0.4 to 15.4 µm. The library includes the spectra of various substances such as rocks, minerals, lunar soils, terrestrial soils, human-made materials, meteorites, vegetation, snow, and ice. The second section contains 1000 simulated samples of blackbody radiation, ranging from 100K to 1000K, in line with Planck’s law. The third section presents the gas emission spectra of 24 characteristic LWIR gases, including Methane (CH4), Ethylene (C2H4), ammonia (NH3), sulfur hexafluoride (SF6), Cyclohexane (C6H12), Methanol (CH3OH), Acetone (CH3COCH3), trimethylamine ((CH3)3N), Cyclopropane (C3H6), Propylene (C3H6), trans-2-butene (C4H8), Butadiene (C4H6), Ethylene oxide (C2H4O), Dimethylamine (HN(CH3)2), vinyl chloride (C2H3Cl), 1-Butene (C3H8), Acetylene (C2H2), Propyne (C3H4), Dimethyl ether (C2H6O), Acetaldehyde (CH3CHO), Chloroethane (C2H5Cl), Chloromethane (CH3Cl), Methylamine (CH3NH2), and Sulfur dioxide (SO₂). Lastly, the fourth section simulates the radiation spectrum of gas in front of a blackbody, based on the model in [6], and includes 2400 samples.

We began with a qualitative evaluation of the proposed HST’s ability to handle four different spectrum types—minerals, blackbody, gas emission, and gas transmission—and to generate high-resolution spectrum reconstruction, as depicted in Figure 3. The reconstructed data aligned closely with the ground truth across all types, highlighting the effectiveness of our reconstruction method. This figure emphasizes the importance of accurate spectral data replication in applications such as remote sensing, spectroscopy, and material identification.

For a quantitative evaluation of the reconstruction’s performance, we employed general metrics. The correlation coefficient, for example, provides insights into the linear relationship between two spectra:

r = \sum_{i = 1}^{l} \frac{(s_{i} - \bar{s}) (g_{i} - \bar{g})}{l σ_{s} σ_{g}}

(10)

where s and g represent the target and ground truth spectra, respectively.

s_{i}

and

g_{i}

denote the sampled spectra indexed with i.

\bar{s}

and

\bar{g}

are the means of s and g, respectively.

σ_{s}

and

σ_{g}

represent the standard deviations of s and g, respectively. Lastly, l refers to the length of each spectrum.

The Root Mean Squared Error (RMSE) is another commonly used metric for comparing the differences between a reconstructed spectrum and the ground truth. The RMSE is the square root of the average of these differences squared. Using s and g as the target and ground truth spectra, respectively, the RMSE is defined as follows:

R M S E = \sqrt{\frac{1}{l} \sum_{i = 1}^{l} {(s_{i} - g_{i})}^{2}}

(11)

where l is the length of each spectrum vector,

\sum_{i = 1}^{l}

is the sum from

i = 1

to l,

s_{i}

is the sampled result for the reconstructed spectrum at index i, and

g_{i}

is the sampled result for the ground truth at index i.

{(s_{i} - g_{i})}^{2}

is the squared difference between the spectrum and the ground truth at index i. The RMSE is always non-negative, and a lower value is generally more desirable.

The Peak Signal-to-Noise Ratio (PSNR) is a widely used metric in spectrum reconstruction for assessing the quality of a reconstructed spectrum in comparison to the ground truth [36]. Using s and g as the target and ground truth spectra, respectively, the PSNR is defined as follows:

P S N R = 20 \cdot {log}_{10} (\frac{M A X_{s}}{\sqrt{M S E}})

(12)

where

M A X_{s}

is the maximum possible spectrum intensity value. For a normalized spectrum, this is set to 1.

M S E

is the Mean Squared Error between the target and ground truth spectra and is defined as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(s_{i} - g_{i})}^{2}

(13)

where n is the length of the spectrum, and

s_{i}

and

g_{i}

are the intensity values of the reconstructed and ground truth spectra, respectively, at the i-th position. A higher PSNR indicates better image quality, indicating less distortion due to noise.

We quantitatively evaluated our proposed methods and compared them with various other methods. In Figure 4, our proposed HST is compared with other neural network methods and interpolation methods for spectrum reconstruction for four materials: kaolin, a 770 K blackbody, vinyl chloride gas emission, and propene gas transmission. The correlation coefficient of the reconstructed spectrum with the ground truth is indicated in the legend. Our proposed HST consistently demonstrated the highest fidelity to the ground truth across all materials. The Linear and Cubic methods offer simple approximations, while deep learning approaches—MLP [32], CNN [26], and Transformer [35]—demonstrate varying degrees of improvement, with the HST model achieving superior performance.

We further assessed the proposed HST and other methods at different noise levels on synthetic data, as shown in Table 1. The performances of various interpolation and neural network methods (Linear Interpolation, Cubic Interpolation, MLP, CNN, Transformer, and HST) in spectrum reconstruction using synthetic data across different noise levels (0, 0.1, 0.2, 0.5) were compared. Performance was measured using the Root Mean Squared Error (RMSE), correlation, and Peak Signal-to-Noise Ratio (PSNR). The HST model exhibited the best performance, with the lowest RMSE and highest correlation and PSNR values across almost all noise levels, as highlighted in red. The CNN and Transformer models follow closely, as marked in blue.

We then evaluated our method using experimental data. The experimental data used for the evaluation consist of laboratory experiments with 11 types of gases and field experiments with a blackbody and two gases. The gases in the laboratory experiments included Methane (CH₄), Ethylene (C₂H₄), ammonia (NH₃), sulfur hexafluoride (SF₆), trimethylamine ((CH₃)₃N), Cyclopropane (C₃H₆), Propylene (C₃H₆), trans-2-butene (C₄H₈), butadiene (C₄H₆), Ethylene oxide (C₂H₄O), and vinyl chloride (C₂H₃Cl), while the gases in the field experiments included ammonia (NH₃) and sulfur hexafluoride (SF₆). In the laboratory experiments, a blackbody was used as an ideal background, and a customised gas chamber was placed between the blackbody and prototype with a 0.5 m optical path. Under these conditions, the measured gases were released from gas bottles into the gas cell. During the experiment, the background blackbody temperature was set at 50 °C. The gas temperature, humidity, and pressure in the gas cell were 23.1 °C, 40%, and 1 atm, respectively. The gas charged into the gas cell with 0.5%, 1%, and 2% purity, corresponding to path-integrated concentrations of 2500, 5000, and 10,000 ppm-m, respectively. The path integral concentration can be determined by multiplying the concentration with the gas cell’s optical path.

We initially compared our proposed HST with other methods using laboratory experiments with gas transmission spectra. As illustrated in Figure 5, our proposed HST accurately reconstructed the high-resolution spectrum from the low-resolution input (Linear Interpolation, in solid blue line) for four different types of gas transmission spectra (Cyclopropane, Butadiene, vinyl chloride, and Methane) across a wavelength range of 7 to 14

μ m

. In comparison to the MLP, CNN, and Transformer methods, our proposed HST effectively captured the details that were lost by other methods, such as the peak at 9.5

μ m

for Cyclopropane. Overall, our proposed HST outperforms the baseline networks.

We also quantitatively evaluated the performance of different methods in Table 2. This table illustrates the performance of various interpolation and learning methods in terms of three metrics: Root Mean Square Error (RMSE), correlation, and Peak Signal-to-Noise Ratio (PSNR). The Linear and Cubic Interpolation methods show moderate performance, with RMSE values of 0.1149 and 0.1338, respectively. The MLP method offers a slight improvement with an RMSE of 0.1241 and a correlation of 0.9248. The CNN method significantly enhances the performance, achieving an RMSE of 0.0437 and a correlation of 0.9866. Among all the methods, the HST (ours) method demonstrates the best performance, with the lowest RMSE of 0.0333, the highest correlation of 0.9915, and the highest PSNR of 26.78, indicating its superior predictive capabilities. The second best performance is achieved by the Transformer method, marked in blue, with an RMSE of 0.0422 and a correlation of 0.9880. This table emphasizes the effectiveness of our HST method in accurately reconstructing the high-resolution spectrum.

Figure 6 illustrates the evaluation of our method using field experimental data. The experiment’s goal was to test the effectiveness of the spectral imaging in complex scenes. In an open field, we released SF6 and NH3 gases, each with a concentration of 40,000 ppm, marked by red and green boxes, respectively. The images were captured from approximately 100 m away and 15 m above ground. The ground’s temperature, around 35 °C, created a strong thermal contrast against the cooler gas temperature of 22 °C. This contrast helped highlight the gas targets. However, it also introduced significant noise into the raw data, alongside the inherent Poisson noise. Our HST was trained using synthetic data. We then used the HST to reconstruct the high-resolution spectrum from the noisy input intensity for each pixel. We selected three different targets for this process: a road, SF6 gas, and NH3 gas. The middle section of the figure offers a detailed comparison of the input, reconstructed, and reference spectral data for these three materials. The input data appear noisy and distorted when compared to the reference spectrum, primarily due to background and Poisson noise. However, the graphs in this section demonstrate the strong correlation between the input spectra and the network’s reconstruction against the reference spectrum. This emphasizes the network’s high accuracy in preserving spectral features. The displayed quantitative results further highlight the consistency and precision of our reconstructions.

4. Conclusions

The proposed Hierarchical Spectral Transformer presents an innovative solution to challenges in long-wave infrared spectral image reconstruction. By concentrating training on pixel-based LWIR spectra and incorporating an LWIR spectral noise model into the data simulation pipeline, the HST effectively bridges the gap between synthetic and experimental data, reducing the need for high-quality paired data. The multistage architecture of the HST provides a detailed and accurate reconstruction by progressively refining the spectral features and using spectral-wise multi-head self-attention to focus on relevant spectral features. This approach surpasses traditional CNNs, which struggle with noise and the unique characteristics of the LWIR spectrum. Despite the inherent challenge of the nature of the reconstruction problem, the HST demonstrates promising results in both synthetic and experimental data, indicating its potential for real-world applications in environmental monitoring, climate studies, and even carbon neutrality efforts.

Author Contributions

Conceptualization, Z.W. and Y.Y.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and Y.Y.; resources, Y.Y.; data curation, Y.Y.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W.; visualization, Z.W.; supervision, L.Y. and C.L.; project administration, C.L. and J.W.; funding acquisition, L.Y., C.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Commission of Shanghai Municipality Technology Plan Project (21DZ1200403), the Preliminary Research Project on Civil Aerospace Technology (D040107), and the National Key Research and Development Program of China (2023YFF0713303).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and data are publicly avaiable at https://github.com/zeromakerplus/LWIR_spectral_reconstruction_public (accessed on 27 November 2024).

Acknowledgments

We would like to acknowledge the assistance provided by Shijie Liu and Guoliang Tang in the hardware experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

For the convenience of our readers, we have provided a table outlining the variables used in this paper, along with their meanings.

	Greek Symbol	Meaning
	$P$	photon noise
	$τ$	response
	$σ$	variance
	$μ$	mean
	$α$	attention weight
Lowercase Label	Meaning	Uppercase Label	Meaning
t	target	L	radiation
p	background	M	encoding dimension
d	dark current	N	noise
q	quantization	Q	query
n	total number of data	K	key
s	spectrum	V	value
g	ground truth spectrum	S	network input
l	length of spectrum	B	concatenated attention

References

Shaw, G.A.; Burke, H.-H.K. Spectral Imaging for Remote Sensing. Linc. Lab. J. 2003, 14, 3–28. [Google Scholar]
Jia, J.; Wang, Y.-M.; Chen, J.; Guo, R.; Shu, R.; Wang, J. Status and application of advanced airborne hyperspectral imaging technology: A review. Infrared Phys. Technol. 2020, 104, 103–115. [Google Scholar] [CrossRef]
Manolakis, D.G.; Pieper, M.L.; Truslow, E.; Lockwood, R.B.; Weisner, A.; Jacobson, J.; Cooley, T.W. Longwave Infrared Hyperspectral Imaging: Principles, Progress, and Challenges. IEEE Geosci. Remote Sens. Mag. 2019, 7, 72–100. [Google Scholar] [CrossRef]
Bacca, J.; Martinez, E.; Arguello, H. Longwave Computational Spectral Imaging: A Contemporary Overview. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 2023, 40, 115–125. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Liu, S.; Wang, P.; Yuan, L.; Tang, G.; Liu, X.; Lu, J.; Kong, Y.; Li, C.; Wang, J. Uncooled Snapshot Infrared Spectrometer With Improved Sensitivity for Gas Imaging. IEEE Trans. Instrum. Meas. 2024, 73, 4506009. [Google Scholar] [CrossRef]
Yang, Y.; Wang, Z.; Wang, P.; Tang, G.; Liu, C.; Li, C.; Wang, J. Robust gas species and concentration monitoring via cross-talk transformer with snapshot infrared spectral imager. Sens. Actuators B Chem. 2024, 413, 135780. [Google Scholar] [CrossRef]
Oiknine, Y.; August, I.; Stern, A. Multi-aperture snapshot compressive hyperspectral camera. Opt. Lett. 2018, 43, 5042–5045. [Google Scholar] [CrossRef]
Takasawa, S. Uncooled LWIR imaging: Applications and market analysis. Image Sens. Technol. Mater. Devices Syst. Appl. II 2015, 9481. [Google Scholar] [CrossRef]
Vollmer, M. Infrared Thermal Imaging. In Computer Vision: A Reference Guide; Springer International Publishing: Cham, Switzerland, 2020; pp. 1–4. [Google Scholar]
Zhang, Q.; Zheng, Y.; Yuan, Q.; Song, M.; Yu, H.; Xiao, Y. Hyperspectral Image Denoising: From Model-Driven, Data-Driven, to Model-Data-Driven. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 13143–13163. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Yuan, X.; Wu, Z.; Luo, T. Coded Aperture Snapshot Spectral Imager. In Coded Optical Imaging; Liang, J., Ed.; Springer: Cham, Switzerland, 2024. [Google Scholar]
Yuan, Q.; Zhang, Q.; Li, J.; Shen, H.; Zhang, L. Hyperspectral Image Denoising Employing a Spatial–Spectral Deep Residual Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1205–1218. [Google Scholar] [CrossRef]
Wang, L.; Sun, C.; Fu, Y.; Kim, M.H.; Huang, H. Hyperspectral Image Reconstruction Using a Deep Spatial-Spectral Prior. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8024–8033. [Google Scholar]
Cai, Y.; Lin, J.; Hu, X.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Gool, L.V. Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17481–17490. [Google Scholar]
Cai, Y.; Lin, J.; Hu, X.; Wang, H.; Yuan, X.; Zhang, Y.; Timofte, R.; Gool, L.V. Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction. In Proceedings of the 2022 European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Volume 13677. [Google Scholar]
Shi, Q.; Tang, X.; Yang, T.; Liu, R.; Zhang, L. Hyperspectral Image Denoising Using a 3-D Attention Denoising Network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10348–10363. [Google Scholar] [CrossRef]
Xiong, F.; Zhou, J.; Zhou, J.; Lu, J.; Qian, Y. Multitask Sparse Representation Model Inspired Network for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518515. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, X.; Bao, J. Denoising autoencoder aided spectrum reconstruction for colloidal quantum dot spectrometers. IEEE Sens. 2021, 21, 6450–6458. [Google Scholar] [CrossRef]
Brown, C.; Goncharov, A.; Ballard, Z.S.; Fordham, M.; Clemens, A.; Qiu, Y.; Rivenson, Y.; Ozcan, A. Neural network-based on-chip spectroscopy using a scalable plasmonic encoder. ACS Nano 2021, 15, 6305–6315. [Google Scholar] [CrossRef]
Zhang, W.; Song, H.; He, X.; Huang, L.; Zhang, X.; Zheng, J.; Shen, W.; Hao, X.; Liu, X. Deeply learned broadband encoding stochastic hyperspectral imaging. Light Sci. Appl. 2021, 10, 108. [Google Scholar] [CrossRef]
Wang, J.; Pan, B.; Wang, Z.; Zhang, J.; Zhou, Z.; Yao, L.; Wu, Y.; Ren, W.; Wang, J.; Ji, H.; et al. Single-pixel p-graded-n junction spectrometers. Nat. Commun. 2024, 15, 1773. [Google Scholar] [CrossRef]
Wen, J.; Hao, L.; Gao, C.; Wang, H.; Mo, K.; Yuan, W.; Chen, X.; Wang, Y.; Zhang, Y.; Shao, Y.; et al. Deep learning-based miniaturized all-dielectric ultracompact film spectrometer. Acs Photonics 2022, 10, 225–233. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, X.; Bao, J. Solver-informed neural networks for spectrum reconstruction of colloidal quantum dot spectrometers. Opt. Express 2020, 28, 33656–33672. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 1, 25. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R.S. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–8 December 2016; Volume 29. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Gardner, M.W.; Dorling, S.R. Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences. Atmos. Environ. 1998, 32, 2627–2636. [Google Scholar] [CrossRef]
Haykin, S. Neural Networks and Learning Machines, 3/E; Pearson Education India: Noida, Uttar Pradesh, India, 2009. [Google Scholar]
Cherubini, D.; Fanni, A.; Montisci, A.; Testoni, P. Inversion of MLP neural networks for direct solution of inverse problems. IEEE Trans. Magn. 2005, 41, 1784–1787. [Google Scholar] [CrossRef]
Baldridge, A.M.; Hook, S.J.; Grove, C.I.; Rivera, G. The ASTER spectral library version 2.0. Remote Sens. Environ. 2009, 113, 711–715. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yoon, H.H.; Fernandez, H.A.; Nigmatulin, F.; Cai, W.; Yang, Z.; Cui, H.; Ahmed, F.; Cui, X.; Uddin, M.G.; Minot, E.D.; et al. Miniaturized spectrometers with a tunable van der Waals junction. Science 2022, 378, 296–299. [Google Scholar] [CrossRef]

Figure 1. (a) Photograph of the USIRS for monitoring the gas emission in an industrial zone. (b) Optical schematic diagram of the USIRS. (c) Part of the captured raw data of the USIRS. (d) The captured noise intensity of the target gas in the blue zone.

Figure 2. Schematic representation of the proposed HST. This network employs a point-based transformer to process the input spectral intensity by breaking it down into individual point tokens. These tokens are processed through several transformer blocks using multi-head self-attention. As the process advances, the merging of layers decrease the number of tokens while simultaneously doubling the channel depth, thereby creating an effective hierarchical representation.

Figure 3. Qualitative evaluation of the proposed HST for spectrum reconstruction of four different types of spectra. The blue dashed lines in the top row represent the input intensity. The bottom row shows the ground truth in gray compared with the reconstructed spectrum in red. The reconstructed data closely match the ground truth across all types, highlighting the effectiveness of our proposed HST.

Figure 4. Comparison of different methods using synthetic datasets. We compared the proposed HST with other neural network methods and interpolation method for spectrum reconstruction for four materials: kaolin, a 770 K blackbody, vinyl chloride gas emission, and propene gas transmission. Each subplot presents the ground truth data (solid gray line) against predictions from six methods: Linear Interpolation (dashed blue line), Cubic Spline Interpolation (dashed yellow line), MLP (dotted green line), CNN (dotted cyan line), Transformer model (dotted orange line), and HST (solid red line). The correlation coefficient of the reconstructed spectrum with the ground truth is shown in the legend.

Figure 5. A comparison of different methods using experimental laboratory data of four different gas transmission spectra: Cyclopropane, Butadiene, vinyl chloride, and Methane. Each subplot shows the reference spectrum (in gray) alongside the spectra predicted by various models: Cubic Interpolation, Linear Interpolation, MLP, CNN, Transformer, and HST (ours). The correlation coefficients (corr) for each model are also provided, indicating the degree of similarity between the predicted and reference spectra. The HST model consistently shows the highest correlation across all molecules, suggesting superior performance in predicting the spectral features accurately.

Figure 6. Illustration of HST using field experiments. The top row shows the input multi-spectral image, the network, and the reconstructed high-spectral resolution image. Below, three sets of graphs compare the input, reconstructed, and reference spectral data for three materials: a road, SF6 gas, and NH3 gas. The results highlight the network’s ability to accurately reconstruct high-resolution spectra.

Table 1. Comparison of different methods with various noise levels using synthetic dataset.

	Noise Level	0	0.1	0.2	0.5
Linear interp.	RMSE	0.1133	0.1346	0.1793	0.3532
	Correlation	0.9558	0.9433	0.9134	0.8008
	PSNR	17.54	16.28	13.98	8.16
Cubic interp.	RMSE	0.1139	0.1686	0.2629	0.5953
	Correlation	0.9593	0.9180	0.8420	0.6868
	PSNR	17.93	14.33	10.13	2.75
MLP [32]	RMSE	0.1208	0.1130	0.1767	0.2153
	Correlation	0.9568	0.9599	0.9011	0.8420
	PSNR	18.04	18.35	14.63	12.83
CNN [26]	RMSE	0.0103	0.0233	0.0428	0.1076
	Correlation	0.9989	0.9968	0.9906	0.9565
	PSNR	33.87	29.22	24.53	18.03
Transformer [35]	RMSE	0.0061	0.0238	0.0436	0.1089
	Correlation	0.9994	0.9970	0.9904	0.9563
	PSNR	36.51	29.46	24.45	17.99
HST (ours)	RMSE	0.0059	0.0212	0.0378	0.1074
	Correlation	0.9995	0.9976	0.9929	0.9566
	PSNR	37.16	30.41	25.74	18.02

Red: the best performance. Blue: the second best performance.

Table 2. Quantitative comparison of different methods using experimental laboratory data. The HST method (ours) outperforms all the others, with the lowest RMSE, highest correlation, and highest PSNR, highlighting its superior accuracy in spectral data prediction. The second best performance is achieved by the Transformer method.

Method	Metric	Performance
Linear interp.	RMSE	0.1149
	Correlation	0.9386
	PSNR	18.30
Cubic interp.	RMSE	0.1338
	Correlation	0.9059
	PSNR	16.28
MLP [32]	RMSE	0.1241
	Correlation	0.9248
	PSNR	17.67
CNN [26]	RMSE	0.0437
	Correlation	0.9866
	PSNR	24.87
Transformer [35]	RMSE	0.0422
	Correlation	0.9880
	PSNR	25.32
HST (ours)	RMSE	0.0333
	Correlation	0.9915
	PSNR	26.78

Red: the best performance. Blue: the second best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Yang, Y.; Yuan, L.; Li, C.; Wang, J. Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer. Sensors 2024, 24, 7658. https://doi.org/10.3390/s24237658

AMA Style

Wang Z, Yang Y, Yuan L, Li C, Wang J. Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer. Sensors. 2024; 24(23):7658. https://doi.org/10.3390/s24237658

Chicago/Turabian Style

Wang, Zi, Yang Yang, Liyin Yuan, Chunlai Li, and Jianyu Wang. 2024. "Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer" Sensors 24, no. 23: 7658. https://doi.org/10.3390/s24237658

APA Style

Wang, Z., Yang, Y., Yuan, L., Li, C., & Wang, J. (2024). Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer. Sensors, 24(23), 7658. https://doi.org/10.3390/s24237658

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pixel-Based Long-Wave Infrared Spectral Image Reconstruction Using a Hierarchical Spectral Transformer

Abstract

1. Introduction

2. Materials and Methods

2.1. The USIRS

2.2. Imaging and Noise Model

2.3. Pixel-Based Hierarchical Spectral Transformer

2.3.1. Positional Encoding

2.3.2. Hierarchical Representation

2.3.3. Attention Mechanism in Transformer

2.3.4. Implementation Details

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI