1. Introduction
Hyperspectral imagery has emerged as a vital tool in the field of remote sensing because it samples a scene across a vast number of narrow spectral bands. This fine spectral resolution allows analysts to precisely identify and characterize the unique spectral signatures of different materials. Consequently, hyperspectral images are proving invaluable in a broad range of applications, including environmental conservation, agricultural management, pollution control, and mineral exploration. By examining the subtle reflectance properties of surfaces, e.g., computing vegetation health indicators, soil composition, or building materials, hyperspectral imaging supports enhanced data-driven decision-making and accurate assessment.
Despite these distinct advantages, the benefits of hyperspectral imaging come with certain trade-offs. One of the most significant drawbacks lies in its typically coarser spatial resolution when compared to multispectral sensors. While these easily provide spatial resolutions in the order of a few meters per pixel, hyperspectral payloads are limited to tens of meters per pixel or worse. For instance, the recent PRISMA [
1], EnMAP [
2], and HySIS [
3] missions all provided a resolution of 30 m/pixel. For this reason, spatial super-resolution in hyperspectral images is a topic of great interest. The recent literature [
4,
5,
6,
7], particularly regarding deep neural networks, has shown that powerful data priors can be used to accurately enhance spatial resolutions, even from a single image.
At the same time, there is growing interest in the remote sensing community in shifting image analysis procedures from ground-based operations to onboard satellite platforms [
8,
9,
10,
11]. By processing data directly in space, as they are acquired, satellites can rapidly identify and respond to critical occurrences such as natural disasters or sudden environmental changes. This approach significantly reduces the latency associated with transferring large volumes of raw data to Earth, followed by subsequent processing at ground stations. Ultimately, real-time or near-real-time analytics could provide timely information to decision-makers and emergency response units, potentially saving lives and mitigating damage. However, onboard processing faces significant constraints in terms of the available computational resources, which calls for the development of efficient and low-complexity models. Hyperspectral imagers represent a good case study since they generate very large volumes of spatial–spectral data, which may be difficult to process efficiently, particularly in terms of memory requirements.
Since the super-resolution of hyperspectral images is highly desirable for more accurate image processing in downstream applications, it is natural to ask whether efficient models could be developed so that a satellite payload could perform super-resolution in real time as it acquires images. On the other hand, the current literature on hyperspectral super-resolution [
4,
12] predominantly emphasizes maximizing the output quality, at the expense of complexity, often relying on sophisticated and computationally expensive deep learning architectures. While these models achieve impressive results, they are poorly suited for potential onboard usage, where constraints related to power, memory, and processing speeds are present. Specifically, the current high-performance models are orders of magnitude heavier than what a small, low-power accelerator can sustain. For example, on the HySpecNet-11k dataset, the state-of-the-art methods MSDformer [
4] and CST [
12] require 714 K and 245 K FLOPs per pixel (FLOPs/px), respectively, for
super-resolution and 528 K and 121 K FLOPs/px for
super-resolution, which are excessive for existing low-power accelerators. Under a realistic input of
, such as the one of the PRISMA VNIR instrument, they also exceed 24 GB of required memory, again precluding onboard use. Balancing the need for high-fidelity enhancements with the practical limitations of satellite hardware remains a challenging area of study.
In this paper, we depart from the literature and present a highly efficient model for hyperspectral image super-resolution for onboard usage, called Deep Pushbroom Super-Resolution (DPSR), which can run in real time on low-power hardware and with limited memory requirements, while providing performance comparable to that of state-of-the-art methods exploiting one order of magnitude more floating point operations (FLOPs) per pixel. On HySpecNet-11k with a super-resolution factor of
, DPSR only requires 31 K FLOPs/px, while, with a factor
, it requires 20 K FLOPs/px. Compared to other lightweight models, such as EUNet [
13] and SNLSR [
7], the core of our contribution is a neural network that sustains the acquisition dynamics of a pushbroom sensor by processing one line at a time as it is acquired. A memory mechanism based on selective state space models (SSMs, e.g., Mamba [
14]) processes the image as a sequence of lines (as opposed to a 2D tile) and ensures that information from previously acquired lines is exploited to super-resolve the current line. This design minimizes the memory requirement since only the feature maps for the current line (and a small memory state) need to fit in the accelerator memory, instead of buffers of hundreds of lines and respective features needed for state-of-the-art methods. We show that DPSR can process an entire PRISMA VNIR frame of size
with less than 1 GB of memory—at least an order of magnitude less than in other existing methods. Note that this line-based paradigm naturally matches the pushbroom imaging method commonly used in hyperspectral sensors on satellites, which acquires one line with all its across-track pixels and spectral channels at a given time, and it uses the movement of the satellite in the along-track direction to capture successive lines. Consequently, our super-resolution module could be directly pipelined to the sensor output, enabling real-time image enhancement as a line is super-resolved in the acquisition time of the next one, without the need for extra line buffering. As an example, the line acquisition time for the PRISMA satellite is 4.34 ms [
15], and our experimental results show that a largely unoptimized implementation of DPSR on a 15 W system-on-chip super-resolves a line (
scaling factor, thus producing two output lines with twice as many across-track pixels; the line size is the one of the VNIR PRISMA instrument, i.e.,
) in 4.25 ms, demonstrating, for the first time, real-time performance. On the other hand, we show that existing state-of-the-art models in the literature, including those that are more efficiency-oriented, such as EUNet [
13] and SNLSR [
7], all run far slower than the real-time threshold of
ms.
A preliminary version of this work appeared in [
16]. The present paper substantially extends and improves upon the earlier version regarding both the methodology and evaluation. Specifically, the model architecture has been redesigned in several ways. First, we now treat the spectral dimension as a feature axis rather than as part of a 3D spatial cube, leading to a more efficient and expressive representation. Moreover, the new design of the basic neural block (the SFE block in this paper) uses attention operations rather than simple convolutions, and a residual connection with bilinear interpolation is introduced. Following these modifications, substantial improvements have been obtained, with DPSR delivering a
dB MPSNR on HySpecNet-11k (
SR factor), compared to the
dB achieved by LineSR. This paper also significantly expands the discussion and the experimental validation, providing results under two super-resolution factors (
and
) across four different datasets, comparing the method against seven baselines, and presenting extensive ablation studies that dissect the contributions of individual architectural components and design choices. Finally, experiments on low-power hardware are also given to support the low complexity claim and suitability for onboard deployment.
We can summarize our contributions as follows:
We introduce a lightweight deep learning architecture tailored to hyperspectral single-image super-resolution onboard satellites.
Drawing inspiration from the operational principles of pushbroom sensors, our framework performs super-resolution in a causal, line-by-line fashion. To accomplish this, we leverage deep SSMs, which maintain effective memory of previously processed lines. This design choice allows the network to access relevant historical features without storing or reprocessing the entire image, leading to substantial efficiency gains.
Experimental results on multiple datasets show image quality comparable to that of state-of-the-art models at a fraction of the complexity, with significantly lower runtime and memory requirements.
3. Materials and Methods
In this section, we introduce the method that we propose to address the problem of onboard single-image super-resolution and the structure of our line-by-line neural network architecture, called DPSR. A high-level illustration of the concept is given in
Figure 1, while the details of the DPSR neural network architecture are shown in
Figure 2.
3.1. Preliminaries and Proposed Strategy
The overall design goal is to super-resolve an image line by line in the along-track direction, aiming to match a pushbroom acquisition system and minimize the memory and computational requirements. In the remainder of this paper, we will use H to denote the number of lines in the along-track direction y, W to denote the number of columns in the across-track direction x, and C to denote the number of spectral channels. We also remark that super-resolving a line, with all its spectral channels, of size by a factor r produces an output of size , i.e., r lines in the along-track direction, expanded by a factor of r in the across-track direction.
More formally, we suppose to have acquired the current LR line
corresponding to along-track LR line location
y and a memory of past lines, which will be described in
Section 3.2. From this, we estimate
,
with
, i.e., the set of
r super-resolved (SR) lines at the high-resolution (HR) along-track spatial locations between the LR spatial locations
(included) and
y (excluded). It is important to note that we are framing the problem within an
interpolation setting, where line
y is used to predict HR locations strictly preceding it. This is in contrast with an
extrapolation setting, which would predict spatial locations beyond
y and would be more challenging, leading to worse image quality. Moreover, the estimation of the SR lines is performed in a residual fashion, i.e., a neural network only estimates an additive correction to the result of a bilinear interpolator.
We note that this configuration restricts the model to using only previously acquired lines, with no access to future lines. This departs from existing methods, which require a 2D tile of the image. This choice is intentional since such methods would require one to buffer a large number of lines to form a complete image, and their processing would require significantly increased memory and computational resources. Essentially, DPSR trades access to future lines, which may potentially limit quality, for complexity and speed.
3.2. DPSR Architecture
This section introduces the neural network architecture of DPSR, which is shown in
Figure 2 and
Figure 3.
As a first step, we decouple the along-track dimension from the across-track dimension by extracting the line
along with all its spectral channels. This leads to the input of the architecture having a size of
. Conceptually, DPSR will first extract features that jointly represent the pixels in the across-track and spectral dimensions of the current line and then use a selective SSM (Mamba [
14]) to integrate these features with a memory of those of previous lines.
This results in a design featuring an initial shallow feature extraction block (SFE block) comprising 1D operations operating on across-track pixels and spectral channels. In particular, a 1D convolution layer, with a spatial kernel size of 3, transforms the input channels
C into an arbitrary number of features
F. This is then followed by layer normalization and non-linear activation, for which we use the sigmoid linear unit (SiLU) function [
58]. Subsequently, a channel attention block [
59] is introduced. We denote the output of the SFE block as
, where each pixel in the across-track direction is associated with a feature vector jointly capturing correlations between the columns and the original spectral bands. The SFE block is then followed by a backbone composed of two cross-line feature fusion (CLFF) blocks, each consisting of a NAFBlock followed by a Mamba block. The design of the CLFF blocks is motivated by the different but complementary roles of the NAFBlock and Mamba block. The tensor
obtained as the output of the SFE block is fed to the CLFF blocks:
In detail, the NAFBlock uses a sequence of layers inspired by the building blocks of the NAFNet architecture [
60] to expand the across-track receptive field and extract deeper features in the across-track and spectral dimensions. In particular, this block uses separable convolutions (size-1 1D convolution, followed by depthwise 1D convolution) to achieve lightweight receptive field expansion. Moreover, a SimpleGate and a simplified channel attention operation implement an attention-like mechanism of input-dependent feature extraction, while maintaining a lightweight design employing no specific activation function. We refer the reader to [
60] for more details of these operations.
On the other hand, the Mamba block [
14] has the crucial task of learning the interactions between successive lines, i.e., the propagation of information in the along-track dimension. The output of the NAFBlock consists of one feature vector for each of the
W pixels in the line, which can be regarded as a token whose evolution is modeled over the along-track direction by means of Mamba. In the Mamba block, the input feature dimension
F is expanded to
, where
E is an arbitrary constant. Then, a causal convolution operation in the along-track direction further processes the feature sequence. Note that this requires storing the receptive of the causal convolution operator, i.e., the features of
K past lines, where
K is the convolution kernel size. We remark that
K is typically a small value, such as
, as we used in the experiments, so that a very modest number of past lines needs to be retained. After non-linear activation with a SiLU function, the line features
are processed by the Mamba selective SSM formulation introduced in
Section 2.2 so that long-term dependencies encoded in the latent state can be exploited. The latent state of the SSM is a vector of length
N for each expanded feature of the line pixels, i.e.,
, and can be seen as a condensed running memory of all the previous lines scanned by the model, which allows the model to uncover self-similarities in the far past, dropping the requirement of a buffer of a high number of lines in order to perform effective super-resolution. When the SiLU-processed features
reach the state-space module, they are processed by a discrete-time version of the linear SSM in
Section 2.2,
operating in parallel over all
W pixels in the current line. The state
is updated at each time step, i.e., every time a new line is processed, following Equation (
4). New context-rich line features are then produced as the output of Equation (5) using the updated latent state. Finally, the SSM-transformed features are gated by the original features input to the Mamba block after feature expansion and SiLU non-linearity and projected back to an
F-dimensional space. The Mamba block is the main driver of the memory requirements for the architecture, and, based on the aforementioned design, we can summarize such requirements as follows: (i) a tensor of size
with the features of the current line and past lines in the receptive field of causal convolution; (ii) a tensor of size
with the latent state of the SSM. These are quite modest memory requirements; for instance, if we consider a realistic
,
,
,
setting, then only about 10 MB of memory is required to store the aforementioned tensors as single-precision floating point values.
The final part of the architecture is the upsampler module, which is responsible for the output tensor having the appropriate spatial dimensions, according to the desired super-resolution scale factor. The upsampler first expands the input tensor channels with a 1D convolution to a value
, where
r is the SR factor and
f is an arbitrary number of features that will remain after the subsequent Pixel Shuffle operation [
57]. The last layer is a 1D convolution that restores the number of channels of the input image. We note that choosing a suitable value for
f involves a trade-off between limiting the amount of floating point operations performed by the upsampler module and properly representing all the spectral channels. Indeed, it could be argued that the optimal value
f is tied to how many of the spectral channels are redundant and thus their dimensionality could be reduced. Hence, the final super-resolved lines are obtained as
3.3. Implementation Details and Experimental Settings
The proposed architecture makes use of an internal feature dimension
F. This hyperparameter was set to
for all experiments, except those where we considered scaling down the model to
(DPSR-S model). The expansion factor of the Mamba block was set to
, whereas the internal state of the SSM used was
, following [
14] and common practice in the literature that builds on the Mamba model. The reduced features factor in the upsampler, which controls how much we compress the feature information while recovering the appropriate spatial resolution, was set to
throughout all experiments. For the compact DPSR-S variant, our target was real-time operation on low-power hardware; this requirement drove the choice of internal parameters. For the full DPSR model, both the internal feature width and the Mamba expansion factor were selected to match a predefined complexity envelope (30 K FLOPs/px on HySpecNet-11k with an SR factor of 4 and 20 K FLOPs/px on the same dataset with an SR factor of 2). For the training procedure, we used the Adam optimizer with a fixed learning rate of
and a variable number of epochs to let the model converge until overfitting was observed. The optimal number of training epochs depends on the dataset, and it amounted to 300 for Pavia, 1600 for Houston, and 2000 for HySpecNet-11k and Chikusei. This value was determined empirically by monitoring convergence and stopping training once the improvement in the validation MPSNR fell below 0.01 dB over the preceding 50 epochs. No weight decay was used. We used the same loss as in [
12], setting
to
and
to
, as is common practice for other methods in the literature.
We present results for four different hyperspectral datasets. Besides the widely used Chikusei [
61], Houston [
62], and Pavia [
63], we add the HySpecNet-11k dataset [
64]. This is motivated by the significantly larger number of images present in HySpecNet-11k, which makes the results less sensitive to overfitting. See
Section 4.1 for details regarding the preprocessing of each of the four datasets. Since our model deploys an interpolation strategy, i.e., it super-resolves the previous line, in order to train and test with SR factor
r, we discard the first
r lines of the output of the network (as they estimate lines before the start of the image) and the last
r lines of the ground truth (as these cannot be estimated by DPSR as it would require a line after the end of the image). Note that, in a real-world scenario, we would not have to discard any line since the stream of input lines would be continuous. For the sake of fairness, we use the same discarding approach for all other baselines that we compare against in the evaluation of the quality metrics.
We compare DPSR against the following state-of-the-art and baseline methods: CST [
12], MSDformer [
4], SNLSR [
7], EUNet [
13], SSPSR [
24], GDRRN [
23]. The official code released by the authors has been used to retrain and test the models, with all hyperparameters set as specified in the published reference papers. We note that we do not report results for SNLSR at a
SR factor since the model has been intrinsically designed and proposed for the
factor. In particular, the architecture uses a two-stage spatial expansion, each building its upsampling module with a scale equal to 2. For the method to support a true
SR factor, major modifications to the model architecture and size would be required. Additionally, we report baseline results obtained using bicubic interpolation when processing the entire image.
To quantify the fidelity of the super-resolved hyperspectral images, we report four commonly used metrics: the peak signal-to-noise ratio (PSNR) [
65], structural similarity (SSIM) [
66], spectral angle mapper (SAM) [
67], and root mean squared error (RMSE). Together, these four metrics provide a balanced view of the spatial quality (PSNR, SSIM), spectral faithfulness (SAM), and overall error magnitude (RMSE). Following [
4,
12], for a hyperspectral image with
C spectral channels, we compute each metric channelwise, accumulate over all channels, and then take the average as the metric value.
For model training and testing, we used one NVIDIA TITAN RTX GPU, whereas, for the low-power inference speed experiments, we used an NVIDIA Jetson Orin Nano in 15 W mode.
We remark that all reported results were obtained with a fixed random seed for every model and dataset, for both and settings and for all ablation studies, thus representing a typical run, as is commonly done in the literature.
5. Discussion
This work explored whether an onboard HSI SR model could approach the reconstruction fidelity of state-of-the-art networks while meeting the latency and memory constraints of pushbroom imagers. Our results quantitatively confirm that DPSR achieves an excellent quality–efficiency trade-off: it trails the best-performing high-complexity Transformer models by only small margins (see
Table 1,
Table 3,
Table 5 and
Table 7), while requiring roughly an order of magnitude fewer FLOPs per pixel and substantially less memory (see
Figure 10,
Figure 11 and
Figure 12). For instance, on the Houston dataset, DPSR is only 0.10 dB (
SR) below the best model, CST, yet it runs in 10.28 ms instead of 90 ms and with less than 1 GB of memory instead of 20 GB on realistic input sizes. This is all thanks to the line-by-line design of DPSR, which eliminates the need for large spatial buffers and allows for low-complexity models.
5.1. Relationship with Prior Work and Working Hypotheses
DPSR was designed around two hypotheses: (i) that optimizing the architecture of a neural network in terms of inference speed and minimal memory and power requirements can be achieved without sacrificing reconstruction quality and (ii) that causal, linewise processing that matches pushbroom acquisition can retain sufficient spatial–spectral context via compact memory to yield competitive SR quality while satisfying the constraints in (i). The proposed architecture implements these ideas with across-track and spectral feature extraction followed by cross-line fusion, where a selective SSM (Mamba) stores information in the along-track direction and a lightweight upsampler recovers the target spatial resolution. This departs from 2D tile-based designs that must stage large windows to aggregate the context, thus better aligning with the onboard streaming constraint. The proposed method DPSR stems from the work in [
16], which gave promising indications that exploiting the Mamba memory to match the pushbroom sensors physics could be effective. However, limited experiments and comparisons with state-of-the-art methods were presented. This work changed the strategy to process the input hyperspectral cube, expanded and refined the architecture with insightful modifications, and showed the effectiveness of the method through extensive evaluation across four datasets and several ablations to assess the architectural choices. In particular, the hyperspectral images are no longer treated as cubes, but we decided to use the spectral dimension as a feature dimension throughout the whole architecture, allowing a change from 3D to 2D for all operations involved. In addition, we introduced a new block to firstly refine the input based on channel attention, named the SFE block, which improves the simple first convolution in the LineSR model. Finally, we introduced a way to perform residual learning using buffers, with no tangible increase in memory or computational requirements. These modifications provide an entirely different final model, with significantly improved performance (see
Table 11) and even lower complexity.
5.2. Main Findings
On HySpecNet-11k ( SR), DPSR achieves a 43.17 dB MPSNR with only 31 K FLOPs/px, within 0.28–0.44 dB of MSDformer and CST while being significantly lighter (e.g., 714 K and 245 K FLOPs/px for MSDformer and CST, respectively). A similar pattern holds for 2× SR. On the Chikusei, Houston, and Pavia benchmarks, DPSR is consistently competitive, often within 0.1–0.3 dB of the strongest baselines and occasionally surpassing them, while maintaining a much lower FLOPs/px profile. In urban scenes (Houston), DPSR’s gap with CST remains small at both 2× SR and 4× SR, indicating that linewise memory is effective even in highly structured environments with many details. For Pavia, the method remains close to the best scores, delivering the second-best quality metrics, only behind CST, as in Houston. These results demonstrate that the long-range along-track context captured by DPSR through Mamba’s selective memory enables the accurate reconstruction of fine spatial and spectral details.
Beyond quality, the efficiency analysis reinforces the practical relevance of the design. Memory scales linearly with columns and bands and is independent of the number of lines processed, so a PRISMA-like frame of 1000 × 1000 × 66 can be super-resolved with less than 1 GB of memory, whereas several 2D counterparts exceed 24 GB for similar sizes. The runtime on an NVIDIA Jetson Orin Nano (15 W) reaches 4.25 ms/line (2× SR) and 5.70 ms/line (4× SR) for the small variant (DPSR-S), crossing the real-time threshold set by representative line times (e.g., PRISMA VNIR 4.3 ms/line). The base model remains substantially faster than state-of-the-art models and, given that our implementation is not yet hardware-optimal, future work will explore hardware-aware optimizations, such as kernel-level tuning and FPGA deployment, to further enhance the throughput.
5.3. Why Does the Line-by-Line Strategy Work?
Two design factors appear central. First, the feature extraction pipeline explicitly separates across-track and spectral mixing from along-track memory. The former is addressed by shallow and NAFNet-style separable convolutions with lightweight gating simulating attention, avoiding heavy 2D context aggregation. This preserves spectral information while keeping a sufficient across-track receptive field. On the other hand, Mamba handles the latter: the selective SSM maintains a compact latent state per across-track position, acting as a content-adaptive accumulator of information from past lines. Together with the residual learning on top of bilinear interpolation, this mitigates hallucinations and stabilizes spectral metrics despite the low compute budget.
5.4. Limitations
Despite its strengths, DPSR is not the state of the art in peak image quality: CST still retains a modest PSNR edge. We attribute this to three factors intrinsic to the line-causal constraint: (i) the model never accesses future lines, whereas 2D tile-based methods can implicitly fuse both past and future spatial contexts; (ii) the selective memory could sometimes discard or forget important information since it needs to compress many details, causing the network to lose some features, since it cannot actually see lines far in the past (more than four lines ago); and (iii) the residual baseline is not followed by learnable layers, which can underfit high-frequency textures in challenging regions, possibly limiting the top quality reachable by the residual learner.
5.5. Implications and Future Directions
In a possible practical onboard pipeline, SR would not be the end task, but it should be paired with a downstream task. DPSR’s causal, streaming nature admits straightforward composition with downstream analytics without increasing buffering or breaking the real-time throughput. We see several promising directions:
Designing multi-task architectures that jointly perform real-time denoising and super-resolution, incorporating the advantages of the two proposed methods while delivering real-time performance without increasing the complexity;
Integrating downstream tasks, such as semantic segmentation, into the pipeline after inverse tasks to build an end-to-end efficient onboard processing model;
Developing and incorporating selective compression strategies that leverage the improved quality of super-resolved and denoised images to optimize the downlink efficiency;
Hardware–aware kernels, implementing the model on FPGA in an attempt to bring the base model within real-time speeds even at .
These extensions would maintain the central constraint of the present work—the single-line causality—so that the complexity and memory remain compatible with small onboard accelerators.