Task-Driven Real-World Super-Resolution of Document Scans

Zyrek, Maciej; Tarasiewicz, Tomasz; Sadel, Jakub; Krzywon, Aleksandra; Kawulok, Michal

doi:10.3390/app15148063

Open AccessArticle

Task-Driven Real-World Super-Resolution of Document Scans

by

Maciej Zyrek

¹

,

Tomasz Tarasiewicz

¹

,

Jakub Sadel

¹

,

Aleksandra Krzywon

²

and

Michal Kawulok

^1,*

¹

Department of Algorithmics and Software, Silesian University of Technology, 44-100 Gliwice, Poland

²

Department of Biostatistics and Bioinformatics, Maria Sklodowska-Curie National Research Institute of Oncology, Gliwice Branch, 44-102 Gliwice, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8063; https://doi.org/10.3390/app15148063

Submission received: 10 June 2025 / Revised: 29 June 2025 / Accepted: 17 July 2025 / Published: 20 July 2025

Download

Browse Figures

Versions Notes

Abstract

Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets—with low-resolution images obtained by degrading and downsampling high-resolution ones—they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network (CTPN), text recognition via a convolutional recurrent neural network (CRNN), keypoints localization using Key.Net, and hue consistency. To balance these diverse objectives, we employ a dynamic weight averaging (DWA) mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. Experimental evaluation demonstrates that the proposed approach improves text detection, measured with intersection over union, by 1.09% for simulated and 1.94% for real-world datasets containing scanned documents, while preserving overall image fidelity. These improvements are statistically significant as confirmed by the Kruskal–Wallis H test and the post hoc Dunn test with Benjamini–Hochberg p-value correction. Our findings highlight the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.

Keywords:

super-resolution; document image processing; task-driven training; multi-task learning; real-world super-resolution

1. Introduction

Limited spatial resolution of scanned documents often poses a substantial challenge for optical character recognition (OCR) systems, in particular when input images suffer from the noise and other distortions [1] that result from sensor limitations, compression artifacts, motion blur, or suboptimal lighting conditions. In order to allow for the effective processing of low-resolution (LR) scans with existing OCR systems, their quality can be enhanced with super-resolution (SR) techniques. They operate either from a single image [2,3] or multiple observations of the same scene [4,5]. The state-of-the-art single-image SR (SISR) approaches have been underpinned with deep learning, since Dong et al. proposed the first convolutional neural network (CNN) for SR (SRCNN) [6]. It was followed by more advanced solutions, including fast SRCNN (FSRCNN) [7], a very deep SR network (VDSR) [8], and a residual SR network (SRResNet) [9] that was trained within a generative adversarial network (GAN) setting. These techniques, as well as recently developed solutions [2], demonstrate impressive performance on simulated benchmarks, with LR images obtained from original images that are later treated as HR references during training and validation. However, as noted by Cai et al. [10], these models often fail to generalize to real-world data where degradation is neither uniform nor well modeled by the kernels used for downsampling. This gave rise to real-world SR that is aimed at super-resolving original rather than simulated images [11]. While such techniques consist in exploiting real-world datasets, they are rarely validated in task-specific scenarios [12].

To address these limitations, we propose a task-driven SISR framework designed specifically for text document images. Our approach shifts the training objective from pure image fidelity—commonly expressed with peak signal-to-noise ratio (PSNR)—to task-oriented performance. We introduce additional loss function components derived from pretrained semantic models into our training framework: (i) text detection components based on connectionist text proposal network (CTPN) [13], which encourage reconstruction of text-presence features; (ii) a loss derived from intermediate activations of a pretrained convolutional recurrent neural network (CRNN) [14,15], promoting character recognition; (iii) a keypoint alignment loss using the Key.Net detector [16], which enhances structural consistency between the output and ground truth; and (iv) a color consistency loss based on the hue component of the hue–saturation-value (HSV) color space that helps maintain chromatic coherence without distorting tonal relationships.

These complementary objectives are integrated into a unified multi-task loss function. A major challenge in such multi-objective optimization is determining appropriate weights for each task, taking into account that static weights may lead to unstable convergence or underfitting. To overcome this problem, we employ dynamic weight averaging (DWA) [17], which automatically adjusts task weights during training based on their relative learning dynamics. This encourages balanced training and ensures that no single loss component dominates the remaining ones.

The research reported in this paper was aimed at verifying whether including the task-oriented loss function components contributes to finding more valuable regions of the real-world SR solution space than relying on conventional loss functions underpinned with pixel-wise image fidelity. Therefore, we have designed a comprehensive experimental setup involving real and simulated LR–HR document pairs. Real-world scans are acquired using a controlled multi-resolution acquisition procedure, while simulated pairs are generated through standard downsampling. We demonstrate that our task-driven training framework improves text detection performance measured with the intersection over union (IoU) metric, in particular for real-world scans. Importantly, the model shows strong generalization across different document types. Overall, our contribution can be summarized as follows:

We propose a task-driven framework for training networks aimed at super-resolving text document images (see Figure 1), guided by text detection and recognition, keypoint detection, and color consistency.
We employ DWA to dynamically balance the loss components, ensuring stable and balanced convergence across the tasks.
We introduce a real-world dataset with accurate registration at 4× magnification ratio, which allows for realistic evaluation of SR for OCR-related tasks. In this work, we exploit the dataset for validating SISR techniques; however, the dataset can also be used for multi-image SR, as it contains multiple scans for each document.
We report the results of a thorough quantitative and qualitative experimental analysis supported with statistical tests. It allows for understanding the model behavior on both real-world and simulated datasets, highlighting the practical benefits and risks of task-aware SR. Importantly, the elaborated methodology constitutes a solid foundation for our future work aimed at multi-image SR.

The paper is structured as follows. In Section 2, we outline the state of the art in SR, with particular attention given to task-driven approaches and SR of text documents. Our approach is specified in Section 3, and the results of experimental validation are reported in Section 4. Section 5 concludes the paper.

2. Related Work

In this section, we present the state of the art regarding SR techniques underpinned with deep learning (Section 2.1). Moreover, we outline the task-specific approaches, including super-resolving text documents (Section 2.2), as well as the methods to handle multi-task optimization (Section 2.3).

2.1. Deep Learning for SR

The introduction of CNNs has revolutionized the field of SISR. The SRCNN model by Dong et al., composed of just three convolutional layers, outperformed the sparse-coding techniques in terms of the PSNR score reported for natural images [18]. However, this was achieved at the cost of over-smoothed edges and poor generalization beyond bicubic downsampling. It was followed by the very deep SR (VDSR) network composed of 20 layers, which improved reconstruction quality via residual learning [8]. However, it was originally trained on simulated training pairs, which limits its robustness against real-world degradations. Ledig et al. introduced SRResNet, a generator built from deep residual blocks, achieving state-of-the-art scores for Set5 and Set14 datasets [9]. SRResNet also formed the backbone of SRGAN [9], which augments the

L 2

loss function with an adversarial perceptual loss based on the features extracted using the VGG network [19]. This combination fixed the blurry edges and low realism seen in earlier models, which boosted the perceptual scores, but it also deteriorated the PSNR metric and introduced some artifacts. Subsequent architectures, such as enhanced deep SR network (EDSR) [20] and a residual dense network (RDN) [21], refined residual and attention mechanisms to push perceptual quality further. However, these models are more likely to hallucinate, generating plausible results, which do not necessarily reflect the actual ground truth [22]. This limits their applicability in the context of specific tasks like the OCR case considered here. Furthermore, vision transformers have also been successfully employed for SISR [23]. However, these methods focus primarily on natural-image benchmarks with synthetic downsampling and do not integrate higher-level semantic objectives during training.

Recent developments in SR architectures increasingly emphasize reducing model size without compromising reconstruction quality [24]. Lu et al. have shown that, with vision transformers, it is possible to dynamically adjust feature map sizes to lower model complexity [25]. With just

6.5 \cdot 10^{5}

parameters, this approach has been reported to outperform many state-of-the-art models that are significantly larger. However, it is sensitive to hyper-parameter tuning and still shows higher latency than tiny CNNs on low-resolution inputs. The swift parameter-free attention network [26] introduces a novel attention mechanism devoid of parameters, striking a balance between image quality and inference speed, making it suitable for real-time applications. Xie et al. employed kernel distillation to simplify the model structure and enhance the attention modules [27], reducing the computational cost while improving the performance.

In sum, even the best generic SR models now deliver high fidelity with tiny parameter budgets, yet they still learn from synthetic degradations and optimize only pixel-wise error. It is therefore not ensured that they preserve task-relevant details—an issue the next section on task-specific SR tackles directly.

2.2. Task-Specific SR

Although most of the efforts are concerned with purpose-agnostic enhancement, the validation of emerging SR techniques in the context of specific computer vision tasks has received growing research attention [12]. This also includes research focused on text detection and recognition. Wang et al. used GAN-based SR to boost scene-text recognition performance [28], and Honda et al. leveraged multi-task transformers for scene-text SR [29]. These methods typically report improvements in the recognition accuracy, but rely almost entirely on synthetically blurred images for training and evaluation. Hence, their effectiveness under real, uncontrolled degradations remains unclear. In the document analysis domain, the ICDAR Robust Reading Challenges have driven the development of SR applied to scanned receipts and book pages [30]. This has provided the first large sets of real LR documents, but without paired HR ground truth, which makes systematic training and evaluation of SR methods difficult. Inspired by these findings, our approach explicitly incorporates OCR-relevant feature losses to guide SR toward text-clean reconstructions, while monitoring and mitigating potential hallucinations.

There were some attempts to train the networks for SR in a task-oriented manner by making the loss function focus on these specific tasks, thus guiding the training process accordingly. Haris et al. applied an object detection loss for training an SISR network [31]. Although the task-driven training leads to worse PSNR scores than relying on the

L 1

loss, it was demonstrated that the super-resolved images constitute a more valuable source for object detection. Similar task-driven loss functions were also defined for semantic image segmentation [32,33]. The losses based on segmentation masks raise mean IoU, but softens the textures. Importantly, all of these techniques were applied only to simulated LR images, and they were not tested in real-world scenarios. Task-driven OCR has been explored by Madi et al. and Wang et al. [28,34], who report lower character-related error rates, yet their GAN-based networks occasionally hallucinate glyphs on unseen degradations. This is also a concern of our recent work on task-driven SR [35], in which we trained several architectures for SISR in a task-oriented way. In the work reported here, we adapt that technique for real-world images, and we incorporate multiple semantic losses that are dynamically balanced during training.

2.3. Multi-Task Loss Optimization

Balancing heterogeneous loss terms is critical in multi-task learning. Fixed weights often cause one objective to dominate, leading to suboptimal convergence. Several dynamic strategies have been proposed in the literature [36], including gradient normalization [37] that equalizes the gradient magnitudes across tasks, uncertainty weighting [38], which uses task uncertainty as adaptive coefficients, dynamic task prioritization [39], and DWA [17], which updates each weight based on recent loss change rates. DWA assigns greater emphasis to tasks whose losses decrease more slowly, promoting a balanced convergence. Yet each strategy has practical drawbacks: the gradient normalization doubles gradient-memory usage and needs a tuned exponent; the uncertainty weighting assumes homoscedastic noise and can mis-scale tasks on very different numeric ranges; dynamic task prioritization requires heuristic phase scheduling that may oscillate; and DWA reacts with an epoch-level delay and tends to stall once all losses converge to a plateau. Moreover, none of these methods has been tested for OCR SR, where over-emphasizing a single semantic loss can quickly produce hallucinated character-resembling structures.

3. The Proposed Framework for Task-Driven Training

In this section, we present our approach, which allows for adapting an SISR technique for enhancing real-world document scans. While the numerous studies outlined in Section 2 have demonstrated promising performance in simulated settings, achieving robust and task-relevant reconstructions in real scenarios remains a significant challenge. Motivated by the observed limitations of conventional SISR techniques, we propose a task-driven, multi-loss training framework. Our method integrates multiple objectives into the training pipeline to achieve not only pixel-wise reconstruction fidelity, but also enhance the semantic and structural domains critical to document analysis.

The remainder of this section details our proposed solution. In Section 3.1, we describe the SRResNet architecture that we selected for our study, along with the specific loss functions employed. Section 3.2 elaborates on our strategy for combining these objectives using DWA.

3.1. Network Architecture and Loss Function

To address the challenge of super-resolving real-world document scans for OCR-related tasks, we propose to enrich the training process of a baseline SR technique with auxiliary components for semantic supervision. In our study, we have selected the well-established SRResNet network [9]. Its simple and well-understood architecture allows for fast training, offering a good compromise between the model size and reconstruction accuracy, as confirmed in many studies [40,41]. Although more complex models like EDSR [20] or SwinIR [23] retrieve higher reconstruction quality scores, they are also more likely to hallucinate image details, which is not beneficial in task-oriented applications. Importantly, in our earlier study [35], we considered several different architectures for OCR-oriented SISR—even though it was limited to simulated datasets, it showed that SRResNet is an optimal choice.

In our work, we focus on the ability to reconstruct task-relevant semantics essential for OCR, in addition to the pixel-wise similarity between the super-resolved image and the HR reference. A key element of our framework is the integration of pretrained CTPN, CRNN, and Key.Net models into the loss function exploited for training the SR network. Importantly, the weights of these pretrained models are not updated during training; hence, the models preserve their original behavior. The models for text detection (CTPN) and recognition (CRNN) were integrated into a single pipeline (the CTPN and CRNN models are available at https://github.com/courao/ocr.pytorch (accessed on 1 June 2025)) that allows for solving the OCR task in an end-to-end manner [28]. During training, these networks serve as proxy-supervisors: rather than requiring manually annotated labels, we extract high-level feature activations from their deepest layers and compare them between the super-resolved image

\hat{I}

and the ground-truth HR image

I_{HR}

using the

L 1

distance. In this way, each previously-trained network implicitly generates its own “labels” based on learned representations, yielding a semi-self-supervised training regime that enforces semantic consistency without explicit annotations.

Furthermore, we incorporate a pretrained Key.Net model (independent from OCR) to extract image keypoints, because text regions in a super-resolved image and the HR reference should exhibit similar keypoint patterns. Although Key.Net was originally designed for general-purpose keypoint detection, we test its ability to complement OCR-driven losses by encouraging the similarity in terms of the structural alignment. These auxiliary loss functions guide the SR model to preserve both textual and structural cues.

3.1.1. SRResNet Architecture Overview

The SRResNet model [9] serves as a backbone to validate our approach. It is composed of an initial feature extraction layer that transforms a three-channel LR input image

I_{LR} \in R^{3 \times H \times W}

into a higher-dimensional representation using a convolutional layer with

9 \times 9

kernels. This is followed by a deep stack of N residual blocks, each of which refines the extracted features while maintaining stable training through skip connections. Formally, if

F_{i - 1}

is the input to the i-th residual block, then

F_{i} = F_{i - 1} + R (F_{i - 1}), i = 1, \dots, N,

(1)

where

R (\cdot)

denotes the two-layer convolutional transformation with parametric rectified linear unit (PReLU) activations. These residual connections allow the network to focus on learning a high-frequency difference between the HR and LR representations, rather than the entire mapping, thus improving the convergence and final performance.

After the residual blocks, another convolutional layer merges the learned features into a single tensor, which is then upsampled by successive subpixel convolution (i.e., pixel shuffle) layers. Each subpixel block increases spatial resolution by a factor of 2× (for a total scaling factor of 4×), rearranging feature-channel information into finer-grained pixel grids. Finally, a reconstruction layer (with

9 \times 9

convolutional kernels followed by a hyperbolic tangent activation) maps the upsampled feature maps back to a three-channel RGB image:

\hat{I} = G (I_{LR}; θ_{G}) \in R^{3 \times (4 H) \times (4 W)} .

(2)

In our work,

θ_{G}

denotes all trainable parameters of SRResNet. At first, they are optimized by relying on simulated images from the MS COCO dataset [28] using a mean squared error (MSE) loss function, and the network is subsequently fine-tuned with our multi-task loss function.

3.1.2. CTPN Architecture for Text Region Detection

To extract spatial features relevant to textual content, we incorporate the CTPN network [28], which was originally introduced to detect horizontal text lines. CTPN is particularly effective at identifying localized regions likely to contain text, making it a suitable auxiliary supervision module within our framework.

The CTPN architecture is based on a truncated VGG16 backbone, pretrained on the ImageNet dataset [42]. Given a super-resolved image

\hat{I}

or an HR reference

I_{HR}

, the convolutional encoder outputs a feature map of size

(C \times H^{'} \times W^{'})

. A subsequent convolutional layer with

3 \times 3

kernels reduces the dimensionality to 512, resulting in a tensor

X \in R^{512 \times H^{'} \times W^{'}}

. This tensor is reshaped into a sequence of vectors of length 512 and processed row-wise by a bidirectional gated recurrent unit (Bi-GRU), which captures horizontal contextual dependencies across the image width—essential for accurate text region detection. The output sequence, consisting of 256-dimensional features from the Bi-GRU, is reshaped back to a spatial map

X_{seq} \in R^{256 \times H^{'} \times W^{'}}

.

This intermediate representation is then passed through a

1 \times 1

convolution to restore the dimensionality to 512 channels. The resulting tensor is processed by two parallel

1 \times 1

convolutional branches. The first of these is a classification head (

ψ_{CTPN - clss}

), which predicts, for each anchor (i.e., a fixed-width vertical slice of the input), the text-presence probability. Training of this branch is carried out using a masked cross-entropy loss, where neutral anchors are excluded from supervision. The second branch is the regression head (

ψ_{CTPN - reg}

), which estimates the vertical coordinate offsets for the bounding box corresponding to each anchor. This output is trained with a smooth

L 1

loss applied only to positive samples. Overall, CTPN detects the text and estimates its location. When it is used as a loss function, it learns an SR network to produce super-resolved images, from which the text regions could be easily detected.

Let

F_{clss} (\cdot)

and

F_{reg} (\cdot)

denote the outputs of the classification and regression convolutions. Then:

L_{CTPN - clss} = CrossEntropy (F_{clss} (\hat{I}), F_{clss} (I_{HR})),

(3)

L_{CTPN - reg} = SmoothL 1 (F_{reg} (\hat{I}), F_{reg} (I_{HR})) .

(4)

However, instead of comparing the final predicted boxes directly, we extract deep features from the last pre-activation layers in each head, denoted

ψ_{CTPN - deep} (\cdot) \in R^{D}

, and we measure their

L 1

distance between

\hat{I}

and

I_{HR}

:

L_{CTPN - deep} = {∥ψ_{CTPN - deep} (\hat{I}) - ψ_{CTPN - deep} (I_{HR})∥}_{1} .

(5)

These losses computed in the feature space make it possible to train the SR network to reconstruct the details that are helpful for text detection and localization.

3.1.3. CRNN Architecture for Text Recognition

To provide a higher-level OCR-related semantic supervision, we also exploit a text-recognition CRNN module [15] that has been earlier trained on a large corpus of scanned documents. CRNN consists of a convolutional encoder that transforms an RGB input into a compact feature map whose height is reduced to 1, effectively converting a 2D image into a 1D sequence of feature vectors along the width axis. This convolutional encoder comprises several stacked convolutional layers with batch normalization, a rectified linear unit (ReLU), and pooling layers that progressively halve the height dimension until it equals 1. The output of this stage is

H \in R^{C \times 1 \times W^{'}}

, which is reshaped into a sequence

S \in R^{W^{'} \times C}

.

Next, S is fed into two stacked bidirectional long short-term memory (Bi-LSTM) layers. Each Bi-LSTM layer processes the sequence in both forward and backward directions, capturing context across neighboring character regions. The final recurrent output is projected via a linear embedding to a distribution over character classes plus a blank token. In our multi-loss setup, we extract intermediate recurrent features denoted as

ψ_{CRNN} (\cdot) \in R^{W^{'} \times d}

from the last Bi-LSTM layer and compare them between

\hat{I}

and

I_{HR}

:

L_{CRNN} = {∥ψ_{CRNN} (\hat{I}) - ψ_{CRNN} (I_{HR})∥}_{1} .

(6)

This guides the training of an SR network to teach it reconstructing features that match those of HR references, thus enhancing the character-level discriminability in super-resolved images.

3.1.4. Key.Net for Structural Consistency

While CTPN and CRNN focus on OCR-related tasks, we also incorporate Key.Net [16], a pretrained network for generic keypoint detection in natural images. Our hypothesis is that text regions induce characteristic local keypoints (e.g., stroke junctions), so aligning these keypoints between

\hat{I}

and

I_{HR}

may further reinforce structural similarity. We use the already trained Key.Net model (the Key.Net implementation along with a pretrained model is available at https://github.com/axelBarroso/Key.Net (accessed on 1 June 2025)) as a loss function component—the weights of the Key.Net network are not updated in our training procedure.

Given an image X, let

p_{k} (X) \in R^{2}

denote the 2D coordinates (or heatmap response) of the k-th detected keypoint. We then define the keypoint alignment loss as:

L_{Key . Net} = \sum_{k} {∥p_{k} (\hat{I}) - p_{k} (I_{HR})∥}_{2}^{2},

(7)

summing over all keypoints detected in the HR image. In practice, we sample a fixed number of top-scoring keypoints from Key.Net and enforce that their spatial arrangement in a super-resolved image resembles that of the HR reference.

3.1.5. Multi-Loss Approach

In the study reported here, we consider eight loss components to capture pixel-level fidelity, structural consistency, and high-level semantic alignment. These are (i) the pixel-wise MSE, (ii) the consistency loss component, (iii) distances in the CTPN feature spaces (three components), (iv) the distance in the CRNN feature space, (v) the Key.Net loss, and (vi) the hue difference.

The pixel-wise MSE loss enforces that the super-resolved output matches the HR ground truth:

L_{MSE} = {∥\hat{I} - I_{HR}∥}_{2}^{2} .

(8)

Minimizing

L_{MSE}

drives the network to reduce pixel-wise differences, which increases the PSNR values.

The goal of the consistency loss is to ensure that when a super-resolved image is downsampled back to the original size, it resembles the input image. We employ bicubic interpolation to downsample

\hat{I}

to the LR domain, and we compare it with the original

I_{LR}

image to impose the cycle-consistency:

L_{cons} = {∥D (\hat{I}) - I_{LR}∥}_{2}^{2},

(9)

where

D (\cdot)

denotes bicubic downsampling by a factor of 4×. This term penalizes the artifacts that do not match with the input LR image.

To guide the network toward preserving spatial features relevant to text detection, we incorporate a set of auxiliary loss functions derived from the pretrained CTPN model. Specifically, we extract and compare three types of intermediate representations from both the super-resolved image

\hat{I}

and its HR counterpart

I_{HR}

: (i) deep convolutional features, (ii) classification logits, and (iii) regression outputs. Each feature space yields a separate

L 1

loss, formulated as:

L_{CTPN - deep} = {∥ψ_{CTPN - deep} (\hat{I}) - ψ_{CTPN - deep} (I_{HR})∥}_{1},

(10)

L_{CTPN - clss} = {∥ψ_{CTPN - clss} (\hat{I}) - ψ_{CTPN - clss} (I_{HR})∥}_{1},

(11)

L_{CTPN - reg} = {∥ψ_{CTPN - reg} (\hat{I}) - ψ_{CTPN - reg} (I_{HR})∥}_{1} .

(12)

Here,

ψ_{CTPN - deep} (\cdot)

denotes the deepest convolutional activation maps,

ψ_{CTPN - clss} (\cdot)

refers to the classification output logits for text proposal confidence, and

ψ_{CTPN - reg} (\cdot)

captures the predicted vertical bounding-box regressions.

By encouraging alignment between these latent representations of

\hat{I}

and

I_{HR}

, we guide the network to reconstruct text regions in a manner that is focused on text-relevant geometry, even in the absence of explicit bounding-box labels. To further ensure that the super-resolved output improves the capabilities of character recognition, we incorporate the CRNN loss [15]. Let

ψ_{CRNN} (\cdot)

denote the feature extraction function of the CRNN model, which includes both convolutional and recurrent representations. We compute the

L 1

distance between the feature maps extracted from the super-resolved image

\hat{I}

and its HR reference

I_{HR}

:

L_{CRNN} = {∥ψ_{CRNN} (\hat{I}) - ψ_{CRNN} (I_{HR})∥}_{1} .

(13)

This formulation encourages the network to reconstruct the images of enhanced semantic structure required for accurate text recognition. By aligning CRNN-derived features across the input pairs, the model is guided to reconstruct OCR-relevant details even in the absence of explicit textual labels.

To teach an SR network reconstructing structural patterns, we incorporate the multi-scale index proposal (MSIP) loss derived from the Key.Net architecture. Unlike previous losses that compare intermediate feature maps, MSIP operates directly on local descriptors extracted from keypoint regions across multiple scales. Given the HR reference image

I_{HR}

and the super-resolved image

\hat{I}

, Key.Net identifies repeatable keypoints and computes local gradient-based descriptors. The MSIP loss compares the responses around corresponding keypoints at multiple scales, enforcing stability and robustness in the reconstructed geometry. Formally, this loss is defined as:

L_{Key . Net} = \sum_{s \in S} \sum_{k} {∥ϕ_{s, k} (\hat{I}) - ϕ_{s, k} (I_{HR})∥}_{2}^{2},

(14)

where

ϕ_{s, k} (\cdot)

denotes the local descriptor at a keypoint k and a scale s, and

S

is the set of scales used. This loss penalizes discrepancies in the geometric structure and increases the keypoint similarity between a super-resolved image and its HR reference. By leveraging the MSIP loss, we promote alignment, not only in texture or color but also in intrinsic structural cues, which are critical in documents containing fine typographic details or graphical annotations.

The overall loss function (

L_{total}

) combines the individual loss components described previously:

\begin{matrix} L_{total} = & λ_{MSE} L_{MSE} + λ_{cons} L_{cons} + λ_{CTPN - deep} L_{CTPN - deep} \\ + λ_{CTPN - clss} L_{CTPN - clss} + λ_{CTPN - reg} L_{CTPN - reg} \\ + λ_{CRNN} L_{CRNN} + λ_{Key . Net} L_{Key . Net} + λ_{Hue} L_{Hue} . \end{matrix}

(15)

Their aggregation allows the network to simultaneously optimize for pixel-wise fidelity, structural alignment, semantic consistency, and chromatic coherence. To ensure that no single task dominates the training process, each weight

λ_{i}

is dynamically adjusted during optimization using the DWA strategy [43]. Specifically, higher emphasis is placed on the tasks whose loss decreases less over time, thus promoting balanced convergence across heterogeneous objectives.

The proposed loss function (Equation (15)) ensures that the super-resolved output not only minimizes pixel-wise reconstruction error but also retains structural and semantic integrity relevant for document analysis tasks such as text detection and recognition. The DWA mechanism is recomputed at the end of each epoch, based on the relative loss descent rates, thereby allowing the model to adapt its focus throughout the training trajectory. The weights

{λ_{i} (t)}

are updated at each epoch t. Let

r_{i} (t) = \frac{L_{i} (t - 1)}{L_{i} (t - 2)}

(16)

denote the relative loss improvement between two earlier epochs. Then:

λ_{i} (t) = \frac{N exp (r_{i} (t - 1) / T)}{\sum_{j = 1}^{N} exp (r_{j} (t - 1) / T)},

(17)

where

N = 8

is the number of loss terms and

T = 2

controls the weight softness. This encourages balanced convergence, preventing any single task from dominating the training process.

3.2. Multi-Task Loss Aggregation and Optimization Strategy

The loss function components are combined into a unified multi-task training objective. Each component contributes to the overall optimization according to its dynamically updated weight. This strategy enables the model to simultaneously satisfy multiple reconstruction goals—ranging from pixel-wise accuracy to high-level semantic preservation—during training. All loss values are aggregated into the total loss:

L_{total} = \sum_{i = 1}^{N} λ_{i} (t) \cdot L_{i},

(18)

and the gradients are propagated accordingly to update the SRResNet parameters

θ

using the Adam optimizer.

4. Experimental Validation

In this section, we report our experiments conducted to evaluate the proposed task-driven SR framework. We describe the dataset construction and training setup (Section 4.1), followed by quantitative and qualitative comparisons across multiple test scenarios (Section 4.2) and the final discussion (Section 4.3).

4.1. Experimental Setup

To evaluate our approach, we carried out a series of experiments that compare the models trained under various supervision regimes. Each model was evaluated on a diverse set of test datasets, with both real-world and simulated degradations, to assess its generalization ability and robustness to real-world distortions. In this section, we describe the dataset preparation procedure, the training configuration, and the evaluation protocol used in our study.

4.1.1. Datasets

To evaluate our task-driven SISR framework, we constructed a dataset with real-world document scans acquired under different conditions (the dataset is available upon request). The dataset design was guided by our prior studies, and it is focused on the magnification factor of

4 \times

. This offers a practical balance: while

2 \times

magnification may not sufficiently challenge the model to reconstruct fine details, higher ratios often lead to unstable optimization and a hallucinated content [31,44]. Thus, the

4 \times

setting provides a meaningful and realistic benchmark for restoration quality. The dataset is composed of several parts, namely, (i) University Bulletin (scans of a university bulletin), (ii) Scientific Article (scans of a printed scientific article), and (iii) COVID Test Leaflet (a scan of a medical leaflet), as well as (iv) a publicly available Old Books dataset [45]. Furthermore, we have exploited the MS COCO dataset [28] that contains natural images.

The University Bulletin and Scientific Article scans were acquired using a Samsung SCX-3400 flatbed scanner—a widely available consumer-grade device. Each page was scanned sequentially at five resolution levels: 75, 150, 200, 300, and 600 dots per inch (DPIs). The scanning was automated via a custom script to maintain consistent acquisition conditions. After scanning at all DPI settings, each page was translated by a small amount and re-scanned, yielding nine spatial shifts per page. This resulted in a total of

32 \times 5 \times 9 = 1440

scanned images. The resulting multi-resolution, multi-shift dataset supports both single-image and multi-frame SR scenarios. For our study, we focus on resolution pairs with a scaling ratio of exactly

4 \times

(75–300 DPI and 150–600 DPI), ensuring both practical fidelity and alignment with our experimental goals. To ensure spatial alignment between the LR and HR image pairs, we applied global registration based on a rigid 2D translation. Each LR image was first upsampled using bicubic interpolation to match the resolution of its HR counterpart. To assess the alignment quality, we computed an absolute difference map, followed by the calculation of the MSE within a central crop region (excluding a 32-pixel margin). A stochastic grid search was then performed to identify the integer-pixel translation vector that minimized the MSE. To ensure the reliability of the estimated transformation, we validated the chosen shift across 20 randomly selected image pairs. The final translations (in pixels, px) were determined to be

[- 5 px, - 1 px]

for 75–300 DPI pairs and

[- 6 px, - 7 px]

for 150–600 DPI pairs. These displacements were subsequently applied to the LR images using affine transformation. The aligned image pairs were cropped into patch pairs of

256 \times 256

pixels (LR) and

1024 \times 1024

pixels (HR), preserving the

4 \times

ratio. The stride matched the LR patch size, and patches were clamped at image borders to avoid out-of-bound sampling. For each page, the zero-shift scan was incorporated into the test set for all DPI levels. The remaining eight shifted scans served as the training set.

The COVID Test Leaflet dataset was captured using a consumer-grade Canon (Tokyo, Japan) imageFORMULA P-208II linear scanner equipped with a contact image sensor. These scans pose significant challenges due to transport instabilities during scanning, which result in nonlinear geometric distortions such as local stretching and warping. Consequently, this dataset provides a robust benchmark for assessing the spatial resilience of SR models under realistic and uncontrolled acquisition conditions. Each scan was divided into overlapping tiles with

512 \times 512

pixels. Local alignment was performed using phase correlation, and the registered tiles were reassembled into full images. Due to the non-rigid nature of distortions, all registration results were manually verified. These scans were incorporated into the test set.

The Old Books dataset [45] is publicly available and comprises HR scans of historical printed documents. It features a wide variety of typographic styles, physical degradation due to paper aging, and uneven illumination. Importantly, the dataset includes text transcriptions and binarized ground truth versions of each scan, making it particularly suitable for evaluating the OCR-related performance on super-resolved outputs.

To examine generalization beyond the document-centric imagery, we employ a simulated subset of the MS COCO dataset [28]. From this corpus, we extracted image regions that contain textual content, and we generated corresponding LR images via bicubic downsampling. This controlled setup enables benchmarking of our document-oriented SR model against natural-scene inputs, offering insights into its cross-domain adaptability.

For the University Bulletin, Scientific Article, COVID Test Leaflet, and Old Books datasets, we prepared real LR inputs obtained from physical scans performed at different resolutions, as well as simulated LR–HR pairs generated by downsampling the corresponding HR images. This dual-mode evaluation facilitates fair and consistent comparison across different degradation procedures, helping to disentangle the effects of acquisition noise from purely resolution-related loss. In contrast, the MS COCO subset is included solely into the simulated dataset, as no real LR counterparts are available. The MS COCO, University Bulletin, and Scientific Article datasets were split into training and test parts without any overlaps, ensuring that there were no information leakages between training and test sets. The simulated MS COCO images were used to pretrain the model, and the images from the University Bulletin and Scientific Article datasets (both in real-world and simulated modes) were used to train the models in a task-driven way (20% of the training data were used as a validation set). The Old Books and COVID Test Leaflet datasets were not exploited for training, thus allowing for verifying the cross-device robustness of the elaborated models.

All the datasets exploited in our study are summarized in Table 1. For each dataset, we present the split into training, validation, and test sets, and we provide the number of images (i.e., the original distinct pages) alongside the size and number of extracted patches in the real-world and simulated parts. For real-world datasets, each image was scanned at different DPIs to obtain LR and HR images. For the simulated images, the HR scan was downsampled to obtain a simulated LR image.

4.1.2. Training Strategy

The SRResNet model was initialized using weights from a variant pretrained on the MS COCO dataset with the pixel-wise MSE loss, following the procedure described in [9]. To investigate the effect of degradation realism on SR performance, we fine-tuned the model, relying on both real-world and simulated datasets. In the former case, we trained the model using image pairs acquired through actual scanning processes from the University Bulletin and Scientific Article datasets, specifically at resolution ratios of

4 \times

(e.g., 75–300 DPI and 150–600 DPI). These pairs contain realistic distortions, including scanner-specific blur, noise, compression artifacts, and physical imperfections from printed media. This setup enables the model to learn reconstruction patterns reflective of real-world scanning degradations. Furthermore, we exploited simulated datasets—we used the same HR images as in the case of real-world datasets, but the LR counterparts were obtained by bicubic downsampling. This allowed for controlled degradation modeling under matched semantic content, isolating the learning dynamics attributable solely to the downsampling process. This protocol was designed to disentangle the contributions of physical acquisition artifacts and interpolation-based degradation, facilitating a fair comparison between models trained under real-world and simulated conditions.

To balance the multiple objectives of our task-driven framework, we employed the DWA strategy for adaptive loss weighting. In practice, we observed that DWA allowed the model to dynamically shift focus depending on input characteristics: for instance, when processing patches with sparse or low-contrast text, the contribution of CRNN-based supervision increased, whereas for the texture-rich regions a greater emphasis was placed on pixel-wise losses and Key.Net-based structural alignment. This adaptive mechanism promoted stable optimization and contributed to the model’s ability to generalize effectively across the test dataset.

4.1.3. Investigated Variants and Evaluation Metrics

In Table 2, we list the variants investigated within our study. As it was demonstrated in [35] (for simulated datasets) that CTPN losses are crucial for optimizing SR networks for OCR-related tasks, here we focus on investigating the influence of the components concerned with CRNN, Key.Net, and hue features. In addition to that, we report the results obtained with bicubic interpolation (Int.) and with a baseline SRResNet model (

M_{B}

) that was both pretrained and fine-tuned using the simulated MS COCO dataset.

For each variant, we report the PSNR, structural similarity index (SSIM) [46], learned perceptual image patch similarity (LPIPS) [47], and IoU between the text localization extracted from super-resolved and HR reference images. In order to verify whether the differences between the investigated variants are statistically significant, we have employed statistical tests. Comparisons between groups were made using the Kruskal–Wallis H test and the post hoc Dunn test with Benjamini–Hochberg p-value correction. A two-sided p-value < 0.05 was considered statistically significant. All computational analysis was performed in the R environment for statistical computing (version 4.4.3).

4.2. Experimental Results

The mean reconstruction quality scores obtained using the investigated variants over the test datasets are reported in Table 3. It presents the summarized results, and the outcome of detailed statistical analysis is presented in Figure 2 and Figure 3 and in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11. Finally, examples of the qualitative results are showcased in Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9.

From Table 3, it can be seen that all of the trained SR models receive worse image fidelity scores (PSNR, SSIM, and LPIPS) than bicubic interpolation, but the IoU scores are better on the models trained in a task-driven way from real-world images (for test sets composed of both simulated and real-world LR images). The violin plots showing distribution of the scores alongside median and quartile values are presented in Figure 2 and Figure 3 for the simulated and real-world test sets, respectively. In Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11, we report the outcomes of statistical tests performed for PSNR, SSIM, LPIPS, and IoU metrics for all test sets—for each pair of methods, we indicate which one was significantly better considering the median score. It can be seen that the IoU scores are significantly better for the models trained in a task-driven way for both the simulated (Table 7) and real-world (Table 11) datasets.

In Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, we show examples of the reconstruction outcomes for images from the University Bulletin (Figure 4), Scientific Article (Figure 5), COVID Test Leaflet (Figure 6), Old Books (Figure 7), and MS COCO (Figure 8) datasets. It can be appreciated that the models trained in a task-driven way reconstruct more high-frequency features than the baseline

M_{B}

configuration; however, this is achieved at a cost of some artifacts that are reflected in lower image fidelity scores reported earlier in Table 3. In Figure 9, we showcase the outcomes retrieved with the

M_{+ + +}^{R}

and

M_{+ + +}^{S}

models for scans of documents with various fonts and backgrounds. It can be seen that, in all cases, the SR models trained in a task-driven manner retrieve the text that is visually much clearer and sharper than in the outcomes of bicubic interpolation. Overall, the text appearance in super-resolved images is closer to that in the HR references.

4.3. Discussion

4.3.1. Interpreting the Quantitative Results

It may be surprising that the average image similarity scores (PSNR, SSIM, and LPIPS) reported in Table 3 are consistently best for bicubic interpolation, even outperforming the baseline model

M_{B}

trained from the simulated MS COCO dataset; their median values are also significantly better than those retrieved with all the remaining models (see Table 4, Table 5 and Table 6 and Table 8, Table 9 and Table 10). However, it must be taken into account that the goal of the proposed task-driven SR is to optimize the models for OCR performance rather than for pixel-wise fidelity. SISR is a severely ill-posed problem with an extensive solution space [48]. The task-driven training navigates the models toward those regions of the solution space that are optimal considering the specific tasks, hence they may not overlap with the regions of high pixel-wise fidelity. Another important issue to consider is that the images in the test set are quite specific, as they mainly present scans of printed text documents. It is well-known that the pixel-wise metrics (PSNR and SSIM) are of limited robustness in assessing the reconstruction quality [49,50], and this effect may be amplified for high-contrast images like document scans [51]. While LPIPS is commonly more robust in such circumstances [47], it has been trained over natural images and hence may not be optimal for assessing the quality of document scans [52]. Furthermore, the fact that the models trained in a task-driven manner deteriorate the image similarity scores compared with the model trained in a conventional way is in line with the results reported in other works on task-driven SR [31].

In contrast to the image similarity metrics, the task-oriented IoU scores indicate clearly that text detection is more effective after super-resolving the images using task-driven models. Interestingly, the models trained from real-world images outperform those trained from the simulated ones also when they are applied to the simulated test dataset. In particular,

M_{- + +}^{R}

renders the best average and median scores (Table 3 and Figure 2), which are significantly better than all 17 other variants (see Table 7). This may be caused by the fact that both simulated and real-world test sets contain data from various sources, and a real-world training dataset with a wider variety of LR–HR relationships enables the model to learn more complex degradations, resulting in more generalizable performance even for the simulated datasets. On the contrary, the use of the simulated dataset for task-driven training does not improve the IoU scores. For the real-world test set, the best IoU scores were retrieved with

M_{+ + +}^{R}

and

M_{- + +}^{R}

variants, each of which is significantly better than the 15 other variants, including bicubically-interpolated images and those obtained with the baseline

M_{B}

variant (Table 11). Here, the models trained from the simulated data were also rather poor in terms of the IoU scores, which confirms the necessity of elaborating real-world training datasets for training SR networks.

The statistical analysis reported for the IoU scores in Table 7 and Table 11 also sheds light on the necessity of employing task-driven components of the loss function. For the simulated test set (Table 7), the

L_{Key . Net}

and

L_{Hue}

components are crucial (in addition to those based on the CTPN features). For the real-world test set, the best average score was obtained with all components switched on (

M_{+ + +}^{R}

); however, the use of

L_{CRNN}

does not improve the scores in a statistically significant way. Clearly, incorporating a loss function component related to text recognition does not enhance text detection capabilities, quantified with the IoU metric; however, it may be relevant in our future work, which will be aimed at incorporating text recognition to evaluate the SR outcome. It must also be noted that the improvements in IoU scores, although statistically significant, are limited due to the inherent constraints of running reconstruction from just a single image. While SISR techniques allow for substantial enhancement of perceived image quality, they contribute little to recovering genuine HR information. In contrast, multi-image SR benefits from the fact that every LR observation contains a different portion of HR information—as a result, reconstructed images are much closer to the actual ground truth than any of the LR inputs [4]. Importantly, the elaborated framework is not restricted to SISR, and it can be employed for training multi-image SR techniques. They may be expected to achieve considerably larger gains in terms of the OCR performance, benefiting from the task-driven guidance proposed in the study reported here.

4.3.2. Qualitative Analysis

Inspecting the qualitative results in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 allows us to confirm that the image similarity metrics (PSNR, SSIM, and LPIPS) do not correlate well with the visual quality of reconstructed text documents. Employing task-driven loss function components allows the models to learn preserving and enhancing text structures. At the same time, they become blurred after applying bicubic interpolation, and the baseline

M_{B}

model trained from natural images tends to impair the connectivity within the characters, affecting text detection and recognition capabilities. In our opinion, the

M_{+ + +}^{R}

model leads to the visually best outcome for both real-world and simulated images presenting text documents (Figure 4, Figure 5, Figure 6 and Figure 7), making the text most legible, which is coherent with the quantitative results. Even though in some cases (like in Figure 7) there are some color artifacts, they are weaker than those caused by other variants trained from the real-world datasets (e.g.,

M_{+ + -}^{R}

or

M_{- + -}^{R}

), as well as from the simulated ones (e.g.,

M_{+ + +}^{S}

). Also, the baseline model

M_{B}

distorts the letters (see the letter ‘e’ in the zoomed example in Figure 7). We have also verified the behavior of the investigated models for natural images (Figure 8)—the task-driven models tend to render sharper edges, better reconstructing high-contrast details (compare the Big Ben clock for

M_{B}

and

M_{+ + +}^{R}

). However, this is achieved at a cost of some artifacts, especially for smooth regions (see the upper left corner in

M_{+ + +}^{R}

) that substantially deteriorate image similarity scores.

4.3.3. Limitations

Even though we have achieved promising results in improving OCR-related performance, the proposed task-driven SR approach has certain limitations. First, the models trained in a task-oriented way may introduce artifacts or distortions that, while beneficial for OCR, reduce the overall visual realism or general usability of the super-resolved images. This task-specific optimization limits their applicability in scenarios where perceptual quality and fidelity are crucial alongside OCR-related aspects. Furthermore, the elaborated framework has been validated in the context of SISR, which restricts the quality of the reconstruction, as genuine HR details cannot be fully recovered without additional observations, inherent to multi-image SR. Although in our experimental validation we have extensively studied the behavior of SRResNet, the proposed approach has not been evaluated considering a larger model like EDSR [20] or SwinIR [23]. However, we have managed to observe that introducing the task-oriented loss function components enhances the performance in a statistically significant way—this establishes the methodology, which may help in improving the task-oriented performance for multi-image SR. Finally, the current loss design is focused on text detection, and while recognition-related components have been included, their contribution has not yet been validated in the context of an end-to-end OCR process. This draws an interesting direction for future research, where the SR training procedure could be more tightly integrated with the full character recognition process. Furthermore, extending the task-driven framework to multi-image SR, which inherently provides richer information content, may allow for achieving improvements in both perceptual quality and OCR performance.

5. Conclusions and Future Work

In this paper, we have presented a multi-task training framework for SR networks aimed at real-world text document scans. By incorporating text detection, recognition, and keypoint alignment components into the loss function, and balancing them relying on DWA, our model generates super-resolved images, which significantly improves the OCR-related text detection metric on challenging scans. Compared to conventional SR trained on simulated data, our approach better captures the semantics of text content and produces images more appropriate for further processing. Through our extensive experiments backed by statistical analysis, we demonstrated that it is critical to both exploit real-world training data and to incorporate task-specific components into the loss function. Compared with the baseline, the IoU text detection score was improved by 1.09% and 1.94% for simulated and real-world datasets, respectively. Although these gains are of a relatively small magnitude, they are statistically significant and they allow us to positively verify the hypothesis that including the task-oriented loss function components contributes to finding more valuable regions of the SR solution space. In the reported case, the task-related performance is limited by the very nature of SISR, which cannot introduce any new information beyond that present in the input image and learned from the training data. Importantly, the elaborated methodology can be further exploited to improve SR approaches underpinned with information fusion.

Our ongoing research is aimed at exploiting the developed framework for training multi-image SR techniques, including our recent graph attention network [53], which we expect to allow for enhancing the OCR performance. Importantly, our dataset already contains multiple images per scene, and it can be exploited for the training and validation of multi-image fusion techniques. At the same time, we plan to elaborate validation procedures that will incorporate text recognition metrics in addition to text localization in document scans. This approach will enable the evaluation of SR techniques within an end-to-end character recognition pipeline, thereby bridging the gap to practical real-world applications.

Author Contributions

Conceptualization, M.K.; data curation, M.Z.; formal analysis, A.K. and M.K.; funding acquisition, M.K.; investigation, M.Z., T.T., A.K. and M.K.; methodology, A.K. and M.K.; project administration, M.K.; resources, M.K.; software, M.Z., T.T. and J.S.; supervision, M.K.; validation, M.Z. and T.T.; visualization, M.Z., A.K. and M.K.; writing—original draft, M.Z., A.K. and M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Centre, Poland, under Research Grant no. 2022/47/B/ST6/03009. M.K. was supported by the SUT funds through the Rector’s Research and Development Grant 02/080/RGJ25/0053.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data exploited in this papers are available upon request.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guo, H.; Dai, T.; Zhu, M.; Meng, G.; Chen, B.; Wang, Z.; Xia, S.T. One-stage low-resolution text recognition with high-resolution knowledge transfer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2189–2198. [Google Scholar]
Yu, M.; Shi, J.; Xue, C.; Hao, X.; Yan, G. A review of single image super-resolution reconstruction based on deep learning. Multimed. Tools Appl. 2024, 83, 55921–55962. [Google Scholar] [CrossRef]
Wang, Z.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
Yue, L.; Shen, H.; Li, J.; Yuan, Q.; Zhang, H.; Zhang, L. Image super-resolution: The techniques, applications, and future. Signal Process. 2016, 128, 389–408. [Google Scholar] [CrossRef]
Valsesia, D.; Magli, E. Permutation Invariance and Uncertainty in Multitemporal Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar]
Kim, J.; Kwon Lee, J.; Mu Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Cai, J.; Zeng, H.; Yong, H.; Cao, Z.; Zhang, L. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Chen, H.; He, X.; Qing, L.; Wu, Y.; Ren, C.; Sheriff, R.E.; Zhu, C. Real-world single image super-resolution: A brief review. Inf. Fusion 2022, 79, 124–145. [Google Scholar] [CrossRef]
Kawulok, M.; Kowaleczko, P.; Ziaja, M.; Nalepa, J.; Kostrzewa, D.; Latini, D.; De Santis, D.; Salvucci, G.; Petracca, I.; La Pegna, V.; et al. Hyperspectral image super-resolution: Task-based evaluation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18949–18966. [Google Scholar] [CrossRef]
Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 56–72. [Google Scholar]
Liu, Y.; Wang, Y.; Shi, H. A convolutional recurrent neural-network-based machine learning for scene text recognition application. Symmetry 2023, 15, 849. [Google Scholar] [CrossRef]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
Barroso-Laguna, A.; Riba, E.; Ponsa, D.; Mikolajczyk, K. Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Dong, C.; Zhu, X.; Deng, Y.; Loy, C.C.; Qiao, Y. Boosting optical character recognition: A super-resolution approach. arXiv 2015, arXiv:1506.02211. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6228–6237. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Ayazoglu, M. Extremely lightweight quantization robust real-time single-image super resolution for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2472–2479. [Google Scholar]
Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
Wan, C.; Yu, H.; Li, Z.; Chen, Y.; Zou, Y.; Liu, Y.; Yin, X.; Zuo, K. Swift parameter-free attention network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 6246–6256. [Google Scholar]
Xie, C.; Zhang, X.; Li, L.; Meng, H.; Zhang, T.; Li, T.; Zhao, X. Large kernel distillation network for efficient single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1283–1292. [Google Scholar]
Wang, W.; Xie, E.; Sun, P.; Wang, W.; Tian, L.; Shen, C.; Luo, P. TextSR: Content-aware text super-resolution guided by recognition. arXiv 2019, arXiv:1909.07113. [Google Scholar]
Honda, K.; Fujita, H.; Kurematsu, M. Improvement of Text Image Super-Resolution Benefiting Multi-task Learning. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 19–22 July 2022; Springer: Cham, Switzerland, 2022; pp. 275–286. [Google Scholar]
Gomez, R.; Shi, B.; Gomez, L.; Numann, L.; Veit, A.; Matas, J.; Belongie, S.; Karatzas, D. ICDAR2017 robust reading challenge on COCO-text. In Proceedings of the IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1435–1443. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Task-driven super resolution: Object detection in low-resolution images. In Proceedings of the International Conference on Neural Information Processing (ICONIP), Sanur, Bali, Indonesia, 8–12 December 2021; Springer: Cham, Switzerland, 2021; pp. 387–395. [Google Scholar]
Frizza, T.; Dansereau, D.G.; Seresht, N.M.; Bewley, M. Semantically accurate super-resolution generative adversarial networks. Comput. Vis. Image Underst. 2022, 221, 103464. [Google Scholar] [CrossRef]
Rad, M.S.; Bozorgtabar, B.; Musat, C.; Marti, U.V.; Basler, M.; Ekenel, H.K.; Thiran, J.P. Benefiting from multitask learning to improve single image super-resolution. Neurocomputing 2020, 398, 304–313. [Google Scholar] [CrossRef]
Madi, B.; Alaasam, R.; El-Sana, J. Text Edges Guided Network for Historical Document Super Resolution. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Hyderabad, India, 4–7 December 2022; Springer: Cham, Switzerland, 2022; pp. 18–33. [Google Scholar]
Zyrek, M.; Kawulok, M. Task-driven single-image super-resolution reconstruction of document scans. In Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), Belgrade, Serbia, 8–11 September 2024; pp. 259–264. [Google Scholar]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Guo, M.; Haque, A.; Huang, D.A.; Yeung, S.; Li, F.-F. Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 270–287. [Google Scholar]
Tan, C.; Cheng, S.; Wang, L. Efficient Image Super-Resolution via Self-Calibrated Feature Fuse. Sensors 2022, 22, 329. [Google Scholar] [CrossRef] [PubMed]
Prajapati, K.; Chudasama, V.; Upla, K. A light weight convolutional neural network for single image super-resolution. Procedia Comput. Sci. 2020, 171, 139–147. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida, 20–25 June 2009; pp. 248–255. [Google Scholar]
Liu, Z.; Li, L.; Wu, Y.; Zhang, C. Facial Expression Restoration Based on Improved Graph Convolutional Networks. arXiv 2019, arXiv:1910.10344. [Google Scholar] [CrossRef]
Shermeyer, J.; Van Etten, A. The effects of super-resolution on object detection performance in satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1–10. [Google Scholar]
Correia, P.H.B.; Rivera, G.A.R. Evaluation of OCR free software applied to old books. Rev. Trab. Iniciação Científica UNICAMP 2018, 26. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Timofte, R.; Kim, K.-w.; Kim, Y.; Lee, J.-y.; Li, Z.; Pan, J.; Shim, D.; Song, K.-U.; et al. NTIRE 2022 challenge on learning the super-resolution space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 786–797. [Google Scholar]
Benecki, P.; Kawulok, M.; Kostrzewa, D.; Skonieczny, L. Evaluating super-resolution reconstruction of satellite images. Acta Astronaut. 2018, 153, 15–25. [Google Scholar] [CrossRef]
Lin, L.; Chen, H.; Kuruoglu, E.E.; Zhou, W. Robust structural similarity index measure for images with non-Gaussian distortions. Pattern Recognit. Lett. 2022, 163, 10–16. [Google Scholar] [CrossRef]
Zhang, F.; Li, S.; Ma, L.; Ngan, K.N. Limitation and challenges of image quality measurement. In Proceedings of the Visual Communications and Image Processing 2010, Huangshan, China, 11–14 July 2010; SPIE: Bellingham, WA, USA, 2010; Volume 7744, pp. 25–32. [Google Scholar]
Alaei, A.; Bui, V.; Doermann, D.; Pal, U. Document image quality assessment: A survey. ACM Comput. Surv. 2023, 56, 1–36. [Google Scholar] [CrossRef]
Tarasiewicz, T.; Kawulok, M. A graph attention network for real-world multi-image super-resolution. Inf. Fusion 2025, 124, 103325. [Google Scholar] [CrossRef]

Figure 1. Outline of the proposed task-driven training of an SR network. CNNs performing OCR-related tasks are applied to process both the HR reference and the super-resolved image—the differences between their outcomes establish task-driven components of the loss function, which are coupled with the commonly applied image-driven components to guide the training of the SR network.

Figure 2. Violin plots illustrating the distribution of the IoU scores obtained with different models for the simulated images in the test set. The median and quartiles are indicated with the box plots.

Figure 3. Violin plots illustrating the distribution of the IoU scores obtained with different models for the real-world images in the test set. The median and quartiles are indicated with the box plots.

Figure 4. Example of SR outcomes based on a real-world image (top two rows) and a simulated image (top bottom rows) from the University Bulletin dataset performed with the models trained from real-world and simulated training datasets using different loss functions.

Figure 5. Example of SR outcomes based on a real-world image (top two rows) and a simulated image (top bottom rows) from the Scientific Article dataset performed with the models trained from real-world and simulated training datasets using different loss functions.

Figure 6. Example of SR outcomes based on a real-world image (top two rows) and a simulated image (top bottom rows) from the COVID Test Leaflet dataset performed with the models trained from real-world and simulated training datasets using different loss functions.

Figure 7. Example of SR outcomes based on a real-world image (top two rows) and a simulated image (top bottom rows) from the Old Books dataset performed with the models trained from real-world and simulated training datasets using different loss functions.

Figure 8. Example of reconstructing a simulated LR image from the MS COCO dataset performed with the models trained using different loss functions relying on real-world and simulated training datasets.

Figure 9. Examples of real-world document scans with different fonts derived from the Scientific Article (a,b), University Bulletin (c), COVID Test Leaflet (d), and Old Books (e–h) datasets, upsampled with bicubic interpolation and reconstructed using

M_{+ + +}^{S}

and

M_{+ + +}^{R}

SRResNet models.

Figure 9. Examples of real-world document scans with different fonts derived from the Scientific Article (a,b), University Bulletin (c), COVID Test Leaflet (d), and Old Books (e–h) datasets, upsampled with bicubic interpolation and reconstructed using

M_{+ + +}^{S}

and

M_{+ + +}^{R}

SRResNet models.

Table 1. Detailed specification of all datasets used in our study. For real LR–HR pairs, LR scans were acquired at specific DPI settings and aligned to HR references. Simulated counterparts were generated by the bicubic downsampling of HR images to match the number and size of real-world LR patches.

		No. of	LR–HR Resolution	LR Patch Size	No. of Patches
	Data Source	Images	(in DPI)	(in px)	Real-World	Simulated
Pretraining set
	MS COCO [28]	81,574	—	32 × 32	—	81,574
Training and validation set
	University Bulletin	208	75–300	256 × 256	2046	2046
		208	150–600	256 × 256	6004	6004
	Scientific Article	48	75–300	256 × 256	482	482
		48	150–600	256 × 256	1392	1392
Test set
	MS COCO [28]	1209	—	32 × 32	—	1209
	University Bulletin	26	75–300	256 × 256	300	300
		26	150–600	256 × 256	875	875
	Scientific Article	6	75–300	256 × 256	72	72
		6	150–600	256 × 256	210	210
	COVID Test Leaflet	1	75–300	128 × 128	95	95
	Old Books [45]	45	125–500	256 × 256	860	860

Table 2. The variants considered in our experimental study that have been trained from real-world and simulated datasets, employing different loss function components (

L_{CRNN}

,

L_{Key . Net}

, and

L_{Hue}

).

Table 2. The variants considered in our experimental study that have been trained from real-world and simulated datasets, employing different loss function components (

L_{CRNN}

,

L_{Key . Net}

, and

L_{Hue}

).

Variant Name	Training Dataset	$L_{CRNN}$	$L_{Key . Net}$	$L_{Hue}$
$M_{+ + +}^{R}$	Real-world	✓	✓	✓
$M_{+ + -}^{R}$	Real-world	✓	✓
$M_{+ - +}^{R}$	Real-world	✓		✓
$M_{+ - -}^{R}$	Real-world	✓
$M_{- + +}^{R}$	Real-world		✓	✓
$M_{- + -}^{R}$	Real-world		✓
$M_{- - +}^{R}$	Real-world			✓
$M_{- - -}^{R}$	Real-world
$M_{+ + +}^{S}$	Simulated	✓	✓	✓
$M_{+ + -}^{S}$	Simulated	✓	✓
$M_{+ - +}^{S}$	Simulated	✓		✓
$M_{+ - -}^{S}$	Simulated	✓
$M_{- + +}^{S}$	Simulated		✓	✓
$M_{- + -}^{S}$	Simulated		✓
$M_{- - +}^{S}$	Simulated			✓
$M_{- - -}^{S}$	Simulated

Table 3. Reconstruction quality scores (mean value and standard deviation) retrieved using different models for simulated and real-world test sets. For each metric, we show whether higher (↑) or lower (↓) scores are better. The best scores for each metric are boldfaced.

	Simulated Images				Real-World Images
	PSNR [dB] ↑	SSIM ↑	LPIPS ↓	IoU ↑	PSNR [dB] ↑	SSIM ↑	LPIPS ↓	IoU ↑
Int.	$20.36 \pm 6.03$	$0.7043 \pm 0.1890$	$0.2973 \pm 0.1893$	$0.9038 \pm 0.0993$	$19.02 \pm 6.32$	$0.7403 \pm 0.1934$	$0.2516 \pm 0.1863$	$0.8820 \pm 0.0961$
$M_{B}$	$18.09 \pm 5.35$	$0.6535 \pm 0.1926$	$0.3222 \pm 0.1914$	$0.9033 \pm 0.0943$	$15.67 \pm 5.39$	$0.6509 \pm 0.2182$	$0.3073 \pm 0.2074$	$0.8703 \pm 0.0960$
$M_{+ + +}^{R}$	$15.52 \pm 5.31$	$0.4512 \pm 0.2423$	$0.4186 \pm 0.1976$	$0.9119 \pm 0.1029$	$13.72 \pm 5.20$	$0.4182 \pm 0.2698$	$0.4424 \pm 0.2139$	$0.8894 \pm 0.0979$
$M_{+ + -}^{R}$	$17.62 \pm 3.98$	$0.6195 \pm 0.1995$	$0.3083 \pm 0.1541$	$0.9071 \pm 0.1022$	$17.19 \pm 4.38$	$0.6708 \pm 0.1892$	$0.2799 \pm 0.1471$	$0.8834 \pm 0.1005$
$M_{+ - +}^{R}$	$17.39 \pm 3.86$	$0.6290 \pm 0.2014$	$0.3032 \pm 0.1608$	$0.9094 \pm 0.0989$	$16.74 \pm 4.04$	$0.6793 \pm 0.1859$	$0.2723 \pm 0.1544$	$0.8823 \pm 0.1027$
$M_{+ - -}^{R}$	$14.93 \pm 4.80$	$0.4990 \pm 0.2206$	$0.4071 \pm 0.1894$	$0.8904 \pm 0.1191$	$13.97 \pm 5.14$	$0.4901 \pm 0.2579$	$0.4116 \pm 0.2206$	$0.8635 \pm 0.1233$
$M_{- + +}^{R}$	$16.12 \pm 4.55$	$0.5360 \pm 0.2069$	$0.3471 \pm 0.1637$	$0.9142 \pm 0.1041$	$14.65 \pm 4.70$	$0.5214 \pm 0.2435$	$0.3619 \pm 0.1784$	$0.8876 \pm 0.0990$
$M_{- + -}^{R}$	$16.77 \pm 4.10$	$0.4628 \pm 0.1842$	$0.3760 \pm 0.1488$	$0.9038 \pm 0.1112$	$15.42 \pm 4.38$	$0.4364 \pm 0.2158$	$0.3923 \pm 0.1711$	$0.8776 \pm 0.1041$
$M_{- - +}^{R}$	$17.44 \pm 4.17$	$0.6053 \pm 0.1793$	$0.3204 \pm 0.1549$	$0.9123 \pm 0.0959$	$16.24 \pm 4.21$	$0.6417 \pm 0.1728$	$0.2965 \pm 0.1555$	$0.8873 \pm 0.0968$
$M_{- - -}^{R}$	$14.90 \pm 4.87$	$0.5027 \pm 0.2197$	$0.4085 \pm 0.1785$	$0.8795 \pm 0.1335$	$13.99 \pm 5.30$	$0.5016 \pm 0.2536$	$0.4052 \pm 0.2050$	$0.8549 \pm 0.1386$
$M_{+ + +}^{S}$	$15.95 \pm 5.40$	$0.5516 \pm 0.2094$	$0.3239 \pm 0.1896$	$0.9062 \pm 0.0883$	$13.58 \pm 5.45$	$0.5229 \pm 0.2530$	$0.3473 \pm 0.2114$	$0.8692 \pm 0.1036$
$M_{+ + -}^{S}$	$16.57 \pm 5.03$	$0.5468 \pm 0.2057$	$0.3260 \pm 0.1803$	$0.9010 \pm 0.0892$	$14.26 \pm 4.75$	$0.5168 \pm 0.2457$	$0.3591 \pm 0.2038$	$0.8662 \pm 0.1033$
$M_{+ - +}^{S}$	$16.42 \pm 4.53$	$0.5808 \pm 0.1796$	$0.3253 \pm 0.1642$	$0.9028 \pm 0.0903$	$14.38 \pm 4.27$	$0.5715 \pm 0.2046$	$0.3406 \pm 0.1780$	$0.8682 \pm 0.1023$
$M_{+ - -}^{S}$	$16.67 \pm 5.19$	$0.5463 \pm 0.2181$	$0.3217 \pm 0.1785$	$0.9006 \pm 0.0894$	$14.41 \pm 4.98$	$0.5193 \pm 0.2541$	$0.3466 \pm 0.2007$	$0.8658 \pm 0.1015$
$M_{- + +}^{S}$	$16.64 \pm 4.59$	$0.5925 \pm 0.1691$	$0.3224 \pm 0.1595$	$0.9020 \pm 0.0876$	$14.56 \pm 4.30$	$0.5848 \pm 0.1901$	$0.3444 \pm 0.1734$	$0.8663 \pm 0.1030$
$M_{- + -}^{S}$	$16.58 \pm 5.07$	$0.5505 \pm 0.2075$	$0.3253 \pm 0.1787$	$0.9007 \pm 0.0881$	$14.37 \pm 4.83$	$0.5250 \pm 0.2459$	$0.3564 \pm 0.2005$	$0.8652 \pm 0.1020$
$M_{- - +}^{S}$	$17.26 \pm 4.79$	$0.6069 \pm 0.1737$	$0.3058 \pm 0.1577$	$0.9015 \pm 0.0933$	$15.58 \pm 4.77$	$0.6049 \pm 0.1949$	$0.3104 \pm 0.1655$	$0.8664 \pm 0.1045$
$M_{- - -}^{S}$	$16.71 \pm 5.17$	$0.5609 \pm 0.2096$	$0.3163 \pm 0.1730$	$0.9007 \pm 0.0874$	$14.49 \pm 5.00$	$0.5429 \pm 0.2417$	$0.3357 \pm 0.1909$	$0.8652 \pm 0.1006$

Table 4. Statistical significance of the differences between the PSNR scores obtained using the investigated models for simulated images. Green color indicates that a variant in the row is significantly better than a variant in the column; red means the opposite. For each row, we present the number of variants that were outperformed (in a statistically significant way) by the variant in that row.

	Int.	$M_{B}$	$M_{+ + +}^{R}$	$M_{+ + -}^{R}$	$M_{+ - +}^{R}$	$M_{+ - -}^{R}$	$M_{- + +}^{R}$	$M_{- + -}^{R}$	$M_{- - +}^{R}$	$M_{- - -}^{R}$	$M_{+ + +}^{S}$	$M_{+ + -}^{S}$	$M_{+ - +}^{S}$	$M_{+ - -}^{S}$	$M_{- + +}^{S}$	$M_{- + -}^{S}$	$M_{- - +}^{S}$	$M_{- - -}^{S}$	$Σ$
Int.	—	***	***	***	***	***	***	***	***	***	***	***	***	***	***	***	***	***	17
$M_{B}$	***	—	***	***	***	***	***	***	***	***	***	***	***	***	***	***	***	***	16
$M_{+ + +}^{R}$	***	***	—	***	***	***	**	***	***	***	***	***	***	***	***	***	***	***	2
$M_{+ + -}^{R}$	***	***	***	—	NS	***	***	***	NS	***	***	***	***	***	***	***	NS	***	12
$M_{+ - +}^{R}$	***	***	***	NS	—	***	***	***	NS	***	***	***	***	***	***	***	NS	**	12
$M_{+ - -}^{R}$	***	***	***	***	***	—	***	***	***	NS	***	***	***	***	***	***	***	***	0
$M_{- + +}^{R}$	***	***	**	***	***	***	—	***	***	***	*	***	*	***	***	***	***	***	3
$M_{- + -}^{R}$	***	***	***	***	***	***	***	—	***	***	**	NS	**	NS	NS	NS	***	NS	5
$M_{- - +}^{R}$	***	***	***	NS	NS	***	***	***	—	***	***	***	***	***	***	***	NS	***	12
$M_{- - -}^{R}$	***	***	***	***	***	NS	***	***	***	—	***	***	***	***	***	***	***	***	0
$M_{+ + +}^{S}$	***	***	***	***	***	***	*	**	***	***	—	NS	NS	***	*	*	***	***	7
$M_{+ + -}^{S}$	***	***	***	***	***	***	***	NS	***	***	NS	—	NS	NS	NS	NS	***	NS	4
$M_{+ - +}^{S}$	***	***	***	***	***	***	*	**	***	***	NS	NS	—	***	NS	NS	***	***	4
$M_{+ - -}^{S}$	***	***	***	***	***	***	***	NS	***	***	***	NS	***	—	NS	NS	***	NS	6
$M_{- + +}^{S}$	***	***	***	***	***	***	***	NS	***	***	*	NS	NS	NS	—	NS	***	NS	4
$M_{- + -}^{S}$	***	***	***	***	***	***	***	NS	***	***	*	NS	NS	NS	NS	—	***	NS	4
$M_{- - +}^{S}$	***	***	***	NS	NS	***	***	***	NS	***	***	***	***	***	***	***	—	***	12
$M_{- - -}^{S}$	***	***	***	***	**	***	***	NS	***	***	***	NS	***	NS	NS	NS	***	—	6