Super-Resolution Learning Strategy Based on Expert Knowledge Supervision

: Existing Super-Resolution (SR) methods are typically trained using bicubic degradation simulations, resulting in unsatisfactory results when applied to remote sensing images that contain a wide variety of object shapes and sizes. The insufficient learning approach reduces the focus of models on critical object regions within the images. As a result, their practical performance is significantly hindered, especially in real-world applications where accuracy in object reconstruction is crucial. In this work, we propose a general learning strategy for SR models based on expert knowledge supervision, named EKS-SR, which can incorporate a few coarse-grained semantic information derived from high-level visual tasks into the SR reconstruction process. It utilizes prior information from three perspectives: regional constraints, feature constraints, and attributive constraints, to guide the model to focus more on the object regions within the images. By integrating these expert knowledge-driven constraints, EKS-SR can enhance the model’s ability to accurately reconstruct object regions and capture the key information needed for practical applications. Importantly, this improvement does not increase the inference time and does not require full annotation of the large-scale datasets, but only a few labels, making EKS-SR both efficient and effective. Experimental results demonstrate that the proposed method can achieve improvements in both reconstruction quality and machine vision analysis performance.


Introduction
With the rapid development of deep learning (DL), image super-resolution (SR) algorithms based on DL have achieved remarkable results.Remote sensing (RS) images are commonly used in many fields such as change detection [1][2][3], hyperspectral application [4][5][6][7], object detection [8,9], and environmental monitoring [10][11][12], making them highly valuable for practical applications.However, RS images are often degraded due to imaging limitations of the sensors and transmission noise.This leads to unsatisfactory results in practical applications.Using the SR algorithms for image reconstruction is an effective approach to improve the accuracy of practical applications.
Single Image Super-Resolution (SISR) methods can be divided into two categories based on their learning approach: Peak Signal-to-Noise Ratio (PSNR)-Oriented SR (PSNR-SR) [13][14][15][16][17][18][19] and Generative Adversarial Network (GAN)-Based SR (GAN-SR) [20][21][22][23][24][25][26][27][28].PSNR-SR methods only use L1 or L2 as the loss function and GAN-SR methods incorporate adversarial training.However, SISR models trained solely on downsampled images obtained using bicubic interpolation are unable to cope with severe degradation and RS images that contain diverse objects.If the SR models cannot accurately restore the degraded RS images, this will lead to poor performance in practical applications.As shown in Figure 1, existing SR methods can achieve visually pleasing results, but still exhibit discrepancies with the ground truth at object boundaries and edges.Therefore, incorporating the semantic information reflected by labels from high-level visual tasks into low-level SR reconstruction presents a challenging task.Expert annotation label (bottom).The Mask-RCNN mode [29] with a ResNet101 [30] as the backbone is used here and the input image size is 800 × 800.SRGAN fails to achieve satisfactory results in most areas where objects are located.Although SISR can achieve satisfactory visual results, practical applications still have shortcomings.

HR
Using LR (Low-Resolution)/HR (High-Resolution) image pairs for SR model training is often insufficient to significantly improve the downstream application performance.As shown in Figure 2, some researchers have proposed methods that leverage the labels from high-level tasks to supervise the SR task. Figure 2a,b illustrate two learning strategies that utilize expert prior knowledge.Specifically, Figure 2a represents cascading SR and highlevel visual tasks [24,31], which can fine-tune the parameters of the SR network through the loss function in the high-level visual task.However, this approach fails to establish a holistic connection between the two models, resulting in a lack of coherence.Figure 2b represents merging the SR and high-level visual networks into a multi-task network and optimizing it using a multi-objective loss [32][33][34][35][36][37][38].However, even with optimization using a multi-objective loss, it is difficult to guide the SR model to focus on the regions required for the high-level visual task.Additionally, a multi-task network not only increases network complexity but also introduces the problem of task competition between multiple objectives, making it difficult to perform network training.
To address the above mentioned issues, we propose a new learning strategy for SR based on expert knowledge supervision (EKS-SR), as shown in Figure 2c.The main contributions of this work can be summarized as follows: 1.
An Expert Knowledge Guided SR Framework: EKS-SR innovatively incorporates expert annotations for high-level tasks to supervise the SR network, achieving significant improvements in fine-grained tasks with coarse-grained annotations.

2.
Multi Constraint Approach to Focus on Object Reconstruction: Unlike existing learning strategies that overlook the challenge of object area recovery, EKS-SR leverages prior information from three perspectives: regional constraints, feature constraints, and attribution constraints, to guide the SR model in achieving more accurate reconstructions of multi-scale objects in RS, especially for small objects.

3.
Enhancing Practicality Under Limited Annotations without Increasing Inference-time: Even the expert annotations are limited, EKS-SR can improve practical task performance without increasing the model parameters and inference time, which provides a new solution for resource-limited RS devices.

4.
Plug-in and Play: The design of EKS-SR does not rely on specific SR models and high-level task models, which can be applied to any model and have strong scalability.
The strong scalability ensures that as new models and tasks emerge, EKS-SR can continue to be relevant and beneficial, offering ongoing improvements in performance and utility.The remainder of this article is organized as follows.Section 2 summarizes the related works.Section 3 introduces the details of EKS-SR.The experimental results are given in Section 4. Finally, The discussion and conclusion are presented in Sections 5 and 6, respectively.

Single Image Super Resolution
The SISR algorithms aim to restore the LR image to the HR image, while the LR image suffers a complex degradation process that causes information loss.This means that image SR algorithms need to compensate for the limited information in the LR image.
Early SR methods are primarily based on traditional signal processing techniques such as interpolation and reconstruction.These methods are often computationally simple, but the reconstruction quality is unsatisfactory, and they fail to capture the detailed information in the images.To address this issue, researchers have proposed learning-based SR methods.These methods first establish a mapping relationship between LR and HR images, and then use this mapping to restore the HR image.
With the rapid development of DL technology, SR methods based on deep neural networks have received widespread attention.These methods leverage the powerful feature representation capabilities of DL to not only preserve image details but also improve the overall quality of the restoration.
Based on their different optimization approaches, the mainstream deep learning-based image SR algorithms can be categorized into PSNR-SR and GAN-SR methods.PSNR-SR methods typically use the L1 or L2 loss function, constraining the recovered SR image in the pixel domain against the HR ground truth image.Super-resolution convolutional neural network (SRCNN) [13] was the first deep learning-based image SR method, which used a three-layer convolutional neural network to learn the complex mapping relationship between LR and HR images.To effectively capture the important features in the input image, the Residual Channel Attention Network (RCAN) [14] proposed a deep SR network structure based on attention mechanisms.As multi-scale features have been shown to be effective in improving the performance of the recent computer vision models [39,40], an improved multi-scale residual network is proposed in [41,42] for remote sensing SR tasks.
In recent years, transformers [43] have achieved success in the field of computer vision [44,45].Liang et al. [16] proposed the SwinIR network based on Swin Transformer [45] for image reconstruction.Benefiting from the shift window mechanism, SwinIR can better model the image context and achieve better results with fewer parameters.SRFormer Zhou et al. [46] introduced a novel mechanism Permuted Self-Attention (PSA), which can balance the channel and spatial information within the self-attention process, allowing the model to leverage the benefits of large window sizes for selfattention without incurring additional computational costs.
Since directly constraining pixel values is challenging, and this optimization method tends to generate overly smooth SR images that do not match human perception, the resulting SR images lack realism.Super-Resolution Generative Adversarial Network (SR-GAN) [20] introduced the GAN network into the image SR field.Specifically, SRGAN incorporates perceptual loss and adversarial training based on L1 and L2 loss functions.Instead of pursuing identical pixel values, GAN-based image SR methods aim to generate images that better match human visual perception.To estimate the probability that real images are more realistic than fake images, rather than simply determining whether an image is real or not, Enhanced Super-Resolution Generative Adversarial Networks (ESR-GAN) [27] use a relativistic average discriminator [47] to replace the regular discriminator, achieving better visual results.

Image Super-Resolution with High-Level Tasks
Although existing image SR algorithms can achieve good visual quality, most of the work has not focused on the performance of the reconstructed SR images in actual downstream applications, such as object detection, and instance segmentation.For some images with practical application value, such as remote sensing images and medical images, improving the performance of SR images in real-world downstream applications is just as important as improving their visual quality.
Pereira and Santos [34] proposed an end-to-end framework, which cascades an SR network and a semantic segmentation network and trains the two networks at the same time.SR-guided deep network (SRGDN) [37] proposed a method that utilizes SR to guide the land cover classification by sharing the low-level feature.Multi-task generative adversarial network (MTGAN) [32] integrates the functionality of target detection into the discriminator of the SR network, and improves the network performance through multi-task optimization.Bashir and Wang [48] proposed an image super-resolution cyclic GAN with residual feature aggregation and YOLO as the detection network (SRCGAN-RFA-YOLO).Yang et al. [49] proposed mutual-feed learning for SR and object detection and designed a closed-loop structure by building the feedback connection between two tasks.Tang et al. [50] proposed a Super Resolution Domain Adaptation Network (SRDA-Net) to adapt the changes from LR images to HR images, which sets the LR image domain and HR image domain as the source domain and target domain, respectively.
However, these methods are difficult to guide SR models to focus on the regions where objects are located in the image by serializing model optimization or performing multi-objective optimization.Therefore, we propose a learning strategy EKS-SR, which utilizes expert knowledge from high-level computer vision tasks to constrain the SR model.

Method
The existing two mainstream SR methods, PSNR-SR and GAN-SR, typically use the loss functions shown in Equations ( 1) and (2), respectively.
where I HR and I SR refer to the HR image and the reconstructed SR images.N denotes the number of pixels in the image.α and β represent the weights of adversarial loss L adv and perceptual loss L per , respectively.For GAN-SR methods, L adv and L per are incorporated to enhance the authenticity of the reconstructed image, which can be described as: where ϕ l (•) and ω l represents the l-th layer of the VGG19 [51] network and its corresponding weight.D(•) refers to the discriminator network.
From the calculation process of the loss function, it can be observed that it treats each pixel in the image equally and ignores the actual significance of each pixel.Therefore, some "hard samples" such as object edges, may be averaged down during the loss calculation by "easy samples" such as background regions.As the easy samples dominate the loss, the model training process pays little attention to hard samples, resulting in poor practical performance in difficult areas, such as dense cars.

Regional Constraint
In the L1 loss function, I o and I b are treated equally, with the number of I b being larger than I o .This makes it difficult for the SR model to learn the features of hard samples in I o , resulting in unsatisfactory reconstruction results, as shown in Figure 1.To address this issue, we propose a regional constraint L RC , as shown in Figure 3a, which calculates the loss with adaptive weights for object region I o and background region I b based on expert-annotated prior knowledge from the high-level task.
Specifically, we divide an image I into two disjoint sets, I o and I b , i.e., I = I o ∪ I b , I o ∩ I b = ∅.The object region I o is formed by taking the union of bounding boxes annotated by the experts, i.e., I o = K i=0 O i .L RC aims to enhance the SR model's reconstruction performance on hard samples by selectively focusing on the foreground objects rather than the entire image.The regional constraint L RC is defined as: where w o and w b denote the learnable weights of the object region and the background region, respectively.

Feature Constraint
The perceptual loss L per is typically used to extract the feature representation of an image using a pre-trained VGG network.The SR image is then constrained by minimizing the difference between the SR image and the HR image in the feature domain.However, perceptual loss only focuses on overall feature matching and neglects the local details.When an image contains many objects, L per fails to pay sufficient attention to key features where the objects are located, resulting in SR images lacking details or appearing blurry.
To enhance the model's ability to focus on key features during feature domain constraints, we measure the difficulty level of a feature F l corresponding to the receptive field M(F l ) extracted from the l-th layer of VGG by counting the number of object pixels W l within it.This process can be described as follows: where M(•) represents a mapping relationship from the feature F l to its receptive field on the original image I. Based on the difficulty level W l of each feature's corresponding receptive field, we propose the feature constraint L FC shown in Figure 3b that incorporates expert knowledge.The expression is as where ⊙ denotes the element-wise product.L FC weights each feature with W l , encouraging the model to pay more attention to hard samples that contain more objects, thereby achieving better reconstruction results.

Attributive Constraint
Due to the difficulty of fully annotating large-scale datasets, more explicit constraint methods are needed to guide the model learning, so that it can focus on the object regions even in the case of limited annotations.In previous work [52], it has been demonstrated that using bounding boxes for Attribution map (AM) supervision can guide high-level visual tasks.Specifically, the model's results are more accurate when the model's output relies more on the pixels inside the bounding box.However, the AM calculation and AM supervision progress are not suitable for the low-level task.Therefore, we have designed an attributive constraint based on AM supervision for the SR task as shown in Figure 3c, which can guide the SR model to pay more attention to the object region when reconstructing.
AM analysis [53][54][55][56] is an interpretability analysis method in DL used to explain the results of deep neural network models.For an input image I ∈ R h×w and a classification model M : R h×w → R, the model M can calculate the probability that input image I belongs to each category.The attribution analysis can output an AM for the importance of each pixel in the input image to the output of the model.Sundararajan et al. [55] proposed an Integrated Gradients (IG) method for attribution analysis and stated that the attribution analysis method should satisfy two basic axioms: 1.
Sensitivity: For any input image I and baseline image I ′ , when any part of the image changes and causes a change in the model's prediction result, the AM should also be able to express this change.

2.
Implementation Invariance: For two networks, even though their implementation methods are different, if their outputs are equal for all inputs, then the AM obtained by performing attribution analysis on these two networks should be the same.
Unlike high-level visual tasks that analyze attribution maps for a whole image, lowlevel visual tasks exhibit strong local correlations.Therefore, it is common to select a specific region for AM analysis.Local Attribution Maps (LAM) [57] is an AM analysis method for image SR tasks, which can identify which input pixels are important for the model's output and their relative contribution levels.For LR image I LR ∈ R m/s×n/s , SR image I SR ∈ R m×n , and SR model G : R m/s×n/s → R m×n with upsample scale factor s, consider a patch p ∈ R q×t in the LR image and the range of patch is [(p x1 , p y1 ), (p x2 , p y2 )].LAM designs the detector D p : R q×t → R to determine the edge feature.
where ∇ ij I represents the gradient of image I at the pixel location (i, j).
In the high-level attribution analysis method, the baseline image I ′ generally is a whole black image.However, for the SR task, the low-frequency component of the LR image is not important to the SR model performance and the high-frequency component is more important.Therefore, to decrease the high-frequency component in the input LR image I, LAM uses Gaussian blur to generate the baseline image I ′ , which can be expressed as where ω(σ) is Gaussian blur kernel and the kernel size is σ × σ. ⊗ denotes the convolution operation.
Let γ pb = γ 1 pb , . . ., γ n pb : [0, 1] → R n be a progressive blurring path function from the baseline image to the input image.It can be defined as follows: where γ pb (0) = I ′ and γ pb (1) = I.The i-th dimension LAM LAM F,D (γ pb ) i can be calculated by the IG method as follows: where Equations ( 14) and ( 15) are the approximations of Equation ( 13) that can make the calculation faster.m is the step number in the approximation of the integral.The entire LAM LAM F,D (γ pb ) can be calculated as To determine which pixels and their relative contribution levels are used by the SR model to reconstruct each object's region O i , we perform AM analysis on O i separately, resulting in the corresponding AM A i .This process can be described as follows: where LAM G,D,O i (γ pb ) denotes the calculation process of the AM on the patch O i of image I and the model is generator network G. Furthermore, we define the AM in the object region A i+ = A i ∩ O i , which represents a collection of pixels used by the SR model to reconstruct the region where the object is located.Let v(A) = ∑ h m=1 ∑ w n=1 A m,n : R h×w → R be a function that sums up the relative contribution level in AM A. To achieve better SR results, we should increase the constraint for the model to utilize pixels located within the bounding box as much as possible.This is equivalent to maximizing the ratio of v(A i+ )/v(A i ).Therefore, the attributive constraint L AC is defined as follows: where L i AC represents the each attributive constraint to i-th object's region.

Proposed Learning Strategy
Based on the analysis in the previous sections, we have derived three different SR loss functions L RC , L FC , and L AC that incorporate expert knowledge from different perspectives.To obtain a more versatile learning strategy, we have designed two distinct loss functions, L PSNR and L GAN , specifically targeting PSNR-SR and GAN-SR methods, respectively.The loss functions are defined as L GAN = ξL RC + µL FC + δL adv + 10 −4 ξL AC (22) where ξ, µ, and δ denote the different weights of each component to ensure that no individual loss dominates and impedes convergence.In this work, we set it to 1 × 10 −2 , 1, and 5 × 10 −3 follow previous literature [20].L GAN incorporates L FC and L adv based on L PSNR to enhance the visual quality and detail fidelity.
Since the model cannot obtain accurate attribution maps in the initial stages of training, using L AC in the early training phase can cause network oscillation and make convergence difficult.Therefore, we introduce L AC after 10,000 iterations to guide the model to utilize more effective pixels for SR reconstruction.Meanwhile, we noticed that the computation of LAM takes a longer time.To avoid significantly increasing the training time, we perform L AC calculation every 100 iterations.The overview of the proposed learning strategy is presented in Algorithm 1.

Algorithm 1 EKS-SR Learning Strategy
l RC ← L RC (I HR , I SR )

4:
if i > N AC and mod(i, f )=0 then 5: l AC ← L AC (I HR , I SR , A i )

Evaluation Metrics for SR
We select PSNR, Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) [60] as evaluation metrics for the SR task.
(1) PSNR: PSNR is the most widely used objective quality assessment metric in SR tasks.
Given HR image I HR and LR image I SR , PSNR is defined as where MAX I represents the maximum pixel value (255 for 8-bit images) and MSE(I HR , I SR ) represents the mean squared error (MSE) between I HR and I SR , which can be calculated as where m and n represent the height and width of the SR image I SR .A larger PSNR value indicates greater similarity between the two images.(2) SSIM: SSIM is an index that quantifies the structural similarity between two images.
Unlike PSNR, SSIM is designed to mimic the human visual system's perception of structural similarity.SSIM quantifies the image's attributes of brightness, contrast, and structure, using the mean to estimate brightness, variance to estimate contrast, and covariance to estimate structural similarity.SSIM is defined as where µ x and µ y denote the mean values of I HR and I SR , respectively.σ 2 x and σ 2 y denote the variance of I HR and I SR , respectively.σ xy denotes the covariance of I HR and I SR .C 1 and C 2 are two constants used to maintain the stability of the denominator.
The SSIM value ranges from 0 to 1, with a higher value indicating greater similarity between the two images.(3) LPIPS: To better simulate human visual perception, Zhang et al. [60] proposed LPIPS, which measures the difference between two images in the feature domain by a pretrained VGG [51] feature extract network φ.Compared to PSNR and SSIM, LPIPS evaluates the similarity between two images in a way that is more consistent with human visual habits.LPIPS is defined as where φ l h,w and η represents the l-th layer of φ and its weights.H l and W l is the height and width of the SR image I SR .A smaller LPIPS value indicates greater similarity between the two images.

Evaluation Metrics for Object Detection and Instance Segmentation
We use COCO evaluation metrics for object detection and instance segmentation tasks.This includes AP b and AP m for bounding boxes and masks, which measures the average precision (AP) values for ten intersections over union (IoU) thresholds ranging from 0.

Implementation Details
To validate the effectiveness of our proposed EKS-SR, we select three classical SR works SRGAN [20], SRFormer [46], and SwinIR [16], which have been widely used in SR fields and represent GAN-SR and PSNR-SR, respectively.For the training of the SR model, we train it on patches of size 256 × 256.For the SR models SRGAN and SwinIR, the batch sizes are set to 32 and 16, respectively.The initial learning rate is 2 × 10 −4 and decreases to half in 50,000, 100,000, and 200,000 iterations.We use the Adam optimizer [61] by setting β 1 = 0.9, β 2 = 0.99, and ϵ = 10 −8 to optimize the SR models.All SR models are trained for 300,000 iterations.
Furthermore, we use the Faster region-based Convolutional Neural Network (Faster R-CNN) [62] model as the object detection network to evaluate the SR images obtained from different SR models.For the instance segmentation task evaluation, we select the Mask region-based Convolutional Neural Network (Mask R-CNN) [29] model with a ResNet101 [30] as the backbone.The Faster R-CNN and Mask R-CNN are trained on the HR images in the original dataset.For Faster R-CNN, we use the Adam optimizer [61] by setting β 1 = 0.9, β 2 = 0.99, and ϵ = 10 −8 for training.For Mask R-CNN, we use the stochastic gradient descent with momentum of 0.9 and weight decay of 0.0001 as the optimizer to train the entire network.The experiments are implemented under NVIDIA RTX 3090 graphics processing unit (GPU) and Ubuntu 18.04.

Results Achieved Using the Learning Strategy EKS-SR on Different SR Models
To demonstrate the effectiveness of our proposed learning strategy, EKS-SR, we conducted training on three representative works in the existing two major categories of SR methods: SRGAN [20], SRFormer [46], and SwinIR [16].We then utilized models trained under different learning strategies to obtain SR images.Specifically, we trained the SR model using three different approaches: the original learning strategy, locally discriminative learning (LDL) [63], and our proposed EKS-SR, respectively.Additionally, we input the LR images into the SR model to reconstruct the SR images.Finally, we input the SR images into the same pre-train Faster R-CNN or Mask R-CNN network to obtain object detection and instance segmentation results.

Quantitative Results on COWC
As shown in Table 1, the SRGAN and SwinIR trained by the EKS-SR can achieve performance improvement on three SR evaluation metrics and three object detection metrics.Specifically, EKS-SR demonstrates a significant improvement over SRGAN in the object detection task, with enhancements of 9.8, 10.7, and 13.9 across three evaluation metrics, respectively.Moreover, the SR model also achieved better visual performance, with increases of 0.557, 0.0760, and 0.0216 in PSNR, SSIM, and LPIPS, respectively.Although SwinIR has a larger number of parameters and higher computational complexity, it already achieves good visual effects and object detection results using the original SR learning strategy.However, EKS-SR, guided by expert knowledge, can more deeply explore the key information in the regions where objects are located within the image, making the SR model learn more effectively and enhancing its image reconstruction capabilities.Consequently, the six metrics obtained by the SwinIR model exhibit improvements.
We compare the three models based on different learning strategies, including original, LDL, and our proposed EKS-SR.For SRGAN, a representative work of GAN-SR, the model obtained using the EKS-SR learning strategy gave the best performance in the object detection task.Although LDL shows a larger increase in PSNR and SSIM metrics than EKS-SR, the gain is limited in practical applications.For the SRFormer model, the use of the EKS-SR learning strategy gives the best results in both SR metrics and object detection metrics, while the use of the LDL method instead results in a decrease in accuracy in the object detection task.For the SwinIR model, we compare the results of the model pre-trained on the DF2K [64,65] dataset and our implementation model trained on the COWC dataset.It can be observed that although LDL can improve the LPIPS metrics, the accuracy drop on the object detection task is severe.This indicates that EKS-SR has better utility than LDL.
The experimental results reveal that LDL can improve the visual quality of the image by removing artifacts, which makes some progress in SR metrics, but brings about a performance degradation in real-world RS applications.In contrast, with the regional constraint, feature constraint, and attributive constraint, EKS-SR can make the SR model pay more attention to complex object regions in RS images during the training process, which can enhance the visual quality of SR images and improve the performance of practical tasks.  2 These results are obtained using the weights trained on the COWC dataset.

Quantitative Results on iSAID
Table 2 presents the results of SRGAN and SwinIR using different learning strategies for upscale ×4.For the PSNR-SR method SwinIR, the original learning strategy has already yielded favorable results with a PSNR of 38.533 dB.However, models trained using EKS-SR still can achieve a gain of 0.017 dB.It is important to note that PSNR is a logarithmic scale metric, where even small numerical changes correspond to significant perceptual variations.Additionally, in terms of human perception, EKS-SR can enhance the LPIPS metric by 4% compared to SwinIR.For the GAN-SR method SRGAN, the original learning strategy exhibited notable shortcomings.Thanks to the utilization of expert knowledge supervision in EKS-SR, SRGAN can achieve an improvement of 1.053 dB in PSNR, along with enhancements of 0.0164 in SSIM and 0.0178 in LPIPS.
Meanwhile, it can be observed that the models trained using the EKS-SR learning strategy show improvements in all six metrics, whether in the PSNR-SR model or the GAN-SR model.Additionally, for the smaller-scale SRGAN network, employing the EKS-SR learning strategy significantly narrows the performance gap between it and the larger SwinIR model in practical tasks.With advanced learning strategies, satisfactory performance can be achieved even with smaller models, making significant contributions to the practical application of DL models in the resource-limited RS device.To intuitively compare the performance of EKS-SR in practical applications, we have visualized the instance segmentation results on the iSAID dataset, as shown in Figure 4, and the object detection results on the COWC dataset are shown in Figures 5 and 6.It can be seen that the SR images reconstructed by the SR model using the original learning strategy fail to be detected by the instance segmentation model in many areas, especially for small objects.This is because small objects tend to lose their original structure and become more difficult to reconstruct after degradation.In contrast, the SR model trained with EKS-SR demonstrates a significant improvement in this problem, which indicates that EKS-SR has strong practical application value and can effectively mitigate the issues associated with severe degradation of RS images.Consequently, EKS-SR contributes to improved performance in various machine vision applications.

Performance under Different Upscale Factors
To validate the applicability of the proposed EKS-SR learning strategy in SR tasks with different upscale factors, we conducted comparative experiments using the SwinIR model on the COWC dataset with ×4 and ×8 upscale factors.Specifically, we trained the SR network with both the original and EKS-SR learning strategies at ×4 and ×8 upscale factors, respectively.
As shown in Table 3, the results obtained by the SR model for images downsampled by a factor of eight are significantly inferior to those for images downsampled by a factor of four.This is due to the greater information loss caused by ×8 downsampling, which substantially increases the reconstruction difficulty.Additionally, the accuracy of SR images in the object detection task drops significantly, with AP b decreasing from 80.5 to 50.6.However, training with the EKS-SR learning strategy yields greater gains in both visual quality and practical tasks.This indicates that EKS-SR can still assist the SR model in identifying more complex object regions within highly degraded images, thereby achieving better SR reconstruction performance.Figures 6 and 7 show the object detection results on the ×4 and ×8 COWC datasets.

Performance under Limited Annotation
Annotating large datasets comprehensively is an extremely challenging task.To validate the performance of EKS-SR with limited annotations, we constrained the model using only 25% of the annotations.
As shown in Tables 4 and 5, the results indicate that even using a small number of bounding box annotations can effectively guide the model to focus on the object regions in the image.Figure 8 intuitively compares LPIPS and AP b across various annotation utilization rates.Despite the model using only 25% of the annotations, the SRGAN model trained by the EKS-SR also gets 0.661 dB on the iSAID dataset and 0.293 dB on the COWC dataset.Meanwhile, the LPIPS metric showed enhancements of 0.0064 and 0.0157, respectively.Furthermore, for the instance segmentation task, the five metrics have been improved by the EKS-SR under limited annotation.For the object detection task, the increases achieved in three evaluation metrics are 8.6, 8.4, and 12.9, respectively.It should be noted that the gains obtained by using only 25% annotations in machine vision applications are already close to the gains obtained by using 100% annotations, which are 9.8, 10.7, and 13.9, respectively.The proposed learning strategy EKS-SR not only enhances the detection performance of objects but also improves the effectiveness of fine-grained semantic segmentation tasks, further confirming the practicality.

Ablation Studies
To validate the effectiveness of all three constraints in EKS-SR, we designed various learning strategies for conducting ablation experiments.Specifically, we replaced L 1 with L RC and L per with L FC in the original SRGAN, and the results are shown in Table 6.It can be seen that when L RC is used alone, there is an improvement of 0.523 dB in PSNR.When L RC and L FC are used together, not only is a gain of 0.702 dB in PSNR achieved, but also gains of 0.5 and 0.4 in AP b and AP m , respectively.This demonstrates the effectiveness of L RC and L FC .Additionally, experiments on the SwinIR model verified the effectiveness of L RC and L AC .The ablation experiments suggest that these proposed constraints play a vital role in achieving SR reconstruction and high-level task performances.

Discussion
This work proposes a new SR learning strategy based on expert knowledge supervision, aiming to address the shortcomings of existing SR methods in practical applications.EKS-SR successfully integrates expert knowledge from high-level vision tasks into the SR reconstruction process, significantly improving the model's ability to reconstruct object regions in low-resolution images, especially for small objects.Experimental results show that EKS-SR not only improves the visual quality but more importantly achieves significant performance improvements in downstream tasks such as object detection and instance segmentation.This confirms the effectiveness of our approach in bridging the gap between low-level SR tasks and high-level visual tasks.
Importantly, EKS-SR can improve performance without increasing the number of model parameters and inference time.This feature is particularly important for resourceconstrained RS devices.In addition, EKS-SR significantly improves real-world task performance even with limited expert annotation.This efficiency and utility make EKS-SR a great potential candidate for real-world applications, especially in cases where comprehensive annotation is difficult to obtain for large-scale datasets.
The design of EKS-SR does not depend on a specific SR model or high-level task model, and this "plug-and-play" feature makes it highly scalable and generalized.This means that as new models and tasks emerge, EKS-SR can continue to maintain its relevance and benefits, providing the potential for continuous improvement in performance and utility.
Despite the remarkable results achieved by EKS-SR, there are still some limitations that need to be addressed in future research.The performance of EKS-SR relies to some extent on the quality of expert knowledge provided by high-level tasks.Future research could explore how to improve the robustness of the model in the presence of noisy or incomplete expert knowledge.Furthermore, it is a challenging but rewarding task to investigate a self-supervised learning framework that does not require expert labeling.Meanwhile, the enhancement of EKS-SR is slight for some SR models with better performance.Further research is needed in the future on loss functions that can fully exploit the intrinsic correlation of data.
With EKS-SR's advantages in improving image quality and target recognition capabilities, it can play an important role in disaster monitoring, environmental monitoring, and smart cities in the future.

Conclusions
In this work, to address the issue that SISR methods cannot accurately reconstruct object regions, we propose the EKS-SR learning strategy.The EKS-SR integrates a set of coarse-grained labels, which are typically used in high-level visual tasks, into the training process of the SR task.By leveraging prior information from three key constraints-regional constraint, feature constraint, and attribution constraint-EKS-SR guides the SR model to achieve a more precise reconstruction of object areas.Experiments demonstrate that EKS-SR can be easily adapted to both PSNR-SR and GAN-SR methods, enhancing the performance of SR and its practical applications in RS.

Figure 1 .
Figure 1.The first column: HR image I HR and its instance segmentation result.The second column: SRGAN output I SR and its instance segmentation result.The third column: Difference between I HR and I SR , i.e., |I HR − I SR | (top).Expert annotation label (bottom).The Mask-RCNN mode[29] with a ResNet101[30] as the backbone is used here and the input image size is 800 × 800.SRGAN fails to achieve satisfactory results in most areas where objects are located.Although SISR can achieve satisfactory visual results, practical applications still have shortcomings.

Figure 2 .
Figure 2. The different ways to utilize the expert knowledge in SR.(a) Independent learning strategy.(b) Multitask learning strategy.(c) Our proposed EKS-SR learning strategy.

Figure 3 .
Figure 3.The three constraints included in the proposed EKS-SR learning strategy.

Input: 1 :
A dataset of image pairs with expert knowledge {(I HR , I LR , O i )}, Initial model parameters Θ, Number of iterations N iter , Start iterations of attribute constraint begins N AC , Attributive constraint frequency f .Output: Trained model parameters Θ for i = 1 to N iter do 2: 5 to 0.95.Specifically, AP b 50 , AP m 50 , AP b 75 , and AP m 75 represent the AP values for detection and segmentation at IoU thresholds of 0.5 and 0.75, respectively.The object detection task use AP b , AP b 50 , and AP b 75 as evaluation metrics.The instance segmentation task uses all six metrics.

Figure 4 .Figure 5 .
Figure 4. Qualitative comparison results on the ×4 iSAID dataset.The red boxes indicate the areas where the reconstructed images obtained using SRGAN were not correctly identified under the same Mask R-CNN model.

Figure 6 .
Figure 6.Qualitative comparison results on the ×4 COWC dataset.The red boxes indicate the areas where the reconstructed images obtained using SwinIR were not correctly identified under the same Faster R-CNN model.

Figure 7 .
Figure 7. Qualitative comparison results on the ×8 COWC dataset.The red boxes indicate the areas where the reconstructed images obtained using SwinIR were not correctly identified under the same Faster R-CNN model.

Figure 8 .
Figure 8.The LPIPS and AP b results obtained by the SRGAN network trained under limited annotation.(a): Performance on the COWC dataset.(b): Performance on the iSAID dataset.The blue curve denotes the LPIPS value and the orange curve denotes the AP b value.
GAN ← ξl RC + µl FC + δl adv + 10 −4 ξl AC The iSAID dataset consists of 2806 images with different sizes and 655,451 annotated instances.Due to the large size of the original images in the iSAID dataset, we have divided them into 800 × 800 image patches for training and testing.We have created the SR dataset using bicubic and Gaussian blur to get the LR image with 200 × 200 sizes.The original training set is used as the training set for the SR task.Additionally, the validation set of iSAID is used as the test set for the SR task.
if Θ is a PSNR-SR model then 11:l PSNR ← l RC + 10 −4 l AC 12:else Θ is a GAN-SR model 13: l adv ← L adv (I SR ) 14: l FC ← L FC (I HR , I SR ) 15:l Columbus and Utah in the United States, and Toronto in Canada.We crop the image to 256 × 256 and randomly select 80% images in Potsdam for training, 10% images in Potsdam for validating, and others for testing.The LR images of the COWC dataset have a size of 64 × 64 and 32 × 32, corresponding to ×4 and ×8 upscale factor SR tasks, respectively.

Table 1 .
Results of comparison with different learning strategies on the ×4 COWC dataset.The best result of each metric is in bold font.

Table 2 .
Results of comparison with different learning strategies on the ×4 iSAID dataset.The best result of each metric is in bold font.

Table 3 .
Results of comparison with different learning strategies on the ×4 and ×8 COWC dataset.The best result of each metric is in bold font.

Table 4 .
Comparison results of different label utilization rates on SRGAN from ×4 iSAID dataset.

Table 5 .
Comparison results of different label utilization rates on SRGAN from ×4 COWC dataset.

Table 6 .
Ablation results of comparison with different learning strategies on the iSAID dataset.The best result of each metric is in bold font.RC + L FC + L AC + L adv