SR-FEINR: Continuous Remote Sensing Image Super-Resolution Using Feature-Enhanced Implicit Neural Representation

Remote sensing images often have limited resolution, which can hinder their effectiveness in various applications. Super-resolution techniques can enhance the resolution of remote sensing images, and arbitrary resolution super-resolution techniques provide additional flexibility in choosing appropriate image resolutions for different tasks. However, for subsequent processing, such as detection and classification, the resolution of the input image may vary greatly for different methods. In this paper, we propose a method for continuous remote sensing image super-resolution using feature-enhanced implicit neural representation (SR-FEINR). Continuous remote sensing image super-resolution means users can scale a low-resolution image into an image with arbitrary resolution. Our algorithm is composed of three main components: a low-resolution image feature extraction module, a positional encoding module, and a feature-enhanced multi-layer perceptron module. We are the first to apply implicit neural representation in a continuous remote sensing image super-resolution task. Through extensive experiments on two popular remote sensing image datasets, we have shown that our SR-FEINR outperforms the state-of-the-art algorithms in terms of accuracy. Our algorithm showed an average improvement of 0.05 dB over the existing method on ×30 across three datasets.


Introduction
With the development of satellite image processing technology, the application of remote sensing has increased [1][2][3][4][5]. However, low spatial, spectral, radiometric, and temporal resolutions of current image sensors and complicated atmospheric conditions make it hard to use remote sensing. Consequently, extensive super-resolution (SR) methods have been proposed to improve the low quality and low resolution of remote sensing images.
SR reconstruction is a method used for generating high-resolution remote sensing images, which combines a large number of images with similar content. Generally, remote sensing image SR reconstruction algorithms can be classified into three categories: single remote sensing image SR reconstruction [6][7][8][9][10][11], multiple remote sensing image SR reconstruction [12,13], and multi/hyperspectral remote sensing image SR reconstruction [14]. Since the latter two approaches have poor SR effects, registration fusion, multi-source information fusion, and other issues, more research studies have been focusing on single remote sensing image SR reconstruction.
Single remote sensing image SR (SISR) methods can be divided into two categories based on the generative adversarial network and the convolution neural network. Although both GAN-based networks and CNN-based networks can achieve good results in SISR, they can only scale the low-resolution (LR) image with an integer factor, which makes the obtained high-resolution (HR) image inconvenient for downstream tasks. One way to solve this problem is to represent a discrete image continuously with implicit neural representation. Continuous image representation allows recovering arbitrary resolution imaging by modeling the image as a function defined in a continuous domain. For a continuous domain, the best way to describe an image is to fit this image as a function of continuous coordinates. Our method is motivated by recent advances in implicit neural representation for 3D shape reconstruction [15]. The concept behind implicit functions is to represent a signal as a function that maps coordinates to the corresponding signal (e.g., signed distance to a 3D object surface). In remote sensing image super-resolution, the signals can be the RGB values of an image. Multi-layer perceptron (MLP) is a common way to implement implicit neural representation. Instead of fitting unique implicit functions for each object, encoder-based approaches are suggested to predict a latent code for each item in order to share information across instances. The implicit function is then shared by all objects, and it accepts the latent code as an extra input. Although the encoder-based implicit function method is effective in a 3D challenge, it can only successfully represent simple images and is unable to accurately represent remote sensing images.
To solve the problem of the expression ability of encoder-based implicit neural representations, this paper explores different positional encoding methods in image representation for the image SR task, and proposes a novel feature-enhanced MLP network to enhance the approximation ability of the original MLP. Our main contributions are as follows:

1.
We are the first to adopt the implicit neural representation into remote sensing image SR tasks. With our method, one can obtain significant improvements in AID and UC Merced datasets.

2.
We propose a novel feature-enhanced MLP architecture to make use of the feature information of the low-resolution image.

3.
The performances of different positional encoding methods are investigated in implicit neural representations for continuous remote sensing image SR tasks.

Related Works
In this section, we will briefly review the implicit neural representation and the related methods, including positional encoding and continuous image SR.

Implicit Neural Representation
The implicit neural representation is essentially a continuously differentiable function that maps the coordinates into the signals. It has been widely used in many fields, such as shape parts [16,17], objects [18][19][20][21], or scenes [22][23][24][25]. The implicit neural representation is a data-driven method. It is trained from some form of data as a signal distance function. Many 3D-aware image generation methods use convolutional architectures. Park et al. [18] proposed using neural networks to fit scalar functions for the representation of 3D scenes. Mildenhall et al. [26] proposed a neural radiance field (Nerf) to implicitly represent a scene. It takes images of the same scene taken from different viewpoints as inputs and uses a neural network to learn a static 3D scene implicitly. Based on these images, the trained neural network can render images from any perspective. However, the present work based on implicit neural representation does not perform very well in the spatial and temporal derivatives. In terms of image generation, Chen et al. [27] proposed a local implicit image function (LIIF). It feeds the coordinates and the features corresponding to the MLP and outputs a RGB signal for the coordinates. Since the coordinates of images with arbitrary resolution are continuous, LIIF can represent images with arbitrary resolutions.

Positional Encoding
In order to capture the positional relationships, a method called positional encoding is introduced in [28,29]. Positional encoding is essentially a map from a position space to a high-dimensional vector space. For the continuous image SR task, 2D image coordinates are mapped into high-dimensional vectors. The common method used in [29] employs sinusoidal positional encoding by manually designing. The performance of the hand-designed approach depends on the weights of the sinusoidal positional encoding, which lacks flexibility. In order to improve the flexibility of the positional encoding, Parmar et al. [30] introduced a learnable embedding vector for each position for 1D cases. Although the trainable embedding method has the potential to capture more complex positional relationships, the learnable parameters are largely increased with the increasing dimensionality of the positional input coordinates. For the purpose of capturing more complex position relationships, for instance, the similarity of positions in an image, a novel learnable positional encoding was proposed in [31]. In their proposed method, a function is learned to map multi-dimensional positions into a vector space based on the Fourier transform. The obtained vectors are fed into the MLP. In our work, we will also focus on the learnable positional encoding method.

Continuous Image SR
Image SR is a reconstruction task that restores a realistic and more detailed highresolution image from a LR image. It is an important class of computer vision image processing techniques. However, it is an ill-posed problem because a specific LR image corresponds to a set of possible high-resolution images. Due to the powerful characterization and extraction capabilities of deep learning in both low-resolution and high-resolution spaces, deep learning-based image SR tasks have significantly improved in both qualitative and quantitative terms. Dong et al. [32] were the first to research single natural image SR based on deep learning, called SRCNN. It uses a bicubic interpolation to scale a LR image to a target size. Then, these images are fed into a three-layer convolutional network to fit a nonlinear map. The output is a HR image. In [33], a novel network, FSRCNN, was proposed to improve the inference speed of SRCNN. However, the SRCNN model not only learns how to generate high-frequency information, but it also needs to reconstruct low-frequency information, which greatly reduces its efficiency. Kim et al. [34] proposed VDSR to increase the depth of the network by employing the residual connect. Remote sensing images are different from natural images, as they often have coupled objects and environments, and the images span a wide range of scales. In order to make full use of the environmental information, Lei et al. [35] proposed a VDSR-based network called a local-global combined network (LGCNet).
It is evident that all the methods mentioned above upsample the input LR images before feeding them into the model for learning, which slows down the convergence speed of the model and also greatly increases the memory overhead. The ESPCN model [36] proposed a sub-pixel convolution operation as an efficient, fast, and non-parametric pixel rearrangement upsampling method, which significantly improved the training efficiency of the network. To further improve the expressive power of the model, the SRResNet model was proposed in [37], which utilized the residual module widely used in image classification tasks. At the same time, the confrontational generation loss function was first adopted to the image SR problem, which achieved satisfactory results. In [38], the EDSR model was proposed to further optimize the above network structure. Additionally, the performance of the EDSR model was further improved by removing the batch normalization layer and the second activation layer from the residual module. Later, several models were proposed to enhance the network's performance, including the RDN model [39] and the RCAN model [40]. To adaptively fuse the extracted multi-scale information, Wang et al. [41] proposed an adaptive multiscale feature fusion network for SR of remote sensing images.
However, the above methods can only upsample an image to a specific scale. To generate the HR image of arbitrary resolution, MetaSR, ref. [42] introduced a meta-upscale module, which employs a single model to upsample the input image to arbitrary resolution by dynamically predicting weights. However, it cannot achieve satisfactory results for the resolutions outside of the training distribution. Therefore, Chen et al. [27] proposed a local implicit image function (LIIF) by taking advantage of the neural implicit representation. In their method, the coordinates and the features corresponding are fed to the MLP to obtain a RGB signal. Since the coordinates are continuous, the HR image can be presented in arbitrary resolution. However, LIIF ignores the influence of positional encoding on image generation. Therefore, in this work, the coordinate was encoded to obtain more high-dimensional information about the coordinates, which can produce more realistic HR images. Figure 1 shows the results of our method, which can scale the input image into an arbitrary resolution.

Method
Image SR is a common task in computer vision that outputs a high-resolution image I H based on the input LR image I L . In other words, for each continuous coordinate p in the high-resolution image I H , we need to calculate a signal at this coordinate, denoted as c p . In the image SR task, the signal for a coordinate is the RGB value. In the following section, we will introduce the details of our method.

Network Overview
The main part of the proposed network is illustrated in Figure 2. It is composed of three major components: the feature extraction module (E ψ ), the positional encoding module (E φ ), and the feature-enhanced MLP module (M θ ).
For a given discrete image I ∈ R H×W×3 , we define the coordinate bank B I as a subset of [−1, 1] 2 : For a LR image I L , the feature extraction module E ψ is used to extract the features F ∈ R (#B I L )×l of the LR image. For a coordinate p ∈ B I H in a HR image I H , the feature at p can be set as the nearest point feature in B I L , which can be formulated as: The positional encoding module E φ is used to encode the coordinate p into a highdimensional space. The output encoding vector at this position is formulated as: We will discuss the performances of three commonly used positional encoding methods in Section 5. With the feature f p and the encoding vector g p , the feature-enhanced MLP module M θ is used to reconstruct the signal c p , which can be formulated as: Consequently, for any coordinate p ∈ P, P is the set of coordinates p in the highresolution image I H , and the L1 loss is used as the reconstruction loss: The complete training and inference processes are presented in Algorithm 1 and Algorithm 2, respectively. Algorithm 1: Training process of continuous super-resolution using SR-FEINR.
Input: A low-resolution image I L , a high-resolution image I H Output: A trained model M θ 1 Initialize the parameters of the model M θ 2 Extract features F from I L using the feature extractor E ψ 3 Encode the coordinates of I H using the position encoder E φ 4 for p ∈ B I H do 5 Find the nearest point q * in B I L to p using a distance metric d 6 Set the feature at p to f p = F q * 7 Set the encoding vector at p to g p = E φ (p) 8 Update the parameters of the model M θ using stochastic gradient descent with the following loss function: where c gt p is the ground-truth signal value at coordinate p Algorithm 2: Inference process of continuous super-resolution using SR-FEINR Input: A low-resolution image I L Output: A reconstructed high-resolution imageÎ H 1 Define coordinate bank B I for images I L andÎ H 2 Extract features F from I L using feature extractor E ψ 3 Encode the coordinates ofÎ H , using position encoder E φ 4 for p ∈ BÎ H do 5 Find the nearest point q * in B I L to p using a distance metric d 6 Set the feature at p to f p = F q *

7
Set the encoding vector at p to g p = E φ (p) 8 Reconstruct the signal at p using M θ : Construct a high-resolution imageÎ H from signals c p 10 returnÎ H

Feature Extraction
As mentioned in [27], we used EDSR and RDN to extract the features of the lowresolution image. The feature extraction process in EDSR includes inputting a lowresolution image, extracting high-level features through convolutional layers, enhancing features through residual blocks, fusing features through feature fusion modules, and outputting a feature map. The feature extraction process in RDN includes inputting a low-resolution image, extracting feature maps through convolutional layers and residual dense networks, expanding features through feature expansion modules, fusing features through feature fusion modules, and finally upsampling and reconstructing the image.
For a low-resolution image I L ∈ R H×W×3 , to enrich the information of each latent code in the feature space, we update the features using the feature-unfolding method, which can be formulated as: Afterward, we obtain the features of the low-resolution image F; the features of the continuous coordinate f p can be calculated using Equation (2) and fed into the featureenhanced MLP module M θ .

Positional Encoding
To encode the coordinate p, we use the following equation: cos(ω 1 πp), · · · sin(ω n πp), cos(ω n πp)), where ω 0 , ω 1 , . . . , and ω n are coefficients and n is related to the dimension of the encoding space. As illustrated in Figure 3, the details of three common positional encoding methods are described, which are the hand-craft approach, the random approach, and the learnable approach. In the hand-craft approach, ω i is fixed as ω 0 = b 0 , · · · , ω n = b L , where b and L are hyperparameters. The difference between the random approach and the normal positional encoding is that the weights ω i are randomly selected and not specified. The weights ω i are sampled from a normal distribution N (µ, Σ), where µ and Σ are hyperparameters. For the learnable approach, the encoding vector of each position is represented as a trainable code by a learnable mapping of the coordinate. A major advantage of this method for multidimensional coordinates is that it is naturally inductive and can handle test samples with arbitrary lengths. Another major advantage is that the number of parameters does not increase with the sequence length. This method is composed of two components: learnable Fourier and a MLP layer. To extract useful features, learnable Fourier features map an M-dimensional position p into an F-dimensional Fourier feature vector called r p . The definition of learnable Fourier features is roughly the same as Equation (7), cos(ω 1 πp), · · · sin(ω n πp), cos(ω n πp)), where ω 0 , · · · , ω n are trainable parameters, n = F 2 − 1 defines both the orientation and wavelength of the Fourier features. The linear projection coefficients ω 0 , · · · , ω n are initialized with a normal distribution N (0, γ −2 ). The MLP layer is a simple neural network architecture for implicit neural representation with a GELU activation function: where τ(.) is the perceptron parameterized by η.
Since the weights are learnable, the expression power of the encoding vector is more flexible. Therefore, in our work, we focus on learnable positional encoding.

Feature-Enhanced MLP for Reconstruction
In order to make use of the information in the LR image, we propose a featureenhanced MLP module M θ to reuse the feature of the LR image. The latent code f p at the coordinate p of the LR image and the encoded coordinate feature vector g p are fed into the first hidden layer of the MLP. This process is defined as where h 1 is the first hidden layer of the MLP, c 1 p is the output vector of the first hidden layer. Then we concatenate the image feature vector f p with the output feature of the previously hidden layer. At this point, Equation (10) is transformed into where h 2 is the second hidden layer of the MLP, c 2 p is the output vector of the second hidden layer.
In our method, the MLP is constructed with five perceptron layers to obtain better results compared to LIIF [27]. The MLP model can be written as: where h i (.) is the ith hidden layer and c p is the predicted RGB value for coordinate p.

Implementation Details
Two feature extraction modules are considered in this work, which are EDSR and RDN. In the three positional encoding approaches, we chose the learnable positional encoding because it was more conducive to the learning of the network and it performed better in our experiment. As for the MLP setting of the feature-enhanced MLP network M θ , we chose a five-layer 256-d multilayer perceptron (MLP) with the GELU activation function.

Experimental Dataset and Settings
In our experiment, we used a common dataset DIV2K [43] for the ablation study and two common remote sensing datasets: UC Merced [44] and AID [45]. In the field of remote sensing SISR, these datasets have been heavily utilized [35,46,47].
• AID dataset [45]: This dataset contains 30 classes of remote sensing scenes, such as an airport, railway station, square, and so on. Each class contains hundreds of images with a resolution of 600 × 600. In our experiment, we chose two types of scenes, an airport and a railway station, to evaluate different methods. The images in each scene were split into the train set and test set with a ratio of 8:2, and then we randomly picked five images from the train set as the valid set for each scene. • UC Merced Dataset [44]: This dataset contains 21 classes of remote sensing scenes, such as an airport, baseball diamond, beach, and so on. Each class contains 100 images with a resolution of 256 × 256. We split the dataset into the train set, test set, and valid set with a ratio of 4:5:1. • DIV2K dataset [43] : This dataset contains 1000 high-resolution natural images and corresponding LR images with scales ×2, ×3, and ×4. We used 800 images as the training set and 100 images in the DIV2k validation set as the test set, which followed prior work [27].
In our training process, the low-resolution image I L and the coordinate-RGB pairs O = {(p, c p )} p∈A of the high-resolution image can be obtained by the following steps: (1) the high-resolution image in the training dataset is cropped into a 48r i × 48r i patch I P , where r i is sampled from a uniform distribution U(1, 4); (2) I P is downsampled with the bicubic interpolation method to generate its LR image I L with a resolution of 48 × 48; (3) for an original 48r i × 48r i image patch I P , the coordinate bank is constructed B I P . For each coordinate p ∈ B I P , its RGB value is denoted as c p . Then, the coordinate-RGB pair set I P is constructed as O full = {(p, c p )} p∈B I P ; 4) the 48 × 48 coordinate-RGB pairs O = {(p, c p )} p∈A are randomly chosen from O full to evaluate the network.
We implemented SRCNN, VDSR, and LGCNet based on the settings given in [48]. For other experiments, we adapted the same training settings given in [27]. Specifically, we used the Adam optimizer [49] with an initial learning rate 1 × 10 −4 . All of the experiments were trained for 1000 epochs with a batch size of 16, and the learning rate decayed by a factor of 0.5 every 200 epochs.

Evaluation Metrics
To evaluate the effectiveness of the proposed method, two commonly used evaluation indicators were used in [50][51][52][53]. The most popular method for evaluating the quality of outcomes is PSNR (the peak signal-to-noise ratio). For a RGB image, the PSNR can be calculated as follows: PSNR = 10 log 10 255 2 × N p MSE . (13) where N p is the total number of pixels in the image and MSE is the mean squared error, which can be calculated as: where I(i) c and K(i) c represent the intensity values of the ith pixel in the original and reconstructed images in the cth color channel, respectively. The structural similarity index (SSIM) can be used to measure the similarity between two RGB images. The SSIM index can be calculated as follows: where µ I , µ K , σ I , σ K , and σ IK are the mean, standard deviation, and cross-covariance of the intensity values of the original and reconstructed images in the three color channels, respectively. The constants c 1 and c 2 are small positive constants to avoid instability when the denominator is close to zero. Note that the above equations assume that the original and reconstructed RGB images have the same resolution. If the images have different resolutions, they need to be resampled before calculating PSNR and SSIM.

Comparison Results on the AID Dataset
Since the AID dataset has 30 scene categories, we only randomly selected 2 categories to show the comparison results, which are the airport and the railway station. The results are listed in Table 1 for upscale factors ×2, ×3, ×4, ×6, ×12, and ×18, where the bold text represents the best results. It can be observed that our method obtains competitive results for in-distribution scales compared to the previous methods. For out-of-distribution, our method significantly outperforms the other methods in both the PSNR and SSIM. In addition to the quantitative analysis, we also conducted qualitative comparisons, which are shown in Figures 4 and 5. In Figure 4, the ×3 SR results of a railway station for different methods are shown, where two regions are zoomed in to show the details (see the red and green rectangles). The PSNR values are listed in the left-bottom corner of each image. In Figure 5, we show the ×4 SR results of an airport for different methods. From these figures, we can see that our method has the clearest details and the highest PSNR value. Table 1. Quantitative comparisons between the AID test set (PSNR (dB) and SSIM). (RS * : railway station, the bold in table is the highest value).

Comparison Results on UCMerced Dataset
Different from the AID dataset,UCMerced dataset has smaller number of images and categories. Therefore, our model is trained and tested on the whole dataset. The quantitative comparison results of these methods on the UCMerced dataset are listed in Table 2. From this table we can see, our results are higher than LIIF at all magnification scales. In addition, we also visualize the SR results for different methods in Figure 6. From a visual point of view, both LIIF and our method outperform the other methods. Although the visualization results of LIIF and our method are similar, the PSNR values of the whole image and the local regions of our method are larger than LIIF, which means our method is slightly better than LIIF.

Comparison Results on the DIV2K Dataset
Unlike the above two datasets, the images in the DIV2K dataset are mainly natural. Since our method is proposed for remote sensing image SR, we only conducted the quantitative comparisons on this dataset. In this dataset, we compare two versions of our method with Bicubic, EDSR, EDSR-MetaSR, EDSR-LIIF, and RDN-LIIF. The EDSR-ours and RDN-ours use EDSR and RDN to extract features, respectively. The comparison results are listed in Table 3. From this table, we can see that for EDSR, our method has the best performance from the ×3 scale. For the ×2 scale, LIIF and EDSR-MeatSR are better than our method as they are trained for this scale. Regarding the RDN, we only compare it with LIIF. The comparison results demonstrate that our method can achieve the best results at high scales.

Ablation Study
In this section, we perform ablation studies to assess the effectiveness of each module, where the EDSR is used as the feature encoder. Based on the baseline LIIF model, we progressively add the positional encoding module and feature-enhanced MLP module to evaluate their effectiveness. In order to further evaluate the effectiveness of the proposed feature-enhanced MLP module, we replace the features with coordinates and embed them into the MLP. The results of the ablation study are shown in Table 4. In this table, LIIF is our baseline. LIIF + PE is the combination of LIIF and the positional encoding module. LIIF + PE + FE is the combination of the positional encoding module and the feature-enhanced MLP module, which is our method. Based on LIIF + PE + FE, the features in the feature-enhanced MLP module are replaced with coordinates, and the resulting network is LIIF + PE + PF*. From this table, we can see that LIIF + PE + FE (our method) outperforms the LIIF at all scales except for the ×2 scale. This result proves that the learning ability of the network can be effectively improved by embedding the image features into the hidden layer of the MLP. The positional encoding module is an important module in the proposed method. As described in Section 3.2, there are three commonly used positional encoding methods, which are the hand-craft approach, random approach, and learnable approach. Therefore, in this section, we will discuss the effectiveness of these methods on the remote sensing image SR task. The comparison results are listed in Table 5. In this table, LIIF + PE-hand represents the network with the hand-craft positional encoding method, where b = 2 and L = 10. i.e., ω i = 2 i , i = 0, 1, · · · , 9. LIIF + PE-random shows that the weights are chosen randomly from a normal distribution. In this network, the hyperparameters are set as µ = 100 and Σ = 0. The LIIF + PE-learning is the network with the learnable positional encoding method. Weights are learned through a MLP. The function τ(.) is a 2-layer MLP with the GELU activation and hidden dimensions of 256. The dimensions of the Fourier feature vector F are set to 768. γ is set to 10 in the normal distribution N (0, γ −2 ). From Table 5, we can see that LIIF outperforms the other methods for in-distribution scales, which are ×2, ×3, and ×4. However, after the ×6 scale, LIIF + PE + learnable achieves the best performance among all methods. Therefore, the learnable positional encoding method is used in our network. Table 5. Quantitative comparison of three different positional encoding approaches in Figure 3 (PSNR(dB)), the bold in table is the highest value.

Conclusions
In this paper, we propose a novel network structure for continuous remote sensing image SR. By using the LIIF as our baseline, two important modules are introduced to improve its performance, which are the positional encoding module and the featureenhanced MLP module. The positional encoding module can capture complex positional relationships by using more coordinate information. The feature-enhanced MLP module is constructed by adding prior information from the LR image to the hidden layer of MLP, which can improve the expression and learning ability of the network. Extensive experimental results demonstrate the effectiveness of the proposed method. It is worth noting that our method outperforms the state-of-the-art methods for magnifications outside of the training distribution, which is important in practical applications.
As far as we know, the inference speed of the MLP is a bit slow, which limits the application of our method. In the literature, there are some acceleration algorithms for the MLP architecture, which can be used to decrease the inference time. Therefore, we will attempt to integrate these methods into our algorithm to improve its efficiency.