OW-SLR: Overlapping Windows on Semi-Local Region for Image Super-Resolution

There has been considerable progress in implicit neural representation to upscale an image to any arbitrary resolution. However, existing methods are based on defining a function to predict the Red, Green and Blue (RGB) value from just four specific loci. Relying on just four loci is insufficient as it leads to losing fine details from the neighboring region(s). We show that by taking into account the semi-local region leads to an improvement in performance. In this paper, we propose applying a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) to an image to obtain any arbitrary resolution by taking the coordinates of the semi-local region around a point in the latent space. This extracted detail is used to predict the RGB value of a point. We illustrate the technique by applying the algorithm to the Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to random resolution. This technique outperforms the existing state-of-the-art methods when applied to the OCT500 dataset. OW-SLR provides better results for classifying healthy and diseased retinal images such as diabetic retinopathy and normals from the given set of OCT-A images.


Introduction
The primary objective of super resolution (SR) is to obtain a credible high resolution (HR) image from a low resolution (LR) image.The major challenge is to retrieve the information which is too minute or almost non existent, and to extrapolate this information to higher dimensions which is plausible to the human eye.Furthermore, the availability of paired HR-LR image data poses another concern.Typically, an image is downsampled using a specific method in the hope of encountering a real-life LR image that is somewhat similar.The aim of SR models is to fill in the deficient information between the HR and LR images, thereby bridging the gap.Also, for high-dimensional inputs like videos and 3D scans there are quite a few work in the literature [1][2][3][4][5][6].
Most of the architectures [7][8][9][10][11] proposed for SR of images upsample them by a fixed factor only.This means that a separate architecture needs to be trained for each unseen upscaling factor.However, the real world is continuous in nature, whereas images are represented and stored as discrete values in 2D arrays.Inspired by [12][13][14][15] for 3D shape reconstruction using implicit neural representation, ref. [16] proposed Local Implicit Image Function (LIIF) to represent images in a continuous fashion.Some postprocessing is performed to obtain the RGB value of the query point.This approach enables representing and manipulating images in a continuous manner, departing from the traditional discrete representation in 2D arrays.
In our work, we draw partial inspiration from advancements in 3D shape reconstruction, but we extend the approach by considering a semi-local region rather than relying solely on four specific locations.Our method allows for extrapolation to any random upscaling factor using the same architecture.This architecture takes into account the semi-local region and specifically learns to extract important details related to a query point in the latent space that needs to be upscaled.In this paper, we propose an image representation technique called Overlapping Windows for Semi-Local Representation in a continuous domain and we fine our work as follows: (i) Each image is represented as a set of latent codes, establishing a continuous nature.To determine the RGB value of a point in the HR image within the latent space, we employ a decoding function.(ii) This semi-local region is fed into network as input which generates the embeddings of the intricate details in it which have high probability of getting lost when an entire image is taken into consideration by the networks.(iii) The overlapping window technique allows for effective learning of features within the semi-local region around a point in the latent space using the embeddings.(iv) A decoder takes the features derived from the overlapping window technique and produces the RGB value of the corresponding point in the HR image.
In summary, our work makes two key contributions.Firstly, we introduce a novel technique called overlapping windows, which enables efficient learning of features within the semi-local region around a point.This approach allows for more effective representation and extraction of important details.Secondly, our architecture is capable of upscaling an image to any arbitrary factor, providing flexibility and versatility without the need for separate architectures for different upscaling factors.This contribution enables seamless and consistent image upscaling using a unified framework.

Related Work
During the early stages of SR research, images were typically upsampled by a certain factor using simple interpolation techniques, and the network was trained to learn the extrapolation of the LR images [17,18].However, this approach presents some issues.Firstly, the pre-upsampling process introduces more parameters compared to the postupsampling process.Pre-upsampling is defined as upscaling the input image and then passing it through the network, whereas post-upsampling is defined as passing the image through the network and then upscaling the feature map.Secondly, due to the higher requirement of parameters more training time becomes a requisite.The network needed to learn the intricacies of the pre-upsampling method, which added to the overall training complexity.Finally, the pre-upsampling process using traditional bicubic interpolation does not yield realistic results during testing.Since it is the first step of the SR pipeline, the network often attempts to mimic this interpolation, which limits the realism of the output images.On the other hand, post-upsampling approaches, where the LR image is downscaled in the very first step, typically involve the use of bicubic interpolation for resizing.However, downscaling an image, even with bicubic interpolation, tends to yield more realistic results compared to upscaling.As a result, the research focus has shifted towards post-upsampling techniques, which provides more efficient and realistic SR results by leveraging downscaling with appropriate interpolation methods in the very first step.
As already mentioned, downscaling of images happens as the initial step in postupsampling process.The network learns features from the downscaled image and the upsamples the learned features towards the very end.A technique proposed by Shi et al. in their work [8] is known as sub-pixel convolution.Sub-pixel convolution handles the extrapolation of each pixel by accumulating the features along the channel of that pixel.By rearranging the feature channels, sub-pixel convolution enables the network to effectively upscale the LR image to a higher resolution.While sub-pixel convolution provides a practical solution for upsampling by integral factors (×1, ×2, ×3, etc.), it does not support fractional upsampling factors (×1.4,×2.9, etc.).However, for cases where fixed integral upsampling factors are sufficient, sub-pixel convolution offers an efficient approach to achieving high-quality upsampling.The work by Ledig et al. [19] introduced the use of multiple residual blocks for feature extraction in super-resolution (SR) tasks.Their approach demonstrated the effectiveness of residual blocks in capturing and enhancing image details.
Building upon Ledig et al.'s work, Lim et al. [11] proposed an enhanced SR model that incorporated insights regarding batch normalization.They postulated that removing batch normalization from the residual blocks could lead to improved performance for SR tasks.This is because batch normalization tends to normalize the input, which may reduce the network's ability to capture and amplify the fine details required for SR.Removing batch normalization not only results in a reduction in memory requirements but also makes the network faster.Additionally, the work by Shi et al. [8] contributed to the development of various approaches for SR using CNNs.These approaches include methods proposed by [9,[19][20][21].These methods aimed to enhance feature extraction capabilities specifically tailored for SR problems, further advancing the state-of-the-art in SR research.
After the success of CNNs in SR tasks, researchers explored the use of generative adversarial networks (GANs) to further improve SR performance.Several works, such as [19,22,23], introduced different GAN architectures for extrapolating low-resolution (LR) images to higher resolution.ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) proposed by Wang et al. [24] introduced a perceptual loss function and modified the generator network to produce HR images.This perceptual loss function aimed to align the visual quality of the generated HR images with that of the ground truth HR images, improving the perceptual realism of the results.
In Real-ESRGAN [25], the authors addressed the issue of using LR images downsampled with simple techniques like bicubic interpolation during training.They note that real-world LR images undergo various types of degradations, compressions, and noise, unlike the simple interpolation-based downsampling.To simulate realistic LR images during training, they proposed a novel technique that subjected the training images to various degradation processes, mimicking real-life scenarios.Additionally, Real-ESRGAN introduced an U-Net discriminator to enhance the adversarial training process and improve the quality of the generated HR images.

Method
We illustrate the three main components of our approach in this section along with its pictorial representation in Figure 1.In Section 3.1, we introduce the backbone of our framework.We represent the LR image as a feature map, which serves as the basis for subsequent processing and analysis.In Section 3.2, we demonstrate how we find the semi-local region of an arbitrary point in the HR image.This region contains valuable information that helps determine the corresponding RGB value.In Section 3.3, we highlight the Overlapping Windows technique, which plays a crucial role in predicting the RGB value of a point in the HR image.We accomplish this by leveraging the semi-local region extracted around the sampling points of the feature map.These three parts collectively form the foundation of our approach, allowing for accurate prediction of RGB values.

Backbone Framework
To extract features from the LR image, we employ the enhanced deep residual networks (EDSR) [11].Specifically, we utilize the baseline architecture of EDSR, which consists of 16 residual blocks.ψ = EDSR(I LR ) Given an LR image denoted as I LR ∈ R H×W×C , we express it in the form of a feature map ψ ∈ R P×Q×D .Here, H and W represent the height and width of the LR image, respectively, and C signifies the number of channels.P and Q represent the spatial dimensions of the feature map, and D denotes the depth of the feature map.

Locating the Semi-Local Region
In our scenario, we aim to predict the RGB value at any random point in a continuous HR image of arbitrary dimensions.Let I HR ∈ R X×Y×C represent the HR image.To predict the RGB value at a specific point, we first select a point of interest.Then, we identify its corresponding spatially equivalent point in the feature map ψ obtained from the LR image using bilinear interpolation denoted as ℧ BI .
where x and x are the 2D coordinates of the ψ and I HR respectively.Furthermore, we extract a square semi-local region around this corresponding point.The size of this region is determined by a length parameter M units, where each unit dimension of the square region corresponds to the inverse of the dimensions P and Q of the feature map ψ along its length and breadth respectively defined in Equation ( 5) which is used to find the discrete positions in the semi-local region.Once we have identified the square semi-local region around the corresponding point in the feature map ψ, we proceed to extract M × M depth features from this region using Equation (3).These depth features capture the important information necessary for predicting the RGB value at the desired point in the HR image.To extract these features, we employ a closest Euclidean distance approach denoted by ð ED .Each point within the M × M region in ψ is mapped to the nearest point in the latent space, which represents the extracted depth feature.Figure 2 illustrates the working of selecting of features from the feature map.This mapping ensures that we capture the most relevant information from the semi-local region.
Thus X holds the 2D coordinates of all the M × M points.
Figure 3 illustrates how the semi-local region is identified and used to extract the M × M depth features from the feature map ψ.This depiction helps to visualize the steps involved in the feature extraction process.

Overlapping Windows
After extracting the semi-local region S ∈ R M×M×D , our objective is to obtain the RGB value of the center point using this region.To achieve this, we employ a overlapping window-based approach.We start with four windows, each with a size of M − 1, positioned at the four corners of S. Each window extracts information from its respective region and passes it on to the next subsequent window in the process.With each iteration, the size of the window decreases by 1 until it reaches a final size of M 2 .This iterative process ensures that information is progressively gathered and refined towards the center point.This approach allows us to effectively capture and utilize the information from the semi-local region while focusing on the features that are most relevant for determining the RGB value.
In each iteration i, where the window size decreases by 1 for the next step, we utilize weights w i for combining the features from all four corners.This ensures that the information from each corner is properly incorporated and made available for the subsequent iteration.In the last step, we take a final window size of 2, but instead of being positioned at the corners as in previous iterations, it is centered around the target point of interest.features extracted from this final window are then passed through a Perceptron (MLP) to make the final prediction.
By the window positions and sizes throughout the iterations, we effectively capture and aggregate the relevant information from the semi-local region.This approach allows us to make accurate predictions at the target point, utilizing the combined features from all iterations and the final MLP-based processing.Figure 4 shows the working of the overlapping windows.

Dataset
We used the OCT500 [26] dataset and randomly sampled 524 images from it to train our network.It consists of 300 3 × 3 OCTA images and 224 6 × 6 OCTA images.We use For evaluation, 80 images were selected and we report the results using peak signal-to-noise ratio (PSNR) metric.

Implementation Details
During the training process, we apply downsampling to each image using bicubic interpolation in PyTorch [27].This downsampling is performed by selecting a random factor, which introduces the desired level of degradation to the images.For training, we utilize a batch size of 16 images.From each high-resolution (HR) image, we randomly select 1500 points for which we aim to calculate the RGB values.These points serve as the targets for our network during the optimization process.
To optimize the network, we employ the L1 loss function and use the Adam optimizer [28].The learning rate is initialized as 1 × 10 −4 and is decayed by a factor of 0.3 at specific epochs, namely [40,60,70] .We train the network for a total of 100 epochs, allowing it to learn the necessary representations and refine its predictions over time.Furthermore, each LR image is converted into a feature map of size 48 48 with a depth of 64 using the EDSR-baseline architecture.This conversion process ensures that the LR images are properly represented and aligned with the architecture used in the training process.

Quantitative Results
In Figure 5, we present a comparison of the performance of our proposed OW-SLR method against existing works.The original image patch is first downsampled using bicubic interpolation to a lower resolution.It is evident that there is a significant loss of image quality in the LR patches compared to the ground truth (GT) image.However, our model outperforms the other existing methods, demonstrating a significant improvement when the LR image is extrapolated to a higher scale.The results obtained by our model show better preservation of details and higher fidelity compared to the other approaches when the given image is extrapolated to higher scale.The PSNR results of each image are shown in Table 1.A 96 × 96 patch is taken and its size is reduced to 24 × 24 (first row), 32 × 32 (second row) and 48 × 48 (third row) using bicubic interpolation.Our architecture uses the same set to weights reproduce the given results.However, others require different set of weights for a newer scale to be trained on.The PSNR results of each image are shown in Table 1.
Table 1.PSNR result of each of the input images across different methods shown in Figure 5.It is worth noting that our model achieves these results for different scaling factors using the same set of weights trained once.In contrast, the other models would need to be retrained for each new scale to which the LR image is extrapolated.This highlights the versatility and efficiency of our model in handling various scaling factors without the need for additional training.

Patch
In Table 2, we provide the upscaling time taken by the proposed model by different factors, while training it just once.In Table 3, we present the results of this technique compared to the existing state-ofthe-art methods on the OCT500 [26] dataset.The evaluation metric used in this case is the peak signal-to-noise ratio (PSNR).Our work demonstrates superior performance compared to LIIF, highlighting the effectiveness of considering the semi-local region instead of solely focusing on four specific locations.By incorporating the information from the semi-local region, our approach achieves improved results in terms of PSNR, showcasing the benefits of our methodology for super-resolution tasks.

Conclusions and Future Work
OCTA images help us for the diagnosis of retinal diseases.However, due to various reasons like speckle noise, movement of the eye, hardware incapabilities, etc. we lose onto intricate details in the capillaries that play a crucial role for correct diagnosis.We propose this architecture which upscales a given LR image to arbitrary higher dimensions with enhanced image quality.First, we extract the image features using a backbone architecture.We then select a random point in the HR image and calculate its equivalent spatial point in the extracted feature map.We find the semi-local region around this calculated point and pass it through the proposed Overlapping Windows architecture.Finally, an MLP is used to predict the RGB value using the output of the overlapping window architecture.We hope our work will help the people in the medical field in their diagnosis.PSNR 17.93 is achieved for the OCT500 dataset which outperforms the other state-of-the-art work.The technique outperforms the existing methods and allows upscaling images to arbitrary resolution by training the architecture just once.
While effective, it is worth noting that this algorithm does come with a slightly higher computational cost due to its consideration of the semi-local region.There remains potential for further enhancements in both computational efficiency and accuracy while taking the semi-local region into account.This work will provide a stepping stone for future researchers to make strides in this direction.

Figure 1 .
Figure 1.(a) An LR image is taken.(b) It is passed through EDSR [11] and a feature map is produced.(c) Locating the semi-local region (M = 6) around a random selected point from HR image.(d) Semilocal region is passed through the proposed Overlapping Windows.(e) This output is passed through the MLP to give out the RGB value of a randomly selected point.Steps (c-e) are performed for all the points in the HR image.

Figure 2 .Figure 3 .
Figure 2.To extract features from a feature map of size 3 × 3, we focus on a specific query point represented by a red dot.In order to determine which pixel locations in the feature map correspond to this query point, we compute the Euclidean distance between the query point and the center points of each pixel location.In the provided image, the black line represents the closest pixel location in the feature map to the query point.

Figure 4 .
Figure 4.The iteration of overlapping windows, where the window size = M − 1 (M = 6).Assuming the feature map is of negligible depth and four windows are positioned at the four corners of the feature map.

Figure 5 .
Figure5.A 96 × 96 patch is taken and its size is reduced to 24 × 24 (first row), 32 × 32 (second row) and 48 × 48 (third row) using bicubic interpolation.Our architecture uses the same set to weights reproduce the given results.However, others require different set of weights for a newer scale to be trained on.The PSNR results of each image are shown in Table1.

Table 2 .
Time taken to extrapolate a 320 × image on a single Nvidia Titan V of 12 Gigabyte size.