In image processing, enhancement techniques focus on the visual improvement of the image quality while restoration techniques focus on restoring the image to its original quality. Super-resolution and contrast stretching are examples of image enhancement techniques while image in-painting is an example of image restoration. We present next existing methods for SISR and discuss the visual tasks that are typically applied to the output of SISR models.
2.1. Super-Resolution
Several SISR methods have been proposed to recover an HR image from a single LR image. These methods can be broadly divided into handcrafted methods and deep learning methods.
Early handcrafted SISR methods include interpolation-based methods, statistics-based methods, and example-based methods. Interpolation-based methods, such as bilinear [
12], bi-cubic [
13], and edge-directed interpolation [
14], estimate HR by interpolating LR pixels to HR pixels. These types of methods tend to generate overly smooth or blurry image. Statistics-based methods (e.g., [
15]) learns the statistical relationship of the gradient profile between the HR and LR images, motivated by the idea that shape statistics of the gradient profiles is invariant to the image resolution. These types of images tend to produce watercolor-like artifacts when applied to visually complex images. Finally, example-based methods use conventional machine-learning algorithms, such as Random Forests (RFs) [
16,
17,
18] and Markov Random Fields (MRFs) [
19], to learn the mapping from LR to HR images. These methods tend to generate a reconstructed image that contains irrelevant (hallucinated) details.
Recently, convolutional neural networks (CNNs) and generative adversarial networks (GANs) have been used to learn the mapping from LR to HR images. For example, Dong et al. [
20] proposed the first SR-CNN network to efficiently learn this mapping. Kim et al. [
21] proposed a CNN model, known as very deep super-resolution (VDSR), to predict the mapping from LR to HR images using residual learning. Inspired by the VGG architecture [
22], VDSR has 20 layers and is trained using extremely high learning rates. VDSR achieved state-of-the-art performance and outperformed SR-CNN [
20]. It also resolved several issues of SR-CNN [
20] such as the utilization of contextual information from a small image region, slow convergence, and the need for training individual scale-dependent models.
A GAN-based network, named P-SRGAN, has been proposed [
23] recently to learn the mapping from LR to HR images. P-SRGAN consists of a generator network and a discriminator network. The generator network takes an LR image as input and generates the HR image while the discriminator compares the generated image with the HR image to generate good quality reconstructed images. Although this approach achieved excellent performance, GAN-based networks are hard to train due to the Nash equilibria problem. This problem is defined as a zero-sum game between the generator network (player 1) and the discriminator network (player 2) where the opponent players contest with each other in a game to improve their objective functions [
24]. In addition, GAN-based networks are highly sensitive to the hyperparameter selection and often get into mode collapse (i.e., the generator maps different inputs to the same output) [
24]. To regularize the training of GAN-based SR models and enforce the right mapping between the input and output domains, You et al. [
25] proposed GAN-CIRCLE network for constructing HR CT images from their LR counterparts. The proposed network combines four loss functions, viz., adversarial loss, cycle-consistency loss, identity loss, and joint sparsifying transform loss, to stabilize training and enforce the right input-output mapping. Expert radiologists evaluation of GAN-CIRCLE on three CT datasets demonstrates its ability in constructing HR images from noisy LR input images. Although the usage of GAN-CIRCLE solved regular GANs training and mapping problems, this network is computationally complex and requires a relatively large GPU as well as much longer training time in comparison to the regular-GAN networks. Further, the network failed to faithfully recover subtle structures in CT images as discussed in [
25]. We refer the reader to [
26,
27] for comprehensive reviews of other handcrafted and deep learning SISR methods.
While existing deep learning models achieved excellent performance and successfully constructed HR images, these models are trained using a single scale (e.g., [
20,
25]), have a deep architecture with a very large number of training parameters (e.g., [
21,
23,
25]), and are hard to train (e.g., [
23]). To resolve issues with existing models, we propose a simple SISR model that learns the mapping from LR to HR images. Our customized SISR model has the following virtues:
Simplicity and Stability. Our customized SISR model, inspired by VDSR (20 conv. layers) [
21], has a shallower structure (7 conv. layers) with a lower number of training parameters. In addition, our proposed SISR model is easy to train and has a single network contrary to GAN-based networks which are difficult to train and require both generator and discriminator networks. Further, our proposed model is more stable and less sensitive to hyper-parameters selection as compared to most GAN-based models. As large models with massive number of parameters are restricted to computing platforms with large memory banks and computing capability, developing smaller and stable networks without losing representative accuracy is important to reduce the number of parameters and the storage size of the networks. This would boost the usage of these networks in limited-resource settings and embedded healthcare systems.
Multiple Scales Training: Our SISR model is trained with different scale factors at once. The trained network can then be tested with any scale used during training. As discussed in [
21], training a single model with multiple scale factors is more efficient, accurate, and practical as compared to training and storing several scale-dependent models.
Context: We utilize information from the entire image region. Existing methods either rely on the context of small image regions (e.g., [
20]) or large image regions (e.g., [
21]), but not the entire image region. Our experimental results demonstrate that using the entire image region leads to better overall performance while decreasing computations.
Raw Image Channels: We propose to compute the residual image from the raw image (RGB or grayscale) directly instead of converting the images to a different color space (e.g., YCbCr [
21]). The residual image is computed by subtracting the HR reference image from the LR image that has been upscaled using interpolation to match the size of the reference image. The computed residual image contains information of the image’s high-frequency details. The main benefit of directly working on the raw color space is that we decrease the total computational time by dropping two operations: (1) converting from raw color space to another color space (e.g., YCbCr) and (2) converting the image back to its original color space. Our customized SISR model computes the residual images directly from the original color space and learns to estimate these images. To construct an HR image, the estimated residual image is added to the upsampled LR image.
Combined Learning Loss: We propose to train the proposed SISR model using a loss function that combines the advantages of the mean absolute error (MAE) and the Multi-scale Structural Similarity (MS-SSIM). Our experimental results show that MAE can better assess the average model performance as compared to other loss metrics. Also, our experimental results show that the MS-SSIM preserves the contrast in high-frequency regions better than other loss functions (e.g., SSIM). To capture the best characteristics of both loss functions, we propose to combine both loss terms (MAE + MS-SSIM).
In summary, our proposed SISR model has a simple architecture, is stable, and it is trained using the entire raw image and multiple scales at once to minimize a proposed loss function. It reduces the number of training parameters and storage size without performance degradation. Such advantages would boost the usage of SISR models in limited computational resources and facilitate the deployment of these models in clinical settings for potential real-time healthcare applications.
2.2. Visual Task Analysis
Existing methods for medical image analysis apply SISR models to construct HR images followed by using the constructed HR images as input to individual models for individual tasks. For example, various methods applied deep learning-based SISR models to LR images and used the constructed HR images as input to separate segmentation (e.g., U-Net network [
23,
28,
29]) and classification (e.g., VGG network [
23,
28] and DenseNet network [
30]) models. This traditional approach for analysis is not efficient because it involves unnecessary repetitions of learning (end-to-end) multiple task-specific models in isolation.
Contrary to previous works that separate image enhancement from other visual tasks, we propose to use our customized SISR model as a shared representation to simultaneously learn multiple subsequent visual tasks. Specifically, the weights of our SISR model, which learns the mapping from LR to HR, are directly used to simultaneously learn tasks such as image segmentation and classification. Using the proposed SISR model as a shared backbone improves generalization and prevents unnecessary repetitions of learning visual task models in isolation. This can lead to a decrease in resource utilization and training time, and boosts the use of deep learning models in limited-resource settings.
2.3. Contributions
In this paper, we propose the
Hydra, a deep learning approach that consists of two components: a shared trunk and computing heads. The trunk is a customized SISR model that learns the mapping from LR to HR. The trained trunk is then appended with task-specific layers to learn multiple visual tasks in medical images.
Figure 1 depicts the main difference between the
Hydra and existing works. As can be seen from the figure, the
Hydra trunk is used as a shared backbone to learn multiple visual tasks (heads). On the contrary, the majority of existing methods (traditional approach for task analysis) use the constructed HR image as input to multiple individual models. The main contributions of this paper can be summarized as follows:
We propose the Hydra approach for enhancing medical image resolution and visual task analysis. The Hydra consists of two components: a shared trunk and computing Heads.
Hydra trunk is a proposed customized SISR model that learns the mapping from LR to HR. This SISR model has a simple architecture and is trained using the entire raw image and multiple scales at once to minimize a proposed loss function. Our experimental results show that the proposed SISR model, which has a markedly lower number of training time and parameters, achieves state-of-the-art performance.
We propose to append the customized SISR trunk with multiple computing heads to learn different visual tasks in medical images. We evaluate our approach using CXR datasets to generate HR representation followed by jointly performing lung segmentation (visual task 1) and abnormality classification (visual task 2). We focus mainly on these two tasks because classification and segmentation are the key tasks in most medical image analysis applications.
We empirically demonstrate the superiority and efficiency of our approach, in terms of performance and computation, for SR and medical image analysis as compared to the traditional approach.
We present next the CXR datasets used to evaluate the proposed Hydra and provide detailed descriptions of Hydra trunk and computing heads.