3.2.1. Subjective Assessment Techniques
- 1.
Expert Radiologist Review
Subjective assessment often involves expert radiologists reviewing images to evaluate quality based on clinical experience and expertise. Expert radiologist review is a widely used method for assessing image quality. Radiologists evaluate images based on contrast, sharpness, and the visibility of anatomical structures. Their assessments are based on clinical experience and expertise, making them valuable for identifying subtle image quality issues that objective metrics may not capture.
Several studies have investigated the reliability of expert radiologist reviews for MIQA. The following
Table 2 shows some of the techniques used by various researchers.
These studies have shown that radiologists can provide consistent and accurate image quality assessments, but there is still some variability between radiologists. A Mean Opinion Score (MOS) is used to overcome these challenges. It averages the scores given by multiple reviewers to achieve a consensus on image quality. This method helps mitigate individual biases and provides a more balanced image quality evaluation. According to the study in [
13], MRI images are rated on a subjective scale, and the average scores from multiple reviewers provide an overall quality metric. This method is beneficial when a single expert’s opinion may not be sufficient and a collective judgment is preferred. Nevertheless, MOS has limitations such as subjectivity, observer bias, limited granularity, and difficulty handling complex image quality issues.
- 2.
Visual Grading Analysis (VGA)
VGA is a subjective assessment technique that involves grading images on a predefined scale based on visual quality and diagnostic acceptability. VGA is widely used in clinical practice to evaluate image quality, providing a standardized method for comparing images. The procedures of some selected studies are summarized in
Table 3.
Several studies have investigated the reliability and validity of VGA for MIQA, along with the expert radiologist review. These studies have shown that both can provide consistent and reliable image quality assessments, but there is still some variability between observers. Training and calibration can help reduce this variability, but objective metrics are also needed to provide consistent and reliable assessments.
VGA focuses on image quality evaluation through subjective grading based on predefined criteria, assessing attributes such as contrast, noise, and sharpness. It is moderately subjective, with standardized scales helping to improve consistency, and is primarily used in research and image optimization. In contrast, an expert radiologist review involves comprehensive image interpretation to identify pathological findings such as tumors, fractures, and anomalies. This process is highly subjective, relying on clinical expertise and contextual analysis, and is essential for patient diagnosis and treatment planning.
- 3.
Inter Observability and Intra-Observability
Inter-observability concerns the agreement between different observers for assessing medical image quality to ensure other individuals’ consistent use of criteria. Intra-observability refers to the repeatability of an individual observer’s assessments over time [
7,
29]. High inter- and intra-observability is challenging when human observers are inconsistent or vary in their subjective judgments. Approaches to improving consistency include using standardized protocols, rigorous training, AI/ML programs, and intense learning models like CNNs, which offer the potential to provide objective and reproducible evaluations. These are essential considerations for accurate diagnosis, improved patient outcomes, and reliable quality assessment in medical imaging. To overcome the challenges in this method, the Intraclass Correlation Coefficient (ICC) [
27,
35,
39], the statistical index, was employed. ICC is a statistical measure commonly used to assess the consistency or agreement between multiple observers. It can be applied in binary classification and scoring systems depending on how the image quality is rated.
- 4.
Calibration and Consensus Methods
Calibration and consensus methods are employed to reduce variability among radiologists. These methods involve training sessions and establishing standardized evaluation criteria, ensuring assessment consistency [
11]. Calibration sessions help align grading criteria among radiologists for prostate MRI images, minimizing subjective differences and enhancing the reliability of image quality evaluations [
16]; the Double Stimulus Continuous Quality Scale (DSCQS) method achieves consensus in CT image assessments, where images are evaluated in pairs to compare quality, providing a more structured approach to subjective evaluations.
These subjective image quality assessment techniques provide valuable insights into the quality of medical images, ensuring they meet the necessary standards for accurate diagnosis. Each method has its strengths and limitations, and the choice of technique often depends on the specific requirements of the study and the type of images being assessed. Combining these subjective techniques with objective measures can enhance the overall image quality assessment and improve diagnostic outcomes. In most instances, the subjective assessments are combined and used for evaluations and annotation of datasets.
3.2.2. Objective Assessment Techniques
Accurate image quality assessment is crucial for ensuring reliable diagnostics and effective treatment planning in medical imaging. Various image quality assessment (IQA) metrics are employed to evaluate the quality of images, particularly in modalities such as chest X-rays, MRI, and CT scans. These metrics are categorized into three main types: full-reference image quality assessment (FR-IQA), Reduced-Reference Image Quality Assessment (RR-IQA), and No-Reference Image Quality Assessment (NR-IQA) [
1,
2,
13,
16,
20]. However, in some instances, it is categorized as a full-reference image quality assessment (FR-IQA), distribution-based image quality assessment (DB-IQA), and no-reference [
12].
- 1.
Full Reference Image Quality Assessment (FR-IQA)
FR-IQA methods require a reference image of perfect quality to compare against the assessed image. Full reference methods compare the test image to a high-quality reference image, using metrics such as peak signal–noise ratio (PSNR) and Structural Similarity Index (SSIM).
The following studies have investigated the reliability and validity of full reference methods for MIQA. These studies have shown that full reference methods can provide consistent and reliable assessments of image quality, but they require the availability of a high-quality reference image.
Signal–Noise Ratio (SNR): This measurement evaluates the clarity of an image by comparing signal strength to background noise [
27,
28,
31,
32,
33,
34].
Contrast–Noise Ratio (CNR): CNR focuses on the visibility of structures by measuring the contrast between two regions relative to noise [
27,
28,
31,
32,
33,
34].
Structural Similarity Index Measure (SSIM): SSIM measures the similarity between two images by focusing on structural information and assessing luminance, contrast, and structure. It is widely used due to its ability to closely correlate with human visual perception [
16,
20].
The SSIM index is calculated using various image windows—the measure between two windows x and y of standard size N × N. Here, x is a window from the reference image, and y is the corresponding window from the other image.
Multi-Scale Structural Similarity Index Measure (MS-SSIM): An extension of SSIM, MS-SSIM evaluates images at multiple scales, providing a more comprehensive analysis by considering luminance, contrast, and structural similarity across different resolutions [
16,
20].
Feature Similarity Index Measure (FSIM): FSIM focuses on the similarity of features between images, particularly those that are perceptually significant. This metric is crucial for assessing the visual quality of images [
16,
20].
Peak Signal–Noise Ratio (PSNR): PSNR measures the ratio between a signal’s maximum possible power and noise’s power that affects the image’s quality. It is a standard metric in FR-IQA, commonly used for evaluating image compression and restoration [
16,
20].
Visual Information Fidelity (VIF): This metric quantifies the amount of information that can be extracted by a human observer from the distorted image relative to the reference image, providing insight into the visual quality loss [
16].
Information Fidelity Criterion (IFC): Similarly to VIF, IFC measures the fidelity of visual information in the distorted image, offering a quantitative evaluation of image quality degradation [
16].
Noise Quality Measure (NQM): NQM evaluates image quality by considering noise characteristics, particularly in assessing medical images where noise can obscure essential details [
16].
Visual Signal–Noise Ratio (VSNR): This metric focuses on the signal–noise ratio in images, which is particularly important in medical imaging to differentiate between meaningful signal and noise [
16].
Information Content-Weighted SSIM (IWSSIM): A variant of SSIM, IWSSIM weights the structural similarity by the information content, providing a more nuanced assessment of image quality based on content importance [
17].
- 2.
No Reference
No reference methods are widely used for the objective assessment of image quality. These methods assess image quality without needing a reference image, using algorithms to evaluate features such as noise, blur, and contrast.
These studies used no reference methods for MIQA. These studies have shown no reference decision score can provide consistent and reliable image quality assessments.
Natural Image Quality Evaluator (NIQE): A reference-free metric that assesses image quality based on natural scene statistics, often used when reference images are unavailable [
9].
Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE): ML models trained on natural images with known distortions are used to evaluate the quality of a given image. It is effective for various types of distortions [
10,
12,
19,
20]
Blind Image Quality Assessment (BIQA): Techniques under this category assess image quality without reference images, often leveraging DL approaches to model complex distortions [
2].
Maximum Mean Discrepancy (MMD): MMD measures the distance between the distributions of authentic and generated images in a feature space, which helps evaluate the consistency of generated images with real ones [
9].
- 3.
Reduced Reference
Reduced reference methods use partial information from the reference image to assess quality, striking a balance between full and no reference techniques. These methods use partial information from the reference image to determine quality, balancing full and no reference techniques.
Spatial Efficient-based Entropic Differencing: According to the study by Nikiforaki et al. [
30], this metric measures the difference in entropy between reference and distorted images, providing a measure of quality degradation.
- 4.
Distribution-based Metrics
Distribution-based image quality metrics assess the quality of images by comparing statistical distributions of image features between reference (or real) images and generated (or distorted) images. These metrics often focus on the overall distribution of pixel intensities, textures, or other image features rather than individual pixel comparisons. As in [
12], the following matrices are distribution-based metrics.
Fréchet Inception Distance (FID): This metric compares the distribution of feature representations of authentic and generated images using the Fréchet distance. It considers the mean and covariance of these features, often extracted using a neural network trained on a large dataset like ImageNet [
9].
Kernel Inception Distance (KID): Like FID, KID uses the Inception network to extract features from images and compares these using polynomial kernel Maximum Mean Discrepancy (MMD). It has the advantage of providing unbiased estimates even with small sample sizes.
Inception Score (IS): IS uses the output distribution of a classifier (usually InceptionV3) to evaluate the diversity and quality of generated images by comparing the distribution of predicted labels with a uniform distribution.
These metrics are widely used to assess the quality of images generated by GANs. They enable the comparison of statistical properties in a deep feature space, ensuring that generated images closely resemble real ones. Operating in feature space rather than raw pixels makes them perceptually relevant.
3.2.3. DL Model-Based Approaches
Medical imaging is manually assessed by technicians or radiologists, posing subjective and time-consuming quality evaluation. The DL models, which replace manual inspection or support it in these approaches, can improve the efficiency of image quality evaluation and ensure the evaluation outcome’s stability [
13]. IQA is essential for accurate diagnostics and effective treatment results [
1]. DL methods are considered valuable for determining what factors may affect image quality by providing automated and objective image assessments. As predicted, the integration of AI in health will optimize the workflow of clinical practice, improve patient outcomes, and change the current delivery system of healthcare. However, the data quality gap is a significant bottleneck from model development to its clinical application in medical AI research. Many solutions relied on already overburdened medical practitioners to perform data classification, often resulting in slow progress [
23].
The following models are frequently used in these selected studies:
As mentioned in the methodology of [
21], they have proposed an automated IQA framework based on multi-task learning. Convolutional neural networks (CNNs), such as VGG, ResNet, and DenseNet, are feature extractors that learn shared knowledge from images. Therefore, VGG16, ResNet18, ResNet34, ResNet50, and DenseNet121, VGG16 achieve the best performance.
The study by Stępień et al. [
1] introduced a method for enhancing MRI quality assessment (MRIQA) by fusing multiple DL architectures. The approach leveraged their feature extraction and classification strengths by integrating well-known models such as VGG, ResNet, and Inception. Utilizing transfer learning with networks pre-trained on ImageNet, the final classification layers are replaced with regression layers to tailor the models for quality prediction. The process involved three steps. Namely, feature extraction to MRI images is processed by each network to extract relevant features, Feature Fusion to the features from the various networks are combined and input into a Support Vector Regression (SVR) module, and quality prediction to the SVR model maps these combined features to quality scores. The method adopted network layers to MRI characteristics and employed a radial basis function kernel in SVR for regression. The results demonstrated that this fusion approach, including networks like DenseNet-201, GoogLeNet, and ResNet, significantly improves quality prediction. Additionally, DeepDream visualizations reveal that the fused networks better manage distortions and provide more detailed feature responses than single networks.
The proposed technique, Multi lEvel and multi-model Deep Quality Evaluator of MR Images (MEDQEMRIs) in [
13], fused two DL networks of different complexities that belong to the same family. Re-input data are fed to ResNet18, Resnet,50, and their fusion. The main finding was that the quality assessment is performed by a high-level quality model trained on scores of quality models obtained for layers of the networks.
In [
23], which is known as DeepFundus, InceptionResNetV2 architecture was used for model construction, trained with TensorFlow, Keras, and early stopping to prevent overfitting. This addressed the data quality gap and achieved Areas Under Curve (AUCs) over 0.9 in image classification concerning overall quality, clinical quality factors, and structural quality analysis on both the internal test and national validation datasets. The value achieved signifies that the model accurately distinguishes between image quality categories. Specifically, there is a higher chance that the model will correctly identify the image quality category.
The DL model in [
11] is InceptionResnetV2 on the development set using probabilistic prostate masks and an ordinal loss function. The research examined how well a DL model checked the ability for bi-parametric prostate MRI. It compared the model’s results to what experts agreed on and to less experienced readers. This revealed to us that the DL model does about as well as less-experienced readers in judging quality. The study concluded that DL models, trained on more typical datasets with input from more experts, could give reliable automatic quality checks. They might even help with or replace people looking at images to judge quality.
In assessing CT image quality, a study [
17] introduced M
2IQA, which consisted of a fusion of YOLOv8 and U-Net. This retrospective study analyzed chest CT images from 327 patients, using 1613 images from 286 patients for model training and validation, while 41 patients were reserved for ablation, comparative, and observer studies. The M
2IQA method, driven by DL, utilizes a multi-view fusion strategy across three scanning planes (coronal, axial, and sagittal) to evaluate image quality for tasks like inspiration, position, radiation protection, and artifact assessment. It achieved 87% precision, 93% sensitivity, 69% specificity, and a 0.90 F1-score on an additional test set. Comparably, [
37] also mentioned the DL-based method, which combines image selection, tracheal carina segmentation, and bronchial beam detection. The score obtained by this method is compared with the MOS given in the observer study.
The study [
15] developed a DL model using DenseNet to assess PET image quality, encompassing acquisition, preprocessing, and training with data augmentation and cross-validation. The DL-based assessment tool reliably categorized PET images into “Good” or “Poor” quality and provides detailed image and scanning information, supporting clinical research through accurate quality assessment.
The study introduced Deep Detector IQA (D2IQA) [
10], a new CT image quality assessment system that copies radiologists’ work by spotting fake lesion-like objects affected by different noise levels. D2IQA used a self-taught Cascade R-CNN model trained with made-up lesions of various shapes, sizes, and contrasts, eliminating the need for accurate labels. The system showed strong results, showing object size, contrast, and noise impact detection accuracy, and it kept a strong link with what radiologists thought across different dose levels. Compared to old-school full-reference (FR-IQA) and no-reference IQA (NR-IQA) measures, D2IQA does better in picking up on quality changes. It worked well across different body areas and artifact types. This new approach looks set to push forward CT image quality checking and improvement, with plans to expand its use to other kinds of medical imaging in future studies.
In [
6], the study introduced an enhanced semi-supervised learning approach for fetal brain MRI quality assessment, integrating region of interest (ROI) and consistency into the Mean Teacher Model. The model employed DL architecture based on ResNet-34 for student and teacher networks. The student network is trained with a combined loss function consisting of classification loss and consistency loss between the student and teacher networks. The teacher network’s parameters were updated using an exponential moving average. This method introduced ROI consistency loss to target the brain ROI, which ensured that the network focused on brain features by comparing features from masked and original images. The approach also used conditional entropy for further refinement. It showed more significant benefits with smaller labeled datasets and confirmed the importance of the additional regularization terms. This novel approach enhanced MRI quality assessment and offers potential for integration with fetal motion tracking algorithms to optimize imaging workflows. Similarly, the work mentioned in [
36] the development of MD-IQA, which consisted of multi-scale distribution regression to reduce prediction uncertainty and improve the robustness in reducing prediction uncertainty. The study used a vision transformer (ViT) and CNN modules to extract global and local features to enhance the representation capability.
According to the Schwyzer et al. [
14], two experienced readers scored image quality from 1 (not applicable for diagnosis) to 4 (best for diagnosis). Their scores helped train a ResNet-34 DL model built with the fast.ai library on 400 images split into training, validation, and test groups. The main takeaways included the classifier’s trustworthy assessment of image quality, performance matching standard SNR measures, and steady results across different reconstruction settings.
The work in [
8] study presented a multi-task DL model that analyzes cardiac MRI. It focuses on classifying image quality and segmenting cardiac structures. The classification part sorts images into three motion artifact levels. It also had a side job of guessing the breath-hold type during imaging. The model tweaked a 3D ResNet-18 structure and uses a multi-task loss. For segmentation, they picked a 3D U-Net setup with a mixed loss. This proves they work well in tackling motion artifacts in heart MRIs.
According to the study in [
9], automated methods using image quality metrics and statistical analyses were explored while addressing human assessment limitations. Experts’ ratings and reaction times indicated sensitivity to image quality, with FID, MMD, and NIQE showing good correspondence, particularly for lower-quality images. A deep-quality assessment model best captures subtle differences in high-quality photos. The study recommended combining group analyses, spatial correlations, and distortion and perceptual metrics for comprehensive evaluation.
The GAN-guided nuance perceptual module [
42] (G2NPAN) system used a GAN to assess the quality of medical fused images by integrating several advanced techniques. The Generator, a convolutional neural network with down-sampling and up-sampling phases, converted a fused image into a high-quality version using residual blocks and LeakyReLU activations. The Discriminator, another CNN, distinguished between authentic and generated images using LeakyReLU and mean absolute error. The Unique Feature Warehouse (UFW) extracted and integrated spatial features from the images at multiple scales, focused on subtle details critical for medical assessments. Additionally, the Attention-Based Quality Assessment Network (AQA), built on VGG11, evaluates the quality of the fused image by comparing it to the high-quality reference image produced by the GAN. This network used an attention mechanism (Class Activation Mapping—CAM) to enhance interpretability and focus on significant features.
Q-Net [
25] architecture was a multi-stream, multi-output regression model designed for multi-labeled predictions, specifically targeting anatomical features in medical images. It used a combination of spatial and temporal modules, with each spatial module consisting of several convolutional layers and an LSTM layer for temporal feature extraction. The model employed Rectified Linear Unit (ReLU) activation, batch normalization, and dropout to prevent overfitting. It was trained using a 5-fold cross-validation with data augmentation, aimed for real-time application with optimal performance, memory efficiency, and inference speed. The architecture was compared with other state-of-the-art models like DenseNet121, ResNet, and VggNet, focusing on quality attributes using mean absolute error as the cost function.
At the training stage, unsupervised anomaly-aware framework (UNO-QA) [
21], they trained an encoder with multi-scale pyramid pooling and multiple decoders for multiple scales with only outstanding samples. All the testing samples were fed into a low-quality representation module at the inference stage. After that, excellent and non-outstanding samples were classified. For the non-outstanding samples, they extracted and concatenated the output features of the decoders and then applied feature dimension reduction and clustered to subdivide no-outstanding samples into gradable and unreadable samples. To assess the adaptability of this framework, they analyzed different anomaly detection models, PaDim, PatchCore, and Fastflow, which were incorporated into this pipeline. They have observed that their low-quality representation module combined with hierarchy clustering achieved the best classification performance.
The deep convolutional neural network for automated image quality assessment (IQ-DCNN) [
7] inputs 2D patches from 3D image volumes in the axial, sagittal, and coronal orientations. The network comprised four convolutional layers, three fully connected layers, and a final regression layer to output a quantitative image quality score. It employed an anti-bias L1 loss function to align the predicted grades with ground truth, compensating for non-uniform grade distribution. Inspired by the VGG-16 model, the architecture was optimized using three-fold cross-validation on a dataset of 424 scans, divided into training/validation (324) and test sets (100). Dropout regularization prevented overfitting, and patch quality grades were averaged per patient for a final assessment. The IQ-DCNN demonstrated performance comparable to human intra- and interobserver agreement, achieving an R
2 of 0.78 and a kappa coefficient of 0.67, indicating substantial agreement with human experts in assessing image quality. The model effectively tracked image quality during compressed sensing reconstruction, aligning closely with expert evaluations. The IQ-DCNN successfully mimicked expert assessments of 3D whole-heart MR images and could automatically compare different reconstructed volumes. However, increasing dataset size and diversity, implementing data augmentation, or fine-tuning hyperparameters can be incorporated to achieve higher consistency with expert evaluations.
Optimized Deep Knowledge-based NIQI (ODK-NIQI) [
2] for MRI image quality assessment used a three-step approach involving deep images before creating a diverse denoised image database, noise removal, and feature extraction from noisy and denoised images. It employed a ConvNet model enhanced by shuffle shepherd optimization and mish activation, with weighted average pooling to consolidate results. Based on a pre-trained VGG-16 model, the improved deep knowledge algorithm refined hyperparameters and enhanced feature extraction. The ODK-NIQI method outperforms traditional NIQI techniques, demonstrating superior performance and consistency across standard metrics such as SROCC, RMSE, MAE, and PLCC.
In an UnSupervised learning-based Ultrasound image Quality assessment Network (US2QNet) [
24], a Variational Autoencoder (VAE) was trained using reconstruction loss to learn feature representations from pre-processed ultrasound images, optimized for both the reconstruction of images and the latent space distribution. The VAE was enhanced with a clustering module to improve the quality of feature representations. It jointly optimized the VAE’s reconstruction and clustering loss to better align image features with quality clusters. The validation of urinary bladder ultrasound images demonstrated that the proposed framework could generate clusters with 78% accuracy and perform better than state-of-the-art methods.
As in DistilIQA described in [
4], the model evaluated CT image quality using a distillation-based vision transformer. It combined a vision transformer network, which merges self-attention with convolutional operations, and a distillation setup. A “teacher networks” group passed on knowledge to one “student network” in this setup. The primary goals were training and testing the model on different CT datasets, including CT scans of the chest and abdomen, and predicting image quality scores. The model’s structure has a convolutional stem and a vision transformer.
In the work of [
5], the Swin transformer is described. In other words, the Shifted Window Transformer is a vision transformer known for its hierarchical and efficient image processing capabilities. Unlike traditional CNNs, Swin transformers excelled at capturing long-range dependencies within an image. They achieved this through a hierarchical structure that processes images at multiple scales, allowing the model to capture local details and broader contextual information. The “shifted window” mechanism involved dividing the image into non-overlapping windows and then shifting these windows in subsequent stages. This method ensured that the model captured information from neighboring windows, offering a more comprehensive image understanding. Similarly, the authors of [
38] proposed a quality control method based on YOLOV8 and Convolutional Block Attention Module (CBAM) and Swin transformers. As specified by the authors, the suggested model can be utilized in evaluating CT phantom images and holds the potential for setting a new standard in quantitative assessment methodologies.
The Swin transformer-based Multiple Color-space Fusion network (SwinMCSFNet) classifiers [
41] detected subtle abnormalities and differentiated between various tissue types, which is crucial for accurate medical diagnoses. SwinMCSFNet was particularly useful in medical applications. Its design allowed it to handle the complexity and high dimensionality typical of medical images, often with fewer parameters than traditional CNN-based methods, yet achieved superior or comparable performance. They had focused on implementing a Multiple Color-Space Fusion network to integrate representations from various color spaces.
Semantic Aware Contrast Learning (SCL) [
3] is an advanced technique in the field of ML, beneficial for tasks that require distinguishing between subtle differences in data, such as IQA in medical imaging. The core idea behind SCL is to enhance the ability of a model to differentiate between classes by focusing on semantic features, which are meaningful and relevant attributes that capture the essence of the data. In the context of IQA, SCL involves training a model to recognize the general quality of an image and the specific semantic content that may influence the quality assessment. This approach was particularly beneficial in medical imaging, where specific anatomical structures or pathological features can significantly impact the perceived quality of an image.
The work in [
29] mentioned the design of a two-stage dual-task network framework, which takes the fetal MRI slices as input and outputs the image quality label. This framework included two stages: brain localization and dual-task with brain segmentation and quality assessment tasks. The model consisted of a brain localization module using U-Net for coarse segmentation, followed by a dual-task module with a feature extraction network, segmentation head, and quality assessment head, using complex parameter sharing for efficient fetal brain MRI analysis. The process typically involved creating a contrastive loss function that encouraged the model to push apart representations of images with different quality levels while pulling together representations with similar quality levels. By incorporating semantic awareness, the model can learn to prioritize features most relevant for assessing image quality, such as clarity, presence of artifacts, and contrast between different tissues.
Table 4 summarizes the IQA methods used in their studies, and
Table 5 on DL techniques in medical imaging.