1. Introduction
Image quality in the medical domain is a key concept for diagnostic image interpretation. Image quality is especially important when assessed by human experts as a measure of image diagnostic value and the ability to present disease manifestation and identify subtle differences among pathology or physiology entities. However, it is equally important when images are used as input in artificial intelligence (AI) algorithms for achieving adequate performance on segmentation or classification tasks. Image quality information is an important factor not only for AI model explainability but also for assessing fairness and robustness when images acquired from different sites, vendors, or imaging protocols are considered [
1]. This is extremely important, as current AI needs to promote the aggregation of large data volumes in image repositories, such as the “The Cancer Imaging Archive” (TCIA) [
2] repository or EUCAIM initiative [
3], where many clinical sites are invited and encouraged to contribute data that show high variability in their acquisition settings.
The notion of image quality is more complex where medical images are concerned, as it is related to the ability to deliver an accurate and specific clinical outcome. Moreover, diagnostic quality may be outlined by different attributes or weaknesses depending on the recipient of the image, if it is a diagnostic decision support system or the field expert using human perception, even for the same clinical question. Objective image quality metrics are also used to provide an alternative to manual subjective scoring, avoiding issues of inter- or intra-observer variability. These metrics evaluate image quality based on expected image statistics. Ideally, a valuable quality metric correlates well with the subjective perception of quality by a human rater. The selection of objective metrics that correlate with the evaluation based on human perception is still an open field of research [
4,
5,
6,
7,
8].
Assessing image quality can contribute to efficiently building large databases, as it is relevant to a number of different aspects that AI repositories value. It serves for image harmonization, as it can highlight images requiring post-processing actions such as noise reduction or motion correction. Image Quality Assessment (IQA) supports fairness [
9] by assessing sufficiency among different scanners, sites, or other conditions. It holds an important role in explainability [
10], as data perturbations with regard to quality may be relevant to consistency in results and transferability of the outcome to different sites. Moreover, it can support the evaluation of real-world clinical usability by applying exclusion criteria and linking them to specific image attributes, such as local or global signal-to-noise ratio (SNR), pixel size, or the presence of artifacts. It can also be a part of robustness evaluation, as data quality can serve to produce group-specific studies, either by exclusively eliminating image groups by thresholding or by creating mixed groups of desired percentages of quality levels. Moreover, local measurements can assist explanations focusing on specific regions of the image when local degradation factors are observed, such as variable signal reception sensitivity from different radiofrequency coils or coil configurations.
To date, the human expert’s opinion remains the gold standard for assessing the quality of medical images, as they are the individuals responsible for delivering and signing the final diagnosis. Subsequently, AI-based applications for assessing image quality are oriented and evaluated according to their ability to mimic human opinion. This work presents a software tool enabling a combined evaluation of the quality of a medical image by human experts but also providing a number of automatically calculated objective criteria to support the evaluation, in a single session.
2. Material and Methods
The IQA tool proposed in this work is oriented to explore the sequences and register degrading factors that are most relevant to the current practice of multi-parametric (mp) breast magnetic resonance imaging (MRI) examinations in the frame of an EU-funded project (
https://radioval.eu/, accessed on 5 May 2023). A typical mp breast MRI protocol consists of a number of conventional (T1- or T2-weighted images) in axial or sagittal orientation and dynamic T1-weighted acquisitions (Dynamic Contrast Enhanced, DCE) with approximately 90 sec temporal resolution for a number of time points (3 to 6), usually at identical orientation as the conventional imaging. Diffusion Weighted Imaging (DWI) is also part of an mp breast MRI protocol.
2.1. Working Environment
The tool can work as a plugin within the freely available imaging software Mango V4.1 (
https://mangoviewer.com/, accessed on 1 May 2024) or can work as a stand-alone tool, as a windowed application. The application can also be executed via the command line provided the appropriate arguments, although it will launch its Graphical User Interface (GUI) once the analysis is completed. The advantage of integrating into Mango’s environment is the ability to use the additional functionalities related to viewing or performing some post-processing applications. The latter is related to faster and more efficient delineation of user-defined regions of interest (ROIs), which are used to perform local quality measurements. However, this action is optional and is not a prerequisite for the successful completion of the evaluation task.
The window application of the tool is a comprehensive User Interface (UI) image quality app for conventional and dynamic MR series to report image quality and types of artifacts. The Tkinter package (the standard Python interface to the Tcl/Tk GUI toolkit) was used to build the application’s UI. The ttkbootstrap (1.10.×) theme extension for tkinter was also used to provide a modern flat-style theme inspired by the widely used front-end toolkit Bootstrap. The image quality metrics, which are the No-Reference (NR) and Full-Reference (FR) metrics, were calculated using the PyTorch Image Quality (PIQ) library (version 0.8.0) [
11]. Furthermore, fundamental Python libraries for data manipulation and mathematical calculations were used, such as numpy (v1.26.0), pandas (v2.1.3), pydicom (v2.4.3), matplotlib (v3.8.1), and pytorch (v2.1.1).
2.2. Input
The tool supports a prevalent file format in radiology, DICOM (.dcm), though it could be extended with ease to support other medical imaging file formats in a future version, i.e., NifTI (.nii) or ITK MetaImages (.mha). The tool is provided with a couple of inputs: the type of images to be utilized (DCE or Conventional), which serves for initialization and configuration purposes and defines the processing that will be performed, and the directory path to the medical images. The tool iteratively parses either a single directory input (containing DICOM files) or resources through a directory of directories, to extract the image volumes and the respective image metadata.
2.3. IQA Tool Components
The full functionality of the IQA tool includes the presentation of descriptive attributes for the scanner/sequence, the submission of the subjective metrics, as well as the presentation of objective image metrics (NR metrics for conventional images, NR and FR metrics for dynamic studies, and statistical metrics for any type of image concerning relative or local measurements in user-defined regions) (
Table 1).
Full-Reference metrics are measurements that compare the quality of a target image to the quality of a reference image. The FR metrics deployed in the tool are Peak SNR (PSNR), Structural Similarity Index Measure (SSIM) [
12], Multi-Scale Structural Similarity Index Measure (MS-SSIM) [
13], and Feature Similarity Index Measure (FSIM) [
14]. The PSNR metric is defined as the ratio between the maximum possible power of a signal and the power of corrupting noise. The SSIM is a method to predict image quality by focusing on structural distortions, measuring the structural similarity between two images. Luminance, contrast, and structural similarity components are calculated automatically from the image and then combined into a single quality map. The MS-SSIM metric, the Multi-scale SSIM, is an extension of SSIM and measures SSIM on five different scales, computing contrast and similarity on all levels while measuring luminance only at the final scale. The FSIM metric is a method that measures the feature similarity between two images. For all SSIM, MS-SSIM, and FSIM, a value of one indicates perfect similarity between the images compared. They can be used to compare the target image and the reference image directly. In this application, the FR metrics are calculated only for the dynamic series, in which the pre-contrast image is used as the reference image and each consecutive timepoint of the dynamic acquisition as the target image. These metrics are used to identify motion between identical acquisitions in dynamic series, as a local decrease in the index. The PSNR metric can be used as a measure of global SNR, but PSNR is also indicative of the correct timing of contrast administration.
No-Reference metrics compute quality scores based on expected image statistics. Metrics used in this implementation are the BRISQUE score [
15] and Total Variation, which are applied to any image series. Total Variation shows the integral of the absolute image gradient. High values of Total Variation indicate a high image gradient and thus are likely to identify images with higher contrast. BRISQUE score is computed using a support vector regression (SVR) model trained on an image database [
16] with corresponding differential mean opinion score (DMOS) values that represent the subjective quality of the image. The database contains natural images with known distortion such as compression artifacts, blurring, and noise, and it contains pristine versions of the distorted images. The image to be scored must have at least one of the distortions for which the model was trained. A smaller score of BRISQUE is better, indicating a higher quality of the image.
2.4. Output
Outputs, in the form of graphs or text files, are stored in the “IQA” folder within the same directory as the inspected images. This folder contains questionnaire responses in text format, along with image metadata in an Excel file.
2.5. Dataset for Module Demonstration
The publicly available Duke Breast Cancer MRI dataset with breast data was used [
17,
18] to perform basic testing of the tool. The results from the experiments performed on the images of an indicative patient from the dataset with an ID equal to Breast_MRI_056 are presented.
2.6. Metrics Evaluation
The NR and the FR metrics were calculated under different conditions in order to evaluate the potential of these objective image metrics in assessing the quality of medical images. More specifically, the NR metrics were calculated for three different slices (indicatively slices 98, 99, and 113) of the same image of the first post-contrast phase to measure the effect of slight in-plane (z-axis, head–foot) patient movement. The three slices of the same image were selected to simulate slightly different image positions for conventional series. The NR metrics were also calculated for the same slices across time phases to investigate the change in the NR metrics when no patient movement is simulated. As another step for testing the tool’s ability to highlight images of degraded quality in a dynamic acquisition, the FR metrics were calculated for identical slices across time passes (no patient motion), and after simulating through-plane patient translocation by disaligning the different time phase series (patient motion during acquisition). To this end, the images of these three slices from the first post-contrast phase were used as input and reference images for the calculation of the FR metrics, expecting to verify the maximum output value in similarity indices. The image of Slice 98 was compared to the image of Slice 98 (identical), Slice 99 (close to the image), and Slice 113 (further in the volume). This comparison could simulate the differences in the images when patient movement is present. Moreover, Slice 98 of the first post-contrast phase was blurred using a Gaussian blurring filter with a kernel size equal to 7 × 7 to further simulate though-plane (transversal) patient motion. The image of the same slice of the second post-contrast phase was used as a reference image, while the original and the blurred version of the image of this slice were used as target images to calculate the FR metrics. To further assess whether the FR metrics can detect the degradation in the image quality due to the blurring effect, the FR metrics were calculated using the original pre-contrast phase image as reference image and original and blurred versions of the images of subsequent phases as target images. More precisely, Gaussian blurring was performed on the third and fifth phases out of the seven phases of a DCE image. The FR metrics were calculated between the same pre-contrast phase image (reference image) and all the subsequent phases with and without blurring (target images).
4. Discussion
The IQA tool presented in this work has dedicated components for sequence evaluation with respect to the parameters, a means to register a number of user-defined image quality attributes, as well as provides quantitative objective metrics for the support of the user. Testing the ability of the tool to identify slices most suspicious for image degradation has been conducted as feasibility proof to provide a time-efficient working environment to highlight the slices or sequences most affected by degradation factors. The feasibility testing comprised a comparison among metrics when patient motion was simulated in-plane and through-plane of acquisition by applying blurring filters and translocating slices across time phases, respectively.
Regarding conventional diagnostics, it can be used to assess the ability to visualize, identify disease manifestation signs, or compare images acquired under different protocol setups both visually and numerically under certain circumstances. In the frame of a multi-centric or variable protocol study, it can help optimize protocol parameters under real clinical situations rather than resorting to the use of test objects, which cannot capture disease heterogeneity, the effect of artifacts, and human behavior. For AI methodologies, the proposed tool could serve to enhance robustness and explainability as well as promote the fairness of AI algorithms. Such actions are in the core interest of AI professionals and the medical community, as they are essential to building confidence and trust and being adopted in everyday clinical practice.
The only mandatory part of the evaluation is the human-based evaluation, performed by the completion of a questionnaire by the user. The automatic objective metrics calculation and the manual ROI-based analysis are the optional features supporting the user in defending his/her evaluation. Since the tool was designed to meet the needs of the RadioVal project for IQA, specific characteristics related to mp MRI of the breast are examined in the current version of the tool’s questionnaire, i.e., the ability to suppress fat, the presence and degrading effect of metal clips, dynamic acquisitions, etc. However, the tool can be applied to other body parts or imaging modalities, as it can be configured by selecting the appropriate parts relevant to other use cases. Conventional series of any contrast and in any orientation are assessed by human experts and NR objective metrics, while dynamic series are assessed by human experts and FR objective metrics, taking as reference the pre-contrast phase. The tool can be used as a stand-alone application or can be hosted as a plugin in a DICOM viewer. The latter option enables local assessment of different image sub-parts by calculating medically oriented objective metrics in user-defined image areas. Local assessment of relative SNR and contrast-to-noise ratio (CNR) metrics can also be applied to measure tissue-specific evaluation or assessment in different sub-parts of the image.
Each attribute of the tool serves a specific purpose. The presentation of selected attributes related to quality derived from the DICOM header can be used to verify that the perceived image characteristics correlate with the objective attributes of the sequence. For example, a slightly blurred image can have a low score based on the expert’s evaluation. However, identifying the etiology of blurring is feasible once the sequence characteristics are taken into account. Thus, patient- or protocol-related blurring is evident. An image of low resolution, as revealed by sub-optimal characteristics in the “number of frequency/phase encoding steps”, can be differentiated from an image compromised by patient involuntary motion during acquisition, even though the image parameters are tailored for excellent image quality. Moreover, an image having a high SNR can score differently once a slice thickness above the diagnostic threshold is chosen.
The tool registers the human experts’ opinions by addressing several questions related to image-degrading factors. They are also asked to provide an overall score of the perceived image quality. Each question aims to draw attention to one specific aspect of the image to decompose the overall impression into specific etiologies of degradation (contrast, noise, or artifacts) to provide a concise but comprehensive image profile.
Graphical representation of objective metrics across series or time points serves to provide a time-efficient experience as an alert for selectively inspecting the group of images where metrics present a decrease. It has to be noted that a single time point of a dynamic series can be considered a conventional series when used in isolation from the rest of the time phases and is therefore assessed by NR metrics. A local or global minimum value can suggest a selective evaluation of those particular slices scoring low in one or both objective evaluations by the user, as being hypothetically the worst representations of the anatomy.
The use of ROI-based local evaluation of SNR and CNR statistical metrics has been added, as none of the deployed FR or NR metrics is medically oriented, and it can be possible either to find discordance among them for the scoring of image quality, or it is possible that they fail to disassociate diagnostic value from image quality. Such examples can be ignoring areas of imperfect excitation appearing black, variable coil sensitivity, or, on the contrary, the case of a persistent non-avoidable artifact not covering the area of diagnostic interest.
Similar implementations have been developed, with the vast majority focusing on brain MRI studies (MRIQC [
22], Qoala-T [
23], MRQy [
24]). Compared to the existing software solutions for MRI IQA, the main advantages of the tool presented herein are the simultaneous subjective and objective evaluation, the ability to assess images independently of the anatomical site (also possible in [
24]), and the ability to perform local tissue-specific calculations of medically oriented metrics. The flexibility of localized measures widely accepted for medical image evaluation (ROI-based statistical metrics of mean and standard deviation for specific sub-regions) has not been part of any IQA tool currently available.
Additionally, the integration of subjective and objective evaluation for an image at a single session is an important attribute of the tool. The automatic extraction of a number of objective metrics alongside the experts’ opinions on various aspects of possible image degradation can be a very powerful starting point to create associations between the most powerful metrics to capture the users’ opinion, which remains the holy grail. Moreover, it can highlight the correlation of specific artifacts or degrading factors to the most sensitive objective markers for each factor, apart from the overall image evaluation.
Limitations concerning this version of the tool are as follows. (1) The ability to support image formats other than DICOM. Future work will be able to support more imaging formats, such as NifTI or ITK MetaImages (.mha); however, this will provide incomplete information concerning the DICOM header tags that are an important part of the rationale of this work. (2) Moreover, the current version of the application is not tailored to the evaluation of other 4D datasets such as DWI, which are very important for diagnosis. They present specific challenges regarding IQA, as artifacts (geometric distortion) and SNR drops on high b values are expected.
Both of the aforementioned limitations are considered, and the tool can accommodate changes to address the need for more imaging formats, as well as DWI sequences. Future actions will include further testing in clinical environments, and the tool is expected to be optimized by experts’ feedback in the frame of the RadioVal project and beyond.