Image Quality Assessment to Emulate Experts’ Perception in Lumbar MRI Using Machine Learning

: Medical image quality is crucial to obtaining reliable diagnostics. Most quality controls rely on routine tests using phantoms, which do not reﬂect closely the reality of images obtained on patients and do not reﬂect directly the quality perceived by radiologists. The purpose of this work is to develop a method that classiﬁes the image quality perceived by radiologists in MR images. The focus was set on lumbar images as they are widely used with different challenges. Three neuroradiologists evaluated the image quality of a dataset that included T 1 -weighting images in axial and sagittal orientation, and sagittal T 2 -weighting. In parallel, we introduced the computational assessment using a wide range of features extracted from the images, then fed them into a classiﬁer system. A total of 95 exams were used, from our local hospital and a public database, and part of the images was manipulated to broaden the distribution quality of the dataset. Good recall of 82% and an area under curve (AUC) of 77% were obtained on average in testing condition, using a Support Vector Machine. Even though the actual implementation still relies on user interaction to extract features, the results are promising with respect to a potential implementation for monitoring image quality online with the acquisition process.


Introduction
Many medical diagnoses nowadays rely on medical images and thus depend on the quality of the acquired images. It is then of prime importance to monitor regularly the quality of these images [1]. Currently, most of the quality controls are conducted in the context of the equipment maintenance, using phantoms that usually consist of geometrical shapes filled with materials emulating biological properties, in highly standardized measures [2]. Yet the image of a phantom does not reflect perfectly the quality and complexity of images obtained from a real and alive human body, and this type of quality control could not be sufficient. During a patient examination, problems can occur inherently for the patient (motion, difficulties linked to the patient body mass index, etc.), due to the acquisition protocol that might not be optimal or due to the general state of the system that could need additional maintenance. If the image quality is affected by some specific patient characteristics, not much can be done. However, if the image quality is affected by an under-optimal acquisition protocol or by the equipment state, one might need to take some time to analyze the issue and to make the corresponding adjustments.
Besides quality control realized with phantoms, image quality checking can be divided into two categories: a subjective assessment based on human judgment, and objective assessment, which is computed with mathematical algorithms on the resulting images [1]. A subjective assessment gives the results that are closest to the expert appreciation-in our case, a radiologist interpreting a medical image-but it consumes the highly valued time of an expert and, therefore, it is not practical to implement as a regular image quality control. An example can be found in [3], where a set of anatomical criteria for MR images of knee joints is proposed based on expert criteria. On the other hand, the objective assessment can be divided into two main categories: considering a comparison with respect to a reference image or considering no reference at all [4][5][6]. Image quality assessment based on a reference image is very useful when working on problems such as lossy compression, but in the context of medical image acquisition, usually, no reference is available. No reference image quality assessment (NR-IQA) is a challenging task, mainly based on strategies such as signal-to-noise ratio (SNR) estimation, entropy or different families of mathematical measures that are far-off from the human perception itself [7][8][9][10][11][12][13][14][15].
Several works have explored the possibility of modeling natural images [16], and the estimation of natural scenes statistic features [17]. Yet medical images are not purely natural scenes. An interesting solution has been proposed in the last few years, to evaluate the image quality not directly from the acquired images but through the evaluation of the result of an automatic processing pipeline [18][19][20]. This can be done, however, for a specific subset of applications, such as anatomical brain images, where a processing pipeline is consolidated. Image quality assessment has also been applied through the evaluation of diagnostic performance using the receiver operating characteristic curve (ROC) [21].
The purpose of this work is to propose a method to classify the image quality perceived by radiologists in magnetic resonance (MR) images. To be more specific, the focus was set on MR lumbar images, as they are one of the most common images acquired on MRI presenting quality issues, according to our local radiologists. We aim at qualifying the MR lumbar image quality emulating the expert perception. In a future step, this evaluation of "good" or "bad" quality images reflecting what radiologists would judge, could then be obtained automatically at the moment of image acquisition and serve as a "traffic light" indication. A recurrent "bad" quality image would support actions to re-evaluate either the acquisition protocol, or the magnet maintenance or further analysis of what would cause reduced image quality.
The main contribution of this work is to show the feasibility of the emulation of the experts' perception on medical image quality based on feature extraction using machine learning. The proposed method was divided into two parts: in the first part, three neuroradiologists (NR) evaluated the image quality of dataset that included different types of lumbar MR images commonly used in clinical practice. In the second part, we introduced the computational assessment using a wide range of features extracted from the images, then fed into a classifier system. The machine is trained to learn the classification made by the experts, based on the features extracted from the images. The feasibility of this method, of automatic labeling of image quality, is evaluated here in three different cases of MR lumbar images, of T 1 -weighting acquisitions in axial and sagittal orientation and T 2 -weighting in sagittal orientation.
The article is structured as follows. In Section 2 we present the processing pipeline, where we explain how the medical exams were obtained and evaluated by experts, and how we implement the machine learning techniques for image quality assessment of medical images by means of no-reference features. The main results are given in Section 3. We discuss these results and the limitations of our proposal in Section 4. In Section 5 we give some concluding remarks and we outline some future works.

Methods and Materials
A Global processing pipeline is schematized in Figure 1 and undertaken for each one of the three image types independently. On the one hand, each exam is evaluated separately by three neuro-radiologists in their regular settings for image visualization, each one blind to the evaluation of the other NR, according to a previous list of criteria agreed on and detailed in Section 2.2. According to its average evaluation between the three NR, the exam is classified as "good image quality" if its Mean Opinion Score (MOS) is greater than or equal to 3, corresponding to the qualifiers regular, good or excellent in the subjective evaluation, or as "deficient image quality" otherwise, corresponding to the use of qualifiers bad or poor in the subjective evaluation. On another hand, a list of features is the extracted features from the images, as detailed in Section 2.3. These features are fed to a classifier. Five systems were tested: Linear Discriminant Analysis (LDA), Quadratic Linear Analysis (QDA), Support Vector Machine (SVM), Logistic Regression (LogReg) and Multilayer Perceptron (MLP). We will refer to the evaluation by experts as the "subjective evaluation", and to the evaluation by machine learning from extracted feature as the "objective evaluation". Ethical approval by our local Ethics Committee was obtained.

Data Set
The development of the MR lumbar images data set involved 95 exams from different origins: our local hospital contributed with 41 exams and 12 exams from a public database SpineWeb (We have used images from the dataset 1 available at http://spineweb. digitalimaginggroup.ca/Index.php?n=Main.Datasets. accessed on 31 October 2017). Moreover, 42 exams were generated by modifying other original exams to count on a wider range of image quality variations. The modification includes one or a combination of the following: • Noise addition, with a standard deviation ranging from 0.001 to 0.8 • Contrast manipulation using power transform with gamma values ranging from 0.7 to 1.15 • Convolution with Gaussian kernel, with the kernel used from 3 × 3 to 6 × 6. All image modifications were developed in Matlab (MathWorks, Natick, MA, USA). Each exam includes 3 types of images: T 1 -weighting in axial and sagittal slices and T 2weighting in sagittal slices. Each image type was analyzed separately.

Subjective Evaluation
Three neuro-radiologists, with from 6 to 26 years of professional experience, participated in this study. First, the DELPHI method was used among these three experts to establish the image evaluation criteria and the relative weightings of these criteria [22][23][24], as shown in Table 1. To obtain such a list, we proceeded as follows: a list of criteria of what was relevant in the image quality was obtained first by an external observation of the radiologists' method of reviewing each type of image, while they were verbally expressing their observations. This list was then submitted as a questionnaire to the each NR. Agreement on which criterion to use was obtained. Once the list of criteria was defined, agreement was easily obtained by the three NR on their respective weight. Each criterion is evaluated using a Likert scale, with scores corresponding to: 1-bad, 2-poor, 3-regular, 4-good, 5-excellent. According to this scale and the weight of each criterion, each exam obtains one grade per NR, and then a weighted Mean Opinion Score (wMOS) is calculated by averaging the scores given by the three NR. Table 1. List of criteria used for the subjective image quality evaluation, elaborated by three neuroradiologists, using the DELPHI method. Weighted Mean Opinion Score (wMOS) was calculated using the weights listed in the second column of this table. A higher criterion weight implies a greater importance for this specific criterion.

Exam Type Criterion Weight Criterion
Sagittal

Objective Evaluation
Image feature extraction was undertaken in Matlab, in a semi-automatic way. Some features were evaluated on a Region Of Interest (ROI), others over the entire slice. The ROIs were positioned manually within vertebral bodies, intervertebral discs, fatty tissues, psoas, and paravertebral muscle in three different slices located in the center of the acquired volume. Image manipulation was conducted by engineers, blinded to the process and results of the "subjective evaluation". The dataset is composed of three different cases of MR lumbar images, of T 1 -weighting acquisitions in axial and sagittal orientation and T 2 -weighting in sagittal orientation. Three different slices were obtained for each exam modality. On the one hand, for the sagittal exams (T 1 and T 2 ), 26 features were extracted from each of the three different slices, obtaining a total of 78 variables. From these features, 12 of them were computed from the whole image and 14 from several ROIs (8 SNR, 2 CNR, 3 Uniformity, and 1 Image Sharpness in fat). Moreover, for the axial exams, 16 features were extracted from each of the three different slices, obtaining 48 variables. From these features, 12 of them were computed from the whole image and four from several ROIs (2 SNR, 1 CNR, and 1 Image Sharpness in fat).
Some features were selected to depict different characteristics known to influence image perception, such as spatial resolution or presence of noise; other features correspond to a mathematical description of the image not directly related to human perception. Some of the features are sensitive to spatial resolution, such as pixel dimension, slice thickness, or quantification of "image sharpness" relative to the presence of borders within the image. Other features are sensitive to the presence of noise or signal homogeneity, such as signalto-noise ratio (SNR),contrast-to-noise ratio (CNR), or "uniformity" of the signal within an ROI. Some features are sensitive to the presence of artifacts: we used the index proposed by Wang et al. [25], and also quantification of the ratio of the energy present in the signal in the foreground and the background. In the case of aliasing artifacts, the background energy is altered. The Wang index is a no-reference image quality metric made initially to measure distortions caused by JPEG compression on natural images. This measure is explored here for its utility in the noise and intensity non-uniformity detection. It is used in its implementation made public by their authors (https://github.com/dcatteeu/JpegQuality. accessed on 20 July 2019).
Less "intuitive" characteristics were also included so that another approach of image description is taken into account, different from the one trying to quantify parameters that could explain human perception directly, such as contrast or spatial resolution. In this category, we find measures of entropy, spatial, and spectral flatness. Image representations based on histograms are quite popular, and entropy is among the most widely used. Image distortions have been observed to affect the histograms of pixel intensities [26]. The histogram-based Shannon entropy could be an indicator of noise and intensity nonuniformity. An unpredictable image, i.e., nonredundant, in the spatial domain, will tend to have a white or flat looking spectrum. Conversely, predictable images will possess colored spectra; that is, their spectral shapes exhibit peaks. The spectral flatness measure is widely used to quantify signal information and compressibility [27]. A complementary quantity has been proposed, spatial flatness, which quantifies image shape [28].
The features extracted from the images are detailed in Table 2. S represents the pixel intensity,S i represents the average of intensities in region i. I represents the image and I v a vector created from the image columns. B is the number of intensity levels present in the image, and p k an estimation of the probability of occurrence of the k th gray level. F represents the Fourier transform of the image and ∇S uv the gradient evaluated in pixel (u, v). N x and N y represent matrix size in x and y direction, respectively, and N pix stands for the number of pixels present within a specific ROI or foreground or background. The foreground was separated from background first by user interaction, identifying a pixel from each region, then contrast was enhanced by histogram manipulation and equalization, a Wiener filter was applied, and a unique threshold was identified by the Otsu method. In sagittal exams: applied on three different ROIs in vertebral bodies, in fatty tissues and two intervertebral discs. In axial exams: applied on ROI in psoas and paravertebral muscles and one in fatty tissues.
In sagittal exams: applied on vertebral bodies vs. disc, and vertebral body and disc vs. fat. In axial exams: applied on fatty tissues vs. psoas and paravertebral muscles.

Machine Learning
In the process for classifying exams into "good" or "deficient" it was decided to apply five techniques of machine learning that are well known and widely used in the state of the art. Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are classifiers with a linear or quadratic decision limit, respectively, generated from adjusting the conditional densities of the data classes using the Bayes rule [29]. Another of the most used algorithms is the Support Vector Machine (SVM) technique. The SVM is a supervised learning classifier utilized for the prediction of class labels. It transforms features into a higher dimension space, where it implements the optimal hyperplane that describes the classes. The hyperplane work is based on the maximum margin between itself and those nearest to it. The nearest set of points are called support vectors [30]. Another method is Logistic regression (LogReg), a statistical approach for predicting binary classes. The outcome or target variable is dichotomous in nature, and the model computes the probability of an event occurrence using a logit function [31]. Finally, the Multi layer perceptron (MLP) is a supplement of the feed forward neural network, and consists of three types of layers: the input layer (it receives the input signal to be processed), output layer (It performs the task) and hidden layer (the true computational engine of the MLP). Similar to a feed forward network, in an MLP the data flow in the forward direction from input to output layer and the neurons are trained with the back propagation learning algorithm. MLPs are designed to approximate any continuous function and can solve problems which are not linearly separable. The major use cases of MLP are pattern classification, recognition, prediction and approximation [32]. These five algorithms (LDA, QDA, SVM, LogReg and MLP) were implemented in python 3.6 using the ScikitLearn toolbox in version 0.16.1 [33].
In order to reduce the dimensionality of the input data, the Principal Component Analysis (PCA) technique is applied, which decomposes the data set into a series of orthogonal components that explain a desired amount of variance. In the case of the proposed data set, the amount of features obtained in objective evaluation was reduced applying PCA, the variance was solved by 99%, and 12 principal components were reported. For LDA it is used as a singular value decomposition (SVD) solver which is not based on the calculation of the covariance matrix; instead, it performs a polar decomposition from a square matrix m * n to any other matrix. SVM was implemented using a radio basal kernel (RBF) which is described by φ y (x, l) = exp(−y||x − l|| 2 ); this causes many dimensions to be created in the dataset and makes it linearly separable. The hyperparameters were selected using a Grid Search Cross Validation.
To avoid overfitting and validate the model, we carried out the simulation study with a 10-fold cross-validation scheme. The averages of the accuracy, precision, recall, F1-score and area under curve (AUC) for testing were estimated for each of the machine learning models. Moreover, the Kappa Index was estimated as the Agreement coefficient.

Results
Typical images of "bad" and "good" quality as defined by the experts are shown in Figure 2. The examples aim to emphasize that, even if the difference between a clearly good and a clearly bad image is easy to assess, the subtleties of the in-between range are more difficult to distinguish. Figure 3 presents the distribution of the image quality of our three data sets evaluated by the experts: each of the data sets presents an equilibrated quantity of "good" and "bad" quality images. Axial images are globally better evaluated than sagittal images by all experts.  Agreement between neuro-radiologists about the image quality is rated from fair to substantial in the different types of data sets, as detailed in Table 3. Better agreement in quality image evaluation is found in Sagittal T 2 data sets, with Kappa values of 0.56 ± 0.08 over all cases, while the worst agreement is found in the axial T 1 data set, with Kappa values of 0.38 ± 0.22. NR 1 and NR 2 present the closest agreement in image quality evaluation, even though Kappa values reach only 0.63 ± 0.03, underlining that excellent experts' agreement is not easy to achieve. Agreement between NR 2 and NR 3 is even lower, only 0.35 ± 0.12. Table 3. Kappa index of agreement between neuro-radiologists (NR). The column on the right and the bottom line indicate mean ± standard deviation. We considered as "substantial" agreement kappa indices between 0.61 and 0.8, as "moderate" agreement values between 0.41 and 0.60 and as "fair" agreement values between 0.21 and 0.40.  Figure 4 shows correlation coefficients between MOS and each feature used in the objective quality assessment. Most correlation coefficients calculated in the case of Axial T 1 images are close to 0, showing that no clear linear relation between MOS and each feature separately exists that could by itself identify "good" or "bad" quality images. A very similar trend is observed in the case of Sagittal T 2 images. Considering Sagittal T 1 images, a few correlation coefficients rise to a magnitude of 0.6, uniformity measured in the vertebra or spectral flatness, but this correlation is not strong enough to explain the classification obtained by subjective assessment.  Classification performance metrics obtained for the five ML models are described in Table 4. In general, the SVM showed superior performance in most cases, followed by LDA and MLP. Furthermore, it can be observed that QDA showed the lowest performance. The SVM achieved an accuracy of over 73% in all cases, it reached 77% in Sagittal T 2 as well as a recall of 89% for this same image. On the other hand, the reported AUC reflects a good discriminatory precision by showing values higher than 70%.

NR 1 vs. NR
The agreement between each neuro-radiologist and the classification by SVM, being the best learning case, was evaluated and is displayed in Table 5. Agreement in all cases was moderate, showing the best results between SVM and NR 1 . The exam type that showed the best agreement between expert, Sagittal T 2 , showed one of the best agreements between SVM and NR, 0.48 ± 0.05. Conversely, Axial T 1 images, which showed the lowest kappa index between NR, are also the ones that present the lowest kappa index between SVM and neuro-radiologist, of 0.37 ± 0.04. In comparison with what is shown in Table 4, results are slightly improved for LDA, but no significant change is seen in SVM (data not shown).

Discussion
The agreement that is observed between the three radiologists, in Table 3, vary from fair to substantial. Even though it is usually easy to be in agreement to establish a diagnosis, to reach agreement on subjective quality perception led to more discussions. Two examples can be found in [34,35], where agreement among experts was moderate in subjective determination of glaucomatous visual field profession, and in subjective evaluation of sublingual microcirculation images, respectively. In our understanding, the observed differences in subjective evaluations could reflect the different manners that the neuroradiologists use in their interactions with the images. Each of them was trained in a different school, some began practicing neurosurgery before neuro-radiology which might reflect different subjective expectation on image quality. The training radiologists receive has been shown to influence their behavior in reading images [36]. Differences in perceptions also could come from their differences in experience, such as [36,37] emphasize. In our understanding, the differences in subjective perceptions seen in this work reflect the reality of the existence of a range of experts evaluations. The system proposed here includes the variety of experts perceptions, in a way that would be more potentially more robust and more generalizable in future works, than one that reflects only the subjective evaluation of one expert, be it the most experienced perception or not.
The experts' perception of image quality is emulated with good accuracy, 75.3 ± 2.4% on average in the testing condition, using the Support Vector Machine. A wide range of features was extracted either from the entire image or from specific user-defined ROI in relation to lumbar anatomy. Features include characteristics known to influence image perception, such as signal to noise ratio or spatial resolution, but other less "intuitive" parameters were also taken into account such as spectral flatness or entropy. The image quality evaluation obtained from the non-linear combination of these characteristics, shown in Table 4, is in better agreement with the experts' view than each of the features taken individually, as depicted in Figure 4. Comparing the results obtained with the literature is complex since there is no similar model based on the same set of image type and machine learning technique used. However, it can be mentioned that the performance obtained is lower than reported by Nakanishi et al. [38], who evaluated the efficacy of a fully automated method for assessing the image quality (IQ) of coronary computed tomography angiography (CCTA), obtaining an AUC of 0.96 and a kappa index for the agreement between automated and visual IQ assessment of 0.67. Similarly, the performance obtained is lower than that reported by Küstner et al. [39], who proposed a new machine-learningbased reference-free MR image quality assessment framework, including the concept of active learning and applying classifiers such as SVM and Deep Neural Networks. Although these authors report a high percentage of accuracy (93.7%), they did not perform an evaluation of concordance with experts, making this comparison difficult. However, performance results were closer to reported by Pizarro et al. [40], who applied an SVM algorithm in the quality assessment of structural brain images, using global and region of interest (ROI) automated image quality features developed in-house and obtaining an accuracy of 80%. On the other hand, on natural images, correlation coefficients between subjective vs. predictive MOS have been obtained close to 80% [41]. When using reference images, correlation coefficients published are close to 95% [17] or 96% [42], using sparse representation and kernel ridge regression. Yet their implementation was applied to natural images, with the possibility of estimation of visual information fidelity. It would be interesting to apply these methodologies described in the state of the art to the lumbar MRI data set and compare their performance with the algorithm proposed in this research. However, some conditions prevent its realization. For example, in the case of the algorithm proposed by Küstner et al., it involved a Deep Learning model in its classifiers, which cannot be applied in our work due to the limited database. On the other hand, Nakanishi et al. proposed an automated estimation using specific features of CCTA, and the novelty of this method is largely in obtaining these features prior to the application of the Machine Learning model. Since the characteristics obtained in lumbar MRI are different, the method proposed by Nakanishi to our work does not have much scalability. Finally, Pizarro et al. present a model very similar to the one developed in our research since they use the MRI image, extract its main characteristics and use the SVM as a classifier. Beyond the type of image used, a key step that differs between the two methods is the dimensionality reduction performed with the PCA in our work, so we consider it unnecessary to replicate what was developed by these authors.
It is interesting to note that the results obtained here were obtained through the machine learning of the three experts, taking into account more than an individual point of view. In the case of this application, the agreement between experts was not always qualified as "excellent", and the machine must learn different points of view. This is a common problem faced in machine learning, and the results obtained here show good performance in general. The human judgment or decision are based in several variables that may seem reasonable, however there exists many unobserved information that cannot be captured. Meanwhile, machine learning methods rely only on the available data obtained from quantifiable features [43][44][45][46].
These results are encouraging with respect to the possibility of developing an automated system that could monitor not only "mathematical image quality" but also image quality perceived by experts, who are the real final users of medical images. The method still needs to be fully automatized to avoid human interaction for the positioning of the Region-Of-Interest in the analysis. These are only preliminary results, as the proposed method was tested on three types of exams so far, and the number of exams, of sites and vendors needs to be increased. This would increase the number of observations of "bad quality" images obtained with no artificial manipulation, and therefore should refine the capacity of the method proposed here to discriminate between image qualities. One of the specificities of the images types that were selected here is that they were acquired with a spine MRI coil, which means that the signal within the images was not homogeneous. Working with these kinds of images does not represent conditions common to all kinds of MR images.
There is a discussion on how to define image quality in medical applications. A crisp definition of good vs. bad quality was used in the present case. The use of a fuzzy definition of transition between types might bring the behavior of this system closer to that of the human experts. It is important to emphasize that this system should be apprehended as a constant monitoring solution and that the interest is not in detecting when one single exam was of poor quality but when, as a whole, the trend of the MR system is deteriorating from an image quality point of view.

Conclusions
In conclusion, a method is presented here, where feasibility of the emulation of expert perception of image quality in three types of lumbar MR images is shown. Good accuracy is obtained in the set of images used, Sagittal T 1 , Sagittal T 2 , and Axial T 1 , of 75 ± 2.4% on average in the testing condition, using a Support Vector Machine to construct the classifier. Using a non linear combination of quality features extracted from the images, an emulation is obtained of the combined views of three different experts, whose agreement on image quality varies between fair and substantial. Even though the actual implementation still relies on user interaction to extract certain features from the images, the results are promising with respect to a potential implementation in monitoring image quality online with the image acquisition process.
Future works could include further features, such as block kurtosis of DCT coefficients [47], dominant eigenvalues of the covariance matrix [48] or kernel ridge regression [42] for instance. Other machine learning techniques, closer to deep neural networks [49], might also improve the performance of image assessment. The method in its essence can be applied to other kind of images, while modifying the definition of localization of ROI in agreement with the organs observed. All other features proposed in Table 2 can be extracted for different kinds of medical images.
Further work is needed to confirm the observations in other experimental conditions and to other types of images, using for instance MR images in other anatomical area, or computed tomography images. The automatic assessment of medical image quality is probably an issue that will occur more frequently as many artificial intelligence systems are developed based on large-scale databases, whose quality has been questioned [50]. Moreover, this study could be extended by increasing the number of subjective evaluators and introducing multi-criteria decision-making techniques [51] to support the variability among the users.  Informed Consent Statement: As data was extracted from a previously anonymized database in the hospital, with no possibility for the investigators or other person to access the identity of the individual, with the aim of the study focused on the technical quality of images, the Ethics Committee granted the authorization of waiving patient consent.