Generating Virtual Short Tau Inversion Recovery (STIR) Images from T1- and T2-Weighted Images Using a Conditional Generative Adversarial Network in Spine Imaging

Short tau inversion recovery (STIR) sequences are frequently used in magnetic resonance imaging (MRI) of the spine. However, STIR sequences require a significant amount of scanning time. The purpose of the present study was to generate virtual STIR (vSTIR) images from non-contrast, non-fat-suppressed T1- and T2-weighted images using a conditional generative adversarial network (cGAN). The training dataset comprised 612 studies from 514 patients, and the validation dataset comprised 141 studies from 133 patients. For validation, 100 original STIR and respective vSTIR series were presented to six senior radiologists (blinded for the STIR type) in independent A/B-testing sessions. Additionally, for 141 real or vSTIR sequences, the testers were required to produce a structured report of 15 different findings. In the A/B-test, most testers could not reliably identify the real STIR (mean error of tester 1–6: 41%; 44%; 58%; 48%; 39%; 45%). In the evaluation of the structured reports, vSTIR was equivalent to real STIR in 13 of 15 categories. In the category of the number of STIR hyperintense vertebral bodies (p = 0.08) and in the diagnosis of bone metastases (p = 0.055), the vSTIR was only slightly insignificantly equivalent. By virtually generating STIR images of diagnostic quality from T1- and T2-weighted images using a cGAN, one can shorten examination times and increase throughput.


Introduction
The spine is one of the body regions that is the most frequently examined in MRI. Reasons for MRI are mainly back pain, sensitivity impairments, and paralysis [1,2]. To visualize the most common pathologies, short tau inversion recovery images (STIR) are often used, along with T1-and T2-weighted images. The STIR contrasts are particularly useful in the diagnosis of acute pathologies, such as inflammation or acute vertebral fractures. In the example of a vertebral body fracture, STIR is used to detect a vertebral edema and thus often enables a therapy-relevant differentiation between new and old fractures. Apart from that, the STIR sequence can lead to the decision of whether a contrast agent administration is required [3]. This is especially important considering the continuously increasing number of MRI examinations worldwide [4]. However, the acquisition of STIR sequences requires a significant amount of scanning time of three minutes [5] and is therefore susceptible to motion artifacts. In recent years, the introduction of new techniques based on deep learning has enabled advances in image processing that were previously widely considered impossible. For image processing, the use of generative adversarial networks (GAN) has become the predominant approach. As a result, it could be demonstrated that GANs are highly effective in CT denoising [6] and in inserting virtual contrast media in non-contrast MRI [7].
The aim of the present study was to generate virtual STIR (vSTIR) sequences from non-contrast non-fat-suppressed sagittal T1-and T2-weighted sequences using a cGAN and to validate these synthetic images in blinded A/B-tests on clinical MR examinations of the spine against experienced radiologists.

Network Architecture and Preprocessing
Each scan was preprocessed by converting it into a 16-bit PNG image. The size of each slice was, in general, 512 × 512 px; in the few cases where the slice was larger, a central crop was performed across the entire scan. If the slices were smaller, the images were either padded with black to the required size of 512 × 512 px or, if the height was smaller than 256 px, dropped from the training set.
The T1, T2 images were used as input images. Additionally, a contrast limited adaptive histogram equalized filter [8] (size 32 × 32, clip limit 1.0) was applied to the T2 image and added as another channel of the input image. The intensities of all input images were rescaled to −1.1.
The Pix2PixHD framework was employed, as it has exhibited excellent performance in image-to-image tasks [9]. It is a conditional generative adversarial network using a combination of two residual networks, which are called local and global generators. The global generator produces lower-resolution images that are enhanced by a local generator. The architecture Pix2PixHD network was not changed for this study. As the output images were single-channel 16-bit vSTIR, the last layer of the network was modified to produce such output. The feature matching (VGG) part of the loss function was adapted by simple averaging to work with gray-scale images, as this loss is defined on RGB images only. The network was trained for 300 epochs, and all other parameters were left at their default (learning rate 0.0002, Adam optimizer with momentum 0.5).

MRI
The MRIs were performed on 1.5T and 3T MRI machines (MAGNETOM Symphony, MAGNETOM Sonata, MAGNETOM Avanto, MAGNETOM Aera, MAGNETOM Skyra) from a single vendor (Siemens Healthineers AG, Erlangen, Germany) between 2007 and 2019 at a single center (Table 1). All MRI examinations contained a sagittal non-contrast, non-fat-suppressed T1 and T2 as well as a STIR sequence with a matching field of view. The MRI scan parameters are illustrated in the supplementary material Supplementary Tables S1-S3.

Dataset
Using our clinical PACS, a set of 980 MRI examinations of the spine from the years 2007-2019 were identified for this study. All scans were curated by removing scans with incomplete series, non-matching T1, T2, and STIR sequences (e.g., STIR was not sagittal). The scans were then visually inspected by an experienced radiologist to ensure that no misalignment between the scans was present, resulting in 753 scans with T1/T2 and STIR images of 637 patients that were finally selected for training. For validation, two datasets were assembled, whose minimum size was previously calculated with a power analysis. For the power analysis, a two-sided equivalence test was performed with a statistical significance alpha of 0.05. A power calculation [10] with a power of 0.8, an accepted equivalence limit of the difference between the two procedures with a delta of 0.1 and expected confusions of p01 and p10 of 0.05, resulting in a minimum sample size of n = 86. For the first cohort, which should be evaluated in an A/B-test to verify whether the vSTIR is identifiable by a radiologist, 100 studies were randomly selected that were not part of the training cohort.
However, with this sample size, it is possible that certain pathologies are not sufficiently represented. Therefore, the second validation cohort was designed so that at least 20 studies with the most important pathologies (bone metastases, myelopathy, acute vertebral fractures, spondylodiscitis, epidural abscess, intraspinal masses, and muscular lesions) were represented. Furthermore, at least 20 healthy patients were included to check whether pathologies were inserted by our GAN [11]. In total, the cohort, which met all of the above-mentioned requirements, comprised 141 studies ( Table 2). None of these studies were part of the training cohort. The distribution of the pathologies among the validation cohort is illustrated in Table 3.

Validation
Two different validations were used. First, to determine whether the vSTIR was visually distinguishable from the real STIR (rSTIR), 100 MRI series from 100 distinct patients were presented to six senior radiologists in independent A/B-testing sessions. In each case, both STIR and vSTIR series were randomly demonstrated (Figure 1), and the radiologist was asked to identify the rSTIR sequence.
Second, to validate whether pathologies were represented qualitatively and quantitatively correct by vSTIR images, the validation dataset comprising 141 STIR sequences from 131 distinct patients was presented. Without knowing which series was presented to them, the radiologists were asked to perform a structured assessment of the pathological findings ( Figure 2). The readers did not get any information about the STIR sequence (virtual/real) and no clinical information about the patient. Eventually, each vSTIR and rSTIR sequence was assessed by three different senior radiologists. This number of readers was chosen in order to calculate the mean and standard deviation for all quantitative findings for both the virtual STIR and the real STIR. The structured reports were compared independently for each pathology to determine whether the vSTIR was diagnostically equivalent to rSTIR. For this purpose, the number of collapsed vertebral bodies, the number of vertebral bodies with edema, and the number of STIR hyperintense discs were reported as ordinal values. Additionally, the testers were asked to determine whether it was a rSTIR and whether the following pathologies and findings were present: intraspinal mass, myelopathy, muscular edema, muscular abscess, epidural abscess, spondylodiscitis, bone metastases, intraspinal neoplasia, acute traumatic fracture, pathological fracture, or benign bone neoplasia. Finally, the testers were asked to determine whether the case was normal.  Both validation steps were performed using a generic framework for A/B-testing developed in-house (Figures 1 and 2). The validation images were processed the same way as the training data. For display purposes, all images were rescaled to the intensity values 0-65,535.

Statistical Analysis
The ordinal values were converted to three categories: none, low (1-2), or high (>2). Ground truth for the rSTIR ratings was determined as the median of the three ratings. Similarly, the vSTIR ratings were gathered. An equivalence test for categorical data was used to compare the ratings [12], while an equivalence test of proportions was used for binary outcomes, where the procedure of Liu 2002 [10] was employed. Inter-rater agreements, following Cohn, were computed.
A one-sided Fisher test was employed to determine whether each rater was able to distinguish the real and virtual images. Fleiss' kappa was used to determine the inter-rater agreement.
All statistical tests were computed using R 3.6 and the irr library.

Results
For the A/B-test only, two of the six raters showed a statistically significant tendency to be able to distinguish virtual from real images. However, the error rate was rather high in both cases (39% and 41%) and the inter-rater agreement was quite low = −0.03 (p = 0.25). In 34% of the cases, the raters disagreed (i.e., three raters chose the rSTIR while the other three raters chose the vSTIR), while in 41% of the cases, a majority (i.e., five or six raters) chose the rSTIR, but in 25% of the cases, they chose the vSTIR. Overall, the testers were only marginally better than a coin toss, and a single tester was even worse than an average coin toss. The results of the individual testers are indicated in Table 4. Several examples of the validation cohort are shown in Figure 3.
The analysis of the structured reports revealed that the vSTIR is equivalent to the rSTIR in 13 of 15 categories ( Table 5). The two categories where the vSTIR was not equivalent to the rSTIR were the number of STIR hypertense vertebral bodies and the diagnosis of bone metastases. With a p-value of 0.08 for the number of STIR hyperintense vertebral bodies and 0.055 for the diagnosis of bone metastases, both categories were only slightly not significantly equivalent. In the category of detecting the true STIR, an average detection rate of only 57% was found with a very low inter-rater agreement of 0.01-0.02, consistent with the previous A/B-test.  Mean STIR/vSTIR represents how often the pathology was identified, on average, in the images. The inter-rater agreements describe how often the raters were in agreement for a given pathology, and the p-value measures whether the value was significantly different from 0 (i.e., no agreement at all). The significance for the difference tests whether both interrater agreements were significantly different. For the equivalence tests, the null hypothesis is that there is a difference between STIR and vSTIR, while the alternative hypothesis is that of their equivalence.
To calculate the time saved by generating the vSTIR sequence, the acquisition time of the T1, T2, and STIR sequences was extracted from the DICOM header. The acquisition of the STIR sequence took 188.5 ± 46.7 s, on average, in comparison to 164.6 ± 48.1 s of T1 scan time and 132.2 ± 40.3 s of T2 scan time (Figures 4 and 5).

Discussion
The aim of the present study was to generate virtual STIR (vSTIR) sequences from non-contrast non-fat-suppressed sagittal T1-and T2-weighted sequences using a cGAN and to validate these synthetic images in blinded A/B-tests on clinical MR examinations of the spine against experienced radiologists. With this approach, we were able to generate high quality synthetic STIR images that could not be distinguished from the real images even by experienced radiologists in a blinded A/B-test. In addition, a qualitative and quantitative evaluation of the pathologies depicted on the sequences showed no relevant difference between synthetic and real images, although there was a relatively high interrater variability.
Applications based on artificial intelligence have demonstrated a high potential in a variety of medical applications, such as the prediction of tumor histology [13], the detection of lung nodules [14], or the artifact reduction in PET imaging [15]. At the same time, there are few applications that increase time efficiency in the daily business of radiological image acquisition even though this is in great demand, considering the continuously increasing numbers of MRI examinations worldwide [16].
In this study, we developed a method to generate STIR images from non-fat-suppressed T1 and T2 images using a cGAN to reduce the scan time and the recall rate for a spinal MRI. For this purpose, a paired image-to-image translation was used [9], as this offers, on the one hand, a higher accuracy in comparison to an unpaired approach [17], and, on the other hand, the more efficient monitoring of the training cohort. This is especially important when considering the possible dangers of a completely unsupervised cohort. In this context, Cohen et al. were able to demonstrate that pathologies can be artificially inserted or removed by using an unpaired image-to-image conversion by training the network with an imbalanced cohort [11]. However, the use of a system based on image pairs also offers uncertain risks, as it is known that GANs can produce their own artifacts, such as checkerboard artifacts [18]. Therefore, we qualitatively and quantitatively evaluated the similarity of the vSTIR images to rSTIR images. When vSTIR and rSTIR images were directly compared in an A/B-test, six consultant radiologists-each with at least seven years' experience in musculoskeletal imaging-were not able to predict the rSTIR images.
At the same time, several studies have already indicated that image data generated by GANs can look deceptively real without representing reality [19,20]. After demonstrating that the generated STIR sequence looks real, it was therefore important to reveal that the sequence also reflects reality.
For this reason, we have had the examinations assessed by experienced radiologists in blinded A/B-tests with regard to the pathologies depicted. In this analysis, the vSTIR was equivalent to the rSTIR in 13 of 15 categories. Very similar values were indicated in two other categories-number of STIR hyperintense vertebral bodies and diagnosis of bone metastases-with the values rated statistically only slightly not significantly equivalent. On the one hand, the lack of significance could be a product of chance due to the variance between the evaluators. Alternatively, in a few cases, the vSTIR may not equivalently correspond to the rSTIR in these two categories. In the end, this should be tested in a prospective trial with a larger number of patients.
To evaluate how much time is saved by the vSTIR, the average acquisition times of T1, T2, and STIR images were compared. For an average spine MRI-which consists of a sagittal T1, T2, and STIR-about three of eight minutes of scan time could be saved. This time saving can increase the number of patients that can be scanned with one device by about one third, which significantly improves the cost efficiency of the system. In the future, this method could be combined with GAN-based compressed sensing [21] to further speed up the MRI to cope with the increasing MRI demand.
A similar model for converting a T1 or T2 sequence into a STIR sequence has been developed by Galbusera et al. [22]. In comparison, they archived very mixed results for different pathologies. This may be based on the fact that both T1 and T2 images contain independent information [23]; combining those naturally leads to an increase in information for different pathologies. To date, the only publication that combines T1 and T2 to generate a STIR sequence using deep learning was recently published by Kim et al. [24]. With only 12 healthy volunteers, this study demonstrated that deep learning can be used to generate real-looking STIR images of a knee MRI. However, it could not be demonstrated whether this virtual STIR sequence also depicts clinical reality and correctly represents pathologies. Therefore, our study is the first to generate a virtual STIR sequence with a large cohort of 657 patients, which is significantly equivalent to a rSTIR in 13 of 15 categories. By means of this method, it is not only possible to generate real-looking STIR images but, above all, to generate images that depict reality.
For limitations, the datasets contained only MRI examination from a single vendor, therefore, the network may not be generalizable to other MRI vendors [25,26]. Furthermore, our method was validated for only 15 different categories of pathologies / findings; it is still uncertain whether the vSTIR is equal to rSTIR in demonstrating other pathologies. A true 3D or 2.5D network may be able to employ more local information into the generation of the vSTIRs, thereby increasing the output quality.

Conclusions
In conclusion, our study underlines the potential of a cGAN for generating STIR images from T1 and T2 images. Overall, we had very good results in the similarity of the vSTIR to rSTIR images and in displaying the most important pathologies. This may lead to reduced MR scanning time and to a reduced re-scan rate. In the next step, our database must be increased and validated on a multi-center basis to avoid overfitting to a single vendor.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/diagnostics11091542/s1, Table S1. TE and TR relaxation times in ms (mean ± standard deviation), Table S2. Slice thickness in mm (mean ± standard deviation), Table S3. Pixel spacing in mm (mean ± standard deviation).  The DFG had no role in the study design, data collection, data interpretation, data analysis, or writing of the report. The corresponding authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of the University Hospital Essen (Approval Code: 19-8891-BO; Approval date: 30 August 2019).

Informed Consent Statement:
The Institutional Review Board has waived the requirement of written informed consent due to the retrospective nature of the study. All data were anonymized before inclusion in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available for data protection reasons.