Detection of Degenerative Changes on MR Images of the Lumbar Spine with a Convolutional Neural Network: A Feasibility Study

Our objective was to evaluate the diagnostic performance of a convolutional neural network (CNN) trained on multiple MR imaging features of the lumbar spine, to detect a variety of different degenerative changes of the lumbar spine. One hundred and forty-six consecutive patients underwent routine clinical MRI of the lumbar spine including T2-weighted imaging and were retrospectively analyzed using a CNN for detection and labeling of vertebrae, disc segments, as well as presence of disc herniation, disc bulging, spinal canal stenosis, nerve root compression, and spondylolisthesis. The assessment of a radiologist served as the diagnostic reference standard. We assessed the CNN’s diagnostic accuracy and consistency using confusion matrices and McNemar’s test. In our data, 77 disc herniations (thereof 46 further classified as extrusions), 133 disc bulgings, 35 spinal canal stenoses, 59 nerve root compressions, and 20 segments with spondylolisthesis were present in a total of 888 lumbar spine segments. The CNN yielded a perfect accuracy score for intervertebral disc detection and labeling (100%), and moderate to high diagnostic accuracy for the detection of disc herniations (87%; 95% CI: 0.84, 0.89), extrusions (86%; 95% CI: 0.84, 0.89), bulgings (76%; 95% CI: 0.73, 0.78), spinal canal stenoses (98%; 95% CI: 0.97, 0.99), nerve root compressions (91%; 95% CI: 0.89, 0.92), and spondylolisthesis (87.61%; 95% CI: 85.26, 89.21), respectively. Our data suggest that automatic diagnosis of multiple different degenerative changes of the lumbar spine is feasible using a single comprehensive CNN. The CNN provides high diagnostic accuracy for intervertebral disc labeling and detection of clinically relevant degenerative changes such as spinal canal stenosis and disc extrusion of the lumbar spine.


Introduction
Lower back pain is among the leading causes of morbidity and disability, with an increasing prevalence due to the steadily aging population worldwide [1]. According to the American College of Radiologists, lumbar spine magnetic resonance imaging (MRI) is the preferred imaging modality to rule out causes of complicated lower back pain and to decide whether conservative or invasive therapeutic approaches should be considered [2,3].
Subsequently, the number of MRI studies of the lumbar spine has been rising over the last decades at a much higher rate than the number of trained radiologists who could adequately interpret the MR images [4]. To address this challenge, automated systems for pathology detection and grading on MRI scans can be utilized to support clinical reporting and reduce the steadily increasing workload for radiologists.
Machine learning algorithms to analyze lumbar spine imaging have successfully been used for segmentation, single measurements, and labeling tasks in radiographs [5,6], computed tomography (CT) [7][8][9][10], and MRI studies of the spine [11][12][13][14][15][16]. To date, a few attempts have been made to perform multiple automated radiological gradings such as Pfirrman grading, disc narrowing, spondylolisthesis, central canal stenosis, endplate defects, and bone marrow alterations with a single software solution [17]. However, to the best of our knowledge, there are no software solutions available yet that address the detection of multiple pathologies in MR imaging studies of the lumbar spine. Here, we evaluated the automated detection of degenerative changes in the lumbar spine using a comprehensive software solution ("CoLumbo", SmartSoft Ltd., Varna, Bulgaria). This algorithm is designed to label the segments of the lumbar spine and to detect a broad variety of degenerative pathologies based on a convolutional neural network (CNN). CNNs have been widely used in the field of medical imaging. They are known to be efficient and accurate, usually outperforming other machine learning, or more specifically, deeplearning-based approaches not only for medical imaging analysis [18][19][20]. CNNs are space invariant networks. Based on the fact that the response for a shifted image should be a similar shifted feature map, they use a shared-weight architecture. Generally, convolution filters (for example a 3 × 3 matrix) slide along input features and subsequently provide a corresponding output feature map. This methodology can significantly reduce the number of input parameters when compared to standard neural networks, and hence reduces the need for large data sets [21]. Moreover, from a practical point of view, CNNs are more straightforward to train than recurrent neural networks, as the latter face issues like exploding or vanishing gradient [22,23]. As things stand at present, the leading software solutions in the general imaging segmentation challenges and spine segmentation challenges are currently based on CNNs. Specifically, CoLumbo aims to detect the presence and the location of disc herniation, disc bulging, nerve root compression, spinal canal stenosis, and spondylolisthesis. Currently, this CNN-based algorithm is leading the IVDM3Seg challenge on automatic intervertebral disc localization and segmentation from 3D multimodality MR (M3) images (IVDM3Seg, entry smartsoftv2) spine segmentation competition [24] which has been established in association with the international conference on Medical Image Computation and Computer Assisted Intervention (MICCAI) 2018, Granada, Spain.
The primary aim of this study was to validate the algorithm's diagnostic performance and to determine whether the automated approach is generally feasible and provides the potential for clinical use. Therefore, we evaluated the diagnostic accuracy of the findings detected by the algorithm and the generalizability of the CNN to a new previously unseen data set.

Materials and Methods
Institutional Review Board approval was obtained for this retrospective diagnostic study and the need for written informed consent was waived.

Case Selection and Expert Reading
Inclusion criteria were the following: patients aged between 18 and 70 years undergoing routine MRI of the lumbar spine due to lower back pain were eligible to participate in this study. Exclusion criteria were presence of vertebral fractures and/or active inflammation as determined on routine clinical MRI, history of previous or concomitant malignancy, prior spine surgery, and metallic implants on the spinal level. Additionally, patients with incomplete examinations or severe motion and/or susceptibility artifacts were excluded. In patients who underwent repeated MRI scans, only the first examination was included in Diagnostics 2021, 11, 902 3 of 16 the analysis to avoid inclusion of repeated observations made in the same patient. The MR images included in the study were evaluated retrospectively by an experienced radiologist (with 5 years of experience in interpreting spine MRI) under full consideration of the originally written reports and the clinical history of the patients. This expert reading served as the diagnostic reference standard. The expert reader analyzed a total of 888 lumbar spine segments (TH12/L1-L5/S1) for the presence or absence of disc herniation, spinal canal stenosis, and bulging. For nerve root compression, 1036 segments (TH12-S1) were analyzed. Disc herniation was further subdivided into protrusion and extrusion in order to differentiate smaller and potentially less relevant herniations from larger and clinically more relevant findings. Disc extrusion was defined as a focal herniation with the distance between the edges of the herniating disc material into the spinal canal being greater than the diameter at the base of the herniation. Disc bulging was defined as the annulus fibrosus extending beyond the edges of the disc space, affecting more than 25% of the circumference of the disc. These classifications were done according to the nomenclature proposed by the consensus statement of the American Society of Spine Radiology (ASSR), American Society of Neuroradiology (ASNR), and North American Spine Society (NASS) [25]. Listhesis was graded according to the Meyerding classification [26]. Spinal canal stenosis was deemed present when there was a loss of anterior CSF signal on the sagittal images and a loss of CSF signal in the spinal canal on the axial images with an aggregation of the cauda equina fibers, according to moderate or severe spinal canal stenosis as proposed by Lee et al. [27].

MR Imaging Protocol
All lumbar spine MR examinations were performed using either a clinical 3.0T MR scanner (Achieva, Philips Healthcare, Best, The Netherlands) or a 1.5 T MR scanner (Achieva, Philips Healthcare, Best, The Netherlands). All patients were placed in the supine position. MR imaging of the spine was acquired according to the routine clinical MRI protocol used at our institution which included at least a sagittal T1-weighted spin-echo (450-750/6-12 [repetition time (TR) msec/echo time (TE) msec]), a sagittal and axial T2-weighted turbo spin-echo sequence (3000-5000/80-120 [TR/TE]) as well as a sagittal T2 spectral attenuated inversion recovery (SPAIR)-weighted turbo spin-echo sequence (3000-5000/50-120 [TR/TE]) or T2-weighted mDixon sequence (3000-5000/50-120 [TR/TE]). Slice thickness was 4 mm for sagittal imaging (slice gap 0.4 mm) and 3.5 mm for axial imaging (slice gap 0.35 mm). Field of view and matrix size were tailored to the individual patients' characteristics by the radiological technician. T2-weighted axial and sagittal images of the 146 patients were anonymized and extracted from the institute's archive as DICOM files to make them accessible to the algorithm for image analysis.

Machine Learning Algorithm and Image Analysis
The software automatically registers anatomic structures of the lumbar spine like vertebral bodies, intervertebral discs, the pedicles, the spinous processes and laminae, the flava ligaments, the dural sac, the nerve roots, the aorta to measure its diameter, and the erector spinae muscle to analyze its fatty degeneration. Additionally, the CNN is designed to detect multiple pathologies, like disc herniation and disc bulging, spinal canal stenosis, nerve root compression, and spondylolisthesis.
The algorithm in this study incorporates a three-step process for pathology identification and characterization: 1. segmentation of different tissue types; 2. measurements of clinically used distances observable in the image; 3. diagnosis, which is described in detail elsewhere [28]. In brief, a 2D single modality algorithm using a U-Net-based convolutional neural network (CNN) [29] is utilized in each plane of view. Each of the four up-/downsampling steps is achieved using 2 × 2 upsampling/maxpool operations. Additional feature maps from the downscaling part are incorporated at each upscaling layer. These feature maps are based on a fully convolutional network, similar to ResNet-50 [30]. The initial low-resolution feature map (64 × 64) is then upscaled using a U-Net-like architecture to the resolution of the input image (512 × 512) ( Figure 1). In this way, higher-level, lower-resolution classification features are used as context and the higher resolution is used for finer details.
up-/downsampling steps is achieved using 2 × 2 upsampling/maxpool operations. Ad tional feature maps from the downscaling part are incorporated at each upscaling la These feature maps are based on a fully convolutional network, similar to ResNet-50 [ The initial low-resolution feature map (64 × 64) is then upscaled using a U-Net-like chitecture to the resolution of the input image (512 × 512) ( Figure 1). In this way, h er-level, lower-resolution classification features are used as context and the higher re lution is used for finer details. As a next step, the segmentation is utilized to perform measurements and to clas the different pathologies. For disc herniation and disc bulging, the segmented contou the vertebral bodies adjacent to the intervertebral disc is projected over the interverte disc on the axial slice at disc level. The algorithm then identifies the parts of the disc exceed the projection of the contour of the neighboring vertebral bodies and the ant or-posterior distance of the disc exceeding the projected contour of the vertebral bodie measured. Herniation or bulging are defined as present when the diameter exceeds 3 by default. The differentiation between disc herniation and disc bulging is based on lumbar disc nomenclature: version 2.0 [25].
For spinal canal stenosis, the cross sectional area of the dural sac is measured dural sac area of 75 to 100 mm 2 is defined as relative stenosis, a dural sac area of less t 75 mm 2 is defined as absolute stenosis by the developers. Both relative and abso spinal canal stenosis reported by the software were counted as positive findings by authors.
Nerve root compression is reported as such, when a herniated disc or other tiss are in contact with the nerve root and the nerve root is deviated. For deviation of nerve root, the algorithm calculates where the nerve root with contact to a herniated is supposed to be according to the position of the nerve root on the adjacent axial slice For spondylolisthesis, a tangent through the posterior aspect of the vertebral bod adjacent to the specific intervertebral disc is drawn. These tangents cross a line dra along the superior endplate of the inferior vertebral body at different points. The dista between these points is measured and compared to the length of the superior endplat As a next step, the segmentation is utilized to perform measurements and to classify the different pathologies. For disc herniation and disc bulging, the segmented contour of the vertebral bodies adjacent to the intervertebral disc is projected over the intervertebral disc on the axial slice at disc level. The algorithm then identifies the parts of the disc that exceed the projection of the contour of the neighboring vertebral bodies and the anteriorposterior distance of the disc exceeding the projected contour of the vertebral bodies is measured. Herniation or bulging are defined as present when the diameter exceeds 3 mm by default. The differentiation between disc herniation and disc bulging is based on the lumbar disc nomenclature: version 2.0 [25].
For spinal canal stenosis, the cross sectional area of the dural sac is measured. A dural sac area of 75 to 100 mm 2 is defined as relative stenosis, a dural sac area of less than 75 mm 2 is defined as absolute stenosis by the developers. Both relative and absolute spinal canal stenosis reported by the software were counted as positive findings by the authors.
Nerve root compression is reported as such, when a herniated disc or other tissues are in contact with the nerve root and the nerve root is deviated. For deviation of the nerve root, the algorithm calculates where the nerve root with contact to a herniated disc is supposed to be according to the position of the nerve root on the adjacent axial slices.
For spondylolisthesis, a tangent through the posterior aspect of the vertebral bodies adjacent to the specific intervertebral disc is drawn. These tangents cross a line drawn along the superior endplate of the inferior vertebral body at different points. The distance between these points is measured and compared to the length of the superior endplate of the inferior vertebral body and the ratio of the two distances is calculated, determining the percentage of spondylolisthesis. Spondylolisthesis is graded according to the Meyerding classification.
The process of measurement and classification of disc herniation, disc bulging, nerve root compression and spinal canal stenosis is illustrated in Figure 2.
the inferior vertebral body and the ratio of the two distances is calculated, determining the percentage of spondylolisthesis. Spondylolisthesis is graded according to the Meyerding classification.
The process of measurement and classification of disc herniation, disc bulging, nerve root compression and spinal canal stenosis is illustrated in Figure 2. Measurement and classification of disc herniation, disc bulging, nerve root compression and spinal canal stenosis. T2-weighted, axial slice through the segment L5/S1 at disc level showing disc herniation, disc bulging, no nerve root compression and no spinal canal stenosis. The projected contour of the vertebral body adjacent to the disc is represented by the rounded blue line, the intervertebral disc is represented by the blue area. The red area represents herniated disc material, the distance between the red crosses is measured 6.5 mm and is therefore correctly classified as disc herniation. The blue lines and the single red line on the right side of the disc perpendicular to the projection of the contour of the vertebral body represent measurements of the disc exceeding the boundaries of the adjacent vertebral bodies, correctly reported as 4 mm bulging. The nerve roots (pink) have no contact to either the herniated or the bulging parts of the disc, therefore nerve root compression was correctly classified as absent. The light blue area represents the dural sac of 201 mm 2 . There was no spinal canal stenosis reported. Measurement and classification of disc herniation, disc bulging, nerve root compression and spinal canal stenosis. T2-weighted, axial slice through the segment L5/S1 at disc level showing disc herniation, disc bulging, no nerve root compression and no spinal canal stenosis. The projected contour of the vertebral body adjacent to the disc is represented by the rounded blue line, the intervertebral disc is represented by the blue area. The red area represents herniated disc material, the distance between the red crosses is measured 6.5 mm and is therefore correctly classified as disc herniation. The blue lines and the single red line on the right side of the disc perpendicular to the projection of the contour of the vertebral body represent measurements of the disc exceeding the boundaries of the adjacent vertebral bodies, correctly reported as 4 mm bulging. The nerve roots (pink) have no contact to either the herniated or the bulging parts of the disc, therefore nerve root compression was correctly classified as absent. The light blue area represents the dural sac of 201 mm 2 . There was no spinal canal stenosis reported.
The initial training set consisted of 1500 patient studies, including a total number of 20,000 axial slices and 10,000 sagittal slices manually annotated by board-certified radiologists. These patient studies were provided by three different European medical The training started with a model that had a feature extractor pretrained on the COCO dataset [31] while the rest of the model was trained without pretraining. For the segmentations, pixelwise cross-entropy loss was used. The optimization algorithm was Gradient Descent with momentum. L2 regularization, dropout and data augmentation were used. The training process lasted for 300,000 epochs.
Each MRI study was analyzed by the algorithm, thereby segmented and labeled for each lumbar spine segment and subsequently classified for presence or absence of the previously mentioned pathologies. The results were taken from an automatically generated report on the software user interface and compared to the findings detected by the expert reader.

Statistical Analysis
Statistical analyses were performed with R version 4.0.3 and RStudio version 1.2.5033 (RStudio, Inc., Boston, MA, USA) using the caret package [32]. All applicable clinical and imaging data are given as mean ± standard deviation, unless otherwise specified. Statistical significance level was set at p < 0.05. The diagnostic performance of the CNN was compared to the radiologist's findings using confusion matrices and the McNemar test. Additionally, we calculated sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy.

Patient Cohort
Between November 2018 and July 2020, 567 consecutive patients with lower back pain underwent MRI examinations of the lumbar spine at our institution, 421 of whom were excluded from study enrollment based on the exclusion criteria stated above. The remaining patient cohort consisted of 146 subjects, 81 males (55.5%) and 65 females (44.5%). Mean age was 48.7 years (median 49.5 years, range 19-70 years). An additional subset analysis revealed no statistically significant differences regarding both gender and age distribution between the patient groups examined at 1.5 T (59 men, 47 females; mean age, 48.36 years, range 19-70 years) and 3 T (22 men, 18 females; mean age, 49.55 years, range 21-69 years). In a total of 888 segments analyzed, expert reading detected 77 disc herniations (8.7%) out of which 46 were labelled as disc extrusions (5.2%), 133 disc bulgings (15.0%), 35 spinal canal stenosis (3.9%), and 20 spondylolistheses (2.3%). For nerve root compression, 59 nerve root compressions (5.7%) were found in the 1036 segments analyzed. All spondylolistheses detected by the radiologist were reported as grade I according to the Meyerding classification.

CNN Diagnostic Performance
Mean processing time for each individual MRI study was 9:25 min (range, 8:34 to 11:20 min). All imaging studies could successfully be processed by the software.
Results are summarized in Table 1. Examples of accurately and inaccurately classified findings by the algorithm are shown in Figures 3-5. For vertebrae and segment detection and labeling, the CNN correctly identified all lumbar spine segments in our cohort. Concordant classification between the expert reader and the CNN was observed in 771 of the 888 segments for disc herniation, with 58 (7.5%) being true positives. Discordant classification was observed in 117 of the 888 segments for disc herniation, with 19 of these 117 segments being false negatives. Out of 46 disc herniations that were further classified as disc extrusions by the radiologist, 41 were detected by the algorithm. For disc bulging, concordant classification between the reader and the CNN was observed in 671 of the 888 segments, with 69 (10.3%) of the 671 segments being true positive findings. Discordant classification for disc bulging was observed in 217 segments, with 64 of these segments being false negatives. For nerve root compression, concordant classification was observed in 938 of the 1036 spine segments, with 42 (4.5%) of the 938 segments being true positives. Discordant classification for nerve root compression was observed in 98 segments with 17 of these segments being false negatives. For spinal canal stenosis, concordant classification was observed in 871 of the 888 segments, with 27 (3.1%) of the 871 segments being true positives. Discordant classification was observed in 17 segments, out of which nine were false negatives. For spondylolisthesis, discordant classification was observed in 110 segments, with four listheses being present according to the radiologist but not reported by the CNN, and 106 listheses being present according to the CNN but not diagnosed by the radiologist.
Additional and dedicated subset analyses of our cohort scanned at 1.5 T and 3 T revealed no statistically significant differences regarding diagnostic performance and can be found in the Supplemental Materials in Tables S1 and S2. Diagnostics 2021, 11, x 8 of 16     For vertebrae and segment detection and labeling, the CNN correctly identified all lumbar spine segments in our cohort. Concordant classification between the expert reader and the CNN was observed in 771 of the 888 segments for disc herniation, with 58 (7.5%) being true positives. Discordant classification was observed in 117 of the 888 segments for disc herniation, with 19 of these 117 segments being false negatives. Out of Figure 5. Intraforaminal disc extrusion missed by the CNN. Upper row: Sagittal and axial T2 weighted images of the lumbar spine showing an intraforaminal disc extrusion at the level of L4/L5 on the right side (arrows) with compression of the nerve root L4 on the right side (arrowhead). Middle row: User interface of the CNN. The CNN misinterpreted the herniated disc material as the nerve root L4, highlighted in pink. The actual nerve root L4 is situated lateral to the herniated disc material and has not been identified by the algorithm. Lower row: Excerpt from the written report automatically generated by the software.

Discussion
This single-center study compared the diagnostic performance of a single comprehensive CNN, trained on lumbar spine degenerative changes, with an expert radiologist's reading for labeling spine segments and detecting disc herniations, nerve root compression, spinal canal stenosis, and spondylolisthesis on lumbar spine MR images in a randomly selected retrospective cohort. As mentioned earlier, the software is designed to solve three tasks: segmentation, measurements and diagnosis, while our study focused on the third task, diagnosis, to determine the clinical applicability, without analyzing the accuracy of the segmentation and the measurements in detail.
Our major findings revealed that, in a patient cohort with a wide age range, the CNN was highly consistent with the radiologist's expert reading and yielded moderate to high diagnostic sensitivities and specificities for the detection of lumbar degenerative changes ranging from 52-89% and 80-99%, respectively. Owing to the high diagnostic accuracies and NPVs of the CNN, our data suggest that both clinical reading times and human input can possibly be reduced in lumbar MR imaging studies that were reported to be unsuspicious by the CNN. Our data also demonstrated promising results for the detection of both disc herniation and extrusion. Substantial disagreement between the expert reader and the CNN was found for disc bulging and spondylolisthesis. This might be due to the classification of even minor bulging by the automated methods, whereas the expert reader might tend to neglect clinically irrelevant findings, such as minor bulging that does not cause spinal stenosis or nerve root compression. The positive predictive value of only 13.11% for the detection of spondylolisthesis is partially due to the algorithm classifying a subset of segments as spondylolisthesis grade 0-1 according to the Meyerding classification. These segments were interpreted as spondylolisthesis grade 1 being present in the CNN analysis by the authors.
Most of the previous approaches mainly focused on automated detection and classification systems for a variety of individual spine pathologies. The automated detection methods used to date can be divided into three main types. The first type is the automatic localization of one or two types of spine structures, which are capable of depicting specific anatomical structures [33]. The second type comprises automatic segmentation of one or two types of spine structures [34], and the third type is simultaneous localization and segmentation of different spine structures [35]. These previously used detection methods have achieved accurate detection of one or two types of spine structures, but they have not been able to simultaneously perform radiological grading or diagnosis. Therefore, most AI-assisted radiological grading systems have been limited to one type of spinal structure, such as spondylolisthesis grading [36]. To date, there have been few attempts to simultaneously classify multiple pathologies on MR imaging studies of the lumbar spine to support diagnosis with a single comprehensive software solution. Recently, Jamaludin et al. trained an Oxford SpineNet software system, a machine-learning-based system for automatic analysis of T2-weighted spine MRI scans obtained from DICOM files. They reported that the system can automatically classify a variety of degenerative changes in the lumbar spine including Pfirrmann grade, disc stenosis, spondylolisthesis, central canal stenosis, Modic changes, and bone marrow changes. The software was able to classify multiple radiological features at the same time. However, a major drawback of this approach is that only the preprocessed disc volumes were used as inputs, so only one type of spinal structure could be analyzed. Another drawback of this approach was that only sagittal T2-weighted images could be analyzed, which particularly limited the ability to correctly classify spinal canal stenosis [17]. Moreover, automatic detection of disc herniation or nerve root compression was also not considered in that study, findings that must be considered when attempting to automate the reading of lumbar spine MRIs to reduce the radiologist's workload in clinical practice. To our knowledge, our study is the first to externally validate a comprehensive software solution that focuses on the diagnosis of clinically relevant degenerative changes such as disc herniation and nerve root compression, in addition to the sole graduation of single degenerative changes. Our approach differs from most preliminary studies since it can process both axial and sagittal slices of lumbar MRIs, leading to markedly enhanced detection rates of spinal pathologies compared to the aforementioned study by Jamaludin et al. and a recent study by Won et al. (e.g., in case of spinal canal stenosis, our accuracy was 98% vs. 95% vs. 83%, respectively) [37], even though direct comparison of the methods on different data sets is methodologically difficult.
External validation is an important step in the validation of a deep-learning-based, predictive model to avoid overfitting to the training data set, potentially resulting in an overestimation of the model's diagnostic performance [38,39]. Although various studies have shown that the diagnostic performance of AI algorithms can vary across different institutions, only 6% of studies that evaluated the diagnostic performance of AI algorithms performed external validation to ensure their generalizability [40,41]. The current study served as an external validation for the CNN that had been trained on 1500 data sets. Our results showed that the CNN was capable of processing lumbar MRI acquired independently of the training and test datasets at a different institution, using different MR scanners, different field strengths, and varying sequence parameters, thereby still providing fair to high diagnostic accuracy.
Although attempts have been made to standardize the nomenclature in the reporting of lumbar spine disease, such as the nomenclature of lumbar discs [25] and the grading of spinal canal stenosis [27], a general reporting and data system for spine MRI is still lacking. In other organ systems, such as the genitourinary tract, the recently established Vesical Imaging Reporting and Data System (VI-RADS) [42][43][44] demonstrated near-perfect interrater agreement in the evaluation of non-muscle-invasive vs. muscle-invasive bladder cancer. However, only moderate intra-and interrater agreement was observed for degenerative conditions of the lumbar spine across different readers, even after adjustment to standardized evaluation criteria [45]. This was mainly attributed to the different weighting of specific disease-defining conditions by the radiologic or surgical observer and may eventually result in limited concordance, even between the same observers over time. Given the potential importance for therapeutic decision making, a more standardized and reproducible approach is therefore still needed to homogenize spine MRI reporting. An automated approach, as we have attempted with CNN-based software, potentially offers the possibility of standardization of spinal MRI findings based on previously defined diagnostic criteria and is not prima vista biased by subjective assessment by the investigator. Because of this, CNNs should not be viewed as the sole basis for reporting, but rather as additional tools that can speed up and facilitate the radiologist's work and possibly reduce interrater disagreement. Recently, Pacilè et al. investigated the impact of concurrent use of an AI algorithm on the diagnostic performance of radiologists reading mammograms and observed a positive effect on interrater agreement and reading times for cases with low suspicion for malignancy as classified by the CNN [46]. Our results suggest that the CNN we used can especially improve the radiologist's efficiency for studies that do not show any relevant pathologies, although our study design did not focus on the actual performance of a radiologist using the software. For lumbar spine imaging, no such literature exists, so it will be a subject of further research to explore whether the concurrent use of a CNN can indeed influence the radiologist's performance.
All of the MR examinations selected for the study at hand could be processed by the software and written reports could be generated, proving the software to be a very reliable tool for the daily practice. The mean processing time for each MRI study was 9:25 min. The preclinical version of the CNN the authors had at their disposal used the computer's CPU instead of the GPU, resulting in longer processing times. Processing times of approximately 10 min can be acceptable in the clinical setting when the image processing is not done by the radiologist's workstation simultaneously with the written report, but prior to the radiologist analyzing the images. Still, we see some potential for improvement here to ensure that no delay in the clinical workflow is caused by the algorithm.

Limitations
We acknowledge several limitations of this study. The retrospective nature generally limits the conclusions to be drawn since the reproducibility of the obtained clinical data cannot be determined. In addition, we are aware of the relatively small sample size. However, considering that many patients admitted to third-level hospitals had already been diagnosed with acute inflammation or malignancy and/or had previously undergone spine surgery, only a limited number of patients could be included for final assessment. Further, the exclusion of patients older than 70 years and with prior history of surgery, trauma, or cancer, does not reflect the clinical reality in most aging societies. However, prior research showed that the mean age is 41 years for lumbar disc herniation and 64 years for spinal canal stenosis, thus the majority of the population affected by lumbar disc herniation and spinal canal stenosis is covered by our study [47]. Our results showed acceptable specificities and sensitivities for the diagnosis of a variety of different degenerative spine pathologies, although the observed PPVs of 13.11-75.00% appear rather low. The main reason for this low PPV is most likely the exclusion of findings by the examining radiologist that were not considered clinically significant, even if they may have been considered abnormal on the basis of objective criteria. This may be considered a limitation of our CNN-based approach, which may not sufficiently reduce the workload of radiologists. However, this will highly depend on the final version of CoLumbo and in how far the developers will succeed in designing a practical user interface to enable the radiologist to evaluate false positive findings as such and flag them as insignificant. For a final assessment of whether the software can positively influence the standardization of findings, reduce reading times, and increase interrater agreement, further studies are necessary and currently pending. The diagnostic objectivity that can result from the machine application of clearly defined diagnostic criteria has the potential to significantly impact the future evaluation of degenerative spine disease and its treatment planning across treatment centers and represents a first milestone in the homogenization of diagnostic reports.

Conclusions
There is a growing interest in machine-learning algorithms supporting radiologists in their daily work. In the current study, we found that support of diagnosis of various lumbar degenerative changes is feasible with moderate to high diagnostic accuracy using a single comprehensive CNN. Additionally, the high NPVs make this CNN a promising approach to rule out vertebral pathologies in clinical practice. The proposed method paves the way for clinical transition of machine-learning-based diagnostic tools into clinical routine. However, further research is needed to evaluate whether the CNN can influence detection rates, interrater agreement, and reading times of radiologists using the software. Informed Consent Statement: Patient consent was waived due to the retrospective nature of this study.