Artificial Intelligence and Computer Aided Diagnosis in Chronic Low Back Pain: A Systematic Review

Low Back Pain (LBP) is currently the first cause of disability in the world, with a significant socioeconomic burden. Diagnosis and treatment of LBP often involve a multidisciplinary, individualized approach consisting of several outcome measures and imaging data along with emerging technologies. The increased amount of data generated in this process has led to the development of methods related to artificial intelligence (AI), and to computer-aided diagnosis (CAD) in particular, which aim to assist and improve the diagnosis and treatment of LBP. In this manuscript, we have systematically reviewed the available literature on the use of CAD in the diagnosis and treatment of chronic LBP. A systematic research of PubMed, Scopus, and Web of Science electronic databases was performed. The search strategy was set as the combinations of the following keywords: “Artificial Intelligence”, “Machine Learning”, “Deep Learning”, “Neural Network”, “Computer Aided Diagnosis”, “Low Back Pain”, “Lumbar”, “Intervertebral Disc Degeneration”, “Spine Surgery”, etc. The search returned a total of 1536 articles. After duplication removal and evaluation of the abstracts, 1386 were excluded, whereas 93 papers were excluded after full-text examination, taking the number of eligible articles to 57. The main applications of CAD in LBP included classification and regression. Classification is used to identify or categorize a disease, whereas regression is used to produce a numerical output as a quantitative evaluation of some measure. The best performing systems were developed to diagnose degenerative changes of the spine from imaging data, with average accuracy rates >80%. However, notable outcomes were also reported for CAD tools executing different tasks including analysis of clinical, biomechanical, electrophysiological, and functional imaging data. Further studies are needed to better define the role of CAD in LBP care.


Introduction
In the last few decades, Artificial Intelligence (AI) has been revolutionizing the healthcare industry thanks to innovative computational tools able to support and even substitute human intelligence in some specific tasks [1]. To date, AI is being applied to almost any aspect of daily life, thanks to its capacity to handle the unprecedented amount of information recorded every nanosecond by computer systems, e.g., vocal assistants, car security devices, and smart home detectors. Due to the huge quantity of data and the ever-increasing use of digital processing in clinical practice, the employment of AI in medical research has been increasingly investigated in several studies [2]. Indeed, AI-based systems have been shown to perform automatic segmentation and data extraction from radiological datasets [3] as well as to support diagnosis, treatment, and outcome evaluation in different fields, including spine surgery [2].
The use of AI in spine surgery has been exploited for different tasks, including the segmentation of spinal structures [4], identification of degenerated discs [5], detection of vertebral fractures [6], classification of scoliotic curves [7], and several more. In a previous systematic review on the application of Computer Vision to the management of low back pain (LBP), we have demonstrated that AI systems achieved Sørensen-Dice scores >90% with regard to segmentation of vertebrae, intervertebral discs (IVDs), spinal canal, and lumbar muscles, whereas studies focusing on structure localization and identification demonstrated an accuracy >80% [8].
LBP is primarily caused by intervertebral disc degeneration, representing the main cause of disability in the world, with a huge impact on patients' quality of life as well as on socioeconomic and working conditions [9]. Diagnosing and treating LBP often requires a multidisciplinary approach involving the acquisition of radiological images, patient-reported outcome (PROMs) evaluation questionnaires, and angular and linear measurements. Therefore, the ultimate decision is often guided by the elaboration of several data using an algorithmic approach [10,11]. Computer-aided diagnosis (CAD) is a field of AI employing machine learning methods to specifically analyze both imaging and non-imaging data in order to classify patients' conditions and to support clinicians in the formulation of a correct diagnosis [12]. While having been firstly adopted for the diagnosis of breast cancer [12], CAD systems are now utilized in several fields, including the detection of osteoporosis [13], individuation of missed polyps during colonoscopy [14] and many others. Applications of CAD to LBP are numerous and involve several data sources (e.g., magnetic resonance imaging-MRI-and computed tomography-CT-datasets, clinical notes, surface sensor and electrophysiological measurements), as well as numerous ancillary AI tasks (e.g., segmentation, classification, regression).
The diagnosis of disc abnormalities can be easily performed by an experienced professional, even if affected by notable variability among experts (Alomari et al. [15] report that "there is over 50% inter-and intra-observer variability in the MRI interpretation that urges the need for standardized mechanisms in MRI interpretation"). This aspect can be automatized in AI systems specifically focusing on Computer Vision, with encouraging results from preliminary studies. For example, Won et al. [16] reported: "Spinal stenosis Grading agreement between the experts was 77.5% and 75.0% in terms of accuracy and F1 scores". The main advantage of CAD systems is to carry out multiple tasks on large datasets resulting in a definite outcome with a high degree of accuracy when compared to the human counterpart. However, the real added value of AI in CAD systems is to combine different pieces of information (demographics, patient-reported outcome measures, clinic notes, radiological data, etc.) in order to better predict a specific diagnosis and improve patient outcomes. All of these aspects have been recently reviewed by Mallow et al. [17]. Briefly, although clinicians achieve high accuracy scores in some easy tasks such as detecting disc bulging, AI models achieve very similar results while reducing the diagnosis time, as well as excluding inter-and intra-observer variability. In addition, the diagnosis of some diseases is still challenging for medical practitioners, and can actually be aided and improved by AI.
In this review, we have systematically reviewed the available literature on the application of CAD systems to the management of LBP. The state of the art on the present technology and individual results of included studies will be thoroughly discussed, in order to describe the actual evidence and potential future applications of these groundbreaking tools.

Materials and Methods
In order to perform an exhaustive research of AI articles related to LBP, we performed literature research on PubMed, Scopus, and Web of Science. The search keywords utilized for both the medical and the AI part are reported in Table 1. At least one of the search keywords for the medical and the AI part had to be included in the title or in the abstract of the articles.

Inclusion and Exclusion Criteria
The aim of this study was to gather all the articles concerning the utilization of AI in the diagnosis of LBP and lumbar degenerative diseases. Straightforwardly, all the selected articles had to meet all the following inclusion criteria: • Chronic LBP or lumbar degenerative diseases must have been between the main topics of the article. We included articles on the diagnosis of diseases related to chronic LBP, and treating at least one of the structures involved in LBP (i.e., vertebrae, discs, muscles, spinal canal); • AI must have been used in the article. We included articles exploiting AI methods falling in the areas of computer vision, machine learning and neural networks (NNs), regardless of the type of data utilized (e.g., images, text data, clinical data); • Aim of the study: all the articles included must have been focused on a CAD system; • Subjects included in the study: all the articles must have been based on studies of human low back and related pathology, regardless of age or employment of the included individuals; • Validation procedures: results must have been reported on a test set different from the training set; • Language: all articles must have been written in English.
Conversely, articles that were excluded did not meet the inclusion criteria for one of the following reasons: • A different medical problem was considered: we excluded articles which did not consider chronic LBP and its related anatomical structures and medical data. For example, we excluded studies that focused only on cervical or thoracic vertebrae, and studies focusing on acute LBP and osteoporosis; • AI was not considered: we excluded studies that did not utilize AI-based techniques in the diagnosis or management of LBP; • Diagnosis was not provided: we excluded studies using Computer-Vision-based methods that, although focusing on LBP related structures, limited to the segmentation or identification of lumbar structures; • Animal studies: we excluded studies based on vertebral structures of animals, e.g., goats or mice; • Results reproducibility: we excluded articles that did not use a K-fold cross-validation procedure or reported a clear division of the dataset between a training set and a test set.; • Not in English: we excluded all the articles written in a language different from English.
In our previous study, we have defined three main categories in which the utilization of AI in LBP can be split, namely Computer Vision, CAD, and Decision Support Systems (DSS) (Figure 1). Computer Vision is the field of AI that deals with how computers can gain a high-level understanding from digital images or videos. With regard to LBP, its main applications concern feature extraction and image segmentation, which have been widely discussed in our previous systematic review [8]. CAD is a group of techniques which help medical practitioners identify a pathology or quantify the grade of a disease. It can be divided into two distinct tasks, namely classification and regression, in which machine or deep learning models are used to assign a predefined label or to generate a numeric output, respectively. In practice, classification is used to identify or categorize a disease, whereas regression is used to produce a numerical output as a quantitative evaluation of some measure [18].
DSS are systems that allow medical practitioners or patients to enhance the decisionmaking process in order to improve the outcome of subjects suffering from a specific disease. The goal of the vast majority of DSS is outcome prediction, i.e., the prediction of the improvement that a patient would experience after exposure to a defined therapy. By predicting the extent to which a patient would benefit from a specific treatment, DSS may provide the physician with practical tools to assess, for example, whether or not surgery may be preferable to conservative treatment. However, a DSS only provides a suggestion to the physician, who is responsible for the final decision on the treatment to be undertaken. Finally, DSS can be used for prevention, e.g., by providing the user with recommendations or correct practice for preventing the onset of a disease [19].

Evaluation Metrics
Among the articles included, different tasks resorted to different metrics to evaluate the performance of systems under investigation. However, considering the large amount of studies reported in this review, different metrics were also considered within the same task.
With regard to the Classification task, we reported the results in terms of Accuracy (Acc), where available. For brevity purposes, let us consider a binary Classification task, e.g., Positive vs. Negative. Given a test set composed of N samples, defining the True Positives TP as the number of Positive samples correctly classified, and the True Negatives TN as the number of Negative samples correctly classified, Accuracy is defined as: With regard to the Regression task, the vast majority of the studies included in this review report the performance in terms of the Mean Absolute Error (MAE). Let us consider a sequence of original values x(t) and a sequence of predicted valuesx(t). The MAE for a sequence of N timestamps is defined as: Thus, the closer the value is to 0, the better the performance. In some cases, percentage error values are used to evaluate performance, the meaning of which varies according to the investigated task.

Quality of Evidence
The methodological quality of the included studies was assessed independently by two reviewers (L.A. and F.R.), and any disagreement was solved by the intervention of a third reviewer (G.V.). The risk of bias and applicability of included studies were evaluated by using customized assessment criteria based on the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [20]. This tool is based on four domains: patient selection, index test, reference standard, and flow and timing. Each domain is evaluated in terms of risk of bias, and the first three domains are also assessed in terms of concerns regarding applicability. Fifty studies were rated on a 3-point scale, reflecting concerns about risk of bias and applicability as low, unclear or high, as shown in Figure 2 (the details of the analysis are presented in Tables S1 and S2 in the Supplementary Materials). Figure 2. Summary of the methodological quality of included studies regarding the four domains assessing the risk of bias (left) and the three domains assessing applicability concerns (right) of the QUADAS-2 score. The portion of studies with a low risk of bias is highlighted in green, the portion with an unclear risk of bias is depicted in blue, and the portion with a high risk of bias is represented in orange.

Results
The search was performed on 5 November 2021, and resulted in 1536 articles. Nonetheless, after removing duplicates and following a first screening based on the article titles and abstracts, we reduced the number of eligible articles to 93, as many of them focused on a different topic. A second screening phase was performed after having read the full text of each article, which led the total amount of included articles to 57. We created a flow-chart diagram according to the PRISMA protocol that shows the selection process of the studies ( Figure 3). The articles were screened by two independent reviewers and, in the case of discrepancies regarding the inclusion or exclusion of an article, they discussed together until consensus was reached. It is worth noting how the amount of published papers is increasing year by year, and that the number of articles published in 2020 is almost double compared to 2019. This may be due to two main reasons: first, the ever-increasing amount of clinical images and data available to researchers and, secondly, the improvement of computing capacity observed in recent years.

Computed Aided Diagnosis
Computer Aided Diagnosis (CAD) is a branch of AI that resorts to machine learning techniques to help physicians diagnose a disease or quantify its severity. Several studies resulted from the search utilized CAD systems, and they considered two main tasks, namely classification and regression. CAD systems can be based on clinical and physiological data and/or on clinical images, and, in the latter case, may be following a segmentation phase. In this systematic review, we found a total of 57 articles employing CAD systems, 45 of which were based on classification, and 12 were based on regression.

Classification
Classification is a task that resorts to assigning an input sample to one of a finite number of predetermined classes, and can be based on machine learning models such as Support Vector Machines (SVM) and Decision Trees, or on deep learning models. In this review, we identified a total of 45 papers performing a classification task as a CAD, and their main features are reported in Table 2. Briefly, we included: • 27 studies on clinical lumbar imaging, and in detail: -20 studies on MRI; -4 studies on X-ray images; -3 studies on other typologies of medical images; • 4 studies on brain MRI (1 in combination with other physiological data); • 14 studies on physiological data, and in detail: -8 studies using kinematic variables or sensor data; -3 studies using clinical data and text notes; -3 studies using electromyography (EMG) data.
Specifically, 18 out of the 45 papers focused on LBP diagnosis, 13 studies investigated disc degeneration, 4 studied spinal stenosis, and 3 approached spondylolisthesis, whereas the remaining studies focused on different conditions such as scoliosis, osteoarthritis, disc and bone diseases, and routine reporting. It is worth noting that 22 studies exploited NNs and deep learning, 22 exploited machine learning models, and 1 study exploited both approaches.
With regard to the studies focusing on the diagnosis of LBP, 4 articles utilized brain MRI to identify morphological factors predicting LBP, whereas 8 studies exploited other types of data such as EMG signals, kinematic variables or bio-mechanical measures; 3 papers were based on clinical data and text notes, and 3 studies aimed to diagnose LBP based on clinical images related to the lumbar region. All of the studies exploiting brain MRI chose a Support Vector Machine (SVM) as a classifier to discriminate between healthy and unhealthy subjects. In detail, Lee et al. [21] used brain MRI in combination with physiological parameters of 53 subjects to discriminate between healthy and LBP subjects, achieving an accuracy of 92.5%; Lamichhane et al. [22] searched for multimodal biomarkers of LBP on brain MRI images of 24 patients and 27 healthy control subjects with an accuracy of 78.7%; in addition, the same group [23] expanded the previous work by adding a Enetsubset feature selection, improving the SVM accuracy to 83.1%. Shen et al. [24] searched for alterations in brain functional connectivity due to chronic LBP, achieving 79.3% accuracy on brain MRI images of 90 patients.
Among the studies that aimed to diagnose LBP based on clinical data, Mathew et al. [25] used Inductive Learning in an early study to diagnose LBP based on clinical data from 200 subjects, achieving accuracy values ranging between 82 and 90%. Staartjes et al. [26] performed a Fuzzy-rule based classification based on Chi's method to diagnose LBP based on clinical data from 262 subjects, with an accuracy of 96.2%. Parsaeian et al. [27] compared a Feedforward NN and Logistic Regression on clinical data from more than 34,000 subjects to diagnose LBP, achieving an equal AUC of 0.75. Table 2. Summary Table of the works performing Classification. If more than one structure/task were investigated in a study, the correspondent results are reported in the same order in which the structures are presented in the "Structures involved"/"Target" column.   Among the studies that aimed to diagnose LBP exploiting EMG signals and kinematic/biomechanical measures, Caza-Szoka et al. [53] performed a surrogate analysis of fractal dimensions from sEMG sensor array in order to identify a predictor of chronic LBP in 24 subjects, using a Feedforward NN with an accuracy of 80%. Wang et al. [54] proposed DeepLap, a system for the automatic diagnosis of LBP-symptomatic muscles. The system includes a belt for sEMG recording of lumbar muscles, and exploits a Spanning CNN for the recognition of symptomatic muscles; the model was validated on data of 288 patients with 92.9% accuracy. Liew et al. [55] used Logistic Regression on EMG signals and physiological parameters of 49 subjects for classifying LBP achieving an AUC of 0.97. Abdollahi et al. [56] used kinematic variables from a motion sensor to categorize 94 nonspecific LBP patients, using an SVM with an accuracy of 75%. Bishop et al. [57] used a Feedforward NN to classify 183 LBP patients using dynamic motion characteristics and achieving 85% accuracy. Hu et al. [58] used a Long Short-Term Memory (LSTM) NN on static-standing physiological variables of 44 subjects to diagnose LBP with an accuracy of 97.2%. Ashouri et al. [59] used an SVM to evaluate LBP from inertial sensor data of 53 subjects achieving an accuracy of 96%. Karabulut et al. [60] used Synthetic Minority Over-sampling TEchnique (SMOTE) preprocessing and Logistic Model Tree to predict LBP from biomechanical measures of 310 subjects with an accuracy of 89.7%.
Among the studies that aimed to diagnose LBP based on medical images related to the lumbar region, Ketola et al. [61] performed texture Feature Extraction and applied Logistic Regression on MRI images of 518 subjects to identify predictors of LBP, achieving an accuracy of 83%. Torrado-Carvajal et al. [62] used a Random Forest to state thalamic neuroinflammation as a discriminating signature for chronic LBP from Positive Emission Tomography (PET) images of 33 subjects, achieving an AUC of 0.88. Sanders et al. [63] used a Feedforward NN to develop an automated scoring of patients pain drawings of 250 subjects to identify LBP, achieving 49% sensitivity for a 5-class problem.
With regard to disc degeneration, the majority of the included studies used MRI imaging. Gao et al. [5] gave MRI images of 500 patients as an input to different CNNs, namely VGG-M, VGG-16, GoggleNet, and ResNet-34, in order to quantify disc degeneration, achieving a maximum accuracy of 86%. Ruiz-España et al. [29] extracted features from MRI images of 67 patients using Gradient Vector Flow, and tested several Machine Learning models to classify degenerated IVDs achieving accuracies greater than 90%. Oktay et al. [30] used MRI images of 102 patients as input for an SVM to classify degenerative disc diseases with an accuracy of 92.8%. Alomari et al. [31] used MRI images of 80 subjects to develop three Probabilistic Gaussian models related to disc appearance, location and context, in order to generate the inputs for a Gibbs probabilistic model to discriminate between healthy and unhealthy IVDs. Koh et al. [32] gave MRI images of 70 subjects as input to an ensemble of machine learning models composed of a perceptron classifier, a least mean square classifier, an SVM, and a k-Means, using a weighted sum of the models outputs in order to detect lumbar disc herniation, achieving 99% detection accuracy. Tsai et al. [33] trained a YOLO v3 CNN to detect lumbar disc herniation on MRI images of 168 subjects, achieving 81.1% accuracy after data augmentation. Pan et al. [34] used MRI images from 500 subjects to train a faster R-CNN to automatically diagnose disc bulging and herniation, with a mean accuracy of 88.8% over the five lumbar IVDs, after having performed IVDs localization and identification. Salehi et al. [37] used MRI images of 50 subjects to detect disc herniation using a K-nearest neighbor after having extracted features from the region of interest using Active Contour snakes and K-Means, achieving 97.9% accuracy. Beulah et al. [35] automatically segmented IVDs and extracted Gabor features from MRI images of 93 patients to discriminate between degenerated and healthy discs, achieving 92.5% accuracy using an SVM. Sundarsingh et al. [36] proposed Local Sub-Rhombus Binary Relation Pattern techniques to extract features from MRI images of 63 subjects to discriminate between healthy, bulging and desiccated discs. They achieved an average 94.7% accuracy feeding such features to a Random Forest classifier. Three additional studies diagnosed disc degeneration without using MRI imaging: Šušteršič et al. [38] used features extracted from force sensors embedded in a foot force platform in order to diagnose the type of disc herniation in 33 patients. They tested several machine learning models and achieved the best accuracy of 85% using a decision tree. Rankovic et al. [39] used measures extracted through the medium of a platform for the detection of foot pressure distribution in order to diagnose disc herniation on four different discs levels. They trained an adaptive network-based fuzzy inference system on data of 29 patients, correctly grading the side and level of herniation of 8 out of the 9 test subjects. Oyedotun et al. [40] used biomechanical measures of 310 subjects in the UCI Machine Learning Repository to train a feedforward NN to discriminate between healthy subjects and those suffering from disc herniation or spondylolistehesis. They achieved 92.5% accuracy on the three-class task, whereas they achieved 96.8% accuracy on the task of discriminating between healthy and unhealthy subjects.
With  [51] used GoogLeNet, and compared its results to those achieved using AlexNet on X-ray images of 286 patients to diagnose the presence of spondylolisthesis, achieving 93.9% accuracy on images of 48 patients kept as the test set. In addition, the same group extended the study [52] by using a transfer learning-based CNN for spondylolisthesis detection; they extracted features from a total of 2707 images with a Yolo v3, and thus fed them to a fine-tuned MobileNet, achieving 99% test diagnosis accuracy.
However, some articles did not fall in any of the aforementioned categories. In the frame of routine clinical reporting, Lewandrowski et al. [28] used a Tiramisu NN and a CNN for reporting of 17,800 IVDs from MRI related to IVDs and spinal canal, achieving an accuracy of 85.2% for disc herniation. In the frame of scoliosis diagnosis, Adankon et al. [48] used 3D images of the surface of the human back of 165 patients, extracting features with local geometric descriptors, and feeding them to a least-squares SVM for the classification of scoliosis curve types, achieving 95% accuracy; Lin [49] fed X-ray images of 37 subjects to a Feedforward NN to diagnose scoliosis, with an identification rate of 84%. Veronezi et al. [47] used a Feedforward NN on X-ray images of 206 subjects to diagnose osteoarthritis, achieving an accuracy of 62.9%.
Finally, three articles aimed at the detection and classification of different lumbar structures and abnormalities at once: Jamaludin et al. [41,42] presented a CNN, namely SpineNet that achieved a detection accuracy of 71.5% for disc degeneration, 75.0% for disc narrowing, 95.2% for spondylolisthesis, 94.3% for stenosis, 86.3% for endplate defects, and 90.7% for marrow changes; Lehnen et al. [43] proposed a U-net for the identification of IVDs on MRI images of 146 subjects, and exploited measurement differences between the original and the segmented image for the detection of abnormalities, achieving an accuracy of 87% for disc herniation, 86% for disc extrusions, 76% for disc bulging, 98% for spinal canal stenosis, 91% for nerve root compression, and 87.6% for spondylolisthesis.

Regression
Regression is a task that resorts to assign a numerical value to any input sample. Differently from classification, the number of classes is not predetermined; in other words, regression can be looked at as a classification task with an infinite number of classes. In this review, we identified a total of 12 papers performing a regression task as a CAD, and their main characteristics are reported in Vertebrae were the most investigated structures (5 papers), whereas other studies focused on IVDs, muscles, definition and quantification of LBP-related measures. In more detail, three studies focused on spinal deformity, three studies focused on the measurement of lumbar structures, two studies focused on the quantification of LBP, one investigated spondylolisthesis, and one assessed intramuscular fat quantification. It is worth noting that eight studies resorted to NNs and deep learning, two studies resorted to machine learning models, whereas two exploited threshold methods.  [73] used a U-net for the automated Segmentation and measurement of lumbar lordosis on X-ray images of 629 patients, achieving an MAE on the curve angle of 8.06°. Garcia-Cano et al. [74] extracted features from X-ray images of 150 patients through the medium of Independent Component Analysis, and used Random Forest Regression to predict the spinal curve progression in adolescents with idiopathic scoliosis, achieving a Mean Absolute Error (MAE) of 4.79°for the Cobb angle.
With regard to the studies focusing on the measurement of lumbar structures, Pang et al. [64] used a Cascade Amplifier Regression Network (CARN) on MRI of 215 subjects for the quantification of 30 lumbar spinal indices, achieving an overall MAE of 1.22 mm. Neubert et al. [65] used Active Shape Modeling for the measurement of IVDs from MRI of seven patients, achieving estimate error of 4.1% and 0.1% for disc height and area, respectively. Natalia et al. [68] used a SegNet and a Contour Evolution Algorithm to measure anteroposterior diameter and foraminal widths in MRI images of 515 patients suffering from lumbar spinal stenosis with a mean error of 0.9 mm. Nguyen et al. [75] used a CNN trained on X-ray images of 1000 spondylolisthesis patients to measure structure deviation, achieving a mean deviation angle on 20 further test patients of 1.76°.
With regard to the studies focusing on LBP quantification, Sari et al. [69] tested a Feedforward NN and an Adaptive Neuro-Fuzzy inference system for the objective assessment of LBP intensity, using as input skin resistance and visual analog scale of 169 patients and achieving a pain intensity error of 4%.
In addition, Fortin et al. [70] used a threshold algorithm for Segmentation and quantification of paraspinal muscle composition with a reliability coefficient ranging between 97 and 99%. Niemeyer et al. [66] developed a CNN to frame the grading of the Pfirrmann as a regression problem, achieving an MAE of 0.08 on MRI images of 1599 subjects. Finally, Sneath et al. [67] proposed an ensemble of machine learning models to calculate a predicted "age estimate" for the age-related changes based on MRI images of 60 subjects, achieving a "predicted age" differing from the true subject age by less than 11 years in 80% of cases.

Discussion
The management of patients affected by spine-related problems, first LBP, is a demanding process which often involves gathering a thorough patient's history, conducting a structured physical examination, and combining multiple imaging sources to accurately formulate the diagnosis and plan an appropriate treatment [76]. The use of multiple scales and measurement, as well as different imaging technologies, generates a vast amount of data which, while being fundamental to individualize the treatment approach, often becomes difficult to handle and fully interpret.
The advent of AI has been revolutionizing several research and clinical fields, including spine surgery, in which the development of automated systems may increase the accuracy and repeatability of the execution of tasks critical to the diagnostic process [2]. More specifically, the application of such tools-namely CAD systems-has been extensively reported in the recent literature with application to both conventional datasets (e.g., clinical data, lumbar MRI) and innovative technologies (e.g., brain fMRI, kinematic sensors). In this review, most included studies were focused on classification, through which AI systems are able to assign a numerical value to any input sample within a finite number of predetermined classes. Lumbar MRI was the main input source in the majority of studies. Indeed, investigated CAD systems were able to diagnose intervertebral disc degeneration based on IVD intensity at sagittal T2-weighted MRI images, with an accuracy of 86-92.8% [5,[29][30][31]. In addition, several studies proposed different models for automatic classification of IVD degenerative changes based on the Pfirrmann grading system [5,29], while the preliminary manuscript from Oktay et al. [30] described a machine learning system able to discriminate between normal and degenerated IVDs only. Collectively, these studies showed an accuracy rate between 86% and 92.8%. Similarly, three studies showed a significantly high accuracy in detecting disc bulging and herniation, with rates of 81.1-99% [32][33][34]. Lewandrowski et al. [28] trained deep neural networks with a dataset of 17,800 IVDs and implemented it with a natural language processing (NLP) module capable of performing a sort of routine reporting for each disc level, achieving an accuracy of 81% for the diagnosis of foraminal stenosis, 86.2% for central stenosis, and 85.2% for disc herniation. In addition, other studies displayed CAD systems able to detect and rate central canal stenosis as well as foraminal and lateral recess stenosis, with an almost perfect or at least significantly high inter-reader agreement [44][45][46]. Jamaludin and colleagues developed a CNN capable of segmenting vertebrae and IVDs (with an accuracy of 95.6%) and to identify disc narrowing, marrow changes, endplate defects, spondylolisthesis, central canal stenosis as well as to perform Pfirrmann grading, with accuracy rates ranging from 70.1% to 95.4%. Furthermore, this model can directly mark disc and vertebral abnormalities in the form of heatmaps, namely "evidence hotspots" [41,42]. Similarly, Lehnen et al. [43] showed a CNN trained to segment the IVDs and detect disc herniation, extrusion, bulging, spinal canal stenosis, nerve root compression, and spondylolisthesis, with accuracy scores between 76 and 100%. In a study from Ketola and colleagues [61], a machine learning system showed accuracy, specificity, and sensitivity scores >80% in classifying patients as either symptomatic or nonsymptomatic based on LBP-related degenerative changes. However, the high incidence of false positives (asymptomatic individuals with disc degenerative changes) significantly impacted on the precision performance of the system. X-rays were utilized as an input source only in two studies [47,49]. Veronezi et al. reported a significantly lower accuracy (62.85%) in recognizing osteoarthritic changes of the lumbar spine compared to other studies, due both to the heterogeneity of digital images and the low number of images used for training the system [47]. In another study, lumbar X-rays of scoliotic patients were utilized to build a 3D spine model and a multilayer feed-forward, back-propagation (MLFF/BP) Artificial NN was developed to identify the pattern of the scoliotic deformity [49]. However, AI applications for CAD are not limited to radiological images of the spine. Indeed, Lee et al. [21] have developed a system able to predict the intensity of LBP based on the integration of brain fMRI data and heart rate variability. The model demonstrated to anticipate the exacerbation of LBP in patients showing an increase of cerebral blood flow in the thalamus, prefrontal and posterior cingulate cortices and an increment of heart rate variability with an accuracy of 92.5%. In a similar study, Lamichhane and colleagues [22] showed that a machine learning approach was able to associate the reduction of cortical thickness in specific areas of the brain deputed to the elaboration of pain, emotions and vision in patients affected by LBP with an accuracy of 74.51%. In a subsequent analysis, the same authors tested a new hybrid feature selection technique (namely Enet-subset) to extract local graph measures from functional connectomes and determine their capacity to predict LBP using an SVM, achieving an average classification accuracy of 83.1% [23]. The alteration of visual network connectivity in individuals with chronic LBP was also documented by Shen et al. [24], who reported an accuracy rate of 79.3% in distinguishing patients with LBP in their machine learning study. On the other hand, Torrado-Carvajal and coauthors demonstrated the accumulation of the glial activation marker 18 kDa translocator protein (TSPO) in the thalamus of patients with chronic LBP using PET imaging and a Random Forest system [62].
The use of AI has been exploited in the diagnosis of LBP from clinical data as well. A preliminary study from Mathew et al. [25] showed that AI was able to outperform clinicians in the differential diagnosis of LBP, sciatica, or other spinal pathology already in 1988. Other studies have demonstrated the possibility of training AI systems to anticipate the diagnosis of lumbar disc herniation, lumbar spinal stenosis and chronic LBP based on patients' performances during the five-repetition sit-to-stand test [26], predict the risk factors associated with LBP from a population survey [27], refine the diagnosis and personalize the treatment of LBP in a primary care context using free-text clinical notes [77] and automatically score pain drawings [63]. Additional inputs utilized to develop CAD systems for LBP diagnosis include sEMG during weightlifting [55] or an endurance test [53], as well as spinopelvic parameters [40,60] and kinematic data during static standing [58], trunk flexion/extension and lateral bending [56,57,59], which were able to detect LBP in affected patients with an accuracy >80%. Šušteršič et al. [38] tested five different classifier algorithms to diagnose the side and level of disc herniation based on the force exerted during normal standing or leaning either towards the forefeet or the heels. Using a Random Forest algorithm, the system reached an accuracy of 87.9%. Adankon and colleagues [48] proposed an SVM able to classify a scoliotic deformity based on a 3D model of patients' spines built with four optical digitizers, reaching an overall accuracy of 95%.
Several studies have described the use of CAD systems for regression tasks, such as calculation of radiological indexes and LBP quantification. The investigations from the groups of Pang [64] and Neubert [65] presented automated systems able to extract numerous quantitative measurements from lumbar spine MRI, including vertebral height as well as disc height and area, whereas Natalia et al. [68] reported a model capable of calculating foraminal width and canal diameter following automatic segmentation of the surrounding structures. In each of these studies, the mean average error was not higher than 1.22 mm. Similarly, the system presented by Niemeyer et al. [66] showed to perform intervertebral disc degeneration grading with an average sensitivity >90%. An interesting study by Sneath et al. utilized a machine learning technique to gather degenerative changes of the spine and surrounding structures in order to perform an estimation of patients' age, which eventually was within 11 years of the subjects' physical age [67]. In addition, the AI systems proposed by Chae [71] and Cho [73] were able to automatically calculate several spinopelvic parameters predictive of lumbar spine deformity using lumbar X-rays, reaching an average error range of 1.45-3.51°in the former and 8.055°in the latter. With regard to scoliosis, Watanabe et al. [72] utilized a CNN able to estimate vertebral position, Cobb angle, and vertebral rotation using a combination of X-rays and Moirè topography, with a mean average error of 5.4 mm, 3.42°and 2.9°, respectively. In another study, 3D models of scoliotic spines were built from X-rays and updated every three months for 18 months to check for curve progression. Subsequently, a Random Forest system was trained with such a dataset and demonstrated to predict curve progression with a difference <5°compared to the real curvature [74]. Differently, Fortin and colleagues were the only ones to analyze paraspinal muscle composition in patients with LBP, reaching an intra-rater reliability coefficient of 0.95-0.99 [70]. Another study has described an AI-based model able to predict the severity of LBP based on skin resistance and pain expressed through visual analog scale (VAS) with an error of 4% [69].
With regard to the classification task, most studies addressed LBP diagnosis or disc degeneration. Figure 4 reports the accuracy of methods aiming at the diagnosis of LBP or, in other words, at the classification of whether or not a subject is suffering from LBP. The reported results differ on the type of data considered as model input, and on whether machine or deep learning techniques were utilized. The accuracy results are all greater than 75%, and three studies achieved accuracy greater than 95%, reaching a human-level diagnosis capability. Two of them exploited kinematic or biomechanical measures [58,59], whereas one exploited clinical data [26]. It is worth noting how the best performance was achieved by a deep LSTM net [58], although the majority of studies exploited machine learning techniques. Figure 5 presents a boxplot that reports the accuracy of the disc degeneration classification task. This boxplot considers nine studies that used machine learning (median accuracy = 92.5%), and four studies that used deep learning (median accuracy = 88.8%) techniques. Briefly, machine learning techniques achieved slightly better results, both in terms of median accuracy and best performance. However, it must be taken into account that the number of studies performing such a task was not sufficient to provide a thorough statistical analysis, and the same applies to the LBP diagnosis task. Thus, these results should be intended as a preliminary effort to identify the most promising approach in the frame of CAD applications to LBP. Finally, with regard to regression, there is no one task that is addressed more than the others, but rather each research group focused on a characteristic task. Nonetheless, some technically-sound studies have been presented, and their results are noteworthy when considering a specific task.  The implementation of AI systems in healthcare, particularly in terms of tools implying a direct clinical repercussion in the formulation of diagnosis or clinical decisions, is undoubtedly determining a paradigm shift, with significant ethics and regulatory issues [2]. More specifically, although apparently autonomous, such systems must be always accompanied by the judgement of the clinicians with regard to the diagnostic process. Furthermore, exceptional care should be taken considering the huge amount of personal data used to train AI systems in order to avoid the unintended divulgation of private information.
This study has some limitations. First of all, the significant heterogeneity across studies in terms of methodology, data source and outcomes prevented a meta-analysis to be performed. Second, as the search included English manuscripts only, we may have missed articles written in other languages matching with our inclusion criteria.

Conclusions
AI is undoubtedly revolutionizing medical research and patient care with its multiple applications in several fields, including spine surgery. In this study, we have systematically reviewed the available literature on the use of AI, and more specifically CAD, in supporting the diagnostic process in patients affected by LBP. The majority of included studies showed a high degree of accuracy and low margins of error in performing various tasks, most frequently identification of degenerative changes (disc degeneration or herniation, stenosis of the central canal and foramina, spondylolisthesis) while also presenting promising results from innovative data acquisition techniques. In this picture, the use of AI and CAD may effectively improve the diagnostic process and consequently patients' outcomes.
Supplementary Materials: The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/ijerph19105971/s1, Table S1: Summary of the methodological quality of included studies regarding the four domains assessing the risk of bias of the QUADAS-2 score; Table S2: Summary of the methodological quality of included studies regarding the three domains assessing applicability concerns of the QUADAS-2 score.