Knee Osteoarthritis Classiﬁcation Using 3D CNN and MRI

: Osteoarthritis (OA) is the most common form of arthritis and can often occur in the knee. While convolutional neural networks (CNNs) have been widely used to study medical images, the application of a 3-dimensional (3D) CNN in knee OA diagnosis is limited. This study utilizes a 3D CNN model to analyze sequences of knee magnetic resonance (MR) images to perform knee OA classiﬁcation. An advantage of using 3D CNNs is the ability to analyze the whole sequence of 3D MR images as a single unit as opposed to a traditional 2D CNN, which examines one image at a time. Therefore, 3D features could be extracted from adjacent slices, which may not be detectable from a single 2D image. The input data for each knee were a sequence of double-echo steady-state (DESS) MR images, and each knee was labeled by the Kellgren and Lawrence (KL) grade of severity at levels 0–4. In addition to the 5-category KL grade classiﬁcation, we further examined a 2-category classiﬁcation that distinguishes non-OA (KL ≤ 1) from OA (KL ≥ 2) knees. Clinically, diagnosing a patient with knee OA is the ultimate goal of assigning a KL grade. On a dataset with 1100 knees, the 3D CNN model that classiﬁes knees with and without OA achieved an accuracy of 86.5% on the validation set and 83.0% on the testing set. We further conducted a comparative study between MRI and X-ray. Compared with a CNN model using X-ray images trained from the same group of patients, the proposed 3D model with MR images achieved higher accuracy in both the 5-category classiﬁcation (54.0% vs. 50.0%) and the 2-category classiﬁcation (83.0% vs. 77.0%). The result indicates that MRI, with the application of a 3D CNN model, has greater potential to improve diagnosis accuracy for knee OA clinically than the currently used X-ray methods.


Introduction
The most common form of joint disorder in the United States is osteoarthritis (OA) [1]. Knee OA can cause pain and is the number one disease at causing loss of ability to perform daily activities such as walking and stair climbing [2]. Knee OA is associated with age [3] and is characterized by the loss of articular cartilage volume [4]. OA is viewed as a "wholeorgan" disorder, manifesting damage to a range of articular structures, especially the hyaline cartilage, meniscus, periarticular bone, ligaments, and tendons [5]. Despite its importance for public health, we have no interventions that effectively modify the OA disease process [6]. The absence of useful biomarkers to detect OA progression is a major technological obstacle to the development of treatment and prevention of knee OA [7].
While joint replacement is effective for treating end-stage OA, the evaluation of potential disease-modifying treatments in populations meeting current clinical criteria for OA has had limited success [6]. In the past decade, early diagnosis and early treatment strategies in rheumatoid arthritis have reduced patient morbidity and associated costs [8]. The early diagnosis and treatment of OA conditions may similarly improve outcomes and reduce disability and costs for OA. However, the absence of useful image biomarkers to detect OA progression has been a critical technology gap in the early diagnosis and treatment of OA [9].
The conventional radiographs (X-rays) are commonly used for routine knee OA examinations. An X-ray of a joint with osteoarthritis typically shows a narrowing of the space between the bones of the knee joint where the cartilage has worn away. However, symptoms of knee OA may arise before the damage can be seen in standard X-rays. For example, Roemer et al. [10] described how X-rays are unable to show certain structural phenotypes of OA and cannot detect some detrimental findings which can indicate risk of disease that would progress rapidly.
The advent of magnetic resonance imaging (MRI) offers the promise of addressing the critical technology gap by allowing quantification of structural damage in joints. For this reason, radiologists at hospitals often use the more sensitive magnetic resonance imaging (MRI) for OA early detection. Juras et al. [11] pointed out that OA needs early detection, and MRI is a noninvasive way for detecting early biomarkers. To promote the evaluation of OA MRI biomarkers, the National Institutes of Health (NIH) launched the Osteoarthritis Initiative (OAI) cohort study, which includes four clinical centers that recruited approximately 4800 men and women (ages 45-79 years) with or at risk for knee OA. The OAI collected a wealth of data on its participants over an eight-year span. The study included annual knee MRIs for the first four years and then biannual knee MRIs for the subsequent 4 years [12]. One of the goals of creating the OAI dataset was to discover the objective, measurable standards of disease diagnosis and progression, and to determine the predictive role of MRI changes for subsequent radiographic and clinical changes related to the development of knee OA.
The 3-dimensional (3D) MR images allow for both viewing the knee as a "whole organ" and depicting all of the tissues of the joint [13]. While cartilage degradation and other biomarkers can be manually detected, it is time-consuming to process the volume of 3D MR images. Thus, there is a need to automate these processes with machine learning techniques.
Convolutional neural networks (CNNs) are a class of deep learning techniques that are designed to work with images and can remove the need for handcrafted feature extractors [14]. CNNs have been used for various image classification tasks, with recent studies developing CNN models for medical image analysis. The early work of using CNNs to classify knee OA was mainly applied to radiographic (X-ray) images [15][16][17][18].
Anthony et al. employed the classical VGG-16 CNN architecture and transferred learning with X-ray images to determine the OA severity level [15]. These images were preprocessed using an SVM and Sobel edge detector in order to locate the knee joint area. Their study used X-ray images from the OAI. A set of 4446 X-rays were used in this study, representing a total of 8892 knees. When classifying the five Kellgren and Lawrence (KL) grades, they achieved an accuracy of 59.6%. Later, in another work, the same group updated the preprocessing step to use a fully convolutional neural network (FCN) to determine the bounding box of the knee joint. The FCN method was found to be highly accurate in determining regions of interest (ROI), and when combined with a CNN for classification, the method achieved an accuracy of 61.9% [16].
Unlike the two-stage frameworks developed in [15,16], a recent work [17] proposed an end-to-end CNN architecture for knee OA severity assessment without using a neural network for preprocessing. This method used branches in its CNN that are referred to as "attention modules", which provide an unsupervised determination of the ROI of X-ray images. Another recent work added a long short-term memory (LSTM) classifying step following the CNN layers in their network [18]. Given the nature of LSTM for processing sequential data, additional images were generated in a preprocessing step by cropping a fixed ROI and rotating the cropped image by 5, 10, −5, and −10 degrees. The original image and augmented images were stacked, giving about 4600 images used for training and about 480 for testing. Their work also used images from the OAI and achieved an accuracy of 75.28% for the 5-category classification.
It can be seen that 3D CNNs have developed quickly and are attracting interest as a method for analyzing sequences of images or other volumetric data. In a recent study, a 3D CNN was used for classifying real-world objects [19]. Depth information was used to create a 3D shape that was converted into a volumetric representation (voxels) to be classified by the 3D CNN. In addition, 3D CNNs have shown to be useful in medical image processing. When classifying lung nodules, working with 3D volumetric data in a 3D CNN outperformed 2D CNNs [20]. Wang et al. applied a 3D CNN model to calculate the probability of needing a total knee replacement (TKR) within the next nine years [21]. Their work demonstrated that the automated discovery of OA biomarkers from turbo spin echo (TSE) and double-echo steady-state (DESS) images could outperform models that use only demographic and clinical data. Another work explored this area using the popular 2D U-Net architecture for the segmentation of cartilage and meniscus in the knee, which were fed into a 3D CNN for classifying the severity of the cartilage and meniscus lesions [22]. Given the large amount of volumetric data, another recent work for classifying knee lesions used cropping of 3 ROIs from knee MRI to reduce the dimensionality before processing by multiple 3D CNN [23]. Aside from these applications, 3D CNNs have also been applied to segmentation problems including knee cartilage segmentation [24] and segmentation of brain lesions [25].
For knee OA severity classification, while previous methods used 2D CNNs to analyze X-ray images, in this work we propose a method using a 3D CNN and MR images. The details of the proposed method is introduced in the next section.

Method Overview
Knee MR imaging produces a 3D representation of the knee joint, utilizing a sequence of 2D images taken laterally across the knee. Given the 3D nature of MR images, 3D CNN can be advantageous in evaluating the whole sequence of images as one unit. Through the implementation of 3D kernels, information from adjacent slices could be integrated. Therefore, 3D features that may not be detectable using 2D CNN could be potentially captured.
For this study, we built a machine-learning model capable of analyzing sequences of MR images for each knee as input, with output given as one of the five KL grades. We further trained another model by relabeling the samples into OA and non-OA categories according to the clinical standard, i.e., patients with KL ≤ 1 are diagnosed as non-OA cases while patients with KL ≥ 2 are considered as OA. An overview of the proposed models is described in Figure 1. Overview of the models proposed in this study. The input data for each patient are a sequence of MR images. Each sequence is preprocessed by cropping and then removing slices. This reduced data are fed into a 3D CNN. The second pipeline applies the same 3D CNN architecture but is trained separately for the binary OA/non-OA classification.

Dataset
The dataset used in this study was from the public database Osteoarthritis Initiative (OAI). Most of the patient samples in the OAI dataset include an X-ray image; however, many do not have an accompanying MRI sequence. For this study, we used a subset of Figure 1. Overview of the models proposed in this study. The input data for each patient are a sequence of MR images. Each sequence is preprocessed by cropping and then removing slices. This reduced data are fed into a 3D CNN. The second pipeline applies the same 3D CNN architecture but is trained separately for the binary OA/non-OA classification.
In addition to MRI, we also studied traditional X-ray images, with an interest of finding out which imaging modality coupled with the modern CNNs can achieve better accuracy for knee OA diagnosis. We employed several state-of-the-art 2D CNN models, including VGG 16, ResNet50, DenseNet, etc. These models were trained to classify X-ray images into five KL categories. The one with the best accuracy was selected and further applied for the binary OA/non-OA classification. The pipelines for X-ray images are similar to those illustrated in Figure 1, except the 3D CNN is replaced by a 2D CNN and the preprocessing step for X-ray is to cut each pair of knees into individual ones. The X-ray images and MR images were obtained from the same group of patients, and the separation of training, validation, and testing sets were kept the same at the patient level for all the models trained and compared in this work.

Dataset
The dataset used in this study was from the public database Osteoarthritis Initiative (OAI). Most of the patient samples in the OAI dataset include an X-ray image; however, many do not have an accompanying MRI sequence. For this study, we used a subset of the OAI data with 1100 knees, with each knee having both MRI and X-ray available. The 3D DESS MRI data for each knee contain a sequence of 160 2D images, while there is one X-ray image containing both knees from a patient. The dataset was selected with an equal distribution among different OA severity levels (0-4) measured by the Kellgren and Lawrence (KL) grades.
A common practice in machine learning is to split the available dataset into three subsets known as the training set, validation set, and testing set [26]. Machine learning models learn from the training set with the validation set being used during the training process to tune parameters [27]. The testing set is not seen during the training process but rather is held back until the end of the study. The available dataset was randomized and then split into groups balanced by the KL grade with 800 training samples, 200 validation samples, and 100 testing samples. Each set contains a balanced number of samples from each of the five KL categories. Table 1 shows the distribution of the data.  Training  160  160  160  160  160  800  Validation  40  40  40  40  40  200  Testing  20  20  20  20  20  100  Total  220  220  220 220 220 1100

Preprocessing: Subregion Selection
Unlike natural images where useful information could appear anywhere, in medical images, features are usually located in fixed locations. As an example, the knee MR image shown in Figure 2 contains large bone areas and many other tissues. The important indicators for knee OA are often observed near the cartilage and joint region of femur and tibia bones. Therefore, we can reduce the input dimensionality by cropping a subregion of the images.
We cropped each image as a preprocessing step that helps to keep the cartilage region while removing regions that are less informative for knee OA classification. We cropped the image (original size 384 × 384) from the center using both square and rectangular regions. An example of cropping a knee MR image to various sizes is provided in Figure 3. Through various tests we discovered that the window size of 160 × 160 achieved the highest F-measure. Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 13 We cropped each image as a preprocessing step that helps to keep the cartilage region while removing regions that are less informative for knee OA classification. We cropped the image (original size 384 × 384) from the center using both square and rectangular regions. An example of cropping a knee MR image to various sizes is provided in Figure 3. Through various tests we discovered that the window size of 160 × 160 achieved the highest F-measure.

Preprocessing: Slice Removal
Each sample in the database contains a sequence of 160 MR images. After cropping, the input dimensions of 160 × 160 × 160 were still very high. To further reduce the input data dimensionality, we removed some of the outer and center slices. The reason for removing a few beginning and ending slices is that they do not contain bone or cartilage information. Therefore, they are not likely to contain information related to OA. The reason for removing the middle range slices is that they have ill-defined cartilage regions and blurry bone boundaries due to the transition of medial and lateral bone happening in this range. Example slices are provided in Figure 4. Slice #20 has a small bone region starting, while slice #40 and #60 have larger bones with clearly defined bone boundaries and cartilage. Slice #80 is in the transition range, and therefore the cartilage and bone boundaries are unclear. For each sequence, we excluded the first 10 slices (1-10), middle 20 slices (71-90) and final 10 slices (151-160). The remaining 120 slices (11-70, 91-150) from the 160 slices were fed into the 3D CNN model. This is about 13% of the original 384 × 384 × 160 volume. We cropped each image as a preprocessing step that helps to keep the cartilage region while removing regions that are less informative for knee OA classification. We cropped the image (original size 384 × 384) from the center using both square and rectangular regions. An example of cropping a knee MR image to various sizes is provided in Figure 3. Through various tests we discovered that the window size of 160 × 160 achieved the highest F-measure.

Preprocessing: Slice Removal
Each sample in the database contains a sequence of 160 MR images. After cropping, the input dimensions of 160 × 160 × 160 were still very high. To further reduce the input data dimensionality, we removed some of the outer and center slices. The reason for removing a few beginning and ending slices is that they do not contain bone or cartilage information. Therefore, they are not likely to contain information related to OA. The reason for removing the middle range slices is that they have ill-defined cartilage regions and blurry bone boundaries due to the transition of medial and lateral bone happening in this range. Example slices are provided in Figure 4. Slice #20 has a small bone region starting, while slice #40 and #60 have larger bones with clearly defined bone boundaries and cartilage. Slice #80 is in the transition range, and therefore the cartilage and bone boundaries are unclear. For each sequence, we excluded the first 10 slices (1-10), middle 20 slices (71-90) and final 10 slices (151-160). The remaining 120 slices (11-70, 91-150) from the 160 slices were fed into the 3D CNN model. This is about 13% of the original 384 × 384 × 160 volume.

Preprocessing: Slice Removal
Each sample in the database contains a sequence of 160 MR images. After cropping, the input dimensions of 160 × 160 × 160 were still very high. To further reduce the input data dimensionality, we removed some of the outer and center slices. The reason for removing a few beginning and ending slices is that they do not contain bone or cartilage information. Therefore, they are not likely to contain information related to OA. The reason for removing the middle range slices is that they have ill-defined cartilage regions and blurry bone boundaries due to the transition of medial and lateral bone happening in this range. Example slices are provided in Figure 4. Slice #20 has a small bone region starting, while slice #40 and #60 have larger bones with clearly defined bone boundaries and cartilage. Slice #80 is in the transition range, and therefore the cartilage and bone boundaries are unclear. For each sequence, we excluded the first 10 slices (1-10), middle 20 slices (71-90) and final 10 slices (151-160). The remaining 120 slices  from the 160 slices were fed into the 3D CNN model. This is about 13% of the original 384 × 384 × 160 volume.

3D CNN Model for MRI
The architecture of the 3D CNN model proposed in this study is shown in Figure 5. The structure was inspired by the work performed by Wang et al. [21]. The most important difference between 3D CNNs and 2D CNNs is that 3D CNNs use 3D convolutional kernels

3D CNN Model for MRI
The architecture of the 3D CNN model proposed in this study is shown in Figure 5. The structure was inspired by the work performed by Wang et al. [21]. The most important difference between 3D CNNs and 2D CNNs is that 3D CNNs use 3D convolutional kernels to process a volumetric patch of a scan, while 2D CNNs process a single anatomical plane. The 3D convolutional kernels incorporate information from adjacent slices and are therefore able to extract 3D features, which are not detectable from 2D CNNs. As shown in Figure 5, three stages are included in the proposed 3D CNN model before the final softmax layer. The details of the model are discussed below.

3D CNN Model for MRI
The architecture of the 3D CNN model proposed in this study is shown in Figure 5. The structure was inspired by the work performed by Wang et al. [21]. The most important difference between 3D CNNs and 2D CNNs is that 3D CNNs use 3D convolutional kernels to process a volumetric patch of a scan, while 2D CNNs process a single anatomical plane. The 3D convolutional kernels incorporate information from adjacent slices and are therefore able to extract 3D features, which are not detectable from 2D CNNs. As shown in Figure 5, three stages are included in the proposed 3D CNN model before the final softmax layer. The details of the model are discussed below. Stage 1 of the model began with a convolutional layer containing 32 kernels of size 7 × 7 × 7 with a stride of 2 × 2 × 2. This was followed by batch normalization and an activation layer using the ReLU function. A max-pooling layer was added with a window size of 2 × 2 × 2 and a stride of 2 × 2 × 2. Finally, a dropout layer was placed before the start of the residual blocks. We used dropout layers in each stage of our 3D CNN model to help reduce overfitting. Each dropout layer used a rate of 0.5, which gives each node a 50% chance of being set to 0.
The second stage of the model contained a sequence of six residual blocks [28]. Each residual block featured a shortcut connection from the input to the output. There were two types of blocks used in this model as shown in Figure 6. The convolutional block features a convolutional layer in the shortcut path. This layer was used when the input dimensions were changed. The identity block did not have any layers in the shortcut path and was used when the input and output dimensions matched. This stage also ended with a dropout layer. The final stage of the model used global max pooling, followed by a fully connected layer of size 1024 and a dropout layer. The last layer of the model uses the softmax function, which outputs the possibilities of the sample belonging to each category.
The model was implemented in Python using the Keras library with TensorFlow as a backend. The model was trained using a batch size of 15 with early stopping based on validation loss. The Adam optimization function was used with a learning rate 0.001. The training was performed on a high-performance computer with a NVIDIA Tesla V100 32G GPU.

Classic 2D CNN Architectures for X-ray
When building our dataset, we selected patients which had an MRI volume as well as an X-ray image of the same knees. To develop the model for X-ray, we employed a variety of state-of-the-art 2D CNN architectures. VGG16 [29] was one of the earlier deep learning models, and it showed superior performance in many applications. ResNet50 [28] used the concept of residual blocks in which a shortcut connection is added after a series of layers. Our proposed 3D model utilizes a 3D variation of the ResNet50 convolutional and residual blocks as well. Inception-v3 [30] is the representation of the deep learning networks with inception modules and one of the first models to make use of batch normalization. Inception-ResNet [31] is a hybrid of Inception-v3 with residual connections. DenseNet [32] implements dense blocks in which convolutional layers of the same size are connected to every other layer in front of them.
While an MRI volume contains just one knee, an X-ray image contains both. These Xray images were split in half, and all the left knees were flipped so the right and left knees are aligned. The pretrained ImageNet weights were used for transfer learning. The last Stage 1 of the model began with a convolutional layer containing 32 kernels of size 7 × 7 × 7 with a stride of 2 × 2 × 2. This was followed by batch normalization and an activation layer using the ReLU function. A max-pooling layer was added with a window size of 2 × 2 × 2 and a stride of 2 × 2 × 2. Finally, a dropout layer was placed before the start of the residual blocks. We used dropout layers in each stage of our 3D CNN model to help reduce overfitting. Each dropout layer used a rate of 0.5, which gives each node a 50% chance of being set to 0.
The second stage of the model contained a sequence of six residual blocks [28]. Each residual block featured a shortcut connection from the input to the output. There were two types of blocks used in this model as shown in Figure 6. The convolutional block features a convolutional layer in the shortcut path. This layer was used when the input dimensions were changed. The identity block did not have any layers in the shortcut path and was used when the input and output dimensions matched. This stage also ended with a dropout layer.
The final stage of the model used global max pooling, followed by a fully connected layer of size 1024 and a dropout layer. The last layer of the model uses the softmax function, which outputs the possibilities of the sample belonging to each category.
The model was implemented in Python using the Keras library with TensorFlow as a backend. The model was trained using a batch size of 15 with early stopping based on validation loss. The Adam optimization function was used with a learning rate 0.001. The training was performed on a high-performance computer with a NVIDIA Tesla V100 32 G GPU.

Classic 2D CNN Architectures for X-ray
When building our dataset, we selected patients which had an MRI volume as well as an X-ray image of the same knees. To develop the model for X-ray, we employed a variety of state-of-the-art 2D CNN architectures. VGG16 [29] was one of the earlier deep learning models, and it showed superior performance in many applications. ResNet50 [28] used the concept of residual blocks in which a shortcut connection is added after a series of layers. Our proposed 3D model utilizes a 3D variation of the ResNet50 convolutional and residual blocks as well. Inception-v3 [30] is the representation of the deep learning networks with inception modules and one of the first models to make use of batch normalization. Inception-ResNet [31] is a hybrid of Inception-v3 with residual connections. DenseNet [32] implements dense blocks in which convolutional layers of the same size are connected to every other layer in front of them.
While an MRI volume contains just one knee, an X-ray image contains both. These X-ray images were split in half, and all the left knees were flipped so the right and left knees are aligned. The pretrained ImageNet weights were used for transfer learning. The last softmax layer was retrained using the X-ray data while the previous layers were not changed. Input images were scaled to a size of 224 × 224 or 299 × 299, depending on the architectures of different models. Since the pretrained networks were trained with RGB images, we duplicated each gray level X-ray image three times to feed it into the three input channels. Table 2 shows the performance of our 3D CNN model in a confusion matrix of actual vs. predicted level for the 5-category classification. It should be noted that results presented in this subsection and the following two subsections are based on the validation set. The testing set was used once as a final evaluation step in Section 3.4 only. Observing the results in Table 2, we found that the two boundary categories, i.e., KL = 0 and KL = 4, are relatively easier to classify with a high accuracy of 70% and 90%, respectively. The middle categories are more difficult to classify. For category KL = 1, the accuracy is only 45%. However, most misclassified cases (15 out of 22 misclassified ones) were for KL = 0, which is less severe than those misclassified into higher KL grades, since clinically, both KL = 0 and KL = 1 are considered as non-OA. Category KL = 2 had the lowest accuracy 37.5%, and most of these misclassified cases went to KL = 0 and KL = 1. It is worth further examining why this category was considered more similar to non-OA class by the model when clinically this category is considered as an OA class. Using the same 3D CNN architecture, we trained another model with the input da-ta labeled as non-OA (KL ≤ 1) and OA (KL ≥ 2). Table 3 shows the confusion matrix of the 2-category model.

Ablation Study
The accuracy of our 3D CNN model is based on the preprocessing steps as well as the architecture. Table 4 demonstrates the effects of removing individual aspects of our model. The first row shows the accuracy of the model with all strategies used. While our final model used a cropped size of 160 × 160, for this ablation study we used a larger size of 224 × 224, which was the largest that could be used-given memory constraints-to generate the performance without cropping (second row in Table 4). It is worth noting that when we provided more information (larger crop), the accuracy dropped by 5%. To test the effect of slice removal, we removed the slice selection step and used all 160 slices. Similar to cropping, providing more information caused worse performance. Dropout layers are a common method to avoid overfitting, and in this study, it can be seen that removing these layers caused a drop in performance. Finally, we experimented with halving the number of our final residual layers. This caused the most significant drop in accuracy and demonstrates the need of a deep model for working with the 3D MR image data. Note: " √ " indicates the feature is included; "X" indicates the feature was removed.

2D CNN Using X-ray Data
For the same group of patients, we compared the performance of various 2D CNN architectures that use X-rays. Table 5 compares the performance of each architecture for the 5-category classification and the 2-category OA/non-OA classification. Based on the above experiments, we selected InceptionResNetV2, given its best performance for the averages of the 5-category and 2-category accuracy percentages. Tables 6 and 7 show the confusion matrix for InceptionResNetV2 for the 5-category classification and the 2-category classification, respectively. The last columns of Tables 6 and 7 are the results of MR images with 3D CNN models copied from Tables 2 and 3 in order to make it easier to compare the two imaging modalities. Overall, MRI outperformed X-ray in both the 5-category (60.5% vs. 55.5%) and 2-category (86.5% vs. 80.0%) classifications. From Table 6, we can see that MRI has higher accuracy in classifying the categories of KL = 0, 1, and 3, but lower in the KL = 2 and 4 categories. Correspondingly, in Table 7, the accuracy of MRI is much higher than X-ray in classifying the non-OA category (88.8% vs. 70.0%) while a little lower in classifying the OA category (85.0% vs. 86.7%).

Comparison and Further Evaluation with Testing Set
The results presented in the above subsections are from the validation set. We set aside a testing set of 100 samples that has not been seen by any model yet. This testing set was balanced between the 5 KL grades. Table 8 shows the performance of our 3D model with MRI data as well as the best 2D CNN model (InceptionResNetV2) for X-ray, against both the validation set and the testing set. We can see that the 3D CNN model with MRI outperformed the 2D CNN model with X-ray significantly for the 5-category and 2-category classifications on both the validation and testing sets. Table 9 further shows a group of different measures to evaluate OA and non-OA classification using the testing set. Our 3D CNN model with MRI achieved a much higher specificity (0.850 vs. 0.650), F1 score (0.831 vs. 0.767), and AUC (area under ROC curve; 0.911 vs 0.867), while the sensitivity is lower than that of the X-ray model (0.817 vs. 0.850). Figure 7 plots the ROC curves for the two models in the OA/non-OA classification.

Discussion
Currently, X-ray is the basic routine imaging modality for examining a patient with OA potentials clinically. While X-ray is more cost-efficient than MRI, it is not as sensitive as MRI, which can show much more structure and tissue details. Therefore, MRI is considered as an alternative imaging tool, especially for detecting early osteoarthritis with slight structure change.
In the 5-category results of this study we found that MRI had higher accuracy in classifying KL = 0 and KL = 1 ( Table 6). This aligns with the previous studies that found MRI

Discussion
Currently, X-ray is the basic routine imaging modality for examining a patient with OA potentials clinically. While X-ray is more cost-efficient than MRI, it is not as sensitive as MRI, which can show much more structure and tissue details. Therefore, MRI is considered as an alternative imaging tool, especially for detecting early osteoarthritis with slight structure change.
In the 5-category results of this study we found that MRI had higher accuracy in classifying KL = 0 and KL = 1 ( Table 6). This aligns with the previous studies that found MRI to be better at capturing detailed and small structure change and therefore more sensitive to early signs of OA development. When classifying the category KL = 4, X-rays have higher accuracy than MRI, indicating that X-rays are better at detecting OA in a more severe situation. The 2-category results in this study were consistent with those in the 5-category, in that X-ray has higher accuracy for detecting severe OA cases while MRI is more sensitive to small structure changes and early indicators of OA (Table 7). The complementary performance of the two imaging modalities is interesting and indicates the possibility that they could be combined to develop a comprehensive and more accurate diagnosis system than using each individual imaging modality alone.
A limitation of this study is that we have a limited number of samples. This is because we have to include patients with both MRI and X-ray scanned on the same knee. Another limitation is that MRI is not widely used in clinical diagnosis due to the cost. However, MRI is a new trend of imaging to study the pathology of knee OA in many clinical trials since MRI can offer a better view of soft tissues such as cartilage, bone marrow lesions, and effusions.
Future work includes further examination of the KL categories with lower accuracy, e.g., the KL = 2 category, which was often misclassified into KL = 1 (non-OA) by the model. This may be solved with a weighting system during training. Additionally, our current preprocessing uses a fixed offset for cropping as well as a fixed range for slice removing. This could be updated as a dynamic setting for each sequence, which may retrieve more accurate information and therefore generate better classification performance. Combining the two imaging modalities for a comprehensive and more accurate model is also a promising direction.

Conclusions
In this study, we proposed a novel 3D CNN model coupled with 3D MRI for knee OA classification. Guided by clinical knowledge, we reduced the input dimensionality of each image sequence by using subregions of MR images and removing less informative