Exploring the Efficacy of Base Data Augmentation Methods in Deep Learning-Based Radiograph Classification of Knee Joint Osteoarthritis

Diagnosing knee joint osteoarthritis (KOA), a major cause of disability worldwide, is challenging due to subtle radiographic indicators and the varied progression of the disease. Using deep learning for KOA diagnosis requires broad, comprehensive datasets. However, obtaining these datasets poses significant challenges due to patient privacy concerns and data collection restrictions. Additive data augmentation, which enhances data variability, emerges as a promising solution. Yet, it's unclear which augmentation techniques are most effective for KOA. This study explored various data augmentation methods, including adversarial augmentations, and their impact on KOA classification model performance. While some techniques improved performance, others commonly used underperformed. We identified potential confounding regions within the images using adversarial augmentation. This was evidenced by our models' ability to classify KL0 and KL4 grades accurately, with the knee joint omitted. This observation suggested a model bias, which might leverage unrelated features for classification currently present in radiographs. Interestingly, removing the knee joint also led to an unexpected improvement in KL1 classification accuracy. To better visualize these paradoxical effects, we employed Grad-CAM, highlighting the associated regions. Our study underscores the need for careful technique selection for improved model performance and identifying and managing potential confounding regions in radiographic KOA deep learning.


Introduction
The past decade has seen a considerable surge in the integration of artificial intelligence into medicine 1,2 , riding on the wave of the dramatic growth in deep machine learning methods 3 .Medicine has emerged as a crucial field for applying these advanced technologies, with deep learning primarily targeting clinical decision support and data analysis.These systems, adept at examining medical data to discover patterns and relationships, span a diverse range of applications.They have demonstrated significant progress in predicting patient outcomes [4][5][6] , as well as enhancing diagnostics and disease classification [7][8][9][10][11] .Beyond analysis and classification, deep learning has proven effective in data segmentation 12,13 and has even made strides in the generation [14][15][16][17][18] and anonymization of medical data [19][20][21][22][23][24] .The utility of these advances in the field of Osteoarthritis (OA), however, presents its unique set of challenges.
OA, with knee joint osteoarthritis [25][26][27] (KOA) being especially prevalent 28 , is a primary global cause of disability 29 , with estimated expenditures reaching up to 2.5% of the Gross National Product in western countries 29 .Its early detection is often impeded by subtle radiographic markers and disease progression variability 25,28 .Leveraging deep learning for diagnosing KOA [30][31][32] depends heavily on the availability of diverse and extensive data sets.However, obtaining such data sets is problematic, constrained by patient privacy considerations 33,34 , data collection restrictions, and the inherent progression of OA.Various studies have utilized data augmentation techniques as a workaround, creating artificial data variability.Specifically for KOA, two leading data augmentation methods are employed: affine and online, where random transformations occur during training, and additive and offline, manipulating the base (original) data prior to training to generate more data points.These techniques, often used in tandem, have proven successful in enhancing performance and mitigating overfitting.However, until now, there has been no systematic exploration to determine which technique is most effective for the task at hand, nor which ones might be less beneficial.In addition, no prior research has probed the realm of adversarial augmentation, a tactic designed to deceive the underlying system into delivering high performance while excluding or distorting essential radiograph characteristics.This approach could potentially identify confounding regions within images, thereby enhancing validation arXiv:2311.06118v1[eess.IV] 10 Nov 2023 processes.
In this study, we address these research gaps.Our focus centers on discerning the most suitable base augmentation technique for the task at hand and pinpointing potential confounding regions present within the radiographs (with adversarial augmentation).

Materials and Methods
In this study, we present a comprehensive augmentation methodology for the classification of knee joint X-ray images sourced from the Osteoarthritis Initiative 35 .Our approach is three-fold: data collection and preprocessing, image augmentation, and the application of a convolutional neural network (CNN) for classification.We utilized a dataset of 8260 images, graded via the Kellgren and Lawrence 36 system, and subjected them to both positive/supportive and negative/adversarial augmentations.This was done to explore the benefits of artificial diversity during training and to challenge the classifier's resilience.The CNN model of choice was the EfficientNetV2-M 37 , which was trained over 15 epochs with a dataset split into training, validation, and testing sets.To enhance the interpretability of our CNN model, we employed the Grad-CAM 38 algorithm, providing visual insights into the decision-making process of the network.Our evaluation metrics included accuracy, precision, recall, and the F1 score, offering a wider view of the model's performance.Figure 1 illustrates the study's operation sequence using numeric markers.This section elaborates on each step in the order indicated by the numeric markers in the figure.

Data Collection
Our research utilized knee joint X-ray images from the Chen 2019 study 39 , sourced initially from the Osteoarthritis Initiative (OAI) 35 .The OAI, a multi-center study focused on biomarkers for knee osteoarthritis, included 4796 participants aged 45 to 79.We employed the pre-processed primary cohort data from Chen 2019 39 , which had been subject to automatic knee joint detection, bounding, and zoom standardization to 0.14 mm/pixel.This led to 8260 images (224 × 224 pixels) derived from 4130 X-rays containing both knee joints.The images were graded via the Kellgren and Lawrence (KL) system 36 , as shown in Figure 2. The KL grade distribution was as follows: 3253 images for Grade 0, 1495 for Grade 1, 2175 for Grade 2, 1086 for Grade 3, and 251 for Grade 4.

Image Pre-Processing
We flipped each right knee joint image to mirror a left knee orientation.Then, we identified and inverted any negative channel images, resulting in 189 such alterations for KL01 and 77 for KL234.We then equalized the image histograms' contrast using equation 1.In this equation, for a given grayscale image I with dimensions of m × n, we used the cumulative distribution

Base Data Augmentation Sets
In our research, we divided our dataset into distinct splits and applied base data augmentations to each of these splits.We crafted two base data augmentation sets.The term 'base data' refers to enduring modifications made to all the data ('offline') before introducing any 'online' affine augmentations during the training phase (Table 1).The first augmentation set focused on positive or supportive modifications, exploring the potential benefits of incorporating artificial diversity during training.The second set, conversely, incorporated negative/adversarial augmentations intended to challenge the classifier.This was done to help pinpoint potential confounds in the classification task and test the model's resilience.Table 2 showcases all conditions used, while figure 3

Convolutional Neural Networks
Convolutional neural networks (CNNs) 42 are foundational in the recent deep learning revolution 3 .CNNs are a type of neural network often used for computer vision.These neural networks employ the convolution operation between input and a filter-kernel.Filters slide across inputs to highlight features in a response known as a feature map.Various feature maps are combined to produce higher-level feature maps corresponding to higher-level concepts.Formally 43 , for an image I of u × v dimensions and filter-kernel H of s × t dimensions, we can obtain feature map G by convolution across the two axes with kernel H as: Typically, the feature map values are filtered with an activation function.The role of the activation function is to re-map the values across a given function.For example, a rectified linear unit activation function 44 (ReLu) zeros-out negative values.Such an approach offers computational efficiency due to replacing redundant values with zero.For any feature map value z, the ReLu activation is defined as: In addition to the activation function operation, the max pooling operation is often used.Max pooling down-samples the convolution result such that cascades of max pooling and convolution would result in an ever-decreasing number of features.For image I of u × v dimensions, the max pooled value g(u I ) given dimension u can be simply defined as follows: Where u I is only dimension u from image I, r is the pooling window size, and h is the stride value.Negative 5% Gaussian pixel noise added to the baseline Noise 10 Negative 10% Gaussian pixel noise added to the baseline Noise 20 Negative 20% Gaussian pixel noise added to the baseline Noise 50 Negative 50% Gaussian pixel noise added to the baseline Cube 2 Negative Image divided into equidistant parts in a 2x2 grid Cube 3 41 Negative Image divided into equidistant parts in a 3x3 grid Cube 6 Negative

Convolutional Neural Network Architecture
EfficientNet 45 , a well-recognized deep learning model, employs compound scaling, balancing depth (number of layers), width (size of the layers), and resolution (size of the input image) in a structured manner.This scaling process is mathematically represented as: Here, α, β , γ are constants, φ is a user-defined coefficient, and d 0 , w 0 , r 0 represent depth, width, and resolution of the base model.
A vital component of EfficientNet is the MBConv block.It sequences transformations starting with a 1 × 1 convolution, a depth-wise convolution, a Squeeze-and-Excitation (SE) operation 46 , and another 1 × 1 convolution: In this formula, K 1 and K 2 are 1 × 1 convolutional filters, D represents the depth-wise convolutional filter, and SE is the Squeeze-and-Excitation operation.EfficientNetV2 37 extends the original model by incorporating a Fused-MBConv block, which combines the initial 1 × 1 and depth-wise convolutions into a single 3 × 3 convolution, followed by an SE operation and a final 1 × 1 convolution:

5/16
Here, K f is the 3 × 3 convolutional filter combining the initial 1 × 1 and depth-wise convolutions, and K 2 is the final 1 × 1 convolutional filter.An activation follows each convolution and may include a skip connection..All offline affine augmentations are consistently applied to each image, but the degree to which they are applied is randomized within the specified ranges of each technique.These augmentations are directly incorporated and executed using the Keras library 48 .

CNN Interpretability
While the complexity of neural networks increases their capabilities, it also complicates the interpretation of their predictions 49 .Due to this complexity, these systems are often deemed 'black boxes.'However, the Grad-CAM 38 algorithm (based on the CAM 50 framework) helps reduce this 'black box' effect.At a high level, Grad-CAM is an algorithm that visualizes how a convolutional neural network makes its decisions.It creates what are known as "heat maps" or "activation maps" that highlight the areas in an input image that the model considers important for making its prediction.The Grad-CAM spatial activation map M p Grad−CAM can be calculated using the ReLU activation function on the sum of neuron importance weights b p k multiplied by feature maps Ψ k as shown below: In this equation, b p k are the neuron importance weights of feature map k for class p, ∂ y p ∂ Ψ k mn represents the partial derivative of the final layer prediction for class p (y p ) with respect to the last convolutional layer's kth feature map Ψ k mn .Z is the total pixels, and m, n are the indexes for each element within feature map k.Ψ k is the feature map k given by the last convolutional layer, spatially averaged.In our study, we extracted Grad-CAM activations from the layer immediately preceding the flattening operation.

Figures of Merit
In evaluating the results of our experiment, we employ several key figures of merit to quantify the performance.
Accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined.It can be calculated using the following equation: Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.It is calculated as follows:

6/16
Precision = TP TP + FP (10)   Recall (also known as sensitivity, hit rate, or true positive rate) is the fraction of the total amount of relevant instances that were actually retrieved.The equation for recall is: The F1 score is the harmonic mean of precision and recall.It tries to find the balance between precision and recall.The F1 score can be calculated as follows: In these formulas, TP is True Positives, TN is True Negatives, FP is False Positives, and FN is False Negatives.

Positive Augmentations
In Table 3, the model with the best performance appeared to be the "Baseline Rotated" model, obtaining an Accuracy of 0.655, Precision of 0.621, Recall of 0.645, and an F1-Score of 0.618.The high accuracy indicated that this model was successful in correctly predicting the classification most of the time, while the substantial F1-Score, which is a harmonic mean of precision and recall, indicated a balanced high performance in both these areas.This suggested that the model could retrieve a high proportion of relevant instances (high recall), while ensuring the proportion of instances it claimed to be relevant were indeed relevant (high precision).
In contrast, the model with the lowest performance in the evaluation was the "Horizontal Split" model.With an Accuracy of 0.560, Precision of 0.497, Recall of 0.563, and an F1-Score of 0.501, this model consistently fell behind the other models across all performance metrics, indicating lower overall performance.It can be observed that there was a clear downward trend in performance metrics from the "Baseline Rotated" model to the "Horizontal Split" model.Interestingly, models using the "Flip" modification, such as "Horizontal Split Flip" and "ROI Split Flip", tended to have a lower performance than their non-flip counterparts.The only exception to this was the "Horizontal Split 20% Flip" model, which slightly outperformed the "ROI Split" and "Horizontal Split" models, suggesting that the impact of the "Flip" modification could be influenced by broader image overalp (0.20 in that case).

Model Name
Accuracy The provided confusion matrix 5 revealed the performance of nine distinct models: Baseline, Baseline Rotated, Horizontal Split, Horizontal Split Flip, ROI, ROI Split, ROI Split Flip, Horizontal Split 20%, and Horizontal Split 20% Flip.The Baseline model performed well for KL0 and KL4 but encountered difficulties with the intermediate classes.This issue was partly alleviated in the Baseline Rotated model for KL0.The Horizontal Split and Flip models demonstrated variability in performance across classes, with the Flip version slightly improving the accuracy for KL2 and KL3.The Region of Interest (ROI) models exhibited improvements for the intermediate classes, especially KL2.The ROI Split and Flip models offered a more balanced performance across classes, particularly for KL0 and KL1.Lastly, the Horizontal Split 20% and its Flip variant showed high misclassification rates between KL0 and KL1, although the Flip version brought some improvement.However, KL4 was well classified across all models, suggesting distinct features that differentiated it from other classes.Models with the "Flip" modification seemed to have a more evenly distributed confusion matrix, indicating a more balanced prediction across different classes.However, this did not always result in overall higher performance, as seen in the "Horizontal Split Flip" model.The "Baseline Rotated" model seemed to perform well for the first and last class, but its performance decreased notably for the other classes.This behavior was shared across models, where models often performed better for the first and last class.The "ROI" and "ROI Split" models presented a similar pattern, with typical performance in the first and last classes.Reviewing the results from the different model iterations, it was quite unexpected to find that the ROI (Region of Interest) models did not manage to outperform the Baseline models.Given that the Baseline models leveraged the entire image and the ROI models focused on the specific region expected to contain more relevant information for the task, it was anticipated that the ROI models would perform better.Figure 6 showcases ROC curves for the ROI model against the baseline.
Upon evaluation of AUCs in the one-vs-all scheme (Figure 6), it was observed that Classes 0, 2, and 3 had marginally higher AUC scores in the Baseline Model compared to the ROI Model.Specifically, the AUC for Class 0 was 0.87 in the Baseline Model versus 0.85 in the ROI Model, suggesting that the Baseline Model was slightly more successful in distinguishing between positive and negative instances for this class.Similarly, for Class 2, the Baseline Model had an AUC of 0.84 compared to 0.81 in the ROI Model, and for Class 3, the AUC was 0.96 in the Baseline Model versus 0.95 in the ROI Model.In contrast, for Class 1, the ROI Model outperformed the Baseline Model, albeit slightly, with an AUC of 0.69 against 0.68.For Class 4, both models performed impeccably, achieving a perfect AUC score of 1.00, demonstrating their ability to distinguish instances of this class perfectly.
In summary, while the performance of both models was similar for all classes, the Baseline Model showed a slight edge in Classes 0, 2, and 3.The ROI Model only performed marginally better in Class 1, and both models were equally successful in Class 4. Despite these differences in AUC values, it's noted that the curvature along the axis was similar between the two models.This suggestsed that the trade-off between sensitivity and specificity (true positive rate and false positive rate) was similar for both models across different decision thresholds.This similarity in shape indicated that both models had similar performance trade-offs, even if the absolute performance (as measured by AUC) varied slightly.

Negative (Adversarial) Augmentations
In table 4, we find a comparison of several adversarial augmentation models based on metrics including Accuracy, Precision, Recall, and F1-Score.The model with noise level 05 had the highest performance across all metrics.As noise increased (i.e., Noise 10 and Noise 20), a corresponding decrease was observed in all performance metrics.This suggested that lower levels of noise improved the model's ability to generalize.In comparison, higher noise levels degraded the performance, likely due to interference with essential radiograph features.The models using cube techniques also showed varying levels of performance.Cube 2, for instance, performed better than Cube 3 and Cube 6 in all aspects.This could imply a potential optimal size or representation for the cube that best captured critical information.The models with no Region of Interest (ROI) performed poorly compared to others.The Noise 50 model recorded the lowest performance.In figure 7, the Cube 3 model performed well when identifying KL0; however, as the Kellgren-Lawrence (KL) grades increased, this performance gradually diminished, culminating in a notable difficulty when classifying KL4.This suggested that while the model could easily distinguish KL0 from other classes, the higher grades posed more of a challenge.In contrast,

9/16
the Cube 2 model showed a more even performance across all KL grades, with a gradual decrease in accuracy from KL0 to KL4.While the model also performed best on KL0 and worst on KL4, it demonstrated a more balanced confusion across different classes.The Cube 6 model continued the trend observed in Cube 3 and Cube 2, struggling with higher KL grades and decreasing performance from KL0 to KL4.Interestingly, it confused KL0 with KL1 and KL3 more than Cube 3 and Cube 2, which indicated its difficulty differentiating between these classes.Lastly, the No ROI model showed exceptional performance when classifying KL0, but it struggled to differentiate KL0 from KL2, and performed notably poorly on KL3.Similarly, the No ROI Split model showed a distinct performance pattern.It was particularly adept at identifying KL0 and KL4.
In the No ROI model, the high score for the KL0 class (0.84) indicated that the model effectively identified patterns associated with KL0.However, there is an absence of the primary region of interest.Similarly, the No ROI Split model achieved a surprisingly high score of 0.75 for the KL4 class.This suggested that these models were identifying other image features unrelated to the knee joint to make the classification decisions for these particular grades.
Figure 8 illustrates the Receiver Operating Characteristic (ROC) curves for both the "no ROI" and the "no ROI split" configurations.Notably, the model's performance appeared virtually indistinguishable across these settings when employing a one-versus-all scheme.However, we must also pay attention to the markedly high Area Under Curve (AUC) values for KL 0 (> 0.70) and KL 4 (> 0.88).These elevated values, alongside the accompanying confusion matrices, underscored the prevalence of potential confounding regions within these images.Remarkably, these potential confounding regions enabled a level of classification precision that is both significant and surprising, particularly given the absence of a region of interest, such as the entire knee joint.We further investigated these outlier results in the next section.

Adversarial Outliers
In this section, we extend our results corresponding to the identified outlier classifications -specifically, "NO ROI" and "NO ROI Split".To enhance the depth and clarity of our analysis, we juxtapose these outcomes with the baseline results, enabling a more thorough comparative evaluation.
In Figure 9, we noticed identical scores from the 'Baseline' and 'No ROI' models in the case of KL0, which proposed either that the absence of a large region of interest (ROI) does not affect KL0 classification or that true class region confounds are visible.Moreover, the 'No ROI Split' model demonstrated performance remarkably similar to the 'Baseline' model for KL4.Although it did not reach complete alignment with the 'Baseline,' its relative success hinted at similar causes as observed in KL0.Most interestingly, we observed a clear performance boost for KL1 in the 'NO ROI Split' model.This class is historically the most significant challenge for classification in Knee-Osteoarthritis. Remarkably, this score was the highest individual KL1 score across all examined models of this study.

11/16
To delve deeper into these paradoxical results, we applied the grad-CAM technique to the top examples from each of the previously mentioned classes.As shown in Figure 10, in the first row (no ROI KL0), we observed activations focused on texture and potential outlines of the patella.On the other hand, the second row (Baseline KL0) displayed control images that distinctly highlighted the knee joint.However, it is essential to note the broad spread of activation extending across and above the knee joint.Observations in the third row (No ROI Split KL4) revealed unclear patterns, primarily centered around what seemed to be a focus on wear-related texture.Despite the ambiguity, the controls (row 4) highlighted the knee joint, albeit with significantly less spread than the KL0 control (second row).Examining the KL1 focus in the first image of the last set (fourth row) revealed what appeared to be a part of the patella outline.In contrast, the other two images (fourth row) depicted a non-specific texture focus.Finally, all controls (KL0, KL4, KL1) presented a wide activation range that extended slightly over the knee joint.Overall, our observations suggested that the baseline models for KL0 tend to relied not only on the knee joint but also incorporated broader areas for their classifications.This starkly contrasted the KL1 baseline models, which seemed to concentrate more on the knee joint.

Discussion
Our analysis has highlighted the marked effectiveness of certain positive data augmentations in improving model performance.Specifically, the 'Baseline Rotated' model showed the highest performance.Incorporating rotation into our baseline model could have increased its robustness to orientation flips in the images, contributing to its superior performance.Furthermore, the confusion matrix analysis demonstrated an excellent performance for the KL0 and KL4 grades, a result that may be associated with the distinct radiographic features of these classes.Conversely, the 'Horizontal Split' model, which divided the image into two parts along the horizontal axis, performed the worst across all the considered metrics.This could be because this approach might eliminate or distort crucial radiographic features, thereby reducing the model's ability to classify the images accurately.Notably, the results contradicted our initial expectation that the ROI models would outperform the baseline models, given the assumption that focusing on specific regions containing more relevant information would increase performance.The results, however, indicated that models which utilize the entire image data might have a slight edge in performance over those that focus on a particular ROI, suggesting either that potentially important information outside the ROI might be missed or that confounds are integrated to inflate the performance.
While data augmentation techniques have been widely adopted in deep learning, studies specifically investigating their effects in the medical imaging domain remain sparse.Often, the choice of augmentation techniques relies heavily on informal recommendations or generic best practices that aren't always tailored to medical images' unique challenges and characteristics 51 .This lack of systematic exploration can lead to suboptimal model performance or even introduce biases.Against this backdrop, Goceri 51 and Hussain, et al. 52 studies stand out as notable exceptions that delve deep into the intricacies of various additive augmentation methods and their impact on model performance in medical imaging tasks.
The study by Goceri 51 spanning different medical imaging domains such as lung CT, mammography, and brain MR images observed distinct patterns in additive augmentation effectiveness.For lung CT images, translating and shearing produced the highest accuracy of 0.857, whereas a mere translation yielded the lowest at 0.610.In mammography images, the combination of translation, shearing, and clockwise rotation was most effective with an accuracy of 0.833, while adding 'salt-and-pepper' noise and shearing underperformed, achieving only 0.667.For brain MR images, the same combination of translation, shearing, and clockwise rotation outperformed other methods with an accuracy of 0.882, adding 'salt-and-pepper' noise and shearing, showing the lowest accuracy at 0.624.Another investigation by Hussain, et al 52 explored different mammography additive augmentation techniques, producing varied results.Notably, the Shear augmentation achieved the highest training accuracy of 0.891 and a validation accuracy of 0.879.Conversely, Noise augmentation was the least effective, with training and validation accuracies of 0.625 and 0.660, respectively.Augmentations such as Gaussian Filter, Rotate, and Scale also demonstrated high accuracy in training and validation phases.By comparing our results, those of Goceri, and the findings from Hussain, et al, it becomes evident that while some augmentation methods consistently show effectiveness across studies, the efficacy can vary based on domain specificity and dataset nuances.Our results, especially those pertaining to the 'Baseline Rotated' model, suggest that certain augmentations, such as rotation, might have unique advantages in the context of KOA.
Negative augmentations, in the form of adversarial attacks, were explored in our study.It was observed that as the noise level increased, the models' performance deteriorated, suggesting that the introduction of excessive noise could disrupt the discernment of relevant features within the images.Interestingly, the models lacking a Region of Interest (ROI) performed poorly overall.However, these models did exhibit exceptionally high performance for specific KL grades, such as KL0 and KL4, which may indicate the presence of confounding variables that the model is leveraging to make its predictions.The results for the "No ROI" and "No ROI Split" models were particularly intriguing.Despite the absence of a region of interest, the high-performance scores achieved by these models for specific KL grades suggest that these models might be identifying other image features unrelated directly to knee joint osteoarthritis to make the classification decisions.In the case of KL0, the identical scores from the 'Baseline' and 'No ROI' models proposed either that the absence of a region of interest (ROI) may not affect KL0 classification or that true class confounds are visible.For KL4, the 'No ROI Split' model demonstrated performance remarkably similar to the 'Baseline' model, hinting at similar influences.The most notable result, however, was the clear performance boost for KL1 in the 'NO ROI Split' model.This class is historically challenging to classify in Knee-Osteoarthritis studies, which makes this finding of particular interest.
Our Grad-CAM visualization analysis revealed insights into the potential confounding regions that might affect our models' decision-making processes.Notably, in the absence of a designated region of interest (ROI) for KL0, the models show a tendency towards the texture and contours of the patella.Interestingly, this pattern shifts with the baseline KL0 models, where the joint and its eminence are distinctly highlighted.However, the spread of activation that extends broadly across and above the knee joint suggests the model might be considering features beyond the knee joint for classification.In the No ROI Split KL4. the model appeared to be using general wear-and-tear texture indications for their classifications, which may not be directly related to disease progression but rather simply the participant's age.Finally, in the KL1 category, the models oscillate between specific and non-specific textures, further underscoring potential confounding regions.
Our best-performing set involved rotation, which may be partly attributed to the rotated orientation of the knee joint along the vertical axis of the radiograph.In this configuration, the convolution operation repeatedly encounters relevant features as it slides across the image, potentially leading to a more condensed and effective feature maps.This is in contrast to a non-rotated radiograph, where the knee joint occupies only a single vertical segment of the image.In this latter case, the convolution would likely traverse the entire joint just once or twice, depending on the receptive field, making feature extraction potentially slightly less effective.Our findings regarding pixel noise underscore an essential consideration in raw radiograph data, which may contain parts of suboptimal quality with substantial pixel noise.These observations are in line with similar findings reported in other studies.The sensitivity to noise is especially evident for the early stages of the condition, as even a minimal noise level of 5 % led to a decline in performance.This raises the question of whether including low-quality radiographs might be more detrimental than beneficial.This observation suggests a potential direction for future research to establish a guideline for acceptable noise levels.An alternative approach could be incorporating blur effects in the augmentation process to counteract noise-related challenges.However, the efficacy of such methods is beyond the scope of this study and warrants further investigation.

Conclusion
In this study, we evaluated the effectiveness of various data augmentation techniques to enhance model performance in knee-joint osteoarthritis classification.These findings have potential implications for future work in this area, particularly for improving the robustness and accuracy of deep-learning models in medical image analysis.However, our results also highlight the need to carefully consider potential confounding regions to ensure that the models primarily base their predictions on relevant features.To facilitate further analyses, we provide open access to all data, trained models, and an extensive set of the top 20 Grad-CAM images, ranked by prediction confidence.This information is available in our data availability section.

Figure 1 .
Figure 1.The methodological pipeline for the study.Green represents positive/supportive augmentation, red signifies adversarial augmentations.Blue indicates data processing and purple signifies CNN training.Numeric markers indicate the order of operations.

Figure 2 .
Figure 2. Sample images showing various KL grades, ranging from 0 (no OA signs) to 4 (severe OA).From left to right, OA severity increases.Joint space narrowing, denoted as JSN.
visualizes the base augmentations made.Augmentation Method Description Training Configuration Image Rotation Implements a counter-clockwise rotation on images Allows a rotation of up to 40 degrees Width Shifting Moves the image laterally along the x-axis Allows a shift of up to 45 pixels along the x-axis Height Shifting Moves the image vertically along the y-axis Allows a shift up to 45 pixels along the yaxis Shearing Distorts the image along the width or height axis Implements a maximum shear angle of 0.2 degrees Zooming Modifies the image scale, zooming in or out within the frame Enables a maximum of 20% zoom Horizontal Flipping Creates a mirror image along the vertical axis Implements flipping only along the horizontal axis

Figure 3 .
Figure 3. Visualization of the study's base data augmentations: red indicates negative/adversarial augmentations and green shows positive/supportive augmentations.Each transformation is demonstrated on provided baseline image.

Figure 5 .
Figure 5. Confusion matrices for the test set using positive/supportive base data augmentations.

Figure 6 .
Figure 6.Using a one-vs-all scheme, ROC curves are shown for the Baseline condition (left) and the ROI condition (right).

Figure 7 .
Figure 7. Confusion matrices for the test set using negative/adversarial base data augmentations.

Figure 8 .
Figure 8.Using a one-vs-all scheme, ROC curves are shown for the NO ROI condition (left) and the NO ROI Split condition (right).

Figure 9 .
Figure 9. Confusion matrices for the test set using no ROI, baseline, and no ROI split base data augmentations.

Figure 10 .
Figure 10.Grad-CAM displays for outlier conditions with baseline comparisons.'Confidence' denotes the output layer's value on the correct class.

Table 1 .
Online augmentation approaches, occurring randomly during training

Table 3 .
Performance metrics of the model on the test set using positive/supportive base augmentations.

Table 4 .
Performance metrics of the model on the test set using negative/adversarial base augmentations.