Ensemble Learning for Breast Cancer Lesion Classification: A Pilot Validation Using Correlated Spectroscopic Imaging and Diffusion-Weighted Imaging

The main objective of this work was to evaluate the application of individual and ensemble machine learning models to classify malignant and benign breast masses using features from two-dimensional (2D) correlated spectroscopy spectra extracted from five-dimensional echo-planar correlated spectroscopic imaging (5D EP-COSI) and diffusion-weighted imaging (DWI). Twenty-four different metabolite and lipid ratios with respect to diagonal fat peaks (1.4 ppm, 5.4 ppm) from 2D spectra, and water and fat peaks (4.7 ppm, 1.4 ppm) from one-dimensional non-water-suppressed (NWS) spectra were used as the features. Additionally, water fraction, fat fraction and water-to-fat ratios from NWS spectra and apparent diffusion coefficients (ADC) from DWI were included. The nine most important features were identified using recursive feature elimination, sequential forward selection and correlation analysis. XGBoost (AUC: 93.0%, Accuracy: 85.7%, F1-score: 88.9%, Precision: 88.2%, Sensitivity: 90.4%, Specificity: 84.6%) and GradientBoost (AUC: 94.3%, Accuracy: 89.3%, F1-score: 90.7%, Precision: 87.9%, Sensitivity: 94.2%, Specificity: 83.4%) were the best-performing models. Conventional biomarkers like choline, myo-Inositol, and glycine were statistically significant predictors. Key features contributing to the classification were ADC, 2D diagonal peaks at 0.9 ppm, 2.1 ppm, 3.5 ppm, and 5.4 ppm, cross peaks between 1.4 and 0.9 ppm, 4.3 and 4.1 ppm, 2.3 and 1.6 ppm, and the triglyceryl–fat cross peak. The results highlight the contribution of the 2D spectral peaks to the model, and they demonstrate the potential of 5D EP-COSI for early breast cancer detection.


Introduction
Breast cancer is one of the most prevalent cancers in females and one of the leading causes of cancer death worldwide [1,2]. Early detection and accurate characterization of breast malignancies are crucial factors in breast cancer management and positive treatment outcomes [3][4][5][6][7][8][9][10][11][12]. Differentiation of benign from malignant breast lesions can aid clinicians in determining appropriate therapeutic plans. While histopathological examination of breast tissues extracted by biopsy is often required to confirm a suspicious lesion, a mammogram continues to be the gold standard for the detection of breast cancer, but this approach has a high false positive rate [13]. Multi-parametric MRI (mp-MRI), which includes dynamic contrast-enhanced MRI (DCE-MRI), T 2 -weighted MRI and diffusionweighted imaging (DWI) may allow differentiation between benign and malignant breast lesions that present highly overlapping enhancement patterns. However, despite the potential to eliminate unnecessary biopsies and follow-up examinations of benign tumors, mp-MRI-based breast tumor differentiation still has increased false positive findings.
Cell density, organization, membrane integrity and cellular metabolism of breast tissues undergo changes in the presence of cancer. Magnetic resonance spectroscopic imaging (MRSI) is capable of detecting the changes in concentrations of various metabolites and lipids in the tissue that are altered due to cancer-related changes in cellular metabolism [14][15][16][17][18][19][20][21][22][23]. High cell density and altered tissue structure due to cancer also lead to restricted motion of water molecules in the tissue, which can be measured by the apparent diffusion coefficient (ADC) on DWI [12,[24][25][26][27][28][29][30][31]. DCE-MRI, one of the most sensitive diagnostic techniques, highlights the areas of increased blood flow and blood volume in the breast tissues due to cancer with the help of a contrast agent [5][6][7]12,[32][33][34][35].
Even though the sensitivity of mp-MRI methods can be affected by various factors like tumor size and aggressiveness, these methods are often reported to have relatively high sensitivity (in the range of 88-100% for DCE-MRI, 85-95% for DWI and 80% for MRSI) [9,12,[36][37][38][39]. Reported specificity, on the other hand, is relatively low (69-74% for DCE-MRI, 75-82% for DWI and 74% for MRSI), restricting the capability for classification of benign and malignant lesions [37][38][39][40]. While single-voxel spectroscopy has a reported 64-82% sensitivity and 85-91% specificity [41], the multi-voxel technique of MRSI can cover a larger area of the breast with a relatively higher spatial resolution. Advanced MRSI techniques like five-dimensional (5D) echo-planar correlated spectroscopic imaging (EP-COSI) can record two-dimensional (2D) correlated spectroscopy (COSY) from multiple regions in three-dimensional (3D) space [42]. Achieving high specificity is also challenging in MRSI due to overlapping patterns of the measures between benign and malignant lesions.
One option to potentially improve the specificity while retaining the benefits of the non-invasive nature of these imaging modalities is to use machine learning (ML) models to identify subtle or complex differences in the multi-model data that differentiate benign and malignant lesions [43,44]. Development and validation of machine learning models have seen impressive growth in the last decade due to their high accuracy and flexibility in handling a wide range of data types and features [45]. While individual machine learning models may perform well, a meta-approach that combines individual models named ensemble learning could generate even more generalizable models that can reduce individual base learner's variance or bias [46]. In particular, advanced ensemble models like the gradient-boosted tree-based algorithm that combines multiple weak learners (decision trees) are shown to be capable of detecting key features of the multi-modal, multi-parametric imaging information for applications such as tissue/cancer grade classification [47][48][49][50].
Multiple studies have recently shown that the features extracted from DCE-MRI and DWI of breast tissues used in ML models are capable of predicting tumor grades and classifying benign and malignant breast lesions [48][49][50]. However, metabolite and lipid information from MRSI data has not been used in this context so far. Therefore, a major goal of this work was to evaluate the application of different machine learning models, including ensemble learning techniques, for the classification of benign and malignant breast lesions based on the 5D EP-COSI data along with the corresponding ADC information from DWI data.

Subjects and Data Acquisition
The dataset consisted of 5D EP-COSI and DWI data from twenty-three subjects with malignant breast masses (mean age 53 [range:  years and seventeen benign breast masses (mean age 37 [range: 19-60] years). All scans were acquired on a Siemens 3T Skyra scanner (Siemens Healthineer, Erlangen, Germany). Consent was obtained from all volunteers included in the study according to the on-site institutional review board guidelines. The 5D EP-COSI data was acquired using FOV = 160 × 160 × 120 mm 3 , matrix size = 16 × 16 × 8, TR/TE = 1500/35 ms, 64 t 1 points and 512 t 2 points with a spectral width of 1250 Hz and 1190 Hz along F 1 and F 2 , respectively. A non-watersuppressed (NWS) 1D MRSI scan with one t 1 point was acquired for eddy current phase correction and for combining signals from multiple receiver coils [51]. The data was nonuniformly undersampled (NUS) along two spatial k y -k z and the spectral t 1 dimensions with a total acceleration factor of 8, and was reconstructed using a Group Sparsity (GS)-based compressed sensing technique [52,53].
The DWI acquisition protocol included the following: Two-dimensional spin-echo echo-planar imaging (EPI) sequence (TR/TE of 3800/93 ms; data matrix, 192 × 192; signal average, 3; slice thickness, 3 mm; and distance factor, 20%) in the axial plane. Diffusion sensitizing gradients (DSG) in three orthogonal directions with b values of 50 and 800 s/mm 2 were applied. The ADC maps were created automatically by the in-line scanner software using the trace-weighted images with b values of 50 and 800 s/mm 2 .

Pre-Processing
Tumor-containing slices in the DWI were selected and the boundaries of the lesion were marked by a radiologist. ADC values were then extracted from this delineated region of interest (ROI). The MRSI data were interpolated by a factor of 2 and the slices containing the tumor were identified similarly to DWI. Spectroscopic voxels within the delineated region were extracted and the metabolite and lipid ratios were quantified in these voxels as described in [42]. All variables were standardized with z-score normalization (zero mean and unit standard deviation) and voxels containing outlier measurements were removed. For the variables that followed a normal distribution, outliers were identified as three standard deviations away from the mean. For other variables, previously reported ranges of metabolite and lipid ratios were used as a guideline for outliers [42].

Feature Extraction
Ten to twelve voxels from multiple slices were selected for each MRSI dataset, resulting in 241 malignant voxels and 195 benign voxels after removing outliers. The Apparent Diffusion Coefficient value was calculated for each dataset and assigned to voxels under the respective dataset.
A total of 99 features were available for the study. These were derived from both DWI and MRSI data as follows: DWI : 1 feature (ADC).

2D MRSI
: 95 features which are the ratios of 24 metabolite and lipid peaks with respect to 4 different reference peaks. Reference peaks include methylene fat, olefinic fat and water at 1.4 ppm, 5.4 ppm and 4.7 ppm from the 1D spectrum, and the methylene fat diagonal peak at 1.4 ppm from 2D spectrum. These constitute to 96 features, out of which the ratio of 2D Methylene Fat diagonal peak (FAT14) with itself is excluded resulting in 95 features.
These features were then narrowed down using statistical tests and feature selection algorithms. The full list of metabolites and lipids including choline (Cho), myo-Inositol + glycine (mI + Gly), unsaturated fatty acid and triglyceryl fat cross-peaks identified in the 2D correlated spectroscopy (COSY) and 1D NWS spectra are shown in Table 1. A representative 2D COSY spectrum with labeled metabolite and lipid diagonal and cross peaks along with the corresponding ADC map is shown in Figure 1. glycine (mI + Gly), unsaturated fatty acid and triglyceryl fat cross-peaks identified in the 2D correlated spectroscopy (COSY) and 1D NWS spectra are shown in Table 1. A representative 2D COSY spectrum with labeled metabolite and lipid diagonal and cross peaks along with the corresponding ADC map is shown in Figure 1. est. An extracted COSY spectrum and 1D NWS spectrum are shown on the right side. Bottom-left panel shows the corresponding ADC map for the same subject with the region of lesion marked in green. These metabolite, lipid ratios and ADC values were inputted into the feature pool, which was then narrowed down using statistical tests, recursive feature elimination and sequential forward selection. Bottom-left panel shows the corresponding ADC map for the same subject with the region of lesion marked in green. These metabolite, lipid ratios and ADC values were inputted into the feature pool, which was then narrowed down using statistical tests, recursive feature elimination and sequential forward selection. and 5 malignant). This ensured that the samples in the training and testing set were independent, which in turn avoided overestimation of model performance due to data leakage so that the model will be generalizable to new data.
One of the main considerations for the feature selection method was the handling of high-dimensional data with a relatively limited sample size. A statistical significance test was used to narrow down the variable space before running the feature selection algorithms. Only the significant features which were capable of distinguishing benign and malignant classes were selected. Normality and homogeneity of variance of the features were checked using Quantile-Quantile (Q-Q) plots of the data and Levene's tests. Based on that, either a ttest or Mann-Whitney U (MWU) test was used for a statistical significance of p-value < 0.01. For the next level of analysis, we considered some of the machine learning model-based feature selection algorithms like sequential feature selection (SFE) and recursive feature elimination (RFE) [54]. SFS and RFE with cross-validation were selected based on the model performance considering all the significant features identified in the statistical test. However, both RFE and SFS have a drawback in that they do not exclude redundant features. This was addressed using a correlation analysis to remove moderate to strong correlated redundant features based on a Spearman's rank correlation coefficient threshold of ±0.6. A correlation p-value < 0.05 was used to check for the statistical significance of the observed correlation between different features. Redundant features with statistically significant high correlation were removed from the feature list.

Machine Learning Algorithms
The open-source machine learning library for Python, 'scikit-learn' was used for implementing different supervised learning algorithms for classification [55], which included support vector machine (SVM), Decision Tree, Logistic Regression, Naive Bayes, and Knearest neighbors (KNN) as well as ensemble learning techniques including Adaptive Boosting (AdaBoost), GradientBoost, Extreme Gradient Boost (XGBoost), Light Gradient Boost, Categorical Boost (CatBoost), RandomForest, and Decision Tree-based bagging classifiers [56][57][58]. In bagging, the training data was divided into different subsets by random sampling with replacement and multiple models were trained on these different subsets. It then combined the prediction of each of the models by averaging. Boosting, on the other hand, used multiple base learners like decision trees in a sequential manner where the successive learner corrected for the error in prediction by the previous one.

Cross-Validation and Parameter Tuning
Grouped K-Folds cross-validation method was used in both feature selection and hyperparameter tuning to return stratified folds with non-overlapping groups that are representative of the class distributions of the dataset. The entire dataset was divided into five non-overlapping folds based on datasets using the Stratified Group 5-Fold method (implemented with the StratifiedGroupKFold method) during the 5-Fold Cross-Validation stage. In each iteration, one of the five folds (20% of the data) was held out to be the testing set, and the remaining folds served as the training set. The cross-validated score was then the average accuracy score across the five folds. The train set was z-score standardized and the test set was standardized with the train set's statistics. The models were optimized using the cross-validated Grid Search method. Grid search was used by first defining the possible values of hyperparameters in the ML models, and then finding the combination of these parameters that optimize the classification accuracy by exhaustive search.

Evaluation Metrics
The classification performance of the different machine learning models in the testing stage was compared based on the scores of (a) accuracy (ratio of correct predictions to total number of predictions), (b) area under the receiver operating characteristic (ROC) curve (c) precision (True positives/(True positives + False positives)), (d) sensitivity (True positives/(True positives + False negatives)), (e) specificity (True negatives/(True negatives + False positives)) and (f) F1 score (2 × ((precision × sensitivity)/(precision + sensitivity))).

Statistical Analysis
Statistical tests were performed to compare the performance of the machine learning models. One-way Analysis of Variance (ANOVA) test (in RStudio (version 4.1.1)) was used for this comparison, based on the evaluation metrics for a statistical significance level of p-value < 0.05. Tukey's HSD (honestly significant difference) post hoc test was used for pair-wise analysis of these models.

Feature Importance and Model Comparison
Average feature importance was determined by repeating the cross-validation 100 times using the best-performing ensemble models. Then, one-feature models were trained using each of the top features separately to compare the relative classification capability of the individual features. A linear combination of the one-feature models using linear SVM and logistic regression was also studied to show the relative advantage of more complex ML models, like the ensemble models. Five-fold cross-validation was repeated 20 times and the scores were averaged from these 100 repetitions.

Feature Selection and Comparison
Based on the results of the MWU test comparing the benign and malignant classes, the feature set was narrowed down to 86 that were statistically significant at p-value ≤ 0.01. Nine out of these eighty-six features were identified as the most important by RFE, SFS and correlation analysis. These included ADC and ratios of CP8, FAT21, CP2, FMETD, mI + Gly, CP4, TGFRupper and UFD54 with respect to the diagonal FAT14 peak. The boxplots of these most significant features are shown in Figure 2a for both malignant and benign classes. The larger interquartile range (IQR) of ADC and CP8/FAT14 indicated a larger spread of these features. TGFRupper/FAT14 and UFD54/FAT14, on the other hand, showed the least variability for the malignant class, while CP4/FAT14 showed the least variability for the benign class. The values were z-score normalized. Figure 2b shows the correlation heatmap of these features. Since the feature selection process also included correlation analysis-based exclusion of redundant features, the heatmap showed a correlation coefficient less than 0.6 and greater than −0.06 between any pair of features.

Comparison of Models
Comparative performance of linear SVM, Decision Tree, DT-based bagging classifier, RandomForest, AdaBoost, GradientBoost, XGBoost and CatBoost are shown in Figures 3-5. These models were the best performing out of all the models considered in terms of their accuracy scores. Figure 3 shows the AUC, F1 score, accuracy, precision, sensitivity and specificity of these eight classifiers in the testing stage repeated 100 times with randomized dataset split and model initializations, and Figure 4 shows these scores in the cross-validation stage repeated 50 times using the entire dataset. The respective box plots show the median and IQR of these metrics, along with outliers. Their corresponding mean and standard deviation are listed in Table 2 and the ROC curves of these different models are shown in Figure 5.

Comparison of Models
Comparative performance of linear SVM, Decision Tree, DT-based bagging classifier, RandomForest, AdaBoost, GradientBoost, XGBoost and CatBoost are shown in Figures 3-5. These models were the best performing out of all the models considered in terms of their accuracy scores. Figure 3 shows the AUC, F1 score, accuracy, precision, sensitivity and specificity of these eight classifiers in the testing stage repeated 100 times with randomized dataset split and model initializations, and Figure 4 shows these scores in the cross-validation stage repeated 50 times using the entire dataset. The respective box plots show the median and IQR of these metrics, along with outliers. Their corresponding mean and standard deviation are listed in Table 2 and the ROC curves of these different models are shown in Figure 5.      While GradientBoost was the model with the highest AUC, accuracy, sensitivity and F1 scores, XGBoost had the maximum precision and specificity as shown in Table 2. However, the results of Tukey's HSD post hoc test following the ANOVA with p-values adjusted for multiple comparisons showed that the differences between the ensemble models XGboost, GradientBoost, CatBoost AdaBoost, Decision Tree based bagging and Ran-domForest were not statistically significant in terms of Accuracy, AUC, Precision, Sensitivity, Specificity and F1 scores for the significance at p-value ≤ 0.05. However, significant differences were observed between the ensemble models and base models like linear SVM and Decision Tree. The ROC curve in Figure 5 also shows a better performance for the ensemble models as compared to the base models, linear SVM and Decision Tree, which  While GradientBoost was the model with the highest AUC, accuracy, sensitivity and F1 scores, XGBoost had the maximum precision and specificity as shown in Table 2. However, the results of Tukey's HSD post hoc test following the ANOVA with p-values adjusted for multiple comparisons showed that the differences between the ensemble models XGboost, GradientBoost, CatBoost AdaBoost, Decision Tree based bagging and RandomForest were not statistically significant in terms of Accuracy, AUC, Precision, Sensitivity, Specificity and F1 scores for the significance at p-value ≤ 0.05. However, significant differences were observed between the ensemble models and base models like linear SVM and Decision Tree. The ROC curve in Figure 5 also shows a better performance for the ensemble models as compared to the base models, linear SVM and Decision Tree, which is also consistent with the cross-validation scores of other performance metrics shown in Figure 4.

Feature Importance and Linear Combination Models
Average feature importance, measured by repeating the cross-validation 100 times using the ensemble models, is shown in Figure 6. Single-feature models were trained and a linear combination of these one-feature classifiers was performed using SVM and logistic regression. Bar charts in Figures 7 and 8 show the average cross-validation accuracy of logistic regression and linear SVM, repeated over 100 iterations. Horizontal axis shows different feature combinations used. Features 1 to 9 are ADC and ratios of CP8, FAT21, CP2, FMETD, mI + Gly, CP4, TGFRupper and UFD54 with respect to the diagonal FAT14 peak. Error bars represent standard deviation. The vertical axis indicates the average accuracy score. The average accuracy of these linear models was reduced when more than the top six features were used, indicating overfitting. is also consistent with the cross-validation scores of other performance metrics shown in Figure 4.

Feature Importance and Linear Combination Models
Average feature importance, measured by repeating the cross-validation 100 times using the ensemble models, is shown in Figure 6. Single-feature models were trained and a linear combination of these one-feature classifiers was performed using SVM and logistic regression. Bar charts in Figures 7 and 8 show the average cross-validation accuracy of logistic regression and linear SVM, repeated over 100 iterations. Horizontal axis shows different feature combinations used. Features 1 to 9 are ADC and ratios of CP8, FAT21, CP2, FMETD, mI + Gly, CP4, TGFRupper and UFD54 with respect to the diagonal FAT14 peak. Error bars represent standard deviation. The vertical axis indicates the average accuracy score. The average accuracy of these linear models was reduced when more than the top six features were used, indicating overfitting.

Discussion
This study showed the feasibility of using metabolite ratios from 5D EP-COSI and ADC values from the DWI data of breast cancer patients to train machine learning models for classifying benign and malignant lesions. While earlier studies have attempted lesion characterization using features extracted from the DWI and DCE-MRI data, these models did not use the quantitative measures of metabolite and lipid features which can be obtained with an MRSI examination [48][49][50]. Although variations in water and fat levels can become ambiguous in glandular regions, especially in benign and healthy tissues, various lipid and metabolite ratios are reported to have statistically significant differences

Discussion
This study showed the feasibility of using metabolite ratios from 5D EP-COSI and ADC values from the DWI data of breast cancer patients to train machine learning models for classifying benign and malignant lesions. While earlier studies have attempted lesion characterization using features extracted from the DWI and DCE-MRI data, these models did not use the quantitative measures of metabolite and lipid features which can be obtained with an MRSI examination [48][49][50]. Although variations in water and fat levels can become ambiguous in glandular regions, especially in benign and healthy tissues, various lipid and metabolite ratios are reported to have statistically significant differences between benign and malignant lesions [42]. Building on this fact, our study pursued a detailed analysis of lesion characterization using 5D EP-COSI features in a machine-learning framework.
The ensemble models were found to perform better than the individual models. This is expected since they combine the strengths of multiple individual models [45]. In fact, the ensemble models can use multiple base models to learn different aspects of the data and hence learn more complex relationships between the variables. They are also more robust to outliers and are also expected to reduce overfitting since they can compensate for the prediction errors of individual models. XGBoost, GradientBoost, RandomForest, AdaBoost and CatBoost were found to be the best-performing ensemble models in this study with 92% to 95% AUC, 86% to 90% accuracy, 87% to 89% F1 scores, 84% to 89% precision, 89% to 95% sensitivity and 79% to 85% specificity. While the highest sensitivity of 94.2% was achieved with GradientBoost, the highest specificity of 84.6% was achieved using XGBoost, which is higher than the reported performance metrics of DWI or MRSI without the application of machine learning techniques [37][38][39][40].
While the feature importance scores slightly varied among the top-performing models, ADC was ranked first on average over 100 iterations using different ensemble models. Four out of the top nine features were the ratios of cross-peaks, which are specific to the 2D COSY technique. The remaining four main features were the ratios of diagonal lipid peaks. It is interesting to note that the ratios of lipid cross peaks ranked higher than some of the conventional biomarkers like Cho and mI + Gly ratios for classifying benign and malignant lesions in the ML framework. While both Cho and mI + Gly ratios were in the list of statistically significant variables in the MWU tests, only mI + Gly ratio was selected in the top nine features. This is mainly due to the high correlation between the two features. Therefore, Cho may also be used in place of mI + Gly, or a combination of the two could be used as a single feature to achieve a similar classification performance. The same argument exists for some of the lipid peaks as well, for example, different fat peaks in the range of 1 to 2 ppm could be highly correlated, especially with large linewidths and quantitation by peak integration.
ADC and CP8/FAT14 had the highest average feature importance scores. Linear SVM and logistic regression favored ADC more than other features. This could be because the linear classification models like linear SVM and logistic regression favored a more linear relationship between ADC and the target classes. Linear combination of the one-feature models showed a maximum average accuracy less than that of ensemble models and was found to be dropping with the increased number of features. Possible reasons include the complex relationship of the features and reduced ratio of data points to features. Other possible factors like redundancy and relevancy of the individual features are less likely to be the cause since the correlated features were not included and these features were known to be relevant to breast cancer-related changes in cellular metabolism and tissue structure. This indicates that a better classification would require advanced models capable of learning non-linear relationships like the ensemble machine learning models studied in this work and can also benefit from more data points.
The number of datasets is one of the limitations of this study. Even though we have multiple voxels from the same dataset giving metabolite and lipid ratios, it is important to split the data based on the actual number of subjects rather than the voxels. It would be tempting to consider the individual voxels as separate data when splitting the data into training and testing sets. However, this approach could lead to severe data leakage, since multiple voxels from the same subject can have similar statistics, especially when interpolation is used to increase the number of voxels. Otherwise, if the lesion spans multiple voxels in the spectroscopic data, the relatively low resolution and partial volume effects can potentially cause slightly overlapping information between the neighboring voxels. Therefore, if the train-test split is performed based on the voxels rather than individual subjects, it is reasonable to assume that during the training stage, the model would already see some of the statistics present in the testing data. This will artificially increase the score of test and validation performance metrics but will not be generalizable to a new subject.
Even though these ML models should be generalizable to the MRSI/DWI data from different scanners and sites, it may be considered as another limitation of this study since there could be subtle/complex variations in the datasets from different scanners and sites so that the list of most important features could differ. A future study with a larger sample size, ideally from different scanners and sites, can further validate the results presented in this work.
Since the focus of this study was to analyze the performance of ML models with features from the 5D EP-COSI data, we have not considered some of the image-based features potentially available from DWI. For example, it has been recently shown that the features based on continuous-time random-walk (CTRW) and intravoxel incoherent motion (IVIM) models from DWI using multiple b-values can classify benign and malignant breast lesions using ensemble ML models [48]. More radiomics features from DWI as well as other modalities like DCE-MRI can be used in a future study to potentially further improve the model performance.

Conclusions
In this pilot validation of the multi-dimensional (5D EP-COSI) data for the characterization of breast tissues, we have shown that ML-based classification models can be trained using spectroscopic features in conjunction with ADC values from DWI to classify benign and malignant lesions. Multiple diagonal and cross-peaks from 2D COSY spectra were identified as important features, further asserting the advantage of 2D COSY spectra as compared to features derived from 1D spectra. GradientBoost, CatBoost, RandomForest AdaBoost and XGBoost were the best performing models with 92% to 95% AUC, 86% to