Over-the-Counter Breast Cancer Classification Using Machine Learning and Patient Registration Records

This study aims to determine the feasibility of machine learning (ML) and patient registration record to be utilised to develop an over-the-counter (OTC) screening model for breast cancer risk estimation. Data were retrospectively collected from women who came to the Hospital Universiti Sains Malaysia, Malaysia for breast-related problems. Eight ML models were used: k-nearest neighbour (kNN), elastic-net logistic regression, multivariate adaptive regression splines, artificial neural network, partial least square, random forest, support vector machine (SVM), and extreme gradient boosting. Features utilised for the development of the screening models were limited to information in the patient registration form. The final model was evaluated in terms of performance across a mammographic density. Additionally, the feature importance of the final model was assessed using the model agnostic approach. kNN had the highest Youden J index, precision, and PR-AUC, while SVM had the highest F2 score. The kNN model was selected as the final model. The model had a balanced performance in terms of sensitivity, specificity, and PR-AUC across the mammographic density groups. The most important feature was the age at examination. In conclusion, this study showed that ML and patient registration information are feasible to be used as the OTC screening model for breast cancer.


Introduction
Breast cancer is the most common cancer among women in at least 140 countries [1]. The WHO aims to reduce global breast cancer mortality by 25% annually between 2020 and 2040, which is equivalent to 2.5 million breast cancer death worldwide [2]. Generally, breast cancer affects women above the age of 50 and the risk of having the disease increases with increased age [3][4][5]. The risk factors for breast cancer are mainly divided into two groups [6]. The inherent risk factors include a family history of breast cancer, age, and gender, while the extrinsic risk factors include diet and lifestyle. The risk factors differ according to the individual and population. One of the important risk factors for breast cancer is mammographic density which reflects the amount of dense and fatty tissue in the breast [7,8]. Women with denser breasts had four to six times higher chances of developing in BestARi were limited to 1 January 2014 and 30 June 2021. Twenty-seven variables were collected in this study. Twenty-four features were collected from the BestARi including (1) date of examination; eight features related to sociodemographic and personal information: (1) age at examination, (2) race, (3) marital status, (4) number of children, (5) age at menarche, (6) weight, (7) height, (8) handedness; six features regarding the symptoms or patient complaints: (1) lump, (2) nipple discharge, (3) nipple retraction, (4) axillary mass, (5) pain, and (6) skin changes; and nine features regarding the medical history: (1) history of breast surgery or implant, (2) history of breast trauma, (3) history of birth control or hormone replacement therapy, (4) history of the previous mammography, (5) history of breast self-examination, (6) breastfeeding history, (7) history of total abdominal hysterectomy bilateral salpingo-oophorectomy (TAHBSO), (8) family history of breast cancer, and (9) menopausal status. All features were used in the ML model development except for the date of examination as the feature provided no information for the model development.
Another two variables collected from the Department of Radiology, HUSM, were breast imaging-reporting and data system (BIRADS) classification information and BIRADS density (or mammographic density). Both variables were used to classify the cases into dense vs. non-dense groups and normal vs. suspicious groups. Finally, the last information collected from the Department of Pathology, HUSM, was histopathological examination (HPE) results. The latter three variables were used to determine the outcome variable.
The data from the Department of Radiology and Department of Pathology were combined with BestARi's data if both data were dated within a year after the date of BestARi's data for each patient. The latest medical record was taken if patients had several records in the BestARi and a single record from the Department of Radiology or Department of Pathology. Afterwards, a body mass index (BMI) was further calculated from the individual weight and height and was added to the existing list of features. Each patient was classified as a normal or suspicious class. The normal class was patients with a BIRADS classification of 1 or who had a diagnosis of normal from the HPE result. The suspicious class was patients with a BIRADS classification of 2, 3, 4, 5, and 6 or who had a diagnosis of benign or malignant subtype of breast cancer from the HPE result. Patients with a BIRADS classification of 0 and missing BIRADS classification or mammographic density were excluded from the study. Additionally, non-dense breast women were those with BIRADS density of A and B, while dense breast women were those with BIRADS density of C and D. Table 1 presents the characteristics of the collected data.

Pre-Processing Steps
Initially, all 24 features including the additional variable of BMI were included in the model development. Next, missing values in the data were imputed using a bagged tree model. Subsequently, numerical variables with absolute correlations above 0.8 with other numerical variables were removed. Then, the training dataset was balanced using a random over-sampling examples (ROSE) algorithm [36]. All numerical features were normalised and transformed using a Yeo-Johnson transformation [37]. A dummy coding variable was created for all categorical features for all ML models except for the random forest model. The random forest model had been shown to have at least similar performance if not better when categorical features were used as factor variables as opposed to when the dummy variables were used in the model [38]. The ROSE algorithm was implemented using a themis package version 1.0.0 [39]. The remaining pre-processing steps were implemented using a recipes package 1.0.1 [40].

Machine Learning Models
Eight OTC screening models were developed from ML methods including k-nearest neighbour (kNN), elastic-net logistic regression, multivariate adaptive regression splines (MARS), artificial neural network (ANN), partial least square (PLS), random forest, support vector machine (SVM), and extreme gradient boosting (XGBoost). SVM was implemented using a radial basis function kernel which used a nonlinear class boundary to maximize the width margin between the class. All ML algorithms were implemented using the parsnip package version 1.0.1 [41] with the kknn package version 1.3.1 [42] as a backend for kNN, glmnet package version 4.1-4 [43] for elastic-net logistic regression, earth package version 5.3.1 [44] for MARS, nnet package version 7.3-17 [45] for ANN, mixOmics package version 6.16.3 [46] for PLS, ranger package version 0.14.1 [47] for random forest, kernlab package version 0.9-31 [48] as a backend for SVM, and xgboost package 1.6.0.1 [49] for XGBoost. R version 4.1.3 was used to develop all the screening models [50].

Model Comparison and Hyperparameter Tuning
The data were split into 80% development dataset and 20% validation dataset. The development dataset was further split into nested cross-validation groups for model comparison and hyperparameter tuning. The outer folds were split into 10-fold cross-validation groups of 80% training and 20% testing datasets. Each training dataset of each fold was further split into 25 bootstrap samples (inner folds). The validation dataset was further split into a dense breast dataset and a non-dense breast dataset. Thus, there were three validation datasets available: (4.1-41) the whole validation dataset, (2) the dense breast validation dataset, and (4.1-43) the non-dense breast validation dataset.
A random search with a Latin hypercube grid design of 500 combinations of hyperparameters was used for model comparison and hyperparameter tuning. Firstly, all the performance metrics from the results of the bootstrapped samples were summarised by the mean and standard deviation to obtain the descriptive result for each model. The performance metrics of each model were compared using a one-way ANOVA and subsequently pairwise independent t-test if the former test was significant. A p-value below 0.05 was considered significant. Additionally, the p-values for the post hoc pairwise independent t-test were adjusted using Bonferroni corrections. Once the best model was identified, the hyperparameters were chosen based on the highest performance metrics from the bootstrapped sample. Figure 1 elucidates the flow of the analysis for this study. Finally, the best model was re-fit using the chosen hyperparameters on the whole development dataset to obtain the final model.

Performance Metrics
Four performance metrics used for model comparison were precision, precision recallarea under the curve (PR-AUC), F2 score, and Youden J index. Once the final model was identified, four hyperparameter tuning results with the highest mean of the aforementioned performance metrics were determined. The best hyperparameters result was selected from the four tuning results based on the highest sensitivity value. The performance metrics were defined below: A true positive (TP) case was defined as a suspicious case and predicted suspicious by the model, while a true negative (TN) case was a normal case and predicted normal by the model. A false negative (FN) case was a suspicious case but predicted normal by the model, while a false positive (FP) case was a normal case but predicted suspicious by the model. A random search with a Latin hypercube grid design of 500 combinations of hyperparameters was used for model comparison and hyperparameter tuning. Firstly, all the performance metrics from the results of the bootstrapped samples were summarised by the mean and standard deviation to obtain the descriptive result for each model. The performance metrics of each model were compared using a one-way ANOVA and subsequently pairwise independent t-test if the former test was significant. A p-value below 0.05 was considered significant. Additionally, the p-values for the post hoc pairwise independent t-test were adjusted using Bonferroni corrections. Once the best model was identified, the hyperparameters were chosen based on the highest performance metrics from the bootstrapped sample. Figure 1 elucidates the flow of the analysis for this study. Finally, the best model was re-fit using the chosen hyperparameters on the whole development dataset to obtain the final model.

Performance Metrics
Four performance metrics used for model comparison were precision, precision recall-area under the curve (PR-AUC), F2 score, and Youden J index. Once the final model was identified, four hyperparameter tuning results with the highest mean of the aforementioned performance metrics were determined. The best hyperparameters result was selected from the four tuning results based on the highest sensitivity value. The performance metrics were defined below:

Explainable Approach
The model agnostic approach was used to estimate the variable importance for the final ML model. The variable importance was estimated as a mean change in the value of the loss function after variable permutations. The number of permutations was set to 50. The loss function was defined as 1-PR-AUC. The PR-AUC in the loss function reflected the performance of the ML model. Thus, if the feature was important, the performance of the ML model would worsen after permutating the feature. The worse performance of the ML model would in turn result in a high value of the loss function. Hence, the most important feature was the feature with the highest value of 1-PR-AUC. Only the top fifteen important variables were displayed in the variable importance plot. The explainable approach was applied using DALEX and DALEXtra packages versions 2.4.2 and 2.2.1 [51,52].

Related Works
Numerous research had been conducted related to breast cancer and ML. Previous studies had used different types of data including imaging modalities, genomic data, and clinical data. Most studies involving ML and breast cancer utilised imaging data especially mammograms and ultrasound [53], while only several studies utilised tabular data. Additionally, a public dataset such as Wisconsin diagnostic breast cancer (WDBC) dataset, despite the tabular nature of the data, the features were derived from the fine needle aspirate imaging of breast mass [54]. Other types of tabular data used for ML classification of breast cancer were sociodemographic, clinical, histological, and pathological data. These types of tabular data were used to predict breast cancer recurrence [55] and survival [56]. Additionally, for breast cancer risk estimation such as screening and diagnosis, imaging data and imaging-derived features were commonly utilised [53]. The use of imaging data in previous studies limited the utilisation of the ML model in the early phase of the screening stage prior to medical consultation.
Several ML algorithms had been used in previous studies that utilised tabular data for the prediction of breast cancer, breast cancer recurrence, and survival of breast cancer patients. Table 2 presents the summary of the previous research related to machine learning classification and breast cancer that utilised tabular data such as sociodemographic, medical history, clinical, pathological, histological, molecular, and genomic data. SVM had been shown to outperform other ML models in several studies involving the prediction of breast cancer recurrence and distant recurrence with the best accuracy at 0.96 [57][58][59]. However, other studies found ANN and random forest had the best performance in predicting breast cancer recurrence [60,61]. Moreover, for the prediction of the survival of breast cancer patients, naïve Bayes, deep learning, and multilayer perceptron (MLP) had the best accuracy at 0.80, 0.83, and 0.88, respectively [60,62,63]. All the aforementioned studies utilised different datasets which may contribute to the difference in the model performance. Additionally, for breast cancer prediction, random forest showed a promising result with accuracy and an area under the curve (AUC) of 0.98 [64]. Other studies showed that XGBoost and MLP had better performance and outperformed random forest in their respective studies [65,66]. However, all three studies except for Hout et al. [66] used clinical data such as the level of glucose, insulin, leptin, and adiponectin which was beyond the initial screening stage of breast cancer. Additionally, a meta-analysis study had shown that SVM outperformed the other classifier such as ANN, decision tree, naive Bayes, and kNN in breast cancer risk estimation [67]. This meta-analysis was limited to ML models performed on imaging data, thus, the performance of the aforementioned ML models as an initial breast cancer screening model utilising a tabular dataset have yet to be explored.

Model Comparison
Eight OTC screening models were developed from ML. kNN had the highest Youden J index, precision, and PR-AUC, while the ML model with the highest F2 score was SVM. Table 3 presents the descriptive performance of all ML models, while Figure 2 further illustrates the performance comparison of all models.  One-way ANOVA showed that there was a significant difference between the mean of Youden J index, F2 score, precision, and PR-AUC among all ML models (Table 4). Further post hoc pairwise comparison using t-test indicated all pairwise comparisons were significant after Bonferroni correction except for XGBoost vs. elastic-net logistic regression for Youden J index and XGBoost vs. elastic-net logistic regression, ANN vs. elastic-net logistic regression, and XGBoost vs. ANN for F2 score ( Figure 3). Thus, kNN was identified to be the best ML model for the purpose of OTC breast cancer screening in this study.  One-way ANOVA showed that there was a significant difference between the mean of Youden J index, F2 score, precision, and PR-AUC among all ML models (Table 4). Further post hoc pairwise comparison using t-test indicated all pairwise comparisons were significant after Bonferroni correction except for XGBoost vs. elastic-net logistic regression for Youden J index and XGBoost vs. elastic-net logistic regression, ANN vs. elastic-net logistic regression, and XGBoost vs. ANN for F2 score (Figure 3). Thus, kNN was identified to be the best ML model for the purpose of OTC breast cancer screening in this study.   Table 5 presents the four results of hyperparameter tuning with the highest Youden J index, F2 score, precision, and PR-AUC. Models 1, 2, and 4 had lower specificity than sensitivity, while model 3 had it otherwise. kNN model 3 was selected as the best hyperparameters tuning result as it had the highest sensitivity.   Table 6 displays the performance of the final kNN model on the validation dataset across mammographic density. The model had a higher sensitivity on the non-dense cases  Table 5 presents the four results of hyperparameter tuning with the highest Youden J index, F2 score, precision, and PR-AUC. Models 1, 2, and 4 had lower specificity than sensitivity, while model 3 had it otherwise. kNN model 3 was selected as the best hyperparameters tuning result as it had the highest sensitivity.  Table 6 displays the performance of the final kNN model on the validation dataset across mammographic density. The model had a higher sensitivity on the non-dense cases and a higher specificity on the dense cases. Additionally, the performance differences across the mammographic density were very minimal as shown in Table 5. Furthermore, Figure 4 indicated that there was no difference between PR-AUC of non-dense and dense breast women for the final kNN model as both lines were overlapped. and a higher specificity on the dense cases. Additionally, the performance differences across the mammographic density were very minimal as shown in Table 5. Furthermore, Figure 4 indicated that there was no difference between PR-AUC of non-dense and dense breast women for the final kNN model as both lines were overlapped.   Figure 5 illustrates the top fifteen influential features of the final ML model. The top three most influential variables were age at examination, birth control/hormone replacement, and race. In terms of patient complaints, breast pain, breast lump, and breast trauma were the most important factors that influence the model's prediction as opposed to the other complaints.  Figure 5 illustrates the top fifteen influential features of the final ML model. The top three most influential variables were age at examination, birth control/hormone replacement, and race. In terms of patient complaints, breast pain, breast lump, and breast trauma were the most important factors that influence the model's prediction as opposed to the other complaints.

Discussion
In this study, we evaluated the feasibility of OTC breast cancer screening models developed from ML. The model was aimed to predict women with suspicious breast problems or women with a high probability of developing breast cancer. The screening model used the information obtained during patient registration prior to a medical consultation with the clinician. Thus, patients with a suspicious breast issue would be prioritised at the screening stage and referred to a breast cancer specialist for timely consultation. Previous studies showed that early detection of breast cancer reduces its mortality [68,69]. Additionally, one of the factors of severe breast cancer presentation and poor survival among breast cancer patients was a delay in seeking medical treatment [70][71][72][73]. The development of the OTC screening model would be beneficial in minimising the time between a woman first noticing a symptom and arranging a medical consultation. At least about 17% of women with breast cancer symptoms in European countries had a delayed medical consultation of at least 3 months or more [74]. In southeast Asian countries such as Malaysia, a delay in medical consultation was estimated at 2 months [75]. In general, shortening the delay in arranging medical consultations would be helpful for the prognosis of breast cancer women.
OTC models were developed from eight ML models in this study. The kNN models were significantly better than the other seven models in terms of the Youden J index, precision, and PR-AUC. Additionally, in terms of F2 score SVM had the highest performance value. Thus, the best model based on the four-performance metrics was kNN followed by random forest and ANN. The SVM model had the lowest Youden J index and precision and one of the lowest PR-AUC, despite having the highest F2 score. SVM was believed to work well with imbalanced datasets compared to other ML models, however, this was not the case in our study [76]. Additionally, the final kNN model had a balanced

Discussion
In this study, we evaluated the feasibility of OTC breast cancer screening models developed from ML. The model was aimed to predict women with suspicious breast problems or women with a high probability of developing breast cancer. The screening model used the information obtained during patient registration prior to a medical consultation with the clinician. Thus, patients with a suspicious breast issue would be prioritised at the screening stage and referred to a breast cancer specialist for timely consultation. Previous studies showed that early detection of breast cancer reduces its mortality [68,69]. Additionally, one of the factors of severe breast cancer presentation and poor survival among breast cancer patients was a delay in seeking medical treatment [70][71][72][73]. The development of the OTC screening model would be beneficial in minimising the time between a woman first noticing a symptom and arranging a medical consultation. At least about 17% of women with breast cancer symptoms in European countries had a delayed medical consultation of at least 3 months or more [74]. In southeast Asian countries such as Malaysia, a delay in medical consultation was estimated at 2 months [75]. In general, shortening the delay in arranging medical consultations would be helpful for the prognosis of breast cancer women.
OTC models were developed from eight ML models in this study. The kNN models were significantly better than the other seven models in terms of the Youden J index, precision, and PR-AUC. Additionally, in terms of F2 score SVM had the highest performance value. Thus, the best model based on the four-performance metrics was kNN followed by random forest and ANN. The SVM model had the lowest Youden J index and precision and one of the lowest PR-AUC, despite having the highest F2 score. SVM was believed to work well with imbalanced datasets compared to other ML models, however, this was not the case in our study [76]. Additionally, the final kNN model had a balanced performance between sensitivity and specificity (Table 4). In the hyperparameters tuning stage, we prioritised ML models with a higher sensitivity value. The OTC model aimed to be deployed in the breast clinic during the registration prior to the medical consultation.
The model with high sensitivity would prioritise women with a suspicious breast issue which in turn accelerates the needed process for those with medical urgency.
The features used for the development of ML screening models were sociodemographic information, medical history, and patient complaints. A study conducted to develop ML models to predict breast cancer in Chinese women included ten risk factors that achieved the best sensitivity and specificity of 0.66 and 0.69 using XGBoost [66]. This study achieved the sensitivity and specificity of 0.82 and 0.79, respectively, using kNN. Therefore, our study showed that adding patient symptoms or complaints to the features used in the development of the screening model improved the predictive performance of the screening model. Another study conducted to predict breast cancer using laboratory data showed the best precision performance at 0.85 using ANN [65] while the precision for our final kNN model was at 0.81. Although the performance of our model was slightly lower, however, obtaining laboratory data before medical consultation was unfeasible and impractical in our study.
Mammographic density is a known risk factor for developing breast cancer [77]. Asian women had a higher mammographic density than non-Asian women [78,79], thus, having a higher risk of getting breast cancer. For example, in Malaysia, Chinese women had been shown to have denser breasts than the other races [80,81]. A few studies denoted that the proportion of women who attended mammogram procedures in Malaysia was at least half of them were women with dense breasts [82,83]. An ML screening model aimed to be applied to this population should take this information into account. However, it was inappropriate to include the mammographic density as one of the features in the screening model as the density was known at a later stage after medical examination. The final kNN model had a slightly higher sensitivity and specificity in a non-dense and dense group, respectively (Table 6). However, the comparison of the PR-AUC of the model indicated that there was no performance difference between the two groups. Additionally, the explainable ML revealed the most significant feature in the final model was the age at examination. The incidence of breast cancer had been shown to increase with age [84]. However, breast cancer presented at a younger age tended to be more aggressive and at a higher stage of cancer [84][85][86]. Thus, in developing the ML screening model, misclassification of suspicious cases as normal cases especially in younger women could be a catastrophic error. Moreover, there were two modifiable features which were weight and breast self-examination (BSE). Weight control had been suggested to reduce breast cancer risk [87,88]. Although BSE did not relate to breast cancer risk, frequent BSE led to an increased incidence of breast cancer [87]. Additionally, there were three influential features related to patient complaints including breast pain, breast lump and breast trauma.
This study used secondary data collected from a university-and research-based hospital in Kelantan, Malaysia. The data was further validated by a radiologist and pathologist to ensure the good quality of the data. However, our study still had a few limitations. One of the main limitations of this study was the size of the data to develop our screening models. The lack of data was a prevalent issue in the application of ML in healthcare [89]. However, this issue was worsened in our study as the dataset had missing values and imbalanced outcome classification. Subsequently, we used a bagged tree model and ROSE algorithm to overcome these issues, and undeniably larger data will further improve our model. Additionally, we only included one hospital in our study as we utilised information from patient registration records which were specific to the BestARi, HUSM at the time this study was conducted. Including more hospitals in the study was not feasible due to the lack of standardisation in the patient registration record among the hospitals. However, future studies should aim to include more hospitals, if possible, thus increasing the size of the data. Nonetheless, the challenges and approaches presented in the study reflected a real workflow in the development and application of the OTC ML model for breast cancer screening.

Conclusions
We evaluated eight ML to be developed as an OTC screening model for breast cancer. We used patient registration records including sociodemographic, medical history, and patient complaints as features for the development of the screening models. This study found that the OTC screening models developed from the ML and patient registration records show promising performance. The screening models can be deployed in a breast clinic and improve the workflow of breast cancer management. Thus, the deployment of the model will reduce patient delays in arranging investigations and consultations from the breast cancer team.  Informed Consent Statement: Patient consent was waived due to the retrospective nature of this study and the use of secondary data.

Data Availability Statement:
The data are available upon reasonable request to the corresponding author.