A Proactive Explainable Artiﬁcial Neural Network Model for the Early Diagnosis of Thyroid Cancer

: Early diagnosis of thyroid cancer can reduce mortality, and can decrease the risk of recurrence, side effects, or the need for lengthy surgery. In this study, an explainable artiﬁcial neural network (EANN) model was developed to distinguish between malignant and benign nodules and to understand the factors that are predictive of malignancy. The study was conducted using the records of 724 patients who were admitted to Shengjing Hospital of China Medical University. The dataset contained the patients’ demographic information, nodule characteristics, blood test ﬁndings, and thyroid characteristics. The performance of the model was evaluated using the metrics of accuracy, sensitivity, speciﬁcity, F1 score, and area under the curve (AUC). The SMOTEENN combined sampling method was used to correct for a signiﬁcant imbalance between malignant and benign nodules in the dataset. The proposed model outperformed a baseline study, with an accuracy of 0.99 and an AUC of 0.99. The proposed EANN model can assist health care professionals by enabling them to make effective early cancer diagnoses.


Introduction
Thyroid cancer is the 13th most common cancer diagnosis among adolescents and young adults, with 43,800 cases expected in 2022 [1,2]. It accounts for more than 90% of endocrine system malignancies and ranks as the sixth most common among women [1]. Follicular cells are responsible for 96% of thyroid tumors, and 99% of these are differentiated thyroid cancers (DTC) [3]. The main treatments for DTC are surgery, radioactive iodine remnant ablation, and TSH-suppressive therapy with levothyroxine [4]. Each patient's treatment plan is unique and dependent on the lesion's characteristics, but surgery is the most common option [5]. The surgical procedure places an emphasis on identifying benign and malignant thyroid nodules. An accurate prior diagnosis reduces the risk of recurrence, reduces side effects, and allows surgery to proceed more quickly [6]. Thorough post-surgery care also aids in improving survival rates. As a result, it is essential to make precise diagnoses and predictions using thyroid ultrasonography, blood work, and other clinical data [6,7]. Presently, physicians' clinical judgment is used to determine whether a nodule is malignant, but this can be laborious and error prone. To improve treatment decisions and minimize labor, accurate and comprehensive prediction models are urgently required.
Artificial intelligence (AI) is a field of computer science that develops techniques for machines to use knowledge to solve problems. AI is commonly used in the healthcare industry and can contribute to creating new treatments, selecting treatments for complex illnesses, improving the course of care for patients with chronic diseases, and reducing medical mistakes [8]. However, these applications have some shortcomings, such as the "black box" nature of some AI models: because neither the underlying learning mechanisms nor the output models are fully understood, it can be difficult for medical professionals to understand these models and draw clinical conclusions that can be explained to patients and colleagues. Transparency in medical AI applications is necessary for clinicians and patients to trust them. In recent years, interest in the study of explainable artificial intelligence (EAI) has grown rapidly [9], but to the author's knowledge few studies have investigated the use of EAI for the early detection of thyroid cancer.
In this study, an EAI model is developed, which is capable of diagnosing thyroid nodules with better performance than the benchmark study [10]. Crucially, the model's output can be readily explained.

Related Studies
Physicians can benefit from AI in a range of scenarios. Machine learning is widely used in healthcare for the diagnosis of diseases, the development of new drugs, and the determination of patient risk factors. To detect diseases using AI approaches, a variety of medical data sources, including ultrasound images, MRI, and medical records are needed. In recent years, AI techniques have led to significant improvements in healthcare systems and have been employed to accelerate the process of getting patients ready to continue their recovery at home [11].

AI-Based Studies on the Early Diagnosis of Thyroid Cancer
Chan et al. [12] used a deep learning technique, Deep Convolutional Neural Networks (CNN), to diagnose DTC. The dataset used in this study consisted of ultrasound images of 421 DTC patients and 319 patients with benign tumors. Three CNN methods were used for training the dataset: InceptionV3, ResNet101, and VGG19. The results showed that ResNet101 outperformed the other methods with an accuracy of 77.6%, while In-ceptionV3 and VGG19 achieved accuracies of 76.5% and 76.1%, respectively. Similarly, Naglah et al. [13] used a computer-aided system to distinguish between benign and malignant thyroid nodules. Their model was developed using the deep learning technique CNN and was evaluated using the magnetic resonance images of 49 patients. The model achieved an accuracy of 87%, a specificity of 97%, and a sensitivity of 69%.
Teknologi et al. [14] proposed a diagnostic for thyroid follicular carcinoma and its subtypes. This study used biopsy images from Pakistani hospitals. An ensemble method was applied using multiclass support-vector machines (SVM) to develop the model. The results showed that the model successfully diagnosed cancers with an accuracy of 98.5%. Olatunji et al. [15] developed a machine learning-based model to detect thyroid cancer by comparing three classifiers: random forest (RF), artificial neural network (ANN), SVM, and naïve Bayes. These classifiers were compared using a dataset collected from Saudi hospitals. The RF model outperformed other models with an accuracy of 90.91% and used only seven of the fifteen measured features of the tumors. In a similar vein, Yang et al. [16] used the machine learning classifiers RF, SVM, and decision tree with genomic data for the early diagnosis of thyroid cancer. The model was evaluated using 506 patients using 703 RNA features. The highest accuracy achieved was 99.2% using RF.
Zhao et al. [17] developed a machine learning-based model using three classifiers: RF, logistic regression, SVM, gradient boost, and ANN. This model was evaluated using 177 ultrasound images with 10 features: shape, composition, size, margins, micro-calcification, halo sign, strain ratio, the echogenicity of the solid portion, vascularity, and the color scale scoring system of real-time elastography. RF outperformed other classifiers with an accuracy of 86% and an area under the curve (AUC) of 93.4%.

Applications of EAI in Diagnosis and Surgery
Explainability is the extent to which the reasoning behind AI's decisions can be communicated in language that can be understood by a wide range of end users. A concept closely related to explainability is interpretability. Explainability focuses on how the AI's decisions are explained in novel cases, whereas interpretability refers to how the model is understood after being trained on data [18]. There have been several recent studies applying EAI to medicine. Aslam [18] proposed an EAI approach for the early prediction of ventilator support and mortality in COVID-19 patients. This study used data on patients hospitalized with COVID-19 from King Abdulaziz Medical City in Riyadh. Additionally, the synthetic minority oversampling technique (SMOTE) was used as the dataset suffered from imbalance. The balanced accuracy of the model was 98%, with an AUC of 99.8%, for predicting mortality. For predicting ventilator support, the model achieved a balanced accuracy of 97.9% and an AUC of 98.1%.
El-Sappagh et al. [19] developed an EAI detection and prediction model for Alzheimer's disease. The dataset included 1048 patients, of whom 294 were cognitively normal, 254 had stable mild cognitive impairment, 232 had progressive mild cognitive impairment, and 268 had Alzheimer's disease. The study used RF as a classifier on a two-layer network. The model performed a multi-class classification for the early diagnosis of Alzheimer's patients in the first layer. The model used binary classification in the second layer to look for potential progression from mild impairment to Alzheimer's within three years after a baseline diagnosis. In terms of explainability, Shapley Additive Explanations (SHAP) was used for the RF classifier. The explanations were delivered in plain language so that they were easily understood by doctors. The model achieved 93.95% cross-validation accuracy and an 93.94% F1 score in the first layer, and 87.08% cross-validation accuracy and an 87.09% F1 score in the second layer.
Chen et al. [20] proposed an EAI model using Bayesian networks and CNN for an interpretable clinical diagnosis system. The authors used annotated medical records from multiple hospitals. The results demonstrated that the proposed framework performed more accurately than the prior automatic diagnosis techniques, and its diagnosis explanation was understandable. Magesh et al. [21] developed a machine learning model based on EAI for the diagnosis of Parkinson's disease through dopaminergic imaging techniques such as SPECT and DaTSCAN. The model was trained using CNN and explained using the Local Interpretable Model-Agnostic Explainer (LIME). The results showed that the model successfully detected Parkinson's disease with an accuracy of 95.2%, a sensitivity of 97.5%, and a specificity of 90.9%.
Similarly, Aghamohammadi et al. [22] developed a prediction model for diagnosing heart attacks using EAI techniques. They combined a genetic algorithm with an adaptive fuzzy inference system. Explainable graphs were offered to address the forecasts' explainability. The proposed algorithm performed well, and the study demonstrated that several signs are crucial for the accurate prediction of heart attacks.

Materials and Methods
All computer analysis in this study was performed using Python version 3.6 (Python Software Foundation, Wilmington, DE, USA).

Exploratory Dataset Analysis
The dataset used in this study was the same as that used by Xi et al. [10] and consisted of the records of 724 patients who were admitted to Shengjing Hospital of China Medical University. All participants received a thyroidectomy, and the dataset included the following 19 attributes: • Patient demographics: gender and age; • Nodule characteristics: size, site (right, left, or isthmus), multifocality (unifocal or multifocal-i.e., whether there were multiple nodules in one location), shape (regular or irregular), calcification (absent or present), ultrasound echo strength (none, isoechoic, medium-echogenic, hyperechogenic, or hypoechogenic), margin (clear or unclear), blood flow (normal or enriched), composition (cystic, mixed, or solid), laterality (unilateral or multilateral), and malignancy (benign or malignant); • Blood test findings: free triiodothyronine (FT3), free thyroxine (FT4), thyroid-stimulating hormone (TSH), thyroid peroxidase antibodies (TPO), and thyroglobulin antibodies (TgAb); • Thyroid characteristics: echo pattern (even or uneven).
Only the largest nodule was included in the dataset if the patient had numerous nodules in a single location-the dataset contained 1232 nodules in total. The target attribute was malignancy, which has two possible values: benign and malignant. Figure 1 shows the number of nodules for each of these outcomes.
Only the largest nodule was included in the dataset if the patient had numero ules in a single location-the dataset contained 1232 nodules in total. The target a was malignancy, which has two possible values: benign and malignant. Figure 1 the number of nodules for each of these outcomes. The ages of the patients ranged from 13 to 82, and the mean age was 46.6. Fig  a representation of the nodule data binned by patient ages, while in Figure 3 the d binned according to patient age and nodule malignancy, and in Figure 4 the nod also binned by patient data. These plots demonstrate that the modal age group wa and that there were far more nodules from female patients (1032) than male patien Figure 5 shows the correlation values for all 19 features, and it shows a significan lation between the nodule outcome and four features: site, shape, calcification, bloo and margin.   The ages of the patients ranged from 13 to 82, and the mean age was 46.6. Figure 2 is a representation of the nodule data binned by patient ages, while in Figure 3 the data are binned according to patient age and nodule malignancy, and in Figure 4 the nodules are also binned by patient data. These plots demonstrate that the modal age group was 43-52 and that there were far more nodules from female patients (1032) than male patients (200). Figure 5 shows the correlation values for all 19 features, and it shows a significant correlation between the nodule outcome and four features: site, shape, calcification, blood flow, and margin.
Only the largest nodule was included in the dataset if the patient had numerous n ules in a single location-the dataset contained 1232 nodules in total. The target attri was malignancy, which has two possible values: benign and malignant. Figure 1 sh the number of nodules for each of these outcomes. The ages of the patients ranged from 13 to 82, and the mean age was 46.6. Figure a representation of the nodule data binned by patient ages, while in Figure 3 the data binned according to patient age and nodule malignancy, and in Figure 4 the nodules also binned by patient data. These plots demonstrate that the modal age group was 43 and that there were far more nodules from female patients (1032) than male patients (2 Figure 5 shows the correlation values for all 19 features, and it shows a significant co lation between the nodule outcome and four features: site, shape, calcification, blood f and margin.

Artificial Neural Network Model
An Artificial Neural Network (ANN) is a simulated network of neurons designed to emulate the human brain [23]. ANNs may learn without supervision, unlike most other AI techniques. Users simply need to provide the ANN with training data-they do not need to give it specific instructions on how to make decisions, so no complex programming is required [24]. Many diseases currently have unknown causes, and the symptoms of some conditions evolve over time. Diagnosis and treatment decisions are typically based on the experience and considered judgment of medical professionals. As a result, an ANN's learning, memory, and induction capabilities define how well-suited it is for use in medicine [25].

Artificial Neural Network Model
An Artificial Neural Network (ANN) is a simulated network of neurons designed to emulate the human brain [23]. ANNs may learn without supervision, unlike most other AI techniques. Users simply need to provide the ANN with training data-they do not need to give it specific instructions on how to make decisions, so no complex programming is required [24]. Many diseases currently have unknown causes, and the symptoms of some conditions evolve over time. Diagnosis and treatment decisions are typically based on the experience and considered judgment of medical professionals. As a result, an ANN's learning, memory, and induction capabilities define how well-suited it is for use in medicine [25].
ANNs have four key advantages: their capacity for self-learning, their ability to conduct a large number of processes concurrently, their strong promotion ability to generalize and predict on unobserved data after learning from the original inputs and their associations, and their capacity to synthesize information [26]. Conceptually, an ANN can be split into three layers: the input layer, in which data enters the network; the hidden layer, which derives patterns from the data; and the output layer, which uses these patterns for the data classification or regression [26].
In this study, two ANN models were developed; the first model includes all dataset features, while the second model is limited to 10 selected features, which are selected based on the highest correlations values. The neural tangent kernel (NTK) was used to implement the models. NTK uses gradient descent to explain how a deep ANN evolved. ANNs have four key advantages: their capacity for self-learning, their ability to conduct a large number of processes concurrently, their strong promotion ability to generalize and predict on unobserved data after learning from the original inputs and their associations, and their capacity to synthesize information [26]. Conceptually, an ANN can be split into three layers: the input layer, in which data enters the network; the hidden layer, which derives patterns from the data; and the output layer, which uses these patterns for the data classification or regression [26].
In this study, two ANN models were developed; the first model includes all dataset features, while the second model is limited to 10 selected features, which are selected based on the highest correlations values. The neural tangent kernel (NTK) was used to implement the models. NTK uses gradient descent to explain how a deep ANN evolved. A sequential method was used to initialize the model and add the layers. In the first model with the full set of features, the input layer has 18 neurons, whereas the second model has an input layer of 10 neurons. Both models have five hidden layers of 512, 256, 128, 64, and 32 neurons, respectively. For the activation function, the rectified linear unit (ReLU) was used, which processes the inputs by converting any negative values to zero: In both models, the output layer has one neuron and uses the sigmoid activation function: To avoid overfitting, dropout layers were added after the third layer with a rate of 20%. The Adam optimization algorithm [27] was used for training deep learning models. Binary and categorical cross-entropy were used to calculate the loss, while the accuracy metric was used to evaluate the model. The training process was carried out using optimum values of the hyperparameters of epochs and batch size, which were found using a grid search optimization algorithm [28]. Specifically, 100 epochs and a batch size of 10 were used.

Explainability of the Proposed Model
An explainable AI was added to the proposed model to study its interpretability. SHAP values [29] were used to quantify the size of the impact of the various features on the model, and for the second model, these are shown in Figure 6. , In both models, the output layer has one neuron and uses the sigmoid activation function: (2) To avoid overfitting, dropout layers were added after the third layer with a rate of 20%. The Adam optimization algorithm [27] was used for training deep learning models. Binary and categorical cross-entropy were used to calculate the loss, while the accuracy metric was used to evaluate the model. The training process was carried out using optimum values of the hyperparameters of epochs and batch size, which were found using a grid search optimization algorithm [28]. Specifically, 100 epochs and a batch size of 10 were used.

Explainability of the Proposed Model
An explainable AI was added to the proposed model to study its interpretability. SHAP values [29] were used to quantify the size of the impact of the various features on the model, and for the second model, these are shown in Figure 6. The Gini score: a metric that quantifies the purity of the node. A Gini score below zero means that the samples in each leaf belong to a single class. A decision tree was created to further enhance the interpretability of the model. Figure 7 shows the decision tree for the classification of benign and malignant thyroid cancers. For each decision level, the node shows the following information:

•
The name of the feature used; • The number of samples used; • The number of samples at the node that fall into each class; • The Gini score: a metric that quantifies the purity of the node. A Gini score below zero means that the samples in each leaf belong to a single class.
Computation 2022, 10, x FOR PEER REVIEW 8 of 12 Figure 7. A decision tree representation of the ANN model. Here "class" represents the malignancy of the cancer, with a value of 0 for a benign cancer and 1 for a malignant cancer. "value" represents the fraction of samples that fall into each class.

Evaluation Measures
The performance of the model was evaluated using the metrics of accuracy, sensitivity, specificity, precision, F1 score, and AUC.
Accuracy is simply the percentage of samples that were classified correctly. In a medical context it can be expressed in terms of the numbers of true positives (TP), true nega- Figure 7. A decision tree representation of the ANN model. Here "class" represents the malignancy of the cancer, with a value of 0 for a benign cancer and 1 for a malignant cancer. "value" represents the fraction of samples that fall into each class.

Evaluation Measures
The performance of the model was evaluated using the metrics of accuracy, sensitivity, specificity, precision, F1 score, and AUC.
Accuracy is simply the percentage of samples that were classified correctly. In a medical context it can be expressed in terms of the numbers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as follows: Sensitivity is a measure of how good the model is at picking up positive cases, and is calculated as the percentage of positive cases that were classified correctly: Similarly, specificity is a measure of how often the model correctly classifies negative cases: Precision is a measure of the quality of the model's positive classifications. It is calculated as the percentage of positive classifications that were correct: The F1 score is the harmonic mean of the precision and the sensitivity. It takes into account both the false positive and false negative rates, with an F1 score close to 1 implying that both are low.
The AUC value was also used to measure the performance of the model. Its value varies between 0 and 1, with higher values corresponding to better performance.

Results
The two different models (one using all features, the other using 10 selected features) were trained on the data, and their performance was compared using several evaluation measures.
The dataset suffered from an imbalance-there were far more malignant nodules than benign ones-so it was necessary to use a technique to avoid biasing the models toward one of the outcomes. The SMOTEENN algorithm [30] was employed for this purpose. This is a combination of oversampling using SMOTE and undersampling using the Edited Nearest Neighbors (ENN) method. The SMOTEENN algorithm starts by applying the SMOTE algorithm first which is a statistical method for evenly expanding the number of instances in the dataset. For each target class and its near neighbors, the algorithm samples the feature space. After that, the algorithm creates new instances by incorporating features from the target case. Then the ENN algorithm is applied by finding each observation's K-nearest neighbor first, then determining whether or not the majority class from the observation's K-nearest neighbor matches the observation's class. The observation and its K-nearest neighbor are removed from the dataset if the majority class of the observation's K-nearest neighbor and the observation's class vary. The dataset was split and evaluated using the K-fold cross-validation technique (K = 10) in order to avoid any bias in the training and testing sets. This involved splitting the data into ten subsets, using nine subsets for training, and leaving the last one for testing the trained model. The training subsets were also split into training and validation sets. Table 1 shows the testing results for the proposed model for the diagnosis of thyroid cancer. The evaluation measures described in Section 3.4 were calculated for the two models using the original dataset and two different data sampling techniques. The highest measures were achieved using the full set of features and the SMOTEENN combined sampling technique, which yielded a score of at least 0.99 for all of the evaluation measures. Figure 8 shows a comparison between the evaluation measures for the two models and the three approaches to data sampling. and evaluated using the K-fold cross-validation technique (K = 10) in order to avoid any bias in the training and testing sets. This involved splitting the data into ten subsets, using nine subsets for training, and leaving the last one for testing the trained model. The training subsets were also split into training and validation sets. Table 1 shows the testing results for the proposed model for the diagnosis of thyroid cancer. The evaluation measures described in Section 3.4 were calculated for the two models using the original dataset and two different data sampling techniques. The highest measures were achieved using the full set of features and the SMOTEENN combined sampling technique, which yielded a score of at least 0.99 for all of the evaluation measures. Figure 8 shows a comparison between the evaluation measures for the two models and the three approaches to data sampling.  In the baseline study by Xi et al. [10], the best accuracy of 0.79 and AUC of 0.84 was achieved using Random Forest and the original dataset with full features. The present study's model achieved an accuracy of 0.82 and an AUC of 0.86. However, applying the SMOTEENN sampling technique yielded a significantly higher accuracy and an AUC of 0.99. In the case of SOMTE, the model achieved an accuracy of 0.80 and an AUC of 0.89 using the full set of features.
In a medical setting, it is conventional to focus on the sensitivity and specificity, which are the probability of correctly classifying patients with and without the disease, respectively. The highest sensitivity and specificity were achieved using the full features and the SMOTEENN sampling technique-in this case the sensitivity was 0.99 and the specificity was 1.00. Similarly, in the selected features model, the highest sensitivity and specificity values-0.98 and 0.97, respectively-were achieved using the SMOTEENN sampling technique.

Discussion
A deep learning model was used to predict the malignancy of thyroid nodules, using previously published data from medical tests. An EAI model was used to explain the impacts of the various nodule features in the deep learning model. The results suggest that nodule calcification, multifocality, and enriched blood flow have the greatest impact in diagnosing malignant nodules. Previous studies have found that nodule calcification is the strongest indicator for malignancy [31][32][33]. Other studies have found that multifocality has a significant impact [34][35][36]. Debnam et al. [36] raised doubts about whether enriched blood flow is a significant diagnostic factor, whereas the results in the present study found that it did have a significant impact based on the EAI model.
The site of the nodule is also considered an important feature by the proposed model. This aligns with many prior studies that have concluded that the thyroid nodule site is an important indicator of thyroid cancer, with isthmic nodules being particularly likely to be malignant [37][38][39].

Conclusions
In this study, an EANN model was developed to diagnose malignant thyroid nodules. The model used a dataset of 724 patients including their clinical and demographic information-19 features in total, including the target attribute of malignancy, which has two possible values: benign and malignant. The model was trained on this dataset and optimized using the Adam optimizer.
To evaluate the model, six measurements were used: accuracy, sensitivity, specificity, F1 score, precision, and AUC. As the dataset suffered from imbalance, the SMOTEENN balancing technique was used. A variant of the model using only 10 of the features was also studied. The results showed that the proposed model outperformed the baseline study, with an accuracy of 0.99 using the full features and 0.98 using the selected 10 features. EAI was employed to interpret the proposed model and to understand the impact of the different features in predicting malignant thyroid nodules. A decision tree was created to specify the rules behind the model.
The proposed model can assist healthcare professionals in diagnosing malignant thyroid nodules and explaining the reasoning behind their diagnoses. In future, this study could be expanded by applying the model to a larger dataset.