Artificial Intelligence Methods for Identifying and Localizing Abnormal Parathyroid Glands: A Review Study

: Background: Recent advances in Artificial Intelligence (AI) algorithms, and specifically Deep Learning (DL) methods, demonstrate substantial performance in detecting and classifying medical images. Recent clinical studies have reported novel optical technologies which enhance the localization or assess the viability of Parathyroid Glands (PG) during surgery, or preoperatively. These technologies could become complementary to the surgeon’s eyes and may improve surgical outcomes in thyroidectomy and parathyroidectomy. Methods: The study explores and reports the use of AI methods for identifying and localizing PGs, Primary Hyperparathyroidism (PHPT), Parathyroid Adenoma (PTA), and Multiglandular Disease (MGD). Results: The review identified 13 publications that employ Machine Learning and DL methods for preoperative and operative implementations. Conclusions: AI can aid in PG, PHPT, PTA, and MGD detection, as well as PG abnormality discrimination, both during surgery and non-invasively. to identify abnormal PGs in the early MIBI, late MIBI and TcO4 thyroid scan images. The study includes 632 parathyroid scans (414 PG, 168 nPG). The proposed model, which is called ParaNet, exhibits top performance, reaching an accuracy of 96.56% in distinguishing between abnormal PGs and normal PGs scans. Its sensitivity and specificity are 96.38% and 97.02%, respectively. PPV and NPV values are 98.76% and 91.57%, respectively.


Introduction
A parathyroid adenoma (PTA) is a noncancerous (benign) tumor of the Parathyroid Glands (PGs). PGs are located in the neck, near or attached to the back side of the thyroid gland. PTA is part of a spectrum of parathyroid proliferative disorders that includes parathyroid hyperplasia, PTA, and parathyroid carcinoma [1].
Approximately eighty percent of primary hyperparathyroidism (PHPT) is caused by a PTA [2], followed by four-gland hyperplasia with ten to fifteen percent [2], and multiple adenomas with five percent [1].
Computer-Aided Diagnostic (CAD) assistance tools in PTA identification could significantly alleviate human tiredness and routine in everyday clinical practice, allowing the experts to put their efforts into nontrivial tasks. In addition, online surgical assisting tools that detect and localize important areas can aid in error prevention. Identification and preservation of the parathyroid glands (PGs) during thyroid surgery are very important. Damaging, devascularizing, autotransplanting, or inadverting PGs can cause post-operative hypocalcemia. To this end, near infrared-induced autofluorescence (NI-RAF) can deliver normal and pathologic PG localization in real time. Such tools are already embedded into modern image acquisition technologies and computer-enabled surgery frameworks. However, more modern solutions are worthy of examination; recent advances in Artificial Intelligence (AI) algorithms, specifically Deep Learning (DL) methods, demonstrate substantial performance in detecting and classifying medical images [2][3][4].
DL brought a revolution in feature-extraction from image data, enabling the computer-suggested capture of millions of potentially significant image features. DL algorithms can learn to detect and distinguish important features that characterize an image according to a predefined label. For example, such methods have achieved remarkable success in various cancer-detection studies utilizing various imaging modalities [3][4][5]. DL implementations are also found in video processing and biomedical signal processing.
Recent clinical studies report novel optical technologies which enhance the localization or assess the viability of Parathyroid Glands (PG). These technologies could become complementary to the surgeon's eyes and may improve surgical outcomes in thyroidectomy and parathyroidectomy [6]. More importantly, combining such technologies with state-of-the-art image and video processing computational models can multiply the capabilities of these systems and greatly increase their necessity and utility in hospitals.
Non-invasive medical imaging acquisition modalities, such as SPECT, aid in the preoperative identification of hyperparathyroidism and abnormal PG localization. Again, AI methods can substantially contribute to the detection task and assist medical staff.
The present review investigates the implementation of AI for identifying and localizing abnormal PGs and PHPT. The Literature Review identifies 13 related papers from the year 2000 to July of 2022 and discusses their findings and methods. Current limitations and future suggestions are provided in the Discussion section.

Literature Review
The relevant publications were identified through extensive searches in approved publication-indexing websites and repositories. PubMed, Scopus, and Google Scholar were the major sources of information. Multiple keyword combinations were used to discover research papers and constitute the initial library, including: The survey covered publications from January 2000 to July 2022. A total number of thirty-three publications constituted the initial library. Each publication's abstract and title were used to exclude irrelevant entries. Subsequently, a total of twelve research studies qualified for the review. The complete process is presented in Figure 1. This procedure identified 13 relevant papers which qualify for this review.

Machine Learning and Deep Learning in a Nutshell
This section describes the AI methods and algorithms reported in the literature review.

Machine Learning
ML is a part of AI [7]. It uses structured or unstructured data to learn patterns, forecast future values, or discover underlying knowledge [8]. The general idea of a machine that learns through a set of past observations is not an idea of our time [9]. The large amounts of data of any kind which are at the disposal of medical research centres and hospitals do not guarantee the successful development of an ML model. One of the most difficult challenges for engineers and programmers is labelling [10]. An ML model is commonly built upon a specific question or hypothesis to be investigated. For example, the malignancy suspiciousness rating of nodules inside specific organs and tissues in our body could be the focal point of ML methods. In essence, the medical dilemma of whether an observed nodule is malignant or benign is a likely domain for applying an ML model. We can assign varying levels of discretion to the methods and algorithms of any ML implementation concerning the focal point. Hence, we discuss supervised ML, unsupervised ML, and semi-supervised ML [10].
Supervised learning involves working with labelled datasets for training and testing. Every instance in the training data is accompanied by a specific and desired output/target, which is utilized by the algorithm in order to learn [11]. Examples of data where the desired output is known and their values are predefined are called labelled examples. In the case of parathyroid gland detection, the actual location of an image finding is considered to be the label of each instance. Based on this label, the ML model learns to identify patterns in the image related to this location. In a similar example, research may focus on distinguishing between normal and abnormal parathyroid images of various modalities (e.g., SPECT or histopathological images). In that case, the actual label of the image (normal or abnormal) is considered the ground truth.
Contrary to supervised learning, unsupervised learning utilizes unlabeled data, aiming to discover hidden patterns that group the data into clear and sufficiently demarcated sets [12]. Unsupervised ML can reveal new knowledge from data by analyzing the suggested patterns and performing cross-examination [13]. Unsupervised learning is often identical to data mining, a broader field aiming to discover patterns for data, deploying both ML and statistical or mathematical tools.
Dealing with labelled and unlabeled data is the objective of semi-supervised learning [14]. Not necessarily reliant upon discovering underlying patterns within the unlabeled data, this method instead focuses on discovering basic patterns from a set of labelled data and matching them with similar patterns of a set of unlabeled data [15]. Based on the confidence of the prediction, a certain amount of unlabeled data is incrementally incorporated into the labelled data to increase their size.

Deep Learning
DL alludes to various ML approaches utilizing many nonlinear processing units grouped by layers to process the input information by gradually applying specific transformations [22]. In the basic approach, the layers are usually sequentially connected. In essence, each layer processes the previous layer's output [23]. In this way, different levels of abstraction can acquire hierarchical representations of the input data. Special neural networks are utilized in DL's applications, which are related to image feature extraction.
Those networks are known as CNN, and their name comes from the convolution operation, which is the cornerstone of such methods. CNNs were introduced by LeCun [24]. CNN is a deep neural network that mainly uses convolution layers to extract useful information from the input data, usually feeding a final Fully Connected (FC) layer [25]. They exhibit impressive performance in a variety of tasks. A detailed explanation of the convolution process is presented in the next section. A convolution operation is performed as a filter, which is a table of weights slides throughout the input image. An output pixel produced at every position is a weighted sum of the input pixels (the pixels that the filter has passed from). The weights of the filter, as well as the size of the table (kernel), are constant for the duration of the scan. Therefore, convolutional layers can seize the shift-invariance of visible patterns and depict robust features.
After several convolutional and pooling layers, one or more FC layers may aim to perform high-level reasoning. FC layers connect all previous layers' neurons with every neuron of the FC layer.
The last layer of CNN is an output layer. The Softmax [26] operator is a common classifier for CNNs. A Support Vector Machine (SVM), usually combined with CNN features, is used. CNNs have been widely used for medical image classification [27][28][29][30][31][32].

Results
The review identified two major categories, namely, thyroidectomy-assisting methods for localizing PGs and preoperative PG detection and abnormality identification. Tables 1 and 2 summarize the type and the results of the reported 13 studies, respectively.

Thyroidectomy Assisting Methods for Localizing Parathyroid Glands
Early and precise detection of PGs is a challenging problem in thyroidectomy due to their small size and an appearance similar to that of surrounding tissues. Several AI methods have been designed and proposed to assist surgeons in localizing and identifying PGs. Recent literature fully uses emerging ML and DL algorithms to achieve high detection rates.
Kim et al. [33] introduced a prototype solution for the reduction of false-positive PGs localized using near-infrared autofluorescence (NIRAF) [34] methods. Their appliance is equipped with a coaxial excitation light (785 nm) and a dual-sensor. Under this setup, the authors employed the YOLO v5 [35] network, a real-time object detection DL model, to identify and localize PGs. The authors evaluated their solution's clinical feasibility in situ and ex vivo using sterile drapes on ten human subjects. Video data of 1287 images of wellvisualized and localized PGs from six human subjects were utilized. This method yielded a mean average precision of 94.7% and a 19.5-millisecond processing time/detection. It is a matter for future research whether the proposed method remains at a top performance after the inclusion of more human participants.
Akbulut et al. [36] proposed a decision tree for intraoperative autofluorescence assessment of PGs in PHPT. The study involved 102 patients and 333 confirmed PGs. The authors extracted predictors from each PG, and the developed decision tree used normalized autofluorescence intensity, heterogeneity index, and gland volume to predict normal versus abnormal glands and subclasses of parathyroid pathologies. The algorithm achieved 95% accuracy in distinguishing between normal and abnormal PGs and 84% in predicting parathyroid pathologies' subclasses. However, the authors do not report the training and evaluation samples.
Wang et al. [37] benchmarked the YOLO V3, Faster R-CNN, and Cascade algorithms for identifying PGs during endoscopic approaches. The study involved 166 endoscopic thyroidectomy videos, of which 1700 images were employed (frames). The experiments revealed the superiority of Faster R-CNN in this task, which achieved precision, recall rate, and F1 scores of 88.7%, 92.3%, and 90.5%, respectively. The authors evaluated this network further using an independent external cohort of 20 videos. Senior and junior surgeons' visual estimation was used for comparisons. In this test set, the parathyroid identification rate of their method was 96.9%, while senior surgeons and junior surgeons achieved 87.5% and 71.9%, respectively.
Avci et al. [38] used the Google AutoMl platform to identify an optimal DL model to localize parathyroid-specific autofluorescence on near-Infrared imaging. The study involved 466 intraoperative near-infrared images of 197 participants undergoing thyroidectomy or parathyroidectomy procedures. The study was split into three sets, training, validation, and test. 527 PG AF signals from the near-infrared images obtained intraoperatively from these procedures were used to develop the model's training set. The method yielded a recall of 90.5% and a precision of 95.7%, respectively. Those scores correspond to a 91.9% accuracy in detecting PGs.
Avci et al. [39] repeated the above study using a total of 906 intraoperative parathyroid autofluorescence images of 303 patients undergoing parathyroidectomy/thyroidectomy. The dataset was split, and 20% was kept for evaluation. The authors evaluated their models based on AUROC and AUPRC, which were found to be 0.9 and 0.93, respectively. Precision and recall were reported at 89% each.
Wang et al. [40] proposed an innovative method for identifying PGs based on laserinduced breakdown spectroscopy (LIBS). The study involved 1525 original spectra (773 PG spectra and 752 NPG spectra) from 20 smear samples of three rabbits. The authors extracted the emission lines related to K, Na, Ca, N, O, CN, and C2 and built several ML algorithms to distinguish between PGs and nPGs. The predictive attributes were ranked based on the importance weight calculated by Random Forest. The Artificial Neural Network model and the Random-Forest-based feature selection achieved a 92% accuracy.

Preoperative Parathyroid Gland Detection and Abnormality Identification
Sandqvist et al. [41] proposed an ensemble of decision trees with Bayesian hyperparameter optimization for predicting the presence of overlooked PTAs at a preoperative level using 99mTc-Sestamibi-SPECT/CT technology in Multiglandular Disease (MGD) patients. The authors used six predictors, namely, the preoperative plasma concentrations of parathyroid hormone, total calcium, and thyroid stimulating hormone, the serum concentration of ionized calcium, the 24-h urine calcium, and the histopathological weight of the localized PTA at imaging. The retrospective study involved 349 patients, whilst the dataset was split into 70% for training and 30% for testing. The authors designed their framework utilizing two response classes; patients with Single-Gland Disease (SGD) correctly localized at imaging and MGD patients in whom only one PTA was localized on imaging. Their algorithm achieved a 72% true positive prediction rate for MGD patients and a misclassification rate of 6% for SGD patients. This study confirmed that AI could aid in identifying patients with MGD for whom 99mTc-Sestamibi-SPECT/CT failed to visualize all PTAs.
Stefaniak [42] et al. developed an ANN to detect and locate pathological parathyroid tissue in the planar neck scintigrams. This study involved 35 participants. The detailed data consisted of sets of three single pixels, each belonging to one of the three consecutive neck scintigrams generated 20 min after (99m)TcO(4)-administration, 10 min after (99m)Tc-MIBI injection, and 120 min after (99m)Tc-MIBI injection, respectively. The results of the ANN were compared to the conventional assessment of two radionuclide parathyroid examinations, namely, the subtraction method and (99m)Tc-MIBI double-phase imaging. The ANN yielded a close relationship with the visual assessment of original neck scintigrams, with R square coefficient R 2 of 0.717 and standard error equal to 0.243 during its training. Multidimensional regression analysis yielded a weaker relationship, with an R 2 of 0.543 and a standard error of 0.567.
Yoshida et al. [43] employed RetinaNet [44], a DL network for the detection of PTA by parathyroid scintigraphy with 99m-technetium sestamibi (99mTc-MIBI) before surgery. The study enrolled 237 patients who took parathyroid scintigrams using 99mTc-MIBI and each of whom were determined to be a positive or negative case. Those patients' scans included 948 scintigraphy with 660 annotations, which were used for training and validation purposes. The test set included 44 patients (176 scintigrams and 120 annotations). The models' lesion-based sensitivity and mean false positive indications per image (mFPI) were assessed with the test dataset. The model yielded a sensitivity of 82%, with an mFPI of 0.44 for the scintigrams of the early-phase model. For the delayed-phase model, the results reported 83% sensitivity and 0.31 mFPI.
Somnay et al. [45] employed several ML models for recognizing PHPT using clinical predictors, such as age, sex, and serum levels of preoperative calcium, phosphate, parathyroid hormone, vitamin D, and creatinine. The study enrolled 11,830 patients managed operatively at three high-volume endocrine surgery. Under a 10-fold cross-validation procedure, the Bayesian network was found superior to the rest of the ML models, achieving 95.2% accuracy and an AUC score of 0.989.This performance by the Bayesian network is interesting because, in general, such networks tend to overfit and their generalization capabilities are very limited.
Imbus et al. [46] benchmarked ML classifiers for predicting MGD in PHPT patients. The study involved 2010 participants (1532 patients with SGD and 478 with MGD). The fourteen predictor variables included patient demographic, clinical, and laboratory attributes. The boosted tree classifier was found superior to the rest ML modes, reaching an accuracy of 94.1%, a sensitivity of 94.1%, a specificity of 83.8%, a PPV of 94.1%, and an AUC score of 0.984.
Chen et al. [47] applied transfer learning for the automatic detection of PHPT from ultrasound images annotated by senior radiologists. The study involved 1000 ultrasound images containing PHPTs, of which 200 images were used to evaluate the developed model. For this purpose, they employed three well-established Convolutional Neural Networks to analyze the PHPT ultrasound and suggest potential features underlying the presence of PHPT. This study achieved the best recall, at 0.956.
In a recent work by Apostolopoulos et al. [48], the authors developed a three-path VGG19-based network to identify abnormal PGs in the early MIBI, late MIBI and TcO4 thyroid scan images. The study includes 632 parathyroid scans (414 PG, 168 nPG). The proposed model, which is called ParaNet, exhibits top performance, reaching an accuracy of 96.56% in distinguishing between abnormal PGs and normal PGs scans. Its sensitivity and specificity are 96.38% and 97.02%, respectively. PPV and NPV values are 98.76% and 91.57%, respectively.

Discussion
The research study identified and described 13 studies addressing the issue of PG identification and localization, PHPT, PTA, and MGD detection. Most studies focus on PG detection (42%), while PG localization is addressed in 33% of the total studies.
There has been a significant amount of research conducted for preoperative delivery using ultrasound and scintigraphy image sources. Preoperative detection of abnormalities is also addressed using ML approaches without deploying any imaging modality. Significant clinical and demographical predictors are revealed in the literature, contributing to the diagnosis of PHPT and MGD. Overall, the preoperative delivery methods are introduced in 54% of the reviewed publications ( Figure 2). The studies report very promising results in preoperative classification tasks, such as normal-abnormal image discrimination or MGD prediction using clinical factors. The observed sensitivity varies between 82 and 96 per cent. The majority of studies report an accuracy that ranges between 91 and 96 per cent. However, PG localization is not yet explored. It is expected that localizing each abnormal PG in thyroid scans would yield a number of false positive findings, thereby making this task very challenging.
The research community is also making efforts to provide novel appliances and topologies to improve the detection of findings during surgery. Most relevant publications accompany their technological solutions with traditional ML and DL approaches to enhance detection accuracy or to provide assisting computational tools. Studies presenting technological and AI solutions that deliver during surgery report better results regarding PG localization. It is observed that none of the reviewed research works integrates clinical factors and imaging data. It is expected that combining any available demographic, clinical, and biological data, where existent, would improve the diagnostic accuracy of imagebased approaches and reduce the many reported false positive cases. Despite their promising results, most studies use very few participants to train and evaluate their models. Most studies address this issue by extracting many video frames and slices from each patient. Therefore, the amount of samples is adequate for model training. However, the datasets remain biased because the utilized frames/slices share the same origin. As a result, the study's results might be misleading. Still, there are studies that use more participants and report acceptable results and meaningful conclusions [41,46,48]. Most studies are validated on cohorts that do not exceed 500. As a result, the reported results, though undeniably encouraging, are not yet well-grounded. While the number of studies published peaks after 2021, the research on PG identification and localization, PHPT, PTA, and MGD detection is still constrained. The absence of publicly available data repositories covering relevant tasks impedes Biomedical Engineering experts from exploring the full potential of Artificial Intelligence in this domain. Nevertheless, the significant results reported in the literature undeniably open the horizons. Specifically, in PG detection and localization, the emergence of large-scale image datasets could accelerate the exploration of novel and state-of-the-art DL approaches and provide trustworthy solutions for medical assisting tools.
The emerging field of eXplainable Artificial Intelligence (XAI) as a set of algorithms and methods providing explanations can increase the medical importance and usefulness of AI methods in PHPT detection and PG abnormality discrimination. Most studies do not use explainable algorithms that inform the user of their decisions. As an example of explainable AI, the study of Imbus et al. [46] uses a decision tree for discriminating MGD from SGD. Decision trees are inherently self-explanatory. However, in studies where an ensemble of decision trees is employed (e.g., [36]), it is difficult to provide explanations. In studies where DL is employed (e.g., [43,48]), post-hoc explainability methods, such as the Grad-CAM algorithm [49], are not considered. Future studies could consider adopting explainable strategies to enhance their results and provide frameworks that are meaningful in everyday practice.
It was observed that many studies do not extensively report their methodology in terms of the employed ML and DL algorithms. Moreover, the majority of studies employ basic AI methods without mentioning any parameter tuning. For example, in studies where the decision trees are designed, the maximum number of leaf nodes and the maximum depth are not documented.
It is concluded that more effort should be put into designing and furnishing problemspecific models with well-grounded parameter selection. As an example of such methodology, in [41], the authors performed a Bayesian hyperparameter tuning, one which improved their results.
Finally, there is no established and documented method for validating the results. Some studies consider a train-test split solely, without any cross-validation method. This method is only suitable when large amounts of data are involved. In studies with few samples, partitioning the dataset at random may introduce biases. Other studies perform a cross-validation method (e.g., 3-fold, 10-fold) but do not consider control groups and external test sets. As a result, comparisons between studies are difficult.
Moreover, the robustness of the proposed pipelines regarding acquisition device variation is not explored. It is usual that different devices yield different image characteristics, e.g., resolution, pixel intensities, and video frames. Some variations regarding the models' effectiveness are expected and should be investigated.

Conclusions
This review study presented twelve works addressing the issue of PG identification and localization, HPT, PTA, and MGD detection. The reviewed studies were focused on both preoperative and operative solutions. Significant clinical and demographical predictors are revealed in the literature, contributing to the effective diagnosis of PHPT and MGD. Most relevant publications accompany their technological solutions with traditional ML and DL approaches to enhance the detection accuracy or to provide assisting computational tools. in the task of PG detection and localization, the emergence of largescale image datasets could accelerate the exploration of novel and state-of-the-art DL approaches and provide trustworthy solutions for medical assisting tools. Moreover, explainable algorithms must be introduced to enhance the results and increase the significance of the proposed methods.