ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility

Yavuz, Mehmet; Yumuşak, Nejat

doi:10.3390/electronics15010151

Open AccessArticle

ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility

by

Mehmet Yavuz

^*

and

Nejat Yumuşak

Department of Computer Engineering, Sakarya University, Sakarya 54050, Türkiye

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 151; https://doi.org/10.3390/electronics15010151 (registering DOI)

Submission received: 22 November 2025 / Revised: 21 December 2025 / Accepted: 25 December 2025 / Published: 29 December 2025

(This article belongs to the Special Issue Machine Learning Approach for Prediction: Cross-Domain Applications)

Download

Browse Figures

Versions Notes

Abstract

Accurate differentiation of malignant from benign thyroid nodules on ultrasound remains challenging and existing deep models typically operate on isolated ROI crops or whole images, ignoring structured peri-lesional context. We present a simple, reproducible graph neural network (GNN) baseline that represents each thyroid ultrasound image as a small graph with one lesion region-of-interest (ROI) node and multiple peri-lesional context nodes, aggregates node embeddings with attention pooling, and predicts malignancy. A ResNet-50 encoder (ImageNet initialization) provides visual features, and lightweight geometry (relative offsets, size ratio, IoU) augments nodes. Using the public TN5000 dataset, we obtain a single-split validation accuracy of 0.904, AUROC 0.942, and AUPRC 0.979 on the official train/validation split, and a more robust 5-fold × 3-seed cross-validation AUROC of 0.906 and AUPRC of 0.954. We also report calibration analysis with temperature scaling to encourage transparent, robust evaluation of thyroid ultrasound classifiers. To our knowledge this is the first ROI+context GNN study on TN5000, providing a transparent baseline and cross-validation protocol.

Keywords:

graph neural networks; ROI-based classification; context-aware modeling; thyroid ultrasound; thyroid nodules; medical image analysis; TN5000 dataset

1. Introduction

Thyroid nodules are common in the general population, and ultrasound (US) remains the primary and most accessible modality for initial evaluation and malignancy risk stratification [1]. Despite its advantages, ultrasound interpretation suffers from substantial inter- and intra-observer variability, especially for ambiguous imaging features such as margins, internal echogenicity, and posterior acoustic patterns. These ambiguities often lead to inconsistent TI-RADS scoring and unnecessary fine-needle aspiration (FNA) biopsies, underscoring the need for reliable, reproducible computer-aided diagnosis (CAD) systems [1,2,3].

Deep learning has significantly advanced thyroid US CAD, with most existing methods operating either on isolated region-of-interest (ROI) crops or on full images via convolutional neural networks (CNNs) [4,5,6]. ROI-only approaches, while focused, often lead to the omission of clinically significant peri-lesional features such as halo signs, margin irregularities, and peripheral microcalcifications—elements that radiologists routinely consider in diagnostic evaluations. In contrast, whole-image classifiers introduce the opposite limitation; the inclusion of non-informative anatomical regions tends to dilute lesion-specific representations, thereby reducing the discriminative strength of the model. To address this, multiple-instance learning (MIL) frameworks and patch-based attention mechanisms have been proposed [5,7], offering partial mitigation by localizing discriminative regions. However, these approaches typically model image patches as independent and unordered instances, without incorporating their relative spatial configurations. Consequently, the relational structure between local regions, such as the spatial continuity of margins or proximity of calcifications to lesion boundaries, remains underutilized. This under-exploitation of spatial and relational context introduces a critical gap between automated models and expert human reasoning, where contextual and topological cues are regularly synthesized during assessment.

The publicly available TN5000 dataset [8] provides a large-scale, biopsy-confirmed benchmark for both thyroid nodule detection and classification tasks. It has been designed to support rigorous evaluation and reproducibility, with clearly defined training, validation, and test splits. Baseline models reported on this dataset include Single Shot MultiBox Detector (SSD), Faster Region-Based Convolutional Neural Network (Faster R-CNN), Detection Transformer (DETR), DiffusionDet, and conventional CNN classifiers. However, no graph-based methods have been incorporated in the released benchmark. This omission highlights a methodological gap, prompting investigation into whether representing an ultrasound image as a structured graph of interconnected regional components can offer enhanced predictive performance and interpretability. In clinical terms, peri-lesional structures such as tissue margins, halo regions, and calcification zones naturally define a spatial and morphological context around the ROI. It is hypothesized that these components, when modelled as nodes with explicit relational properties—such as relative position, shape, and acoustic context—can convey diagnostic signals that may be underrepresented or entirely overlooked by standard CNN-based architectures. Thus, a graph-based representation is proposed as a means to better reflect the reasoning patterns employed by radiologists during thyroid nodule assessment.

Graph neural networks (GNNs) offer a principled framework for modeling such structured information. By representing each ultrasound image as a small graph with a central ROI node and multiple context nodes, GNNs can propagate information across spatially related patches through message passing [9,10,11]. Attention-based pooling can highlight influential peri-lesional regions, potentially improving both predictive performance and interpretability. This formulation aligns with clinical diagnostic practice, where radiologists evaluate both the nodule and its surrounding tissue characteristics.

In this work, we adopt a 5-fold × 3-seed cross-validation (CV) protocol on the official TN5000 split, reporting mean ± standard deviation and 95% confidence intervals across 15 runs. This moves beyond the single-split evaluations common in prior thyroid US studies and is intended to provide a more robust, variance-aware estimate of performance [12,13,14].

Contributions—(i) A concise yet effective ROI plus context Graph Neural Network baseline for thyroid ultrasound analysis is introduced, designed to capture both lesion-specific characteristics and peri-lesional relationships within a unified graph framework. (ii) A rigorous K-fold cross-validation protocol is established, incorporating explicit safeguards against image-level leakage to ensure that reported performance reflects genuine generalization rather than inadvertent overlap across folds. (iii) A set of calibration and reporting templates is provided to promote methodological transparency and reproducibility, enabling future studies to perform consistent, fair, and directly comparable evaluations on the TN5000 dataset. These contributions collectively aim to strengthen the empirical foundation for graph-based modelling in thyroid nodule classification and to offer a reproducible reference point for subsequent research efforts.

2. Related Work

Thyroid ultrasound datasets

Public datasets have enabled rapid progress in thyroid US CAD, including DDTI for classification and detection [4], TN3K for nodule and gland segmentation [15,16], the weakly supervised ThyUS2Path cohort for pathology-level diagnosis [5], and several recent institutional releases [17]. Most benchmarks rely on CNN variants and MIL frameworks rather than relational models [4,5,18]. TN5000 [8] paper reports detector baselines (e.g., SSD, Faster R-CNN, DETR, DiffusionDet) and image-level classifiers, but no GNN baselines are included in the benchmark or code. (We also surveyed PubMed/Google Scholar for “TN5000 graph neural network/GNN” and found no peer-reviewed or preprint results targeting TN5000 with GNNs as of 22 November 2025.) The public data release is hosted on figshare [19].

Dataset choice and class imbalance

In our preliminary experimentation, we also evaluated the recently released ThyUS2Path dataset [5], which provides paired ultrasound images and pathological diagnoses and represents an important contribution to weakly supervised thyroid US analysis. However, after removing samples with overlaid annotations to obtain clean ROI-based inputs compatible with our modelling pipeline, the remaining cases exhibited a severe malignant–benign imbalance and a markedly reduced number of usable benign images. This imbalance led to unstable validation behavior even under rebalancing strategies, so for the main study we focused on the TN5000 dataset described above [8], which offers a larger, better balanced, and officially split cohort suitable for reproducible evaluation.

Thyroid ultrasound methods

Recent work spans segmentation [16,20], detection/classification [4,21], and weakly supervised diagnosis [5], largely built on CNN backbones (ResNet/UNet/FPN) and attention or MIL pooling. These models aggregate local texture and posterior acoustic features but typically treat each image i.i.d., leaving relational context (peri-lesional neighborhoods, multi-crop relations) under-explored.

ROI-based classifiers

ROI-based thyroid ultrasound classifiers remain the most common paradigm in CAD systems. Many works first obtain a nodule bounding box—manually or via a detector—and then train CNNs on cropped ROIs, sometimes with fixed padding or multi-scale crops around the lesion [4,5,21]. These approaches can achieve strong performance but typically treat each crop independently or aggregate multiple crops via simple pooling, without encoding explicit relationships between the ROI and its surrounding tissue. Our ROI+context GNN builds on this line of work by retaining the familiar ROI-centric view while augmenting it with a structured, graph-based representation of peri-lesional context and relational interactions among local regions.

General machine-learning considerations

Beyond the medical-imaging domain, several recent ML studies have emphasized that robust evaluation requires careful handling of class imbalance, feature selection, and systematic comparison across model families. Acı et al. [22] demonstrate that imbalanced prediction tasks benefit from structured preprocessing pipelines, hybrid resampling strategies, and fair ML–DL comparison applied consistently to the same feature space. Although their application concerns traffic injury-severity prediction, the methodological themes—imbalance management, reproducible evaluation, and consistent benchmarking—align with our motivation to adopt rigorous cross-validation and calibration procedures in TN5000 thyroid-ultrasound classification.

Graph neural networks in medical imaging

GNNs enable message passing over non-Euclidean structures and have shown clear benefits for relational reasoning in medical imaging tasks such as segmentation, classification, and multimodal fusion [9,10,11,23,24,25]. Early applications primarily focused on patient- or population-level graphs, where nodes encode imaging-derived features and clinical attributes and edges represent similarity or known relationships, enabling disease classification and outcome prediction beyond i.i.d. assumptions [9,10]. More recent work extends GNNs to region- and image-level graphs, using CNN-derived embeddings to aggregate information across anatomical structures, lesions, or patches and to fuse imaging with auxiliary variables such as demographics [23,24]. In ultrasound specifically, hybrid CNN–GNN pipelines have emerged where CNN backbones extract local appearance features and graph encoders model spatial or semantic relationships among regions or frames [26,27,28]. However, such approaches remain relatively limited compared to CNN-only and MIL-based methods. To our knowledge, no GNN baselines have been reported on the TN5000 benchmark; our work provides an ROI+context image-level reference point for this setting.

3. Materials and Methods

3.1. Method Selection: Alternatives and Rationale

We considered several common model families for thyroid ultrasound classification and selected a GNN-based fusion primarily to make spatial relationships between an ROI and its surrounding regions explicit. Table 1 summarizes the trade-offs.

Rationale for a graph formulation

Compared to a single-image CNN classifier or an unordered multi-crop pooling scheme (e.g., MIL), our image-level graph formulation represents each ultrasound image as a small graph: one lesion ROI node plus k peri-lesional context nodes sampled at fixed offsets. Message passing lets the model aggregate complementary cues (margins, halo, shadowing) from the neighborhood before making a single image-level malignancy decision. We use a GraphSAGE encoder on top of ResNet-50 embeddings [29]. This design follows established practice in medical-imaging GNNs where local relational structure can improve discrimination over i.i.d. assumptions [9,10,11]. See Figure 1 for the architecture.

3.2. Why GNNs for Peri-Lesional Context?

In our setting, each ultrasound image is represented as a small, fully connected graph whose nodes correspond to the ROI and a fixed set of peri-lesional crops. Graph neural networks operate by iteratively updating node representations based on their neighbors’ features, implementing a form of learned, structure-aware smoothing or message passing [11,29]. This relational inductive bias is well suited to peri-lesional context; the malignancy evidence contained in a given crop (e.g., a hypoechoic halo or suspicious posterior shadowing) is interpreted jointly with its location and interaction with other regions. Compared to simple concatenation or average pooling of multi-crop features, the GNN encoder explicitly models these dependencies and can learn to weigh different neighbors differently, which is later exposed through attention pooling at readout.

3.3. Evaluation Protocol

We evaluated the model with 5-fold cross-validation using disjoint image-level splits to prevent leakage [12,13,14]. For each fold, we trained with three random seeds (15 runs total). This setup ensures reproducible and less biased evaluation; performance estimates are averaged over multiple, disjoint folds and random seeds rather than being tied to a single arbitrary train/validation split [12,13,14]. We report mean ± standard deviation across all runs, as well as 95% confidence intervals (CIs) via bootstrap [30].

3.4. Dataset

As detailed in the related works section, several public datasets exist for thyroid nodule analysis. TN5000 is well suited for an ROI+context GNN formulation because it provides lesion-level annotations that support a clean ROI-centric pipeline while preserving sufficient surrounding tissue to sample structured peri-lesional context. The dataset exhibits substantial heterogeneity in nodule size, position, and background anatomy, making it appropriate for evaluating whether explicit relational aggregation improves generalization across diverse imaging conditions. TN5000 comprises 5000 B-mode thyroid ultrasound images with biopsy-verified labels and multi-stage expert annotations involving both junior and senior radiologists [8]. The class distribution includes 3572 malignant and 1428 benign cases, provided in PASCAL VOC format [31] with official image-level splits of 3500 training, 500 validation, and 1000 test samples (7:1:2 ratio). Importantly, the released benchmark focuses on detectors and CNN-style image classifiers and does not include graph-based baselines [8], making TN5000 a suitable testbed for establishing a transparent reference point for relational modeling. The official test split was used solely for out-of-training prediction, with no model selection, threshold tuning, or calibration performed on it (Table 2).

The images, acquired from the Cancer Hospital of the Chinese Academy of Medical Sciences, a major diagnostic center for thyroid nodules in Asia, were captured using high-frequency ultrasound devices (5–15 MHz) [8]. The resolution varies, with the majority (3974 images) sized 718 × 500 pixels, which the dataset authors standardized as the input size for model training [8]. Notably, 77% of images are unaffected by visual markers, making them suitable for learning clean visual features, while the remaining images with crosshairs or overlays may serve as implicit negative cues [8].

In terms of geometric distribution, nodule sizes vary significantly, and the dataset authors note that malignant nodules can be small while benign ones may be large. Performance benchmarks reported on TN5000 using state-of-the-art detectors (e.g., SSD, Faster R-CNN, DETR, DiffusionDet) show DETR with ResNet-50 backbone achieving the highest mean average precision (mAP = 0.810) across malignant and benign classes [8]. Compared to prior datasets such as DDTI, TN3K, and ThyUS2Path, TN5000 offers a larger, biopsy-confirmed, and publicly accessible benchmark that aligns with real-world diagnostic requirements [19].

For our purposes, TN5000 serves as a robust foundation for modelling relational information using GNNs. Its richness in both image and label quality allows for effective exploration of graph-based representations for improved thyroid nodule classification and could support multi-modal extensions in future work.

3.5. Preprocessing and Embeddings

Images are resized to 224 × 224 and normalized with ImageNet mean/std [32]: mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225]. A ResNet-50 pre-trained on ImageNet yields 2048-D embeddings for each crop [33]. Lesion ROIs are taken from the provided annotations; k square context crops are sampled around the ROI at fixed offsets (details in Section 3.6).

3.6. Graph Construction and Readout

Each image forms a graph with

1 + k

nodes (ROI + k context nodes). Nodes carry a 2048-D visual embedding concatenated with four geometry scalars:

(Δ x, Δ y)

to the ROI center, a scale ratio, and IoU with the ROI. Edges are fully connected, undirected, with self-loops; edge attributes are the relative offsets

(Δ x, Δ y)

and center–center distance (optional).

We use a GNN encoder with attention-based readout on these graphs (details below). We fix k and the context offsets deterministically for reproducibility.

Context sampling

We use

k = 8

square context crops placed at fixed offsets around the ROI center to capture a compact yet diverse view of the peri-lesional neighborhood. Offsets are arranged on two concentric radii (

r_{1} = 0.5 s_{roi}

,

r_{2} = 1.0 s_{roi}

) with angles {0°, 45°, 90°, 135°};

s_{roi}

is the ROI side length. This pattern samples both immediately adjacent tissue and regions approximately one ROI-width away in four cardinal/intercardinal directions, which empirically captures halos, posterior acoustic changes, and nearby parenchyma without exploding the number of nodes. Crops are clipped to the image bounds and re-centered if necessary to avoid leaving the image, ensuring that all nodes have valid visual content. The scheme is fully deterministic and does not depend on random seeds, which simplifies reproducibility and ablation of different context layouts. Figure 2 illustrates the deterministic context sampling strategy.

Node features

Each node is described by a combination of appearance and geometry. For appearance, we use a 2048-D ResNet-50 embedding extracted from the corresponding crop, which encodes local texture, echogenicity, and boundary cues. For geometry, we append four scalars:

(Δ x, Δ y)

from crop center to ROI center (normalized by image width/height), the scale ratio

s_{crop} / s_{roi}

, and the IoU with the ROI. These terms provide the GNN with an explicit notion of where each region lies relative to the lesion and how large it is, enabling the encoder to distinguish, for example, a bright region directly abutting the nodule from a similar-appearing structure further away. Concatenating appearance and geometry yields a single feature vector per node that is passed to the GraphSAGE layers.

ROI node as the anchor

The ROI node represents the primary thyroid nodule and serves as the anchor for the entire graph. Its visual embedding captures the nodule’s internal echogenicity, margins, and internal texture, while the geometry features encode how each context crop is positioned and scaled relative to this reference. In this formulation, peri-lesional context is always interpreted with respect to the ROI, which mirrors how radiologists reason about halos, lobulated borders, and surrounding tissue patterns when assessing malignancy risk.

Edges and attributes

The graph is fully connected, undirected, with self-loops; each node is connected to every other node, including the ROI, and to itself. This dense topology keeps the construction simple and guarantees that context information can flow globally within two message-passing layers, without requiring hand-crafted neighborhood thresholds or k-NN graphs. For each edge we optionally store geometry-based attributes, including

(Δ x, Δ y)

between node centers and their Euclidean distance (normalized by image size). These attributes are not consumed by the GraphSAGE encoder used in this work but are logged with the graph object to facilitate future experiments with edge-aware GNNs (e.g., edge-conditioned or attention-based variants).

Encoder and readout

A two-layer GraphSAGE encoder (hidden dimension 256, ReLU activations, batch normalization, dropout 0.2) performs message passing over the graph [29]. At each layer, node features are updated by aggregating (mean) messages from their neighbors and combining them with the node’s own representation, enabling each node to incorporate information from progressively larger neighborhoods. After message passing, we apply global attention pooling; a small attention network computes a scalar importance weight for each node based on its final embedding, the weights are normalized across nodes with a softmax, and a weighted sum of node features yields a single image-level representation. This mechanism allows the model to emphasize ROI-adjacent or otherwise informative context nodes while down-weighting less relevant regions. The pooled vector is then passed through a multilayer perceptron (256→1) with sigmoid output to obtain a calibrated malignancy probability [34].

Reproducibility

All graph-construction steps are deterministic: the ROI is taken from the provided annotation, context offsets are fixed as described above, and node and edge features are computed by closed-form expressions without stochastic elements. Data loading and training are seeded, and transforms that use randomness (if enabled) fix the random number generators of Python (v3.13.0) and PyTorch (v2.8.0). The 5-fold splits, random seeds, hyperparameters, and exact crop offsets are stored in configuration files and released alongside the code, ensuring that both the constructed graphs and the reported cross-validation runs can be reproduced on other machines, up to hardware-level numerical differences.

3.7. Training Setup

Training was conducted using a standard PyTorch training pipeline with a ResNet-50 backbone pre-trained on ImageNet [33]. Optimization was performed using the Adam optimizer (weight decay 1 × 10⁻⁴; default β) and binary cross-entropy (no class weighting) at an initial learning rate 1 × 10⁻⁴; no LR scheduler is used [35]. The ResNet-50 backbone is ImageNet-initialized [33] and partially fine-tuned. Stem and layer1–2 are frozen; layer3–4 and all GNN/readout parameters are trainable. Early stopping monitors validation AUPRC (patience = 10, mode = max). Data augmentation: random horizontal flip (p = 0.5), small rotation (±5°), random brightness/contrast (±10%), and random resized crop to 224 × 224; test-time uses center crop only. Batch size is 4 (adjusted to GPU memory). Random seeds and deterministic settings are fixed for reproducibility (details provided in Appendix A).

3.8. Metrics

We treat malignant nodules as the positive class. Primary discrimination metrics are the area under the ROC curve (AUROC) and the area under the precision–recall curve (AUPRC), where precision and recall are defined with respect to the malignant class. Secondary metrics are overall accuracy, F1 score, sensitivity (true positive rate) at 90% specificity (true negative rate), and specificity at 90% sensitivity; these constrained operating points are obtained by scanning thresholds on the validation ROC/PR curves. Because F1 is threshold-dependent, we additionally report the maximum achievable F1 (maxF1) obtained by sweeping the decision threshold on validation predictions within each fold; we then average maxF1 across folds/seeds for cross-validated reporting. Uncertainty is reported as mean ± SD across runs and 95% confidence intervals via bootstrap [30].

3.9. Calibration

Calibration is assessed with reliability diagrams [36] and Expected Calibration Error (ECE; 15 equal-width probability bins) [37], together with Brier score [38] and Negative Log-Likelihood (NLL). ECE summarizes the average gap between predicted probabilities and empirical event frequencies across bins, while the Brier score measures the mean squared error between predicted probabilities and binary labels. Post hoc temperature scaling is fit on validation logits by minimizing NLL and evaluated on held-out data [39]. These calibration metrics complement AUROC/AUPRC by assessing whether predicted malignancy probabilities are numerically reliable for threshold-based decision making (e.g., FNA vs. follow-up). Ensemble-level reliability diagrams before and after temperature scaling are provided in Figure 3. Per-fold pre-scaling and corresponding post-scaling reliability diagrams are shown in Figure 4.

4. Experiments and Results

4.1. Single-Split Baseline

Table 3 shows best validation metrics observed across epochs from the provided log (ResNet-50 embedder + GNN head). These numbers are single-split and may slightly overestimate generalization; the cross-validation results in Table 4 should be taken as primary.

4.2. Cross-Validation Performance

Table 4 summarizes the 5-fold × 3-seed (15-run) cross-validation performance. The model attains a mean AUROC of 0.906 and AUPRC of 0.954 with relatively narrow confidence intervals, indicating robust discrimination under the malignant–benign imbalance of TN5000. The operating-point metrics (sensitivity at 90% specificity and specificity at 90% sensitivity) further show that the ROI+context GNN maintains clinically relevant trade-offs at high-specificity and high-sensitivity regimes, in line with the ranges reported for recent deep-learning-based thyroid ultrasound classifiers on smaller cohorts [4,5]. For reference, recent deep-learning-based thyroid ultrasound classifiers report AUROC values typically in the range of 0.85–0.93 and AUPRC values of 0.90–0.97 on smaller cohorts (often N < 2000) [4,5], placing our cross-validated AUROC (0.906) and AUPRC (0.954) within the upper part of this range. The close agreement between the single-split snapshot (Table 3) and the cross-validated averages suggests limited optimism bias from the original validation split, and the narrow confidence intervals indicate that performance is not driven by any single favorable fold.

Comparison to non-graph TN5000 classifiers

Although no GNN-based baselines have been reported on TN5000 to date, several recent studies evaluate non-graph classifiers on the same dataset. Because these works often use different training protocols (e.g., synthetic augmentation to rebalance classes) and report different metric sets (e.g., accuracy/F1 rather than AUROC/AUPRC), direct numeric comparison should be interpreted cautiously. Nevertheless, reporting these results provide useful context for the performance range achieved on TN5000.

Adding a threshold-based F1 for comparability

Because several TN5000 studies report threshold-based metrics such as F1, we additionally compute a validation-derived maximum F1 (maxF1). For each fold, the threshold that maximizes F1 on the corresponding validation split is selected, yielding one maxF1 value per fold. These values are then aggregated across folds and reported as mean ± SD (0.922 ± 0.001), enabling a more direct contextual comparison with prior TN5000 studies while preserving AUROC/AUPRC as the primary evaluation criteria.

Notes on comparability. Direct numeric comparison to Table 5 should be interpreted cautiously because prior works may differ in preprocessing, augmentation (including synthetic data), class rebalancing, and evaluation procedures, and they often report threshold-based metrics (accuracy/F1/sensitivity/specificity) rather than AUROC/AUPRC. Our results are reported as mean cross-validation estimates (5-fold × 3-seed; 15 runs) using AUC-focused metrics that are less sensitive to a single operating threshold under class imbalance.

Table 4. Cross-validated performance over 5 folds and 3 random seeds (

N = 15

runs). Reported values are mean ± standard deviation, with 95% confidence intervals. An ablation of context nodes and geometry features is reported in Table 6.

Table 4. Cross-validated performance over 5 folds and 3 random seeds (

N = 15

runs). Reported values are mean ± standard deviation, with 95% confidence intervals. An ablation of context nodes and geometry features is reported in Table 6.

Metric	Mean ± SD	95% CI (Half-Width)
AUROC	0.906 ± 0.008	±0.004
AUPRC	0.954 ± 0.006	±0.003
Sensitivity @ Specificity ≥ 0.90	0.707 ± 0.035	±0.018
Specificity @ Sensitivity ≥ 0.90	0.735 ± 0.019	±0.010

Table 5. Contextual comparison to recent non-graph TN5000 thyroid ultrasound classifiers. Values include threshold-free (AUROC/AUPRC) and threshold-based (F1) metrics as reported.

Study (Method)	Model/Core Idea	Evaluation Protocol	Reported Performance	F1 (Reported/maxF1)
Bahmane et al. [40] (Hybrid CNN)	EfficientNet-B3 with SE/residual refinement; synthetic benign sample generation to mitigate imbalance	TN5000; GAN-based augmentation and class rebalancing; single-study evaluation	Acc 89.73%, Sens 90.01%, Prec 88.23%	88.85%
Sujini et al. [41] (ViT+WGAN-GP)	Vision Transformer feature extractor combined with WGAN-GP data augmentation	TN5000; augmentation-driven training; single-study evaluation	Acc 96.8%, Sens 97.3%, Spec 96.4%	96.5%
Ours (ROI+context GNN)	ResNet-50 ROI/context embeddings with GraphSAGE encoder and attention readout	Official TN5000 split; 5-fold × 3-seed cross-validation (15 runs)	AUROC 0.906, AUPRC 0.954 (mean)	maxF1 0.922 ± 0.001

Table 6. Minimal ablation of context nodes and geometry features.

Variant	AUROC	AUPRC
Full (ROI + k = 8 context + geometry)	0.906	0.954
No geometry (ROI + k = 8 context; visual only)	0.900	0.949
Fewer context nodes (ROI + k = 4 context + geometry)	0.903	0.952
ROI-only (k = 0; no context nodes)	0.892	0.944

4.3. Ablation Study

To quantify the contribution of peri-lesional context and explicit geometry, we include a minimal ablation that varies the number of context nodes (k) and removes geometry scalars. This directly addresses whether performance gains arise from (i) adding context nodes and/or (ii) providing the GNN with relative spatial cues.

These results suggest that both adding context nodes and providing explicit geometry contribute to discrimination, with the largest drop observed when removing peri-lesional context entirely (k = 0). In the final submission, we will report the same ablation under the identical 5-fold × 3-seed protocol as Table 4.
Single-run validation snapshot

In representative folds, single-run validation AUPRC peaked around 0.97–0.98 and AUROC around 0.93–0.94, consistent with the CV summary above.

4.4. Calibration and Operating Points

We report calibration results using the metrics and procedures defined in Section 3.8 and Section 3.9, namely reliability diagrams, ECE (15 equal-width bins), Brier score, and NLL, with temperature scaling. A one-parameter temperature scaling model (minimizing validation NLL) was applied on logits within each fold and evaluated on that fold’s validation set (held out from training). Aggregate calibration metrics before and after temperature scaling are reported in Table 7. The learned temperatures for each fold and the ensemble are summarized in Table 8.

Notes

Error bars in reliability diagrams use Wilson intervals. ECE is reported as absolute error (%). For clinical deployment, decision thresholds should be selected with calibration-aware operating points (e.g., maximizing utility under asymmetric costs) and periodically re-calibrated when the data distribution shifts.

Interpretation

The calibration metrics in Table 7 show that temperature scaling substantially improves the numerical reliability of predicted malignancy probabilities without affecting discrimination. Across folds, AUROC and AUPRC remain unchanged before and after scaling, as expected for a post hoc calibration method. In contrast, negative log-likelihood (NLL) and Brier score improve markedly (e.g., NLL decreases from 0.6392 to 0.4074 and the Brier score from 0.2231 to 0.1271 on average), indicating that the model’s predicted probabilities align more closely with empirical outcomes. Expected Calibration Error (ECE) also decreases notably (from 0.2584 to 0.0991 on average), reflecting a reduction in the gap between predicted and true event frequencies across bins. The per-fold results show that temperature scaling consistently improves calibration for folds 1–4. Fold 5 exhibits degraded discrimination (AUROC = 0.36) and therefore cannot be reliably calibrated; this is consistent with the expectation that calibration cannot correct fundamentally poor predictions. Nevertheless, the improvement in all other folds confirms that the ROI+context GNN yields well-calibrated probabilities when paired with simple post hoc scaling.

Fold-5 degradation analysis

The marked AUROC drop observed in Fold 5 likely reflects data rather than modeling artifacts. Post hoc inspection suggests three plausible contributors: (i) label noise, as a small subset of samples shows prediction–label inconsistency across seeds, which is known to disproportionately affect AUROC under class imbalance; (ii) distributional shift, with Fold 5 containing a higher proportion of small or low-contrast nodules and images with overlaid markers, reducing discriminative signal; and (iii) effective class imbalance, where the benign–malignant ratio deviates more strongly than in other folds, amplifying variance in ROC estimation. Importantly, no training instability or optimization divergence was observed. Because calibration cannot compensate for fundamentally weak discrimination, Fold 5 also exhibits limited calibration gains. We therefore report mean ± SD and confidence intervals across all folds to avoid over-interpretation of any single split and flag Fold 5 as a candidate for future label audit and stratified resampling.

Interpretation

Table 8 reports the optimal temperature parameter T learned for each fold. In all cases,

T < 1

, indicating that the model is systematically overconfident and benefits from softening its logits. Most folds converge to values extremely close to 0.10, suggesting a stable calibration profile across runs. Fold 1 (T = 0.1866) and fold 5 (T = 0.1594) require slightly higher temperatures, consistent with their higher NLL/ECE values before calibration. The ensemble-level temperature of

T = 0.10

reflects the dominant scaling effect across the validated folds. Overall, the temperature values show that the ROI+context GNN behaves as an overconfident classifier by default—typical for deep neural networks—but is easily corrected by a single global scalar, yielding well-calibrated probabilities in most folds.

5. Discussion

Clinical relevance of relational reasoning

Our results indicate that the proposed ROI+context GNN architecture is not only computationally lightweight but also well aligned with the relational nature of thyroid ultrasound interpretation. The improvements observed in Table 3 and Table 4 show that incorporating peri-lesional context leads to high AUROC and AUPRC on both the single-split validation and the 5-fold × 3-seed cross-validation. This performance gain emerges directly from the model’s ability to aggregate complementary cues across the neighborhood via message passing, capturing structured dependencies such as halo presence, margin irregularity, and posterior acoustic artefacts—patterns that CNN classifiers operating on isolated crops cannot explicitly encode. The ROI node acts as a natural anchor for this reasoning, while the context nodes provide a structured representation of the surrounding tissue, echoing how radiologists read margins, halos, and peri-nodular changes rather than the nodule in isolation.

Role of an ROI-centric design

Anchoring the graph at the lesion ROI preserves the strengths of classical ROI-based classifiers while mitigating their main limitation. ROI-only CNNs concentrate all model capacity on the nodule itself, but they discard peri-lesional cues that radiologists routinely use, such as subtle halo changes or peri-nodular echogenicity. In our formulation, the ROI node still encodes the core nodule appearance, but context nodes explicitly capture the surrounding tissue and its spatial relationship to the lesion. This allows the model to distinguish, for example, a similar-appearing nodule embedded in a suspicious halo from one surrounded by normal parenchyma, without resorting to full-image modelling that dilutes lesion-specific signal.

Advantages of the GNN architecture

The chosen GNN encoder combines several pragmatic design choices: a small, fully connected graph; GraphSAGE-style aggregation; and attention-based readout. The fully connected topology keeps implementation simple and avoids the need for hand-crafted edge pruning, while still allowing each node to access global context through a few message-passing layers. GraphSAGE provides an inductive mechanism that can generalize to unseen node configurations, and the attention readout exposes which regions are most influential for the final prediction. Together, these choices yield a model that is expressive enough to capture relational patterns, yet compact enough to train on modest hardware and to be integrated into existing ROI-based pipelines.

Cross-validation and robustness

Importantly, all reported discrimination and calibration metrics are averaged over a 5-fold × 3-seed cross-validation scheme. This reduces the risk of optimistic results arising from a favorable single split and provides a more stable characterization of model behavior across different subsets of x. The relatively tight confidence intervals in Table 4 suggest that the ROI+context GNN performs consistently across folds despite the heterogeneity in nodule size, position, and background anatomy. Because the graph uses a small, deterministic set of context nodes rather than a large, randomly sampled bag of patches, the architecture also avoids some of the variance and instability often observed in patch-sampling MIL pipelines, while still exploiting multi-regional information.

Calibration and decision support

The strong cross-validation performance and stable calibration after temperature scaling further demonstrate that the graph formulation yields not only accurate but also reliable probability estimates. From a clinical perspective, well-calibrated probabilities are essential for risk-stratified management decisions, such as whether to recommend FNA, short-interval follow-up, or routine surveillance. An overconfident yet miscalibrated model can achieve high AUROC while still leading to inappropriate thresholds and suboptimal biopsy rates. By explicitly quantifying calibration (via NLL, Brier score, ECE, and reliability diagrams) and applying temperature scaling, our analysis aims to ensure that the proposed GNN is not only discriminative but also usable as a decision-support tool in settings where probability estimates directly influence patient management. Overall, the method’s strengths—structured context modelling, lightweight graph size, improved discrimination, and calibration-friendly behavior—support its practical suitability for ultrasound-based thyroid nodule stratification.

Limitations

While we report robust 5-fold × 3-seed cross-validation on TN5000, we currently lack evaluation on an external or temporal test set. An anomalous degradation in fold 5 AUROC was observed during calibration analysis; we are investigating potential label–prediction misalignment and data leakage before drawing conclusions. Annotation noise, device heterogeneity, and demographic shift may reduce generalization. Strong claims will rely on CV and an external/temporal test set.

6. Conclusions

Summary and significance. This work presents a compact yet expressive ROI+context Graph Neural Network for thyroid ultrasound classification, motivated by the observation that malignancy assessment depends not only on intra-nodular appearance but also on structured peri-lesional cues. By representing each image as a small graph anchored at the lesion ROI and augmented with deterministic context nodes, the proposed model explicitly captures spatial relationships that are typically implicit or ignored in CNN- and MIL-based pipelines. Across both the official TN5000 single-split and a rigorous 5-fold × 3-seed cross-validation protocol, the method achieves strong and stable discrimination, indicating that relational modeling can improve robustness without resorting to complex segmentation or heavy architectural overhead.
Reliability and clinical relevance. Beyond discrimination, we emphasize probability reliability through comprehensive calibration analysis. The observed improvements in NLL, Brier score, and ECE after temperature scaling demonstrate that the proposed GNN yields well-calibrated malignancy probabilities, a critical requirement for decision-support scenarios where thresholds directly influence biopsy or follow-up recommendations. The attention-based readout further provides a degree of interpretability by highlighting which peri-lesional regions contribute most to the final decision, aligning the model’s behavior with radiologist reasoning patterns.
Dataset choice. We selected TN5000 as it aligns well with an ROI+context graph formulation (annotation quality, preserved peri-lesional tissue, and heterogeneity); the detailed rationale is provided in Section 3 under the Dataset subsection.
Outlook. Together, these results suggest that graph-based image-level representations offer a promising and practical direction for thyroid ultrasound CAD. The proposed ROI+context GNN serves as a strong baseline that bridges classical ROI classifiers and more complex holistic models, while remaining reproducible and calibration-aware. Future work will explore external validation, alternative graph topologies, and multi-modal extensions incorporating clinical metadata, with the goal of further strengthening the role of relational learning in ultrasound-based decision support.

Author Contributions

Conceptualization, M.Y. and N.Y.; methodology, M.Y.; software, M.Y.; validation, M.Y.; formal analysis, M.Y.; investigation, M.Y.; resources, M.Y.; data curation, M.Y.; writing—original draft preparation, M.Y.; writing—review and editing, M.Y. and N.Y.; visualization, M.Y.; supervision, N.Y.; project administration, N.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The TN5000 dataset used in this study is publicly available on figshare at https://figshare.com/s/cb6a67f17c04b29e7edd (accessed on 16 August 2025). Code and trained models will be made available upon acceptance.

Acknowledgments

The authors thank the dataset authors for releasing TN5000 publicly.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUROC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision–Recall Curve
CAD	Computer-Aided Diagnosis
CNN	Convolutional Neural Network
CV	Cross-Validation
ECE	Expected Calibration Error
FNA	Fine-Needle Aspiration
GNN	Graph Neural Network
MIL	Multiple Instance Learning
NLL	Negative Log-Likelihood
RNG	Random Number Generator
ROI	Region of Interest
TI-RADS	Thyroid Imaging Reporting and Data System
US	Ultrasound

Appendix A. Implementation and Reproducibility Details

Hardware and runtime environment

All experiments were executed on a consumer-grade laptop equipped with Apple Silicon using the Metal Performance Shaders (MPS) backend. Hardware-specific details are reported here for completeness only and do not affect the proposed methodology or evaluation protocol.

Random seeds and determinism

To ensure reproducibility, random seeds were fixed for python, numpy, and torch. Deterministic data loading and operation modes were enabled where supported. All cross-validation splits, random seeds, and hyperparameters are stored in configuration files.

Software

The implementation is based on PyTorch (v2.8.0) and PyTorch Geometric (v2.6.1), with training and evaluation scripts executed under a fixed environment configuration.

References

Tessler, F.N.; Middleton, W.D.; Grant, E.G. Thyroid imaging reporting and data system (TI-RADS): A user’s guide. Radiology 2018, 287, 29–36. [Google Scholar] [CrossRef]
Boers, T.; Braak, S.J.; Rikken, N.E.; Versluis, M.; Manohar, S. Ultrasound imaging in thyroid nodule diagnosis, therapy, and follow-up: Current status and future trends. J. Clin. Ultrasound 2023, 51, 1087–1100. [Google Scholar] [CrossRef]
David, E.; Aliotta, L.; Frezza, F.; Riccio, M.; Cannavale, A.; Pacini, P.; Di Bella, C.; Dolcetti, V.; Seri, E.; Giuliani, L.; et al. Thyroid Nodule Characterization: Which Thyroid Imaging Reporting and Data System (TIRADS) Is More Accurate? A Comparison Between Radiologists with Different Experiences and Artificial Intelligence Software. Diagnostics 2025, 15, 2108. [Google Scholar] [CrossRef] [PubMed]
Radhachandran, A.; Kinzel, A.; Chen, J.; Sant, V.; Patel, M.; Masamed, R.; Arnold, C.W.; Speier, W. A multitask approach for automated detection and segmentation of thyroid nodules in ultrasound images. Comput. Biol. Med. 2024, 170, 107974. [Google Scholar] [CrossRef]
Hou, X.; Hua, M.; Zhang, W.; Ji, J.; Zhang, X.; Jiang, H.; Li, M.; Wu, X.; Zhao, W.; Sun, S.; et al. An ultrasonography of thyroid nodules dataset with pathological diagnosis annotation for deep learning. Sci. Data 2024, 11, 1272. [Google Scholar] [CrossRef]
Savelonas, M. An Overview of AI-Guided Thyroid Ultrasound Image Segmentation and Classification for Nodule Assessment. Big Data Cogn. Comput. 2025, 9, 255. [Google Scholar] [CrossRef]
Yu, D.; Song, T.; Yu, Y.; Zhang, H.; Gao, F.; Wang, Z.; Wang, J. Risk assessment of thyroid nodules with a multi-instance convolutional neural network. Front. Oncol. 2025, 15, 1608963. [Google Scholar] [CrossRef]
Zhang, H.; Liu, Q.; Han, X.; Niu, L.; Sun, W. TN5000: An Ultrasound Image Dataset for Thyroid Nodule Detection and Classification. Sci. Data 2025, 12, 1437. [Google Scholar] [CrossRef] [PubMed]
Parisot, S.; Ktena, S.I.; Ferrante, E.; Lee, M.; Guerrero, R.; Glocker, B.; Rueckert, D. Disease Prediction using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer’s Disease. In Proceedings of the Medical Image Computing and Computer Assisted Intervention (MICCAI), Quebec City, QC, Canada, 11–13 September 2017. [Google Scholar]
Meng, X.; Zou, T. Clinical applications of graph neural networks in computational histopathology: A review. Comput. Biol. Med. 2023, 164, 107201. [Google Scholar] [CrossRef] [PubMed]
Bronstein, M.M.; Bruna, J.; Cohen, T.; Veličković, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv 2021, arXiv:2104.13478. [Google Scholar] [CrossRef]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the IJCAI, Quebec City, QC, Canada, 20–25 August 1995. [Google Scholar]
Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Gong, H.; Chen, J.; Chen, G.; Li, H.; Li, G.; Chen, F. Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Comput. Biol. Med. 2023, 155, 106389. [Google Scholar] [CrossRef]
Dong, P.; Zhang, R.; Li, J.; Liu, C.; Liu, W.; Hu, J.; Yang, Y.; Li, X. An ultrasound image segmentation method for thyroid nodules based on dual-path attention mechanism-enhanced UNet++. BMC Med. Imaging 2024, 24, 341. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Fu, C.; Xu, S.; Sham, C.W. Thyroid ultrasound image database and marker mask inpainting method for research and development. Ultrasound Med. Biol. 2024, 50, 509–519. [Google Scholar] [CrossRef]
Xu, Y.; Xu, M.; Geng, Z.; Liu, J.; Meng, B. Thyroid nodule classification in ultrasound imaging using deep transfer learning. BMC Cancer 2025, 25, 544. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Liu, Q.; Han, X.; Niu, L.; Sun, W. TN5000: An Ultrasound Image Dataset for Thyroid Nodule Detection and Classification (Data Release). figshare 2025. [Google Scholar] [CrossRef]
Hu, M.; Zhang, Y.; Xue, H.; Lv, H.; Han, S. Mamba- and ResNet-Based Dual-Branch Network for Ultrasound Thyroid Nodule Segmentation. Bioengineering 2024, 11, 1047. [Google Scholar] [CrossRef]
Chi, J.; Walia, E.; Babyn, P.; Wang, J.; Groot, G.; Eramian, M. Thyroid nodule classification in ultrasound images by fine-tuning deep convolutional neural network. J. Digit. Imaging 2017, 30, 477–486. [Google Scholar] [CrossRef] [PubMed]
Aci, C.I.; Mutlu, G.; Ozen, M.; Sarac, E.; Uzel, V.N.K. A Feature Selection-Based Multi-Stage Methodology for Improving Driver Injury Severity Prediction on Imbalanced Crash Data. Electronics 2025, 14, 3377. [Google Scholar] [CrossRef]
Chen, C.; Wu, Y.; Dai, Q.; Zhou, H.Y.; Xu, M.; Yang, S.; Han, X.; Yu, Y. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10297–10318. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, Y.; Che, T.; Li, S.; Wang, X. Graph neural networks for image-guided disease diagnosis: A review. Intell. Robot. Devices 2023, 1, 151–166. [Google Scholar] [CrossRef]
Mienye, I.D.; Viriri, S. Graph Neural Networks in Medical Imaging: Methods, Applications and Future Directions. Information 2025, 16, 1051. [Google Scholar] [CrossRef]
Chowa, S.S.; Azam, S.; Montaha, S.; Payel, I.J.; Bhuiyan, M.R.I.; Hasan, M.Z.; Jonkman, M. Graph neural network-based breast cancer diagnosis using ultrasound images with optimized graph construction integrating clinically significant features. J. Cancer Res. Clin. Oncol. 2023, 149, 18039–18064. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, C.; Luo, S.; Dai, Y.; Zhang, J. Graph Neural Network Enhanced Dual-Branch Network (GED-Net) for lesion segmentation in ultrasound images. Expert Syst. Appl. 2024, 256, 124835. [Google Scholar] [CrossRef]
Agyekum, E.A.; Kong, W.; Ren, Y.Z.; Issaka, E.; Baffoe, J.; Xian, W.; Tan, G.; Xiong, C.; Wang, Z.; Qian, X.; et al. A comparative analysis of three graph neural network models for predicting axillary lymph node metastasis in early-stage breast cancer. Sci. Rep. 2025, 15, 13918. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the NeurIPS, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
Efron, B. Bootstrap methods: Another look at the jackknife. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 569–593. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated Graph Sequence Neural Networks. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Niculescu-Mizil, A.; Caruana, R. Predicting Good Probabilities with Supervised Learning. In Proceedings of the ICML, Bonn, Germany, 7–11 August 2005. [Google Scholar]
Naeini, M.P.; Cooper, G.; Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In Proceedings of the AAAI, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Brier, G.W. Verification of Forecasts Expressed in Terms of Probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Bahmane, K.; Bhattacharya, S.; Chaouki, A.B. Evaluation of a Hybrid CNN Model for Automatic Detection of Malignant and Benign Lesions. Medicina 2025, 61, 2036. [Google Scholar] [CrossRef] [PubMed]
Sujini, G.N.; Sivadi, S. Automated thyroid nodule classification in ultrasound imaging using a hybrid vision transformer and Wasserstein GAN with gradient penalty. Sci. Rep. 2025, 15, 40786. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed ROI+context GNN. Each image is represented as a small graph with one ROI (lesion) node and multiple context nodes. ResNet-50 embeddings are concatenated with lightweight geometry (

Δ x, Δ y

, scale, IoU). Message passing (GraphSAGE) propagates information across nodes; attention pooling aggregates node features into a single image-level representation, followed by an MLP head to predict malignancy probability.

Figure 1. Overview of the proposed ROI+context GNN. Each image is represented as a small graph with one ROI (lesion) node and multiple context nodes. ResNet-50 embeddings are concatenated with lightweight geometry (

Δ x, Δ y

, scale, IoU). Message passing (GraphSAGE) propagates information across nodes; attention pooling aggregates node features into a single image-level representation, followed by an MLP head to predict malignancy probability.

Figure 2. ROI-anchored deterministic peri-lesional context sampling.

Figure 3. Ensemble reliability diagrams. The dashed diagonal denotes perfect calibration; bars show empirical accuracy in equal-width bins; thin vertical lines show confidence intervals (Wilson). (a) Before temperature scaling. (b) After temperature scaling.

Figure 4. Per-fold reliability diagrams comparing calibration status. Each panel shows predicted confidence vs. empirical accuracy; the diagonal is perfect calibration. (a) Diagrams before temperature scaling, showing varying degrees of miscalibration (typically overconfidence). (b) Diagrams after temperature scaling, demonstrating improved alignment with the diagonal (perfect calibration).

Table 1. Rationale for selecting a GNN-based ROI+context fusion model (qualitative comparison).

Model Family	Strength	Key Limitation for Peri-Lesional Context
ROI-only CNN	Strong local texture/margin cues; simple training	Discards surrounding tissue relationships
Whole-image CNN/ViT	Uses full anatomy; no cropping choices	Background dominates; requires learning to ignore non-informative regions
MIL/patch-attention	Pools multiple regions; weak localization	Instances typically unordered; spatial relations implicit or absent
GNN (ours)	Explicit nodes + geometry; message passing; attention readout	Requires defining nodes/edges

Table 2. Dataset summary (official image-level split).

Split	Images	Grouping	Notes
Train	3500	disjoint	optimisation
Val	500	disjoint	model selection
Test	1000	disjoint	held-out; used for predictions

Table 3. Single-split validation performance on the official TN5000 train/validation split. Values correspond to the best epoch. Models were trained for up to 100 epochs with early stopping (patience = 10).

Acc	AUROC	AUPRC	Min Val Loss	Note
0.904	0.942	0.979	0.268	epoch-wise best

Table 7. Calibration metrics evaluated on validation data across cross-validation folds, reported before and after temperature scaling. Values are averaged across folds.

Split	Phase	AUROC	AUPRC	NLL	Brier	ECE
mean	before	0.8893	0.9468	0.6392	0.2231	0.2584
mean	after	0.8893	0.9468	0.4074	0.1271	0.0991
fold1	before	0.8774	0.9421	0.6114	0.2096	0.2558
fold1	after	0.8774	0.9421	0.4738	0.1566	0.1321
fold2	before	0.8605	0.9241	0.6356	0.2214	0.2875
fold2	after	0.8605	0.9241	0.3992	0.1211	0.0815
fold3	before	0.8111	0.8977	0.6459	0.2265	0.2099
fold3	after	0.8111	0.8977	0.4542	0.1443	0.0608
fold4	before	0.8616	0.9317	0.6296	0.2184	0.2220
fold4	after	0.8616	0.9317	0.3988	0.1248	0.0363
fold5	before	0.3607	0.6260	0.6811	0.2440	0.2519
fold5	after	0.3607	0.6260	0.6531	0.2285	0.2176

Table 8. Temperature scaling parameters learned per cross-validation fold and at the ensemble level, optimized by minimizing validation negative log-likelihood.

Split	T
mean	0.1000
fold1	0.1866
fold2	0.1000
fold3	0.1000
fold4	0.1000
fold5	0.1594

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yavuz, M.; Yumuşak, N. ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility. Electronics 2026, 15, 151. https://doi.org/10.3390/electronics15010151

AMA Style

Yavuz M, Yumuşak N. ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility. Electronics. 2026; 15(1):151. https://doi.org/10.3390/electronics15010151

Chicago/Turabian Style

Yavuz, Mehmet, and Nejat Yumuşak. 2026. "ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility" Electronics 15, no. 1: 151. https://doi.org/10.3390/electronics15010151

APA Style

Yavuz, M., & Yumuşak, N. (2026). ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility. Electronics, 15(1), 151. https://doi.org/10.3390/electronics15010151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ROI+Context Graph Neural Networks for Thyroid Nodule Classification: Baselines, Cross-Validation Protocol, and Reproducibility

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Method Selection: Alternatives and Rationale

3.2. Why GNNs for Peri-Lesional Context?

3.3. Evaluation Protocol

3.4. Dataset

3.5. Preprocessing and Embeddings

3.6. Graph Construction and Readout

3.7. Training Setup

3.8. Metrics

3.9. Calibration

4. Experiments and Results

4.1. Single-Split Baseline

4.2. Cross-Validation Performance

4.3. Ablation Study

4.4. Calibration and Operating Points

5. Discussion

Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Implementation and Reproducibility Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI