Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images

Oad, Ammar; Koondhar, Imtiaz Hussain; Dong, Feng; Liu, Weibing; Zou, Beiji; Liu, Weichun; Chen, Yun; Wu, Yaoqun

doi:10.3390/sym18010141

Open AccessArticle

Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images

by

Ammar Oad

^1,2,*

,

Imtiaz Hussain Koondhar

²,

Feng Dong

^1,*,

Weibing Liu

¹,

Beiji Zou

^1,3,

Weichun Liu

¹,

Yun Chen

¹ and

Yaoqun Wu

¹

School of Information Engineering, Shaoyang University, Shaoyang 422000, China

²

Information Technology Center, Sindh Agriculture University, Tandojam 70050, Pakistan

³

School of Computer Science and Engineering, Central South University, Changsha 410083, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2026, 18(1), 141; https://doi.org/10.3390/sym18010141 (registering DOI)

Submission received: 5 December 2025 / Revised: 31 December 2025 / Accepted: 4 January 2026 / Published: 10 January 2026

(This article belongs to the Special Issue Deep Learning and Deep Learning Synergy of Transformers and Symmetry in Small Object Detection and Tracking)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of thyroid nodules in ultrasound images remains challenging due to low contrast, speckle noise, and inter-patient variability that disrupt the inherent spatial symmetry of thyroid anatomy. This study proposes a symmetry-aware SwinUNet framework with integrated spatial attention for thyroid nodule segmentation. The hierarchical window-based Swin Transformer encoder preserves spatial symmetry and scale consistency while capturing both global contextual information and fine-grained local features. Attention modules in the decoder emphasize symmetry consistent anatomical regions and asymmetric nodule boundaries, effectively suppressing irrelevant background responses. The proposed method was evaluated on the publicly available TN3K thyroid ultrasound dataset. Experimental results demonstrate strong performance, achieving a Dice Similarity Coefficient of 85.51%, precision of 87.05%, recall of 89.13%, an IoU of 78.00%, accuracy of 97.02%, and an AUC of 99.02%. Compared with the baseline model, the proposed approach improves the IoU and Dice score by 15.38% and 12.05%, respectively, confirming its ability to capture symmetry-preserving nodule morphology and boundary asymmetry. These findings indicate that the proposed symmetry-aware SwinUNet provides a robust and clinically promising solution for thyroid ultrasound image analysis and computer-aided diagnosis.

Keywords:

deep learning in healthcare; thyroid nodule segmentation; ultrasound imaging; SwinUNet; vision transformers; attention mechanisms; medical image analysis

1. Introduction

The frequency of thyroid nodules, which are characterized as abnormal growths in the thyroid gland, has significantly increased in the last few decades and thus demands prioritization of early detection and evaluation [1,2]. Most epidemiological studies state that although only a small number of nodules can be felt, around 50–70% of the adult population has nodules that can be identified using ultrasound [3]. As a result, the need for CAD (computer-aided diagnosis) systems that provide consistent, objective, and reproducible assessments has grown. More specifically, automated segmentation is essential in CAD systems, as it offers detail in the tagging of normal and abnormal areas of the anatomy, which is critical for diagnosis and treatment options. Convolutional neural networks (CNNs) are presently the most preferred architectures for the segmentation of medical images, owing to their profound capability for learning hierarchical feature representations [4]. However, in the case of thyroid ultrasound images, the regular challenge has been the absence of clarity in contrast between the nodules and surrounding tissues, speckle noise, acoustic shadowing, and irregular boundaries. This has led to an accuracy of 79.00% to 83.70% generally for conventional thyroid nodule image segmentation, although deep learning architectures showed better performance, achieving accuracies of 91.30% [5] or better. However, sensitivity to noise, human error, and diversity in images has been an important limitation of the currently proposed deep learning architectures.

However, recent advancements in the use of attention-based deep learning models appear promising in overcoming these limitations. SwinUNet models, which incorporate hierarchical vision transformers and self-attention mechanisms, enhance the accuracy of the process by allowing the network to pay attention to regions of interest while also accounting for distant contextual relationships that can aid in diagnoses [6]. This particular aid would be very useful in the field of thyroid ultrasound scans since nodules can vary significantly in size, shape, and backscatter characteristics. The process of segmentation is even more important in the context of distinguishing malignant from benign lesions, because it has been found that it has a direct impact on the extraction of texture features. Even though fine needle aspiration (FNA) cytology is still considered state-of-the-art for diagnostic techniques, it is still invasive, costly, and lacks availability worldwide, and it depends on experienced cytopathologists [7].

The earlier CAD systems based on ultrasound B-mode images have shown potential, but usually depend on a rigid, multi-stage machine learning pipeline comprising preprocessing, segmentation, feature extraction, and classification [8]. These kinds of systems are able to differentiate between benign versus malignant nodules, yet they are still susceptible to ultrasound-specific artifacts such as speckle noise and overlapping tissue textures. Furthermore, the easy availability of ultrasound imaging has presently caused overdiagnosis, biopsy, and extra workload for radiologists due to inter-observer variability in image interpretations [9,10]. In view of these limitations, certain recent studies have examined even more sophisticated paradigms under the realm of deep learning, such as graph attention networks such as the Multiple Attention Graph Convolution Networks (MAGCNs), which focus on abstract semantic patterns and have been observed to generalize better in various tasks under medical imaging [11]. These methods are, however, predominantly data-driven, without any focus on anatomy or priors. In addition, despite the reasonable prediction results that have been achieved with traditional machine learning approaches such as artificial neural networks and Random Forest classifiers, big limitations to adapting these approaches to the clinical environment remain, namely, they are not interpretable and cannot capture the reasoning processes of clinicians when they are making diagnoses. In thyroid image analysis, because of the importance of anatomical consistency and comparison of both sides, predictions without explanations related to anatomy could have a negative effect on confidence and on translating results to the clinical environment [12].

The other significant hurdle is that of limitations within the datasets. The majority of publicly available datasets related to thyroid ultrasound studies consist of images from single centers or particular machines and usually involve just one nodule per image. This reduces their ability to generalize, especially when they have to deal with real-life cases involving several nodules, asymmetric thyroid glands, and varying data acquisition methods [13]. Although multi-center and multi-machine datasets have been recommended to overcome these challenges, further advancements are also necessary within their architecture to enhance robustness. The integration of ultrasound, CT, and MRI modalities has further enhanced the information content, which counteracts the low resolution and noise associated with ultrasound images [14]. The integration of attention mechanisms in CNN and transformer models has improved the effectiveness of segmentation tasks involving complex and low-contrast regions, as it allows highlighting the most discriminative regions in the image [15]. Meanwhile, learning paradigms such as few-shot learning, unsupervised learning, and self-supervised learning have been developed as alternatives to overcome the issue of limited availability of large-scale datasets in the medical field [16]. Knowledge transfer, learning from models, and data augmentation have improved the generalizability and robustness of models, making it even more representative of clinical reality [17].

Despite these benefits, one of the crucial prior aspects that remains an uncharted region in existing thyroid segmentation tasks is the inherent symmetry of the thyroid gland. As a biological entity, the thyroid gland itself comprises two lobes that are joined together by an isthmus, with high levels of overall symmetry between the two sides. Pathological conditions such as thyroid nodules, cancerous tissues, or the development of unusual tissue that grows inside the thyroid gland tend to appear otherwise asymmetric. From the imaging viewpoint, the explicit representation of such symmetric characteristics helps define an important discriminant feature separating actual pathologies from the noise characteristics generally observed inside ultrasound images. Nonetheless, existing deep learning-based representative techniques tend to ignore such an important image attribute simply because it is an ultrasound image of the thyroid gland. The attention-augmented transformer models, especially the SwinUNet models, can provide a suitable background for learning symmetry-aware representations given their hierarchical architecture, ability to learn features on different scales, and efficient self-attention mechanisms [18]. Prior studies have also demonstrated that attention-augmented multi-task models significantly improve performance in difficult low-contrast imaging scenarios [19], while the integration of CEUS, CT, and MR imaging provides complementary diagnostic information [20]. Expert-guided and semi-structured deep learning models further improved clinical accuracy and decision support [21], and radiomics-based deep learning approaches proved to be strongly generalizable across thyroid imaging modalities [22].

The impact of this research can be discussed as follows. We developed a symmetry-aware segmentation framework based on SwinUNet architecture that takes into account the bilateral symmetry of the thyroid gland, which allows for the precise delineation of nodules on low-contrast, low-quality ultrasound images. To facilitate this, we included attention gates in the skip connections to focus on the clinically relevant features and reduce the background noise to achieve a better localization on the boundaries. The proposed method showcases the task-specific adaptation of the Swin Transformer, and pairs a hierarchical encoder and a symmetry-oriented decoder to obtain a sufficient amount of global contextual information and to retain fine-grained local details. The method was validated on the TN3K dataset and demonstrated substantial gains over the baseline SwinUNet and other leading segmentation algorithms, presenting an increase in IoU of 15.38% and Dice of 12.05%. Most of all, this research tackled the issue left by other models which entirely disregarded thyroid symmetry by actively using it to improve the segmentation efficiency and the clinical relevance of the results.

Novelty of Methods:

▪: A symmetry-aware SwinUNet with spatial attention for thyroid nodule segmentation.
▪: A window-based hierarchical encoder of Swin Transformer with the merits to preserve spatial symmetry and concurrently model global and local information.
▪: An architecture for low contrast, speckle noise and inter-patient variability in thyroid ultrasound images.
▪: Included spatial attention mechanisms to attend to the relevant anatomical regions and inhibit the background.
▪: Significant gain over the baseline model, by +15.38% IoU and +12.05% Dice.
▪: A large-scale multi-epoch training study (10–800 epochs) with stable performance improvements.
▪: Significant improvements in the segmentation metrics like precision, recall, and F1-score with longer training period.

This paper outlines our research on the segmentation of thyroid nodules. In Section 2, we present the related work of the problem of nodule segmentation as imaging ultrasound data. Section 3 presents the materials and methods, providing a detailed explanation of the SwinUnet_withAttention architecture illustrated in Figure 1 and Figure 2, and describing how spatial attention mechanisms are integrated to enhance feature discrimination and localization accuracy. Section 3.4, Experimental Setup and Evaluation Metrics, outlines the experimental setup, including specifics of the TN3K dataset, the data preprocessing and augmentation pipeline, the implemented training approach, and the evaluation metrics employed to measure model performance. Section 4 provides the results and discussion, explains the effect of the proposed architecture and extended training epochs on segmentation performance, and presents a comparative study against other state-of-the-art approaches. Lastly, Section 5 concludes the paper by summarizing the main findings and contributions, and suggests possible directions for future work aimed at enhancing model generalization, efficiency, and clinical applicability.

2. Related Work

2.1. Deep Learning for Thyroid Scintigraphy and Ultrasound Segmentation

While thyroid scintigraphy is one of the most important tests for evaluating thyroid function due to its imaging capabilities, thyroid scintigraphy has had to rely on slow, user-based interpretations for long periods of time. Recent advancements in artificial intelligence, especially Deep Learning, have improved diagnostic accuracy. A five-layer U-Net model has been trained to precisely automatically delineate the borders of the thyroid gland and estimate uptake for scintigraphy. This model has been trained on 2734 scans, and has achieved a 92% accuracy for predicting the edge of the lobe and a 94% accuracy overall to the thyroid shape, along with a median gland area of 3.520 cm² and 0.029% uptake prediction accuracy. Such results defend the use of U-Net model for clinical purposes [23]. Research should be performed on the usefulness of AI, predominantly the U-Net and its variants, in nuclear medicine. In terms of models, there is credible data suggesting that dataset diversity, data cleaning, and data augmentation are of greater importance than the model structure, especially for the ResU-Net model in assisting the diagnosis of thyroid nodules [24]. Recent work incorporates deep learning and attention mechanisms to accomplish complex tasks in medical imaging. With self-attention, residuals, and multi-scale feature learning, the SwinUNet Model surpasses the standard methods in ultrasound image analysis and nodule identification in the thyroid. Its versatility is a great potential for other imaging types such as MRI and CT. Current projects are devoted to real-time and multimodal optimization [25]. Ultrasound imagery for kids’ thyroid glands is quite challenging. Their images typically have low contrast, a lot of noise scattered around from other parts of the body, and unusual anatomy appearances. One method involves employing a DC-Contrast UNet with specialized techniques for overcoming such challenges, which helped achieve a clear improvement. Specifically, it lifted IoU and mIoU performance by a slight margin above 6%, as well as precision and recall. The current remaining sticking points revolve around isolating the boundary areas that usually have noise that muddles things, which are considered prime opportunities for further adjustments in the model itself [26]. More recently, transformer-based U-Net models have proved to demonstrate high capability in detecting thyroid nodules. Other models have used methods involving the incorporation of filtering based on frequency, as well as overlapping patch assessments on a multi-scale dimension. Worth noting, in particular, is that a high level of accuracy was demonstrated in the Cooperative Transformer Fusion model, which integrated self-attention and cross-attention mechanisms, in several datasets: 98.2% in DDTI and TN3K and 97.8% in TG3K. Nonetheless, there was a lack of generalization [27].

2.2. Hybrid and Residual U-Net Architectures for Thyroid Imaging

Another multimodal approach features segmentation using active contour models. That leads to classification based on ResUNet. Using the TDID database, the two-process pipeline was more accurate and efficient than the other models, thereby strengthening the contribution of ResUNet to facilitate holistic analysis of thyroid nodules [28]. While there are many deep learning algorithms to solve the vanishing gradient problem, residual connections, which also speed up convergence, are the ones chosen by ResUNet. Furthermore, when applied to segment thyroid CT images, Residual U-Net outperformed both U-Net and U-Net++, achieving Dice and IoU scores of 90.87% and 94.58%, respectively. Even with limited training data, the performance of ResUNet remains strong, confirming the versatility and capability of this architecture [29].

To tackle problems from not having enough good training data for ultrasound images, researchers made a Super-Pixel U-Net system. It has three parts: find edges, fix problems, and then classify. Each part makes the result better. It obtained a Dice score of 0.9279 and an F1-score of 0.9161, which proves that performing a series of U-Nets with a couple of extra steps is helpful for ultrasound data [30]. The Hybrid Transformer U-Net (H-TUNet) was made to fix bad spatial and context understanding. By mixing a 2D transformer for inside-the-picture stuff and a 3D transformer for between-the-pictures movement, H-TUNet really makes anatomical representation better. The TSUD and TG3K data show that it is better at segmenting images and learning features than other methods [31]. Also, different encoder–decoder designs, especially those with fewer residual links, have shown that we can balance good segmentation without using too much computer power. Hybrid ResUNet designs like this do a good job outlining thyroid nodules and are doable in real clinics [32].

A deep learning architecture for the combined classification and segmentation of images of the thyroid was constructed and demonstrated good performance. The architecture incorporated adaptive median filtering and histogram equalization to optimize the images and combined segmentation capabilities of SegNet and classification of a CB-CNN. The system attained an accuracy of 96%, outperforming competitors in the classification of thyroid cancer such as DCNN and ResNet101 [33]. While traditional segmentation methods (thresholding and edge detection) are simple and fast, their quality degrades in noisy or complicated images. Deep learning models, while more intensive computationally, offer better feature learning ability. Transfer learning, model compression, and semi-supervised learning mitigate data and hardware constraints. Interpretability must continue, however. Explainable AI techniques like attention maps facilitate clinician trust building. Hybrid systems combining rule-based and deep learning strategies excel, particularly in environments with weak boundaries or variable intensities [34].

2.3. Advanced Hybrid, Transformer, and Attention-Based Models

More recent work has investigated combining deep and handcrafted features. For example, SegNet from VGG19 was blended with fuzzy gray-level co-occurrence matrices and deep features and classified afterwards using an RBF-kernel SVM. With 99.25% classification accuracy, the approach outperforms current models, although it is hampered by high-dimensional feature vectors and associated computational expense [35]. Within a multi-task learning setting, FFANet proposed a segmentation and classification dual-head model. Through feature fusion and a loss function adapted to it, it attained a Dice coefficient of 0.935 and 79% classification accuracy, showing the efficacy of optimization across the joint tasks in thyroid ultrasound analysis [36]. Efforts toward image acquisition standardization and deep learning models incorporation in thyroid diagnosis remain promising. Although not yet ideal predictors, these models reduce radiological subjectivity and potentially increase access to healthcare in rural areas [37]. New hybrid architectures such as Enhanced-TransUNet employ Transformers for global context and U-Nets for spatial localization, resolving issues of overfitting as well as low contrast. Experimental evidence on TN3K and DDTI datasets verifies the model’s higher Dice and Hausdorff scores, affirming its usability in clinical practice [38]. FCG-Net offers a leaner architecture than those that are parameter-rich, such as UNet3+. This is achieved through the implementation of Ghost Bottlenecks and full-resolution skip connections to reduce the parameter size and increase the computational efficacy. This leads to improved performances in terms of the Dice and sensitivity metrics and hence is very efficient for mobile implementation [39]. For the encoder part, the improvements lie in the capability to learn features comprehensively via the implementation of Residual Networks and Multi-Scale Attention Modules. This supports the improvement in spatial feature and contextual understanding through the implementation of Deep Supervision and Hybrid Loss Functions such as Focal Loss for balancing classes [40]. To combat noisy ultrasound settings, SGUNet was designed using semantic-guided architecture. It provides pixel-wise semantic maps in the decoding process for better propagation of low-level features, achieving more than 2% improvement over U-Net and U-Net++ as per Dice scores [41]. The Swin Transformer-enhanced U-Net technology-ST-Unet-utilizes the global context via Transformer encoders and performs feature enhancement [42]. The notable performance of H-TUNet is highlighted by benchmark evaluations on datasets such as Synapse and ISIC 2018, achieving a Dice score of approximately 78.86 and a recall of 0.9243 [43]. The model has garnered considerable attention due to its dual-transformer architecture, which integrates 2D MSC-AT and 3D SAT modules, enabling effective intra-frame and inter-frame feature learning. This design has been reported to outperform existing state-of-the-art methods for thyroid ultrasound image analysis [44]. This research improves thyroid ultrasound analysis using multi-level self-attention with a SwinUNet. The models with Swin transformer backbones are able to capture global and local context and deal with the challenges of blurry boundaries, low contrast, and distracting anatomical noise. Thus, the models are great at thyroid contouring and nodule detection without high computational costs, which ensures their usability in real-time clinical settings.

3. Proposed Method

Symmetry-aware SwinUNet with integrated attention has been applied for the transformer-based segmentation of thyroid ultrasound. This method brings together the big-picture view of vision transformers and good spatial accuracy using attention. It is made to fix problems in medical image analysis, where spotting and cutting out tiny body parts and problem areas is key. There are four main components to this approach: data preparation and preprocessing to form the pipeline, hybrid architecture which includes a Swin Transformer encoder and an attention-gated decoder, the training approach defined by a custom set of loss functions, and a comprehensive evaluation framework, as illustrated in Figure 1. This integrated approach retains the advantages of traditional CNN-based segmentation methods, while avoiding their shortcomings in computational efficiency and clinical applicability.

Figure 1. Flowchart of Transformer + Attention + Decoder.

3.1. Flowchart

This diagram presents a clean, six-step machine learning workflow that converts raw data to ready-to-use predictive models. It begins with Data Scientists-outsiders submitting raw data to Process 1.0, Data Loading and Preprocessing, where the information is cleansed and stored in the Preprocessed Data store (D2). Next, Process 2.0, Data Splitting, splits the cleaned data into training and testing sets, which are stored in D3 and D4, respectively. Subsequently, Process 3.0, Model Training, utilizes the training data for constructing predictive models, which are stored in the Model Repository (D5). Simultaneously, Process 4.0, Model Testing, evaluates the models on the testing data in order to estimate predictive performance. Process 5.0, Model Evaluation, calculates the results and logs performance metrics in the Evaluation Metrics repository (D6). Lastly, Process 6.0, Model Deployment, selects the best-performing model and delivers real-time predictions to End Users, who are also outsiders.

3.2. Algorithm

Algorithm 1 introduces the entire training pipeline for SwinUNet with attention gates for medical image segmentation. As illustrated in Figure 2, the approach starts by initializing the network architecture and loading pre-trained Swin Transformer weights to leverage learned visual representations. During training, the algorithm passes medical images via a hierarchical encoder to extract multi-scale features via the Swin Transformer backbone. The attention gate mechanism selectively amplifies relevant features from skip connections and suppresses irrelevant information to improve segmentation accuracy. The decoder reconstructs the segmentation mask from attention-weighted features. The model is optimized using a combination of the cross-entropy loss and Dice loss to handle class imbalance and improve boundary delineation. At the end of each epoch, the algorithm evaluates performance on validation data and saves the best-performing model checkpoint. This is repeated until convergence or until maximum epochs have been achieved, returning a trained model that can successfully perform medical image segmentation.

Algorithm 1 End-to-End Training Algorithm for SwinUNet-Based Medical Image Segmentation

1: Input:
2: -Medical image I ∈ ℝ^(H × W × 3)
3: -Ground truth mask M ∈ ℝ^(H × W × 1)
4: Output:
5: -Segmentation Prediction P ∈ ℝ^(H × W × C)
6: -Trained model parameters
7: Steps:
8: Initialize SwinUNet model with attention gates
9: Load pre-trained Swin Transformer weights
10: for each epoch, e = 1 to max_epochs do
11: for each batch (I, M) ∈ training_data do
12: // Forward Pass
13: features ← SwinEncoder(I)
14: attended_features ← AttentionGates(features)
15: P ← Decoder(attended_features)
16: // Loss Computation
17: loss ← CombinedLoss(P, M)
18: // Backward Pass
19: optimizer.zero_grad()
20: loss.backward()
21: optimizer.step()
22: end for
23: // Validation Phase
24: val_metrics ← Evaluate (model, validation_data)
25: if val_metrics.improved then
26: save_checkpoint(model)
27: end if
28: end for
29: return trained SwinUnet_withAttention model

Figure 2. SwinUNet with attention gates end-to-end medical image segmentation.

3.3. Hierarchical Feature Extraction Process

The hierarchical structure of the Swin Transformer specialized in vision tasks, especially those requiring dense predictions such as image segmentation. The architecture starts from Stage 1, where the image is split into patches of 56 × 56 × C dimensions, extracting shallow features. Stage 2 Patch Merger reduces the spatial resolution to 28 × 28 while deepening the channels to 512, extracting mid-level features through Swin Blocks. Stage 3 further downsamples the image to 14 × 14 × 512, from which multiple Swin Blocks can extract deeper features. Stage 4 uses advanced Power Blocks and Swin Blocks to derive deep contextual features at very low resolution 7 × 7 × 1024, richly capturing semantics. Like traditional CNNs, this model reduces the spatial resolution from 56 to 7, enhances feature depth and shifts window attention in place of convolutions to learn local detail and global context simultaneously, which is crucial in precise tasks like medical image segmentation, shown in Figure 3 and Table 1.

The Table 1. illustrates how convolutional neural networks gradually develop features, increasing the level of complexity as they progress through levels. In Level 1, the convolutional neural network provides feature maps with size 56 × 56 × C to extract primal features such as lines and tiny textures. Possibly advancing to Level 2 with feature maps with size 28 × 28 × 512, they recognize medium-level features consisting of simple forms and textural patterns lying somewhere between primal and high levels. In Level 3, they output maps with size 14 × 14 × 512, which represent high semantic features involving pieces and spatial arrangements of objects. Finally, for Level 4 with map sizes of 7 × 7 × 1024, they extract deep and global features encapsulating all information in a single image. The successive progression through levels with features increasing in complexity is crucial for convolutional neural networks to understand and dissect complex visual representations.

3.3.1. Decoder Architecture for Medical Image Segmentation

The decoder’s upsampling part in a segmentation network is explained. It shows how features are slowly brought back from low resolution to the original image size, so we can predict each pixel. It is a process carried out by three ConvTranspose2D layers, each using a 2 × 2 kernel with a stride of 2. Starting upsampling takes deep compressed features, gradually increases spatial dimensionality to 56 × 56 with 256 channels, and then finally decreases to 224 × 224 and less than 2 channels. The last Conv2D layer outputs a 224 × 224 binary image with two output channels performing the background-versus-object segmentation. Important design aspects are learnable upsampling-transposed convolution layers, reduction of channels to a single output class, enhancement of class prediction, emphasis on binary output, and background/foreground object segmentation. This allows the decoder to restore accurate and detailed full-resolution segmentation maps from abstract feature representations, which is important in the analysis of medical images to delineate structures precisely and accurately, as shown in Figure 4.

3.3.2. Swin Transformer Block Architecture

The components within Swin Transformer Blocks lay the groundwork for the Swin Transformer used in computer vision tasks. It introduces two block types: The standard Swin Block to the left and the Shifted Window Swin Block to the right. Both blocks share a common framework, which includes a feature transforming Multi-Layer Perceptron (MLP), Layer Normalization (LN) for stabilizing the training, and residual connections (⊕) for gradient flow. The Standard Block employs the Window Multi-Head Self Attention (W-MSA) technique, which computes attention within fixed local windows. The Shifted Block, on the other hand, employs the Shifted Window Multi-Head Self Attention (SW-MSA) which allows cross-window interaction through partitioning shifts. This alternating stack of W-MSA and SW-MSA blocks provides a balance between computational cost and the ability to capture long-range dependencies. Notable Additive design highlights include the use of pre-norm design, residual connections as used in ResNeta, and a hierarchical attention design which supports CNN-style multi-scale feature extraction. Such design empowers Swin Transformers to process images with high resolutions and provides the advantages of self-attention with the required scalability for dense vision tasks, such as segmentation, shown in Figure 5.

{z^{l} = z}^{(l - 1)} + W - M S A (l n (z^{(l - 1)}))

(1)

The equation updates

z^{l}

from

z^{(l - 1)}

by adding a constant shift

W

and subtracting the attention term

M S A (l n (z^{(l - 1)}))

; it represents an iterative process where each step refines the state using both a fixed bias and self-attention on the log of the previous state.

z^{(l)} = M L P (\ln (z^{(l)})) + z^{l}

(2)

The equation applies a Multi-Layer Perceptron (MLP) to the natural logarithm function of the activation

z^{(l)}

at layer. Then, it adds the original activation back. This is the principle of the residual connection, in which the model is able to retain the original information to some extent while transforming it in MLP.

z^{(l + 1)} = S W - M S A (l n (z^{(l)})) + z^{(l)}

(3)

SW-MSA operates on the natural logarithm of the activation

z^{(l)}

of the layer, then adds the original

z^{(l)}

, and the residual connection is formed. This is the bottom of the equation. The transformed information and the information on the

z^{(l)}

gives the model the ability to retain the original input.

z^{(l + 1)} = W - M S A (l n (z^{(l)} + 1)) + z^{(l)} + 1

(4)

The equation computes W-MSA of the logarithm of

z^{(l)} + 1

, and then adds to it the activation

z^{(l)}

together with a constant 1. This modifies the residual connection with a small shift and further aids the information flow across the layers [46].

3.3.3. Swin Transformer Encoder and Attention Mechanism

The encoder began by training on the Swin Transformer architecture, the timm model swinbasepatch4window7224, which is known to provide excellent foundational representations, especially after being tuned on ImageNet. This design possesses numerous critical elements tailored for the precise segmentation of medical images. Its hierarchical feature extraction processes consist of four stages, which progressively lower the spatial resolution, increasing the depth of the channels. This design serves to capture smaller anatomical parts alongside broader contextual cues. In the second stage, the shifted window attention mechanism substitutes global self-attention for self-attention within defined regions, enabling feature cross-exchange during window-level interactions while maintaining linear computational complexity. In the third tier, the spatial resolution is down-sampled and subjected to the patch merging strategy to simulate a pyramid of CNN-like features with spatial resolution that is successively diminished. This method improves multidimensional representation across multiple features, which is vital for enhancing performance on segmentation tasks. Attention gates are strategically positioned along the connection between the encoder and decoder to improve localization by amplifying relevant feature concentration while suppressing background clutter and refining feature selection. These integrated components allow the model to learn both local detail and gross architecture which is, in turn, vital for precise segmentation in complicated and low-contrast ultrasound pictures.

The attention coefficient α is calculated as

α = γ \cdot σ (ψ (σ (w_{x} . x + w_{g} . g + b_{g})) + b_{ψ})

(5)

Here,

x

refers to the input feature map, while g refers to a gating signal that delivers keener contextual background information. Learnable weight matrices

W x

and

W g

apply to

x

and

g

, respectively. Also,

W g

and

b φ

are classified as gaps. The inner

σ

(ReLU or Sigmoid) applies

x

and g’s combination in a nonlinear transformation. The function

ψ

, which is sometimes a linear transformation or convolution, further processes or computes this intermediate representation. The outer

σ

normalizes the output, while the attention map

α

is scaled by

γ

.

α

highlights the most critical

x

spatial locations or features and, guided by contextual g information, enables the network to focus on relevant regions while suppressing the less important regions. The critical spatial locations or features are then guided and soft attention is processed.

3.3.4. Decoder Architecture and Attention Integration

The decoder architecture improves segmentation accuracy through its multiscale upsampling capabilities combined with sophisticated attention mechanisms shown in Figure 6A,B. Attention gates play a central and essential role by dynamically weighing encoder features according to their relevance to the decoder’s contextual information and effectively suppressing unimportant background areas while enhancing anatomic regions, representing the foreground in medical imaging applications of interest, where target structures are often tiny and buried within difficult backgrounds. These gates help blend multiscale attention across various levels of the decoder for spatial clipping of features that transform from coarse to fine across different spatial extents. In the decoder pathway, the input features are upsampled in a constrained manner by 2 × 2 transposed convolutional upsampling followed by batch normalization and ReLU, which maintains the spatial coherence of the expanded features. After passing through the attention gates, the skip links enable the merging of encoder features with the associated outputs of the decoder while preserving fine spatial details, which preserves elaborate spatial context and adds fine-grained localization to context-rich feature augmentation. Maps improved by the decoder are projected using a single-pixel convolution segmentation head to obtain the initial output class map. This generates two channels representing background and foreground regions for binary segmentation tasks, therefore allowing the accurate identification of target structures. Training involves cross-entropy loss specifically designed for binary segmentation tasks. The loss function solves class imbalance frequently occurring with medical segmentation, since target structures usually cover smaller regions compared to the background. To ensure optimal segmentation performance, we utilize a hybrid loss function, which combines cross-entropy (CE) loss with Dice loss. The CE loss function is effective in dealing with pixel-level classification, with optimal gradient flow that ensures smooth convergence during training. The Dice loss function, on the other hand, directly maximizes the overlap between the predicted and ground truth masks, which is critical in medical image segmentation, where the region-of-interest, namely the thyroid nodules, occupies only a very small fraction of the total image. The hybrid CE-Dice loss function combines the benefits of both CE and Dice losses, where CE ensures stable training with regard to all pixels, while Dice focuses on optimal segmentation, especially with irregular nodules, to better define their boundaries. The hybrid CE-Dice loss function has been demonstrated to improve training stability as well as segmentation performance, with evidence provided in terms of improved Dice, IoU, and F1 metrics with an increasing number of training epochs (Table 2, Table 3, Table 4 and Table 5, Figure 7, Figure 8, Figure 9 and Figure 10).

L_{C E = - \frac{1}{N} \sum_{i - 1}^{N} [_{y i} \log (_{p i}) + ({1 -}_{y i}) l o g ({1 -}_{p i})]}

(6)

Equation (6) represents the Binary Cross-Entropy (BCE) loss which is a loss function commonly used for classification problems. In this case, we have

N

as the number of samples. Also,

y i

is the ground truth label for the i-th sample (0 or 1), and

p i

is the predicted probability of the sample being a member of class 1. In this context, the loss is assessing the error within the predicted probabilities and the true labels. The term

y i l o g (p i)

punishes the model’s performance by predicting low with true positive. Similarly,

(1 - y i) l o g (1 - p i)

applies punishment for wrongly predicted true negative labels. The negative in this case serves the purpose of lowering the loss in the presence of matching predictions and true labels. The total loss

L C E

is computed by taking the average of all

N

samples which, in this case, provides a single scalar value which can be used for further optimization.

3.4. Experimental Setup and Evaluation Metrics

3.4.1. Dataset, Preparation and Preprocessing

We trained our model on the TN3K: Thyroid Nodule Region Segmentation Dataset, which is a publicly available dataset [47] specifically intended for the creation and evaluation of automated methods of thyroid nodule segmentation. There are a total of 3493 ultrasound images between 2879 images with segmentation masks for training purposes and 614 images with masks for testing purposes. All images have been annotated by clinical experts, rendering the dataset a true and reliable source of meaningful advancement for computer-aided diagnosis in thyroid ultrasound imaging. The experimental dataset comprises medical images along with their associated ground truth segmentation masks. Images are obtained in RGB format, and masks are given as grayscale images where pixel values represent class membership. The dataset is intentionally created to represent actual clinical situations with differing image qualities, anatomical differences, and pathological features.

3.4.2. Preprocessing, Standardization, and Label Encoding

The preprocessing pipeline is built with key tasks that standardize inputs and optimize processing speed for the downstream model. Each image is shrunk to 224 by 224 pixels using bilinear resizing, which keeps the original aspect ratio while producing the fixed dimension required by the pre-trained Swin Transformer. Next, the resized samples are turned into PyTorch (version 2.7.x) tensors and each channel is normalized to the same mean and standard deviation derived from the ImageNet dataset mean values of 0.485 in the red channel, 0.456 for green, and 0.406 for blue, with standard deviations of 0.229, 0.224, and 0.225, respectively, thereby maximizing the impact of transfer learning downstream. Mirror to preprocessing, the target masks undergo a simple binarization: pixels above a set threshold are marked as foreground (label 1), while remaining pixels are marked as background (label 0). This concise binary encoding pairs neatly with the cross-entropy loss function utilized throughout training, ensuring predictable and efficient gradient calculations.

3.4.3. Data Augmentation Strategy

To enhance the model’s ability to generalize while coping with the limited availability of medical imaging data, we implement a comprehensive data augmentation pipeline. The pipeline first introduces random horizontal and vertical flips, each with a 50% chance of being applied. Next, we permit rotations of up to −15 ≤ θ ≤ +15 degrees to account for variations in camera positioning. Color normalization is adjusted through small perturbations in brightness, contrast, and saturation. Finally, Gaussian noise is superimposed to mimic noise typically introduced during image acquisition, further diversifying the training set.

Intersection over Union (IoU)

Calculates the spatial overlap between predicted segmentations and ground truth segmentations:

I o U = \frac{∣ P \cup G T ∣}{∣ P \cap G T ∣}

(7)

The numerator

| P \cap G T |

shows the overlapping region of prediction and ground truth while the denominator

| P \cup G T |

shows total area covered by both prediction and ground truth. As for the value of IoU, it can go from 0 to 1 with 0 being no overlap and 1 being complete overlap. Better IoU means better alignment of segmentation with the ground truth.

Dice Similarity Coefficient:

Evaluates segmentation accuracy with emphasis on overlap:

D i c e = \frac{∣ P ∣ + ∣ G T ∣}{2 ∣ P \cap G T ∣}

(8)

This equation shows how to determine the value of the Dice coefficient (Dice) which can be referred to as the Sørensen-Dice index or the F1-score of segmentation tasks. It is a well-known metric of similarity in the segmentation of medical images and the field of computer vision. It is defined as twice the intersection of predicted (P) and ground truth (GT) segmentations divided by the sum of their individual areas. The statement 2|P ∩ GT| can be understood as twice the intersection of prediction and ground truth. In the denominator, the |P| + |GT| reads as the sum of the cardinalities of P and GT which, in this case, describes the number of pixels in the segmentations. Like other coefficients, Dice ranges from 0 to 1. In this case, 0 means no overlap and 1 means perfect correspondence between predicted and ground truth masks. This coefficient is one of the median measures in medical imaging, especially when the classes are imbalanced. The accuracy of the segmentation mask is evaluated with respect to overlaps or agreements, rather than including true negatives. Medical practitioners heavily rely on the Dice value both as a metric and as a loss function (Dice loss) when training grids on segmentation networks via backprop, especially in biomedicine where accurate contouring is crucial for a diagnosis or a treatment.

3.5. Complete SwinUNet Architecture for Medical Image Segmentation

The SwinUNet pipeline for medical image segmentation, illustrating how various components integrate to perform accurate, pixel-level predictions. The process begins with input preprocessing, where a 224 × 224 × 3 medical image undergoes patch embedding and position encoding, followed by data loading, batch preparation, and training using cross-entropy and Dice loss with the Adam optimizer over 50 epochs. Evaluation metrics such as Dice Score, IoU, and intermediate loss track monitor performance. The Swin Transformer encoder consists of four hierarchical stages: Stage 1 with 2 Swin Blocks at high resolution (224 × 224), Stage 2 downsampling to 112 × 112, Stage 3 with 18 Swin Blocks at 56 × 56, and a bottleneck Stage 4 capturing deep features at 2 × 2 resolution. The decoder pathway employs a series of ConvTranspose2D layers to progressively upsample from 7 × 7 to 224 × 224, while attention gates selectively refine skip connections from encoder stages by combining gating and feature signals. The final segmentation head produces a binary mask distinguishing target structure, shown in Figure 6A,B.

Figure 6. (A) Input Processing and Model Training Pipeline. (B) Detailed SwinUNet Architecture for Medical Image Segmentation.

4. Results and Discussion

4.1. Performance Evaluation

The proposed SwinUNet with an attention mechanism has about 88 million parameters. For the implementation of the proposed models, the machine used is a Supermicro Super Server, Intel^® Xeon^® Platinum 8352V processor × 114, 256 GB of RAM, and a 17TB disk drive, assisted by an NVIDIA RTX A6000 GPU. It is noted that the inference time on the 224 × 224 image of the model is about 25–30 milliseconds, and the maximum GPU memory usage is about 12–14 GB.

The performance (Table 2) evaluation of a machine learning model across several training epochs (10, 20, 50, 100, 300, and 800) is provided in this thorough analysis. With the model reaching top performance at 800 epochs, the study shows continuous improvement in all assessment measures as training epochs rise. From epoch 10 to epoch 800, important discoveries include a 12.3% increase in precision, 9.4% boost in recall, and 12.1% improvement in F1-score.

Table 2. Performance Metrics Overview of Our Model.

Epoch	Precision	Recall	F1-Score	Accuracy	IoU	AUC
10	0.7752	0.8150	0.7631	0.9512	0.6760	0.9835
20	0.8104	0.8357	0.7882	0.9541	0.7020	0.9857
50	0.8250	0.8481	0.8009	0.9560	0.7100	0.9864
100	0.8368	0.8602	0.8154	0.9584	0.7160	0.9871
300	0.8512	0.8701	0.8327	0.9610	0.7300	0.9886
800	0.8705	0.8913	0.8551	0.9691	0.7800	0.9902

The table represents precision, recall, F1-score, accuracy, Intersection over Union (IoU), and Area Under the Curve (AUC), the six key performance metrics used to assess the model. The following table offers a summary of the whole results spanning all training epochs:

As shown in Figure 7 how accuracy, recall, F1-score, and precision vary over several training epochs. Every indicator shows steady increasing trends with decreasing returns as eras go.

This Figure 8 presents the development of AUC and IoU metrics. Although AUC displays continuous growth with little steps, IoU exhibits more notable improvements, especially between epochs 300 and 800.

All performance indicators across training epochs are shown in Figure 9 as a heatmap. Darker hues show higher performance values, therefore clearly demonstrating the development and relative performance of every criterion.

Figure 10 shows the percent increase in every indicator with respect to the baseline performance at epoch 10. F1-score follows IoU with the biggest improvement (15.38%) at 12.05%.

Patterns in learning effectiveness are seen in Figure 11, which shows the rate of change between successive epoch periods. Significant results include constant AUC increases throughout training and faster IoU growth in the last phase of training.

Figure 7. Primary Metrics Performance Across Training Epochs.

Figure 8. Secondary Metrics Performance (IoU and AUC).

Figure 9. Performance Metrics Heatmap.

Figure 10. Percentage Improvement from Baseline (Epoch 10).

Figure 11. Learning Rate Analysis—Rate of Change between Epoch Intervals.

Figure 12 shows the percent increases in several performance indicators of a machine learning model following a modification (model fine-tuning or architectural alteration). On the Y-axis the development of every indicator is plotted, therefore offering a relative view of the extent of change in every field.

Figure 13 shows the average change over five different epoch intervals during training for all measures. Between eras 10 and 20, it reveals a sharp improvement of 4.1%, marking the most important learning period. Following this, improvement rates progressively fall to 2.1% between epochs 50–100, suggesting early convergence. Interestingly, the curve shows a slight upward trend again in the 300–800 epoch range (2.7%), suggesting that extended training yields subtle yet meaningful gains, particularly in spatial accuracy metrics like IoU. The general pattern matches conventional decreasing returns in deep learning training cycles. The 300 to 800 epoch’s time span surprisingly reveals fresh improvement, implying that some elements of the model profit from the long training session’s particular spatial accuracy (IoU gain of 6.85%).

Table 3 displays the change in metric values from one epoch to the next, and highlights where learning progress slows down or accelerates across training duration.

Table 3. Improvement over Time (Δ from Previous Epoch).

Epoch	Precision	Recall	F1-Score	Accuracy	IoU	AUC
10	-	-	-	-	-	-
20	0.0352	0.0207	0.0251	0.0029	0.026	0.0022
50	0.0146	0.0124	0.0127	0.0019	0.008	0.0007
100	0.0118	0.0121	0.0145	0.0024	0.006	0.0007
300	0.0144	0.0099	0.0173	0.0026	0.014	0.0015
800	0.0193	0.0212	0.0224	0.0081	0.05	0.0016

Table 4 identifies the highest achieved value for each metric and the epoch it occurred. It is useful for selecting the most optimal model checkpoint based on specific goals.

Table 4. Best Values Highlight.

Metric	Best Value	Epoch
Precision	0.8705	800
Recall	0.8913	800
F1-score	0.8551	800
Accuracy	0.9691	800
IoU	0.780	800
AUC	0.9902	800

Table 5 shows the relative improvement in each metric of 10 and 800 epochs, and clearly indicates long-term training benefits and metric-wise learning efficiency.

Table 5. Percentage Gain from Epoch 10 and 800.

Metric	Epoch 10	Epoch 800	% Increase
Precision	0.7752	0.8705	+12.3%
Recall	0.815	0.8913	+9.37%
F1-score	0.7631	0.8551	+12.05%
Accuracy	0.9512	0.9691	+1.88%
IoU	0.676	0.780	+15.38%
AUC	0.9835	0.9902	+0.68%

Figure 14 imparts the results of several deep learning architectures applied to medical image segmentation for the four key metrics of precision, F1-score, accuracy, and IoU (Intersection). The graph clearly demonstrates that SwinUnet_withAttention (in red) performs exceptionally well and, in some areas, very well on all metrics, achieving scores that hover around 1.0. Others such as CGNet, TransUNet, and DeepLabV3+ rank higher, while U-Net and SwinUNet rank lower. It shows the various architectures’ performance in relation to these key metrics for the evaluation of medical imaging tasks.

In the thyroid ultrasound images, the image demonstrates the stages of development of the SwinUet_withAttention segmentation model during training up to several epochs (10, 20, 50 shown in Figure 15A and 100, 300, and 800 shown in Figure 15B. Early in training (10 and 20 epochs), the model starts to spot cystic or nodular structures’ approximate locations with uneven and rough borders. Better shape consistency and better alignment with ground truth annotations define segmentation as training advances toward 50 and 100 epochs. The model peaks at 300 epochs, creating smooth, anatomically correct contours that closely match the actual borders of the hypoechoic areas. Although the results at 800 epochs stay remarkably precise, little changes point to a plateau or some overfitting in some situations. With maximum segmentation performance seen between 300 and 800 epochs, the image shows the model’s improving ability to capture intricate textures and structural features in low-contrast ultrasound pictures.

Among all evaluated methods, it is clear that SwinUnet_withAttention has the strongest segmentation performance as it is able to achieve the highest reported IoU (78.00 ± 0.22) and Dice score (87.60 ± 0.33). The CNN-based models U-Net, FCN, and SegNet remain stagnant at around 66 IoU and 80 Dice, straggling far behind the rest of the competitors. Even in comparison to more advanced models such as ResUNet (73.38 IoU, 84.80 Dice), SwinUnet_withAttention outperformed them by more than four points in IoU and almost three in Dice. This illustrates how effective the model is at capturing small intricate details of the target structures. The accuracy still stays around the same narrow band as most other methods (~96.4–97.2%), reinforcing that the improvements in IoU and Dice are not the result of background classification inflation but instead a true enhancement in segmentation quality.

Improvements were tracked in all benchmark assessments for the performance during the training phase, signifying an enhancement in the model’s ability to learn. Precision increased by 12.30%, rising from 77.52% to 87.05%, as well as recall, which improved by 9.36%, from 81.50% to 89.13%, which demonstrated increased reliability and stability in the classification. An improvement in F1-score of 12.05% showed an increase from 76.31% to 85.51%, which indicates an improvement in the overall representation of precision and recall. An increase of 1.88% was shown in the accuracy from 95.12% to 96.91%, which was an increase but still shown to be a meaningful improvement. A relative improvement of 15.38% in the IoU was shown as one of the large relative improvements, which increased from 67.60% to 78.00%, signifying that there was an improvement in the overlap of the segmentation. The AUC score improved from 98.35% to 99.02%, which displayed that the model had improved confidence and had better discrimination during performance over the time span.

4.2. Discussion

For this research, a SwinUNet with Attention was proposed to tackle the thyroid ultrasound images’ segmentation task, and its effectiveness was shown in terms of accuracy and efficiency. Limiting the training to a maximum of 800 epochs improves the accuracy of the segmentation outcome (IoU and Dice), though training beyond this threshold could result in overfitting. The progression of the validation loss helps in understanding the convergence and small boundary fluctuations, proposing the need to apply early stopping or other regularization techniques in such models, especially during training. With the hierarchical representation provided by the Swin Transformer architecture and attention gating, the model provides a balance between efficiency and effectiveness in its outcomes. The model takes about 20–25 ms to process a single image of size 224 × 224 on the NVIDIA RTX A6000 GPU and operates at approximately 40–50 frames per second. The results of the experiments in this extensive assessment highlight the essentiality of prolonged training as well as architectural developments in enhancing the medical image segmentation field, specifically through the proposed SwinUnet_withAttention framework shown in Figure 2. This research constitutes strong empirical evidence that longer training epochs drastically improve performance on all primary assessment metrics, notably those related to spatial accuracy, which are crucial when used in the clinical environment. As shown in Table 2, the model shows a consistent improvement in performance from epoch 10 to epoch 800 with precision enhancing from 0.7752 to 0.8705, recall increasing from 0.8150 to 0.8913, and F1-score improving from 0.7631 to 0.8551. These observations are represented in Figure 7, which clearly shows the gradual upward path over training epochs and, according to that, affirms the continuous improvement the model is capable of achieving as depth, problem complexity, and time continue to increase. The most impressive performance enhancements are seen in Table 5, where percentage improvement from the baseline epoch 10 to epoch 800 is significant: +12.3% in precision; +9.37% recall; +12.05% F1-score; remarkable +15.38% IoU. This high gains, most of all in IoU, is proof of the system’s ability to localize and delineate anatomy, which is critical for sophisticated medical diagnostics. Learning dynamics as a whole is centered on and documented in Figure 13, where the results across the various decisive training phases are also demonstrated, to which the initial epochs 10 to 20, which span from 4.1% improvement and above, are the beginnings of what epochs 20 to 50 call the rapid-learning phase. Then, a plateau is attained between epochs 50 to 100, which indicates dilation in partial convergence. The second in which learning efficiency is attained goes between epochs 300 to 800, where the span over which epoch 800 falls within the learning interval is ascribed to the latter phase of the learning. This is much more illuminated with respect to the improvements in IoU illustrated in Figure 11, Figure 15A,B and Figure 16. It refers to a general trend in deep learning training: this three-stage habitual learning emphasizes SwinUnet_withAttention’s ability to be learned over time, particularly for complex segmentation tasks spatially.

A deeper inspection of the sensitivity to metrics, backed up by Figure 9 heatmap and Figure 10 comparison, indicates that IoU reacts most robustly to longer training, with an enhancement of 15.38%, followed by F1-score (12.05%) and precision (12.3%). In contrast, accuracy does not improve significantly (+1.88%) owing to its already high starting rate of 95.12% at epoch 10, indicating that there are diminishing returns for already saturated measures. Differential metric analysis in Figure 12 supports that spatially oriented measures gain more from longer epochs than discriminative measures like AUC, which instead exhibit incremental increase from already high base performance. This has significant implications for model optimization tactics: applications requiring accurate boundary definition, i.e., tumor detection or organ contouring, are justified in having longer training times, whereas acceptably performing resource-limited applications in early stages of discriminability might prefer earlier termination to save computational expenses. The comparative assessment in the form of Table 6 and Figure 14 once again asserts the architectural superiority of SwinUnet_withAttention, which outperforms all the benchmark models tested on all counts. With 0.8705 precision, 0.8913 recall, 0.9691 accuracy, 0.78 IoU, and 0.9902 AUC, it performs better than state-of-the-art models such as UCTransNet (IoU: 0.73, F1-score: 0.92) and state-of-the-art baselines such as DeepLabV3+, U-Net++, and TransUNet. The traditional architectures such as SwinUNet and CGNet are comparatively less competitive, with 0.78 and 0.82 F1-scores and 0.675 and 0.643 IoUs, respectively, highlighting the performance gain obtained through the combination of attention gates and hierarchical transformer-based encoding. A closer look at the rate-of-change metrics in Table 3 identifies strong single-epoch gains between epochs 10–20 and close to epoch 800, with precision increasing by 0.0193 and recall increasing by 0.0212 in the last epoch alone. These observations are illustrated in Figure 11 and substantiate the assertion that meaningful improvements are still being made even in advanced stages of training, particularly in spatial comprehension. The efficiency of learning, even though it does lessen as the training progresses, still provides a worthwhile marginal gain, as seen from the 2.7% improvement during the time period from epochs 300–800, illustrated in Figure 13. While the proposed model architecture and augmentation strategy are designed to enhance generalization, formal validation on multi-center or multi-device datasets without retraining has not yet been performed. Therefore, the model’s performance across different clinical settings and imaging devices remains to be fully assessed. The current discussion illustrates the need for long training periods in problems requiring high spatial accuracy, providing helpful insight into this area of research within the medical imaging community. Indeed, it is clear that SwinUNet with Attention is superior to its more outdated predecessors within the realm of medical image segmentation by combining the focus provided by attention mechanisms with the skip connections found within U-Nets. It also indicates the benefits of longer training: the faster convergence with the passage of time and the improved ability to discover deeper, more complex representations is a crucial component, especially when the performance is space-conscious. Furthermore, it asserts the importance of well-balanced and metric-conscious training techniques, including adaptive early stopping techniques that consider when the discriminative metrics saturate.

5. Conclusions

In this work, we proposed a symmetry-aware SwinUNet with integrated attention for transformer-based segmentation of thyroid ultrasound images. By embedding attention gates within a hierarchical encoder–decoder architecture, the model preserves the strong representational capacity of Swin Transformers while maintaining a balanced interaction between global contextual information and fine-grained spatial features. This symmetric feature modeling improves structural consistency and boundary accuracy in challenging ultrasound images. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across multiple evaluation metrics, including precision, recall, F1-score, IoU, accuracy, and AUC, establishing a new benchmark for thyroid nodule segmentation. Ablation studies further reveal the influence of architectural symmetry and extended training on convergence behavior and spatial accuracy, offering practical guidance for optimizing training strategies in clinically relevant scenarios. Future research may explore adaptive training schedules guided by spatial symmetry metrics, lightweight attention designs to reduce computational complexity, and broader validation across diverse imaging modalities and pathologies. Incorporating clinician-in-the-loop feedback mechanisms may further enhance interpretability, trust, and translational applicability of symmetry-aware transformer models in real-world medical imaging tasks. Possible future research avenues include adaptive training strategies, light attention mechanisms, validation studies covering various image types/modalities and cases, adding multimodal data (for example, integrating ultrasounds with other data) to further enhance segmentation performance, and incorporating feedback in the clinical implementation to increase interpretability.

Author Contributions

Conceptualization, A.O.; Methodology, A.O.; Software, A.O., I.H.K. and W.L. (Weichun Liu); Validation, I.H.K.; Investigation, A.O., W.L. (Weibing Liu) and W.L. (Weichun Liu); Resources, I.H.K., F.D., W.L. (Weibing Liu), B.Z., W.L. (Weichun Liu), Y.C. and Y.W.; Data curation, I.H.K., B.Z. and Y.C.; Writing—original draft, A.O.; Writing—review and editing, I.H.K., F.D., W.L. (Weibing Liu), W.L. (Weichun Liu), Y.C. and Y.W.; Visualization, A.O. and W.L. (Weibing Liu); Supervision, B.Z.; Project administration, A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (No. 2018AAA0102100); the National Natural Science Foundation of China (Nos. U22A2034 and 62177047); the Key Research and Development Program of Hunan Province (No. 2022SK2054); the Projects of Natural Science Research in China (No. 2022JJ50191); the Hunan Provincial Natural Science Foundation Committee (No. 2024JJ7508); and a grant from the Shaoyang Science and Technology Bureau, Hunan, China (No. 2024RC2028).

Data Availability Statement

The data presented in this study are openly available in [Kaggle] at [https://www.kaggle.com/datasets/tjahan/tn3k-thyroid-nodule-region-segmentation-dataset] (accessed 16 July 2025) reference number [47].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zuo, X.; Zhang, Y.; Wang, L. Federated learning via multi-attention guided UNet for thyroid nodule segmentation of ultrasound images. Neural Netw. 2025, 181, 106754. [Google Scholar] [CrossRef]
Grani, G.; Sponziello, M.; Filetti, S.; Durante, C. Thyroid nodules: Diagnosis and management. Nat. Rev. Endocrinol. 2024, 20, 715–728. [Google Scholar] [CrossRef]
Munsterman, R.; van der Velden, T.; Jansen, K. 3D Ultrasound Segmentation of Thyroid. WFUMB Ultrasound Open 2024, 2, 100055. [Google Scholar] [CrossRef]
Li, X.; Chen, Y.; Liu, Z. DMSA-UNet for medical image segmentation. Knowl.-Based Syst. 2024, 299, 112050. [Google Scholar] [CrossRef]
Li, X.; Fu, C.; Xu, S.; Sham, C.-W. Thyroid Ultrasound Image Database and Marker Mask Inpainting Method for Research and Development. Ultrasound Med. Biol. 2024, 50, 509–519. [Google Scholar] [CrossRef] [PubMed]
Chaphekar, M.; Chandrakar, O. An improved deep learning models with hybrid architectures thyroid disease classification diagnosis. J. Neonatal Surg. 2025, 14, 1151–1162. [Google Scholar] [CrossRef]
Wang, J.; Zheng, N.; Wan, H.; Yao, Q.; Jia, S.; Zhang, X.; Fu, S.; Ruan, J.; He, G.; Ouyang, N.; et al. Deep learning models for thyroid nodules diagnosis of fine-needle aspiration biopsy: A retrospective, prospective, multicentre study in China. Lancet Digit. Health 2024, 6, e458–e469. [Google Scholar] [CrossRef]
Yadav, N.; Dass, R.; Virmani, J. A systematic review of machine learning based thyroid tumor characterisation using ultrasonographic images. J. Ultrasound 2024, 27, 209–224. [Google Scholar] [CrossRef]
Cantisani, V.; Bojunga, J.; Durante, C.; Dolcetti, V.; Pacini, P. Multiparametric Ultrasound Evaluation of Thyroid Nodules. Ultraschall Med. 2025, 46, 14–35. [Google Scholar] [CrossRef]
Gulame, M.B.; Dixit, V.V. Hybrid deep learning assisted multi-classification: Grading of malignant thyroid nodules. Int. J. Numer. Meth. Biomed. Engng 2024, 40, e3824. [Google Scholar] [CrossRef]
Lu, X.; Chen, G.; Li, J.; Hu, X.; Sun, F. MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 2681–2689. [Google Scholar] [CrossRef]
Xu, C.; Liu, W.; Chen, Y.; Ding, X. A Supervised Case-Based Reasoning Approach for Explainable Thyroid Nodule Diagnosis. Knowl.-Based Syst. 2022, 251, 109200. [Google Scholar] [CrossRef]
Nie, X.; Zhou, X.; Tong, T.; Lin, X.; Wang, L.; Zheng, H.; Li, J.; Xue, E.; Chen, S.; Zheng, M.; et al. N-Net: A novel dense fully convolutional neural network for thyroid nodule segmentation. Front. Neurosci. 2022, 16, 872601. [Google Scholar] [CrossRef]
Pan, S.; Liu, X.; Xie, N.; Zhang, Y.; Chen, L.; Li, H. EG-TransUNet: A transformer-based U-Net with enhanced and guided models for biomedical image segmentation. BMC Bioinform. 2023, 24, 85. [Google Scholar] [CrossRef] [PubMed]
Dong, P.; Zhang, R.; Li, J.; Liu, C.; Liu, W.; Hu, J.; Yang, Y.; Li, X. An ultrasound image segmentation method for thyroid nodules based on dual-path attention mechanism-enhanced UNet++. BMC Med. Imaging 2024, 24, 341. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, X.; Li, D.; Park, H.; Li, X.; Liu, P.; Jin, J.; Shen, Y. Automatic Segmentation of Thyroid with the Assistance of the Devised Boundary Improvement Based on Multicomponent Small Dataset. Appl. Intell. 2023, 53, 19708–19723. [Google Scholar] [CrossRef] [PubMed]
Das, D.; Iyengar, M.S.; Majdi, M.S.; Rodriguez, J.J.; Alsayed, M. Deep Learning for Thyroid Nodule Examination: A Technical Review. Artif. Intell. Rev. 2024, 57, 47. [Google Scholar] [CrossRef]
Ma, X.; Sun, B.; Liu, W.; Sui, D.; Chen, J.; Tian, Z. AMSeg: A Novel Adversarial Architecture Based Multi-Scale Fusion Framework for Thyroid Nodule Segmentation. IEEE Access 2023, 11, 72911–72924. [Google Scholar] [CrossRef]
Beyyala, A.; Priya, R.; Choudari, S.R.; Bhavani, R. Swin Transformer and Attention Guided Thyroid Nodule Segmentation on Ultrasound Images. Ingénierie Systèmes D’information 2024, 29, 75–81. [Google Scholar] [CrossRef]
Yang, W.T.; Ma, B.Y.; Chen, Y. A Narrative Review of Deep Learning in Thyroid Imaging: Current Progress and Future Prospects. Quant. Imaging Med. Surg. 2024, 14, 2069–2088. [Google Scholar] [CrossRef]
Sureshkumar, V.; Jaganathan, D.; Ravi, V.; Velleangiri, V.; Ravi, P. A Comparative Study on Thyroid Nodule Classification Using Transfer Learning Methods. Open Bioinform. J. 2024, 17, e18750362305982. [Google Scholar] [CrossRef]
Sabouri, M.; Ahamed, S.; Asadzadeh, A.; Avval, A.H.; Bagheri, S.; Arabi, M.; Zakavi, S.R.; Askari, E.; Rasouli, A.; Aghaee, A.; et al. Thyroidiomics: An Automated Pipeline for Segmentation and Classification of Thyroid Pathologies from Scintigraphy Images. In Proceedings of the 12th European Workshop on Visual Information Processing (EUVIP), Geneva, Switzerland, 8–11 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Mau, M.A.; Krusen, M.; Ernst, F. Automatic Thyroid Scintigram Segmentation Using U-Net. In Bildverarbeitung für die Medizin 2025; Palm, C., Breininger, K., Deserno, T., Handels, H., Maier, A., Maier-Hein, K.H., Tolxdorff, T.M., Eds.; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2025; pp. 229–234. [Google Scholar] [CrossRef]
Ludwig, M.; Ludwig, B.; Mikuła, A.; Biernat, S.; Rudnicki, J.; Kaliszewski, K. The Use of Artificial Intelligence in the Diagnosis and Classification of Thyroid Nodules: An Update. Cancers 2023, 15, 708. [Google Scholar] [CrossRef] [PubMed]
Chi, J.; Li, Z.; Sun, Z.; Yu, X.; Wang, H. Hybrid transformer UNet for thyroid segmentation from ultrasound scans. Comput. Biol. Med. 2023, 153, 106453. [Google Scholar] [CrossRef]
Peng, B.; Lin, W.; Zhou, W.; Bai, Y.; Luo, A.; Xie, S.; Yin, L. Enhanced Pediatric Thyroid Ultrasound Image Segmentation Using DC-Contrast U-Net. BMC Med. Imaging 2024, 24, 275. [Google Scholar] [CrossRef]
Haribabu, K.; Prasath, R.; Praveen Joe, I.R. MLRT-UNet: An Efficient Multi-Level Relation Transformer-Based U-Net for Thyroid Nodule Segmentation. Comput. Model. Eng. Sci. 2025, 143, 413–448. [Google Scholar] [CrossRef]
Pavithra, S.; Yamuna, G.; Arunkumar, R. Deep Learning Method for Classifying Thyroid Nodules Using Ultrasound Images. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; pp. 1–6. [Google Scholar] [CrossRef]
Zeng, Y.; Zhang, Y.; Gong, N.; Li, M.; Wang, M. Research on Thyroid CT Image Segmentation Based on U-Shaped Convolutional Neural Network. Proc. SPIE 2023, 12705, 127051I. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, X.; Liu, F.; Shen, Y. A Devised Thyroid Segmentation with Multi-Stage Modification Based on Super-Pixel U-Net under Insufficient Data. Ultrasound Med. Biol. 2023, 49, 1728–1741. [Google Scholar] [CrossRef]
Liu, X.; Hu, Y.; Chen, J. Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron. Biomed. Signal Process. Control 2023, 86, 105331. [Google Scholar] [CrossRef]
Inan, N.G.; Kocadağlı, O.; Yıldırım, D.; Meşe, İ.; Kovan, Ö. Multi-class classification of thyroid nodules from automatic segmented ultrasound images: Hybrid ResNet-based U-Net convolutional neural network approach. Comput. Methods Programs Biomed. 2024, 243, 107921. [Google Scholar] [CrossRef]
Arepalli, L.; Kasukiurthi, V.R.; Dabbiru, M. Channel Boosted Convolutional Neural Network with SegNet-Based Segmentation for an Automatic Prediction of Thyroid Cancer. Soft Comput. 2025, 29, 2399–2415. [Google Scholar] [CrossRef]
Xu, Y.; Quan, R.; Xu, W.; Huang, Y.; Chen, X.; Liu, F. Advances in Medical Image Segmentation: A Comprehensive Review of Traditional, Deep Learning and Hybrid Approaches. Bioengineering 2024, 11, 1034. [Google Scholar] [CrossRef]
Al-Mukhtar, F.H.; Ali, A.A.; Al-Dahan, Z.T. Joint Segmentation and Classification. Zanco J. Pure Appl. Sci. 2024, 35, 60–71. [Google Scholar] [CrossRef]
Yang, D.; Li, Y.; Yu, J. Multi-Task Thyroid Tumor Segmentation Based on the Joint Loss Function. Biomed. Signal Process. Control 2023, 79, 104249. [Google Scholar] [CrossRef]
Xu, P. Research on thyroid nodule segmentation using an improved U-Net network. Rev. Int. Métodos Numér. Cálc. Diseño Ing. 2024, 40, 1–7. [Google Scholar] [CrossRef]
Ozcan, A.; Tosun, Ö.; Donmez, E.; Sanwal, M. Enhanced-TransUNet for Ultrasound Segmentation of Thyroid Nodules. Biomed. Signal Process. Control 2024, 95, 106472. [Google Scholar] [CrossRef]
Shao, J.; Pan, T.; Fan, L.; Li, Z.; Yang, J.; Zhang, S.; Zhang, J.; Chen, D.; Zhu, X.; Chen, H.; et al. FCG-Net: An innovative full-scale connected network for thyroid nodule segmentation in ultrasound images. Biomed. Signal Process. Control 2023, 86, 105048. [Google Scholar] [CrossRef]
Hu, R.; Wang, H.; Zhang, S.; Zhang, W.; Xu, P. Improved U-Net Segmentation Model for Thyroid Nodules. IAENG Int. J. Comput. Sci. 2025, 52, 1407–1416. Available online: https://www.iaeng.org/IJCS/issues_v52/issue_5/index.html (accessed on 1 May 2025).
Zheng, T.; Qin, H.; Cui, Y.; Wang, R.; Zhao, W.; Zhang, S.; Geng, S.; Zhao, L. Segmentation of thyroid glands and nodules in ultrasound images using the improved U-Net architecture. BMC Med. Imaging 2023, 23, 56. [Google Scholar] [CrossRef]
Yetginler, B.; Atacak, İ. An Improved V-Net Model for Thyroid Nodule Segmentation. Appl. Sci. 2025, 15, 3873. [Google Scholar] [CrossRef]
Zhang, J.; Qin, Q.; Ye, Q.; Ruan, T. ST-UNet: Swin Transformer Boosted U-Net with Cross-Layer Feature Enhancement for Medical Image Segmentation. Comput. Biol. Med. 2023, 153, 106516. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Pang, S.; Zhang, R.; Zhu, J.; Fu, X.; Tian, Y.; Gao, J. ATTransUNet: An Enhanced Hybrid Transformer Architecture for Ultrasound and Histopathology Image Segmentation. Comput. Biol. Med. 2023, 152, 106365. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Ashraf, M.A.; Riaz, M.; Umwanzavugaye, P.; Chipusu, K.; Huang, H.; Xu, Y. Improving diagnostic precision in thyroid nodule segmentation from ultrasound images with a self-attention mechanism-based Swin U-Net model. Front. Oncol. 2025, 15, 1456563. [Google Scholar] [CrossRef] [PubMed]
Ajilisa, O.A.; Jagathy Raj, V.P.; Sabu, M.K. Segmentation of thyroid nodules from ultrasound images using convolutional neural network architectures. J. Intell. Fuzzy Syst. 2022, 43, 687–705. [Google Scholar] [CrossRef]
The TN3K: Thyroid Nodule Region Segmentation Dataset Is a Publicly. Available online: https://www.kaggle.com/datasets/tjahan/tn3k-thyroid-nodule-region-segmentation-dataset?select=trainval-mask (accessed on 16 July 2025).
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Gong, H.; Chen, J.; Chen, G.; Li, H.; Li, G.; Chen, F. Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Comput. Biol. Med. 2023, 155, 106389. [Google Scholar] [CrossRef]
Pan, H.; Zhou, Q.; Latecki, L.J. SGUNET: Semantic Guided UNET for Thyroid Nodule Segmentation. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 630–634. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Prochazka, A.; Zeman, J. Thyroid Nodule Segmentation in Ultrasound Images Using U-Net with ResNet Encoder: Achieving State-of-the-Art Performance on All Public Datasets. AIMS Med. Sci. 2025, 12, 124–144. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar]
Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context Pyramid Fusion Network for Medical Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision–ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 801–818. [Google Scholar] [CrossRef]
Gong, H.; Chen, G.; Wang, R.; Xie, X.; Mao, M.; Yu, Y.; Chen, F.; Li, G. Multi-Task Learning for Thyroid Nodule Segmentation with Thyroid Region Prior. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 257–261. [Google Scholar] [CrossRef]

Figure 3. Swin Transformer feature extraction.

Figure 4. Swin Feature Decoding and Mask Prediction.

Figure 5. Swin transformer block [45].

Figure 12. Distribution of Model Performance Improvements Across Evaluation Metrics.

Figure 13. Average Performance Improvement across Epoch Intervals.

Figure 14. Comparison of Semantic Segmentation Methods.

Figure 15. (A) Predicted result with 10, 20, and 50 epochs. (B) Predicted result with 100, 300, and 800 epochs.

Figure 16. Performance of Segmentation Models: Accuracy vs. IoU.

Table 1. Feature hierarchy.

Stage	Output Size	Feature Depth	Feature Type
1	56 × 56 × C	Low	Shallow (Local textures)
2	28 × 28 × 512	Medium	Mid-level patterns
3	14 × 14 × 512	High	Deep semantic features
4	7 × 7×1024	Very High	Global contextual features

Table 6. Comparison of Semantic Segmentation Methods with TN3K dataset.

Model	Dataset	Accuracy (%)	IoU (%)	Dice (%)
TransUNet [48]	TN3K train	96.86 ± 0.05	69.26 ± 0.55	81.84 ± 1.09
TRFE+ [49]	TN3K train	97.04 ± 0.10	71.38 ± 0.43	83.30 ± 0.26
SGUNet [50]	TN3K train	96.54 ± 0.09	66.05 ± 0.43	79.55 ± 0.86
U-Net [51]	TN3K train	96.46 ± 0.11	65.99 ± 0.66	79.51 ± 1.31
ResUNet [52]	TN3K train	97.18 ± 0.03	75.09 ± 0.22	83.77 ± 0.20
SegNet [53]	TN3K train	96.72 ± 0.12	66.54 ± 0.85	79.91 ± 1.69
FCN [54]	TN3K train	96.92 ± 0.04	68.18 ± 0.25	81.08 ± 0.50
CPFNet [55]	TN3K train	97.17 ± 0.06	70.50 ± 0.39	82.70 ± 0.78
Deeplabv3+ [56]	TN3K train	97.19 ± 0.05	70.60 ± 0.49	82.77 ± 0.98
TRFE [57]	TN3K train	96.71 ± 0.07	68.33 ± 0.68	81.19 ± 1.35
SwinUNet_with Attention (our)	TN3K train	96.91 ± 0.00	78.00 ± 0.00	87.60 ± 0.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oad, A.; Koondhar, I.H.; Dong, F.; Liu, W.; Zou, B.; Liu, W.; Chen, Y.; Wu, Y. Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images. Symmetry 2026, 18, 141. https://doi.org/10.3390/sym18010141

AMA Style

Oad A, Koondhar IH, Dong F, Liu W, Zou B, Liu W, Chen Y, Wu Y. Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images. Symmetry. 2026; 18(1):141. https://doi.org/10.3390/sym18010141

Chicago/Turabian Style

Oad, Ammar, Imtiaz Hussain Koondhar, Feng Dong, Weibing Liu, Beiji Zou, Weichun Liu, Yun Chen, and Yaoqun Wu. 2026. "Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images" Symmetry 18, no. 1: 141. https://doi.org/10.3390/sym18010141

APA Style

Oad, A., Koondhar, I. H., Dong, F., Liu, W., Zou, B., Liu, W., Chen, Y., & Wu, Y. (2026). Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images. Symmetry, 18(1), 141. https://doi.org/10.3390/sym18010141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Thyroid Scintigraphy and Ultrasound Segmentation

2.2. Hybrid and Residual U-Net Architectures for Thyroid Imaging

2.3. Advanced Hybrid, Transformer, and Attention-Based Models

3. Proposed Method

3.1. Flowchart

3.2. Algorithm

3.3. Hierarchical Feature Extraction Process

3.3.1. Decoder Architecture for Medical Image Segmentation

3.3.2. Swin Transformer Block Architecture

3.3.3. Swin Transformer Encoder and Attention Mechanism

3.3.4. Decoder Architecture and Attention Integration

3.4. Experimental Setup and Evaluation Metrics

3.4.1. Dataset, Preparation and Preprocessing

3.4.2. Preprocessing, Standardization, and Label Encoding

3.4.3. Data Augmentation Strategy

3.5. Complete SwinUNet Architecture for Medical Image Segmentation

4. Results and Discussion

4.1. Performance Evaluation

4.2. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI