Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation

Liu, Shengzhi; Tang, Haotian; Zhao, Junhao; Liu, Rundong; Zheng, Sirui; Hou, Kaiyao; Zhang, Xiyu; Liu, Fuyong; Ding, Chen

doi:10.3390/info16111013

Open AccessArticle

Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation

by

Shengzhi Liu

¹,

Haotian Tang

²,

Junhao Zhao

³,

Rundong Liu

³,

Sirui Zheng

³

,

Kaiyao Hou

¹,

Xiyu Zhang

¹,

Fuyong Liu

¹ and

Chen Ding

^3,*

¹

School of Information Science and Engineering, Xinjiang College of Science and Technology, Korla 841000, China

²

School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710129, China

³

School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 1013; https://doi.org/10.3390/info16111013

Submission received: 13 October 2025 / Revised: 16 November 2025 / Accepted: 17 November 2025 / Published: 20 November 2025

Download

Browse Figures

Versions Notes

Abstract

The precise segmentation of thyroid nodules in ultrasound images is essential for computer-aided diagnosis and treatment. Although various deep learning methods have been proposed, similar intensity distributions and variable nodule morphology often lead to blurred segmentation boundaries and missed detection of small nodules. To address this problem, we propose a multiscale cross-level fusion U-net with combined wavelet convolutions (MCFU-net) for thyroid nodule segmentation. Firstly, the network designs a multi-branch wavelet convolution (MBWC) block, which decouples texture features through wavelet domain multiresolution analysis and reorganizes cross-channel features, thereby enhancing context extraction and aggregation capabilities during the encoding stage. Secondly, a scale-selective atrous pyramid (SSAP) module based on multi-level dynamic perception is constructed to achieve saliency enhancement for nodules of varying sizes, in order to improve the detection ability for small nodules. Thirdly, to decrease the loss of fine-grained information during upsampling, a cross-level fusion module (CLFM) with hierarchical refinement mechanisms is designed, which progressively reconstructs ambiguous boundary areas through multistage upsampling. Experiments conducted on two public ultrasound datasets, TN3K and DDTI, demonstrate the effectiveness and superiority of our method, achieving Dice coefficients of 85.22% and 78.21% and IoU values of 74.25% and 64.23%, respectively.

Keywords:

thyroid nodule segmentation; ultrasound image; context information aggregation; cross-level fusion; wavelet convolution

1. Introduction

Thyroid nodules are typically defined as palpable cystic or solid masses formed within the thyroid [1]. A cancer statistics report showed that there were 821,173 new cases of thyroid cancer in 2022 [2]. Therefore, accurate detection and evaluation of thyroid nodules can help in the early detection of patients’ lesions and formulate reasonable treatment plans. Ultrasound technology has become the preferred method for screening thyroid nodules due to its non-invasiveness, safety, and advantages in soft tissue imaging [3]. However, the low contrast of ultrasound images [4], speckle noise [5], and the diversity of nodule morphology [6] make the precise segmentation of thyroid nodules face considerable challenges.

With the rise of deep convolutional neural networks, a series of CNN-based methods [7,8,9,10], such as FCN, Segnet and U-net, have demonstrated outstanding performance in medical image segmentation due to their powerful nonlinear learning ability. Consequently, several deep learning networks based on the U-net architecture have been employed for thyroid nodule segmentation [11,12,13,14]. U-net employs an encoder–decoder structure. The encoder operation is responsible for extracting image features and capturing context, while the decoder operation performs target localization and prediction. Additionally, skip connections can better preserve the details and structures of the target in the decoding process, thus enhancing the accuracy of target localization [8]. However, as shown in Figure 1, the U-net architecture still faces two issues in thyroid nodule segmentation.

Insufficient sensitivity to small targets: As seen in the first row of Figure 1, U-net exhibits missed detections for small nodules. The reason is that, in the process of encoding from shallow to deep, the fixed receptive field of convolutional kernels limits the network’s feature extraction ability. Moreover, the diversity of thyroid nodules exacerbates the loss of global context information [15].

Difficulty in boundary pixel recognition: As seen in the second row of Figure 1, the U-net-based methods’ performance in segmenting nodule edges is suboptimal. Due to the similar intensity distribution of ultrasound images, skip connections can carry noise during feature supplementation [16], leading to confusion of features and resulting in pixel error classification.

To solve these two issues, two strategies have been widely used: multiscale representation and reconstructed skip connections.

Multiscale representation [17] aims to extract and fuse features at different levels to simultaneously capture both details and global information in the image. Dai et al. [18] introduced the SK block [19] into the encoder part of Unet++ [10], dynamically adjusting the receptive field to better capture multi-scale spatial features, thereby improving segmentation accuracy. Cui et al. [13] introduced the ASPP [20] module in the encoding stage to carry out multi-scale feature extraction using dilated convolution with different rates. Zheng et al. [21] redesigned the ASPP using deformable convolutions to accommodate variations in the shape and scale of targets. Recent advancements further enrich the multi-scale feature learning paradigm for thyroid nodule segmentation. Ozcan et al. [22] proposed Enhanced-TransUNet, fusing Transformer and UNet to capture global context and segment small targets, with an information bottleneck to compress redundant features and mitigate overfitting. Zheng et al. [23] developed GWUNet, integrating a Swin-Transformer-based gated attention mechanism and an improved wavelet transform module to extract frequency–spatial domain features and enhance nodule texture capture. However, these methods have limitations in context awareness and feature aggregation, which can easily lead to the loss of small nodule targets.

In the strategy of reconstructed skip connections, Gan et al. [24] utilized a polarized attention module to assist skip connections, helping the model focus on regions relevant to the nodule segmentation task. Chen et al. [25] incorporated the SE attention [26] mechanism at the skip connections to enhance the relevance of distant information, which can suppress the interference of irrelevant regions in edge pixel classification. Yang et al. [27] enhanced the interaction between global and local information through multi-stage feature information integration in channel and spatial dimensions. Xie et al. [28] selected channels with richer texture information by calculating the information entropy of different channels, incorporating them into the decoding process to recover shallow semantic information. Nie et al. [29] proposed an attention guidance module in N-Net, which filters features before skip connections to remove noise, reduce background interference, and ensure effective transfer of structural information between network layers. However, most of the existing methods take peer features as information supplement to the decoding process, failing to fully consider the guiding role of high-level features.

In order to address the issues outlined above, this paper designs a multiscale cross-level fusion U-net with combined wavelet convolutions for thyroid nodule segmentation. Inspired by the first issue discussed, we embed a multi-branch wavelet convolution block in the encoder to enhance the context awareness ability. At the same time, the scale-selective dilated pyramid module, which can aggregate the global context information, is used in the top-level encoder to improve the detection ability of small targets. In response to the second issue and the discussion on reconstructed skip connections, we propose a cross-level fusion module to maximally utilize previously learned high-level features, which enhances the fine-grained information of nodules.

In summary, the key contributions of this study are as follows:

A network called MCFU-net for thyroid nodule segmentation is proposed, which introduces a collaborative learning framework integrating wavelet-domain feature decoupling, dynamic scale selection, and cross-level fusion. A series of experiments on the TN3K and DDTI datasets demonstrate that this network has significant advantages over other models.

A multi-branch wavelet convolution (MBWC) block and a scale-selective dilated pyramid module (SSAP) are proposed. These enhance the network’s adaptability to various nodules through multi-scale feature sensing and aggregation strategies, thereby enabling precise segmentation of thyroid nodules with significant morphological differences.

A cross-level fusion module (CLFM) is designed. During the image restoration process, high-level semantic information is supplemented through a cross-level fusion unit (CLFU), and the multi-level semantic information is used in the decoding process. After the integration of the CLFM, the multi-level fusion features can effectively filter out irrelevant clutter in the thyroid ultrasound data, making the contours and details of the nodules clearer.

The remainder of this paper is organized as follows: Section 2 provides a detailed description of the proposed network and its important modules, as well as the loss function; Section 3 describes the ultrasound dataset used in the experiment, experiment setup and evaluation metrics; Section 4 presents the results of experiment. The discussion and conclusion of this paper are given in Section 5 and Section 6, respectively.

2. The Proposed Method

This section describes MCFU-net in detail. Section 2.1 provides an overview of the overall architecture of the network; Section 2.2–Section 2.4 detail the important modules of the network, and Section 2.5 describes the loss function used by the network.

2.1. Overview

The proposed MCFU-net is a network based on an encoder–decoder architecture, and the overall framework is shown in Figure 2. Given a thyroid ultrasound image as input, the network employs MBWC blocks for downsampling in the encoding process, continuously aggregating global information. The top-level encoder uses the SSAP module to capture multi-scale semantic information from high-level abstract features, enhancing sensitivity to small nodules. Additionally, multiple CLFMs are employed to reconstruct skip connections, propagating high-level features as prior knowledge to the decoding process to guide image restoration. Ultimately, the network outputs pixel-level semantic prediction results.

Specifically, given an input image

I n p u t \in ℝ^{3 \times h \times w}

, by encoding the input image using MBWC blocks and the SSAP module, we obtain five different levels of feature outputs

F_{i}^{e} (i = 1, 2, 3, 4)

, corresponding to the feature level of the network. Features

F_{i}^{d} (i = 1, 2, 3, 4)

are obtained at different stages of decoding and are aggregated through CLFMn (n = 1, 2, 3) to enhance fine-grained information recovery during the upsampling phase. The filter sizes of the MCFU-net are 32, 64, 128, 256, 512, 256, 128, 64 and 32, respectively. The single-stage feature mappings for encoding and decoding are computed by (1) and (2), where

c = 32

:

F_{i}^{e} \in ℝ^{(2^{i - 1}) c \times (h / 2^{i - 1}) \times (w / 2^{i - 1})}

(1)

F_{i}^{d} \in ℝ^{(2^{i - 1}) c \times (h / 2^{i - 1}) \times (w / 2^{i - 1})}

(2)

2.2. Muti-Branch Wavelet Convolution

The network needs to consider information surrounding the foreground to avoid making ambiguous decisions [30]. Therefore, it is particularly important to enhance the extraction and aggregation of multiscale context information in the encoding process [31]. As shown in Figure 3, inspired by the Inception [32] architecture and [33], we design the MBWC block as a fundamental component of the network encoder. The MBWC block enhances the ability to extract and aggregate context via parallelly using three different convolution kernels to obtain feature maps with varying receptive fields. The computation method of the MBWC block is shown as follows:

Firstly, discrete wavelet transform and attention mechanisms are utilized to achieve joint frequency–spatial domain modeling. Specifically, given the input feature map

F_{i n} \in ℝ^{c^{'} \times h \times w}

, feature extraction is performed through a three-branch structure:

F_{1} = B (δ (C A (W T C o n ν_{3 \times 3} (F_{i n}))))

(3)

F_{2} = B (δ (S A (W T C o n ν_{5 \times 5} (F_{i n}))))

(4)

F_{3} = B (δ (S A (W T C o n ν_{3 \times 3} (F_{i n}))))

(5)

where

W T C o n v_{3 \times 3} (\cdot)

and

W T C o n v_{5 \times 5} (\cdot)

represent wavelet convolutions with kernel sizes 3 and 5 respectively.

C A (\cdot)

and

S A (\cdot)

represent the channel attention mechanism and the spatial attention mechanism respectively.

B

represents the batch normalization operation, and

δ

denotes the ReLU activation function:

F_{f u s i o n} = C o n v_{1 \times 1} (S_{3} (C o n c a t [F_{1}, F_{2}, F_{3}]))

(6)

where

S_{3} (\cdot)

denotes the channel shuffle operation with the groups of 3, while

F_{f u s i o n}

represents the aggregated features from all branches.

Finally,

F_{e n h a n c e d}

is obtained by applying spatial attention enhancement to

F_{f u s i o n} (\cdot)

, and then concatenated with the output

F_{r e s i d u a l}

from the residual branch to produce the final multi-scale feature output

F_{o u t} \in ℝ^{c^{'} \times h \times w}

:

F_{e n h a n c e d} = δ (B (S A (F_{f w i o n})))

(7)

F_{r e s i d u a l} = δ (B (C A (C o n v_{1 \times 1} (F_{i n}))))

(8)

F_{o u t} = B (F_{e n h a n c e d} + F_{r e s i d u a l})

(9)

2.3. Scale Selection Atrous Pyramid

In continuous image downsampling operations, shallow features typically focus on the texture information of the target, while deep features emphasize positional information. However, in the process of extracting image features, sub-region information may be lost, which makes the feature expression of some pixel regions inadequate, causing segmentation results to deviate from the ground truth. To alleviate this issue, we propose the SSAP module, which further aggregates high-level semantic information through multiscale context fusion and cross-receptive field feature calibration, filtering features that are highly relevant to the target region.

As shown in Figure 4, in order to improve the model’s ability to capture multiscale context information, the proposed SSAP module first uses dilated convolutions with rates of 1, 2, and 4 (all with a kernel size of

3 \times 3

) to widen the receptive field and obtain feature maps at different scales. Then, by calculating spatial weights on each branch, different scale features are adjusted to obtain a multiscale feature representation of the nodule location. Simultaneously, feature maps

F_{p o o l, i} (i = 1, 2, 4)

containing different local context information are obtained through three different adaptive pooling paths. Finally, the adjusted feature map is concatenated with all the feature maps obtained from the scale selection path and pooling paths, and the final global context aggregation information

F_{S S A P}

is obtained by convolution operation.

For the top-level encoder, the feature map

F_{i n} \in ℝ^{c \times h \times w}

is fed into the SSAP module, resulting in the encoded

F_{o u t} \in ℝ^{c^{'} \times h \times w}

. In the pooling path, adaptive average pooling is used to compress the space of each channel to

i \times i (i = 1, 2, 4)

. Subsequently, upsampling is performed to restore

h

and to the same values as the original feature map:

F_{p o o l, i} = i n t e r p o l (C o n v_{1 \times 1} (δ (A P o o l (F_{i n}, i)))), i = 1, 2, 4

(10)

where

i n t e r p o l (\cdot)

denotes bilinear interpolation, and

A P o o l ()

represents the adaptive pooling operation.

In the scale selection path, different dilated convolutions with rates

n (n = 1, 2, 4)

are applied to the original features to obtain feature maps

F_{a t r o u s, n}

at three different scales. Subsequently, these three feature maps are concatenated to produce the multiscale feature

F_{a t r o u s}

:

F_{a t r o u s, n} = δ (B (C o n v_{3 \times 3}^{r a t e = n} (F_{i n}))), n = 1, 2, 4

(11)

F_{a t r o u s} = Concat [F_{a t r o u s, 1}, F_{a t r o u s, 2}, F_{a t r o u s, 4}]

(12)

Then, a regular convolution operation is applied to adjust the channels of

F_{a t r o u s}

, and a SoftMax classifier is used to obtain attention weight maps

F_{a t t e n t i o n, n} (n = 1, 2, 4)

at three different scales. The attention scores are then multiplied by the corresponding feature maps

F_{a t r o u s, n}

at each scale, and the resulting outputs are summed to produce the output

F_{c o n v}

along the scale selection path:

F_{a t t e n t i o n, n} = S o f t m a x (C o n v^{*} (F_{a t r o u s}))

(13)

F_{conv} = F_{atrous, 1} \times F_{attention, 1} + F_{atrous, 2} \times F_{attention, 2} + F_{atrous, 4} \times F_{attention, 4}

(14)

where

C o n v^{*} ()

denotes a conventional convolution combination using

1 \times 1

convolution,

3 \times 3

convolution, and the ReLU activation function.

Finally, the features from each path are concatenated, and the adjusted original features are added to supplement global information, resulting in the final global contextual and multiscale selected features

F_{o u t}

:

F_{i n}^{'} = C o n v_{1 \times 1} (F_{i n})

(15)

F_{o u t} = F_{i n}^{'} + δ (B (C o n v_{1 \times 1} (Concat [F_{c o n v}, F_{i n}^{'}, F_{p o o l, i}])))

(16)

2.4. Cross-Level Fusion Module

In the image encoding process, detailed information about the target may be lost. Recovering this information in the upsampling process of the decoder is challenging and typically requires the use of skip connections for information supplementation [34]. However, since shallow features contain more noise, supplementing information from the same level into the decoding process limits decoding ability and affects segmentation accuracy. Therefore, we propose the CLFM module. This module leverages the characteristics of high-level features, which have less noise and greater target focus, to serve as prior knowledge that guides the attention of low-level features toward the target region, achieving directed coupling between deep semantic information and shallow details during the decoding process. As shown in Figure 5, CLFU is an important component of the CLFM module.

Specifically, the CLFU integrates high-level and low-level features step by step to create a richer and more semantic feature representation, thereby enhancing the ability of low-level features to represent the target area and aiding in the recovery of fine-grained information. Firstly, CLFU receives two adjacent-level features

F_{i}

and

F_{i - 1}

.

F_{i - 1}

is mapped to the same dimensions as

F_{i}

using a transposed convolution operation, and they are then concatenated to obtain

F_{c}

. Secondly, two depthwise separable convolutions with dilation rates of 1 and 2 are applied to obtain two feature maps

F_{1}

and

F_{2}

at different scales. Finally, the output

F_{o u t}

is obtained by concatenating the two feature maps. The specific computation process is as follows:

F_{c} = C o n c a t [U (F_{i}), F_{i - 1}]

(17)

F_{1}, F_{2} = D S C o n v_{3 \times 3}^{r a t e = n} (F_{c}), n = 1, 2

(18)

F_{o u t} = C o n c a t [F_{1}, F_{2}]

(19)

where

D S C o n v_{3 \times 3}^{r a t e = n} (\cdot)

denotes a

3 \times 3

depthwise separable convolution,

U (\cdot)

represents a transposed convolution, and Concat indicates the concatenation of different features along the channel dimension.

Taking CLFM₃ as an example, this demonstrates the multistage progressive upsampling and dual-path feature guidance strategy of the CLFM module, progressively aggregating the encoder output feature maps

F_{5}^{e}

,

F_{4}^{e}

and

F_{3}^{e}

through two CLFUs to obtain the prior knowledge

F_{3}^{e^{'}}

rich in high-level semantic information. Finally,

F_{3}^{e^{'}}

is fused with

F_{2}^{e}

through another CLFU, and a

3 \times 3

convolution is applied to obtain the final output

F_{C L F M_{3}}

:

F_{4}^{e^{'}} = C L F U [F_{5}^{e}, F_{4}^{e}]

(20)

F_{3}^{e^{'}} = C L F U [F_{4}^{e^{'}}, F_{3}^{e}]

(21)

F_{2}^{e^{'}} = C L F U [F_{3}^{e^{'}}, F_{2}^{e}]

(22)

F_{C L F M_{3}} = δ (C o n v_{3 \times 3} (F_{2}^{e^{'}}))

(23)

2.5. Loss Function

In the task of thyroid nodule segmentation, there is a significant class imbalance issue. We use the Binary Cross-Entropy Loss (BCE Loss) as the loss function for model training. BCE is designed to measure the difference between predicted probabilities and true labels. Specifically, for each pixel, the BCE Loss calculates the logarithmic loss between the predicted class probability and the actual class label:

L_{B C E} = - \sum_{(i, j)}^{N} Y (i, j) \cdot \log X (i, j) + (1 - Y (i, j)) \cdot \log (1 - X (i, j))

(24)

The variable

N

represents the total number of pixels in the image and denotes the model’s prediction for the pixel value at position

(i, j)

. Specifically,

X (i, j)

corresponds to the predicted mask for pixel

(i, j)

, while

Y (i, j)

refers to the ground-truth label.

3. Experiments

In this section, we will introduce the dataset used in the experiments, evaluation metrics, hyperparameter settings, and the operating environment.

3.1. Datasets

To evaluate our method, we use two public datasets: the DDTI dataset and the TN3K dataset.

The TN3K dataset was contributed by Gong et al. [35], collected at the Zhujiang Hospital of Southern Medical University, containing 3493 ultrasound images from 2421 patients. All images are in grayscale, and each image contains at least one region corresponding to a thyroid nodule. We use 2303 images as the training set, 576 images as the validation set, and 614 images for testing.

The DDTI dataset, provided by Pedraza et al. [36], contains 637 thyroid ultrasound images with pixel-wise annotations from two devices (TOSHIBA Nemio 30 and TOSHIBA Nemio MX, Tokyo, Japan). We use 458 of these images as the training set, 115 images as the validation set, and 64 images for testing.

All images in these two datasets have had patient privacy and other irrelevant information removed. The image sizes are adjusted to 256 × 256 and data augmentation operations such as normalization and random flipping are applied.

3.2. Experimental Details

During model training, we use the Adam optimizer for dynamic optimization of parameters, setting the initial learning rate to 0.0001 and the weight decay to 0.0001. All model parameters are trained on a PC with an NVIDIA GeForce RTX 4090 GPU and an Intel (R) Core (TM) i9-10900K CPU @ 3.7 GHz. The development environment consists of Python 3.9.19, PyTorch 2.3.0, and CUDA 11.8. For the experiments on the DDTI and TN3K datasets, five-fold cross-validation is employed, with the batch size set to 6 and the epoch to 150.

3.3. Evaluation Metrics

In order to evaluate the model’s performance objectively, we employ a range of evaluation metrics, including Precision (Pre), Recall, Specificity (Spe), Accuracy (Acc), IoU, Dice and 95th percentile of the asymmetric Hausdorff distance (HD95). The mathematical expressions for these performance evaluation metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(25)

R e c a l l = \frac{T P}{T P + F N}

(26)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(27)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(28)

I o U = \frac{T P}{T P + F P + F N}

(29)

D i c e = \frac{T P}{2 T P + F P + F N}

(30)

H D 95 = \max_{k 95 %} [d (X, Y), d (Y, X)]

(31)

where

T P

,

F P

,

T N

, and

F N

refer to true positives, false positives, true negatives, and false negatives, respectively. The term

d

represents the calculation of the one-way Hausdorff distance between two sets. Additionally,

X

and

Y

denote the predicted and ground-truth sets, respectively.

4. Experiment Result

In Section 4.1, we conduct ablation experiments on different components of the network to assess their effect on model performance. In Section 4.2, we compare our method with advanced deep learning segmentation methods.

4.1. Ablation Experiment

To evaluate the impact of each component on network performance, we conduct ablation experiments using U-net as the baseline network. Table 1 and Table 2 present the ablation experiment results of different components on the TN3K and DDTI datasets, respectively. By comparing the segmentation results of Baseline and Baseline + MBWC, we find that using the multi-branch architecture in the encoding phase significantly improved the segmentation performance of the network. Subsequently, after adding the SSAP module on this basis, the segmentation performance of the network is further greatly improved, which indicates that enhancing the single-stage context extraction ability in the encoding process is more beneficial for thyroid nodule segmentation. Next, after adding CLFM to the network, the HD95 values decreased by 1.04 and 2.20 on the two datasets, respectively, indicating a significant improvement in the model’s capacity to capture edge details. In summary, each component aids the network in learning more robust feature representations from thyroid ultrasound images.

4.2. Comparison with the Other Methods

Our comparative methods include U-net [8], Unet++ [10], AttUnet [37], Sgunet [12], ASPP-UNet [13], TransUnet [38], SmaAt-UNet [39], and DCSAU-Net [40]. Among them, TransUnet is trained using a pre-trained ResNet-50 [41] backbone. The performance of these methods is evaluated on the two datasets, with the results shown in Table 3 and Table 4, respectively.

According to the experimental results in Table 3 and Table 4, we draw several conclusions. Generally, U-net variants (such as Unet++) achieve better segmentation results than the original U-net, suggesting that fusing low-level features with high-level features using skip connections is beneficial for the segmentation of thyroid nodules. Sgunet leverages high-level semantics to guide auxiliary segmentation in the decoding stage, demonstrating better performance compared to the basic U-net architecture. ASPP-UNet employs the ASPP module within the deep encoder to perform multiscale sampling, retaining additional fine details, which effectively improves segmentation accuracy. By analyzing the segmentation results of AttUnet, it can be concluded that the introduction of an attention mechanism can also improve the segmentation performance of the network. SmaAt-UNet uses a CBAM attention mechanism to reconstruct skip connections, which helps optimize network performance. Additionally, TransUnet utilizes a Transformer architecture, which offers significant advantages in global feature modeling compared to conventional CNNs. Moreover, results of DCSAU-Net suggest that the optimized attention mechanism can further enhance the network’s performance.

Compared to these methods, our approach designs a cross-level fusion strategy to reduce semantic information discrepancies among different levels of features and enhances the single-stage feature extraction ability of the U-shaped architecture, effectively improving the network’s performance in thyroid nodule segmentation. On both datasets, the proposed method achieves the best segmentation performance. The gap between the IoU and Dice indicators can be visually displayed in Figure 6 and Figure 7. On the TN3K dataset, our method attained values of 92.14% for Precision, 79.86% for Recall, 99.12% for Specificity, 96.85% for Accuracy, 74.25% for IoU, 85.22% for Dice, and 23.45 for HD95. Compared to other methods, these metrics improved by 0.63%, 2.06%, 0.04%, 0.19%, 1.72%, 1.14%, and 2.21, respectively. On the DDTI dataset, our method improves these seven metrics by 2.34%, 1.40%, 0.27%, 0.84%, 3.92%, 3.00%, and 6.10, respectively. Additionally, Figure 8 displays the ROC curves and their AUC scores for different segmentation methods on the TN3K and DDTI datasets. The ROC curve represents the confidence of a method’s correct predictions, while the AUC scores reflect performance within the ROC curve. Based on the comparison of ROC curves and AUC scores, the proposed method achieves the highest confidence level in segmentation on both the TN3K and DDTI datasets.

Figure 9 and Figure 10 present the visual segmentation results of different methods on the TN3K and DDTI datasets. Based on these results, we summarize four key points. Firstly, the segmentation results in the first row indicate that the similar intensity distribution of surrounding tissues to the nodules leads to significant missed detections and false positives, with pronounced heterogeneity making it challenging for various methods to detect the nodule areas. The images in the second and third rows show that when there are significant shape differences in nodules, it is difficult for various methods to achieve good performance in unified segmentation. According to the results in the fourth and fifth rows, there are many errors in the detection of small nodules by each method, and even small nodules cannot be detected. In particular, from the sixth and seventh rows in Figure 9, it can be observed that spatial texture differences when segmenting multi-nodule images significantly increase the missed detection rate. Compared to other methods, our approach effectively mitigates the impact of nodule shape differences and surrounding tissue disturbances on segmentation results, achieving outcomes closer to the ground truth. Overall, both experimental and visualization results indicate that our method reduces missed and false detections of thyroid nodules while achieving better segmentation results.

5. Discussion

5.1. The Impact of the MBWC Block Structure

The encoder employs multi-branch wavelet convolutions for feature extraction. The different combinations of branches in the MBWC block may impact the network’s performance, so we conduct experiments on the TN3K and DDTI datasets. Specifically, using U-net + SSAP + CLFM as the baseline model, we initiate experiments with a single branch configuration and progressively incorporate additional branches. As shown in Table 5 and Table 6, when only using a single branch of

5 \times 5

wavelet convolution with spatial attention, the network performs the worst, with IoU values of 69.93% and 59.72% on the TN3K and DDTI datasets, respectively. However, with the introduction of additional paths, segmentation accuracy gradually improves, ultimately reaching IoU values of 74.25% and 64.23%. Therefore, the increase in convolution branches with different receptive fields contributes to improving network performance.

5.2. The Impact of the Number of CLFMs

At each level, the semantic information from high-level feature maps provides important context for the learning of low-level features, aiding them in better focusing on the target area and reducing noise interference. To explore the role of CLFM in the decoding process, we conduct experiments on the TN3K and DDTI datasets. Specifically, we use U-net + MBWC + SSAP as the baseline model. We incrementally increase the number of CLFMs, reconstructing skip connections from bottom to top to further explore the optimal number. As shown in Table 7, as the number of CLFMs increases, the model performance shows a trend of first increasing and then decreasing. It is important to note that CLFM_1,2,3,4 described in Table 7 is used to replace the last layer of skip connections, and it aggregates the output features

F_{1}^{e}

to

F_{5}^{e}

of the encoder. Moreover, the model performance is optimal when the number of CLFMs is three.

We suspect that shallow features with more noise may have led to this phenomenon. Therefore, we attempt to remove the remaining skip connections while adding CLFM to validate this hypothesis. As shown in Table 8, when the number of CLFMs is set to three and the skip connections are not retained, the model achieves optimal performance on both datasets. This indicates that increasing the number of CLFMs does not always have a positive impact on model performance.

6. Conclusions

Accurate segmentation of thyroid nodules is of significant importance in clinical diagnosis. This paper proposes a novel multiscale cross-level fusion U-net with combined wavelet convolutions (MCFU-net), which includes three key components: MBWC block, SSAP, and CLFM. The MBWC block and SSAP module improve the network’s ability to aggregate and extract multiscale context features, improving its representation of small nodule features. CLFM propagates high-level features as prior knowledge to the decoding stage, supplementing the fine-grained information in the upsampling process. Compared to advanced methods, MCFU-net demonstrates strong competitiveness on the TN3K and DDTI datasets, outperforming other models.

Despite the promising performance of MCFU-net, this study has several limitations: first, it is trained mainly on the TN3K and DDTI datasets with single-center collection and limited ultrasound equipment parameter variations, restricting generalizability to multi-institutional clinical data; second, for complex ultrasound images with dense multi-nodule distributions or blurred boundaries, it still has occasional false positives/negatives due to insufficient tissue discrimination; third, the integrated MBWC and CLFM modules increase computational complexity, hindering real-time clinical applications. To address these, future work will focus on expanding to multi-center, multi-device datasets with diverse nodule characteristics to enhance robustness; optimizing the architecture via lightweight techniques to reduce computation while maintaining accuracy; and fusing multi-modal data and integrating clinical prior knowledge to improve ambiguous nodule discrimination.

Author Contributions

Conceptualization: S.L. and H.T.; methodology: S.L., H.T. and J.Z.; validation: H.T., J.Z., R.L. and S.Z.; formal analysis: X.Z. and F.L.; investigation: H.T., J.Z., R.L. and S.Z.; resources: S.L.; data curation: J.Z. and H.T.; writing—original draft preparation: S.L., J.Z. and H.T.; writing—review and editing: S.L., X.Z. and F.L.; supervision: C.D., F.L. and K.H.; project administration: J.Z. and H.T.; funding acquisition: F.L. All authors have read and agreed to the published version of the manuscript.

Funding

Central Guidance for Local Science and Technology Development Fund (Grant ZYYD2025QY19).

Institutional Review Board Statement

This study does not involve human subjects, animal experiments, or any research content related to personal privacy, biological samples, or sensitive data. Therefore, it is exempt from the review and approval procedures of the Ethics Committee (EC) or Institutional Review Board (IRB) in accordance with “Measures for the Ethical Review of Medical Research” (issued by the National Health Commission of the People’s Republic of China, 2020 Edition, Article 16).

Informed Consent Statement

This study only collects publicly available information (without interfering with the research objects or involving their personal sensitive information). Informed consent for participation is not required as per “Measures for the Ethical Review of Medical Research” (issued by the National Health Commission of the People’s Republic of China, 2020 Edition, Article 16).

Data Availability Statement

The available data is taken from https://github.com/haifangong/TRFE-Net-for-thyroid-nodule-segmentation (accessed on 12 October 2025) and https://www.kaggle.com/datasets/dasmehdixtr/ddti-thyroid-ultrasound-images/data (accessed on 12 October 2025).

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Chang, C.Y.; Lei, Y.F.; Tseng, C.H.; Shih, S.R. Thyroid segmentation and volume estimation in ultrasound images. IEEE Trans. Biomed. Eng. 2010, 57, 1348–1357. [Google Scholar] [CrossRef] [PubMed]
Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef]
Peng, B.; Lin, W.; Zhou, W.; Bai, Y.; Luo, A.; Xie, S.; Yin, L. Enhanced pediatric thyroid ultrasound image segmentation using DC-Contrast U-Net. BMC Med. Imaging 2024, 24, 275. [Google Scholar] [CrossRef]
Gong, Y.; Zhu, H.; Li, J.; Yang, J.; Cheng, J.; Chang, Y.; Bai, X.; Ji, X. SCCNet: Self-correction boundary preservation with a dynamic class prior filter for high-variability ultrasound image segmentation. Comput. Med. Imaging Graph. 2023, 104, 102183. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Mu, J.; Sun, H.; Dai, C.; Ji, Z.; Ganchev, I. BFG&MSF-Net: Boundary Feature Guidance and Multi-Scale Fusion Network for Thyroid Nodule Segmentation. IEEE Access 2024, 12, 78701–78713. [Google Scholar]
Sun, S.; Fu, C.; Xu, S.; Wen, Y.; Ma, T. GLFNet: Global-local fusion network for the segmentation in ultrasound images. Comput. Biol. Med. 2024, 171, 108103. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Berlin, Germany, 5–9 October 2015; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Liu, M.; Yuan, X.; Zhang, Y.; Chang, K.; Deng, Z.; Xue, J. An end to end thyroid nodule segmentation model based on optimized U-net convolutional neural network. In Proceedings of the 1st International Symposium on Artificial Intelligence in Medical Sciences, Beijing, China, 11–13 September 2020; pp. 74–78. [Google Scholar]
Pan, H.; Zhou, Q.; Latecki, L.J. Sgunet: Semantic guided unet for thyroid nodule segmentation. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; IEEE: New York, NY, USA, 2021; pp. 630–634. [Google Scholar]
Cui, S.; Zhang, Y.; Wen, H.; Tang, Y.; Wang, H. ASPP-UNet: A new semantic segmentation algorithm for thyroid nodule ultrasonic image. In Proceedings of the 2022 International Conference on Artificial Intelligence, Information Processing and Cloud Computing (AIIPCC), Kunming, China, 21–23 June 2022; IEEE: New York, NY, USA, 2022; pp. 323–328. [Google Scholar]
Bi, H.; Cai, C.; Sun, J.; Jiang, Y.; Lu, G.; Shu, H.; Ni, X. BPAT-UNet: Boundary preserving assembled transformer UNet for ultrasound thyroid nodule segmentation. Comput. Methods Programs Biomed. 2023, 238, 107614. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Tan, G.; Duan, M.; Pu, B.; Luo, H.; Li, S.; Li, K. MLMSeg: A multi-view learning model for ultrasound thyroid nodule segmentation. Comput. Biol. Med. 2024, 169, 107898. [Google Scholar] [CrossRef]
Yang, X.; Qu, S.; Wang, Z.; Li, L.; An, X.; Cong, Z. The study on ultrasound image classification using a dual-branch model based on Resnet50 guided by U-net segmentation results. BMC Med. Imaging 2024, 24, 314. [Google Scholar] [CrossRef]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A survey of the four pillars for small object detection: Multiscale representation, contextual information, super-resolution, and region proposal. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 936–953. [Google Scholar] [CrossRef]
Dai, H.; Xie, W.; Xia, E. SK-Unet++: An improved Unet++ network with adaptive receptive fields for automatic segmentation of ultrasound thyroid nodule images. Med. Phys. 2024, 51, 1798–1811. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Zheng, T.; Qin, H.; Cui, Y.; Wang, R.; Zhao, W.; Zhang, S.; Geng, S.; Zhao, L. Segmentation of thyroid glands and nodules in ultrasound images using the improved U-Net architecture. BMC Med. Imaging 2023, 23, 56. [Google Scholar] [CrossRef]
Ozcan, A.; Tosun, Ö.; Donmez, E.; Sanwal, M. Enhanced-TransUNet for ultrasound segmentation of thyroid nodules. Biomed. Signal Process. Control 2024, 95, 106472. [Google Scholar] [CrossRef]
Zheng, S.J.; Yu, S.X.; Wang, Y.; Wen, J. GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation. In Proceedings of the 31st International Conference on Multimedia Modeling (MMM 2025), Nara, Japan, 8–10 January 2025; pp. 31–44. [Google Scholar]
Gan, J.; Zhang, R. Ultrasound image segmentation algorithm of thyroid nodules based on improved U-Net network. In Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System, Xi’an, China, 26–28 August 2022; pp. 61–66. [Google Scholar]
Chen, G.; Liu, Y.; Qian, J.; Zhang, J.; Yin, X.; Cui, L.; Dai, Y. DSEU-net: A novel deep supervision SEU-net for medical ultrasound image segmentation. Expert Syst. Appl. 2023, 223, 119939. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Yang, Y.; Huang, H.; Shao, Y.; Chen, B. DAC-Net: A light-weight U-shaped network based efficient convolution and attention for thyroid nodule segmentation. Comput. Biol. Med. 2024, 180, 108972. [Google Scholar] [CrossRef]
Xie, X.; Liu, P.; Lang, Y.; Guo, Z.; Yang, Z.; Zhao, Y. US-Net: U-shaped network with Convolutional Attention Mechanism for ultrasound medical images. Comput. Graph. 2024, 124, 104054. [Google Scholar] [CrossRef]
Nie, X.Q.; Zhou, X.G.; Tong, T.; Lin, X.; Wang, L.; Zheng, H.; Li, J.; Xue, E.; Chen, S.; Zheng, M.; et al. N-Net: A novel dense fully convolutional neural network for thyroid nodule segmentation. Front. Neurosci. 2022, 16, 872601. [Google Scholar] [CrossRef]
Ma, X.; Sun, B.; Liu, W.; Sui, D.; Shan, S.; Chen, J.; Tian, Z. Tnseg: Adversarial networks with multi-scale joint loss for thyroid nodule segmentation. J. Supercomput. 2024, 80, 6093–6118. [Google Scholar] [CrossRef]
Ali, H.; Wang, M.; Xie, J. Cil-net: Densely connected context information learning network for boosting thyroid nodule segmentation using ultrasound images. Cogn. Comput. 2024, 16, 1176–1197. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Chen, Z.; Zhu, H.; Liu, Y.; Gao, X. MSCA-UNet: Multi-scale channel attention-based UNet for segmentation of medical ultrasound images. Clust. Comput. 2024, 27, 6787–6804. [Google Scholar] [CrossRef]
Gong, H.; Chen, J.; Chen, G. Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Comput. Biol. Med. 2023, 155, 106389. [Google Scholar] [CrossRef] [PubMed]
Pedraza, L.; Vargas, C.; Narváez, F.; Durán, O.; Muñoz, E.; Romero, E. An open access thyroid ultrasound image database. In Proceedings of the 10th International symposium on medical information processing and analysis. SPIE 2015, 9287, 188–193. [Google Scholar]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Trebing, K.; Staǹczyk, T.; Mehrkanoon, S. SmaAt-UNet: Precipitation nowcasting using a small attention-UNet architecture. Pattern Recognit. Lett. 2021, 145, 178–186. [Google Scholar] [CrossRef]
Xu, Q.; Ma, Z.; Duan, W. DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation. Comput. Biol. Med. 2023, 154, 106626. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2016; pp. 770–778. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2018; pp. 3–19. [Google Scholar]

Figure 1. Examples of issues with U-net in thyroid ultrasound image segmentation.

Figure 2. Overall structure of the proposed MCFU-net network.

Figure 3. Architecture of the MBWC block.

Figure 4. Architecture of the SSAP: The * operator indicates element-wise multiplication between feature maps, and the © represents feature concatenation.

Figure 5. Architecture of the CLFU and CLFM_i (i = 1, 2, 3).

Figure 6. IoU and Dice metric histograms of the above models on the TN3K dataset: Dotted lines represent the metric values of MCFU-net (Ours).

Figure 7. IoU and Dice metric histograms of the above models on the DDTI dataset: Dotted lines represent the metric values of MCFU-net (Ours).

Figure 8. ROC curves of different algorithms on the TN3K and DDTI datasets.

Figure 9. Segmentation results of various methods on the TN3K dataset. (a) U-net; (b) Unet++; (c) AttUnet; (d) Sgunet; (e) ASPP-UNet; (f) TransUnet; (g) SmaAt-UNet; (h) DCSAU-Net; (i) MCFU-net.

Figure 10. Segmentation results of various methods on the DDTI dataset. (a) U-net; (b) Unet++; (c) AttUnet; (d) Sgunet; (e) ASPP-UNet; (f) TransUnet; (g) SmaAt-UNet; (h) DCSAU-Net; (i) MCFU-net.

Table 1. Ablation experiments on the TN3K dataset.

Methods	Pre (%)	Recall (%)	Spe (%)	Acc (%)	IoU (%)	Dice (%)	HD95
Baseline	90.26 ± 2.67	70.98 ± 3.45	99.00 ± 0.34	95.82 ± 0.15	65.79 ± 1.79	79.35 ± 1.32	32.26 ± 1.69
MBWC	91.41 ± 1.46	73.66 ± 1.90	99.11 ± 0.18	96.22 ± 0.11	68.85 ± 1.09	81.55 ± 0.76	30.56 ± 2.15
MBWC + SSAP	91.42 ± 1.03	76.83 ± 0.50	99.04 ± 0.13	96.63 ± 0.07	72.12 ± 0.52	83.80 ± 0.35	25.67 ± 1.98
MBWC + SSAP + CLFM (Ours)	92.14 ± 0.66	79.86 ± 0.97	99.12 ± 0.14	96.85 ± 0.08	74.25 ± 0.53	85.22 ± 0.34	23.45 ± 1.65

Bold values indicate the best performance in each metric across all methods.

Table 2. Ablation experiments on the DDTI dataset.

Methods	Pre (%)	Recall (%)	Spe (%)	Acc (%)	IoU (%)	Dice (%)	HD95
Baseline	69.70 ± 2.24	72.81 ± 2.36	94.65 ± 0.67	91.51 ± 0.42	55.25 ± 1.35	71.17 ± 1.13	37.77 ± 1.11
MBWC	75.28 ± 1.89	73.51 ± 2.26	95.92 ± 0.49	92.70 ± 0.22	59.15 ± 0.96	74.33 ± 0.76	33.74 ± 2.96
MBWC + SSAP	76.16 ± 1.42	74.11 ± 2.61	96.15 ± 0.40	93.00 ± 0.17	60.04 ± 1.23	75.03 ± 0.96	30.70 ± 2.97
MBWC + SSAP + CLFM (Ours)	78.58 ± 1.15	77.89 ± 1.75	96.42 ± 0.29	93.76 ± 0.20	64.23 ± 1.05	78.21 ± 0.77	24.78 ± 1.55

Bold values indicate the best performance in each metric across all methods.

Table 3. Comparisons with other segmentation models on the TN3K dataset.

Methods	Pre (%)	Recall (%)	Spe (%)	Acc (%)	IoU (%)	Dice (%)	HD95
U-net	90.26 ± 2.67	70.98 ± 3.45	99.00 ± 0.34	95.82 ± 0.15	65.79 ± 1.79	79.35 ± 1.32	32.26 ± 1.69
Unet++	90.87 ± 1.44	74.19 ± 2.21	99.04 ± 0.19	96.22 ± 0.10	68.99 ± 1.18	81.64 ± 0.84	30.10 ± 2.06
AttUnet	91.04 ± 1.48	73.96 ± 2.69	99.06 ± 0.20	96.21 ± 0.14	68.88 ± 1.57	81.56 ± 1.10	33.26 ± 1.98
Sgunet	91.13 ± 0.77	71.46 ± 2.06	99.08 ± 0.16	95.97 ± 0.15	66.78 ± 1.47	80.07 ± 1.06	32.66 ± 1.76
ASPP-UNet	89.70 ± 1.40	76.32 ± 2.05	98.87 ± 0.20	96.31 ± 0.10	70.13 ± 1.05	82.44 ± 0.73	29.15 ± 2.44
TransUnet	90.53 ± 1.55	76.07 ± 2.54	98.97 ± 0.22	96.37 ± 0.14	70.40 ± 1.42	82.62 ± 0.98	26.07 ± 2.44
SmaAt-UNet	89.98 ± 0.84	77.14 ± 2.33	98.90 ± 0.14	96.43 ± 0.15	71.00 ± 1.46	83.03 ± 0.99	25.66 ± 2.59
DCSAU-Net	91.51 ± 0.94	77.80 ± 1.68	99.07 ± 0.13	96.66 ± 0.08	72.53 ± 0.92	84.08 ± 0.62	25.94 ± 1.83
MCFU-net (Ours)	92.14 ± 0.66	79.86 ± 0.97	99.12 ± 0.14	96.85 ± 0.08	74.25 ± 0.53	85.22 ± 0.34	23.45 ± 1.65

Bold values indicate the best performance in each metric across all methods.

Table 4. Comparisons with other segmentation models on the DDTI dataset.

Methods	Pre (%)	Recall (%)	Spe (%)	Acc (%)	IoU (%)	Dice (%)	HD95
U-net	69.70 ± 2.24	72.81 ± 2.36	94.65 ± 0.67	91.51 ± 0.42	55.25 ± 1.35	71.17 ± 1.13	37.77 ± 1.11
Unet++	71.51 ± 2.21	72.46 ± 6.65	95.09 ± 0.92	91.83 ± 0.32	55.97 ± 3.00	71.73 ± 2.47	37.48 ± 5.40
AttUnet	71.77 ± 1.41	72.06 ± 1.70	95.23 ± 0.32	91.90 ± 0.35	56.14 ± 1.47	71.90 ± 1.21	37.53 ± 3.77
Sgunet	76.24 ± 2.03	74.06 ± 3.75	96.09 ± 0.60	92.92 ± 0.12	60.05 ± 1.42	75.03 ± 1.11	31.86 ± 3.97
ASPP-UNet	73.91 ± 2.82	72.04 ± 3.54	95.67 ± 0.80	92.27 ± 0.22	57.27 ± 0.89	72.82 ± 0.73	31.31 ± 1.72
TransUnet	73.99 ± 2.00	64.79 ± 3.93	96.15 ± 0.59	91.63 ± 0.32	52.67 ± 2.15	68.98 ± 1.86	38.64 ± 2.86
SmaAt-UNet	71.41 ± 1.11	73.71 ± 2.38	95.03 ± 0.39	91.96 ± 0.20	56.88 ± 1.15	72.51 ± 0.93	36.16 ± 2.06
DCSAU-Net	74.07 ± 2.66	76.49 ± 3.09	95.48 ± 0.65	92.74 ± 0.60	60.31 ± 2.61	75.21 ± 2.04	30.88 ± 4.23
MCFU-net (Ours)	78.58 ± 1.15	77.89 ± 1.75	96.42 ± 0.29	93.76 ± 0.20	64.23 ± 1.05	78.21 ± 0.77	24.78 ± 1.55

Bold values indicate the best performance in each metric across all methods.

Table 5. Impact of different branch numbers in MBWC block on model performance on TN3K dataset.

Branch1	Branch2	Branch3	IoU (%)	Dice (%)
√			70.97 ± 0.65	83.02 ± 0.44
	√		69.93 ± 0.59	82.30 ± 0.40
		√	70.46 ± 0.41	82.67 ± 0.31
√	√		72.09 ± 1.17	83.78 ± 0.78
√		√	72.80 ± 0.64	84.26 ± 0.43
	√	√	71.72 ± 0.91	83.53 ± 0.21
√	√	√	74.25 ± 0.53	85.22 ± 0.34

Bold values indicate the best performance in each metric across all methods, and the checkmark (√) denotes that the corresponding branch (Branch1, Branch2, Branch3) is enabled.

Table 6. Impact of different branch numbers in MBWC block on model performance on DDTI dataset.

Branch1	Branch2	Branch3	IoU(%)	Dice(%)
√			61.67 ± 0.95	76.00 ± 0.72
	√		59.72 ± 1.29	74.76 ± 0.84
		√	60.29 ± 0.91	75.22 ± 0.61
√	√		62.38 ± 1.16	76.75 ± 0.87
√		√	62.88 ± 1.07	77.21 ± 0.83
	√	√	61.30 ± 1.31	76.29 ± 1.01
√	√	√	64.23 ± 1.05	78.21 ± 0.77

Bold values indicate the best performance in each metric across all methods, and the checkmark (√) denotes that the corresponding branch (Branch1, Branch2, Branch3) is enabled.

Table 7. Impact of different quantities of CLFMs (retaining remaining skip connections) on model performance on TN3K and DDTI datasets.

Datasets	TN3K		DDTI
Methods	IoU (%)	Dice (%)	IoU (%)	Dice (%)
Model	72.12 ± 0.52	83.80 ± 0.35	60.04 ± 1.23	75.03 ± 0.96
Model + CLFM₁	72.23 ± 0.62	83.97 ± 0.42	60.83 ± 0.74	75.64 ± 0.57
Model + CLFM_1,2	73.08 ± 0.73	84.45 ± 0.49	61.55 ± 1.53	76.19 ± 1.18
Model + CLFM_1,2,3	73.40 ± 0.43	84.69 ± 0.28	63.09 ± 0.53	77.31 ± 0.41
Model + CLFM_1,2,3,4	73.05 ± 0.60	84.42 ± 0.40	61.90 ± 0.59	76.47 ± 0.45

Bold values indicate the best performance in each metric across all methods.

Table 8. Impact of different quantities of CLFMs (without retaining remaining skip connections) on model performance on TN3K and DDTI datasets.

Datasets	TN3K		DDTI
Methods	IoU (%)	Dice (%)	IoU (%)	Dice (%)
Model	73.01 ± 0.86	84.40 ± 0.58	60.04 ± 1.40	75.02 ± 1.10
Model + CLFM₁	73.19 ± 1.52	84.51 ± 1.02	60.97 ± 1.23	75.74 ± 0.95
Model + CLFM_1,2	73.30 ± 1.10	84.59 ± 0.73	61.42 ± 1.79	76.09 ± 1.38
Model + CLFM_1,2,3	74.25 ± 0.53	85.22 ± 0.34	64.23 ± 1.05	78.21 ± 0.77

Bold values indicate the best performance in each metric across all methods

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Tang, H.; Zhao, J.; Liu, R.; Zheng, S.; Hou, K.; Zhang, X.; Liu, F.; Ding, C. Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation. Information 2025, 16, 1013. https://doi.org/10.3390/info16111013

AMA Style

Liu S, Tang H, Zhao J, Liu R, Zheng S, Hou K, Zhang X, Liu F, Ding C. Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation. Information. 2025; 16(11):1013. https://doi.org/10.3390/info16111013

Chicago/Turabian Style

Liu, Shengzhi, Haotian Tang, Junhao Zhao, Rundong Liu, Sirui Zheng, Kaiyao Hou, Xiyu Zhang, Fuyong Liu, and Chen Ding. 2025. "Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation" Information 16, no. 11: 1013. https://doi.org/10.3390/info16111013

APA Style

Liu, S., Tang, H., Zhao, J., Liu, R., Zheng, S., Hou, K., Zhang, X., Liu, F., & Ding, C. (2025). Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation. Information, 16(11), 1013. https://doi.org/10.3390/info16111013

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction Multiscale Cross-Level Fusion U-Net with Combined Wavelet Convolutions for Thyroid Nodule Segmentation

Abstract

1. Introduction

2. The Proposed Method

2.1. Overview

2.2. Muti-Branch Wavelet Convolution

2.3. Scale Selection Atrous Pyramid

2.4. Cross-Level Fusion Module

2.5. Loss Function

3. Experiments

3.1. Datasets

3.2. Experimental Details

3.3. Evaluation Metrics

4. Experiment Result

4.1. Ablation Experiment

4.2. Comparison with the Other Methods

5. Discussion

5.1. The Impact of the MBWC Block Structure

5.2. The Impact of the Number of CLFMs

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI