Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection

Zhu, Zhijian; Zhao, Yiwen; Zhao, Xingang; Ying, Yuhan; Gu, Haoran; Song, Guoli; Wang, Qinghui

doi:10.3390/math14081350

Open AccessArticle

Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection

by

Zhijian Zhu

^1,2

,

Yiwen Zhao

²,

Xingang Zhao

²,

Yuhan Ying

^2,3

,

Haoran Gu

^1,2

,

Guoli Song

^2,* and

Qinghui Wang

^1,*

¹

School of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1350; https://doi.org/10.3390/math14081350

Submission received: 13 March 2026 / Revised: 14 April 2026 / Accepted: 16 April 2026 / Published: 17 April 2026

(This article belongs to the Special Issue Intelligent Computing Methods for Medical Image Analysis and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Lung nodule detection remains a critical challenge in clinical diagnostics due to the small size, weak contrast, and high background interference of nodules in CT scans. To address these issues, a novel deep neural network architecture, termed Daisy-Net, is proposed. This model incorporates dual attention mechanisms and inter-scale feature perception, consisting of two primary components: the Parallelized Patch and Spatial Context Aware (PPSCA) module and the Omni-domain Multistage Fusion (OMF) module. The PPSCA module enhances the extraction of fine-grained textures and boundary information through multi-branch patch perception and spatial attention. The OMF module employs omni-domain feature fusion and progressive stage-wise supervision to improve robustness and discrimination under complex conditions. The lung nodule detection task is formulated as a two-dimensional segmentation problem and evaluated on the LUNA16 dataset. In the post-binarization comparative evaluation, Daisy-Net achieves the best overall performance among all compared methods, with an Intersection over Union (IoU) of 81.41, a Dice coefficient of 89.75, a precision of 95.34, a sensitivity of 84.78, and a specificity of 99.9974. These findings indicate the model’s strong capability in detecting small pulmonary nodules accurately and reliably.

Keywords:

attention mechanisms; convolutional neural networks; detection algorithms; feature extraction; supervised learning

MSC:

68T07; 68U10; 92C55

1. Introduction

Lung nodules are frequently detected in chest computed tomography (CT) scans and are of clinical concern due to their potential risk of malignancy, despite many being benign [1,2,3]. While low-dose spiral CT has improved early detection capabilities by offering high-resolution imaging of pulmonary structures, several challenges persist. The most critical issue lies in the intrinsic nature of pulmonary nodules as small-sized targets, typically occupying only a limited number of pixels in CT slices. Their diminutive size leads to weak and often ambiguous visual features, making them highly susceptible to background interference from complex anatomical structures such as blood vessels, bronchi, and normal lung parenchyma. These surrounding tissues introduce substantial visual noise, which can obscure or mimic nodule characteristics, thereby increasing the rates of false negatives and false positives. Moreover, pulmonary nodules exhibit considerable heterogeneity in shape, density, and boundary definition, requiring radiologists to possess both extensive expertise and sustained concentration for accurate interpretation [4]. In large-scale screening scenarios, the overwhelming volume of CT slices further amplifies the cognitive and workload burden. Therefore, enhancing small object feature extraction while effectively suppressing background interference is essential for improving the accuracy and reliability of pulmonary nodule detection.

1.1. Related Work

1.1.1. Traditional Detection Methods

In the early stages, lung nodule detection primarily relied on manual inspection of CT images by radiologists. Common traditional techniques include grayscale thresholding and region growing, which perform image segmentation based on pixel intensity and spatial continuity. Ye et al. (2009) developed a rule-based automated lung nodule detection method by combining shape feature analysis with adaptive thresholding, systematically evaluating geometric models for boundary delineation [5]. Li et al. (2011) proposed a hybrid framework integrating rule inference and support vector machines (SVM), using region growing and thresholding to extract candidate regions [6]. Soleymanpour et al. (2011) implemented an automatic lung segmentation and rib suppression algorithm based on multi-thresholding and morphological operations, offering robust preprocessing for downstream nodule detection [7]. Choi et al. (2014) combined 3D shape descriptors with region growing techniques to construct a rule-based detection pipeline, particularly effective for round-shaped nodules [8]. Halder et al. (2020) provided a comprehensive review of traditional workflows involving thresholding, region growing, and morphological operations, highlighting their continued relevance in low-contrast scenarios [9]. These references are cited here specifically to illustrate representative thresholding-, region-growing-, and rule-based paradigms that dominated early pulmonary nodule analysis. However, these traditional techniques rely heavily on handcrafted features, making them inefficient for processing large volumes of CT scans and limiting their robustness and clinical applicability.

1.1.2. Deep Learning-Based Two-Dimensional Methods

In recent years, the rise of deep learning, especially convolutional neural networks (CNNs), has improved the accuracy and efficiency of lung nodule detection. Shah et al. (2023) demonstrated the ability of CNNs to automatically learn hierarchical feature representations from raw CT data [10]. Xu et al. (2023) and Su et al. (2021) applied Faster R-CNN architectures to the nodule detection task and achieved substantial performance gains [11,12]. Zheng et al. (2019, 2020) integrated maximum intensity projection (MIP) images with CNNs to enhance the sensitivity toward small nodules (3–10 mm) and reduce false positives [13,14]. Furthermore, Fu et al. (2022) proposed a semi-supervised generative framework to improve model robustness under limited labeled data [15]. However, 2D CNN-based methods inherently lack volumetric contextual information, leading to spatial information loss and suboptimal performance in detecting complex nodules.

1.1.3. Advances in Attention Mechanisms

To further improve detection performance, attention mechanisms have emerged as a promising enhancement in recent studies. Cai et al. (2025) introduced an interference-resistant attention model that emphasizes nodule regions while suppressing irrelevant background noise [16]. Urrehman et al. (2024) and Nasrullah et al. (2019) incorporated dual attention and channel–spatial joint attention, respectively, to better localize discriminative nodule features [17,18]. Wang et al. (2024) proposed the Attention Pyramid Pooling Network (APPN), which improves feature fusion through multi-scale spatial attention for benign–malignant classification [19]. Liu et al. (2025) developed the CSEA-Net, which combines both channel and spatial attention to enhance robustness in complex CT scenarios [20]. Despite these advances, most attention-based models still face challenges in effectively integrating multi-scale features, and the coupling between attention modules and backbone networks remains suboptimal.

1.1.4. Three-Dimensional Deep Learning Approaches

Three-dimensional convolutional neural networks (3D CNNs) have become a major focus in recent years for volumetric medical image analysis. Kamnitsas et al. (2017) introduced an efficient 3D CNN architecture for segmenting complex anatomical structures in CT scans [21]. Song et al. (2024) proposed a multi-scale anchor-free framework (M3N) with adaptive nodule modeling (ANM) to improve detection of nodules with varying sizes [22]. Sardar et al. (2025) presented a contextual 3D CNN (3D-CCNN) that integrates spatial context and multi-scale features to enhance detection sensitivity and robustness [23]. Luo et al. (2021) designed SCPM-Net, which leverages spherical representations and central point matching to achieve anchor-free and robust detection [24]. Karampidis et al. (2025) developed a patch-based 3D residual segmentation network for low-dose CT, enabling precise boundary delineation and volumetric estimation [25]. Yang et al. (2025) proposed a multi-layer 3D CNN framework for risk assessment based on subtle and sub-centimeter nodules, demonstrating the feasibility of early diagnosis without large nodular structures [26]. Additionally, Ozdemir et al. (2020) introduced an end-to-end framework that jointly integrates detection and diagnosis, significantly reducing false positives [27].

Despite their superior performance, 3D CNNs have practical limitations. Training these models requires substantial computational resources and large annotated datasets. Furthermore, the high cost of obtaining volumetric annotations and hardware demands of 3D models make them difficult to deploy, especially in resource-limited clinical settings, thus restricting their widespread adoption.

1.1.5. Motivation for Daisy-Net

In summary, traditional methods suffer from poor scalability and manual dependency; 2D CNNs are limited by insufficient spatial context; attention mechanisms are still under-explored in multi-scale integration; and 3D CNNs, although effective, are constrained by computational overhead. To address these challenges, this study aims to develop a network that efficiently utilizes 2D image representations while effectively integrating multi-scale features and spatial contextual information. The goal is to improve detection performance while ensuring feasibility for real-world clinical deployment.

Taken together, these observations define the central gap addressed in this study: a clinically practical two-dimensional framework is needed that preserves small-nodule details, strengthens cross-scale feature interaction, and avoids the computational burden of fully volumetric models.

Although 2D CNNs may lose inter-slice volumetric continuity, this study intentionally formulates pulmonary nodule detection as a 2D slice-based segmentation problem. This choice is motivated by several practical considerations: (1) annotation availability and consistency—most public lung nodule datasets (including LUNA16) provide slice-level masks and are commonly benchmarked under 2D protocols, which makes a 2D formulation more reproducible and directly comparable; (2) deployment feasibility—in many clinical workflows, fast screening and localization can be performed by processing axial slices sequentially, where a high-quality slice-wise segmentation can efficiently indicate suspicious regions without requiring full-volume inference; and (3) computational efficiency and stability—3D CNNs require substantially higher GPU memory and computation due to volumetric convolutions and larger input tensors, which often forces aggressive downsampling or patch-based training, potentially weakening fine boundary cues for small nodules and complicating training/inference pipelines.

Therefore, instead of adopting a computationally expensive 3D backbone, this work focuses on enhancing 2D representations by explicitly strengthening fine-grained feature extraction and cross-scale fusion (via PPSCA and OMF), aiming to compensate for the lack of volumetric context to a certain extent while maintaining efficiency and practicality. Incorporating 3D context, such as lightweight inter-slice aggregation or hybrid 2.5D/3D extensions, remains a meaningful direction for future studies.

To address the aforementioned challenges, we propose a deep neural network architecture named Dual-Attention and Inter-Scale-aware Yield Network (Daisy-Net). In the feature fusion stage, inspired by the Feature Fusion Module (FFM) [28], we design an Omni-domain Multistage Fusion (OMF) module, which leverages a multi-stage supervision strategy to enhance the quality control and integration of feature maps across different scales, thereby improving the precision of feature representation. Additionally, we introduce the Parallelized Patch and Spatial Context Aware (PPSCA) module. Building upon the advantages of the Patch Perception Attention (PPA) [29] mechanism, PPSCA further incorporates a Spatial Context Aware Module (SCAM) [28] to strengthen the network’s ability to capture complex spatial structures and local details. Compared with conventional variants, these two modules provide a more explicit combination of fine-detail enhancement and cross-scale fusion. The proposed architecture is further specified through channel-reweighted cross-scale fusion, the stage-wise refinement pipeline in OMF, and the weighted multi-scale supervision scheme in Daisy-Net, which together make the feature interaction and optimization process more transparent and reproducible.

2. Materials and Methods

2.1. LUNA16 Dataset

In this study, we utilize the publicly available LUNA16 dataset [30,31], a widely used benchmark in pulmonary nodule detection. Derived from the LIDC/IDRI database, LUNA16 excludes CT scans with slice thickness greater than 2.5 mm, ultimately including 888 high-quality 3D CT volumes.

The LUNA16 annotations used in this study were provided by the public LUNA16 benchmark [30,31]. The dataset contains nodule annotations derived from the LIDC/IDRI reference standard, including nodule center coordinates and diameters. In this work, these public annotations were used to generate the corresponding slice-wise binary masks for supervised training.

We reformulate the detection task as a nodule segmentation problem. Compared to traditional detection methods, segmentation provides richer localization and size estimation, improving diagnostic value. Moreover, by analyzing pixel-level features, segmentation captures fine-grained textures and boundaries—particularly advantageous for small nodule detection—enhancing accuracy and stability.

2.2. Preprocessing

The original CT images in LUNA16 are in 3D format, and each pixel is expressed in Hounsfield Units (HU), quantifying tissue radiodensity. To improve model performance, we designed a specialized preprocessing pipeline, as illustrated in Figure 1. The purpose of this pipeline is threefold: to maintain spatial consistency between images and annotations, to reduce redundant slices that do not contribute effective nodule information, and to generate stable binary supervision masks for slice-wise training.

Coordinate transformation and metadata preservation. All CT images were resampled to a standardized size of $(Z, 512, 512)$ , where Z is the number of slices. The in-plane size of $512 \times 512$ was retained because it is consistent with the common matrix size of chest CT slices in LUNA16 and helps preserve fine anatomical boundaries while keeping the network input format uniform across scans. This standardization reduces implementation variability in batch training and avoids introducing an additional resize target that could blur tiny nodules. Both image and annotation coordinates were converted to a relative reference system. The origin and pixel spacing were stored to allow mapping predictions back to the original space.
Slice selection based on nodule location. To reduce redundancy and balance data distribution, we selected a range of slices centered on each nodule, based on its diameter and z-axis position. This ensures the model learns from effective and relevant image regions.
Mask generation. For each selected slice, a corresponding binary nodule mask was generated. If the slice contains part of a nodule, the region corresponding to its projection is marked as 1; other pixels remain 0. These masks provide fine supervision signals for training.

Note that we did not perform normalization during preprocessing to avoid losing contrast or misclassifying tissue types.

Finally, we split the preprocessed data into training and test sets with a ratio of 8:2, providing a solid foundation for model training and evaluation.

2.3. Proposed Method

As illustrated in Figure 2, Daisy-Net adopts an enhanced U-Net framework that incorporates two key modules: the Parallelized Patch and Spatial Context Aware (PPSCA) module and the Omni-domain Multistage Fusion (OMF) module. These carefully designed components improve the network’s capability to detect small pulmonary nodules, addressing the challenges of small object loss and high background interference. The following paragraphs present an overview of the entire architecture, followed by detailed descriptions of the OMF and PPSCA modules, highlighting their design principles and advantages.

2.3.1. Network Architecture Overview

Daisy-Net is a dual-attention-based multi-scale feature perception framework that consists of two main modules, OMF and PPSCA, as shown in Figure 2.

The OMF module can supervise the training direction of the model from multiple stages by obtaining and fusing the features at different levels of the encoder and the decoder. In the encoder, since the features have not disappeared due to downsampling, many detailed parts are retaine’d. In the decoder, we expect the model to ignore the interference of background noise and strengthen the features of the target, so as to avoid the influence of complex background noise.

The PPSCA module allows us to observe the feature maps with medium and small fields of view instead of a large field of view. The PPSCA module adopts a multi-branch structure, focusing on feature maps at different scales respectively. The model can search for small targets that are difficult to detect in a large-scale field of view at a small scale. This creates conditions for the full extraction of small target features.

During encoding, the input image is passed through four consecutive PPSCA modules, each followed by a downsampling operation to progressively extract features at different scales. Four PPSCA stages were selected to match the four-level encoder–decoder hierarchy of the backbone, so that detail enhancement and spatial-context refinement are maintained consistently from high-resolution shallow features to low-resolution deep features rather than being applied at only a single scale. With PPSCA’s ability to capture both spatial context and local details, the network effectively learns rich multi-scale representations of pulmonary nodules.

In the decoding stage, four additional PPSCA modules are applied alongside upsampling operations to gradually restore spatial resolution and generate accurate nodule segmentation masks. To further enhance the fusion of multi-scale features, we design three OMF modules, which supervise and integrate multi-scale feature maps during the decoding process. These modules ensure effective information propagation across scales.

For supervision, the ground truth segmentation masks are downsampled to match the resolutions of multi-scale feature maps, enabling scale-specific loss computation. The overall loss function is defined as:

Total Loss = \sum_{i = 0}^{4} \frac{l_{i}}{2^{i}}

(1)

where the loss at the i-th scale,

l_{i}

, is defined as:

l_{i} = BCE {y, \hat{y}} + (1 - IoU {y, \hat{y}})

(2)

Here,

BCE

denotes the binary cross-entropy loss and

IoU

represents the intersection over union metric. The loss weights

\frac{1}{16}, \frac{1}{8}, \frac{1}{4}, \frac{1}{2}, 1

reflect the importance of higher-resolution feature maps from bottom to top, emphasizing the contribution of finer-grained predictions.

Through this design, Daisy-Net effectively addresses the limitations of insufficient multi-scale feature fusion and constrained use of attention mechanisms, substantially improving detection performance and stability in pulmonary nodule analysis.

2.3.2. Omni-Domain Multistage Fusion Module

One of the core challenges in small object detection lies in insufficient feature extraction. After multiple convolutional and downsampling operations, fine-grained details such as edges and textures are often lost, making it difficult for models to accurately identify objects, especially under complex background interference. To address this issue, we propose the Omni-domain Multistage Fusion (OMF) module, which systematically enhances the network’s ability to extract and utilize fine-scale features by introducing multi-stage supervision and multi-scale fusion strategies.

Unlike conventional small object detection models, the OMF module adopts a progressive supervision mechanism that provides feature-level guidance at different training stages. In the early phase of training, the model focuses on learning basic features such as boundaries and textures. As training progresses, the supervision gradually shifts toward more abstract semantic features. This stage-wise learning paradigm ensures both the sufficiency and accuracy of feature extraction and significantly mitigates false positives or missed detections caused by incomplete features in traditional methods.

In addition, the OMF module performs multi-scale feature fusion to address the inherent scale variability of small targets. At early stages, high-resolution feature maps are emphasized to capture fine-grained textures and localize nodules precisely. In later stages, lower-resolution, large-scale features are progressively integrated to incorporate global contextual information. This cross-scale fusion mechanism enables the network to localize small nodules accurately while simultaneously understanding their relationship with the surrounding background, thereby improving detection robustness and precision under complex imaging conditions.

As illustrated in Figure 3, the OMF module contains two core units: Cross Stage Partial (CSP) blocks and Channel Re-weighting Convolution (CRC) blocks. The CSP block alleviates gradient vanishing and improves computational efficiency and feature expression, while the CRC block adaptively adjusts channel-wise feature weights for finer fusion. Together, they support more informative and discriminative feature learning along the channel dimension. Unlike a standard CSP-style fusion block that usually processes one input tensor or a single concatenated feature once, OMF uses repeated aligned multi-input fusion, intermediate mid-level refinement (

X_{2}^{″}

), and stage-wise supervision-oriented aggregation. Therefore, its role is not limited to channel reduction or residual propagation; it is specifically designed to improve cross-scale consistency before the final prediction stage.

Specifically, the OMF module takes three input feature maps:

X_{1} \in R^{C_{1} \times H \times W}

,

X_{2} \in R^{C_{2} \times H \times W}

, and

X_{3} \in R^{C_{3} \times H \times W}

, derived from different stages of the backbone, with channel ratios

C_{1}

:

C_{2}

:

C_{3}

= 4:2:1.

For reproducibility, we emphasize that all inputs involved in each fusion step are first spatially aligned to the same resolution

(H, W)

(via upsampling/downsampling when necessary), and the subsequent fusion operations are performed only along the channel dimension. The fusion procedure consists of the following steps:

Fuse $X_{3}$ and $X_{2}$ using CRC and CSP to produce $X_{2}^{'} \in R^{C_{2} \times H \times W}$ ;
Fuse $X_{2}^{'}$ and $X_{1}$ to generate $X_{1}^{'} \in R^{C_{1} \times H \times W}$ ;
Combine $X_{2}^{'}$ , $X_{1}^{'}$ , and $X_{2}$ to compute $X_{2}^{″} \in R^{C_{2} \times H \times W}$ ;
Fuse $X_{2}^{″}$ with $X_{3}$ to obtain $X_{3}^{″} \in R^{C_{2} \times H \times W}$ ;
Concatenate $X_{1}^{'}$ , $X_{2}^{″}$ , and $X_{3}^{″}$ along the channel dimension and pass through a $1 \times 1$ convolution to yield the final output $Y \in R^{1 \times H \times W}$ .

In the OMF pipeline,

X_{2}^{″}

serves as an intermediate refined mid-level representation that bridges high-resolution details and high-level semantics. Specifically, after obtaining

X_{1}^{'}

(detail-enhanced) and

X_{2}^{'}

(context-enhanced), we further refine the mid-level feature by re-fusing

{X_{2}^{'}, X_{1}, X_{2}}

to produce

X_{2}^{″}

. This second refinement is designed to (i) re-inject reliable boundary/detail cues from the shallower branch (

X_{1}

) while preserving spatial stability, and (ii) mitigate cross-scale inconsistency introduced by progressive fusion, yielding a more discriminative yet spatially coherent feature for subsequent aggregation.

We perform the second refinement on the mid-level feature rather than on

X_{1}

for two reasons. First,

X_{1}^{'}

already corresponds to the finest scale and is directly optimized by the segmentation objective; repeatedly refining

X_{1}^{'}

would increase computation and may amplify high-frequency noise without clear benefit. Second, the mid-level feature provides a better trade-off between resolution and semantics, making it a suitable “fusion hub” to absorb complementary cues from both the high-semantic branch and the fine-detail branch. Therefore, refining

X_{2}

into

X_{2}^{″}

improves feature consistency before generating

X_{3}^{″}

and the final output.

In each fusion, CRC first performs channel-wise reweighting and concatenation, which increases the channel dimension additively (e.g.,

C_{2} + C_{3}

). The following CSP then deterministically projects the concatenated channels back to the required target width (e.g.,

C_{2}

), ensuring consistent channel dimensions for all subsequent steps.

Definition of CRC. Given two aligned feature maps

A \in R^{C_{a} \times H \times W}

and

B \in R^{C_{b} \times H \times W}

, CRC learns a channel-wise weight vector

w \in R^{C_{a} + C_{b}}

and normalizes it as

\hat{w} = w / (\sum_{j} w_{j} + ϵ)

. It then reweights channels and concatenates them:

\begin{matrix} CRC [A, B] & = Concat ({\hat{w}}_{1 : C_{a}} ⊙ A, {\hat{w}}_{C_{a} + 1 : C_{a} + C_{b}} ⊙ B) \\ \in R^{(C_{a} + C_{b}) \times H \times W} . \end{matrix}

(3)

The three-input case is defined analogously, where

w \in R^{C_{1} + C_{2} + C_{3}}

and

CRC [X_{2}^{'}, X_{1}, X_{2}]

\in R^{(C_{2}^{'} + C_{1} + C_{2}) \times H \times W}

.

Definition of CSP. CSP is implemented as a CrossConv mapping that projects an input

X \in R^{c_{1} \times H \times W}

to a target width

c_{2}

via two factorized convolutions (

1 \times k

followed by

k \times 1

) with intermediate width

c^{'} = ⌊ c_{2} \cdot e ⌋

:

CSP {X} = {Conv}_{k \times 1} ({Conv}_{1 \times k} (X; c^{'}); c_{2}) \in R^{c_{2} \times H \times W} .

(4)

where each

Conv

denotes Conv-BN-SiLU in our implementation, and an optional shortcut is used only when

c_{1} = c_{2}

.

The process is formally defined as:

X_{2}^{'} = CSP {CRC [X_{3}, X_{2}]}

(5)

X_{1}^{'} = CSP {CRC [X_{2}^{'}, X_{1}]}

(6)

X_{2}^{″} = CSP {CRC [X_{2}^{'}, X_{1}, X_{2}]}

(7)

X_{3}^{″} = CSP {CRC [X_{2}^{″}, X_{3}]}

(8)

In summary, the OMF module effectively improves the model’s capability to detect small objects in complex environments by combining progressive supervision and refined cross-scale feature fusion.

2.3.3. Parallelized Patch and Spatial Context Aware Module

The conventional U-Net architecture typically employs convolution and pooling layers in the encoder and relies on convolution and upsampling operations in the decoder. While effective for general feature extraction, this structure struggles to suppress background interference in small object detection tasks, often resulting in the loss of fine details such as edges and textures. Accurately capturing subtle features of small objects thus remains a significant challenge.

To address this issue, we propose the Parallelized Patch and Spatial Context Aware (PPSCA) module, an optimized design that significantly enhances multi-scale fine-grained feature extraction. Specifically, PPSCA adopts a multi-branch structure to capture information at different scales, preventing key features of small objects from being lost during successive downsampling. Unlike earlier patch-aware modules, PPSCA removes the full-resolution branch and retains only two branches with patch scales

p = 2

and

p = 4

, thereby reducing parameter overhead and improving computational efficiency while preserving capacity for attention enhancement. This means that PPSCA is not a direct reuse of the original PPA-style design: it simplifies the branch configuration, keeps only the practically useful mid-scale patch partitions, and then combines the resulting patch-aware responses with SCAM-based spatial context modeling through residual fusion. Notably, in our patch partition setting,

p = 1

corresponds to treating the entire feature map as a single patch, which yields a largely spatially invariant global descriptor that is broadcast to all locations; this behavior provides limited location-dependent refinement for small-nodule boundaries and tends to overlap with the global/spatial context modeling already offered by the spatial-attention pathway. At the other extreme, overly large p approaches a per-pixel partition, weakening meaningful regional aggregation and becoming closer to a point-wise local mapping, which is also less complementary. Therefore, we select

p = 2

and

p = 4

as a practical mid-scale trade-off that maintains non-trivial regional context perception and boundary sensitivity, and the efficiency gain is an additional benefit of removing such degenerate branches.

Following multi-scale patch processing, PPSCA incorporates the Spatial Context Aware Module (SCAM) to further improve its representational capability. SCAM leverages both global max pooling (GMP) and global average pooling (GAP) to effectively extract global spatial contextual information. This combination allows the network to model pixel-level spatial relationships and significantly improves its sensitivity to small object features.

As shown in Figure 4, the PPSCA module takes an input feature map

X \in R^{C \times H \times W}

and outputs a feature map Y of the same dimensions. The internal process includes the following steps:

Multi-branch patch preparation: The input X is first passed through a channel adjustment layer to produce $P^{(0)}$ . This intermediate tensor is then fed into two parallel patch-aware branches to extract features at scales $P_{1} \in R^{C_{p} \times \frac{H}{2} \times \frac{W}{2}}$ and $P_{2} \in R^{C_{p} \times \frac{H}{4} \times \frac{W}{4}}$ .
Patch processing in each branch: Within each branch, patches are extracted via the Unfold operation, followed by channel-wise average pooling to obtain $P^{(1)}$ . A feed-forward network (FFN) computes patch-wise attention weights, normalized via Softmax. The weighted feature map $P^{(2)}$ is then reassembled and upsampled to the original resolution, yielding $P^{(3)}$ .
Residual fusion and enhancement: The outputs from both branches are combined with $P^{(0)}$ through residual connections to form the enhanced feature map $P^{*}$ , which is passed into the subsequent attention module.
Spatial context enhancement: The attention module, implemented as SCAM, captures spatial context information across the entire image to reinforce feature representation.
Final output: A final channel adjustment layer transforms $P^{*}$ into the output $Y \in R^{C \times H \times W}$ .

Through the above design, the PPSCA module enhances the model’s ability to extract fine features of small objects in complex backgrounds, thereby improving the overall detection performance and robustness of the network.

2.4. Evaluation Metrics and Model Training

To comprehensively evaluate the performance of the proposed Daisy-Net on the LUNA16 dataset, we adopt the following standard metrics (pixel-level): Intersection over Union (IoU), Sørensen–Dice Coefficient (Dice), Sensitivity (Sens), and Specificity (Spec). The overlap metrics (IoU and Dice) and the classification-style metrics (Sensitivity and Specificity) are computed under different protocols. IoU and Dice are computed as soft overlap measures from probabilistic predictions, whereas Sensitivity and Specificity are computed after binarization of the predicted masks at a predefined threshold. Therefore, the latter two metrics are threshold-dependent and are used here as reference indicators of binary classification behavior.

The definitions are provided below:

Intersection over Union (IoU) measures the overlap between the predicted segmentation and the ground truth. In this work, it is calculated in a soft form from probabilistic predictions, defined as:

IoU = \frac{T P}{T P + F P + F N}

(9)

Dice Coefficient (Dice) also quantifies the similarity between prediction and ground truth and is widely used in medical image segmentation. In this work, it is likewise evaluated in a soft form:

Dice = \frac{2 T P}{2 T P + F P + F N}

(10)

Sensitivity (Sens) indicates the model’s ability to correctly identify positive samples (i.e., true nodules). In contrast to the overlap metrics above, it is computed after binarizing the prediction map at a predefined threshold:

Sensitivity = \frac{T P}{T P + F N}

(11)

Specificity (Spec) reflects the model’s capacity to correctly classify negative samples (i.e., non-nodule regions). It is also computed on binarized predictions:

Specificity = \frac{T N}{T N + F P}

(12)

For the soft overlap metrics,

T P

denotes true positives,

F P

false positives, and

F N

false negatives, which are defined as

\begin{matrix} T P & = \sum_{i} p_{i} y_{i}, \end{matrix}

(13)

\begin{matrix} F P & = \sum_{i} p_{i} (1 - y_{i}), \end{matrix}

(14)

\begin{matrix} F N & = \sum_{i} (1 - p_{i}) y_{i}, \end{matrix}

(15)

where

p_{i} \in [0, 1]

is the predicted probability at pixel i and

y_{i} \in {0, 1}

is the corresponding ground-truth label. For Sensitivity and Specificity, the probability map is first binarized as

{\tilde{p}}_{i} \in {0, 1}

using a predefined threshold, and the corresponding

T P

,

F P

,

F N

, and

T N

are then computed in the usual binary form from

{\tilde{p}}_{i}

and

y_{i}

. In addition, the comparative results reported in the Results section are recomputed from the binarized prediction masks using the same threshold, so that the reported overlap metrics and the added Precision value follow a unified post-binarization protocol. For the main comparison experiment, 95% confidence intervals (CIs) were estimated using a non-parametric bootstrap procedure at the image level. Specifically, the 1186 paired ground-truth and binarized prediction masks were resampled with replacement for 10,000 iterations, and each metric was recalculated from the aggregated

T P

,

F P

,

F N

, and

T N

values in each bootstrap sample. The 2.5th and 97.5th percentiles of the bootstrap distribution were reported as the 95% CI.

Model training and evaluation were conducted using four NVIDIA Tesla V100-SXM2-32GB GPUs. Each input CT slice has a resolution of

512 \times 512

with a single-channel grayscale representation in Hounsfield Units (HU). The total number of parameters in Daisy-Net is approximately 20.5 million.

During training, we used the Stochastic Gradient Descent (SGD) optimizer with a batch size of 6 and a total of 800 training epochs. These hyperparameters were selected empirically under the available hardware resources. In particular, the small batch size is mainly constrained by the

512 \times 512

input resolution, the multi-branch PPSCA design, and the multi-stage supervision in OMF, which together increase memory consumption during forward and backward propagation. The relatively long training schedule was adopted to allow the optimization to converge stably, as reflected by the later-stage loss curves in Figure 5. A larger batch size or fewer epochs could reduce training time, but such a hyperparameter sensitivity study is beyond the scope of the present work. As shown in Figure 5, both the IoU loss and the total loss converge stably in the later training stages, indicating that the training parameters are well-tuned and effective.

For reproducibility, the main training hyperparameters used in the present study are summarized in Table 1.

It should be noted that this work focuses on slice-wise, pixel-level segmentation, and we do not aggregate slice-level predictions into patient-level or nodule-level outcomes.

3. Results

3.1. Comparison with Other Models

To comprehensively evaluate the effectiveness of Daisy-Net for pulmonary nodule detection, we conducted comparative experiments against recent state-of-the-art methods. Specifically, we selected HCF-Net as the baseline model and compared Daisy-Net with some well-known architectures: U-Net [32], ACM (Asymmetric Contextual Modulation) [33], UIUNet (U-Net in U-Net) [34], HintUNet and HintHCFUNet [35], using the LUNA16 dataset. A direct experimental comparison with recent 3D pulmonary-nodule methods was not performed because the present work is formulated as a 2D slice-wise segmentation framework, whereas many 3D methods adopt substantially different input organization, volumetric context usage, and computational settings. Therefore, such comparisons are informative at the discussion level, but they are not strictly one-to-one benchmarks under an identical protocol.

All models were trained under the same settings to ensure fairness. As shown in Table 2, Daisy-Net achieved the best overall post-binarization segmentation performance, with the highest IoU, Dice, Sensitivity, and Accuracy among the compared methods. The bootstrap confidence intervals further indicate that its advantage over the baseline HCF-Net and most other comparison models is stable across image-level resampling. The reported runtime is provided only as a reference for the training cost of each model under the same experimental setting; it is not used as a standalone criterion for model quality.

Compared to ACM and UIUNet, Daisy-Net also demonstrates superior post-binarization accuracy and robustness. The corresponding confidence intervals in Table 2 show that Daisy-Net maintains a clear advantage in overlap-based segmentation quality while preserving a favorable precision–recall balance under the same binary evaluation protocol.

U-Net shows the highest precision and specificity among all models, but its lower overlap and sensitivity performance suggests a stronger bias toward conservative foreground prediction, which is not ideal for small-target segmentation.

HintUNet and HintHCFNet demonstrate different advantages and limitations. HintUNet performs weakly in overlap and sensitivity, suggesting that its architecture remains inadequate for this segmentation task. HintHCFNet is the strongest competitor, but Daisy-Net still shows better overall segmentation quality under the post-binarization evaluation protocol.

Visual comparison results are presented in Figure 6, further illustrating the segmentation behavior of Daisy-Net on subtle pulmonary nodules.

3.2. Ablation Study of Daisy-Net Components

To further investigate the contributions of the PPSCA and OMF modules, we conducted ablation experiments. The ablation results were also supplemented with 95% CIs to provide a clearer statistical description of the contribution of each component.

As shown in Table 3, replacing the original patch module with PPSCA alone improves Dice and sensitivity, indicating that PPSCA primarily improves fine-grained feature extraction and local detail perception. When the OMF module is additionally introduced, the model achieves larger gains across all metrics compared with the baseline. This trend suggests that OMF contributes more strongly to the final performance improvement, while PPSCA supplies complementary detail-enhancement capability; therefore, the best result is obtained when both modules are used together rather than when either module is considered in isolation. Because all ablation variants were trained under the same settings and these post-binarization results remain consistent across the three ablation variants, this improvement should not be interpreted as an artifact of unequal optimization. Instead, it suggests that cross-scale feature interaction is a key bottleneck in this task setting, and that OMF directly addresses that bottleneck more strongly than a local patch-only refinement strategy.

3.3. Impact of Normalization on Nodule Detection Accuracy

To examine whether normalization on CT slices affects fine-detail preservation, we conducted a comparative experiment. We trained the same model (Daisy-Net) on two versions of the dataset: one with normalization and the other without. Table 4 reports the post-binarization comparison results, while the training loss values remain the originally recorded optimization results. Consistent with the main comparison experiment, 95% CIs were also added to the post-binarization metrics in the normalization comparison.

Although normalization reduced the total training loss, this reduction was largely driven by the BCE term, which is sensitive to the overall intensity scaling and can dominate the optimization objective. Therefore, a lower total loss under normalization does not necessarily translate into better boundary delineation or overlap-based performance. From a practical performance perspective, the model trained without normalization achieved better results in terms of IoU, Dice, Sensitivity, and Specificity in the post-binarization comparison, indicating better retention of fine-grained features in the CT slices.

A plausible explanation is that certain normalization schemes may compress subtle HU variations into very small numeric ranges, weakening effective contrast and gradient signals for small-target boundaries (especially under limited floating-point precision), while per-sample min–max normalization may introduce inconsistent intensity mapping across samples and enlarge inter-sample distribution shifts.

Moreover, although lung-window clipping is widely used, it also removes out-of-window intensities a priori. Since we did not verify that extremely high/low HU values are completely irrelevant to the nodule region and its surrounding context, we chose to preserve the original HU information to avoid discarding potentially informative cues at the preprocessing stage.

Overall, our results suggest that, under the current experimental setting and implementation, skipping normalization yields better overlap-based segmentation performance, with the post-binarization IoU/Dice/Sensitivity/Specificity changing from 69.65/76.91/80.88/99.9938 to 81.41/89.75/84.78/99.9974. This observation should be interpreted as strategy-dependent rather than as a universal claim, and lung-window-based or fixed-range clipping-and-scaling normalization remains an important direction for future systematic evaluation.

3.4. Threshold Analysis

Threshold analysis is a critical component of model performance evaluation, as it offers insights into the model’s behavior under varying classification thresholds. In this study, we employ the Receiver Operating Characteristic (ROC) curve to analyze Daisy-Net’s pixel-level classification performance. Figure 7 illustrates the ROC curve of our model, which depicts the relationship between sensitivity and the complement of specificity (

1 - Specificity

) across different threshold values.

A ROC curve that approaches the top-left corner of the plot indicates a better classification performance. To quantify this performance, we calculate the Area Under the Curve (AUC). An AUC close to 1 indicates strong classification capability, whereas an AUC near 0.5 implies performance comparable to random guessing. As shown in Figure 7, Daisy-Net achieves an AUC of 0.99, indicating strong separability between nodule and non-nodule pixels under the present evaluation setting. However, because extremely high AUC values may also be associated with overfitting or with the strong class imbalance between small nodule regions and the large non-nodule background, this result should be interpreted together with the post-binarization metrics and bootstrap confidence intervals reported in Table 2, rather than as standalone evidence of generalization.

These results further support the effectiveness of the proposed PPSCA and OMF modules under the present evaluation setting.

4. Discussion

4.1. Added Value of the Study

In summary, the main added value of this study is as follows:

We design two dedicated modules to address the challenge of inadequate feature extraction for small objects and interference from complex backgrounds: the Parallelized Patch and Spatial Context Aware (PPSCA) module and the Omni-domain Multistage Fusion (OMF) module. PPSCA improves the extraction of spatial and channel features of tiny nodules, while OMF ensures effective integration of multi-scale features to improve nodule–background discrimination.
Extensive experiments conducted on the publicly available LUNA16 dataset validate the performance of Daisy-Net. The proposed method achieves higher overlap-based scores than the selected comparison methods across multiple evaluation metrics, indicating its potential for computer-aided analysis.
We formulate pulmonary nodule detection as a 2D image segmentation task and propose Daisy-Net, a network that integrates dual attention mechanisms and multi-scale feature awareness for efficient and accurate detection of pulmonary nodules.

4.2. Comparison with State-of-the-Art

Daisy-Net was benchmarked against several small-object detection methods, including U-Net, ACM, UIUNet, HCF-Net, HintUNet and HintHCF, on the lung-nodule detection task [29,32,33,34,35]. Results on the LUNA16 dataset show that Daisy-Net performs better than these selected comparison models, particularly on small nodules. While existing models often suffer from semantic loss in deep layers and limited boundary precision, Daisy-Net’s PPSCA and OMF modules help preserve local details and strengthen cross-scale contextual interaction. The detailed post-binarization comparison is reported in Table 2. At the same time, this comparison should be interpreted within the scope of 2D slice-wise segmentation. Recent 3D pulmonary-nodule methods often benefit from explicit volumetric context [22,24,27], but they also rely on substantially different training and inference protocols. Classification-oriented lung-nodule analysis has also benefited from combining multistage filtering with modified segmentation networks [36], further supporting the practical value of integrating preprocessing and segmentation-oriented design choices. Therefore, the present study positions Daisy-Net as a practically oriented 2D framework rather than as a direct replacement for all 3D solutions.

4.3. Innovations and Limitations

The experimental results indicate that Daisy-Net benefits from the complementary roles of PPSCA and OMF rather than from a single isolated component. In the ablation study, introducing PPSCA alone improves Dice and sensitivity, which suggests that multi-branch patch perception and spatial-context refinement improve the representation of fine nodule boundaries. When OMF is further incorporated, the complete model achieves the strongest overall post-binarization performance in the ablation setting, indicating that cross-scale feature supervision contributes strongly to the final segmentation quality. In addition, Table 4 shows that preserving raw CT intensities is beneficial in the present setting, where the model without normalization outperforms the normalized counterpart in overlap-based metrics. At the same time, the present study still has several limitations. First, the method is intentionally formulated as a slice-wise 2D segmentation framework, so inter-slice volumetric continuity is not explicitly modeled. This design improves practicality and reduces computational burden, but it may limit performance on nodules whose appearance depends strongly on adjacent slices. Second, the current evaluation remains focused on slice-level segmentation on LUNA16 rather than on nodule-level or patient-level clinical endpoints. Third, although the post-binarization metrics are strong overall, certain difficult cases remain challenging, especially when nodules are attached to vessels or bronchi, occupy only a few pixels, or present ambiguous low-contrast boundaries. In such situations, background structures may mimic the target appearance, while weak local intensity transitions reduce the stability of the final binary prediction. Finally, the generalization capability of Daisy-Net has not yet been verified on external multi-center cohorts with different acquisition settings. These points define the main directions for subsequent work.

4.4. Clinical Relevance and Future Directions

In principle, Daisy-Net can be adapted to other CT datasets because it operates on 2D slices and does not rely on dataset-specific metadata. However, its performance may degrade under domain shift caused by differences in scanner vendors, reconstruction kernels, slice thickness, voxel spacing, and intensity distributions. Therefore, transfer to a new dataset may require adaptation of: (i) preprocessing (consistent resampling to a target in-plane resolution and slice thickness if needed, lung-region cropping/segmentation, and intensity handling such as windowing or dataset-specific scaling); (ii) label/interface alignment (converting 3D annotations to slice-wise masks, matching class definitions, and ensuring consistent mask encoding); and (iii) training strategy (fine-tuning with a small labeled subset of the target domain, threshold calibration for probability-to-mask binarization, and optionally domain adaptation/normalization variants). Cross-center/device external validation remains necessary to rigorously assess clinical robustness. At present, the experimental evidence is limited to LUNA16 only, which restricts the strength of any generalization claim. Additional experiments on independent datasets will therefore be an important part of future work.

Daisy-Net may provide useful support for early lung cancer screening. By improving the sensitivity and accuracy of pulmonary nodule detection, it can function as a computer-aided diagnosis (CAD) tool that helps reduce diagnostic errors and radiologist workload. In future work, the study aims to:

Integrate 3D convolutional architectures to better exploit volumetric information and spatial continuity.
Evaluate the model on multi-center datasets to ensure robustness and generalization across patient populations.
Extend the framework to other small-lesion detection tasks such as liver metastases, cerebral microbleeds, and retinal abnormalities.

Overall, Daisy-Net provides a useful 2D segmentation framework for pulmonary nodule analysis under the present experimental setting.

5. Conclusions

Experiments on the public LUNA16 dataset show that Daisy-Net achieves favorable post-binarization comparative performance, as reported in Table 2. The ablation study further indicates that PPSCA provides stable fine-detail enhancement, whereas OMF contributes the larger gain in cross-scale feature integration, and the combined model yields the best overall segmentation quality. These findings indicate that Daisy-Net is a practical candidate for computer-aided diagnosis scenarios in which small nodules are difficult to localize reliably on individual CT slices. Future work will focus on extending the framework to 3D or hybrid volumetric settings, exploring more deployment-oriented optimization, and validating the method on broader clinical datasets and imaging conditions. Overall, Daisy-Net provides an interpretable segmentation framework for pulmonary nodule analysis and offers a useful basis for further investigation of small-target detection in CT imaging.

Author Contributions

Conceptualization, Q.W.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z. and Y.Y.; formal analysis, Z.Z.; investigation, Z.Z.; data curation, H.G.; writing—original draft preparation, Z.Z.; writing—review and editing, Y.Z., X.Z., Y.Y., H.G., G.S. and Q.W.; visualization, Y.Y.; supervision, Y.Z., X.Z., G.S. and Q.W.; project administration, Q.W.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Innovation Promotion Association CAS, grant number Y2023058, and the Natural Science Foundation of Liaoning Province, grant number 2022-MS-078.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ACM	Asymmetric Contextual Modulation
ANM	Adaptive Nodule Modeling
AUC	Area Under the Curve
BCE	Binary Cross-Entropy
CAD	Computer-Aided Diagnosis
CNN	Convolutional Neural Network
CRC	Channel Re-weighting Convolution
CSP	Cross Stage Partial
CT	Computed Tomography
FFN	Feed-Forward Network
FN	False Negative
FP	False Positive
GAP	Global Average Pooling
GMP	Global Max Pooling
HU	Hounsfield Unit
IoU	Intersection over Union
LIDC/IDRI	Lung Image Database Consortium and Image Database Resource Initiative
LUNA16	LUng Nodule Analysis 2016
MIP	Maximum Intensity Projection
OMF	Omni-domain Multistage Fusion
PPA	Patch Perception Attention
PPSCA	Parallelized Patch and Spatial Context Aware
ROC	Receiver Operating Characteristic
SCAM	Spatial Context Aware Module
Sens	Sensitivity
SGD	Stochastic Gradient Descent
Spec	Specificity
SVM	Support Vector Machine
TN	True Negative
TP	True Positive
U-Net	U-shaped Convolutional Network

References

Thanoon, M.A.; Zulkifley, M.A.; Mohd Zainuri, M.A.A.; Abdani, S.R. A Review of Deep Learning Techniques for Lung Cancer Screening and Diagnosis Based on CT Images. Diagnostics 2023, 13, 2617. [Google Scholar] [CrossRef]
Rampinelli, C.; Calloni, S.F.; Minotti, M.; Bellomi, M. Spectrum of early lung cancer presentation in low-dose screening CT: A pictorial review. Insights Imaging 2016, 7, 449–459. [Google Scholar] [CrossRef] [PubMed]
Lenzen, H.; Roos, N.; Heindel, W.; Semik, M.; Diederich, S.; Thomas, M.; Weber, A.; Wormanns, D. Screening for early lung cancer with low-dose spiral computed tomography: Results of annual follow-up examinations in asymptomatic smokers. Eur. Radiol. 2004, 14, 691–702. [Google Scholar] [CrossRef] [PubMed]
Jett, J.R. Limitations of Screening for Lung Cancer with Low-Dose Spiral Computed Tomography. Clin. Cancer Res. 2005, 11, 4988s–4992s. [Google Scholar] [CrossRef]
Ye, X.; Lin, X.; Dehmeshki, J.; Slabaugh, G.; Beddoe, G. Shape-Based Computer-Aided Detection of Lung Nodules in Thoracic CT Images. IEEE Trans. Biomed. Eng. 2009, 56, 1810–1820. [Google Scholar] [CrossRef]
Li, B.; Zhang, J.; Tian, L.; Tan, L.; Xiang, S.; Ou, S. Intelligent Recognition of Lung Nodule Combining Rule-based and C-SVM Classifiers. Int. J. Comput. Intell. Syst. 2011, 4, 960. [Google Scholar] [CrossRef]
Soleymanpour, E.; Pourreza, H.R.; ansaripour, E.; Yazdi, M.S. Fully Automatic Lung Segmentation and Rib Suppression Methods to Improve Nodule Detection in Chest Radiographs. J. Med. Signals Sens. 2011, 1, 191–199. [Google Scholar] [CrossRef]
Choi, W.J.; Choi, T.S. Automated pulmonary nodule detection based on three-dimensional shape-based feature descriptor. Comput. Methods Programs Biomed. 2014, 113, 37–54. [Google Scholar] [CrossRef]
Halder, A.; Dey, D.; Sadhu, A.K. Lung Nodule Detection from Feature Engineering to Deep Learning in Thoracic CT Images: A Comprehensive Review. J. Digit. Imaging 2020, 33, 655–677. [Google Scholar] [CrossRef]
Shah, A.A.; Malik, H.A.M.; Muhammad, A.; Alourani, A.; Butt, Z.A. Deep learning ensemble 2D CNN approach towards the detection of lung cancer. Sci. Rep. 2023, 13, 2987. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Ren, H.; Cai, S.; Zhang, X. An improved faster R-CNN algorithm for assisted detection of lung nodules. Comput. Biol. Med. 2023, 153, 106470. [Google Scholar] [CrossRef]
Su, Y.; Li, D.; Chen, X. Lung Nodule Detection based on Faster R-CNN Framework. Comput. Methods Programs Biomed. 2021, 200, 105866. [Google Scholar] [CrossRef]
Zheng, S.; Guo, J.; Cui, X.; Veldhuis, R.N.J.; Oudkerk, M.; van Ooijen, P.M.A. Automatic Pulmonary Nodule Detection in CT Scans Using Convolutional Neural Networks Based on Maximum Intensity Projection. arXiv 2019, arXiv:1904.05956. [Google Scholar] [CrossRef]
Zheng, S.; Cornelissen, L.J.; Cui, X.; Jing, X.; Veldhuis, R.N.J.; Oudkerk, M.; van Ooijen, P.M.A. Deep convolutional neural networks for multi-planar lung nodule detection: Improvement in small nodule identification. arXiv 2020, arXiv:2001.04537. [Google Scholar] [CrossRef]
Fu, Y.; Xue, P.; Xiao, T.; Zhang, Z.; Zhang, Y.; Dong, E. Semi-Supervised Adversarial Learning for Improving the Diagnosis of Pulmonary Nodules. IEEE J. Biomed. Health Inform. 2022, 27, 109–120. [Google Scholar] [CrossRef] [PubMed]
Cai, J.; Wang, L.; Cai, J.; Deng, Z.; Yang, Z.; Feng, H. Contactless Intelligent Anti-interference Lung Nodule Detection Method for Early Disease Detection. IEEE J. Biomed. Health Inform. 2025, 30, 2939–2950. [Google Scholar] [CrossRef]
UrRehman, Z.; Qiang, Y.; Wang, L.; Shi, Y.; Yang, Q.; Khattak, S.U.; Aftab, R.; Zhao, J. Effective lung nodule detection using deep CNN with dual attention mechanisms. Sci. Rep. 2024, 14, 3934. [Google Scholar] [CrossRef]
Nasrullah, N.; Sang, J.; Alam, M.S.; Mateen, M.; Cai, B.; Hu, H. Automated Lung Nodule Detection and Classification Using Deep Learning Combined with Multiple Strategies. Sensors 2019, 19, 3722. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhu, H.; Ding, L.; Yang, K. Attention pyramid pooling network for artificial diagnosis on pulmonary nodules. PLoS ONE 2024, 19, e0302641. [Google Scholar] [CrossRef]
Liu, W.; Sun, J.; Li, H.; Wang, Y.; Wang, Z. CSEA-Net: A channel–spatial enhanced attention network for lung tumor segmentation on CT images. iScience 2025, 28, 111974. [Google Scholar] [CrossRef]
Kamnitsas, K.; Ledig, C.; Newcombe, V.F.; Simpson, J.P.; Kane, A.D.; Menon, D.K.; Rueckert, D.; Glocker, B. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 2017, 36, 61–78. [Google Scholar] [CrossRef]
Song, W.; Tang, F.; Marshall, H.; Fong, K.M.; Liu, F. A multiscale 3D network for lung nodule detection using flexible nodule modeling. Med. Phys. 2024, 51, 7356–7368. [Google Scholar] [CrossRef] [PubMed]
Hamidian, S.; Sahiner, B.; Petrick, N.; Pezeshk, A. 3D convolutional neural network for automatic detection of lung nodules in chest CT. In Proceedings of the SPIE Medical Imaging, Orlando, FL, USA, 3 March 2017; p. 1013409. [Google Scholar] [CrossRef]
Luo, X.; Song, T.; Wang, G.; Chen, J.; Chen, Y.; Li, K.; Metaxas, D.N.; Zhang, S. SCPM-Net: An Anchor-free 3D Lung Nodule Detection Network using Sphere Representation and Center Points Matching. arXiv 2021, arXiv:2104.05215. [Google Scholar] [CrossRef]
Marinakis, I.D.; Karampidis, K.; Papadourakis, G.; Kara, M. Dynamic Patch-Based Sample Generation for Pulmonary Nodule Segmentation in Low-Dose CT Scans Using 3D Residual Networks for Lung Cancer Screening. Appl. Biosci. 2025, 4, 14. [Google Scholar] [CrossRef]
Yang, S.; Lim, S.H.; Hong, J.H.; Park, J.S.; Kim, J.; Kim, H.W. Deep learning-based lung cancer risk assessment using chest computed tomography images without pulmonary nodules ≥ 8 mm. Transl. Lung Cancer Res. 2025, 14, 150–162. [Google Scholar] [CrossRef] [PubMed]
Ozdemir, O.; Russell, R.L.; Berlin, A.A. A 3D Probabilistic Deep Learning System for Detection and Diagnosis of Lung Cancer Using Low-Dose CT Scans. IEEE Trans. Med. Imaging 2020, 39, 1419–1429. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
van Ginneken, B.; Jacobs, C. LUNA16 Part 1/2. Zenodo 2019. [Google Scholar] [CrossRef]
van Ginneken, B.; Jacobs, C. LUNA16 Part 2/2. Zenodo 2019. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. arXiv 2020, arXiv:2009.14530. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Quan, W.; Zhao, W.; Wang, W.; Xie, H.; Lee Wang, F.; Wei, M. Lost in UNet: Improving Infrared Small Target Detection by Underappreciated Local Features. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5000115. [Google Scholar] [CrossRef]
Gunawan, R.; Tran, Y.; Zheng, J.; Nguyen, H.; Carrigan, A.; Mills, M.K.; Chai, R. Combining Multistaged Filters and Modified Segmentation Network for Improving Lung Nodules Classification. IEEE J. Biomed. Health Inform. 2024, 28, 5519–5527. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed preprocessing pipeline, including coordinate transformation, nodule-centered slice selection, and binary mask generation.

Figure 2. Daisy Network Architecture.

Figure 3. Structure of the OMF Module.

Figure 4. Structure of the PPSCA Module.

Figure 5. IoU loss and total loss on training and validation sets throughout the training process.

Figure 6. Comparison of experimental results between Daisy-Net and other state-of-the-art methods. The regions highlighted by green boxes are shown as zoomed-in local views, indicated by yellow boxes in the upper-left corner of each image.

Figure 7. Pixel-level ROC curve of Daisy-Net.

Table 1. Main Training Hyperparameters of Daisy-Net.

Item	Setting
Input resolution	$512 \times 512$
Optimizer	SGD
Learning rate	$4 \times 10^{- 3}$
Momentum	0.9
Weight decay	$1 \times 10^{- 4}$
Batch size	6
Training epochs	800
Dropout	0.1
Binarization threshold	0.3
Loss	BCE + $(1 - IoU)$ with multi-scale weighting
GPU configuration	$4 \times$ NVIDIA Tesla V100-SXM2-32GB
Parameter count	≈20.5 M

Table 2. Post-binarization performance comparison with other models. Values are reported as point estimates with bootstrap 95% CIs in parentheses.

	IoU (%)	Dice (%)	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	Total Loss	Runtime (h)
Baseline (HCF-Net)	61.1309 (56.7180–65.4509)	75.8773 (72.3823–79.1182)	95.0295 (92.3706–97.1994)	63.1501 (58.6862–67.4493)	99.9979 (99.9968–99.9989)	99.9750 (99.9718–99.9781)	0.9846	7.77
U-Net	52.2023 (47.3823–56.8784)	68.5959 (64.2985–72.5127)	97.3866 (94.6778–99.2788)	52.9440 (48.1230–57.6247)	99.9991 (99.9982–99.9998)	99.9699 (99.9663–99.9732)	0.6252	5.95
ACM	56.3188 (52.1214–60.3663)	72.0563 (68.5261–75.2855)	96.0852 (93.6192–97.7764)	57.6414 (53.4276–61.6748)	99.9985 (99.9976–99.9992)	99.9722 (99.9692–99.9751)	0.6583	1.97
UIUNet	62.8487 (59.2687–66.1993)	77.1866 (74.4261–79.6625)	85.1420 (82.2017–87.7702)	70.5908 (67.1808–73.7182)	99.9923 (99.9907–99.9938)	99.9741 (99.9712–99.9766)	0.8248	8.12
HintUNet	43.1266 (38.9726–47.1955)	60.2636 (56.0867–64.1263)	82.6198 (79.3638–85.4477)	47.4296 (43.0273–51.7941)	99.9938 (99.9926–99.9948)	99.9611 (99.9574–99.9647)	0.7251	10.21
HintHCFNet	78.5465 (75.0126–81.6417)	87.9844 (85.7225–89.8931)	95.4145 (93.4688–96.8643)	81.6278 (78.3335–84.4658)	99.9976 (99.9965–99.9983)	99.9861 (99.9838–99.9882)	0.2965	15.98
Daisy-Net (Ours)	81.4134 (78.3253–84.2531)	89.7545 (87.8454–91.4537)	95.3436 (93.4224–96.7934)	84.7845 (81.9568–87.3162)	99.9974 (99.9963–99.9982)	99.9880 (99.9859–99.9898)	0.5001	11.30

Table 3. Ablation Study on LUNA16. Values are reported as point estimates with 95% CIs in parentheses.

HCF-Net	PPSCA	OMF	LUNA16 Results
HCF-Net	PPSCA	OMF	IoU (%)	Dice (%)	Sens (%)	Spec (%)
✓			55.08 (52.99–57.00)	60.49 (59.21–61.64)	47.99 (46.39–49.43)	99.9991 (99.9980–99.9999)
✓	✓		55.51 (53.41–57.45)	65.80 (64.40–67.05)	51.17 (49.47–52.70)	99.9818 (99.9807–99.9826)
✓	✓	✓	81.41 (78.33–84.25)	89.75 (87.85–91.45)	84.78 (81.96–87.32)	99.9974 (99.9963–99.9982)

Note: ✓ indicates that the corresponding module is included in the model variant. Bold values indicate the best result in each metric column.

Table 4. Comparison experiments with and without normalization operations. Values are reported as point estimates with 95% CIs in parentheses.

	IoU (%)	Dice (%)	Sens (%)	Spec (%)	Total Loss
W/O Data Normalization	81.41 (78.33–84.25)	89.75 (87.85–91.45)	84.78 (81.96–87.32)	99.9974 (99.9963–99.9982)	0.5001
W/Data Normalization	69.65 (67.01–72.08)	76.91 (75.28–78.37)	80.88 (78.19–83.30)	99.9938 (99.9927–99.9946)	0.3647

Note: Bold values indicate the best result in each column; for Total Loss, a lower value is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Z.; Zhao, Y.; Zhao, X.; Ying, Y.; Gu, H.; Song, G.; Wang, Q. Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection. Mathematics 2026, 14, 1350. https://doi.org/10.3390/math14081350

AMA Style

Zhu Z, Zhao Y, Zhao X, Ying Y, Gu H, Song G, Wang Q. Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection. Mathematics. 2026; 14(8):1350. https://doi.org/10.3390/math14081350

Chicago/Turabian Style

Zhu, Zhijian, Yiwen Zhao, Xingang Zhao, Yuhan Ying, Haoran Gu, Guoli Song, and Qinghui Wang. 2026. "Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection" Mathematics 14, no. 8: 1350. https://doi.org/10.3390/math14081350

APA Style

Zhu, Z., Zhao, Y., Zhao, X., Ying, Y., Gu, H., Song, G., & Wang, Q. (2026). Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection. Mathematics, 14(8), 1350. https://doi.org/10.3390/math14081350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Daisy-Net: Dual-Attention and Inter-Scale-Aware Yield Network for Lung Nodule Object Detection

Abstract

1. Introduction

1.1. Related Work

1.1.1. Traditional Detection Methods

1.1.2. Deep Learning-Based Two-Dimensional Methods

1.1.3. Advances in Attention Mechanisms

1.1.4. Three-Dimensional Deep Learning Approaches

1.1.5. Motivation for Daisy-Net

2. Materials and Methods

2.1. LUNA16 Dataset

2.2. Preprocessing

2.3. Proposed Method

2.3.1. Network Architecture Overview

2.3.2. Omni-Domain Multistage Fusion Module

2.3.3. Parallelized Patch and Spatial Context Aware Module

2.4. Evaluation Metrics and Model Training

3. Results

3.1. Comparison with Other Models

3.2. Ablation Study of Daisy-Net Components

3.3. Impact of Normalization on Nodule Detection Accuracy

3.4. Threshold Analysis

4. Discussion

4.1. Added Value of the Study

4.2. Comparison with State-of-the-Art

4.3. Innovations and Limitations

4.4. Clinical Relevance and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI