Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme

Wu, Kunlun; Peng, Bo; Zhai, Donghai

doi:10.3390/app13010377

Open AccessArticle

Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme

by

Kunlun Wu

,

Bo Peng

and

Donghai Zhai

^*

School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(1), 377; https://doi.org/10.3390/app13010377

Submission received: 6 December 2022 / Revised: 22 December 2022 / Accepted: 24 December 2022 / Published: 28 December 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

Intelligent lung nodules classification is a meaningful and challenging research topic for early precaution of lung cancers, which aims to diagnose the malignancy of candidate nodules from the pulmonary computed tomography images. Nowadays, deep learning methods have made significant achievements in the medical field and promoted developments of lung nodules classification. Nevertheless, mainstream CNNs-based networks typically excel in learning coarse-grained local feature representations via stacked local-aware and weight-shared convolutions, and cannot practically model the long-range context interaction and the spatial dependencies. To tackle the above difficulties, we innovatively propose an effective Multi-Granularity Dilated Transformer to learn the long-range context relations, and explore fine-grained local details via the proposed Local Focus Scheme. Specifically, we delicately design a novel Deformable Dilated Transformer to incorporate diverse contextual information with self-attention for learning long-range global spatial dependencies. Moreover, numerous investigations indicate that local details are extremely crucial to classify indistinguishable lung nodules. Thus, we propose the Local Focus Scheme to focus on the more discriminative local features by modeling channel-wise grouped topology. Consequently, the Multi-Granularity Dilated Transformer is constructed by leveraging the Local Focus Scheme to guide the Deformable Dilated Transformer for learning fine-grained local cues. Experimental results on the mainstream benchmark LIDC-IDRI demonstrate the superiority of our model compared with the state-of-the-art methods.

Keywords:

lung nodule classification; dilated transformer; multi-granularity; local focus scheme

1. Introduction

Currently, lung cancer has become one of the most lethal cancers, and it is urgent to improve such an alarming condition. Doctors can provide an early diagnosis based on pulmonary computed tomography (CT) images (e.g., cancerous nodules frequently have more rough boundaries compared with non-cancerous ones), which can significantly reduce the mortality of lung cancer. However, manual analysis is excessively time-consuming for the radiologist, since each CT scan is composed of multiple slices. Additionally, the benign and malignant nodules may look similar in morphology, and it is hard to distinguish the corresponding results by naked eyes. The rapid development of deep learning drives us to focus on automatic lung nodule classification to judge cancer heterogeneity efficiently. Recently, advanced lung diagnosis systems have been widely used in hospitals to assist doctors in making judgments.

Conventionally, researchers have studied hand-crafted features [1,2,3,4] and proposed pre-designed structures to predict whether a candidate nodule is benign or malignant. Although existing deep learning-based methods have been proven to notably outperform the previous hand-crafted architectures, the performance of deep learning-based lung nodule classification still cannot be comparable to universal vision tasks. The critical reason is the wide variation of nodule sizes (3–30 mm) and heterogeneity of the nodules further hinder the classification performance. The previous work [5] displayed the nodule diameter distribution of all the nodules in the LIDC-LDRI dataset, the malignant nodules in the dataset have diameters that typically exceed 12 mm, and the benign nodules have diameters that are generally less than 5 mm. Early approaches seek to capture salient information from the whole region in the CT image and obtain a global feature representation of the lung nodules. However, partial representation error may lead to a breakdown in exploring holistic structures and notably limits the performance of global features. Furthermore, the global representation generally ignores useful local detail, which also results in relatively poor performance in lung nodule classification, thus recent works commonly regard this nature as multi-scale formulations in the field of computer vision, using patches of multiple sizes [6] to learn scale-invariant representation. However, multi-stage downsample may still discard many contextual features. As emphasized in the above analysis, part-based representations are very crucial to distinguish the category of lung nodules, which can guide the network to learn fine-grained local features.

Unlike previous CNNs-based methods, which cannot practically model the long-range context interaction and the global spatial dependencies, we propose a multi-granularity dilated transformer (MGDFormer) to learn robust long-range global representations of various nodules, and explore the discriminative part-based representations to learn fine-grained local features simultaneously. Specifically, we aim to leverage Multi-Head Self-Attention in Transformer to capture global dependencies, but the regular structure will calculate the relationship of all pixels that easily fall into the overfitting problem. We argue that calculating pixel-wise attention of the objective region can obtain discriminative global features and modify the dilated convolutional layer as the feedforward layer to improve receptive fields. As shown in Figure 1, generalized global-local features can be utilized to classify much larger (>12 mm) and smaller (<5 mm) nodules, but it is not effective to distinguish hard samples (i.e., 5–12 mm lung nodules). To better focus on the most discriminative region for judging the above middle-sized nodules, we propose a Local Focus Scheme (LFS) to exploit fine-grained local information. Inspired by channel attention [7] that models channel-wise topology of each sample, different channels of features can be regarded as multiple neurons corresponding to objects, thus understanding the relations of various channels is a vital way to focus on the local discriminative features. We formulate a simple yet effective scheme to implement the above intention, i.e., calculating a regularized channel-wise grouped attention and using the proposed keyMask to force networks to explore the most discriminative local information.

Our key contributions are summarized as three-fold:

We propose a novel Multi-Granularity Dilated Transformer (MGDFormer) to learn pixel-wise global attention of objective region, with an aim to construct a more robust long-range global representation;
To better focus on local discriminative features for classifying hard samples (i.e., middle-sized nodules), we innovatively design a Local Focus Scheme (LFS) to force networks learning fine-grained local information by modeling channel-wise topology;
The experimental results on mainstream datasets demonstrate the superiority of our model has a competitive performance compared with the state-of-the-art approaches.

2. Related Work

2.1. Lung Nodule Classification

Automated lung nodule classification algorithms have been extensively studied as the increasing threat of lung cancer. Similar to other computer vision tasks, lung nodule classification can be categorized into two aspects: hand-crafted and deep learning-based methods. The former traditionally use a SIFT descriptor [1] and Support Vector Machine [8] to predict the property of lung nodules. With the prosperous development of hardware technology, deep learning-based methods have obtained the predominant status. Convolutional Neural Networks (CNNs) is a widely used classifier. Nibali et al. [9] adopted three Residual CNN branches to learn multiple views of nodules and then merged the outputs of different views by multi-layer perceptron layers. Hussein et al. [10] proposed a 3D CNN to handle lung nodule risk stratification with volumetric information. Latterly, 3D dual path networks were developed [11] with auxiliary features (gradient boosting machine) to improve classification performance. Jiang et al. [12] subsequently proposed spatial and contextual attention modules to boost the representation quality. Moreover, researchers introduced principles of evolutionary computation into CNNs [13], which segments the nodules first and feeds the transformed features to CNNs for classification. In the train phase, we commonly need to design the hyperparameters of CNNs (e.g., kernel size, stride and padding, etc.), thus Jiang et al. [14] designed a Neural Architecture Search (NAS) scheme to select the optimal parameters of networks. ProCAN [15] enhanced the non-local network with curriculum learning to progressively improve the classification performance of lung nodules. The above CNNs-based methods can learn useful inductive bias but lack full global understanding, thus the transformer-based methods have advanced to capture global dependencies. DETR [16] used the regular self-attention scheme to learn pixel-wise relations, and Deformable DETR [17] with the deformable self-attention was designed to focus on the global attention of the objective region. Inspired by Deformable DETR [17], we propose the Deformable Dilated Transformer to capture the more discriminative global representation by calculating the self-attention of the objective regions.

2.2. Attention Mechanism

Self-attention structures have become mainstream systems in natural language processing, computer vision, etc. In particular, the basic principle and variants of self-attention have played an indelible role in image recognition [18,19,20,21] and image captioning [22,23]. Alternatively, self-attention is a new paradigm that is comparable to convolutions, which can be divided into spatial attention, channel attention, and hybrid attention. Spatial attention [16,24,25,26,27] can guide models to adaptively explore the most relevant regions via the explicit procedure. STN [25] is the earliest method to explicitly mine important regions and transformation-invaried features. Analogous to STN, Refs. [16,24] propose a tailored self-attention mechanism for image classification. Channel attention [28,29,30] models the topology of different channels and adaptively captures discriminative spatial relations via channel-wise dependencies. Hybrid attention [18,23,31,32] combines the characteristics of channel attention and spatial attention to enhance channel-wise and pixel-wise attention learning jointly. CBAM [33] is a typical prototype, which uses MaxPooling and AvgPooling to realize channel and spatial attention, but the inherited structures of pooling can only mine coarse-grained spatial relations. Inspired by channel-wise attention [28], the proposed Local Focus Scheme aims to capture more fine-grained local features by modeling channel-wise relations, which can improve the classification performance of hard samples.

3. Methods

In this section, we describe our new Multi-Granularity Dilated Transformer and Local Focus Scheme in detail. In Section 3.1, we first formulate and explain the Local Focus Scheme. We sequentially explain how to combine the fine-grained local representation extract by LFS and long-range global features by Deformable Dilated Transformer (DDFormer) to construct a novel multi-granularity Dilated Transformer in Section 3.2 and Section 3.3.

3.1. Local Focus Scheme

As previously emphasized in Section 1, part-based local information is extremely crucial for distinguishing hard samples. Therefore, we propose a Local Focus Scheme (Figure 2) to mine the most discriminative local information, which consists of two parts: Regularized Channel-wise Grouped Attention and KeyMask. We will sequentially detail these components as follows.

3.1.1. Regularized Channel-Wise Grouped Attention

Generally, CNNs transform high-level semantic information into presentative feature maps. Each channel of feature maps indicates multiple signals to different regions of CT images, channels that have more high correlations focus on more salient features of lung nodules. Therefore, understanding channel-wise topology is essential to learn fine-grained spatial relations. To fully mine discriminative channel-wise dependencies in the feature maps, as shown in Figure 2, we propose a Regularized Channel-wise Grouped Attention to obtain the fine-grained local feature representation. Specifically, we first split the feature map into G groups along the channel dimension to prepare for calculating grouped channel-wise attention, since each group channel is indeed corresponding to a part in the pulmonary computed tomography (CT) images. However, due to similar patterns and inherent noise in medical images, it is difficult to learn a precise channel-wise feature distribution. With the goal to learn more well-distributed and discriminative channel-wise relations, we decide to leverage information from each group channel to enhance feature modeling in key regions and obtain the group-wise channel attention by calculating the correlation of global statistical information and high-level semantic features

X_{i n}^{C \times H \times W}

extracted by the backbone. That is, we divided

X_{i n}

into {

x_{h w}^{1}

,

x_{h w}^{2}

, …,

x_{h w}^{g}

} along the channel dimension, and the adaptive global average and max pooling are performed on {

x_{h w}^{1}

,

x_{h w}^{2}

, …,

x_{h w}^{g}

}, respectively, to obtain the statistical features in the whole CT image

R_{h w}

, which can be formulated as:

X^{m} = \sum_{g = 1}^{G} (\underset{(h, w) \in R_{h w}}{M a x} x_{h w}^{g} + \frac{1}{|R_{h w}|} \sum_{(h, w) \in R_{h w}} x_{h w}^{g})

(1)

In Equation (1), C denotes the channel dimension, and H and W are height and width of the image, respectively, where

x_{h w}^{g}

is the pixel-wise feature of each group channel. As shown in Equation (2), we then leverage the inner product M to calculate the correlation between the high-level semantic features

X_{i n}

and statistical features

X^{m}

. Ideally, the ones that are closer to

X^{m}

have larger initial weight to prevent features of the whole space from dominating by unrelated information:

X^{r} = M (X^{m}, X_{i n})

(2)

Next, we use a group normalization [34] to improve Internal Covariate Shift between multiple grouped channels for accelerating the convergence, and the corresponding mean and variance are denoted in Equation (3):

\begin{matrix} μ & = \frac{1}{(C / G) H W} \sum_{c = k C / G}^{(k + 1) C / G} \sum_{i = 1}^{H} \sum_{i = 1}^{W} x^{c i j} \\ σ & = \sqrt[]{\frac{1}{(C / G) H W} \sum_{c = k C / G}^{(k + 1) C / G} \sum_{i = 1}^{H} \sum_{i = 1}^{W} {(x^{c i j} - μ)}^{2}} \end{matrix}

(3)

where G is the number of groups and

k \in [0, G - 1)

,

x^{c i j}

is the pixel-wise feature of each grouped channel. The normalized feature representation is:

X^{r^{'}} = \frac{X^{r} - μ}{σ + ϵ}

(4)

In Equation (4),

ϵ

is the hyperparameter for numerical stability. Finally, we combine the regularized channel-wise attention

X^{r^{'}}

and the high-level semantic features

X_{i n}

to focus on more discriminative local relations:

\begin{matrix} X^{'} = δ (W_{1} X^{r^{'}}) * X_{i n} \end{matrix}

(5)

In Equation (5),

δ

is the sigmoid function and * is the element-wise multiplication.

W_{1}

is the weight matrix of a fully-connected layer.

3.1.2. KeyMask

The primary objective of KeyMask is to compress noise further and consequently improve the overfitting problem. Specifically, we use KeyMask to randomly discard a key channel whose importance ranks in the top 1%. Firstly, we make a sort along the channel dimension and then randomly select one as the candidate. We argue that it will have a weaker effect if we discard one of the whole channels without instructive guidance and cannot force the model to mine more discriminative local information. Here, we have obtained a representation focusing on the local features, which is useful for classifying hard samples.

3.2. Deformable Dilated Transformer

Inspired by [17,35], we combine the merits of them to construct a Deformable Dilated Transformer as the backbone. The key characteristic of the regular transformer is that it can explore all locations in an image for long-range spatial modeling. To mitigate the problems of spatial resolutions and convergence, as shown in Figure 3, we leverage the deformable attention based on [17] to learn a small set of vital sampling points around a reference point, without considering the resolution of feature maps. Given the original input

x^{C \times H \times W}

, the deformable attention is calculated as Equation (6):

\begin{matrix} D e f A t t (z, p_{q}, x) = \sum_{n = 1}^{N} W_{n} [\sum_{k = 1}^{K} A_{n q k} \cdot x (p_{q} + Δ P_{n q k}) W_{n}^{'}] \end{matrix}

(6)

where

W_{n}

and

W_{n}^{'}

are the learnable weight matrix, q is the index of a query element with content feature

z

, and a 2D reference point

P_{q}

. In addition, n and k index the attention head and sampled keys, respectively. K is the total sampled key that is much smaller than

H \times W

.

Δ P_{n q k}

denotes the sample offset,

A_{n q k}

is the attention matrix of

k^{t h}

sampling point in the

n^{t h}

attention head.

Δ P_{n q k}

and

A_{n q k}

are obtained by the linear transformation on the content feature

z

.

Δ P_{n q k}

are

2^{d}

real numbers without constrained range and the attention

\sum_{k = 1}^{K} A_{n q k} = 1

. We implement the above principle as [17], with the aim to reduce computational cost and avoid overfitting risk. Moreover, we replace the feedforward layer (i.e., regular 2D convolution) with two dilated convolutional layers to further enlarge spatial receptive fields. Let m be the number of stacked

k \times k

dilated convolution, where k is the filter size. We denote the dilation rate as r and then the kernel size k after dilation is calculated as Equation (7):

\begin{matrix} k^{'} = (k - 1) * r + 1 \end{matrix}

(7)

Let

R_{m}

be the effective receptive field of layer m, which is defined as Equation (8):

\begin{matrix} R_{m} = R_{m - 1} + (k^{'} - 1) \times \prod_{i = 1}^{m} s_{i} \end{matrix}

(8)

where

R_{m - 1}

is the receptive field of layer

m - 1

, and

s_{i}

is the stride of layer i. From the above equations, we can know that the dilated layer can increase the receptive field without introducing additional parameters. By combining the deformable attention and dilated convolutional layers (Figure 3), we can obtain a generalized high-level global-local representation.

3.3. Network Architecture

Figure 3 is the overview of the Multi-Granularity Dilated Transformer. Unlike previous work (CNNs-based) to extract semantic features, we use the proposed

4 \times

Deformable Dilated Transformer as our backbone, which can capture the more discriminative long-range global understanding. Following the backbone, we leverage global pooling and dual sub-pooling to obtain the coarse-grained global information and local features of lung nodules. Global pooling can help model obtain a representation of the statistical description of lung nodules, and this operation can guide the other two branches to learn a better representation. As for dual sub-pooling, with our intention to further seek more specific structural features, we empirically split the learned feature maps into two parts based on human biological structure (i.e., left lung and right lung) and perform pooling on each part, respectively. Notably, we use the proposed Local Focus Scheme (LFS) to mine fine-grained local features by modeling grouped channel-wise topology. Specifically, Regularized Channel-wise Grouped Attention in LFS first extracts the fine-grained channel-wise relations. Then, we use the KeyMask to discard a channel ranking in the top 1% of importance, forcing our model to learn the more discriminative local representation. We train the above three branches simultaneously, thus the model can learn multi-granularity features and has a more robust representation ability.

4. Experiments

In this section, we will present the experimental settings and analyze the corresponding results below.

4.1. Dataset

In the experiments, we adopt the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) [36] dataset released by the National Cancer Institute, NIH. The LIDC-IDRI dataset is the largest and most comprehensive publicly available lung nodule dataset. There are 1018 CT scans in total collated from 1010 patients. The malignant degree of each nodule in the LIDC-IDRI dataset was judged by four professional radiologists. The experts annotated nodules with diameters between 3 and 30 mm in each CT scan, but only a portion of the nodules was annotated by at least three-quarters of the radiologists. Therefore, we use a unified standard to define nodules for experiments. The four radiologists also annotated nodules on the malignancy suspiciousness of one to five, where one denotes that a nodule is clearly benign and five represents clearly malignant. As previous works [5,15], we aggregate all of the malignancy ratings by experienced radiologists as the final level. Eventually, we obtain 848 nodules in total of which 442 are benign and 406 are malignant.

4.2. Experimental Settings

We implement our method by using Pytorch framework and one GTX 3080Ti GPU. The MGDFormer network was trained with the batch size of 256 for 50 epochs. The initial learning rate is set to 5.0 × 10

^{- 4}

and, decreasing it to 5.0 × 10

^{- 5}

after the 25th epoch, we choose the Adam optimizer [37] and use the default parameters (

β_{1} = 0.9

and

β_{2} = 0.99

). In addition, we applied a weight decay of 0.0001 to avoid the overfitting problem. We evaluated our proposed MGDFormer on the LIDC-IDRI dataset using a 10-fold cross-validation method, where 9 folds were used for training and the rest for testing. With regard to evaluation criteria, we also adopt the unified standard as previous works for a fair comparison, namely AUC, accuracy, precision, and sensitivity. The specific metrics are listed as follows:

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(9)

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

(10)

\begin{matrix} S e n s i t i v i t y = \frac{T P}{T P + F N} \end{matrix}

(11)

\begin{matrix} F 1 - S c o r e = \frac{2 T P}{2 T P + F N + F P} \end{matrix}

(12)

\begin{matrix} A U C = \int_{0}^{1} η_{t p} (η_{f p}) d η_{f p} = P (X^{t} > X^{f}) \end{matrix}

(13)

As shown in Equations (9)–(13), where

F N, F P, T P

, and

T N

indicate the number of false negative, false positive nodules, true positive, and true negative,

η_{f p}

is the false positive rate and

η_{t p}

is the true positive rate.

X^{t}

and

X^{f}

denote the confidence scores for a positive and negative sample, respectively.

4.3. Data Preprocessing

Firstly, we use trilinear interpolation to normalize the CT scans, which can ensure that isotropic resolution in all x, y, z dimensions. Next, we resize the resolution of images to

32 \times 32

due to the largest diameter of nodules in the LIDC-IDRI dataset is 30 mm. Then, we normalized the nodules as ProCAN [15]. Finally, we augmented the nodules by rotating each nodule around the three (x, y, and z) axes. Specifically, we rotated each nodule in seven directions (

0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}, 180^{\circ}, 225^{\circ}, 270^{\circ}

) along the each axis.

4.4. Results

In this section, we compare our method with state-of-the-art methods on the LIDC-IDRI dataset. As shown in Table 1, we analyze the evaluation criteria of our MGDFormer against other methods, and we can observe that our MGDFormer outperforms the state-of-the-art methods on all evaluation criteria with a clear margin. For example, MGDFormer has a 1.4% performance gain compared with ProCAN in AUC, and we can also see a clear superiority on other metrics. We argue that the main cause is that MGDFormer can capture more fine-grained spatial relations and discriminative local representation.

5. Discussion

In this section, we analyze the effectiveness of our proposed MGDFormer, performing ablation studies on each proposed component of the network architecture and empirically analyzing the corresponding reason.

5.1. Effectiveness of the Deformable Dilated Transformer

In this part, we analyze the effectiveness of deformable attention (DA) and dilated layer in Deformable Dilated Transformer (DDFormer) for an extensive ablation study. As shown in Figure 3, the key parts of DDFormer are Deformable Attention (DA) and Dilated Layer. First, we only replace the deformable attention from the DDFormer with the regular multi-head attention (RA) to examine the effectiveness of deformable attention mechanism. Next, we choose the dilated layer as the feedforward layer (the original one is a multi-layer perceptron (MLP)). Finally, we jointly use deformable attention and dilated layer to construct the network. We compared the above four settings and tabulated the results in Table 2. From the results, we can observe that using the dilated layer combined with the RA and DA both can improve the performance across all metrics, as expected. This is due to the fact that the dilated layer can increase spatial receptive fields and without introducing extra computational cost. Moreover, we can see that Deformable Attention has a clear performance gain (1%) compared with Regular Attention, which indicates that DA indeed improves the overfitting problem and obtains a more discriminative global representation. Notably, the combination of DA and Dilated Layer (i.e., DDFormer) outperforms all the other settings over multiple criteria.

Moreover, we conduct an ablation study to examine the optimal number of DDFormer blocks as our ultimate structure. Specifically, we use the different number of blocks to verify the corresponding AUC and accuracy results, which are listed in Table 3. We can observe that the accuracy obtains the optimal performance when we use four base DDFormer blocks, and the performance starts to decline as we set the number to exceed 4. We believe the main reason is that a small number of modules cannot have sufficient feature extraction capability, but excessive ones will lead to the overfitting problem, and the two phenomena above can both lead to a decrease in performance.

5.2. Influence of Multi-Granularity Structure

We also conducted an ablation study to analyze the effects of applying the Local Focus Scheme (LFS) and further verify the performance of different granularities. Specifically, we only use one of three branches to obtain the final results, respectively. Next, we test the performance of different combinations. The ablation studies in Table 4 show the importance of applying multi-granularity on the overall results, Dual Sub-Pooling (DSP) has a performance gain compared with global pooling (GP), and LFS outperforms both GP and DSP across all performance metrics. Moreover, the combination of them has a peak superiority. The above results indicate that LFS can learn fine-grained representation to mine more discriminative local information, and each granularity can be viewed as the complementary features of each other.

5.3. Performance on Different Size Nodules

To validate the performance of our MGDFormer in different-size nodules, especially the classification results of hard samples, as shown in Figure 4, we test our MGDFormer with other state-of-the-art methods on three parts: small nodules (<5 mm), middle nodules (5–12 mm), and big nodules (>12 mm). We can observe that all the methods can obtain high accuracy in bigger and smaller parts. However, for middle nodules, the previous works mostly have poorer performance, and our MGDFormer has an obvious superiority to classify these hard samples, which proves that MGDFormer indeed can learn more discriminative local features.

6. Conclusions

In this study, we propose a novel Multi-Granularity Dilated Transformer system to address the challenging problem of lung nodule malignancy classification. The lung nodules have wide diameter variation, and each can range between 3 and 30 mm. We thus propose a Local Focus Scheme to mine the fine-grained local features to reduce the ambiguity of highly similar lung nodules. Moreover, we design a Deformable Dilated Transformer that consists of two parts: deformable attention and dilated layer, and the frontier can only reference points to guide the model to learn long-range global dependencies and avoid the overfitting problem. Meanwhile, the dilated layer is used to improve local spatial receptive fields. The combination of them can help us obtain a more discriminative global representation. Finally, to further improve the performance of hard samples, we adopt the multi-granularity structure to capture relations at different levels. The proposed model achieves state-of-the-art results on the LIDC-LDRI dataset, and the ablation studies also demonstrate significant improvements in terms of all the metrics on all scope nodules, especially the nodules with diameters between 5 and 12 mm. Although MGDFormer has a clear performance gain compared with state-of-the-art methods, it needs more time for convergence. In future work, we will modify MGDFormer principles to accelerate the convergence of the model and use the modified structure to design an end-to-end early-stage lung cancer detection and classification model, examining its performance on bigger, more diverse datasets.

Author Contributions

Methodology, K.W.; Writing—original draft, K.W.; Writing—review & editing, B.P. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Sichuan Province Grant No. 23ZDYF0090.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available at https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, F.; Song, Y.; Cai, W.; Lee, M.Z.; Zhou, Y.; Huang, H.; Shan, S.; Fulham, M.J.; Feng, D.D. Lung nodule classification with multilevel patch-based context analysis. IEEE Trans. Biomed. Eng. 2013, 61, 1155–1166. [Google Scholar] [CrossRef] [PubMed]
Tajbakhsh, N.; Suzuki, K. Comparing two classes of end-to-end machine-learning models in lung nodule detection and classification: MTANNs vs. CNNs. Pattern Recognit. 2017, 63, 476–486. [Google Scholar] [CrossRef]
Hu, Z.; Tang, J.; Wang, Z.; Zhang, K.; Zhang, L.; Sun, Q. Deep learning for image-based cancer detection and diagnosis—A survey. Pattern Recognit. 2018, 83, 134–149. [Google Scholar] [CrossRef]
Xie, H.; Yang, D.; Sun, N.; Chen, Z.; Zhang, Y. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognit. 2019, 85, 109–119. [Google Scholar] [CrossRef]
Al-Shabi, M.; Lee, H.K.; Tan, M. Gated-dilated networks for lung nodule classification in CT scans. IEEE Access 2019, 7, 178827–178838. [Google Scholar] [CrossRef]
Xu, X.; Wang, C.; Guo, J.; Gan, Y.; Wang, J.; Bai, H.; Zhang, L.; Li, W.; Yi, Z. MSCS-DeepLN: Evaluating lung nodule malignancy using multi-scale cost-sensitive neural networks. Med. Image Anal. 2020, 65, 101772. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Netto, S.M.B.; Silva, A.C.; Nunes, R.A.; Gattass, M. Automatic segmentation of lung nodules with growing neural gas and support vector machine. Comput. Biol. Med. 2012, 42, 1110–1121. [Google Scholar] [CrossRef]
Nibali, A.; He, Z.; Wollersheim, D. Pulmonary nodule classification with deep residual networks. Int. J. Comput. Assist. Radiol. Surg. 2017, 12, 1799–1808. [Google Scholar] [CrossRef]
Hussein, S.; Cao, K.; Song, Q.; Bagci, U. Risk stratification of lung nodules using 3D CNN-based multi-task learning. In Proceedings of the International Conference on Information Processing in Medical Imaging, Boon, NC, USA, 25–30 June 2017; pp. 249–260. [Google Scholar]
Zhu, W.; Liu, C.; Fan, W.; Xie, X. Deeplung: Deep 3d dual path nets for automated pulmonary nodule detection and classification. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 673–681. [Google Scholar]
Jiang, H.; Gao, F.; Xu, X.; Huang, F.; Zhu, S. Attentive and ensemble 3D dual path networks for pulmonary nodules classification. Neurocomputing 2020, 398, 422–430. [Google Scholar] [CrossRef]
da Silva, G.L.; da Silva Neto, O.P.; Silva, A.C.; de Paiva, A.C.; Gattass, M. Lung nodules diagnosis based on evolutionary convolutional neural network. Multimed. Tools Appl. 2017, 76, 19039–19055. [Google Scholar] [CrossRef]
Jiang, H.; Shen, F.; Gao, F.; Han, W. Learning efficient, explainable and discriminative representations for pulmonary nodules classification. Pattern Recognit. 2021, 113, 107825. [Google Scholar] [CrossRef]
Al-Shabi, M.; Shak, K.; Tan, M. ProCAN: Progressive growing channel attentive non-local network for lung nodule classification. Pattern Recognit. 2022, 122, 108309. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the PIEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3286–3295. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9308–9316. [Google Scholar]
Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 21–29. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Choi, M.; Kim, H.; Han, B.; Xu, N.; Lee, K.M. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10663–10671. [Google Scholar]
Bastidas, A.A.; Tang, H. Channel attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Zhou, T.; Canu, S.; Ruan, S. Automatic COVID-19 CT segmentation using U-Net integrated spatial and channel attention mechanism. Int. J. Imaging Syst. Technol. 2021, 31, 16–27. [Google Scholar] [CrossRef]
Fang, W.; Han, X.h. Spatial and channel attention modulated network for medical image segmentation. In Proceedings of the Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Armato, S.G., III; McLennan, G.; Bidaut, L.; McNitt-Gray, M.F.; Meyer, C.R.; Reeves, A.P.; Zhao, B.; Aberle, D.R.; Henschke, C.I.; Hoffman, E.A.; et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans. Med Phys. 2011, 38, 915–931. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Shen, S.; Han, S.X.; Aberle, D.R.; Bui, A.A.; Hsu, W. An interpretable deep hierarchical semantic convolutional neural network for lung nodule malignancy classification. Expert Syst. Appl. 2019, 128, 84–95. [Google Scholar] [CrossRef]
Shen, W.; Zhou, M.; Yang, F.; Yu, D.; Dong, D.; Yang, C.; Zang, Y.; Tian, J. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognit. 2017, 61, 663–673. [Google Scholar] [CrossRef]
Al-Shabi, M.; Lan, B.L.; Chan, W.Y.; Ng, K.H.; Tan, M. Lung nodule classification using deep local–global networks. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 1815–1819. [Google Scholar] [CrossRef] [PubMed]
de Pinho Pinheiro, C.A.; Nedjah, N.; de Macedo Mourelle, L. Detection and classification of pulmonary nodules using deep learning and swarm intelligence. Multimed. Tools Appl. 2020, 79, 15437–15465. [Google Scholar] [CrossRef]
Xie, Y.; Xia, Y.; Zhang, J.; Song, Y.; Feng, D.; Fulham, M.; Cai, W. Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT. IEEE Trans. Med Imaging 2018, 38, 991–1004. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Zhang, J.; Xia, Y. Semi-supervised adversarial model for benign–malignant lung nodule classification on chest CT. Med. Image Anal. 2019, 57, 237–248. [Google Scholar] [CrossRef]

Figure 1. Visualizations of different samples in LIDC-IDRI dataset. be and mal represents benign and malignant, respectively.

Figure 2. Illustration of the proposed Local Focus Scheme. The Local Focus Scheme consists of two key parts: the regularized channel-wise grouped attention and KeyMask.

Figure 3. Illustration of the proposed Multi-Granularity Dilated Transformer. We use the proposed Deformable Dilated Transformer as our backbone and then leverage the global pooling, dual sub-pooling, and LFS to jointly capture multi-granularity features.

Figure 4. The performance of multiple methods on different size nodules.

Table 1. Comparisons of state-of-the-art methods on the LIDC-IDRI dataset.

Method	AUC	Accuracy	Precision	Sensitivity	F1-Score
HSCNN [38]	85.6	84.2	-	70.5	-
Multi-Crop [39]	93.0	87.14	-	77.0	-
Local-Global [40]	95.6	88.4	87.3	88.6	88.3
Gated-Dilated [5]	95.1	92.5	91.8	92.2	92.6
Swarm [41]	-	93.7	93.5	92.9	-
3D DPN [12]	-	90.2	-	92.0	90.4
MV-KBC [42]	95.7	91.6	87.7	86.5	87.1
MSCS-DeepLN [6]	94.0	92.6	90.3	85.5	87.9
MK-SSAC [43]	95.8	92.5	-	84.9	-
ProCAN [15]	97.1	94.1	94.5	93.1	93.8
MGDFormer (ours)	98.5	96.1	95.9	94.4	95.2

Table 2. Performance comparisons of each component in the Deformable Dilated Transformer.

Method	AUC	Accuracy	Precision	Sensitivity	F1-Score
RA + Dilated Layer	97.6	94.8	94.9	93.6	94.3
RA + MLP	97.3	94.1	94.4	93.1	93.8
DA + MLP	98.1	95.5	95.4	94.0	94.8
DA + Dilated Layer	98.5	96.1	95.9	94.4	95.2

Table 3. Performance comparisons of the different number of DDFormer blocks.

Number	AUC	Accuracy	Precision	Sensitivity	F1-Score
2	97.7	94.2	94.9	93.6	94.2
3	98.0	94.7	95.4	94.1	94.5
4	98.5	96.1	95.9	94.4	95.2
5	98.3	95.6	95.6	94.1	95.3
6	98.0	94.8	95.2	93.8	94.7

Table 4. Influence of multi-granularity structure.

Method	AUC	Accuracy	Precision	Sensitivity	F1-Score
GP	97.4	94.0	94.8	93.2	94.1
DSP	97.7	94.6	95.3	93.6	94.3
LFS	98.2	95.8	95.7	94.1	94.8
GP+DSP	97.9	95.2	95.3	93.8	94.5
GP+DSP+LFS	98.5	96.1	95.9	94.4	95.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, K.; Peng, B.; Zhai, D. Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme. Appl. Sci. 2023, 13, 377. https://doi.org/10.3390/app13010377

AMA Style

Wu K, Peng B, Zhai D. Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme. Applied Sciences. 2023; 13(1):377. https://doi.org/10.3390/app13010377

Chicago/Turabian Style

Wu, Kunlun, Bo Peng, and Donghai Zhai. 2023. "Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme" Applied Sciences 13, no. 1: 377. https://doi.org/10.3390/app13010377

APA Style

Wu, K., Peng, B., & Zhai, D. (2023). Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme. Applied Sciences, 13(1), 377. https://doi.org/10.3390/app13010377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme

Abstract

1. Introduction

2. Related Work

2.1. Lung Nodule Classification

2.2. Attention Mechanism

3. Methods

3.1. Local Focus Scheme

3.1.1. Regularized Channel-Wise Grouped Attention

3.1.2. KeyMask

3.2. Deformable Dilated Transformer

3.3. Network Architecture

4. Experiments

4.1. Dataset

4.2. Experimental Settings

4.3. Data Preprocessing

4.4. Results

5. Discussion

5.1. Effectiveness of the Deformable Dilated Transformer

5.2. Influence of Multi-Granularity Structure

5.3. Performance on Different Size Nodules

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI