HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification

Chen, Huilong; Zhang, Xiaoxia

doi:10.3390/app14146186

Open AccessArticle

HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification

by

Huilong Chen

and

Xiaoxia Zhang

^*

School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan 114051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6186; https://doi.org/10.3390/app14146186

Submission received: 13 June 2024 / Revised: 13 July 2024 / Accepted: 13 July 2024 / Published: 16 July 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

At present, many diseases are diagnosed by computer tomography (CT) image technology, which affects the health of the lives of millions of people. In the process of disease confrontation, it is very important for patients to detect diseases in the early stage by deep learning of 3D CT images. The paper offers a hybrid multi-instance learning model (HLFSRNN-MIL), which hybridizes high-low frequency feature fusion (HLFFF) with sequential recurrent neural network (SRNN) for CT image classification tasks. Firstly, the hybrid model uses Resnet-50 as the deep feature. The main feature of the HLFSRNN-MIL lies in its ability to make full use of the advantages of the HLFFF and SRNN methods to make up for their own weakness; i.e., the HLFFF can extract more targeted feature information to avoid the problem of excessive gradient fluctuation during training, and the SRNN is used to process the time-related sequences before classification. The experimental study of the HLFSRNN-MIL model is on two public CT datasets, namely, the Cancer Imaging Archive (TCIA) dataset on lung cancer and the China Consortium of Chest CT Image Investigation (CC-CCII) dataset on pneumonia. The experimental results show that the model exhibits better performance and accuracy. On the TCIA dataset, HLFSRNN-MIL with Residual Network (ResNet) as the feature extractor achieves an accuracy (ACC) of 0.992 and an area under curve (AUC) of 0.997. On the CC-CCII dataset, HLFSRNN-MIL achieves an ACC of 0.994 and an AUC of 0.997. Finally, compared with the existing methods, HLFSRNN-MIL has obvious advantages in all aspects. These experimental results demonstrate that HLFSRNN-MIL can effectively solve the disease problem in the field of 3D CT images.

Keywords:

3D CT images; multiple-instance learning; convolutional neural network; deep learning; lung cancer; COVID-19

1. Introduction

In the description of classical image classification problems based on machine learning, images typically belong to a specific category. However, in some practical scenarios, a label is assigned after observing multiple images. This process is analogous to analyzing a video, where the judgment is made only after a comprehensive view. This scenario aligns with the general concept of multi-instance learning (MIL). In 1997, Dietterich et al. proposed the concept of multi-instance learning to address the challenge of drug activity prediction. They employed the axis-parallel rectangle (APR) learning algorithm in conjunction with multi-instance learning to tackle this issue. Within the framework of multi-instance learning, a method for handling weakly supervised problems is elucidated. In this process, labels are not assigned to individual samples. Instead, based on the problem’s specifications, a subset of instances (samples) meeting the problem’s criteria is grouped into a bag, and a label is then assigned to each bag. If at least one positive instance exists within a bag, the label of the bag is defined as positive; otherwise, if all instances are negative, the bag is considered negative [1]. The primary advantage of multi-instance learning lies in its simplification of dataset construction for the respective problem, thereby reducing the workload for professionals. This advantage is particularly pronounced in medical imaging tasks, such as computer histopathology, X-ray imaging, or CT imaging screening. Given the potential existence of various associations and features among different image data, organizing these related images into bags enables multi-instance learning to better capture this information. Consequently, this approach enhances the performance and accuracy of the model [2].

Currently, numerous medical models rely on multi-instance learning frameworks to predict bag labels. Another crucial task involves assigning prediction weights to the outcomes within the bag. This process helps identify key instances that contribute to the bag label, which holds significant importance in certain medical clinical tasks. It plays a pivotal role in enabling doctors to locate the disease and formulate precise treatment plans accordingly [3]. To address this challenge, researchers have proposed various solutions. One approach involves making judgments based on the similarity at the bag level. This method utilizes the difference vector between the bag and all other bags as the feature representation, followed by prediction using this representation [4]. Another method entails embedding all instances within a bag into a compact representation, subsequently employing a bag-level classifier for prediction [5]. The third approach involves predicting low-dimensional embedded features via an instance-level classifier [6]. Among these methods, only the third can offer interpretable results. However, many researchers note that the instance-level classifier often exhibits lower prediction accuracy compared to the bag-level classifier [7]. Furthermore, in cases where there is a strong correlation among instances, employing this method may lead the model to detect abnormalities in different data segments with minor variations in the training data. Consequently, this can result in discrepancies in disease localization [8].

Computed tomography (CT) scan images currently serve as an effective method for clinically examining various diseases, including lung tumors, pneumonia, and others. According to the American Cancer Society, patients detected with lung cancer at an early stage through CT scan images have a 47% survival rate, while 80% of patients can be accurately diagnosed and treated at the central or advanced stages of cancer. A notable feature of CT scan images is their unobtrusiveness. The scanning process is relatively quiet, and compared to other imaging techniques like parallel imaging, changes in angles seldom draw attention [9]. When X-rays capture parts with a 3D structure, they are projected onto a 2D view. In some severe cases, this method may not offer effective information for professional radiologists. This is because the geometric structure overlaps during the projection process, leading to confusion regarding the actual structure. Therefore, the reconstruction of 3D scan images emerges as the preferred solution for visualizing the structural information of the patient’s lesion site, facilitating clinical analysis [10]. Three-dimensional CT scan images exhibit high-resolution features. Presently, the thickness of high-resolution CT scans can range from 1 to 1.5 mm, enabling clear observation of tissue structures and lesion sites. Typically, a single scan generates dozens to hundreds of slices based on the patient’s condition. However, disease diagnosis based on CT images relies on analyzing various permutations and combinations of features. This task proves exceedingly challenging even for experienced physicians. In recent years, the advancement of artificial intelligence and deep learning has significantly contributed to image classification. This technology has matured substantially, capable of extracting subtle features imperceptible to the human eye from different angles. Common examples include histogram features, texture features, shape features, and more [11].

The features of 3D CT scan images can be categorized into high-frequency and low-frequency features. High-frequency features typically represent fine structures within the image, such as edges and textures. These features correspond to rapidly changing or high-gradient areas in the image, such as object boundaries or structural textures. In CT images, they may manifest as clear contours, small tissue structures, or subtle tissue changes. On the other hand, low-frequency features usually depict the overall shape of the image, encompassing wide-ranging density changes or major tissue structures. These features correspond to slowly changing or low-gradient regions in the image, such as the overall shape or density distribution of the tissue. In CT images, low-frequency features may appear as the general organ shape and variations in tissue density across a wide area. CT images offer diverse levels of feature information at both high and low frequencies. However, accurately discerning these features necessitates considerable time and expertise from professional doctors [12]. In a 3D CT image, multiple adjacent slices typically comprise the image. The spacing between these slices must be carefully set to ensure that the patient’s disease information is not lost. In three-dimensional space, adjacent slices often overlap to some extent, ensuring the continuity of tissue structure within the CT image. This continuity entails seamless connection and smooth transition between adjacent slices, allowing for the reflection of spatial changes at the lesion site. Such continuity is crucial for doctors to make accurate diagnoses [13].

In this paper, our objective is to address the practical significance of CT image classification using the concept of multi-instance learning. We meticulously consider the fusion of high- and low-frequency features, as well as the continuity of 3D CT images, aiming to demonstrate its applicability in medical imaging through instances of 3D CT image data. Our experiments will evaluate two extensive public datasets focusing on lung cancer (TCIA) and pneumonia (CCII-CC), aiming to substantiate the superiority of our model over other methods. Figure 1 illustrates the comparison of CT slices depicting normal lungs and three types of lung cancer from the dataset utilized in this study. Many medical experts assert that scanning a greater number of CT images decreases the risk of misjudging cancer during examinations. However, CT scan images often contain extensive nodule information. Assessing a larger number of images poses a considerable challenge but is crucial for accurate diagnosis [14].

In the study, to classify CT images into lung cancer patient and normal categories, we propose a multi-instance learning model called the High–Low Frequency Sliding Recurrent Neural Network (HLFSRNN-MIL) to assist doctors in making a diagnosis of the patient’s condition. HLFSRNN-MIL leverages the principles of multi-instance learning, and the model requires only patient-level labels as input and defines a patient’s 3D CT image as a bag. The size of each bag is determined based on the patient’s specific circumstances, typically ranging from 30 to 300 slices. Each bag undergoes basic feature extraction to obtain feature information. Subsequently, the high- and low-frequency feature information derived from these features is fused, and the prediction probability of each instance within the bag is determined using our modified recurrent neural network. Finally, this information is fed into the MIL classifier to generate bag-level predictions. The paper’s main contributions are summarized below:

We use a wide range of preprocessing techniques to mitigate the noise in the data. In this process, resampling, grayscale standardization, noise reduction, and other technologies are used to reduce the impact of noise on training.
The HLFFF method is used in the process of feature extraction, which can effectively capture and fuse the high- and low-frequency information from the basic feature information, and improve the sensitivity of the network to various distributions of disease data.
The SRNN method is proposed to learn and capture the temporal relationship information in the 2D feature map. This method is based on the Long Short-Term Memory Network (LSTM), which can effectively capture the relationship information between instances and identify the sequential instance fragments that affect the prediction results (that is, the continuous instance fragments of the window size in the package).
We propose a HLFSRNN-MIL model based on the ResNet-50 network. The model adheres to the ‘end-to-end’ architecture concept, and through experiments on TCIA and CC-CCII datasets solves the problem of disease diagnosis using 3D CT images in clinical practice.

2. Related Works

In recent years, numerous models based on various algorithms have emerged to identify and detect specific features, aiming to overcome the limitations of standard CNN methods like AlexNet [15], Inception [16], ShuffleNet [17], and others. These models offer the flexibility to focus on different aspects of CT images by adjusting parameters. However, despite these advancements, the involvement of medical professionals remains crucial in this process, and significant limitations persist in clinical applications [18]. Hence, researchers in this field are actively exploring ways to enhance feature extraction within models. Improved feature extraction methods have the potential to facilitate accurate diagnosis by medical practitioners in clinical settings. Currently, there is a plethora of studies aimed at enhancing model performance in this domain.

Qin et al. [19] introduced a deep learning framework that integrates fine-grained features from PET and CT images. This framework utilizes a multi-dimensional attention mechanism to mitigate noise from feature information during the extraction of fine-grained features across multiple imaging modalities. Through experiments on feature fusion and attention concentration, they achieved ROC scores of 0.92 and observed more focused network attention. Raza et al. [20] proposed the Lung-EffNet framework for lung cancer CT scan image classification. By adapting the classification head of the EfficientNet model and incorporating a predictor based on transfer learning, they achieved an accuracy rate of 99.10% and ROC scores ranging from 0.97 to 0.99 on a public dataset. Compared to other CNN models, Lung-EffNet demonstrated faster training speeds, required fewer parameters, and exhibited model performance comparable to that of general practitioners. Xu et al. [21] introduced ISANET, a CNN-based classification model that integrates channel attention and spatial attention mechanisms into InceptionV3 to prioritize the lesion area. Compared to traditional models such as AlexNet, VGG-19, and MobileNetV3, ISANET achieved an accuracy of 95.25% in classifying non-small-cell lung cancer data. Siddiqui et al. [22] addressed the perceived limitations of existing methods in early cancer diagnosis, namely, low performance and long processing times. To overcome these challenges, they proposed E-DBN, a support vector machine based on the deep belief network. This method incorporates an improved Gabor filter with fewer parameters, resulting in reduced processing delays and times.

Chang et al. [23] proposed a model based on deep multi-instance learning (DMIL), which utilizes a deep convolutional neural network (CNN) as the backbone network for feature extraction. In DMIL, each instance corresponds to a path of the model input. The packet-level features obtained from three different methods—maximum pooling, convolution pooling, and attention mechanism pooling—are expressed as fully connected layers for prediction. This feature fusion technique allows for the estimation of the importance of each instance, thereby enhancing the model’s performance. Fuhrman et al. [24] used transfer learning to extract more complex and richer feature representations than untrained models. They connected MIL pooling (AMIL) based on an attention mechanism through two fully connected layers, which can evaluate influential slices in CT packages according to attention weights. The powerful classification ability of the model has great potential in clinical implementation. Tang et al. [25] proposed a shielded hard instance mining model (MHIMMIL) using Exponential Moving Average (EMA), which identifies difficult instances through the instance attention mechanism, and introduces parameter-free momentum into the Siamese structure to obtain more effective and stable instance prediction scores. Meng et al. [26] proposed an uncertainty-aware consensus-assisted multi-instance learning (UC-MIL) model to deal with any number of CT slices. Through training, a certain number of top instances that contribute the most to the label in the package are selected as the input for subsequent bag prediction. Subsequently, an adaptive bilateral adjacency matrix will be constructed to find vertex relationships at different granularity levels, and a graph inference module (BA-GCN) will be used to predict the patient’s diagnostic probability.

Xue et al. [27] proposed a multi-instance learning approach based on two-stage attention combined with transfer learning (TSA-MIL), which adjusts the attention weight of each instance by determining the distance between the key instance and the current instance. During the feature space processing, they adopted a clustering constraint method to refine the classification features, thereby enhancing the classification performance of TSA-MIL. Wang et al. [28] proposed a self-tuning module to mitigate the impact of unbalanced label distribution in positive packets. The confidence-based regularization strategy generates pseudo-labels for each weak input through prediction. By incorporating a Student–Teacher architecture, the model can leverage unlabeled data for training, thereby enhancing the model’s recognition ability in a cost-effective manner. Liu et al. [29] proposed an adversarial multi-instance learning (AdvMIL) model, which enhances the distribution estimation performance of AdvMIL by employing time-adversarial instances. This approach eliminates the necessity of fully supervised learning for data in the model, effectively reducing the computational cost of AdvMIL and enhancing the robustness of mainstream survival analysis methods. Schmidt et al. [30] introduced the Attention Gaussian Process (AGP) method in multi-instance learning, which leverages a combination of the Gaussian process and attention mechanism to predict the probability of instance-level labels. The packet-level predictions can be elucidated by the instance-level prediction results. This approach effectively ensures accurate fitting and predictive estimation performance of the model even when the data are unbalanced. Wang et al. [31] proposed an Iteratively Coupled Multi-Instance Learning (ICMIL) framework, which fine-tunes the parameters of the feature-extracted instance embedder using packet-level category information. This framework facilitates information exchange between the low-cost coupled instance embedder and the packet classifier, thus aiding in generating improved instance representations. In addressing the challenge of early clinical diagnosis of low-level characteristic symptoms, Li et al. [32] introduced a multi-instance learning framework based on a causal-driven graph neural network. This framework encodes spatial proximity and feature similarity between instances while emphasizing the causal components of instance features through a causal comparison mechanism, thus enhancing the diagnostic performance of the model.

These methods have played a significant role in advancing multi-instance learning, and a common trend among them is the exploration of more complex feature extraction techniques. Many current methods rely on feature extraction modules based on deep neural networks, such as ResNet [33], MobileNetV3 [34], and GoogleNet [35], among others. These network models have great differences in design concepts and performance. ResNet connects residual blocks through identity mapping to solve the problem of gradient explosion and vanishing in training. Using the inverted residual structure and separable convolution, MobileNetV3 reduces the required computing resources, enhances the feature relationship between channels, and further improves the performance of the model. GoogleNet uses a combination of convolution kernels of different shapes and pooling methods to extract features from multiple scales, which improves the utilization of computing resources. However, lesions in CT images often exhibit complex characteristics such as uncertain location, blurred shape, and varying sizes. Therefore, further research is necessary to effectively enhance the classification performance of multi-instance learning methods in CT image settings. This entails developing more sophisticated feature extraction methods that can accurately capture the nuanced characteristics of lesions in CT images, ultimately leading to improved diagnostic accuracy and clinical utility.

3. Proposed Methods

The designed method of HLFSRNN-MIL draws on the current popular network architecture concept, and combines more advanced and effective methods on the basis of a variety of feature extractors to solve the problem of 3D CT images. This design pattern is more conducive to training and subsequent parameter adjustment of the model, demonstrating high-intensity feature selection ability and excellent performance in multi-instance learning problems. Figure 2 shows an overview of the classification model proposed in this paper, which can be explained in three parts: (1) The preprocessing of 3D CT image data; (2) The extraction of the high- and low-frequency features (HLFFF) of the instance at the 2D level; (3) The search for key package segments for classification at the 2D and 3D levels (SRNN). To input a patient’s 3D CT image with any number of slices, the original CT data need to be preprocessed first, and then the processed data are sent to the HLFFF module to extract the high- and low-frequency features of each instance in a packet as a unit, and then the representation of the feature space is improved by increasing the dimension. Then, as the input of the proposed SRNN, SRNN learns the time information of the instance in the 3D space. It is worth noting that the output of SRNN can be considered as finding a packet fragment with a high contribution based on the size W of the sliding window, which is set to 5 according to the empirical hyperparameter

W

. The development details and detailed explanations of each part are given as follows.

3.1. 3D CT Datasets and Datasets Preprocessing

This paper will verify the feasibility of the method on two datasets, the Cancer Imaging Archive (TCIA) and the China Consortium of Chest CT Image Investigation (CC-CCII), and the number of slices per package is controlled between 30 and 300. Table 1 presents the CT image acquisition parameter information of the two datasets.

TCIA is a large public dataset on lung cancer. A total of 300 3D CT packages were used for 234 patients with lung cancer, including 216 packages of adenocarcinoma, 41 packages of small-cell carcinoma, and 43 packages of squamous-cell carcinoma. Patients needed to be fasted for at least 6 h before the examination. Whole-body emission scanning was performed within 12 min after injection of 60F-FDG (18.4 MBq/kg, 44.0 mCi/kg). During the scan, the blood glucose of each patient was less than 11 mmol/L. The original CT image was composed of anisotropic voxels with different in-plane resolutions. Due to the differences between different scanners or acquisition protocols, the voxel spacing of the CT dataset was also inconsistent. To facilitate training, we resampled all medical images according to the voxel spacing information provided by the DICOM file to ensure consistent resolution. The slice thickness range was usually between 0.625 mm and 5 mm, and the scanning mode includes plane, contrast, and 3D reconstruction.

CC-CCII is a dataset of novel coronavirus pneumonia established by Zhang et al. [36], comprising both cases of novel coronavirus pneumonia (COVID-19) and cases of common pneumonia resulting from the SARS-CoV-2 virus. A total of 300 3D CT scans of COVID-19 and common pneumonia were included, with each scan acquired during the patient’s admission examination. This dataset is essential for distinguishing between COVID-19 and other types of pneumonia, providing valuable data for developing and validating diagnostic models.

Since X-ray absorption varies among different tissues or substances within the human body, Hounsfield Unit (HU) values require adjustment based on varying conditions. By referencing the absorption of water, a positive HU value signifies greater X-ray absorption by tissues or substances compared to water, while a negative HU value indicates lesser absorption than water. To delineate diverse types of abnormal tissues and lesions in the lungs, it is crucial to establish appropriate window width (WW) and window level (WL) settings to achieve clear lung features. In our experiment, we configured WW to 1050 HU and WL to −475 HU. The disparity between initial CT and modified input CT is illustrated in Figure 1.

Voxel spacing denotes the distance between adjacent volume pixels in the patient’s CT image within three-dimensional space. This spacing determines the minimum spatial unit discernible within the image. Smaller voxel spacing implies higher spatial resolution. Typically, the size of voxel spacing is governed by the parameter settings of the scanning device. In our experiment, the voxel spacing range of the original CT image data spanned from 0.585937 to 0.841796, while the spatial resolution range of the resampled images fell between 300 and 431. To streamline subsequent training processes, standardizing the resolution of all CT images becomes necessary. Hence, employing the OpenCV tool, the resolution of all images was uniformly scaled to 256 × 256.

3.2. Extraction of Deep Features

The fine-tuned ResNet-50 serves as the foundational network for the deep feature extraction process. The input channel of the ResNet-50 network was adjusted to 1, and the final global average pooling layer and fully connected layer were omitted. The retained portion functions as a feature extractor devoid of a classifier, thereby preserving the spatial structure of image features and encompassing feature information across various depth levels.

After the final layer, a series of feature maps was obtained, which represent image features at various levels within the network and encompass information at differing abstract levels. In the context of this paper’s problem, the number of channels in the last layer determined the quantity of features captured by each instance in the package across different aspects, corresponding to distinct feature extractors. The dimensions of the feature map depend on the size of the input image. Following convolution and pooling operations, the size transitions from the original 256 × 256 resolution to 8 × 8. Throughout this process, the feature map’s dimensions decreased while the semantic information contained within them increased.

In general, the process of acquiring CT images can be influenced by the scanning equipment and parameters, necessitating consideration of the model’s flexibility and generalization capability. The adapted ResNet-50 network effectively addresses these challenges, enabling adaptation to various tasks and application scenarios during subsequent processing.

3.3. Multiple Instance Learning

Unlike typical binary supervised learning problems, the objective of prediction in the Multiple Instance Learning (MIL) problem involves a collection of instances comprising numerous instances, denoted as

X = \{x_{1}, x_{2}, \dots, x_{K}\}

. Instances within the collection exhibit no interdependence or ordering among each other. Throughout this process, it is presumed that the value of

K

may vary across different collections, with a package label

Y

associated with each collection, existing independently and denoted as

Y \in \{0, 1\}

. Traditional multi-instance learning assumes that each instance in the bag possesses an individual label

Y = \{y_{1}, y_{2}, \dots, y_{K}\}

and

y_{k} \in \{0, 1\}

[37]. The assumption underlying the MIL problem can be articulated as follows:

Y = \{\begin{matrix} 0, & i f f \sum_{k} y_{k} = 0 \\ 1, & o t h e r w i s e \end{matrix}

(1)

However, obtaining labels at the instance level can be challenging and time-consuming in many practical scenarios. For instance, in CT images of lung cancer and COVID-19, medical professionals must examine all slices to identify lesion locations, where a lesion may span across a contiguous series of adjacent slices. Therefore, it becomes imperative to define an aggregation function

R (\cdot)

that aggregates the spatial features of instances within a specified range, such as:

A_{i} (x_{a}) = R (x_{a - \frac{w}{2}}, \dots, x_{a + \frac{w}{2}})

(2)

where

A_{i} (x_{a})

is the result of the

a

-th instance aggregation,

i

denotes the number of primary aggregations within the package, with

i = k - W + 1

, where

W

is a hyperparameter signifying the range requiring particular attention for the instance.

In many cases, the number of instances with actual corresponding labels in the positive bag may be limited and non-contiguous, influenced by factors such as slice thickness during CT scanning and the actual number of lesions present in the patient. Given this scenario, it becomes essential to adapt the two requisite transformation functions

f (\cdot)

and

g (\cdot)

, of the prediction model for the MIL problem, as follows:

P (X) = g (f (\sum_{i} A_{i} (x_{a})))

(3)

In the current study, the selection of types for

f (\cdot)

and

g (\cdot)

is approached from different perspectives to address the problem [38]. Firstly, the focus lies on the instance-based method. Here, the transformation function

f (\cdot)

is regarded as an instance-level classifier, while

g (\cdot)

is considered an identity function that consolidates instance scores. While this method offers high interpretability, scholars [39] have observed its inferior effectiveness compared to the embedding-based approach across all facets. The embedding method utilizes

f (\cdot)

as a low-dimensional embedding feature extractor, with

g (\cdot)

serving as a packet-level aggregation operator. By aggregating sample-level features into packet-level features, predictions for package labels can be made.

3.4. HLFFF Method for Processing High-Frequency and Low-Frequency Features

For the MIL problem parameterized by the neural network,

f (\cdot)

is used as a feature extractor. This paper proposes an HLFFF method that combines the basic feature extraction network with Hilo attention [40], as shown in Figure 3. In the face of the complex problems of high-resolution images, the semantic information of features can be re-added from the perspective of high frequency and low frequency. This section will explain the HLFFF method with the feature extraction network based on ResNet-50.

The output vector

M \in R^{N \times D}

from the last residual block of ResNet-50 is preserved. Here,

M

represents the feature matrix acquired from each instance under identical conditions, where

N

denotes the number of instances in the package, and

D

signifies the size of the hidden dimension. Subsequently, dimensionality reduction is applied to the feature vector of each instance, followed by global average pooling over a specified window size. The data

M

before pooling and the data

M P

after pooling are utilized as inputs for high-frequency attention and low-frequency attention.

Initially, leveraging the multi-head self-attention mechanism (MSA) [41],

h

self-attention heads are designated, each assigned with Query matrices (

Q

), Key matrices (

K

), and Value matrices (

V

):

Q_{h i} = M_{k} W_{q}^{h}, K_{h i} = M_{k} W_{k}^{h}, V_{h i} = M_{k} W_{ν}^{h}

(4)

Q_{l o} = M_{k} W_{q}^{h}, K_{l o} = M_{k} W_{k}^{h}, V_{l o} = M_{k} W_{ν}^{h}

(5)

where

h i

and

l o

denote the high-frequency and low-frequency attention branches, respectively.

W_{q}^{h}, W_{k}^{h}, W_{v}^{h} \in R^{D \times D_{h}}

are the attention weights that

Q K V

can learn. The input for

K_{l o}

and

V_{l o}

is

M P

, where

D

represents the number of hidden weight matrices, and

D_{h}

is the number of hidden weight matrices corresponding to each self-attention head. Subsequently, the weighted sum of the output vectors from each head is computed:

S A_{h i}^{h} (M_{k}) = S o f t m a x (\frac{Q_{h i} K_{h i}^{T}}{\sqrt{D_{h}}}) V_{h i}

(6)

S A_{l o}^{h} (M_{k}, M P_{k}) = S o f t m a x (\frac{Q_{l o} K_{l o}^{T}}{\sqrt{D_{h}}}) V_{l o}

(7)

Then, combine the output of each head:

H i (M) = c o n c a t [S A_{h i}^{h} (M_{k})] W_{o}

(8)

L o (M) = c o n c a t [S A_{l o}^{h} (M_{k}, M P_{k})] W_{o}

(9)

In the formula,

W_{o} \in R^{(N_{h} \times D_{h}) \times D}

denotes a linear projection layer for series output. Here,

N_{h}

represents the total number of self-attention heads in this layer, harmonizing the dimensions of output and input.

In this process, a hyperparameter

α \in [0,1]

is defined to regulate the number of self-attention heads participating in high frequency and low frequency, respectively.

α \times N_{h}

denotes the number of self-attention heads assigned to high frequency, while

(1 - α) \times N_{h}

denotes the number of self-attention heads allocated to low frequency. Depending on the nature of the problem at hand, the value of α can be adjusted to prioritize either high frequency or low frequency, thereby potentially reducing computational complexity to some extent.

High-frequency attention (

H i

): The intuitive concept here involves utilizing high-frequency components to encode local details of the target. This is achieved by capturing high-frequency information through the establishment of a local self-attention window (for example, a 2 × 2 window). Low-frequency attention (

L F

): Each window employs average pooling to acquire down-sampled feature information

M P

, which is then mapped to

K_{l o}

and

V_{l o}

, while

Q_{l o}

utilizes the original feature information

M

. This approach not only captures abundant low-frequency information but also alleviates the complexity of Formulas (5) and (7).

Finally, the output of HLFFF is obtained through concatenating each attention result:

H L F F F (M) = c o n c a t [\begin{matrix} H i (M), L o (M) \end{matrix}]

(10)

3.5. SRNN Method for Processing Instance with Temporal Relationships

In the classic MIL scenario, there is a fundamental assumption that each instance within a package is independent of one another. However, when a patient’s 3D CT image is employed as a MIL package, there exists a contiguous relationship between each slice. In this context, the target lesion at the same position might consecutively appear across several adjacent slices. Given that these lesions actually pertain to the same morphologically similar structure, we propose the utilization of a recurrent neural network to process a certain length of the instance sequence. In doing so, it is crucial to maintain the spatial order of instance upon input. A sequence of sequentially continuous instances is termed a sequence.

The significance of the SRNN method lies in its ability to identify an instance similarity vector that aligns with the target variations within a sequence. However, the uncertainty regarding the number of instance

k

in each package presents a challenge. One existing approach is to employ a fixed-length sequence template. In this method, the length of the template must exceed the length of each package used. When the size of the package falls short of this length, the vacant portion is filled with zeros. However, this approach gives rise to two issues. Firstly, it can lead to inefficient utilization of computational resources when dealing with packages containing a small number of instances (with the largest package approximately 10 times the size of the smallest package). Secondly, the padding of zeros may degrade the learning performance of the network, particularly in cases of highly unbalanced data.

This paper proposes to employ a sliding window

w

to encode sequential packages. The selection of

w

entails additional configuration, yet within packages of varying sizes,

n

encoded sample fragments emerge:

n = k - w - 1

(11)

After such processing, the sequence length of each input to the recurrent neural network equals

w

, resulting in:

F r = S e ({H l F F F (M)}^{T})

(12)

Here,

S e

is a sliding encoder, and then the recurrent neural network uses a long short-term memory (LSTM) network. The input is a feature map obtained by the HLFFF method, as shown in Figure 4. Here, the

n

-th instance sequence obtained by

S e

in order is considered to be the state at time

t

, and the gated unit at time

t

is calculated by referring to the mathematical calculation formula of LSTM:

i_{n_{t}} = σ (W_{i} F r_{n_{t}} + U_{i} h_{t - 1})

(13)

f_{n_{t}} = σ (W_{f} F r_{n_{t}} + U_{f} h_{t - 1})

(14)

o_{n_{t}} = σ (W_{o} F r_{n_{t}} + U_{o} h_{t - 1})

(15)

{\tilde{c}}_{t} = t a n h (W_{c} F r_{n_{t}} + U_{c} h_{t - 1})

(16)

c_{t} = i_{n_{t}} ⊙ {\tilde{c}}_{t} + f_{n_{t}} ⊙ {\tilde{c}}_{t - 1}

(17)

h_{t} = o_{n_{t}} ⊙ t a n h (c_{t})

(18)

Here,

σ

represents the sigmoid function, while

W_{i}

,

W_{f}

,

W_{o}

,

U_{i}

,

U_{f}

, and

U_{o}

denote the weight matrices that can be learned within the LSTM.

{F r}_{n_{t}}

signifies the sequence vector of the

t

-th time step of the

n

-th sequence,

h_{t}

denotes the hidden state at the

t

-th time step, and

c_{t}

represents the memory unit at the

t

-th time step. Particularly,

W_{c}

represents the weight matrix employed to calculate the memory unit required for the next sequence. It is utilized to map the hidden state

t - 1

of the current time

t

input

{F r}_{n_{t}}

and the previous time

t - 1

to the memory unit

c_{t}

needed for the subsequent sequence.

4. Experiments and Results

In this chapter, the effectiveness of the proposed method will be validated using two large public datasets. The first dataset comprises lung cancer data from TCIA, while the second dataset is the CC-CCII dataset, which currently holds a prominent position in COVID-19 research.

4.1. Datasets and Experimental Settings

The TCIA lung cancer dataset comprises comprehensive lung cancer images updated in 2020, encompassing adenocarcinoma (A), small-cell carcinoma (B), and squamous-cell carcinoma (G). It provides DICOM format medical data files, facilitating the adjustment of imaging parameters as per specific requirements. On the other hand, the CC-CCII novel coronavirus pneumonia dataset, released in 2020, consists of 3D CT images of novel coronavirus pneumonia (NCP) and common pneumonia (CP). This dataset offers processed PNG format data files suitable for direct training.

In the experiment of this project, the learning ability and generalization ability of the proposed model will be assessed on these two datasets separately. Detailed information on the two datasets is provided in Table 2, with the number of slices ranging from 30 to 300. The TCIA dataset comprises 36,031 slices, while the CC-CCII dataset contains 36,950 slices. To evaluate the model’s performance, a 5-fold cross-validation is employed during the training and testing phases. All packages are approximately split into training and test sets in a 4:1 ratio. During each batch’s training process, four-fifths of the training set is utilized to learn the weight parameters in the model, while the remaining one-fifth is used to assess the model’s generalization ability. The experimental results will be averaged across five experiments.

When transforming the data content into the tensor format required for the experiment, each 512 × 512 resolution CT image is scaled down to 256 × 256, representing only one-quarter of the initial CT image size. This resizing reduces the file size and training time of the training data overall, while also alleviating the demand on video memory to some extent. Importantly, there is minimal disparity between the data after model testing and the data before modification.

All experiments of this evaluation model will be conducted on a local workstation equipped with an Intel^® Core™ i9-12900H CPU and an NVIDIA GeForce RTX 3070 Ti Laptop GPU. The manufacturers of CPU and GPU are Intel and NVIDIA, both of which are purchased in China. Given the large dataset, 32 GB of memory is utilized. The initial learning rate of the model is configured to 0.0005, and subsequently, the learning rate of the Adam optimizer is decayed using the cosine annealing method. The number of training epochs for each model is set to 50.

4.2. Evaluation Metrics

The problem solved in this paper is a binary classification problem. To evaluate the performance of the classification model in practical scenarios, typical classification evaluation indicators are usually used for evaluation, including Sensitivity (SEN), Specificity (SPE), Precision, F1 score, accuracy (ACC), and area under the receiver operating characteristic (ROC) curve (AUC). Here, the area under the ROC curve is used to assess the model’s ability to predict positive and negative instances across various thresholds. When the data category distribution is unbalanced, the use of AUC and ACC can more comprehensively evaluate the comprehensive performance of the model. Different classification models are evaluated by the above six evaluation indicators:

S E N = \frac{T P}{T P + F N}

(19)

S P E = \frac{T N}{T N + F P}

(20)

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

F 1 s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n \times R e c a l l}

(22)

A C C = \frac{T P + T N}{T P + T N + F P + T N}

(23)

A U C = \int_{0}^{1} S E N d (\frac{F P}{F P + T N})

(24)

In the formulas, TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. Recall is calculated in the same way as Sensitivity (SEN). By computing these six indicators on the actual dataset, we can compare them with those of currently popular models to demonstrate the effectiveness of the proposed method.

4.3. Ablation Experiments on TCIA

This paper aims to investigate the contribution of the HLFFF and SRNN methods to the classification network. The FLFFF method requires setting the value of

α \in (0,1)

to adjust the weight distribution of high frequency and low frequency in the method. Initially, ResNet-50 is utilized as the fundamental feature extraction method of the model. The value of

α

is set in the range of 0.1 to 0.9, with an interval of 0.1, yielding a total of nine values. The benchmark model of ResNet-50 combined with the HLFFF method is tested on the TCIA dataset. Table 3 summarizes the performance indicators under different

α

values. Figure 5A illustrates the area under the ROC curve under varying

α

values. The results indicate that when

α = 0.5

, the model performs optimally, with the AUC and ACC increasing by 7.8% and 6.7%, respectively, compared to the classification model utilizing only ResNet-50.

The SRNN method needs to set the value of

w \in R

to adjust the length of the instance fragment processed by the method. The experiment will be carried out on the basis of the combination of ResNet-50 and HLFFF methods. The value of

w

is set between 2 and 9 and the interval is 1, a total of 9 values. The benchmark model combining ResNet-50 and the HLFFF and SRNN methods is also tested on the TCIA dataset. Table 4 shows the performance indicators of the model under different w values. Figure 5B shows the area under the ROC curve under different w values. The results show that the performance of the model is optimal when

w = 5

, which means that each time the sample fragment with a length of five is input into the SRNN method, the AUC and ACC are increased by 3.3% and 3.5%, respectively, compared with the model using only the HLFFF method.

In particular, we evaluated the accuracy of the model predictions based on the le-sion site annotations in the original TCIA dataset. By generating weight heat maps for each sample fragment after applying the SRNN method, Figure 6 displays the weight heat maps for four packages. Each weight represents a fragment of length 5. The re-sults demonstrate that the SRNN method adeptly identifies the fragment with the highest lesion weight. The red boxes in the graph denote annotation boxes marked ac-cording to corresponding annotation information, highlighting the lesion sites identi-fied by the SRNN method.

4.4. HLFSRNN-MIL Method and Its Performance on TCIA

In order to eliminate the influence of the feature extraction network on the proposed method, the experiment will use the current popular feature extraction networks and the proposed method for performance evaluation. Table 5 shows the basic information of the network used. The initial learning rate is set to 0.0005, and the learning rate reduction method is optimized by the cosine annealing algorithm. Table 6 corresponding to Table 5 shows the performance index of the proposed method combined with different feature extraction networks. It is evident that although AlexNet boasts a shorter training time, its performance across all metrics is notably poorer compared to ResNet-50, which exhibits the best performance metrics. Specifically, ResNet-50 achieves an AUC of 0.997, ACC of 0.992, SEN of 0.995, SPE of 0.997, and F1 score of 0.995. Figure 7A depicts the area under the ROC curve of different feature extraction networks combined with the proposed method.

4.5. Attention Heat Maps Visualization for HLFSRNN-MIL on TCIA

Figure 8 illustrates the thermal image generated under various conditions of different feature extraction networks, alongside the attention heat map generated by the gradient category activation mapping (Grad-CAM) method. The first column in the figure represents the result of a lung cancer patient’s slice, with each row displaying the original CT slice image and the Grad-CAM results depicted under different network models. Observations reveal that models exhibiting better performance metrics tend to focus their attention on more concentrated and accurate areas, whereas models with lower indicators often concentrate attention on areas unrelated to the lesion site. Notably, the areas of interest for ResNet-50, MobileNetV3, and GoogleNet are predominantly centered on the lesion.

4.6. Comparison of HLFSRNN-MIL with Other Methods on CC-CCII

Finally, the performance of the proposed HLFSRNN-MIL on the CC-CCII dataset is evaluated in terms of generalization and robustness. Firstly, we select the most outstanding performance indicators based on different feature extraction networks. The results are presented in Table 7, revealing that ResNet-50 exhibits superior performance with an AUC of 0.997, ACC of 0.994, SEN of 0.996, SPE of 0.995, and F1 score of 0.992. To gain a more intuitive understanding of the classification effect across various feature networks, Figure 7B illustrates the area under the ROC curve associated with each model.

In order to validate the efficacy of our proposed approach, we employ the HLFSRNN-MIL model on the CC-CCII dataset and compare it with state-of-the-art research. Table 8 presents the performance index results of the relevant research compared with our study from 2021 to 2023. However, what needs special attention is that, due to actual conditions (video memory and memory), many studies have selected a part of CC-CCII for experimental testing, but we divided CC-CCII into several parts and trained them separately, and ultimately obtained very good results.

5. Conclusions

In the study, we proposed the HLFSRNN-MIL model, which can effectively deal with the expression of deep lesion features in CT images, and can accurately find the lesion target in a weakly supervised manner. By further processing the features obtained in the basic feature extraction network, the fusion features of high-frequency features and low-frequency features are obtained by the HLFFF method. This method further improves the expressive ability of features, and it requires very little computing resources. The number of features in the unit of instance fragments is sent to the SRNN method, which can process deep features and accurately predict the location of the target lesion. We verified the effectiveness of the proposed model with two datasets of TCIA and CC-CCII. We achieved an ACC of 0.992 and AUC of 0.997 on TCIA, and an ACC of 0.994 and AUC of 0.997 on CC-CCII. Therefore, the HLFSRNN-MIL model has a certain reference value in solving the problems in the field of 3D CT medical images. In the future, we will further explore the application of other modules and algorithms, strengthen the research on CT image lesion visualization, and extend it to more medical tasks.

Author Contributions

Funding acquisition, X.Z.; resources, X.Z.; supervision, X.Z. and H.C.; writing—original draft, H.C.; writing—review and editing, H.C. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper are public datasets. The TCIA could be found at https://www.cancerimagingarchive.net/collection/lung-pet-ct-dx. The CC-CCII could be found at http://ncov-ai.big.ac.cn/download.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef]
Khatibi, T.; Shahsavari, A.; Farahani, A. Proposing a novel multi-instance learning model for tuberculosis recognition from chest X-ray images based on CNNs, complex networks and stacked ensemble. Phys. Eng. Sci. Med. 2021, 44, 291–311. [Google Scholar] [CrossRef]
Li, Z.; Yuan, L.; Xu, H.; Cheng, R.; Wen, X. Deep multi-instance learning with induced self-attention for medical image classification. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 446–450. [Google Scholar]
Cheplygina, V.; Tax, D.M.J.; Loog, M. Multiple instance learning with bag dissimilarities. Pattern Recognit. 2015, 48, 264–275. [Google Scholar] [CrossRef]
Chen, Y.; Bi, J.; Wang, J.Z. MILES: Multiple-instance learning via embedded instance selection. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1931–1947. [Google Scholar] [CrossRef]
Raykar, V.C.; Krishnapuram, B.; Bi, J.; Dundar, M.; Rao, R.B. Bayesian multiple instance learning: Automatic feature selection and inductive transfer. In Proceedings of the 25th International Conference on Machine Learning 2008, Helsinki, Finland, 5–9 July 2008; pp. 808–815. [Google Scholar]
Kandemir, M.; Hamprecht, F.A. Computer-aided diagnosis from weak supervision: A benchmarking study. Comput. Med. Imaging Graph. 2015, 42, 44–50. [Google Scholar] [CrossRef] [PubMed]
Cheplygina, V.; Sørensen, L.; Tax, D.M.J.; de Bruijne, M.; Loog, M. Label stability in multiple instance learning. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part I 18. Springer International Publishing: Cham, Switzerland, 2015; pp. 539–546. [Google Scholar]
Lakshmanaprabu, S.K.; Mohanty, S.N.; Shankar, K.; Arunkumar, N.; Ramirez, G. Optimal deep learning model for classification of lung cancer on CT images. Future Gener. Comput. Syst. 2019, 92, 374–382. [Google Scholar]
Maken, P.; Gupta, A. 2D-to-3D: A review for computational 3D image reconstruction from X-ray images. Arch. Comput. Methods Eng. 2023, 30, 85–114. [Google Scholar] [CrossRef]
Riquelme, D.; Akhloufi, M.A. Deep learning for lung cancer nodules detection and classification in CT scans. AI 2020, 1, 28–67. [Google Scholar] [CrossRef]
Shi, Z.; Mettes, P.; Zheng, G.; Snoek, C. Frequency-supervised mr-to-ct image synthesis. In Proceedings of the Deep Generative Models, and Data Augmentation, Labelling, and Imperfections: First Workshop, DGM4MICCAI 2021, and First Workshop, DALI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 1 October 2021; Proceedings 1; Springer International Publishing: Cham, Switzerland, 2021; pp. 3–13. [Google Scholar]
Chen, C.; Fu, Z.; Ye, S.; Zhao, C.; Golovko, V.; Ye, S.; Bai, Z. Study on high-precision three-dimensional reconstruction of pulmonary lesions and surrounding blood vessels based on CT images. Opt. Express 2024, 32, 1371–1390. [Google Scholar] [CrossRef]
Thakur, S.K.; Singh, D.P.; Choudhary, J. Lung cancer identification: A review on detection and classification. Cancer Metastasis Rev. 2020, 39, 989–998. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Marentakis, P.; Karaiskos, P.; Kouloulias, V.; Kelekis, N.; Argentos, S.; Oikonomopoulos, N.; Loukas, C. Lung cancer histology classification from CT images based on radiomics and deep learning models. Med. Biol. Eng. Comput. 2021, 59, 215–226. [Google Scholar] [CrossRef] [PubMed]
Qin, R.X.; Wang, Z.; Jiang, L.Y.; Qiao, K.; Hai, J.; Chen, J.; Yan, B. Fine-grained lung cancer classification from PET and CT images based on multidimensional attention mechanism. Complexity 2020, 2020, 6153657. [Google Scholar] [CrossRef]
Raza, R.; Zulfiqar, F.; Khan, M.O.; Arif, M.; Alvi, A.; Iftikhar, M.A.; Alam, T. Lung-EffNet: Lung cancer classification using EfficientNet from CT-scan images. Eng. Appl. Artif. Intell. 2023, 126, 106902. [Google Scholar] [CrossRef]
Xu, Z.; Ren, H.; Zhou, W.; Liu, Z. ISANET: Non-small cell lung cancer classification and detection based on CNN and attention mechanism. Biomed. Signal Process. Control 2022, 77, 103773. [Google Scholar] [CrossRef]
Siddiqui, E.A.; Chaurasia, V.; Shandilya, M. Detection and classification of lung cancer computed tomography images using a novel improved deep belief network with Gabor filters. Chemom. Intell. Lab. Syst. 2023, 235, 104763. [Google Scholar] [CrossRef]
Chang, R.; Qi, S.; Wu, Y.; Song, Q.; Yue, Y.; Zhang, X.; Qian, W. Deep multiple instance learning for predicting chemotherapy response in non-small cell lung cancer using pretreatment CT images. Sci. Rep. 2022, 12, 19829. [Google Scholar] [CrossRef]
Fuhrman, J.; Yip, R.; Zhu, Y.; Jirapatnakul, A.C.; Li, F.; Henschke, C.I.; Giger, M.L. Evaluation of emphysema on thoracic low-dose CTs through attention-based multiple instance deep learning. Sci. Rep. 2023, 13, 1187. [Google Scholar] [CrossRef] [PubMed]
Tang, W.; Huang, S.; Zhang, X.; Zhou, F.; Zhang, Y.; Liu, B. Multiple instance learning framework with masked hard instance mining for whole slide image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4078–4087. [Google Scholar]
Meng, Y.; Bridge, J.; Addison, C.; Wang, M.; Merritt, C.; Franks, S.; Zheng, Y. Bilateral adaptive graph convolutional network on CT based Covid-19 diagnosis with uncertainty-aware consensus-assisted multiple instance learning. Med. Image Anal. 2023, 84, 102722. [Google Scholar] [CrossRef]
Xue, M.; Jia, S.; Chen, L.; Huang, H.; Yu, L.; Zhu, W. CT-based COPD identification using multiple instance learning with two-stage attention. Comput. Methods Programs Biomed. 2023, 230, 107356. [Google Scholar] [CrossRef]
Wang, X.; Tang, F.; Chen, H.; Cheung, C.Y.; Heng, P.A. Deep semi-supervised multiple instance learning with self-correction for DME classification from OCT images. Med. Image Anal. 2023, 83, 102673. [Google Scholar] [CrossRef]
Liu, P.; Ji, L.; Ye, F.; Fu, B. Advmil: Adversarial multiple instance learning for the survival analysis on whole-slide images. Med. Image Anal. 2024, 91, 103020. [Google Scholar] [CrossRef] [PubMed]
Schmidt, A.; Morales-Alvarez, P.; Molina, R. Probabilistic attention based on gaussian processes for deep multiple instance learning. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef]
Wang, H.; Luo, L.; Wang, F.; Tong, R.; Chen, Y.W.; Hu, H.; Chen, H. Iteratively coupled multiple instance learning from instance to bag classifier for whole slide image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2023; pp. 467–476. [Google Scholar]
Li, X.; Guo, R.; Lu, J.; Chen, T.; Qian, X. Causality-driven graph neural network for early diagnosis of pancreatic cancer in non-contrast computerized tomography. IEEE Trans. Med. Imaging 2023, 42, 1656–1667. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 32–42. [Google Scholar]
Zhang, K.; Liu, X.; Shen, J.; Li, Z.; Sang, Y.; Wu, X.; Wang, G. Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography. Cell 2020, 181, 1423–1433.e11. [Google Scholar] [CrossRef] [PubMed]
Ilse, M.; Tomczak, J.; Welling, M. Attention-based Deep Multiple Instance Learning. Proc. Mach. Learn. Res. 2018, 80, 2127–2136. [Google Scholar]
Ilse, M.; Tomczak, J.M.; Welling, M. Deep multiple instance learning for digital histopathology. In Handbook of Medical Image Computing and Computer Assisted Intervention; Academic Press: Cambridge, MA, USA, 2020; pp. 521–546. [Google Scholar]
Chen, J.; Zeng, H.; Zhang, C.; Shi, Z.; Dekker, A.; Wee, L.; Bermejo, I. Lung cancer diagnosis using deep attention-based multiple instance learning and radiomics. Med. Phys. 2022, 49, 3134–3143. [Google Scholar] [CrossRef]
Pan, Z.; Cai, J.; Zhuang, B. Fast vision transformers with hilo attention. Adv. Neural Inf. Process. Syst. 2022, 35, 14541–14554. [Google Scholar]
Hong, Y.; Zhang, Y.; Schindler, K.; Raubal, M. Context-aware multi-head self-attentional neural network model for next location prediction. Transp. Res. Part C Emerg. Technol. 2023, 156, 104315. [Google Scholar] [CrossRef]
Qi, S.; Xu, C.; Li, C.; Tian, B.; Xia, S.; Ren, J.; Yu, H. DR-MIL: Deep represented multiple instance learning distinguishes COVID-19 from community-acquired pneumonia in CT images. Comput. Methods Programs Biomed. 2021, 211, 106406. [Google Scholar] [CrossRef]
Li, M.; Li, X.; Jiang, Y.; Zhang, J.; Luo, H.; Yin, S. Explainable multi-instance and multi-task learning for COVID-19 diagnosis and lesion segmentation in CT images. Knowl. Based Syst. 2022, 252, 109278. [Google Scholar] [CrossRef] [PubMed]
Arsenos, A.; Kollias, D.; Kollias, S. A large imaging database and novel deep neural architecture for COVID-19 diagnosis. In Proceedings of the 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Nafplio, Greece, 26–29 June 2022; pp. 1–5. [Google Scholar]
Ye, Q.; Gao, Y.; Ding, W.; Niu, Z.; Wang, C.; Jiang, Y.; Yang, G. Robust weakly supervised learning for COVID-19 recognition using multi-center CT images. Appl. Soft Comput. 2022, 116, 108291. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Sample CT images of lung cancer (A: adenocarcinoma, B: small-cell carcinoma, G: squamous-cell carcinoma, Normal: normal. The red box indicates the location of the lesion).

Figure 2. Framework overview of the proposed HLFSRNN-MIL model.

Figure 3. Overview of the proposed HLFFF method.

Figure 4. Overview of the proposed SRNN method. (Here different colors represent the feature representations of different examples).

Figure 5. Regarding the area under the ROC curve under the influence of different α and w, respectively. (The left picture (A) is the ROC curve of the α, and the right picture (B) is the ROC curve of the w).

Figure 6. Prediction of lesion location using SRNN method.

Figure 7. Regarding the area under the ROC curve of different feature extraction networks on TCIA and CC-CCII. (Figure (A) on the left is the ROC curve at TCIA, and Figure (B) on the right is the ROC curve at CC-CCII).

Figure 8. Heatmap visualization comparison of HLFSRNN-MIL under different feature extraction networks.

Table 1. Description of acquisition parameter information about TCIA and CC-CCII.

Information	TCIA	CC-CCII
Slice Thickness Range	(0.625 mm, 5 mm)	(0.625 mm, 5 mm)
Pixel Spacing Range	(0.585937, 0.841796)	(0.594, 0.826)
Smallest/Largest Image Pixel Value	0/2548	-
Initial WW Range	(0, 1600)	-
Initial WL Range	(−700, 40)	-
Manufacturer	GE MEDICAL SYSTEMS Philips SIEMENS	GE MEDICAL SYSTEMS
Series Description	Thorax Recon ThoraxRoutine	Siemens Toshiba
X-Ray Tube Current Range	(94, 450)	(69, 236)

Table 2. Description of the experimental setup for TCIA and CC-CCII.

Datasets	Classes	Slices		Bags
Datasets	Classes	Train and Val	Test	Train and Val	Test
TCIA	A	9045	2683	167	49
	B	1712	386	34	7
	G	1629	741	29	14
	Normal	15,318	4517	230	70
	Total	27,704	8327	460	140
CC-CCII	NCP	16,232	5142	230	70
	CP	12,151	3425	230	70
	Total	28,383	8567	460	140

Table 3. Performance comparison of HLFFF method under different

α

values.

Table 3. Performance comparison of HLFFF method under different

α

values.

$α$	AUC	ACC	SEN	SPE	F1 Score	Precision
$α = 0.1$	0.924	0.914	0.995	0.829	0.921	0.854
$α = 0.2$	0.930	0.907	0.886	0.929	0.905	0.925
$α = 0.3$	0.952	0.936	0.871	0.914	0.931	0.921
$α = 0.4$	0.932	0.943	0.996	0.886	0.946	0.897
$α = 0.5$	0.966	0.957	0.997	0.997	0.959	0.951
$α = 0.6$	0.958	0.936	0.986	0.886	0.939	0.896
$α = 0.7$	0.950	0.907	0.943	0.871	0.910	0.885
$α = 0.8$	0.919	0.893	0.998	0.786	0.903	0.824
$α = 0.9$	0.899	0.879	0.786	0.971	0.866	0.965

Table 4. Performance comparison of SRNN method under different

w

values.

Table 4. Performance comparison of SRNN method under different

w

values.

$w$	AUC	ACC	SEN	SPE	F1 Score	Precision
$w = 2$	0.988	0.943	0.929	0.957	0.942	0.956
$w = 3$	0.989	0.951	0.986	0.914	0.952	0.920
$w = 4$	0.987	0.981	0.982	0.971	0.958	0.958
$w = 5$	0.997	0.992	0.995	0.997	0.995	0.995
$w = 6$	0.981	0.957	0.943	0.957	0.950	0.957
$w = 7$	0.969	0.950	0.929	0.971	0.949	0.970
$w = 8$	0.984	0.943	0.943	0.943	0.943	0.943
$w = 9$	0.978	0.921	0.843	0.885	0.903	0.915

Table 5. Comparison of various feature extraction networks used by HLFSRNN-MIL.

Feature Extraction Network	Size of Model (MB)	Initial Learning Rate	Training Time (s)	Feature Map
ResNet-50	86.0	0.0005	954/epoch	512 × 8 × 8
MobileNetV3	18.3	0.0005	768/epoch	960 × 8 × 8
DenseNet-201	7.55	0.0005	648/epoch	512 × 8 × 8
GoogleNet	41.2	0.0005	672/epoch	1024 × 8 × 8
AlexNet	442	0.0005	444/epoch	256 × 6 × 6
VGG-19	86.7	0.0005	1236/epoch	512 × 8 × 8

Table 6. Performance comparison of HLFSRNN-MIL in different feature extraction networks in TCIA.

Feature Extraction Network	AUC	ACC	SEN	SPE	F1 Score	Precision
ResNet-50	0.997	0.992	0.995	0.997	0.995	0.995
MobileNetV3	0.979	0.991	0.986	0.971	0.979	0.972
DenseNet-201	0.978	0.957	0.999	0.914	0.959	0.921
GoogleNet	0.986	0.986	0.986	0.963	0.993	0.986
AlexNet	0.942	0.943	0.986	0.886	0.946	0.897
VGG-19	0.946	0.943	0.986	0.886	0.946	0.897

Table 7. Performance comparison of HLFSRNN-MIL in different feature extraction networks on CC-CCII.

Feature Extractor	AUC	ACC	SEN	SPE	F1 Score	Precision
ResNet-50	0.997	0.994	0.996	0.995	0.992	0.993
MobileNetV3	0.987	0.975	0.948	0.943	0.957	0.958
DenseNet-201	0.952	0.932	0.955	0.975	0.966	0.970
GoogleNet	0.996	0.993	0.994	0.995	0.991	0.992
AlexNet	0.972	0.982	0.968	0.967	0.981	0.974
VGG-19	0.954	0.927	0.962	0.938	0.948	0.926

Table 8. Performance of our HLFSRNN-MIL method and current state-of-the-art studies.

Method	AUC	ACC	SEN	SPE	F1 Score	Precision
Ours	0.997	0.994	0.996	0.995	0.992	0.993
BA-GCN [26]	0.987	0.969	0.949	0.971	0.949	0.951
DR-MIL [42]	0.955	0.959	0.972	0.941	0.965	-
EMTN [43]	0.975	0.974	-	-	0.968	0.962
RACNet [44]	-	0.953	-	-	-	-
CIFD-Net [45]	0.932	0.893	0.939	0.806	0.919	0.899

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Zhang, X. HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification. Appl. Sci. 2024, 14, 6186. https://doi.org/10.3390/app14146186

AMA Style

Chen H, Zhang X. HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification. Applied Sciences. 2024; 14(14):6186. https://doi.org/10.3390/app14146186

Chicago/Turabian Style

Chen, Huilong, and Xiaoxia Zhang. 2024. "HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification" Applied Sciences 14, no. 14: 6186. https://doi.org/10.3390/app14146186

APA Style

Chen, H., & Zhang, X. (2024). HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification. Applied Sciences, 14(14), 6186. https://doi.org/10.3390/app14146186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification

Abstract

1. Introduction

2. Related Works

3. Proposed Methods

3.1. 3D CT Datasets and Datasets Preprocessing

3.2. Extraction of Deep Features

3.3. Multiple Instance Learning

3.4. HLFFF Method for Processing High-Frequency and Low-Frequency Features

3.5. SRNN Method for Processing Instance with Temporal Relationships

4. Experiments and Results

4.1. Datasets and Experimental Settings

4.2. Evaluation Metrics

4.3. Ablation Experiments on TCIA

4.4. HLFSRNN-MIL Method and Its Performance on TCIA

4.5. Attention Heat Maps Visualization for HLFSRNN-MIL on TCIA

4.6. Comparison of HLFSRNN-MIL with Other Methods on CC-CCII

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI