A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping

Bi, Dehui; Zhang, Yuqi

doi:10.3390/math13132178

Open AccessArticle

A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping

by

Dehui Bi

¹ and

Yuqi Zhang

^2,*

¹

College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China

²

School of Computer Science and Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2178; https://doi.org/10.3390/math13132178

Submission received: 5 June 2025 / Revised: 24 June 2025 / Accepted: 30 June 2025 / Published: 3 July 2025

(This article belongs to the Special Issue Computational Perspectives on Artificial Intelligence Drive in Medical Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

Precise identification of cancer subtypes from whole slide images (WSIs) is pivotal in tailoring patient-specific therapies. Under the weakly supervised multiple instance learning (MIL) paradigm, existing techniques frequently fall short in simultaneously capturing local tissue textures and long-range contextual relationships. To address these challenges, we introduce ConvMixerSSM, a hybrid model that integrates a ConvMixer block for local spatial representation, a state space model (SSM) block for capturing long-range dependencies, and a feature-gated block to enhance informative feature selection. The model was evaluated on the TCGA-NSCLC dataset and the CAMELYON16 dataset for cancer subtyping tasks. Extensive experiments, including comparisons with state-of-the-art MIL methods and ablation studies, were conducted to assess the contribution of each component. ConvMixerSSM achieved an AUC of 97.83%, an ACC of 91.82%, and an F1 score of 91.18%, outperforming existing MIL baselines on the TCGA-NSCLC dataset. The ablation study revealed that each block contributed positively to performance, with the full model showing the most balanced and superior results. Moreover, our visualization results further confirm that ConvMixerSSM can effectively identify tumor regions within WSIs, providing model interpretability and clinical relevance. These findings suggest that ConvMixerSSM has strong potential for advancing computational pathology applications in clinical decision-making.

Keywords:

computational pathology; whole slide image; multiple instance learning; state space model; convolutional neural networks

MSC:

68T07

1. Introduction

Computational pathology has become a critical pillar of modern clinical diagnostics, particularly in oncology [1,2,3]. By analyzing whole slide images (WSIs), key tasks such as tumor subtyping, prognosis prediction, and treatment response assessment can be effectively performed, thereby advancing the development of personalized medicine [4,5,6]. WSIs are high-resolution two-dimensional color images generated by scanning tissue sections layer by layer using whole-slide scanners, achieving subcellular-level detail [7]. These images are typically derived from surgical tissue specimens or biopsy samples. Due to their ultra-high resolution, as shown in Figure 1, WSIs provide comprehensive representations of tissue architecture and microscopic morphology, making them a valuable data foundation for AI-assisted diagnosis [8,9,10].

However, WSIs pose two major challenges for efficient and accurate computational analysis due to their large file sizes (often containing billions of pixels) and complex spatial structures [11,12]. First, acquiring high-quality annotations is extremely costly, making it difficult to obtain precise pixel-level labels for every region of the slide. Second, the computational burden of model training and inference is substantial, particularly when processing entire slides at their original resolution. These issues significantly hinder the practical deployment of deep learning methods in clinical settings.

To address these challenges, researchers have proposed and widely adopted weakly supervised approaches based on multiple instance learning (MIL) [13,14]. In MIL, a WSI is treated as a bag of instances (e.g., image patches), and the model is trained using only slide-level labels without requiring fine-grained annotations for each patch [15,16]. This framework greatly reduces the dependency on detailed labels. By learning the contribution of each instance to the overall classification outcome, MIL-based models can automatically identify diagnostically relevant regions, improving both model generalizability and interpretability. Nevertheless, traditional MIL frameworks suffer from limited receptive fields and insufficient capacity for modeling long-range dependencies, making it difficult to fully leverage the global context of WSIs.

In recent years, deep learning has achieved widespread success in the medical domain [17,18,19]. Architectures based on Transformers [20] and models such as state space models (SSMs) [21,22] have demonstrated strong performance due to their global modeling capabilities and efficient sequence processing. These advancements have gradually attracted attention in computational pathology. ABMIL [23] introduces an attention-based aggregation mechanism that allows the model to weigh instance contributions differently, but it still lacks spatial feature refinement. CLAM [13], a clustering-constrained attention MIL framework, improves robustness by learning discriminative subpopulations within a slide; however, its reliance on multiple instance-level attention heads increases model complexity and training instability in certain settings. TransMIL [24] introduces a transformer-based attention mechanism to capture pairwise interactions between instances, improving the model’s ability to reason over multiple patches. However, its reliance on self-attention may lead to high computational costs and challenges in scalability when applied to gigapixel WSIs with thousands of instances per slide. S4MIL [25] leverages structured state-space models to model long-range dependencies efficiently and has demonstrated strong performance with reduced memory consumption. Nevertheless, it does not explicitly address local spatial representation learning, which is critical in histopathological image analysis. MambaMIL [26], a recent model inspired by state-space dynamics and gated sequence modeling, further improves the efficiency and representation capacity of MIL. However, it primarily focuses on sequential modeling and may overlook the importance of strong convolutional priors tailored for histological textures.

To overcome these bottlenecks, we propose a novel architecture called ConvMixerSSM, which integrates the local modeling capacity of convolutional neural networks with the long-range dependency modeling power of SSMs to enhance performance in cancer subtyping tasks. The proposed model consists of three main components: (1) A ConvMixer block, which employs depthwise separable convolutions to efficiently mix local spatial features. (2) An SSM block, a novel linear state-space sequence model with sub-quadratic complexity that enables global modeling of contextual dependencies across patches. (3) A feature-gated block, which incorporates a ReLU-based gating structure to dynamically focus on key instance features, thereby improving the identification of diagnostically important regions within the MIL framework. We conducted extensive experiments on The Cancer Genome Atlas (TCGA) lung cancer subtyping dataset and the CAMELYON16 breast cancer diagnosis dataset, and the results demonstrate that our approach consistently outperforms other methods. Our work represents a significant advancement in WSI analysis within the field of computational pathology. It holds substantial potential for promoting intelligent pathological slide analysis and assisting in tumor subtyping, thereby contributing meaningfully to the development of precision medicine.

2. Materials and Methods

2.1. Dataset

The dataset used in this study was obtained from TCGA and CAMELYON16 [27], both of which are publicly available datasets widely used in computational pathology research. TCGA is a public cancer genomics initiative led by the National Cancer Institute (NCI) of the United States. Specifically, we selected 1053 WSIs from the TCGA-NSCLC (Non-Small Cell Lung Cancer) subproject as the primary source for NSCLC subtyping. The dataset comprises diagnostic-grade hematoxylin and eosin (H&E)-stained slides from primary tumor tissues and includes two major subtypes: 512 WSIs of Lung Squamous Cell Carcinoma (LUSC) and 541 WSIs of Lung Adenocarcinoma (LUAD). All data are publicly accessible through the Genomic Data Commons (GDC) data portal at https://portal.gdc.cancer.gov/ (accessed on 1 January 2025). In addition, we incorporated the CAMELYON16 dataset to further evaluate the generalizability of our model in metastasis detection tasks. CAMELYON16 is a grand challenge dataset focused on detecting lymph node metastases in breast cancer patients, providing a total of 395 WSIs of hematoxylin and eosin (H&E)-stained sentinel lymph node sections. Among these, 236 WSIs are normal (i.e., without metastatic regions) and 159 WSIs contain metastases. The CAMELYON16 dataset is publicly available at https://camelyon16.grand-challenge.org/ (accessed on 1 January 2025).

2.2. Preliminary: Assumptions of MIL

In the multiple instance learning paradigm, each whole slide image

X_{i}

is treated as a bag composed of a collection of instances (image patches) extracted from the tissue region. Due to the gigapixel scale of WSIs, each bag

X_{i}

is expressed as a set of L instance-level features:

\begin{matrix} X_{i} = {x_{1}, x_{2}, x_{3}, \dots, x_{L}}, x_{j} \in R^{D} \end{matrix}

(1)

where L is the number of patches in the i-th slide and D denotes the dimensionality of each feature embedding.

The label

Y_{i} \in C

(e.g., cancer subtype) is known at the bag level, but the labels

y_{j}

of individual instances

x_{j}

are unobserved. The fundamental assumption in MIL is that only a subset of instances contributes meaningfully to the bag label. Specifically, in binary classification tasks, the standard MIL assumption can be formalized as:

Y_{i} = \{\begin{matrix} 1, & if \exists j, y_{j} = 1 \\ 0, & if \forall j, y_{j} = 0 \end{matrix}

(2)

where

j \in {1, 2, \dots, L}

. This assumption reflects many real-world biomedical scenarios, such as cancer detection, where only a small fraction of image patches within a slide may contain malignancies, and the rest may appear benign or irrelevant.

Given this setting, the goal of the MIL model is to learn a bag-level classifier:

\hat{Y_{i}} = f (X_{i})

(3)

There is no explicit supervision on instance labels. However, due to memory and computational limitations, direct end-to-end modeling from raw WSIs to slide-level outputs is infeasible. Thus, the MIL process is typically decomposed into two stages:

Instance-level feature extraction: A feature extractor

g (\cdot)

(e.g., domain-specific models like CONCH [28]) maps raw image patches to embeddings:

\begin{matrix} Z_{i} = g (X_{i}) = {z_{1}, z_{2}, z_{3}, \dots, z_{L}}, z_{j} \in R^{D} \end{matrix}

(4)

Bag-level classification: An aggregation function

f (\cdot)

consumes the sequence

Z_{i}

and produces a slide-level prediction:

\hat{Y_{i}} = f (Z_{i}) = f (g (X_{i}))

(5)

Equation (5) summarizes this two-stage MIL framework.

This setup underpins the design of our proposed ConvMixerSSM architecture, which focuses on optimizing the aggregation stage

f (\cdot)

via convolution and long sequence modeling. The central challenge remains in identifying and amplifying the contribution of informative instances while suppressing irrelevant ones, all without access to ground-truth instance labels.

2.3. Overview of ConvMixerSSM

Figure 2 illustrates the overall architecture of ConvMixerSSM, a novel framework designed for cancer subtyping under the MIL paradigm. The pipeline begins with background removal, a critical preprocessing step to eliminate irrelevant regions. Given the gigapixel scale of WSIs, retaining background areas not only introduces noise but also leads to a substantial computational burden. To mitigate this, tissue regions are extracted using a threshold-based filtering strategy. Subsequently, each WSI is divided into non-overlapping image patches, from which discriminative features are extracted using a pretrained CONCH [28]. This process produces an instance-level feature sequence, where each instance corresponds to a localized tissue patch within the WSI.

The core of ConvMixerSSM consists of three main components: (1) A ConvMixer block, which leverages depthwise separable convolutions to efficiently capture local spatial features across patches. (2) An SSM block, in order to model long-range dependencies and contextual relationships among the patch-level features, we employ a sequence modeling approach based on SSM. (3) A feature-gated block, which incorporates a ReLU-based gating structure to dynamically focus on key instance features, thereby improving the identification of diagnostically important regions. Finally, the aggregated instance representations are fed into a multi-layer perceptron (MLP), which produces the slide-level prediction. This hierarchical design allows ConvMixerSSM to effectively handle the inherent complexity of WSIs while maintaining computational efficiency.

2.4. ConvMixer Block

To effectively extract local patch-level features from the input instances, we adopt a ConvMixer Block that integrates depthwise separable convolution and pointwise convolution. Specifically, given an input tensor

x \in R^{B \times L \times D}

, where B is the batch size, L is the bag size (number of instances), and D is the patch dimension, the block first applies a depthwise convolution:

\begin{matrix} x^{'} = DWConv (permute (x)), x^{'} \in R^{B \times L \times D}, where x_{b, d, l}^{'} = x_{b, l, d} \end{matrix}

(6)

where permute represents the permutation of the input tensor.

The 1D depthwise convolution applies a separate convolutional filter to each individual channel (i.e., dimension d), without mixing information across channels. For a given channel

d \in {1, \dots, D}

, the output at position

l \in {1, \dots, L}

is computed as:

y_{b, d, l} = \sum_{i = - ⌊ k / 2 ⌋}^{⌊ k / 2 ⌋} w_{i}^{(d)} \cdot x_{b, d, l + i}^{'}

(7)

where

w^{(d)} \in R^{k}

is the depthwise kernel for channel d with kernel size k,

y \in R^{B \times L \times D}

is the output after depthwise convolution, and zero-padding is applied on both sides to maintain the same sequence length.

Then, it is followed by a pointwise convolution and a non-linear activation:

\begin{matrix} x^{″} = σ (PWConv (x^{'} + x)) \end{matrix}

(8)

where

σ

denotes a non-linear activation function (e.g., LeakyReLU). The pointwise convolution performs a

1 \times 1

convolution that linearly combines the D channels into

D_{out}

output channels at each position l. The output is computed as:

z_{b, d^{'}, l} = \sum_{d = 1}^{D} w_{d}^{(d^{'})} \cdot y_{b, d, l},

(9)

where

w^{(d^{'})} \in R^{D}

is the weight vector for output channel

d^{'}

and

z \in R^{B \times D_{out} \times L}

is the final output after PWConv.

A residual connection is also applied:

\begin{matrix} f = permute (σ (x^{″} + x)) \end{matrix}

(10)

This design allows the block to efficiently capture local contextual dependencies within each instance patch while preserving computational efficiency.

2.5. SSM Block

To capture long-range dependencies among instance embeddings, the SSM block processes the input through a sequence of a linear projection, an SSM layer, a causal 1D convolution, and a final linear projection. Given a sequence of locally encoded features,

F = {f_{1}, f_{2}, f_{3}, \dots, f_{L}} \in R^{B \times L \times D}

. First, each feature vector is projected to the SSM hidden dimension:

\begin{matrix} f^{'} = Linear (F) \end{matrix}

(11)

We then apply a continuous-to-discrete state-space transform inspired by the Mamba formulation.

\begin{matrix} s_{t} & = A \cdot s_{t - 1} + B \cdot f_{t}^{'} \\ o_{t} & = C \cdot s_{t} + D \cdot f_{t}^{'} \end{matrix}

(12)

where

s_{t}

is the state at time step t,

o_{t}

is the output vector, and

A, B, C, D

are the learnable parameter matrices. SMM improves the global modeling capability and maintains high efficiency through parallel processing.

To inject additional temporal locality and ensure causality, we apply a causal 1D convolution:

\begin{matrix} y = ConvCausal 1 D (o) \end{matrix}

(13)

where o is the output of the SSM layer. Finally, a second linear layer projects back to the model dimension:

\begin{matrix} z = Linear (y) \end{matrix}

(14)

An optional residual connection adds the original input F, yielding the block output:

\begin{matrix} z^{'} = z + F \end{matrix}

(15)

2.6. Feature-Gated Block

For final representation aggregation and classification, we utilize a feature-gated block composed of an attention-based pooling mechanism followed by a linear classifier. Given the instance representations

Z = {z_{1}, z_{2}, z_{3}, \dots, z_{L}} \in R^{B \times L \times D}

, an attention score

A \in R^{B \times 1 \times L}

is computed as:

\begin{matrix} m = LayerNorm (Linear (Z)) \\ A = Softmax (Linear (ReLU (m))) \end{matrix}

(16)

The bag-level representation is obtained via weighted pooling:

\begin{matrix} Z_{b a g} = Z \cdot A \end{matrix}

(17)

Finally, the bag-level feature vector is passed to an MLP to produce the final prediction:

\begin{matrix} \hat{y} = MLP (Z_{b a g}) \end{matrix}

(18)

This feature-gated mechanism adaptively emphasizes informative instances while suppressing irrelevant ones, leading to more accurate bag-level predictions.

2.7. Implementation Details

We implemented our model using PyTorch 2.5.0 and conducted all experiments on a workstation equipped with an NVIDIA RTX 3090 GPU. Following the training strategy described by Yang et al. [26] (i.e., MambaMIL), we set the initial learning rate to

2 \times 10^{- 4}

and used the Adam optimizer with a weight decay of

1 \times 10^{- 5}

. The model was trained using a slide-level batch size of 1, which is consistent with prior work in WSI-based MIL settings due to the large memory footprint of whole-slide feature bags. We trained for 50 epochs with an early stopping patience of 20 based on validation AUC.

For data preprocessing, as shown in Figure 3, each WSI was tiled into non-overlapping patches of size 512 × 512 pixels at ×20 magnification, resulting in a set of instance-level inputs for each slide. To extract meaningful features from each patch, we employed CONCH [28], a state-of-the-art foundation model pretrained on large-scale histopathological datasets, as our feature encoder.

To ensure a robust evaluation and reduce the influence of data partitioning and training randomness, we adopted a 5-fold cross-validation strategy on the TCGA-NSCLC dataset. In each fold, the data were randomly split into training, validation, and testing subsets following an 8:1:1 ratio. Specifically, 80% of the data were used for training, 10% for validation, and the remaining 10% for testing. Final performance metrics were averaged across the five folds.

2.8. Evaluation Metrics

To comprehensively evaluate the performance of our model on cancer subtyping, we adopted the following standard metrics:

Area Under the Curve of ROC (AUC). AUC measures the model’s ability to distinguish between classes across different decision thresholds. It is widely used in medical image classification tasks due to its robustness to class imbalance. A higher AUC indicates better discriminative power. We computed AUC using the trapezoidal rule over the ROC curve derived from true positive rate (TPR) and false positive rate (FPR):

AUC = \int_{0}^{1} TPR (FPR) d (FPR)

(19)

Accuracy (ACC). Accuracy represents the proportion of correctly predicted WSIs among all predictions. While intuitive and easy to interpret, accuracy can be biased in the presence of class imbalance. It is defined as:

ACC = \frac{T P + T N}{T P + T N + F P + F N}

(20)

F1 Score. The F1 score is the harmonic mean of precision and recall, balancing false positives and false negatives. It provides a more informative measure than accuracy when class distributions are skewed and is particularly useful for assessing the model’s robustness in identifying minority classes. It is defined as:

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N}

(21)

All metrics were computed on the slide level, and the final reported results represent the average across five cross-validation folds to ensure statistical reliability and reduce variance due to data partitioning.

3. Results

3.1. Results of Data Preprocessing

As shown in Table 1, the TCGA-NSCLC dataset contains 1053 WSIs, while the CAMELYON16 dataset includes 395 WSIs. Both datasets exhibit substantial variability in image resolution. For TCGA-NSCLC, the image sizes range from a minimum of 10,000 × 4617 pixels to a maximum of 191,352 × 97,078 pixels. Similarly, CAMELYON16 slides range from 45,056 × 35,840 pixels to 217,088 × 111,104 pixels. This diversity in slide dimensions reflects differences in tissue sampling, staining, and scanning protocols across institutions, and introduces challenges for model standardization and computational efficiency.

Following the preprocessing procedure described in Section 2.7, the bag size (i.e., the number of patches per WSI) varied significantly across the dataset. In the TCGA-NSCLC dataset, the bag size ranged from 35 to 11,747 patches per slide. For CAMELYON16, the range was similarly wide, from 40 to 11,221 patches per slide. This variation highlights the heterogeneous nature of histopathological content in clinical slides and emphasizes the need for robust feature aggregation strategies in downstream tasks.

3.2. Main Results

Table 2 presents the cancer subtyping performance on the TCGA-NSCLC dataset using various MIL methods. Our proposed model, ConvMixerSSM, consistently outperforms all compared approaches across the three evaluation metrics.

Specifically, ConvMixerSSM achieves the highest AUC of 97.83%, surpassing both traditional pooling strategies (e.g., Max Pooling: 97.21%, Mean Pooling: 97.07%) and recent advanced MIL frameworks such as TransMIL (97.06%), S4MIL (97.43%), MambaMIL (97.34%), and

R^{2}

TMIL (97.29%). This demonstrates the superior ability of ConvMixerSSM to distinguish between LUAD and LUSC on the slide level.

In terms of classification accuracy, ConvMixerSSM attains 91.82%, outperforming

R^{2}

TMIL (91.34%), MambaMIL (91.21%) and S4MIL (91.03%). Moreover, ConvMixerSSM achieves the highest F1 score of 91.18%, reflecting its robust balance between precision and recall in identifying cancer subtypes. These improvements are attributed to the synergy of its three components.

In particular, although both ConvMixerSSM and MambaMIL leverage the sequence modeling capabilities of SSM, our method consistently achieves superior performance across all three evaluation metrics. ConvMixerSSM attains higher AUC (97.83% vs. 97.34%), accuracy (91.82% vs. 91.21%), and F1 score (91.18% vs. 90.76%), indicating its enhanced capability in discriminating between LUAD and LUSC. Furthermore, ConvMixerSSM exhibits lower standard deviations on all metrics, suggesting greater stability and robustness. This consistent improvement can be attributed to the effective integration of convolutional feature mixing and global dependency modeling via SSM, enabling ConvMixerSSM to extract both local and contextual information more efficiently than MambaMIL.

Figure 4 compares ConvMixerSSM with other models regarding loss reduction on the validation set, demonstrating that ConvMixerSSM achieves a faster decrease in loss throughout the training process.

To further evaluate the generalizability of our approach, we conducted experiments on the CAMELYON16 dataset, as shown in Table 3. While ConvMixerSSM does not achieve the highest scores across all evaluation metrics, it attains the best AUC of 98.95%, outperforming all baseline and state-of-the-art MIL methods, including S4MIL (98.85%), MambaMIL (98.33%), and

R^{2}

TMIL (97.80%).

These findings highlight the strength of ConvMixerSSM in capturing nuanced morphological differences between positive and negative samples. The high AUC further supports the effectiveness of combining convolutional feature mixing with state-space modeling, enabling the model to generalize well beyond its training distribution and maintain robust performance on histopathological tasks.

3.3. Comparison with Different Feature-Gated Methods

To investigate the effectiveness of different activation functions within the feature-gated block of our proposed ConvMixerSSM model, we conducted a comparison study by replacing the default ReLU activation with several commonly used alternatives, including SiLU, Sigmoid, and Tanh. We also tested a variant without any activation function to isolate the impact of non-linearity. The results of this comparison on the TCGA-NSCLC dataset are summarized in Table 4.

Among all the configurations, ConvMixerSSM with ReLU activation achieves the best overall performance, yielding the highest AUC (97.83%), ACC (91.82%), and F1 score (91.18%). In contrast, other activation functions, such as Tanh and Sigmoid, showed inferior results. Although the variant with no activation performed reasonably well (AUC 97.29%, F1 90.87%), it still fell short of the performance achieved by ReLU.

The superior performance of ReLU in our MIL framework can be attributed to its ability to introduce sparse activations and highlight discriminative features more effectively. This sparsity is particularly beneficial in the MIL setting, where only a small subset of instances (patches) within a bag (WSI) may be truly informative for the slide-level classification task. ReLU helps the model focus on these key instances by suppressing less relevant ones, thus enhancing the instance selection process within the attention-based aggregation module. Moreover, ReLU’s simple and efficient non-linearity avoids potential issues like vanishing gradients or overly smooth feature transformations, which can occur with functions such as Sigmoid or Tanh. The results suggest that ReLU serves as a strong gating mechanism in the context of pathological image classification, effectively balancing feature suppression and enhancement for robust subtype discrimination.

This comparison highlights the critical role of activation design in feature gating. The ReLU-based gating block, as used in ConvMixerSSM, proves to be the most effective choice for capturing informative patterns in WSIs under the MIL framework.

3.4. Comparison with Different Depths of ConvMixerSSM

To assess the impact of model depth on performance and computational efficiency, we conducted experiments using our proposed ConvMixerSSM framework with varying depths (i.e., the number of stacked ConvMixer + SMM). Specifically, we evaluated depths of 1, 2, and 3, and reported the results in Table 5.

The configuration with depth ×1 achieved the highest AUC of 97.83% and tied for the best F1 score (91.18%), while maintaining a strong ACC of 91.82%, slightly below the best observed (92.30% with depth ×3). The depth ×3 variant achieved the highest accuracy (92.30%) and competitive F1 score (91.10%), but at the cost of increased model complexity and computational burden. The depth ×2 variant yielded similar performance but did not surpass either of the other two in any metric.

These results suggest that a shallower ConvMixerSSM (depth ×1) strikes an optimal balance between model performance and computational efficiency. While deeper architectures may slightly improve certain metrics such as accuracy, the gains are marginal and may not justify the significantly increased inference time and memory usage, particularly in the context of WSI classification, where input sizes are enormous and scalability is critical.

Furthermore, the strong performance of the depth ×1 configuration highlights the robust representational capacity of the ConvMixerSSM building blocks. Even a single block is sufficient to extract meaningful local and global features, perform effective instance selection, and yield high discriminative power under the MIL framework for lung cancer subtype classification.

3.5. Computational Complexity Analysis

To evaluate the computational efficiency of different MIL methods, we compare the number of parameters and floating point operations (FLOPs) in Table 6. Compared with TransMIL, our proposed ConvMixerSSM significantly reduces the model size, with 1.98 M parameters versus 2.67 M, and achieves lower computational complexity in terms of FLOPs (17.8 G vs. 24.8 G). Similarly, ConvMixerSSM also requires fewer parameters than

R^{2}

TMIL (1.98 M vs. 2.70 M), with only a slightly higher FLOPs count (17.8 G vs. 15.42 G).

Although S4MIL and MambaMIL are more lightweight in terms of both parameters (1.05 M and 0.59 M, respectively) and FLOPs (9.46 G and 5.36 G, respectively), our model consistently outperforms them in classification accuracy, AUC, and F1 score across multiple datasets. This demonstrates that ConvMixerSSM offers a favorable trade-off between computational cost and predictive performance.

Furthermore, our model achieves fast inference speed in practice. For a WSI of size 51,200 × 51,200 pixels, ConvMixerSSM requires only 1.65 ms to complete inference, making it suitable for real-world deployment scenarios where both accuracy and efficiency are critical. These results confirm that ConvMixerSSM achieves a good balance between model complexity and performance, offering competitive or superior predictive capability with acceptable computational overhead.

3.6. Ablation Study

To evaluate the effectiveness of individual components within our proposed ConvMixerSSM architecture, we conducted an ablation study on the TCGA-NSCLC dataset. Specifically, we assessed the contribution of three key modules: the SSM block, the ConvMixer block, and the feature-gated block. The results are summarized in Table 7.

Starting with a baseline model that includes only the SSM block, we achieved an AUC of 97.21%, an ACC of 90.66%, and an F1 score of 89.96%, indicating that SSM alone already provides a strong foundation for WSI classification. In terms of computational complexity, this version requires 15.42 GFLOPs and 1.71 M parameters. Incorporating the feature-gated block on SSM block led to a notable improvement across all metrics, particularly increasing AUC to 97.76% and F1 score to 91.09%, demonstrating the effectiveness of feature-level recalibration in enhancing discriminative power under the MIL setting. Similarly, when the ConvMixer block was integrated with the SSM block (without feature gating), the model achieved better results compared to SSM alone. This validates the benefit of ConvMixer’s feature mixing in capturing local dependencies across image patches. Next, we evaluated a configuration that includes ConvMixer and the feature-gated module, but excludes SSM. This variant achieved the highest ACC (91.79%) and tied F1 score (91.09%) among all incomplete versions, along with an AUC of 97.39%. Despite being the lightest model in the ablation group, with only 5.34 GFLOPs and 0.59M parameters, it demonstrates strong discriminative performance, highlighting the effectiveness of convolutional mixing and adaptive feature selection in capturing local patterns.

The full ConvMixerSSM model, which integrates all three components, achieved the best performance across all metrics. It has a moderate computational cost of 17.8 GFLOPs and 1.98M parameters, which remains acceptable for practical deployment. Compared with the version without the feature-gated block (i.e., only SSM + ConvMixer), the full model shows absolute gains of +0.54% in AUC, +0.48% in ACC, and +0.31% in F1 score. These improvements demonstrate the complementary advantages of each module and highlight the importance of jointly leveraging feature mixing, sequential modeling, and feature selection to capture complex patterns in histopathological images.

3.7. Visualization

To further demonstrate the interpretability and effectiveness of our proposed ConvMixerSSM model in the task of cancer subtyping, we visualize the model’s attention response using heatmaps and top-scoring patches, as shown in Figure 5. The visualizations include three components for each sample: the original whole-slide image (WSI), the corresponding attention heatmap generated by our model, and the top-k patches with the highest prediction scores.

The heatmaps clearly highlight the regions of interest that contribute most to the model’s decision, aligning closely with known tumor areas as verified by pathologists. The top-scoring patches consistently correspond to morphologically abnormal regions indicative of malignancy. These results indicate that ConvMixerSSM not only achieves strong classification performance but also provides reliable spatial localization of discriminative tumor regions under the weakly supervised multiple instance learning (MIL) setting. Such visualization enhances the transparency and interpretability of our model and demonstrates its potential utility in assisting pathologists with diagnostic insights at the subtype level.

To qualitatively assess the interpretability of our model’s visual outputs, we invited two experienced pathologists from tertiary hospitals to evaluate the heatmaps generated by ConvMixerSSM. We randomly selected a total of seven WSIs and their corresponding heatmaps: three from LUAD cases and four from LUSC cases. Each heatmap was overlaid on the original WSI to highlight regions identified as highly predictive of the cancer subtype. The evaluation followed a structured questionnaire titled WSI Heatmap Evaluation. For each WSI–heatmap pair, pathologists were asked to rate the relevance and consistency of the highlighted tumor regions using a 5-point Likert scale (1—very poor, 2—poor, 3—fair, 4—good, 5—excellent).

Each expert independently reviewed all seven cases. As shown in Figure 6, the first pathologist rated five out of seven cases as good (4) and the remaining two as fair (3). The second pathologist provided one rating of excellent (5), four ratings of good (4), and two ratings of fair (3).

These evaluations indicate that the majority of the visualized results were regarded as clinically meaningful and aligned with expert knowledge of tumor morphology. The consistent recognition of relevant tumor regions, especially in LUAD and LUSC slides, supports the potential of our model not only for automated prediction but also for assisting interpretability and decision-making in computational pathology workflows.

4. Discussion

In this study, we proposed ConvMixerSSM, a novel and effective multiple instance learning framework for cancer subtype classification based on whole-slide images. The model is composed of three key components: (1) a ConvMixer block for extracting localized spatial features from each patch, (2) an SSM block to capture long-range and global dependencies among instance embeddings, and (3) a feature-gated block that adaptively emphasizes informative instances through learnable activation mechanisms. This architecture is designed to address the high heterogeneity and weak-label nature of pathology images. Our ablation studies demonstrate that each of these blocks contributes significantly to performance improvement. Furthermore, the proposed model achieves not only excellent quantitative results in terms of AUC, accuracy, and F1 score, but also provides reliable visual interpretability through heatmap-based patch importance visualization, which highlights tumor-related regions and confirms the model’s ability to focus on relevant histopathological features.

The development of artificial intelligence models capable of automatically distinguishing cancer subtypes and accurately localizing tumor regions holds profound implications for both computational pathology and clinical practice [29,30,31]. In computational pathology, such models enable scalable and objective analysis of large-scale whole-slide images, significantly reducing the reliance on manual annotations and inter-observer variability [32,33]. The ability to differentiate between histologically similar but clinically distinct cancer subtypes is critical for guiding downstream molecular testing, prognosis estimation, and personalized treatment decisions. Traditional workflows often involve time-consuming and subjective assessment by expert pathologists, whereas AI-driven subtype classification can serve as a robust decision-support system, enhancing diagnostic consistency and throughput [34,35,36].

We systematically compared ConvMixerSSM with a variety of existing MIL-based models. Traditional MIL pooling strategies, such as Max Pooling and Mean Pooling, treat all instances in a bag either equally or select only the most salient one, potentially ignoring useful contextual or distributional information across patches. Although these simple strategies show competitive results, they lack the ability to model complex inter-instance dependencies and rich local patterns, which limits their capacity for accurate subtyping in heterogeneous tumor environments.

Our proposed ConvMixerSSM effectively integrates the strengths of convolutional modeling and structured sequence processing. By combining ConvMixer blocks, which are well-suited for extracting fine-grained spatial features from image patches, with an SSM for efficient long-range dependency modeling, ConvMixerSSM achieves a balanced and comprehensive representation. From a structural perspective, the enhanced performance of ConvMixerSSM can be attributed to its ability to disentangle and hierarchically process both local and global information. The ConvMixer block acts as a localized filter bank, capturing texture, shape, and edge-level features inherent to histopathological structures. These features correspond to morphological variations such as nuclear density, glandular formations, or stromal textures that are critical for tumor subtyping. In contrast, the SSM (state space model) captures sequence-level context by treating the patch sequence as a structured signal, enabling it to model tumor growth patterns, cellular organization gradients, and tissue architecture over large spatial extents. This emulates how pathologists scan slides by integrating both focal detail and overall structure. Furthermore, the inclusion of a feature-gated mechanism enhances the model’s ability to emphasize relevant instance-level features, further improving the discriminative power under weak supervision. Notably, ConvMixerSSM outperforms all comparison methods in AUC, accuracy, and F1 score, indicating its robustness and generalization ability across diverse WSIs. Its relatively shallow architecture (as shown in the depth comparison study) also suggests that the model can achieve strong performance with fewer layers, leading to lower computational cost and faster inference, which are desirable for clinical deployment.

Importantly, our visualization strategy, which highlights the top-k most informative patches and generates corresponding attention heatmaps, has significant clinical value. The visual outputs clearly delineate tumor-related regions within WSIs, aligning well with pathologist-annotated areas. Such interpretability not only enhances the trustworthiness of the model but also provides actionable cues for clinical pathologists, enabling them to quickly identify suspicious regions for further analysis. This can potentially reduce diagnostic time, improve consistency among human readers, and assist in educational settings for trainee pathologists. In addition, explainability is foundational for clinical trust. This interpretability fosters confidence among medical professionals, facilitating acceptance and integration of AI tools in daily practice. We believe that combining strong performance with explainable visual outputs is essential for the successful deployment of AI-based pathology tools in real-world clinical workflows.

Moreover, this study successfully applies structured state space models (SSMs), particularly when combined with convolutional layers, to large-scale histopathology image analysis. While prior work often relied on transformer-based or recurrent approaches for sequence modeling, our results highlight the efficiency and expressiveness of SSMs in modeling patch-wise dependencies [37,38,39]. This demonstrates the potential of SSMs to become a new paradigm for large-scale, context-aware modeling in medical imaging. The success of ConvMixerSSM underscores the effectiveness of integrating local feature extraction (via convolution), global dependency modeling (via SSM), and adaptive instance selection (via gating) within a unified MIL architecture. This hybrid design not only improves performance but also enhances robustness and interpretability, providing a valuable blueprint for future WSI analysis models in computational pathology.

While the proposed method shows promising results, we acknowledge certain limitations in our current work. All experiments were conducted solely on the TCGA-NSCLC dataset and the CAMELYON16 dataset. The generalizability of our model to other cancer types or histopathological conditions remains to be validated. In future work, we plan to extend ConvMixerSSM to additional publicly available WSI datasets involving other cancers (e.g., colon, kidney) and multi-class classification settings, to assess its robustness and applicability across domains. Additionally, ConvMixerSSM can be extended to tackle more complex pathological tasks, such as tumor grading, multi-class cancer classification, or WSI level segmentation. We also envision adapting this architecture to self-supervised or few-shot learning scenarios, thereby enhancing its applicability to rare diseases or small datasets.

5. Conclusions

In this study, we proposed ConvMixerSSM, a novel and effective MIL framework for cancer subtype classification based on whole-slide images (WSIs). The model integrates a state space model (SSM) block for long-range dependency modeling, a ConvMixer block for local feature extraction, and a feature-gated module to adaptively enhance discriminative representations under weak supervision. Extensive experiments on the TCGA-NSCLC dataset demonstrate that ConvMixerSSM achieves state-of-the-art performance, with an AUC of 97.83%, an ACC of 91.82%, and an F1 score of 91.18%, outperforming all comparison methods. In addition, validation on the CAMELYON16 dataset further confirms the model’s generalization ability, where ConvMixerSSM achieves the best AUC (98.95%) among all methods. Beyond accuracy, ConvMixerSSM demonstrates high computational efficiency, requiring only 1.65 ms to process a 51,200 × 51,200 WSI, which makes it feasible for real-world deployment. Furthermore, our visualization results show that ConvMixerSSM can accurately localize tumor regions, offering clinically meaningful interpretability and highlighting its potential as a trustworthy decision support tool for pathologists. Taken together, these findings indicate that ConvMixerSSM is not only a powerful and efficient classifier but also a generalizable and interpretable model well-suited for clinical integration.

Author Contributions

Conceptualization, D.B. and Y.Z.; methodology, D.B. and Y.Z.; validation, Y.Z.; data curation, Y.Z.; writing—original draft preparation, D.B. and Y.Z.; writing—review and editing, D.B. and Y.Z.; visualization, Y.Z.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data used in this study are publicly available. The TCGA-NSCLC dataset was obtained from The Cancer Genome Atlas (TCGA) through the Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/, accessed on 1 January 2025), and includes hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) of lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD). Access to TCGA data is subject to the GDC’s data usage policies and requirements. The CAMELYON16 dataset was obtained from the CAMELYON16 Grand Challenge platform (https://camelyon16.grand-challenge.org/, accessed on 1 January 2025), and includes WSIs of sentinel lymph node sections from breast cancer patients, labeled as either normal or metastatic. Both datasets are widely used in computational pathology research and are freely accessible for academic use.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WSI	Whole slide image
MIL	Multiple instance learning
SSM	State space model
TCGA	The Cancer Genome Atlas
GDC	Genomic Data Commons
NCI	National Cancer Institute
LUSC	Lung squamous cell carcinoma
LUAD	Lung adenocarcinoma
MLP	Multi-layer perceptron
AUC	Area under the ROC curve
ROC	Receiver operating characteristic
ACC	Accuracy
TPR	True positive rate
FPR	False positive rate

References

Chen, R.J.; Ding, T.; Lu, M.Y.; Williamson, D.F.; Jaume, G.; Song, A.H.; Chen, B.; Zhang, A.; Shao, D.; Shaban, M.; et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 2024, 30, 850–862. [Google Scholar] [CrossRef] [PubMed]
Vorontsov, E.; Bozkurt, A.; Casson, A.; Shaikovski, G.; Zelechowski, M.; Severson, K.; Zimmermann, E.; Hall, J.; Tenenholtz, N.; Fusi, N.; et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 2024, 30, 2924–2935. [Google Scholar] [CrossRef]
Song, A.H.; Jaume, G.; Williamson, D.F.; Lu, M.Y.; Vaidya, A.; Miller, T.R.; Mahmood, F. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 2023, 1, 930–949. [Google Scholar] [CrossRef]
Jiang, R.; Yin, X.; Yang, P.; Cheng, L.; Hu, J.; Yang, J.; Wang, Y.; Fu, X.; Shang, L.; Li, L.; et al. A transformer-based weakly supervised computational pathology method for clinical-grade diagnosis and molecular marker discovery of gliomas. Nat. Mach. Intell. 2024, 6, 876–891. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, X.; Wang, J.; Yang, Y.; Peng, T.; Tong, C. Mamba2MIL: State Space Duality Based Multiple Instance Learning for Computational Pathology. arXiv 2024, arXiv:2408.15032. [Google Scholar]
Vaidya, A.; Chen, R.J.; Williamson, D.F.; Song, A.H.; Jaume, G.; Yang, Y.; Hartvigsen, T.; Dyer, E.C.; Lu, M.Y.; Lipkova, J.; et al. Demographic bias in misdiagnosis by computational pathology models. Nat. Med. 2024, 30, 1174–1190. [Google Scholar] [CrossRef]
Hanna, M.G.; Parwani, A.; Sirintrapun, S.J. Whole slide imaging: Technology and applications. Adv. Anat. Pathol. 2020, 27, 251–259. [Google Scholar] [CrossRef]
Zheng, H.; Zhou, Y.; Huang, X. Spatiality sensitive learning for cancer metastasis detection in whole-slide images. Mathematics 2022, 10, 2657. [Google Scholar] [CrossRef]
Jin, X.; Huang, T.; Wen, K.; Chi, M.; An, H. HistoSSL: Self-supervised representation learning for classifying histopathology images. Mathematics 2022, 11, 110. [Google Scholar] [CrossRef]
Qu, L.; Ma, Y.; Luo, X.; Guo, Q.; Wang, M.; Song, Z. Rethinking multiple instance learning for whole slide image classification: A good instance classifier is all you need. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9732–9744. [Google Scholar] [CrossRef]
Tang, W.; Zhou, F.; Huang, S.; Zhu, X.; Zhang, Y.; Liu, B. Feature re-embedding: Towards foundation model-level performance in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11343–11352. [Google Scholar]
Song, A.H.; Chen, R.J.; Ding, T.; Williamson, D.F.; Jaume, G.; Mahmood, F. Morphological prototyping for unsupervised slide representation learning in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11566–11578. [Google Scholar]
Lu, M.Y.; Williamson, D.F.; Chen, T.Y.; Chen, R.J.; Barbieri, M.; Mahmood, F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 2021, 5, 555–570. [Google Scholar] [CrossRef] [PubMed]
Jiang, H.; Zhou, Y.; Lin, Y.; Chan, R.C.; Liu, J.; Chen, H. Deep learning for computational cytology: A survey. Med. Image Anal. 2023, 84, 102691. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Chen, Y.; Chu, H.; Sun, Q.; Guan, T.; Han, A.; He, Y. Dynamic graph representation with knowledge-aware attention for histopathology whole slide image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11323–11332. [Google Scholar]
Li, H.; Zhang, Y.; Chen, P.; Shui, Z.; Zhu, C.; Yang, L. Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Dimitri, G.M.; Andreini, P.; Bonechi, S.; Bianchini, M.; Mecocci, A.; Scarselli, F.; Zacchi, A.; Garosi, G.; Marcuzzo, T.; Tripodi, S.A. Deep learning approaches for the segmentation of glomeruli in kidney histopathological images. Mathematics 2022, 10, 1934. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, M.; Tong, C.; Zhao, Y.; Han, J. CA-UNet Segmentation makes a good ischemic stroke risk prediction. Interdiscip. Sci. Comput. Life Sci. 2024, 16, 58–72. [Google Scholar] [CrossRef]
Zhang, Y.; Li, S.; Wu, W.; Zhao, Y.; Han, J.; Tong, C.; Luo, N.; Zhang, K. Machine-learning-based models to predict cardiovascular risk using oculomics and clinic variables in KNHANES. BioData Min. 2024, 17, 12. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar]
Ilse, M.; Tomczak, J.; Welling, M. Attention-based deep multiple instance learning. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 2127–2136. [Google Scholar]
Shao, Z.; Bian, H.; Chen, Y.; Wang, Y.; Zhang, J.; Ji, X.; Zhang, Y. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Adv. Neural Inf. Process. Syst. 2021, 34, 2136–2147. [Google Scholar]
Fillioux, L.; Boyd, J.; Vakalopoulou, M.; Cournède, P.H.; Christodoulidis, S. Structured state space models for multiple instance learning in digital pathology. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Cham, Switzerland, 2023; pp. 594–604. [Google Scholar]
Yang, S.; Wang, Y.; Chen, H. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer: Cham, Switzerland, 2024; pp. 296–306. [Google Scholar]
Bejnordi, B.E.; Veta, M.; Van Diest, P.J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J.A.; Hermsen, M.; Manson, Q.F.; Balkenhol, M.; et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017, 318, 2199–2210. [Google Scholar] [CrossRef]
Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Liang, I.; Ding, T.; Jaume, G.; Odintsov, I.; Le, L.P.; Gerber, G.; et al. A visual-language foundation model for computational pathology. Nat. Med. 2024, 30, 863–874. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Jia, Y.; Hou, C.; Li, N.; Zhang, N.; Yan, X.; Yang, L.; Guo, Y.; Chen, H.; Li, J.; et al. Pathological prognosis classification of patients with neuroblastoma using computational pathology analysis. Comput. Biol. Med. 2022, 149, 105980. [Google Scholar] [CrossRef] [PubMed]
Ghaffari Laleh, N.; Truhn, D.; Veldhuizen, G.P.; Han, T.; van Treeck, M.; Buelow, R.D.; Langer, R.; Dislich, B.; Boor, P.; Schulz, V.; et al. Adversarial attacks and adversarial robustness in computational pathology. Nat. Commun. 2022, 13, 5711. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 8022–8031. [Google Scholar]
Wagner, S.J.; Matek, C.; Shetab Boushehri, S.; Boxberg, M.; Lamm, L.; Sadafi, A.; Waibel, D.J.; Marr, C.; Peng, T. Make deep learning algorithms in computational pathology more reproducible and reusable. Nat. Med. 2022, 28, 1744–1746. [Google Scholar] [CrossRef]
Laleh, N.G.; Muti, H.S.; Loeffler, C.M.L.; Echle, A.; Saldanha, O.L.; Mahmood, F.; Lu, M.Y.; Trautwein, C.; Langer, R.; Dislich, B.; et al. Benchmarking weakly supervised deep learning pipelines for whole slide classification in computational pathology. Med. Image Anal. 2022, 79, 102474. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, Z.; He, K.; Li, C.; Mao, R. From patches to WSIs: A systematic review of deep Multiple Instance Learning in computational pathology. Inf. Fusion 2025, 119, 103027. [Google Scholar] [CrossRef]
Liu, P.; Ji, L.; Zhang, X.; Ye, F. Pseudo-bag mixup augmentation for multiple instance learning-based whole slide image classification. IEEE Trans. Med. Imaging 2024, 43, 1841–1852. [Google Scholar] [CrossRef]
Gou, J.; Ji, L.; Liu, P.; Ye, M. Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 3158–3166. [Google Scholar]
Waqas, M.; Ahmed, S.U.; Tahir, M.A.; Wu, J.; Qureshi, R. Exploring multiple instance learning (MIL): A brief survey. Expert Syst. Appl. 2024, 250, 123893. [Google Scholar] [CrossRef]
Zhang, Y.; Li, H.; Sun, Y.; Zheng, S.; Zhu, C.; Yang, L. Attention-challenging multiple instance learning for whole slide image classification. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 125–143. [Google Scholar]
Yang, Y.; Tu, Y.; Lei, H.; Long, W. HAMIL: Hierarchical aggregation-based multi-instance learning for microscopy image classification. Pattern Recognit. 2023, 136, 109245. [Google Scholar] [CrossRef]

Figure 1. The pyramid structure of WSIs.

Figure 2. Overview of our model.

Figure 3. Data preprocessing results. (a,c) Background and voids are excluded. The area enclosed by the green lines is the tissue area, and the area enclosed by the blue lines is the void. (b,d) The WSI is divided into 512 × 512 pixels.

Figure 4. The loss performance of our model and other models in the validation set on the TCGA-NSCLC dataset. (a) Comparison with MambaMIL. (b) Comparison with TransMIL.

Figure 5. The heatmap visualization of cancer subtyping on the TCGA-NSCLC dataset. From left to right: original slides, attention heatmaps, and top-k scores patches. The heatmap uses color to indicate the intensity of the values: red represents stronger associations, while blue indicates weaker associations. This color scheme helps highlight areas of importance in the data.

Figure 6. Expert evaluation of heatmap interpretability.

Table 1. Details of the TCGA-NSCLC dataset.

	TCGA-NSCLC	CAMELYON-16
Sample	1053	395
Min Image Size	10,000 × 4617	45,056 × 35,840
Max Image Size	191,352 × 97,078	217,088 × 111,104
Min Bag Size	35	40
Max Bag Size	11,747	11,221

Table 2. Cancer subtyping results of the TCGA-NSCLC dataset. The bold and underlined values represent the best and second-best results, respectively.

Method	AUC	ACC	F1 Score
Max Pooling	97.21 ± 1.51	90.45 ± 2.67	89.94 ± 2.79
Mean Pooling	97.07 ± 1.10	91.20 ± 3.31	90.65 ± 3.66
TransMIL [24]	97.06 ± 2.12	90.83 ± 2.82	90.14 ± 3.40
S4MIL [25]	97.43 ± 0.92	91.03 ± 2.05	90.59 ± 2.23
MambaMIL [26]	97.34 ± 1.75	91.21 ± 3.78	90.76 ± 3.94
$R^{2}$ TMIL [11]	97.29 ± 2.15	91.34 ± 3.57	90.98 ± 3.62
ConvMixerSSM	97.83 ± 1.52	91.82 ± 3.06	91.18 ± 3.71

Table 3. Cancer subtyping results of the CAMELYON16 dataset. The bold and underlined values represent the best and second-best results, respectively.

Method	AUC	ACC	F1 Score
Max Pooling	98.14 ± 1.78	96.95 ± 2.12	96.21 ± 2.60
Mean Pooling	88.87 ± 4.09	84.97 ± 3.33	78.74 ± 6.62
TransMIL [24]	97.44 ± 2.40	95.38 ± 3.90	93.91 ± 5.31
S4MIL [25]	98.85 ± 1.48	96.43 ± 2.10	95.64 ± 2.65
MambaMIL [26]	98.33 ± 1.45	96.37 ± 1.43	95.52 ± 1.81
$R^{2}$ TMIL [11]	97.80 ± 1.99	94.80 ± 2.59	93.82 ± 2.93
ConvMixerSSM	98.95 ± 0.99	94.60 ± 7.93	94.47 ± 7.08

Table 4. Results of different feature-gated methods on the TCGA-NSCLC dataset. The bold and underlined values represent the best and second-best results, respectively.

Method	AUC	ACC	F1 Score
Without activation	97.29 ± 1.66	91.34 ± 2.82	90.87 ± 3.26
SiLU	96.67 ± 2.60	91.42 ± 3.92	90.85 ± 4.51
Sigmod	97.03 ± 1.20	91.07 ± 3.71	90.64 ± 3.86
Tanh	97.61 ± 1.30	91.02 ± 3.24	90.55 ± 3.45
ConvMixerSSM	97.83 ± 1.52	91.82 ± 3.06	91.18 ± 3.71

Table 5. Results of different depths of ConvMixerSSM on the TCGA-NSCLC dataset. The bold and underlined values represent the best and second-best results, respectively.

Method	AUC	ACC	F1 Score
Depth × 3	97.41 ± 1.40	92.30 ± 2.96	91.10 ± 3.00
Depth × 2	97.36 ± 1.69	91.75 ± 3.00	91.18 ± 3.16
Depth × 1	97.83 ± 1.52	91.82 ± 3.06	91.18 ± 3.71

Table 6. Parameters and FLOPs of MIL methods.

Method	Params	FLOPs
TransMIL	2.67	24.8
S4MIL	1.05	9.46
MambaMIL	0.59	5.36
$R^{2}$ TMIL	2.7	15.42
Our Model	1.98	17.8

Table 7. Ablation study on the TCGA-NSCLC dataset. The numbers following the arrows indicate the improvements compared to the second-to-last row. The bold and underlined values represent the best and second-best results, respectively.

SSM	ConvMixer	Feature-Gated	AUC	ACC	F1 Score
✓			97.21 ± 1.44	90.66 ± 2.79	89.96 ± 3.27
	✓	✓	97.39 ± 1.88	91.79 ± 2.82	91.09 ± 3.01
✓		✓	97.76 ± 1.86	91.42 ± 3.53	91.09 ± 3.81
✓	✓		97.29 ± 1.66	91.34 ± 2.82	90.87 ± 3.26
✓	✓	✓	97.83 ± 1.52 ( $↑ 0.54$ )	91.82 ± 3.06 ( $↑ 0.48$ )	91.18 ± 3.71 ( $↑ 0.31$ )

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bi, D.; Zhang, Y. A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping. Mathematics 2025, 13, 2178. https://doi.org/10.3390/math13132178

AMA Style

Bi D, Zhang Y. A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping. Mathematics. 2025; 13(13):2178. https://doi.org/10.3390/math13132178

Chicago/Turabian Style

Bi, Dehui, and Yuqi Zhang. 2025. "A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping" Mathematics 13, no. 13: 2178. https://doi.org/10.3390/math13132178

APA Style

Bi, D., & Zhang, Y. (2025). A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping. Mathematics, 13(13), 2178. https://doi.org/10.3390/math13132178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid MIL Approach Leveraging Convolution and State-Space Model for Whole-Slide Image Cancer Subtyping

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Preliminary: Assumptions of MIL

2.3. Overview of ConvMixerSSM

2.4. ConvMixer Block

2.5. SSM Block

2.6. Feature-Gated Block

2.7. Implementation Details

2.8. Evaluation Metrics

3. Results

3.1. Results of Data Preprocessing

3.2. Main Results

3.3. Comparison with Different Feature-Gated Methods

3.4. Comparison with Different Depths of ConvMixerSSM

3.5. Computational Complexity Analysis

3.6. Ablation Study

3.7. Visualization

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI