1. Introduction
Computational pathology has become a critical pillar of modern clinical diagnostics, particularly in oncology [
1,
2,
3]. By analyzing whole slide images (WSIs), key tasks such as tumor subtyping, prognosis prediction, and treatment response assessment can be effectively performed, thereby advancing the development of personalized medicine [
4,
5,
6]. WSIs are high-resolution two-dimensional color images generated by scanning tissue sections layer by layer using whole-slide scanners, achieving subcellular-level detail [
7]. These images are typically derived from surgical tissue specimens or biopsy samples. Due to their ultra-high resolution, as shown in
Figure 1, WSIs provide comprehensive representations of tissue architecture and microscopic morphology, making them a valuable data foundation for AI-assisted diagnosis [
8,
9,
10].
However, WSIs pose two major challenges for efficient and accurate computational analysis due to their large file sizes (often containing billions of pixels) and complex spatial structures [
11,
12]. First, acquiring high-quality annotations is extremely costly, making it difficult to obtain precise pixel-level labels for every region of the slide. Second, the computational burden of model training and inference is substantial, particularly when processing entire slides at their original resolution. These issues significantly hinder the practical deployment of deep learning methods in clinical settings.
To address these challenges, researchers have proposed and widely adopted weakly supervised approaches based on multiple instance learning (MIL) [
13,
14]. In MIL, a WSI is treated as a bag of instances (e.g., image patches), and the model is trained using only slide-level labels without requiring fine-grained annotations for each patch [
15,
16]. This framework greatly reduces the dependency on detailed labels. By learning the contribution of each instance to the overall classification outcome, MIL-based models can automatically identify diagnostically relevant regions, improving both model generalizability and interpretability. Nevertheless, traditional MIL frameworks suffer from limited receptive fields and insufficient capacity for modeling long-range dependencies, making it difficult to fully leverage the global context of WSIs.
In recent years, deep learning has achieved widespread success in the medical domain [
17,
18,
19]. Architectures based on Transformers [
20] and models such as state space models (SSMs) [
21,
22] have demonstrated strong performance due to their global modeling capabilities and efficient sequence processing. These advancements have gradually attracted attention in computational pathology. ABMIL [
23] introduces an attention-based aggregation mechanism that allows the model to weigh instance contributions differently, but it still lacks spatial feature refinement. CLAM [
13], a clustering-constrained attention MIL framework, improves robustness by learning discriminative subpopulations within a slide; however, its reliance on multiple instance-level attention heads increases model complexity and training instability in certain settings. TransMIL [
24] introduces a transformer-based attention mechanism to capture pairwise interactions between instances, improving the model’s ability to reason over multiple patches. However, its reliance on self-attention may lead to high computational costs and challenges in scalability when applied to gigapixel WSIs with thousands of instances per slide. S4MIL [
25] leverages structured state-space models to model long-range dependencies efficiently and has demonstrated strong performance with reduced memory consumption. Nevertheless, it does not explicitly address local spatial representation learning, which is critical in histopathological image analysis. MambaMIL [
26], a recent model inspired by state-space dynamics and gated sequence modeling, further improves the efficiency and representation capacity of MIL. However, it primarily focuses on sequential modeling and may overlook the importance of strong convolutional priors tailored for histological textures.
To overcome these bottlenecks, we propose a novel architecture called ConvMixerSSM, which integrates the local modeling capacity of convolutional neural networks with the long-range dependency modeling power of SSMs to enhance performance in cancer subtyping tasks. The proposed model consists of three main components: (1) A ConvMixer block, which employs depthwise separable convolutions to efficiently mix local spatial features. (2) An SSM block, a novel linear state-space sequence model with sub-quadratic complexity that enables global modeling of contextual dependencies across patches. (3) A feature-gated block, which incorporates a ReLU-based gating structure to dynamically focus on key instance features, thereby improving the identification of diagnostically important regions within the MIL framework. We conducted extensive experiments on The Cancer Genome Atlas (TCGA) lung cancer subtyping dataset and the CAMELYON16 breast cancer diagnosis dataset, and the results demonstrate that our approach consistently outperforms other methods. Our work represents a significant advancement in WSI analysis within the field of computational pathology. It holds substantial potential for promoting intelligent pathological slide analysis and assisting in tumor subtyping, thereby contributing meaningfully to the development of precision medicine.
2. Materials and Methods
2.1. Dataset
The dataset used in this study was obtained from TCGA and CAMELYON16 [
27], both of which are publicly available datasets widely used in computational pathology research. TCGA is a public cancer genomics initiative led by the National Cancer Institute (NCI) of the United States. Specifically, we selected 1053 WSIs from the TCGA-NSCLC (Non-Small Cell Lung Cancer) subproject as the primary source for NSCLC subtyping. The dataset comprises diagnostic-grade hematoxylin and eosin (H&E)-stained slides from primary tumor tissues and includes two major subtypes: 512 WSIs of Lung Squamous Cell Carcinoma (LUSC) and 541 WSIs of Lung Adenocarcinoma (LUAD). All data are publicly accessible through the Genomic Data Commons (GDC) data portal at
https://portal.gdc.cancer.gov/ (accessed on 1 January 2025). In addition, we incorporated the CAMELYON16 dataset to further evaluate the generalizability of our model in metastasis detection tasks. CAMELYON16 is a grand challenge dataset focused on detecting lymph node metastases in breast cancer patients, providing a total of 395 WSIs of hematoxylin and eosin (H&E)-stained sentinel lymph node sections. Among these, 236 WSIs are normal (i.e., without metastatic regions) and 159 WSIs contain metastases. The CAMELYON16 dataset is publicly available at
https://camelyon16.grand-challenge.org/ (accessed on 1 January 2025).
2.2. Preliminary: Assumptions of MIL
In the multiple instance learning paradigm, each whole slide image
is treated as a bag composed of a collection of instances (image patches) extracted from the tissue region. Due to the gigapixel scale of WSIs, each bag
is expressed as a set of
L instance-level features:
where
L is the number of patches in the
i-th slide and
D denotes the dimensionality of each feature embedding.
The label
(e.g., cancer subtype) is known at the bag level, but the labels
of individual instances
are unobserved. The fundamental assumption in MIL is that only a subset of instances contributes meaningfully to the bag label. Specifically, in binary classification tasks, the standard MIL assumption can be formalized as:
where
. This assumption reflects many real-world biomedical scenarios, such as cancer detection, where only a small fraction of image patches within a slide may contain malignancies, and the rest may appear benign or irrelevant.
Given this setting, the goal of the MIL model is to learn a bag-level classifier:
There is no explicit supervision on instance labels. However, due to memory and computational limitations, direct end-to-end modeling from raw WSIs to slide-level outputs is infeasible. Thus, the MIL process is typically decomposed into two stages:
Instance-level feature extraction: A feature extractor
(e.g., domain-specific models like CONCH [
28]) maps raw image patches to embeddings:
Bag-level classification: An aggregation function
consumes the sequence
and produces a slide-level prediction:
Equation (
5) summarizes this two-stage MIL framework.
This setup underpins the design of our proposed ConvMixerSSM architecture, which focuses on optimizing the aggregation stage via convolution and long sequence modeling. The central challenge remains in identifying and amplifying the contribution of informative instances while suppressing irrelevant ones, all without access to ground-truth instance labels.
2.3. Overview of ConvMixerSSM
Figure 2 illustrates the overall architecture of ConvMixerSSM, a novel framework designed for cancer subtyping under the MIL paradigm. The pipeline begins with background removal, a critical preprocessing step to eliminate irrelevant regions. Given the gigapixel scale of WSIs, retaining background areas not only introduces noise but also leads to a substantial computational burden. To mitigate this, tissue regions are extracted using a threshold-based filtering strategy. Subsequently, each WSI is divided into non-overlapping image patches, from which discriminative features are extracted using a pretrained CONCH [
28]. This process produces an instance-level feature sequence, where each instance corresponds to a localized tissue patch within the WSI.
The core of ConvMixerSSM consists of three main components: (1) A ConvMixer block, which leverages depthwise separable convolutions to efficiently capture local spatial features across patches. (2) An SSM block, in order to model long-range dependencies and contextual relationships among the patch-level features, we employ a sequence modeling approach based on SSM. (3) A feature-gated block, which incorporates a ReLU-based gating structure to dynamically focus on key instance features, thereby improving the identification of diagnostically important regions. Finally, the aggregated instance representations are fed into a multi-layer perceptron (MLP), which produces the slide-level prediction. This hierarchical design allows ConvMixerSSM to effectively handle the inherent complexity of WSIs while maintaining computational efficiency.
2.4. ConvMixer Block
To effectively extract local patch-level features from the input instances, we adopt a ConvMixer Block that integrates depthwise separable convolution and pointwise convolution. Specifically, given an input tensor
, where
B is the batch size,
L is the bag size (number of instances), and
D is the patch dimension, the block first applies a depthwise convolution:
where permute represents the permutation of the input tensor.
The 1D depthwise convolution applies a separate convolutional filter to each individual channel (i.e., dimension
d), without mixing information across channels. For a given channel
, the output at position
is computed as:
where
is the depthwise kernel for channel
d with kernel size
k,
is the output after depthwise convolution, and zero-padding is applied on both sides to maintain the same sequence length.
Then, it is followed by a pointwise convolution and a non-linear activation:
where
denotes a non-linear activation function (e.g., LeakyReLU). The pointwise convolution performs a
convolution that linearly combines the
D channels into
output channels at each position
l. The output is computed as:
where
is the weight vector for output channel
and
is the final output after PWConv.
A residual connection is also applied:
This design allows the block to efficiently capture local contextual dependencies within each instance patch while preserving computational efficiency.
2.5. SSM Block
To capture long-range dependencies among instance embeddings, the SSM block processes the input through a sequence of a linear projection, an SSM layer, a causal 1D convolution, and a final linear projection. Given a sequence of locally encoded features,
. First, each feature vector is projected to the SSM hidden dimension:
We then apply a continuous-to-discrete state-space transform inspired by the Mamba formulation.
where
is the state at time step
t,
is the output vector, and
are the learnable parameter matrices. SMM improves the global modeling capability and maintains high efficiency through parallel processing.
To inject additional temporal locality and ensure causality, we apply a causal 1D convolution:
where
o is the output of the SSM layer. Finally, a second linear layer projects back to the model dimension:
An optional residual connection adds the original input
F, yielding the block output:
2.6. Feature-Gated Block
For final representation aggregation and classification, we utilize a feature-gated block composed of an attention-based pooling mechanism followed by a linear classifier. Given the instance representations
, an attention score
is computed as:
The bag-level representation is obtained via weighted pooling:
Finally, the bag-level feature vector is passed to an MLP to produce the final prediction:
This feature-gated mechanism adaptively emphasizes informative instances while suppressing irrelevant ones, leading to more accurate bag-level predictions.
2.7. Implementation Details
We implemented our model using PyTorch 2.5.0 and conducted all experiments on a workstation equipped with an NVIDIA RTX 3090 GPU. Following the training strategy described by Yang et al. [
26] (i.e., MambaMIL), we set the initial learning rate to
and used the Adam optimizer with a weight decay of
. The model was trained using a slide-level batch size of 1, which is consistent with prior work in WSI-based MIL settings due to the large memory footprint of whole-slide feature bags. We trained for 50 epochs with an early stopping patience of 20 based on validation AUC.
For data preprocessing, as shown in
Figure 3, each WSI was tiled into non-overlapping patches of size 512 × 512 pixels at ×20 magnification, resulting in a set of instance-level inputs for each slide. To extract meaningful features from each patch, we employed CONCH [
28], a state-of-the-art foundation model pretrained on large-scale histopathological datasets, as our feature encoder.
To ensure a robust evaluation and reduce the influence of data partitioning and training randomness, we adopted a 5-fold cross-validation strategy on the TCGA-NSCLC dataset. In each fold, the data were randomly split into training, validation, and testing subsets following an 8:1:1 ratio. Specifically, 80% of the data were used for training, 10% for validation, and the remaining 10% for testing. Final performance metrics were averaged across the five folds.
2.8. Evaluation Metrics
To comprehensively evaluate the performance of our model on cancer subtyping, we adopted the following standard metrics:
Area Under the Curve of ROC (AUC). AUC measures the model’s ability to distinguish between classes across different decision thresholds. It is widely used in medical image classification tasks due to its robustness to class imbalance. A higher AUC indicates better discriminative power. We computed AUC using the trapezoidal rule over the ROC curve derived from true positive rate (TPR) and false positive rate (FPR):
Accuracy (ACC). Accuracy represents the proportion of correctly predicted WSIs among all predictions. While intuitive and easy to interpret, accuracy can be biased in the presence of class imbalance. It is defined as:
F1 Score. The F1 score is the harmonic mean of precision and recall, balancing false positives and false negatives. It provides a more informative measure than accuracy when class distributions are skewed and is particularly useful for assessing the model’s robustness in identifying minority classes. It is defined as:
All metrics were computed on the slide level, and the final reported results represent the average across five cross-validation folds to ensure statistical reliability and reduce variance due to data partitioning.
3. Results
3.1. Results of Data Preprocessing
As shown in
Table 1, the TCGA-NSCLC dataset contains 1053 WSIs, while the CAMELYON16 dataset includes 395 WSIs. Both datasets exhibit substantial variability in image resolution. For TCGA-NSCLC, the image sizes range from a minimum of 10,000 × 4617 pixels to a maximum of 191,352 × 97,078 pixels. Similarly, CAMELYON16 slides range from 45,056 × 35,840 pixels to 217,088 × 111,104 pixels. This diversity in slide dimensions reflects differences in tissue sampling, staining, and scanning protocols across institutions, and introduces challenges for model standardization and computational efficiency.
Following the preprocessing procedure described in
Section 2.7, the bag size (i.e., the number of patches per WSI) varied significantly across the dataset. In the TCGA-NSCLC dataset, the bag size ranged from 35 to 11,747 patches per slide. For CAMELYON16, the range was similarly wide, from 40 to 11,221 patches per slide. This variation highlights the heterogeneous nature of histopathological content in clinical slides and emphasizes the need for robust feature aggregation strategies in downstream tasks.
3.2. Main Results
Table 2 presents the cancer subtyping performance on the TCGA-NSCLC dataset using various MIL methods. Our proposed model, ConvMixerSSM, consistently outperforms all compared approaches across the three evaluation metrics.
Specifically, ConvMixerSSM achieves the highest AUC of 97.83%, surpassing both traditional pooling strategies (e.g., Max Pooling: 97.21%, Mean Pooling: 97.07%) and recent advanced MIL frameworks such as TransMIL (97.06%), S4MIL (97.43%), MambaMIL (97.34%), and TMIL (97.29%). This demonstrates the superior ability of ConvMixerSSM to distinguish between LUAD and LUSC on the slide level.
In terms of classification accuracy, ConvMixerSSM attains 91.82%, outperforming TMIL (91.34%), MambaMIL (91.21%) and S4MIL (91.03%). Moreover, ConvMixerSSM achieves the highest F1 score of 91.18%, reflecting its robust balance between precision and recall in identifying cancer subtypes. These improvements are attributed to the synergy of its three components.
In particular, although both ConvMixerSSM and MambaMIL leverage the sequence modeling capabilities of SSM, our method consistently achieves superior performance across all three evaluation metrics. ConvMixerSSM attains higher AUC (97.83% vs. 97.34%), accuracy (91.82% vs. 91.21%), and F1 score (91.18% vs. 90.76%), indicating its enhanced capability in discriminating between LUAD and LUSC. Furthermore, ConvMixerSSM exhibits lower standard deviations on all metrics, suggesting greater stability and robustness. This consistent improvement can be attributed to the effective integration of convolutional feature mixing and global dependency modeling via SSM, enabling ConvMixerSSM to extract both local and contextual information more efficiently than MambaMIL.
Figure 4 compares ConvMixerSSM with other models regarding loss reduction on the validation set, demonstrating that ConvMixerSSM achieves a faster decrease in loss throughout the training process.
To further evaluate the generalizability of our approach, we conducted experiments on the CAMELYON16 dataset, as shown in
Table 3. While ConvMixerSSM does not achieve the highest scores across all evaluation metrics, it attains the best AUC of 98.95%, outperforming all baseline and state-of-the-art MIL methods, including S4MIL (98.85%), MambaMIL (98.33%), and
TMIL (97.80%).
These findings highlight the strength of ConvMixerSSM in capturing nuanced morphological differences between positive and negative samples. The high AUC further supports the effectiveness of combining convolutional feature mixing with state-space modeling, enabling the model to generalize well beyond its training distribution and maintain robust performance on histopathological tasks.
3.3. Comparison with Different Feature-Gated Methods
To investigate the effectiveness of different activation functions within the feature-gated block of our proposed ConvMixerSSM model, we conducted a comparison study by replacing the default ReLU activation with several commonly used alternatives, including SiLU, Sigmoid, and Tanh. We also tested a variant without any activation function to isolate the impact of non-linearity. The results of this comparison on the TCGA-NSCLC dataset are summarized in
Table 4.
Among all the configurations, ConvMixerSSM with ReLU activation achieves the best overall performance, yielding the highest AUC (97.83%), ACC (91.82%), and F1 score (91.18%). In contrast, other activation functions, such as Tanh and Sigmoid, showed inferior results. Although the variant with no activation performed reasonably well (AUC 97.29%, F1 90.87%), it still fell short of the performance achieved by ReLU.
The superior performance of ReLU in our MIL framework can be attributed to its ability to introduce sparse activations and highlight discriminative features more effectively. This sparsity is particularly beneficial in the MIL setting, where only a small subset of instances (patches) within a bag (WSI) may be truly informative for the slide-level classification task. ReLU helps the model focus on these key instances by suppressing less relevant ones, thus enhancing the instance selection process within the attention-based aggregation module. Moreover, ReLU’s simple and efficient non-linearity avoids potential issues like vanishing gradients or overly smooth feature transformations, which can occur with functions such as Sigmoid or Tanh. The results suggest that ReLU serves as a strong gating mechanism in the context of pathological image classification, effectively balancing feature suppression and enhancement for robust subtype discrimination.
This comparison highlights the critical role of activation design in feature gating. The ReLU-based gating block, as used in ConvMixerSSM, proves to be the most effective choice for capturing informative patterns in WSIs under the MIL framework.
3.4. Comparison with Different Depths of ConvMixerSSM
To assess the impact of model depth on performance and computational efficiency, we conducted experiments using our proposed ConvMixerSSM framework with varying depths (i.e., the number of stacked ConvMixer + SMM). Specifically, we evaluated depths of 1, 2, and 3, and reported the results in
Table 5.
The configuration with depth ×1 achieved the highest AUC of 97.83% and tied for the best F1 score (91.18%), while maintaining a strong ACC of 91.82%, slightly below the best observed (92.30% with depth ×3). The depth ×3 variant achieved the highest accuracy (92.30%) and competitive F1 score (91.10%), but at the cost of increased model complexity and computational burden. The depth ×2 variant yielded similar performance but did not surpass either of the other two in any metric.
These results suggest that a shallower ConvMixerSSM (depth ×1) strikes an optimal balance between model performance and computational efficiency. While deeper architectures may slightly improve certain metrics such as accuracy, the gains are marginal and may not justify the significantly increased inference time and memory usage, particularly in the context of WSI classification, where input sizes are enormous and scalability is critical.
Furthermore, the strong performance of the depth ×1 configuration highlights the robust representational capacity of the ConvMixerSSM building blocks. Even a single block is sufficient to extract meaningful local and global features, perform effective instance selection, and yield high discriminative power under the MIL framework for lung cancer subtype classification.
3.5. Computational Complexity Analysis
To evaluate the computational efficiency of different MIL methods, we compare the number of parameters and floating point operations (FLOPs) in
Table 6. Compared with TransMIL, our proposed ConvMixerSSM significantly reduces the model size, with 1.98 M parameters versus 2.67 M, and achieves lower computational complexity in terms of FLOPs (17.8 G vs. 24.8 G). Similarly, ConvMixerSSM also requires fewer parameters than
TMIL (1.98 M vs. 2.70 M), with only a slightly higher FLOPs count (17.8 G vs. 15.42 G).
Although S4MIL and MambaMIL are more lightweight in terms of both parameters (1.05 M and 0.59 M, respectively) and FLOPs (9.46 G and 5.36 G, respectively), our model consistently outperforms them in classification accuracy, AUC, and F1 score across multiple datasets. This demonstrates that ConvMixerSSM offers a favorable trade-off between computational cost and predictive performance.
Furthermore, our model achieves fast inference speed in practice. For a WSI of size 51,200 × 51,200 pixels, ConvMixerSSM requires only 1.65 ms to complete inference, making it suitable for real-world deployment scenarios where both accuracy and efficiency are critical. These results confirm that ConvMixerSSM achieves a good balance between model complexity and performance, offering competitive or superior predictive capability with acceptable computational overhead.
3.6. Ablation Study
To evaluate the effectiveness of individual components within our proposed ConvMixerSSM architecture, we conducted an ablation study on the TCGA-NSCLC dataset. Specifically, we assessed the contribution of three key modules: the SSM block, the ConvMixer block, and the feature-gated block. The results are summarized in
Table 7.
Starting with a baseline model that includes only the SSM block, we achieved an AUC of 97.21%, an ACC of 90.66%, and an F1 score of 89.96%, indicating that SSM alone already provides a strong foundation for WSI classification. In terms of computational complexity, this version requires 15.42 GFLOPs and 1.71 M parameters. Incorporating the feature-gated block on SSM block led to a notable improvement across all metrics, particularly increasing AUC to 97.76% and F1 score to 91.09%, demonstrating the effectiveness of feature-level recalibration in enhancing discriminative power under the MIL setting. Similarly, when the ConvMixer block was integrated with the SSM block (without feature gating), the model achieved better results compared to SSM alone. This validates the benefit of ConvMixer’s feature mixing in capturing local dependencies across image patches. Next, we evaluated a configuration that includes ConvMixer and the feature-gated module, but excludes SSM. This variant achieved the highest ACC (91.79%) and tied F1 score (91.09%) among all incomplete versions, along with an AUC of 97.39%. Despite being the lightest model in the ablation group, with only 5.34 GFLOPs and 0.59M parameters, it demonstrates strong discriminative performance, highlighting the effectiveness of convolutional mixing and adaptive feature selection in capturing local patterns.
The full ConvMixerSSM model, which integrates all three components, achieved the best performance across all metrics. It has a moderate computational cost of 17.8 GFLOPs and 1.98M parameters, which remains acceptable for practical deployment. Compared with the version without the feature-gated block (i.e., only SSM + ConvMixer), the full model shows absolute gains of +0.54% in AUC, +0.48% in ACC, and +0.31% in F1 score. These improvements demonstrate the complementary advantages of each module and highlight the importance of jointly leveraging feature mixing, sequential modeling, and feature selection to capture complex patterns in histopathological images.
3.7. Visualization
To further demonstrate the interpretability and effectiveness of our proposed ConvMixerSSM model in the task of cancer subtyping, we visualize the model’s attention response using heatmaps and top-scoring patches, as shown in
Figure 5. The visualizations include three components for each sample: the original whole-slide image (WSI), the corresponding attention heatmap generated by our model, and the top-
k patches with the highest prediction scores.
The heatmaps clearly highlight the regions of interest that contribute most to the model’s decision, aligning closely with known tumor areas as verified by pathologists. The top-scoring patches consistently correspond to morphologically abnormal regions indicative of malignancy. These results indicate that ConvMixerSSM not only achieves strong classification performance but also provides reliable spatial localization of discriminative tumor regions under the weakly supervised multiple instance learning (MIL) setting. Such visualization enhances the transparency and interpretability of our model and demonstrates its potential utility in assisting pathologists with diagnostic insights at the subtype level.
To qualitatively assess the interpretability of our model’s visual outputs, we invited two experienced pathologists from tertiary hospitals to evaluate the heatmaps generated by ConvMixerSSM. We randomly selected a total of seven WSIs and their corresponding heatmaps: three from LUAD cases and four from LUSC cases. Each heatmap was overlaid on the original WSI to highlight regions identified as highly predictive of the cancer subtype. The evaluation followed a structured questionnaire titled WSI Heatmap Evaluation. For each WSI–heatmap pair, pathologists were asked to rate the relevance and consistency of the highlighted tumor regions using a 5-point Likert scale (1—very poor, 2—poor, 3—fair, 4—good, 5—excellent).
Each expert independently reviewed all seven cases. As shown in
Figure 6, the first pathologist rated five out of seven cases as good (4) and the remaining two as fair (3). The second pathologist provided one rating of excellent (5), four ratings of good (4), and two ratings of fair (3).
These evaluations indicate that the majority of the visualized results were regarded as clinically meaningful and aligned with expert knowledge of tumor morphology. The consistent recognition of relevant tumor regions, especially in LUAD and LUSC slides, supports the potential of our model not only for automated prediction but also for assisting interpretability and decision-making in computational pathology workflows.
4. Discussion
In this study, we proposed ConvMixerSSM, a novel and effective multiple instance learning framework for cancer subtype classification based on whole-slide images. The model is composed of three key components: (1) a ConvMixer block for extracting localized spatial features from each patch, (2) an SSM block to capture long-range and global dependencies among instance embeddings, and (3) a feature-gated block that adaptively emphasizes informative instances through learnable activation mechanisms. This architecture is designed to address the high heterogeneity and weak-label nature of pathology images. Our ablation studies demonstrate that each of these blocks contributes significantly to performance improvement. Furthermore, the proposed model achieves not only excellent quantitative results in terms of AUC, accuracy, and F1 score, but also provides reliable visual interpretability through heatmap-based patch importance visualization, which highlights tumor-related regions and confirms the model’s ability to focus on relevant histopathological features.
The development of artificial intelligence models capable of automatically distinguishing cancer subtypes and accurately localizing tumor regions holds profound implications for both computational pathology and clinical practice [
29,
30,
31]. In computational pathology, such models enable scalable and objective analysis of large-scale whole-slide images, significantly reducing the reliance on manual annotations and inter-observer variability [
32,
33]. The ability to differentiate between histologically similar but clinically distinct cancer subtypes is critical for guiding downstream molecular testing, prognosis estimation, and personalized treatment decisions. Traditional workflows often involve time-consuming and subjective assessment by expert pathologists, whereas AI-driven subtype classification can serve as a robust decision-support system, enhancing diagnostic consistency and throughput [
34,
35,
36].
We systematically compared ConvMixerSSM with a variety of existing MIL-based models. Traditional MIL pooling strategies, such as Max Pooling and Mean Pooling, treat all instances in a bag either equally or select only the most salient one, potentially ignoring useful contextual or distributional information across patches. Although these simple strategies show competitive results, they lack the ability to model complex inter-instance dependencies and rich local patterns, which limits their capacity for accurate subtyping in heterogeneous tumor environments.
Our proposed ConvMixerSSM effectively integrates the strengths of convolutional modeling and structured sequence processing. By combining ConvMixer blocks, which are well-suited for extracting fine-grained spatial features from image patches, with an SSM for efficient long-range dependency modeling, ConvMixerSSM achieves a balanced and comprehensive representation. From a structural perspective, the enhanced performance of ConvMixerSSM can be attributed to its ability to disentangle and hierarchically process both local and global information. The ConvMixer block acts as a localized filter bank, capturing texture, shape, and edge-level features inherent to histopathological structures. These features correspond to morphological variations such as nuclear density, glandular formations, or stromal textures that are critical for tumor subtyping. In contrast, the SSM (state space model) captures sequence-level context by treating the patch sequence as a structured signal, enabling it to model tumor growth patterns, cellular organization gradients, and tissue architecture over large spatial extents. This emulates how pathologists scan slides by integrating both focal detail and overall structure. Furthermore, the inclusion of a feature-gated mechanism enhances the model’s ability to emphasize relevant instance-level features, further improving the discriminative power under weak supervision. Notably, ConvMixerSSM outperforms all comparison methods in AUC, accuracy, and F1 score, indicating its robustness and generalization ability across diverse WSIs. Its relatively shallow architecture (as shown in the depth comparison study) also suggests that the model can achieve strong performance with fewer layers, leading to lower computational cost and faster inference, which are desirable for clinical deployment.
Importantly, our visualization strategy, which highlights the top-k most informative patches and generates corresponding attention heatmaps, has significant clinical value. The visual outputs clearly delineate tumor-related regions within WSIs, aligning well with pathologist-annotated areas. Such interpretability not only enhances the trustworthiness of the model but also provides actionable cues for clinical pathologists, enabling them to quickly identify suspicious regions for further analysis. This can potentially reduce diagnostic time, improve consistency among human readers, and assist in educational settings for trainee pathologists. In addition, explainability is foundational for clinical trust. This interpretability fosters confidence among medical professionals, facilitating acceptance and integration of AI tools in daily practice. We believe that combining strong performance with explainable visual outputs is essential for the successful deployment of AI-based pathology tools in real-world clinical workflows.
Moreover, this study successfully applies structured state space models (SSMs), particularly when combined with convolutional layers, to large-scale histopathology image analysis. While prior work often relied on transformer-based or recurrent approaches for sequence modeling, our results highlight the efficiency and expressiveness of SSMs in modeling patch-wise dependencies [
37,
38,
39]. This demonstrates the potential of SSMs to become a new paradigm for large-scale, context-aware modeling in medical imaging. The success of ConvMixerSSM underscores the effectiveness of integrating local feature extraction (via convolution), global dependency modeling (via SSM), and adaptive instance selection (via gating) within a unified MIL architecture. This hybrid design not only improves performance but also enhances robustness and interpretability, providing a valuable blueprint for future WSI analysis models in computational pathology.
While the proposed method shows promising results, we acknowledge certain limitations in our current work. All experiments were conducted solely on the TCGA-NSCLC dataset and the CAMELYON16 dataset. The generalizability of our model to other cancer types or histopathological conditions remains to be validated. In future work, we plan to extend ConvMixerSSM to additional publicly available WSI datasets involving other cancers (e.g., colon, kidney) and multi-class classification settings, to assess its robustness and applicability across domains. Additionally, ConvMixerSSM can be extended to tackle more complex pathological tasks, such as tumor grading, multi-class cancer classification, or WSI level segmentation. We also envision adapting this architecture to self-supervised or few-shot learning scenarios, thereby enhancing its applicability to rare diseases or small datasets.
5. Conclusions
In this study, we proposed ConvMixerSSM, a novel and effective MIL framework for cancer subtype classification based on whole-slide images (WSIs). The model integrates a state space model (SSM) block for long-range dependency modeling, a ConvMixer block for local feature extraction, and a feature-gated module to adaptively enhance discriminative representations under weak supervision. Extensive experiments on the TCGA-NSCLC dataset demonstrate that ConvMixerSSM achieves state-of-the-art performance, with an AUC of 97.83%, an ACC of 91.82%, and an F1 score of 91.18%, outperforming all comparison methods. In addition, validation on the CAMELYON16 dataset further confirms the model’s generalization ability, where ConvMixerSSM achieves the best AUC (98.95%) among all methods. Beyond accuracy, ConvMixerSSM demonstrates high computational efficiency, requiring only 1.65 ms to process a 51,200 × 51,200 WSI, which makes it feasible for real-world deployment. Furthermore, our visualization results show that ConvMixerSSM can accurately localize tumor regions, offering clinically meaningful interpretability and highlighting its potential as a trustworthy decision support tool for pathologists. Taken together, these findings indicate that ConvMixerSSM is not only a powerful and efficient classifier but also a generalizable and interpretable model well-suited for clinical integration.