1. Introduction
Convolutional neural networks (CNNs) are widely used to solve a large variety of problems in classification [
1], segmentation [
2], regression [
3], and many more domains with state-of-the-art (SOTA) performance [
4,
5,
6]. However, to keep up with the increasing need for the performance of the models, they need to be able to handle correctly more diverse data. The main and easiest approach is to increase the number of filters in convolutional layers, which would react to specific objects and their transformations. This would help with the performance, but it would also increase the training length and size of the models. A larger number of filters will only help with those transformations, which they encountered during the training phase, and it introduces a large redundancy, which is difficult to scale for a very large number of objects and their transformations [
7]. This would make the classification of a completely novel viewpoint of the object close to a random guess. Because of this, the dataset needs to be large enough to contain most of the transformations of the objects, which is impossible in some cases.
A different approach made to mitigate some of these flaws was mentioned in [
8] with the introduction of the idea of capsules. In traditional neural networks, in their classification part, neurons are scalar values, which represent the presence of features, extracted from a given image. Then, weights are used as simple encoded feature detectors to create representations of more complex features. On the other hand, capsules were designed to handle spatial information of the target. This was accomplished by having capsules be created by grouping neurons to create vector values, representing different features of the same entity and the total length/magnitude of the capsule vector would represent the probability of that entity existing in the image. The values of the vector after training contain information about the pose, transformation, and other important features of the entity they represent.
The mentioned flaws of CNNs were mostly mitigated by introducing a novel dynamic routing algorithm and creating a complete capsule network (CapsNet) with multiple capsule layers in [
9]. This algorithm is capable of transforming information from lower-level capsules into higher-level ones dynamically by using the weights to embed the relationships between features of different capsules and use these in combination with dynamically computed coupling coefficients to obtain the final capsule outputs. CapsNets have an advantage in generalizing to new perspectives because they utilize vector representations of entities and learned relationships among them. As a result, they require fewer filters in terms of both size and number to achieve satisfactory performance [
7]. In addition, this approach helps fix other problems of CNNs like Picasso’s face problem. Using the mentioned algorithm would ensure that the positions and rotations of lower-level entities like eyes and nose are correct for the face to be recognized as a whole.
The capsule models can reach SOTA performance on simple datasets, and there were some good results on more complex ones as well [
7,
10,
11]. However, the capsules have multiple limitations. Having longer training times and larger use of memory are some of them, but these can be mitigated by having more computational power. We believe the main problem is the low performance on more complex datasets with more varied backgrounds and objects [
9,
12,
13]. This is caused by the networks not being able to find efficient representations of features of such entities and thus not being able to fully utilize the powerful routing algorithm [
14,
15]. The use of reconstruction of the input images to extract more efficient features can be used, but the difficulty of representing the relationships between more complicated parts can be very hard to encode using a limited amount of capsules [
16]. Another problem is the scalability of capsules [
17]. With a limit ot the number of convolutional modules used, the number of capsules and their dimensions needed is increasing almost as quickly as adding fully connected layers into traditional networks. These limitations need to be fixed by creating a stronger feature extraction (convolutional) part of the network.
To address the first issue, we propose employing segmentation masks as the primary reconstruction target. This approach aids in feature extraction and reduces the number of features required for background elements and less significant parts of the image. This will increase the classification performance, which is the main goal of the network and, in addition, this approach can use image segmentation of the given data for better interpretability. Most works in this field are focused solely on one field, like classification [
7], segmentation [
18], regression [
19], or even affordance detection [
20], but we will show that combining classification and segmentation can achieve greater performance in the classification part of the model as well as help increase the trust of the users by using the created segmentations for ensuring that the models are looking at the correct parts of the image. In addition, the experiments show that creating models with the same convolutional topology as traditional CNN helps with the simplicity of the network and fixes the problem of scalability by limiting the number of needed parameters by condensing the features using multiple convolutional blocks. The main contributions of this work are as follows:
Comprehensive comparison of diverse implementations of capsule models across datasets containing more intricate images. This analysis sheds light on the relative strengths and weaknesses of various CapsNet architectures in handling complex image data.
Proposal of a refined C-CapsNet structure featuring multiple incremental enhancements capable of concurrently performing classification and segmentation tasks. These improvements result in superior classification performance, surpassing both existing CapsNet models and traditional CNNs with a comparable number of layers.
Conceptualization of an explanation module serving as the output component of CapsNet, advocating for the adoption of Explainable Artificial Intelligence (XAI) principles. This module aims to bolster the trustworthiness of CapsNet models among human users by providing transparent and interpretable insights into the model’s decision-making process.
This article presents a comparative study that evaluates the performance of CapsNets in image classification tasks, specifically focusing on our novel approach that utilizes segmentation masks as targets.
Section 2 provides an overview of the existing literature in two key areas: CapsNets and Segmentation, establishing the necessary background knowledge for this research.
Section 3 details the network architecture employed, including the specific modifications made, and provides a comprehensive explanation of the losses used and the various reconstruction targets. The datasets utilized in our experiments are described, highlighting the number of images and classes, as well as the transformations applied during training. The network structures and configuration parameters used for training are presented meticulously, enabling easy reproduction of the results in
Section 4.
In
Section 5, we provide a quantitative evaluation of our proposed approach on benchmark datasets and extensively discuss and compare the obtained results in
Section 6. Finally,
Section 7 summarizes our contributions and outlines potential avenues for future research.
3. Combined-CapsNet (C-CapsNet)
The overall architecture of our proposed network is a combination of ideas from the Efficient-CapsNet from [
7] and the original CapsNet from [
9] with many small improvements, which we introduced. The schematic representation can be seen in
Figure 1. The feature extraction part uses traditional convolutional layers with added BachNormalization and ReLU layers in the feature extraction process [
46]. MaxPool layers are also integrated to reduce the number of parameters and capsules [
47]. Through stacking multiple convolutional blocks, the extraction of essential features is facilitated, effectively removing unnecessary information and generating a condensed set of capsules. This stacking mechanism enables the model to project the input image onto a higher-dimensional space, thereby facilitating subsequent steps in capsule formation. A key innovation lies in utilizing the same convolutional blocks found in conventional CNNs, allowing for seamless transfer of the feature extraction part and the network’s learned weights to different networks. This approach also enables integration with SOTA models for connecting to subsequent capsule components. The choice of layer and filter quantities is customizable by the user and varies depending on the dataset’s complexity. Simpler datasets need fewer layers and filters, while more intricate datasets require a greater number of layers or filters. Nevertheless, the minimum required number of filters remains smaller than that of traditional convolutional networks.
Following the final convolutional block, another convolutional layer is applied, where the output channels are split into multiple capsules the same way as in the original CapsNet, with a small difference in adding the Batchnormalization and ReLU layer, which was added for this experiment. This block produces an output of dimensions
, with
H representing the height of the transformed image,
W denoting the width, and
G denoting the number of output channels within the layer. Subsequently, the output is reshaped into primary capsules of size
, where
N denotes the number of primary capsules and
D represents the dimensionality of each capsule. The user determines the number and dimensionality of the primary capsules, ensuring that their product matches the product of the dimensions of the convolutional layer’s output, with
G being divisible by
D. This reshaping operation serves to create a vector-based representation of the previously extracted features. This operation can be seen in
Figure 2.
The introduction of capsules has altered how location information is encoded. In conventional convolutions, location is encoded using a high-value pixel known as place-coded location. However, capsules employ a different approach by encoding location information within vector coordinates. This transformation results in a rate-coded location, compressing the location information into a single numerical value.
CapsNets utilize a dedicated capsule-wise squashing function as the primary activation function within the capsule component of the network. This function serves multiple roles, including introducing non-linearity into the routing process to facilitate learning. Furthermore, the squashing function plays a crucial role in encoding the likelihood of an entity’s presence in the input image by adjusting the length of the capsule. To fulfill this purpose, two conditions must be met. The direction of each capsule should remain unchanged, and the length of the capsule must be confined to the range of zero to one. In this experiment, the squashing function originally proposed in [
7] is employed, as depicted in Equation (
1). This choice is made due to better sensitivity near zero than the original squash function used in [
9]. All equations are in the Einstein notation, so
and
are the same tensors with the summation performed on the subscript dimension of the first tensor and superscript dimension of the second tensor.
where
is the squashed vector output of one capsule
n and
is its input. This function satisfies the required properties by returning matrix
v with capsules of the same dimensionality and direction as was the input; the only difference is that the lengths of all capsules are squashed into the required interval and non-linearity is introduced into the network to make it more robust.
In CapsNets, routing algorithms play a vital role in transforming raw input data into meaningful representations. The algorithm used in the experiments is based on the original dynamic routing proposed by [
9], with multiple improvements. The pseudocode for the algorithm can be seen in
Table 1. In the Table and all equations,
b is the batch size,
n is the number of input capsules,
m is the number of output capsules,
i is the dimensionality of input capsules, and
j is the dimensionality of output capsules. Capsules, which are groups of neurons representing various properties of an entity, rely on routing algorithms to iteratively update their parameters. This process involves first multiplying the input capsules with a weight matrix, which is trained, to obtain the predictions from every input capsule of the pose of each output capsule. This operation essentially learns the transformation from the lower-level features to the higher-level features. Each element of the output tensor
represents the predicted instantiation parameters (such as position, orientation, scale, etc.) of each output capsule, given the input from each lower-level capsule. This multiplication operation can be seen in (
2).
where
denotes the weight matrix,
represents the output of the
n-th lower-level capsule, and
is the prediction for the
j-th dimension of the
m-th output capsule, based on the
n-th input capsule.
After obtaining these predictions, the algorithm iteratively calculates the coupling coefficients to determine the relevance of each lower-level capsule’s output to each higher-level capsule’s output. Initially, the agreements are initialized to zero, and they are then transformed into coupling coefficients using the softmax function along the output capsule dimension, ensuring that the sum of probabilities across all lower-level capsules equals one. This can be seen in (
3).
where
represents the coupling coefficient between the
n-th lower-level capsule and the
m-th output capsule, and
is the initial agreement initialized as zero.
With the coupling coefficients determined, the algorithm calculates the intermediate and final outputs by combining the predictions from the lower-level capsules weighted by their respective coupling coefficients. This step effectively aggregates information from the lower-level capsules to form the output of each higher-level capsule, which can be seen in (
4).
where
represents the
j-th dimension of the output of the
m-th capsule, and
represents the prediction for the
j-th dimension of the
m-th output capsule coming from the
n-th lower-level capsule, with squash being the function described before in (
1).
During each routing iteration, the agreement is updated based on the current predictions and outputs, to improve the coupling coefficients. This is achieved by using only the newly calculated agreement matrix
based on the dot product of the output vector and the prediction tensor compared to adding to the previous agreements, as was the case in [
9]. An additional small adjustment in this equation was the addition of the learnable parameter
p, which behaves as a multiplier on how effective the agreements should be in calculating the coupling coefficients. These adjustments are made to enhance the network’s adaptability to changes in predictions. The operation can be seen in (
5).
where
represents the new calculated agreements,
is the intermediate output of capsule
m, and
is the prediction from the
n-th input capsule for the
j-th dimension of the
m-th output capsule, with
p being the added trainable parameter.
Capsule layers can be stacked, and this algorithm remains unchanged, so experiments with multiple capsule topologies can be carried out to reach better performance. The user chooses the number and dimensionality of higher-level capsules, but it is traditional to decrease the number of capsules and increase their dimensionality. This is implemented to reduce the computational complexity for further layers, as the data are more compressed. The number of capsules in the final layer is fixed, as each capsule represents one class from the dataset. An additional capsule can be used, which would not represent any class and would not be used for classification, but it would contain information about the background to help with the reconstruction.
To be able to reconstruct more complicated images from only a few features, the novel idea of this paper was to create a different and more sophisticated architecture. The decoder block selectively masks the outputs of all capsules in the final capsule layer, except for the longest capsule (representing the classification result) or a capsule manually designated for a needed purpose, achieved by element-wise multiplication with a one-hot encoded vector corresponding to the target label. This manual choice approach enables segmentation based on the positive class alone, potentially yielding valuable insights from images classified as negative. Moreover, during training, the true label is consistently designated as the target, rather than the predicted label, to facilitate the training process. The masking operation is shown in
Figure 3. The whole layer is then flattened into one vector, which is processed in a fully connected layer and reshaped into a small 2D image with multiple channels. This is a novel approach chosen for CapsNets, which previously used only 1D fully connected layers in the whole decoder for the creation of smaller images. Using multiple blocks of ConvTranspose, Convolutional, and ReLU layers with a decreasing number of channels and increasing size of the images, the new model creates the recreated images in the desired shape. The number and size of these blocks depend on the initial and the desired size of the images. The easiest way to choose them is by mirroring the feature extraction part of the network. The number of output neurons of the fully connected layer would be the same as the total number of inputs for the grouped convolution, which was used in the capsule creation process. The following blocks would then mirror the convolutional blocks backwards to ensure the correct shape of the reconstruction. The final number of channels can differ depending on the target of the reconstruction: three channels for RGB and one channel for greyscale images.
3.1. Classification Loss
The main goal of the CapsNet is accurate image classification. Consequently, the primary and most influential loss is typically the Margin Loss, which is used to train the capsules.
Margin Loss operates similarly to mean squared error (MSE). To make the output of the final capsule layer compatible with MSE, the Euclidean length of the capsules is computed. This transforms the capsule output into a conventional output vector representing probabilities for each class, facilitating the calculation of the total error. The fundamental principle is that the length of a capsule should be close to one if the corresponding entity class is detected in the image, and close to zero otherwise. In contrast to MSE, Margin Loss deals with multiple classes within the same image by computing the loss independently for each class and aggregating the errors. It is calculated using (
6).
where
corresponds to the label and
represents the output capsule for class
n. The hyperparameters
,
, and
are used to fine-tune the training process, particularly in the initial phase, to prevent excessive contraction of all capsules, which could impede progress.
To assess the distinctions between different loss functions, the binary cross-entropy (BCE) loss has also been employed as the classification loss, which can be seen in (
7). This choice aligns with the conventional usage of this loss function in classification tasks and convolutional networks.
3.2. Reconstruction Loss
Reconstruction plays a fundamental role in most CapsNets, as it enables capsules to learn more generalized information about classes, contributing to the overall concept and improving performance. However, a significant challenge with reconstruction arises when dealing with more complex images beyond the basic Modified National Institute of Standards and Technology (MNIST) dataset [
48]. In such cases, capsules require additional information to recreate the input image accurately [
9]. Moreover, real-world images often exhibit diverse backgrounds that the network must also learn. The complexity further increases when attempting to reconstruct colored images, as opposed to simpler greyscale images, which can negatively impact performance.
An additional step to decrease the complexity of the reconstruction is to try to segment the object in the image. If the input image is colored, the reconstruction targets can be:
These reconstructed versions are then utilized in the secondary loss, known as the reconstruction loss. Multiple computation methods for this loss exist, but in this experiment, only MSE and BCE losses seen in (
7) and (
8) are used and compared against their corresponding targets. The MSE is the default loss used in the reconstruction of CapsNets models, and BCE is a loss often used for segmentation. DICE loss and IOU were considered as well, but as they cannot be used for the reconstruction of input images, a comparison could not be made. It should be noted that normalization with mean and standard deviation cannot be applied to BCE, as it would place the targets outside the 0–1 range. The real loss was calculated as a mean of the loss for each pixel value of the original and predicted segmentation mask.
where
is the pixel value of the ground truth mask and
is the value of the pixel from the output mask
n.
To prevent the reconstruction loss from overpowering the margin loss, a coefficient is introduced, which can be adjusted to observe how the weight of the reconstruction loss influences performance. This scaled loss is then added to the classification loss to obtain the final loss, which is employed for backpropagation seen in (
9).
where
L is the total loss of the model,
is the classification loss: either
or
in the case of binary classification,
is the scaling coefficient and
is the reconstruction loss: either
or
. In certain cases, the introduction of a dedicated capsule solely for reconstruction purposes may be beneficial. This capsule is disregarded during classification and remains unmasked with the others. Its inclusion aids in encoding features and information related to shared background characteristics among classes, thereby improving reconstructions and, consequently, classification performance.
6. Discussion
Based on the results presented in the previous section, several general observations can be made. The difference in model performance between those using the background capsule and those that do not is significant only when the input images share many similar aspects, which are then saved in the shared capsule, while the differences between classes are saved in other capsules. However, this distinction is less prominent in the Oxford Pets dataset, where backgrounds vary greatly, even between classes. Models using segmentation masks experience small but consistent performance gains as the compositions of scenes with masks of different classes have many similarities, and the class capsules are only used for accurate reconstruction of specific shapes of different breeds.
The CapsNet models using EMRouting perform poorly on large and complex images. This limitation is a general drawback of EMRouting models due to the limited scalability of this topology and the high GPU memory requirements for effective training. The original CapsNet implementation performs well on the Oxford Pets dataset, while the implementation of Efficient CapsNet performs better on the SIIM-ACR Pneumothorax dataset, but the best overall performance is achieved by using our improved network, which reached the best results on both datasets. The use of segmentation masks as reconstruction targets shows an improvement in performance on both datasets, with an increase of 0.5% and 0.3%, respectively. This is not that statistically significant, but still the gain is shown to be general. Although the difference between greyscale and color images as targets is small, different models achieve better classification performance using different targets. This phenomenon may be attributed to how background information is stored in the capsules of each topology, which can make the recreations easier for the specific implementation.
From the results on both datasets, it is evident that the variability of image backgrounds does not have a straightforward correlation with the performance of the segmentation-based reconstruction method. This observation is supported by the proportional performance improvement achieved by using segmentation masks across both datasets, despite the significantly different background variability in each dataset. However, the quality and consistency of the segmentation masks used can cause multiple issues. If the segmentation masks are completely incorrect, the model may learn information from the wrong regions, hindering classification performance, as the class capsule should contain the features for segmentation. Similar problems arise if there are inconsistencies in the masks or inaccuracies in the boundaries. These errors can mislead the model, causing it to focus on irrelevant or incorrect features, ultimately reducing the effectiveness and accuracy of the classification process. Therefore, ensuring the accuracy and consistency of segmentation masks is crucial for optimal model performance [
56].
Compared to traditional CNNs of various sizes, our capsule model achieves better peak performance with similar or smaller standard deviations, with an average difference of 3% and 2.5%, respectively. Although the differences between convolutional networks are minimal, models with more parameters show a slight improvement in the first dataset. To achieve the performance level of our C-CapsNet, a CNN would need approximately 20 million parameters, based on the observed performance increase when transitioning from 64 to 128 channels in the last block of the first dataset. This is evident in the SOTA performance of the TWIST method, which achieved its best results with 25 million parameters. However, for the Pneumothorax dataset, increasing the number of parameters does not improve the performance of convolutional models; in fact, it worsens it. This deterioration is likely because the additional filters capture variations not present in lung images, leading to overfitting. Although using a pre-trained RESNET model helps, our C-CapsNet model still outperforms it when using a segmentation mask as the reconstruction target.
The less conclusive results on the SIIM-ACR Pneumothorax dataset compared to the Oxford Pets dataset may be attributed to several factors inherent to the nature and challenges of medical image analysis. First, the variability and subtlety of features in medical images make it inherently more difficult to achieve high classification accuracy. Pneumothorax, being a condition that can present with very subtle visual cues, poses a significant challenge even for experienced radiologists. Consequently, the model may primarily learn the main differences between clearly positive and negative images but struggle with those that are less distinct.
The choice of a better general reconstruction loss depends on the dataset. Models using MSE loss achieved the same performance on the Oxford Pets dataset, with a maximum difference of 0.3%, while BCE loss proved superior for both reconstruction tasks and helped outperform other models on the SIIM-ACR dataset with an average increase of 1%.
When considering the classification loss function, the one that consistently yields better performance is not definitively clear. On the Oxford Pets dataset, the average performance difference between models trained with margin loss and cross-entropy loss was 1%, indicating that margin loss tends to be superior for similar datasets. However, on the pneumothorax datasets, the results were nearly reversed, with cross-entropy loss exhibiting an average improvement of 0.4% across all models except Efficient CapsNet. Nevertheless, it can be concluded that when utilizing Efficient CapsNet, margin loss consistently outperforms other approaches.
Training time for models within the same category is similar across datasets, with minimal differences. CNN models exhibit the shortest training time due to the simpler complexity of the fully connected layer compared to the routing algorithm. The longer training time of the CapsNet implementations is attributed to the complexity of the routing process, which takes more time than using simple fully connected layers. The increase in time for the models using segmentation masks as the targets is caused by the need to read the additional masks from memory and do more transformations, which will slow down the loading of data. The EMRouting model’s lengthy training time is a result of the algorithm’s poor scalability for larger images.
The performance progress across epochs differs between capsule models and CNNs. Capsule models exhibit a slower start but eventually surpass convolutional models in performance. This can be attributed to the more complex nature of training capsules, as they need to learn connections of features in an additional dimension and require multiple passes over the dataset to achieve the desired generalization. In contrast, training fully connected layers in CNNs, which operate in one dimension, can be accomplished in fewer epochs. However as the structure is simpler, the CNNs are not able to reach the performance of capsules.
Reconstruction results shown in
Figure 8 and
Figure 11 vary greatly for each dataset, with capsules struggling to learn and generalize colors of smaller image parts, often resorting to the average color as the main hue for recreations in the colored reconstructions of the first dataset. Differences between greyscale and colored reconstructions are not apparent, except for minor discrepancies in the shape and length of the ears. The overall shape of the cat is visible, but finer details cannot be recognized. Segmentation masks show promise, as the placement and rotation of the cat are correct, although the shape of the ears and body appear wider than in the original image.
For the second dataset, input image reconstructions are better than those from the Oxford Pets dataset due to lower variability in overall shape and the absence of background. While some details are lost, they are not crucial for the overall classification process. However, all reconstructions capture the shape, size, and location of the lungs, which are important for pneumothorax detection.
The quality of segmentation reconstructions depends on the specific input image. In most cases, the placement of segmented regions is correct, with various deformations applied. However, some images only segment a single part or provide entirely incorrect segmentations. This issue also arises in incorrectly classified images when the wrong capsule is used for reconstruction.
Reconstructions provide insight into what our model requires for accurate classification, aiding model explainability. Doctors can identify regions highlighted by the model as collapsed areas and focus on confirming the classification result. With time, this can lead to increased trust in the model, enabling more autonomous use and speeding up the diagnostic process by avoiding numerous negative images. With slight modifications, this approach can be used as an explanation module, taking an X-ray image as input and returning a diagnosis with identified regions for positive cases. Adding textual information to provide details such as severity and other useful information could further improve the module.
One possible explanation module prototype can be seen in
Figure 12, which uses the classification output of the network and the segmentation map as a prompt for Chat-GPT. Chat-GPT then takes the role of the doctor and describes the detected illness with the possibility of communicating further to ask more questions about what to do with this information and how to get better.
7. Conclusions
In this study, we present a modified CapsNet architecture called C-CapsNet, which combines elements from Efficient-CapsNet and the original CapsNet, while incorporating segmentation masks as reconstruction targets. By using segmentation masks for reconstructions, the models can effectively generalize the essential information required for accurate reconstruction, leading to improved performance. However, generalizing information from fully detailed input images poses significant challenges for our models. Through comprehensive evaluations on two diverse datasets, we demonstrate that utilizing segmentation masks enhances model performance in classification, surpassing other capsule and convolutional models.
Our findings indicate that the BCE loss function yields superior results for reconstruction, even when reconstructing original images. The best-performing C-CapsNet model achieved a mean F1 score of 93% on the Oxford Pets dataset while reaching the final F1 score of 67% on the SIIM-ACR PNEUMOTHORAX dataset. Different CapsNet implementations can be used for different datasets, as no superior implementation can be found in our experiments. The performance of CapsNet models can be increased by using segmentation masks. Furthermore, these models offer potential applications in enhancing explainability and fostering trust among medical professionals and other users by creating images capable of showcasing the exact regions the models are looking at during their diagnosis.
Future investigations will explore more intricate reconstruction topologies, incorporating skipped connections to leverage the benefits observed in traditional U-Net architectures. Additionally, we will explore an alternative reconstruction target, namely the original input image multiplied by the mask, which could facilitate improved generalization by downplaying the significance of the background. Moreover, we will consider employing the Dice loss as the primary reconstruction loss metric.
However, our approach entails certain limitations. Firstly, it relies on the availability of training datasets that contain both classification labels and corresponding segmentation masks. Given the scarcity of such datasets, the widespread adoption of this method may be hindered. Additionally, the performance of the network, although superior to other methods, is not yet at a level suitable for clinical use. Mitigating this limitation could involve leveraging pre-trained feature extractors and incorporating more extensive datasets to improve overall performance.