Simpliﬁed Routing Mechanism for Capsule Networks

: Classifying digital images using neural networks is one of the most fundamental tasks within the ﬁeld of artiﬁcial intelligence. For a long time, convolutional neural networks have proven to be the most efﬁcient solution for processing visual data, such as classiﬁcation, detection, or segmentation. The efﬁcient operation of convolutional neural networks requires the use of data augmentation and a high number of feature maps to embed object transformations. Especially for large datasets, this approach is not very efﬁcient. In 2017, Geoffrey Hinton and his research team introduced the theory of capsule networks. Capsule networks offer a solution to the problems of convolutional neural networks. In this approach, sufﬁcient efﬁciency can be achieved without large-scale data augmentation. However, the training time for Hinton’s capsule network is much longer than for convolutional neural networks. We have examined the capsule networks and propose a modiﬁcation in the routing mechanism to speed up the algorithm. This could reduce the training time of capsule networks by almost half in some cases. Moreover, our solution achieves performance improvements in the ﬁeld of image classiﬁcation.


Introduction
For processing visual data, convolutional neural networks (CNNs) are proving to be the best solutions nowadays. The most popular applications of convolutional neural networks in the field of image processing are image classification [1,2], object detection [3,4], semantic segmentation [5,6] and instance segmentation [7,8]. However, the biggest challenge of convolutional neural networks is their inability to recognize pose, texture and deformations of an object, caused by the pooling layers. Pooling layers are used in the feature maps. Where we can find several types of this layer: max pooling, min pooling, average pooling and sum pooling are the most common types of pooling layers [9]. Due to this layer, the efficiency of the convolutional neural network to recognize the same object in different input images under different conditions is high. At the same time, the size of the tensors is reduced due to the pooling layer, thus reducing the computational complexity of the network. In most cases, pooling layers are one of the best tools for feature extraction; however, they introduce spatial invariance in convolutional neural networks. Due to the nature of the pooling layer, a great amount of information is lost, which in some cases may even be important features in the image. To compensate for this, the convolutional neural network needs a substantial amount of training data where data augmentation is necessary. Geoffrey Hinton and his research team introduced capsule network theory as an alternative to convolutional neural networks. Hinton et al. published the first paper in the field of capsule networks in 2011 [10], where the potential of the new theory is explained, but the solution for effective training it is not yet available. The next important milestone came in 2017, when Sabour et al. introduced the dynamic routing algorithm between capsule layers [11]. Thanks to this dynamic routing algorithm, the training and optimization of capsule-based networks can be performed efficiently. Finally, Hinton et al. published a matrix capsule-based approach in 2018 [12]. These are the three most important results that the inventors of the theory have published in the field of capsule networks. The basic building block of convolutional neural networks is the neuron, while capsule networks are made up of so-called capsules. A capsule is a group of related neurons, where each neuron's output represents a different property of the same feature. Hence, the input and output of the capsule networks are both vectors (n-dimensional capsules), while the neural network works with scalar values (neurons). Instead of pooling layers, a dynamic routing algorithm was introduced in capsule networks. In this approach, the lower-level features (lower-level capsules) will only be sent to higher-level capsules that match its contents. This property makes capsule networks a more effective solution than convolutional neural networks in some use cases.
However, the training process for capsule networks can be much longer than for convolutional neural networks, where due to the high number of parameters, the memory requirements of the network can be much higher. Therefore, for complex datasets (e.g., large input images, high number of output classes), presently, capsule networks do not perform well yet. This is due to the complexity of the dynamic routing algorithm. For this reason, we have attempted to make modifications to the dynamic routing algorithm. Our primary aim was to reduce the time of the training process, and secondly to achieve a higher efficiency. In our method, we reduced the weight of the input capsule vector during the optimization in the routing process. We also proposed a parameterizable activation function interpreted in terms of vectors, based on the squash function. In this paper, we demonstrate the effectiveness of our proposed modified routing algorithm and compare it with other capsule network-based methods and convolutional neural network-based approaches.
This paper is structured as follows. In Section 2, we provide the theoretical background of the capsule network theory proposed by Hinton et al. [10] and Sabour et al. [11]. Section 3 clarifies our improved routing mechanism for capsule network and our parameterizable activation squash function. Section 4 describes the capsule network architecture used in this research. In Section 5, we present the datasets used to compare the dynamic routing algorithm and our proposed solution. Our results are summarized in Section 6, where we compare our improved routing solution with Sabour et al.'s method, and with some recently published neural network-based solutions. Finally, our conclusions based on our results are summarized in Section 7.

Theory of Capsule Network
The capsule network [10][11][12] (or CapsNet) is very similar to the classical neural network. The main difference is the basic building block. In the neural network, we use neurons, but in the capsule network, we can find capsules. Figures 1 and 2 show the main differences between the classical artificial neurons and the capsules.
A capsule is a group of neurons that perform a multitude of internal computation and encapsulate the results of the computations into an n-dimensional vector. This vector is the output of the capsule. The length of this output vector is the probability and the direction of the vector, indicating certain properties about the entity.
In a capsule-based network, we use routing-by-agreement, where the output vector of any capsule is sent to all higher-level capsules. Each capsule output is compared with the actual output of the higher-level capsules. Where the outputs match, the coupling coefficient between the two capsules are increased.
Let i be a lower-level capsule and j be a higher-level capsule. The prediction vector is calculated as follows:û where W ij is a trainable weighting matrix and u i is an output pose vector from the i-th capsule to the j-th capsule. The coupling coefficients are calculated with a simple SoftMax function, as follows: where b ij is the log probability of capsule i coupled with capsule j, and it is initialized with zero values. The total input to capsule j is a weighted sum over the prediction vectors, calculated as follows: In capsule networks, we use the length of the output vector to represent the probability for the capsule. Therefore, we use a non-linear activation function, which is called the squashing function. The squashing function is the next: We can use the dynamic routing algorithm (by Sabour et al. [11]) to update the c ij values in every iteration. In this case, the goal is to optimize the v j vector. In the dynamic routing algorithm, the b ij vector is updated in every iteration, as follows: Algorithms 2023, 16, x FOR PEER REVIEW 3 of 23  A capsule is a group of neurons that perform a multitude of internal computation and encapsulate the results of the computations into an n-dimensional vector. This vector is the output of the capsule. The length of this output vector is the probability and the direction of the vector, indicating certain properties about the entity.
In a capsule-based network, we use routing-by-agreement, where the output vector of any capsule is sent to all higher-level capsules. Each capsule output is compared with the actual output of the higher-level capsules. Where the outputs match, the coupling co-  A capsule is a group of neurons that perform a multitude of internal computation and encapsulate the results of the computations into an n-dimensional vector. This vector is the output of the capsule. The length of this output vector is the probability and the direction of the vector, indicating certain properties about the entity.
In a capsule-based network, we use routing-by-agreement, where the output vector of any capsule is sent to all higher-level capsules. Each capsule output is compared with the actual output of the higher-level capsules. Where the outputs match, the coupling co-

Improved Routing Algorithm
Our experiments on capsule network theory have shown that theû (j|i) input tensor in the dynamic routing algorithm has too large an impact on the output tensor and greatly increaes the processing time. When calculating the output vector v j , the formula includes the inputû (j|i) twice: To improve the routing mechanism between lower-level and higher-level capsules, the following modifications to the routing algorithm are proposed: Let where c kl is the value of the l-th neuron of the k-th capsule. If v j is an intermediate capsule layer, then n is the number of output capsules. If v j is an output capsule layer, then n is the number of possible object categories. Let This minimal modification makes the routing algorithm simpler and faster to compute. Our other proposed change concerns the squashing function. In the last capsule layer, we use a modified squashing function, as follows: where ε is a fine-tuning parameter. Based on our experience, we used ε = 1 × 10 −7 in this work. Figure 3 shows a simple example of our squash function in a one-dimensional case for different values of ε. Figures 4 and 5 show a block diagram of the dynamic routing algorithm and our improved routing solution, where the main differences between the two methods are clearly visible. pute. Our other proposed change concerns the squashing function. In the last capsule layer, we use a modified squashing function, as follows: where is a fine-tuning parameter. Based on our experience, we used 1 10 in this work. Figure 3 shows a simple example of our squash function in a one-dimensional case for different values of .
where is a fine-tuning parameter. Based on our experience, we used 1 10 in this work. Figure 3 shows a simple example of our squash function in a one-dimensional case for different values of .

Network Architecture
In this work, we have used the network architecture proposed by Sabour et al. [11] to compare our proposed routing mechanism with other optimization solutions in the

Network Architecture
In this work, we have used the network architecture proposed by Sabour et al. [11] to compare our proposed routing mechanism with other optimization solutions in the field of capsule networks. This capsule network architecture is shown in Figure 6. The original paper used a fixed 32 × 32 × 1-sized input tensor, because they only tested the network efficiency for the MNIST [13] dataset. In contrast, we trained and tested the capsule networks for six fundamentally different datasets in the field of image classification. In our work, the shape of the input layer varies depending on the dataset. We used the following input shapes: 28 × 28 × 1, 48 × 48 × 1 and 32 × 32 × 3. After the input layer, the capsule network architecture consisted of three main components: the first is a convolutional layer, the next is the primary capsule layer, and the last one is the secondary capsule layer.

Network Architecture
In this work, we have used the network architecture proposed by Sabour et al. [11] to compare our proposed routing mechanism with other optimization solutions in the field of capsule networks. This capsule network architecture is shown in Figure 6. The original paper used a fixed 32 32 1-sized input tensor, because they only tested the network efficiency for the MNIST [13] dataset. In contrast, we trained and tested the capsule networks for six fundamentally different datasets in the field of image classification. In our work, the shape of the input layer varies depending on the dataset. We used the following input shapes: 28 28 1, 48 48 1 and 32 32 3. After the input layer, the capsule network architecture consisted of three main components: the first is a convolutional layer, the next is the primary capsule layer, and the last one is the secondary capsule layer. The convolution layer contains 256 convolution kernels of size 9 9 with a stride of 1, and a ReLU (rectified linear unit) [14] activation layer. This convolutional layer generates the main visual features based on intensities for the primary capsule layer.
The primary capsule block contains a convolutional layer, where both the input and the output are of the size 256. This capsule block also contains a squash layer. In this case, Figure 6. The capsule network architecture used in the research, based on work by Sabour et al. [11]. (green: input, purple: convolutional layer, yellow: primary capsule layer, red: secondary capsule layer, gray: prediction).
The convolution layer contains 256 convolution kernels of size 9 × 9 with a stride of 1, and a ReLU (rectified linear unit) [14] activation layer. This convolutional layer generates the main visual features based on intensities for the primary capsule layer.
The primary capsule block contains a convolutional layer, where both the input and the output are of the size 256. This capsule block also contains a squash layer. In this case, the original squash function (Equation (4)) is used for both implementations. The output of this block contains 32 capsules, where each capsule has 8 dimensions. This capsule block contains advanced features, which are passed onto the secondary capsule block.
The secondary capsule block has one capsule per class. As mentioned earlier, we worked with several different datasets, so the number of capsules in this capsule block varied, always according to the class number of the dataset: 5, 10 or 43. This capsule block contains the routing mechanism, which is responsible for determining the connection weights between the lower and higher capsules. Therefore, this capsule block represents the main difference between the solution of Sabour et al. and our presented method. In this block, we applied our proposed squash function (Equation (11)). The secondary capsule block contains a trainable matrix, called W (Equation (14)). The shape of the W matrix, for both solutions, is pc × n × 16 × 8, where n is the number of output classes and pc depends on the input image shape as follows: where im size is the size of the input image. The routing algorithm was run through r = 3 iterations in both cases. The output of this capsule block is a 16-dimensional vector per each class. This means that the block produces n 16-dimensional capsules, where n is the number of output classes. The length of the output capsules represents the probability values belonging to the given class.

Datasets
In this work, six different datasets were used. A classification task was performed for each dataset in our work. The datasets needed to have different levels of complexity. This allowed us to test as wide a range of datasets as possible. The selected datasets include grayscale and color images. The number of classes also varies, from 5 to 43. The size of the images is typically small, but here, again, we tried to experiment with different sizes. The datasets included a fixed background which had been used, as well as a variable color background which had been applied. Table 1 shows the main properties of the six datasets that we used in this research.  [15] (28, 28) 1 10 60,000 10,000 false SmallNORB [16] (48, 48) 1 5 48,600 48,600 true CIFAR10 [17] (32, 32) 3 10 50,000 10,000 true SVHN [18] (32, 32) 3 10 73,257 26,032 true GTSRB [19] (32, 32) 3 43 26,640 12,630 true

MNIST
The MNIST dataset (Modified National Institute of Standards and Technology dataset) is a large set of handwritten digits (from 0 to 9) that is one of the most widely used datasets in the field of image classification. The MNIST dataset contains 60, 000 training images and 10, 000 testing images, where every image is grayscale with a 28 pixel width and 28 pixel height. Figure 7 shows some samples from this dataset.

Fashion-MNIST
The Fashion-MNIST (or F-MNIST) dataset is very similar to the MNIST dataset. The main parameters are the same. It contains 60,000 training and 10,000 testing examples. Every sample is 28 pixels in width and 28 pixels in height, and a grayscale image where colors are inverted (an intensity value of 255 represents the darkest color). The Fashion-MINST dataset contains 10 fashion categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. Figure 8 shows some samples from this dataset.

Fashion-MNIST
The Fashion-MNIST (or F-MNIST) dataset is very similar to the MNIST dataset. The main parameters are the same. It contains 60,000 training and 10,000 testing examples. Every sample is 28 pixels in width and 28 pixels in height, and a grayscale image where colors are inverted (an intensity value of 255 represents the darkest color). The Fashion-MINST dataset contains 10 fashion categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. Figure 8 shows some samples from this dataset.

Fashion-MNIST
The Fashion-MNIST (or F-MNIST) dataset is very similar to the MNIST dataset. The main parameters are the same. It contains 60,000 training and 10,000 testing examples. Every sample is 28 pixels in width and 28 pixels in height, and a grayscale image where colors are inverted (an intensity value of 255 represents the darkest color). The Fashion-MINST dataset contains 10 fashion categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. Figure

SmallNORB
The SmallNORB dataset contains images of 3D objects. The specialty of this dataset is that the images were taken under several different lighting conditions and poses. This dataset contains images of toys belonging to five different categories: four-legged animals, human figures, airplanes, trucks and cars. The images were taken with two cameras under six lighting conditions, nine elevations and eighteen azimuths. All images are grayscale, with a size of 96 pixels in width by 96 pixels in height. However, in this work, we resized the images to 48 48 pixels. Figure 9 shows some samples from this dataset.

SmallNORB
The SmallNORB dataset contains images of 3D objects. The specialty of this dataset is that the images were taken under several different lighting conditions and poses. This dataset contains images of toys belonging to five different categories: four-legged animals, human figures, airplanes, trucks and cars. The images were taken with two cameras under six lighting conditions, nine elevations and eighteen azimuths. All images are grayscale, with a size of 96 pixels in width by 96 pixels in height. However, in this work, we resized the images to 48 × 48 pixels. Figure 9 shows some samples from this dataset.

CIFAR10
The CIFAR10 (Canadian Institute for Advanced Research) dataset is one of the most widely used datasets in the field of machine learning-based image classification. This dataset is composed of 60,000 RGB colored images, where 50,000 images are the training samples and 10,000 images are the testing samples. Each image is 32 pixels wide and 32 pixels high. The object categories are the following: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. Figure 10 shows some samples from this dataset.

CIFAR10
The CIFAR10 (Canadian Institute for Advanced Research) dataset is one of the most widely used datasets in the field of machine learning-based image classification. This dataset is composed of 60,000 RGB colored images, where 50,000 images are the training samples and 10,000 images are the testing samples. Each image is 32 pixels wide and 32 pixels high. The object categories are the following: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. Figure 10 shows some samples from this dataset.

CIFAR10
The CIFAR10 (Canadian Institute for Advanced Research) dataset is one of the most widely used datasets in the field of machine learning-based image classification. This dataset is composed of 60,000 RGB colored images, where 50,000 images are the training samples and 10,000 images are the testing samples. Each image is 32 pixels wide and 32 pixels high. The object categories are the following: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. Figure 10 shows some samples from this dataset.

SVHN
The SVHN (Street View House Numbers) dataset contains small, cropped digits, like the MNIST dataset. However, the SVHN dataset is slightly more complex than the MNIST dataset. The SVHN is obtained from house numbers in Google Street View, where the background of the digits is not homogeneous and images may also include part of the adjacent digit. This property makes the SVHN dataset more difficult to classify than the MNIST dataset. The size of the images in this dataset is 32 pixels wide and 32 pixels

SVHN
The SVHN (Street View House Numbers) dataset contains small, cropped digits, like the MNIST dataset. However, the SVHN dataset is slightly more complex than the MNIST dataset. The SVHN is obtained from house numbers in Google Street View, where the background of the digits is not homogeneous and images may also include part of the adjacent digit. This property makes the SVHN dataset more difficult to classify than the MNIST dataset. The size of the images in this dataset is 32 pixels wide and 32 pixels high. The SVHN dataset contains 73,257 training samples and 26,032 testing samples. Figure 11 shows some samples from this dataset.

GTSRB
The GTSRB (German Traffic Sign Recognition Benchmark) dataset includes 43 classes of traffic signs. Each image contains one traffic sign with varying light conditions and rich backgrounds. Images are 32 pixels wide and 32 pixels high. The traffic sign classes are the following: speed limit {20, 30, 50, 60, 70, 80, 100, 120} km/h, end of speed limit (80 km/h), no passing, no passing for vehicles over 3.5 metric tons, right-of-way at the next intersection, priority road, yield, stop, no vehicles, vehicles over 3.5 metric tons prohibited, no entry, general caution, dangerous curve to the {left, right}, double curve, bumpy road, slippery road, road narrows on the right, road work, traffic signals, pedestrians, children crossing, bicycles crossing, beware of ice/snow, wild animals crossing, end of all

GTSRB
The GTSRB (German Traffic Sign Recognition Benchmark) dataset includes 43 classes of traffic signs. Each image contains one traffic sign with varying light conditions and rich backgrounds. Images are 32 pixels wide and 32 pixels high. The traffic sign classes are the following: speed limit {20, 30, 50, 60, 70, 80, 100, 120} km/h, end of speed limit (80 km/h), no passing, no passing for vehicles over 3.5 metric tons, right-of-way at the next intersection, priority road, yield, stop, no vehicles, vehicles over 3.5 metric tons prohibited, no entry, general caution, dangerous curve to the {left, right}, double curve, bumpy road, slippery road, road narrows on the right, road work, traffic signals, pedestrians, children crossing, bicycles crossing, beware of ice/snow, wild animals crossing, end of all speed and passing limits, turn {right, left} ahead, ahead only, go straight or {right, left}, keep {right, left}, roundabout mandatory, end of no passing and end of no passing by vehicles over 3.5 metric tons. Figure 12 shows some samples from this dataset. Figure 11. Sample data from SVHN dataset.

GTSRB
The GTSRB (German Traffic Sign Recognition Benchmark) dataset includes 43 classes of traffic signs. Each image contains one traffic sign with varying light conditions and rich backgrounds. Images are 32 pixels wide and 32 pixels high. The traffic sign classes are the following: speed limit {20, 30, 50, 60, 70, 80, 100, 120} km/h, end of speed limit (80 km/h), no passing, no passing for vehicles over 3.5 metric tons, right-of-way at the next intersection, priority road, yield, stop, no vehicles, vehicles over 3.5 metric tons prohibited, no entry, general caution, dangerous curve to the {left, right}, double curve, bumpy road, slippery road, road narrows on the right, road work, traffic signals, pedestrians, children crossing, bicycles crossing, beware of ice/snow, wild animals crossing, end of all speed and passing limits, turn {right, left} ahead, ahead only, go straight or {right, left}, keep {right, left}, roundabout mandatory, end of no passing and end of no passing by vehicles over 3.5 metric tons. Figure 12 shows some samples from this dataset. speed limit 80 km/h priority road keep right speed limit 60 km/h speed limit 30 km/h speed limit 70 km/h priority road speed limit 120 km/h no passing no passing ahead only traffic signals Figure 12. Sample data from GTSRB dataset. Figure 12. Sample data from GTSRB dataset.

Results
In this work, the network architecture presented in Section 4 has been designed in three different ways and trained separately on the datasets presented in Section 5. The difference between the three networks is the routing algorithm used: the first is the original capsule network by Sabour et al., the second is our modified capsule network with some improvements, and the third is the efficient vector routing by Heinsen [20]. Capsule networks are trained separately on the six presented dataset. For the implementation, we used Python 3.9.16 [21] programming language with PyTorch 1.12.1 [22] machine learning framework and CUDA toolkit 11.6 platform. The capsule networks are trained on the Paperspace [23] online artificial intelligence platform with an Nvidia Quadro RTX4000 series graphical processing unit.
We trained all networks for 35 epochs with the Adam [24] optimizer algorithm where the train and test batch size are both 128. We also attempted to train the networks over many more epochs, but found that the difference between the three solutions does not change significantly after 35 epochs. In this study, we used 5 × 10 −4 initial learning rate. In each epoch, we reduced the learning rate as follows: where lr init is the initial learning rate and lr i is the learning rate in the i-th epoch. We used β 1 = 0.9 and β 2 = 0.999 hyperparameter values to control the exponential decay, and ε = 1 × 10 −8 to prevent any division by zero in the implementation. In the training process, we used the same loss function as proposed by Sabour et al. where T k = 1, if object of class k present 0, otherwise (15) m + , m − and λ are hyperparameters. In the present work, we used the same values for these three hyperparameters as proposed by Sabour et al., in this case, m + = 0.9, m − = 0.1 and λ = 0.5. The original study also used reconstruction in the training process; however, we did not apply this in our work. During training without reconstruction, the efficiency of the capsule network is reduced. In the long term, we want to translate our results in the field of capsule networks into real-world applications. In this respect, reconstruction of the input image is a rarely necessary step. Therefore, we explicitly investigated the ability of capsule networks without reconstruction. Figures 13-18 show the accuracy of the training processes for the test sets of the six presented datasets with different numbers of routings. The value of r indicates the number of iterations which the routing algorithm has optimized the coefficients. Based on Sabour et al.'s experiment, the r = 3 is a good choice; however, we showed the efficiency with r = 1 and r = 10. This makes the difference more visible between the routing methods. As can be seen, for all six datasets, we have achieved efficiency gains compared to the two other capsule network solutions. The difference in efficiency between our and Sabour et al.'s solutions is minimal for about the first 10 epochs; however, after that, there is a noticeable difference in the learning curve. Although Heinsen's solution also proves to be effective in most cases; its performance is slightly lower than the other two solutions. It is also noticeable that changing the number of iterations has a much larger impact on Sabour et al.'s solution and that of Heinsen. Our proposed solution is less sensitive to the iteration value chosen during the optimization.          Figures 19-24 show the test losses (Equation (14)) during the training processes with different numbers of routings. There is not much difference in the loss function, but it is noticeable that our solution is less noisy and converges more smoothly.    Figures 19-24 show the test losses (Equation (14)) during the training processes with different numbers of routings. There is not much difference in the loss function, but it is noticeable that our solution is less noisy and converges more smoothly.    Figures 19-24 show the test losses (Equation (14)) during the training processes with different numbers of routings. There is not much difference in the loss function, but it is noticeable that our solution is less noisy and converges more smoothly.   (14)) during the training processes with different numbers of routings. There is not much difference in the loss function, but it is noticeable that our solution is less noisy and converges more smoothly.          Figure 25 shows the processing times for the three capsule-based networks. It can be seen that for all six datasets, the capsule network was faster with our proposed routing algorithm than with the dynamic routing algorithm introduced by Sabour et al. The smallest increase was achieved for the Fashion-MNIST and MNIST datasets, but this still represents an 18.60% and a 19.33% speedup. For more complex datasets, much higher speed increases were achieved. A running time reduction of 25.55% was achieved for SVHN and 26.54% for CIFAR10. The best results were observed for the SmallNORB and    Figure 25 shows the processing times for the three capsule-based networks. It can be seen that for all six datasets, the capsule network was faster with our proposed routing algorithm than with the dynamic routing algorithm introduced by Sabour et al. The smallest increase was achieved for the Fashion-MNIST and MNIST datasets, but this still represents an 18.60% and a 19.33% speedup. For more complex datasets, much higher speed increases were achieved. A running time reduction of 25.55% was achieved for SVHN and 26.54% for CIFAR10. The best results were observed for the SmallNORB and    Figure 25 shows the processing times for the three capsule-based networks. It can be seen that for all six datasets, the capsule network was faster with our proposed routing algorithm than with the dynamic routing algorithm introduced by Sabour et al. The smallest increase was achieved for the Fashion-MNIST and MNIST datasets, but this still represents an 18.60% and a 19.33% speedup. For more complex datasets, much higher speed increases were achieved. A running time reduction of 25.55% was achieved for SVHN and 26.54% for CIFAR10. The best results were observed for the SmallNORB and  Figure 25 shows the processing times for the three capsule-based networks. It can be seen that for all six datasets, the capsule network was faster with our proposed routing algorithm than with the dynamic routing algorithm introduced by Sabour et al. The smallest increase was achieved for the Fashion-MNIST and MNIST datasets, but this still represents an 18.60% and a 19.33% speedup. For more complex datasets, much higher speed increases were achieved. A running time reduction of 25.55% was achieved for SVHN and 26.54% for CIFAR10. The best results were observed for the SmallNORB and GTSRB datasets. For SmallNORB it was 35.28%, while for GTSRB, it was 48.30%. Compared to Heinsen's solution, our proposed algorithm performed worse, but the difference in efficiency between the two solutions is significant. GTSRB datasets. For SmallNORB it was 35.28%, while for GTSRB, it was 48.30%. Compared to Heinsen's solution, our proposed algorithm performed worse, but the difference in efficiency between the two solutions is significant. The test errors during the training process are shown in Table 2, where the capsulebased solutions were compared with the recently released neural network-based approaches. It can be clearly seen that our proposed modifications to the routing algorithm have led to efficiency gains. It is important to note that our capsule-based solution does not always approach the effectiveness of the state-of-the-art solutions, however, the capsule network used consists of only three layers with a very minimal number of parameters ( 5.4 M). Its architecture is quite simple, but with further improvements, a higher efficiency can be achieved. Prior to this, we felt it necessary to improve the efficiency of the routing algorithm. Experience has shown that designing deep network architecture in the area of capsule networks is too resource-intensive. For this reason, it is necessary to increase the processing speed of the routing algorithm.  Figure 26 shows the confusion matrices for the capsule network-based approaches in the case of the six datasets used. It can be seen that for simpler datasets, such as MNIST, the difference between the three optimization algorithms is minimal. For more complex datasets, the differences are more pronounced. Tables 3-8  The test errors during the training process are shown in Table 2, where the capsulebased solutions were compared with the recently released neural network-based approaches. It can be clearly seen that our proposed modifications to the routing algorithm have led to efficiency gains. It is important to note that our capsule-based solution does not always approach the effectiveness of the state-of-the-art solutions, however, the capsule network used consists of only three layers with a very minimal number of parameters (<5.4 M). Its architecture is quite simple, but with further improvements, a higher efficiency can be achieved. Prior to this, we felt it necessary to improve the efficiency of the routing algorithm. Experience has shown that designing deep network architecture in the area of capsule networks is too resource-intensive. For this reason, it is necessary to increase the processing speed of the routing algorithm.  Figure 26 shows the confusion matrices for the capsule network-based approaches in the case of the six datasets used. It can be seen that for simpler datasets, such as MNIST, the difference between the three optimization algorithms is minimal. For more complex datasets, the differences are more pronounced. Tables 3-8 show the efficiencies achieved by capsule networks for each class, separately. This table also shows that our proposed method is in most cases able to provide a more effective solution than the other solutions tested. However, there are cases where our solution falls short compared with solutions by Sabour et al. and Heinsen.  Table 9 summarizes the percentage of classes per dataset that were able to provide the best result for a given solution. From this approach as well, our solution performed the best. Only in the case of the GTSRB dataset was the method proposed by Sabour et al. more efficient. For the other five datasets, our proposed method was able to achieve the best accuracy for most classes. Tables 10-12 summarize the recall score, dice score and F1-score for the capsule-based implementations under study. It can be observed that, according to all three metrics, our proposed method performs the best. The solution by Sabour et al. performs better only for the Fashion-MNIST dataset, but there was no difference in accuracy of this dataset. The solution by Heinsen underperforms the other two solutions in the cases studied. It can also be seen that, where the method of Sabour et al. and our approach perform worse, the score of Heinsen's solution also decreases in a similar way.

Conclusions
Our work involved research in the field of capsule networks. We showed the main differences between classical convolutional neural networks and capsule networks, highlighting the new potential of capsule networks. We have shown that the dynamic routing algorithm for capsule networks is too complex and that the training time makes it difficult to build more deep and complex networks. At the same time, capsule networks can achieve very good efficiency, but their practical application is difficult due to the complexity of routing. Therefore, it is important to improve the optimization algorithm and introduce novel solutions.
We proposed a modified routing algorithm for capsule networks and a parameterizable activation function for capsules, based on the dynamic routing algorithm introduced by Sabour et al. In this approach, we aimed to reduce the computational complexity of the current dynamic routing algorithm. Thanks to our proposed routing algorithm and activation function, the training time can be reduced. In our work, we have shown its effectiveness on six different datasets, compared with neural network-based solutions and capsule-based solutions. As can be seen, the training time was reduced in all cases, by almost 30% on average. Even in the worst case, a speed increase of almost 20% was achieved. And in some cases, an increase in speed of almost 50% can be seen. Despite the increase in speed, the efficiency of the network has not decreased. For several different metrics, our proposed solution was compared with other capsule-based methods. As we have shown, our proposed approach can increase the efficiency of the routing mechanism. For all six datasets tested in this research, our solution provided the highest results in almost all cases.
In the future, we would like to perform further research on routing algorithms and capsule networks to be able to achieve even greater improvements. We would like to carry out more complex studies on larger datasets, compared with other solutions. We would like to further optimize our solution based on the test results. Our goal is to be able to provide an efficient and fast solution for more complex tasks, such as instance segmentation or reconstruction, in the field of capsule networks. This is necessary to create much deeper and more complex capsule networks, so it is important to address the issue of optimization. This will allow us to apply the theory of capsule networks to real practical applications with great efficiency.