An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints

In image classification with Deep Convolutional Neural Networks (DCNNs), the number of parameters in pointwise convolutions rapidly grows due to the multiplication of the number of filters by the number of input channels that come from the previous layer. Existing studies demonstrated that a subnetwork can replace pointwise convolutional layers with significantly fewer parameters and fewer floating-point computations, while maintaining the learning capacity. In this paper, we propose an improved scheme for reducing the complexity of pointwise convolutions in DCNNs for image classification based on interleaved grouped filters without divisibility constraints. The proposed scheme utilizes grouped pointwise convolutions, in which each group processes a fraction of the input channels. It requires a number of channels per group as a hyperparameter Ch. The subnetwork of the proposed scheme contains two consecutive convolutional layers K and L, connected by an interleaving layer in the middle, and summed at the end. The number of groups of filters and filters per group for layers K and L is determined by exact divisions of the original number of input channels and filters by Ch. If the divisions were not exact, the original layer could not be substituted. In this paper, we refine the previous algorithm so that input channels are replicated and groups can have different numbers of filters to cope with non exact divisibility situations. Thus, the proposed scheme further reduces the number of floating-point computations (11%) and trainable parameters (10%) achieved by the previous method. We tested our optimization on an EfficientNet-B0 as a baseline architecture and made classification tests on the CIFAR-10, Colorectal Cancer Histology, and Malaria datasets. For each dataset, our optimization achieves a saving of 76%, 89%, and 91% of the number of trainable parameters of EfficientNet-B0, while keeping its test classification accuracy.


Introduction
In 2012, Krizhevsky et al. [1] reported a breakthrough in the ImageNet Large Scale Visual Recognition Challenge [2] using their AlexNet architecture, which contains 5 convolutional layers and 3 dense layers. Since 2012, many other architectures have been introduced, like ZFNet [3], VGG [4], GoogLeNet [5], ResNet [6] and DenseNet [7]. Since the number of layers of proposed convolutional neural networks has increased from 5 to more than 200, those models are usually referred to as "Deep Learning" or DCNN.
In 2013, Min Lin et al. introduced the Network in Network architecture (NiN) [8]. It has 3 spatial convolutional layers with 192 filters, separated by pairs of pointwise convolutional layers. These pointwise convolutions enable the architecture to learn patterns without the computational cost of a spatial convolution. In 2016, ResNet [6] was introduced.
Following VGG [4], all ResNet spatial filters have 3 × 3 pixels. Their paper conjectures that deeper CNNs have exponentially low convergence rates. To deal with this problem, they introduced skip connections every 2 convolutional layers. In 2017, Ioannou et al. [9] adapted the NiN architecture to use 2 to 16 convolutional groups per layer for classifying the CIFAR-10 dataset.
A grouped convolution separates input channels and filters into groups. Each filter processes only input channels entering its group. Each group of filters can be understood as an independent (parallel) path for information flow. This aspect drastically reduces the number of weights in each filter and, therefore, reduces the number of floating-point computations. Grouping 3 × 3 and 5 × 5 spatial convolutions, Ioannou et al. were able to decrease the number of parameters by more than 50% while keeping the NiN classification accuracy. Ioannou et al. also adapted the Resnet-50, Resnet-200, and GoogleLeNet architectures applying 2 to 64 groups per layer when classifying the ImageNet dataset, obtaining parameter reduction while maintaining or improving the classification accuracy. Also in 2017, an improvement for the ResNet architecture called ResNeXt [10] was introduced, replacing the spatial convolutions with parallel paths (groups), reducing the number of parameters.
Several studies have also reported the creation of parameter-efficient architectures with grouped convolutions [11][12][13][14][15]. In 2019, Mingxing Tan et al. [16] developed the EfficientNet architecture. At that time, their EfficientNet-B7 variant was 8.4 times more parameter-efficient and 6.1 times faster than the best existing architecture, achieving 84.3% top-1 accuracy on ImageNet. More than 90% of the parameters of EfficientNets come from standard pointwise convolutions. This aspect opens an opportunity for a huge reduction in several parameters and floating-point operations, as we have exploited in the present paper.
Most parameters in DCNNs are redundant [17][18][19][20][21]. Pruning methods remove connections and neurons found to be irrelevant by different techniques. After training the original network with the full set of connections, the removal is carried out [22][23][24][25][26][27]. Our method differs from pruning as we reduce the number of connections before the training starts, while pruning does after training. Therefore, our method can save computing resources during training time.
In previous works [28,29], we proposed replacing standard pointwise convolutions with a sub-architecture that contains two grouped pointwise convolutional layers (K and L), an interleaving layer that mixes channels from layer K before feeding the layer L, and a summation at the end that sums the results from both convolutional layers. Our original method accepts a hyperparameter Ch, which denotes the number of input channels fed to each group of filters. Then, our method computes the number of groups of filters and filters per group according to the division of original input channels and filters by Ch. Our original method avoided substituting the layers where the divisions were not exact.
In this paper, we propose an enhanced scheme to allow computing the number of groups in a flexible manner, in the sense that the divisibility constraints do not have to be considered anymore. By applying our method to all pointwise convolutional layers of an EfficientNet-B0 architecture, we are able to reduce a huge amount of resources (trainable parameters, floating-point computations) while maintaining the learning capacity. This paper is structured as follows: Section 2 details our improved solution for grouping pointwise convolutions while skipping the constraints of divisibility found in our previous method. Section 3 details the experiments carried out for testing our solution. Section 4 summarizes the conclusions and limitations of our proposal.

Mathematical Ground for Regular Pointwise Convolutions
. . , x i Ic i } be a set of input feature maps (2D lattices) for a convolutional layer i in a DCNN, where Ic i denotes the number of input channels for this layer. Let W i = {w i 1 , w i 2 , . . . , w i F i } be a set of filters containing the weights for convolutions, where F i denotes the number of filters at layer i, which is also the number of output channels of this layer. Following the notation proposed in [30], a regular DCNN convolution can be mathematically expressed as in Equation (1): where the operator indicates that filters in W i are convolved with feature maps in X i , using the * operator to indicate a 3D tensor multiplication and shifting of a filter w i j across all patches of the size of the filter in all feature maps. For simplicity, we are ignoring the bias terms. Consequently, X i+1 will contain F i feature maps that will feed the next layer i + 1. The tensor shapes of involved elements are the following: where H × W is the size (height, width) of feature maps, and S × S is the size of a filter (usually square). In this paper we work with S = 1 because we are focused on pointwise convolutions. In this case, each filter w i j carries Ic i weights. The total number of weights P i in layer i is obtained with a simple multiplication:

Definition of Grouped Pointwise Convolutions
For expressing a grouped pointwise convolution, let us split the input feature maps and the set of filters in G i groups, as Assuming that both Ic i and F i are divisible by G i , the elements in X i and W i can be evenly distributed through all their subset X i j and W i j . Then, Equation (1) can be reformulated as Equation (4): The shapes of the subsets are the following: where Fg i is the number of filters per group, namely, Fg i = F i/G i . Since each filter w i,m j only convolves on a fraction of input channels ( Ic i/G i ), the total number of weights per subset W i m is ( F i/G i ) · ( Ic i/G i ). Multiplying the last expression by the number of groups provides the total number of weights P i in a grouped pointwise convolutional layer i: Equation (6) shows that the number of trainable parameters is inversely proportional to the number of groups. However, grouping has the evident drawback that it prevents the filters to be connected with all input channel, which reduces the possible connections of input channels for learning new patterns. As it may lead to a lower learning capacity of the DCNN, one must be cautious with using such grouping technique.

Improved Scheme for Reducing the Complexity of Pointwise Convolutions
Two major limitations of our previous method were inherited from constraints found in most deep learning APIs: • The number of input channels Ic i must be multiple of the number of groups G i . • The number of filters F i must be multiple of the number of groups G i .
The present work circumvents the first limitation by replicating channels from the input. The second limitation is circumvented by adding a second parallel path with another pointwise grouped convolution when required. Figure 1 shows an example of our updated architecture.
Details of this process are described below, which is applied to substitute each pointwise convolutional layer i found in the original architecture. To explain the method, we start detailing the construction of the layer K shown in Figure 1. For simplicity, we drop the index i and use the index K to refer to the original hyperparameters, i.e., we use Ic K instead of Ic i , F K instead of F i . Also, we will use the indexes K1 and K2 to refer the parameters of the two parallel paths that may exist in layer K.
First of all, we must manually specify the value of the hyperparameter Ch. In the graphical example shown in Figure 1, we set Ch = 4. The rest of hyperparameters such as number of groups in layers K and L are determined automatically by the rules of our algorithm, according to the chosen value of Ch, the number of input channels Ic K and the number of filters F K . We do not have a procedure to find the optimal value of Ch, hence we must apply ablation studies on a range of Ch values as shown in the results section. For the example in Figure 1, we have chosen the value of Ch to obtain a full variety of situations that must be tackled by our algorithm, i.e., non-divisibility conditions. Figure 1. A schematic diagram of our pointwise convolution replacement. This example replaces a pointwise convolution with 14 input channels and 10 filters. It contains two convolutional layers, K and L, one interleaving, and one summation layer. Channels surrounded by a red border represent replicated channels.

Definition of Layer K
The first step of the algorithm is to compute the number of groups in branch K1, as in Equation (7): Since the number of input channels Ic K may not be divisible by Ch, we use the ceiling operator on the division to obtain an integer number of groups. In the example, G K1 = 14/4 = 4. Thus, the output of filters in branch K1 can be defined as in (8): The subsets X K m are composed of input feature maps x j , collected in a sorted manner, i.e., (9) provides a general definition of which feature maps x j are included in any feature subset X K m : However, if Ic K is not divisible by Ch, the last group m = G K1 would not have Ch channels. In this case, the method will complete this last group replicating Ch − b initial input channels, where b is computed as stated in Equation (10): It can be proved that b will always be less or equal than Ch, since b is the excess of the integer division Ic K /Ch, i.e., G K1 · Ch will always be above or equal to Ic K , but less than Ic K + Ch, because otherwise G K1 would increase its value (as a quotient of Ic K /Ch). In the example, b = 2, hence X K1 4 = {x 13 , x 14 , x 1 , x 2 }. Then, the method calculates the number of filters per group Fg K1 as in (11): To avoid divisibility conflicts, this time we have chosen the floor integer division. For the first path K1, each of the filter subsets shown in (8) will contain the following filters: For the first path of the example, the number of filters per group is Fg K1 = 10/4 = 2. So, the first path has 4 groups (G K1 ) of 2 filters (Fg K1 ), each filter being connected to 4 input channels (Ch).
If F K is not divisible by Ch, a second path K2 will provide as many groups as filters not provided in K1, with one filter per group, to complete the total number of filters F K : In the example, G K2 = 2. The required input channels for the second path is Ch · G K2 . The method obtains those channels reusing the same subsets of input feature maps X K m shown in (9). Hence, the output of filters in path K2 can be defined as in (14): where w K2 j ∈ R 1×1×Ch . Therefore, each filter in K2 operates on exactly the same subset of input channels than the corresponding subset of filters in K1. Hence, each filter in the second path can be considered as belonging to one of the groups of the first path.
It must be noticed that G K2 will always be less than G K1 . This is true because G K2 is the reminder of the integer division F K /G K1 , as can be deduced from (11) and (13). This property warranties that there will be enough subsets X K m for this second path. After defining paths K1 and K2 in layer K, the output of this layer is the concatenation of both paths: The total number of channels after the concatenation is equal to F K = G K1 · Fg K1 + G K2 .

Interleaving Stage
As mentioned above, grouped convolutions inherently face a limitation: each parallel group of filters computes its output from their own subset of input channels, preventing combinations of channels connected to different groups. To alleviate this limitation, we propose to interleave the output channels from the convolutional layer K.
The interleaving process simply consists in arranging the odd channels first and the even channels last, as noted in Equation (16): Here we are assuming that F K is even. Otherwise, the list of odd channels will include an extra channel k 2c+1 .

Definition of Layer L
The interleaved output feeds the grouped convolutions in layer L to process data coming from more than one group from the preceding layer K.
To create layer L, we apply the same algorithm as for layer K, but now the number of input channels is equal to F K instead of Ic K .
The number of groups in path L1 is computed as: Note that G L1 may not be equal to G K1 . In the example, G L1 = 10/4 = 3.
Then, the output of L1 is computed as in (18), where the input channel groups I K m come from the interleaving stage. Each group is composed of Ch channels, whose indexes are generically defined in (19): Again, the last group of indexes may not contain Ch channels due to a non-exact division condition in (17). Similar to path K1, for path L1 the missing channels in the last group will be supplied by replicating Ch − b initial interleaved channels, where b is computed as stated in Equation (20): The number of filters per group Fg L1 is computed as in (21): In the example, Fg L1 = 10/3 = 3. Each group of filters W L1 m shown in (18) can be defined as in (22), each one containing Fg L1 convolutional filters of Ch inputs: It should be noted that if the division in (21) is not exact, the number of output channels from layer L may not reach the required F K outputs. In this case, a second path L2 will be added, with the following parameters: In the example, G L2 = 1. The output of path L2 is computed as in (24), defining one extra convolutional filter for some initial groups of interleaved channels declared in (18) and (19), taking into account that G L2 will always be less than G L1 according to the same reasoning done for G K2 and G K1 : The last step in defining the output of layer L is to join the outputs of paths L1 and L2:

Joining of Layers
Finally, the output of both convolutional layers K and L are summed to create the output of the original layer: Compared to concatenation, summation has the advantage of allowing a residual learning in the filters of layer L, because gradient can be backpropagated through L filters or directly to K filters. In other words, residual layers provide more learning capacity with low degree of downsides due to increasing the number of layers (i.e., overfitting, longer training time, etc.) In the results section, we present an ablation study that contains experiments done without the interleaving and the L layers (rows labeled with "no L"). These experiments empirically prove that the interleaving mechanism and the secondary L layer help in improving the sub-architecture accuracy, with low impact.
It is worth mentioning that we only add the layer L an the interleaving when the number of input channels is bigger or equal to the number of filters in layer K.

Computing the Number of Parameters
We can compute the total number of parameters of our sub-architecture. First, Equation (27) shows that the number of filters in layer K is equal to the number of filters in layer L, which in turn is equal to the total number of filters in the original convolutional layer F i : Then, the total number of parameters P i is twice the number of original filters multiplied by the number of input channels per filter: Therefore, comparing Equation (28) with (3), it is clear that Ch must be significantly less than Ic i /2 to reduce the number of parameters of a regular pointwise convolutional layer. Also, comparing Equation (28) with (6), our sub-architecture provides a parameter reduction similar to a plain grouped convolutional layer when Ch is around Ic i /2G i , although we cannot specify a general G i term because of the complexity of our pair of layers with possibly two paths per layer.
The requirement for a low value of Ch is also necessary to ensure that divisions in Equations (7) and (17) provide quotients above one, otherwise our method will not create grouping. Hence, Ch must be less or equal to Ic i /2 and F i /2. These are the only two constraints that our method is restricted by.
As shown in Table 1, pointwise convolutional layers found in real networks such as EfficientNet-B0 have significant Figures for Ic i and F i , either hundreds or thousands. Therefore, values of Ch less or equal than 32 will ensure a good ratio of parameter reduction for most of these pointwise convolutional layers.
EfficientNet is one of the most complex (but efficient) architectures that can be found in the literature. To our method, the degree of complexity of a DCNN is mainly related to the maximum number of input channels and output features in any pointwise convolutional layer. Our method does not care about the number of layers, neither in depth nor in parallel, because it works on each layer independently. Therefore, the degree of complexity of EfficientNet-B0 can be considered significantly high, taking into account the values shown in the last row of Table 1. Arguably, other versions of EfficientNet (B1, B2, etc.) and other types of DCNN can exceed those values. In such cases, higher values of Ch may be necessary, but we cannot provide any rule to forecast its optimum value for the configuration of any pointwise convolutional layer. Table 1. For a standard pointwise convolution with Ic input channels, F filters, P parameters and a given number of channels per group Ch, this Table shows the calculated parameters for layers K and L: the number of groups G <layer><path> and the number of filters per group Fg <layer><path> . The last 2 columns show the total number of parameters and its percentage with respect to the original layer.

Activation Function
In 2018, Prajit et al. [31] tested a number of activation functions. In their experimentation, they found that the best performing one was the so-called "swish", shown in Equation (29).
In previous works [28,29], we used the ReLU activation function. In this work, we use the swish activation function. This change gives us better results in our ablation experiments shown on Table 5.

Implementation Details
We tested our optimization by replacing original pointwise convolutions in the EfficientNet-B0 and named it as "kEffNet-B0 V2". With CIFAR-10, we tested an additional modification that skips the first 4 convolutional strides, allowing input images with 32 × 32 pixels instead of the original resolution of 224 × 224 pixels.
In all our experiments, we saved the trained network from the epoch that achieved the lowest validation loss for testing with the test dataset. Convolutional layers are initialized with Glorot's method [32]. All experiments were trained with RMSProp optimizer, data augmentation [33] and cyclical learning rate schedule [34]. We worked with various configurations of hardware with NVIDIA video cards. Regarding software, we did our experiments with K-CAI [35] and Keras [36] on the top of Tensorflow [37].

Horizontal Flip
In some experiments, we run the model twice with the input image and its horizontally flipped version. The output from the softmax from both runs is summed before class prediction. In these experiments, the number of floating-point computations doubles, although the number of trainable parameters remains the same.

Results and Discussion
In this section, we present and discuss the results of the proposed scheme with three image classification datasets: CIFAR-10 dataset [38], Malaria dataset, and colorectal cancer histology dataset [39,40].

Results on the CIFAR-10 Dataset
The CIFAR-10 dataset [38] is a subset of [41] and consists of 60k 32 × 32 images belonging to 10 different classes: airplane, automobile, bird, cat, deer, dog, frog, horse ship and truck. These images are taken from natural and uncontrolled lightning environment. They contain only one prominent instance of the object to which the class refers to. The object may be partially occluded or seen from an unusual viewpoint. This dataset has 50k images for training and 10k images for test. We picked 5k images for validation and left the training set with 45k images. We run experiments with 50 and 180 epochs.
On Table 2 we compare kEffNet-B0 V1 (our previous method) and V2 (our current method), for two values of Ch. We can see that our V2 models has slightly more reduction in both number of parameters and floating-point computations than the V1 counterpart models, while achieving slightly higher accuracy. Specifically, V2 models save 10% of the parameters (from 1,059,202 to 950,650) and 11% of the floating-point computations (from 138,410,206 to 123,209,110) of V1 models. All of our variants obtain similar accuracy to the baseline with a remarkable reduction of resources (at least 26.3% of trainable parameters and 35.5% of computations). As the scope of this work is limited to small datasets and small architectures, we only experimented with the smallest EfficientNet variant (EfficientNet-B0) and our modified variant (kEffNet-B0). Nevertheless, Table 3 provides the number of trainable parameters of the other EfficientNet variants (original and parameter-reduced). Equation (3) indicates that the number of parameters grows with the number of filters and the number of input channels. Equation (6) indicates that the number of parameters decreases with the number of groups. As we create more groups when the number of input channels grows, we expect to find bigger parameter savings on larger models. This saving can be seen on Table 3. We also tested our kEffNet-B0 with 2, 4, 8, 16 and 32 channels per group for 50 epochs as shown in Table 4. As expected, the test classification accuracy increases when allocating more channels per group: from 84.26% for Ch = 2 to 93.67% for Ch = 32. Also, the resource saving decreases as the number of channels per group increase: from 7.8% of parameters and 11.4% of computations for Ch = 2 to 23.6% of parameters and 31.6% of computations for Ch = 32 (compared to the baseline). For CIFAR-10, if we aim to achieve an accuracy comparable to the baseline, we must choose at least 16 channels per group. If we add an extra run per image sample with horizontal flipping when training kEffNet-B0 V2 32ch, the classification accuracy increases from 93.67% to 94.01%. Table 4. Ablation study done with the CIFAR-10 dataset for 50 epochs, comparing the effect of varying the number of channels per group. It also includes the improvement achieved by double training kEffNet-B0 V2 32ch with original images and horizontally flipped images.

Model
Parameters % Computations % Test acc.  Table 5 replicates most of the results shown in Table 4, but comparing the effect of not including layer L and interleaving, and also substituting the swish activation function with the typical ReLU. As can be observed, disabling layer L has a noticeable degradation on test accuracy when the values of Ch are smaller. For example, when Ch = 4, the performance drops more than 5%. On the other hand, when Ch = 32 the drop is less than 0.5%. This is logical taking into account that, the more channels are included per group, the more chances are to combine input features in the filters. Therefore, a second layer and the corresponding interleaving is not as crucial as when the filters of layer K are fed with fewer channels.
In the comparison of activation functions, the same effect can be appreciated: the swish function works better than the ReLU function, but provides less improvement for larger number of channels per group. Nevertheless, the gain in the least difference case (32 ch) is still profitable, with more than 1.5% of extra test accuracy when using the swish activation function. Table 5. Extra experiments made for kEffNet-B0 V2 4ch, 8ch, 16ch and 32ch variants. Rows labeled with "no L" indicate experiments done using only layer K, i.e., disabling layer L and the interleaving. Rows labeled with "ReLU" replace the swish activation function by ReLU.

Model
Parameters % Computations % Test acc.  Table 6 shows the effect in accuracy when classifying the CIFAR-10 dataset with EfficientNet-B0 and our kEffNet-B0 V2 32ch variant for 180 epochs instead of 50 epochs. The additional training epochs assign slightly higher test accuracy to the baseline than to our core variant. When adding horizontal flipping, our variant has slightly surpassed the baseline results. Nevertheless, all three results can be considered similar to each other, but our variant offers a significant saving in parameters and computations. Although the H flipping doubles the computational cost of our core variant, it still remains only a fraction (63.3%) of the baseline computational cost.

Results on the Malaria Dataset
The Malaria dataset [40] has 27,558 cell images from infected and healthy cells separated into 2 classes. There is the same number of images for healthy and infected cells. From the original 27,558 images set, we separated 10% of the images (2756 images) for validation and another 10% for testing. On all training, validation, and test subsets, there are 50% of healthy cell images. We quadruplicated the number of validation images by flipping these images horizontally and vertically, resulting in 11,024 images for validation.
On this dataset, we tested our kEffNet-B0 with 2, 4, 8, 12, 16, and 32 channels per group, as well as the baseline architecture, as shown in Table 7. Our variants have from 7.5% to 23.5% of the trainable parameters and from 15.7% to 42.2% of the computations allocated by the baseline architecture. Although the worst classification accuracy was found with the smallest variant (2ch), its classification accuracy is less than 1% inferior to the best performing variant (16ch) and only 0.69% below the baseline performance. With only 8 channels per group, our method equals the baseline accuracy with a small portion of the parameters (10.8%) and computations (22.5%) required by the baseline architecture. Curiously, our 32ch variant is slightly worse than the 16ch variant, but still better than the baseline. It is an example that a rather low complexity of the input images may require less channels per filter (and more parallel groups of filters), to optimally capture the relevant features of images.

Results on the Colorectal Cancer Histology Dataset
The collection of samples in colorectal cancer histology dataset [39] contains 5000 150 × 150 images separated into 8 classes: adipose, complex, debris, empty, lympho, mucosa, stroma, and tumor. Similar to what we did with the Malaria dataset, we separated 10% of the images for validation and another 10% for testing. We also quadruplicated the number of validation images by flipping these images horizontally and vertically.
On this dataset, we tested our kEffNet-B0 with 2, 4, 8, 12, and 16 channels per group, as well as the baseline architecture, as shown in Table 8. Similar to the Malaria dataset, higher values of channels per group do not lead to better performance. In this case, the variants with the highest classification accuracy are 4ch and 8ch, achieving 98.02% of classification accuracy, outperforming the baseline accuracy in 0.41%. The 16ch variant has obtained the same accuracy than the 2ch variant, but doubling the required resources. Again, it indicates that the complexity of the images plays a role in the selection of the optimal number of channels per group. In other words, simpler images may require less channels per group. Unfortunately, the only method we know to find out this optimal value is performing theses scanning experiments.

Conclusions and Future Work
This paper presented an efficient scheme for decreasing the complexity of pointwise convolutions in DCNNs for image classification based on interleaved grouped filters with no divisibility constraints. From our experiments, we can conclude that connecting all input channels from the previous layer to all filters is unnecessary: grouped convolutional filters can achieve the same learning power with a small fraction of resources (1/3 of floatingpoint computations, 1/4 of parameters). Our enhanced scheme avoids the divisibility contraints, furter reducing the required resources (up to 10% less) while maintaining or slightly surpassing the accuracy of our previous method.
We have made ablation studies to obtain the optimal number of channels per group for each dataset. For colorectal cancer dataset, this number is surprisingly low (4 channels per group). On the other side, for CIFAR-10 the best results require at least 16 channels per group. This fact indicates that the complexity of the input images affects the optimal configuration of our sub-architecture.
As the main limitation of our method, it cannot determine the optimal number of channels per group automatically, according to the complexity of each pointwise convolutional layer to be substituted and the complexity of input images. A second limitation is that the same number of channels per group is applied to all pointwise convolutional layers of the target architecture, regardless of the specific complexity of each layer. This limitation could be easily tackled by setting Ch as a fraction of the total number of parameters of each layer. This is a straightforward task for future research. Besides, we will apply our method to different problems, such as instance and semantic image segmentation, developing an efficient deep learning-based seismic acoustic impedance inversion method [42], object detection, and forecasting. Data Availability Statement: Datasets used in this study are publicly available: CIFAR-10 [38], Colorectal cancer histology [39] and Malaria [40]. Software APIs are also publicly available: K-CAI [35] and Keras [36]. Our source code and raw experiment results are publicly available: https://github.com/ joaopauloschuler/kEffNetV2, accessed on 1 September 2022.