As discussed in this section, an extensive array of experiments was conducted on three authentic datasets: CIFAR-10 [
31], Tiny-ImageNet [
6], and SVHN [
32]. To study the influence of Mqscm on network performance without considering variously complex architectures and gain mechanisms, the bottleneck-free architecture model ResNet-50 [
13] was selected. First, in this paper, only 3 × 3 standard convolution in each residual basic block was replaced, keeping the remaining relevant parameters unchanged for comparisons among multiple convolution blocks. Second, the stacking times of residual blocks in the model were adjusted to evaluate the performance of Mqscm under varying computational overheads. Subsequently, the MqscmNet network was constructed and comparative experiments were conducted with a series of classic networks, eliminating the interference of additional gain mechanisms and, more intuitively, demonstrating the impact of network simplification on performance. Finally, in the MqscmNet network, variations of Mqscm were compared and tested to verify the influence of different structures on the performance of the MqscmNet network. In addition, all architectures were implemented using PyTorch [
33], and the experimental machine was equipped with an NVIDIA GeForce RTX 3060 graphics card.
4.1. Datasets and Training Settings
CIFAR-10: The CIFAR-10 dataset [
31] is a small dataset designed for identifying ordinary objects, created by Alex Krizhevsky and Ilya Sutskever. It includes 60,000 color natural images sized at 32 × 32, divided into 10 categories. Each category contains an average of 6000 images, with 50,000 images allocated for training and 10,000 images for testing, ensuring no overlap in the dataset. In this paper, a standard data augmentation scheme was employed. First, an image-padding operation (padding = 4) was applied to resize the image to 40 × 40. Second, the image was randomly cropped to the size of 32 × 32, followed by random horizontal flipping and random occlusion. Finally, normalization was performed.
Tiny-ImageNet: The Tiny-ImageNet dataset [
6] is a subset of ImageNet [
34], with a total of 200 categories, with 500 images in each category as the training set, and 50 images as the validation set. It is a color image with a resolution of 64 × 64. The higher the resolution, the more complex the learning features and the longer the training time. This paper adopted the same image-enhancement method as that of CIFAR-10.
SVHN: The SVHN [
32] dataset comprises a total of 10 categories, representing color digital images ranging from 0 to 9, with an image resolution of 32 × 32. The training set consists of 73,257 images, and the test set contains 26,032 images. The same image-enhancement method was applied.
Training setting: For all networks, the settings included batch size = 256, epoch = 100, and an initial learning rate (Lr) of 0.01. The SGD optimizer was employed to optimize gradient information, with a momentum of 0.9 and weight decay of 5 × 10
−3, indicating a significant L2 regularization to mitigate overfitting. The learning rate (Lr) was adjusted using the step learning rate MultiStepLR, with Lr decreasing to 0.1 times the learning rate of the previous stage at 0.45× and 0.7× epochs, respectively. Batch normalization (BN) [
35] was utilized to address the issue of gradient disappearance. ReLU was chosen as the activation function. The CIFAR-10 dataset, the Tiny-ImageNet dataset, and the SVHN dataset employed the same training settings.
4.2. Results and Analysis
This paper evaluates the performance of Mqscm on three public image-classification datasets, utilizing the bottleneck-free architecture ResNet-50 [
13] to demonstrate that Mqscm enhances network performance without the need for additional gain mechanisms. We replaced each standard 3 × 3 convolution (Conv3) in the residual block with Mqscm, 3 × 3 depthwise convolution (DWConv3), or 2 × 2 convolution using a symmetric padding strategy (C2sp), respectively. Then, the performance of multiple convolution blocks was compared. All data processing and training details remained the same to ensure the fairness of the experiment.
Table 1 shows the classification results of Mqscm, Conv3, DWConv3, and C2sp convolution on the CIFAR-10 dataset. In this paper, the metrics compared were computing complexity (FLOPs), parameter size (Params), accuracy, and average training time per epoch (Time/Epoch). Ranked by computational complexity, Conv3 > C2sp > DWConv3 > Mqscm. Ranked by accuracy, Conv3 > Mqscm > DWConv3 > C2sp. Ranked by speed, DWConv3 > Mqscm > C2sp > Conv3. The standard convolution Conv3 had, undoubtedly, the highest accuracy, but it had the largest computational complexity, the largest number of parameters, and the longest training time. DWConv3 was the fastest and achieved a good efficiency tradeoff, but the computational complexity and accuracy were slightly lower than those of Mqscm. C2sp performed well but was weaker than Mqscm in all aspects. This is attributed to the heightened sensitivity of standard convolution to the complex feature information in images and its robust representation capability. However, the simultaneous consideration of channels and regions contributed to increased computing and memory requirements, thereby impacting the efficiency of network training. Conversely, DWConv3 and Mqscm effectively reduced the number of parameters and enhanced training speed through distinct separation methods. Nevertheless, there was a tradeoff, as they occasionally lost some global information, leading to a slight decrease in accuracy. Compared with Conv3, Mqscm saved about 44.5% of the calculated amount and accelerated the model training by about 32.5%. The comprehensive performance of Mqscm was comparable to that of DWConv3.
Table 2 presents the classification performance comparison of several convolution methods on the Tiny-ImageNet dataset. It is evident that Mqscm outperformed the other convolutional modules in the Table in terms of FLOPs, Params, and accuracy, and it trained faster than Conv3 and C2sp, and just below DWConv3. This occurred because the substantial extraction of intricate information by the standard convolution method obscured some crucial details in this dataset, leading to lower accuracy in Conv3 and C2sp compared to that of Mqscm and DWConv3. The unique structure of Mqscm not only diminished information redundancy but also integrated multi-scale information, resulting in optimal accuracy for this dataset. Compared to Conv3, Mqscm improved accuracy by 1.82% and enhanced model training speed by 14.93%.
Table 3 shows the classification performance comparison of several convolution modules on the SVHN dataset. It is evident that Mqscm exhibited the fewest FLOPs (M) and Params (M) and surpassed Conv3 and C2sp in terms of training time. This is attributed to Mqscm’s reduction in computing and memory requirements through secondary separation, consequently enhancing the training efficiency of the network. In terms of accuracy, it outperformed DWConv3 and C2sp, because Mqscm integrated multi-scale information, to some extent mitigating the issue of global information loss resulting from secondary separation. Compared with Conv3, the model speed improved by 35.41%. The experiments demonstrated that the performance of Mqscm on SVHN datasets was highly competitive and comparable to that of depthwise separable convolution.
In summary, Mqscm only requires a direct replacement of the original standard convolution, resulting in substantial savings in computational resources and memory overhead. Moreover, it accelerates model training while maintaining a high level of accuracy. Experiments conducted on three real public datasets demonstrated that Mqscm not only effectively achieved efficiency tradeoffs in networks, but also exhibited good generalization.
This paper provides tradeoff curves of accuracy and speed under different FLOPs on the CIFAR-10 dataset, as shown in
Figure 4. Subgraph (a) displays the accuracy tradeoff curve under different FLOPs and subgraph (b) illustrates the speed tradeoff curve under different FLOPs.
In ResNet-50, the stacking times of blocks from stage 1 to stage 4 are (3, 4, 6, 3). In this paper, the stacking times were changed to (1, 1, 3, 1) and (2, 2, 6, 2) to achieve different computational complexity and parameters, while the remaining settings remained unchanged. The experimental results indicated that Mqscm achieved faster speeds than Conv3 and C2sp, while maintaining higher accuracy than DWConv3 across various computational complexities and parameters, all without the need for additional gain mechanisms such as attention mechanisms. In summary, Mqscm exhibited an overall performance comparable to that of depthwise separable convolution, striking a good balance between accuracy and network efficiency.
To further explore the performance of Mqscm, this paper proposes a straightforward convolutional network, MqscmNet, with a simple architecture, as illustrated in
Figure 3. In
Table 4, this subsection presents the performance comparison experiment of various classical classification networks on the CIFAR-10 dataset. To ensure fairness, BN regularization was employed to mitigate the gradient-disappearance problem in these networks, and the activation function uniformly used Relu. The experimental results showed that the accuracy of MqscmNet was only lower than that of Vgg-16 and was much higher than that of other classical networks. Compared with ResNet-50, MqscmNet had fewer network layers, the parameters were reduced by about 59.5%, the training time was reduced by about 29.7%, and the accuracy was increased by 0.59%. This arose due to task-specific challenges, where the excessive depth of ResNet-50 network layers resulted in repeated extraction of complex information, masking certain details. In contrast, MqscmNet simplified the number of network layers and used Mqscm to widen the network and reduce computational workload and memory overhead. This strategy improved the accuracy and efficiency of the network.
It was found that reducing the computational load was only a necessary condition, but not a sufficient condition for achieving network lightweightness. For example, the computational load and parameters of Vgg-16 were larger than those of ResNet-50, but Vgg-16 had higher accuracy and faster speed. These experimental findings affirmed that MqscmNet effectively leveraged the performance of Mqscm and balanced the relationship between network speed and accuracy.
Table 5 presents comparative experiments involving several variant structures of Mqscm on the CIFAR-10 dataset, with their structures depicted in
Figure 5. ResNet-50 and MqscmNet network structures were utilized to assess the impact of different structures on accuracy and speed. The experimental results showed that among the four structures, Mqscm-d had the highest accuracy, but the training speed was slightly slower than that of Mqscm-b. This is because Mqscm-d enhanced the accuracy of the model through the multi-scale structure, but also increased the degree of fragmentation, potentially reducing the parallel speed of the hardware. Mqscm-b trained the fastest but with less accuracy. All four variants of Mqscm showed superior performance in MqscmNet networks compared to ResNet-50, which confirmed the effectiveness of MqscmNet in enhancing performance.
It is evident from
Table 1,
Table 2 and
Table 3 that DWConv3 exhibited performance characteristics that were highly similar to those of Mqscm. To conduct a more in-depth performance comparison, two widely recognized architectures, namely ResNet-50 and Mobilenet_v3_large, were employed for k-fold cross-validation across three datasets. The training set was divided into five subsets, and the models underwent evaluation and training five times, followed by testing on the respective test sets.
Table 6 presents the accuracy comparison of Mqscm and DWConv3 through k-fold cross-validation.
The data presented in
Table 6 reveal that Mqscm exhibited superior accuracy compared to that of DWConv3 across three datasets and two distinct classic architectures. This superiority can be attributed to the multi-scale topology inherent in Mqscm, which enhanced its information-extraction capabilities. Furthermore, it is noteworthy that both Mqscm and DWConv3 achieved higher accuracy in the context of ResNet-50 compared to Mobilenet_v3_large. This discrepancy arose from the fact that ResNet-50, with its deeper network architecture, extracted more intricate feature information, whereas Mobilenet_v3_large, a classic lightweight architecture, significantly reduced FLOPs and Params at the cost of sacrificing a portion of its information-extraction capability. In summary, across the three data scenarios presented in
Table 6, Mqscm consistently outperformed DWConv3 in terms of accuracy.