CDTNet: Improved Image Classification Method Using Standard, Dilated and Transposed Convolutions

Convolutional neural networks (CNNs) have achieved great success in image classification tasks. In the process of a convolutional operation, a larger input area can capture more context information. Stacking several convolutional layers can enlarge the receptive field, but this increases the parameters. Most CNN models use pooling layers to extract important features, but the pooling operations cause information loss. Transposed convolution can increase the spatial size of the feature maps to recover the lost low-resolution information. In this study, we used two branches with different dilated rates to obtain different size features. The dilated convolution can capture richer information, and the outputs from the two channels are concatenated together as input for the next block. The small size feature maps of the top blocks are transposed to increase the spatial size of the feature maps to recover low-resolution prediction maps. We evaluated the model on three image classification benchmark datasets (CIFAR-10, SVHN, and FMNIST) with four state-of-the-art models, namely, VGG16, VGG19, ResNeXt, and DenseNet. The experimental results show that CDTNet achieved lower loss, higher accuracy, and faster convergence speed in the training and test stages. The average test accuracy of CDTNet increased by 54.81% at most on SVHN with VGG19 and by 1.28% at least on FMNIST with VGG16, which proves that CDTNet has better performance and strong generalization abilities, as well as fewer parameters.

The excellent performance of CNNs comes from their wider and deeper models [4]; however, these models have also faced an increasing memory burden [21], which limits their application in resource-constrained and high real-time requirement scenarios, such as mobile terminals and embedded systems with low hardware resources [22,23].
The CNN operation usually extracts features through the convolutional layer and integrates features by subsampling and fully connected (FC) layers; the method based on deep features can learn the most distinguishable semantic-level features from the original input [24]. Most image classification networks [2,3,25,26] employ successive pooling operations to gradually reduce the resolution of features and extend the receptive field (RF) size, but the pooling operations will cause information loss [27].
In CNNs, each feature map of the output only depends on a certain area of the input; a larger input area can capture more context information [15]. Enlarging the RF can extract From Figure 1, we can see that we can control the RF of the models conveniently to use dilated convolution. When the dilation rate is 1, dilated convolution is standard convolution. When the dilation rate is 2 or 3, the size of the RF is 5 × 5 or 7 × 7, respectively, as presented in Figure 1a-c. All the values are zero except the value at the red dot.
Dilated convolution with rate introduces − 1 zeros between successive filter parameters, which can enlarge the kernel size of the filter effectively. The RF can be calculated by the following formula: where represents the filter size, and is the dilated rate. Compared with purely convolution networks, dilated convolution can capture richer information. The number of parameters in dilated convolution is the same as that in the standard convolution, but the size of the RF increases linearly [14].
Dilated convolution layers with different dilated rates can fetch multi-scale features. To capture more contextual information at multiple scales, Lu et al. [49] and Xia et al. [14] set the dilation rates as 1, 3, and 5 to extract richer features. DeepLabv2 [20] (rate = 6, 12, 18, 24) and Yao et al. [50] proposed PYolo to use multi-branch convolution with dilation rates of 1, 3, 6, and 12 for detecting pneumonia; these features complement each other to ensure that the information distributed in different ranges can be sampled [51].

Transposed Convolution
TC was introduced by Zeiler et al. [52] and is widely used in generative models for computer vision [53,54]. TCs work by exchanging the backward and forward passes of a convolution [55]. TC is used to increase the spatial size of the feature maps to recover the low-resolution prediction maps. The process of standard and transposed convolutions is shown in Figure 2 [56]. From Figure 1, we can see that we can control the RF of the models conveniently to use dilated convolution. When the dilation rate is 1, dilated convolution is standard convolution. When the dilation rate is 2 or 3, the size of the RF is 5 × 5 or 7 × 7, respectively, as presented in Figure 1a-c. All the values are zero except the value at the red dot.
Dilated convolution with rate r introduces r − 1 zeros between successive filter parameters, which can enlarge the kernel size of the filter effectively. The RF can be calculated by the following formula: where k represents the filter size, and r is the dilated rate. Compared with purely convolution networks, dilated convolution can capture richer information. The number of parameters in dilated convolution is the same as that in the standard convolution, but the size of the RF increases linearly [14].
Dilated convolution layers with different dilated rates can fetch multi-scale features. To capture more contextual information at multiple scales, Lu et al. [49] and Xia et al. [14] set the dilation rates as 1, 3, and 5 to extract richer features. DeepLabv2 [20] (rate = 6, 12, 18, 24) and Yao et al. [50] proposed PYolo to use multi-branch convolution with dilation rates of 1, 3, 6, and 12 for detecting pneumonia; these features complement each other to ensure that the information distributed in different ranges can be sampled [51].

Transposed Convolution
TC was introduced by Zeiler et al. [52] and is widely used in generative models for computer vision [53,54]. TCs work by exchanging the backward and forward passes of a convolution [55]. TC is used to increase the spatial size of the feature maps to recover the low-resolution prediction maps. The process of standard and transposed convolutions is shown in Figure 2  TC is not exactly the reverse operation of standard convolution, but it reconstructs the high-dimensional state by gradually up-sampling low-dimensional representations [57].
By constructing a transposed model, the low-resolution features can be mapped to the high-resolution ones, and accurate boundary location can be generated through pixel-level supervision [58].
For example, [18,[59][60][61] employed TC after the pooling layer to enlarge the size of the low-resolution features and make the size of the output the same as the input. Zeiler et al. used TC in computer vision to capture mid-and high-level image structures [45], and Gulrajani and Yu et al. used TC to generate high-resolution feature maps [53,54], achieving remarkable performance in the up-sampling process.

Feature Fusion Methods
Feature fusion is an important operation in the CNN models, which can transmit the information of lower layers directly to higher layers [21]. Merging the features from different layers can achieve the effect of aggregating information from different RFs. For the operation method of the joining layer, many researchers chose addition and concatenation, both of which seem reasonable [9].
The experimental results of [61] showed that the concatenation operation is more effective in their architectures. Fu et al. [58] adopted concatenation on the outputs of two branches, which brings too much memory demand, and they performed a convolution with fewer filters to reduce the filter number of the feature map. Li et al. [29] used two dense dilated blocks with dilated convolution; the features extracted from the last three blocks were concatenated as the input of the inference layer for multi-scale attention competition. Many other researchers have utilized concatenation to fuse features [16,[62][63][64].
However, the concatenation operation naturally raises the processing time. Since the input channels of each layer remain unchanged, the running time of the models using element-wise summation in the skip connection layer is almost equal to that of the model without skip connection [51]. Xie et al. [46] used summation operation to merge features, and many other researchers have also used summation to fuse features [24,51,[65][66][67].
Unlike Huang and Xie et al., Dai [68] used concatenation operation to combine multi-scale features, then used sum operation after blocks to supplement the missing information in the pooling layers. TC is not exactly the reverse operation of standard convolution, but it reconstructs the high-dimensional state by gradually up-sampling low-dimensional representations [57].
By constructing a transposed model, the low-resolution features can be mapped to the high-resolution ones, and accurate boundary location can be generated through pixel-level supervision [58].
For example, refs. [18,[59][60][61] employed TC after the pooling layer to enlarge the size of the low-resolution features and make the size of the output the same as the input. Zeiler et al., used TC in computer vision to capture mid-and high-level image structures [45], and Gulrajani and Yu et al., used TC to generate high-resolution feature maps [53,54], achieving remarkable performance in the up-sampling process.

Feature Fusion Methods
Feature fusion is an important operation in the CNN models, which can transmit the information of lower layers directly to higher layers [21]. Merging the features from different layers can achieve the effect of aggregating information from different RFs. For the operation method of the joining layer, many researchers chose addition and concatenation, both of which seem reasonable [9].
The experimental results of [61] showed that the concatenation operation is more effective in their architectures. Fu et al. [58] adopted concatenation on the outputs of two branches, which brings too much memory demand, and they performed a convolution with fewer filters to reduce the filter number of the feature map. Li et al. [29] used two dense dilated blocks with dilated convolution; the features extracted from the last three blocks were concatenated as the input of the inference layer for multi-scale attention competition. Many other researchers have utilized concatenation to fuse features [16,[62][63][64].
However, the concatenation operation naturally raises the processing time. Since the input channels of each layer remain unchanged, the running time of the models using element-wise summation in the skip connection layer is almost equal to that of the model without skip connection [51]. Xie et al. [46] used summation operation to merge features, and many other researchers have also used summation to fuse features [24,51,[65][66][67].
Unlike Huang and Xie et al., Dai [68] used concatenation operation to combine multiscale features, then used sum operation after blocks to supplement the missing information in the pooling layers.

Skip Connection
In CNN, deeper layers can capture global features by stacking convolutional layers, but these cannot prove that the features extracted by the last layer are the final representation for any task [69]. This indicates that combining information of low and high layers can yield the contextual and abstraction information of objects, which can improve the accuracy of image super-resolution restoration. Different intermediate layers can extract different semantic levels and RFs; the merged features through the skip connections contain contextual and abstraction features that are extracted in different blocks [51]. Skip connection is a suitable method to combine the local and global features to strengthen feature propagation [9,61], which can describe different sizes objects comprehensively. The skip connection can also avoid the gradient vanishing, which can benefit back propagation [17] and provide rich information flow to the next layer.
Ronneberger et al., added skip connections between the encoder and the corresponding size decoder in their proposed model, U-Net [44]. Yu et al. [69] showed that the skip connection method is an effective method to make the following layers acquire the information from the previous layers. Shelhamer et al. [18] used skip connections to connect the coarse granularity features with the fine granularity features to improve the prediction effect.

Fuse Different Features of CDTNet
Inspired by the previously mentioned research, we proposed CDTNet to fuse different features of standard, dilated, and transposed convolutions for image classification. Similar to the VGG models, we used 3 × 3 filters and doubled the filter numbers after every pooling operation [3], except the last one.
In our model, we used two parallel convolution operations in the beginning stage: one uses standard convolution and the other uses dilated convolution. The results of the two branches are fused by a concatenation operation, which naturally raises the dimension. After the concatenating operation, we used a block containing 1 × 1 convolution, Batch Normalization (BN) [70], ReLU [71], and 2 × 2 max-pooling (CBRP), four consecutive operations, to reduce the channel number of the feature map. The process of standard and dilated convolutions is shown in Figure 3.

Skip Connection
In CNN, deeper layers can capture global features by stacking convolutional layers, but these cannot prove that the features extracted by the last layer are the final representation for any task [69]. This indicates that combining information of low and high layers can yield the contextual and abstraction information of objects, which can improve the accuracy of image super-resolution restoration. Different intermediate layers can extract different semantic levels and RFs; the merged features through the skip connections contain contextual and abstraction features that are extracted in different blocks [51]. Skip connection is a suitable method to combine the local and global features to strengthen feature propagation [9,61], which can describe different sizes objects comprehensively. The skip connection can also avoid the gradient vanishing, which can benefit back propagation [17] and provide rich information flow to the next layer.
Ronneberger et al. added skip connections between the encoder and the corresponding size decoder in their proposed model, U-Net [44]. Yu et al. [69] showed that the skip connection method is an effective method to make the following layers acquire the information from the previous layers. Shelhamer et al. [18] used skip connections to connect the coarse granularity features with the fine granularity features to improve the prediction effect.

Fuse Different Features of CDTNet
Inspired by the previously mentioned research, we proposed CDTNet to fuse different features of standard, dilated, and transposed convolutions for image classification. Similar to the VGG models, we used 3 × 3 filters and doubled the filter numbers after every pooling operation [3], except the last one.
In our model, we used two parallel convolution operations in the beginning stage: one uses standard convolution and the other uses dilated convolution. The results of the two branches are fused by a concatenation operation, which naturally raises the dimension. After the concatenating operation, we used a block containing 1 × 1 convolution, Batch Normalization (BN) [70], ReLU [71], and 2 × 2 max-pooling (CBRP), four consecutive operations, to reduce the channel number of the feature map. The process of standard and dilated convolutions is shown in Figure 3. The CDTNet comprises five blocks in Figure 3, and each block contains three parts: standard and dilated convolutions, concatenate, and CBRP operations. The usage of standard and dilated convolution layers can capture image clues of multiple scales by expanding the RF and can avoid increasing the number of model parameters. In each block, rate = 1 represents standard convolution, and rate = 2 represents that the dilation rate is 2 in the dilated convolution operation. The concatenation operation can combine the features from two branches, which represent the features of the global and local images. After each concatenation operation, we added a CBRP operation to extract features. The size of the convolutional kernel is 1 × 1, which has been used to reduce the parame- The CDTNet comprises five blocks in Figure 3, and each block contains three parts: standard and dilated convolutions, concatenate, and CBRP operations. The usage of standard and dilated convolution layers can capture image clues of multiple scales by expanding the RF and can avoid increasing the number of model parameters. In each block, rate = 1 represents standard convolution, and rate = 2 represents that the dilation rate is 2 in the dilated convolution operation. The concatenation operation can combine the features from two branches, which represent the features of the global and local images. After each concatenation operation, we added a CBRP operation to extract features. The size of the convolutional kernel is 1 × 1, which has been used to reduce the parameters of the network and computational costs. In addition, the third dimension of feature maps, i.e., the number of channels, is controlled by the number of the 1 × 1 filter.
It is useful to increase the dilation rate moderately for better performance [61]. Xia et al. [14] used four dilated rates to extract features, then fused the four levels of features to make full use of the low-level and high-level features. Larger RF can be obtained by enlarging the dilation rate; however, as the filling size increases with the dilation rate, the boundary effect is introduced, which counterbalances the effect of large RF obtained by increasing the dilation rate [61]. We used two dilation rates in CDTNet, i.e., rate = 1 and rate = 2. There are five blocks with the process of standard and dilated convolutions, and the output size of each block is half the size of the front block's output.
The output of each block is fed into up-sampling units for finer information recovery, which can generate high-resolution features. The compared experiments of [61] suggest that the features extracted with a small upscaling factor could retain more detailed information. Thus, we used the filter number in the block: divide by 2 for ×2 TC, divide by 4 for ×4 TC, and so on. The last two TCs have the same filter number. All TC results of the same size are concatenated together with the same size pooling result; the process is shown in Figure 4.
ters of the network and computational costs. In addition, the third dimension of feature maps, i.e., the number of channels, is controlled by the number of the 1 × 1 filter.
It is useful to increase the dilation rate moderately for better performance [61]. Xia et al. [14] used four dilated rates to extract features, then fused the four levels of features to make full use of the low-level and high-level features. Larger RF can be obtained by enlarging the dilation rate; however, as the filling size increases with the dilation rate, the boundary effect is introduced, which counterbalances the effect of large RF obtained by increasing the dilation rate [61]. We used two dilation rates in CDTNet, i.e., rate = 1 and rate = 2. There are five blocks with the process of standard and dilated convolutions, and the output size of each block is half the size of the front block's output.
The output of each block is fed into up-sampling units for finer information recovery, which can generate high-resolution features. The compared experiments of [61] suggest that the features extracted with a small upscaling factor could retain more detailed information. Thus, we used the filter number in the block: divide by 2 for ×2 TC, divide by 4 for ×4 TC, and so on. The last two TCs have the same filter number. All TC results of the same size are concatenated together with the same size pooling result; the process is shown in Figure 4. In Figure 4, pool1 to pool5 correspond to pool1 to pool5 and block1 to block5 correspond to block1 to block5 in Figure 3. The polylines with an arrow represent TC operations, which are marked with "TC", and the straight lines under each pool block without arrows represent skip connections, which are marked with "copy". The pooling operation will abandon some important feature information. The skip connection is widely used in many popular deep networks, and the advantage is that it allows more lower-level information to reach the top level. We used skip connection in CDTNet, and the fused feature map retains the high resolution of the lower-level feature map and represents better semantic information.
In Figure 4, the feature maps in previous layers are fused with other TC results. We used two methods to fuse the features, concatenation (CDT_C) and addition (CDT_A). The details of the fuse process in the lower left part of Figure 4 are shown in Figure 5. In Figure 4, pool1 to pool5 correspond to pool1 to pool5 and block1 to block5 correspond to block1 to block5 in Figure 3. The polylines with an arrow represent TC operations, which are marked with "TC", and the straight lines under each pool block without arrows represent skip connections, which are marked with "copy". The pooling operation will abandon some important feature information. The skip connection is widely used in many popular deep networks, and the advantage is that it allows more lower-level information to reach the top level. We used skip connection in CDTNet, and the fused feature map retains the high resolution of the lower-level feature map and represents better semantic information.
In Figure 4, the feature maps in previous layers are fused with other TC results. We used two methods to fuse the features, concatenation (CDT_C) and addition (CDT_A). The details of the fuse process in the lower left part of Figure 4 are shown in Figure 5.
In Figure 5, each pooling result is transposed with different strides and filter numbers, where TC represents transposed convolution, s is stride, and num is filter number. Several successive max-pooling and transpose operations may lead to the information loss of low-level features, so we set the filter number by dividing by 2 in order, except for the last TC process, where the filter number equals that of the second-to-last TC. In Figure 5, each pooling result is transposed with different strides and filter numbers, where TC represents transposed convolution, s is stride, and num is filter number. Several successive max-pooling and transpose operations may lead to the information loss of low-level features, so we set the filter number by dividing by 2 in order, except for the last TC process, where the filter number equals that of the second-to-last TC.
After TC, the outputs of pool2 to pool5 have the same size as pool1. The feature maps of all transposed results and pool1, in Figure 5a, are concatenated into a single tensor as the input of the next layer.
where concatenates all feature maps, and + 2 is the result of (n)-th pooling layer transposed with stride 2 .
The feature maps of all transposed results in Figure 5b are concatenated into a single tensor, and then added with pool1 as input of the next layer.
where concatenates all transposed feature maps except , and averages and .
According to the output size, each pooling result performs different times of TC. The TCs of pooling results are shown in Figure 6. After TC, the outputs of pool2 to pool5 have the same size as pool1. The feature maps of all transposed results and pool1, in Figure 5a, are concatenated into a single tensor as the input of the next layer.
where O i concatenates all feature maps, and p i+n t 2 n is the result of (n)-th pooling layer transposed with stride 2 n . The feature maps of all transposed results in Figure 5b are concatenated into a single tensor, and then added with pool1 as input of the next layer.
, p i+2 t 2 2 . . . . . . , p i+n t 2 n (3) where C i concatenates all transposed feature maps except p i , and O i averages p i and C i . According to the output size, each pooling result performs different times of TC. The TCs of pooling results are shown in Figure 6.
In Figure 6, pool5 is transposed four times, and the stride is 2, 4, 8, and 16, respectively. The number of channels is 512, 128, 32, and 8, which is based on the filter number of each convolutional layer and the ratio in Figure 5 In Figure 6, pool5 is transposed four times, and the stride is 2, 4, 8, and 16, respectively. The number of channels is 512, 128, 32, and 8, which is based on the filter number of each convolutional layer and the ratio in Figure 5.
All experiments were conducted on a server with Intel Xeon Gold 6139 M (2.3-3.7 GHz) processors, 88 GB memory, and an NVIDIA GeForce RTX 2080 Ti graphics card. The operating system was 64 Bit Ubuntu 16.04. Tensorflow was used for building the model, and the main source codes of ResNeXt and DenseNet were taken from the websites (https://github.com/taki0112/ResNeXt-Tensorflow, accessed on 12 July 2021) and (https://github.com/taki0112/Densenet-Tensorflow, accessed on 12 July 2021), respectively.

Datasets
CIFAR-10: The CIFAR-10 [72] dataset contains 60,000 color images from 10 different classes: trucks, cats, cars, horses, airplanes, dogs, ships, deer, birds, and frogs. The size of these images is 32 × 32 pixels. The dataset contains 50,000 images for training (5000 images in each category) and 10,000 images for testing.
SVHN: The Street View House Number (SVHN) [73] dataset comprises color images of house numbers, collected by Google Street View. SVHN comprises 73,257 training images and 26,032 test images. The digits 0 to 9 offer a multi-class classification with resolution of 32 × 32 pixels. The SVHN shows vast intra-class variations and includes complex photometric distortions, which makes the recognition problem a challenge [75].
All experiments were conducted on a server with Intel Xeon Gold 6139 M (2.3-3.7 GHz) processors, 88 GB memory, and an NVIDIA GeForce RTX 2080 Ti graphics card. The operating system was 64 Bit Ubuntu 16.04. Tensorflow was used for building the model, and the main source codes of ResNeXt and DenseNet were taken from the websites (https://github.com/taki0112/ResNeXt-Tensorflow, accessed on 12 July 2021) and (https: //github.com/taki0112/Densenet-Tensorflow, accessed on 12 July 2021), respectively.

Datasets
CIFAR-10: The CIFAR-10 [72] dataset contains 60,000 color images from 10 different classes: trucks, cats, cars, horses, airplanes, dogs, ships, deer, birds, and frogs. The size of these images is 32 × 32 pixels. The dataset contains 50,000 images for training (5000 images in each category) and 10,000 images for testing. SVHN: The Street View House Number (SVHN) [73] dataset comprises color images of house numbers, collected by Google Street View. SVHN comprises 73,257 training images and 26,032 test images. The digits 0 to 9 offer a multi-class classification with resolution of 32 × 32 pixels. The SVHN shows vast intra-class variations and includes complex photometric distortions, which makes the recognition problem a challenge [75].

Parameter Settings
Parallel convolutional layer: 3 × 3 convolutional kernel has been proven to be the most effective kernel size for natural images [76]. We used 3 × 3 convolutional kernels and 1-padding with stride 1 to guarantee that the size of the outputs equals that of the inputs. In the dilated convolution channel, the dilated rate is 2. The outputs of the two channels are fused by the concatenation method. BN leads to considerable improvements in convergence while eliminating the need for other forms of regularization; every convolution operation is followed by a BN operation.
Pooling layer: We used max-pooling with a 2 × 2 pixel window, and stride is 2 in each pooling operation.
TC layer: For CIFAR-10 and SVHN, the input size is 32 × 32, and the output size is divided by 2 after each max-pool. In the TC layer, the input size is multiplied by 2, 4, 8, and 16 for pool5 to recover the size corresponding to pooling layers, and by 2, 4, and 8 for pool4, and so on.
Feature fusion layer: Based on the output size of TC and pooling layers, we used the concatenation (CDT_C) and addition (CDT_A) method to fuse the features.
FC layer: Because the FC layers in VGGs have a large number of redundant parameters, according to the ablation experiments of [33], we added two FC layers in our model; the neuron numbers are 1024 and 10.
The last FC layer is connected with a 10-class layer with cross-entropy loss. Softmax was selected to obtain the category probability, formulated as Formula (5): where C is the number of channels, w ∈ R C×N , N is the number of classes, and p(y|x) ∈ R N is the scaled classification score. On the CIFAR-10 dataset, we used Nesterov momentum with a momentum weight of 0.9, and a weight decay of 0.0003. All models were trained with an initial learning rate of 0.1, divided by a factor of 10 after 80 and 120 epochs, and the batch size was 250. On the SVHN dataset, the models were trained for 50 epochs, and the batch size was 96. On the FMNIST dataset, all models were trained for 50 epochs with a batch size of 100. We used the Adam optimization method and set the learning rate as 0.001 for SVHN and FMNIST datasets.
We adopted data augmentation methods such as translation [37] and horizontal flipping [38] for CIFAR-10 and SVHN datasets, and used L2-regularization [34] techniques for limiting network complexity.

Results and Discussion
We used four evaluation metrics (training loss, training accuracy, test loss, test accuracy) for CIFAR-10 and SVHN, and three evaluation metrics (training loss, training accuracy, test accuracy) for FMNIST. We compared the experimental results of CDTNet with VGGs, ResNeXt, and DenseNet. At the same time, we also compared the parameters of these models. The parameters can be calculated as Formula (6) [21]: where P represents the parameters; f n f and f n n denote the filter number of the front and next layers, respectively; and C represents the filter size.

CIFAR-10
We performed the experiments using six models (VGG16, VGG19, ResNeXt, DenseNet, CDT_C, and CDT_A) on CIFAR-10. Four evaluation metrics were used to evaluate the performances of the models. The experimental comparison results are plotted in Figure 7

CIFAR-10
We performed the experiments using six models (VGG16, VGG19, ResNeXt, DenseNet, CDT_C, and CDT_A) on CIFAR-10. Four evaluation metrics were used to evaluate the performances of the models. The experimental comparison results are plotted in Figure 7.  Figure 7 reveals that CDTNet has a better performance in the training and test stages. CDT_C improves the average test accuracy by 9.43%, 1.48%, 19.92%, and 8.89% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. CDT_A improves the average test accuracy by 9.61%, 1.65%, 20.13%, and 9.07% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. They also reduce the average training loss and improve the average training accuracy. The specific values are shown in Table 1, where the symbol ↓ indicates a reduction and the symbol ↑ indicates an improvement.

SVHN
We performed the same experiments on SVHN as those on CIFAR-10, but the performance of VGGs was very poor, so we modified the kernel size to 1 for layers 7, 10, and  Figure 7 reveals that CDTNet has a better performance in the training and test stages. CDT_C improves the average test accuracy by 9.43%, 1.48%, 19.92%, and 8.89% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. CDT_A improves the average test accuracy by 9.61%, 1.65%, 20.13%, and 9.07% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. They also reduce the average training loss and improve the average training accuracy. The specific values are shown in Table 1, where the symbol ↓ indicates a reduction and the symbol ↑ indicates an improvement.

SVHN
We performed the same experiments on SVHN as those on CIFAR-10, but the performance of VGGs was very poor, so we modified the kernel size to 1 for layers 7, 10, and 13 in VGG16, and for layers 8, 12, and 16 in VGG19. Figure 8 shows the compared experimental results on the SVHN dataset. Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 16 13 in VGG16, and for layers 8, 12, and 16 in VGG19. Figure 8 shows the compared experimental results on the SVHN dataset. From Figure 8, we can draw the same conclusion that the performance of CDTNet is better than that of VGGs, ResNeXt, and DenseNet. CDT_C improves the average test accuracy by 9.13%, 54.81%, 17.49%, and 3.82% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. CDT_A improves the average test accuracy by 6.79%, 51.48%, 14.97%, and 1.59% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. They also reduce the average training loss and improve the average training accuracy. The specific values are shown in Table 2. Because the size of FMNIST is 28 × 28, which is not the same as CIFAR-10 and SVHN, when we transposed the features from blocks 3 and 4, the feature size of TC layers was not equal to that of previous layers, so we used additional layers with stride 2 and 0-padding to adjust the output size to equal that of previous features that will lose some boundary information. The experimental results are shown in Figure 9. From Figure 8, we can draw the same conclusion that the performance of CDTNet is better than that of VGGs, ResNeXt, and DenseNet. CDT_C improves the average test accuracy by 9.13%, 54.81%, 17.49%, and 3.82% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. CDT_A improves the average test accuracy by 6.79%, 51.48%, 14.97%, and 1.59% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. They also reduce the average training loss and improve the average training accuracy. The specific values are shown in Table 2. Because the size of FMNIST is 28 × 28, which is not the same as CIFAR-10 and SVHN, when we transposed the features from blocks 3 and 4, the feature size of TC layers was not equal to that of previous layers, so we used additional layers with stride 2 and 0-padding to adjust the output size to equal that of previous features that will lose some boundary information. The experimental results are shown in Figure 9. As seen in Figure 9, the additional 0-padding convolution may lose some boundary information, and the training loss and training accuracy of VGGs are better than those of CDTNet at the beginning, but the performance of CDTNet turns the tide after epoch 28, and the test accuracy of CDTNet is also better than VGG16, VGG19, ResNeXt, and DenseNet. The test accuracies of all models are shown in Table 3.

Parameter of Models
The parameters of these models are shown in Table 4   From Table 4, we can see that there are slightly more parameters for CDT_C than ResNeXt, but the number of parameters of CDT_A is much less than other models.
To sum up, through the experimental results of the above three datasets, it can be seen that the CDTNet reduces the training and test losses and improves the accuracies. There are outliers during the training stage on FMNIST because we used additional layers to adjust the feature size, resulting in some boundary information lost, but this does not affect the overall performance of CDTNet. The average test accuracy of CDTNet increased by 54.81% at most on SVHN with VGG19 and by 1.28% at least on FMNIST with VGG16.

Conclusions
In this study, we proposed CDTNet with standard, dilated, and transposed convolutions. The standard and dilated convolution can extract multi-scale features, and the As seen in Figure 9, the additional 0-padding convolution may lose some boundary information, and the training loss and training accuracy of VGGs are better than those of CDTNet at the beginning, but the performance of CDTNet turns the tide after epoch 28, and the test accuracy of CDTNet is also better than VGG16, VGG19, ResNeXt, and DenseNet. The test accuracies of all models are shown in Table 3. The CDTNet has better performance in the training stage and improves the test accuracy. CDT_C improves the test accuracy by 1.35%, 1.41%, 1.80%, and 2.23% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively. CDT_A improves the test accuracy by 1.28%, 1.35%, 1.73%, and 2.17% compared to VGG16, VGG19, ResNeXt, and DenseNet, respectively.

Parameter of Models
The parameters of these models are shown in Table 4 according to Figures 3-6 and Formula (6). From Table 4, we can see that there are slightly more parameters for CDT_C than ResNeXt, but the number of parameters of CDT_A is much less than other models.
To sum up, through the experimental results of the above three datasets, it can be seen that the CDTNet reduces the training and test losses and improves the accuracies. There are outliers during the training stage on FMNIST because we used additional layers to adjust the feature size, resulting in some boundary information lost, but this does not affect the overall performance of CDTNet. The average test accuracy of CDTNet increased by 54.81% at most on SVHN with VGG19 and by 1.28% at least on FMNIST with VGG16.

Conclusions
In this study, we proposed CDTNet with standard, dilated, and transposed convolutions. The standard and dilated convolution can extract multi-scale features, and the transposed convolution can transmit features from low level to high level, which can recover part of the lost information in the pooling layers. Because the object size is small, we used a dilated rate of 2 to fetch the features to concatenate the output of standard convolution. Each block except block 1 was followed by a transposed operation to increase the spatial size of the feature maps to recover low-resolution prediction maps. We evaluated the model on CIFAR-10, SVHN, and FMNIST datasets with VGG16, VGG19, ResNeXt, and DenseNet. CDTNet improves the average test accuracy by 1.48% to 20.13% and reduces average test loss by 14.1% to 56.6% on CIFAR-10. On SVHN, CDTNet improves the average test accuracy by 1.59% to 54.81% and reduces the average test loss by 15.51% to 62.85%. On FMNIST, CDTNet improves the average test accuracy by 1.28% to 2.23%. The experimental results show that all evaluation metrics of CDTNet are better than those of the state-of-the-art models, which proves that CDTNet has better performance and strong generalization abilities-and fewer parameters.
In future work, we will explore more effective architecture to fuse different granularity features and adopt diversified evaluation metrics to analyze the performance. In addition, as not all input image sizes are to the nth power of 2, in future work, we will explore a more effective method to set the number of TC channels and design the feature size after TC operation.